All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v5 000/104] KVM TDX basic feature support
@ 2022-03-04 19:48 isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 001/104] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX isaku.yamahata
                   ` (105 more replies)
  0 siblings, 106 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Hi.  Now TDX host kernel patch series was posted, I've rebased this patch
series to it and make it work.

  https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/

Changes from v4:
- rebased to TDX host kernel patch series.
- include all the patches to make this patch series working.
- add [MARKER] patches to mark the patch layer clear.

Thanks,


* What's TDX?
TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
Domain (TD) for confidential computing.

A TD runs in a CPU mode that is designed to protect the confidentiality of its
memory contents and its CPU state from any other software, including the hosting
Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.

We have more detailed explanations below (***).
We have the high-level design of TDX KVM below (****).

In this patch series, we use "TD" or "guest TD" to differentiate it from the
current "VM" (Virtual Machine), which is supported by KVM today.


* The organization of this patch series
This patch series is on top of the patches series "TDX host kernel support":
https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/

this patch series is available at
https://github.com/intel/tdx/releases/tag/kvm-upstream
The corresponding patches to qemu are available at
https://github.com/intel/qemu-tdx/commits/tdx-upstream

The relations of the layers are depicted as follows.
The arrows below show the order of patch reviews we would like to have.

The below layers are chosen so that the device model, for example, qemu can
exercise each layering step by step.  Check if TDX is supported, create TD VM,
create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
vcpu exits/hypercalls/interrupts to run TD fully.

  TDX vcpu
  interrupt/exits/hypercall<------------\
        ^                               |
        |                               |
  TD finalization                       |
        ^                               |
        |                               |
  TDX EPT violation<------------\       |
        ^                       |       |
        |                       |       |
  TD vcpu enter/exit            |       |
        ^                       |       |
        |                       |       |
  TD vcpu creation/destruction  |       \-------KVM TDP MMU MapGPA
        ^                       |                       ^
        |                       |                       |
  TD VM creation/destruction    \---------------KVM TDP MMU hooks
        ^                                               ^
        |                                               |
  TDX architectural definitions                 KVM TDP refactoring for TDX
        ^                                               ^
        |                                               |
   TDX, VMX    <--------TDX host kernel         KVM MMU GPA stolen bits
   coexistence          support


The followings are explanations of each layer.  Each layer has a dummy commit
that starts with [MARKER] in subject.  It is intended to help to identify where
each layer starts.

TDX host kernel support:
        https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
        The guts of system-wide initialization of TDX module.  There is an
        independent patch series for host x86.  TDX KVM patches call functions
        this patch series provides to initialize the TDX module.

TDX, VMX coexistence:
        Infrastructure to allow TDX to coexist with VMX and trigger the
        initialization of the TDX module.
        This layer starts with
        "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
TDX architectural definitions:
        Add TDX architectural definitions and helper functions
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
TD VM creation/destruction:
        Guest TD creation/destroy allocation and releasing of TDX specific vm
        and vcpu structure.  Create an initial guest memory image with TDX
        measurement.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
TD vcpu creation/destruction:
        guest TD creation/destroy Allocation and releasing of TDX specific vm
        and vcpu structure.  Create an initial guest memory image with TDX
        measurement.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
TDX EPT violation:
        Create an initial guest memory image with TDX measurement.  Handle
        secure EPT violations to populate guest pages with TDX SEAMCALLs.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
TD vcpu enter/exit:
        Allow TDX vcpu to enter into TD and exit from TD.  Save CPU state before
        entering into TD.  Restore CPU state after exiting from TD.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
TD vcpu interrupts/exit/hypercall:
        Handle various exits/hypercalls and allow interrupts to be injected so
        that TD vcpu can continue running.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"

KVM MMU GPA stolen bits:
        Introduce framework to handle stolen repurposed bit of GPA TDX
        repurposed a bit of GPA to indicate shared or private. If it's shared,
        it's the same as the conventional VMX EPT case.  VMM can access shared
        guest pages.  If it's private, it's handled by Secure-EPT and the guest
        page is encrypted.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
KVM TDP refactoring for TDX:
        TDX Secure EPT requires different constants. e.g. initial value EPT
        entry value etc. Various refactoring for those differences.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
KVM TDP MMU hooks:
        Introduce framework to TDP MMU to add hooks in addition to direct EPT
        access TDX added Secure EPT which is an enhancement to VMX EPT.  Unlike
        conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
        use TDX SEAMCALLs to operate on Secure EPT.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
KVM TDP MMU MapGPA:
        Introduce framework to handle switching guest pages from private/shared
        to shared/private.  For a given GPA, a guest page can be assigned to a
        private GPA or a shared GPA exclusively.  With TDX MapGPA hypercall,
        guest TD converts GPA assignments from private (or shared) to shared (or
        private).
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "

KVM guest private memory: (not shown in the above diagram)
[PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
memory: https://lkml.org/lkml/2022/1/18/395
        Guest private memory requires different memory management in KVM.  The
        patch proposes a way for it.  Integration with TDX KVM.

(***)
* TDX module
A CPU-attested software module called the "TDX module" is designed to implement
the TDX architecture, and it is loaded by the UEFI firmware today. It can be
loaded by the kernel or driver at runtime, but in this patch series we assume
that the TDX module is already loaded and initialized.

The TDX module provides two main new logical modes of operation built upon the
new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
architecture. TDX root mode is mostly identical to the VMX root operation mode,
and the TDX functions (described later) are triggered by the new SEAMCALL
instruction with the desired interface function selected by an input operand
(leaf number, in RAX). TDX non-root mode is used for TD guest operation.  TDX
non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
operation (i.e. guest VM), with changes and restrictions to better assure that
no other software or hardware has direct visibility of the TD memory and state.

TDX transitions between TDX root operation and TDX non-root operation include TD
Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
TDX root mode.  A TD Exit might be asynchronous, triggered by some external
event (e.g., external interrupt or SMI) or an exception, or it might be
synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.

TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
of the TDX interface functions as mentioned above, and "TDH" stands for Trust
Domain Host. Those host-side TDX interface functions are categorized into
various areas just for better organization, such as SYS (TDX module management),
MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.

TDCS (Trust Domain Control Structure) is the main control structure of a guest
TD, and encrypted (using the guest TD's ephemeral private key).  At a high
level, TDCS holds information for controlling TD operation as a whole,
execution, EPTP, MSR bitmaps, etc that KVM needs to set it up.  Note that MSR
bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
same value for all VCPUs of the same TD.

Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
TD VCPU.  It helps the TDX module control the operation of the VCPU, and holds
the VCPU state while the VCPU is not running. TDVPS is opaque to software and
DMA access, accessible only by using the TDX module interface functions (such as
TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
such as virtual APIC page, virtualization exception information, etc.

Several VMX control structures (such as Shared EPT and Posted interrupt
descriptor) are directly managed and accessed by the host VMM.  These control
structures are pointed to by fields in the TD VMCS.

The above means that 1) KVM needs to allocate different data structures for TDs,
2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
define TD-specific handling for others.  3) Redirect operations to .  3)
Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
tdx_callback() else vmx_callback();".

*TD Private Memory
TD private memory is designed to hold TD private content, encrypted by the CPU
using the TD ephemeral key. An encryption engine holds a table of encryption
keys, and an encryption key is selected for each memory transaction based on a
Host Key Identifier (HKID). By design, the host VMM does not have access to the
encryption keys.

In the first generation of MKTME, HKID is "stolen" from the physical address by
allocating a configurable number of bits from the top of the physical
address. The HKID space is partitioned into shared HKIDs for legacy MKTME
accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
HKID on the host so that MKTME can be opaque or bypassed on the host.

During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
as either shared or private, based on the value of a new SHARED bit in the Guest
Physical Address (GPA).  The CPU translates shared GPAs using the usual VMX EPT
(Extended Page Table) or "Shared EPT" (in this document), which resides in host
VMM memory. The Shared EPT is directly managed by the host VMM - the same as
with the current VMX. Since guest TDs usually require I/O, and the data exchange
needs to be done via shared memory, thus KVM needs to use the current EPT
functionality even for TDs.

* Secure EPT and Minoring using the TDP code
The CPU translates private GPAs using a separate Secure EPT.  The Secure EPT
pages are encrypted and integrity-protected with the TD's ephemeral private
key.  Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
interface functions, and thus conceptually Secure EPT is a subset of EPT (why
"subset"). Since execution of such interface functions takes much longer time
than accessing memory directly, in KVM we use the existing TDP code to minor the
Secure EPT for the TD.

This way, we can effectively walk Secure EPT without using the TDX interface
functions.

* VM life cycle and TDX specific operations
The userspace VMM, such as QEMU, needs to build and treat TDs differently.  For
example, a TD needs to boot in private memory, and the host software cannot copy
the initial image to private memory.

* TSC Virtualization
The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
(e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
by TD configuration, i.e. when the TD is created, not per VCPU.  The current KVM
owns TSC virtualization for VMs, but the TDX module does for TDs.

* MCE support for TDs
The TDX module doesn't allow VMM to inject MCE.  Instead PV way is needed for TD
to communicate with VMM.  For now, KVM silently ignores MCE request by VMM.  MSRs
related to MCE (e.g, MCE bank registers) can be naturally emulated by
paravirtualizing MSR access.

[1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
available.

* Restrictions or future work
Some features are not included to reduce patch size.  Those features are
addressed as future independent patch series.
- large page (2M, 1G)
- qemu gdb stub
- guest PMU
- and more

* Prerequisites
It's required to load the TDX module and initialize it.  It's out of the scope
of this patch series.  Another independent patch for the common x86 code is
planned.  It defines CONFIG_INTEL_TDX_HOST and this patch series uses
CONFIG_INTEL_TDX_HOST.  It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
module is initialized and ready for KVM to use the TDX module APIs for TDX guest
life cycle like tdh.mng.init are ready to use.

Concretely Global initialization, LP (Logical Processor) initialization, global
configuration, the key configuration, and TDMR and PAMT initialization are done.
The state of the TDX module is SYS_READY.  Please refer to the TDX module
specification, the chapter Intel TDX Module Lifecycle State Machine

** Detecting the TDX module readiness.
TDX host patch series implements the detection of the TDX module availability
and its initialization so that KVM can use it.  Also it manages Host KeyID
(HKID) assigned to guest TD.
The assumed APIs the TDX host patch series provides are
- int seamrr_enabled()
  Check if required cpu feature (SEAM mode) is available. This only check CPU
  feature availability.  At this point, the TDX module may not be ready for KVM
  to use.
- int init_tdx(void);
  Initialization of TDX module so that the TDX module is ready for KVM to use.
- const struct tdsysinfo_struct *tdx_get_sysinfo(void);
  Return the system wide information about the TDX module.  NULL if the TDX
  isn't initialized.
- u32 tdx_get_global_keyid(void);
  Return global key id that is used for the TDX module itself.
- int tdx_keyid_alloc(void);
  Allocate HKID for guest TD.
- void tdx_keyid_free(int keyid);
  Free HKID for guest TD.

(****)
* TDX KVM high-level design
- Host key ID management
Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
It is assumed The TDX host patch series implements necessary functions,
u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
void tdx_keyid_free(int keyid).

- Data structures and VM type
Because TDX is different from VMX, define its own VM/VCPU structures, struct
kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx.  To
identify the VM, introduce VM-type to specify which VM type, VMX (default) or
TDX, is used.

- VM life cycle and TDX specific operations
Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
parameters, set initial guest memory and measurement.

The creation of TDX VM requires five additional operations in addition to the
conventional VM creation.
  - Get KVM system capability to check if TDX VM type is supported
  - VM creation (KVM_CREATE_VM)
  - New: Get the TDX specific system parameters.  KVM_TDX_GET_CAPABILITY.
  - New: Set TDX specific VM parameters.  KVM_TDX_INIT_VM.
  - VCPU creation (KVM_CREATE_VCPU)
  - New: Set TDX specific VCPU parameters.  KVM_TDX_INIT_VCPU.
  - New: Initialize guest memory as boot state and extend the measurement with
    the memory.  KVM_TDX_INIT_MEM_REGION.
  - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
    TDX VM contents.
  - VCPU RUN (KVM_VCPU_RUN)

- Protected guest state
Because the guest state (CPU state and guest memory) is protected, the KVM VMM
can't operate on them.  For example, accessing CPU registers, injecting
exceptions, and accessing guest memory.  Those operations are handled as
silently ignored, returning zero or initial reset value when it's requested via
KVM API ioctls.

    VM/VCPU state and callbacks for TDX specific operations.
    Define tdx specific VM state and VCPU state instead of VMX ones.  Redirect
    operations to TDX specific callbacks.  "if (tdx) tdx_op() else vmx_op()".

    Operations on the CPU state
    silently ignore operations on the guest state.  For example, the write to
    CPU registers is ignored and the read from CPU registers returns 0.

    . ignore access to CPU registers except for allowed ones.
    . TSC: add a check if tsc is immutable and return an error.  Because the KVM
      implementation updates the internal tsc state and it's difficult to back
      out those changes.  Instead, skip the logic.
    . dirty logging: add check if dirty logging is supported.
    . exceptions/SMI/MCE/SIPI/INIT: silently ignore

    Note: virtual external interrupt and NMI can be injected into TDX guests.

- KVM MMU integration
One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
the guest physical address is private (the bit is cleared) or shared (the bit is
set).  The bits are called stolen bits.

  - Stolen bits framework
    systematically tracks which guest physical address, shared or private, is
    used.

  - Shared EPT and secure EPT
    There are two EPTs. Shared EPT (the conventional one) and Secure
    EPT(the new one). Shared EPT is handled the same for the stolen
    bit set.  Secure EPT points to private guest pages.  To resolve
    EPT violation, KVM walks one of two EPTs based on faulted GPA.
    Because it's costly to access secure EPT during walking EPTs with
    SEAMCALLs for the private guest physical address, another private
    EPT is used as a shadow of Secure-EPT with the existing logic at
    the cost of extra memory.

The following depicts the relationship.

                    KVM                             |       TDX module
                     |                              |           |
        -------------+----------                    |           |
        |                      |                    |           |
        V                      V                    |           |
     shared GPA           private GPA               |           |
  CPU shared EPT pointer  KVM private EPT pointer   |  CPU secure EPT pointer
        |                      |                    |           |
        |                      |                    |           |
        V                      V                    |           V
  shared EPT                private EPT<-------mirror----->Secure EPT
        |                      |                    |           |
        |                      \--------------------+------\    |
        |                                           |      |    |
        V                                           |      V    V
  shared guest page                                 |    private guest page
                                                    |
                                                    |
                              non-encrypted memory  |    encrypted memory
                                                    |

  - Operating on Secure EPT
    Use the TDX module APIs to operate on Secure EPT.  To call the TDX API
    during resolving EPT violation, add hooks to additional operation and wiring
    it to TDX backend.

* References

[1] TDX specification
   https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
[2] Intel Trust Domain Extensions (Intel TDX)
   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
[3] Intel CPU Architectural Extensions Specification
   https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
[4] Intel TDX Module 1.0 EAS
   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
[5] Intel TDX Loader Interface Specification
  https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
[6] Intel TDX Guest-Hypervisor Communication Interface
   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
[7] Intel TDX Virtual Firmware Design Guide
   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf
[8] intel public github
   kvm TDX branch: https://github.com/intel/tdx/tree/kvm
   TDX guest branch: https://github.com/intel/tdx/tree/guest
   qemu TDX https://github.com/intel/qemu-tdx
[9] TDVF
    https://github.com/tianocore/edk2-staging/tree/TDVF


Chao Gao (1):
  KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
    wrmsr

Isaku Yamahata (73):
  x86/virt/tdx: export platform_has_tdx
  KVM: TDX: Detect CPU feature on kernel module initialization
  KVM: x86: Refactor KVM VMX module init/exit functions
  KVM: TDX: Add placeholders for TDX VM/vcpu structure
  x86/virt/tdx: Add a helper function to return system wide info about
    TDX module
  KVM: TDX: Add a function to initialize TDX module
  KVM: TDX: Make TDX VM type supported
  [MARKER] The start of TDX KVM patch series: TDX architectural
    definitions
  KVM: TDX: Define TDX architectural definitions
  KVM: TDX: Add a function for KVM to invoke SEAMCALL
  KVM: TDX: add a helper function for KVM to issue SEAMCALL
  KVM: TDX: Add helper functions to print TDX SEAMCALL error
  [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
  KVM: TDX: allocate per-package mutex
  x86/cpu: Add helper functions to allocate/free MKTME keyid
  KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
  [MARKER] The start of TDX KVM patch series: TD vcpu
    creation/destruction
  KVM: TDX: allocate/free TDX vcpu structure
  [MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits
  KVM: x86/mmu: introduce config for PRIVATE KVM MMU
  [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
    TDX
  KVM: x86/mmu: Disallow fast page fault on private GPA
  [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
  KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
  KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
  KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
  KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
  KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
  [MARKER] The start of TDX KVM patch series: TDX EPT violation
  KVM: TDX: TDP MMU TDX support
  [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
  KVM: x86/mmu: steal software usable bit for EPT to represent shared
    page
  KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping
  KVM: x86/tdp_mmu: prevent private/shared map based on PRIVATE_PROHIBIT
  KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
  KVM: x86/mmu: Focibly use TDP MMU for TDX
  [MARKER] The start of TDX KVM patch series: TD finalization
  KVM: TDX: Create initial guest memory
  KVM: TDX: Finalize VM initialization
  [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
  KVM: TDX: Add helper assembly function to TDX vcpu
  KVM: TDX: Implement TDX vcpu enter/exit path
  KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
  KVM: TDX: restore host xsave state when exit from the guest TD
  KVM: TDX: restore user ret MSRs
  [MARKER] The start of TDX KVM patch series: TD vcpu
    exits/interrupts/hypercalls
  KVM: TDX: complete interrupts after tdexit
  KVM: TDX: restore debug store when TD exit
  KVM: TDX: handle vcpu migration over logical processor
  KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the
    guest TD
  KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
    behavior
  KVM: TDX: Implement interrupt injection
  KVM: TDX: Implements vcpu request_immediate_exit
  KVM: TDX: Implement methods to inject NMI
  KVM: TDX: Add a place holder to handle TDX VM exit
  KVM: TDX: handle EXIT_REASON_OTHER_SMI
  KVM: TDX: handle ept violation/misconfig exit
  KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
  KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers
  KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
  KVM: TDX: Handle TDX PV CPUID hypercall
  KVM: TDX: Handle TDX PV HLT hypercall
  KVM: TDX: Handle TDX PV port io hypercall
  KVM: TDX: Implement callbacks for MSR operations for TDX
  KVM: TDX: Handle TDX PV rdmsr hypercall
  KVM: TDX: Handle TDX PV wrmsr hypercall
  KVM: TDX: Handle TDX PV report fatal error hypercall
  KVM: TDX: Handle TDX PV map_gpa hypercall
  KVM: TDX: Silently discard SMI request
  KVM: TDX: Silently ignore INIT/SIPI
  Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
  KVM: x86: design documentation on TDX support of x86 KVM TDP MMU

Kai Huang (1):
  KVM: x86: Introduce hooks to free VM callback prezap and vm_free

Rick Edgecombe (1):
  KVM: x86: Add infrastructure for stolen GPA bits

Sean Christopherson (26):
  KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
  KVM: Enable hardware before doing arch VM initialization
  KVM: x86: Introduce vm_type to differentiate default VMs from
    confidential VMs
  KVM: TDX: Add TDX "architectural" error codes
  KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
  KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
  KVM: Add max_vcpus field in common 'struct kvm'
  KVM: TDX: create/destroy VM structure
  KVM: TDX: Do TDX specific vcpu initialization
  KVM: x86/mmu: Disallow dirty logging for x86 TDX
  KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
  KVM: x86/mmu: Allow non-zero init value for shadow PTE
  KVM: x86/mmu: Allow per-VM override of the TDP max page level
  KVM: VMX: Split out guts of EPT violation to common/exposed function
  KVM: VMX: Move setting of EPT MMU masks to common VT-x code
  KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
  KVM: TDX: Add load_mmu_pgd method for TDX
  KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
  KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
  KVM: x86: Add option to force LAPIC expiration wait
  KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
    argument
  KVM: VMX: Move NMI/exception handler to common helper
  KVM: x86: Split core of hypercall emulation to helper function
  KVM: TDX: Add a placeholder for handler of TDX hypercalls
    (TDG.VP.VMCALL)
  KVM: TDX: Handle TDX PV MMIO hypercall
  KVM: TDX: Add methods to ignore accesses to CPU state

Xiaoyao Li (1):
  KVM: TDX: initialize VM with TDX specific parameters

Yuan Yao (1):
  KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c

 Documentation/virt/kvm/api.rst                |   24 +-
 .../virt/kvm/intel-tdx-layer-status.rst       |   33 +
 Documentation/virt/kvm/intel-tdx.rst          |  360 +++
 Documentation/virt/kvm/tdx-tdp-mmu.rst        |  466 ++++
 arch/arm64/include/asm/kvm_host.h             |    3 -
 arch/arm64/kvm/arm.c                          |    6 +-
 arch/arm64/kvm/vgic/vgic-init.c               |    6 +-
 arch/x86/events/intel/ds.c                    |    1 +
 arch/x86/include/asm/kvm-x86-ops.h            |    5 +
 arch/x86/include/asm/kvm_host.h               |   38 +-
 arch/x86/include/asm/tdx.h                    |   61 +
 arch/x86/include/asm/vmx.h                    |    2 +
 arch/x86/include/uapi/asm/kvm.h               |   59 +
 arch/x86/include/uapi/asm/vmx.h               |    5 +-
 arch/x86/kvm/Kconfig                          |    4 +
 arch/x86/kvm/Makefile                         |    3 +-
 arch/x86/kvm/lapic.c                          |   25 +-
 arch/x86/kvm/lapic.h                          |    2 +-
 arch/x86/kvm/mmu.h                            |   65 +-
 arch/x86/kvm/mmu/mmu.c                        |  232 +-
 arch/x86/kvm/mmu/mmu_internal.h               |   84 +
 arch/x86/kvm/mmu/paging_tmpl.h                |   25 +-
 arch/x86/kvm/mmu/spte.c                       |   48 +-
 arch/x86/kvm/mmu/spte.h                       |   40 +-
 arch/x86/kvm/mmu/tdp_iter.h                   |    2 +-
 arch/x86/kvm/mmu/tdp_mmu.c                    |  642 ++++-
 arch/x86/kvm/mmu/tdp_mmu.h                    |   16 +-
 arch/x86/kvm/svm/svm.c                        |   10 +-
 arch/x86/kvm/vmx/common.h                     |  155 ++
 arch/x86/kvm/vmx/main.c                       | 1026 ++++++++
 arch/x86/kvm/vmx/posted_intr.c                |    8 +-
 arch/x86/kvm/vmx/seamcall.S                   |   55 +
 arch/x86/kvm/vmx/seamcall.h                   |   25 +
 arch/x86/kvm/vmx/tdx.c                        | 2337 +++++++++++++++++
 arch/x86/kvm/vmx/tdx.h                        |  253 ++
 arch/x86/kvm/vmx/tdx_arch.h                   |  158 ++
 arch/x86/kvm/vmx/tdx_errno.h                  |   29 +
 arch/x86/kvm/vmx/tdx_error.c                  |   22 +
 arch/x86/kvm/vmx/tdx_ops.h                    |  174 ++
 arch/x86/kvm/vmx/vmenter.S                    |  146 +
 arch/x86/kvm/vmx/vmx.c                        |  619 ++---
 arch/x86/kvm/vmx/x86_ops.h                    |  235 ++
 arch/x86/kvm/x86.c                            |  123 +-
 arch/x86/kvm/x86.h                            |    8 +
 arch/x86/virt/tdxcall.S                       |    8 +-
 arch/x86/virt/vmx/tdx.c                       |   50 +-
 arch/x86/virt/vmx/tdx.h                       |   52 -
 include/linux/kvm_host.h                      |    2 +
 include/uapi/linux/kvm.h                      |    1 +
 tools/arch/x86/include/uapi/asm/kvm.h         |   59 +
 tools/include/uapi/linux/kvm.h                |    1 +
 virt/kvm/kvm_main.c                           |   35 +-
 52 files changed, 7142 insertions(+), 706 deletions(-)
 create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
 create mode 100644 Documentation/virt/kvm/intel-tdx.rst
 create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
 create mode 100644 arch/x86/kvm/vmx/common.h
 create mode 100644 arch/x86/kvm/vmx/main.c
 create mode 100644 arch/x86/kvm/vmx/seamcall.S
 create mode 100644 arch/x86/kvm/vmx/seamcall.h
 create mode 100644 arch/x86/kvm/vmx/tdx.c
 create mode 100644 arch/x86/kvm/vmx/tdx.h
 create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
 create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
 create mode 100644 arch/x86/kvm/vmx/tdx_error.c
 create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
 create mode 100644 arch/x86/kvm/vmx/x86_ops.h

-- 
2.25.1


^ permalink raw reply	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 001/104] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 13:45   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 002/104] x86/virt/tdx: export platform_has_tdx isaku.yamahata
                   ` (104 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM.  TDX defines its data structure and TDX SEAMCALL APIs for
VMM to operate on Trust Domain (TD) instead.

Trust Domain Virtual Processor State (TDVPS) is the root control structure
of a TD VCPU.  It helps the TDX module control the operation of the VCPU,
and holds the VCPU state while the VCPU is not running. TDVPS is opaque to
software and DMA access, accessible only by using the TDX module interface
functions (such as TDH.VP.RD, TDH.VP.WR ,..).  TDVPS includes TD VMCS, and
TD VMCS auxiliary structures, such as virtual APIC page, virtualization
exception information, etc.  TDVPS is composed of Trust Domain Virtual
Processor Root (TDVPR) which is the root page of TDVPS and Trust Domain
Virtual Processor eXtension (TDVPX) pages which extend TDVPR to help
provide enough physical space for the logical TDVPS structure.

Also, we have a new structure, Trust Domain Control Structure (TDCS) is the
main control structure of a guest TD, and encrypted (using the guest TD's
ephemeral private key).  At a high level, TDCS holds information for
controlling TD operation as a whole, execution, EPTP, MSR bitmaps, etc. KVM
needs to set it up.  Note that MSR bitmaps are held as part of TDCS (unlike
VMX) because they are meant to have the same value for all VCPUs of the
same TD.  TDCS is a multi-page logical structure composed of multiple Trust
Domain Control Extension (TDCX) physical pages.  Trust Domain Root (TDR) is
the root control structure of a guest TD and is encrypted using the TDX
global private key. It holds a minimal set of state variables that enable
guest TD control even during times when the TD's private key is not known,
or when the TD's key management state does not permit access to memory
encrypted using the TD's private key.

The following shows the relationship between those structures.

        TDR--> TDCS                     per-TD
         |       \--> TDCX
         \
          \--> TDVPS                    per-TD VCPU
                 \--> TDVPR and TDVPX

The existing global struct kvm_x86_ops already defines an interface which
fits with TDX.  But kvm_x86_ops is system-wide, not per-VM structure.  To
allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers
"if (tdx) tdx_op() else vmx_op()" to switch VMX or TDX at run time.

To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX, which can coexist with VMX, i.e. KVM can run
both VMs and TDs.  Use 'vt' for the naming scheme as a nod to VT-x and as a
concatenation of VmxTdx.

The current code looks as follows.
In vmx.c
  static vmx_op() { ... }
  static struct kvm_x86_ops vmx_x86_ops = {
        .op = vmx_op,
  initialization code

The eventually converted code will look like
In vmx.c, keep the VMX operations.
  vmx_op() { ... }
  VMX initialization
In tdx.c, define the TDX operations.
  tdx_op() { ... }
  TDX initialization
In x86_ops.h, declare the VMX and TDX operations.
  vmx_op();
  tdx_op();
In main.c, define common wrappers for VMX and VMX.
  static vt_ops() { if (tdx) tdx_ops() else vmx_ops() }
  static struct kvm_x86_ops vt_x86_ops = {
        .op = vt_op,
  initialization to call VMX and TDX initialization

Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vxm_vcpu_free().

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Makefile      |   2 +-
 arch/x86/kvm/vmx/main.c    | 154 ++++++++++++++++
 arch/x86/kvm/vmx/vmx.c     | 360 +++++++++++--------------------------
 arch/x86/kvm/vmx/x86_ops.h | 126 +++++++++++++
 4 files changed, 385 insertions(+), 257 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/main.c
 create mode 100644 arch/x86/kvm/vmx/x86_ops.h

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 30f244b64523..ee4d0999f20f 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -22,7 +22,7 @@ kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
 kvm-$(CONFIG_KVM_XEN)	+= xen.o
 
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
-			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
+			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
 kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
 
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
new file mode 100644
index 000000000000..b08ea9c42a11
--- /dev/null
+++ b/arch/x86/kvm/vmx/main.c
@@ -0,0 +1,154 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/moduleparam.h>
+
+#include "x86_ops.h"
+#include "vmx.h"
+#include "nested.h"
+#include "pmu.h"
+
+struct kvm_x86_ops vt_x86_ops __initdata = {
+	.name = "kvm_intel",
+
+	.hardware_unsetup = vmx_hardware_unsetup,
+
+	.hardware_enable = vmx_hardware_enable,
+	.hardware_disable = vmx_hardware_disable,
+	.cpu_has_accelerated_tpr = report_flexpriority,
+	.has_emulated_msr = vmx_has_emulated_msr,
+
+	.vm_size = sizeof(struct kvm_vmx),
+	.vm_init = vmx_vm_init,
+
+	.vcpu_create = vmx_vcpu_create,
+	.vcpu_free = vmx_vcpu_free,
+	.vcpu_reset = vmx_vcpu_reset,
+
+	.prepare_guest_switch = vmx_prepare_switch_to_guest,
+	.vcpu_load = vmx_vcpu_load,
+	.vcpu_put = vmx_vcpu_put,
+
+	.update_exception_bitmap = vmx_update_exception_bitmap,
+	.get_msr_feature = vmx_get_msr_feature,
+	.get_msr = vmx_get_msr,
+	.set_msr = vmx_set_msr,
+	.get_segment_base = vmx_get_segment_base,
+	.get_segment = vmx_get_segment,
+	.set_segment = vmx_set_segment,
+	.get_cpl = vmx_get_cpl,
+	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
+	.set_cr0 = vmx_set_cr0,
+	.is_valid_cr4 = vmx_is_valid_cr4,
+	.set_cr4 = vmx_set_cr4,
+	.set_efer = vmx_set_efer,
+	.get_idt = vmx_get_idt,
+	.set_idt = vmx_set_idt,
+	.get_gdt = vmx_get_gdt,
+	.set_gdt = vmx_set_gdt,
+	.set_dr7 = vmx_set_dr7,
+	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
+	.cache_reg = vmx_cache_reg,
+	.get_rflags = vmx_get_rflags,
+	.set_rflags = vmx_set_rflags,
+	.get_if_flag = vmx_get_if_flag,
+
+	.tlb_flush_all = vmx_flush_tlb_all,
+	.tlb_flush_current = vmx_flush_tlb_current,
+	.tlb_flush_gva = vmx_flush_tlb_gva,
+	.tlb_flush_guest = vmx_flush_tlb_guest,
+
+	.vcpu_pre_run = vmx_vcpu_pre_run,
+	.run = vmx_vcpu_run,
+	.handle_exit = vmx_handle_exit,
+	.skip_emulated_instruction = vmx_skip_emulated_instruction,
+	.update_emulated_instruction = vmx_update_emulated_instruction,
+	.set_interrupt_shadow = vmx_set_interrupt_shadow,
+	.get_interrupt_shadow = vmx_get_interrupt_shadow,
+	.patch_hypercall = vmx_patch_hypercall,
+	.set_irq = vmx_inject_irq,
+	.set_nmi = vmx_inject_nmi,
+	.queue_exception = vmx_queue_exception,
+	.cancel_injection = vmx_cancel_injection,
+	.interrupt_allowed = vmx_interrupt_allowed,
+	.nmi_allowed = vmx_nmi_allowed,
+	.get_nmi_mask = vmx_get_nmi_mask,
+	.set_nmi_mask = vmx_set_nmi_mask,
+	.enable_nmi_window = vmx_enable_nmi_window,
+	.enable_irq_window = vmx_enable_irq_window,
+	.update_cr8_intercept = vmx_update_cr8_intercept,
+	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
+	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+	.load_eoi_exitmap = vmx_load_eoi_exitmap,
+	.apicv_post_state_restore = vmx_apicv_post_state_restore,
+	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
+	.hwapic_irr_update = vmx_hwapic_irr_update,
+	.hwapic_isr_update = vmx_hwapic_isr_update,
+	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+	.sync_pir_to_irr = vmx_sync_pir_to_irr,
+	.deliver_interrupt = vmx_deliver_interrupt,
+	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+
+	.set_tss_addr = vmx_set_tss_addr,
+	.set_identity_map_addr = vmx_set_identity_map_addr,
+	.get_mt_mask = vmx_get_mt_mask,
+
+	.get_exit_info = vmx_get_exit_info,
+
+	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+
+	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
+
+	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
+	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
+	.write_tsc_offset = vmx_write_tsc_offset,
+	.write_tsc_multiplier = vmx_write_tsc_multiplier,
+
+	.load_mmu_pgd = vmx_load_mmu_pgd,
+
+	.check_intercept = vmx_check_intercept,
+	.handle_exit_irqoff = vmx_handle_exit_irqoff,
+
+	.request_immediate_exit = vmx_request_immediate_exit,
+
+	.sched_in = vmx_sched_in,
+
+	.cpu_dirty_log_size = PML_ENTITY_NUM,
+	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+
+	.pmu_ops = &intel_pmu_ops,
+	.nested_ops = &vmx_nested_ops,
+
+	.update_pi_irte = pi_update_irte,
+	.start_assignment = vmx_pi_start_assignment,
+
+#ifdef CONFIG_X86_64
+	.set_hv_timer = vmx_set_hv_timer,
+	.cancel_hv_timer = vmx_cancel_hv_timer,
+#endif
+
+	.setup_mce = vmx_setup_mce,
+
+	.smi_allowed = vmx_smi_allowed,
+	.enter_smm = vmx_enter_smm,
+	.leave_smm = vmx_leave_smm,
+	.enable_smi_window = vmx_enable_smi_window,
+
+	.can_emulate_instruction = vmx_can_emulate_instruction,
+	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+	.migrate_timers = vmx_migrate_timers,
+
+	.msr_filter_changed = vmx_msr_filter_changed,
+	.complete_emulated_msr = kvm_complete_insn_gp,
+
+	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+};
+
+struct kvm_x86_init_ops vt_init_ops __initdata = {
+	.cpu_has_kvm_support = vmx_cpu_has_kvm_support,
+	.disabled_by_bios = vmx_disabled_by_bios,
+	.check_processor_compatibility = vmx_check_processor_compat,
+	.hardware_setup = vmx_hardware_setup,
+	.handle_intel_pt_intr = NULL,
+
+	.runtime_ops = &vt_x86_ops,
+};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index efda5e4d6247..f6f5d0dac579 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -66,6 +66,7 @@
 #include "vmcs12.h"
 #include "vmx.h"
 #include "x86.h"
+#include "x86_ops.h"
 
 MODULE_AUTHOR("Qumranet");
 MODULE_LICENSE("GPL");
@@ -541,7 +542,7 @@ static inline bool cpu_need_virtualize_apic_accesses(struct kvm_vcpu *vcpu)
 	return flexpriority_enabled && lapic_in_kernel(vcpu);
 }
 
-static inline bool report_flexpriority(void)
+bool report_flexpriority(void)
 {
 	return flexpriority_enabled;
 }
@@ -1316,7 +1317,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
  */
-static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -1327,7 +1328,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vmx->host_debugctlmsr = get_debugctlmsr();
 }
 
-static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
+void vmx_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	vmx_vcpu_pi_put(vcpu);
 
@@ -1381,7 +1382,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 		vmx->emulation_required = vmx_emulation_required(vcpu);
 }
 
-static bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
 {
 	return vmx_get_rflags(vcpu) & X86_EFLAGS_IF;
 }
@@ -1487,8 +1488,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
 	return 0;
 }
 
-static bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
-					void *insn, int insn_len)
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+				void *insn, int insn_len)
 {
 	/*
 	 * Emulation of instructions in SGX enclaves is impossible as RIP does
@@ -1572,7 +1573,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
  * Recognizes a pending MTF VM-exit and records the nested state for later
  * delivery.
  */
-static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1595,7 +1596,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
 		vmx->nested.mtf_pending = false;
 }
 
-static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 {
 	vmx_update_emulated_instruction(vcpu);
 	return skip_emulated_instruction(vcpu);
@@ -1614,7 +1615,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
 		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 }
 
-static void vmx_queue_exception(struct kvm_vcpu *vcpu)
+void vmx_queue_exception(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned nr = vcpu->arch.exception.nr;
@@ -1727,12 +1728,12 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
 	return kvm_default_tsc_scaling_ratio;
 }
 
-static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
 	vmcs_write64(TSC_OFFSET, offset);
 }
 
-static void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
 {
 	vmcs_write64(TSC_MULTIPLIER, multiplier);
 }
@@ -1756,7 +1757,7 @@ static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
 	return !(val & ~valid_bits);
 }
 
-static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
+int vmx_get_msr_feature(struct kvm_msr_entry *msr)
 {
 	switch (msr->index) {
 	case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
@@ -1776,7 +1777,7 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
  */
-static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmx_uret_msr *msr;
@@ -1954,7 +1955,7 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu)
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
  */
-static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmx_uret_msr *msr;
@@ -2267,7 +2268,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return ret;
 }
 
-static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 {
 	unsigned long guest_owned_bits;
 
@@ -2310,12 +2311,12 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 	}
 }
 
-static __init int cpu_has_kvm_support(void)
+__init int vmx_cpu_has_kvm_support(void)
 {
 	return cpu_has_vmx();
 }
 
-static __init int vmx_disabled_by_bios(void)
+__init int vmx_disabled_by_bios(void)
 {
 	return !boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
 	       !boot_cpu_has(X86_FEATURE_VMX);
@@ -2341,7 +2342,7 @@ static int kvm_cpu_vmxon(u64 vmxon_pointer)
 	return -EFAULT;
 }
 
-static int hardware_enable(void)
+int vmx_hardware_enable(void)
 {
 	int cpu = raw_smp_processor_id();
 	u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
@@ -2382,7 +2383,7 @@ static void vmclear_local_loaded_vmcss(void)
 		__loaded_vmcs_clear(v);
 }
 
-static void hardware_disable(void)
+void vmx_hardware_disable(void)
 {
 	vmclear_local_loaded_vmcss();
 
@@ -2924,7 +2925,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
 
 #endif
 
-static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -2954,7 +2955,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
 	return to_vmx(vcpu)->vpid;
 }
 
-static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	u64 root_hpa = mmu->root_hpa;
@@ -2970,7 +2971,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 		vpid_sync_context(vmx_get_current_vpid(vcpu));
 }
 
-static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
 {
 	/*
 	 * vpid_sync_vcpu_addr() is a nop if vpid==0, see the comment in
@@ -2979,7 +2980,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
 	vpid_sync_vcpu_addr(vmx_get_current_vpid(vcpu), addr);
 }
 
-static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
 {
 	/*
 	 * vpid_sync_context() is a nop if vpid==0, e.g. if enable_vpid==0 or a
@@ -3134,8 +3135,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
 	return eptp;
 }
 
-static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
-			     int root_level)
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
 {
 	struct kvm *kvm = vcpu->kvm;
 	bool update_guest_cr3 = true;
@@ -3163,8 +3163,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 		vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-
-static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	/*
 	 * We operate under the default treatment of SMM, so VMX cannot be
@@ -3280,7 +3279,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 	var->g = (ar >> 15) & 1;
 }
 
-static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
 {
 	struct kvm_segment s;
 
@@ -3360,14 +3359,14 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 	vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
 }
 
-static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 {
 	__vmx_set_segment(vcpu, var, seg);
 
 	to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu);
 }
 
-static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 {
 	u32 ar = vmx_read_guest_seg_ar(to_vmx(vcpu), VCPU_SREG_CS);
 
@@ -3375,25 +3374,25 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 	*l = (ar >> 13) & 1;
 }
 
-static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
 	dt->address = vmcs_readl(GUEST_IDTR_BASE);
 }
 
-static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
 	vmcs_writel(GUEST_IDTR_BASE, dt->address);
 }
 
-static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
 	dt->address = vmcs_readl(GUEST_GDTR_BASE);
 }
 
-static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
 	vmcs_writel(GUEST_GDTR_BASE, dt->address);
@@ -3889,7 +3888,7 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
 	}
 }
 
-static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	void *vapic_page;
@@ -3909,7 +3908,7 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
 	return ((rvi & 0xf0) > (vppr & 0xf0));
 }
 
-static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 i;
@@ -4041,8 +4040,8 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	return 0;
 }
 
-static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
-				  int trig_mode, int vector)
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+			   int trig_mode, int vector)
 {
 	struct kvm_vcpu *vcpu = apic->vcpu;
 
@@ -4185,7 +4184,7 @@ static u32 vmx_vmexit_ctrl(void)
 		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
 }
 
-static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -4508,7 +4507,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 	vmx->pi_desc.sn = 1;
 }
 
-static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -4565,12 +4564,12 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vpid_sync_context(vmx->vpid);
 }
 
-static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	exec_controls_setbit(to_vmx(vcpu), CPU_BASED_INTR_WINDOW_EXITING);
 }
 
-static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
 {
 	if (!enable_vnmi ||
 	    vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_STI) {
@@ -4581,7 +4580,7 @@ static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
 	exec_controls_setbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
 }
 
-static void vmx_inject_irq(struct kvm_vcpu *vcpu)
+void vmx_inject_irq(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	uint32_t intr;
@@ -4609,7 +4608,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
 	vmx_clear_hlt(vcpu);
 }
 
-static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
+void vmx_inject_nmi(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -4687,7 +4686,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
 		 GUEST_INTR_STATE_NMI));
 }
 
-static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	if (to_vmx(vcpu)->nested.nested_run_pending)
 		return -EBUSY;
@@ -4709,7 +4708,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
 		(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
 }
 
-static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	if (to_vmx(vcpu)->nested.nested_run_pending)
 		return -EBUSY;
@@ -4724,7 +4723,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 	return !vmx_interrupt_blocked(vcpu);
 }
 
-static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 {
 	void __user *ret;
 
@@ -4744,7 +4743,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 	return init_rmode_tss(kvm, ret);
 }
 
-static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
 {
 	to_kvm_vmx(kvm)->ept_identity_map_addr = ident_addr;
 	return 0;
@@ -5023,8 +5022,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
 	return kvm_fast_pio(vcpu, size, port, in);
 }
 
-static void
-vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
 {
 	/*
 	 * Patch in the VMCALL instruction:
@@ -5234,7 +5232,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
 	return kvm_complete_insn_gp(vcpu, err);
 }
 
-static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 {
 	get_debugreg(vcpu->arch.db[0], 0);
 	get_debugreg(vcpu->arch.db[1], 1);
@@ -5253,7 +5251,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 	set_debugreg(DR6_RESERVED, 6);
 }
 
-static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
 {
 	vmcs_writel(GUEST_DR7, val);
 }
@@ -5519,7 +5517,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
-static int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
 {
 	if (vmx_emulation_required_with_pending_exception(vcpu)) {
 		kvm_prepare_emulation_failure_exit(vcpu);
@@ -5756,9 +5754,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 static const int kvm_vmx_max_exit_handlers =
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
-static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
-			      u64 *info1, u64 *info2,
-			      u32 *intr_info, u32 *error_code)
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -6191,7 +6188,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	return 0;
 }
 
-static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 {
 	int ret = __vmx_handle_exit(vcpu, exit_fastpath);
 
@@ -6279,7 +6276,7 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 		: "eax", "ebx", "ecx", "edx");
 }
 
-static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 	int tpr_threshold;
@@ -6349,7 +6346,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
 	vmx_update_msr_bitmap_x2apic(vcpu);
 }
 
-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 {
 	struct page *page;
 
@@ -6377,7 +6374,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 	put_page(page);
 }
 
-static void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
+void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
 {
 	u16 status;
 	u8 old;
@@ -6411,7 +6408,7 @@ static void vmx_set_rvi(int vector)
 	}
 }
 
-static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
 {
 	/*
 	 * When running L2, updating RVI is only relevant when
@@ -6425,7 +6422,7 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
 		vmx_set_rvi(max_irr);
 }
 
-static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int max_irr;
@@ -6471,7 +6468,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 	return max_irr;
 }
 
-static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 {
 	if (!kvm_vcpu_apicv_active(vcpu))
 		return;
@@ -6482,7 +6479,7 @@ static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 	vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
 }
 
-static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -6554,7 +6551,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
 	handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
 }
 
-static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -6571,7 +6568,7 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
  * The kvm parameter can be NULL (module initialization, or invocation before
  * VM creation). Be sure to check the kvm parameter before using it.
  */
-static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
 {
 	switch (index) {
 	case MSR_IA32_SMBASE:
@@ -6692,7 +6689,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 				  IDT_VECTORING_ERROR_CODE);
 }
 
-static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
+void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
 	__vmx_complete_interrupts(vcpu,
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
@@ -6788,7 +6785,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	guest_state_exit_irqoff();
 }
 
-static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned long cr4;
@@ -6969,7 +6966,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	return vmx_exit_handlers_fastpath(vcpu);
 }
 
-static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
+void vmx_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -6980,7 +6977,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
 	free_loaded_vmcs(vmx->loaded_vmcs);
 }
 
-static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
+int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 {
 	struct vmx_uret_msr *tsx_ctrl;
 	struct vcpu_vmx *vmx;
@@ -7085,7 +7082,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
 #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 
-static int vmx_vm_init(struct kvm *kvm)
+int vmx_vm_init(struct kvm *kvm)
 {
 	if (!ple_gap)
 		kvm->arch.pause_in_guest = true;
@@ -7116,7 +7113,7 @@ static int vmx_vm_init(struct kvm *kvm)
 	return 0;
 }
 
-static int __init vmx_check_processor_compat(void)
+int __init vmx_check_processor_compat(void)
 {
 	struct vmcs_config vmcs_conf;
 	struct vmx_capability vmx_cap;
@@ -7139,7 +7136,7 @@ static int __init vmx_check_processor_compat(void)
 	return 0;
 }
 
-static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	u8 cache;
 
@@ -7328,7 +7325,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
 }
 
-static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -7433,7 +7430,7 @@ static __init void vmx_set_cpu_caps(void)
 		kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
 }
 
-static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
 {
 	to_vmx(vcpu)->req_immediate_exit = true;
 }
@@ -7472,10 +7469,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
 	return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
 }
 
-static int vmx_check_intercept(struct kvm_vcpu *vcpu,
-			       struct x86_instruction_info *info,
-			       enum x86_intercept_stage stage,
-			       struct x86_exception *exception)
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+		       struct x86_instruction_info *info,
+		       enum x86_intercept_stage stage,
+		       struct x86_exception *exception)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 
@@ -7540,8 +7537,8 @@ static inline int u64_shl_div_u64(u64 a, unsigned int shift,
 	return 0;
 }
 
-static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
-			    bool *expired)
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+		bool *expired)
 {
 	struct vcpu_vmx *vmx;
 	u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
@@ -7580,13 +7577,13 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 	return 0;
 }
 
-static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
 {
 	to_vmx(vcpu)->hv_deadline_tsc = -1;
 }
 #endif
 
-static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
 	if (!kvm_pause_in_guest(vcpu->kvm))
 		shrink_ple_window(vcpu);
@@ -7612,7 +7609,7 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 		secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
 }
 
-static void vmx_setup_mce(struct kvm_vcpu *vcpu)
+void vmx_setup_mce(struct kvm_vcpu *vcpu)
 {
 	if (vcpu->arch.mcg_cap & MCG_LMCE_P)
 		to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
@@ -7622,7 +7619,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
 			~FEAT_CTL_LMCE_ENABLED;
 }
 
-static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	/* we need a nested vmexit to enter SMM, postpone if run is pending */
 	if (to_vmx(vcpu)->nested.nested_run_pending)
@@ -7630,7 +7627,7 @@ static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 	return !is_smm(vcpu);
 }
 
-static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
+int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -7644,7 +7641,7 @@ static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
 	return 0;
 }
 
-static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int ret;
@@ -7665,17 +7662,17 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
 	return 0;
 }
 
-static void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
 {
 	/* RSM will cause a vmexit anyway.  */
 }
 
-static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
 {
 	return to_vmx(vcpu)->nested.vmxon && !is_guest_mode(vcpu);
 }
 
-static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
+void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 {
 	if (is_guest_mode(vcpu)) {
 		struct hrtimer *timer = &to_vmx(vcpu)->nested.preemption_timer;
@@ -7685,7 +7682,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 	}
 }
 
-static void hardware_unsetup(void)
+void vmx_hardware_unsetup(void)
 {
 	kvm_set_posted_intr_wakeup_handler(NULL);
 
@@ -7695,7 +7692,7 @@ static void hardware_unsetup(void)
 	free_kvm_area();
 }
 
-static bool vmx_check_apicv_inhibit_reasons(ulong bit)
+bool vmx_check_apicv_inhibit_reasons(ulong bit)
 {
 	ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
 			  BIT(APICV_INHIBIT_REASON_ABSENT) |
@@ -7705,143 +7702,6 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
 	return supported & BIT(bit);
 }
 
-static struct kvm_x86_ops vmx_x86_ops __initdata = {
-	.name = "kvm_intel",
-
-	.hardware_unsetup = hardware_unsetup,
-
-	.hardware_enable = hardware_enable,
-	.hardware_disable = hardware_disable,
-	.cpu_has_accelerated_tpr = report_flexpriority,
-	.has_emulated_msr = vmx_has_emulated_msr,
-
-	.vm_size = sizeof(struct kvm_vmx),
-	.vm_init = vmx_vm_init,
-
-	.vcpu_create = vmx_create_vcpu,
-	.vcpu_free = vmx_free_vcpu,
-	.vcpu_reset = vmx_vcpu_reset,
-
-	.prepare_guest_switch = vmx_prepare_switch_to_guest,
-	.vcpu_load = vmx_vcpu_load,
-	.vcpu_put = vmx_vcpu_put,
-
-	.update_exception_bitmap = vmx_update_exception_bitmap,
-	.get_msr_feature = vmx_get_msr_feature,
-	.get_msr = vmx_get_msr,
-	.set_msr = vmx_set_msr,
-	.get_segment_base = vmx_get_segment_base,
-	.get_segment = vmx_get_segment,
-	.set_segment = vmx_set_segment,
-	.get_cpl = vmx_get_cpl,
-	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
-	.set_cr0 = vmx_set_cr0,
-	.is_valid_cr4 = vmx_is_valid_cr4,
-	.set_cr4 = vmx_set_cr4,
-	.set_efer = vmx_set_efer,
-	.get_idt = vmx_get_idt,
-	.set_idt = vmx_set_idt,
-	.get_gdt = vmx_get_gdt,
-	.set_gdt = vmx_set_gdt,
-	.set_dr7 = vmx_set_dr7,
-	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
-	.cache_reg = vmx_cache_reg,
-	.get_rflags = vmx_get_rflags,
-	.set_rflags = vmx_set_rflags,
-	.get_if_flag = vmx_get_if_flag,
-
-	.tlb_flush_all = vmx_flush_tlb_all,
-	.tlb_flush_current = vmx_flush_tlb_current,
-	.tlb_flush_gva = vmx_flush_tlb_gva,
-	.tlb_flush_guest = vmx_flush_tlb_guest,
-
-	.vcpu_pre_run = vmx_vcpu_pre_run,
-	.run = vmx_vcpu_run,
-	.handle_exit = vmx_handle_exit,
-	.skip_emulated_instruction = vmx_skip_emulated_instruction,
-	.update_emulated_instruction = vmx_update_emulated_instruction,
-	.set_interrupt_shadow = vmx_set_interrupt_shadow,
-	.get_interrupt_shadow = vmx_get_interrupt_shadow,
-	.patch_hypercall = vmx_patch_hypercall,
-	.set_irq = vmx_inject_irq,
-	.set_nmi = vmx_inject_nmi,
-	.queue_exception = vmx_queue_exception,
-	.cancel_injection = vmx_cancel_injection,
-	.interrupt_allowed = vmx_interrupt_allowed,
-	.nmi_allowed = vmx_nmi_allowed,
-	.get_nmi_mask = vmx_get_nmi_mask,
-	.set_nmi_mask = vmx_set_nmi_mask,
-	.enable_nmi_window = vmx_enable_nmi_window,
-	.enable_irq_window = vmx_enable_irq_window,
-	.update_cr8_intercept = vmx_update_cr8_intercept,
-	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
-	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
-	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
-	.load_eoi_exitmap = vmx_load_eoi_exitmap,
-	.apicv_post_state_restore = vmx_apicv_post_state_restore,
-	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
-	.hwapic_irr_update = vmx_hwapic_irr_update,
-	.hwapic_isr_update = vmx_hwapic_isr_update,
-	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
-	.sync_pir_to_irr = vmx_sync_pir_to_irr,
-	.deliver_interrupt = vmx_deliver_interrupt,
-	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
-
-	.set_tss_addr = vmx_set_tss_addr,
-	.set_identity_map_addr = vmx_set_identity_map_addr,
-	.get_mt_mask = vmx_get_mt_mask,
-
-	.get_exit_info = vmx_get_exit_info,
-
-	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
-
-	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
-
-	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
-	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
-	.write_tsc_offset = vmx_write_tsc_offset,
-	.write_tsc_multiplier = vmx_write_tsc_multiplier,
-
-	.load_mmu_pgd = vmx_load_mmu_pgd,
-
-	.check_intercept = vmx_check_intercept,
-	.handle_exit_irqoff = vmx_handle_exit_irqoff,
-
-	.request_immediate_exit = vmx_request_immediate_exit,
-
-	.sched_in = vmx_sched_in,
-
-	.cpu_dirty_log_size = PML_ENTITY_NUM,
-	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
-
-	.pmu_ops = &intel_pmu_ops,
-	.nested_ops = &vmx_nested_ops,
-
-	.update_pi_irte = pi_update_irte,
-	.start_assignment = vmx_pi_start_assignment,
-
-#ifdef CONFIG_X86_64
-	.set_hv_timer = vmx_set_hv_timer,
-	.cancel_hv_timer = vmx_cancel_hv_timer,
-#endif
-
-	.setup_mce = vmx_setup_mce,
-
-	.smi_allowed = vmx_smi_allowed,
-	.enter_smm = vmx_enter_smm,
-	.leave_smm = vmx_leave_smm,
-	.enable_smi_window = vmx_enable_smi_window,
-
-	.can_emulate_instruction = vmx_can_emulate_instruction,
-	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
-	.migrate_timers = vmx_migrate_timers,
-
-	.msr_filter_changed = vmx_msr_filter_changed,
-	.complete_emulated_msr = kvm_complete_insn_gp,
-
-	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
-};
-
 static unsigned int vmx_handle_intel_pt_intr(void)
 {
 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
@@ -7882,9 +7742,7 @@ static __init void vmx_setup_user_return_msrs(void)
 		kvm_add_user_return_msr(vmx_uret_msrs_list[i]);
 }
 
-static struct kvm_x86_init_ops vmx_init_ops __initdata;
-
-static __init int hardware_setup(void)
+__init int vmx_hardware_setup(void)
 {
 	unsigned long host_bndcfgs;
 	struct desc_ptr dt;
@@ -7944,16 +7802,16 @@ static __init int hardware_setup(void)
 	 * using the APIC_ACCESS_ADDR VMCS field.
 	 */
 	if (!flexpriority_enabled)
-		vmx_x86_ops.set_apic_access_page_addr = NULL;
+		vt_x86_ops.set_apic_access_page_addr = NULL;
 
 	if (!cpu_has_vmx_tpr_shadow())
-		vmx_x86_ops.update_cr8_intercept = NULL;
+		vt_x86_ops.update_cr8_intercept = NULL;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
 	    && enable_ept) {
-		vmx_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
-		vmx_x86_ops.tlb_remote_flush_with_range =
+		vt_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
+		vt_x86_ops.tlb_remote_flush_with_range =
 				hv_remote_flush_tlb_with_range;
 	}
 #endif
@@ -7969,7 +7827,7 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_apicv())
 		enable_apicv = 0;
 	if (!enable_apicv)
-		vmx_x86_ops.sync_pir_to_irr = NULL;
+		vt_x86_ops.sync_pir_to_irr = NULL;
 
 	if (cpu_has_vmx_tsc_scaling()) {
 		kvm_has_tsc_control = true;
@@ -7996,7 +7854,7 @@ static __init int hardware_setup(void)
 		enable_pml = 0;
 
 	if (!enable_pml)
-		vmx_x86_ops.cpu_dirty_log_size = 0;
+		vt_x86_ops.cpu_dirty_log_size = 0;
 
 	if (!cpu_has_vmx_preemption_timer())
 		enable_preemption_timer = false;
@@ -8023,9 +7881,9 @@ static __init int hardware_setup(void)
 	}
 
 	if (!enable_preemption_timer) {
-		vmx_x86_ops.set_hv_timer = NULL;
-		vmx_x86_ops.cancel_hv_timer = NULL;
-		vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
+		vt_x86_ops.set_hv_timer = NULL;
+		vt_x86_ops.cancel_hv_timer = NULL;
+		vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
 	}
 
 	kvm_mce_cap_supported |= MCG_LMCE_P;
@@ -8035,9 +7893,9 @@ static __init int hardware_setup(void)
 	if (!enable_ept || !cpu_has_vmx_intel_pt())
 		pt_mode = PT_MODE_SYSTEM;
 	if (pt_mode == PT_MODE_HOST_GUEST)
-		vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
+		vt_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
 	else
-		vmx_init_ops.handle_intel_pt_intr = NULL;
+		vt_init_ops.handle_intel_pt_intr = NULL;
 
 	setup_default_sgx_lepubkeyhash();
 
@@ -8061,16 +7919,6 @@ static __init int hardware_setup(void)
 	return r;
 }
 
-static struct kvm_x86_init_ops vmx_init_ops __initdata = {
-	.cpu_has_kvm_support = cpu_has_kvm_support,
-	.disabled_by_bios = vmx_disabled_by_bios,
-	.check_processor_compatibility = vmx_check_processor_compat,
-	.hardware_setup = hardware_setup,
-	.handle_intel_pt_intr = NULL,
-
-	.runtime_ops = &vmx_x86_ops,
-};
-
 static void vmx_cleanup_l1d_flush(void)
 {
 	if (vmx_l1d_flush_pages) {
@@ -8149,7 +7997,7 @@ static int __init vmx_init(void)
 		}
 
 		if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
-			vmx_x86_ops.enable_direct_tlbflush
+			vt_x86_ops.enable_direct_tlbflush
 				= hv_enable_direct_tlbflush;
 
 	} else {
@@ -8157,8 +8005,8 @@ static int __init vmx_init(void)
 	}
 #endif
 
-	r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
-		     __alignof__(struct vcpu_vmx), THIS_MODULE);
+	r = kvm_init(&vt_init_ops, sizeof(struct vcpu_vmx),
+		__alignof__(struct vcpu_vmx), THIS_MODULE);
 	if (r)
 		return r;
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
new file mode 100644
index 000000000000..40c64fb1f505
--- /dev/null
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -0,0 +1,126 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_X86_OPS_H
+#define __KVM_X86_VMX_X86_OPS_H
+
+#include <linux/kvm_host.h>
+
+#include <asm/virtext.h>
+
+#include "x86.h"
+
+extern struct kvm_x86_init_ops vt_init_ops __initdata;
+
+__init int vmx_cpu_has_kvm_support(void);
+__init int vmx_disabled_by_bios(void);
+int __init vmx_check_processor_compat(void);
+__init int vmx_hardware_setup(void);
+
+extern struct kvm_x86_ops vt_x86_ops __initdata;
+extern struct kvm_x86_init_ops vt_init_ops __initdata;
+
+void vmx_hardware_unsetup(void);
+int vmx_hardware_enable(void);
+void vmx_hardware_disable(void);
+bool report_flexpriority(void);
+int vmx_vm_init(struct kvm *kvm);
+int vmx_vcpu_create(struct kvm_vcpu *vcpu);
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu);
+void vmx_vcpu_free(struct kvm_vcpu *vcpu);
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_put(struct kvm_vcpu *vcpu);
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath);
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu);
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu);
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate);
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate);
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu);
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+				void *insn, int insn_len);
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+			struct x86_instruction_info *info,
+			enum x86_intercept_stage stage,
+			struct x86_exception *exception);
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
+void vmx_migrate_timers(struct kvm_vcpu *vcpu);
+void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
+bool vmx_check_apicv_inhibit_reasons(ulong bit);
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
+void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr);
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu);
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+			   int trig_mode, int vector);
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index);
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu);
+void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
+int vmx_get_msr_feature(struct kvm_msr_entry *msr);
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+int vmx_get_cpl(struct kvm_vcpu *vcpu);
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
+void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
+void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
+void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
+u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
+void vmx_inject_irq(struct kvm_vcpu *vcpu);
+void vmx_inject_nmi(struct kvm_vcpu *vcpu);
+void vmx_queue_exception(struct kvm_vcpu *vcpu);
+void vmx_cancel_injection(struct kvm_vcpu *vcpu);
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
+void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu);
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu);
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr);
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu);
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr);
+u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
+u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset);
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier);
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_X86_64
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+		bool *expired);
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
+#endif
+void vmx_setup_mce(struct kvm_vcpu *vcpu);
+
+#endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 002/104] x86/virt/tdx: export platform_has_tdx
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 001/104] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 003/104] KVM: TDX: Detect CPU feature on kernel module initialization isaku.yamahata
                   ` (103 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX KVM uses platform_has_tdx() via hardware_setup().

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/virt/vmx/tdx.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/x86/virt/vmx/tdx.c b/arch/x86/virt/vmx/tdx.c
index 60d58b2daabd..da4d1df95503 100644
--- a/arch/x86/virt/vmx/tdx.c
+++ b/arch/x86/virt/vmx/tdx.c
@@ -1630,3 +1630,4 @@ bool platform_has_tdx(void)
 {
 	return seamrr_enabled() && tdx_keyid_sufficient();
 }
+EXPORT_SYMBOL_GPL(platform_has_tdx);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 003/104] KVM: TDX: Detect CPU feature on kernel module initialization
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 001/104] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 002/104] x86/virt/tdx: export platform_has_tdx isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 13:49   ` Paolo Bonzini
  2022-04-08 16:46   ` Sean Christopherson
  2022-03-04 19:48 ` [RFC PATCH v5 004/104] KVM: Enable hardware before doing arch VM initialization isaku.yamahata
                   ` (102 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX requires several initialization steps for KVM to create guest TDs.
Detect CPU feature, enable VMX (TDX is based on VMX), detect TDX module
availability, and initialize TDX module.  This patch implements the first
step to detect CPU feature.  Because VMX isn't enabled yet by VMXON
instruction on KVM kernel module initialization, defer further
initialization step until VMX is enabled by hardware_enable callback.

Introduce a module parameter, enable_tdx, to explicitly enable TDX KVM
support.  It's off by default to keep same behavior for those who don't use
TDX.  Implement CPU feature detection at KVM kernel module initialization
as hardware_setup callback to check if CPU feature is available and get
some CPU parameters.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Makefile      |  1 +
 arch/x86/kvm/vmx/main.c    | 15 ++++++++++-
 arch/x86/kvm/vmx/tdx.c     | 53 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  6 +++++
 4 files changed, 74 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kvm/vmx/tdx.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index ee4d0999f20f..e2c05195cb95 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -24,6 +24,7 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
 			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
 kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o
 
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index b08ea9c42a11..b79fcc8d81dd 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -6,6 +6,19 @@
 #include "nested.h"
 #include "pmu.h"
 
+static __init int vt_hardware_setup(void)
+{
+	int ret;
+
+	ret = vmx_hardware_setup();
+	if (ret)
+		return ret;
+
+	tdx_hardware_setup(&vt_x86_ops);
+
+	return 0;
+}
+
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = "kvm_intel",
 
@@ -147,7 +160,7 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
 	.cpu_has_kvm_support = vmx_cpu_has_kvm_support,
 	.disabled_by_bios = vmx_disabled_by_bios,
 	.check_processor_compatibility = vmx_check_processor_compat,
-	.hardware_setup = vmx_hardware_setup,
+	.hardware_setup = vt_hardware_setup,
 	.handle_intel_pt_intr = NULL,
 
 	.runtime_ops = &vt_x86_ops,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
new file mode 100644
index 000000000000..1acf08c310c4
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -0,0 +1,53 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/cpu.h>
+
+#include <asm/tdx.h>
+
+#include "capabilities.h"
+#include "x86_ops.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) "tdx: " fmt
+
+static bool __read_mostly enable_tdx = true;
+module_param_named(tdx, enable_tdx, bool, 0644);
+
+static u64 hkid_mask __ro_after_init;
+static u8 hkid_start_pos __ro_after_init;
+
+static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
+{
+	u32 max_pa;
+
+	if (!enable_ept) {
+		pr_warn("Cannot enable TDX with EPT disabled\n");
+		return -EINVAL;
+	}
+
+	if (!platform_has_tdx()) {
+		pr_warn("Cannot enable TDX with SEAMRR disabled\n");
+		return -ENODEV;
+	}
+
+	if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
+		return -EIO;
+
+	max_pa = cpuid_eax(0x80000008) & 0xff;
+	hkid_start_pos = boot_cpu_data.x86_phys_bits;
+	hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
+
+	return 0;
+}
+
+void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
+{
+	/*
+	 * This function is called at the initialization.  No need to protect
+	 * enable_tdx.
+	 */
+	if (!enable_tdx)
+		return;
+
+	if (__tdx_hardware_setup(&vt_x86_ops))
+		enable_tdx = false;
+}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 40c64fb1f505..ccf98e79d8c3 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -123,4 +123,10 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
 #endif
 void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
+#ifdef CONFIG_INTEL_TDX_HOST
+void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+#else
+static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
+#endif
+
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 004/104] KVM: Enable hardware before doing arch VM initialization
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (2 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 003/104] KVM: TDX: Detect CPU feature on kernel module initialization isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 14:00   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 005/104] KVM: x86: Refactor KVM VMX module init/exit functions isaku.yamahata
                   ` (101 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Swap the order of hardware_enable_all() and kvm_arch_init_vm() to
accommodate Intel's TDX, which needs VMX to be enabled during VM init in
order to make SEAMCALLs.

This also provides consistent ordering between kvm_create_vm() and
kvm_destroy_vm() with respect to calling kvm_arch_destroy_vm() and
hardware_disable_all().

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 virt/kvm/kvm_main.c | 14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0afc016cc54d..52f72a366beb 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1105,19 +1105,19 @@ static struct kvm *kvm_create_vm(unsigned long type)
 		rcu_assign_pointer(kvm->buses[i],
 			kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL_ACCOUNT));
 		if (!kvm->buses[i])
-			goto out_err_no_arch_destroy_vm;
+			goto out_err_no_disable;
 	}
 
 	kvm->max_halt_poll_ns = halt_poll_ns;
 
-	r = kvm_arch_init_vm(kvm, type);
-	if (r)
-		goto out_err_no_arch_destroy_vm;
-
 	r = hardware_enable_all();
 	if (r)
 		goto out_err_no_disable;
 
+	r = kvm_arch_init_vm(kvm, type);
+	if (r)
+		goto out_err_no_arch_destroy_vm;
+
 #ifdef CONFIG_HAVE_KVM_IRQFD
 	INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
 #endif
@@ -1145,10 +1145,10 @@ static struct kvm *kvm_create_vm(unsigned long type)
 		mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
 #endif
 out_err_no_mmu_notifier:
-	hardware_disable_all();
-out_err_no_disable:
 	kvm_arch_destroy_vm(kvm);
 out_err_no_arch_destroy_vm:
+	hardware_disable_all();
+out_err_no_disable:
 	WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
 	for (i = 0; i < KVM_NR_BUSES; i++)
 		kfree(kvm_get_bus(kvm, i));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 005/104] KVM: x86: Refactor KVM VMX module init/exit functions
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (3 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 004/104] KVM: Enable hardware before doing arch VM initialization isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 13:54   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 006/104] KVM: TDX: Add placeholders for TDX VM/vcpu structure isaku.yamahata
                   ` (100 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Currently, KVM VMX module initialization/exit functions are a single
function each.  Refactor KVM VMX module initialization functions into KVM
common part and VMX part so that TDX specific part can be added cleanly.
Opportunistically refactor module exit function as well.

The current module initialization flow is, 1.) calculate the sizes of VMX
kvm structure and VMX vcpu structure, 2.) report those sizes to the KVM
common layer and KVM common initialization, and 3.) VMX specific
system-wide initialization.

Refactor the KVM VMX module initialization function into functions with a
wrapper function to separate VMX logic in vmx.c from a file, main.c, common
among VMX and TDX.  We have a wrapper function,
"vt_init() {vmx_pre_kvm_init(); kvm_init(); vmx_init(); }" in main.c, and
vmx_pre_kvm_init() and vmx_init() in vmx.c.  vmx_pre_kvm_init() calculates
the sizes of VMX kvm structure and KVM vcpu structure, kvm_init() does
system-wide initialization of the KVM common layer, and vmx_init() does
system-wide VMX initialization.

The KVM architecture common layer allocates struct kvm with reported size
for architecture-specific code.  The KVM VMX module defines its structure
as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
struct vmx kvm.  Similar for vcpu structure. TDX KVM patches will define
TDX specific kvm and vcpu structures, add tdx_pre_kvm_init() to report the
sizes of them to the KVM common layer.

The current module exit function is also a single function, a combination
of VMX specific logic and common KVM logic.  Refactor it into VMX specific
logic and KVM common logic.  This is just refactoring to keep the VMX
specific logic in vmx.c from main.c.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 33 +++++++++++++
 arch/x86/kvm/vmx/vmx.c     | 97 +++++++++++++++++++-------------------
 arch/x86/kvm/vmx/x86_ops.h |  5 +-
 3 files changed, 86 insertions(+), 49 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index b79fcc8d81dd..8ff13c7881f2 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -165,3 +165,36 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
 
 	.runtime_ops = &vt_x86_ops,
 };
+
+static int __init vt_init(void)
+{
+	unsigned int vcpu_size = 0, vcpu_align = 0;
+	int r;
+
+	vmx_pre_kvm_init(&vcpu_size, &vcpu_align);
+
+	r = kvm_init(&vt_init_ops, vcpu_size, vcpu_align, THIS_MODULE);
+	if (r)
+		goto err_vmx_post_exit;
+
+	r = vmx_init();
+	if (r)
+		goto err_kvm_exit;
+
+	return 0;
+
+err_kvm_exit:
+	kvm_exit();
+err_vmx_post_exit:
+	vmx_post_kvm_exit();
+	return r;
+}
+module_init(vt_init);
+
+static void vt_exit(void)
+{
+	vmx_exit();
+	kvm_exit();
+	vmx_post_kvm_exit();
+}
+module_exit(vt_exit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f6f5d0dac579..7838cd177f0e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7929,47 +7929,12 @@ static void vmx_cleanup_l1d_flush(void)
 	l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
 }
 
-static void vmx_exit(void)
+void __init vmx_pre_kvm_init(unsigned int *vcpu_size, unsigned int *vcpu_align)
 {
-#ifdef CONFIG_KEXEC_CORE
-	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
-	synchronize_rcu();
-#endif
-
-	kvm_exit();
-
-#if IS_ENABLED(CONFIG_HYPERV)
-	if (static_branch_unlikely(&enable_evmcs)) {
-		int cpu;
-		struct hv_vp_assist_page *vp_ap;
-		/*
-		 * Reset everything to support using non-enlightened VMCS
-		 * access later (e.g. when we reload the module with
-		 * enlightened_vmcs=0)
-		 */
-		for_each_online_cpu(cpu) {
-			vp_ap =	hv_get_vp_assist_page(cpu);
-
-			if (!vp_ap)
-				continue;
-
-			vp_ap->nested_control.features.directhypercall = 0;
-			vp_ap->current_nested_vmcs = 0;
-			vp_ap->enlighten_vmentry = 0;
-		}
-
-		static_branch_disable(&enable_evmcs);
-	}
-#endif
-	vmx_cleanup_l1d_flush();
-
-	allow_smaller_maxphyaddr = false;
-}
-module_exit(vmx_exit);
-
-static int __init vmx_init(void)
-{
-	int r, cpu;
+	if (sizeof(struct vcpu_vmx) > *vcpu_size)
+		*vcpu_size = sizeof(struct vcpu_vmx);
+	if (__alignof__(struct vcpu_vmx) > *vcpu_align)
+		*vcpu_align = __alignof__(struct vcpu_vmx);
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	/*
@@ -8004,11 +7969,38 @@ static int __init vmx_init(void)
 		enlightened_vmcs = false;
 	}
 #endif
+}
 
-	r = kvm_init(&vt_init_ops, sizeof(struct vcpu_vmx),
-		__alignof__(struct vcpu_vmx), THIS_MODULE);
-	if (r)
-		return r;
+void vmx_post_kvm_exit(void)
+{
+#if IS_ENABLED(CONFIG_HYPERV)
+	if (static_branch_unlikely(&enable_evmcs)) {
+		int cpu;
+		struct hv_vp_assist_page *vp_ap;
+		/*
+		 * Reset everything to support using non-enlightened VMCS
+		 * access later (e.g. when we reload the module with
+		 * enlightened_vmcs=0)
+		 */
+		for_each_online_cpu(cpu) {
+			vp_ap =	hv_get_vp_assist_page(cpu);
+
+			if (!vp_ap)
+				continue;
+
+			vp_ap->nested_control.features.directhypercall = 0;
+			vp_ap->current_nested_vmcs = 0;
+			vp_ap->enlighten_vmentry = 0;
+		}
+
+		static_branch_disable(&enable_evmcs);
+	}
+#endif
+}
+
+int __init vmx_init(void)
+{
+	int r, cpu;
 
 	/*
 	 * Must be called after kvm_init() so enable_ept is properly set
@@ -8018,10 +8010,8 @@ static int __init vmx_init(void)
 	 * mitigation mode.
 	 */
 	r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
-	if (r) {
-		vmx_exit();
+	if (r)
 		return r;
-	}
 
 	for_each_possible_cpu(cpu) {
 		INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
@@ -8045,4 +8035,15 @@ static int __init vmx_init(void)
 
 	return 0;
 }
-module_init(vmx_init);
+
+void vmx_exit(void)
+{
+#ifdef CONFIG_KEXEC_CORE
+	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
+	synchronize_rcu();
+#endif
+
+	vmx_cleanup_l1d_flush();
+
+	allow_smaller_maxphyaddr = false;
+}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ccf98e79d8c3..7da541e1c468 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -8,7 +8,10 @@
 
 #include "x86.h"
 
-extern struct kvm_x86_init_ops vt_init_ops __initdata;
+void __init vmx_pre_kvm_init(unsigned int *vcpu_size, unsigned int *vcpu_align);
+int __init vmx_init(void);
+void vmx_exit(void);
+void vmx_post_kvm_exit(void);
 
 __init int vmx_cpu_has_kvm_support(void);
 __init int vmx_disabled_by_bios(void);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 006/104] KVM: TDX: Add placeholders for TDX VM/vcpu structure
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (4 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 005/104] KVM: x86: Refactor KVM VMX module init/exit functions isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 13:55   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 007/104] x86/virt/tdx: Add a helper function to return system wide info about TDX module isaku.yamahata
                   ` (99 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add placeholders TDX VM/vcpu structure that overlays with VMX VM/vcpu
structures.  Initialize VM structure size and vcpu size/align so that x86
KVM common code knows those size irrespective of VMX or TDX.  Those
structures will be populated as guest creation logic develops.

Add helper functions to check if the VM is guest TD and add conversion
functions between KVM VM/VCPU and TDX VM/VCPU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    |  3 +++
 arch/x86/kvm/vmx/tdx.c     | 11 +++++++++
 arch/x86/kvm/vmx/tdx.h     | 47 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++++
 4 files changed, 65 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx.h

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 8ff13c7881f2..28a7597d0782 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -171,6 +171,9 @@ static int __init vt_init(void)
 	unsigned int vcpu_size = 0, vcpu_align = 0;
 	int r;
 
+	/* tdx_pre_kvm_init must be called before vmx_pre_kvm_init(). */
+	tdx_pre_kvm_init(&vcpu_size, &vcpu_align, &vt_x86_ops.vm_size);
+
 	vmx_pre_kvm_init(&vcpu_size, &vcpu_align);
 
 	r = kvm_init(&vt_init_ops, vcpu_size, vcpu_align, THIS_MODULE);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1acf08c310c4..8ed3ec342e28 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,6 +5,7 @@
 
 #include "capabilities.h"
 #include "x86_ops.h"
+#include "tdx.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) "tdx: " fmt
@@ -51,3 +52,13 @@ void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 	if (__tdx_hardware_setup(&vt_x86_ops))
 		enable_tdx = false;
 }
+
+void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
+			unsigned int *vcpu_align, unsigned int *vm_size)
+{
+	*vcpu_size = sizeof(struct vcpu_tdx);
+	*vcpu_align = __alignof__(struct vcpu_tdx);
+
+	if (sizeof(struct kvm_tdx) > *vm_size)
+		*vm_size = sizeof(struct kvm_tdx);
+}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
new file mode 100644
index 000000000000..daf6bfc6502a
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -0,0 +1,47 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_TDX_H
+#define __KVM_X86_TDX_H
+
+#ifdef CONFIG_INTEL_TDX_HOST
+struct kvm_tdx {
+	struct kvm kvm;
+};
+
+struct vcpu_tdx {
+	struct kvm_vcpu	vcpu;
+};
+
+static inline bool is_td(struct kvm *kvm)
+{
+	/*
+	 * TDX VM type isn't defined yet.
+	 * return kvm->arch.vm_type == KVM_X86_TDX_VM;
+	 */
+	return false;
+}
+
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
+{
+	return is_td(vcpu->kvm);
+}
+
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm)
+{
+	return container_of(kvm, struct kvm_tdx, kvm);
+}
+
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
+{
+	return container_of(vcpu, struct vcpu_tdx, vcpu);
+}
+#else
+struct kvm_tdx;
+struct vcpu_tdx;
+
+static inline bool is_td(struct kvm *kvm) { return false; }
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return NULL; }
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) { return NULL; }
+#endif /* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_X86_TDX_H */
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 7da541e1c468..1bad27e592b5 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -127,8 +127,12 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
 void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
 #ifdef CONFIG_INTEL_TDX_HOST
+void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
+			unsigned int *vcpu_align, unsigned int *vm_size);
 void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 #else
+static inline void tdx_pre_kvm_init(
+	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
 static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 007/104] x86/virt/tdx: Add a helper function to return system wide info about TDX module
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (5 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 006/104] KVM: TDX: Add placeholders for TDX VM/vcpu structure isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 13:59   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize " isaku.yamahata
                   ` (98 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX KVM needs system-wide information about the TDX module, struct
tdsysinfo_struct.  Add a helper function tdx_get_sysinfo() to return it
instead of KVM getting it with various error checks.  Move out the struct
definition about it to common place tdx_host.h.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/tdx.h | 55 ++++++++++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx.c    | 16 +++++++++--
 arch/x86/virt/vmx/tdx.h    | 52 -----------------------------------
 3 files changed, 69 insertions(+), 54 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 24f2b7e8b280..9a8dc6afcb63 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -82,15 +82,70 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_HOST
+struct tdx_cpuid_config {
+	u32	leaf;
+	u32	sub_leaf;
+	u32	eax;
+	u32	ebx;
+	u32	ecx;
+	u32	edx;
+} __packed;
+
+#define TDSYSINFO_STRUCT_SIZE		1024
+#define TDSYSINFO_STRUCT_ALIGNMENT	1024
+
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor_version;
+	u16	major_version;
+	u8	reserved0[14];
+	/* Memory Info */
+	u16	max_tdmrs;
+	u16	max_reserved_per_tdmr;
+	u16	pamt_entry_size;
+	u8	reserved1[10];
+	/* Control Struct Info */
+	u16	tdcs_base_size;
+	u8	reserved2[2];
+	u16	tdvps_base_size;
+	u8	tdvps_xfam_dependent_size;
+	u8	reserved3[9];
+	/* TD Capabilities */
+	u64	attributes_fixed0;
+	u64	attributes_fixed1;
+	u64	xfam_fixed0;
+	u64	xfam_fixed1;
+	u8	reserved4[32];
+	u32	num_cpuid_config;
+	/*
+	 * The actual number of CPUID_CONFIG depends on above
+	 * 'num_cpuid_config'.  The size of 'struct tdsysinfo_struct'
+	 * is 1024B defined by TDX architecture.  Use a union with
+	 * specific padding to make 'sizeof(struct tdsysinfo_struct)'
+	 * equal to 1024.
+	 */
+	union {
+		struct tdx_cpuid_config	cpuid_configs[0];
+		u8			reserved5[892];
+	};
+} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
+
 void tdx_detect_cpu(struct cpuinfo_x86 *c);
 int tdx_detect(void);
 int tdx_init(void);
 bool platform_has_tdx(void);
+const struct tdsysinfo_struct *tdx_get_sysinfo(void);
 #else
 static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
 static inline int tdx_detect(void) { return -ENODEV; }
 static inline int tdx_init(void) { return -ENODEV; }
 static inline bool platform_has_tdx(void) { return false; }
+struct tdsysinfo_struct;
+static inline const struct tdsysinfo_struct *tdx_get_sysinfo(void) { return NULL; }
 #endif /* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx.c b/arch/x86/virt/vmx/tdx.c
index da4d1df95503..e45f188479cb 100644
--- a/arch/x86/virt/vmx/tdx.c
+++ b/arch/x86/virt/vmx/tdx.c
@@ -641,7 +641,7 @@ static int sanitize_cmrs(struct cmr_info *cmr_array, int cmr_num)
 	return 0;
 }
 
-static int tdx_get_sysinfo(void)
+static int __tdx_get_sysinfo(void)
 {
 	struct tdx_module_output out;
 	u64 tdsysinfo_sz, cmr_num;
@@ -676,6 +676,18 @@ static int tdx_get_sysinfo(void)
 	return sanitize_cmrs(tdx_cmr_array, cmr_num);
 }
 
+const struct tdsysinfo_struct *tdx_get_sysinfo(void)
+{
+       const struct tdsysinfo_struct *r = NULL;
+
+       mutex_lock(&tdx_module_lock);
+       if (tdx_module_status == TDX_MODULE_INITIALIZED)
+	       r = &tdx_sysinfo;
+       mutex_unlock(&tdx_module_lock);
+       return r;
+}
+EXPORT_SYMBOL_GPL(tdx_get_sysinfo);
+
 /*
  * Only E820_TYPE_RAM and E820_TYPE_PRAM are considered as candidate for
  * TDX usable memory.  The latter is treated as RAM because it is created
@@ -1383,7 +1395,7 @@ static int init_tdx_module(void)
 		goto out;
 
 	/* Get TDX module information and CMRs */
-	ret = tdx_get_sysinfo();
+	ret = __tdx_get_sysinfo();
 	if (ret)
 		goto out;
 
diff --git a/arch/x86/virt/vmx/tdx.h b/arch/x86/virt/vmx/tdx.h
index 212f83374c0a..b071d299327b 100644
--- a/arch/x86/virt/vmx/tdx.h
+++ b/arch/x86/virt/vmx/tdx.h
@@ -37,58 +37,6 @@ struct cmr_info {
 #define MAX_CMRS			32
 #define CMR_INFO_ARRAY_ALIGNMENT	512
 
-struct cpuid_config {
-	u32	leaf;
-	u32	sub_leaf;
-	u32	eax;
-	u32	ebx;
-	u32	ecx;
-	u32	edx;
-} __packed;
-
-#define TDSYSINFO_STRUCT_SIZE		1024
-#define TDSYSINFO_STRUCT_ALIGNMENT	1024
-
-struct tdsysinfo_struct {
-	/* TDX-SEAM Module Info */
-	u32	attributes;
-	u32	vendor_id;
-	u32	build_date;
-	u16	build_num;
-	u16	minor_version;
-	u16	major_version;
-	u8	reserved0[14];
-	/* Memory Info */
-	u16	max_tdmrs;
-	u16	max_reserved_per_tdmr;
-	u16	pamt_entry_size;
-	u8	reserved1[10];
-	/* Control Struct Info */
-	u16	tdcs_base_size;
-	u8	reserved2[2];
-	u16	tdvps_base_size;
-	u8	tdvps_xfam_dependent_size;
-	u8	reserved3[9];
-	/* TD Capabilities */
-	u64	attributes_fixed0;
-	u64	attributes_fixed1;
-	u64	xfam_fixed0;
-	u64	xfam_fixed1;
-	u8	reserved4[32];
-	u32	num_cpuid_config;
-	/*
-	 * The actual number of CPUID_CONFIG depends on above
-	 * 'num_cpuid_config'.  The size of 'struct tdsysinfo_struct'
-	 * is 1024B defined by TDX architecture.  Use a union with
-	 * specific padding to make 'sizeof(struct tdsysinfo_struct)'
-	 * equal to 1024.
-	 */
-	union {
-		struct cpuid_config	cpuid_configs[0];
-		u8			reserved5[892];
-	};
-} __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
-
 struct tdmr_reserved_area {
 	u64 offset;
 	u64 size;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (6 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 007/104] x86/virt/tdx: Add a helper function to return system wide info about TDX module isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 14:03   ` Paolo Bonzini
  2022-03-31  3:31   ` Kai Huang
  2022-03-04 19:48 ` [RFC PATCH v5 009/104] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs isaku.yamahata
                   ` (97 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Memory used for TDX is encrypted with an encryption key.  An encryption key
is assigned to guest TD, and TDX memory is encrypted.  VMM calculates Trust
Domain Memory Range (TDMR), a range of memory pages that can hold TDX
memory encrypted with an encryption key.  VMM allocates memory regions for
Physical Address Metadata Table (PAMT) which the TDX module uses to track
page states. Used for TDX memory, assigned to which guest TD, etc.  VMM
gives PAMT regions to the TDX module and initializes it which is also
encrypted.

TDX requires more initialization steps in addition to VMX.  As a
preparation step, check if the CPU feature is available and enable VMX
because the TDX module API requires VMX to be enabled to be functional.
The next step is basic platform initialization.  Check if TDX module API is
available, call system-wide initialization API (TDH.SYS.INIT), and call LP
initialization API (TDH.SYS.LP.INIT).  Lastly, get system-wide
parameters (TDH.SYS.INFO), allocate PAMT for TDX module to track page
states (TDH.SYS.CONFIG), configure encryption key (TDH.SYS.KEY.CONFIG), and
initialize PAMT (TDH.SYS.TDMR.INIT).

A TDX host patch series implements those details and it provides APIs,
seamrr_enabled() to check if CPU feature is available, init_tdx() to
initialize the TDX module, tdx_get_tdsysinfo() to get TDX system
parameters.

Add a wrapper function to initialize the TDX module and get system-wide
parameters via those APIs.  Because TDX requires VMX enabled, It will be
called on-demand when the first guest TD is created via x86 KVM init_vm
callback.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 89 ++++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h |  4 ++
 2 files changed, 93 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8ed3ec342e28..8adc87ad1807 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -13,9 +13,98 @@
 static bool __read_mostly enable_tdx = true;
 module_param_named(tdx, enable_tdx, bool, 0644);
 
+#define TDX_MAX_NR_CPUID_CONFIGS					\
+	((sizeof(struct tdsysinfo_struct) -				\
+		offsetof(struct tdsysinfo_struct, cpuid_configs))	\
+		/ sizeof(struct tdx_cpuid_config))
+
+struct tdx_capabilities {
+	u8 tdcs_nr_pages;
+	u8 tdvpx_nr_pages;
+
+	u64 attrs_fixed0;
+	u64 attrs_fixed1;
+	u64 xfam_fixed0;
+	u64 xfam_fixed1;
+
+	u32 nr_cpuid_configs;
+	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
+};
+
+/* Capabilities of KVM + the TDX module. */
+struct tdx_capabilities tdx_caps;
+
 static u64 hkid_mask __ro_after_init;
 static u8 hkid_start_pos __ro_after_init;
 
+static int __tdx_module_setup(void)
+{
+	const struct tdsysinfo_struct *tdsysinfo;
+	int ret = 0;
+
+	BUILD_BUG_ON(sizeof(*tdsysinfo) != 1024);
+	BUILD_BUG_ON(TDX_MAX_NR_CPUID_CONFIGS != 37);
+
+	ret = tdx_detect();
+	if (ret) {
+		pr_info("Failed to detect TDX module.\n");
+		return ret;
+	}
+
+	ret = tdx_init();
+	if (ret) {
+		pr_info("Failed to initialize TDX module.\n");
+		return ret;
+	}
+
+	tdsysinfo = tdx_get_sysinfo();
+	if (tdx_caps.nr_cpuid_configs > TDX_MAX_NR_CPUID_CONFIGS)
+		return -EIO;
+
+	tdx_caps = (struct tdx_capabilities) {
+		.tdcs_nr_pages = tdsysinfo->tdcs_base_size / PAGE_SIZE,
+		/*
+		 * TDVPS = TDVPR(4K page) + TDVPX(multiple 4K pages).
+		 * -1 for TDVPR.
+		 */
+		.tdvpx_nr_pages = tdsysinfo->tdvps_base_size / PAGE_SIZE - 1,
+		.attrs_fixed0 = tdsysinfo->attributes_fixed0,
+		.attrs_fixed1 = tdsysinfo->attributes_fixed1,
+		.xfam_fixed0 =	tdsysinfo->xfam_fixed0,
+		.xfam_fixed1 = tdsysinfo->xfam_fixed1,
+		.nr_cpuid_configs = tdsysinfo->num_cpuid_config,
+	};
+	if (!memcpy(tdx_caps.cpuid_configs, tdsysinfo->cpuid_configs,
+			tdsysinfo->num_cpuid_config *
+			sizeof(struct tdx_cpuid_config)))
+		return -EIO;
+
+	return 0;
+}
+
+int tdx_module_setup(void)
+{
+	static DEFINE_MUTEX(tdx_init_lock);
+	static bool __read_mostly tdx_module_initialized;
+	int ret = 0;
+
+	mutex_lock(&tdx_init_lock);
+
+	if (!tdx_module_initialized) {
+		if (enable_tdx) {
+			ret = __tdx_module_setup();
+			if (ret)
+				enable_tdx = false;
+			else
+				tdx_module_initialized = true;
+		} else
+			ret = -EOPNOTSUPP;
+	}
+
+	mutex_unlock(&tdx_init_lock);
+	return ret;
+}
+
 static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 {
 	u32 max_pa;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index daf6bfc6502a..d448e019602c 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -3,6 +3,8 @@
 #define __KVM_X86_TDX_H
 
 #ifdef CONFIG_INTEL_TDX_HOST
+int tdx_module_setup(void);
+
 struct kvm_tdx {
 	struct kvm kvm;
 };
@@ -35,6 +37,8 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
 	return container_of(vcpu, struct vcpu_tdx, vcpu);
 }
 #else
+static inline int tdx_module_setup(void) { return -ENODEV; };
+
 struct kvm_tdx;
 struct vcpu_tdx;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 009/104] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (7 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize " isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 14:07   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 010/104] KVM: TDX: Make TDX VM type supported isaku.yamahata
                   ` (96 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Unlike default VMs, confidential VMs (Intel TDX and AMD SEV-ES) don't allow
some operations (e.g., memory read/write, register state access, etc).

Introduce vm_type to track the type of the VM to x86 KVM.  Other arch KVMs
already use vm_type, KVM_INIT_VM accepts vm_type, and x86 KVM callback
vm_init accepts vm_type.  So follow them.  Further, a different policy can
be made based on vm_type.  Define KVM_X86_DEFAULT_VM for default VM as
default and define KVM_X86_TDX_VM for Intel TDX VM.  The wrapper function
will be defined as "bool is_td(kvm) { return vm_type == VM_TYPE_TDX; }"

Add a capability KVM_CAP_VM_TYPES to effectively allow device model,
e.g. qemu, to query what VM types are supported by KVM.  This (introduce a
new capability and add vm_type) is chosen to align with other arch KVMs
that have VM types already.  Other arch KVMs uses different name to query
supported vm types and there is no common name for it, so new name was
chosen.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/api.rst        | 15 +++++++++++++++
 arch/x86/include/asm/kvm-x86-ops.h    |  1 +
 arch/x86/include/asm/kvm_host.h       |  2 ++
 arch/x86/include/uapi/asm/kvm.h       |  3 +++
 arch/x86/kvm/svm/svm.c                |  6 ++++++
 arch/x86/kvm/vmx/main.c               |  1 +
 arch/x86/kvm/vmx/tdx.h                |  6 +-----
 arch/x86/kvm/vmx/vmx.c                |  5 +++++
 arch/x86/kvm/vmx/x86_ops.h            |  1 +
 arch/x86/kvm/x86.c                    |  9 ++++++++-
 include/uapi/linux/kvm.h              |  1 +
 tools/arch/x86/include/uapi/asm/kvm.h |  3 +++
 tools/include/uapi/linux/kvm.h        |  1 +
 13 files changed, 48 insertions(+), 6 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9f3172376ec3..b1e142719ec0 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,15 +147,30 @@ described as 'basic' will be available.
 The new VM has no virtual cpus and no memory.
 You probably want to use 0 as machine type.
 
+X86:
+^^^^
+
+Supported vm type can be queried from KVM_CAP_VM_TYPES, which returns the
+bitmap of supported vm types. The 1-setting of bit @n means vm type with
+value @n is supported.
+
+S390:
+^^^^^
+
 In order to create user controlled virtual machines on S390, check
 KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
 privileged user (CAP_SYS_ADMIN).
 
+MIPS:
+^^^^^
+
 To use hardware assisted virtualization on MIPS (VZ ASE) rather than
 the default trap & emulate implementation (which changes the virtual
 memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
 flag KVM_VM_MIPS_VZ.
 
+ARM64:
+^^^^^^
 
 On arm64, the physical address size for a VM (IPA Size limit) is limited
 to 40bits by default. The limit can be configured if the host supports the
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index d39e0de06be2..8125d43d3566 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -18,6 +18,7 @@ KVM_X86_OP_NULL(hardware_unsetup)
 KVM_X86_OP_NULL(cpu_has_accelerated_tpr)
 KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP(is_vm_type_supported)
 KVM_X86_OP(vm_init)
 KVM_X86_OP_NULL(vm_destroy)
 KVM_X86_OP(vcpu_create)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b61e46743184..8de357a9ad30 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1048,6 +1048,7 @@ struct kvm_x86_msr_filter {
 #define APICV_INHIBIT_REASON_ABSENT	7
 
 struct kvm_arch {
+	unsigned long vm_type;
 	unsigned long n_used_mmu_pages;
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
@@ -1322,6 +1323,7 @@ struct kvm_x86_ops {
 	bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
 	void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
 
+	bool (*is_vm_type_supported)(unsigned long vm_type);
 	unsigned int vm_size;
 	int (*vm_init)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index bf6e96011dfe..71a5851475e7 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -525,4 +525,7 @@ struct kvm_pmu_event_filter {
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
 
+#define KVM_X86_DEFAULT_VM	0
+#define KVM_X86_TDX_VM		1
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index fd3a00c892c7..778075b71dc3 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4512,6 +4512,11 @@ static void svm_vm_destroy(struct kvm *kvm)
 	sev_vm_destroy(kvm);
 }
 
+static bool svm_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_DEFAULT_VM;
+}
+
 static int svm_vm_init(struct kvm *kvm)
 {
 	if (!pause_filter_count || !pause_filter_thresh)
@@ -4539,6 +4544,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_free = svm_free_vcpu,
 	.vcpu_reset = svm_vcpu_reset,
 
+	.is_vm_type_supported = svm_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_svm),
 	.vm_init = svm_vm_init,
 	.vm_destroy = svm_vm_destroy,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 28a7597d0782..77da926ee505 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -29,6 +29,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.cpu_has_accelerated_tpr = report_flexpriority,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
+	.is_vm_type_supported = vmx_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vmx_vm_init,
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index d448e019602c..616fbf79b129 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -15,11 +15,7 @@ struct vcpu_tdx {
 
 static inline bool is_td(struct kvm *kvm)
 {
-	/*
-	 * TDX VM type isn't defined yet.
-	 * return kvm->arch.vm_type == KVM_X86_TDX_VM;
-	 */
-	return false;
+	return kvm->arch.vm_type == KVM_X86_TDX_VM;
 }
 
 static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7838cd177f0e..3c7b3f245fee 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7079,6 +7079,11 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 	return err;
 }
 
+bool vmx_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_DEFAULT_VM;
+}
+
 #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 1bad27e592b5..f7327bc73be0 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -25,6 +25,7 @@ void vmx_hardware_unsetup(void);
 int vmx_hardware_enable(void);
 void vmx_hardware_disable(void);
 bool report_flexpriority(void);
+bool vmx_is_vm_type_supported(unsigned long type);
 int vmx_vm_init(struct kvm *kvm);
 int vmx_vcpu_create(struct kvm_vcpu *vcpu);
 int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index aa2942060154..f6438750d190 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4344,6 +4344,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 			r = sizeof(struct kvm_xsave);
 		break;
 	}
+	case KVM_CAP_VM_TYPES:
+		r = BIT(KVM_X86_DEFAULT_VM);
+		if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
+			r |= BIT(KVM_X86_TDX_VM);
+		break;
 	default:
 		break;
 	}
@@ -11583,9 +11588,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	int ret;
 	unsigned long flags;
 
-	if (type)
+	if (!static_call(kvm_x86_is_vm_type_supported)(type))
 		return -EINVAL;
 
+	kvm->arch.vm_type = type;
+
 	ret = kvm_page_track_init(kvm);
 	if (ret)
 		return ret;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 507ee1f2aa96..bb3e29770f76 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1135,6 +1135,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_XSAVE2 208
 #define KVM_CAP_SYS_ATTRIBUTES 209
 #define KVM_CAP_PPC_AIL_MODE_3 210
+#define KVM_CAP_VM_TYPES 21
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index bf6e96011dfe..71a5851475e7 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -525,4 +525,7 @@ struct kvm_pmu_event_filter {
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
 
+#define KVM_X86_DEFAULT_VM	0
+#define KVM_X86_TDX_VM		1
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 507ee1f2aa96..1a99d0aae852 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1135,6 +1135,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_XSAVE2 208
 #define KVM_CAP_SYS_ATTRIBUTES 209
 #define KVM_CAP_PPC_AIL_MODE_3 210
+#define KVM_CAP_VM_TYPES 211
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 010/104] KVM: TDX: Make TDX VM type supported
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (8 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 009/104] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 23:08   ` Kai Huang
  2022-03-04 19:48 ` [RFC PATCH v5 011/104] [MARKER] The start of TDX KVM patch series: TDX architectural definitions isaku.yamahata
                   ` (95 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

As first step TDX VM support, return that TDX VM type supported to device
model, e.g. qemu.  The callback to create guest TD is vm_init callback for
KVM_CREATE_VM.  Add a place holder function and call a function to
initialize TDX module on demand because in that callback VMX is enabled by
hardware_enable callback (vmx_hardware_enable).

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 24 ++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     |  5 +++++
 arch/x86/kvm/vmx/vmx.c     |  5 -----
 arch/x86/kvm/vmx/x86_ops.h |  3 ++-
 4 files changed, 29 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 77da926ee505..8103d1c32cc9 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -5,6 +5,12 @@
 #include "vmx.h"
 #include "nested.h"
 #include "pmu.h"
+#include "tdx.h"
+
+static bool vt_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_DEFAULT_VM || tdx_is_vm_type_supported(type);
+}
 
 static __init int vt_hardware_setup(void)
 {
@@ -19,6 +25,20 @@ static __init int vt_hardware_setup(void)
 	return 0;
 }
 
+static int vt_vm_init(struct kvm *kvm)
+{
+	int ret;
+
+	if (is_td(kvm)) {
+		ret = tdx_module_setup();
+		if (ret)
+			return ret;
+		return -EOPNOTSUPP;	/* Not ready to create guest TD yet. */
+	}
+
+	return vmx_vm_init(kvm);
+}
+
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = "kvm_intel",
 
@@ -29,9 +49,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.cpu_has_accelerated_tpr = report_flexpriority,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
-	.is_vm_type_supported = vmx_is_vm_type_supported,
+	.is_vm_type_supported = vt_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
-	.vm_init = vmx_vm_init,
+	.vm_init = vt_vm_init,
 
 	.vcpu_create = vmx_vcpu_create,
 	.vcpu_free = vmx_vcpu_free,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8adc87ad1807..e8d293a3c11c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -105,6 +105,11 @@ int tdx_module_setup(void)
 	return ret;
 }
 
+bool tdx_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_TDX_VM && READ_ONCE(enable_tdx);
+}
+
 static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 {
 	u32 max_pa;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 3c7b3f245fee..7838cd177f0e 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7079,11 +7079,6 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 	return err;
 }
 
-bool vmx_is_vm_type_supported(unsigned long type)
-{
-	return type == KVM_X86_DEFAULT_VM;
-}
-
 #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f7327bc73be0..78331dbc29f7 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -25,7 +25,6 @@ void vmx_hardware_unsetup(void);
 int vmx_hardware_enable(void);
 void vmx_hardware_disable(void);
 bool report_flexpriority(void);
-bool vmx_is_vm_type_supported(unsigned long type);
 int vmx_vm_init(struct kvm *kvm);
 int vmx_vcpu_create(struct kvm_vcpu *vcpu);
 int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
@@ -130,10 +129,12 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_INTEL_TDX_HOST
 void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
 			unsigned int *vcpu_align, unsigned int *vm_size);
+bool tdx_is_vm_type_supported(unsigned long type);
 void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 #else
 static inline void tdx_pre_kvm_init(
 	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
+static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
 static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 011/104] [MARKER] The start of TDX KVM patch series: TDX architectural definitions
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (9 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 010/104] KVM: TDX: Make TDX VM type supported isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 012/104] KVM: TDX: Define " isaku.yamahata
                   ` (94 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TDX architectural
definitions.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 .../virt/kvm/intel-tdx-layer-status.rst       | 29 +++++++++++++++++++
 1 file changed, 29 insertions(+)
 create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
new file mode 100644
index 000000000000..d4db90cb098b
--- /dev/null
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -0,0 +1,29 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Dodmain Extensions(TDX)
+===================================
+
+Layer status
+============
+What qemu can do
+----------------
+- TDX VM TYPE is exposed to Qemu.
+- Qemu can try to create VM of TDX VM type and then fails.
+
+Patch Layer status
+------------------
+  Patch layer                          Status
+* TDX, VMX coexistence:                 Applied
+* TDX architectural definitions:        Applying
+* TD VM creation/destruction:           Not yet
+* TD vcpu creation/destruction:         Not yet
+* TDX EPT violation:                    Not yet
+* TD finalization:                      Not yet
+* TD vcpu enter/exit:                   Not yet
+* TD vcpu interrupts/exit/hypercall:    Not yet
+
+* KVM MMU GPA stolen bits:              Not yet
+* KVM TDP refactoring for TDX:          Not yet
+* KVM TDP MMU hooks:                    Not yet
+* KVM TDP MMU MapGPA:                   Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 012/104] KVM: TDX: Define TDX architectural definitions
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (10 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 011/104] [MARKER] The start of TDX KVM patch series: TDX architectural definitions isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 14:30   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 013/104] KVM: TDX: Add TDX "architectural" error codes isaku.yamahata
                   ` (93 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Define architectural definitions for KVM to issue the TDX SEAMCALLs.

Structures and values that are architecturally defined in the TDX module
specifications the chapter of ABI Reference.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx_arch.h | 158 ++++++++++++++++++++++++++++++++++++
 1 file changed, 158 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_arch.h

diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
new file mode 100644
index 000000000000..3824491d22dc
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -0,0 +1,158 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_ARCH_H
+#define __KVM_X86_TDX_ARCH_H
+
+#include <linux/types.h>
+
+/*
+ * TDX SEAMCALL API function leaves
+ */
+#define TDH_VP_ENTER			0
+#define TDH_MNG_ADDCX			1
+#define TDH_MEM_PAGE_ADD		2
+#define TDH_MEM_SEPT_ADD		3
+#define TDH_VP_ADDCX			4
+#define TDH_MEM_PAGE_AUG		6
+#define TDH_MEM_RANGE_BLOCK		7
+#define TDH_MNG_KEY_CONFIG		8
+#define TDH_MNG_CREATE			9
+#define TDH_VP_CREATE			10
+#define TDH_MNG_RD			11
+#define TDH_MR_EXTEND			16
+#define TDH_MR_FINALIZE			17
+#define TDH_VP_FLUSH			18
+#define TDH_MNG_VPFLUSHDONE		19
+#define TDH_MNG_KEY_FREEID		20
+#define TDH_MNG_INIT			21
+#define TDH_VP_INIT			22
+#define TDH_VP_RD			26
+#define TDH_MNG_KEY_RECLAIMID		27
+#define TDH_PHYMEM_PAGE_RECLAIM		28
+#define TDH_MEM_PAGE_REMOVE		29
+#define TDH_MEM_TRACK			38
+#define TDH_MEM_RANGE_UNBLOCK		39
+#define TDH_PHYMEM_CACHE_WB		40
+#define TDH_PHYMEM_PAGE_WBINVD		41
+#define TDH_VP_WR			43
+#define TDH_SYS_LP_SHUTDOWN		44
+
+#define TDG_VP_VMCALL_GET_TD_VM_CALL_INFO		0x10000
+#define TDG_VP_VMCALL_MAP_GPA				0x10001
+#define TDG_VP_VMCALL_GET_QUOTE				0x10002
+#define TDG_VP_VMCALL_REPORT_FATAL_ERROR		0x10003
+#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT	0x10004
+
+/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
+#define TDX_NON_ARCH			BIT_ULL(63)
+#define TDX_CLASS_SHIFT			56
+#define TDX_FIELD_MASK			GENMASK_ULL(31, 0)
+
+#define __BUILD_TDX_FIELD(non_arch, class, field)	\
+	(((non_arch) ? TDX_NON_ARCH : 0) |		\
+	 ((u64)(class) << TDX_CLASS_SHIFT) |		\
+	 ((u64)(field) & TDX_FIELD_MASK))
+
+#define BUILD_TDX_FIELD(class, field)			\
+	__BUILD_TDX_FIELD(false, (class), (field))
+
+#define BUILD_TDX_FIELD_NON_ARCH(class, field)		\
+	__BUILD_TDX_FIELD(true, (class), (field))
+
+
+/* @field is the VMCS field encoding */
+#define TDVPS_VMCS(field)		BUILD_TDX_FIELD(0, (field))
+
+enum tdx_guest_other_state {
+	TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
+};
+
+union tdx_vcpu_state_details {
+	struct {
+		u64 vmxip	: 1;
+		u64 reserved	: 63;
+	};
+	u64 full;
+};
+
+/* @field is any of enum tdx_guest_other_state */
+#define TDVPS_STATE(field)		BUILD_TDX_FIELD(17, (field))
+#define TDVPS_STATE_NON_ARCH(field)	BUILD_TDX_FIELD_NON_ARCH(17, (field))
+
+/* Management class fields */
+enum tdx_guest_management {
+	TD_VCPU_PEND_NMI = 11,
+};
+
+/* @field is any of enum tdx_guest_management */
+#define TDVPS_MANAGEMENT(field)		BUILD_TDX_FIELD(32, (field))
+
+enum tdx_tdcs_execution_control {
+	TD_TDCS_EXEC_TSC_OFFSET = 10,
+};
+
+/* @field is any of enum tdx_tdcs_execution_control */
+#define TDCS_EXEC(field)		BUILD_TDX_FIELD(17, (field))
+
+#define TDX_EXTENDMR_CHUNKSIZE		256
+
+struct tdx_cpuid_value {
+	u32 eax;
+	u32 ebx;
+	u32 ecx;
+	u32 edx;
+} __packed;
+
+#define TDX_TD_ATTRIBUTE_DEBUG		BIT_ULL(0)
+#define TDX_TD_ATTRIBUTE_PKS		BIT_ULL(30)
+#define TDX_TD_ATTRIBUTE_KL		BIT_ULL(31)
+#define TDX_TD_ATTRIBUTE_PERFMON	BIT_ULL(63)
+
+#define TDX_TD_XFAM_LBR			BIT_ULL(15)
+#define TDX_TD_XFAM_AMX			(BIT_ULL(17) | BIT_ULL(18))
+
+/*
+ * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
+ */
+struct td_params {
+	u64 attributes;
+	u64 xfam;
+	u32 max_vcpus;
+	u32 reserved0;
+
+	u64 eptp_controls;
+	u64 exec_controls;
+	u16 tsc_frequency;
+	u8  reserved1[38];
+
+	u64 mrconfigid[6];
+	u64 mrowner[6];
+	u64 mrownerconfig[6];
+	u64 reserved2[4];
+
+	union {
+		struct tdx_cpuid_value cpuid_values[0];
+		u8 reserved3[768];
+	};
+} __packed __aligned(1024);
+
+/*
+ * Guest uses MAX_PA for GPAW when set.
+ * 0: GPA.SHARED bit is GPA[47]
+ * 1: GPA.SHARED bit is GPA[51]
+ */
+#define TDX_EXEC_CONTROL_MAX_GPAW      BIT_ULL(0)
+
+/*
+ * TDX requires the frequency to be defined in units of 25MHz, which is the
+ * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX
+ * module can only program frequencies that are multiples of 25MHz.  The
+ * frequency must be between 100mhz and 10ghz (inclusive).
+ */
+#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz)	((tsc_in_khz) / (25 * 1000))
+#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz)	((tsc_in_25mhz) * (25 * 1000))
+#define TDX_MIN_TSC_FREQUENCY_KHZ		(100 * 1000)
+#define TDX_MAX_TSC_FREQUENCY_KHZ		(10 * 1000 * 1000)
+
+#endif /* __KVM_X86_TDX_ARCH_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 013/104] KVM: TDX: Add TDX "architectural" error codes
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (11 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 012/104] KVM: TDX: Define " isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 14:08   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 014/104] KVM: TDX: Add a function for KVM to invoke SEAMCALL isaku.yamahata
                   ` (92 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH
SEAMCALL and TDX guest side for TDG.VP.VMCALL.  KVM issues the TDX
SEAMCALLs and checks its error code.  KVM handles hypercall from the TDX
guest and may return an error.  So error code for the TDX guest is also
needed.

TDX SEAMCALL uses bits 31:0 to return more information, so these error
codes will only exactly match RAX[63:32].  Error codes for TDG.VP.VMCALL is
defined by TDX Guest-Host-Communication interface spec.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx_errno.h | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_errno.h

diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
new file mode 100644
index 000000000000..5c878488795d
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -0,0 +1,29 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural status code for SEAMCALL */
+
+#ifndef __KVM_X86_TDX_ERRNO_H
+#define __KVM_X86_TDX_ERRNO_H
+
+#define TDX_SEAMCALL_STATUS_MASK		0xFFFFFFFF00000000ULL
+
+/*
+ * TDX SEAMCALL Status Codes (returned in RAX)
+ */
+#define TDX_SUCCESS				0x0000000000000000ULL
+#define TDX_NON_RECOVERABLE_VCPU		0x4000000100000000ULL
+#define TDX_INTERRUPTED_RESUMABLE		0x8000000300000000ULL
+#define TDX_LIFECYCLE_STATE_INCORRECT		0xC000060700000000ULL
+#define TDX_VCPU_NOT_ASSOCIATED			0x8000070200000000ULL
+#define TDX_KEY_GENERATION_FAILED		0x8000080000000000ULL
+#define TDX_KEY_STATE_INCORRECT			0xC000081100000000ULL
+#define TDX_KEY_CONFIGURED			0x0000081500000000ULL
+#define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
+
+/*
+ * TDG.VP.VMCALL Status Codes (returned in R10)
+ */
+#define TDG_VP_VMCALL_SUCCESS			0x0000000000000000ULL
+#define TDG_VP_VMCALL_INVALID_OPERAND		0x8000000000000000ULL
+#define TDG_VP_VMCALL_TDREPORT_FAILED		0x8000000000000001ULL
+
+#endif /* __KVM_X86_TDX_ERRNO_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 014/104] KVM: TDX: Add a function for KVM to invoke SEAMCALL
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (12 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 013/104] KVM: TDX: Add TDX "architectural" error codes isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 14:10   ` Paolo Bonzini
  2022-03-13 22:42   ` Kai Huang
  2022-03-04 19:48 ` [RFC PATCH v5 015/104] KVM: TDX: add a helper function for KVM to issue SEAMCALL isaku.yamahata
                   ` (91 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add an assembly function for KVM to call the TDX module because __seamcall
defined in arch/x86/virt/vmx/seamcall.S doesn't fit for the KVM use case.

TDX module API returns extended error information in registers, rcx, rdx,
r8, r9, r10, and r11 in addition to success case.  KVM uses those extended
error information in addition to the status code returned in RAX.  Update
the assembly code to optionally return those outputs even in the error case
and define the specific version for KVM to call the TDX module.

SEAMCALL to the SEAM module (P-SEAMLDR or TDX module) can result in the
error of VmFailInvalid indicated by CF=1 when VMX isn't enabled by VMXON
instruction.  Because KVM guarantees that VMX is enabled, VmFailInvalid
error won't happen.  Don't check the error for KVM.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Makefile       |  2 +-
 arch/x86/kvm/vmx/seamcall.S | 55 +++++++++++++++++++++++++++++++++++++
 arch/x86/virt/tdxcall.S     |  8 ++++--
 3 files changed, 62 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/seamcall.S

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index e2c05195cb95..e8f83a7d0dc3 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -24,7 +24,7 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
 			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
 kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
-kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o vmx/seamcall.o
 
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
 
diff --git a/arch/x86/kvm/vmx/seamcall.S b/arch/x86/kvm/vmx/seamcall.S
new file mode 100644
index 000000000000..4a15017fc7dd
--- /dev/null
+++ b/arch/x86/kvm/vmx/seamcall.S
@@ -0,0 +1,55 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#include <linux/linkage.h>
+#include <asm/export.h>
+#include <asm/frame.h>
+
+#include "../../virt/tdxcall.S"
+
+/*
+ * kvm_seamcall()  - Host-side interface functions to SEAM software (TDX module)
+ *
+ * Transform function call register arguments into the SEAMCALL register
+ * ABI.  Return the completion status of the SEAMCALL.  Additional output
+ * operands are saved in @out (if it is provided by the user).
+ * It doesn't check TDX_SEAMCALL_VMFAILINVALID unlike __semcall() because KVM
+ * guarantees that VMX is enabled so that TDX_SEAMCALL_VMFAILINVALID doesn't
+ * happen.  In the case of error completion status code, extended error code may
+ * be stored in leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ * SEAMCALL ABI:
+ *-------------------------------------------------------------------------
+ * Input Registers:
+ *
+ * RAX                 - SEAMCALL Leaf number.
+ * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
+ *
+ * Output Registers:
+ *
+ * RAX                 - SEAMCALL completion status code.
+ * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
+ *
+ *-------------------------------------------------------------------------
+ *
+ * kvm_seamcall() function ABI:
+ *
+ * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
+ * @rcx (RSI)          - Input parameter 1, moved to RCX
+ * @rdx (RDX)          - Input parameter 2, moved to RDX
+ * @r8  (RCX)          - Input parameter 3, moved to R8
+ * @r9  (R8)           - Input parameter 4, moved to R9
+ *
+ * @out (R9)           - struct tdx_module_output pointer
+ *                       stored temporarily in R12 (not
+ *                       shared with the TDX module). It
+ *                       can be NULL.
+ *
+ * Return (via RAX) the completion status of the SEAMCALL
+ */
+SYM_FUNC_START(kvm_seamcall)
+        FRAME_BEGIN
+        TDX_MODULE_CALL host=1 error_check=0
+        FRAME_END
+        ret
+SYM_FUNC_END(kvm_seamcall)
+EXPORT_SYMBOL_GPL(kvm_seamcall)
diff --git a/arch/x86/virt/tdxcall.S b/arch/x86/virt/tdxcall.S
index 90569faedacc..2e614b6b5f1e 100644
--- a/arch/x86/virt/tdxcall.S
+++ b/arch/x86/virt/tdxcall.S
@@ -13,7 +13,7 @@
 #define tdcall		.byte 0x66,0x0f,0x01,0xcc
 #define seamcall	.byte 0x66,0x0f,0x01,0xcf
 
-.macro TDX_MODULE_CALL host:req
+.macro TDX_MODULE_CALL host:req error_check=1
 	/*
 	 * R12 will be used as temporary storage for struct tdx_module_output
 	 * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
@@ -51,9 +51,11 @@
 	 *
 	 * Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
 	 * This value will never be used as actual SEAMCALL error code.
-	 */
+	*/
+	.if \error_check
 	jnc .Lno_vmfailinvalid
 	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
+	.endif
 .Lno_vmfailinvalid:
 	.else
 	tdcall
@@ -66,8 +68,10 @@
 	pop %r12
 
 	/* Check for success: 0 - Successful, otherwise failed */
+	.if \error_check
 	test %rax, %rax
 	jnz .Lno_output_struct
+	.endif
 
 	/*
 	 * Since this function can be initiated without an output pointer,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 015/104] KVM: TDX: add a helper function for KVM to issue SEAMCALL
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (13 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 014/104] KVM: TDX: Add a function for KVM to invoke SEAMCALL isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 14:11   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 016/104] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module isaku.yamahata
                   ` (90 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TODO: Consolidate seamcall helper function with TDX host/guest patch series.
For now, this is kept to make this patch series compile/work.

A VMM interacts with the TDX module using a new instruction (SEAMCALL).  A
TDX VMM uses SEAMCALLs where a VMX VMM would have directly interacted with
VMX instructions.  For instance, a TDX VMM does not have full access to the
VM control structure corresponding to VMX VMCS.  Instead, a VMM induces the
TDX module to act on behalf via SEAMCALLs.

Add a helper function for KVM C code to execute SEAMCALL instruction to
hide its SEAMCALL ABI details.  Although the x86 TDX host patch series
defines a similar wrapper, the KVM TDX patch series defines its own because
KVM TDX case is performance-critical, unlike the x86 TDX one that does
one-time initialization.  The difference is that the KVM TDX one is defined
as a static inline function without an error check that is known to not
happen so that compiler can optimize it better.  The wrapper fiction in the
x86 TDX host patch is defined as a function written in assembly code with
error check so that it can detect errors that can occur only during the
initialization.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/seamcall.h | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/seamcall.h

diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
new file mode 100644
index 000000000000..604792e9a59f
--- /dev/null
+++ b/arch/x86/kvm/vmx/seamcall.h
@@ -0,0 +1,23 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_VMX_SEAMCALL_H
+#define __KVM_VMX_SEAMCALL_H
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+#ifdef __ASSEMBLY__
+
+.macro seamcall
+	.byte 0x66, 0x0f, 0x01, 0xcf
+.endm
+
+#else
+
+struct tdx_module_output;
+u64 kvm_seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
+		struct tdx_module_output *out);
+
+#endif /* !__ASSEMBLY__ */
+
+#endif	/* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_VMX_SEAMCALL_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 016/104] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (14 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 015/104] KVM: TDX: add a helper function for KVM to issue SEAMCALL isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 017/104] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
                   ` (89 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX SEAMCALL interface is defined in the TDX module specification.  Define
C wrapper functions for SEAMCALLs for readability.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx_ops.h | 161 +++++++++++++++++++++++++++++++++++++
 1 file changed, 161 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_ops.h

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
new file mode 100644
index 000000000000..0bed43879b82
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_OPS_H
+#define __KVM_X86_TDX_OPS_H
+
+#include <linux/compiler.h>
+
+#include <asm/asm.h>
+#include <asm/kvm_host.h>
+
+#include "tdx_errno.h"
+#include "tdx_arch.h"
+#include "seamcall.h"
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
+{
+	return kvm_seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
+				struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, 0, out);
+}
+
+static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+				struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, 0, out);
+}
+
+static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
+{
+	return kvm_seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+				struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, 0, out);
+}
+
+static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
+				struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_MEM_RANGE_BLOCK, gpa | level, tdr, 0, 0, 0, out);
+}
+
+static inline u64 tdh_mng_key_config(hpa_t tdr)
+{
+	return kvm_seamcall(TDH_MNG_KEY_CONFIG, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
+{
+	return kvm_seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
+{
+	return kvm_seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_rd(hpa_t tdr, u64 field, struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_MNG_RD, tdr, field, 0, 0, 0, out);
+}
+
+static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa, struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_MR_EXTEND, gpa, tdr, 0, 0, 0, out);
+}
+
+static inline u64 tdh_mr_finalize(hpa_t tdr)
+{
+	return kvm_seamcall(TDH_MR_FINALIZE, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_flush(hpa_t tdvpr)
+{
+	return kvm_seamcall(TDH_VP_FLUSH, tdvpr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_vpflushdone(hpa_t tdr)
+{
+	return kvm_seamcall(TDH_MNG_VPFLUSHDONE, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_key_freeid(hpa_t tdr)
+{
+	return kvm_seamcall(TDH_MNG_KEY_FREEID, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_init(hpa_t tdr, hpa_t td_params, struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_MNG_INIT, tdr, td_params, 0, 0, 0, out);
+}
+
+static inline u64 tdh_vp_init(hpa_t tdvpr, u64 rcx)
+{
+	return kvm_seamcall(TDH_VP_INIT, tdvpr, rcx, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_rd(hpa_t tdvpr, u64 field, struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_VP_RD, tdvpr, field, 0, 0, 0, out);
+}
+
+static inline u64 tdh_mng_key_reclaimid(hpa_t tdr)
+{
+	return kvm_seamcall(TDH_MNG_KEY_RECLAIMID, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_reclaim(hpa_t page, struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_PHYMEM_PAGE_RECLAIM, page, 0, 0, 0, 0, out);
+}
+
+static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
+				struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_MEM_PAGE_REMOVE, gpa | level, tdr, 0, 0, 0, out);
+}
+
+static inline u64 tdh_sys_lp_shutdown(void)
+{
+	return kvm_seamcall(TDH_SYS_LP_SHUTDOWN, 0, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_track(hpa_t tdr)
+{
+	return kvm_seamcall(TDH_MEM_TRACK, tdr, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_range_unblock(hpa_t tdr, gpa_t gpa, int level,
+					struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_MEM_RANGE_UNBLOCK, gpa | level, tdr, 0, 0, 0, out);
+}
+
+static inline u64 tdh_phymem_cache_wb(bool resume)
+{
+	return kvm_seamcall(TDH_PHYMEM_CACHE_WB, resume ? 1 : 0, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_wbinvd(hpa_t page)
+{
+	return kvm_seamcall(TDH_PHYMEM_PAGE_WBINVD, page, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_wr(hpa_t tdvpr, u64 field, u64 val, u64 mask,
+			struct tdx_module_output *out)
+{
+	return kvm_seamcall(TDH_VP_WR, tdvpr, field, val, mask, 0, out);
+}
+#endif /* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_X86_TDX_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 017/104] KVM: TDX: Add helper functions to print TDX SEAMCALL error
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (15 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 016/104] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-13 14:12   ` Paolo Bonzini
  2022-04-15 16:54   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 018/104] [MARKER] The start of TDX KVM patch series: TD VM creation/destruction isaku.yamahata
                   ` (88 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add helper functions to print out errors from the TDX module in a uniform
manner.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Makefile        |  2 +-
 arch/x86/kvm/vmx/seamcall.h  |  2 ++
 arch/x86/kvm/vmx/tdx_error.c | 22 ++++++++++++++++++++++
 3 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kvm/vmx/tdx_error.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index e8f83a7d0dc3..3d6550c73fb5 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -24,7 +24,7 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
 			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
 kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
-kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o vmx/seamcall.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o vmx/seamcall.o vmx/tdx_error.o
 
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
 
diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
index 604792e9a59f..5ac419cd8e27 100644
--- a/arch/x86/kvm/vmx/seamcall.h
+++ b/arch/x86/kvm/vmx/seamcall.h
@@ -16,6 +16,8 @@ struct tdx_module_output;
 u64 kvm_seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
 		struct tdx_module_output *out);
 
+void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out);
+
 #endif /* !__ASSEMBLY__ */
 
 #endif	/* CONFIG_INTEL_TDX_HOST */
diff --git a/arch/x86/kvm/vmx/tdx_error.c b/arch/x86/kvm/vmx/tdx_error.c
new file mode 100644
index 000000000000..61ed855d1188
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_error.c
@@ -0,0 +1,22 @@
+// SPDX-License-Identifier: GPL-2.0
+/* functions to record TDX SEAMCALL error */
+
+#include <linux/kernel.h>
+#include <linux/bug.h>
+
+#include "tdx_ops.h"
+
+void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out)
+{
+	if (!out) {
+		pr_err_ratelimited("SEAMCALL[%lld] failed: 0x%llx\n",
+				op, error_code);
+		return;
+	}
+
+	pr_err_ratelimited(
+		"SEAMCALL[%lld] failed: 0x%llx "
+		"RCX 0x%llx, RDX 0x%llx, R8 0x%llx, R9 0x%llx, R10 0x%llx, R11 0x%llx\n",
+		op, error_code,
+		out->rcx, out->rdx, out->r8, out->r9, out->r10, out->r11);
+}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 018/104] [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (16 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 017/104] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 019/104] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers isaku.yamahata
                   ` (87 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TD VM
creation/destruction.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index d4db90cb098b..4066495da3d1 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -15,8 +15,8 @@ Patch Layer status
 ------------------
   Patch layer                          Status
 * TDX, VMX coexistence:                 Applied
-* TDX architectural definitions:        Applying
-* TD VM creation/destruction:           Not yet
+* TDX architectural definitions:        Applied
+* TD VM creation/destruction:           Applying
 * TD vcpu creation/destruction:         Not yet
 * TDX EPT violation:                    Not yet
 * TD finalization:                      Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 019/104] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (17 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 018/104] [MARKER] The start of TDX KVM patch series: TD VM creation/destruction isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-15 16:55   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 020/104] KVM: TDX: allocate per-package mutex isaku.yamahata
                   ` (86 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Stub in kvm_tdx, vcpu_tdx, and their various accessors.  TDX defines
SEAMCALL APIs to access TDX control structures corresponding to the VMX
VMCS.  Introduce helper accessors to hide its SEAMCALL ABI details.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.h | 101 +++++++++++++++++++++++++++++++++++++++++
 1 file changed, 101 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 616fbf79b129..e4bb8831764e 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -3,14 +3,29 @@
 #define __KVM_X86_TDX_H
 
 #ifdef CONFIG_INTEL_TDX_HOST
+
+#include "tdx_ops.h"
+
 int tdx_module_setup(void);
 
+struct tdx_td_page {
+	unsigned long va;
+	hpa_t pa;
+	bool added;
+};
+
 struct kvm_tdx {
 	struct kvm kvm;
+
+	struct tdx_td_page tdr;
+	struct tdx_td_page *tdcs;
 };
 
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
+
+	struct tdx_td_page tdvpr;
+	struct tdx_td_page *tdvpx;
 };
 
 static inline bool is_td(struct kvm *kvm)
@@ -32,6 +47,92 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
 {
 	return container_of(vcpu, struct vcpu_tdx, vcpu);
 }
+
+static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
+{
+	BUILD_BUG_ON_MSG(__builtin_constant_p(field) && (field) & 0x1,
+			 "Read/Write to TD VMCS *_HIGH fields not supported");
+
+	BUILD_BUG_ON(bits != 16 && bits != 32 && bits != 64);
+
+	BUILD_BUG_ON_MSG(bits != 64 && __builtin_constant_p(field) &&
+			 (((field) & 0x6000) == 0x2000 ||
+			  ((field) & 0x6000) == 0x6000),
+			 "Invalid TD VMCS access for 64-bit field");
+	BUILD_BUG_ON_MSG(bits != 32 && __builtin_constant_p(field) &&
+			 ((field) & 0x6000) == 0x4000,
+			 "Invalid TD VMCS access for 32-bit field");
+	BUILD_BUG_ON_MSG(bits != 16 && __builtin_constant_p(field) &&
+			 ((field) & 0x6000) == 0x0000,
+			 "Invalid TD VMCS access for 16-bit field");
+}
+
+static __always_inline void tdvps_state_non_arch_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
+
+#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)				\
+static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,	\
+							u32 field)		\
+{										\
+	struct tdx_module_output out;						\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_rd(tdx->tdvpr.pa, TDVPS_##uclass(field), &out);		\
+	if (unlikely(err)) {							\
+		pr_err("TDH_VP_RD["#uclass".0x%x] failed: 0x%llx\n",		\
+		       field, err);						\
+		return 0;							\
+	}									\
+	return (u##bits)out.r8;							\
+}										\
+static __always_inline void td_##lclass##_write##bits(struct vcpu_tdx *tdx,	\
+						      u32 field, u##bits val)	\
+{										\
+	struct tdx_module_output out;						\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), val,		\
+		      GENMASK_ULL(bits - 1, 0), &out);				\
+	if (unlikely(err))							\
+		pr_err("TDH_VP_WR["#uclass".0x%x] = 0x%llx failed: 0x%llx\n",	\
+		       field, (u64)val, err);					\
+}										\
+static __always_inline void td_##lclass##_setbit##bits(struct vcpu_tdx *tdx,	\
+						       u32 field, u64 bit)	\
+{										\
+	struct tdx_module_output out;						\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), bit, bit,		\
+			&out);							\
+	if (unlikely(err))							\
+		pr_err("TDH_VP_WR["#uclass".0x%x] |= 0x%llx failed: 0x%llx\n",	\
+		       field, bit, err);					\
+}										\
+static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx,	\
+							 u32 field, u64 bit)	\
+{										\
+	struct tdx_module_output out;						\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), 0, bit,		\
+			&out);							\
+	if (unlikely(err))							\
+		pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: 0x%llx\n",	\
+		       field, bit,  err);					\
+}
+
+TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
+
+TDX_BUILD_TDVPS_ACCESSORS(64, STATE_NON_ARCH, state_non_arch);
+TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
+
 #else
 static inline int tdx_module_setup(void) { return -ENODEV; };
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 020/104] KVM: TDX: allocate per-package mutex
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (18 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 019/104] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 12:39   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 021/104] KVM: x86: Introduce hooks to free VM callback prezap and vm_free isaku.yamahata
                   ` (85 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Several TDX SEAMCALLs are per-package scope (concretely per memory
controller) and they need to be serialized per-package.  Allocate mutex for
it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    |  8 +++++++-
 arch/x86/kvm/vmx/tdx.c     | 18 ++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 3 files changed, 27 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 8103d1c32cc9..6111c6485d8e 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -25,6 +25,12 @@ static __init int vt_hardware_setup(void)
 	return 0;
 }
 
+static void vt_hardware_unsetup(void)
+{
+	tdx_hardware_unsetup();
+	vmx_hardware_unsetup();
+}
+
 static int vt_vm_init(struct kvm *kvm)
 {
 	int ret;
@@ -42,7 +48,7 @@ static int vt_vm_init(struct kvm *kvm)
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = "kvm_intel",
 
-	.hardware_unsetup = vmx_hardware_unsetup,
+	.hardware_unsetup = vt_hardware_unsetup,
 
 	.hardware_enable = vmx_hardware_enable,
 	.hardware_disable = vmx_hardware_disable,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e8d293a3c11c..1c8222f54764 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -34,6 +34,8 @@ struct tdx_capabilities {
 /* Capabilities of KVM + the TDX module. */
 struct tdx_capabilities tdx_caps;
 
+static struct mutex *tdx_mng_key_config_lock;
+
 static u64 hkid_mask __ro_after_init;
 static u8 hkid_start_pos __ro_after_init;
 
@@ -112,7 +114,9 @@ bool tdx_is_vm_type_supported(unsigned long type)
 
 static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 {
+	int max_pkgs;
 	u32 max_pa;
+	int i;
 
 	if (!enable_ept) {
 		pr_warn("Cannot enable TDX with EPT disabled\n");
@@ -127,6 +131,14 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 	if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
 		return -EIO;
 
+	max_pkgs = topology_max_packages();
+	tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
+				   GFP_KERNEL);
+	if (!tdx_mng_key_config_lock)
+		return -ENOMEM;
+	for (i = 0; i < max_pkgs; i++)
+		mutex_init(&tdx_mng_key_config_lock[i]);
+
 	max_pa = cpuid_eax(0x80000008) & 0xff;
 	hkid_start_pos = boot_cpu_data.x86_phys_bits;
 	hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
@@ -147,6 +159,12 @@ void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 		enable_tdx = false;
 }
 
+void tdx_hardware_unsetup(void)
+{
+	/* kfree accepts NULL. */
+	kfree(tdx_mng_key_config_lock);
+}
+
 void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
 			unsigned int *vcpu_align, unsigned int *vm_size)
 {
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 78331dbc29f7..da32b4b86b19 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -131,11 +131,13 @@ void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
 			unsigned int *vcpu_align, unsigned int *vm_size);
 bool tdx_is_vm_type_supported(unsigned long type);
 void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+void tdx_hardware_unsetup(void);
 #else
 static inline void tdx_pre_kvm_init(
 	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
 static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
 static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
+static inline void tdx_hardware_unsetup(void) {}
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 021/104] KVM: x86: Introduce hooks to free VM callback prezap and vm_free
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (19 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 020/104] KVM: TDX: allocate per-package mutex isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-31  3:02   ` Kai Huang
  2022-04-05 12:40   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 022/104] KVM: Add max_vcpus field in common 'struct kvm' isaku.yamahata
                   ` (84 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Kai Huang <kai.huang@intel.com>

Before tearing down private page tables, TDX requires some resources of the
guest TD to be destroyed (i.e. keyID must have been reclaimed, etc).  Add
prezap callback before tearing down private page tables for it.

TDX needs to free some resources after other resources (i.e. vcpu related
resources).  Add vm_free callback at the end of kvm_arch_destroy_vm().

Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h | 2 ++
 arch/x86/include/asm/kvm_host.h    | 2 ++
 arch/x86/kvm/x86.c                 | 8 ++++++++
 3 files changed, 12 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 8125d43d3566..ef48dcc98cfc 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,7 +20,9 @@ KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP(is_vm_type_supported)
 KVM_X86_OP(vm_init)
+KVM_X86_OP_NULL(mmu_prezap)
 KVM_X86_OP_NULL(vm_destroy)
+KVM_X86_OP_NULL(vm_free)
 KVM_X86_OP(vcpu_create)
 KVM_X86_OP(vcpu_free)
 KVM_X86_OP(vcpu_reset)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8de357a9ad30..5ff7a0fba311 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1326,7 +1326,9 @@ struct kvm_x86_ops {
 	bool (*is_vm_type_supported)(unsigned long vm_type);
 	unsigned int vm_size;
 	int (*vm_init)(struct kvm *kvm);
+	void (*mmu_prezap)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
+	void (*vm_free)(struct kvm *kvm);
 
 	/* Create, but do not attach this VCPU */
 	int (*vcpu_create)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f6438750d190..a48f5c69fadb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -11779,6 +11779,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_page_track_cleanup(kvm);
 	kvm_xen_destroy_vm(kvm);
 	kvm_hv_destroy_vm(kvm);
+	static_call_cond(kvm_x86_vm_free)(kvm);
 }
 
 static void memslot_rmap_free(struct kvm_memory_slot *slot)
@@ -12036,6 +12037,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
+	/*
+	 * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
+	 * tearing down private page tables, TDX requires some TD resources to
+	 * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
+	 * kvm_x86_mmu_prezap() for this.
+	 */
+	static_call_cond(kvm_x86_mmu_prezap)(kvm);
 	kvm_mmu_zap_all(kvm);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 022/104] KVM: Add max_vcpus field in common 'struct kvm'
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (20 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 021/104] KVM: x86: Introduce hooks to free VM callback prezap and vm_free isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 12:42   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid isaku.yamahata
                   ` (83 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

For TDX guests, the maximum number of vcpus needs to be specified when the
TDX guest VM is initialized (creating the TDX data corresponding to TDX
guest) before creating vcpu.  It needs to record the maximum number of
vcpus on VM creation (KVM_CREATE_VM) and return error if the number of
vcpus exceeds it

Because there is already max_vcpu member in arm64 struct kvm_arch, move it
to common struct kvm and initialize it to KVM_MAX_VCPUS before
kvm_arch_init_vm() instead of adding it to x86 struct kvm_arch.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/arm64/include/asm/kvm_host.h | 3 ---
 arch/arm64/kvm/arm.c              | 6 +++---
 arch/arm64/kvm/vgic/vgic-init.c   | 6 +++---
 include/linux/kvm_host.h          | 1 +
 virt/kvm/kvm_main.c               | 3 ++-
 5 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 5bc01e62c08a..27249d634605 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -107,9 +107,6 @@ struct kvm_arch {
 	/* VTCR_EL2 value for this VM */
 	u64    vtcr;
 
-	/* The maximum number of vCPUs depends on the used GIC model */
-	int max_vcpus;
-
 	/* Interrupt controller */
 	struct vgic_dist	vgic;
 
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index ecc5958e27fe..defec2cd94bd 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -153,7 +153,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm_vgic_early_init(kvm);
 
 	/* The maximum number of VCPUs is limited by the host's GIC model */
-	kvm->arch.max_vcpus = kvm_arm_default_max_vcpus();
+	kvm->max_vcpus = kvm_arm_default_max_vcpus();
 
 	set_default_spectre(kvm);
 
@@ -229,7 +229,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_MAX_VCPUS:
 	case KVM_CAP_MAX_VCPU_ID:
 		if (kvm)
-			r = kvm->arch.max_vcpus;
+			r = kvm->max_vcpus;
 		else
 			r = kvm_arm_default_max_vcpus();
 		break;
@@ -305,7 +305,7 @@ int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
 	if (irqchip_in_kernel(kvm) && vgic_initialized(kvm))
 		return -EBUSY;
 
-	if (id >= kvm->arch.max_vcpus)
+	if (id >= kvm->max_vcpus)
 		return -EINVAL;
 
 	return 0;
diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
index fc00304fe7d8..77feafd5c0e3 100644
--- a/arch/arm64/kvm/vgic/vgic-init.c
+++ b/arch/arm64/kvm/vgic/vgic-init.c
@@ -98,11 +98,11 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
 	ret = 0;
 
 	if (type == KVM_DEV_TYPE_ARM_VGIC_V2)
-		kvm->arch.max_vcpus = VGIC_V2_MAX_CPUS;
+		kvm->max_vcpus = VGIC_V2_MAX_CPUS;
 	else
-		kvm->arch.max_vcpus = VGIC_V3_MAX_CPUS;
+		kvm->max_vcpus = VGIC_V3_MAX_CPUS;
 
-	if (atomic_read(&kvm->online_vcpus) > kvm->arch.max_vcpus) {
+	if (atomic_read(&kvm->online_vcpus) > kvm->max_vcpus) {
 		ret = -E2BIG;
 		goto out_unlock;
 	}
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index f11039944c08..a56044a31bc6 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -715,6 +715,7 @@ struct kvm {
 	 * and is accessed atomically.
 	 */
 	atomic_t online_vcpus;
+	int max_vcpus;
 	int created_vcpus;
 	int last_boosted_vcpu;
 	struct list_head vm_list;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 52f72a366beb..3adee9c6b370 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1075,6 +1075,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
 	spin_lock_init(&kvm->gpc_lock);
 
 	INIT_LIST_HEAD(&kvm->devices);
+	kvm->max_vcpus = KVM_MAX_VCPUS;
 
 	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
 
@@ -3718,7 +3719,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
 		return -EINVAL;
 
 	mutex_lock(&kvm->lock);
-	if (kvm->created_vcpus == KVM_MAX_VCPUS) {
+	if (kvm->created_vcpus >= kvm->max_vcpus) {
 		mutex_unlock(&kvm->lock);
 		return -EINVAL;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (21 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 022/104] KVM: Add max_vcpus field in common 'struct kvm' isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-31  1:21   ` Kai Huang
  2022-04-05 13:08   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure isaku.yamahata
                   ` (82 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

MKTME keyid is assigned to guest TD.  The memory controller encrypts guest
TD memory with key id.  Add helper functions to allocate/free MKTME keyid
so that TDX KVM assign keyid.

Also export MKTME global keyid that is used to encrypt TDX module and its
memory.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/tdx.h |  6 ++++++
 arch/x86/virt/vmx/tdx.c    | 33 ++++++++++++++++++++++++++++++++-
 2 files changed, 38 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 9a8dc6afcb63..73bb472bd515 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -139,6 +139,9 @@ int tdx_detect(void);
 int tdx_init(void);
 bool platform_has_tdx(void);
 const struct tdsysinfo_struct *tdx_get_sysinfo(void);
+u32 tdx_get_global_keyid(void);
+int tdx_keyid_alloc(void);
+void tdx_keyid_free(int keyid);
 #else
 static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
 static inline int tdx_detect(void) { return -ENODEV; }
@@ -146,6 +149,9 @@ static inline int tdx_init(void) { return -ENODEV; }
 static inline bool platform_has_tdx(void) { return false; }
 struct tdsysinfo_struct;
 static inline const struct tdsysinfo_struct *tdx_get_sysinfo(void) { return NULL; }
+static inline u32 tdx_get_global_keyid(void) { return 0; };
+static inline int tdx_keyid_alloc(void) { return -EOPNOTSUPP; }
+static inline void tdx_keyid_free(int keyid) { }
 #endif /* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx.c b/arch/x86/virt/vmx/tdx.c
index e45f188479cb..d714106321d4 100644
--- a/arch/x86/virt/vmx/tdx.c
+++ b/arch/x86/virt/vmx/tdx.c
@@ -113,7 +113,13 @@ static int tdx_cmr_num;
 static struct tdsysinfo_struct tdx_sysinfo;
 
 /* TDX global KeyID to protect TDX metadata */
-static u32 tdx_global_keyid;
+static u32 __read_mostly tdx_global_keyid;
+
+u32 tdx_get_global_keyid(void)
+{
+	return tdx_global_keyid;
+}
+EXPORT_SYMBOL_GPL(tdx_get_global_keyid);
 
 static bool enable_tdx_host;
 
@@ -189,6 +195,31 @@ static void detect_seam(struct cpuinfo_x86 *c)
 		detect_seam_ap(c);
 }
 
+/* TDX KeyID pool */
+static DEFINE_IDA(tdx_keyid_pool);
+
+int tdx_keyid_alloc(void)
+{
+	if (WARN_ON_ONCE(!tdx_keyid_start || !tdx_keyid_num))
+		return -EINVAL;
+
+	/* The first keyID is reserved for the global key. */
+	return ida_alloc_range(&tdx_keyid_pool, tdx_keyid_start + 1,
+			       tdx_keyid_start + tdx_keyid_num - 1,
+			       GFP_KERNEL);
+}
+EXPORT_SYMBOL_GPL(tdx_keyid_alloc);
+
+void tdx_keyid_free(int keyid)
+{
+	/* keyid = 0 is reserved. */
+	if (!keyid || keyid <= 0)
+		return;
+
+	ida_free(&tdx_keyid_pool, keyid);
+}
+EXPORT_SYMBOL_GPL(tdx_keyid_free);
+
 static void detect_tdx_keyids_bsp(struct cpuinfo_x86 *c)
 {
 	u64 keyid_part;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (22 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-31  4:17   ` Kai Huang
  2022-04-05 12:44   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 025/104] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl isaku.yamahata
                   ` (81 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

As the first step to create TDX guest, create/destroy VM struct.  Assign
Host Key ID (HKID) to the TDX guest for memory encryption and allocate
extra pages for the TDX guest. On destruction, free allocated pages, and
HKID.

Add a second kvm_x86_ops hook in kvm_arch_vm_destroy() to support TDX's
destruction path, which needs to first put the VM into a teardown state,
then free per-vCPU resources, and finally free per-VM resources.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c      |  16 +-
 arch/x86/kvm/vmx/tdx.c       | 312 +++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h       |   2 +
 arch/x86/kvm/vmx/tdx_errno.h |   2 +-
 arch/x86/kvm/vmx/tdx_ops.h   |   8 +
 arch/x86/kvm/vmx/x86_ops.h   |   8 +
 6 files changed, 346 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 6111c6485d8e..5c3a904a30e8 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -39,12 +39,24 @@ static int vt_vm_init(struct kvm *kvm)
 		ret = tdx_module_setup();
 		if (ret)
 			return ret;
-		return -EOPNOTSUPP;	/* Not ready to create guest TD yet. */
+		return tdx_vm_init(kvm);
 	}
 
 	return vmx_vm_init(kvm);
 }
 
+static void vt_mmu_prezap(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		return tdx_mmu_prezap(kvm);
+}
+
+static void vt_vm_free(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		return tdx_vm_free(kvm);
+}
+
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = "kvm_intel",
 
@@ -58,6 +70,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.is_vm_type_supported = vt_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vt_vm_init,
+	.mmu_prezap = vt_mmu_prezap,
+	.vm_free = vt_vm_free,
 
 	.vcpu_create = vmx_vcpu_create,
 	.vcpu_free = vmx_vcpu_free,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1c8222f54764..702953fd365f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -31,14 +31,324 @@ struct tdx_capabilities {
 	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
 };
 
+/* KeyID used by TDX module */
+static u32 tdx_global_keyid __read_mostly;
+
 /* Capabilities of KVM + the TDX module. */
 struct tdx_capabilities tdx_caps;
 
+static DEFINE_MUTEX(tdx_lock);
 static struct mutex *tdx_mng_key_config_lock;
 
 static u64 hkid_mask __ro_after_init;
 static u8 hkid_start_pos __ro_after_init;
 
+static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
+{
+	pa &= ~hkid_mask;
+	pa |= (u64)hkid << hkid_start_pos;
+
+	return pa;
+}
+
+static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->tdr.added;
+}
+
+static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
+{
+	tdx_keyid_free(kvm_tdx->hkid);
+	kvm_tdx->hkid = -1;
+}
+
+static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->hkid > 0;
+}
+
+static void tdx_clear_page(unsigned long page)
+{
+	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
+	unsigned long i;
+
+	/* Zeroing the page is only necessary for systems with MKTME-i. */
+	if (!static_cpu_has(X86_FEATURE_MOVDIR64B))
+		return;
+
+	for (i = 0; i < 4096; i += 64)
+		/* MOVDIR64B [rdx], es:rdi */
+		asm (".byte 0x66, 0x0f, 0x38, 0xf8, 0x3a"
+		     : : "d" (zero_page), "D" (page + i) : "memory");
+}
+
+static int __tdx_reclaim_page(unsigned long va, hpa_t pa, bool do_wb, u16 hkid)
+{
+	struct tdx_module_output out;
+	u64 err;
+
+	err = tdh_phymem_page_reclaim(pa, &out);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
+		return -EIO;
+	}
+
+	if (do_wb) {
+		err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+			return -EIO;
+		}
+	}
+
+	tdx_clear_page(va);
+	return 0;
+}
+
+static int tdx_reclaim_page(unsigned long va, hpa_t pa)
+{
+	return __tdx_reclaim_page(va, pa, false, 0);
+}
+
+static int tdx_alloc_td_page(struct tdx_td_page *page)
+{
+	page->va = __get_free_page(GFP_KERNEL_ACCOUNT);
+	if (!page->va)
+		return -ENOMEM;
+
+	page->pa = __pa(page->va);
+	return 0;
+}
+
+static void tdx_mark_td_page_added(struct tdx_td_page *page)
+{
+	WARN_ON_ONCE(page->added);
+	page->added = true;
+}
+
+static void tdx_reclaim_td_page(struct tdx_td_page *page)
+{
+	if (page->added) {
+		if (tdx_reclaim_page(page->va, page->pa))
+			return;
+
+		page->added = false;
+	}
+	free_page(page->va);
+}
+
+static int tdx_do_tdh_phymem_cache_wb(void *param)
+{
+	u64 err = 0;
+
+	/*
+	 * We can destroy multiple the guest TDs simultaneously.  Prevent
+	 * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
+	 */
+	mutex_lock(&tdx_lock);
+	do {
+		err = tdh_phymem_cache_wb(!!err);
+	} while (err == TDX_INTERRUPTED_RESUMABLE);
+	mutex_unlock(&tdx_lock);
+
+	/* Other thread may have done for us. */
+	if (err == TDX_NO_HKID_READY_TO_WBCACHE)
+		err = TDX_SUCCESS;
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+void tdx_mmu_prezap(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	cpumask_var_t packages;
+	bool cpumask_allocated;
+	u64 err;
+	int ret;
+	int i;
+
+	if (!is_hkid_assigned(kvm_tdx))
+		return;
+
+	if (!is_td_created(kvm_tdx))
+		goto free_hkid;
+
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_key_reclaimid(kvm_tdx->tdr.pa);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_KEY_RECLAIMID, err, NULL);
+		return;
+	}
+
+	cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
+	for_each_online_cpu(i) {
+		if (cpumask_allocated &&
+			cpumask_test_and_set_cpu(topology_physical_package_id(i),
+						packages))
+			continue;
+
+		ret = smp_call_on_cpu(i, tdx_do_tdh_phymem_cache_wb, NULL, 1);
+		if (ret)
+			break;
+	}
+	free_cpumask_var(packages);
+
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_key_freeid(kvm_tdx->tdr.pa);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
+		return;
+	}
+
+free_hkid:
+	tdx_hkid_free(kvm_tdx);
+}
+
+void tdx_vm_free(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	int i;
+
+	/* Can't reclaim or free TD pages if teardown failed. */
+	if (is_hkid_assigned(kvm_tdx))
+		return;
+
+	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++)
+		tdx_reclaim_td_page(&kvm_tdx->tdcs[i]);
+	kfree(kvm_tdx->tdcs);
+
+	if (kvm_tdx->tdr.added &&
+		__tdx_reclaim_page(kvm_tdx->tdr.va, kvm_tdx->tdr.pa, true,
+				tdx_global_keyid))
+		return;
+
+	free_page(kvm_tdx->tdr.va);
+}
+
+static int tdx_do_tdh_mng_key_config(void *param)
+{
+	hpa_t *tdr_p = param;
+	int cpu, cur_pkg;
+	u64 err;
+
+	cpu = raw_smp_processor_id();
+	cur_pkg = topology_physical_package_id(cpu);
+
+	mutex_lock(&tdx_mng_key_config_lock[cur_pkg]);
+	do {
+		err = tdh_mng_key_config(*tdr_p);
+	} while (err == TDX_KEY_GENERATION_FAILED);
+	mutex_unlock(&tdx_mng_key_config_lock[cur_pkg]);
+
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+int tdx_vm_init(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	cpumask_var_t packages;
+	int ret, i;
+	u64 err;
+
+	/* vCPUs can't be created until after KVM_TDX_INIT_VM. */
+	kvm->max_vcpus = 0;
+
+	kvm_tdx->hkid = tdx_keyid_alloc();
+	if (kvm_tdx->hkid < 0)
+		return -EBUSY;
+
+	ret = tdx_alloc_td_page(&kvm_tdx->tdr);
+	if (ret)
+		goto free_hkid;
+
+	kvm_tdx->tdcs = kcalloc(tdx_caps.tdcs_nr_pages, sizeof(*kvm_tdx->tdcs),
+				GFP_KERNEL_ACCOUNT);
+	if (!kvm_tdx->tdcs)
+		goto free_tdr;
+	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
+		ret = tdx_alloc_td_page(&kvm_tdx->tdcs[i]);
+		if (ret)
+			goto free_tdcs;
+	}
+
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_create(kvm_tdx->tdr.pa, kvm_tdx->hkid);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_CREATE, err, NULL);
+		ret = -EIO;
+		goto free_tdcs;
+	}
+	tdx_mark_td_page_added(&kvm_tdx->tdr);
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto free_tdcs;
+	}
+	for_each_online_cpu(i) {
+		if (cpumask_test_and_set_cpu(topology_physical_package_id(i),
+						packages))
+			continue;
+
+		ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
+				&kvm_tdx->tdr.pa, 1);
+		if (ret)
+			break;
+	}
+	free_cpumask_var(packages);
+	if (ret)
+		goto teardown;
+
+	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
+		err = tdh_mng_addcx(kvm_tdx->tdr.pa, kvm_tdx->tdcs[i].pa);
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
+			ret = -EIO;
+			goto teardown;
+		}
+		tdx_mark_td_page_added(&kvm_tdx->tdcs[i]);
+	}
+
+	/*
+	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
+	 * ioctl() to define the configure CPUID values for the TD.
+	 */
+	return 0;
+
+	/*
+	 * The sequence for freeing resources from a partially initialized TD
+	 * varies based on where in the initialization flow failure occurred.
+	 * Simply use the full teardown and destroy, which naturally play nice
+	 * with partial initialization.
+	 */
+teardown:
+	tdx_mmu_prezap(kvm);
+	tdx_vm_free(kvm);
+	return ret;
+
+free_tdcs:
+	/* @i points at the TDCS page that failed allocation. */
+	for (--i; i >= 0; i--)
+		free_page(kvm_tdx->tdcs[i].va);
+	kfree(kvm_tdx->tdcs);
+free_tdr:
+	free_page(kvm_tdx->tdr.va);
+free_hkid:
+	tdx_hkid_free(kvm_tdx);
+	return ret;
+}
+
 static int __tdx_module_setup(void)
 {
 	const struct tdsysinfo_struct *tdsysinfo;
@@ -59,6 +369,8 @@ static int __tdx_module_setup(void)
 		return ret;
 	}
 
+	tdx_global_keyid = tdx_get_global_keyid();
+
 	tdsysinfo = tdx_get_sysinfo();
 	if (tdx_caps.nr_cpuid_configs > TDX_MAX_NR_CPUID_CONFIGS)
 		return -EIO;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index e4bb8831764e..860136ed70f5 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -19,6 +19,8 @@ struct kvm_tdx {
 
 	struct tdx_td_page tdr;
 	struct tdx_td_page *tdcs;
+
+	int hkid;
 };
 
 struct vcpu_tdx {
diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
index 5c878488795d..590fcfdd1899 100644
--- a/arch/x86/kvm/vmx/tdx_errno.h
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -12,11 +12,11 @@
 #define TDX_SUCCESS				0x0000000000000000ULL
 #define TDX_NON_RECOVERABLE_VCPU		0x4000000100000000ULL
 #define TDX_INTERRUPTED_RESUMABLE		0x8000000300000000ULL
-#define TDX_LIFECYCLE_STATE_INCORRECT		0xC000060700000000ULL
 #define TDX_VCPU_NOT_ASSOCIATED			0x8000070200000000ULL
 #define TDX_KEY_GENERATION_FAILED		0x8000080000000000ULL
 #define TDX_KEY_STATE_INCORRECT			0xC000081100000000ULL
 #define TDX_KEY_CONFIGURED			0x0000081500000000ULL
+#define TDX_NO_HKID_READY_TO_WBCACHE		0x0000082100000000ULL
 #define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
 
 /*
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 0bed43879b82..3dd5b4c3f04c 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -6,6 +6,7 @@
 
 #include <linux/compiler.h>
 
+#include <asm/cacheflush.h>
 #include <asm/asm.h>
 #include <asm/kvm_host.h>
 
@@ -15,8 +16,14 @@
 
 #ifdef CONFIG_INTEL_TDX_HOST
 
+static inline void tdx_clflush_page(hpa_t addr)
+{
+	clflush_cache_range(__va(addr), PAGE_SIZE);
+}
+
 static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
 {
+	tdx_clflush_page(addr);
 	return kvm_seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, 0, NULL);
 }
 
@@ -56,6 +63,7 @@ static inline u64 tdh_mng_key_config(hpa_t tdr)
 
 static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
 {
+	tdx_clflush_page(tdr);
 	return kvm_seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, 0, NULL);
 }
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index da32b4b86b19..2b2738c768d6 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -132,12 +132,20 @@ void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
 bool tdx_is_vm_type_supported(unsigned long type);
 void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 void tdx_hardware_unsetup(void);
+
+int tdx_vm_init(struct kvm *kvm);
+void tdx_mmu_prezap(struct kvm *kvm);
+void tdx_vm_free(struct kvm *kvm);
 #else
 static inline void tdx_pre_kvm_init(
 	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
 static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
 static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
 static inline void tdx_hardware_unsetup(void) {}
+
+static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
+static inline void tdx_mmu_prezap(struct kvm *kvm) {}
+static inline void tdx_vm_free(struct kvm *kvm) {}
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 025/104] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (23 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 12:50   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 026/104] KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters isaku.yamahata
                   ` (80 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
TDX specific sub-commands will be added to retrieve/pass TDX specific
parameters.

KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
guest state-protected VM.  It defined subcommands for technology-specific
operations under KVM_MEMORY_ENCRYPT_OP.  Despite its name, the subcommands
are not limited to memory encryption, but various technology-specific
operations are defined.  It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
for TDX specific operations and define subcommands.

TDX requires VM-scoped, and VCPU-scoped TDX-specific operations for device
model, for example, qemu.  Getting system-wide parameters, TDX-specific VM
initialization, and TDX-specific vCPU initialization.  Which requires KVM
vCPU-scoped operations in addition to the existing VM-scoped operations.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/uapi/asm/kvm.h       | 11 +++++++++++
 arch/x86/kvm/vmx/main.c               | 10 ++++++++++
 arch/x86/kvm/vmx/tdx.c                | 24 ++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h            |  4 ++++
 tools/arch/x86/include/uapi/asm/kvm.h | 11 +++++++++++
 5 files changed, 60 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 71a5851475e7..2ad61caf4e0b 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -528,4 +528,15 @@ struct kvm_pmu_event_filter {
 #define KVM_X86_DEFAULT_VM	0
 #define KVM_X86_TDX_VM		1
 
+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+	KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+	__u32 id;
+	__u32 metadata;
+	__u64 data;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 5c3a904a30e8..fc497f50e0e1 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -57,6 +57,14 @@ static void vt_vm_free(struct kvm *kvm)
 		return tdx_vm_free(kvm);
 }
 
+static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
+{
+	if (!is_td(kvm))
+		return -ENOTTY;
+
+	return tdx_vm_ioctl(kvm, argp);
+}
+
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = "kvm_intel",
 
@@ -195,6 +203,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+	.mem_enc_op = vt_mem_enc_op,
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 702953fd365f..8c67444d052a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -349,6 +349,30 @@ int tdx_vm_init(struct kvm *kvm)
 	return ret;
 }
 
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_tdx_cmd tdx_cmd;
+	int r;
+
+	if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
+		return -EFAULT;
+
+	mutex_lock(&kvm->lock);
+
+	switch (tdx_cmd.id) {
+	default:
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
+		r = -EFAULT;
+
+out:
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
 static int __tdx_module_setup(void)
 {
 	const struct tdsysinfo_struct *tdsysinfo;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 2b2738c768d6..9d88ba9b093b 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -136,6 +136,8 @@ void tdx_hardware_unsetup(void);
 int tdx_vm_init(struct kvm *kvm);
 void tdx_mmu_prezap(struct kvm *kvm);
 void tdx_vm_free(struct kvm *kvm);
+
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 #else
 static inline void tdx_pre_kvm_init(
 	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
@@ -146,6 +148,8 @@ static inline void tdx_hardware_unsetup(void) {}
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_prezap(struct kvm *kvm) {}
 static inline void tdx_vm_free(struct kvm *kvm) {}
+
+static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 71a5851475e7..2ad61caf4e0b 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -528,4 +528,15 @@ struct kvm_pmu_event_filter {
 #define KVM_X86_DEFAULT_VM	0
 #define KVM_X86_TDX_VM		1
 
+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+	KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+	__u32 id;
+	__u32 metadata;
+	__u64 data;
+};
+
 #endif /* _ASM_X86_KVM_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 026/104] KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (24 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 025/104] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 12:52   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
                   ` (79 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement a VM-scoped subcomment to get system-wide parameters.  Although
this is system-wide parameters not per-VM, this subcomand is VM-scoped
because
- Device model needs TDX system-wide parameters after creating KVM VM.
- This subcommands requires to initialize TDX module.  For lazy
  initialization of the TDX module, vm-scope ioctl is better.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/uapi/asm/kvm.h       | 22 ++++++++++++++
 arch/x86/kvm/vmx/tdx.c                | 41 +++++++++++++++++++++++++++
 tools/arch/x86/include/uapi/asm/kvm.h | 22 ++++++++++++++
 3 files changed, 85 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 2ad61caf4e0b..70f9be4ea575 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -530,6 +530,8 @@ struct kvm_pmu_event_filter {
 
 /* Trust Domain eXtension sub-ioctl() commands. */
 enum kvm_tdx_cmd_id {
+	KVM_TDX_CAPABILITIES = 0,
+
 	KVM_TDX_CMD_NR_MAX,
 };
 
@@ -539,4 +541,24 @@ struct kvm_tdx_cmd {
 	__u64 data;
 };
 
+struct kvm_tdx_cpuid_config {
+	__u32 leaf;
+	__u32 sub_leaf;
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+	__u64 attrs_fixed0;
+	__u64 attrs_fixed1;
+	__u64 xfam_fixed0;
+	__u64 xfam_fixed1;
+
+	__u32 nr_cpuid_configs;
+	__u32 padding;
+	struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8c67444d052a..20b45bb0b032 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -349,6 +349,44 @@ int tdx_vm_init(struct kvm *kvm)
 	return ret;
 }
 
+static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_tdx_capabilities __user *user_caps;
+	struct kvm_tdx_capabilities caps;
+
+	BUILD_BUG_ON(sizeof(struct kvm_tdx_cpuid_config) !=
+		     sizeof(struct tdx_cpuid_config));
+
+	WARN_ON(cmd->id != KVM_TDX_CAPABILITIES);
+	if (cmd->metadata)
+		return -EINVAL;
+
+	user_caps = (void __user *)cmd->data;
+	if (copy_from_user(&caps, user_caps, sizeof(caps)))
+		return -EFAULT;
+
+	if (caps.nr_cpuid_configs < tdx_caps.nr_cpuid_configs)
+		return -E2BIG;
+
+	caps = (struct kvm_tdx_capabilities) {
+		.attrs_fixed0 = tdx_caps.attrs_fixed0,
+		.attrs_fixed1 = tdx_caps.attrs_fixed1,
+		.xfam_fixed0 = tdx_caps.xfam_fixed0,
+		.xfam_fixed1 = tdx_caps.xfam_fixed1,
+		.nr_cpuid_configs = tdx_caps.nr_cpuid_configs,
+		.padding = 0,
+	};
+
+	if (copy_to_user(user_caps, &caps, sizeof(caps)))
+		return -EFAULT;
+	if (copy_to_user(user_caps->cpuid_configs, &tdx_caps.cpuid_configs,
+			 tdx_caps.nr_cpuid_configs *
+			 sizeof(struct tdx_cpuid_config)))
+		return -EFAULT;
+
+	return 0;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -360,6 +398,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	mutex_lock(&kvm->lock);
 
 	switch (tdx_cmd.id) {
+	case KVM_TDX_CAPABILITIES:
+		r = tdx_capabilities(kvm, &tdx_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 2ad61caf4e0b..70f9be4ea575 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -530,6 +530,8 @@ struct kvm_pmu_event_filter {
 
 /* Trust Domain eXtension sub-ioctl() commands. */
 enum kvm_tdx_cmd_id {
+	KVM_TDX_CAPABILITIES = 0,
+
 	KVM_TDX_CMD_NR_MAX,
 };
 
@@ -539,4 +541,24 @@ struct kvm_tdx_cmd {
 	__u64 data;
 };
 
+struct kvm_tdx_cpuid_config {
+	__u32 leaf;
+	__u32 sub_leaf;
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+	__u64 attrs_fixed0;
+	__u64 attrs_fixed1;
+	__u64 xfam_fixed0;
+	__u64 xfam_fixed1;
+
+	__u32 nr_cpuid_configs;
+	__u32 padding;
+	struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
 #endif /* _ASM_X86_KVM_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (25 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 026/104] KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-31  4:55   ` Kai Huang
  2022-04-05 12:58   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 028/104] [MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction isaku.yamahata
                   ` (78 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Xiaoyao Li <xiaoyao.li@intel.com>

TDX requires additional parameters for TDX VM for confidential execution to
protect its confidentiality of its memory contents and its CPU state from
any other software, including VMM. When creating guest TD VM before
creating vcpu, the number of vcpu, TSC frequency (that is same among
vcpus. and it can't be changed.)  CPUIDs which is emulated by the TDX
module. It means guest can trust those CPUIDs. and sha384 values for
measurement.

Add new subcommand, KVM_TDX_INIT_VM, to pass parameters for TDX guest.  It
assigns encryption key to the TDX guest for memory encryption.  TDX
encrypts memory per-guest bases.  It assigns Device model passes per-VM
parameters for the TDX guest.  The maximum number of vcpus, tsc frequency
(TDX guest has fised VM-wide TSC frequency. not per-vcpu.  The TDX guest
can not change it.), attributes (production or debug), available extended
features (which is reflected into guest XCR0, IA32_XSS MSR), cpuids, sha384
measurements, and etc.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h       |   2 +
 arch/x86/include/uapi/asm/kvm.h       |  12 ++
 arch/x86/kvm/vmx/tdx.c                | 200 ++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h                |  26 ++++
 arch/x86/kvm/x86.c                    |   3 +-
 arch/x86/kvm/x86.h                    |   2 +
 tools/arch/x86/include/uapi/asm/kvm.h |  12 ++
 7 files changed, 256 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 5ff7a0fba311..290e200f012c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1234,6 +1234,8 @@ struct kvm_arch {
 	hpa_t	hv_root_tdp;
 	spinlock_t hv_root_tdp_lock;
 #endif
+
+	gfn_t gfn_shared_mask;
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 70f9be4ea575..6e26dde0dce6 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -531,6 +531,7 @@ struct kvm_pmu_event_filter {
 /* Trust Domain eXtension sub-ioctl() commands. */
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
+	KVM_TDX_INIT_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -561,4 +562,15 @@ struct kvm_tdx_capabilities {
 	struct kvm_tdx_cpuid_config cpuid_configs[0];
 };
 
+struct kvm_tdx_init_vm {
+	__u32 max_vcpus;
+	__u32 tsc_khz;
+	__u64 attributes;
+	__u64 cpuid;
+	__u64 mrconfigid[6];	/* sha384 digest */
+	__u64 mrowner[6];	/* sha384 digest */
+	__u64 mrownerconfig[6];	/* sha348 digest */
+	__u64 reserved[43];	/* must be zero for future extensibility */
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 20b45bb0b032..236faaca68a0 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -387,6 +387,203 @@ static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return 0;
 }
 
+static struct kvm_cpuid_entry2 *tdx_find_cpuid_entry(struct kvm_tdx *kvm_tdx,
+						u32 function, u32 index)
+{
+	struct kvm_cpuid_entry2 *e;
+	int i;
+
+	for (i = 0; i < kvm_tdx->cpuid_nent; i++) {
+		e = &kvm_tdx->cpuid_entries[i];
+
+		if (e->function == function && (e->index == index ||
+		    !(e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX)))
+			return e;
+	}
+	return NULL;
+}
+
+static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
+			struct kvm_tdx_init_vm *init_vm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct tdx_cpuid_config *config;
+	struct kvm_cpuid_entry2 *entry;
+	struct tdx_cpuid_value *value;
+	u64 guest_supported_xcr0;
+	u64 guest_supported_xss;
+	u32 guest_tsc_khz;
+	int max_pa;
+	int i;
+
+	/* init_vm->reserved must be zero */
+	if (find_first_bit((unsigned long *)init_vm->reserved,
+			   sizeof(init_vm->reserved) * 8) !=
+	    sizeof(init_vm->reserved) * 8)
+		return -EINVAL;
+
+	td_params->max_vcpus = init_vm->max_vcpus;
+
+	td_params->attributes = init_vm->attributes;
+	if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
+		pr_warn("TD doesn't support perfmon. KVM needs to save/restore "
+			"host perf registers properly.\n");
+		return -EOPNOTSUPP;
+	}
+
+	/* TODO: Enforce consistent CPUID features for all vCPUs. */
+	for (i = 0; i < tdx_caps.nr_cpuid_configs; i++) {
+		config = &tdx_caps.cpuid_configs[i];
+
+		entry = tdx_find_cpuid_entry(kvm_tdx, config->leaf,
+					     config->sub_leaf);
+		if (!entry)
+			continue;
+
+		/*
+		 * Non-configurable bits must be '0', even if they are fixed to
+		 * '1' by the TDX module, i.e. mask off non-configurable bits.
+		 */
+		value = &td_params->cpuid_values[i];
+		value->eax = entry->eax & config->eax;
+		value->ebx = entry->ebx & config->ebx;
+		value->ecx = entry->ecx & config->ecx;
+		value->edx = entry->edx & config->edx;
+	}
+
+	max_pa = 36;
+	entry = tdx_find_cpuid_entry(kvm_tdx, 0x80000008, 0);
+	if (entry)
+		max_pa = entry->eax & 0xff;
+
+	td_params->eptp_controls = VMX_EPTP_MT_WB;
+	if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
+		td_params->eptp_controls |= VMX_EPTP_PWL_5;
+		td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
+	} else {
+		td_params->eptp_controls |= VMX_EPTP_PWL_4;
+	}
+
+	/* Setup td_params.xfam */
+	entry = tdx_find_cpuid_entry(kvm_tdx, 0xd, 0);
+	if (entry)
+		guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
+	else
+		guest_supported_xcr0 = 0;
+	guest_supported_xcr0 &= supported_xcr0;
+
+	entry = tdx_find_cpuid_entry(kvm_tdx, 0xd, 1);
+	if (entry)
+		guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
+	else
+		guest_supported_xss = 0;
+	/* PT can be exposed to TD guest regardless of KVM's XSS support */
+	guest_supported_xss &= (supported_xss | XFEATURE_MASK_PT);
+
+	td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
+	if (td_params->xfam & TDX_TD_XFAM_LBR) {
+		pr_warn("TD doesn't support LBR. KVM needs to save/restore "
+			"IA32_LBR_DEPTH properly.\n");
+		return -EOPNOTSUPP;
+	}
+
+	if (td_params->xfam & TDX_TD_XFAM_AMX) {
+		pr_warn("TD doesn't support AMX. KVM needs to save/restore "
+			"IA32_XFD, IA32_XFD_ERR properly.\n");
+		return -EOPNOTSUPP;
+	}
+
+	if (init_vm->tsc_khz)
+		guest_tsc_khz = init_vm->tsc_khz;
+	else
+		guest_tsc_khz = max_tsc_khz;
+	td_params->tsc_frequency = TDX_TSC_KHZ_TO_25MHZ(guest_tsc_khz);
+
+#define BUILD_BUG_ON_MEMCPY(dst, src)				\
+	do {							\
+		BUILD_BUG_ON(sizeof(dst) != sizeof(src));	\
+		memcpy((dst), (src), sizeof(dst));		\
+	} while (0)
+
+	BUILD_BUG_ON_MEMCPY(td_params->mrconfigid, init_vm->mrconfigid);
+	BUILD_BUG_ON_MEMCPY(td_params->mrowner, init_vm->mrowner);
+	BUILD_BUG_ON_MEMCPY(td_params->mrownerconfig, init_vm->mrownerconfig);
+
+	return 0;
+}
+
+static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_cpuid2 __user *user_cpuid;
+	struct kvm_tdx_init_vm init_vm;
+	struct td_params *td_params;
+	struct tdx_module_output out;
+	struct kvm_cpuid2 cpuid;
+	int ret;
+	u64 err;
+
+	BUILD_BUG_ON(sizeof(init_vm) != 512);
+	BUILD_BUG_ON(sizeof(struct td_params) != 1024);
+
+	if (is_td_initialized(kvm))
+		return -EINVAL;
+
+	if (cmd->metadata)
+		return -EINVAL;
+
+	if (copy_from_user(&init_vm, (void __user *)cmd->data, sizeof(init_vm)))
+		return -EFAULT;
+
+	if (init_vm.max_vcpus > KVM_MAX_VCPUS)
+		return -EINVAL;
+
+	user_cpuid = (void *)init_vm.cpuid;
+	if (copy_from_user(&cpuid, user_cpuid, sizeof(cpuid)))
+		return -EFAULT;
+
+	if (cpuid.nent > KVM_MAX_CPUID_ENTRIES)
+		return -E2BIG;
+
+	if (copy_from_user(&kvm_tdx->cpuid_entries, user_cpuid->entries,
+			   cpuid.nent * sizeof(struct kvm_cpuid_entry2)))
+		return -EFAULT;
+
+	td_params = kzalloc(sizeof(struct td_params), GFP_KERNEL_ACCOUNT);
+	if (!td_params)
+		return -ENOMEM;
+
+	kvm_tdx->cpuid_nent = cpuid.nent;
+
+	ret = setup_tdparams(kvm, td_params, &init_vm);
+	if (ret)
+		goto free_tdparams;
+
+	err = tdh_mng_init(kvm_tdx->tdr.pa, __pa(td_params), &out);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_INIT, err, &out);
+		ret = -EIO;
+		goto free_tdparams;
+	}
+
+	kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_OFFSET);
+	kvm_tdx->attributes = td_params->attributes;
+	kvm_tdx->xfam = td_params->xfam;
+	kvm_tdx->tsc_khz = TDX_TSC_25MHZ_TO_KHZ(td_params->tsc_frequency);
+	kvm->max_vcpus = td_params->max_vcpus;
+
+	if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
+		kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
+	else
+		kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
+
+free_tdparams:
+	kfree(td_params);
+	if (ret)
+		kvm_tdx->cpuid_nent = 0;
+	return ret;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -401,6 +598,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_CAPABILITIES:
 		r = tdx_capabilities(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_INIT_VM:
+		r = tdx_td_init(kvm, &tdx_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 860136ed70f5..f116c40ac319 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -20,7 +20,15 @@ struct kvm_tdx {
 	struct tdx_td_page tdr;
 	struct tdx_td_page *tdcs;
 
+	u64 attributes;
+	u64 xfam;
 	int hkid;
+
+	int cpuid_nent;
+	struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
+
+	u64 tsc_offset;
+	unsigned long tsc_khz;
 };
 
 struct vcpu_tdx {
@@ -50,6 +58,11 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
 	return container_of(vcpu, struct vcpu_tdx, vcpu);
 }
 
+static inline bool is_td_initialized(struct kvm *kvm)
+{
+	return !!kvm->max_vcpus;
+}
+
 static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
 {
 	BUILD_BUG_ON_MSG(__builtin_constant_p(field) && (field) & 0x1,
@@ -135,6 +148,19 @@ TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
 TDX_BUILD_TDVPS_ACCESSORS(64, STATE_NON_ARCH, state_non_arch);
 TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
 
+static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
+{
+	struct tdx_module_output out;
+	u64 err;
+
+	err = tdh_mng_rd(kvm_tdx->tdr.pa, TDCS_EXEC(field), &out);
+	if (unlikely(err)) {
+		pr_err("TDH_MNG_RD[EXEC.0x%x] failed: 0x%llx\n", field, err);
+		return 0;
+	}
+	return out.r8;
+}
+
 #else
 static inline int tdx_module_setup(void) { return -ENODEV; };
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a48f5c69fadb..734699bd940f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2274,7 +2274,8 @@ static atomic_t kvm_guest_has_master_clock = ATOMIC_INIT(0);
 #endif
 
 static DEFINE_PER_CPU(unsigned long, cpu_tsc_khz);
-static unsigned long max_tsc_khz;
+unsigned long max_tsc_khz;
+EXPORT_SYMBOL_GPL(max_tsc_khz);
 
 static u32 adjust_tsc_khz(u32 khz, s32 ppm)
 {
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f11d945ac41f..5ff3badc3f2b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -302,6 +302,8 @@ extern int pi_inject_timer;
 
 extern bool report_ignored_msrs;
 
+extern unsigned long max_tsc_khz;
+
 static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
 {
 	return pvclock_scale_delta(nsec, vcpu->arch.virtual_tsc_mult,
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 70f9be4ea575..6e26dde0dce6 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -531,6 +531,7 @@ struct kvm_pmu_event_filter {
 /* Trust Domain eXtension sub-ioctl() commands. */
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
+	KVM_TDX_INIT_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -561,4 +562,15 @@ struct kvm_tdx_capabilities {
 	struct kvm_tdx_cpuid_config cpuid_configs[0];
 };
 
+struct kvm_tdx_init_vm {
+	__u32 max_vcpus;
+	__u32 tsc_khz;
+	__u64 attributes;
+	__u64 cpuid;
+	__u64 mrconfigid[6];	/* sha384 digest */
+	__u64 mrowner[6];	/* sha384 digest */
+	__u64 mrownerconfig[6];	/* sha348 digest */
+	__u64 reserved[43];	/* must be zero for future extensibility */
+};
+
 #endif /* _ASM_X86_KVM_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 028/104] [MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (26 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 029/104] KVM: TDX: allocate/free TDX vcpu structure isaku.yamahata
                   ` (77 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TD vcpu
creation/destruction.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 4066495da3d1..9b63afa6cd1a 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -9,15 +9,15 @@ Layer status
 What qemu can do
 ----------------
 - TDX VM TYPE is exposed to Qemu.
-- Qemu can try to create VM of TDX VM type and then fails.
+- Qemu can create/destroy guest of TDX vm type.
 
 Patch Layer status
 ------------------
   Patch layer                          Status
 * TDX, VMX coexistence:                 Applied
 * TDX architectural definitions:        Applied
-* TD VM creation/destruction:           Applying
-* TD vcpu creation/destruction:         Not yet
+* TD VM creation/destruction:           Applied
+* TD vcpu creation/destruction:         Applying
 * TDX EPT violation:                    Not yet
 * TD finalization:                      Not yet
 * TD vcpu enter/exit:                   Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 029/104] KVM: TDX: allocate/free TDX vcpu structure
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (27 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 028/104] [MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 13:04   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 030/104] KVM: TDX: Do TDX specific vcpu initialization isaku.yamahata
                   ` (76 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

The next step of TDX guest creation is to create vcpu.  Allocate TDX vcpu
structures, initialize it.  Allocate pages of TDX vcpu for the TDX module.

In the case of the conventional case, cpuid is empty at the initialization.
and cpuid is configured after the vcpu initialization.  Because TDX
supports only X2APIC mode, cpuid is forcibly initialized to support X2APIC
on the vcpu initialization.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    |  30 +++++++-
 arch/x86/kvm/vmx/tdx.c     | 142 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx_ops.h |   2 +
 arch/x86/kvm/vmx/x86_ops.h |   8 +++
 arch/x86/kvm/x86.c         |   3 +-
 arch/x86/kvm/x86.h         |   1 +
 6 files changed, 182 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index fc497f50e0e1..a11d3e870a98 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -65,6 +65,30 @@ static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
 	return tdx_vm_ioctl(kvm, argp);
 }
 
+static int vt_vcpu_create(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_create(vcpu);
+
+	return vmx_vcpu_create(vcpu);
+}
+
+static void vt_vcpu_free(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_free(vcpu);
+
+	return vmx_vcpu_free(vcpu);
+}
+
+static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_reset(vcpu, init_event);
+
+	return vmx_vcpu_reset(vcpu, init_event);
+}
+
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = "kvm_intel",
 
@@ -81,9 +105,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.mmu_prezap = vt_mmu_prezap,
 	.vm_free = vt_vm_free,
 
-	.vcpu_create = vmx_vcpu_create,
-	.vcpu_free = vmx_vcpu_free,
-	.vcpu_reset = vmx_vcpu_reset,
+	.vcpu_create = vt_vcpu_create,
+	.vcpu_free = vt_vcpu_free,
+	.vcpu_reset = vt_vcpu_reset,
 
 	.prepare_guest_switch = vmx_prepare_switch_to_guest,
 	.vcpu_load = vmx_vcpu_load,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 236faaca68a0..e43ca93ff95b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,6 +6,7 @@
 #include "capabilities.h"
 #include "x86_ops.h"
 #include "tdx.h"
+#include "x86.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) "tdx: " fmt
@@ -51,6 +52,11 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
 	return pa;
 }
 
+static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
+{
+	return tdx->tdvpr.added;
+}
+
 static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
 {
 	return kvm_tdx->tdr.added;
@@ -349,6 +355,142 @@ int tdx_vm_init(struct kvm *kvm)
 	return ret;
 }
 
+int tdx_vcpu_create(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int ret, i;
+
+	ret = tdx_alloc_td_page(&tdx->tdvpr);
+	if (ret)
+		return ret;
+
+	tdx->tdvpx = kcalloc(tdx_caps.tdvpx_nr_pages, sizeof(*tdx->tdvpx),
+			GFP_KERNEL_ACCOUNT);
+	if (!tdx->tdvpx) {
+		ret = -ENOMEM;
+		goto free_tdvpr;
+	}
+	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
+		ret = tdx_alloc_td_page(&tdx->tdvpx[i]);
+		if (ret)
+			goto free_tdvpx;
+	}
+
+	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
+
+	vcpu->arch.cr0_guest_owned_bits = -1ul;
+	vcpu->arch.cr4_guest_owned_bits = -1ul;
+
+	vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
+	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
+	vcpu->arch.guest_state_protected =
+		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+
+	return 0;
+
+free_tdvpx:
+	/* @i points at the TDVPX page that failed allocation. */
+	for (--i; i >= 0; i--)
+		free_page(tdx->tdvpx[i].va);
+	kfree(tdx->tdvpx);
+free_tdvpr:
+	free_page(tdx->tdvpr.va);
+
+	return ret;
+}
+
+void tdx_vcpu_free(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int i;
+
+	/* Can't reclaim or free pages if teardown failed. */
+	if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm)))
+		return;
+
+	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
+		tdx_reclaim_td_page(&tdx->tdvpx[i]);
+	kfree(tdx->tdvpx);
+	tdx_reclaim_td_page(&tdx->tdvpr);
+}
+
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct msr_data apic_base_msr;
+	u64 err;
+	int i;
+
+	/* TDX doesn't support INIT event. */
+	if (WARN_ON(init_event))
+		goto td_bugged;
+	/* TDX supports only X2APIC enabled. */
+	if (WARN_ON(!vcpu->arch.apic))
+		goto td_bugged;
+	if (WARN_ON(is_td_vcpu_created(tdx)))
+		goto td_bugged;
+
+	/*
+	 * In TDX case, tsc frequency is per-VM and determined by the parameter
+	 * tdh_mng_init().  Forcibly set it instead of max_tsc_khz set by
+	 * kvm_arch_vcpu_create().
+	 *
+	 * This function is called after kvm_arch_vcpu_create() calling
+	 * kvm_set_tsc_khz().
+	 */
+	kvm_set_tsc_khz(vcpu, kvm_tdx->tsc_khz);
+
+	err = tdh_vp_create(kvm_tdx->tdr.pa, tdx->tdvpr.pa);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_VP_CREATE, err, NULL);
+		goto td_bugged;
+	}
+	tdx_mark_td_page_added(&tdx->tdvpr);
+
+	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
+		err = tdh_vp_addcx(tdx->tdvpr.pa, tdx->tdvpx[i].pa);
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_VP_ADDCX, err, NULL);
+			goto td_bugged;
+		}
+		tdx_mark_td_page_added(&tdx->tdvpx[i]);
+	}
+
+	if (!vcpu->arch.cpuid_entries) {
+		/*
+		 * On cpu creation, cpuid entry is blank.  Forcibly enable
+		 * X2APIC feature to allow X2APIC.
+		 */
+		struct kvm_cpuid_entry2 *e;
+
+		e = kvmalloc_array(1, sizeof(*e), GFP_KERNEL_ACCOUNT);
+		*e  = (struct kvm_cpuid_entry2) {
+			.function = 1,	/* Features for X2APIC */
+			.index = 0,
+			.eax = 0,
+			.ebx = 0,
+			.ecx = 1ULL << 21,	/* X2APIC */
+			.edx = 0,
+		};
+		vcpu->arch.cpuid_entries = e;
+		vcpu->arch.cpuid_nent = 1;
+	}
+	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
+	if (kvm_vcpu_is_reset_bsp(vcpu))
+		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
+	apic_base_msr.host_initiated = true;
+	if (WARN_ON(kvm_set_apic_base(vcpu, &apic_base_msr)))
+		goto td_bugged;
+
+	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+
+	return;
+
+td_bugged:
+	vcpu->kvm->vm_bugged = true;
+}
+
 static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 3dd5b4c3f04c..dc76b3a5cf96 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -41,6 +41,7 @@ static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
 
 static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
 {
+	tdx_clflush_page(addr);
 	return kvm_seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, 0, NULL);
 }
 
@@ -69,6 +70,7 @@ static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
 
 static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
 {
+	tdx_clflush_page(tdvpr);
 	return kvm_seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, 0, NULL);
 }
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 9d88ba9b093b..f1640f201a19 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -137,6 +137,10 @@ int tdx_vm_init(struct kvm *kvm);
 void tdx_mmu_prezap(struct kvm *kvm);
 void tdx_vm_free(struct kvm *kvm);
 
+int tdx_vcpu_create(struct kvm_vcpu *vcpu);
+void tdx_vcpu_free(struct kvm_vcpu *vcpu);
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 #else
 static inline void tdx_pre_kvm_init(
@@ -149,6 +153,10 @@ static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_prezap(struct kvm *kvm) {}
 static inline void tdx_vm_free(struct kvm *kvm) {}
 
+static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
+static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 #endif
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 734699bd940f..158e1891ac14 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2322,7 +2322,7 @@ static int set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz, bool scale)
 	return 0;
 }
 
-static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
+int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
 {
 	u32 thresh_lo, thresh_hi;
 	int use_scaling = 0;
@@ -2354,6 +2354,7 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
 	}
 	return set_tsc_khz(vcpu, user_tsc_khz, use_scaling);
 }
+EXPORT_SYMBOL_GPL(kvm_set_tsc_khz);
 
 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 5ff3badc3f2b..f15bf1c0aeb1 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -303,6 +303,7 @@ extern int pi_inject_timer;
 extern bool report_ignored_msrs;
 
 extern unsigned long max_tsc_khz;
+int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz);
 
 static inline u64 nsec_to_cycles(struct kvm_vcpu *vcpu, u64 nsec)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 030/104] KVM: TDX: Do TDX specific vcpu initialization
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (28 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 029/104] KVM: TDX: allocate/free TDX vcpu structure isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 031/104] [MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits isaku.yamahata
                   ` (75 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TD guest vcpu need to be configured before ready to run which requests
addtional information from Device model (e.g. qemu), one 64bit value is
passed to vcpu's RCX as an initial value.  Repurpose KVM_MEMORY_ENCRYPT_OP
to vcpu-scope and add new sub-commands KVM_TDX_INIT_VCPU under it for such
additional vcpu configuration.

Add callback for kvm vCPU-scoped operations of KVM_MEMORY_ENCRYPT_OP and
add a new subcommand, KVM_TDX_INIT_VCPU, for further vcpu initialization.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h       |  1 +
 arch/x86/include/uapi/asm/kvm.h       |  1 +
 arch/x86/kvm/vmx/main.c               | 25 +++++++++++++-------
 arch/x86/kvm/vmx/tdx.c                | 34 +++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h                |  4 ++++
 arch/x86/kvm/vmx/x86_ops.h            |  2 ++
 arch/x86/kvm/x86.c                    |  6 +++++
 tools/arch/x86/include/uapi/asm/kvm.h |  1 +
 8 files changed, 66 insertions(+), 8 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 290e200f012c..208b29b0e637 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1483,6 +1483,7 @@ struct kvm_x86_ops {
 	void (*enable_smi_window)(struct kvm_vcpu *vcpu);
 
 	int (*mem_enc_op)(struct kvm *kvm, void __user *argp);
+	int (*mem_enc_op_vcpu)(struct kvm_vcpu *vcpu, void __user *argp);
 	int (*mem_enc_reg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*mem_enc_unreg_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 6e26dde0dce6..9702f0d95776 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -532,6 +532,7 @@ struct kvm_pmu_event_filter {
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
+	KVM_TDX_INIT_VCPU,
 
 	KVM_TDX_CMD_NR_MAX,
 };
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a11d3e870a98..3eb9db6d83ac 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -57,14 +57,6 @@ static void vt_vm_free(struct kvm *kvm)
 		return tdx_vm_free(kvm);
 }
 
-static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
-{
-	if (!is_td(kvm))
-		return -ENOTTY;
-
-	return tdx_vm_ioctl(kvm, argp);
-}
-
 static int vt_vcpu_create(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -89,6 +81,22 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	return vmx_vcpu_reset(vcpu, init_event);
 }
 
+static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
+{
+	if (!is_td(kvm))
+		return -ENOTTY;
+
+	return tdx_vm_ioctl(kvm, argp);
+}
+
+static int vt_mem_enc_op_vcpu(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	if (!is_td_vcpu(vcpu))
+		return -EINVAL;
+
+	return tdx_vcpu_ioctl(vcpu, argp);
+}
+
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = "kvm_intel",
 
@@ -229,6 +237,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
 
 	.mem_enc_op = vt_mem_enc_op,
+	.mem_enc_op_vcpu = vt_mem_enc_op_vcpu,
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e43ca93ff95b..f86a257dd71b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -73,6 +73,11 @@ static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
 	return kvm_tdx->hkid > 0;
 }
 
+static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->finalized;
+}
+
 static void tdx_clear_page(unsigned long page)
 {
 	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -756,6 +761,35 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	return r;
 }
 
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct kvm_tdx_cmd cmd;
+	u64 err;
+
+	if (tdx->initialized)
+		return -EINVAL;
+
+	if (!is_td_initialized(vcpu->kvm) || is_td_finalized(kvm_tdx))
+		return -EINVAL;
+
+	if (copy_from_user(&cmd, argp, sizeof(cmd)))
+		return -EFAULT;
+
+	if (cmd.metadata || cmd.id != KVM_TDX_INIT_VCPU)
+		return -EINVAL;
+
+	err = tdh_vp_init(tdx->tdvpr.pa, cmd.data);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_VP_INIT, err, NULL);
+		return -EIO;
+	}
+
+	tdx->initialized = true;
+	return 0;
+}
+
 static int __tdx_module_setup(void)
 {
 	const struct tdsysinfo_struct *tdsysinfo;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index f116c40ac319..4ce7fcab6f64 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -27,6 +27,8 @@ struct kvm_tdx {
 	int cpuid_nent;
 	struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
 
+	bool finalized;
+
 	u64 tsc_offset;
 	unsigned long tsc_khz;
 };
@@ -36,6 +38,8 @@ struct vcpu_tdx {
 
 	struct tdx_td_page tdvpr;
 	struct tdx_td_page *tdvpx;
+
+	bool initialized;
 };
 
 static inline bool is_td(struct kvm *kvm)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f1640f201a19..81f246493ec7 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -142,6 +142,7 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 #else
 static inline void tdx_pre_kvm_init(
 	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
@@ -158,6 +159,7 @@ static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
+static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 158e1891ac14..c52a052e208c 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5680,6 +5680,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	case KVM_SET_DEVICE_ATTR:
 		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
 		break;
+	case KVM_MEMORY_ENCRYPT_OP:
+		r = -EINVAL;
+		if (!kvm_x86_ops.mem_enc_op_vcpu)
+			goto out;
+		r = kvm_x86_ops.mem_enc_op_vcpu(vcpu, argp);
+		break;
 	default:
 		r = -EINVAL;
 	}
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 6e26dde0dce6..9702f0d95776 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -532,6 +532,7 @@ struct kvm_pmu_event_filter {
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
+	KVM_TDX_INIT_VCPU,
 
 	KVM_TDX_CMD_NR_MAX,
 };
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 031/104] [MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (29 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 030/104] KVM: TDX: Do TDX specific vcpu initialization isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 032/104] KVM: x86/mmu: introduce config for PRIVATE KVM MMU isaku.yamahata
                   ` (74 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of KVM MMU GPA
stolen bits.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 9b63afa6cd1a..32e93d63972a 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -10,6 +10,7 @@ What qemu can do
 ----------------
 - TDX VM TYPE is exposed to Qemu.
 - Qemu can create/destroy guest of TDX vm type.
+- Qemu can create/destroy vcpu of TDX vm type.
 
 Patch Layer status
 ------------------
@@ -17,13 +18,13 @@ Patch Layer status
 * TDX, VMX coexistence:                 Applied
 * TDX architectural definitions:        Applied
 * TD VM creation/destruction:           Applied
-* TD vcpu creation/destruction:         Applying
+* TD vcpu creation/destruction:         Applied
 * TDX EPT violation:                    Not yet
 * TD finalization:                      Not yet
 * TD vcpu enter/exit:                   Not yet
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
-* KVM MMU GPA stolen bits:              Not yet
+* KVM MMU GPA stolen bits:              Applying
 * KVM TDP refactoring for TDX:          Not yet
 * KVM TDP MMU hooks:                    Not yet
 * KVM TDP MMU MapGPA:                   Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 032/104] KVM: x86/mmu: introduce config for PRIVATE KVM MMU
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (30 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 031/104] [MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-31 11:23   ` Kai Huang
  2022-03-04 19:48 ` [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits isaku.yamahata
                   ` (73 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

To Keep the case of non TDX intact, introduce a new config option for
private KVM MMU support.  At the moment, this is synonym for
CONFIG_INTEL_TDX_HOST && CONFIG_KVM_INTEL.  The new flag make it clear
that the config is only for x86 KVM MMU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 2b1548da00eb..2db590845927 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -136,4 +136,8 @@ config KVM_MMU_AUDIT
 config KVM_EXTERNAL_WRITE_TRACKING
 	bool
 
+config KVM_MMU_PRIVATE
+	def_bool y
+	depends on INTEL_TDX_HOST && KVM_INTEL
+
 endif # VIRTUALIZATION
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (31 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 032/104] KVM: x86/mmu: introduce config for PRIVATE KVM MMU isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-31 11:16   ` Kai Huang
  2022-03-04 19:48 ` [RFC PATCH v5 034/104] [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX isaku.yamahata
                   ` (72 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Rick Edgecombe <rick.p.edgecombe@intel.com>

Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
perspective) to a single GPA (from a memslot perspective). GPA aliasing
will be used to repurpose GPA bits as attribute bits, e.g. to expose an
execute-only permission bit to the guest. To keep the implementation
simple (relatively speaking), GPA aliasing is only supported via TDP.

Today KVM assumes two things that are broken by GPA aliasing.
  1. GPAs coming from hardware can be simply shifted to get the GFNs.
  2. GPA bits 51:MAXPHYADDR are reserved to zero.

With GPA aliasing, translating a GPA to GFN requires masking off the
repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.

To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
that is, bits stolen from the GPA to act as new virtualized attribute
bits. A bit in the mask will cause the MMU code to create aliases of the
GPA. It can also be used to find the GFN out of a GPA coming from a tdp
fault.

To handle case (1) from above, retain any stolen bits when passing a GPA
in KVM's MMU code, but strip them when converting to a GFN so that the
GFN contains only the "real" GFN, i.e. never has repurposed bits set.

GFNs (without stolen bits) continue to be used to:
  - Specify physical memory by userspace via memslots
  - Map GPAs to TDP PTEs via RMAP
  - Specify dirty tracking and write protection
  - Look up MTRR types
  - Inject async page faults

Since there are now multiple aliases for the same aliased GPA, when
userspace memory backing the memslots is paged out, both aliases need to be
modified. Fortunately, this happens automatically. Since rmap supports
multiple mappings for the same GFN for PTE shadowing based paging, by
adding/removing each alias PTE with its GFN, kvm_handle_hva() based
operations will be applied to both aliases.

In the case of the rmap being removed in the future, the needed
information could be recovered by iterating over the stolen bits and
walking the TDP page tables.

For TLB flushes that are address based, make sure to flush both aliases
in the case of stolen bits.

Only support stolen bits in 64 bit guest paging modes (long, PAE).
Features that use this infrastructure should restrict the stolen bits to
exclude the other paging modes. Don't support stolen bits for shadow EPT.

Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.h              | 51 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/mmu.c          | 19 ++++++++++--
 arch/x86/kvm/mmu/paging_tmpl.h  | 25 +++++++++-------
 4 files changed, 84 insertions(+), 13 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 208b29b0e637..d8b78d6abc10 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1235,7 +1235,9 @@ struct kvm_arch {
 	spinlock_t hv_root_tdp_lock;
 #endif
 
+#ifdef CONFIG_KVM_MMU_PRIVATE
 	gfn_t gfn_shared_mask;
+#endif
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e9fbb2c8bbe2..3fb530359f81 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -365,4 +365,55 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
 		return gpa;
 	return translate_nested_gpa(vcpu, gpa, access, exception);
 }
+
+static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
+{
+#ifdef CONFIG_KVM_MMU_PRIVATE
+	return kvm->arch.gfn_shared_mask;
+#else
+	return 0;
+#endif
+}
+
+static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
+{
+	return gfn_to_gpa(kvm_gfn_stolen_mask(kvm));
+}
+
+static inline gpa_t kvm_gpa_unalias(struct kvm *kvm, gpa_t gpa)
+{
+	return gpa & ~kvm_gpa_stolen_mask(kvm);
+}
+
+static inline gfn_t kvm_gfn_unalias(struct kvm *kvm, gfn_t gfn)
+{
+	return gfn & ~kvm_gfn_stolen_mask(kvm);
+}
+
+static inline gfn_t kvm_gfn_shared(struct kvm *kvm, gfn_t gfn)
+{
+	return gfn | kvm_gfn_stolen_mask(kvm);
+}
+
+static inline gfn_t kvm_gfn_private(struct kvm *kvm, gfn_t gfn)
+{
+	return gfn & ~kvm_gfn_stolen_mask(kvm);
+}
+
+static inline gpa_t kvm_gpa_private(struct kvm *kvm, gpa_t gpa)
+{
+	return gpa & ~kvm_gpa_stolen_mask(kvm);
+}
+
+static inline bool kvm_is_private_gfn(struct kvm *kvm, gfn_t gfn)
+{
+	gfn_t mask = kvm_gfn_stolen_mask(kvm);
+
+	return mask && !(gfn & mask);
+}
+
+static inline bool kvm_is_private_gpa(struct kvm *kvm, gpa_t gpa)
+{
+	return kvm_is_private_gfn(kvm, gpa_to_gfn(gpa));
+}
 #endif
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8e24f73bf60b..b68191aa39bf 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -276,11 +276,24 @@ static inline bool kvm_available_flush_tlb_with_range(void)
 static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
 		struct kvm_tlb_range *range)
 {
-	int ret = -ENOTSUPP;
+	int ret = -EOPNOTSUPP;
+	u64 gfn_stolen_mask;
 
-	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
+	/*
+	 * Fall back to the big hammer flush if there is more than one
+	 * GPA alias that needs to be flushed.
+	 */
+	gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
+	if (hweight64(gfn_stolen_mask) > 1)
+		goto generic_flush;
+
+	if (range && kvm_available_flush_tlb_with_range()) {
+		/* Callback should flush both private GFN and shared GFN. */
+		range->start_gfn = kvm_gfn_unalias(kvm, range->start_gfn);
 		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
+	}
 
+generic_flush:
 	if (ret)
 		kvm_flush_remote_tlbs(kvm);
 }
@@ -4010,7 +4023,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	unsigned long mmu_seq;
 	int r;
 
-	fault->gfn = fault->addr >> PAGE_SHIFT;
+	fault->gfn = kvm_gfn_unalias(vcpu->kvm, gpa_to_gfn(fault->addr));
 	fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
 
 	if (page_fault_handle_page_track(vcpu, fault))
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 5b5bdac97c7b..70aec31dee06 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -25,7 +25,8 @@
 	#define guest_walker guest_walker64
 	#define FNAME(name) paging##64_##name
 	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
-	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_ADDR_MASK(vcpu, lvl) (~kvm_gpa_stolen_mask(vcpu->kvm) & \
+					     PT64_LVL_ADDR_MASK(lvl))
 	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT64_LEVEL_BITS
@@ -44,7 +45,7 @@
 	#define guest_walker guest_walker32
 	#define FNAME(name) paging##32_##name
 	#define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
-	#define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
 	#define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT32_LEVEL_BITS
@@ -58,7 +59,7 @@
 	#define guest_walker guest_walkerEPT
 	#define FNAME(name) ept_##name
 	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
-	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
+	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
 	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
 	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
 	#define PT_LEVEL_BITS PT64_LEVEL_BITS
@@ -75,7 +76,7 @@
 #define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)
 
 #define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
-#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
+#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)
 
 /*
  * The guest_walker structure emulates the behavior of the hardware page
@@ -96,9 +97,9 @@ struct guest_walker {
 	struct x86_exception fault;
 };
 
-static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
+static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
 {
-	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
+	return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
 }
 
 static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
@@ -395,7 +396,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 		--walker->level;
 
 		index = PT_INDEX(addr, walker->level);
-		table_gfn = gpte_to_gfn(pte);
+		table_gfn = gpte_to_gfn(vcpu, pte);
 		offset    = index * sizeof(pt_element_t);
 		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
 
@@ -460,7 +461,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
 	if (unlikely(errcode))
 		goto error;
 
-	gfn = gpte_to_gfn_lvl(pte, walker->level);
+	gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
 	gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
 
 	if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
@@ -555,12 +556,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	gfn_t gfn;
 	kvm_pfn_t pfn;
 
+	WARN_ON(gpte & kvm_gpa_stolen_mask(vcpu->kvm));
+
 	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
 		return false;
 
 	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
 
-	gfn = gpte_to_gfn(gpte);
+	gfn = gpte_to_gfn(vcpu, gpte);
 	pte_access = sp->role.access & FNAME(gpte_access)(gpte);
 	FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
 
@@ -656,6 +659,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
 	WARN_ON_ONCE(gw->gfn != base_gfn);
 	direct_access = gw->pte_access;
 
+	WARN_ON(fault->addr & kvm_gpa_stolen_mask(vcpu->kvm));
+
 	top_level = vcpu->arch.mmu->root_level;
 	if (top_level == PT32E_ROOT_LEVEL)
 		top_level = PT32_ROOT_LEVEL;
@@ -1080,7 +1085,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 			continue;
 		}
 
-		gfn = gpte_to_gfn(gpte);
+		gfn = gpte_to_gfn(vcpu, gpte);
 		pte_access = sp->role.access;
 		pte_access &= FNAME(gpte_access)(gpte);
 		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 034/104] [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (32 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-03-04 19:48 ` [RFC PATCH v5 035/104] KVM: x86/mmu: Disallow dirty logging for x86 TDX isaku.yamahata
                   ` (71 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of KVM TDP
refactoring for TDX.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 32e93d63972a..b4c10eb46b8d 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -24,7 +24,7 @@ Patch Layer status
 * TD vcpu enter/exit:                   Not yet
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
-* KVM MMU GPA stolen bits:              Applying
-* KVM TDP refactoring for TDX:          Not yet
+* KVM MMU GPA stolen bits:              Applied
+* KVM TDP refactoring for TDX:          Applying
 * KVM TDP MMU hooks:                    Not yet
 * KVM TDP MMU MapGPA:                   Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 035/104] KVM: x86/mmu: Disallow dirty logging for x86 TDX
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (33 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 034/104] [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 13:09   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 036/104] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault isaku.yamahata
                   ` (70 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX doesn't support dirty logging.  Report dirty logging isn't supported so
that device model, for example qemu, can properly handle it.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/x86.c       |  5 +++++
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 15 ++++++++++++---
 3 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c52a052e208c..da411bcd8cbc 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12876,6 +12876,11 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
 }
 EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
 
+bool kvm_arch_dirty_log_supported(struct kvm *kvm)
+{
+	return kvm->arch.vm_type != KVM_X86_TDX_VM;
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index a56044a31bc6..86f984e0c93f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1423,6 +1423,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
 int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
+bool kvm_arch_dirty_log_supported(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3adee9c6b370..ae3bf553f215 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1423,9 +1423,18 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
-static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
+bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
 {
-	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+	return true;
+}
+
+static int check_memory_region_flags(struct kvm *kvm,
+				     const struct kvm_userspace_memory_region *mem)
+{
+	u32 valid_flags = 0;
+
+	if (kvm_arch_dirty_log_supported(kvm))
+		valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
 
 #ifdef __KVM_HAVE_READONLY_MEM
 	valid_flags |= KVM_MEM_READONLY;
@@ -1826,7 +1835,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
 	int as_id, id;
 	int r;
 
-	r = check_memory_region_flags(mem);
+	r = check_memory_region_flags(kvm, mem);
 	if (r)
 		return r;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 036/104] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (34 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 035/104] KVM: x86/mmu: Disallow dirty logging for x86 TDX isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 13:17   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE isaku.yamahata
                   ` (69 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Explicitly check for an MMIO spte in the fast page fault flow.  TDX will
use a not-present entry for MMIO sptes, which can be mistaken for an
access-tracked spte since both have SPTE_SPECIAL_MASK set.

The fast page fault handles the case of changing access bits without
obtaining mmu_lock.  For example, clear write protect bit for dirty page
tracking.  MMIO emulation is handled in a slow path.  So it doesn't affect
the default VM case.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b68191aa39bf..9907cb759fd1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3167,7 +3167,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			break;
 
 		sp = sptep_to_sp(sptep);
-		if (!is_last_spte(spte, sp->role.level))
+		if (!is_last_spte(spte, sp->role.level) || is_mmio_spte(spte))
 			break;
 
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (35 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 036/104] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-01  5:13   ` Kai Huang
  2022-04-05 14:10   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
                   ` (68 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX will run with EPT violation #VEs enabled for shared EPT, which means
KVM needs to set the "suppress #VE" bit in unused PTEs to avoid
unintentionally reflecting not-present EPT violations into the guest.

Because guest memory is protected with TDX, VMM can't parse instructions
in the guest memory.  Instead, MMIO hypercall is used to pass necessary
information to VMM.

To make unmodified device driver work, guest TD expects #VE on accessing
shared GPA.  The #VE handler converts MMIO access into MMIO hypercall with
the EPT entry of enabled "#VE" by clearing "suppress #VE" bit.  Before VMM
enabling #VE, it needs to figure out the given GPA is for MMIO by EPT
violation.  So the execution flow looks like

- allocate unused shared EPT entry with suppress #VE bit set.
- EPT violation on that GPA.
- VMM figures out the faulted GPA is for MMIO.
- VMM clears the suppress #VE bit.
- Guest TD gets #VE, and converts MMIO access into MMIO hypercall.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu.h      |  1 +
 arch/x86/kvm/mmu/mmu.c  | 50 +++++++++++++++++++++++++++++++++++------
 arch/x86/kvm/mmu/spte.c | 10 +++++++++
 arch/x86/kvm/mmu/spte.h |  2 ++
 4 files changed, 56 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 3fb530359f81..0ae91b8b25df 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -66,6 +66,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
 
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
+void kvm_mmu_set_spte_init_value(u64 init_value);
 
 void kvm_init_mmu(struct kvm_vcpu *vcpu);
 void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 9907cb759fd1..a474f2e76d78 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -617,9 +617,9 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 	int level = sptep_to_sp(sptep)->role.level;
 
 	if (!spte_has_volatile_bits(old_spte))
-		__update_clear_spte_fast(sptep, 0ull);
+		__update_clear_spte_fast(sptep, shadow_init_value);
 	else
-		old_spte = __update_clear_spte_slow(sptep, 0ull);
+		old_spte = __update_clear_spte_slow(sptep, shadow_init_value);
 
 	if (!is_shadow_present_pte(old_spte))
 		return old_spte;
@@ -651,7 +651,7 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
  */
 static void mmu_spte_clear_no_track(u64 *sptep)
 {
-	__update_clear_spte_fast(sptep, 0ull);
+	__update_clear_spte_fast(sptep, shadow_init_value);
 }
 
 static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -737,6 +737,42 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 	}
 }
 
+static inline void kvm_init_shadow_page(void *page)
+{
+#ifdef CONFIG_X86_64
+	int ign;
+
+	asm volatile (
+		"rep stosq\n\t"
+		: "=c"(ign), "=D"(page)
+		: "a"(shadow_init_value), "c"(4096/8), "D"(page)
+		: "memory"
+	);
+#else
+	BUG();
+#endif
+}
+
+static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
+	int start, end, i, r;
+
+	if (shadow_init_value)
+		start = kvm_mmu_memory_cache_nr_free_objects(mc);
+
+	r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
+	if (r)
+		return r;
+
+	if (shadow_init_value) {
+		end = kvm_mmu_memory_cache_nr_free_objects(mc);
+		for (i = start; i < end; i++)
+			kvm_init_shadow_page(mc->objects[i]);
+	}
+	return 0;
+}
+
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
 	int r;
@@ -746,8 +782,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
-	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
-				       PT64_ROOT_MAX_LEVEL);
+	r = mmu_topup_shadow_page_cache(vcpu);
 	if (r)
 		return r;
 	if (maybe_indirect) {
@@ -3146,7 +3181,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_mmu_page *sp;
 	int ret = RET_PF_INVALID;
-	u64 spte = 0ull;
+	u64 spte = shadow_init_value;
 	u64 *sptep = NULL;
 	uint retry_count = 0;
 
@@ -5598,7 +5633,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
-	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+	if (!shadow_init_value)
+		vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 73cfe62fdad1..5071e8332db2 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -35,6 +35,7 @@ u64 __read_mostly shadow_mmio_access_mask;
 u64 __read_mostly shadow_present_mask;
 u64 __read_mostly shadow_me_mask;
 u64 __read_mostly shadow_acc_track_mask;
+u64 __read_mostly shadow_init_value;
 
 u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
 u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
@@ -223,6 +224,14 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
 	return new_spte;
 }
 
+void kvm_mmu_set_spte_init_value(u64 init_value)
+{
+	if (WARN_ON(!IS_ENABLED(CONFIG_X86_64) && init_value))
+		init_value = 0;
+	shadow_init_value = init_value;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_spte_init_value);
+
 static u8 kvm_get_shadow_phys_bits(void)
 {
 	/*
@@ -367,6 +376,7 @@ void kvm_mmu_reset_all_pte_masks(void)
 	shadow_present_mask	= PT_PRESENT_MASK;
 	shadow_acc_track_mask	= 0;
 	shadow_me_mask		= sme_me_mask;
+	shadow_init_value	= 0;
 
 	shadow_host_writable_mask = DEFAULT_SPTE_HOST_WRITEABLE;
 	shadow_mmu_writable_mask  = DEFAULT_SPTE_MMU_WRITEABLE;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index be6a007a4af3..8e13a35ab8c9 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -171,6 +171,8 @@ extern u64 __read_mostly shadow_mmio_access_mask;
 extern u64 __read_mostly shadow_present_mask;
 extern u64 __read_mostly shadow_me_mask;
 
+extern u64 __read_mostly shadow_init_value;
+
 /*
  * SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
  * shadow_acc_track_mask is the set of bits to be cleared in non-accessed
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (36 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-01  5:15   ` Kai Huang
  2022-03-04 19:48 ` [RFC PATCH v5 039/104] KVM: x86/mmu: Disallow fast page fault on private GPA isaku.yamahata
                   ` (67 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

In the existing x86 KVM MMU code, there is already max_level member in
struct kvm_page_fault with KVM_MAX_HUGEPAGE_LEVEL initial value.  The KVM
page fault handler denies page size larger than max_level.

Add per-VM member to indicate the allowed maximum page size with
KVM_MAX_HUGEPAGE_LEVEL as default value and initialize max_level in struct
kvm_page_fault with it.

For the guest TD, the set per-VM value for allows maximum page size to 4K
page size.  Then only allowed page size is 4K.  It means large page is
disabled.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/mmu.h              | 2 +-
 arch/x86/kvm/mmu/mmu.c          | 2 ++
 3 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d8b78d6abc10..d33d79f2af2d 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1053,6 +1053,7 @@ struct kvm_arch {
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
 	unsigned int indirect_shadow_pages;
+	int tdp_max_page_level;
 	u8 mmu_valid_gen;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	struct list_head active_mmu_pages;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 0ae91b8b25df..650989c37f2e 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -192,7 +192,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
 		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
 
-		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
+		.max_level = vcpu->kvm->arch.tdp_max_page_level,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
 	};
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index a474f2e76d78..e9212394a530 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5782,6 +5782,8 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 	node->track_write = kvm_mmu_pte_write;
 	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
 	kvm_page_track_register_notifier(kvm, node);
+
+	kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 039/104] KVM: x86/mmu: Disallow fast page fault on private GPA
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (37 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 13:22   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 040/104] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
                   ` (66 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX requires TDX SEAMCALL to operate Secure EPT instead of direct memory
access and TDX SEAMCALL is heavy operation.  Fast page fault on private GPA
doesn't make sense.  Disallow fast page fault on private GPA.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e9212394a530..d8c1505155b0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3185,6 +3185,13 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	u64 *sptep = NULL;
 	uint retry_count = 0;
 
+	/*
+	 * TDX private mapping doesn't support fast page fault because the EPT
+	 * entry needs TDX SEAMCALL. not direct memory access.
+	 */
+	if (kvm_is_private_gpa(vcpu->kvm, fault->addr))
+		return ret;
+
 	if (!page_fault_can_be_fast(fault))
 		return ret;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 040/104] KVM: VMX: Split out guts of EPT violation to common/exposed function
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (38 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 039/104] KVM: x86/mmu: Disallow fast page fault on private GPA isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 14:43   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 041/104] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
                   ` (65 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

The difference of TDX EPT violation is how to retrieve information, GPA,
and exit qualification.  To share the code to handle EPT violation, split
out the guts of EPT violation handler so that VMX/TDX exit handler can call
it after retrieving GPA and exit qualification.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/common.h | 35 +++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c    | 34 ++++++----------------------------
 2 files changed, 41 insertions(+), 28 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/common.h

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
new file mode 100644
index 000000000000..1052b3c93eb8
--- /dev/null
+++ b/arch/x86/kvm/vmx/common.h
@@ -0,0 +1,35 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_X86_VMX_COMMON_H
+#define __KVM_X86_VMX_COMMON_H
+
+#include <linux/kvm_host.h>
+
+#include "mmu.h"
+
+static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
+					     unsigned long exit_qualification)
+{
+	u64 error_code;
+
+	/* Is it a read fault? */
+	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
+		     ? PFERR_USER_MASK : 0;
+	/* Is it a write fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
+		      ? PFERR_WRITE_MASK : 0;
+	/* Is it a fetch fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
+		      ? PFERR_FETCH_MASK : 0;
+	/* ept page table entry is present? */
+	error_code |= (exit_qualification &
+		       (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
+			EPT_VIOLATION_EXECUTABLE))
+		      ? PFERR_PRESENT_MASK : 0;
+
+	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
+	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+
+	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+}
+
+#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 7838cd177f0e..0edeeed0b4c8 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -50,6 +50,7 @@
 #include <asm/vmx.h>
 
 #include "capabilities.h"
+#include "common.h"
 #include "cpuid.h"
 #include "evmcs.h"
 #include "hyperv.h"
@@ -5381,11 +5382,10 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
 
 static int handle_ept_violation(struct kvm_vcpu *vcpu)
 {
-	unsigned long exit_qualification;
-	gpa_t gpa;
-	u64 error_code;
+	unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
+	gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 
-	exit_qualification = vmx_get_exit_qual(vcpu);
+	trace_kvm_page_fault(gpa, exit_qualification);
 
 	/*
 	 * EPT violation happened while executing iret from NMI,
@@ -5394,31 +5394,9 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	 * AAK134, BY25.
 	 */
 	if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
-			enable_vnmi &&
-			(exit_qualification & INTR_INFO_UNBLOCK_NMI))
+	    enable_vnmi && (exit_qualification & INTR_INFO_UNBLOCK_NMI))
 		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI);
 
-	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
-	trace_kvm_page_fault(gpa, exit_qualification);
-
-	/* Is it a read fault? */
-	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
-		     ? PFERR_USER_MASK : 0;
-	/* Is it a write fault? */
-	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
-		      ? PFERR_WRITE_MASK : 0;
-	/* Is it a fetch fault? */
-	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
-		      ? PFERR_FETCH_MASK : 0;
-	/* ept page table entry is present? */
-	error_code |= (exit_qualification &
-		       (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
-			EPT_VIOLATION_EXECUTABLE))
-		      ? PFERR_PRESENT_MASK : 0;
-
-	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
-	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
-
 	vcpu->arch.exit_qualification = exit_qualification;
 
 	/*
@@ -5432,7 +5410,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 041/104] KVM: VMX: Move setting of EPT MMU masks to common VT-x code
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (39 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 040/104] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 14:48   ` Paolo Bonzini
  2022-03-04 19:48 ` [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis isaku.yamahata
                   ` (64 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

EPT MMU masks are used commonly for VMX and TDX.  The value needs to be
initialized in common code before both VMX/TDX-specific initialization
code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c | 5 +++++
 arch/x86/kvm/vmx/vmx.c  | 4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 3eb9db6d83ac..51aaafe6b432 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -4,6 +4,7 @@
 #include "x86_ops.h"
 #include "vmx.h"
 #include "nested.h"
+#include "mmu.h"
 #include "pmu.h"
 #include "tdx.h"
 
@@ -22,6 +23,10 @@ static __init int vt_hardware_setup(void)
 
 	tdx_hardware_setup(&vt_x86_ops);
 
+	if (enable_ept)
+		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
+				      cpu_has_vmx_ept_execute_only());
+
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0edeeed0b4c8..07fd892768be 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7817,10 +7817,6 @@ __init int vmx_hardware_setup(void)
 
 	set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
 
-	if (enable_ept)
-		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
-				      cpu_has_vmx_ept_execute_only());
-
 	kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(),
 			  ept_caps_to_lpage_level(vmx_capability.ept));
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (40 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 041/104] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 15:25   ` Paolo Bonzini
  2022-04-06 11:06   ` Kai Huang
  2022-03-04 19:48 ` [RFC PATCH v5 043/104] KVM: TDX: Add load_mmu_pgd method for TDX isaku.yamahata
                   ` (63 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Define the EPT Violation #VE control bit, #VE info VMCS fields, and the
suppress #VE bit for EPT entries.

TDX will use a different shadow PTE entry value for MMIO from VMX.  Add
members to kvm_arch and track value for MMIO per-VM.  By using per-VM EPT
entry value for MMIO, the existing VMX logic is kept working.

In the case of VMX VM case, the EPT entry for MMIO is non-present PTE
(present bit cleared) without backing guest physical address (on EPT
violation, KVM seaching backing guest memory and it finds there is no
backing guest page.) or the value to trigger EPT misconfiguration.  For
fast path. Once MMIO is triggered on the EPT entry, the EPT entry is
updated for the future MMIO.  It allows KVM to understand the memory access
is for MMIO without searching backing guest pages.). And then KVM parses
guest instruction to figure out address/value/width for MMIO.

In the case of the guest TD, the guest memory is protected so that VMM
can't parse guest instruction to trigger EPT violation.  Instead VMM sets
up (Shared) EPT to trigger #VE.  When the guest TD issues MMIO, #VE is
injected.  guest VE handler converts MMIO access into MMIO hypercall to
pass address/value/width for MMIO to VMM. (or directly paravirtualize MMIO
into hypercall.)  Then VMM can handle the MMIO hypercall without parsing
guest instruction.

When the guest accesses GPA if "the EPT Violation #VE" control bit is set
and EPT SUPPRESS VE bit in EPT entry is cleared, #VE, virtualization
exception, is injected into the guest.  Because the TDX guest vCPU state
and memory are protected, a VMM can't emulate MMIO by the TDX guest on EPT
violation by snooping vCPU state and parsing instruction to figure out MMIO
address and value.  Instead, PV MMIO (MMIO hypercall) is adapted.  On EPT
violation, CPU injects #VE to guest and the guest converts MMIO instruction
into PV MMIO.  Or guest directly issues MMIO hypercall.

The existing VMX code uses zero as an initial value for EPT entry.  TDX
will enable EPT-violation #VE VM-execution control and requires suppress VE
bit cleared in shared EPT entry to inject #VE into the TDX guest.  To keep
the same behavior for VMX, suppress VE bit needs to be set.  Allow to
specify an initial value for EPT entry and if TDX is enabled, set initial
EPT entry value to suppress VE bit set.  EPT-violation #VE VM-execution
control will be enabled, and For TDX shared EPT suppress VE bit will be
cleared for TDX shared EPT entry.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  3 +++
 arch/x86/include/asm/vmx.h      |  1 +
 arch/x86/kvm/mmu.h              |  6 ++++--
 arch/x86/kvm/mmu/mmu.c          | 19 +++++++++++------
 arch/x86/kvm/mmu/spte.c         | 38 ++++++++++++++++-----------------
 arch/x86/kvm/mmu/spte.h         |  9 ++++----
 arch/x86/kvm/mmu/tdp_mmu.c      |  6 +++---
 arch/x86/kvm/svm/svm.c          |  2 +-
 arch/x86/kvm/vmx/main.c         |  7 ++++--
 arch/x86/kvm/vmx/tdx.c          |  2 +-
 arch/x86/kvm/vmx/tdx.h          |  2 ++
 arch/x86/kvm/vmx/vmx.c          |  8 +++++++
 12 files changed, 63 insertions(+), 40 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d33d79f2af2d..fcab2337819c 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1069,6 +1069,9 @@ struct kvm_arch {
 	 */
 	spinlock_t mmu_unsync_pages_lock;
 
+	u64 shadow_mmio_value;
+	u64 shadow_mmio_mask;
+
 	struct list_head assigned_dev_head;
 	struct iommu_domain *iommu_domain;
 	bool iommu_noncoherent;
diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 0ffaa3156a4e..88d9b8cc7dde 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -498,6 +498,7 @@ enum vmcs_field {
 #define VMX_EPT_IPAT_BIT    			(1ull << 6)
 #define VMX_EPT_ACCESS_BIT			(1ull << 8)
 #define VMX_EPT_DIRTY_BIT			(1ull << 9)
+#define VMX_EPT_SUPPRESS_VE_BIT			(1ull << 63)
 #define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
 						 VMX_EPT_WRITABLE_MASK |       \
 						 VMX_EPT_EXECUTABLE_MASK)
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 650989c37f2e..b49841e4faaa 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -64,8 +64,10 @@ static __always_inline u64 rsvd_bits(int s, int e)
 	return ((2ULL << (e - s)) - 1) << s;
 }
 
-void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
+void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask,
+				u64 access_mask);
+void kvm_mmu_set_default_mmio_spte_mask(u64 mask);
+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, u64 init_value);
 void kvm_mmu_set_spte_init_value(u64 init_value);
 
 void kvm_init_mmu(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index d8c1505155b0..6e9847b1124b 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2336,7 +2336,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 				return kvm_mmu_prepare_zap_page(kvm, child,
 								invalid_list);
 		}
-	} else if (is_mmio_spte(pte)) {
+	} else if (is_mmio_spte(kvm, pte)) {
 		mmu_spte_clear_no_track(spte);
 	}
 	return 0;
@@ -3069,9 +3069,12 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		/*
 		 * If MMIO caching is disabled, emulate immediately without
 		 * touching the shadow page tables as attempting to install an
-		 * MMIO SPTE will just be an expensive nop.
+		 * MMIO SPTE will just be an expensive nop, but excludes the
+		 * INTEL TD guest due to it also uses shadow_mmio_value = 0
+		 * to emulating MMIO access.
 		 */
-		if (unlikely(!shadow_mmio_value)) {
+		if (unlikely(!vcpu->kvm->arch.shadow_mmio_value)
+		    && !kvm_gfn_stolen_mask(vcpu->kvm)) {
 			*ret_val = RET_PF_EMULATE;
 			return true;
 		}
@@ -3209,7 +3212,8 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			break;
 
 		sp = sptep_to_sp(sptep);
-		if (!is_last_spte(spte, sp->role.level) || is_mmio_spte(spte))
+		if (!is_last_spte(spte, sp->role.level) ||
+			is_mmio_spte(vcpu->kvm, spte))
 			break;
 
 		/*
@@ -3892,7 +3896,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 	if (WARN_ON(reserved))
 		return -EINVAL;
 
-	if (is_mmio_spte(spte)) {
+	if (is_mmio_spte(vcpu->kvm, spte)) {
 		gfn_t gfn = get_mmio_spte_gfn(spte);
 		unsigned int access = get_mmio_spte_access(spte);
 
@@ -4294,7 +4298,7 @@ static unsigned long get_cr3(struct kvm_vcpu *vcpu)
 static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
 			   unsigned int access)
 {
-	if (unlikely(is_mmio_spte(*sptep))) {
+	if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
 		if (gfn != get_mmio_spte_gfn(*sptep)) {
 			mmu_spte_clear_no_track(sptep);
 			return true;
@@ -5791,6 +5795,9 @@ void kvm_mmu_init_vm(struct kvm *kvm)
 	kvm_page_track_register_notifier(kvm, node);
 
 	kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
+	kvm_mmu_set_mmio_spte_mask(kvm, shadow_default_mmio_mask,
+				   shadow_default_mmio_mask,
+				   ACC_WRITE_MASK | ACC_USER_MASK);
 }
 
 void kvm_mmu_uninit_vm(struct kvm *kvm)
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 5071e8332db2..ea83927b9231 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -29,8 +29,7 @@ u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
 u64 __read_mostly shadow_user_mask;
 u64 __read_mostly shadow_accessed_mask;
 u64 __read_mostly shadow_dirty_mask;
-u64 __read_mostly shadow_mmio_value;
-u64 __read_mostly shadow_mmio_mask;
+u64 __read_mostly shadow_default_mmio_mask;
 u64 __read_mostly shadow_mmio_access_mask;
 u64 __read_mostly shadow_present_mask;
 u64 __read_mostly shadow_me_mask;
@@ -59,10 +58,11 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
 	u64 spte = generation_mmio_spte_mask(gen);
 	u64 gpa = gfn << PAGE_SHIFT;
 
-	WARN_ON_ONCE(!shadow_mmio_value);
+	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
+		     !kvm_gfn_stolen_mask(vcpu->kvm));
 
 	access &= shadow_mmio_access_mask;
-	spte |= shadow_mmio_value | access;
+	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
 	spte |= gpa | shadow_nonpresent_or_rsvd_mask;
 	spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
 		<< SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
@@ -279,7 +279,8 @@ u64 mark_spte_for_access_track(u64 spte)
 	return spte;
 }
 
-void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
+void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask,
+				u64 access_mask)
 {
 	BUG_ON((u64)(unsigned)access_mask != access_mask);
 	WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
@@ -308,39 +309,32 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
 	    WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
 		mmio_value = 0;
 
-	shadow_mmio_value = mmio_value;
-	shadow_mmio_mask  = mmio_mask;
+	kvm->arch.shadow_mmio_value = mmio_value;
+	kvm->arch.shadow_mmio_mask = mmio_mask;
 	shadow_mmio_access_mask = access_mask;
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
-void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
+void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, u64 init_value)
 {
 	shadow_user_mask	= VMX_EPT_READABLE_MASK;
 	shadow_accessed_mask	= has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
 	shadow_dirty_mask	= has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
 	shadow_nx_mask		= 0ull;
 	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
-	shadow_present_mask	= has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
+	shadow_present_mask	=
+		(has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | init_value;
 	shadow_acc_track_mask	= VMX_EPT_RWX_MASK;
 	shadow_me_mask		= 0ull;
 
 	shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE;
 	shadow_mmu_writable_mask  = EPT_SPTE_MMU_WRITABLE;
-
-	/*
-	 * EPT Misconfigurations are generated if the value of bits 2:0
-	 * of an EPT paging-structure entry is 110b (write/execute).
-	 */
-	kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
-				   VMX_EPT_RWX_MASK, 0);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);
 
 void kvm_mmu_reset_all_pte_masks(void)
 {
 	u8 low_phys_bits;
-	u64 mask;
 
 	shadow_phys_bits = kvm_get_shadow_phys_bits();
 
@@ -389,9 +383,13 @@ void kvm_mmu_reset_all_pte_masks(void)
 	 * PTEs and so the reserved PA approach must be disabled.
 	 */
 	if (shadow_phys_bits < 52)
-		mask = BIT_ULL(51) | PT_PRESENT_MASK;
+		shadow_default_mmio_mask = BIT_ULL(51) | PT_PRESENT_MASK;
 	else
-		mask = 0;
+		shadow_default_mmio_mask = 0;
+}
 
-	kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
+void kvm_mmu_set_default_mmio_spte_mask(u64 mask)
+{
+	shadow_default_mmio_mask = mask;
 }
+EXPORT_SYMBOL_GPL(kvm_mmu_set_default_mmio_spte_mask);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 8e13a35ab8c9..bde843bce878 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -165,8 +165,7 @@ extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
 extern u64 __read_mostly shadow_user_mask;
 extern u64 __read_mostly shadow_accessed_mask;
 extern u64 __read_mostly shadow_dirty_mask;
-extern u64 __read_mostly shadow_mmio_value;
-extern u64 __read_mostly shadow_mmio_mask;
+extern u64 __read_mostly shadow_default_mmio_mask;
 extern u64 __read_mostly shadow_mmio_access_mask;
 extern u64 __read_mostly shadow_present_mask;
 extern u64 __read_mostly shadow_me_mask;
@@ -229,10 +228,10 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
  */
 extern u8 __read_mostly shadow_phys_bits;
 
-static inline bool is_mmio_spte(u64 spte)
+static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
 {
-	return (spte & shadow_mmio_mask) == shadow_mmio_value &&
-	       likely(shadow_mmio_value);
+	return (spte & kvm->arch.shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
+		likely(kvm->arch.shadow_mmio_value || kvm_gfn_stolen_mask(kvm));
 }
 
 static inline bool is_shadow_present_pte(u64 pte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bc9e3553fba2..ebd0a02620e8 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -447,8 +447,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 		 * impact the guest since both the former and current SPTEs
 		 * are nonpresent.
 		 */
-		if (WARN_ON(!is_mmio_spte(old_spte) &&
-			    !is_mmio_spte(new_spte) &&
+		if (WARN_ON(!is_mmio_spte(kvm, old_spte) &&
+			    !is_mmio_spte(kvm, new_spte) &&
 			    !is_removed_spte(new_spte)))
 			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
 			       "should not be replaced with another,\n"
@@ -927,7 +927,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	}
 
 	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
-	if (unlikely(is_mmio_spte(new_spte))) {
+	if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
 		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
 				     new_spte);
 		ret = RET_PF_EMULATE;
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 778075b71dc3..c7eec23e9ebe 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4704,7 +4704,7 @@ static __init void svm_adjust_mmio_mask(void)
 	 */
 	mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0;
 
-	kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK);
+	kvm_mmu_set_default_mmio_spte_mask(mask);
 }
 
 static __init void svm_set_cpu_caps(void)
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 51aaafe6b432..b242a9dc9e29 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -23,9 +23,12 @@ static __init int vt_hardware_setup(void)
 
 	tdx_hardware_setup(&vt_x86_ops);
 
-	if (enable_ept)
+	if (enable_ept) {
+		const u64 init_value = enable_tdx ? VMX_EPT_SUPPRESS_VE_BIT : 0ull;
 		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
-				      cpu_has_vmx_ept_execute_only());
+				      cpu_has_vmx_ept_execute_only(), init_value);
+		kvm_mmu_set_spte_init_value(init_value);
+	}
 
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f86a257dd71b..c3434b33c452 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -11,7 +11,7 @@
 #undef pr_fmt
 #define pr_fmt(fmt) "tdx: " fmt
 
-static bool __read_mostly enable_tdx = true;
+bool __read_mostly enable_tdx = true;
 module_param_named(tdx, enable_tdx, bool, 0644);
 
 #define TDX_MAX_NR_CPUID_CONFIGS					\
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 4ce7fcab6f64..b32e068c51b4 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -6,6 +6,7 @@
 
 #include "tdx_ops.h"
 
+extern bool __read_mostly enable_tdx;
 int tdx_module_setup(void);
 
 struct tdx_td_page {
@@ -166,6 +167,7 @@ static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 fiel
 }
 
 #else
+#define enable_tdx false
 static inline int tdx_module_setup(void) { return -ENODEV; };
 
 struct kvm_tdx;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 07fd892768be..00f88aa25047 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7065,6 +7065,14 @@ int vmx_vm_init(struct kvm *kvm)
 	if (!ple_gap)
 		kvm->arch.pause_in_guest = true;
 
+	/*
+	 * EPT Misconfigurations can be generated if the value of bits 2:0
+	 * of an EPT paging-structure entry is 110b (write/execute).
+	 */
+	if (enable_ept)
+		kvm_mmu_set_mmio_spte_mask(kvm, VMX_EPT_MISCONFIG_WX_VALUE,
+					   VMX_EPT_MISCONFIG_WX_VALUE, 0);
+
 	if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {
 		switch (l1tf_mitigation) {
 		case L1TF_MITIGATION_OFF:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 043/104] KVM: TDX: Add load_mmu_pgd method for TDX
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (41 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis isaku.yamahata
@ 2022-03-04 19:48 ` isaku.yamahata
  2022-04-05 14:51   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 044/104] [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks isaku.yamahata
                   ` (62 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:48 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

For virtual IO, the guest TD shares guest pages with VMM without
encryption.  Shared EPT is used to map guest pages in unprotected way.

Add the VMCS field encoding for the shared EPTP, which will be used by
TDX to have separate EPT walks for private GPAs (existing EPTP) versus
shared GPAs (new shared EPTP).

Set shared EPT pointer value for the TDX guest to initialize TDX MMU.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/vmx.h |  1 +
 arch/x86/kvm/vmx/main.c    | 11 ++++++++++-
 arch/x86/kvm/vmx/tdx.c     |  5 +++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++++
 4 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 88d9b8cc7dde..a2402d1bde04 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -221,6 +221,7 @@ enum vmcs_field {
 	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
 	TSC_MULTIPLIER                  = 0x00002032,
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
+	SHARED_EPT_POINTER		= 0x0000203C,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
 	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
 	VMCS_LINK_POINTER               = 0x00002800,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index b242a9dc9e29..6969e3557bd4 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -89,6 +89,15 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	return vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
+			int pgd_level)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+
+	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+}
+
 static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -205,7 +214,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.write_tsc_offset = vmx_write_tsc_offset,
 	.write_tsc_multiplier = vmx_write_tsc_multiplier,
 
-	.load_mmu_pgd = vmx_load_mmu_pgd,
+	.load_mmu_pgd = vt_load_mmu_pgd,
 
 	.check_intercept = vmx_check_intercept,
 	.handle_exit_irqoff = vmx_handle_exit_irqoff,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c3434b33c452..51098e10b6a0 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -496,6 +496,11 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->kvm->vm_bugged = true;
 }
 
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
+{
+	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
+}
+
 static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 81f246493ec7..ad9b1c883761 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -143,6 +143,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
+
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
 static inline void tdx_pre_kvm_init(
 	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
@@ -160,6 +162,8 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
+
+static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 044/104] [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (42 preceding siblings ...)
  2022-03-04 19:48 ` [RFC PATCH v5 043/104] KVM: TDX: Add load_mmu_pgd method for TDX isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 045/104] KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value isaku.yamahata
                   ` (61 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of KVM TDP MMU
hooks.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index b4c10eb46b8d..9b2b811c5bdf 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -25,6 +25,6 @@ Patch Layer status
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
 * KVM MMU GPA stolen bits:              Applied
-* KVM TDP refactoring for TDX:          Applying
-* KVM TDP MMU hooks:                    Not yet
+* KVM TDP refactoring for TDX:          Applied
+* KVM TDP MMU hooks:                    Applying
 * KVM TDP MMU MapGPA:                   Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 045/104] KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (43 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 044/104] [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-05 14:22   ` Paolo Bonzini
  2022-04-06 23:30   ` Kai Huang
  2022-03-04 19:49 ` [RFC PATCH v5 046/104] KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map() isaku.yamahata
                   ` (60 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
intermediate value to indicate one thread is operating on it and the value
should be semi-arbitrary value.  For TDX (more correctly to use #VE), the
value should include suppress #VE value which is shadow_init_value.

Define SHADOW_REMOVED_SPTE as shadow_init_value | REMOVED_SPTE, and replace
REMOVED_SPTE with SHADOW_REMOVED_SPTE to use suppress #VE bit properly for
TDX.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/spte.h    | 14 ++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.c | 23 ++++++++++++++++-------
 2 files changed, 28 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index bde843bce878..e88f796724b4 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -194,7 +194,9 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  * If a thread running without exclusive control of the MMU lock must perform a
  * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
  * non-present intermediate value. Other threads which encounter this value
- * should not modify the SPTE.
+ * should not modify the SPTE.  When TDX is enabled, shadow_init_value, which
+ * is "suppress #VE" bit set, is also set to removed SPTE, because TDX module
+ * always enables "EPT violation #VE".
  *
  * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
  * bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
@@ -207,9 +209,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
 /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
 static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
 
+/*
+ * See above comment around REMOVED_SPTE.  SHADOW_REMOVED_SPTE is the actual
+ * intermediate value set to the removed SPET.  When TDX is enabled, it sets
+ * the "suppress #VE" bit, otherwise it's REMOVED_SPTE.
+ */
+extern u64 __read_mostly shadow_init_value;
+#define SHADOW_REMOVED_SPTE	(shadow_init_value | REMOVED_SPTE)
+
 static inline bool is_removed_spte(u64 spte)
 {
-	return spte == REMOVED_SPTE;
+	return spte == SHADOW_REMOVED_SPTE;
 }
 
 /*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index ebd0a02620e8..b6ec2f112c26 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -338,7 +338,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
 			 * value to the removed SPTE value.
 			 */
 			for (;;) {
-				old_child_spte = xchg(sptep, REMOVED_SPTE);
+				old_child_spte = xchg(sptep, SHADOW_REMOVED_SPTE);
 				if (!is_removed_spte(old_child_spte))
 					break;
 				cpu_relax();
@@ -365,10 +365,10 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
 			 * the two branches consistent and simplifies
 			 * the function.
 			 */
-			WRITE_ONCE(*sptep, REMOVED_SPTE);
+			WRITE_ONCE(*sptep, SHADOW_REMOVED_SPTE);
 		}
 		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-				    old_child_spte, REMOVED_SPTE, level,
+				    old_child_spte, SHADOW_REMOVED_SPTE, level,
 				    shared);
 	}
 
@@ -537,7 +537,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * immediately installing a present entry in its place
 	 * before the TLBs are flushed.
 	 */
-	if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE))
+	if (!tdp_mmu_set_spte_atomic(kvm, iter, SHADOW_REMOVED_SPTE))
 		return false;
 
 	kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
@@ -550,8 +550,16 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * special removed SPTE value. No bookkeeping is needed
 	 * here since the SPTE is going from non-present
 	 * to non-present.
+	 *
+	 * Set non-present value to shadow_init_value, rather than 0.
+	 * It is because when TDX is enabled, TDX module always
+	 * enables "EPT-violation #VE", so KVM needs to set
+	 * "suppress #VE" bit in EPT table entries, in order to get
+	 * real EPT violation, rather than TDVMCALL.  KVM sets
+	 * shadow_init_value (which sets "suppress #VE" bit) so it
+	 * can be set when EPT table entries are zapped.
 	 */
-	WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
+	WRITE_ONCE(*rcu_dereference(iter->sptep), shadow_init_value);
 
 	return true;
 }
@@ -748,7 +756,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 
 		if (!shared) {
-			tdp_mmu_set_spte(kvm, &iter, 0);
+			/* see comments in tdp_mmu_zap_spte_atomic() */
+			tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
 			flush = true;
 		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
 			/*
@@ -1135,7 +1144,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
 	 * invariant that the PFN of a present * leaf SPTE can never change.
 	 * See __handle_changed_spte().
 	 */
-	tdp_mmu_set_spte(kvm, iter, 0);
+	tdp_mmu_set_spte(kvm, iter, shadow_init_value);
 
 	if (!pte_write(range->pte)) {
 		new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 046/104] KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (44 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 045/104] KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-05 14:53   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page isaku.yamahata
                   ` (59 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Factor out non-leaf SPTE population logic from kvm_tdp_mmu_map().  MapGPA
hypercall needs to populate non-leaf SPTE to record which GPA, private or
shared, is allowed in the leaf EPT entry.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 48 ++++++++++++++++++++++++--------------
 1 file changed, 30 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b6ec2f112c26..8db262440d5c 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -955,6 +955,31 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	return ret;
 }
 
+static bool tdp_mmu_populate_nonleaf(
+	struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool account_nx)
+{
+	struct kvm_mmu_page *sp;
+	u64 *child_pt;
+	u64 new_spte;
+
+	WARN_ON(is_shadow_present_pte(iter->old_spte));
+	WARN_ON(is_removed_spte(iter->old_spte));
+
+	sp = alloc_tdp_mmu_page(vcpu, iter->gfn, iter->level - 1);
+	child_pt = sp->spt;
+
+	new_spte = make_nonleaf_spte(child_pt, !shadow_accessed_mask);
+
+	if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte)) {
+		tdp_mmu_free_sp(sp);
+		return false;
+	}
+
+	tdp_mmu_link_page(vcpu->kvm, sp, account_nx);
+	trace_kvm_mmu_get_page(sp, true);
+	return true;
+}
+
 /*
  * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
  * page tables and SPTEs to translate the faulting guest physical address.
@@ -963,9 +988,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	struct tdp_iter iter;
-	struct kvm_mmu_page *sp;
-	u64 *child_pt;
-	u64 new_spte;
 	int ret;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1000,6 +1022,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		}
 
 		if (!is_shadow_present_pte(iter.old_spte)) {
+			bool account_nx;
+
 			/*
 			 * If SPTE has been frozen by another thread, just
 			 * give up and retry, avoiding unnecessary page table
@@ -1008,22 +1032,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 			if (is_removed_spte(iter.old_spte))
 				break;
 
-			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
-			child_pt = sp->spt;
-
-			new_spte = make_nonleaf_spte(child_pt,
-						     !shadow_accessed_mask);
-
-			if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
-				tdp_mmu_link_page(vcpu->kvm, sp,
-						  fault->huge_page_disallowed &&
-						  fault->req_level >= iter.level);
-
-				trace_kvm_mmu_get_page(sp, true);
-			} else {
-				tdp_mmu_free_sp(sp);
+			account_nx = fault->huge_page_disallowed &&
+				fault->req_level >= iter.level;
+			if (!tdp_mmu_populate_nonleaf(vcpu, &iter, account_nx))
 				break;
-			}
 		}
 	}
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (45 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 046/104] KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map() isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-05 14:58   ` Paolo Bonzini
  2022-04-06 23:43   ` Kai Huang
  2022-03-04 19:49 ` [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU isaku.yamahata
                   ` (58 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a private pointer to kvm_mmu_page for private EPT.

To resolve KVM page fault on private GPA, it will allocate additional page
for Secure EPT in addition to private EPT.  Add memory allocator for it and
topup its memory allocator before resolving KVM page fault similar to
shared EPT page.  Allocation of those memory will be done for TDP MMU by
alloc_tdp_mmu_page().  Freeing those memory will be done for TDP MMU on
behalf of kvm_tdp_mmu_zap_all() called by kvm_mmu_zap_all().  Private EPT
page needs to carry one more page used for Secure EPT in addition to the
private EPT page.  Add private pointer to struct kvm_mmu_page for that
purpose and Add helper functions to allocate/free a page for Secure EPT.
Also add helper functions to check if a given kvm_mmu_page is private.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/mmu/mmu.c          |  9 ++++
 arch/x86/kvm/mmu/mmu_internal.h | 84 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c      |  3 ++
 4 files changed, 97 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index fcab2337819c..0c8cc7d73371 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -689,6 +689,7 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	struct kvm_mmu_memory_cache mmu_private_sp_cache;
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 6e9847b1124b..8def8b97978f 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -758,6 +758,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
 	struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
 	int start, end, i, r;
 
+	if (kvm_gfn_stolen_mask(vcpu->kvm)) {
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
+					       PT64_ROOT_MAX_LEVEL);
+		if (r)
+			return r;
+	}
+
 	if (shadow_init_value)
 		start = kvm_mmu_memory_cache_nr_free_objects(mc);
 
@@ -799,6 +806,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_sp_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
@@ -1791,6 +1799,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
 	if (!direct)
 		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
+	kvm_mmu_init_private_sp(sp);
 
 	/*
 	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index da6166b5c377..80f7a74a71dc 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -53,6 +53,10 @@ struct kvm_mmu_page {
 	u64 *spt;
 	/* hold the gfn of each spte inside spt */
 	gfn_t *gfns;
+#ifdef CONFIG_KVM_MMU_PRIVATE
+	/* associated private shadow page, e.g. SEPT page */
+	void *private_sp;
+#endif
 	/* Currently serving as active root */
 	union {
 		int root_count;
@@ -104,6 +108,86 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 	return kvm_mmu_role_as_id(sp->role);
 }
 
+/*
+ * TDX vcpu allocates page for root Secure EPT page and assigns to CPU secure
+ * EPT pointer.  KVM doesn't need to allocate and link to the secure EPT.
+ * Dummy value to make is_pivate_sp() return true.
+ */
+#define KVM_MMU_PRIVATE_SP_ROOT	((void *)1)
+
+#ifdef CONFIG_KVM_MMU_PRIVATE
+static inline bool is_private_sp(struct kvm_mmu_page *sp)
+{
+	return !!sp->private_sp;
+}
+
+static inline bool is_private_spte(u64 *sptep)
+{
+	return is_private_sp(sptep_to_sp(sptep));
+}
+
+static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
+{
+	return sp->private_sp;
+}
+
+static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp)
+{
+	sp->private_sp = NULL;
+}
+
+/* Valid sp->role.level is required. */
+static inline void kvm_mmu_alloc_private_sp(struct kvm_vcpu *vcpu,
+					struct kvm_mmu_page *sp)
+{
+	if (vcpu->arch.mmu->shadow_root_level == sp->role.level)
+		sp->private_sp = KVM_MMU_PRIVATE_SP_ROOT;
+	else
+		sp->private_sp =
+			kvm_mmu_memory_cache_alloc(
+				&vcpu->arch.mmu_private_sp_cache);
+	/*
+	 * Because mmu_private_sp_cache is topped up before staring kvm page
+	 * fault resolving, the allocation above shouldn't fail.
+	 */
+	WARN_ON_ONCE(!sp->private_sp);
+}
+
+static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
+{
+	if (sp->private_sp != KVM_MMU_PRIVATE_SP_ROOT)
+		free_page((unsigned long)sp->private_sp);
+}
+#else
+static inline bool is_private_sp(struct kvm_mmu_page *sp)
+{
+	return false;
+}
+
+static inline bool is_private_spte(u64 *sptep)
+{
+	return false;
+}
+
+static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
+{
+	return NULL;
+}
+
+static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp)
+{
+}
+
+static inline void kvm_mmu_alloc_private_sp(struct kvm_vcpu *vcpu,
+					struct kvm_mmu_page *sp)
+{
+}
+
+static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
+{
+}
+#endif
+
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 8db262440d5c..a68f3a22836b 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -59,6 +59,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
+	if (is_private_sp(sp))
+		kvm_mmu_free_private_sp(sp);
 	free_page((unsigned long)sp->spt);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
@@ -184,6 +186,7 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	sp->role.word = page_role_for_level(vcpu, level).word;
 	sp->gfn = gfn;
 	sp->tdp_mmu_page = true;
+	kvm_mmu_init_private_sp(sp);
 
 	trace_kvm_mmu_get_page(sp, true);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (46 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-07  0:50   ` Kai Huang
  2022-04-29  0:28   ` Sagi Shahar
  2022-03-04 19:49 ` [RFC PATCH v5 049/104] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs isaku.yamahata
                   ` (57 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Use private EPT to mirror Secure EPT, and On the change of the private EPT
entry, invoke kvm_x86_ops hook in __handle_changed_spte() to propagate the
change to Secure EPT.

On EPT violation, determine which EPT to use, private or shared, based on
faulted GPA.  When allocating an EPT page, record it (private or shared) in
the page role.  The private is passed down to the function as an argument
as necessary.  When the private EPT entry is changed, call the hook.

When zapping EPT, the EPT entry is frozen with the special EPT value that
clears the present bit. After the TLB shootdown, the entry is set to the
eventual value.  On populating the EPT entry, atomically set the entry.

For TDX, TDX SEAMCALL to update Secure EPT in addition to direct access to
the private EPT entry.  For the zapping case, freeing the EPT entry
works. It can call TDX SEAMCALL in addition to TLB shootdown.  For
populating the private EPT entry, there can be a race condition without
further protection

  vcpu 1: populating 2M private EPT entry
  vcpu 2: populating 4K private EPT entry
  vcpu 2: TDX SEAMCALL to update 4K secure EPT => error
  vcpu 1: TDX SEAMCALL to update 4M secure EPT

To avoid the race, the frozen EPT entry is utilized.  Instead of atomic
update of the private EPT entry, freeze the entry, call the hook that
invokes TDX SEAMCALL, set the entry to the final value (unfreeze).

Support 4K page only at this stage.  2M page support can be done in future
patches.

Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |   2 +
 arch/x86/include/asm/kvm_host.h    |   8 +
 arch/x86/kvm/mmu/mmu.c             |  31 +++-
 arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c         | 226 +++++++++++++++++++++++------
 arch/x86/kvm/mmu/tdp_mmu.h         |  13 +-
 virt/kvm/kvm_main.c                |   1 +
 7 files changed, 232 insertions(+), 51 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index ef48dcc98cfc..7e27b73d839f 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -91,6 +91,8 @@ KVM_X86_OP(set_tss_addr)
 KVM_X86_OP(set_identity_map_addr)
 KVM_X86_OP(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP(free_private_sp)
+KVM_X86_OP(handle_changed_private_spte)
 KVM_X86_OP_NULL(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 0c8cc7d73371..8406f8b5ab74 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -433,6 +433,7 @@ struct kvm_mmu {
 			 struct kvm_mmu_page *sp);
 	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
 	hpa_t root_hpa;
+	hpa_t private_root_hpa;
 	gpa_t root_pgd;
 	union kvm_mmu_role mmu_role;
 	u8 root_level;
@@ -1433,6 +1434,13 @@ struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
+	int (*free_private_sp)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+			       void *private_sp);
+	void (*handle_changed_private_spte)(
+		struct kvm *kvm, gfn_t gfn, enum pg_level level,
+		kvm_pfn_t old_pfn, bool was_present, bool was_leaf,
+		kvm_pfn_t new_pfn, bool is_present, bool is_leaf, void *sept_page);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8def8b97978f..0ec9548ff4dd 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3422,6 +3422,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	u8 shadow_root_level = mmu->shadow_root_level;
+	gfn_t gfn_stolen = kvm_gfn_stolen_mask(vcpu->kvm);
 	hpa_t root;
 	unsigned i;
 	int r;
@@ -3432,7 +3433,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		goto out_unlock;
 
 	if (is_tdp_mmu_enabled(vcpu->kvm)) {
-		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
+		if (gfn_stolen && !VALID_PAGE(mmu->private_root_hpa)) {
+			root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
+			mmu->private_root_hpa = root;
+		}
+		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
 		mmu->root_hpa = root;
 	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
 		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
@@ -5596,6 +5601,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 	int i;
 
 	mmu->root_hpa = INVALID_PAGE;
+	mmu->private_root_hpa = INVALID_PAGE;
 	mmu->root_pgd = 0;
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
@@ -5772,6 +5778,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
 
 	write_unlock(&kvm->mmu_lock);
 
+	/*
+	 * For now private root is never invalidate during VM is running,
+	 * so this can only happen for shared roots.
+	 */
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		kvm_tdp_mmu_zap_invalidated_roots(kvm);
@@ -5871,7 +5881,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 	if (is_tdp_mmu_enabled(kvm)) {
 		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
 			flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
-							  gfn_end, flush);
+							  gfn_end, flush,
+							  false);
 	}
 
 	if (flush)
@@ -5904,6 +5915,11 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 	}
 
+	/*
+	 * For now this can only happen for non-TD VM, because TD private
+	 * mapping doesn't support write protection.  kvm_tdp_mmu_wrprot_slot()
+	 * will give a WARN() if it hits for TD.
+	 */
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
@@ -5952,6 +5968,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 		sp = sptep_to_sp(sptep);
 		pfn = spte_to_pfn(*sptep);
 
+		/* Private page dirty logging is not supported. */
+		KVM_BUG_ON(is_private_spte(sptep), kvm);
+
 		/*
 		 * We cannot do huge page mapping for indirect shadow pages,
 		 * which are found on the last rmap (level = 1) when not using
@@ -5992,6 +6011,11 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 	}
 
+	/*
+	 * This should only be reachable in case of log-dirty, wihch TD private
+	 * mapping doesn't support so far.  kvm_tdp_mmu_zap_collapsible_sptes()
+	 * internally gives a WARN() when it hits.
+	 */
 	if (is_tdp_mmu_enabled(kvm)) {
 		read_lock(&kvm->mmu_lock);
 		kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
@@ -6266,6 +6290,9 @@ int kvm_mmu_module_init(void)
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_unload(vcpu);
+	if (is_tdp_mmu_enabled(vcpu->kvm))
+		mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
+				NULL);
 	free_mmu_pages(&vcpu->arch.root_mmu);
 	free_mmu_pages(&vcpu->arch.guest_mmu);
 	mmu_free_memory_caches(vcpu);
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index e19cabbcb65c..ad22d5b691c5 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -28,7 +28,7 @@ struct tdp_iter {
 	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
 	tdp_ptep_t sptep;
-	/* The lowest GFN mapped by the current SPTE */
+	/* The lowest GFN (stolen bits included) mapped by the current SPTE */
 	gfn_t gfn;
 	/* The level of the root page given to the iterator */
 	int root_level;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index a68f3a22836b..acba2590b51e 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -53,6 +53,11 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 	rcu_barrier();
 }
 
+static gfn_t tdp_iter_gfn_unalias(struct kvm *kvm, struct tdp_iter *iter)
+{
+	return kvm_gfn_unalias(kvm, iter->gfn);
+}
+
 static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 			  gfn_t start, gfn_t end, bool can_yield, bool flush,
 			  bool shared);
@@ -175,10 +180,13 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
 }
 
 static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
-					       int level)
+					       int level, bool private)
 {
 	struct kvm_mmu_page *sp;
 
+	WARN_ON(level != vcpu->arch.mmu->shadow_root_level &&
+		kvm_is_private_gfn(vcpu->kvm, gfn) != private);
+	WARN_ON(level == vcpu->arch.mmu->shadow_root_level && gfn != 0);
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
@@ -186,14 +194,19 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
 	sp->role.word = page_role_for_level(vcpu, level).word;
 	sp->gfn = gfn;
 	sp->tdp_mmu_page = true;
-	kvm_mmu_init_private_sp(sp);
+
+	if (private)
+		kvm_mmu_alloc_private_sp(vcpu, sp);
+	else
+		kvm_mmu_init_private_sp(sp);
 
 	trace_kvm_mmu_get_page(sp, true);
 
 	return sp;
 }
 
-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu,
+						      bool private)
 {
 	union kvm_mmu_page_role role;
 	struct kvm *kvm = vcpu->kvm;
@@ -206,11 +219,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	/* Check for an existing root before allocating a new one. */
 	for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
 		if (root->role.word == role.word &&
+		    is_private_sp(root) == private &&
 		    kvm_tdp_mmu_get_root(kvm, root))
 			goto out;
 	}
 
-	root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
+	root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level,
+			private);
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
 	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
@@ -218,12 +233,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 
 out:
-	return __pa(root->spt);
+	return root;
+}
+
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private)
+{
+	return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt);
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared);
+				bool private_spte, u64 old_spte,
+				u64 new_spte, int level, bool shared);
 
 static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
 {
@@ -321,6 +341,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
 	int level = sp->role.level;
 	gfn_t base_gfn = sp->gfn;
 	int i;
+	bool private_sp = is_private_sp(sp);
 
 	trace_kvm_mmu_prepare_zap_page(sp);
 
@@ -370,7 +391,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
 			 */
 			WRITE_ONCE(*sptep, SHADOW_REMOVED_SPTE);
 		}
-		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
+		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn, private_sp,
 				    old_child_spte, SHADOW_REMOVED_SPTE, level,
 				    shared);
 	}
@@ -378,6 +399,17 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
 	kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
 					   KVM_PAGES_PER_HPAGE(level + 1));
 
+	if (private_sp &&
+		WARN_ON(static_call(kvm_x86_free_private_sp)(
+				kvm, sp->gfn, sp->role.level,
+				kvm_mmu_private_sp(sp)))) {
+		/*
+		 * Failed to unlink Secure EPT page and there is nothing to do
+		 * further.  Intentionally leak the page to prevent the kernel
+		 * from accessing the encrypted page.
+		 */
+		kvm_mmu_init_private_sp(sp);
+	}
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
@@ -386,6 +418,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
  * @kvm: kvm instance
  * @as_id: the address space of the paging structure the SPTE was a part of
  * @gfn: the base GFN that was mapped by the SPTE
+ * @private_spte: the SPTE is private or not
  * @old_spte: The value of the SPTE before the change
  * @new_spte: The value of the SPTE after the change
  * @level: the level of the PT the SPTE is part of in the paging structure
@@ -397,14 +430,16 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
  * This function must be called for all TDP SPTE modifications.
  */
 static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				  u64 old_spte, u64 new_spte, int level,
-				  bool shared)
+				  bool private_spte, u64 old_spte,
+				  u64 new_spte, int level, bool shared)
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
-	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+	bool pfn_changed = old_pfn != new_pfn;
 
 	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON(level < PG_LEVEL_4K);
@@ -468,23 +503,49 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 
 	if (was_leaf && is_dirty_spte(old_spte) &&
 	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		kvm_set_pfn_dirty(old_pfn);
+
+	/*
+	 * Special handling for the private mapping.  We are either
+	 * setting up new mapping at middle level page table, or leaf,
+	 * or tearing down existing mapping.
+	 */
+	if (private_spte) {
+		void *sept_page = NULL;
+
+		if (is_present && !is_leaf) {
+			struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(new_pfn));
+
+			sept_page = kvm_mmu_private_sp(sp);
+			WARN_ON(!sept_page);
+			WARN_ON(sp->role.level + 1 != level);
+			WARN_ON(sp->gfn != gfn);
+		}
+
+		static_call(kvm_x86_handle_changed_private_spte)(
+			kvm, gfn, level,
+			old_pfn, was_present, was_leaf,
+			new_pfn, is_present, is_leaf, sept_page);
+	}
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
 	 * the paging structure.
 	 */
-	if (was_present && !was_leaf && (pfn_changed || !is_present))
+	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
+		WARN_ON(private_spte !=
+			is_private_spte(spte_to_child_pt(old_spte, level)));
 		handle_removed_tdp_mmu_page(kvm,
 				spte_to_child_pt(old_spte, level), shared);
+	}
 }
 
 static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared)
+				bool private_spte, u64 old_spte, u64 new_spte,
+				int level, bool shared)
 {
-	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
-			      shared);
+	__handle_changed_spte(kvm, as_id, gfn, private_spte,
+			old_spte, new_spte, level, shared);
 	handle_changed_spte_acc_track(old_spte, new_spte, level);
 	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
 				      new_spte, level);
@@ -505,6 +566,10 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 					   struct tdp_iter *iter,
 					   u64 new_spte)
 {
+	bool freeze_spte = is_private_spte(iter->sptep) &&
+		!is_removed_spte(new_spte);
+	u64 tmp_spte = freeze_spte ? SHADOW_REMOVED_SPTE : new_spte;
+
 	WARN_ON_ONCE(iter->yielded);
 
 	lockdep_assert_held_read(&kvm->mmu_lock);
@@ -521,13 +586,16 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	 * does not hold the mmu_lock.
 	 */
 	if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
-		      new_spte) != iter->old_spte)
+		      tmp_spte) != iter->old_spte)
 		return false;
 
-	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			      new_spte, iter->level, true);
+	__handle_changed_spte(kvm, iter->as_id, iter->gfn, is_private_spte(iter->sptep),
+			      iter->old_spte, new_spte, iter->level, true);
 	handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
 
+	if (freeze_spte)
+		WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+
 	return true;
 }
 
@@ -603,8 +671,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 
 	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
 
-	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			      new_spte, iter->level, false);
+	__handle_changed_spte(kvm, iter->as_id, iter->gfn, is_private_spte(iter->sptep),
+			      iter->old_spte, new_spte, iter->level, false);
 	if (record_acc_track)
 		handle_changed_spte_acc_track(iter->old_spte, new_spte,
 					      iter->level);
@@ -644,9 +712,10 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 			continue;					\
 		else
 
-#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, __va(_mmu->root_hpa),		\
-			 _mmu->shadow_root_level, _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end)		\
+	for_each_tdp_pte(_iter,							\
+		__va((_private) ? _mmu->private_root_hpa : _mmu->root_hpa),	\
+		_mmu->shadow_root_level, _start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -731,6 +800,18 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	 */
 	end = min(end, max_gfn_host);
 
+	/*
+	 * Extend [start, end) to include GFN shared bit when TDX is enabled,
+	 * and for shared mapping range.
+	 */
+	if (is_private_sp(root)) {
+		start = kvm_gfn_private(kvm, start);
+		end = kvm_gfn_private(kvm, end);
+	} else {
+		start = kvm_gfn_shared(kvm, start);
+		end = kvm_gfn_shared(kvm, end);
+	}
+
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 
 	rcu_read_lock();
@@ -783,13 +864,18 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
  * MMU lock.
  */
 bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
-				 gfn_t end, bool can_yield, bool flush)
+				 gfn_t end, bool can_yield, bool flush,
+				 bool zap_private)
 {
 	struct kvm_mmu_page *root;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
+	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false) {
+		/* Skip private page table if not requested */
+		if (!zap_private && is_private_sp(root))
+			continue;
 		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
 				      false);
+	}
 
 	return flush;
 }
@@ -800,7 +886,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
 	int i;
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
-		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
+		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush, true);
 
 	if (flush)
 		kvm_flush_remote_tlbs(kvm);
@@ -851,6 +937,13 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 	while (root) {
 		next_root = next_invalidated_root(kvm, root);
 
+		/*
+		 * Private table is only torn down when VM is destroyed.
+		 * It is a bug to zap private table here.
+		 */
+		if (WARN_ON(is_private_sp(root)))
+			goto out;
+
 		rcu_read_unlock();
 
 		flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
@@ -865,7 +958,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
 
 		rcu_read_lock();
 	}
-
+out:
 	rcu_read_unlock();
 
 	if (flush)
@@ -897,9 +990,16 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
 	struct kvm_mmu_page *root;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
-	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
+	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+		/*
+		 * Skip private root since private page table
+		 * is only torn down when VM is destroyed.
+		 */
+		if (is_private_sp(root))
+			continue;
 		if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
 			root->role.invalid = true;
+	}
 }
 
 /*
@@ -914,14 +1014,23 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	u64 new_spte;
 	int ret = RET_PF_FIXED;
 	bool wrprot = false;
+	unsigned long pte_access = ACC_ALL;
 
 	WARN_ON(sp->role.level != fault->goal_level);
+
+	/* TDX shared GPAs are no executable, enforce this for the SDV. */
+	if (!kvm_is_private_gfn(vcpu->kvm, iter->gfn))
+		pte_access &= ~ACC_EXEC_MASK;
+
 	if (unlikely(!fault->slot))
-		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
+		new_spte = make_mmio_spte(vcpu,
+				tdp_iter_gfn_unalias(vcpu->kvm, iter),
+				pte_access);
 	else
-		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-					 fault->pfn, iter->old_spte, fault->prefetch, true,
-					 fault->map_writable, &new_spte);
+		wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
+				tdp_iter_gfn_unalias(vcpu->kvm, iter),
+				fault->pfn, iter->old_spte, fault->prefetch,
+				true, fault->map_writable, &new_spte);
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
@@ -959,7 +1068,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 }
 
 static bool tdp_mmu_populate_nonleaf(
-	struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool account_nx)
+	struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool is_private,
+	bool account_nx)
 {
 	struct kvm_mmu_page *sp;
 	u64 *child_pt;
@@ -968,7 +1078,7 @@ static bool tdp_mmu_populate_nonleaf(
 	WARN_ON(is_shadow_present_pte(iter->old_spte));
 	WARN_ON(is_removed_spte(iter->old_spte));
 
-	sp = alloc_tdp_mmu_page(vcpu, iter->gfn, iter->level - 1);
+	sp = alloc_tdp_mmu_page(vcpu, iter->gfn, iter->level - 1, is_private);
 	child_pt = sp->spt;
 
 	new_spte = make_nonleaf_spte(child_pt, !shadow_accessed_mask);
@@ -991,6 +1101,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	struct tdp_iter iter;
+	gfn_t raw_gfn;
+	bool is_private;
 	int ret;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -999,7 +1111,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 	rcu_read_lock();
 
-	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
+	raw_gfn = gpa_to_gfn(fault->addr);
+	is_private = kvm_is_private_gfn(vcpu->kvm, raw_gfn);
+
+	if (is_error_noslot_pfn(fault->pfn) || kvm_is_reserved_pfn(fault->pfn)) {
+		if (is_private) {
+			rcu_read_unlock();
+			return -EFAULT;
+		}
+	}
+
+	tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
 		if (fault->nx_huge_page_workaround_enabled)
 			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
 
@@ -1015,6 +1137,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		    is_large_pte(iter.old_spte)) {
 			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
 				break;
+			/*
+			 * TODO: large page support.
+			 * Doesn't support large page for TDX now
+			 */
+			WARN_ON(is_private_spte(&iter.old_spte));
+
 
 			/*
 			 * The iter must explicitly re-read the spte here
@@ -1037,7 +1165,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 			account_nx = fault->huge_page_disallowed &&
 				fault->req_level >= iter.level;
-			if (!tdp_mmu_populate_nonleaf(vcpu, &iter, account_nx))
+			if (!tdp_mmu_populate_nonleaf(
+					vcpu, &iter, is_private, account_nx))
 				break;
 		}
 	}
@@ -1058,9 +1187,12 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
 {
 	struct kvm_mmu_page *root;
 
-	for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
+	for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false) {
+		if (is_private_sp(root))
+			continue;
 		flush = zap_gfn_range(kvm, root, range->start, range->end,
-				      range->may_block, flush, false);
+				range->may_block, flush, false);
+	}
 
 	return flush;
 }
@@ -1513,10 +1645,14 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	int leaf = -1;
+	bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn);
 
 	*root_level = vcpu->arch.mmu->shadow_root_level;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	if (WARN_ON(is_private))
+		return leaf;
+
+	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
 	}
@@ -1542,12 +1678,16 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	tdp_ptep_t sptep = NULL;
+	bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn);
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	if (is_private)
+		goto out;
+
+	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
 		*spte = iter.old_spte;
 		sptep = iter.sptep;
 	}
-
+out:
 	/*
 	 * Perform the rcu_dereference to get the raw spte pointer value since
 	 * we are passing it up to fast_page_fault, which is shared with the
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 3899004a5d91..7c62f694a465 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -5,7 +5,7 @@
 
 #include <linux/kvm_host.h>
 
-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private);
 
 __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
 						     struct kvm_mmu_page *root)
@@ -20,11 +20,14 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared);
 
 bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
-				 gfn_t end, bool can_yield, bool flush);
+				 gfn_t end, bool can_yield, bool flush,
+				 bool zap_private);
 static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
-					     gfn_t start, gfn_t end, bool flush)
+					     gfn_t start, gfn_t end, bool flush,
+					     bool zap_private)
 {
-	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
+	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush,
+			zap_private);
 }
 static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 {
@@ -41,7 +44,7 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	 */
 	lockdep_assert_held_write(&kvm->mmu_lock);
 	return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
-					   sp->gfn, end, false, false);
+					   sp->gfn, end, false, false, false);
 }
 
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ae3bf553f215..d4e117f5b5b9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -190,6 +190,7 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
 
 	return true;
 }
+EXPORT_SYMBOL_GPL(kvm_is_reserved_pfn);
 
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 049/104] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (47 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-05 15:15   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 050/104] [MARKER] The start of TDX KVM patch series: TDX EPT violation isaku.yamahata
                   ` (56 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Some KVM MMU operations (dirty page logging, page migration, aging page)
aren't supported for private GFNs (yet).  Silently return on unsupported
TDX KVM MMU operations.

For dirty logging, aging page, the GFN should be recorded with un-aliased
GFN.  Insert GFN un-aliasing where necessary.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 100 +++++++++++++++++++++++++++++++++++--
 1 file changed, 96 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index acba2590b51e..1949f81027a0 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -257,11 +257,20 @@ static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
 }
 
 static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
+					  bool private_spte,
 					  u64 old_spte, u64 new_spte, int level)
 {
 	bool pfn_changed;
 	struct kvm_memory_slot *slot;
 
+	/*
+	 * TDX doesn't support live migration.  Never mark private page as
+	 * dirty in log-dirty bitmap, since it's not possible for userspace
+	 * hypervisor to live migrate private page anyway.
+	 */
+	if (private_spte)
+		return;
+
 	if (level > PG_LEVEL_4K)
 		return;
 
@@ -269,6 +278,8 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 
 	if ((!is_writable_pte(old_spte) || pfn_changed) &&
 	    is_writable_pte(new_spte)) {
+		/* For memory slot operations, use GFN without aliasing */
+		gfn = kvm_gfn_unalias(kvm, gfn);
 		slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn);
 		mark_page_dirty_in_slot(kvm, slot, gfn);
 	}
@@ -547,7 +558,7 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	__handle_changed_spte(kvm, as_id, gfn, private_spte,
 			old_spte, new_spte, level, shared);
 	handle_changed_spte_acc_track(old_spte, new_spte, level);
-	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
+	handle_changed_spte_dirty_log(kvm, as_id, gfn, private_spte, old_spte,
 				      new_spte, level);
 }
 
@@ -678,6 +689,7 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
 					      iter->level);
 	if (record_dirty_log)
 		handle_changed_spte_dirty_log(kvm, iter->as_id, iter->gfn,
+					      is_private_spte(iter->sptep),
 					      iter->old_spte, new_spte,
 					      iter->level);
 }
@@ -1215,7 +1227,23 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 	 * into this helper allow blocking; it'd be dead, wasteful code.
 	 */
 	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
-		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+		/*
+		 * For TDX shared mapping, set GFN shared bit to the range,
+		 * so the handler() doesn't need to set it, to avoid duplicated
+		 * code in multiple handler()s.
+		 */
+		gfn_t start;
+		gfn_t end;
+
+		if (is_private_sp(root)) {
+			start = kvm_gfn_private(kvm, range->start);
+			end = kvm_gfn_private(kvm, range->end);
+		} else {
+			start = kvm_gfn_shared(kvm, range->start);
+			end = kvm_gfn_shared(kvm, range->end);
+		}
+
+		tdp_root_for_each_leaf_pte(iter, root, start, end)
 			ret |= handler(kvm, &iter, range);
 	}
 
@@ -1237,6 +1265,15 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 	if (!is_accessed_spte(iter->old_spte))
 		return false;
 
+	/*
+	 * First TDX generation doesn't support clearing A bit for private
+	 * mapping, since there's no secure EPT API to support it.  However
+	 * it's a legitimate request for TDX guest, so just return w/o a
+	 * WARN().
+	 */
+	if (is_private_spte(iter->sptep))
+		return false;
+
 	new_spte = iter->old_spte;
 
 	if (spte_ad_enabled(new_spte)) {
@@ -1281,6 +1318,13 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
 	/* Huge pages aren't expected to be modified without first being zapped. */
 	WARN_ON(pte_huge(range->pte) || range->start + 1 != range->end);
 
+	/*
+	 * .change_pte() callback should not happen for private page, because
+	 * for now TDX private pages are pinned during VM's life time.
+	 */
+	if (WARN_ON(is_private_spte(iter->sptep)))
+		return false;
+
 	if (iter->level != PG_LEVEL_4K ||
 	    !is_shadow_present_pte(iter->old_spte))
 		return false;
@@ -1332,6 +1376,16 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	u64 new_spte;
 	bool spte_set = false;
 
+	/*
+	 * First TDX generation doesn't support write protecting private
+	 * mappings, since there's no secure EPT API to support it.  It
+	 * is a bug to reach here for TDX guest.
+	 */
+	if (WARN_ON(is_private_sp(root)))
+		return spte_set;
+	start = kvm_gfn_shared(kvm, start);
+	end = kvm_gfn_shared(kvm, end);
+
 	rcu_read_lock();
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
@@ -1398,6 +1452,16 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 	u64 new_spte;
 	bool spte_set = false;
 
+	/*
+	 * First TDX generation doesn't support clearing dirty bit,
+	 * since there's no secure EPT API to support it.  It is a
+	 * bug to reach here for TDX guest.
+	 */
+	if (WARN_ON(is_private_sp(root)))
+		return spte_set;
+	start = kvm_gfn_shared(kvm, start);
+	end = kvm_gfn_shared(kvm, end);
+
 	rcu_read_lock();
 
 	tdp_root_for_each_leaf_pte(iter, root, start, end) {
@@ -1467,6 +1531,15 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
 	struct tdp_iter iter;
 	u64 new_spte;
 
+	/*
+	 * First TDX generation doesn't support clearing dirty bit,
+	 * since there's no secure EPT API to support it.  It is a
+	 * bug to reach here for TDX guest.
+	 */
+	if (WARN_ON(is_private_sp(root)))
+		return;
+	gfn = kvm_gfn_shared(kvm, gfn);
+
 	rcu_read_lock();
 
 	tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
@@ -1530,6 +1603,16 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 	struct tdp_iter iter;
 	kvm_pfn_t pfn;
 
+	/*
+	 * This should only be reachable in case of log-dirty, which TD
+	 * private mapping doesn't support so far.  Give a WARN() if it
+	 * hits private mapping.
+	 */
+	if (WARN_ON(is_private_sp(root)))
+		return;
+	start = kvm_gfn_shared(kvm, start);
+	end = kvm_gfn_shared(kvm, end);
+
 	rcu_read_lock();
 
 	tdp_root_for_each_pte(iter, root, start, end) {
@@ -1543,8 +1626,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
 
 		pfn = spte_to_pfn(iter.old_spte);
 		if (kvm_is_reserved_pfn(pfn) ||
-		    iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
-							    pfn, PG_LEVEL_NUM))
+		    iter.level >= kvm_mmu_max_mapping_level(kvm, slot,
+			    tdp_iter_gfn_unalias(kvm, &iter), pfn,
+			    PG_LEVEL_NUM))
 			continue;
 
 		/* Note, a successful atomic zap also does a remote TLB flush. */
@@ -1590,6 +1674,14 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
 
+	/*
+	 * First TDX generation doesn't support write protecting private
+	 * mappings, since there's no secure EPT API to support it.  It
+	 * is a bug to reach here for TDX guest.
+	 */
+	if (WARN_ON(is_private_sp(root)))
+		return spte_set;
+
 	rcu_read_lock();
 
 	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 050/104] [MARKER] The start of TDX KVM patch series: TDX EPT violation
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (48 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 049/104] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 051/104] KVM: TDX: TDP MMU TDX support isaku.yamahata
                   ` (55 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TDX EPT
violation.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 9b2b811c5bdf..a14355332d44 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -19,12 +19,12 @@ Patch Layer status
 * TDX architectural definitions:        Applied
 * TD VM creation/destruction:           Applied
 * TD vcpu creation/destruction:         Applied
-* TDX EPT violation:                    Not yet
+* TDX EPT violation:                    Applying
 * TD finalization:                      Not yet
 * TD vcpu enter/exit:                   Not yet
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
 * KVM MMU GPA stolen bits:              Applied
 * KVM TDP refactoring for TDX:          Applied
-* KVM TDP MMU hooks:                    Applying
+* KVM TDP MMU hooks:                    Applied
 * KVM TDP MMU MapGPA:                   Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 051/104] KVM: TDX: TDP MMU TDX support
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (49 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 050/104] [MARKER] The start of TDX KVM patch series: TDX EPT violation isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-07  2:20   ` Kai Huang
  2022-03-04 19:49 ` [RFC PATCH v5 052/104] [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA isaku.yamahata
                   ` (54 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement hooks of TDP MMU for TDX backend.  TLB flush, TLB shootdown,
propagating the change private EPT entry to Secure EPT and freeing Secure
EPT page.

TLB flush handles both shared EPT and private EPT.  It flushes shared EPT
same as VMX.  It also waits for the TDX TLB shootdown.

For the hook to free Secure EPT page, unlinks the Secure EPT page from the
Secure EPT so that the page can be freed to OS.

Propagating the entry change to Secure EPT.  The possible entry changes are
present -> non-present(zapping) and non-present -> present(population).  On
population just link the Secure EPT page or the private guest page to the
Secure EPT by TDX SEAMCALL.

Because TDP MMU allows concurrent zapping/population, zapping requires
synchronous TLB shootdown with the frozen EPT entry.  It zaps the secure
entry, increments TLB counter, sends IPI to remote vcpus to trigger TLB
flush, and then unlinks the private guest page from the Secure EPT.

For simplicity, batched zapping with exclude lock is handled as concurrent
zapping.  Although it's inefficient, it can be optimized in the future.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    |  40 +++++-
 arch/x86/kvm/vmx/tdx.c     | 246 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h     |  14 +++
 arch/x86/kvm/vmx/tdx_ops.h |   3 +
 arch/x86/kvm/vmx/x86_ops.h |   2 +
 5 files changed, 301 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 6969e3557bd4..f571b07c2aae 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -89,6 +89,38 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	return vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_flush_tlb(vcpu);
+
+	vmx_flush_tlb_all(vcpu);
+}
+
+static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_flush_tlb(vcpu);
+
+	vmx_flush_tlb_current(vcpu);
+}
+
+static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+{
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
+	vmx_flush_tlb_gva(vcpu, addr);
+}
+
+static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_flush_tlb_guest(vcpu);
+}
+
 static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			int pgd_level)
 {
@@ -162,10 +194,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_rflags = vmx_set_rflags,
 	.get_if_flag = vmx_get_if_flag,
 
-	.tlb_flush_all = vmx_flush_tlb_all,
-	.tlb_flush_current = vmx_flush_tlb_current,
-	.tlb_flush_gva = vmx_flush_tlb_gva,
-	.tlb_flush_guest = vmx_flush_tlb_guest,
+	.tlb_flush_all = vt_flush_tlb_all,
+	.tlb_flush_current = vt_flush_tlb_current,
+	.tlb_flush_gva = vt_flush_tlb_gva,
+	.tlb_flush_guest = vt_flush_tlb_guest,
 
 	.vcpu_pre_run = vmx_vcpu_pre_run,
 	.run = vmx_vcpu_run,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 51098e10b6a0..5d74ae001e4f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -5,7 +5,9 @@
 
 #include "capabilities.h"
 #include "x86_ops.h"
+#include "mmu.h"
 #include "tdx.h"
+#include "vmx.h"
 #include "x86.h"
 
 #undef pr_fmt
@@ -272,6 +274,15 @@ int tdx_vm_init(struct kvm *kvm)
 	int ret, i;
 	u64 err;
 
+	/*
+	 * To generate EPT violation to inject #VE instead of EPT MISCONFIG,
+	 * set RWX=0.
+	 */
+	kvm_mmu_set_mmio_spte_mask(kvm, 0, VMX_EPT_RWX_MASK, 0);
+
+	/* TODO: Enable 2mb and 1gb large page support. */
+	kvm->arch.tdp_max_page_level = PG_LEVEL_4K;
+
 	/* vCPUs can't be created until after KVM_TDX_INIT_VM. */
 	kvm->max_vcpus = 0;
 
@@ -331,6 +342,8 @@ int tdx_vm_init(struct kvm *kvm)
 		tdx_mark_td_page_added(&kvm_tdx->tdcs[i]);
 	}
 
+	spin_lock_init(&kvm_tdx->seamcall_lock);
+
 	/*
 	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
 	 * ioctl() to define the configure CPUID values for the TD.
@@ -501,6 +514,220 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
 }
 
+static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+					enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	hpa_t hpa = pfn_to_hpa(pfn);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	struct tdx_module_output out;
+	u64 err;
+
+	if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
+		return;
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return;
+
+	/* Pin the page, TDX KVM doesn't yet support page migration. */
+	get_page(pfn_to_page(pfn));
+
+	if (likely(is_td_finalized(kvm_tdx))) {
+		err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &out);
+		if (KVM_BUG_ON(err, kvm))
+			pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
+		return;
+	}
+}
+
+static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	spin_lock(&kvm_tdx->seamcall_lock);
+	__tdx_sept_set_private_spte(kvm, gfn, level, pfn);
+	spin_unlock(&kvm_tdx->seamcall_lock);
+}
+
+static void tdx_sept_drop_private_spte(
+	struct kvm *kvm, gfn_t gfn, enum pg_level level, kvm_pfn_t pfn)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	hpa_t hpa = pfn_to_hpa(pfn);
+	hpa_t hpa_with_hkid;
+	struct tdx_module_output out;
+	u64 err = 0;
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return;
+
+	spin_lock(&kvm_tdx->seamcall_lock);
+	if (is_hkid_assigned(kvm_tdx)) {
+		err = tdh_mem_page_remove(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
+		if (KVM_BUG_ON(err, kvm)) {
+			pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
+			goto unlock;
+		}
+
+		hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+		err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+			goto unlock;
+		}
+	} else
+		err = tdx_reclaim_page((unsigned long)__va(hpa), hpa);
+
+unlock:
+	spin_unlock(&kvm_tdx->seamcall_lock);
+
+	if (!err)
+		put_page(pfn_to_page(pfn));
+}
+
+static int tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
+				    enum pg_level level, void *sept_page)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	hpa_t hpa = __pa(sept_page);
+	struct tdx_module_output out;
+	u64 err;
+
+	spin_lock(&kvm_tdx->seamcall_lock);
+	err = tdh_mem_sept_add(kvm_tdx->tdr.pa, gpa, tdx_level, hpa, &out);
+	spin_unlock(&kvm_tdx->seamcall_lock);
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_SEPT_ADD, err, &out);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	struct tdx_module_output out;
+	u64 err;
+
+	spin_lock(&kvm_tdx->seamcall_lock);
+	err = tdh_mem_range_block(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
+	spin_unlock(&kvm_tdx->seamcall_lock);
+	if (KVM_BUG_ON(err, kvm))
+		pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
+}
+
+static int tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				    void *sept_page)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	int ret;
+
+	/*
+	 * free_private_sp() is (obviously) called when a shadow page is being
+	 * zapped.  KVM doesn't (yet) zap private SPs while the TD is active.
+	 */
+	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
+		return -EINVAL;
+
+	spin_lock(&kvm_tdx->seamcall_lock);
+	ret = tdx_reclaim_page((unsigned long)sept_page, __pa(sept_page));
+	spin_unlock(&kvm_tdx->seamcall_lock);
+
+	return ret;
+}
+
+static int tdx_sept_tlb_remote_flush(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx;
+	u64 err;
+
+	if (!is_td(kvm))
+		return -EOPNOTSUPP;
+
+	kvm_tdx = to_kvm_tdx(kvm);
+	if (!is_hkid_assigned(kvm_tdx))
+		return 0;
+
+	/* If TD isn't finalized, it's before any vcpu running. */
+	if (unlikely(!is_td_finalized(kvm_tdx)))
+		return 0;
+
+	kvm_tdx->tdh_mem_track = true;
+
+	kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
+
+	err = tdh_mem_track(kvm_tdx->tdr.pa);
+	if (KVM_BUG_ON(err, kvm))
+		pr_tdx_error(TDH_MEM_TRACK, err, NULL);
+
+	WRITE_ONCE(kvm_tdx->tdh_mem_track, false);
+
+	return 0;
+}
+
+static void tdx_handle_changed_private_spte(
+	struct kvm *kvm, gfn_t gfn, enum pg_level level,
+	kvm_pfn_t old_pfn, bool was_present, bool was_leaf,
+	kvm_pfn_t new_pfn, bool is_present, bool is_leaf, void *sept_page)
+{
+	WARN_ON(!is_td(kvm));
+	lockdep_assert_held(&kvm->mmu_lock);
+
+	if (is_present) {
+		/* TDP MMU doesn't change present -> present */
+		WARN_ON(was_present);
+
+		/*
+		 * Use different call to either set up middle level
+		 * private page table, or leaf.
+		 */
+		if (is_leaf)
+			tdx_sept_set_private_spte(kvm, gfn, level, new_pfn);
+		else {
+			WARN_ON(!sept_page);
+			if (tdx_sept_link_private_sp(kvm, gfn, level, sept_page))
+				/* failed to update Secure-EPT.  */
+				WARN_ON(1);
+		}
+	} else if (was_leaf) {
+		/* non-present -> non-present doesn't make sense. */
+		WARN_ON(!was_present);
+
+		/*
+		 * Zap private leaf SPTE.  Zapping private table is done
+		 * below in handle_removed_tdp_mmu_page().
+		 */
+		tdx_sept_zap_private_spte(kvm, gfn, level);
+
+		/*
+		 * TDX requires TLB tracking before dropping private page.  Do
+		 * it here, although it is also done later.
+		 * If hkid isn't assigned, the guest is destroying and no vcpu
+		 * runs further.  TLB shootdown isn't needed.
+		 *
+		 * TODO: implement with_range version for optimization.
+		 * kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
+		 *   => tdx_sept_tlb_remote_flush_with_range(kvm, gfn,
+		 *                                 KVM_PAGES_PER_HPAGE(level));
+		 */
+		if (is_hkid_assigned(to_kvm_tdx(kvm)))
+			kvm_flush_remote_tlbs(kvm);
+
+		tdx_sept_drop_private_spte(kvm, gfn, level, old_pfn);
+	}
+}
+
 static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
@@ -736,6 +963,21 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return ret;
 }
 
+void tdx_flush_tlb(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	u64 root_hpa = mmu->root_hpa;
+
+	/* Flush the shared EPTP, if it's valid. */
+	if (VALID_PAGE(root_hpa))
+		ept_sync_context(construct_eptp(vcpu, root_hpa,
+						mmu->shadow_root_level));
+
+	while (READ_ONCE(kvm_tdx->tdh_mem_track))
+		cpu_relax();
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -901,6 +1143,10 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 	hkid_start_pos = boot_cpu_data.x86_phys_bits;
 	hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
 
+	x86_ops->tlb_remote_flush = tdx_sept_tlb_remote_flush;
+	x86_ops->free_private_sp = tdx_sept_free_private_sp;
+	x86_ops->handle_changed_private_spte = tdx_handle_changed_private_spte;
+
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index b32e068c51b4..906666c7c70b 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -29,9 +29,17 @@ struct kvm_tdx {
 	struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
 
 	bool finalized;
+	bool tdh_mem_track;
 
 	u64 tsc_offset;
 	unsigned long tsc_khz;
+
+	/*
+	 * Lock to prevent seamcalls from running concurrently
+	 * when TDP MMU is enabled, because TDP fault handler
+	 * runs concurrently.
+	 */
+	spinlock_t seamcall_lock;
 };
 
 struct vcpu_tdx {
@@ -166,6 +174,12 @@ static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 fiel
 	return out.r8;
 }
 
+static __always_inline int pg_level_to_tdx_sept_level(enum pg_level level)
+{
+	WARN_ON(level == PG_LEVEL_NONE);
+	return level - 1;
+}
+
 #else
 #define enable_tdx false
 static inline int tdx_module_setup(void) { return -ENODEV; };
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index dc76b3a5cf96..cb40edc8c245 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -30,12 +30,14 @@ static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
 static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
 				struct tdx_module_output *out)
 {
+	tdx_clflush_page(hpa);
 	return kvm_seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, 0, out);
 }
 
 static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
 				struct tdx_module_output *out)
 {
+	tdx_clflush_page(page);
 	return kvm_seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, 0, out);
 }
 
@@ -48,6 +50,7 @@ static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
 static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
 				struct tdx_module_output *out)
 {
+	tdx_clflush_page(hpa);
 	return kvm_seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, 0, out);
 }
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ad9b1c883761..922a3799336e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -144,6 +144,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
+void tdx_flush_tlb(struct kvm_vcpu *vcpu);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
 static inline void tdx_pre_kvm_init(
@@ -163,6 +164,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
+static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 052/104] [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (50 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 051/104] KVM: TDX: TDP MMU TDX support isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 053/104] KVM: x86/mmu: steal software usable bit for EPT to represent shared page isaku.yamahata
                   ` (53 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of KVM TDP MMU
MapGPA.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index a14355332d44..d56b3890ddfe 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -11,6 +11,7 @@ What qemu can do
 - TDX VM TYPE is exposed to Qemu.
 - Qemu can create/destroy guest of TDX vm type.
 - Qemu can create/destroy vcpu of TDX vm type.
+- Qemu can populate initial guest memory image.
 
 Patch Layer status
 ------------------
@@ -19,7 +20,7 @@ Patch Layer status
 * TDX architectural definitions:        Applied
 * TD VM creation/destruction:           Applied
 * TD vcpu creation/destruction:         Applied
-* TDX EPT violation:                    Applying
+* TDX EPT violation:                    Applied
 * TD finalization:                      Not yet
 * TD vcpu enter/exit:                   Not yet
 * TD vcpu interrupts/exit/hypercall:    Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 053/104] KVM: x86/mmu: steal software usable bit for EPT to represent shared page
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (51 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 052/104] [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 15:21   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 054/104] KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping isaku.yamahata
                   ` (52 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

With TDX, all GFNs are private at guest boot time.  At run time guest TD
can explicitly change it to shared from private or vice-versa by MapGPA
hypercall.  If it's specified, the given GFN can't be used as otherwise.
That's is, if a guest tells KVM that the GFN is shared, it can't be used
as private.  or vice-versa.

KVM needs to record it.  Steal software usable bit for it from MMIO
counter.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/spte.h | 15 ++++++++++++---
 1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index e88f796724b4..25dffdb488d1 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -14,6 +14,9 @@
  */
 #define SPTE_MMU_PRESENT_MASK		BIT_ULL(11)
 
+/* Masks that used to track for shared GPA **/
+#define SPTE_PRIVATE_PROHIBIT	BIT_ULL(62)
+
 /*
  * TDP SPTES (more specifically, EPT SPTEs) may not have A/D bits, and may also
  * be restricted to using write-protection (for L2 when CPU dirty logging, i.e.
@@ -124,7 +127,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
  * the memslots generation and is derived as follows:
  *
  * Bits 0-7 of the MMIO generation are propagated to spte bits 3-10
- * Bits 8-18 of the MMIO generation are propagated to spte bits 52-62
+ * Bits 8-18 of the MMIO generation are propagated to spte bits 52-61
  *
  * The KVM_MEMSLOT_GEN_UPDATE_IN_PROGRESS flag is intentionally not included in
  * the MMIO generation number, as doing so would require stealing a bit from
@@ -138,7 +141,7 @@ static_assert(!(EPT_SPTE_MMU_WRITABLE & SHADOW_ACC_TRACK_SAVED_MASK));
 #define MMIO_SPTE_GEN_LOW_END		10
 
 #define MMIO_SPTE_GEN_HIGH_START	52
-#define MMIO_SPTE_GEN_HIGH_END		62
+#define MMIO_SPTE_GEN_HIGH_END		61
 
 #define MMIO_SPTE_GEN_LOW_MASK		GENMASK_ULL(MMIO_SPTE_GEN_LOW_END, \
 						    MMIO_SPTE_GEN_LOW_START)
@@ -151,7 +154,7 @@ static_assert(!(SPTE_MMU_PRESENT_MASK &
 #define MMIO_SPTE_GEN_HIGH_BITS		(MMIO_SPTE_GEN_HIGH_END - MMIO_SPTE_GEN_HIGH_START + 1)
 
 /* remember to adjust the comment above as well if you change these */
-static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
+static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 10);
 
 #define MMIO_SPTE_GEN_LOW_SHIFT		(MMIO_SPTE_GEN_LOW_START - 0)
 #define MMIO_SPTE_GEN_HIGH_SHIFT	(MMIO_SPTE_GEN_HIGH_START - MMIO_SPTE_GEN_LOW_BITS)
@@ -208,6 +211,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
 
 /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
 static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
+static_assert(!(REMOVED_SPTE & SPTE_PRIVATE_PROHIBIT));
 
 /*
  * See above comment around REMOVED_SPTE.  SHADOW_REMOVED_SPTE is the actual
@@ -222,6 +226,11 @@ static inline bool is_removed_spte(u64 spte)
 	return spte == SHADOW_REMOVED_SPTE;
 }
 
+static inline bool is_private_prohibit_spte(u64 spte)
+{
+	return !!(spte & SPTE_PRIVATE_PROHIBIT);
+}
+
 /*
  * In some cases, we need to preserve the GFN of a non-present or reserved
  * SPTE when we usurp the upper five bits of the physical address space to
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 054/104] KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (52 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 053/104] KVM: x86/mmu: steal software usable bit for EPT to represent shared page isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-07  1:43   ` Kai Huang
  2022-03-04 19:49 ` [RFC PATCH v5 055/104] KVM: x86/tdp_mmu: prevent private/shared map based on PRIVATE_PROHIBIT isaku.yamahata
                   ` (51 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

SPTE_PRIVATE_PROHIBIT specifies the share or private GPA is allowed or not.
It needs to be kept over zapping the EPT entry.  Currently the EPT entry is
initialized shadow_init_value unconditionally to clear
SPTE_PRIVATE_PROHIBIT bit.  To carry SPTE_PRIVATE_PROHIBIT bit, introduce a
helper function to get initial value for zapped entry with
SPTE_PRIVATE_PROHIBIT bit.  Replace shadow_init_value with it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 19 +++++++++++++++----
 1 file changed, 15 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 1949f81027a0..6d750563824d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -610,6 +610,12 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	return true;
 }
 
+static u64 shadow_init_spte(u64 old_spte)
+{
+	return shadow_init_value |
+		(is_private_prohibit_spte(old_spte) ? SPTE_PRIVATE_PROHIBIT : 0);
+}
+
 static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 					   struct tdp_iter *iter)
 {
@@ -641,7 +647,8 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * shadow_init_value (which sets "suppress #VE" bit) so it
 	 * can be set when EPT table entries are zapped.
 	 */
-	WRITE_ONCE(*rcu_dereference(iter->sptep), shadow_init_value);
+	WRITE_ONCE(*rcu_dereference(iter->sptep),
+		shadow_init_spte(iter->old_spte));
 
 	return true;
 }
@@ -853,7 +860,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
 
 		if (!shared) {
 			/* see comments in tdp_mmu_zap_spte_atomic() */
-			tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
+			tdp_mmu_set_spte(kvm, &iter,
+					shadow_init_spte(iter.old_spte));
 			flush = true;
 		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
 			/*
@@ -1038,11 +1046,14 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 		new_spte = make_mmio_spte(vcpu,
 				tdp_iter_gfn_unalias(vcpu->kvm, iter),
 				pte_access);
-	else
+	else {
 		wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
 				tdp_iter_gfn_unalias(vcpu->kvm, iter),
 				fault->pfn, iter->old_spte, fault->prefetch,
 				true, fault->map_writable, &new_spte);
+		if (is_private_prohibit_spte(iter->old_spte))
+			new_spte |= SPTE_PRIVATE_PROHIBIT;
+	}
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
@@ -1335,7 +1346,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
 	 * invariant that the PFN of a present * leaf SPTE can never change.
 	 * See __handle_changed_spte().
 	 */
-	tdp_mmu_set_spte(kvm, iter, shadow_init_value);
+	tdp_mmu_set_spte(kvm, iter, shadow_init_spte(iter->old_spte));
 
 	if (!pte_write(range->pte)) {
 		new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 055/104] KVM: x86/tdp_mmu: prevent private/shared map based on PRIVATE_PROHIBIT
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (53 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 054/104] KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 056/104] KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX isaku.yamahata
                   ` (50 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Use the bit SPTE_PRIVATE_PROHIBIT in shared and private EPT to determine
which mapping, shared or private, is allowed.  If requested mapping isn't
allowed, return RET_PF_RETRY to wait for other vcpu to change it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/spte.h    |  2 +-
 arch/x86/kvm/mmu/tdp_mmu.c | 22 +++++++++++++++++++---
 2 files changed, 20 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 25dffdb488d1..9c37381a6762 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -223,7 +223,7 @@ extern u64 __read_mostly shadow_init_value;
 
 static inline bool is_removed_spte(u64 spte)
 {
-	return spte == SHADOW_REMOVED_SPTE;
+	return (spte & ~SPTE_PRIVATE_PROHIBIT) == SHADOW_REMOVED_SPTE;
 }
 
 static inline bool is_private_prohibit_spte(u64 spte)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6d750563824d..f6bd35831e32 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1038,9 +1038,25 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 
 	WARN_ON(sp->role.level != fault->goal_level);
 
-	/* TDX shared GPAs are no executable, enforce this for the SDV. */
-	if (!kvm_is_private_gfn(vcpu->kvm, iter->gfn))
-		pte_access &= ~ACC_EXEC_MASK;
+	if (kvm_gfn_stolen_mask(vcpu->kvm)) {
+		if (is_private_spte(iter->sptep)) {
+			/*
+			 * This GPA is not allowed to map as private.  Let
+			 * vcpu loop in page fault until other vcpu change it
+			 * by MapGPA hypercall.
+			 */
+			if (fault->slot &&
+				is_private_prohibit_spte(iter->old_spte))
+				return RET_PF_RETRY;
+		} else {
+			/* This GPA is not allowed to map as shared. */
+			if (fault->slot &&
+				!is_private_prohibit_spte(iter->old_spte))
+				return RET_PF_RETRY;
+			/* TDX shared GPAs are no executable, enforce this. */
+			pte_access &= ~ACC_EXEC_MASK;
+		}
+	}
 
 	if (unlikely(!fault->slot))
 		new_spte = make_mmio_spte(vcpu,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 056/104] KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (54 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 055/104] KVM: x86/tdp_mmu: prevent private/shared map based on PRIVATE_PROHIBIT isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 057/104] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX isaku.yamahata
                   ` (49 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

The TDX Guest-Hypervisor communication interface(GHCI) specification
defines MapGPA hypercall for guest TD to request the host VMM to map given
GPA range as private or shared.

It means the guest TD uses the GPA as shared (or private).  The GPA
won't be used as private (or shared).  VMM should enforce GPA usage. VMM
doesn't have to map the GPA on the hypercall request.

- Zap the aliased region.
  If shared (or private) GPA is requested, zap private (or shared) GPA
  (modulo shared bit).

- Record the request GPA is shared (or private) by SPTE_PRIVATE_PROHIBIT
  in SPTE in both shared and private EPT tables.
  - With SPTE_PRIVATE_PROHIBIT set, a shared GPA is allowed.
  - With SPTE_PRIVATE_PROHIBIT cleared, a private GPA is allowed.

  The reason to record SPTE_PRIVATE_PROHIBIT in both shared and private EPT
  is to optimize EPT violation path for normal guest TD execution path and
  penalize map_gpa hypercall.

  If the guest TD faults on not-allowed GPA (modulo shared bit), the KVM
  doesn't resolve EPT violation and let vcpu retry.  vcpu will keep
  faulting until other vcpu maps the region with MapGPA hypercall.  With
  the initial value of spte(initial shadow_init_value),
  SPTE_PRIVATE_PROHIBIT is cleared.  So the default behavior doesn't
  change.

- don't map GPA.
  The GPA is mapped on the next EPT violation.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu.h         |   2 +
 arch/x86/kvm/mmu/mmu.c     |  56 ++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c | 208 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.h |   3 +
 4 files changed, 269 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index b49841e4faaa..ac4540aa694d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -305,6 +305,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end);
 
 int kvm_arch_write_log_dirty(struct kvm_vcpu *vcpu);
 
+int kvm_mmu_map_gpa(struct kvm_vcpu *vcpu, gfn_t *startp, gfn_t end);
+
 int kvm_mmu_post_init_vm(struct kvm *kvm);
 void kvm_mmu_pre_destroy_vm(struct kvm *kvm);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 0ec9548ff4dd..e2d4a7d546e1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6119,6 +6119,62 @@ void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen)
 	}
 }
 
+int kvm_mmu_map_gpa(struct kvm_vcpu *vcpu, gfn_t *startp, gfn_t end)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_memslots *slots;
+	struct kvm_memslot_iter iter;
+	gfn_t start = *startp;
+	bool allow_private;
+	int ret;
+
+	if (!kvm_gfn_stolen_mask(kvm))
+		return -EOPNOTSUPP;
+
+	ret = mmu_topup_memory_caches(vcpu, false);
+	if (ret)
+		return ret;
+
+	allow_private = kvm_is_private_gfn(kvm, start);
+	start = kvm_gfn_unalias(kvm, start);
+	end = kvm_gfn_unalias(kvm, end);
+
+	mutex_lock(&kvm->slots_lock);
+	write_lock(&kvm->mmu_lock);
+
+	slots = __kvm_memslots(kvm, 0 /* only normal ram. not SMM. */);
+	kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+		struct kvm_memory_slot *memslot = iter.slot;
+		gfn_t s = max(start, memslot->base_gfn);
+		gfn_t e = min(end, memslot->base_gfn + memslot->npages);
+
+		if (WARN_ON_ONCE(s >= e))
+			continue;
+		if (is_tdp_mmu_enabled(kvm)) {
+			ret = kvm_tdp_mmu_map_gpa(vcpu, &s, e, allow_private);
+			if (ret) {
+				start = s;
+				break;
+			}
+		} else {
+			ret = -EOPNOTSUPP;
+			break;
+		}
+	}
+
+	write_unlock(&kvm->mmu_lock);
+	mutex_unlock(&kvm->slots_lock);
+
+	if (ret == -EAGAIN) {
+		if (allow_private)
+			*startp = kvm_gfn_private(kvm, start);
+		else
+			*startp = kvm_gfn_shared(kvm, start);
+	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_gpa);
+
 static unsigned long
 mmu_shrink_scan(struct shrinker *shrink, struct shrink_control *sc)
 {
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f6bd35831e32..b33ace3d4456 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -533,6 +533,13 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 			WARN_ON(sp->gfn != gfn);
 		}
 
+		/*
+		 * SPTE_PRIVATE_PROHIBIT is only changed by map_gpa that obtains
+		 * write lock of mmu_lock.
+		 */
+		WARN_ON(shared &&
+			(is_private_prohibit_spte(old_spte) !=
+				is_private_prohibit_spte(new_spte)));
 		static_call(kvm_x86_handle_changed_private_spte)(
 			kvm, gfn, level,
 			old_pfn, was_present, was_leaf,
@@ -1751,6 +1758,207 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 	return spte_set;
 }
 
+typedef void (*update_spte_t)(
+	struct kvm *kvm, struct tdp_iter *iter, bool allow_private);
+
+static int kvm_tdp_mmu_update_range(struct kvm_vcpu *vcpu, bool is_private,
+				gfn_t start, gfn_t end, gfn_t *nextp,
+				update_spte_t fn, bool allow_private)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct tdp_iter iter;
+	int ret = 0;
+
+	rcu_read_lock();
+	tdp_mmu_for_each_pte(iter, vcpu->arch.mmu, is_private, start, end) {
+		if (iter.level == PG_LEVEL_4K) {
+			fn(kvm, &iter, allow_private);
+			continue;
+		}
+
+		/*
+		 * Which GPA is allowed, private or shared, is recorded in the
+		 * granular of 4K in private leaf spte as SPTE_PRIVATE_PROHIBIT.
+		 * Break large page into 4K.
+		 */
+		if (is_shadow_present_pte(iter.old_spte) &&
+			is_large_pte(iter.old_spte)) {
+			/*
+			 * TODO: large page support.
+			 * Doesn't support large page for TDX now
+			 */
+			WARN_ON_ONCE(true);
+			tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
+			iter.old_spte = READ_ONCE(*rcu_dereference(iter.sptep));
+		}
+
+		if (!is_shadow_present_pte(iter.old_spte)) {
+			/*
+			 * Guarantee that alloc_tdp_mmu_page() succees which
+			 * assumes page allocation from cache always successes.
+			 */
+			if (vcpu->arch.mmu_page_header_cache.nobjs == 0 ||
+				vcpu->arch.mmu_shadow_page_cache.nobjs == 0 ||
+				vcpu->arch.mmu_private_sp_cache.nobjs == 0) {
+				ret = -EAGAIN;
+				break;
+			}
+			/*
+			 * write lock of mmu_lock is held.  No other thread
+			 * freezes SPTE.
+			 */
+			if (!tdp_mmu_populate_nonleaf(
+					vcpu, &iter, is_private, false)) {
+				/* As write lock is held, this case sholdn't happen. */
+				WARN_ON_ONCE(true);
+				ret = -EAGAIN;
+				break;
+			}
+		}
+	}
+	rcu_read_unlock();
+
+	if (ret == -EAGAIN)
+		*nextp = iter.next_last_level_gfn;
+
+	return ret;
+}
+
+static void kvm_tdp_mmu_update_shared_spte(
+	struct kvm *kvm, struct tdp_iter *iter, bool allow_private)
+{
+	u64 new_spte;
+
+	WARN_ON(kvm_is_private_gfn(kvm, iter->gfn));
+	if (allow_private) {
+		/* Zap SPTE and clear PRIVATE_PROHIBIT */
+		new_spte = shadow_init_value;
+		if (new_spte != iter->old_spte)
+			tdp_mmu_set_spte(kvm, iter, new_spte);
+	} else {
+		new_spte = iter->old_spte | SPTE_PRIVATE_PROHIBIT;
+		/* No side effect is needed */
+		if (new_spte != iter->old_spte)
+			WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+	}
+}
+
+static void kvm_tdp_mmu_update_private_spte(
+	struct kvm *kvm, struct tdp_iter *iter, bool allow_private)
+{
+	u64 new_spte;
+
+	WARN_ON(!kvm_is_private_gfn(kvm, iter->gfn));
+	if (allow_private) {
+		new_spte = iter->old_spte & ~SPTE_PRIVATE_PROHIBIT;
+		/* No side effect is needed */
+		if (new_spte != iter->old_spte)
+			WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+	} else {
+		if (is_shadow_present_pte(iter->old_spte)) {
+			/* Zap SPTE */
+			new_spte = shadow_init_value | SPTE_PRIVATE_PROHIBIT;
+			tdp_mmu_set_spte(kvm, iter, new_spte);
+		} else {
+			new_spte = iter->old_spte | SPTE_PRIVATE_PROHIBIT;
+			/* No side effect is needed */
+			WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
+		}
+	}
+}
+
+/*
+ * Whether GPA is allowed to map private or shared is recorded in both private
+ * and shared leaf spte entry as SPTE_PRIVATE_PROHIBIT bit.  They must match.
+ * private leaf spte entry
+ * - present: private mapping is allowed. (already mapped)
+ * - non-present: private mapping is allowed.
+ * - present | PRIVATE_PROHIBIT: invalid state.
+ * - non-present | SPTE_PRIVATE_PROHIBIT: shared mapping is allowed.
+ *                                        may or may not be mapped as shared.
+ * shared leaf spte entry
+ * - present: invalid state
+ * - non-present: private mapping is allowed.
+ * - present | PRIVATE_PROHIBIT: shared mapping is allowed (already mapped)
+ * - non-present | PRIVATE_PROHIBIT: shared mapping is allowed.
+ *
+ * state change of private spte:
+ * map_gpa(private):
+ *      private EPT entry: clear PRIVATE_PROHIBIT
+ *	  present: nop
+ *	  non-present: nop
+ *	  non-present | PRIVATE_PROHIBIT -> non-present
+ *	share EPT entry: zap and clear PRIVATE_PROHIBIT
+ *	  any -> non-present
+ * map_gpa(shared):
+ *	private EPT entry: zap and set PRIVATE_PROHIBIT
+ *	  present     -> non-present | PRIVATE_PROHIBIT
+ *	  non-present -> non-present | PRIVATE_PROHIBIT
+ *	  non-present | PRIVATE_PROHIBIT: nop
+ *	shared EPT entry: set PRIVATE_PROHIBIT
+ *	  present | PRIVATE_PROHIBIT: nop
+ *	  non-present -> non-present | PRIVATE_PROHIBIT
+ *	  non-present | PRIVATE_PROHIBIT: nop
+ * map(private GPA):
+ *	private EPT entry: try to populate
+ *	  present: nop
+ *	  non-present -> present
+ *	  non-present | PRIVATE_PROHIBIT: nop. looping on EPT violation
+ *	shared EPT entry: nop
+ * map(shared GPA):
+ *	private EPT entry: nop
+ *	shared EPT entry: populate
+ *	  present | PRIVATE_PROHIBIT: nop
+ *	  non-present | PRIVATE_PROHIBIT -> present | PRIVATE_PROHIBIT
+ *	  non-present: nop. looping on EPT violation
+ * zap(private GPA):
+ *	private EPT entry: zap and keep PRIVATE_PROHIBIT
+ *	  present | PRIVATE_PROHIBIT -> non-present | PRIVATE_PROHIBIT
+ *	  non-present: nop as is_shadow_prsent_pte() is checked
+ *	  non-present | PRIVATE_PROHIBIT: nop by is_shadow_present_pte()
+ *	shared EPT entry: nop
+ * zap(shared GPA):
+ *	private EPT entry: nop
+ *	shared EPT entry: zap and keep PRIVATE_PROHIBIT
+ *	  present | PRIVATE_PROHIBIT -> non-present | PRIVATE_PROHIBIT
+ *	  non-present | PRIVATE_PROHIBIT: nop
+ *	  non-present: nop.
+ */
+int kvm_tdp_mmu_map_gpa(struct kvm_vcpu *vcpu,
+			gfn_t *startp, gfn_t end, bool allow_private)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	gfn_t start = *startp;
+	gfn_t next;
+	int ret = 0;
+
+	lockdep_assert_held_write(&kvm->mmu_lock);
+	WARN_ON(start & kvm_gfn_stolen_mask(kvm));
+	WARN_ON(end & kvm_gfn_stolen_mask(kvm));
+
+	if (!VALID_PAGE(mmu->root_hpa) || !VALID_PAGE(mmu->private_root_hpa))
+		return -EINVAL;
+
+	next = end;
+	ret = kvm_tdp_mmu_update_range(
+		vcpu, false, kvm_gfn_shared(kvm, start), kvm_gfn_shared(kvm, end),
+		&next, kvm_tdp_mmu_update_shared_spte, allow_private);
+	if (ret) {
+		kvm_flush_remote_tlbs_with_address(kvm, start, next - start);
+		return ret;
+	}
+
+	ret = kvm_tdp_mmu_update_range(vcpu, true, start, end, &next,
+				kvm_tdp_mmu_update_private_spte, allow_private);
+	if (ret == -EAGAIN) {
+		*startp = next;
+		end = *startp;
+	}
+	kvm_flush_remote_tlbs_with_address(kvm, start, end - start);
+	return ret;
+}
+
 /*
  * Return the level of the lowest level SPTE added to sptes.
  * That SPTE may be non-present.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 7c62f694a465..0f83960d92aa 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -74,6 +74,9 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 				   struct kvm_memory_slot *slot, gfn_t gfn,
 				   int min_level);
 
+int kvm_tdp_mmu_map_gpa(struct kvm_vcpu *vcpu,
+			gfn_t *startp, gfn_t end, bool is_private);
+
 static inline void kvm_tdp_mmu_walk_lockless_begin(void)
 {
 	rcu_read_lock();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 057/104] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (55 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 056/104] KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 058/104] KVM: x86/mmu: Focibly use TDP MMU for TDX isaku.yamahata
                   ` (48 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Introduce a helper to directly (pun intended) fault-in a TDP page
without having to go through the full page fault path.  This allows
TDX to get the resulting pfn and also allows the RET_PF_* enums to
stay in mmu.c where they belong.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu.h     |  3 +++
 arch/x86/kvm/mmu/mmu.c | 38 ++++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index ac4540aa694d..bd93e7235812 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -205,6 +205,9 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 	return vcpu->arch.mmu->page_fault(vcpu, &fault);
 }
 
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u32 error_code, int max_level);
+
 /*
  * Currently, we have two sorts of write-protection, a) the first one
  * write-protects guest page to sync the guest modification, b) another one is
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index e2d4a7d546e1..72d8f200c819 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4192,6 +4192,44 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u32 error_code, int max_level)
+{
+	int r;
+	struct kvm_page_fault fault = (struct kvm_page_fault) {
+		.addr = gpa,
+		.error_code = error_code,
+		.exec = error_code & PFERR_FETCH_MASK,
+		.write = error_code & PFERR_WRITE_MASK,
+		.present = error_code & PFERR_PRESENT_MASK,
+		.rsvd = error_code & PFERR_RSVD_MASK,
+		.user = error_code & PFERR_USER_MASK,
+		.prefetch = false,
+		.is_tdp = true,
+		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
+	};
+
+	if (mmu_topup_memory_caches(vcpu, false))
+		return KVM_PFN_ERR_FAULT;
+
+	/*
+	 * Loop on the page fault path to handle the case where an mmu_notifier
+	 * invalidation triggers RET_PF_RETRY.  In the normal page fault path,
+	 * KVM needs to resume the guest in case the invalidation changed any
+	 * of the page fault properties, i.e. the gpa or error code.  For this
+	 * path, the gpa and error code are fixed by the caller, and the caller
+	 * expects failure if and only if the page fault can't be fixed.
+	 */
+	do {
+		fault.max_level = max_level;
+		fault.req_level = PG_LEVEL_4K;
+		fault.goal_level = PG_LEVEL_4K;
+		r = direct_page_fault(vcpu, &fault);
+	} while (r == RET_PF_RETRY && !is_error_noslot_pfn(fault.pfn));
+	return fault.pfn;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
 static void nonpaging_init_context(struct kvm_mmu *context)
 {
 	context->page_fault = nonpaging_page_fault;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 058/104] KVM: x86/mmu: Focibly use TDP MMU for TDX
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (56 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 057/104] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-07  1:49   ` Kai Huang
  2022-03-04 19:49 ` [RFC PATCH v5 059/104] [MARKER] The start of TDX KVM patch series: TD finalization isaku.yamahata
                   ` (47 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

At this point, TDX supports TDP MMU and doesn't support legacy MMU.
Forcibly use TDP MMU for TDX irrelevant of kernel parameter to disable
TDP MMU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index b33ace3d4456..9df6aa4da202 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -16,7 +16,12 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
 /* Initializes the TDP MMU for the VM, if enabled. */
 bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 {
-	if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
+	/*
+	 *  Because TDX supports only TDP MMU, forcibly use TDP MMU in the case
+	 *  of TDX.
+	 */
+	if (kvm->arch.vm_type != KVM_X86_TDX_VM &&
+		(!tdp_enabled || !READ_ONCE(tdp_mmu_enabled)))
 		return false;
 
 	/* This should not be changed for the lifetime of the VM. */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 059/104] [MARKER] The start of TDX KVM patch series: TD finalization
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (57 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 058/104] KVM: x86/mmu: Focibly use TDP MMU for TDX isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 060/104] KVM: TDX: Create initial guest memory isaku.yamahata
                   ` (46 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TD finalization.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index d56b3890ddfe..3737b966ea07 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -21,11 +21,11 @@ Patch Layer status
 * TD VM creation/destruction:           Applied
 * TD vcpu creation/destruction:         Applied
 * TDX EPT violation:                    Applied
-* TD finalization:                      Not yet
+* TD finalization:                      Applying
 * TD vcpu enter/exit:                   Not yet
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
 * KVM MMU GPA stolen bits:              Applied
 * KVM TDP refactoring for TDX:          Applied
 * KVM TDP MMU hooks:                    Applied
-* KVM TDP MMU MapGPA:                   Not yet
+* KVM TDP MMU MapGPA:                   Applied
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 060/104] KVM: TDX: Create initial guest memory
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (58 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 059/104] [MARKER] The start of TDX KVM patch series: TD finalization isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-07  2:30   ` Kai Huang
  2022-03-04 19:49 ` [RFC PATCH v5 061/104] KVM: TDX: Finalize VM initialization isaku.yamahata
                   ` (45 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because the guest memory is protected in TDX, the creation of the initial
guest memory requires a dedicated TDX module API, tdh_mem_page_add, instead
of directly copying the memory contents into the guest memory in the case
of the default VM type.  KVM MMU page fault handler callback,
private_page_add, handles it.

Define new subcommand, KVM_TDX_INIT_MEM_REGION, of VM-scoped
KVM_MEMORY_ENCRYPT_OP.  It assigns the guest page, copies the initial
memory contents into the guest memory, encrypts the guest memory.  At the
same time, optionally it extends memory measurement of the TDX guest.  It
calls the KVM MMU page fault(EPT-violation) handler to trigger the
callbacks for it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/uapi/asm/kvm.h       |   9 ++
 arch/x86/kvm/mmu/mmu.c                |   1 +
 arch/x86/kvm/vmx/tdx.c                | 128 ++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h                |   2 +
 tools/arch/x86/include/uapi/asm/kvm.h |   9 ++
 5 files changed, 149 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 9702f0d95776..77f46260d868 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -533,6 +533,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
+	KVM_TDX_INIT_MEM_REGION,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -574,4 +575,12 @@ struct kvm_tdx_init_vm {
 	__u64 reserved[43];	/* must be zero for future extensibility */
 };
 
+#define KVM_TDX_MEASURE_MEMORY_REGION	(1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+	__u64 source_addr;
+	__u64 gpa;
+	__u64 nr_pages;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 72d8f200c819..23c954035227 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5226,6 +5226,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 out:
 	return r;
 }
+EXPORT_SYMBOL(kvm_mmu_load);
 
 void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 5d74ae001e4f..cd726c41d362 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -514,6 +514,21 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
 }
 
+static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa)
+{
+	struct tdx_module_output out;
+	u64 err;
+	int i;
+
+	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+		err = tdh_mr_extend(kvm_tdx->tdr.pa, gpa + i, &out);
+		if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
+			pr_tdx_error(TDH_MR_EXTEND, err, &out);
+			break;
+		}
+	}
+}
+
 static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 					enum pg_level level, kvm_pfn_t pfn)
 {
@@ -521,6 +536,7 @@ static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	hpa_t hpa = pfn_to_hpa(pfn);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	struct tdx_module_output out;
+	hpa_t source_pa;
 	u64 err;
 
 	if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
@@ -533,12 +549,33 @@ static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	/* Pin the page, TDX KVM doesn't yet support page migration. */
 	get_page(pfn_to_page(pfn));
 
+	/* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
 	if (likely(is_td_finalized(kvm_tdx))) {
 		err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &out);
 		if (KVM_BUG_ON(err, kvm))
 			pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
 		return;
 	}
+
+	/*
+	 * In case of TDP MMU, fault handler can run concurrently.  Note
+	 * 'source_pa' is a TD scope variable, meaning if there are multiple
+	 * threads reaching here with all needing to access 'source_pa', it
+	 * will break.  However fortunately this won't happen, because below
+	 * TDH_MEM_PAGE_ADD code path is only used when VM is being created
+	 * before it is running, using KVM_TDX_INIT_MEM_REGION ioctl (which
+	 * always uses vcpu 0's page table and protected by vcpu->mutex).
+	 */
+	WARN_ON(kvm_tdx->source_pa == INVALID_PAGE);
+	source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
+
+	err = tdh_mem_page_add(kvm_tdx->tdr.pa, gpa, hpa, source_pa, &out);
+	if (KVM_BUG_ON(err, kvm))
+		pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
+	else if ((kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION))
+		tdx_measure_page(kvm_tdx, gpa);
+
+	kvm_tdx->source_pa = INVALID_PAGE;
 }
 
 static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
@@ -978,6 +1015,94 @@ void tdx_flush_tlb(struct kvm_vcpu *vcpu)
 		cpu_relax();
 }
 
+#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | PFERR_USER_MASK)
+
+static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_tdx_init_mem_region region;
+	struct kvm_vcpu *vcpu;
+	struct page *page;
+	kvm_pfn_t pfn;
+	int idx, ret = 0;
+
+	/* The BSP vCPU must be created before initializing memory regions. */
+	if (!atomic_read(&kvm->online_vcpus))
+		return -EINVAL;
+
+	if (cmd->metadata & ~KVM_TDX_MEASURE_MEMORY_REGION)
+		return -EINVAL;
+
+	if (copy_from_user(&region, (void __user *)cmd->data, sizeof(region)))
+		return -EFAULT;
+
+	/* Sanity check */
+	if (!IS_ALIGNED(region.source_addr, PAGE_SIZE) ||
+	    !IS_ALIGNED(region.gpa, PAGE_SIZE) ||
+	    !region.nr_pages ||
+	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
+	    !kvm_is_private_gpa(kvm, region.gpa) ||
+	    !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT)))
+		return -EINVAL;
+
+	vcpu = kvm_get_vcpu(kvm, 0);
+	if (mutex_lock_killable(&vcpu->mutex))
+		return -EINTR;
+
+	vcpu_load(vcpu);
+	idx = srcu_read_lock(&kvm->srcu);
+
+	kvm_mmu_reload(vcpu);
+
+	while (region.nr_pages) {
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+
+		if (need_resched())
+			cond_resched();
+
+
+		/* Pin the source page. */
+		ret = get_user_pages_fast(region.source_addr, 1, 0, &page);
+		if (ret < 0)
+			break;
+		if (ret != 1) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		kvm_tdx->source_pa = pfn_to_hpa(page_to_pfn(page)) |
+				     (cmd->metadata & KVM_TDX_MEASURE_MEMORY_REGION);
+
+		pfn = kvm_mmu_map_tdp_page(vcpu, region.gpa, TDX_SEPT_PFERR,
+					   PG_LEVEL_4K);
+		if (is_error_noslot_pfn(pfn) || kvm->vm_bugged)
+			ret = -EFAULT;
+		else
+			ret = 0;
+
+		put_page(page);
+		if (ret)
+			break;
+
+		region.source_addr += PAGE_SIZE;
+		region.gpa += PAGE_SIZE;
+		region.nr_pages--;
+	}
+
+	srcu_read_unlock(&kvm->srcu, idx);
+	vcpu_put(vcpu);
+
+	mutex_unlock(&vcpu->mutex);
+
+	if (copy_to_user((void __user *)cmd->data, &region, sizeof(region)))
+		ret = -EFAULT;
+
+	return ret;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -995,6 +1120,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_INIT_VM:
 		r = tdx_td_init(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_INIT_MEM_REGION:
+		r = tdx_init_mem_region(kvm, &tdx_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 906666c7c70b..bf9865a88991 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -28,6 +28,8 @@ struct kvm_tdx {
 	int cpuid_nent;
 	struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
 
+	hpa_t source_pa;
+
 	bool finalized;
 	bool tdh_mem_track;
 
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 9702f0d95776..77f46260d868 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -533,6 +533,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
+	KVM_TDX_INIT_MEM_REGION,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -574,4 +575,12 @@ struct kvm_tdx_init_vm {
 	__u64 reserved[43];	/* must be zero for future extensibility */
 };
 
+#define KVM_TDX_MEASURE_MEMORY_REGION	(1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+	__u64 source_addr;
+	__u64 gpa;
+	__u64 nr_pages;
+};
+
 #endif /* _ASM_X86_KVM_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 061/104] KVM: TDX: Finalize VM initialization
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (59 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 060/104] KVM: TDX: Create initial guest memory isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 13:52   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 062/104] [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit isaku.yamahata
                   ` (44 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

To protect the initial contents of the guest TD, the TDX module measures
the guest TD during the build process as SHA-384 measurement.  The
measurement of the guest TD contents needs to be completed to make the
guest TD ready to run.

Add a new subcommand, KVM_TDX_FINALIZE_VM, for VM-scoped
KVM_MEMORY_ENCRYPT_OP to finalize the measurement and mark the TDX VM ready
to run.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/uapi/asm/kvm.h       |  1 +
 arch/x86/kvm/vmx/tdx.c                | 21 +++++++++++++++++++++
 tools/arch/x86/include/uapi/asm/kvm.h |  1 +
 3 files changed, 23 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 77f46260d868..943219a08fcd 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -534,6 +534,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
 	KVM_TDX_INIT_MEM_REGION,
+	KVM_TDX_FINALIZE_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cd726c41d362..85d5f961d97e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1103,6 +1103,24 @@ static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return ret;
 }
 
+static int tdx_td_finalizemr(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err;
+
+	if (!is_td_initialized(kvm) || is_td_finalized(kvm_tdx))
+		return -EINVAL;
+
+	err = tdh_mr_finalize(kvm_tdx->tdr.pa);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MR_FINALIZE, err, NULL);
+		return -EIO;
+	}
+
+	kvm_tdx->finalized = true;
+	return 0;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -1123,6 +1141,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_INIT_MEM_REGION:
 		r = tdx_init_mem_region(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_FINALIZE_VM:
+		r = tdx_td_finalizemr(kvm);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 77f46260d868..943219a08fcd 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -534,6 +534,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
 	KVM_TDX_INIT_MEM_REGION,
+	KVM_TDX_FINALIZE_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 062/104] [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (60 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 061/104] KVM: TDX: Finalize VM initialization isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 063/104] KVM: TDX: Add helper assembly function to TDX vcpu isaku.yamahata
                   ` (43 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TD vcpu
enter/exit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 3737b966ea07..e6af9ad4e23f 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -12,6 +12,7 @@ What qemu can do
 - Qemu can create/destroy guest of TDX vm type.
 - Qemu can create/destroy vcpu of TDX vm type.
 - Qemu can populate initial guest memory image.
+- Qemu can finalize guest TD.
 
 Patch Layer status
 ------------------
@@ -21,8 +22,8 @@ Patch Layer status
 * TD VM creation/destruction:           Applied
 * TD vcpu creation/destruction:         Applied
 * TDX EPT violation:                    Applied
-* TD finalization:                      Applying
-* TD vcpu enter/exit:                   Not yet
+* TD finalization:                      Applied
+* TD vcpu enter/exit:                   Applying
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
 * KVM MMU GPA stolen bits:              Applied
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 063/104] KVM: TDX: Add helper assembly function to TDX vcpu
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (61 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 062/104] [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 064/104] KVM: TDX: Implement TDX vcpu enter/exit path isaku.yamahata
                   ` (42 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX defines an API to run TDX vcpu with its own ABI.  Define an assembly
helper function to run TDX vcpu to hide the special ABI so that C code can
call it with function call ABI.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/vmenter.S | 146 +++++++++++++++++++++++++++++++++++++
 1 file changed, 146 insertions(+)

diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 435c187927c4..33dc5aa2f0db 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -2,6 +2,7 @@
 #include <linux/linkage.h>
 #include <asm/asm.h>
 #include <asm/bitsperlong.h>
+#include <asm/errno.h>
 #include <asm/kvm_vcpu_regs.h>
 #include <asm/nospec-branch.h>
 #include <asm/segment.h>
@@ -28,6 +29,13 @@
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
 #endif
 
+#ifdef CONFIG_INTEL_TDX_HOST
+#define TDENTER 		0
+#define EXIT_REASON_TDCALL	77
+#define TDENTER_ERROR_BIT	63
+#include "seamcall.h"
+#endif
+
 .section .noinstr.text, "ax"
 
 /**
@@ -328,3 +336,141 @@ SYM_FUNC_START(vmx_do_interrupt_nmi_irqoff)
 	pop %_ASM_BP
 	RET
 SYM_FUNC_END(vmx_do_interrupt_nmi_irqoff)
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+.pushsection .noinstr.text, "ax"
+
+/**
+ * __tdx_vcpu_run - Call SEAMCALL(TDENTER) to run a TD vcpu
+ * @tdvpr:	physical address of TDVPR
+ * @regs:	void * (to registers of TDVCPU)
+ * @gpr_mask:	non-zero if guest registers need to be loaded prior to TDENTER
+ *
+ * Returns:
+ *	TD-Exit Reason
+ *
+ * Note: KVM doesn't support using XMM in its hypercalls, it's the HyperV
+ *	 code's responsibility to save/restore XMM registers on TDVMCALL.
+ */
+SYM_FUNC_START(__tdx_vcpu_run)
+	push %rbp
+	mov  %rsp, %rbp
+
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+	push %rbx
+
+	/* Save @regs, which is needed after TDENTER to capture output. */
+	push %rsi
+
+	/* Load @tdvpr to RCX */
+	mov %rdi, %rcx
+
+	/* No need to load guest GPRs if the last exit wasn't a TDVMCALL. */
+	test %dx, %dx
+	je 1f
+
+	/* Load @regs to RAX, which will be clobbered with $TDENTER anyways. */
+	mov %rsi, %rax
+
+	mov VCPU_RBX(%rax), %rbx
+	mov VCPU_RDX(%rax), %rdx
+	mov VCPU_RBP(%rax), %rbp
+	mov VCPU_RSI(%rax), %rsi
+	mov VCPU_RDI(%rax), %rdi
+
+	mov VCPU_R8 (%rax),  %r8
+	mov VCPU_R9 (%rax),  %r9
+	mov VCPU_R10(%rax), %r10
+	mov VCPU_R11(%rax), %r11
+	mov VCPU_R12(%rax), %r12
+	mov VCPU_R13(%rax), %r13
+	mov VCPU_R14(%rax), %r14
+	mov VCPU_R15(%rax), %r15
+
+	/*  Load TDENTER to RAX.  This kills the @regs pointer! */
+1:	mov $TDENTER, %rax
+
+2:	seamcall
+
+	/* Skip to the exit path if TDENTER failed. */
+	bt $TDENTER_ERROR_BIT, %rax
+	jc 4f
+
+	/* Temporarily save the TD-Exit reason. */
+	push %rax
+
+	/* check if TD-exit due to TDVMCALL */
+	cmp $EXIT_REASON_TDCALL, %ax
+
+	/* Reload @regs to RAX. */
+	mov 8(%rsp), %rax
+
+	/* Jump on non-TDVMCALL */
+	jne 3f
+
+	/* Save all output from SEAMCALL(TDENTER) */
+	mov %rbx, VCPU_RBX(%rax)
+	mov %rbp, VCPU_RBP(%rax)
+	mov %rsi, VCPU_RSI(%rax)
+	mov %rdi, VCPU_RDI(%rax)
+	mov %r10, VCPU_R10(%rax)
+	mov %r11, VCPU_R11(%rax)
+	mov %r12, VCPU_R12(%rax)
+	mov %r13, VCPU_R13(%rax)
+	mov %r14, VCPU_R14(%rax)
+	mov %r15, VCPU_R15(%rax)
+
+3:	mov %rcx, VCPU_RCX(%rax)
+	mov %rdx, VCPU_RDX(%rax)
+	mov %r8,  VCPU_R8 (%rax)
+	mov %r9,  VCPU_R9 (%rax)
+
+	/*
+	 * Clear all general purpose registers except RSP and RAX to prevent
+	 * speculative use of the guest's values.
+	 */
+	xor %rbx, %rbx
+	xor %rcx, %rcx
+	xor %rdx, %rdx
+	xor %rsi, %rsi
+	xor %rdi, %rdi
+	xor %rbp, %rbp
+	xor %r8,  %r8
+	xor %r9,  %r9
+	xor %r10, %r10
+	xor %r11, %r11
+	xor %r12, %r12
+	xor %r13, %r13
+	xor %r14, %r14
+	xor %r15, %r15
+
+	/* Restore the TD-Exit reason to RAX for return. */
+	pop %rax
+
+	/* "POP" @regs. */
+4:	add $8, %rsp
+	pop %rbx
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	pop %rbp
+	ret
+
+5:	cmpb $0, kvm_rebooting
+	je 6f
+	mov $-EFAULT, %rax
+	jmp 4b
+6:	ud2
+	_ASM_EXTABLE(2b, 5b)
+
+SYM_FUNC_END(__tdx_vcpu_run)
+
+.popsection
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 064/104] KVM: TDX: Implement TDX vcpu enter/exit path
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (62 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 063/104] KVM: TDX: Add helper assembly function to TDX vcpu isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-22 17:28   ` Erdem Aktas
  2022-03-04 19:49 ` [RFC PATCH v5 065/104] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) isaku.yamahata
                   ` (41 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This patch implements running TDX vcpu.  Once vcpu runs on the logical
processor (LP), the TDX vcpu is associated with it.  When the TDX vcpu
moves to another LP, the TDX vcpu needs to flush its status on the LP.
When destroying TDX vcpu, it needs to complete flush and flush cpu memory
cache.  Track which LP the TDX vcpu run and flush it as necessary.

Do nothing on sched_in event as TDX doesn't support pause loop.

TDX vcpu execution requires restoring PMU debug store after returning back
to KVM because the TDX module unconditionally resets the value.  To reuse
the existing code, export perf_restore_debug_store.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 10 +++++++++-
 arch/x86/kvm/vmx/tdx.c     | 34 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h     | 33 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 arch/x86/kvm/x86.c         |  1 +
 5 files changed, 79 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index f571b07c2aae..2e5a7a72d560 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -89,6 +89,14 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	return vmx_vcpu_reset(vcpu, init_event);
 }
 
+static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_run(vcpu);
+
+	return vmx_vcpu_run(vcpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -200,7 +208,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.tlb_flush_guest = vt_flush_tlb_guest,
 
 	.vcpu_pre_run = vmx_vcpu_pre_run,
-	.run = vmx_vcpu_run,
+	.run = vt_vcpu_run,
 	.handle_exit = vmx_handle_exit,
 	.skip_emulated_instruction = vmx_skip_emulated_instruction,
 	.update_emulated_instruction = vmx_update_emulated_instruction,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 85d5f961d97e..ebe4f9bf19e7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -10,6 +10,9 @@
 #include "vmx.h"
 #include "x86.h"
 
+#include <trace/events/kvm.h>
+#include "trace.h"
+
 #undef pr_fmt
 #define pr_fmt(fmt) "tdx: " fmt
 
@@ -509,6 +512,37 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->kvm->vm_bugged = true;
 }
 
+u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
+
+static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
+					struct vcpu_tdx *tdx)
+{
+	guest_enter_irqoff();
+	tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
+	guest_exit_irqoff();
+}
+
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (unlikely(vcpu->kvm->vm_bugged)) {
+		tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
+		return EXIT_FASTPATH_NONE;
+	}
+
+	trace_kvm_entry(vcpu);
+
+	tdx_vcpu_enter_exit(vcpu, tdx);
+
+	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
+	trace_kvm_exit(vcpu, KVM_ISA_VMX);
+
+	if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
+		return EXIT_FASTPATH_NONE;
+	return EXIT_FASTPATH_NONE;
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index bf9865a88991..e950404ce5de 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -44,12 +44,45 @@ struct kvm_tdx {
 	spinlock_t seamcall_lock;
 };
 
+union tdx_exit_reason {
+	struct {
+		/* 31:0 mirror the VMX Exit Reason format */
+		u64 basic		: 16;
+		u64 reserved16		: 1;
+		u64 reserved17		: 1;
+		u64 reserved18		: 1;
+		u64 reserved19		: 1;
+		u64 reserved20		: 1;
+		u64 reserved21		: 1;
+		u64 reserved22		: 1;
+		u64 reserved23		: 1;
+		u64 reserved24		: 1;
+		u64 reserved25		: 1;
+		u64 bus_lock_detected	: 1;
+		u64 enclave_mode	: 1;
+		u64 smi_pending_mtf	: 1;
+		u64 smi_from_vmx_root	: 1;
+		u64 reserved30		: 1;
+		u64 failed_vmentry	: 1;
+
+		/* 63:32 are TDX specific */
+		u64 details_l1		: 8;
+		u64 class		: 8;
+		u64 reserved61_48	: 14;
+		u64 non_recoverable	: 1;
+		u64 error		: 1;
+	};
+	u64 full;
+};
+
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
 
 	struct tdx_td_page tdvpr;
 	struct tdx_td_page *tdvpx;
 
+	union tdx_exit_reason exit_reason;
+
 	bool initialized;
 };
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 922a3799336e..44404dd25737 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -140,6 +140,7 @@ void tdx_vm_free(struct kvm *kvm);
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -160,6 +161,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index da411bcd8cbc..66400810d54f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -300,6 +300,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
 };
 
 u64 __read_mostly host_xcr0;
+EXPORT_SYMBOL_GPL(host_xcr0);
 u64 __read_mostly supported_xcr0;
 EXPORT_SYMBOL_GPL(supported_xcr0);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 065/104] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (63 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 064/104] KVM: TDX: Implement TDX vcpu enter/exit path isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 13:56   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 066/104] KVM: TDX: restore host xsave state when exit from the guest TD isaku.yamahata
                   ` (40 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

On entering/exiting TDX vcpu, Preserved or clobbered CPU state is different
from VMX case.  Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 28 +++++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 39 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h     |  4 ++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++++
 4 files changed, 73 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2e5a7a72d560..f9d43f2de145 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -89,6 +89,30 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	return vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * All host state is saved/restored across SEAMCALL/SEAMRET, and the
+	 * guest state of a TD is obviously off limits.  Deferring MSRs and DRs
+	 * is pointless because the TDX module needs to load *something* so as
+	 * not to expose guest state.
+	 */
+	if (is_td_vcpu(vcpu)) {
+		tdx_prepare_switch_to_guest(vcpu);
+		return;
+	}
+
+	vmx_prepare_switch_to_guest(vcpu);
+}
+
+static void vt_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_put(vcpu);
+
+	return vmx_vcpu_put(vcpu);
+}
+
 static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -174,9 +198,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_free = vt_vcpu_free,
 	.vcpu_reset = vt_vcpu_reset,
 
-	.prepare_guest_switch = vmx_prepare_switch_to_guest,
+	.prepare_guest_switch = vt_prepare_switch_to_guest,
 	.vcpu_load = vmx_vcpu_load,
-	.vcpu_put = vmx_vcpu_put,
+	.vcpu_put = vt_vcpu_put,
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
 	.get_msr_feature = vmx_get_msr_feature,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index ebe4f9bf19e7..7a288aae03ba 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/cpu.h>
+#include <linux/mmu_context.h>
 
 #include <asm/tdx.h>
 
@@ -407,6 +408,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.guest_state_protected =
 		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
 
+	tdx->host_state_need_save = true;
+	tdx->host_state_need_restore = false;
+
 	return 0;
 
 free_tdvpx:
@@ -420,6 +424,39 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	return ret;
 }
 
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (!tdx->host_state_need_save)
+		return;
+
+	if (likely(is_64bit_mm(current->mm)))
+		tdx->msr_host_kernel_gs_base = current->thread.gsbase;
+	else
+		tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+
+	tdx->host_state_need_save = false;
+}
+
+static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	tdx->host_state_need_save = true;
+	if (!tdx->host_state_need_restore)
+		return;
+
+	wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
+	tdx->host_state_need_restore = false;
+}
+
+void tdx_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	vmx_vcpu_pi_put(vcpu);
+	tdx_prepare_switch_to_host(vcpu);
+}
+
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -535,6 +572,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	tdx_vcpu_enter_exit(vcpu, tdx);
 
+	tdx->host_state_need_restore = true;
+
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
 	trace_kvm_exit(vcpu, KVM_ISA_VMX);
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index e950404ce5de..8b1cf9c158e3 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -84,6 +84,10 @@ struct vcpu_tdx {
 	union tdx_exit_reason exit_reason;
 
 	bool initialized;
+
+	bool host_state_need_save;
+	bool host_state_need_restore;
+	u64 msr_host_kernel_gs_base;
 };
 
 static inline bool is_td(struct kvm *kvm)
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 44404dd25737..8b871c5f52cf 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -141,6 +141,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void tdx_vcpu_put(struct kvm_vcpu *vcpu);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -162,6 +164,8 @@ static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
+static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 066/104] KVM: TDX: restore host xsave state when exit from the guest TD
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (64 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 065/104] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 067/104] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr isaku.yamahata
                   ` (39 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

On exiting from the guest TD, xsave state is clobbered.  Restore xsave
state on TD exit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7a288aae03ba..54be5be1a06c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2,6 +2,7 @@
 #include <linux/cpu.h>
 #include <linux/mmu_context.h>
 
+#include <asm/fpu/xcr.h>
 #include <asm/tdx.h>
 
 #include "capabilities.h"
@@ -549,6 +550,22 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->kvm->vm_bugged = true;
 }
 
+static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+	if (static_cpu_has(X86_FEATURE_XSAVE) &&
+	    host_xcr0 != (kvm_tdx->xfam & supported_xcr0))
+		xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
+	if (static_cpu_has(X86_FEATURE_XSAVES) &&
+	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
+	    host_xss != (kvm_tdx->xfam & (supported_xss | XFEATURE_MASK_PT)))
+		wrmsrl(MSR_IA32_XSS, host_xss);
+	if (static_cpu_has(X86_FEATURE_PKU) &&
+	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
+		write_pkru(vcpu->arch.host_pkru);
+}
+
 u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
 
 static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
@@ -572,6 +589,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	tdx_vcpu_enter_exit(vcpu, tdx);
 
+	tdx_restore_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 067/104] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (65 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 066/104] KVM: TDX: restore host xsave state when exit from the guest TD isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 14:02   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 068/104] KVM: TDX: restore user ret MSRs isaku.yamahata
                   ` (38 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Chao Gao <chao.gao@intel.com>

Several MSRs are constant and only used in userspace(ring 3).  But VMs may
have different values.  KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace.  To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.

TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit.  It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware.  This inconsistency needs to be
fixed.  Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values.  kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr.  So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 25 ++++++++++++++++++++-----
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8406f8b5ab74..b6396d11139e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1894,6 +1894,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
 int kvm_add_user_return_msr(u32 msr);
 int kvm_find_user_return_msr(u32 msr);
 int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
+void kvm_user_return_update_cache(unsigned int index, u64 val);
 
 static inline bool kvm_is_supported_user_return_msr(u32 msr)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 66400810d54f..45e8a02e99bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -427,6 +427,15 @@ static void kvm_user_return_msr_cpu_online(void)
 	}
 }
 
+static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
+{
+	if (!msrs->registered) {
+		msrs->urn.on_user_return = kvm_on_user_return;
+		user_return_notifier_register(&msrs->urn);
+		msrs->registered = true;
+	}
+}
+
 int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 {
 	unsigned int cpu = smp_processor_id();
@@ -441,15 +450,21 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 		return 1;
 
 	msrs->values[slot].curr = value;
-	if (!msrs->registered) {
-		msrs->urn.on_user_return = kvm_on_user_return;
-		user_return_notifier_register(&msrs->urn);
-		msrs->registered = true;
-	}
+	kvm_user_return_register_notifier(msrs);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
 
+/* Update the cache, "curr", and register the notifier */
+void kvm_user_return_update_cache(unsigned int slot, u64 value)
+{
+	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
+
+	msrs->values[slot].curr = value;
+	kvm_user_return_register_notifier(msrs);
+}
+EXPORT_SYMBOL_GPL(kvm_user_return_update_cache);
+
 static void drop_user_return_notifiers(void)
 {
 	unsigned int cpu = smp_processor_id();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 068/104] KVM: TDX: restore user ret MSRs
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (66 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 067/104] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 14:06   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 069/104] [MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls isaku.yamahata
                   ` (37 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Several user ret MSRs are clobbered on TD exit.  Restore those values on
TD exit and before returning to ring 3.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 54be5be1a06c..c1366aac7d96 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -550,6 +550,28 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->kvm->vm_bugged = true;
 }
 
+struct tdx_uret_msr {
+	u32 msr;
+	unsigned int slot;
+	u64 defval;
+};
+
+static struct tdx_uret_msr tdx_uret_msrs[] = {
+	{.msr = MSR_SYSCALL_MASK,},
+	{.msr = MSR_STAR,},
+	{.msr = MSR_LSTAR,},
+	{.msr = MSR_TSC_AUX,},
+};
+
+static void tdx_user_return_update_cache(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
+		kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
+					     tdx_uret_msrs[i].defval);
+}
+
 static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -589,6 +611,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	tdx_vcpu_enter_exit(vcpu, tdx);
 
+	tdx_user_return_update_cache();
 	tdx_restore_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
@@ -1371,6 +1394,16 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 	if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
 		return -EIO;
 
+	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
+		tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
+		if (tdx_uret_msrs[i].slot == -1) {
+			/* If any MSR isn't supported, it is a KVM bug */
+			pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
+				tdx_uret_msrs[i].msr);
+			return -EIO;
+		}
+	}
+
 	max_pkgs = topology_max_packages();
 	tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
 				   GFP_KERNEL);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 069/104] [MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (67 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 068/104] KVM: TDX: restore user ret MSRs isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 070/104] KVM: TDX: complete interrupts after tdexit isaku.yamahata
                   ` (36 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TD vcpu
exits, interrupts, and hypercalls.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index e6af9ad4e23f..1aad7ceb0573 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -13,6 +13,7 @@ What qemu can do
 - Qemu can create/destroy vcpu of TDX vm type.
 - Qemu can populate initial guest memory image.
 - Qemu can finalize guest TD.
+- Qemu can start to run vcpu. But vcpu can not make progress yet.
 
 Patch Layer status
 ------------------
@@ -23,7 +24,7 @@ Patch Layer status
 * TD vcpu creation/destruction:         Applied
 * TDX EPT violation:                    Applied
 * TD finalization:                      Applied
-* TD vcpu enter/exit:                   Applying
+* TD vcpu enter/exit:                   Applied
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
 * KVM MMU GPA stolen bits:              Applied
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 070/104] KVM: TDX: complete interrupts after tdexit
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (68 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 069/104] [MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 14:07   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 071/104] KVM: TDX: restore debug store when TD exit isaku.yamahata
                   ` (35 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

This corresponds to VMX __vmx_complete_interrupts().  Because TDX
virtualize vAPIC, KVM only needs to care NMI injection.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
 1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c1366aac7d96..3cb2fbd1c12c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -550,6 +550,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->kvm->vm_bugged = true;
 }
 
+static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
+{
+	/* Avoid costly SEAMCALL if no nmi was injected */
+	if (vcpu->arch.nmi_injected)
+		vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
+							      TD_VCPU_PEND_NMI);
+}
+
 struct tdx_uret_msr {
 	u32 msr;
 	unsigned int slot;
@@ -618,6 +626,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
 	trace_kvm_exit(vcpu, KVM_ISA_VMX);
 
+	tdx_complete_interrupts(vcpu);
+
 	if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
 		return EXIT_FASTPATH_NONE;
 	return EXIT_FASTPATH_NONE;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 071/104] KVM: TDX: restore debug store when TD exit
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (69 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 070/104] KVM: TDX: complete interrupts after tdexit isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 14:20   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 072/104] KVM: TDX: handle vcpu migration over logical processor isaku.yamahata
                   ` (34 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because debug store is clobbered, restore it on TD exit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/events/intel/ds.c | 1 +
 arch/x86/kvm/vmx/tdx.c     | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 376cc3d66094..cdba4227ad3b 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2256,3 +2256,4 @@ void perf_restore_debug_store(void)
 
 	wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
 }
+EXPORT_SYMBOL_GPL(perf_restore_debug_store);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3cb2fbd1c12c..37cf7d43435d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -620,6 +620,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 	tdx_vcpu_enter_exit(vcpu, tdx);
 
 	tdx_user_return_update_cache();
+	perf_restore_debug_store();
 	tdx_restore_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 072/104] KVM: TDX: handle vcpu migration over logical processor
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (70 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 071/104] KVM: TDX: restore debug store when TD exit isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 14:14   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 073/104] KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the guest TD isaku.yamahata
                   ` (33 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

For vcpu migration, in the case of VMX, VCMS is flushed on the source pcpu,
and load it on the target pcpu.  There are corresponding TDX SEAMCALL APIs,
call them on vcpu migration.  The logic is mostly same as VMX except the
TDX SEAMCALLs are used.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 20 +++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 51 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index f9d43f2de145..2cd5ba0e8788 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -121,6 +121,14 @@ static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
 	return vmx_vcpu_run(vcpu);
 }
 
+static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_load(vcpu, cpu);
+
+	return vmx_vcpu_load(vcpu, cpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -162,6 +170,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
 }
 
+static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_sched_in(vcpu, cpu);
+}
+
 static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -199,7 +215,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_reset = vt_vcpu_reset,
 
 	.prepare_guest_switch = vt_prepare_switch_to_guest,
-	.vcpu_load = vmx_vcpu_load,
+	.vcpu_load = vt_vcpu_load,
 	.vcpu_put = vt_vcpu_put,
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
@@ -285,7 +301,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.request_immediate_exit = vmx_request_immediate_exit,
 
-	.sched_in = vmx_sched_in,
+	.sched_in = vt_sched_in,
 
 	.cpu_dirty_log_size = PML_ENTITY_NUM,
 	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 37cf7d43435d..a6b1a8ce888d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -85,6 +85,18 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
 	return kvm_tdx->finalized;
 }
 
+static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
+	 * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
+	 * to its list before its deleted from this CPUs list.
+	 */
+	smp_wmb();
+
+	vcpu->cpu = -1;
+}
+
 static void tdx_clear_page(unsigned long page)
 {
 	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -155,6 +167,39 @@ static void tdx_reclaim_td_page(struct tdx_td_page *page)
 	free_page(page->va);
 }
 
+static void tdx_flush_vp(void *arg)
+{
+	struct kvm_vcpu *vcpu = arg;
+	u64 err;
+
+	/* Task migration can race with CPU offlining. */
+	if (vcpu->cpu != raw_smp_processor_id())
+		return;
+
+	/*
+	 * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized.  The
+	 * list tracking still needs to be updated so that it's correct if/when
+	 * the vCPU does get initialized.
+	 */
+	if (is_td_vcpu_created(to_tdx(vcpu))) {
+		err = tdh_vp_flush(to_tdx(vcpu)->tdvpr.pa);
+		if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
+			if (WARN_ON_ONCE(err))
+				pr_tdx_error(TDH_VP_FLUSH, err, NULL);
+		}
+	}
+
+	tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+	if (unlikely(vcpu->cpu == -1))
+		return;
+
+	smp_call_function_single(vcpu->cpu, tdx_flush_vp, vcpu, 1);
+}
+
 static int tdx_do_tdh_phymem_cache_wb(void *param)
 {
 	u64 err = 0;
@@ -425,6 +470,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	return ret;
 }
 
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	if (vcpu->cpu != cpu)
+		tdx_flush_vp_on_cpu(vcpu);
+}
+
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 8b871c5f52cf..ceafd6e18f4e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -143,6 +143,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
 void tdx_vcpu_put(struct kvm_vcpu *vcpu);
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -166,6 +167,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
 static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 073/104] KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the guest TD
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (71 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 072/104] KVM: TDX: handle vcpu migration over logical processor isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-23  0:54   ` Erdem Aktas
  2022-03-04 19:49 ` [RFC PATCH v5 074/104] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior isaku.yamahata
                   ` (32 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

When shutting down the machine, (VMX or TDX) vcpus needs to be shutdown on
each pcpu.  Do the similar for TDX with TDX SEAMCALL APIs.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 23 +++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 70 ++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.h     |  2 ++
 arch/x86/kvm/vmx/x86_ops.h |  4 +++
 4 files changed, 95 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2cd5ba0e8788..882358ac270b 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -13,6 +13,25 @@ static bool vt_is_vm_type_supported(unsigned long type)
 	return type == KVM_X86_DEFAULT_VM || tdx_is_vm_type_supported(type);
 }
 
+static int vt_hardware_enable(void)
+{
+	int ret;
+
+	ret = vmx_hardware_enable();
+	if (ret)
+		return ret;
+
+	tdx_hardware_enable();
+	return 0;
+}
+
+static void vt_hardware_disable(void)
+{
+	/* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
+	tdx_hardware_disable();
+	vmx_hardware_disable();
+}
+
 static __init int vt_hardware_setup(void)
 {
 	int ret;
@@ -199,8 +218,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.hardware_unsetup = vt_hardware_unsetup,
 
-	.hardware_enable = vmx_hardware_enable,
-	.hardware_disable = vmx_hardware_disable,
+	.hardware_enable = vt_hardware_enable,
+	.hardware_disable = vt_hardware_disable,
 	.cpu_has_accelerated_tpr = report_flexpriority,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a6b1a8ce888d..690298fb99c7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -48,6 +48,14 @@ struct tdx_capabilities tdx_caps;
 static DEFINE_MUTEX(tdx_lock);
 static struct mutex *tdx_mng_key_config_lock;
 
+/*
+ * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
+ * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
+ * Protected by interrupt mask.  This list is manipulated in process context
+ * of vcpu and IPI callback.  See tdx_flush_vp_on_cpu().
+ */
+static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+
 static u64 hkid_mask __ro_after_init;
 static u8 hkid_start_pos __ro_after_init;
 
@@ -87,6 +95,8 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
 
 static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
 {
+	list_del(&to_tdx(vcpu)->cpu_list);
+
 	/*
 	 * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
 	 * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
@@ -97,6 +107,22 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
 	vcpu->cpu = -1;
 }
 
+void tdx_hardware_enable(void)
+{
+	INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
+}
+
+void tdx_hardware_disable(void)
+{
+	int cpu = raw_smp_processor_id();
+	struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+	struct vcpu_tdx *tdx, *tmp;
+
+	/* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
+	list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
+		tdx_disassociate_vp(&tdx->vcpu);
+}
+
 static void tdx_clear_page(unsigned long page)
 {
 	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -230,9 +256,11 @@ void tdx_mmu_prezap(struct kvm *kvm)
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	cpumask_var_t packages;
 	bool cpumask_allocated;
+	struct kvm_vcpu *vcpu;
 	u64 err;
 	int ret;
 	int i;
+	unsigned long j;
 
 	if (!is_hkid_assigned(kvm_tdx))
 		return;
@@ -248,6 +276,17 @@ void tdx_mmu_prezap(struct kvm *kvm)
 		return;
 	}
 
+	kvm_for_each_vcpu(j, vcpu, kvm)
+		tdx_flush_vp_on_cpu(vcpu);
+
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
+		return;
+	}
+
 	cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
 	for_each_online_cpu(i) {
 		if (cpumask_allocated &&
@@ -472,8 +511,22 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
-	if (vcpu->cpu != cpu)
-		tdx_flush_vp_on_cpu(vcpu);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (vcpu->cpu == cpu)
+		return;
+
+	tdx_flush_vp_on_cpu(vcpu);
+
+	local_irq_disable();
+	/*
+	 * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
+	 * vcpu->cpu is read before tdx->cpu_list.
+	 */
+	smp_rmb();
+
+	list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+	local_irq_enable();
 }
 
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
@@ -522,6 +575,19 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 		tdx_reclaim_td_page(&tdx->tdvpx[i]);
 	kfree(tdx->tdvpx);
 	tdx_reclaim_td_page(&tdx->tdvpr);
+
+	/*
+	 * kvm_free_vcpus()
+	 *   -> kvm_unload_vcpu_mmu()
+	 *
+	 * does vcpu_load() for every vcpu after they already disassociated
+	 * from the per cpu list when tdx_vm_teardown(). So we need to
+	 * disassociate them again, otherwise the freed vcpu data will be
+	 * accessed when do list_{del,add}() on associated_tdvcpus list
+	 * later.
+	 */
+	tdx_flush_vp_on_cpu(vcpu);
+	WARN_ON(vcpu->cpu != -1);
 }
 
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 8b1cf9c158e3..180360a65545 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -81,6 +81,8 @@ struct vcpu_tdx {
 	struct tdx_td_page tdvpr;
 	struct tdx_td_page *tdvpx;
 
+	struct list_head cpu_list;
+
 	union tdx_exit_reason exit_reason;
 
 	bool initialized;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ceafd6e18f4e..aae0f4449ec5 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -132,6 +132,8 @@ void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
 bool tdx_is_vm_type_supported(unsigned long type);
 void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 void tdx_hardware_unsetup(void);
+void tdx_hardware_enable(void);
+void tdx_hardware_disable(void);
 
 int tdx_vm_init(struct kvm *kvm);
 void tdx_mmu_prezap(struct kvm *kvm);
@@ -156,6 +158,8 @@ static inline void tdx_pre_kvm_init(
 static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
 static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
 static inline void tdx_hardware_unsetup(void) {}
+static inline void tdx_hardware_enable(void) {}
+static inline void tdx_hardware_disable(void) {}
 
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_prezap(struct kvm *kvm) {}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 074/104] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (72 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 073/104] KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the guest TD isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-05 15:32   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 075/104] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events() isaku.yamahata
                   ` (31 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a flag, KVM_DEBUGREG_AUTO_SWITCHED_GUEST, to skip saving/restoring DRs
irrespective of any other flags.  TDX-SEAM unconditionally saves and
restores guest DRs and reset to architectural INIT state on TD exit.
So, KVM needs to save host DRs before TD enter without restoring guest DRs
and restore host DRs after TD exit.

Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 10 ++++++++--
 arch/x86/kvm/vmx/tdx.c          |  1 +
 arch/x86/kvm/x86.c              |  3 ++-
 3 files changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index b6396d11139e..489374a57b66 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -535,8 +535,14 @@ struct kvm_pmu {
 struct kvm_pmu_ops;
 
 enum {
-	KVM_DEBUGREG_BP_ENABLED = 1,
-	KVM_DEBUGREG_WONT_EXIT = 2,
+	KVM_DEBUGREG_BP_ENABLED		= BIT(0),
+	KVM_DEBUGREG_WONT_EXIT		= BIT(1),
+	KVM_DEBUGREG_RELOAD		= BIT(2),
+	/*
+	 * Guest debug registers are saved/restored by hardware on exit from
+	 * or enter guest. KVM needn't switch them.
+	 */
+	KVM_DEBUGREG_AUTO_SWITCH	= BIT(3),
 };
 
 struct kvm_mtrr_range {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 690298fb99c7..3a0e826fbe0c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -485,6 +485,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
 
+	vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
 	vcpu->arch.cr0_guest_owned_bits = -1ul;
 	vcpu->arch.cr4_guest_owned_bits = -1ul;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 45e8a02e99bf..89d04cd64cd0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10084,7 +10084,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.guest_fpu.xfd_err)
 		wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
 
-	if (unlikely(vcpu->arch.switch_db_regs)) {
+	if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH)) {
 		set_debugreg(0, 7);
 		set_debugreg(vcpu->arch.eff_db[0], 0);
 		set_debugreg(vcpu->arch.eff_db[1], 1);
@@ -10126,6 +10126,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 */
 	if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
 		WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
+		WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH);
 		static_call(kvm_x86_sync_dirty_debug_regs)(vcpu);
 		kvm_update_dr0123(vcpu);
 		kvm_update_dr7(vcpu);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 075/104] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (73 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 074/104] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-08 16:24   ` Sean Christopherson
  2022-03-04 19:49 ` [RFC PATCH v5 076/104] KVM: x86: Add option to force LAPIC expiration wait isaku.yamahata
                   ` (30 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
interrupt to support TDX's usage of APICv.  Unlike VMX, TDX doesn't have
access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,
i.e. needs to generate a posted interrupt and more importantly can't
manually move requested interrupts into the vIRR (which it also doesn't
have access to).

Because pi_has_pending_interrupt() is heavy operation which uses two atomic
test bit operations and one atomic 256 bit bitmap check, introduce new
callback for this check instead of reusing dy_apicv_has_pending_interrupt()
callback to avoid affecting the exiting code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/vmx/main.c         | 9 +++++++++
 arch/x86/kvm/x86.c              | 5 ++++-
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 489374a57b66..8dab9f16f559 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1491,6 +1491,7 @@ struct kvm_x86_ops {
 	void (*start_assignment)(struct kvm *kvm);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
 	bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
+	bool (*apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
 
 	int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 			    bool *expired);
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 882358ac270b..d75caf0d6861 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -148,6 +148,14 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	return vmx_vcpu_load(vcpu, cpu);
 }
 
+static bool vt_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return pi_has_pending_interrupt(vcpu);
+
+	return false;
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -297,6 +305,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.sync_pir_to_irr = vmx_sync_pir_to_irr,
 	.deliver_interrupt = vmx_deliver_interrupt,
 	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+	.apicv_has_pending_interrupt = vt_apicv_has_pending_interrupt,
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.set_identity_map_addr = vmx_set_identity_map_addr,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 89d04cd64cd0..314ae43e07bf 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12111,7 +12111,10 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
 
 	if (kvm_arch_interrupt_allowed(vcpu) &&
 	    (kvm_cpu_has_interrupt(vcpu) ||
-	    kvm_guest_apic_has_interrupt(vcpu)))
+	     kvm_guest_apic_has_interrupt(vcpu) ||
+	     (vcpu->arch.apicv_active &&
+	      kvm_x86_ops.apicv_has_pending_interrupt &&
+	      kvm_x86_ops.apicv_has_pending_interrupt(vcpu))))
 		return true;
 
 	if (kvm_hv_has_stimer_pending(vcpu))
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 076/104] KVM: x86: Add option to force LAPIC expiration wait
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (74 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 075/104] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events() isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-05 15:33   ` Paolo Bonzini
  2022-04-08 16:36   ` Sean Christopherson
  2022-03-04 19:49 ` [RFC PATCH v5 077/104] KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c isaku.yamahata
                   ` (29 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add an option to skip the IRR check-in kvm_wait_lapic_expire().  This
will be used by TDX to wait if there is an outstanding notification for
a TD, i.e. a virtual interrupt is being triggered via posted interrupt
processing.  KVM TDX doesn't emulate PI processing, i.e. there will
never be a bit set in IRR/ISR, so the default behavior for APICv of
querying the IRR doesn't work as intended.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/lapic.c   | 4 ++--
 arch/x86/kvm/lapic.h   | 2 +-
 arch/x86/kvm/svm/svm.c | 2 +-
 arch/x86/kvm/vmx/vmx.c | 2 +-
 4 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9322e6340a74..d49f029ef0e3 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1620,12 +1620,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
 		__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
 }
 
-void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
+void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait)
 {
 	if (lapic_in_kernel(vcpu) &&
 	    vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
 	    vcpu->arch.apic->lapic_timer.timer_advance_ns &&
-	    lapic_timer_int_injected(vcpu))
+	    (force_wait || lapic_timer_int_injected(vcpu)))
 		__kvm_wait_lapic_expire(vcpu);
 }
 EXPORT_SYMBOL_GPL(kvm_wait_lapic_expire);
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 2b44e533fc8d..2a0119ef9e96 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -233,7 +233,7 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu)
 
 bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
 
-void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);
+void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait);
 
 void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
 			      unsigned long *vcpu_bitmap);
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index c7eec23e9ebe..a46415845f48 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -3766,7 +3766,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
 	clgi();
 	kvm_load_guest_xsave_state(vcpu);
 
-	kvm_wait_lapic_expire(vcpu);
+	kvm_wait_lapic_expire(vcpu, false);
 
 	/*
 	 * If this vCPU has touched SPEC_CTRL, restore the guest's value if
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 00f88aa25047..9b7bd52d19a9 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6838,7 +6838,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	if (enable_preemption_timer)
 		vmx_update_hv_timer(vcpu);
 
-	kvm_wait_lapic_expire(vcpu);
+	kvm_wait_lapic_expire(vcpu, false);
 
 	/*
 	 * If this vCPU has touched SPEC_CTRL, restore the guest's value if
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 077/104] KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (75 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 076/104] KVM: x86: Add option to force LAPIC expiration wait isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-05 15:36   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 078/104] KVM: TDX: Implement interrupt injection isaku.yamahata
                   ` (28 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Yuan Yao <yuan.yao@intel.com>

The helper function, vcpu_to_pi_desc(), is defined to get the posted
interrupt descriptor from vcpu.  There is one place that doesn't use it,
but direct reference to vmx_vcpu->pi_desc.  It's inconsistent.

For TDX, TDX vcpu structure will be defined and the helper function,
vcpu_to_pi_desc(), will return tdx_vcpu->pi_desc for TDX case instead of
vmx_vcpu->pi_desc.  The direct reference to vmx_vcpu->pi_desc doesn't work
for TDX.

Replace vmx_vcpu->pi_desc with the helper function, vcpu_pi_desc() for
consistency and TDX.

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 2 +-
 arch/x86/kvm/vmx/x86_ops.h     | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index aa1fe9085d77..c8a81c916eed 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -311,7 +311,7 @@ int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq,
 			continue;
 		}
 
-		vcpu_info.pi_desc_addr = __pa(&to_vmx(vcpu)->pi_desc);
+		vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
 		vcpu_info.vector = irq.vector;
 
 		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, e->gsi,
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index aae0f4449ec5..0f1a28f67e60 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -147,6 +147,9 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
 void tdx_vcpu_put(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 
+void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
+int tdx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector);
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 078/104] KVM: TDX: Implement interrupt injection
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (76 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 077/104] KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-06 11:47   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 079/104] KVM: TDX: Implements vcpu request_immediate_exit isaku.yamahata
                   ` (27 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX supports interrupt inject into vcpu with posted interrupt.  Wire up the
corresponding kvm x86 operations to posted interrupt.  Move
kvm_vcpu_trigger_posted_interrupt() from vmx.c to common.h to share the
code.

VMX can inject interrupt by setting interrupt information field,
VM_ENTRY_INTR_INFO_FIELD, of VMCS.  TDX supports interrupt injection only
by posted interrupt.  Ignore the execution path to access
VM_ENTRY_INTR_INFO_FIELD.

As cpu state is protected and apicv is enabled for the TDX guest, VMM can
inject interrupt by updating posted interrupt descriptor.  Treat interrupt
can be injected always.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/common.h      | 70 +++++++++++++++++++++++++
 arch/x86/kvm/vmx/main.c        | 93 ++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/posted_intr.c |  6 +++
 arch/x86/kvm/vmx/tdx.c         | 33 ++++++++++++
 arch/x86/kvm/vmx/tdx.h         |  3 ++
 arch/x86/kvm/vmx/vmx.c         | 57 +--------------------
 arch/x86/kvm/vmx/x86_ops.h     |  7 ++-
 7 files changed, 203 insertions(+), 66 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 1052b3c93eb8..79a4517e43d1 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,6 +4,7 @@
 
 #include <linux/kvm_host.h>
 
+#include "posted_intr.h"
 #include "mmu.h"
 
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -32,4 +33,73 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
+static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
+						     int pi_vec)
+{
+#ifdef CONFIG_SMP
+	if (vcpu->mode == IN_GUEST_MODE) {
+		/*
+		 * The vector of interrupt to be delivered to vcpu had
+		 * been set in PIR before this function.
+		 *
+		 * Following cases will be reached in this block, and
+		 * we always send a notification event in all cases as
+		 * explained below.
+		 *
+		 * Case 1: vcpu keeps in non-root mode. Sending a
+		 * notification event posts the interrupt to vcpu.
+		 *
+		 * Case 2: vcpu exits to root mode and is still
+		 * runnable. PIR will be synced to vIRR before the
+		 * next vcpu entry. Sending a notification event in
+		 * this case has no effect, as vcpu is not in root
+		 * mode.
+		 *
+		 * Case 3: vcpu exits to root mode and is blocked.
+		 * vcpu_block() has already synced PIR to vIRR and
+		 * never blocks vcpu if vIRR is not cleared. Therefore,
+		 * a blocked vcpu here does not wait for any requested
+		 * interrupts in PIR, and sending a notification event
+		 * which has no effect is safe here.
+		 */
+
+		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
+		return;
+	}
+#endif
+	/*
+	 * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
+	 * otherwise do nothing as KVM will grab the highest priority pending
+	 * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
+	 */
+	kvm_vcpu_wake_up(vcpu);
+}
+
+/*
+ * Send interrupt to vcpu via posted interrupt way.
+ * 1. If target vcpu is running(non-root mode), send posted interrupt
+ * notification to vcpu and hardware will sync PIR to vIRR atomically.
+ * 2. If target vcpu isn't running(root mode), kick it to pick up the
+ * interrupt from PIR in next vmentry.
+ */
+static inline void __vmx_deliver_posted_interrupt(
+	struct kvm_vcpu *vcpu, struct pi_desc *pi_desc, int vector)
+{
+	if (pi_test_and_set_pir(vector, pi_desc))
+		return;
+
+	/* If a previous notification has sent the IPI, nothing to do.  */
+	if (pi_test_and_set_on(pi_desc))
+		return;
+
+	/*
+	 * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
+	 * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
+	 * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
+	 * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
+	 */
+	kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
+}
+
+
 #endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d75caf0d6861..a0bcc4dca678 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -148,6 +148,34 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	return vmx_vcpu_load(vcpu, cpu);
 }
 
+static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_apicv_post_state_restore(vcpu);
+
+	return vmx_apicv_post_state_restore(vcpu);
+}
+
+static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return -1;
+
+	return vmx_sync_pir_to_irr(vcpu);
+}
+
+static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+			   int trig_mode, int vector)
+{
+	if (is_td_vcpu(apic->vcpu)) {
+		tdx_deliver_interrupt(apic, delivery_mode, trig_mode,
+					     vector);
+		return;
+	}
+
+	vmx_deliver_interrupt(apic, delivery_mode, trig_mode, vector);
+}
+
 static bool vt_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -205,6 +233,53 @@ static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
 	vmx_sched_in(vcpu, cpu);
 }
 
+static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+	vmx_set_interrupt_shadow(vcpu, mask);
+}
+
+static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return 0;
+
+	return vmx_get_interrupt_shadow(vcpu);
+}
+
+static void vt_inject_irq(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_inject_irq(vcpu);
+}
+
+static void vt_cancel_injection(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_cancel_injection(vcpu);
+}
+
+static int vt_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	if (is_td_vcpu(vcpu))
+		return true;
+
+	return vmx_interrupt_allowed(vcpu, for_injection);
+}
+
+static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_enable_irq_window(vcpu);
+}
+
 static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -279,31 +354,31 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.handle_exit = vmx_handle_exit,
 	.skip_emulated_instruction = vmx_skip_emulated_instruction,
 	.update_emulated_instruction = vmx_update_emulated_instruction,
-	.set_interrupt_shadow = vmx_set_interrupt_shadow,
-	.get_interrupt_shadow = vmx_get_interrupt_shadow,
+	.set_interrupt_shadow = vt_set_interrupt_shadow,
+	.get_interrupt_shadow = vt_get_interrupt_shadow,
 	.patch_hypercall = vmx_patch_hypercall,
-	.set_irq = vmx_inject_irq,
+	.set_irq = vt_inject_irq,
 	.set_nmi = vmx_inject_nmi,
 	.queue_exception = vmx_queue_exception,
-	.cancel_injection = vmx_cancel_injection,
-	.interrupt_allowed = vmx_interrupt_allowed,
+	.cancel_injection = vt_cancel_injection,
+	.interrupt_allowed = vt_interrupt_allowed,
 	.nmi_allowed = vmx_nmi_allowed,
 	.get_nmi_mask = vmx_get_nmi_mask,
 	.set_nmi_mask = vmx_set_nmi_mask,
 	.enable_nmi_window = vmx_enable_nmi_window,
-	.enable_irq_window = vmx_enable_irq_window,
+	.enable_irq_window = vt_enable_irq_window,
 	.update_cr8_intercept = vmx_update_cr8_intercept,
 	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
 	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
 	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
 	.load_eoi_exitmap = vmx_load_eoi_exitmap,
-	.apicv_post_state_restore = vmx_apicv_post_state_restore,
+	.apicv_post_state_restore = vt_apicv_post_state_restore,
 	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
 	.hwapic_irr_update = vmx_hwapic_irr_update,
 	.hwapic_isr_update = vmx_hwapic_isr_update,
 	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
-	.sync_pir_to_irr = vmx_sync_pir_to_irr,
-	.deliver_interrupt = vmx_deliver_interrupt,
+	.sync_pir_to_irr = vt_sync_pir_to_irr,
+	.deliver_interrupt = vt_deliver_interrupt,
 	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
 	.apicv_has_pending_interrupt = vt_apicv_has_pending_interrupt,
 
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index c8a81c916eed..e22c3015f064 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -7,6 +7,7 @@
 #include "lapic.h"
 #include "irq.h"
 #include "posted_intr.h"
+#include "tdx.h"
 #include "trace.h"
 #include "vmx.h"
 
@@ -31,6 +32,11 @@ static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);
 
 static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
 {
+#ifdef CONFIG_INTEL_TDX_HOST
+	if (is_td_vcpu(vcpu))
+		return &(to_tdx(vcpu)->pi_desc);
+#endif
+
 	return &(to_vmx(vcpu)->pi_desc);
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3a0e826fbe0c..bdc658ca9e4f 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -7,6 +7,7 @@
 
 #include "capabilities.h"
 #include "x86_ops.h"
+#include "common.h"
 #include "mmu.h"
 #include "tdx.h"
 #include "vmx.h"
@@ -494,6 +495,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.guest_state_protected =
 		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
 
+	tdx->pi_desc.nv = POSTED_INTR_VECTOR;
+	tdx->pi_desc.sn = 1;
+
 	tdx->host_state_need_save = true;
 	tdx->host_state_need_restore = false;
 
@@ -514,6 +518,7 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
 
+	vmx_vcpu_pi_load(vcpu, cpu);
 	if (vcpu->cpu == cpu)
 		return;
 
@@ -735,6 +740,12 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	trace_kvm_entry(vcpu);
 
+	if (pi_test_on(&tdx->pi_desc)) {
+		apic->send_IPI_self(POSTED_INTR_VECTOR);
+
+		kvm_wait_lapic_expire(vcpu, true);
+	}
+
 	tdx_vcpu_enter_exit(vcpu, tdx);
 
 	tdx_user_return_update_cache();
@@ -1008,6 +1019,24 @@ static void tdx_handle_changed_private_spte(
 	}
 }
 
+void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	pi_clear_on(&tdx->pi_desc);
+	memset(tdx->pi_desc.pir, 0, sizeof(tdx->pi_desc.pir));
+}
+
+void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+			   int trig_mode, int vector)
+{
+	struct kvm_vcpu *vcpu = apic->vcpu;
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	/* TDX supports only posted interrupt.  No lapic emulation. */
+	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
+}
+
 static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
@@ -1425,6 +1454,10 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 		return -EIO;
 	}
 
+	td_vmcs_write16(tdx, POSTED_INTR_NV, POSTED_INTR_VECTOR);
+	td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->pi_desc));
+	td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);
+
 	tdx->initialized = true;
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 180360a65545..7cd81780f3fa 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -83,6 +83,9 @@ struct vcpu_tdx {
 
 	struct list_head cpu_list;
 
+	/* Posted interrupt descriptor */
+	struct pi_desc pi_desc;
+
 	union tdx_exit_reason exit_reason;
 
 	bool initialized;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 9b7bd52d19a9..4bd1e61b8d45 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3931,48 +3931,6 @@ void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
 	pt_update_intercept_for_msr(vcpu);
 }
 
-static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
-						     int pi_vec)
-{
-#ifdef CONFIG_SMP
-	if (vcpu->mode == IN_GUEST_MODE) {
-		/*
-		 * The vector of interrupt to be delivered to vcpu had
-		 * been set in PIR before this function.
-		 *
-		 * Following cases will be reached in this block, and
-		 * we always send a notification event in all cases as
-		 * explained below.
-		 *
-		 * Case 1: vcpu keeps in non-root mode. Sending a
-		 * notification event posts the interrupt to vcpu.
-		 *
-		 * Case 2: vcpu exits to root mode and is still
-		 * runnable. PIR will be synced to vIRR before the
-		 * next vcpu entry. Sending a notification event in
-		 * this case has no effect, as vcpu is not in root
-		 * mode.
-		 *
-		 * Case 3: vcpu exits to root mode and is blocked.
-		 * vcpu_block() has already synced PIR to vIRR and
-		 * never blocks vcpu if vIRR is not cleared. Therefore,
-		 * a blocked vcpu here does not wait for any requested
-		 * interrupts in PIR, and sending a notification event
-		 * which has no effect is safe here.
-		 */
-
-		apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
-		return;
-	}
-#endif
-	/*
-	 * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
-	 * otherwise do nothing as KVM will grab the highest priority pending
-	 * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
-	 */
-	kvm_vcpu_wake_up(vcpu);
-}
-
 static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 						int vector)
 {
@@ -4024,20 +3982,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	if (!vcpu->arch.apicv_active)
 		return -1;
 
-	if (pi_test_and_set_pir(vector, &vmx->pi_desc))
-		return 0;
-
-	/* If a previous notification has sent the IPI, nothing to do.  */
-	if (pi_test_and_set_on(&vmx->pi_desc))
-		return 0;
-
-	/*
-	 * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
-	 * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
-	 * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
-	 * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
-	 */
-	kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
+	__vmx_deliver_posted_interrupt(vcpu, &vmx->pi_desc, vector);
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 0f1a28f67e60..c3768a20347f 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -148,7 +148,8 @@ void tdx_vcpu_put(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 
 void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
-int tdx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector);
+void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+			   int trig_mode, int vector);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -176,6 +177,10 @@ static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
 
+static inline void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu) {}
+static inline void tdx_deliver_interrupt(
+	struct kvm_lapic *apic, int delivery_mode, int trig_mode, int vector) {}
+
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 079/104] KVM: TDX: Implements vcpu request_immediate_exit
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (77 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 078/104] KVM: TDX: Implement interrupt injection isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-06 12:49   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 080/104] KVM: TDX: Implement methods to inject NMI isaku.yamahata
                   ` (26 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Now we are able to inject interrupts into TDX vcpu, it's ready to block TDX
vcpu.  Wire up kvm x86 methods for blocking/unblocking vcpu for TDX.  To
unblock on pending events, request immediate exit methods is also needed.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a0bcc4dca678..404a260796e4 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -280,6 +280,14 @@ static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
 	vmx_enable_irq_window(vcpu);
 }
 
+static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return __kvm_request_immediate_exit(vcpu);
+
+	vmx_request_immediate_exit(vcpu);
+}
+
 static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -402,7 +410,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.check_intercept = vmx_check_intercept,
 	.handle_exit_irqoff = vmx_handle_exit_irqoff,
 
-	.request_immediate_exit = vmx_request_immediate_exit,
+	.request_immediate_exit = vt_request_immediate_exit,
 
 	.sched_in = vt_sched_in,
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 080/104] KVM: TDX: Implement methods to inject NMI
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (78 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 079/104] KVM: TDX: Implements vcpu request_immediate_exit isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-06 12:47   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 081/104] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument isaku.yamahata
                   ` (25 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX vcpu control structure defines one bit for pending NMI for VMM to
inject NMI by setting the bit without knowing TDX vcpu NMI states.  Because
the vcpu state is protected, VMM can't know about NMI states of TDX vcpu.
The TDX module handles actual injection and NMI states transition.

Add methods for NMI and treat NMI can be injected always.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 62 +++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/tdx.c     |  5 +++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 3 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 404a260796e4..aa84c13f8ee1 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -216,6 +216,58 @@ static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
 	vmx_flush_tlb_guest(vcpu);
 }
 
+static void vt_inject_nmi(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_inject_nmi(vcpu);
+
+	vmx_inject_nmi(vcpu);
+}
+
+static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	/*
+	 * The TDX module manages NMI windows and NMI reinjection, and hides NMI
+	 * blocking, all KVM can do is throw an NMI over the wall.
+	 */
+	if (is_td_vcpu(vcpu))
+		return true;
+
+	return vmx_nmi_allowed(vcpu, for_injection);
+}
+
+static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Assume NMIs are always unmasked.  KVM could query PEND_NMI and treat
+	 * NMIs as masked if a previous NMI is still pending, but SEAMCALLs are
+	 * expensive and the end result is unchanged as the only relevant usage
+	 * of get_nmi_mask() is to limit the number of pending NMIs, i.e. it
+	 * only changes whether KVM or the TDX module drops an NMI.
+	 */
+	if (is_td_vcpu(vcpu))
+		return false;
+
+	return vmx_get_nmi_mask(vcpu);
+}
+
+static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_nmi_mask(vcpu, masked);
+}
+
+static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
+{
+	/* Refer the comment in vt_get_nmi_mask(). */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_enable_nmi_window(vcpu);
+}
+
 static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			int pgd_level)
 {
@@ -366,14 +418,14 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.get_interrupt_shadow = vt_get_interrupt_shadow,
 	.patch_hypercall = vmx_patch_hypercall,
 	.set_irq = vt_inject_irq,
-	.set_nmi = vmx_inject_nmi,
+	.set_nmi = vt_inject_nmi,
 	.queue_exception = vmx_queue_exception,
 	.cancel_injection = vt_cancel_injection,
 	.interrupt_allowed = vt_interrupt_allowed,
-	.nmi_allowed = vmx_nmi_allowed,
-	.get_nmi_mask = vmx_get_nmi_mask,
-	.set_nmi_mask = vmx_set_nmi_mask,
-	.enable_nmi_window = vmx_enable_nmi_window,
+	.nmi_allowed = vt_nmi_allowed,
+	.get_nmi_mask = vt_get_nmi_mask,
+	.set_nmi_mask = vt_set_nmi_mask,
+	.enable_nmi_window = vt_enable_nmi_window,
 	.enable_irq_window = vt_enable_irq_window,
 	.update_cr8_intercept = vmx_update_cr8_intercept,
 	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index bdc658ca9e4f..273898de9f7a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -763,6 +763,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 	return EXIT_FASTPATH_NONE;
 }
 
+void tdx_inject_nmi(struct kvm_vcpu *vcpu)
+{
+	td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index c3768a20347f..31be5e8a1d5c 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -150,6 +150,7 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 			   int trig_mode, int vector);
+void tdx_inject_nmi(struct kvm_vcpu *vcpu);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -180,6 +181,7 @@ static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
 static inline void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu) {}
 static inline void tdx_deliver_interrupt(
 	struct kvm_lapic *apic, int delivery_mode, int trig_mode, int vector) {}
+static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 081/104] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (79 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 080/104] KVM: TDX: Implement methods to inject NMI isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-06 12:49   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 082/104] KVM: VMX: Move NMI/exception handler to common helper isaku.yamahata
                   ` (24 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX uses different ABI to get information about VM exit.  Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.

When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS.  The eventual code will be

VMX:
  - get exit reason, intr_info, exit_qualification, and etc from VMCS
  - call NMI/INTR handlers (common code)

TDX:
  - get exit reason, intr_info, exit_qualification, and etc from guest
    registers
  - call NMI/INTR handlers (common code)

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/vmx.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 4bd1e61b8d45..008400927144 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6442,28 +6442,27 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
 }
 
-static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
+static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
 {
 	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
-	u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
 
 	/* if exit due to PF check for async PF */
 	if (is_page_fault(intr_info))
-		vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+		vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
 	/* if exit due to NM, handle before interrupts are enabled */
 	else if (is_nm_fault(intr_info))
-		handle_nm_fault_irqoff(&vmx->vcpu);
+		handle_nm_fault_irqoff(vcpu);
 	/* Handle machine checks before interrupts are enabled */
 	else if (is_machine_check(intr_info))
 		kvm_machine_check();
 	/* We need to handle NMIs before interrupts are enabled */
 	else if (is_nmi(intr_info))
-		handle_interrupt_nmi_irqoff(&vmx->vcpu, nmi_entry);
+		handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
 }
 
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
+static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+					     u32 intr_info)
 {
-	u32 intr_info = vmx_get_intr_info(vcpu);
 	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
 	gate_desc *desc = (gate_desc *)host_idt_base + vector;
 
@@ -6482,9 +6481,9 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 		return;
 
 	if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
-		handle_external_interrupt_irqoff(vcpu);
+		handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
 	else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
-		handle_exception_nmi_irqoff(vmx);
+		handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 082/104] KVM: VMX: Move NMI/exception handler to common helper
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (80 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 081/104] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 083/104] KVM: x86: Split core of hypercall emulation to helper function isaku.yamahata
                   ` (23 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX mostly handles NMI/exception exit mostly the same to VMX case.  The
difference is how to retrieve exit qualification.  To share the code with
TDX, move NMI/exception to a common header

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/common.h | 50 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c    | 60 ++++++---------------------------------
 2 files changed, 59 insertions(+), 51 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 79a4517e43d1..1e24ede1c581 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,8 +4,58 @@
 
 #include <linux/kvm_host.h>
 
+#include <asm/traps.h>
+
 #include "posted_intr.h"
 #include "mmu.h"
+#include "vmcs.h"
+#include "x86.h"
+
+extern unsigned long vmx_host_idt_base;
+void vmx_handle_nm_fault_irqoff(struct kvm_vcpu *vcpu);
+void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
+
+static inline void vmx_handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
+						unsigned long entry)
+{
+	bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist;
+
+	kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : KVM_HANDLING_IRQ);
+	vmx_do_interrupt_nmi_irqoff(entry);
+	kvm_after_interrupt(vcpu);
+}
+
+static inline void vmx_handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu,
+						u32 intr_info)
+{
+	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
+
+	/* if exit due to PF check for async PF */
+	if (is_page_fault(intr_info))
+		vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+	/* if exit due to NM, handle before interrupts are enabled */
+	else if (is_nm_fault(intr_info))
+		vmx_handle_nm_fault_irqoff(vcpu);
+	/* Handle machine checks before interrupts are enabled */
+	else if (is_machine_check(intr_info))
+		kvm_machine_check();
+	/* We need to handle NMIs before interrupts are enabled */
+	else if (is_nmi(intr_info))
+		vmx_handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
+}
+
+static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+							u32 intr_info)
+{
+	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
+	gate_desc *desc = (gate_desc *)vmx_host_idt_base + vector;
+
+	if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
+	    "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
+		return;
+
+	vmx_handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
+}
 
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 					     unsigned long exit_qualification)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 008400927144..b235541e4b27 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -457,7 +457,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
 	vmx->segment_cache.bitmask = 0;
 }
 
-static unsigned long host_idt_base;
+unsigned long vmx_host_idt_base;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 static bool __read_mostly enlightened_vmcs = true;
@@ -4046,7 +4046,7 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
 	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);  /* 22.2.4 */
 
-	vmcs_writel(HOST_IDTR_BASE, host_idt_base);   /* 22.2.4 */
+	vmcs_writel(HOST_IDTR_BASE, vmx_host_idt_base);   /* 22.2.4 */
 
 	vmcs_writel(HOST_RIP, (unsigned long)vmx_vmexit); /* 22.2.5 */
 
@@ -4789,10 +4789,10 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 	intr_info = vmx_get_intr_info(vcpu);
 
 	if (is_machine_check(intr_info) || is_nmi(intr_info))
-		return 1; /* handled by handle_exception_nmi_irqoff() */
+		return 1; /* handled by vmx_handle_exception_nmi_irqoff() */
 
 	/*
-	 * Queue the exception here instead of in handle_nm_fault_irqoff().
+	 * Queue the exception here instead of in vmx_handle_nm_fault_irqoff().
 	 * This ensures the nested_vmx check is not skipped so vmexit can
 	 * be reflected to L1 (when it intercepts #NM) before reaching this
 	 * point.
@@ -6410,19 +6410,7 @@ void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 	memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
 }
 
-void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
-
-static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
-					unsigned long entry)
-{
-	bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist;
-
-	kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : KVM_HANDLING_IRQ);
-	vmx_do_interrupt_nmi_irqoff(entry);
-	kvm_after_interrupt(vcpu);
-}
-
-static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
+void vmx_handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 {
 	/*
 	 * Save xfd_err to guest_fpu before interrupt is enabled, so the
@@ -6442,37 +6430,6 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
 }
 
-static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
-{
-	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
-
-	/* if exit due to PF check for async PF */
-	if (is_page_fault(intr_info))
-		vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
-	/* if exit due to NM, handle before interrupts are enabled */
-	else if (is_nm_fault(intr_info))
-		handle_nm_fault_irqoff(vcpu);
-	/* Handle machine checks before interrupts are enabled */
-	else if (is_machine_check(intr_info))
-		kvm_machine_check();
-	/* We need to handle NMIs before interrupts are enabled */
-	else if (is_nmi(intr_info))
-		handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
-}
-
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
-					     u32 intr_info)
-{
-	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
-	gate_desc *desc = (gate_desc *)host_idt_base + vector;
-
-	if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
-	    "KVM: unexpected VM-Exit interrupt info: 0x%x", intr_info))
-		return;
-
-	handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
-}
-
 void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6481,9 +6438,10 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 		return;
 
 	if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
-		handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
+		vmx_handle_external_interrupt_irqoff(vcpu,
+						     vmx_get_intr_info(vcpu));
 	else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
-		handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
+		vmx_handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
 }
 
 /*
@@ -7679,7 +7637,7 @@ __init int vmx_hardware_setup(void)
 	int r;
 
 	store_idt(&dt);
-	host_idt_base = dt.address;
+	vmx_host_idt_base = dt.address;
 
 	vmx_setup_user_return_msrs();
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 083/104] KVM: x86: Split core of hypercall emulation to helper function
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (81 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 082/104] KVM: VMX: Move NMI/exception handler to common helper isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-21 18:32   ` Sagi Shahar
  2022-03-04 19:49 ` [RFC PATCH v5 084/104] KVM: TDX: Add a place holder to handle TDX VM exit isaku.yamahata
                   ` (22 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  4 +++
 arch/x86/kvm/x86.c              | 54 ++++++++++++++++++++-------------
 2 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8dab9f16f559..33b75b0e3de1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1818,6 +1818,10 @@ void kvm_request_apicv_update(struct kvm *kvm, bool activate,
 void __kvm_request_apicv_update(struct kvm *kvm, bool activate,
 				unsigned long bit);
 
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+				      unsigned long a0, unsigned long a1,
+				      unsigned long a2, unsigned long a3,
+				      int op_64_bit);
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 314ae43e07bf..9acb33a17445 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9090,26 +9090,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
-int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+				      unsigned long a0, unsigned long a1,
+				      unsigned long a2, unsigned long a3,
+				      int op_64_bit)
 {
-	unsigned long nr, a0, a1, a2, a3, ret;
-	int op_64_bit;
-
-	if (kvm_xen_hypercall_enabled(vcpu->kvm))
-		return kvm_xen_hypercall(vcpu);
-
-	if (kvm_hv_hypercall_enabled(vcpu))
-		return kvm_hv_hypercall(vcpu);
-
-	nr = kvm_rax_read(vcpu);
-	a0 = kvm_rbx_read(vcpu);
-	a1 = kvm_rcx_read(vcpu);
-	a2 = kvm_rdx_read(vcpu);
-	a3 = kvm_rsi_read(vcpu);
+	unsigned long ret;
 
 	trace_kvm_hypercall(nr, a0, a1, a2, a3);
 
-	op_64_bit = is_64_bit_hypercall(vcpu);
 	if (!op_64_bit) {
 		nr &= 0xFFFFFFFF;
 		a0 &= 0xFFFFFFFF;
@@ -9118,11 +9107,6 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		a3 &= 0xFFFFFFFF;
 	}
 
-	if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
-		ret = -KVM_EPERM;
-		goto out;
-	}
-
 	ret = -KVM_ENOSYS;
 
 	switch (nr) {
@@ -9181,6 +9165,34 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		ret = -KVM_ENOSYS;
 		break;
 	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
+
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+	unsigned long nr, a0, a1, a2, a3, ret;
+	int op_64_bit;
+
+	if (kvm_xen_hypercall_enabled(vcpu->kvm))
+		return kvm_xen_hypercall(vcpu);
+
+	if (kvm_hv_hypercall_enabled(vcpu))
+		return kvm_hv_hypercall(vcpu);
+
+	nr = kvm_rax_read(vcpu);
+	a0 = kvm_rbx_read(vcpu);
+	a1 = kvm_rcx_read(vcpu);
+	a2 = kvm_rdx_read(vcpu);
+	a3 = kvm_rsi_read(vcpu);
+	op_64_bit = is_64_bit_mode(vcpu);
+
+	if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
+		ret = -KVM_EPERM;
+		goto out;
+	}
+
+	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit);
 out:
 	if (!op_64_bit)
 		ret = (u32)ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 084/104] KVM: TDX: Add a place holder to handle TDX VM exit
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (82 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 083/104] KVM: x86: Split core of hypercall emulation to helper function isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 14:20   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 085/104] KVM: TDX: handle EXIT_REASON_OTHER_SMI isaku.yamahata
                   ` (21 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up handle_exit and handle_exit_irqoff methods and add a place holder
to handle VM exit.  Add helper functions to get exit info, exit
qualification, etc.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 35 +++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 79 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h | 11 ++++++
 3 files changed, 122 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index aa84c13f8ee1..1e65406e3882 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -148,6 +148,23 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	return vmx_vcpu_load(vcpu, cpu);
 }
 
+static int vt_handle_exit(struct kvm_vcpu *vcpu,
+			     enum exit_fastpath_completion fastpath)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_handle_exit(vcpu, fastpath);
+
+	return vmx_handle_exit(vcpu, fastpath);
+}
+
+static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_handle_exit_irqoff(vcpu);
+
+	vmx_handle_exit_irqoff(vcpu);
+}
+
 static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -340,6 +357,18 @@ static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
 	vmx_request_immediate_exit(vcpu);
 }
 
+static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_get_exit_info(vcpu, reason, info1, info2, intr_info,
+				error_code);
+		return;
+	}
+
+	vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
+}
+
 static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -411,7 +440,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.vcpu_pre_run = vmx_vcpu_pre_run,
 	.run = vt_vcpu_run,
-	.handle_exit = vmx_handle_exit,
+	.handle_exit = vt_handle_exit,
 	.skip_emulated_instruction = vmx_skip_emulated_instruction,
 	.update_emulated_instruction = vmx_update_emulated_instruction,
 	.set_interrupt_shadow = vt_set_interrupt_shadow,
@@ -446,7 +475,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_identity_map_addr = vmx_set_identity_map_addr,
 	.get_mt_mask = vmx_get_mt_mask,
 
-	.get_exit_info = vmx_get_exit_info,
+	.get_exit_info = vt_get_exit_info,
 
 	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
 
@@ -460,7 +489,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.load_mmu_pgd = vt_load_mmu_pgd,
 
 	.check_intercept = vmx_check_intercept,
-	.handle_exit_irqoff = vmx_handle_exit_irqoff,
+	.handle_exit_irqoff = vt_handle_exit_irqoff,
 
 	.request_immediate_exit = vt_request_immediate_exit,
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 273898de9f7a..155208a8d768 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -68,6 +68,26 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
 	return pa;
 }
 
+static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
+{
+	return kvm_rcx_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
+{
+	return kvm_rdx_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
+{
+	return kvm_r8_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
+{
+	return kvm_r9_read(vcpu);
+}
+
 static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
 {
 	return tdx->tdvpr.added;
@@ -768,6 +788,25 @@ void tdx_inject_nmi(struct kvm_vcpu *vcpu)
 	td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
 }
 
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	u16 exit_reason = tdx->exit_reason.basic;
+
+	if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
+		vmx_handle_exception_nmi_irqoff(vcpu, tdexit_intr_info(vcpu));
+	else if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
+		vmx_handle_external_interrupt_irqoff(vcpu,
+						     tdexit_intr_info(vcpu));
+}
+
+static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
+{
+	vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+	vcpu->mmio_needed = 0;
+	return 0;
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
@@ -1042,6 +1081,46 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
 }
 
+int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
+{
+	union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
+
+	if (unlikely(exit_reason.non_recoverable || exit_reason.error)) {
+		if (exit_reason.basic == EXIT_REASON_TRIPLE_FAULT)
+			return tdx_handle_triple_fault(vcpu);
+
+		kvm_pr_unimpl("TD exit 0x%llx, %d\n",
+			exit_reason.full, exit_reason.basic);
+		goto unhandled_exit;
+	}
+
+	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
+
+	switch (exit_reason.basic) {
+	default:
+		break;
+	}
+
+unhandled_exit:
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+	vcpu->run->hw.hardware_exit_reason = exit_reason.full;
+	return 0;
+}
+
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	*reason = tdx->exit_reason.full;
+
+	*info1 = tdexit_exit_qual(vcpu);
+	*info2 = tdexit_ext_exit_qual(vcpu);
+
+	*intr_info = tdexit_intr_info(vcpu);
+	*error_code = 0;
+}
+
 static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 31be5e8a1d5c..c0a34186bc37 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -146,11 +146,16 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
 void tdx_vcpu_put(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int tdx_handle_exit(struct kvm_vcpu *vcpu,
+		enum exit_fastpath_completion fastpath);
 
 void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 			   int trig_mode, int vector);
 void tdx_inject_nmi(struct kvm_vcpu *vcpu);
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -177,11 +182,17 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTP
 static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
+static inline void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu) {}
+static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
+		enum exit_fastpath_completion fastpath) { return 0; }
 
 static inline void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu) {}
 static inline void tdx_deliver_interrupt(
 	struct kvm_lapic *apic, int delivery_mode, int trig_mode, int vector) {}
 static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
+static inline void tdx_get_exit_info(
+	struct kvm_vcpu *vcpu, u32 *reason, u64 *info1, u64 *info2,
+	u32 *intr_info, u32 *error_code) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 085/104] KVM: TDX: handle EXIT_REASON_OTHER_SMI
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (83 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 084/104] KVM: TDX: Add a place holder to handle TDX VM exit isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 14:29   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 086/104] KVM: TDX: handle ept violation/misconfig exit isaku.yamahata
                   ` (20 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

If the control reaches EXIT_REASON_OTHER_SMI, #SMI is delivered and
handled right after returning from the TDX module to KVM nothing needs to
be done in KVM.  Continue TDX vcpu execution.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/uapi/asm/vmx.h | 1 +
 arch/x86/kvm/vmx/tdx.c          | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index 946d761adbd3..3d9b4598e166 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -34,6 +34,7 @@
 #define EXIT_REASON_TRIPLE_FAULT        2
 #define EXIT_REASON_INIT_SIGNAL			3
 #define EXIT_REASON_SIPI_SIGNAL         4
+#define EXIT_REASON_OTHER_SMI           6
 
 #define EXIT_REASON_INTERRUPT_WINDOW    7
 #define EXIT_REASON_NMI_WINDOW          8
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 155208a8d768..6fbe89bcfe1e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1097,6 +1097,13 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
 
 	switch (exit_reason.basic) {
+	case EXIT_REASON_OTHER_SMI:
+		/*
+		 * If reach here, it's not a MSMI.
+		 * #SMI is delivered and handled right after SEAMRET, nothing
+		 * needs to be done in KVM.
+		 */
+		return 1;
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 086/104] KVM: TDX: handle ept violation/misconfig exit
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (84 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 085/104] KVM: TDX: handle EXIT_REASON_OTHER_SMI isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-06 20:50   ` Sagi Shahar
  2022-03-04 19:49 ` [RFC PATCH v5 087/104] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT isaku.yamahata
                   ` (19 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

On EPT violation, call a common function, __vmx_handle_ept_violation() to
trigger x86 MMU code.  On EPT misconfiguration, exit to ring 3 with
KVM_EXIT_UNKNOWN.  because EPT misconfiguration can't happen as MMIO is
trigged by TDG.VP.VMCALL. No point to set a misconfiguration value for the
fast path.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 38 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6fbe89bcfe1e..2c35dcad077e 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1081,6 +1081,40 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
 }
 
+#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | PFERR_USER_MASK)
+
+static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qual;
+
+	if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu)))
+		exit_qual = TDX_SEPT_PFERR;
+	else {
+		exit_qual = tdexit_exit_qual(vcpu);
+		if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
+			pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
+				tdexit_gpa(vcpu), kvm_rip_read(vcpu));
+			vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
+			vcpu->run->ex.exception = PF_VECTOR;
+			vcpu->run->ex.error_code = exit_qual;
+			return 0;
+		}
+	}
+
+	trace_kvm_page_fault(tdexit_gpa(vcpu), exit_qual);
+	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+}
+
+static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
+{
+	WARN_ON(1);
+
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+	vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+
+	return 0;
+}
+
 int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 {
 	union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
@@ -1097,6 +1131,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
 
 	switch (exit_reason.basic) {
+	case EXIT_REASON_EPT_VIOLATION:
+		return tdx_handle_ept_violation(vcpu);
+	case EXIT_REASON_EPT_MISCONFIG:
+		return tdx_handle_ept_misconfig(vcpu);
 	case EXIT_REASON_OTHER_SMI:
 		/*
 		 * If reach here, it's not a MSMI.
@@ -1378,8 +1416,6 @@ void tdx_flush_tlb(struct kvm_vcpu *vcpu)
 		cpu_relax();
 }
 
-#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | PFERR_USER_MASK)
-
 static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 087/104] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (85 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 086/104] KVM: TDX: handle ept violation/misconfig exit isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 14:49   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 088/104] KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers isaku.yamahata
                   ` (18 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because guest TD state is protected, exceptions in guest TDs can't be
intercepted.  TDX VMM doesn't need to handle exceptions.
tdx_handle_exit_irqoff() handles NMI and machine check.  Ignore NMI and
machine check and continue guest TD execution.

For external interrupt, increment stats same to the VMX case.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2c35dcad077e..dc83414cb72a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -800,6 +800,23 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 						     tdexit_intr_info(vcpu));
 }
 
+static int tdx_handle_exception(struct kvm_vcpu *vcpu)
+{
+	u32 intr_info = tdexit_intr_info(vcpu);
+
+	if (is_nmi(intr_info) || is_machine_check(intr_info))
+		return 1;
+
+	kvm_pr_unimpl("unexpected exception 0x%x\n", intr_info);
+	return -EFAULT;
+}
+
+static int tdx_handle_external_interrupt(struct kvm_vcpu *vcpu)
+{
+	++vcpu->stat.irq_exits;
+	return 1;
+}
+
 static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
 {
 	vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
@@ -1131,6 +1148,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
 
 	switch (exit_reason.basic) {
+	case EXIT_REASON_EXCEPTION_NMI:
+		return tdx_handle_exception(vcpu);
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		return tdx_handle_external_interrupt(vcpu);
 	case EXIT_REASON_EPT_VIOLATION:
 		return tdx_handle_ept_violation(vcpu);
 	case EXIT_REASON_EPT_MISCONFIG:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 088/104] KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (86 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 087/104] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-07  4:06   ` Kai Huang
  2022-04-15 14:50   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL) isaku.yamahata
                   ` (17 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX defines ABI for the TDX guest to call hypercall with TDG.VP.VMCALL API.
To get hypercall arguments and to set return values, add accessors to guest
vcpu registers.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 35 +++++++++++++++++++++++++++++++++++
 1 file changed, 35 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index dc83414cb72a..8695836ce796 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -88,6 +88,41 @@ static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
 	return kvm_r9_read(vcpu);
 }
 
+#define BUILD_TDVMCALL_ACCESSORS(param, gpr)					\
+static __always_inline								\
+unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu)			\
+{										\
+	return kvm_##gpr##_read(vcpu);						\
+}										\
+static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu,	\
+						     unsigned long val)		\
+{										\
+	kvm_##gpr##_write(vcpu, val);						\
+}
+BUILD_TDVMCALL_ACCESSORS(p1, r12);
+BUILD_TDVMCALL_ACCESSORS(p2, r13);
+BUILD_TDVMCALL_ACCESSORS(p3, r14);
+BUILD_TDVMCALL_ACCESSORS(p4, r15);
+
+static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
+{
+	return kvm_r10_read(vcpu);
+}
+static __always_inline unsigned long tdvmcall_exit_reason(struct kvm_vcpu *vcpu)
+{
+	return kvm_r11_read(vcpu);
+}
+static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
+						     long val)
+{
+	kvm_r10_write(vcpu, val);
+}
+static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
+						    unsigned long val)
+{
+	kvm_r11_write(vcpu, val);
+}
+
 static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
 {
 	return tdx->tdvpr.added;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (87 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 088/104] KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-07  4:15   ` Kai Huang
  2022-03-04 19:49 ` [RFC PATCH v5 090/104] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL isaku.yamahata
                   ` (16 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short)
for the guest TD to call hypercall to VMM.  When the guest TD issues
TDG.VP.VMCALL, the guest TD exits to VMM with a new exit reason of
TDVMCALL.  The arguments from the guest TD and returned values from the VMM
are passed in the guest registers.  The guest RCX registers indicates which
registers are used.

Define the TDVMCALL exit reason, which is carved out from the VMX exit
reason namespace as the TDVMCALL exit from TDX guest to TDX-SEAM is really
just a VM-Exit.  Add a place holder to handle TDVMCALL exit.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/uapi/asm/vmx.h |  4 +++-
 arch/x86/kvm/vmx/tdx.c          | 27 ++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h          | 13 +++++++++++++
 3 files changed, 42 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index 3d9b4598e166..cb0a0565219a 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -92,6 +92,7 @@
 #define EXIT_REASON_UMWAIT              67
 #define EXIT_REASON_TPAUSE              68
 #define EXIT_REASON_BUS_LOCK            74
+#define EXIT_REASON_TDCALL              77
 
 #define VMX_EXIT_REASONS \
 	{ EXIT_REASON_EXCEPTION_NMI,         "EXCEPTION_NMI" }, \
@@ -154,7 +155,8 @@
 	{ EXIT_REASON_XRSTORS,               "XRSTORS" }, \
 	{ EXIT_REASON_UMWAIT,                "UMWAIT" }, \
 	{ EXIT_REASON_TPAUSE,                "TPAUSE" }, \
-	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }
+	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }, \
+	{ EXIT_REASON_TDCALL,                "TDCALL" }
 
 #define VMX_EXIT_REASON_FLAGS \
 	{ VMX_EXIT_REASONS_FAILED_VMENTRY,	"FAILED_VMENTRY" }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8695836ce796..86daafd9eec0 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -780,7 +780,8 @@ static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 					struct vcpu_tdx *tdx)
 {
 	guest_enter_irqoff();
-	tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
+	tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs,
+					tdx->tdvmcall.regs_mask);
 	guest_exit_irqoff();
 }
 
@@ -815,6 +816,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
 		return EXIT_FASTPATH_NONE;
+
+	if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
+		tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
+	else
+		tdx->tdvmcall.rcx = 0;
 	return EXIT_FASTPATH_NONE;
 }
 
@@ -859,6 +865,23 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int handle_tdvmcall(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (unlikely(tdx->tdvmcall.xmm_mask))
+		goto unsupported;
+
+	switch (tdvmcall_exit_reason(vcpu)) {
+	default:
+		break;
+	}
+
+unsupported:
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+	return 1;
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
@@ -1187,6 +1210,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 		return tdx_handle_exception(vcpu);
 	case EXIT_REASON_EXTERNAL_INTERRUPT:
 		return tdx_handle_external_interrupt(vcpu);
+	case EXIT_REASON_TDCALL:
+		return handle_tdvmcall(vcpu);
 	case EXIT_REASON_EPT_VIOLATION:
 		return tdx_handle_ept_violation(vcpu);
 	case EXIT_REASON_EPT_MISCONFIG:
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 7cd81780f3fa..9e8ed9b3119e 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -86,6 +86,19 @@ struct vcpu_tdx {
 	/* Posted interrupt descriptor */
 	struct pi_desc pi_desc;
 
+	union {
+		struct {
+			union {
+				struct {
+					u16 gpr_mask;
+					u16 xmm_mask;
+				};
+				u32 regs_mask;
+			};
+			u32 reserved;
+		};
+		u64 rcx;
+	} tdvmcall;
 	union tdx_exit_reason exit_reason;
 
 	bool initialized;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 090/104] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (88 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL) isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 091/104] KVM: TDX: Handle TDX PV CPUID hypercall isaku.yamahata
                   ` (15 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

The TDX Guest-Host communication interface (GHCI) specification defines
the ABI for the guest TD to issue hypercall.   It reserves vendor specific
arguments for VMM specific use.  Use it as KVM hypercall and handle it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 86daafd9eec0..53f59fb92dcf 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -865,6 +865,34 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
+{
+	unsigned long nr, a0, a1, a2, a3, ret;
+
+	/*
+	 * ABI for KVM tdvmcall argument:
+	 * In Guest-Hypervisor Communication Interface(GHCI) specification,
+	 * Non-zero leaf number (R10 != 0) is defined to indicate
+	 * vendor-specific.  KVM uses this for KVM hypercall.  NOTE: KVM
+	 * hypercall number starts from one.  Zero isn't used for KVM hypercall
+	 * number.
+	 *
+	 * R10: KVM h ypercall number
+	 * arguments: R11, R12, R13, R14.
+	 */
+	nr = kvm_r10_read(vcpu);
+	a0 = kvm_r11_read(vcpu);
+	a1 = kvm_r12_read(vcpu);
+	a2 = kvm_r13_read(vcpu);
+	a3 = kvm_r14_read(vcpu);
+
+	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, true);
+
+	tdvmcall_set_return_code(vcpu, ret);
+
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -872,6 +900,9 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 	if (unlikely(tdx->tdvmcall.xmm_mask))
 		goto unsupported;
 
+	if (tdvmcall_exit_type(vcpu))
+		return tdx_emulate_vmcall(vcpu);
+
 	switch (tdvmcall_exit_reason(vcpu)) {
 	default:
 		break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 091/104] KVM: TDX: Handle TDX PV CPUID hypercall
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (89 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 090/104] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-07 13:16   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall isaku.yamahata
                   ` (14 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV CPUID hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 27 +++++++++++++++++++++++++++
 1 file changed, 27 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 53f59fb92dcf..f7c9170d596a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -893,6 +893,30 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
+{
+	u32 eax, ebx, ecx, edx;
+
+	/* EAX and ECX for cpuid is stored in R12 and R13. */
+	eax = tdvmcall_p1_read(vcpu);
+	ecx = tdvmcall_p2_read(vcpu);
+
+	kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, true);
+
+	/*
+	 * The returned value for CPUID (EAX, EBX, ECX, and EDX) is stored into
+	 * R12, R13, R14, and R15.
+	 */
+	tdvmcall_p1_write(vcpu, eax);
+	tdvmcall_p2_write(vcpu, ebx);
+	tdvmcall_p3_write(vcpu, ecx);
+	tdvmcall_p4_write(vcpu, edx);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -904,6 +928,9 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_vmcall(vcpu);
 
 	switch (tdvmcall_exit_reason(vcpu)) {
+	case EXIT_REASON_CPUID:
+		return tdx_emulate_cpuid(vcpu);
+
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (90 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 091/104] KVM: TDX: Handle TDX PV CPUID hypercall isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-07 13:56   ` Paolo Bonzini
  2022-04-07 14:51   ` Sean Christopherson
  2022-03-04 19:49 ` [RFC PATCH v5 093/104] KVM: TDX: Handle TDX PV port io hypercall isaku.yamahata
                   ` (13 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV HLT hypercall to the KVM backend function.

When the guest issues HLT, the hypercall instruction can be the right after
CLI instruction.  Atomically unmask virtual interrupt and issue HLT
hypercall. The virtual interrupts can arrive right after CLI instruction
before switching back to VMM.  In such a case, the VMM should return to the
guest without losing the interrupt.  Check if interrupts arrived before the
TDX module switching to VMM.  And return to the guest in such cases.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 45 +++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f7c9170d596a..b0dcc2421649 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -917,6 +917,48 @@ static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
+{
+	bool interrupt_disabled = tdvmcall_p1_read(vcpu);
+	union tdx_vcpu_state_details details;
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+	if (!interrupt_disabled) {
+		/*
+		 * Virtual interrupt can arrive after TDG.VM.VMCALL<HLT> during
+		 * the TDX module executing.  On the other hand, KVM doesn't
+		 * know if vcpu was executing in the guest TD or the TDX module.
+		 *
+		 * CPU mode transition:
+		 * TDG.VP.VMCALL<HLT> (SEAM VMX non-root mode) ->
+		 * the TDX module (SEAM VMX root mode) ->
+		 * KVM (Legacy VMX root mode)
+		 *
+		 * If virtual interrupt arrives to this vcpu
+		 * - In the guest TD executing:
+		 *   KVM can handle it in the same way to the VMX case.
+		 * - During the TDX module executing:
+		 *   The TDX modules switches to KVM with TDG.VM.VMCALL<HLT>
+		 *   exit reason.  KVM thinks the guest was running.  So KVM
+		 *   vcpu wake up logic doesn't kick in.  Check if virtual
+		 *   interrupt is pending and resume vcpu without blocking vcpu.
+		 * - KVM executing:
+		 *   The existing logic wakes up the target vcpu on injecting
+		 *   virtual interrupt in the same way to the VMX case.
+		 *
+		 * Check if the interrupt is already pending.  If yes, resume
+		 * vcpu from guest HLT without emulating hlt instruction.
+		 */
+		details.full = td_state_non_arch_read64(
+			to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
+		if (details.vmxip)
+			return 1;
+	}
+
+	return kvm_emulate_halt_noskip(vcpu);
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -930,7 +972,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 	switch (tdvmcall_exit_reason(vcpu)) {
 	case EXIT_REASON_CPUID:
 		return tdx_emulate_cpuid(vcpu);
-
+	case EXIT_REASON_HLT:
+		return tdx_emulate_hlt(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 093/104] KVM: TDX: Handle TDX PV port io hypercall
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (91 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 14:59   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 094/104] KVM: TDX: Handle TDX PV MMIO hypercall isaku.yamahata
                   ` (12 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV port IO hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 55 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 55 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b0dcc2421649..c900347d0bc7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -959,6 +959,59 @@ static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
 	return kvm_emulate_halt_noskip(vcpu);
 }
 
+static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
+{
+	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+	unsigned long val = 0;
+	int ret;
+
+	WARN_ON(vcpu->arch.pio.count != 1);
+
+	ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
+					 vcpu->arch.pio.port, &val, 1);
+	WARN_ON(!ret);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	tdvmcall_set_return_val(vcpu, val);
+
+	return 1;
+}
+
+static int tdx_emulate_io(struct kvm_vcpu *vcpu)
+{
+	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+	unsigned long val = 0;
+	unsigned int port;
+	int size, ret;
+
+	++vcpu->stat.io_exits;
+
+	size = tdvmcall_p1_read(vcpu);
+	port = tdvmcall_p3_read(vcpu);
+
+	if (size != 1 && size != 2 && size != 4) {
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	if (!tdvmcall_p2_read(vcpu)) {
+		ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
+		if (!ret)
+			vcpu->arch.complete_userspace_io = tdx_complete_pio_in;
+		else
+			tdvmcall_set_return_val(vcpu, val);
+	} else {
+		val = tdvmcall_p4_read(vcpu);
+		ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
+
+		/* No need for a complete_userspace_io callback. */
+		vcpu->arch.pio.count = 0;
+	}
+	if (ret)
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	return ret;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -974,6 +1027,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_cpuid(vcpu);
 	case EXIT_REASON_HLT:
 		return tdx_emulate_hlt(vcpu);
+	case EXIT_REASON_IO_INSTRUCTION:
+		return tdx_emulate_io(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 094/104] KVM: TDX: Handle TDX PV MMIO hypercall
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (92 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 093/104] KVM: TDX: Handle TDX PV port io hypercall isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 15:05   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 095/104] KVM: TDX: Implement callbacks for MSR operations for TDX isaku.yamahata
                   ` (11 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Export kvm_io_bus_read and kvm_mmio tracepoint and wire up TDX PV MMIO
hypercall to the KVM backend functions.

kvm_io_bus_read/write() searches KVM device emulated in kernel of the given
MMIO address and emulates the MMIO.  As TDX PV MMIO also needs it, export
kvm_io_bus_read().  kvm_io_bus_write() is already exported.  TDX PV MMIO
emulates some of MMIO itself.  To add trace point consistently with x86
kvm, export kvm_mmio tracepoint.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c     |   1 +
 virt/kvm/kvm_main.c    |   2 +
 3 files changed, 117 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c900347d0bc7..914af5da4805 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1012,6 +1012,118 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
 	return ret;
 }
 
+static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
+{
+	unsigned long val = 0;
+	gpa_t gpa;
+	int size;
+
+	WARN_ON(vcpu->mmio_needed != 1);
+	vcpu->mmio_needed = 0;
+
+	if (!vcpu->mmio_is_write) {
+		gpa = vcpu->mmio_fragments[0].gpa;
+		size = vcpu->mmio_fragments[0].len;
+
+		memcpy(&val, vcpu->run->mmio.data, size);
+		tdvmcall_set_return_val(vcpu, val);
+		trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+	}
+	return 1;
+}
+
+static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
+				 unsigned long val)
+{
+	if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+	    kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+		return -EOPNOTSUPP;
+
+	trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
+	return 0;
+}
+
+static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
+{
+	unsigned long val;
+
+	if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+	    kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+		return -EOPNOTSUPP;
+
+	tdvmcall_set_return_val(vcpu, val);
+	trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+	return 0;
+}
+
+static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
+{
+	struct kvm_memory_slot *slot;
+	int size, write, r;
+	unsigned long val;
+	gpa_t gpa;
+
+	WARN_ON(vcpu->mmio_needed);
+
+	size = tdvmcall_p1_read(vcpu);
+	write = tdvmcall_p2_read(vcpu);
+	gpa = tdvmcall_p3_read(vcpu);
+	val = write ? tdvmcall_p4_read(vcpu) : 0;
+
+	if (size != 1 && size != 2 && size != 4 && size != 8)
+		goto error;
+	if (write != 0 && write != 1)
+		goto error;
+
+	/* Strip the shared bit, allow MMIO with and without it set. */
+	gpa = kvm_gpa_unalias(vcpu->kvm, gpa);
+
+	if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
+		goto error;
+
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
+	if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
+		goto error;
+
+	if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+		trace_kvm_fast_mmio(gpa);
+		return 1;
+	}
+
+	if (write)
+		r = tdx_mmio_write(vcpu, gpa, size, val);
+	else
+		r = tdx_mmio_read(vcpu, gpa, size);
+	if (!r) {
+		/* Kernel completed device emulation. */
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+		return 1;
+	}
+
+	/* Request the device emulation to userspace device model. */
+	vcpu->mmio_needed = 1;
+	vcpu->mmio_is_write = write;
+	vcpu->arch.complete_userspace_io = tdx_complete_mmio;
+
+	vcpu->run->mmio.phys_addr = gpa;
+	vcpu->run->mmio.len = size;
+	vcpu->run->mmio.is_write = write;
+	vcpu->run->exit_reason = KVM_EXIT_MMIO;
+
+	if (write) {
+		memcpy(vcpu->run->mmio.data, &val, size);
+	} else {
+		vcpu->mmio_fragments[0].gpa = gpa;
+		vcpu->mmio_fragments[0].len = size;
+		trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
+	}
+	return 0;
+
+error:
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -1029,6 +1141,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_hlt(vcpu);
 	case EXIT_REASON_IO_INSTRUCTION:
 		return tdx_emulate_io(vcpu);
+	case EXIT_REASON_EPT_VIOLATION:
+		return tdx_emulate_mmio(vcpu);
 	default:
 		break;
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9acb33a17445..483fa46b1be7 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12915,6 +12915,7 @@ bool kvm_arch_dirty_log_supported(struct kvm *kvm)
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d4e117f5b5b9..6db075db6098 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2259,6 +2259,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
 
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
 {
@@ -5126,6 +5127,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
 	r = __kvm_io_bus_read(vcpu, bus, &range, val);
 	return r < 0 ? r : 0;
 }
+EXPORT_SYMBOL_GPL(kvm_io_bus_read);
 
 /* Caller must hold slots_lock. */
 int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 095/104] KVM: TDX: Implement callbacks for MSR operations for TDX
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (93 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 094/104] KVM: TDX: Handle TDX PV MMIO hypercall isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 15:07   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 096/104] KVM: TDX: Handle TDX PV rdmsr hypercall isaku.yamahata
                   ` (10 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implements set_msr/get_msr/has_emulated_msr methods for TDX to handle
hypercall from guest TD for paravirtualized rdmsr and wrmsr.  The TDX
module virtualizes MSRs.  For some MSRs, it injects #VE to the guest TD
upon RDMSR or WRMSR.  The exact list of such MSRs are defined in the spec.

Upon #VE, the guest TD may execute hypercalls,
TDG.VP.VMCALL<INSTRUCTION.RDMSR> and TDG.VP.VMCALL<INSTRUCTION.WRMSR>,
which are defined in GHCI (Guest-Host Communication Interface) so that the
host VMM (e.g. KVM) can virtualizes the MSRs.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 34 +++++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 68 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  6 ++++
 3 files changed, 105 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 1e65406e3882..a528cfdbce54 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -165,6 +165,34 @@ static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 	vmx_handle_exit_irqoff(vcpu);
 }
 
+static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_set_msr(vcpu, msr_info);
+
+	return vmx_set_msr(vcpu, msr_info);
+}
+
+/*
+ * The kvm parameter can be NULL (module initialization, or invocation before
+ * VM creation). Be sure to check the kvm parameter before using it.
+ */
+static bool vt_has_emulated_msr(struct kvm *kvm, u32 index)
+{
+	if (kvm && is_td(kvm))
+		return tdx_is_emulated_msr(index, true);
+
+	return vmx_has_emulated_msr(kvm, index);
+}
+
+static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_get_msr(vcpu, msr_info);
+
+	return vmx_get_msr(vcpu, msr_info);
+}
+
 static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -393,7 +421,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.hardware_enable = vt_hardware_enable,
 	.hardware_disable = vt_hardware_disable,
 	.cpu_has_accelerated_tpr = report_flexpriority,
-	.has_emulated_msr = vmx_has_emulated_msr,
+	.has_emulated_msr = vt_has_emulated_msr,
 
 	.is_vm_type_supported = vt_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
@@ -411,8 +439,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
 	.get_msr_feature = vmx_get_msr_feature,
-	.get_msr = vmx_get_msr,
-	.set_msr = vmx_set_msr,
+	.get_msr = vt_get_msr,
+	.set_msr = vt_set_msr,
 	.get_segment_base = vmx_get_segment_base,
 	.get_segment = vmx_get_segment,
 	.set_segment = vmx_set_segment,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 914af5da4805..cec2660206bd 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1517,6 +1517,74 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 	*error_code = 0;
 }
 
+bool tdx_is_emulated_msr(u32 index, bool write)
+{
+	switch (index) {
+	case MSR_IA32_UCODE_REV:
+	case MSR_IA32_ARCH_CAPABILITIES:
+	case MSR_IA32_POWER_CTL:
+	case MSR_MTRRcap:
+	case 0x200 ... 0x26f:
+		/* IA32_MTRR_PHYS{BASE, MASK}, IA32_MTRR_FIX*_* */
+	case MSR_IA32_CR_PAT:
+	case MSR_MTRRdefType:
+	case MSR_IA32_TSC_DEADLINE:
+	case MSR_IA32_MISC_ENABLE:
+	case MSR_KVM_STEAL_TIME:
+	case MSR_KVM_POLL_CONTROL:
+	case MSR_PLATFORM_INFO:
+	case MSR_MISC_FEATURES_ENABLES:
+	case MSR_IA32_MCG_CAP:
+	case MSR_IA32_MCG_STATUS:
+	case MSR_IA32_MCG_CTL:
+	case MSR_IA32_MCG_EXT_CTL:
+	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(28) - 1:
+		/* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC} */
+		return true;
+	case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
+		/*
+		 * x2APIC registers that are virtualized by the CPU can't be
+		 * emulated, KVM doesn't have access to the virtual APIC page.
+		 */
+		switch (index) {
+		case X2APIC_MSR(APIC_TASKPRI):
+		case X2APIC_MSR(APIC_PROCPRI):
+		case X2APIC_MSR(APIC_EOI):
+		case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
+		case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
+		case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
+			return false;
+		default:
+			return true;
+		}
+	case MSR_IA32_APICBASE:
+	case MSR_EFER:
+		return !write;
+	case MSR_IA32_MCx_CTL2(0) ... MSR_IA32_MCx_CTL2(31):
+		/*
+		 * 0x280 - 0x29f: The x86 common code doesn't emulate MCx_CTL2.
+		 * Refer to kvm_{get,set}_msr_common(),
+		 * kvm_mtrr_{get, set}_msr(), and msr_mtrr_valid().
+		 */
+	default:
+		return false;
+	}
+}
+
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	if (tdx_is_emulated_msr(msr->index, false))
+		return kvm_get_msr_common(vcpu, msr);
+	return 1;
+}
+
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	if (tdx_is_emulated_msr(msr->index, true))
+		return kvm_set_msr_common(vcpu, msr);
+	return 1;
+}
+
 static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index c0a34186bc37..dcaa5806802e 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -156,6 +156,9 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 void tdx_inject_nmi(struct kvm_vcpu *vcpu);
 void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+bool tdx_is_emulated_msr(u32 index, bool write);
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -193,6 +196,9 @@ static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
 static inline void tdx_get_exit_info(
 	struct kvm_vcpu *vcpu, u32 *reason, u64 *info1, u64 *info2,
 	u32 *intr_info, u32 *error_code) {}
+static inline bool tdx_is_emulated_msr(u32 index, bool write) { return false; }
+static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 096/104] KVM: TDX: Handle TDX PV rdmsr hypercall
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (94 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 095/104] KVM: TDX: Implement callbacks for MSR operations for TDX isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 15:08   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 097/104] KVM: TDX: Handle TDX PV wrmsr hypercall isaku.yamahata
                   ` (9 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV rdmsr hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index cec2660206bd..dd7aaa28bf3a 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1124,6 +1124,23 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
+{
+	u32 index = tdvmcall_p1_read(vcpu);
+	u64 data;
+
+	if (kvm_get_msr(vcpu, index, &data)) {
+		trace_kvm_msr_read_ex(index);
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+	trace_kvm_msr_read(index, data);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	tdvmcall_set_return_val(vcpu, data);
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -1143,6 +1160,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_io(vcpu);
 	case EXIT_REASON_EPT_VIOLATION:
 		return tdx_emulate_mmio(vcpu);
+	case EXIT_REASON_MSR_READ:
+		return tdx_emulate_rdmsr(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 097/104] KVM: TDX: Handle TDX PV wrmsr hypercall
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (95 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 096/104] KVM: TDX: Handle TDX PV rdmsr hypercall isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 098/104] KVM: TDX: Handle TDX PV report fatal error hypercall isaku.yamahata
                   ` (8 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV wrmsr hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index dd7aaa28bf3a..123d4322da99 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1141,6 +1141,22 @@ static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
+{
+	u32 index = tdvmcall_p1_read(vcpu);
+	u64 data = tdvmcall_p2_read(vcpu);
+
+	if (kvm_set_msr(vcpu, index, data)) {
+		trace_kvm_msr_write_ex(index, data);
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	trace_kvm_msr_write(index, data);
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -1162,6 +1178,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_mmio(vcpu);
 	case EXIT_REASON_MSR_READ:
 		return tdx_emulate_rdmsr(vcpu);
+	case EXIT_REASON_MSR_WRITE:
+		return tdx_emulate_wrmsr(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 098/104] KVM: TDX: Handle TDX PV report fatal error hypercall
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (96 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 097/104] KVM: TDX: Handle TDX PV wrmsr hypercall isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-15 15:13   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 099/104] KVM: TDX: Handle TDX PV map_gpa hypercall isaku.yamahata
                   ` (7 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV report fatal error hypercall to KVM_SYSTEM_EVENT_CRASH KVM
exit event.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 123d4322da99..4d668a6c7dc9 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1157,6 +1157,15 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
+{
+	/* Exit to userspace device model for teardown. */
+	vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
+	vcpu->run->system_event.type = KVM_SYSTEM_EVENT_CRASH;
+	vcpu->run->system_event.flags = tdvmcall_p1_read(vcpu);
+	return 0;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -1180,6 +1189,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_rdmsr(vcpu);
 	case EXIT_REASON_MSR_WRITE:
 		return tdx_emulate_wrmsr(vcpu);
+	case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
+		return tdx_report_fatal_error(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 099/104] KVM: TDX: Handle TDX PV map_gpa hypercall
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (97 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 098/104] KVM: TDX: Handle TDX PV report fatal error hypercall isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:49 ` [RFC PATCH v5 100/104] KVM: TDX: Silently discard SMI request isaku.yamahata
                   ` (6 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV map_gpa hypercall to the kvm/mmu backend.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 59 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 59 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4d668a6c7dc9..e5eccec0e24d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1166,6 +1166,63 @@ static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int tdx_map_gpa(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+	gpa_t gpa = tdvmcall_p1_read(vcpu);
+	gpa_t size = tdvmcall_p2_read(vcpu);
+	gpa_t end = gpa + size;
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+	if (!IS_ALIGNED(gpa, 4096) || !IS_ALIGNED(size, 4096) ||
+		end < gpa ||
+		end > kvm_gfn_stolen_mask(kvm) << (PAGE_SHIFT + 1) ||
+		kvm_is_private_gpa(kvm, gpa) != kvm_is_private_gpa(kvm, end))
+		return 1;
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+#define TDX_MAP_GPA_SIZE_MAX   (16 * 1024 * 1024)
+	while (gpa < end) {
+		gfn_t s = gpa_to_gfn(gpa);
+		gfn_t e = gpa_to_gfn(
+			min(roundup(gpa + 1, TDX_MAP_GPA_SIZE_MAX), end));
+		int ret = kvm_mmu_map_gpa(vcpu, &s, e);
+
+		if (ret == -EAGAIN)
+			e = s;
+		else if (ret) {
+			tdvmcall_set_return_code(vcpu,
+						TDG_VP_VMCALL_INVALID_OPERAND);
+			break;
+		}
+
+		gpa = gfn_to_gpa(e);
+
+		/*
+		 * TODO:
+		 * Interrupt this hypercall invocation to return remaining
+		 * region to the guest and let the guest to resume the
+		 * hypercall.
+		 *
+		 * The TDX Guest-Hypervisor Communication Interface(GHCI)
+		 * specification and guest implementation need to be updated.
+		 *
+		 * if (gpa < end && need_resched()) {
+		 *	size = end - gpa;
+		 *	kvm_r12_write(vcpu, gpa);
+		 *	kvm_r13_write(vcpu, size);
+		 *	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INTERRUPTED_RESUME);
+		 *	break;
+		 * }
+		 */
+		if (gpa < end && need_resched())
+			cond_resched();
+	}
+
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -1191,6 +1248,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_wrmsr(vcpu);
 	case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
 		return tdx_report_fatal_error(vcpu);
+	case TDG_VP_VMCALL_MAP_GPA:
+		return tdx_map_gpa(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 100/104] KVM: TDX: Silently discard SMI request
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (98 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 099/104] KVM: TDX: Handle TDX PV map_gpa hypercall isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-05 15:41   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 101/104] KVM: TDX: Silently ignore INIT/SIPI isaku.yamahata
                   ` (5 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs.  Because guest state (vcpu state, memory
state) is protected, it must go through the TDX module APIs to change guest
state, injecting SMI and changing vcpu mode into SMM.  The TDX module
doesn't provide a way for VMM to inject SMI into guest TD and a way for VMM
to switch guest vcpu mode into SMM.

We have two options in KVM when handling SMM or SMI in the guest TD or the
device model (e.g. QEMU): 1) silently ignore the request or 2) return a
meaningful error.

For simplicity, we implemented the option 1).

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 43 ++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/tdx.c     | 25 ++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  8 +++++++
 3 files changed, 72 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a528cfdbce54..478aa63acefa 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -193,6 +193,41 @@ static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return vmx_get_msr(vcpu, msr_info);
 }
 
+static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_smi_allowed(vcpu, for_injection);
+
+	return vmx_smi_allowed(vcpu, for_injection);
+}
+
+static int vt_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
+{
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_enter_smm(vcpu, smstate);
+
+	return vmx_enter_smm(vcpu, smstate);
+}
+
+static int vt_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
+{
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_leave_smm(vcpu, smstate);
+
+	return vmx_leave_smm(vcpu, smstate);
+}
+
+static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_enable_smi_window(vcpu);
+		return;
+	}
+
+	/* RSM will cause a vmexit anyway.  */
+	vmx_enable_smi_window(vcpu);
+}
+
 static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -539,10 +574,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.setup_mce = vmx_setup_mce,
 
-	.smi_allowed = vmx_smi_allowed,
-	.enter_smm = vmx_enter_smm,
-	.leave_smm = vmx_leave_smm,
-	.enable_smi_window = vmx_enable_smi_window,
+	.smi_allowed = vt_smi_allowed,
+	.enter_smm = vt_enter_smm,
+	.leave_smm = vt_leave_smm,
+	.enable_smi_window = vt_enable_smi_window,
 
 	.can_emulate_instruction = vmx_can_emulate_instruction,
 	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e5eccec0e24d..7bbf6271967b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1692,6 +1692,31 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	return 1;
 }
 
+int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	/* SMI isn't supported for TDX. */
+	return false;
+}
+
+int tdx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
+{
+	/* smi_allowed() is always false for TDX as above. */
+	WARN_ON_ONCE(1);
+	return 0;
+}
+
+int tdx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
+{
+	WARN_ON_ONCE(1);
+	return 0;
+}
+
+void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+	/* SMI isn't supported for TDX.  Silently discard SMI request. */
+	vcpu->arch.smi_pending = false;
+}
+
 static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index dcaa5806802e..19d793609cc4 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -159,6 +159,10 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 bool tdx_is_emulated_msr(u32 index, bool write);
 int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int tdx_enter_smm(struct kvm_vcpu *vcpu, char *smstate);
+int tdx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate);
+void tdx_enable_smi_window(struct kvm_vcpu *vcpu);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -199,6 +203,10 @@ static inline void tdx_get_exit_info(
 static inline bool tdx_is_emulated_msr(u32 index, bool write) { return false; }
 static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+static inline int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { return false; }
+static inline int tdx_enter_smm(struct kvm_vcpu *vcpu, char *smstate) { return 0; }
+static inline int tdx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate) { return 0; }
+static inline void tdx_enable_smi_window(struct kvm_vcpu *vcpu) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 101/104] KVM: TDX: Silently ignore INIT/SIPI
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (99 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 100/104] KVM: TDX: Silently discard SMI request isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-05 15:48   ` Paolo Bonzini
  2022-03-04 19:49 ` [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state isaku.yamahata
                   ` (4 subsequent siblings)
  105 siblings, 1 reply; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

The TDX module API doesn't provide API for VMM to inject INIT IPI and SIPI.
Instead it defines the different protocols to boot application processors.
Ignore INIT and SIPI events for the TDX guest.

There are two options. 1) (silently) ignore INIT/SIPI request or 2) return
error to guest TDs somehow.  Given that TDX guest is paravirtualized to
boot AP, the option 1 is chosen for simplicity.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/lapic.c    | 21 +++++++++++++++++----
 arch/x86/kvm/vmx/main.c | 10 +++++++++-
 arch/x86/kvm/x86.h      |  5 +++++
 3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index d49f029ef0e3..e27653d5e630 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2921,11 +2921,20 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 
 	if (test_bit(KVM_APIC_INIT, &pe)) {
 		clear_bit(KVM_APIC_INIT, &apic->pending_events);
-		kvm_vcpu_reset(vcpu, true);
-		if (kvm_vcpu_is_bsp(apic->vcpu))
+		if (kvm_init_sipi_unsupported(vcpu->kvm))
+			/*
+			 * TDX doesn't support INIT.  Ignore INIT event.  In the
+			 * case of SIPI, the callback of
+			 * vcpu_deliver_sipi_vector ignores it.
+			 */
 			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
-		else
-			vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+		else {
+			kvm_vcpu_reset(vcpu, true);
+			if (kvm_vcpu_is_bsp(apic->vcpu))
+				vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+			else
+				vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+		}
 	}
 	if (test_bit(KVM_APIC_SIPI, &pe)) {
 		clear_bit(KVM_APIC_SIPI, &apic->pending_events);
@@ -2933,6 +2942,10 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 			/* evaluate pending_events before reading the vector */
 			smp_rmb();
 			sipi_vector = apic->sipi_vector;
+			/*
+			 * If SINIT isn't supported, the callback ignores SIPI
+			 * request.
+			 */
 			kvm_x86_ops.vcpu_deliver_sipi_vector(vcpu, sipi_vector);
 			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
 		}
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 478aa63acefa..de9b4a270f20 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -264,6 +264,14 @@ static bool vt_apicv_has_pending_interrupt(struct kvm_vcpu *vcpu)
 	return false;
 }
 
+static void vt_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	kvm_vcpu_deliver_sipi_vector(vcpu, vector);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -586,7 +594,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.msr_filter_changed = vmx_msr_filter_changed,
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
-	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+	.vcpu_deliver_sipi_vector = vt_vcpu_deliver_sipi_vector,
 
 	.mem_enc_op = vt_mem_enc_op,
 	.mem_enc_op_vcpu = vt_mem_enc_op_vcpu,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index f15bf1c0aeb1..c789d72ab408 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -405,6 +405,11 @@ static inline void kvm_machine_check(void)
 #endif
 }
 
+static __always_inline bool kvm_init_sipi_unsupported(struct kvm *kvm)
+{
+	return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
 void kvm_load_guest_xsave_state(struct kvm_vcpu *vcpu);
 void kvm_load_host_xsave_state(struct kvm_vcpu *vcpu);
 int kvm_spec_ctrl_test_value(u64 value);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (100 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 101/104] KVM: TDX: Silently ignore INIT/SIPI isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-04-05 15:56   ` Paolo Bonzini
  2022-04-12  6:49   ` Xiaoyao Li
  2022-03-04 19:49 ` [RFC PATCH v5 103/104] Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX) isaku.yamahata
                   ` (3 subsequent siblings)
  105 siblings, 2 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX protects TDX guest state from VMM.  Implements to access methods for
TDX guest state to ignore them or return zero.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 465 +++++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/tdx.c     |  44 ++++
 arch/x86/kvm/vmx/x86_ops.h |  17 ++
 3 files changed, 483 insertions(+), 43 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index de9b4a270f20..0515998f7fa5 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -228,6 +228,46 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
 	vmx_enable_smi_window(vcpu);
 }
 
+static bool vt_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+				       void *insn, int insn_len)
+{
+	if (is_td_vcpu(vcpu))
+		return false;
+
+	return vmx_can_emulate_instruction(vcpu, emul_type, insn, insn_len);
+}
+
+static int vt_check_intercept(struct kvm_vcpu *vcpu,
+				 struct x86_instruction_info *info,
+				 enum x86_intercept_stage stage,
+				 struct x86_exception *exception)
+{
+	/*
+	 * This call back is triggered by the x86 instruction emulator. TDX
+	 * doesn't allow guest memory inspection.
+	 */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return X86EMUL_UNHANDLEABLE;
+
+	return vmx_check_intercept(vcpu, info, stage, exception);
+}
+
+static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return true;
+
+	return vmx_apic_init_signal_blocked(vcpu);
+}
+
+static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_set_virtual_apic_mode(vcpu);
+
+	return vmx_set_virtual_apic_mode(vcpu);
+}
+
 static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -236,6 +276,31 @@ static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 	return vmx_apicv_post_state_restore(vcpu);
 }
 
+static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	return vmx_hwapic_irr_update(vcpu, max_irr);
+}
+
+static void vt_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	return vmx_hwapic_isr_update(vcpu, max_isr);
+}
+
+static bool vt_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+	/* TDX doesn't support L2 at the moment. */
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return false;
+
+	return vmx_guest_apic_has_interrupt(vcpu);
+}
+
 static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -272,6 +337,179 @@ static void vt_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
 	kvm_vcpu_deliver_sipi_vector(vcpu, vector);
 }
 
+static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	return vmx_vcpu_after_set_cpuid(vcpu);
+}
+
+static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_update_exception_bitmap(vcpu);
+}
+
+static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_segment_base(vcpu, seg);
+
+	return vmx_get_segment_base(vcpu, seg);
+}
+
+static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+			      int seg)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_segment(vcpu, var, seg);
+
+	vmx_get_segment(vcpu, var, seg);
+}
+
+static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+			      int seg)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_segment(vcpu, var, seg);
+}
+
+static int vt_get_cpl(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_cpl(vcpu);
+
+	return vmx_get_cpl(vcpu);
+}
+
+static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_get_cs_db_l_bits(vcpu, db, l);
+}
+
+static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_cr0(vcpu, cr0);
+}
+
+static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_cr4(vcpu, cr4);
+}
+
+static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
+{
+	if (is_td_vcpu(vcpu))
+		return 0;
+
+	return vmx_set_efer(vcpu, efer);
+}
+
+static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu)) {
+		memset(dt, 0, sizeof(*dt));
+		return;
+	}
+
+	vmx_get_idt(vcpu, dt);
+}
+
+static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_idt(vcpu, dt);
+}
+
+static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu)) {
+		memset(dt, 0, sizeof(*dt));
+		return;
+	}
+
+	vmx_get_gdt(vcpu, dt);
+}
+
+static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_gdt(vcpu, dt);
+}
+
+static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_dr7(vcpu, val);
+}
+
+static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * MOV-DR exiting is always cleared for TD guest, even in debug mode.
+	 * Thus KVM_DEBUGREG_WONT_EXIT can never be set and it should never
+	 * reach here for TD vcpu.
+	 */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
+	vmx_sync_dirty_debug_regs(vcpu);
+}
+
+static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_cache_reg(vcpu, reg);
+		return;
+	}
+
+	vmx_cache_reg(vcpu, reg);
+}
+
+static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_rflags(vcpu);
+
+	return vmx_get_rflags(vcpu);
+}
+
+static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_rflags(vcpu, rflags);
+}
+
+static bool vt_get_if_flag(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return false;
+
+	return vmx_get_if_flag(vcpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -388,6 +626,15 @@ static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
 	return vmx_get_interrupt_shadow(vcpu);
 }
 
+static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
+				  unsigned char *hypercall)
+{
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
+	vmx_patch_hypercall(vcpu, hypercall);
+}
+
 static void vt_inject_irq(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -396,6 +643,14 @@ static void vt_inject_irq(struct kvm_vcpu *vcpu)
 	vmx_inject_irq(vcpu);
 }
 
+static void vt_queue_exception(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_queue_exception(vcpu);
+}
+
 static void vt_cancel_injection(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -428,6 +683,130 @@ static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
 	vmx_request_immediate_exit(vcpu);
 }
 
+static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_update_cr8_intercept(vcpu, tpr, irr);
+}
+
+static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+{
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return;
+
+	vmx_set_apic_access_page_addr(vcpu);
+}
+
+static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+{
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return;
+
+	vmx_refresh_apicv_exec_ctrl(vcpu);
+}
+
+static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
+}
+
+static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
+{
+	if (is_td(kvm))
+		return 0;
+
+	return vmx_set_tss_addr(kvm, addr);
+}
+
+static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+{
+	if (is_td(kvm))
+		return 0;
+
+	return vmx_set_identity_map_addr(kvm, ident_addr);
+}
+
+static u64 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+{
+	if (is_td_vcpu(vcpu)) {
+		if (is_mmio)
+			return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+		return  MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+	}
+
+	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
+}
+
+static u64 vt_get_l2_tsc_offset(struct kvm_vcpu *vcpu)
+{
+	/* TDX doesn't support L2 guest at the moment. */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return 0;
+
+	return vmx_get_l2_tsc_offset(vcpu);
+}
+
+static u64 vt_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
+{
+	/* TDX doesn't support L2 guest at the moment. */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return 0;
+
+	return vmx_get_l2_tsc_multiplier(vcpu);
+}
+
+static void vt_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+{
+	/* In TDX, tsc offset can't be changed. */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_write_tsc_offset(vcpu, offset);
+}
+
+static void vt_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+{
+	/* In TDX, tsc multiplier can't be changed. */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_write_tsc_multiplier(vcpu, multiplier);
+}
+
+static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_update_cpu_dirty_logging(vcpu);
+}
+
+#ifdef CONFIG_X86_64
+static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+			      bool *expired)
+{
+	/* VMX-preemption timer isn't available for TDX. */
+	if (is_td_vcpu(vcpu))
+		return -EINVAL;
+
+	return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
+}
+
+static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
+{
+	/* VMX-preemption timer can't be set.  Set vt_set_hv_timer(). */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
+	vmx_cancel_hv_timer(vcpu);
+}
+#endif
+
 static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
 {
@@ -480,29 +859,29 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_load = vt_vcpu_load,
 	.vcpu_put = vt_vcpu_put,
 
-	.update_exception_bitmap = vmx_update_exception_bitmap,
+	.update_exception_bitmap = vt_update_exception_bitmap,
 	.get_msr_feature = vmx_get_msr_feature,
 	.get_msr = vt_get_msr,
 	.set_msr = vt_set_msr,
-	.get_segment_base = vmx_get_segment_base,
-	.get_segment = vmx_get_segment,
-	.set_segment = vmx_set_segment,
-	.get_cpl = vmx_get_cpl,
-	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
-	.set_cr0 = vmx_set_cr0,
+	.get_segment_base = vt_get_segment_base,
+	.get_segment = vt_get_segment,
+	.set_segment = vt_set_segment,
+	.get_cpl = vt_get_cpl,
+	.get_cs_db_l_bits = vt_get_cs_db_l_bits,
+	.set_cr0 = vt_set_cr0,
 	.is_valid_cr4 = vmx_is_valid_cr4,
-	.set_cr4 = vmx_set_cr4,
-	.set_efer = vmx_set_efer,
-	.get_idt = vmx_get_idt,
-	.set_idt = vmx_set_idt,
-	.get_gdt = vmx_get_gdt,
-	.set_gdt = vmx_set_gdt,
-	.set_dr7 = vmx_set_dr7,
-	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
-	.cache_reg = vmx_cache_reg,
-	.get_rflags = vmx_get_rflags,
-	.set_rflags = vmx_set_rflags,
-	.get_if_flag = vmx_get_if_flag,
+	.set_cr4 = vt_set_cr4,
+	.set_efer = vt_set_efer,
+	.get_idt = vt_get_idt,
+	.set_idt = vt_set_idt,
+	.get_gdt = vt_get_gdt,
+	.set_gdt = vt_set_gdt,
+	.set_dr7 = vt_set_dr7,
+	.sync_dirty_debug_regs = vt_sync_dirty_debug_regs,
+	.cache_reg = vt_cache_reg,
+	.get_rflags = vt_get_rflags,
+	.set_rflags = vt_set_rflags,
+	.get_if_flag = vt_get_if_flag,
 
 	.tlb_flush_all = vt_flush_tlb_all,
 	.tlb_flush_current = vt_flush_tlb_current,
@@ -516,10 +895,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.update_emulated_instruction = vmx_update_emulated_instruction,
 	.set_interrupt_shadow = vt_set_interrupt_shadow,
 	.get_interrupt_shadow = vt_get_interrupt_shadow,
-	.patch_hypercall = vmx_patch_hypercall,
+	.patch_hypercall = vt_patch_hypercall,
 	.set_irq = vt_inject_irq,
 	.set_nmi = vt_inject_nmi,
-	.queue_exception = vmx_queue_exception,
+	.queue_exception = vt_queue_exception,
 	.cancel_injection = vt_cancel_injection,
 	.interrupt_allowed = vt_interrupt_allowed,
 	.nmi_allowed = vt_nmi_allowed,
@@ -527,39 +906,39 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_nmi_mask = vt_set_nmi_mask,
 	.enable_nmi_window = vt_enable_nmi_window,
 	.enable_irq_window = vt_enable_irq_window,
-	.update_cr8_intercept = vmx_update_cr8_intercept,
-	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
-	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
-	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
-	.load_eoi_exitmap = vmx_load_eoi_exitmap,
+	.update_cr8_intercept = vt_update_cr8_intercept,
+	.set_virtual_apic_mode = vt_set_virtual_apic_mode,
+	.set_apic_access_page_addr = vt_set_apic_access_page_addr,
+	.refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl,
+	.load_eoi_exitmap = vt_load_eoi_exitmap,
 	.apicv_post_state_restore = vt_apicv_post_state_restore,
 	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
-	.hwapic_irr_update = vmx_hwapic_irr_update,
-	.hwapic_isr_update = vmx_hwapic_isr_update,
-	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+	.hwapic_irr_update = vt_hwapic_irr_update,
+	.hwapic_isr_update = vt_hwapic_isr_update,
+	.guest_apic_has_interrupt = vt_guest_apic_has_interrupt,
 	.sync_pir_to_irr = vt_sync_pir_to_irr,
 	.deliver_interrupt = vt_deliver_interrupt,
 	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
 	.apicv_has_pending_interrupt = vt_apicv_has_pending_interrupt,
 
-	.set_tss_addr = vmx_set_tss_addr,
-	.set_identity_map_addr = vmx_set_identity_map_addr,
-	.get_mt_mask = vmx_get_mt_mask,
+	.set_tss_addr = vt_set_tss_addr,
+	.set_identity_map_addr = vt_set_identity_map_addr,
+	.get_mt_mask = vt_get_mt_mask,
 
 	.get_exit_info = vt_get_exit_info,
 
-	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+	.vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,
 
 	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
 
-	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
-	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
-	.write_tsc_offset = vmx_write_tsc_offset,
-	.write_tsc_multiplier = vmx_write_tsc_multiplier,
+	.get_l2_tsc_offset = vt_get_l2_tsc_offset,
+	.get_l2_tsc_multiplier = vt_get_l2_tsc_multiplier,
+	.write_tsc_offset = vt_write_tsc_offset,
+	.write_tsc_multiplier = vt_write_tsc_multiplier,
 
 	.load_mmu_pgd = vt_load_mmu_pgd,
 
-	.check_intercept = vmx_check_intercept,
+	.check_intercept = vt_check_intercept,
 	.handle_exit_irqoff = vt_handle_exit_irqoff,
 
 	.request_immediate_exit = vt_request_immediate_exit,
@@ -567,7 +946,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.sched_in = vt_sched_in,
 
 	.cpu_dirty_log_size = PML_ENTITY_NUM,
-	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+	.update_cpu_dirty_logging = vt_update_cpu_dirty_logging,
 
 	.pmu_ops = &intel_pmu_ops,
 	.nested_ops = &vmx_nested_ops,
@@ -576,8 +955,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.start_assignment = vmx_pi_start_assignment,
 
 #ifdef CONFIG_X86_64
-	.set_hv_timer = vmx_set_hv_timer,
-	.cancel_hv_timer = vmx_cancel_hv_timer,
+	.set_hv_timer = vt_set_hv_timer,
+	.cancel_hv_timer = vt_cancel_hv_timer,
 #endif
 
 	.setup_mce = vmx_setup_mce,
@@ -587,8 +966,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.leave_smm = vt_leave_smm,
 	.enable_smi_window = vt_enable_smi_window,
 
-	.can_emulate_instruction = vmx_can_emulate_instruction,
-	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+	.can_emulate_instruction = vt_can_emulate_instruction,
+	.apic_init_signal_blocked = vt_apic_init_signal_blocked,
 	.migrate_timers = vmx_migrate_timers,
 
 	.msr_filter_changed = vmx_msr_filter_changed,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7bbf6271967b..55a6fd218fc7 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3,6 +3,7 @@
 #include <linux/mmu_context.h>
 
 #include <asm/fpu/xcr.h>
+#include <asm/virtext.h>
 #include <asm/tdx.h>
 
 #include "capabilities.h"
@@ -1717,6 +1718,49 @@ void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
 	vcpu->arch.smi_pending = false;
 }
 
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+	/* Only x2APIC mode is supported for TD. */
+	WARN_ON_ONCE(kvm_get_apic_mode(vcpu) != LAPIC_MODE_X2APIC);
+}
+
+int tdx_get_cpl(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+	kvm_register_mark_available(vcpu, reg);
+	switch (reg) {
+	case VCPU_REGS_RSP:
+	case VCPU_REGS_RIP:
+	case VCPU_EXREG_PDPTR:
+	case VCPU_EXREG_CR0:
+	case VCPU_EXREG_CR3:
+	case VCPU_EXREG_CR4:
+		break;
+	default:
+		KVM_BUG_ON(1, vcpu->kvm);
+		break;
+	}
+}
+
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+	return 0;
+}
+
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+{
+	memset(var, 0, sizeof(*var));
+}
+
 static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 19d793609cc4..7cd29b586e43 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -163,6 +163,14 @@ int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
 int tdx_enter_smm(struct kvm_vcpu *vcpu, char *smstate);
 int tdx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate);
 void tdx_enable_smi_window(struct kvm_vcpu *vcpu);
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+
+int tdx_get_cpl(struct kvm_vcpu *vcpu);
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu);
+bool tdx_is_emulated_msr(u32 index, bool write);
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -203,10 +211,19 @@ static inline void tdx_get_exit_info(
 static inline bool tdx_is_emulated_msr(u32 index, bool write) { return false; }
 static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+
 static inline int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { return false; }
 static inline int tdx_enter_smm(struct kvm_vcpu *vcpu, char *smstate) { return 0; }
 static inline int tdx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate) { return 0; }
 static inline void tdx_enable_smi_window(struct kvm_vcpu *vcpu) {}
+static inline void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) {}
+
+static inline int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
+static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
+static inline unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
+static inline u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg) { return 0;}
+static inline void tdx_get_segment(
+	struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 103/104] Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (101 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state isaku.yamahata
@ 2022-03-04 19:49 ` isaku.yamahata
  2022-03-04 19:50 ` [RFC PATCH v5 104/104] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU isaku.yamahata
                   ` (2 subsequent siblings)
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:49 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add documentation to Intel Trusted Domain Extensions(TDX) support.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/api.rst       |   9 +-
 Documentation/virt/kvm/intel-tdx.rst | 360 +++++++++++++++++++++++++++
 2 files changed, 368 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/virt/kvm/intel-tdx.rst

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index b1e142719ec0..f86f547f2de4 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1384,6 +1384,9 @@ It is recommended to use this API instead of the KVM_SET_MEMORY_REGION ioctl.
 The KVM_SET_MEMORY_REGION does not allow fine grained control over memory
 allocation and is deprecated.
 
+For TDX guest, deleting/moving memory region loses guest memory contents.
+Read only region isn't supported.  Only as-id 0 is supported.
+
 
 4.36 KVM_SET_TSS_ADDR
 ---------------------
@@ -4539,7 +4542,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.
 
 :Capability: basic
 :Architectures: x86
-:Type: vm
+:Type: vm ioctl, vcpu ioctl
 :Parameters: an opaque platform specific structure (in/out)
 :Returns: 0 on success; -1 on error
 
@@ -4551,6 +4554,10 @@ Currently, this ioctl is used for issuing Secure Encrypted Virtualization
 (SEV) commands on AMD Processors. The SEV commands are defined in
 Documentation/virt/kvm/amd-memory-encryption.rst.
 
+Currently, this ioctl is used for issuing Trusted Domain Extensions
+(TDX) commands on Intel Processors. The TDX commands are defined in
+Documentation/virt/kvm/intel-tdx.rst.
+
 4.111 KVM_MEMORY_ENCRYPT_REG_REGION
 -----------------------------------
 
diff --git a/Documentation/virt/kvm/intel-tdx.rst b/Documentation/virt/kvm/intel-tdx.rst
new file mode 100644
index 000000000000..ec4381b0a26c
--- /dev/null
+++ b/Documentation/virt/kvm/intel-tdx.rst
@@ -0,0 +1,360 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Dodmain Extensions(TDX)
+===================================
+
+Overview
+========
+TDX stands for Trust Domain Extensions which isolates VMs from
+the virtual-machine manager (VMM)/hypervisor and any other software on
+the platform. [1]
+For details, the specifications, [2], [3], [4], [5], [6], [7], are
+available.
+
+
+API description
+===============
+
+KVM_MEMORY_ENCRYPT_OP
+---------------------
+:Type: vm ioctl, vcpu ioctl
+
+For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
+ioctl with TDX specific sub ioctl command.
+
+::
+
+  /* Trust Domain eXtension sub-ioctl() commands. */
+  enum kvm_tdx_cmd_id {
+          KVM_TDX_CAPABILITIES = 0,
+          KVM_TDX_INIT_VM,
+          KVM_TDX_INIT_VCPU,
+          KVM_TDX_INIT_MEM_REGION,
+          KVM_TDX_FINALIZE_VM,
+
+          KVM_TDX_CMD_NR_MAX,
+  };
+
+  struct kvm_tdx_cmd {
+          __u32 id;             /* tdx_cmd_id */
+          __u32 metadata;       /* sub comamnd specific */
+          __u64 data;           /* sub command specific */
+  };
+
+
+KVM_TDX_CAPABILITIES
+--------------------
+:Type: vm ioctl
+
+Subset of TDSYSINFO_STRCUCT retrieved by TDH.SYS.INFO TDX SEAM call will be
+returned. Which describes about Intel TDX module.
+
+- id: KVM_TDX_CAPABILITIES
+- metadata: must be 0
+- data: pointer to struct kvm_tdx_capabilities
+
+::
+
+  struct kvm_tdx_cpuid_config {
+          __u32 leaf;
+          __u32 sub_leaf;
+          __u32 eax;
+          __u32 ebx;
+          __u32 ecx;
+          __u32 edx;
+  };
+
+  struct kvm_tdx_capabilities {
+          __u64 attrs_fixed0;
+          __u64 attrs_fixed1;
+          __u64 xfam_fixed0;
+          __u64 xfam_fixed1;
+
+          __u32 nr_cpuid_configs;
+          struct kvm_tdx_cpuid_config cpuid_configs[0];
+  };
+
+
+KVM_TDX_INIT_VM
+---------------
+:Type: vm ioctl
+
+Does additional VM initialization specific to TDX which corresponds to
+TDH.MNG.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VM
+- metadata: must be 0
+- data: pointer to struct kvm_tdx_init_vm
+- reserved: must be 0
+
+::
+
+  struct kvm_tdx_init_vm {
+          __u32 max_vcpus;
+          __u32 reserved;
+          __u64 attributes;
+          __u64 cpuid;  /* pointer to struct kvm_cpuid2 */
+          __u64 mrconfigid[6];          /* sha384 digest */
+          __u64 mrowner[6];             /* sha384 digest */
+          __u64 mrownerconfig[6];       /* sha348 digest */
+          __u64 reserved[43];           /* must be zero for future extensibility */
+  };
+
+
+KVM_TDX_INIT_VCPU
+-----------------
+:Type: vcpu ioctl
+
+Does additional VCPU initialization specific to TDX which corresponds to
+TDH.VP.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- metadata: must be 0
+- data: initial value of the guest TD VCPU RCX
+
+
+KVM_TDX_INIT_MEM_REGION
+-----------------------
+:Type: vm ioctl
+
+Encrypt a memory continuous region which corresponding to TDH.MEM.PAGE.ADD
+TDX SEAM call.
+If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement
+which corresponds to TDH.MR.EXTEND TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- metadata: flags
+            currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
+- data: pointer to struct kvm_tdx_init_mem_region
+
+::
+
+  #define KVM_TDX_MEASURE_MEMORY_REGION   (1UL << 0)
+
+  struct kvm_tdx_init_mem_region {
+          __u64 source_addr;
+          __u64 gpa;
+          __u64 nr_pages;
+  };
+
+
+KVM_TDX_FINALIZE_VM
+-------------------
+:Type: vm ioctl
+
+Complete measurement of the initial TD contents and mark it ready to run
+which corresponds to TDH.MR.FINALIZE
+
+- id: KVM_TDX_FINALIZE_VM
+- metadata: ignored
+- data: ignored
+
+
+KVM TDX creation flow
+=====================
+In addition to KVM normal flow, new TDX ioctls need to be called.  The control flow
+looks like as follows.
+
+#. system wide capability check
+  * KVM_CAP_VM_TYPES: check if VM type is supported and if TDX_VM_TYPE is
+    supported.
+
+#. creating VM
+  * KVM_CREATE_VM
+  * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.
+  * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
+
+#. creating VCPU
+  * KVM_CREATE_VCPU
+  * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
+
+#. initializing guest memory
+  * allocate guest memory and initialize page same to normal KVM case
+    In TDX case, parse and load TDVF into guest memory in addition.
+  * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
+    If the pages has contents above, those pages need to be added.
+    Otherwise the contents will be lost and guest sees zero pages.
+  * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
+    This must be after KVM_TDX_INIT_MEM_REGION.
+
+#. run vcpu
+
+Design discussion
+=================
+
+Coexistence of normal(VMX) VM and TD VM
+---------------------------------------
+It's required to allow both legacy(normal VMX) VMs and new TD VMs to
+coexist. Otherwise the benefits of VM flexibility would be eliminated.
+The main issue for it is that the logic of kvm_x86_ops callbacks for
+TDX is different from VMX. On the other hand, the variable,
+kvm_x86_ops, is global single variable. Not per-VM, not per-vcpu.
+
+Several points to be considered.
+  . No or minimal overhead when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+  . Avoid overhead of indirect call via function pointers.
+  . Contain the changes under arch/x86/kvm/vmx directory and share logic
+    with VMX for maintenance.
+    Even though the ways to operation on VM (VMX instruction vs TDX
+    SEAM call) is different, the basic idea remains same. So, many
+    logic can be shared.
+  . Future maintenance
+    The huge change of kvm_x86_ops in (near) future isn't expected.
+    a centralized file is acceptable.
+
+- Wrapping kvm x86_ops: The current choice
+  Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
+  main.c, is just chosen to show main entry points for callbacks.) and
+  wrapper functions around all the callbacks with
+  "if (is-tdx) tdx-callback() else vmx-callback()".
+
+  Pros:
+  - No major change in common x86 KVM code. The change is (mostly)
+    contained under arch/x86/kvm/vmx/.
+  - When TDX is disabled(CONFIG_INTEL_TDX_HOST=n), the overhead is
+    optimized out.
+  - Micro optimization by avoiding function pointer.
+  Cons:
+  - Many boiler plates in arch/x86/kvm/vmx/main.c.
+
+Alternative:
+- Introduce another callback layer under arch/x86/kvm/vmx.
+  Pros:
+  - No major change in common x86 KVM code. The change is (mostly)
+    contained under arch/x86/kvm/vmx/.
+  - clear separation on callbacks.
+  Cons:
+  - overhead in VMX even when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+
+- Allow per-VM kvm_x86_ops callbacks instead of global kvm_x86_ops
+  Pros:
+  - clear separation on callbacks.
+  Cons:
+  - Big change in common x86 code.
+  - overhead in common code even when TDX is
+    disabled(CONFIG_INTEL_TDX_HOST=n).
+
+- Introduce new directory arch/x86/kvm/tdx
+  Pros:
+  - It clarifies that TDX is different from VMX.
+  Cons:
+  - Given the level of code sharing, it complicates code sharing.
+
+KVM MMU Changes
+---------------
+KVM MMU needs to be enhanced to handle Secure/Shared-EPT. The
+high-level execution flow is mostly same to normal EPT case.
+EPT violation/misconfiguration -> invoke TDP fault handler ->
+resolve TDP fault -> resume execution. (or emulate MMIO)
+The difference is, that S-EPT is operated(read/write) via TDX SEAM
+call which is expensive instead of direct read/write EPT entry.
+One bit of GPA (51 or 47 bit) is repurposed so that it means shared
+with host(if set to 1) or private to TD(if cleared to 0).
+
+- The current implementation
+  . Reuse the existing MMU code with minimal update.  Because the
+    execution flow is mostly same. But additional operation, TDX call
+    for S-EPT, is needed. So add hooks for it to kvm_x86_ops.
+  . For performance, minimize TDX SEAM call to operate on S-EPT. When
+    getting corresponding S-EPT pages/entry from faulting GPA, don't
+    use TDX SEAM call to read S-EPT entry. Instead create shadow copy
+    in host memory.
+    Repurpose the existing kvm_mmu_page as shadow copy of S-EPT and
+    associate S-EPT to it.
+  . Treats share bit as attributes. mask/unmask the bit where
+    necessary to keep the existing traversing code works.
+    Introduce kvm.arch.gfn_shared_mask and use "if (gfn_share_mask)"
+    for special case.
+    = 0 : for non-TDX case
+    = 51 or 47 bit set for TDX case.
+
+  Pros:
+  - Large code reuse with minimal new hooks.
+  - Execution path is same.
+  Cons:
+  - Complicates the existing code.
+  - Repurpose kvm_mmu_page as shadow of Secure-EPT can be confusing.
+
+Alternative:
+- Replace direct read/write on EPT entry with TDX-SEAM call by
+  introducing callbacks on EPT entry.
+  Pros:
+  - Straightforward.
+  Cons:
+  - Too many touching point.
+  - Too slow due to TDX-SEAM call.
+  - Overhead even when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+
+- Sprinkle "if (is-tdx)" for TDX special case
+  Pros:
+  - Straightforward.
+  Cons:
+  - The result is non-generic and ugly.
+  - Put TDX specific logic into common KVM MMU code.
+
+New KVM API, ioctl (sub)command, to manage TD VMs
+-------------------------------------------------
+Additional KVM API are needed to control TD VMs. The operations on TD
+VMs are specific to TDX.
+
+- Piggyback and repurpose KVM_MEMORY_ENCRYPT_OP
+  Although not all operation isn't memory encryption, repupose to get
+  TDX specific ioctls.
+  Pros:
+  - No major change in common x86 KVM code.
+  Cons:
+  - The operations aren't actually memory encryption, but operations
+    on TD VMs.
+
+Alternative:
+- Introduce new ioctl for guest protection like
+  KVM_GUEST_PROTECTION_OP and introduce subcommand for TDX.
+  Pros:
+  - Clean name.
+  Cons:
+  - One more new ioctl for guest protection.
+  - Confusion with KVM_MEMORY_ENCRYPT_OP with KVM_GUEST_PROTECTION_OP.
+
+- Rename KVM_MEMORY_ENCRYPT_OP to KVM_GUEST_PROTECTION_OP and keep
+  KVM_MEMORY_ENCRYPT_OP as same value for user API for compatibility.
+  "#define KVM_MEMORY_ENCRYPT_OP KVM_GUEST_PROTECTION_OP" for uapi
+  compatibility.
+  Pros:
+  - No new ioctl with more suitable name.
+  Cons:
+  - May cause confusion to the existing user program.
+
+
+References
+==========
+
+.. [1] TDX specification
+   https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
+.. [2] Intel Trust Domain Extensions (Intel TDX)
+   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
+.. [3] Intel CPU Architectural Extensions Specification
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf
+.. [4] Intel TDX Module 1.0 EAS
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
+.. [5] Intel TDX Loader Interface Specification
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-seamldr-interface-specification.pdf
+.. [6] Intel TDX Guest-Hypervisor Communication Interface
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
+.. [7] Intel TDX Virtual Firmware Design Guide
+   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.
+.. [8] intel public github
+   kvm TDX branch: https://github.com/intel/tdx/tree/kvm
+   TDX guest branch: https://github.com/intel/tdx/tree/guest
+.. [9] tdvf
+    https://github.com/tianocore/edk2-staging/tree/TDVF
+.. [10] KVM forum 2020: Intel Virtualization Technology Extensions to
+     Enable Hardware Isolated VMs
+     https://osseu2020.sched.com/event/eDzm/intel-virtualization-technology-extensions-to-enable-hardware-isolated-vms-sean-christopherson-intel
+.. [11] Linux Security Summit EU 2020:
+     Architectural Extensions for Hardware Virtual Machine Isolation
+     to Advance Confidential Computing in Public Clouds - Ravi Sahita
+     & Jun Nakajima, Intel Corporation
+     https://osseu2020.sched.com/event/eDOx/architectural-extensions-for-hardware-virtual-machine-isolation-to-advance-confidential-computing-in-public-clouds-ravi-sahita-jun-nakajima-intel-corporation
+.. [12] [RFCv2,00/16] KVM protected memory extension
+     https://lkml.org/lkml/2020/10/20/66
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* [RFC PATCH v5 104/104] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (102 preceding siblings ...)
  2022-03-04 19:49 ` [RFC PATCH v5 103/104] Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX) isaku.yamahata
@ 2022-03-04 19:50 ` isaku.yamahata
  2022-03-07  7:44 ` [RFC PATCH v5 000/104] KVM TDX basic feature support Christoph Hellwig
  2022-04-15 15:18 ` Paolo Bonzini
  105 siblings, 0 replies; 310+ messages in thread
From: isaku.yamahata @ 2022-03-04 19:50 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a high level design document on TDX changes to TDP MMU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/tdx-tdp-mmu.rst | 466 +++++++++++++++++++++++++
 1 file changed, 466 insertions(+)
 create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst

diff --git a/Documentation/virt/kvm/tdx-tdp-mmu.rst b/Documentation/virt/kvm/tdx-tdp-mmu.rst
new file mode 100644
index 000000000000..3f69d178f2e4
--- /dev/null
+++ b/Documentation/virt/kvm/tdx-tdp-mmu.rst
@@ -0,0 +1,466 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Design of TDP MMU for TDX support
+=================================
+This document describes a (high level) design for TDX support of KVM TDP MMU of
+x86 KVM.
+
+In this document, we use "TD" or "guest TD" to differentiate it from the current
+"VM" (Virtual Machine), which is supported by KVM today.
+
+
+Background of TDX
+=================
+TD private memory is designed to hold TD private content, encrypted by the CPU
+using the TD ephemeral key.  An encryption engine holds a table of encryption
+keys, and an encryption key is selected for each memory transaction based on a
+Host Key Identifier (HKID).  By design, the host VMM does not have access to the
+encryption keys.
+
+In the first generation of MKTME, HKID is "stolen" from the physical address by
+allocating a configurable number of bits from the top of the physical address.
+The HKID space is partitioned into shared HKIDs for legacy MKTME accesses and
+private HKIDs for SEAM-mode-only accesses.  We use 0 for the shared HKID on the
+host so that MKTME can be opaque or bypassed on the host.
+
+During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
+as either shared or private, based on the value of a new SHARED bit in the Guest
+Physical Address (GPA).  The CPU translates shared GPAs using the usual VMX EPT
+(Extended Page Table) or "Shared EPT" (in this document), which resides in the
+host VMM memory.  The Shared EPT is directly managed by the host VMM - the same
+as with the current VMX.  Since guest TDs usually require I/O, and the data
+exchange needs to be done via shared memory, thus KVM needs to use the current
+EPT functionality even for TDs.
+
+The CPU translates private GPAs using a separate Secure EPT.  The Secure EPT
+pages are encrypted and integrity-protected with the TD's ephemeral private key.
+Secure EPT can be managed _indirectly_ by the host VMM, using the TDX interface
+functions (SEAMCALLs), and thus conceptually Secure EPT is a subset of EPT
+because not all functionalities are available.
+
+Since the execution of such interface functions takes much longer time than
+accessing memory directly, in KVM we use the existing TDP code to mirror the
+Secure EPT for the TD. And we think there are at least two options today in
+terms of the timing for executing such SEAMCALLs:
+
+1. synchronous, i.e. while walking the TDP page tables, or
+2. post-walk, i.e. record what needs to be done to the real Secure EPT during
+   the walk, and execute SEAMCALLs later.
+
+The option 1 seems to be more intuitive and simpler, but the Secure EPT
+concurrency rules are different from the ones of the TDP or EPT. For example,
+MEM.SEPT.RD acquire shared access to the whole Secure EPT tree of the target
+
+Secure EPT(SEPT) operations
+---------------------------
+Secure EPT is an Extended Page Table for GPA-to-HPA translation of TD private
+HPA.  A Secure EPT is designed to be encrypted with the TD's ephemeral private
+key. SEPT pages are allocated by the host VMM via Intel TDX functions, but their
+content is intended to be hidden and is not architectural.
+
+Unlike the conventional EPT, the CPU can't directly read/write its entry.
+Instead, TDX SEAMCALL API is used.  Several SEAMCALLs correspond to operation on
+the EPT entry.
+
+* TDH.MEM.SEPT.ADD():
+  Add a secure EPT page from the secure EPT tree.  This corresponds to updating
+  the non-leaf EPT entry with present bit set
+
+* TDH.MEM.SEPT.REMOVE():
+  Remove the secure page from the secure EPT tree.  There is no corresponding
+  to the EPT operation.
+
+* TDH.MEM.SEPT.RD():
+  Read the secure EPT entry.  This corresponds to reading the EPT entry as
+  memory.  Please note that this is much slower than direct memory reading.
+
+* TDH.MEM.PAGE.ADD() and TDH.MEM.PAGE.AUG():
+  Add a private page to the secure EPT tree.  This corresponds to updating the
+  leaf EPT entry with present bit set.
+
+* THD.MEM.PAGE.REMOVE():
+  Remove a private page from the secure EPT tree.  There is no corresponding
+  to the EPT operation.
+
+* TDH.MEM.RANGE.BLOCK():
+  This (mostly) corresponds to clearing the present bit of the leaf EPT entry.
+  Note that the private page is still linked in the secure EPT.  To remove it
+  from the secure EPT, TDH.MEM.SEPT.REMOVE() and TDH.MEM.PAGE.REMOVE() needs to
+  be called.
+
+* TDH.MEM.TRACK():
+  Increment the TLB epoch counter. This (mostly) corresponds to EPT TLB flush.
+  Note that the private page is still linked in the secure EPT.  To remove it
+  from the secure EPT, tdh_mem_page_remove() needs to be called.
+
+
+Adding private page
+-------------------
+The procedure of populating the private page looks as follows.
+
+1. TDH.MEM.SEPT.ADD(512G level)
+2. TDH.MEM.SEPT.ADD(1G level)
+3. TDH.MEM.SEPT.ADD(2M level)
+4. TDH.MEM.PAGE.AUG(4K level)
+
+Those operations correspond to updating the EPT entries.
+
+Dropping private page and TLB shootdown
+---------------------------------------
+The procedure of dropping the private page looks as follows.
+
+1. TDH.MEM.RANGE.BLOCK(4K level)
+   This mostly corresponds to clear the present bit in the EPT entry.  This
+   prevents (or blocks) TLB entry from creating in the future.  Note that the
+   private page is still linked in the secure EPT tree and the existing cache
+   entry in the TLB isn't flushed.
+2. TDH.MEM.TRACK(range) and TLB shootdown
+   This mostly corresponds to the EPT TLB shootdown.  Because all vcpus share
+   the same Secure EPT, all vcpus need to flush TLB.
+   * TDH.MEM.TRACK(range) by one vcpu.  It increments the global internal TLB
+     epoch counter.
+   * send IPI to remote vcpus
+   * Other vcpu exits to VMM from guest TD and then re-enter. TDH.VP.ENTER().
+   * TDH.VP.ENTER() checks the TLB epoch counter and If its TLB is old, flush
+     TLB.
+   Note that only single vcpu issues tdh_mem_track().
+   Note that the private page is still linked in the secure EPT tree, unlike the
+   conventional EPT.
+3. TDH.MEM.PAGE.PROMOTE, TDH.MEM.PAGEDEMOTE(), TDH.MEM.PAGE.RELOCATE(), or
+   TDH.MEM.PAGE.REMOVE()
+   There is no corresponding operation to the conventional EPT.
+   * When changing page size (e.g. 4K <-> 2M) TDH.MEM.PAGE.PROMOTE() or
+     TDH.MEM.PAGE.DEMOTE() is used.  During those operation, the guest page is
+     kept referenced in the Secure EPT.
+   * When migrating page, TDH.MEM.PAGE.RELOCATE().  This requires both source
+     page and destination page.
+   * when destroying TD, TDH.MEM.PAGE.REMOVE() removes the private page from the
+     secure EPT tree.  In this case TLB shootdown is not needed because vcpus
+     don't run any more.
+
+The basic idea for TDX support
+==============================
+Because shared EPT is the same as the existing EPT, use the existing logic for
+shared EPT.  On the other hand, secure EPT requires additional operations
+instead of directly reading/writing of the EPT entry.
+
+On EPT violation, The KVM mmu walks down the EPT tree from the root, determines
+the EPT entry to operate, and updates the entry. If necessary, a TLB shootdown
+is done.  Because it's very slow to directly walk secure EPT by TDX SEAMCALL,
+TDH.MEM.SEPT.RD(), the mirror of secure EPT is created and maintained.  Add
+hooks to KVM MMU to reuse the existing code.
+
+EPT violation on shared GPA
+---------------------------
+(1) EPT violation on shared GPA or zapping shared GPA
+    walk down shared EPT tree (the existing code)
+        |
+        |
+        V
+shared EPT tree (CPU refers.)
+(2) update the EPT entry. (the existing code)
+    TLB shootdown in the case of zapping.
+
+
+EPT violation on private GPA
+----------------------------
+(1) EPT violation on private GPA or zapping private GPA
+    walk down the mirror of secure EPT tree (mostly same as the existing code)
+        |
+        |
+        V
+mirror of secure EPT tree (KVM MMU software only. reuse of the existing code)
+(2) update the (mirrored) EPT entry. (mostly same as the existing code)
+(3) call the hooks with what EPT entry is changed
+        |
+        NEW: hooks in KVM MMU
+        |
+        V
+secure EPT root(CPU refers)
+(4) the TDX backend calls necessary TDX SEAMCALLs to update real secure EPT.
+
+The major modification is to add hooks for the TDX backend for additional
+operations and to pass down which EPT, shared EPT, or private EPT is used, and
+twist the behavior if we're operating on private EPT.
+
+The following depicts the relationship.
+::
+
+                    KVM                             |       TDX module
+                     |                              |           |
+        -------------+----------                    |           |
+        |                      |                    |           |
+        V                      V                    |           |
+     shared GPA           private GPA               |           |
+  CPU shared EPT pointer  KVM private EPT pointer   |  CPU secure EPT pointer
+        |                      |                    |           |
+        |                      |                    |           |
+        V                      V                    |           V
+  shared EPT                private EPT<-------mirror----->Secure EPT
+        |                      |                    |           |
+        |                      \--------------------+------\    |
+        |                                           |      |    |
+        V                                           |      V    V
+  shared guest page                                 |    private guest page
+                                                    |
+                                                    |
+                              non-encrypted memory  |    encrypted memory
+                                                    |
+
+shared EPT: CPU and KVM walk with shared GPA
+            Maintained by the existing code
+private EPT: KVM walks with private GPA
+             Maintained by the twisted existing code
+secure EPT: CPU walks with private GPA.
+            Maintained by TDX module with TDX SEAMCALLs via hooks
+
+
+Tracking private EPT page
+=========================
+Shared EPT pages are managed by struct kvm_mmu_page.  They are linked in a list
+structure.  When necessary, the list is traversed to operate on.  Private EPT
+pages have different characteristics.  For example, private pages can't be
+swapped out.  When shrinking memory, we'd like to traverse only shared EPT pages
+and skip private EPT pages.  Likewise, page migration isn't supported for
+private pages (yet).  Introduce an additional list to track shared EPT pages and
+track private EPT pages independently.
+
+At the beginning of EPT violation, the fault handler knows fault GPA, thus it
+knows which EPT to operate on, private or shared.  If it's private EPT,
+an additional task is done.  Something like "if (private) { callback a hook }".
+Since the fault handler has deep function calls, it's cumbersome to hold the
+information of which EPT is operating.  Options to mitigate it are
+
+1. Pass the information as an argument for the function call.
+2. Record the information in struct kvm_mmu_page somehow.
+3. Record the information in vcpu structure.
+
+Option 2 was chosen.  Because option 1 requires modifying all the functions.  It
+would affect badly to the normal case.  Option 3 doesn't work well because in
+some cases, we need to walk both private and shared EPT.
+
+The role of the EPT page can be utilized and one bit can be curved out from
+unused bits in struct kvm_mmu_page_role.  When allocating the EPT page,
+initialize the information. Mostly struct kvm_mmu_page is available because
+we're operating on EPT pages.
+
+
+The conversion of private GPA and shared GPA
+============================================
+A page of a given GPA can be assigned to only private GPA xor shared GPA at one
+time.  The GPA can't be accessed simultaneously via both private GPA and shared
+GPA.  On guest startup, all the GPAs are assigned as private.  Guest converts
+the range of GPA to shared (or private) from private (or shared) by MapGPA
+hypercall.  MapGPA hypercall takes the start GPA and the size of the region.  If
+the given start GPA is shared, VMM converts the region into shared (if it's
+already shared, nop).  If the start GPA is private, VMM converts the region into
+private.  It implies the guest won't access the unmapped region. private(or
+shared) region after converting to shared(or private).
+
+If the guest TD triggers an EPT violation on the already converted region, the
+access won't be allowed (loop in EPT violation) until other vcpu converts back
+the region.
+
+KVM MMU records which GPA is allowed to access, private or shared.  It steals
+software usable bit from MMU present mask.  SEPT_PRIVATE_PROHIBIT.  The bit is
+recorded in both shared EPT and the mirror of secure EPT.
+
+* If SEPT_PRIVATE_PROHIBIT cleared in the shared EPT and the mirror of secure EPT:
+  Private GPA is allowed. Shared GPA is not allowed.
+
+* SEPT_PRIVATE_PROHIBIT set in the shared EPT and the mirror of secure EPT:
+  Private GPA is not allowed. Shared GPA is allowed.
+
+The default is that SEPT_PRIVATE_PROHIBIT is cleared so that the existing KVM
+MMU code (mostly) works.
+
+The reason why the bit is recorded in both shared and private EPT is to optimize
+for EPT violation path by penalizing MapGPA hypercall.
+
+The state machine of EPT entry
+------------------------------
+(private EPT entry, shared EPT entry) =
+        (non-present, non-present):             private mapping is allowed
+        (present, non-present):                 private mapping is mapped
+        (non-present | PRIVATE_PROHIBIT, non-present | PRIVATE_PROHIBIT):
+                                                shared mapping is allowed
+        (non-present | PRIVATE_PROHIBIT, present | PRIVATE_PROHIBIT):
+                                                shared mapping is mapped
+        (present | PRIVATE_PROHIBIT, any)       invalid combination
+
+* map_gpa(private GPA): Mark the region that private GPA is allowed(NEW)
+        private EPT entry: clear PRIVATE_PROHIBIT
+          present: nop
+          non-present: nop
+          non-present | PRIVATE_PROHIBIT -> non-present (clear PRIVATE_PROHIBIT)
+
+        shared EPT entry: zap the entry, clear PRIVATE_PROHIBIT
+          present: invalid
+          non-present -> non-present: nop
+          present | PRIVATE_PROHIBIT -> non-present
+          non-present | PRIVATE_PROHIBIT -> non-present
+
+* map_gpa(shared GPA): Mark the region that shared GPA is allowed(NEW)
+        private EPT entry: zap and set PRIVATE_PROHIBIT
+          present     -> non-present | PRIVATE_PROHIBIT
+          non-present -> non-present | PRIVATE_PROHIBIT
+          non-present | PRIVATE_PROHIBIT: nop
+
+        shared EPT entry: set PRIVATE_PROHIBIT
+          present: invalid
+          non-present -> non-present | PRIVATE_PROHIBIT
+          present | PRIVATE_PROHIBIT -> present | PRIVATE_PROHIBIT: nop
+          non-present | PRIVATE_PROHIBIT -> non-present | PRIVATE_PROHIBIT: nop
+
+* map(private GPA)
+        private EPT entry
+          present: nop
+          non-present -> present
+          non-present | PRIVATE_PROHIBIT: nop. looping on EPT violation(NEW)
+
+        shared EPT entry: nop
+
+* map(shared GPA)
+        private EPT entry: nop
+
+        shared EPT entry
+          present: invalid
+          present | PRIVATE_PROHIBIT: nop
+          non-present | PRIVATE_PROHIBIT -> present | PRIVATE_PROHIBIT
+          non-present: nop. looping on EPT violation(NEW)
+
+* zap(private GPA)
+        private EPT entry: zap the entry with keeping PRIVATE_PROHIBIT
+          present -> non-present
+          present | PRIVATE_PROHIBIT: invalid
+          non-present: nop as is_shadow_present_pte() is checked
+          non-present | PRIVATE_PROHIBIT: nop as is_shadow_present_pte() is
+                                          checked
+
+        shared EPT entry: nop
+
+* zap(shared GPA)
+        private EPT entry: nop
+
+        shared EPT entry: zap
+          any -> non-present
+          present: invalid
+          present | PRIVATE_PROHIBIT -> non-present | PRIVATE_PROHIBIT
+          non-present: nop as is_shadow_present_pte() is checked
+          non-present | PRIVATE_PROHIBIT: nop as is_shadow_present_pte() is
+                                          checked
+
+
+The original TDP MMU and race condition
+=======================================
+Because vcpus share the EPT, once the EPT entry is zapped, we need to shootdown
+TLB.  Send IPI to remote vcpus.  Remote vcpus flush their down TLBs.  Until TLB
+shootdown is done, vcpus may reference the zapped guest page.
+
+TDP MMU uses read lock of mmu_lock to mitigate vcpu contention.  When read lock
+is obtained, it depends on the atomic update of the EPT entry.  (On the other
+hand legacy MMU uses write lock.)  When vcpu is populating/zapping the EPT entry
+with a read lock held, other vcpu may be populating or zapping the same EPT
+entry at the same time.
+
+To avoid the race condition, the entry is frozen.  It means the EPT entry is set
+to the special value, REMOVED_SPTE which clears the present bit.  And then after
+TLB shootdown, update the EPT entry to the final value.
+
+Concurrent zapping
+------------------
+1. read lock
+2. freeze the EPT entry (atomically set the value to REMOVED_SPTE)
+   If other vcpu froze the entry, restart page fault.
+3. TLB shootdown
+   * send IPI to remote vcpus
+   * TLB flush (local and remote)
+   For each entry update, TLB shootdown is needed because of the
+   concurrency.
+4. atomically set the EPT entry to the final value
+5. read unlock
+
+Concurrent populating
+---------------------
+In the case of populating the non-present EPT entry, atomically update the EPT
+entry.
+1. read lock
+2. atomically update the EPT entry
+   If other vcpu frozen the entry or updated the entry, restart page fault.
+3. read unlock
+
+In the case of updating the present EPT entry (e.g. page migration), the
+operation is split into two.  Zapping the entry and populating the entry.
+1. read lock
+2. zap the EPT entry.  follow the concurrent zapping case.
+3. populate the non-present EPT entry.
+4. read unlock
+
+Non-concurrent batched zapping
+------------------------------
+In some cases, zapping the ranges is done exclusively with a write lock held.
+In this case, the TLB shootdown is batched into one.
+
+1. write lock
+2. zap the EPT entries by traversing them
+3. TLB shootdown
+4. write unlock
+
+
+For Secure EPT, TDX SEAMCALLs are needed in addition to updating the mirrored
+EPT entry.
+
+TDX concurrent zapping
+----------------------
+Add a hook for TDX SEAMCALLs at the step of the TLB shootdown.
+
+1. read lock
+2. freeze the EPT entry(set the value to REMOVED_SPTE)
+3. TLB shootdown via a hook
+   * TLB.MEM.RANGE.BLOCK()
+   * TLB.MEM.TRACK()
+   * send IPI to remote vcpus
+4. set the EPT entry to the final value
+5. read unlock
+
+TDX concurrent populating
+-------------------------
+TDX SEAMCALLs are required in addition to operating the mirrored EPT entry.  The
+frozen entry is utilized by following the zapping case to avoid the race
+condition.  A hook can be added.
+
+1. read lock
+2. freeze the EPT entry
+3. hook
+   * TDH_MEM_SEPT_ADD() for non-leaf or TDH_MEM_PAGE_AUG() for leaf.
+4. set the EPT entry to the final value
+5. read unlock
+
+Without freezing the entry, the following race can happen.  Suppose two vcpus
+are faulting on the same GPA and the 2M and 4K level entries aren't populated
+yet.
+
+* vcpu 1: update 2M level EPT entry
+* vcpu 2: update 4K level EPT entry
+* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
+* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry
+
+
+TDX non-concurrent batched zapping
+----------------------------------
+For simplicity, the procedure of concurrent populating is utilized.  The
+procedure can be optimized later.
+
+
+Co-existing with unmapping guest private memory
+===============================================
+TODO.  This needs to be addressed.
+
+
+Restrictions or future work
+===========================
+The following features aren't supported yet at the moment.
+
+* optimizing non-concurrent zap
+* Large page
+* Page migration
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 000/104] KVM TDX basic feature support
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (103 preceding siblings ...)
  2022-03-04 19:50 ` [RFC PATCH v5 104/104] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU isaku.yamahata
@ 2022-03-07  7:44 ` Christoph Hellwig
  2022-03-13 14:00   ` Paolo Bonzini
  2022-04-15 15:18 ` Paolo Bonzini
  105 siblings, 1 reply; 310+ messages in thread
From: Christoph Hellwig @ 2022-03-07  7:44 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

A series of 104 patches is completely unreviewably, please split it into
reasonable chunks.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 001/104] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
  2022-03-04 19:48 ` [RFC PATCH v5 001/104] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX isaku.yamahata
@ 2022-03-13 13:45   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 13:45 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
> to operate on VM.  TDX defines its data structure and TDX SEAMCALL APIs for
> VMM to operate on Trust Domain (TD) instead.
> 
> Trust Domain Virtual Processor State (TDVPS) is the root control structure
> of a TD VCPU.  It helps the TDX module control the operation of the VCPU,
> and holds the VCPU state while the VCPU is not running. TDVPS is opaque to
> software and DMA access, accessible only by using the TDX module interface
> functions (such as TDH.VP.RD, TDH.VP.WR ,..).  TDVPS includes TD VMCS, and
> TD VMCS auxiliary structures, such as virtual APIC page, virtualization
> exception information, etc.  TDVPS is composed of Trust Domain Virtual
> Processor Root (TDVPR) which is the root page of TDVPS and Trust Domain
> Virtual Processor eXtension (TDVPX) pages which extend TDVPR to help
> provide enough physical space for the logical TDVPS structure.
> 
> Also, we have a new structure, Trust Domain Control Structure (TDCS) is the
> main control structure of a guest TD, and encrypted (using the guest TD's
> ephemeral private key).  At a high level, TDCS holds information for
> controlling TD operation as a whole, execution, EPTP, MSR bitmaps, etc. KVM
> needs to set it up.  Note that MSR bitmaps are held as part of TDCS (unlike
> VMX) because they are meant to have the same value for all VCPUs of the
> same TD.  TDCS is a multi-page logical structure composed of multiple Trust
> Domain Control Extension (TDCX) physical pages.  Trust Domain Root (TDR) is
> the root control structure of a guest TD and is encrypted using the TDX
> global private key. It holds a minimal set of state variables that enable
> guest TD control even during times when the TD's private key is not known,
> or when the TD's key management state does not permit access to memory
> encrypted using the TD's private key.
> 
> The following shows the relationship between those structures.
> 
>          TDR--> TDCS                     per-TD
>           |       \--> TDCX
>           \
>            \--> TDVPS                    per-TD VCPU
>                   \--> TDVPR and TDVPX
> 
> The existing global struct kvm_x86_ops already defines an interface which
> fits with TDX.  But kvm_x86_ops is system-wide, not per-VM structure.  To
> allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers
> "if (tdx) tdx_op() else vmx_op()" to switch VMX or TDX at run time.
> 
> To split the runtime switch, the VMX implementation, and the TDX
> implementation, add main.c, and move out the vmx_x86_ops hooks in
> preparation for adding TDX, which can coexist with VMX, i.e. KVM can run
> both VMs and TDs.  Use 'vt' for the naming scheme as a nod to VT-x and as a
> concatenation of VmxTdx.
> 
> The current code looks as follows.
> In vmx.c
>    static vmx_op() { ... }
>    static struct kvm_x86_ops vmx_x86_ops = {
>          .op = vmx_op,
>    initialization code
> 
> The eventually converted code will look like
> In vmx.c, keep the VMX operations.
>    vmx_op() { ... }
>    VMX initialization
> In tdx.c, define the TDX operations.
>    tdx_op() { ... }
>    TDX initialization
> In x86_ops.h, declare the VMX and TDX operations.
>    vmx_op();
>    tdx_op();
> In main.c, define common wrappers for VMX and VMX.
>    static vt_ops() { if (tdx) tdx_ops() else vmx_ops() }
>    static struct kvm_x86_ops vt_x86_ops = {
>          .op = vt_op,
>    initialization to call VMX and TDX initialization
> 
> Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
> vmx_free_vcpu() to vmx_vcpu_create() and vxm_vcpu_free().
> 
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

> ---
>   arch/x86/kvm/Makefile      |   2 +-
>   arch/x86/kvm/vmx/main.c    | 154 ++++++++++++++++
>   arch/x86/kvm/vmx/vmx.c     | 360 +++++++++++--------------------------
>   arch/x86/kvm/vmx/x86_ops.h | 126 +++++++++++++
>   4 files changed, 385 insertions(+), 257 deletions(-)
>   create mode 100644 arch/x86/kvm/vmx/main.c
>   create mode 100644 arch/x86/kvm/vmx/x86_ops.h
> 
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 30f244b64523..ee4d0999f20f 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -22,7 +22,7 @@ kvm-$(CONFIG_X86_64) += mmu/tdp_iter.o mmu/tdp_mmu.o
>   kvm-$(CONFIG_KVM_XEN)	+= xen.o
>   
>   kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
> -			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o
> +			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
>   kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
>   
>   kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> new file mode 100644
> index 000000000000..b08ea9c42a11
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -0,0 +1,154 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/moduleparam.h>
> +
> +#include "x86_ops.h"
> +#include "vmx.h"
> +#include "nested.h"
> +#include "pmu.h"
> +
> +struct kvm_x86_ops vt_x86_ops __initdata = {
> +	.name = "kvm_intel",
> +
> +	.hardware_unsetup = vmx_hardware_unsetup,
> +
> +	.hardware_enable = vmx_hardware_enable,
> +	.hardware_disable = vmx_hardware_disable,
> +	.cpu_has_accelerated_tpr = report_flexpriority,
> +	.has_emulated_msr = vmx_has_emulated_msr,
> +
> +	.vm_size = sizeof(struct kvm_vmx),
> +	.vm_init = vmx_vm_init,
> +
> +	.vcpu_create = vmx_vcpu_create,
> +	.vcpu_free = vmx_vcpu_free,
> +	.vcpu_reset = vmx_vcpu_reset,
> +
> +	.prepare_guest_switch = vmx_prepare_switch_to_guest,
> +	.vcpu_load = vmx_vcpu_load,
> +	.vcpu_put = vmx_vcpu_put,
> +
> +	.update_exception_bitmap = vmx_update_exception_bitmap,
> +	.get_msr_feature = vmx_get_msr_feature,
> +	.get_msr = vmx_get_msr,
> +	.set_msr = vmx_set_msr,
> +	.get_segment_base = vmx_get_segment_base,
> +	.get_segment = vmx_get_segment,
> +	.set_segment = vmx_set_segment,
> +	.get_cpl = vmx_get_cpl,
> +	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
> +	.set_cr0 = vmx_set_cr0,
> +	.is_valid_cr4 = vmx_is_valid_cr4,
> +	.set_cr4 = vmx_set_cr4,
> +	.set_efer = vmx_set_efer,
> +	.get_idt = vmx_get_idt,
> +	.set_idt = vmx_set_idt,
> +	.get_gdt = vmx_get_gdt,
> +	.set_gdt = vmx_set_gdt,
> +	.set_dr7 = vmx_set_dr7,
> +	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
> +	.cache_reg = vmx_cache_reg,
> +	.get_rflags = vmx_get_rflags,
> +	.set_rflags = vmx_set_rflags,
> +	.get_if_flag = vmx_get_if_flag,
> +
> +	.tlb_flush_all = vmx_flush_tlb_all,
> +	.tlb_flush_current = vmx_flush_tlb_current,
> +	.tlb_flush_gva = vmx_flush_tlb_gva,
> +	.tlb_flush_guest = vmx_flush_tlb_guest,
> +
> +	.vcpu_pre_run = vmx_vcpu_pre_run,
> +	.run = vmx_vcpu_run,
> +	.handle_exit = vmx_handle_exit,
> +	.skip_emulated_instruction = vmx_skip_emulated_instruction,
> +	.update_emulated_instruction = vmx_update_emulated_instruction,
> +	.set_interrupt_shadow = vmx_set_interrupt_shadow,
> +	.get_interrupt_shadow = vmx_get_interrupt_shadow,
> +	.patch_hypercall = vmx_patch_hypercall,
> +	.set_irq = vmx_inject_irq,
> +	.set_nmi = vmx_inject_nmi,
> +	.queue_exception = vmx_queue_exception,
> +	.cancel_injection = vmx_cancel_injection,
> +	.interrupt_allowed = vmx_interrupt_allowed,
> +	.nmi_allowed = vmx_nmi_allowed,
> +	.get_nmi_mask = vmx_get_nmi_mask,
> +	.set_nmi_mask = vmx_set_nmi_mask,
> +	.enable_nmi_window = vmx_enable_nmi_window,
> +	.enable_irq_window = vmx_enable_irq_window,
> +	.update_cr8_intercept = vmx_update_cr8_intercept,
> +	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
> +	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
> +	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
> +	.load_eoi_exitmap = vmx_load_eoi_exitmap,
> +	.apicv_post_state_restore = vmx_apicv_post_state_restore,
> +	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
> +	.hwapic_irr_update = vmx_hwapic_irr_update,
> +	.hwapic_isr_update = vmx_hwapic_isr_update,
> +	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
> +	.sync_pir_to_irr = vmx_sync_pir_to_irr,
> +	.deliver_interrupt = vmx_deliver_interrupt,
> +	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
> +
> +	.set_tss_addr = vmx_set_tss_addr,
> +	.set_identity_map_addr = vmx_set_identity_map_addr,
> +	.get_mt_mask = vmx_get_mt_mask,
> +
> +	.get_exit_info = vmx_get_exit_info,
> +
> +	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
> +
> +	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
> +
> +	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
> +	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
> +	.write_tsc_offset = vmx_write_tsc_offset,
> +	.write_tsc_multiplier = vmx_write_tsc_multiplier,
> +
> +	.load_mmu_pgd = vmx_load_mmu_pgd,
> +
> +	.check_intercept = vmx_check_intercept,
> +	.handle_exit_irqoff = vmx_handle_exit_irqoff,
> +
> +	.request_immediate_exit = vmx_request_immediate_exit,
> +
> +	.sched_in = vmx_sched_in,
> +
> +	.cpu_dirty_log_size = PML_ENTITY_NUM,
> +	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
> +
> +	.pmu_ops = &intel_pmu_ops,
> +	.nested_ops = &vmx_nested_ops,
> +
> +	.update_pi_irte = pi_update_irte,
> +	.start_assignment = vmx_pi_start_assignment,
> +
> +#ifdef CONFIG_X86_64
> +	.set_hv_timer = vmx_set_hv_timer,
> +	.cancel_hv_timer = vmx_cancel_hv_timer,
> +#endif
> +
> +	.setup_mce = vmx_setup_mce,
> +
> +	.smi_allowed = vmx_smi_allowed,
> +	.enter_smm = vmx_enter_smm,
> +	.leave_smm = vmx_leave_smm,
> +	.enable_smi_window = vmx_enable_smi_window,
> +
> +	.can_emulate_instruction = vmx_can_emulate_instruction,
> +	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
> +	.migrate_timers = vmx_migrate_timers,
> +
> +	.msr_filter_changed = vmx_msr_filter_changed,
> +	.complete_emulated_msr = kvm_complete_insn_gp,
> +
> +	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
> +};
> +
> +struct kvm_x86_init_ops vt_init_ops __initdata = {
> +	.cpu_has_kvm_support = vmx_cpu_has_kvm_support,
> +	.disabled_by_bios = vmx_disabled_by_bios,
> +	.check_processor_compatibility = vmx_check_processor_compat,
> +	.hardware_setup = vmx_hardware_setup,
> +	.handle_intel_pt_intr = NULL,
> +
> +	.runtime_ops = &vt_x86_ops,
> +};
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index efda5e4d6247..f6f5d0dac579 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -66,6 +66,7 @@
>   #include "vmcs12.h"
>   #include "vmx.h"
>   #include "x86.h"
> +#include "x86_ops.h"
>   
>   MODULE_AUTHOR("Qumranet");
>   MODULE_LICENSE("GPL");
> @@ -541,7 +542,7 @@ static inline bool cpu_need_virtualize_apic_accesses(struct kvm_vcpu *vcpu)
>   	return flexpriority_enabled && lapic_in_kernel(vcpu);
>   }
>   
> -static inline bool report_flexpriority(void)
> +bool report_flexpriority(void)
>   {
>   	return flexpriority_enabled;
>   }
> @@ -1316,7 +1317,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
>    * Switches to specified vcpu, until a matching vcpu_put(), but assumes
>    * vcpu mutex is already taken.
>    */
> -static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -1327,7 +1328,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	vmx->host_debugctlmsr = get_debugctlmsr();
>   }
>   
> -static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
> +void vmx_vcpu_put(struct kvm_vcpu *vcpu)
>   {
>   	vmx_vcpu_pi_put(vcpu);
>   
> @@ -1381,7 +1382,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
>   		vmx->emulation_required = vmx_emulation_required(vcpu);
>   }
>   
> -static bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
> +bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
>   {
>   	return vmx_get_rflags(vcpu) & X86_EFLAGS_IF;
>   }
> @@ -1487,8 +1488,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
>   	return 0;
>   }
>   
> -static bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
> -					void *insn, int insn_len)
> +bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
> +				void *insn, int insn_len)
>   {
>   	/*
>   	 * Emulation of instructions in SGX enclaves is impossible as RIP does
> @@ -1572,7 +1573,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
>    * Recognizes a pending MTF VM-exit and records the nested state for later
>    * delivery.
>    */
> -static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
> +void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
>   {
>   	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
> @@ -1595,7 +1596,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
>   		vmx->nested.mtf_pending = false;
>   }
>   
> -static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
> +int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
>   {
>   	vmx_update_emulated_instruction(vcpu);
>   	return skip_emulated_instruction(vcpu);
> @@ -1614,7 +1615,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
>   		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
>   }
>   
> -static void vmx_queue_exception(struct kvm_vcpu *vcpu)
> +void vmx_queue_exception(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	unsigned nr = vcpu->arch.exception.nr;
> @@ -1727,12 +1728,12 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
>   	return kvm_default_tsc_scaling_ratio;
>   }
>   
> -static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> +void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
>   {
>   	vmcs_write64(TSC_OFFSET, offset);
>   }
>   
> -static void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
> +void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
>   {
>   	vmcs_write64(TSC_MULTIPLIER, multiplier);
>   }
> @@ -1756,7 +1757,7 @@ static inline bool vmx_feature_control_msr_valid(struct kvm_vcpu *vcpu,
>   	return !(val & ~valid_bits);
>   }
>   
> -static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
> +int vmx_get_msr_feature(struct kvm_msr_entry *msr)
>   {
>   	switch (msr->index) {
>   	case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
> @@ -1776,7 +1777,7 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
>    * Returns 0 on success, non-0 otherwise.
>    * Assumes vcpu_load() was already called.
>    */
> -static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> +int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	struct vmx_uret_msr *msr;
> @@ -1954,7 +1955,7 @@ static u64 vcpu_supported_debugctl(struct kvm_vcpu *vcpu)
>    * Returns 0 on success, non-0 otherwise.
>    * Assumes vcpu_load() was already called.
>    */
> -static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> +int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	struct vmx_uret_msr *msr;
> @@ -2267,7 +2268,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>   	return ret;
>   }
>   
> -static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> +void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
>   {
>   	unsigned long guest_owned_bits;
>   
> @@ -2310,12 +2311,12 @@ static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
>   	}
>   }
>   
> -static __init int cpu_has_kvm_support(void)
> +__init int vmx_cpu_has_kvm_support(void)
>   {
>   	return cpu_has_vmx();
>   }
>   
> -static __init int vmx_disabled_by_bios(void)
> +__init int vmx_disabled_by_bios(void)
>   {
>   	return !boot_cpu_has(X86_FEATURE_MSR_IA32_FEAT_CTL) ||
>   	       !boot_cpu_has(X86_FEATURE_VMX);
> @@ -2341,7 +2342,7 @@ static int kvm_cpu_vmxon(u64 vmxon_pointer)
>   	return -EFAULT;
>   }
>   
> -static int hardware_enable(void)
> +int vmx_hardware_enable(void)
>   {
>   	int cpu = raw_smp_processor_id();
>   	u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
> @@ -2382,7 +2383,7 @@ static void vmclear_local_loaded_vmcss(void)
>   		__loaded_vmcs_clear(v);
>   }
>   
> -static void hardware_disable(void)
> +void vmx_hardware_disable(void)
>   {
>   	vmclear_local_loaded_vmcss();
>   
> @@ -2924,7 +2925,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
>   
>   #endif
>   
> -static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
> +void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -2954,7 +2955,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
>   	return to_vmx(vcpu)->vpid;
>   }
>   
> -static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
> +void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
>   {
>   	struct kvm_mmu *mmu = vcpu->arch.mmu;
>   	u64 root_hpa = mmu->root_hpa;
> @@ -2970,7 +2971,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
>   		vpid_sync_context(vmx_get_current_vpid(vcpu));
>   }
>   
> -static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
> +void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
>   {
>   	/*
>   	 * vpid_sync_vcpu_addr() is a nop if vpid==0, see the comment in
> @@ -2979,7 +2980,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
>   	vpid_sync_vcpu_addr(vmx_get_current_vpid(vcpu), addr);
>   }
>   
> -static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
> +void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
>   {
>   	/*
>   	 * vpid_sync_context() is a nop if vpid==0, e.g. if enable_vpid==0 or a
> @@ -3134,8 +3135,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
>   	return eptp;
>   }
>   
> -static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> -			     int root_level)
> +void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
>   {
>   	struct kvm *kvm = vcpu->kvm;
>   	bool update_guest_cr3 = true;
> @@ -3163,8 +3163,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>   		vmcs_writel(GUEST_CR3, guest_cr3);
>   }
>   
> -
> -static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> +bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
>   {
>   	/*
>   	 * We operate under the default treatment of SMM, so VMX cannot be
> @@ -3280,7 +3279,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
>   	var->g = (ar >> 15) & 1;
>   }
>   
> -static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
> +u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
>   {
>   	struct kvm_segment s;
>   
> @@ -3360,14 +3359,14 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
>   	vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
>   }
>   
> -static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> +void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
>   {
>   	__vmx_set_segment(vcpu, var, seg);
>   
>   	to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu);
>   }
>   
> -static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
> +void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
>   {
>   	u32 ar = vmx_read_guest_seg_ar(to_vmx(vcpu), VCPU_SREG_CS);
>   
> @@ -3375,25 +3374,25 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
>   	*l = (ar >> 13) & 1;
>   }
>   
> -static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
>   {
>   	dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
>   	dt->address = vmcs_readl(GUEST_IDTR_BASE);
>   }
>   
> -static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
>   {
>   	vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
>   	vmcs_writel(GUEST_IDTR_BASE, dt->address);
>   }
>   
> -static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
>   {
>   	dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
>   	dt->address = vmcs_readl(GUEST_GDTR_BASE);
>   }
>   
> -static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
>   {
>   	vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
>   	vmcs_writel(GUEST_GDTR_BASE, dt->address);
> @@ -3889,7 +3888,7 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
>   	}
>   }
>   
> -static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
> +bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	void *vapic_page;
> @@ -3909,7 +3908,7 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
>   	return ((rvi & 0xf0) > (vppr & 0xf0));
>   }
>   
> -static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
> +void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	u32 i;
> @@ -4041,8 +4040,8 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
>   	return 0;
>   }
>   
> -static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
> -				  int trig_mode, int vector)
> +void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
> +			   int trig_mode, int vector)
>   {
>   	struct kvm_vcpu *vcpu = apic->vcpu;
>   
> @@ -4185,7 +4184,7 @@ static u32 vmx_vmexit_ctrl(void)
>   		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
>   }
>   
> -static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
> +void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -4508,7 +4507,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
>   	vmx->pi_desc.sn = 1;
>   }
>   
> -static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> +void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -4565,12 +4564,12 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>   	vpid_sync_context(vmx->vpid);
>   }
>   
> -static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
> +void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
>   {
>   	exec_controls_setbit(to_vmx(vcpu), CPU_BASED_INTR_WINDOW_EXITING);
>   }
>   
> -static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
> +void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
>   {
>   	if (!enable_vnmi ||
>   	    vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_STI) {
> @@ -4581,7 +4580,7 @@ static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
>   	exec_controls_setbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
>   }
>   
> -static void vmx_inject_irq(struct kvm_vcpu *vcpu)
> +void vmx_inject_irq(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	uint32_t intr;
> @@ -4609,7 +4608,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
>   	vmx_clear_hlt(vcpu);
>   }
>   
> -static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
> +void vmx_inject_nmi(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -4687,7 +4686,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
>   		 GUEST_INTR_STATE_NMI));
>   }
>   
> -static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>   {
>   	if (to_vmx(vcpu)->nested.nested_run_pending)
>   		return -EBUSY;
> @@ -4709,7 +4708,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
>   		(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
>   }
>   
> -static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>   {
>   	if (to_vmx(vcpu)->nested.nested_run_pending)
>   		return -EBUSY;
> @@ -4724,7 +4723,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>   	return !vmx_interrupt_blocked(vcpu);
>   }
>   
> -static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
> +int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
>   {
>   	void __user *ret;
>   
> @@ -4744,7 +4743,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
>   	return init_rmode_tss(kvm, ret);
>   }
>   
> -static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
> +int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
>   {
>   	to_kvm_vmx(kvm)->ept_identity_map_addr = ident_addr;
>   	return 0;
> @@ -5023,8 +5022,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
>   	return kvm_fast_pio(vcpu, size, port, in);
>   }
>   
> -static void
> -vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
> +void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
>   {
>   	/*
>   	 * Patch in the VMCALL instruction:
> @@ -5234,7 +5232,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
>   	return kvm_complete_insn_gp(vcpu, err);
>   }
>   
> -static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
> +void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
>   {
>   	get_debugreg(vcpu->arch.db[0], 0);
>   	get_debugreg(vcpu->arch.db[1], 1);
> @@ -5253,7 +5251,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
>   	set_debugreg(DR6_RESERVED, 6);
>   }
>   
> -static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
> +void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
>   {
>   	vmcs_writel(GUEST_DR7, val);
>   }
> @@ -5519,7 +5517,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
>   	return 1;
>   }
>   
> -static int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
> +int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
>   {
>   	if (vmx_emulation_required_with_pending_exception(vcpu)) {
>   		kvm_prepare_emulation_failure_exit(vcpu);
> @@ -5756,9 +5754,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
>   static const int kvm_vmx_max_exit_handlers =
>   	ARRAY_SIZE(kvm_vmx_exit_handlers);
>   
> -static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> -			      u64 *info1, u64 *info2,
> -			      u32 *intr_info, u32 *error_code)
> +void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> +		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -6191,7 +6188,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   	return 0;
>   }
>   
> -static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
> +int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
>   {
>   	int ret = __vmx_handle_exit(vcpu, exit_fastpath);
>   
> @@ -6279,7 +6276,7 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
>   		: "eax", "ebx", "ecx", "edx");
>   }
>   
> -static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
> +void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
>   {
>   	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>   	int tpr_threshold;
> @@ -6349,7 +6346,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
>   	vmx_update_msr_bitmap_x2apic(vcpu);
>   }
>   
> -static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
> +void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>   {
>   	struct page *page;
>   
> @@ -6377,7 +6374,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
>   	put_page(page);
>   }
>   
> -static void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
> +void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
>   {
>   	u16 status;
>   	u8 old;
> @@ -6411,7 +6408,7 @@ static void vmx_set_rvi(int vector)
>   	}
>   }
>   
> -static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
> +void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
>   {
>   	/*
>   	 * When running L2, updating RVI is only relevant when
> @@ -6425,7 +6422,7 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
>   		vmx_set_rvi(max_irr);
>   }
>   
> -static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
> +int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	int max_irr;
> @@ -6471,7 +6468,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>   	return max_irr;
>   }
>   
> -static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
> +void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
>   {
>   	if (!kvm_vcpu_apicv_active(vcpu))
>   		return;
> @@ -6482,7 +6479,7 @@ static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
>   	vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
>   }
>   
> -static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
> +void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -6554,7 +6551,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
>   	handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
>   }
>   
> -static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> +void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -6571,7 +6568,7 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
>    * The kvm parameter can be NULL (module initialization, or invocation before
>    * VM creation). Be sure to check the kvm parameter before using it.
>    */
> -static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
> +bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
>   {
>   	switch (index) {
>   	case MSR_IA32_SMBASE:
> @@ -6692,7 +6689,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
>   				  IDT_VECTORING_ERROR_CODE);
>   }
>   
> -static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
> +void vmx_cancel_injection(struct kvm_vcpu *vcpu)
>   {
>   	__vmx_complete_interrupts(vcpu,
>   				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
> @@ -6788,7 +6785,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>   	guest_state_exit_irqoff();
>   }
>   
> -static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
> +fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	unsigned long cr4;
> @@ -6969,7 +6966,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>   	return vmx_exit_handlers_fastpath(vcpu);
>   }
>   
> -static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
> +void vmx_vcpu_free(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -6980,7 +6977,7 @@ static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
>   	free_loaded_vmcs(vmx->loaded_vmcs);
>   }
>   
> -static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
> +int vmx_vcpu_create(struct kvm_vcpu *vcpu)
>   {
>   	struct vmx_uret_msr *tsx_ctrl;
>   	struct vcpu_vmx *vmx;
> @@ -7085,7 +7082,7 @@ static int vmx_create_vcpu(struct kvm_vcpu *vcpu)
>   #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>   #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>   
> -static int vmx_vm_init(struct kvm *kvm)
> +int vmx_vm_init(struct kvm *kvm)
>   {
>   	if (!ple_gap)
>   		kvm->arch.pause_in_guest = true;
> @@ -7116,7 +7113,7 @@ static int vmx_vm_init(struct kvm *kvm)
>   	return 0;
>   }
>   
> -static int __init vmx_check_processor_compat(void)
> +int __init vmx_check_processor_compat(void)
>   {
>   	struct vmcs_config vmcs_conf;
>   	struct vmx_capability vmx_cap;
> @@ -7139,7 +7136,7 @@ static int __init vmx_check_processor_compat(void)
>   	return 0;
>   }
>   
> -static u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
>   {
>   	u8 cache;
>   
> @@ -7328,7 +7325,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
>   		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
>   }
>   
> -static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> +void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -7433,7 +7430,7 @@ static __init void vmx_set_cpu_caps(void)
>   		kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
>   }
>   
> -static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
> +void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
>   {
>   	to_vmx(vcpu)->req_immediate_exit = true;
>   }
> @@ -7472,10 +7469,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
>   	return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
>   }
>   
> -static int vmx_check_intercept(struct kvm_vcpu *vcpu,
> -			       struct x86_instruction_info *info,
> -			       enum x86_intercept_stage stage,
> -			       struct x86_exception *exception)
> +int vmx_check_intercept(struct kvm_vcpu *vcpu,
> +		       struct x86_instruction_info *info,
> +		       enum x86_intercept_stage stage,
> +		       struct x86_exception *exception)
>   {
>   	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>   
> @@ -7540,8 +7537,8 @@ static inline int u64_shl_div_u64(u64 a, unsigned int shift,
>   	return 0;
>   }
>   
> -static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
> -			    bool *expired)
> +int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
> +		bool *expired)
>   {
>   	struct vcpu_vmx *vmx;
>   	u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
> @@ -7580,13 +7577,13 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
>   	return 0;
>   }
>   
> -static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
> +void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
>   {
>   	to_vmx(vcpu)->hv_deadline_tsc = -1;
>   }
>   #endif
>   
> -static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
>   {
>   	if (!kvm_pause_in_guest(vcpu->kvm))
>   		shrink_ple_window(vcpu);
> @@ -7612,7 +7609,7 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
>   		secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
>   }
>   
> -static void vmx_setup_mce(struct kvm_vcpu *vcpu)
> +void vmx_setup_mce(struct kvm_vcpu *vcpu)
>   {
>   	if (vcpu->arch.mcg_cap & MCG_LMCE_P)
>   		to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
> @@ -7622,7 +7619,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
>   			~FEAT_CTL_LMCE_ENABLED;
>   }
>   
> -static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>   {
>   	/* we need a nested vmexit to enter SMM, postpone if run is pending */
>   	if (to_vmx(vcpu)->nested.nested_run_pending)
> @@ -7630,7 +7627,7 @@ static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
>   	return !is_smm(vcpu);
>   }
>   
> -static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
> +int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   
> @@ -7644,7 +7641,7 @@ static int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate)
>   	return 0;
>   }
>   
> -static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
> +int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
>   {
>   	struct vcpu_vmx *vmx = to_vmx(vcpu);
>   	int ret;
> @@ -7665,17 +7662,17 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate)
>   	return 0;
>   }
>   
> -static void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
> +void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
>   {
>   	/* RSM will cause a vmexit anyway.  */
>   }
>   
> -static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> +bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
>   {
>   	return to_vmx(vcpu)->nested.vmxon && !is_guest_mode(vcpu);
>   }
>   
> -static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
> +void vmx_migrate_timers(struct kvm_vcpu *vcpu)
>   {
>   	if (is_guest_mode(vcpu)) {
>   		struct hrtimer *timer = &to_vmx(vcpu)->nested.preemption_timer;
> @@ -7685,7 +7682,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
>   	}
>   }
>   
> -static void hardware_unsetup(void)
> +void vmx_hardware_unsetup(void)
>   {
>   	kvm_set_posted_intr_wakeup_handler(NULL);
>   
> @@ -7695,7 +7692,7 @@ static void hardware_unsetup(void)
>   	free_kvm_area();
>   }
>   
> -static bool vmx_check_apicv_inhibit_reasons(ulong bit)
> +bool vmx_check_apicv_inhibit_reasons(ulong bit)
>   {
>   	ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
>   			  BIT(APICV_INHIBIT_REASON_ABSENT) |
> @@ -7705,143 +7702,6 @@ static bool vmx_check_apicv_inhibit_reasons(ulong bit)
>   	return supported & BIT(bit);
>   }
>   
> -static struct kvm_x86_ops vmx_x86_ops __initdata = {
> -	.name = "kvm_intel",
> -
> -	.hardware_unsetup = hardware_unsetup,
> -
> -	.hardware_enable = hardware_enable,
> -	.hardware_disable = hardware_disable,
> -	.cpu_has_accelerated_tpr = report_flexpriority,
> -	.has_emulated_msr = vmx_has_emulated_msr,
> -
> -	.vm_size = sizeof(struct kvm_vmx),
> -	.vm_init = vmx_vm_init,
> -
> -	.vcpu_create = vmx_create_vcpu,
> -	.vcpu_free = vmx_free_vcpu,
> -	.vcpu_reset = vmx_vcpu_reset,
> -
> -	.prepare_guest_switch = vmx_prepare_switch_to_guest,
> -	.vcpu_load = vmx_vcpu_load,
> -	.vcpu_put = vmx_vcpu_put,
> -
> -	.update_exception_bitmap = vmx_update_exception_bitmap,
> -	.get_msr_feature = vmx_get_msr_feature,
> -	.get_msr = vmx_get_msr,
> -	.set_msr = vmx_set_msr,
> -	.get_segment_base = vmx_get_segment_base,
> -	.get_segment = vmx_get_segment,
> -	.set_segment = vmx_set_segment,
> -	.get_cpl = vmx_get_cpl,
> -	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
> -	.set_cr0 = vmx_set_cr0,
> -	.is_valid_cr4 = vmx_is_valid_cr4,
> -	.set_cr4 = vmx_set_cr4,
> -	.set_efer = vmx_set_efer,
> -	.get_idt = vmx_get_idt,
> -	.set_idt = vmx_set_idt,
> -	.get_gdt = vmx_get_gdt,
> -	.set_gdt = vmx_set_gdt,
> -	.set_dr7 = vmx_set_dr7,
> -	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
> -	.cache_reg = vmx_cache_reg,
> -	.get_rflags = vmx_get_rflags,
> -	.set_rflags = vmx_set_rflags,
> -	.get_if_flag = vmx_get_if_flag,
> -
> -	.tlb_flush_all = vmx_flush_tlb_all,
> -	.tlb_flush_current = vmx_flush_tlb_current,
> -	.tlb_flush_gva = vmx_flush_tlb_gva,
> -	.tlb_flush_guest = vmx_flush_tlb_guest,
> -
> -	.vcpu_pre_run = vmx_vcpu_pre_run,
> -	.run = vmx_vcpu_run,
> -	.handle_exit = vmx_handle_exit,
> -	.skip_emulated_instruction = vmx_skip_emulated_instruction,
> -	.update_emulated_instruction = vmx_update_emulated_instruction,
> -	.set_interrupt_shadow = vmx_set_interrupt_shadow,
> -	.get_interrupt_shadow = vmx_get_interrupt_shadow,
> -	.patch_hypercall = vmx_patch_hypercall,
> -	.set_irq = vmx_inject_irq,
> -	.set_nmi = vmx_inject_nmi,
> -	.queue_exception = vmx_queue_exception,
> -	.cancel_injection = vmx_cancel_injection,
> -	.interrupt_allowed = vmx_interrupt_allowed,
> -	.nmi_allowed = vmx_nmi_allowed,
> -	.get_nmi_mask = vmx_get_nmi_mask,
> -	.set_nmi_mask = vmx_set_nmi_mask,
> -	.enable_nmi_window = vmx_enable_nmi_window,
> -	.enable_irq_window = vmx_enable_irq_window,
> -	.update_cr8_intercept = vmx_update_cr8_intercept,
> -	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
> -	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
> -	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
> -	.load_eoi_exitmap = vmx_load_eoi_exitmap,
> -	.apicv_post_state_restore = vmx_apicv_post_state_restore,
> -	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
> -	.hwapic_irr_update = vmx_hwapic_irr_update,
> -	.hwapic_isr_update = vmx_hwapic_isr_update,
> -	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
> -	.sync_pir_to_irr = vmx_sync_pir_to_irr,
> -	.deliver_interrupt = vmx_deliver_interrupt,
> -	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
> -
> -	.set_tss_addr = vmx_set_tss_addr,
> -	.set_identity_map_addr = vmx_set_identity_map_addr,
> -	.get_mt_mask = vmx_get_mt_mask,
> -
> -	.get_exit_info = vmx_get_exit_info,
> -
> -	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
> -
> -	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
> -
> -	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
> -	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
> -	.write_tsc_offset = vmx_write_tsc_offset,
> -	.write_tsc_multiplier = vmx_write_tsc_multiplier,
> -
> -	.load_mmu_pgd = vmx_load_mmu_pgd,
> -
> -	.check_intercept = vmx_check_intercept,
> -	.handle_exit_irqoff = vmx_handle_exit_irqoff,
> -
> -	.request_immediate_exit = vmx_request_immediate_exit,
> -
> -	.sched_in = vmx_sched_in,
> -
> -	.cpu_dirty_log_size = PML_ENTITY_NUM,
> -	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
> -
> -	.pmu_ops = &intel_pmu_ops,
> -	.nested_ops = &vmx_nested_ops,
> -
> -	.update_pi_irte = pi_update_irte,
> -	.start_assignment = vmx_pi_start_assignment,
> -
> -#ifdef CONFIG_X86_64
> -	.set_hv_timer = vmx_set_hv_timer,
> -	.cancel_hv_timer = vmx_cancel_hv_timer,
> -#endif
> -
> -	.setup_mce = vmx_setup_mce,
> -
> -	.smi_allowed = vmx_smi_allowed,
> -	.enter_smm = vmx_enter_smm,
> -	.leave_smm = vmx_leave_smm,
> -	.enable_smi_window = vmx_enable_smi_window,
> -
> -	.can_emulate_instruction = vmx_can_emulate_instruction,
> -	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
> -	.migrate_timers = vmx_migrate_timers,
> -
> -	.msr_filter_changed = vmx_msr_filter_changed,
> -	.complete_emulated_msr = kvm_complete_insn_gp,
> -
> -	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
> -};
> -
>   static unsigned int vmx_handle_intel_pt_intr(void)
>   {
>   	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
> @@ -7882,9 +7742,7 @@ static __init void vmx_setup_user_return_msrs(void)
>   		kvm_add_user_return_msr(vmx_uret_msrs_list[i]);
>   }
>   
> -static struct kvm_x86_init_ops vmx_init_ops __initdata;
> -
> -static __init int hardware_setup(void)
> +__init int vmx_hardware_setup(void)
>   {
>   	unsigned long host_bndcfgs;
>   	struct desc_ptr dt;
> @@ -7944,16 +7802,16 @@ static __init int hardware_setup(void)
>   	 * using the APIC_ACCESS_ADDR VMCS field.
>   	 */
>   	if (!flexpriority_enabled)
> -		vmx_x86_ops.set_apic_access_page_addr = NULL;
> +		vt_x86_ops.set_apic_access_page_addr = NULL;
>   
>   	if (!cpu_has_vmx_tpr_shadow())
> -		vmx_x86_ops.update_cr8_intercept = NULL;
> +		vt_x86_ops.update_cr8_intercept = NULL;
>   
>   #if IS_ENABLED(CONFIG_HYPERV)
>   	if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
>   	    && enable_ept) {
> -		vmx_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
> -		vmx_x86_ops.tlb_remote_flush_with_range =
> +		vt_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
> +		vt_x86_ops.tlb_remote_flush_with_range =
>   				hv_remote_flush_tlb_with_range;
>   	}
>   #endif
> @@ -7969,7 +7827,7 @@ static __init int hardware_setup(void)
>   	if (!cpu_has_vmx_apicv())
>   		enable_apicv = 0;
>   	if (!enable_apicv)
> -		vmx_x86_ops.sync_pir_to_irr = NULL;
> +		vt_x86_ops.sync_pir_to_irr = NULL;
>   
>   	if (cpu_has_vmx_tsc_scaling()) {
>   		kvm_has_tsc_control = true;
> @@ -7996,7 +7854,7 @@ static __init int hardware_setup(void)
>   		enable_pml = 0;
>   
>   	if (!enable_pml)
> -		vmx_x86_ops.cpu_dirty_log_size = 0;
> +		vt_x86_ops.cpu_dirty_log_size = 0;
>   
>   	if (!cpu_has_vmx_preemption_timer())
>   		enable_preemption_timer = false;
> @@ -8023,9 +7881,9 @@ static __init int hardware_setup(void)
>   	}
>   
>   	if (!enable_preemption_timer) {
> -		vmx_x86_ops.set_hv_timer = NULL;
> -		vmx_x86_ops.cancel_hv_timer = NULL;
> -		vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
> +		vt_x86_ops.set_hv_timer = NULL;
> +		vt_x86_ops.cancel_hv_timer = NULL;
> +		vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
>   	}
>   
>   	kvm_mce_cap_supported |= MCG_LMCE_P;
> @@ -8035,9 +7893,9 @@ static __init int hardware_setup(void)
>   	if (!enable_ept || !cpu_has_vmx_intel_pt())
>   		pt_mode = PT_MODE_SYSTEM;
>   	if (pt_mode == PT_MODE_HOST_GUEST)
> -		vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
> +		vt_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
>   	else
> -		vmx_init_ops.handle_intel_pt_intr = NULL;
> +		vt_init_ops.handle_intel_pt_intr = NULL;
>   
>   	setup_default_sgx_lepubkeyhash();
>   
> @@ -8061,16 +7919,6 @@ static __init int hardware_setup(void)
>   	return r;
>   }
>   
> -static struct kvm_x86_init_ops vmx_init_ops __initdata = {
> -	.cpu_has_kvm_support = cpu_has_kvm_support,
> -	.disabled_by_bios = vmx_disabled_by_bios,
> -	.check_processor_compatibility = vmx_check_processor_compat,
> -	.hardware_setup = hardware_setup,
> -	.handle_intel_pt_intr = NULL,
> -
> -	.runtime_ops = &vmx_x86_ops,
> -};
> -
>   static void vmx_cleanup_l1d_flush(void)
>   {
>   	if (vmx_l1d_flush_pages) {
> @@ -8149,7 +7997,7 @@ static int __init vmx_init(void)
>   		}
>   
>   		if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
> -			vmx_x86_ops.enable_direct_tlbflush
> +			vt_x86_ops.enable_direct_tlbflush
>   				= hv_enable_direct_tlbflush;
>   
>   	} else {
> @@ -8157,8 +8005,8 @@ static int __init vmx_init(void)
>   	}
>   #endif
>   
> -	r = kvm_init(&vmx_init_ops, sizeof(struct vcpu_vmx),
> -		     __alignof__(struct vcpu_vmx), THIS_MODULE);
> +	r = kvm_init(&vt_init_ops, sizeof(struct vcpu_vmx),
> +		__alignof__(struct vcpu_vmx), THIS_MODULE);
>   	if (r)
>   		return r;
>   
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> new file mode 100644
> index 000000000000..40c64fb1f505
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -0,0 +1,126 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVM_X86_VMX_X86_OPS_H
> +#define __KVM_X86_VMX_X86_OPS_H
> +
> +#include <linux/kvm_host.h>
> +
> +#include <asm/virtext.h>
> +
> +#include "x86.h"
> +
> +extern struct kvm_x86_init_ops vt_init_ops __initdata;
> +
> +__init int vmx_cpu_has_kvm_support(void);
> +__init int vmx_disabled_by_bios(void);
> +int __init vmx_check_processor_compat(void);
> +__init int vmx_hardware_setup(void);
> +
> +extern struct kvm_x86_ops vt_x86_ops __initdata;
> +extern struct kvm_x86_init_ops vt_init_ops __initdata;
> +
> +void vmx_hardware_unsetup(void);
> +int vmx_hardware_enable(void);
> +void vmx_hardware_disable(void);
> +bool report_flexpriority(void);
> +int vmx_vm_init(struct kvm *kvm);
> +int vmx_vcpu_create(struct kvm_vcpu *vcpu);
> +int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
> +fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu);
> +void vmx_vcpu_free(struct kvm_vcpu *vcpu);
> +void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> +void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
> +void vmx_vcpu_put(struct kvm_vcpu *vcpu);
> +int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath);
> +void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
> +int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu);
> +void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu);
> +int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
> +int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
> +int vmx_enter_smm(struct kvm_vcpu *vcpu, char *smstate);
> +int vmx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate);
> +void vmx_enable_smi_window(struct kvm_vcpu *vcpu);
> +bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
> +				void *insn, int insn_len);
> +int vmx_check_intercept(struct kvm_vcpu *vcpu,
> +			struct x86_instruction_info *info,
> +			enum x86_intercept_stage stage,
> +			struct x86_exception *exception);
> +bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
> +void vmx_migrate_timers(struct kvm_vcpu *vcpu);
> +void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
> +void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
> +bool vmx_check_apicv_inhibit_reasons(ulong bit);
> +void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
> +void vmx_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr);
> +bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu);
> +int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
> +void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
> +			   int trig_mode, int vector);
> +void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
> +bool vmx_has_emulated_msr(struct kvm *kvm, u32 index);
> +void vmx_msr_filter_changed(struct kvm_vcpu *vcpu);
> +void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
> +void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
> +int vmx_get_msr_feature(struct kvm_msr_entry *msr);
> +int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
> +u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
> +void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
> +void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
> +int vmx_get_cpl(struct kvm_vcpu *vcpu);
> +void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
> +void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
> +void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
> +void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
> +bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
> +int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
> +void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
> +void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
> +void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
> +void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
> +void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
> +void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
> +void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
> +unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
> +void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
> +bool vmx_get_if_flag(struct kvm_vcpu *vcpu);
> +void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
> +void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
> +void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
> +void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
> +void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
> +u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
> +void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
> +void vmx_inject_irq(struct kvm_vcpu *vcpu);
> +void vmx_inject_nmi(struct kvm_vcpu *vcpu);
> +void vmx_queue_exception(struct kvm_vcpu *vcpu);
> +void vmx_cancel_injection(struct kvm_vcpu *vcpu);
> +int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection);
> +int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
> +bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
> +void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
> +void vmx_enable_nmi_window(struct kvm_vcpu *vcpu);
> +void vmx_enable_irq_window(struct kvm_vcpu *vcpu);
> +void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr);
> +void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu);
> +void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
> +void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
> +int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
> +int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr);
> +u64 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
> +void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> +		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
> +u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
> +u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
> +void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset);
> +void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier);
> +void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
> +void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
> +void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
> +#ifdef CONFIG_X86_64
> +int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
> +		bool *expired);
> +void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
> +#endif
> +void vmx_setup_mce(struct kvm_vcpu *vcpu);
> +
> +#endif /* __KVM_X86_VMX_X86_OPS_H */


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 003/104] KVM: TDX: Detect CPU feature on kernel module initialization
  2022-03-04 19:48 ` [RFC PATCH v5 003/104] KVM: TDX: Detect CPU feature on kernel module initialization isaku.yamahata
@ 2022-03-13 13:49   ` Paolo Bonzini
  2022-03-14 18:34     ` Isaku Yamahata
  2022-04-08 16:46   ` Sean Christopherson
  1 sibling, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 13:49 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX requires several initialization steps for KVM to create guest TDs.
> Detect CPU feature, enable VMX (TDX is based on VMX), detect TDX module
> availability, and initialize TDX module.  This patch implements the first
> step to detect CPU feature.  Because VMX isn't enabled yet by VMXON
> instruction on KVM kernel module initialization, defer further
> initialization step until VMX is enabled by hardware_enable callback.
> 
> Introduce a module parameter, enable_tdx, to explicitly enable TDX KVM
> support.  It's off by default to keep same behavior for those who don't use
> TDX.  Implement CPU feature detection at KVM kernel module initialization
> as hardware_setup callback to check if CPU feature is available and get
> some CPU parameters.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/Makefile      |  1 +
>   arch/x86/kvm/vmx/main.c    | 15 ++++++++++-
>   arch/x86/kvm/vmx/tdx.c     | 53 ++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/x86_ops.h |  6 +++++
>   4 files changed, 74 insertions(+), 1 deletion(-)
>   create mode 100644 arch/x86/kvm/vmx/tdx.c
> 
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index ee4d0999f20f..e2c05195cb95 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -24,6 +24,7 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
>   kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
>   			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
>   kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
> +kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o
>   
>   kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
>   
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index b08ea9c42a11..b79fcc8d81dd 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -6,6 +6,19 @@
>   #include "nested.h"
>   #include "pmu.h"
>   
> +static __init int vt_hardware_setup(void)
> +{
> +	int ret;
> +
> +	ret = vmx_hardware_setup();
> +	if (ret)
> +		return ret;
> +
> +	tdx_hardware_setup(&vt_x86_ops);
> +
> +	return 0;
> +}
> +
>   struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.name = "kvm_intel",
>   
> @@ -147,7 +160,7 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
>   	.cpu_has_kvm_support = vmx_cpu_has_kvm_support,
>   	.disabled_by_bios = vmx_disabled_by_bios,
>   	.check_processor_compatibility = vmx_check_processor_compat,
> -	.hardware_setup = vmx_hardware_setup,
> +	.hardware_setup = vt_hardware_setup,
>   	.handle_intel_pt_intr = NULL,
>   
>   	.runtime_ops = &vt_x86_ops,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> new file mode 100644
> index 000000000000..1acf08c310c4
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -0,0 +1,53 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/cpu.h>
> +
> +#include <asm/tdx.h>
> +
> +#include "capabilities.h"
> +#include "x86_ops.h"
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) "tdx: " fmt
> +
> +static bool __read_mostly enable_tdx = true;
> +module_param_named(tdx, enable_tdx, bool, 0644);
> +
> +static u64 hkid_mask __ro_after_init;
> +static u8 hkid_start_pos __ro_after_init;
> +
> +static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> +{
> +	u32 max_pa;
> +
> +	if (!enable_ept) {
> +		pr_warn("Cannot enable TDX with EPT disabled\n");
> +		return -EINVAL;
> +	}
> +
> +	if (!platform_has_tdx()) {
> +		pr_warn("Cannot enable TDX with SEAMRR disabled\n");
> +		return -ENODEV;
> +	}

This will cause a pr_warn in the logs on all machines that don't have 
TDX.  Perhaps you can restrict the pr_warn() to machines that have 
__seamrr_enabled() == true?

Paolo

> +	if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
> +		return -EIO;
> +
> +	max_pa = cpuid_eax(0x80000008) & 0xff;
> +	hkid_start_pos = boot_cpu_data.x86_phys_bits;
> +	hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
> +
> +	return 0;
> +}
> +
> +void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> +{
> +	/*
> +	 * This function is called at the initialization.  No need to protect
> +	 * enable_tdx.
> +	 */
> +	if (!enable_tdx)
> +		return;
> +
> +	if (__tdx_hardware_setup(&vt_x86_ops))
> +		enable_tdx = false;
> +}
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 40c64fb1f505..ccf98e79d8c3 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -123,4 +123,10 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
>   #endif
>   void vmx_setup_mce(struct kvm_vcpu *vcpu);
>   
> +#ifdef CONFIG_INTEL_TDX_HOST
> +void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> +#else
> +static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
> +#endif
> +
>   #endif /* __KVM_X86_VMX_X86_OPS_H */


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 005/104] KVM: x86: Refactor KVM VMX module init/exit functions
  2022-03-04 19:48 ` [RFC PATCH v5 005/104] KVM: x86: Refactor KVM VMX module init/exit functions isaku.yamahata
@ 2022-03-13 13:54   ` Paolo Bonzini
  2022-03-14 19:22     ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 13:54 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Currently, KVM VMX module initialization/exit functions are a single
> function each.  Refactor KVM VMX module initialization functions into KVM
> common part and VMX part so that TDX specific part can be added cleanly.
> Opportunistically refactor module exit function as well.
> 
> The current module initialization flow is, 1.) calculate the sizes of VMX
> kvm structure and VMX vcpu structure, 2.) report those sizes to the KVM
> common layer and KVM common initialization, and 3.) VMX specific
> system-wide initialization.
> 
> Refactor the KVM VMX module initialization function into functions with a
> wrapper function to separate VMX logic in vmx.c from a file, main.c, common
> among VMX and TDX.  We have a wrapper function,
> "vt_init() {vmx_pre_kvm_init(); kvm_init(); vmx_init(); }" in main.c, and
> vmx_pre_kvm_init() and vmx_init() in vmx.c.  vmx_pre_kvm_init() calculates
> the sizes of VMX kvm structure and KVM vcpu structure, kvm_init() does
> system-wide initialization of the KVM common layer, and vmx_init() does
> system-wide VMX initialization.
> 
> The KVM architecture common layer allocates struct kvm with reported size
> for architecture-specific code.  The KVM VMX module defines its structure
> as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
> struct vmx kvm.  Similar for vcpu structure. TDX KVM patches will define
> TDX specific kvm and vcpu structures, add tdx_pre_kvm_init() to report the
> sizes of them to the KVM common layer.
> 
> The current module exit function is also a single function, a combination
> of VMX specific logic and common KVM logic.  Refactor it into VMX specific
> logic and KVM common logic.  This is just refactoring to keep the VMX
> specific logic in vmx.c from main.c.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/main.c    | 33 +++++++++++++
>   arch/x86/kvm/vmx/vmx.c     | 97 +++++++++++++++++++-------------------
>   arch/x86/kvm/vmx/x86_ops.h |  5 +-
>   3 files changed, 86 insertions(+), 49 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index b79fcc8d81dd..8ff13c7881f2 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -165,3 +165,36 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
>   
>   	.runtime_ops = &vt_x86_ops,
>   };
> +
> +static int __init vt_init(void)
> +{
> +	unsigned int vcpu_size = 0, vcpu_align = 0;
> +	int r;
> +
> +	vmx_pre_kvm_init(&vcpu_size, &vcpu_align);
> +
> +	r = kvm_init(&vt_init_ops, vcpu_size, vcpu_align, THIS_MODULE);
> +	if (r)
> +		goto err_vmx_post_exit;
> +
> +	r = vmx_init();
> +	if (r)
> +		goto err_kvm_exit;
> +
> +	return 0;
> +
> +err_kvm_exit:
> +	kvm_exit();
> +err_vmx_post_exit:
> +	vmx_post_kvm_exit();
> +	return r;
> +}
> +module_init(vt_init);
> +
> +static void vt_exit(void)
> +{
> +	vmx_exit();
> +	kvm_exit();
> +	vmx_post_kvm_exit();
> +}
> +module_exit(vt_exit);
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index f6f5d0dac579..7838cd177f0e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7929,47 +7929,12 @@ static void vmx_cleanup_l1d_flush(void)
>   	l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
>   }
>   
> -static void vmx_exit(void)
> +void __init vmx_pre_kvm_init(unsigned int *vcpu_size, unsigned int *vcpu_align)
>   {
> -#ifdef CONFIG_KEXEC_CORE
> -	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
> -	synchronize_rcu();
> -#endif
> -
> -	kvm_exit();
> -
> -#if IS_ENABLED(CONFIG_HYPERV)
> -	if (static_branch_unlikely(&enable_evmcs)) {
> -		int cpu;
> -		struct hv_vp_assist_page *vp_ap;
> -		/*
> -		 * Reset everything to support using non-enlightened VMCS
> -		 * access later (e.g. when we reload the module with
> -		 * enlightened_vmcs=0)
> -		 */
> -		for_each_online_cpu(cpu) {
> -			vp_ap =	hv_get_vp_assist_page(cpu);
> -
> -			if (!vp_ap)
> -				continue;
> -
> -			vp_ap->nested_control.features.directhypercall = 0;
> -			vp_ap->current_nested_vmcs = 0;
> -			vp_ap->enlighten_vmentry = 0;
> -		}
> -
> -		static_branch_disable(&enable_evmcs);
> -	}
> -#endif
> -	vmx_cleanup_l1d_flush();
> -
> -	allow_smaller_maxphyaddr = false;
> -}
> -module_exit(vmx_exit);
> -
> -static int __init vmx_init(void)
> -{
> -	int r, cpu;
> +	if (sizeof(struct vcpu_vmx) > *vcpu_size)
> +		*vcpu_size = sizeof(struct vcpu_vmx);
> +	if (__alignof__(struct vcpu_vmx) > *vcpu_align)
> +		*vcpu_align = __alignof__(struct vcpu_vmx);

Please keep these four lines in vt_init, and rename the rest of 
vmx_pre_kvm_init to hv_vp_assist_page_init.  Likewise, rename 
vmx_post_kvm_exit to hv_vp_assist_page_exit.

Adjusting the vcpu_size and vcpu_align for TDX (I guess) can be added 
later when TDX ops are introduced.

Paolo

>   
>   #if IS_ENABLED(CONFIG_HYPERV)
>   	/*
> @@ -8004,11 +7969,38 @@ static int __init vmx_init(void)
>   		enlightened_vmcs = false;
>   	}
>   #endif
> +}
>   
> -	r = kvm_init(&vt_init_ops, sizeof(struct vcpu_vmx),
> -		__alignof__(struct vcpu_vmx), THIS_MODULE);
> -	if (r)
> -		return r;
> +void vmx_post_kvm_exit(void)
> +{
> +#if IS_ENABLED(CONFIG_HYPERV)
> +	if (static_branch_unlikely(&enable_evmcs)) {
> +		int cpu;
> +		struct hv_vp_assist_page *vp_ap;
> +		/*
> +		 * Reset everything to support using non-enlightened VMCS
> +		 * access later (e.g. when we reload the module with
> +		 * enlightened_vmcs=0)
> +		 */
> +		for_each_online_cpu(cpu) {
> +			vp_ap =	hv_get_vp_assist_page(cpu);
> +
> +			if (!vp_ap)
> +				continue;
> +
> +			vp_ap->nested_control.features.directhypercall = 0;
> +			vp_ap->current_nested_vmcs = 0;
> +			vp_ap->enlighten_vmentry = 0;
> +		}
> +
> +		static_branch_disable(&enable_evmcs);
> +	}
> +#endif
> +}
> +
> +int __init vmx_init(void)
> +{
> +	int r, cpu;
>   
>   	/*
>   	 * Must be called after kvm_init() so enable_ept is properly set
> @@ -8018,10 +8010,8 @@ static int __init vmx_init(void)
>   	 * mitigation mode.
>   	 */
>   	r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
> -	if (r) {
> -		vmx_exit();
> +	if (r)
>   		return r;
> -	}
>   
>   	for_each_possible_cpu(cpu) {
>   		INIT_LIST_HEAD(&per_cpu(loaded_vmcss_on_cpu, cpu));
> @@ -8045,4 +8035,15 @@ static int __init vmx_init(void)
>   
>   	return 0;
>   }
> -module_init(vmx_init);
> +
> +void vmx_exit(void)
> +{
> +#ifdef CONFIG_KEXEC_CORE
> +	RCU_INIT_POINTER(crash_vmclear_loaded_vmcss, NULL);
> +	synchronize_rcu();
> +#endif
> +
> +	vmx_cleanup_l1d_flush();
> +
> +	allow_smaller_maxphyaddr = false;
> +}
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index ccf98e79d8c3..7da541e1c468 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -8,7 +8,10 @@
>   
>   #include "x86.h"
>   
> -extern struct kvm_x86_init_ops vt_init_ops __initdata;
> +void __init vmx_pre_kvm_init(unsigned int *vcpu_size, unsigned int *vcpu_align);
> +int __init vmx_init(void);
> +void vmx_exit(void);
> +void vmx_post_kvm_exit(void);
>   
>   __init int vmx_cpu_has_kvm_support(void);
>   __init int vmx_disabled_by_bios(void);


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 006/104] KVM: TDX: Add placeholders for TDX VM/vcpu structure
  2022-03-04 19:48 ` [RFC PATCH v5 006/104] KVM: TDX: Add placeholders for TDX VM/vcpu structure isaku.yamahata
@ 2022-03-13 13:55   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 13:55 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> +void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
> +			unsigned int *vcpu_align, unsigned int *vm_size)
> +{
> +	*vcpu_size = sizeof(struct vcpu_tdx);
> +	*vcpu_align = __alignof__(struct vcpu_tdx);
> +
> +	if (sizeof(struct kvm_tdx) > *vm_size)
> +		*vm_size = sizeof(struct kvm_tdx);
> +}

No need for this function, I would just do

	vcpu_size = max(sizeof vcpu_vmx, sizeof vcpu_tdx);
	vcpu_align = max(...);
	vt_x86_ops.vm_size = max(...);

in vt_init.

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 007/104] x86/virt/tdx: Add a helper function to return system wide info about TDX module
  2022-03-04 19:48 ` [RFC PATCH v5 007/104] x86/virt/tdx: Add a helper function to return system wide info about TDX module isaku.yamahata
@ 2022-03-13 13:59   ` Paolo Bonzini
  2022-03-13 23:02     ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 13:59 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> Signed-off-by: Isaku Yamahata<isaku.yamahata@intel.com>
> ---
>   arch/x86/include/asm/tdx.h | 55 ++++++++++++++++++++++++++++++++++++++
>   arch/x86/virt/vmx/tdx.c    | 16 +++++++++--
>   arch/x86/virt/vmx/tdx.h    | 52 -----------------------------------
>   3 files changed, 69 insertions(+), 54 deletions(-)

Patch looks good, but place these definitions in 
arch/x86/include/asm/tdx.h already in Kai's series if possible.

Apart from that,

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 004/104] KVM: Enable hardware before doing arch VM initialization
  2022-03-04 19:48 ` [RFC PATCH v5 004/104] KVM: Enable hardware before doing arch VM initialization isaku.yamahata
@ 2022-03-13 14:00   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 14:00 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Swap the order of hardware_enable_all() and kvm_arch_init_vm() to
> accommodate Intel's TDX, which needs VMX to be enabled during VM init in
> order to make SEAMCALLs.
> 
> This also provides consistent ordering between kvm_create_vm() and
> kvm_destroy_vm() with respect to calling kvm_arch_destroy_vm() and
> hardware_disable_all().
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

and please submit this as a preparation patch that can be committed 
separately.

Paolo

> ---
>   virt/kvm/kvm_main.c | 14 +++++++-------
>   1 file changed, 7 insertions(+), 7 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0afc016cc54d..52f72a366beb 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1105,19 +1105,19 @@ static struct kvm *kvm_create_vm(unsigned long type)
>   		rcu_assign_pointer(kvm->buses[i],
>   			kzalloc(sizeof(struct kvm_io_bus), GFP_KERNEL_ACCOUNT));
>   		if (!kvm->buses[i])
> -			goto out_err_no_arch_destroy_vm;
> +			goto out_err_no_disable;
>   	}
>   
>   	kvm->max_halt_poll_ns = halt_poll_ns;
>   
> -	r = kvm_arch_init_vm(kvm, type);
> -	if (r)
> -		goto out_err_no_arch_destroy_vm;
> -
>   	r = hardware_enable_all();
>   	if (r)
>   		goto out_err_no_disable;
>   
> +	r = kvm_arch_init_vm(kvm, type);
> +	if (r)
> +		goto out_err_no_arch_destroy_vm;
> +
>   #ifdef CONFIG_HAVE_KVM_IRQFD
>   	INIT_HLIST_HEAD(&kvm->irq_ack_notifier_list);
>   #endif
> @@ -1145,10 +1145,10 @@ static struct kvm *kvm_create_vm(unsigned long type)
>   		mmu_notifier_unregister(&kvm->mmu_notifier, current->mm);
>   #endif
>   out_err_no_mmu_notifier:
> -	hardware_disable_all();
> -out_err_no_disable:
>   	kvm_arch_destroy_vm(kvm);
>   out_err_no_arch_destroy_vm:
> +	hardware_disable_all();
> +out_err_no_disable:
>   	WARN_ON_ONCE(!refcount_dec_and_test(&kvm->users_count));
>   	for (i = 0; i < KVM_NR_BUSES; i++)
>   		kfree(kvm_get_bus(kvm, i));


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 000/104] KVM TDX basic feature support
  2022-03-07  7:44 ` [RFC PATCH v5 000/104] KVM TDX basic feature support Christoph Hellwig
@ 2022-03-13 14:00   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 14:00 UTC (permalink / raw)
  To: Christoph Hellwig, isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On 3/7/22 08:44, Christoph Hellwig wrote:
> A series of 104 patches is completely unreviewably, please split it into
> reasonable chunks.

It is split into 5-15 patch chunks, and I'm going to review it mostly 
according to the separation.  It's just posted together because it 
doesn't really accomplish anything until all the chunks are merged together.

 From the cover letter:

>> TDX, VMX coexistence:
>>         Infrastructure to allow TDX to coexist with VMX and trigger the
>>         initialization of the TDX module.
>>         This layer starts with
>>         "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
>> TDX architectural definitions:
>>         Add TDX architectural definitions and helper functions
>>         This layer starts with
>>         "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
>> TD VM creation/destruction:
>>         Guest TD creation/destroy allocation and releasing of TDX specific vm
>>         and vcpu structure.  Create an initial guest memory image with TDX
>>         measurement.
>>         This layer starts with
>>         "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
>> TD vcpu creation/destruction:
>>         guest TD creation/destroy Allocation and releasing of TDX specific vm
>>         and vcpu structure.  Create an initial guest memory image with TDX
>>         measurement.
>>         This layer starts with
>>         "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
>> TDX EPT violation:
>>         Create an initial guest memory image with TDX measurement.  Handle
>>         secure EPT violations to populate guest pages with TDX SEAMCALLs.
>>         This layer starts with
>>         "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
>> TD vcpu enter/exit:
>>         Allow TDX vcpu to enter into TD and exit from TD.  Save CPU state before
>>         entering into TD.  Restore CPU state after exiting from TD.
>>         This layer starts with
>>         "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
>> TD vcpu interrupts/exit/hypercall:
>>         Handle various exits/hypercalls and allow interrupts to be injected so
>>         that TD vcpu can continue running.
>>         This layer starts with
>>         "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
>> 
>> KVM MMU GPA stolen bits:
>>         Introduce framework to handle stolen repurposed bit of GPA TDX
>>         repurposed a bit of GPA to indicate shared or private. If it's shared,
>>         it's the same as the conventional VMX EPT case.  VMM can access shared
>>         guest pages.  If it's private, it's handled by Secure-EPT and the guest
>>         page is encrypted.
>>         This layer starts with
>>         "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
>> KVM TDP refactoring for TDX:
>>         TDX Secure EPT requires different constants. e.g. initial value EPT
>>         entry value etc. Various refactoring for those differences.
>>         This layer starts with
>>         "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
>> KVM TDP MMU hooks:
>>         Introduce framework to TDP MMU to add hooks in addition to direct EPT
>>         access TDX added Secure EPT which is an enhancement to VMX EPT.  Unlike
>>         conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
>>         use TDX SEAMCALLs to operate on Secure EPT.
>>         This layer starts with
>>         "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
>> KVM TDP MMU MapGPA:
>>         Introduce framework to handle switching guest pages from private/shared
>>         to shared/private.  For a given GPA, a guest page can be assigned to a
>>         private GPA or a shared GPA exclusively.  With TDX MapGPA hypercall,
>>         guest TD converts GPA assignments from private (or shared) to shared (or
>>         private).
>>         This layer starts with
>>         "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-03-04 19:48 ` [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize " isaku.yamahata
@ 2022-03-13 14:03   ` Paolo Bonzini
  2022-03-14 19:45     ` Isaku Yamahata
  2022-03-31  3:31   ` Kai Huang
  1 sibling, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 14:03 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> +
> +	if (!tdx_module_initialized) {
> +		if (enable_tdx) {
> +			ret = __tdx_module_setup();
> +			if (ret)
> +				enable_tdx = false;

"enable_tdx = false" isn't great to do only when a VM is created.  Does 
it make sense to anticipate this to the point when the kvm_intel.ko 
module is loaded?

Paolo

> +			else
> +				tdx_module_initialized = true;
> +		} else
> +			ret = -EOPNOTSUPP;
> +	}
> +


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 009/104] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs
  2022-03-04 19:48 ` [RFC PATCH v5 009/104] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs isaku.yamahata
@ 2022-03-13 14:07   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 14:07 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Unlike default VMs, confidential VMs (Intel TDX and AMD SEV-ES) don't allow
> some operations (e.g., memory read/write, register state access, etc).
> 
> Introduce vm_type to track the type of the VM to x86 KVM.  Other arch KVMs
> already use vm_type, KVM_INIT_VM accepts vm_type, and x86 KVM callback
> vm_init accepts vm_type.  So follow them.  Further, a different policy can
> be made based on vm_type.  Define KVM_X86_DEFAULT_VM for default VM as
> default and define KVM_X86_TDX_VM for Intel TDX VM.  The wrapper function
> will be defined as "bool is_td(kvm) { return vm_type == VM_TYPE_TDX; }"
> 
> Add a capability KVM_CAP_VM_TYPES to effectively allow device model,
> e.g. qemu, to query what VM types are supported by KVM.  This (introduce a
> new capability and add vm_type) is chosen to align with other arch KVMs
> that have VM types already.  Other arch KVMs uses different name to query
> supported vm types and there is no common name for it, so new name was
> chosen.
> 
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

> ---
>   Documentation/virt/kvm/api.rst        | 15 +++++++++++++++
>   arch/x86/include/asm/kvm-x86-ops.h    |  1 +
>   arch/x86/include/asm/kvm_host.h       |  2 ++
>   arch/x86/include/uapi/asm/kvm.h       |  3 +++
>   arch/x86/kvm/svm/svm.c                |  6 ++++++
>   arch/x86/kvm/vmx/main.c               |  1 +
>   arch/x86/kvm/vmx/tdx.h                |  6 +-----
>   arch/x86/kvm/vmx/vmx.c                |  5 +++++
>   arch/x86/kvm/vmx/x86_ops.h            |  1 +
>   arch/x86/kvm/x86.c                    |  9 ++++++++-
>   include/uapi/linux/kvm.h              |  1 +
>   tools/arch/x86/include/uapi/asm/kvm.h |  3 +++
>   tools/include/uapi/linux/kvm.h        |  1 +
>   13 files changed, 48 insertions(+), 6 deletions(-)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9f3172376ec3..b1e142719ec0 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -147,15 +147,30 @@ described as 'basic' will be available.
>   The new VM has no virtual cpus and no memory.
>   You probably want to use 0 as machine type.
>   
> +X86:
> +^^^^
> +
> +Supported vm type can be queried from KVM_CAP_VM_TYPES, which returns the
> +bitmap of supported vm types. The 1-setting of bit @n means vm type with
> +value @n is supported.
> +
> +S390:
> +^^^^^
> +
>   In order to create user controlled virtual machines on S390, check
>   KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
>   privileged user (CAP_SYS_ADMIN).
>   
> +MIPS:
> +^^^^^
> +
>   To use hardware assisted virtualization on MIPS (VZ ASE) rather than
>   the default trap & emulate implementation (which changes the virtual
>   memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
>   flag KVM_VM_MIPS_VZ.
>   
> +ARM64:
> +^^^^^^
>   
>   On arm64, the physical address size for a VM (IPA Size limit) is limited
>   to 40bits by default. The limit can be configured if the host supports the
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index d39e0de06be2..8125d43d3566 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -18,6 +18,7 @@ KVM_X86_OP_NULL(hardware_unsetup)
>   KVM_X86_OP_NULL(cpu_has_accelerated_tpr)
>   KVM_X86_OP(has_emulated_msr)
>   KVM_X86_OP(vcpu_after_set_cpuid)
> +KVM_X86_OP(is_vm_type_supported)
>   KVM_X86_OP(vm_init)
>   KVM_X86_OP_NULL(vm_destroy)
>   KVM_X86_OP(vcpu_create)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index b61e46743184..8de357a9ad30 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1048,6 +1048,7 @@ struct kvm_x86_msr_filter {
>   #define APICV_INHIBIT_REASON_ABSENT	7
>   
>   struct kvm_arch {
> +	unsigned long vm_type;
>   	unsigned long n_used_mmu_pages;
>   	unsigned long n_requested_mmu_pages;
>   	unsigned long n_max_mmu_pages;
> @@ -1322,6 +1323,7 @@ struct kvm_x86_ops {
>   	bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
>   	void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
>   
> +	bool (*is_vm_type_supported)(unsigned long vm_type);
>   	unsigned int vm_size;
>   	int (*vm_init)(struct kvm *kvm);
>   	void (*vm_destroy)(struct kvm *kvm);
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index bf6e96011dfe..71a5851475e7 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -525,4 +525,7 @@ struct kvm_pmu_event_filter {
>   #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
>   #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
>   
> +#define KVM_X86_DEFAULT_VM	0
> +#define KVM_X86_TDX_VM		1
> +
>   #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index fd3a00c892c7..778075b71dc3 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4512,6 +4512,11 @@ static void svm_vm_destroy(struct kvm *kvm)
>   	sev_vm_destroy(kvm);
>   }
>   
> +static bool svm_is_vm_type_supported(unsigned long type)
> +{
> +	return type == KVM_X86_DEFAULT_VM;
> +}
> +
>   static int svm_vm_init(struct kvm *kvm)
>   {
>   	if (!pause_filter_count || !pause_filter_thresh)
> @@ -4539,6 +4544,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>   	.vcpu_free = svm_free_vcpu,
>   	.vcpu_reset = svm_vcpu_reset,
>   
> +	.is_vm_type_supported = svm_is_vm_type_supported,
>   	.vm_size = sizeof(struct kvm_svm),
>   	.vm_init = svm_vm_init,
>   	.vm_destroy = svm_vm_destroy,
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 28a7597d0782..77da926ee505 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -29,6 +29,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.cpu_has_accelerated_tpr = report_flexpriority,
>   	.has_emulated_msr = vmx_has_emulated_msr,
>   
> +	.is_vm_type_supported = vmx_is_vm_type_supported,
>   	.vm_size = sizeof(struct kvm_vmx),
>   	.vm_init = vmx_vm_init,
>   
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index d448e019602c..616fbf79b129 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -15,11 +15,7 @@ struct vcpu_tdx {
>   
>   static inline bool is_td(struct kvm *kvm)
>   {
> -	/*
> -	 * TDX VM type isn't defined yet.
> -	 * return kvm->arch.vm_type == KVM_X86_TDX_VM;
> -	 */
> -	return false;
> +	return kvm->arch.vm_type == KVM_X86_TDX_VM;
>   }
>   
>   static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7838cd177f0e..3c7b3f245fee 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7079,6 +7079,11 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
>   	return err;
>   }
>   
> +bool vmx_is_vm_type_supported(unsigned long type)
> +{
> +	return type == KVM_X86_DEFAULT_VM;
> +}
> +
>   #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>   #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>   
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 1bad27e592b5..f7327bc73be0 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -25,6 +25,7 @@ void vmx_hardware_unsetup(void);
>   int vmx_hardware_enable(void);
>   void vmx_hardware_disable(void);
>   bool report_flexpriority(void);
> +bool vmx_is_vm_type_supported(unsigned long type);
>   int vmx_vm_init(struct kvm *kvm);
>   int vmx_vcpu_create(struct kvm_vcpu *vcpu);
>   int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index aa2942060154..f6438750d190 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4344,6 +4344,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   			r = sizeof(struct kvm_xsave);
>   		break;
>   	}
> +	case KVM_CAP_VM_TYPES:
> +		r = BIT(KVM_X86_DEFAULT_VM);
> +		if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
> +			r |= BIT(KVM_X86_TDX_VM);
> +		break;
>   	default:
>   		break;
>   	}
> @@ -11583,9 +11588,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   	int ret;
>   	unsigned long flags;
>   
> -	if (type)
> +	if (!static_call(kvm_x86_is_vm_type_supported)(type))
>   		return -EINVAL;
>   
> +	kvm->arch.vm_type = type;
> +
>   	ret = kvm_page_track_init(kvm);
>   	if (ret)
>   		return ret;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 507ee1f2aa96..bb3e29770f76 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1135,6 +1135,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_XSAVE2 208
>   #define KVM_CAP_SYS_ATTRIBUTES 209
>   #define KVM_CAP_PPC_AIL_MODE_3 210
> +#define KVM_CAP_VM_TYPES 21
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
> index bf6e96011dfe..71a5851475e7 100644
> --- a/tools/arch/x86/include/uapi/asm/kvm.h
> +++ b/tools/arch/x86/include/uapi/asm/kvm.h
> @@ -525,4 +525,7 @@ struct kvm_pmu_event_filter {
>   #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
>   #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
>   
> +#define KVM_X86_DEFAULT_VM	0
> +#define KVM_X86_TDX_VM		1
> +
>   #endif /* _ASM_X86_KVM_H */
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 507ee1f2aa96..1a99d0aae852 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -1135,6 +1135,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_XSAVE2 208
>   #define KVM_CAP_SYS_ATTRIBUTES 209
>   #define KVM_CAP_PPC_AIL_MODE_3 210
> +#define KVM_CAP_VM_TYPES 211
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 013/104] KVM: TDX: Add TDX "architectural" error codes
  2022-03-04 19:48 ` [RFC PATCH v5 013/104] KVM: TDX: Add TDX "architectural" error codes isaku.yamahata
@ 2022-03-13 14:08   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 14:08 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH
> SEAMCALL and TDX guest side for TDG.VP.VMCALL.  KVM issues the TDX
> SEAMCALLs and checks its error code.  KVM handles hypercall from the TDX
> guest and may return an error.  So error code for the TDX guest is also
> needed.
> 
> TDX SEAMCALL uses bits 31:0 to return more information, so these error
> codes will only exactly match RAX[63:32].  Error codes for TDG.VP.VMCALL is
> defined by TDX Guest-Host-Communication interface spec.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


> ---
>   arch/x86/kvm/vmx/tdx_errno.h | 29 +++++++++++++++++++++++++++++
>   1 file changed, 29 insertions(+)
>   create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
> 
> diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
> new file mode 100644
> index 000000000000..5c878488795d
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx_errno.h
> @@ -0,0 +1,29 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* architectural status code for SEAMCALL */
> +
> +#ifndef __KVM_X86_TDX_ERRNO_H
> +#define __KVM_X86_TDX_ERRNO_H
> +
> +#define TDX_SEAMCALL_STATUS_MASK		0xFFFFFFFF00000000ULL
> +
> +/*
> + * TDX SEAMCALL Status Codes (returned in RAX)
> + */
> +#define TDX_SUCCESS				0x0000000000000000ULL
> +#define TDX_NON_RECOVERABLE_VCPU		0x4000000100000000ULL
> +#define TDX_INTERRUPTED_RESUMABLE		0x8000000300000000ULL
> +#define TDX_LIFECYCLE_STATE_INCORRECT		0xC000060700000000ULL
> +#define TDX_VCPU_NOT_ASSOCIATED			0x8000070200000000ULL
> +#define TDX_KEY_GENERATION_FAILED		0x8000080000000000ULL
> +#define TDX_KEY_STATE_INCORRECT			0xC000081100000000ULL
> +#define TDX_KEY_CONFIGURED			0x0000081500000000ULL
> +#define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
> +
> +/*
> + * TDG.VP.VMCALL Status Codes (returned in R10)
> + */
> +#define TDG_VP_VMCALL_SUCCESS			0x0000000000000000ULL
> +#define TDG_VP_VMCALL_INVALID_OPERAND		0x8000000000000000ULL
> +#define TDG_VP_VMCALL_TDREPORT_FAILED		0x8000000000000001ULL
> +
> +#endif /* __KVM_X86_TDX_ERRNO_H */


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 014/104] KVM: TDX: Add a function for KVM to invoke SEAMCALL
  2022-03-04 19:48 ` [RFC PATCH v5 014/104] KVM: TDX: Add a function for KVM to invoke SEAMCALL isaku.yamahata
@ 2022-03-13 14:10   ` Paolo Bonzini
  2022-03-13 22:42   ` Kai Huang
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 14:10 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add an assembly function for KVM to call the TDX module because __seamcall
> defined in arch/x86/virt/vmx/seamcall.S doesn't fit for the KVM use case.
> 
> TDX module API returns extended error information in registers, rcx, rdx,
> r8, r9, r10, and r11 in addition to success case.  KVM uses those extended
> error information in addition to the status code returned in RAX.  Update
> the assembly code to optionally return those outputs even in the error case
> and define the specific version for KVM to call the TDX module.
> 
> SEAMCALL to the SEAM module (P-SEAMLDR or TDX module) can result in the
> error of VmFailInvalid indicated by CF=1 when VMX isn't enabled by VMXON
> instruction.  Because KVM guarantees that VMX is enabled, VmFailInvalid
> error won't happen.  Don't check the error for KVM.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/Makefile       |  2 +-
>   arch/x86/kvm/vmx/seamcall.S | 55 +++++++++++++++++++++++++++++++++++++
>   arch/x86/virt/tdxcall.S     |  8 ++++--
>   3 files changed, 62 insertions(+), 3 deletions(-)
>   create mode 100644 arch/x86/kvm/vmx/seamcall.S
> 
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index e2c05195cb95..e8f83a7d0dc3 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -24,7 +24,7 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
>   kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
>   			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
>   kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
> -kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o
> +kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o vmx/seamcall.o
>   
>   kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
>   
> diff --git a/arch/x86/kvm/vmx/seamcall.S b/arch/x86/kvm/vmx/seamcall.S
> new file mode 100644
> index 000000000000..4a15017fc7dd
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/seamcall.S
> @@ -0,0 +1,55 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/linkage.h>
> +#include <asm/export.h>
> +#include <asm/frame.h>
> +
> +#include "../../virt/tdxcall.S"
> +
> +/*
> + * kvm_seamcall()  - Host-side interface functions to SEAM software (TDX module)
> + *
> + * Transform function call register arguments into the SEAMCALL register
> + * ABI.  Return the completion status of the SEAMCALL.  Additional output
> + * operands are saved in @out (if it is provided by the user).
> + * It doesn't check TDX_SEAMCALL_VMFAILINVALID unlike __semcall() because KVM
> + * guarantees that VMX is enabled so that TDX_SEAMCALL_VMFAILINVALID doesn't
> + * happen.  In the case of error completion status code, extended error code may
> + * be stored in leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + * SEAMCALL ABI:
> + *-------------------------------------------------------------------------
> + * Input Registers:
> + *
> + * RAX                 - SEAMCALL Leaf number.
> + * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
> + *
> + * Output Registers:
> + *
> + * RAX                 - SEAMCALL completion status code.
> + * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + *
> + * kvm_seamcall() function ABI:
> + *
> + * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
> + * @rcx (RSI)          - Input parameter 1, moved to RCX
> + * @rdx (RDX)          - Input parameter 2, moved to RDX
> + * @r8  (RCX)          - Input parameter 3, moved to R8
> + * @r9  (R8)           - Input parameter 4, moved to R9
> + *
> + * @out (R9)           - struct tdx_module_output pointer
> + *                       stored temporarily in R12 (not
> + *                       shared with the TDX module). It
> + *                       can be NULL.
> + *
> + * Return (via RAX) the completion status of the SEAMCALL
> + */
> +SYM_FUNC_START(kvm_seamcall)
> +        FRAME_BEGIN
> +        TDX_MODULE_CALL host=1 error_check=0
> +        FRAME_END
> +        ret
> +SYM_FUNC_END(kvm_seamcall)
> +EXPORT_SYMBOL_GPL(kvm_seamcall)
> diff --git a/arch/x86/virt/tdxcall.S b/arch/x86/virt/tdxcall.S
> index 90569faedacc..2e614b6b5f1e 100644
> --- a/arch/x86/virt/tdxcall.S
> +++ b/arch/x86/virt/tdxcall.S
> @@ -13,7 +13,7 @@
>   #define tdcall		.byte 0x66,0x0f,0x01,0xcc
>   #define seamcall	.byte 0x66,0x0f,0x01,0xcf
>   
> -.macro TDX_MODULE_CALL host:req
> +.macro TDX_MODULE_CALL host:req error_check=1

Perhaps name this argument ext_error_out and reverse it (that is, 0 is 
the current behavior while 1 is what KVM needs).

Paolo

>   	/*
>   	 * R12 will be used as temporary storage for struct tdx_module_output
>   	 * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
> @@ -51,9 +51,11 @@
>   	 *
>   	 * Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
>   	 * This value will never be used as actual SEAMCALL error code.
> -	 */
> +	*/
> +	.if \error_check
>   	jnc .Lno_vmfailinvalid
>   	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
> +	.endif
>   .Lno_vmfailinvalid:
>   	.else
>   	tdcall
> @@ -66,8 +68,10 @@
>   	pop %r12
>   
>   	/* Check for success: 0 - Successful, otherwise failed */
> +	.if \error_check
>   	test %rax, %rax
>   	jnz .Lno_output_struct
> +	.endif
>   
>   	/*
>   	 * Since this function can be initiated without an output pointer,


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 015/104] KVM: TDX: add a helper function for KVM to issue SEAMCALL
  2022-03-04 19:48 ` [RFC PATCH v5 015/104] KVM: TDX: add a helper function for KVM to issue SEAMCALL isaku.yamahata
@ 2022-03-13 14:11   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 14:11 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TODO: Consolidate seamcall helper function with TDX host/guest patch series.
> For now, this is kept to make this patch series compile/work.
> 
> A VMM interacts with the TDX module using a new instruction (SEAMCALL).  A
> TDX VMM uses SEAMCALLs where a VMX VMM would have directly interacted with
> VMX instructions.  For instance, a TDX VMM does not have full access to the
> VM control structure corresponding to VMX VMCS.  Instead, a VMM induces the
> TDX module to act on behalf via SEAMCALLs.
> 
> Add a helper function for KVM C code to execute SEAMCALL instruction to
> hide its SEAMCALL ABI details.  Although the x86 TDX host patch series
> defines a similar wrapper, the KVM TDX patch series defines its own because
> KVM TDX case is performance-critical, unlike the x86 TDX one that does
> one-time initialization.  The difference is that the KVM TDX one is defined
> as a static inline function without an error check that is known to not
> happen so that compiler can optimize it better.  The wrapper fiction in the
> x86 TDX host patch is defined as a function written in assembly code with
> error check so that it can detect errors that can occur only during the
> initialization.

I assume whatever survives of this patch will be merged in the previous one.

Paolo

> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/seamcall.h | 23 +++++++++++++++++++++++
>   1 file changed, 23 insertions(+)
>   create mode 100644 arch/x86/kvm/vmx/seamcall.h
> 
> diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
> new file mode 100644
> index 000000000000..604792e9a59f
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/seamcall.h
> @@ -0,0 +1,23 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef __KVM_VMX_SEAMCALL_H
> +#define __KVM_VMX_SEAMCALL_H
> +
> +#ifdef CONFIG_INTEL_TDX_HOST
> +
> +#ifdef __ASSEMBLY__
> +
> +.macro seamcall
> +	.byte 0x66, 0x0f, 0x01, 0xcf
> +.endm
> +
> +#else
> +
> +struct tdx_module_output;
> +u64 kvm_seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
> +		struct tdx_module_output *out);
> +
> +#endif /* !__ASSEMBLY__ */
> +
> +#endif	/* CONFIG_INTEL_TDX_HOST */
> +
> +#endif /* __KVM_VMX_SEAMCALL_H */


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 017/104] KVM: TDX: Add helper functions to print TDX SEAMCALL error
  2022-03-04 19:48 ` [RFC PATCH v5 017/104] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
@ 2022-03-13 14:12   ` Paolo Bonzini
  2022-04-15 16:54   ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 14:12 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add helper functions to print out errors from the TDX module in a uniform
> manner.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/Makefile        |  2 +-
>   arch/x86/kvm/vmx/seamcall.h  |  2 ++
>   arch/x86/kvm/vmx/tdx_error.c | 22 ++++++++++++++++++++++
>   3 files changed, 25 insertions(+), 1 deletion(-)
>   create mode 100644 arch/x86/kvm/vmx/tdx_error.c
> 
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index e8f83a7d0dc3..3d6550c73fb5 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -24,7 +24,7 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
>   kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
>   			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
>   kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
> -kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o vmx/seamcall.o
> +kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o vmx/seamcall.o vmx/tdx_error.o
>   
>   kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
>   
> diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
> index 604792e9a59f..5ac419cd8e27 100644
> --- a/arch/x86/kvm/vmx/seamcall.h
> +++ b/arch/x86/kvm/vmx/seamcall.h
> @@ -16,6 +16,8 @@ struct tdx_module_output;
>   u64 kvm_seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
>   		struct tdx_module_output *out);
>   
> +void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out);
> +
>   #endif /* !__ASSEMBLY__ */
>   
>   #endif	/* CONFIG_INTEL_TDX_HOST */
> diff --git a/arch/x86/kvm/vmx/tdx_error.c b/arch/x86/kvm/vmx/tdx_error.c
> new file mode 100644
> index 000000000000..61ed855d1188
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx_error.c
> @@ -0,0 +1,22 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* functions to record TDX SEAMCALL error */
> +
> +#include <linux/kernel.h>
> +#include <linux/bug.h>
> +
> +#include "tdx_ops.h"
> +
> +void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out)
> +{
> +	if (!out) {
> +		pr_err_ratelimited("SEAMCALL[%lld] failed: 0x%llx\n",
> +				op, error_code);
> +		return;
> +	}
> +
> +	pr_err_ratelimited(
> +		"SEAMCALL[%lld] failed: 0x%llx "
> +		"RCX 0x%llx, RDX 0x%llx, R8 0x%llx, R9 0x%llx, R10 0x%llx, R11 0x%llx\n",
> +		op, error_code,
> +		out->rcx, out->rdx, out->r8, out->r9, out->r10, out->r11);
> +}

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 012/104] KVM: TDX: Define TDX architectural definitions
  2022-03-04 19:48 ` [RFC PATCH v5 012/104] KVM: TDX: Define " isaku.yamahata
@ 2022-03-13 14:30   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-13 14:30 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Define architectural definitions for KVM to issue the TDX SEAMCALLs.
> 
> Structures and values that are architecturally defined in the TDX module
> specifications the chapter of ABI Reference.
> 
> Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx_arch.h | 158 ++++++++++++++++++++++++++++++++++++
>   1 file changed, 158 insertions(+)
>   create mode 100644 arch/x86/kvm/vmx/tdx_arch.h

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

> diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
> new file mode 100644
> index 000000000000..3824491d22dc
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx_arch.h
> @@ -0,0 +1,158 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +/* architectural constants/data definitions for TDX SEAMCALLs */
> +
> +#ifndef __KVM_X86_TDX_ARCH_H
> +#define __KVM_X86_TDX_ARCH_H
> +
> +#include <linux/types.h>
> +
> +/*
> + * TDX SEAMCALL API function leaves
> + */
> +#define TDH_VP_ENTER			0
> +#define TDH_MNG_ADDCX			1
> +#define TDH_MEM_PAGE_ADD		2
> +#define TDH_MEM_SEPT_ADD		3
> +#define TDH_VP_ADDCX			4
> +#define TDH_MEM_PAGE_AUG		6
> +#define TDH_MEM_RANGE_BLOCK		7
> +#define TDH_MNG_KEY_CONFIG		8
> +#define TDH_MNG_CREATE			9
> +#define TDH_VP_CREATE			10
> +#define TDH_MNG_RD			11
> +#define TDH_MR_EXTEND			16
> +#define TDH_MR_FINALIZE			17
> +#define TDH_VP_FLUSH			18
> +#define TDH_MNG_VPFLUSHDONE		19
> +#define TDH_MNG_KEY_FREEID		20
> +#define TDH_MNG_INIT			21
> +#define TDH_VP_INIT			22
> +#define TDH_VP_RD			26
> +#define TDH_MNG_KEY_RECLAIMID		27
> +#define TDH_PHYMEM_PAGE_RECLAIM		28
> +#define TDH_MEM_PAGE_REMOVE		29
> +#define TDH_MEM_TRACK			38
> +#define TDH_MEM_RANGE_UNBLOCK		39
> +#define TDH_PHYMEM_CACHE_WB		40
> +#define TDH_PHYMEM_PAGE_WBINVD		41
> +#define TDH_VP_WR			43
> +#define TDH_SYS_LP_SHUTDOWN		44
> +
> +#define TDG_VP_VMCALL_GET_TD_VM_CALL_INFO		0x10000
> +#define TDG_VP_VMCALL_MAP_GPA				0x10001
> +#define TDG_VP_VMCALL_GET_QUOTE				0x10002
> +#define TDG_VP_VMCALL_REPORT_FATAL_ERROR		0x10003
> +#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT	0x10004
> +
> +/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
> +#define TDX_NON_ARCH			BIT_ULL(63)
> +#define TDX_CLASS_SHIFT			56
> +#define TDX_FIELD_MASK			GENMASK_ULL(31, 0)
> +
> +#define __BUILD_TDX_FIELD(non_arch, class, field)	\
> +	(((non_arch) ? TDX_NON_ARCH : 0) |		\
> +	 ((u64)(class) << TDX_CLASS_SHIFT) |		\
> +	 ((u64)(field) & TDX_FIELD_MASK))
> +
> +#define BUILD_TDX_FIELD(class, field)			\
> +	__BUILD_TDX_FIELD(false, (class), (field))
> +
> +#define BUILD_TDX_FIELD_NON_ARCH(class, field)		\
> +	__BUILD_TDX_FIELD(true, (class), (field))
> +
> +
> +/* @field is the VMCS field encoding */
> +#define TDVPS_VMCS(field)		BUILD_TDX_FIELD(0, (field))
> +
> +enum tdx_guest_other_state {
> +	TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
> +};
> +
> +union tdx_vcpu_state_details {
> +	struct {
> +		u64 vmxip	: 1;
> +		u64 reserved	: 63;
> +	};
> +	u64 full;
> +};
> +
> +/* @field is any of enum tdx_guest_other_state */
> +#define TDVPS_STATE(field)		BUILD_TDX_FIELD(17, (field))
> +#define TDVPS_STATE_NON_ARCH(field)	BUILD_TDX_FIELD_NON_ARCH(17, (field))
> +
> +/* Management class fields */
> +enum tdx_guest_management {
> +	TD_VCPU_PEND_NMI = 11,
> +};
> +
> +/* @field is any of enum tdx_guest_management */
> +#define TDVPS_MANAGEMENT(field)		BUILD_TDX_FIELD(32, (field))
> +
> +enum tdx_tdcs_execution_control {
> +	TD_TDCS_EXEC_TSC_OFFSET = 10,
> +};
> +
> +/* @field is any of enum tdx_tdcs_execution_control */
> +#define TDCS_EXEC(field)		BUILD_TDX_FIELD(17, (field))
> +
> +#define TDX_EXTENDMR_CHUNKSIZE		256
> +
> +struct tdx_cpuid_value {
> +	u32 eax;
> +	u32 ebx;
> +	u32 ecx;
> +	u32 edx;
> +} __packed;
> +
> +#define TDX_TD_ATTRIBUTE_DEBUG		BIT_ULL(0)
> +#define TDX_TD_ATTRIBUTE_PKS		BIT_ULL(30)
> +#define TDX_TD_ATTRIBUTE_KL		BIT_ULL(31)
> +#define TDX_TD_ATTRIBUTE_PERFMON	BIT_ULL(63)
> +
> +#define TDX_TD_XFAM_LBR			BIT_ULL(15)
> +#define TDX_TD_XFAM_AMX			(BIT_ULL(17) | BIT_ULL(18))
> +
> +/*
> + * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
> + */
> +struct td_params {
> +	u64 attributes;
> +	u64 xfam;
> +	u32 max_vcpus;
> +	u32 reserved0;
> +
> +	u64 eptp_controls;
> +	u64 exec_controls;
> +	u16 tsc_frequency;
> +	u8  reserved1[38];
> +
> +	u64 mrconfigid[6];
> +	u64 mrowner[6];
> +	u64 mrownerconfig[6];
> +	u64 reserved2[4];
> +
> +	union {
> +		struct tdx_cpuid_value cpuid_values[0];
> +		u8 reserved3[768];
> +	};
> +} __packed __aligned(1024);
> +
> +/*
> + * Guest uses MAX_PA for GPAW when set.
> + * 0: GPA.SHARED bit is GPA[47]
> + * 1: GPA.SHARED bit is GPA[51]
> + */
> +#define TDX_EXEC_CONTROL_MAX_GPAW      BIT_ULL(0)
> +
> +/*
> + * TDX requires the frequency to be defined in units of 25MHz, which is the
> + * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX
> + * module can only program frequencies that are multiples of 25MHz.  The
> + * frequency must be between 100mhz and 10ghz (inclusive).
> + */
> +#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz)	((tsc_in_khz) / (25 * 1000))
> +#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz)	((tsc_in_25mhz) * (25 * 1000))
> +#define TDX_MIN_TSC_FREQUENCY_KHZ		(100 * 1000)
> +#define TDX_MAX_TSC_FREQUENCY_KHZ		(10 * 1000 * 1000)
> +
> +#endif /* __KVM_X86_TDX_ARCH_H */


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 014/104] KVM: TDX: Add a function for KVM to invoke SEAMCALL
  2022-03-04 19:48 ` [RFC PATCH v5 014/104] KVM: TDX: Add a function for KVM to invoke SEAMCALL isaku.yamahata
  2022-03-13 14:10   ` Paolo Bonzini
@ 2022-03-13 22:42   ` Kai Huang
  1 sibling, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-03-13 22:42 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson, kirill.shutemov

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add an assembly function for KVM to call the TDX module because __seamcall
> defined in arch/x86/virt/vmx/seamcall.S doesn't fit for the KVM use case.
> 
> TDX module API returns extended error information in registers, rcx, rdx,
> r8, r9, r10, and r11 in addition to success case.  KVM uses those extended
> error information in addition to the status code returned in RAX.  Update
> the assembly code to optionally return those outputs even in the error case
> and define the specific version for KVM to call the TDX module.

+Kirill.

Kirill's patches hasn't been merged yet.  Can Kirill just change his patch so
you don't have to change the assembly code again?  Btw this change is
reasonable for host kernel support series as well.  Sorry that I failed to
notice.

Btw, SEAMCALL C function is already implemented in host kernel support series:

https://lore.kernel.org/kvm/269a053607357eedd9a1e8ddf0e7240ae0c3985c.1647167475.git.kai.huang@intel.com/

I think maybe you can just export __seamcall() and use it?

> 
> SEAMCALL to the SEAM module (P-SEAMLDR or TDX module) can result in the
> error of VmFailInvalid indicated by CF=1 when VMX isn't enabled by VMXON
> instruction.  Because KVM guarantees that VMX is enabled, VmFailInvalid
> error won't happen.  Don't check the error for KVM.

Perhaps you can introduce kvm_seamcall() which calls __seamcall(), but WARN()
if it returns VMfailInvalid?

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/Makefile       |  2 +-
>  arch/x86/kvm/vmx/seamcall.S | 55 +++++++++++++++++++++++++++++++++++++
>  arch/x86/virt/tdxcall.S     |  8 ++++--
>  3 files changed, 62 insertions(+), 3 deletions(-)
>  create mode 100644 arch/x86/kvm/vmx/seamcall.S
> 
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index e2c05195cb95..e8f83a7d0dc3 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -24,7 +24,7 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
>  kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
>  			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
>  kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
> -kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o
> +kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o vmx/seamcall.o
>  
>  kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
>  
> diff --git a/arch/x86/kvm/vmx/seamcall.S b/arch/x86/kvm/vmx/seamcall.S
> new file mode 100644
> index 000000000000..4a15017fc7dd
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/seamcall.S
> @@ -0,0 +1,55 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#include <linux/linkage.h>
> +#include <asm/export.h>
> +#include <asm/frame.h>
> +
> +#include "../../virt/tdxcall.S"
> +
> +/*
> + * kvm_seamcall()  - Host-side interface functions to SEAM software (TDX module)
> + *
> + * Transform function call register arguments into the SEAMCALL register
> + * ABI.  Return the completion status of the SEAMCALL.  Additional output
> + * operands are saved in @out (if it is provided by the user).
> + * It doesn't check TDX_SEAMCALL_VMFAILINVALID unlike __semcall() because KVM
> + * guarantees that VMX is enabled so that TDX_SEAMCALL_VMFAILINVALID doesn't
> + * happen.  In the case of error completion status code, extended error code may
> + * be stored in leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + * SEAMCALL ABI:
> + *-------------------------------------------------------------------------
> + * Input Registers:
> + *
> + * RAX                 - SEAMCALL Leaf number.
> + * RCX,RDX,R8-R9       - SEAMCALL Leaf specific input registers.
> + *
> + * Output Registers:
> + *
> + * RAX                 - SEAMCALL completion status code.
> + * RCX,RDX,R8-R11      - SEAMCALL Leaf specific output registers.
> + *
> + *-------------------------------------------------------------------------
> + *
> + * kvm_seamcall() function ABI:
> + *
> + * @fn  (RDI)          - SEAMCALL Leaf number, moved to RAX
> + * @rcx (RSI)          - Input parameter 1, moved to RCX
> + * @rdx (RDX)          - Input parameter 2, moved to RDX
> + * @r8  (RCX)          - Input parameter 3, moved to R8
> + * @r9  (R8)           - Input parameter 4, moved to R9
> + *
> + * @out (R9)           - struct tdx_module_output pointer
> + *                       stored temporarily in R12 (not
> + *                       shared with the TDX module). It
> + *                       can be NULL.
> + *
> + * Return (via RAX) the completion status of the SEAMCALL
> + */
> +SYM_FUNC_START(kvm_seamcall)
> +        FRAME_BEGIN
> +        TDX_MODULE_CALL host=1 error_check=0
> +        FRAME_END
> +        ret
> +SYM_FUNC_END(kvm_seamcall)
> +EXPORT_SYMBOL_GPL(kvm_seamcall)

> diff --git a/arch/x86/virt/tdxcall.S b/arch/x86/virt/tdxcall.S
> index 90569faedacc..2e614b6b5f1e 100644
> --- a/arch/x86/virt/tdxcall.S
> +++ b/arch/x86/virt/tdxcall.S
> @@ -13,7 +13,7 @@
>  #define tdcall		.byte 0x66,0x0f,0x01,0xcc
>  #define seamcall	.byte 0x66,0x0f,0x01,0xcf
>  
> -.macro TDX_MODULE_CALL host:req
> +.macro TDX_MODULE_CALL host:req error_check=1
>  	/*
>  	 * R12 will be used as temporary storage for struct tdx_module_output
>  	 * pointer. Since R12-R15 registers are not used by TDCALL/SEAMCALL
> @@ -51,9 +51,11 @@
>  	 *
>  	 * Set %rax to TDX_SEAMCALL_VMFAILINVALID for VMfailInvalid.
>  	 * This value will never be used as actual SEAMCALL error code.
> -	 */
> +	*/
> +	.if \error_check
>  	jnc .Lno_vmfailinvalid
>  	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
> +	.endif
>  .Lno_vmfailinvalid:
>  	.else
>  	tdcall
> @@ -66,8 +68,10 @@
>  	pop %r12
>  
>  	/* Check for success: 0 - Successful, otherwise failed */
> +	.if \error_check
>  	test %rax, %rax
>  	jnz .Lno_output_struct
> +	.endif
>  

Checking VMfailInvalid, and checking whether %rax is 0 are totally different
things.  I don't think it's good to use one single 'error_check' to handle.

As I mentioned above, I think checking whether %rax is 0 can be just removed
in Kirill's patch.  But I don't have strong opinion on whether you should add
another parameter to determine whether to check VMfailInvalid, or you should
just call __seamcall() which is implemented in host support series.


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 007/104] x86/virt/tdx: Add a helper function to return system wide info about TDX module
  2022-03-13 13:59   ` Paolo Bonzini
@ 2022-03-13 23:02     ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-03-13 23:02 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On Sun, 2022-03-13 at 14:59 +0100, Paolo Bonzini wrote:
> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > Signed-off-by: Isaku Yamahata<isaku.yamahata@intel.com>
> > ---
> >   arch/x86/include/asm/tdx.h | 55 ++++++++++++++++++++++++++++++++++++++
> >   arch/x86/virt/vmx/tdx.c    | 16 +++++++++--
> >   arch/x86/virt/vmx/tdx.h    | 52 -----------------------------------
> >   3 files changed, 69 insertions(+), 54 deletions(-)
> 
> Patch looks good, but place these definitions in 
> arch/x86/include/asm/tdx.h already in Kai's series if possible.
> 
> Apart from that,
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> Paolo

Does it make more sense for me to just include this patch (and couple of other
patches such as exporting information of TDX KeyIDs) to host kernel support
series?

-- 
Thanks,
-Kai

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 010/104] KVM: TDX: Make TDX VM type supported
  2022-03-04 19:48 ` [RFC PATCH v5 010/104] KVM: TDX: Make TDX VM type supported isaku.yamahata
@ 2022-03-13 23:08   ` Kai Huang
  2022-03-15 21:03     ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-03-13 23:08 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> As first step TDX VM support, return that TDX VM type supported to device
> model, e.g. qemu.  The callback to create guest TD is vm_init callback for
> KVM_CREATE_VM.  Add a place holder function and call a function to
> initialize TDX module on demand because in that callback VMX is enabled by
> hardware_enable callback (vmx_hardware_enable).

Should we put this patch at the end of series until all changes required to run
TD are introduced?  This patch essentially tells userspace KVM is ready to
support a TD but actually it's not ready.  And this might also cause bisect
issue I suppose?

-- 
Thanks,
-Kai

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/main.c    | 24 ++++++++++++++++++++++--
>  arch/x86/kvm/vmx/tdx.c     |  5 +++++
>  arch/x86/kvm/vmx/vmx.c     |  5 -----
>  arch/x86/kvm/vmx/x86_ops.h |  3 ++-
>  4 files changed, 29 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 77da926ee505..8103d1c32cc9 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -5,6 +5,12 @@
>  #include "vmx.h"
>  #include "nested.h"
>  #include "pmu.h"
> +#include "tdx.h"
> +
> +static bool vt_is_vm_type_supported(unsigned long type)
> +{
> +	return type == KVM_X86_DEFAULT_VM || tdx_is_vm_type_supported(type);
> +}
>  
>  static __init int vt_hardware_setup(void)
>  {
> @@ -19,6 +25,20 @@ static __init int vt_hardware_setup(void)
>  	return 0;
>  }
>  
> +static int vt_vm_init(struct kvm *kvm)
> +{
> +	int ret;
> +
> +	if (is_td(kvm)) {
> +		ret = tdx_module_setup();
> +		if (ret)
> +			return ret;
> +		return -EOPNOTSUPP;	/* Not ready to create guest TD yet. */
> +	}
> +
> +	return vmx_vm_init(kvm);
> +}
> +
>  struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.name = "kvm_intel",
>  
> @@ -29,9 +49,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.cpu_has_accelerated_tpr = report_flexpriority,
>  	.has_emulated_msr = vmx_has_emulated_msr,
>  
> -	.is_vm_type_supported = vmx_is_vm_type_supported,
> +	.is_vm_type_supported = vt_is_vm_type_supported,
>  	.vm_size = sizeof(struct kvm_vmx),
> -	.vm_init = vmx_vm_init,
> +	.vm_init = vt_vm_init,
>  
>  	.vcpu_create = vmx_vcpu_create,
>  	.vcpu_free = vmx_vcpu_free,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 8adc87ad1807..e8d293a3c11c 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -105,6 +105,11 @@ int tdx_module_setup(void)
>  	return ret;
>  }
>  
> +bool tdx_is_vm_type_supported(unsigned long type)
> +{
> +	return type == KVM_X86_TDX_VM && READ_ONCE(enable_tdx);
> +}
> +
>  static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
>  {
>  	u32 max_pa;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 3c7b3f245fee..7838cd177f0e 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7079,11 +7079,6 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
>  	return err;
>  }
>  
> -bool vmx_is_vm_type_supported(unsigned long type)
> -{
> -	return type == KVM_X86_DEFAULT_VM;
> -}
> -
>  #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>  #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>  
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index f7327bc73be0..78331dbc29f7 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -25,7 +25,6 @@ void vmx_hardware_unsetup(void);
>  int vmx_hardware_enable(void);
>  void vmx_hardware_disable(void);
>  bool report_flexpriority(void);
> -bool vmx_is_vm_type_supported(unsigned long type);
>  int vmx_vm_init(struct kvm *kvm);
>  int vmx_vcpu_create(struct kvm_vcpu *vcpu);
>  int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
> @@ -130,10 +129,12 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
>  #ifdef CONFIG_INTEL_TDX_HOST
>  void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
>  			unsigned int *vcpu_align, unsigned int *vm_size);
> +bool tdx_is_vm_type_supported(unsigned long type);
>  void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
>  #else
>  static inline void tdx_pre_kvm_init(
>  	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
> +static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
>  static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
>  #endif
>  



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 003/104] KVM: TDX: Detect CPU feature on kernel module initialization
  2022-03-13 13:49   ` Paolo Bonzini
@ 2022-03-14 18:34     ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-14 18:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

Thanks for comment.

On Sun, Mar 13, 2022 at 02:49:51PM +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > +static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> > +{
> > +	u32 max_pa;
> > +
> > +	if (!enable_ept) {
> > +		pr_warn("Cannot enable TDX with EPT disabled\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	if (!platform_has_tdx()) {
> > +		pr_warn("Cannot enable TDX with SEAMRR disabled\n");
> > +		return -ENODEV;
> > +	}
> 
> This will cause a pr_warn in the logs on all machines that don't have TDX.
> Perhaps you can restrict the pr_warn() to machines that have
> __seamrr_enabled() == true?

Makes sense. I'll include the following change.

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 73bb472bd515..aa02c98afd11 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -134,6 +134,7 @@ struct tdsysinfo_struct {
        };
 } __packed __aligned(TDSYSINFO_STRUCT_ALIGNMENT);
 
+bool __seamrr_enabled(void);
 void tdx_detect_cpu(struct cpuinfo_x86 *c);
 int tdx_detect(void);
 int tdx_init(void);
@@ -143,6 +144,7 @@ u32 tdx_get_global_keyid(void);
 int tdx_keyid_alloc(void);
 void tdx_keyid_free(int keyid);
 #else
+static inline bool __seamrr_enabled(void) { return false; }
 static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
 static inline int tdx_detect(void) { return -ENODEV; }
 static inline int tdx_init(void) { return -ENODEV; }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 66dffe815e63..880d8291b380 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2625,7 +2625,8 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
        }
 
        if (!platform_has_tdx()) {
-               pr_warn("Cannot enable TDX with SEAMRR disabled\n");
+               if (__seamrr_enabled())
+                       pr_warn("Cannot enable TDX with SEAMRR disabled\n");
                return -ENODEV;
        }
 
diff --git a/arch/x86/virt/vmx/tdx.c b/arch/x86/virt/vmx/tdx.c
index d99961b7cb02..bb578a72b2da 100644
--- a/arch/x86/virt/vmx/tdx.c
+++ b/arch/x86/virt/vmx/tdx.c
@@ -186,10 +186,11 @@ static const struct kernel_param_ops tdx_trace_ops = {
 module_param_cb(tdx_trace_level, &tdx_trace_ops, &tdx_trace_level, 0644);
 MODULE_PARM_DESC(tdx_trace_level, "TDX module trace level");
 
-static bool __seamrr_enabled(void)
+bool __seamrr_enabled(void)
 {
        return (seamrr_mask & SEAMRR_ENABLED_BITS) == SEAMRR_ENABLED_BITS;
 }
+EXPORT_SYMBOL_GPL(__seamrr_enabled);
 
 static void detect_seam_bsp(struct cpuinfo_x86 *c)
 {


-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply related	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 005/104] KVM: x86: Refactor KVM VMX module init/exit functions
  2022-03-13 13:54   ` Paolo Bonzini
@ 2022-03-14 19:22     ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-14 19:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

Thanks for review.

On Sun, Mar 13, 2022 at 02:54:15PM +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > -static int __init vmx_init(void)
> > -{
> > -	int r, cpu;
> > +	if (sizeof(struct vcpu_vmx) > *vcpu_size)
> > +		*vcpu_size = sizeof(struct vcpu_vmx);
> > +	if (__alignof__(struct vcpu_vmx) > *vcpu_align)
> > +		*vcpu_align = __alignof__(struct vcpu_vmx);
> 
> Please keep these four lines in vt_init, and rename the rest of
> vmx_pre_kvm_init to hv_vp_assist_page_init.  Likewise, rename
> vmx_post_kvm_exit to hv_vp_assist_page_exit.
> 
> Adjusting the vcpu_size and vcpu_align for TDX (I guess) can be added later
> when TDX ops are introduced.

sure. I'll make it like
  vcpu_size = max(sizeof(struct vcpu_vmx), sizeof(vcpu_tdx));
  ...


Thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-03-13 14:03   ` Paolo Bonzini
@ 2022-03-14 19:45     ` Isaku Yamahata
  2022-03-31  0:03       ` Sean Christopherson
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-14 19:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Sun, Mar 13, 2022 at 03:03:40PM +0100,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > +
> > +	if (!tdx_module_initialized) {
> > +		if (enable_tdx) {
> > +			ret = __tdx_module_setup();
> > +			if (ret)
> > +				enable_tdx = false;
> 
> "enable_tdx = false" isn't great to do only when a VM is created.  Does it
> make sense to anticipate this to the point when the kvm_intel.ko module is
> loaded?

It's possible.  I have the following two reasons to chose to defer TDX module
initialization until creating first TD.  Given those reasons, do you still want
the initialization at loading kvm_intel.ko module?  If yes, I'll change it.

- memory over head: The initialization of TDX module requires to allocate
physically contiguous memory whose size is about 0.43% of the system memory.
If user don't use TD, it will be wasted.

- VMXON on all pCPUs: The TDX module initialization requires to enable VMX
(VMXON) on all present pCPUs.  vmx_hardware_enable() which is called on creating
guest does it.  It naturally fits with the TDX module initialization at creating
first TD.  I wanted to avoid code to enable VMXON on loading the kvm_intel.ko.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 010/104] KVM: TDX: Make TDX VM type supported
  2022-03-13 23:08   ` Kai Huang
@ 2022-03-15 21:03     ` Isaku Yamahata
  2022-03-15 21:47       ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-15 21:03 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson

On Mon, Mar 14, 2022 at 12:08:59PM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > As first step TDX VM support, return that TDX VM type supported to device
> > model, e.g. qemu.  The callback to create guest TD is vm_init callback for
> > KVM_CREATE_VM.  Add a place holder function and call a function to
> > initialize TDX module on demand because in that callback VMX is enabled by
> > hardware_enable callback (vmx_hardware_enable).
> 
> Should we put this patch at the end of series until all changes required to run
> TD are introduced?  This patch essentially tells userspace KVM is ready to
> support a TD but actually it's not ready.  And this might also cause bisect
> issue I suppose?

The intention is that developers can exercise the new code step-by-step even if
the TDX KVM isn't complete.
How about introducing new config and remove it at the last of the patch series?

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 2b1548da00eb..a3287440aa9e 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -98,6 +98,20 @@ config X86_SGX_KVM
 
          If unsure, say N.
 
+config X86_TDX_KVM_EXPERIMENTAL
+       bool "EXPERIMENTAL Trust Domian Extensions (TDX) KVM support"
+       default n
+       depends on INTEL_TDX_HOST
+       depends on KVM_INTEL
+       help
+         Enable experimental TDX KVM support.  TDX KVM needs many patches and
+         the patches will be merged step by step, not at once. Even if TDX KVM
+         support is incomplete, enable TDX KVM support so that developper can
+         exercise TDX KVM code.  TODO: Remove this configuration once the
+         (first step of) TDX KVM support is complete.
+
+         If unsure, say N.
+
 config KVM_AMD
        tristate "KVM for AMD processors support"
        depends on KVM
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b16e2ed3b204..e31d6902e49c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -170,7 +170,11 @@ int tdx_module_setup(void)
 
 bool tdx_is_vm_type_supported(unsigned long type)
 {
+#ifdef CONFIG_X86_TDX_KVM_EXPERIMENTAL
        return type == KVM_X86_TDX_VM && READ_ONCE(enable_tdx);
+#else
+       return false;
+#endif
 }
 
 static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)


-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply related	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 010/104] KVM: TDX: Make TDX VM type supported
  2022-03-15 21:03     ` Isaku Yamahata
@ 2022-03-15 21:47       ` Kai Huang
  2022-03-15 21:49         ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-03-15 21:47 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: isaku.yamahata, kvm, linux-kernel, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Tue, 2022-03-15 at 14:03 -0700, Isaku Yamahata wrote:
> On Mon, Mar 14, 2022 at 12:08:59PM +1300,
> Kai Huang <kai.huang@intel.com> wrote:
> 
> > On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > As first step TDX VM support, return that TDX VM type supported to device
> > > model, e.g. qemu.  The callback to create guest TD is vm_init callback for
> > > KVM_CREATE_VM.  Add a place holder function and call a function to
> > > initialize TDX module on demand because in that callback VMX is enabled by
> > > hardware_enable callback (vmx_hardware_enable).
> > 
> > Should we put this patch at the end of series until all changes required to run
> > TD are introduced?  This patch essentially tells userspace KVM is ready to
> > support a TD but actually it's not ready.  And this might also cause bisect
> > issue I suppose?
> 
> The intention is that developers can exercise the new code step-by-step even if
> the TDX KVM isn't complete.

What is the purpose/value to allow developers to exercise the new code step-by-
step?  Userspace cannot create TD successfully anyway until all patches are
ready.

-- 
Thanks,
-Kai

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 010/104] KVM: TDX: Make TDX VM type supported
  2022-03-15 21:47       ` Kai Huang
@ 2022-03-15 21:49         ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-03-15 21:49 UTC (permalink / raw)
  To: Kai Huang, Isaku Yamahata
  Cc: isaku.yamahata, kvm, linux-kernel, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On 3/15/22 22:47, Kai Huang wrote:
>> The intention is that developers can exercise the new code step-by-step even if
>> the TDX KVM isn't complete.
> What is the purpose/value to allow developers to exercise the new code step-by-
> step?  Userspace cannot create TD successfully anyway until all patches are
> ready.

We can move this to the end when the patch is committed, but I think 
there is value in showing that the series works (for partial definitions 
of "work") at every step of the enablement process.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 083/104] KVM: x86: Split core of hypercall emulation to helper function
  2022-03-04 19:49 ` [RFC PATCH v5 083/104] KVM: x86: Split core of hypercall emulation to helper function isaku.yamahata
@ 2022-03-21 18:32   ` Sagi Shahar
  2022-03-23 17:53     ` Isaku Yamahata
  2022-04-07 13:12     ` Paolo Bonzini
  0 siblings, 2 replies; 310+ messages in thread
From: Sagi Shahar @ 2022-03-21 18:32 UTC (permalink / raw)
  To: Yamahata, Isaku
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	Erdem Aktas, Connor Kuehl, Sean Christopherson

On Fri, Mar 4, 2022 at 12:00 PM <isaku.yamahata@intel.com> wrote:
>
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> By necessity, TDX will use a different register ABI for hypercalls.
> Break out the core functionality so that it may be reused for TDX.
>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  4 +++
>  arch/x86/kvm/x86.c              | 54 ++++++++++++++++++++-------------
>  2 files changed, 37 insertions(+), 21 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8dab9f16f559..33b75b0e3de1 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1818,6 +1818,10 @@ void kvm_request_apicv_update(struct kvm *kvm, bool activate,
>  void __kvm_request_apicv_update(struct kvm *kvm, bool activate,
>                                 unsigned long bit);
>
> +unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> +                                     unsigned long a0, unsigned long a1,
> +                                     unsigned long a2, unsigned long a3,
> +                                     int op_64_bit);
>  int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>
>  int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 314ae43e07bf..9acb33a17445 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -9090,26 +9090,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
>         return kvm_skip_emulated_instruction(vcpu);
>  }
>
> -int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> +unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> +                                     unsigned long a0, unsigned long a1,
> +                                     unsigned long a2, unsigned long a3,
> +                                     int op_64_bit)
>  {
> -       unsigned long nr, a0, a1, a2, a3, ret;
> -       int op_64_bit;
> -
> -       if (kvm_xen_hypercall_enabled(vcpu->kvm))
> -               return kvm_xen_hypercall(vcpu);
> -
> -       if (kvm_hv_hypercall_enabled(vcpu))
> -               return kvm_hv_hypercall(vcpu);
> -
> -       nr = kvm_rax_read(vcpu);
> -       a0 = kvm_rbx_read(vcpu);
> -       a1 = kvm_rcx_read(vcpu);
> -       a2 = kvm_rdx_read(vcpu);
> -       a3 = kvm_rsi_read(vcpu);
> +       unsigned long ret;
>
>         trace_kvm_hypercall(nr, a0, a1, a2, a3);
>
> -       op_64_bit = is_64_bit_hypercall(vcpu);
>         if (!op_64_bit) {
>                 nr &= 0xFFFFFFFF;
>                 a0 &= 0xFFFFFFFF;
> @@ -9118,11 +9107,6 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>                 a3 &= 0xFFFFFFFF;
>         }
>
> -       if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
> -               ret = -KVM_EPERM;
> -               goto out;
> -       }
> -
>         ret = -KVM_ENOSYS;
>
>         switch (nr) {
> @@ -9181,6 +9165,34 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>                 ret = -KVM_ENOSYS;
>                 break;
>         }
> +       return ret;
> +}
> +EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
> +
> +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> +{
> +       unsigned long nr, a0, a1, a2, a3, ret;
> +       int op_64_bit;
> +
> +       if (kvm_xen_hypercall_enabled(vcpu->kvm))
> +               return kvm_xen_hypercall(vcpu);
> +
> +       if (kvm_hv_hypercall_enabled(vcpu))
> +               return kvm_hv_hypercall(vcpu);
> +
> +       nr = kvm_rax_read(vcpu);
> +       a0 = kvm_rbx_read(vcpu);
> +       a1 = kvm_rcx_read(vcpu);
> +       a2 = kvm_rdx_read(vcpu);
> +       a3 = kvm_rsi_read(vcpu);
> +       op_64_bit = is_64_bit_mode(vcpu);

I think this should be "op_64_bit = is_64_bit_hypercall(vcpu);"
is_64_bit_mode was replaced with is_64_bit_hypercall to support
protected guests here:
https://lore.kernel.org/all/87cztf8h43.fsf@vitty.brq.redhat.com/T/

Without it, op_64_bit will be set to 0 for TD VMs which will cause the
upper 32 bit of the registers to be cleared in __kvm_emulate_hypercall

> +
> +       if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
> +               ret = -KVM_EPERM;
> +               goto out;
> +       }
> +
> +       ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit);
>  out:
>         if (!op_64_bit)
>                 ret = (u32)ret;
> --
> 2.25.1
>

Sagi

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 064/104] KVM: TDX: Implement TDX vcpu enter/exit path
  2022-03-04 19:49 ` [RFC PATCH v5 064/104] KVM: TDX: Implement TDX vcpu enter/exit path isaku.yamahata
@ 2022-03-22 17:28   ` Erdem Aktas
  2022-03-23 17:55     ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Erdem Aktas @ 2022-03-22 17:28 UTC (permalink / raw)
  To: Yamahata, Isaku
  Cc: open list:KERNEL VIRTUAL MACHINE (KVM),
	open list, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	Connor Kuehl, Sean Christopherson

On Fri, Mar 4, 2022 at 11:50 AM <isaku.yamahata@intel.com> wrote:
>
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> This patch implements running TDX vcpu.  Once vcpu runs on the logical
> processor (LP), the TDX vcpu is associated with it.  When the TDX vcpu
> moves to another LP, the TDX vcpu needs to flush its status on the LP.
> When destroying TDX vcpu, it needs to complete flush and flush cpu memory
> cache.  Track which LP the TDX vcpu run and flush it as necessary.
>
> Do nothing on sched_in event as TDX doesn't support pause loop.
>
> TDX vcpu execution requires restoring PMU debug store after returning back
> to KVM because the TDX module unconditionally resets the value.  To reuse
> the existing code, export perf_restore_debug_store.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/main.c    | 10 +++++++++-
>  arch/x86/kvm/vmx/tdx.c     | 34 ++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/tdx.h     | 33 +++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/x86_ops.h |  2 ++
>  arch/x86/kvm/x86.c         |  1 +
>  5 files changed, 79 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index f571b07c2aae..2e5a7a72d560 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -89,6 +89,14 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         return vmx_vcpu_reset(vcpu, init_event);
>  }
>
> +static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
> +{
> +       if (is_td_vcpu(vcpu))
> +               return tdx_vcpu_run(vcpu);
> +
> +       return vmx_vcpu_run(vcpu);
> +}
> +
>  static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
>  {
>         if (is_td_vcpu(vcpu))
> @@ -200,7 +208,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>         .tlb_flush_guest = vt_flush_tlb_guest,
>
>         .vcpu_pre_run = vmx_vcpu_pre_run,
> -       .run = vmx_vcpu_run,
> +       .run = vt_vcpu_run,
>         .handle_exit = vmx_handle_exit,
>         .skip_emulated_instruction = vmx_skip_emulated_instruction,
>         .update_emulated_instruction = vmx_update_emulated_instruction,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 85d5f961d97e..ebe4f9bf19e7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -10,6 +10,9 @@
>  #include "vmx.h"
>  #include "x86.h"
>
> +#include <trace/events/kvm.h>
> +#include "trace.h"
> +
>  #undef pr_fmt
>  #define pr_fmt(fmt) "tdx: " fmt
>
> @@ -509,6 +512,37 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>         vcpu->kvm->vm_bugged = true;
>  }
>
> +u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
> +
> +static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> +                                       struct vcpu_tdx *tdx)
> +{
> +       guest_enter_irqoff();
> +       tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
> +       guest_exit_irqoff();
> +}
> +
> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> +{
> +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +       if (unlikely(vcpu->kvm->vm_bugged)) {
> +               tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> +               return EXIT_FASTPATH_NONE;
> +       }
> +
> +       trace_kvm_entry(vcpu);
> +
> +       tdx_vcpu_enter_exit(vcpu, tdx);
> +
> +       vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> +       trace_kvm_exit(vcpu, KVM_ISA_VMX);
> +
> +       if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
> +               return EXIT_FASTPATH_NONE;

Looks like the above if statement has no effect. Just checking if this
is intentional.

> +       return EXIT_FASTPATH_NONE;
> +}
> +
>  void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  {
>         td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index bf9865a88991..e950404ce5de 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -44,12 +44,45 @@ struct kvm_tdx {
>         spinlock_t seamcall_lock;
>  };
>
> +union tdx_exit_reason {
> +       struct {
> +               /* 31:0 mirror the VMX Exit Reason format */
> +               u64 basic               : 16;
> +               u64 reserved16          : 1;
> +               u64 reserved17          : 1;
> +               u64 reserved18          : 1;
> +               u64 reserved19          : 1;
> +               u64 reserved20          : 1;
> +               u64 reserved21          : 1;
> +               u64 reserved22          : 1;
> +               u64 reserved23          : 1;
> +               u64 reserved24          : 1;
> +               u64 reserved25          : 1;
> +               u64 bus_lock_detected   : 1;
> +               u64 enclave_mode        : 1;
> +               u64 smi_pending_mtf     : 1;
> +               u64 smi_from_vmx_root   : 1;
> +               u64 reserved30          : 1;
> +               u64 failed_vmentry      : 1;
> +
> +               /* 63:32 are TDX specific */
> +               u64 details_l1          : 8;
> +               u64 class               : 8;
> +               u64 reserved61_48       : 14;
> +               u64 non_recoverable     : 1;
> +               u64 error               : 1;
> +       };
> +       u64 full;
> +};
> +
>  struct vcpu_tdx {
>         struct kvm_vcpu vcpu;
>
>         struct tdx_td_page tdvpr;
>         struct tdx_td_page *tdvpx;
>
> +       union tdx_exit_reason exit_reason;
> +
>         bool initialized;
>  };
>
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 922a3799336e..44404dd25737 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -140,6 +140,7 @@ void tdx_vm_free(struct kvm *kvm);
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
>
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -160,6 +161,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
>  static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>  static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> +static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
>
>  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>  static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index da411bcd8cbc..66400810d54f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -300,6 +300,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
>  };
>
>  u64 __read_mostly host_xcr0;
> +EXPORT_SYMBOL_GPL(host_xcr0);
>  u64 __read_mostly supported_xcr0;
>  EXPORT_SYMBOL_GPL(supported_xcr0);
>
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 073/104] KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the guest TD
  2022-03-04 19:49 ` [RFC PATCH v5 073/104] KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the guest TD isaku.yamahata
@ 2022-03-23  0:54   ` Erdem Aktas
  2022-03-23 19:08     ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Erdem Aktas @ 2022-03-23  0:54 UTC (permalink / raw)
  To: Yamahata, Isaku
  Cc: open list:KERNEL VIRTUAL MACHINE (KVM),
	open list, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	Connor Kuehl, Sean Christopherson

On Fri, Mar 4, 2022 at 11:50 AM <isaku.yamahata@intel.com> wrote:
>
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> When shutting down the machine, (VMX or TDX) vcpus needs to be shutdown on
> each pcpu.  Do the similar for TDX with TDX SEAMCALL APIs.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/main.c    | 23 +++++++++++--
>  arch/x86/kvm/vmx/tdx.c     | 70 ++++++++++++++++++++++++++++++++++++--
>  arch/x86/kvm/vmx/tdx.h     |  2 ++
>  arch/x86/kvm/vmx/x86_ops.h |  4 +++
>  4 files changed, 95 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 2cd5ba0e8788..882358ac270b 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -13,6 +13,25 @@ static bool vt_is_vm_type_supported(unsigned long type)
>         return type == KVM_X86_DEFAULT_VM || tdx_is_vm_type_supported(type);
>  }
>
> +static int vt_hardware_enable(void)
> +{
> +       int ret;
> +
> +       ret = vmx_hardware_enable();
> +       if (ret)
> +               return ret;
> +
> +       tdx_hardware_enable();
> +       return 0;
> +}
> +
> +static void vt_hardware_disable(void)
> +{
> +       /* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
> +       tdx_hardware_disable();
> +       vmx_hardware_disable();
> +}
> +
>  static __init int vt_hardware_setup(void)
>  {
>         int ret;
> @@ -199,8 +218,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>
>         .hardware_unsetup = vt_hardware_unsetup,
>
> -       .hardware_enable = vmx_hardware_enable,
> -       .hardware_disable = vmx_hardware_disable,
> +       .hardware_enable = vt_hardware_enable,
> +       .hardware_disable = vt_hardware_disable,
>         .cpu_has_accelerated_tpr = report_flexpriority,
>         .has_emulated_msr = vmx_has_emulated_msr,
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index a6b1a8ce888d..690298fb99c7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -48,6 +48,14 @@ struct tdx_capabilities tdx_caps;
>  static DEFINE_MUTEX(tdx_lock);
>  static struct mutex *tdx_mng_key_config_lock;
>
> +/*
> + * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
> + * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
> + * Protected by interrupt mask.  This list is manipulated in process context
> + * of vcpu and IPI callback.  See tdx_flush_vp_on_cpu().
> + */
> +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
> +
>  static u64 hkid_mask __ro_after_init;
>  static u8 hkid_start_pos __ro_after_init;
>
> @@ -87,6 +95,8 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
>
>  static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
>  {
> +       list_del(&to_tdx(vcpu)->cpu_list);
> +
>         /*
>          * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
>          * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
> @@ -97,6 +107,22 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
>         vcpu->cpu = -1;
>  }
>
> +void tdx_hardware_enable(void)
> +{
> +       INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
> +}
> +
> +void tdx_hardware_disable(void)
> +{
> +       int cpu = raw_smp_processor_id();
> +       struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
> +       struct vcpu_tdx *tdx, *tmp;
> +
> +       /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
> +       list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
> +               tdx_disassociate_vp(&tdx->vcpu);
> +}
> +
>  static void tdx_clear_page(unsigned long page)
>  {
>         const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> @@ -230,9 +256,11 @@ void tdx_mmu_prezap(struct kvm *kvm)
>         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
>         cpumask_var_t packages;
>         bool cpumask_allocated;
> +       struct kvm_vcpu *vcpu;
>         u64 err;
>         int ret;
>         int i;
> +       unsigned long j;
>
>         if (!is_hkid_assigned(kvm_tdx))
>                 return;
> @@ -248,6 +276,17 @@ void tdx_mmu_prezap(struct kvm *kvm)
>                 return;
>         }
>
> +       kvm_for_each_vcpu(j, vcpu, kvm)
> +               tdx_flush_vp_on_cpu(vcpu);
> +
> +       mutex_lock(&tdx_lock);
> +       err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);

Hi Isaku,

I am wondering about the impact of the failures on these functions. Is
there any other function which recovers any failures here?
When I look at the tdx_flush_vp function, it seems like it can fail
due to task migration so tdx_flush_vp_on_cpu might also fail and if it
fails, tdh_mng_vpflushdone returns err. Since tdx_vm_teardown does not
return any error , how the VMM can free the keyid used in this TD.
Will they be forever in "used state"?
Also if tdx_vm_teardown fails, the kvm_tdx->hkid is never set to -1
which will prevent tdx_vcpu_free to free and reclaim the resources
allocated for the vcpu.

-Erdem
> +       mutex_unlock(&tdx_lock);
> +       if (WARN_ON_ONCE(err)) {
> +               pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
> +               return;
> +       }
> +
>         cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
>         for_each_online_cpu(i) {
>                 if (cpumask_allocated &&
> @@ -472,8 +511,22 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>
>  void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>  {
> -       if (vcpu->cpu != cpu)
> -               tdx_flush_vp_on_cpu(vcpu);
> +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +       if (vcpu->cpu == cpu)
> +               return;
> +
> +       tdx_flush_vp_on_cpu(vcpu);
> +
> +       local_irq_disable();
> +       /*
> +        * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
> +        * vcpu->cpu is read before tdx->cpu_list.
> +        */
> +       smp_rmb();
> +
> +       list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
> +       local_irq_enable();
>  }
>
>  void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> @@ -522,6 +575,19 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
>                 tdx_reclaim_td_page(&tdx->tdvpx[i]);
>         kfree(tdx->tdvpx);
>         tdx_reclaim_td_page(&tdx->tdvpr);
> +
> +       /*
> +        * kvm_free_vcpus()
> +        *   -> kvm_unload_vcpu_mmu()
> +        *
> +        * does vcpu_load() for every vcpu after they already disassociated
> +        * from the per cpu list when tdx_vm_teardown(). So we need to
> +        * disassociate them again, otherwise the freed vcpu data will be
> +        * accessed when do list_{del,add}() on associated_tdvcpus list
> +        * later.
> +        */
> +       tdx_flush_vp_on_cpu(vcpu);
> +       WARN_ON(vcpu->cpu != -1);
>  }
>
>  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 8b1cf9c158e3..180360a65545 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -81,6 +81,8 @@ struct vcpu_tdx {
>         struct tdx_td_page tdvpr;
>         struct tdx_td_page *tdvpx;
>
> +       struct list_head cpu_list;
> +
>         union tdx_exit_reason exit_reason;
>
>         bool initialized;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index ceafd6e18f4e..aae0f4449ec5 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -132,6 +132,8 @@ void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
>  bool tdx_is_vm_type_supported(unsigned long type);
>  void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
>  void tdx_hardware_unsetup(void);
> +void tdx_hardware_enable(void);
> +void tdx_hardware_disable(void);
>
>  int tdx_vm_init(struct kvm *kvm);
>  void tdx_mmu_prezap(struct kvm *kvm);
> @@ -156,6 +158,8 @@ static inline void tdx_pre_kvm_init(
>  static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
>  static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
>  static inline void tdx_hardware_unsetup(void) {}
> +static inline void tdx_hardware_enable(void) {}
> +static inline void tdx_hardware_disable(void) {}
>
>  static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
>  static inline void tdx_mmu_prezap(struct kvm *kvm) {}
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 083/104] KVM: x86: Split core of hypercall emulation to helper function
  2022-03-21 18:32   ` Sagi Shahar
@ 2022-03-23 17:53     ` Isaku Yamahata
  2022-04-07 13:12     ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-23 17:53 UTC (permalink / raw)
  To: Sagi Shahar
  Cc: Yamahata, Isaku, kvm, linux-kernel, isaku.yamahata,
	Paolo Bonzini, Jim Mattson, Erdem Aktas, Connor Kuehl,
	Sean Christopherson

On Mon, Mar 21, 2022 at 11:32:21AM -0700,
Sagi Shahar <sagis@google.com> wrote:

> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 314ae43e07bf..9acb33a17445 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -9090,26 +9090,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
> >         return kvm_skip_emulated_instruction(vcpu);
> >  }
> >
> > -int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> > +unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
> > +                                     unsigned long a0, unsigned long a1,
> > +                                     unsigned long a2, unsigned long a3,
> > +                                     int op_64_bit)
> >  {
> > -       unsigned long nr, a0, a1, a2, a3, ret;
> > -       int op_64_bit;
> > -
> > -       if (kvm_xen_hypercall_enabled(vcpu->kvm))
> > -               return kvm_xen_hypercall(vcpu);
> > -
> > -       if (kvm_hv_hypercall_enabled(vcpu))
> > -               return kvm_hv_hypercall(vcpu);
> > -
> > -       nr = kvm_rax_read(vcpu);
> > -       a0 = kvm_rbx_read(vcpu);
> > -       a1 = kvm_rcx_read(vcpu);
> > -       a2 = kvm_rdx_read(vcpu);
> > -       a3 = kvm_rsi_read(vcpu);
> > +       unsigned long ret;
> >
> >         trace_kvm_hypercall(nr, a0, a1, a2, a3);
> >
> > -       op_64_bit = is_64_bit_hypercall(vcpu);
> >         if (!op_64_bit) {
> >                 nr &= 0xFFFFFFFF;
> >                 a0 &= 0xFFFFFFFF;
> > @@ -9118,11 +9107,6 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >                 a3 &= 0xFFFFFFFF;
> >         }
> >
> > -       if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
> > -               ret = -KVM_EPERM;
> > -               goto out;
> > -       }
> > -
> >         ret = -KVM_ENOSYS;
> >
> >         switch (nr) {
> > @@ -9181,6 +9165,34 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> >                 ret = -KVM_ENOSYS;
> >                 break;
> >         }
> > +       return ret;
> > +}
> > +EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
> > +
> > +int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
> > +{
> > +       unsigned long nr, a0, a1, a2, a3, ret;
> > +       int op_64_bit;
> > +
> > +       if (kvm_xen_hypercall_enabled(vcpu->kvm))
> > +               return kvm_xen_hypercall(vcpu);
> > +
> > +       if (kvm_hv_hypercall_enabled(vcpu))
> > +               return kvm_hv_hypercall(vcpu);
> > +
> > +       nr = kvm_rax_read(vcpu);
> > +       a0 = kvm_rbx_read(vcpu);
> > +       a1 = kvm_rcx_read(vcpu);
> > +       a2 = kvm_rdx_read(vcpu);
> > +       a3 = kvm_rsi_read(vcpu);
> > +       op_64_bit = is_64_bit_mode(vcpu);
> 
> I think this should be "op_64_bit = is_64_bit_hypercall(vcpu);"
> is_64_bit_mode was replaced with is_64_bit_hypercall to support
> protected guests here:
> https://lore.kernel.org/all/87cztf8h43.fsf@vitty.brq.redhat.com/T/
> 
> Without it, op_64_bit will be set to 0 for TD VMs which will cause the
> upper 32 bit of the registers to be cleared in __kvm_emulate_hypercall

Oops, thanks for pointing it out.  I'll fix it up with next respin.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 064/104] KVM: TDX: Implement TDX vcpu enter/exit path
  2022-03-22 17:28   ` Erdem Aktas
@ 2022-03-23 17:55     ` Isaku Yamahata
  2022-03-23 20:05       ` Erdem Aktas
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-23 17:55 UTC (permalink / raw)
  To: Erdem Aktas
  Cc: Yamahata, Isaku, open list:KERNEL VIRTUAL MACHINE (KVM),
	open list, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	Connor Kuehl, Sean Christopherson

On Tue, Mar 22, 2022 at 10:28:42AM -0700,
Erdem Aktas <erdemaktas@google.com> wrote:

> On Fri, Mar 4, 2022 at 11:50 AM <isaku.yamahata@intel.com> wrote:
> > @@ -509,6 +512,37 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >         vcpu->kvm->vm_bugged = true;
> >  }
> >
> > +u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
> > +
> > +static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> > +                                       struct vcpu_tdx *tdx)
> > +{
> > +       guest_enter_irqoff();
> > +       tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
> > +       guest_exit_irqoff();
> > +}
> > +
> > +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > +{
> > +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +
> > +       if (unlikely(vcpu->kvm->vm_bugged)) {
> > +               tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> > +               return EXIT_FASTPATH_NONE;
> > +       }
> > +
> > +       trace_kvm_entry(vcpu);
> > +
> > +       tdx_vcpu_enter_exit(vcpu, tdx);
> > +
> > +       vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> > +       trace_kvm_exit(vcpu, KVM_ISA_VMX);
> > +
> > +       if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
> > +               return EXIT_FASTPATH_NONE;
> 
> Looks like the above if statement has no effect. Just checking if this
> is intentional.

I'm not sure if I get your point.  tdx->exit_reason is updated by the above
tdx_cpu_enter_exit().  So it makes sense to check .error or .non_recoverable.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 073/104] KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the guest TD
  2022-03-23  0:54   ` Erdem Aktas
@ 2022-03-23 19:08     ` Isaku Yamahata
  2022-03-23 20:17       ` Erdem Aktas
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-23 19:08 UTC (permalink / raw)
  To: Erdem Aktas
  Cc: Yamahata, Isaku, open list:KERNEL VIRTUAL MACHINE (KVM),
	open list, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	Connor Kuehl, Sean Christopherson

On Tue, Mar 22, 2022 at 05:54:45PM -0700,
Erdem Aktas <erdemaktas@google.com> wrote:

> On Fri, Mar 4, 2022 at 11:50 AM <isaku.yamahata@intel.com> wrote:
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index a6b1a8ce888d..690298fb99c7 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -48,6 +48,14 @@ struct tdx_capabilities tdx_caps;
> >  static DEFINE_MUTEX(tdx_lock);
> >  static struct mutex *tdx_mng_key_config_lock;
> >
> > +/*
> > + * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
> > + * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
> > + * Protected by interrupt mask.  This list is manipulated in process context
> > + * of vcpu and IPI callback.  See tdx_flush_vp_on_cpu().
> > + */
> > +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
> > +
> >  static u64 hkid_mask __ro_after_init;
> >  static u8 hkid_start_pos __ro_after_init;
> >
> > @@ -87,6 +95,8 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> >
> >  static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> >  {
> > +       list_del(&to_tdx(vcpu)->cpu_list);
> > +
> >         /*
> >          * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
> >          * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
> > @@ -97,6 +107,22 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> >         vcpu->cpu = -1;
> >  }
> >
> > +void tdx_hardware_enable(void)
> > +{
> > +       INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
> > +}
> > +
> > +void tdx_hardware_disable(void)
> > +{
> > +       int cpu = raw_smp_processor_id();
> > +       struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
> > +       struct vcpu_tdx *tdx, *tmp;
> > +
> > +       /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
> > +       list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
> > +               tdx_disassociate_vp(&tdx->vcpu);
> > +}
> > +
> >  static void tdx_clear_page(unsigned long page)
> >  {
> >         const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > @@ -230,9 +256,11 @@ void tdx_mmu_prezap(struct kvm *kvm)
> >         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> >         cpumask_var_t packages;
> >         bool cpumask_allocated;
> > +       struct kvm_vcpu *vcpu;
> >         u64 err;
> >         int ret;
> >         int i;
> > +       unsigned long j;
> >
> >         if (!is_hkid_assigned(kvm_tdx))
> >                 return;
> > @@ -248,6 +276,17 @@ void tdx_mmu_prezap(struct kvm *kvm)
> >                 return;
> >         }
> >
> > +       kvm_for_each_vcpu(j, vcpu, kvm)
> > +               tdx_flush_vp_on_cpu(vcpu);
> > +
> > +       mutex_lock(&tdx_lock);
> > +       err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);
> 
> Hi Isaku,

Hi.


> I am wondering about the impact of the failures on these functions. Is
> there any other function which recovers any failures here?
> When I look at the tdx_flush_vp function, it seems like it can fail
> due to task migration so tdx_flush_vp_on_cpu might also fail and if it
> fails, tdh_mng_vpflushdone returns err. Since tdx_vm_teardown does not
> return any error , how the VMM can free the keyid used in this TD.
> Will they be forever in "used state"?
> Also if tdx_vm_teardown fails, the kvm_tdx->hkid is never set to -1
> which will prevent tdx_vcpu_free to free and reclaim the resources
> allocated for the vcpu.

mmu_prezap() is called via release callback of mmu notifier when the last mmu
reference of this process is dropped.  It is after all kvm vcpu fd and kvm vm
fd were closed.  vcpu will never run.  But we still hold kvm_vcpu structures.
There is no race between tdh_vp_flush()/tdh_mng_vpflushdone() here and process
migration.  tdh_vp_flush()/tdh_mng_vp_flushdone() should success.

The cpuid check in tdx_flush_vp() is for vcpu_load() which may race with process
migration. 

Anyway what if one of those TDX seamcalls fails?  HKID is leaked and will be
never used because there is no good way to free and use HKID safely.  Such
failure is due to unknown issue and probably a bug.

One mitigation is to add pr_err() when HKID leak happens.  I'll add such message
on next respin.

thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 064/104] KVM: TDX: Implement TDX vcpu enter/exit path
  2022-03-23 17:55     ` Isaku Yamahata
@ 2022-03-23 20:05       ` Erdem Aktas
  2022-03-23 22:48         ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Erdem Aktas @ 2022-03-23 20:05 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Yamahata, Isaku, open list:KERNEL VIRTUAL MACHINE (KVM),
	open list, Paolo Bonzini, Jim Mattson, Connor Kuehl,
	Sean Christopherson

On Wed, Mar 23, 2022 at 10:55 AM Isaku Yamahata
<isaku.yamahata@gmail.com> wrote:
>
> On Tue, Mar 22, 2022 at 10:28:42AM -0700,
> Erdem Aktas <erdemaktas@google.com> wrote:
>
> > On Fri, Mar 4, 2022 at 11:50 AM <isaku.yamahata@intel.com> wrote:
> > > @@ -509,6 +512,37 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > >         vcpu->kvm->vm_bugged = true;
> > >  }
> > >
> > > +u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
> > > +
> > > +static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> > > +                                       struct vcpu_tdx *tdx)
> > > +{
> > > +       guest_enter_irqoff();
> > > +       tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
> > > +       guest_exit_irqoff();
> > > +}
> > > +
> > > +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > > +{
> > > +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > +
> > > +       if (unlikely(vcpu->kvm->vm_bugged)) {
> > > +               tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> > > +               return EXIT_FASTPATH_NONE;
> > > +       }
> > > +
> > > +       trace_kvm_entry(vcpu);
> > > +
> > > +       tdx_vcpu_enter_exit(vcpu, tdx);
> > > +
> > > +       vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> > > +       trace_kvm_exit(vcpu, KVM_ISA_VMX);
> > > +
> > > +       if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
> > > +               return EXIT_FASTPATH_NONE;
> >
> > Looks like the above if statement has no effect. Just checking if this
> > is intentional.
>
> I'm not sure if I get your point.  tdx->exit_reason is updated by the above
> tdx_cpu_enter_exit().  So it makes sense to check .error or .non_recoverable.
> --
> Isaku Yamahata <isaku.yamahata@gmail.com>

What I mean is, if there is an error, it returns EXIT_FASTPATH_NONE
but if there is no error, it still returns EXIT_FASTPATH_NONE.

The code is like below, the if-statement might be there as a
placeholder to check errors but it has no impact on what is returned
from this function.

       if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
               return EXIT_FASTPATH_NONE;
       return EXIT_FASTPATH_NONE;

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 073/104] KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the guest TD
  2022-03-23 19:08     ` Isaku Yamahata
@ 2022-03-23 20:17       ` Erdem Aktas
  0 siblings, 0 replies; 310+ messages in thread
From: Erdem Aktas @ 2022-03-23 20:17 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Yamahata, Isaku, open list:KERNEL VIRTUAL MACHINE (KVM),
	open list, Paolo Bonzini, Jim Mattson, Connor Kuehl,
	Sean Christopherson

So the tdh_vp_flush should always succeed while vm is being torn down.
Thanks Isaku for the explanation, and I think it would be great to add
the error message.

-Erdem

On Wed, Mar 23, 2022 at 12:08 PM Isaku Yamahata
<isaku.yamahata@gmail.com> wrote:
>
> On Tue, Mar 22, 2022 at 05:54:45PM -0700,
> Erdem Aktas <erdemaktas@google.com> wrote:
>
> > On Fri, Mar 4, 2022 at 11:50 AM <isaku.yamahata@intel.com> wrote:
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index a6b1a8ce888d..690298fb99c7 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -48,6 +48,14 @@ struct tdx_capabilities tdx_caps;
> > >  static DEFINE_MUTEX(tdx_lock);
> > >  static struct mutex *tdx_mng_key_config_lock;
> > >
> > > +/*
> > > + * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
> > > + * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
> > > + * Protected by interrupt mask.  This list is manipulated in process context
> > > + * of vcpu and IPI callback.  See tdx_flush_vp_on_cpu().
> > > + */
> > > +static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
> > > +
> > >  static u64 hkid_mask __ro_after_init;
> > >  static u8 hkid_start_pos __ro_after_init;
> > >
> > > @@ -87,6 +95,8 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> > >
> > >  static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> > >  {
> > > +       list_del(&to_tdx(vcpu)->cpu_list);
> > > +
> > >         /*
> > >          * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
> > >          * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
> > > @@ -97,6 +107,22 @@ static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> > >         vcpu->cpu = -1;
> > >  }
> > >
> > > +void tdx_hardware_enable(void)
> > > +{
> > > +       INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, raw_smp_processor_id()));
> > > +}
> > > +
> > > +void tdx_hardware_disable(void)
> > > +{
> > > +       int cpu = raw_smp_processor_id();
> > > +       struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
> > > +       struct vcpu_tdx *tdx, *tmp;
> > > +
> > > +       /* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
> > > +       list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
> > > +               tdx_disassociate_vp(&tdx->vcpu);
> > > +}
> > > +
> > >  static void tdx_clear_page(unsigned long page)
> > >  {
> > >         const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > > @@ -230,9 +256,11 @@ void tdx_mmu_prezap(struct kvm *kvm)
> > >         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > >         cpumask_var_t packages;
> > >         bool cpumask_allocated;
> > > +       struct kvm_vcpu *vcpu;
> > >         u64 err;
> > >         int ret;
> > >         int i;
> > > +       unsigned long j;
> > >
> > >         if (!is_hkid_assigned(kvm_tdx))
> > >                 return;
> > > @@ -248,6 +276,17 @@ void tdx_mmu_prezap(struct kvm *kvm)
> > >                 return;
> > >         }
> > >
> > > +       kvm_for_each_vcpu(j, vcpu, kvm)
> > > +               tdx_flush_vp_on_cpu(vcpu);
> > > +
> > > +       mutex_lock(&tdx_lock);
> > > +       err = tdh_mng_vpflushdone(kvm_tdx->tdr.pa);
> >
> > Hi Isaku,
>
> Hi.
>
>
> > I am wondering about the impact of the failures on these functions. Is
> > there any other function which recovers any failures here?
> > When I look at the tdx_flush_vp function, it seems like it can fail
> > due to task migration so tdx_flush_vp_on_cpu might also fail and if it
> > fails, tdh_mng_vpflushdone returns err. Since tdx_vm_teardown does not
> > return any error , how the VMM can free the keyid used in this TD.
> > Will they be forever in "used state"?
> > Also if tdx_vm_teardown fails, the kvm_tdx->hkid is never set to -1
> > which will prevent tdx_vcpu_free to free and reclaim the resources
> > allocated for the vcpu.
>
> mmu_prezap() is called via release callback of mmu notifier when the last mmu
> reference of this process is dropped.  It is after all kvm vcpu fd and kvm vm
> fd were closed.  vcpu will never run.  But we still hold kvm_vcpu structures.
> There is no race between tdh_vp_flush()/tdh_mng_vpflushdone() here and process
> migration.  tdh_vp_flush()/tdh_mng_vp_flushdone() should success.
>
> The cpuid check in tdx_flush_vp() is for vcpu_load() which may race with process
> migration.
>
> Anyway what if one of those TDX seamcalls fails?  HKID is leaked and will be
> never used because there is no good way to free and use HKID safely.  Such
> failure is due to unknown issue and probably a bug.
>
> One mitigation is to add pr_err() when HKID leak happens.  I'll add such message
> on next respin.
>
> thanks,
> --
> Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 064/104] KVM: TDX: Implement TDX vcpu enter/exit path
  2022-03-23 20:05       ` Erdem Aktas
@ 2022-03-23 22:48         ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-23 22:48 UTC (permalink / raw)
  To: Erdem Aktas
  Cc: Isaku Yamahata, Yamahata, Isaku,
	open list:KERNEL VIRTUAL MACHINE (KVM),
	open list, Paolo Bonzini, Jim Mattson, Connor Kuehl,
	Sean Christopherson

On Wed, Mar 23, 2022 at 01:05:27PM -0700,
Erdem Aktas <erdemaktas@google.com> wrote:

> On Wed, Mar 23, 2022 at 10:55 AM Isaku Yamahata
> <isaku.yamahata@gmail.com> wrote:
> >
> > On Tue, Mar 22, 2022 at 10:28:42AM -0700,
> > Erdem Aktas <erdemaktas@google.com> wrote:
> >
> > > On Fri, Mar 4, 2022 at 11:50 AM <isaku.yamahata@intel.com> wrote:
> > > > @@ -509,6 +512,37 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > >         vcpu->kvm->vm_bugged = true;
> > > >  }
> > > >
> > > > +u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
> > > > +
> > > > +static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
> > > > +                                       struct vcpu_tdx *tdx)
> > > > +{
> > > > +       guest_enter_irqoff();
> > > > +       tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
> > > > +       guest_exit_irqoff();
> > > > +}
> > > > +
> > > > +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > > > +{
> > > > +       struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > > +
> > > > +       if (unlikely(vcpu->kvm->vm_bugged)) {
> > > > +               tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> > > > +               return EXIT_FASTPATH_NONE;
> > > > +       }
> > > > +
> > > > +       trace_kvm_entry(vcpu);
> > > > +
> > > > +       tdx_vcpu_enter_exit(vcpu, tdx);
> > > > +
> > > > +       vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
> > > > +       trace_kvm_exit(vcpu, KVM_ISA_VMX);
> > > > +
> > > > +       if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
> > > > +               return EXIT_FASTPATH_NONE;
> > >
> > > Looks like the above if statement has no effect. Just checking if this
> > > is intentional.
> >
> > I'm not sure if I get your point.  tdx->exit_reason is updated by the above
> > tdx_cpu_enter_exit().  So it makes sense to check .error or .non_recoverable.
> > --
> > Isaku Yamahata <isaku.yamahata@gmail.com>
> 
> What I mean is, if there is an error, it returns EXIT_FASTPATH_NONE
> but if there is no error, it still returns EXIT_FASTPATH_NONE.
> 
> The code is like below, the if-statement might be there as a
> placeholder to check errors but it has no impact on what is returned
> from this function.
> 
>        if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
>                return EXIT_FASTPATH_NONE;
>        return EXIT_FASTPATH_NONE;

Got it. It doesn't make sense. I'll fix it with the next respin.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-03-14 19:45     ` Isaku Yamahata
@ 2022-03-31  0:03       ` Sean Christopherson
  2022-03-31  1:02         ` Kai Huang
  2022-03-31 17:03         ` Isaku Yamahata
  0 siblings, 2 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-03-31  0:03 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel, Jim Mattson,
	erdemaktas, Connor Kuehl

On Mon, Mar 14, 2022, Isaku Yamahata wrote:
> On Sun, Mar 13, 2022 at 03:03:40PM +0100,
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > > +
> > > +	if (!tdx_module_initialized) {
> > > +		if (enable_tdx) {
> > > +			ret = __tdx_module_setup();
> > > +			if (ret)
> > > +				enable_tdx = false;
> > 
> > "enable_tdx = false" isn't great to do only when a VM is created.  Does it
> > make sense to anticipate this to the point when the kvm_intel.ko module is
> > loaded?
> 
> It's possible.  I have the following two reasons to chose to defer TDX module
> initialization until creating first TD.  Given those reasons, do you still want
> the initialization at loading kvm_intel.ko module?  If yes, I'll change it.

Yes, TDX module setup needs to be done at load time.  The loss of memory is
unfortunate, e.g. if the host is part of a pool that _might_ run TDX guests, but
the alternatives are worse.  If TDX fails to initialize, e.g. due to low mem,
then the host will be unable to run TDX guests despite saying "I support TDX".
Or this gem :-)

	/*
	 * TDH.SYS.KEY.CONFIG may fail with entropy error (which is
	 * a recoverable error).  Assume this is exceedingly rare and
	 * just return error if encountered instead of retrying.
	 */

The CPU overhead of initializing the TDX module is also non-trivial, and it
doesn't affect just this CPU, e.g. all CPUs need to do certain SEAMCALLs and at
least one WBINVD.  The can cause noisy neighbor problems.

> - memory over head: The initialization of TDX module requires to allocate
> physically contiguous memory whose size is about 0.43% of the system memory.
> If user don't use TD, it will be wasted.
> 
> - VMXON on all pCPUs: The TDX module initialization requires to enable VMX
> (VMXON) on all present pCPUs.  vmx_hardware_enable() which is called on creating
> guest does it.  It naturally fits with the TDX module initialization at creating
> first TD.  I wanted to avoid code to enable VMXON on loading the kvm_intel.ko.

That's a solvable problem, though making it work without exporting hardware_enable_all()
could get messy.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-03-31  0:03       ` Sean Christopherson
@ 2022-03-31  1:02         ` Kai Huang
  2022-03-31 17:03         ` Isaku Yamahata
  1 sibling, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-03-31  1:02 UTC (permalink / raw)
  To: Sean Christopherson, Isaku Yamahata
  Cc: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel, Jim Mattson,
	erdemaktas, Connor Kuehl


> > 
> > - VMXON on all pCPUs: The TDX module initialization requires to enable VMX
> > (VMXON) on all present pCPUs.  vmx_hardware_enable() which is called on creating
> > guest does it.  It naturally fits with the TDX module initialization at creating
> > first TD.  I wanted to avoid code to enable VMXON on loading the kvm_intel.ko.
> 
> That's a solvable problem, though making it work without exporting hardware_enable_all()
> could get messy.

Could you elaborate a little bit on how to resolve?

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid
  2022-03-04 19:48 ` [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid isaku.yamahata
@ 2022-03-31  1:21   ` Kai Huang
  2022-03-31 20:15     ` Isaku Yamahata
  2022-04-05 13:08   ` Paolo Bonzini
  1 sibling, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-03-31  1:21 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> MKTME keyid is assigned to guest TD.  The memory controller encrypts guest
> TD memory with key id.  Add helper functions to allocate/free MKTME keyid
> so that TDX KVM assign keyid.

Using MKTME keyid is wrong, at least not accurate I think.  We should use
explicitly use "TDX private KeyID", which is clearly documented in the spec:
  
https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf

Also, description of IA32_MKTME_KEYID_PARTITIONING MSR clearly says TDX private
KeyIDs span the range (NUM_MKTME_KIDS+1) through
(NUM_MKTME_KIDS+NUM_TDX_PRIV_KIDS).  So please just use TDX private KeyID here.


> 
> Also export MKTME global keyid that is used to encrypt TDX module and its
> memory.

This needs explanation why the global keyID needs to be exported.

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/tdx.h |  6 ++++++
>  arch/x86/virt/vmx/tdx.c    | 33 ++++++++++++++++++++++++++++++++-
>  2 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 9a8dc6afcb63..73bb472bd515 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -139,6 +139,9 @@ int tdx_detect(void);
>  int tdx_init(void);
>  bool platform_has_tdx(void);
>  const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> +u32 tdx_get_global_keyid(void);
> +int tdx_keyid_alloc(void);
> +void tdx_keyid_free(int keyid);
>  #else
>  static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
>  static inline int tdx_detect(void) { return -ENODEV; }
> @@ -146,6 +149,9 @@ static inline int tdx_init(void) { return -ENODEV; }
>  static inline bool platform_has_tdx(void) { return false; }
>  struct tdsysinfo_struct;
>  static inline const struct tdsysinfo_struct *tdx_get_sysinfo(void) { return NULL; }
> +static inline u32 tdx_get_global_keyid(void) { return 0; };
> +static inline int tdx_keyid_alloc(void) { return -EOPNOTSUPP; }
> +static inline void tdx_keyid_free(int keyid) { }
>  #endif /* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx.c b/arch/x86/virt/vmx/tdx.c
> index e45f188479cb..d714106321d4 100644
> --- a/arch/x86/virt/vmx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx.c
> @@ -113,7 +113,13 @@ static int tdx_cmr_num;
>  static struct tdsysinfo_struct tdx_sysinfo;
>  
>  /* TDX global KeyID to protect TDX metadata */
> -static u32 tdx_global_keyid;
> +static u32 __read_mostly tdx_global_keyid;
> +
> +u32 tdx_get_global_keyid(void)
> +{
> +	return tdx_global_keyid;
> +}
> +EXPORT_SYMBOL_GPL(tdx_get_global_keyid);
>  
>  static bool enable_tdx_host;
>  
> @@ -189,6 +195,31 @@ static void detect_seam(struct cpuinfo_x86 *c)
>  		detect_seam_ap(c);
>  }
>  
> +/* TDX KeyID pool */
> +static DEFINE_IDA(tdx_keyid_pool);
> +
> +int tdx_keyid_alloc(void)
> +{
> +	if (WARN_ON_ONCE(!tdx_keyid_start || !tdx_keyid_num))
> +		return -EINVAL;
> +
> +	/* The first keyID is reserved for the global key. */
> +	return ida_alloc_range(&tdx_keyid_pool, tdx_keyid_start + 1,
> +			       tdx_keyid_start + tdx_keyid_num - 1,
> +			       GFP_KERNEL);
> +}
> +EXPORT_SYMBOL_GPL(tdx_keyid_alloc);
> +
> +void tdx_keyid_free(int keyid)
> +{
> +	/* keyid = 0 is reserved. */
> +	if (!keyid || keyid <= 0)
> +		return;
> +
> +	ida_free(&tdx_keyid_pool, keyid);
> +}
> +EXPORT_SYMBOL_GPL(tdx_keyid_free);
> +
>  static void detect_tdx_keyids_bsp(struct cpuinfo_x86 *c)
>  {
>  	u64 keyid_part;


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 021/104] KVM: x86: Introduce hooks to free VM callback prezap and vm_free
  2022-03-04 19:48 ` [RFC PATCH v5 021/104] KVM: x86: Introduce hooks to free VM callback prezap and vm_free isaku.yamahata
@ 2022-03-31  3:02   ` Kai Huang
  2022-03-31 19:54     ` Isaku Yamahata
  2022-04-05 12:40   ` Paolo Bonzini
  1 sibling, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-03-31  3:02 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Kai Huang <kai.huang@intel.com>
> 
> Before tearing down private page tables, TDX requires some resources of the
> guest TD to be destroyed (i.e. keyID must have been reclaimed, etc).  Add
> prezap callback before tearing down private page tables for it.
> 
> TDX needs to free some resources after other resources (i.e. vcpu related
> resources).  Add vm_free callback at the end of kvm_arch_destroy_vm().
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h | 2 ++
>  arch/x86/include/asm/kvm_host.h    | 2 ++
>  arch/x86/kvm/x86.c                 | 8 ++++++++
>  3 files changed, 12 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 8125d43d3566..ef48dcc98cfc 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -20,7 +20,9 @@ KVM_X86_OP(has_emulated_msr)
>  KVM_X86_OP(vcpu_after_set_cpuid)
>  KVM_X86_OP(is_vm_type_supported)
>  KVM_X86_OP(vm_init)
> +KVM_X86_OP_NULL(mmu_prezap)
>  KVM_X86_OP_NULL(vm_destroy)
> +KVM_X86_OP_NULL(vm_free)
>  KVM_X86_OP(vcpu_create)
>  KVM_X86_OP(vcpu_free)
>  KVM_X86_OP(vcpu_reset)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8de357a9ad30..5ff7a0fba311 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1326,7 +1326,9 @@ struct kvm_x86_ops {
>  	bool (*is_vm_type_supported)(unsigned long vm_type);
>  	unsigned int vm_size;
>  	int (*vm_init)(struct kvm *kvm);
> +	void (*mmu_prezap)(struct kvm *kvm);
>  	void (*vm_destroy)(struct kvm *kvm);
> +	void (*vm_free)(struct kvm *kvm);
>  
>  	/* Create, but do not attach this VCPU */
>  	int (*vcpu_create)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f6438750d190..a48f5c69fadb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11779,6 +11779,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  	kvm_page_track_cleanup(kvm);
>  	kvm_xen_destroy_vm(kvm);
>  	kvm_hv_destroy_vm(kvm);
> +	static_call_cond(kvm_x86_vm_free)(kvm);
>  }
>  
>  static void memslot_rmap_free(struct kvm_memory_slot *slot)
> @@ -12036,6 +12037,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>  
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  {
> +	/*
> +	 * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
> +	 * tearing down private page tables, TDX requires some TD resources to
> +	 * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
> +	 * kvm_x86_mmu_prezap() for this.
> +	 */
> +	static_call_cond(kvm_x86_mmu_prezap)(kvm);
>  	kvm_mmu_zap_all(kvm);
>  }
>  

The two callbacks are introduced here but they are actually implemented in 2
patches later (patch 24 KVM: TDX: create/destroy VM structure).  Why not just
squash this patch to patch 24?  Or at least you can put this patch right before
patch 24.

Please feel free to remove my SoB and From if this bothers you to squash.



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-03-04 19:48 ` [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize " isaku.yamahata
  2022-03-13 14:03   ` Paolo Bonzini
@ 2022-03-31  3:31   ` Kai Huang
  2022-03-31 19:41     ` Isaku Yamahata
  1 sibling, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-03-31  3:31 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Memory used for TDX is encrypted with an encryption key.  An encryption key
> is assigned to guest TD, and TDX memory is encrypted.  VMM calculates Trust
> Domain Memory Range (TDMR), a range of memory pages that can hold TDX
> memory encrypted with an encryption key.  VMM allocates memory regions for
> Physical Address Metadata Table (PAMT) which the TDX module uses to track
> page states. Used for TDX memory, assigned to which guest TD, etc.  VMM
> gives PAMT regions to the TDX module and initializes it which is also
> encrypted.

Not sure why above are related to this patch.  Perhaps you can just say TDX
module is detected and initialized on demand via tdx_detect() and tdx_init().

> 
> TDX requires more initialization steps in addition to VMX.  As a
> preparation step, check if the CPU feature is available and enable VMX
> because the TDX module API requires VMX to be enabled to be functional.

Those are not reflected in this patch either.

> The next step is basic platform initialization.  Check if TDX module API is
> available, call system-wide initialization API (TDH.SYS.INIT), and call LP
> initialization API (TDH.SYS.LP.INIT).  Lastly, get system-wide
> parameters (TDH.SYS.INFO), allocate PAMT for TDX module to track page
> states (TDH.SYS.CONFIG), configure encryption key (TDH.SYS.KEY.CONFIG), and
> initialize PAMT (TDH.SYS.TDMR.INIT).

Again, not sure why those are related.

> 
> A TDX host patch series implements those details and it provides APIs,
> seamrr_enabled() to check if CPU feature is available, init_tdx() to
> initialize the TDX module, tdx_get_tdsysinfo() to get TDX system
> parameters.

init_tdx() -> tdx_init().

"A TDX host patch series" should not be in the formal commit message, I suppose.

> 
> Add a wrapper function to initialize the TDX module and get system-wide
> parameters via those APIs.  Because TDX requires VMX enabled, It will be
> called on-demand when the first guest TD is created via x86 KVM init_vm
> callback.

Why not just merge this patch with the change where you implement the init_vm
callback?  Then you can just declare this patch as "detect and initialize TDX
module when first VM is created", or something like that..

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 89 ++++++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/tdx.h |  4 ++
>  2 files changed, 93 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 8ed3ec342e28..8adc87ad1807 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -13,9 +13,98 @@
>  static bool __read_mostly enable_tdx = true;
>  module_param_named(tdx, enable_tdx, bool, 0644);
>  
> +#define TDX_MAX_NR_CPUID_CONFIGS					\
> +	((sizeof(struct tdsysinfo_struct) -				\
> +		offsetof(struct tdsysinfo_struct, cpuid_configs))	\
> +		/ sizeof(struct tdx_cpuid_config))
> +
> +struct tdx_capabilities {
> +	u8 tdcs_nr_pages;
> +	u8 tdvpx_nr_pages;
> +
> +	u64 attrs_fixed0;
> +	u64 attrs_fixed1;
> +	u64 xfam_fixed0;
> +	u64 xfam_fixed1;
> +
> +	u32 nr_cpuid_configs;
> +	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
> +};
> +
> +/* Capabilities of KVM + the TDX module. */
> +struct tdx_capabilities tdx_caps;
> +
>  static u64 hkid_mask __ro_after_init;
>  static u8 hkid_start_pos __ro_after_init;

The two seems are not used in this patch.

Please make sure each patch can compile w/o warning.

>  
> +static int __tdx_module_setup(void)
> +{
> +	const struct tdsysinfo_struct *tdsysinfo;
> +	int ret = 0;
> +
> +	BUILD_BUG_ON(sizeof(*tdsysinfo) != 1024);
> +	BUILD_BUG_ON(TDX_MAX_NR_CPUID_CONFIGS != 37);
> +
> +	ret = tdx_detect();
> +	if (ret) {
> +		pr_info("Failed to detect TDX module.\n");
> +		return ret;
> +	}
> +
> +	ret = tdx_init();
> +	if (ret) {
> +		pr_info("Failed to initialize TDX module.\n");
> +		return ret;
> +	}
> +
> +	tdsysinfo = tdx_get_sysinfo();
> +	if (tdx_caps.nr_cpuid_configs > TDX_MAX_NR_CPUID_CONFIGS)
> +		return -EIO;
> +
> +	tdx_caps = (struct tdx_capabilities) {
> +		.tdcs_nr_pages = tdsysinfo->tdcs_base_size / PAGE_SIZE,
> +		/*
> +		 * TDVPS = TDVPR(4K page) + TDVPX(multiple 4K pages).
> +		 * -1 for TDVPR.
> +		 */
> +		.tdvpx_nr_pages = tdsysinfo->tdvps_base_size / PAGE_SIZE - 1,
> +		.attrs_fixed0 = tdsysinfo->attributes_fixed0,
> +		.attrs_fixed1 = tdsysinfo->attributes_fixed1,
> +		.xfam_fixed0 =	tdsysinfo->xfam_fixed0,
> +		.xfam_fixed1 = tdsysinfo->xfam_fixed1,
> +		.nr_cpuid_configs = tdsysinfo->num_cpuid_config,
> +	};
> +	if (!memcpy(tdx_caps.cpuid_configs, tdsysinfo->cpuid_configs,
> +			tdsysinfo->num_cpuid_config *
> +			sizeof(struct tdx_cpuid_config)))
> +		return -EIO;
> +
> +	return 0;
> +}
> +
> +int tdx_module_setup(void)
> +{
> +	static DEFINE_MUTEX(tdx_init_lock);
> +	static bool __read_mostly tdx_module_initialized;
> +	int ret = 0;
> +
> +	mutex_lock(&tdx_init_lock);

It took me a while to figure out why this mutex is needed.  Please see my above
suggestion to merge this patch to the change that implements init_vm() callback.

> +
> +	if (!tdx_module_initialized) {
> +		if (enable_tdx) {
> +			ret = __tdx_module_setup();

I think you can move tdx_detect() and tdx_init() out of your mutex.  They are
internally protected by mutex.

> +			if (ret)
> +				enable_tdx = false;
> +			else
> +				tdx_module_initialized = true;
> +		} else
> +			ret = -EOPNOTSUPP;
> +	}
> +
> +	mutex_unlock(&tdx_init_lock);
> +	return ret;
> +}
> +
>  static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
>  {
>  	u32 max_pa;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index daf6bfc6502a..d448e019602c 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -3,6 +3,8 @@
>  #define __KVM_X86_TDX_H
>  
>  #ifdef CONFIG_INTEL_TDX_HOST
> +int tdx_module_setup(void);
> +
>  struct kvm_tdx {
>  	struct kvm kvm;
>  };
> @@ -35,6 +37,8 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
>  	return container_of(vcpu, struct vcpu_tdx, vcpu);
>  }
>  #else
> +static inline int tdx_module_setup(void) { return -ENODEV; };
> +
>  struct kvm_tdx;
>  struct vcpu_tdx;
>  


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure
  2022-03-04 19:48 ` [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure isaku.yamahata
@ 2022-03-31  4:17   ` Kai Huang
  2022-03-31 22:12     ` Isaku Yamahata
  2022-04-05 12:44   ` Paolo Bonzini
  1 sibling, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-03-31  4:17 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> As the first step to create TDX guest, create/destroy VM struct.  Assign
> Host Key ID (HKID) to the TDX guest for memory encryption and allocate
> extra pages for the TDX guest. On destruction, free allocated pages, and
> HKID.
> 
> Add a second kvm_x86_ops hook in kvm_arch_vm_destroy() to support TDX's
> destruction path, which needs to first put the VM into a teardown state,
> then free per-vCPU resources, and finally free per-VM resources.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/main.c      |  16 +-
>  arch/x86/kvm/vmx/tdx.c       | 312 +++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/tdx.h       |   2 +
>  arch/x86/kvm/vmx/tdx_errno.h |   2 +-
>  arch/x86/kvm/vmx/tdx_ops.h   |   8 +
>  arch/x86/kvm/vmx/x86_ops.h   |   8 +
>  6 files changed, 346 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 6111c6485d8e..5c3a904a30e8 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -39,12 +39,24 @@ static int vt_vm_init(struct kvm *kvm)
>  		ret = tdx_module_setup();
>  		if (ret)
>  			return ret;
> -		return -EOPNOTSUPP;	/* Not ready to create guest TD yet. */
> +		return tdx_vm_init(kvm);
>  	}
>  
>  	return vmx_vm_init(kvm);
>  }
>  
> +static void vt_mmu_prezap(struct kvm *kvm)
> +{
> +	if (is_td(kvm))
> +		return tdx_mmu_prezap(kvm);
> +}
> +
> +static void vt_vm_free(struct kvm *kvm)
> +{
> +	if (is_td(kvm))
> +		return tdx_vm_free(kvm);
> +}
> +
>  struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.name = "kvm_intel",
>  
> @@ -58,6 +70,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.is_vm_type_supported = vt_is_vm_type_supported,
>  	.vm_size = sizeof(struct kvm_vmx),
>  	.vm_init = vt_vm_init,
> +	.mmu_prezap = vt_mmu_prezap,
> +	.vm_free = vt_vm_free,
>  
>  	.vcpu_create = vmx_vcpu_create,
>  	.vcpu_free = vmx_vcpu_free,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 1c8222f54764..702953fd365f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -31,14 +31,324 @@ struct tdx_capabilities {
>  	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
>  };
>  
> +/* KeyID used by TDX module */
> +static u32 tdx_global_keyid __read_mostly;
> +

It's really not clear why you need to know tdx_global_keyid in the context of
creating/destroying a TD.

>  /* Capabilities of KVM + the TDX module. */
>  struct tdx_capabilities tdx_caps;
>  
> +static DEFINE_MUTEX(tdx_lock);
>  static struct mutex *tdx_mng_key_config_lock;
>  
>  static u64 hkid_mask __ro_after_init;
>  static u8 hkid_start_pos __ro_after_init;
>  
> +static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> +{
> +	pa &= ~hkid_mask;
> +	pa |= (u64)hkid << hkid_start_pos;
> +
> +	return pa;
> +}
> +
> +static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> +{
> +	return kvm_tdx->tdr.added;
> +}
> +
> +static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> +{
> +	tdx_keyid_free(kvm_tdx->hkid);
> +	kvm_tdx->hkid = -1;
> +}
> +
> +static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> +{
> +	return kvm_tdx->hkid > 0;
> +}
> +
> +static void tdx_clear_page(unsigned long page)
> +{
> +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> +	unsigned long i;
> +
> +	/* Zeroing the page is only necessary for systems with MKTME-i. */

"only necessary for systems  with MKTME-i" because of what?

Please be more clear that on MKTME-i system, when re-assign one page from old
keyid to a new keyid, MOVDIR64B is required to clear/write the page with new
keyid to prevent integrity error when read on the page with new keyid.

However, the new keyid is essentially 0, but in practice integrity check for
keyid 0 is disabled in current generation of MKTME-i, so I guess we are also
safe even we don't use MOVDIR64B to clear page for TDX here.  But I agree it's
better to do.

> +	if (!static_cpu_has(X86_FEATURE_MOVDIR64B))
> +		return;
> +
> +	for (i = 0; i < 4096; i += 64)
> +		/* MOVDIR64B [rdx], es:rdi */
> +		asm (".byte 0x66, 0x0f, 0x38, 0xf8, 0x3a"
> +		     : : "d" (zero_page), "D" (page + i) : "memory");
> +}
> +
> +static int __tdx_reclaim_page(unsigned long va, hpa_t pa, bool do_wb, u16 hkid)
> +{
> +	struct tdx_module_output out;
> +	u64 err;
> +
> +	err = tdh_phymem_page_reclaim(pa, &out);
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
> +		return -EIO;
> +	}
> +
> +	if (do_wb) {

In the callers, please add some comments explaining why do_wb is needed, and why
is not needed.

> +		err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
> +		if (WARN_ON_ONCE(err)) {
> +			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> +			return -EIO;
> +		}
> +	}
> +
> +	tdx_clear_page(va);
> +	return 0;
> +}
> +
> +static int tdx_reclaim_page(unsigned long va, hpa_t pa)
> +{
> +	return __tdx_reclaim_page(va, pa, false, 0);
> +}
> +
> +static int tdx_alloc_td_page(struct tdx_td_page *page)
> +{
> +	page->va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +	if (!page->va)
> +		return -ENOMEM;
> +
> +	page->pa = __pa(page->va);
> +	return 0;
> +}
> +
> +static void tdx_mark_td_page_added(struct tdx_td_page *page)
> +{
> +	WARN_ON_ONCE(page->added);
> +	page->added = true;
> +}
> +
> +static void tdx_reclaim_td_page(struct tdx_td_page *page)
> +{
> +	if (page->added) {
> +		if (tdx_reclaim_page(page->va, page->pa))
> +			return;
> +
> +		page->added = false;
> +	}
> +	free_page(page->va);
> +}
> +
> +static int tdx_do_tdh_phymem_cache_wb(void *param)
> +{
> +	u64 err = 0;
> +
> +	/*
> +	 * We can destroy multiple the guest TDs simultaneously.  Prevent
> +	 * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
> +	 */
> +	mutex_lock(&tdx_lock);
> +	do {
> +		err = tdh_phymem_cache_wb(!!err);
> +	} while (err == TDX_INTERRUPTED_RESUMABLE);
> +	mutex_unlock(&tdx_lock);
> +
> +	/* Other thread may have done for us. */
> +	if (err == TDX_NO_HKID_READY_TO_WBCACHE)
> +		err = TDX_SUCCESS;
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
> +void tdx_mmu_prezap(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	cpumask_var_t packages;
> +	bool cpumask_allocated;
> +	u64 err;
> +	int ret;
> +	int i;
> +
> +	if (!is_hkid_assigned(kvm_tdx))
> +		return;
> +
> +	if (!is_td_created(kvm_tdx))
> +		goto free_hkid;
> +
> +	mutex_lock(&tdx_lock);
> +	err = tdh_mng_key_reclaimid(kvm_tdx->tdr.pa);
> +	mutex_unlock(&tdx_lock);

Please add a comment explaining why the mutex is needed.

> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_KEY_RECLAIMID, err, NULL);
> +		return;
> +	}
> +
> +	cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> +	for_each_online_cpu(i) {
> +		if (cpumask_allocated &&
> +			cpumask_test_and_set_cpu(topology_physical_package_id(i),
> +						packages))
> +			continue;
> +
> +		ret = smp_call_on_cpu(i, tdx_do_tdh_phymem_cache_wb, NULL, 1);
> +		if (ret)
> +			break;
> +	}
> +	free_cpumask_var(packages);
> +
> +	mutex_lock(&tdx_lock);
> +	err = tdh_mng_key_freeid(kvm_tdx->tdr.pa);
> +	mutex_unlock(&tdx_lock);
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> +		return;
> +	}
> +
> +free_hkid:
> +	tdx_hkid_free(kvm_tdx);
> +}
> +
> +void tdx_vm_free(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	int i;
> +
> +	/* Can't reclaim or free TD pages if teardown failed. */
> +	if (is_hkid_assigned(kvm_tdx))
> +		return;
> +
> +	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++)
> +		tdx_reclaim_td_page(&kvm_tdx->tdcs[i]);
> +	kfree(kvm_tdx->tdcs);
> +
> +	if (kvm_tdx->tdr.added &&
> +		__tdx_reclaim_page(kvm_tdx->tdr.va, kvm_tdx->tdr.pa, true,
> +				tdx_global_keyid))
> +		return;
> +
> +	free_page(kvm_tdx->tdr.va);
> +}
> +
> +static int tdx_do_tdh_mng_key_config(void *param)
> +{
> +	hpa_t *tdr_p = param;
> +	int cpu, cur_pkg;
> +	u64 err;
> +
> +	cpu = raw_smp_processor_id();
> +	cur_pkg = topology_physical_package_id(cpu);
> +
> +	mutex_lock(&tdx_mng_key_config_lock[cur_pkg]);
> +	do {
> +		err = tdh_mng_key_config(*tdr_p);
> +	} while (err == TDX_KEY_GENERATION_FAILED);
> +	mutex_unlock(&tdx_mng_key_config_lock[cur_pkg]);

Why not squashing patch 20 ("KVM: TDX: allocate per-package mutex") into this
patch?  

> +
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
> +int tdx_vm_init(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	cpumask_var_t packages;
> +	int ret, i;
> +	u64 err;
> +
> +	/* vCPUs can't be created until after KVM_TDX_INIT_VM. */
> +	kvm->max_vcpus = 0;
> +
> +	kvm_tdx->hkid = tdx_keyid_alloc();
> +	if (kvm_tdx->hkid < 0)
> +		return -EBUSY;
> +
> +	ret = tdx_alloc_td_page(&kvm_tdx->tdr);
> +	if (ret)
> +		goto free_hkid;
> +
> +	kvm_tdx->tdcs = kcalloc(tdx_caps.tdcs_nr_pages, sizeof(*kvm_tdx->tdcs),
> +				GFP_KERNEL_ACCOUNT);
> +	if (!kvm_tdx->tdcs)
> +		goto free_tdr;
> +	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
> +		ret = tdx_alloc_td_page(&kvm_tdx->tdcs[i]);
> +		if (ret)
> +			goto free_tdcs;
> +	}
> +
> +	mutex_lock(&tdx_lock);
> +	err = tdh_mng_create(kvm_tdx->tdr.pa, kvm_tdx->hkid);
> +	mutex_unlock(&tdx_lock);

Please add comment explaining why locking is needed.

> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_CREATE, err, NULL);
> +		ret = -EIO;
> +		goto free_tdcs;
> +	}
> +	tdx_mark_td_page_added(&kvm_tdx->tdr);
> +
> +	if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
> +		ret = -ENOMEM;
> +		goto free_tdcs;
> +	}
> +	for_each_online_cpu(i) {
> +		if (cpumask_test_and_set_cpu(topology_physical_package_id(i),
> +						packages))
> +			continue;
> +
> +		ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
> +				&kvm_tdx->tdr.pa, 1);
> +		if (ret)
> +			break;
> +	}
> +	free_cpumask_var(packages);
> +	if (ret)
> +		goto teardown;
> +
> +	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
> +		err = tdh_mng_addcx(kvm_tdx->tdr.pa, kvm_tdx->tdcs[i].pa);
> +		if (WARN_ON_ONCE(err)) {
> +			pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> +			ret = -EIO;
> +			goto teardown;
> +		}
> +		tdx_mark_td_page_added(&kvm_tdx->tdcs[i]);
> +	}
> +
> +	/*
> +	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
> +	 * ioctl() to define the configure CPUID values for the TD.
> +	 */
> +	return 0;
> +
> +	/*
> +	 * The sequence for freeing resources from a partially initialized TD
> +	 * varies based on where in the initialization flow failure occurred.
> +	 * Simply use the full teardown and destroy, which naturally play nice
> +	 * with partial initialization.
> +	 */
> +teardown:
> +	tdx_mmu_prezap(kvm);
> +	tdx_vm_free(kvm);
> +	return ret;
> +
> +free_tdcs:
> +	/* @i points at the TDCS page that failed allocation. */
> +	for (--i; i >= 0; i--)
> +		free_page(kvm_tdx->tdcs[i].va);
> +	kfree(kvm_tdx->tdcs);
> +free_tdr:
> +	free_page(kvm_tdx->tdr.va);
> +free_hkid:
> +	tdx_hkid_free(kvm_tdx);
> +	return ret;
> +}
> +
>  static int __tdx_module_setup(void)
>  {
>  	const struct tdsysinfo_struct *tdsysinfo;
> @@ -59,6 +369,8 @@ static int __tdx_module_setup(void)
>  		return ret;
>  	}
>  
> +	tdx_global_keyid = tdx_get_global_keyid();
> +

Again, really confusing why this is needed.

>  	tdsysinfo = tdx_get_sysinfo();
>  	if (tdx_caps.nr_cpuid_configs > TDX_MAX_NR_CPUID_CONFIGS)
>  		return -EIO;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index e4bb8831764e..860136ed70f5 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -19,6 +19,8 @@ struct kvm_tdx {
>  
>  	struct tdx_td_page tdr;
>  	struct tdx_td_page *tdcs;
> +
> +	int hkid;
>  };
>  
>  struct vcpu_tdx {
> diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
> index 5c878488795d..590fcfdd1899 100644
> --- a/arch/x86/kvm/vmx/tdx_errno.h
> +++ b/arch/x86/kvm/vmx/tdx_errno.h
> @@ -12,11 +12,11 @@
>  #define TDX_SUCCESS				0x0000000000000000ULL
>  #define TDX_NON_RECOVERABLE_VCPU		0x4000000100000000ULL
>  #define TDX_INTERRUPTED_RESUMABLE		0x8000000300000000ULL
> -#define TDX_LIFECYCLE_STATE_INCORRECT		0xC000060700000000ULL
>  #define TDX_VCPU_NOT_ASSOCIATED			0x8000070200000000ULL
>  #define TDX_KEY_GENERATION_FAILED		0x8000080000000000ULL
>  #define TDX_KEY_STATE_INCORRECT			0xC000081100000000ULL
>  #define TDX_KEY_CONFIGURED			0x0000081500000000ULL
> +#define TDX_NO_HKID_READY_TO_WBCACHE		0x0000082100000000ULL
>  #define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
>  
>  /*
> diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
> index 0bed43879b82..3dd5b4c3f04c 100644
> --- a/arch/x86/kvm/vmx/tdx_ops.h
> +++ b/arch/x86/kvm/vmx/tdx_ops.h
> @@ -6,6 +6,7 @@
>  
>  #include <linux/compiler.h>
>  
> +#include <asm/cacheflush.h>
>  #include <asm/asm.h>
>  #include <asm/kvm_host.h>
>  
> @@ -15,8 +16,14 @@
>  
>  #ifdef CONFIG_INTEL_TDX_HOST
>  
> +static inline void tdx_clflush_page(hpa_t addr)
> +{
> +	clflush_cache_range(__va(addr), PAGE_SIZE);
> +}
> +
>  static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
>  {
> +	tdx_clflush_page(addr);

Please add comment to explain why clflush is needed.

And you don't need the tdx_clflush_page() wrapper -- it's not a TDX specific
ops.  You can just use clflush_cache_range().

>  	return kvm_seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, 0, NULL);
>  }
>  
> @@ -56,6 +63,7 @@ static inline u64 tdh_mng_key_config(hpa_t tdr)
>  
>  static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
>  {
> +	tdx_clflush_page(tdr);
>  	return kvm_seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, 0, NULL);
>  }
>  
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index da32b4b86b19..2b2738c768d6 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -132,12 +132,20 @@ void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
>  bool tdx_is_vm_type_supported(unsigned long type);
>  void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
>  void tdx_hardware_unsetup(void);
> +
> +int tdx_vm_init(struct kvm *kvm);
> +void tdx_mmu_prezap(struct kvm *kvm);
> +void tdx_vm_free(struct kvm *kvm);
>  #else
>  static inline void tdx_pre_kvm_init(
>  	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
>  static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
>  static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
>  static inline void tdx_hardware_unsetup(void) {}
> +
> +static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> +static inline void tdx_mmu_prezap(struct kvm *kvm) {}
> +static inline void tdx_vm_free(struct kvm *kvm) {}
>  #endif
>  
>  #endif /* __KVM_X86_VMX_X86_OPS_H */


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters
  2022-03-04 19:48 ` [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
@ 2022-03-31  4:55   ` Kai Huang
  2022-04-05 13:01     ` Paolo Bonzini
  2022-04-08  2:18     ` Isaku Yamahata
  2022-04-05 12:58   ` Paolo Bonzini
  1 sibling, 2 replies; 310+ messages in thread
From: Kai Huang @ 2022-03-31  4:55 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Xiaoyao Li <xiaoyao.li@intel.com>
> 
> TDX requires additional parameters for TDX VM for confidential execution to
> protect its confidentiality of its memory contents and its CPU state from
> any other software, including VMM. When creating guest TD VM before
> creating vcpu, the number of vcpu, TSC frequency (that is same among
> vcpus. and it can't be changed.)  CPUIDs which is emulated by the TDX
> module. It means guest can trust those CPUIDs. and sha384 values for
> measurement.
> 
> Add new subcommand, KVM_TDX_INIT_VM, to pass parameters for TDX guest.  It
> assigns encryption key to the TDX guest for memory encryption.  TDX
> encrypts memory per-guest bases.  It assigns Device model passes per-VM
> parameters for the TDX guest.  The maximum number of vcpus, tsc frequency
> (TDX guest has fised VM-wide TSC frequency. not per-vcpu.  The TDX guest
> can not change it.), attributes (production or debug), available extended
> features (which is reflected into guest XCR0, IA32_XSS MSR), cpuids, sha384
> measurements, and etc.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h       |   2 +
>  arch/x86/include/uapi/asm/kvm.h       |  12 ++
>  arch/x86/kvm/vmx/tdx.c                | 200 ++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/tdx.h                |  26 ++++
>  arch/x86/kvm/x86.c                    |   3 +-
>  arch/x86/kvm/x86.h                    |   2 +
>  tools/arch/x86/include/uapi/asm/kvm.h |  12 ++
>  7 files changed, 256 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 5ff7a0fba311..290e200f012c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1234,6 +1234,8 @@ struct kvm_arch {
>  	hpa_t	hv_root_tdp;
>  	spinlock_t hv_root_tdp_lock;
>  #endif
> +
> +	gfn_t gfn_shared_mask;
>  };
>  
>  struct kvm_vm_stat {
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 70f9be4ea575..6e26dde0dce6 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -531,6 +531,7 @@ struct kvm_pmu_event_filter {
>  /* Trust Domain eXtension sub-ioctl() commands. */
>  enum kvm_tdx_cmd_id {
>  	KVM_TDX_CAPABILITIES = 0,
> +	KVM_TDX_INIT_VM,
>  
>  	KVM_TDX_CMD_NR_MAX,
>  };
> @@ -561,4 +562,15 @@ struct kvm_tdx_capabilities {
>  	struct kvm_tdx_cpuid_config cpuid_configs[0];
>  };
>  
> +struct kvm_tdx_init_vm {
> +	__u32 max_vcpus;
> +	__u32 tsc_khz;
> +	__u64 attributes;
> +	__u64 cpuid;

Is it better to append all CPUIDs directly into this structure, perhaps at end
of this structure, to make it more consistent with TD_PARAMS?

Also, I think somewhere in commit message or comments we should explain why
CPUIDs are passed here (why existing KVM_SET_CUPID2 is not sufficient).

> +	__u64 mrconfigid[6];	/* sha384 digest */
> +	__u64 mrowner[6];	/* sha384 digest */
> +	__u64 mrownerconfig[6];	/* sha348 digest */
> +	__u64 reserved[43];	/* must be zero for future extensibility */
> +};
> +
>  #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 20b45bb0b032..236faaca68a0 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -387,6 +387,203 @@ static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  	return 0;
>  }
>  
> +static struct kvm_cpuid_entry2 *tdx_find_cpuid_entry(struct kvm_tdx *kvm_tdx,
> +						u32 function, u32 index)
> +{
> +	struct kvm_cpuid_entry2 *e;
> +	int i;
> +
> +	for (i = 0; i < kvm_tdx->cpuid_nent; i++) {
> +		e = &kvm_tdx->cpuid_entries[i];
> +
> +		if (e->function == function && (e->index == index ||
> +		    !(e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX)))
> +			return e;
> +	}
> +	return NULL;
> +}
> +
> +static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> +			struct kvm_tdx_init_vm *init_vm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct tdx_cpuid_config *config;
> +	struct kvm_cpuid_entry2 *entry;
> +	struct tdx_cpuid_value *value;
> +	u64 guest_supported_xcr0;
> +	u64 guest_supported_xss;
> +	u32 guest_tsc_khz;
> +	int max_pa;
> +	int i;
> +
> +	/* init_vm->reserved must be zero */
> +	if (find_first_bit((unsigned long *)init_vm->reserved,
> +			   sizeof(init_vm->reserved) * 8) !=
> +	    sizeof(init_vm->reserved) * 8)
> +		return -EINVAL;
> +
> +	td_params->max_vcpus = init_vm->max_vcpus;
> +
> +	td_params->attributes = init_vm->attributes;
> +	if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> +		pr_warn("TD doesn't support perfmon. KVM needs to save/restore "
> +			"host perf registers properly.\n");
> +		return -EOPNOTSUPP;
> +	}

PERFMON can be supported but it's not support in this series, so perhaps add a
comment to explain it's a TODO?

> +
> +	/* TODO: Enforce consistent CPUID features for all vCPUs. */

I guess you have to enforce when you do KVM_SET_CPUID2 after vcpu is created?
Then I guess this comment shouldn't be here, because the enforcement isn't
something you can do here in setup_tdparams().

> +	for (i = 0; i < tdx_caps.nr_cpuid_configs; i++) {
> +		config = &tdx_caps.cpuid_configs[i];
> +
> +		entry = tdx_find_cpuid_entry(kvm_tdx, config->leaf,
> +					     config->sub_leaf);
> +		if (!entry)
> +			continue;
> +
> +		/*
> +		 * Non-configurable bits must be '0', even if they are fixed to
> +		 * '1' by the TDX module, i.e. mask off non-configurable bits.
> +		 */
> +		value = &td_params->cpuid_values[i];
> +		value->eax = entry->eax & config->eax;
> +		value->ebx = entry->ebx & config->ebx;
> +		value->ecx = entry->ecx & config->ecx;
> +		value->edx = entry->edx & config->edx;
> +	}
> +
> +	max_pa = 36;
> +	entry = tdx_find_cpuid_entry(kvm_tdx, 0x80000008, 0);
> +	if (entry)
> +		max_pa = entry->eax & 0xff;
> +
> +	td_params->eptp_controls = VMX_EPTP_MT_WB;
> +	if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
> +		td_params->eptp_controls |= VMX_EPTP_PWL_5;
> +		td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
> +	} else {
> +		td_params->eptp_controls |= VMX_EPTP_PWL_4;
> +	}

Not quite sure, but could we support >48 GPA with 4-level EPT?




^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits
  2022-03-04 19:48 ` [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits isaku.yamahata
@ 2022-03-31 11:16   ` Kai Huang
  2022-04-01  2:10     ` Kai Huang
                       ` (2 more replies)
  0 siblings, 3 replies; 310+ messages in thread
From: Kai Huang @ 2022-03-31 11:16 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Rick Edgecombe <rick.p.edgecombe@intel.com>
> 
> Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
> perspective) to a single GPA (from a memslot perspective). GPA aliasing
> will be used to repurpose GPA bits as attribute bits, e.g. to expose an
> execute-only permission bit to the guest. To keep the implementation
> simple (relatively speaking), GPA aliasing is only supported via TDP.
> 
> Today KVM assumes two things that are broken by GPA aliasing.
>   1. GPAs coming from hardware can be simply shifted to get the GFNs.
>   2. GPA bits 51:MAXPHYADDR are reserved to zero.
> 
> With GPA aliasing, translating a GPA to GFN requires masking off the
> repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.
> 
> To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
> that is, bits stolen from the GPA to act as new virtualized attribute
> bits. A bit in the mask will cause the MMU code to create aliases of the
> GPA. It can also be used to find the GFN out of a GPA coming from a tdp
> fault.
> 
> To handle case (1) from above, retain any stolen bits when passing a GPA
> in KVM's MMU code, but strip them when converting to a GFN so that the
> GFN contains only the "real" GFN, i.e. never has repurposed bits set.
> 
> GFNs (without stolen bits) continue to be used to:
>   - Specify physical memory by userspace via memslots
>   - Map GPAs to TDP PTEs via RMAP
>   - Specify dirty tracking and write protection
>   - Look up MTRR types
>   - Inject async page faults
> 
> Since there are now multiple aliases for the same aliased GPA, when
> userspace memory backing the memslots is paged out, both aliases need to be
> modified. Fortunately, this happens automatically. Since rmap supports
> multiple mappings for the same GFN for PTE shadowing based paging, by
> adding/removing each alias PTE with its GFN, kvm_handle_hva() based
> operations will be applied to both aliases.
> 
> In the case of the rmap being removed in the future, the needed
> information could be recovered by iterating over the stolen bits and
> walking the TDP page tables.
> 
> For TLB flushes that are address based, make sure to flush both aliases
> in the case of stolen bits.
> 
> Only support stolen bits in 64 bit guest paging modes (long, PAE).
> Features that use this infrastructure should restrict the stolen bits to
> exclude the other paging modes. Don't support stolen bits for shadow EPT.
> 
> Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.h              | 51 +++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/mmu.c          | 19 ++++++++++--
>  arch/x86/kvm/mmu/paging_tmpl.h  | 25 +++++++++-------
>  4 files changed, 84 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 208b29b0e637..d8b78d6abc10 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1235,7 +1235,9 @@ struct kvm_arch {
>  	spinlock_t hv_root_tdp_lock;
>  #endif
>  
> +#ifdef CONFIG_KVM_MMU_PRIVATE
>  	gfn_t gfn_shared_mask;
> +#endif
>  };
>  
>  struct kvm_vm_stat {
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index e9fbb2c8bbe2..3fb530359f81 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -365,4 +365,55 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
>  		return gpa;
>  	return translate_nested_gpa(vcpu, gpa, access, exception);
>  }
> +
> +static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
> +{
> +#ifdef CONFIG_KVM_MMU_PRIVATE
> +	return kvm->arch.gfn_shared_mask;
> +#else
> +	return 0;
> +#endif
> +}
> +
> +static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
> +{
> +	return gfn_to_gpa(kvm_gfn_stolen_mask(kvm));
> +}
> +
> +static inline gpa_t kvm_gpa_unalias(struct kvm *kvm, gpa_t gpa)
> +{
> +	return gpa & ~kvm_gpa_stolen_mask(kvm);
> +}
> +
> +static inline gfn_t kvm_gfn_unalias(struct kvm *kvm, gfn_t gfn)
> +{
> +	return gfn & ~kvm_gfn_stolen_mask(kvm);
> +}
> +
> +static inline gfn_t kvm_gfn_shared(struct kvm *kvm, gfn_t gfn)
> +{
> +	return gfn | kvm_gfn_stolen_mask(kvm);
> +}
> +
> +static inline gfn_t kvm_gfn_private(struct kvm *kvm, gfn_t gfn)
> +{
> +	return gfn & ~kvm_gfn_stolen_mask(kvm);
> +}
> +
> +static inline gpa_t kvm_gpa_private(struct kvm *kvm, gpa_t gpa)
> +{
> +	return gpa & ~kvm_gpa_stolen_mask(kvm);
> +}
> +
> +static inline bool kvm_is_private_gfn(struct kvm *kvm, gfn_t gfn)
> +{
> +	gfn_t mask = kvm_gfn_stolen_mask(kvm);
> +
> +	return mask && !(gfn & mask);
> +}
> +
> +static inline bool kvm_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> +{
> +	return kvm_is_private_gfn(kvm, gpa_to_gfn(gpa));
> +}

The patch title and commit message say nothing about private/shared, but only
mention stolen bits in general.  It's weird to introduce those *private* related
helpers here.

I think you can just ditch the concept of stolen bit infrastructure, but just
adopt what TDX needs.


>  #endif
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8e24f73bf60b..b68191aa39bf 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -276,11 +276,24 @@ static inline bool kvm_available_flush_tlb_with_range(void)
>  static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
>  		struct kvm_tlb_range *range)
>  {
> -	int ret = -ENOTSUPP;
> +	int ret = -EOPNOTSUPP;

Change doesn't belong to this patch.

> +	u64 gfn_stolen_mask;
>  
> -	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> +	/*
> +	 * Fall back to the big hammer flush if there is more than one
> +	 * GPA alias that needs to be flushed.
> +	 */
> +	gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
> +	if (hweight64(gfn_stolen_mask) > 1)
> +		goto generic_flush;
> +
> +	if (range && kvm_available_flush_tlb_with_range()) {
> +		/* Callback should flush both private GFN and shared GFN. */
> +		range->start_gfn = kvm_gfn_unalias(kvm, range->start_gfn);

This seems wrong.  It seems the intention of this function is to flush TLB for
all aliases for a given GFN range.  Here it seems you are unconditionally change
to range to always exclude the stolen bits.

>  		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
> +	}

And you always fall through to do big hammer flush, which is obviously not
intended.

>  
> +generic_flush:
>  	if (ret)
>  		kvm_flush_remote_tlbs(kvm);
>  }
> @@ -4010,7 +4023,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>  	unsigned long mmu_seq;
>  	int r;
>  
> -	fault->gfn = fault->addr >> PAGE_SHIFT;
> +	fault->gfn = kvm_gfn_unalias(vcpu->kvm, gpa_to_gfn(fault->addr));
>  	fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
>  
>  	if (page_fault_handle_page_track(vcpu, fault))
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 5b5bdac97c7b..70aec31dee06 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -25,7 +25,8 @@
>  	#define guest_walker guest_walker64
>  	#define FNAME(name) paging##64_##name
>  	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
> -	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_ADDR_MASK(vcpu, lvl) (~kvm_gpa_stolen_mask(vcpu->kvm) & \
> +					     PT64_LVL_ADDR_MASK(lvl))
>  	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>  	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>  	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> @@ -44,7 +45,7 @@
>  	#define guest_walker guest_walker32
>  	#define FNAME(name) paging##32_##name
>  	#define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
> -	#define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
>  	#define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
>  	#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
>  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
> @@ -58,7 +59,7 @@
>  	#define guest_walker guest_walkerEPT
>  	#define FNAME(name) ept_##name
>  	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
> -	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> +	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
>  	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
>  	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
>  	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> @@ -75,7 +76,7 @@
>  #define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)
>  
>  #define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
> -#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
> +#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)
>  
>  /*
>   * The guest_walker structure emulates the behavior of the hardware page
> @@ -96,9 +97,9 @@ struct guest_walker {
>  	struct x86_exception fault;
>  };
>  
> -static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> +static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
>  {
> -	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
> +	return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
>  }
>  
>  static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
> @@ -395,7 +396,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  		--walker->level;
>  
>  		index = PT_INDEX(addr, walker->level);
> -		table_gfn = gpte_to_gfn(pte);
> +		table_gfn = gpte_to_gfn(vcpu, pte);
>  		offset    = index * sizeof(pt_element_t);
>  		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
>  
> @@ -460,7 +461,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
>  	if (unlikely(errcode))
>  		goto error;
>  
> -	gfn = gpte_to_gfn_lvl(pte, walker->level);
> +	gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
>  	gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
>  
>  	if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
> @@ -555,12 +556,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
>  	gfn_t gfn;
>  	kvm_pfn_t pfn;
>  
> +	WARN_ON(gpte & kvm_gpa_stolen_mask(vcpu->kvm));
> +
>  	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
>  		return false;
>  
>  	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
>  
> -	gfn = gpte_to_gfn(gpte);
> +	gfn = gpte_to_gfn(vcpu, gpte);
>  	pte_access = sp->role.access & FNAME(gpte_access)(gpte);
>  	FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
>  
> @@ -656,6 +659,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
>  	WARN_ON_ONCE(gw->gfn != base_gfn);
>  	direct_access = gw->pte_access;
>  
> +	WARN_ON(fault->addr & kvm_gpa_stolen_mask(vcpu->kvm));
> +
>  	top_level = vcpu->arch.mmu->root_level;
>  	if (top_level == PT32E_ROOT_LEVEL)
>  		top_level = PT32_ROOT_LEVEL;
> @@ -1080,7 +1085,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  			continue;
>  		}
>  
> -		gfn = gpte_to_gfn(gpte);
> +		gfn = gpte_to_gfn(vcpu, gpte);
>  		pte_access = sp->role.access;
>  		pte_access &= FNAME(gpte_access)(gpte);
>  		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);

In commit message you mentioned "Don't support stolen bits for shadow EPT" (you
actually mean shadow MMU I suppose), yet there's bunch of code change to shadow
MMU.


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 032/104] KVM: x86/mmu: introduce config for PRIVATE KVM MMU
  2022-03-04 19:48 ` [RFC PATCH v5 032/104] KVM: x86/mmu: introduce config for PRIVATE KVM MMU isaku.yamahata
@ 2022-03-31 11:23   ` Kai Huang
  2022-04-01  1:51     ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-03-31 11:23 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> To Keep the case of non TDX intact, introduce a new config option for
> private KVM MMU support.  At the moment, this is synonym for
> CONFIG_INTEL_TDX_HOST && CONFIG_KVM_INTEL.  The new flag make it clear
> that the config is only for x86 KVM MMU.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/Kconfig | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 2b1548da00eb..2db590845927 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -136,4 +136,8 @@ config KVM_MMU_AUDIT
>  config KVM_EXTERNAL_WRITE_TRACKING
>  	bool
>  
> +config KVM_MMU_PRIVATE
> +	def_bool y
> +	depends on INTEL_TDX_HOST && KVM_INTEL
> +
>  endif # VIRTUALIZATION

I am really not sure why need this.  Roughly looking at MMU related patches this
new config option is hardly used.  You have many code changes related to
handling private/shared but they are not under this config option.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-03-31  0:03       ` Sean Christopherson
  2022-03-31  1:02         ` Kai Huang
@ 2022-03-31 17:03         ` Isaku Yamahata
  2022-03-31 19:34           ` Sean Christopherson
  1 sibling, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-31 17:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Paolo Bonzini, isaku.yamahata, kvm, linux-kernel,
	Jim Mattson, erdemaktas, Connor Kuehl

On Thu, Mar 31, 2022 at 12:03:15AM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Mon, Mar 14, 2022, Isaku Yamahata wrote:
> > On Sun, Mar 13, 2022 at 03:03:40PM +0100,
> > Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> > > On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > > > +
> > > > +	if (!tdx_module_initialized) {
> > > > +		if (enable_tdx) {
> > > > +			ret = __tdx_module_setup();
> > > > +			if (ret)
> > > > +				enable_tdx = false;
> > > 
> > > "enable_tdx = false" isn't great to do only when a VM is created.  Does it
> > > make sense to anticipate this to the point when the kvm_intel.ko module is
> > > loaded?
> > 
> > It's possible.  I have the following two reasons to chose to defer TDX module
> > initialization until creating first TD.  Given those reasons, do you still want
> > the initialization at loading kvm_intel.ko module?  If yes, I'll change it.
> 
> Yes, TDX module setup needs to be done at load time.  The loss of memory is
> unfortunate, e.g. if the host is part of a pool that _might_ run TDX guests, but
> the alternatives are worse.  If TDX fails to initialize, e.g. due to low mem,
> then the host will be unable to run TDX guests despite saying "I support TDX".
> Or this gem :-)

Ok.


> > - memory over head: The initialization of TDX module requires to allocate
> > physically contiguous memory whose size is about 0.43% of the system memory.
> > If user don't use TD, it will be wasted.
> > 
> > - VMXON on all pCPUs: The TDX module initialization requires to enable VMX
> > (VMXON) on all present pCPUs.  vmx_hardware_enable() which is called on creating
> > guest does it.  It naturally fits with the TDX module initialization at creating
> > first TD.  I wanted to avoid code to enable VMXON on loading the kvm_intel.ko.
> 
> That's a solvable problem, though making it work without exporting hardware_enable_all()
> could get messy.

Could you please explain any reason why it's bad idea to export it?

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-03-31 17:03         ` Isaku Yamahata
@ 2022-03-31 19:34           ` Sean Christopherson
       [not found]             ` <20220401032741.GA2806@gao-cwp>
  0 siblings, 1 reply; 310+ messages in thread
From: Sean Christopherson @ 2022-03-31 19:34 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel, Jim Mattson,
	erdemaktas, Connor Kuehl, Chao Gao

+Chao Gao

On Thu, Mar 31, 2022, Isaku Yamahata wrote:
> On Thu, Mar 31, 2022 at 12:03:15AM +0000, Sean Christopherson <seanjc@google.com> wrote:
> > On Mon, Mar 14, 2022, Isaku Yamahata wrote:
> > > - VMXON on all pCPUs: The TDX module initialization requires to enable VMX
> > > (VMXON) on all present pCPUs.  vmx_hardware_enable() which is called on creating
> > > guest does it.  It naturally fits with the TDX module initialization at creating
> > > first TD.  I wanted to avoid code to enable VMXON on loading the kvm_intel.ko.
> > 
> > That's a solvable problem, though making it work without exporting hardware_enable_all()
> > could get messy.
> 
> Could you please explain any reason why it's bad idea to export it?

I'd really prefer to keep the hardware enable/disable logic internal to kvm_main.c
so that all architectures share a common flow, and so that kvm_main.c is the sole
owner.  I'm worried that exposing the helper will lead to other arch/vendor usage,
and that will end up with what is effectively duplicate flows.  Deduplicating arch
code into generic KVM is usually very difficult.

This might also be a good opportunity to make KVM slightly more robust.  Ooh, and
we can kill two birds with one stone.  There's an in-flight series to add compatibility
checks to hotplug[*].  But rather than special case hotplug, what if we instead do
hardware enable/disable during module load, and move the compatibility check into
the hardware_enable path?  That fixes the hotplug issue, gives TDX a window for running
post-VMXON code in kvm_init(), and makes the broadcast IPI less wasteful on architectures
that don't have compatiblity checks.

I'm thinking something like this, maybe as a modificatyion to patch 6 in Chao's
series, or more likely as a patch 7 so that the hotplug compat checks still get
in even if the early hardware enable doesn't work on all architectures for some
reason.

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 69c318fdff61..c6572a056072 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4838,8 +4838,13 @@ static void hardware_enable_nolock(void *junk)

        cpumask_set_cpu(cpu, cpus_hardware_enabled);

+       r = kvm_arch_check_processor_compat();
+       if (r)
+               goto out;
+
        r = kvm_arch_hardware_enable();

+out:
        if (r) {
                cpumask_clear_cpu(cpu, cpus_hardware_enabled);
                atomic_inc(&hardware_enable_failed);
@@ -5636,18 +5641,6 @@ void kvm_unregister_perf_callbacks(void)
 }
 #endif

-struct kvm_cpu_compat_check {
-       void *opaque;
-       int *ret;
-};
-
-static void check_processor_compat(void *data)
-{
-       struct kvm_cpu_compat_check *c = data;
-
-       *c->ret = kvm_arch_check_processor_compat(c->opaque);
-}
-
 int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
                  struct module *module)
 {
@@ -5679,13 +5672,13 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
        if (r < 0)
                goto out_free_1;

-       c.ret = &r;
-       c.opaque = opaque;
-       for_each_online_cpu(cpu) {
-               smp_call_function_single(cpu, check_processor_compat, &c, 1);
-               if (r < 0)
-                       goto out_free_2;
-       }
+       r = hardware_enable_all();
+       if (r)
+               goto out_free_2;
+
+       kvm_arch_post_hardware_enable_setup();
+
+       hardware_disable_all();

        r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
                                      kvm_starting_cpu, kvm_dying_cpu);

[*] https://lore.kernel.org/all/20211227081515.2088920-7-chao.gao@intel.com

^ permalink raw reply related	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-03-31  3:31   ` Kai Huang
@ 2022-03-31 19:41     ` Isaku Yamahata
  2022-04-01  6:56       ` Xiaoyao Li
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-31 19:41 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson

On Thu, Mar 31, 2022 at 04:31:10PM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>

> > Add a wrapper function to initialize the TDX module and get system-wide
> > parameters via those APIs.  Because TDX requires VMX enabled, It will be
> > called on-demand when the first guest TD is created via x86 KVM init_vm
> > callback.
> 
> Why not just merge this patch with the change where you implement the init_vm
> callback?  Then you can just declare this patch as "detect and initialize TDX
> module when first VM is created", or something like that..

Ok. Anyway in the next respoin, tdx module initialization will be done when
loading kvm_intel.ko.  So the whole part will be changed and will be a part
of module loading.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 021/104] KVM: x86: Introduce hooks to free VM callback prezap and vm_free
  2022-03-31  3:02   ` Kai Huang
@ 2022-03-31 19:54     ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-31 19:54 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson

On Thu, Mar 31, 2022 at 04:02:24PM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > From: Kai Huang <kai.huang@intel.com>
> > 
> > Before tearing down private page tables, TDX requires some resources of the
> > guest TD to be destroyed (i.e. keyID must have been reclaimed, etc).  Add
> > prezap callback before tearing down private page tables for it.
> > 
> > TDX needs to free some resources after other resources (i.e. vcpu related
> > resources).  Add vm_free callback at the end of kvm_arch_destroy_vm().
> > 
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/include/asm/kvm-x86-ops.h | 2 ++
> >  arch/x86/include/asm/kvm_host.h    | 2 ++
> >  arch/x86/kvm/x86.c                 | 8 ++++++++
> >  3 files changed, 12 insertions(+)
> > 
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index 8125d43d3566..ef48dcc98cfc 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -20,7 +20,9 @@ KVM_X86_OP(has_emulated_msr)
> >  KVM_X86_OP(vcpu_after_set_cpuid)
> >  KVM_X86_OP(is_vm_type_supported)
> >  KVM_X86_OP(vm_init)
> > +KVM_X86_OP_NULL(mmu_prezap)
> >  KVM_X86_OP_NULL(vm_destroy)
> > +KVM_X86_OP_NULL(vm_free)
> >  KVM_X86_OP(vcpu_create)
> >  KVM_X86_OP(vcpu_free)
> >  KVM_X86_OP(vcpu_reset)
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 8de357a9ad30..5ff7a0fba311 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1326,7 +1326,9 @@ struct kvm_x86_ops {
> >  	bool (*is_vm_type_supported)(unsigned long vm_type);
> >  	unsigned int vm_size;
> >  	int (*vm_init)(struct kvm *kvm);
> > +	void (*mmu_prezap)(struct kvm *kvm);
> >  	void (*vm_destroy)(struct kvm *kvm);
> > +	void (*vm_free)(struct kvm *kvm);
> >  
> >  	/* Create, but do not attach this VCPU */
> >  	int (*vcpu_create)(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index f6438750d190..a48f5c69fadb 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -11779,6 +11779,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
> >  	kvm_page_track_cleanup(kvm);
> >  	kvm_xen_destroy_vm(kvm);
> >  	kvm_hv_destroy_vm(kvm);
> > +	static_call_cond(kvm_x86_vm_free)(kvm);
> >  }
> >  
> >  static void memslot_rmap_free(struct kvm_memory_slot *slot)
> > @@ -12036,6 +12037,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
> >  
> >  void kvm_arch_flush_shadow_all(struct kvm *kvm)
> >  {
> > +	/*
> > +	 * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
> > +	 * tearing down private page tables, TDX requires some TD resources to
> > +	 * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
> > +	 * kvm_x86_mmu_prezap() for this.
> > +	 */
> > +	static_call_cond(kvm_x86_mmu_prezap)(kvm);
> >  	kvm_mmu_zap_all(kvm);
> >  }
> >  
> 
> The two callbacks are introduced here but they are actually implemented in 2
> patches later (patch 24 KVM: TDX: create/destroy VM structure).  Why not just
> squash this patch to patch 24?  Or at least you can put this patch right before
> patch 24.

Ok. I'll squash this patch into it.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid
  2022-03-31  1:21   ` Kai Huang
@ 2022-03-31 20:15     ` Isaku Yamahata
  2022-04-06  1:55       ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-31 20:15 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson

On Thu, Mar 31, 2022 at 02:21:06PM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > MKTME keyid is assigned to guest TD.  The memory controller encrypts guest
> > TD memory with key id.  Add helper functions to allocate/free MKTME keyid
> > so that TDX KVM assign keyid.
> 
> Using MKTME keyid is wrong, at least not accurate I think.  We should use
> explicitly use "TDX private KeyID", which is clearly documented in the spec:
>   
> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> 
> Also, description of IA32_MKTME_KEYID_PARTITIONING MSR clearly says TDX private
> KeyIDs span the range (NUM_MKTME_KIDS+1) through
> (NUM_MKTME_KIDS+NUM_TDX_PRIV_KIDS).  So please just use TDX private KeyID here.
> 
> 
> > 
> > Also export MKTME global keyid that is used to encrypt TDX module and its
> > memory.
> 
> This needs explanation why the global keyID needs to be exported.

How about the followings?

TDX private host key id is assigned to guest TD.  The memory controller
encrypts guest TD memory with the assigned host key id (HIKD).  Add helper
functions to allocate/free TDX private host key id so that TDX KVM manage
it.

Also export the global TDX private host key id that is used to encrypt TDX
module, its memory and some dynamic data (e.g. TDR).  When VMM releasing
encrypted page to reuse it, the page needs to be flushed with the used host
key id.  VMM needs the global TDX private host key id to flush such pages
TDX module accesses with the global TDX private host key id.

Thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure
  2022-03-31  4:17   ` Kai Huang
@ 2022-03-31 22:12     ` Isaku Yamahata
  2022-03-31 23:41       ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-03-31 22:12 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson

On Thu, Mar 31, 2022 at 05:17:37PM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 

> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 1c8222f54764..702953fd365f 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -31,14 +31,324 @@ struct tdx_capabilities {
> >  	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
> >  };
> >  
> > +/* KeyID used by TDX module */
> > +static u32 tdx_global_keyid __read_mostly;
> > +
> 
> It's really not clear why you need to know tdx_global_keyid in the context of
> creating/destroying a TD.

TDX module mpas TDR with TDX global key id. This page includes key id assigned
to this TD.  Then, TDX modules maps other TD-related pages with the HKID.
TDR requires the TDX global key id for cache flush unlike other TD-related
pages.
I'll add a comment.


> >  /* Capabilities of KVM + the TDX module. */
> >  struct tdx_capabilities tdx_caps;
> >  
> > +static DEFINE_MUTEX(tdx_lock);
> >  static struct mutex *tdx_mng_key_config_lock;
> >  
> >  static u64 hkid_mask __ro_after_init;
> >  static u8 hkid_start_pos __ro_after_init;
> >  
> > +static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> > +{
> > +	pa &= ~hkid_mask;
> > +	pa |= (u64)hkid << hkid_start_pos;
> > +
> > +	return pa;
> > +}
> > +
> > +static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> > +{
> > +	return kvm_tdx->tdr.added;
> > +}
> > +
> > +static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> > +{
> > +	tdx_keyid_free(kvm_tdx->hkid);
> > +	kvm_tdx->hkid = -1;
> > +}
> > +
> > +static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> > +{
> > +	return kvm_tdx->hkid > 0;
> > +}
> > +
> > +static void tdx_clear_page(unsigned long page)
> > +{
> > +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > +	unsigned long i;
> > +
> > +	/* Zeroing the page is only necessary for systems with MKTME-i. */
> 
> "only necessary for systems  with MKTME-i" because of what?
> 
> Please be more clear that on MKTME-i system, when re-assign one page from old
> keyid to a new keyid, MOVDIR64B is required to clear/write the page with new
> keyid to prevent integrity error when read on the page with new keyid.

Let me borrow this sentence as a comment on it.


> > +static int __tdx_reclaim_page(unsigned long va, hpa_t pa, bool do_wb, u16 hkid)
> > +{
> > +	struct tdx_module_output out;
> > +	u64 err;
> > +
> > +	err = tdh_phymem_page_reclaim(pa, &out);
> > +	if (WARN_ON_ONCE(err)) {
> > +		pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
> > +		return -EIO;
> > +	}
> > +
> > +	if (do_wb) {
> 
> In the callers, please add some comments explaining why do_wb is needed, and why
> is not needed.

Will do.

> > +static int tdx_do_tdh_mng_key_config(void *param)
> > +{
> > +	hpa_t *tdr_p = param;
> > +	int cpu, cur_pkg;
> > +	u64 err;
> > +
> > +	cpu = raw_smp_processor_id();
> > +	cur_pkg = topology_physical_package_id(cpu);
> > +
> > +	mutex_lock(&tdx_mng_key_config_lock[cur_pkg]);
> > +	do {
> > +		err = tdh_mng_key_config(*tdr_p);
> > +	} while (err == TDX_KEY_GENERATION_FAILED);
> > +	mutex_unlock(&tdx_mng_key_config_lock[cur_pkg]);
> 
> Why not squashing patch 20 ("KVM: TDX: allocate per-package mutex") into this
> patch?  

Will do.


> > +
> > +	if (WARN_ON_ONCE(err)) {
> > +		pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
> > +		return -EIO;
> > +	}
> > +
> > +	return 0;
> > +}
> > +
> > +int tdx_vm_init(struct kvm *kvm)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +	cpumask_var_t packages;
> > +	int ret, i;
> > +	u64 err;
> > +
> > +	/* vCPUs can't be created until after KVM_TDX_INIT_VM. */
> > +	kvm->max_vcpus = 0;
> > +
> > +	kvm_tdx->hkid = tdx_keyid_alloc();
> > +	if (kvm_tdx->hkid < 0)
> > +		return -EBUSY;
> > +
> > +	ret = tdx_alloc_td_page(&kvm_tdx->tdr);
> > +	if (ret)
> > +		goto free_hkid;
> > +
> > +	kvm_tdx->tdcs = kcalloc(tdx_caps.tdcs_nr_pages, sizeof(*kvm_tdx->tdcs),
> > +				GFP_KERNEL_ACCOUNT);
> > +	if (!kvm_tdx->tdcs)
> > +		goto free_tdr;
> > +	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
> > +		ret = tdx_alloc_td_page(&kvm_tdx->tdcs[i]);
> > +		if (ret)
> > +			goto free_tdcs;
> > +	}
> > +
> > +	mutex_lock(&tdx_lock);
> > +	err = tdh_mng_create(kvm_tdx->tdr.pa, kvm_tdx->hkid);
> > +	mutex_unlock(&tdx_lock);
> 
> Please add comment explaining why locking is needed.

I'll add a comment on tdx_lock. Not each TDX seamcalls.


> > diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
> > index 0bed43879b82..3dd5b4c3f04c 100644
> > --- a/arch/x86/kvm/vmx/tdx_ops.h
> > +++ b/arch/x86/kvm/vmx/tdx_ops.h
> > @@ -6,6 +6,7 @@
> >  
> >  #include <linux/compiler.h>
> >  
> > +#include <asm/cacheflush.h>
> >  #include <asm/asm.h>
> >  #include <asm/kvm_host.h>
> >  
> > @@ -15,8 +16,14 @@
> >  
> >  #ifdef CONFIG_INTEL_TDX_HOST
> >  
> > +static inline void tdx_clflush_page(hpa_t addr)
> > +{
> > +	clflush_cache_range(__va(addr), PAGE_SIZE);
> > +}
> > +
> >  static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
> >  {
> > +	tdx_clflush_page(addr);
> 
> Please add comment to explain why clflush is needed.
> 
> And you don't need the tdx_clflush_page() wrapper -- it's not a TDX specific
> ops.  You can just use clflush_cache_range().

Will remove it.

The plan was to enhance tdx_cflush_page(addr, page_level) to support large page
to avoid repeating page_level_to_size. Defer it to large page support.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure
  2022-03-31 22:12     ` Isaku Yamahata
@ 2022-03-31 23:41       ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-03-31 23:41 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: isaku.yamahata, kvm, linux-kernel, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Thu, 2022-03-31 at 15:12 -0700, Isaku Yamahata wrote:
> > It's really not clear why you need to know tdx_global_keyid in the context
> > of
> > creating/destroying a TD.
> 
> TDX module mpas TDR with TDX global key id. This page includes key id assigned
> to this TD.  Then, TDX modules maps other TD-related pages with the HKID.
> TDR requires the TDX global key id for cache flush unlike other TD-related
> pages.
> I'll add a comment.

Sorry I forgot this. Thanks.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 032/104] KVM: x86/mmu: introduce config for PRIVATE KVM MMU
  2022-03-31 11:23   ` Kai Huang
@ 2022-04-01  1:51     ` Isaku Yamahata
  2022-04-01  2:13       ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-01  1:51 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson

On Fri, Apr 01, 2022 at 12:23:28AM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > To Keep the case of non TDX intact, introduce a new config option for
> > private KVM MMU support.  At the moment, this is synonym for
> > CONFIG_INTEL_TDX_HOST && CONFIG_KVM_INTEL.  The new flag make it clear
> > that the config is only for x86 KVM MMU.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/kvm/Kconfig | 4 ++++
> >  1 file changed, 4 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > index 2b1548da00eb..2db590845927 100644
> > --- a/arch/x86/kvm/Kconfig
> > +++ b/arch/x86/kvm/Kconfig
> > @@ -136,4 +136,8 @@ config KVM_MMU_AUDIT
> >  config KVM_EXTERNAL_WRITE_TRACKING
> >  	bool
> >  
> > +config KVM_MMU_PRIVATE
> > +	def_bool y
> > +	depends on INTEL_TDX_HOST && KVM_INTEL
> > +
> >  endif # VIRTUALIZATION
> 
> I am really not sure why need this.  Roughly looking at MMU related patches this
> new config option is hardly used.  You have many code changes related to
> handling private/shared but they are not under this config option.

I don't want to use CONFIG_INTEL_TDX_HOST in KVM MMU code.  I think the change
to KVM MMU should be a sort of independent from TDX.  But it seems failed based
on your feedback.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits
  2022-03-31 11:16   ` Kai Huang
@ 2022-04-01  2:10     ` Kai Huang
  2022-04-01  2:34     ` Isaku Yamahata
  2022-04-05 13:55     ` Paolo Bonzini
  2 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-01  2:10 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-04-01 at 00:16 +1300, Kai Huang wrote:
> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > From: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
> > perspective) to a single GPA (from a memslot perspective). GPA aliasing
> > will be used to repurpose GPA bits as attribute bits, e.g. to expose an
> > execute-only permission bit to the guest. To keep the implementation
> > simple (relatively speaking), GPA aliasing is only supported via TDP.
> > 
> > Today KVM assumes two things that are broken by GPA aliasing.
> >   1. GPAs coming from hardware can be simply shifted to get the GFNs.
> >   2. GPA bits 51:MAXPHYADDR are reserved to zero.
> > 
> > With GPA aliasing, translating a GPA to GFN requires masking off the
> > repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.
> > 
> > To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
> > that is, bits stolen from the GPA to act as new virtualized attribute
> > bits. A bit in the mask will cause the MMU code to create aliases of the
> > GPA. It can also be used to find the GFN out of a GPA coming from a tdp
> > fault.
> > 
> > To handle case (1) from above, retain any stolen bits when passing a GPA
> > in KVM's MMU code, but strip them when converting to a GFN so that the
> > GFN contains only the "real" GFN, i.e. never has repurposed bits set.
> > 
> > GFNs (without stolen bits) continue to be used to:
> >   - Specify physical memory by userspace via memslots
> >   - Map GPAs to TDP PTEs via RMAP
> >   - Specify dirty tracking and write protection
> >   - Look up MTRR types
> >   - Inject async page faults
> > 
> > Since there are now multiple aliases for the same aliased GPA, when
> > userspace memory backing the memslots is paged out, both aliases need to be
> > modified. Fortunately, this happens automatically. Since rmap supports
> > multiple mappings for the same GFN for PTE shadowing based paging, by
> > adding/removing each alias PTE with its GFN, kvm_handle_hva() based
> > operations will be applied to both aliases.
> > 
> > In the case of the rmap being removed in the future, the needed
> > information could be recovered by iterating over the stolen bits and
> > walking the TDP page tables.
> > 
> > For TLB flushes that are address based, make sure to flush both aliases
> > in the case of stolen bits.
> > 
> > Only support stolen bits in 64 bit guest paging modes (long, PAE).
> > Features that use this infrastructure should restrict the stolen bits to
> > exclude the other paging modes. Don't support stolen bits for shadow EPT.
> > 
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 ++
> >  arch/x86/kvm/mmu.h              | 51 +++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/mmu/mmu.c          | 19 ++++++++++--
> >  arch/x86/kvm/mmu/paging_tmpl.h  | 25 +++++++++-------
> >  4 files changed, 84 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 208b29b0e637..d8b78d6abc10 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1235,7 +1235,9 @@ struct kvm_arch {
> >  	spinlock_t hv_root_tdp_lock;
> >  #endif
> >  
> > +#ifdef CONFIG_KVM_MMU_PRIVATE
> >  	gfn_t gfn_shared_mask;
> > +#endif
> >  };
> >  
> >  struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index e9fbb2c8bbe2..3fb530359f81 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -365,4 +365,55 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
> >  		return gpa;
> >  	return translate_nested_gpa(vcpu, gpa, access, exception);
> >  }
> > +
> > +static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
> > +{
> > +#ifdef CONFIG_KVM_MMU_PRIVATE
> > +	return kvm->arch.gfn_shared_mask;
> > +#else
> > +	return 0;
> > +#endif
> > +}
> > +
> > +static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
> > +{
> > +	return gfn_to_gpa(kvm_gfn_stolen_mask(kvm));
> > +}
> > +
> > +static inline gpa_t kvm_gpa_unalias(struct kvm *kvm, gpa_t gpa)
> > +{
> > +	return gpa & ~kvm_gpa_stolen_mask(kvm);
> > +}
> > +
> > +static inline gfn_t kvm_gfn_unalias(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return gfn & ~kvm_gfn_stolen_mask(kvm);
> > +}
> > +
> > +static inline gfn_t kvm_gfn_shared(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return gfn | kvm_gfn_stolen_mask(kvm);
> > +}
> > +
> > +static inline gfn_t kvm_gfn_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return gfn & ~kvm_gfn_stolen_mask(kvm);
> > +}
> > +
> > +static inline gpa_t kvm_gpa_private(struct kvm *kvm, gpa_t gpa)
> > +{
> > +	return gpa & ~kvm_gpa_stolen_mask(kvm);
> > +}
> > +
> > +static inline bool kvm_is_private_gfn(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	gfn_t mask = kvm_gfn_stolen_mask(kvm);
> > +
> > +	return mask && !(gfn & mask);
> > +}
> > +
> > +static inline bool kvm_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> > +{
> > +	return kvm_is_private_gfn(kvm, gpa_to_gfn(gpa));
> > +}
> 
> The patch title and commit message say nothing about private/shared, but only
> mention stolen bits in general.  It's weird to introduce those *private* related
> helpers here.
> 
> I think you can just ditch the concept of stolen bit infrastructure, but just
> adopt what TDX needs.
> 
> 
> >  #endif
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 8e24f73bf60b..b68191aa39bf 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -276,11 +276,24 @@ static inline bool kvm_available_flush_tlb_with_range(void)
> >  static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> >  		struct kvm_tlb_range *range)
> >  {
> > -	int ret = -ENOTSUPP;
> > +	int ret = -EOPNOTSUPP;
> 
> Change doesn't belong to this patch.
> 
> > +	u64 gfn_stolen_mask;
> >  
> > -	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> > +	/*
> > +	 * Fall back to the big hammer flush if there is more than one
> > +	 * GPA alias that needs to be flushed.
> > +	 */
> > +	gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
> > +	if (hweight64(gfn_stolen_mask) > 1)
> > +		goto generic_flush;
> > +
> > +	if (range && kvm_available_flush_tlb_with_range()) {
> > +		/* Callback should flush both private GFN and shared GFN. */
> > +		range->start_gfn = kvm_gfn_unalias(kvm, range->start_gfn);
> 
> This seems wrong.  It seems the intention of this function is to flush TLB for
> all aliases for a given GFN range.  Here it seems you are unconditionally change
> to range to always exclude the stolen bits.
> 
> >  		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
> > +	}
> 
> And you always fall through to do big hammer flush, which is obviously not
> intended.
> 
> >  
> > +generic_flush:
> >  	if (ret)
> >  		kvm_flush_remote_tlbs(kvm);
> >  }
> > @@ -4010,7 +4023,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  	unsigned long mmu_seq;
> >  	int r;
> >  
> > -	fault->gfn = fault->addr >> PAGE_SHIFT;
> > +	fault->gfn = kvm_gfn_unalias(vcpu->kvm, gpa_to_gfn(fault->addr));
> >  	fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
> >  
> >  	if (page_fault_handle_page_track(vcpu, fault))

Looking at code more, I think this patch is broken.  There are couple of issues
if I understand correctly:

- Rick's original patch has stolen_bits_mask encoded in 'struct kvm_mmu_page',
so basically a new page table is allocated for different aliasing GPA.  Sean
suggested to use role.private instead of stolen_bits_mask so I changed but that
was lost in this patch too.  Therefore essentially, with this patch, all
aliasing GFNs share the same page table and the same mapping.  There's slight
difference between TDP MMU and legacy MMU, that the former purely uses 'fault-
>gfn' (which doesn't have aliasing bit) to iterate page table and the latter
uses 'fault->addr' (which contains the aliasing bit), but this makes little
difference.  With this patch, all aliasing GFNs share page table and the
mapping.  This is not what we want, and this is wrong.

- The original change to get GFN w/o aliasing for MTRR check (below) is lost. 
And there are some other changes that are also lost (such as don't support
aliasing for private (user-invisible, not TDX private) memory slot), but it's
not immediately apparent to me whether this is an issue.

@@ -3833,7 +3865,7 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, gpa_t gpa,
u32 error_code,
 	     max_level > PG_LEVEL_4K;
 	     max_level--) {
 		int page_num = KVM_PAGES_PER_HPAGE(max_level);
-		gfn_t base = (gpa >> PAGE_SHIFT) & ~(page_num - 1);
+		gfn_t base = vcpu_gpa_to_gfn_unalias(vcpu, gpa) & ~(page_num -
1);

Another thing is above change to kvm_flush_remote_tlbs_with_range() to make it
flush TLBs for mappings for all aliasing for a given GFN range doesn't fit for
TDX.  TDX private mapping and shared mapping cannot co-exist therefore when a
page that has multiple aliasing mapped to it is taken out, only one mapping is
valid (not to mention private page cannot be taken out).  This is one of the
reasons that I think this GPA stolen bits infrastructure isn't that mandatory
for TDX.  I think it's OK to ditch this infrastructure and adopt what TDX needs
(the concept of private/shared mapping).


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 032/104] KVM: x86/mmu: introduce config for PRIVATE KVM MMU
  2022-04-01  1:51     ` Isaku Yamahata
@ 2022-04-01  2:13       ` Kai Huang
  2022-04-05 13:48         ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-01  2:13 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: isaku.yamahata, kvm, linux-kernel, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Thu, 2022-03-31 at 18:51 -0700, Isaku Yamahata wrote:
> On Fri, Apr 01, 2022 at 12:23:28AM +1300,
> Kai Huang <kai.huang@intel.com> wrote:
> 
> > On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > To Keep the case of non TDX intact, introduce a new config option for
> > > private KVM MMU support.  At the moment, this is synonym for
> > > CONFIG_INTEL_TDX_HOST && CONFIG_KVM_INTEL.  The new flag make it clear
> > > that the config is only for x86 KVM MMU.
> > > 
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > ---
> > >  arch/x86/kvm/Kconfig | 4 ++++
> > >  1 file changed, 4 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> > > index 2b1548da00eb..2db590845927 100644
> > > --- a/arch/x86/kvm/Kconfig
> > > +++ b/arch/x86/kvm/Kconfig
> > > @@ -136,4 +136,8 @@ config KVM_MMU_AUDIT
> > >  config KVM_EXTERNAL_WRITE_TRACKING
> > >  	bool
> > >  
> > > +config KVM_MMU_PRIVATE
> > > +	def_bool y
> > > +	depends on INTEL_TDX_HOST && KVM_INTEL
> > > +
> > >  endif # VIRTUALIZATION
> > 
> > I am really not sure why need this.  Roughly looking at MMU related patches this
> > new config option is hardly used.  You have many code changes related to
> > handling private/shared but they are not under this config option.
> 
> I don't want to use CONFIG_INTEL_TDX_HOST in KVM MMU code.  I think the change
> to KVM MMU should be a sort of independent from TDX.  But it seems failed based
> on your feedback.

Why do you need to use any config?  As I said majority of your changes to MMU
are not under any config.  But I'll leave this to maintainer/reviewers. 

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits
  2022-03-31 11:16   ` Kai Huang
  2022-04-01  2:10     ` Kai Huang
@ 2022-04-01  2:34     ` Isaku Yamahata
  2022-04-05 14:02       ` Paolo Bonzini
  2022-04-05 14:02       ` Paolo Bonzini
  2022-04-05 13:55     ` Paolo Bonzini
  2 siblings, 2 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-01  2:34 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson,
	Chao Peng

Added Peng Chao.

On Fri, Apr 01, 2022 at 12:16:41AM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > From: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > 
> > Add support in KVM's MMU for aliasing multiple GPAs (from a hardware
> > perspective) to a single GPA (from a memslot perspective). GPA aliasing
> > will be used to repurpose GPA bits as attribute bits, e.g. to expose an
> > execute-only permission bit to the guest. To keep the implementation
> > simple (relatively speaking), GPA aliasing is only supported via TDP.
> > 
> > Today KVM assumes two things that are broken by GPA aliasing.
> >   1. GPAs coming from hardware can be simply shifted to get the GFNs.
> >   2. GPA bits 51:MAXPHYADDR are reserved to zero.
> > 
> > With GPA aliasing, translating a GPA to GFN requires masking off the
> > repurposed bit, and a repurposed bit may reside in 51:MAXPHYADDR.
> > 
> > To support GPA aliasing, introduce the concept of per-VM GPA stolen bits,
> > that is, bits stolen from the GPA to act as new virtualized attribute
> > bits. A bit in the mask will cause the MMU code to create aliases of the
> > GPA. It can also be used to find the GFN out of a GPA coming from a tdp
> > fault.
> > 
> > To handle case (1) from above, retain any stolen bits when passing a GPA
> > in KVM's MMU code, but strip them when converting to a GFN so that the
> > GFN contains only the "real" GFN, i.e. never has repurposed bits set.
> > 
> > GFNs (without stolen bits) continue to be used to:
> >   - Specify physical memory by userspace via memslots
> >   - Map GPAs to TDP PTEs via RMAP
> >   - Specify dirty tracking and write protection
> >   - Look up MTRR types
> >   - Inject async page faults
> > 
> > Since there are now multiple aliases for the same aliased GPA, when
> > userspace memory backing the memslots is paged out, both aliases need to be
> > modified. Fortunately, this happens automatically. Since rmap supports
> > multiple mappings for the same GFN for PTE shadowing based paging, by
> > adding/removing each alias PTE with its GFN, kvm_handle_hva() based
> > operations will be applied to both aliases.
> > 
> > In the case of the rmap being removed in the future, the needed
> > information could be recovered by iterating over the stolen bits and
> > walking the TDP page tables.
> > 
> > For TLB flushes that are address based, make sure to flush both aliases
> > in the case of stolen bits.
> > 
> > Only support stolen bits in 64 bit guest paging modes (long, PAE).
> > Features that use this infrastructure should restrict the stolen bits to
> > exclude the other paging modes. Don't support stolen bits for shadow EPT.
> > 
> > Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/include/asm/kvm_host.h |  2 ++
> >  arch/x86/kvm/mmu.h              | 51 +++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/mmu/mmu.c          | 19 ++++++++++--
> >  arch/x86/kvm/mmu/paging_tmpl.h  | 25 +++++++++-------
> >  4 files changed, 84 insertions(+), 13 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 208b29b0e637..d8b78d6abc10 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1235,7 +1235,9 @@ struct kvm_arch {
> >  	spinlock_t hv_root_tdp_lock;
> >  #endif
> >  
> > +#ifdef CONFIG_KVM_MMU_PRIVATE
> >  	gfn_t gfn_shared_mask;
> > +#endif
> >  };
> >  
> >  struct kvm_vm_stat {
> > diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> > index e9fbb2c8bbe2..3fb530359f81 100644
> > --- a/arch/x86/kvm/mmu.h
> > +++ b/arch/x86/kvm/mmu.h
> > @@ -365,4 +365,55 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
> >  		return gpa;
> >  	return translate_nested_gpa(vcpu, gpa, access, exception);
> >  }
> > +
> > +static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
> > +{
> > +#ifdef CONFIG_KVM_MMU_PRIVATE
> > +	return kvm->arch.gfn_shared_mask;
> > +#else
> > +	return 0;
> > +#endif
> > +}
> > +
> > +static inline gpa_t kvm_gpa_stolen_mask(struct kvm *kvm)
> > +{
> > +	return gfn_to_gpa(kvm_gfn_stolen_mask(kvm));
> > +}
> > +
> > +static inline gpa_t kvm_gpa_unalias(struct kvm *kvm, gpa_t gpa)
> > +{
> > +	return gpa & ~kvm_gpa_stolen_mask(kvm);
> > +}
> > +
> > +static inline gfn_t kvm_gfn_unalias(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return gfn & ~kvm_gfn_stolen_mask(kvm);
> > +}
> > +
> > +static inline gfn_t kvm_gfn_shared(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return gfn | kvm_gfn_stolen_mask(kvm);
> > +}
> > +
> > +static inline gfn_t kvm_gfn_private(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	return gfn & ~kvm_gfn_stolen_mask(kvm);
> > +}
> > +
> > +static inline gpa_t kvm_gpa_private(struct kvm *kvm, gpa_t gpa)
> > +{
> > +	return gpa & ~kvm_gpa_stolen_mask(kvm);
> > +}
> > +
> > +static inline bool kvm_is_private_gfn(struct kvm *kvm, gfn_t gfn)
> > +{
> > +	gfn_t mask = kvm_gfn_stolen_mask(kvm);
> > +
> > +	return mask && !(gfn & mask);
> > +}
> > +
> > +static inline bool kvm_is_private_gpa(struct kvm *kvm, gpa_t gpa)
> > +{
> > +	return kvm_is_private_gfn(kvm, gpa_to_gfn(gpa));
> > +}
> 
> The patch title and commit message say nothing about private/shared, but only
> mention stolen bits in general.  It's weird to introduce those *private* related
> helpers here.
> 
> I think you can just ditch the concept of stolen bit infrastructure, but just
> adopt what TDX needs.

Sure, this patch heavily changed from the original patch Now.  One suggestion
is that private/shared is characteristic to kvm page fault, not gpa/gfn.
It's TDX specific.

- Add a helper function to check if KVM MMU is TD or VM. Right now
  kvm_gfn_stolen_mask() is used.  Probably kvm_mmu_has_private_bit().
  (any better name?)
- Let's keep address conversion functions: address => unalias/shared/private
- Add struct kvm_page_fault.is_private
  see how kvm_is_private_{gpa, gfn}() can be removed (or reduced).


> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 8e24f73bf60b..b68191aa39bf 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -276,11 +276,24 @@ static inline bool kvm_available_flush_tlb_with_range(void)
> >  static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> >  		struct kvm_tlb_range *range)
> >  {
> > -	int ret = -ENOTSUPP;
> > +	int ret = -EOPNOTSUPP;
> 
> Change doesn't belong to this patch.

Will fix it.


> > +	u64 gfn_stolen_mask;
> >  
> > -	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> > +	/*
> > +	 * Fall back to the big hammer flush if there is more than one
> > +	 * GPA alias that needs to be flushed.
> > +	 */
> > +	gfn_stolen_mask = kvm_gfn_stolen_mask(kvm);
> > +	if (hweight64(gfn_stolen_mask) > 1)
> > +		goto generic_flush;
> > +
> > +	if (range && kvm_available_flush_tlb_with_range()) {
> > +		/* Callback should flush both private GFN and shared GFN. */
> > +		range->start_gfn = kvm_gfn_unalias(kvm, range->start_gfn);
> 
> This seems wrong.  It seems the intention of this function is to flush TLB for
> all aliases for a given GFN range.  Here it seems you are unconditionally change
> to range to always exclude the stolen bits.

Ooh, right. This alias knowledge is in TDX.  This unalias should be dropped
and put it in tdx.c.  I'll fix it.


> >  		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
> > +	}
> 
> And you always fall through to do big hammer flush, which is obviously not
> intended.

Please notice "if (ret)".  If it succeeded, big hammer flush is skipped.


> > +generic_flush:
> >  	if (ret)
> >  		kvm_flush_remote_tlbs(kvm);
> >  }
> > @@ -4010,7 +4023,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
> >  	unsigned long mmu_seq;
> >  	int r;
> >  
> > -	fault->gfn = fault->addr >> PAGE_SHIFT;
> > +	fault->gfn = kvm_gfn_unalias(vcpu->kvm, gpa_to_gfn(fault->addr));
> >  	fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
> >  
> >  	if (page_fault_handle_page_track(vcpu, fault))
> > diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> > index 5b5bdac97c7b..70aec31dee06 100644
> > --- a/arch/x86/kvm/mmu/paging_tmpl.h
> > +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> > @@ -25,7 +25,8 @@
> >  	#define guest_walker guest_walker64
> >  	#define FNAME(name) paging##64_##name
> >  	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
> > -	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> > +	#define PT_LVL_ADDR_MASK(vcpu, lvl) (~kvm_gpa_stolen_mask(vcpu->kvm) & \
> > +					     PT64_LVL_ADDR_MASK(lvl))
> >  	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> >  	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >  	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> > @@ -44,7 +45,7 @@
> >  	#define guest_walker guest_walker32
> >  	#define FNAME(name) paging##32_##name
> >  	#define PT_BASE_ADDR_MASK PT32_BASE_ADDR_MASK
> > -	#define PT_LVL_ADDR_MASK(lvl) PT32_LVL_ADDR_MASK(lvl)
> > +	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT32_LVL_ADDR_MASK(lvl)
> >  	#define PT_LVL_OFFSET_MASK(lvl) PT32_LVL_OFFSET_MASK(lvl)
> >  	#define PT_INDEX(addr, level) PT32_INDEX(addr, level)
> >  	#define PT_LEVEL_BITS PT32_LEVEL_BITS
> > @@ -58,7 +59,7 @@
> >  	#define guest_walker guest_walkerEPT
> >  	#define FNAME(name) ept_##name
> >  	#define PT_BASE_ADDR_MASK GUEST_PT64_BASE_ADDR_MASK
> > -	#define PT_LVL_ADDR_MASK(lvl) PT64_LVL_ADDR_MASK(lvl)
> > +	#define PT_LVL_ADDR_MASK(vcpu, lvl) PT64_LVL_ADDR_MASK(lvl)
> >  	#define PT_LVL_OFFSET_MASK(lvl) PT64_LVL_OFFSET_MASK(lvl)
> >  	#define PT_INDEX(addr, level) PT64_INDEX(addr, level)
> >  	#define PT_LEVEL_BITS PT64_LEVEL_BITS
> > @@ -75,7 +76,7 @@
> >  #define PT_GUEST_ACCESSED_MASK (1 << PT_GUEST_ACCESSED_SHIFT)
> >  
> >  #define gpte_to_gfn_lvl FNAME(gpte_to_gfn_lvl)
> > -#define gpte_to_gfn(pte) gpte_to_gfn_lvl((pte), PG_LEVEL_4K)
> > +#define gpte_to_gfn(vcpu, pte) gpte_to_gfn_lvl(vcpu, pte, PG_LEVEL_4K)
> >  
> >  /*
> >   * The guest_walker structure emulates the behavior of the hardware page
> > @@ -96,9 +97,9 @@ struct guest_walker {
> >  	struct x86_exception fault;
> >  };
> >  
> > -static gfn_t gpte_to_gfn_lvl(pt_element_t gpte, int lvl)
> > +static gfn_t gpte_to_gfn_lvl(struct kvm_vcpu *vcpu, pt_element_t gpte, int lvl)
> >  {
> > -	return (gpte & PT_LVL_ADDR_MASK(lvl)) >> PAGE_SHIFT;
> > +	return (gpte & PT_LVL_ADDR_MASK(vcpu, lvl)) >> PAGE_SHIFT;
> >  }
> >  
> >  static inline void FNAME(protect_clean_gpte)(struct kvm_mmu *mmu, unsigned *access,
> > @@ -395,7 +396,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> >  		--walker->level;
> >  
> >  		index = PT_INDEX(addr, walker->level);
> > -		table_gfn = gpte_to_gfn(pte);
> > +		table_gfn = gpte_to_gfn(vcpu, pte);
> >  		offset    = index * sizeof(pt_element_t);
> >  		pte_gpa   = gfn_to_gpa(table_gfn) + offset;
> >  
> > @@ -460,7 +461,7 @@ static int FNAME(walk_addr_generic)(struct guest_walker *walker,
> >  	if (unlikely(errcode))
> >  		goto error;
> >  
> > -	gfn = gpte_to_gfn_lvl(pte, walker->level);
> > +	gfn = gpte_to_gfn_lvl(vcpu, pte, walker->level);
> >  	gfn += (addr & PT_LVL_OFFSET_MASK(walker->level)) >> PAGE_SHIFT;
> >  
> >  	if (PTTYPE == 32 && walker->level > PG_LEVEL_4K && is_cpuid_PSE36())
> > @@ -555,12 +556,14 @@ FNAME(prefetch_gpte)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
> >  	gfn_t gfn;
> >  	kvm_pfn_t pfn;
> >  
> > +	WARN_ON(gpte & kvm_gpa_stolen_mask(vcpu->kvm));
> > +
> >  	if (FNAME(prefetch_invalid_gpte)(vcpu, sp, spte, gpte))
> >  		return false;
> >  
> >  	pgprintk("%s: gpte %llx spte %p\n", __func__, (u64)gpte, spte);
> >  
> > -	gfn = gpte_to_gfn(gpte);
> > +	gfn = gpte_to_gfn(vcpu, gpte);
> >  	pte_access = sp->role.access & FNAME(gpte_access)(gpte);
> >  	FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
> >  
> > @@ -656,6 +659,8 @@ static int FNAME(fetch)(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault,
> >  	WARN_ON_ONCE(gw->gfn != base_gfn);
> >  	direct_access = gw->pte_access;
> >  
> > +	WARN_ON(fault->addr & kvm_gpa_stolen_mask(vcpu->kvm));
> > +
> >  	top_level = vcpu->arch.mmu->root_level;
> >  	if (top_level == PT32E_ROOT_LEVEL)
> >  		top_level = PT32_ROOT_LEVEL;
> > @@ -1080,7 +1085,7 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
> >  			continue;
> >  		}
> >  
> > -		gfn = gpte_to_gfn(gpte);
> > +		gfn = gpte_to_gfn(vcpu, gpte);
> >  		pte_access = sp->role.access;
> >  		pte_access &= FNAME(gpte_access)(gpte);
> >  		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
> 
> In commit message you mentioned "Don't support stolen bits for shadow EPT" (you
> actually mean shadow MMU I suppose), yet there's bunch of code change to shadow
> MMU.


Those are not needed. I'll drop them.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
       [not found]             ` <20220401032741.GA2806@gao-cwp>
@ 2022-04-01  5:07               ` Chao Gao
  0 siblings, 0 replies; 310+ messages in thread
From: Chao Gao @ 2022-04-01  5:07 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, Paolo Bonzini, isaku.yamahata, kvm, linux-kernel,
	Jim Mattson, erdemaktas, Connor Kuehl, Chao Gao

The original reply was sent to Sean only by mistake. Add others back.

On Fri, Apr 01, 2022 at 11:27:42AM +0800, Chao Gao wrote:
>On Thu, Mar 31, 2022 at 07:34:12PM +0000, Sean Christopherson wrote:
>>+Chao Gao
>>
>>On Thu, Mar 31, 2022, Isaku Yamahata wrote:
>>> On Thu, Mar 31, 2022 at 12:03:15AM +0000, Sean Christopherson <seanjc@google.com> wrote:
>>> > On Mon, Mar 14, 2022, Isaku Yamahata wrote:
>>> > > - VMXON on all pCPUs: The TDX module initialization requires to enable VMX
>>> > > (VMXON) on all present pCPUs.  vmx_hardware_enable() which is called on creating
>>> > > guest does it.  It naturally fits with the TDX module initialization at creating
>>> > > first TD.  I wanted to avoid code to enable VMXON on loading the kvm_intel.ko.
>>> > 
>>> > That's a solvable problem, though making it work without exporting hardware_enable_all()
>>> > could get messy.
>>> 
>>> Could you please explain any reason why it's bad idea to export it?
>>
>>I'd really prefer to keep the hardware enable/disable logic internal to kvm_main.c
>>so that all architectures share a common flow, and so that kvm_main.c is the sole
>>owner.  I'm worried that exposing the helper will lead to other arch/vendor usage,
>>and that will end up with what is effectively duplicate flows.  Deduplicating arch
>>code into generic KVM is usually very difficult.
>>
>>This might also be a good opportunity to make KVM slightly more robust.  Ooh, and
>>we can kill two birds with one stone.  There's an in-flight series to add compatibility
>>checks to hotplug[*].  But rather than special case hotplug, what if we instead do
>>hardware enable/disable during module load, and move the compatibility check into
>>the hardware_enable path?  That fixes the hotplug issue, gives TDX a window for running
>>post-VMXON code in kvm_init(), and makes the broadcast IPI less wasteful on architectures
>>that don't have compatiblity checks.
>
>Sounds good. But more time is wasted on compat checks on architectures
>that have them because they are done each time of enabling hardware.
>A solution for this is caching the result of kvm_arch_check_processor_compat().
>
>>
>>I'm thinking something like this, maybe as a modificatyion to patch 6 in Chao's
>>series, or more likely as a patch 7 so that the hotplug compat checks still get
>>in even
>
>>if the early hardware enable doesn't work on all architectures for some
>>reason.
>
>By "early", do you mean hardware enable during module loading or during CPU hotplug?
>
>And if below change is put into my series, kvm_arch_post_hardware_enable_setup()
>will be an empty function for all architectures until TDX series gets merged.
>So, I prefer to drop kvm_arch_post_hardware_enable_setup() and let TDX series
>introduce it.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE
  2022-03-04 19:48 ` [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE isaku.yamahata
@ 2022-04-01  5:13   ` Kai Huang
  2022-04-01  7:13     ` Kai Huang
  2022-04-05 14:13     ` Paolo Bonzini
  2022-04-05 14:10   ` Paolo Bonzini
  1 sibling, 2 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-01  5:13 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> TDX will run with EPT violation #VEs enabled for shared EPT, which means
> KVM needs to set the "suppress #VE" bit in unused PTEs to avoid
> unintentionally reflecting not-present EPT violations into the guest.

This sentence is hard to interpret.  Please add more sentences to elaborate "TDX
will run with EPT violation #VEs enabled for shared EPT".  Also, this patch is
the first time to introduce "shared EPT", perhaps you should also explain it
here.  Or even you can move patch 43 ("KVM: TDX: Add load_mmu_pgd method for
TDX") before this one.

"reflecting non-present EPT violations into the guest" could be hard to
interpret.  Perhaps you can be more explicit to say VMM wants to get EPT
violation for normal (shared) memory access rather than to cause #VE to guest.  

Mentioning you want EPT violation instead of #VE for normal (shared) memory
access also completes your statement of wanting #VE for MMIO below, so that
people can have a clear picture when to get a #VE when not.

> 
> Because guest memory is protected with TDX, VMM can't parse instructions
> in the guest memory.  Instead, MMIO hypercall is used to pass necessary
> information to VMM.
> 
> To make unmodified device driver work, guest TD expects #VE on accessing
> shared GPA.  The #VE handler converts MMIO access into MMIO hypercall with
> the EPT entry of enabled "#VE" by clearing "suppress #VE" bit.  Before VMM
> enabling #VE, it needs to figure out the given GPA is for MMIO by EPT
> violation.  So the execution flow looks like
> 
> - allocate unused shared EPT entry with suppress #VE bit set.

allocate -> Allocate

> - EPT violation on that GPA.
> - VMM figures out the faulted GPA is for MMIO.
> - VMM clears the suppress #VE bit.
> - Guest TD gets #VE, and converts MMIO access into MMIO hypercall.

Here you have described both normal memory access and MMIO, it's good time to
summarize the purpose of this patch: For both cases you want PTE with "suppress
#VE" bit set initially when it is allocated, therefore allow non-zero init value
for PTE.

> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/mmu.h      |  1 +
>  arch/x86/kvm/mmu/mmu.c  | 50 +++++++++++++++++++++++++++++++++++------
>  arch/x86/kvm/mmu/spte.c | 10 +++++++++
>  arch/x86/kvm/mmu/spte.h |  2 ++
>  4 files changed, 56 insertions(+), 7 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 3fb530359f81..0ae91b8b25df 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -66,6 +66,7 @@ static __always_inline u64 rsvd_bits(int s, int e)
>  
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
>  void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> +void kvm_mmu_set_spte_init_value(u64 init_value);
>  
>  void kvm_init_mmu(struct kvm_vcpu *vcpu);
>  void kvm_init_shadow_npt_mmu(struct kvm_vcpu *vcpu, unsigned long cr0,
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 9907cb759fd1..a474f2e76d78 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -617,9 +617,9 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>  	int level = sptep_to_sp(sptep)->role.level;
>  
>  	if (!spte_has_volatile_bits(old_spte))
> -		__update_clear_spte_fast(sptep, 0ull);
> +		__update_clear_spte_fast(sptep, shadow_init_value);
>  	else
> -		old_spte = __update_clear_spte_slow(sptep, 0ull);
> +		old_spte = __update_clear_spte_slow(sptep, shadow_init_value);

I guess it's better to have some comment here.  Allow non-zero init value for
shadow PTE doesn't necessarily mean the initial value should be used when one
PTE is zapped.  I think mmu_spte_clear_track_bits() is only called for mapping
of normal (shared) memory but not MMIO? Then perhaps it's better to have a
comment to explain we want "suppress #VE" set to get a real EPT violation for
normal memory access from guest?

>  
>  	if (!is_shadow_present_pte(old_spte))
>  		return old_spte;
> @@ -651,7 +651,7 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>   */
>  static void mmu_spte_clear_no_track(u64 *sptep)
>  {
> -	__update_clear_spte_fast(sptep, 0ull);
> +	__update_clear_spte_fast(sptep, shadow_init_value);
>  }

Similar here.  Seems mmu_spte_clear_no_track() is used to zap non-leaf PTE which
doesn't require state tracking, so theoretically it can be set to 0.  But this
seems is also called to zap MMIO PTE so looks need to set to shadow_init_value.
Anyway looks deserve a comment?

Btw, Above two changes to mmu_spte_clear_track_bits() and
mmu_spte_clear_track_bits() seems a little bit out-of-scope of what this patch
claims to do.  Allow non-zero init value for shadow PTE doesn't necessarily mean
the initial value should be used when one PTE is zapped. Maybe we can further
improve the patch title and commit message a little bit.  Such as: Allow non-
zero value for empty (or invalid?) PTE? Non-present seems doesn't fit here.

>  
>  static u64 mmu_spte_get_lockless(u64 *sptep)
> @@ -737,6 +737,42 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +static inline void kvm_init_shadow_page(void *page)
> +{
> +#ifdef CONFIG_X86_64
> +	int ign;
> +
> +	asm volatile (
> +		"rep stosq\n\t"
> +		: "=c"(ign), "=D"(page)
> +		: "a"(shadow_init_value), "c"(4096/8), "D"(page)
> +		: "memory"
> +	);
> +#else
> +	BUG();
> +#endif
> +}
> +
> +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> +	int start, end, i, r;
> +
> +	if (shadow_init_value)
> +		start = kvm_mmu_memory_cache_nr_free_objects(mc);
> +
> +	r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> +	if (r)
> +		return r;
> +
> +	if (shadow_init_value) {
> +		end = kvm_mmu_memory_cache_nr_free_objects(mc);
> +		for (i = start; i < end; i++)
> +			kvm_init_shadow_page(mc->objects[i]);
> +	}
> +	return 0;
> +}
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>  	int r;
> @@ -746,8 +782,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
>  	if (r)
>  		return r;
> -	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> -				       PT64_ROOT_MAX_LEVEL);
> +	r = mmu_topup_shadow_page_cache(vcpu);
>  	if (r)
>  		return r;
>  	if (maybe_indirect) {
> @@ -3146,7 +3181,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_mmu_page *sp;
>  	int ret = RET_PF_INVALID;
> -	u64 spte = 0ull;
> +	u64 spte = shadow_init_value;

I don't quite understand this change.  'spte' is set to the last level PTE of
the given GFN if mapping is found.  Otherwise fast_page_fault() returns
RET_PF_INVALID.  In both cases, the initial value doesn't matter.

Am I wrong?

>  	u64 *sptep = NULL;
>  	uint retry_count = 0;
>  
> @@ -5598,7 +5633,8 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>  	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>  
> -	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +	if (!shadow_init_value)
> +		vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>  
>  	vcpu->arch.mmu = &vcpu->arch.root_mmu;
>  	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 73cfe62fdad1..5071e8332db2 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -35,6 +35,7 @@ u64 __read_mostly shadow_mmio_access_mask;
>  u64 __read_mostly shadow_present_mask;
>  u64 __read_mostly shadow_me_mask;
>  u64 __read_mostly shadow_acc_track_mask;
> +u64 __read_mostly shadow_init_value;
>  
>  u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>  u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> @@ -223,6 +224,14 @@ u64 kvm_mmu_changed_pte_notifier_make_spte(u64 old_spte, kvm_pfn_t new_pfn)
>  	return new_spte;
>  }
>  
> +void kvm_mmu_set_spte_init_value(u64 init_value)
> +{
> +	if (WARN_ON(!IS_ENABLED(CONFIG_X86_64) && init_value))
> +		init_value = 0;
> +	shadow_init_value = init_value;
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_set_spte_init_value);
> +
>  static u8 kvm_get_shadow_phys_bits(void)
>  {
>  	/*
> @@ -367,6 +376,7 @@ void kvm_mmu_reset_all_pte_masks(void)
>  	shadow_present_mask	= PT_PRESENT_MASK;
>  	shadow_acc_track_mask	= 0;
>  	shadow_me_mask		= sme_me_mask;
> +	shadow_init_value	= 0;
>  
>  	shadow_host_writable_mask = DEFAULT_SPTE_HOST_WRITEABLE;
>  	shadow_mmu_writable_mask  = DEFAULT_SPTE_MMU_WRITEABLE;
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index be6a007a4af3..8e13a35ab8c9 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -171,6 +171,8 @@ extern u64 __read_mostly shadow_mmio_access_mask;
>  extern u64 __read_mostly shadow_present_mask;
>  extern u64 __read_mostly shadow_me_mask;
>  
> +extern u64 __read_mostly shadow_init_value;
> +
>  /*
>   * SPTEs in MMUs without A/D bits are marked with SPTE_TDP_AD_DISABLED_MASK;
>   * shadow_acc_track_mask is the set of bits to be cleared in non-accessed


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2022-03-04 19:48 ` [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
@ 2022-04-01  5:15   ` Kai Huang
  2022-04-01 14:08     ` Sean Christopherson
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-01  5:15 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> In the existing x86 KVM MMU code, there is already max_level member in
> struct kvm_page_fault with KVM_MAX_HUGEPAGE_LEVEL initial value.  The KVM
> page fault handler denies page size larger than max_level.
> 
> Add per-VM member to indicate the allowed maximum page size with
> KVM_MAX_HUGEPAGE_LEVEL as default value and initialize max_level in struct
> kvm_page_fault with it.
> 
> For the guest TD, the set per-VM value for allows maximum page size to 4K
> page size.  Then only allowed page size is 4K.  It means large page is
> disabled.

Do not support large page for TD is the reason that you want this change, but
not the result.  Please refine a little bit.

> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h | 1 +
>  arch/x86/kvm/mmu.h              | 2 +-
>  arch/x86/kvm/mmu/mmu.c          | 2 ++
>  3 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d8b78d6abc10..d33d79f2af2d 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1053,6 +1053,7 @@ struct kvm_arch {
>  	unsigned long n_requested_mmu_pages;
>  	unsigned long n_max_mmu_pages;
>  	unsigned int indirect_shadow_pages;
> +	int tdp_max_page_level;
>  	u8 mmu_valid_gen;
>  	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
>  	struct list_head active_mmu_pages;
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 0ae91b8b25df..650989c37f2e 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -192,7 +192,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>  		.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
>  		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(),
>  
> -		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
> +		.max_level = vcpu->kvm->arch.tdp_max_page_level,
>  		.req_level = PG_LEVEL_4K,
>  		.goal_level = PG_LEVEL_4K,
>  	};
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index a474f2e76d78..e9212394a530 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -5782,6 +5782,8 @@ void kvm_mmu_init_vm(struct kvm *kvm)
>  	node->track_write = kvm_mmu_pte_write;
>  	node->track_flush_slot = kvm_mmu_invalidate_zap_pages_in_memslot;
>  	kvm_page_track_register_notifier(kvm, node);
> +
> +	kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
>  }
>  
>  void kvm_mmu_uninit_vm(struct kvm *kvm)


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-03-31 19:41     ` Isaku Yamahata
@ 2022-04-01  6:56       ` Xiaoyao Li
  2022-04-01 20:18         ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Xiaoyao Li @ 2022-04-01  6:56 UTC (permalink / raw)
  To: Isaku Yamahata, Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On 4/1/2022 3:41 AM, Isaku Yamahata wrote:
> On Thu, Mar 31, 2022 at 04:31:10PM +1300,
> Kai Huang <kai.huang@intel.com> wrote:
> 
>> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
>>> Add a wrapper function to initialize the TDX module and get system-wide
>>> parameters via those APIs.  Because TDX requires VMX enabled, It will be
>>> called on-demand when the first guest TD is created via x86 KVM init_vm
>>> callback.
>>
>> Why not just merge this patch with the change where you implement the init_vm
>> callback?  Then you can just declare this patch as "detect and initialize TDX
>> module when first VM is created", or something like that..
> 
> Ok. Anyway in the next respoin, tdx module initialization will be done when
> loading kvm_intel.ko.  So the whole part will be changed and will be a part
> of module loading.

Will we change the GET_TDX_CAPABILITIES ioctl back to KVM scope?

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE
  2022-04-01  5:13   ` Kai Huang
@ 2022-04-01  7:13     ` Kai Huang
  2022-04-05 14:14       ` Paolo Bonzini
  2022-04-05 14:13     ` Paolo Bonzini
  1 sibling, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-01  7:13 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-04-01 at 18:13 +1300, Kai Huang wrote:
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -617,9 +617,9 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm,
> > u64 *sptep)
> >   	int level = sptep_to_sp(sptep)->role.level;
> >   
> >   	if (!spte_has_volatile_bits(old_spte))
> > -		__update_clear_spte_fast(sptep, 0ull);
> > +		__update_clear_spte_fast(sptep, shadow_init_value);
> >   	else
> > -		old_spte = __update_clear_spte_slow(sptep, 0ull);
> > +		old_spte = __update_clear_spte_slow(sptep,
> > shadow_init_value);
> 
> I guess it's better to have some comment here.  Allow non-zero init value for
> shadow PTE doesn't necessarily mean the initial value should be used when one
> PTE is zapped.  I think mmu_spte_clear_track_bits() is only called for mapping
> of normal (shared) memory but not MMIO? Then perhaps it's better to have a
> comment to explain we want "suppress #VE" set to get a real EPT violation for
> normal memory access from guest?

Btw, I think the relevant part of TDP MMU change should be included in this
patch too otherwise TDP MMU is broken with this patch.

Actually in this series legacy MMU is not supported to work with TDX, so above
change to legacy MMU doesn't matter actually.  Instead, TDP MMU change should be
here.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2022-04-01  5:15   ` Kai Huang
@ 2022-04-01 14:08     ` Sean Christopherson
  2022-04-01 20:28       ` Isaku Yamahata
  2022-04-01 22:27       ` Kai Huang
  0 siblings, 2 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-04-01 14:08 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl

On Fri, Apr 01, 2022, Kai Huang wrote:
> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > In the existing x86 KVM MMU code, there is already max_level member in
> > struct kvm_page_fault with KVM_MAX_HUGEPAGE_LEVEL initial value.  The KVM
> > page fault handler denies page size larger than max_level.
> > 
> > Add per-VM member to indicate the allowed maximum page size with
> > KVM_MAX_HUGEPAGE_LEVEL as default value and initialize max_level in struct
> > kvm_page_fault with it.
> > 
> > For the guest TD, the set per-VM value for allows maximum page size to 4K
> > page size.  Then only allowed page size is 4K.  It means large page is
> > disabled.
> 
> Do not support large page for TD is the reason that you want this change, but
> not the result.  Please refine a little bit.

Not supporting huge pages was fine for the PoC, but I'd prefer not to merge TDX
without support for huge pages.  Has any work been put into enabling huge pages?
If so, what's the technical blocker?  If not...

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-04-01  6:56       ` Xiaoyao Li
@ 2022-04-01 20:18         ` Isaku Yamahata
  2022-04-02  2:40           ` Xiaoyao Li
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-01 20:18 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Isaku Yamahata, Kai Huang, isaku.yamahata, kvm, linux-kernel,
	Paolo Bonzini, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On Fri, Apr 01, 2022 at 02:56:40PM +0800,
Xiaoyao Li <xiaoyao.li@intel.com> wrote:

> On 4/1/2022 3:41 AM, Isaku Yamahata wrote:
> > On Thu, Mar 31, 2022 at 04:31:10PM +1300,
> > Kai Huang <kai.huang@intel.com> wrote:
> > 
> > > On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > > > Add a wrapper function to initialize the TDX module and get system-wide
> > > > parameters via those APIs.  Because TDX requires VMX enabled, It will be
> > > > called on-demand when the first guest TD is created via x86 KVM init_vm
> > > > callback.
> > > 
> > > Why not just merge this patch with the change where you implement the init_vm
> > > callback?  Then you can just declare this patch as "detect and initialize TDX
> > > module when first VM is created", or something like that..
> > 
> > Ok. Anyway in the next respoin, tdx module initialization will be done when
> > loading kvm_intel.ko.  So the whole part will be changed and will be a part
> > of module loading.
> 
> Will we change the GET_TDX_CAPABILITIES ioctl back to KVM scope?

No because it system scoped KVM_TDX_CAPABILITIES requires one more callback for
it.  We can reduce the change.

Or do you have any use case for system scoped KVM_TDX_CAPABILITIES?
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2022-04-01 14:08     ` Sean Christopherson
@ 2022-04-01 20:28       ` Isaku Yamahata
  2022-04-01 20:53         ` Sean Christopherson
  2022-04-01 22:27       ` Kai Huang
  1 sibling, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-01 20:28 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, isaku.yamahata, kvm, linux-kernel, isaku.yamahata,
	Paolo Bonzini, Jim Mattson, erdemaktas, Connor Kuehl

On Fri, Apr 01, 2022 at 02:08:38PM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Fri, Apr 01, 2022, Kai Huang wrote:
> > On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > 
> > > In the existing x86 KVM MMU code, there is already max_level member in
> > > struct kvm_page_fault with KVM_MAX_HUGEPAGE_LEVEL initial value.  The KVM
> > > page fault handler denies page size larger than max_level.
> > > 
> > > Add per-VM member to indicate the allowed maximum page size with
> > > KVM_MAX_HUGEPAGE_LEVEL as default value and initialize max_level in struct
> > > kvm_page_fault with it.
> > > 
> > > For the guest TD, the set per-VM value for allows maximum page size to 4K
> > > page size.  Then only allowed page size is 4K.  It means large page is
> > > disabled.
> > 
> > Do not support large page for TD is the reason that you want this change, but
> > not the result.  Please refine a little bit.
> 
> Not supporting huge pages was fine for the PoC, but I'd prefer not to merge TDX
> without support for huge pages.  Has any work been put into enabling huge pages?
> If so, what's the technical blocker?  If not...

I wanted to get feedback on the approach (always set SPTE to REMOVED_SPTE,
callback, set the SPTE to the final value instead of relying atomic update SPTE)
before going further for large page.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2022-04-01 20:28       ` Isaku Yamahata
@ 2022-04-01 20:53         ` Sean Christopherson
  0 siblings, 0 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-04-01 20:53 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Kai Huang, isaku.yamahata, kvm, linux-kernel, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl

On Fri, Apr 01, 2022, Isaku Yamahata wrote:
> On Fri, Apr 01, 2022 at 02:08:38PM +0000,
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Fri, Apr 01, 2022, Kai Huang wrote:
> > > On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > 
> > > > In the existing x86 KVM MMU code, there is already max_level member in
> > > > struct kvm_page_fault with KVM_MAX_HUGEPAGE_LEVEL initial value.  The KVM
> > > > page fault handler denies page size larger than max_level.
> > > > 
> > > > Add per-VM member to indicate the allowed maximum page size with
> > > > KVM_MAX_HUGEPAGE_LEVEL as default value and initialize max_level in struct
> > > > kvm_page_fault with it.
> > > > 
> > > > For the guest TD, the set per-VM value for allows maximum page size to 4K
> > > > page size.  Then only allowed page size is 4K.  It means large page is
> > > > disabled.
> > > 
> > > Do not support large page for TD is the reason that you want this change, but
> > > not the result.  Please refine a little bit.
> > 
> > Not supporting huge pages was fine for the PoC, but I'd prefer not to merge TDX
> > without support for huge pages.  Has any work been put into enabling huge pages?
> > If so, what's the technical blocker?  If not...
> 
> I wanted to get feedback on the approach (always set SPTE to REMOVED_SPTE,
> callback, set the SPTE to the final value instead of relying atomic update SPTE)
> before going further for large page.

Pretty please with a cherry on top, send an email calling out which areas and
patches you'd like "immediate" feedback on.  Putting that information in the cover
letter would have been extremely helpful.  I realize it's hard to balance providing
context for folks who don't know TDX with "instructions" for reviewers, but one of
the most helpful things you can do for reviewers is to make it explicitly clear
what _your_ expectations and wants are, _why_ you posted the series.   Usually that
information is implied, i.e. you want your patches merged, but that's obviously not
the case here.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2022-04-01 14:08     ` Sean Christopherson
  2022-04-01 20:28       ` Isaku Yamahata
@ 2022-04-01 22:27       ` Kai Huang
  2022-04-02  0:08         ` Sean Christopherson
  1 sibling, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-01 22:27 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl

On Fri, 2022-04-01 at 14:08 +0000, Sean Christopherson wrote:
> On Fri, Apr 01, 2022, Kai Huang wrote:
> > On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > 
> > > In the existing x86 KVM MMU code, there is already max_level member in
> > > struct kvm_page_fault with KVM_MAX_HUGEPAGE_LEVEL initial value.  The KVM
> > > page fault handler denies page size larger than max_level.
> > > 
> > > Add per-VM member to indicate the allowed maximum page size with
> > > KVM_MAX_HUGEPAGE_LEVEL as default value and initialize max_level in struct
> > > kvm_page_fault with it.
> > > 
> > > For the guest TD, the set per-VM value for allows maximum page size to 4K
> > > page size.  Then only allowed page size is 4K.  It means large page is
> > > disabled.
> > 
> > Do not support large page for TD is the reason that you want this change, but
> > not the result.  Please refine a little bit.
> 
> Not supporting huge pages was fine for the PoC, but I'd prefer not to merge TDX
> without support for huge pages.  Has any work been put into enabling huge pages?
> If so, what's the technical blocker?  If not...

Hi Sean,

Is there any reason large page support must be included in the initial merge of
TDX?  Large page is more about performance improvement I think.  Given this
series is already very big, perhaps we can do it later.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2022-04-01 22:27       ` Kai Huang
@ 2022-04-02  0:08         ` Sean Christopherson
  2022-04-04  0:41           ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Sean Christopherson @ 2022-04-02  0:08 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl

On Sat, Apr 02, 2022, Kai Huang wrote:
> On Fri, 2022-04-01 at 14:08 +0000, Sean Christopherson wrote:
> > On Fri, Apr 01, 2022, Kai Huang wrote:
> > > On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > 
> > > > In the existing x86 KVM MMU code, there is already max_level member in
> > > > struct kvm_page_fault with KVM_MAX_HUGEPAGE_LEVEL initial value.  The KVM
> > > > page fault handler denies page size larger than max_level.
> > > > 
> > > > Add per-VM member to indicate the allowed maximum page size with
> > > > KVM_MAX_HUGEPAGE_LEVEL as default value and initialize max_level in struct
> > > > kvm_page_fault with it.
> > > > 
> > > > For the guest TD, the set per-VM value for allows maximum page size to 4K
> > > > page size.  Then only allowed page size is 4K.  It means large page is
> > > > disabled.
> > > 
> > > Do not support large page for TD is the reason that you want this change, but
> > > not the result.  Please refine a little bit.
> > 
> > Not supporting huge pages was fine for the PoC, but I'd prefer not to merge TDX
> > without support for huge pages.  Has any work been put into enabling huge pages?
> > If so, what's the technical blocker?  If not...
> 
> Hi Sean,
> 
> Is there any reason large page support must be included in the initial merge of
> TDX?  Large page is more about performance improvement I think.  Given this
> series is already very big, perhaps we can do it later.

I'm ok punting 1gb for now, but I want to have a high level of confidence that 2mb
pages will work without requiring significant churn in KVM on top of the initial
TDX support.  I suspect gaining that level of confidence will mean getting 95%+ of
the way to a fully working code base.  IIRC, 2mb wasn't expected to be terrible, it
was 1gb support where things started to get messy.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize TDX module
  2022-04-01 20:18         ` Isaku Yamahata
@ 2022-04-02  2:40           ` Xiaoyao Li
  0 siblings, 0 replies; 310+ messages in thread
From: Xiaoyao Li @ 2022-04-02  2:40 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Kai Huang, isaku.yamahata, kvm, linux-kernel, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson

On 4/2/2022 4:18 AM, Isaku Yamahata wrote:
> On Fri, Apr 01, 2022 at 02:56:40PM +0800,
> Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> 
>> On 4/1/2022 3:41 AM, Isaku Yamahata wrote:
>>> On Thu, Mar 31, 2022 at 04:31:10PM +1300,
>>> Kai Huang <kai.huang@intel.com> wrote:
>>>
>>>> On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
>>>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>
>>>>> Add a wrapper function to initialize the TDX module and get system-wide
>>>>> parameters via those APIs.  Because TDX requires VMX enabled, It will be
>>>>> called on-demand when the first guest TD is created via x86 KVM init_vm
>>>>> callback.
>>>>
>>>> Why not just merge this patch with the change where you implement the init_vm
>>>> callback?  Then you can just declare this patch as "detect and initialize TDX
>>>> module when first VM is created", or something like that..
>>>
>>> Ok. Anyway in the next respoin, tdx module initialization will be done when
>>> loading kvm_intel.ko.  So the whole part will be changed and will be a part
>>> of module loading.
>>
>> Will we change the GET_TDX_CAPABILITIES ioctl back to KVM scope?
> 
> No because it system scoped KVM_TDX_CAPABILITIES requires one more callback for
> it.  We can reduce the change.
> 
> Or do you have any use case for system scoped KVM_TDX_CAPABILITIES?

No. Just to confirm.

on the other hand, vm-scope IOCTL seems more flexible if different 
capabilities are reported per VM in the future.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2022-04-02  0:08         ` Sean Christopherson
@ 2022-04-04  0:41           ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-04  0:41 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl

On Sat, 2022-04-02 at 00:08 +0000, Sean Christopherson wrote:
> On Sat, Apr 02, 2022, Kai Huang wrote:
> > On Fri, 2022-04-01 at 14:08 +0000, Sean Christopherson wrote:
> > > On Fri, Apr 01, 2022, Kai Huang wrote:
> > > > On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > > > > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > > > > 
> > > > > In the existing x86 KVM MMU code, there is already max_level member in
> > > > > struct kvm_page_fault with KVM_MAX_HUGEPAGE_LEVEL initial value.  The KVM
> > > > > page fault handler denies page size larger than max_level.
> > > > > 
> > > > > Add per-VM member to indicate the allowed maximum page size with
> > > > > KVM_MAX_HUGEPAGE_LEVEL as default value and initialize max_level in struct
> > > > > kvm_page_fault with it.
> > > > > 
> > > > > For the guest TD, the set per-VM value for allows maximum page size to 4K
> > > > > page size.  Then only allowed page size is 4K.  It means large page is
> > > > > disabled.
> > > > 
> > > > Do not support large page for TD is the reason that you want this change, but
> > > > not the result.  Please refine a little bit.
> > > 
> > > Not supporting huge pages was fine for the PoC, but I'd prefer not to merge TDX
> > > without support for huge pages.  Has any work been put into enabling huge pages?
> > > If so, what's the technical blocker?  If not...
> > 
> > Hi Sean,
> > 
> > Is there any reason large page support must be included in the initial merge of
> > TDX?  Large page is more about performance improvement I think.  Given this
> > series is already very big, perhaps we can do it later.
> 
> I'm ok punting 1gb for now, but I want to have a high level of confidence that 2mb
> pages will work without requiring significant churn in KVM on top of the initial
> TDX support.  I suspect gaining that level of confidence will mean getting 95%+ of
> the way to a fully working code base.  IIRC, 2mb wasn't expected to be terrible, it
> was 1gb support where things started to get messy.

OK no argument here :)

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 020/104] KVM: TDX: allocate per-package mutex
  2022-03-04 19:48 ` [RFC PATCH v5 020/104] KVM: TDX: allocate per-package mutex isaku.yamahata
@ 2022-04-05 12:39   ` Paolo Bonzini
  2022-04-08  0:44     ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 12:39 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Several TDX SEAMCALLs are per-package scope (concretely per memory
> controller) and they need to be serialized per-package.  Allocate mutex for
> it.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/main.c    |  8 +++++++-
>   arch/x86/kvm/vmx/tdx.c     | 18 ++++++++++++++++++
>   arch/x86/kvm/vmx/x86_ops.h |  2 ++
>   3 files changed, 27 insertions(+), 1 deletion(-)

Please define here the lock/unlock functions as well:

static inline int tdx_mng_key_lock(void)
{
	int cpu = get_cpu();
	cur_pkg = topology_physical_package_id(cpu);

	mutex_lock(&tdx_mng_key_config_lock[cur_pkg]);
	return cur_pkg;
}

static inline void tdx_mng_key_unlock(int cur_pkg)
{
	mutex_unlock(&tdx_mng_key_config_lock[cur_pkg]);
	put_cpu();
}

Thanks,

Paolo


> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 8103d1c32cc9..6111c6485d8e 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -25,6 +25,12 @@ static __init int vt_hardware_setup(void)
>   	return 0;
>   }
>   
> +static void vt_hardware_unsetup(void)
> +{
> +	tdx_hardware_unsetup();
> +	vmx_hardware_unsetup();
> +}
> +
>   static int vt_vm_init(struct kvm *kvm)
>   {
>   	int ret;
> @@ -42,7 +48,7 @@ static int vt_vm_init(struct kvm *kvm)
>   struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.name = "kvm_intel",
>   
> -	.hardware_unsetup = vmx_hardware_unsetup,
> +	.hardware_unsetup = vt_hardware_unsetup,
>   
>   	.hardware_enable = vmx_hardware_enable,
>   	.hardware_disable = vmx_hardware_disable,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e8d293a3c11c..1c8222f54764 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -34,6 +34,8 @@ struct tdx_capabilities {
>   /* Capabilities of KVM + the TDX module. */
>   struct tdx_capabilities tdx_caps;
>   
> +static struct mutex *tdx_mng_key_config_lock;
> +
>   static u64 hkid_mask __ro_after_init;
>   static u8 hkid_start_pos __ro_after_init;
>   
> @@ -112,7 +114,9 @@ bool tdx_is_vm_type_supported(unsigned long type)
>   
>   static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
>   {
> +	int max_pkgs;
>   	u32 max_pa;
> +	int i;
>   
>   	if (!enable_ept) {
>   		pr_warn("Cannot enable TDX with EPT disabled\n");
> @@ -127,6 +131,14 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
>   	if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
>   		return -EIO;
>   
> +	max_pkgs = topology_max_packages();
> +	tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
> +				   GFP_KERNEL);
> +	if (!tdx_mng_key_config_lock)
> +		return -ENOMEM;
> +	for (i = 0; i < max_pkgs; i++)
> +		mutex_init(&tdx_mng_key_config_lock[i]);
> +
>   	max_pa = cpuid_eax(0x80000008) & 0xff;
>   	hkid_start_pos = boot_cpu_data.x86_phys_bits;
>   	hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
> @@ -147,6 +159,12 @@ void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
>   		enable_tdx = false;
>   }
>   
> +void tdx_hardware_unsetup(void)
> +{
> +	/* kfree accepts NULL. */
> +	kfree(tdx_mng_key_config_lock);
> +}
> +
>   void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
>   			unsigned int *vcpu_align, unsigned int *vm_size)
>   {
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 78331dbc29f7..da32b4b86b19 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -131,11 +131,13 @@ void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
>   			unsigned int *vcpu_align, unsigned int *vm_size);
>   bool tdx_is_vm_type_supported(unsigned long type);
>   void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> +void tdx_hardware_unsetup(void);
>   #else
>   static inline void tdx_pre_kvm_init(
>   	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
>   static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
>   static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
> +static inline void tdx_hardware_unsetup(void) {}
>   #endif
>   
>   #endif /* __KVM_X86_VMX_X86_OPS_H */


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 021/104] KVM: x86: Introduce hooks to free VM callback prezap and vm_free
  2022-03-04 19:48 ` [RFC PATCH v5 021/104] KVM: x86: Introduce hooks to free VM callback prezap and vm_free isaku.yamahata
  2022-03-31  3:02   ` Kai Huang
@ 2022-04-05 12:40   ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 12:40 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Kai Huang <kai.huang@intel.com>
> 
> Before tearing down private page tables, TDX requires some resources of the
> guest TD to be destroyed (i.e. keyID must have been reclaimed, etc).  Add
> prezap callback before tearing down private page tables for it.
> 
> TDX needs to free some resources after other resources (i.e. vcpu related
> resources).  Add vm_free callback at the end of kvm_arch_destroy_vm().
> 
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/include/asm/kvm-x86-ops.h | 2 ++
>   arch/x86/include/asm/kvm_host.h    | 2 ++
>   arch/x86/kvm/x86.c                 | 8 ++++++++
>   3 files changed, 12 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 8125d43d3566..ef48dcc98cfc 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -20,7 +20,9 @@ KVM_X86_OP(has_emulated_msr)
>   KVM_X86_OP(vcpu_after_set_cpuid)
>   KVM_X86_OP(is_vm_type_supported)
>   KVM_X86_OP(vm_init)
> +KVM_X86_OP_NULL(mmu_prezap)
>   KVM_X86_OP_NULL(vm_destroy)
> +KVM_X86_OP_NULL(vm_free)
>   KVM_X86_OP(vcpu_create)
>   KVM_X86_OP(vcpu_free)
>   KVM_X86_OP(vcpu_reset)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8de357a9ad30..5ff7a0fba311 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1326,7 +1326,9 @@ struct kvm_x86_ops {
>   	bool (*is_vm_type_supported)(unsigned long vm_type);
>   	unsigned int vm_size;
>   	int (*vm_init)(struct kvm *kvm);
> +	void (*mmu_prezap)(struct kvm *kvm);
>   	void (*vm_destroy)(struct kvm *kvm);
> +	void (*vm_free)(struct kvm *kvm);
>   
>   	/* Create, but do not attach this VCPU */
>   	int (*vcpu_create)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f6438750d190..a48f5c69fadb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -11779,6 +11779,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>   	kvm_page_track_cleanup(kvm);
>   	kvm_xen_destroy_vm(kvm);
>   	kvm_hv_destroy_vm(kvm);
> +	static_call_cond(kvm_x86_vm_free)(kvm);
>   }
>   
>   static void memslot_rmap_free(struct kvm_memory_slot *slot)
> @@ -12036,6 +12037,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
>   
>   void kvm_arch_flush_shadow_all(struct kvm *kvm)
>   {
> +	/*
> +	 * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
> +	 * tearing down private page tables, TDX requires some TD resources to
> +	 * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
> +	 * kvm_x86_mmu_prezap() for this.
> +	 */
> +	static_call_cond(kvm_x86_mmu_prezap)(kvm);
>   	kvm_mmu_zap_all(kvm);

Please rename the hook to (*flush_shadow_all_private).

Otherwise ok,

Paolo

>   }
>   


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 022/104] KVM: Add max_vcpus field in common 'struct kvm'
  2022-03-04 19:48 ` [RFC PATCH v5 022/104] KVM: Add max_vcpus field in common 'struct kvm' isaku.yamahata
@ 2022-04-05 12:42   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 12:42 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> For TDX guests, the maximum number of vcpus needs to be specified when the
> TDX guest VM is initialized (creating the TDX data corresponding to TDX
> guest) before creating vcpu.  It needs to record the maximum number of
> vcpus on VM creation (KVM_CREATE_VM) and return error if the number of
> vcpus exceeds it
> 
> Because there is already max_vcpu member in arm64 struct kvm_arch, move it
> to common struct kvm and initialize it to KVM_MAX_VCPUS before
> kvm_arch_init_vm() instead of adding it to x86 struct kvm_arch.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/arm64/include/asm/kvm_host.h | 3 ---
>   arch/arm64/kvm/arm.c              | 6 +++---
>   arch/arm64/kvm/vgic/vgic-init.c   | 6 +++---
>   include/linux/kvm_host.h          | 1 +
>   virt/kvm/kvm_main.c               | 3 ++-
>   5 files changed, 9 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 5bc01e62c08a..27249d634605 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -107,9 +107,6 @@ struct kvm_arch {
>   	/* VTCR_EL2 value for this VM */
>   	u64    vtcr;
>   
> -	/* The maximum number of vCPUs depends on the used GIC model */
> -	int max_vcpus;
> -
>   	/* Interrupt controller */
>   	struct vgic_dist	vgic;
>   
> diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
> index ecc5958e27fe..defec2cd94bd 100644
> --- a/arch/arm64/kvm/arm.c
> +++ b/arch/arm64/kvm/arm.c
> @@ -153,7 +153,7 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   	kvm_vgic_early_init(kvm);
>   
>   	/* The maximum number of VCPUs is limited by the host's GIC model */
> -	kvm->arch.max_vcpus = kvm_arm_default_max_vcpus();
> +	kvm->max_vcpus = kvm_arm_default_max_vcpus();
>   
>   	set_default_spectre(kvm);
>   
> @@ -229,7 +229,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_MAX_VCPUS:
>   	case KVM_CAP_MAX_VCPU_ID:
>   		if (kvm)
> -			r = kvm->arch.max_vcpus;
> +			r = kvm->max_vcpus;
>   		else
>   			r = kvm_arm_default_max_vcpus();
>   		break;
> @@ -305,7 +305,7 @@ int kvm_arch_vcpu_precreate(struct kvm *kvm, unsigned int id)
>   	if (irqchip_in_kernel(kvm) && vgic_initialized(kvm))
>   		return -EBUSY;
>   
> -	if (id >= kvm->arch.max_vcpus)
> +	if (id >= kvm->max_vcpus)
>   		return -EINVAL;
>   
>   	return 0;
> diff --git a/arch/arm64/kvm/vgic/vgic-init.c b/arch/arm64/kvm/vgic/vgic-init.c
> index fc00304fe7d8..77feafd5c0e3 100644
> --- a/arch/arm64/kvm/vgic/vgic-init.c
> +++ b/arch/arm64/kvm/vgic/vgic-init.c
> @@ -98,11 +98,11 @@ int kvm_vgic_create(struct kvm *kvm, u32 type)
>   	ret = 0;
>   
>   	if (type == KVM_DEV_TYPE_ARM_VGIC_V2)
> -		kvm->arch.max_vcpus = VGIC_V2_MAX_CPUS;
> +		kvm->max_vcpus = VGIC_V2_MAX_CPUS;
>   	else
> -		kvm->arch.max_vcpus = VGIC_V3_MAX_CPUS;
> +		kvm->max_vcpus = VGIC_V3_MAX_CPUS;
>   
> -	if (atomic_read(&kvm->online_vcpus) > kvm->arch.max_vcpus) {
> +	if (atomic_read(&kvm->online_vcpus) > kvm->max_vcpus) {
>   		ret = -E2BIG;
>   		goto out_unlock;
>   	}
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index f11039944c08..a56044a31bc6 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -715,6 +715,7 @@ struct kvm {
>   	 * and is accessed atomically.
>   	 */
>   	atomic_t online_vcpus;
> +	int max_vcpus;
>   	int created_vcpus;
>   	int last_boosted_vcpu;
>   	struct list_head vm_list;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 52f72a366beb..3adee9c6b370 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1075,6 +1075,7 @@ static struct kvm *kvm_create_vm(unsigned long type)
>   	spin_lock_init(&kvm->gpc_lock);
>   
>   	INIT_LIST_HEAD(&kvm->devices);
> +	kvm->max_vcpus = KVM_MAX_VCPUS;
>   
>   	BUILD_BUG_ON(KVM_MEM_SLOTS_NUM > SHRT_MAX);
>   
> @@ -3718,7 +3719,7 @@ static int kvm_vm_ioctl_create_vcpu(struct kvm *kvm, u32 id)
>   		return -EINVAL;
>   
>   	mutex_lock(&kvm->lock);
> -	if (kvm->created_vcpus == KVM_MAX_VCPUS) {
> +	if (kvm->created_vcpus >= kvm->max_vcpus) {
>   		mutex_unlock(&kvm->lock);
>   		return -EINVAL;
>   	}

Queued this one already, thanks.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure
  2022-03-04 19:48 ` [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure isaku.yamahata
  2022-03-31  4:17   ` Kai Huang
@ 2022-04-05 12:44   ` Paolo Bonzini
  2022-04-08  0:51     ` Isaku Yamahata
  1 sibling, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 12:44 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> As the first step to create TDX guest, create/destroy VM struct.  Assign
> Host Key ID (HKID) to the TDX guest for memory encryption and allocate
> extra pages for the TDX guest. On destruction, free allocated pages, and
> HKID.
> 
> Add a second kvm_x86_ops hook in kvm_arch_vm_destroy() to support TDX's
> destruction path, which needs to first put the VM into a teardown state,
> then free per-vCPU resources, and finally free per-VM resources.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/main.c      |  16 +-
>   arch/x86/kvm/vmx/tdx.c       | 312 +++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/tdx.h       |   2 +
>   arch/x86/kvm/vmx/tdx_errno.h |   2 +-
>   arch/x86/kvm/vmx/tdx_ops.h   |   8 +
>   arch/x86/kvm/vmx/x86_ops.h   |   8 +
>   6 files changed, 346 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 6111c6485d8e..5c3a904a30e8 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -39,12 +39,24 @@ static int vt_vm_init(struct kvm *kvm)
>   		ret = tdx_module_setup();
>   		if (ret)
>   			return ret;
> -		return -EOPNOTSUPP;	/* Not ready to create guest TD yet. */
> +		return tdx_vm_init(kvm);
>   	}
>   
>   	return vmx_vm_init(kvm);
>   }
>   
> +static void vt_mmu_prezap(struct kvm *kvm)
> +{
> +	if (is_td(kvm))
> +		return tdx_mmu_prezap(kvm);
> +}

Please rename the function to explain what it does, for example 
tdx_mmu_release_hkid.

Paolo

> +static void vt_vm_free(struct kvm *kvm)
> +{
> +	if (is_td(kvm))
> +		return tdx_vm_free(kvm);
> +}

> +
>   struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.name = "kvm_intel",
>   
> @@ -58,6 +70,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.is_vm_type_supported = vt_is_vm_type_supported,
>   	.vm_size = sizeof(struct kvm_vmx),
>   	.vm_init = vt_vm_init,
> +	.mmu_prezap = vt_mmu_prezap,
> +	.vm_free = vt_vm_free,
>   
>   	.vcpu_create = vmx_vcpu_create,
>   	.vcpu_free = vmx_vcpu_free,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 1c8222f54764..702953fd365f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -31,14 +31,324 @@ struct tdx_capabilities {
>   	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
>   };
>   
> +/* KeyID used by TDX module */
> +static u32 tdx_global_keyid __read_mostly;
> +
>   /* Capabilities of KVM + the TDX module. */
>   struct tdx_capabilities tdx_caps;
>   
> +static DEFINE_MUTEX(tdx_lock);
>   static struct mutex *tdx_mng_key_config_lock;
>   
>   static u64 hkid_mask __ro_after_init;
>   static u8 hkid_start_pos __ro_after_init;
>   
> +static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> +{
> +	pa &= ~hkid_mask;
> +	pa |= (u64)hkid << hkid_start_pos;
> +
> +	return pa;
> +}
> +
> +static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> +{
> +	return kvm_tdx->tdr.added;
> +}
> +
> +static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> +{
> +	tdx_keyid_free(kvm_tdx->hkid);
> +	kvm_tdx->hkid = -1;
> +}
> +
> +static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> +{
> +	return kvm_tdx->hkid > 0;
> +}
> +
> +static void tdx_clear_page(unsigned long page)
> +{
> +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> +	unsigned long i;
> +
> +	/* Zeroing the page is only necessary for systems with MKTME-i. */
> +	if (!static_cpu_has(X86_FEATURE_MOVDIR64B))
> +		return;
> +
> +	for (i = 0; i < 4096; i += 64)
> +		/* MOVDIR64B [rdx], es:rdi */
> +		asm (".byte 0x66, 0x0f, 0x38, 0xf8, 0x3a"
> +		     : : "d" (zero_page), "D" (page + i) : "memory");
> +}
> +
> +static int __tdx_reclaim_page(unsigned long va, hpa_t pa, bool do_wb, u16 hkid)
> +{
> +	struct tdx_module_output out;
> +	u64 err;
> +
> +	err = tdh_phymem_page_reclaim(pa, &out);
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
> +		return -EIO;
> +	}
> +
> +	if (do_wb) {
> +		err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
> +		if (WARN_ON_ONCE(err)) {
> +			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> +			return -EIO;
> +		}
> +	}
> +
> +	tdx_clear_page(va);
> +	return 0;
> +}
> +
> +static int tdx_reclaim_page(unsigned long va, hpa_t pa)
> +{
> +	return __tdx_reclaim_page(va, pa, false, 0);
> +}
> +
> +static int tdx_alloc_td_page(struct tdx_td_page *page)
> +{
> +	page->va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +	if (!page->va)
> +		return -ENOMEM;
> +
> +	page->pa = __pa(page->va);
> +	return 0;
> +}
> +
> +static void tdx_mark_td_page_added(struct tdx_td_page *page)
> +{
> +	WARN_ON_ONCE(page->added);
> +	page->added = true;
> +}
> +
> +static void tdx_reclaim_td_page(struct tdx_td_page *page)
> +{
> +	if (page->added) {
> +		if (tdx_reclaim_page(page->va, page->pa))
> +			return;
> +
> +		page->added = false;
> +	}
> +	free_page(page->va);
> +}
> +
> +static int tdx_do_tdh_phymem_cache_wb(void *param)
> +{
> +	u64 err = 0;
> +
> +	/*
> +	 * We can destroy multiple the guest TDs simultaneously.  Prevent
> +	 * tdh_phymem_cache_wb from returning TDX_BUSY by serialization.
> +	 */
> +	mutex_lock(&tdx_lock);
> +	do {
> +		err = tdh_phymem_cache_wb(!!err);
> +	} while (err == TDX_INTERRUPTED_RESUMABLE);
> +	mutex_unlock(&tdx_lock);
> +
> +	/* Other thread may have done for us. */
> +	if (err == TDX_NO_HKID_READY_TO_WBCACHE)
> +		err = TDX_SUCCESS;
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
> +void tdx_mmu_prezap(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	cpumask_var_t packages;
> +	bool cpumask_allocated;
> +	u64 err;
> +	int ret;
> +	int i;
> +
> +	if (!is_hkid_assigned(kvm_tdx))
> +		return;
> +
> +	if (!is_td_created(kvm_tdx))
> +		goto free_hkid;
> +
> +	mutex_lock(&tdx_lock);
> +	err = tdh_mng_key_reclaimid(kvm_tdx->tdr.pa);
> +	mutex_unlock(&tdx_lock);
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_KEY_RECLAIMID, err, NULL);
> +		return;
> +	}
> +
> +	cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> +	for_each_online_cpu(i) {
> +		if (cpumask_allocated &&
> +			cpumask_test_and_set_cpu(topology_physical_package_id(i),
> +						packages))
> +			continue;
> +
> +		ret = smp_call_on_cpu(i, tdx_do_tdh_phymem_cache_wb, NULL, 1);
> +		if (ret)
> +			break;
> +	}
> +	free_cpumask_var(packages);
> +
> +	mutex_lock(&tdx_lock);
> +	err = tdh_mng_key_freeid(kvm_tdx->tdr.pa);
> +	mutex_unlock(&tdx_lock);
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> +		return;
> +	}
> +
> +free_hkid:
> +	tdx_hkid_free(kvm_tdx);
> +}
> +
> +void tdx_vm_free(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	int i;
> +
> +	/* Can't reclaim or free TD pages if teardown failed. */
> +	if (is_hkid_assigned(kvm_tdx))
> +		return;
> +
> +	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++)
> +		tdx_reclaim_td_page(&kvm_tdx->tdcs[i]);
> +	kfree(kvm_tdx->tdcs);
> +
> +	if (kvm_tdx->tdr.added &&
> +		__tdx_reclaim_page(kvm_tdx->tdr.va, kvm_tdx->tdr.pa, true,
> +				tdx_global_keyid))
> +		return;
> +
> +	free_page(kvm_tdx->tdr.va);
> +}
> +
> +static int tdx_do_tdh_mng_key_config(void *param)
> +{
> +	hpa_t *tdr_p = param;
> +	int cpu, cur_pkg;
> +	u64 err;
> +
> +	cpu = raw_smp_processor_id();
> +	cur_pkg = topology_physical_package_id(cpu);
> +
> +	mutex_lock(&tdx_mng_key_config_lock[cur_pkg]);
> +	do {
> +		err = tdh_mng_key_config(*tdr_p);
> +	} while (err == TDX_KEY_GENERATION_FAILED);
> +	mutex_unlock(&tdx_mng_key_config_lock[cur_pkg]);
> +
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
> +int tdx_vm_init(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	cpumask_var_t packages;
> +	int ret, i;
> +	u64 err;
> +
> +	/* vCPUs can't be created until after KVM_TDX_INIT_VM. */
> +	kvm->max_vcpus = 0;
> +
> +	kvm_tdx->hkid = tdx_keyid_alloc();
> +	if (kvm_tdx->hkid < 0)
> +		return -EBUSY;
> +
> +	ret = tdx_alloc_td_page(&kvm_tdx->tdr);
> +	if (ret)
> +		goto free_hkid;
> +
> +	kvm_tdx->tdcs = kcalloc(tdx_caps.tdcs_nr_pages, sizeof(*kvm_tdx->tdcs),
> +				GFP_KERNEL_ACCOUNT);
> +	if (!kvm_tdx->tdcs)
> +		goto free_tdr;
> +	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
> +		ret = tdx_alloc_td_page(&kvm_tdx->tdcs[i]);
> +		if (ret)
> +			goto free_tdcs;
> +	}
> +
> +	mutex_lock(&tdx_lock);
> +	err = tdh_mng_create(kvm_tdx->tdr.pa, kvm_tdx->hkid);
> +	mutex_unlock(&tdx_lock);
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_CREATE, err, NULL);
> +		ret = -EIO;
> +		goto free_tdcs;
> +	}
> +	tdx_mark_td_page_added(&kvm_tdx->tdr);
> +
> +	if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
> +		ret = -ENOMEM;
> +		goto free_tdcs;
> +	}
> +	for_each_online_cpu(i) {
> +		if (cpumask_test_and_set_cpu(topology_physical_package_id(i),
> +						packages))
> +			continue;
> +
> +		ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
> +				&kvm_tdx->tdr.pa, 1);
> +		if (ret)
> +			break;
> +	}
> +	free_cpumask_var(packages);
> +	if (ret)
> +		goto teardown;
> +
> +	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
> +		err = tdh_mng_addcx(kvm_tdx->tdr.pa, kvm_tdx->tdcs[i].pa);
> +		if (WARN_ON_ONCE(err)) {
> +			pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> +			ret = -EIO;
> +			goto teardown;
> +		}
> +		tdx_mark_td_page_added(&kvm_tdx->tdcs[i]);
> +	}
> +
> +	/*
> +	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
> +	 * ioctl() to define the configure CPUID values for the TD.
> +	 */
> +	return 0;
> +
> +	/*
> +	 * The sequence for freeing resources from a partially initialized TD
> +	 * varies based on where in the initialization flow failure occurred.
> +	 * Simply use the full teardown and destroy, which naturally play nice
> +	 * with partial initialization.
> +	 */
> +teardown:
> +	tdx_mmu_prezap(kvm);
> +	tdx_vm_free(kvm);
> +	return ret;
> +
> +free_tdcs:
> +	/* @i points at the TDCS page that failed allocation. */
> +	for (--i; i >= 0; i--)
> +		free_page(kvm_tdx->tdcs[i].va);
> +	kfree(kvm_tdx->tdcs);
> +free_tdr:
> +	free_page(kvm_tdx->tdr.va);
> +free_hkid:
> +	tdx_hkid_free(kvm_tdx);
> +	return ret;
> +}
> +
>   static int __tdx_module_setup(void)
>   {
>   	const struct tdsysinfo_struct *tdsysinfo;
> @@ -59,6 +369,8 @@ static int __tdx_module_setup(void)
>   		return ret;
>   	}
>   
> +	tdx_global_keyid = tdx_get_global_keyid();
> +
>   	tdsysinfo = tdx_get_sysinfo();
>   	if (tdx_caps.nr_cpuid_configs > TDX_MAX_NR_CPUID_CONFIGS)
>   		return -EIO;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index e4bb8831764e..860136ed70f5 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -19,6 +19,8 @@ struct kvm_tdx {
>   
>   	struct tdx_td_page tdr;
>   	struct tdx_td_page *tdcs;
> +
> +	int hkid;
>   };
>   
>   struct vcpu_tdx {
> diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
> index 5c878488795d..590fcfdd1899 100644
> --- a/arch/x86/kvm/vmx/tdx_errno.h
> +++ b/arch/x86/kvm/vmx/tdx_errno.h
> @@ -12,11 +12,11 @@
>   #define TDX_SUCCESS				0x0000000000000000ULL
>   #define TDX_NON_RECOVERABLE_VCPU		0x4000000100000000ULL
>   #define TDX_INTERRUPTED_RESUMABLE		0x8000000300000000ULL
> -#define TDX_LIFECYCLE_STATE_INCORRECT		0xC000060700000000ULL
>   #define TDX_VCPU_NOT_ASSOCIATED			0x8000070200000000ULL
>   #define TDX_KEY_GENERATION_FAILED		0x8000080000000000ULL
>   #define TDX_KEY_STATE_INCORRECT			0xC000081100000000ULL
>   #define TDX_KEY_CONFIGURED			0x0000081500000000ULL
> +#define TDX_NO_HKID_READY_TO_WBCACHE		0x0000082100000000ULL
>   #define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
>   
>   /*
> diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
> index 0bed43879b82..3dd5b4c3f04c 100644
> --- a/arch/x86/kvm/vmx/tdx_ops.h
> +++ b/arch/x86/kvm/vmx/tdx_ops.h
> @@ -6,6 +6,7 @@
>   
>   #include <linux/compiler.h>
>   
> +#include <asm/cacheflush.h>
>   #include <asm/asm.h>
>   #include <asm/kvm_host.h>
>   
> @@ -15,8 +16,14 @@
>   
>   #ifdef CONFIG_INTEL_TDX_HOST
>   
> +static inline void tdx_clflush_page(hpa_t addr)
> +{
> +	clflush_cache_range(__va(addr), PAGE_SIZE);
> +}
> +
>   static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
>   {
> +	tdx_clflush_page(addr);
>   	return kvm_seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, 0, NULL);
>   }
>   
> @@ -56,6 +63,7 @@ static inline u64 tdh_mng_key_config(hpa_t tdr)
>   
>   static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
>   {
> +	tdx_clflush_page(tdr);
>   	return kvm_seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, 0, NULL);
>   }
>   
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index da32b4b86b19..2b2738c768d6 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -132,12 +132,20 @@ void __init tdx_pre_kvm_init(unsigned int *vcpu_size,
>   bool tdx_is_vm_type_supported(unsigned long type);
>   void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
>   void tdx_hardware_unsetup(void);
> +
> +int tdx_vm_init(struct kvm *kvm);
> +void tdx_mmu_prezap(struct kvm *kvm);
> +void tdx_vm_free(struct kvm *kvm);
>   #else
>   static inline void tdx_pre_kvm_init(
>   	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
>   static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
>   static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
>   static inline void tdx_hardware_unsetup(void) {}
> +
> +static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> +static inline void tdx_mmu_prezap(struct kvm *kvm) {}
> +static inline void tdx_vm_free(struct kvm *kvm) {}
>   #endif
>   
>   #endif /* __KVM_X86_VMX_X86_OPS_H */


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 025/104] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  2022-03-04 19:48 ` [RFC PATCH v5 025/104] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl isaku.yamahata
@ 2022-04-05 12:50   ` Paolo Bonzini
  2022-04-08  0:56     ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 12:50 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
> TDX specific sub-commands will be added to retrieve/pass TDX specific
> parameters.
> 
> KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
> guest state-protected VM.  It defined subcommands for technology-specific
> operations under KVM_MEMORY_ENCRYPT_OP.  Despite its name, the subcommands
> are not limited to memory encryption, but various technology-specific
> operations are defined.  It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
> for TDX specific operations and define subcommands.
> 
> TDX requires VM-scoped, and VCPU-scoped TDX-specific operations for device
> model, for example, qemu.  Getting system-wide parameters, TDX-specific VM
> initialization, and TDX-specific vCPU initialization.  Which requires KVM
> vCPU-scoped operations in addition to the existing VM-scoped operations.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/include/uapi/asm/kvm.h       | 11 +++++++++++
>   arch/x86/kvm/vmx/main.c               | 10 ++++++++++
>   arch/x86/kvm/vmx/tdx.c                | 24 ++++++++++++++++++++++++
>   arch/x86/kvm/vmx/x86_ops.h            |  4 ++++
>   tools/arch/x86/include/uapi/asm/kvm.h | 11 +++++++++++
>   5 files changed, 60 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 71a5851475e7..2ad61caf4e0b 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -528,4 +528,15 @@ struct kvm_pmu_event_filter {
>   #define KVM_X86_DEFAULT_VM	0
>   #define KVM_X86_TDX_VM		1
>   
> +/* Trust Domain eXtension sub-ioctl() commands. */
> +enum kvm_tdx_cmd_id {
> +	KVM_TDX_CMD_NR_MAX,
> +};
> +
> +struct kvm_tdx_cmd {
> +	__u32 id;
> +	__u32 metadata;
> +	__u64 data;
> +};

Please include some initial documentation here already, for example it 
is not clear what "metadata" is.

Also please add

	u32 error;
	u32 unused;

for two reasons: 1) consistency with kvm_sev_cmd 2) error codes should 
be returned to userspace and not just sent through pr_tdx_error.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 026/104] KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
  2022-03-04 19:48 ` [RFC PATCH v5 026/104] KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters isaku.yamahata
@ 2022-04-05 12:52   ` Paolo Bonzini
  2022-04-06  1:54     ` Xiaoyao Li
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 12:52 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> Implement a VM-scoped subcomment to get system-wide parameters.  Although
> this is system-wide parameters not per-VM, this subcomand is VM-scoped
> because
> - Device model needs TDX system-wide parameters after creating KVM VM.
> - This subcommands requires to initialize TDX module.  For lazy
>    initialization of the TDX module, vm-scope ioctl is better.

Since there was agreement to install the TDX module on load, please 
place this ioctl on the /dev/kvm file descriptor.

At least for SEV, there were cases where the system-wide parameters are 
needed outside KVM, so it's better to avoid requiring a VM file descriptor.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters
  2022-03-04 19:48 ` [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
  2022-03-31  4:55   ` Kai Huang
@ 2022-04-05 12:58   ` Paolo Bonzini
  2022-04-07  1:29     ` Xiaoyao Li
  1 sibling, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 12:58 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> +	td_params->attributes = init_vm->attributes;
> +	if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> +		pr_warn("TD doesn't support perfmon. KVM needs to save/restore "
> +			"host perf registers properly.\n");
> +		return -EOPNOTSUPP;
> +	}

Why does KVM have to hardcode this (and LBR/AMX below)?  Is the level of 
hardware support available from tdx_caps, for example through the CPUID 
configs (0xA for this one, 0xD for LBR and AMX)?

> +	/* PT can be exposed to TD guest regardless of KVM's XSS support */
> +	guest_supported_xss &= (supported_xss | XFEATURE_MASK_PT);
> +	td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
> +	if (td_params->xfam & TDX_TD_XFAM_LBR) {
> +		pr_warn("TD doesn't support LBR. KVM needs to save/restore "
> +			"IA32_LBR_DEPTH properly.\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (td_params->xfam & TDX_TD_XFAM_AMX) {
> +		pr_warn("TD doesn't support AMX. KVM needs to save/restore "
> +			"IA32_XFD, IA32_XFD_ERR properly.\n");
> +		return -EOPNOTSUPP;
> +	}

> 
> +	if (init_vm->tsc_khz)
> +		guest_tsc_khz = init_vm->tsc_khz;
> +	else
> +		guest_tsc_khz = max_tsc_khz;

You can just use kvm->arch.default_tsc_khz in the latest kvm/queue.

> +#define BUILD_BUG_ON_MEMCPY(dst, src)				\
> +	do {							\
> +		BUILD_BUG_ON(sizeof(dst) != sizeof(src));	\
> +		memcpy((dst), (src), sizeof(dst));		\
> +	} while (0)
> +
> +	BUILD_BUG_ON_MEMCPY(td_params->mrconfigid, init_vm->mrconfigid);
> +	BUILD_BUG_ON_MEMCPY(td_params->mrowner, init_vm->mrowner);
> +	BUILD_BUG_ON_MEMCPY(td_params->mrownerconfig, init_vm->mrownerconfig);
> +


Please rename to MEMCPY_SAME_SIZE.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters
  2022-03-31  4:55   ` Kai Huang
@ 2022-04-05 13:01     ` Paolo Bonzini
  2022-04-06  2:06       ` Xiaoyao Li
  2022-04-08  2:18     ` Isaku Yamahata
  1 sibling, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 13:01 UTC (permalink / raw)
  To: Kai Huang, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/31/22 06:55, Kai Huang wrote:
>>   
>> +struct kvm_tdx_init_vm {
>> +	__u32 max_vcpus;
>> +	__u32 tsc_khz;
>> +	__u64 attributes;
>> +	__u64 cpuid;
> Is it better to append all CPUIDs directly into this structure, perhaps at end
> of this structure, to make it more consistent with TD_PARAMS?
> 
> Also, I think somewhere in commit message or comments we should explain why
> CPUIDs are passed here (why existing KVM_SET_CUPID2 is not sufficient).
> 

Indeed, it would be easier to use the existing cpuid data in struct 
kvm_vcpu, because right now there is no way to ensure that they are 
consistent.

Why is KVM_SET_CPUID2 not enough?  Are there any modifications done by 
KVM that affect the measurement?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 029/104] KVM: TDX: allocate/free TDX vcpu structure
  2022-03-04 19:48 ` [RFC PATCH v5 029/104] KVM: TDX: allocate/free TDX vcpu structure isaku.yamahata
@ 2022-04-05 13:04   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 13:04 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> +	/*
> +	 * In TDX case, tsc frequency is per-VM and determined by the parameter
> +	 * tdh_mng_init().  Forcibly set it instead of max_tsc_khz set by
> +	 * kvm_arch_vcpu_create().
> +	 *
> +	 * This function is called after kvm_arch_vcpu_create() calling
> +	 * kvm_set_tsc_khz().
> +	 */
> +	kvm_set_tsc_khz(vcpu, kvm_tdx->tsc_khz);
> +

I think this is not needed anymore, now that there is 
kvm->arch.default_tsc_khz.  If so, exporting the function is not needed 
either.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid
  2022-03-04 19:48 ` [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid isaku.yamahata
  2022-03-31  1:21   ` Kai Huang
@ 2022-04-05 13:08   ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 13:08 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> MKTME keyid is assigned to guest TD.  The memory controller encrypts guest
> TD memory with key id.  Add helper functions to allocate/free MKTME keyid
> so that TDX KVM assign keyid.
> 
> Also export MKTME global keyid that is used to encrypt TDX module and its
> memory.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/include/asm/tdx.h |  6 ++++++
>   arch/x86/virt/vmx/tdx.c    | 33 ++++++++++++++++++++++++++++++++-
>   2 files changed, 38 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 9a8dc6afcb63..73bb472bd515 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -139,6 +139,9 @@ int tdx_detect(void);
>   int tdx_init(void);
>   bool platform_has_tdx(void);
>   const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> +u32 tdx_get_global_keyid(void);
> +int tdx_keyid_alloc(void);
> +void tdx_keyid_free(int keyid);
>   #else
>   static inline void tdx_detect_cpu(struct cpuinfo_x86 *c) { }
>   static inline int tdx_detect(void) { return -ENODEV; }
> @@ -146,6 +149,9 @@ static inline int tdx_init(void) { return -ENODEV; }
>   static inline bool platform_has_tdx(void) { return false; }
>   struct tdsysinfo_struct;
>   static inline const struct tdsysinfo_struct *tdx_get_sysinfo(void) { return NULL; }
> +static inline u32 tdx_get_global_keyid(void) { return 0; };
> +static inline int tdx_keyid_alloc(void) { return -EOPNOTSUPP; }
> +static inline void tdx_keyid_free(int keyid) { }
>   #endif /* CONFIG_INTEL_TDX_HOST */
>   
>   #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx.c b/arch/x86/virt/vmx/tdx.c
> index e45f188479cb..d714106321d4 100644
> --- a/arch/x86/virt/vmx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx.c
> @@ -113,7 +113,13 @@ static int tdx_cmr_num;
>   static struct tdsysinfo_struct tdx_sysinfo;
>   
>   /* TDX global KeyID to protect TDX metadata */
> -static u32 tdx_global_keyid;
> +static u32 __read_mostly tdx_global_keyid;
> +
> +u32 tdx_get_global_keyid(void)
> +{
> +	return tdx_global_keyid;
> +}
> +EXPORT_SYMBOL_GPL(tdx_get_global_keyid);
>   
>   static bool enable_tdx_host;
>   
> @@ -189,6 +195,31 @@ static void detect_seam(struct cpuinfo_x86 *c)
>   		detect_seam_ap(c);
>   }
>   
> +/* TDX KeyID pool */
> +static DEFINE_IDA(tdx_keyid_pool);
> +
> +int tdx_keyid_alloc(void)
> +{
> +	if (WARN_ON_ONCE(!tdx_keyid_start || !tdx_keyid_num))
> +		return -EINVAL;
> +
> +	/* The first keyID is reserved for the global key. */
> +	return ida_alloc_range(&tdx_keyid_pool, tdx_keyid_start + 1,
> +			       tdx_keyid_start + tdx_keyid_num - 1,
> +			       GFP_KERNEL);
> +}
> +EXPORT_SYMBOL_GPL(tdx_keyid_alloc);
> +
> +void tdx_keyid_free(int keyid)
> +{
> +	/* keyid = 0 is reserved. */
> +	if (!keyid || keyid <= 0)
> +		return;
> +
> +	ida_free(&tdx_keyid_pool, keyid);
> +}
> +EXPORT_SYMBOL_GPL(tdx_keyid_free);
> +
>   static void detect_tdx_keyids_bsp(struct cpuinfo_x86 *c)
>   {
>   	u64 keyid_part;

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 035/104] KVM: x86/mmu: Disallow dirty logging for x86 TDX
  2022-03-04 19:48 ` [RFC PATCH v5 035/104] KVM: x86/mmu: Disallow dirty logging for x86 TDX isaku.yamahata
@ 2022-04-05 13:09   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 13:09 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> TDX doesn't support dirty logging.  Report dirty logging isn't supported so
> that device model, for example qemu, can properly handle it.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/x86.c       |  5 +++++
>   include/linux/kvm_host.h |  1 +
>   virt/kvm/kvm_main.c      | 15 ++++++++++++---
>   3 files changed, 18 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c52a052e208c..da411bcd8cbc 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12876,6 +12876,11 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
>   }
>   EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
>   
> +bool kvm_arch_dirty_log_supported(struct kvm *kvm)
> +{
> +	return kvm->arch.vm_type != KVM_X86_TDX_VM;
> +}
> +
>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
> diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
> index a56044a31bc6..86f984e0c93f 100644
> --- a/include/linux/kvm_host.h
> +++ b/include/linux/kvm_host.h
> @@ -1423,6 +1423,7 @@ bool kvm_arch_dy_has_pending_interrupt(struct kvm_vcpu *vcpu);
>   int kvm_arch_post_init_vm(struct kvm *kvm);
>   void kvm_arch_pre_destroy_vm(struct kvm *kvm);
>   int kvm_arch_create_vm_debugfs(struct kvm *kvm);
> +bool kvm_arch_dirty_log_supported(struct kvm *kvm);
>   
>   #ifndef __KVM_HAVE_ARCH_VM_ALLOC
>   /*
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 3adee9c6b370..ae3bf553f215 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -1423,9 +1423,18 @@ static void kvm_replace_memslot(struct kvm *kvm,
>   	}
>   }
>   
> -static int check_memory_region_flags(const struct kvm_userspace_memory_region *mem)
> +bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
>   {
> -	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
> +	return true;
> +}
> +
> +static int check_memory_region_flags(struct kvm *kvm,
> +				     const struct kvm_userspace_memory_region *mem)
> +{
> +	u32 valid_flags = 0;
> +
> +	if (kvm_arch_dirty_log_supported(kvm))
> +		valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
>   
>   #ifdef __KVM_HAVE_READONLY_MEM
>   	valid_flags |= KVM_MEM_READONLY;
> @@ -1826,7 +1835,7 @@ int __kvm_set_memory_region(struct kvm *kvm,
>   	int as_id, id;
>   	int r;
>   
> -	r = check_memory_region_flags(mem);
> +	r = check_memory_region_flags(kvm, mem);
>   	if (r)
>   		return r;
>   

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 036/104] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
  2022-03-04 19:48 ` [RFC PATCH v5 036/104] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault isaku.yamahata
@ 2022-04-05 13:17   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 13:17 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Explicitly check for an MMIO spte in the fast page fault flow.  TDX will
> use a not-present entry for MMIO sptes, which can be mistaken for an
> access-tracked spte since both have SPTE_SPECIAL_MASK set.
> 
> The fast page fault handles the case of changing access bits without
> obtaining mmu_lock.  For example, clear write protect bit for dirty page
> tracking.  MMIO emulation is handled in a slow path.  So it doesn't affect

"MMIO sptes are handled in handle_mmio_page_fault for non-TDX VMs, so 
this patch does not affect them.  TDX will handle MMIO emulation through 
a hypercall instead".

For this comment, it is not necessary to talk about the slow path, since 
that is just where MMIO sptes are installed.  If the slow path is 
reached, fast_page_fault must not have seen is_mmio_spte(spte).

> @@ -3167,7 +3167,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   			break;
>   
>   		sp = sptep_to_sp(sptep);
> -		if (!is_last_spte(spte, sp->role.level))
> +		if (!is_last_spte(spte, sp->role.level) || is_mmio_spte(spte))
>   			break;
>   
>   		/*

I would include the check a couple lines before:

	if (!is_shadow_present_pte(spte) || is_mmio_spte(spte))

This matches what is in the commit message: the problem is that MMIO 
SPTEs are present in the TDX case, so you need to check them even if 
is_shadow_present_pte(spte) returns true.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 039/104] KVM: x86/mmu: Disallow fast page fault on private GPA
  2022-03-04 19:48 ` [RFC PATCH v5 039/104] KVM: x86/mmu: Disallow fast page fault on private GPA isaku.yamahata
@ 2022-04-05 13:22   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 13:22 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX requires TDX SEAMCALL to operate Secure EPT instead of direct memory
> access and TDX SEAMCALL is heavy operation.  Fast page fault on private GPA
> doesn't make sense.  Disallow fast page fault on private GPA.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/mmu/mmu.c | 7 +++++++
>   1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index e9212394a530..d8c1505155b0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3185,6 +3185,13 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   	u64 *sptep = NULL;
>   	uint retry_count = 0;
>   
> +	/*
> +	 * TDX private mapping doesn't support fast page fault because the EPT
> +	 * entry needs TDX SEAMCALL. not direct memory access.

"the EPT entry is read/written with TDX SEAMCALLs instead of direct 
memory access".

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

> +	 */
> +	if (kvm_is_private_gpa(vcpu->kvm, fault->addr))
> +		return ret;
> +
>   	if (!page_fault_can_be_fast(fault))
>   		return ret;
>   


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 032/104] KVM: x86/mmu: introduce config for PRIVATE KVM MMU
  2022-04-01  2:13       ` Kai Huang
@ 2022-04-05 13:48         ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 13:48 UTC (permalink / raw)
  To: Kai Huang, Isaku Yamahata
  Cc: isaku.yamahata, kvm, linux-kernel, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On 4/1/22 04:13, Kai Huang wrote:
>> I don't want to use CONFIG_INTEL_TDX_HOST in KVM MMU code.  I think the change
>> to KVM MMU should be a sort of independent from TDX.  But it seems failed based
>> on your feedback.
> 
> Why do you need to use any config?  As I said majority of your changes to MMU
> are not under any config.  But I'll leave this to maintainer/reviewers.

There are few uses, but the effect should be pretty large, because the 
config symbol replaces variable accesses with constants:

+static inline gfn_t kvm_gfn_stolen_mask(struct kvm *kvm)
+{
+#ifdef CONFIG_KVM_MMU_PRIVATE
+	return kvm->arch.gfn_shared_mask;
+#else
+	return 0;
+#endif
+}

Please keep it.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits
  2022-03-31 11:16   ` Kai Huang
  2022-04-01  2:10     ` Kai Huang
  2022-04-01  2:34     ` Isaku Yamahata
@ 2022-04-05 13:55     ` Paolo Bonzini
  2022-04-06  2:23       ` Kai Huang
  2 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 13:55 UTC (permalink / raw)
  To: Kai Huang, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/31/22 13:16, Kai Huang wrote:
>> +	if (range && kvm_available_flush_tlb_with_range()) {
>> +		/* Callback should flush both private GFN and shared GFN. */
>> +		range->start_gfn = kvm_gfn_unalias(kvm, range->start_gfn);
> This seems wrong.  It seems the intention of this function is to flush TLB for
> all aliases for a given GFN range.  Here it seems you are unconditionally change
> to range to always exclude the stolen bits.

He passes the "low" range with bits cleared, and expects the callback to 
take care of both.  That seems okay (apart from the incorrect 
fallthrough that you pointed out).

>>
>>  
>> -		gfn = gpte_to_gfn(gpte);
>> +		gfn = gpte_to_gfn(vcpu, gpte);
>>  		pte_access = sp->role.access;
>>  		pte_access &= FNAME(gpte_access)(gpte);
>>  		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
> 
> In commit message you mentioned "Don't support stolen bits for shadow EPT" (you
> actually mean shadow MMU I suppose), yet there's bunch of code change to shadow
> MMU.

It's a bit ugly, but it's uglier to keep two versions of gpte_to_gfn.

Perhaps the commit message can be rephrased to "Stolen bits are not 
supported in the shadow MMU; they will be used only for TDX which uses 
the TDP MMU exclusively as it does not support nested virtualization. 
Therefore, the gfn_shared_mask will always be zero in that case".

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits
  2022-04-01  2:34     ` Isaku Yamahata
@ 2022-04-05 14:02       ` Paolo Bonzini
  2022-04-05 14:02       ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:02 UTC (permalink / raw)
  To: Isaku Yamahata, Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson, Chao Peng

On 4/1/22 04:34, Isaku Yamahata wrote:
> Sure, this patch heavily changed from the original patch Now.  One suggestion
> is that private/shared is characteristic to kvm page fault, not gpa/gfn.
> It's TDX specific.
> 
> - Add a helper function to check if KVM MMU is TD or VM. Right now
>    kvm_gfn_stolen_mask() is used.  Probably kvm_mmu_has_private_bit().
>    (any better name?)

I think use of kvm_gfn_stolen_mask() should be minimized anyway.  I 
would rename it to to kvm_{gfn,gpa}_private_mask and not return bool.

> - Let's keep address conversion functions: address => unalias/shared/private

unalias is the same as private.  It doesn't seem to have a lot of uses. 
  I would just inline "x & ~gfn_private_mask", or "x & 
~kvm_gfn_private_mask(kvm)"; or the same with gpa of course.

The shared and private conversion functions should remain.

> - Add struct kvm_page_fault.is_private
>    see how kvm_is_private_{gpa, gfn}() can be removed (or reduced).

Agreed.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits
  2022-04-01  2:34     ` Isaku Yamahata
  2022-04-05 14:02       ` Paolo Bonzini
@ 2022-04-05 14:02       ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:02 UTC (permalink / raw)
  To: Isaku Yamahata, Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson, Chao Peng

On 4/1/22 04:34, Isaku Yamahata wrote:
> Sure, this patch heavily changed from the original patch Now.  One suggestion
> is that private/shared is characteristic to kvm page fault, not gpa/gfn.
> It's TDX specific.
> 
> - Add a helper function to check if KVM MMU is TD or VM. Right now
>    kvm_gfn_stolen_mask() is used.  Probably kvm_mmu_has_private_bit().
>    (any better name?)

I think use of kvm_gfn_stolen_mask() should be minimized anyway.  I 
would rename it to to kvm_{gfn,gpa}_private_mask and not return bool.

> - Let's keep address conversion functions: address => unalias/shared/private

unalias is the same as private.  It doesn't seem to have a lot of uses. 
  I would just inline "x & ~gfn_private_mask", or "x & 
~kvm_gfn_private_mask(kvm)"; or the same with gpa of course.

The shared and private conversion functions should remain.

> - Add struct kvm_page_fault.is_private
>    see how kvm_is_private_{gpa, gfn}() can be removed (or reduced).

Agreed.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE
  2022-03-04 19:48 ` [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE isaku.yamahata
  2022-04-01  5:13   ` Kai Huang
@ 2022-04-05 14:10   ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:10 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
>   static void mmu_spte_clear_no_track(u64 *sptep)
>   {
> -	__update_clear_spte_fast(sptep, 0ull);
> +	__update_clear_spte_fast(sptep, shadow_init_value);
>   }
>   

Please WARN_ON_ONCE if shadow_init_value is nonzero, and then keep 0ull 
as the argument.

I have not thought much of the steps that are needed if were to flip 
both bit 0 and bit 63, so let's at least document that with a WARN.

Otherwise,

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE
  2022-04-01  5:13   ` Kai Huang
  2022-04-01  7:13     ` Kai Huang
@ 2022-04-05 14:13     ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:13 UTC (permalink / raw)
  To: Kai Huang, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/1/22 07:13, Kai Huang wrote:
>> @@ -617,9 +617,9 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>>   	int level = sptep_to_sp(sptep)->role.level;
>>   
>>   	if (!spte_has_volatile_bits(old_spte))
>> -		__update_clear_spte_fast(sptep, 0ull);
>> +		__update_clear_spte_fast(sptep, shadow_init_value);
>>   	else
>> -		old_spte = __update_clear_spte_slow(sptep, 0ull);
>> +		old_spte = __update_clear_spte_slow(sptep, shadow_init_value);

(FWIW this one should also assume that the init_value is zero, and WARN 
if not).

> I guess it's better to have some comment here.  Allow non-zero init value for
> shadow PTE doesn't necessarily mean the initial value should be used when one
> PTE is zapped.  I think mmu_spte_clear_track_bits() is only called for mapping
> of normal (shared) memory but not MMIO? Then perhaps it's better to have a
> comment to explain we want "suppress #VE" set to get a real EPT violation for
> normal memory access from guest?
> 
>>   
>>   	if (!is_shadow_present_pte(old_spte))
>>   		return old_spte;
>> @@ -651,7 +651,7 @@ static int mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>>    */
>>   static void mmu_spte_clear_no_track(u64 *sptep)
>>   {
>> -	__update_clear_spte_fast(sptep, 0ull);
>> +	__update_clear_spte_fast(sptep, shadow_init_value);
>>   }
> Similar here.  Seems mmu_spte_clear_no_track() is used to zap non-leaf PTE which
> doesn't require state tracking, so theoretically it can be set to 0.  But this
> seems is also called to zap MMIO PTE so looks need to set to shadow_init_value.
> Anyway looks deserve a comment?
> 
> Btw, Above two changes to mmu_spte_clear_track_bits() and
> mmu_spte_clear_track_bits() seems a little bit out-of-scope of what this patch
> claims to do.  Allow non-zero init value for shadow PTE doesn't necessarily mean
> the initial value should be used when one PTE is zapped. Maybe we can further
> improve the patch title and commit message a little bit.  Such as: Allow non-
> zero value for empty (or invalid?) PTE? Non-present seems doesn't fit here.

I think shadow_init_value is not named well.  Let's rename it to 
shadow_nonpresent_value, and the commit to "Allow non-zero value for 
non-present SPTE".  That explains why it's used for zapping.

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE
  2022-04-01  7:13     ` Kai Huang
@ 2022-04-05 14:14       ` Paolo Bonzini
  2022-04-08 18:38         ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:14 UTC (permalink / raw)
  To: Kai Huang, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/1/22 09:13, Kai Huang wrote:
> Btw, I think the relevant part of TDP MMU change should be included in this
> patch too otherwise TDP MMU is broken with this patch.

I agree.

Paolo

> Actually in this series legacy MMU is not supported to work with TDX, so above
> change to legacy MMU doesn't matter actually.  Instead, TDP MMU change should be
> here.


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 045/104] KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
  2022-03-04 19:49 ` [RFC PATCH v5 045/104] KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value isaku.yamahata
@ 2022-04-05 14:22   ` Paolo Bonzini
  2022-04-06 23:35     ` Sean Christopherson
  2022-04-06 23:30   ` Kai Huang
  1 sibling, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:22 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
> intermediate value to indicate one thread is operating on it and the value
> should be semi-arbitrary value.  For TDX (more correctly to use #VE), the
> value should include suppress #VE value which is shadow_init_value.
> 
> Define SHADOW_REMOVED_SPTE as shadow_init_value | REMOVED_SPTE, and replace
> REMOVED_SPTE with SHADOW_REMOVED_SPTE to use suppress #VE bit properly for
> TDX.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/mmu/spte.h    | 14 ++++++++++++--
>   arch/x86/kvm/mmu/tdp_mmu.c | 23 ++++++++++++++++-------
>   2 files changed, 28 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index bde843bce878..e88f796724b4 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -194,7 +194,9 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>    * If a thread running without exclusive control of the MMU lock must perform a
>    * multi-part operation on an SPTE, it can set the SPTE to REMOVED_SPTE as a
>    * non-present intermediate value. Other threads which encounter this value
> - * should not modify the SPTE.
> + * should not modify the SPTE.  When TDX is enabled, shadow_init_value, which
> + * is "suppress #VE" bit set, is also set to removed SPTE, because TDX module
> + * always enables "EPT violation #VE".
>    *
>    * Use a semi-arbitrary value that doesn't set RWX bits, i.e. is not-present on
>    * bot AMD and Intel CPUs, and doesn't set PFN bits, i.e. doesn't create a L1TF
> @@ -207,9 +209,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>   /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
>   static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
>   
> +/*
> + * See above comment around REMOVED_SPTE.  SHADOW_REMOVED_SPTE is the actual
> + * intermediate value set to the removed SPET.  When TDX is enabled, it sets
> + * the "suppress #VE" bit, otherwise it's REMOVED_SPTE.
> + */
> +extern u64 __read_mostly shadow_init_value;
> +#define SHADOW_REMOVED_SPTE	(shadow_init_value | REMOVED_SPTE)

Please rename the existing REMOVED_SPTE to REMOVED_SPTE_MASK, and call 
this simply REMOVED_SPTE.  This also makes the patch smaller.

Paolo

>   }
>   
>   /*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index ebd0a02620e8..b6ec2f112c26 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -338,7 +338,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>   			 * value to the removed SPTE value.
>   			 */
>   			for (;;) {
> -				old_child_spte = xchg(sptep, REMOVED_SPTE);
> +				old_child_spte = xchg(sptep, SHADOW_REMOVED_SPTE);
>   				if (!is_removed_spte(old_child_spte))
>   					break;
>   				cpu_relax();
> @@ -365,10 +365,10 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>   			 * the two branches consistent and simplifies
>   			 * the function.
>   			 */
> -			WRITE_ONCE(*sptep, REMOVED_SPTE);
> +			WRITE_ONCE(*sptep, SHADOW_REMOVED_SPTE);
>   		}
>   		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> -				    old_child_spte, REMOVED_SPTE, level,
> +				    old_child_spte, SHADOW_REMOVED_SPTE, level,
>   				    shared);
>   	}
>   
> @@ -537,7 +537,7 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>   	 * immediately installing a present entry in its place
>   	 * before the TLBs are flushed.
>   	 */
> -	if (!tdp_mmu_set_spte_atomic(kvm, iter, REMOVED_SPTE))
> +	if (!tdp_mmu_set_spte_atomic(kvm, iter, SHADOW_REMOVED_SPTE))
>   		return false;
>   
>   	kvm_flush_remote_tlbs_with_address(kvm, iter->gfn,
> @@ -550,8 +550,16 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>   	 * special removed SPTE value. No bookkeeping is needed
>   	 * here since the SPTE is going from non-present
>   	 * to non-present.
> +	 *
> +	 * Set non-present value to shadow_init_value, rather than 0.
> +	 * It is because when TDX is enabled, TDX module always
> +	 * enables "EPT-violation #VE", so KVM needs to set
> +	 * "suppress #VE" bit in EPT table entries, in order to get
> +	 * real EPT violation, rather than TDVMCALL.  KVM sets
> +	 * shadow_init_value (which sets "suppress #VE" bit) so it
> +	 * can be set when EPT table entries are zapped.
>   	 */
> -	WRITE_ONCE(*rcu_dereference(iter->sptep), 0);
> +	WRITE_ONCE(*rcu_dereference(iter->sptep), shadow_init_value);
>   
>   	return true;
>   }
> @@ -748,7 +756,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   			continue;
>   
>   		if (!shared) {
> -			tdp_mmu_set_spte(kvm, &iter, 0);
> +			/* see comments in tdp_mmu_zap_spte_atomic() */
> +			tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
>   			flush = true;
>   		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
>   			/*
> @@ -1135,7 +1144,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
>   	 * invariant that the PFN of a present * leaf SPTE can never change.
>   	 * See __handle_changed_spte().
>   	 */
> -	tdp_mmu_set_spte(kvm, iter, 0);
> +	tdp_mmu_set_spte(kvm, iter, shadow_init_value);
>   
>   	if (!pte_write(range->pte)) {
>   		new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 040/104] KVM: VMX: Split out guts of EPT violation to common/exposed function
  2022-03-04 19:48 ` [RFC PATCH v5 040/104] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
@ 2022-04-05 14:43   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:43 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> The difference of TDX EPT violation is how to retrieve information, GPA,
> and exit qualification.  To share the code to handle EPT violation, split
> out the guts of EPT violation handler so that VMX/TDX exit handler can call
> it after retrieving GPA and exit qualification.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/common.h | 35 +++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/vmx.c    | 34 ++++++----------------------------
>   2 files changed, 41 insertions(+), 28 deletions(-)
>   create mode 100644 arch/x86/kvm/vmx/common.h
> 
> diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
> new file mode 100644
> index 000000000000..1052b3c93eb8
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/common.h
> @@ -0,0 +1,35 @@
> +/* SPDX-License-Identifier: GPL-2.0-only */
> +#ifndef __KVM_X86_VMX_COMMON_H
> +#define __KVM_X86_VMX_COMMON_H
> +
> +#include <linux/kvm_host.h>
> +
> +#include "mmu.h"
> +
> +static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
> +					     unsigned long exit_qualification)
> +{
> +	u64 error_code;
> +
> +	/* Is it a read fault? */
> +	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
> +		     ? PFERR_USER_MASK : 0;
> +	/* Is it a write fault? */
> +	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
> +		      ? PFERR_WRITE_MASK : 0;
> +	/* Is it a fetch fault? */
> +	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
> +		      ? PFERR_FETCH_MASK : 0;
> +	/* ept page table entry is present? */
> +	error_code |= (exit_qualification &
> +		       (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
> +			EPT_VIOLATION_EXECUTABLE))
> +		      ? PFERR_PRESENT_MASK : 0;
> +
> +	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
> +	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> +
> +	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +}
> +
> +#endif /* __KVM_X86_VMX_COMMON_H */
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 7838cd177f0e..0edeeed0b4c8 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -50,6 +50,7 @@
>   #include <asm/vmx.h>
>   
>   #include "capabilities.h"
> +#include "common.h"
>   #include "cpuid.h"
>   #include "evmcs.h"
>   #include "hyperv.h"
> @@ -5381,11 +5382,10 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
>   
>   static int handle_ept_violation(struct kvm_vcpu *vcpu)
>   {
> -	unsigned long exit_qualification;
> -	gpa_t gpa;
> -	u64 error_code;
> +	unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
> +	gpa_t gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
>   
> -	exit_qualification = vmx_get_exit_qual(vcpu);
> +	trace_kvm_page_fault(gpa, exit_qualification);
>   
>   	/*
>   	 * EPT violation happened while executing iret from NMI,
> @@ -5394,31 +5394,9 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>   	 * AAK134, BY25.
>   	 */
>   	if (!(to_vmx(vcpu)->idt_vectoring_info & VECTORING_INFO_VALID_MASK) &&
> -			enable_vnmi &&
> -			(exit_qualification & INTR_INFO_UNBLOCK_NMI))
> +	    enable_vnmi && (exit_qualification & INTR_INFO_UNBLOCK_NMI))
>   		vmcs_set_bits(GUEST_INTERRUPTIBILITY_INFO, GUEST_INTR_STATE_NMI);
>   
> -	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
> -	trace_kvm_page_fault(gpa, exit_qualification);
> -
> -	/* Is it a read fault? */
> -	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
> -		     ? PFERR_USER_MASK : 0;
> -	/* Is it a write fault? */
> -	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
> -		      ? PFERR_WRITE_MASK : 0;
> -	/* Is it a fetch fault? */
> -	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
> -		      ? PFERR_FETCH_MASK : 0;
> -	/* ept page table entry is present? */
> -	error_code |= (exit_qualification &
> -		       (EPT_VIOLATION_READABLE | EPT_VIOLATION_WRITABLE |
> -			EPT_VIOLATION_EXECUTABLE))
> -		      ? PFERR_PRESENT_MASK : 0;
> -
> -	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
> -	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
> -
>   	vcpu->arch.exit_qualification = exit_qualification;
>   
>   	/*
> @@ -5432,7 +5410,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
>   	if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
>   		return kvm_emulate_instruction(vcpu, 0);
>   
> -	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
> +	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
>   }
>   
>   static int handle_ept_misconfig(struct kvm_vcpu *vcpu)

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 041/104] KVM: VMX: Move setting of EPT MMU masks to common VT-x code
  2022-03-04 19:48 ` [RFC PATCH v5 041/104] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
@ 2022-04-05 14:48   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:48 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> EPT MMU masks are used commonly for VMX and TDX.  The value needs to be
> initialized in common code before both VMX/TDX-specific initialization
> code.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/main.c | 5 +++++
>   arch/x86/kvm/vmx/vmx.c  | 4 ----
>   2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 3eb9db6d83ac..51aaafe6b432 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -4,6 +4,7 @@
>   #include "x86_ops.h"
>   #include "vmx.h"
>   #include "nested.h"
> +#include "mmu.h"
>   #include "pmu.h"
>   #include "tdx.h"
>   
> @@ -22,6 +23,10 @@ static __init int vt_hardware_setup(void)
>   
>   	tdx_hardware_setup(&vt_x86_ops);
>   
> +	if (enable_ept)
> +		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
> +				      cpu_has_vmx_ept_execute_only());
> +
>   	return 0;
>   }
>   
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 0edeeed0b4c8..07fd892768be 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7817,10 +7817,6 @@ __init int vmx_hardware_setup(void)
>   
>   	set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
>   
> -	if (enable_ept)
> -		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
> -				      cpu_has_vmx_ept_execute_only());
> -
>   	kvm_configure_mmu(enable_ept, 0, vmx_get_max_tdp_level(),
>   			  ept_caps_to_lpage_level(vmx_capability.ept));
>   

RB


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 043/104] KVM: TDX: Add load_mmu_pgd method for TDX
  2022-03-04 19:48 ` [RFC PATCH v5 043/104] KVM: TDX: Add load_mmu_pgd method for TDX isaku.yamahata
@ 2022-04-05 14:51   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:51 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> For virtual IO, the guest TD shares guest pages with VMM without
> encryption.  Shared EPT is used to map guest pages in unprotected way.
> 
> Add the VMCS field encoding for the shared EPTP, which will be used by
> TDX to have separate EPT walks for private GPAs (existing EPTP) versus
> shared GPAs (new shared EPTP).
> 
> Set shared EPT pointer value for the TDX guest to initialize TDX MMU.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/include/asm/vmx.h |  1 +
>   arch/x86/kvm/vmx/main.c    | 11 ++++++++++-
>   arch/x86/kvm/vmx/tdx.c     |  5 +++++
>   arch/x86/kvm/vmx/x86_ops.h |  4 ++++
>   4 files changed, 20 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 88d9b8cc7dde..a2402d1bde04 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -221,6 +221,7 @@ enum vmcs_field {
>   	ENCLS_EXITING_BITMAP_HIGH	= 0x0000202F,
>   	TSC_MULTIPLIER                  = 0x00002032,
>   	TSC_MULTIPLIER_HIGH             = 0x00002033,
> +	SHARED_EPT_POINTER		= 0x0000203C,
>   	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
>   	GUEST_PHYSICAL_ADDRESS_HIGH     = 0x00002401,
>   	VMCS_LINK_POINTER               = 0x00002800,
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index b242a9dc9e29..6969e3557bd4 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -89,6 +89,15 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>   	return vmx_vcpu_reset(vcpu, init_event);
>   }
>   
> +static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> +			int pgd_level)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
> +
> +	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
> +}
> +
>   static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
>   {
>   	if (!is_td(kvm))
> @@ -205,7 +214,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.write_tsc_offset = vmx_write_tsc_offset,
>   	.write_tsc_multiplier = vmx_write_tsc_multiplier,
>   
> -	.load_mmu_pgd = vmx_load_mmu_pgd,
> +	.load_mmu_pgd = vt_load_mmu_pgd,
>   
>   	.check_intercept = vmx_check_intercept,
>   	.handle_exit_irqoff = vmx_handle_exit_irqoff,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index c3434b33c452..51098e10b6a0 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -496,6 +496,11 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>   	vcpu->kvm->vm_bugged = true;
>   }
>   
> +void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
> +{
> +	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
> +}
> +
>   static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>   {
>   	struct kvm_tdx_capabilities __user *user_caps;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 81f246493ec7..ad9b1c883761 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -143,6 +143,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
>   
>   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> +
> +void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
>   #else
>   static inline void tdx_pre_kvm_init(
>   	unsigned int *vcpu_size, unsigned int *vcpu_align, unsigned int *vm_size) {}
> @@ -160,6 +162,8 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
>   
>   static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
> +
> +static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
>   #endif
>   
>   #endif /* __KVM_X86_VMX_X86_OPS_H */

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 046/104] KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
  2022-03-04 19:49 ` [RFC PATCH v5 046/104] KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map() isaku.yamahata
@ 2022-04-05 14:53   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:53 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Factor out non-leaf SPTE population logic from kvm_tdp_mmu_map().  MapGPA
> hypercall needs to populate non-leaf SPTE to record which GPA, private or
> shared, is allowed in the leaf EPT entry.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

and feel free to rebase/resubmit this one, with subject "KVM: 
x86/tdp_mmu: extract tdp_mmu_populate_nonleaf()".

Paolo

> ---
>   arch/x86/kvm/mmu/tdp_mmu.c | 48 ++++++++++++++++++++++++--------------
>   1 file changed, 30 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b6ec2f112c26..8db262440d5c 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -955,6 +955,31 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>   	return ret;
>   }
>   
> +static bool tdp_mmu_populate_nonleaf(
> +	struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool account_nx)
> +{
> +	struct kvm_mmu_page *sp;
> +	u64 *child_pt;
> +	u64 new_spte;
> +
> +	WARN_ON(is_shadow_present_pte(iter->old_spte));
> +	WARN_ON(is_removed_spte(iter->old_spte));
> +
> +	sp = alloc_tdp_mmu_page(vcpu, iter->gfn, iter->level - 1);
> +	child_pt = sp->spt;
> +
> +	new_spte = make_nonleaf_spte(child_pt, !shadow_accessed_mask);
> +
> +	if (!tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte)) {
> +		tdp_mmu_free_sp(sp);
> +		return false;
> +	}
> +
> +	tdp_mmu_link_page(vcpu->kvm, sp, account_nx);
> +	trace_kvm_mmu_get_page(sp, true);
> +	return true;
> +}
> +
>   /*
>    * Handle a TDP page fault (NPT/EPT violation/misconfiguration) by installing
>    * page tables and SPTEs to translate the faulting guest physical address.
> @@ -963,9 +988,6 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   {
>   	struct kvm_mmu *mmu = vcpu->arch.mmu;
>   	struct tdp_iter iter;
> -	struct kvm_mmu_page *sp;
> -	u64 *child_pt;
> -	u64 new_spte;
>   	int ret;
>   
>   	kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -1000,6 +1022,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   		}
>   
>   		if (!is_shadow_present_pte(iter.old_spte)) {
> +			bool account_nx;
> +
>   			/*
>   			 * If SPTE has been frozen by another thread, just
>   			 * give up and retry, avoiding unnecessary page table
> @@ -1008,22 +1032,10 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   			if (is_removed_spte(iter.old_spte))
>   				break;
>   
> -			sp = alloc_tdp_mmu_page(vcpu, iter.gfn, iter.level - 1);
> -			child_pt = sp->spt;
> -
> -			new_spte = make_nonleaf_spte(child_pt,
> -						     !shadow_accessed_mask);
> -
> -			if (tdp_mmu_set_spte_atomic(vcpu->kvm, &iter, new_spte)) {
> -				tdp_mmu_link_page(vcpu->kvm, sp,
> -						  fault->huge_page_disallowed &&
> -						  fault->req_level >= iter.level);
> -
> -				trace_kvm_mmu_get_page(sp, true);
> -			} else {
> -				tdp_mmu_free_sp(sp);
> +			account_nx = fault->huge_page_disallowed &&
> +				fault->req_level >= iter.level;
> +			if (!tdp_mmu_populate_nonleaf(vcpu, &iter, account_nx))
>   				break;
> -			}
>   		}
>   	}
>   


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
  2022-03-04 19:49 ` [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page isaku.yamahata
@ 2022-04-05 14:58   ` Paolo Bonzini
  2022-04-06 23:43   ` Kai Huang
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 14:58 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a private pointer to kvm_mmu_page for private EPT.
> 
> To resolve KVM page fault on private GPA, it will allocate additional page
> for Secure EPT in addition to private EPT.  Add memory allocator for it and
> topup its memory allocator before resolving KVM page fault similar to
> shared EPT page.  Allocation of those memory will be done for TDP MMU by
> alloc_tdp_mmu_page().  Freeing those memory will be done for TDP MMU on
> behalf of kvm_tdp_mmu_zap_all() called by kvm_mmu_zap_all().  Private EPT
> page needs to carry one more page used for Secure EPT in addition to the
> private EPT page.  Add private pointer to struct kvm_mmu_page for that
> purpose and Add helper functions to allocate/free a page for Secure EPT.
> Also add helper functions to check if a given kvm_mmu_page is private.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  1 +
>   arch/x86/kvm/mmu/mmu.c          |  9 ++++
>   arch/x86/kvm/mmu/mmu_internal.h | 84 +++++++++++++++++++++++++++++++++
>   arch/x86/kvm/mmu/tdp_mmu.c      |  3 ++
>   4 files changed, 97 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fcab2337819c..0c8cc7d73371 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -689,6 +689,7 @@ struct kvm_vcpu_arch {
>   	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
>   	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
>   	struct kvm_mmu_memory_cache mmu_page_header_cache;
> +	struct kvm_mmu_memory_cache mmu_private_sp_cache;
>   
>   	/*
>   	 * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6e9847b1124b..8def8b97978f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -758,6 +758,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
>   	struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
>   	int start, end, i, r;
>   
> +	if (kvm_gfn_stolen_mask(vcpu->kvm)) {
> +		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
> +					       PT64_ROOT_MAX_LEVEL);
> +		if (r)
> +			return r;
> +	}
> +
>   	if (shadow_init_value)
>   		start = kvm_mmu_memory_cache_nr_free_objects(mc);
>   
> @@ -799,6 +806,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
>   {
>   	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
>   	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
> +	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_sp_cache);
>   	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_gfn_array_cache);
>   	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
>   }
> @@ -1791,6 +1799,7 @@ static struct kvm_mmu_page *kvm_mmu_alloc_page(struct kvm_vcpu *vcpu, int direct
>   	if (!direct)
>   		sp->gfns = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_gfn_array_cache);
>   	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> +	kvm_mmu_init_private_sp(sp);
>   
>   	/*
>   	 * active_mmu_pages must be a FIFO list, as kvm_zap_obsolete_pages()
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index da6166b5c377..80f7a74a71dc 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -53,6 +53,10 @@ struct kvm_mmu_page {
>   	u64 *spt;
>   	/* hold the gfn of each spte inside spt */
>   	gfn_t *gfns;
> +#ifdef CONFIG_KVM_MMU_PRIVATE
> +	/* associated private shadow page, e.g. SEPT page */
> +	void *private_sp;
> +#endif
>   	/* Currently serving as active root */
>   	union {
>   		int root_count;
> @@ -104,6 +108,86 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
>   	return kvm_mmu_role_as_id(sp->role);
>   }
>   
> +/*
> + * TDX vcpu allocates page for root Secure EPT page and assigns to CPU secure
> + * EPT pointer.  KVM doesn't need to allocate and link to the secure EPT.
> + * Dummy value to make is_pivate_sp() return true.
> + */
> +#define KVM_MMU_PRIVATE_SP_ROOT	((void *)1)
> +
> +#ifdef CONFIG_KVM_MMU_PRIVATE
> +static inline bool is_private_sp(struct kvm_mmu_page *sp)
> +{
> +	return !!sp->private_sp;
> +}
> +
> +static inline bool is_private_spte(u64 *sptep)
> +{
> +	return is_private_sp(sptep_to_sp(sptep));
> +}
> +
> +static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
> +{
> +	return sp->private_sp;
> +}
> +
> +static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp)
> +{
> +	sp->private_sp = NULL;
> +}
> +
> +/* Valid sp->role.level is required. */
> +static inline void kvm_mmu_alloc_private_sp(struct kvm_vcpu *vcpu,
> +					struct kvm_mmu_page *sp)
> +{
> +	if (vcpu->arch.mmu->shadow_root_level == sp->role.level)
> +		sp->private_sp = KVM_MMU_PRIVATE_SP_ROOT;
> +	else
> +		sp->private_sp =
> +			kvm_mmu_memory_cache_alloc(
> +				&vcpu->arch.mmu_private_sp_cache);
> +	/*
> +	 * Because mmu_private_sp_cache is topped up before staring kvm page
> +	 * fault resolving, the allocation above shouldn't fail.
> +	 */
> +	WARN_ON_ONCE(!sp->private_sp);
> +}
> +
> +static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
> +{
> +	if (sp->private_sp != KVM_MMU_PRIVATE_SP_ROOT)
> +		free_page((unsigned long)sp->private_sp);
> +}
> +#else
> +static inline bool is_private_sp(struct kvm_mmu_page *sp)
> +{
> +	return false;
> +}
> +
> +static inline bool is_private_spte(u64 *sptep)
> +{
> +	return false;
> +}
> +
> +static inline void *kvm_mmu_private_sp(struct kvm_mmu_page *sp)
> +{
> +	return NULL;
> +}
> +
> +static inline void kvm_mmu_init_private_sp(struct kvm_mmu_page *sp)
> +{
> +}
> +
> +static inline void kvm_mmu_alloc_private_sp(struct kvm_vcpu *vcpu,
> +					struct kvm_mmu_page *sp)
> +{
> +}
> +
> +static inline void kvm_mmu_free_private_sp(struct kvm_mmu_page *sp)
> +{
> +}
> +#endif
> +
>   static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
>   {
>   	/*
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 8db262440d5c..a68f3a22836b 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -59,6 +59,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   
>   static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
>   {
> +	if (is_private_sp(sp))
> +		kvm_mmu_free_private_sp(sp);
>   	free_page((unsigned long)sp->spt);
>   	kmem_cache_free(mmu_page_header_cache, sp);
>   }
> @@ -184,6 +186,7 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>   	sp->role.word = page_role_for_level(vcpu, level).word;
>   	sp->gfn = gfn;
>   	sp->tdp_mmu_page = true;
> +	kvm_mmu_init_private_sp(sp);
>   
>   	trace_kvm_mmu_get_page(sp, true);
>   

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 049/104] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
  2022-03-04 19:49 ` [RFC PATCH v5 049/104] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs isaku.yamahata
@ 2022-04-05 15:15   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 15:15 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
>   static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
> +					  bool private_spte,
>   					  u64 old_spte, u64 new_spte, int level)
>   {
>   	bool pfn_changed;
>   	struct kvm_memory_slot *slot;
>   
> +	/*
> +	 * TDX doesn't support live migration.  Never mark private page as
> +	 * dirty in log-dirty bitmap, since it's not possible for userspace
> +	 * hypervisor to live migrate private page anyway.
> +	 */
> +	if (private_spte)
> +		return;

This should not be needed, patch 35 will block it anyway because 
kvm_slot_dirty_track_enabled() will return false.

> @@ -1215,7 +1227,23 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
>   	 * into this helper allow blocking; it'd be dead, wasteful code.
>   	 */
>   	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
> -		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
> +		/*
> +		 * For TDX shared mapping, set GFN shared bit to the range,
> +		 * so the handler() doesn't need to set it, to avoid duplicated
> +		 * code in multiple handler()s.
> +		 */
> +		gfn_t start;
> +		gfn_t end;
> +
> +		if (is_private_sp(root)) {
> +			start = kvm_gfn_private(kvm, range->start);
> +			end = kvm_gfn_private(kvm, range->end);
> +		} else {
> +			start = kvm_gfn_shared(kvm, range->start);
> +			end = kvm_gfn_shared(kvm, range->end);
> +		}

I think this could be a separate function kvm_gfn_for_root(kvm, root, ...).

> @@ -1237,6 +1265,15 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
>   	if (!is_accessed_spte(iter->old_spte))
>   		return false;
>   
> +	/*
> +	 * First TDX generation doesn't support clearing A bit for private
> +	 * mapping, since there's no secure EPT API to support it.  However
> +	 * it's a legitimate request for TDX guest, so just return w/o a
> +	 * WARN().
> +	 */
> +	if (is_private_spte(iter->sptep))
> +		return false;

Please add instead a "bool only_shared" argument to 
kvm_tdp_mmu_handle_gfn, since you can check "only_shared && 
is_private_sp(root)" just once (instead of checking it once per PTE).

>   	new_spte = iter->old_spte;
>   
>   	if (spte_ad_enabled(new_spte)) {
> @@ -1281,6 +1318,13 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
>   	/* Huge pages aren't expected to be modified without first being zapped. */
>   	WARN_ON(pte_huge(range->pte) || range->start + 1 != range->end);
>   
> +	/*
> +	 * .change_pte() callback should not happen for private page, because
> +	 * for now TDX private pages are pinned during VM's life time.
> +	 */
> +	if (WARN_ON(is_private_spte(iter->sptep)))
> +		return false;
> +
>   	if (iter->level != PG_LEVEL_4K ||
>   	    !is_shadow_present_pte(iter->old_spte))
>   		return false;
> @@ -1332,6 +1376,16 @@ static bool wrprot_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   	u64 new_spte;
>   	bool spte_set = false;
>   
> +	/*
> +	 * First TDX generation doesn't support write protecting private
> +	 * mappings, since there's no secure EPT API to support it.  It
> +	 * is a bug to reach here for TDX guest.
> +	 */
> +	if (WARN_ON(is_private_sp(root)))
> +		return spte_set;

Isn't this function unreachable even for the shared address range?  If 
so, this WARN should be in kvm_tdp_mmu_wrprot_slot, and also it should 
check if !kvm_arch_dirty_log_supported(kvm).

> +	start = kvm_gfn_shared(kvm, start);
> +	end = kvm_gfn_shared(kvm, end);

... and then these two lines are unnecessary.

>   	rcu_read_lock();
>   
>   	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
> @@ -1398,6 +1452,16 @@ static bool clear_dirty_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   	u64 new_spte;
>   	bool spte_set = false;
>   
> +	/*
> +	 * First TDX generation doesn't support clearing dirty bit,
> +	 * since there's no secure EPT API to support it.  It is a
> +	 * bug to reach here for TDX guest.
> +	 */
> +	if (WARN_ON(is_private_sp(root)))
> +		return spte_set;

Same here, can you check it in kvm_tdp_mmu_clear_dirty_slot?

> +	start = kvm_gfn_shared(kvm, start);
> +	end = kvm_gfn_shared(kvm, end);

Same here.

>   	rcu_read_lock();
>   
>   	tdp_root_for_each_leaf_pte(iter, root, start, end) {
> @@ -1467,6 +1531,15 @@ static void clear_dirty_pt_masked(struct kvm *kvm, struct kvm_mmu_page *root,
>   	struct tdp_iter iter;
>   	u64 new_spte;
>   
> +	/*
> +	 * First TDX generation doesn't support clearing dirty bit,
> +	 * since there's no secure EPT API to support it.  It is a
> +	 * bug to reach here for TDX guest.
> +	 */
> +	if (WARN_ON(is_private_sp(root)))
> +		return;

IIRC this is reachable from userspace, so WARN_ON is not possible.  But, 
again please check

	if (!kvm_arch_dirty_log_supported(kvm))
		return;

in kvm_tdp_mmu_clear_dirty_pt_masked.

> +	gfn = kvm_gfn_shared(kvm, gfn);

Also unnecessary, I think.

>   	rcu_read_lock();
>   
>   	tdp_root_for_each_leaf_pte(iter, root, gfn + __ffs(mask),
> @@ -1530,6 +1603,16 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>   	struct tdp_iter iter;
>   	kvm_pfn_t pfn;
>   
> +	/*
> +	 * This should only be reachable in case of log-dirty, which TD
> +	 * private mapping doesn't support so far.  Give a WARN() if it
> +	 * hits private mapping.
> +	 */
> +	if (WARN_ON(is_private_sp(root)))
> +		return;
> +	start = kvm_gfn_shared(kvm, start);
> +	end = kvm_gfn_shared(kvm, end);

I think this is not a big deal and you can leave it as is. 
Alternatively, please move the WARN to 
kvm_tdp_mmu_zap_collapsible_sptes().  It is also only reachable if you 
can set KVM_MEM_LOG_DIRTY_PAGES in the first place.

Paolo

>   	rcu_read_lock();
>   
>   	tdp_root_for_each_pte(iter, root, start, end) {
> @@ -1543,8 +1626,9 @@ static void zap_collapsible_spte_range(struct kvm *kvm,
>   
>   		pfn = spte_to_pfn(iter.old_spte);
>   		if (kvm_is_reserved_pfn(pfn) ||
> -		    iter.level >= kvm_mmu_max_mapping_level(kvm, slot, iter.gfn,
> -							    pfn, PG_LEVEL_NUM))
> +		    iter.level >= kvm_mmu_max_mapping_level(kvm, slot,
> +			    tdp_iter_gfn_unalias(kvm, &iter), pfn,
> +			    PG_LEVEL_NUM))
>   			continue;
>   
>   		/* Note, a successful atomic zap also does a remote TLB flush. */
> @@ -1590,6 +1674,14 @@ static bool write_protect_gfn(struct kvm *kvm, struct kvm_mmu_page *root,
>   
>   	BUG_ON(min_level > KVM_MAX_HUGEPAGE_LEVEL);
>   
> +	/*
> +	 * First TDX generation doesn't support write protecting private
> +	 * mappings, since there's no secure EPT API to support it.  It
> +	 * is a bug to reach here for TDX guest.
> +	 */
> +	if (WARN_ON(is_private_sp(root)))
> +		return spte_set;
> +
>   	rcu_read_lock();
>   
>   	for_each_tdp_pte_min_level(iter, root->spt, root->role.level,


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
  2022-03-04 19:48 ` [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis isaku.yamahata
@ 2022-04-05 15:25   ` Paolo Bonzini
  2022-04-08 18:46     ` Isaku Yamahata
  2022-04-06 11:06   ` Kai Huang
  1 sibling, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 15:25 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> +	if (enable_ept) {
> +		const u64 init_value = enable_tdx ? VMX_EPT_SUPPRESS_VE_BIT : 0ull;
>   		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
> -				      cpu_has_vmx_ept_execute_only());
> +				      cpu_has_vmx_ept_execute_only(), init_value);
> +		kvm_mmu_set_spte_init_value(init_value);
> +	}

I think kvm-intel.ko should use VMX_EPT_SUPPRESS_VE_BIT unconditionally 
as the init value.  The bit is ignored anyway if the "EPT-violation #VE" 
execution control is 0.  Otherwise looks good, but I have a couple more 
crazy ideas:

1) there could even be a test mode where KVM enables the execution 
control, traps #VE in the exception bitmap, and shouts loudly if it gets 
a #VE.  That might avoid hard-to-find bugs due to forgetting about 
VMX_EPT_SUPPRESS_VE_BIT.

2) or even, perhaps the init_value for the TDP MMU could set bit 63 
_unconditionally_, because KVM always sets the NX bit on AMD hardware. 
That would remove the whole infrastructure to keep shadow_init_value, 
because it would be constant 0 in mmu.c and constant BIT(63) in tdp_mmu.c.

Sean, what do you think?

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 074/104] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
  2022-03-04 19:49 ` [RFC PATCH v5 074/104] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior isaku.yamahata
@ 2022-04-05 15:32   ` Paolo Bonzini
  2022-04-06 23:28     ` Sean Christopherson
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 15:32 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
>   	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
>   
> +	vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
>   	vcpu->arch.cr0_guest_owned_bits = -1ul;
>   	vcpu->arch.cr4_guest_owned_bits = -1ul;
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 45e8a02e99bf..89d04cd64cd0 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -10084,7 +10084,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   	if (vcpu->arch.guest_fpu.xfd_err)
>   		wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
>   
> -	if (unlikely(vcpu->arch.switch_db_regs)) {
> +	if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH)) {
>   		set_debugreg(0, 7);
>   		set_debugreg(vcpu->arch.eff_db[0], 0);
>   		set_debugreg(vcpu->arch.eff_db[1], 1);

I'm confused.  I'm probably missing something obvious, but where does
KVM_DEBUGREG_AUTO_SWITCH affect the behavior of KVM?  vcpu_enter_guest
would still write %db0-%db3 if KVM_DEBUGREG_BP_ENABLED is set, for
example.

Do you mean:

	if (unlikely(vcpu->arch.switch_db_regs) &&
	    !unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH)) {

?

Paolo
  
> @@ -10126,6 +10126,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
>   	 */
>   	if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
>   		WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
> +		WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH);
>   		static_call(kvm_x86_sync_dirty_debug_regs)(vcpu);
>   		kvm_update_dr0123(vcpu);
>   		kvm_update_dr7(vcpu);


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 076/104] KVM: x86: Add option to force LAPIC expiration wait
  2022-03-04 19:49 ` [RFC PATCH v5 076/104] KVM: x86: Add option to force LAPIC expiration wait isaku.yamahata
@ 2022-04-05 15:33   ` Paolo Bonzini
  2022-04-08 16:36   ` Sean Christopherson
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 15:33 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Add an option to skip the IRR check-in kvm_wait_lapic_expire().  This
> will be used by TDX to wait if there is an outstanding notification for
> a TD, i.e. a virtual interrupt is being triggered via posted interrupt
> processing.  KVM TDX doesn't emulate PI processing, i.e. there will
> never be a bit set in IRR/ISR, so the default behavior for APICv of
> querying the IRR doesn't work as intended.

Would be better to explain "doesn't work as intended" more verbosely. 
Otherwise,

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo

> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/lapic.c   | 4 ++--
>   arch/x86/kvm/lapic.h   | 2 +-
>   arch/x86/kvm/svm/svm.c | 2 +-
>   arch/x86/kvm/vmx/vmx.c | 2 +-
>   4 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 9322e6340a74..d49f029ef0e3 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1620,12 +1620,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
>   		__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
>   }
>   
> -void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> +void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait)
>   {
>   	if (lapic_in_kernel(vcpu) &&
>   	    vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
>   	    vcpu->arch.apic->lapic_timer.timer_advance_ns &&
> -	    lapic_timer_int_injected(vcpu))
> +	    (force_wait || lapic_timer_int_injected(vcpu)))
>   		__kvm_wait_lapic_expire(vcpu);
>   }
>   EXPORT_SYMBOL_GPL(kvm_wait_lapic_expire);
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index 2b44e533fc8d..2a0119ef9e96 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -233,7 +233,7 @@ static inline int kvm_lapic_latched_init(struct kvm_vcpu *vcpu)
>   
>   bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
>   
> -void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);
> +void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait);
>   
>   void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
>   			      unsigned long *vcpu_bitmap);
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index c7eec23e9ebe..a46415845f48 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -3766,7 +3766,7 @@ static __no_kcsan fastpath_t svm_vcpu_run(struct kvm_vcpu *vcpu)
>   	clgi();
>   	kvm_load_guest_xsave_state(vcpu);
>   
> -	kvm_wait_lapic_expire(vcpu);
> +	kvm_wait_lapic_expire(vcpu, false);
>   
>   	/*
>   	 * If this vCPU has touched SPEC_CTRL, restore the guest's value if
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 00f88aa25047..9b7bd52d19a9 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6838,7 +6838,7 @@ fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
>   	if (enable_preemption_timer)
>   		vmx_update_hv_timer(vcpu);
>   
> -	kvm_wait_lapic_expire(vcpu);
> +	kvm_wait_lapic_expire(vcpu, false);
>   
>   	/*
>   	 * If this vCPU has touched SPEC_CTRL, restore the guest's value if


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 077/104] KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c
  2022-03-04 19:49 ` [RFC PATCH v5 077/104] KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c isaku.yamahata
@ 2022-04-05 15:36   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 15:36 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Yuan Yao <yuan.yao@intel.com>
> 
> The helper function, vcpu_to_pi_desc(), is defined to get the posted
> interrupt descriptor from vcpu.  There is one place that doesn't use it,
> but direct reference to vmx_vcpu->pi_desc.  It's inconsistent.
> 
> For TDX, TDX vcpu structure will be defined and the helper function,
> vcpu_to_pi_desc(), will return tdx_vcpu->pi_desc for TDX case instead of
> vmx_vcpu->pi_desc.  The direct reference to vmx_vcpu->pi_desc doesn't work
> for TDX.
> 
> Replace vmx_vcpu->pi_desc with the helper function, vcpu_pi_desc() for
> consistency and TDX.
> 
> Signed-off-by: Yuan Yao <yuan.yao@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/posted_intr.c | 2 +-
>   arch/x86/kvm/vmx/x86_ops.h     | 3 +++
>   2 files changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
> index aa1fe9085d77..c8a81c916eed 100644
> --- a/arch/x86/kvm/vmx/posted_intr.c
> +++ b/arch/x86/kvm/vmx/posted_intr.c
> @@ -311,7 +311,7 @@ int pi_update_irte(struct kvm *kvm, unsigned int host_irq, uint32_t guest_irq,
>   			continue;
>   		}
>   
> -		vcpu_info.pi_desc_addr = __pa(&to_vmx(vcpu)->pi_desc);
> +		vcpu_info.pi_desc_addr = __pa(vcpu_to_pi_desc(vcpu));
>   		vcpu_info.vector = irq.vector;
>   
>   		trace_kvm_pi_irte_update(host_irq, vcpu->vcpu_id, e->gsi,
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index aae0f4449ec5..0f1a28f67e60 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -147,6 +147,9 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_put(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>   
> +void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
> +int tdx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector);
> +
>   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>   

Applied the first hunk, the second should be squashed somewhere else.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 100/104] KVM: TDX: Silently discard SMI request
  2022-03-04 19:49 ` [RFC PATCH v5 100/104] KVM: TDX: Silently discard SMI request isaku.yamahata
@ 2022-04-05 15:41   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 15:41 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX doesn't support system-management mode (SMM) and system-management
> interrupt (SMI) in guest TDs.  Because guest state (vcpu state, memory
> state) is protected, it must go through the TDX module APIs to change guest
> state, injecting SMI and changing vcpu mode into SMM.  The TDX module
> doesn't provide a way for VMM to inject SMI into guest TD and a way for VMM
> to switch guest vcpu mode into SMM.
> 
> We have two options in KVM when handling SMM or SMI in the guest TD or the
> device model (e.g. QEMU): 1) silently ignore the request or 2) return a
> meaningful error.
> 
> For simplicity, we implemented the option 1).

Please also:

1) return zero from vmx_has_emulated_msr(MSR_IA32_SMBASE) for TDX 
virtual machines.

2) do a check for static_call(kvm_x86_has_emulated_msr)(kvm, 
MSR_IA32_SMBASE) in kvm_vcpu_ioctl_smi and __apic_accept_irq.

3) WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 101/104] KVM: TDX: Silently ignore INIT/SIPI
  2022-03-04 19:49 ` [RFC PATCH v5 101/104] KVM: TDX: Silently ignore INIT/SIPI isaku.yamahata
@ 2022-04-05 15:48   ` Paolo Bonzini
  2022-04-05 17:53     ` Tom Lendacky
  2022-04-07 11:09     ` Xiaoyao Li
  0 siblings, 2 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 15:48 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson, Tom Lendacky

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> +		if (kvm_init_sipi_unsupported(vcpu->kvm))
> +			/*
> +			 * TDX doesn't support INIT.  Ignore INIT event.  In the
> +			 * case of SIPI, the callback of
> +			 * vcpu_deliver_sipi_vector ignores it.
> +			 */
>   			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> -		else
> -			vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> +		else {
> +			kvm_vcpu_reset(vcpu, true);
> +			if (kvm_vcpu_is_bsp(apic->vcpu))
> +				vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> +			else
> +				vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> +		}

Should you check vcpu->arch.guest_state_protected instead of 
special-casing TDX?  KVM_APIC_INIT is not valid for SEV-ES either, if I 
remember correctly.

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state
  2022-03-04 19:49 ` [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state isaku.yamahata
@ 2022-04-05 15:56   ` Paolo Bonzini
  2022-04-08  3:50     ` Isaku Yamahata
  2022-04-12  6:49   ` Xiaoyao Li
  1 sibling, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-05 15:56 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> TDX protects TDX guest state from VMM.  Implements to access methods for
> TDX guest state to ignore them or return zero.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

For most of these, it would be interesting to see which paths actually 
can be hit.  For SEV, it's all cut out by

         if (vcpu->arch.guest_state_protected)
                 return 0;

in functions such as __set_sregs_common.  Together with the fact that 
TDX does not get to e.g. handle_set_cr0, this should prevent most such 
calls from happening.  So most of these should be KVM_BUG_ON or WARN_ON, 
not just returns.

Thanks,

Paolo

> ---
>   arch/x86/kvm/vmx/main.c    | 465 +++++++++++++++++++++++++++++++++----
>   arch/x86/kvm/vmx/tdx.c     |  44 ++++
>   arch/x86/kvm/vmx/x86_ops.h |  17 ++
>   3 files changed, 483 insertions(+), 43 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index de9b4a270f20..0515998f7fa5 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -228,6 +228,46 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
>   	vmx_enable_smi_window(vcpu);
>   }
>   
> +static bool vt_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
> +				       void *insn, int insn_len)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return false;
> +
> +	return vmx_can_emulate_instruction(vcpu, emul_type, insn, insn_len);
> +}
> +
> +static int vt_check_intercept(struct kvm_vcpu *vcpu,
> +				 struct x86_instruction_info *info,
> +				 enum x86_intercept_stage stage,
> +				 struct x86_exception *exception)
> +{
> +	/*
> +	 * This call back is triggered by the x86 instruction emulator. TDX
> +	 * doesn't allow guest memory inspection.
> +	 */
> +	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
> +		return X86EMUL_UNHANDLEABLE;
> +
> +	return vmx_check_intercept(vcpu, info, stage, exception);
> +}
> +
> +static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return true;
> +
> +	return vmx_apic_init_signal_blocked(vcpu);
> +}
> +
> +static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_set_virtual_apic_mode(vcpu);
> +
> +	return vmx_set_virtual_apic_mode(vcpu);
> +}
> +
>   static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu))
> @@ -236,6 +276,31 @@ static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
>   	return vmx_apicv_post_state_restore(vcpu);
>   }
>   
> +static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	return vmx_hwapic_irr_update(vcpu, max_irr);
> +}
> +
> +static void vt_hwapic_isr_update(struct kvm_vcpu *vcpu, int max_isr)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	return vmx_hwapic_isr_update(vcpu, max_isr);
> +}
> +
> +static bool vt_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
> +{
> +	/* TDX doesn't support L2 at the moment. */
> +	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
> +		return false;
> +
> +	return vmx_guest_apic_has_interrupt(vcpu);
> +}
> +
>   static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu))
> @@ -272,6 +337,179 @@ static void vt_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
>   	kvm_vcpu_deliver_sipi_vector(vcpu, vector);
>   }
>   
> +static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	return vmx_vcpu_after_set_cpuid(vcpu);
> +}
> +
> +static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_update_exception_bitmap(vcpu);
> +}
> +
> +static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_get_segment_base(vcpu, seg);
> +
> +	return vmx_get_segment_base(vcpu, seg);
> +}
> +
> +static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
> +			      int seg)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_get_segment(vcpu, var, seg);
> +
> +	vmx_get_segment(vcpu, var, seg);
> +}
> +
> +static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
> +			      int seg)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_set_segment(vcpu, var, seg);
> +}
> +
> +static int vt_get_cpl(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_get_cpl(vcpu);
> +
> +	return vmx_get_cpl(vcpu);
> +}
> +
> +static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_get_cs_db_l_bits(vcpu, db, l);
> +}
> +
> +static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_set_cr0(vcpu, cr0);
> +}
> +
> +static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_set_cr4(vcpu, cr4);
> +}
> +
> +static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return 0;
> +
> +	return vmx_set_efer(vcpu, efer);
> +}
> +
> +static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +{
> +	if (is_td_vcpu(vcpu)) {
> +		memset(dt, 0, sizeof(*dt));
> +		return;
> +	}
> +
> +	vmx_get_idt(vcpu, dt);
> +}
> +
> +static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_set_idt(vcpu, dt);
> +}
> +
> +static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +{
> +	if (is_td_vcpu(vcpu)) {
> +		memset(dt, 0, sizeof(*dt));
> +		return;
> +	}
> +
> +	vmx_get_gdt(vcpu, dt);
> +}
> +
> +static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_set_gdt(vcpu, dt);
> +}
> +
> +static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_set_dr7(vcpu, val);
> +}
> +
> +static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * MOV-DR exiting is always cleared for TD guest, even in debug mode.
> +	 * Thus KVM_DEBUGREG_WONT_EXIT can never be set and it should never
> +	 * reach here for TD vcpu.
> +	 */
> +	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
> +		return;
> +
> +	vmx_sync_dirty_debug_regs(vcpu);
> +}
> +
> +static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> +{
> +	if (is_td_vcpu(vcpu)) {
> +		tdx_cache_reg(vcpu, reg);
> +		return;
> +	}
> +
> +	vmx_cache_reg(vcpu, reg);
> +}
> +
> +static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_get_rflags(vcpu);
> +
> +	return vmx_get_rflags(vcpu);
> +}
> +
> +static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_set_rflags(vcpu, rflags);
> +}
> +
> +static bool vt_get_if_flag(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return false;
> +
> +	return vmx_get_if_flag(vcpu);
> +}
> +
>   static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu))
> @@ -388,6 +626,15 @@ static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
>   	return vmx_get_interrupt_shadow(vcpu);
>   }
>   
> +static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
> +				  unsigned char *hypercall)
> +{
> +	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
> +		return;
> +
> +	vmx_patch_hypercall(vcpu, hypercall);
> +}
> +
>   static void vt_inject_irq(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu))
> @@ -396,6 +643,14 @@ static void vt_inject_irq(struct kvm_vcpu *vcpu)
>   	vmx_inject_irq(vcpu);
>   }
>   
> +static void vt_queue_exception(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_queue_exception(vcpu);
> +}
> +
>   static void vt_cancel_injection(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu))
> @@ -428,6 +683,130 @@ static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
>   	vmx_request_immediate_exit(vcpu);
>   }
>   
> +static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_update_cr8_intercept(vcpu, tpr, irr);
> +}
> +
> +static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
> +{
> +	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
> +		return;
> +
> +	vmx_set_apic_access_page_addr(vcpu);
> +}
> +
> +static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
> +{
> +	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
> +		return;
> +
> +	vmx_refresh_apicv_exec_ctrl(vcpu);
> +}
> +
> +static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
> +}
> +
> +static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
> +{
> +	if (is_td(kvm))
> +		return 0;
> +
> +	return vmx_set_tss_addr(kvm, addr);
> +}
> +
> +static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
> +{
> +	if (is_td(kvm))
> +		return 0;
> +
> +	return vmx_set_identity_map_addr(kvm, ident_addr);
> +}
> +
> +static u64 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +{
> +	if (is_td_vcpu(vcpu)) {
> +		if (is_mmio)
> +			return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> +		return  MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> +	}
> +
> +	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
> +}
> +
> +static u64 vt_get_l2_tsc_offset(struct kvm_vcpu *vcpu)
> +{
> +	/* TDX doesn't support L2 guest at the moment. */
> +	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
> +		return 0;
> +
> +	return vmx_get_l2_tsc_offset(vcpu);
> +}
> +
> +static u64 vt_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
> +{
> +	/* TDX doesn't support L2 guest at the moment. */
> +	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
> +		return 0;
> +
> +	return vmx_get_l2_tsc_multiplier(vcpu);
> +}
> +
> +static void vt_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
> +{
> +	/* In TDX, tsc offset can't be changed. */
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_write_tsc_offset(vcpu, offset);
> +}
> +
> +static void vt_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
> +{
> +	/* In TDX, tsc multiplier can't be changed. */
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_write_tsc_multiplier(vcpu, multiplier);
> +}
> +
> +static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_update_cpu_dirty_logging(vcpu);
> +}
> +
> +#ifdef CONFIG_X86_64
> +static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
> +			      bool *expired)
> +{
> +	/* VMX-preemption timer isn't available for TDX. */
> +	if (is_td_vcpu(vcpu))
> +		return -EINVAL;
> +
> +	return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
> +}
> +
> +static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
> +{
> +	/* VMX-preemption timer can't be set.  Set vt_set_hv_timer(). */
> +	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
> +		return;
> +
> +	vmx_cancel_hv_timer(vcpu);
> +}
> +#endif
> +
>   static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
>   			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
>   {
> @@ -480,29 +859,29 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.vcpu_load = vt_vcpu_load,
>   	.vcpu_put = vt_vcpu_put,
>   
> -	.update_exception_bitmap = vmx_update_exception_bitmap,
> +	.update_exception_bitmap = vt_update_exception_bitmap,
>   	.get_msr_feature = vmx_get_msr_feature,
>   	.get_msr = vt_get_msr,
>   	.set_msr = vt_set_msr,
> -	.get_segment_base = vmx_get_segment_base,
> -	.get_segment = vmx_get_segment,
> -	.set_segment = vmx_set_segment,
> -	.get_cpl = vmx_get_cpl,
> -	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
> -	.set_cr0 = vmx_set_cr0,
> +	.get_segment_base = vt_get_segment_base,
> +	.get_segment = vt_get_segment,
> +	.set_segment = vt_set_segment,
> +	.get_cpl = vt_get_cpl,
> +	.get_cs_db_l_bits = vt_get_cs_db_l_bits,
> +	.set_cr0 = vt_set_cr0,
>   	.is_valid_cr4 = vmx_is_valid_cr4,
> -	.set_cr4 = vmx_set_cr4,
> -	.set_efer = vmx_set_efer,
> -	.get_idt = vmx_get_idt,
> -	.set_idt = vmx_set_idt,
> -	.get_gdt = vmx_get_gdt,
> -	.set_gdt = vmx_set_gdt,
> -	.set_dr7 = vmx_set_dr7,
> -	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
> -	.cache_reg = vmx_cache_reg,
> -	.get_rflags = vmx_get_rflags,
> -	.set_rflags = vmx_set_rflags,
> -	.get_if_flag = vmx_get_if_flag,
> +	.set_cr4 = vt_set_cr4,
> +	.set_efer = vt_set_efer,
> +	.get_idt = vt_get_idt,
> +	.set_idt = vt_set_idt,
> +	.get_gdt = vt_get_gdt,
> +	.set_gdt = vt_set_gdt,
> +	.set_dr7 = vt_set_dr7,
> +	.sync_dirty_debug_regs = vt_sync_dirty_debug_regs,
> +	.cache_reg = vt_cache_reg,
> +	.get_rflags = vt_get_rflags,
> +	.set_rflags = vt_set_rflags,
> +	.get_if_flag = vt_get_if_flag,
>   
>   	.tlb_flush_all = vt_flush_tlb_all,
>   	.tlb_flush_current = vt_flush_tlb_current,
> @@ -516,10 +895,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.update_emulated_instruction = vmx_update_emulated_instruction,
>   	.set_interrupt_shadow = vt_set_interrupt_shadow,
>   	.get_interrupt_shadow = vt_get_interrupt_shadow,
> -	.patch_hypercall = vmx_patch_hypercall,
> +	.patch_hypercall = vt_patch_hypercall,
>   	.set_irq = vt_inject_irq,
>   	.set_nmi = vt_inject_nmi,
> -	.queue_exception = vmx_queue_exception,
> +	.queue_exception = vt_queue_exception,
>   	.cancel_injection = vt_cancel_injection,
>   	.interrupt_allowed = vt_interrupt_allowed,
>   	.nmi_allowed = vt_nmi_allowed,
> @@ -527,39 +906,39 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.set_nmi_mask = vt_set_nmi_mask,
>   	.enable_nmi_window = vt_enable_nmi_window,
>   	.enable_irq_window = vt_enable_irq_window,
> -	.update_cr8_intercept = vmx_update_cr8_intercept,
> -	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
> -	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
> -	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
> -	.load_eoi_exitmap = vmx_load_eoi_exitmap,
> +	.update_cr8_intercept = vt_update_cr8_intercept,
> +	.set_virtual_apic_mode = vt_set_virtual_apic_mode,
> +	.set_apic_access_page_addr = vt_set_apic_access_page_addr,
> +	.refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl,
> +	.load_eoi_exitmap = vt_load_eoi_exitmap,
>   	.apicv_post_state_restore = vt_apicv_post_state_restore,
>   	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
> -	.hwapic_irr_update = vmx_hwapic_irr_update,
> -	.hwapic_isr_update = vmx_hwapic_isr_update,
> -	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
> +	.hwapic_irr_update = vt_hwapic_irr_update,
> +	.hwapic_isr_update = vt_hwapic_isr_update,
> +	.guest_apic_has_interrupt = vt_guest_apic_has_interrupt,
>   	.sync_pir_to_irr = vt_sync_pir_to_irr,
>   	.deliver_interrupt = vt_deliver_interrupt,
>   	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
>   	.apicv_has_pending_interrupt = vt_apicv_has_pending_interrupt,
>   
> -	.set_tss_addr = vmx_set_tss_addr,
> -	.set_identity_map_addr = vmx_set_identity_map_addr,
> -	.get_mt_mask = vmx_get_mt_mask,
> +	.set_tss_addr = vt_set_tss_addr,
> +	.set_identity_map_addr = vt_set_identity_map_addr,
> +	.get_mt_mask = vt_get_mt_mask,
>   
>   	.get_exit_info = vt_get_exit_info,
>   
> -	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
> +	.vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,
>   
>   	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
>   
> -	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
> -	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
> -	.write_tsc_offset = vmx_write_tsc_offset,
> -	.write_tsc_multiplier = vmx_write_tsc_multiplier,
> +	.get_l2_tsc_offset = vt_get_l2_tsc_offset,
> +	.get_l2_tsc_multiplier = vt_get_l2_tsc_multiplier,
> +	.write_tsc_offset = vt_write_tsc_offset,
> +	.write_tsc_multiplier = vt_write_tsc_multiplier,
>   
>   	.load_mmu_pgd = vt_load_mmu_pgd,
>   
> -	.check_intercept = vmx_check_intercept,
> +	.check_intercept = vt_check_intercept,
>   	.handle_exit_irqoff = vt_handle_exit_irqoff,
>   
>   	.request_immediate_exit = vt_request_immediate_exit,
> @@ -567,7 +946,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.sched_in = vt_sched_in,
>   
>   	.cpu_dirty_log_size = PML_ENTITY_NUM,
> -	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
> +	.update_cpu_dirty_logging = vt_update_cpu_dirty_logging,
>   
>   	.pmu_ops = &intel_pmu_ops,
>   	.nested_ops = &vmx_nested_ops,
> @@ -576,8 +955,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.start_assignment = vmx_pi_start_assignment,
>   
>   #ifdef CONFIG_X86_64
> -	.set_hv_timer = vmx_set_hv_timer,
> -	.cancel_hv_timer = vmx_cancel_hv_timer,
> +	.set_hv_timer = vt_set_hv_timer,
> +	.cancel_hv_timer = vt_cancel_hv_timer,
>   #endif
>   
>   	.setup_mce = vmx_setup_mce,
> @@ -587,8 +966,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.leave_smm = vt_leave_smm,
>   	.enable_smi_window = vt_enable_smi_window,
>   
> -	.can_emulate_instruction = vmx_can_emulate_instruction,
> -	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
> +	.can_emulate_instruction = vt_can_emulate_instruction,
> +	.apic_init_signal_blocked = vt_apic_init_signal_blocked,
>   	.migrate_timers = vmx_migrate_timers,
>   
>   	.msr_filter_changed = vmx_msr_filter_changed,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 7bbf6271967b..55a6fd218fc7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -3,6 +3,7 @@
>   #include <linux/mmu_context.h>
>   
>   #include <asm/fpu/xcr.h>
> +#include <asm/virtext.h>
>   #include <asm/tdx.h>
>   
>   #include "capabilities.h"
> @@ -1717,6 +1718,49 @@ void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
>   	vcpu->arch.smi_pending = false;
>   }
>   
> +void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
> +{
> +	/* Only x2APIC mode is supported for TD. */
> +	WARN_ON_ONCE(kvm_get_apic_mode(vcpu) != LAPIC_MODE_X2APIC);
> +}
> +
> +int tdx_get_cpl(struct kvm_vcpu *vcpu)
> +{
> +	return 0;
> +}
> +
> +void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> +{
> +	kvm_register_mark_available(vcpu, reg);
> +	switch (reg) {
> +	case VCPU_REGS_RSP:
> +	case VCPU_REGS_RIP:
> +	case VCPU_EXREG_PDPTR:
> +	case VCPU_EXREG_CR0:
> +	case VCPU_EXREG_CR3:
> +	case VCPU_EXREG_CR4:
> +		break;
> +	default:
> +		KVM_BUG_ON(1, vcpu->kvm);
> +		break;
> +	}
> +}
> +
> +unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu)
> +{
> +	return 0;
> +}
> +
> +u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
> +{
> +	return 0;
> +}
> +
> +void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
> +{
> +	memset(var, 0, sizeof(*var));
> +}
> +
>   static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>   {
>   	struct kvm_tdx_capabilities __user *user_caps;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 19d793609cc4..7cd29b586e43 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -163,6 +163,14 @@ int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
>   int tdx_enter_smm(struct kvm_vcpu *vcpu, char *smstate);
>   int tdx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate);
>   void tdx_enable_smi_window(struct kvm_vcpu *vcpu);
> +void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
> +
> +int tdx_get_cpl(struct kvm_vcpu *vcpu);
> +void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
> +unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu);
> +bool tdx_is_emulated_msr(u32 index, bool write);
> +u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
> +void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
>   
>   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -203,10 +211,19 @@ static inline void tdx_get_exit_info(
>   static inline bool tdx_is_emulated_msr(u32 index, bool write) { return false; }
>   static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
>   static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
> +
>   static inline int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { return false; }
>   static inline int tdx_enter_smm(struct kvm_vcpu *vcpu, char *smstate) { return 0; }
>   static inline int tdx_leave_smm(struct kvm_vcpu *vcpu, const char *smstate) { return 0; }
>   static inline void tdx_enable_smi_window(struct kvm_vcpu *vcpu) {}
> +static inline void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) {}
> +
> +static inline int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
> +static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
> +static inline unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
> +static inline u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg) { return 0;}
> +static inline void tdx_get_segment(
> +	struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg) {}
>   
>   static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 101/104] KVM: TDX: Silently ignore INIT/SIPI
  2022-04-05 15:48   ` Paolo Bonzini
@ 2022-04-05 17:53     ` Tom Lendacky
  2022-04-07 11:09     ` Xiaoyao Li
  1 sibling, 0 replies; 310+ messages in thread
From: Tom Lendacky @ 2022-04-05 17:53 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/5/22 10:48, Paolo Bonzini wrote:
> On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
>> +        if (kvm_init_sipi_unsupported(vcpu->kvm))
>> +            /*
>> +             * TDX doesn't support INIT.  Ignore INIT event.  In the
>> +             * case of SIPI, the callback of
>> +             * vcpu_deliver_sipi_vector ignores it.
>> +             */
>>               vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>> -        else
>> -            vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>> +        else {
>> +            kvm_vcpu_reset(vcpu, true);
>> +            if (kvm_vcpu_is_bsp(apic->vcpu))
>> +                vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>> +            else
>> +                vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>> +        }
> 
> Should you check vcpu->arch.guest_state_protected instead of 
> special-casing TDX?  KVM_APIC_INIT is not valid for SEV-ES either, if I 
> remember correctly.

While the INIT doesn't update any actual state that is in the encrypted 
VMSA, SEV-ES still calls kvm_vcpu_reset() to allow KVM to set any internal 
tracking state, etc. I haven't ever tested SEV-ES where that is bypassed.

Thanks,
Tom

> 
> Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 026/104] KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
  2022-04-05 12:52   ` Paolo Bonzini
@ 2022-04-06  1:54     ` Xiaoyao Li
  2022-04-07  1:07       ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Xiaoyao Li @ 2022-04-06  1:54 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/5/2022 8:52 PM, Paolo Bonzini wrote:
> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
>> Implement a VM-scoped subcomment to get system-wide parameters.  Although
>> this is system-wide parameters not per-VM, this subcomand is VM-scoped
>> because
>> - Device model needs TDX system-wide parameters after creating KVM VM.
>> - This subcommands requires to initialize TDX module.  For lazy
>>    initialization of the TDX module, vm-scope ioctl is better.
> 
> Since there was agreement to install the TDX module on load, please 
> place this ioctl on the /dev/kvm file descriptor.
> 
> At least for SEV, there were cases where the system-wide parameters are 
> needed outside KVM, so it's better to avoid requiring a VM file descriptor.

I don't have strong preference on KVM-scope ioctl or VM-scope.

Initially, we made it KVM-scope and change it to VM-scope in this 
version. Yes, it returns the info from TDX module, which doesn't vary 
per VM. However, what if we want to return different capabilities 
(software controlled capabilities) per VM? Part of the TDX capabilities 
serves like get_supported_cpuid, making it KVM wide lacks the 
flexibility to return differentiated capabilities for different TDs.


> Thanks,
> 
> Paolo
> 


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid
  2022-03-31 20:15     ` Isaku Yamahata
@ 2022-04-06  1:55       ` Kai Huang
  2022-04-07  1:00         ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-06  1:55 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: isaku.yamahata, kvm, linux-kernel, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Thu, 2022-03-31 at 13:15 -0700, Isaku Yamahata wrote:
> On Thu, Mar 31, 2022 at 02:21:06PM +1300,
> Kai Huang <kai.huang@intel.com> wrote:
> 
> > On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > MKTME keyid is assigned to guest TD.  The memory controller encrypts guest
> > > TD memory with key id.  Add helper functions to allocate/free MKTME keyid
> > > so that TDX KVM assign keyid.
> > 
> > Using MKTME keyid is wrong, at least not accurate I think.  We should use
> > explicitly use "TDX private KeyID", which is clearly documented in the spec:
> >   
> > https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> > 
> > Also, description of IA32_MKTME_KEYID_PARTITIONING MSR clearly says TDX private
> > KeyIDs span the range (NUM_MKTME_KIDS+1) through
> > (NUM_MKTME_KIDS+NUM_TDX_PRIV_KIDS).  So please just use TDX private KeyID here.
> > 
> > 
> > > 
> > > Also export MKTME global keyid that is used to encrypt TDX module and its
> > > memory.
> > 
> > This needs explanation why the global keyID needs to be exported.
> 
> How about the followings?
> 
> TDX private host key id is assigned to guest TD.  The memory controller
> encrypts guest TD memory with the assigned host key id (HIKD).  Add helper
> functions to allocate/free TDX private host key id so that TDX KVM manage
> it.

HIKD -> HKID.  

You may also want to use KeyID in consistent way (KeyID, keyid, key id, etc).
The spec uses KeyID.

> 
> Also export the global TDX private host key id that is used to encrypt TDX
> module, its memory and some dynamic data (e.g. TDR).  When VMM releasing
> encrypted page to reuse it, the page needs to be flushed with the used host
> key id.  VMM needs the global TDX private host key id to flush such pages
> TDX module accesses with the global TDX private host key id.
> 
> 

Find to me.


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters
  2022-04-05 13:01     ` Paolo Bonzini
@ 2022-04-06  2:06       ` Xiaoyao Li
  2022-04-06 11:27         ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Xiaoyao Li @ 2022-04-06  2:06 UTC (permalink / raw)
  To: Paolo Bonzini, Kai Huang, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/5/2022 9:01 PM, Paolo Bonzini wrote:
> On 3/31/22 06:55, Kai Huang wrote:
>>> +struct kvm_tdx_init_vm {
>>> +    __u32 max_vcpus;
>>> +    __u32 tsc_khz;
>>> +    __u64 attributes;
>>> +    __u64 cpuid;
>> Is it better to append all CPUIDs directly into this structure, 
>> perhaps at end
>> of this structure, to make it more consistent with TD_PARAMS?
>>
>> Also, I think somewhere in commit message or comments we should 
>> explain why
>> CPUIDs are passed here (why existing KVM_SET_CUPID2 is not sufficient).
>>
> 
> Indeed, it would be easier to use the existing cpuid data in struct 
> kvm_vcpu, because right now there is no way to ensure that they are 
> consistent.
> 
> Why is KVM_SET_CPUID2 not enough?  Are there any modifications done by 
> KVM that affect the measurement?

Then we get the situation that KVM_TDX_INIT_VM must be called after 1 
vcpu is created. It seems illogical that it has chance to fail the VM 
scope initialization after 1 vcpu is successfully created.

> Thanks,
> 
> Paolo
> 


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits
  2022-04-05 13:55     ` Paolo Bonzini
@ 2022-04-06  2:23       ` Kai Huang
  2022-04-06 11:26         ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-06  2:23 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

> 
> > > 
> > >  
> > > -		gfn = gpte_to_gfn(gpte);
> > > +		gfn = gpte_to_gfn(vcpu, gpte);
> > >  		pte_access = sp->role.access;
> > >  		pte_access &= FNAME(gpte_access)(gpte);
> > >  		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
> > 
> > In commit message you mentioned "Don't support stolen bits for shadow EPT" (you
> > actually mean shadow MMU I suppose), yet there's bunch of code change to shadow
> > MMU.
> 
> It's a bit ugly, but it's uglier to keep two versions of gpte_to_gfn.

gpte_to_gfn() is only used in paging_tmpl.h.  Could you elaborate why we need to
keep two versions of it?


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
  2022-03-04 19:48 ` [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis isaku.yamahata
  2022-04-05 15:25   ` Paolo Bonzini
@ 2022-04-06 11:06   ` Kai Huang
  2022-04-07  3:05     ` Kai Huang
  2022-04-08 19:12     ` Isaku Yamahata
  1 sibling, 2 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-06 11:06 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:48 -0800, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Define the EPT Violation #VE control bit, #VE info VMCS fields, and the
> suppress #VE bit for EPT entries.

It appears only the last one is introduced in this patch.

> 
> TDX will use a different shadow PTE entry value for MMIO from VMX.  Add
> members to kvm_arch and track value for MMIO per-VM.  By using per-VM EPT
> entry value for MMIO, the existing VMX logic is kept working.
> 
> In the case of VMX VM case, the EPT entry for MMIO is non-present PTE
> (present bit cleared) without backing guest physical address (on EPT
> violation, KVM seaching backing guest memory and it finds there is no
> backing guest page.) or the value to trigger EPT misconfiguration.  For
> fast path. Once MMIO is triggered on the EPT entry, the EPT entry is
> updated for the future MMIO.  It allows KVM to understand the memory access
> is for MMIO without searching backing guest pages.). And then KVM parses
> guest instruction to figure out address/value/width for MMIO.

What does "For fast path." meaning?

There are also grammar issues in above paragraph.

> 
> In the case of the guest TD, the guest memory is protected so that VMM
> can't parse guest instruction to trigger EPT violation.  
> 

Even VMM can parse guest instruction, it cannot trigger EPT violatoin.  Only
guest can trigger EPT violation.

> Instead VMM sets
> up (Shared) EPT to trigger #VE.  When the guest TD issues MMIO, #VE is
> injected.  guest VE handler converts MMIO access into MMIO hypercall to
> pass address/value/width for MMIO to VMM. (or directly paravirtualize MMIO
> into hypercall.)  Then VMM can handle the MMIO hypercall without parsing
> guest instruction.

There are couple of grammar issues in above two paragraphs.

> 
> When the guest accesses GPA if "the EPT Violation #VE" control bit is set
> and EPT SUPPRESS VE bit in EPT entry is cleared, #VE, virtualization
> exception, is injected into the guest.  Because the TDX guest vCPU state
> and memory are protected, a VMM can't emulate MMIO by the TDX guest on EPT
> violation by snooping vCPU state and parsing instruction to figure out MMIO
> address and value.  Instead, PV MMIO (MMIO hypercall) is adapted.  On EPT
> violation, CPU injects #VE to guest and the guest converts MMIO instruction
> into PV MMIO.  Or guest directly issues MMIO hypercall.

Isn't this paragraph kinda duplicated with above paragraph?

> 
> The existing VMX code uses zero as an initial value for EPT entry.  TDX
> will enable EPT-violation #VE VM-execution control and requires suppress VE
> bit cleared in shared EPT entry to inject #VE into the TDX guest.  To keep
> the same behavior for VMX, suppress VE bit needs to be set.  Allow to
> specify an initial value for EPT entry and if TDX is enabled, set initial
> EPT entry value to suppress VE bit set.  EPT-violation #VE VM-execution
> control will be enabled, and For TDX shared EPT suppress VE bit will be
> cleared for TDX shared EPT entry.

Isn't the last paragraph talking about the same thing as you did in patch 37
"KVM: x86/mmu: Allow non-zero init value for shadow PTE"?  That patch has
already done the thing to set a non-zero initial PTE.

Btw, please put this patch and patch 37 together since they are handling similar
thing (now there are couple of unrelated patches in the middle).

This patch talks about changing MMIO value/mask from global to per-VM tracking.
Please focus on explaining the rational behind this -- i.e. unlike to legacy VMX
VM (and AMD legacy VMs), TD guest requires different MMIO value/mask in order to
configure MMIO EPT entry to not generate EPT misconfiguration, but instead to
inject #VE by setting up a non-present SPTE which causes EPT violation but with
"Suppress #VE" bit clear.


> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  3 +++
>  arch/x86/include/asm/vmx.h      |  1 +
>  arch/x86/kvm/mmu.h              |  6 ++++--
>  arch/x86/kvm/mmu/mmu.c          | 19 +++++++++++------
>  arch/x86/kvm/mmu/spte.c         | 38 ++++++++++++++++-----------------
>  arch/x86/kvm/mmu/spte.h         |  9 ++++----
>  arch/x86/kvm/mmu/tdp_mmu.c      |  6 +++---
>  arch/x86/kvm/svm/svm.c          |  2 +-
>  arch/x86/kvm/vmx/main.c         |  7 ++++--
>  arch/x86/kvm/vmx/tdx.c          |  2 +-
>  arch/x86/kvm/vmx/tdx.h          |  2 ++
>  arch/x86/kvm/vmx/vmx.c          |  8 +++++++
>  12 files changed, 63 insertions(+), 40 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index d33d79f2af2d..fcab2337819c 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1069,6 +1069,9 @@ struct kvm_arch {
>  	 */
>  	spinlock_t mmu_unsync_pages_lock;
>  
> +	u64 shadow_mmio_value;
> +	u64 shadow_mmio_mask;
> +
>  	struct list_head assigned_dev_head;
>  	struct iommu_domain *iommu_domain;
>  	bool iommu_noncoherent;
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 0ffaa3156a4e..88d9b8cc7dde 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -498,6 +498,7 @@ enum vmcs_field {
>  #define VMX_EPT_IPAT_BIT    			(1ull << 6)
>  #define VMX_EPT_ACCESS_BIT			(1ull << 8)
>  #define VMX_EPT_DIRTY_BIT			(1ull << 9)
> +#define VMX_EPT_SUPPRESS_VE_BIT			(1ull << 63)
>  #define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
>  						 VMX_EPT_WRITABLE_MASK |       \
>  						 VMX_EPT_EXECUTABLE_MASK)
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index 650989c37f2e..b49841e4faaa 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -64,8 +64,10 @@ static __always_inline u64 rsvd_bits(int s, int e)
>  	return ((2ULL << (e - s)) - 1) << s;
>  }
>  
> -void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
> -void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
> +void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask,
> +				u64 access_mask);
> +void kvm_mmu_set_default_mmio_spte_mask(u64 mask);
> +void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, u64 init_value);
>  void kvm_mmu_set_spte_init_value(u64 init_value);
>  
>  void kvm_init_mmu(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index d8c1505155b0..6e9847b1124b 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2336,7 +2336,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
>  				return kvm_mmu_prepare_zap_page(kvm, child,
>  								invalid_list);
>  		}
> -	} else if (is_mmio_spte(pte)) {
> +	} else if (is_mmio_spte(kvm, pte)) {
>  		mmu_spte_clear_no_track(spte);
>  	}
>  	return 0;
> @@ -3069,9 +3069,12 @@ static bool handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
>  		/*
>  		 * If MMIO caching is disabled, emulate immediately without
>  		 * touching the shadow page tables as attempting to install an
> -		 * MMIO SPTE will just be an expensive nop.
> +		 * MMIO SPTE will just be an expensive nop, but excludes the
> +		 * INTEL TD guest due to it also uses shadow_mmio_value = 0
> +		 * to emulating MMIO access.

The comment seems is not accurate.  If I read correctly, for a TD guest,
shadow_mmio_value is initialized to shadow_default_mmio_mask, which isn't always
0.  See changes to kvm_mmu_reset_all_pte_masks().

>  		 */
> -		if (unlikely(!shadow_mmio_value)) {
> +		if (unlikely(!vcpu->kvm->arch.shadow_mmio_value)
> +		    && !kvm_gfn_stolen_mask(vcpu->kvm)) {

I don't like using kvm_gfn_stolen_mask() here.  Similar to below comment related
to is_mmio_spte().

>  			*ret_val = RET_PF_EMULATE;
>  			return true;
>  		}
> @@ -3209,7 +3212,8 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  			break;
>  
>  		sp = sptep_to_sp(sptep);
> -		if (!is_last_spte(spte, sp->role.level) || is_mmio_spte(spte))
> +		if (!is_last_spte(spte, sp->role.level) ||
> +			is_mmio_spte(vcpu->kvm, spte))
>  			break;
>  
>  		/*
> @@ -3892,7 +3896,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
>  	if (WARN_ON(reserved))
>  		return -EINVAL;
>  
> -	if (is_mmio_spte(spte)) {
> +	if (is_mmio_spte(vcpu->kvm, spte)) {
>  		gfn_t gfn = get_mmio_spte_gfn(spte);
>  		unsigned int access = get_mmio_spte_access(spte);
>  
> @@ -4294,7 +4298,7 @@ static unsigned long get_cr3(struct kvm_vcpu *vcpu)
>  static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
>  			   unsigned int access)
>  {
> -	if (unlikely(is_mmio_spte(*sptep))) {
> +	if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
>  		if (gfn != get_mmio_spte_gfn(*sptep)) {
>  			mmu_spte_clear_no_track(sptep);
>  			return true;
> @@ -5791,6 +5795,9 @@ void kvm_mmu_init_vm(struct kvm *kvm)
>  	kvm_page_track_register_notifier(kvm, node);
>  
>  	kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
> +	kvm_mmu_set_mmio_spte_mask(kvm, shadow_default_mmio_mask,
> +				   shadow_default_mmio_mask,
> +				   ACC_WRITE_MASK | ACC_USER_MASK);
>  }
>  
>  void kvm_mmu_uninit_vm(struct kvm *kvm)
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index 5071e8332db2..ea83927b9231 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -29,8 +29,7 @@ u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
>  u64 __read_mostly shadow_user_mask;
>  u64 __read_mostly shadow_accessed_mask;
>  u64 __read_mostly shadow_dirty_mask;
> -u64 __read_mostly shadow_mmio_value;
> -u64 __read_mostly shadow_mmio_mask;
> +u64 __read_mostly shadow_default_mmio_mask;
>  u64 __read_mostly shadow_mmio_access_mask;
>  u64 __read_mostly shadow_present_mask;
>  u64 __read_mostly shadow_me_mask;
> @@ -59,10 +58,11 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
>  	u64 spte = generation_mmio_spte_mask(gen);
>  	u64 gpa = gfn << PAGE_SHIFT;
>  
> -	WARN_ON_ONCE(!shadow_mmio_value);
> +	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
> +		     !kvm_gfn_stolen_mask(vcpu->kvm));
>  
>  	access &= shadow_mmio_access_mask;
> -	spte |= shadow_mmio_value | access;
> +	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
>  	spte |= gpa | shadow_nonpresent_or_rsvd_mask;
>  	spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
>  		<< SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
> @@ -279,7 +279,8 @@ u64 mark_spte_for_access_track(u64 spte)
>  	return spte;
>  }
>  
> -void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> +void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask,
> +				u64 access_mask)
>  {
>  	BUG_ON((u64)(unsigned)access_mask != access_mask);
>  	WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
> @@ -308,39 +309,32 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
>  	    WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
>  		mmio_value = 0;
>  
> -	shadow_mmio_value = mmio_value;
> -	shadow_mmio_mask  = mmio_mask;
> +	kvm->arch.shadow_mmio_value = mmio_value;
> +	kvm->arch.shadow_mmio_mask = mmio_mask;
>  	shadow_mmio_access_mask = access_mask;
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
>  
> -void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
> +void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, u64 init_value)
>  {
>  	shadow_user_mask	= VMX_EPT_READABLE_MASK;
>  	shadow_accessed_mask	= has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
>  	shadow_dirty_mask	= has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
>  	shadow_nx_mask		= 0ull;
>  	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
> -	shadow_present_mask	= has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
> +	shadow_present_mask	=
> +		(has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | init_value;

This change doesn't seem make any sense.  Why should "Suppress #VE" bit be set
for a present PTE?

>  	shadow_acc_track_mask	= VMX_EPT_RWX_MASK;
>  	shadow_me_mask		= 0ull;
>  
>  	shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE;
>  	shadow_mmu_writable_mask  = EPT_SPTE_MMU_WRITABLE;
> -
> -	/*
> -	 * EPT Misconfigurations are generated if the value of bits 2:0
> -	 * of an EPT paging-structure entry is 110b (write/execute).
> -	 */
> -	kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
> -				   VMX_EPT_RWX_MASK, 0);
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);
>  
>  void kvm_mmu_reset_all_pte_masks(void)
>  {
>  	u8 low_phys_bits;
> -	u64 mask;
>  
>  	shadow_phys_bits = kvm_get_shadow_phys_bits();
>  
> @@ -389,9 +383,13 @@ void kvm_mmu_reset_all_pte_masks(void)
>  	 * PTEs and so the reserved PA approach must be disabled.
>  	 */
>  	if (shadow_phys_bits < 52)
> -		mask = BIT_ULL(51) | PT_PRESENT_MASK;
> +		shadow_default_mmio_mask = BIT_ULL(51) | PT_PRESENT_MASK;

Hmm...  Not related to this patch, but it seems there's a bug here.  On a MKTME
enabled system (but not TDX) with 52 physical bits, the shadow_phys_bits will be
set to < 52 (depending on how many MKTME KeyIDs are configured by BIOS).  In
this case, bit 51 is set, but actually bit 51 isn't a reserved bit in this case.
Instead, it is a MKTME KeyID bit.  Therefore, above setting won't cause #PF, but
will use a non-zero MKTME keyID to access the physical address.

Paolo/Sean, any comments here?

>  	else
> -		mask = 0;
> +		shadow_default_mmio_mask = 0;
> +}
>  
> -	kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
> +void kvm_mmu_set_default_mmio_spte_mask(u64 mask)
> +{
> +	shadow_default_mmio_mask = mask;
>  }
> +EXPORT_SYMBOL_GPL(kvm_mmu_set_default_mmio_spte_mask);
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 8e13a35ab8c9..bde843bce878 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -165,8 +165,7 @@ extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
>  extern u64 __read_mostly shadow_user_mask;
>  extern u64 __read_mostly shadow_accessed_mask;
>  extern u64 __read_mostly shadow_dirty_mask;
> -extern u64 __read_mostly shadow_mmio_value;
> -extern u64 __read_mostly shadow_mmio_mask;
> +extern u64 __read_mostly shadow_default_mmio_mask;
>  extern u64 __read_mostly shadow_mmio_access_mask;
>  extern u64 __read_mostly shadow_present_mask;
>  extern u64 __read_mostly shadow_me_mask;
> @@ -229,10 +228,10 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
>   */
>  extern u8 __read_mostly shadow_phys_bits;
>  
> -static inline bool is_mmio_spte(u64 spte)
> +static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
>  {
> -	return (spte & shadow_mmio_mask) == shadow_mmio_value &&
> -	       likely(shadow_mmio_value);
> +	return (spte & kvm->arch.shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
> +		likely(kvm->arch.shadow_mmio_value || kvm_gfn_stolen_mask(kvm));

I don't like using kvm_gfn_stolen_mask() to check whether SPTE is MMIO. 
kvm_gfn_stolen_mask() really doesn't imply anything regarding to setting up the
value of MMIO SPTE.  At least, I guess we can use some is_protected_vm() sort of
things since it implies guest memory is protected therefore legacy way handling
of MMIO doesn't work (i.e. you cannot parse MMIO instruction).

>  }
>  
>  static inline bool is_shadow_present_pte(u64 pte)
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index bc9e3553fba2..ebd0a02620e8 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -447,8 +447,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  		 * impact the guest since both the former and current SPTEs
>  		 * are nonpresent.
>  		 */
> -		if (WARN_ON(!is_mmio_spte(old_spte) &&
> -			    !is_mmio_spte(new_spte) &&
> +		if (WARN_ON(!is_mmio_spte(kvm, old_spte) &&
> +			    !is_mmio_spte(kvm, new_spte) &&
>  			    !is_removed_spte(new_spte)))
>  			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
>  			       "should not be replaced with another,\n"
> @@ -927,7 +927,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  	}
>  
>  	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
> -	if (unlikely(is_mmio_spte(new_spte))) {
> +	if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
>  		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
>  				     new_spte);
>  		ret = RET_PF_EMULATE;
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 778075b71dc3..c7eec23e9ebe 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4704,7 +4704,7 @@ static __init void svm_adjust_mmio_mask(void)
>  	 */
>  	mask = (mask_bit < 52) ? rsvd_bits(mask_bit, 51) | PT_PRESENT_MASK : 0;
>  
> -	kvm_mmu_set_mmio_spte_mask(mask, mask, PT_WRITABLE_MASK | PT_USER_MASK);
> +	kvm_mmu_set_default_mmio_spte_mask(mask);
>  }
>  
>  static __init void svm_set_cpu_caps(void)
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 51aaafe6b432..b242a9dc9e29 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -23,9 +23,12 @@ static __init int vt_hardware_setup(void)
>  
>  	tdx_hardware_setup(&vt_x86_ops);
>  
> -	if (enable_ept)
> +	if (enable_ept) {
> +		const u64 init_value = enable_tdx ? VMX_EPT_SUPPRESS_VE_BIT : 0ull;
>  		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
> -				      cpu_has_vmx_ept_execute_only());
> +				      cpu_has_vmx_ept_execute_only(), init_value);
> +		kvm_mmu_set_spte_init_value(init_value);
> +	}
>  
>  	return 0;
>  }
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index f86a257dd71b..c3434b33c452 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -11,7 +11,7 @@
>  #undef pr_fmt
>  #define pr_fmt(fmt) "tdx: " fmt
>  
> -static bool __read_mostly enable_tdx = true;
> +bool __read_mostly enable_tdx = true;
>  module_param_named(tdx, enable_tdx, bool, 0644);
>  
>  #define TDX_MAX_NR_CPUID_CONFIGS					\
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 4ce7fcab6f64..b32e068c51b4 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -6,6 +6,7 @@
>  
>  #include "tdx_ops.h"
>  
> +extern bool __read_mostly enable_tdx;
>  int tdx_module_setup(void);
>  
>  struct tdx_td_page {
> @@ -166,6 +167,7 @@ static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 fiel
>  }
>  
>  #else
> +#define enable_tdx false
>  static inline int tdx_module_setup(void) { return -ENODEV; };
>  
>  struct kvm_tdx;
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 07fd892768be..00f88aa25047 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7065,6 +7065,14 @@ int vmx_vm_init(struct kvm *kvm)
>  	if (!ple_gap)
>  		kvm->arch.pause_in_guest = true;
>  
> +	/*
> +	 * EPT Misconfigurations can be generated if the value of bits 2:0
> +	 * of an EPT paging-structure entry is 110b (write/execute).
> +	 */
> +	if (enable_ept)
> +		kvm_mmu_set_mmio_spte_mask(kvm, VMX_EPT_MISCONFIG_WX_VALUE,
> +					   VMX_EPT_MISCONFIG_WX_VALUE, 0);

Should be:

	kvm_mmu_set_mmio_spte_mask(kvm, VMX_EPT_MISCONFIG_WX_VALUE,
				   	VMX_EPT_RWX_MASK, 0);

> +
>  	if (boot_cpu_has(X86_BUG_L1TF) && enable_ept) {
>  		switch (l1tf_mitigation) {
>  		case L1TF_MITIGATION_OFF:


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits
  2022-04-06  2:23       ` Kai Huang
@ 2022-04-06 11:26         ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-06 11:26 UTC (permalink / raw)
  To: Kai Huang, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/6/22 04:23, Kai Huang wrote:
>>
>>>>
>>>>   
>>>> -		gfn = gpte_to_gfn(gpte);
>>>> +		gfn = gpte_to_gfn(vcpu, gpte);
>>>>   		pte_access = sp->role.access;
>>>>   		pte_access &= FNAME(gpte_access)(gpte);
>>>>   		FNAME(protect_clean_gpte)(vcpu->arch.mmu, &pte_access, gpte);
>>>
>>> In commit message you mentioned "Don't support stolen bits for shadow EPT" (you
>>> actually mean shadow MMU I suppose), yet there's bunch of code change to shadow
>>> MMU.
>>
>> It's a bit ugly, but it's uglier to keep two versions of gpte_to_gfn.
> 
> gpte_to_gfn() is only used in paging_tmpl.h.  Could you elaborate why we need to
> keep two versions of it?

You're right.  Yeah, considering page table walks are not supported when 
private memory is available, it shouldn't be necessary to change 
paging_tmpl.h.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters
  2022-04-06  2:06       ` Xiaoyao Li
@ 2022-04-06 11:27         ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-06 11:27 UTC (permalink / raw)
  To: Xiaoyao Li, Kai Huang, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/6/22 04:06, Xiaoyao Li wrote:
>>
>> Indeed, it would be easier to use the existing cpuid data in struct 
>> kvm_vcpu, because right now there is no way to ensure that they are 
>> consistent.
>>
>> Why is KVM_SET_CPUID2 not enough?  Are there any modifications done by 
>> KVM that affect the measurement?
> 
> Then we get the situation that KVM_TDX_INIT_VM must be called after 1 
> vcpu is created. It seems illogical that it has chance to fail the VM 
> scope initialization after 1 vcpu is successfully created.

I see.  Yeah, it makes sense to have the CPUID in KVM_TDX_INIT_VM then.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 078/104] KVM: TDX: Implement interrupt injection
  2022-03-04 19:49 ` [RFC PATCH v5 078/104] KVM: TDX: Implement interrupt injection isaku.yamahata
@ 2022-04-06 11:47   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-06 11:47 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

Please do not duplicate code, for example:

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
>   
> +void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	pi_clear_on(&tdx->pi_desc);
> +	memset(tdx->pi_desc.pir, 0, sizeof(tdx->pi_desc.pir));
> +}

This is the same as vmx_apicv_post_state_restore.  Please write this like:

void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
{
	struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
	pi_clear_on(pi);
	memset(pi->pir, 0, sizeof(pi->pir));
}


Otherwise,

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 080/104] KVM: TDX: Implement methods to inject NMI
  2022-03-04 19:49 ` [RFC PATCH v5 080/104] KVM: TDX: Implement methods to inject NMI isaku.yamahata
@ 2022-04-06 12:47   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-06 12:47 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX vcpu control structure defines one bit for pending NMI for VMM to
> inject NMI by setting the bit without knowing TDX vcpu NMI states.  Because
> the vcpu state is protected, VMM can't know about NMI states of TDX vcpu.
> The TDX module handles actual injection and NMI states transition.
> 
> Add methods for NMI and treat NMI can be injected always.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/main.c    | 62 +++++++++++++++++++++++++++++++++++---
>   arch/x86/kvm/vmx/tdx.c     |  5 +++
>   arch/x86/kvm/vmx/x86_ops.h |  2 ++
>   3 files changed, 64 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 404a260796e4..aa84c13f8ee1 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -216,6 +216,58 @@ static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
>   	vmx_flush_tlb_guest(vcpu);
>   }
>   
> +static void vt_inject_nmi(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_inject_nmi(vcpu);
> +
> +	vmx_inject_nmi(vcpu);
> +}
> +
> +static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
> +{
> +	/*
> +	 * The TDX module manages NMI windows and NMI reinjection, and hides NMI
> +	 * blocking, all KVM can do is throw an NMI over the wall.
> +	 */
> +	if (is_td_vcpu(vcpu))
> +		return true;
> +
> +	return vmx_nmi_allowed(vcpu, for_injection);
> +}
> +
> +static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * Assume NMIs are always unmasked.  KVM could query PEND_NMI and treat
> +	 * NMIs as masked if a previous NMI is still pending, but SEAMCALLs are
> +	 * expensive and the end result is unchanged as the only relevant usage
> +	 * of get_nmi_mask() is to limit the number of pending NMIs, i.e. it
> +	 * only changes whether KVM or the TDX module drops an NMI.
> +	 */
> +	if (is_td_vcpu(vcpu))
> +		return false;
> +
> +	return vmx_get_nmi_mask(vcpu);
> +}
> +
> +static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_set_nmi_mask(vcpu, masked);
> +}
> +
> +static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
> +{
> +	/* Refer the comment in vt_get_nmi_mask(). */
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_enable_nmi_window(vcpu);
> +}
> +
>   static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>   			int pgd_level)
>   {
> @@ -366,14 +418,14 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.get_interrupt_shadow = vt_get_interrupt_shadow,
>   	.patch_hypercall = vmx_patch_hypercall,
>   	.set_irq = vt_inject_irq,
> -	.set_nmi = vmx_inject_nmi,
> +	.set_nmi = vt_inject_nmi,
>   	.queue_exception = vmx_queue_exception,
>   	.cancel_injection = vt_cancel_injection,
>   	.interrupt_allowed = vt_interrupt_allowed,
> -	.nmi_allowed = vmx_nmi_allowed,
> -	.get_nmi_mask = vmx_get_nmi_mask,
> -	.set_nmi_mask = vmx_set_nmi_mask,
> -	.enable_nmi_window = vmx_enable_nmi_window,
> +	.nmi_allowed = vt_nmi_allowed,
> +	.get_nmi_mask = vt_get_nmi_mask,
> +	.set_nmi_mask = vt_set_nmi_mask,
> +	.enable_nmi_window = vt_enable_nmi_window,
>   	.enable_irq_window = vt_enable_irq_window,
>   	.update_cr8_intercept = vmx_update_cr8_intercept,
>   	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index bdc658ca9e4f..273898de9f7a 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -763,6 +763,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>   	return EXIT_FASTPATH_NONE;
>   }
>   
> +void tdx_inject_nmi(struct kvm_vcpu *vcpu)
> +{
> +	td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
> +}
> +
>   void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>   {
>   	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index c3768a20347f..31be5e8a1d5c 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -150,6 +150,7 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>   void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
>   void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
>   			   int trig_mode, int vector);
> +void tdx_inject_nmi(struct kvm_vcpu *vcpu);
>   
>   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -180,6 +181,7 @@ static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
>   static inline void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_deliver_interrupt(
>   	struct kvm_lapic *apic, int delivery_mode, int trig_mode, int vector) {}
> +static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
>   
>   static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 079/104] KVM: TDX: Implements vcpu request_immediate_exit
  2022-03-04 19:49 ` [RFC PATCH v5 079/104] KVM: TDX: Implements vcpu request_immediate_exit isaku.yamahata
@ 2022-04-06 12:49   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-06 12:49 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Now we are able to inject interrupts into TDX vcpu, it's ready to block TDX
> vcpu.  Wire up kvm x86 methods for blocking/unblocking vcpu for TDX.  To
> unblock on pending events, request immediate exit methods is also needed.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/main.c | 10 +++++++++-
>   1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index a0bcc4dca678..404a260796e4 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -280,6 +280,14 @@ static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
>   	vmx_enable_irq_window(vcpu);
>   }
>   
> +static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return __kvm_request_immediate_exit(vcpu);
> +
> +	vmx_request_immediate_exit(vcpu);
> +}
> +
>   static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
>   {
>   	if (!is_td(kvm))
> @@ -402,7 +410,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.check_intercept = vmx_check_intercept,
>   	.handle_exit_irqoff = vmx_handle_exit_irqoff,
>   
> -	.request_immediate_exit = vmx_request_immediate_exit,
> +	.request_immediate_exit = vt_request_immediate_exit,
>   
>   	.sched_in = vt_sched_in,
>   

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 081/104] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument
  2022-03-04 19:49 ` [RFC PATCH v5 081/104] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument isaku.yamahata
@ 2022-04-06 12:49   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-06 12:49 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> TDX uses different ABI to get information about VM exit.  Pass intr_info to
> the NMI and INTR handlers instead of pulling it from vcpu_vmx in
> preparation for sharing the bulk of the handlers with TDX.
> 
> When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
> exit qualification etc rather than the VMCS fields because VMM doesn't have
> access to the VMCS.  The eventual code will be
> 
> VMX:
>    - get exit reason, intr_info, exit_qualification, and etc from VMCS
>    - call NMI/INTR handlers (common code)
> 
> TDX:
>    - get exit reason, intr_info, exit_qualification, and etc from guest
>      registers
>    - call NMI/INTR handlers (common code)
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/vmx.c | 17 ++++++++---------
>   1 file changed, 8 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 4bd1e61b8d45..008400927144 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -6442,28 +6442,27 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
>   		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
>   }
>   
> -static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
> +static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
>   {
>   	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
> -	u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
>   
>   	/* if exit due to PF check for async PF */
>   	if (is_page_fault(intr_info))
> -		vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
> +		vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
>   	/* if exit due to NM, handle before interrupts are enabled */
>   	else if (is_nm_fault(intr_info))
> -		handle_nm_fault_irqoff(&vmx->vcpu);
> +		handle_nm_fault_irqoff(vcpu);
>   	/* Handle machine checks before interrupts are enabled */
>   	else if (is_machine_check(intr_info))
>   		kvm_machine_check();
>   	/* We need to handle NMIs before interrupts are enabled */
>   	else if (is_nmi(intr_info))
> -		handle_interrupt_nmi_irqoff(&vmx->vcpu, nmi_entry);
> +		handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
>   }
>   
> -static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
> +static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
> +					     u32 intr_info)
>   {
> -	u32 intr_info = vmx_get_intr_info(vcpu);
>   	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
>   	gate_desc *desc = (gate_desc *)host_idt_base + vector;
>   
> @@ -6482,9 +6481,9 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
>   		return;
>   
>   	if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
> -		handle_external_interrupt_irqoff(vcpu);
> +		handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
>   	else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
> -		handle_exception_nmi_irqoff(vmx);
> +		handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
>   }
>   
>   /*

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 086/104] KVM: TDX: handle ept violation/misconfig exit
  2022-03-04 19:49 ` [RFC PATCH v5 086/104] KVM: TDX: handle ept violation/misconfig exit isaku.yamahata
@ 2022-04-06 20:50   ` Sagi Shahar
  2022-04-07  1:09     ` Xiaoyao Li
  0 siblings, 1 reply; 310+ messages in thread
From: Sagi Shahar @ 2022-04-06 20:50 UTC (permalink / raw)
  To: Yamahata, Isaku
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	Erdem Aktas, Connor Kuehl, Sean Christopherson

On Fri, Mar 4, 2022 at 12:23 PM <isaku.yamahata@intel.com> wrote:
>
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> On EPT violation, call a common function, __vmx_handle_ept_violation() to
> trigger x86 MMU code.  On EPT misconfiguration, exit to ring 3 with
> KVM_EXIT_UNKNOWN.  because EPT misconfiguration can't happen as MMIO is
> trigged by TDG.VP.VMCALL. No point to set a misconfiguration value for the
> fast path.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 38 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 6fbe89bcfe1e..2c35dcad077e 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1081,6 +1081,40 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
>         __vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
>  }
>
> +#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | PFERR_USER_MASK)

TDX_SEPT_PFERR is defined using PFERR_.* bitmask but
__vmx_handle_ept_violation is accepting an EPT_VIOLATION_.* bitmask.
so (PFERR_WRITE_MASK | PFERR_USER_MASK) will get interpreted as
(EPT_VIOLATION_ACC_WRITE | EPT_VIOLATION_ACC_INSTR) which will get
translated to (PFERR_WRITE_MASK | PFERR_FETCH_MASK). Was that the
intention of this code?

> +
> +static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
> +{
> +       unsigned long exit_qual;
> +
> +       if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu)))
> +               exit_qual = TDX_SEPT_PFERR;
> +       else {
> +               exit_qual = tdexit_exit_qual(vcpu);
> +               if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
> +                       pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
> +                               tdexit_gpa(vcpu), kvm_rip_read(vcpu));
> +                       vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
> +                       vcpu->run->ex.exception = PF_VECTOR;
> +                       vcpu->run->ex.error_code = exit_qual;
> +                       return 0;
> +               }
> +       }
> +
> +       trace_kvm_page_fault(tdexit_gpa(vcpu), exit_qual);
> +       return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
> +}
> +
> +static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
> +{
> +       WARN_ON(1);
> +
> +       vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
> +       vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
> +
> +       return 0;
> +}
> +
>  int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
>  {
>         union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
> @@ -1097,6 +1131,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
>         WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
>
>         switch (exit_reason.basic) {
> +       case EXIT_REASON_EPT_VIOLATION:
> +               return tdx_handle_ept_violation(vcpu);
> +       case EXIT_REASON_EPT_MISCONFIG:
> +               return tdx_handle_ept_misconfig(vcpu);
>         case EXIT_REASON_OTHER_SMI:
>                 /*
>                  * If reach here, it's not a MSMI.
> @@ -1378,8 +1416,6 @@ void tdx_flush_tlb(struct kvm_vcpu *vcpu)
>                 cpu_relax();
>  }
>
> -#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | PFERR_USER_MASK)
> -
>  static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  {
>         struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> --
> 2.25.1
>

Sagi

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 074/104] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
  2022-04-05 15:32   ` Paolo Bonzini
@ 2022-04-06 23:28     ` Sean Christopherson
  0 siblings, 0 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-04-06 23:28 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl

On Tue, Apr 05, 2022, Paolo Bonzini wrote:
> On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> >   	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
> > +	vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
> >   	vcpu->arch.cr0_guest_owned_bits = -1ul;
> >   	vcpu->arch.cr4_guest_owned_bits = -1ul;
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 45e8a02e99bf..89d04cd64cd0 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -10084,7 +10084,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
> >   	if (vcpu->arch.guest_fpu.xfd_err)
> >   		wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
> > -	if (unlikely(vcpu->arch.switch_db_regs)) {
> > +	if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH)) {
> >   		set_debugreg(0, 7);
> >   		set_debugreg(vcpu->arch.eff_db[0], 0);
> >   		set_debugreg(vcpu->arch.eff_db[1], 1);
> 
> I'm confused.  I'm probably missing something obvious, but where does
> KVM_DEBUGREG_AUTO_SWITCH affect the behavior of KVM?  vcpu_enter_guest
> would still write %db0-%db3 if KVM_DEBUGREG_BP_ENABLED is set, for
> example.
> 
> Do you mean:
> 
> 	if (unlikely(vcpu->arch.switch_db_regs) &&
> 	    !unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH)) {
> 
> ?

No, you're confused there's a crucial chunk of this patch that's missing.  The
host restore path was originally

        if (hw_breakpoint_active() &&
            !(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCHED))
                hw_breakpoint_restore();

The TDX module context switches the guest _and_ host debug registers.  It restores
the host DRs because it needs to write _something_ to hide guest state, so it might
as well restore the host values.  The above was an optmization to avoid rewriting
all debug registers.

However, this patch was written before commit f85d40160691 ("KVM: X86: Disable
hardware breakpoints unconditionally before kvm_x86->run()").  I suspect that when
rebasing to newer KVM, the Intel folks discovered that DR7 isn't getting restored
because the GUEST_DR7 in the SEAM VMCS will hold the zero'd value.

But rather than completely drop the flag, KVM can still avoid writing everything
except DR7, i.e.

	if (hw_breakpoint_active()) {
		if (!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))
			hw_breakpoint_restore();
		else
			set_debugreg(__this_cpu_read(cpu_dr7), 7);
	}

with a comment explaining that DR7 always needs to be restored because it's
preemptively cleared to play nice with the non-instrumentable sections.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 045/104] KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
  2022-03-04 19:49 ` [RFC PATCH v5 045/104] KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value isaku.yamahata
  2022-04-05 14:22   ` Paolo Bonzini
@ 2022-04-06 23:30   ` Kai Huang
  1 sibling, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-06 23:30 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDP MMU uses REMOVED_SPTE = 0x5a0ULL as special constant to indicate the
> intermediate value to indicate one thread is operating on it and the value
> should be semi-arbitrary value.  For TDX (more correctly to use #VE), the
> value should include suppress #VE value which is shadow_init_value.
> 
> Define SHADOW_REMOVED_SPTE as shadow_init_value | REMOVED_SPTE, and replace
> REMOVED_SPTE with SHADOW_REMOVED_SPTE to use suppress #VE bit properly for
> TDX.

Like we discussed, this patch should be merged with patch "KVM: x86/mmu: Allow
non-zero init value for shadow PTE".


-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 045/104] KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
  2022-04-05 14:22   ` Paolo Bonzini
@ 2022-04-06 23:35     ` Sean Christopherson
  2022-04-07 13:52       ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Sean Christopherson @ 2022-04-06 23:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl

On Tue, Apr 05, 2022, Paolo Bonzini wrote:
> On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> > @@ -207,9 +209,17 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> >   /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> >   static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> > +/*
> > + * See above comment around REMOVED_SPTE.  SHADOW_REMOVED_SPTE is the actual
> > + * intermediate value set to the removed SPET.  When TDX is enabled, it sets
> > + * the "suppress #VE" bit, otherwise it's REMOVED_SPTE.
> > + */
> > +extern u64 __read_mostly shadow_init_value;
> > +#define SHADOW_REMOVED_SPTE	(shadow_init_value | REMOVED_SPTE)
> 
> Please rename the existing REMOVED_SPTE to REMOVED_SPTE_MASK, and call this
> simply REMOVED_SPTE.  This also makes the patch smaller.

Can we name it either __REMOVE_SPTE or REMOVED_SPTE_VAL?  It's most definitely
not a mask, it's a full value, e.g. spte |= REMOVED_SPTE_MASK is completely wrong.

Other than that, 100% agree with avoiding churn.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
  2022-03-04 19:49 ` [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page isaku.yamahata
  2022-04-05 14:58   ` Paolo Bonzini
@ 2022-04-06 23:43   ` Kai Huang
  2022-04-07 13:52     ` Paolo Bonzini
  1 sibling, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-06 23:43 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a private pointer to kvm_mmu_page for private EPT.
> 
> To resolve KVM page fault on private GPA, it will allocate additional page
> for Secure EPT in addition to private EPT.  Add memory allocator for it and
> topup its memory allocator before resolving KVM page fault similar to
> shared EPT page.  Allocation of those memory will be done for TDP MMU by
> alloc_tdp_mmu_page().  Freeing those memory will be done for TDP MMU on
> behalf of kvm_tdp_mmu_zap_all() called by kvm_mmu_zap_all().  Private EPT
> page needs to carry one more page used for Secure EPT in addition to the
> private EPT page.  Add private pointer to struct kvm_mmu_page for that
> purpose and Add helper functions to allocate/free a page for Secure EPT.
> Also add helper functions to check if a given kvm_mmu_page is private.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  1 +
>  arch/x86/kvm/mmu/mmu.c          |  9 ++++
>  arch/x86/kvm/mmu/mmu_internal.h | 84 +++++++++++++++++++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.c      |  3 ++
>  4 files changed, 97 insertions(+)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index fcab2337819c..0c8cc7d73371 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -689,6 +689,7 @@ struct kvm_vcpu_arch {
>  	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
>  	struct kvm_mmu_memory_cache mmu_gfn_array_cache;
>  	struct kvm_mmu_memory_cache mmu_page_header_cache;
> +	struct kvm_mmu_memory_cache mmu_private_sp_cache;
>  
>  	/*
>  	 * QEMU userspace and the guest each have their own FPU state.
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 6e9847b1124b..8def8b97978f 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -758,6 +758,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
>  	struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
>  	int start, end, i, r;
>  
> +	if (kvm_gfn_stolen_mask(vcpu->kvm)) {

Please get rid of kvm_gfn_stolen_mask().

> +		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_sp_cache,
> +					       PT64_ROOT_MAX_LEVEL);
> +		if (r)
> +			return r;
> +	}
> +
>  	if (shadow_init_value)
>  		start = kvm_mmu_memory_cache_nr_free_objects(mc);
>  

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
  2022-03-04 19:49 ` [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU isaku.yamahata
@ 2022-04-07  0:50   ` Kai Huang
  2022-04-25 19:10     ` Sagi Shahar
  2022-04-29  0:28   ` Sagi Shahar
  1 sibling, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-07  0:50 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Use private EPT to mirror Secure EPT, and On the change of the private EPT

"on the change".

> entry, invoke kvm_x86_ops hook in __handle_changed_spte() to propagate the
> change to Secure EPT.
> 
> On EPT violation, determine which EPT to use, private or shared, based on
> faulted GPA.  When allocating an EPT page, record it (private or shared) in
> the page role.  The private is passed down to the function as an argument
> as necessary.  When the private EPT entry is changed, call the hook.
> 
> When zapping EPT, the EPT entry is frozen with the special EPT value that
> clears the present bit. After the TLB shootdown, the entry is set to the
> eventual value.  On populating the EPT entry, atomically set the entry.
> 
> For TDX, TDX SEAMCALL to update Secure EPT in addition to direct access to
> the private EPT entry.  For the zapping case, freeing the EPT entry
> works. It can call TDX SEAMCALL in addition to TLB shootdown.  For
> populating the private EPT entry, there can be a race condition without
> further protection
> 
>   vcpu 1: populating 2M private EPT entry
>   vcpu 2: populating 4K private EPT entry
>   vcpu 2: TDX SEAMCALL to update 4K secure EPT => error
>   vcpu 1: TDX SEAMCALL to update 4M secure EPT

2M secure EPT

> 
> To avoid the race, the frozen EPT entry is utilized.  Instead of atomic
> update of the private EPT entry, freeze the entry, call the hook that
> invokes TDX SEAMCALL, set the entry to the final value (unfreeze).
> 
> Support 4K page only at this stage.  2M page support can be done in future
> patches.

Btw, I'd like to see this as a patch to handle schematic of private mapping in
MMU code, and put this patch close to other infrastructure patches such as
"stolen GPA infrastructure" and "private_sp", so people can get a clear view on
what does the schematic of "private mapping" meaning and how to handle it before
jumping to TDX details.

In this patch, you have a "mirrored private page table", this is an important
concept and please explain it in commit message.  Only with this, your above
race condition case makes sense.

Also, you mentioned you will record private or shared in page role.  It seems I
don't see it.  Anyway, you also have SPTE_PRIVATE_PROHIBIT.  It's not entirely
clearly to me why you need it, or what's advantage between using page.role vs
it.  Please put those patches together so we can have a clear understanding on
your decisions.

IMHO, in general please reorganize the MMU related patches into below way:

1) Patches to introduce schematic of private mapping, and how to handle.  You
can use TDX as background to explain, but IMHO please use common name such as
"private page table" instead of "Secure EPT".  Those patches can include:
  - concept of protected VM, private/shared mapping (current "GPA stolen bits
    infrastructure" patch).
  - shadow_nonpresent_mask to support setting "Suppress #VE" bit.
  - per-VM MMIO value/mask
  - 'kvm_mmu_page->private_sp'
  - mirrored private page table (this patch).
  - SPTE_PRIVATE_PROHIBIT (vs role.private, etc)
  - Anything else?

2) After you clearly explain the schematic of private mapping, you can declare
some features cannot work with private mapping, such as log-dirty, fast page
fault, so that you can disable them in separate patches

3) TDX specific handling.

Order of 2) and 3) doesn't matter, of course.

And when introduce a patch, please make sure it doesn't break existing things,
even logically.  While I understand we want to split out independent small logic
as small patches (as preparation to support main patches), please don't break
anything in one patch.

My 2cents above.

> 
> Co-developed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |   2 +
>  arch/x86/include/asm/kvm_host.h    |   8 +
>  arch/x86/kvm/mmu/mmu.c             |  31 +++-
>  arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
>  arch/x86/kvm/mmu/tdp_mmu.c         | 226 +++++++++++++++++++++++------
>  arch/x86/kvm/mmu/tdp_mmu.h         |  13 +-
>  virt/kvm/kvm_main.c                |   1 +
>  7 files changed, 232 insertions(+), 51 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index ef48dcc98cfc..7e27b73d839f 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -91,6 +91,8 @@ KVM_X86_OP(set_tss_addr)
>  KVM_X86_OP(set_identity_map_addr)
>  KVM_X86_OP(get_mt_mask)
>  KVM_X86_OP(load_mmu_pgd)
> +KVM_X86_OP(free_private_sp)
> +KVM_X86_OP(handle_changed_private_spte)
>  KVM_X86_OP_NULL(has_wbinvd_exit)
>  KVM_X86_OP(get_l2_tsc_offset)
>  KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0c8cc7d73371..8406f8b5ab74 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -433,6 +433,7 @@ struct kvm_mmu {
>  			 struct kvm_mmu_page *sp);
>  	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
>  	hpa_t root_hpa;
> +	hpa_t private_root_hpa;
>  	gpa_t root_pgd;
>  	union kvm_mmu_role mmu_role;
>  	u8 root_level;
> @@ -1433,6 +1434,13 @@ struct kvm_x86_ops {
>  	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>  			     int root_level);
>  
> +	int (*free_private_sp)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +			       void *private_sp);
> +	void (*handle_changed_private_spte)(
> +		struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +		kvm_pfn_t old_pfn, bool was_present, bool was_leaf,
> +		kvm_pfn_t new_pfn, bool is_present, bool is_leaf, void *sept_page);
> +
>  	bool (*has_wbinvd_exit)(void);
>  
>  	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8def8b97978f..0ec9548ff4dd 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3422,6 +3422,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  	u8 shadow_root_level = mmu->shadow_root_level;
> +	gfn_t gfn_stolen = kvm_gfn_stolen_mask(vcpu->kvm);
>  	hpa_t root;
>  	unsigned i;
>  	int r;
> @@ -3432,7 +3433,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  		goto out_unlock;
>  
>  	if (is_tdp_mmu_enabled(vcpu->kvm)) {
> -		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> +		if (gfn_stolen && !VALID_PAGE(mmu->private_root_hpa)) {
> +			root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
> +			mmu->private_root_hpa = root;
> +		}
> +		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
>  		mmu->root_hpa = root;
>  	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
>  		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
> @@ -5596,6 +5601,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>  	int i;
>  
>  	mmu->root_hpa = INVALID_PAGE;
> +	mmu->private_root_hpa = INVALID_PAGE;
>  	mmu->root_pgd = 0;
>  	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>  		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> @@ -5772,6 +5778,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>  
>  	write_unlock(&kvm->mmu_lock);
>  
> +	/*
> +	 * For now private root is never invalidate during VM is running,
> +	 * so this can only happen for shared roots.
> +	 */
>  	if (is_tdp_mmu_enabled(kvm)) {
>  		read_lock(&kvm->mmu_lock);
>  		kvm_tdp_mmu_zap_invalidated_roots(kvm);
> @@ -5871,7 +5881,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>  	if (is_tdp_mmu_enabled(kvm)) {
>  		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
>  			flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
> -							  gfn_end, flush);
> +							  gfn_end, flush,
> +							  false);
>  	}
>  
>  	if (flush)
> @@ -5904,6 +5915,11 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>  		write_unlock(&kvm->mmu_lock);
>  	}
>  
> +	/*
> +	 * For now this can only happen for non-TD VM, because TD private
> +	 * mapping doesn't support write protection.  kvm_tdp_mmu_wrprot_slot()
> +	 * will give a WARN() if it hits for TD.
> +	 */
>  	if (is_tdp_mmu_enabled(kvm)) {
>  		read_lock(&kvm->mmu_lock);
>  		flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
> @@ -5952,6 +5968,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  		sp = sptep_to_sp(sptep);
>  		pfn = spte_to_pfn(*sptep);
>  
> +		/* Private page dirty logging is not supported. */
> +		KVM_BUG_ON(is_private_spte(sptep), kvm);
> +
>  		/*
>  		 * We cannot do huge page mapping for indirect shadow pages,
>  		 * which are found on the last rmap (level = 1) when not using
> @@ -5992,6 +6011,11 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>  		write_unlock(&kvm->mmu_lock);
>  	}
>  
> +	/*
> +	 * This should only be reachable in case of log-dirty, wihch TD private
> +	 * mapping doesn't support so far.  kvm_tdp_mmu_zap_collapsible_sptes()
> +	 * internally gives a WARN() when it hits.
> +	 */
>  	if (is_tdp_mmu_enabled(kvm)) {
>  		read_lock(&kvm->mmu_lock);
>  		kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
> @@ -6266,6 +6290,9 @@ int kvm_mmu_module_init(void)
>  void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>  {
>  	kvm_mmu_unload(vcpu);
> +	if (is_tdp_mmu_enabled(vcpu->kvm))
> +		mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> +				NULL);
>  	free_mmu_pages(&vcpu->arch.root_mmu);
>  	free_mmu_pages(&vcpu->arch.guest_mmu);
>  	mmu_free_memory_caches(vcpu);
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index e19cabbcb65c..ad22d5b691c5 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -28,7 +28,7 @@ struct tdp_iter {
>  	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
>  	/* A pointer to the current SPTE */
>  	tdp_ptep_t sptep;
> -	/* The lowest GFN mapped by the current SPTE */
> +	/* The lowest GFN (stolen bits included) mapped by the current SPTE */
>  	gfn_t gfn;
>  	/* The level of the root page given to the iterator */
>  	int root_level;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index a68f3a22836b..acba2590b51e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -53,6 +53,11 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>  	rcu_barrier();
>  }
>  
> +static gfn_t tdp_iter_gfn_unalias(struct kvm *kvm, struct tdp_iter *iter)
> +{
> +	return kvm_gfn_unalias(kvm, iter->gfn);
> +}
> +
>  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  			  gfn_t start, gfn_t end, bool can_yield, bool flush,
>  			  bool shared);
> @@ -175,10 +180,13 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
>  }
>  
>  static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> -					       int level)
> +					       int level, bool private)
>  {
>  	struct kvm_mmu_page *sp;
>  
> +	WARN_ON(level != vcpu->arch.mmu->shadow_root_level &&
> +		kvm_is_private_gfn(vcpu->kvm, gfn) != private);
> +	WARN_ON(level == vcpu->arch.mmu->shadow_root_level && gfn != 0);
>  	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
>  	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
>  	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> @@ -186,14 +194,19 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>  	sp->role.word = page_role_for_level(vcpu, level).word;
>  	sp->gfn = gfn;
>  	sp->tdp_mmu_page = true;
> -	kvm_mmu_init_private_sp(sp);
> +
> +	if (private)
> +		kvm_mmu_alloc_private_sp(vcpu, sp);
> +	else
> +		kvm_mmu_init_private_sp(sp);
>  
>  	trace_kvm_mmu_get_page(sp, true);
>  
>  	return sp;
>  }
>  
> -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> +static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu,
> +						      bool private)
>  {
>  	union kvm_mmu_page_role role;
>  	struct kvm *kvm = vcpu->kvm;
> @@ -206,11 +219,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>  	/* Check for an existing root before allocating a new one. */
>  	for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
>  		if (root->role.word == role.word &&
> +		    is_private_sp(root) == private &&
>  		    kvm_tdp_mmu_get_root(kvm, root))
>  			goto out;
>  	}
>  
> -	root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> +	root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level,
> +			private);
>  	refcount_set(&root->tdp_mmu_root_count, 1);
>  
>  	spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> @@ -218,12 +233,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>  	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>  
>  out:
> -	return __pa(root->spt);
> +	return root;
> +}
> +
> +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private)
> +{
> +	return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt);
>  }
>  
>  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level,
> -				bool shared);
> +				bool private_spte, u64 old_spte,
> +				u64 new_spte, int level, bool shared);
>  
>  static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
>  {
> @@ -321,6 +341,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>  	int level = sp->role.level;
>  	gfn_t base_gfn = sp->gfn;
>  	int i;
> +	bool private_sp = is_private_sp(sp);
>  
>  	trace_kvm_mmu_prepare_zap_page(sp);
>  
> @@ -370,7 +391,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>  			 */
>  			WRITE_ONCE(*sptep, SHADOW_REMOVED_SPTE);
>  		}
> -		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> +		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn, private_sp,
>  				    old_child_spte, SHADOW_REMOVED_SPTE, level,
>  				    shared);
>  	}
> @@ -378,6 +399,17 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>  	kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
>  					   KVM_PAGES_PER_HPAGE(level + 1));
>  
> +	if (private_sp &&
> +		WARN_ON(static_call(kvm_x86_free_private_sp)(
> +				kvm, sp->gfn, sp->role.level,
> +				kvm_mmu_private_sp(sp)))) {
> +		/*
> +		 * Failed to unlink Secure EPT page and there is nothing to do
> +		 * further.  Intentionally leak the page to prevent the kernel
> +		 * from accessing the encrypted page.
> +		 */
> +		kvm_mmu_init_private_sp(sp);
> +	}
>  	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
>  }
>  
> @@ -386,6 +418,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>   * @kvm: kvm instance
>   * @as_id: the address space of the paging structure the SPTE was a part of
>   * @gfn: the base GFN that was mapped by the SPTE
> + * @private_spte: the SPTE is private or not
>   * @old_spte: The value of the SPTE before the change
>   * @new_spte: The value of the SPTE after the change
>   * @level: the level of the PT the SPTE is part of in the paging structure
> @@ -397,14 +430,16 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>   * This function must be called for all TDP SPTE modifications.
>   */
>  static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				  u64 old_spte, u64 new_spte, int level,
> -				  bool shared)
> +				  bool private_spte, u64 old_spte,
> +				  u64 new_spte, int level, bool shared)
>  {
>  	bool was_present = is_shadow_present_pte(old_spte);
>  	bool is_present = is_shadow_present_pte(new_spte);
>  	bool was_leaf = was_present && is_last_spte(old_spte, level);
>  	bool is_leaf = is_present && is_last_spte(new_spte, level);
> -	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> +	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> +	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> +	bool pfn_changed = old_pfn != new_pfn;
>  
>  	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
>  	WARN_ON(level < PG_LEVEL_4K);
> @@ -468,23 +503,49 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  
>  	if (was_leaf && is_dirty_spte(old_spte) &&
>  	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> -		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> +		kvm_set_pfn_dirty(old_pfn);
> +
> +	/*
> +	 * Special handling for the private mapping.  We are either
> +	 * setting up new mapping at middle level page table, or leaf,
> +	 * or tearing down existing mapping.
> +	 */
> +	if (private_spte) {
> +		void *sept_page = NULL;
> +
> +		if (is_present && !is_leaf) {
> +			struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(new_pfn));
> +
> +			sept_page = kvm_mmu_private_sp(sp);
> +			WARN_ON(!sept_page);
> +			WARN_ON(sp->role.level + 1 != level);
> +			WARN_ON(sp->gfn != gfn);
> +		}
> +
> +		static_call(kvm_x86_handle_changed_private_spte)(
> +			kvm, gfn, level,
> +			old_pfn, was_present, was_leaf,
> +			new_pfn, is_present, is_leaf, sept_page);
> +	}
>  
>  	/*
>  	 * Recursively handle child PTs if the change removed a subtree from
>  	 * the paging structure.
>  	 */
> -	if (was_present && !was_leaf && (pfn_changed || !is_present))
> +	if (was_present && !was_leaf && (pfn_changed || !is_present)) {
> +		WARN_ON(private_spte !=
> +			is_private_spte(spte_to_child_pt(old_spte, level)));
>  		handle_removed_tdp_mmu_page(kvm,
>  				spte_to_child_pt(old_spte, level), shared);
> +	}
>  }
>  
>  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level,
> -				bool shared)
> +				bool private_spte, u64 old_spte, u64 new_spte,
> +				int level, bool shared)
>  {
> -	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> -			      shared);
> +	__handle_changed_spte(kvm, as_id, gfn, private_spte,
> +			old_spte, new_spte, level, shared);
>  	handle_changed_spte_acc_track(old_spte, new_spte, level);
>  	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
>  				      new_spte, level);
> @@ -505,6 +566,10 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  					   struct tdp_iter *iter,
>  					   u64 new_spte)
>  {
> +	bool freeze_spte = is_private_spte(iter->sptep) &&
> +		!is_removed_spte(new_spte);
> +	u64 tmp_spte = freeze_spte ? SHADOW_REMOVED_SPTE : new_spte;
> +
>  	WARN_ON_ONCE(iter->yielded);
>  
>  	lockdep_assert_held_read(&kvm->mmu_lock);
> @@ -521,13 +586,16 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  	 * does not hold the mmu_lock.
>  	 */
>  	if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> -		      new_spte) != iter->old_spte)
> +		      tmp_spte) != iter->old_spte)
>  		return false;
>  
> -	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -			      new_spte, iter->level, true);
> +	__handle_changed_spte(kvm, iter->as_id, iter->gfn, is_private_spte(iter->sptep),
> +			      iter->old_spte, new_spte, iter->level, true);
>  	handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
>  
> +	if (freeze_spte)
> +		WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
> +
>  	return true;
>  }
>  
> @@ -603,8 +671,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>  
>  	WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
>  
> -	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -			      new_spte, iter->level, false);
> +	__handle_changed_spte(kvm, iter->as_id, iter->gfn, is_private_spte(iter->sptep),
> +			      iter->old_spte, new_spte, iter->level, false);
>  	if (record_acc_track)
>  		handle_changed_spte_acc_track(iter->old_spte, new_spte,
>  					      iter->level);
> @@ -644,9 +712,10 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>  			continue;					\
>  		else
>  
> -#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
> -	for_each_tdp_pte(_iter, __va(_mmu->root_hpa),		\
> -			 _mmu->shadow_root_level, _start, _end)
> +#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end)		\
> +	for_each_tdp_pte(_iter,							\
> +		__va((_private) ? _mmu->private_root_hpa : _mmu->root_hpa),	\
> +		_mmu->shadow_root_level, _start, _end)
>  
>  /*
>   * Yield if the MMU lock is contended or this thread needs to return control
> @@ -731,6 +800,18 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  	 */
>  	end = min(end, max_gfn_host);
>  
> +	/*
> +	 * Extend [start, end) to include GFN shared bit when TDX is enabled,
> +	 * and for shared mapping range.
> +	 */
> +	if (is_private_sp(root)) {
> +		start = kvm_gfn_private(kvm, start);
> +		end = kvm_gfn_private(kvm, end);
> +	} else {
> +		start = kvm_gfn_shared(kvm, start);
> +		end = kvm_gfn_shared(kvm, end);
> +	}
> +
>  	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
>  
>  	rcu_read_lock();
> @@ -783,13 +864,18 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   * MMU lock.
>   */
>  bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> -				 gfn_t end, bool can_yield, bool flush)
> +				 gfn_t end, bool can_yield, bool flush,
> +				 bool zap_private)
>  {
>  	struct kvm_mmu_page *root;
>  
> -	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
> +	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false) {
> +		/* Skip private page table if not requested */
> +		if (!zap_private && is_private_sp(root))
> +			continue;
>  		flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
>  				      false);
> +	}
>  
>  	return flush;
>  }
> @@ -800,7 +886,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>  	int i;
>  
>  	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> -		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
> +		flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush, true);
>  
>  	if (flush)
>  		kvm_flush_remote_tlbs(kvm);
> @@ -851,6 +937,13 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>  	while (root) {
>  		next_root = next_invalidated_root(kvm, root);
>  
> +		/*
> +		 * Private table is only torn down when VM is destroyed.
> +		 * It is a bug to zap private table here.
> +		 */
> +		if (WARN_ON(is_private_sp(root)))
> +			goto out;
> +
>  		rcu_read_unlock();
>  
>  		flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
> @@ -865,7 +958,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>  
>  		rcu_read_lock();
>  	}
> -
> +out:
>  	rcu_read_unlock();
>  
>  	if (flush)
> @@ -897,9 +990,16 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
>  	struct kvm_mmu_page *root;
>  
>  	lockdep_assert_held_write(&kvm->mmu_lock);
> -	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
> +	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> +		/*
> +		 * Skip private root since private page table
> +		 * is only torn down when VM is destroyed.
> +		 */
> +		if (is_private_sp(root))
> +			continue;
>  		if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
>  			root->role.invalid = true;
> +	}
>  }
>  
>  /*
> @@ -914,14 +1014,23 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  	u64 new_spte;
>  	int ret = RET_PF_FIXED;
>  	bool wrprot = false;
> +	unsigned long pte_access = ACC_ALL;
>  
>  	WARN_ON(sp->role.level != fault->goal_level);
> +
> +	/* TDX shared GPAs are no executable, enforce this for the SDV. */
> +	if (!kvm_is_private_gfn(vcpu->kvm, iter->gfn))
> +		pte_access &= ~ACC_EXEC_MASK;
> +
>  	if (unlikely(!fault->slot))
> -		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> +		new_spte = make_mmio_spte(vcpu,
> +				tdp_iter_gfn_unalias(vcpu->kvm, iter),
> +				pte_access);
>  	else
> -		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> -					 fault->pfn, iter->old_spte, fault->prefetch, true,
> -					 fault->map_writable, &new_spte);
> +		wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
> +				tdp_iter_gfn_unalias(vcpu->kvm, iter),
> +				fault->pfn, iter->old_spte, fault->prefetch,
> +				true, fault->map_writable, &new_spte);
>  
>  	if (new_spte == iter->old_spte)
>  		ret = RET_PF_SPURIOUS;
> @@ -959,7 +1068,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  }
>  
>  static bool tdp_mmu_populate_nonleaf(
> -	struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool account_nx)
> +	struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool is_private,
> +	bool account_nx)
>  {
>  	struct kvm_mmu_page *sp;
>  	u64 *child_pt;
> @@ -968,7 +1078,7 @@ static bool tdp_mmu_populate_nonleaf(
>  	WARN_ON(is_shadow_present_pte(iter->old_spte));
>  	WARN_ON(is_removed_spte(iter->old_spte));
>  
> -	sp = alloc_tdp_mmu_page(vcpu, iter->gfn, iter->level - 1);
> +	sp = alloc_tdp_mmu_page(vcpu, iter->gfn, iter->level - 1, is_private);
>  	child_pt = sp->spt;
>  
>  	new_spte = make_nonleaf_spte(child_pt, !shadow_accessed_mask);
> @@ -991,6 +1101,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  	struct tdp_iter iter;
> +	gfn_t raw_gfn;
> +	bool is_private;
>  	int ret;
>  
>  	kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -999,7 +1111,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  	rcu_read_lock();
>  
> -	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> +	raw_gfn = gpa_to_gfn(fault->addr);
> +	is_private = kvm_is_private_gfn(vcpu->kvm, raw_gfn);
> +
> +	if (is_error_noslot_pfn(fault->pfn) || kvm_is_reserved_pfn(fault->pfn)) {
> +		if (is_private) {
> +			rcu_read_unlock();
> +			return -EFAULT;
> +		}
> +	}
> +
> +	tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
>  		if (fault->nx_huge_page_workaround_enabled)
>  			disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
>  
> @@ -1015,6 +1137,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  		    is_large_pte(iter.old_spte)) {
>  			if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
>  				break;
> +			/*
> +			 * TODO: large page support.
> +			 * Doesn't support large page for TDX now
> +			 */
> +			WARN_ON(is_private_spte(&iter.old_spte));
> +
>  
>  			/*
>  			 * The iter must explicitly re-read the spte here
> @@ -1037,7 +1165,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  
>  			account_nx = fault->huge_page_disallowed &&
>  				fault->req_level >= iter.level;
> -			if (!tdp_mmu_populate_nonleaf(vcpu, &iter, account_nx))
> +			if (!tdp_mmu_populate_nonleaf(
> +					vcpu, &iter, is_private, account_nx))
>  				break;
>  		}
>  	}
> @@ -1058,9 +1187,12 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
>  {
>  	struct kvm_mmu_page *root;
>  
> -	for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
> +	for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false) {
> +		if (is_private_sp(root))
> +			continue;
>  		flush = zap_gfn_range(kvm, root, range->start, range->end,
> -				      range->may_block, flush, false);
> +				range->may_block, flush, false);
> +	}
>  
>  	return flush;
>  }
> @@ -1513,10 +1645,14 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  	gfn_t gfn = addr >> PAGE_SHIFT;
>  	int leaf = -1;
> +	bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn);
>  
>  	*root_level = vcpu->arch.mmu->shadow_root_level;
>  
> -	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> +	if (WARN_ON(is_private))
> +		return leaf;
> +
> +	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
>  		leaf = iter.level;
>  		sptes[leaf] = iter.old_spte;
>  	}
> @@ -1542,12 +1678,16 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
>  	struct kvm_mmu *mmu = vcpu->arch.mmu;
>  	gfn_t gfn = addr >> PAGE_SHIFT;
>  	tdp_ptep_t sptep = NULL;
> +	bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn);
>  
> -	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> +	if (is_private)
> +		goto out;
> +
> +	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
>  		*spte = iter.old_spte;
>  		sptep = iter.sptep;
>  	}
> -
> +out:
>  	/*
>  	 * Perform the rcu_dereference to get the raw spte pointer value since
>  	 * we are passing it up to fast_page_fault, which is shared with the
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 3899004a5d91..7c62f694a465 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -5,7 +5,7 @@
>  
>  #include <linux/kvm_host.h>
>  
> -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private);
>  
>  __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
>  						     struct kvm_mmu_page *root)
> @@ -20,11 +20,14 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  			  bool shared);
>  
>  bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> -				 gfn_t end, bool can_yield, bool flush);
> +				 gfn_t end, bool can_yield, bool flush,
> +				 bool zap_private);
>  static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
> -					     gfn_t start, gfn_t end, bool flush)
> +					     gfn_t start, gfn_t end, bool flush,
> +					     bool zap_private)
>  {
> -	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
> +	return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush,
> +			zap_private);
>  }
>  static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> @@ -41,7 +44,7 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  	 */
>  	lockdep_assert_held_write(&kvm->mmu_lock);
>  	return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
> -					   sp->gfn, end, false, false);
> +					   sp->gfn, end, false, false, false);
>  }
>  
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ae3bf553f215..d4e117f5b5b9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -190,6 +190,7 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>  
>  	return true;
>  }
> +EXPORT_SYMBOL_GPL(kvm_is_reserved_pfn);
>  
>  /*
>   * Switches to specified vcpu, until a matching vcpu_put()


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid
  2022-04-06  1:55       ` Kai Huang
@ 2022-04-07  1:00         ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-07  1:00 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: isaku.yamahata, kvm, linux-kernel, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

> 
> > 
> > Also export the global TDX private host key id that is used to encrypt TDX
> > module, its memory and some dynamic data (e.g. TDR).  
> > 

Sorry I was replying too quick.

This sentence is not correct.  Hardware doesn't use global KeyID to encrypt TDX
module itself.  In current generation of TDX, global KeyID is used to encrypt
TDX memory metadata (PAMTs) and TDRs.


> > When VMM releasing
> > encrypted page to reuse it, the page needs to be flushed with the used host
> > key id.  VMM needs the global TDX private host key id to flush such pages
> > TDX module accesses with the global TDX private host key id.
> > 
> > 
> 
> Find to me.
> 


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 026/104] KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
  2022-04-06  1:54     ` Xiaoyao Li
@ 2022-04-07  1:07       ` Kai Huang
  2022-04-07  1:17         ` Xiaoyao Li
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-07  1:07 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On Wed, 2022-04-06 at 09:54 +0800, Xiaoyao Li wrote:
> On 4/5/2022 8:52 PM, Paolo Bonzini wrote:
> > On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > > Implement a VM-scoped subcomment to get system-wide parameters.  Although
> > > this is system-wide parameters not per-VM, this subcomand is VM-scoped
> > > because
> > > - Device model needs TDX system-wide parameters after creating KVM VM.
> > > - This subcommands requires to initialize TDX module.  For lazy
> > >    initialization of the TDX module, vm-scope ioctl is better.
> > 
> > Since there was agreement to install the TDX module on load, please 
> > place this ioctl on the /dev/kvm file descriptor.
> > 
> > At least for SEV, there were cases where the system-wide parameters are 
> > needed outside KVM, so it's better to avoid requiring a VM file descriptor.
> 
> I don't have strong preference on KVM-scope ioctl or VM-scope.
> 
> Initially, we made it KVM-scope and change it to VM-scope in this 
> version. Yes, it returns the info from TDX module, which doesn't vary 
> per VM. However, what if we want to return different capabilities 
> (software controlled capabilities) per VM? 
> 

In this case, you don't return different capabilities, instead, you return the
same capabilities but control the capabilities on per-VM basis.

> Part of the TDX capabilities 
> serves like get_supported_cpuid, making it KVM wide lacks the 
> flexibility to return differentiated capabilities for different TDs.
> 
> 
> > Thanks,
> > 
> > Paolo
> > 
> 


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 086/104] KVM: TDX: handle ept violation/misconfig exit
  2022-04-06 20:50   ` Sagi Shahar
@ 2022-04-07  1:09     ` Xiaoyao Li
  0 siblings, 0 replies; 310+ messages in thread
From: Xiaoyao Li @ 2022-04-07  1:09 UTC (permalink / raw)
  To: Sagi Shahar, Yamahata, Isaku
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	Erdem Aktas, Connor Kuehl, Sean Christopherson

On 4/7/2022 4:50 AM, Sagi Shahar wrote:
> On Fri, Mar 4, 2022 at 12:23 PM <isaku.yamahata@intel.com> wrote:
>>
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> On EPT violation, call a common function, __vmx_handle_ept_violation() to
>> trigger x86 MMU code.  On EPT misconfiguration, exit to ring 3 with
>> KVM_EXIT_UNKNOWN.  because EPT misconfiguration can't happen as MMIO is
>> trigged by TDG.VP.VMCALL. No point to set a misconfiguration value for the
>> fast path.
>>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> ---
>>   arch/x86/kvm/vmx/tdx.c | 40 ++++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 38 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
>> index 6fbe89bcfe1e..2c35dcad077e 100644
>> --- a/arch/x86/kvm/vmx/tdx.c
>> +++ b/arch/x86/kvm/vmx/tdx.c
>> @@ -1081,6 +1081,40 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
>>          __vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
>>   }
>>
>> +#define TDX_SEPT_PFERR (PFERR_WRITE_MASK | PFERR_USER_MASK)
> 
> TDX_SEPT_PFERR is defined using PFERR_.* bitmask but
> __vmx_handle_ept_violation is accepting an EPT_VIOLATION_.* bitmask.
> so (PFERR_WRITE_MASK | PFERR_USER_MASK) will get interpreted as
> (EPT_VIOLATION_ACC_WRITE | EPT_VIOLATION_ACC_INSTR) which will get
> translated to (PFERR_WRITE_MASK | PFERR_FETCH_MASK). Was that the
> intention of this code?

No. It's a mistake. We have corrected internally you can find corrected 
code in github repo or see it in next version.




^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 026/104] KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
  2022-04-07  1:07       ` Kai Huang
@ 2022-04-07  1:17         ` Xiaoyao Li
  2022-04-08  0:58           ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Xiaoyao Li @ 2022-04-07  1:17 UTC (permalink / raw)
  To: Kai Huang, Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/7/2022 9:07 AM, Kai Huang wrote:
> On Wed, 2022-04-06 at 09:54 +0800, Xiaoyao Li wrote:
>> On 4/5/2022 8:52 PM, Paolo Bonzini wrote:
>>> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
>>>> Implement a VM-scoped subcomment to get system-wide parameters.  Although
>>>> this is system-wide parameters not per-VM, this subcomand is VM-scoped
>>>> because
>>>> - Device model needs TDX system-wide parameters after creating KVM VM.
>>>> - This subcommands requires to initialize TDX module.  For lazy
>>>>     initialization of the TDX module, vm-scope ioctl is better.
>>>
>>> Since there was agreement to install the TDX module on load, please
>>> place this ioctl on the /dev/kvm file descriptor.
>>>
>>> At least for SEV, there were cases where the system-wide parameters are
>>> needed outside KVM, so it's better to avoid requiring a VM file descriptor.
>>
>> I don't have strong preference on KVM-scope ioctl or VM-scope.
>>
>> Initially, we made it KVM-scope and change it to VM-scope in this
>> version. Yes, it returns the info from TDX module, which doesn't vary
>> per VM. However, what if we want to return different capabilities
>> (software controlled capabilities) per VM?
>>
> 
> In this case, you don't return different capabilities, instead, you return the
> same capabilities but control the capabilities on per-VM basis.

yes, so I'm not arguing it or insisting on per-VM.

I just speak out my concern since it's user ABI.

>> Part of the TDX capabilities
>> serves like get_supported_cpuid, making it KVM wide lacks the
>> flexibility to return differentiated capabilities for different TDs.
>>
>>
>>> Thanks,
>>>
>>> Paolo
>>>
>>
> 


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters
  2022-04-05 12:58   ` Paolo Bonzini
@ 2022-04-07  1:29     ` Xiaoyao Li
  2022-04-07  1:51       ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Xiaoyao Li @ 2022-04-07  1:29 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/5/2022 8:58 PM, Paolo Bonzini wrote:
> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
>> +    td_params->attributes = init_vm->attributes;
>> +    if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
>> +        pr_warn("TD doesn't support perfmon. KVM needs to save/restore "
>> +            "host perf registers properly.\n");
>> +        return -EOPNOTSUPP;
>> +    }
> 
> Why does KVM have to hardcode this (and LBR/AMX below)?  Is the level of 
> hardware support available from tdx_caps, for example through the CPUID 
> configs (0xA for this one, 0xD for LBR and AMX)?

It's wrong code. PMU is allowed.

AMX and LBR are disallowed because and the time we wrote the codes they 
are not supported by KVM. Now AMX should be allowed, but (arch-)LBR 
should be still blocked until KVM merges arch-LBR support.

>> +    /* PT can be exposed to TD guest regardless of KVM's XSS support */
>> +    guest_supported_xss &= (supported_xss | XFEATURE_MASK_PT);
>> +    td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
>> +    if (td_params->xfam & TDX_TD_XFAM_LBR) {
>> +        pr_warn("TD doesn't support LBR. KVM needs to save/restore "
>> +            "IA32_LBR_DEPTH properly.\n");
>> +        return -EOPNOTSUPP;
>> +    }
>> +
>> +    if (td_params->xfam & TDX_TD_XFAM_AMX) {
>> +        pr_warn("TD doesn't support AMX. KVM needs to save/restore "
>> +            "IA32_XFD, IA32_XFD_ERR properly.\n");
>> +        return -EOPNOTSUPP;
>> +    }
> 
>>
>> +    if (init_vm->tsc_khz)
>> +        guest_tsc_khz = init_vm->tsc_khz;
>> +    else
>> +        guest_tsc_khz = max_tsc_khz;
> 
> You can just use kvm->arch.default_tsc_khz in the latest kvm/queue.

yes. will change it.


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 054/104] KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping
  2022-03-04 19:49 ` [RFC PATCH v5 054/104] KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping isaku.yamahata
@ 2022-04-07  1:43   ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-07  1:43 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> SPTE_PRIVATE_PROHIBIT specifies the share or private GPA is allowed or not.
> It needs to be kept over zapping the EPT entry.  Currently the EPT entry is
> initialized shadow_init_value unconditionally to clear
> SPTE_PRIVATE_PROHIBIT bit.  To carry SPTE_PRIVATE_PROHIBIT bit, introduce a
> helper function to get initial value for zapped entry with
> SPTE_PRIVATE_PROHIBIT bit.  Replace shadow_init_value with it.

Isn't it better to merge patch 53-55, especially 54-55 together? 

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 19 +++++++++++++++----
>  1 file changed, 15 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 1949f81027a0..6d750563824d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -610,6 +610,12 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  	return true;
>  }
>  
> +static u64 shadow_init_spte(u64 old_spte)
> +{
> +	return shadow_init_value |
> +		(is_private_prohibit_spte(old_spte) ? SPTE_PRIVATE_PROHIBIT : 0);
> +}
> +
>  static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>  					   struct tdp_iter *iter)
>  {
> @@ -641,7 +647,8 @@ static inline bool tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>  	 * shadow_init_value (which sets "suppress #VE" bit) so it
>  	 * can be set when EPT table entries are zapped.
>  	 */
> -	WRITE_ONCE(*rcu_dereference(iter->sptep), shadow_init_value);
> +	WRITE_ONCE(*rcu_dereference(iter->sptep),
> +		shadow_init_spte(iter->old_spte));
>  
>  	return true;
>  }

In this and next patch (54-55), in all the code path, you already have the iter-
>sptep, from which you can get the sp->private_sp, and check using
is_private_sp().  Why do we need this SPTE_PRIVATE_PRORHIBIT bit?

Are you suggesting we can have mixed private/shared mapping under a private_sp?

> @@ -853,7 +860,8 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>  
>  		if (!shared) {
>  			/* see comments in tdp_mmu_zap_spte_atomic() */
> -			tdp_mmu_set_spte(kvm, &iter, shadow_init_value);
> +			tdp_mmu_set_spte(kvm, &iter,
> +					shadow_init_spte(iter.old_spte));
>  			flush = true;
>  		} else if (!tdp_mmu_zap_spte_atomic(kvm, &iter)) {
>  			/*
> @@ -1038,11 +1046,14 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  		new_spte = make_mmio_spte(vcpu,
>  				tdp_iter_gfn_unalias(vcpu->kvm, iter),
>  				pte_access);
> -	else
> +	else {
>  		wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
>  				tdp_iter_gfn_unalias(vcpu->kvm, iter),
>  				fault->pfn, iter->old_spte, fault->prefetch,
>  				true, fault->map_writable, &new_spte);
> +		if (is_private_prohibit_spte(iter->old_spte))
> +			new_spte |= SPTE_PRIVATE_PROHIBIT;
> +	}
>  
>  	if (new_spte == iter->old_spte)
>  		ret = RET_PF_SPURIOUS;
> @@ -1335,7 +1346,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
>  	 * invariant that the PFN of a present * leaf SPTE can never change.
>  	 * See __handle_changed_spte().
>  	 */
> -	tdp_mmu_set_spte(kvm, iter, shadow_init_value);
> +	tdp_mmu_set_spte(kvm, iter, shadow_init_spte(iter->old_spte));
>  
>  	if (!pte_write(range->pte)) {
>  		new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 058/104] KVM: x86/mmu: Focibly use TDP MMU for TDX
  2022-03-04 19:49 ` [RFC PATCH v5 058/104] KVM: x86/mmu: Focibly use TDP MMU for TDX isaku.yamahata
@ 2022-04-07  1:49   ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-07  1:49 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> At this point, TDX supports TDP MMU and doesn't support legacy MMU.
> Forcibly use TDP MMU for TDX irrelevant of kernel parameter to disable
> TDP MMU.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index b33ace3d4456..9df6aa4da202 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -16,7 +16,12 @@ module_param_named(tdp_mmu, tdp_mmu_enabled, bool, 0644);
>  /* Initializes the TDP MMU for the VM, if enabled. */
>  bool kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>  {
> -	if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
> +	/*
> +	 *  Because TDX supports only TDP MMU, forcibly use TDP MMU in the case
> +	 *  of TDX.
> +	 */
> +	if (kvm->arch.vm_type != KVM_X86_TDX_VM &&
> +		(!tdp_enabled || !READ_ONCE(tdp_mmu_enabled)))
>  		return false;
>  
>  	/* This should not be changed for the lifetime of the VM. */

Please move this patch forward before introducing any private/shared mapping
support, otherwise nothing prevents you from creating a TD against legacy MMU,
which is broken (especially you have allowed userspace to create TD in patch 10
"KVM: TDX: Make TDX VM type supported").

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters
  2022-04-07  1:29     ` Xiaoyao Li
@ 2022-04-07  1:51       ` Kai Huang
  2022-04-08  3:33         ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-07  1:51 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On Thu, 2022-04-07 at 09:29 +0800, Xiaoyao Li wrote:
> On 4/5/2022 8:58 PM, Paolo Bonzini wrote:
> > On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > > +    td_params->attributes = init_vm->attributes;
> > > +    if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> > > +        pr_warn("TD doesn't support perfmon. KVM needs to save/restore "
> > > +            "host perf registers properly.\n");
> > > +        return -EOPNOTSUPP;
> > > +    }
> > 
> > Why does KVM have to hardcode this (and LBR/AMX below)?  Is the level of 
> > hardware support available from tdx_caps, for example through the CPUID 
> > configs (0xA for this one, 0xD for LBR and AMX)?
> 
> It's wrong code. PMU is allowed.
> 
> AMX and LBR are disallowed because and the time we wrote the codes they 
> are not supported by KVM. Now AMX should be allowed, but (arch-)LBR 
> should be still blocked until KVM merges arch-LBR support.

I think Isaku's idea is we don't support them in the first submission?

If so as I suggested, we should add a TODO in comment..


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 051/104] KVM: TDX: TDP MMU TDX support
  2022-03-04 19:49 ` [RFC PATCH v5 051/104] KVM: TDX: TDP MMU TDX support isaku.yamahata
@ 2022-04-07  2:20   ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-07  2:20 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Implement hooks of TDP MMU for TDX backend.  TLB flush, TLB shootdown,
> propagating the change private EPT entry to Secure EPT and freeing Secure
> EPT page.
> 
> TLB flush handles both shared EPT and private EPT.  It flushes shared EPT
> same as VMX.  It also waits for the TDX TLB shootdown.
> 
> For the hook to free Secure EPT page, unlinks the Secure EPT page from the
> Secure EPT so that the page can be freed to OS.
> 
> Propagating the entry change to Secure EPT.  The possible entry changes are
> present -> non-present(zapping) and non-present -> present(population).  On
> population just link the Secure EPT page or the private guest page to the
> Secure EPT by TDX SEAMCALL.
> 
> Because TDP MMU allows concurrent zapping/population, zapping requires
> synchronous TLB shootdown with the frozen EPT entry.  It zaps the secure
> entry, increments TLB counter, sends IPI to remote vcpus to trigger TLB
> flush, and then unlinks the private guest page from the Secure EPT.
> 
> For simplicity, batched zapping with exclude lock is handled as concurrent
> zapping.  Although it's inefficient, it can be optimized in the future.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/main.c    |  40 +++++-
>  arch/x86/kvm/vmx/tdx.c     | 246 +++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/tdx.h     |  14 +++
>  arch/x86/kvm/vmx/tdx_ops.h |   3 +
>  arch/x86/kvm/vmx/x86_ops.h |   2 +
>  5 files changed, 301 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 6969e3557bd4..f571b07c2aae 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -89,6 +89,38 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	return vmx_vcpu_reset(vcpu, init_event);
>  }
>  
> +static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_flush_tlb(vcpu);
> +
> +	vmx_flush_tlb_all(vcpu);
> +}
> +
> +static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_flush_tlb(vcpu);
> +
> +	vmx_flush_tlb_current(vcpu);
> +}
> +
> +static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
> +{
> +	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
> +		return;
> +
> +	vmx_flush_tlb_gva(vcpu, addr);
> +}
> +
> +static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_flush_tlb_guest(vcpu);
> +}
> +
>  static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>  			int pgd_level)
>  {
> @@ -162,10 +194,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.set_rflags = vmx_set_rflags,
>  	.get_if_flag = vmx_get_if_flag,
>  
> -	.tlb_flush_all = vmx_flush_tlb_all,
> -	.tlb_flush_current = vmx_flush_tlb_current,
> -	.tlb_flush_gva = vmx_flush_tlb_gva,
> -	.tlb_flush_guest = vmx_flush_tlb_guest,
> +	.tlb_flush_all = vt_flush_tlb_all,
> +	.tlb_flush_current = vt_flush_tlb_current,
> +	.tlb_flush_gva = vt_flush_tlb_gva,
> +	.tlb_flush_guest = vt_flush_tlb_guest,
>  
>  	.vcpu_pre_run = vmx_vcpu_pre_run,
>  	.run = vmx_vcpu_run,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 51098e10b6a0..5d74ae001e4f 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -5,7 +5,9 @@
>  
>  #include "capabilities.h"
>  #include "x86_ops.h"
> +#include "mmu.h"
>  #include "tdx.h"
> +#include "vmx.h"
>  #include "x86.h"
>  
>  #undef pr_fmt
> @@ -272,6 +274,15 @@ int tdx_vm_init(struct kvm *kvm)
>  	int ret, i;
>  	u64 err;
>  
> +	/*
> +	 * To generate EPT violation to inject #VE instead of EPT MISCONFIG,
> +	 * set RWX=0.
> +	 */
> +	kvm_mmu_set_mmio_spte_mask(kvm, 0, VMX_EPT_RWX_MASK, 0);

I literally spent couple of minutes looking for this chunk while I was looking
at the patch "KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis".

> +
> +	/* TODO: Enable 2mb and 1gb large page support. */
> +	kvm->arch.tdp_max_page_level = PG_LEVEL_4K;

Why don't you move this chunk before MMU code change, where you declared large
page is not supported many times in the code?

> +
>  	/* vCPUs can't be created until after KVM_TDX_INIT_VM. */
>  	kvm->max_vcpus = 0;
>  
> @@ -331,6 +342,8 @@ int tdx_vm_init(struct kvm *kvm)
>  		tdx_mark_td_page_added(&kvm_tdx->tdcs[i]);
>  	}
>  
> +	spin_lock_init(&kvm_tdx->seamcall_lock);
> +
>  	/*
>  	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
>  	 * ioctl() to define the configure CPUID values for the TD.
> @@ -501,6 +514,220 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
>  }
>  
> +static void __tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> +					enum pg_level level, kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	hpa_t hpa = pfn_to_hpa(pfn);
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	struct tdx_module_output out;
> +	u64 err;
> +
> +	if (WARN_ON_ONCE(is_error_noslot_pfn(pfn) || kvm_is_reserved_pfn(pfn)))
> +		return;
> +
> +	/* TODO: handle large pages. */
> +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> +		return;
> +
> +	/* Pin the page, TDX KVM doesn't yet support page migration. */
> +	get_page(pfn_to_page(pfn));

I think there are some MMU code change has the logic that private mappings are
not zapped during VM's runtime.  This logic depends on page being pinned, which
you are doing here.

> +
> +	if (likely(is_td_finalized(kvm_tdx))) {
> +		err = tdh_mem_page_aug(kvm_tdx->tdr.pa, gpa, hpa, &out);
> +		if (KVM_BUG_ON(err, kvm))
> +			pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
> +		return;
> +	}
> +}
> +
> +static void tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
> +				      enum pg_level level, kvm_pfn_t pfn)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +
> +	spin_lock(&kvm_tdx->seamcall_lock);
> +	__tdx_sept_set_private_spte(kvm, gfn, level, pfn);
> +	spin_unlock(&kvm_tdx->seamcall_lock);
> +}
> +
> +static void tdx_sept_drop_private_spte(
> +	struct kvm *kvm, gfn_t gfn, enum pg_level level, kvm_pfn_t pfn)
> +{
> +	int tdx_level = pg_level_to_tdx_sept_level(level);
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	hpa_t hpa = pfn_to_hpa(pfn);
> +	hpa_t hpa_with_hkid;
> +	struct tdx_module_output out;
> +	u64 err = 0;
> +
> +	/* TODO: handle large pages. */
> +	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
> +		return;
> +
> +	spin_lock(&kvm_tdx->seamcall_lock);
> +	if (is_hkid_assigned(kvm_tdx)) {
> +		err = tdh_mem_page_remove(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
> +		if (KVM_BUG_ON(err, kvm)) {
> +			pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
> +			goto unlock;
> +		}
> +
> +		hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
> +		err = tdh_phymem_page_wbinvd(hpa_with_hkid);
> +		if (WARN_ON_ONCE(err)) {
> +			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> +			goto unlock;
> +		}
> +	} else
> +		err = tdx_reclaim_page((unsigned long)__va(hpa), hpa);
> +
> +unlock:
> +	spin_unlock(&kvm_tdx->seamcall_lock);
> +
> +	if (!err)
> +		put_page(pfn_to_page(pfn));
> +}
> +
> +static int tdx_sept_link_private_sp(struct kvm *kvm, gfn_t gfn,
> +				    enum pg_level level, void *sept_page)
> +{
> +	int tdx_level = pg_level_to_tdx_sept_level(level);
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	hpa_t hpa = __pa(sept_page);
> +	struct tdx_module_output out;
> +	u64 err;
> +
> +	spin_lock(&kvm_tdx->seamcall_lock);
> +	err = tdh_mem_sept_add(kvm_tdx->tdr.pa, gpa, tdx_level, hpa, &out);
> +	spin_unlock(&kvm_tdx->seamcall_lock);
> +	if (KVM_BUG_ON(err, kvm)) {
> +		pr_tdx_error(TDH_MEM_SEPT_ADD, err, &out);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
> +static void tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
> +				      enum pg_level level)
> +{
> +	int tdx_level = pg_level_to_tdx_sept_level(level);
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	gpa_t gpa = gfn_to_gpa(gfn);
> +	struct tdx_module_output out;
> +	u64 err;
> +
> +	spin_lock(&kvm_tdx->seamcall_lock);
> +	err = tdh_mem_range_block(kvm_tdx->tdr.pa, gpa, tdx_level, &out);
> +	spin_unlock(&kvm_tdx->seamcall_lock);
> +	if (KVM_BUG_ON(err, kvm))
> +		pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
> +}
> +
> +static int tdx_sept_free_private_sp(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +				    void *sept_page)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	int ret;
> +
> +	/*
> +	 * free_private_sp() is (obviously) called when a shadow page is being
> +	 * zapped.  KVM doesn't (yet) zap private SPs while the TD is active.
> +	 */

I have some memory that you ever said private memory can be zapped when memory
slot is moved/deleted?

> +	if (KVM_BUG_ON(is_hkid_assigned(to_kvm_tdx(kvm)), kvm))
> +		return -EINVAL;
> +
> +	spin_lock(&kvm_tdx->seamcall_lock);
> +	ret = tdx_reclaim_page((unsigned long)sept_page, __pa(sept_page));
> +	spin_unlock(&kvm_tdx->seamcall_lock);
> +
> +	return ret;
> +}
> +
> +static int tdx_sept_tlb_remote_flush(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx;
> +	u64 err;
> +
> +	if (!is_td(kvm))
> +		return -EOPNOTSUPP;
> +
> +	kvm_tdx = to_kvm_tdx(kvm);
> +	if (!is_hkid_assigned(kvm_tdx))
> +		return 0;
> +
> +	/* If TD isn't finalized, it's before any vcpu running. */
> +	if (unlikely(!is_td_finalized(kvm_tdx)))
> +		return 0;
> +
> +	kvm_tdx->tdh_mem_track = true;
> +
> +	kvm_make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH);
> +
> +	err = tdh_mem_track(kvm_tdx->tdr.pa);
> +	if (KVM_BUG_ON(err, kvm))
> +		pr_tdx_error(TDH_MEM_TRACK, err, NULL);
> +
> +	WRITE_ONCE(kvm_tdx->tdh_mem_track, false);
> +
> +	return 0;

The whole TLB flush mechanism definitely needs more explanation in either commit
message, or comments.  How can people understand this magic with such little
information?

> +}
> +
> +static void tdx_handle_changed_private_spte(
> +	struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +	kvm_pfn_t old_pfn, bool was_present, bool was_leaf,
> +	kvm_pfn_t new_pfn, bool is_present, bool is_leaf, void *sept_page)
> +{
> +	WARN_ON(!is_td(kvm));
> +	lockdep_assert_held(&kvm->mmu_lock);
> +
> +	if (is_present) {
> +		/* TDP MMU doesn't change present -> present */
> +		WARN_ON(was_present);
> +
> +		/*
> +		 * Use different call to either set up middle level
> +		 * private page table, or leaf.
> +		 */
> +		if (is_leaf)
> +			tdx_sept_set_private_spte(kvm, gfn, level, new_pfn);
> +		else {
> +			WARN_ON(!sept_page);
> +			if (tdx_sept_link_private_sp(kvm, gfn, level, sept_page))
> +				/* failed to update Secure-EPT.  */
> +				WARN_ON(1);
> +		}
> +	} else if (was_leaf) {
> +		/* non-present -> non-present doesn't make sense. */
> +		WARN_ON(!was_present);
> +
> +		/*
> +		 * Zap private leaf SPTE.  Zapping private table is done
> +		 * below in handle_removed_tdp_mmu_page().
> +		 */
> +		tdx_sept_zap_private_spte(kvm, gfn, level);
> +
> +		/*
> +		 * TDX requires TLB tracking before dropping private page.  Do
> +		 * it here, although it is also done later.
> +		 * If hkid isn't assigned, the guest is destroying and no vcpu
> +		 * runs further.  TLB shootdown isn't needed.
> +		 *
> +		 * TODO: implement with_range version for optimization.
> +		 * kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
> +		 *   => tdx_sept_tlb_remote_flush_with_range(kvm, gfn,
> +		 *                                 KVM_PAGES_PER_HPAGE(level));
> +		 */
> +		if (is_hkid_assigned(to_kvm_tdx(kvm)))
> +			kvm_flush_remote_tlbs(kvm);
> +
> +		tdx_sept_drop_private_spte(kvm, gfn, level, old_pfn);
> +	}
> +}
> +
>  static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  {
>  	struct kvm_tdx_capabilities __user *user_caps;
> @@ -736,6 +963,21 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>  	return ret;
>  }
>  
> +void tdx_flush_tlb(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +	struct kvm_mmu *mmu = vcpu->arch.mmu;
> +	u64 root_hpa = mmu->root_hpa;
> +
> +	/* Flush the shared EPTP, if it's valid. */
> +	if (VALID_PAGE(root_hpa))
> +		ept_sync_context(construct_eptp(vcpu, root_hpa,
> +						mmu->shadow_root_level));
> +
> +	while (READ_ONCE(kvm_tdx->tdh_mem_track))
> +		cpu_relax();
> +}
> +
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_tdx_cmd tdx_cmd;
> @@ -901,6 +1143,10 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
>  	hkid_start_pos = boot_cpu_data.x86_phys_bits;
>  	hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
>  
> +	x86_ops->tlb_remote_flush = tdx_sept_tlb_remote_flush;
> +	x86_ops->free_private_sp = tdx_sept_free_private_sp;
> +	x86_ops->handle_changed_private_spte = tdx_handle_changed_private_spte;
> +
>  	return 0;
>  }
>  
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index b32e068c51b4..906666c7c70b 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -29,9 +29,17 @@ struct kvm_tdx {
>  	struct kvm_cpuid_entry2 cpuid_entries[KVM_MAX_CPUID_ENTRIES];
>  
>  	bool finalized;
> +	bool tdh_mem_track;
>  
>  	u64 tsc_offset;
>  	unsigned long tsc_khz;
> +
> +	/*
> +	 * Lock to prevent seamcalls from running concurrently
> +	 * when TDP MMU is enabled, because TDP fault handler
> +	 * runs concurrently.
> +	 */
> +	spinlock_t seamcall_lock;

Please also explain why relevant SEAMCALLs cannot run concurrently.

>  };
>  
>  struct vcpu_tdx {
> @@ -166,6 +174,12 @@ static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 fiel
>  	return out.r8;
>  }
>  
> +static __always_inline int pg_level_to_tdx_sept_level(enum pg_level level)
> +{
> +	WARN_ON(level == PG_LEVEL_NONE);
> +	return level - 1;
> +}
> +
>  #else
>  #define enable_tdx false
>  static inline int tdx_module_setup(void) { return -ENODEV; };
> diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
> index dc76b3a5cf96..cb40edc8c245 100644
> --- a/arch/x86/kvm/vmx/tdx_ops.h
> +++ b/arch/x86/kvm/vmx/tdx_ops.h
> @@ -30,12 +30,14 @@ static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
>  static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
>  				struct tdx_module_output *out)
>  {
> +	tdx_clflush_page(hpa);

I think those flush change can be done together when those tdh_mem_xx are
introduced with a single explanation why flush is needed.  You really don't need
to do each of them in separate patch.

>  	return kvm_seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, 0, out);
>  }
>  
>  static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
>  				struct tdx_module_output *out)
>  {
> +	tdx_clflush_page(page);
>  	return kvm_seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, 0, out);
>  }
>  
> @@ -48,6 +50,7 @@ static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
>  static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
>  				struct tdx_module_output *out)
>  {
> +	tdx_clflush_page(hpa);
>  	return kvm_seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, 0, out);
>  }
>  
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index ad9b1c883761..922a3799336e 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -144,6 +144,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>  
> +void tdx_flush_tlb(struct kvm_vcpu *vcpu);
>  void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
>  #else
>  static inline void tdx_pre_kvm_init(
> @@ -163,6 +164,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
>  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>  static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>  
> +static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
>  #endif
>  


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 060/104] KVM: TDX: Create initial guest memory
  2022-03-04 19:49 ` [RFC PATCH v5 060/104] KVM: TDX: Create initial guest memory isaku.yamahata
@ 2022-04-07  2:30   ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-07  2:30 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> +	/*
> +	 * In case of TDP MMU, fault handler can run concurrently.  Note
> +	 * 'source_pa' is a TD scope variable, meaning if there are multiple
> +	 * threads reaching here with all needing to access 'source_pa', it
> +	 * will break.  However fortunately this won't happen, because below
> +	 * TDH_MEM_PAGE_ADD code path is only used when VM is being created
> +	 * before it is running, using KVM_TDX_INIT_MEM_REGION ioctl (which
> +	 * always uses vcpu 0's page table and protected by vcpu->mutex).
> +	 */
> +	WARN_ON(kvm_tdx->source_pa == INVALID_PAGE);

We can just KVM_BUG_ON() and return here.

> +	source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
> +
> +	err = tdh_mem_page_add(kvm_tdx->tdr.pa, gpa, hpa, source_pa, &out);
> +	if (KVM_BUG_ON(err, kvm))
> +		pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
> +	else if ((kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION))
> +		tdx_measure_page(kvm_tdx, gpa);
> +
> +	kvm_tdx->source_pa = INVALID_PAGE;


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
  2022-04-06 11:06   ` Kai Huang
@ 2022-04-07  3:05     ` Kai Huang
  2022-04-08 19:12     ` Isaku Yamahata
  1 sibling, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-07  3:05 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Wed, 2022-04-06 at 23:06 +1200, Kai Huang wrote:
> >   void kvm_mmu_reset_all_pte_masks(void)
> >   {
> >   	u8 low_phys_bits;
> > -	u64 mask;
> >   
> >   	shadow_phys_bits = kvm_get_shadow_phys_bits();
> >   
> > @@ -389,9 +383,13 @@ void kvm_mmu_reset_all_pte_masks(void)
> >   	 * PTEs and so the reserved PA approach must be disabled.
> >   	 */
> >   	if (shadow_phys_bits < 52)
> > -		mask = BIT_ULL(51) | PT_PRESENT_MASK;
> > +		shadow_default_mmio_mask = BIT_ULL(51) | PT_PRESENT_MASK;
> 
> Hmm...  Not related to this patch, but it seems there's a bug here.  On a
> MKTME
> enabled system (but not TDX) with 52 physical bits, the shadow_phys_bits will
> be
> set to < 52 (depending on how many MKTME KeyIDs are configured by BIOS).  In
> this case, bit 51 is set, but actually bit 51 isn't a reserved bit in this
> case.
> Instead, it is a MKTME KeyID bit.  Therefore, above setting won't cause #PF,
> but
> will use a non-zero MKTME keyID to access the physical address.
> 
> Paolo/Sean, any comments here?

After looking at the code more carefully, this is not correct.  shadow_phys_bits
will be 52 on a MKTME-enabled system.  Please ignore this.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 088/104] KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers
  2022-03-04 19:49 ` [RFC PATCH v5 088/104] KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers isaku.yamahata
@ 2022-04-07  4:06   ` Kai Huang
  2022-04-15 14:50   ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-07  4:06 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX defines ABI for the TDX guest to call hypercall with TDG.VP.VMCALL API.
> To get hypercall arguments and to set return values, add accessors to guest
> vcpu registers.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 35 +++++++++++++++++++++++++++++++++++
>  1 file changed, 35 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index dc83414cb72a..8695836ce796 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -88,6 +88,41 @@ static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
>  	return kvm_r9_read(vcpu);
>  }
>  
> +#define BUILD_TDVMCALL_ACCESSORS(param, gpr)					\
> +static __always_inline								\
> +unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu)			\
> +{										\
> +	return kvm_##gpr##_read(vcpu);						\
> +}										\
> +static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu,	\
> +						     unsigned long val)		\
> +{										\
> +	kvm_##gpr##_write(vcpu, val);						\
> +}
> +BUILD_TDVMCALL_ACCESSORS(p1, r12);
> +BUILD_TDVMCALL_ACCESSORS(p2, r13);
> +BUILD_TDVMCALL_ACCESSORS(p3, r14);
> +BUILD_TDVMCALL_ACCESSORS(p4, r15);

Are they needed? Do those helpers provide more information than just using
kvm_{reg}_read/write()?

> +
> +static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_r10_read(vcpu);
> +}
> +static __always_inline unsigned long tdvmcall_exit_reason(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_r11_read(vcpu);
> +}
> +static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
> +						     long val)
> +{
> +	kvm_r10_write(vcpu, val);
> +}
> +static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
> +						    unsigned long val)
> +{
> +	kvm_r11_write(vcpu, val);
> +}
> +
>  static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
>  {
>  	return tdx->tdvpr.added;



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2022-03-04 19:49 ` [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL) isaku.yamahata
@ 2022-04-07  4:15   ` Kai Huang
  2022-04-07 13:14     ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-07  4:15 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short)
> for the guest TD to call hypercall to VMM.  When the guest TD issues
> TDG.VP.VMCALL, the guest TD exits to VMM with a new exit reason of
> TDVMCALL.  The arguments from the guest TD and returned values from the VMM
> are passed in the guest registers.  The guest RCX registers indicates which
> registers are used.
> 
> Define the TDVMCALL exit reason, which is carved out from the VMX exit
> reason namespace as the TDVMCALL exit from TDX guest to TDX-SEAM is really
> just a VM-Exit.  Add a place holder to handle TDVMCALL exit.
> 
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/uapi/asm/vmx.h |  4 +++-
>  arch/x86/kvm/vmx/tdx.c          | 27 ++++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/tdx.h          | 13 +++++++++++++
>  3 files changed, 42 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
> index 3d9b4598e166..cb0a0565219a 100644
> --- a/arch/x86/include/uapi/asm/vmx.h
> +++ b/arch/x86/include/uapi/asm/vmx.h
> @@ -92,6 +92,7 @@
>  #define EXIT_REASON_UMWAIT              67
>  #define EXIT_REASON_TPAUSE              68
>  #define EXIT_REASON_BUS_LOCK            74
> +#define EXIT_REASON_TDCALL              77
>  
>  #define VMX_EXIT_REASONS \
>  	{ EXIT_REASON_EXCEPTION_NMI,         "EXCEPTION_NMI" }, \
> @@ -154,7 +155,8 @@
>  	{ EXIT_REASON_XRSTORS,               "XRSTORS" }, \
>  	{ EXIT_REASON_UMWAIT,                "UMWAIT" }, \
>  	{ EXIT_REASON_TPAUSE,                "TPAUSE" }, \
> -	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }
> +	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }, \
> +	{ EXIT_REASON_TDCALL,                "TDCALL" }
>  
>  #define VMX_EXIT_REASON_FLAGS \
>  	{ VMX_EXIT_REASONS_FAILED_VMENTRY,	"FAILED_VMENTRY" }
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 8695836ce796..86daafd9eec0 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -780,7 +780,8 @@ static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
>  					struct vcpu_tdx *tdx)
>  {
>  	guest_enter_irqoff();
> -	tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs, 0);
> +	tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr.pa, vcpu->arch.regs,
> +					tdx->tdvmcall.regs_mask);
>  	guest_exit_irqoff();
>  }
>  
> @@ -815,6 +816,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>  
>  	if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
>  		return EXIT_FASTPATH_NONE;
> +
> +	if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
> +		tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
> +	else
> +		tdx->tdvmcall.rcx = 0;
>  	return EXIT_FASTPATH_NONE;
>  }
>  
> @@ -859,6 +865,23 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
>  	return 0;
>  }
>  
> +static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	if (unlikely(tdx->tdvmcall.xmm_mask))
> +		goto unsupported;

Put a comment explaining this logic?

> +
> +	switch (tdvmcall_exit_reason(vcpu)) {

Could we rename tdxvmcall_exit_reason() to something like tdvmcall_leaf()?

Btw, why couldn't we merge previous patch to this one, so we don't have to look
back and forth to figure out exactly what do those functions do?


> +	default:
> +		break;
> +	}
> +
> +unsupported:
> +	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
> +	return 1;
> +}
> +
>  void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>  {
>  	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
> @@ -1187,6 +1210,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
>  		return tdx_handle_exception(vcpu);
>  	case EXIT_REASON_EXTERNAL_INTERRUPT:
>  		return tdx_handle_external_interrupt(vcpu);
> +	case EXIT_REASON_TDCALL:
> +		return handle_tdvmcall(vcpu);
>  	case EXIT_REASON_EPT_VIOLATION:
>  		return tdx_handle_ept_violation(vcpu);
>  	case EXIT_REASON_EPT_MISCONFIG:
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 7cd81780f3fa..9e8ed9b3119e 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -86,6 +86,19 @@ struct vcpu_tdx {
>  	/* Posted interrupt descriptor */
>  	struct pi_desc pi_desc;
>  
> +	union {
> +		struct {
> +			union {
> +				struct {
> +					u16 gpr_mask;
> +					u16 xmm_mask;
> +				};
> +				u32 regs_mask;
> +			};
> +			u32 reserved;
> +		};
> +		u64 rcx;
> +	} tdvmcall;
>  	union tdx_exit_reason exit_reason;
>  
>  	bool initialized;


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 101/104] KVM: TDX: Silently ignore INIT/SIPI
  2022-04-05 15:48   ` Paolo Bonzini
  2022-04-05 17:53     ` Tom Lendacky
@ 2022-04-07 11:09     ` Xiaoyao Li
  2022-04-07 12:12       ` Paolo Bonzini
  1 sibling, 1 reply; 310+ messages in thread
From: Xiaoyao Li @ 2022-04-07 11:09 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson, Tom Lendacky

On 4/5/2022 11:48 PM, Paolo Bonzini wrote:
> On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
>> +        if (kvm_init_sipi_unsupported(vcpu->kvm))
>> +            /*
>> +             * TDX doesn't support INIT.  Ignore INIT event.  In the
>> +             * case of SIPI, the callback of
>> +             * vcpu_deliver_sipi_vector ignores it.
>> +             */
>>               vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>> -        else
>> -            vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>> +        else {
>> +            kvm_vcpu_reset(vcpu, true);
>> +            if (kvm_vcpu_is_bsp(apic->vcpu))
>> +                vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>> +            else
>> +                vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>> +        }
> 
> Should you check vcpu->arch.guest_state_protected instead of 
> special-casing TDX? 

We cannot use vcpu->arch.guest_state_protected because TDX supports 
debug TD, of which the states are not protected.

At least we need another flag, I think.

> KVM_APIC_INIT is not valid for SEV-ES either, if I 
> remember correctly.
> 
> Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 101/104] KVM: TDX: Silently ignore INIT/SIPI
  2022-04-07 11:09     ` Xiaoyao Li
@ 2022-04-07 12:12       ` Paolo Bonzini
  2022-04-08  3:40         ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 12:12 UTC (permalink / raw)
  To: Xiaoyao Li, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson, Tom Lendacky

On 4/7/22 13:09, Xiaoyao Li wrote:
> On 4/5/2022 11:48 PM, Paolo Bonzini wrote:
>> On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
>>> +        if (kvm_init_sipi_unsupported(vcpu->kvm))
>>> +            /*
>>> +             * TDX doesn't support INIT.  Ignore INIT event.  In the
>>> +             * case of SIPI, the callback of
>>> +             * vcpu_deliver_sipi_vector ignores it.
>>> +             */
>>>               vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>>> -        else
>>> -            vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>>> +        else {
>>> +            kvm_vcpu_reset(vcpu, true);
>>> +            if (kvm_vcpu_is_bsp(apic->vcpu))
>>> +                vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
>>> +            else
>>> +                vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
>>> +        }
>>
>> Should you check vcpu->arch.guest_state_protected instead of 
>> special-casing TDX? 
> 
> We cannot use vcpu->arch.guest_state_protected because TDX supports 
> debug TD, of which the states are not protected.
> 
> At least we need another flag, I think.

Let's add .deliver_init to the kvm_x86_ops then.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 083/104] KVM: x86: Split core of hypercall emulation to helper function
  2022-03-21 18:32   ` Sagi Shahar
  2022-03-23 17:53     ` Isaku Yamahata
@ 2022-04-07 13:12     ` Paolo Bonzini
  2022-04-08  5:34       ` Isaku Yamahata
  1 sibling, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 13:12 UTC (permalink / raw)
  To: Sagi Shahar, Yamahata, Isaku
  Cc: kvm, linux-kernel, isaku.yamahata, Jim Mattson, Erdem Aktas,
	Connor Kuehl, Sean Christopherson

On 3/21/22 19:32, Sagi Shahar wrote:
> On Fri, Mar 4, 2022 at 12:00 PM <isaku.yamahata@intel.com> wrote:
>>
>> From: Sean Christopherson <sean.j.christopherson@intel.com>
>>
>> By necessity, TDX will use a different register ABI for hypercalls.
>> Break out the core functionality so that it may be reused for TDX.
>>
>> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> ---
>>   arch/x86/include/asm/kvm_host.h |  4 +++
>>   arch/x86/kvm/x86.c              | 54 ++++++++++++++++++++-------------
>>   2 files changed, 37 insertions(+), 21 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
>> index 8dab9f16f559..33b75b0e3de1 100644
>> --- a/arch/x86/include/asm/kvm_host.h
>> +++ b/arch/x86/include/asm/kvm_host.h
>> @@ -1818,6 +1818,10 @@ void kvm_request_apicv_update(struct kvm *kvm, bool activate,
>>   void __kvm_request_apicv_update(struct kvm *kvm, bool activate,
>>                                  unsigned long bit);
>>
>> +unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>> +                                     unsigned long a0, unsigned long a1,
>> +                                     unsigned long a2, unsigned long a3,
>> +                                     int op_64_bit);
>>   int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
>>
>>   int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 314ae43e07bf..9acb33a17445 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -9090,26 +9090,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
>>          return kvm_skip_emulated_instruction(vcpu);
>>   }
>>
>> -int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
>> +unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
>> +                                     unsigned long a0, unsigned long a1,
>> +                                     unsigned long a2, unsigned long a3,
>> +                                     int op_64_bit)
>>   {
>> -       unsigned long nr, a0, a1, a2, a3, ret;
>> -       int op_64_bit;
>> -
>> -       if (kvm_xen_hypercall_enabled(vcpu->kvm))
>> -               return kvm_xen_hypercall(vcpu);
>> -
>> -       if (kvm_hv_hypercall_enabled(vcpu))
>> -               return kvm_hv_hypercall(vcpu);

Please keep Xen and Hyper-V hypercalls to kvm_emulate_hypercall (more on 
this in the reply to patch 89).  __kvm_emulate_hypercall should only 
handle KVM hypercalls.

>> +       if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
>> +               ret = -KVM_EPERM;
>> +               goto out;
>> +       }

Is this guaranteed by TDG.VP.VMCALL?

Paolo

>> +       ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit);
>>   out:
>>          if (!op_64_bit)
>>                  ret = (u32)ret;
>> --
>> 2.25.1
>>
> 
> Sagi
> 


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2022-04-07  4:15   ` Kai Huang
@ 2022-04-07 13:14     ` Paolo Bonzini
  2022-04-07 14:39       ` Sean Christopherson
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 13:14 UTC (permalink / raw)
  To: Kai Huang, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/7/22 06:15, Kai Huang wrote:
>>   
>> +static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>> +{
>> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>> +
>> +	if (unlikely(tdx->tdvmcall.xmm_mask))
>> +		goto unsupported;
> Put a comment explaining this logic?
> 

This only seems to be necessary for Hyper-V hypercalls, which however 
are not supported by this series in TDX guests (because the 
kvm_hv_hypercall still calls kvm_*_read, likewise for Xen).

So for now this conditional can be dropped.

Since


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 091/104] KVM: TDX: Handle TDX PV CPUID hypercall
  2022-03-04 19:49 ` [RFC PATCH v5 091/104] KVM: TDX: Handle TDX PV CPUID hypercall isaku.yamahata
@ 2022-04-07 13:16   ` Paolo Bonzini
  2022-04-07 14:48     ` Sean Christopherson
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 13:16 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Wire up TDX PV CPUID hypercall to the KVM backend function.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 27 +++++++++++++++++++++++++++
>   1 file changed, 27 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 53f59fb92dcf..f7c9170d596a 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -893,6 +893,30 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
>   	return 1;
>   }
>   
> +static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
> +{
> +	u32 eax, ebx, ecx, edx;
> +
> +	/* EAX and ECX for cpuid is stored in R12 and R13. */
> +	eax = tdvmcall_p1_read(vcpu);
> +	ecx = tdvmcall_p2_read(vcpu);
> +
> +	kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, true);
> +
> +	/*
> +	 * The returned value for CPUID (EAX, EBX, ECX, and EDX) is stored into
> +	 * R12, R13, R14, and R15.
> +	 */
> +	tdvmcall_p1_write(vcpu, eax);
> +	tdvmcall_p2_write(vcpu, ebx);
> +	tdvmcall_p3_write(vcpu, ecx);
> +	tdvmcall_p4_write(vcpu, edx);
> +
> +	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> +
> +	return 1;
> +}
> +
>   static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -904,6 +928,9 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>   		return tdx_emulate_vmcall(vcpu);
>   
>   	switch (tdvmcall_exit_reason(vcpu)) {
> +	case EXIT_REASON_CPUID:
> +		return tdx_emulate_cpuid(vcpu);
> +
>   	default:
>   		break;
>   	}

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

but I don't think tdvmcall_*_{read,write} add much.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
  2022-04-06 23:43   ` Kai Huang
@ 2022-04-07 13:52     ` Paolo Bonzini
  2022-04-07 22:53       ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 13:52 UTC (permalink / raw)
  To: Kai Huang, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/7/22 01:43, Kai Huang wrote:
>> +	if (kvm_gfn_stolen_mask(vcpu->kvm)) {
> Please get rid of kvm_gfn_stolen_mask().
> 

Kai, please follow the other reviews that I have posted in the last few 
days.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 045/104] KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
  2022-04-06 23:35     ` Sean Christopherson
@ 2022-04-07 13:52       ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 13:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl

On 4/7/22 01:35, Sean Christopherson wrote:
>> Please rename the existing REMOVED_SPTE to REMOVED_SPTE_MASK, and call this
>> simply REMOVED_SPTE.  This also makes the patch smaller.
> Can we name it either __REMOVE_SPTE or REMOVED_SPTE_VAL?  It's most definitely
> not a mask, it's a full value, e.g. spte |= REMOVED_SPTE_MASK is completely wrong.

REMOVED_SPTE_VAL is fine.

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-03-04 19:49 ` [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall isaku.yamahata
@ 2022-04-07 13:56   ` Paolo Bonzini
  2022-04-07 15:02     ` Sean Christopherson
  2022-04-07 14:51   ` Sean Christopherson
  1 sibling, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 13:56 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> +	bool interrupt_disabled = tdvmcall_p1_read(vcpu);

Where is R12 documented for TDG.VP.VMCALL<Instruction.HLT>?

> +		 * Virtual interrupt can arrive after TDG.VM.VMCALL<HLT> during
> +		 * the TDX module executing.  On the other hand, KVM doesn't
> +		 * know if vcpu was executing in the guest TD or the TDX module.

I don't understand this; why isn't it enough to check PI.ON or something 
like that as part of HLT emulation?

> +		details.full = td_state_non_arch_read64(
> +			to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);

TDX documentation says "the meaning of the field may change with Intel 
TDX module version", where is this field documented?  I cannot find any 
"other guest state" fields in the TDX documentation.

Paolo

> +		if (details.vmxip)
> +			return 1;
> +	}
> +
> +	return kvm_emulate_halt_noskip(vcpu);


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2022-04-07 13:14     ` Paolo Bonzini
@ 2022-04-07 14:39       ` Sean Christopherson
  2022-04-07 18:04         ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Sean Christopherson @ 2022-04-07 14:39 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kai Huang, isaku.yamahata, kvm, linux-kernel, isaku.yamahata,
	Jim Mattson, erdemaktas, Connor Kuehl

On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> On 4/7/22 06:15, Kai Huang wrote:
> > > +static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> > > +{
> > > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > +
> > > +	if (unlikely(tdx->tdvmcall.xmm_mask))
> > > +		goto unsupported;
> > Put a comment explaining this logic?
> > 
> 
> This only seems to be necessary for Hyper-V hypercalls, which however are
> not supported by this series in TDX guests (because the kvm_hv_hypercall
> still calls kvm_*_read, likewise for Xen).
> 
> So for now this conditional can be dropped.

I'd prefer to keep the sanity check, it's a cheap and easy way to detect a clear
cut guest bug.  E.g. KVM would be within its rights to write garbage the XMM
registers in this case.  Even though KVM isn't to be trusted, KVM can still be
nice to the guest.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 091/104] KVM: TDX: Handle TDX PV CPUID hypercall
  2022-04-07 13:16   ` Paolo Bonzini
@ 2022-04-07 14:48     ` Sean Christopherson
  2022-04-07 18:03       ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Sean Christopherson @ 2022-04-07 14:48 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl

On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Wire up TDX PV CPUID hypercall to the KVM backend function.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >   arch/x86/kvm/vmx/tdx.c | 27 +++++++++++++++++++++++++++
> >   1 file changed, 27 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 53f59fb92dcf..f7c9170d596a 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -893,6 +893,30 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
> >   	return 1;
> >   }
> > +static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
> > +{
> > +	u32 eax, ebx, ecx, edx;
> > +
> > +	/* EAX and ECX for cpuid is stored in R12 and R13. */
> > +	eax = tdvmcall_p1_read(vcpu);
> > +	ecx = tdvmcall_p2_read(vcpu);
> > +
> > +	kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, true);
> > +
> > +	/*
> > +	 * The returned value for CPUID (EAX, EBX, ECX, and EDX) is stored into
> > +	 * R12, R13, R14, and R15.
> > +	 */
> > +	tdvmcall_p1_write(vcpu, eax);
> > +	tdvmcall_p2_write(vcpu, ebx);
> > +	tdvmcall_p3_write(vcpu, ecx);
> > +	tdvmcall_p4_write(vcpu, edx);
> > +
> > +	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> > +
> > +	return 1;
> > +}
> > +
> >   static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> >   {
> >   	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > @@ -904,6 +928,9 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> >   		return tdx_emulate_vmcall(vcpu);
> >   	switch (tdvmcall_exit_reason(vcpu)) {
> > +	case EXIT_REASON_CPUID:
> > +		return tdx_emulate_cpuid(vcpu);
> > +

Spurious whitespace that gets deleted by the HLT patch.

> >   	default:
> >   		break;
> >   	}
> 
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> 
> but I don't think tdvmcall_*_{read,write} add much.

They provided a lot more value when the ABI was still in flux, but I still like
having them.  That said, either the comments about R12..R15 need to go, or the
wrappers need to go.  Having both is confusing.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-03-04 19:49 ` [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall isaku.yamahata
  2022-04-07 13:56   ` Paolo Bonzini
@ 2022-04-07 14:51   ` Sean Christopherson
  1 sibling, 0 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-04-07 14:51 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl

On Fri, Mar 04, 2022, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Wire up TDX PV HLT hypercall to the KVM backend function.
> 
> When the guest issues HLT, the hypercall instruction can be the right after
> CLI instruction.  Atomically unmask virtual interrupt and issue HLT
> hypercall. The virtual interrupts can arrive right after CLI instruction
> before switching back to VMM.  In such a case, the VMM should return to the
> guest without losing the interrupt.  Check if interrupts arrived before the
> TDX module switching to VMM.  And return to the guest in such cases.

Pretty sure you mean STI, not CLI.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-04-07 13:56   ` Paolo Bonzini
@ 2022-04-07 15:02     ` Sean Christopherson
  2022-04-07 15:56       ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Sean Christopherson @ 2022-04-07 15:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl

On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> > +	bool interrupt_disabled = tdvmcall_p1_read(vcpu);
> 
> Where is R12 documented for TDG.VP.VMCALL<Instruction.HLT>?
> 
> > +		 * Virtual interrupt can arrive after TDG.VM.VMCALL<HLT> during
> > +		 * the TDX module executing.  On the other hand, KVM doesn't
> > +		 * know if vcpu was executing in the guest TD or the TDX module.
> 
> I don't understand this; why isn't it enough to check PI.ON or something
> like that as part of HLT emulation?

Ooh, I think I remember what this is.  This is for the case where the virtual
interrupt is recognized, i.e. set in vmcs.RVI, between the STI and "HLT".  KVM
doesn't have access to RVI and the interrupt is no longer in the PID (because it
was "recognized".  It doesn't get delivered in the guest because the TDCALL
completes before interrupts are enabled.

I lobbied to get this fixed in the TDX module by immediately resuming the guest
in this case, but obviously that was unsuccessful.
 
> > +		details.full = td_state_non_arch_read64(
> > +			to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
> 
> TDX documentation says "the meaning of the field may change with Intel TDX
> module version", where is this field documented?  I cannot find any "other
> guest state" fields in the TDX documentation.

IMO we should put a stake in the ground and refuse to accept code that consumes
"non-architectural" state.  It's all software, having non-architectural APIs is
completely ridiculous.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-04-07 15:02     ` Sean Christopherson
@ 2022-04-07 15:56       ` Paolo Bonzini
  2022-04-07 16:08         ` Sean Christopherson
  2022-04-08  4:58         ` Isaku Yamahata
  0 siblings, 2 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 15:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl

On 4/7/22 17:02, Sean Christopherson wrote:
> On Thu, Apr 07, 2022, Paolo Bonzini wrote:
>> On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
>>> +	bool interrupt_disabled = tdvmcall_p1_read(vcpu);
>>
>> Where is R12 documented for TDG.VP.VMCALL<Instruction.HLT>?
>>
>>> +		 * Virtual interrupt can arrive after TDG.VM.VMCALL<HLT> during
>>> +		 * the TDX module executing.  On the other hand, KVM doesn't
>>> +		 * know if vcpu was executing in the guest TD or the TDX module.
>>
>> I don't understand this; why isn't it enough to check PI.ON or something
>> like that as part of HLT emulation?
> 
> Ooh, I think I remember what this is.  This is for the case where the virtual
> interrupt is recognized, i.e. set in vmcs.RVI, between the STI and "HLT".  KVM
> doesn't have access to RVI and the interrupt is no longer in the PID (because it
> was "recognized".  It doesn't get delivered in the guest because the TDCALL
> completes before interrupts are enabled.
> 
> I lobbied to get this fixed in the TDX module by immediately resuming the guest
> in this case, but obviously that was unsuccessful.

So the TDX module sets RVI while in an STI interrupt shadow.  So far so 
good.  Then:

- it receives the HLT TDCALL from the guest.  The interrupt shadow at 
this point is gone.

- it knows that there is an interrupt that can be delivered (RVI > PPR 
&& EFLAGS.IF=1, the other conditions of 29.2.2 don't matter)

- it forwards the HLT TDCALL nevertheless, to a clueless hypervisor that 
has no way to glean either RVI or PPR?

It's absurd that this be treated as anything but a bug.


Until that is fixed, KVM needs to do something like:

- every time a bit is set in PID.PIR, set tdx->buggy_hlt_workaround = 1

- every time TDG.VP.VMCALL<HLT> is received, 
xchg(&tdx->buggy_hlt_workaround, 0) and return immediately to the guest 
if it is 1.

Basically an internal version of PID.ON.

>>> +		details.full = td_state_non_arch_read64(
>>> +			to_tdx(vcpu), TD_VCPU_STATE_DETAILS_NON_ARCH);
>>
>> TDX documentation says "the meaning of the field may change with Intel TDX
>> module version", where is this field documented?  I cannot find any "other
>> guest state" fields in the TDX documentation.
> 
> IMO we should put a stake in the ground and refuse to accept code that consumes
> "non-architectural" state.  It's all software, having non-architectural APIs is
> completely ridiculous.

Having them is fine, *using* them to work around undocumented bugs is 
the ridiculous part.

You didn't answer the other question, which is "Where is R12 documented 
for TDG.VP.VMCALL<Instruction.HLT>?" though...  Should I be worried? :)


Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-04-07 15:56       ` Paolo Bonzini
@ 2022-04-07 16:08         ` Sean Christopherson
  2022-04-08  4:58         ` Isaku Yamahata
  1 sibling, 0 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-04-07 16:08 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl

On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> On 4/7/22 17:02, Sean Christopherson wrote:
> > On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> > > On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> > > > +	bool interrupt_disabled = tdvmcall_p1_read(vcpu);
> > > 
> > > Where is R12 documented for TDG.VP.VMCALL<Instruction.HLT>?
> > > 
> > > > +		 * Virtual interrupt can arrive after TDG.VM.VMCALL<HLT> during
> > > > +		 * the TDX module executing.  On the other hand, KVM doesn't
> > > > +		 * know if vcpu was executing in the guest TD or the TDX module.
> > > 
> > > I don't understand this; why isn't it enough to check PI.ON or something
> > > like that as part of HLT emulation?
> > 
> > Ooh, I think I remember what this is.  This is for the case where the virtual
> > interrupt is recognized, i.e. set in vmcs.RVI, between the STI and "HLT".  KVM
> > doesn't have access to RVI and the interrupt is no longer in the PID (because it
> > was "recognized".  It doesn't get delivered in the guest because the TDCALL
> > completes before interrupts are enabled.
> > 
> > I lobbied to get this fixed in the TDX module by immediately resuming the guest
> > in this case, but obviously that was unsuccessful.
> 
> So the TDX module sets RVI while in an STI interrupt shadow.  So far so
> good.  Then:
> 
> - it receives the HLT TDCALL from the guest.  The interrupt shadow at this
> point is gone.
> 
> - it knows that there is an interrupt that can be delivered (RVI > PPR &&
> EFLAGS.IF=1, the other conditions of 29.2.2 don't matter)
> 
> - it forwards the HLT TDCALL nevertheless, to a clueless hypervisor that has
> no way to glean either RVI or PPR?
> 
> It's absurd that this be treated as anything but a bug.

That's what I said!  :-)

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 091/104] KVM: TDX: Handle TDX PV CPUID hypercall
  2022-04-07 14:48     ` Sean Christopherson
@ 2022-04-07 18:03       ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 18:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl

On 4/7/22 16:48, Sean Christopherson wrote:
>> Reviewed-by: Paolo Bonzini<pbonzini@redhat.com>
>>
>> but I don't think tdvmcall_*_{read,write} add much.
> They provided a lot more value when the ABI was still in flux, but I still like
> having them.  That said, either the comments about R12..R15 need to go, or the
> wrappers need to go.  Having both is confusing.
> 

Fair enough, let's keep them but rename them a0..a3 for consistency with 
kvm_emulate_hypercall.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2022-04-07 14:39       ` Sean Christopherson
@ 2022-04-07 18:04         ` Paolo Bonzini
  2022-04-07 18:11           ` Sean Christopherson
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 18:04 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kai Huang, isaku.yamahata, kvm, linux-kernel, isaku.yamahata,
	Jim Mattson, erdemaktas, Connor Kuehl

On 4/7/22 16:39, Sean Christopherson wrote:
> On Thu, Apr 07, 2022, Paolo Bonzini wrote:
>> On 4/7/22 06:15, Kai Huang wrote:
>>>> +static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>>>> +{
>>>> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
>>>> +
>>>> +	if (unlikely(tdx->tdvmcall.xmm_mask))
>>>> +		goto unsupported;
>>> Put a comment explaining this logic?
>>>
>>
>> This only seems to be necessary for Hyper-V hypercalls, which however are
>> not supported by this series in TDX guests (because the kvm_hv_hypercall
>> still calls kvm_*_read, likewise for Xen).
>>
>> So for now this conditional can be dropped.
> 
> I'd prefer to keep the sanity check, it's a cheap and easy way to detect a clear
> cut guest bug.

I don't think it's necessarily a guest bug, just silly but valid behavior.

Paolo

   E.g. KVM would be within its rights to write garbage the XMM
> registers in this case.  Even though KVM isn't to be trusted, KVM can still be
> nice to the guest.



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2022-04-07 18:04         ` Paolo Bonzini
@ 2022-04-07 18:11           ` Sean Christopherson
  2022-04-07 23:20             ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Sean Christopherson @ 2022-04-07 18:11 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kai Huang, isaku.yamahata, kvm, linux-kernel, isaku.yamahata,
	Jim Mattson, erdemaktas, Connor Kuehl

On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> On 4/7/22 16:39, Sean Christopherson wrote:
> > On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> > > On 4/7/22 06:15, Kai Huang wrote:
> > > > > +static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> > > > > +{
> > > > > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > > > +
> > > > > +	if (unlikely(tdx->tdvmcall.xmm_mask))
> > > > > +		goto unsupported;
> > > > Put a comment explaining this logic?
> > > > 
> > > 
> > > This only seems to be necessary for Hyper-V hypercalls, which however are
> > > not supported by this series in TDX guests (because the kvm_hv_hypercall
> > > still calls kvm_*_read, likewise for Xen).
> > > 
> > > So for now this conditional can be dropped.
> > 
> > I'd prefer to keep the sanity check, it's a cheap and easy way to detect a clear
> > cut guest bug.
> 
> I don't think it's necessarily a guest bug, just silly but valid behavior.

It's a bug from a security perspective given that letting the host unnecessarily
manipulate register state is an exploit waiting to happen.

Though for KVM to reject the TDVMCALLs, the GHCI should really be updated to state
that exposing more state than is required _may_ be considered invalid ("may" so
that KVM isn't required to check the mask on every exit, which IMO is beyond tedious).

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
  2022-04-07 13:52     ` Paolo Bonzini
@ 2022-04-07 22:53       ` Kai Huang
  2022-04-07 23:03         ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Kai Huang @ 2022-04-07 22:53 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On Thu, 2022-04-07 at 15:52 +0200, Paolo Bonzini wrote:
> On 4/7/22 01:43, Kai Huang wrote:
> > > +	if (kvm_gfn_stolen_mask(vcpu->kvm)) {
> > Please get rid of kvm_gfn_stolen_mask().
> > 
> 
> Kai, please follow the other reviews that I have posted in the last few 
> days.
> 
> Paolo
> 

Do you mean below reply?

"I think use of kvm_gfn_stolen_mask() should be minimized anyway.  I 
would rename it to to kvm_{gfn,gpa}_private_mask and not return bool."

I also mean we should not use kvm_gfn_stolen_mask().  I don't have opinion on
the new name.  Perhaps kvm_is_protected_vm() is my preference though.



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
  2022-04-07 22:53       ` Kai Huang
@ 2022-04-07 23:03         ` Paolo Bonzini
  2022-04-07 23:24           ` Kai Huang
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-07 23:03 UTC (permalink / raw)
  To: Kai Huang, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/8/22 00:53, Kai Huang wrote:
>>
> Do you mean below reply?
> 
> "I think use of kvm_gfn_stolen_mask() should be minimized anyway.  I
> would rename it to to kvm_{gfn,gpa}_private_mask and not return bool."
> 
> I also mean we should not use kvm_gfn_stolen_mask().  I don't have opinion on
> the new name.  Perhaps kvm_is_protected_vm() is my preference though.

But this is one of the case where it would survive, even with the 
changed name.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2022-04-07 18:11           ` Sean Christopherson
@ 2022-04-07 23:20             ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-07 23:20 UTC (permalink / raw)
  To: Sean Christopherson, Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl

On Thu, 2022-04-07 at 18:11 +0000, Sean Christopherson wrote:
> On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> > On 4/7/22 16:39, Sean Christopherson wrote:
> > > On Thu, Apr 07, 2022, Paolo Bonzini wrote:
> > > > On 4/7/22 06:15, Kai Huang wrote:
> > > > > > +static int handle_tdvmcall(struct kvm_vcpu *vcpu)
> > > > > > +{
> > > > > > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > > > > +
> > > > > > +	if (unlikely(tdx->tdvmcall.xmm_mask))
> > > > > > +		goto unsupported;
> > > > > Put a comment explaining this logic?
> > > > > 
> > > > 
> > > > This only seems to be necessary for Hyper-V hypercalls, which however are
> > > > not supported by this series in TDX guests (because the kvm_hv_hypercall
> > > > still calls kvm_*_read, likewise for Xen).
> > > > 
> > > > So for now this conditional can be dropped.
> > > 
> > > I'd prefer to keep the sanity check, it's a cheap and easy way to detect a clear
> > > cut guest bug.
> > 
> > I don't think it's necessarily a guest bug, just silly but valid behavior.
> 
> It's a bug from a security perspective given that letting the host unnecessarily
> manipulate register state is an exploit waiting to happen.

Security perspective from guest's or host's respective?  It's guest's
responsibility to make sure itself doesn't expose unnecessarily security holes.
In this particular case, if guest does, then it's guest's business, but I don't
see how it can compromise host's security, or other VM's security?

As Paolo said it's a valid operation from guest, perhaps an alternative is KVM
can unconditionally clear XMM registers, instead of rejecting this VMCALL.

> 
> Though for KVM to reject the TDVMCALLs, the GHCI should really be updated to state
> that exposing more state than is required _may_ be considered invalid ("may" so
> that KVM isn't required to check the mask on every exit, which IMO is beyond tedious).

Independent from this issue,  I guess it's good for GHCI to have this anyway.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
  2022-04-07 23:03         ` Paolo Bonzini
@ 2022-04-07 23:24           ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-07 23:24 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On Fri, 2022-04-08 at 01:03 +0200, Paolo Bonzini wrote:
> On 4/8/22 00:53, Kai Huang wrote:
> > > 
> > Do you mean below reply?
> > 
> > "I think use of kvm_gfn_stolen_mask() should be minimized anyway.  I
> > would rename it to to kvm_{gfn,gpa}_private_mask and not return bool."
> > 
> > I also mean we should not use kvm_gfn_stolen_mask().  I don't have opinion on
> > the new name.  Perhaps kvm_is_protected_vm() is my preference though.
> 
> But this is one of the case where it would survive, even with the 
> changed name.
> 
> Paolo
> 

Perhaps I confused you (sorry about that).  Yes we do need the check here.  I
just dislike the function name.

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 020/104] KVM: TDX: allocate per-package mutex
  2022-04-05 12:39   ` Paolo Bonzini
@ 2022-04-08  0:44     ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08  0:44 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Tue, Apr 05, 2022 at 02:39:51PM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Several TDX SEAMCALLs are per-package scope (concretely per memory
> > controller) and they need to be serialized per-package.  Allocate mutex for
> > it.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >   arch/x86/kvm/vmx/main.c    |  8 +++++++-
> >   arch/x86/kvm/vmx/tdx.c     | 18 ++++++++++++++++++
> >   arch/x86/kvm/vmx/x86_ops.h |  2 ++
> >   3 files changed, 27 insertions(+), 1 deletion(-)
> 
> Please define here the lock/unlock functions as well:
> 
> static inline int tdx_mng_key_lock(void)
> {
> 	int cpu = get_cpu();
> 	cur_pkg = topology_physical_package_id(cpu);
> 
> 	mutex_lock(&tdx_mng_key_config_lock[cur_pkg]);
> 	return cur_pkg;
> }
> 
> static inline void tdx_mng_key_unlock(int cur_pkg)
> {
> 	mutex_unlock(&tdx_mng_key_config_lock[cur_pkg]);
> 	put_cpu();
> }

Sure, will do in the next respoin.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure
  2022-04-05 12:44   ` Paolo Bonzini
@ 2022-04-08  0:51     ` Isaku Yamahata
  2022-04-15 13:47       ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08  0:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Tue, Apr 05, 2022 at 02:44:04PM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 6111c6485d8e..5c3a904a30e8 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -39,12 +39,24 @@ static int vt_vm_init(struct kvm *kvm)
> >   		ret = tdx_module_setup();
> >   		if (ret)
> >   			return ret;
> > -		return -EOPNOTSUPP;	/* Not ready to create guest TD yet. */
> > +		return tdx_vm_init(kvm);
> >   	}
> >   	return vmx_vm_init(kvm);
> >   }
> > +static void vt_mmu_prezap(struct kvm *kvm)
> > +{
> > +	if (is_td(kvm))
> > +		return tdx_mmu_prezap(kvm);
> > +}
> 
> Please rename the function to explain what it does, for example
> tdx_mmu_release_hkid.

In the patch 021/104,  you suggested flush_shadow_all_private().
Which do you prefer? flush_shadow_all_private or tdx_mmu_release_hkid.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 025/104] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  2022-04-05 12:50   ` Paolo Bonzini
@ 2022-04-08  0:56     ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08  0:56 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Tue, Apr 05, 2022 at 02:50:29PM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
> > TDX specific sub-commands will be added to retrieve/pass TDX specific
> > parameters.
> > 
> > KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
> > guest state-protected VM.  It defined subcommands for technology-specific
> > operations under KVM_MEMORY_ENCRYPT_OP.  Despite its name, the subcommands
> > are not limited to memory encryption, but various technology-specific
> > operations are defined.  It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
> > for TDX specific operations and define subcommands.
> > 
> > TDX requires VM-scoped, and VCPU-scoped TDX-specific operations for device
> > model, for example, qemu.  Getting system-wide parameters, TDX-specific VM
> > initialization, and TDX-specific vCPU initialization.  Which requires KVM
> > vCPU-scoped operations in addition to the existing VM-scoped operations.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >   arch/x86/include/uapi/asm/kvm.h       | 11 +++++++++++
> >   arch/x86/kvm/vmx/main.c               | 10 ++++++++++
> >   arch/x86/kvm/vmx/tdx.c                | 24 ++++++++++++++++++++++++
> >   arch/x86/kvm/vmx/x86_ops.h            |  4 ++++
> >   tools/arch/x86/include/uapi/asm/kvm.h | 11 +++++++++++
> >   5 files changed, 60 insertions(+)
> > 
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 71a5851475e7..2ad61caf4e0b 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -528,4 +528,15 @@ struct kvm_pmu_event_filter {
> >   #define KVM_X86_DEFAULT_VM	0
> >   #define KVM_X86_TDX_VM		1
> > +/* Trust Domain eXtension sub-ioctl() commands. */
> > +enum kvm_tdx_cmd_id {
> > +	KVM_TDX_CMD_NR_MAX,
> > +};
> > +
> > +struct kvm_tdx_cmd {
> > +	__u32 id;
> > +	__u32 metadata;
> > +	__u64 data;
> > +};
> 
> Please include some initial documentation here already, for example it is
> not clear what "metadata" is.
> 
> Also please add
> 
> 	u32 error;
> 	u32 unused;
> 
> for two reasons: 1) consistency with kvm_sev_cmd 2) error codes should be
> returned to userspace and not just sent through pr_tdx_error.

Sure.
For now metadata is only used to specify flags specific to id.
So I'll rename it to flags.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 026/104] KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
  2022-04-07  1:17         ` Xiaoyao Li
@ 2022-04-08  0:58           ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08  0:58 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Kai Huang, Paolo Bonzini, isaku.yamahata, kvm, linux-kernel,
	isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On Thu, Apr 07, 2022 at 09:17:51AM +0800,
Xiaoyao Li <xiaoyao.li@intel.com> wrote:

> On 4/7/2022 9:07 AM, Kai Huang wrote:
> > On Wed, 2022-04-06 at 09:54 +0800, Xiaoyao Li wrote:
> > > On 4/5/2022 8:52 PM, Paolo Bonzini wrote:
> > > > On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > > > > Implement a VM-scoped subcomment to get system-wide parameters.  Although
> > > > > this is system-wide parameters not per-VM, this subcomand is VM-scoped
> > > > > because
> > > > > - Device model needs TDX system-wide parameters after creating KVM VM.
> > > > > - This subcommands requires to initialize TDX module.  For lazy
> > > > >     initialization of the TDX module, vm-scope ioctl is better.
> > > > 
> > > > Since there was agreement to install the TDX module on load, please
> > > > place this ioctl on the /dev/kvm file descriptor.
> > > > 
> > > > At least for SEV, there were cases where the system-wide parameters are
> > > > needed outside KVM, so it's better to avoid requiring a VM file descriptor.
> > > 
> > > I don't have strong preference on KVM-scope ioctl or VM-scope.
> > > 
> > > Initially, we made it KVM-scope and change it to VM-scope in this
> > > version. Yes, it returns the info from TDX module, which doesn't vary
> > > per VM. However, what if we want to return different capabilities
> > > (software controlled capabilities) per VM?
> > > 
> > 
> > In this case, you don't return different capabilities, instead, you return the
> > same capabilities but control the capabilities on per-VM basis.
> 
> yes, so I'm not arguing it or insisting on per-VM.
> 
> I just speak out my concern since it's user ABI.

The reason why I made this API to VM-scope API is to reduce the number of patch
given qemu usage.  Now Paolo requested it, I'll change it KVM-scope API.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters
  2022-03-31  4:55   ` Kai Huang
  2022-04-05 13:01     ` Paolo Bonzini
@ 2022-04-08  2:18     ` Isaku Yamahata
  1 sibling, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08  2:18 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson

On Thu, Mar 31, 2022 at 05:55:01PM +1300,
Kai Huang <kai.huang@intel.com> wrote:

> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index 70f9be4ea575..6e26dde0dce6 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -531,6 +531,7 @@ struct kvm_pmu_event_filter {
> >  /* Trust Domain eXtension sub-ioctl() commands. */
> >  enum kvm_tdx_cmd_id {
> >  	KVM_TDX_CAPABILITIES = 0,
> > +	KVM_TDX_INIT_VM,
> >  
> >  	KVM_TDX_CMD_NR_MAX,
> >  };
> > @@ -561,4 +562,15 @@ struct kvm_tdx_capabilities {
> >  	struct kvm_tdx_cpuid_config cpuid_configs[0];
> >  };
> >  
> > +struct kvm_tdx_init_vm {
> > +	__u32 max_vcpus;
> > +	__u32 tsc_khz;
> > +	__u64 attributes;
> > +	__u64 cpuid;
> 
> Is it better to append all CPUIDs directly into this structure, perhaps at end
> of this structure, to make it more consistent with TD_PARAMS?
> 
> Also, I think somewhere in commit message or comments we should explain why
> CPUIDs are passed here (why existing KVM_SET_CUPID2 is not sufficient).

Ok, let's change the data structure to match more with TD_PARAMS.


> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 20b45bb0b032..236faaca68a0 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -387,6 +387,203 @@ static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> >  	return 0;
> >  }
> >  
> > +static struct kvm_cpuid_entry2 *tdx_find_cpuid_entry(struct kvm_tdx *kvm_tdx,
> > +						u32 function, u32 index)
> > +{
> > +	struct kvm_cpuid_entry2 *e;
> > +	int i;
> > +
> > +	for (i = 0; i < kvm_tdx->cpuid_nent; i++) {
> > +		e = &kvm_tdx->cpuid_entries[i];
> > +
> > +		if (e->function == function && (e->index == index ||
> > +		    !(e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX)))
> > +			return e;
> > +	}
> > +	return NULL;
> > +}
> > +
> > +static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> > +			struct kvm_tdx_init_vm *init_vm)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> > +	struct tdx_cpuid_config *config;
> > +	struct kvm_cpuid_entry2 *entry;
> > +	struct tdx_cpuid_value *value;
> > +	u64 guest_supported_xcr0;
> > +	u64 guest_supported_xss;
> > +	u32 guest_tsc_khz;
> > +	int max_pa;
> > +	int i;
> > +
> > +	/* init_vm->reserved must be zero */
> > +	if (find_first_bit((unsigned long *)init_vm->reserved,
> > +			   sizeof(init_vm->reserved) * 8) !=
> > +	    sizeof(init_vm->reserved) * 8)
> > +		return -EINVAL;
> > +
> > +	td_params->max_vcpus = init_vm->max_vcpus;
> > +
> > +	td_params->attributes = init_vm->attributes;
> > +	if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> > +		pr_warn("TD doesn't support perfmon. KVM needs to save/restore "
> > +			"host perf registers properly.\n");
> > +		return -EOPNOTSUPP;
> > +	}
> 
> PERFMON can be supported but it's not support in this series, so perhaps add a
> comment to explain it's a TODO?

Yes, good idea. Will do.


> > +	max_pa = 36;
> > +	entry = tdx_find_cpuid_entry(kvm_tdx, 0x80000008, 0);
> > +	if (entry)
> > +		max_pa = entry->eax & 0xff;
> > +
> > +	td_params->eptp_controls = VMX_EPTP_MT_WB;
> > +	if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
> > +		td_params->eptp_controls |= VMX_EPTP_PWL_5;
> > +		td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
> > +	} else {
> > +		td_params->eptp_controls |= VMX_EPTP_PWL_4;
> > +	}
> 
> Not quite sure, but could we support >48 GPA with 4-level EPT?

No.
"5-level paging and 5-level EPT"
section 4.1 4-level EPT
"4-level EPT is limited to translating 48-bit guest-physical addresses."
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters
  2022-04-07  1:51       ` Kai Huang
@ 2022-04-08  3:33         ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08  3:33 UTC (permalink / raw)
  To: Kai Huang
  Cc: Xiaoyao Li, Paolo Bonzini, isaku.yamahata, kvm, linux-kernel,
	isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On Thu, Apr 07, 2022 at 01:51:38PM +1200,
Kai Huang <kai.huang@intel.com> wrote:

> On Thu, 2022-04-07 at 09:29 +0800, Xiaoyao Li wrote:
> > On 4/5/2022 8:58 PM, Paolo Bonzini wrote:
> > > On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > > > +    td_params->attributes = init_vm->attributes;
> > > > +    if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> > > > +        pr_warn("TD doesn't support perfmon. KVM needs to save/restore "
> > > > +            "host perf registers properly.\n");
> > > > +        return -EOPNOTSUPP;
> > > > +    }
> > > 
> > > Why does KVM have to hardcode this (and LBR/AMX below)?  Is the level of 
> > > hardware support available from tdx_caps, for example through the CPUID 
> > > configs (0xA for this one, 0xD for LBR and AMX)?
> > 
> > It's wrong code. PMU is allowed.
> > 
> > AMX and LBR are disallowed because and the time we wrote the codes they 
> > are not supported by KVM. Now AMX should be allowed, but (arch-)LBR 
> > should be still blocked until KVM merges arch-LBR support.
> 
> I think Isaku's idea is we don't support them in the first submission?
> 
> If so as I suggested, we should add a TODO in comment..

Sure, will add a TODO comment.

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 101/104] KVM: TDX: Silently ignore INIT/SIPI
  2022-04-07 12:12       ` Paolo Bonzini
@ 2022-04-08  3:40         ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08  3:40 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Xiaoyao Li, isaku.yamahata, kvm, linux-kernel, isaku.yamahata,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson,
	Tom Lendacky

On Thu, Apr 07, 2022 at 02:12:28PM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 4/7/22 13:09, Xiaoyao Li wrote:
> > On 4/5/2022 11:48 PM, Paolo Bonzini wrote:
> > > On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> > > > +        if (kvm_init_sipi_unsupported(vcpu->kvm))
> > > > +            /*
> > > > +             * TDX doesn't support INIT.  Ignore INIT event.  In the
> > > > +             * case of SIPI, the callback of
> > > > +             * vcpu_deliver_sipi_vector ignores it.
> > > > +             */
> > > >               vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > > > -        else
> > > > -            vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> > > > +        else {
> > > > +            kvm_vcpu_reset(vcpu, true);
> > > > +            if (kvm_vcpu_is_bsp(apic->vcpu))
> > > > +                vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > > > +            else
> > > > +                vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
> > > > +        }
> > > 
> > > Should you check vcpu->arch.guest_state_protected instead of
> > > special-casing TDX?
> > 
> > We cannot use vcpu->arch.guest_state_protected because TDX supports
> > debug TD, of which the states are not protected.
> > 
> > At least we need another flag, I think.
> 
> Let's add .deliver_init to the kvm_x86_ops then.

Will do.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state
  2022-04-05 15:56   ` Paolo Bonzini
@ 2022-04-08  3:50     ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08  3:50 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Tue, Apr 05, 2022 at 05:56:36PM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > TDX protects TDX guest state from VMM.  Implements to access methods for
> > TDX guest state to ignore them or return zero.
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> For most of these, it would be interesting to see which paths actually can
> be hit.  For SEV, it's all cut out by
> 
>         if (vcpu->arch.guest_state_protected)
>                 return 0;
> 
> in functions such as __set_sregs_common.  Together with the fact that TDX
> does not get to e.g. handle_set_cr0, this should prevent most such calls
> from happening.  So most of these should be KVM_BUG_ON or WARN_ON, not just
> returns.

If debug mode is enabled, guest state isn't protected.  memory/cpu state can
be read/written via SEAMCALLs.  So guest_state_protected isn't set to true.

Anyway for now with this patch series, debug mode isn't supported well, I will
go with adding KVM_BUG_ON/WARN_ON.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-04-07 15:56       ` Paolo Bonzini
  2022-04-07 16:08         ` Sean Christopherson
@ 2022-04-08  4:58         ` Isaku Yamahata
  2022-04-08  9:57           ` Paolo Bonzini
  1 sibling, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08  4:58 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sean Christopherson, isaku.yamahata, kvm, linux-kernel,
	isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl

On Thu, Apr 07, 2022 at 05:56:05PM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> You didn't answer the other question, which is "Where is R12 documented for
> TDG.VP.VMCALL<Instruction.HLT>?" though...  Should I be worried? :)

It's publicly documented.

Guest-Host-Communication Interface(GHCI) spec, 344426-003US Feburary 2022.
3.8 TDG.VP.VMCALL<Instruction.HLT>
R12 Interrupt Blocked Flag.
    The TD is expected to clear this flag iff RFLAGS.IF == 1 or the TDCALL instruction
    (that invoked TDG.VP.TDVMCALL(Instruction.HLT)) immediately follows an STI
    instruction, otherwise this flag should be set.

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 083/104] KVM: x86: Split core of hypercall emulation to helper function
  2022-04-07 13:12     ` Paolo Bonzini
@ 2022-04-08  5:34       ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08  5:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Sagi Shahar, Yamahata, Isaku, kvm, linux-kernel, isaku.yamahata,
	Jim Mattson, Erdem Aktas, Connor Kuehl, Sean Christopherson

On Thu, Apr 07, 2022 at 03:12:57PM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> > > +       if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
> > > +               ret = -KVM_EPERM;
> > > +               goto out;
> > > +       }
> 
> Is this guaranteed by TDG.VP.VMCALL?

Yes. TDCALL instruction in TD results in #GP(0) if CPL > 0.
It's documented in trust domain CPU architectural extensions spec.
https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf

Anyway VMM can't know TD guest CPL (or other CPU state).
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-04-08  4:58         ` Isaku Yamahata
@ 2022-04-08  9:57           ` Paolo Bonzini
  2022-04-08 14:51             ` Sean Christopherson
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-08  9:57 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Sean Christopherson, isaku.yamahata, kvm, linux-kernel,
	Jim Mattson, erdemaktas, Connor Kuehl

On 4/8/22 06:58, Isaku Yamahata wrote:
> On Thu, Apr 07, 2022 at 05:56:05PM +0200,
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
>> You didn't answer the other question, which is "Where is R12 documented for
>> TDG.VP.VMCALL<Instruction.HLT>?" though...  Should I be worried? :)
> 
> It's publicly documented.
> 
> Guest-Host-Communication Interface(GHCI) spec, 344426-003US Feburary 2022.
> 3.8 TDG.VP.VMCALL<Instruction.HLT>
> R12 Interrupt Blocked Flag.
>      The TD is expected to clear this flag iff RFLAGS.IF == 1 or the TDCALL instruction
>      (that invoked TDG.VP.TDVMCALL(Instruction.HLT)) immediately follows an STI
>      instruction, otherwise this flag should be set.

Oh, Google doesn't know about this version of the spec...  It can be 
downloaded from 
https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html 
though.

I also found VCPU_STATE_DETAILS in 
https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf:

   Bit 0: VMXIP, indicates that a virtual interrupt is pending
   delivery, i.e. VMCS.RVI[7:4] > TDVPS.VAPIC.VPPR[7:4]

It also documents how it has to be used.  So this looks more or less 
okay, just rename "vmxip" to "interrupt_pending_delivery".

The VCPU_STATE_DETAILS being "non-architectural" is still worrisome.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-04-08  9:57           ` Paolo Bonzini
@ 2022-04-08 14:51             ` Sean Christopherson
  2022-04-11 17:40               ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Sean Christopherson @ 2022-04-08 14:51 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Isaku Yamahata, isaku.yamahata, kvm, linux-kernel, Jim Mattson,
	erdemaktas, Connor Kuehl

On Fri, Apr 08, 2022, Paolo Bonzini wrote:
> On 4/8/22 06:58, Isaku Yamahata wrote:
> > On Thu, Apr 07, 2022 at 05:56:05PM +0200,
> > Paolo Bonzini <pbonzini@redhat.com> wrote:
> > 
> > > You didn't answer the other question, which is "Where is R12 documented for
> > > TDG.VP.VMCALL<Instruction.HLT>?" though...  Should I be worried? :)
> > 
> > It's publicly documented.
> > 
> > Guest-Host-Communication Interface(GHCI) spec, 344426-003US Feburary 2022.
> > 3.8 TDG.VP.VMCALL<Instruction.HLT>
> > R12 Interrupt Blocked Flag.
> >      The TD is expected to clear this flag iff RFLAGS.IF == 1 or the TDCALL instruction
> >      (that invoked TDG.VP.TDVMCALL(Instruction.HLT)) immediately follows an STI
> >      instruction, otherwise this flag should be set.
> 
> Oh, Google doesn't know about this version of the spec...  It can be
> downloaded from https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
> though.
> 
> I also found VCPU_STATE_DETAILS in https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf:
> 
>   Bit 0: VMXIP, indicates that a virtual interrupt is pending
>   delivery, i.e. VMCS.RVI[7:4] > TDVPS.VAPIC.VPPR[7:4]
> 
> It also documents how it has to be used.  So this looks more or less okay,
> just rename "vmxip" to "interrupt_pending_delivery".

If we're keeping the call back into SEAM, then this belongs in the path of
apic_has_interrupt_for_ppr(), not in the HLT-exit path.  To avoid multiple SEAMCALLS
in a single exit, VCPU_EXREG_RVI can be added.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 075/104] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
  2022-03-04 19:49 ` [RFC PATCH v5 075/104] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events() isaku.yamahata
@ 2022-04-08 16:24   ` Sean Christopherson
  2022-04-15 14:20     ` Paolo Bonzini
  0 siblings, 1 reply; 310+ messages in thread
From: Sean Christopherson @ 2022-04-08 16:24 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl

On Fri, Mar 04, 2022, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
> interrupt to support TDX's usage of APICv.  Unlike VMX, TDX doesn't have
> access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,

Based on the discussion in the HLT patch, this is no longer true.

> i.e. needs to generate a posted interrupt and more importantly can't
> manually move requested interrupts into the vIRR (which it also doesn't
> have access to).
> 
> Because pi_has_pending_interrupt() is heavy operation which uses two atomic
> test bit operations and one atomic 256 bit bitmap check, introduce new
> callback for this check instead of reusing dy_apicv_has_pending_interrupt()
> callback to avoid affecting the exiting code.

...

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 89d04cd64cd0..314ae43e07bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12111,7 +12111,10 @@ static inline bool kvm_vcpu_has_events(struct kvm_vcpu *vcpu)
>  
>  	if (kvm_arch_interrupt_allowed(vcpu) &&
>  	    (kvm_cpu_has_interrupt(vcpu) ||
> -	    kvm_guest_apic_has_interrupt(vcpu)))
> +	     kvm_guest_apic_has_interrupt(vcpu) ||
> +	     (vcpu->arch.apicv_active &&
> +	      kvm_x86_ops.apicv_has_pending_interrupt &&
> +	      kvm_x86_ops.apicv_has_pending_interrupt(vcpu))))

This is pretty gross (fully realizing that I wrote this patch).  It's also arguably
wrong as it really should be called from apic_has_interrupt_for_ppr().

  1. The hook implies it is valid for APICv in general, which is misleading.

  2. It's wasted effort for VMX.
 
  3. It does a poor job of conveying _why_ TDX is different.

  4. KVM is unnecessarily processing its useless "copy" of the PPR/IRR for TDX
     vCPUs.  It's functionally not an issue unless userspace stuffs garbage into
     KVM's vAPIC, but it's unnecessary work.

Rather than hook this path, I would rather we tag kvm_apic has having some of its
state protected.  Then kvm_cpu_has_interrupt() can invoke the alternative,
protected-apic-only hook when appropriate, and kvm_apic_has_interrupt() can bail
immediately instead of doing useless processing of stale vAPIC state.

Note, the below moves the !apic check from tdx_vcpu_reset() to tdx_vcpu_create().
That part should be hoisted earlier in the series, there's no reason to wait until
RESET to perform the check, and I suspect the WARN_ON() can be triggered by userespace.

Compile tested only...

From: Sean Christopherson <seanjc@google.com>
Date: Fri, 8 Apr 2022 08:56:27 -0700
Subject: [PATCH] KVM: TDX: Add support for find pending IRQ in a protected
 local APIC

Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest as a pending IRQ.  For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM.  As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM.  To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector.  And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.

Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.

Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unecessary checks.  A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.

Note, this doesn't handle interrupts that have been delivered to the vCPU
but not yet recognized by the core, i.e. interrupts that are sitting in
vmcs.GUEST_INTR_STATUS.  Querying that state requires a SEAMCALL and will
be supported in a future patch.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/kvm/irq.c                 |  3 +++
 arch/x86/kvm/lapic.c               |  3 +++
 arch/x86/kvm/lapic.h               |  2 ++
 arch/x86/kvm/vmx/main.c            | 11 +++++++++++
 arch/x86/kvm/vmx/tdx.c             |  9 ++++++---
 7 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 7e27b73d839f..ce705d0c6241 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -110,6 +110,7 @@ KVM_X86_OP_NULL(update_pi_irte)
 KVM_X86_OP_NULL(start_assignment)
 KVM_X86_OP_NULL(apicv_post_state_restore)
 KVM_X86_OP_NULL(dy_apicv_has_pending_interrupt)
+KVM_X86_OP_NULL(protected_apic_has_interrupt)
 KVM_X86_OP_NULL(set_hv_timer)
 KVM_X86_OP_NULL(cancel_hv_timer)
 KVM_X86_OP(setup_mce)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 489374a57b66..b3dcc0814461 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1491,6 +1491,7 @@ struct kvm_x86_ops {
 	void (*start_assignment)(struct kvm *kvm);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
 	bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
+	bool (*protected_apic_has_interrupt)(struct kvm_vcpu *vcpu);

 	int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 			    bool *expired);
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 172b05343cfd..24f180c538b0 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -96,6 +96,9 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
 	if (kvm_cpu_has_extint(v))
 		return 1;

+	if (lapic_in_kernel(v) && v->arch.apic->guest_apic_protected)
+		return static_call(kvm_x86_protected_apic_has_interrupt)(v);
+
 	return kvm_apic_has_interrupt(v) != -1;	/* LAPIC */
 }
 EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9322e6340a74..50a483abc0fe 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2503,6 +2503,9 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
 	if (!kvm_apic_present(vcpu))
 		return -1;

+	if (apic->guest_apic_protected)
+		return -1;
+
 	__apic_update_ppr(apic, &ppr);
 	return apic_has_interrupt_for_ppr(apic, ppr);
 }
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 2b44e533fc8d..7b62f1889a98 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -52,6 +52,8 @@ struct kvm_lapic {
 	bool sw_enabled;
 	bool irr_pending;
 	bool lvt0_in_nmi_mode;
+	/* Select registers in the vAPIC cannot be read/written. */
+	bool guest_apic_protected;
 	/* Number of bits set in ISR. */
 	s16 isr_count;
 	/* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 882358ac270b..31aab8add010 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -42,6 +42,9 @@ static __init int vt_hardware_setup(void)

 	tdx_hardware_setup(&vt_x86_ops);

+	if (!enable_tdx)
+		vt_x86_ops.protected_apic_has_interrupt = NULL;
+
 	if (enable_ept) {
 		const u64 init_value = enable_tdx ? VMX_EPT_SUPPRESS_VE_BIT : 0ull;
 		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
@@ -148,6 +151,13 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	return vmx_vcpu_load(vcpu, cpu);
 }

+static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+	KVM_BUG_ON(!is_td_vcpu(vcpu), vcpu->kvm);
+
+	return pi_has_pending_interrupt(vcpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -297,6 +307,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.sync_pir_to_irr = vmx_sync_pir_to_irr,
 	.deliver_interrupt = vmx_deliver_interrupt,
 	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+	.protected_apic_has_interrupt = vt_protected_apic_has_interrupt,

 	.set_tss_addr = vmx_set_tss_addr,
 	.set_identity_map_addr = vmx_set_identity_map_addr,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3a0e826fbe0c..7b9370384ce4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -467,6 +467,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	int ret, i;

+	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
+	if (!vcpu->arch.apic)
+		return -EINVAL;
+
+	vcpu->arch.apic->guest_apic_protected = true;
+
 	ret = tdx_alloc_td_page(&tdx->tdvpr);
 	if (ret)
 		return ret;
@@ -602,9 +608,6 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	/* TDX doesn't support INIT event. */
 	if (WARN_ON(init_event))
 		goto td_bugged;
-	/* TDX supports only X2APIC enabled. */
-	if (WARN_ON(!vcpu->arch.apic))
-		goto td_bugged;
 	if (WARN_ON(is_td_vcpu_created(tdx)))
 		goto td_bugged;


base-commit: f88e9fa63cbd87cda9352ee9a86a6f815744be33
--


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 076/104] KVM: x86: Add option to force LAPIC expiration wait
  2022-03-04 19:49 ` [RFC PATCH v5 076/104] KVM: x86: Add option to force LAPIC expiration wait isaku.yamahata
  2022-04-05 15:33   ` Paolo Bonzini
@ 2022-04-08 16:36   ` Sean Christopherson
  1 sibling, 0 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-04-08 16:36 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl

On Fri, Mar 04, 2022, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Add an option to skip the IRR check-in kvm_wait_lapic_expire().  This
> will be used by TDX to wait if there is an outstanding notification for
> a TD, i.e. a virtual interrupt is being triggered via posted interrupt
> processing.  KVM TDX doesn't emulate PI processing, i.e. there will
> never be a bit set in IRR/ISR, so the default behavior for APICv of
> querying the IRR doesn't work as intended.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/lapic.c   | 4 ++--
>  arch/x86/kvm/lapic.h   | 2 +-
>  arch/x86/kvm/svm/svm.c | 2 +-
>  arch/x86/kvm/vmx/vmx.c | 2 +-
>  4 files changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 9322e6340a74..d49f029ef0e3 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1620,12 +1620,12 @@ static void __kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
>  		__wait_lapic_expire(vcpu, tsc_deadline - guest_tsc);
>  }
>  
> -void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu)
> +void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu, bool force_wait)
>  {
>  	if (lapic_in_kernel(vcpu) &&
>  	    vcpu->arch.apic->lapic_timer.expired_tscdeadline &&
>  	    vcpu->arch.apic->lapic_timer.timer_advance_ns &&
> -	    lapic_timer_int_injected(vcpu))
> +	    (force_wait || lapic_timer_int_injected(vcpu)))
>  		__kvm_wait_lapic_expire(vcpu);

If the guest_apic_protected idea works, rather than require TDX to tell the local
APIC that it should wait, the common code can instead assume a timer IRQ is pending
if the IRR holds garbage.

Again, compile tested only...

From: Sean Christopherson <seanjc@google.com>
Date: Fri, 8 Apr 2022 09:24:39 -0700
Subject: [PATCH] KVM: x86: Assume timer IRQ was injected if APIC state is
 proteced

If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer
IRQ was injected when deciding whether or not to busy wait in the "timer
advanced" path.  The "real" vIRR is not readable/writable, so trying to
query for a pending timer IRQ will return garbage.

Note, TDX can scour the PIR if it wants to be more precise and skip the
"wait" call entirely.

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/lapic.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 50a483abc0fe..e5555dce8db8 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1531,8 +1531,17 @@ static void apic_update_lvtt(struct kvm_lapic *apic)
 static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
-	u32 reg = kvm_lapic_get_reg(apic, APIC_LVTT);
+	u32 reg;

+	/*
+	 * Assume a timer IRQ was "injected" if the APIC is protected.  KVM's
+	 * copy of the vIRR is bogus, it's the responsibility of the caller to
+	 * precisely check whether or not a timer IRQ is pending.
+	 */
+	if (apic->guest_apic_protected)
+		return true;
+
+	reg  = kvm_lapic_get_reg(apic, APIC_LVTT);
 	if (kvm_apic_hw_enabled(apic)) {
 		int vec = reg & APIC_VECTOR_MASK;
 		void *bitmap = apic->regs + APIC_ISR;

base-commit: 33f2439cd63c84fcbc8b4cdd4eb731e83deead90
--

^ permalink raw reply related	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 003/104] KVM: TDX: Detect CPU feature on kernel module initialization
  2022-03-04 19:48 ` [RFC PATCH v5 003/104] KVM: TDX: Detect CPU feature on kernel module initialization isaku.yamahata
  2022-03-13 13:49   ` Paolo Bonzini
@ 2022-04-08 16:46   ` Sean Christopherson
  1 sibling, 0 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-04-08 16:46 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl

On Fri, Mar 04, 2022, isaku.yamahata@intel.com wrote:
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> new file mode 100644
> index 000000000000..1acf08c310c4
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -0,0 +1,53 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/cpu.h>
> +
> +#include <asm/tdx.h>
> +
> +#include "capabilities.h"
> +#include "x86_ops.h"
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) "tdx: " fmt
> +
> +static bool __read_mostly enable_tdx = true;
> +module_param_named(tdx, enable_tdx, bool, 0644);

This is comically unsafe, userspace must not be allowed to toggle enable_tdx
after KVM is loaded.

> +static u64 hkid_mask __ro_after_init;
> +static u8 hkid_start_pos __ro_after_init;
> +
> +static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> +{
> +	u32 max_pa;
> +
> +	if (!enable_ept) {
> +		pr_warn("Cannot enable TDX with EPT disabled\n");
> +		return -EINVAL;
> +	}
> +
> +	if (!platform_has_tdx()) {
> +		pr_warn("Cannot enable TDX with SEAMRR disabled\n");
> +		return -ENODEV;
> +	}
> +
> +	if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
> +		return -EIO;
> +
> +	max_pa = cpuid_eax(0x80000008) & 0xff;
> +	hkid_start_pos = boot_cpu_data.x86_phys_bits;
> +	hkid_mask = GENMASK_ULL(max_pa - 1, hkid_start_pos);
> +
> +	return 0;
> +}
> +
> +void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> +{
> +	/*
> +	 * This function is called at the initialization.  No need to protect
> +	 * enable_tdx.
> +	 */
> +	if (!enable_tdx)
> +		return;
> +
> +	if (__tdx_hardware_setup(&vt_x86_ops))
> +		enable_tdx = false;

Clearing enable_tdx here unnecessarily risks introducing bugs in the caller,
e.g. acting on enable_tdx before tdx_hardware_setup() is invoked.  I'm guessing
this was the result of trying to defer module load until VM creation.  With that
gone, the flag can be moved to vmx/main.c, as there should be zero reason for
tdx.c to check/modify enable_tdx, i.e. functions in tdx.c should never be called
if enabled_tdx = false.

An alteranative to 

	if (enable_tdx && tdx_hardware_setup(&vt_x86_ops))
		enable_tdx = false;

would be

	enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);

I actually prefer the latter (no "if"), but I already generated and wiped the below
diff before thinking of that.

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index b79fcc8d81dd..43e13c2a804e 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -6,6 +6,9 @@
 #include "nested.h"
 #include "pmu.h"

+static bool __read_mostly enable_tdx = IS_ENABLED(CONFIG_INTEL_TDX_HOST);
+module_param_named(tdx, enable_tdx, bool, 0644);
+
 static __init int vt_hardware_setup(void)
 {
        int ret;
@@ -14,7 +17,8 @@ static __init int vt_hardware_setup(void)
        if (ret)
                return ret;

-       tdx_hardware_setup(&vt_x86_ops);
+       if (enable_tdx && tdx_hardware_setup(&vt_x86_ops))
+               enable_tdx = false;

        return 0;
 }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 1acf08c310c4..3f660f323426 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -9,13 +9,10 @@
 #undef pr_fmt
 #define pr_fmt(fmt) "tdx: " fmt

-static bool __read_mostly enable_tdx = true;
-module_param_named(tdx, enable_tdx, bool, 0644);
-
 static u64 hkid_mask __ro_after_init;
 static u8 hkid_start_pos __ro_after_init;

-static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
+static int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 {
        u32 max_pa;

@@ -38,16 +35,3 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)

        return 0;
 }
-
-void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
-{
-       /*
-        * This function is called at the initialization.  No need to protect
-        * enable_tdx.
-        */
-       if (!enable_tdx)
-               return;
-
-       if (__tdx_hardware_setup(&vt_x86_ops))
-               enable_tdx = false;
-}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index ccf98e79d8c3..fd60128eb10a 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -124,9 +124,9 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
 void vmx_setup_mce(struct kvm_vcpu *vcpu);

 #ifdef CONFIG_INTEL_TDX_HOST
-void __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 #else
-static inline void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {}
+static inline int void tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
 #endif

 #endif /* __KVM_X86_VMX_X86_OPS_H */


^ permalink raw reply related	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE
  2022-04-05 14:14       ` Paolo Bonzini
@ 2022-04-08 18:38         ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08 18:38 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kai Huang, isaku.yamahata, kvm, linux-kernel, isaku.yamahata,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson

On Tue, Apr 05, 2022 at 04:14:25PM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 4/1/22 09:13, Kai Huang wrote:
> > Btw, I think the relevant part of TDP MMU change should be included in this
> > patch too otherwise TDP MMU is broken with this patch.
> 
> I agree.
> 
> Paolo
> 
> > Actually in this series legacy MMU is not supported to work with TDX, so above
> > change to legacy MMU doesn't matter actually.  Instead, TDP MMU change should be
> > here.

Sure, will reorganize it in the next respin.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
  2022-04-05 15:25   ` Paolo Bonzini
@ 2022-04-08 18:46     ` Isaku Yamahata
  2022-04-19 19:55       ` Sean Christopherson
  0 siblings, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08 18:46 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Tue, Apr 05, 2022 at 05:25:34PM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > +	if (enable_ept) {
> > +		const u64 init_value = enable_tdx ? VMX_EPT_SUPPRESS_VE_BIT : 0ull;
> >   		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
> > -				      cpu_has_vmx_ept_execute_only());
> > +				      cpu_has_vmx_ept_execute_only(), init_value);
> > +		kvm_mmu_set_spte_init_value(init_value);
> > +	}
> 
> I think kvm-intel.ko should use VMX_EPT_SUPPRESS_VE_BIT unconditionally as
> the init value.  The bit is ignored anyway if the "EPT-violation #VE"
> execution control is 0.  Otherwise looks good, but I have a couple more
> crazy ideas:
> 
> 1) there could even be a test mode where KVM enables the execution control,
> traps #VE in the exception bitmap, and shouts loudly if it gets a #VE.  That
> might avoid hard-to-find bugs due to forgetting about
> VMX_EPT_SUPPRESS_VE_BIT.
> 
> 2) or even, perhaps the init_value for the TDP MMU could set bit 63
> _unconditionally_, because KVM always sets the NX bit on AMD hardware. That
> would remove the whole infrastructure to keep shadow_init_value, because it
> would be constant 0 in mmu.c and constant BIT(63) in tdp_mmu.c.
> 
> Sean, what do you think?

Then, I'll start with 1) because it's a bit hard for me to test 2) with real AMD
hardware.  If someone is willing to test 2), I'm quite fine to implement 2)
on top of 1).  2) isn't exclusive with 1).
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
  2022-04-06 11:06   ` Kai Huang
  2022-04-07  3:05     ` Kai Huang
@ 2022-04-08 19:12     ` Isaku Yamahata
  2022-04-08 23:34       ` Kai Huang
  1 sibling, 1 reply; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-08 19:12 UTC (permalink / raw)
  To: Kai Huang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	Jim Mattson, erdemaktas, Connor Kuehl, Sean Christopherson

On Wed, Apr 06, 2022 at 11:06:41PM +1200,
Kai Huang <kai.huang@intel.com> wrote:

> > diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> > index 5071e8332db2..ea83927b9231 100644
> > --- a/arch/x86/kvm/mmu/spte.c
> > +++ b/arch/x86/kvm/mmu/spte.c
> > @@ -29,8 +29,7 @@ u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> >  u64 __read_mostly shadow_user_mask;
> >  u64 __read_mostly shadow_accessed_mask;
> >  u64 __read_mostly shadow_dirty_mask;
> > -u64 __read_mostly shadow_mmio_value;
> > -u64 __read_mostly shadow_mmio_mask;
> > +u64 __read_mostly shadow_default_mmio_mask;
> >  u64 __read_mostly shadow_mmio_access_mask;
> >  u64 __read_mostly shadow_present_mask;
> >  u64 __read_mostly shadow_me_mask;
> > @@ -59,10 +58,11 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
> >  	u64 spte = generation_mmio_spte_mask(gen);
> >  	u64 gpa = gfn << PAGE_SHIFT;
> >  
> > -	WARN_ON_ONCE(!shadow_mmio_value);
> > +	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
> > +		     !kvm_gfn_stolen_mask(vcpu->kvm));
> >  
> >  	access &= shadow_mmio_access_mask;
> > -	spte |= shadow_mmio_value | access;
> > +	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
> >  	spte |= gpa | shadow_nonpresent_or_rsvd_mask;
> >  	spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
> >  		<< SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
> > @@ -279,7 +279,8 @@ u64 mark_spte_for_access_track(u64 spte)
> >  	return spte;
> >  }
> >  
> > -void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> > +void kvm_mmu_set_mmio_spte_mask(struct kvm *kvm, u64 mmio_value, u64 mmio_mask,
> > +				u64 access_mask)
> >  {
> >  	BUG_ON((u64)(unsigned)access_mask != access_mask);
> >  	WARN_ON(mmio_value & shadow_nonpresent_or_rsvd_lower_gfn_mask);
> > @@ -308,39 +309,32 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
> >  	    WARN_ON(mmio_value && (REMOVED_SPTE & mmio_mask) == mmio_value))
> >  		mmio_value = 0;
> >  
> > -	shadow_mmio_value = mmio_value;
> > -	shadow_mmio_mask  = mmio_mask;
> > +	kvm->arch.shadow_mmio_value = mmio_value;
> > +	kvm->arch.shadow_mmio_mask = mmio_mask;
> >  	shadow_mmio_access_mask = access_mask;
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
> >  
> > -void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
> > +void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only, u64 init_value)
> >  {
> >  	shadow_user_mask	= VMX_EPT_READABLE_MASK;
> >  	shadow_accessed_mask	= has_ad_bits ? VMX_EPT_ACCESS_BIT : 0ull;
> >  	shadow_dirty_mask	= has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
> >  	shadow_nx_mask		= 0ull;
> >  	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
> > -	shadow_present_mask	= has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
> > +	shadow_present_mask	=
> > +		(has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | init_value;
> 
> This change doesn't seem make any sense.  Why should "Suppress #VE" bit be set
> for a present PTE?

Because W or NX violation also needs #VE.  Although the name uses present, it's
actually readable.


> >  	shadow_acc_track_mask	= VMX_EPT_RWX_MASK;
> >  	shadow_me_mask		= 0ull;
> >  
> >  	shadow_host_writable_mask = EPT_SPTE_HOST_WRITABLE;
> >  	shadow_mmu_writable_mask  = EPT_SPTE_MMU_WRITABLE;
> > -
> > -	/*
> > -	 * EPT Misconfigurations are generated if the value of bits 2:0
> > -	 * of an EPT paging-structure entry is 110b (write/execute).
> > -	 */
> > -	kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
> > -				   VMX_EPT_RWX_MASK, 0);
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);
> >  
> >  void kvm_mmu_reset_all_pte_masks(void)
> >  {
> >  	u8 low_phys_bits;
> > -	u64 mask;
> >  
> >  	shadow_phys_bits = kvm_get_shadow_phys_bits();
> >  
> > @@ -389,9 +383,13 @@ void kvm_mmu_reset_all_pte_masks(void)
> >  	 * PTEs and so the reserved PA approach must be disabled.
> >  	 */
> >  	if (shadow_phys_bits < 52)
> > -		mask = BIT_ULL(51) | PT_PRESENT_MASK;
> > +		shadow_default_mmio_mask = BIT_ULL(51) | PT_PRESENT_MASK;
> 
> Hmm...  Not related to this patch, but it seems there's a bug here.  On a MKTME
> enabled system (but not TDX) with 52 physical bits, the shadow_phys_bits will be
> set to < 52 (depending on how many MKTME KeyIDs are configured by BIOS).  In
> this case, bit 51 is set, but actually bit 51 isn't a reserved bit in this case.
> Instead, it is a MKTME KeyID bit.  Therefore, above setting won't cause #PF, but
> will use a non-zero MKTME keyID to access the physical address.
> 
> Paolo/Sean, any comments here?
> 
> >  	else
> > -		mask = 0;
> > +		shadow_default_mmio_mask = 0;
> > +}
> >  
> > -	kvm_mmu_set_mmio_spte_mask(mask, mask, ACC_WRITE_MASK | ACC_USER_MASK);
> > +void kvm_mmu_set_default_mmio_spte_mask(u64 mask)
> > +{
> > +	shadow_default_mmio_mask = mask;
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_mmu_set_default_mmio_spte_mask);
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index 8e13a35ab8c9..bde843bce878 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -165,8 +165,7 @@ extern u64 __read_mostly shadow_x_mask; /* mutual exclusive with nx_mask */
> >  extern u64 __read_mostly shadow_user_mask;
> >  extern u64 __read_mostly shadow_accessed_mask;
> >  extern u64 __read_mostly shadow_dirty_mask;
> > -extern u64 __read_mostly shadow_mmio_value;
> > -extern u64 __read_mostly shadow_mmio_mask;
> > +extern u64 __read_mostly shadow_default_mmio_mask;
> >  extern u64 __read_mostly shadow_mmio_access_mask;
> >  extern u64 __read_mostly shadow_present_mask;
> >  extern u64 __read_mostly shadow_me_mask;
> > @@ -229,10 +228,10 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_lower_gfn_mask;
> >   */
> >  extern u8 __read_mostly shadow_phys_bits;
> >  
> > -static inline bool is_mmio_spte(u64 spte)
> > +static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
> >  {
> > -	return (spte & shadow_mmio_mask) == shadow_mmio_value &&
> > -	       likely(shadow_mmio_value);
> > +	return (spte & kvm->arch.shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
> > +		likely(kvm->arch.shadow_mmio_value || kvm_gfn_stolen_mask(kvm));
> 
> I don't like using kvm_gfn_stolen_mask() to check whether SPTE is MMIO. 
> kvm_gfn_stolen_mask() really doesn't imply anything regarding to setting up the
> value of MMIO SPTE.  At least, I guess we can use some is_protected_vm() sort of
> things since it implies guest memory is protected therefore legacy way handling
> of MMIO doesn't work (i.e. you cannot parse MMIO instruction).

As discussed in other thread, let's rename those functions.


> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 07fd892768be..00f88aa25047 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -7065,6 +7065,14 @@ int vmx_vm_init(struct kvm *kvm)
> >  	if (!ple_gap)
> >  		kvm->arch.pause_in_guest = true;
> >  
> > +	/*
> > +	 * EPT Misconfigurations can be generated if the value of bits 2:0
> > +	 * of an EPT paging-structure entry is 110b (write/execute).
> > +	 */
> > +	if (enable_ept)
> > +		kvm_mmu_set_mmio_spte_mask(kvm, VMX_EPT_MISCONFIG_WX_VALUE,
> > +					   VMX_EPT_MISCONFIG_WX_VALUE, 0);
> 
> Should be:
> 
> 	kvm_mmu_set_mmio_spte_mask(kvm, VMX_EPT_MISCONFIG_WX_VALUE,
> 				   	VMX_EPT_RWX_MASK, 0);

Thanks for catching it.  It's fixed in github repo.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
  2022-04-08 19:12     ` Isaku Yamahata
@ 2022-04-08 23:34       ` Kai Huang
  0 siblings, 0 replies; 310+ messages in thread
From: Kai Huang @ 2022-04-08 23:34 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: isaku.yamahata, kvm, linux-kernel, Paolo Bonzini, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Fri, 2022-04-08 at 12:12 -0700, Isaku Yamahata wrote:
> > > -	shadow_present_mask	= has_exec_only ? 0ull :
> > > VMX_EPT_READABLE_MASK;
> > > +	shadow_present_mask	=
> > > +		(has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) |
> > > init_value;
> > 
> > This change doesn't seem make any sense.  Why should "Suppress #VE" bit be
> > set
> > for a present PTE?
> 
> Because W or NX violation also needs #VE.  Although the name uses present,
> it's
> actually readable.

Yeah I forgot this.  Thanks!

-- 
Thanks,
-Kai



^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-04-08 14:51             ` Sean Christopherson
@ 2022-04-11 17:40               ` Paolo Bonzini
  2022-04-14 17:09                 ` Sean Christopherson
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-11 17:40 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Isaku Yamahata, isaku.yamahata, kvm, linux-kernel, Jim Mattson,
	erdemaktas, Connor Kuehl

On 4/8/22 16:51, Sean Christopherson wrote:
>> It also documents how it has to be used.  So this looks more or less okay,
>> just rename "vmxip" to "interrupt_pending_delivery".
> 
> If we're keeping the call back into SEAM, then this belongs in the path of
> apic_has_interrupt_for_ppr(), not in the HLT-exit path.  To avoid multiple SEAMCALLS
> in a single exit, VCPU_EXREG_RVI can be added.

But apic_has_interrupt_for_ppr takes a PPR argument and that is not 
available.

So I suppose you mean kvm_apic_has_interrupt?  You would change that to 
a callback, like

         if (!kvm_apic_present(vcpu))
                 return -1;

	return static_call(kvm_x86_apic_has_interrupt)(vcpu);
}

and the default version would also be inlined in kvm_get_apic_interrupt, 
like

-       int vector = kvm_apic_has_interrupt(vcpu);
         struct kvm_lapic *apic = vcpu->arch.apic;
         u32 ppr;

-       if (vector == -1)
+       if (!kvm_apic_present(vcpu))
                 return -1;
+       __apic_update_ppr(apic, &ppr);
+	vector = apic_has_interrupt_for_ppr(apic, ppr);

Checking the SEAM state (which would likewise not be VCPU_EXREG_RVI, but 
more like VCPU_EXREG_INTR_PENDING) would be done in the tdx case of 
kvm_x86_apic_has_interrupt.

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state
  2022-03-04 19:49 ` [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state isaku.yamahata
  2022-04-05 15:56   ` Paolo Bonzini
@ 2022-04-12  6:49   ` Xiaoyao Li
  2022-04-12  6:52     ` Paolo Bonzini
  1 sibling, 1 reply; 310+ messages in thread
From: Xiaoyao Li @ 2022-04-12  6:49 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On 3/5/2022 3:49 AM, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> TDX protects TDX guest state from VMM.  Implements to access methods for
> TDX guest state to ignore them or return zero.
> 

...

> +void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
> +{
> +	kvm_register_mark_available(vcpu, reg);
> +	switch (reg) {
> +	case VCPU_REGS_RSP:
> +	case VCPU_REGS_RIP:
> +	case VCPU_EXREG_PDPTR:
> +	case VCPU_EXREG_CR0:
> +	case VCPU_EXREG_CR3:
> +	case VCPU_EXREG_CR4:
> +		break;
> +	default:
> +		KVM_BUG_ON(1, vcpu->kvm);
> +		break;
> +	}
> +}

Isaku,

We missed one case that some GPRs are accessible by KVM/userspace for 
TDVMCALL exit.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state
  2022-04-12  6:49   ` Xiaoyao Li
@ 2022-04-12  6:52     ` Paolo Bonzini
  2022-04-12  7:31       ` Xiaoyao Li
  0 siblings, 1 reply; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-12  6:52 UTC (permalink / raw)
  To: Xiaoyao Li, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/12/22 08:49, Xiaoyao Li wrote:
> 
>> +void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
>> +{
>> +    kvm_register_mark_available(vcpu, reg);
>> +    switch (reg) {
>> +    case VCPU_REGS_RSP:
>> +    case VCPU_REGS_RIP:
>> +    case VCPU_EXREG_PDPTR:
>> +    case VCPU_EXREG_CR0:
>> +    case VCPU_EXREG_CR3:
>> +    case VCPU_EXREG_CR4:
>> +        break;
>> +    default:
>> +        KVM_BUG_ON(1, vcpu->kvm);
>> +        break;
>> +    }
>> +}
> 
> Isaku,
> 
> We missed one case that some GPRs are accessible by KVM/userspace for 
> TDVMCALL exit.

If a register is not in the VMX_REGS_LAZY_LOAD_SET it will never be 
passed to tdx_cache_reg.  As far as I understand those TDVMCALL 
registers do not include either RSP or RIP.

Paolo


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state
  2022-04-12  6:52     ` Paolo Bonzini
@ 2022-04-12  7:31       ` Xiaoyao Li
  0 siblings, 0 replies; 310+ messages in thread
From: Xiaoyao Li @ 2022-04-12  7:31 UTC (permalink / raw)
  To: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 4/12/2022 2:52 PM, Paolo Bonzini wrote:
> On 4/12/22 08:49, Xiaoyao Li wrote:
>>
>>> +void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
>>> +{
>>> +    kvm_register_mark_available(vcpu, reg);
>>> +    switch (reg) {
>>> +    case VCPU_REGS_RSP:
>>> +    case VCPU_REGS_RIP:
>>> +    case VCPU_EXREG_PDPTR:
>>> +    case VCPU_EXREG_CR0:
>>> +    case VCPU_EXREG_CR3:
>>> +    case VCPU_EXREG_CR4:
>>> +        break;
>>> +    default:
>>> +        KVM_BUG_ON(1, vcpu->kvm);
>>> +        break;
>>> +    }
>>> +}
>>
>> Isaku,
>>
>> We missed one case that some GPRs are accessible by KVM/userspace for 
>> TDVMCALL exit.
> 
> If a register is not in the VMX_REGS_LAZY_LOAD_SET it will never be 
> passed to tdx_cache_reg.  As far as I understand those TDVMCALL 
> registers do not include either RSP or RIP.

Sorry, I should not keep the code snippet of tdx_cache_reg() as 
reference to mislead you and other people.

I just want to remind that in the certain case of TDVMCALL, GPRs might 
be accessible.

> Paolo
> 


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall
  2022-04-11 17:40               ` Paolo Bonzini
@ 2022-04-14 17:09                 ` Sean Christopherson
  0 siblings, 0 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-04-14 17:09 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Isaku Yamahata, isaku.yamahata, kvm, linux-kernel, Jim Mattson,
	erdemaktas, Connor Kuehl

On Mon, Apr 11, 2022, Paolo Bonzini wrote:
> On 4/8/22 16:51, Sean Christopherson wrote:
> > > It also documents how it has to be used.  So this looks more or less okay,
> > > just rename "vmxip" to "interrupt_pending_delivery".
> > 
> > If we're keeping the call back into SEAM, then this belongs in the path of
> > apic_has_interrupt_for_ppr(), not in the HLT-exit path.  To avoid multiple SEAMCALLS
> > in a single exit, VCPU_EXREG_RVI can be added.
> 
> But apic_has_interrupt_for_ppr takes a PPR argument and that is not
> available.
> 
> So I suppose you mean kvm_apic_has_interrupt?

Yeah, I realized that when I actually tried to implement my idea in code :-)

My hopefully-fully-thought-out idea for handling this:

https://lore.kernel.org/all/YlBhuWElVRwYrrS+@google.com
https://lore.kernel.org/all/YlBkiOmTGk8VlWFh@google.com

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure
  2022-04-08  0:51     ` Isaku Yamahata
@ 2022-04-15 13:47       ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 13:47 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: isaku.yamahata, kvm, linux-kernel, Jim Mattson, erdemaktas,
	Connor Kuehl, Sean Christopherson

On 4/8/22 02:51, Isaku Yamahata wrote:
>> Please rename the function to explain what it does, for example
>> tdx_mmu_release_hkid.
> In the patch 021/104,  you suggested flush_shadow_all_private().
> Which do you prefer? flush_shadow_all_private or tdx_mmu_release_hkid.

vt_mmu_prezap should become vt_flush_shadow_all_private.

tdx_mmu_prezap should be come tdx_mmu_release_hkid.

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 061/104] KVM: TDX: Finalize VM initialization
  2022-03-04 19:49 ` [RFC PATCH v5 061/104] KVM: TDX: Finalize VM initialization isaku.yamahata
@ 2022-04-15 13:52   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 13:52 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> To protect the initial contents of the guest TD, the TDX module measures
> the guest TD during the build process as SHA-384 measurement.  The
> measurement of the guest TD contents needs to be completed to make the
> guest TD ready to run.
> 
> Add a new subcommand, KVM_TDX_FINALIZE_VM, for VM-scoped
> KVM_MEMORY_ENCRYPT_OP to finalize the measurement and mark the TDX VM ready
> to run.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/include/uapi/asm/kvm.h       |  1 +
>   arch/x86/kvm/vmx/tdx.c                | 21 +++++++++++++++++++++
>   tools/arch/x86/include/uapi/asm/kvm.h |  1 +
>   3 files changed, 23 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index 77f46260d868..943219a08fcd 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -534,6 +534,7 @@ enum kvm_tdx_cmd_id {
>   	KVM_TDX_INIT_VM,
>   	KVM_TDX_INIT_VCPU,
>   	KVM_TDX_INIT_MEM_REGION,
> +	KVM_TDX_FINALIZE_VM,
>   
>   	KVM_TDX_CMD_NR_MAX,
>   };
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index cd726c41d362..85d5f961d97e 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1103,6 +1103,24 @@ static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>   	return ret;
>   }
>   
> +static int tdx_td_finalizemr(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	u64 err;
> +
> +	if (!is_td_initialized(kvm) || is_td_finalized(kvm_tdx))
> +		return -EINVAL;
> +
> +	err = tdh_mr_finalize(kvm_tdx->tdr.pa);
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MR_FINALIZE, err, NULL);
> +		return -EIO;
> +	}
> +
> +	kvm_tdx->finalized = true;
> +	return 0;
> +}
> +
>   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>   {
>   	struct kvm_tdx_cmd tdx_cmd;
> @@ -1123,6 +1141,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>   	case KVM_TDX_INIT_MEM_REGION:
>   		r = tdx_init_mem_region(kvm, &tdx_cmd);
>   		break;
> +	case KVM_TDX_FINALIZE_VM:
> +		r = tdx_td_finalizemr(kvm);
> +		break;
>   	default:
>   		r = -EINVAL;
>   		goto out;
> diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
> index 77f46260d868..943219a08fcd 100644
> --- a/tools/arch/x86/include/uapi/asm/kvm.h
> +++ b/tools/arch/x86/include/uapi/asm/kvm.h
> @@ -534,6 +534,7 @@ enum kvm_tdx_cmd_id {
>   	KVM_TDX_INIT_VM,
>   	KVM_TDX_INIT_VCPU,
>   	KVM_TDX_INIT_MEM_REGION,
> +	KVM_TDX_FINALIZE_VM,
>   
>   	KVM_TDX_CMD_NR_MAX,
>   };

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Note however that errors should be passed back in the struct.

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 065/104] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
  2022-03-04 19:49 ` [RFC PATCH v5 065/104] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) isaku.yamahata
@ 2022-04-15 13:56   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 13:56 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> On entering/exiting TDX vcpu, Preserved or clobbered CPU state is different
> from VMX case.  Add TDX hooks to save/restore host/guest CPU state.
> Save/restore kernel GS base MSR.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/main.c    | 28 +++++++++++++++++++++++++--
>   arch/x86/kvm/vmx/tdx.c     | 39 ++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/tdx.h     |  4 ++++
>   arch/x86/kvm/vmx/x86_ops.h |  4 ++++
>   4 files changed, 73 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 2e5a7a72d560..f9d43f2de145 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -89,6 +89,30 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>   	return vmx_vcpu_reset(vcpu, init_event);
>   }
>   
> +static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * All host state is saved/restored across SEAMCALL/SEAMRET, and the
> +	 * guest state of a TD is obviously off limits.  Deferring MSRs and DRs
> +	 * is pointless because the TDX module needs to load *something* so as
> +	 * not to expose guest state.
> +	 */
> +	if (is_td_vcpu(vcpu)) {
> +		tdx_prepare_switch_to_guest(vcpu);
> +		return;
> +	}
> +
> +	vmx_prepare_switch_to_guest(vcpu);
> +}
> +
> +static void vt_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_vcpu_put(vcpu);
> +
> +	return vmx_vcpu_put(vcpu);
> +}
> +
>   static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu))
> @@ -174,9 +198,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.vcpu_free = vt_vcpu_free,
>   	.vcpu_reset = vt_vcpu_reset,
>   
> -	.prepare_guest_switch = vmx_prepare_switch_to_guest,
> +	.prepare_guest_switch = vt_prepare_switch_to_guest,
>   	.vcpu_load = vmx_vcpu_load,
> -	.vcpu_put = vmx_vcpu_put,
> +	.vcpu_put = vt_vcpu_put,
>   
>   	.update_exception_bitmap = vmx_update_exception_bitmap,
>   	.get_msr_feature = vmx_get_msr_feature,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index ebe4f9bf19e7..7a288aae03ba 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1,5 +1,6 @@
>   // SPDX-License-Identifier: GPL-2.0
>   #include <linux/cpu.h>
> +#include <linux/mmu_context.h>
>   
>   #include <asm/tdx.h>
>   
> @@ -407,6 +408,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>   	vcpu->arch.guest_state_protected =
>   		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
>   
> +	tdx->host_state_need_save = true;
> +	tdx->host_state_need_restore = false;
> +
>   	return 0;
>   
>   free_tdvpx:
> @@ -420,6 +424,39 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>   	return ret;
>   }
>   
> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	if (!tdx->host_state_need_save)
> +		return;
> +
> +	if (likely(is_64bit_mm(current->mm)))
> +		tdx->msr_host_kernel_gs_base = current->thread.gsbase;
> +	else
> +		tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
> +
> +	tdx->host_state_need_save = false;
> +}
> +
> +static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	tdx->host_state_need_save = true;
> +	if (!tdx->host_state_need_restore)
> +		return;
> +
> +	wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
> +	tdx->host_state_need_restore = false;
> +}
> +
> +void tdx_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> +	vmx_vcpu_pi_put(vcpu);
> +	tdx_prepare_switch_to_host(vcpu);
> +}
> +
>   void tdx_vcpu_free(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -535,6 +572,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>   
>   	tdx_vcpu_enter_exit(vcpu, tdx);
>   
> +	tdx->host_state_need_restore = true;
> +
>   	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>   	trace_kvm_exit(vcpu, KVM_ISA_VMX);
>   
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index e950404ce5de..8b1cf9c158e3 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -84,6 +84,10 @@ struct vcpu_tdx {
>   	union tdx_exit_reason exit_reason;
>   
>   	bool initialized;
> +
> +	bool host_state_need_save;
> +	bool host_state_need_restore;
> +	u64 msr_host_kernel_gs_base;
>   };
>   
>   static inline bool is_td(struct kvm *kvm)
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 44404dd25737..8b871c5f52cf 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -141,6 +141,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
>   fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
> +void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_put(struct kvm_vcpu *vcpu);
>   
>   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -162,6 +164,8 @@ static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>   static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
>   static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
> +static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
> +static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
>   
>   static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 067/104] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
  2022-03-04 19:49 ` [RFC PATCH v5 067/104] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr isaku.yamahata
@ 2022-04-15 14:02   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:02 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Chao Gao <chao.gao@intel.com>
> 
> Several MSRs are constant and only used in userspace(ring 3).  But VMs may
> have different values.  KVM uses kvm_set_user_return_msr() to switch to
> guest's values and leverages user return notifier to restore them when the
> kernel is to return to userspace.  To eliminate unnecessary wrmsr, KVM also
> caches the value it wrote to an MSR last time.
> 
> TDX module unconditionally resets some of these MSRs to architectural INIT
> state on TD exit.  It makes the cached values in kvm_user_return_msrs are
> inconsistent with values in hardware.  This inconsistency needs to be
> fixed.  Otherwise, it may mislead kvm_on_user_return() to skip restoring
> some MSRs to the host's values.  kvm_set_user_return_msr() can help correct
> this case, but it is not optimal as it always does a wrmsr.  So, introduce
> a variation of kvm_set_user_return_msr() to update cached values and skip
> that wrmsr.
> 
> Signed-off-by: Chao Gao <chao.gao@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/include/asm/kvm_host.h |  1 +
>   arch/x86/kvm/x86.c              | 25 ++++++++++++++++++++-----
>   2 files changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 8406f8b5ab74..b6396d11139e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1894,6 +1894,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
>   int kvm_add_user_return_msr(u32 msr);
>   int kvm_find_user_return_msr(u32 msr);
>   int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
> +void kvm_user_return_update_cache(unsigned int index, u64 val);
>   
>   static inline bool kvm_is_supported_user_return_msr(u32 msr)
>   {
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 66400810d54f..45e8a02e99bf 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -427,6 +427,15 @@ static void kvm_user_return_msr_cpu_online(void)
>   	}
>   }
>   
> +static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
> +{
> +	if (!msrs->registered) {
> +		msrs->urn.on_user_return = kvm_on_user_return;
> +		user_return_notifier_register(&msrs->urn);
> +		msrs->registered = true;
> +	}
> +}
> +
>   int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
>   {
>   	unsigned int cpu = smp_processor_id();
> @@ -441,15 +450,21 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
>   		return 1;
>   
>   	msrs->values[slot].curr = value;
> -	if (!msrs->registered) {
> -		msrs->urn.on_user_return = kvm_on_user_return;
> -		user_return_notifier_register(&msrs->urn);
> -		msrs->registered = true;
> -	}
> +	kvm_user_return_register_notifier(msrs);
>   	return 0;
>   }
>   EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
>   
> +/* Update the cache, "curr", and register the notifier */
> +void kvm_user_return_update_cache(unsigned int slot, u64 value)
> +{
> +	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
> +
> +	msrs->values[slot].curr = value;
> +	kvm_user_return_register_notifier(msrs);
> +}
> +EXPORT_SYMBOL_GPL(kvm_user_return_update_cache);
> +
>   static void drop_user_return_notifiers(void)
>   {
>   	unsigned int cpu = smp_processor_id();

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 068/104] KVM: TDX: restore user ret MSRs
  2022-03-04 19:49 ` [RFC PATCH v5 068/104] KVM: TDX: restore user ret MSRs isaku.yamahata
@ 2022-04-15 14:06   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:06 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Several user ret MSRs are clobbered on TD exit.  Restore those values on
> TD exit and before returning to ring 3.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 33 +++++++++++++++++++++++++++++++++
>   1 file changed, 33 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 54be5be1a06c..c1366aac7d96 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -550,6 +550,28 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>   	vcpu->kvm->vm_bugged = true;
>   }
>   
> +struct tdx_uret_msr {
> +	u32 msr;
> +	unsigned int slot;
> +	u64 defval;
> +};
> +
> +static struct tdx_uret_msr tdx_uret_msrs[] = {
> +	{.msr = MSR_SYSCALL_MASK,},
> +	{.msr = MSR_STAR,},
> +	{.msr = MSR_LSTAR,},
> +	{.msr = MSR_TSC_AUX,},
> +};
> +
> +static void tdx_user_return_update_cache(void)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
> +		kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
> +					     tdx_uret_msrs[i].defval);
> +}
> +
>   static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
>   {
>   	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> @@ -589,6 +611,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>   
>   	tdx_vcpu_enter_exit(vcpu, tdx);
>   
> +	tdx_user_return_update_cache();
>   	tdx_restore_host_xsave_state(vcpu);
>   	tdx->host_state_need_restore = true;
>   
> @@ -1371,6 +1394,16 @@ static int __init __tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
>   	if (WARN_ON_ONCE(x86_ops->tlb_remote_flush))
>   		return -EIO;
>   
> +	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
> +		tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
> +		if (tdx_uret_msrs[i].slot == -1) {
> +			/* If any MSR isn't supported, it is a KVM bug */
> +			pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
> +				tdx_uret_msrs[i].msr);
> +			return -EIO;
> +		}
> +	}
> +
>   	max_pkgs = topology_max_packages();
>   	tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
>   				   GFP_KERNEL);

I wonder if you only need to do this if 
!this_cpu_ptr(user_return_msrs)->registered, but not a big deal.

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 070/104] KVM: TDX: complete interrupts after tdexit
  2022-03-04 19:49 ` [RFC PATCH v5 070/104] KVM: TDX: complete interrupts after tdexit isaku.yamahata
@ 2022-04-15 14:07   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:07 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> This corresponds to VMX __vmx_complete_interrupts().  Because TDX
> virtualize vAPIC, KVM only needs to care NMI injection.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
>   1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index c1366aac7d96..3cb2fbd1c12c 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -550,6 +550,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>   	vcpu->kvm->vm_bugged = true;
>   }
>   
> +static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
> +{
> +	/* Avoid costly SEAMCALL if no nmi was injected */
> +	if (vcpu->arch.nmi_injected)
> +		vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
> +							      TD_VCPU_PEND_NMI);
> +}
> +
>   struct tdx_uret_msr {
>   	u32 msr;
>   	unsigned int slot;
> @@ -618,6 +626,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>   	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
>   	trace_kvm_exit(vcpu, KVM_ISA_VMX);
>   
> +	tdx_complete_interrupts(vcpu);
> +
>   	if (tdx->exit_reason.error || tdx->exit_reason.non_recoverable)
>   		return EXIT_FASTPATH_NONE;
>   	return EXIT_FASTPATH_NONE;

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 072/104] KVM: TDX: handle vcpu migration over logical processor
  2022-03-04 19:49 ` [RFC PATCH v5 072/104] KVM: TDX: handle vcpu migration over logical processor isaku.yamahata
@ 2022-04-15 14:14   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:14 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> For vcpu migration, in the case of VMX, VCMS is flushed on the source pcpu,
> and load it on the target pcpu.  There are corresponding TDX SEAMCALL APIs,
> call them on vcpu migration.  The logic is mostly same as VMX except the
> TDX SEAMCALLs are used.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/main.c    | 20 +++++++++++++--
>   arch/x86/kvm/vmx/tdx.c     | 51 ++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/x86_ops.h |  2 ++
>   3 files changed, 71 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index f9d43f2de145..2cd5ba0e8788 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -121,6 +121,14 @@ static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
>   	return vmx_vcpu_run(vcpu);
>   }
>   
> +static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_vcpu_load(vcpu, cpu);
> +
> +	return vmx_vcpu_load(vcpu, cpu);
> +}
> +
>   static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu))
> @@ -162,6 +170,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>   	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
>   }
>   
> +static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return;
> +
> +	vmx_sched_in(vcpu, cpu);
> +}
> +
>   static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
>   {
>   	if (!is_td(kvm))
> @@ -199,7 +215,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.vcpu_reset = vt_vcpu_reset,
>   
>   	.prepare_guest_switch = vt_prepare_switch_to_guest,
> -	.vcpu_load = vmx_vcpu_load,
> +	.vcpu_load = vt_vcpu_load,
>   	.vcpu_put = vt_vcpu_put,
>   
>   	.update_exception_bitmap = vmx_update_exception_bitmap,
> @@ -285,7 +301,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   
>   	.request_immediate_exit = vmx_request_immediate_exit,
>   
> -	.sched_in = vmx_sched_in,
> +	.sched_in = vt_sched_in,
>   
>   	.cpu_dirty_log_size = PML_ENTITY_NUM,
>   	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 37cf7d43435d..a6b1a8ce888d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -85,6 +85,18 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
>   	return kvm_tdx->finalized;
>   }
>   
> +static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
> +{
> +	/*
> +	 * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
> +	 * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
> +	 * to its list before its deleted from this CPUs list.
> +	 */
> +	smp_wmb();
> +
> +	vcpu->cpu = -1;
> +}
> +
>   static void tdx_clear_page(unsigned long page)
>   {
>   	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> @@ -155,6 +167,39 @@ static void tdx_reclaim_td_page(struct tdx_td_page *page)
>   	free_page(page->va);
>   }
>   
> +static void tdx_flush_vp(void *arg)
> +{
> +	struct kvm_vcpu *vcpu = arg;
> +	u64 err;
> +
> +	/* Task migration can race with CPU offlining. */
> +	if (vcpu->cpu != raw_smp_processor_id())
> +		return;
> +
> +	/*
> +	 * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized.  The
> +	 * list tracking still needs to be updated so that it's correct if/when
> +	 * the vCPU does get initialized.
> +	 */
> +	if (is_td_vcpu_created(to_tdx(vcpu))) {
> +		err = tdh_vp_flush(to_tdx(vcpu)->tdvpr.pa);
> +		if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
> +			if (WARN_ON_ONCE(err))
> +				pr_tdx_error(TDH_VP_FLUSH, err, NULL);
> +		}
> +	}
> +
> +	tdx_disassociate_vp(vcpu);
> +}
> +
> +static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
> +{
> +	if (unlikely(vcpu->cpu == -1))
> +		return;
> +
> +	smp_call_function_single(vcpu->cpu, tdx_flush_vp, vcpu, 1);
> +}
> +
>   static int tdx_do_tdh_phymem_cache_wb(void *param)
>   {
>   	u64 err = 0;
> @@ -425,6 +470,12 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>   	return ret;
>   }
>   
> +void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +{
> +	if (vcpu->cpu != cpu)
> +		tdx_flush_vp_on_cpu(vcpu);
> +}
> +
>   void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_tdx *tdx = to_tdx(vcpu);
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 8b871c5f52cf..ceafd6e18f4e 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -143,6 +143,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
>   fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
>   void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_put(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
>   
>   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -166,6 +167,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
>   static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
>   static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
> +static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
>   
>   static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

This patch and the next one might even be squashed together.

Otherwise

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 075/104] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
  2022-04-08 16:24   ` Sean Christopherson
@ 2022-04-15 14:20     ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:20 UTC (permalink / raw)
  To: Sean Christopherson, isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl

On 4/8/22 18:24, Sean Christopherson wrote:
>> Return true for kvm_vcpu_has_events() if the vCPU has a pending APICv
>> interrupt to support TDX's usage of APICv.  Unlike VMX, TDX doesn't have
>> access to vmcs.GUEST_INTR_STATUS and so can't emulate posted interrupts,
> Based on the discussion in the HLT patch, this is no longer true.
> 

It's still true, it only has access to RVI > PPR (which is enough to check
if the vCPU is runnable).

> Rather than hook this path, I would rather we tag kvm_apic has having some of its
> state protected.  Then kvm_cpu_has_interrupt() can invoke the alternative,
> protected-apic-only hook when appropriate, and kvm_apic_has_interrupt() can bail
> immediately instead of doing useless processing of stale vAPIC state.

Agreed, this is similar to my suggestion on the HLT patch:

https://lkml.kernel.org/r/a7d28775-2dbe-7d97-7053-e182bd5be51c@redhat.com

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 084/104] KVM: TDX: Add a place holder to handle TDX VM exit
  2022-03-04 19:49 ` [RFC PATCH v5 084/104] KVM: TDX: Add a place holder to handle TDX VM exit isaku.yamahata
@ 2022-04-15 14:20   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:20 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Wire up handle_exit and handle_exit_irqoff methods and add a place holder
> to handle VM exit.  Add helper functions to get exit info, exit
> qualification, etc.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/main.c    | 35 +++++++++++++++--
>   arch/x86/kvm/vmx/tdx.c     | 79 ++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/x86_ops.h | 11 ++++++
>   3 files changed, 122 insertions(+), 3 deletions(-)

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index aa84c13f8ee1..1e65406e3882 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -148,6 +148,23 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
>   	return vmx_vcpu_load(vcpu, cpu);
>   }
>   
> +static int vt_handle_exit(struct kvm_vcpu *vcpu,
> +			     enum exit_fastpath_completion fastpath)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_handle_exit(vcpu, fastpath);
> +
> +	return vmx_handle_exit(vcpu, fastpath);
> +}
> +
> +static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_handle_exit_irqoff(vcpu);
> +
> +	vmx_handle_exit_irqoff(vcpu);
> +}
> +
>   static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu))
> @@ -340,6 +357,18 @@ static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
>   	vmx_request_immediate_exit(vcpu);
>   }
>   
> +static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> +			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
> +{
> +	if (is_td_vcpu(vcpu)) {
> +		tdx_get_exit_info(vcpu, reason, info1, info2, intr_info,
> +				error_code);
> +		return;
> +	}
> +
> +	vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
> +}
> +
>   static int vt_mem_enc_op(struct kvm *kvm, void __user *argp)
>   {
>   	if (!is_td(kvm))
> @@ -411,7 +440,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   
>   	.vcpu_pre_run = vmx_vcpu_pre_run,
>   	.run = vt_vcpu_run,
> -	.handle_exit = vmx_handle_exit,
> +	.handle_exit = vt_handle_exit,
>   	.skip_emulated_instruction = vmx_skip_emulated_instruction,
>   	.update_emulated_instruction = vmx_update_emulated_instruction,
>   	.set_interrupt_shadow = vt_set_interrupt_shadow,
> @@ -446,7 +475,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.set_identity_map_addr = vmx_set_identity_map_addr,
>   	.get_mt_mask = vmx_get_mt_mask,
>   
> -	.get_exit_info = vmx_get_exit_info,
> +	.get_exit_info = vt_get_exit_info,
>   
>   	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
>   
> @@ -460,7 +489,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.load_mmu_pgd = vt_load_mmu_pgd,
>   
>   	.check_intercept = vmx_check_intercept,
> -	.handle_exit_irqoff = vmx_handle_exit_irqoff,
> +	.handle_exit_irqoff = vt_handle_exit_irqoff,
>   
>   	.request_immediate_exit = vt_request_immediate_exit,
>   
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 273898de9f7a..155208a8d768 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -68,6 +68,26 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
>   	return pa;
>   }
>   
> +static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_rcx_read(vcpu);
> +}
> +
> +static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_rdx_read(vcpu);
> +}
> +
> +static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_r8_read(vcpu);
> +}
> +
> +static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_r9_read(vcpu);
> +}
> +
>   static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
>   {
>   	return tdx->tdvpr.added;
> @@ -768,6 +788,25 @@ void tdx_inject_nmi(struct kvm_vcpu *vcpu)
>   	td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
>   }
>   
> +void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	u16 exit_reason = tdx->exit_reason.basic;
> +
> +	if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
> +		vmx_handle_exception_nmi_irqoff(vcpu, tdexit_intr_info(vcpu));
> +	else if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
> +		vmx_handle_external_interrupt_irqoff(vcpu,
> +						     tdexit_intr_info(vcpu));
> +}
> +
> +static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
> +{
> +	vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> +	vcpu->mmio_needed = 0;
> +	return 0;
> +}
> +
>   void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
>   {
>   	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
> @@ -1042,6 +1081,46 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
>   	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
>   }
>   
> +int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
> +{
> +	union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
> +
> +	if (unlikely(exit_reason.non_recoverable || exit_reason.error)) {
> +		if (exit_reason.basic == EXIT_REASON_TRIPLE_FAULT)
> +			return tdx_handle_triple_fault(vcpu);
> +
> +		kvm_pr_unimpl("TD exit 0x%llx, %d\n",
> +			exit_reason.full, exit_reason.basic);
> +		goto unhandled_exit;
> +	}
> +
> +	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
> +
> +	switch (exit_reason.basic) {
> +	default:
> +		break;
> +	}
> +
> +unhandled_exit:
> +	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
> +	vcpu->run->hw.hardware_exit_reason = exit_reason.full;
> +	return 0;
> +}
> +
> +void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> +		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
> +{
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +
> +	*reason = tdx->exit_reason.full;
> +
> +	*info1 = tdexit_exit_qual(vcpu);
> +	*info2 = tdexit_ext_exit_qual(vcpu);
> +
> +	*intr_info = tdexit_intr_info(vcpu);
> +	*error_code = 0;
> +}
> +
>   static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>   {
>   	struct kvm_tdx_capabilities __user *user_caps;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 31be5e8a1d5c..c0a34186bc37 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -146,11 +146,16 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
>   void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_put(struct kvm_vcpu *vcpu);
>   void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
> +void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
> +int tdx_handle_exit(struct kvm_vcpu *vcpu,
> +		enum exit_fastpath_completion fastpath);
>   
>   void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
>   void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
>   			   int trig_mode, int vector);
>   void tdx_inject_nmi(struct kvm_vcpu *vcpu);
> +void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
> +		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
>   
>   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -177,11 +182,17 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTP
>   static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
> +static inline void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu) {}
> +static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
> +		enum exit_fastpath_completion fastpath) { return 0; }
>   
>   static inline void tdx_apicv_post_state_restore(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_deliver_interrupt(
>   	struct kvm_lapic *apic, int delivery_mode, int trig_mode, int vector) {}
>   static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
> +static inline void tdx_get_exit_info(
> +	struct kvm_vcpu *vcpu, u32 *reason, u64 *info1, u64 *info2,
> +	u32 *intr_info, u32 *error_code) {}
>   
>   static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 071/104] KVM: TDX: restore debug store when TD exit
  2022-03-04 19:49 ` [RFC PATCH v5 071/104] KVM: TDX: restore debug store when TD exit isaku.yamahata
@ 2022-04-15 14:20   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:20 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Because debug store is clobbered, restore it on TD exit.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/events/intel/ds.c | 1 +
>   arch/x86/kvm/vmx/tdx.c     | 1 +
>   2 files changed, 2 insertions(+)
> 
> diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
> index 376cc3d66094..cdba4227ad3b 100644
> --- a/arch/x86/events/intel/ds.c
> +++ b/arch/x86/events/intel/ds.c
> @@ -2256,3 +2256,4 @@ void perf_restore_debug_store(void)
>   
>   	wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
>   }
> +EXPORT_SYMBOL_GPL(perf_restore_debug_store);
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 3cb2fbd1c12c..37cf7d43435d 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -620,6 +620,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
>   	tdx_vcpu_enter_exit(vcpu, tdx);
>   
>   	tdx_user_return_update_cache();
> +	perf_restore_debug_store();
>   	tdx_restore_host_xsave_state(vcpu);
>   	tdx->host_state_need_restore = true;
>   

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 085/104] KVM: TDX: handle EXIT_REASON_OTHER_SMI
  2022-03-04 19:49 ` [RFC PATCH v5 085/104] KVM: TDX: handle EXIT_REASON_OTHER_SMI isaku.yamahata
@ 2022-04-15 14:29   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:29 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> If the control reaches EXIT_REASON_OTHER_SMI, #SMI is delivered and
> handled right after returning from the TDX module to KVM nothing needs to
> be done in KVM.  Continue TDX vcpu execution.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/include/uapi/asm/vmx.h | 1 +
>   arch/x86/kvm/vmx/tdx.c          | 7 +++++++
>   2 files changed, 8 insertions(+)
> 
> diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
> index 946d761adbd3..3d9b4598e166 100644
> --- a/arch/x86/include/uapi/asm/vmx.h
> +++ b/arch/x86/include/uapi/asm/vmx.h
> @@ -34,6 +34,7 @@
>   #define EXIT_REASON_TRIPLE_FAULT        2
>   #define EXIT_REASON_INIT_SIGNAL			3
>   #define EXIT_REASON_SIPI_SIGNAL         4
> +#define EXIT_REASON_OTHER_SMI           6
>   
>   #define EXIT_REASON_INTERRUPT_WINDOW    7
>   #define EXIT_REASON_NMI_WINDOW          8
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 155208a8d768..6fbe89bcfe1e 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1097,6 +1097,13 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
>   	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
>   
>   	switch (exit_reason.basic) {
> +	case EXIT_REASON_OTHER_SMI:
> +		/*
> +		 * If reach here, it's not a MSMI.

Please expand MSMI.  Otherwise,

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

Paolo

> +		 * #SMI is delivered and handled right after SEAMRET, nothing
> +		 * needs to be done in KVM.
> +		 */
> +		return 1;
>   	default:
>   		break;
>   	}


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 087/104] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
  2022-03-04 19:49 ` [RFC PATCH v5 087/104] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT isaku.yamahata
@ 2022-04-15 14:49   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:49 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Because guest TD state is protected, exceptions in guest TDs can't be
> intercepted.  TDX VMM doesn't need to handle exceptions.
> tdx_handle_exit_irqoff() handles NMI and machine check.  Ignore NMI and
> machine check and continue guest TD execution.
> 
> For external interrupt, increment stats same to the VMX case.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 21 +++++++++++++++++++++
>   1 file changed, 21 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 2c35dcad077e..dc83414cb72a 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -800,6 +800,23 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
>   						     tdexit_intr_info(vcpu));
>   }
>   
> +static int tdx_handle_exception(struct kvm_vcpu *vcpu)
> +{
> +	u32 intr_info = tdexit_intr_info(vcpu);
> +
> +	if (is_nmi(intr_info) || is_machine_check(intr_info))
> +		return 1;
> +
> +	kvm_pr_unimpl("unexpected exception 0x%x\n", intr_info);
> +	return -EFAULT;
> +}
> +
> +static int tdx_handle_external_interrupt(struct kvm_vcpu *vcpu)
> +{
> +	++vcpu->stat.irq_exits;
> +	return 1;
> +}
> +
>   static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
>   {
>   	vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
> @@ -1131,6 +1148,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
>   	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
>   
>   	switch (exit_reason.basic) {
> +	case EXIT_REASON_EXCEPTION_NMI:
> +		return tdx_handle_exception(vcpu);
> +	case EXIT_REASON_EXTERNAL_INTERRUPT:
> +		return tdx_handle_external_interrupt(vcpu);
>   	case EXIT_REASON_EPT_VIOLATION:
>   		return tdx_handle_ept_violation(vcpu);
>   	case EXIT_REASON_EPT_MISCONFIG:

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 088/104] KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers
  2022-03-04 19:49 ` [RFC PATCH v5 088/104] KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers isaku.yamahata
  2022-04-07  4:06   ` Kai Huang
@ 2022-04-15 14:50   ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:50 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX defines ABI for the TDX guest to call hypercall with TDG.VP.VMCALL API.
> To get hypercall arguments and to set return values, add accessors to guest
> vcpu registers.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 35 +++++++++++++++++++++++++++++++++++
>   1 file changed, 35 insertions(+)

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index dc83414cb72a..8695836ce796 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -88,6 +88,41 @@ static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
>   	return kvm_r9_read(vcpu);
>   }
>   
> +#define BUILD_TDVMCALL_ACCESSORS(param, gpr)					\
> +static __always_inline								\
> +unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu)			\
> +{										\
> +	return kvm_##gpr##_read(vcpu);						\
> +}										\
> +static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu,	\
> +						     unsigned long val)		\
> +{										\
> +	kvm_##gpr##_write(vcpu, val);						\
> +}
> +BUILD_TDVMCALL_ACCESSORS(p1, r12);
> +BUILD_TDVMCALL_ACCESSORS(p2, r13);
> +BUILD_TDVMCALL_ACCESSORS(p3, r14);
> +BUILD_TDVMCALL_ACCESSORS(p4, r15);
> +
> +static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_r10_read(vcpu);
> +}
> +static __always_inline unsigned long tdvmcall_exit_reason(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_r11_read(vcpu);
> +}
> +static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
> +						     long val)
> +{
> +	kvm_r10_write(vcpu, val);
> +}
> +static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
> +						    unsigned long val)
> +{
> +	kvm_r11_write(vcpu, val);
> +}
> +
>   static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
>   {
>   	return tdx->tdvpr.added;


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 093/104] KVM: TDX: Handle TDX PV port io hypercall
  2022-03-04 19:49 ` [RFC PATCH v5 093/104] KVM: TDX: Handle TDX PV port io hypercall isaku.yamahata
@ 2022-04-15 14:59   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 14:59 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Wire up TDX PV port IO hypercall to the KVM backend function.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 55 ++++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 55 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index b0dcc2421649..c900347d0bc7 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -959,6 +959,59 @@ static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
>   	return kvm_emulate_halt_noskip(vcpu);
>   }
>   
> +static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
> +{
> +	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> +	unsigned long val = 0;
> +	int ret;
> +
> +	WARN_ON(vcpu->arch.pio.count != 1);
> +
> +	ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
> +					 vcpu->arch.pio.port, &val, 1);
> +	WARN_ON(!ret);
> +
> +	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> +	tdvmcall_set_return_val(vcpu, val);
> +
> +	return 1;
> +}
> +
> +static int tdx_emulate_io(struct kvm_vcpu *vcpu)
> +{
> +	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
> +	unsigned long val = 0;
> +	unsigned int port;
> +	int size, ret;
> +
> +	++vcpu->stat.io_exits;
> +
> +	size = tdvmcall_p1_read(vcpu);
> +	port = tdvmcall_p3_read(vcpu);
> +
> +	if (size != 1 && size != 2 && size != 4) {
> +		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
> +		return 1;
> +	}
> +
> +	if (!tdvmcall_p2_read(vcpu)) {
> +		ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
> +		if (!ret)
> +			vcpu->arch.complete_userspace_io = tdx_complete_pio_in;
> +		else
> +			tdvmcall_set_return_val(vcpu, val);
> +	} else {
> +		val = tdvmcall_p4_read(vcpu);
> +		ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
> +
> +		/* No need for a complete_userspace_io callback. */
> +		vcpu->arch.pio.count = 0;
> +	}
> +	if (ret)
> +		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> +	return ret;
> +}
> +
>   static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -974,6 +1027,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>   		return tdx_emulate_cpuid(vcpu);
>   	case EXIT_REASON_HLT:
>   		return tdx_emulate_hlt(vcpu);
> +	case EXIT_REASON_IO_INSTRUCTION:
> +		return tdx_emulate_io(vcpu);
>   	default:
>   		break;
>   	}

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 094/104] KVM: TDX: Handle TDX PV MMIO hypercall
  2022-03-04 19:49 ` [RFC PATCH v5 094/104] KVM: TDX: Handle TDX PV MMIO hypercall isaku.yamahata
@ 2022-04-15 15:05   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 15:05 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Export kvm_io_bus_read and kvm_mmio tracepoint and wire up TDX PV MMIO
> hypercall to the KVM backend functions.
> 
> kvm_io_bus_read/write() searches KVM device emulated in kernel of the given
> MMIO address and emulates the MMIO.  As TDX PV MMIO also needs it, export
> kvm_io_bus_read().  kvm_io_bus_write() is already exported.  TDX PV MMIO
> emulates some of MMIO itself.  To add trace point consistently with x86
> kvm, export kvm_mmio tracepoint.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/x86.c     |   1 +
>   virt/kvm/kvm_main.c    |   2 +
>   3 files changed, 117 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index c900347d0bc7..914af5da4805 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1012,6 +1012,118 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
>   	return ret;
>   }
>   
> +static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
> +{
> +	unsigned long val = 0;
> +	gpa_t gpa;
> +	int size;
> +
> +	WARN_ON(vcpu->mmio_needed != 1);
> +	vcpu->mmio_needed = 0;
> +
> +	if (!vcpu->mmio_is_write) {
> +		gpa = vcpu->mmio_fragments[0].gpa;
> +		size = vcpu->mmio_fragments[0].len;
> +
> +		memcpy(&val, vcpu->run->mmio.data, size);
> +		tdvmcall_set_return_val(vcpu, val);
> +		trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
> +	}
> +	return 1;
> +}
> +
> +static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
> +				 unsigned long val)
> +{
> +	if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
> +	    kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
> +		return -EOPNOTSUPP;
> +
> +	trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
> +	return 0;
> +}
> +
> +static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
> +{
> +	unsigned long val;
> +
> +	if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
> +	    kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
> +		return -EOPNOTSUPP;
> +
> +	tdvmcall_set_return_val(vcpu, val);
> +	trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
> +	return 0;
> +}
> +
> +static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_memory_slot *slot;
> +	int size, write, r;
> +	unsigned long val;
> +	gpa_t gpa;
> +
> +	WARN_ON(vcpu->mmio_needed);
> +
> +	size = tdvmcall_p1_read(vcpu);
> +	write = tdvmcall_p2_read(vcpu);
> +	gpa = tdvmcall_p3_read(vcpu);
> +	val = write ? tdvmcall_p4_read(vcpu) : 0;
> +
> +	if (size != 1 && size != 2 && size != 4 && size != 8)
> +		goto error;
> +	if (write != 0 && write != 1)
> +		goto error;
> +
> +	/* Strip the shared bit, allow MMIO with and without it set. */
> +	gpa = kvm_gpa_unalias(vcpu->kvm, gpa);
> +
> +	if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
> +		goto error;
> +
> +	slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
> +	if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
> +		goto error;
> +
> +	if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
> +		trace_kvm_fast_mmio(gpa);
> +		return 1;
> +	}
> +
> +	if (write)
> +		r = tdx_mmio_write(vcpu, gpa, size, val);
> +	else
> +		r = tdx_mmio_read(vcpu, gpa, size);
> +	if (!r) {
> +		/* Kernel completed device emulation. */
> +		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> +		return 1;
> +	}
> +
> +	/* Request the device emulation to userspace device model. */
> +	vcpu->mmio_needed = 1;
> +	vcpu->mmio_is_write = write;
> +	vcpu->arch.complete_userspace_io = tdx_complete_mmio;
> +
> +	vcpu->run->mmio.phys_addr = gpa;
> +	vcpu->run->mmio.len = size;
> +	vcpu->run->mmio.is_write = write;
> +	vcpu->run->exit_reason = KVM_EXIT_MMIO;
> +
> +	if (write) {
> +		memcpy(vcpu->run->mmio.data, &val, size);
> +	} else {
> +		vcpu->mmio_fragments[0].gpa = gpa;
> +		vcpu->mmio_fragments[0].len = size;
> +		trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
> +	}
> +	return 0;
> +
> +error:
> +	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> +	return 1;
> +}
> +
>   static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -1029,6 +1141,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>   		return tdx_emulate_hlt(vcpu);
>   	case EXIT_REASON_IO_INSTRUCTION:
>   		return tdx_emulate_io(vcpu);
> +	case EXIT_REASON_EPT_VIOLATION:
> +		return tdx_emulate_mmio(vcpu);
>   	default:
>   		break;
>   	}
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9acb33a17445..483fa46b1be7 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12915,6 +12915,7 @@ bool kvm_arch_dirty_log_supported(struct kvm *kvm)
>   
>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
> +EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
>   EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d4e117f5b5b9..6db075db6098 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2259,6 +2259,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
>   
>   	return NULL;
>   }
> +EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
>   
>   bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
>   {
> @@ -5126,6 +5127,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
>   	r = __kvm_io_bus_read(vcpu, bus, &range, val);
>   	return r < 0 ? r : 0;
>   }
> +EXPORT_SYMBOL_GPL(kvm_io_bus_read);
>   
>   /* Caller must hold slots_lock. */
>   int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 095/104] KVM: TDX: Implement callbacks for MSR operations for TDX
  2022-03-04 19:49 ` [RFC PATCH v5 095/104] KVM: TDX: Implement callbacks for MSR operations for TDX isaku.yamahata
@ 2022-04-15 15:07   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 15:07 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Implements set_msr/get_msr/has_emulated_msr methods for TDX to handle
> hypercall from guest TD for paravirtualized rdmsr and wrmsr.  The TDX
> module virtualizes MSRs.  For some MSRs, it injects #VE to the guest TD
> upon RDMSR or WRMSR.  The exact list of such MSRs are defined in the spec.
> 
> Upon #VE, the guest TD may execute hypercalls,
> TDG.VP.VMCALL<INSTRUCTION.RDMSR> and TDG.VP.VMCALL<INSTRUCTION.WRMSR>,
> which are defined in GHCI (Guest-Host Communication Interface) so that the
> host VMM (e.g. KVM) can virtualizes the MSRs.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/main.c    | 34 +++++++++++++++++--
>   arch/x86/kvm/vmx/tdx.c     | 68 ++++++++++++++++++++++++++++++++++++++
>   arch/x86/kvm/vmx/x86_ops.h |  6 ++++
>   3 files changed, 105 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 1e65406e3882..a528cfdbce54 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -165,6 +165,34 @@ static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
>   	vmx_handle_exit_irqoff(vcpu);
>   }
>   
> +static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> +{
> +	if (unlikely(is_td_vcpu(vcpu)))
> +		return tdx_set_msr(vcpu, msr_info);
> +
> +	return vmx_set_msr(vcpu, msr_info);
> +}
> +
> +/*
> + * The kvm parameter can be NULL (module initialization, or invocation before
> + * VM creation). Be sure to check the kvm parameter before using it.
> + */
> +static bool vt_has_emulated_msr(struct kvm *kvm, u32 index)
> +{
> +	if (kvm && is_td(kvm))
> +		return tdx_is_emulated_msr(index, true);
> +
> +	return vmx_has_emulated_msr(kvm, index);
> +}
> +
> +static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> +{
> +	if (unlikely(is_td_vcpu(vcpu)))
> +		return tdx_get_msr(vcpu, msr_info);
> +
> +	return vmx_get_msr(vcpu, msr_info);
> +}
> +
>   static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
>   {
>   	if (is_td_vcpu(vcpu))
> @@ -393,7 +421,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.hardware_enable = vt_hardware_enable,
>   	.hardware_disable = vt_hardware_disable,
>   	.cpu_has_accelerated_tpr = report_flexpriority,
> -	.has_emulated_msr = vmx_has_emulated_msr,
> +	.has_emulated_msr = vt_has_emulated_msr,
>   
>   	.is_vm_type_supported = vt_is_vm_type_supported,
>   	.vm_size = sizeof(struct kvm_vmx),
> @@ -411,8 +439,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   
>   	.update_exception_bitmap = vmx_update_exception_bitmap,
>   	.get_msr_feature = vmx_get_msr_feature,
> -	.get_msr = vmx_get_msr,
> -	.set_msr = vmx_set_msr,
> +	.get_msr = vt_get_msr,
> +	.set_msr = vt_set_msr,
>   	.get_segment_base = vmx_get_segment_base,
>   	.get_segment = vmx_get_segment,
>   	.set_segment = vmx_set_segment,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 914af5da4805..cec2660206bd 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1517,6 +1517,74 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
>   	*error_code = 0;
>   }
>   
> +bool tdx_is_emulated_msr(u32 index, bool write)
> +{
> +	switch (index) {
> +	case MSR_IA32_UCODE_REV:
> +	case MSR_IA32_ARCH_CAPABILITIES:
> +	case MSR_IA32_POWER_CTL:
> +	case MSR_MTRRcap:
> +	case 0x200 ... 0x26f:
> +		/* IA32_MTRR_PHYS{BASE, MASK}, IA32_MTRR_FIX*_* */
> +	case MSR_IA32_CR_PAT:
> +	case MSR_MTRRdefType:
> +	case MSR_IA32_TSC_DEADLINE:
> +	case MSR_IA32_MISC_ENABLE:
> +	case MSR_KVM_STEAL_TIME:
> +	case MSR_KVM_POLL_CONTROL:
> +	case MSR_PLATFORM_INFO:
> +	case MSR_MISC_FEATURES_ENABLES:
> +	case MSR_IA32_MCG_CAP:
> +	case MSR_IA32_MCG_STATUS:
> +	case MSR_IA32_MCG_CTL:
> +	case MSR_IA32_MCG_EXT_CTL:
> +	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_MISC(28) - 1:
> +		/* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC} */
> +		return true;
> +	case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
> +		/*
> +		 * x2APIC registers that are virtualized by the CPU can't be
> +		 * emulated, KVM doesn't have access to the virtual APIC page.
> +		 */
> +		switch (index) {
> +		case X2APIC_MSR(APIC_TASKPRI):
> +		case X2APIC_MSR(APIC_PROCPRI):
> +		case X2APIC_MSR(APIC_EOI):
> +		case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
> +		case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
> +		case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
> +			return false;
> +		default:
> +			return true;
> +		}
> +	case MSR_IA32_APICBASE:
> +	case MSR_EFER:
> +		return !write;
> +	case MSR_IA32_MCx_CTL2(0) ... MSR_IA32_MCx_CTL2(31):
> +		/*
> +		 * 0x280 - 0x29f: The x86 common code doesn't emulate MCx_CTL2.
> +		 * Refer to kvm_{get,set}_msr_common(),
> +		 * kvm_mtrr_{get, set}_msr(), and msr_mtrr_valid().
> +		 */
> +	default:
> +		return false;
> +	}
> +}
> +
> +int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
> +{
> +	if (tdx_is_emulated_msr(msr->index, false))
> +		return kvm_get_msr_common(vcpu, msr);
> +	return 1;
> +}
> +
> +int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
> +{
> +	if (tdx_is_emulated_msr(msr->index, true))
> +		return kvm_set_msr_common(vcpu, msr);
> +	return 1;
> +}
> +
>   static int tdx_capabilities(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
>   {
>   	struct kvm_tdx_capabilities __user *user_caps;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index c0a34186bc37..dcaa5806802e 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -156,6 +156,9 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
>   void tdx_inject_nmi(struct kvm_vcpu *vcpu);
>   void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
>   		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
> +bool tdx_is_emulated_msr(u32 index, bool write);
> +int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
> +int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
>   
>   int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>   int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -193,6 +196,9 @@ static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
>   static inline void tdx_get_exit_info(
>   	struct kvm_vcpu *vcpu, u32 *reason, u64 *info1, u64 *info2,
>   	u32 *intr_info, u32 *error_code) {}
> +static inline bool tdx_is_emulated_msr(u32 index, bool write) { return false; }
> +static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
> +static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
>   
>   static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>   static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 096/104] KVM: TDX: Handle TDX PV rdmsr hypercall
  2022-03-04 19:49 ` [RFC PATCH v5 096/104] KVM: TDX: Handle TDX PV rdmsr hypercall isaku.yamahata
@ 2022-04-15 15:08   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 15:08 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Wire up TDX PV rdmsr hypercall to the KVM backend function.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 19 +++++++++++++++++++
>   1 file changed, 19 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index cec2660206bd..dd7aaa28bf3a 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1124,6 +1124,23 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
>   	return 1;
>   }
>   
> +static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
> +{
> +	u32 index = tdvmcall_p1_read(vcpu);
> +	u64 data;
> +
> +	if (kvm_get_msr(vcpu, index, &data)) {
> +		trace_kvm_msr_read_ex(index);
> +		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
> +		return 1;
> +	}
> +	trace_kvm_msr_read(index, data);
> +
> +	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
> +	tdvmcall_set_return_val(vcpu, data);
> +	return 1;
> +}
> +
>   static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>   {
>   	struct vcpu_tdx *tdx = to_tdx(vcpu);
> @@ -1143,6 +1160,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>   		return tdx_emulate_io(vcpu);
>   	case EXIT_REASON_EPT_VIOLATION:
>   		return tdx_emulate_mmio(vcpu);
> +	case EXIT_REASON_MSR_READ:
> +		return tdx_emulate_rdmsr(vcpu);
>   	default:
>   		break;
>   	}

Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>

and feel free to squash with the wrmsr one.

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 098/104] KVM: TDX: Handle TDX PV report fatal error hypercall
  2022-03-04 19:49 ` [RFC PATCH v5 098/104] KVM: TDX: Handle TDX PV report fatal error hypercall isaku.yamahata
@ 2022-04-15 15:13   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 15:13 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Wire up TDX PV report fatal error hypercall to KVM_SYSTEM_EVENT_CRASH KVM
> exit event.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.c | 11 +++++++++++
>   1 file changed, 11 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 123d4322da99..4d668a6c7dc9 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1157,6 +1157,15 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
>   	return 1;
>   }
>   
> +static int tdx_report_fatal_error(struct kvm_vcpu *vcpu)
> +{
> +	/* Exit to userspace device model for teardown. */
> +	vcpu->run->exit_reason = KVM_EXIT_SYSTEM_EVENT;
> +	vcpu->run->system_event.type = KVM_SYSTEM_EVENT_CRASH;
> +	vcpu->run->system_event.flags = tdvmcall_p1_read(vcpu);
> +	return 0;
> +}

With the latest SEV changes we have a data[] member.  Please set type 
instead to KVM_SYSTEM_EVENT_CRASH|KVM_SYSTEM_EVENT_NDATA_VALID and ndata 
to 1, so that the value of p1 (which will be a0 in the next series) can 
go in data[0].

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 000/104] KVM TDX basic feature support
  2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
                   ` (104 preceding siblings ...)
  2022-03-07  7:44 ` [RFC PATCH v5 000/104] KVM TDX basic feature support Christoph Hellwig
@ 2022-04-15 15:18 ` Paolo Bonzini
  2022-04-15 17:05   ` Paolo Bonzini
  2022-04-15 21:19   ` Isaku Yamahata
  105 siblings, 2 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 15:18 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Hi.  Now TDX host kernel patch series was posted, I've rebased this patch
> series to it and make it work.
> 
>    https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
> 
> Changes from v4:
> - rebased to TDX host kernel patch series.
> - include all the patches to make this patch series working.
> - add [MARKER] patches to mark the patch layer clear.

I think I have reviewed everything except the TDP MMU parts (48, 54-57). 
  I will do those next week, but in the meanwhile feel free to send v6 
if you have it ready.  A lot of the requests have been cosmetic.

If you would like to use something like Trello to track all the changes, 
and submit before you have done all of them, that's fine by me.

Paolo

> Thanks,
> 
> 
> * What's TDX?
> TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
> Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
> Domain (TD) for confidential computing.
> 
> A TD runs in a CPU mode that is designed to protect the confidentiality of its
> memory contents and its CPU state from any other software, including the hosting
> Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
> 
> We have more detailed explanations below (***).
> We have the high-level design of TDX KVM below (****).
> 
> In this patch series, we use "TD" or "guest TD" to differentiate it from the
> current "VM" (Virtual Machine), which is supported by KVM today.
> 
> 
> * The organization of this patch series
> This patch series is on top of the patches series "TDX host kernel support":
> https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
> 
> this patch series is available at
> https://github.com/intel/tdx/releases/tag/kvm-upstream
> The corresponding patches to qemu are available at
> https://github.com/intel/qemu-tdx/commits/tdx-upstream
> 
> The relations of the layers are depicted as follows.
> The arrows below show the order of patch reviews we would like to have.
> 
> The below layers are chosen so that the device model, for example, qemu can
> exercise each layering step by step.  Check if TDX is supported, create TD VM,
> create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
> vcpu exits/hypercalls/interrupts to run TD fully.
> 
>    TDX vcpu
>    interrupt/exits/hypercall<------------\
>          ^                               |
>          |                               |
>    TD finalization                       |
>          ^                               |
>          |                               |
>    TDX EPT violation<------------\       |
>          ^                       |       |
>          |                       |       |
>    TD vcpu enter/exit            |       |
>          ^                       |       |
>          |                       |       |
>    TD vcpu creation/destruction  |       \-------KVM TDP MMU MapGPA
>          ^                       |                       ^
>          |                       |                       |
>    TD VM creation/destruction    \---------------KVM TDP MMU hooks
>          ^                                               ^
>          |                                               |
>    TDX architectural definitions                 KVM TDP refactoring for TDX
>          ^                                               ^
>          |                                               |
>     TDX, VMX    <--------TDX host kernel         KVM MMU GPA stolen bits
>     coexistence          support
> 
> 
> The followings are explanations of each layer.  Each layer has a dummy commit
> that starts with [MARKER] in subject.  It is intended to help to identify where
> each layer starts.
> 
> TDX host kernel support:
>          https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
>          The guts of system-wide initialization of TDX module.  There is an
>          independent patch series for host x86.  TDX KVM patches call functions
>          this patch series provides to initialize the TDX module.
> 
> TDX, VMX coexistence:
>          Infrastructure to allow TDX to coexist with VMX and trigger the
>          initialization of the TDX module.
>          This layer starts with
>          "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
> TDX architectural definitions:
>          Add TDX architectural definitions and helper functions
>          This layer starts with
>          "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
> TD VM creation/destruction:
>          Guest TD creation/destroy allocation and releasing of TDX specific vm
>          and vcpu structure.  Create an initial guest memory image with TDX
>          measurement.
>          This layer starts with
>          "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
> TD vcpu creation/destruction:
>          guest TD creation/destroy Allocation and releasing of TDX specific vm
>          and vcpu structure.  Create an initial guest memory image with TDX
>          measurement.
>          This layer starts with
>          "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
> TDX EPT violation:
>          Create an initial guest memory image with TDX measurement.  Handle
>          secure EPT violations to populate guest pages with TDX SEAMCALLs.
>          This layer starts with
>          "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
> TD vcpu enter/exit:
>          Allow TDX vcpu to enter into TD and exit from TD.  Save CPU state before
>          entering into TD.  Restore CPU state after exiting from TD.
>          This layer starts with
>          "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
> TD vcpu interrupts/exit/hypercall:
>          Handle various exits/hypercalls and allow interrupts to be injected so
>          that TD vcpu can continue running.
>          This layer starts with
>          "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"
> 
> KVM MMU GPA stolen bits:
>          Introduce framework to handle stolen repurposed bit of GPA TDX
>          repurposed a bit of GPA to indicate shared or private. If it's shared,
>          it's the same as the conventional VMX EPT case.  VMM can access shared
>          guest pages.  If it's private, it's handled by Secure-EPT and the guest
>          page is encrypted.
>          This layer starts with
>          "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
> KVM TDP refactoring for TDX:
>          TDX Secure EPT requires different constants. e.g. initial value EPT
>          entry value etc. Various refactoring for those differences.
>          This layer starts with
>          "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
> KVM TDP MMU hooks:
>          Introduce framework to TDP MMU to add hooks in addition to direct EPT
>          access TDX added Secure EPT which is an enhancement to VMX EPT.  Unlike
>          conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
>          use TDX SEAMCALLs to operate on Secure EPT.
>          This layer starts with
>          "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
> KVM TDP MMU MapGPA:
>          Introduce framework to handle switching guest pages from private/shared
>          to shared/private.  For a given GPA, a guest page can be assigned to a
>          private GPA or a shared GPA exclusively.  With TDX MapGPA hypercall,
>          guest TD converts GPA assignments from private (or shared) to shared (or
>          private).
>          This layer starts with
>          "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "
> 
> KVM guest private memory: (not shown in the above diagram)
> [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
> memory: https://lkml.org/lkml/2022/1/18/395
>          Guest private memory requires different memory management in KVM.  The
>          patch proposes a way for it.  Integration with TDX KVM.
> 
> (***)
> * TDX module
> A CPU-attested software module called the "TDX module" is designed to implement
> the TDX architecture, and it is loaded by the UEFI firmware today. It can be
> loaded by the kernel or driver at runtime, but in this patch series we assume
> that the TDX module is already loaded and initialized.
> 
> The TDX module provides two main new logical modes of operation built upon the
> new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
> architecture. TDX root mode is mostly identical to the VMX root operation mode,
> and the TDX functions (described later) are triggered by the new SEAMCALL
> instruction with the desired interface function selected by an input operand
> (leaf number, in RAX). TDX non-root mode is used for TD guest operation.  TDX
> non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
> operation (i.e. guest VM), with changes and restrictions to better assure that
> no other software or hardware has direct visibility of the TD memory and state.
> 
> TDX transitions between TDX root operation and TDX non-root operation include TD
> Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
> TDX root mode.  A TD Exit might be asynchronous, triggered by some external
> event (e.g., external interrupt or SMI) or an exception, or it might be
> synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
> 
> TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
> of the TDX interface functions as mentioned above, and "TDH" stands for Trust
> Domain Host. Those host-side TDX interface functions are categorized into
> various areas just for better organization, such as SYS (TDX module management),
> MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
> etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.
> 
> TDCS (Trust Domain Control Structure) is the main control structure of a guest
> TD, and encrypted (using the guest TD's ephemeral private key).  At a high
> level, TDCS holds information for controlling TD operation as a whole,
> execution, EPTP, MSR bitmaps, etc that KVM needs to set it up.  Note that MSR
> bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
> same value for all VCPUs of the same TD.
> 
> Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
> TD VCPU.  It helps the TDX module control the operation of the VCPU, and holds
> the VCPU state while the VCPU is not running. TDVPS is opaque to software and
> DMA access, accessible only by using the TDX module interface functions (such as
> TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
> such as virtual APIC page, virtualization exception information, etc.
> 
> Several VMX control structures (such as Shared EPT and Posted interrupt
> descriptor) are directly managed and accessed by the host VMM.  These control
> structures are pointed to by fields in the TD VMCS.
> 
> The above means that 1) KVM needs to allocate different data structures for TDs,
> 2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
> define TD-specific handling for others.  3) Redirect operations to .  3)
> Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
> tdx_callback() else vmx_callback();".
> 
> *TD Private Memory
> TD private memory is designed to hold TD private content, encrypted by the CPU
> using the TD ephemeral key. An encryption engine holds a table of encryption
> keys, and an encryption key is selected for each memory transaction based on a
> Host Key Identifier (HKID). By design, the host VMM does not have access to the
> encryption keys.
> 
> In the first generation of MKTME, HKID is "stolen" from the physical address by
> allocating a configurable number of bits from the top of the physical
> address. The HKID space is partitioned into shared HKIDs for legacy MKTME
> accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
> HKID on the host so that MKTME can be opaque or bypassed on the host.
> 
> During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
> as either shared or private, based on the value of a new SHARED bit in the Guest
> Physical Address (GPA).  The CPU translates shared GPAs using the usual VMX EPT
> (Extended Page Table) or "Shared EPT" (in this document), which resides in host
> VMM memory. The Shared EPT is directly managed by the host VMM - the same as
> with the current VMX. Since guest TDs usually require I/O, and the data exchange
> needs to be done via shared memory, thus KVM needs to use the current EPT
> functionality even for TDs.
> 
> * Secure EPT and Minoring using the TDP code
> The CPU translates private GPAs using a separate Secure EPT.  The Secure EPT
> pages are encrypted and integrity-protected with the TD's ephemeral private
> key.  Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
> interface functions, and thus conceptually Secure EPT is a subset of EPT (why
> "subset"). Since execution of such interface functions takes much longer time
> than accessing memory directly, in KVM we use the existing TDP code to minor the
> Secure EPT for the TD.
> 
> This way, we can effectively walk Secure EPT without using the TDX interface
> functions.
> 
> * VM life cycle and TDX specific operations
> The userspace VMM, such as QEMU, needs to build and treat TDs differently.  For
> example, a TD needs to boot in private memory, and the host software cannot copy
> the initial image to private memory.
> 
> * TSC Virtualization
> The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
> (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
> by TD configuration, i.e. when the TD is created, not per VCPU.  The current KVM
> owns TSC virtualization for VMs, but the TDX module does for TDs.
> 
> * MCE support for TDs
> The TDX module doesn't allow VMM to inject MCE.  Instead PV way is needed for TD
> to communicate with VMM.  For now, KVM silently ignores MCE request by VMM.  MSRs
> related to MCE (e.g, MCE bank registers) can be naturally emulated by
> paravirtualizing MSR access.
> 
> [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
> available.
> 
> * Restrictions or future work
> Some features are not included to reduce patch size.  Those features are
> addressed as future independent patch series.
> - large page (2M, 1G)
> - qemu gdb stub
> - guest PMU
> - and more
> 
> * Prerequisites
> It's required to load the TDX module and initialize it.  It's out of the scope
> of this patch series.  Another independent patch for the common x86 code is
> planned.  It defines CONFIG_INTEL_TDX_HOST and this patch series uses
> CONFIG_INTEL_TDX_HOST.  It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
> module is initialized and ready for KVM to use the TDX module APIs for TDX guest
> life cycle like tdh.mng.init are ready to use.
> 
> Concretely Global initialization, LP (Logical Processor) initialization, global
> configuration, the key configuration, and TDMR and PAMT initialization are done.
> The state of the TDX module is SYS_READY.  Please refer to the TDX module
> specification, the chapter Intel TDX Module Lifecycle State Machine
> 
> ** Detecting the TDX module readiness.
> TDX host patch series implements the detection of the TDX module availability
> and its initialization so that KVM can use it.  Also it manages Host KeyID
> (HKID) assigned to guest TD.
> The assumed APIs the TDX host patch series provides are
> - int seamrr_enabled()
>    Check if required cpu feature (SEAM mode) is available. This only check CPU
>    feature availability.  At this point, the TDX module may not be ready for KVM
>    to use.
> - int init_tdx(void);
>    Initialization of TDX module so that the TDX module is ready for KVM to use.
> - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
>    Return the system wide information about the TDX module.  NULL if the TDX
>    isn't initialized.
> - u32 tdx_get_global_keyid(void);
>    Return global key id that is used for the TDX module itself.
> - int tdx_keyid_alloc(void);
>    Allocate HKID for guest TD.
> - void tdx_keyid_free(int keyid);
>    Free HKID for guest TD.
> 
> (****)
> * TDX KVM high-level design
> - Host key ID management
> Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
> It is assumed The TDX host patch series implements necessary functions,
> u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
> void tdx_keyid_free(int keyid).
> 
> - Data structures and VM type
> Because TDX is different from VMX, define its own VM/VCPU structures, struct
> kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx.  To
> identify the VM, introduce VM-type to specify which VM type, VMX (default) or
> TDX, is used.
> 
> - VM life cycle and TDX specific operations
> Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
> New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
> parameters, set initial guest memory and measurement.
> 
> The creation of TDX VM requires five additional operations in addition to the
> conventional VM creation.
>    - Get KVM system capability to check if TDX VM type is supported
>    - VM creation (KVM_CREATE_VM)
>    - New: Get the TDX specific system parameters.  KVM_TDX_GET_CAPABILITY.
>    - New: Set TDX specific VM parameters.  KVM_TDX_INIT_VM.
>    - VCPU creation (KVM_CREATE_VCPU)
>    - New: Set TDX specific VCPU parameters.  KVM_TDX_INIT_VCPU.
>    - New: Initialize guest memory as boot state and extend the measurement with
>      the memory.  KVM_TDX_INIT_MEM_REGION.
>    - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
>      TDX VM contents.
>    - VCPU RUN (KVM_VCPU_RUN)
> 
> - Protected guest state
> Because the guest state (CPU state and guest memory) is protected, the KVM VMM
> can't operate on them.  For example, accessing CPU registers, injecting
> exceptions, and accessing guest memory.  Those operations are handled as
> silently ignored, returning zero or initial reset value when it's requested via
> KVM API ioctls.
> 
>      VM/VCPU state and callbacks for TDX specific operations.
>      Define tdx specific VM state and VCPU state instead of VMX ones.  Redirect
>      operations to TDX specific callbacks.  "if (tdx) tdx_op() else vmx_op()".
> 
>      Operations on the CPU state
>      silently ignore operations on the guest state.  For example, the write to
>      CPU registers is ignored and the read from CPU registers returns 0.
> 
>      . ignore access to CPU registers except for allowed ones.
>      . TSC: add a check if tsc is immutable and return an error.  Because the KVM
>        implementation updates the internal tsc state and it's difficult to back
>        out those changes.  Instead, skip the logic.
>      . dirty logging: add check if dirty logging is supported.
>      . exceptions/SMI/MCE/SIPI/INIT: silently ignore
> 
>      Note: virtual external interrupt and NMI can be injected into TDX guests.
> 
> - KVM MMU integration
> One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
> the guest physical address is private (the bit is cleared) or shared (the bit is
> set).  The bits are called stolen bits.
> 
>    - Stolen bits framework
>      systematically tracks which guest physical address, shared or private, is
>      used.
> 
>    - Shared EPT and secure EPT
>      There are two EPTs. Shared EPT (the conventional one) and Secure
>      EPT(the new one). Shared EPT is handled the same for the stolen
>      bit set.  Secure EPT points to private guest pages.  To resolve
>      EPT violation, KVM walks one of two EPTs based on faulted GPA.
>      Because it's costly to access secure EPT during walking EPTs with
>      SEAMCALLs for the private guest physical address, another private
>      EPT is used as a shadow of Secure-EPT with the existing logic at
>      the cost of extra memory.
> 
> The following depicts the relationship.
> 
>                      KVM                             |       TDX module
>                       |                              |           |
>          -------------+----------                    |           |
>          |                      |                    |           |
>          V                      V                    |           |
>       shared GPA           private GPA               |           |
>    CPU shared EPT pointer  KVM private EPT pointer   |  CPU secure EPT pointer
>          |                      |                    |           |
>          |                      |                    |           |
>          V                      V                    |           V
>    shared EPT                private EPT<-------mirror----->Secure EPT
>          |                      |                    |           |
>          |                      \--------------------+------\    |
>          |                                           |      |    |
>          V                                           |      V    V
>    shared guest page                                 |    private guest page
>                                                      |
>                                                      |
>                                non-encrypted memory  |    encrypted memory
>                                                      |
> 
>    - Operating on Secure EPT
>      Use the TDX module APIs to operate on Secure EPT.  To call the TDX API
>      during resolving EPT violation, add hooks to additional operation and wiring
>      it to TDX backend.
> 
> * References
> 
> [1] TDX specification
>     https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
> [2] Intel Trust Domain Extensions (Intel TDX)
>     https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
> [3] Intel CPU Architectural Extensions Specification
>     https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
> [4] Intel TDX Module 1.0 EAS
>     https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf
> [5] Intel TDX Loader Interface Specification
>    https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
> [6] Intel TDX Guest-Hypervisor Communication Interface
>     https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
> [7] Intel TDX Virtual Firmware Design Guide
>     https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf
> [8] intel public github
>     kvm TDX branch: https://github.com/intel/tdx/tree/kvm
>     TDX guest branch: https://github.com/intel/tdx/tree/guest
>     qemu TDX https://github.com/intel/qemu-tdx
> [9] TDVF
>      https://github.com/tianocore/edk2-staging/tree/TDVF
> 
> 
> Chao Gao (1):
>    KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
>      wrmsr
> 
> Isaku Yamahata (73):
>    x86/virt/tdx: export platform_has_tdx
>    KVM: TDX: Detect CPU feature on kernel module initialization
>    KVM: x86: Refactor KVM VMX module init/exit functions
>    KVM: TDX: Add placeholders for TDX VM/vcpu structure
>    x86/virt/tdx: Add a helper function to return system wide info about
>      TDX module
>    KVM: TDX: Add a function to initialize TDX module
>    KVM: TDX: Make TDX VM type supported
>    [MARKER] The start of TDX KVM patch series: TDX architectural
>      definitions
>    KVM: TDX: Define TDX architectural definitions
>    KVM: TDX: Add a function for KVM to invoke SEAMCALL
>    KVM: TDX: add a helper function for KVM to issue SEAMCALL
>    KVM: TDX: Add helper functions to print TDX SEAMCALL error
>    [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
>    KVM: TDX: allocate per-package mutex
>    x86/cpu: Add helper functions to allocate/free MKTME keyid
>    KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
>    KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
>    [MARKER] The start of TDX KVM patch series: TD vcpu
>      creation/destruction
>    KVM: TDX: allocate/free TDX vcpu structure
>    [MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits
>    KVM: x86/mmu: introduce config for PRIVATE KVM MMU
>    [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
>      TDX
>    KVM: x86/mmu: Disallow fast page fault on private GPA
>    [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
>    KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
>    KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
>    KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
>    KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
>    KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
>    [MARKER] The start of TDX KVM patch series: TDX EPT violation
>    KVM: TDX: TDP MMU TDX support
>    [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
>    KVM: x86/mmu: steal software usable bit for EPT to represent shared
>      page
>    KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping
>    KVM: x86/tdp_mmu: prevent private/shared map based on PRIVATE_PROHIBIT
>    KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
>    KVM: x86/mmu: Focibly use TDP MMU for TDX
>    [MARKER] The start of TDX KVM patch series: TD finalization
>    KVM: TDX: Create initial guest memory
>    KVM: TDX: Finalize VM initialization
>    [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
>    KVM: TDX: Add helper assembly function to TDX vcpu
>    KVM: TDX: Implement TDX vcpu enter/exit path
>    KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
>    KVM: TDX: restore host xsave state when exit from the guest TD
>    KVM: TDX: restore user ret MSRs
>    [MARKER] The start of TDX KVM patch series: TD vcpu
>      exits/interrupts/hypercalls
>    KVM: TDX: complete interrupts after tdexit
>    KVM: TDX: restore debug store when TD exit
>    KVM: TDX: handle vcpu migration over logical processor
>    KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the
>      guest TD
>    KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
>      behavior
>    KVM: TDX: Implement interrupt injection
>    KVM: TDX: Implements vcpu request_immediate_exit
>    KVM: TDX: Implement methods to inject NMI
>    KVM: TDX: Add a place holder to handle TDX VM exit
>    KVM: TDX: handle EXIT_REASON_OTHER_SMI
>    KVM: TDX: handle ept violation/misconfig exit
>    KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
>    KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers
>    KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
>    KVM: TDX: Handle TDX PV CPUID hypercall
>    KVM: TDX: Handle TDX PV HLT hypercall
>    KVM: TDX: Handle TDX PV port io hypercall
>    KVM: TDX: Implement callbacks for MSR operations for TDX
>    KVM: TDX: Handle TDX PV rdmsr hypercall
>    KVM: TDX: Handle TDX PV wrmsr hypercall
>    KVM: TDX: Handle TDX PV report fatal error hypercall
>    KVM: TDX: Handle TDX PV map_gpa hypercall
>    KVM: TDX: Silently discard SMI request
>    KVM: TDX: Silently ignore INIT/SIPI
>    Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
>    KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
> 
> Kai Huang (1):
>    KVM: x86: Introduce hooks to free VM callback prezap and vm_free
> 
> Rick Edgecombe (1):
>    KVM: x86: Add infrastructure for stolen GPA bits
> 
> Sean Christopherson (26):
>    KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
>    KVM: Enable hardware before doing arch VM initialization
>    KVM: x86: Introduce vm_type to differentiate default VMs from
>      confidential VMs
>    KVM: TDX: Add TDX "architectural" error codes
>    KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
>    KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
>    KVM: Add max_vcpus field in common 'struct kvm'
>    KVM: TDX: create/destroy VM structure
>    KVM: TDX: Do TDX specific vcpu initialization
>    KVM: x86/mmu: Disallow dirty logging for x86 TDX
>    KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
>    KVM: x86/mmu: Allow non-zero init value for shadow PTE
>    KVM: x86/mmu: Allow per-VM override of the TDP max page level
>    KVM: VMX: Split out guts of EPT violation to common/exposed function
>    KVM: VMX: Move setting of EPT MMU masks to common VT-x code
>    KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
>    KVM: TDX: Add load_mmu_pgd method for TDX
>    KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
>    KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
>    KVM: x86: Add option to force LAPIC expiration wait
>    KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
>      argument
>    KVM: VMX: Move NMI/exception handler to common helper
>    KVM: x86: Split core of hypercall emulation to helper function
>    KVM: TDX: Add a placeholder for handler of TDX hypercalls
>      (TDG.VP.VMCALL)
>    KVM: TDX: Handle TDX PV MMIO hypercall
>    KVM: TDX: Add methods to ignore accesses to CPU state
> 
> Xiaoyao Li (1):
>    KVM: TDX: initialize VM with TDX specific parameters
> 
> Yuan Yao (1):
>    KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c
> 
>   Documentation/virt/kvm/api.rst                |   24 +-
>   .../virt/kvm/intel-tdx-layer-status.rst       |   33 +
>   Documentation/virt/kvm/intel-tdx.rst          |  360 +++
>   Documentation/virt/kvm/tdx-tdp-mmu.rst        |  466 ++++
>   arch/arm64/include/asm/kvm_host.h             |    3 -
>   arch/arm64/kvm/arm.c                          |    6 +-
>   arch/arm64/kvm/vgic/vgic-init.c               |    6 +-
>   arch/x86/events/intel/ds.c                    |    1 +
>   arch/x86/include/asm/kvm-x86-ops.h            |    5 +
>   arch/x86/include/asm/kvm_host.h               |   38 +-
>   arch/x86/include/asm/tdx.h                    |   61 +
>   arch/x86/include/asm/vmx.h                    |    2 +
>   arch/x86/include/uapi/asm/kvm.h               |   59 +
>   arch/x86/include/uapi/asm/vmx.h               |    5 +-
>   arch/x86/kvm/Kconfig                          |    4 +
>   arch/x86/kvm/Makefile                         |    3 +-
>   arch/x86/kvm/lapic.c                          |   25 +-
>   arch/x86/kvm/lapic.h                          |    2 +-
>   arch/x86/kvm/mmu.h                            |   65 +-
>   arch/x86/kvm/mmu/mmu.c                        |  232 +-
>   arch/x86/kvm/mmu/mmu_internal.h               |   84 +
>   arch/x86/kvm/mmu/paging_tmpl.h                |   25 +-
>   arch/x86/kvm/mmu/spte.c                       |   48 +-
>   arch/x86/kvm/mmu/spte.h                       |   40 +-
>   arch/x86/kvm/mmu/tdp_iter.h                   |    2 +-
>   arch/x86/kvm/mmu/tdp_mmu.c                    |  642 ++++-
>   arch/x86/kvm/mmu/tdp_mmu.h                    |   16 +-
>   arch/x86/kvm/svm/svm.c                        |   10 +-
>   arch/x86/kvm/vmx/common.h                     |  155 ++
>   arch/x86/kvm/vmx/main.c                       | 1026 ++++++++
>   arch/x86/kvm/vmx/posted_intr.c                |    8 +-
>   arch/x86/kvm/vmx/seamcall.S                   |   55 +
>   arch/x86/kvm/vmx/seamcall.h                   |   25 +
>   arch/x86/kvm/vmx/tdx.c                        | 2337 +++++++++++++++++
>   arch/x86/kvm/vmx/tdx.h                        |  253 ++
>   arch/x86/kvm/vmx/tdx_arch.h                   |  158 ++
>   arch/x86/kvm/vmx/tdx_errno.h                  |   29 +
>   arch/x86/kvm/vmx/tdx_error.c                  |   22 +
>   arch/x86/kvm/vmx/tdx_ops.h                    |  174 ++
>   arch/x86/kvm/vmx/vmenter.S                    |  146 +
>   arch/x86/kvm/vmx/vmx.c                        |  619 ++---
>   arch/x86/kvm/vmx/x86_ops.h                    |  235 ++
>   arch/x86/kvm/x86.c                            |  123 +-
>   arch/x86/kvm/x86.h                            |    8 +
>   arch/x86/virt/tdxcall.S                       |    8 +-
>   arch/x86/virt/vmx/tdx.c                       |   50 +-
>   arch/x86/virt/vmx/tdx.h                       |   52 -
>   include/linux/kvm_host.h                      |    2 +
>   include/uapi/linux/kvm.h                      |    1 +
>   tools/arch/x86/include/uapi/asm/kvm.h         |   59 +
>   tools/include/uapi/linux/kvm.h                |    1 +
>   virt/kvm/kvm_main.c                           |   35 +-
>   52 files changed, 7142 insertions(+), 706 deletions(-)
>   create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
>   create mode 100644 Documentation/virt/kvm/intel-tdx.rst
>   create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
>   create mode 100644 arch/x86/kvm/vmx/common.h
>   create mode 100644 arch/x86/kvm/vmx/main.c
>   create mode 100644 arch/x86/kvm/vmx/seamcall.S
>   create mode 100644 arch/x86/kvm/vmx/seamcall.h
>   create mode 100644 arch/x86/kvm/vmx/tdx.c
>   create mode 100644 arch/x86/kvm/vmx/tdx.h
>   create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
>   create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
>   create mode 100644 arch/x86/kvm/vmx/tdx_error.c
>   create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
>   create mode 100644 arch/x86/kvm/vmx/x86_ops.h
> 


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 053/104] KVM: x86/mmu: steal software usable bit for EPT to represent shared page
  2022-03-04 19:49 ` [RFC PATCH v5 053/104] KVM: x86/mmu: steal software usable bit for EPT to represent shared page isaku.yamahata
@ 2022-04-15 15:21   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 15:21 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:49, isaku.yamahata@intel.com wrote:
> +/* Masks that used to track for shared GPA **/
> +#define SPTE_PRIVATE_PROHIBIT	BIT_ULL(62)
> +

Please rename this to SPTE_SHARED_MAPPING_MASK, or even just 
SPTE_SHARED_MASK.

Paolo

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 017/104] KVM: TDX: Add helper functions to print TDX SEAMCALL error
  2022-03-04 19:48 ` [RFC PATCH v5 017/104] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
  2022-03-13 14:12   ` Paolo Bonzini
@ 2022-04-15 16:54   ` Paolo Bonzini
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 16:54 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add helper functions to print out errors from the TDX module in a uniform
> manner.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/Makefile        |  2 +-
>   arch/x86/kvm/vmx/seamcall.h  |  2 ++
>   arch/x86/kvm/vmx/tdx_error.c | 22 ++++++++++++++++++++++
>   3 files changed, 25 insertions(+), 1 deletion(-)
>   create mode 100644 arch/x86/kvm/vmx/tdx_error.c

When rebasing against tip/x86/tdx,  the new .c file needs to include 
asm/tdx.h.

Paolo

> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index e8f83a7d0dc3..3d6550c73fb5 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -24,7 +24,7 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
>   kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
>   			   vmx/evmcs.o vmx/nested.o vmx/posted_intr.o vmx/main.o
>   kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
> -kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o vmx/seamcall.o
> +kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o vmx/seamcall.o vmx/tdx_error.o
>   
>   kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o svm/sev.o
>   
> diff --git a/arch/x86/kvm/vmx/seamcall.h b/arch/x86/kvm/vmx/seamcall.h
> index 604792e9a59f..5ac419cd8e27 100644
> --- a/arch/x86/kvm/vmx/seamcall.h
> +++ b/arch/x86/kvm/vmx/seamcall.h
> @@ -16,6 +16,8 @@ struct tdx_module_output;
>   u64 kvm_seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9, u64 r10,
>   		struct tdx_module_output *out);
>   
> +void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out);
> +
>   #endif /* !__ASSEMBLY__ */
>   
>   #endif	/* CONFIG_INTEL_TDX_HOST */
> diff --git a/arch/x86/kvm/vmx/tdx_error.c b/arch/x86/kvm/vmx/tdx_error.c
> new file mode 100644
> index 000000000000..61ed855d1188
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx_error.c
> @@ -0,0 +1,22 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/* functions to record TDX SEAMCALL error */
> +
> +#include <linux/kernel.h>
> +#include <linux/bug.h>
> +
> +#include "tdx_ops.h"
> +
> +void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out)
> +{
> +	if (!out) {
> +		pr_err_ratelimited("SEAMCALL[%lld] failed: 0x%llx\n",
> +				op, error_code);
> +		return;
> +	}
> +
> +	pr_err_ratelimited(
> +		"SEAMCALL[%lld] failed: 0x%llx "
> +		"RCX 0x%llx, RDX 0x%llx, R8 0x%llx, R9 0x%llx, R10 0x%llx, R11 0x%llx\n",
> +		op, error_code,
> +		out->rcx, out->rdx, out->r8, out->r9, out->r10, out->r11);
> +}


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 019/104] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
  2022-03-04 19:48 ` [RFC PATCH v5 019/104] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers isaku.yamahata
@ 2022-04-15 16:55   ` Paolo Bonzini
  0 siblings, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 16:55 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson

On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> Stub in kvm_tdx, vcpu_tdx, and their various accessors.  TDX defines
> SEAMCALL APIs to access TDX control structures corresponding to the VMX
> VMCS.  Introduce helper accessors to hide its SEAMCALL ABI details.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>   arch/x86/kvm/vmx/tdx.h | 101 +++++++++++++++++++++++++++++++++++++++++
>   1 file changed, 101 insertions(+)

When rebasing against tip/x86/tdx,  the new .h file needs to include 
asm/tdx.h.

Paolo

> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 616fbf79b129..e4bb8831764e 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -3,14 +3,29 @@
>   #define __KVM_X86_TDX_H
>   
>   #ifdef CONFIG_INTEL_TDX_HOST
> +
> +#include "tdx_ops.h"
> +
>   int tdx_module_setup(void);
>   
> +struct tdx_td_page {
> +	unsigned long va;
> +	hpa_t pa;
> +	bool added;
> +};
> +
>   struct kvm_tdx {
>   	struct kvm kvm;
> +
> +	struct tdx_td_page tdr;
> +	struct tdx_td_page *tdcs;
>   };
>   
>   struct vcpu_tdx {
>   	struct kvm_vcpu	vcpu;
> +
> +	struct tdx_td_page tdvpr;
> +	struct tdx_td_page *tdvpx;
>   };
>   
>   static inline bool is_td(struct kvm *kvm)
> @@ -32,6 +47,92 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
>   {
>   	return container_of(vcpu, struct vcpu_tdx, vcpu);
>   }
> +
> +static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
> +{
> +	BUILD_BUG_ON_MSG(__builtin_constant_p(field) && (field) & 0x1,
> +			 "Read/Write to TD VMCS *_HIGH fields not supported");
> +
> +	BUILD_BUG_ON(bits != 16 && bits != 32 && bits != 64);
> +
> +	BUILD_BUG_ON_MSG(bits != 64 && __builtin_constant_p(field) &&
> +			 (((field) & 0x6000) == 0x2000 ||
> +			  ((field) & 0x6000) == 0x6000),
> +			 "Invalid TD VMCS access for 64-bit field");
> +	BUILD_BUG_ON_MSG(bits != 32 && __builtin_constant_p(field) &&
> +			 ((field) & 0x6000) == 0x4000,
> +			 "Invalid TD VMCS access for 32-bit field");
> +	BUILD_BUG_ON_MSG(bits != 16 && __builtin_constant_p(field) &&
> +			 ((field) & 0x6000) == 0x0000,
> +			 "Invalid TD VMCS access for 16-bit field");
> +}
> +
> +static __always_inline void tdvps_state_non_arch_check(u64 field, u8 bits) {}
> +static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
> +
> +#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)				\
> +static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,	\
> +							u32 field)		\
> +{										\
> +	struct tdx_module_output out;						\
> +	u64 err;								\
> +										\
> +	tdvps_##lclass##_check(field, bits);					\
> +	err = tdh_vp_rd(tdx->tdvpr.pa, TDVPS_##uclass(field), &out);		\
> +	if (unlikely(err)) {							\
> +		pr_err("TDH_VP_RD["#uclass".0x%x] failed: 0x%llx\n",		\
> +		       field, err);						\
> +		return 0;							\
> +	}									\
> +	return (u##bits)out.r8;							\
> +}										\
> +static __always_inline void td_##lclass##_write##bits(struct vcpu_tdx *tdx,	\
> +						      u32 field, u##bits val)	\
> +{										\
> +	struct tdx_module_output out;						\
> +	u64 err;								\
> +										\
> +	tdvps_##lclass##_check(field, bits);					\
> +	err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), val,		\
> +		      GENMASK_ULL(bits - 1, 0), &out);				\
> +	if (unlikely(err))							\
> +		pr_err("TDH_VP_WR["#uclass".0x%x] = 0x%llx failed: 0x%llx\n",	\
> +		       field, (u64)val, err);					\
> +}										\
> +static __always_inline void td_##lclass##_setbit##bits(struct vcpu_tdx *tdx,	\
> +						       u32 field, u64 bit)	\
> +{										\
> +	struct tdx_module_output out;						\
> +	u64 err;								\
> +										\
> +	tdvps_##lclass##_check(field, bits);					\
> +	err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), bit, bit,		\
> +			&out);							\
> +	if (unlikely(err))							\
> +		pr_err("TDH_VP_WR["#uclass".0x%x] |= 0x%llx failed: 0x%llx\n",	\
> +		       field, bit, err);					\
> +}										\
> +static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx,	\
> +							 u32 field, u64 bit)	\
> +{										\
> +	struct tdx_module_output out;						\
> +	u64 err;								\
> +										\
> +	tdvps_##lclass##_check(field, bits);					\
> +	err = tdh_vp_wr(tdx->tdvpr.pa, TDVPS_##uclass(field), 0, bit,		\
> +			&out);							\
> +	if (unlikely(err))							\
> +		pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: 0x%llx\n",	\
> +		       field, bit,  err);					\
> +}
> +
> +TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
> +TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
> +TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
> +
> +TDX_BUILD_TDVPS_ACCESSORS(64, STATE_NON_ARCH, state_non_arch);
> +TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
> +
>   #else
>   static inline int tdx_module_setup(void) { return -ENODEV; };
>   


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 000/104] KVM TDX basic feature support
  2022-04-15 15:18 ` Paolo Bonzini
@ 2022-04-15 17:05   ` Paolo Bonzini
  2022-04-15 21:19   ` Isaku Yamahata
  1 sibling, 0 replies; 310+ messages in thread
From: Paolo Bonzini @ 2022-04-15 17:05 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Jim Mattson, erdemaktas, Connor Kuehl,
	Sean Christopherson, Eduardo Lima

On 4/15/22 17:18, Paolo Bonzini wrote:
> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> Hi.  Now TDX host kernel patch series was posted, I've rebased this patch
>> series to it and make it work.
>>
>>    https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
>>
>> Changes from v4:
>> - rebased to TDX host kernel patch series.
>> - include all the patches to make this patch series working.
>> - add [MARKER] patches to mark the patch layer clear.
> 
> I think I have reviewed everything except the TDP MMU parts (48, 54-57). 
>   I will do those next week, but in the meanwhile feel free to send v6 
> if you have it ready.  A lot of the requests have been cosmetic.
> 
> If you would like to use something like Trello to track all the changes, 
> and submit before you have done all of them, that's fine by me.

Also, I have now pushed what (I think) should be all that's needed to 
run TDX guests at branch kvm-tdx-5.17 of 
https://git.kernel.org/pub/scm/virt/kvm/kvm.git.  It's only 
compile-tested for now, but if I missed something please report so that 
it can be used by people doing other work (including QEMU, TDVF and guest).

Thanks,

Paolo

> Paolo
> 
>> Thanks,
>>
>>
>> * What's TDX?
>> TDX stands for Trust Domain Extensions, which extends Intel Virtual 
>> Machines
>> Extensions (VMX) to introduce a kind of virtual machine guest called a 
>> Trust
>> Domain (TD) for confidential computing.
>>
>> A TD runs in a CPU mode that is designed to protect the 
>> confidentiality of its
>> memory contents and its CPU state from any other software, including 
>> the hosting
>> Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
>>
>> We have more detailed explanations below (***).
>> We have the high-level design of TDX KVM below (****).
>>
>> In this patch series, we use "TD" or "guest TD" to differentiate it 
>> from the
>> current "VM" (Virtual Machine), which is supported by KVM today.
>>
>>
>> * The organization of this patch series
>> This patch series is on top of the patches series "TDX host kernel 
>> support":
>> https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
>>
>> this patch series is available at
>> https://github.com/intel/tdx/releases/tag/kvm-upstream
>> The corresponding patches to qemu are available at
>> https://github.com/intel/qemu-tdx/commits/tdx-upstream
>>
>> The relations of the layers are depicted as follows.
>> The arrows below show the order of patch reviews we would like to have.
>>
>> The below layers are chosen so that the device model, for example, 
>> qemu can
>> exercise each layering step by step.  Check if TDX is supported, 
>> create TD VM,
>> create TD vcpu, allow vcpu running, populate TD guest private memory, 
>> and handle
>> vcpu exits/hypercalls/interrupts to run TD fully.
>>
>>    TDX vcpu
>>    interrupt/exits/hypercall<------------\
>>          ^                               |
>>          |                               |
>>    TD finalization                       |
>>          ^                               |
>>          |                               |
>>    TDX EPT violation<------------\       |
>>          ^                       |       |
>>          |                       |       |
>>    TD vcpu enter/exit            |       |
>>          ^                       |       |
>>          |                       |       |
>>    TD vcpu creation/destruction  |       \-------KVM TDP MMU MapGPA
>>          ^                       |                       ^
>>          |                       |                       |
>>    TD VM creation/destruction    \---------------KVM TDP MMU hooks
>>          ^                                               ^
>>          |                                               |
>>    TDX architectural definitions                 KVM TDP refactoring 
>> for TDX
>>          ^                                               ^
>>          |                                               |
>>     TDX, VMX    <--------TDX host kernel         KVM MMU GPA stolen bits
>>     coexistence          support
>>
>>
>> The followings are explanations of each layer.  Each layer has a dummy 
>> commit
>> that starts with [MARKER] in subject.  It is intended to help to 
>> identify where
>> each layer starts.
>>
>> TDX host kernel support:
>>          
>> https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
>>          The guts of system-wide initialization of TDX module.  There 
>> is an
>>          independent patch series for host x86.  TDX KVM patches call 
>> functions
>>          this patch series provides to initialize the TDX module.
>>
>> TDX, VMX coexistence:
>>          Infrastructure to allow TDX to coexist with VMX and trigger the
>>          initialization of the TDX module.
>>          This layer starts with
>>          "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
>> TDX architectural definitions:
>>          Add TDX architectural definitions and helper functions
>>          This layer starts with
>>          "[MARKER] The start of TDX KVM patch series: TDX 
>> architectural definitions".
>> TD VM creation/destruction:
>>          Guest TD creation/destroy allocation and releasing of TDX 
>> specific vm
>>          and vcpu structure.  Create an initial guest memory image 
>> with TDX
>>          measurement.
>>          This layer starts with
>>          "[MARKER] The start of TDX KVM patch series: TD VM 
>> creation/destruction".
>> TD vcpu creation/destruction:
>>          guest TD creation/destroy Allocation and releasing of TDX 
>> specific vm
>>          and vcpu structure.  Create an initial guest memory image 
>> with TDX
>>          measurement.
>>          This layer starts with
>>          "[MARKER] The start of TDX KVM patch series: TD vcpu 
>> creation/destruction"
>> TDX EPT violation:
>>          Create an initial guest memory image with TDX measurement.  
>> Handle
>>          secure EPT violations to populate guest pages with TDX 
>> SEAMCALLs.
>>          This layer starts with
>>          "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
>> TD vcpu enter/exit:
>>          Allow TDX vcpu to enter into TD and exit from TD.  Save CPU 
>> state before
>>          entering into TD.  Restore CPU state after exiting from TD.
>>          This layer starts with
>>          "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
>> TD vcpu interrupts/exit/hypercall:
>>          Handle various exits/hypercalls and allow interrupts to be 
>> injected so
>>          that TD vcpu can continue running.
>>          This layer starts with
>>          "[MARKER] The start of TDX KVM patch series: TD vcpu 
>> exits/interrupts/hypercalls"
>>
>> KVM MMU GPA stolen bits:
>>          Introduce framework to handle stolen repurposed bit of GPA TDX
>>          repurposed a bit of GPA to indicate shared or private. If 
>> it's shared,
>>          it's the same as the conventional VMX EPT case.  VMM can 
>> access shared
>>          guest pages.  If it's private, it's handled by Secure-EPT and 
>> the guest
>>          page is encrypted.
>>          This layer starts with
>>          "[MARKER] The start of TDX KVM patch series: KVM MMU GPA 
>> stolen bits"
>> KVM TDP refactoring for TDX:
>>          TDX Secure EPT requires different constants. e.g. initial 
>> value EPT
>>          entry value etc. Various refactoring for those differences.
>>          This layer starts with
>>          "[MARKER] The start of TDX KVM patch series: KVM TDP 
>> refactoring for TDX"
>> KVM TDP MMU hooks:
>>          Introduce framework to TDP MMU to add hooks in addition to 
>> direct EPT
>>          access TDX added Secure EPT which is an enhancement to VMX 
>> EPT.  Unlike
>>          conventional VMX EPT, CPU can't directly read/write Secure 
>> EPT. Instead,
>>          use TDX SEAMCALLs to operate on Secure EPT.
>>          This layer starts with
>>          "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
>> KVM TDP MMU MapGPA:
>>          Introduce framework to handle switching guest pages from 
>> private/shared
>>          to shared/private.  For a given GPA, a guest page can be 
>> assigned to a
>>          private GPA or a shared GPA exclusively.  With TDX MapGPA 
>> hypercall,
>>          guest TD converts GPA assignments from private (or shared) to 
>> shared (or
>>          private).
>>          This layer starts with
>>          "[MARKER] The start of TDX KVM patch series: KVM TDP MMU 
>> MapGPA "
>>
>> KVM guest private memory: (not shown in the above diagram)
>> [PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest 
>> private
>> memory: https://lkml.org/lkml/2022/1/18/395
>>          Guest private memory requires different memory management in 
>> KVM.  The
>>          patch proposes a way for it.  Integration with TDX KVM.
>>
>> (***)
>> * TDX module
>> A CPU-attested software module called the "TDX module" is designed to 
>> implement
>> the TDX architecture, and it is loaded by the UEFI firmware today. It 
>> can be
>> loaded by the kernel or driver at runtime, but in this patch series we 
>> assume
>> that the TDX module is already loaded and initialized.
>>
>> The TDX module provides two main new logical modes of operation built 
>> upon the
>> new SEAM (Secure Arbitration Mode) root and non-root CPU modes added 
>> to the VMX
>> architecture. TDX root mode is mostly identical to the VMX root 
>> operation mode,
>> and the TDX functions (described later) are triggered by the new SEAMCALL
>> instruction with the desired interface function selected by an input 
>> operand
>> (leaf number, in RAX). TDX non-root mode is used for TD guest 
>> operation.  TDX
>> non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
>> operation (i.e. guest VM), with changes and restrictions to better 
>> assure that
>> no other software or hardware has direct visibility of the TD memory 
>> and state.
>>
>> TDX transitions between TDX root operation and TDX non-root operation 
>> include TD
>> Entries, from TDX root to TDX non-root mode, and TD Exits from TDX 
>> non-root to
>> TDX root mode.  A TD Exit might be asynchronous, triggered by some 
>> external
>> event (e.g., external interrupt or SMI) or an exception, or it might be
>> synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.
>>
>> TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. 
>> TDH.VP.ENTER is one
>> of the TDX interface functions as mentioned above, and "TDH" stands 
>> for Trust
>> Domain Host. Those host-side TDX interface functions are categorized into
>> various areas just for better organization, such as SYS (TDX module 
>> management),
>> MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM 
>> (private memory),
>> etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module 
>> information.
>>
>> TDCS (Trust Domain Control Structure) is the main control structure of 
>> a guest
>> TD, and encrypted (using the guest TD's ephemeral private key).  At a 
>> high
>> level, TDCS holds information for controlling TD operation as a whole,
>> execution, EPTP, MSR bitmaps, etc that KVM needs to set it up.  Note 
>> that MSR
>> bitmaps are held as part of TDCS (unlike VMX) because they are meant 
>> to have the
>> same value for all VCPUs of the same TD.
>>
>> Trust Domain Virtual Processor State (TDVPS) is the root control 
>> structure of a
>> TD VCPU.  It helps the TDX module control the operation of the VCPU, 
>> and holds
>> the VCPU state while the VCPU is not running. TDVPS is opaque to 
>> software and
>> DMA access, accessible only by using the TDX module interface 
>> functions (such as
>> TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary 
>> structures,
>> such as virtual APIC page, virtualization exception information, etc.
>>
>> Several VMX control structures (such as Shared EPT and Posted interrupt
>> descriptor) are directly managed and accessed by the host VMM.  These 
>> control
>> structures are pointed to by fields in the TD VMCS.
>>
>> The above means that 1) KVM needs to allocate different data 
>> structures for TDs,
>> 2) KVM can reuse the existing code for TDs for some operations, 3) it 
>> needs to
>> define TD-specific handling for others.  3) Redirect operations to .  3)
>> Redirect operations to the TDX specific callbacks, like "if 
>> (is_td_vcpu(vcpu))
>> tdx_callback() else vmx_callback();".
>>
>> *TD Private Memory
>> TD private memory is designed to hold TD private content, encrypted by 
>> the CPU
>> using the TD ephemeral key. An encryption engine holds a table of 
>> encryption
>> keys, and an encryption key is selected for each memory transaction 
>> based on a
>> Host Key Identifier (HKID). By design, the host VMM does not have 
>> access to the
>> encryption keys.
>>
>> In the first generation of MKTME, HKID is "stolen" from the physical 
>> address by
>> allocating a configurable number of bits from the top of the physical
>> address. The HKID space is partitioned into shared HKIDs for legacy MKTME
>> accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for 
>> the shared
>> HKID on the host so that MKTME can be opaque or bypassed on the host.
>>
>> During TDX non-root operation (i.e. guest TD), memory accesses can be 
>> qualified
>> as either shared or private, based on the value of a new SHARED bit in 
>> the Guest
>> Physical Address (GPA).  The CPU translates shared GPAs using the 
>> usual VMX EPT
>> (Extended Page Table) or "Shared EPT" (in this document), which 
>> resides in host
>> VMM memory. The Shared EPT is directly managed by the host VMM - the 
>> same as
>> with the current VMX. Since guest TDs usually require I/O, and the 
>> data exchange
>> needs to be done via shared memory, thus KVM needs to use the current EPT
>> functionality even for TDs.
>>
>> * Secure EPT and Minoring using the TDP code
>> The CPU translates private GPAs using a separate Secure EPT.  The 
>> Secure EPT
>> pages are encrypted and integrity-protected with the TD's ephemeral 
>> private
>> key.  Secure EPT can be managed _indirectly_ by the host VMM, using 
>> the TDX
>> interface functions, and thus conceptually Secure EPT is a subset of 
>> EPT (why
>> "subset"). Since execution of such interface functions takes much 
>> longer time
>> than accessing memory directly, in KVM we use the existing TDP code to 
>> minor the
>> Secure EPT for the TD.
>>
>> This way, we can effectively walk Secure EPT without using the TDX 
>> interface
>> functions.
>>
>> * VM life cycle and TDX specific operations
>> The userspace VMM, such as QEMU, needs to build and treat TDs 
>> differently.  For
>> example, a TD needs to boot in private memory, and the host software 
>> cannot copy
>> the initial image to private memory.
>>
>> * TSC Virtualization
>> The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) 
>> values
>> (e.g. consistent among the TD VCPUs) and the virtual TSC frequency is 
>> determined
>> by TD configuration, i.e. when the TD is created, not per VCPU.  The 
>> current KVM
>> owns TSC virtualization for VMs, but the TDX module does for TDs.
>>
>> * MCE support for TDs
>> The TDX module doesn't allow VMM to inject MCE.  Instead PV way is 
>> needed for TD
>> to communicate with VMM.  For now, KVM silently ignores MCE request by 
>> VMM.  MSRs
>> related to MCE (e.g, MCE bank registers) can be naturally emulated by
>> paravirtualizing MSR access.
>>
>> [1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
>> available.
>>
>> * Restrictions or future work
>> Some features are not included to reduce patch size.  Those features are
>> addressed as future independent patch series.
>> - large page (2M, 1G)
>> - qemu gdb stub
>> - guest PMU
>> - and more
>>
>> * Prerequisites
>> It's required to load the TDX module and initialize it.  It's out of 
>> the scope
>> of this patch series.  Another independent patch for the common x86 
>> code is
>> planned.  It defines CONFIG_INTEL_TDX_HOST and this patch series uses
>> CONFIG_INTEL_TDX_HOST.  It's assumed that With 
>> CONFIG_INTEL_TDX_HOST=y, the TDX
>> module is initialized and ready for KVM to use the TDX module APIs for 
>> TDX guest
>> life cycle like tdh.mng.init are ready to use.
>>
>> Concretely Global initialization, LP (Logical Processor) 
>> initialization, global
>> configuration, the key configuration, and TDMR and PAMT initialization 
>> are done.
>> The state of the TDX module is SYS_READY.  Please refer to the TDX module
>> specification, the chapter Intel TDX Module Lifecycle State Machine
>>
>> ** Detecting the TDX module readiness.
>> TDX host patch series implements the detection of the TDX module 
>> availability
>> and its initialization so that KVM can use it.  Also it manages Host 
>> KeyID
>> (HKID) assigned to guest TD.
>> The assumed APIs the TDX host patch series provides are
>> - int seamrr_enabled()
>>    Check if required cpu feature (SEAM mode) is available. This only 
>> check CPU
>>    feature availability.  At this point, the TDX module may not be 
>> ready for KVM
>>    to use.
>> - int init_tdx(void);
>>    Initialization of TDX module so that the TDX module is ready for 
>> KVM to use.
>> - const struct tdsysinfo_struct *tdx_get_sysinfo(void);
>>    Return the system wide information about the TDX module.  NULL if 
>> the TDX
>>    isn't initialized.
>> - u32 tdx_get_global_keyid(void);
>>    Return global key id that is used for the TDX module itself.
>> - int tdx_keyid_alloc(void);
>>    Allocate HKID for guest TD.
>> - void tdx_keyid_free(int keyid);
>>    Free HKID for guest TD.
>>
>> (****)
>> * TDX KVM high-level design
>> - Host key ID management
>> Host Key ID (HKID) needs to be assigned to each TDX guest for memory 
>> encryption.
>> It is assumed The TDX host patch series implements necessary functions,
>> u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
>> void tdx_keyid_free(int keyid).
>>
>> - Data structures and VM type
>> Because TDX is different from VMX, define its own VM/VCPU structures, 
>> struct
>> kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct 
>> vcpu_vmx.  To
>> identify the VM, introduce VM-type to specify which VM type, VMX 
>> (default) or
>> TDX, is used.
>>
>> - VM life cycle and TDX specific operations
>> Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific 
>> operations.
>> New commands are used to get the TDX system parameters, set TDX 
>> specific VM/VCPU
>> parameters, set initial guest memory and measurement.
>>
>> The creation of TDX VM requires five additional operations in addition 
>> to the
>> conventional VM creation.
>>    - Get KVM system capability to check if TDX VM type is supported
>>    - VM creation (KVM_CREATE_VM)
>>    - New: Get the TDX specific system parameters.  
>> KVM_TDX_GET_CAPABILITY.
>>    - New: Set TDX specific VM parameters.  KVM_TDX_INIT_VM.
>>    - VCPU creation (KVM_CREATE_VCPU)
>>    - New: Set TDX specific VCPU parameters.  KVM_TDX_INIT_VCPU.
>>    - New: Initialize guest memory as boot state and extend the 
>> measurement with
>>      the memory.  KVM_TDX_INIT_MEM_REGION.
>>    - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the 
>> initial
>>      TDX VM contents.
>>    - VCPU RUN (KVM_VCPU_RUN)
>>
>> - Protected guest state
>> Because the guest state (CPU state and guest memory) is protected, the 
>> KVM VMM
>> can't operate on them.  For example, accessing CPU registers, injecting
>> exceptions, and accessing guest memory.  Those operations are handled as
>> silently ignored, returning zero or initial reset value when it's 
>> requested via
>> KVM API ioctls.
>>
>>      VM/VCPU state and callbacks for TDX specific operations.
>>      Define tdx specific VM state and VCPU state instead of VMX ones.  
>> Redirect
>>      operations to TDX specific callbacks.  "if (tdx) tdx_op() else 
>> vmx_op()".
>>
>>      Operations on the CPU state
>>      silently ignore operations on the guest state.  For example, the 
>> write to
>>      CPU registers is ignored and the read from CPU registers returns 0.
>>
>>      . ignore access to CPU registers except for allowed ones.
>>      . TSC: add a check if tsc is immutable and return an error.  
>> Because the KVM
>>        implementation updates the internal tsc state and it's 
>> difficult to back
>>        out those changes.  Instead, skip the logic.
>>      . dirty logging: add check if dirty logging is supported.
>>      . exceptions/SMI/MCE/SIPI/INIT: silently ignore
>>
>>      Note: virtual external interrupt and NMI can be injected into TDX 
>> guests.
>>
>> - KVM MMU integration
>> One bit of the guest physical address (bit 51 or 47) is repurposed to 
>> indicate if
>> the guest physical address is private (the bit is cleared) or shared 
>> (the bit is
>> set).  The bits are called stolen bits.
>>
>>    - Stolen bits framework
>>      systematically tracks which guest physical address, shared or 
>> private, is
>>      used.
>>
>>    - Shared EPT and secure EPT
>>      There are two EPTs. Shared EPT (the conventional one) and Secure
>>      EPT(the new one). Shared EPT is handled the same for the stolen
>>      bit set.  Secure EPT points to private guest pages.  To resolve
>>      EPT violation, KVM walks one of two EPTs based on faulted GPA.
>>      Because it's costly to access secure EPT during walking EPTs with
>>      SEAMCALLs for the private guest physical address, another private
>>      EPT is used as a shadow of Secure-EPT with the existing logic at
>>      the cost of extra memory.
>>
>> The following depicts the relationship.
>>
>>                      KVM                             |       TDX module
>>                       |                              |           |
>>          -------------+----------                    |           |
>>          |                      |                    |           |
>>          V                      V                    |           |
>>       shared GPA           private GPA               |           |
>>    CPU shared EPT pointer  KVM private EPT pointer   |  CPU secure EPT 
>> pointer
>>          |                      |                    |           |
>>          |                      |                    |           |
>>          V                      V                    |           V
>>    shared EPT                private EPT<-------mirror----->Secure EPT
>>          |                      |                    |           |
>>          |                      \--------------------+------\    |
>>          |                                           |      |    |
>>          V                                           |      V    V
>>    shared guest page                                 |    private 
>> guest page
>>                                                      |
>>                                                      |
>>                                non-encrypted memory  |    encrypted 
>> memory
>>                                                      |
>>
>>    - Operating on Secure EPT
>>      Use the TDX module APIs to operate on Secure EPT.  To call the 
>> TDX API
>>      during resolving EPT violation, add hooks to additional operation 
>> and wiring
>>      it to TDX backend.
>>
>> * References
>>
>> [1] TDX specification
>>     
>> https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html 
>>
>> [2] Intel Trust Domain Extensions (Intel TDX)
>>     
>> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf 
>>
>> [3] Intel CPU Architectural Extensions Specification
>>     
>> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf 
>>
>> [4] Intel TDX Module 1.0 EAS
>>     
>> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1eas-v0.85.039.pdf 
>>
>> [5] Intel TDX Loader Interface Specification
>>    
>> https://software.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf 
>>
>> [6] Intel TDX Guest-Hypervisor Communication Interface
>>     
>> https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf 
>>
>> [7] Intel TDX Virtual Firmware Design Guide
>>     
>> https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf 
>>
>> [8] intel public github
>>     kvm TDX branch: https://github.com/intel/tdx/tree/kvm
>>     TDX guest branch: https://github.com/intel/tdx/tree/guest
>>     qemu TDX https://github.com/intel/qemu-tdx
>> [9] TDVF
>>      https://github.com/tianocore/edk2-staging/tree/TDVF
>>
>>
>> Chao Gao (1):
>>    KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
>>      wrmsr
>>
>> Isaku Yamahata (73):
>>    x86/virt/tdx: export platform_has_tdx
>>    KVM: TDX: Detect CPU feature on kernel module initialization
>>    KVM: x86: Refactor KVM VMX module init/exit functions
>>    KVM: TDX: Add placeholders for TDX VM/vcpu structure
>>    x86/virt/tdx: Add a helper function to return system wide info about
>>      TDX module
>>    KVM: TDX: Add a function to initialize TDX module
>>    KVM: TDX: Make TDX VM type supported
>>    [MARKER] The start of TDX KVM patch series: TDX architectural
>>      definitions
>>    KVM: TDX: Define TDX architectural definitions
>>    KVM: TDX: Add a function for KVM to invoke SEAMCALL
>>    KVM: TDX: add a helper function for KVM to issue SEAMCALL
>>    KVM: TDX: Add helper functions to print TDX SEAMCALL error
>>    [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
>>    KVM: TDX: allocate per-package mutex
>>    x86/cpu: Add helper functions to allocate/free MKTME keyid
>>    KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
>>    KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters
>>    [MARKER] The start of TDX KVM patch series: TD vcpu
>>      creation/destruction
>>    KVM: TDX: allocate/free TDX vcpu structure
>>    [MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits
>>    KVM: x86/mmu: introduce config for PRIVATE KVM MMU
>>    [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
>>      TDX
>>    KVM: x86/mmu: Disallow fast page fault on private GPA
>>    [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
>>    KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value
>>    KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map()
>>    KVM: x86/mmu: add a private pointer to struct kvm_mmu_page
>>    KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
>>    KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
>>    [MARKER] The start of TDX KVM patch series: TDX EPT violation
>>    KVM: TDX: TDP MMU TDX support
>>    [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA
>>    KVM: x86/mmu: steal software usable bit for EPT to represent shared
>>      page
>>    KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping
>>    KVM: x86/tdp_mmu: prevent private/shared map based on PRIVATE_PROHIBIT
>>    KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX
>>    KVM: x86/mmu: Focibly use TDP MMU for TDX
>>    [MARKER] The start of TDX KVM patch series: TD finalization
>>    KVM: TDX: Create initial guest memory
>>    KVM: TDX: Finalize VM initialization
>>    [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
>>    KVM: TDX: Add helper assembly function to TDX vcpu
>>    KVM: TDX: Implement TDX vcpu enter/exit path
>>    KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
>>    KVM: TDX: restore host xsave state when exit from the guest TD
>>    KVM: TDX: restore user ret MSRs
>>    [MARKER] The start of TDX KVM patch series: TD vcpu
>>      exits/interrupts/hypercalls
>>    KVM: TDX: complete interrupts after tdexit
>>    KVM: TDX: restore debug store when TD exit
>>    KVM: TDX: handle vcpu migration over logical processor
>>    KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the
>>      guest TD
>>    KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
>>      behavior
>>    KVM: TDX: Implement interrupt injection
>>    KVM: TDX: Implements vcpu request_immediate_exit
>>    KVM: TDX: Implement methods to inject NMI
>>    KVM: TDX: Add a place holder to handle TDX VM exit
>>    KVM: TDX: handle EXIT_REASON_OTHER_SMI
>>    KVM: TDX: handle ept violation/misconfig exit
>>    KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
>>    KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers
>>    KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
>>    KVM: TDX: Handle TDX PV CPUID hypercall
>>    KVM: TDX: Handle TDX PV HLT hypercall
>>    KVM: TDX: Handle TDX PV port io hypercall
>>    KVM: TDX: Implement callbacks for MSR operations for TDX
>>    KVM: TDX: Handle TDX PV rdmsr hypercall
>>    KVM: TDX: Handle TDX PV wrmsr hypercall
>>    KVM: TDX: Handle TDX PV report fatal error hypercall
>>    KVM: TDX: Handle TDX PV map_gpa hypercall
>>    KVM: TDX: Silently discard SMI request
>>    KVM: TDX: Silently ignore INIT/SIPI
>>    Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX)
>>    KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
>>
>> Kai Huang (1):
>>    KVM: x86: Introduce hooks to free VM callback prezap and vm_free
>>
>> Rick Edgecombe (1):
>>    KVM: x86: Add infrastructure for stolen GPA bits
>>
>> Sean Christopherson (26):
>>    KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
>>    KVM: Enable hardware before doing arch VM initialization
>>    KVM: x86: Introduce vm_type to differentiate default VMs from
>>      confidential VMs
>>    KVM: TDX: Add TDX "architectural" error codes
>>    KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
>>    KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers
>>    KVM: Add max_vcpus field in common 'struct kvm'
>>    KVM: TDX: create/destroy VM structure
>>    KVM: TDX: Do TDX specific vcpu initialization
>>    KVM: x86/mmu: Disallow dirty logging for x86 TDX
>>    KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault
>>    KVM: x86/mmu: Allow non-zero init value for shadow PTE
>>    KVM: x86/mmu: Allow per-VM override of the TDP max page level
>>    KVM: VMX: Split out guts of EPT violation to common/exposed function
>>    KVM: VMX: Move setting of EPT MMU masks to common VT-x code
>>    KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
>>    KVM: TDX: Add load_mmu_pgd method for TDX
>>    KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
>>    KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events()
>>    KVM: x86: Add option to force LAPIC expiration wait
>>    KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
>>      argument
>>    KVM: VMX: Move NMI/exception handler to common helper
>>    KVM: x86: Split core of hypercall emulation to helper function
>>    KVM: TDX: Add a placeholder for handler of TDX hypercalls
>>      (TDG.VP.VMCALL)
>>    KVM: TDX: Handle TDX PV MMIO hypercall
>>    KVM: TDX: Add methods to ignore accesses to CPU state
>>
>> Xiaoyao Li (1):
>>    KVM: TDX: initialize VM with TDX specific parameters
>>
>> Yuan Yao (1):
>>    KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c
>>
>>   Documentation/virt/kvm/api.rst                |   24 +-
>>   .../virt/kvm/intel-tdx-layer-status.rst       |   33 +
>>   Documentation/virt/kvm/intel-tdx.rst          |  360 +++
>>   Documentation/virt/kvm/tdx-tdp-mmu.rst        |  466 ++++
>>   arch/arm64/include/asm/kvm_host.h             |    3 -
>>   arch/arm64/kvm/arm.c                          |    6 +-
>>   arch/arm64/kvm/vgic/vgic-init.c               |    6 +-
>>   arch/x86/events/intel/ds.c                    |    1 +
>>   arch/x86/include/asm/kvm-x86-ops.h            |    5 +
>>   arch/x86/include/asm/kvm_host.h               |   38 +-
>>   arch/x86/include/asm/tdx.h                    |   61 +
>>   arch/x86/include/asm/vmx.h                    |    2 +
>>   arch/x86/include/uapi/asm/kvm.h               |   59 +
>>   arch/x86/include/uapi/asm/vmx.h               |    5 +-
>>   arch/x86/kvm/Kconfig                          |    4 +
>>   arch/x86/kvm/Makefile                         |    3 +-
>>   arch/x86/kvm/lapic.c                          |   25 +-
>>   arch/x86/kvm/lapic.h                          |    2 +-
>>   arch/x86/kvm/mmu.h                            |   65 +-
>>   arch/x86/kvm/mmu/mmu.c                        |  232 +-
>>   arch/x86/kvm/mmu/mmu_internal.h               |   84 +
>>   arch/x86/kvm/mmu/paging_tmpl.h                |   25 +-
>>   arch/x86/kvm/mmu/spte.c                       |   48 +-
>>   arch/x86/kvm/mmu/spte.h                       |   40 +-
>>   arch/x86/kvm/mmu/tdp_iter.h                   |    2 +-
>>   arch/x86/kvm/mmu/tdp_mmu.c                    |  642 ++++-
>>   arch/x86/kvm/mmu/tdp_mmu.h                    |   16 +-
>>   arch/x86/kvm/svm/svm.c                        |   10 +-
>>   arch/x86/kvm/vmx/common.h                     |  155 ++
>>   arch/x86/kvm/vmx/main.c                       | 1026 ++++++++
>>   arch/x86/kvm/vmx/posted_intr.c                |    8 +-
>>   arch/x86/kvm/vmx/seamcall.S                   |   55 +
>>   arch/x86/kvm/vmx/seamcall.h                   |   25 +
>>   arch/x86/kvm/vmx/tdx.c                        | 2337 +++++++++++++++++
>>   arch/x86/kvm/vmx/tdx.h                        |  253 ++
>>   arch/x86/kvm/vmx/tdx_arch.h                   |  158 ++
>>   arch/x86/kvm/vmx/tdx_errno.h                  |   29 +
>>   arch/x86/kvm/vmx/tdx_error.c                  |   22 +
>>   arch/x86/kvm/vmx/tdx_ops.h                    |  174 ++
>>   arch/x86/kvm/vmx/vmenter.S                    |  146 +
>>   arch/x86/kvm/vmx/vmx.c                        |  619 ++---
>>   arch/x86/kvm/vmx/x86_ops.h                    |  235 ++
>>   arch/x86/kvm/x86.c                            |  123 +-
>>   arch/x86/kvm/x86.h                            |    8 +
>>   arch/x86/virt/tdxcall.S                       |    8 +-
>>   arch/x86/virt/vmx/tdx.c                       |   50 +-
>>   arch/x86/virt/vmx/tdx.h                       |   52 -
>>   include/linux/kvm_host.h                      |    2 +
>>   include/uapi/linux/kvm.h                      |    1 +
>>   tools/arch/x86/include/uapi/asm/kvm.h         |   59 +
>>   tools/include/uapi/linux/kvm.h                |    1 +
>>   virt/kvm/kvm_main.c                           |   35 +-
>>   52 files changed, 7142 insertions(+), 706 deletions(-)
>>   create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst
>>   create mode 100644 Documentation/virt/kvm/intel-tdx.rst
>>   create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
>>   create mode 100644 arch/x86/kvm/vmx/common.h
>>   create mode 100644 arch/x86/kvm/vmx/main.c
>>   create mode 100644 arch/x86/kvm/vmx/seamcall.S
>>   create mode 100644 arch/x86/kvm/vmx/seamcall.h
>>   create mode 100644 arch/x86/kvm/vmx/tdx.c
>>   create mode 100644 arch/x86/kvm/vmx/tdx.h
>>   create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
>>   create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
>>   create mode 100644 arch/x86/kvm/vmx/tdx_error.c
>>   create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
>>   create mode 100644 arch/x86/kvm/vmx/x86_ops.h
>>
> 


^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 000/104] KVM TDX basic feature support
  2022-04-15 15:18 ` Paolo Bonzini
  2022-04-15 17:05   ` Paolo Bonzini
@ 2022-04-15 21:19   ` Isaku Yamahata
  1 sibling, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-15 21:19 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Jim Mattson,
	erdemaktas, Connor Kuehl, Sean Christopherson

On Fri, Apr 15, 2022 at 05:18:42PM +0200,
Paolo Bonzini <pbonzini@redhat.com> wrote:

> On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Hi.  Now TDX host kernel patch series was posted, I've rebased this patch
> > series to it and make it work.
> > 
> >    https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
> > 
> > Changes from v4:
> > - rebased to TDX host kernel patch series.
> > - include all the patches to make this patch series working.
> > - add [MARKER] patches to mark the patch layer clear.
> 
> I think I have reviewed everything except the TDP MMU parts (48, 54-57).  I
> will do those next week, but in the meanwhile feel free to send v6 if you
> have it ready.  A lot of the requests have been cosmetic.

Thank you so much. I'm updating patches now.


> If you would like to use something like Trello to track all the changes, and
> submit before you have done all of them, that's fine by me.

Sure. I've created public trello board.
If you want to edit it, please let me know. I'll add you to the project member.

https://trello.com/kvmtdxreview

thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis
  2022-04-08 18:46     ` Isaku Yamahata
@ 2022-04-19 19:55       ` Sean Christopherson
  0 siblings, 0 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-04-19 19:55 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Paolo Bonzini, isaku.yamahata, kvm, linux-kernel, Jim Mattson,
	erdemaktas, Connor Kuehl

Sorry, missed my name...

On Fri, Apr 08, 2022, Isaku Yamahata wrote:
> On Tue, Apr 05, 2022 at 05:25:34PM +0200,
> Paolo Bonzini <pbonzini@redhat.com> wrote:
> 
> > On 3/4/22 20:48, isaku.yamahata@intel.com wrote:
> > > +	if (enable_ept) {
> > > +		const u64 init_value = enable_tdx ? VMX_EPT_SUPPRESS_VE_BIT : 0ull;
> > >   		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
> > > -				      cpu_has_vmx_ept_execute_only());
> > > +				      cpu_has_vmx_ept_execute_only(), init_value);
> > > +		kvm_mmu_set_spte_init_value(init_value);
> > > +	}
> > 
> > I think kvm-intel.ko should use VMX_EPT_SUPPRESS_VE_BIT unconditionally as
> > the init value.  The bit is ignored anyway if the "EPT-violation #VE"
> > execution control is 0.
> > Otherwise looks good, but I have a couple more crazy ideas:
> > 
> > 1) there could even be a test mode where KVM enables the execution control,
> > traps #VE in the exception bitmap, and shouts loudly if it gets a #VE.  That
> > might avoid hard-to-find bugs due to forgetting about
> > VMX_EPT_SUPPRESS_VE_BIT.
> > 
> > 2) or even, perhaps the init_value for the TDP MMU could set bit 63
> > _unconditionally_, because KVM always sets the NX bit on AMD hardware.

Heh, took me a minute to realize you mean EFER.NX.  To clarify:

KVM requires NX support in hardware

	if (!boot_cpu_has(X86_FEATURE_NX)) {
		pr_err_ratelimited("NX (Execute Disable) not supported\n");
		return -EOPNOTSUPP;
	}

and 64-bit or PAE paging to enable NPT

	if (!IS_ENABLED(CONFIG_X86_64) && !IS_ENABLED(CONFIG_X86_PAE))
		npt_enabled = false;

and the _kernel_ forces EFER.NX=1 for 64-bit and PAE kernels.

But whether or not EFER.NX is enabled is irrelevant, it's only the initial value,
i.e. the SPTE is guaranteed to be !PRESENT, so hardware will never generate a
reserved bit #PF.

> > That would remove the whole infrastructure to keep shadow_init_value,
> > because it would be constant 0 in mmu.c and constant BIT(63) in tdp_mmu.c.
> > 
> > Sean, what do you think?

I like #2, though I suspect we'll still want shadow_init_value so that the MMU
caches can be shared without creating a mess.   But I still like keeping that
detail in the MMUs and out of the vendor modules, even though there's obviously
a hard dependency on the MMU doing the right thing.

> Then, I'll start with 1) because it's a bit hard for me to test 2) with real AMD
> hardware.  If someone is willing to test 2), I'm quite fine to implement 2)
> on top of 1).  2) isn't exclusive with 1).

I can test #2.

Tangentially related, the kvm_gfn_stolen_mask() exception to MMIO SPTEs is unnecessarily
convoluted and gross.  That's partly my fault as I should have just updated
enable_mmio_caching when hardware can't support it instead of using shadow_mmio_value
to convey that information.  I'll submit a patch to fix that, then is_mmio_spte()
can be left alone in the TDX series.

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
  2022-04-07  0:50   ` Kai Huang
@ 2022-04-25 19:10     ` Sagi Shahar
  2022-04-26 21:12       ` Isaku Yamahata
  0 siblings, 1 reply; 310+ messages in thread
From: Sagi Shahar @ 2022-04-25 19:10 UTC (permalink / raw)
  To: Kai Huang
  Cc: Yamahata, Isaku, kvm, linux-kernel, isaku.yamahata,
	Paolo Bonzini, Jim Mattson, Erdem Aktas, Connor Kuehl,
	Sean Christopherson

On Wed, Apr 6, 2022 at 5:50 PM Kai Huang <kai.huang@intel.com> wrote:
>
> On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> >
> > Use private EPT to mirror Secure EPT, and On the change of the private EPT
>
> "on the change".
>
> > entry, invoke kvm_x86_ops hook in __handle_changed_spte() to propagate the
> > change to Secure EPT.
> >
> > On EPT violation, determine which EPT to use, private or shared, based on
> > faulted GPA.  When allocating an EPT page, record it (private or shared) in
> > the page role.  The private is passed down to the function as an argument
> > as necessary.  When the private EPT entry is changed, call the hook.
> >
> > When zapping EPT, the EPT entry is frozen with the special EPT value that
> > clears the present bit. After the TLB shootdown, the entry is set to the
> > eventual value.  On populating the EPT entry, atomically set the entry.
> >
> > For TDX, TDX SEAMCALL to update Secure EPT in addition to direct access to
> > the private EPT entry.  For the zapping case, freeing the EPT entry
> > works. It can call TDX SEAMCALL in addition to TLB shootdown.  For
> > populating the private EPT entry, there can be a race condition without
> > further protection
> >
> >   vcpu 1: populating 2M private EPT entry
> >   vcpu 2: populating 4K private EPT entry
> >   vcpu 2: TDX SEAMCALL to update 4K secure EPT => error
> >   vcpu 1: TDX SEAMCALL to update 4M secure EPT
>
> 2M secure EPT
>
> >
> > To avoid the race, the frozen EPT entry is utilized.  Instead of atomic
> > update of the private EPT entry, freeze the entry, call the hook that
> > invokes TDX SEAMCALL, set the entry to the final value (unfreeze).
> >
> > Support 4K page only at this stage.  2M page support can be done in future
> > patches.
>
> Btw, I'd like to see this as a patch to handle schematic of private mapping in
> MMU code, and put this patch close to other infrastructure patches such as
> "stolen GPA infrastructure" and "private_sp", so people can get a clear view on
> what does the schematic of "private mapping" meaning and how to handle it before
> jumping to TDX details.
>
> In this patch, you have a "mirrored private page table", this is an important
> concept and please explain it in commit message.  Only with this, your above
> race condition case makes sense.
>
> Also, you mentioned you will record private or shared in page role.  It seems I
> don't see it.  Anyway, you also have SPTE_PRIVATE_PROHIBIT.  It's not entirely
> clearly to me why you need it, or what's advantage between using page.role vs
> it.  Please put those patches together so we can have a clear understanding on
> your decisions.
>
> IMHO, in general please reorganize the MMU related patches into below way:
>
> 1) Patches to introduce schematic of private mapping, and how to handle.  You
> can use TDX as background to explain, but IMHO please use common name such as
> "private page table" instead of "Secure EPT".  Those patches can include:
>   - concept of protected VM, private/shared mapping (current "GPA stolen bits
>     infrastructure" patch).
>   - shadow_nonpresent_mask to support setting "Suppress #VE" bit.
>   - per-VM MMIO value/mask
>   - 'kvm_mmu_page->private_sp'
>   - mirrored private page table (this patch).
>   - SPTE_PRIVATE_PROHIBIT (vs role.private, etc)
>   - Anything else?
>
> 2) After you clearly explain the schematic of private mapping, you can declare
> some features cannot work with private mapping, such as log-dirty, fast page
> fault, so that you can disable them in separate patches
>
> 3) TDX specific handling.
>
> Order of 2) and 3) doesn't matter, of course.
>
> And when introduce a patch, please make sure it doesn't break existing things,
> even logically.  While I understand we want to split out independent small logic
> as small patches (as preparation to support main patches), please don't break
> anything in one patch.
>
> My 2cents above.
>
> >
> > Co-developed-by: Kai Huang <kai.huang@intel.com>
> > Signed-off-by: Kai Huang <kai.huang@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/include/asm/kvm-x86-ops.h |   2 +
> >  arch/x86/include/asm/kvm_host.h    |   8 +
> >  arch/x86/kvm/mmu/mmu.c             |  31 +++-
> >  arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
> >  arch/x86/kvm/mmu/tdp_mmu.c         | 226 +++++++++++++++++++++++------
> >  arch/x86/kvm/mmu/tdp_mmu.h         |  13 +-
> >  virt/kvm/kvm_main.c                |   1 +
> >  7 files changed, 232 insertions(+), 51 deletions(-)
> >
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index ef48dcc98cfc..7e27b73d839f 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -91,6 +91,8 @@ KVM_X86_OP(set_tss_addr)
> >  KVM_X86_OP(set_identity_map_addr)
> >  KVM_X86_OP(get_mt_mask)
> >  KVM_X86_OP(load_mmu_pgd)
> > +KVM_X86_OP(free_private_sp)
> > +KVM_X86_OP(handle_changed_private_spte)
> >  KVM_X86_OP_NULL(has_wbinvd_exit)
> >  KVM_X86_OP(get_l2_tsc_offset)
> >  KVM_X86_OP(get_l2_tsc_multiplier)
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 0c8cc7d73371..8406f8b5ab74 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -433,6 +433,7 @@ struct kvm_mmu {
> >                        struct kvm_mmu_page *sp);
> >       void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
> >       hpa_t root_hpa;
> > +     hpa_t private_root_hpa;
> >       gpa_t root_pgd;
> >       union kvm_mmu_role mmu_role;
> >       u8 root_level;
> > @@ -1433,6 +1434,13 @@ struct kvm_x86_ops {
> >       void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> >                            int root_level);
> >
> > +     int (*free_private_sp)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > +                            void *private_sp);
> > +     void (*handle_changed_private_spte)(
> > +             struct kvm *kvm, gfn_t gfn, enum pg_level level,
> > +             kvm_pfn_t old_pfn, bool was_present, bool was_leaf,
> > +             kvm_pfn_t new_pfn, bool is_present, bool is_leaf, void *sept_page);
> > +
> >       bool (*has_wbinvd_exit)(void);
> >
> >       u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 8def8b97978f..0ec9548ff4dd 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -3422,6 +3422,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> >  {
> >       struct kvm_mmu *mmu = vcpu->arch.mmu;
> >       u8 shadow_root_level = mmu->shadow_root_level;
> > +     gfn_t gfn_stolen = kvm_gfn_stolen_mask(vcpu->kvm);
> >       hpa_t root;
> >       unsigned i;
> >       int r;
> > @@ -3432,7 +3433,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
> >               goto out_unlock;
> >
> >       if (is_tdp_mmu_enabled(vcpu->kvm)) {
> > -             root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> > +             if (gfn_stolen && !VALID_PAGE(mmu->private_root_hpa)) {
> > +                     root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
> > +                     mmu->private_root_hpa = root;
> > +             }
> > +             root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
> >               mmu->root_hpa = root;
> >       } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
> >               root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
> > @@ -5596,6 +5601,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
> >       int i;
> >
> >       mmu->root_hpa = INVALID_PAGE;
> > +     mmu->private_root_hpa = INVALID_PAGE;
> >       mmu->root_pgd = 0;
> >       for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
> >               mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> > @@ -5772,6 +5778,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
> >
> >       write_unlock(&kvm->mmu_lock);
> >
> > +     /*
> > +      * For now private root is never invalidate during VM is running,
> > +      * so this can only happen for shared roots.
> > +      */
> >       if (is_tdp_mmu_enabled(kvm)) {
> >               read_lock(&kvm->mmu_lock);
> >               kvm_tdp_mmu_zap_invalidated_roots(kvm);
> > @@ -5871,7 +5881,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
> >       if (is_tdp_mmu_enabled(kvm)) {
> >               for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> >                       flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
> > -                                                       gfn_end, flush);
> > +                                                       gfn_end, flush,
> > +                                                       false);
> >       }
> >
> >       if (flush)
> > @@ -5904,6 +5915,11 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
> >               write_unlock(&kvm->mmu_lock);
> >       }
> >
> > +     /*
> > +      * For now this can only happen for non-TD VM, because TD private
> > +      * mapping doesn't support write protection.  kvm_tdp_mmu_wrprot_slot()
> > +      * will give a WARN() if it hits for TD.
> > +      */
> >       if (is_tdp_mmu_enabled(kvm)) {
> >               read_lock(&kvm->mmu_lock);
> >               flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
> > @@ -5952,6 +5968,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >               sp = sptep_to_sp(sptep);
> >               pfn = spte_to_pfn(*sptep);
> >
> > +             /* Private page dirty logging is not supported. */
> > +             KVM_BUG_ON(is_private_spte(sptep), kvm);
> > +
> >               /*
> >                * We cannot do huge page mapping for indirect shadow pages,
> >                * which are found on the last rmap (level = 1) when not using
> > @@ -5992,6 +6011,11 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
> >               write_unlock(&kvm->mmu_lock);
> >       }
> >
> > +     /*
> > +      * This should only be reachable in case of log-dirty, wihch TD private
> > +      * mapping doesn't support so far.  kvm_tdp_mmu_zap_collapsible_sptes()
> > +      * internally gives a WARN() when it hits.
> > +      */
> >       if (is_tdp_mmu_enabled(kvm)) {
> >               read_lock(&kvm->mmu_lock);
> >               kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
> > @@ -6266,6 +6290,9 @@ int kvm_mmu_module_init(void)
> >  void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
> >  {
> >       kvm_mmu_unload(vcpu);
> > +     if (is_tdp_mmu_enabled(vcpu->kvm))
> > +             mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> > +                             NULL);
> >       free_mmu_pages(&vcpu->arch.root_mmu);
> >       free_mmu_pages(&vcpu->arch.guest_mmu);
> >       mmu_free_memory_caches(vcpu);
> > diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> > index e19cabbcb65c..ad22d5b691c5 100644
> > --- a/arch/x86/kvm/mmu/tdp_iter.h
> > +++ b/arch/x86/kvm/mmu/tdp_iter.h
> > @@ -28,7 +28,7 @@ struct tdp_iter {
> >       tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
> >       /* A pointer to the current SPTE */
> >       tdp_ptep_t sptep;
> > -     /* The lowest GFN mapped by the current SPTE */
> > +     /* The lowest GFN (stolen bits included) mapped by the current SPTE */
> >       gfn_t gfn;
> >       /* The level of the root page given to the iterator */
> >       int root_level;
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index a68f3a22836b..acba2590b51e 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -53,6 +53,11 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
> >       rcu_barrier();
> >  }
> >
> > +static gfn_t tdp_iter_gfn_unalias(struct kvm *kvm, struct tdp_iter *iter)
> > +{
> > +     return kvm_gfn_unalias(kvm, iter->gfn);
> > +}
> > +
> >  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >                         gfn_t start, gfn_t end, bool can_yield, bool flush,
> >                         bool shared);
> > @@ -175,10 +180,13 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
> >  }
> >
> >  static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> > -                                            int level)
> > +                                            int level, bool private)
> >  {
> >       struct kvm_mmu_page *sp;
> >
> > +     WARN_ON(level != vcpu->arch.mmu->shadow_root_level &&
> > +             kvm_is_private_gfn(vcpu->kvm, gfn) != private);
> > +     WARN_ON(level == vcpu->arch.mmu->shadow_root_level && gfn != 0);
> >       sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
> >       sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
> >       set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> > @@ -186,14 +194,19 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> >       sp->role.word = page_role_for_level(vcpu, level).word;
> >       sp->gfn = gfn;
> >       sp->tdp_mmu_page = true;
> > -     kvm_mmu_init_private_sp(sp);
> > +
> > +     if (private)
> > +             kvm_mmu_alloc_private_sp(vcpu, sp);
> > +     else
> > +             kvm_mmu_init_private_sp(sp);
> >
> >       trace_kvm_mmu_get_page(sp, true);
> >
> >       return sp;
> >  }
> >
> > -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> > +static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu,
> > +                                                   bool private)
> >  {
> >       union kvm_mmu_page_role role;
> >       struct kvm *kvm = vcpu->kvm;
> > @@ -206,11 +219,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> >       /* Check for an existing root before allocating a new one. */
> >       for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
> >               if (root->role.word == role.word &&
> > +                 is_private_sp(root) == private &&
> >                   kvm_tdp_mmu_get_root(kvm, root))
> >                       goto out;
> >       }
> >
> > -     root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> > +     root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level,
> > +                     private);
> >       refcount_set(&root->tdp_mmu_root_count, 1);
> >
> >       spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> > @@ -218,12 +233,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> >       spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
> >
> >  out:
> > -     return __pa(root->spt);
> > +     return root;
> > +}
> > +
> > +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private)
> > +{
> > +     return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt);
> >  }
> >
> >  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> > -                             u64 old_spte, u64 new_spte, int level,
> > -                             bool shared);
> > +                             bool private_spte, u64 old_spte,
> > +                             u64 new_spte, int level, bool shared);
> >
> >  static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
> >  {
> > @@ -321,6 +341,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> >       int level = sp->role.level;
> >       gfn_t base_gfn = sp->gfn;
> >       int i;
> > +     bool private_sp = is_private_sp(sp);
> >
> >       trace_kvm_mmu_prepare_zap_page(sp);
> >
> > @@ -370,7 +391,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> >                        */
> >                       WRITE_ONCE(*sptep, SHADOW_REMOVED_SPTE);
> >               }
> > -             handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> > +             handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn, private_sp,
> >                                   old_child_spte, SHADOW_REMOVED_SPTE, level,
> >                                   shared);
> >       }
> > @@ -378,6 +399,17 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> >       kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
> >                                          KVM_PAGES_PER_HPAGE(level + 1));
> >
> > +     if (private_sp &&
> > +             WARN_ON(static_call(kvm_x86_free_private_sp)(
> > +                             kvm, sp->gfn, sp->role.level,
> > +                             kvm_mmu_private_sp(sp)))) {
> > +             /*
> > +              * Failed to unlink Secure EPT page and there is nothing to do
> > +              * further.  Intentionally leak the page to prevent the kernel
> > +              * from accessing the encrypted page.
> > +              */
> > +             kvm_mmu_init_private_sp(sp);
> > +     }
> >       call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> >  }
> >
> > @@ -386,6 +418,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> >   * @kvm: kvm instance
> >   * @as_id: the address space of the paging structure the SPTE was a part of
> >   * @gfn: the base GFN that was mapped by the SPTE
> > + * @private_spte: the SPTE is private or not
> >   * @old_spte: The value of the SPTE before the change
> >   * @new_spte: The value of the SPTE after the change
> >   * @level: the level of the PT the SPTE is part of in the paging structure
> > @@ -397,14 +430,16 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
> >   * This function must be called for all TDP SPTE modifications.
> >   */
> >  static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> > -                               u64 old_spte, u64 new_spte, int level,
> > -                               bool shared)
> > +                               bool private_spte, u64 old_spte,
> > +                               u64 new_spte, int level, bool shared)
> >  {
> >       bool was_present = is_shadow_present_pte(old_spte);
> >       bool is_present = is_shadow_present_pte(new_spte);
> >       bool was_leaf = was_present && is_last_spte(old_spte, level);
> >       bool is_leaf = is_present && is_last_spte(new_spte, level);
> > -     bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> > +     kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> > +     kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> > +     bool pfn_changed = old_pfn != new_pfn;
> >
> >       WARN_ON(level > PT64_ROOT_MAX_LEVEL);
> >       WARN_ON(level < PG_LEVEL_4K);
> > @@ -468,23 +503,49 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> >
> >       if (was_leaf && is_dirty_spte(old_spte) &&
> >           (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> > -             kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > +             kvm_set_pfn_dirty(old_pfn);
> > +
> > +     /*
> > +      * Special handling for the private mapping.  We are either
> > +      * setting up new mapping at middle level page table, or leaf,
> > +      * or tearing down existing mapping.
> > +      */
> > +     if (private_spte) {
> > +             void *sept_page = NULL;
> > +
> > +             if (is_present && !is_leaf) {
> > +                     struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(new_pfn));
> > +
> > +                     sept_page = kvm_mmu_private_sp(sp);
> > +                     WARN_ON(!sept_page);
> > +                     WARN_ON(sp->role.level + 1 != level);
> > +                     WARN_ON(sp->gfn != gfn);
> > +             }
> > +
> > +             static_call(kvm_x86_handle_changed_private_spte)(
> > +                     kvm, gfn, level,
> > +                     old_pfn, was_present, was_leaf,
> > +                     new_pfn, is_present, is_leaf, sept_page);
> > +     }
> >
> >       /*
> >        * Recursively handle child PTs if the change removed a subtree from
> >        * the paging structure.
> >        */
> > -     if (was_present && !was_leaf && (pfn_changed || !is_present))
> > +     if (was_present && !was_leaf && (pfn_changed || !is_present)) {
> > +             WARN_ON(private_spte !=
> > +                     is_private_spte(spte_to_child_pt(old_spte, level)));
> >               handle_removed_tdp_mmu_page(kvm,
> >                               spte_to_child_pt(old_spte, level), shared);
> > +     }
> >  }
> >
> >  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> > -                             u64 old_spte, u64 new_spte, int level,
> > -                             bool shared)
> > +                             bool private_spte, u64 old_spte, u64 new_spte,
> > +                             int level, bool shared)
> >  {
> > -     __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> > -                           shared);
> > +     __handle_changed_spte(kvm, as_id, gfn, private_spte,
> > +                     old_spte, new_spte, level, shared);
> >       handle_changed_spte_acc_track(old_spte, new_spte, level);
> >       handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
> >                                     new_spte, level);
> > @@ -505,6 +566,10 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> >                                          struct tdp_iter *iter,
> >                                          u64 new_spte)
> >  {
> > +     bool freeze_spte = is_private_spte(iter->sptep) &&
> > +             !is_removed_spte(new_spte);
> > +     u64 tmp_spte = freeze_spte ? SHADOW_REMOVED_SPTE : new_spte;
> > +
> >       WARN_ON_ONCE(iter->yielded);
> >
> >       lockdep_assert_held_read(&kvm->mmu_lock);
> > @@ -521,13 +586,16 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
> >        * does not hold the mmu_lock.
> >        */
> >       if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> > -                   new_spte) != iter->old_spte)
> > +                   tmp_spte) != iter->old_spte)
> >               return false;
> >
> > -     __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > -                           new_spte, iter->level, true);
> > +     __handle_changed_spte(kvm, iter->as_id, iter->gfn, is_private_spte(iter->sptep),
> > +                           iter->old_spte, new_spte, iter->level, true);
> >       handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
> >
> > +     if (freeze_spte)
> > +             WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
> > +
> >       return true;
> >  }
> >
> > @@ -603,8 +671,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
> >
> >       WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
> >
> > -     __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> > -                           new_spte, iter->level, false);
> > +     __handle_changed_spte(kvm, iter->as_id, iter->gfn, is_private_spte(iter->sptep),
> > +                           iter->old_spte, new_spte, iter->level, false);
> >       if (record_acc_track)
> >               handle_changed_spte_acc_track(iter->old_spte, new_spte,
> >                                             iter->level);
> > @@ -644,9 +712,10 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
> >                       continue;                                       \
> >               else
> >
> > -#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)              \
> > -     for_each_tdp_pte(_iter, __va(_mmu->root_hpa),           \
> > -                      _mmu->shadow_root_level, _start, _end)
> > +#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end)            \
> > +     for_each_tdp_pte(_iter,                                                 \
> > +             __va((_private) ? _mmu->private_root_hpa : _mmu->root_hpa),     \
> > +             _mmu->shadow_root_level, _start, _end)
> >
> >  /*
> >   * Yield if the MMU lock is contended or this thread needs to return control
> > @@ -731,6 +800,18 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >        */
> >       end = min(end, max_gfn_host);
> >
> > +     /*
> > +      * Extend [start, end) to include GFN shared bit when TDX is enabled,
> > +      * and for shared mapping range.
> > +      */
> > +     if (is_private_sp(root)) {
> > +             start = kvm_gfn_private(kvm, start);
> > +             end = kvm_gfn_private(kvm, end);
> > +     } else {
> > +             start = kvm_gfn_shared(kvm, start);
> > +             end = kvm_gfn_shared(kvm, end);
> > +     }
> > +
> >       kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> >
> >       rcu_read_lock();
> > @@ -783,13 +864,18 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
> >   * MMU lock.
> >   */
> >  bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> > -                              gfn_t end, bool can_yield, bool flush)
> > +                              gfn_t end, bool can_yield, bool flush,
> > +                              bool zap_private)
> >  {
> >       struct kvm_mmu_page *root;
> >
> > -     for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
> > +     for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false) {
> > +             /* Skip private page table if not requested */
> > +             if (!zap_private && is_private_sp(root))
> > +                     continue;
> >               flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
> >                                     false);
> > +     }
> >
> >       return flush;
> >  }
> > @@ -800,7 +886,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
> >       int i;
> >
> >       for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> > -             flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
> > +             flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush, true);
> >
> >       if (flush)
> >               kvm_flush_remote_tlbs(kvm);
> > @@ -851,6 +937,13 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> >       while (root) {
> >               next_root = next_invalidated_root(kvm, root);
> >
> > +             /*
> > +              * Private table is only torn down when VM is destroyed.
> > +              * It is a bug to zap private table here.
> > +              */
> > +             if (WARN_ON(is_private_sp(root)))
> > +                     goto out;
> > +
> >               rcu_read_unlock();
> >
> >               flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
> > @@ -865,7 +958,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
> >
> >               rcu_read_lock();
> >       }
> > -
> > +out:
> >       rcu_read_unlock();
> >
> >       if (flush)
> > @@ -897,9 +990,16 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
> >       struct kvm_mmu_page *root;
> >
> >       lockdep_assert_held_write(&kvm->mmu_lock);
> > -     list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
> > +     list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> > +             /*
> > +              * Skip private root since private page table
> > +              * is only torn down when VM is destroyed.
> > +              */
> > +             if (is_private_sp(root))
> > +                     continue;
> >               if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
> >                       root->role.invalid = true;
> > +     }
> >  }
> >
> >  /*
> > @@ -914,14 +1014,23 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> >       u64 new_spte;
> >       int ret = RET_PF_FIXED;
> >       bool wrprot = false;
> > +     unsigned long pte_access = ACC_ALL;
> >
> >       WARN_ON(sp->role.level != fault->goal_level);
> > +
> > +     /* TDX shared GPAs are no executable, enforce this for the SDV. */
> > +     if (!kvm_is_private_gfn(vcpu->kvm, iter->gfn))

This should be:
if (kvm_gfn_stolen_mask(vcpu->kvm) && !kvm_is_private_gfn(vcpu->kvm, iter->gfn))

Otherwise, when TDX is disabled, all EPTs are going to be considered
as shared non-executable EPTs.

>
> > +             pte_access &= ~ACC_EXEC_MASK;
> > +
> >       if (unlikely(!fault->slot))
> > -             new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> > +             new_spte = make_mmio_spte(vcpu,
> > +                             tdp_iter_gfn_unalias(vcpu->kvm, iter),
> > +                             pte_access);
> >       else
> > -             wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> > -                                      fault->pfn, iter->old_spte, fault->prefetch, true,
> > -                                      fault->map_writable, &new_spte);
> > +             wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
> > +                             tdp_iter_gfn_unalias(vcpu->kvm, iter),
> > +                             fault->pfn, iter->old_spte, fault->prefetch,
> > +                             true, fault->map_writable, &new_spte);
> >
> >       if (new_spte == iter->old_spte)
> >               ret = RET_PF_SPURIOUS;
> > @@ -959,7 +1068,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> >  }
> >
> >  static bool tdp_mmu_populate_nonleaf(
> > -     struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool account_nx)
> > +     struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool is_private,
> > +     bool account_nx)
> >  {
> >       struct kvm_mmu_page *sp;
> >       u64 *child_pt;
> > @@ -968,7 +1078,7 @@ static bool tdp_mmu_populate_nonleaf(
> >       WARN_ON(is_shadow_present_pte(iter->old_spte));
> >       WARN_ON(is_removed_spte(iter->old_spte));
> >
> > -     sp = alloc_tdp_mmu_page(vcpu, iter->gfn, iter->level - 1);
> > +     sp = alloc_tdp_mmu_page(vcpu, iter->gfn, iter->level - 1, is_private);
> >       child_pt = sp->spt;
> >
> >       new_spte = make_nonleaf_spte(child_pt, !shadow_accessed_mask);
> > @@ -991,6 +1101,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >  {
> >       struct kvm_mmu *mmu = vcpu->arch.mmu;
> >       struct tdp_iter iter;
> > +     gfn_t raw_gfn;
> > +     bool is_private;
> >       int ret;
> >
> >       kvm_mmu_hugepage_adjust(vcpu, fault);
> > @@ -999,7 +1111,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >
> >       rcu_read_lock();
> >
> > -     tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> > +     raw_gfn = gpa_to_gfn(fault->addr);
> > +     is_private = kvm_is_private_gfn(vcpu->kvm, raw_gfn);
> > +
> > +     if (is_error_noslot_pfn(fault->pfn) || kvm_is_reserved_pfn(fault->pfn)) {
> > +             if (is_private) {
> > +                     rcu_read_unlock();
> > +                     return -EFAULT;
> > +             }
> > +     }
> > +
> > +     tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
> >               if (fault->nx_huge_page_workaround_enabled)
> >                       disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
> >
> > @@ -1015,6 +1137,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                   is_large_pte(iter.old_spte)) {
> >                       if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> >                               break;
> > +                     /*
> > +                      * TODO: large page support.
> > +                      * Doesn't support large page for TDX now
> > +                      */
> > +                     WARN_ON(is_private_spte(&iter.old_spte));
> > +
> >
> >                       /*
> >                        * The iter must explicitly re-read the spte here
> > @@ -1037,7 +1165,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >
> >                       account_nx = fault->huge_page_disallowed &&
> >                               fault->req_level >= iter.level;
> > -                     if (!tdp_mmu_populate_nonleaf(vcpu, &iter, account_nx))
> > +                     if (!tdp_mmu_populate_nonleaf(
> > +                                     vcpu, &iter, is_private, account_nx))
> >                               break;
> >               }
> >       }
> > @@ -1058,9 +1187,12 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
> >  {
> >       struct kvm_mmu_page *root;
> >
> > -     for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
> > +     for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false) {
> > +             if (is_private_sp(root))
> > +                     continue;
> >               flush = zap_gfn_range(kvm, root, range->start, range->end,
> > -                                   range->may_block, flush, false);
> > +                             range->may_block, flush, false);
> > +     }
> >
> >       return flush;
> >  }
> > @@ -1513,10 +1645,14 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
> >       struct kvm_mmu *mmu = vcpu->arch.mmu;
> >       gfn_t gfn = addr >> PAGE_SHIFT;
> >       int leaf = -1;
> > +     bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn);
> >
> >       *root_level = vcpu->arch.mmu->shadow_root_level;
> >
> > -     tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> > +     if (WARN_ON(is_private))
> > +             return leaf;
> > +
> > +     tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
> >               leaf = iter.level;
> >               sptes[leaf] = iter.old_spte;
> >       }
> > @@ -1542,12 +1678,16 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
> >       struct kvm_mmu *mmu = vcpu->arch.mmu;
> >       gfn_t gfn = addr >> PAGE_SHIFT;
> >       tdp_ptep_t sptep = NULL;
> > +     bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn);
> >
> > -     tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> > +     if (is_private)
> > +             goto out;
> > +
> > +     tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
> >               *spte = iter.old_spte;
> >               sptep = iter.sptep;
> >       }
> > -
> > +out:
> >       /*
> >        * Perform the rcu_dereference to get the raw spte pointer value since
> >        * we are passing it up to fast_page_fault, which is shared with the
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> > index 3899004a5d91..7c62f694a465 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.h
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> > @@ -5,7 +5,7 @@
> >
> >  #include <linux/kvm_host.h>
> >
> > -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> > +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private);
> >
> >  __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
> >                                                    struct kvm_mmu_page *root)
> > @@ -20,11 +20,14 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
> >                         bool shared);
> >
> >  bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> > -                              gfn_t end, bool can_yield, bool flush);
> > +                              gfn_t end, bool can_yield, bool flush,
> > +                              bool zap_private);
> >  static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
> > -                                          gfn_t start, gfn_t end, bool flush)
> > +                                          gfn_t start, gfn_t end, bool flush,
> > +                                          bool zap_private)
> >  {
> > -     return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
> > +     return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush,
> > +                     zap_private);
> >  }
> >  static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >  {
> > @@ -41,7 +44,7 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
> >        */
> >       lockdep_assert_held_write(&kvm->mmu_lock);
> >       return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
> > -                                        sp->gfn, end, false, false);
> > +                                        sp->gfn, end, false, false, false);
> >  }
> >
> >  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index ae3bf553f215..d4e117f5b5b9 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -190,6 +190,7 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
> >
> >       return true;
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_is_reserved_pfn);
> >
> >  /*
> >   * Switches to specified vcpu, until a matching vcpu_put()
>

Trying again in plain text mode. Sorry but it looks like I turned it
off at some point and forgot about it.

Sagi

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
  2022-04-25 19:10     ` Sagi Shahar
@ 2022-04-26 21:12       ` Isaku Yamahata
  0 siblings, 0 replies; 310+ messages in thread
From: Isaku Yamahata @ 2022-04-26 21:12 UTC (permalink / raw)
  To: Sagi Shahar
  Cc: Kai Huang, Yamahata, Isaku, kvm, linux-kernel, isaku.yamahata,
	Paolo Bonzini, Jim Mattson, Erdem Aktas, Connor Kuehl,
	Sean Christopherson

On Mon, Apr 25, 2022 at 12:10:22PM -0700,
Sagi Shahar <sagis@google.com> wrote:

> On Wed, Apr 6, 2022 at 5:50 PM Kai Huang <kai.huang@intel.com> wrote:
> >
> > On Fri, 2022-03-04 at 11:49 -0800, isaku.yamahata@intel.com wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
...
> > > @@ -914,14 +1014,23 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
> > >       u64 new_spte;
> > >       int ret = RET_PF_FIXED;
> > >       bool wrprot = false;
> > > +     unsigned long pte_access = ACC_ALL;
> > >
> > >       WARN_ON(sp->role.level != fault->goal_level);
> > > +
> > > +     /* TDX shared GPAs are no executable, enforce this for the SDV. */
> > > +     if (!kvm_is_private_gfn(vcpu->kvm, iter->gfn))
> 
> This should be:
> if (kvm_gfn_stolen_mask(vcpu->kvm) && !kvm_is_private_gfn(vcpu->kvm, iter->gfn))
> 
> Otherwise, when TDX is disabled, all EPTs are going to be considered
> as shared non-executable EPTs.

Oops, will fix it. Thank you for pointing it out.

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
  2022-03-04 19:49 ` [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU isaku.yamahata
  2022-04-07  0:50   ` Kai Huang
@ 2022-04-29  0:28   ` Sagi Shahar
  2022-04-29  0:46     ` Sean Christopherson
  1 sibling, 1 reply; 310+ messages in thread
From: Sagi Shahar @ 2022-04-29  0:28 UTC (permalink / raw)
  To: Yamahata, Isaku
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, Jim Mattson,
	Erdem Aktas, Connor Kuehl, Sean Christopherson

 is not  .

On Fri, Mar 4, 2022 at 12:14 PM <isaku.yamahata@intel.com> wrote:
>
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Use private EPT to mirror Secure EPT, and On the change of the private EPT
> entry, invoke kvm_x86_ops hook in __handle_changed_spte() to propagate the
> change to Secure EPT.
>
> On EPT violation, determine which EPT to use, private or shared, based on
> faulted GPA.  When allocating an EPT page, record it (private or shared) in
> the page role.  The private is passed down to the function as an argument
> as necessary.  When the private EPT entry is changed, call the hook.
>
> When zapping EPT, the EPT entry is frozen with the special EPT value that
> clears the present bit. After the TLB shootdown, the entry is set to the
> eventual value.  On populating the EPT entry, atomically set the entry.
>
> For TDX, TDX SEAMCALL to update Secure EPT in addition to direct access to
> the private EPT entry.  For the zapping case, freeing the EPT entry
> works. It can call TDX SEAMCALL in addition to TLB shootdown.  For
> populating the private EPT entry, there can be a race condition without
> further protection
>
>   vcpu 1: populating 2M private EPT entry
>   vcpu 2: populating 4K private EPT entry
>   vcpu 2: TDX SEAMCALL to update 4K secure EPT => error
>   vcpu 1: TDX SEAMCALL to update 4M secure EPT
>
> To avoid the race, the frozen EPT entry is utilized.  Instead of atomic
> update of the private EPT entry, freeze the entry, call the hook that
> invokes TDX SEAMCALL, set the entry to the final value (unfreeze).
>
> Support 4K page only at this stage.  2M page support can be done in future
> patches.
>
> Co-developed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |   2 +
>  arch/x86/include/asm/kvm_host.h    |   8 +
>  arch/x86/kvm/mmu/mmu.c             |  31 +++-
>  arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
>  arch/x86/kvm/mmu/tdp_mmu.c         | 226 +++++++++++++++++++++++------
>  arch/x86/kvm/mmu/tdp_mmu.h         |  13 +-
>  virt/kvm/kvm_main.c                |   1 +
>  7 files changed, 232 insertions(+), 51 deletions(-)
>
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index ef48dcc98cfc..7e27b73d839f 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -91,6 +91,8 @@ KVM_X86_OP(set_tss_addr)
>  KVM_X86_OP(set_identity_map_addr)
>  KVM_X86_OP(get_mt_mask)
>  KVM_X86_OP(load_mmu_pgd)
> +KVM_X86_OP(free_private_sp)
> +KVM_X86_OP(handle_changed_private_spte)
>  KVM_X86_OP_NULL(has_wbinvd_exit)
>  KVM_X86_OP(get_l2_tsc_offset)
>  KVM_X86_OP(get_l2_tsc_multiplier)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 0c8cc7d73371..8406f8b5ab74 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -433,6 +433,7 @@ struct kvm_mmu {
>                          struct kvm_mmu_page *sp);
>         void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
>         hpa_t root_hpa;
> +       hpa_t private_root_hpa;
>         gpa_t root_pgd;
>         union kvm_mmu_role mmu_role;
>         u8 root_level;
> @@ -1433,6 +1434,13 @@ struct kvm_x86_ops {
>         void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>                              int root_level);
>
> +       int (*free_private_sp)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +                              void *private_sp);
> +       void (*handle_changed_private_spte)(
> +               struct kvm *kvm, gfn_t gfn, enum pg_level level,
> +               kvm_pfn_t old_pfn, bool was_present, bool was_leaf,
> +               kvm_pfn_t new_pfn, bool is_present, bool is_leaf, void *sept_page);
> +
>         bool (*has_wbinvd_exit)(void);
>
>         u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 8def8b97978f..0ec9548ff4dd 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3422,6 +3422,7 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>  {
>         struct kvm_mmu *mmu = vcpu->arch.mmu;
>         u8 shadow_root_level = mmu->shadow_root_level;
> +       gfn_t gfn_stolen = kvm_gfn_stolen_mask(vcpu->kvm);
>         hpa_t root;
>         unsigned i;
>         int r;
> @@ -3432,7 +3433,11 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
>                 goto out_unlock;
>
>         if (is_tdp_mmu_enabled(vcpu->kvm)) {
> -               root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
> +               if (gfn_stolen && !VALID_PAGE(mmu->private_root_hpa)) {
> +                       root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
> +                       mmu->private_root_hpa = root;
> +               }
> +               root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
>                 mmu->root_hpa = root;
>         } else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
>                 root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level, true);
> @@ -5596,6 +5601,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
>         int i;
>
>         mmu->root_hpa = INVALID_PAGE;
> +       mmu->private_root_hpa = INVALID_PAGE;
>         mmu->root_pgd = 0;
>         for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
>                 mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
> @@ -5772,6 +5778,10 @@ static void kvm_mmu_zap_all_fast(struct kvm *kvm)
>
>         write_unlock(&kvm->mmu_lock);
>
> +       /*
> +        * For now private root is never invalidate during VM is running,
> +        * so this can only happen for shared roots.
> +        */
>         if (is_tdp_mmu_enabled(kvm)) {
>                 read_lock(&kvm->mmu_lock);
>                 kvm_tdp_mmu_zap_invalidated_roots(kvm);
> @@ -5871,7 +5881,8 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
>         if (is_tdp_mmu_enabled(kvm)) {
>                 for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
>                         flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, gfn_start,
> -                                                         gfn_end, flush);
> +                                                         gfn_end, flush,
> +                                                         false);
>         }
>
>         if (flush)
> @@ -5904,6 +5915,11 @@ void kvm_mmu_slot_remove_write_access(struct kvm *kvm,
>                 write_unlock(&kvm->mmu_lock);
>         }
>
> +       /*
> +        * For now this can only happen for non-TD VM, because TD private
> +        * mapping doesn't support write protection.  kvm_tdp_mmu_wrprot_slot()
> +        * will give a WARN() if it hits for TD.
> +        */
>         if (is_tdp_mmu_enabled(kvm)) {
>                 read_lock(&kvm->mmu_lock);
>                 flush |= kvm_tdp_mmu_wrprot_slot(kvm, memslot, start_level);
> @@ -5952,6 +5968,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>                 sp = sptep_to_sp(sptep);
>                 pfn = spte_to_pfn(*sptep);
>
> +               /* Private page dirty logging is not supported. */
> +               KVM_BUG_ON(is_private_spte(sptep), kvm);
> +
>                 /*
>                  * We cannot do huge page mapping for indirect shadow pages,
>                  * which are found on the last rmap (level = 1) when not using
> @@ -5992,6 +6011,11 @@ void kvm_mmu_zap_collapsible_sptes(struct kvm *kvm,
>                 write_unlock(&kvm->mmu_lock);
>         }
>
> +       /*
> +        * This should only be reachable in case of log-dirty, wihch TD private
> +        * mapping doesn't support so far.  kvm_tdp_mmu_zap_collapsible_sptes()
> +        * internally gives a WARN() when it hits.
> +        */
>         if (is_tdp_mmu_enabled(kvm)) {
>                 read_lock(&kvm->mmu_lock);
>                 kvm_tdp_mmu_zap_collapsible_sptes(kvm, slot);
> @@ -6266,6 +6290,9 @@ int kvm_mmu_module_init(void)
>  void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
>  {
>         kvm_mmu_unload(vcpu);
> +       if (is_tdp_mmu_enabled(vcpu->kvm))
> +               mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
> +                               NULL);
>         free_mmu_pages(&vcpu->arch.root_mmu);
>         free_mmu_pages(&vcpu->arch.guest_mmu);
>         mmu_free_memory_caches(vcpu);
> diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
> index e19cabbcb65c..ad22d5b691c5 100644
> --- a/arch/x86/kvm/mmu/tdp_iter.h
> +++ b/arch/x86/kvm/mmu/tdp_iter.h
> @@ -28,7 +28,7 @@ struct tdp_iter {
>         tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
>         /* A pointer to the current SPTE */
>         tdp_ptep_t sptep;
> -       /* The lowest GFN mapped by the current SPTE */
> +       /* The lowest GFN (stolen bits included) mapped by the current SPTE */
>         gfn_t gfn;
>         /* The level of the root page given to the iterator */
>         int root_level;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index a68f3a22836b..acba2590b51e 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -53,6 +53,11 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
>         rcu_barrier();
>  }
>
> +static gfn_t tdp_iter_gfn_unalias(struct kvm *kvm, struct tdp_iter *iter)
> +{
> +       return kvm_gfn_unalias(kvm, iter->gfn);
> +}
> +
>  static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>                           gfn_t start, gfn_t end, bool can_yield, bool flush,
>                           bool shared);
> @@ -175,10 +180,13 @@ static union kvm_mmu_page_role page_role_for_level(struct kvm_vcpu *vcpu,
>  }
>
>  static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
> -                                              int level)
> +                                              int level, bool private)
>  {
>         struct kvm_mmu_page *sp;
>
> +       WARN_ON(level != vcpu->arch.mmu->shadow_root_level &&
> +               kvm_is_private_gfn(vcpu->kvm, gfn) != private);
> +       WARN_ON(level == vcpu->arch.mmu->shadow_root_level && gfn != 0);
>         sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
>         sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
>         set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
> @@ -186,14 +194,19 @@ static struct kvm_mmu_page *alloc_tdp_mmu_page(struct kvm_vcpu *vcpu, gfn_t gfn,
>         sp->role.word = page_role_for_level(vcpu, level).word;
>         sp->gfn = gfn;
>         sp->tdp_mmu_page = true;
> -       kvm_mmu_init_private_sp(sp);
> +
> +       if (private)
> +               kvm_mmu_alloc_private_sp(vcpu, sp);
> +       else
> +               kvm_mmu_init_private_sp(sp);
>
>         trace_kvm_mmu_get_page(sp, true);
>
>         return sp;
>  }
>
> -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
> +static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu,
> +                                                     bool private)
>  {
>         union kvm_mmu_page_role role;
>         struct kvm *kvm = vcpu->kvm;
> @@ -206,11 +219,13 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>         /* Check for an existing root before allocating a new one. */
>         for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
>                 if (root->role.word == role.word &&
> +                   is_private_sp(root) == private &&
>                     kvm_tdp_mmu_get_root(kvm, root))
>                         goto out;
>         }
>
> -       root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level);
> +       root = alloc_tdp_mmu_page(vcpu, 0, vcpu->arch.mmu->shadow_root_level,
> +                       private);
>         refcount_set(&root->tdp_mmu_root_count, 1);
>
>         spin_lock(&kvm->arch.tdp_mmu_pages_lock);
> @@ -218,12 +233,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>         spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
>
>  out:
> -       return __pa(root->spt);
> +       return root;
> +}
> +
> +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private)
> +{
> +       return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt);
>  }
>
>  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -                               u64 old_spte, u64 new_spte, int level,
> -                               bool shared);
> +                               bool private_spte, u64 old_spte,
> +                               u64 new_spte, int level, bool shared);
>
>  static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
>  {
> @@ -321,6 +341,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>         int level = sp->role.level;
>         gfn_t base_gfn = sp->gfn;
>         int i;
> +       bool private_sp = is_private_sp(sp);
>
>         trace_kvm_mmu_prepare_zap_page(sp);
>
> @@ -370,7 +391,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>                          */
>                         WRITE_ONCE(*sptep, SHADOW_REMOVED_SPTE);
>                 }
> -               handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> +               handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn, private_sp,
>                                     old_child_spte, SHADOW_REMOVED_SPTE, level,
>                                     shared);
>         }
> @@ -378,6 +399,17 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>         kvm_flush_remote_tlbs_with_address(kvm, base_gfn,
>                                            KVM_PAGES_PER_HPAGE(level + 1));
>
> +       if (private_sp &&
> +               WARN_ON(static_call(kvm_x86_free_private_sp)(
> +                               kvm, sp->gfn, sp->role.level,
> +                               kvm_mmu_private_sp(sp)))) {
> +               /*
> +                * Failed to unlink Secure EPT page and there is nothing to do
> +                * further.  Intentionally leak the page to prevent the kernel
> +                * from accessing the encrypted page.
> +                */
> +               kvm_mmu_init_private_sp(sp);
> +       }
>         call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
>  }
>
> @@ -386,6 +418,7 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>   * @kvm: kvm instance
>   * @as_id: the address space of the paging structure the SPTE was a part of
>   * @gfn: the base GFN that was mapped by the SPTE
> + * @private_spte: the SPTE is private or not
>   * @old_spte: The value of the SPTE before the change
>   * @new_spte: The value of the SPTE after the change
>   * @level: the level of the PT the SPTE is part of in the paging structure
> @@ -397,14 +430,16 @@ static void handle_removed_tdp_mmu_page(struct kvm *kvm, tdp_ptep_t pt,
>   * This function must be called for all TDP SPTE modifications.
>   */
>  static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -                                 u64 old_spte, u64 new_spte, int level,
> -                                 bool shared)
> +                                 bool private_spte, u64 old_spte,
> +                                 u64 new_spte, int level, bool shared)
>  {
>         bool was_present = is_shadow_present_pte(old_spte);
>         bool is_present = is_shadow_present_pte(new_spte);
>         bool was_leaf = was_present && is_last_spte(old_spte, level);
>         bool is_leaf = is_present && is_last_spte(new_spte, level);
> -       bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
> +       kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
> +       kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
> +       bool pfn_changed = old_pfn != new_pfn;
>
>         WARN_ON(level > PT64_ROOT_MAX_LEVEL);
>         WARN_ON(level < PG_LEVEL_4K);
> @@ -468,23 +503,49 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>
>         if (was_leaf && is_dirty_spte(old_spte) &&
>             (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> -               kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> +               kvm_set_pfn_dirty(old_pfn);
> +
> +       /*
> +        * Special handling for the private mapping.  We are either
> +        * setting up new mapping at middle level page table, or leaf,
> +        * or tearing down existing mapping.
> +        */
> +       if (private_spte) {
> +               void *sept_page = NULL;
> +
> +               if (is_present && !is_leaf) {
> +                       struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(new_pfn));
> +
> +                       sept_page = kvm_mmu_private_sp(sp);
> +                       WARN_ON(!sept_page);
> +                       WARN_ON(sp->role.level + 1 != level);
> +                       WARN_ON(sp->gfn != gfn);
> +               }
> +
> +               static_call(kvm_x86_handle_changed_private_spte)(
> +                       kvm, gfn, level,
> +                       old_pfn, was_present, was_leaf,
> +                       new_pfn, is_present, is_leaf, sept_page);
> +       }
>
>         /*
>          * Recursively handle child PTs if the change removed a subtree from
>          * the paging structure.
>          */
> -       if (was_present && !was_leaf && (pfn_changed || !is_present))
> +       if (was_present && !was_leaf && (pfn_changed || !is_present)) {
> +               WARN_ON(private_spte !=
> +                       is_private_spte(spte_to_child_pt(old_spte, level)));
>                 handle_removed_tdp_mmu_page(kvm,
>                                 spte_to_child_pt(old_spte, level), shared);
> +       }
>  }
>
>  static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -                               u64 old_spte, u64 new_spte, int level,
> -                               bool shared)
> +                               bool private_spte, u64 old_spte, u64 new_spte,
> +                               int level, bool shared)
>  {
> -       __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> -                             shared);
> +       __handle_changed_spte(kvm, as_id, gfn, private_spte,
> +                       old_spte, new_spte, level, shared);
>         handle_changed_spte_acc_track(old_spte, new_spte, level);
>         handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
>                                       new_spte, level);
> @@ -505,6 +566,10 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>                                            struct tdp_iter *iter,
>                                            u64 new_spte)
>  {
> +       bool freeze_spte = is_private_spte(iter->sptep) &&
> +               !is_removed_spte(new_spte);
> +       u64 tmp_spte = freeze_spte ? SHADOW_REMOVED_SPTE : new_spte;
> +
>         WARN_ON_ONCE(iter->yielded);
>
>         lockdep_assert_held_read(&kvm->mmu_lock);
> @@ -521,13 +586,16 @@ static inline bool tdp_mmu_set_spte_atomic(struct kvm *kvm,
>          * does not hold the mmu_lock.
>          */
>         if (cmpxchg64(rcu_dereference(iter->sptep), iter->old_spte,
> -                     new_spte) != iter->old_spte)
> +                     tmp_spte) != iter->old_spte)
>                 return false;
>
> -       __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -                             new_spte, iter->level, true);
> +       __handle_changed_spte(kvm, iter->as_id, iter->gfn, is_private_spte(iter->sptep),
> +                             iter->old_spte, new_spte, iter->level, true);
>         handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
>
> +       if (freeze_spte)
> +               WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
> +
>         return true;
>  }
>
> @@ -603,8 +671,8 @@ static inline void __tdp_mmu_set_spte(struct kvm *kvm, struct tdp_iter *iter,
>
>         WRITE_ONCE(*rcu_dereference(iter->sptep), new_spte);
>
> -       __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -                             new_spte, iter->level, false);
> +       __handle_changed_spte(kvm, iter->as_id, iter->gfn, is_private_spte(iter->sptep),
> +                             iter->old_spte, new_spte, iter->level, false);
>         if (record_acc_track)
>                 handle_changed_spte_acc_track(iter->old_spte, new_spte,
>                                               iter->level);
> @@ -644,9 +712,10 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
>                         continue;                                       \
>                 else
>
> -#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)                \
> -       for_each_tdp_pte(_iter, __va(_mmu->root_hpa),           \
> -                        _mmu->shadow_root_level, _start, _end)
> +#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end)              \
> +       for_each_tdp_pte(_iter,                                                 \
> +               __va((_private) ? _mmu->private_root_hpa : _mmu->root_hpa),     \
> +               _mmu->shadow_root_level, _start, _end)
>
>  /*
>   * Yield if the MMU lock is contended or this thread needs to return control
> @@ -731,6 +800,18 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>          */
>         end = min(end, max_gfn_host);
>
> +       /*
> +        * Extend [start, end) to include GFN shared bit when TDX is enabled,
> +        * and for shared mapping range.
> +        */
> +       if (is_private_sp(root)) {
> +               start = kvm_gfn_private(kvm, start);
> +               end = kvm_gfn_private(kvm, end);
> +       } else {
> +               start = kvm_gfn_shared(kvm, start);
> +               end = kvm_gfn_shared(kvm, end);
> +       }
> +
>         kvm_lockdep_assert_mmu_lock_held(kvm, shared);
>
>         rcu_read_lock();
> @@ -783,13 +864,18 @@ static bool zap_gfn_range(struct kvm *kvm, struct kvm_mmu_page *root,
>   * MMU lock.
>   */
>  bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> -                                gfn_t end, bool can_yield, bool flush)
> +                                gfn_t end, bool can_yield, bool flush,
> +                                bool zap_private)
>  {
>         struct kvm_mmu_page *root;
>
> -       for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false)
> +       for_each_tdp_mmu_root_yield_safe(kvm, root, as_id, false) {
> +               /* Skip private page table if not requested */
> +               if (!zap_private && is_private_sp(root))
> +                       continue;
>                 flush = zap_gfn_range(kvm, root, start, end, can_yield, flush,
>                                       false);
> +       }
>
>         return flush;
>  }
> @@ -800,7 +886,7 @@ void kvm_tdp_mmu_zap_all(struct kvm *kvm)
>         int i;
>
>         for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
> -               flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
> +               flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush, true);
>
>         if (flush)
>                 kvm_flush_remote_tlbs(kvm);
> @@ -851,6 +937,13 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>         while (root) {
>                 next_root = next_invalidated_root(kvm, root);
>
> +               /*
> +                * Private table is only torn down when VM is destroyed.
> +                * It is a bug to zap private table here.
> +                */
> +               if (WARN_ON(is_private_sp(root)))
> +                       goto out;
> +
>                 rcu_read_unlock();
>
>                 flush = zap_gfn_range(kvm, root, 0, -1ull, true, flush, true);
> @@ -865,7 +958,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm)
>
>                 rcu_read_lock();
>         }
> -
> +out:
>         rcu_read_unlock();
>
>         if (flush)
> @@ -897,9 +990,16 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
>         struct kvm_mmu_page *root;
>
>         lockdep_assert_held_write(&kvm->mmu_lock);
> -       list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link)
> +       list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
> +               /*
> +                * Skip private root since private page table
> +                * is only torn down when VM is destroyed.
> +                */
> +               if (is_private_sp(root))
> +                       continue;
>                 if (refcount_inc_not_zero(&root->tdp_mmu_root_count))
>                         root->role.invalid = true;
> +       }
>  }
>
>  /*
> @@ -914,14 +1014,23 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>         u64 new_spte;
>         int ret = RET_PF_FIXED;
>         bool wrprot = false;
> +       unsigned long pte_access = ACC_ALL;
>
>         WARN_ON(sp->role.level != fault->goal_level);
> +
> +       /* TDX shared GPAs are no executable, enforce this for the SDV. */
> +       if (!kvm_is_private_gfn(vcpu->kvm, iter->gfn))
> +               pte_access &= ~ACC_EXEC_MASK;
> +
>         if (unlikely(!fault->slot))
> -               new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
> +               new_spte = make_mmio_spte(vcpu,
> +                               tdp_iter_gfn_unalias(vcpu->kvm, iter),
> +                               pte_access);
>         else
> -               wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
> -                                        fault->pfn, iter->old_spte, fault->prefetch, true,
> -                                        fault->map_writable, &new_spte);
> +               wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
> +                               tdp_iter_gfn_unalias(vcpu->kvm, iter),
> +                               fault->pfn, iter->old_spte, fault->prefetch,
> +                               true, fault->map_writable, &new_spte);
>
>         if (new_spte == iter->old_spte)
>                 ret = RET_PF_SPURIOUS;
> @@ -959,7 +1068,8 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  }
>
>  static bool tdp_mmu_populate_nonleaf(
> -       struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool account_nx)
> +       struct kvm_vcpu *vcpu, struct tdp_iter *iter, bool is_private,
> +       bool account_nx)
>  {
>         struct kvm_mmu_page *sp;
>         u64 *child_pt;
> @@ -968,7 +1078,7 @@ static bool tdp_mmu_populate_nonleaf(
>         WARN_ON(is_shadow_present_pte(iter->old_spte));
>         WARN_ON(is_removed_spte(iter->old_spte));
>
> -       sp = alloc_tdp_mmu_page(vcpu, iter->gfn, iter->level - 1);
> +       sp = alloc_tdp_mmu_page(vcpu, iter->gfn, iter->level - 1, is_private);
>         child_pt = sp->spt;
>
>         new_spte = make_nonleaf_spte(child_pt, !shadow_accessed_mask);
> @@ -991,6 +1101,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>  {
>         struct kvm_mmu *mmu = vcpu->arch.mmu;
>         struct tdp_iter iter;
> +       gfn_t raw_gfn;
> +       bool is_private;
>         int ret;
>
>         kvm_mmu_hugepage_adjust(vcpu, fault);
> @@ -999,7 +1111,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>
>         rcu_read_lock();
>
> -       tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
> +       raw_gfn = gpa_to_gfn(fault->addr);
> +       is_private = kvm_is_private_gfn(vcpu->kvm, raw_gfn);
> +
> +       if (is_error_noslot_pfn(fault->pfn) || kvm_is_reserved_pfn(fault->pfn)) {
> +               if (is_private) {
> +                       rcu_read_unlock();
> +                       return -EFAULT;
> +               }
> +       }
> +
> +       tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
>                 if (fault->nx_huge_page_workaround_enabled)
>                         disallowed_hugepage_adjust(fault, iter.old_spte, iter.level);
>
> @@ -1015,6 +1137,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>                     is_large_pte(iter.old_spte)) {
>                         if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
>                                 break;
> +                       /*
> +                        * TODO: large page support.
> +                        * Doesn't support large page for TDX now
> +                        */
> +                       WARN_ON(is_private_spte(&iter.old_spte));

The above line is causing a null ptr dereferencing when running the
KVM unit tests.
It should be is_private_spte(iter.sptep) instead of
is_private_spte(&iter.old_spte)
While old_spte holds a snapshot of the value pointed to by sptep,
&old_spte is not equivalent to sptep.

>
> +
>
>
>                         /*
>                          * The iter must explicitly re-read the spte here
> @@ -1037,7 +1165,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>
>                         account_nx = fault->huge_page_disallowed &&
>                                 fault->req_level >= iter.level;
> -                       if (!tdp_mmu_populate_nonleaf(vcpu, &iter, account_nx))
> +                       if (!tdp_mmu_populate_nonleaf(
> +                                       vcpu, &iter, is_private, account_nx))
>                                 break;
>                 }
>         }
> @@ -1058,9 +1187,12 @@ bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
>  {
>         struct kvm_mmu_page *root;
>
> -       for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false)
> +       for_each_tdp_mmu_root_yield_safe(kvm, root, range->slot->as_id, false) {
> +               if (is_private_sp(root))
> +                       continue;
>                 flush = zap_gfn_range(kvm, root, range->start, range->end,
> -                                     range->may_block, flush, false);
> +                               range->may_block, flush, false);
> +       }
>
>         return flush;
>  }
> @@ -1513,10 +1645,14 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
>         struct kvm_mmu *mmu = vcpu->arch.mmu;
>         gfn_t gfn = addr >> PAGE_SHIFT;
>         int leaf = -1;
> +       bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn);
>
>         *root_level = vcpu->arch.mmu->shadow_root_level;
>
> -       tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> +       if (WARN_ON(is_private))
> +               return leaf;
> +
> +       tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
>                 leaf = iter.level;
>                 sptes[leaf] = iter.old_spte;
>         }
> @@ -1542,12 +1678,16 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
>         struct kvm_mmu *mmu = vcpu->arch.mmu;
>         gfn_t gfn = addr >> PAGE_SHIFT;
>         tdp_ptep_t sptep = NULL;
> +       bool is_private = kvm_is_private_gfn(vcpu->kvm, gfn);
>
> -       tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
> +       if (is_private)
> +               goto out;
> +
> +       tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
>                 *spte = iter.old_spte;
>                 sptep = iter.sptep;
>         }
> -
> +out:
>         /*
>          * Perform the rcu_dereference to get the raw spte pointer value since
>          * we are passing it up to fast_page_fault, which is shared with the
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
> index 3899004a5d91..7c62f694a465 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.h
> +++ b/arch/x86/kvm/mmu/tdp_mmu.h
> @@ -5,7 +5,7 @@
>
>  #include <linux/kvm_host.h>
>
> -hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
> +hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private);
>
>  __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm *kvm,
>                                                      struct kvm_mmu_page *root)
> @@ -20,11 +20,14 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
>                           bool shared);
>
>  bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
> -                                gfn_t end, bool can_yield, bool flush);
> +                                gfn_t end, bool can_yield, bool flush,
> +                                bool zap_private);
>  static inline bool kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id,
> -                                            gfn_t start, gfn_t end, bool flush)
> +                                            gfn_t start, gfn_t end, bool flush,
> +                                            bool zap_private)
>  {
> -       return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush);
> +       return __kvm_tdp_mmu_zap_gfn_range(kvm, as_id, start, end, true, flush,
> +                       zap_private);
>  }
>  static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  {
> @@ -41,7 +44,7 @@ static inline bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>          */
>         lockdep_assert_held_write(&kvm->mmu_lock);
>         return __kvm_tdp_mmu_zap_gfn_range(kvm, kvm_mmu_page_as_id(sp),
> -                                          sp->gfn, end, false, false);
> +                                          sp->gfn, end, false, false, false);
>  }
>
>  void kvm_tdp_mmu_zap_all(struct kvm *kvm);
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ae3bf553f215..d4e117f5b5b9 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -190,6 +190,7 @@ bool kvm_is_reserved_pfn(kvm_pfn_t pfn)
>
>         return true;
>  }
> +EXPORT_SYMBOL_GPL(kvm_is_reserved_pfn);
>
>  /*
>   * Switches to specified vcpu, until a matching vcpu_put()
> --
> 2.25.1
>

Sagi

^ permalink raw reply	[flat|nested] 310+ messages in thread

* Re: [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
  2022-04-29  0:28   ` Sagi Shahar
@ 2022-04-29  0:46     ` Sean Christopherson
  0 siblings, 0 replies; 310+ messages in thread
From: Sean Christopherson @ 2022-04-29  0:46 UTC (permalink / raw)
  To: Sagi Shahar
  Cc: Yamahata, Isaku, kvm, linux-kernel, isaku.yamahata,
	Paolo Bonzini, Jim Mattson, Erdem Aktas, Connor Kuehl

On Thu, Apr 28, 2022, Sagi Shahar wrote:
> > @@ -468,23 +503,49 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> >
> >         if (was_leaf && is_dirty_spte(old_spte) &&
> >             (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
> > -               kvm_set_pfn_dirty(spte_to_pfn(old_spte));
> > +               kvm_set_pfn_dirty(old_pfn);
> > +
> > +       /*
> > +        * Special handling for the private mapping.  We are either
> > +        * setting up new mapping at middle level page table, or leaf,
> > +        * or tearing down existing mapping.
> > +        */
> > +       if (private_spte) {
> > +               void *sept_page = NULL;
> > +
> > +               if (is_present && !is_leaf) {
> > +                       struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(new_pfn));
> > +
> > +                       sept_page = kvm_mmu_private_sp(sp);
> > +                       WARN_ON(!sept_page);
> > +                       WARN_ON(sp->role.level + 1 != level);
> > +                       WARN_ON(sp->gfn != gfn);
> > +               }
> > +
> > +               static_call(kvm_x86_handle_changed_private_spte)(
> > +                       kvm, gfn, level,
> > +                       old_pfn, was_present, was_leaf,
> > +                       new_pfn, is_present, is_leaf, sept_page);
> > +       }
> >
> >         /*
> >          * Recursively handle child PTs if the change removed a subtree from
> >          * the paging structure.
> >          */
> > -       if (was_present && !was_leaf && (pfn_changed || !is_present))
> > +       if (was_present && !was_leaf && (pfn_changed || !is_present)) {
> > +               WARN_ON(private_spte !=
> > +                       is_private_spte(spte_to_child_pt(old_spte, level)));

This sanity check is pointless.  The private flag comes from the parent shadow
page role, and that's not changing.

> > @@ -1015,6 +1137,12 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
> >                     is_large_pte(iter.old_spte)) {
> >                         if (!tdp_mmu_zap_spte_atomic(vcpu->kvm, &iter))
> >                                 break;
> > +                       /*
> > +                        * TODO: large page support.
> > +                        * Doesn't support large page for TDX now
> > +                        */
> > +                       WARN_ON(is_private_spte(&iter.old_spte));
> 
> The above line is causing a null ptr dereferencing when running the
> KVM unit tests.
> It should be is_private_spte(iter.sptep) instead of
> is_private_spte(&iter.old_spte)
> While old_spte holds a snapshot of the value pointed to by sptep,
> &old_spte is not equivalent to sptep.

Bug aside, the name is really, really bad.  All of the existing helpers with an
"is_blah_spte()" name take an SPTE value, not a pointer to an SPTE.

is_private_sptep() is the obvious choice.  That makes me a bit nervous too, and
I don't love having to go back to the parent to query private vs shared.

That said, I think it's worth waiting to see the next version of this series before
going behind the bikeshed, I suspect many/most of the calls will go away, i.e. we
might find a better option presents itself.

^ permalink raw reply	[flat|nested] 310+ messages in thread

end of thread, other threads:[~2022-04-29  0:46 UTC | newest]

Thread overview: 310+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-03-04 19:48 [RFC PATCH v5 000/104] KVM TDX basic feature support isaku.yamahata
2022-03-04 19:48 ` [RFC PATCH v5 001/104] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX isaku.yamahata
2022-03-13 13:45   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 002/104] x86/virt/tdx: export platform_has_tdx isaku.yamahata
2022-03-04 19:48 ` [RFC PATCH v5 003/104] KVM: TDX: Detect CPU feature on kernel module initialization isaku.yamahata
2022-03-13 13:49   ` Paolo Bonzini
2022-03-14 18:34     ` Isaku Yamahata
2022-04-08 16:46   ` Sean Christopherson
2022-03-04 19:48 ` [RFC PATCH v5 004/104] KVM: Enable hardware before doing arch VM initialization isaku.yamahata
2022-03-13 14:00   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 005/104] KVM: x86: Refactor KVM VMX module init/exit functions isaku.yamahata
2022-03-13 13:54   ` Paolo Bonzini
2022-03-14 19:22     ` Isaku Yamahata
2022-03-04 19:48 ` [RFC PATCH v5 006/104] KVM: TDX: Add placeholders for TDX VM/vcpu structure isaku.yamahata
2022-03-13 13:55   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 007/104] x86/virt/tdx: Add a helper function to return system wide info about TDX module isaku.yamahata
2022-03-13 13:59   ` Paolo Bonzini
2022-03-13 23:02     ` Kai Huang
2022-03-04 19:48 ` [RFC PATCH v5 008/104] KVM: TDX: Add a function to initialize " isaku.yamahata
2022-03-13 14:03   ` Paolo Bonzini
2022-03-14 19:45     ` Isaku Yamahata
2022-03-31  0:03       ` Sean Christopherson
2022-03-31  1:02         ` Kai Huang
2022-03-31 17:03         ` Isaku Yamahata
2022-03-31 19:34           ` Sean Christopherson
     [not found]             ` <20220401032741.GA2806@gao-cwp>
2022-04-01  5:07               ` Chao Gao
2022-03-31  3:31   ` Kai Huang
2022-03-31 19:41     ` Isaku Yamahata
2022-04-01  6:56       ` Xiaoyao Li
2022-04-01 20:18         ` Isaku Yamahata
2022-04-02  2:40           ` Xiaoyao Li
2022-03-04 19:48 ` [RFC PATCH v5 009/104] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs isaku.yamahata
2022-03-13 14:07   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 010/104] KVM: TDX: Make TDX VM type supported isaku.yamahata
2022-03-13 23:08   ` Kai Huang
2022-03-15 21:03     ` Isaku Yamahata
2022-03-15 21:47       ` Kai Huang
2022-03-15 21:49         ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 011/104] [MARKER] The start of TDX KVM patch series: TDX architectural definitions isaku.yamahata
2022-03-04 19:48 ` [RFC PATCH v5 012/104] KVM: TDX: Define " isaku.yamahata
2022-03-13 14:30   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 013/104] KVM: TDX: Add TDX "architectural" error codes isaku.yamahata
2022-03-13 14:08   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 014/104] KVM: TDX: Add a function for KVM to invoke SEAMCALL isaku.yamahata
2022-03-13 14:10   ` Paolo Bonzini
2022-03-13 22:42   ` Kai Huang
2022-03-04 19:48 ` [RFC PATCH v5 015/104] KVM: TDX: add a helper function for KVM to issue SEAMCALL isaku.yamahata
2022-03-13 14:11   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 016/104] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module isaku.yamahata
2022-03-04 19:48 ` [RFC PATCH v5 017/104] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
2022-03-13 14:12   ` Paolo Bonzini
2022-04-15 16:54   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 018/104] [MARKER] The start of TDX KVM patch series: TD VM creation/destruction isaku.yamahata
2022-03-04 19:48 ` [RFC PATCH v5 019/104] KVM: TDX: Stub in tdx.h with structs, accessors, and VMCS helpers isaku.yamahata
2022-04-15 16:55   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 020/104] KVM: TDX: allocate per-package mutex isaku.yamahata
2022-04-05 12:39   ` Paolo Bonzini
2022-04-08  0:44     ` Isaku Yamahata
2022-03-04 19:48 ` [RFC PATCH v5 021/104] KVM: x86: Introduce hooks to free VM callback prezap and vm_free isaku.yamahata
2022-03-31  3:02   ` Kai Huang
2022-03-31 19:54     ` Isaku Yamahata
2022-04-05 12:40   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 022/104] KVM: Add max_vcpus field in common 'struct kvm' isaku.yamahata
2022-04-05 12:42   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 023/104] x86/cpu: Add helper functions to allocate/free MKTME keyid isaku.yamahata
2022-03-31  1:21   ` Kai Huang
2022-03-31 20:15     ` Isaku Yamahata
2022-04-06  1:55       ` Kai Huang
2022-04-07  1:00         ` Kai Huang
2022-04-05 13:08   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 024/104] KVM: TDX: create/destroy VM structure isaku.yamahata
2022-03-31  4:17   ` Kai Huang
2022-03-31 22:12     ` Isaku Yamahata
2022-03-31 23:41       ` Kai Huang
2022-04-05 12:44   ` Paolo Bonzini
2022-04-08  0:51     ` Isaku Yamahata
2022-04-15 13:47       ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 025/104] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl isaku.yamahata
2022-04-05 12:50   ` Paolo Bonzini
2022-04-08  0:56     ` Isaku Yamahata
2022-03-04 19:48 ` [RFC PATCH v5 026/104] KVM: TDX: x86: Add vm ioctl to get TDX systemwide parameters isaku.yamahata
2022-04-05 12:52   ` Paolo Bonzini
2022-04-06  1:54     ` Xiaoyao Li
2022-04-07  1:07       ` Kai Huang
2022-04-07  1:17         ` Xiaoyao Li
2022-04-08  0:58           ` Isaku Yamahata
2022-03-04 19:48 ` [RFC PATCH v5 027/104] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
2022-03-31  4:55   ` Kai Huang
2022-04-05 13:01     ` Paolo Bonzini
2022-04-06  2:06       ` Xiaoyao Li
2022-04-06 11:27         ` Paolo Bonzini
2022-04-08  2:18     ` Isaku Yamahata
2022-04-05 12:58   ` Paolo Bonzini
2022-04-07  1:29     ` Xiaoyao Li
2022-04-07  1:51       ` Kai Huang
2022-04-08  3:33         ` Isaku Yamahata
2022-03-04 19:48 ` [RFC PATCH v5 028/104] [MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction isaku.yamahata
2022-03-04 19:48 ` [RFC PATCH v5 029/104] KVM: TDX: allocate/free TDX vcpu structure isaku.yamahata
2022-04-05 13:04   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 030/104] KVM: TDX: Do TDX specific vcpu initialization isaku.yamahata
2022-03-04 19:48 ` [RFC PATCH v5 031/104] [MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits isaku.yamahata
2022-03-04 19:48 ` [RFC PATCH v5 032/104] KVM: x86/mmu: introduce config for PRIVATE KVM MMU isaku.yamahata
2022-03-31 11:23   ` Kai Huang
2022-04-01  1:51     ` Isaku Yamahata
2022-04-01  2:13       ` Kai Huang
2022-04-05 13:48         ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 033/104] KVM: x86: Add infrastructure for stolen GPA bits isaku.yamahata
2022-03-31 11:16   ` Kai Huang
2022-04-01  2:10     ` Kai Huang
2022-04-01  2:34     ` Isaku Yamahata
2022-04-05 14:02       ` Paolo Bonzini
2022-04-05 14:02       ` Paolo Bonzini
2022-04-05 13:55     ` Paolo Bonzini
2022-04-06  2:23       ` Kai Huang
2022-04-06 11:26         ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 034/104] [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX isaku.yamahata
2022-03-04 19:48 ` [RFC PATCH v5 035/104] KVM: x86/mmu: Disallow dirty logging for x86 TDX isaku.yamahata
2022-04-05 13:09   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 036/104] KVM: x86/mmu: Explicitly check for MMIO spte in fast page fault isaku.yamahata
2022-04-05 13:17   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 037/104] KVM: x86/mmu: Allow non-zero init value for shadow PTE isaku.yamahata
2022-04-01  5:13   ` Kai Huang
2022-04-01  7:13     ` Kai Huang
2022-04-05 14:14       ` Paolo Bonzini
2022-04-08 18:38         ` Isaku Yamahata
2022-04-05 14:13     ` Paolo Bonzini
2022-04-05 14:10   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 038/104] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
2022-04-01  5:15   ` Kai Huang
2022-04-01 14:08     ` Sean Christopherson
2022-04-01 20:28       ` Isaku Yamahata
2022-04-01 20:53         ` Sean Christopherson
2022-04-01 22:27       ` Kai Huang
2022-04-02  0:08         ` Sean Christopherson
2022-04-04  0:41           ` Kai Huang
2022-03-04 19:48 ` [RFC PATCH v5 039/104] KVM: x86/mmu: Disallow fast page fault on private GPA isaku.yamahata
2022-04-05 13:22   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 040/104] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
2022-04-05 14:43   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 041/104] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
2022-04-05 14:48   ` Paolo Bonzini
2022-03-04 19:48 ` [RFC PATCH v5 042/104] KVM: x86/mmu: Track shadow MMIO value/mask on a per-VM basis isaku.yamahata
2022-04-05 15:25   ` Paolo Bonzini
2022-04-08 18:46     ` Isaku Yamahata
2022-04-19 19:55       ` Sean Christopherson
2022-04-06 11:06   ` Kai Huang
2022-04-07  3:05     ` Kai Huang
2022-04-08 19:12     ` Isaku Yamahata
2022-04-08 23:34       ` Kai Huang
2022-03-04 19:48 ` [RFC PATCH v5 043/104] KVM: TDX: Add load_mmu_pgd method for TDX isaku.yamahata
2022-04-05 14:51   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 044/104] [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 045/104] KVM: x86/tdp_mmu: make REMOVED_SPTE include shadow_initial value isaku.yamahata
2022-04-05 14:22   ` Paolo Bonzini
2022-04-06 23:35     ` Sean Christopherson
2022-04-07 13:52       ` Paolo Bonzini
2022-04-06 23:30   ` Kai Huang
2022-03-04 19:49 ` [RFC PATCH v5 046/104] KVM: x86/tdp_mmu: refactor kvm_tdp_mmu_map() isaku.yamahata
2022-04-05 14:53   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 047/104] KVM: x86/mmu: add a private pointer to struct kvm_mmu_page isaku.yamahata
2022-04-05 14:58   ` Paolo Bonzini
2022-04-06 23:43   ` Kai Huang
2022-04-07 13:52     ` Paolo Bonzini
2022-04-07 22:53       ` Kai Huang
2022-04-07 23:03         ` Paolo Bonzini
2022-04-07 23:24           ` Kai Huang
2022-03-04 19:49 ` [RFC PATCH v5 048/104] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU isaku.yamahata
2022-04-07  0:50   ` Kai Huang
2022-04-25 19:10     ` Sagi Shahar
2022-04-26 21:12       ` Isaku Yamahata
2022-04-29  0:28   ` Sagi Shahar
2022-04-29  0:46     ` Sean Christopherson
2022-03-04 19:49 ` [RFC PATCH v5 049/104] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs isaku.yamahata
2022-04-05 15:15   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 050/104] [MARKER] The start of TDX KVM patch series: TDX EPT violation isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 051/104] KVM: TDX: TDP MMU TDX support isaku.yamahata
2022-04-07  2:20   ` Kai Huang
2022-03-04 19:49 ` [RFC PATCH v5 052/104] [MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 053/104] KVM: x86/mmu: steal software usable bit for EPT to represent shared page isaku.yamahata
2022-04-15 15:21   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 054/104] KVM: x86/tdp_mmu: Keep PRIVATE_PROHIBIT bit when zapping isaku.yamahata
2022-04-07  1:43   ` Kai Huang
2022-03-04 19:49 ` [RFC PATCH v5 055/104] KVM: x86/tdp_mmu: prevent private/shared map based on PRIVATE_PROHIBIT isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 056/104] KVM: x86/tdp_mmu: implement MapGPA hypercall for TDX isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 057/104] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 058/104] KVM: x86/mmu: Focibly use TDP MMU for TDX isaku.yamahata
2022-04-07  1:49   ` Kai Huang
2022-03-04 19:49 ` [RFC PATCH v5 059/104] [MARKER] The start of TDX KVM patch series: TD finalization isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 060/104] KVM: TDX: Create initial guest memory isaku.yamahata
2022-04-07  2:30   ` Kai Huang
2022-03-04 19:49 ` [RFC PATCH v5 061/104] KVM: TDX: Finalize VM initialization isaku.yamahata
2022-04-15 13:52   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 062/104] [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 063/104] KVM: TDX: Add helper assembly function to TDX vcpu isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 064/104] KVM: TDX: Implement TDX vcpu enter/exit path isaku.yamahata
2022-03-22 17:28   ` Erdem Aktas
2022-03-23 17:55     ` Isaku Yamahata
2022-03-23 20:05       ` Erdem Aktas
2022-03-23 22:48         ` Isaku Yamahata
2022-03-04 19:49 ` [RFC PATCH v5 065/104] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) isaku.yamahata
2022-04-15 13:56   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 066/104] KVM: TDX: restore host xsave state when exit from the guest TD isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 067/104] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr isaku.yamahata
2022-04-15 14:02   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 068/104] KVM: TDX: restore user ret MSRs isaku.yamahata
2022-04-15 14:06   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 069/104] [MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 070/104] KVM: TDX: complete interrupts after tdexit isaku.yamahata
2022-04-15 14:07   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 071/104] KVM: TDX: restore debug store when TD exit isaku.yamahata
2022-04-15 14:20   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 072/104] KVM: TDX: handle vcpu migration over logical processor isaku.yamahata
2022-04-15 14:14   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 073/104] KVM: TDX: track LP tdx vcpu run and teardown vcpus on descroing the guest TD isaku.yamahata
2022-03-23  0:54   ` Erdem Aktas
2022-03-23 19:08     ` Isaku Yamahata
2022-03-23 20:17       ` Erdem Aktas
2022-03-04 19:49 ` [RFC PATCH v5 074/104] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior isaku.yamahata
2022-04-05 15:32   ` Paolo Bonzini
2022-04-06 23:28     ` Sean Christopherson
2022-03-04 19:49 ` [RFC PATCH v5 075/104] KVM: x86: Check for pending APICv interrupt in kvm_vcpu_has_events() isaku.yamahata
2022-04-08 16:24   ` Sean Christopherson
2022-04-15 14:20     ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 076/104] KVM: x86: Add option to force LAPIC expiration wait isaku.yamahata
2022-04-05 15:33   ` Paolo Bonzini
2022-04-08 16:36   ` Sean Christopherson
2022-03-04 19:49 ` [RFC PATCH v5 077/104] KVM: TDX: Use vcpu_to_pi_desc() uniformly in posted_intr.c isaku.yamahata
2022-04-05 15:36   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 078/104] KVM: TDX: Implement interrupt injection isaku.yamahata
2022-04-06 11:47   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 079/104] KVM: TDX: Implements vcpu request_immediate_exit isaku.yamahata
2022-04-06 12:49   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 080/104] KVM: TDX: Implement methods to inject NMI isaku.yamahata
2022-04-06 12:47   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 081/104] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument isaku.yamahata
2022-04-06 12:49   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 082/104] KVM: VMX: Move NMI/exception handler to common helper isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 083/104] KVM: x86: Split core of hypercall emulation to helper function isaku.yamahata
2022-03-21 18:32   ` Sagi Shahar
2022-03-23 17:53     ` Isaku Yamahata
2022-04-07 13:12     ` Paolo Bonzini
2022-04-08  5:34       ` Isaku Yamahata
2022-03-04 19:49 ` [RFC PATCH v5 084/104] KVM: TDX: Add a place holder to handle TDX VM exit isaku.yamahata
2022-04-15 14:20   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 085/104] KVM: TDX: handle EXIT_REASON_OTHER_SMI isaku.yamahata
2022-04-15 14:29   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 086/104] KVM: TDX: handle ept violation/misconfig exit isaku.yamahata
2022-04-06 20:50   ` Sagi Shahar
2022-04-07  1:09     ` Xiaoyao Li
2022-03-04 19:49 ` [RFC PATCH v5 087/104] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT isaku.yamahata
2022-04-15 14:49   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 088/104] KVM: TDX: Add TDG.VP.VMCALL accessors to access guest vcpu registers isaku.yamahata
2022-04-07  4:06   ` Kai Huang
2022-04-15 14:50   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 089/104] KVM: TDX: Add a placeholder for handler of TDX hypercalls (TDG.VP.VMCALL) isaku.yamahata
2022-04-07  4:15   ` Kai Huang
2022-04-07 13:14     ` Paolo Bonzini
2022-04-07 14:39       ` Sean Christopherson
2022-04-07 18:04         ` Paolo Bonzini
2022-04-07 18:11           ` Sean Christopherson
2022-04-07 23:20             ` Kai Huang
2022-03-04 19:49 ` [RFC PATCH v5 090/104] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 091/104] KVM: TDX: Handle TDX PV CPUID hypercall isaku.yamahata
2022-04-07 13:16   ` Paolo Bonzini
2022-04-07 14:48     ` Sean Christopherson
2022-04-07 18:03       ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 092/104] KVM: TDX: Handle TDX PV HLT hypercall isaku.yamahata
2022-04-07 13:56   ` Paolo Bonzini
2022-04-07 15:02     ` Sean Christopherson
2022-04-07 15:56       ` Paolo Bonzini
2022-04-07 16:08         ` Sean Christopherson
2022-04-08  4:58         ` Isaku Yamahata
2022-04-08  9:57           ` Paolo Bonzini
2022-04-08 14:51             ` Sean Christopherson
2022-04-11 17:40               ` Paolo Bonzini
2022-04-14 17:09                 ` Sean Christopherson
2022-04-07 14:51   ` Sean Christopherson
2022-03-04 19:49 ` [RFC PATCH v5 093/104] KVM: TDX: Handle TDX PV port io hypercall isaku.yamahata
2022-04-15 14:59   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 094/104] KVM: TDX: Handle TDX PV MMIO hypercall isaku.yamahata
2022-04-15 15:05   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 095/104] KVM: TDX: Implement callbacks for MSR operations for TDX isaku.yamahata
2022-04-15 15:07   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 096/104] KVM: TDX: Handle TDX PV rdmsr hypercall isaku.yamahata
2022-04-15 15:08   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 097/104] KVM: TDX: Handle TDX PV wrmsr hypercall isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 098/104] KVM: TDX: Handle TDX PV report fatal error hypercall isaku.yamahata
2022-04-15 15:13   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 099/104] KVM: TDX: Handle TDX PV map_gpa hypercall isaku.yamahata
2022-03-04 19:49 ` [RFC PATCH v5 100/104] KVM: TDX: Silently discard SMI request isaku.yamahata
2022-04-05 15:41   ` Paolo Bonzini
2022-03-04 19:49 ` [RFC PATCH v5 101/104] KVM: TDX: Silently ignore INIT/SIPI isaku.yamahata
2022-04-05 15:48   ` Paolo Bonzini
2022-04-05 17:53     ` Tom Lendacky
2022-04-07 11:09     ` Xiaoyao Li
2022-04-07 12:12       ` Paolo Bonzini
2022-04-08  3:40         ` Isaku Yamahata
2022-03-04 19:49 ` [RFC PATCH v5 102/104] KVM: TDX: Add methods to ignore accesses to CPU state isaku.yamahata
2022-04-05 15:56   ` Paolo Bonzini
2022-04-08  3:50     ` Isaku Yamahata
2022-04-12  6:49   ` Xiaoyao Li
2022-04-12  6:52     ` Paolo Bonzini
2022-04-12  7:31       ` Xiaoyao Li
2022-03-04 19:49 ` [RFC PATCH v5 103/104] Documentation/virtual/kvm: Document on Trust Domain Extensions(TDX) isaku.yamahata
2022-03-04 19:50 ` [RFC PATCH v5 104/104] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU isaku.yamahata
2022-03-07  7:44 ` [RFC PATCH v5 000/104] KVM TDX basic feature support Christoph Hellwig
2022-03-13 14:00   ` Paolo Bonzini
2022-04-15 15:18 ` Paolo Bonzini
2022-04-15 17:05   ` Paolo Bonzini
2022-04-15 21:19   ` Isaku Yamahata

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.