linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v11 000/113] KVM TDX basic feature support
@ 2023-01-12 16:31 isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 001/113] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX isaku.yamahata
                   ` (112 more replies)
  0 siblings, 113 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

KVM TDX basic feature support

Hello.  This is v11 the patch series vof KVM TDX support.  This is based on
v6.2-rc3 + the following patch series.

Related patch series This patch is based on:
- fd-based approach for supporing KVM v10
  https://lore.kernel.org/lkml/20221202061347.1070246-1-chao.p.peng@linux.intel.com/
- TDX host kernel support v8
  https://lore.kernel.org/lkml/cover.1670566861.git.kai.huang@intel.com/
- KVM: Rework kvm_init() and hardware enabling v2
  https://lore.kernel.org/kvm/20221130230934.1014142-1-seanjc@google.com/

The tree can be found at https://github.com/intel/tdx/tree/kvm-upstream
How to run/test: It's describe at https://github.com/intel/tdx/wiki/TDX-KVM

Major changes from v10:
- rebased to v6.2-rc3
- support mtrr with its own patches
- Integrated fd-based private page v10
- Integrated TDX host kernel support v8
- Integrated kvm_init rework v2
- removed struct tdx_td_page and its initialization logic
- cleaned up mmio spte and require enable_mmio_caching=true for TDX
- removed dubious WARN_ON_ONCE()
- split a patch adding methods as nop into several patches

Thanks,
Isaku Yamahata

Changes from v9:
- rebased to v6.1-rc2
- Integrated fd-based private page v9 as prerequisite.
- Integrated TDX host kernel support v6
- TDP MMU: Make handle_change_spte() return value.
- TDX: removed seamcall_lock and return -EAGAIN so that TDP MMU can retry

Changes from v8:
- rebased to v6.0-rc7
- Integrated with kvm hardware initialization.  Check all packages has at least
  one online CPU when creating guest TD and refuse cpu offline during guest TDs
  are running.
- Integrated fd-based private page v8 as prerequisite.
- TDP MMU: Introduced more callbacks instead of single callback.

Changes from v7:
- Use xarray to track whether GFN is private or shared. Drop SPTE_SHARED_MASK.
  The complex state machine with SPTE_SHARED_MASK was ditched.
- Large page support is implemented. But will be posted as independent RFC patch.
- fd-based private page v7 is integrated. This is mostly same to Chao's patches.
  It's in github.

Changes from v6:
- rebased to v5.19

Changes from v5:
- export __seamcall and use it
- move mutex lock from callee function of smp_call_on_cpu to the caller.
- rename mmu_prezap => flush_shadow_all_private() and tdx_mmu_release_hkid
- updated comment
- drop the use of tdh_mng_key.reclaimid(): as the function is for backward
  compatibility to only return success
- struct kvm_tdx_cmd: metadata => flags, added __u64 error.
- make this ioctl systemwide ioctl
- ABI change to struct kvm_init_vm
- guest_tsc_khz: use kvm->arch.default_tsc_khz
- rename BUILD_BUG_ON_MEMCPY to MEMCPY_SAME_SIZE
- drop exporting kvm_set_tsc_khz().
- fix kvm_tdp_page_fault() for mtrr emulation
- rename it to kvm_gfn_shared_mask(), dropped kvm_gpa_shared_mask()
- drop kvm_is_private_gfn(), kept kvm_is_private_gpa()
  keep kvm_{gfn, gpa}_private(), kvm_gpa_private()
- update commit message
- rename shadow_init_value => shadow_nonprsent_value
- added ept_violation_ve_test mode
- shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in tdp_mmu.c
- legacy MMU case
  => - mmu_topup_shadow_page_cache(), kvm_mmu_create()
     - FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
- #VE warning:
- rename: REMOVED_SPTE => __REMOVED_SPTE, SHADOW_REMOVED_SPTE => REMOVED_SPTE
- merge into Like we discussed, this patch should be merged with patch
  "KVM: x86/mmu: Allow non-zero init value for shadow PTE".
- fix pointed by Sagi. check !is_private check => (kvm_gfn_shared_mask && !is_private)
- introduce kvm_gfn_for_root(kvm, root, gfn)
- add only_shared argument to kvm_tdp_mmu_handle_gfn()
- use kvm_arch_dirty_log_supported()
- rename SPTE_PRIVATE_PROHIBIT to SPTE_SHARED_MASK.
- rename: is_private_prohibit_spte() => spte_shared_mask()
- fix: shadow_nonpresent_value => SHADOW_NONPRESENT_VALUE in comment
- dropped this patch as the change was merged into kvm/queue
- update vt_apicv_post_state_restore()
- use is_64_bit_hypercall()
- comment: expand MSMI -> Machine Check System Management Interrupt
- fixed TDX_SEPT_PFERR
- tdvmcall_p[1234]_{write, read}() => tdvmcall_a[0123]_{read,write}()
- rename tdmvcall_exit_readon() => tdvmcall_leaf()
- remove optional zero check of argument.
- do a check for static_call(kvm_x86_has_emulated_msr)(kvm, MSR_IA32_SMBASE)
   in kvm_vcpu_ioctl_smi and __apic_accept_irq.
- WARN_ON_ONCE in tdx_smi_allowed and tdx_enable_smi_window.
- introduce vcpu_deliver_init to x86_ops
- sprinkeled KVM_BUG_ON()

Changes from v4:
- rebased to TDX host kernel patch series.
- include all the patches to make this patch series working.
- add [MARKER] patches to mark the patch layer clear.

---
* What's TDX?
TDX stands for Trust Domain Extensions, which extends Intel Virtual Machines
Extensions (VMX) to introduce a kind of virtual machine guest called a Trust
Domain (TD) for confidential computing.

A TD runs in a CPU mode that is designed to protect the confidentiality of its
memory contents and its CPU state from any other software, including the hosting
Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.

We have more detailed explanations below (***).
We have the high-level design of TDX KVM below (****).

In this patch series, we use "TD" or "guest TD" to differentiate it from the
current "VM" (Virtual Machine), which is supported by KVM today.


* The organization of this patch series
This patch series is on top of the patches series "TDX host kernel support":
https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/

This patch series is available at
https://github.com/intel/tdx/tree/kvm-upstream

The related repositories (TDX qemu, TDX OVMF(tdvf) etc) are described at
https://github.com/intel/tdx/wiki/TDX-KVM

The relations of the layers are depicted as follows.
The arrows below show the order of patch reviews we would like to have.

The below layers are chosen so that the device model, for example, qemu can
exercise each layering step by step.  Check if TDX is supported, create TD VM,
create TD vcpu, allow vcpu running, populate TD guest private memory, and handle
vcpu exits/hypercalls/interrupts to run TD fully.

  TDX vcpu
  interrupt/exits/hypercall<------------\
        ^                               |
        |                               |
  TD finalization                       |
        ^                               |
        |                               |
  TDX EPT violation<------------\       |
        ^                       |       |
        |                       |       |
  TD vcpu enter/exit            |       |
        ^                       |       |
        |                       |       |
  TD vcpu creation/destruction  |       \-------KVM TDP MMU MapGPA
        ^                       |                       ^
        |                       |                       |
  TD VM creation/destruction    \---------------KVM TDP MMU hooks
        ^                                               ^
        |                                               |
  TDX architectural definitions                 KVM TDP refactoring for TDX
        ^                                               ^
        |                                               |
   TDX, VMX    <--------TDX host kernel         KVM MMU GPA stolen bits
   coexistence          support


The followings are explanations of each layer.  Each layer has a dummy commit
that starts with [MARKER] in subject.  It is intended to help to identify where
each layer starts.

TDX host kernel support:
        https://lore.kernel.org/lkml/cover.1646007267.git.kai.huang@intel.com/
        The guts of system-wide initialization of TDX module.  There is an
        independent patch series for host x86.  TDX KVM patches call functions
        this patch series provides to initialize the TDX module.

TDX, VMX coexistence:
        Infrastructure to allow TDX to coexist with VMX and trigger the
        initialization of the TDX module.
        This layer starts with
        "KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX"
TDX architectural definitions:
        Add TDX architectural definitions and helper functions
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TDX architectural definitions".
TD VM creation/destruction:
        Guest TD creation/destroy allocation and releasing of TDX specific vm
        and vcpu structure.  Create an initial guest memory image with TDX
        measurement.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD VM creation/destruction".
TD vcpu creation/destruction:
        guest TD creation/destroy Allocation and releasing of TDX specific vm
        and vcpu structure.  Create an initial guest memory image with TDX
        measurement.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction"
TDX EPT violation:
        Create an initial guest memory image with TDX measurement.  Handle
        secure EPT violations to populate guest pages with TDX SEAMCALLs.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TDX EPT violation"
TD vcpu enter/exit:
        Allow TDX vcpu to enter into TD and exit from TD.  Save CPU state before
        entering into TD.  Restore CPU state after exiting from TD.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD vcpu enter/exit"
TD vcpu interrupts/exit/hypercall:
        Handle various exits/hypercalls and allow interrupts to be injected so
        that TD vcpu can continue running.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls"

KVM MMU GPA shared bit:
        Introduce framework to handle shared bit repurposed bit of GPA TDX
        repurposed a bit of GPA to indicate shared or private. If it's shared,
        it's the same as the conventional VMX EPT case.  VMM can access shared
        guest pages.  If it's private, it's handled by Secure-EPT and the guest
        page is encrypted.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM MMU GPA stolen bits"
KVM TDP refactoring for TDX:
        TDX Secure EPT requires different constants. e.g. initial value EPT
        entry value etc. Various refactoring for those differences.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX"
KVM TDP MMU hooks:
        Introduce framework to TDP MMU to add hooks in addition to direct EPT
        access TDX added Secure EPT which is an enhancement to VMX EPT.  Unlike
        conventional VMX EPT, CPU can't directly read/write Secure EPT. Instead,
        use TDX SEAMCALLs to operate on Secure EPT.
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks"
KVM TDP MMU MapGPA:
        Introduce framework to handle switching guest pages from private/shared
        to shared/private.  For a given GPA, a guest page can be assigned to a
        private GPA or a shared GPA exclusively.  With TDX MapGPA hypercall,
        guest TD converts GPA assignments from private (or shared) to shared (or
        private).
        This layer starts with
        "[MARKER] The start of TDX KVM patch series: KVM TDP MMU MapGPA "

KVM guest private memory: (not shown in the above diagram)
[PATCH v4 00/12] KVM: mm: fd-based approach for supporting KVM guest private
memory: https://lkml.org/lkml/2022/1/18/395
        Guest private memory requires different memory management in KVM.  The
        patch proposes a way for it.  Integration with TDX KVM.

(***)
* TDX module
A CPU-attested software module called the "TDX module" is designed to implement
the TDX architecture, and it is loaded by the UEFI firmware today. It can be
loaded by the kernel or driver at runtime, but in this patch series we assume
that the TDX module is already loaded and initialized.

The TDX module provides two main new logical modes of operation built upon the
new SEAM (Secure Arbitration Mode) root and non-root CPU modes added to the VMX
architecture. TDX root mode is mostly identical to the VMX root operation mode,
and the TDX functions (described later) are triggered by the new SEAMCALL
instruction with the desired interface function selected by an input operand
(leaf number, in RAX). TDX non-root mode is used for TD guest operation.  TDX
non-root operation (i.e. "guest TD" mode) is similar to the VMX non-root
operation (i.e. guest VM), with changes and restrictions to better assure that
no other software or hardware has direct visibility of the TD memory and state.

TDX transitions between TDX root operation and TDX non-root operation include TD
Entries, from TDX root to TDX non-root mode, and TD Exits from TDX non-root to
TDX root mode.  A TD Exit might be asynchronous, triggered by some external
event (e.g., external interrupt or SMI) or an exception, or it might be
synchronous, triggered by a TDCALL (TDG.VP.VMCALL) function.

TD VCPUs can be entered using SEAMCALL(TDH.VP.ENTER) by KVM. TDH.VP.ENTER is one
of the TDX interface functions as mentioned above, and "TDH" stands for Trust
Domain Host. Those host-side TDX interface functions are categorized into
various areas just for better organization, such as SYS (TDX module management),
MNG (TD management), VP (VCPU), PHYSMEM (physical memory), MEM (private memory),
etc. For example, SEAMCALL(TDH.SYS.INFO) returns the TDX module information.

TDCS (Trust Domain Control Structure) is the main control structure of a guest
TD, and encrypted (using the guest TD's ephemeral private key).  At a high
level, TDCS holds information for controlling TD operation as a whole,
execution, EPTP, MSR bitmaps, etc that KVM needs to set it up.  Note that MSR
bitmaps are held as part of TDCS (unlike VMX) because they are meant to have the
same value for all VCPUs of the same TD.

Trust Domain Virtual Processor State (TDVPS) is the root control structure of a
TD VCPU.  It helps the TDX module control the operation of the VCPU, and holds
the VCPU state while the VCPU is not running. TDVPS is opaque to software and
DMA access, accessible only by using the TDX module interface functions (such as
TDH.VP.RD, TDH.VP.WR). TDVPS includes TD VMCS, and TD VMCS auxiliary structures,
such as virtual APIC page, virtualization exception information, etc.

Several VMX control structures (such as Shared EPT and Posted interrupt
descriptor) are directly managed and accessed by the host VMM.  These control
structures are pointed to by fields in the TD VMCS.

The above means that 1) KVM needs to allocate different data structures for TDs,
2) KVM can reuse the existing code for TDs for some operations, 3) it needs to
define TD-specific handling for others.  3) Redirect operations to .  3)
Redirect operations to the TDX specific callbacks, like "if (is_td_vcpu(vcpu))
tdx_callback() else vmx_callback();".

*TD Private Memory
TD private memory is designed to hold TD private content, encrypted by the CPU
using the TD ephemeral key. An encryption engine holds a table of encryption
keys, and an encryption key is selected for each memory transaction based on a
Host Key Identifier (HKID). By design, the host VMM does not have access to the
encryption keys.

In the first generation of MKTME, HKID is "stolen" from the physical address by
allocating a configurable number of bits from the top of the physical
address. The HKID space is partitioned into shared HKIDs for legacy MKTME
accesses and private HKIDs for SEAM-mode-only accesses. We use 0 for the shared
HKID on the host so that MKTME can be opaque or bypassed on the host.

During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
as either shared or private, based on the value of a new SHARED bit in the Guest
Physical Address (GPA).  The CPU translates shared GPAs using the usual VMX EPT
(Extended Page Table) or "Shared EPT" (in this document), which resides in host
VMM memory. The Shared EPT is directly managed by the host VMM - the same as
with the current VMX. Since guest TDs usually require I/O, and the data exchange
needs to be done via shared memory, thus KVM needs to use the current EPT
functionality even for TDs.

* Secure EPT and Minoring using the TDP code
The CPU translates private GPAs using a separate Secure EPT.  The Secure EPT
pages are encrypted and integrity-protected with the TD's ephemeral private
key.  Secure EPT can be managed _indirectly_ by the host VMM, using the TDX
interface functions, and thus conceptually Secure EPT is a subset of EPT (why
"subset"). Since execution of such interface functions takes much longer time
than accessing memory directly, in KVM we use the existing TDP code to minor the
Secure EPT for the TD.

This way, we can effectively walk Secure EPT without using the TDX interface
functions.

* VM life cycle and TDX specific operations
The userspace VMM, such as QEMU, needs to build and treat TDs differently.  For
example, a TD needs to boot in private memory, and the host software cannot copy
the initial image to private memory.

* TSC Virtualization
The TDX module helps TDs maintain reliable TSC (Time Stamp Counter) values
(e.g. consistent among the TD VCPUs) and the virtual TSC frequency is determined
by TD configuration, i.e. when the TD is created, not per VCPU.  The current KVM
owns TSC virtualization for VMs, but the TDX module does for TDs.

* MCE support for TDs
The TDX module doesn't allow VMM to inject MCE.  Instead PV way is needed for TD
to communicate with VMM.  For now, KVM silently ignores MCE request by VMM.  MSRs
related to MCE (e.g, MCE bank registers) can be naturally emulated by
paravirtualizing MSR access.

[1] For details, the specifications, [2], [3], [4], [5], [6], [7], are
available.

* Restrictions or future work
Some features are not included to reduce patch size.  Those features are
addressed as future independent patch series.
- large page (2M, 1G)
- qemu gdb stub
- guest PMU
- and more

* Prerequisites
It's required to load the TDX module and initialize it.  It's out of the scope
of this patch series.  Another independent patch for the common x86 code is
planned.  It defines CONFIG_INTEL_TDX_HOST and this patch series uses
CONFIG_INTEL_TDX_HOST.  It's assumed that With CONFIG_INTEL_TDX_HOST=y, the TDX
module is initialized and ready for KVM to use the TDX module APIs for TDX guest
life cycle like tdh.mng.init are ready to use.

Concretely Global initialization, LP (Logical Processor) initialization, global
configuration, the key configuration, and TDMR and PAMT initialization are done.
The state of the TDX module is SYS_READY.  Please refer to the TDX module
specification, the chapter Intel TDX Module Lifecycle State Machine

** Detecting the TDX module readiness.
TDX host patch series implements the detection of the TDX module availability
and its initialization so that KVM can use it.  Also it manages Host KeyID
(HKID) assigned to guest TD.
The assumed APIs the TDX host patch series provides are
- const struct tdsysinfo_struct *tdx_get_sysinfo(void);
  Return the system wide information about the TDX module.  NULL if the TDX
  isn't initialized.
- int tdx_enable(void);
  Initialization of TDX module so that the TDX module is ready for KVM to use.
- extern u32 tdx_global_keyid __read_mostly;
  global host key id that is used for the TDX module itself.
- u32 tdx_get_num_keyid(void);
  return the number of available TDX private host key id.
- int tdx_keyid_alloc(void);
  Allocate HKID for guest TD.
- void tdx_keyid_free(int keyid);
  Free HKID for guest TD.

(****)
* TDX KVM high-level design
- Host key ID management
Host Key ID (HKID) needs to be assigned to each TDX guest for memory encryption.
It is assumed The TDX host patch series implements necessary functions,
u32 tdx_get_global_keyid(void), int tdx_keyid_alloc(void) and,
void tdx_keyid_free(int keyid).

- Data structures and VM type
Because TDX is different from VMX, define its own VM/VCPU structures, struct
kvm_tdx and struct vcpu_tdx instead of struct kvm_vmx and struct vcpu_vmx.  To
identify the VM, introduce VM-type to specify which VM type, VMX (default) or
TDX, is used.

- VM life cycle and TDX specific operations
Re-purpose the existing KVM_MEMORY_ENCRYPT_OP to add TDX specific operations.
New commands are used to get the TDX system parameters, set TDX specific VM/VCPU
parameters, set initial guest memory and measurement.

The creation of TDX VM requires five additional operations in addition to the
conventional VM creation.
  - Get KVM system capability to check if TDX VM type is supported
  - VM creation (KVM_CREATE_VM)
  - New: Get the TDX specific system parameters.  KVM_TDX_GET_CAPABILITY.
  - New: Set TDX specific VM parameters.  KVM_TDX_INIT_VM.
  - VCPU creation (KVM_CREATE_VCPU)
  - New: Set TDX specific VCPU parameters.  KVM_TDX_INIT_VCPU.
  - New: Initialize guest memory as boot state and extend the measurement with
    the memory.  KVM_TDX_INIT_MEM_REGION.
  - New: Finalize VM. KVM_TDX_FINALIZE. Complete measurement of the initial
    TDX VM contents.
  - VCPU RUN (KVM_VCPU_RUN)

- Protected guest state
Because the guest state (CPU state and guest memory) is protected, the KVM VMM
can't operate on them.  For example, accessing CPU registers, injecting
exceptions, and accessing guest memory.  Those operations are handled as
silently ignored, returning zero or initial reset value when it's requested via
KVM API ioctls.

    VM/VCPU state and callbacks for TDX specific operations.
    Define tdx specific VM state and VCPU state instead of VMX ones.  Redirect
    operations to TDX specific callbacks.  "if (tdx) tdx_op() else vmx_op()".

    Operations on the CPU state
    silently ignore operations on the guest state.  For example, the write to
    CPU registers is ignored and the read from CPU registers returns 0.

    . ignore access to CPU registers except for allowed ones.
    . TSC: add a check if tsc is immutable and return an error.  Because the KVM
      implementation updates the internal tsc state and it's difficult to back
      out those changes.  Instead, skip the logic.
    . dirty logging: add check if dirty logging is supported.
    . exceptions/SMI/MCE/SIPI/INIT: silently ignore

    Note: virtual external interrupt and NMI can be injected into TDX guests.

- KVM MMU integration
One bit of the guest physical address (bit 51 or 47) is repurposed to indicate if
the guest physical address is private (the bit is cleared) or shared (the bit is
set).  The bits are called stolen bits.

  - Stolen bits framework
    systematically tracks which guest physical address, shared or private, is
    used.

  - Shared EPT and secure EPT
    There are two EPTs. Shared EPT (the conventional one) and Secure
    EPT(the new one). Shared EPT is handled the same for the stolen
    bit set.  Secure EPT points to private guest pages.  To resolve
    EPT violation, KVM walks one of two EPTs based on faulted GPA.
    Because it's costly to access secure EPT during walking EPTs with
    SEAMCALLs for the private guest physical address, another private
    EPT is used as a shadow of Secure-EPT with the existing logic at
    the cost of extra memory.

The following depicts the relationship.

                    KVM                             |       TDX module
                     |                              |           |
        -------------+----------                    |           |
        |                      |                    |           |
        V                      V                    |           |
     shared GPA           private GPA               |           |
  CPU shared EPT pointer  KVM private EPT pointer   |  CPU secure EPT pointer
        |                      |                    |           |
        |                      |                    |           |
        V                      V                    |           V
  shared EPT                private EPT--------mirror----->Secure EPT
        |                      |                    |           |
        |                      \--------------------+------\    |
        |                                           |      |    |
        V                                           |      V    V
  shared guest page                                 |    private guest page
                                                    |
                                                    |
                              non-encrypted memory  |    encrypted memory
                                                    |

  - Operating on Secure EPT
    Use the TDX module APIs to operate on Secure EPT.  To call the TDX API
    during resolving EPT violation, add hooks to additional operation and wiring
    it to TDX backend.

* References

[1] TDX specification
   https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html
[2] Intel Trust Domain Extensions (Intel TDX)
   https://cdrdv2.intel.com/v1/dl/getContent/726790
[3] Intel CPU Architectural Extensions Specification
   https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-cpu-architectural-specification.pdf
[4] Intel TDX Module 1.0 Specification
   https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-module-1.0-public-spec-v0.931.pdf
[5] Intel TDX Loader Interface Specification
  https://www.intel.com/content/dam/develop/external/us/en/documents-tps/intel-tdx-seamldr-interface-specification.pdf
[6] Intel TDX Guest-Hypervisor Communication Interface
   https://cdrdv2.intel.com/v1/dl/getContent/726790
[7] Intel TDX Virtual Firmware Design Guide
   https://www.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.01.pdf
[8] intel public github
   kvm TDX branch: https://github.com/intel/tdx/tree/kvm
   TDX guest branch: https://github.com/intel/tdx/tree/guest
   qemu TDX https://github.com/intel/qemu-tdx
[9] TDVF
    https://github.com/tianocore/edk2-staging/tree/TDVF
    This was merged into EDK2 main branch. https://github.com/tianocore/edk2

Chao Gao (1):
  KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o
    wrmsr

Chao Peng (1):
  KVM: TDX: Use private memory for TDX

Isaku Yamahata (88):
  KVM: x86/vmx: Refactor KVM VMX module init/exit functions
  KVM: TDX: Add placeholders for TDX VM/vcpu structure
  KVM: TDX: Initialize the TDX module when loading the KVM intel kernel
    module
  KVM: TDX: Make TDX VM type supported
  [MARKER] The start of TDX KVM patch series: TDX architectural
    definitions
  KVM: TDX: Define TDX architectural definitions
  KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
  KVM: TDX: Add helper functions to print TDX SEAMCALL error
  [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
  x86/cpu: Add helper functions to allocate/free TDX private host key id
  x86/virt/tdx: Add a helper function to return system wide info about
    TDX module
  KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP
  KVM: TDX: initialize VM with TDX specific parameters
  KVM: TDX: Make pmu_intel.c ignore guest TD case
  KVM: TDX: Refuse to unplug the last cpu on the package
  [MARKER] The start of TDX KVM patch series: TD vcpu
    creation/destruction
  KVM: TDX: allocate/free TDX vcpu structure
  KVM: TDX: Do TDX specific vcpu initialization
  [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
  KVM: x86/mmu: introduce config for PRIVATE KVM MMU
  KVM: x86/mmu: Add address conversion functions for TDX shared bit of
    GPA
  [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for
    TDX
  KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
  KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask
  KVM: x86/mmu: Track shadow MMIO value on a per-VM basis
  KVM: x86/mmu: Disallow fast page fault on private GPA
  KVM: VMX: Introduce test mode related to EPT violation VE
  [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
  KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at
    allocation
  KVM: x86/mmu: Require TDP MMU for TDX
  KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role
  KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
  KVM: Add flags to struct kvm_gfn_range
  KVM: x86/tdp_mmu: Make handle_changed_spte() return value
  KVM: x86/mmu: Make make_spte() aware of shared GPA for MTRR
  KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
  [MARKER] The start of TDX KVM patch series: TDX EPT violation
  KVM: x86/mmu: Disallow dirty logging for x86 TDX
  KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
  KVM: x86/VMX: introduce vmx tlb_remote_flush and
    tlb_remote_flush_with_range
  KVM: TDX: TDP MMU TDX support
  KVM: TDX: MTRR: implement get_mt_mask() for TDX
  [MARKER] The start of TDX KVM patch series: TD finalization
  KVM: TDX: Create initial guest memory
  KVM: TDX: Finalize VM initialization
  [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
  KVM: TDX: Add helper assembly function to TDX vcpu
  KVM: TDX: Implement TDX vcpu enter/exit path
  KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
  KVM: TDX: restore host xsave state when exit from the guest TD
  KVM: TDX: restore user ret MSRs
  [MARKER] The start of TDX KVM patch series: TD vcpu
    exits/interrupts/hypercalls
  KVM: TDX: complete interrupts after tdexit
  KVM: TDX: restore debug store when TD exit
  KVM: TDX: handle vcpu migration over logical processor
  KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched
    behavior
  KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
  KVM: TDX: Implement interrupt injection
  KVM: TDX: Implements vcpu request_immediate_exit
  KVM: TDX: Implement methods to inject NMI
  KVM: TDX: Add a place holder to handle TDX VM exit
  KVM: TDX: handle EXIT_REASON_OTHER_SMI
  KVM: TDX: handle ept violation/misconfig exit
  KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
  KVM: TDX: Add a place holder for handler of TDX hypercalls
    (TDG.VP.VMCALL)
  KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
  KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL
  KVM: TDX: Handle TDX PV CPUID hypercall
  KVM: TDX: Handle TDX PV HLT hypercall
  KVM: TDX: Handle TDX PV port io hypercall
  KVM: TDX: Implement callbacks for MSR operations for TDX
  KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
  KVM: TDX: Handle TDX PV report fatal error hypercall
  KVM: TDX: Handle TDX PV map_gpa hypercall
  KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
  KVM: TDX: Silently discard SMI request
  KVM: TDX: Silently ignore INIT/SIPI
  KVM: TDX: Add methods to ignore guest instruction emulation
  KVM: TDX: Add a method to ignore dirty logging
  KVM: TDX: Add methods to ignore VMX preemption timer
  KVM: TDX: Add methods to ignore accesses to TSC
  KVM: TDX: Ignore setting up mce
  KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch
  KVM: TDX: Add methods to ignore virtual apic related operation
  Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)
  KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
  [MARKER] the end of (the first phase of) TDX KVM patch series

Sean Christopherson (20):
  KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
  KVM: x86: Introduce vm_type to differentiate default VMs from
    confidential VMs
  KVM: TDX: Add TDX "architectural" error codes
  KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
  KVM: TDX: create/destroy VM structure
  KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed
    SPTE
  KVM: x86/mmu: Allow per-VM override of the TDP max page level
  KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases
  KVM: VMX: Split out guts of EPT violation to common/exposed function
  KVM: VMX: Move setting of EPT MMU masks to common VT-x code
  KVM: TDX: Add accessors VMX VMCS helpers
  KVM: TDX: Add load_mmu_pgd method for TDX
  KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
  KVM: TDX: Add support for find pending IRQ in a protected local APIC
  KVM: x86: Assume timer IRQ was injected if APIC state is proteced
  KVM: VMX: Modify NMI and INTR handlers to take intr_info as function
    argument
  KVM: VMX: Move NMI/exception handler to common helper
  KVM: x86: Split core of hypercall emulation to helper function
  KVM: TDX: Handle TDX PV MMIO hypercall
  KVM: TDX: Add methods to ignore accesses to CPU state

Yan Zhao (1):
  KVM: x86/mmu: TDX: Do not enable page track for TD guest

Yao Yuan (1):
  KVM: TDX: Handle vmentry failure for INTEL TD guest

Yuan Yao (1):
  KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT

 Documentation/virt/kvm/api.rst         |   30 +-
 Documentation/virt/kvm/index.rst       |    3 +
 Documentation/virt/kvm/intel-tdx.rst   |  347 +++
 Documentation/virt/kvm/tdx-tdp-mmu.rst |  417 ++++
 arch/x86/events/intel/ds.c             |    1 +
 arch/x86/include/asm/kvm-x86-ops.h     |   13 +
 arch/x86/include/asm/kvm_host.h        |   75 +-
 arch/x86/include/asm/tdx.h             |   74 +-
 arch/x86/include/asm/vmx.h             |   14 +
 arch/x86/include/uapi/asm/kvm.h        |   93 +
 arch/x86/include/uapi/asm/vmx.h        |    5 +-
 arch/x86/kvm/Kconfig                   |    4 +
 arch/x86/kvm/Makefile                  |    3 +-
 arch/x86/kvm/irq.c                     |    3 +
 arch/x86/kvm/kvm_onhyperv.c            |    5 +-
 arch/x86/kvm/kvm_onhyperv.h            |    1 +
 arch/x86/kvm/lapic.c                   |   33 +-
 arch/x86/kvm/lapic.h                   |    2 +
 arch/x86/kvm/mmu.h                     |   36 +
 arch/x86/kvm/mmu/mmu.c                 |  222 +-
 arch/x86/kvm/mmu/mmu_internal.h        |  108 +-
 arch/x86/kvm/mmu/page_track.c          |    3 +
 arch/x86/kvm/mmu/paging_tmpl.h         |    3 +-
 arch/x86/kvm/mmu/spte.c                |   22 +-
 arch/x86/kvm/mmu/spte.h                |   29 +-
 arch/x86/kvm/mmu/tdp_iter.h            |   14 +-
 arch/x86/kvm/mmu/tdp_mmu.c             |  460 +++-
 arch/x86/kvm/mmu/tdp_mmu.h             |    7 +-
 arch/x86/kvm/smm.h                     |    7 +-
 arch/x86/kvm/svm/svm.c                 |    7 +
 arch/x86/kvm/svm/svm_onhyperv.h        |    1 +
 arch/x86/kvm/vmx/common.h              |  174 ++
 arch/x86/kvm/vmx/main.c                | 1118 ++++++++++
 arch/x86/kvm/vmx/pmu_intel.c           |   46 +-
 arch/x86/kvm/vmx/pmu_intel.h           |   28 +
 arch/x86/kvm/vmx/posted_intr.c         |   43 +-
 arch/x86/kvm/vmx/posted_intr.h         |   13 +
 arch/x86/kvm/vmx/tdx.c                 | 2753 ++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h                 |  257 +++
 arch/x86/kvm/vmx/tdx_arch.h            |  166 ++
 arch/x86/kvm/vmx/tdx_errno.h           |   38 +
 arch/x86/kvm/vmx/tdx_error.c           |   21 +
 arch/x86/kvm/vmx/tdx_ops.h             |  218 ++
 arch/x86/kvm/vmx/vmcs.h                |    5 +
 arch/x86/kvm/vmx/vmenter.S             |  156 ++
 arch/x86/kvm/vmx/vmx.c                 |  727 +++----
 arch/x86/kvm/vmx/vmx.h                 |   52 +-
 arch/x86/kvm/vmx/x86_ops.h             |  256 +++
 arch/x86/kvm/x86.c                     |  142 +-
 arch/x86/kvm/x86.h                     |    2 +
 arch/x86/virt/vmx/tdx/seamcall.S       |    2 +
 arch/x86/virt/vmx/tdx/tdx.c            |   56 +-
 arch/x86/virt/vmx/tdx/tdx.h            |   51 -
 include/linux/kvm_host.h               |   11 +-
 include/uapi/linux/kvm.h               |   58 +
 tools/arch/x86/include/uapi/asm/kvm.h  |   95 +
 tools/include/uapi/linux/kvm.h         |    1 +
 virt/kvm/kvm_main.c                    |   50 +-
 58 files changed, 7823 insertions(+), 758 deletions(-)
 create mode 100644 Documentation/virt/kvm/intel-tdx.rst
 create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst
 create mode 100644 arch/x86/kvm/vmx/common.h
 create mode 100644 arch/x86/kvm/vmx/main.c
 create mode 100644 arch/x86/kvm/vmx/pmu_intel.h
 create mode 100644 arch/x86/kvm/vmx/tdx.c
 create mode 100644 arch/x86/kvm/vmx/tdx.h
 create mode 100644 arch/x86/kvm/vmx/tdx_arch.h
 create mode 100644 arch/x86/kvm/vmx/tdx_errno.h
 create mode 100644 arch/x86/kvm/vmx/tdx_error.c
 create mode 100644 arch/x86/kvm/vmx/tdx_ops.h
 create mode 100644 arch/x86/kvm/vmx/x86_ops.h


base-commit: 40c18f363a0806d4f566e8a9a9bd2d7766a72cf5
prerequisite-patch-id: cb32f271215dd2080619b4a794331c8b1c518ae6
prerequisite-patch-id: 57f151e3b417beb4764b2f479e2c2cee9c24f113
prerequisite-patch-id: 3c93e412ef811eb92d0c9e7442108e57f4c0161d
prerequisite-patch-id: 9acd31e1c509affb363584255f595df52bc4aed0
prerequisite-patch-id: 63aec280487c8a9b2d47a80181b51d1f75869d47
prerequisite-patch-id: 748006fa086b15f077d6fe72ef1dd45b5d97f4d5
prerequisite-patch-id: 05377b9d771a7d1f9793d3eca9ecb18d07186c1e
prerequisite-patch-id: d89ca3c47a1995e7977058f4a2a59619bc283d9f
prerequisite-patch-id: 7817b1c896f650b8e8649028150fd6e076b35f61
prerequisite-patch-id: b9513f1b50aa3dfb13737677a0515820dcaf4a1e
prerequisite-patch-id: 8e65b22aaa8336bb25f2f368a9923e04630e626b
prerequisite-patch-id: b709d3632f19c91c124ddce7261f6eaf4853144e
prerequisite-patch-id: df4a8604d9d10328fc5996dc5133061dbd4b00c7
prerequisite-patch-id: 1c206d7035a2978c1c09521b78dd4ea81665d34a
prerequisite-patch-id: 56a319ef5c97bd66ed952513358a2d45e9f59606
prerequisite-patch-id: 343e636e705b0ee6944efbc35d457d4ff898c4b3
prerequisite-patch-id: f10c8636aaeb0fb9b7b47a1111b9c4b6ba12b1c1
prerequisite-patch-id: 003948d57f9828bbb460b680c386e30c6331c68c
prerequisite-patch-id: dc172265f050831b96447522dc8fab4231fd50d9
prerequisite-patch-id: 8d1ef7b82aba942d60171ed0877479d2a4297404
prerequisite-patch-id: e78505575e9e9f08d0166fa3eaf05c140929e319
prerequisite-patch-id: bcd4a65da6528d07f1edd628ec7f1d8df66222a0
prerequisite-patch-id: 2597d4d142448ff4796f4b9fcc5ab2e9873ae391
prerequisite-patch-id: 73023e0f63805ae226ffffe6af0798adb14c9e63
prerequisite-patch-id: daaff2e774b9e2aa139fbf89486650acc815cfa4
prerequisite-patch-id: 027b97fe2a2d10188ad1b18154a5b3a494fcdb3b
prerequisite-patch-id: 945c0546972666678bfb2b5c7737370e9f0897cb
prerequisite-patch-id: 94475f3a196844c29c9ffcfb728cecbc2baa0900
prerequisite-patch-id: 4eb88b47e0ed46c9034008bb5c9350c9e940c454
prerequisite-patch-id: 261cfb768e11e24317a4113f3cec3e65ba021696
prerequisite-patch-id: 8fbfa7cdd01d715cb8bbfe20041d0fecc691f72d
prerequisite-patch-id: c0a4c61d4e210aa0c691b26b3d4a0ab06efa617d
prerequisite-patch-id: 66322a79e5a22623c2398e9663ad23597b11347d
prerequisite-patch-id: 569bc795ecfd8c434c75bbc0202fa058c423e40e
prerequisite-patch-id: bb6bb2b65db2bc477c4b2e0e4144b9606db713a6
prerequisite-patch-id: 982f8129cad940137f6cf2c6234300b2e7d60c80
prerequisite-patch-id: f4fa117498ff3dd92db3b47c2d92e42a3d5c20f1
prerequisite-patch-id: 6f33a6b551b2c459a0083a9ec9a73fe51eb8bce1
prerequisite-patch-id: 8cc565c48b64291afb9cc817671d046cdc28a4ac
prerequisite-patch-id: 27f2d18b5ce2af402462117b522ad5a827ba3e82
prerequisite-patch-id: 40e5e3850a9402a43c31d4305f9c2e8051026328
prerequisite-patch-id: 871f2d5e5df13094f62fb8a52cb7d88ebb6146ba
prerequisite-patch-id: 176eac82fd72836b18ae8d50460f194421a00a02
prerequisite-patch-id: d38e46840d508967d1566c02ca5d1fecc39ad4ef
prerequisite-patch-id: 3160d1d6c295d51a610c71f8f9b8a0d83ee6d66c
prerequisite-patch-id: eb533cad65d1717efc48978ea6bc1c7a00cecfd2
prerequisite-patch-id: 1b4777af5019256652cf46a9719b645f35ad3545
prerequisite-patch-id: ab67f6fbf39573bf1926ed47f0f377d1907d93f9
prerequisite-patch-id: e53b862a3f19bfae6bad83a14e65399317e8f412
prerequisite-patch-id: f8c467f752f8ae76e4371637fc7f7f29ebba44c2
prerequisite-patch-id: 7746a8aef667b0d76e867bb3fe84c37e18ba9cf5
prerequisite-patch-id: ad820a6ec0b7d37bfbbd2ccba36ecb6def99f26a
prerequisite-patch-id: 24d59ae1238fb43f7337a0d08478f394441b04dc
prerequisite-patch-id: 42287c3913d32bf80cc28be41d8bf9972139dc33
prerequisite-patch-id: be3e7f787d9912862b913df033f4087d25ec5aa4
prerequisite-patch-id: a0a7ae7f18baf2ba244a5f2c38375c3352c933b8
prerequisite-patch-id: f9c3d2dd6a62adcd5c3899a5ae8aca6e511f2f4a
prerequisite-patch-id: 103d70d6a171ec88afbe7b79dacc38f45141e1f8
prerequisite-patch-id: c02ee58512800188946f1a4aecdee9c8fd9783bc
prerequisite-patch-id: 64aadbfc41067b73b942410fe4900d83e8b2fa34
prerequisite-patch-id: 97871a9f8b43b5b45c39008fe90a461eacb4eade
prerequisite-patch-id: f2be318daa505a3ec4d310438484bd27c47f08dd
prerequisite-patch-id: 4d458c35e2988d7ffa25f1b7945e84a43efffb5e
prerequisite-patch-id: 1c2f09dc37a0b4ddd186f474207d98dafa8de411
prerequisite-patch-id: c3813c0f10f83c0c0b93a84e1c31c304ff5b48e0
prerequisite-patch-id: 06060a0723d48ca605a960d819b9e3c7fe0a2bb1
prerequisite-patch-id: 3023cf5f10af5d6d129d8483ce0b28b3c2ebff57
prerequisite-patch-id: ce552534be3aa425b06f91a5e39dc7c8521cffd1
prerequisite-patch-id: ae8a3a131d7fa122e8b707cb3cbd86e74dec0cad
prerequisite-patch-id: 0c98c88257806ae7c017ce0e8db05cd77de74c82
prerequisite-patch-id: fd3fa935c3b980085ac0cb39476fb643f49a4f86
prerequisite-patch-id: 98ab27b7920907a340c10e04eadb4e6cd4a6a790
prerequisite-patch-id: 5c65c807fd9fcd77b54ca0dd0d6247910795dfbf
prerequisite-patch-id: 17e3e055e5093b47691fec7df42ecbed663b8ace
prerequisite-patch-id: dc1c8837ae740b52a864a0160d3fba842b119b09
prerequisite-patch-id: c79948a58e937f74edd94585dae796b066805569
prerequisite-patch-id: d0a71fc6ae91999a93703727e682532fe5d12c9a
-- 
2.25.1


^ permalink raw reply	[flat|nested] 221+ messages in thread

* [PATCH v11 001/113] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 002/113] KVM: x86/vmx: Refactor KVM VMX module init/exit functions isaku.yamahata
                   ` (111 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson, Xiaoyao Li

From: Sean Christopherson <sean.j.christopherson@intel.com>

KVM accesses Virtual Machine Control Structure (VMCS) with VMX instructions
to operate on VM.  TDX doesn't allow VMM to operate VMCS directly.
Instead, TDX has its own data structures, and TDX SEAMCALL APIs for VMM to
indirectly operate those data structures.  This means we must have a TDX
version of kvm_x86_ops.

The existing global struct kvm_x86_ops already defines an interface which
fits with TDX.  But kvm_x86_ops is system-wide, not per-VM structure.  To
allow VMX to coexist with TDs, the kvm_x86_ops callbacks will have wrappers
"if (tdx) tdx_op() else vmx_op()" to switch VMX or TDX at run time.

To split the runtime switch, the VMX implementation, and the TDX
implementation, add main.c, and move out the vmx_x86_ops hooks in
preparation for adding TDX, which can coexist with VMX, i.e. KVM can run
both VMs and TDs.  Use 'vt' for the naming scheme as a nod to VT-x and as a
concatenation of VmxTdx.

The current code looks as follows.
In vmx.c
  static vmx_op() { ... }
  static struct kvm_x86_ops vmx_x86_ops = {
        .op = vmx_op,
  initialization code

The eventually converted code will look like
In vmx.c, keep the VMX operations.
  vmx_op() { ... }
  VMX initialization
In tdx.c, define the TDX operations.
  tdx_op() { ... }
  TDX initialization
In x86_ops.h, declare the VMX and TDX operations.
  vmx_op();
  tdx_op();
In main.c, define common wrappers for VMX and TDX.
  static vt_ops() { if (tdx) tdx_ops() else vmx_ops() }
  static struct kvm_x86_ops vt_x86_ops = {
        .op = vt_op,
  initialization to call VMX and TDX initialization

Opportunistically, fix the name inconsistency from vmx_create_vcpu() and
vmx_free_vcpu() to vmx_vcpu_create() and vxm_vcpu_free().

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/Makefile      |   2 +-
 arch/x86/kvm/vmx/main.c    | 156 ++++++++++++++++
 arch/x86/kvm/vmx/vmx.c     | 361 +++++++++++--------------------------
 arch/x86/kvm/vmx/x86_ops.h | 125 +++++++++++++
 4 files changed, 384 insertions(+), 260 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/main.c
 create mode 100644 arch/x86/kvm/vmx/x86_ops.h

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 80e3fe184d17..0e894ae23cbc 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -23,7 +23,7 @@ kvm-$(CONFIG_KVM_XEN)	+= xen.o
 kvm-$(CONFIG_KVM_SMM)	+= smm.o
 
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
-			   vmx/hyperv.o vmx/nested.o vmx/posted_intr.o
+			   vmx/hyperv.o vmx/nested.o vmx/posted_intr.o vmx/main.o
 kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
 
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
new file mode 100644
index 000000000000..a39d9d68b1b3
--- /dev/null
+++ b/arch/x86/kvm/vmx/main.c
@@ -0,0 +1,156 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/moduleparam.h>
+
+#include "x86_ops.h"
+#include "vmx.h"
+#include "nested.h"
+#include "pmu.h"
+
+struct kvm_x86_ops vt_x86_ops __initdata = {
+	.name = KBUILD_MODNAME,
+
+	.check_processor_compatibility = vmx_check_processor_compat,
+
+	.hardware_unsetup = vmx_hardware_unsetup,
+
+	.hardware_enable = vmx_hardware_enable,
+	.hardware_disable = vmx_hardware_disable,
+	.has_emulated_msr = vmx_has_emulated_msr,
+
+	.vm_size = sizeof(struct kvm_vmx),
+	.vm_init = vmx_vm_init,
+	.vm_destroy = vmx_vm_destroy,
+
+	.vcpu_precreate = vmx_vcpu_precreate,
+	.vcpu_create = vmx_vcpu_create,
+	.vcpu_free = vmx_vcpu_free,
+	.vcpu_reset = vmx_vcpu_reset,
+
+	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+	.vcpu_load = vmx_vcpu_load,
+	.vcpu_put = vmx_vcpu_put,
+
+	.update_exception_bitmap = vmx_update_exception_bitmap,
+	.get_msr_feature = vmx_get_msr_feature,
+	.get_msr = vmx_get_msr,
+	.set_msr = vmx_set_msr,
+	.get_segment_base = vmx_get_segment_base,
+	.get_segment = vmx_get_segment,
+	.set_segment = vmx_set_segment,
+	.get_cpl = vmx_get_cpl,
+	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
+	.set_cr0 = vmx_set_cr0,
+	.is_valid_cr4 = vmx_is_valid_cr4,
+	.set_cr4 = vmx_set_cr4,
+	.set_efer = vmx_set_efer,
+	.get_idt = vmx_get_idt,
+	.set_idt = vmx_set_idt,
+	.get_gdt = vmx_get_gdt,
+	.set_gdt = vmx_set_gdt,
+	.set_dr7 = vmx_set_dr7,
+	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
+	.cache_reg = vmx_cache_reg,
+	.get_rflags = vmx_get_rflags,
+	.set_rflags = vmx_set_rflags,
+	.get_if_flag = vmx_get_if_flag,
+
+	.flush_tlb_all = vmx_flush_tlb_all,
+	.flush_tlb_current = vmx_flush_tlb_current,
+	.flush_tlb_gva = vmx_flush_tlb_gva,
+	.flush_tlb_guest = vmx_flush_tlb_guest,
+
+	.vcpu_pre_run = vmx_vcpu_pre_run,
+	.vcpu_run = vmx_vcpu_run,
+	.handle_exit = vmx_handle_exit,
+	.skip_emulated_instruction = vmx_skip_emulated_instruction,
+	.update_emulated_instruction = vmx_update_emulated_instruction,
+	.set_interrupt_shadow = vmx_set_interrupt_shadow,
+	.get_interrupt_shadow = vmx_get_interrupt_shadow,
+	.patch_hypercall = vmx_patch_hypercall,
+	.inject_irq = vmx_inject_irq,
+	.inject_nmi = vmx_inject_nmi,
+	.inject_exception = vmx_inject_exception,
+	.cancel_injection = vmx_cancel_injection,
+	.interrupt_allowed = vmx_interrupt_allowed,
+	.nmi_allowed = vmx_nmi_allowed,
+	.get_nmi_mask = vmx_get_nmi_mask,
+	.set_nmi_mask = vmx_set_nmi_mask,
+	.enable_nmi_window = vmx_enable_nmi_window,
+	.enable_irq_window = vmx_enable_irq_window,
+	.update_cr8_intercept = vmx_update_cr8_intercept,
+	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
+	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
+	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+	.load_eoi_exitmap = vmx_load_eoi_exitmap,
+	.apicv_post_state_restore = vmx_apicv_post_state_restore,
+	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
+	.hwapic_irr_update = vmx_hwapic_irr_update,
+	.hwapic_isr_update = vmx_hwapic_isr_update,
+	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+	.sync_pir_to_irr = vmx_sync_pir_to_irr,
+	.deliver_interrupt = vmx_deliver_interrupt,
+	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+
+	.set_tss_addr = vmx_set_tss_addr,
+	.set_identity_map_addr = vmx_set_identity_map_addr,
+	.get_mt_mask = vmx_get_mt_mask,
+
+	.get_exit_info = vmx_get_exit_info,
+
+	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+
+	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
+
+	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
+	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
+	.write_tsc_offset = vmx_write_tsc_offset,
+	.write_tsc_multiplier = vmx_write_tsc_multiplier,
+
+	.load_mmu_pgd = vmx_load_mmu_pgd,
+
+	.check_intercept = vmx_check_intercept,
+	.handle_exit_irqoff = vmx_handle_exit_irqoff,
+
+	.request_immediate_exit = vmx_request_immediate_exit,
+
+	.sched_in = vmx_sched_in,
+
+	.cpu_dirty_log_size = PML_ENTITY_NUM,
+	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+
+	.nested_ops = &vmx_nested_ops,
+
+	.pi_update_irte = vmx_pi_update_irte,
+	.pi_start_assignment = vmx_pi_start_assignment,
+
+#ifdef CONFIG_X86_64
+	.set_hv_timer = vmx_set_hv_timer,
+	.cancel_hv_timer = vmx_cancel_hv_timer,
+#endif
+
+	.setup_mce = vmx_setup_mce,
+
+#ifdef CONFIG_KVM_SMM
+	.smi_allowed = vmx_smi_allowed,
+	.enter_smm = vmx_enter_smm,
+	.leave_smm = vmx_leave_smm,
+	.enable_smi_window = vmx_enable_smi_window,
+#endif
+
+	.can_emulate_instruction = vmx_can_emulate_instruction,
+	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+	.migrate_timers = vmx_migrate_timers,
+
+	.msr_filter_changed = vmx_msr_filter_changed,
+	.complete_emulated_msr = kvm_complete_insn_gp,
+
+	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+};
+
+struct kvm_x86_init_ops vt_init_ops __initdata = {
+	.hardware_setup = vmx_hardware_setup,
+	.handle_intel_pt_intr = NULL,
+
+	.runtime_ops = &vt_x86_ops,
+	.pmu_ops = &intel_pmu_ops,
+};
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f290781394e7..63155166de45 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -66,6 +66,7 @@
 #include "vmcs12.h"
 #include "vmx.h"
 #include "x86.h"
+#include "x86_ops.h"
 #include "smm.h"
 
 MODULE_AUTHOR("Qumranet");
@@ -524,8 +525,6 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
 static unsigned long host_idt_base;
 
 #if IS_ENABLED(CONFIG_HYPERV)
-static struct kvm_x86_ops vmx_x86_ops __initdata;
-
 static bool __read_mostly enlightened_vmcs = true;
 module_param(enlightened_vmcs, bool, 0444);
 
@@ -583,9 +582,8 @@ static __init void hv_init_evmcs(void)
 		}
 
 		if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH)
-			vmx_x86_ops.enable_l2_tlb_flush
+			vt_x86_ops.enable_l2_tlb_flush
 				= hv_enable_l2_tlb_flush;
-
 	} else {
 		enlightened_vmcs = false;
 	}
@@ -1456,7 +1454,7 @@ void vmx_vcpu_load_vmcs(struct kvm_vcpu *vcpu, int cpu,
  * Switches to specified vcpu, until a matching vcpu_put(), but assumes
  * vcpu mutex is already taken.
  */
-static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -1467,7 +1465,7 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	vmx->host_debugctlmsr = get_debugctlmsr();
 }
 
-static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
+void vmx_vcpu_put(struct kvm_vcpu *vcpu)
 {
 	vmx_vcpu_pi_put(vcpu);
 
@@ -1521,7 +1519,7 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
 		vmx->emulation_required = vmx_emulation_required(vcpu);
 }
 
-static bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu)
 {
 	return vmx_get_rflags(vcpu) & X86_EFLAGS_IF;
 }
@@ -1627,8 +1625,8 @@ static int vmx_rtit_ctl_check(struct kvm_vcpu *vcpu, u64 data)
 	return 0;
 }
 
-static bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
-					void *insn, int insn_len)
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+				void *insn, int insn_len)
 {
 	/*
 	 * Emulation of instructions in SGX enclaves is impossible as RIP does
@@ -1712,7 +1710,7 @@ static int skip_emulated_instruction(struct kvm_vcpu *vcpu)
  * Recognizes a pending MTF VM-exit and records the nested state for later
  * delivery.
  */
-static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -1743,7 +1741,7 @@ static void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu)
 	}
 }
 
-static int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu)
 {
 	vmx_update_emulated_instruction(vcpu);
 	return skip_emulated_instruction(vcpu);
@@ -1762,7 +1760,7 @@ static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
 		vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
 }
 
-static void vmx_inject_exception(struct kvm_vcpu *vcpu)
+void vmx_inject_exception(struct kvm_vcpu *vcpu)
 {
 	struct kvm_queued_exception *ex = &vcpu->arch.exception;
 	u32 intr_info = ex->vector | INTR_INFO_VALID_MASK;
@@ -1883,12 +1881,12 @@ u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
 	return kvm_caps.default_tsc_scaling_ratio;
 }
 
-static void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
 {
 	vmcs_write64(TSC_OFFSET, offset);
 }
 
-static void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
 {
 	vmcs_write64(TSC_MULTIPLIER, multiplier);
 }
@@ -1942,7 +1940,7 @@ static inline bool is_vmx_feature_control_msr_valid(struct vcpu_vmx *vmx,
 	return !(msr->data & ~valid_bits);
 }
 
-static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
+int vmx_get_msr_feature(struct kvm_msr_entry *msr)
 {
 	switch (msr->index) {
 	case MSR_IA32_VMX_BASIC ... MSR_IA32_VMX_VMFUNC:
@@ -1959,7 +1957,7 @@ static int vmx_get_msr_feature(struct kvm_msr_entry *msr)
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
  */
-static int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmx_uret_msr *msr;
@@ -2138,7 +2136,7 @@ static u64 vmx_get_supported_debugctl(struct kvm_vcpu *vcpu, bool host_initiated
  * Returns 0 on success, non-0 otherwise.
  * Assumes vcpu_load() was already called.
  */
-static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	struct vmx_uret_msr *msr;
@@ -2473,7 +2471,7 @@ static int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	return ret;
 }
 
-static void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
 {
 	unsigned long guest_owned_bits;
 
@@ -2758,7 +2756,7 @@ static bool kvm_is_vmx_supported(void)
 	return true;
 }
 
-static int vmx_check_processor_compat(void)
+int vmx_check_processor_compat(void)
 {
 	int cpu = raw_smp_processor_id();
 	struct vmcs_config vmcs_conf;
@@ -2800,7 +2798,7 @@ static int kvm_cpu_vmxon(u64 vmxon_pointer)
 	return -EFAULT;
 }
 
-static int vmx_hardware_enable(void)
+int vmx_hardware_enable(void)
 {
 	int cpu = raw_smp_processor_id();
 	u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
@@ -2841,7 +2839,7 @@ static void vmclear_local_loaded_vmcss(void)
 		__loaded_vmcs_clear(v);
 }
 
-static void vmx_hardware_disable(void)
+void vmx_hardware_disable(void)
 {
 	vmclear_local_loaded_vmcss();
 
@@ -3153,7 +3151,7 @@ static void exit_lmode(struct kvm_vcpu *vcpu)
 
 #endif
 
-static void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -3183,7 +3181,7 @@ static inline int vmx_get_current_vpid(struct kvm_vcpu *vcpu)
 	return to_vmx(vcpu)->vpid;
 }
 
-static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 {
 	struct kvm_mmu *mmu = vcpu->arch.mmu;
 	u64 root_hpa = mmu->root.hpa;
@@ -3199,7 +3197,7 @@ static void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 		vpid_sync_context(vmx_get_current_vpid(vcpu));
 }
 
-static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
 {
 	/*
 	 * vpid_sync_vcpu_addr() is a nop if vpid==0, see the comment in
@@ -3208,7 +3206,7 @@ static void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
 	vpid_sync_vcpu_addr(vmx_get_current_vpid(vcpu), addr);
 }
 
-static void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu)
 {
 	/*
 	 * vpid_sync_context() is a nop if vpid==0, e.g. if enable_vpid==0 or a
@@ -3363,8 +3361,7 @@ u64 construct_eptp(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
 	return eptp;
 }
 
-static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
-			     int root_level)
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level)
 {
 	struct kvm *kvm = vcpu->kvm;
 	bool update_guest_cr3 = true;
@@ -3392,8 +3389,7 @@ static void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 		vmcs_writel(GUEST_CR3, guest_cr3);
 }
 
-
-static bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
 {
 	/*
 	 * We operate under the default treatment of SMM, so VMX cannot be
@@ -3509,7 +3505,7 @@ void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 	var->g = (ar >> 15) & 1;
 }
 
-static u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
 {
 	struct kvm_segment s;
 
@@ -3589,14 +3585,14 @@ void __vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 	vmcs_write32(sf->ar_bytes, vmx_segment_access_rights(var));
 }
 
-static void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
 {
 	__vmx_set_segment(vcpu, var, seg);
 
 	to_vmx(vcpu)->emulation_required = vmx_emulation_required(vcpu);
 }
 
-static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 {
 	u32 ar = vmx_read_guest_seg_ar(to_vmx(vcpu), VCPU_SREG_CS);
 
@@ -3604,25 +3600,25 @@ static void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
 	*l = (ar >> 13) & 1;
 }
 
-static void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	dt->size = vmcs_read32(GUEST_IDTR_LIMIT);
 	dt->address = vmcs_readl(GUEST_IDTR_BASE);
 }
 
-static void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	vmcs_write32(GUEST_IDTR_LIMIT, dt->size);
 	vmcs_writel(GUEST_IDTR_BASE, dt->address);
 }
 
-static void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	dt->size = vmcs_read32(GUEST_GDTR_LIMIT);
 	dt->address = vmcs_readl(GUEST_GDTR_BASE);
 }
 
-static void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
 {
 	vmcs_write32(GUEST_GDTR_LIMIT, dt->size);
 	vmcs_writel(GUEST_GDTR_BASE, dt->address);
@@ -4120,7 +4116,7 @@ void pt_update_intercept_for_msr(struct kvm_vcpu *vcpu)
 	}
 }
 
-static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	void *vapic_page;
@@ -4140,7 +4136,7 @@ static bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
 	return ((rvi & 0xf0) > (vppr & 0xf0));
 }
 
-static void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	u32 i;
@@ -4281,8 +4277,8 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	return 0;
 }
 
-static void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
-				  int trig_mode, int vector)
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+			   int trig_mode, int vector)
 {
 	struct kvm_vcpu *vcpu = apic->vcpu;
 
@@ -4444,7 +4440,7 @@ static u32 vmx_vmexit_ctrl(void)
 		~(VM_EXIT_LOAD_IA32_PERF_GLOBAL_CTRL | VM_EXIT_LOAD_IA32_EFER);
 }
 
-static void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -4709,7 +4705,7 @@ static int vmx_alloc_ipiv_pid_table(struct kvm *kvm)
 	return 0;
 }
 
-static int vmx_vcpu_precreate(struct kvm *kvm)
+int vmx_vcpu_precreate(struct kvm *kvm)
 {
 	return vmx_alloc_ipiv_pid_table(kvm);
 }
@@ -4861,7 +4857,7 @@ static void __vmx_vcpu_reset(struct kvm_vcpu *vcpu)
 	vmx->pi_desc.sn = 1;
 }
 
-static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -4920,12 +4916,12 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vmx_update_fb_clear_dis(vcpu, vmx);
 }
 
-static void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu)
 {
 	exec_controls_setbit(to_vmx(vcpu), CPU_BASED_INTR_WINDOW_EXITING);
 }
 
-static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
 {
 	if (!enable_vnmi ||
 	    vmcs_read32(GUEST_INTERRUPTIBILITY_INFO) & GUEST_INTR_STATE_STI) {
@@ -4936,7 +4932,7 @@ static void vmx_enable_nmi_window(struct kvm_vcpu *vcpu)
 	exec_controls_setbit(to_vmx(vcpu), CPU_BASED_NMI_WINDOW_EXITING);
 }
 
-static void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
+void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	uint32_t intr;
@@ -4964,7 +4960,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
 	vmx_clear_hlt(vcpu);
 }
 
-static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
+void vmx_inject_nmi(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -5042,7 +5038,7 @@ bool vmx_nmi_blocked(struct kvm_vcpu *vcpu)
 		 GUEST_INTR_STATE_NMI));
 }
 
-static int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	if (to_vmx(vcpu)->nested.nested_run_pending)
 		return -EBUSY;
@@ -5064,7 +5060,7 @@ bool vmx_interrupt_blocked(struct kvm_vcpu *vcpu)
 		(GUEST_INTR_STATE_STI | GUEST_INTR_STATE_MOV_SS));
 }
 
-static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	if (to_vmx(vcpu)->nested.nested_run_pending)
 		return -EBUSY;
@@ -5079,7 +5075,7 @@ static int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 	return !vmx_interrupt_blocked(vcpu);
 }
 
-static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 {
 	void __user *ret;
 
@@ -5099,7 +5095,7 @@ static int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr)
 	return init_rmode_tss(kvm, ret);
 }
 
-static int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
 {
 	to_kvm_vmx(kvm)->ept_identity_map_addr = ident_addr;
 	return 0;
@@ -5380,8 +5376,7 @@ static int handle_io(struct kvm_vcpu *vcpu)
 	return kvm_fast_pio(vcpu, size, port, in);
 }
 
-static void
-vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall)
 {
 	/*
 	 * Patch in the VMCALL instruction:
@@ -5591,7 +5586,7 @@ static int handle_dr(struct kvm_vcpu *vcpu)
 	return kvm_complete_insn_gp(vcpu, err);
 }
 
-static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 {
 	get_debugreg(vcpu->arch.db[0], 0);
 	get_debugreg(vcpu->arch.db[1], 1);
@@ -5610,7 +5605,7 @@ static void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
 	set_debugreg(DR6_RESERVED, 6);
 }
 
-static void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
 {
 	vmcs_writel(GUEST_DR7, val);
 }
@@ -5881,7 +5876,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
-static int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu)
 {
 	if (vmx_emulation_required_with_pending_exception(vcpu)) {
 		kvm_prepare_emulation_failure_exit(vcpu);
@@ -6145,9 +6140,8 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
 static const int kvm_vmx_max_exit_handlers =
 	ARRAY_SIZE(kvm_vmx_exit_handlers);
 
-static void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
-			      u64 *info1, u64 *info2,
-			      u32 *intr_info, u32 *error_code)
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		       u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -6590,7 +6584,7 @@ static int __vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 	return 0;
 }
 
-static int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath)
 {
 	int ret = __vmx_handle_exit(vcpu, exit_fastpath);
 
@@ -6678,7 +6672,7 @@ static noinstr void vmx_l1d_flush(struct kvm_vcpu *vcpu)
 		: "eax", "ebx", "ecx", "edx");
 }
 
-static void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 	int tpr_threshold;
@@ -6748,7 +6742,7 @@ void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
 	vmx_update_msr_bitmap_x2apic(vcpu);
 }
 
-static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 {
 	struct page *page;
 
@@ -6776,7 +6770,7 @@ static void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
 	put_page(page);
 }
 
-static void vmx_hwapic_isr_update(int max_isr)
+void vmx_hwapic_isr_update(int max_isr)
 {
 	u16 status;
 	u8 old;
@@ -6810,7 +6804,7 @@ static void vmx_set_rvi(int vector)
 	}
 }
 
-static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
 {
 	/*
 	 * When running L2, updating RVI is only relevant when
@@ -6824,7 +6818,7 @@ static void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
 		vmx_set_rvi(max_irr);
 }
 
-static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int max_irr;
@@ -6870,7 +6864,7 @@ static int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 	return max_irr;
 }
 
-static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 {
 	if (!kvm_vcpu_apicv_active(vcpu))
 		return;
@@ -6881,7 +6875,7 @@ static void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 	vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
 }
 
-static void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -6954,7 +6948,7 @@ static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
 	vcpu->arch.at_instruction_boundary = true;
 }
 
-static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -6971,7 +6965,7 @@ static void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
  * The kvm parameter can be NULL (module initialization, or invocation before
  * VM creation). Be sure to check the kvm parameter before using it.
  */
-static bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index)
 {
 	switch (index) {
 	case MSR_IA32_SMBASE:
@@ -7094,7 +7088,7 @@ static void vmx_complete_interrupts(struct vcpu_vmx *vmx)
 				  IDT_VECTORING_ERROR_CODE);
 }
 
-static void vmx_cancel_injection(struct kvm_vcpu *vcpu)
+void vmx_cancel_injection(struct kvm_vcpu *vcpu)
 {
 	__vmx_complete_interrupts(vcpu,
 				  vmcs_read32(VM_ENTRY_INTR_INFO_FIELD),
@@ -7228,7 +7222,7 @@ static noinstr void vmx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 	guest_state_exit_irqoff();
 }
 
-static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	unsigned long cr3, cr4;
@@ -7394,7 +7388,7 @@ static fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu)
 	return vmx_exit_handlers_fastpath(vcpu);
 }
 
-static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
+void vmx_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -7405,7 +7399,7 @@ static void vmx_vcpu_free(struct kvm_vcpu *vcpu)
 	free_loaded_vmcs(vmx->loaded_vmcs);
 }
 
-static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
+int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 {
 	struct vmx_uret_msr *tsx_ctrl;
 	struct vcpu_vmx *vmx;
@@ -7514,7 +7508,7 @@ static int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 
-static int vmx_vm_init(struct kvm *kvm)
+int vmx_vm_init(struct kvm *kvm)
 {
 	if (!ple_gap)
 		kvm->arch.pause_in_guest = true;
@@ -7545,7 +7539,7 @@ static int vmx_vm_init(struct kvm *kvm)
 	return 0;
 }
 
-static u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	u8 cache;
 
@@ -7717,7 +7711,7 @@ static void update_intel_pt_cfg(struct kvm_vcpu *vcpu)
 		vmx->pt_desc.ctl_bitmask &= ~(0xfULL << (32 + i * 4));
 }
 
-static void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -7853,7 +7847,7 @@ static __init void vmx_set_cpu_caps(void)
 		kvm_cpu_cap_check_and_set(X86_FEATURE_WAITPKG);
 }
 
-static void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu)
 {
 	to_vmx(vcpu)->req_immediate_exit = true;
 }
@@ -7892,10 +7886,10 @@ static int vmx_check_intercept_io(struct kvm_vcpu *vcpu,
 	return intercept ? X86EMUL_UNHANDLEABLE : X86EMUL_CONTINUE;
 }
 
-static int vmx_check_intercept(struct kvm_vcpu *vcpu,
-			       struct x86_instruction_info *info,
-			       enum x86_intercept_stage stage,
-			       struct x86_exception *exception)
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+			struct x86_instruction_info *info,
+			enum x86_intercept_stage stage,
+			struct x86_exception *exception)
 {
 	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
 
@@ -7960,8 +7954,8 @@ static inline int u64_shl_div_u64(u64 a, unsigned int shift,
 	return 0;
 }
 
-static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
-			    bool *expired)
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+		     bool *expired)
 {
 	struct vcpu_vmx *vmx;
 	u64 tscl, guest_tscl, delta_tsc, lapic_timer_advance_cycles;
@@ -8000,13 +7994,13 @@ static int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 	return 0;
 }
 
-static void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu)
 {
 	to_vmx(vcpu)->hv_deadline_tsc = -1;
 }
 #endif
 
-static void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu)
 {
 	if (!kvm_pause_in_guest(vcpu->kvm))
 		shrink_ple_window(vcpu);
@@ -8032,7 +8026,7 @@ void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 		secondary_exec_controls_clearbit(vmx, SECONDARY_EXEC_ENABLE_PML);
 }
 
-static void vmx_setup_mce(struct kvm_vcpu *vcpu)
+void vmx_setup_mce(struct kvm_vcpu *vcpu)
 {
 	if (vcpu->arch.mcg_cap & MCG_LMCE_P)
 		to_vmx(vcpu)->msr_ia32_feature_control_valid_bits |=
@@ -8043,7 +8037,7 @@ static void vmx_setup_mce(struct kvm_vcpu *vcpu)
 }
 
 #ifdef CONFIG_KVM_SMM
-static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 {
 	/* we need a nested vmexit to enter SMM, postpone if run is pending */
 	if (to_vmx(vcpu)->nested.nested_run_pending)
@@ -8051,7 +8045,7 @@ static int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
 	return !is_smm(vcpu);
 }
 
-static int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 
@@ -8072,7 +8066,7 @@ static int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
 	return 0;
 }
 
-static int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
 	int ret;
@@ -8093,18 +8087,18 @@ static int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
 	return 0;
 }
 
-static void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu)
 {
 	/* RSM will cause a vmexit anyway.  */
 }
 #endif
 
-static bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
 {
 	return to_vmx(vcpu)->nested.vmxon && !is_guest_mode(vcpu);
 }
 
-static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
+void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 {
 	if (is_guest_mode(vcpu)) {
 		struct hrtimer *timer = &to_vmx(vcpu)->nested.preemption_timer;
@@ -8114,7 +8108,7 @@ static void vmx_migrate_timers(struct kvm_vcpu *vcpu)
 	}
 }
 
-static void vmx_hardware_unsetup(void)
+void vmx_hardware_unsetup(void)
 {
 	kvm_set_posted_intr_wakeup_handler(NULL);
 
@@ -8124,7 +8118,7 @@ static void vmx_hardware_unsetup(void)
 	free_kvm_area();
 }
 
-static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
+bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
 {
 	ulong supported = BIT(APICV_INHIBIT_REASON_DISABLE) |
 			  BIT(APICV_INHIBIT_REASON_ABSENT) |
@@ -8136,154 +8130,13 @@ static bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason)
 	return supported & BIT(reason);
 }
 
-static void vmx_vm_destroy(struct kvm *kvm)
+void vmx_vm_destroy(struct kvm *kvm)
 {
 	struct kvm_vmx *kvm_vmx = to_kvm_vmx(kvm);
 
 	free_pages((unsigned long)kvm_vmx->pid_table, vmx_get_pid_table_order(kvm));
 }
 
-static struct kvm_x86_ops vmx_x86_ops __initdata = {
-	.name = KBUILD_MODNAME,
-
-	.check_processor_compatibility = vmx_check_processor_compat,
-
-	.hardware_unsetup = vmx_hardware_unsetup,
-
-	.hardware_enable = vmx_hardware_enable,
-	.hardware_disable = vmx_hardware_disable,
-	.has_emulated_msr = vmx_has_emulated_msr,
-
-	.vm_size = sizeof(struct kvm_vmx),
-	.vm_init = vmx_vm_init,
-	.vm_destroy = vmx_vm_destroy,
-
-	.vcpu_precreate = vmx_vcpu_precreate,
-	.vcpu_create = vmx_vcpu_create,
-	.vcpu_free = vmx_vcpu_free,
-	.vcpu_reset = vmx_vcpu_reset,
-
-	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
-	.vcpu_load = vmx_vcpu_load,
-	.vcpu_put = vmx_vcpu_put,
-
-	.update_exception_bitmap = vmx_update_exception_bitmap,
-	.get_msr_feature = vmx_get_msr_feature,
-	.get_msr = vmx_get_msr,
-	.set_msr = vmx_set_msr,
-	.get_segment_base = vmx_get_segment_base,
-	.get_segment = vmx_get_segment,
-	.set_segment = vmx_set_segment,
-	.get_cpl = vmx_get_cpl,
-	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
-	.set_cr0 = vmx_set_cr0,
-	.is_valid_cr4 = vmx_is_valid_cr4,
-	.set_cr4 = vmx_set_cr4,
-	.set_efer = vmx_set_efer,
-	.get_idt = vmx_get_idt,
-	.set_idt = vmx_set_idt,
-	.get_gdt = vmx_get_gdt,
-	.set_gdt = vmx_set_gdt,
-	.set_dr7 = vmx_set_dr7,
-	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
-	.cache_reg = vmx_cache_reg,
-	.get_rflags = vmx_get_rflags,
-	.set_rflags = vmx_set_rflags,
-	.get_if_flag = vmx_get_if_flag,
-
-	.flush_tlb_all = vmx_flush_tlb_all,
-	.flush_tlb_current = vmx_flush_tlb_current,
-	.flush_tlb_gva = vmx_flush_tlb_gva,
-	.flush_tlb_guest = vmx_flush_tlb_guest,
-
-	.vcpu_pre_run = vmx_vcpu_pre_run,
-	.vcpu_run = vmx_vcpu_run,
-	.handle_exit = vmx_handle_exit,
-	.skip_emulated_instruction = vmx_skip_emulated_instruction,
-	.update_emulated_instruction = vmx_update_emulated_instruction,
-	.set_interrupt_shadow = vmx_set_interrupt_shadow,
-	.get_interrupt_shadow = vmx_get_interrupt_shadow,
-	.patch_hypercall = vmx_patch_hypercall,
-	.inject_irq = vmx_inject_irq,
-	.inject_nmi = vmx_inject_nmi,
-	.inject_exception = vmx_inject_exception,
-	.cancel_injection = vmx_cancel_injection,
-	.interrupt_allowed = vmx_interrupt_allowed,
-	.nmi_allowed = vmx_nmi_allowed,
-	.get_nmi_mask = vmx_get_nmi_mask,
-	.set_nmi_mask = vmx_set_nmi_mask,
-	.enable_nmi_window = vmx_enable_nmi_window,
-	.enable_irq_window = vmx_enable_irq_window,
-	.update_cr8_intercept = vmx_update_cr8_intercept,
-	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
-	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
-	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
-	.load_eoi_exitmap = vmx_load_eoi_exitmap,
-	.apicv_post_state_restore = vmx_apicv_post_state_restore,
-	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
-	.hwapic_irr_update = vmx_hwapic_irr_update,
-	.hwapic_isr_update = vmx_hwapic_isr_update,
-	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
-	.sync_pir_to_irr = vmx_sync_pir_to_irr,
-	.deliver_interrupt = vmx_deliver_interrupt,
-	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
-
-	.set_tss_addr = vmx_set_tss_addr,
-	.set_identity_map_addr = vmx_set_identity_map_addr,
-	.get_mt_mask = vmx_get_mt_mask,
-
-	.get_exit_info = vmx_get_exit_info,
-
-	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
-
-	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
-
-	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
-	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
-	.write_tsc_offset = vmx_write_tsc_offset,
-	.write_tsc_multiplier = vmx_write_tsc_multiplier,
-
-	.load_mmu_pgd = vmx_load_mmu_pgd,
-
-	.check_intercept = vmx_check_intercept,
-	.handle_exit_irqoff = vmx_handle_exit_irqoff,
-
-	.request_immediate_exit = vmx_request_immediate_exit,
-
-	.sched_in = vmx_sched_in,
-
-	.cpu_dirty_log_size = PML_ENTITY_NUM,
-	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
-
-	.nested_ops = &vmx_nested_ops,
-
-	.pi_update_irte = vmx_pi_update_irte,
-	.pi_start_assignment = vmx_pi_start_assignment,
-
-#ifdef CONFIG_X86_64
-	.set_hv_timer = vmx_set_hv_timer,
-	.cancel_hv_timer = vmx_cancel_hv_timer,
-#endif
-
-	.setup_mce = vmx_setup_mce,
-
-#ifdef CONFIG_KVM_SMM
-	.smi_allowed = vmx_smi_allowed,
-	.enter_smm = vmx_enter_smm,
-	.leave_smm = vmx_leave_smm,
-	.enable_smi_window = vmx_enable_smi_window,
-#endif
-
-	.can_emulate_instruction = vmx_can_emulate_instruction,
-	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
-	.migrate_timers = vmx_migrate_timers,
-
-	.msr_filter_changed = vmx_msr_filter_changed,
-	.complete_emulated_msr = kvm_complete_insn_gp,
-
-	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
-};
-
 static unsigned int vmx_handle_intel_pt_intr(void)
 {
 	struct kvm_vcpu *vcpu = kvm_get_running_vcpu();
@@ -8349,9 +8202,7 @@ static void __init vmx_setup_me_spte_mask(void)
 	kvm_mmu_set_me_spte_mask(0, me_mask);
 }
 
-static struct kvm_x86_init_ops vmx_init_ops __initdata;
-
-static __init int hardware_setup(void)
+__init int vmx_hardware_setup(void)
 {
 	unsigned long host_bndcfgs;
 	struct desc_ptr dt;
@@ -8420,16 +8271,16 @@ static __init int hardware_setup(void)
 	 * using the APIC_ACCESS_ADDR VMCS field.
 	 */
 	if (!flexpriority_enabled)
-		vmx_x86_ops.set_apic_access_page_addr = NULL;
+		vt_x86_ops.set_apic_access_page_addr = NULL;
 
 	if (!cpu_has_vmx_tpr_shadow())
-		vmx_x86_ops.update_cr8_intercept = NULL;
+		vt_x86_ops.update_cr8_intercept = NULL;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
 	    && enable_ept) {
-		vmx_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
-		vmx_x86_ops.tlb_remote_flush_with_range =
+		vt_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
+		vt_x86_ops.tlb_remote_flush_with_range =
 				hv_remote_flush_tlb_with_range;
 	}
 #endif
@@ -8445,7 +8296,7 @@ static __init int hardware_setup(void)
 	if (!cpu_has_vmx_apicv())
 		enable_apicv = 0;
 	if (!enable_apicv)
-		vmx_x86_ops.sync_pir_to_irr = NULL;
+		vt_x86_ops.sync_pir_to_irr = NULL;
 
 	if (!enable_apicv || !cpu_has_vmx_ipiv())
 		enable_ipiv = false;
@@ -8481,7 +8332,7 @@ static __init int hardware_setup(void)
 		enable_pml = 0;
 
 	if (!enable_pml)
-		vmx_x86_ops.cpu_dirty_log_size = 0;
+		vt_x86_ops.cpu_dirty_log_size = 0;
 
 	if (!cpu_has_vmx_preemption_timer())
 		enable_preemption_timer = false;
@@ -8506,9 +8357,9 @@ static __init int hardware_setup(void)
 	}
 
 	if (!enable_preemption_timer) {
-		vmx_x86_ops.set_hv_timer = NULL;
-		vmx_x86_ops.cancel_hv_timer = NULL;
-		vmx_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
+		vt_x86_ops.set_hv_timer = NULL;
+		vt_x86_ops.cancel_hv_timer = NULL;
+		vt_x86_ops.request_immediate_exit = __kvm_request_immediate_exit;
 	}
 
 	kvm_caps.supported_mce_cap |= MCG_LMCE_P;
@@ -8519,9 +8370,9 @@ static __init int hardware_setup(void)
 	if (!enable_ept || !enable_pmu || !cpu_has_vmx_intel_pt())
 		pt_mode = PT_MODE_SYSTEM;
 	if (pt_mode == PT_MODE_HOST_GUEST)
-		vmx_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
+		vt_init_ops.handle_intel_pt_intr = vmx_handle_intel_pt_intr;
 	else
-		vmx_init_ops.handle_intel_pt_intr = NULL;
+		vt_init_ops.handle_intel_pt_intr = NULL;
 
 	setup_default_sgx_lepubkeyhash();
 
@@ -8544,14 +8395,6 @@ static __init int hardware_setup(void)
 	return r;
 }
 
-static struct kvm_x86_init_ops vmx_init_ops __initdata = {
-	.hardware_setup = hardware_setup,
-	.handle_intel_pt_intr = NULL,
-
-	.runtime_ops = &vmx_x86_ops,
-	.pmu_ops = &intel_pmu_ops,
-};
-
 static void vmx_cleanup_l1d_flush(void)
 {
 	if (vmx_l1d_flush_pages) {
@@ -8595,7 +8438,7 @@ static int __init vmx_init(void)
 	 */
 	hv_init_evmcs();
 
-	r = kvm_x86_vendor_init(&vmx_init_ops);
+	r = kvm_x86_vendor_init(&vt_init_ops);
 	if (r)
 		return r;
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
new file mode 100644
index 000000000000..e9ec4d259ff5
--- /dev/null
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -0,0 +1,125 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_X86_OPS_H
+#define __KVM_X86_VMX_X86_OPS_H
+
+#include <linux/kvm_host.h>
+
+#include <asm/virtext.h>
+
+#include "x86.h"
+
+__init int vmx_hardware_setup(void);
+
+extern struct kvm_x86_ops vt_x86_ops __initdata;
+extern struct kvm_x86_init_ops vt_init_ops __initdata;
+
+void vmx_hardware_unsetup(void);
+int vmx_check_processor_compat(void);
+int vmx_hardware_enable(void);
+void vmx_hardware_disable(void);
+int vmx_vm_init(struct kvm *kvm);
+void vmx_vm_destroy(struct kvm *kvm);
+int vmx_vcpu_precreate(struct kvm *kvm);
+int vmx_vcpu_create(struct kvm_vcpu *vcpu);
+int vmx_vcpu_pre_run(struct kvm_vcpu *vcpu);
+fastpath_t vmx_vcpu_run(struct kvm_vcpu *vcpu);
+void vmx_vcpu_free(struct kvm_vcpu *vcpu);
+void vmx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+void vmx_vcpu_put(struct kvm_vcpu *vcpu);
+int vmx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t exit_fastpath);
+void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int vmx_skip_emulated_instruction(struct kvm_vcpu *vcpu);
+void vmx_update_emulated_instruction(struct kvm_vcpu *vcpu);
+int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+#ifdef CONFIG_KVM_SMM
+int vmx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram);
+int vmx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram);
+void vmx_enable_smi_window(struct kvm_vcpu *vcpu);
+#endif
+bool vmx_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+				void *insn, int insn_len);
+int vmx_check_intercept(struct kvm_vcpu *vcpu,
+			struct x86_instruction_info *info,
+			enum x86_intercept_stage stage,
+			struct x86_exception *exception);
+bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
+void vmx_migrate_timers(struct kvm_vcpu *vcpu);
+void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
+void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
+bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason);
+void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
+void vmx_hwapic_isr_update(int max_isr);
+bool vmx_guest_apic_has_interrupt(struct kvm_vcpu *vcpu);
+int vmx_sync_pir_to_irr(struct kvm_vcpu *vcpu);
+void vmx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+			   int trig_mode, int vector);
+void vmx_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu);
+bool vmx_has_emulated_msr(struct kvm *kvm, u32 index);
+void vmx_msr_filter_changed(struct kvm_vcpu *vcpu);
+void vmx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu);
+int vmx_get_msr_feature(struct kvm_msr_entry *msr);
+int vmx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info);
+u64 vmx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void vmx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+void vmx_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+int vmx_get_cpl(struct kvm_vcpu *vcpu);
+void vmx_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l);
+void vmx_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0);
+void vmx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
+void vmx_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+bool vmx_is_valid_cr4(struct kvm_vcpu *vcpu, unsigned long cr4);
+int vmx_set_efer(struct kvm_vcpu *vcpu, u64 efer);
+void vmx_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt);
+void vmx_set_dr7(struct kvm_vcpu *vcpu, unsigned long val);
+void vmx_sync_dirty_debug_regs(struct kvm_vcpu *vcpu);
+void vmx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long vmx_get_rflags(struct kvm_vcpu *vcpu);
+void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
+bool vmx_get_if_flag(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
+void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
+void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
+void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
+u32 vmx_get_interrupt_shadow(struct kvm_vcpu *vcpu);
+void vmx_patch_hypercall(struct kvm_vcpu *vcpu, unsigned char *hypercall);
+void vmx_inject_irq(struct kvm_vcpu *vcpu, bool reinjected);
+void vmx_inject_nmi(struct kvm_vcpu *vcpu);
+void vmx_inject_exception(struct kvm_vcpu *vcpu);
+void vmx_cancel_injection(struct kvm_vcpu *vcpu);
+int vmx_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int vmx_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+bool vmx_get_nmi_mask(struct kvm_vcpu *vcpu);
+void vmx_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked);
+void vmx_enable_nmi_window(struct kvm_vcpu *vcpu);
+void vmx_enable_irq_window(struct kvm_vcpu *vcpu);
+void vmx_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr);
+void vmx_set_apic_access_page_addr(struct kvm_vcpu *vcpu);
+void vmx_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu);
+void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap);
+int vmx_set_tss_addr(struct kvm *kvm, unsigned int addr);
+int vmx_set_identity_map_addr(struct kvm *kvm, u64 ident_addr);
+u8 vmx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
+void vmx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+u64 vmx_get_l2_tsc_offset(struct kvm_vcpu *vcpu);
+u64 vmx_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu);
+void vmx_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset);
+void vmx_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier);
+void vmx_request_immediate_exit(struct kvm_vcpu *vcpu);
+void vmx_sched_in(struct kvm_vcpu *vcpu, int cpu);
+void vmx_update_cpu_dirty_logging(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_X86_64
+int vmx_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+		bool *expired);
+void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
+#endif
+void vmx_setup_mce(struct kvm_vcpu *vcpu);
+
+#endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 002/113] KVM: x86/vmx: Refactor KVM VMX module init/exit functions
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 001/113] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 003/113] KVM: TDX: Add placeholders for TDX VM/vcpu structure isaku.yamahata
                   ` (110 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Currently, KVM VMX module initialization/exit functions are a single
function each.  Refactor KVM VMX module initialization functions into KVM
common part and VMX part so that TDX specific part can be added cleanly.
Opportunistically refactor module exit function as well.

The current module initialization flow is,
0.) Check if VMX is supported,
1.) hyper-v specific initialization,
2.) system-wide x86 specific and vendor specific initialization,
3.) Final VMX specific system-wide initialization,
4.) calculate the sizes of VMX kvm structure and VMX vcpu structure,
5.) report those sizes to the KVM common layer and KVM common
    initialization

Refactor the KVM VMX module initialization function into functions with a
wrapper function to separate VMX logic in vmx.c from a file, main.c, common
among VMX and TDX.  Introduce a wrapper function for vmx_init().

The KVM architecture common layer allocates struct kvm with reported size
for architecture-specific code.  The KVM VMX module defines its structure
as struct vmx_kvm { struct kvm; VMX specific members;} and uses it as
struct vmx kvm.  Similar for vcpu structure. TDX KVM patches will define
TDX specific kvm and vcpu structures.

The current module exit function is also a single function, a combination
of VMX specific logic and common KVM logic.  Refactor it into VMX specific
logic and KVM common logic.  This is just refactoring to keep the VMX
specific logic in vmx.c from main.c.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 51 +++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c     | 54 +++++---------------------------------
 arch/x86/kvm/vmx/x86_ops.h | 13 ++++++++-
 3 files changed, 69 insertions(+), 49 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a39d9d68b1b3..58474511a057 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -154,3 +154,54 @@ struct kvm_x86_init_ops vt_init_ops __initdata = {
 	.runtime_ops = &vt_x86_ops,
 	.pmu_ops = &intel_pmu_ops,
 };
+
+static int __init vt_init(void)
+{
+	unsigned int vcpu_size, vcpu_align;
+	int r;
+
+	if (!kvm_is_vmx_supported())
+		return -EOPNOTSUPP;
+
+	/*
+	 * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
+	 * to unwind if a later step fails.
+	 */
+	hv_init_evmcs();
+
+	r = kvm_x86_vendor_init(&vt_init_ops);
+	if (r)
+		return r;
+
+	r = vmx_init();
+	if (r)
+		goto err_vmx_init;
+
+	/*
+	 * Common KVM initialization _must_ come last, after this, /dev/kvm is
+	 * exposed to userspace!
+	 */
+	vt_x86_ops.vm_size = sizeof(struct kvm_vmx);
+	vcpu_size = sizeof(struct vcpu_vmx);
+	vcpu_align = __alignof__(struct vcpu_vmx);
+	r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
+	if (r)
+		goto err_kvm_init;
+
+	return 0;
+
+err_kvm_init:
+	vmx_exit();
+err_vmx_init:
+	kvm_x86_vendor_exit();
+	return r;
+}
+module_init(vt_init);
+
+static void vt_exit(void)
+{
+	kvm_exit();
+	kvm_x86_vendor_exit();
+	vmx_exit();
+}
+module_exit(vt_exit);
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 63155166de45..5de1792c9902 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -553,7 +553,7 @@ static int hv_enable_l2_tlb_flush(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
-static __init void hv_init_evmcs(void)
+__init void hv_init_evmcs(void)
 {
 	int cpu;
 
@@ -589,7 +589,7 @@ static __init void hv_init_evmcs(void)
 	}
 }
 
-static void hv_reset_evmcs(void)
+void hv_reset_evmcs(void)
 {
 	struct hv_vp_assist_page *vp_ap;
 
@@ -613,10 +613,6 @@ static void hv_reset_evmcs(void)
 	vp_ap->current_nested_vmcs = 0;
 	vp_ap->enlighten_vmentry = 0;
 }
-
-#else /* IS_ENABLED(CONFIG_HYPERV) */
-static void hv_init_evmcs(void) {}
-static void hv_reset_evmcs(void) {}
 #endif /* IS_ENABLED(CONFIG_HYPERV) */
 
 /*
@@ -2738,7 +2734,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 	return 0;
 }
 
-static bool kvm_is_vmx_supported(void)
+bool kvm_is_vmx_supported(void)
 {
 	int cpu = raw_smp_processor_id();
 
@@ -8405,7 +8401,7 @@ static void vmx_cleanup_l1d_flush(void)
 	l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
 }
 
-static void __vmx_exit(void)
+void vmx_exit(void)
 {
 	allow_smaller_maxphyaddr = false;
 
@@ -8416,32 +8412,10 @@ static void __vmx_exit(void)
 	vmx_cleanup_l1d_flush();
 }
 
-static void vmx_exit(void)
-{
-	kvm_exit();
-	kvm_x86_vendor_exit();
-
-	__vmx_exit();
-}
-module_exit(vmx_exit);
-
-static int __init vmx_init(void)
+int __init vmx_init(void)
 {
 	int r, cpu;
 
-	if (!kvm_is_vmx_supported())
-		return -EOPNOTSUPP;
-
-	/*
-	 * Note, hv_init_evmcs() touches only VMX knobs, i.e. there's nothing
-	 * to unwind if a later step fails.
-	 */
-	hv_init_evmcs();
-
-	r = kvm_x86_vendor_init(&vt_init_ops);
-	if (r)
-		return r;
-
 	/*
 	 * Must be called after common x86 init so enable_ept is properly set
 	 * up. Hand the parameter mitigation value in which was stored in
@@ -8451,7 +8425,7 @@ static int __init vmx_init(void)
 	 */
 	r = vmx_setup_l1d_flush(vmentry_l1d_flush_param);
 	if (r)
-		goto err_l1d_flush;
+		return r;
 
 	vmx_setup_fb_clear_ctrl();
 
@@ -8475,21 +8449,5 @@ static int __init vmx_init(void)
 	if (!enable_ept)
 		allow_smaller_maxphyaddr = true;
 
-	/*
-	 * Common KVM initialization _must_ come last, after this, /dev/kvm is
-	 * exposed to userspace!
-	 */
-	r = kvm_init(sizeof(struct vcpu_vmx), __alignof__(struct vcpu_vmx),
-		     THIS_MODULE);
-	if (r)
-		goto err_kvm_init;
-
 	return 0;
-
-err_kvm_init:
-	__vmx_exit();
-err_l1d_flush:
-	kvm_x86_vendor_exit();
-	return r;
 }
-module_init(vmx_init);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index e9ec4d259ff5..051b5c4b5c2f 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -8,11 +8,22 @@
 
 #include "x86.h"
 
-__init int vmx_hardware_setup(void);
+#if IS_ENABLED(CONFIG_HYPERV)
+__init void hv_init_evmcs(void);
+void hv_reset_evmcs(void);
+#else /* IS_ENABLED(CONFIG_HYPERV) */
+static inline void hv_init_evmcs(void) {}
+static inline void hv_reset_evmcs(void) {}
+#endif /* IS_ENABLED(CONFIG_HYPERV) */
+
+bool kvm_is_vmx_supported(void);
+int __init vmx_init(void);
+void vmx_exit(void);
 
 extern struct kvm_x86_ops vt_x86_ops __initdata;
 extern struct kvm_x86_init_ops vt_init_ops __initdata;
 
+__init int vmx_hardware_setup(void);
 void vmx_hardware_unsetup(void);
 int vmx_check_processor_compat(void);
 int vmx_hardware_enable(void);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 003/113] KVM: TDX: Add placeholders for TDX VM/vcpu structure
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 001/113] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 002/113] KVM: x86/vmx: Refactor KVM VMX module init/exit functions isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 004/113] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module isaku.yamahata
                   ` (109 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add placeholders TDX VM/vcpu structure that overlays with VMX VM/vcpu
structures.  Initialize VM structure size and vcpu size/align so that x86
KVM common code knows those size irrespective of VMX or TDX.  Those
structures will be populated as guest creation logic develops.

Add helper functions to check if the VM is guest TD and add conversion
functions between KVM VM/VCPU and TDX VM/VCPU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c |  8 +++---
 arch/x86/kvm/vmx/tdx.h  | 54 +++++++++++++++++++++++++++++++++++++++++
 2 files changed, 59 insertions(+), 3 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/tdx.h

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 58474511a057..18f659d1d456 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -5,6 +5,7 @@
 #include "vmx.h"
 #include "nested.h"
 #include "pmu.h"
+#include "tdx.h"
 
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = KBUILD_MODNAME,
@@ -181,9 +182,10 @@ static int __init vt_init(void)
 	 * Common KVM initialization _must_ come last, after this, /dev/kvm is
 	 * exposed to userspace!
 	 */
-	vt_x86_ops.vm_size = sizeof(struct kvm_vmx);
-	vcpu_size = sizeof(struct vcpu_vmx);
-	vcpu_align = __alignof__(struct vcpu_vmx);
+	vt_x86_ops.vm_size = max(sizeof(struct kvm_vmx), sizeof(struct kvm_tdx));
+	vcpu_size = max(sizeof(struct vcpu_vmx), sizeof(struct vcpu_tdx));
+	vcpu_align = max(__alignof__(struct vcpu_vmx),
+			 __alignof__(struct vcpu_tdx));
 	r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
 	if (r)
 		goto err_kvm_init;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
new file mode 100644
index 000000000000..060bf48ec3d6
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -0,0 +1,54 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_TDX_H
+#define __KVM_X86_TDX_H
+
+#ifdef CONFIG_INTEL_TDX_HOST
+struct kvm_tdx {
+	struct kvm kvm;
+	/* TDX specific members follow. */
+};
+
+struct vcpu_tdx {
+	struct kvm_vcpu	vcpu;
+	/* TDX specific members follow. */
+};
+
+static inline bool is_td(struct kvm *kvm)
+{
+	/*
+	 * TDX VM type isn't defined yet.
+	 * return kvm->arch.vm_type == KVM_X86_TDX_VM;
+	 */
+	return false;
+}
+
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
+{
+	return is_td(vcpu->kvm);
+}
+
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm)
+{
+	return container_of(kvm, struct kvm_tdx, kvm);
+}
+
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
+{
+	return container_of(vcpu, struct vcpu_tdx, vcpu);
+}
+#else
+struct kvm_tdx {
+	struct kvm kvm;
+};
+
+struct vcpu_tdx {
+	struct kvm_vcpu	vcpu;
+};
+
+static inline bool is_td(struct kvm *kvm) { return false; }
+static inline bool is_td_vcpu(struct kvm_vcpu *vcpu) { return false; }
+static inline struct kvm_tdx *to_kvm_tdx(struct kvm *kvm) { return NULL; }
+static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu) { return NULL; }
+#endif /* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_X86_TDX_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 004/113] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (2 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 003/113] KVM: TDX: Add placeholders for TDX VM/vcpu structure isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-13 12:31   ` Zhi Wang
  2023-01-16  3:48   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 005/113] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs isaku.yamahata
                   ` (108 subsequent siblings)
  112 siblings, 2 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX requires several initialization steps for KVM to create guest TDs.
Detect CPU feature, enable VMX (TDX is based on VMX), detect the TDX module
availability, and initialize it.  This patch implements those steps.

There are several options on when to initialize the TDX module.  A.) kernel
module loading time, B.) the first guest TD creation time.  A.) was chosen.
With B.), a user may hit an error of the TDX initialization when trying to
create the first guest TD.  The machine that fails to initialize the TDX
module can't boot any guest TD further.  Such failure is undesirable and a
surprise because the user expects that the machine can accommodate guest
TD, but actually not.  So A.) is better than B.).

Introduce a module parameter, enable_tdx, to explicitly enable TDX KVM
support.  It's off by default to keep same behavior for those who don't use
TDX.  Implement hardware_setup method to detect TDX feature of CPU.
Because TDX requires all present CPUs to enable VMX (VMXON).  The x86
specific kvm_arch_post_hardware_enable_setup overrides the existing weak
symbol of kvm_arch_post_hardware_enable_setup which is called at the KVM
module initialization.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Makefile      |  1 +
 arch/x86/kvm/vmx/main.c    | 33 +++++++++++++++++++++++-----
 arch/x86/kvm/vmx/tdx.c     | 44 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c     | 39 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h | 10 +++++++++
 5 files changed, 122 insertions(+), 5 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/tdx.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 0e894ae23cbc..4b01ab842ab7 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -25,6 +25,7 @@ kvm-$(CONFIG_KVM_SMM)	+= smm.o
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
 			   vmx/hyperv.o vmx/nested.o vmx/posted_intr.o vmx/main.o
 kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o
 
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
 			   svm/sev.o svm/hyperv.o
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 18f659d1d456..f5d1166d2718 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -7,6 +7,22 @@
 #include "pmu.h"
 #include "tdx.h"
 
+static bool enable_tdx __ro_after_init = IS_ENABLED(CONFIG_INTEL_TDX_HOST);
+module_param_named(tdx, enable_tdx, bool, 0444);
+
+static __init int vt_hardware_setup(void)
+{
+	int ret;
+
+	ret = vmx_hardware_setup();
+	if (ret)
+		return ret;
+
+	enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
+
+	return 0;
+}
+
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = KBUILD_MODNAME,
 
@@ -149,7 +165,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
-	.hardware_setup = vmx_hardware_setup,
+	.hardware_setup = vt_hardware_setup,
 	.handle_intel_pt_intr = NULL,
 
 	.runtime_ops = &vt_x86_ops,
@@ -182,10 +198,17 @@ static int __init vt_init(void)
 	 * Common KVM initialization _must_ come last, after this, /dev/kvm is
 	 * exposed to userspace!
 	 */
-	vt_x86_ops.vm_size = max(sizeof(struct kvm_vmx), sizeof(struct kvm_tdx));
-	vcpu_size = max(sizeof(struct vcpu_vmx), sizeof(struct vcpu_tdx));
-	vcpu_align = max(__alignof__(struct vcpu_vmx),
-			 __alignof__(struct vcpu_tdx));
+	vt_x86_ops.vm_size = sizeof(struct kvm_vmx);
+	vcpu_size = sizeof(struct vcpu_vmx);
+	vcpu_align = __alignof__(struct vcpu_vmx);
+	if (enable_tdx) {
+		vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
+					   sizeof(struct kvm_tdx));
+		vcpu_size = max_t(unsigned int, vcpu_size,
+				  sizeof(struct vcpu_tdx));
+		vcpu_align = max_t(unsigned int, vcpu_align,
+				   __alignof__(struct vcpu_tdx));
+	}
 	r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
 	if (r)
 		goto err_kvm_init;
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
new file mode 100644
index 000000000000..d7a276118940
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -0,0 +1,44 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/cpu.h>
+
+#include <asm/tdx.h>
+
+#include "capabilities.h"
+#include "x86_ops.h"
+#include "tdx.h"
+#include "x86.h"
+
+#undef pr_fmt
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
+
+static int __init tdx_module_setup(void)
+{
+	int ret;
+
+	ret = tdx_enable();
+	if (ret) {
+		pr_info("Failed to initialize TDX module.\n");
+		return ret;
+	}
+
+	pr_info("TDX is supported.\n");
+	return 0;
+}
+
+int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
+{
+	int r;
+
+	if (!enable_ept) {
+		pr_warn("Cannot enable TDX with EPT disabled\n");
+		return -EINVAL;
+	}
+
+	/* TDX requires VMX. */
+	r = vmxon_all();
+	if (!r)
+		r = tdx_module_setup();
+	vmxoff_all();
+
+	return r;
+}
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5de1792c9902..5dc7687dcf16 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8147,6 +8147,45 @@ static unsigned int vmx_handle_intel_pt_intr(void)
 	return 1;
 }
 
+static __init void vmxon(void *arg)
+{
+	int cpu = raw_smp_processor_id();
+	u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
+	atomic_t *failed = arg;
+	int r;
+
+	if (cr4_read_shadow() & X86_CR4_VMXE) {
+		r = -EBUSY;
+		goto out;
+	}
+
+	r = kvm_cpu_vmxon(phys_addr);
+out:
+	if (r)
+		atomic_inc(failed);
+}
+
+__init int vmxon_all(void)
+{
+	atomic_t failed = ATOMIC_INIT(0);
+
+	on_each_cpu(vmxon, &failed, 1);
+
+	if (atomic_read(&failed))
+		return -EBUSY;
+	return 0;
+}
+
+static __init void vmxoff(void *junk)
+{
+	cpu_vmxoff();
+}
+
+__init void vmxoff_all(void)
+{
+	on_each_cpu(vmxoff, NULL, 1);
+}
+
 static __init void vmx_setup_user_return_msrs(void)
 {
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 051b5c4b5c2f..fbc57fcbdd21 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -20,6 +20,10 @@ bool kvm_is_vmx_supported(void);
 int __init vmx_init(void);
 void vmx_exit(void);
 
+__init int vmxon_all(void);
+__init void vmxoff_all(void);
+__init int vmx_hardware_setup(void);
+
 extern struct kvm_x86_ops vt_x86_ops __initdata;
 extern struct kvm_x86_init_ops vt_init_ops __initdata;
 
@@ -133,4 +137,10 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
 #endif
 void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
+#ifdef CONFIG_INTEL_TDX_HOST
+int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+#else
+static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
+#endif
+
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 005/113] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (3 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 004/113] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-17  3:31   ` Binbin Wu
  2023-01-12 16:31 ` [PATCH v11 006/113] KVM: TDX: Make TDX VM type supported isaku.yamahata
                   ` (107 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson, Xiaoyao Li

From: Sean Christopherson <sean.j.christopherson@intel.com>

Unlike default VMs, confidential VMs (Intel TDX and AMD SEV-ES) don't allow
some operations (e.g., memory read/write, register state access, etc).

Introduce vm_type to track the type of the VM to x86 KVM.  Other arch KVMs
already use vm_type, KVM_INIT_VM accepts vm_type, and x86 KVM callback
vm_init accepts vm_type.  So follow them.  Further, a different policy can
be made based on vm_type.  Define KVM_X86_DEFAULT_VM for default VM as
default and define KVM_X86_TDX_VM for Intel TDX VM.  The wrapper function
will be defined as "bool is_td(kvm) { return vm_type == VM_TYPE_TDX; }"

Add a capability KVM_CAP_VM_TYPES to effectively allow device model,
e.g. qemu, to query what VM types are supported by KVM.  This (introduce a
new capability and add vm_type) is chosen to align with other arch KVMs
that have VM types already.  Other arch KVMs uses different name to query
supported vm types and there is no common name for it, so new name was
chosen.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 Documentation/virt/kvm/api.rst        | 21 +++++++++++++++++++++
 arch/x86/include/asm/kvm-x86-ops.h    |  1 +
 arch/x86/include/asm/kvm_host.h       |  2 ++
 arch/x86/include/uapi/asm/kvm.h       |  3 +++
 arch/x86/kvm/svm/svm.c                |  6 ++++++
 arch/x86/kvm/vmx/main.c               |  1 +
 arch/x86/kvm/vmx/tdx.h                |  6 +-----
 arch/x86/kvm/vmx/vmx.c                |  5 +++++
 arch/x86/kvm/vmx/x86_ops.h            |  1 +
 arch/x86/kvm/x86.c                    |  9 ++++++++-
 include/uapi/linux/kvm.h              |  1 +
 tools/arch/x86/include/uapi/asm/kvm.h |  3 +++
 tools/include/uapi/linux/kvm.h        |  1 +
 13 files changed, 54 insertions(+), 6 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 98459999273c..d2baa05f7c04 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -147,10 +147,31 @@ described as 'basic' will be available.
 The new VM has no virtual cpus and no memory.
 You probably want to use 0 as machine type.
 
+X86:
+^^^^
+
+Supported vm type can be queried from KVM_CAP_VM_TYPES, which returns the
+bitmap of supported vm types. The 1-setting of bit @n means vm type with
+value @n is supported.
+
+S390:
+^^^^^
+
 In order to create user controlled virtual machines on S390, check
 KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
 privileged user (CAP_SYS_ADMIN).
 
+MIPS:
+^^^^^
+
+To use hardware assisted virtualization on MIPS (VZ ASE) rather than
+the default trap & emulate implementation (which changes the virtual
+memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
+flag KVM_VM_MIPS_VZ.
+
+ARM64:
+^^^^^^
+
 On arm64, the physical address size for a VM (IPA Size limit) is limited
 to 40bits by default. The limit can be configured if the host supports the
 extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index dba2909e5ae2..59181b12ad70 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -20,6 +20,7 @@ KVM_X86_OP(hardware_disable)
 KVM_X86_OP(hardware_unsetup)
 KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
+KVM_X86_OP(is_vm_type_supported)
 KVM_X86_OP(vm_init)
 KVM_X86_OP_OPTIONAL(vm_destroy)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 60dc8f1631de..c6ccfce7dc9e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1212,6 +1212,7 @@ enum kvm_apicv_inhibit {
 };
 
 struct kvm_arch {
+	unsigned long vm_type;
 	unsigned long n_used_mmu_pages;
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
@@ -1536,6 +1537,7 @@ struct kvm_x86_ops {
 	bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
 	void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
 
+	bool (*is_vm_type_supported)(unsigned long vm_type);
 	unsigned int vm_size;
 	int (*vm_init)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index e48deab8901d..a4cca6bc6b06 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -529,4 +529,7 @@ struct kvm_pmu_event_filter {
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
 
+#define KVM_X86_DEFAULT_VM	0
+#define KVM_X86_TDX_VM		1
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 799b24801d31..55f2e0a9b0f6 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4682,6 +4682,11 @@ static void svm_vm_destroy(struct kvm *kvm)
 	sev_vm_destroy(kvm);
 }
 
+static bool svm_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_DEFAULT_VM;
+}
+
 static int svm_vm_init(struct kvm *kvm)
 {
 	if (!pause_filter_count || !pause_filter_thresh)
@@ -4710,6 +4715,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.vcpu_free = svm_vcpu_free,
 	.vcpu_reset = svm_vcpu_reset,
 
+	.is_vm_type_supported = svm_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_svm),
 	.vm_init = svm_vm_init,
 	.vm_destroy = svm_vm_destroy,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index f5d1166d2718..3b24e32077d6 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -34,6 +34,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.hardware_disable = vmx_hardware_disable,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
+	.is_vm_type_supported = vmx_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vmx_vm_init,
 	.vm_destroy = vmx_vm_destroy,
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 060bf48ec3d6..473013265bd8 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -15,11 +15,7 @@ struct vcpu_tdx {
 
 static inline bool is_td(struct kvm *kvm)
 {
-	/*
-	 * TDX VM type isn't defined yet.
-	 * return kvm->arch.vm_type == KVM_X86_TDX_VM;
-	 */
-	return false;
+	return kvm->arch.vm_type == KVM_X86_TDX_VM;
 }
 
 static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5dc7687dcf16..f1dea386d6c2 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7501,6 +7501,11 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 	return err;
 }
 
+bool vmx_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_DEFAULT_VM;
+}
+
 #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index fbc57fcbdd21..6980126bc32a 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -32,6 +32,7 @@ void vmx_hardware_unsetup(void);
 int vmx_check_processor_compat(void);
 int vmx_hardware_enable(void);
 void vmx_hardware_disable(void);
+bool vmx_is_vm_type_supported(unsigned long type);
 int vmx_vm_init(struct kvm *kvm);
 void vmx_vm_destroy(struct kvm *kvm);
 int vmx_vcpu_precreate(struct kvm *kvm);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 07e8ab791e37..68bff699096a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4535,6 +4535,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_X86_NOTIFY_VMEXIT:
 		r = kvm_caps.has_notify_vmexit;
 		break;
+	case KVM_CAP_VM_TYPES:
+		r = BIT(KVM_X86_DEFAULT_VM);
+		if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
+			r |= BIT(KVM_X86_TDX_VM);
+		break;
 	default:
 		break;
 	}
@@ -12126,9 +12131,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	int ret;
 	unsigned long flags;
 
-	if (type)
+	if (!static_call(kvm_x86_is_vm_type_supported)(type))
 		return -EINVAL;
 
+	kvm->arch.vm_type = type;
+
 	ret = kvm_page_track_init(kvm);
 	if (ret)
 		goto out;
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 679d293ece0f..2a47fd0e51fd 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1212,6 +1212,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
 #define KVM_CAP_MEMORY_ATTRIBUTES 226
+#define KVM_CAP_VM_TYPES 227
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 649e50a8f9dd..b67d2d59eb6c 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -524,4 +524,7 @@ struct kvm_pmu_event_filter {
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
 
+#define KVM_X86_DEFAULT_VM	0
+#define KVM_X86_TDX_VM		1
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
index 20522d4ba1e0..792a4889d1f4 100644
--- a/tools/include/uapi/linux/kvm.h
+++ b/tools/include/uapi/linux/kvm.h
@@ -1175,6 +1175,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
 #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
 #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
+#define KVM_CAP_VM_TYPES 227
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 006/113] KVM: TDX: Make TDX VM type supported
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (4 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 005/113] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 007/113] [MARKER] The start of TDX KVM patch series: TDX architectural definitions isaku.yamahata
                   ` (106 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

NOTE: This patch is in position of the patch series for developers to be
able to test codes during the middle of the patch series although this
patch series doesn't provide functional features until the all the patches
of this patch series.  When merging this patch series, this patch can be
moved to the end.

As first step TDX VM support, return that TDX VM type supported to device
model, e.g. qemu.  The callback to create guest TD is vm_init callback for
KVM_CREATE_VM.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 18 ++++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     |  6 ++++++
 arch/x86/kvm/vmx/vmx.c     |  5 -----
 arch/x86/kvm/vmx/x86_ops.h |  3 ++-
 4 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 3b24e32077d6..e3c5e9250990 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -10,6 +10,12 @@
 static bool enable_tdx __ro_after_init = IS_ENABLED(CONFIG_INTEL_TDX_HOST);
 module_param_named(tdx, enable_tdx, bool, 0444);
 
+static bool vt_is_vm_type_supported(unsigned long type)
+{
+	return type == KVM_X86_DEFAULT_VM ||
+		(enable_tdx && tdx_is_vm_type_supported(type));
+}
+
 static __init int vt_hardware_setup(void)
 {
 	int ret;
@@ -23,6 +29,14 @@ static __init int vt_hardware_setup(void)
 	return 0;
 }
 
+static int vt_vm_init(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		return -EOPNOTSUPP;	/* Not ready to create guest TD yet. */
+
+	return vmx_vm_init(kvm);
+}
+
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = KBUILD_MODNAME,
 
@@ -34,9 +48,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.hardware_disable = vmx_hardware_disable,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
-	.is_vm_type_supported = vmx_is_vm_type_supported,
+	.is_vm_type_supported = vt_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
-	.vm_init = vmx_vm_init,
+	.vm_init = vt_vm_init,
 	.vm_destroy = vmx_vm_destroy,
 
 	.vcpu_precreate = vmx_vcpu_precreate,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d7a276118940..6c7d9ec53046 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -25,6 +25,12 @@ static int __init tdx_module_setup(void)
 	return 0;
 }
 
+bool tdx_is_vm_type_supported(unsigned long type)
+{
+	/* enable_tdx check is done by the caller. */
+	return type == KVM_X86_TDX_VM;
+}
+
 int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 {
 	int r;
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index f1dea386d6c2..5dc7687dcf16 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -7501,11 +7501,6 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
 	return err;
 }
 
-bool vmx_is_vm_type_supported(unsigned long type)
-{
-	return type == KVM_X86_DEFAULT_VM;
-}
-
 #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
 
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 6980126bc32a..8fd34842a06b 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -32,7 +32,6 @@ void vmx_hardware_unsetup(void);
 int vmx_check_processor_compat(void);
 int vmx_hardware_enable(void);
 void vmx_hardware_disable(void);
-bool vmx_is_vm_type_supported(unsigned long type);
 int vmx_vm_init(struct kvm *kvm);
 void vmx_vm_destroy(struct kvm *kvm);
 int vmx_vcpu_precreate(struct kvm *kvm);
@@ -140,8 +139,10 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
 #ifdef CONFIG_INTEL_TDX_HOST
 int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+bool tdx_is_vm_type_supported(unsigned long type);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
+static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 007/113] [MARKER] The start of TDX KVM patch series: TDX architectural definitions
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (5 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 006/113] KVM: TDX: Make TDX VM type supported isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 008/113] KVM: TDX: Define " isaku.yamahata
                   ` (105 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TDX architectural
definitions.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 .../virt/kvm/intel-tdx-layer-status.rst       | 28 +++++++++++++++++++
 1 file changed, 28 insertions(+)
 create mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
new file mode 100644
index 000000000000..db32e89e16e9
--- /dev/null
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -0,0 +1,28 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Dodmain Extensions(TDX)
+===================================
+
+Layer status
+============
+What qemu can do
+----------------
+- TDX VM TYPE is exposed to Qemu.
+- Qemu can try to create VM of TDX VM type and then fails.
+
+Patch Layer status
+------------------
+  Patch layer                          Status
+* TDX, VMX coexistence:                 Applied
+* TDX architectural definitions:        Applying
+* TD VM creation/destruction:           Not yet
+* TD vcpu creation/destruction:         Not yet
+* TDX EPT violation:                    Not yet
+* TD finalization:                      Not yet
+* TD vcpu enter/exit:                   Not yet
+* TD vcpu interrupts/exit/hypercall:    Not yet
+
+* KVM MMU GPA shared bits:              Not yet
+* KVM TDP refactoring for TDX:          Not yet
+* KVM TDP MMU hooks:                    Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 008/113] KVM: TDX: Define TDX architectural definitions
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (6 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 007/113] [MARKER] The start of TDX KVM patch series: TDX architectural definitions isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 009/113] KVM: TDX: Add TDX "architectural" error codes isaku.yamahata
                   ` (104 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

Define architectural definitions for KVM to issue the TDX SEAMCALLs.

Structures and values that are architecturally defined in the TDX module
specifications the chapter of ABI Reference.

Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/tdx_arch.h | 166 ++++++++++++++++++++++++++++++++++++
 1 file changed, 166 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_arch.h

diff --git a/arch/x86/kvm/vmx/tdx_arch.h b/arch/x86/kvm/vmx/tdx_arch.h
new file mode 100644
index 000000000000..18604734fb14
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_arch.h
@@ -0,0 +1,166 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_ARCH_H
+#define __KVM_X86_TDX_ARCH_H
+
+#include <linux/types.h>
+
+/*
+ * TDX SEAMCALL API function leaves
+ */
+#define TDH_VP_ENTER			0
+#define TDH_MNG_ADDCX			1
+#define TDH_MEM_PAGE_ADD		2
+#define TDH_MEM_SEPT_ADD		3
+#define TDH_VP_ADDCX			4
+#define TDH_MEM_PAGE_RELOCATE		5
+#define TDH_MEM_PAGE_AUG		6
+#define TDH_MEM_RANGE_BLOCK		7
+#define TDH_MNG_KEY_CONFIG		8
+#define TDH_MNG_CREATE			9
+#define TDH_VP_CREATE			10
+#define TDH_MNG_RD			11
+#define TDH_MR_EXTEND			16
+#define TDH_MR_FINALIZE			17
+#define TDH_VP_FLUSH			18
+#define TDH_MNG_VPFLUSHDONE		19
+#define TDH_MNG_KEY_FREEID		20
+#define TDH_MNG_INIT			21
+#define TDH_VP_INIT			22
+#define TDH_VP_RD			26
+#define TDH_MNG_KEY_RECLAIMID		27
+#define TDH_PHYMEM_PAGE_RECLAIM		28
+#define TDH_MEM_PAGE_REMOVE		29
+#define TDH_MEM_SEPT_REMOVE		30
+#define TDH_MEM_TRACK			38
+#define TDH_MEM_RANGE_UNBLOCK		39
+#define TDH_PHYMEM_CACHE_WB		40
+#define TDH_PHYMEM_PAGE_WBINVD		41
+#define TDH_VP_WR			43
+#define TDH_SYS_LP_SHUTDOWN		44
+
+#define TDG_VP_VMCALL_GET_TD_VM_CALL_INFO		0x10000
+#define TDG_VP_VMCALL_MAP_GPA				0x10001
+#define TDG_VP_VMCALL_GET_QUOTE				0x10002
+#define TDG_VP_VMCALL_REPORT_FATAL_ERROR		0x10003
+#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT	0x10004
+
+/* TDX control structure (TDR/TDCS/TDVPS) field access codes */
+#define TDX_NON_ARCH			BIT_ULL(63)
+#define TDX_CLASS_SHIFT			56
+#define TDX_FIELD_MASK			GENMASK_ULL(31, 0)
+
+#define __BUILD_TDX_FIELD(non_arch, class, field)	\
+	(((non_arch) ? TDX_NON_ARCH : 0) |		\
+	 ((u64)(class) << TDX_CLASS_SHIFT) |		\
+	 ((u64)(field) & TDX_FIELD_MASK))
+
+#define BUILD_TDX_FIELD(class, field)			\
+	__BUILD_TDX_FIELD(false, (class), (field))
+
+#define BUILD_TDX_FIELD_NON_ARCH(class, field)		\
+	__BUILD_TDX_FIELD(true, (class), (field))
+
+
+/* Class code for TD */
+#define TD_CLASS_EXECUTION_CONTROLS	17ULL
+
+/* Class code for TDVPS */
+#define TDVPS_CLASS_VMCS		0ULL
+#define TDVPS_CLASS_GUEST_GPR		16ULL
+#define TDVPS_CLASS_OTHER_GUEST		17ULL
+#define TDVPS_CLASS_MANAGEMENT		32ULL
+
+enum tdx_tdcs_execution_control {
+	TD_TDCS_EXEC_TSC_OFFSET = 10,
+};
+
+/* @field is any of enum tdx_tdcs_execution_control */
+#define TDCS_EXEC(field)		BUILD_TDX_FIELD(TD_CLASS_EXECUTION_CONTROLS, (field))
+
+/* @field is the VMCS field encoding */
+#define TDVPS_VMCS(field)		BUILD_TDX_FIELD(TDVPS_CLASS_VMCS, (field))
+
+enum tdx_vcpu_guest_other_state {
+	TD_VCPU_STATE_DETAILS_NON_ARCH = 0x100,
+};
+
+union tdx_vcpu_state_details {
+	struct {
+		u64 vmxip	: 1;
+		u64 reserved	: 63;
+	};
+	u64 full;
+};
+
+/* @field is any of enum tdx_guest_other_state */
+#define TDVPS_STATE(field)		BUILD_TDX_FIELD(TDVPS_CLASS_OTHER_GUEST, (field))
+#define TDVPS_STATE_NON_ARCH(field)	BUILD_TDX_FIELD_NON_ARCH(TDVPS_CLASS_OTHER_GUEST, (field))
+
+/* Management class fields */
+enum tdx_vcpu_guest_management {
+	TD_VCPU_PEND_NMI = 11,
+};
+
+/* @field is any of enum tdx_vcpu_guest_management */
+#define TDVPS_MANAGEMENT(field)		BUILD_TDX_FIELD(TDVPS_CLASS_MANAGEMENT, (field))
+
+#define TDX_EXTENDMR_CHUNKSIZE		256
+
+struct tdx_cpuid_value {
+	u32 eax;
+	u32 ebx;
+	u32 ecx;
+	u32 edx;
+} __packed;
+
+#define TDX_TD_ATTRIBUTE_DEBUG		BIT_ULL(0)
+#define TDX_TD_ATTRIBUTE_PKS		BIT_ULL(30)
+#define TDX_TD_ATTRIBUTE_KL		BIT_ULL(31)
+#define TDX_TD_ATTRIBUTE_PERFMON	BIT_ULL(63)
+
+/*
+ * TD_PARAMS is provided as an input to TDH_MNG_INIT, the size of which is 1024B.
+ */
+struct td_params {
+	u64 attributes;
+	u64 xfam;
+	u32 max_vcpus;
+	u32 reserved0;
+
+	u64 eptp_controls;
+	u64 exec_controls;
+	u16 tsc_frequency;
+	u8  reserved1[38];
+
+	u64 mrconfigid[6];
+	u64 mrowner[6];
+	u64 mrownerconfig[6];
+	u64 reserved2[4];
+
+	union {
+		struct tdx_cpuid_value cpuid_values[0];
+		u8 reserved3[768];
+	};
+} __packed __aligned(1024);
+
+/*
+ * Guest uses MAX_PA for GPAW when set.
+ * 0: GPA.SHARED bit is GPA[47]
+ * 1: GPA.SHARED bit is GPA[51]
+ */
+#define TDX_EXEC_CONTROL_MAX_GPAW      BIT_ULL(0)
+
+/*
+ * TDX requires the frequency to be defined in units of 25MHz, which is the
+ * frequency of the core crystal clock on TDX-capable platforms, i.e. the TDX
+ * module can only program frequencies that are multiples of 25MHz.  The
+ * frequency must be between 100mhz and 10ghz (inclusive).
+ */
+#define TDX_TSC_KHZ_TO_25MHZ(tsc_in_khz)	((tsc_in_khz) / (25 * 1000))
+#define TDX_TSC_25MHZ_TO_KHZ(tsc_in_25mhz)	((tsc_in_25mhz) * (25 * 1000))
+#define TDX_MIN_TSC_FREQUENCY_KHZ		(100 * 1000)
+#define TDX_MAX_TSC_FREQUENCY_KHZ		(10 * 1000 * 1000)
+
+#endif /* __KVM_X86_TDX_ARCH_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 009/113] KVM: TDX: Add TDX "architectural" error codes
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (7 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 008/113] KVM: TDX: Define " isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 010/113] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module isaku.yamahata
                   ` (103 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Add error codes for the TDX SEAMCALLs both for TDX VMM side for TDH
SEAMCALL and TDX guest side for TDG.VP.VMCALL.  KVM issues the TDX
SEAMCALLs and checks its error code.  KVM handles hypercall from the TDX
guest and may return an error.  So error code for the TDX guest is also
needed.

TDX SEAMCALL uses bits 31:0 to return more information, so these error
codes will only exactly match RAX[63:32].  Error codes for TDG.VP.VMCALL is
defined by TDX Guest-Host-Communication interface spec.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/tdx_errno.h | 38 ++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_errno.h

diff --git a/arch/x86/kvm/vmx/tdx_errno.h b/arch/x86/kvm/vmx/tdx_errno.h
new file mode 100644
index 000000000000..389b1b53da25
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_errno.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* architectural status code for SEAMCALL */
+
+#ifndef __KVM_X86_TDX_ERRNO_H
+#define __KVM_X86_TDX_ERRNO_H
+
+#define TDX_SEAMCALL_STATUS_MASK		0xFFFFFFFF00000000ULL
+
+/*
+ * TDX SEAMCALL Status Codes (returned in RAX)
+ */
+#define TDX_SUCCESS				0x0000000000000000ULL
+#define TDX_NON_RECOVERABLE_VCPU		0x4000000100000000ULL
+#define TDX_INTERRUPTED_RESUMABLE		0x8000000300000000ULL
+#define TDX_OPERAND_BUSY			0x8000020000000000ULL
+#define TDX_VCPU_NOT_ASSOCIATED			0x8000070200000000ULL
+#define TDX_KEY_GENERATION_FAILED		0x8000080000000000ULL
+#define TDX_KEY_STATE_INCORRECT			0xC000081100000000ULL
+#define TDX_KEY_CONFIGURED			0x0000081500000000ULL
+#define TDX_NO_HKID_READY_TO_WBCACHE		0x0000082100000000ULL
+#define TDX_EPT_WALK_FAILED			0xC0000B0000000000ULL
+
+/*
+ * TDG.VP.VMCALL Status Codes (returned in R10)
+ */
+#define TDG_VP_VMCALL_SUCCESS			0x0000000000000000ULL
+#define TDG_VP_VMCALL_RETRY			0x0000000000000001ULL
+#define TDG_VP_VMCALL_INVALID_OPERAND		0x8000000000000000ULL
+#define TDG_VP_VMCALL_TDREPORT_FAILED		0x8000000000000001ULL
+
+/*
+ * TDX module operand ID, appears in 31:0 part of error code as
+ * detail information
+ */
+#define TDX_OPERAND_ID_RCX			0x01
+#define TDX_OPERAND_ID_SEPT			0x92
+
+#endif /* __KVM_X86_TDX_ERRNO_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 010/113] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (8 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 009/113] KVM: TDX: Add TDX "architectural" error codes isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 011/113] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
                   ` (102 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

A VMM interacts with the TDX module using a new instruction (SEAMCALL).  A
TDX VMM uses SEAMCALLs where a VMX VMM would have directly interacted with
VMX instructions.  For instance, a TDX VMM does not have full access to the
VM control structure corresponding to VMX VMCS.  Instead, a VMM induces the
TDX module to act on behalf via SEAMCALLs.

Export __seamcall and define C wrapper functions for SEAMCALLs for
readability.

Some SEAMCALL APIs donates host pages to TDX module or guest TD and the
donated pages are encrypted.  Some of such SEAMCALLs flush cache lines
(typically by movdir64b instruction), some don't.  Those that doesn't
clear cache lines require the VMM to flush the cache lines to avoid cache
line alias.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/tdx.h       |   2 +
 arch/x86/kvm/vmx/tdx_ops.h       | 185 +++++++++++++++++++++++++++++++
 arch/x86/virt/vmx/tdx/seamcall.S |   2 +
 3 files changed, 189 insertions(+)
 create mode 100644 arch/x86/kvm/vmx/tdx_ops.h

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 5c5ecfddb15b..0f71d3856ede 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -107,6 +107,8 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
 int tdx_enable(void);
+u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
+	       struct tdx_module_output *out);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_enable(void)  { return -EINVAL; }
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
new file mode 100644
index 000000000000..85adbf49c277
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -0,0 +1,185 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/* constants/data definitions for TDX SEAMCALLs */
+
+#ifndef __KVM_X86_TDX_OPS_H
+#define __KVM_X86_TDX_OPS_H
+
+#include <linux/compiler.h>
+
+#include <asm/cacheflush.h>
+#include <asm/asm.h>
+#include <asm/kvm_host.h>
+
+#include "tdx_errno.h"
+#include "tdx_arch.h"
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
+{
+	clflush_cache_range(__va(addr), PAGE_SIZE);
+	return __seamcall(TDH_MNG_ADDCX, addr, tdr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source,
+				   struct tdx_module_output *out)
+{
+	clflush_cache_range(__va(hpa), PAGE_SIZE);
+	return __seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, out);
+}
+
+static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
+				   struct tdx_module_output *out)
+{
+	clflush_cache_range(__va(page), PAGE_SIZE);
+	return __seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, out);
+}
+
+static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
+				      struct tdx_module_output *out)
+{
+	return __seamcall(TDH_MEM_SEPT_REMOVE, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_vp_addcx(hpa_t tdvpr, hpa_t addr)
+{
+	clflush_cache_range(__va(addr), PAGE_SIZE);
+	return __seamcall(TDH_VP_ADDCX, addr, tdvpr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_page_relocate(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+					struct tdx_module_output *out)
+{
+	clflush_cache_range(__va(hpa), PAGE_SIZE);
+	return __seamcall(TDH_MEM_PAGE_RELOCATE, gpa, tdr, hpa, 0, out);
+}
+
+static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
+				   struct tdx_module_output *out)
+{
+	clflush_cache_range(__va(hpa), PAGE_SIZE);
+	return __seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, out);
+}
+
+static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
+				      struct tdx_module_output *out)
+{
+	return __seamcall(TDH_MEM_RANGE_BLOCK, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_mng_key_config(hpa_t tdr)
+{
+	return __seamcall(TDH_MNG_KEY_CONFIG, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_create(hpa_t tdr, int hkid)
+{
+	clflush_cache_range(__va(tdr), PAGE_SIZE);
+	return __seamcall(TDH_MNG_CREATE, tdr, hkid, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_create(hpa_t tdr, hpa_t tdvpr)
+{
+	clflush_cache_range(__va(tdvpr), PAGE_SIZE);
+	return __seamcall(TDH_VP_CREATE, tdvpr, tdr, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_rd(hpa_t tdr, u64 field, struct tdx_module_output *out)
+{
+	return __seamcall(TDH_MNG_RD, tdr, field, 0, 0, out);
+}
+
+static inline u64 tdh_mr_extend(hpa_t tdr, gpa_t gpa,
+				struct tdx_module_output *out)
+{
+	return __seamcall(TDH_MR_EXTEND, gpa, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_mr_finalize(hpa_t tdr)
+{
+	return __seamcall(TDH_MR_FINALIZE, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_flush(hpa_t tdvpr)
+{
+	return __seamcall(TDH_VP_FLUSH, tdvpr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_vpflushdone(hpa_t tdr)
+{
+	return __seamcall(TDH_MNG_VPFLUSHDONE, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_key_freeid(hpa_t tdr)
+{
+	return __seamcall(TDH_MNG_KEY_FREEID, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mng_init(hpa_t tdr, hpa_t td_params,
+			       struct tdx_module_output *out)
+{
+	return __seamcall(TDH_MNG_INIT, tdr, td_params, 0, 0, out);
+}
+
+static inline u64 tdh_vp_init(hpa_t tdvpr, u64 rcx)
+{
+	return __seamcall(TDH_VP_INIT, tdvpr, rcx, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_rd(hpa_t tdvpr, u64 field,
+			    struct tdx_module_output *out)
+{
+	return __seamcall(TDH_VP_RD, tdvpr, field, 0, 0, out);
+}
+
+static inline u64 tdh_mng_key_reclaimid(hpa_t tdr)
+{
+	return __seamcall(TDH_MNG_KEY_RECLAIMID, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_reclaim(hpa_t page,
+					  struct tdx_module_output *out)
+{
+	return __seamcall(TDH_PHYMEM_PAGE_RECLAIM, page, 0, 0, 0, out);
+}
+
+static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
+				      struct tdx_module_output *out)
+{
+	return __seamcall(TDH_MEM_PAGE_REMOVE, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_sys_lp_shutdown(void)
+{
+	return __seamcall(TDH_SYS_LP_SHUTDOWN, 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_track(hpa_t tdr)
+{
+	return __seamcall(TDH_MEM_TRACK, tdr, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_mem_range_unblock(hpa_t tdr, gpa_t gpa, int level,
+					struct tdx_module_output *out)
+{
+	return __seamcall(TDH_MEM_RANGE_UNBLOCK, gpa | level, tdr, 0, 0, out);
+}
+
+static inline u64 tdh_phymem_cache_wb(bool resume)
+{
+	return __seamcall(TDH_PHYMEM_CACHE_WB, resume ? 1 : 0, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_phymem_page_wbinvd(hpa_t page)
+{
+	return __seamcall(TDH_PHYMEM_PAGE_WBINVD, page, 0, 0, 0, NULL);
+}
+
+static inline u64 tdh_vp_wr(hpa_t tdvpr, u64 field, u64 val, u64 mask,
+			    struct tdx_module_output *out)
+{
+	return __seamcall(TDH_VP_WR, tdvpr, field, val, mask, out);
+}
+#endif /* CONFIG_INTEL_TDX_HOST */
+
+#endif /* __KVM_X86_TDX_OPS_H */
diff --git a/arch/x86/virt/vmx/tdx/seamcall.S b/arch/x86/virt/vmx/tdx/seamcall.S
index f81be6b9c133..b90a7fe05494 100644
--- a/arch/x86/virt/vmx/tdx/seamcall.S
+++ b/arch/x86/virt/vmx/tdx/seamcall.S
@@ -1,5 +1,6 @@
 /* SPDX-License-Identifier: GPL-2.0 */
 #include <linux/linkage.h>
+#include <asm/export.h>
 #include <asm/frame.h>
 
 #include "tdxcall.S"
@@ -50,3 +51,4 @@ SYM_FUNC_START(__seamcall)
 	FRAME_END
 	RET
 SYM_FUNC_END(__seamcall)
+EXPORT_SYMBOL_GPL(__seamcall)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 011/113] KVM: TDX: Add helper functions to print TDX SEAMCALL error
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (9 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 010/113] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 012/113] [MARKER] The start of TDX KVM patch series: TD VM creation/destruction isaku.yamahata
                   ` (101 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add helper functions to print out errors from the TDX module in a uniform
manner.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Makefile        |  2 +-
 arch/x86/kvm/vmx/tdx_error.c | 21 +++++++++++++++++++++
 arch/x86/kvm/vmx/tdx_ops.h   |  3 +++
 3 files changed, 25 insertions(+), 1 deletion(-)
 create mode 100644 arch/x86/kvm/vmx/tdx_error.c

diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
index 4b01ab842ab7..e3354b784e10 100644
--- a/arch/x86/kvm/Makefile
+++ b/arch/x86/kvm/Makefile
@@ -25,7 +25,7 @@ kvm-$(CONFIG_KVM_SMM)	+= smm.o
 kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
 			   vmx/hyperv.o vmx/nested.o vmx/posted_intr.o vmx/main.o
 kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
-kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o
+kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o vmx/tdx_error.o
 
 kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
 			   svm/sev.o svm/hyperv.o
diff --git a/arch/x86/kvm/vmx/tdx_error.c b/arch/x86/kvm/vmx/tdx_error.c
new file mode 100644
index 000000000000..574b72d34e1e
--- /dev/null
+++ b/arch/x86/kvm/vmx/tdx_error.c
@@ -0,0 +1,21 @@
+// SPDX-License-Identifier: GPL-2.0
+/* functions to record TDX SEAMCALL error */
+
+#include <linux/kernel.h>
+#include <linux/bug.h>
+
+#include "tdx_ops.h"
+
+void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out)
+{
+	if (!out) {
+		pr_err_ratelimited("SEAMCALL[%lld] failed: 0x%llx\n",
+				   op, error_code);
+		return;
+	}
+
+	pr_err_ratelimited("SEAMCALL[%lld] failed: 0x%llx RCX 0x%llx, RDX 0x%llx,"
+			   " R8 0x%llx, R9 0x%llx, R10 0x%llx, R11 0x%llx\n",
+			   op, error_code,
+			   out->rcx, out->rdx, out->r8, out->r9, out->r10, out->r11);
+}
diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 85adbf49c277..8cc2f01c509b 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -9,12 +9,15 @@
 #include <asm/cacheflush.h>
 #include <asm/asm.h>
 #include <asm/kvm_host.h>
+#include <asm/tdx.h>
 
 #include "tdx_errno.h"
 #include "tdx_arch.h"
 
 #ifdef CONFIG_INTEL_TDX_HOST
 
+void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out);
+
 static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
 {
 	clflush_cache_range(__va(addr), PAGE_SIZE);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 012/113] [MARKER] The start of TDX KVM patch series: TD VM creation/destruction
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (10 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 011/113] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 013/113] x86/cpu: Add helper functions to allocate/free TDX private host key id isaku.yamahata
                   ` (100 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TD VM
creation/destruction.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index db32e89e16e9..221372cfb4af 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -15,8 +15,8 @@ Patch Layer status
 ------------------
   Patch layer                          Status
 * TDX, VMX coexistence:                 Applied
-* TDX architectural definitions:        Applying
-* TD VM creation/destruction:           Not yet
+* TDX architectural definitions:        Applied
+* TD VM creation/destruction:           Applying
 * TD vcpu creation/destruction:         Not yet
 * TDX EPT violation:                    Not yet
 * TD finalization:                      Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 013/113] x86/cpu: Add helper functions to allocate/free TDX private host key id
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (11 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 012/113] [MARKER] The start of TDX KVM patch series: TD VM creation/destruction isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-13 12:47   ` Zhi Wang
  2023-01-12 16:31 ` [PATCH v11 014/113] x86/virt/tdx: Add a helper function to return system wide info about TDX module isaku.yamahata
                   ` (99 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX private host key id (HKID) is assigned to guest TD.  The memory
controller encrypts guest TD memory with the assigned TDX HKID.  Add helper
functions to allocate/free TDX private HKID so that TDX KVM can manage it.

Also export the global TDX private HKID that is used to encrypt TDX module,
its memory and some dynamic data (TDR).  When VMM releasing encrypted page
to reuse it, the page needs to be flushed with the used HKID.  VMM needs
the global TDX private HKID to flush such pages.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/tdx.h  | 12 ++++++++++++
 arch/x86/virt/vmx/tdx/tdx.c | 35 +++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 0f71d3856ede..ed9cf61ff8b4 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -107,11 +107,23 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 #ifdef CONFIG_INTEL_TDX_HOST
 bool platform_tdx_enabled(void);
 int tdx_enable(void);
+/*
+ * Key id globally used by TDX module: TDX module maps TDR with this TDX global
+ * key id.  TDR includes key id assigned to the TD.  Then TDX module maps other
+ * TD-related pages with the assigned key id.  TDR requires this TDX global key
+ * id for cache flush unlike other TD-related pages.
+ */
+extern u32 tdx_global_keyid __read_mostly;
+int tdx_keyid_alloc(void);
+void tdx_keyid_free(int keyid);
+
 u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);
 #else	/* !CONFIG_INTEL_TDX_HOST */
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_enable(void)  { return -EINVAL; }
+static inline int tdx_keyid_alloc(void) { return -EOPNOTSUPP; }
+static inline void tdx_keyid_free(int keyid) { }
 #endif	/* CONFIG_INTEL_TDX_HOST */
 
 #endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index eba7e62cebec..d18ab5c4d447 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -51,6 +51,10 @@ static DEFINE_MUTEX(tdx_module_lock);
 /* All TDX-usable memory regions */
 static LIST_HEAD(tdx_memlist);
 
+/* TDX module global KeyID.  Used in TDH.SYS.CONFIG ABI. */
+u32 tdx_global_keyid __read_mostly;
+EXPORT_SYMBOL_GPL(tdx_global_keyid);
+
 /*
  * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
  * This is used in TDX initialization error paths to take it from
@@ -132,6 +136,31 @@ static struct notifier_block tdx_memory_nb = {
 	.notifier_call = tdx_memory_notifier,
 };
 
+/* TDX KeyID pool */
+static DEFINE_IDA(tdx_keyid_pool);
+
+int tdx_keyid_alloc(void)
+{
+	if (WARN_ON_ONCE(!tdx_keyid_start || !nr_tdx_keyids))
+		return -EINVAL;
+
+	/* The first keyID is reserved for the global key. */
+	return ida_alloc_range(&tdx_keyid_pool, tdx_keyid_start + 1,
+			       tdx_keyid_start + nr_tdx_keyids - 1,
+			       GFP_KERNEL);
+}
+EXPORT_SYMBOL_GPL(tdx_keyid_alloc);
+
+void tdx_keyid_free(int keyid)
+{
+	/* keyid = 0 is reserved. */
+	if (WARN_ON_ONCE(keyid <= 0))
+		return;
+
+	ida_free(&tdx_keyid_pool, keyid);
+}
+EXPORT_SYMBOL_GPL(tdx_keyid_free);
+
 static int __init tdx_init(void)
 {
 	int err;
@@ -1161,6 +1190,12 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out_free_pamts;
 
+	/*
+	 * Reserve the first TDX KeyID as global KeyID to protect
+	 * TDX module metadata.
+	 */
+	tdx_global_keyid = tdx_keyid_start;
+
 	/* Initialize TDMRs to complete the TDX module initialization */
 	ret = init_tdmrs(&tdmr_list);
 	if (ret)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 014/113] x86/virt/tdx: Add a helper function to return system wide info about TDX module
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (12 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 013/113] x86/cpu: Add helper functions to allocate/free TDX private host key id isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-16  4:19   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 015/113] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters isaku.yamahata
                   ` (98 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX KVM needs system-wide information about the TDX module, struct
tdsysinfo_struct.  Add a helper function tdx_get_sysinfo() to return it
instead of KVM getting it with various error checks.  Make KVM call the
function and stash the info.  Move out the struct definition about it to
common place arch/x86/include/asm/tdx.h.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/tdx.h  | 54 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.c      | 49 ++++++++++++++++++++++++++++++++-
 arch/x86/virt/vmx/tdx/tdx.c | 21 ++++++++++++---
 arch/x86/virt/vmx/tdx/tdx.h | 51 -----------------------------------
 4 files changed, 119 insertions(+), 56 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index ed9cf61ff8b4..2ca6e8ce1e43 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -105,6 +105,58 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_HOST
+struct tdx_cpuid_config {
+	u32	leaf;
+	u32	sub_leaf;
+	u32	eax;
+	u32	ebx;
+	u32	ecx;
+	u32	edx;
+} __packed;
+
+#define TDSYSINFO_STRUCT_SIZE		1024
+#define TDSYSINFO_STRUCT_ALIGNMENT	1024
+
+/*
+ * The size of this structure itself is flexible.  The actual structure
+ * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE and be
+ * aligned to TDSYSINFO_STRUCT_ALIGNMENT using DECLARE_PADDED_STRUCT().
+ */
+struct tdsysinfo_struct {
+	/* TDX-SEAM Module Info */
+	u32	attributes;
+	u32	vendor_id;
+	u32	build_date;
+	u16	build_num;
+	u16	minor_version;
+	u16	major_version;
+	u8	reserved0[14];
+	/* Memory Info */
+	u16	max_tdmrs;
+	u16	max_reserved_per_tdmr;
+	u16	pamt_entry_size;
+	u8	reserved1[10];
+	/* Control Struct Info */
+	u16	tdcs_base_size;
+	u8	reserved2[2];
+	u16	tdvps_base_size;
+	u8	tdvps_xfam_dependent_size;
+	u8	reserved3[9];
+	/* TD Capabilities */
+	u64	attributes_fixed0;
+	u64	attributes_fixed1;
+	u64	xfam_fixed0;
+	u64	xfam_fixed1;
+	u8	reserved4[32];
+	u32	num_cpuid_config;
+	/*
+	 * The actual number of CPUID_CONFIG depends on above
+	 * 'num_cpuid_config'.
+	 */
+	DECLARE_FLEX_ARRAY(struct tdx_cpuid_config, cpuid_configs);
+} __packed;
+
+const struct tdsysinfo_struct *tdx_get_sysinfo(void);
 bool platform_tdx_enabled(void);
 int tdx_enable(void);
 /*
@@ -120,6 +172,8 @@ void tdx_keyid_free(int keyid);
 u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
 	       struct tdx_module_output *out);
 #else	/* !CONFIG_INTEL_TDX_HOST */
+struct tdsysinfo_struct;
+static inline const struct tdsysinfo_struct *tdx_get_sysinfo(void) { return NULL; }
 static inline bool platform_tdx_enabled(void) { return false; }
 static inline int tdx_enable(void)  { return -EINVAL; }
 static inline int tdx_keyid_alloc(void) { return -EOPNOTSUPP; }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6c7d9ec53046..2adf5551ab26 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -11,9 +11,34 @@
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#define TDX_MAX_NR_CPUID_CONFIGS					\
+	((TDSYSINFO_STRUCT_SIZE -					\
+		offsetof(struct tdsysinfo_struct, cpuid_configs))	\
+		/ sizeof(struct tdx_cpuid_config))
+
+struct tdx_capabilities {
+	u8 tdcs_nr_pages;
+	u8 tdvpx_nr_pages;
+
+	u64 attrs_fixed0;
+	u64 attrs_fixed1;
+	u64 xfam_fixed0;
+	u64 xfam_fixed1;
+
+	u32 nr_cpuid_configs;
+	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
+};
+
+/* Capabilities of KVM + the TDX module. */
+static struct tdx_capabilities tdx_caps;
+
 static int __init tdx_module_setup(void)
 {
-	int ret;
+	const struct tdsysinfo_struct *tdsysinfo;
+	int ret = 0;
+
+	BUILD_BUG_ON(sizeof(*tdsysinfo) > TDSYSINFO_STRUCT_SIZE);
+	BUILD_BUG_ON(TDX_MAX_NR_CPUID_CONFIGS != 37);
 
 	ret = tdx_enable();
 	if (ret) {
@@ -21,6 +46,28 @@ static int __init tdx_module_setup(void)
 		return ret;
 	}
 
+	tdsysinfo = tdx_get_sysinfo();
+	if (tdsysinfo->num_cpuid_config > TDX_MAX_NR_CPUID_CONFIGS)
+		return -EIO;
+
+	tdx_caps = (struct tdx_capabilities) {
+		.tdcs_nr_pages = tdsysinfo->tdcs_base_size / PAGE_SIZE,
+		/*
+		 * TDVPS = TDVPR(4K page) + TDVPX(multiple 4K pages).
+		 * -1 for TDVPR.
+		 */
+		.tdvpx_nr_pages = tdsysinfo->tdvps_base_size / PAGE_SIZE - 1,
+		.attrs_fixed0 = tdsysinfo->attributes_fixed0,
+		.attrs_fixed1 = tdsysinfo->attributes_fixed1,
+		.xfam_fixed0 =	tdsysinfo->xfam_fixed0,
+		.xfam_fixed1 = tdsysinfo->xfam_fixed1,
+		.nr_cpuid_configs = tdsysinfo->num_cpuid_config,
+	};
+	if (!memcpy(tdx_caps.cpuid_configs, tdsysinfo->cpuid_configs,
+			tdsysinfo->num_cpuid_config *
+			sizeof(struct tdx_cpuid_config)))
+		return -EIO;
+
 	pr_info("TDX is supported.\n");
 	return 0;
 }
diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
index d18ab5c4d447..65c0024fd3a9 100644
--- a/arch/x86/virt/vmx/tdx/tdx.c
+++ b/arch/x86/virt/vmx/tdx/tdx.c
@@ -279,7 +279,7 @@ static void print_cmrs(struct cmr_info *cmr_array, int nr_cmrs)
  * kernel stack.  @sysinfo must have been padded to have enough room
  * to save the TDSYSINFO_STRUCT.
  */
-static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
+static int __tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
 			   struct cmr_info *cmr_array)
 {
 	struct tdx_module_output out;
@@ -308,6 +308,21 @@ static int tdx_get_sysinfo(struct tdsysinfo_struct *sysinfo,
 	return 0;
 }
 
+static DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
+			     TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT);
+
+const struct tdsysinfo_struct *tdx_get_sysinfo(void)
+{
+	const struct tdsysinfo_struct *r = NULL;
+
+	mutex_lock(&tdx_module_lock);
+	if (tdx_module_status == TDX_MODULE_INITIALIZED)
+		r = &PADDED_STRUCT(tdsysinfo);
+	mutex_unlock(&tdx_module_lock);
+	return r;
+}
+EXPORT_SYMBOL_GPL(tdx_get_sysinfo);
+
 /*
  * Add a memory region as a TDX memory block.  The caller must make sure
  * all memory regions are added in address ascending order and don't
@@ -1118,8 +1133,6 @@ static int init_tdx_module(void)
 	 * They are 1024 bytes and 512 bytes respectively but it's fine to
 	 * keep them in the stack as this function is only called once.
 	 */
-	DECLARE_PADDED_STRUCT(tdsysinfo_struct, tdsysinfo,
-			TDSYSINFO_STRUCT_SIZE, TDSYSINFO_STRUCT_ALIGNMENT);
 	struct cmr_info cmr_array[MAX_CMRS] __aligned(CMR_INFO_ARRAY_ALIGNMENT);
 	struct tdsysinfo_struct *sysinfo = &PADDED_STRUCT(tdsysinfo);
 	struct tdmr_info_list tdmr_list;
@@ -1134,7 +1147,7 @@ static int init_tdx_module(void)
 	if (ret)
 		goto out;
 
-	ret = tdx_get_sysinfo(sysinfo, cmr_array);
+	ret = __tdx_get_sysinfo(sysinfo, cmr_array);
 	if (ret)
 		goto out;
 
diff --git a/arch/x86/virt/vmx/tdx/tdx.h b/arch/x86/virt/vmx/tdx/tdx.h
index 8abfbcc23be1..9658cd89b579 100644
--- a/arch/x86/virt/vmx/tdx/tdx.h
+++ b/arch/x86/virt/vmx/tdx/tdx.h
@@ -34,15 +34,6 @@ struct cmr_info {
 #define MAX_CMRS			32
 #define CMR_INFO_ARRAY_ALIGNMENT	512
 
-struct cpuid_config {
-	u32	leaf;
-	u32	sub_leaf;
-	u32	eax;
-	u32	ebx;
-	u32	ecx;
-	u32	edx;
-} __packed;
-
 #define DECLARE_PADDED_STRUCT(type, name, size, alignment)	\
 	struct type##_padded {					\
 		union {						\
@@ -53,48 +44,6 @@ struct cpuid_config {
 
 #define PADDED_STRUCT(name)	(name##_padded.name)
 
-#define TDSYSINFO_STRUCT_SIZE		1024
-#define TDSYSINFO_STRUCT_ALIGNMENT	1024
-
-/*
- * The size of this structure itself is flexible.  The actual structure
- * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE and be
- * aligned to TDSYSINFO_STRUCT_ALIGNMENT using DECLARE_PADDED_STRUCT().
- */
-struct tdsysinfo_struct {
-	/* TDX-SEAM Module Info */
-	u32	attributes;
-	u32	vendor_id;
-	u32	build_date;
-	u16	build_num;
-	u16	minor_version;
-	u16	major_version;
-	u8	reserved0[14];
-	/* Memory Info */
-	u16	max_tdmrs;
-	u16	max_reserved_per_tdmr;
-	u16	pamt_entry_size;
-	u8	reserved1[10];
-	/* Control Struct Info */
-	u16	tdcs_base_size;
-	u8	reserved2[2];
-	u16	tdvps_base_size;
-	u8	tdvps_xfam_dependent_size;
-	u8	reserved3[9];
-	/* TD Capabilities */
-	u64	attributes_fixed0;
-	u64	attributes_fixed1;
-	u64	xfam_fixed0;
-	u64	xfam_fixed1;
-	u8	reserved4[32];
-	u32	num_cpuid_config;
-	/*
-	 * The actual number of CPUID_CONFIG depends on above
-	 * 'num_cpuid_config'.
-	 */
-	DECLARE_FLEX_ARRAY(struct cpuid_config, cpuid_configs);
-} __packed;
-
 struct tdmr_reserved_area {
 	u64 offset;
 	u64 size;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 015/113] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (13 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 014/113] x86/virt/tdx: Add a helper function to return system wide info about TDX module isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 016/113] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl isaku.yamahata
                   ` (97 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Implement a system-scoped ioctl to get system-wide parameters for TDX.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h    |  1 +
 arch/x86/include/asm/kvm_host.h       |  1 +
 arch/x86/include/uapi/asm/kvm.h       | 48 +++++++++++++++++++++++++++
 arch/x86/kvm/vmx/main.c               |  2 ++
 arch/x86/kvm/vmx/tdx.c                | 46 +++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h            |  2 ++
 arch/x86/kvm/x86.c                    |  6 ++++
 tools/arch/x86/include/uapi/asm/kvm.h | 48 +++++++++++++++++++++++++++
 8 files changed, 154 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 59181b12ad70..e6708bb3f4f6 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -118,6 +118,7 @@ KVM_X86_OP(enter_smm)
 KVM_X86_OP(leave_smm)
 KVM_X86_OP(enable_smi_window)
 #endif
+KVM_X86_OP_OPTIONAL(dev_mem_enc_ioctl)
 KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
 KVM_X86_OP_OPTIONAL(mem_enc_register_region)
 KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index c6ccfce7dc9e..234c28c8e6ee 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1693,6 +1693,7 @@ struct kvm_x86_ops {
 	void (*enable_smi_window)(struct kvm_vcpu *vcpu);
 #endif
 
+	int (*dev_mem_enc_ioctl)(void __user *argp);
 	int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
 	int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index a4cca6bc6b06..610b1de9eb2b 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -532,4 +532,52 @@ struct kvm_pmu_event_filter {
 #define KVM_X86_DEFAULT_VM	0
 #define KVM_X86_TDX_VM		1
 
+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+	KVM_TDX_CAPABILITIES = 0,
+
+	KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+	/* enum kvm_tdx_cmd_id */
+	__u32 id;
+	/* flags for sub-commend. If sub-command doesn't use this, set zero. */
+	__u32 flags;
+	/*
+	 * data for each sub-command. An immediate or a pointer to the actual
+	 * data in process virtual address.  If sub-command doesn't use it,
+	 * set zero.
+	 */
+	__u64 data;
+	/*
+	 * Auxiliary error code.  The sub-command may return TDX SEAMCALL
+	 * status code in addition to -Exxx.
+	 * Defined for consistency with struct kvm_sev_cmd.
+	 */
+	__u64 error;
+	/* Reserved: Defined for consistency with struct kvm_sev_cmd. */
+	__u64 unused;
+};
+
+struct kvm_tdx_cpuid_config {
+	__u32 leaf;
+	__u32 sub_leaf;
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+	__u64 attrs_fixed0;
+	__u64 attrs_fixed1;
+	__u64 xfam_fixed0;
+	__u64 xfam_fixed1;
+
+	__u32 nr_cpuid_configs;
+	__u32 padding;
+	struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index e3c5e9250990..16053ec3e0ae 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -177,6 +177,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+
+	.dev_mem_enc_ioctl = tdx_dev_ioctl,
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2adf5551ab26..f46579102a44 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -32,6 +32,52 @@ struct tdx_capabilities {
 /* Capabilities of KVM + the TDX module. */
 static struct tdx_capabilities tdx_caps;
 
+int tdx_dev_ioctl(void __user *argp)
+{
+	struct kvm_tdx_capabilities __user *user_caps;
+	struct kvm_tdx_capabilities caps;
+	struct kvm_tdx_cmd cmd;
+
+	BUILD_BUG_ON(sizeof(struct kvm_tdx_cpuid_config) !=
+		     sizeof(struct tdx_cpuid_config));
+
+	if (copy_from_user(&cmd, argp, sizeof(cmd)))
+		return -EFAULT;
+	if (cmd.flags || cmd.error || cmd.unused)
+		return -EINVAL;
+	/*
+	 * Currently only KVM_TDX_CAPABILITIES is defined for system-scoped
+	 * mem_enc_ioctl().
+	 */
+	if (cmd.id != KVM_TDX_CAPABILITIES)
+		return -EINVAL;
+
+	user_caps = (void __user *)cmd.data;
+	if (copy_from_user(&caps, user_caps, sizeof(caps)))
+		return -EFAULT;
+
+	if (caps.nr_cpuid_configs < tdx_caps.nr_cpuid_configs)
+		return -E2BIG;
+
+	caps = (struct kvm_tdx_capabilities) {
+		.attrs_fixed0 = tdx_caps.attrs_fixed0,
+		.attrs_fixed1 = tdx_caps.attrs_fixed1,
+		.xfam_fixed0 = tdx_caps.xfam_fixed0,
+		.xfam_fixed1 = tdx_caps.xfam_fixed1,
+		.nr_cpuid_configs = tdx_caps.nr_cpuid_configs,
+		.padding = 0,
+	};
+
+	if (copy_to_user(user_caps, &caps, sizeof(caps)))
+		return -EFAULT;
+	if (copy_to_user(user_caps->cpuid_configs, &tdx_caps.cpuid_configs,
+			 tdx_caps.nr_cpuid_configs *
+			 sizeof(struct tdx_cpuid_config)))
+		return -EFAULT;
+
+	return 0;
+}
+
 static int __init tdx_module_setup(void)
 {
 	const struct tdsysinfo_struct *tdsysinfo;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 8fd34842a06b..f0803cc15151 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -140,9 +140,11 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
 #ifdef CONFIG_INTEL_TDX_HOST
 int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 bool tdx_is_vm_type_supported(unsigned long type);
+int tdx_dev_ioctl(void __user *argp);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
 static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
+static inline int tdx_dev_ioctl(void __user *argp) { return -EOPNOTSUPP; };
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 68bff699096a..1152a9dc6d84 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4693,6 +4693,12 @@ long kvm_arch_dev_ioctl(struct file *filp,
 		r = kvm_x86_dev_has_attr(&attr);
 		break;
 	}
+	case KVM_MEMORY_ENCRYPT_OP:
+		r = -EINVAL;
+		if (!kvm_x86_ops.dev_mem_enc_ioctl)
+			goto out;
+		r = static_call(kvm_x86_dev_mem_enc_ioctl)(argp);
+		break;
 	default:
 		r = -EINVAL;
 		break;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index b67d2d59eb6c..04562740691b 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -527,4 +527,52 @@ struct kvm_pmu_event_filter {
 #define KVM_X86_DEFAULT_VM	0
 #define KVM_X86_TDX_VM		1
 
+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+	KVM_TDX_CAPABILITIES = 0,
+
+	KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+	/* enum kvm_tdx_cmd_id */
+	__u32 id;
+	/* flags for sub-commend. If sub-command doesn't use this, set zero. */
+	__u32 flags;
+	/*
+	 * data for each sub-command. An immediate or a pointer to the actual
+	 * data in process virtual address.  If sub-command doesn't use it,
+	 * set zero.
+	 */
+	__u64 data;
+	/*
+	 * Auxiliary error code.  The sub-command may return TDX SEAMCALL
+	 * status code in addition to -Exxx.
+	 * Defined for consistency with struct kvm_sev_cmd.
+	 */
+	__u64 error;
+	/* Reserved: Defined for consistency with struct kvm_sev_cmd. */
+	__u64 unused;
+};
+
+struct kvm_tdx_cpuid_config {
+	__u32 leaf;
+	__u32 sub_leaf;
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+	__u64 attrs_fixed0;
+	__u64 attrs_fixed1;
+	__u64 xfam_fixed0;
+	__u64 xfam_fixed1;
+
+	__u32 nr_cpuid_configs;
+	__u32 padding;
+	struct kvm_tdx_cpuid_config cpuid_configs[0];
+};
+
 #endif /* _ASM_X86_KVM_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 016/113] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (14 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 015/113] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-19  2:40   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP isaku.yamahata
                   ` (96 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
TDX specific sub-commands will be added to retrieve/pass TDX specific
parameters.

KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
guest state-protected VM.  It defined subcommands for technology-specific
operations under KVM_MEMORY_ENCRYPT_OP.  Despite its name, the subcommands
are not limited to memory encryption, but various technology-specific
operations are defined.  It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
for TDX specific operations and define subcommands.

TDX requires VM-scoped TDX-specific operations for device model, for
example, qemu.  Getting system-wide parameters, TDX-specific VM
initialization.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    |  9 +++++++++
 arch/x86/kvm/vmx/tdx.c     | 26 ++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++++
 3 files changed, 39 insertions(+)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 16053ec3e0ae..781fbc896120 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -37,6 +37,14 @@ static int vt_vm_init(struct kvm *kvm)
 	return vmx_vm_init(kvm);
 }
 
+static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
+{
+	if (!is_td(kvm))
+		return -ENOTTY;
+
+	return tdx_vm_ioctl(kvm, argp);
+}
+
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = KBUILD_MODNAME,
 
@@ -179,6 +187,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
 
 	.dev_mem_enc_ioctl = tdx_dev_ioctl,
+	.mem_enc_ioctl = vt_mem_enc_ioctl,
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f46579102a44..2bd1cc37abab 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -78,6 +78,32 @@ int tdx_dev_ioctl(void __user *argp)
 	return 0;
 }
 
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
+{
+	struct kvm_tdx_cmd tdx_cmd;
+	int r;
+
+	if (copy_from_user(&tdx_cmd, argp, sizeof(struct kvm_tdx_cmd)))
+		return -EFAULT;
+	if (tdx_cmd.error || tdx_cmd.unused)
+		return -EINVAL;
+
+	mutex_lock(&kvm->lock);
+
+	switch (tdx_cmd.id) {
+	default:
+		r = -EINVAL;
+		goto out;
+	}
+
+	if (copy_to_user(argp, &tdx_cmd, sizeof(struct kvm_tdx_cmd)))
+		r = -EFAULT;
+
+out:
+	mutex_unlock(&kvm->lock);
+	return r;
+}
+
 static int __init tdx_module_setup(void)
 {
 	const struct tdsysinfo_struct *tdsysinfo;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f0803cc15151..bb4090dbae37 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -141,10 +141,14 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
 int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 bool tdx_is_vm_type_supported(unsigned long type);
 int tdx_dev_ioctl(void __user *argp);
+
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
 static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
 static inline int tdx_dev_ioctl(void __user *argp) { return -EOPNOTSUPP; };
+
+static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (15 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 016/113] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-13 12:55   ` Zhi Wang
  2023-01-16  4:44   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 018/113] KVM: TDX: create/destroy VM structure isaku.yamahata
                   ` (95 subsequent siblings)
  112 siblings, 2 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX attestation includes the maximum number of vcpu that the guest can
accommodate.  For that, the maximum number of vcpu needs to be specified
instead of constant, KVM_MAX_VCPUS.  Make KVM_ENABLE_CAP support
KVM_CAP_MAX_VCPUS.

Suggested-by: Sagi Shahar <sagis@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 virt/kvm/kvm_main.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index a235b628b32f..1cfa7da92ad0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4945,7 +4945,27 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
 		}
 
 		mutex_unlock(&kvm->slots_lock);
+		return r;
+	}
+	case KVM_CAP_MAX_VCPUS: {
+		int r;
 
+		if (cap->flags || cap->args[0] == 0)
+			return -EINVAL;
+		if (cap->args[0] >  kvm_vm_ioctl_check_extension(kvm, KVM_CAP_MAX_VCPUS))
+			return -E2BIG;
+
+		mutex_lock(&kvm->lock);
+		/* Only decreasing is allowed. */
+		if (cap->args[0] > kvm->max_vcpus)
+			r = -E2BIG;
+		else if (kvm->created_vcpus)
+			r = -EBUSY;
+		else {
+			kvm->max_vcpus = cap->args[0];
+			r = 0;
+		}
+		mutex_unlock(&kvm->lock);
 		return r;
 	}
 	default:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (16 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-13 13:12   ` Zhi Wang
       [not found]   ` <080e0a246e927545718b6f427dfdcdde505a8859.camel@intel.com>
  2023-01-12 16:31 ` [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
                   ` (94 subsequent siblings)
  112 siblings, 2 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson, Kai Huang

From: Sean Christopherson <sean.j.christopherson@intel.com>

As the first step to create TDX guest, create/destroy VM struct.  Assign
TDX private Host Key ID (HKID) to the TDX guest for memory encryption and
allocate extra pages for the TDX guest. On destruction, free allocated
pages, and HKID.

Before tearing down private page tables, TDX requires some resources of the
guest TD to be destroyed (i.e. HKID must have been reclaimed, etc).  Add
flush_shadow_all_private callback before tearing down private page tables
for it.

Add vm_free() of kvm_x86_ops hook at the end of kvm_arch_destroy_vm()
because some per-VM TDX resources, e.g. TDR, need to be freed after other
TDX resources, e.g. HKID, were freed.

Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

---
Changes v10 -> v11:
- Fix doule free in tdx_vm_free() by setting NULL.
- replace struct tdx_td_page tdr and tdcs from struct kvm_tdx with
  unsigned long

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |   2 +
 arch/x86/include/asm/kvm_host.h    |   2 +
 arch/x86/kvm/vmx/main.c            |  34 ++-
 arch/x86/kvm/vmx/tdx.c             | 415 +++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h             |   6 +-
 arch/x86/kvm/vmx/x86_ops.h         |   9 +
 arch/x86/kvm/x86.c                 |   8 +
 7 files changed, 472 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e6708bb3f4f6..552de893af75 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -22,7 +22,9 @@ KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP(is_vm_type_supported)
 KVM_X86_OP(vm_init)
+KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
 KVM_X86_OP_OPTIONAL(vm_destroy)
+KVM_X86_OP_OPTIONAL(vm_free)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
 KVM_X86_OP(vcpu_create)
 KVM_X86_OP(vcpu_free)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 234c28c8e6ee..e199ddf0bb00 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1540,7 +1540,9 @@ struct kvm_x86_ops {
 	bool (*is_vm_type_supported)(unsigned long vm_type);
 	unsigned int vm_size;
 	int (*vm_init)(struct kvm *kvm);
+	void (*flush_shadow_all_private)(struct kvm *kvm);
 	void (*vm_destroy)(struct kvm *kvm);
+	void (*vm_free)(struct kvm *kvm);
 
 	/* Create, but do not attach this VCPU */
 	int (*vcpu_precreate)(struct kvm *kvm);
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 781fbc896120..c5f2515026e9 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -29,14 +29,40 @@ static __init int vt_hardware_setup(void)
 	return 0;
 }
 
+static void vt_hardware_unsetup(void)
+{
+	tdx_hardware_unsetup();
+	vmx_hardware_unsetup();
+}
+
 static int vt_vm_init(struct kvm *kvm)
 {
 	if (is_td(kvm))
-		return -EOPNOTSUPP;	/* Not ready to create guest TD yet. */
+		return tdx_vm_init(kvm);
 
 	return vmx_vm_init(kvm);
 }
 
+static void vt_flush_shadow_all_private(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		tdx_mmu_release_hkid(kvm);
+}
+
+static void vt_vm_destroy(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		return;
+
+	vmx_vm_destroy(kvm);
+}
+
+static void vt_vm_free(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		tdx_vm_free(kvm);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -50,7 +76,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.check_processor_compatibility = vmx_check_processor_compat,
 
-	.hardware_unsetup = vmx_hardware_unsetup,
+	.hardware_unsetup = vt_hardware_unsetup,
 
 	.hardware_enable = vmx_hardware_enable,
 	.hardware_disable = vmx_hardware_disable,
@@ -59,7 +85,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.is_vm_type_supported = vt_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
 	.vm_init = vt_vm_init,
-	.vm_destroy = vmx_vm_destroy,
+	.flush_shadow_all_private = vt_flush_shadow_all_private,
+	.vm_destroy = vt_vm_destroy,
+	.vm_free = vt_vm_free,
 
 	.vcpu_precreate = vmx_vcpu_precreate,
 	.vcpu_create = vmx_vcpu_create,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2bd1cc37abab..d11950d18226 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,6 +6,7 @@
 #include "capabilities.h"
 #include "x86_ops.h"
 #include "tdx.h"
+#include "tdx_ops.h"
 #include "x86.h"
 
 #undef pr_fmt
@@ -32,6 +33,250 @@ struct tdx_capabilities {
 /* Capabilities of KVM + the TDX module. */
 static struct tdx_capabilities tdx_caps;
 
+/*
+ * Some TDX SEAMCALLs (TDH.MNG.CREATE, TDH.PHYMEM.CACHE.WB,
+ * TDH.MNG.KEY.RECLAIMID, TDH.MNG.KEY.FREEID etc) tries to acquire a global lock
+ * internally in TDX module.  If failed, TDX_OPERAND_BUSY is returned without
+ * spinning or waiting due to a constraint on execution time.  It's caller's
+ * responsibility to avoid race (or retry on TDX_OPERAND_BUSY).  Use this mutex
+ * to avoid race in TDX module because the kernel knows better about scheduling.
+ */
+static DEFINE_MUTEX(tdx_lock);
+static struct mutex *tdx_mng_key_config_lock;
+
+static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
+{
+	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
+}
+
+static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->tdr_pa;
+}
+
+static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
+{
+	tdx_keyid_free(kvm_tdx->hkid);
+	kvm_tdx->hkid = 0;
+}
+
+static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->hkid > 0;
+}
+
+static void tdx_clear_page(unsigned long page_pa)
+{
+	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
+	void *page = __va(page_pa);
+	unsigned long i;
+
+	if (!static_cpu_has(X86_FEATURE_MOVDIR64B)) {
+		clear_page(page);
+		return;
+	}
+
+	/*
+	 * Zeroing the page is only necessary for systems with MKTME-i:
+	 * when re-assign one page from old keyid to a new keyid, MOVDIR64B is
+	 * required to clear/write the page with new keyid to prevent integrity
+	 * error when read on the page with new keyid.
+	 *
+	 * clflush doesn't flush cache with HKID set.
+	 * The cache line could be poisoned (even without MKTME-i), clear the
+	 * poison bit.
+	 */
+	for (i = 0; i < PAGE_SIZE; i += 64)
+		movdir64b(page + i, zero_page);
+	/*
+	 * MOVDIR64B store uses WC buffer.  Prevent following memory reads
+	 * from seeing potentially poisoned cache.
+	 */
+	__mb();
+}
+
+static int tdx_reclaim_page(hpa_t pa, bool do_wb, u16 hkid)
+{
+	struct tdx_module_output out;
+	u64 err;
+
+	do {
+		err = tdh_phymem_page_reclaim(pa, &out);
+		/*
+		 * TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is shutdown.
+		 * state.  i.e. destructing TD.
+		 * TDH.PHYMEM.PAGE.RECLAIM  requires TDR and target page.
+		 * Because we're destructing TD, it's rare to contend with TDR.
+		 */
+	} while (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX));
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
+		return -EIO;
+	}
+
+	if (do_wb) {
+		/*
+		 * Only TDR page gets into this path.  No contention is expected
+		 * because of the last page of TD.
+		 */
+		err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+			return -EIO;
+		}
+	}
+
+	tdx_clear_page(pa);
+	return 0;
+}
+
+static void tdx_reclaim_td_page(unsigned long td_page_pa)
+{
+	if (!td_page_pa)
+		return;
+	/*
+	 * TDCX are being reclaimed.  TDX module maps TDCX with HKID
+	 * assigned to the TD.  Here the cache associated to the TD
+	 * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
+	 * cache doesn't need to be flushed again.
+	 */
+	if (WARN_ON(tdx_reclaim_page(td_page_pa, false, 0)))
+		/* If reclaim failed, leak the page. */
+		return;
+	free_page((unsigned long)__va(td_page_pa));
+}
+
+static int tdx_do_tdh_phymem_cache_wb(void *param)
+{
+	u64 err = 0;
+
+	do {
+		err = tdh_phymem_cache_wb(!!err);
+	} while (err == TDX_INTERRUPTED_RESUMABLE);
+
+	/* Other thread may have done for us. */
+	if (err == TDX_NO_HKID_READY_TO_WBCACHE)
+		err = TDX_SUCCESS;
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+void tdx_mmu_release_hkid(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	cpumask_var_t packages;
+	bool cpumask_allocated;
+	u64 err;
+	int ret;
+	int i;
+
+	if (!is_hkid_assigned(kvm_tdx))
+		return;
+
+	if (!is_td_created(kvm_tdx))
+		goto free_hkid;
+
+	cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
+	cpus_read_lock();
+	for_each_online_cpu(i) {
+		if (cpumask_allocated &&
+			cpumask_test_and_set_cpu(topology_physical_package_id(i),
+						packages))
+			continue;
+
+		/*
+		 * We can destroy multiple the guest TDs simultaneously.
+		 * Prevent tdh_phymem_cache_wb from returning TDX_BUSY by
+		 * serialization.
+		 */
+		mutex_lock(&tdx_lock);
+		ret = smp_call_on_cpu(i, tdx_do_tdh_phymem_cache_wb, NULL, 1);
+		mutex_unlock(&tdx_lock);
+		if (ret)
+			break;
+	}
+	cpus_read_unlock();
+	free_cpumask_var(packages);
+
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
+		pr_err("tdh_mng_key_freeid failed. HKID %d is leaked.\n",
+			kvm_tdx->hkid);
+		return;
+	}
+
+free_hkid:
+	tdx_hkid_free(kvm_tdx);
+}
+
+void tdx_vm_free(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	int i;
+
+	/* Can't reclaim or free TD pages if teardown failed. */
+	if (is_hkid_assigned(kvm_tdx))
+		return;
+
+	if (kvm_tdx->tdcs_pa) {
+		for (i = 0; i < tdx_caps.tdcs_nr_pages; i++)
+			tdx_reclaim_td_page(kvm_tdx->tdcs_pa[i]);
+		kfree(kvm_tdx->tdcs_pa);
+		kvm_tdx->tdcs_pa = NULL;
+	}
+
+	if (!kvm_tdx->tdr_pa)
+		return;
+	/*
+	 * TDX module maps TDR with TDX global HKID.  TDX module may access TDR
+	 * while operating on TD (Especially reclaiming TDCS).  Cache flush with
+	 * TDX global HKID is needed.
+	 */
+	if (tdx_reclaim_page(kvm_tdx->tdr_pa, true, tdx_global_keyid))
+		return;
+
+	free_page((unsigned long)__va(kvm_tdx->tdr_pa));
+	kvm_tdx->tdr_pa = 0;
+}
+
+static int tdx_do_tdh_mng_key_config(void *param)
+{
+	hpa_t *tdr_p = param;
+	u64 err;
+
+	do {
+		err = tdh_mng_key_config(*tdr_p);
+
+		/*
+		 * If it failed to generate a random key, retry it because this
+		 * is typically caused by an entropy error of the CPU's random
+		 * number generator.
+		 */
+	} while (err == TDX_KEY_GENERATION_FAILED);
+
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int __tdx_td_init(struct kvm *kvm);
+
+int tdx_vm_init(struct kvm *kvm)
+{
+	/* Place holder for now. */
+	return __tdx_td_init(kvm);
+}
+
 int tdx_dev_ioctl(void __user *argp)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
@@ -78,6 +323,160 @@ int tdx_dev_ioctl(void __user *argp)
 	return 0;
 }
 
+static int __tdx_td_init(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	cpumask_var_t packages;
+	unsigned long *tdcs_pa = NULL;
+	unsigned long tdr_pa = 0;
+	unsigned long va;
+	int ret, i;
+	u64 err;
+
+	ret = tdx_keyid_alloc();
+	if (ret < 0)
+		return ret;
+	kvm_tdx->hkid = ret;
+
+	va = __get_free_page(GFP_KERNEL_ACCOUNT);
+	if (!va)
+		goto free_hkid;
+	tdr_pa = __pa(va);
+
+	tdcs_pa = kcalloc(tdx_caps.tdcs_nr_pages, sizeof(*kvm_tdx->tdcs_pa),
+			  GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	if (!tdcs_pa)
+		goto free_tdr;
+	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
+		va = __get_free_page(GFP_KERNEL_ACCOUNT);
+		if (!va)
+			goto free_tdcs;
+		tdcs_pa[i] = __pa(va);
+	}
+
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
+		ret = -ENOMEM;
+		goto free_tdcs;
+	}
+	cpus_read_lock();
+	/*
+	 * Need at least one CPU of the package to be online in order to
+	 * program all packages for host key id.  Check it.
+	 */
+	for_each_present_cpu(i)
+		cpumask_set_cpu(topology_physical_package_id(i), packages);
+	for_each_online_cpu(i)
+		cpumask_clear_cpu(topology_physical_package_id(i), packages);
+	if (!cpumask_empty(packages)) {
+		ret = -EIO;
+		/*
+		 * Because it's hard for human operator to figure out the
+		 * reason, warn it.
+		 */
+		pr_warn("All packages need to have online CPU to create TD. Online CPU and retry.\n");
+		goto free_packages;
+	}
+
+	/*
+	 * Acquire global lock to avoid TDX_OPERAND_BUSY:
+	 * TDH.MNG.CREATE and other APIs try to lock the global Key Owner
+	 * Table (KOT) to track the assigned TDX private HKID.  It doesn't spin
+	 * to acquire the lock, returns TDX_OPERAND_BUSY instead, and let the
+	 * caller to handle the contention.  This is because of time limitation
+	 * usable inside the TDX module and OS/VMM knows better about process
+	 * scheduling.
+	 *
+	 * APIs to acquire the lock of KOT:
+	 * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
+	 * TDH.PHYMEM.CACHE.WB.
+	 */
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_CREATE, err, NULL);
+		ret = -EIO;
+		goto free_packages;
+	}
+	kvm_tdx->tdr_pa = tdr_pa;
+
+	for_each_online_cpu(i) {
+		int pkg = topology_physical_package_id(i);
+
+		if (cpumask_test_and_set_cpu(pkg, packages))
+			continue;
+
+		/*
+		 * Program the memory controller in the package with an
+		 * encryption key associated to a TDX private host key id
+		 * assigned to this TDR.  Concurrent operations on same memory
+		 * controller results in TDX_OPERAND_BUSY.  Avoid this race by
+		 * mutex.
+		 */
+		mutex_lock(&tdx_mng_key_config_lock[pkg]);
+		ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
+				      &kvm_tdx->tdr_pa, true);
+		mutex_unlock(&tdx_mng_key_config_lock[pkg]);
+		if (ret)
+			break;
+	}
+	cpus_read_unlock();
+	free_cpumask_var(packages);
+	if (ret)
+		goto teardown;
+
+	kvm_tdx->tdcs_pa = tdcs_pa;
+	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
+		err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
+		if (WARN_ON_ONCE(err)) {
+			pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
+			for (i++; i < tdx_caps.tdcs_nr_pages; i++) {
+				free_page((unsigned long)__va(tdcs_pa[i]));
+				tdcs_pa[i] = 0;
+			}
+			ret = -EIO;
+			goto teardown;
+		}
+	}
+
+	/*
+	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
+	 * ioctl() to define the configure CPUID values for the TD.
+	 */
+	return 0;
+
+	/*
+	 * The sequence for freeing resources from a partially initialized TD
+	 * varies based on where in the initialization flow failure occurred.
+	 * Simply use the full teardown and destroy, which naturally play nice
+	 * with partial initialization.
+	 */
+teardown:
+	tdx_mmu_release_hkid(kvm);
+	tdx_vm_free(kvm);
+	return ret;
+
+free_packages:
+	cpus_read_unlock();
+	free_cpumask_var(packages);
+free_tdcs:
+	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
+		if (tdcs_pa[i])
+			free_page((unsigned long)__va(tdcs_pa[i]));
+	}
+	kfree(tdcs_pa);
+	kvm_tdx->tdcs_pa = NULL;
+
+free_tdr:
+	if (tdr_pa)
+		free_page((unsigned long)__va(tdr_pa));
+	kvm_tdx->tdr_pa = 0;
+free_hkid:
+	if (is_hkid_assigned(kvm_tdx))
+		tdx_hkid_free(kvm_tdx);
+	return ret;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -152,6 +551,8 @@ bool tdx_is_vm_type_supported(unsigned long type)
 
 int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 {
+	int max_pkgs;
+	int i;
 	int r;
 
 	if (!enable_ept) {
@@ -159,6 +560,14 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 		return -EINVAL;
 	}
 
+	max_pkgs = topology_max_packages();
+	tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
+				   GFP_KERNEL);
+	if (!tdx_mng_key_config_lock)
+		return -ENOMEM;
+	for (i = 0; i < max_pkgs; i++)
+		mutex_init(&tdx_mng_key_config_lock[i]);
+
 	/* TDX requires VMX. */
 	r = vmxon_all();
 	if (!r)
@@ -167,3 +576,9 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 
 	return r;
 }
+
+void tdx_hardware_unsetup(void)
+{
+	/* kfree accepts NULL. */
+	kfree(tdx_mng_key_config_lock);
+}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 473013265bd8..e78d72cf4c3a 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -5,7 +5,11 @@
 #ifdef CONFIG_INTEL_TDX_HOST
 struct kvm_tdx {
 	struct kvm kvm;
-	/* TDX specific members follow. */
+
+	unsigned long tdr_pa;
+	unsigned long *tdcs_pa;
+
+	int hkid;
 };
 
 struct vcpu_tdx {
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index bb4090dbae37..3d0f519727c6 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -139,15 +139,24 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
 #ifdef CONFIG_INTEL_TDX_HOST
 int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
+void tdx_hardware_unsetup(void);
 bool tdx_is_vm_type_supported(unsigned long type);
 int tdx_dev_ioctl(void __user *argp);
 
+int tdx_vm_init(struct kvm *kvm);
+void tdx_mmu_release_hkid(struct kvm *kvm);
+void tdx_vm_free(struct kvm *kvm);
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
+static inline void tdx_hardware_unsetup(void) {}
 static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
 static inline int tdx_dev_ioctl(void __user *argp) { return -EOPNOTSUPP; };
 
+static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
+static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
+static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
+static inline void tdx_vm_free(struct kvm *kvm) {}
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 #endif
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1152a9dc6d84..0fa91a9708aa 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12337,6 +12337,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
 	kvm_page_track_cleanup(kvm);
 	kvm_xen_destroy_vm(kvm);
 	kvm_hv_destroy_vm(kvm);
+	static_call_cond(kvm_x86_vm_free)(kvm);
 }
 
 static void memslot_rmap_free(struct kvm_memory_slot *slot)
@@ -12647,6 +12648,13 @@ void kvm_arch_commit_memory_region(struct kvm *kvm,
 
 void kvm_arch_flush_shadow_all(struct kvm *kvm)
 {
+	/*
+	 * kvm_mmu_zap_all() zaps both private and shared page tables.  Before
+	 * tearing down private page tables, TDX requires some TD resources to
+	 * be destroyed (i.e. keyID must have been reclaimed, etc).  Invoke
+	 * kvm_x86_flush_shadow_all_private() for this.
+	 */
+	static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
 	kvm_mmu_zap_all(kvm);
 }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (17 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 018/113] KVM: TDX: create/destroy VM structure isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-13 14:58   ` Zhi Wang
                     ` (2 more replies)
  2023-01-12 16:31 ` [PATCH v11 020/113] KVM: TDX: Make pmu_intel.c ignore guest TD case isaku.yamahata
                   ` (93 subsequent siblings)
  112 siblings, 3 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Xiaoyao Li

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX requires additional parameters for TDX VM for confidential execution to
protect its confidentiality of its memory contents and its CPU state from
any other software, including VMM. When creating guest TD VM before
creating vcpu, the number of vcpu, TSC frequency (that is same among
vcpus. and it can't be changed.)  CPUIDs which is emulated by the TDX
module. It means guest can trust those CPUIDs. and sha384 values for
measurement.

Add new subcommand, KVM_TDX_INIT_VM, to pass parameters for TDX guest.  It
assigns encryption key to the TDX guest for memory encryption.  TDX
encrypts memory per-guest bases.  It assigns device model passes per-VM
parameters for the TDX guest.  The maximum number of vcpus, tsc frequency
(TDX guest has fised VM-wide TSC frequency. not per-vcpu.  The TDX guest
can not change it.), attributes (production or debug), available extended
features (which is reflected into guest XCR0, IA32_XSS MSR), cpuids, sha384
measurements, and etc.

This subcommand is called before creating vcpu and KVM_SET_CPUID2, i.e.
cpuids configurations aren't available yet.  So CPUIDs configuration values
needs to be passed in struct kvm_tdx_init_vm.  It's device model
responsibility to make this cpuid config for KVM_TDX_INIT_VM and
KVM_SET_CPUID2.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/tdx.h            |   3 +
 arch/x86/include/uapi/asm/kvm.h       |  31 ++++
 arch/x86/kvm/vmx/tdx.c                | 229 ++++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.h                |  20 +++
 tools/arch/x86/include/uapi/asm/kvm.h |  33 ++++
 5 files changed, 306 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index 2ca6e8ce1e43..d7ce2217279f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -105,6 +105,9 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
 #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
 
 #ifdef CONFIG_INTEL_TDX_HOST
+
+/* -1 indicates CPUID leaf with no sub-leaves. */
+#define TDX_CPUID_NO_SUBLEAF	((u32)-1)
 struct tdx_cpuid_config {
 	u32	leaf;
 	u32	sub_leaf;
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 610b1de9eb2b..b8f28d86d4fd 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -535,6 +535,7 @@ struct kvm_pmu_event_filter {
 /* Trust Domain eXtension sub-ioctl() commands. */
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
+	KVM_TDX_INIT_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -580,4 +581,34 @@ struct kvm_tdx_capabilities {
 	struct kvm_tdx_cpuid_config cpuid_configs[0];
 };
 
+struct kvm_tdx_init_vm {
+	__u64 attributes;
+	__u64 mrconfigid[6];	/* sha384 digest */
+	__u64 mrowner[6];	/* sha384 digest */
+	__u64 mrownerconfig[6];	/* sha348 digest */
+	union {
+		/*
+		 * KVM_TDX_INIT_VM is called before vcpu creation, thus before
+		 * KVM_SET_CPUID2.  CPUID configurations needs to be passed.
+		 *
+		 * This configuration supersedes KVM_SET_CPUID{,2}.
+		 * The user space VMM, e.g. qemu, should make them consistent
+		 * with this values.
+		 * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(256)
+		 * = 8KB.
+		 */
+		struct {
+			struct kvm_cpuid2 cpuid;
+			/* 8KB with KVM_MAX_CPUID_ENTRIES. */
+			struct kvm_cpuid_entry2 entries[];
+		};
+		/*
+		 * For future extensibility.
+		 * The size(struct kvm_tdx_init_vm) = 16KB.
+		 * This should be enough given sizeof(TD_PARAMS) = 1024
+		 */
+		__u64 reserved[2029];
+	};
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d11950d18226..0b309bbfe4e5 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,7 +6,6 @@
 #include "capabilities.h"
 #include "x86_ops.h"
 #include "tdx.h"
-#include "tdx_ops.h"
 #include "x86.h"
 
 #undef pr_fmt
@@ -269,12 +268,15 @@ static int tdx_do_tdh_mng_key_config(void *param)
 	return 0;
 }
 
-static int __tdx_td_init(struct kvm *kvm);
-
 int tdx_vm_init(struct kvm *kvm)
 {
-	/* Place holder for now. */
-	return __tdx_td_init(kvm);
+	/*
+	 * This function initializes only KVM software construct.  It doesn't
+	 * initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.
+	 * It is handled by KVM_TDX_INIT_VM, __tdx_td_init().
+	 */
+
+	return 0;
 }
 
 int tdx_dev_ioctl(void __user *argp)
@@ -323,9 +325,147 @@ int tdx_dev_ioctl(void __user *argp)
 	return 0;
 }
 
-static int __tdx_td_init(struct kvm *kvm)
+/*
+ * cpuid entry lookup in TDX cpuid config way.
+ * The difference is how to specify index(subleaves).
+ * Specify index to TDX_CPUID_NO_SUBLEAF for CPUID leaf with no-subleaves.
+ */
+static const struct kvm_cpuid_entry2 *tdx_find_cpuid_entry(const struct kvm_cpuid2 *cpuid,
+							   u32 function, u32 index)
+{
+	int i;
+
+	/* In TDX CPU CONFIG, TDX_CPUID_NO_SUBLEAF means index = 0. */
+	if (index == TDX_CPUID_NO_SUBLEAF)
+		index = 0;
+
+	for (i = 0; i < cpuid->nent; i++) {
+		const struct kvm_cpuid_entry2 *e = &cpuid->entries[i];
+
+		if (e->function == function &&
+		    (e->index == index ||
+		     !(e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX)))
+			return e;
+	}
+	return NULL;
+}
+
+static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
+			struct kvm_tdx_init_vm *init_vm)
+{
+	const struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
+	const struct kvm_cpuid_entry2 *entry;
+	u64 guest_supported_xcr0;
+	u64 guest_supported_xss;
+	int max_pa;
+	int i;
+
+	if (kvm->created_vcpus)
+		return -EBUSY;
+	td_params->max_vcpus = kvm->max_vcpus;
+	td_params->attributes = init_vm->attributes;
+	if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
+		/*
+		 * TODO: save/restore PMU related registers around TDENTER.
+		 * Once it's done, remove this guard.
+		 */
+		pr_warn("TD doesn't support perfmon yet. KVM needs to save/restore "
+			"host perf registers properly.\n");
+		return -EOPNOTSUPP;
+	}
+
+	for (i = 0; i < tdx_caps.nr_cpuid_configs; i++) {
+		const struct tdx_cpuid_config *config = &tdx_caps.cpuid_configs[i];
+		const struct kvm_cpuid_entry2 *entry =
+			tdx_find_cpuid_entry(cpuid, config->leaf, config->sub_leaf);
+		struct tdx_cpuid_value *value = &td_params->cpuid_values[i];
+
+		if (!entry)
+			continue;
+
+		value->eax = entry->eax & config->eax;
+		value->ebx = entry->ebx & config->ebx;
+		value->ecx = entry->ecx & config->ecx;
+		value->edx = entry->edx & config->edx;
+	}
+
+	max_pa = 36;
+	entry = tdx_find_cpuid_entry(cpuid, 0x80000008, 0);
+	if (entry)
+		max_pa = entry->eax & 0xff;
+
+	td_params->eptp_controls = VMX_EPTP_MT_WB;
+	/*
+	 * No CPU supports 4-level && max_pa > 48.
+	 * "5-level paging and 5-level EPT" section 4.1 4-level EPT
+	 * "4-level EPT is limited to translating 48-bit guest-physical
+	 *  addresses."
+	 * cpu_has_vmx_ept_5levels() check is just in case.
+	 */
+	if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
+		td_params->eptp_controls |= VMX_EPTP_PWL_5;
+		td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
+	} else {
+		td_params->eptp_controls |= VMX_EPTP_PWL_4;
+	}
+
+	/* Setup td_params.xfam */
+	entry = tdx_find_cpuid_entry(cpuid, 0xd, 0);
+	if (entry)
+		guest_supported_xcr0 = (entry->eax | ((u64)entry->edx << 32));
+	else
+		guest_supported_xcr0 = 0;
+	guest_supported_xcr0 &= kvm_caps.supported_xcr0;
+
+	entry = tdx_find_cpuid_entry(cpuid, 0xd, 1);
+	if (entry)
+		guest_supported_xss = (entry->ecx | ((u64)entry->edx << 32));
+	else
+		guest_supported_xss = 0;
+	/* PT can be exposed to TD guest regardless of KVM's XSS support */
+	guest_supported_xss &= (kvm_caps.supported_xss | XFEATURE_MASK_PT);
+
+	td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
+	if (td_params->xfam & XFEATURE_MASK_LBR) {
+		/*
+		 * TODO: once KVM supports LBR(save/restore LBR related
+		 * registers around TDENTER), remove this guard.
+		 */
+		pr_warn("TD doesn't support LBR yet. KVM needs to save/restore "
+			"IA32_LBR_DEPTH properly.\n");
+		return -EOPNOTSUPP;
+	}
+
+	if (td_params->xfam & XFEATURE_MASK_XTILE) {
+		/*
+		 * TODO: once KVM supports AMX(save/restore AMX related
+		 * registers around TDENTER), remove this guard.
+		 */
+		pr_warn("TD doesn't support AMX yet. KVM needs to save/restore "
+			"IA32_XFD, IA32_XFD_ERR properly.\n");
+		return -EOPNOTSUPP;
+	}
+
+	td_params->tsc_frequency =
+		TDX_TSC_KHZ_TO_25MHZ(kvm->arch.default_tsc_khz);
+
+#define MEMCPY_SAME_SIZE(dst, src)				\
+	do {							\
+		BUILD_BUG_ON(sizeof(dst) != sizeof(src));	\
+		memcpy((dst), (src), sizeof(dst));		\
+	} while (0)
+
+	MEMCPY_SAME_SIZE(td_params->mrconfigid, init_vm->mrconfigid);
+	MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
+	MEMCPY_SAME_SIZE(td_params->mrownerconfig, init_vm->mrownerconfig);
+
+	return 0;
+}
+
+static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct tdx_module_output out;
 	cpumask_var_t packages;
 	unsigned long *tdcs_pa = NULL;
 	unsigned long tdr_pa = 0;
@@ -439,10 +579,13 @@ static int __tdx_td_init(struct kvm *kvm)
 		}
 	}
 
-	/*
-	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT requires a dedicated
-	 * ioctl() to define the configure CPUID values for the TD.
-	 */
+	err = tdh_mng_init(kvm_tdx->tdr_pa, __pa(td_params), &out);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_INIT, err, &out);
+		ret = -EIO;
+		goto teardown;
+	}
+
 	return 0;
 
 	/*
@@ -477,6 +620,69 @@ static int __tdx_td_init(struct kvm *kvm)
 	return ret;
 }
 
+static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_tdx_init_vm *init_vm = NULL;
+	struct td_params *td_params = NULL;
+	void *entries_end;
+	int ret;
+
+	BUILD_BUG_ON(sizeof(*init_vm) != 16 * 1024);
+	BUILD_BUG_ON((sizeof(*init_vm) - offsetof(typeof(*init_vm), entries)) /
+		     sizeof(init_vm->entries[0]) < KVM_MAX_CPUID_ENTRIES);
+	BUILD_BUG_ON(sizeof(struct td_params) != 1024);
+
+	if (is_hkid_assigned(kvm_tdx))
+		return -EINVAL;
+
+	if (cmd->flags)
+		return -EINVAL;
+
+	init_vm = kzalloc(sizeof(*init_vm), GFP_KERNEL);
+	if (!init_vm)
+		return -ENOMEM;
+	if (copy_from_user(init_vm, (void __user *)cmd->data, sizeof(*init_vm))) {
+		ret = -EFAULT;
+		goto out;
+	}
+
+	ret = -EINVAL;
+	if (init_vm->cpuid.padding)
+		goto out;
+	/* init_vm->entries shouldn't overrun. */
+	entries_end = init_vm->entries + init_vm->cpuid.nent;
+	if (entries_end > (void *)(init_vm + 1))
+		goto out;
+	/* Unused part must be zero. */
+	if (memchr_inv(entries_end, 0, (void *)(init_vm + 1) - entries_end))
+		goto out;
+
+	td_params = kzalloc(sizeof(struct td_params), GFP_KERNEL);
+	if (!td_params) {
+		ret = -ENOMEM;
+		goto out;
+	}
+
+	ret = setup_tdparams(kvm, td_params, init_vm);
+	if (ret)
+		goto out;
+
+	ret = __tdx_td_init(kvm, td_params);
+	if (ret)
+		goto out;
+
+	kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_OFFSET);
+	kvm_tdx->attributes = td_params->attributes;
+	kvm_tdx->xfam = td_params->xfam;
+
+out:
+	/* kfree() accepts NULL. */
+	kfree(init_vm);
+	kfree(td_params);
+	return ret;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -490,6 +696,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	mutex_lock(&kvm->lock);
 
 	switch (tdx_cmd.id) {
+	case KVM_TDX_INIT_VM:
+		r = tdx_td_init(kvm, &tdx_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index e78d72cf4c3a..1b950f98242e 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -2,6 +2,8 @@
 #ifndef __KVM_X86_TDX_H
 #define __KVM_X86_TDX_H
 
+#include "tdx_ops.h"
+
 #ifdef CONFIG_INTEL_TDX_HOST
 struct kvm_tdx {
 	struct kvm kvm;
@@ -9,7 +11,11 @@ struct kvm_tdx {
 	unsigned long tdr_pa;
 	unsigned long *tdcs_pa;
 
+	u64 attributes;
+	u64 xfam;
 	int hkid;
+
+	u64 tsc_offset;
 };
 
 struct vcpu_tdx {
@@ -36,6 +42,20 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
 {
 	return container_of(vcpu, struct vcpu_tdx, vcpu);
 }
+
+static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
+{
+	struct tdx_module_output out;
+	u64 err;
+
+	err = tdh_mng_rd(kvm_tdx->tdr_pa, TDCS_EXEC(field), &out);
+	if (unlikely(err)) {
+		pr_err("TDH_MNG_RD[EXEC.0x%x] failed: 0x%llx\n", field, err);
+		return 0;
+	}
+	return out.r8;
+}
+
 #else
 struct kvm_tdx {
 	struct kvm kvm;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 04562740691b..eb800965b589 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -530,6 +530,7 @@ struct kvm_pmu_event_filter {
 /* Trust Domain eXtension sub-ioctl() commands. */
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
+	KVM_TDX_INIT_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -575,4 +576,36 @@ struct kvm_tdx_capabilities {
 	struct kvm_tdx_cpuid_config cpuid_configs[0];
 };
 
+struct kvm_tdx_init_vm {
+	__u64 attributes;
+	__u32 max_vcpus;
+	__u32 padding;
+	__u64 mrconfigid[6];    /* sha384 digest */
+	__u64 mrowner[6];       /* sha384 digest */
+	__u64 mrownerconfig[6]; /* sha348 digest */
+	union {
+		/*
+		 * KVM_TDX_INIT_VM is called before vcpu creation, thus before
+		 * KVM_SET_CPUID2.  CPUID configurations needs to be passed.
+		 *
+		 * This configuration supersedes KVM_SET_CPUID{,2}.
+		 * The user space VMM, e.g. qemu, should make them consistent
+		 * with this values.
+		 * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(256)
+		 * = 8KB.
+		 */
+		struct {
+			struct kvm_cpuid2 cpuid;
+			/* 8KB with KVM_MAX_CPUID_ENTRIES. */
+			struct kvm_cpuid_entry2 entries[];
+		};
+		/*
+		 * For future extensibility.
+		 * The size(struct kvm_tdx_init_vm) = 16KB.
+		 * This should be enough given sizeof(TD_PARAMS) = 1024
+		 */
+		__u64 reserved[2028];
+	};
+};
+
 #endif /* _ASM_X86_KVM_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 020/113] KVM: TDX: Make pmu_intel.c ignore guest TD case
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (18 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 021/113] KVM: TDX: Refuse to unplug the last cpu on the package isaku.yamahata
                   ` (92 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because TDX KVM doesn't support PMU yet (it's future work of TDX KVM
support as another patch series) and pmu_intel.c touches vmx specific
structure in vcpu initialization, as workaround add dummy structure to
struct vcpu_tdx and pmu_intel.c can ignore TDX case.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/pmu_intel.c | 46 +++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/pmu_intel.h | 28 ++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h       | 11 +++++++--
 arch/x86/kvm/vmx/vmx.c       |  2 +-
 arch/x86/kvm/vmx/vmx.h       | 32 +------------------------
 5 files changed, 84 insertions(+), 35 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/pmu_intel.h

diff --git a/arch/x86/kvm/vmx/pmu_intel.c b/arch/x86/kvm/vmx/pmu_intel.c
index efce9ad70e4e..39f43b0290c5 100644
--- a/arch/x86/kvm/vmx/pmu_intel.c
+++ b/arch/x86/kvm/vmx/pmu_intel.c
@@ -19,6 +19,7 @@
 #include "lapic.h"
 #include "nested.h"
 #include "pmu.h"
+#include "tdx.h"
 
 #define MSR_PMC_FULL_WIDTH_BIT      (MSR_IA32_PMC0 - MSR_IA32_PERFCTR0)
 
@@ -37,6 +38,26 @@ static struct kvm_event_hw_type_mapping intel_arch_events[] = {
 /* mapping between fixed pmc index and intel_arch_events array */
 static int fixed_pmc_events[] = {1, 0, 7};
 
+struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_INTEL_TDX_HOST
+	if (is_td_vcpu(vcpu))
+		return &to_tdx(vcpu)->lbr_desc;
+#endif
+
+	return &to_vmx(vcpu)->lbr_desc;
+}
+
+struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu)
+{
+#ifdef CONFIG_INTEL_TDX_HOST
+	if (is_td_vcpu(vcpu))
+		return &to_tdx(vcpu)->lbr_desc.records;
+#endif
+
+	return &to_vmx(vcpu)->lbr_desc.records;
+}
+
 static void reprogram_fixed_counters(struct kvm_pmu *pmu, u64 data)
 {
 	struct kvm_pmc *pmc;
@@ -169,6 +190,23 @@ static inline struct kvm_pmc *get_fw_gp_pmc(struct kvm_pmu *pmu, u32 msr)
 	return get_gp_pmc(pmu, msr, MSR_IA32_PMC0);
 }
 
+bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return false;
+	return cpuid_model_is_consistent(vcpu);
+}
+
+bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
+{
+	struct x86_pmu_lbr *lbr = vcpu_to_lbr_records(vcpu);
+
+	if (is_td_vcpu(vcpu))
+		return false;
+
+	return lbr->nr && (vcpu_get_perf_capabilities(vcpu) & PMU_CAP_LBR_FMT);
+}
+
 static bool intel_pmu_is_valid_lbr_msr(struct kvm_vcpu *vcpu, u32 index)
 {
 	struct x86_pmu_lbr *records = vcpu_to_lbr_records(vcpu);
@@ -279,6 +317,9 @@ int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu)
 					PERF_SAMPLE_BRANCH_USER,
 	};
 
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return 0;
+
 	if (unlikely(lbr_desc->event)) {
 		__set_bit(INTEL_PMC_IDX_FIXED_VLBR, pmu->pmc_in_use);
 		return 0;
@@ -588,7 +629,7 @@ static void intel_pmu_refresh(struct kvm_vcpu *vcpu)
 		INTEL_PMC_MAX_GENERIC, pmu->nr_arch_fixed_counters);
 
 	perf_capabilities = vcpu_get_perf_capabilities(vcpu);
-	if (cpuid_model_is_consistent(vcpu) &&
+	if (intel_pmu_lbr_is_compatible(vcpu) &&
 	    (perf_capabilities & PMU_CAP_LBR_FMT))
 		x86_perf_get_lbr(&lbr_desc->records);
 	else
@@ -644,6 +685,9 @@ static void intel_pmu_reset(struct kvm_vcpu *vcpu)
 	struct kvm_pmc *pmc = NULL;
 	int i;
 
+	if (is_td_vcpu(vcpu))
+		return;
+
 	for (i = 0; i < KVM_INTEL_PMC_MAX_GENERIC; i++) {
 		pmc = &pmu->gp_counters[i];
 
diff --git a/arch/x86/kvm/vmx/pmu_intel.h b/arch/x86/kvm/vmx/pmu_intel.h
new file mode 100644
index 000000000000..66bba47c1269
--- /dev/null
+++ b/arch/x86/kvm/vmx/pmu_intel.h
@@ -0,0 +1,28 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef __KVM_X86_VMX_PMU_INTEL_H
+#define  __KVM_X86_VMX_PMU_INTEL_H
+
+struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu);
+struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu);
+
+bool intel_pmu_lbr_is_compatible(struct kvm_vcpu *vcpu);
+bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu);
+int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
+
+struct lbr_desc {
+	/* Basic info about guest LBR records. */
+	struct x86_pmu_lbr records;
+
+	/*
+	 * Emulate LBR feature via passthrough LBR registers when the
+	 * per-vcpu guest LBR event is scheduled on the current pcpu.
+	 *
+	 * The records may be inaccurate if the host reclaims the LBR.
+	 */
+	struct perf_event *event;
+
+	/* True if LBRs are marked as not intercepted in the MSR bitmap */
+	bool msr_passthrough;
+};
+
+#endif /* __KVM_X86_VMX_PMU_INTEL_H */
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 1b950f98242e..af7fdc1516d5 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -2,9 +2,11 @@
 #ifndef __KVM_X86_TDX_H
 #define __KVM_X86_TDX_H
 
+#ifdef CONFIG_INTEL_TDX_HOST
+
+#include "pmu_intel.h"
 #include "tdx_ops.h"
 
-#ifdef CONFIG_INTEL_TDX_HOST
 struct kvm_tdx {
 	struct kvm kvm;
 
@@ -20,7 +22,12 @@ struct kvm_tdx {
 
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
-	/* TDX specific members follow. */
+
+	/*
+	 * Dummy to make pmu_intel not corrupt memory.
+	 * TODO: Support PMU for TDX.  Future work.
+	 */
+	struct lbr_desc lbr_desc;
 };
 
 static inline bool is_td(struct kvm *kvm)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5dc7687dcf16..5b8369e67939 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -2434,7 +2434,7 @@ int vmx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 			if ((data & PMU_CAP_LBR_FMT) !=
 			    (kvm_caps.supported_perf_cap & PMU_CAP_LBR_FMT))
 				return 1;
-			if (!cpuid_model_is_consistent(vcpu))
+			if (!intel_pmu_lbr_is_compatible(vcpu))
 				return 1;
 		}
 		if (data & PERF_CAP_PEBS_FORMAT) {
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index a3da84f4ea45..d49d0ace9fb8 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -11,6 +11,7 @@
 #include "capabilities.h"
 #include "../kvm_cache_regs.h"
 #include "posted_intr.h"
+#include "pmu_intel.h"
 #include "vmcs.h"
 #include "vmx_ops.h"
 #include "../cpuid.h"
@@ -105,22 +106,6 @@ static inline bool intel_pmu_has_perf_global_ctrl(struct kvm_pmu *pmu)
 	return pmu->version > 1;
 }
 
-struct lbr_desc {
-	/* Basic info about guest LBR records. */
-	struct x86_pmu_lbr records;
-
-	/*
-	 * Emulate LBR feature via passthrough LBR registers when the
-	 * per-vcpu guest LBR event is scheduled on the current pcpu.
-	 *
-	 * The records may be inaccurate if the host reclaims the LBR.
-	 */
-	struct perf_event *event;
-
-	/* True if LBRs are marked as not intercepted in the MSR bitmap */
-	bool msr_passthrough;
-};
-
 /*
  * The nested_vmx structure is part of vcpu_vmx, and holds information we need
  * for correct emulation of VMX (i.e., nested VMX) on this vcpu.
@@ -650,21 +635,6 @@ static inline struct vcpu_vmx *to_vmx(struct kvm_vcpu *vcpu)
 	return container_of(vcpu, struct vcpu_vmx, vcpu);
 }
 
-static inline struct lbr_desc *vcpu_to_lbr_desc(struct kvm_vcpu *vcpu)
-{
-	return &to_vmx(vcpu)->lbr_desc;
-}
-
-static inline struct x86_pmu_lbr *vcpu_to_lbr_records(struct kvm_vcpu *vcpu)
-{
-	return &vcpu_to_lbr_desc(vcpu)->records;
-}
-
-static inline bool intel_pmu_lbr_is_enabled(struct kvm_vcpu *vcpu)
-{
-	return !!vcpu_to_lbr_records(vcpu)->nr;
-}
-
 void intel_pmu_cross_mapped_check(struct kvm_pmu *pmu);
 int intel_pmu_create_guest_lbr_event(struct kvm_vcpu *vcpu);
 void vmx_passthrough_lbr_msrs(struct kvm_vcpu *vcpu);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 021/113] KVM: TDX: Refuse to unplug the last cpu on the package
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (19 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 020/113] KVM: TDX: Make pmu_intel.c ignore guest TD case isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-16 10:23   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 022/113] [MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction isaku.yamahata
                   ` (91 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

In order to reclaim TDX HKID, (i.e. when deleting guest TD), needs to call
TDH.PHYMEM.PAGE.WBINVD on all packages.  If we have used TDX HKID, refuse
to offline the last online cpu. Add arch callback for cpu offline.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/kvm/vmx/main.c            |  1 +
 arch/x86/kvm/vmx/tdx.c             | 40 +++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/x86_ops.h         |  2 ++
 arch/x86/kvm/x86.c                 |  5 ++++
 include/linux/kvm_host.h           |  1 +
 virt/kvm/kvm_main.c                | 12 +++++++--
 8 files changed, 60 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 552de893af75..1a27f3aee982 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -18,6 +18,7 @@ KVM_X86_OP(check_processor_compatibility)
 KVM_X86_OP(hardware_enable)
 KVM_X86_OP(hardware_disable)
 KVM_X86_OP(hardware_unsetup)
+KVM_X86_OP_OPTIONAL_RET0(offline_cpu)
 KVM_X86_OP(has_emulated_msr)
 KVM_X86_OP(vcpu_after_set_cpuid)
 KVM_X86_OP(is_vm_type_supported)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index e199ddf0bb00..30f4ddb18548 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1534,6 +1534,7 @@ struct kvm_x86_ops {
 	int (*hardware_enable)(void);
 	void (*hardware_disable)(void);
 	void (*hardware_unsetup)(void);
+	int (*offline_cpu)(void);
 	bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
 	void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index c5f2515026e9..ddf0742f1f67 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -77,6 +77,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.check_processor_compatibility = vmx_check_processor_compat,
 
 	.hardware_unsetup = vt_hardware_unsetup,
+	.offline_cpu = tdx_offline_cpu,
 
 	.hardware_enable = vmx_hardware_enable,
 	.hardware_disable = vmx_hardware_disable,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0b309bbfe4e5..557a609c5147 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -42,6 +42,7 @@ static struct tdx_capabilities tdx_caps;
  */
 static DEFINE_MUTEX(tdx_lock);
 static struct mutex *tdx_mng_key_config_lock;
+static atomic_t nr_configured_hkid;
 
 static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
 {
@@ -209,7 +210,8 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 		pr_err("tdh_mng_key_freeid failed. HKID %d is leaked.\n",
 			kvm_tdx->hkid);
 		return;
-	}
+	} else
+		atomic_dec(&nr_configured_hkid);
 
 free_hkid:
 	tdx_hkid_free(kvm_tdx);
@@ -560,6 +562,8 @@ static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params)
 		if (ret)
 			break;
 	}
+	if (!ret)
+		atomic_inc(&nr_configured_hkid);
 	cpus_read_unlock();
 	free_cpumask_var(packages);
 	if (ret)
@@ -791,3 +795,37 @@ void tdx_hardware_unsetup(void)
 	/* kfree accepts NULL. */
 	kfree(tdx_mng_key_config_lock);
 }
+
+int tdx_offline_cpu(void)
+{
+	int curr_cpu = smp_processor_id();
+	cpumask_var_t packages;
+	int ret = 0;
+	int i;
+
+	if (!atomic_read(&nr_configured_hkid))
+		return 0;
+
+	/*
+	 * To reclaim hkid, need to call TDH.PHYMEM.PAGE.WBINVD on all packages.
+	 * If this is the last online cpu on the package, refuse offline.
+	 */
+	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
+		return -ENOMEM;
+
+	for_each_online_cpu(i) {
+		if (i != curr_cpu)
+			cpumask_set_cpu(topology_physical_package_id(i), packages);
+	}
+	if (!cpumask_test_cpu(topology_physical_package_id(curr_cpu), packages))
+		ret = -EBUSY;
+	free_cpumask_var(packages);
+	if (ret)
+		/*
+		 * Because it's hard for human operator to understand the
+		 * reason, warn it.
+		 */
+		pr_warn("TDX requires all packages to have an online CPU.  "
+			"Delete all TDs in order to offline all CPUs of a package.\n");
+	return ret;
+}
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 3d0f519727c6..6c40dda1cc2f 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -142,6 +142,7 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 void tdx_hardware_unsetup(void);
 bool tdx_is_vm_type_supported(unsigned long type);
 int tdx_dev_ioctl(void __user *argp);
+int tdx_offline_cpu(void);
 
 int tdx_vm_init(struct kvm *kvm);
 void tdx_mmu_release_hkid(struct kvm *kvm);
@@ -152,6 +153,7 @@ static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
 static inline void tdx_hardware_unsetup(void) {}
 static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
 static inline int tdx_dev_ioctl(void __user *argp) { return -EOPNOTSUPP; };
+static inline int tdx_offline_cpu(void) { return 0; }
 
 static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 0fa91a9708aa..1fb135e0c98f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12100,6 +12100,11 @@ void kvm_arch_hardware_disable(void)
 	drop_user_return_notifiers();
 }
 
+int kvm_arch_offline_cpu(unsigned int cpu)
+{
+	return static_call(kvm_x86_offline_cpu)();
+}
+
 bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
 {
 	return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 6fada852c064..cd1f3634dd6a 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1459,6 +1459,7 @@ static inline void kvm_create_vcpu_debugfs(struct kvm_vcpu *vcpu) {}
 int kvm_arch_hardware_enable(void);
 void kvm_arch_hardware_disable(void);
 #endif
+int kvm_arch_offline_cpu(unsigned int cpu);
 int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
 bool kvm_arch_vcpu_in_kernel(struct kvm_vcpu *vcpu);
 int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1cfa7da92ad0..6c61b71b56d2 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -5516,13 +5516,21 @@ static void hardware_disable_nolock(void *junk)
 	__this_cpu_write(hardware_enabled, false);
 }
 
+__weak int kvm_arch_offline_cpu(unsigned int cpu)
+{
+	return 0;
+}
+
 static int kvm_offline_cpu(unsigned int cpu)
 {
+	int r = 0;
+
 	mutex_lock(&kvm_lock);
-	if (kvm_usage_count)
+	r = kvm_arch_offline_cpu(cpu);
+	if (!r && kvm_usage_count)
 		hardware_disable_nolock(NULL);
 	mutex_unlock(&kvm_lock);
-	return 0;
+	return r;
 }
 
 static void hardware_disable_all_nolock(void)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 022/113] [MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (20 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 021/113] KVM: TDX: Refuse to unplug the last cpu on the package isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure isaku.yamahata
                   ` (90 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TD vcpu
creation/destruction.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 221372cfb4af..a4ee04271d66 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -9,15 +9,15 @@ Layer status
 What qemu can do
 ----------------
 - TDX VM TYPE is exposed to Qemu.
-- Qemu can try to create VM of TDX VM type and then fails.
+- Qemu can create/destroy guest of TDX vm type.
 
 Patch Layer status
 ------------------
   Patch layer                          Status
 * TDX, VMX coexistence:                 Applied
 * TDX architectural definitions:        Applied
-* TD VM creation/destruction:           Applying
-* TD vcpu creation/destruction:         Not yet
+* TD VM creation/destruction:           Applied
+* TD vcpu creation/destruction:         Applying
 * TDX EPT violation:                    Not yet
 * TD finalization:                      Not yet
 * TD vcpu enter/exit:                   Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (21 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 022/113] [MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-16 10:46   ` Zhi Wang
  2023-01-19  0:45   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization isaku.yamahata
                   ` (89 subsequent siblings)
  112 siblings, 2 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

The next step of TDX guest creation is to create vcpu.  Allocate TDX vcpu
structures, partially initialize it.  Allocate pages of TDX vcpu for the
TDX module.  Actual donation TDX vcpu pages to the TDX module is not done
yet.

In the case of the conventional case, cpuid is empty at the initialization.
and cpuid is configured after the vcpu initialization.  Because TDX
supports only X2APIC mode, cpuid is forcibly initialized to support X2APIC
on the vcpu initialization.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
Changes v10 -> v11:
- NULL check of kvmalloc_array() in tdx_vcpu_reset. Move it to
  tdx_vcpu_create()

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 40 ++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 75 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h | 10 +++++
 arch/x86/kvm/x86.c         |  2 +
 4 files changed, 123 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ddf0742f1f67..59813ca05f36 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -63,6 +63,38 @@ static void vt_vm_free(struct kvm *kvm)
 		tdx_vm_free(kvm);
 }
 
+static int vt_vcpu_precreate(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		return 0;
+
+	return vmx_vcpu_precreate(kvm);
+}
+
+static int vt_vcpu_create(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_create(vcpu);
+
+	return vmx_vcpu_create(vcpu);
+}
+
+static void vt_vcpu_free(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_free(vcpu);
+
+	return vmx_vcpu_free(vcpu);
+}
+
+static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_reset(vcpu, init_event);
+
+	return vmx_vcpu_reset(vcpu, init_event);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -90,10 +122,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vm_destroy = vt_vm_destroy,
 	.vm_free = vt_vm_free,
 
-	.vcpu_precreate = vmx_vcpu_precreate,
-	.vcpu_create = vmx_vcpu_create,
-	.vcpu_free = vmx_vcpu_free,
-	.vcpu_reset = vmx_vcpu_reset,
+	.vcpu_precreate = vt_vcpu_precreate,
+	.vcpu_create = vt_vcpu_create,
+	.vcpu_free = vt_vcpu_free,
+	.vcpu_reset = vt_vcpu_reset,
 
 	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
 	.vcpu_load = vmx_vcpu_load,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 557a609c5147..099f0737a5aa 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -281,6 +281,81 @@ int tdx_vm_init(struct kvm *kvm)
 	return 0;
 }
 
+int tdx_vcpu_create(struct kvm_vcpu *vcpu)
+{
+	struct kvm_cpuid_entry2 *e;
+
+	/*
+	 * On cpu creation, cpuid entry is blank.  Forcibly enable
+	 * X2APIC feature to allow X2APIC.
+	 * Because vcpu_reset() can't return error, allocation is done here.
+	 */
+	WARN_ON_ONCE(vcpu->arch.cpuid_entries);
+	WARN_ON_ONCE(vcpu->arch.cpuid_nent);
+	e = kvmalloc_array(1, sizeof(*e), GFP_KERNEL_ACCOUNT);
+	if (!e)
+		return -ENOMEM;
+	*e  = (struct kvm_cpuid_entry2) {
+		.function = 1,	/* Features for X2APIC */
+		.index = 0,
+		.eax = 0,
+		.ebx = 0,
+		.ecx = 1ULL << 21,	/* X2APIC */
+		.edx = 0,
+	};
+	vcpu->arch.cpuid_entries = e;
+	vcpu->arch.cpuid_nent = 1;
+
+	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
+	if (!vcpu->arch.apic)
+		return -EINVAL;
+
+	fpstate_set_confidential(&vcpu->arch.guest_fpu);
+
+	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
+
+	vcpu->arch.cr0_guest_owned_bits = -1ul;
+	vcpu->arch.cr4_guest_owned_bits = -1ul;
+
+	vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
+	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
+	vcpu->arch.guest_state_protected =
+		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+
+	return 0;
+}
+
+void tdx_vcpu_free(struct kvm_vcpu *vcpu)
+{
+	/* This is stub for now.  More logic will come. */
+}
+
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
+{
+	struct msr_data apic_base_msr;
+
+	/* TDX doesn't support INIT event. */
+	if (WARN_ON_ONCE(init_event))
+		goto td_bugged;
+
+	/* TDX rquires X2APIC. */
+	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
+	if (kvm_vcpu_is_reset_bsp(vcpu))
+		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
+	apic_base_msr.host_initiated = true;
+	if (WARN_ON_ONCE(kvm_set_apic_base(vcpu, &apic_base_msr)))
+		goto td_bugged;
+
+	/*
+	 * Don't update mp_state to runnable because more initialization
+	 * is needed by TDX_VCPU_INIT.
+	 */
+	return;
+
+td_bugged:
+	vcpu->kvm->vm_bugged = true;
+}
+
 int tdx_dev_ioctl(void __user *argp)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 6c40dda1cc2f..37ab2cfd35bc 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -147,7 +147,12 @@ int tdx_offline_cpu(void);
 int tdx_vm_init(struct kvm *kvm);
 void tdx_mmu_release_hkid(struct kvm *kvm);
 void tdx_vm_free(struct kvm *kvm);
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
+
+int tdx_vcpu_create(struct kvm_vcpu *vcpu);
+void tdx_vcpu_free(struct kvm_vcpu *vcpu);
+void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
 static inline void tdx_hardware_unsetup(void) {}
@@ -159,7 +164,12 @@ static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
 static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
 static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
 static inline void tdx_vm_free(struct kvm *kvm) {}
+
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
+
+static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
+static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1fb135e0c98f..e8bc66031a1d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -492,6 +492,7 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
 	kvm_recalculate_apic_map(vcpu->kvm);
 	return 0;
 }
+EXPORT_SYMBOL_GPL(kvm_set_apic_base);
 
 /*
  * Handle a fault on a hardware virtualization (VMX or SVM) instruction.
@@ -12109,6 +12110,7 @@ bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
 {
 	return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_is_reset_bsp);
 
 bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
 {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (22 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-16 16:07   ` Zhi Wang
  2023-01-19 10:37   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 025/113] KVM: TDX: Use private memory for TDX isaku.yamahata
                   ` (88 subsequent siblings)
  112 siblings, 2 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TD guest vcpu need to be configured before ready to run which requests
addtional information from Device model (e.g. qemu), one 64bit value is
passed to vcpu's RCX as an initial value.  Repurpose KVM_MEMORY_ENCRYPT_OP
to vcpu-scope and add new sub-commands KVM_TDX_INIT_VCPU under it for such
additional vcpu configuration.

Add callback for kvm vCPU-scoped operations of KVM_MEMORY_ENCRYPT_OP and
add a new subcommand, KVM_TDX_INIT_VCPU, for further vcpu initialization.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h    |   1 +
 arch/x86/include/asm/kvm_host.h       |   1 +
 arch/x86/include/uapi/asm/kvm.h       |   1 +
 arch/x86/kvm/vmx/main.c               |   9 ++
 arch/x86/kvm/vmx/tdx.c                | 147 +++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h                |   7 ++
 arch/x86/kvm/vmx/x86_ops.h            |  10 +-
 arch/x86/kvm/x86.c                    |   6 ++
 tools/arch/x86/include/uapi/asm/kvm.h |   1 +
 9 files changed, 178 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 1a27f3aee982..e3e9b1c2599b 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -123,6 +123,7 @@ KVM_X86_OP(enable_smi_window)
 #endif
 KVM_X86_OP_OPTIONAL(dev_mem_enc_ioctl)
 KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
+KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
 KVM_X86_OP_OPTIONAL(mem_enc_register_region)
 KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
 KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 30f4ddb18548..35773f925cc5 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1698,6 +1698,7 @@ struct kvm_x86_ops {
 
 	int (*dev_mem_enc_ioctl)(void __user *argp);
 	int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
+	int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
 	int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
 	int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index b8f28d86d4fd..9236c1699c48 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -536,6 +536,7 @@ struct kvm_pmu_event_filter {
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
+	KVM_TDX_INIT_VCPU,
 
 	KVM_TDX_CMD_NR_MAX,
 };
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 59813ca05f36..23b3ffc3fe23 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -103,6 +103,14 @@ static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 	return tdx_vm_ioctl(kvm, argp);
 }
 
+static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	if (!is_td_vcpu(vcpu))
+		return -EINVAL;
+
+	return tdx_vcpu_ioctl(vcpu, argp);
+}
+
 struct kvm_x86_ops vt_x86_ops __initdata = {
 	.name = KBUILD_MODNAME,
 
@@ -249,6 +257,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.dev_mem_enc_ioctl = tdx_dev_ioctl,
 	.mem_enc_ioctl = vt_mem_enc_ioctl,
+	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
 };
 
 struct kvm_x86_init_ops vt_init_ops __initdata = {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 099f0737a5aa..e2f5a07ad4e5 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -49,6 +49,11 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
 	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
 }
 
+static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
+{
+	return tdx->tdvpr_pa;
+}
+
 static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
 {
 	return kvm_tdx->tdr_pa;
@@ -65,6 +70,11 @@ static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
 	return kvm_tdx->hkid > 0;
 }
 
+static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
+{
+	return kvm_tdx->finalized;
+}
+
 static void tdx_clear_page(unsigned long page_pa)
 {
 	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -327,7 +337,21 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
-	/* This is stub for now.  More logic will come. */
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	int i;
+
+	/* Can't reclaim or free pages if teardown failed. */
+	if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm)))
+		return;
+
+	if (tdx->tdvpx_pa) {
+		for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
+			tdx_reclaim_td_page(tdx->tdvpx_pa[i]);
+		kfree(tdx->tdvpx_pa);
+		tdx->tdvpx_pa = NULL;
+	}
+	tdx_reclaim_td_page(tdx->tdvpr_pa);
+	tdx->tdvpr_pa = 0;
 }
 
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -337,6 +361,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	/* TDX doesn't support INIT event. */
 	if (WARN_ON_ONCE(init_event))
 		goto td_bugged;
+	if (WARN_ON_ONCE(is_td_vcpu_created(to_tdx(vcpu))))
+		goto td_bugged;
 
 	/* TDX rquires X2APIC. */
 	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
@@ -791,6 +817,125 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	return r;
 }
 
+static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	unsigned long *tdvpx_pa = NULL;
+	unsigned long tdvpr_pa;
+	unsigned long va;
+	int ret, i;
+	u64 err;
+
+	if (is_td_vcpu_created(tdx))
+		return -EINVAL;
+
+	va = __get_free_page(GFP_KERNEL_ACCOUNT);
+	if (!va)
+		return -ENOMEM;
+	tdvpr_pa = __pa(va);
+
+	tdvpx_pa = kcalloc(tdx_caps.tdvpx_nr_pages, sizeof(*tdx->tdvpx_pa),
+			   GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+	if (!tdvpx_pa) {
+		ret = -ENOMEM;
+		goto free_tdvpr;
+	}
+	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
+		va = __get_free_page(GFP_KERNEL_ACCOUNT);
+		if (!va)
+			goto free_tdvpx;
+		tdvpx_pa[i] = __pa(va);
+	}
+
+	err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
+	if (WARN_ON_ONCE(err)) {
+		ret = -EIO;
+		pr_tdx_error(TDH_VP_CREATE, err, NULL);
+		goto td_bugged_free_tdvpx;
+	}
+	tdx->tdvpr_pa = tdvpr_pa;
+
+	tdx->tdvpx_pa = tdvpx_pa;
+	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
+		err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
+		if (WARN_ON_ONCE(err)) {
+			ret = -EIO;
+			pr_tdx_error(TDH_VP_ADDCX, err, NULL);
+			for (; i < tdx_caps.tdvpx_nr_pages; i++) {
+				free_page((unsigned long)__va(tdvpx_pa[i]));
+				tdvpx_pa[i] = 0;
+			}
+			goto td_bugged;
+		}
+	}
+
+	err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
+	if (WARN_ON_ONCE(err)) {
+		ret = -EIO;
+		pr_tdx_error(TDH_VP_INIT, err, NULL);
+		goto td_bugged;
+	}
+
+	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+
+	return 0;
+
+td_bugged_free_tdvpx:
+	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
+		free_page((unsigned long)__va(tdvpx_pa[i]));
+		tdvpx_pa[i] = 0;
+	}
+	kfree(tdvpx_pa);
+td_bugged:
+	vcpu->kvm->vm_bugged = true;
+	return ret;
+
+free_tdvpx:
+	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
+		if (tdvpx_pa[i])
+			free_page((unsigned long)__va(tdvpx_pa[i]));
+	kfree(tdvpx_pa);
+	tdx->tdvpx_pa = NULL;
+free_tdvpr:
+	if (tdvpr_pa)
+		free_page((unsigned long)__va(tdvpr_pa));
+	tdx->tdvpr_pa = 0;
+
+	return ret;
+}
+
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	struct kvm_tdx_cmd cmd;
+	int ret;
+
+	if (tdx->vcpu_initialized)
+		return -EINVAL;
+
+	if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
+		return -EINVAL;
+
+	if (copy_from_user(&cmd, argp, sizeof(cmd)))
+		return -EFAULT;
+
+	if (cmd.error || cmd.unused)
+		return -EINVAL;
+
+	/* Currently only KVM_TDX_INTI_VCPU is defined for vcpu operation. */
+	if (cmd.flags || cmd.id != KVM_TDX_INIT_VCPU)
+		return -EINVAL;
+
+	ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
+	if (ret)
+		return ret;
+
+	tdx->vcpu_initialized = true;
+	return 0;
+}
+
 static int __init tdx_module_setup(void)
 {
 	const struct tdsysinfo_struct *tdsysinfo;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index af7fdc1516d5..e909883d60fa 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -17,12 +17,19 @@ struct kvm_tdx {
 	u64 xfam;
 	int hkid;
 
+	bool finalized;
+
 	u64 tsc_offset;
 };
 
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
 
+	unsigned long tdvpr_pa;
+	unsigned long *tdvpx_pa;
+
+	bool vcpu_initialized;
+
 	/*
 	 * Dummy to make pmu_intel not corrupt memory.
 	 * TODO: Support PMU for TDX.  Future work.
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 37ab2cfd35bc..fba8d0800597 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -148,11 +148,12 @@ int tdx_vm_init(struct kvm *kvm);
 void tdx_mmu_release_hkid(struct kvm *kvm);
 void tdx_vm_free(struct kvm *kvm);
 
-int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
-
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+
+int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
+int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
 static inline void tdx_hardware_unsetup(void) {}
@@ -165,11 +166,12 @@ static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
 static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
 static inline void tdx_vm_free(struct kvm *kvm) {}
 
-static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
-
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+
+static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
+static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e8bc66031a1d..d548d3af6428 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5976,6 +5976,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 	case KVM_SET_DEVICE_ATTR:
 		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
 		break;
+	case KVM_MEMORY_ENCRYPT_OP:
+		r = -ENOTTY;
+		if (!kvm_x86_ops.vcpu_mem_enc_ioctl)
+			goto out;
+		r = kvm_x86_ops.vcpu_mem_enc_ioctl(vcpu, argp);
+		break;
 	default:
 		r = -EINVAL;
 	}
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index eb800965b589..6971f1288043 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -531,6 +531,7 @@ struct kvm_pmu_event_filter {
 enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
+	KVM_TDX_INIT_VCPU,
 
 	KVM_TDX_CMD_NR_MAX,
 };
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 025/113] KVM: TDX: Use private memory for TDX
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (23 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-16 10:45   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 026/113] [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits isaku.yamahata
                   ` (87 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Chao Peng

From: Chao Peng <chao.p.peng@linux.intel.com>

Override kvm_arch_has_private_mem() to use fd-based private memory.
Return true when a VM has a type of KVM_X86_TDX_VM.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/x86.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d548d3af6428..a8b555935fd8 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13498,6 +13498,11 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
 }
 EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
 
+bool kvm_arch_has_private_mem(struct kvm *kvm)
+{
+	return kvm->arch.vm_type == KVM_X86_TDX_VM;
+}
+
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 026/113] [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (24 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 025/113] KVM: TDX: Use private memory for TDX isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 027/113] KVM: x86/mmu: introduce config for PRIVATE KVM MMU isaku.yamahata
                   ` (86 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of KVM MMU GPA
shared bits.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index a4ee04271d66..88343749d4c2 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -10,6 +10,7 @@ What qemu can do
 ----------------
 - TDX VM TYPE is exposed to Qemu.
 - Qemu can create/destroy guest of TDX vm type.
+- Qemu can create/destroy vcpu of TDX vm type.
 
 Patch Layer status
 ------------------
@@ -17,12 +18,12 @@ Patch Layer status
 * TDX, VMX coexistence:                 Applied
 * TDX architectural definitions:        Applied
 * TD VM creation/destruction:           Applied
-* TD vcpu creation/destruction:         Applying
+* TD vcpu creation/destruction:         Applied
 * TDX EPT violation:                    Not yet
 * TD finalization:                      Not yet
 * TD vcpu enter/exit:                   Not yet
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
-* KVM MMU GPA shared bits:              Not yet
+* KVM MMU GPA shared bits:              Applying
 * KVM TDP refactoring for TDX:          Not yet
 * KVM TDP MMU hooks:                    Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 027/113] KVM: x86/mmu: introduce config for PRIVATE KVM MMU
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (25 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 026/113] [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 028/113] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA isaku.yamahata
                   ` (85 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

To keep the case of non TDX intact, introduce a new config option for
private KVM MMU support.  At the moment, this is synonym for
CONFIG_INTEL_TDX_HOST && CONFIG_KVM_INTEL.  The config makes it clear
that the config is only for x86 KVM MMU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/Kconfig | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 84d21ef8b3d2..a2db2169ea9c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -145,4 +145,8 @@ config KVM_XEN
 config KVM_EXTERNAL_WRITE_TRACKING
 	bool
 
+config KVM_MMU_PRIVATE
+	def_bool y
+	depends on INTEL_TDX_HOST && KVM_INTEL
+
 endif # VIRTUALIZATION
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 028/113] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (26 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 027/113] KVM: x86/mmu: introduce config for PRIVATE KVM MMU isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 029/113] [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX isaku.yamahata
                   ` (84 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Rick Edgecombe

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX repurposes one GPA bit (51 bit or 47 bit based on configuration) to
indicate the GPA is private(if cleared) or shared (if set) with VMM.  If
GPA.shared is set, GPA is covered by the existing conventional EPT pointed
by EPTP.  If GPA.shared bit is cleared, GPA is covered by TDX module.
VMM has to issue SEAMCALLs to operate.

Add a member to remember GPA shared bit for each guest TDs, add address
conversion functions between private GPA and shared GPA and test if GPA
is private.

Because struct kvm_arch (or struct kvm which includes struct kvm_arch. See
kvm_arch_alloc_vm() that passes __GPF_ZERO) is zero-cleared when allocated,
the new member to remember GPA shared bit is guaranteed to be zero with
this patch unless it's initialized explicitly.

Co-developed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  4 ++++
 arch/x86/kvm/mmu.h              | 32 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.c          |  5 +++++
 3 files changed, 41 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 35773f925cc5..73c987b3d2b6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1439,6 +1439,10 @@ struct kvm_arch {
 	 */
 #define SPLIT_DESC_CACHE_MIN_NR_OBJECTS (SPTE_ENT_PER_PAGE + 1)
 	struct kvm_mmu_memory_cache split_desc_cache;
+
+#ifdef CONFIG_KVM_MMU_PRIVATE
+	gfn_t gfn_shared_mask;
+#endif
 };
 
 struct kvm_vm_stat {
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 6bdaacb6faa0..a45f7a96b821 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -278,4 +278,36 @@ static inline gpa_t kvm_translate_gpa(struct kvm_vcpu *vcpu,
 		return gpa;
 	return translate_nested_gpa(vcpu, gpa, access, exception);
 }
+
+static inline gfn_t kvm_gfn_shared_mask(const struct kvm *kvm)
+{
+#ifdef CONFIG_KVM_MMU_PRIVATE
+	return kvm->arch.gfn_shared_mask;
+#else
+	return 0;
+#endif
+}
+
+static inline gfn_t kvm_gfn_shared(const struct kvm *kvm, gfn_t gfn)
+{
+	return gfn | kvm_gfn_shared_mask(kvm);
+}
+
+static inline gfn_t kvm_gfn_private(const struct kvm *kvm, gfn_t gfn)
+{
+	return gfn & ~kvm_gfn_shared_mask(kvm);
+}
+
+static inline gpa_t kvm_gpa_private(const struct kvm *kvm, gpa_t gpa)
+{
+	return gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(kvm));
+}
+
+static inline bool kvm_is_private_gpa(const struct kvm *kvm, gpa_t gpa)
+{
+	gfn_t mask = kvm_gfn_shared_mask(kvm);
+
+	return mask && !(gpa_to_gfn(gpa) & mask);
+}
+
 #endif
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e2f5a07ad4e5..a7d42c05a758 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -781,6 +781,11 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	kvm_tdx->attributes = td_params->attributes;
 	kvm_tdx->xfam = td_params->xfam;
 
+	if (td_params->exec_controls & TDX_EXEC_CONTROL_MAX_GPAW)
+		kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(51));
+	else
+		kvm->arch.gfn_shared_mask = gpa_to_gfn(BIT_ULL(47));
+
 out:
 	/* kfree() accepts NULL. */
 	kfree(init_vm);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 029/113] [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (27 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 028/113] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE isaku.yamahata
                   ` (83 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of KVM TDP
refactoring for TDX.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 88343749d4c2..f10aff0b060e 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -24,6 +24,6 @@ Patch Layer status
 * TD vcpu enter/exit:                   Not yet
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
-* KVM MMU GPA shared bits:              Applying
-* KVM TDP refactoring for TDX:          Not yet
+* KVM MMU GPA shared bits:              Applied
+* KVM TDP refactoring for TDX:          Applying
 * KVM TDP MMU hooks:                    Not yet
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (28 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 029/113] [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-25  9:24   ` Zhi Wang
  2023-01-12 16:31 ` [PATCH v11 031/113] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE isaku.yamahata
                   ` (82 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

The TDX support will need the "suppress #VE" bit (bit 63) set as the
initial value for SPTE.  To reduce code change size, introduce a new macro
SHADOW_NONPRESENT_VALUE for the initial value for the shadow page table
entry (SPTE) and replace hard-coded value 0 for it.  Initialize shadow page
tables with their value.

The plan is to unconditionally set the "suppress #VE" bit for both AMD and
Intel as: 1) AMD hardware uses the bit 63 as NX for present SPTE and
ignored for non-present SPTE; 2) for conventional VMX guests, KVM never
enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is
ignored by hardware.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c         | 50 ++++++++++++++++++++++++++++++----
 arch/x86/kvm/mmu/paging_tmpl.h |  3 +-
 arch/x86/kvm/mmu/spte.h        |  2 ++
 arch/x86/kvm/mmu/tdp_mmu.c     | 15 +++++-----
 4 files changed, 56 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 15d0e8f11d53..59befdfeec23 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -540,9 +540,9 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
 
 	if (!is_shadow_present_pte(old_spte) ||
 	    !spte_has_volatile_bits(old_spte))
-		__update_clear_spte_fast(sptep, 0ull);
+		__update_clear_spte_fast(sptep, SHADOW_NONPRESENT_VALUE);
 	else
-		old_spte = __update_clear_spte_slow(sptep, 0ull);
+		old_spte = __update_clear_spte_slow(sptep, SHADOW_NONPRESENT_VALUE);
 
 	if (!is_shadow_present_pte(old_spte))
 		return old_spte;
@@ -576,7 +576,7 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
  */
 static void mmu_spte_clear_no_track(u64 *sptep)
 {
-	__update_clear_spte_fast(sptep, 0ull);
+	__update_clear_spte_fast(sptep, SHADOW_NONPRESENT_VALUE);
 }
 
 static u64 mmu_spte_get_lockless(u64 *sptep)
@@ -644,6 +644,39 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
 	}
 }
 
+#ifdef CONFIG_X86_64
+static inline void kvm_init_shadow_page(void *page)
+{
+	memset64(page, SHADOW_NONPRESENT_VALUE, 4096 / 8);
+}
+
+static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
+{
+	struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
+	int start, end, i, r;
+
+	start = kvm_mmu_memory_cache_nr_free_objects(mc);
+	r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
+
+	/*
+	 * Note, topup may have allocated objects even if it failed to allocate
+	 * the minimum number of objects required to make forward progress _at
+	 * this time_.  Initialize newly allocated objects even on failure, as
+	 * userspace can free memory and rerun the vCPU in response to -ENOMEM.
+	 */
+	end = kvm_mmu_memory_cache_nr_free_objects(mc);
+	for (i = start; i < end; i++)
+		kvm_init_shadow_page(mc->objects[i]);
+	return r;
+}
+#else
+static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
+{
+	return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
+					  PT64_ROOT_MAX_LEVEL);
+}
+#endif /* CONFIG_X86_64 */
+
 static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 {
 	int r;
@@ -653,8 +686,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
 				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
 	if (r)
 		return r;
-	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
-				       PT64_ROOT_MAX_LEVEL);
+	r = mmu_topup_shadow_page_cache(vcpu);
 	if (r)
 		return r;
 	if (maybe_indirect) {
@@ -5920,7 +5952,13 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
 	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
 
-	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
+	/*
+	 * When X86_64, initial SEPT entries are initialized with
+	 * SHADOW_NONPRESENT_VALUE.  Otherwise zeroed.  See
+	 * mmu_topup_shadow_page_cache().
+	 */
+	if (!IS_ENABLED(CONFIG_X86_64))
+		vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
 
 	vcpu->arch.mmu = &vcpu->arch.root_mmu;
 	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
index 0f6455072055..42d7106c7350 100644
--- a/arch/x86/kvm/mmu/paging_tmpl.h
+++ b/arch/x86/kvm/mmu/paging_tmpl.h
@@ -1036,7 +1036,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
 		gpa_t pte_gpa;
 		gfn_t gfn;
 
-		if (!sp->spt[i])
+		/* spt[i] has initial value of shadow page table allocation */
+		if (sp->spt[i] == SHADOW_NONPRESENT_VALUE)
 			continue;
 
 		pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 0d8deefee66c..f190eaf6b2b5 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -148,6 +148,8 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
 
 #define MMIO_SPTE_GEN_MASK		GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
 
+#define SHADOW_NONPRESENT_VALUE	0ULL
+
 extern u64 __read_mostly shadow_host_writable_mask;
 extern u64 __read_mostly shadow_mmu_writable_mask;
 extern u64 __read_mostly shadow_nx_mask;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 12e430a4ebc3..9cf5844dd34a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -701,7 +701,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * here since the SPTE is going from non-present to non-present.  Use
 	 * the raw write helper to avoid an unnecessary check on volatile bits.
 	 */
-	__kvm_tdp_mmu_write_spte(iter->sptep, 0);
+	__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
 
 	return 0;
 }
@@ -878,8 +878,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			continue;
 
 		if (!shared)
-			tdp_mmu_set_spte(kvm, &iter, 0);
-		else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
+			tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
+		else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
 			goto retry;
 	}
 }
@@ -935,8 +935,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
 	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
 		return false;
 
-	__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
-			   sp->gfn, sp->role.level + 1, true, true);
+	__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
+			   SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
+			   true, true);
 
 	return true;
 }
@@ -970,7 +971,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 		    !is_last_spte(iter.old_spte, iter.level))
 			continue;
 
-		tdp_mmu_set_spte(kvm, &iter, 0);
+		tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
 		flush = true;
 	}
 
@@ -1339,7 +1340,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
 	 * invariant that the PFN of a present * leaf SPTE can never change.
 	 * See __handle_changed_spte().
 	 */
-	tdp_mmu_set_spte(kvm, iter, 0);
+	tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
 
 	if (!pte_write(range->pte)) {
 		new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 031/113] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (29 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-16 10:54   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 032/113] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask isaku.yamahata
                   ` (81 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
is not able to access the private memory of TD guest and do the emulation.
Instead, TD guest expects to receive #VE when it accesses the MMIO and then
it can explicitly make hypercall to KVM to get the expected information.

To achieve this, the TDX module always enables "EPT-violation #VE" in the
VMCS control.  And accordingly, for the MMIO spte for the shared GPA,
1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
violation happens on TD accessing MMIO range.  2. On EPT violation, KVM
sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
the #VE instead of EPT misconfigration unlike VMX case.  For the shared GPA
that is not populated yet, EPT violation need to be triggered when TD guest
accesses such shared GPA.  The non-present SPTE value for shared GPA should
set "suppress #VE" bit.

Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
REMOVED_SPTE.  Unconditionally set the "suppress #VE" bit (which is bit 63)
for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
present bit is off; 2) for normal VMX guest, KVM never enables the
"EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
hardware.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/vmx.h |  1 +
 arch/x86/kvm/mmu/spte.h    | 15 ++++++++++++++-
 arch/x86/kvm/mmu/tdp_mmu.c |  8 ++++++++
 3 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 498dc600bd5c..cdbf12c1a83c 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -511,6 +511,7 @@ enum vmcs_field {
 #define VMX_EPT_IPAT_BIT    			(1ull << 6)
 #define VMX_EPT_ACCESS_BIT			(1ull << 8)
 #define VMX_EPT_DIRTY_BIT			(1ull << 9)
+#define VMX_EPT_SUPPRESS_VE_BIT			(1ull << 63)
 #define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
 						 VMX_EPT_WRITABLE_MASK |       \
 						 VMX_EPT_EXECUTABLE_MASK)
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index f190eaf6b2b5..471378ee9071 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -148,7 +148,20 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
 
 #define MMIO_SPTE_GEN_MASK		GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
 
+/*
+ * Non-present SPTE value for both VMX and SVM for TDP MMU.
+ * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
+ * For VMX EPT, bit 63 is ignored if #VE is disabled. (EPT_VIOLATION_VE=0)
+ *              bit 63 is #VE suppress if #VE is enabled. (EPT_VIOLATION_VE=1)
+ * For TDX:
+ *   TDX module sets EPT_VIOLATION_VE for Secure-EPT and conventional EPT
+ */
+#ifdef CONFIG_X86_64
+#define SHADOW_NONPRESENT_VALUE	BIT_ULL(63)
+static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
+#else
 #define SHADOW_NONPRESENT_VALUE	0ULL
+#endif
 
 extern u64 __read_mostly shadow_host_writable_mask;
 extern u64 __read_mostly shadow_mmu_writable_mask;
@@ -195,7 +208,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
  *
  * Only used by the TDP MMU.
  */
-#define REMOVED_SPTE	0x5a0ULL
+#define REMOVED_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
 
 /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
 static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 9cf5844dd34a..6111e3e9266d 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -700,6 +700,14 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
 	 * overwrite the special removed SPTE value. No bookkeeping is needed
 	 * here since the SPTE is going from non-present to non-present.  Use
 	 * the raw write helper to avoid an unnecessary check on volatile bits.
+	 *
+	 * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
+	 * It is because when TDX is enabled, TDX module always
+	 * enables "EPT-violation #VE", so KVM needs to set
+	 * "suppress #VE" bit in EPT table entries, in order to get
+	 * real EPT violation, rather than TDVMCALL.  KVM sets
+	 * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
+	 * can be set when EPT table entries are zapped.
 	 */
 	__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 032/113] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (30 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 031/113] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-16 10:59   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 033/113] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis isaku.yamahata
                   ` (80 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

To make use of the same value of shadow_mmio_mask for TDX and VMX, add
Suppress-VE bit to shadow_mmio_mask so that shadow_mmio_mask can be common
for both VMX and TDX.

TDX will need shadow_mmio_mask to be VMX_SUPPRESS_VE | RWX and
shadow_mmio_value to be 0 so that EPT violation is triggered.  For VMX,
VMX_SUPPRESS_VE doesn't matter because the spte value is required to cause
EPT misconfig.  the additional bit doesn't affect VMX logic to add the bit
to shadow_mmio_{value, mask}.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/spte.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index fce6f047399f..cc0bc058fb25 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -431,7 +431,9 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
 	shadow_dirty_mask	= has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
 	shadow_nx_mask		= 0ull;
 	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
-	shadow_present_mask	= has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
+	/* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
+	shadow_present_mask	=
+		(has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;
 	/*
 	 * EPT overrides the host MTRRs, and so KVM must program the desired
 	 * memtype directly into the SPTEs.  Note, this mask is just the mask
@@ -448,7 +450,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
 	 * of an EPT paging-structure entry is 110b (write/execute).
 	 */
 	kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
-				   VMX_EPT_RWX_MASK, 0);
+				   VMX_EPT_RWX_MASK | VMX_EPT_SUPPRESS_VE_BIT, 0);
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 033/113] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (31 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 032/113] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-16 11:16   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 034/113] KVM: x86/mmu: Disallow fast page fault on private GPA isaku.yamahata
                   ` (79 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX will use a different shadow PTE entry value for MMIO from VMX.  Add
members to kvm_arch and track value for MMIO per-VM instead of global
variables.  By using the per-VM EPT entry value for MMIO, the existing VMX
logic is kept working.  Introduce a separate setter function so that guest
TD can override later.

Also require mmio spte cachcing for TDX.  Actually this is true case
because TDX require EPT and KVM EPT allows mmio spte caching.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  2 ++
 arch/x86/kvm/mmu.h              |  1 +
 arch/x86/kvm/mmu/mmu.c          |  7 ++++---
 arch/x86/kvm/mmu/spte.c         | 10 ++++++++--
 arch/x86/kvm/mmu/spte.h         |  4 ++--
 arch/x86/kvm/mmu/tdp_mmu.c      | 14 +++++++++++---
 6 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 73c987b3d2b6..807da4b95aba 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1243,6 +1243,8 @@ struct kvm_arch {
 	 */
 	spinlock_t mmu_unsync_pages_lock;
 
+	u64 shadow_mmio_value;
+
 	struct list_head assigned_dev_head;
 	struct iommu_domain *iommu_domain;
 	bool iommu_noncoherent;
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index a45f7a96b821..50d240d52697 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -101,6 +101,7 @@ static inline u8 kvm_get_shadow_phys_bits(void)
 }
 
 void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
+void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value);
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
 void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
 
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 59befdfeec23..8d3d7deebdd0 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -2450,7 +2450,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
 				return kvm_mmu_prepare_zap_page(kvm, child,
 								invalid_list);
 		}
-	} else if (is_mmio_spte(pte)) {
+	} else if (is_mmio_spte(kvm, pte)) {
 		mmu_spte_clear_no_track(spte);
 	}
 	return 0;
@@ -4119,7 +4119,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
 	if (WARN_ON(reserved))
 		return -EINVAL;
 
-	if (is_mmio_spte(spte)) {
+	if (is_mmio_spte(vcpu->kvm, spte)) {
 		gfn_t gfn = get_mmio_spte_gfn(spte);
 		unsigned int access = get_mmio_spte_access(spte);
 
@@ -4628,7 +4628,7 @@ static unsigned long get_cr3(struct kvm_vcpu *vcpu)
 static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
 			   unsigned int access)
 {
-	if (unlikely(is_mmio_spte(*sptep))) {
+	if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
 		if (gfn != get_mmio_spte_gfn(*sptep)) {
 			mmu_spte_clear_no_track(sptep);
 			return true;
@@ -6111,6 +6111,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
 	int r;
 
+	kvm->arch.shadow_mmio_value = shadow_mmio_value;
 	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
 	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
 	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index cc0bc058fb25..a23e9205fc42 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -74,10 +74,10 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
 	u64 spte = generation_mmio_spte_mask(gen);
 	u64 gpa = gfn << PAGE_SHIFT;
 
-	WARN_ON_ONCE(!shadow_mmio_value);
+	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);
 
 	access &= shadow_mmio_access_mask;
-	spte |= shadow_mmio_value | access;
+	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
 	spte |= gpa | shadow_nonpresent_or_rsvd_mask;
 	spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
 		<< SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
@@ -413,6 +413,12 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
 }
 EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
 
+void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value)
+{
+	kvm->arch.shadow_mmio_value = mmio_value;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_value);
+
 void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
 {
 	/* shadow_me_value must be a subset of shadow_me_mask */
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 471378ee9071..256395eb593f 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -251,9 +251,9 @@ static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
 	return to_shadow_page(__pa(sptep));
 }
 
-static inline bool is_mmio_spte(u64 spte)
+static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
 {
-	return (spte & shadow_mmio_mask) == shadow_mmio_value &&
+	return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
 	       likely(enable_mmio_caching);
 }
 
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6111e3e9266d..dffacb7eb15a 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -19,6 +19,14 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 {
 	struct workqueue_struct *wq;
 
+	/*
+	 * TDs require mmio_caching to clear suppress_ve bit of SPTE for GPA
+	 * of MMIO so that TD can convert #VE triggered by MMIO into
+	 * TDG.VP.VMCALL<MMIO>.
+	 */
+	if (kvm->arch.vm_type == KVM_X86_TDX_VM && !enable_mmio_caching)
+		return -EOPNOTSUPP;
+
 	if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
 		return 0;
 
@@ -587,8 +595,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 		 * impact the guest since both the former and current SPTEs
 		 * are nonpresent.
 		 */
-		if (WARN_ON(!is_mmio_spte(old_spte) &&
-			    !is_mmio_spte(new_spte) &&
+		if (WARN_ON(!is_mmio_spte(kvm, old_spte) &&
+			    !is_mmio_spte(kvm, new_spte) &&
 			    !is_removed_spte(new_spte)))
 			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
 			       "should not be replaced with another,\n"
@@ -1114,7 +1122,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	}
 
 	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
-	if (unlikely(is_mmio_spte(new_spte))) {
+	if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
 		vcpu->stat.pf_mmio_spte_created++;
 		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
 				     new_spte);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 034/113] KVM: x86/mmu: Disallow fast page fault on private GPA
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (32 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 033/113] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 035/113] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
                   ` (78 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX requires TDX SEAMCALL to operate Secure EPT instead of direct memory
access and TDX SEAMCALL is heavy operation.  Fast page fault on private GPA
doesn't make sense.  Disallow fast page fault on private GPA.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/mmu/mmu.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8d3d7deebdd0..5b111a325434 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3273,8 +3273,16 @@ static int handle_abnormal_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fau
 	return RET_PF_CONTINUE;
 }
 
-static bool page_fault_can_be_fast(struct kvm_page_fault *fault)
+static bool page_fault_can_be_fast(struct kvm *kvm, struct kvm_page_fault *fault)
 {
+	/*
+	 * TDX private mapping doesn't support fast page fault because the EPT
+	 * entry is read/written with TDX SEAMCALLs instead of direct memory
+	 * access.
+	 */
+	if (kvm_is_private_gpa(kvm, fault->addr))
+		return false;
+
 	/*
 	 * Page faults with reserved bits set, i.e. faults on MMIO SPTEs, only
 	 * reach the common page fault handler if the SPTE has an invalid MMIO
@@ -3384,7 +3392,7 @@ static int fast_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	u64 *sptep = NULL;
 	uint retry_count = 0;
 
-	if (!page_fault_can_be_fast(fault))
+	if (!page_fault_can_be_fast(vcpu->kvm, fault))
 		return ret;
 
 	walk_shadow_page_lockless_begin(vcpu);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 035/113] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (33 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 034/113] KVM: x86/mmu: Disallow fast page fault on private GPA isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-16 11:29   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 036/113] KVM: VMX: Introduce test mode related to EPT violation VE isaku.yamahata
                   ` (77 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX requires special handling to support large private page.  For
simplicity, only support 4K page for TD guest for now.  Add per-VM maximum
page level support to support different maximum page sizes for TD guest and
conventional VMX guest.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 1 +
 arch/x86/kvm/mmu/mmu.c          | 1 +
 arch/x86/kvm/mmu/mmu_internal.h | 2 +-
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 807da4b95aba..f5b51bdef0c6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1217,6 +1217,7 @@ struct kvm_arch {
 	unsigned long n_requested_mmu_pages;
 	unsigned long n_max_mmu_pages;
 	unsigned int indirect_shadow_pages;
+	int tdp_max_page_level;
 	u8 mmu_valid_gen;
 	struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES];
 	struct list_head active_mmu_pages;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5b111a325434..801ef6c41847 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6141,6 +6141,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
 	kvm->arch.split_desc_cache.kmem_cache = pte_list_desc_cache;
 	kvm->arch.split_desc_cache.gfp_zero = __GFP_ZERO;
 
+	kvm->arch.tdp_max_page_level = KVM_MAX_HUGEPAGE_LEVEL;
 	return 0;
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 5ccf08183b00..6767bc9b7c5c 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -276,7 +276,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.nx_huge_page_workaround_enabled =
 			is_nx_huge_page_enabled(vcpu->kvm),
 
-		.max_level = KVM_MAX_HUGEPAGE_LEVEL,
+		.max_level = vcpu->kvm->arch.tdp_max_page_level,
 		.req_level = PG_LEVEL_4K,
 		.goal_level = PG_LEVEL_4K,
 	};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 036/113] KVM: VMX: Introduce test mode related to EPT violation VE
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (34 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 035/113] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 037/113] [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks isaku.yamahata
                   ` (76 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

To support TDX, KVM is enhanced to operate with #VE.  For TDX, KVM programs
to inject #VE conditionally and set #VE suppress bit in EPT entry.  For VMX
case, #VE isn't used.  If #VE happens for VMX, it's a bug.  To be
defensive (test that VMX case isn't broken), introduce option
ept_violation_ve_test and when it's set, set error.

Suggested-by: Paolo Bonzini <pbonzini@redhat.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/vmx.h | 12 +++++++
 arch/x86/kvm/vmx/vmcs.h    |  5 +++
 arch/x86/kvm/vmx/vmx.c     | 69 +++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/vmx.h     |  6 +++-
 4 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index cdbf12c1a83c..752d53652007 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -68,6 +68,7 @@
 #define SECONDARY_EXEC_ENCLS_EXITING		VMCS_CONTROL_BIT(ENCLS_EXITING)
 #define SECONDARY_EXEC_RDSEED_EXITING		VMCS_CONTROL_BIT(RDSEED_EXITING)
 #define SECONDARY_EXEC_ENABLE_PML               VMCS_CONTROL_BIT(PAGE_MOD_LOGGING)
+#define SECONDARY_EXEC_EPT_VIOLATION_VE		VMCS_CONTROL_BIT(EPT_VIOLATION_VE)
 #define SECONDARY_EXEC_PT_CONCEAL_VMX		VMCS_CONTROL_BIT(PT_CONCEAL_VMX)
 #define SECONDARY_EXEC_XSAVES			VMCS_CONTROL_BIT(XSAVES)
 #define SECONDARY_EXEC_MODE_BASED_EPT_EXEC	VMCS_CONTROL_BIT(MODE_BASED_EPT_EXEC)
@@ -223,6 +224,8 @@ enum vmcs_field {
 	VMREAD_BITMAP_HIGH              = 0x00002027,
 	VMWRITE_BITMAP                  = 0x00002028,
 	VMWRITE_BITMAP_HIGH             = 0x00002029,
+	VE_INFORMATION_ADDRESS		= 0x0000202A,
+	VE_INFORMATION_ADDRESS_HIGH	= 0x0000202B,
 	XSS_EXIT_BITMAP                 = 0x0000202C,
 	XSS_EXIT_BITMAP_HIGH            = 0x0000202D,
 	ENCLS_EXITING_BITMAP		= 0x0000202E,
@@ -628,4 +631,13 @@ enum vmx_l1d_flush_state {
 
 extern enum vmx_l1d_flush_state l1tf_vmx_mitigation;
 
+struct vmx_ve_information {
+	u32 exit_reason;
+	u32 delivery;
+	u64 exit_qualification;
+	u64 guest_linear_address;
+	u64 guest_physical_address;
+	u16 eptp_index;
+};
+
 #endif
diff --git a/arch/x86/kvm/vmx/vmcs.h b/arch/x86/kvm/vmx/vmcs.h
index ac290a44a693..9277676057a7 100644
--- a/arch/x86/kvm/vmx/vmcs.h
+++ b/arch/x86/kvm/vmx/vmcs.h
@@ -140,6 +140,11 @@ static inline bool is_nm_fault(u32 intr_info)
 	return is_exception_n(intr_info, NM_VECTOR);
 }
 
+static inline bool is_ve_fault(u32 intr_info)
+{
+	return is_exception_n(intr_info, VE_VECTOR);
+}
+
 /* Undocumented: icebp/int1 */
 static inline bool is_icebp(u32 intr_info)
 {
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 5b8369e67939..0df044357e09 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -127,6 +127,9 @@ module_param(error_on_inconsistent_vmcs_config, bool, 0444);
 static bool __read_mostly dump_invalid_vmcs = 0;
 module_param(dump_invalid_vmcs, bool, 0644);
 
+static bool __read_mostly ept_violation_ve_test;
+module_param(ept_violation_ve_test, bool, 0444);
+
 #define MSR_BITMAP_MODE_X2APIC		1
 #define MSR_BITMAP_MODE_X2APIC_APICV	2
 
@@ -844,6 +847,13 @@ void vmx_update_exception_bitmap(struct kvm_vcpu *vcpu)
 
 	eb = (1u << PF_VECTOR) | (1u << UD_VECTOR) | (1u << MC_VECTOR) |
 	     (1u << DB_VECTOR) | (1u << AC_VECTOR);
+	/*
+	 * #VE isn't used for VMX, but for TDX.  To test against unexpected
+	 * change related to #VE for VMX, intercept unexpected #VE and warn on
+	 * it.
+	 */
+	if (ept_violation_ve_test)
+		eb |= 1u << VE_VECTOR;
 	/*
 	 * Guest access to VMware backdoor ports could legitimately
 	 * trigger #GP because of TSS I/O permission bitmap.
@@ -2615,6 +2625,9 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 					&_cpu_based_2nd_exec_control))
 			return -EIO;
 	}
+	if (!ept_violation_ve_test)
+		_cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
+
 #ifndef CONFIG_X86_64
 	if (!(_cpu_based_2nd_exec_control &
 				SECONDARY_EXEC_VIRTUALIZE_APIC_ACCESSES))
@@ -2639,6 +2652,7 @@ static int setup_vmcs_config(struct vmcs_config *vmcs_conf,
 			return -EIO;
 
 		vmx_cap->ept = 0;
+		_cpu_based_2nd_exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
 	}
 	if (!(_cpu_based_2nd_exec_control & SECONDARY_EXEC_ENABLE_VPID) &&
 	    vmx_cap->vpid) {
@@ -4599,6 +4613,7 @@ static u32 vmx_secondary_exec_control(struct vcpu_vmx *vmx)
 		exec_control &= ~SECONDARY_EXEC_ENABLE_VPID;
 	if (!enable_ept) {
 		exec_control &= ~SECONDARY_EXEC_ENABLE_EPT;
+		exec_control &= ~SECONDARY_EXEC_EPT_VIOLATION_VE;
 		enable_unrestricted_guest = 0;
 	}
 	if (!enable_unrestricted_guest)
@@ -4726,8 +4741,40 @@ static void init_vmcs(struct vcpu_vmx *vmx)
 
 	exec_controls_set(vmx, vmx_exec_control(vmx));
 
-	if (cpu_has_secondary_exec_ctrls())
+	if (cpu_has_secondary_exec_ctrls()) {
 		secondary_exec_controls_set(vmx, vmx_secondary_exec_control(vmx));
+		if (secondary_exec_controls_get(vmx) &
+		    SECONDARY_EXEC_EPT_VIOLATION_VE) {
+			if (!vmx->ve_info) {
+				/* ve_info must be page aligned. */
+				struct page *page;
+
+				BUILD_BUG_ON(sizeof(*vmx->ve_info) > PAGE_SIZE);
+				page = alloc_page(GFP_KERNEL_ACCOUNT | __GFP_ZERO);
+				if (page)
+					vmx->ve_info = page_to_virt(page);
+			}
+			if (vmx->ve_info) {
+				/*
+				 * Allow #VE delivery. CPU sets this field to
+				 * 0xFFFFFFFF on #VE delivery.  Another #VE can
+				 * occur only if software clears the field.
+				 */
+				vmx->ve_info->delivery = 0;
+				vmcs_write64(VE_INFORMATION_ADDRESS,
+					     __pa(vmx->ve_info));
+			} else {
+				/*
+				 * Because SECONDARY_EXEC_EPT_VIOLATION_VE is
+				 * used only when ept_violation_ve_test is true,
+				 * it's okay to go with the bit disabled.
+				 */
+				pr_err("Failed to allocate ve_info. disabling EPT_VIOLATION_VE.\n");
+				secondary_exec_controls_clearbit(vmx,
+								 SECONDARY_EXEC_EPT_VIOLATION_VE);
+			}
+		}
+	}
 
 	if (cpu_has_tertiary_exec_ctrls())
 		tertiary_exec_controls_set(vmx, vmx_tertiary_exec_control(vmx));
@@ -5207,6 +5254,12 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 	if (is_invalid_opcode(intr_info))
 		return handle_ud(vcpu);
 
+	/*
+	 * #VE isn't supposed to happen.  Although vcpu can send
+	 */
+	if (KVM_BUG_ON(is_ve_fault(intr_info), vcpu->kvm))
+		return -EIO;
+
 	error_code = 0;
 	if (intr_info & INTR_INFO_DELIVER_CODE_MASK)
 		error_code = vmcs_read32(VM_EXIT_INTR_ERROR_CODE);
@@ -6395,6 +6448,18 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
 	if (secondary_exec_control & SECONDARY_EXEC_ENABLE_VPID)
 		pr_err("Virtual processor ID = 0x%04x\n",
 		       vmcs_read16(VIRTUAL_PROCESSOR_ID));
+	if (secondary_exec_control & SECONDARY_EXEC_EPT_VIOLATION_VE) {
+		struct vmx_ve_information *ve_info;
+
+		pr_err("VE info address = 0x%016llx\n",
+		       vmcs_read64(VE_INFORMATION_ADDRESS));
+		ve_info = __va(vmcs_read64(VE_INFORMATION_ADDRESS));
+		pr_err("ve_info: 0x%08x 0x%08x 0x%016llx 0x%016llx 0x%016llx 0x%04x\n",
+		       ve_info->exit_reason, ve_info->delivery,
+		       ve_info->exit_qualification,
+		       ve_info->guest_linear_address,
+		       ve_info->guest_physical_address, ve_info->eptp_index);
+	}
 }
 
 /*
@@ -7393,6 +7458,8 @@ void vmx_vcpu_free(struct kvm_vcpu *vcpu)
 	free_vpid(vmx->vpid);
 	nested_vmx_free_vcpu(vcpu);
 	free_loaded_vmcs(vmx->loaded_vmcs);
+	if (vmx->ve_info)
+		free_page((unsigned long)vmx->ve_info);
 }
 
 int vmx_vcpu_create(struct kvm_vcpu *vcpu)
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index d49d0ace9fb8..1813caeb24d8 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -359,6 +359,9 @@ struct vcpu_vmx {
 		DECLARE_BITMAP(read, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 		DECLARE_BITMAP(write, MAX_POSSIBLE_PASSTHROUGH_MSRS);
 	} shadow_msr_intercept;
+
+	/* ve_info must be page aligned. */
+	struct vmx_ve_information *ve_info;
 };
 
 struct kvm_vmx {
@@ -570,7 +573,8 @@ static inline u8 vmx_get_rvi(void)
 	 SECONDARY_EXEC_ENABLE_VMFUNC |					\
 	 SECONDARY_EXEC_BUS_LOCK_DETECTION |				\
 	 SECONDARY_EXEC_NOTIFY_VM_EXITING |				\
-	 SECONDARY_EXEC_ENCLS_EXITING)
+	 SECONDARY_EXEC_ENCLS_EXITING |					\
+	 SECONDARY_EXEC_EPT_VIOLATION_VE)
 
 #define KVM_REQUIRED_VMX_TERTIARY_VM_EXEC_CONTROL 0
 #define KVM_OPTIONAL_VMX_TERTIARY_VM_EXEC_CONTROL			\
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 037/113] [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (35 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 036/113] KVM: VMX: Introduce test mode related to EPT violation VE isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 038/113] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation isaku.yamahata
                   ` (75 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of KVM TDP MMU
hooks.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index f10aff0b060e..f4aba85148e3 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -25,5 +25,5 @@ Patch Layer status
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
 * KVM MMU GPA shared bits:              Applied
-* KVM TDP refactoring for TDX:          Applying
-* KVM TDP MMU hooks:                    Not yet
+* KVM TDP refactoring for TDX:          Applied
+* KVM TDP MMU hooks:                    Applying
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 038/113] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (36 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 037/113] [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 039/113] KVM: x86/mmu: Require TDP MMU for TDX isaku.yamahata
                   ` (74 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Refactor tdp_mmu_alloc_sp() and tdp_mmu_init_sp and eliminate
tdp_mmu_init_child_sp().  Currently tdp_mmu_init_sp() (or
tdp_mmu_init_child_sp()) sets kvm_mmu_page.role after tdp_mmu_alloc_sp()
allocating struct kvm_mmu_page and its page table page.  This patch makes
tdp_mmu_alloc_sp() initialize kvm_mmu_page.role instead of
tdp_mmu_init_sp().

To handle private page tables, argument of is_private needs to be passed
down.  Given that already page level is passed down, it would be cumbersome
to add one more parameter about sp. Instead replace the level argument with
union kvm_mmu_page_role.  Thus the number of argument won't be increased
and more info about sp can be passed down.

For private sp, secure page table will be also allocated in addition to
struct kvm_mmu_page and page table (spt member).  The allocation functions
(tdp_mmu_alloc_sp() and __tdp_mmu_alloc_sp_for_split()) need to know if the
allocation is for the conventional page table or private page table.  Pass
union kvm_mmu_role to those functions and initialize role member of struct
kvm_mmu_page.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/tdp_iter.h | 12 ++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c  | 44 ++++++++++++++++---------------------
 2 files changed, 31 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index f0af385c56e0..9e56a5b1024c 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -115,4 +115,16 @@ void tdp_iter_start(struct tdp_iter *iter, struct kvm_mmu_page *root,
 void tdp_iter_next(struct tdp_iter *iter);
 void tdp_iter_restart(struct tdp_iter *iter);
 
+static inline union kvm_mmu_page_role tdp_iter_child_role(struct tdp_iter *iter)
+{
+	union kvm_mmu_page_role child_role;
+	struct kvm_mmu_page *parent_sp;
+
+	parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
+
+	child_role = parent_sp->role;
+	child_role.level--;
+	return child_role;
+}
+
 #endif /* __KVM_X86_MMU_TDP_ITER_H */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index dffacb7eb15a..fdcff390ebc2 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -279,24 +279,30 @@ static struct kvm_mmu_page *tdp_mmu_next_root(struct kvm *kvm,
 		    kvm_mmu_page_as_id(_root) != _as_id) {		\
 		} else
 
-static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu,
+					     union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 
 	sp = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_page_header_cache);
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
+	sp->role = role;
 
 	return sp;
 }
 
 static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
-			    gfn_t gfn, union kvm_mmu_page_role role)
+			    gfn_t gfn)
 {
 	INIT_LIST_HEAD(&sp->possible_nx_huge_page_link);
 
 	set_page_private(virt_to_page(sp->spt), (unsigned long)sp);
 
-	sp->role = role;
+	/*
+	 * role must be set before calling this function.  At least role.level
+	 * is not 0 (PG_LEVEL_NONE).
+	 */
+	WARN_ON_ONCE(!sp->role.word);
 	sp->gfn = gfn;
 	sp->ptep = sptep;
 	sp->tdp_mmu_page = true;
@@ -304,20 +310,6 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
 	trace_kvm_mmu_get_page(sp, true);
 }
 
-static void tdp_mmu_init_child_sp(struct kvm_mmu_page *child_sp,
-				  struct tdp_iter *iter)
-{
-	struct kvm_mmu_page *parent_sp;
-	union kvm_mmu_page_role role;
-
-	parent_sp = sptep_to_sp(rcu_dereference(iter->sptep));
-
-	role = parent_sp->role;
-	role.level--;
-
-	tdp_mmu_init_sp(child_sp, iter->sptep, iter->gfn, role);
-}
-
 hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 {
 	union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
@@ -336,8 +328,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 			goto out;
 	}
 
-	root = tdp_mmu_alloc_sp(vcpu);
-	tdp_mmu_init_sp(root, NULL, 0, role);
+	root = tdp_mmu_alloc_sp(vcpu, role);
+	tdp_mmu_init_sp(root, NULL, 0);
 
 	refcount_set(&root->tdp_mmu_root_count, 1);
 
@@ -1212,8 +1204,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 		 * The SPTE is either non-present or points to a huge page that
 		 * needs to be split.
 		 */
-		sp = tdp_mmu_alloc_sp(vcpu);
-		tdp_mmu_init_child_sp(sp, &iter);
+		sp = tdp_mmu_alloc_sp(vcpu, tdp_iter_child_role(&iter));
+		tdp_mmu_init_sp(sp, iter.sptep, iter.gfn);
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
 
@@ -1442,7 +1434,7 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mmu_page_role role)
 {
 	struct kvm_mmu_page *sp;
 
@@ -1452,6 +1444,7 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
 	if (!sp)
 		return NULL;
 
+	sp->role = role;
 	sp->spt = (void *)__get_free_page(gfp);
 	if (!sp->spt) {
 		kmem_cache_free(mmu_page_header_cache, sp);
@@ -1465,6 +1458,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 						       struct tdp_iter *iter,
 						       bool shared)
 {
+	union kvm_mmu_page_role role = tdp_iter_child_role(iter);
 	struct kvm_mmu_page *sp;
 
 	/*
@@ -1476,7 +1470,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 	 * If this allocation fails we drop the lock and retry with reclaim
 	 * allowed.
 	 */
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
+	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT, role);
 	if (sp)
 		return sp;
 
@@ -1488,7 +1482,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 
 	iter->yielded = true;
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT, role);
 
 	if (shared)
 		read_lock(&kvm->mmu_lock);
@@ -1583,7 +1577,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 				continue;
 		}
 
-		tdp_mmu_init_child_sp(sp, &iter);
+		tdp_mmu_init_sp(sp, iter.sptep, iter.gfn);
 
 		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
 			goto retry;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 039/113] KVM: x86/mmu: Require TDP MMU for TDX
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (37 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 038/113] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-19 11:37   ` Huang, Kai
  2023-01-12 16:31 ` [PATCH v11 040/113] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role isaku.yamahata
                   ` (73 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Require the TDP MMU for guest TDs, the so called "shadow" MMU does not
support mapping guest private memory, i.e. does not support Secure-EPT.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index fdcff390ebc2..6c3ce4121a46 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -27,6 +27,13 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
 	if (kvm->arch.vm_type == KVM_X86_TDX_VM && !enable_mmio_caching)
 		return -EOPNOTSUPP;
 
+	/*
+	 * Because only the TDP MMU supports TDX, require the TDP MMU for guest
+	 * TDs.
+	 */
+	if (kvm->arch.vm_type == KVM_X86_TDX_VM && !tdp_enabled)
+		return -EOPNOTSUPP;
+
 	if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
 		return 0;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 040/113] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (38 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 039/113] KVM: x86/mmu: Require TDP MMU for TDX isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 041/113] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page isaku.yamahata
                   ` (72 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because TDX support introduces private mapping, add a new member in union
kvm_mmu_page_role with access functions to check the member.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 27 +++++++++++++++++++++++++++
 arch/x86/kvm/mmu/mmu_internal.h |  5 +++++
 arch/x86/kvm/mmu/spte.h         |  6 ++++++
 3 files changed, 38 insertions(+)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f5b51bdef0c6..1bcd118eef31 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -342,7 +342,12 @@ union kvm_mmu_page_role {
 		unsigned ad_disabled:1;
 		unsigned guest_mode:1;
 		unsigned passthrough:1;
+#ifdef CONFIG_KVM_MMU_PRIVATE
+		unsigned is_private:1;
+		unsigned :4;
+#else
 		unsigned :5;
+#endif
 
 		/*
 		 * This is left at the top of the word so that
@@ -354,6 +359,28 @@ union kvm_mmu_page_role {
 	};
 };
 
+#ifdef CONFIG_KVM_MMU_PRIVATE
+static inline bool kvm_mmu_page_role_is_private(union kvm_mmu_page_role role)
+{
+	return !!role.is_private;
+}
+
+static inline void kvm_mmu_page_role_set_private(union kvm_mmu_page_role *role)
+{
+	role->is_private = 1;
+}
+#else
+static inline bool kvm_mmu_page_role_is_private(union kvm_mmu_page_role role)
+{
+	return false;
+}
+
+static inline void kvm_mmu_page_role_set_private(union kvm_mmu_page_role *role)
+{
+	WARN_ON_ONCE(1);
+}
+#endif
+
 /*
  * kvm_mmu_extended_role complements kvm_mmu_page_role, tracking properties
  * relevant to the current MMU configuration.   When loading CR0, CR4, or EFER,
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 6767bc9b7c5c..a20b54060bc8 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -143,6 +143,11 @@ static inline int kvm_mmu_page_as_id(struct kvm_mmu_page *sp)
 	return kvm_mmu_role_as_id(sp->role);
 }
 
+static inline bool is_private_sp(const struct kvm_mmu_page *sp)
+{
+	return kvm_mmu_page_role_is_private(sp->role);
+}
+
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 256395eb593f..7046671b08cb 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -251,6 +251,12 @@ static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
 	return to_shadow_page(__pa(sptep));
 }
 
+static inline bool is_private_sptep(u64 *sptep)
+{
+	WARN_ON_ONCE(!sptep);
+	return is_private_sp(sptep_to_sp(sptep));
+}
+
 static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
 {
 	return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 041/113] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (39 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 040/113] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 042/113] KVM: Add flags to struct kvm_gfn_range isaku.yamahata
                   ` (71 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

For private GPA, CPU refers a private page table whose contents are
encrypted.  The dedicated APIs to operate on it (e.g. updating/reading its
PTE entry) are used and their cost is expensive.

When KVM resolves KVM page fault, it walks the page tables.  To reuse the
existing KVM MMU code and mitigate the heavy cost to directly walk private
page table, allocate one more page to copy the dummy page table for KVM MMU
code to directly walk.  Resolve KVM page fault with the existing code, and
do additional operations necessary for the private page table.  To
distinguish such cases, the existing KVM page table is called a shared page
table (i.e. not associated with private page table), and the page table
with private page table is called a private page table.  The relationship
is depicted below.

Add a private pointer to struct kvm_mmu_page for private page table and
add helper functions to allocate/initialize/free a private page table
page.

              KVM page fault                     |
                     |                           |
                     V                           |
        -------------+----------                 |
        |                      |                 |
        V                      V                 |
     shared GPA           private GPA            |
        |                      |                 |
        V                      V                 |
    shared PT root      dummy PT root            |    private PT root
        |                      |                 |           |
        V                      V                 |           V
     shared PT            dummy PT ----propagate---->   private PT
        |                      |                 |           |
        |                      \-----------------+------\    |
        |                                        |      |    |
        V                                        |      V    V
  shared guest page                              |    private guest page
                                                 |
                           non-encrypted memory  |    encrypted memory
                                                 |
PT: page table
- Shared PT is visible to KVM and it is used by CPU.
- Private PT is used by CPU but it is invisible to KVM.
- Dummy PT is visible to KVM but not used by CPU.  It is used to
  propagate PT change to the actual private PT which is used by CPU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  5 ++
 arch/x86/kvm/mmu/mmu.c          |  8 ++++
 arch/x86/kvm/mmu/mmu_internal.h | 83 +++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.c      |  1 +
 4 files changed, 93 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 1bcd118eef31..d1cc1e95108e 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -814,6 +814,11 @@ struct kvm_vcpu_arch {
 	struct kvm_mmu_memory_cache mmu_shadow_page_cache;
 	struct kvm_mmu_memory_cache mmu_shadowed_info_cache;
 	struct kvm_mmu_memory_cache mmu_page_header_cache;
+	/*
+	 * This cache is to allocate private page table. E.g.  Secure-EPT used
+	 * by the TDX module.
+	 */
+	struct kvm_mmu_memory_cache mmu_private_spt_cache;
 
 	/*
 	 * QEMU userspace and the guest each have their own FPU state.
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 801ef6c41847..c943ea003b72 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -655,6 +655,13 @@ static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
 	struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
 	int start, end, i, r;
 
+	if (kvm_gfn_shared_mask(vcpu->kvm)) {
+		r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_private_spt_cache,
+					       PT64_ROOT_MAX_LEVEL);
+		if (r)
+			return r;
+	}
+
 	start = kvm_mmu_memory_cache_nr_free_objects(mc);
 	r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
 
@@ -704,6 +711,7 @@ static void mmu_free_memory_caches(struct kvm_vcpu *vcpu)
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_pte_list_desc_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadow_page_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_shadowed_info_cache);
+	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_private_spt_cache);
 	kvm_mmu_free_memory_cache(&vcpu->arch.mmu_page_header_cache);
 }
 
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index a20b54060bc8..6743c5868ff2 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -93,7 +93,23 @@ struct kvm_mmu_page {
 		int root_count;
 		refcount_t tdp_mmu_root_count;
 	};
-	unsigned int unsync_children;
+	union {
+		struct {
+			unsigned int unsync_children;
+			/*
+			 * Number of writes since the last time traversal
+			 * visited this page.
+			 */
+			atomic_t write_flooding_count;
+		};
+#ifdef CONFIG_KVM_MMU_PRIVATE
+		/*
+		 * Associated private shadow page table, e.g. Secure-EPT page
+		 * passed to the TDX module.
+		 */
+		void *private_spt;
+#endif
+	};
 	union {
 		struct kvm_rmap_head parent_ptes; /* rmap pointers to parent sptes */
 		tdp_ptep_t ptep;
@@ -122,9 +138,6 @@ struct kvm_mmu_page {
 	int clear_spte_count;
 #endif
 
-	/* Number of writes since the last time traversal visited this page.  */
-	atomic_t write_flooding_count;
-
 #ifdef CONFIG_X86_64
 	/* Used for freeing the page asynchronously if it is a TDP MMU page. */
 	struct rcu_head rcu_head;
@@ -148,6 +161,68 @@ static inline bool is_private_sp(const struct kvm_mmu_page *sp)
 	return kvm_mmu_page_role_is_private(sp->role);
 }
 
+#ifdef CONFIG_KVM_MMU_PRIVATE
+static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
+{
+	return sp->private_spt;
+}
+
+static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt)
+{
+	sp->private_spt = private_spt;
+}
+
+static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+	bool is_root = vcpu->arch.root_mmu.root_role.level == sp->role.level;
+
+	KVM_BUG_ON(!kvm_mmu_page_role_is_private(sp->role), vcpu->kvm);
+	if (is_root)
+		/*
+		 * Because TDX module assigns root Secure-EPT page and set it to
+		 * Secure-EPTP when TD vcpu is created, secure page table for
+		 * root isn't needed.
+		 */
+		sp->private_spt = NULL;
+	else {
+		/*
+		 * Because the TDX module doesn't trust VMM and initializes
+		 * the pages itself, KVM doesn't initialize them.  Allocate
+		 * pages with garbage and give them to the TDX module.
+		 */
+		sp->private_spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_private_spt_cache);
+		/*
+		 * Because mmu_private_spt_cache is topped up before staring kvm
+		 * page fault resolving, the allocation above shouldn't fail.
+		 */
+		WARN_ON_ONCE(!sp->private_spt);
+	}
+}
+
+static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
+{
+	if (sp->private_spt)
+		free_page((unsigned long)sp->private_spt);
+}
+#else
+static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
+{
+	return NULL;
+}
+
+static inline void kvm_mmu_init_private_spt(struct kvm_mmu_page *sp, void *private_spt)
+{
+}
+
+static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
+{
+}
+
+static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
+{
+}
+#endif
+
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
 {
 	/*
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 6c3ce4121a46..55ebb7f2b275 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -82,6 +82,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm)
 
 static void tdp_mmu_free_sp(struct kvm_mmu_page *sp)
 {
+	kvm_mmu_free_private_spt(sp);
 	free_page((unsigned long)sp->spt);
 	kmem_cache_free(mmu_page_header_cache, sp);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 042/113] KVM: Add flags to struct kvm_gfn_range
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (40 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 041/113] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 043/113] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases isaku.yamahata
                   ` (70 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

kvm_unmap_gfn_range() needs to know the reason of the callback for TDX.
mmu notifier, set memattr ioctl or restrictedmem notifier.  Based on the
reason, TDX changes the behavior.  For mmu notifier, it's the operation on
shared memory slot to zap shared PTE.  For set memattr, it's the operation
of private<->shared conversion, zap the original PTE.  For restrictedmem,
it's punching a hole of the range, zap the corresponding PTE.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 include/linux/kvm_host.h | 9 ++++++++-
 virt/kvm/kvm_main.c      | 5 ++++-
 2 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index cd1f3634dd6a..0c3b9cf0a731 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -256,12 +256,19 @@ bool kvm_setup_async_pf(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu);
 #endif
 
+#define KVM_GFN_RANGE_FLAGS_RESTRICTED_MEM	BIT(0)
+#define KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR	BIT(1)
+
 struct kvm_gfn_range {
 	struct kvm_memory_slot *slot;
 	gfn_t start;
 	gfn_t end;
-	pte_t pte;
+	union {
+		pte_t pte;
+		u64 attrs;
+	};
 	bool may_block;
+	unsigned int flags;
 };
 bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 6c61b71b56d2..aef8802b188e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -669,6 +669,7 @@ static __always_inline int __kvm_handle_hva_range(struct kvm *kvm,
 			gfn_range.start = hva_to_gfn_memslot(hva_start, slot);
 			gfn_range.end = hva_to_gfn_memslot(hva_end + PAGE_SIZE - 1, slot);
 			gfn_range.slot = slot;
+			gfn_range.flags = 0;
 
 			if (!locked) {
 				locked = true;
@@ -971,6 +972,7 @@ static void kvm_restrictedmem_invalidate_begin(struct restrictedmem_notifier *no
 	gfn_range.slot = slot;
 	gfn_range.pte = __pte(0);
 	gfn_range.may_block = true;
+	gfn_range.flags = KVM_GFN_RANGE_FLAGS_RESTRICTED_MEM;
 
 	idx = srcu_read_lock(&kvm->srcu);
 	KVM_MMU_LOCK(kvm);
@@ -2511,8 +2513,9 @@ static void kvm_unmap_mem_range(struct kvm *kvm, gfn_t start, gfn_t end,
 	int i;
 	int r = 0;
 
-	gfn_range.pte = __pte(0);
+	gfn_range.attrs = attrs;
 	gfn_range.may_block = true;
+	gfn_range.flags = KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR;
 
 	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
 		slots = __kvm_memslots(kvm, i);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 043/113] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (41 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 042/113] KVM: Add flags to struct kvm_gfn_range isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 044/113] KVM: x86/tdp_mmu: Make handle_changed_spte() return value isaku.yamahata
                   ` (69 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX supports only write-back(WB) memory type for private memory
architecturally so that (virtualized) memory type change doesn't make sense
for private memory.  Also currently, page migration isn't supported for TDX
yet. (TDX architecturally supports page migration. it's KVM and kernel
implementation issue.)

Regarding memory type change (mtrr virtualization and lapic page mapping
change), pages are zapped by kvm_zap_gfn_range().  On the next KVM page
fault, the SPTE entry with a new memory type for the page is populated.
Regarding page migration, pages are zapped by the mmu notifier. On the next
KVM page fault, the new migrated page is populated.  Don't zap private
pages on unmapping for those two cases.

When deleting/moving a KVM memory slot, zap private pages. Typically
tearing down VM.  Don't invalidate private page tables. i.e. zap only leaf
SPTEs for KVM mmu that has a shared bit mask. The existing
kvm_tdp_mmu_invalidate_all_roots() depends on role.invalid with read-lock
of mmu_lock so that other vcpu can operate on KVM mmu concurrently.  It
marks the root page table invalid and zaps SPTEs of the root page
tables. The TDX module doesn't allow to unlink a protected root page table
from the hardware and then allocate a new one for it. i.e. replacing a
protected root page table.  Instead, zap only leaf SPTEs for KVM mmu with a
shared bit mask set.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c     | 81 ++++++++++++++++++++++++++++++++++++--
 arch/x86/kvm/mmu/tdp_mmu.c | 24 ++++++++---
 arch/x86/kvm/mmu/tdp_mmu.h |  5 ++-
 3 files changed, 99 insertions(+), 11 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index c943ea003b72..5cb34a65c114 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -1604,8 +1604,28 @@ bool kvm_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 	if (kvm_memslots_have_rmaps(kvm))
 		flush = kvm_handle_gfn_range(kvm, range, kvm_zap_rmap);
 
-	if (is_tdp_mmu_enabled(kvm))
-		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush);
+	if (is_tdp_mmu_enabled(kvm)) {
+		bool zap_private;
+
+		if (range->flags & KVM_GFN_RANGE_FLAGS_RESTRICTED_MEM) {
+			WARN_ON_ONCE(!kvm_slot_can_be_private(range->slot));
+			/*
+			 * For private slot, the callback is triggered
+			 * via PUNCH_HOLE (fallocate(PUNCH_HOLE) or truncate).
+			 * private-shared conversion is done by
+			 * KVM_SET_MEMORY_ATTRIBUTES.
+			 */
+			zap_private = true;
+		} else if (range->flags & KVM_GFN_RANGE_FLAGS_SET_MEM_ATTR)
+			zap_private = !(range->attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE);
+		else
+			/*
+			 * For now private pages are pinned during VM's life
+			 * time.
+			 */
+			zap_private = false;
+		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, range, flush, zap_private);
+	}
 
 	return flush;
 }
@@ -6115,11 +6135,54 @@ static bool kvm_has_zapped_obsolete_pages(struct kvm *kvm)
 	return unlikely(!list_empty_careful(&kvm->arch.zapped_obsolete_pages));
 }
 
+static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
+{
+	bool flush = false;
+
+	write_lock(&kvm->mmu_lock);
+
+	/*
+	 * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't required, worst
+	 * case scenario we'll have unused shadow pages lying around until they
+	 * are recycled due to age or when the VM is destroyed.
+	 */
+	if (is_tdp_mmu_enabled(kvm)) {
+		struct kvm_gfn_range range = {
+		      .slot = slot,
+		      .start = slot->base_gfn,
+		      .end = slot->base_gfn + slot->npages,
+		      .may_block = true,
+		};
+
+		/*
+		 * this handles both private gfn and shared gfn.
+		 * All private page should be zapped on memslot deletion.
+		 */
+		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush, true);
+	} else {
+		/* TDX supports only TDP-MMU case. */
+		WARN_ON_ONCE(1);
+		flush = true;
+	}
+	if (flush)
+		kvm_flush_remote_tlbs(kvm);
+
+	write_unlock(&kvm->mmu_lock);
+}
+
 static void kvm_mmu_invalidate_zap_pages_in_memslot(struct kvm *kvm,
 			struct kvm_memory_slot *slot,
 			struct kvm_page_track_notifier_node *node)
 {
-	kvm_mmu_zap_all_fast(kvm);
+	if (kvm_gfn_shared_mask(kvm))
+		/*
+		 * Secure-EPT requires to release PTs from the leaf.  The
+		 * optimization to zap root PT first with child PT doesn't
+		 * work.
+		 */
+		kvm_mmu_zap_memslot(kvm, slot);
+	else
+		kvm_mmu_zap_all_fast(kvm);
 }
 
 int kvm_mmu_init_vm(struct kvm *kvm)
@@ -6224,8 +6287,18 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t gfn_start, gfn_t gfn_end)
 
 	if (is_tdp_mmu_enabled(kvm)) {
 		for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
+			/*
+			 * zap_private = true. Zap both private/shared pages.
+			 *
+			 * kvm_zap_gfn_range() is used when MTRR or PAT memory
+			 * type was changed.  Later on the next kvm page fault,
+			 * populate it with updated spte entry.
+			 * Because only WB is supported for private pages, don't
+			 * care of private pages.
+			 */
 			flush = kvm_tdp_mmu_zap_leafs(kvm, i, gfn_start,
-						      gfn_end, true, flush);
+						      gfn_end, true, flush,
+						      false);
 	}
 
 	if (flush)
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 55ebb7f2b275..7ab498b80214 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -966,7 +966,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
  * operation can cause a soft lockup.
  */
 static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
-			      gfn_t start, gfn_t end, bool can_yield, bool flush)
+			      gfn_t start, gfn_t end, bool can_yield, bool flush,
+			      bool zap_private)
 {
 	struct tdp_iter iter;
 
@@ -974,6 +975,10 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
+	WARN_ON_ONCE(zap_private && !is_private_sp(root));
+	if (!zap_private && is_private_sp(root))
+		return false;
+
 	rcu_read_lock();
 
 	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
@@ -1006,12 +1011,13 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
  * more SPTEs were zapped since the MMU lock was last acquired.
  */
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
-			   bool can_yield, bool flush)
+			   bool can_yield, bool flush, bool zap_private)
 {
 	struct kvm_mmu_page *root;
 
 	for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
-		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
+		flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush,
+					  zap_private && is_private_sp(root));
 
 	return flush;
 }
@@ -1071,6 +1077,12 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
 	list_for_each_entry(root, &kvm->arch.tdp_mmu_roots, link) {
+		/*
+		 * Skip private root since private page table
+		 * is only torn down when VM is destroyed.
+		 */
+		if (is_private_sp(root))
+			continue;
 		if (!root->role.invalid &&
 		    !WARN_ON_ONCE(!kvm_tdp_mmu_get_root(root))) {
 			root->role.invalid = true;
@@ -1255,11 +1267,13 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return ret;
 }
 
+/* Used by mmu notifier via kvm_unmap_gfn_range() */
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
-				 bool flush)
+				 bool flush, bool zap_private)
 {
 	return kvm_tdp_mmu_zap_leafs(kvm, range->slot->as_id, range->start,
-				     range->end, range->may_block, flush);
+				     range->end, range->may_block, flush,
+				     zap_private);
 }
 
 typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index d3714200b932..e37881d922ba 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -18,7 +18,8 @@ void kvm_tdp_mmu_put_root(struct kvm *kvm, struct kvm_mmu_page *root,
 			  bool shared);
 
 bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start,
-				 gfn_t end, bool can_yield, bool flush);
+			   gfn_t end, bool can_yield, bool flush,
+			   bool zap_private);
 bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp);
 void kvm_tdp_mmu_zap_all(struct kvm *kvm);
 void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm);
@@ -27,7 +28,7 @@ void kvm_tdp_mmu_zap_invalidated_roots(struct kvm *kvm);
 int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
 
 bool kvm_tdp_mmu_unmap_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range,
-				 bool flush);
+				 bool flush, bool zap_private);
 bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
 bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 044/113] KVM: x86/tdp_mmu: Make handle_changed_spte() return value
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (42 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 043/113] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-02-16 16:39   ` Zhi Wang
  2023-01-12 16:31 ` [PATCH v11 045/113] KVM: x86/mmu: Make make_spte() aware of shared GPA for MTRR isaku.yamahata
                   ` (68 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX operation can fail with TDX_OPERAND_BUSY when multiple vcpu try to
operation on same TDX resource like Secure EPT.  It doesn't spin and returns
busy error to VMM so that VMM has to take action, e.g. retry or whatever.

Because TDP MMU uses read spin lock for scalability, spinlock around seam
call busts TDP MMU effort.  The other option is to let SEAMCALL fail and
page fault handler should retry.  Make handle_changed_spte() and its caller
return values so that kvm page fault handler can return on such cases. This
patch makes it return only zero.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/tdp_mmu.c | 72 +++++++++++++++++++++++++-------------
 1 file changed, 47 insertions(+), 25 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 7ab498b80214..4fb07f91e5d6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -349,9 +349,9 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	return __pa(root->spt);
 }
 
-static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared);
+static int __must_check handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+					    u64 old_spte, u64 new_spte, int level,
+					    bool shared);
 
 static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
 {
@@ -445,6 +445,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
 	int level = sp->role.level;
 	gfn_t base_gfn = sp->gfn;
+	int ret;
 	int i;
 
 	trace_kvm_mmu_prepare_zap_page(sp);
@@ -516,8 +517,14 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 			old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
 							  REMOVED_SPTE, level);
 		}
-		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-				    old_spte, REMOVED_SPTE, level, shared);
+		ret = handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
+					  old_spte, REMOVED_SPTE, level, shared);
+		/*
+		 * We are removing page tables.  Because in TDX case we don't
+		 * zap private page tables except tearing down VM.  It means
+		 * no race condition.
+		 */
+		WARN_ON_ONCE(ret);
 	}
 
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
@@ -538,9 +545,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
  * Handle bookkeeping that might result from the modification of a SPTE.
  * This function must be called for all TDP SPTE modifications.
  */
-static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				  u64 old_spte, u64 new_spte, int level,
-				  bool shared)
+static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+					      u64 old_spte, u64 new_spte, int level,
+					      bool shared)
 {
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
@@ -576,7 +583,7 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	}
 
 	if (old_spte == new_spte)
-		return;
+		return 0;
 
 	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
 
@@ -605,7 +612,7 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 			       "a temporary removed SPTE.\n"
 			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
 			       as_id, gfn, old_spte, new_spte, level);
-		return;
+		return 0;
 	}
 
 	if (is_leaf != was_leaf)
@@ -624,17 +631,25 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
 	if (was_present && !was_leaf &&
 	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
+
+	return 0;
 }
 
-static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-				u64 old_spte, u64 new_spte, int level,
-				bool shared)
+static int __must_check handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
+					    u64 old_spte, u64 new_spte, int level,
+					    bool shared)
 {
-	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
-			      shared);
+	int ret;
+
+	ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
+				    shared);
+	if (ret)
+		return ret;
+
 	handle_changed_spte_acc_track(old_spte, new_spte, level);
 	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
 				      new_spte, level);
+	return 0;
 }
 
 /*
@@ -653,12 +668,14 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
  * * -EBUSY - If the SPTE cannot be set. In this case this function will have
  *            no side-effects other than setting iter->old_spte to the last
  *            known value of the spte.
+ * * -EAGAIN - Same to -EBUSY. But the source is from callbacks for private spt
  */
-static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
-					  struct tdp_iter *iter,
-					  u64 new_spte)
+static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
+						       struct tdp_iter *iter,
+						       u64 new_spte)
 {
 	u64 *sptep = rcu_dereference(iter->sptep);
+	int ret;
 
 	/*
 	 * The caller is responsible for ensuring the old SPTE is not a REMOVED
@@ -677,15 +694,16 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
 		return -EBUSY;
 
-	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-			      new_spte, iter->level, true);
-	handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
+	ret = __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
+				    new_spte, iter->level, true);
+	if (!ret)
+		handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
 
-	return 0;
+	return ret;
 }
 
-static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
-					  struct tdp_iter *iter)
+static inline int __must_check tdp_mmu_zap_spte_atomic(struct kvm *kvm,
+						       struct tdp_iter *iter)
 {
 	int ret;
 
@@ -750,6 +768,8 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 			      u64 old_spte, u64 new_spte, gfn_t gfn, int level,
 			      bool record_acc_track, bool record_dirty_log)
 {
+	int ret;
+
 	lockdep_assert_held_write(&kvm->mmu_lock);
 
 	/*
@@ -763,7 +783,9 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
-	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+	ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+	/* Because write spin lock is held, no race.  It should success. */
+	WARN_ON_ONCE(ret);
 
 	if (record_acc_track)
 		handle_changed_spte_acc_track(old_spte, new_spte, level);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 045/113] KVM: x86/mmu: Make make_spte() aware of shared GPA for MTRR
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (43 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 044/113] KVM: x86/tdp_mmu: Make handle_changed_spte() return value isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 046/113] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU isaku.yamahata
                   ` (67 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

For TDX MTRR support of shared GPA, get_mt_mask() method needs to know if
the given gfn is shared or private.  Make make_spte() aware of shared GPA
and rename gfn of make_spte() to gfn_including_shared to make it explicit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/spte.c | 5 +++--
 arch/x86/kvm/mmu/spte.h | 2 +-
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index a23e9205fc42..7171df3e262a 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -136,13 +136,14 @@ bool spte_has_volatile_bits(u64 spte)
 
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
-	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
+	       unsigned int pte_access, gfn_t gfn_including_shared, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte)
 {
 	int level = sp->role.level;
 	u64 spte = SPTE_MMU_PRESENT_MASK;
 	bool wrprot = false;
+	gfn_t gfn = gfn_including_shared & ~kvm_gfn_shared_mask(vcpu->kvm);
 
 	WARN_ON_ONCE(!pte_access && !shadow_present_mask);
 
@@ -190,7 +191,7 @@ bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 		spte |= PT_PAGE_SIZE_MASK;
 
 	if (shadow_memtype_mask)
-		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn,
+		spte |= static_call(kvm_x86_get_mt_mask)(vcpu, gfn_including_shared,
 							 kvm_is_mmio_pfn(pfn));
 	if (host_writable)
 		spte |= shadow_host_writable_mask;
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 7046671b08cb..067ea1ae3a13 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -481,7 +481,7 @@ bool spte_has_volatile_bits(u64 spte);
 
 bool make_spte(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp,
 	       const struct kvm_memory_slot *slot,
-	       unsigned int pte_access, gfn_t gfn, kvm_pfn_t pfn,
+	       unsigned int pte_access, gfn_t gfn_including_shared, kvm_pfn_t pfn,
 	       u64 old_spte, bool prefetch, bool can_unsync,
 	       bool host_writable, u64 *new_spte);
 u64 make_huge_page_split_spte(struct kvm *kvm, u64 huge_spte,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 046/113] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (44 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 045/113] KVM: x86/mmu: Make make_spte() aware of shared GPA for MTRR isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 047/113] [MARKER] The start of TDX KVM patch series: TDX EPT violation isaku.yamahata
                   ` (66 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Kai Huang

From: Isaku Yamahata <isaku.yamahata@intel.com>

Allocate protected page table for private page table, and add hooks to
operate on protected page table.  This patch adds allocation/free of
protected page tables and hooks.  When calling hooks to update SPTE entry,
freeze the entry, call hooks and unfreeze the entry to allow concurrent
updates on page tables.  Which is the advantage of TDP MMU.  As
kvm_gfn_shared_mask() returns false always, those hooks aren't called yet
with this patch.

When the faulting GPA is private, the KVM fault is called private.  When
resolving private KVM fault, allocate protected page table and call hooks
to operate on protected page table. On the change of the private PTE entry,
invoke kvm_x86_ops hook in __handle_changed_spte() to propagate the change
to protected page table. The following depicts the relationship.

  private KVM page fault   |
      |                    |
      V                    |
 private GPA               |     CPU protected EPTP
      |                    |           |
      V                    |           V
 private PT root           |     protected PT root
      |                    |           |
      V                    |           V
   private PT --hook to propagate-->protected PT
      |                    |           |
      \--------------------+------\    |
                           |      |    |
                           |      V    V
                           |    private guest page
                           |
                           |
     non-encrypted memory  |    encrypted memory
                           |
PT: page table

The existing KVM TDP MMU code uses atomic update of SPTE.  On populating
the EPT entry, atomically set the entry.  However, it requires TLB
shootdown to zap SPTE.  To address it, the entry is frozen with the special
SPTE value that clears the present bit. After the TLB shootdown, the entry
is set to the eventual value (unfreeze).

For protected page table, hooks are called to update protected page table
in addition to direct access to the private SPTE. For the zapping case, it
works to freeze the SPTE. It can call hooks in addition to TLB shootdown.
For populating the private SPTE entry, there can be a race condition
without further protection

  vcpu 1: populating 2M private SPTE
  vcpu 2: populating 4K private SPTE
  vcpu 2: TDX SEAMCALL to update 4K protected SPTE => error
  vcpu 1: TDX SEAMCALL to update 2M protected SPTE

To avoid the race, the frozen SPTE is utilized.  Instead of atomic update
of the private entry, freeze the entry, call the hook that update protected
SPTE, set the entry to the final value.

Support 4K page only at this stage.  2M page support can be done in future
patches.

Co-developed-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Kai Huang <kai.huang@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |   5 +
 arch/x86/include/asm/kvm_host.h    |  11 ++
 arch/x86/kvm/mmu/mmu.c             |  15 +-
 arch/x86/kvm/mmu/mmu_internal.h    |  18 +++
 arch/x86/kvm/mmu/tdp_iter.h        |   2 +-
 arch/x86/kvm/mmu/tdp_mmu.c         | 243 +++++++++++++++++++++++++----
 arch/x86/kvm/mmu/tdp_mmu.h         |   2 +-
 virt/kvm/kvm_main.c                |   1 +
 8 files changed, 263 insertions(+), 34 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index e3e9b1c2599b..99ac85c3c8aa 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -95,6 +95,11 @@ KVM_X86_OP_OPTIONAL_RET0(set_tss_addr)
 KVM_X86_OP_OPTIONAL_RET0(set_identity_map_addr)
 KVM_X86_OP_OPTIONAL_RET0(get_mt_mask)
 KVM_X86_OP(load_mmu_pgd)
+KVM_X86_OP_OPTIONAL(link_private_spt)
+KVM_X86_OP_OPTIONAL(free_private_spt)
+KVM_X86_OP_OPTIONAL(set_private_spte)
+KVM_X86_OP_OPTIONAL(remove_private_spte)
+KVM_X86_OP_OPTIONAL(zap_private_spte)
 KVM_X86_OP(has_wbinvd_exit)
 KVM_X86_OP(get_l2_tsc_offset)
 KVM_X86_OP(get_l2_tsc_multiplier)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index d1cc1e95108e..487ff9f4fe1a 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -474,6 +474,7 @@ struct kvm_mmu {
 			 struct kvm_mmu_page *sp);
 	void (*invlpg)(struct kvm_vcpu *vcpu, gva_t gva, hpa_t root_hpa);
 	struct kvm_mmu_root_info root;
+	hpa_t private_root_hpa;
 	union kvm_cpu_role cpu_role;
 	union kvm_mmu_page_role root_role;
 
@@ -1679,6 +1680,16 @@ struct kvm_x86_ops {
 	void (*load_mmu_pgd)(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			     int root_level);
 
+	int (*link_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				void *private_spt);
+	int (*free_private_spt)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				void *private_spt);
+	int (*set_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				 kvm_pfn_t pfn);
+	int (*remove_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level,
+				    kvm_pfn_t pfn);
+	int (*zap_private_spte)(struct kvm *kvm, gfn_t gfn, enum pg_level level);
+
 	bool (*has_wbinvd_exit)(void);
 
 	u64 (*get_l2_tsc_offset)(struct kvm_vcpu *vcpu);
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 5cb34a65c114..484e615196aa 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3674,7 +3674,12 @@ static int mmu_alloc_direct_roots(struct kvm_vcpu *vcpu)
 		goto out_unlock;
 
 	if (is_tdp_mmu_enabled(vcpu->kvm)) {
-		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu);
+		if (kvm_gfn_shared_mask(vcpu->kvm) &&
+		    !VALID_PAGE(mmu->private_root_hpa)) {
+			root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, true);
+			mmu->private_root_hpa = root;
+		}
+		root = kvm_tdp_mmu_get_vcpu_root_hpa(vcpu, false);
 		mmu->root.hpa = root;
 	} else if (shadow_root_level >= PT64_ROOT_4LEVEL) {
 		root = mmu_alloc_root(vcpu, 0, 0, shadow_root_level);
@@ -4396,7 +4401,7 @@ static int direct_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	unsigned long mmu_seq;
 	int r;
 
-	fault->gfn = fault->addr >> PAGE_SHIFT;
+	fault->gfn = gpa_to_gfn(fault->addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
 	fault->slot = kvm_vcpu_gfn_to_memslot(vcpu, fault->gfn);
 
 	if (page_fault_handle_page_track(vcpu, fault))
@@ -5932,6 +5937,7 @@ static int __kvm_mmu_create(struct kvm_vcpu *vcpu, struct kvm_mmu *mmu)
 
 	mmu->root.hpa = INVALID_PAGE;
 	mmu->root.pgd = 0;
+	mmu->private_root_hpa = INVALID_PAGE;
 	for (i = 0; i < KVM_MMU_NUM_PREV_ROOTS; i++)
 		mmu->prev_roots[i] = KVM_MMU_ROOT_INFO_INVALID;
 
@@ -6155,7 +6161,7 @@ static void kvm_mmu_zap_memslot(struct kvm *kvm, struct kvm_memory_slot *slot)
 		};
 
 		/*
-		 * this handles both private gfn and shared gfn.
+		 * This handles both private gfn and shared gfn.
 		 * All private page should be zapped on memslot deletion.
 		 */
 		flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, flush, true);
@@ -6966,6 +6972,9 @@ int kvm_mmu_vendor_module_init(void)
 void kvm_mmu_destroy(struct kvm_vcpu *vcpu)
 {
 	kvm_mmu_unload(vcpu);
+	if (is_tdp_mmu_enabled(vcpu->kvm))
+		mmu_free_root_page(vcpu->kvm, &vcpu->arch.mmu->private_root_hpa,
+				NULL);
 	free_mmu_pages(&vcpu->arch.root_mmu);
 	free_mmu_pages(&vcpu->arch.guest_mmu);
 	mmu_free_memory_caches(vcpu);
diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
index 6743c5868ff2..0ed802dc8627 100644
--- a/arch/x86/kvm/mmu/mmu_internal.h
+++ b/arch/x86/kvm/mmu/mmu_internal.h
@@ -6,6 +6,8 @@
 #include <linux/kvm_host.h>
 #include <asm/kvm_host.h>
 
+#include "mmu.h"
+
 #undef MMU_DEBUG
 
 #ifdef MMU_DEBUG
@@ -204,6 +206,15 @@ static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
 	if (sp->private_spt)
 		free_page((unsigned long)sp->private_spt);
 }
+
+static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
+				     gfn_t gfn)
+{
+	if (is_private_sp(root))
+		return kvm_gfn_private(kvm, gfn);
+	else
+		return kvm_gfn_shared(kvm, gfn);
+}
 #else
 static inline void *kvm_mmu_private_spt(struct kvm_mmu_page *sp)
 {
@@ -221,6 +232,12 @@ static inline void kvm_mmu_alloc_private_spt(struct kvm_vcpu *vcpu, struct kvm_m
 static inline void kvm_mmu_free_private_spt(struct kvm_mmu_page *sp)
 {
 }
+
+static inline gfn_t kvm_gfn_for_root(struct kvm *kvm, struct kvm_mmu_page *root,
+				     gfn_t gfn)
+{
+	return gfn;
+}
 #endif
 
 static inline bool kvm_mmu_page_ad_need_write_protect(struct kvm_mmu_page *sp)
@@ -355,6 +372,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
 		.is_tdp = likely(vcpu->arch.mmu->page_fault == kvm_tdp_page_fault),
 		.nx_huge_page_workaround_enabled =
 			is_nx_huge_page_enabled(vcpu->kvm),
+		.is_private = kvm_is_private_gpa(vcpu->kvm, cr2_or_gpa),
 
 		.max_level = vcpu->kvm->arch.tdp_max_page_level,
 		.req_level = PG_LEVEL_4K,
diff --git a/arch/x86/kvm/mmu/tdp_iter.h b/arch/x86/kvm/mmu/tdp_iter.h
index 9e56a5b1024c..eab62baf8549 100644
--- a/arch/x86/kvm/mmu/tdp_iter.h
+++ b/arch/x86/kvm/mmu/tdp_iter.h
@@ -71,7 +71,7 @@ struct tdp_iter {
 	tdp_ptep_t pt_path[PT64_ROOT_MAX_LEVEL];
 	/* A pointer to the current SPTE */
 	tdp_ptep_t sptep;
-	/* The lowest GFN mapped by the current SPTE */
+	/* The lowest GFN (shared bits included) mapped by the current SPTE */
 	gfn_t gfn;
 	/* The level of the root page given to the iterator */
 	int root_level;
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4fb07f91e5d6..5ce0328c71df 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -296,6 +296,9 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp(struct kvm_vcpu *vcpu,
 	sp->spt = kvm_mmu_memory_cache_alloc(&vcpu->arch.mmu_shadow_page_cache);
 	sp->role = role;
 
+	if (kvm_mmu_page_role_is_private(role))
+		kvm_mmu_alloc_private_spt(vcpu, sp);
+
 	return sp;
 }
 
@@ -318,7 +321,8 @@ static void tdp_mmu_init_sp(struct kvm_mmu_page *sp, tdp_ptep_t sptep,
 	trace_kvm_mmu_get_page(sp, true);
 }
 
-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
+static struct kvm_mmu_page *kvm_tdp_mmu_get_vcpu_root(struct kvm_vcpu *vcpu,
+						      bool private)
 {
 	union kvm_mmu_page_role role = vcpu->arch.mmu->root_role;
 	struct kvm *kvm = vcpu->kvm;
@@ -330,6 +334,8 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	 * Check for an existing root before allocating a new one.  Note, the
 	 * role check prevents consuming an invalid root.
 	 */
+	if (private)
+		kvm_mmu_page_role_set_private(&role);
 	for_each_tdp_mmu_root(kvm, root, kvm_mmu_role_as_id(role)) {
 		if (root->role.word == role.word &&
 		    kvm_tdp_mmu_get_root(root))
@@ -346,11 +352,17 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
 	spin_unlock(&kvm->arch.tdp_mmu_pages_lock);
 
 out:
-	return __pa(root->spt);
+	return root;
+}
+
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private)
+{
+	return __pa(kvm_tdp_mmu_get_vcpu_root(vcpu, private)->spt);
 }
 
 static int __must_check handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-					    u64 old_spte, u64 new_spte, int level,
+					    u64 old_spte, u64 new_spte,
+					    union kvm_mmu_page_role role,
 					    bool shared);
 
 static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
@@ -377,6 +389,8 @@ static void handle_changed_spte_dirty_log(struct kvm *kvm, int as_id, gfn_t gfn,
 
 	if ((!is_writable_pte(old_spte) || pfn_changed) &&
 	    is_writable_pte(new_spte)) {
+		/* For memory slot operations, use GFN without aliasing */
+		gfn = gfn & ~kvm_gfn_shared_mask(kvm);
 		slot = __gfn_to_memslot(__kvm_memslots(kvm, as_id), gfn);
 		mark_page_dirty_in_slot(kvm, slot, gfn);
 	}
@@ -518,7 +532,8 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 							  REMOVED_SPTE, level);
 		}
 		ret = handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
-					  old_spte, REMOVED_SPTE, level, shared);
+					  old_spte, REMOVED_SPTE, sp->role,
+					  shared);
 		/*
 		 * We are removing page tables.  Because in TDX case we don't
 		 * zap private page tables except tearing down VM.  It means
@@ -527,9 +542,81 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
 		WARN_ON_ONCE(ret);
 	}
 
+	if (is_private_sp(sp) &&
+	    WARN_ON(static_call(kvm_x86_free_private_spt)(kvm, sp->gfn, sp->role.level,
+							  kvm_mmu_private_spt(sp)))) {
+		/*
+		 * Failed to unlink Secure EPT page and there is nothing to do
+		 * further.  Intentionally leak the page to prevent the kernel
+		 * from accessing the encrypted page.
+		 */
+		kvm_mmu_init_private_spt(sp, NULL);
+	}
+
 	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
 }
 
+static void *get_private_spt(gfn_t gfn, u64 new_spte, int level)
+{
+	if (is_shadow_present_pte(new_spte) && !is_last_spte(new_spte, level)) {
+		struct kvm_mmu_page *sp = to_shadow_page(pfn_to_hpa(spte_to_pfn(new_spte)));
+		void *private_spt = kvm_mmu_private_spt(sp);
+
+		WARN_ON_ONCE(!private_spt);
+		WARN_ON_ONCE(sp->role.level + 1 != level);
+		WARN_ON_ONCE(sp->gfn != gfn);
+		return private_spt;
+	}
+
+	return NULL;
+}
+
+static int __must_check handle_changed_private_spte(struct kvm *kvm, gfn_t gfn,
+						    u64 old_spte, u64 new_spte,
+						    int level)
+{
+	bool was_present = is_shadow_present_pte(old_spte);
+	bool is_present = is_shadow_present_pte(new_spte);
+	bool was_leaf = was_present && is_last_spte(old_spte, level);
+	bool is_leaf = is_present && is_last_spte(new_spte, level);
+	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+	int ret = 0;
+
+	lockdep_assert_held(&kvm->mmu_lock);
+	if (is_present) {
+		/* TDP MMU doesn't change present -> present */
+		KVM_BUG_ON(was_present, kvm);
+
+		/*
+		 * Use different call to either set up middle level
+		 * private page table, or leaf.
+		 */
+		if (is_leaf)
+			ret = static_call(kvm_x86_set_private_spte)(kvm, gfn, level, new_pfn);
+		else {
+			void *private_spt = get_private_spt(gfn, new_spte, level);
+
+			KVM_BUG_ON(!private_spt, kvm);
+			ret = static_call(kvm_x86_link_private_spt)(kvm, gfn, level, private_spt);
+		}
+	} else if (was_leaf) {
+		/* non-present -> non-present doesn't make sense. */
+		KVM_BUG_ON(!was_present, kvm);
+		/*
+		 * Zap private leaf SPTE.  Zapping private table is done
+		 * below in handle_removed_tdp_mmu_page().
+		 */
+		lockdep_assert_held_write(&kvm->mmu_lock);
+		ret = static_call(kvm_x86_zap_private_spte)(kvm, gfn, level);
+		if (!ret) {
+			ret = static_call(kvm_x86_remove_private_spte)(kvm, gfn, level, old_pfn);
+			WARN_ON_ONCE(ret);
+		}
+	}
+	return ret;
+}
+
 /**
  * __handle_changed_spte - handle bookkeeping associated with an SPTE change
  * @kvm: kvm instance
@@ -537,7 +624,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
  * @gfn: the base GFN that was mapped by the SPTE
  * @old_spte: The value of the SPTE before the change
  * @new_spte: The value of the SPTE after the change
- * @level: the level of the PT the SPTE is part of in the paging structure
+ * @role: the role of the PT the SPTE is part of in the paging structure
  * @shared: This operation may not be running under the exclusive use of
  *	    the MMU lock and the operation must synchronize with other
  *	    threads that might be modifying SPTEs.
@@ -546,14 +633,18 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
  * This function must be called for all TDP SPTE modifications.
  */
 static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-					      u64 old_spte, u64 new_spte, int level,
-					      bool shared)
+					      u64 old_spte, u64 new_spte,
+					      union kvm_mmu_page_role role, bool shared)
 {
+	bool is_private = kvm_mmu_page_role_is_private(role);
+	int level = role.level;
 	bool was_present = is_shadow_present_pte(old_spte);
 	bool is_present = is_shadow_present_pte(new_spte);
 	bool was_leaf = was_present && is_last_spte(old_spte, level);
 	bool is_leaf = is_present && is_last_spte(new_spte, level);
-	bool pfn_changed = spte_to_pfn(old_spte) != spte_to_pfn(new_spte);
+	kvm_pfn_t old_pfn = spte_to_pfn(old_spte);
+	kvm_pfn_t new_pfn = spte_to_pfn(new_spte);
+	bool pfn_changed = old_pfn != new_pfn;
 
 	WARN_ON(level > PT64_ROOT_MAX_LEVEL);
 	WARN_ON(level < PG_LEVEL_4K);
@@ -620,7 +711,7 @@ static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t
 
 	if (was_leaf && is_dirty_spte(old_spte) &&
 	    (!is_present || !is_dirty_spte(new_spte) || pfn_changed))
-		kvm_set_pfn_dirty(spte_to_pfn(old_spte));
+		kvm_set_pfn_dirty(old_pfn);
 
 	/*
 	 * Recursively handle child PTs if the change removed a subtree from
@@ -629,26 +720,42 @@ static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t
 	 * pages are kernel allocations and should never be migrated.
 	 */
 	if (was_present && !was_leaf &&
-	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
+	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed))) {
+		KVM_BUG_ON(is_private != is_private_sptep(spte_to_child_pt(old_spte, level)),
+			   kvm);
 		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
+	}
 
+	/*
+	 * Special handling for the private mapping.  We are either
+	 * setting up new mapping at middle level page table, or leaf,
+	 * or tearing down existing mapping.
+	 *
+	 * This is after handling lower page table by above
+	 * handle_remove_tdp_mmu_page().  Secure-EPT requires to remove
+	 * Secure-EPT tables after removing children.
+	 */
+	if (is_private &&
+	    /* Ignore change of software only bits. e.g. host_writable */
+	    (was_leaf != is_leaf || was_present != is_present || pfn_changed))
+		return handle_changed_private_spte(kvm, gfn, old_spte, new_spte, role.level);
 	return 0;
 }
 
 static int __must_check handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
-					    u64 old_spte, u64 new_spte, int level,
+					    u64 old_spte, u64 new_spte,
+					    union kvm_mmu_page_role role,
 					    bool shared)
 {
 	int ret;
 
-	ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
-				    shared);
+	ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, shared);
 	if (ret)
 		return ret;
 
-	handle_changed_spte_acc_track(old_spte, new_spte, level);
+	handle_changed_spte_acc_track(old_spte, new_spte, role.level);
 	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
-				      new_spte, level);
+				      new_spte, role.level);
 	return 0;
 }
 
@@ -674,6 +781,24 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
 						       struct tdp_iter *iter,
 						       u64 new_spte)
 {
+	/*
+	 * For conventional page table, the update flow is
+	 * - update STPE with atomic operation
+	 * - handle changed SPTE. __handle_changed_spte()
+	 * NOTE: __handle_changed_spte() (and functions) must be safe against
+	 * concurrent update.  It is an exception to zap SPTE.  See
+	 * tdp_mmu_zap_spte_atomic().
+	 *
+	 * For private page table, callbacks are needed to propagate SPTE
+	 * change into the protected page table.  In order to atomically update
+	 * both the SPTE and the protected page tables with callbacks, utilize
+	 * freezing SPTE.
+	 * - Freeze the SPTE. Set entry to REMOVED_SPTE.
+	 * - Trigger callbacks for protected page tables. __handle_changed_spte()
+	 * - Unfreeze the SPTE.  Set the entry to new_spte.
+	 */
+	bool freeze_spte = is_private_sptep(iter->sptep) && !is_removed_spte(new_spte);
+	u64 tmp_spte = freeze_spte ? REMOVED_SPTE : new_spte;
 	u64 *sptep = rcu_dereference(iter->sptep);
 	int ret;
 
@@ -691,14 +816,24 @@ static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
 	 * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
 	 * does not hold the mmu_lock.
 	 */
-	if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
+	if (!try_cmpxchg64(sptep, &iter->old_spte, tmp_spte))
 		return -EBUSY;
 
 	ret = __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
-				    new_spte, iter->level, true);
+				    new_spte, sptep_to_sp(sptep)->role, true);
 	if (!ret)
 		handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
 
+	if (ret) {
+		/*
+		 * !freeze_spte means this fault isn't private.  No call to
+		 * operation on Secure EPT.
+		 */
+		WARN_ON_ONCE(!freeze_spte);
+		__kvm_tdp_mmu_write_spte(sptep, iter->old_spte);
+	} else if (freeze_spte)
+		__kvm_tdp_mmu_write_spte(sptep, new_spte);
+
 	return ret;
 }
 
@@ -768,6 +903,7 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 			      u64 old_spte, u64 new_spte, gfn_t gfn, int level,
 			      bool record_acc_track, bool record_dirty_log)
 {
+	union kvm_mmu_page_role role;
 	int ret;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
@@ -783,7 +919,9 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
 
 	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
 
-	ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
+	role = sptep_to_sp(sptep)->role;
+	role.level = level;
+	ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, role, false);
 	/* Because write spin lock is held, no race.  It should success. */
 	WARN_ON_ONCE(ret);
 
@@ -837,8 +975,11 @@ static inline void tdp_mmu_set_spte_no_dirty_log(struct kvm *kvm,
 			continue;					\
 		else
 
-#define tdp_mmu_for_each_pte(_iter, _mmu, _start, _end)		\
-	for_each_tdp_pte(_iter, to_shadow_page(_mmu->root.hpa), _start, _end)
+#define tdp_mmu_for_each_pte(_iter, _mmu, _private, _start, _end)	\
+	for_each_tdp_pte(_iter,						\
+		 to_shadow_page((_private) ? _mmu->private_root_hpa :	\
+				_mmu->root.hpa),			\
+		_start, _end)
 
 /*
  * Yield if the MMU lock is contended or this thread needs to return control
@@ -1001,6 +1142,14 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
 	if (!zap_private && is_private_sp(root))
 		return false;
 
+	/*
+	 * start and end doesn't have GFN shared bit.  This function zaps
+	 * a region including alias.  Adjust shared bit of [start, end) if the
+	 * root is shared.
+	 */
+	start = kvm_gfn_for_root(kvm, root, start);
+	end = kvm_gfn_for_root(kvm, root, end);
+
 	rcu_read_lock();
 
 	for_each_tdp_pte_min_level(iter, root, PG_LEVEL_4K, start, end) {
@@ -1131,10 +1280,19 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 
 	if (unlikely(!fault->slot))
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
-	else
-		wrprot = make_spte(vcpu, sp, fault->slot, ACC_ALL, iter->gfn,
-					 fault->pfn, iter->old_spte, fault->prefetch, true,
-					 fault->map_writable, &new_spte);
+	else {
+		unsigned long pte_access = ACC_ALL;
+
+		/* TDX shared GPAs are no executable, enforce this for the SDV. */
+		if (kvm_gfn_shared_mask(vcpu->kvm) && !fault->is_private)
+			pte_access &= ~ACC_EXEC_MASK;
+
+		wrprot = make_spte(vcpu, sp, fault->slot, pte_access,
+				   gpa_to_gfn(fault->addr)/* include shared bit */,
+				   fault->pfn, iter->old_spte,
+				   fault->prefetch, true, fault->map_writable,
+				   &new_spte);
+	}
 
 	if (new_spte == iter->old_spte)
 		ret = RET_PF_SPURIOUS;
@@ -1213,6 +1371,8 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	struct kvm *kvm = vcpu->kvm;
 	struct tdp_iter iter;
 	struct kvm_mmu_page *sp;
+	gfn_t raw_gfn;
+	bool is_private = fault->is_private;
 	int ret = RET_PF_RETRY;
 
 	kvm_mmu_hugepage_adjust(vcpu, fault);
@@ -1221,7 +1381,17 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 	rcu_read_lock();
 
-	tdp_mmu_for_each_pte(iter, mmu, fault->gfn, fault->gfn + 1) {
+	raw_gfn = gpa_to_gfn(fault->addr);
+
+	if (is_error_noslot_pfn(fault->pfn) ||
+	    !kvm_pfn_to_refcounted_page(fault->pfn)) {
+		if (is_private) {
+			rcu_read_unlock();
+			return -EFAULT;
+		}
+	}
+
+	tdp_mmu_for_each_pte(iter, mmu, is_private, raw_gfn, raw_gfn + 1) {
 		int r;
 
 		if (fault->nx_huge_page_workaround_enabled)
@@ -1251,9 +1421,14 @@ int kvm_tdp_mmu_map(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 
 		sp->nx_huge_page_disallowed = fault->huge_page_disallowed;
 
-		if (is_shadow_present_pte(iter.old_spte))
+		if (is_shadow_present_pte(iter.old_spte)) {
+			/*
+			 * TODO: large page support.
+			 * Doesn't support large page for TDX now
+			 */
+			KVM_BUG_ON(is_private_sptep(iter.sptep), vcpu->kvm);
 			r = tdp_mmu_split_huge_page(kvm, &iter, sp, true);
-		else
+		} else
 			r = tdp_mmu_link_sp(kvm, &iter, sp, true);
 
 		/*
@@ -1490,6 +1665,8 @@ static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp, union kvm_mm
 
 	sp->role = role;
 	sp->spt = (void *)__get_free_page(gfp);
+	/* TODO: large page support for private GPA. */
+	WARN_ON_ONCE(kvm_mmu_page_role_is_private(role));
 	if (!sp->spt) {
 		kmem_cache_free(mmu_page_header_cache, sp);
 		return NULL;
@@ -1505,6 +1682,11 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 	union kvm_mmu_page_role role = tdp_iter_child_role(iter);
 	struct kvm_mmu_page *sp;
 
+	KVM_BUG_ON(kvm_mmu_page_role_is_private(role) !=
+		   is_private_sptep(iter->sptep), kvm);
+	/* TODO: Large page isn't supported for private SPTE yet. */
+	KVM_BUG_ON(kvm_mmu_page_role_is_private(role), kvm);
+
 	/*
 	 * Since we are allocating while under the MMU lock we have to be
 	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
@@ -1933,7 +2115,7 @@ int kvm_tdp_mmu_get_walk(struct kvm_vcpu *vcpu, u64 addr, u64 *sptes,
 
 	*root_level = vcpu->arch.mmu->root_role.level;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
 		leaf = iter.level;
 		sptes[leaf] = iter.old_spte;
 	}
@@ -1960,7 +2142,10 @@ u64 *kvm_tdp_mmu_fast_pf_get_last_sptep(struct kvm_vcpu *vcpu, u64 addr,
 	gfn_t gfn = addr >> PAGE_SHIFT;
 	tdp_ptep_t sptep = NULL;
 
-	tdp_mmu_for_each_pte(iter, mmu, gfn, gfn + 1) {
+	/* fast page fault for private GPA isn't supported. */
+	WARN_ON_ONCE(kvm_is_private_gpa(vcpu->kvm, addr));
+
+	tdp_mmu_for_each_pte(iter, mmu, false, gfn, gfn + 1) {
 		*spte = iter.old_spte;
 		sptep = iter.sptep;
 	}
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index e37881d922ba..4ea676ea25fd 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -7,7 +7,7 @@
 
 #include "spte.h"
 
-hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu);
+hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu, bool private);
 
 __must_check static inline bool kvm_tdp_mmu_get_root(struct kvm_mmu_page *root)
 {
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index aef8802b188e..db04c57acf77 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -205,6 +205,7 @@ struct page *kvm_pfn_to_refcounted_page(kvm_pfn_t pfn)
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(kvm_pfn_to_refcounted_page);
 
 /*
  * Switches to specified vcpu, until a matching vcpu_put()
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 047/113] [MARKER] The start of TDX KVM patch series: TDX EPT violation
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (45 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 046/113] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 048/113] KVM: x86/mmu: Disallow dirty logging for x86 TDX isaku.yamahata
                   ` (65 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TDX EPT
violation.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index f4aba85148e3..9b3ab0363184 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -19,11 +19,11 @@ Patch Layer status
 * TDX architectural definitions:        Applied
 * TD VM creation/destruction:           Applied
 * TD vcpu creation/destruction:         Applied
-* TDX EPT violation:                    Not yet
+* TDX EPT violation:                    Applying
 * TD finalization:                      Not yet
 * TD vcpu enter/exit:                   Not yet
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
 * KVM MMU GPA shared bits:              Applied
 * KVM TDP refactoring for TDX:          Applied
-* KVM TDP MMU hooks:                    Applying
+* KVM TDP MMU hooks:                    Applied
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 048/113] KVM: x86/mmu: Disallow dirty logging for x86 TDX
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (46 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 047/113] [MARKER] The start of TDX KVM patch series: TDX EPT violation isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 049/113] KVM: x86/mmu: TDX: Do not enable page track for TD guest isaku.yamahata
                   ` (64 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX doesn't support dirty logging.  Report dirty logging isn't supported so
that device model, for example qemu, can properly handle it.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/x86.c       |  5 +++++
 include/linux/kvm_host.h |  1 +
 virt/kvm/kvm_main.c      | 10 +++++++++-
 3 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a8b555935fd8..5b4d5f8128a5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13498,6 +13498,11 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
 }
 EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
 
+bool kvm_arch_dirty_log_supported(struct kvm *kvm)
+{
+	return kvm->arch.vm_type != KVM_X86_TDX_VM;
+}
+
 bool kvm_arch_has_private_mem(struct kvm *kvm)
 {
 	return kvm->arch.vm_type == KVM_X86_TDX_VM;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 0c3b9cf0a731..d6e4da96130f 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1476,6 +1476,7 @@ int kvm_arch_post_init_vm(struct kvm *kvm);
 void kvm_arch_pre_destroy_vm(struct kvm *kvm);
 int kvm_arch_create_vm_debugfs(struct kvm *kvm);
 bool kvm_arch_has_private_mem(struct kvm *kvm);
+bool kvm_arch_dirty_log_supported(struct kvm *kvm);
 
 #ifndef __KVM_HAVE_ARCH_VM_ALLOC
 /*
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index db04c57acf77..251bb7c59c88 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1664,10 +1664,18 @@ static void kvm_replace_memslot(struct kvm *kvm,
 	}
 }
 
+bool __weak kvm_arch_dirty_log_supported(struct kvm *kvm)
+{
+	return true;
+}
+
 static int check_memory_region_flags(struct kvm *kvm,
 				     const struct kvm_user_mem_region *mem)
 {
-	u32 valid_flags = KVM_MEM_LOG_DIRTY_PAGES;
+	u32 valid_flags = 0;
+
+	if (kvm_arch_dirty_log_supported(kvm))
+		valid_flags |= KVM_MEM_LOG_DIRTY_PAGES;
 
 	if (kvm_arch_has_private_mem(kvm))
 		valid_flags |= KVM_MEM_PRIVATE;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 049/113] KVM: x86/mmu: TDX: Do not enable page track for TD guest
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (47 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 048/113] KVM: x86/mmu: Disallow dirty logging for x86 TDX isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:31 ` [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs isaku.yamahata
                   ` (63 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Yan Zhao,
	Yuan Yao

From: Yan Zhao <yan.y.zhao@intel.com>

TDX does not support write protection and hence page track.
Though !tdp_enabled and kvm_shadow_root_allocated(kvm) are always false
for TD guest, should also return false when external write tracking is
enabled.

Cc: Yuan Yao <yuan.yao@linux.intel.com>
Signed-off-by: Yan Zhao <yan.y.zhao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/page_track.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/x86/kvm/mmu/page_track.c b/arch/x86/kvm/mmu/page_track.c
index 0a2ac438d647..571c2c40004a 100644
--- a/arch/x86/kvm/mmu/page_track.c
+++ b/arch/x86/kvm/mmu/page_track.c
@@ -22,6 +22,9 @@
 
 bool kvm_page_track_write_tracking_enabled(struct kvm *kvm)
 {
+	if (kvm->arch.vm_type == KVM_X86_TDX_VM)
+		return false;
+
 	return IS_ENABLED(CONFIG_KVM_EXTERNAL_WRITE_TRACKING) ||
 	       !tdp_enabled || kvm_shadow_root_allocated(kvm);
 }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (48 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 049/113] KVM: x86/mmu: TDX: Do not enable page track for TD guest isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-17  2:40   ` Huang, Kai
  2023-02-17  8:27   ` Zhi Wang
  2023-01-12 16:31 ` [PATCH v11 051/113] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
                   ` (62 subsequent siblings)
  112 siblings, 2 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Some KVM MMU operations (dirty page logging, page migration, aging page)
aren't supported for private GFNs (yet) with the first generation of TDX.
Silently return on unsupported TDX KVM MMU operations.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/mmu.c     |  3 +++
 arch/x86/kvm/mmu/tdp_mmu.c | 50 ++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/x86.c         |  3 +++
 3 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 484e615196aa..ad0482a101a3 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6635,6 +6635,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
 	for_each_rmap_spte(rmap_head, &iter, sptep) {
 		sp = sptep_to_sp(sptep);
 
+		/* Private page dirty logging is not supported yet. */
+		KVM_BUG_ON(is_private_sptep(sptep), kvm);
+
 		/*
 		 * We cannot do huge page mapping for indirect shadow pages,
 		 * which are found on the last rmap (level = 1) when not using
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 5ce0328c71df..69e202bd1897 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1478,7 +1478,8 @@ typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
 
 static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 						   struct kvm_gfn_range *range,
-						   tdp_handler_t handler)
+						   tdp_handler_t handler,
+						   bool only_shared)
 {
 	struct kvm_mmu_page *root;
 	struct tdp_iter iter;
@@ -1489,9 +1490,23 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
 	 * into this helper allow blocking; it'd be dead, wasteful code.
 	 */
 	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
+		gfn_t start;
+		gfn_t end;
+
+		if (only_shared && is_private_sp(root))
+			continue;
+
 		rcu_read_lock();
 
-		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
+		/*
+		 * For TDX shared mapping, set GFN shared bit to the range,
+		 * so the handler() doesn't need to set it, to avoid duplicated
+		 * code in multiple handler()s.
+		 */
+		start = kvm_gfn_for_root(kvm, root, range->start);
+		end = kvm_gfn_for_root(kvm, root, range->end);
+
+		tdp_root_for_each_leaf_pte(iter, root, start, end)
 			ret |= handler(kvm, &iter, range);
 
 		rcu_read_unlock();
@@ -1535,7 +1550,12 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
 
 bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
+	/*
+	 * First TDX generation doesn't support clearing A bit for private
+	 * mapping, since there's no secure EPT API to support it.  However
+	 * it's a legitimate request for TDX guest.
+	 */
+	return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range, true);
 }
 
 static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
@@ -1546,7 +1566,8 @@ static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
 
 bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 {
-	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
+	/* The first TDX generation doesn't support A bit. */
+	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn, true);
 }
 
 static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
@@ -1591,8 +1612,11 @@ bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
 	 * No need to handle the remote TLB flush under RCU protection, the
 	 * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
 	 * shadow page.  See the WARN on pfn_changed in __handle_changed_spte().
+	 *
+	 * .change_pte() callback should not happen for private page, because
+	 * for now TDX private pages are pinned during VM's life time.
 	 */
-	return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
+	return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn, true);
 }
 
 /*
@@ -1974,6 +1998,13 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
 	struct kvm_mmu_page *root;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
+	/*
+	 * First TDX generation doesn't support clearing dirty bit,
+	 * since there's no secure EPT API to support it.  For now silently
+	 * ignore KVM_CLEAR_DIRTY_LOG.
+	 */
+	if (!kvm_arch_dirty_log_supported(kvm))
+		return;
 	for_each_tdp_mmu_root(kvm, root, slot->as_id)
 		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
 }
@@ -2093,6 +2124,15 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
 	bool spte_set = false;
 
 	lockdep_assert_held_write(&kvm->mmu_lock);
+
+	/*
+	 * First TDX generation doesn't support write protecting private
+	 * mappings, silently ignore the request.  KVM_GET_DIRTY_LOG etc
+	 * can reach here, no warning.
+	 */
+	if (!kvm_arch_dirty_log_supported(kvm))
+		return false;
+
 	for_each_tdp_mmu_root(kvm, root, slot->as_id)
 		spte_set |= write_protect_gfn(kvm, root, gfn, min_level);
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5b4d5f8128a5..c4579e696d39 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12526,6 +12526,9 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
 	u32 new_flags = new ? new->flags : 0;
 	bool log_dirty_pages = new_flags & KVM_MEM_LOG_DIRTY_PAGES;
 
+	if (!kvm_arch_dirty_log_supported(kvm) && log_dirty_pages)
+		return;
+
 	/*
 	 * Update CPU dirty logging if dirty logging is being toggled.  This
 	 * applies to all operations.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 051/113] KVM: VMX: Split out guts of EPT violation to common/exposed function
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (49 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs isaku.yamahata
@ 2023-01-12 16:31 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 052/113] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
                   ` (61 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:31 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson, Kai Huang

From: Sean Christopherson <sean.j.christopherson@intel.com>

The difference of TDX EPT violation is how to retrieve information, GPA,
and exit qualification.  To share the code to handle EPT violation, split
out the guts of EPT violation handler so that VMX/TDX exit handler can call
it after retrieving GPA and exit qualification.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
Reviewed-by: Kai Huang <kai.huang@intel.com>
---
 arch/x86/kvm/vmx/common.h | 33 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c    | 25 +++----------------------
 2 files changed, 36 insertions(+), 22 deletions(-)
 create mode 100644 arch/x86/kvm/vmx/common.h

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
new file mode 100644
index 000000000000..235908f3e044
--- /dev/null
+++ b/arch/x86/kvm/vmx/common.h
@@ -0,0 +1,33 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+#ifndef __KVM_X86_VMX_COMMON_H
+#define __KVM_X86_VMX_COMMON_H
+
+#include <linux/kvm_host.h>
+
+#include "mmu.h"
+
+static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
+					     unsigned long exit_qualification)
+{
+	u64 error_code;
+
+	/* Is it a read fault? */
+	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
+		     ? PFERR_USER_MASK : 0;
+	/* Is it a write fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
+		      ? PFERR_WRITE_MASK : 0;
+	/* Is it a fetch fault? */
+	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
+		      ? PFERR_FETCH_MASK : 0;
+	/* ept page table entry is present? */
+	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
+		      ? PFERR_PRESENT_MASK : 0;
+
+	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
+	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
+
+	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+}
+
+#endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 0df044357e09..6394c1241374 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -51,6 +51,7 @@
 #include <asm/vmx.h>
 
 #include "capabilities.h"
+#include "common.h"
 #include "cpuid.h"
 #include "hyperv.h"
 #include "kvm_onhyperv.h"
@@ -5791,11 +5792,8 @@ static int handle_task_switch(struct kvm_vcpu *vcpu)
 
 static int handle_ept_violation(struct kvm_vcpu *vcpu)
 {
-	unsigned long exit_qualification;
+	unsigned long exit_qualification = vmx_get_exit_qual(vcpu);
 	gpa_t gpa;
-	u64 error_code;
-
-	exit_qualification = vmx_get_exit_qual(vcpu);
 
 	/*
 	 * EPT violation happened while executing iret from NMI,
@@ -5810,23 +5808,6 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 
 	gpa = vmcs_read64(GUEST_PHYSICAL_ADDRESS);
 	trace_kvm_page_fault(vcpu, gpa, exit_qualification);
-
-	/* Is it a read fault? */
-	error_code = (exit_qualification & EPT_VIOLATION_ACC_READ)
-		     ? PFERR_USER_MASK : 0;
-	/* Is it a write fault? */
-	error_code |= (exit_qualification & EPT_VIOLATION_ACC_WRITE)
-		      ? PFERR_WRITE_MASK : 0;
-	/* Is it a fetch fault? */
-	error_code |= (exit_qualification & EPT_VIOLATION_ACC_INSTR)
-		      ? PFERR_FETCH_MASK : 0;
-	/* ept page table entry is present? */
-	error_code |= (exit_qualification & EPT_VIOLATION_RWX_MASK)
-		      ? PFERR_PRESENT_MASK : 0;
-
-	error_code |= (exit_qualification & EPT_VIOLATION_GVA_TRANSLATED) != 0 ?
-	       PFERR_GUEST_FINAL_MASK : PFERR_GUEST_PAGE_MASK;
-
 	vcpu->arch.exit_qualification = exit_qualification;
 
 	/*
@@ -5840,7 +5821,7 @@ static int handle_ept_violation(struct kvm_vcpu *vcpu)
 	if (unlikely(allow_smaller_maxphyaddr && kvm_vcpu_is_illegal_gpa(vcpu, gpa)))
 		return kvm_emulate_instruction(vcpu, 0);
 
-	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
+	return __vmx_handle_ept_violation(vcpu, gpa, exit_qualification);
 }
 
 static int handle_ept_misconfig(struct kvm_vcpu *vcpu)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 052/113] KVM: VMX: Move setting of EPT MMU masks to common VT-x code
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (50 preceding siblings ...)
  2023-01-12 16:31 ` [PATCH v11 051/113] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 053/113] KVM: TDX: Add accessors VMX VMCS helpers isaku.yamahata
                   ` (60 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

EPT MMU masks are used commonly for VMX and TDX.  The value needs to be
initialized in common code before both VMX/TDX-specific initialization
code.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c | 5 +++++
 arch/x86/kvm/vmx/vmx.c  | 4 ----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 23b3ffc3fe23..9f817f9a8c69 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -4,6 +4,7 @@
 #include "x86_ops.h"
 #include "vmx.h"
 #include "nested.h"
+#include "mmu.h"
 #include "pmu.h"
 #include "tdx.h"
 
@@ -26,6 +27,10 @@ static __init int vt_hardware_setup(void)
 
 	enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
 
+	if (enable_ept)
+		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
+				      cpu_has_vmx_ept_execute_only());
+
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6394c1241374..2b0de8ba86b1 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8394,10 +8394,6 @@ __init int vmx_hardware_setup(void)
 
 	set_bit(0, vmx_vpid_bitmap); /* 0 is reserved for host */
 
-	if (enable_ept)
-		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
-				      cpu_has_vmx_ept_execute_only());
-
 	/*
 	 * Setup shadow_me_value/shadow_me_mask to include MKTME KeyID
 	 * bits to shadow_zero_check.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 053/113] KVM: TDX: Add accessors VMX VMCS helpers
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (51 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 052/113] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 054/113] KVM: TDX: Add load_mmu_pgd method for TDX isaku.yamahata
                   ` (59 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX defines SEAMCALL APIs to access TDX control structures corresponding to
the VMX VMCS.  Introduce helper accessors to hide its SEAMCALL ABI details.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.h | 95 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 95 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index e909883d60fa..237c8038eb6a 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -57,6 +57,101 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu *vcpu)
 	return container_of(vcpu, struct vcpu_tdx, vcpu);
 }
 
+static __always_inline void tdvps_vmcs_check(u32 field, u8 bits)
+{
+#define VMCS_ENC_ACCESS_TYPE_MASK	0x1UL
+#define VMCS_ENC_ACCESS_TYPE_FULL	0x0UL
+#define VMCS_ENC_ACCESS_TYPE_HIGH	0x1UL
+#define VMCS_ENC_ACCESS_TYPE(field)	((field) & VMCS_ENC_ACCESS_TYPE_MASK)
+
+	/* TDX is 64bit only.  HIGH field isn't supported. */
+	BUILD_BUG_ON_MSG(__builtin_constant_p(field) &&
+			 VMCS_ENC_ACCESS_TYPE(field) == VMCS_ENC_ACCESS_TYPE_HIGH,
+			 "Read/Write to TD VMCS *_HIGH fields not supported");
+
+	BUILD_BUG_ON(bits != 16 && bits != 32 && bits != 64);
+
+#define VMCS_ENC_WIDTH_MASK	GENMASK(14, 13)
+#define VMCS_ENC_WIDTH_16BIT	(0UL << 13)
+#define VMCS_ENC_WIDTH_64BIT	(1UL << 13)
+#define VMCS_ENC_WIDTH_32BIT	(2UL << 13)
+#define VMCS_ENC_WIDTH_NATURAL	(3UL << 13)
+#define VMCS_ENC_WIDTH(field)	((field) & VMCS_ENC_WIDTH_MASK)
+
+	/* TDX is 64bit only.  i.e. natural width = 64bit. */
+	BUILD_BUG_ON_MSG(bits != 64 && __builtin_constant_p(field) &&
+			 (VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_64BIT ||
+			  VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_NATURAL),
+			 "Invalid TD VMCS access for 64-bit field");
+	BUILD_BUG_ON_MSG(bits != 32 && __builtin_constant_p(field) &&
+			 VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_32BIT,
+			 "Invalid TD VMCS access for 32-bit field");
+	BUILD_BUG_ON_MSG(bits != 16 && __builtin_constant_p(field) &&
+			 VMCS_ENC_WIDTH(field) == VMCS_ENC_WIDTH_16BIT,
+			 "Invalid TD VMCS access for 16-bit field");
+}
+
+static __always_inline void tdvps_state_non_arch_check(u64 field, u8 bits) {}
+static __always_inline void tdvps_management_check(u64 field, u8 bits) {}
+
+#define TDX_BUILD_TDVPS_ACCESSORS(bits, uclass, lclass)				\
+static __always_inline u##bits td_##lclass##_read##bits(struct vcpu_tdx *tdx,	\
+							u32 field)		\
+{										\
+	struct tdx_module_output out;						\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_rd(tdx->tdvpr_pa, TDVPS_##uclass(field), &out);		\
+	if (KVM_BUG_ON(err, tdx->vcpu.kvm)) {					\
+		pr_err("TDH_VP_RD["#uclass".0x%x] failed: 0x%llx\n",		\
+		       field, err);						\
+		return 0;							\
+	}									\
+	return (u##bits)out.r8;							\
+}										\
+static __always_inline void td_##lclass##_write##bits(struct vcpu_tdx *tdx,	\
+						      u32 field, u##bits val)	\
+{										\
+	struct tdx_module_output out;						\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx->tdvpr_pa, TDVPS_##uclass(field), val,		\
+		      GENMASK_ULL(bits - 1, 0), &out);				\
+	if (KVM_BUG_ON(err, tdx->vcpu.kvm))					\
+		pr_err("TDH_VP_WR["#uclass".0x%x] = 0x%llx failed: 0x%llx\n",	\
+		       field, (u64)val, err);					\
+}										\
+static __always_inline void td_##lclass##_setbit##bits(struct vcpu_tdx *tdx,	\
+						       u32 field, u64 bit)	\
+{										\
+	struct tdx_module_output out;						\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx->tdvpr_pa, TDVPS_##uclass(field), bit, bit, &out);	\
+	if (KVM_BUG_ON(err, tdx->vcpu.kvm))					\
+		pr_err("TDH_VP_WR["#uclass".0x%x] |= 0x%llx failed: 0x%llx\n",	\
+		       field, bit, err);					\
+}										\
+static __always_inline void td_##lclass##_clearbit##bits(struct vcpu_tdx *tdx,	\
+							 u32 field, u64 bit)	\
+{										\
+	struct tdx_module_output out;						\
+	u64 err;								\
+										\
+	tdvps_##lclass##_check(field, bits);					\
+	err = tdh_vp_wr(tdx->tdvpr_pa, TDVPS_##uclass(field), 0, bit, &out);	\
+	if (KVM_BUG_ON(err, tdx->vcpu.kvm))					\
+		pr_err("TDH_VP_WR["#uclass".0x%x] &= ~0x%llx failed: 0x%llx\n",	\
+		       field, bit,  err);					\
+}
+
+TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
+TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
+
 static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
 {
 	struct tdx_module_output out;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 054/113] KVM: TDX: Add load_mmu_pgd method for TDX
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (52 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 053/113] KVM: TDX: Add accessors VMX VMCS helpers isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 055/113] KVM: x86/VMX: introduce vmx tlb_remote_flush and tlb_remote_flush_with_range isaku.yamahata
                   ` (58 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

For virtual IO, the guest TD shares guest pages with VMM without
encryption.  Shared EPT is used to map guest pages in unprotected way.

Add the VMCS field encoding for the shared EPTP, which will be used by
TDX to have separate EPT walks for private GPAs (existing EPTP) versus
shared GPAs (new shared EPTP).

Set shared EPT pointer value for the TDX guest to initialize TDX MMU.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/vmx.h |  1 +
 arch/x86/kvm/vmx/main.c    | 11 ++++++++++-
 arch/x86/kvm/vmx/tdx.c     |  5 +++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++++
 4 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 752d53652007..1205018b5b6b 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -234,6 +234,7 @@ enum vmcs_field {
 	TSC_MULTIPLIER_HIGH             = 0x00002033,
 	TERTIARY_VM_EXEC_CONTROL	= 0x00002034,
 	TERTIARY_VM_EXEC_CONTROL_HIGH	= 0x00002035,
+	SHARED_EPT_POINTER		= 0x0000203C,
 	PID_POINTER_TABLE		= 0x00002042,
 	PID_POINTER_TABLE_HIGH		= 0x00002043,
 	GUEST_PHYSICAL_ADDRESS          = 0x00002400,
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 9f817f9a8c69..a2ffef95bf9d 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -100,6 +100,15 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	return vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
+			int pgd_level)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+
+	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -220,7 +229,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.write_tsc_offset = vmx_write_tsc_offset,
 	.write_tsc_multiplier = vmx_write_tsc_multiplier,
 
-	.load_mmu_pgd = vmx_load_mmu_pgd,
+	.load_mmu_pgd = vt_load_mmu_pgd,
 
 	.check_intercept = vmx_check_intercept,
 	.handle_exit_irqoff = vmx_handle_exit_irqoff,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index a7d42c05a758..aa07e03843b6 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -382,6 +382,11 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->kvm->vm_bugged = true;
 }
 
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
+{
+	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
+}
+
 int tdx_dev_ioctl(void __user *argp)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index fba8d0800597..27dd778aed6a 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -154,6 +154,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
+
+void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
 static inline void tdx_hardware_unsetup(void) {}
@@ -172,6 +174,8 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
+
+static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif
 
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 055/113] KVM: x86/VMX: introduce vmx tlb_remote_flush and tlb_remote_flush_with_range
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (53 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 054/113] KVM: TDX: Add load_mmu_pgd method for TDX isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-17  2:06   ` Huang, Kai
  2023-01-12 16:32 ` [PATCH v11 056/113] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT isaku.yamahata
                   ` (57 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This is preparation for TDX to define its own tlb_remote_flush and
tlb_remote_flush_with_range.  Currently vmx code defines tlb_remote_flush
and tlb_remote_flush_with_range defined as NULL by default and only when
nested hyper-v guest case, they are defined to non-NULL methods.

To make TDX code to override those two methods consistently with other
methods, define vmx_tlb_remote_flush and vmx_tlb_remote_flush_with_range
as nop and call hyper-v code only when nested hyper-v guest case.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/kvm_onhyperv.c     |  5 ++++-
 arch/x86/kvm/kvm_onhyperv.h     |  1 +
 arch/x86/kvm/mmu/mmu.c          |  2 +-
 arch/x86/kvm/svm/svm_onhyperv.h |  1 +
 arch/x86/kvm/vmx/main.c         |  2 ++
 arch/x86/kvm/vmx/vmx.c          | 34 ++++++++++++++++++++++++++++-----
 arch/x86/kvm/vmx/x86_ops.h      |  3 +++
 7 files changed, 41 insertions(+), 7 deletions(-)

diff --git a/arch/x86/kvm/kvm_onhyperv.c b/arch/x86/kvm/kvm_onhyperv.c
index 482d6639ef88..d2e71c7ab4f4 100644
--- a/arch/x86/kvm/kvm_onhyperv.c
+++ b/arch/x86/kvm/kvm_onhyperv.c
@@ -94,11 +94,14 @@ int hv_remote_flush_tlb(struct kvm *kvm)
 }
 EXPORT_SYMBOL_GPL(hv_remote_flush_tlb);
 
+bool hv_use_remote_flush_tlb __ro_after_init;
+EXPORT_SYMBOL_GPL(hv_use_remote_flush_tlb);
+
 void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp)
 {
 	struct kvm_arch *kvm_arch = &vcpu->kvm->arch;
 
-	if (kvm_x86_ops.tlb_remote_flush == hv_remote_flush_tlb) {
+	if (hv_use_remote_flush_tlb) {
 		spin_lock(&kvm_arch->hv_root_tdp_lock);
 		vcpu->arch.hv_root_tdp = root_tdp;
 		if (root_tdp != kvm_arch->hv_root_tdp)
diff --git a/arch/x86/kvm/kvm_onhyperv.h b/arch/x86/kvm/kvm_onhyperv.h
index 287e98ef9df3..9a07a34666fb 100644
--- a/arch/x86/kvm/kvm_onhyperv.h
+++ b/arch/x86/kvm/kvm_onhyperv.h
@@ -10,6 +10,7 @@
 int hv_remote_flush_tlb_with_range(struct kvm *kvm,
 		struct kvm_tlb_range *range);
 int hv_remote_flush_tlb(struct kvm *kvm);
+extern bool hv_use_remote_flush_tlb __ro_after_init;
 void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp);
 #else /* !CONFIG_HYPERV */
 static inline void hv_track_root_tdp(struct kvm_vcpu *vcpu, hpa_t root_tdp)
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index ad0482a101a3..4f3f4cdc67ed 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -244,7 +244,7 @@ static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
 {
 	int ret = -ENOTSUPP;
 
-	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
+	if (range && kvm_available_flush_tlb_with_range())
 		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
 
 	if (ret)
diff --git a/arch/x86/kvm/svm/svm_onhyperv.h b/arch/x86/kvm/svm/svm_onhyperv.h
index 6981c1e9a809..3ebe08f9d647 100644
--- a/arch/x86/kvm/svm/svm_onhyperv.h
+++ b/arch/x86/kvm/svm/svm_onhyperv.h
@@ -38,6 +38,7 @@ static inline void svm_hv_hardware_setup(void)
 		svm_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
 		svm_x86_ops.tlb_remote_flush_with_range =
 				hv_remote_flush_tlb_with_range;
+		hv_use_remote_flush_tlb = true;
 	}
 
 	if (ms_hyperv.nested_features & HV_X64_NESTED_DIRECT_FLUSH) {
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index a2ffef95bf9d..d0d8cfa89344 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -179,6 +179,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.flush_tlb_all = vmx_flush_tlb_all,
 	.flush_tlb_current = vmx_flush_tlb_current,
+	.tlb_remote_flush = vmx_tlb_remote_flush,
+	.tlb_remote_flush_with_range = vmx_tlb_remote_flush_with_range,
 	.flush_tlb_gva = vmx_flush_tlb_gva,
 	.flush_tlb_guest = vmx_flush_tlb_guest,
 
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 2b0de8ba86b1..d67877a7dcc6 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -3208,6 +3208,33 @@ void vmx_flush_tlb_current(struct kvm_vcpu *vcpu)
 		vpid_sync_context(vmx_get_current_vpid(vcpu));
 }
 
+int vmx_tlb_remote_flush(struct kvm *kvm)
+{
+#if IS_ENABLED(CONFIG_HYPERV)
+	if (hv_use_remote_flush_tlb)
+		return hv_remote_flush_tlb(kvm);
+#endif
+	/*
+	 * fallback to KVM_REQ_TLB_FLUSH.
+	 * See kvm_arch_flush_remote_tlb() and kvm_flush_remote_tlbs().
+	 */
+	return -EOPNOTSUPP;
+}
+
+int vmx_tlb_remote_flush_with_range(struct kvm *kvm,
+				    struct kvm_tlb_range *range)
+{
+#if IS_ENABLED(CONFIG_HYPERV)
+	if (hv_use_remote_flush_tlb)
+		return hv_remote_flush_tlb_with_range(kvm, range);
+#endif
+	/*
+	 * fallback to tlb_remote_flush. See
+	 * kvm_flush_remote_tlbs_with_range()
+	 */
+	return -EOPNOTSUPP;
+}
+
 void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
 {
 	/*
@@ -8361,11 +8388,8 @@ __init int vmx_hardware_setup(void)
 
 #if IS_ENABLED(CONFIG_HYPERV)
 	if (ms_hyperv.nested_features & HV_X64_NESTED_GUEST_MAPPING_FLUSH
-	    && enable_ept) {
-		vt_x86_ops.tlb_remote_flush = hv_remote_flush_tlb;
-		vt_x86_ops.tlb_remote_flush_with_range =
-				hv_remote_flush_tlb_with_range;
-	}
+	    && enable_ept)
+		hv_use_remote_flush_tlb = true;
 #endif
 
 	if (!cpu_has_vmx_ple()) {
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 27dd778aed6a..d7745ac380ed 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -99,6 +99,9 @@ void vmx_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags);
 bool vmx_get_if_flag(struct kvm_vcpu *vcpu);
 void vmx_flush_tlb_all(struct kvm_vcpu *vcpu);
 void vmx_flush_tlb_current(struct kvm_vcpu *vcpu);
+int vmx_tlb_remote_flush(struct kvm *kvm);
+int vmx_tlb_remote_flush_with_range(struct kvm *kvm,
+				    struct kvm_tlb_range *range);
 void vmx_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr);
 void vmx_flush_tlb_guest(struct kvm_vcpu *vcpu);
 void vmx_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 056/113] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (54 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 055/113] KVM: x86/VMX: introduce vmx tlb_remote_flush and tlb_remote_flush_with_range isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 057/113] KVM: TDX: TDP MMU TDX support isaku.yamahata
                   ` (56 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Yuan Yao

From: Yuan Yao <yuan.yao@intel.com>

TDX module internally uses locks to protect internal resources.  It tries
to acquire the locks.  If it fails to obtain the lock, it returns
TDX_OPERAND_BUSY error without spin because its execution time limitation.

TDX SEAMCALL API reference describes what resources are used.  It's known
which TDX SEAMCALL can cause contention with which resources.  VMM can
avoid contention inside the TDX module by avoiding contentious TDX SEAMCALL
with, for example, spinlock.  Because OS knows better its process
scheduling and its scalability, a lock at OS/VMM layer would work better
than simply retrying TDX SEAMCALLs.

TDH.MEM.* API except for TDH.MEM.TRACK operates on a secure EPT tree and
the TDX module internally tries to acquire the lock of the secure EPT tree.
They return TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT in case of failure to
get the lock.  TDX KVM allows sept callbacks to return error so that TDP
MMU layer can retry.

TDH.VP.ENTER is an exception with zero-step attack mitigation.  Normally
TDH.VP.ENTER uses only TD vcpu resources and it doesn't cause contention.
When a zero-step attack is suspected, it obtains a secure EPT tree lock and
tracks the GPAs causing a secure EPT fault.  Thus TDG.VP.ENTER may result
in TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT.  Also TDH.MEM.* SEAMCALLs may
result in TDX_OPERAN_BUSY | TDX_OPERAND_ID_SEPT.

Retry TDX TDH.MEM.* API and TDH.VP.ENTER on the error because the error is
a rare event caused by zero-step attack mitigation and spinlock can not be
used for TDH.VP.ENTER due to indefinite time execution.

Signed-off-by: Yuan Yao <yuan.yao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx_ops.h | 42 ++++++++++++++++++++++++++++++++------
 1 file changed, 36 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx_ops.h b/arch/x86/kvm/vmx/tdx_ops.h
index 8cc2f01c509b..86330d0e4b22 100644
--- a/arch/x86/kvm/vmx/tdx_ops.h
+++ b/arch/x86/kvm/vmx/tdx_ops.h
@@ -18,6 +18,36 @@
 
 void pr_tdx_error(u64 op, u64 error_code, const struct tdx_module_output *out);
 
+/*
+ * TDX module acquires its internal lock for resources.  It doesn't spin to get
+ * locks because of its restrictions of allowed execution time.  Instead, it
+ * returns TDX_OPERAND_BUSY with an operand id.
+ *
+ * Multiple VCPUs can operate on SEPT.  Also with zero-step attack mitigation,
+ * TDH.VP.ENTER may rarely acquire SEPT lock and release it when zero-step
+ * attack is suspected.  It results in TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT
+ * with TDH.MEM.* operation.  Note: TDH.MEM.TRACK is an exception.
+ *
+ * Because TDP MMU uses read lock for scalability, spin lock around SEAMCALL
+ * spoils TDP MMU effort.  Retry several times with the assumption that SEPT
+ * lock contention is rare.  But don't loop forever to avoid lockup.  Let TDP
+ * MMU retry.
+ */
+#define TDX_ERROR_SEPT_BUSY    (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)
+
+static inline u64 seamcall_sept(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
+				struct tdx_module_output *out)
+{
+#define SEAMCALL_RETRY_MAX     16
+	int retry = SEAMCALL_RETRY_MAX;
+	u64 ret;
+
+	do {
+		ret = __seamcall(op, rcx, rdx, r8, r9, out);
+	} while (ret == TDX_ERROR_SEPT_BUSY && retry-- > 0);
+	return ret;
+}
+
 static inline u64 tdh_mng_addcx(hpa_t tdr, hpa_t addr)
 {
 	clflush_cache_range(__va(addr), PAGE_SIZE);
@@ -28,14 +58,14 @@ static inline u64 tdh_mem_page_add(hpa_t tdr, gpa_t gpa, hpa_t hpa, hpa_t source
 				   struct tdx_module_output *out)
 {
 	clflush_cache_range(__va(hpa), PAGE_SIZE);
-	return __seamcall(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, out);
+	return seamcall_sept(TDH_MEM_PAGE_ADD, gpa, tdr, hpa, source, out);
 }
 
 static inline u64 tdh_mem_sept_add(hpa_t tdr, gpa_t gpa, int level, hpa_t page,
 				   struct tdx_module_output *out)
 {
 	clflush_cache_range(__va(page), PAGE_SIZE);
-	return __seamcall(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, out);
+	return seamcall_sept(TDH_MEM_SEPT_ADD, gpa | level, tdr, page, 0, out);
 }
 
 static inline u64 tdh_mem_sept_remove(hpa_t tdr, gpa_t gpa, int level,
@@ -61,13 +91,13 @@ static inline u64 tdh_mem_page_aug(hpa_t tdr, gpa_t gpa, hpa_t hpa,
 				   struct tdx_module_output *out)
 {
 	clflush_cache_range(__va(hpa), PAGE_SIZE);
-	return __seamcall(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, out);
+	return seamcall_sept(TDH_MEM_PAGE_AUG, gpa, tdr, hpa, 0, out);
 }
 
 static inline u64 tdh_mem_range_block(hpa_t tdr, gpa_t gpa, int level,
 				      struct tdx_module_output *out)
 {
-	return __seamcall(TDH_MEM_RANGE_BLOCK, gpa | level, tdr, 0, 0, out);
+	return seamcall_sept(TDH_MEM_RANGE_BLOCK, gpa | level, tdr, 0, 0, out);
 }
 
 static inline u64 tdh_mng_key_config(hpa_t tdr)
@@ -149,7 +179,7 @@ static inline u64 tdh_phymem_page_reclaim(hpa_t page,
 static inline u64 tdh_mem_page_remove(hpa_t tdr, gpa_t gpa, int level,
 				      struct tdx_module_output *out)
 {
-	return __seamcall(TDH_MEM_PAGE_REMOVE, gpa | level, tdr, 0, 0, out);
+	return seamcall_sept(TDH_MEM_PAGE_REMOVE, gpa | level, tdr, 0, 0, out);
 }
 
 static inline u64 tdh_sys_lp_shutdown(void)
@@ -165,7 +195,7 @@ static inline u64 tdh_mem_track(hpa_t tdr)
 static inline u64 tdh_mem_range_unblock(hpa_t tdr, gpa_t gpa, int level,
 					struct tdx_module_output *out)
 {
-	return __seamcall(TDH_MEM_RANGE_UNBLOCK, gpa | level, tdr, 0, 0, out);
+	return seamcall_sept(TDH_MEM_RANGE_UNBLOCK, gpa | level, tdr, 0, 0, out);
 }
 
 static inline u64 tdh_phymem_cache_wb(bool resume)
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 057/113] KVM: TDX: TDP MMU TDX support
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (55 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 056/113] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 058/113] KVM: TDX: MTRR: implement get_mt_mask() for TDX isaku.yamahata
                   ` (55 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement hooks of TDP MMU for TDX backend.  TLB flush, TLB shootdown,
propagating the change private EPT entry to Secure EPT and freeing Secure
EPT page. TLB flush handles both shared EPT and private EPT.  It flushes
shared EPT same as VMX.  It also waits for the TDX TLB shootdown.  For the
hook to free Secure EPT page, unlinks the Secure EPT page from the Secure
EPT so that the page can be freed to OS.

Propagate the entry change to Secure EPT.  The possible entry changes are
present -> non-present(zapping) and non-present -> present(population).  On
population just link the Secure EPT page or the private guest page to the
Secure EPT by TDX SEAMCALL. Because TDP MMU allows concurrent
zapping/population, zapping requires synchronous TLB shoot down with the
frozen EPT entry.  It zaps the secure entry, increments TLB counter, sends
IPI to remote vcpus to trigger TLB flush, and then unlinks the private
guest page from the Secure EPT. For simplicity, batched zapping with
exclude lock is handled as concurrent zapping.  Although it's inefficient,
it can be optimized in the future.

For MMIO SPTE, the spte value changes as follows.
initial value (suppress VE bit is set)
-> Guest issues MMIO and triggers EPT violation
-> KVM updates SPTE value to MMIO value (suppress VE bit is cleared)
-> Guest MMIO resumes.  It triggers VE exception in guest TD
-> Guest VE handler issues TDG.VP.VMCALL<MMIO>
-> KVM handles MMIO
-> Guest VE handler resumes its execution after MMIO instruction

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu/spte.c    |   3 +-
 arch/x86/kvm/vmx/main.c    |  61 +++++++-
 arch/x86/kvm/vmx/tdx.c     | 302 ++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h     |   7 +
 arch/x86/kvm/vmx/x86_ops.h |   4 +
 5 files changed, 369 insertions(+), 8 deletions(-)

diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
index 7171df3e262a..9c874bca69f6 100644
--- a/arch/x86/kvm/mmu/spte.c
+++ b/arch/x86/kvm/mmu/spte.c
@@ -74,7 +74,8 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
 	u64 spte = generation_mmio_spte_mask(gen);
 	u64 gpa = gfn << PAGE_SHIFT;
 
-	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);
+	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value &&
+		     !kvm_gfn_shared_mask(vcpu->kvm));
 
 	access &= shadow_mmio_access_mask;
 	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d0d8cfa89344..770d1b29d1c3 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -100,6 +100,55 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	return vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_flush_tlb(vcpu);
+
+	vmx_flush_tlb_all(vcpu);
+}
+
+static void vt_flush_tlb_current(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_flush_tlb(vcpu);
+
+	vmx_flush_tlb_current(vcpu);
+}
+
+static int vt_tlb_remote_flush(struct kvm *kvm)
+{
+	if (is_td(kvm))
+		return tdx_sept_tlb_remote_flush(kvm);
+
+	return vmx_tlb_remote_flush(kvm);
+}
+
+static int vt_tlb_remote_flush_with_range(struct kvm *kvm,
+					  struct kvm_tlb_range *range)
+{
+	if (is_td(kvm))
+		return -EOPNOTSUPP; /* fall back to tlb_remote_flush */
+
+	return vmx_tlb_remote_flush_with_range(kvm, range);
+}
+
+static void vt_flush_tlb_gva(struct kvm_vcpu *vcpu, gva_t addr)
+{
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
+	vmx_flush_tlb_gva(vcpu, addr);
+}
+
+static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_flush_tlb_guest(vcpu);
+}
+
 static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			int pgd_level)
 {
@@ -177,12 +226,12 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_rflags = vmx_set_rflags,
 	.get_if_flag = vmx_get_if_flag,
 
-	.flush_tlb_all = vmx_flush_tlb_all,
-	.flush_tlb_current = vmx_flush_tlb_current,
-	.tlb_remote_flush = vmx_tlb_remote_flush,
-	.tlb_remote_flush_with_range = vmx_tlb_remote_flush_with_range,
-	.flush_tlb_gva = vmx_flush_tlb_gva,
-	.flush_tlb_guest = vmx_flush_tlb_guest,
+	.flush_tlb_all = vt_flush_tlb_all,
+	.flush_tlb_current = vt_flush_tlb_current,
+	.tlb_remote_flush = vt_tlb_remote_flush,
+	.tlb_remote_flush_with_range = vt_tlb_remote_flush_with_range,
+	.flush_tlb_gva = vt_flush_tlb_gva,
+	.flush_tlb_guest = vt_flush_tlb_guest,
 
 	.vcpu_pre_run = vmx_vcpu_pre_run,
 	.vcpu_run = vmx_vcpu_run,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index aa07e03843b6..e68816999387 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -6,7 +6,9 @@
 #include "capabilities.h"
 #include "x86_ops.h"
 #include "tdx.h"
+#include "vmx.h"
 #include "x86.h"
+#include "mmu.h"
 
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
@@ -282,6 +284,22 @@ static int tdx_do_tdh_mng_key_config(void *param)
 
 int tdx_vm_init(struct kvm *kvm)
 {
+	/*
+	 * Because guest TD is protected, VMM can't parse the instruction in TD.
+	 * Instead, guest uses MMIO hypercall.  For unmodified device driver,
+	 * #VE needs to be injected for MMIO and #VE handler in TD converts MMIO
+	 * instruction into MMIO hypercall.
+	 *
+	 * SPTE value for MMIO needs to be setup so that #VE is injected into
+	 * TD instead of triggering EPT MISCONFIG.
+	 * - RWX=0 so that EPT violation is triggered.
+	 * - suppress #VE bit is cleared to inject #VE.
+	 */
+	kvm_mmu_set_mmio_spte_value(kvm, 0);
+
+	/* TODO: Enable 2mb and 1gb large page support. */
+	kvm->arch.tdp_max_page_level = PG_LEVEL_4K;
+
 	/*
 	 * This function initializes only KVM software construct.  It doesn't
 	 * initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.
@@ -387,6 +405,261 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
 }
 
+static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
+{
+	struct page *page = pfn_to_page(pfn);
+
+	put_page(page);
+}
+
+static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, kvm_pfn_t pfn)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	hpa_t hpa = pfn_to_hpa(pfn);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	struct tdx_module_output out;
+	u64 err;
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EINVAL;
+
+	/*
+	 * Because restricted mem doesn't support page migration with
+	 * a_ops->migrate_page (yet), no callback isn't triggered for KVM on
+	 * page migration.  Until restricted mem supports page migration,
+	 * prevent page migration.
+	 * TODO: Once restricted mem introduces callback on page migration,
+	 * implement it and remove get_page/put_page().
+	 */
+	get_page(pfn_to_page(pfn));
+
+	if (likely(is_td_finalized(kvm_tdx))) {
+		err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, hpa, &out);
+		if (err == TDX_ERROR_SEPT_BUSY) {
+			tdx_unpin(kvm, pfn);
+			return -EAGAIN;
+		}
+		if (KVM_BUG_ON(err, kvm)) {
+			pr_tdx_error(TDH_MEM_PAGE_AUG, err, &out);
+			tdx_unpin(kvm, pfn);
+			return -EIO;
+		}
+		return 0;
+	}
+
+	/* TODO: tdh_mem_page_add() comes here for the initial memory. */
+
+	return 0;
+}
+
+static int tdx_sept_drop_private_spte(struct kvm *kvm, gfn_t gfn,
+				       enum pg_level level, kvm_pfn_t pfn)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct tdx_module_output out;
+	gpa_t gpa = gfn_to_gpa(gfn);
+	hpa_t hpa = pfn_to_hpa(pfn);
+	hpa_t hpa_with_hkid;
+	u64 err;
+
+	/* TODO: handle large pages. */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EINVAL;
+
+	if (!is_hkid_assigned(kvm_tdx)) {
+		/*
+		 * The HKID assigned to this TD was already freed and cache
+		 * was already flushed. We don't have to flush again.
+		 */
+		err = tdx_reclaim_page(hpa, false, 0);
+		if (KVM_BUG_ON(err, kvm))
+			return -EIO;
+		tdx_unpin(kvm, pfn);
+		return 0;
+	}
+
+	do {
+		/*
+		 * When zapping private page, write lock is held. So no race
+		 * condition with other vcpu sept operation.  Race only with
+		 * TDH.VP.ENTER.
+		 */
+		err = tdh_mem_page_remove(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+	} while (err == TDX_ERROR_SEPT_BUSY);
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_PAGE_REMOVE, err, &out);
+		return -EIO;
+	}
+
+	hpa_with_hkid = set_hkid_to_hpa(hpa, (u16)kvm_tdx->hkid);
+	do {
+		/*
+		 * TDX_OPERAND_BUSY can happen on locking PAMT entry.  Because
+		 * this page was removed above, other thread shouldn't be
+		 * repeatedly operating on this page.  Just retry loop.
+		 */
+		err = tdh_phymem_page_wbinvd(hpa_with_hkid);
+	} while (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX));
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
+		return -EIO;
+	}
+	tdx_unpin(kvm, pfn);
+	return 0;
+}
+
+static int tdx_sept_link_private_spt(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, void *private_spt)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	hpa_t hpa = __pa(private_spt);
+	struct tdx_module_output out;
+	u64 err;
+
+	err = tdh_mem_sept_add(kvm_tdx->tdr_pa, gpa, tdx_level, hpa, &out);
+	if (err == TDX_ERROR_SEPT_BUSY)
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_SEPT_ADD, err, &out);
+		return -EIO;
+	}
+
+	return 0;
+}
+
+static int tdx_sept_zap_private_spte(struct kvm *kvm, gfn_t gfn,
+				      enum pg_level level)
+{
+	int tdx_level = pg_level_to_tdx_sept_level(level);
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	gpa_t gpa = gfn_to_gpa(gfn);
+	struct tdx_module_output out;
+	u64 err;
+
+	/* For now large page isn't supported yet. */
+	WARN_ON_ONCE(level != PG_LEVEL_4K);
+	err = tdh_mem_range_block(kvm_tdx->tdr_pa, gpa, tdx_level, &out);
+	if (err == TDX_ERROR_SEPT_BUSY)
+		return -EAGAIN;
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_RANGE_BLOCK, err, &out);
+		return -EIO;
+	}
+	return 0;
+}
+
+/*
+ * TLB shoot down procedure:
+ * There is a global epoch counter and each vcpu has local epoch counter.
+ * - TDH.MEM.RANGE.BLOCK(TDR. level, range) on one vcpu
+ *   This blocks the subsequenct creation of TLB translation on that range.
+ *   This corresponds to clear the present bit(all RXW) in EPT entry
+ * - TDH.MEM.TRACK(TDR): advances the epoch counter which is global.
+ * - IPI to remote vcpus
+ * - TDExit and re-entry with TDH.VP.ENTER on remote vcpus
+ * - On re-entry, TDX module compares the local epoch counter with the global
+ *   epoch counter.  If the local epoch counter is older than the global epoch
+ *   counter, update the local epoch counter and flushes TLB.
+ */
+static void tdx_track(struct kvm_tdx *kvm_tdx)
+{
+	u64 err;
+
+	KVM_BUG_ON(!is_hkid_assigned(kvm_tdx), &kvm_tdx->kvm);
+	/* If TD isn't finalized, it's before any vcpu running. */
+	if (unlikely(!is_td_finalized(kvm_tdx)))
+		return;
+
+	/*
+	 * tdx_flush_tlb() waits for this function to issue TDH.MEM.TRACK() by
+	 * the counter.  The counter is used instead of bool because multiple
+	 * TDH_MEM_TRACK() can be issued concurrently by multiple vcpus.
+	 */
+	atomic_inc(&kvm_tdx->tdh_mem_track);
+	/*
+	 * KVM_REQ_TLB_FLUSH waits for the empty IPI handler, ack_flush(), with
+	 * KVM_REQUEST_WAIT.
+	 */
+	kvm_make_all_cpus_request(&kvm_tdx->kvm, KVM_REQ_TLB_FLUSH);
+
+	do {
+		/*
+		 * kvm_flush_remote_tlbs() doesn't allow to return error and
+		 * retry.
+		 */
+		err = tdh_mem_track(kvm_tdx->tdr_pa);
+	} while ((err & TDX_SEAMCALL_STATUS_MASK) == TDX_OPERAND_BUSY);
+
+	/* Release remote vcpu waiting for TDH.MEM.TRACK in tdx_flush_tlb(). */
+	atomic_dec(&kvm_tdx->tdh_mem_track);
+
+	if (KVM_BUG_ON(err, &kvm_tdx->kvm))
+		pr_tdx_error(TDH_MEM_TRACK, err, NULL);
+
+}
+
+static int tdx_sept_free_private_spt(struct kvm *kvm, gfn_t gfn,
+				     enum pg_level level, void *private_spt)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+
+	/*
+	 * The HKID assigned to this TD was already freed and cache was
+	 * already flushed. We don't have to flush again.
+	 */
+	if (!is_hkid_assigned(kvm_tdx))
+		return tdx_reclaim_page(__pa(private_spt), false, 0);
+
+	/*
+	 * free_private_spt() is (obviously) called when a shadow page is being
+	 * zapped.  KVM doesn't (yet) zap private SPs while the TD is active.
+	 * Note: This function is for private shadow page.  Not for private
+	 * guest page.   private guest page can be zapped during TD is active.
+	 * shared <-> private conversion and slot move/deletion.
+	 */
+	KVM_BUG_ON(is_hkid_assigned(kvm_tdx), kvm);
+	return -EINVAL;
+}
+
+int tdx_sept_tlb_remote_flush(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx;
+
+	if (!is_td(kvm))
+		return -EOPNOTSUPP;
+
+	kvm_tdx = to_kvm_tdx(kvm);
+	if (is_hkid_assigned(kvm_tdx))
+		tdx_track(kvm_tdx);
+
+	return 0;
+}
+
+static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
+					 enum pg_level level, kvm_pfn_t pfn)
+{
+	/*
+	 * TDX requires TLB tracking before dropping private page.  Do
+	 * it here, although it is also done later.
+	 * If hkid isn't assigned, the guest is destroying and no vcpu
+	 * runs further.  TLB shootdown isn't needed.
+	 *
+	 * TODO: implement with_range version for optimization.
+	 * kvm_flush_remote_tlbs_with_address(kvm, gfn, 1);
+	 *   => tdx_sept_tlb_remote_flush_with_range(kvm, gfn,
+	 *                                 KVM_PAGES_PER_HPAGE(level));
+	 */
+	if (is_hkid_assigned(to_kvm_tdx(kvm)))
+		kvm_flush_remote_tlbs(kvm);
+
+	return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
+}
+
 int tdx_dev_ioctl(void __user *argp)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
@@ -798,6 +1071,25 @@ static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return ret;
 }
 
+void tdx_flush_tlb(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+	struct kvm_mmu *mmu = vcpu->arch.mmu;
+	u64 root_hpa = mmu->root.hpa;
+
+	/* Flush the shared EPTP, if it's valid. */
+	if (VALID_PAGE(root_hpa))
+		ept_sync_context(construct_eptp(vcpu, root_hpa,
+						mmu->root_role.level));
+
+	/*
+	 * See tdx_track().  Wait for tlb shootdown initiater to finish
+	 * TDH_MEM_TRACK() so that TLB is flushed on the next TDENTER.
+	 */
+	while (atomic_read(&kvm_tdx->tdh_mem_track))
+		cpu_relax();
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -1016,8 +1308,16 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 	if (!r)
 		r = tdx_module_setup();
 	vmxoff_all();
+	if (r)
+		return r;
 
-	return r;
+	x86_ops->link_private_spt = tdx_sept_link_private_spt;
+	x86_ops->free_private_spt = tdx_sept_free_private_spt;
+	x86_ops->set_private_spte = tdx_sept_set_private_spte;
+	x86_ops->remove_private_spte = tdx_sept_remove_private_spte;
+	x86_ops->zap_private_spte = tdx_sept_zap_private_spte;
+
+	return 0;
 }
 
 void tdx_hardware_unsetup(void)
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 237c8038eb6a..5cc5d1a29c08 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -18,6 +18,7 @@ struct kvm_tdx {
 	int hkid;
 
 	bool finalized;
+	atomic_t tdh_mem_track;
 
 	u64 tsc_offset;
 };
@@ -165,6 +166,12 @@ static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 fiel
 	return out.r8;
 }
 
+static __always_inline int pg_level_to_tdx_sept_level(enum pg_level level)
+{
+	WARN_ON_ONCE(level == PG_LEVEL_NONE);
+	return level - 1;
+}
+
 #else
 struct kvm_tdx {
 	struct kvm kvm;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index d7745ac380ed..8ae689929347 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -158,6 +158,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
+void tdx_flush_tlb(struct kvm_vcpu *vcpu);
+int tdx_sept_tlb_remote_flush(struct kvm *kvm);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
@@ -178,6 +180,8 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
+static inline void tdx_flush_tlb(struct kvm_vcpu *vcpu) {}
+static inline int tdx_sept_tlb_remote_flush(struct kvm *kvm) { return 0; }
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 058/113] KVM: TDX: MTRR: implement get_mt_mask() for TDX
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (56 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 057/113] KVM: TDX: TDP MMU TDX support isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-17  3:11   ` Huang, Kai
  2023-02-03  6:55   ` Yuan Yao
  2023-01-12 16:32 ` [PATCH v11 059/113] [MARKER] The start of TDX KVM patch series: TD finalization isaku.yamahata
                   ` (54 subsequent siblings)
  112 siblings, 2 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Although TDX supports only WB for private GPA, MTRR/PAT for shared GPA
should be supported. Implement get_mt_mask() following vmx case.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 10 +++++++++-
 arch/x86/kvm/vmx/tdx.c     | 19 +++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 3 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 770d1b29d1c3..4319f6d7a4da 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -158,6 +158,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
 }
 
+static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_mt_mask(vcpu, gfn, is_mmio);
+
+	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -267,7 +275,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.set_identity_map_addr = vmx_set_identity_map_addr,
-	.get_mt_mask = vmx_get_mt_mask,
+	.get_mt_mask = vt_get_mt_mask,
 
 	.get_exit_info = vmx_get_exit_info,
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e68816999387..c4c5a8f786c1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -309,6 +309,25 @@ int tdx_vm_init(struct kvm *kvm)
 	return 0;
 }
 
+u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
+{
+	/* TDX private GPA is always WB. */
+	if (gfn & kvm_gfn_shared_mask(vcpu->kvm)) {
+		WARN_ON_ONCE(is_mmio);
+		return  MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
+	}
+
+	if (is_mmio)
+		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
+
+	/*
+	 * Device assignemnt without VT-d snooping capability with shared-GPA
+	 * is dubious.
+	 */
+	WARN_ON_ONCE(kvm_arch_has_noncoherent_dma(vcpu->kvm));
+	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
+}
+
 int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 {
 	struct kvm_cpuid_entry2 *e;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 8ae689929347..d903e0f606d3 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -154,6 +154,7 @@ void tdx_vm_free(struct kvm *kvm);
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -176,6 +177,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 059/113] [MARKER] The start of TDX KVM patch series: TD finalization
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (57 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 058/113] KVM: TDX: MTRR: implement get_mt_mask() for TDX isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 060/113] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX isaku.yamahata
                   ` (53 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TD finalization.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 9b3ab0363184..c081217a0036 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -11,6 +11,7 @@ What qemu can do
 - TDX VM TYPE is exposed to Qemu.
 - Qemu can create/destroy guest of TDX vm type.
 - Qemu can create/destroy vcpu of TDX vm type.
+- Qemu can populate initial guest memory image.
 
 Patch Layer status
 ------------------
@@ -19,8 +20,8 @@ Patch Layer status
 * TDX architectural definitions:        Applied
 * TD VM creation/destruction:           Applied
 * TD vcpu creation/destruction:         Applied
-* TDX EPT violation:                    Applying
-* TD finalization:                      Not yet
+* TDX EPT violation:                    Applied
+* TD finalization:                      Applying
 * TD vcpu enter/exit:                   Not yet
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 060/113] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (58 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 059/113] [MARKER] The start of TDX KVM patch series: TD finalization isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 061/113] KVM: TDX: Create initial guest memory isaku.yamahata
                   ` (52 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Introduce a helper to directly (pun intended) fault-in a TDP page
without having to go through the full page fault path.  This allows
TDX to get the resulting pfn and also allows the RET_PF_* enums to
stay in mmu.c where they belong.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/mmu.h     |  3 +++
 arch/x86/kvm/mmu/mmu.c | 44 ++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index 50d240d52697..e2a0dfbee56d 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -154,6 +154,9 @@ static inline void kvm_mmu_load_pgd(struct kvm_vcpu *vcpu)
 					  vcpu->arch.mmu->root_role.level);
 }
 
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u32 error_code, int max_level);
+
 /*
  * Check if a given access (described through the I/D, W/R and U/S bits of a
  * page fault error code pfec) causes a permission fault with the given PTE
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4f3f4cdc67ed..4058545a4851 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4524,6 +4524,50 @@ int kvm_tdp_page_fault(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
 	return direct_page_fault(vcpu, fault);
 }
 
+kvm_pfn_t kvm_mmu_map_tdp_page(struct kvm_vcpu *vcpu, gpa_t gpa,
+			       u32 error_code, int max_level)
+{
+	int r;
+	struct kvm_page_fault fault = (struct kvm_page_fault) {
+		.addr = gpa,
+		.error_code = error_code,
+		.exec = error_code & PFERR_FETCH_MASK,
+		.write = error_code & PFERR_WRITE_MASK,
+		.present = error_code & PFERR_PRESENT_MASK,
+		.rsvd = error_code & PFERR_RSVD_MASK,
+		.user = error_code & PFERR_USER_MASK,
+		.prefetch = false,
+		.is_tdp = true,
+		.nx_huge_page_workaround_enabled = is_nx_huge_page_enabled(vcpu->kvm),
+		.is_private = kvm_is_private_gpa(vcpu->kvm, gpa),
+	};
+
+	WARN_ON_ONCE(!vcpu->arch.mmu->root_role.direct);
+	fault.gfn = gpa_to_gfn(fault.addr) & ~kvm_gfn_shared_mask(vcpu->kvm);
+	fault.slot = kvm_vcpu_gfn_to_memslot(vcpu, fault.gfn);
+
+	if (mmu_topup_memory_caches(vcpu, false))
+		return KVM_PFN_ERR_FAULT;
+
+	/*
+	 * Loop on the page fault path to handle the case where an mmu_notifier
+	 * invalidation triggers RET_PF_RETRY.  In the normal page fault path,
+	 * KVM needs to resume the guest in case the invalidation changed any
+	 * of the page fault properties, i.e. the gpa or error code.  For this
+	 * path, the gpa and error code are fixed by the caller, and the caller
+	 * expects failure if and only if the page fault can't be fixed.
+	 */
+	do {
+		fault.max_level = max_level;
+		fault.req_level = PG_LEVEL_4K;
+		fault.goal_level = PG_LEVEL_4K;
+
+		r = direct_page_fault(vcpu, &fault);
+	} while (r == RET_PF_RETRY && !is_error_noslot_pfn(fault.pfn));
+	return fault.pfn;
+}
+EXPORT_SYMBOL_GPL(kvm_mmu_map_tdp_page);
+
 static void nonpaging_init_context(struct kvm_mmu *context)
 {
 	context->page_fault = nonpaging_page_fault;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 061/113] KVM: TDX: Create initial guest memory
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (59 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 060/113] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 062/113] KVM: TDX: Finalize VM initialization isaku.yamahata
                   ` (51 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because the guest memory is protected in TDX, the creation of the initial
guest memory requires a dedicated TDX module API, tdh_mem_page_add, instead
of directly copying the memory contents into the guest memory in the case
of the default VM type.  KVM MMU page fault handler callback,
private_page_add, handles it.

Define new subcommand, KVM_TDX_INIT_MEM_REGION, of VM-scoped
KVM_MEMORY_ENCRYPT_OP.  It assigns the guest page, copies the initial
memory contents into the guest memory, encrypts the guest memory.  At the
same time, optionally it extends memory measurement of the TDX guest.  It
calls the KVM MMU page fault(EPT-violation) handler to trigger the
callbacks for it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/uapi/asm/kvm.h       |   9 ++
 arch/x86/kvm/mmu/mmu.c                |   1 +
 arch/x86/kvm/vmx/tdx.c                | 156 +++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h                |   2 +
 tools/arch/x86/include/uapi/asm/kvm.h |   9 ++
 5 files changed, 172 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 9236c1699c48..5280d175623d 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -537,6 +537,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
+	KVM_TDX_INIT_MEM_REGION,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -612,4 +613,12 @@ struct kvm_tdx_init_vm {
 	};
 };
 
+#define KVM_TDX_MEASURE_MEMORY_REGION	(1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+	__u64 source_addr;
+	__u64 gpa;
+	__u64 nr_pages;
+};
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 4058545a4851..13701f0ca5e1 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -5536,6 +5536,7 @@ int kvm_mmu_load(struct kvm_vcpu *vcpu)
 out:
 	return r;
 }
+EXPORT_SYMBOL(kvm_mmu_load);
 
 void kvm_mmu_unload(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index c4c5a8f786c1..5f1c459da363 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -424,6 +424,21 @@ void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
 }
 
+static void tdx_measure_page(struct kvm_tdx *kvm_tdx, hpa_t gpa)
+{
+	struct tdx_module_output out;
+	u64 err;
+	int i;
+
+	for (i = 0; i < PAGE_SIZE; i += TDX_EXTENDMR_CHUNKSIZE) {
+		err = tdh_mr_extend(kvm_tdx->tdr_pa, gpa + i, &out);
+		if (KVM_BUG_ON(err, &kvm_tdx->kvm)) {
+			pr_tdx_error(TDH_MR_EXTEND, err, &out);
+			break;
+		}
+	}
+}
+
 static void tdx_unpin(struct kvm *kvm, kvm_pfn_t pfn)
 {
 	struct page *page = pfn_to_page(pfn);
@@ -438,12 +453,10 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	hpa_t hpa = pfn_to_hpa(pfn);
 	gpa_t gpa = gfn_to_gpa(gfn);
 	struct tdx_module_output out;
+	hpa_t source_pa;
+	bool measure;
 	u64 err;
 
-	/* TODO: handle large pages. */
-	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
-		return -EINVAL;
-
 	/*
 	 * Because restricted mem doesn't support page migration with
 	 * a_ops->migrate_page (yet), no callback isn't triggered for KVM on
@@ -454,7 +467,12 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 	 */
 	get_page(pfn_to_page(pfn));
 
+	/* Build-time faults are induced and handled via TDH_MEM_PAGE_ADD. */
 	if (likely(is_td_finalized(kvm_tdx))) {
+		/* TODO: handle large pages. */
+		if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+			return -EINVAL;
+
 		err = tdh_mem_page_aug(kvm_tdx->tdr_pa, gpa, hpa, &out);
 		if (err == TDX_ERROR_SEPT_BUSY) {
 			tdx_unpin(kvm, pfn);
@@ -468,7 +486,45 @@ static int tdx_sept_set_private_spte(struct kvm *kvm, gfn_t gfn,
 		return 0;
 	}
 
-	/* TODO: tdh_mem_page_add() comes here for the initial memory. */
+	/*
+	 * KVM_INIT_MEM_REGION, tdx_init_mem_region(), supports only 4K page
+	 * because tdh_mem_page_add() supports only 4K page.
+	 */
+	if (KVM_BUG_ON(level != PG_LEVEL_4K, kvm))
+		return -EINVAL;
+
+	/*
+	 * In case of TDP MMU, fault handler can run concurrently.  Note
+	 * 'source_pa' is a TD scope variable, meaning if there are multiple
+	 * threads reaching here with all needing to access 'source_pa', it
+	 * will break.  However fortunately this won't happen, because below
+	 * TDH_MEM_PAGE_ADD code path is only used when VM is being created
+	 * before it is running, using KVM_TDX_INIT_MEM_REGION ioctl (which
+	 * always uses vcpu 0's page table and protected by vcpu->mutex).
+	 */
+	if (KVM_BUG_ON(kvm_tdx->source_pa == INVALID_PAGE, kvm)) {
+		tdx_unpin(kvm, pfn);
+		return -EINVAL;
+	}
+
+	source_pa = kvm_tdx->source_pa & ~KVM_TDX_MEASURE_MEMORY_REGION;
+	measure = kvm_tdx->source_pa & KVM_TDX_MEASURE_MEMORY_REGION;
+	kvm_tdx->source_pa = INVALID_PAGE;
+
+	do {
+		err = tdh_mem_page_add(kvm_tdx->tdr_pa, gpa, hpa, source_pa,
+				       &out);
+		/*
+		 * This path is executed during populating initial guest memory
+		 * image. i.e. before running any vcpu.  Race is rare.
+		 */
+	} while (err == TDX_ERROR_SEPT_BUSY);
+	if (KVM_BUG_ON(err, kvm)) {
+		pr_tdx_error(TDH_MEM_PAGE_ADD, err, &out);
+		tdx_unpin(kvm, pfn);
+		return -EIO;
+	} else if (measure)
+		tdx_measure_page(kvm_tdx, gpa);
 
 	return 0;
 }
@@ -1109,6 +1165,93 @@ void tdx_flush_tlb(struct kvm_vcpu *vcpu)
 		cpu_relax();
 }
 
+#define TDX_SEPT_PFERR	PFERR_WRITE_MASK
+
+static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	struct kvm_tdx_init_mem_region region;
+	struct kvm_vcpu *vcpu;
+	struct page *page;
+	kvm_pfn_t pfn;
+	int idx, ret = 0;
+
+	/* The BSP vCPU must be created before initializing memory regions. */
+	if (!atomic_read(&kvm->online_vcpus))
+		return -EINVAL;
+
+	if (cmd->flags & ~KVM_TDX_MEASURE_MEMORY_REGION)
+		return -EINVAL;
+
+	if (copy_from_user(&region, (void __user *)cmd->data, sizeof(region)))
+		return -EFAULT;
+
+	/* Sanity check */
+	if (!IS_ALIGNED(region.source_addr, PAGE_SIZE) ||
+	    !IS_ALIGNED(region.gpa, PAGE_SIZE) ||
+	    !region.nr_pages ||
+	    region.gpa + (region.nr_pages << PAGE_SHIFT) <= region.gpa ||
+	    !kvm_is_private_gpa(kvm, region.gpa) ||
+	    !kvm_is_private_gpa(kvm, region.gpa + (region.nr_pages << PAGE_SHIFT)))
+		return -EINVAL;
+
+	vcpu = kvm_get_vcpu(kvm, 0);
+	if (mutex_lock_killable(&vcpu->mutex))
+		return -EINTR;
+
+	vcpu_load(vcpu);
+	idx = srcu_read_lock(&kvm->srcu);
+
+	kvm_mmu_reload(vcpu);
+
+	while (region.nr_pages) {
+		if (signal_pending(current)) {
+			ret = -ERESTARTSYS;
+			break;
+		}
+
+		if (need_resched())
+			cond_resched();
+
+		/* Pin the source page. */
+		ret = get_user_pages_fast(region.source_addr, 1, 0, &page);
+		if (ret < 0)
+			break;
+		if (ret != 1) {
+			ret = -ENOMEM;
+			break;
+		}
+
+		kvm_tdx->source_pa = pfn_to_hpa(page_to_pfn(page)) |
+				     (cmd->flags & KVM_TDX_MEASURE_MEMORY_REGION);
+
+		pfn = kvm_mmu_map_tdp_page(vcpu, region.gpa, TDX_SEPT_PFERR,
+					   PG_LEVEL_4K);
+		if (is_error_noslot_pfn(pfn) || kvm->vm_bugged)
+			ret = -EFAULT;
+		else
+			ret = 0;
+
+		put_page(page);
+		if (ret)
+			break;
+
+		region.source_addr += PAGE_SIZE;
+		region.gpa += PAGE_SIZE;
+		region.nr_pages--;
+	}
+
+	srcu_read_unlock(&kvm->srcu, idx);
+	vcpu_put(vcpu);
+
+	mutex_unlock(&vcpu->mutex);
+
+	if (copy_to_user((void __user *)cmd->data, &region, sizeof(region)))
+		ret = -EFAULT;
+
+	return ret;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -1125,6 +1268,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_INIT_VM:
 		r = tdx_td_init(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_INIT_MEM_REGION:
+		r = tdx_init_mem_region(kvm, &tdx_cmd);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 5cc5d1a29c08..d8b4102c67fc 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -17,6 +17,8 @@ struct kvm_tdx {
 	u64 xfam;
 	int hkid;
 
+	hpa_t source_pa;
+
 	bool finalized;
 	atomic_t tdh_mem_track;
 
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 6971f1288043..6587da064a61 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -532,6 +532,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_CAPABILITIES = 0,
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
+	KVM_TDX_INIT_MEM_REGION,
 
 	KVM_TDX_CMD_NR_MAX,
 };
@@ -609,4 +610,12 @@ struct kvm_tdx_init_vm {
 	};
 };
 
+#define KVM_TDX_MEASURE_MEMORY_REGION	(1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+	__u64 source_addr;
+	__u64 gpa;
+	__u64 nr_pages;
+};
+
 #endif /* _ASM_X86_KVM_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 062/113] KVM: TDX: Finalize VM initialization
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (60 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 061/113] KVM: TDX: Create initial guest memory isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 063/113] [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit isaku.yamahata
                   ` (50 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

To protect the initial contents of the guest TD, the TDX module measures
the guest TD during the build process as SHA-384 measurement.  The
measurement of the guest TD contents needs to be completed to make the
guest TD ready to run.

Add a new subcommand, KVM_TDX_FINALIZE_VM, for VM-scoped
KVM_MEMORY_ENCRYPT_OP to finalize the measurement and mark the TDX VM ready
to run.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/uapi/asm/kvm.h       |  1 +
 arch/x86/kvm/vmx/tdx.c                | 31 +++++++++++++++++++++++++++
 tools/arch/x86/include/uapi/asm/kvm.h |  1 +
 3 files changed, 33 insertions(+)

diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 5280d175623d..24353eb901c0 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -538,6 +538,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
 	KVM_TDX_INIT_MEM_REGION,
+	KVM_TDX_FINALIZE_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 5f1c459da363..d04e2729d2a3 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1252,6 +1252,34 @@ static int tdx_init_mem_region(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
 	return ret;
 }
 
+static int tdx_td_finalizemr(struct kvm *kvm)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
+	u64 err;
+
+	if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
+		return -EINVAL;
+
+	err = tdh_mr_finalize(kvm_tdx->tdr_pa);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MR_FINALIZE, err, NULL);
+		return -EIO;
+	}
+
+	/*
+	 * Blindly do TDH_MEM_TRACK after finalizing the measurement to handle
+	 * the case where SEPT entries were zapped/blocked, e.g. from failed
+	 * NUMA balancing, after they were added to the TD via
+	 * tdx_init_mem_region().  TDX module doesn't allow TDH_MEM_TRACK prior
+	 * to TDH.MR.FINALIZE, and conversely requires TDH.MEM.TRACK for entries
+	 * that were TDH.MEM.RANGE.BLOCK'd prior to TDH.MR.FINALIZE.
+	 */
+	(void)tdh_mem_track(to_kvm_tdx(kvm)->tdr_pa);
+
+	kvm_tdx->finalized = true;
+	return 0;
+}
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 {
 	struct kvm_tdx_cmd tdx_cmd;
@@ -1271,6 +1299,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
 	case KVM_TDX_INIT_MEM_REGION:
 		r = tdx_init_mem_region(kvm, &tdx_cmd);
 		break;
+	case KVM_TDX_FINALIZE_VM:
+		r = tdx_td_finalizemr(kvm);
+		break;
 	default:
 		r = -EINVAL;
 		goto out;
diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
index 6587da064a61..7bab32a0b068 100644
--- a/tools/arch/x86/include/uapi/asm/kvm.h
+++ b/tools/arch/x86/include/uapi/asm/kvm.h
@@ -533,6 +533,7 @@ enum kvm_tdx_cmd_id {
 	KVM_TDX_INIT_VM,
 	KVM_TDX_INIT_VCPU,
 	KVM_TDX_INIT_MEM_REGION,
+	KVM_TDX_FINALIZE_VM,
 
 	KVM_TDX_CMD_NR_MAX,
 };
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 063/113] [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (61 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 062/113] KVM: TDX: Finalize VM initialization isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 064/113] KVM: TDX: Add helper assembly function to TDX vcpu isaku.yamahata
                   ` (49 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TD vcpu
enter/exit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index c081217a0036..58bff496abda 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -12,6 +12,7 @@ What qemu can do
 - Qemu can create/destroy guest of TDX vm type.
 - Qemu can create/destroy vcpu of TDX vm type.
 - Qemu can populate initial guest memory image.
+- Qemu can finalize guest TD.
 
 Patch Layer status
 ------------------
@@ -21,8 +22,8 @@ Patch Layer status
 * TD VM creation/destruction:           Applied
 * TD vcpu creation/destruction:         Applied
 * TDX EPT violation:                    Applied
-* TD finalization:                      Applying
-* TD vcpu enter/exit:                   Not yet
+* TD finalization:                      Applied
+* TD vcpu enter/exit:                   Applying
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
 * KVM MMU GPA shared bits:              Applied
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 064/113] KVM: TDX: Add helper assembly function to TDX vcpu
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (62 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 063/113] [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 065/113] KVM: TDX: Implement TDX vcpu enter/exit path isaku.yamahata
                   ` (48 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX defines an API to run TDX vcpu with its own ABI.  Define an assembly
helper function to run TDX vcpu to hide the special ABI so that C code can
call it with function call ABI.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/tdx.h |   3 +-
 arch/x86/kvm/vmx/vmenter.S | 156 +++++++++++++++++++++++++++++++++++++
 2 files changed, 158 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
index d7ce2217279f..144b719d607f 100644
--- a/arch/x86/include/asm/tdx.h
+++ b/arch/x86/include/asm/tdx.h
@@ -18,7 +18,8 @@
  * Bits 47:40 == 0xFF indicate Reserved status code class that never used by
  * TDX module.
  */
-#define TDX_ERROR			_BITUL(63)
+#define TDX_ERROR_BIT			63
+#define TDX_ERROR			_BITUL(TDX_ERROR_BIT)
 #define TDX_SW_ERROR			(TDX_ERROR | GENMASK_ULL(47, 40))
 #define TDX_SEAMCALL_VMFAILINVALID	(TDX_SW_ERROR | _UL(0xFFFF0000))
 
diff --git a/arch/x86/kvm/vmx/vmenter.S b/arch/x86/kvm/vmx/vmenter.S
index 766c6b3ef5ed..58516611f31b 100644
--- a/arch/x86/kvm/vmx/vmenter.S
+++ b/arch/x86/kvm/vmx/vmenter.S
@@ -6,6 +6,7 @@
 #include <asm/nospec-branch.h>
 #include <asm/percpu.h>
 #include <asm/segment.h>
+#include <asm/tdx.h>
 #include "kvm-asm-offsets.h"
 #include "run_flags.h"
 
@@ -31,6 +32,12 @@
 #define VCPU_R15	__VCPU_REGS_R15 * WORD_SIZE
 #endif
 
+#ifdef CONFIG_INTEL_TDX_HOST
+#define TDH_VP_ENTER		0
+#define EXIT_REASON_TDCALL	77
+#define seamcall		.byte 0x66,0x0f,0x01,0xcf
+#endif
+
 .section .noinstr.text, "ax"
 
 /**
@@ -352,3 +359,152 @@ SYM_FUNC_START(vmx_do_interrupt_nmi_irqoff)
 	pop %_ASM_BP
 	RET
 SYM_FUNC_END(vmx_do_interrupt_nmi_irqoff)
+
+#ifdef CONFIG_INTEL_TDX_HOST
+
+.pushsection .noinstr.text, "ax"
+
+/**
+ * __tdx_vcpu_run - Call SEAMCALL(TDH_VP_ENTER) to run a TD vcpu
+ * @tdvpr:	physical address of TDVPR
+ * @regs:	void * (to registers of TDVCPU)
+ * @gpr_mask:	non-zero if guest registers need to be loaded prior to TDH_VP_ENTER
+ *
+ * Returns:
+ *	TD-Exit Reason
+ *
+ * Note: KVM doesn't support using XMM in its hypercalls, it's the HyperV
+ *	 code's responsibility to save/restore XMM registers on TDVMCALL.
+ */
+SYM_FUNC_START(__tdx_vcpu_run)
+	push %rbp
+	mov  %rsp, %rbp
+
+	push %r15
+	push %r14
+	push %r13
+	push %r12
+	push %rbx
+
+	/* Save @regs, which is needed after TDH_VP_ENTER to capture output. */
+	push %rsi
+
+	/* Load @tdvpr to RCX */
+	mov %rdi, %rcx
+
+	/* No need to load guest GPRs if the last exit wasn't a TDVMCALL. */
+	test %dx, %dx
+	je 1f
+
+	/* Load @regs to RAX, which will be clobbered with $TDH_VP_ENTER anyways. */
+	mov %rsi, %rax
+
+	mov VCPU_RBX(%rax), %rbx
+	mov VCPU_RDX(%rax), %rdx
+	mov VCPU_RBP(%rax), %rbp
+	mov VCPU_RSI(%rax), %rsi
+	mov VCPU_RDI(%rax), %rdi
+
+	mov VCPU_R8 (%rax),  %r8
+	mov VCPU_R9 (%rax),  %r9
+	mov VCPU_R10(%rax), %r10
+	mov VCPU_R11(%rax), %r11
+	mov VCPU_R12(%rax), %r12
+	mov VCPU_R13(%rax), %r13
+	mov VCPU_R14(%rax), %r14
+	mov VCPU_R15(%rax), %r15
+
+	/*  Load TDH_VP_ENTER to RAX.  This kills the @regs pointer! */
+1:	mov $TDH_VP_ENTER, %rax
+
+2:	seamcall
+
+	/*
+	 * Use same return value convention to tdxcall.S.
+	 * TDX_SEAMCALL_VMFAILINVALID doesn't conflict with any TDX status code.
+	 */
+	jnc 3f
+	mov $TDX_SEAMCALL_VMFAILINVALID, %rax
+	jmp 5f
+3:
+
+	/* Skip to the exit path if TDH_VP_ENTER failed. */
+	bt $TDX_ERROR_BIT, %rax
+	jc 5f
+
+	/* Temporarily save the TD-Exit reason. */
+	push %rax
+
+	/* check if TD-exit due to TDVMCALL */
+	cmp $EXIT_REASON_TDCALL, %ax
+
+	/* Reload @regs to RAX. */
+	mov 8(%rsp), %rax
+
+	/* Jump on non-TDVMCALL */
+	jne 4f
+
+	/* Save all output from SEAMCALL(TDH_VP_ENTER) */
+	mov %rbx, VCPU_RBX(%rax)
+	mov %rbp, VCPU_RBP(%rax)
+	mov %rsi, VCPU_RSI(%rax)
+	mov %rdi, VCPU_RDI(%rax)
+	mov %r10, VCPU_R10(%rax)
+	mov %r11, VCPU_R11(%rax)
+	mov %r12, VCPU_R12(%rax)
+	mov %r13, VCPU_R13(%rax)
+	mov %r14, VCPU_R14(%rax)
+	mov %r15, VCPU_R15(%rax)
+
+4:	mov %rcx, VCPU_RCX(%rax)
+	mov %rdx, VCPU_RDX(%rax)
+	mov %r8,  VCPU_R8 (%rax)
+	mov %r9,  VCPU_R9 (%rax)
+
+	/*
+	 * Clear all general purpose registers except RSP and RAX to prevent
+	 * speculative use of the guest's values.
+	 */
+	xor %rbx, %rbx
+	xor %rcx, %rcx
+	xor %rdx, %rdx
+	xor %rsi, %rsi
+	xor %rdi, %rdi
+	xor %rbp, %rbp
+	xor %r8,  %r8
+	xor %r9,  %r9
+	xor %r10, %r10
+	xor %r11, %r11
+	xor %r12, %r12
+	xor %r13, %r13
+	xor %r14, %r14
+	xor %r15, %r15
+
+	/* Restore the TD-Exit reason to RAX for return. */
+	pop %rax
+
+	/* "POP" @regs. */
+5:	add $8, %rsp
+	pop %rbx
+	pop %r12
+	pop %r13
+	pop %r14
+	pop %r15
+
+	pop %rbp
+	RET
+
+6:	cmpb $0, kvm_rebooting
+	je 1f
+	mov $TDX_SW_ERROR, %r12
+	orq %r12, %rax
+	jmp 5b
+1:	ud2
+	/* Use FAULT version to know what fault happened. */
+	_ASM_EXTABLE_FAULT(2b, 6b)
+
+SYM_FUNC_END(__tdx_vcpu_run)
+
+.popsection
+
+#endif
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 065/113] KVM: TDX: Implement TDX vcpu enter/exit path
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (63 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 064/113] KVM: TDX: Add helper assembly function to TDX vcpu isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 066/113] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) isaku.yamahata
                   ` (47 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This patch implements running TDX vcpu.  Once vcpu runs on the logical
processor (LP), the TDX vcpu is associated with it.  When the TDX vcpu
moves to another LP, the TDX vcpu needs to flush its status on the LP.
When destroying TDX vcpu, it needs to complete flush and flush cpu memory
cache.  Track which LP the TDX vcpu run and flush it as necessary.

Do nothing on sched_in event as TDX doesn't support pause loop.

TDX vcpu execution requires restoring PMU debug store after returning back
to KVM because the TDX module unconditionally resets the value.  To reuse
the existing code, export perf_restore_debug_store.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 21 +++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 32 ++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h     | 33 +++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 arch/x86/kvm/x86.c         |  1 +
 5 files changed, 87 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 4319f6d7a4da..ac2dc05961b5 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -100,6 +100,23 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	return vmx_vcpu_reset(vcpu, init_event);
 }
 
+static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		/* Unconditionally continue to vcpu_run(). */
+		return 1;
+
+	return vmx_vcpu_pre_run(vcpu);
+}
+
+static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_run(vcpu);
+
+	return vmx_vcpu_run(vcpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -241,8 +258,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.flush_tlb_gva = vt_flush_tlb_gva,
 	.flush_tlb_guest = vt_flush_tlb_guest,
 
-	.vcpu_pre_run = vmx_vcpu_pre_run,
-	.vcpu_run = vmx_vcpu_run,
+	.vcpu_pre_run = vt_vcpu_pre_run,
+	.vcpu_run = vt_vcpu_run,
 	.handle_exit = vmx_handle_exit,
 	.skip_emulated_instruction = vmx_skip_emulated_instruction,
 	.update_emulated_instruction = vmx_update_emulated_instruction,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d04e2729d2a3..53a8c6fcc263 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -10,6 +10,9 @@
 #include "x86.h"
 #include "mmu.h"
 
+#include <trace/events/kvm.h>
+#include "trace.h"
+
 #undef pr_fmt
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
@@ -419,6 +422,35 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->kvm->vm_bugged = true;
 }
 
+u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
+
+static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
+					struct vcpu_tdx *tdx)
+{
+	guest_enter_irqoff();
+	tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr_pa, vcpu->arch.regs, 0);
+	guest_exit_irqoff();
+}
+
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (unlikely(vcpu->kvm->vm_bugged)) {
+		tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
+		return EXIT_FASTPATH_NONE;
+	}
+
+	trace_kvm_entry(vcpu);
+
+	tdx_vcpu_enter_exit(vcpu, tdx);
+
+	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
+	trace_kvm_exit(vcpu, KVM_ISA_VMX);
+
+	return EXIT_FASTPATH_NONE;
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index d8b4102c67fc..4912fbeed1c4 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -25,12 +25,45 @@ struct kvm_tdx {
 	u64 tsc_offset;
 };
 
+union tdx_exit_reason {
+	struct {
+		/* 31:0 mirror the VMX Exit Reason format */
+		u64 basic		: 16;
+		u64 reserved16		: 1;
+		u64 reserved17		: 1;
+		u64 reserved18		: 1;
+		u64 reserved19		: 1;
+		u64 reserved20		: 1;
+		u64 reserved21		: 1;
+		u64 reserved22		: 1;
+		u64 reserved23		: 1;
+		u64 reserved24		: 1;
+		u64 reserved25		: 1;
+		u64 bus_lock_detected	: 1;
+		u64 enclave_mode	: 1;
+		u64 smi_pending_mtf	: 1;
+		u64 smi_from_vmx_root	: 1;
+		u64 reserved30		: 1;
+		u64 failed_vmentry	: 1;
+
+		/* 63:32 are TDX specific */
+		u64 details_l1		: 8;
+		u64 class		: 8;
+		u64 reserved61_48	: 14;
+		u64 non_recoverable	: 1;
+		u64 error		: 1;
+	};
+	u64 full;
+};
+
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
 
 	unsigned long tdvpr_pa;
 	unsigned long *tdvpx_pa;
 
+	union tdx_exit_reason exit_reason;
+
 	bool vcpu_initialized;
 
 	/*
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index d903e0f606d3..b9b2d4fd99e5 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -154,6 +154,7 @@ void tdx_vm_free(struct kvm *kvm);
 int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
+fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
 u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
@@ -177,6 +178,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
 static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
+static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
 static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c4579e696d39..7785225f03ec 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -305,6 +305,7 @@ const struct kvm_stats_header kvm_vcpu_stats_header = {
 };
 
 u64 __read_mostly host_xcr0;
+EXPORT_SYMBOL_GPL(host_xcr0);
 
 static struct kmem_cache *x86_emulator_cache;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 066/113] KVM: TDX: vcpu_run: save/restore host state(host kernel gs)
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (64 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 065/113] KVM: TDX: Implement TDX vcpu enter/exit path isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 067/113] KVM: TDX: restore host xsave state when exit from the guest TD isaku.yamahata
                   ` (46 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

On entering/exiting TDX vcpu, Preserved or clobbered CPU state is different
from VMX case.  Add TDX hooks to save/restore host/guest CPU state.
Save/restore kernel GS base MSR.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/main.c    | 28 ++++++++++++++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 40 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h     |  4 ++++
 arch/x86/kvm/vmx/x86_ops.h |  4 ++++
 4 files changed, 74 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index ac2dc05961b5..f4b20974199f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -100,6 +100,30 @@ static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	return vmx_vcpu_reset(vcpu, init_event);
 }
 
+static void vt_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * All host state is saved/restored across SEAMCALL/SEAMRET, and the
+	 * guest state of a TD is obviously off limits.  Deferring MSRs and DRs
+	 * is pointless because the TDX module needs to load *something* so as
+	 * not to expose guest state.
+	 */
+	if (is_td_vcpu(vcpu)) {
+		tdx_prepare_switch_to_guest(vcpu);
+		return;
+	}
+
+	vmx_prepare_switch_to_guest(vcpu);
+}
+
+static void vt_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_put(vcpu);
+
+	return vmx_vcpu_put(vcpu);
+}
+
 static int vt_vcpu_pre_run(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -223,9 +247,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_free = vt_vcpu_free,
 	.vcpu_reset = vt_vcpu_reset,
 
-	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
+	.prepare_switch_to_guest = vt_prepare_switch_to_guest,
 	.vcpu_load = vmx_vcpu_load,
-	.vcpu_put = vmx_vcpu_put,
+	.vcpu_put = vt_vcpu_put,
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
 	.get_msr_feature = vmx_get_msr_feature,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 53a8c6fcc263..854aa4af4937 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1,5 +1,6 @@
 // SPDX-License-Identifier: GPL-2.0
 #include <linux/cpu.h>
+#include <linux/mmu_context.h>
 
 #include <asm/tdx.h>
 
@@ -333,6 +334,7 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 
 int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 {
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
 	struct kvm_cpuid_entry2 *e;
 
 	/*
@@ -372,9 +374,45 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.guest_state_protected =
 		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
 
+	tdx->host_state_need_save = true;
+	tdx->host_state_need_restore = false;
+
 	return 0;
 }
 
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (!tdx->host_state_need_save)
+		return;
+
+	if (likely(is_64bit_mm(current->mm)))
+		tdx->msr_host_kernel_gs_base = current->thread.gsbase;
+	else
+		tdx->msr_host_kernel_gs_base = read_msr(MSR_KERNEL_GS_BASE);
+
+	tdx->host_state_need_save = false;
+}
+
+static void tdx_prepare_switch_to_host(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	tdx->host_state_need_save = true;
+	if (!tdx->host_state_need_restore)
+		return;
+
+	wrmsrl(MSR_KERNEL_GS_BASE, tdx->msr_host_kernel_gs_base);
+	tdx->host_state_need_restore = false;
+}
+
+void tdx_vcpu_put(struct kvm_vcpu *vcpu)
+{
+	vmx_vcpu_pi_put(vcpu);
+	tdx_prepare_switch_to_host(vcpu);
+}
+
 void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -445,6 +483,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	tdx_vcpu_enter_exit(vcpu, tdx);
 
+	tdx->host_state_need_restore = true;
+
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
 	trace_kvm_exit(vcpu, KVM_ISA_VMX);
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 4912fbeed1c4..63916388fdcf 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -66,6 +66,10 @@ struct vcpu_tdx {
 
 	bool vcpu_initialized;
 
+	bool host_state_need_save;
+	bool host_state_need_restore;
+	u64 msr_host_kernel_gs_base;
+
 	/*
 	 * Dummy to make pmu_intel not corrupt memory.
 	 * TODO: Support PMU for TDX.  Future work.
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index b9b2d4fd99e5..f5ee5efd7cf6 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -155,6 +155,8 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu);
 void tdx_vcpu_free(struct kvm_vcpu *vcpu);
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
+void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
+void tdx_vcpu_put(struct kvm_vcpu *vcpu);
 u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
@@ -179,6 +181,8 @@ static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
 static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
+static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
 static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 067/113] KVM: TDX: restore host xsave state when exit from the guest TD
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (65 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 066/113] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 068/113] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr isaku.yamahata
                   ` (45 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

On exiting from the guest TD, xsave state is clobbered.  Restore xsave
state on TD exit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 854aa4af4937..7bd47a76c96c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2,6 +2,7 @@
 #include <linux/cpu.h>
 #include <linux/mmu_context.h>
 
+#include <asm/fpu/xcr.h>
 #include <asm/tdx.h>
 
 #include "capabilities.h"
@@ -460,6 +461,22 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->kvm->vm_bugged = true;
 }
 
+static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
+
+	if (static_cpu_has(X86_FEATURE_XSAVE) &&
+	    host_xcr0 != (kvm_tdx->xfam & kvm_caps.supported_xcr0))
+		xsetbv(XCR_XFEATURE_ENABLED_MASK, host_xcr0);
+	if (static_cpu_has(X86_FEATURE_XSAVES) &&
+	    /* PT can be exposed to TD guest regardless of KVM's XSS support */
+	    host_xss != (kvm_tdx->xfam & (kvm_caps.supported_xss | XFEATURE_MASK_PT)))
+		wrmsrl(MSR_IA32_XSS, host_xss);
+	if (static_cpu_has(X86_FEATURE_PKU) &&
+	    (kvm_tdx->xfam & XFEATURE_MASK_PKRU))
+		write_pkru(vcpu->arch.host_pkru);
+}
+
 u64 __tdx_vcpu_run(hpa_t tdvpr, void *regs, u32 regs_mask);
 
 static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
@@ -483,6 +500,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	tdx_vcpu_enter_exit(vcpu, tdx);
 
+	tdx_restore_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 068/113] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (66 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 067/113] KVM: TDX: restore host xsave state when exit from the guest TD isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 069/113] KVM: TDX: restore user ret MSRs isaku.yamahata
                   ` (44 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Chao Gao

From: Chao Gao <chao.gao@intel.com>

Several MSRs are constant and only used in userspace(ring 3).  But VMs may
have different values.  KVM uses kvm_set_user_return_msr() to switch to
guest's values and leverages user return notifier to restore them when the
kernel is to return to userspace.  To eliminate unnecessary wrmsr, KVM also
caches the value it wrote to an MSR last time.

TDX module unconditionally resets some of these MSRs to architectural INIT
state on TD exit.  It makes the cached values in kvm_user_return_msrs are
inconsistent with values in hardware.  This inconsistency needs to be
fixed.  Otherwise, it may mislead kvm_on_user_return() to skip restoring
some MSRs to the host's values.  kvm_set_user_return_msr() can help correct
this case, but it is not optimal as it always does a wrmsr.  So, introduce
a variation of kvm_set_user_return_msr() to update cached values and skip
that wrmsr.

Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/asm/kvm_host.h |  1 +
 arch/x86/kvm/x86.c              | 25 ++++++++++++++++++++-----
 2 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 487ff9f4fe1a..91093622f7ba 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2154,6 +2154,7 @@ int kvm_pv_send_ipi(struct kvm *kvm, unsigned long ipi_bitmap_low,
 int kvm_add_user_return_msr(u32 msr);
 int kvm_find_user_return_msr(u32 msr);
 int kvm_set_user_return_msr(unsigned index, u64 val, u64 mask);
+void kvm_user_return_update_cache(unsigned int index, u64 val);
 
 static inline bool kvm_is_supported_user_return_msr(u32 msr)
 {
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7785225f03ec..a3da2526dfda 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -430,6 +430,15 @@ static void kvm_user_return_msr_cpu_online(void)
 	}
 }
 
+static void kvm_user_return_register_notifier(struct kvm_user_return_msrs *msrs)
+{
+	if (!msrs->registered) {
+		msrs->urn.on_user_return = kvm_on_user_return;
+		user_return_notifier_register(&msrs->urn);
+		msrs->registered = true;
+	}
+}
+
 int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 {
 	unsigned int cpu = smp_processor_id();
@@ -444,15 +453,21 @@ int kvm_set_user_return_msr(unsigned slot, u64 value, u64 mask)
 		return 1;
 
 	msrs->values[slot].curr = value;
-	if (!msrs->registered) {
-		msrs->urn.on_user_return = kvm_on_user_return;
-		user_return_notifier_register(&msrs->urn);
-		msrs->registered = true;
-	}
+	kvm_user_return_register_notifier(msrs);
 	return 0;
 }
 EXPORT_SYMBOL_GPL(kvm_set_user_return_msr);
 
+/* Update the cache, "curr", and register the notifier */
+void kvm_user_return_update_cache(unsigned int slot, u64 value)
+{
+	struct kvm_user_return_msrs *msrs = this_cpu_ptr(user_return_msrs);
+
+	msrs->values[slot].curr = value;
+	kvm_user_return_register_notifier(msrs);
+}
+EXPORT_SYMBOL_GPL(kvm_user_return_update_cache);
+
 static void drop_user_return_notifiers(void)
 {
 	unsigned int cpu = smp_processor_id();
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 069/113] KVM: TDX: restore user ret MSRs
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (67 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 068/113] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 070/113] [MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls isaku.yamahata
                   ` (43 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Several user ret MSRs are clobbered on TD exit.  Restore those values on
TD exit and before returning to ring 3.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/tdx.c | 43 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 7bd47a76c96c..4bd651b31172 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -461,6 +461,28 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->kvm->vm_bugged = true;
 }
 
+struct tdx_uret_msr {
+	u32 msr;
+	unsigned int slot;
+	u64 defval;
+};
+
+static struct tdx_uret_msr tdx_uret_msrs[] = {
+	{.msr = MSR_SYSCALL_MASK,},
+	{.msr = MSR_STAR,},
+	{.msr = MSR_LSTAR,},
+	{.msr = MSR_TSC_AUX,},
+};
+
+static void tdx_user_return_update_cache(void)
+{
+	int i;
+
+	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++)
+		kvm_user_return_update_cache(tdx_uret_msrs[i].slot,
+					     tdx_uret_msrs[i].defval);
+}
+
 static void tdx_restore_host_xsave_state(struct kvm_vcpu *vcpu)
 {
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
@@ -500,6 +522,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	tdx_vcpu_enter_exit(vcpu, tdx);
 
+	tdx_user_return_update_cache();
 	tdx_restore_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
@@ -1581,6 +1604,26 @@ int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
 		return -EINVAL;
 	}
 
+	for (i = 0; i < ARRAY_SIZE(tdx_uret_msrs); i++) {
+		/*
+		 * Here it checks if MSRs (tdx_uret_msrs) can be saved/restored
+		 * before returning to user space.
+		 *
+		 * this_cpu_ptr(user_return_msrs)->registered isn't checked
+		 * because the registration is done at vcpu runtime by
+		 * kvm_set_user_return_msr().
+		 * Here is setting up cpu feature before running vcpu,
+		 * registered is already false.
+		 */
+		tdx_uret_msrs[i].slot = kvm_find_user_return_msr(tdx_uret_msrs[i].msr);
+		if (tdx_uret_msrs[i].slot == -1) {
+			/* If any MSR isn't supported, it is a KVM bug */
+			pr_err("MSR %x isn't included by kvm_find_user_return_msr\n",
+				tdx_uret_msrs[i].msr);
+			return -EIO;
+		}
+	}
+
 	max_pkgs = topology_max_packages();
 	tdx_mng_key_config_lock = kcalloc(max_pkgs, sizeof(*tdx_mng_key_config_lock),
 				   GFP_KERNEL);
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 070/113] [MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (68 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 069/113] KVM: TDX: restore user ret MSRs isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 071/113] KVM: TDX: complete interrupts after tdexit isaku.yamahata
                   ` (42 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the start of patch series of TD vcpu
exits, interrupts, and hypercalls.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/intel-tdx-layer-status.rst | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
index 58bff496abda..010c387ef5cc 100644
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ b/Documentation/virt/kvm/intel-tdx-layer-status.rst
@@ -13,6 +13,7 @@ What qemu can do
 - Qemu can create/destroy vcpu of TDX vm type.
 - Qemu can populate initial guest memory image.
 - Qemu can finalize guest TD.
+- Qemu can start to run vcpu. But vcpu can not make progress yet.
 
 Patch Layer status
 ------------------
@@ -23,7 +24,7 @@ Patch Layer status
 * TD vcpu creation/destruction:         Applied
 * TDX EPT violation:                    Applied
 * TD finalization:                      Applied
-* TD vcpu enter/exit:                   Applying
+* TD vcpu enter/exit:                   Applied
 * TD vcpu interrupts/exit/hypercall:    Not yet
 
 * KVM MMU GPA shared bits:              Applied
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 071/113] KVM: TDX: complete interrupts after tdexit
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (69 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 070/113] [MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 072/113] KVM: TDX: restore debug store when TD exit isaku.yamahata
                   ` (41 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This corresponds to VMX __vmx_complete_interrupts().  Because TDX
virtualize vAPIC, KVM only needs to care NMI injection.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/tdx.c | 10 ++++++++++
 arch/x86/kvm/vmx/tdx.h |  2 ++
 2 files changed, 12 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4bd651b31172..40546e692222 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -461,6 +461,14 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
 	vcpu->kvm->vm_bugged = true;
 }
 
+static void tdx_complete_interrupts(struct kvm_vcpu *vcpu)
+{
+	/* Avoid costly SEAMCALL if no nmi was injected */
+	if (vcpu->arch.nmi_injected)
+		vcpu->arch.nmi_injected = td_management_read8(to_tdx(vcpu),
+							      TD_VCPU_PEND_NMI);
+}
+
 struct tdx_uret_msr {
 	u32 msr;
 	unsigned int slot;
@@ -529,6 +537,8 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 	vcpu->arch.regs_avail &= ~VMX_REGS_LAZY_LOAD_SET;
 	trace_kvm_exit(vcpu, KVM_ISA_VMX);
 
+	tdx_complete_interrupts(vcpu);
+
 	return EXIT_FASTPATH_NONE;
 }
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 63916388fdcf..e97cb19a7a50 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -192,6 +192,8 @@ TDX_BUILD_TDVPS_ACCESSORS(16, VMCS, vmcs);
 TDX_BUILD_TDVPS_ACCESSORS(32, VMCS, vmcs);
 TDX_BUILD_TDVPS_ACCESSORS(64, VMCS, vmcs);
 
+TDX_BUILD_TDVPS_ACCESSORS(8, MANAGEMENT, management);
+
 static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx, u32 field)
 {
 	struct tdx_module_output out;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 072/113] KVM: TDX: restore debug store when TD exit
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (70 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 071/113] KVM: TDX: complete interrupts after tdexit isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 073/113] KVM: TDX: handle vcpu migration over logical processor isaku.yamahata
                   ` (40 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because debug store is clobbered, restore it on TD exit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/events/intel/ds.c | 1 +
 arch/x86/kvm/vmx/tdx.c     | 1 +
 2 files changed, 2 insertions(+)

diff --git a/arch/x86/events/intel/ds.c b/arch/x86/events/intel/ds.c
index 88e58b6ee73c..4989b35161b6 100644
--- a/arch/x86/events/intel/ds.c
+++ b/arch/x86/events/intel/ds.c
@@ -2350,3 +2350,4 @@ void perf_restore_debug_store(void)
 
 	wrmsrl(MSR_IA32_DS_AREA, (unsigned long)ds);
 }
+EXPORT_SYMBOL_GPL(perf_restore_debug_store);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 40546e692222..f2b92c8bc081 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -531,6 +531,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 	tdx_vcpu_enter_exit(vcpu, tdx);
 
 	tdx_user_return_update_cache();
+	perf_restore_debug_store();
 	tdx_restore_host_xsave_state(vcpu);
 	tdx->host_state_need_restore = true;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 073/113] KVM: TDX: handle vcpu migration over logical processor
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (71 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 072/113] KVM: TDX: restore debug store when TD exit isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 074/113] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior isaku.yamahata
                   ` (39 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

For vcpu migration, in the case of VMX, VMCS is flushed on the source pcpu,
and load it on the target pcpu.  There are corresponding TDX SEAMCALL APIs,
call them on vcpu migration.  The logic is mostly same as VMX except the
TDX SEAMCALLs are used.

When shutting down the machine, (VMX or TDX) vcpus needs to be shutdown on
each pcpu.  Do the similar for TDX with TDX SEAMCALL APIs.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    |  35 ++++++++-
 arch/x86/kvm/vmx/tdx.c     | 152 +++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/tdx.h     |   2 +
 arch/x86/kvm/vmx/x86_ops.h |   7 ++
 4 files changed, 193 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index f4b20974199f..7be68c3b345b 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -17,6 +17,13 @@ static bool vt_is_vm_type_supported(unsigned long type)
 		(enable_tdx && tdx_is_vm_type_supported(type));
 }
 
+static void vt_hardware_disable(void)
+{
+	/* Note, TDX *and* VMX need to be disabled if TDX is enabled. */
+	tdx_hardware_disable();
+	vmx_hardware_disable();
+}
+
 static __init int vt_hardware_setup(void)
 {
 	int ret;
@@ -141,6 +148,14 @@ static fastpath_t vt_vcpu_run(struct kvm_vcpu *vcpu)
 	return vmx_vcpu_run(vcpu);
 }
 
+static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_vcpu_load(vcpu, cpu);
+
+	return vmx_vcpu_load(vcpu, cpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -199,6 +214,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
 }
 
+static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_sched_in(vcpu, cpu);
+}
+
 static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	if (is_td_vcpu(vcpu))
@@ -232,7 +255,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.offline_cpu = tdx_offline_cpu,
 
 	.hardware_enable = vmx_hardware_enable,
-	.hardware_disable = vmx_hardware_disable,
+	.hardware_disable = vt_hardware_disable,
 	.has_emulated_msr = vmx_has_emulated_msr,
 
 	.is_vm_type_supported = vt_is_vm_type_supported,
@@ -248,7 +271,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_reset = vt_vcpu_reset,
 
 	.prepare_switch_to_guest = vt_prepare_switch_to_guest,
-	.vcpu_load = vmx_vcpu_load,
+	.vcpu_load = vt_vcpu_load,
 	.vcpu_put = vt_vcpu_put,
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
@@ -336,7 +359,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.request_immediate_exit = vmx_request_immediate_exit,
 
-	.sched_in = vmx_sched_in,
+	.sched_in = vt_sched_in,
 
 	.cpu_dirty_log_size = PML_ENTITY_NUM,
 	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
@@ -404,6 +427,10 @@ static int __init vt_init(void)
 	if (r)
 		goto err_vmx_init;
 
+	r = tdx_init();
+	if (r)
+		goto err_tdx_init;
+
 	/*
 	 * Common KVM initialization _must_ come last, after this, /dev/kvm is
 	 * exposed to userspace!
@@ -426,6 +453,8 @@ static int __init vt_init(void)
 	return 0;
 
 err_kvm_init:
+	/* tdx_exit() is not defined. */
+err_tdx_init:
 	vmx_exit();
 err_vmx_init:
 	kvm_x86_vendor_exit();
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index f2b92c8bc081..0eed2481641b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -51,6 +51,14 @@ static DEFINE_MUTEX(tdx_lock);
 static struct mutex *tdx_mng_key_config_lock;
 static atomic_t nr_configured_hkid;
 
+/*
+ * A per-CPU list of TD vCPUs associated with a given CPU.  Used when a CPU
+ * is brought down to invoke TDH_VP_FLUSH on the approapriate TD vCPUS.
+ * Protected by interrupt mask.  This list is manipulated in process context
+ * of vcpu and IPI callback.  See tdx_flush_vp_on_cpu().
+ */
+static DEFINE_PER_CPU(struct list_head, associated_tdvcpus);
+
 static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
 {
 	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
@@ -82,6 +90,31 @@ static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
 	return kvm_tdx->finalized;
 }
 
+static inline void tdx_disassociate_vp(struct kvm_vcpu *vcpu)
+{
+	list_del(&to_tdx(vcpu)->cpu_list);
+
+	/*
+	 * Ensure tdx->cpu_list is updated is before setting vcpu->cpu to -1,
+	 * otherwise, a different CPU can see vcpu->cpu = -1 and add the vCPU
+	 * to its list before its deleted from this CPUs list.
+	 */
+	smp_wmb();
+
+	vcpu->cpu = -1;
+}
+
+void tdx_hardware_disable(void)
+{
+	int cpu = raw_smp_processor_id();
+	struct list_head *tdvcpus = &per_cpu(associated_tdvcpus, cpu);
+	struct vcpu_tdx *tdx, *tmp;
+
+	/* Safe variant needed as tdx_disassociate_vp() deletes the entry. */
+	list_for_each_entry_safe(tdx, tmp, tdvcpus, cpu_list)
+		tdx_disassociate_vp(&tdx->vcpu);
+}
+
 static void tdx_clear_page(unsigned long page_pa)
 {
 	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
@@ -163,6 +196,68 @@ static void tdx_reclaim_td_page(unsigned long td_page_pa)
 	free_page((unsigned long)__va(td_page_pa));
 }
 
+struct tdx_flush_vp_arg {
+	struct kvm_vcpu *vcpu;
+	u64 err;
+};
+
+static void tdx_flush_vp(void *arg_)
+{
+	struct tdx_flush_vp_arg *arg = arg_;
+	struct kvm_vcpu *vcpu = arg->vcpu;
+	u64 err;
+
+	arg->err = 0;
+	lockdep_assert_irqs_disabled();
+
+	/* Task migration can race with CPU offlining. */
+	if (vcpu->cpu != raw_smp_processor_id())
+		return;
+
+	/*
+	 * No need to do TDH_VP_FLUSH if the vCPU hasn't been initialized.  The
+	 * list tracking still needs to be updated so that it's correct if/when
+	 * the vCPU does get initialized.
+	 */
+	if (is_td_vcpu_created(to_tdx(vcpu))) {
+		/*
+		 * No need to retry.  TDX Resources needed for TDH.VP.FLUSH are,
+		 * TDVPR as exclusive, TDR as shared, and TDCS as shared.  This
+		 * vp flush function is called when destructing vcpu/TD or vcpu
+		 * migration.  No other thread uses TDVPR in those cases.
+		 */
+		err = tdh_vp_flush(to_tdx(vcpu)->tdvpr_pa);
+		if (unlikely(err && err != TDX_VCPU_NOT_ASSOCIATED)) {
+			/*
+			 * This function is called in IPI context. Do not use
+			 * printk to avoid console semaphore.
+			 * The caller prints out the error message, instead.
+			 */
+			if (err)
+				arg->err = err;
+		}
+	}
+
+	tdx_disassociate_vp(vcpu);
+}
+
+static void tdx_flush_vp_on_cpu(struct kvm_vcpu *vcpu)
+{
+	struct tdx_flush_vp_arg arg = {
+		.vcpu = vcpu,
+	};
+	int cpu = vcpu->cpu;
+
+	if (unlikely(cpu == -1))
+		return;
+
+	smp_call_function_single(cpu, tdx_flush_vp, &arg, 1);
+	if (WARN_ON_ONCE(arg.err)) {
+		pr_err("cpu: %d ", cpu);
+		pr_tdx_error(TDH_VP_FLUSH, arg.err, NULL);
+	}
+}
+
 static int tdx_do_tdh_phymem_cache_wb(void *param)
 {
 	u64 err = 0;
@@ -187,6 +282,8 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
 	cpumask_var_t packages;
 	bool cpumask_allocated;
+	struct kvm_vcpu *vcpu;
+	unsigned long j;
 	u64 err;
 	int ret;
 	int i;
@@ -197,6 +294,19 @@ void tdx_mmu_release_hkid(struct kvm *kvm)
 	if (!is_td_created(kvm_tdx))
 		goto free_hkid;
 
+	kvm_for_each_vcpu(j, vcpu, kvm)
+		tdx_flush_vp_on_cpu(vcpu);
+
+	mutex_lock(&tdx_lock);
+	err = tdh_mng_vpflushdone(kvm_tdx->tdr_pa);
+	mutex_unlock(&tdx_lock);
+	if (WARN_ON_ONCE(err)) {
+		pr_tdx_error(TDH_MNG_VPFLUSHDONE, err, NULL);
+		pr_err("tdh_mng_vpflushdone failed. HKID %d is leaked.\n",
+			kvm_tdx->hkid);
+		return;
+	}
+
 	cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
 	cpus_read_lock();
 	for_each_online_cpu(i) {
@@ -381,6 +491,26 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (vcpu->cpu == cpu)
+		return;
+
+	tdx_flush_vp_on_cpu(vcpu);
+
+	local_irq_disable();
+	/*
+	 * Pairs with the smp_wmb() in tdx_disassociate_vp() to ensure
+	 * vcpu->cpu is read before tdx->cpu_list.
+	 */
+	smp_rmb();
+
+	list_add(&tdx->cpu_list, &per_cpu(associated_tdvcpus, cpu));
+	local_irq_enable();
+}
+
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
@@ -431,6 +561,19 @@ void tdx_vcpu_free(struct kvm_vcpu *vcpu)
 	}
 	tdx_reclaim_td_page(tdx->tdvpr_pa);
 	tdx->tdvpr_pa = 0;
+
+	/*
+	 * kvm_free_vcpus()
+	 *   -> kvm_unload_vcpu_mmu()
+	 *
+	 * does vcpu_load() for every vcpu after they already disassociated
+	 * from the per cpu list when tdx_vm_teardown(). So we need to
+	 * disassociate them again, otherwise the freed vcpu data will be
+	 * accessed when do list_{del,add}() on associated_tdvcpus list
+	 * later.
+	 */
+	tdx_flush_vp_on_cpu(vcpu);
+	WARN_ON_ONCE(vcpu->cpu != -1);
 }
 
 void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
@@ -1699,3 +1842,12 @@ int tdx_offline_cpu(void)
 			"Delete all TDs in order to offline all CPUs of a package.\n");
 	return ret;
 }
+
+int __init tdx_init(void)
+{
+	int cpu;
+
+	for_each_possible_cpu(cpu)
+		INIT_LIST_HEAD(&per_cpu(associated_tdvcpus, cpu));
+	return 0;
+}
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index e97cb19a7a50..6e021ef6a943 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -62,6 +62,8 @@ struct vcpu_tdx {
 	unsigned long tdvpr_pa;
 	unsigned long *tdvpx_pa;
 
+	struct list_head cpu_list;
+
 	union tdx_exit_reason exit_reason;
 
 	bool vcpu_initialized;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index f5ee5efd7cf6..dbbc27519637 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -141,8 +141,11 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
 void vmx_setup_mce(struct kvm_vcpu *vcpu);
 
 #ifdef CONFIG_INTEL_TDX_HOST
+int __init tdx_init(void);
 int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
 void tdx_hardware_unsetup(void);
+void tdx_hardware_enable(void);
+void tdx_hardware_disable(void);
 bool tdx_is_vm_type_supported(unsigned long type);
 int tdx_dev_ioctl(void __user *argp);
 int tdx_offline_cpu(void);
@@ -157,6 +160,7 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
 fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
 void tdx_vcpu_put(struct kvm_vcpu *vcpu);
+void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
@@ -166,8 +170,10 @@ void tdx_flush_tlb(struct kvm_vcpu *vcpu);
 int tdx_sept_tlb_remote_flush(struct kvm *kvm);
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level);
 #else
+static inline int tdx_init(void) { return 0; };
 static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
 static inline void tdx_hardware_unsetup(void) {}
+static inline void tdx_hardware_disable(void) {}
 static inline bool tdx_is_vm_type_supported(unsigned long type) { return false; }
 static inline int tdx_dev_ioctl(void __user *argp) { return -EOPNOTSUPP; };
 static inline int tdx_offline_cpu(void) { return 0; }
@@ -183,6 +189,7 @@ static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
 static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTPATH_NONE; }
 static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
+static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
 static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 074/113] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (72 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 073/113] KVM: TDX: handle vcpu migration over logical processor isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 075/113] KVM: TDX: Add support for find pending IRQ in a protected local APIC isaku.yamahata
                   ` (38 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Xiaoyao Li,
	Sean Christopherson, Chao Gao

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a flag, KVM_DEBUGREG_AUTO_SWITCHED_GUEST, to skip saving/restoring DRs
irrespective of any other flags.  TDX-SEAM unconditionally saves and
restores guest DRs and reset to architectural INIT state on TD exit.
So, KVM needs to save host DRs before TD enter without restoring guest DRs
and restore host DRs after TD exit.

Opportunistically convert the KVM_DEBUGREG_* definitions to use BIT().

Reported-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Co-developed-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Chao Gao <chao.gao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h | 10 ++++++++--
 arch/x86/kvm/vmx/tdx.c          |  1 +
 arch/x86/kvm/x86.c              | 11 ++++++++---
 3 files changed, 17 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 91093622f7ba..908911e6f0ba 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -606,8 +606,14 @@ struct kvm_pmu {
 struct kvm_pmu_ops;
 
 enum {
-	KVM_DEBUGREG_BP_ENABLED = 1,
-	KVM_DEBUGREG_WONT_EXIT = 2,
+	KVM_DEBUGREG_BP_ENABLED		= BIT(0),
+	KVM_DEBUGREG_WONT_EXIT		= BIT(1),
+	/*
+	 * Guest debug registers (DR0-3 and DR6) are saved/restored by hardware
+	 * on exit from or enter to guest. KVM needn't switch them. Because DR7
+	 * is cleared on exit from guest, DR7 need to be saved/restored.
+	 */
+	KVM_DEBUGREG_AUTO_SWITCH	= BIT(2),
 };
 
 struct kvm_mtrr_range {
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 0eed2481641b..9158d266a086 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -477,6 +477,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
 
+	vcpu->arch.switch_db_regs = KVM_DEBUGREG_AUTO_SWITCH;
 	vcpu->arch.cr0_guest_owned_bits = -1ul;
 	vcpu->arch.cr4_guest_owned_bits = -1ul;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index a3da2526dfda..384aa282c68f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -10613,7 +10613,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	if (vcpu->arch.guest_fpu.xfd_err)
 		wrmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
 
-	if (unlikely(vcpu->arch.switch_db_regs)) {
+	if (unlikely(vcpu->arch.switch_db_regs & ~KVM_DEBUGREG_AUTO_SWITCH)) {
 		set_debugreg(0, 7);
 		set_debugreg(vcpu->arch.eff_db[0], 0);
 		set_debugreg(vcpu->arch.eff_db[1], 1);
@@ -10656,6 +10656,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 */
 	if (unlikely(vcpu->arch.switch_db_regs & KVM_DEBUGREG_WONT_EXIT)) {
 		WARN_ON(vcpu->guest_debug & KVM_GUESTDBG_USE_HW_BP);
+		WARN_ON(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH);
 		static_call(kvm_x86_sync_dirty_debug_regs)(vcpu);
 		kvm_update_dr0123(vcpu);
 		kvm_update_dr7(vcpu);
@@ -10668,8 +10669,12 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 	 * care about the messed up debug address registers. But if
 	 * we have some of them active, restore the old state.
 	 */
-	if (hw_breakpoint_active())
-		hw_breakpoint_restore();
+	if (hw_breakpoint_active()) {
+		if (!(vcpu->arch.switch_db_regs & KVM_DEBUGREG_AUTO_SWITCH))
+			hw_breakpoint_restore();
+		else
+			set_debugreg(__this_cpu_read(cpu_dr7), 7);
+	}
 
 	vcpu->arch.last_vmentry_cpu = vcpu->cpu;
 	vcpu->arch.last_guest_tsc = kvm_read_l1_tsc(vcpu, rdtsc());
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 075/113] KVM: TDX: Add support for find pending IRQ in a protected local APIC
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (73 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 074/113] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 076/113] KVM: x86: Assume timer IRQ was injected if APIC state is proteced isaku.yamahata
                   ` (37 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Sean Christopherson <seanjc@google.com>

Add flag and hook to KVM's local APIC management to support determining
whether or not a TDX guest as a pending IRQ.  For TDX vCPUs, the virtual
APIC page is owned by the TDX module and cannot be accessed by KVM.  As a
result, registers that are virtualized by the CPU, e.g. PPR, cannot be
read or written by KVM.  To deliver interrupts for TDX guests, KVM must
send an IRQ to the CPU on the posted interrupt notification vector.  And
to determine if TDX vCPU has a pending interrupt, KVM must check if there
is an outstanding notification.

Return "no interrupt" in kvm_apic_has_interrupt() if the guest APIC is
protected to short-circuit the various other flows that try to pull an
IRQ out of the vAPIC, the only valid operation is querying _if_ an IRQ is
pending, KVM can't do anything based on _which_ IRQ is pending.

Intentionally omit sanity checks from other flows, e.g. PPR update, so as
not to degrade non-TDX guests with unnecessary checks.  A well-behaved KVM
and userspace will never reach those flows for TDX guests, but reaching
them is not fatal if something does go awry.

Note, this doesn't handle interrupts that have been delivered to the vCPU
but not yet recognized by the core, i.e. interrupts that are sitting in
vmcs.GUEST_INTR_STATUS.  Querying that state requires a SEAMCALL and will
be supported in a future patch.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  1 +
 arch/x86/kvm/irq.c                 |  3 +++
 arch/x86/kvm/lapic.c               |  3 +++
 arch/x86/kvm/lapic.h               |  2 ++
 arch/x86/kvm/vmx/main.c            | 11 +++++++++++
 arch/x86/kvm/vmx/tdx.c             |  6 ++++++
 arch/x86/kvm/vmx/x86_ops.h         |  2 ++
 8 files changed, 29 insertions(+)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index 99ac85c3c8aa..aeff96543090 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -117,6 +117,7 @@ KVM_X86_OP_OPTIONAL(pi_update_irte)
 KVM_X86_OP_OPTIONAL(pi_start_assignment)
 KVM_X86_OP_OPTIONAL(apicv_post_state_restore)
 KVM_X86_OP_OPTIONAL_RET0(dy_apicv_has_pending_interrupt)
+KVM_X86_OP_OPTIONAL(protected_apic_has_interrupt)
 KVM_X86_OP_OPTIONAL(set_hv_timer)
 KVM_X86_OP_OPTIONAL(cancel_hv_timer)
 KVM_X86_OP(setup_mce)
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 908911e6f0ba..cead782a48b6 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1738,6 +1738,7 @@ struct kvm_x86_ops {
 	void (*pi_start_assignment)(struct kvm *kvm);
 	void (*apicv_post_state_restore)(struct kvm_vcpu *vcpu);
 	bool (*dy_apicv_has_pending_interrupt)(struct kvm_vcpu *vcpu);
+	bool (*protected_apic_has_interrupt)(struct kvm_vcpu *vcpu);
 
 	int (*set_hv_timer)(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
 			    bool *expired);
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index b2c397dd2bc6..fd6af5530c32 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -100,6 +100,9 @@ int kvm_cpu_has_interrupt(struct kvm_vcpu *v)
 	if (kvm_cpu_has_extint(v))
 		return 1;
 
+	if (lapic_in_kernel(v) && v->arch.apic->guest_apic_protected)
+		return static_call(kvm_x86_protected_apic_has_interrupt)(v);
+
 	return kvm_apic_has_interrupt(v) != -1;	/* LAPIC */
 }
 EXPORT_SYMBOL_GPL(kvm_cpu_has_interrupt);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index cfaf1d8c64ca..9e8d768ce83f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -2623,6 +2623,9 @@ int kvm_apic_has_interrupt(struct kvm_vcpu *vcpu)
 	if (!kvm_apic_present(vcpu))
 		return -1;
 
+	if (apic->guest_apic_protected)
+		return -1;
+
 	__apic_update_ppr(apic, &ppr);
 	return apic_has_interrupt_for_ppr(apic, ppr);
 }
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index 58c3242fcc7a..79075164e96b 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -66,6 +66,8 @@ struct kvm_lapic {
 	bool sw_enabled;
 	bool irr_pending;
 	bool lvt0_in_nmi_mode;
+	/* Select registers in the vAPIC cannot be read/written. */
+	bool guest_apic_protected;
 	/* Number of bits set in ISR. */
 	s16 isr_count;
 	/* The highest vector set in ISR; if -1 - invalid, must scan ISR. */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7be68c3b345b..91696d5d183f 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -34,6 +34,9 @@ static __init int vt_hardware_setup(void)
 
 	enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
 
+	if (!enable_tdx)
+		vt_x86_ops.protected_apic_has_interrupt = NULL;
+
 	if (enable_ept)
 		kvm_mmu_set_ept_masks(enable_ept_ad_bits,
 				      cpu_has_vmx_ept_execute_only());
@@ -156,6 +159,13 @@ static void vt_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	return vmx_vcpu_load(vcpu, cpu);
 }
 
+static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+	KVM_BUG_ON(!is_td_vcpu(vcpu), vcpu->kvm);
+
+	return tdx_protected_apic_has_interrupt(vcpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -336,6 +346,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.sync_pir_to_irr = vmx_sync_pir_to_irr,
 	.deliver_interrupt = vmx_deliver_interrupt,
 	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
+	.protected_apic_has_interrupt = vt_protected_apic_has_interrupt,
 
 	.set_tss_addr = vmx_set_tss_addr,
 	.set_identity_map_addr = vmx_set_identity_map_addr,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 9158d266a086..3f444e138f52 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -474,6 +474,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 		return -EINVAL;
 
 	fpstate_set_confidential(&vcpu->arch.guest_fpu);
+	vcpu->arch.apic->guest_apic_protected = true;
 
 	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
 
@@ -512,6 +513,11 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 	local_irq_enable();
 }
 
+bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+	return pi_has_pending_interrupt(vcpu);
+}
+
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index dbbc27519637..32e045480bdb 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -161,6 +161,7 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu);
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
 void tdx_vcpu_put(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
+bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
 u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
@@ -190,6 +191,7 @@ static inline fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu) { return EXIT_FASTP
 static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
+static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
 static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 076/113] KVM: x86: Assume timer IRQ was injected if APIC state is proteced
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (74 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 075/113] KVM: TDX: Add support for find pending IRQ in a protected local APIC isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 077/113] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c isaku.yamahata
                   ` (36 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Sean Christopherson <seanjc@google.com>

If APIC state is protected, i.e. the vCPU is a TDX guest, assume a timer
IRQ was injected when deciding whether or not to busy wait in the "timer
advanced" path.  The "real" vIRR is not readable/writable, so trying to
query for a pending timer IRQ will return garbage.

Note, TDX can scour the PIR if it wants to be more precise and skip the
"wait" call entirely.

Signed-off-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/lapic.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9e8d768ce83f..9e5ffe355847 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1607,8 +1607,17 @@ static void apic_update_lvtt(struct kvm_lapic *apic)
 static bool lapic_timer_int_injected(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
-	u32 reg = kvm_lapic_get_reg(apic, APIC_LVTT);
+	u32 reg;
 
+	/*
+	 * Assume a timer IRQ was "injected" if the APIC is protected.  KVM's
+	 * copy of the vIRR is bogus, it's the responsibility of the caller to
+	 * precisely check whether or not a timer IRQ is pending.
+	 */
+	if (apic->guest_apic_protected)
+		return true;
+
+	reg  = kvm_lapic_get_reg(apic, APIC_LVTT);
 	if (kvm_apic_hw_enabled(apic)) {
 		int vec = reg & APIC_VECTOR_MASK;
 		void *bitmap = apic->regs + APIC_ISR;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 077/113] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (75 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 076/113] KVM: x86: Assume timer IRQ was injected if APIC state is proteced isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 078/113] KVM: TDX: Implement interrupt injection isaku.yamahata
                   ` (35 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

As TDX will use posted_interrupt.c, the use of struct vcpu_vmx is a
blocker.  Because the members of struct pi_desc pi_desc and struct
list_head pi_wakeup_list are only used in posted_interrupt.c, introduce
common structure, struct vcpu_pi, make vcpu_vmx and vcpu_tdx has same
layout in the top of structure.

To minimize the diff size, avoid code conversion like,
vmx->pi_desc => vmx->common->pi_desc.  Instead add compile time check
if the layout is expected.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/posted_intr.c | 41 ++++++++++++++++++++++++++--------
 arch/x86/kvm/vmx/posted_intr.h | 11 +++++++++
 arch/x86/kvm/vmx/tdx.c         |  1 +
 arch/x86/kvm/vmx/tdx.h         |  8 +++++++
 arch/x86/kvm/vmx/vmx.h         | 14 +++++++-----
 5 files changed, 60 insertions(+), 15 deletions(-)

diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 94c38bea60e7..92de016852ca 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -11,6 +11,7 @@
 #include "posted_intr.h"
 #include "trace.h"
 #include "vmx.h"
+#include "tdx.h"
 
 /*
  * Maintain a per-CPU list of vCPUs that need to be awakened by wakeup_handler()
@@ -31,9 +32,29 @@ static DEFINE_PER_CPU(struct list_head, wakeup_vcpus_on_cpu);
  */
 static DEFINE_PER_CPU(raw_spinlock_t, wakeup_vcpus_on_cpu_lock);
 
+/*
+ * The layout of the head of struct vcpu_vmx and struct vcpu_tdx must match with
+ * struct vcpu_pi.
+ */
+static_assert(offsetof(struct vcpu_pi, pi_desc) ==
+	      offsetof(struct vcpu_vmx, pi_desc));
+static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
+	      offsetof(struct vcpu_vmx, pi_wakeup_list));
+#ifdef CONFIG_INTEL_TDX_HOST
+static_assert(offsetof(struct vcpu_pi, pi_desc) ==
+	      offsetof(struct vcpu_tdx, pi_desc));
+static_assert(offsetof(struct vcpu_pi, pi_wakeup_list) ==
+	      offsetof(struct vcpu_tdx, pi_wakeup_list));
+#endif
+
+static inline struct vcpu_pi *vcpu_to_pi(struct kvm_vcpu *vcpu)
+{
+	return (struct vcpu_pi *)vcpu;
+}
+
 static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
 {
-	return &(to_vmx(vcpu)->pi_desc);
+	return &vcpu_to_pi(vcpu)->pi_desc;
 }
 
 static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
@@ -52,8 +73,8 @@ static int pi_try_set_control(struct pi_desc *pi_desc, u64 *pold, u64 new)
 
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 {
-	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
+	struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
 	struct pi_desc old, new;
 	unsigned long flags;
 	unsigned int dest;
@@ -90,7 +111,7 @@ void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu)
 	 */
 	if (pi_desc->nv == POSTED_INTR_WAKEUP_VECTOR) {
 		raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
-		list_del(&vmx->pi_wakeup_list);
+		list_del(&vcpu_pi->pi_wakeup_list);
 		raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
 	}
 
@@ -145,15 +166,15 @@ static bool vmx_can_use_vtd_pi(struct kvm *kvm)
  */
 static void pi_enable_wakeup_handler(struct kvm_vcpu *vcpu)
 {
-	struct pi_desc *pi_desc = vcpu_to_pi_desc(vcpu);
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
+	struct vcpu_pi *vcpu_pi = vcpu_to_pi(vcpu);
+	struct pi_desc *pi_desc = &vcpu_pi->pi_desc;
 	struct pi_desc old, new;
 	unsigned long flags;
 
 	local_irq_save(flags);
 
 	raw_spin_lock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
-	list_add_tail(&vmx->pi_wakeup_list,
+	list_add_tail(&vcpu_pi->pi_wakeup_list,
 		      &per_cpu(wakeup_vcpus_on_cpu, vcpu->cpu));
 	raw_spin_unlock(&per_cpu(wakeup_vcpus_on_cpu_lock, vcpu->cpu));
 
@@ -190,7 +211,8 @@ static bool vmx_needs_pi_wakeup(struct kvm_vcpu *vcpu)
 	 * notification vector is switched to the one that calls
 	 * back to the pi_wakeup_handler() function.
 	 */
-	return vmx_can_use_ipiv(vcpu) || vmx_can_use_vtd_pi(vcpu->kvm);
+	return (vmx_can_use_ipiv(vcpu) && !is_td_vcpu(vcpu)) ||
+		vmx_can_use_vtd_pi(vcpu->kvm);
 }
 
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
@@ -200,7 +222,8 @@ void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu)
 	if (!vmx_needs_pi_wakeup(vcpu))
 		return;
 
-	if (kvm_vcpu_is_blocking(vcpu) && !vmx_interrupt_blocked(vcpu))
+	if (kvm_vcpu_is_blocking(vcpu) &&
+	    (is_td_vcpu(vcpu) || !vmx_interrupt_blocked(vcpu)))
 		pi_enable_wakeup_handler(vcpu);
 
 	/*
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 26992076552e..2fe8222308b2 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -94,6 +94,17 @@ static inline bool pi_test_sn(struct pi_desc *pi_desc)
 			(unsigned long *)&pi_desc->control);
 }
 
+struct vcpu_pi {
+	struct kvm_vcpu	vcpu;
+
+	/* Posted interrupt descriptor */
+	struct pi_desc pi_desc;
+
+	/* Used if this vCPU is waiting for PI notification wakeup. */
+	struct list_head pi_wakeup_list;
+	/* Until here common layout betwwn vcpu_vmx and vcpu_tdx. */
+};
+
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
 void pi_wakeup_handler(void);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3f444e138f52..b6a893d90893 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -475,6 +475,7 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 	fpstate_set_confidential(&vcpu->arch.guest_fpu);
 	vcpu->arch.apic->guest_apic_protected = true;
+	INIT_LIST_HEAD(&tdx->pi_wakeup_list);
 
 	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
 
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 6e021ef6a943..e37db607d6d9 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -4,6 +4,7 @@
 
 #ifdef CONFIG_INTEL_TDX_HOST
 
+#include "posted_intr.h"
 #include "pmu_intel.h"
 #include "tdx_ops.h"
 
@@ -59,6 +60,13 @@ union tdx_exit_reason {
 struct vcpu_tdx {
 	struct kvm_vcpu	vcpu;
 
+	/* Posted interrupt descriptor */
+	struct pi_desc pi_desc;
+
+	/* Used if this vCPU is waiting for PI notification wakeup. */
+	struct list_head pi_wakeup_list;
+	/* Until here same layout to struct vcpu_pi. */
+
 	unsigned long tdvpr_pa;
 	unsigned long *tdvpx_pa;
 
diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h
index 1813caeb24d8..0a7ab0a7d604 100644
--- a/arch/x86/kvm/vmx/vmx.h
+++ b/arch/x86/kvm/vmx/vmx.h
@@ -245,6 +245,14 @@ struct nested_vmx {
 
 struct vcpu_vmx {
 	struct kvm_vcpu       vcpu;
+
+	/* Posted interrupt descriptor */
+	struct pi_desc pi_desc;
+
+	/* Used if this vCPU is waiting for PI notification wakeup. */
+	struct list_head pi_wakeup_list;
+	/* Until here same layout to struct vcpu_pi. */
+
 	u8                    fail;
 	u8		      x2apic_msr_bitmap_mode;
 
@@ -314,12 +322,6 @@ struct vcpu_vmx {
 
 	union vmx_exit_reason exit_reason;
 
-	/* Posted interrupt descriptor */
-	struct pi_desc pi_desc;
-
-	/* Used if this vCPU is waiting for PI notification wakeup. */
-	struct list_head pi_wakeup_list;
-
 	/* Support for a guest hypervisor (nested VMX) */
 	struct nested_vmx nested;
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 078/113] KVM: TDX: Implement interrupt injection
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (76 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 077/113] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 079/113] KVM: TDX: Implements vcpu request_immediate_exit isaku.yamahata
                   ` (34 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX supports interrupt inject into vcpu with posted interrupt.  Wire up the
corresponding kvm x86 operations to posted interrupt.  Move
kvm_vcpu_trigger_posted_interrupt() from vmx.c to common.h to share the
code.

VMX can inject interrupt by setting interrupt information field,
VM_ENTRY_INTR_INFO_FIELD, of VMCS.  TDX supports interrupt injection only
by posted interrupt.  Ignore the execution path to access
VM_ENTRY_INTR_INFO_FIELD.

As cpu state is protected and apicv is enabled for the TDX guest, VMM can
inject interrupt by updating posted interrupt descriptor.  Treat interrupt
can be injected always.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/common.h      | 71 ++++++++++++++++++++++++++
 arch/x86/kvm/vmx/main.c        | 93 ++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/posted_intr.c |  2 +-
 arch/x86/kvm/vmx/posted_intr.h |  2 +
 arch/x86/kvm/vmx/tdx.c         | 25 +++++++++
 arch/x86/kvm/vmx/vmx.c         | 67 +-----------------------
 arch/x86/kvm/vmx/x86_ops.h     |  7 ++-
 7 files changed, 190 insertions(+), 77 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 235908f3e044..747f993cf7de 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,6 +4,7 @@
 
 #include <linux/kvm_host.h>
 
+#include "posted_intr.h"
 #include "mmu.h"
 
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
@@ -30,4 +31,74 @@ static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 	return kvm_mmu_page_fault(vcpu, gpa, error_code, NULL, 0);
 }
 
+static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
+						     int pi_vec)
+{
+#ifdef CONFIG_SMP
+	if (vcpu->mode == IN_GUEST_MODE) {
+		/*
+		 * The vector of the virtual has already been set in the PIR.
+		 * Send a notification event to deliver the virtual interrupt
+		 * unless the vCPU is the currently running vCPU, i.e. the
+		 * event is being sent from a fastpath VM-Exit handler, in
+		 * which case the PIR will be synced to the vIRR before
+		 * re-entering the guest.
+		 *
+		 * When the target is not the running vCPU, the following
+		 * possibilities emerge:
+		 *
+		 * Case 1: vCPU stays in non-root mode. Sending a notification
+		 * event posts the interrupt to the vCPU.
+		 *
+		 * Case 2: vCPU exits to root mode and is still runnable. The
+		 * PIR will be synced to the vIRR before re-entering the guest.
+		 * Sending a notification event is ok as the host IRQ handler
+		 * will ignore the spurious event.
+		 *
+		 * Case 3: vCPU exits to root mode and is blocked. vcpu_block()
+		 * has already synced PIR to vIRR and never blocks the vCPU if
+		 * the vIRR is not empty. Therefore, a blocked vCPU here does
+		 * not wait for any requested interrupts in PIR, and sending a
+		 * notification event also results in a benign, spurious event.
+		 */
+
+		if (vcpu != kvm_get_running_vcpu())
+			apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
+		return;
+	}
+#endif
+	/*
+	 * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
+	 * otherwise do nothing as KVM will grab the highest priority pending
+	 * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
+	 */
+	kvm_vcpu_wake_up(vcpu);
+}
+
+/*
+ * Send interrupt to vcpu via posted interrupt way.
+ * 1. If target vcpu is running(non-root mode), send posted interrupt
+ * notification to vcpu and hardware will sync PIR to vIRR atomically.
+ * 2. If target vcpu isn't running(root mode), kick it to pick up the
+ * interrupt from PIR in next vmentry.
+ */
+static inline void __vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu,
+						  struct pi_desc *pi_desc, int vector)
+{
+	if (pi_test_and_set_pir(vector, pi_desc))
+		return;
+
+	/* If a previous notification has sent the IPI, nothing to do.  */
+	if (pi_test_and_set_on(pi_desc))
+		return;
+
+	/*
+	 * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
+	 * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
+	 * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
+	 * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
+	 */
+	kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
+}
+
 #endif /* __KVM_X86_VMX_COMMON_H */
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 91696d5d183f..d5a838e87fb5 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -166,6 +166,34 @@ static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
 	return tdx_protected_apic_has_interrupt(vcpu);
 }
 
+static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
+{
+	struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
+
+	pi_clear_on(pi);
+	memset(pi->pir, 0, sizeof(pi->pir));
+}
+
+static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return -1;
+
+	return vmx_sync_pir_to_irr(vcpu);
+}
+
+static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+			   int trig_mode, int vector)
+{
+	if (is_td_vcpu(apic->vcpu)) {
+		tdx_deliver_interrupt(apic, delivery_mode, trig_mode,
+					     vector);
+		return;
+	}
+
+	vmx_deliver_interrupt(apic, delivery_mode, trig_mode, vector);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -232,6 +260,53 @@ static void vt_sched_in(struct kvm_vcpu *vcpu, int cpu)
 	vmx_sched_in(vcpu, cpu);
 }
 
+static void vt_set_interrupt_shadow(struct kvm_vcpu *vcpu, int mask)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+	vmx_set_interrupt_shadow(vcpu, mask);
+}
+
+static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return 0;
+
+	return vmx_get_interrupt_shadow(vcpu);
+}
+
+static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_inject_irq(vcpu, reinjected);
+}
+
+static void vt_cancel_injection(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_cancel_injection(vcpu);
+}
+
+static int vt_interrupt_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	if (is_td_vcpu(vcpu))
+		return true;
+
+	return vmx_interrupt_allowed(vcpu, for_injection);
+}
+
+static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_enable_irq_window(vcpu);
+}
+
 static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	if (is_td_vcpu(vcpu))
@@ -320,31 +395,31 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.handle_exit = vmx_handle_exit,
 	.skip_emulated_instruction = vmx_skip_emulated_instruction,
 	.update_emulated_instruction = vmx_update_emulated_instruction,
-	.set_interrupt_shadow = vmx_set_interrupt_shadow,
-	.get_interrupt_shadow = vmx_get_interrupt_shadow,
+	.set_interrupt_shadow = vt_set_interrupt_shadow,
+	.get_interrupt_shadow = vt_get_interrupt_shadow,
 	.patch_hypercall = vmx_patch_hypercall,
-	.inject_irq = vmx_inject_irq,
+	.inject_irq = vt_inject_irq,
 	.inject_nmi = vmx_inject_nmi,
 	.inject_exception = vmx_inject_exception,
-	.cancel_injection = vmx_cancel_injection,
-	.interrupt_allowed = vmx_interrupt_allowed,
+	.cancel_injection = vt_cancel_injection,
+	.interrupt_allowed = vt_interrupt_allowed,
 	.nmi_allowed = vmx_nmi_allowed,
 	.get_nmi_mask = vmx_get_nmi_mask,
 	.set_nmi_mask = vmx_set_nmi_mask,
 	.enable_nmi_window = vmx_enable_nmi_window,
-	.enable_irq_window = vmx_enable_irq_window,
+	.enable_irq_window = vt_enable_irq_window,
 	.update_cr8_intercept = vmx_update_cr8_intercept,
 	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
 	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
 	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
 	.load_eoi_exitmap = vmx_load_eoi_exitmap,
-	.apicv_post_state_restore = vmx_apicv_post_state_restore,
+	.apicv_post_state_restore = vt_apicv_post_state_restore,
 	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
 	.hwapic_irr_update = vmx_hwapic_irr_update,
 	.hwapic_isr_update = vmx_hwapic_isr_update,
 	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
-	.sync_pir_to_irr = vmx_sync_pir_to_irr,
-	.deliver_interrupt = vmx_deliver_interrupt,
+	.sync_pir_to_irr = vt_sync_pir_to_irr,
+	.deliver_interrupt = vt_deliver_interrupt,
 	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
 	.protected_apic_has_interrupt = vt_protected_apic_has_interrupt,
 
diff --git a/arch/x86/kvm/vmx/posted_intr.c b/arch/x86/kvm/vmx/posted_intr.c
index 92de016852ca..2b2da6c18504 100644
--- a/arch/x86/kvm/vmx/posted_intr.c
+++ b/arch/x86/kvm/vmx/posted_intr.c
@@ -52,7 +52,7 @@ static inline struct vcpu_pi *vcpu_to_pi(struct kvm_vcpu *vcpu)
 	return (struct vcpu_pi *)vcpu;
 }
 
-static inline struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
+struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu)
 {
 	return &vcpu_to_pi(vcpu)->pi_desc;
 }
diff --git a/arch/x86/kvm/vmx/posted_intr.h b/arch/x86/kvm/vmx/posted_intr.h
index 2fe8222308b2..0f9983b6910b 100644
--- a/arch/x86/kvm/vmx/posted_intr.h
+++ b/arch/x86/kvm/vmx/posted_intr.h
@@ -105,6 +105,8 @@ struct vcpu_pi {
 	/* Until here common layout betwwn vcpu_vmx and vcpu_tdx. */
 };
 
+struct pi_desc *vcpu_to_pi_desc(struct kvm_vcpu *vcpu);
+
 void vmx_vcpu_pi_load(struct kvm_vcpu *vcpu, int cpu);
 void vmx_vcpu_pi_put(struct kvm_vcpu *vcpu);
 void pi_wakeup_handler(void);
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index b6a893d90893..742f0747d4d0 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -7,6 +7,7 @@
 
 #include "capabilities.h"
 #include "x86_ops.h"
+#include "common.h"
 #include "tdx.h"
 #include "vmx.h"
 #include "x86.h"
@@ -488,6 +489,9 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 	vcpu->arch.guest_state_protected =
 		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
 
+	tdx->pi_desc.nv = POSTED_INTR_VECTOR;
+	tdx->pi_desc.sn = 1;
+
 	tdx->host_state_need_save = true;
 	tdx->host_state_need_restore = false;
 
@@ -498,6 +502,7 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 {
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
 
+	vmx_vcpu_pi_load(vcpu, cpu);
 	if (vcpu->cpu == cpu)
 		return;
 
@@ -679,6 +684,12 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	trace_kvm_entry(vcpu);
 
+	if (pi_test_on(&tdx->pi_desc)) {
+		apic->send_IPI_self(POSTED_INTR_VECTOR);
+
+		kvm_wait_lapic_expire(vcpu);
+	}
+
 	tdx_vcpu_enter_exit(vcpu, tdx);
 
 	tdx_user_return_update_cache();
@@ -1010,6 +1021,16 @@ static int tdx_sept_remove_private_spte(struct kvm *kvm, gfn_t gfn,
 	return tdx_sept_drop_private_spte(kvm, gfn, level, pfn);
 }
 
+void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+			   int trig_mode, int vector)
+{
+	struct kvm_vcpu *vcpu = apic->vcpu;
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	/* TDX supports only posted interrupt.  No lapic emulation. */
+	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
+}
+
 int tdx_dev_ioctl(void __user *argp)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
@@ -1705,6 +1726,10 @@ int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
 	if (ret)
 		return ret;
 
+	td_vmcs_write16(tdx, POSTED_INTR_NV, POSTED_INTR_VECTOR);
+	td_vmcs_write64(tdx, POSTED_INTR_DESC_ADDR, __pa(&tdx->pi_desc));
+	td_vmcs_setbit32(tdx, PIN_BASED_VM_EXEC_CONTROL, PIN_BASED_POSTED_INTR);
+
 	tdx->vcpu_initialized = true;
 	return 0;
 }
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index d67877a7dcc6..051952544375 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -4202,50 +4202,6 @@ void vmx_msr_filter_changed(struct kvm_vcpu *vcpu)
 		pt_update_intercept_for_msr(vcpu);
 }
 
-static inline void kvm_vcpu_trigger_posted_interrupt(struct kvm_vcpu *vcpu,
-						     int pi_vec)
-{
-#ifdef CONFIG_SMP
-	if (vcpu->mode == IN_GUEST_MODE) {
-		/*
-		 * The vector of the virtual has already been set in the PIR.
-		 * Send a notification event to deliver the virtual interrupt
-		 * unless the vCPU is the currently running vCPU, i.e. the
-		 * event is being sent from a fastpath VM-Exit handler, in
-		 * which case the PIR will be synced to the vIRR before
-		 * re-entering the guest.
-		 *
-		 * When the target is not the running vCPU, the following
-		 * possibilities emerge:
-		 *
-		 * Case 1: vCPU stays in non-root mode. Sending a notification
-		 * event posts the interrupt to the vCPU.
-		 *
-		 * Case 2: vCPU exits to root mode and is still runnable. The
-		 * PIR will be synced to the vIRR before re-entering the guest.
-		 * Sending a notification event is ok as the host IRQ handler
-		 * will ignore the spurious event.
-		 *
-		 * Case 3: vCPU exits to root mode and is blocked. vcpu_block()
-		 * has already synced PIR to vIRR and never blocks the vCPU if
-		 * the vIRR is not empty. Therefore, a blocked vCPU here does
-		 * not wait for any requested interrupts in PIR, and sending a
-		 * notification event also results in a benign, spurious event.
-		 */
-
-		if (vcpu != kvm_get_running_vcpu())
-			apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), pi_vec);
-		return;
-	}
-#endif
-	/*
-	 * The vCPU isn't in the guest; wake the vCPU in case it is blocking,
-	 * otherwise do nothing as KVM will grab the highest priority pending
-	 * IRQ via ->sync_pir_to_irr() in vcpu_enter_guest().
-	 */
-	kvm_vcpu_wake_up(vcpu);
-}
-
 static int vmx_deliver_nested_posted_interrupt(struct kvm_vcpu *vcpu,
 						int vector)
 {
@@ -4298,20 +4254,7 @@ static int vmx_deliver_posted_interrupt(struct kvm_vcpu *vcpu, int vector)
 	if (!vcpu->arch.apic->apicv_active)
 		return -1;
 
-	if (pi_test_and_set_pir(vector, &vmx->pi_desc))
-		return 0;
-
-	/* If a previous notification has sent the IPI, nothing to do.  */
-	if (pi_test_and_set_on(&vmx->pi_desc))
-		return 0;
-
-	/*
-	 * The implied barrier in pi_test_and_set_on() pairs with the smp_mb_*()
-	 * after setting vcpu->mode in vcpu_enter_guest(), thus the vCPU is
-	 * guaranteed to see PID.ON=1 and sync the PIR to IRR if triggering a
-	 * posted interrupt "fails" because vcpu->mode != IN_GUEST_MODE.
-	 */
-	kvm_vcpu_trigger_posted_interrupt(vcpu, POSTED_INTR_VECTOR);
+	__vmx_deliver_posted_interrupt(vcpu, &vmx->pi_desc, vector);
 	return 0;
 }
 
@@ -6944,14 +6887,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 	vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
 }
 
-void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu)
-{
-	struct vcpu_vmx *vmx = to_vmx(vcpu);
-
-	pi_clear_on(&vmx->pi_desc);
-	memset(vmx->pi_desc.pir, 0, sizeof(vmx->pi_desc.pir));
-}
-
 void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
 
 static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 32e045480bdb..fa7a431c45da 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -62,7 +62,6 @@ int vmx_check_intercept(struct kvm_vcpu *vcpu,
 bool vmx_apic_init_signal_blocked(struct kvm_vcpu *vcpu);
 void vmx_migrate_timers(struct kvm_vcpu *vcpu);
 void vmx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
-void vmx_apicv_post_state_restore(struct kvm_vcpu *vcpu);
 bool vmx_check_apicv_inhibit_reasons(enum kvm_apicv_inhibit reason);
 void vmx_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr);
 void vmx_hwapic_isr_update(int max_isr);
@@ -164,6 +163,9 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
 u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 
+void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+			   int trig_mode, int vector);
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
@@ -194,6 +196,9 @@ static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
 static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
 static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
 
+static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
+					 int trig_mode, int vector) {}
+
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 079/113] KVM: TDX: Implements vcpu request_immediate_exit
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (77 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 078/113] KVM: TDX: Implement interrupt injection isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 080/113] KVM: TDX: Implement methods to inject NMI isaku.yamahata
                   ` (33 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Now we are able to inject interrupts into TDX vcpu, it's ready to block TDX
vcpu.  Wire up kvm x86 methods for blocking/unblocking vcpu for TDX.  To
unblock on pending events, request immediate exit methods is also needed.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/main.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index d5a838e87fb5..7962f03e222d 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -307,6 +307,14 @@ static void vt_enable_irq_window(struct kvm_vcpu *vcpu)
 	vmx_enable_irq_window(vcpu);
 }
 
+static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return __kvm_request_immediate_exit(vcpu);
+
+	vmx_request_immediate_exit(vcpu);
+}
+
 static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	if (is_td_vcpu(vcpu))
@@ -443,7 +451,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.check_intercept = vmx_check_intercept,
 	.handle_exit_irqoff = vmx_handle_exit_irqoff,
 
-	.request_immediate_exit = vmx_request_immediate_exit,
+	.request_immediate_exit = vt_request_immediate_exit,
 
 	.sched_in = vt_sched_in,
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 080/113] KVM: TDX: Implement methods to inject NMI
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (78 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 079/113] KVM: TDX: Implements vcpu request_immediate_exit isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 081/113] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument isaku.yamahata
                   ` (32 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX vcpu control structure defines one bit for pending NMI for VMM to
inject NMI by setting the bit without knowing TDX vcpu NMI states.  Because
the vcpu state is protected, VMM can't know about NMI states of TDX vcpu.
The TDX module handles actual injection and NMI states transition.

Add methods for NMI and treat NMI can be injected always.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/main.c    | 62 +++++++++++++++++++++++++++++++++++---
 arch/x86/kvm/vmx/tdx.c     |  5 +++
 arch/x86/kvm/vmx/x86_ops.h |  2 ++
 3 files changed, 64 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 7962f03e222d..2b31ae11f46d 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -243,6 +243,58 @@ static void vt_flush_tlb_guest(struct kvm_vcpu *vcpu)
 	vmx_flush_tlb_guest(vcpu);
 }
 
+static void vt_inject_nmi(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_inject_nmi(vcpu);
+
+	vmx_inject_nmi(vcpu);
+}
+
+static int vt_nmi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	/*
+	 * The TDX module manages NMI windows and NMI reinjection, and hides NMI
+	 * blocking, all KVM can do is throw an NMI over the wall.
+	 */
+	if (is_td_vcpu(vcpu))
+		return true;
+
+	return vmx_nmi_allowed(vcpu, for_injection);
+}
+
+static bool vt_get_nmi_mask(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Assume NMIs are always unmasked.  KVM could query PEND_NMI and treat
+	 * NMIs as masked if a previous NMI is still pending, but SEAMCALLs are
+	 * expensive and the end result is unchanged as the only relevant usage
+	 * of get_nmi_mask() is to limit the number of pending NMIs, i.e. it
+	 * only changes whether KVM or the TDX module drops an NMI.
+	 */
+	if (is_td_vcpu(vcpu))
+		return false;
+
+	return vmx_get_nmi_mask(vcpu);
+}
+
+static void vt_set_nmi_mask(struct kvm_vcpu *vcpu, bool masked)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_nmi_mask(vcpu, masked);
+}
+
+static void vt_enable_nmi_window(struct kvm_vcpu *vcpu)
+{
+	/* Refer the comment in vt_get_nmi_mask(). */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_enable_nmi_window(vcpu);
+}
+
 static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
 			int pgd_level)
 {
@@ -407,14 +459,14 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.get_interrupt_shadow = vt_get_interrupt_shadow,
 	.patch_hypercall = vmx_patch_hypercall,
 	.inject_irq = vt_inject_irq,
-	.inject_nmi = vmx_inject_nmi,
+	.inject_nmi = vt_inject_nmi,
 	.inject_exception = vmx_inject_exception,
 	.cancel_injection = vt_cancel_injection,
 	.interrupt_allowed = vt_interrupt_allowed,
-	.nmi_allowed = vmx_nmi_allowed,
-	.get_nmi_mask = vmx_get_nmi_mask,
-	.set_nmi_mask = vmx_set_nmi_mask,
-	.enable_nmi_window = vmx_enable_nmi_window,
+	.nmi_allowed = vt_nmi_allowed,
+	.get_nmi_mask = vt_get_nmi_mask,
+	.set_nmi_mask = vt_set_nmi_mask,
+	.enable_nmi_window = vt_enable_nmi_window,
 	.enable_irq_window = vt_enable_irq_window,
 	.update_cr8_intercept = vmx_update_cr8_intercept,
 	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 742f0747d4d0..95c9a1906b62 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -705,6 +705,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 	return EXIT_FASTPATH_NONE;
 }
 
+void tdx_inject_nmi(struct kvm_vcpu *vcpu)
+{
+	td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index fa7a431c45da..a05ae400f1ae 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -165,6 +165,7 @@ u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 			   int trig_mode, int vector);
+void tdx_inject_nmi(struct kvm_vcpu *vcpu);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -198,6 +199,7 @@ static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 
 static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 					 int trig_mode, int vector) {}
+static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 081/113] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (79 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 080/113] KVM: TDX: Implement methods to inject NMI isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 082/113] KVM: VMX: Move NMI/exception handler to common helper isaku.yamahata
                   ` (31 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX uses different ABI to get information about VM exit.  Pass intr_info to
the NMI and INTR handlers instead of pulling it from vcpu_vmx in
preparation for sharing the bulk of the handlers with TDX.

When the guest TD exits to VMM, RAX holds status and exit reason, RCX holds
exit qualification etc rather than the VMCS fields because VMM doesn't have
access to the VMCS.  The eventual code will be

VMX:
  - get exit reason, intr_info, exit_qualification, and etc from VMCS
  - call NMI/INTR handlers (common code)

TDX:
  - get exit reason, intr_info, exit_qualification, and etc from guest
    registers
  - call NMI/INTR handlers (common code)

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/vmx.c | 17 ++++++++---------
 1 file changed, 8 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 051952544375..63cff4d02211 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -6919,28 +6919,27 @@ static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
 		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
 }
 
-static void handle_exception_nmi_irqoff(struct vcpu_vmx *vmx)
+static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
 {
 	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
-	u32 intr_info = vmx_get_intr_info(&vmx->vcpu);
 
 	/* if exit due to PF check for async PF */
 	if (is_page_fault(intr_info))
-		vmx->vcpu.arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+		vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
 	/* if exit due to NM, handle before interrupts are enabled */
 	else if (is_nm_fault(intr_info))
-		handle_nm_fault_irqoff(&vmx->vcpu);
+		handle_nm_fault_irqoff(vcpu);
 	/* Handle machine checks before interrupts are enabled */
 	else if (is_machine_check(intr_info))
 		kvm_machine_check();
 	/* We need to handle NMIs before interrupts are enabled */
 	else if (is_nmi(intr_info))
-		handle_interrupt_nmi_irqoff(&vmx->vcpu, nmi_entry);
+		handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
 }
 
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu)
+static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+					     u32 intr_info)
 {
-	u32 intr_info = vmx_get_intr_info(vcpu);
 	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
 	gate_desc *desc = (gate_desc *)host_idt_base + vector;
 
@@ -6960,9 +6959,9 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 		return;
 
 	if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
-		handle_external_interrupt_irqoff(vcpu);
+		handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
 	else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
-		handle_exception_nmi_irqoff(vmx);
+		handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
 }
 
 /*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 082/113] KVM: VMX: Move NMI/exception handler to common helper
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (80 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 081/113] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 083/113] KVM: x86: Split core of hypercall emulation to helper function isaku.yamahata
                   ` (30 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX mostly handles NMI/exception exit mostly the same to VMX case.  The
difference is how to retrieve exit qualification.  To share the code with
TDX, move NMI/exception to a common header, common.h.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/common.h | 70 ++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/vmx.c    | 79 ++++-----------------------------------
 2 files changed, 78 insertions(+), 71 deletions(-)

diff --git a/arch/x86/kvm/vmx/common.h b/arch/x86/kvm/vmx/common.h
index 747f993cf7de..65abda49debe 100644
--- a/arch/x86/kvm/vmx/common.h
+++ b/arch/x86/kvm/vmx/common.h
@@ -4,8 +4,78 @@
 
 #include <linux/kvm_host.h>
 
+#include <asm/traps.h>
+
 #include "posted_intr.h"
 #include "mmu.h"
+#include "vmcs.h"
+#include "x86.h"
+
+extern unsigned long vmx_host_idt_base;
+void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
+
+static inline void vmx_handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
+						   unsigned long entry)
+{
+	bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist;
+
+	kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : KVM_HANDLING_IRQ);
+	vmx_do_interrupt_nmi_irqoff(entry);
+	kvm_after_interrupt(vcpu);
+}
+
+static inline void vmx_handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * Save xfd_err to guest_fpu before interrupt is enabled, so the
+	 * MSR value is not clobbered by the host activity before the guest
+	 * has chance to consume it.
+	 *
+	 * Do not blindly read xfd_err here, since this exception might
+	 * be caused by L1 interception on a platform which doesn't
+	 * support xfd at all.
+	 *
+	 * Do it conditionally upon guest_fpu::xfd. xfd_err matters
+	 * only when xfd contains a non-zero value.
+	 *
+	 * Queuing exception is done in vmx_handle_exit. See comment there.
+	 */
+	if (vcpu->arch.guest_fpu.fpstate->xfd)
+		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
+}
+
+static inline void vmx_handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu,
+						   u32 intr_info)
+{
+	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
+
+	/* if exit due to PF check for async PF */
+	if (is_page_fault(intr_info))
+		vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
+	/* if exit due to NM, handle before interrupts are enabled */
+	else if (is_nm_fault(intr_info))
+		vmx_handle_nm_fault_irqoff(vcpu);
+	/* Handle machine checks before interrupts are enabled */
+	else if (is_machine_check(intr_info))
+		kvm_machine_check();
+	/* We need to handle NMIs before interrupts are enabled */
+	else if (is_nmi(intr_info))
+		vmx_handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
+}
+
+static inline void vmx_handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
+							u32 intr_info)
+{
+	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
+	gate_desc *desc = (gate_desc *)vmx_host_idt_base + vector;
+
+	if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
+	    "unexpected VM-Exit interrupt info: 0x%x", intr_info))
+		return;
+
+	vmx_handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
+	vcpu->arch.at_instruction_boundary = true;
+}
 
 static inline int __vmx_handle_ept_violation(struct kvm_vcpu *vcpu, gpa_t gpa,
 					     unsigned long exit_qualification)
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 63cff4d02211..7c8522628dd3 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -526,7 +526,7 @@ static inline void vmx_segment_cache_clear(struct vcpu_vmx *vmx)
 	vmx->segment_cache.bitmask = 0;
 }
 
-static unsigned long host_idt_base;
+unsigned long vmx_host_idt_base;
 
 #if IS_ENABLED(CONFIG_HYPERV)
 static bool __read_mostly enlightened_vmcs = true;
@@ -4318,7 +4318,7 @@ void vmx_set_constant_host_state(struct vcpu_vmx *vmx)
 	vmcs_write16(HOST_SS_SELECTOR, __KERNEL_DS);  /* 22.2.4 */
 	vmcs_write16(HOST_TR_SELECTOR, GDT_ENTRY_TSS*8);  /* 22.2.4 */
 
-	vmcs_writel(HOST_IDTR_BASE, host_idt_base);   /* 22.2.4 */
+	vmcs_writel(HOST_IDTR_BASE, vmx_host_idt_base);   /* 22.2.4 */
 
 	vmcs_writel(HOST_RIP, (unsigned long)vmx_vmexit); /* 22.2.5 */
 
@@ -5209,10 +5209,10 @@ static int handle_exception_nmi(struct kvm_vcpu *vcpu)
 	intr_info = vmx_get_intr_info(vcpu);
 
 	if (is_machine_check(intr_info) || is_nmi(intr_info))
-		return 1; /* handled by handle_exception_nmi_irqoff() */
+		return 1; /* handled by vmx_handle_exception_nmi_irqoff() */
 
 	/*
-	 * Queue the exception here instead of in handle_nm_fault_irqoff().
+	 * Queue the exception here instead of in vmx_handle_nm_fault_irqoff().
 	 * This ensures the nested_vmx check is not skipped so vmexit can
 	 * be reflected to L1 (when it intercepts #NM) before reaching this
 	 * point.
@@ -6887,70 +6887,6 @@ void vmx_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 	vmcs_write64(EOI_EXIT_BITMAP3, eoi_exit_bitmap[3]);
 }
 
-void vmx_do_interrupt_nmi_irqoff(unsigned long entry);
-
-static void handle_interrupt_nmi_irqoff(struct kvm_vcpu *vcpu,
-					unsigned long entry)
-{
-	bool is_nmi = entry == (unsigned long)asm_exc_nmi_noist;
-
-	kvm_before_interrupt(vcpu, is_nmi ? KVM_HANDLING_NMI : KVM_HANDLING_IRQ);
-	vmx_do_interrupt_nmi_irqoff(entry);
-	kvm_after_interrupt(vcpu);
-}
-
-static void handle_nm_fault_irqoff(struct kvm_vcpu *vcpu)
-{
-	/*
-	 * Save xfd_err to guest_fpu before interrupt is enabled, so the
-	 * MSR value is not clobbered by the host activity before the guest
-	 * has chance to consume it.
-	 *
-	 * Do not blindly read xfd_err here, since this exception might
-	 * be caused by L1 interception on a platform which doesn't
-	 * support xfd at all.
-	 *
-	 * Do it conditionally upon guest_fpu::xfd. xfd_err matters
-	 * only when xfd contains a non-zero value.
-	 *
-	 * Queuing exception is done in vmx_handle_exit. See comment there.
-	 */
-	if (vcpu->arch.guest_fpu.fpstate->xfd)
-		rdmsrl(MSR_IA32_XFD_ERR, vcpu->arch.guest_fpu.xfd_err);
-}
-
-static void handle_exception_nmi_irqoff(struct kvm_vcpu *vcpu, u32 intr_info)
-{
-	const unsigned long nmi_entry = (unsigned long)asm_exc_nmi_noist;
-
-	/* if exit due to PF check for async PF */
-	if (is_page_fault(intr_info))
-		vcpu->arch.apf.host_apf_flags = kvm_read_and_reset_apf_flags();
-	/* if exit due to NM, handle before interrupts are enabled */
-	else if (is_nm_fault(intr_info))
-		handle_nm_fault_irqoff(vcpu);
-	/* Handle machine checks before interrupts are enabled */
-	else if (is_machine_check(intr_info))
-		kvm_machine_check();
-	/* We need to handle NMIs before interrupts are enabled */
-	else if (is_nmi(intr_info))
-		handle_interrupt_nmi_irqoff(vcpu, nmi_entry);
-}
-
-static void handle_external_interrupt_irqoff(struct kvm_vcpu *vcpu,
-					     u32 intr_info)
-{
-	unsigned int vector = intr_info & INTR_INFO_VECTOR_MASK;
-	gate_desc *desc = (gate_desc *)host_idt_base + vector;
-
-	if (KVM_BUG(!is_external_intr(intr_info), vcpu->kvm,
-	    "unexpected VM-Exit interrupt info: 0x%x", intr_info))
-		return;
-
-	handle_interrupt_nmi_irqoff(vcpu, gate_offset(desc));
-	vcpu->arch.at_instruction_boundary = true;
-}
-
 void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 {
 	struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -6959,9 +6895,10 @@ void vmx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 		return;
 
 	if (vmx->exit_reason.basic == EXIT_REASON_EXTERNAL_INTERRUPT)
-		handle_external_interrupt_irqoff(vcpu, vmx_get_intr_info(vcpu));
+		vmx_handle_external_interrupt_irqoff(vcpu,
+						     vmx_get_intr_info(vcpu));
 	else if (vmx->exit_reason.basic == EXIT_REASON_EXCEPTION_NMI)
-		handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
+		vmx_handle_exception_nmi_irqoff(vcpu, vmx_get_intr_info(vcpu));
 }
 
 /*
@@ -8253,7 +8190,7 @@ __init int vmx_hardware_setup(void)
 	int r;
 
 	store_idt(&dt);
-	host_idt_base = dt.address;
+	vmx_host_idt_base = dt.address;
 
 	vmx_setup_user_return_msrs();
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 083/113] KVM: x86: Split core of hypercall emulation to helper function
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (81 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 082/113] KVM: VMX: Move NMI/exception handler to common helper isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 084/113] KVM: TDX: Add a place holder to handle TDX VM exit isaku.yamahata
                   ` (29 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

By necessity, TDX will use a different register ABI for hypercalls.
Break out the core functionality so that it may be reused for TDX.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm_host.h |  4 +++
 arch/x86/kvm/x86.c              | 54 ++++++++++++++++++++-------------
 2 files changed, 37 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cead782a48b6..f93d271aba67 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -2074,6 +2074,10 @@ static inline void kvm_clear_apicv_inhibit(struct kvm *kvm,
 	kvm_set_or_clear_apicv_inhibit(kvm, reason, false);
 }
 
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+				      unsigned long a0, unsigned long a1,
+				      unsigned long a2, unsigned long a3,
+				      int op_64_bit);
 int kvm_emulate_hypercall(struct kvm_vcpu *vcpu);
 
 int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa, u64 error_code,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 384aa282c68f..3d1c854b1604 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -9723,26 +9723,15 @@ static int complete_hypercall_exit(struct kvm_vcpu *vcpu)
 	return kvm_skip_emulated_instruction(vcpu);
 }
 
-int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+unsigned long __kvm_emulate_hypercall(struct kvm_vcpu *vcpu, unsigned long nr,
+				      unsigned long a0, unsigned long a1,
+				      unsigned long a2, unsigned long a3,
+				      int op_64_bit)
 {
-	unsigned long nr, a0, a1, a2, a3, ret;
-	int op_64_bit;
-
-	if (kvm_xen_hypercall_enabled(vcpu->kvm))
-		return kvm_xen_hypercall(vcpu);
-
-	if (kvm_hv_hypercall_enabled(vcpu))
-		return kvm_hv_hypercall(vcpu);
-
-	nr = kvm_rax_read(vcpu);
-	a0 = kvm_rbx_read(vcpu);
-	a1 = kvm_rcx_read(vcpu);
-	a2 = kvm_rdx_read(vcpu);
-	a3 = kvm_rsi_read(vcpu);
+	unsigned long ret;
 
 	trace_kvm_hypercall(nr, a0, a1, a2, a3);
 
-	op_64_bit = is_64_bit_hypercall(vcpu);
 	if (!op_64_bit) {
 		nr &= 0xFFFFFFFF;
 		a0 &= 0xFFFFFFFF;
@@ -9751,11 +9740,6 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		a3 &= 0xFFFFFFFF;
 	}
 
-	if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
-		ret = -KVM_EPERM;
-		goto out;
-	}
-
 	ret = -KVM_ENOSYS;
 
 	switch (nr) {
@@ -9814,6 +9798,34 @@ int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
 		ret = -KVM_ENOSYS;
 		break;
 	}
+	return ret;
+}
+EXPORT_SYMBOL_GPL(__kvm_emulate_hypercall);
+
+int kvm_emulate_hypercall(struct kvm_vcpu *vcpu)
+{
+	unsigned long nr, a0, a1, a2, a3, ret;
+	int op_64_bit;
+
+	if (kvm_xen_hypercall_enabled(vcpu->kvm))
+		return kvm_xen_hypercall(vcpu);
+
+	if (kvm_hv_hypercall_enabled(vcpu))
+		return kvm_hv_hypercall(vcpu);
+
+	nr = kvm_rax_read(vcpu);
+	a0 = kvm_rbx_read(vcpu);
+	a1 = kvm_rcx_read(vcpu);
+	a2 = kvm_rdx_read(vcpu);
+	a3 = kvm_rsi_read(vcpu);
+	op_64_bit = is_64_bit_hypercall(vcpu);
+
+	if (static_call(kvm_x86_get_cpl)(vcpu) != 0) {
+		ret = -KVM_EPERM;
+		goto out;
+	}
+
+	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, op_64_bit);
 out:
 	if (!op_64_bit)
 		ret = (u32)ret;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 084/113] KVM: TDX: Add a place holder to handle TDX VM exit
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (82 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 083/113] KVM: x86: Split core of hypercall emulation to helper function isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 085/113] KVM: TDX: Handle vmentry failure for INTEL TD guest isaku.yamahata
                   ` (28 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up handle_exit and handle_exit_irqoff methods and add a place holder
to handle VM exit.  Add helper functions to get exit info, exit
qualification, etc.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/main.c    | 33 ++++++++++++--
 arch/x86/kvm/vmx/tdx.c     | 88 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h | 10 +++++
 3 files changed, 128 insertions(+), 3 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2b31ae11f46d..f9339d8f95eb 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -166,6 +166,23 @@ static bool vt_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
 	return tdx_protected_apic_has_interrupt(vcpu);
 }
 
+static int vt_handle_exit(struct kvm_vcpu *vcpu,
+			     enum exit_fastpath_completion fastpath)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_handle_exit(vcpu, fastpath);
+
+	return vmx_handle_exit(vcpu, fastpath);
+}
+
+static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_handle_exit_irqoff(vcpu);
+
+	vmx_handle_exit_irqoff(vcpu);
+}
+
 static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -367,6 +384,16 @@ static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
 	vmx_request_immediate_exit(vcpu);
 }
 
+static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_exit_info(vcpu, reason, info1, info2, intr_info,
+					 error_code);
+
+	return vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
+}
+
 static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 {
 	if (is_td_vcpu(vcpu))
@@ -452,7 +479,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.vcpu_pre_run = vt_vcpu_pre_run,
 	.vcpu_run = vt_vcpu_run,
-	.handle_exit = vmx_handle_exit,
+	.handle_exit = vt_handle_exit,
 	.skip_emulated_instruction = vmx_skip_emulated_instruction,
 	.update_emulated_instruction = vmx_update_emulated_instruction,
 	.set_interrupt_shadow = vt_set_interrupt_shadow,
@@ -487,7 +514,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_identity_map_addr = vmx_set_identity_map_addr,
 	.get_mt_mask = vt_get_mt_mask,
 
-	.get_exit_info = vmx_get_exit_info,
+	.get_exit_info = vt_get_exit_info,
 
 	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
 
@@ -501,7 +528,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.load_mmu_pgd = vt_load_mmu_pgd,
 
 	.check_intercept = vmx_check_intercept,
-	.handle_exit_irqoff = vmx_handle_exit_irqoff,
+	.handle_exit_irqoff = vt_handle_exit_irqoff,
 
 	.request_immediate_exit = vt_request_immediate_exit,
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 95c9a1906b62..964154f7bc60 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -65,6 +65,26 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
 	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
 }
 
+static __always_inline unsigned long tdexit_exit_qual(struct kvm_vcpu *vcpu)
+{
+	return kvm_rcx_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_ext_exit_qual(struct kvm_vcpu *vcpu)
+{
+	return kvm_rdx_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_gpa(struct kvm_vcpu *vcpu)
+{
+	return kvm_r8_read(vcpu);
+}
+
+static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
+{
+	return kvm_r9_read(vcpu);
+}
+
 static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
 {
 	return tdx->tdvpr_pa;
@@ -710,6 +730,25 @@ void tdx_inject_nmi(struct kvm_vcpu *vcpu)
 	td_management_write8(to_tdx(vcpu), TD_VCPU_PEND_NMI, 1);
 }
 
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+	u16 exit_reason = tdx->exit_reason.basic;
+
+	if (exit_reason == EXIT_REASON_EXCEPTION_NMI)
+		vmx_handle_exception_nmi_irqoff(vcpu, tdexit_intr_info(vcpu));
+	else if (exit_reason == EXIT_REASON_EXTERNAL_INTERRUPT)
+		vmx_handle_external_interrupt_irqoff(vcpu,
+						     tdexit_intr_info(vcpu));
+}
+
+static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
+{
+	vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
+	vcpu->mmio_needed = 0;
+	return 0;
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
@@ -1036,6 +1075,55 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
 }
 
+int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
+{
+	union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
+
+	/* See the comment of tdh_sept_seamcall(). */
+	if (unlikely(exit_reason.full == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT)))
+		return 1;
+
+	if (unlikely(exit_reason.non_recoverable || exit_reason.error)) {
+		if (exit_reason.basic == EXIT_REASON_TRIPLE_FAULT)
+			return tdx_handle_triple_fault(vcpu);
+
+		kvm_pr_unimpl("TD exit 0x%llx, %d hkid 0x%x hkid pa 0x%llx\n",
+			      exit_reason.full, exit_reason.basic,
+			      to_kvm_tdx(vcpu->kvm)->hkid,
+			      set_hkid_to_hpa(0, to_kvm_tdx(vcpu->kvm)->hkid));
+		goto unhandled_exit;
+	}
+
+	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
+
+	switch (exit_reason.basic) {
+	default:
+		break;
+	}
+
+unhandled_exit:
+	vcpu->run->exit_reason = KVM_EXIT_INTERNAL_ERROR;
+	vcpu->run->internal.suberror = KVM_INTERNAL_ERROR_UNEXPECTED_EXIT_REASON;
+	vcpu->run->internal.ndata = 2;
+	vcpu->run->internal.data[0] = exit_reason.full;
+	vcpu->run->internal.data[1] = vcpu->arch.last_vmentry_cpu;
+	return 0;
+}
+
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	*reason = tdx->exit_reason.full;
+
+	*info1 = tdexit_exit_qual(vcpu);
+	*info2 = tdexit_ext_exit_qual(vcpu);
+
+	*intr_info = tdexit_intr_info(vcpu);
+	*error_code = 0;
+}
+
 int tdx_dev_ioctl(void __user *argp)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index a05ae400f1ae..38fd5c3eee2f 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -161,11 +161,16 @@ void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu);
 void tdx_vcpu_put(struct kvm_vcpu *vcpu);
 void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu);
 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu);
+void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu);
+int tdx_handle_exit(struct kvm_vcpu *vcpu,
+		enum exit_fastpath_completion fastpath);
 u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
 
 void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 			   int trig_mode, int vector);
 void tdx_inject_nmi(struct kvm_vcpu *vcpu);
+void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -195,11 +200,16 @@ static inline void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_put(struct kvm_vcpu *vcpu) {}
 static inline void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu) {}
 static inline bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu) { return false; }
+static inline void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu) {}
+static inline int tdx_handle_exit(struct kvm_vcpu *vcpu,
+		enum exit_fastpath_completion fastpath) { return 0; }
 static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
 
 static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 					 int trig_mode, int vector) {}
 static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
+static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
+				     u64 *info2, u32 *intr_info, u32 *error_code) {}
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 085/113] KVM: TDX: Handle vmentry failure for INTEL TD guest
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (83 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 084/113] KVM: TDX: Add a place holder to handle TDX VM exit isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 086/113] KVM: TDX: handle EXIT_REASON_OTHER_SMI isaku.yamahata
                   ` (27 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Yao Yuan

From: Yao Yuan <yuan.yao@intel.com>

TDX module passes control back to VMM if it failed to vmentry for a TD, use
same exit reason to notify user space, align with VMX.
If VMM corrupted TD VMCS, machine check during entry can happens.  vm exit
reason will be EXIT_REASON_MCE_DURING_VMENTRY.  If VMM corrupted TD VMCS
with debug TD by TDH.VP.WR, the exit reason would be
EXIT_REASON_INVALID_STATE or EXIT_REASON_MSR_LOAD_FAIL.

Signed-off-by: Yao Yuan <yuan.yao@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 964154f7bc60..47ea12f23471 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1094,6 +1094,28 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 		goto unhandled_exit;
 	}
 
+	/*
+	 * When TDX module saw VMEXIT_REASON_FAILED_VMENTER_MC etc, TDH.VP.ENTER
+	 * returns with TDX_SUCCESS | exit_reason with failed_vmentry = 1.
+	 * Because TDX module maintains TD VMCS correctness, usually vmentry
+	 * failure shouldn't happen.  In some corner cases it can happen.  For
+	 * example
+	 * - machine check during entry: EXIT_REASON_MCE_DURING_VMENTRY
+	 * - TDH.VP.WR with debug TD.  VMM can corrupt TD VMCS
+	 *   - EXIT_REASON_INVALID_STATE
+	 *   - EXIT_REASON_MSR_LOAD_FAIL
+	 */
+	if (unlikely(exit_reason.failed_vmentry)) {
+		pr_err("TDExit: exit_reason 0x%016llx qualification=%016lx ext_qualification=%016lx\n",
+		       exit_reason.full, tdexit_exit_qual(vcpu), tdexit_ext_exit_qual(vcpu));
+		vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
+		vcpu->run->fail_entry.hardware_entry_failure_reason
+			= exit_reason.full;
+		vcpu->run->fail_entry.cpu = vcpu->arch.last_vmentry_cpu;
+
+		return 0;
+	}
+
 	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
 
 	switch (exit_reason.basic) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 086/113] KVM: TDX: handle EXIT_REASON_OTHER_SMI
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (84 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 085/113] KVM: TDX: Handle vmentry failure for INTEL TD guest isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 087/113] KVM: TDX: handle ept violation/misconfig exit isaku.yamahata
                   ` (26 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

If the control reaches EXIT_REASON_OTHER_SMI, #SMI is delivered and
handled right after returning from the TDX module to KVM nothing needs to
be done in KVM.  Continue TDX vcpu execution.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/include/uapi/asm/vmx.h | 1 +
 arch/x86/kvm/vmx/tdx.c          | 7 +++++++
 2 files changed, 8 insertions(+)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index a5faf6d88f1b..b3a30ef3efdd 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -34,6 +34,7 @@
 #define EXIT_REASON_TRIPLE_FAULT        2
 #define EXIT_REASON_INIT_SIGNAL			3
 #define EXIT_REASON_SIPI_SIGNAL         4
+#define EXIT_REASON_OTHER_SMI           6
 
 #define EXIT_REASON_INTERRUPT_WINDOW    7
 #define EXIT_REASON_NMI_WINDOW          8
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 47ea12f23471..83df5673d5f2 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1119,6 +1119,13 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
 
 	switch (exit_reason.basic) {
+	case EXIT_REASON_OTHER_SMI:
+		/*
+		 * If reach here, it's not a Machine Check System Management
+		 * Interrupt(MSMI).  #SMI is delivered and handled right after
+		 * SEAMRET, nothing needs to be done in KVM.
+		 */
+		return 1;
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 087/113] KVM: TDX: handle ept violation/misconfig exit
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (85 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 086/113] KVM: TDX: handle EXIT_REASON_OTHER_SMI isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 088/113] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT isaku.yamahata
                   ` (25 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

On EPT violation, call a common function, __vmx_handle_ept_violation() to
trigger x86 MMU code.  On EPT misconfiguration, exit to ring 3 with
KVM_EXIT_UNKNOWN.  because EPT misconfiguration can't happen as MMIO is
trigged by TDG.VP.VMCALL. No point to set a misconfiguration value for the
fast path.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 46 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 46 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 83df5673d5f2..2bea7c96536b 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1075,6 +1075,48 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
 }
 
+static int tdx_handle_ept_violation(struct kvm_vcpu *vcpu)
+{
+	unsigned long exit_qual;
+
+	if (kvm_is_private_gpa(vcpu->kvm, tdexit_gpa(vcpu))) {
+		/*
+		 * Always treat SEPT violations as write faults.  Ignore the
+		 * EXIT_QUALIFICATION reported by TDX-SEAM for SEPT violations.
+		 * TD private pages are always RWX in the SEPT tables,
+		 * i.e. they're always mapped writable.  Just as importantly,
+		 * treating SEPT violations as write faults is necessary to
+		 * avoid COW allocations, which will cause TDAUGPAGE failures
+		 * due to aliasing a single HPA to multiple GPAs.
+		 */
+#define TDX_SEPT_VIOLATION_EXIT_QUAL	EPT_VIOLATION_ACC_WRITE
+		exit_qual = TDX_SEPT_VIOLATION_EXIT_QUAL;
+	} else {
+		exit_qual = tdexit_exit_qual(vcpu);
+		if (exit_qual & EPT_VIOLATION_ACC_INSTR) {
+			pr_warn("kvm: TDX instr fetch to shared GPA = 0x%lx @ RIP = 0x%lx\n",
+				tdexit_gpa(vcpu), kvm_rip_read(vcpu));
+			vcpu->run->exit_reason = KVM_EXIT_EXCEPTION;
+			vcpu->run->ex.exception = PF_VECTOR;
+			vcpu->run->ex.error_code = exit_qual;
+			return 0;
+		}
+	}
+
+	trace_kvm_page_fault(vcpu, tdexit_gpa(vcpu), exit_qual);
+	return __vmx_handle_ept_violation(vcpu, tdexit_gpa(vcpu), exit_qual);
+}
+
+static int tdx_handle_ept_misconfig(struct kvm_vcpu *vcpu)
+{
+	WARN_ON_ONCE(1);
+
+	vcpu->run->exit_reason = KVM_EXIT_UNKNOWN;
+	vcpu->run->hw.hardware_exit_reason = EXIT_REASON_EPT_MISCONFIG;
+
+	return 0;
+}
+
 int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 {
 	union tdx_exit_reason exit_reason = to_tdx(vcpu)->exit_reason;
@@ -1119,6 +1161,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
 
 	switch (exit_reason.basic) {
+	case EXIT_REASON_EPT_VIOLATION:
+		return tdx_handle_ept_violation(vcpu);
+	case EXIT_REASON_EPT_MISCONFIG:
+		return tdx_handle_ept_misconfig(vcpu);
 	case EXIT_REASON_OTHER_SMI:
 		/*
 		 * If reach here, it's not a Machine Check System Management
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 088/113] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (86 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 087/113] KVM: TDX: handle ept violation/misconfig exit isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 089/113] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) isaku.yamahata
                   ` (24 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because guest TD state is protected, exceptions in guest TDs can't be
intercepted.  TDX VMM doesn't need to handle exceptions.
tdx_handle_exit_irqoff() handles NMI and machine check.  Ignore NMI and
machine check and continue guest TD execution.

For external interrupt, increment stats same to the VMX case.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/tdx.c | 23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2bea7c96536b..d3181e191adf 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -742,6 +742,25 @@ void tdx_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 						     tdexit_intr_info(vcpu));
 }
 
+static int tdx_handle_exception(struct kvm_vcpu *vcpu)
+{
+	u32 intr_info = tdexit_intr_info(vcpu);
+
+	if (is_nmi(intr_info) || is_machine_check(intr_info))
+		return 1;
+
+	kvm_pr_unimpl("unexpected exception 0x%x(exit_reason 0x%llx qual 0x%lx)\n",
+		intr_info,
+		to_tdx(vcpu)->exit_reason.full, tdexit_exit_qual(vcpu));
+	return -EFAULT;
+}
+
+static int tdx_handle_external_interrupt(struct kvm_vcpu *vcpu)
+{
+	++vcpu->stat.irq_exits;
+	return 1;
+}
+
 static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
 {
 	vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
@@ -1161,6 +1180,10 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 	WARN_ON_ONCE(fastpath != EXIT_FASTPATH_NONE);
 
 	switch (exit_reason.basic) {
+	case EXIT_REASON_EXCEPTION_NMI:
+		return tdx_handle_exception(vcpu);
+	case EXIT_REASON_EXTERNAL_INTERRUPT:
+		return tdx_handle_external_interrupt(vcpu);
 	case EXIT_REASON_EPT_VIOLATION:
 		return tdx_handle_ept_violation(vcpu);
 	case EXIT_REASON_EPT_MISCONFIG:
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 089/113] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL)
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (87 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 088/113] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 090/113] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL isaku.yamahata
                   ` (23 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Xiaoyao Li,
	Sean Christopherson

From: Isaku Yamahata <isaku.yamahata@intel.com>

The TDX module specification defines TDG.VP.VMCALL API (TDVMCALL for short)
for the guest TD to call hypercall to VMM.  When the guest TD issues
TDG.VP.VMCALL, the guest TD exits to VMM with a new exit reason of
TDVMCALL.  The arguments from the guest TD and returned values from the VMM
are passed in the guest registers.  The guest RCX registers indicates which
registers are used.  Define helper functions to access those registers as
ABI.

Define the TDVMCALL exit reason, which is carved out from the VMX exit
reason namespace as the TDVMCALL exit from TDX guest to TDX-SEAM is really
just a VM-Exit.  Add a place holder to handle TDVMCALL exit.

Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/uapi/asm/vmx.h |  4 ++-
 arch/x86/kvm/vmx/tdx.c          | 56 ++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h          | 13 ++++++++
 3 files changed, 71 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/uapi/asm/vmx.h b/arch/x86/include/uapi/asm/vmx.h
index b3a30ef3efdd..f0f4a4cf84a7 100644
--- a/arch/x86/include/uapi/asm/vmx.h
+++ b/arch/x86/include/uapi/asm/vmx.h
@@ -93,6 +93,7 @@
 #define EXIT_REASON_TPAUSE              68
 #define EXIT_REASON_BUS_LOCK            74
 #define EXIT_REASON_NOTIFY              75
+#define EXIT_REASON_TDCALL              77
 
 #define VMX_EXIT_REASONS \
 	{ EXIT_REASON_EXCEPTION_NMI,         "EXCEPTION_NMI" }, \
@@ -156,7 +157,8 @@
 	{ EXIT_REASON_UMWAIT,                "UMWAIT" }, \
 	{ EXIT_REASON_TPAUSE,                "TPAUSE" }, \
 	{ EXIT_REASON_BUS_LOCK,              "BUS_LOCK" }, \
-	{ EXIT_REASON_NOTIFY,                "NOTIFY" }
+	{ EXIT_REASON_NOTIFY,                "NOTIFY" }, \
+	{ EXIT_REASON_TDCALL,                "TDCALL" }
 
 #define VMX_EXIT_REASON_FLAGS \
 	{ VMX_EXIT_REASONS_FAILED_VMENTRY,	"FAILED_VMENTRY" }
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index d3181e191adf..836e1c294394 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -85,6 +85,41 @@ static __always_inline unsigned long tdexit_intr_info(struct kvm_vcpu *vcpu)
 	return kvm_r9_read(vcpu);
 }
 
+#define BUILD_TDVMCALL_ACCESSORS(param, gpr)				\
+static __always_inline							\
+unsigned long tdvmcall_##param##_read(struct kvm_vcpu *vcpu)		\
+{									\
+	return kvm_##gpr##_read(vcpu);					\
+}									\
+static __always_inline void tdvmcall_##param##_write(struct kvm_vcpu *vcpu, \
+						     unsigned long val)	\
+{									\
+	kvm_##gpr##_write(vcpu, val);					\
+}
+BUILD_TDVMCALL_ACCESSORS(a0, r12);
+BUILD_TDVMCALL_ACCESSORS(a1, r13);
+BUILD_TDVMCALL_ACCESSORS(a2, r14);
+BUILD_TDVMCALL_ACCESSORS(a3, r15);
+
+static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
+{
+	return kvm_r10_read(vcpu);
+}
+static __always_inline unsigned long tdvmcall_leaf(struct kvm_vcpu *vcpu)
+{
+	return kvm_r11_read(vcpu);
+}
+static __always_inline void tdvmcall_set_return_code(struct kvm_vcpu *vcpu,
+						     long val)
+{
+	kvm_r10_write(vcpu, val);
+}
+static __always_inline void tdvmcall_set_return_val(struct kvm_vcpu *vcpu,
+						    unsigned long val)
+{
+	kvm_r11_write(vcpu, val);
+}
+
 static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
 {
 	return tdx->tdvpr_pa;
@@ -689,7 +724,8 @@ static noinstr void tdx_vcpu_enter_exit(struct kvm_vcpu *vcpu,
 					struct vcpu_tdx *tdx)
 {
 	guest_enter_irqoff();
-	tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr_pa, vcpu->arch.regs, 0);
+	tdx->exit_reason.full = __tdx_vcpu_run(tdx->tdvpr_pa, vcpu->arch.regs,
+					tdx->tdvmcall.regs_mask);
 	guest_exit_irqoff();
 }
 
@@ -722,6 +758,11 @@ fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
 
 	tdx_complete_interrupts(vcpu);
 
+	if (tdx->exit_reason.basic == EXIT_REASON_TDCALL)
+		tdx->tdvmcall.rcx = vcpu->arch.regs[VCPU_REGS_RCX];
+	else
+		tdx->tdvmcall.rcx = 0;
+
 	return EXIT_FASTPATH_NONE;
 }
 
@@ -768,6 +809,17 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int handle_tdvmcall(struct kvm_vcpu *vcpu)
+{
+	switch (tdvmcall_leaf(vcpu)) {
+	default:
+		break;
+	}
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+	return 1;
+}
+
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
 {
 	td_vmcs_write64(to_tdx(vcpu), SHARED_EPT_POINTER, root_hpa & PAGE_MASK);
@@ -1184,6 +1236,8 @@ int tdx_handle_exit(struct kvm_vcpu *vcpu, fastpath_t fastpath)
 		return tdx_handle_exception(vcpu);
 	case EXIT_REASON_EXTERNAL_INTERRUPT:
 		return tdx_handle_external_interrupt(vcpu);
+	case EXIT_REASON_TDCALL:
+		return handle_tdvmcall(vcpu);
 	case EXIT_REASON_EPT_VIOLATION:
 		return tdx_handle_ept_violation(vcpu);
 	case EXIT_REASON_EPT_MISCONFIG:
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index e37db607d6d9..272980d9605c 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -72,6 +72,19 @@ struct vcpu_tdx {
 
 	struct list_head cpu_list;
 
+	union {
+		struct {
+			union {
+				struct {
+					u16 gpr_mask;
+					u16 xmm_mask;
+				};
+				u32 regs_mask;
+			};
+			u32 reserved;
+		};
+		u64 rcx;
+	} tdvmcall;
 	union tdx_exit_reason exit_reason;
 
 	bool vcpu_initialized;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 090/113] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (88 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 089/113] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 091/113] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL isaku.yamahata
                   ` (22 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

The TDX Guest-Host communication interface (GHCI) specification defines
the ABI for the guest TD to issue hypercall.   It reserves vendor specific
arguments for VMM specific use.  Use it as KVM hypercall and handle it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 836e1c294394..120825310611 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -809,8 +809,39 @@ static int tdx_handle_triple_fault(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
+{
+	unsigned long nr, a0, a1, a2, a3, ret;
+
+	/*
+	 * ABI for KVM tdvmcall argument:
+	 * In Guest-Hypervisor Communication Interface(GHCI) specification,
+	 * Non-zero leaf number (R10 != 0) is defined to indicate
+	 * vendor-specific.  KVM uses this for KVM hypercall.  NOTE: KVM
+	 * hypercall number starts from one.  Zero isn't used for KVM hypercall
+	 * number.
+	 *
+	 * R10: KVM hypercall number
+	 * arguments: R11, R12, R13, R14.
+	 */
+	nr = kvm_r10_read(vcpu);
+	a0 = kvm_r11_read(vcpu);
+	a1 = kvm_r12_read(vcpu);
+	a2 = kvm_r13_read(vcpu);
+	a3 = kvm_r14_read(vcpu);
+
+	ret = __kvm_emulate_hypercall(vcpu, nr, a0, a1, a2, a3, true);
+
+	tdvmcall_set_return_code(vcpu, ret);
+
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
+	if (tdvmcall_exit_type(vcpu))
+		return tdx_emulate_vmcall(vcpu);
+
 	switch (tdvmcall_leaf(vcpu)) {
 	default:
 		break;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 091/113] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (89 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 090/113] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 092/113] KVM: TDX: Handle TDX PV CPUID hypercall isaku.yamahata
                   ` (21 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Some of TDG.VP.VMCALL require device model, for example, qemu, to handle
them on behalf of kvm kernel module.

Introduce new kvm exit, KVM_EXIT_TDX, and functions to setup it.
TDG_VP_VMCALL_INVALID_OPERAND is set as default return value to avoid
random value.  Device model should update R10 if necessary.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c   | 93 +++++++++++++++++++++++++++++++++++++++-
 include/uapi/linux/kvm.h | 57 ++++++++++++++++++++++++
 2 files changed, 148 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 120825310611..89693405892d 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -101,6 +101,18 @@ BUILD_TDVMCALL_ACCESSORS(a1, r13);
 BUILD_TDVMCALL_ACCESSORS(a2, r14);
 BUILD_TDVMCALL_ACCESSORS(a3, r15);
 
+#define TDX_VMCALL_REG_MASK_RBX	BIT_ULL(2)
+#define TDX_VMCALL_REG_MASK_RDX	BIT_ULL(3)
+#define TDX_VMCALL_REG_MASK_RBP	BIT_ULL(5)
+#define TDX_VMCALL_REG_MASK_RSI	BIT_ULL(6)
+#define TDX_VMCALL_REG_MASK_RDI	BIT_ULL(7)
+#define TDX_VMCALL_REG_MASK_R8	BIT_ULL(8)
+#define TDX_VMCALL_REG_MASK_R9	BIT_ULL(9)
+#define TDX_VMCALL_REG_MASK_R12	BIT_ULL(12)
+#define TDX_VMCALL_REG_MASK_R13	BIT_ULL(13)
+#define TDX_VMCALL_REG_MASK_R14	BIT_ULL(14)
+#define TDX_VMCALL_REG_MASK_R15	BIT_ULL(15)
+
 static __always_inline unsigned long tdvmcall_exit_type(struct kvm_vcpu *vcpu)
 {
 	return kvm_r10_read(vcpu);
@@ -837,6 +849,80 @@ static int tdx_emulate_vmcall(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_complete_vp_vmcall(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
+	__u64 reg_mask;
+
+	tdvmcall_set_return_code(vcpu, tdx_vmcall->status_code);
+	tdvmcall_set_return_val(vcpu, tdx_vmcall->out_r11);
+
+	reg_mask = kvm_rcx_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R12)
+		kvm_r12_write(vcpu, tdx_vmcall->out_r12);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R13)
+		kvm_r13_write(vcpu, tdx_vmcall->out_r13);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R14)
+		kvm_r14_write(vcpu, tdx_vmcall->out_r14);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R15)
+		kvm_r15_write(vcpu, tdx_vmcall->out_r15);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RBX)
+		kvm_rbx_write(vcpu, tdx_vmcall->out_rbx);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RDI)
+		kvm_rdi_write(vcpu, tdx_vmcall->out_rdi);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RSI)
+		kvm_rsi_write(vcpu, tdx_vmcall->out_rsi);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R8)
+		kvm_r8_write(vcpu, tdx_vmcall->out_r8);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R9)
+		kvm_r9_write(vcpu, tdx_vmcall->out_r9);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RDX)
+		kvm_rdx_write(vcpu, tdx_vmcall->out_rdx);
+
+	return 1;
+}
+
+static int tdx_vp_vmcall_to_user(struct kvm_vcpu *vcpu)
+{
+	struct kvm_tdx_vmcall *tdx_vmcall = &vcpu->run->tdx.u.vmcall;
+	__u64 reg_mask;
+
+	vcpu->arch.complete_userspace_io = tdx_complete_vp_vmcall;
+	memset(tdx_vmcall, 0, sizeof(*tdx_vmcall));
+
+	vcpu->run->exit_reason = KVM_EXIT_TDX;
+	vcpu->run->tdx.type = KVM_EXIT_TDX_VMCALL;
+	tdx_vmcall->type = tdvmcall_exit_type(vcpu);
+	tdx_vmcall->subfunction = tdvmcall_leaf(vcpu);
+	tdx_vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
+
+	reg_mask = kvm_rcx_read(vcpu);
+	tdx_vmcall->reg_mask = reg_mask;
+	if (reg_mask & TDX_VMCALL_REG_MASK_R12)
+		tdx_vmcall->in_r12 = kvm_r12_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R13)
+		tdx_vmcall->in_r13 = kvm_r13_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R14)
+		tdx_vmcall->in_r14 = kvm_r14_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R15)
+		tdx_vmcall->in_r15 = kvm_r15_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RBX)
+		tdx_vmcall->in_rbx = kvm_rbx_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RDI)
+		tdx_vmcall->in_rdi = kvm_rdi_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RSI)
+		tdx_vmcall->in_rsi = kvm_rsi_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R8)
+		tdx_vmcall->in_r8 = kvm_r8_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_R9)
+		tdx_vmcall->in_r9 = kvm_r9_read(vcpu);
+	if (reg_mask & TDX_VMCALL_REG_MASK_RDX)
+		tdx_vmcall->in_rdx = kvm_rdx_read(vcpu);
+
+	/* notify userspace to handle the request */
+	return 0;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	if (tdvmcall_exit_type(vcpu))
@@ -847,8 +933,11 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		break;
 	}
 
-	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
-	return 1;
+	/*
+	 * Unknown VMCALL.  Toss the request to the user space as it may know
+	 * how to handle.
+	 */
+	return tdx_vp_vmcall_to_user(vcpu);
 }
 
 void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int pgd_level)
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2a47fd0e51fd..24d899f66242 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -251,6 +251,60 @@ struct kvm_xen_exit {
 	} u;
 };
 
+struct kvm_tdx_exit {
+#define KVM_EXIT_TDX_VMCALL	1
+	__u32 type;
+	__u32 pad;
+
+	union {
+		struct kvm_tdx_vmcall {
+			/*
+			 * Guest-Host-Communication Interface for TDX spec
+			 * defines the ABI for TDG.VP.VMCALL.
+			 */
+
+			/* Input parameters: guest -> VMM */
+			__u64 type;		/* r10 */
+			__u64 subfunction;	/* r11 */
+			__u64 reg_mask;		/* rcx */
+			/*
+			 * Subfunction specific.
+			 * Registers are used in this order to pass input
+			 * arguments.  r12=arg0, r13=arg1, etc.
+			 */
+			__u64 in_r12;
+			__u64 in_r13;
+			__u64 in_r14;
+			__u64 in_r15;
+			__u64 in_rbx;
+			__u64 in_rdi;
+			__u64 in_rsi;
+			__u64 in_r8;
+			__u64 in_r9;
+			__u64 in_rdx;
+
+			/* Output parameters: VMM -> guest */
+			__u64 status_code;	/* r10 */
+			/*
+			 * Subfunction specific.
+			 * Registers are used in this order to output return
+			 * values.  r11=ret0, r12=ret1, etc.
+			 */
+			__u64 out_r11;
+			__u64 out_r12;
+			__u64 out_r13;
+			__u64 out_r14;
+			__u64 out_r15;
+			__u64 out_rbx;
+			__u64 out_rdi;
+			__u64 out_rsi;
+			__u64 out_r8;
+			__u64 out_r9;
+			__u64 out_rdx;
+		} vmcall;
+	} u;
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -293,6 +347,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_MEMORY_FAULT     38
+#define KVM_EXIT_TDX              39
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -541,6 +596,8 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory;
+		/* KVM_EXIT_TDX_VMCALL */
+		struct kvm_tdx_exit tdx;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 092/113] KVM: TDX: Handle TDX PV CPUID hypercall
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (90 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 091/113] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 093/113] KVM: TDX: Handle TDX PV HLT hypercall isaku.yamahata
                   ` (20 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV CPUID hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 89693405892d..6419cb741996 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -923,12 +923,34 @@ static int tdx_vp_vmcall_to_user(struct kvm_vcpu *vcpu)
 	return 0;
 }
 
+static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
+{
+	u32 eax, ebx, ecx, edx;
+
+	/* EAX and ECX for cpuid is stored in R12 and R13. */
+	eax = tdvmcall_a0_read(vcpu);
+	ecx = tdvmcall_a1_read(vcpu);
+
+	kvm_cpuid(vcpu, &eax, &ebx, &ecx, &edx, false);
+
+	tdvmcall_a0_write(vcpu, eax);
+	tdvmcall_a1_write(vcpu, ebx);
+	tdvmcall_a2_write(vcpu, ecx);
+	tdvmcall_a3_write(vcpu, edx);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	if (tdvmcall_exit_type(vcpu))
 		return tdx_emulate_vmcall(vcpu);
 
 	switch (tdvmcall_leaf(vcpu)) {
+	case EXIT_REASON_CPUID:
+		return tdx_emulate_cpuid(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 093/113] KVM: TDX: Handle TDX PV HLT hypercall
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (91 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 092/113] KVM: TDX: Handle TDX PV CPUID hypercall isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 094/113] KVM: TDX: Handle TDX PV port io hypercall isaku.yamahata
                   ` (19 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV HLT hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 42 +++++++++++++++++++++++++++++++++++++++++-
 arch/x86/kvm/vmx/tdx.h |  3 +++
 2 files changed, 44 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6419cb741996..3dcfbdd37579 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -588,7 +588,32 @@ void tdx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 
 bool tdx_protected_apic_has_interrupt(struct kvm_vcpu *vcpu)
 {
-	return pi_has_pending_interrupt(vcpu);
+	bool ret = pi_has_pending_interrupt(vcpu);
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	if (ret || vcpu->arch.mp_state != KVM_MP_STATE_HALTED)
+		return true;
+
+	if (tdx->interrupt_disabled_hlt)
+		return false;
+
+	/*
+	 * This is for the case where the virtual interrupt is recognized,
+	 * i.e. set in vmcs.RVI, between the STI and "HLT".  KVM doesn't have
+	 * access to RVI and the interrupt is no longer in the PID (because it
+	 * was "recognized".  It doesn't get delivered in the guest because the
+	 * TDCALL completes before interrupts are enabled.
+	 *
+	 * TDX modules sets RVI while in an STI interrupt shadow.
+	 * - TDExit(typically TDG.VP.VMCALL<HLT>) from the guest to TDX module.
+	 *   The interrupt shadow at this point is gone.
+	 * - It knows that there is an interrupt that can be delivered
+	 *   (RVI > PPR && EFLAGS.IF=1, the other conditions of 29.2.2 don't
+	 *    matter)
+	 * - It forwards the TDExit nevertheless, to a clueless hypervisor that
+	 *   has no way to glean either RVI or PPR.
+	 */
+	return !!xchg(&tdx->buggy_hlt_workaround, 0);
 }
 
 void tdx_prepare_switch_to_guest(struct kvm_vcpu *vcpu)
@@ -943,6 +968,17 @@ static int tdx_emulate_cpuid(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
+{
+	struct vcpu_tdx *tdx = to_tdx(vcpu);
+
+	/* See tdx_protected_apic_has_interrupt() to avoid heavy seamcall */
+	tdx->interrupt_disabled_hlt = tdvmcall_a0_read(vcpu);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	return kvm_emulate_halt_noskip(vcpu);
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	if (tdvmcall_exit_type(vcpu))
@@ -951,6 +987,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 	switch (tdvmcall_leaf(vcpu)) {
 	case EXIT_REASON_CPUID:
 		return tdx_emulate_cpuid(vcpu);
+	case EXIT_REASON_HLT:
+		return tdx_emulate_hlt(vcpu);
 	default:
 		break;
 	}
@@ -1284,6 +1322,8 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 	struct kvm_vcpu *vcpu = apic->vcpu;
 	struct vcpu_tdx *tdx = to_tdx(vcpu);
 
+	/* See comment in tdx_protected_apic_has_interrupt(). */
+	tdx->buggy_hlt_workaround = 1;
 	/* TDX supports only posted interrupt.  No lapic emulation. */
 	__vmx_deliver_posted_interrupt(vcpu, &tdx->pi_desc, vector);
 }
diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
index 272980d9605c..01e97d6886d5 100644
--- a/arch/x86/kvm/vmx/tdx.h
+++ b/arch/x86/kvm/vmx/tdx.h
@@ -93,6 +93,9 @@ struct vcpu_tdx {
 	bool host_state_need_restore;
 	u64 msr_host_kernel_gs_base;
 
+	bool interrupt_disabled_hlt;
+	unsigned int buggy_hlt_workaround;
+
 	/*
 	 * Dummy to make pmu_intel not corrupt memory.
 	 * TODO: Support PMU for TDX.  Future work.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 094/113] KVM: TDX: Handle TDX PV port io hypercall
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (92 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 093/113] KVM: TDX: Handle TDX PV HLT hypercall isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 095/113] KVM: TDX: Handle TDX PV MMIO hypercall isaku.yamahata
                   ` (18 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV port IO hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/tdx.c | 57 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 57 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 3dcfbdd37579..88318a80e6de 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -979,6 +979,61 @@ static int tdx_emulate_hlt(struct kvm_vcpu *vcpu)
 	return kvm_emulate_halt_noskip(vcpu);
 }
 
+static int tdx_complete_pio_in(struct kvm_vcpu *vcpu)
+{
+	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+	unsigned long val = 0;
+	int ret;
+
+	WARN_ON_ONCE(vcpu->arch.pio.count != 1);
+
+	ret = ctxt->ops->pio_in_emulated(ctxt, vcpu->arch.pio.size,
+					 vcpu->arch.pio.port, &val, 1);
+	WARN_ON_ONCE(!ret);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	tdvmcall_set_return_val(vcpu, val);
+
+	return 1;
+}
+
+static int tdx_emulate_io(struct kvm_vcpu *vcpu)
+{
+	struct x86_emulate_ctxt *ctxt = vcpu->arch.emulate_ctxt;
+	unsigned long val = 0;
+	unsigned int port;
+	int size, ret;
+	bool write;
+
+	++vcpu->stat.io_exits;
+
+	size = tdvmcall_a0_read(vcpu);
+	write = tdvmcall_a1_read(vcpu);
+	port = tdvmcall_a2_read(vcpu);
+
+	if (size != 1 && size != 2 && size != 4) {
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	if (write) {
+		val = tdvmcall_a3_read(vcpu);
+		ret = ctxt->ops->pio_out_emulated(ctxt, size, port, &val, 1);
+
+		/* No need for a complete_userspace_io callback. */
+		vcpu->arch.pio.count = 0;
+	} else {
+		ret = ctxt->ops->pio_in_emulated(ctxt, size, port, &val, 1);
+		if (!ret)
+			vcpu->arch.complete_userspace_io = tdx_complete_pio_in;
+		else
+			tdvmcall_set_return_val(vcpu, val);
+	}
+	if (ret)
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	return ret;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	if (tdvmcall_exit_type(vcpu))
@@ -989,6 +1044,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_cpuid(vcpu);
 	case EXIT_REASON_HLT:
 		return tdx_emulate_hlt(vcpu);
+	case EXIT_REASON_IO_INSTRUCTION:
+		return tdx_emulate_io(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 095/113] KVM: TDX: Handle TDX PV MMIO hypercall
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (93 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 094/113] KVM: TDX: Handle TDX PV port io hypercall isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 096/113] KVM: TDX: Implement callbacks for MSR operations for TDX isaku.yamahata
                   ` (17 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

Export kvm_io_bus_read and kvm_mmio tracepoint and wire up TDX PV MMIO
hypercall to the KVM backend functions.

kvm_io_bus_read/write() searches KVM device emulated in kernel of the given
MMIO address and emulates the MMIO.  As TDX PV MMIO also needs it, export
kvm_io_bus_read().  kvm_io_bus_write() is already exported.  TDX PV MMIO
emulates some of MMIO itself.  To add trace point consistently with x86
kvm, export kvm_mmio tracepoint.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/tdx.c | 114 +++++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/x86.c     |   1 +
 virt/kvm/kvm_main.c    |   2 +
 3 files changed, 117 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 88318a80e6de..8fab8e641070 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1034,6 +1034,118 @@ static int tdx_emulate_io(struct kvm_vcpu *vcpu)
 	return ret;
 }
 
+static int tdx_complete_mmio(struct kvm_vcpu *vcpu)
+{
+	unsigned long val = 0;
+	gpa_t gpa;
+	int size;
+
+	KVM_BUG_ON(vcpu->mmio_needed != 1, vcpu->kvm);
+	vcpu->mmio_needed = 0;
+
+	if (!vcpu->mmio_is_write) {
+		gpa = vcpu->mmio_fragments[0].gpa;
+		size = vcpu->mmio_fragments[0].len;
+
+		memcpy(&val, vcpu->run->mmio.data, size);
+		tdvmcall_set_return_val(vcpu, val);
+		trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+	}
+	return 1;
+}
+
+static inline int tdx_mmio_write(struct kvm_vcpu *vcpu, gpa_t gpa, int size,
+				 unsigned long val)
+{
+	if (kvm_iodevice_write(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+	    kvm_io_bus_write(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+		return -EOPNOTSUPP;
+
+	trace_kvm_mmio(KVM_TRACE_MMIO_WRITE, size, gpa, &val);
+	return 0;
+}
+
+static inline int tdx_mmio_read(struct kvm_vcpu *vcpu, gpa_t gpa, int size)
+{
+	unsigned long val;
+
+	if (kvm_iodevice_read(vcpu, &vcpu->arch.apic->dev, gpa, size, &val) &&
+	    kvm_io_bus_read(vcpu, KVM_MMIO_BUS, gpa, size, &val))
+		return -EOPNOTSUPP;
+
+	tdvmcall_set_return_val(vcpu, val);
+	trace_kvm_mmio(KVM_TRACE_MMIO_READ, size, gpa, &val);
+	return 0;
+}
+
+static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
+{
+	struct kvm_memory_slot *slot;
+	int size, write, r;
+	unsigned long val;
+	gpa_t gpa;
+
+	KVM_BUG_ON(vcpu->mmio_needed, vcpu->kvm);
+
+	size = tdvmcall_a0_read(vcpu);
+	write = tdvmcall_a1_read(vcpu);
+	gpa = tdvmcall_a2_read(vcpu);
+	val = write ? tdvmcall_a3_read(vcpu) : 0;
+
+	if (size != 1 && size != 2 && size != 4 && size != 8)
+		goto error;
+	if (write != 0 && write != 1)
+		goto error;
+
+	/* Strip the shared bit, allow MMIO with and without it set. */
+	gpa = gpa & ~gfn_to_gpa(kvm_gfn_shared_mask(vcpu->kvm));
+
+	if (size > 8u || ((gpa + size - 1) ^ gpa) & PAGE_MASK)
+		goto error;
+
+	slot = kvm_vcpu_gfn_to_memslot(vcpu, gpa_to_gfn(gpa));
+	if (slot && !(slot->flags & KVM_MEMSLOT_INVALID))
+		goto error;
+
+	if (!kvm_io_bus_write(vcpu, KVM_FAST_MMIO_BUS, gpa, 0, NULL)) {
+		trace_kvm_fast_mmio(gpa);
+		return 1;
+	}
+
+	if (write)
+		r = tdx_mmio_write(vcpu, gpa, size, val);
+	else
+		r = tdx_mmio_read(vcpu, gpa, size);
+	if (!r) {
+		/* Kernel completed device emulation. */
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+		return 1;
+	}
+
+	/* Request the device emulation to userspace device model. */
+	vcpu->mmio_needed = 1;
+	vcpu->mmio_is_write = write;
+	vcpu->arch.complete_userspace_io = tdx_complete_mmio;
+
+	vcpu->run->mmio.phys_addr = gpa;
+	vcpu->run->mmio.len = size;
+	vcpu->run->mmio.is_write = write;
+	vcpu->run->exit_reason = KVM_EXIT_MMIO;
+
+	if (write) {
+		memcpy(vcpu->run->mmio.data, &val, size);
+	} else {
+		vcpu->mmio_fragments[0].gpa = gpa;
+		vcpu->mmio_fragments[0].len = size;
+		trace_kvm_mmio(KVM_TRACE_MMIO_READ_UNSATISFIED, size, gpa, NULL);
+	}
+	return 0;
+
+error:
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	if (tdvmcall_exit_type(vcpu))
@@ -1046,6 +1158,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_hlt(vcpu);
 	case EXIT_REASON_IO_INSTRUCTION:
 		return tdx_emulate_io(vcpu);
+	case EXIT_REASON_EPT_VIOLATION:
+		return tdx_emulate_mmio(vcpu);
 	default:
 		break;
 	}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 3d1c854b1604..ad8735874f1b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -13546,6 +13546,7 @@ bool kvm_arch_has_private_mem(struct kvm *kvm)
 
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_entry);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_exit);
+EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_fast_mmio);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_inj_virq);
 EXPORT_TRACEPOINT_SYMBOL_GPL(kvm_page_fault);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 251bb7c59c88..fe464fe0d9af 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2642,6 +2642,7 @@ struct kvm_memory_slot *kvm_vcpu_gfn_to_memslot(struct kvm_vcpu *vcpu, gfn_t gfn
 
 	return NULL;
 }
+EXPORT_SYMBOL_GPL(kvm_vcpu_gfn_to_memslot);
 
 bool kvm_is_visible_gfn(struct kvm *kvm, gfn_t gfn)
 {
@@ -5834,6 +5835,7 @@ int kvm_io_bus_read(struct kvm_vcpu *vcpu, enum kvm_bus bus_idx, gpa_t addr,
 	r = __kvm_io_bus_read(vcpu, bus, &range, val);
 	return r < 0 ? r : 0;
 }
+EXPORT_SYMBOL_GPL(kvm_io_bus_read);
 
 /* Caller must hold slots_lock. */
 int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 096/113] KVM: TDX: Implement callbacks for MSR operations for TDX
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (94 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 095/113] KVM: TDX: Handle TDX PV MMIO hypercall isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 097/113] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall isaku.yamahata
                   ` (16 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implements set_msr/get_msr/has_emulated_msr methods for TDX to handle
hypercall from guest TD for paravirtualized rdmsr and wrmsr.  The TDX
module virtualizes MSRs.  For some MSRs, it injects #VE to the guest TD
upon RDMSR or WRMSR.  The exact list of such MSRs are defined in the spec.

Upon #VE, the guest TD may execute hypercalls,
TDG.VP.VMCALL<INSTRUCTION.RDMSR> and TDG.VP.VMCALL<INSTRUCTION.WRMSR>,
which are defined in GHCI (Guest-Host Communication Interface) so that the
host VMM (e.g. KVM) can virtualize the MSRs.

There are three classes of MSRs virtualization.
- non-configurable: TDX module directly virtualizes it. VMM can't configure.
  the value set by KVM_SET_MSR_INDEX_LIST is ignored.
- configurable: TDX module directly virtualizes it. VMM can configure at the
  VM creation time.  The value set by KVM_SET_MSR_INDEX_LIST is used.
- #VE case
  Guest TD would issue TDG.VP.VMCALL<INSTRUCTION.{WRMSR,RDMSR> and
  VMM handles the MSR hypercall. The value set by KVM_SET_MSR_INDEX_LIST is
  used.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
Changes v10 -> v11
- added .msr_filter_changed()
---
 arch/x86/kvm/vmx/main.c    | 44 ++++++++++++++++++++---
 arch/x86/kvm/vmx/tdx.c     | 74 ++++++++++++++++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h |  6 ++++
 arch/x86/kvm/x86.c         |  1 -
 arch/x86/kvm/x86.h         |  2 ++
 5 files changed, 122 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index f9339d8f95eb..cb79d64a2058 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -183,6 +183,42 @@ static void vt_handle_exit_irqoff(struct kvm_vcpu *vcpu)
 	vmx_handle_exit_irqoff(vcpu);
 }
 
+static int vt_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_set_msr(vcpu, msr_info);
+
+	return vmx_set_msr(vcpu, msr_info);
+}
+
+/*
+ * The kvm parameter can be NULL (module initialization, or invocation before
+ * VM creation). Be sure to check the kvm parameter before using it.
+ */
+static bool vt_has_emulated_msr(struct kvm *kvm, u32 index)
+{
+	if (kvm && is_td(kvm))
+		return tdx_has_emulated_msr(index, true);
+
+	return vmx_has_emulated_msr(kvm, index);
+}
+
+static int vt_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
+{
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_get_msr(vcpu, msr_info);
+
+	return vmx_get_msr(vcpu, msr_info);
+}
+
+static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_msr_filter_changed(vcpu);
+}
+
 static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -428,7 +464,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.hardware_enable = vmx_hardware_enable,
 	.hardware_disable = vt_hardware_disable,
-	.has_emulated_msr = vmx_has_emulated_msr,
+	.has_emulated_msr = vt_has_emulated_msr,
 
 	.is_vm_type_supported = vt_is_vm_type_supported,
 	.vm_size = sizeof(struct kvm_vmx),
@@ -448,8 +484,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.update_exception_bitmap = vmx_update_exception_bitmap,
 	.get_msr_feature = vmx_get_msr_feature,
-	.get_msr = vmx_get_msr,
-	.set_msr = vmx_set_msr,
+	.get_msr = vt_get_msr,
+	.set_msr = vt_set_msr,
 	.get_segment_base = vmx_get_segment_base,
 	.get_segment = vmx_get_segment,
 	.set_segment = vmx_set_segment,
@@ -560,7 +596,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
 	.migrate_timers = vmx_migrate_timers,
 
-	.msr_filter_changed = vmx_msr_filter_changed,
+	.msr_filter_changed = vt_msr_filter_changed,
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
 	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 8fab8e641070..5d8894899055 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1629,6 +1629,80 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 	*error_code = 0;
 }
 
+static bool tdx_is_emulated_kvm_msr(u32 index, bool write)
+{
+	switch (index) {
+	case MSR_KVM_POLL_CONTROL:
+		return true;
+	default:
+		return false;
+	}
+}
+
+bool tdx_has_emulated_msr(u32 index, bool write)
+{
+	switch (index) {
+	case MSR_IA32_UCODE_REV:
+	case MSR_IA32_ARCH_CAPABILITIES:
+	case MSR_IA32_POWER_CTL:
+	case MSR_MTRRcap:
+	case 0x200 ... 0x26f:
+		/* IA32_MTRR_PHYS{BASE, MASK}, IA32_MTRR_FIX*_* */
+	case MSR_IA32_CR_PAT:
+	case MSR_MTRRdefType:
+	case MSR_IA32_TSC_DEADLINE:
+	case MSR_IA32_MISC_ENABLE:
+	case MSR_PLATFORM_INFO:
+	case MSR_MISC_FEATURES_ENABLES:
+	case MSR_IA32_MCG_CAP:
+	case MSR_IA32_MCG_STATUS:
+	case MSR_IA32_MCG_CTL:
+	case MSR_IA32_MCG_EXT_CTL:
+	case MSR_IA32_MC0_CTL ... MSR_IA32_MCx_CTL(KVM_MAX_MCE_BANKS) - 1:
+	case MSR_IA32_MC0_CTL2 ... MSR_IA32_MCx_CTL2(KVM_MAX_MCE_BANKS) - 1:
+		/* MSR_IA32_MCx_{CTL, STATUS, ADDR, MISC, CTL2} */
+		return true;
+	case APIC_BASE_MSR ... APIC_BASE_MSR + 0xff:
+		/*
+		 * x2APIC registers that are virtualized by the CPU can't be
+		 * emulated, KVM doesn't have access to the virtual APIC page.
+		 */
+		switch (index) {
+		case X2APIC_MSR(APIC_TASKPRI):
+		case X2APIC_MSR(APIC_PROCPRI):
+		case X2APIC_MSR(APIC_EOI):
+		case X2APIC_MSR(APIC_ISR) ... X2APIC_MSR(APIC_ISR + APIC_ISR_NR):
+		case X2APIC_MSR(APIC_TMR) ... X2APIC_MSR(APIC_TMR + APIC_ISR_NR):
+		case X2APIC_MSR(APIC_IRR) ... X2APIC_MSR(APIC_IRR + APIC_ISR_NR):
+			return false;
+		default:
+			return true;
+		}
+	case MSR_IA32_APICBASE:
+	case MSR_EFER:
+		return !write;
+	case 0x4b564d00 ... 0x4b564dff:
+		/* KVM custom MSRs */
+		return tdx_is_emulated_kvm_msr(index, write);
+	default:
+		return false;
+	}
+}
+
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	if (tdx_has_emulated_msr(msr->index, false))
+		return kvm_get_msr_common(vcpu, msr);
+	return 1;
+}
+
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
+{
+	if (tdx_has_emulated_msr(msr->index, true))
+		return kvm_set_msr_common(vcpu, msr);
+	return 1;
+}
+
 int tdx_dev_ioctl(void __user *argp)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 38fd5c3eee2f..3b747fb5bc20 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -171,6 +171,9 @@ void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 void tdx_inject_nmi(struct kvm_vcpu *vcpu);
 void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 		u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code);
+bool tdx_has_emulated_msr(u32 index, bool write);
+int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
@@ -210,6 +213,9 @@ static inline void tdx_deliver_interrupt(struct kvm_lapic *apic, int delivery_mo
 static inline void tdx_inject_nmi(struct kvm_vcpu *vcpu) {}
 static inline void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason, u64 *info1,
 				     u64 *info2, u32 *intr_info, u32 *error_code) {}
+static inline bool tdx_has_emulated_msr(u32 index, bool write) { return false; }
+static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
+static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ad8735874f1b..fe5bd1ab0eec 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -87,7 +87,6 @@
 #include "trace.h"
 
 #define MAX_IO_MSRS 256
-#define KVM_MAX_MCE_BANKS 32
 
 struct kvm_caps kvm_caps __read_mostly = {
 	.supported_mce_cap = MCG_CTL_P | MCG_SER_P,
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 9de72586f406..028bf3eaa43b 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -8,6 +8,8 @@
 #include "kvm_cache_regs.h"
 #include "kvm_emulate.h"
 
+#define KVM_MAX_MCE_BANKS 32
+
 struct kvm_caps {
 	/* control of guest tsc rate supported? */
 	bool has_tsc_control;
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 097/113] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (95 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 096/113] KVM: TDX: Implement callbacks for MSR operations for TDX isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 098/113] KVM: TDX: Handle TDX PV report fatal error hypercall isaku.yamahata
                   ` (15 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV rdmsr/wrmsr hypercall to the KVM backend function.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
---
 arch/x86/kvm/vmx/tdx.c | 39 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 39 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 5d8894899055..5dc7dae55c57 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1146,6 +1146,41 @@ static int tdx_emulate_mmio(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_emulate_rdmsr(struct kvm_vcpu *vcpu)
+{
+	u32 index = tdvmcall_a0_read(vcpu);
+	u64 data;
+
+	if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_READ) ||
+	    kvm_get_msr(vcpu, index, &data)) {
+		trace_kvm_msr_read_ex(index);
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+	trace_kvm_msr_read(index, data);
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	tdvmcall_set_return_val(vcpu, data);
+	return 1;
+}
+
+static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
+{
+	u32 index = tdvmcall_a0_read(vcpu);
+	u64 data = tdvmcall_a1_read(vcpu);
+
+	if (!kvm_msr_allowed(vcpu, index, KVM_MSR_FILTER_WRITE) ||
+	    kvm_set_msr(vcpu, index, data)) {
+		trace_kvm_msr_write_ex(index, data);
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	trace_kvm_msr_write(index, data);
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	if (tdvmcall_exit_type(vcpu))
@@ -1160,6 +1195,10 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_io(vcpu);
 	case EXIT_REASON_EPT_VIOLATION:
 		return tdx_emulate_mmio(vcpu);
+	case EXIT_REASON_MSR_READ:
+		return tdx_emulate_rdmsr(vcpu);
+	case EXIT_REASON_MSR_WRITE:
+		return tdx_emulate_wrmsr(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 098/113] KVM: TDX: Handle TDX PV report fatal error hypercall
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (96 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 097/113] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 099/113] KVM: TDX: Handle TDX PV map_gpa hypercall isaku.yamahata
                   ` (14 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV report fatal error hypercall to exit to device model so that
it can gracefully handle it.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 5dc7dae55c57..4bbde58510a4 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1199,6 +1199,13 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_rdmsr(vcpu);
 	case EXIT_REASON_MSR_WRITE:
 		return tdx_emulate_wrmsr(vcpu);
+	case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
+		/*
+		 * Exit to userspace device model for tear down.
+		 * Because guest TD is already panicking, returning an error to
+		 * guest TD doesn't make sense.  No argument check is done.
+		 */
+		return tdx_vp_vmcall_to_user(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 099/113] KVM: TDX: Handle TDX PV map_gpa hypercall
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (97 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 098/113] KVM: TDX: Handle TDX PV report fatal error hypercall isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-31  1:30   ` Yuan Yao
  2023-01-12 16:32 ` [PATCH v11 100/113] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall isaku.yamahata
                   ` (13 subsequent siblings)
  112 siblings, 1 reply; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Wire up TDX PV map_gpa hypercall to the kvm/mmu backend.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 53 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 4bbde58510a4..486d0f0c6dd1 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1181,6 +1181,57 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_map_gpa(struct kvm_vcpu *vcpu)
+{
+	struct kvm *kvm = vcpu->kvm;
+	gpa_t gpa = tdvmcall_a0_read(vcpu);
+	gpa_t size = tdvmcall_a1_read(vcpu);
+	gpa_t end = gpa + size;
+	gfn_t s = gpa_to_gfn(gpa) & ~kvm_gfn_shared_mask(kvm);
+	gfn_t e = gpa_to_gfn(end) & ~kvm_gfn_shared_mask(kvm);
+	int i;
+
+	if (!IS_ALIGNED(gpa, 4096) || !IS_ALIGNED(size, 4096) ||
+	    end < gpa ||
+	    end > kvm_gfn_shared_mask(kvm) << (PAGE_SHIFT + 1) ||
+	    kvm_is_private_gpa(kvm, gpa) != kvm_is_private_gpa(kvm, end)) {
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+		return 1;
+	}
+
+	/*
+	 * Check how the requested region overlaps with the KVM memory slots.
+	 * For simplicity, require that it must be contained within a memslot or
+	 * it must not overlap with any memslots (MMIO).
+	 */
+	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
+		struct kvm_memslots *slots = __kvm_memslots(kvm, i);
+		struct kvm_memslot_iter iter;
+
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, s, e) {
+			struct kvm_memory_slot *slot = iter.slot;
+			gfn_t slot_s = slot->base_gfn;
+			gfn_t slot_e = slot->base_gfn + slot->npages;
+
+			/* no overlap */
+			if (e < slot_s || s >= slot_e)
+				continue;
+
+			/* contained in slot */
+			if (slot_s <= s && e <= slot_e) {
+				if (kvm_slot_can_be_private(slot))
+					return tdx_vp_vmcall_to_user(vcpu);
+				continue;
+			}
+
+			break;
+		}
+	}
+
+	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+	return 1;
+}
+
 static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 {
 	if (tdvmcall_exit_type(vcpu))
@@ -1206,6 +1257,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		 * guest TD doesn't make sense.  No argument check is done.
 		 */
 		return tdx_vp_vmcall_to_user(vcpu);
+	case TDG_VP_VMCALL_MAP_GPA:
+		return tdx_map_gpa(vcpu);
 	default:
 		break;
 	}
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 100/113] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (98 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 099/113] KVM: TDX: Handle TDX PV map_gpa hypercall isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 101/113] KVM: TDX: Silently discard SMI request isaku.yamahata
                   ` (12 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Implement TDG.VP.VMCALL<GetTdVmCallInfo> hypercall.  If the input value is
zero, return success code and zero in output registers.

TDG.VP.VMCALL<GetTdVmCallInfo> hypercall is a subleaf of TDG.VP.VMCALL to
enumerate which TDG.VP.VMCALL sub leaves are supported.  This hypercall is
for future enhancement of the Guest-Host-Communication Interface (GHCI)
specification.  The GHCI version of 344426-001US defines it to require
input R12 to be zero and to return zero in output registers, R11, R12, R13,
and R14 so that guest TD enumerates no enhancement.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/tdx.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 486d0f0c6dd1..2f3206551c48 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1181,6 +1181,20 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
 	return 1;
 }
 
+static int tdx_get_td_vm_call_info(struct kvm_vcpu *vcpu)
+{
+	if (tdvmcall_a0_read(vcpu))
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
+	else {
+		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_SUCCESS);
+		kvm_r11_write(vcpu, 0);
+		tdvmcall_a0_write(vcpu, 0);
+		tdvmcall_a1_write(vcpu, 0);
+		tdvmcall_a2_write(vcpu, 0);
+	}
+	return 1;
+}
+
 static int tdx_map_gpa(struct kvm_vcpu *vcpu)
 {
 	struct kvm *kvm = vcpu->kvm;
@@ -1250,6 +1264,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
 		return tdx_emulate_rdmsr(vcpu);
 	case EXIT_REASON_MSR_WRITE:
 		return tdx_emulate_wrmsr(vcpu);
+	case TDG_VP_VMCALL_GET_TD_VM_CALL_INFO:
+		return tdx_get_td_vm_call_info(vcpu);
 	case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
 		/*
 		 * Exit to userspace device model for tear down.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 101/113] KVM: TDX: Silently discard SMI request
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (99 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 100/113] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 102/113] KVM: TDX: Silently ignore INIT/SIPI isaku.yamahata
                   ` (11 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX doesn't support system-management mode (SMM) and system-management
interrupt (SMI) in guest TDs.  Because guest state (vcpu state, memory
state) is protected, it must go through the TDX module APIs to change guest
state, injecting SMI and changing vcpu mode into SMM.  The TDX module
doesn't provide a way for VMM to inject SMI into guest TD and a way for VMM
to switch guest vcpu mode into SMM.

We have two options in KVM when handling SMM or SMI in the guest TD or the
device model (e.g. QEMU): 1) silently ignore the request or 2) return a
meaningful error.

For simplicity, we implemented the option 1).

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/smm.h         |  7 +++++-
 arch/x86/kvm/vmx/main.c    | 45 ++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/tdx.c     | 29 ++++++++++++++++++++++++
 arch/x86/kvm/vmx/x86_ops.h | 12 ++++++++++
 4 files changed, 88 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kvm/smm.h b/arch/x86/kvm/smm.h
index a1cf2ac5bd78..bc77902f5c18 100644
--- a/arch/x86/kvm/smm.h
+++ b/arch/x86/kvm/smm.h
@@ -142,7 +142,12 @@ union kvm_smram {
 
 static inline int kvm_inject_smi(struct kvm_vcpu *vcpu)
 {
-	kvm_make_request(KVM_REQ_SMI, vcpu);
+	/*
+	 * If SMM isn't supported (e.g. TDX), silently discard SMI request.
+	 * Assume that SMM supported = MSR_IA32_SMBASE supported.
+	 */
+	if (static_call(kvm_x86_has_emulated_msr)(vcpu->kvm, MSR_IA32_SMBASE))
+		kvm_make_request(KVM_REQ_SMI, vcpu);
 	return 0;
 }
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index cb79d64a2058..3651c32e6cad 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -219,6 +219,43 @@ static void vt_msr_filter_changed(struct kvm_vcpu *vcpu)
 	vmx_msr_filter_changed(vcpu);
 }
 
+#ifdef CONFIG_KVM_SMM
+static int vt_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_smi_allowed(vcpu, for_injection);
+
+	return vmx_smi_allowed(vcpu, for_injection);
+}
+
+static int vt_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+{
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_enter_smm(vcpu, smram);
+
+	return vmx_enter_smm(vcpu, smram);
+}
+
+static int vt_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+{
+	if (unlikely(is_td_vcpu(vcpu)))
+		return tdx_leave_smm(vcpu, smram);
+
+	return vmx_leave_smm(vcpu, smram);
+}
+
+static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		tdx_enable_smi_window(vcpu);
+		return;
+	}
+
+	/* RSM will cause a vmexit anyway.  */
+	vmx_enable_smi_window(vcpu);
+}
+#endif
+
 static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -586,10 +623,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.setup_mce = vmx_setup_mce,
 
 #ifdef CONFIG_KVM_SMM
-	.smi_allowed = vmx_smi_allowed,
-	.enter_smm = vmx_enter_smm,
-	.leave_smm = vmx_leave_smm,
-	.enable_smi_window = vmx_enable_smi_window,
+	.smi_allowed = vt_smi_allowed,
+	.enter_smm = vt_enter_smm,
+	.leave_smm = vt_leave_smm,
+	.enable_smi_window = vt_enable_smi_window,
 #endif
 
 	.can_emulate_instruction = vmx_can_emulate_instruction,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 2f3206551c48..778d170b7549 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1818,6 +1818,35 @@ int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 	return 1;
 }
 
+#ifdef CONFIG_KVM_SMM
+int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection)
+{
+	/* SMI isn't supported for TDX. */
+	WARN_ON_ONCE(1);
+	return false;
+}
+
+int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram)
+{
+	/* smi_allowed() is always false for TDX as above. */
+	WARN_ON_ONCE(1);
+	return 0;
+}
+
+int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram)
+{
+	WARN_ON_ONCE(1);
+	return 0;
+}
+
+void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
+{
+	/* SMI isn't supported for TDX.  Silently discard SMI request. */
+	WARN_ON_ONCE(1);
+	vcpu->arch.smi_pending = false;
+}
+#endif
+
 int tdx_dev_ioctl(void __user *argp)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 3b747fb5bc20..d6c592d06baa 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -225,4 +225,16 @@ static inline int tdx_sept_tlb_remote_flush(struct kvm *kvm) { return 0; }
 static inline void tdx_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa, int root_level) {}
 #endif
 
+#if defined(CONFIG_INTEL_TDX_HOST) && defined(CONFIG_KVM_SMM)
+int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection);
+int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram);
+int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram);
+void tdx_enable_smi_window(struct kvm_vcpu *vcpu);
+#else
+static inline int tdx_smi_allowed(struct kvm_vcpu *vcpu, bool for_injection) { return false; }
+static inline int tdx_enter_smm(struct kvm_vcpu *vcpu, union kvm_smram *smram) { return 0; }
+static inline int tdx_leave_smm(struct kvm_vcpu *vcpu, const union kvm_smram *smram) { return 0; }
+static inline void tdx_enable_smi_window(struct kvm_vcpu *vcpu) {}
+#endif
+
 #endif /* __KVM_X86_VMX_X86_OPS_H */
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 102/113] KVM: TDX: Silently ignore INIT/SIPI
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (100 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 101/113] KVM: TDX: Silently discard SMI request isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 103/113] KVM: TDX: Add methods to ignore accesses to CPU state isaku.yamahata
                   ` (10 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

The TDX module API doesn't provide API for VMM to inject INIT IPI and SIPI.
Instead it defines the different protocols to boot application processors.
Ignore INIT and SIPI events for the TDX guest.

There are two options. 1) (silently) ignore INIT/SIPI request or 2) return
error to guest TDs somehow.  Given that TDX guest is paravirtualized to
boot AP, the option 1 is chosen for simplicity.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/include/asm/kvm-x86-ops.h |  1 +
 arch/x86/include/asm/kvm_host.h    |  2 ++
 arch/x86/kvm/lapic.c               | 19 +++++++++++-------
 arch/x86/kvm/svm/svm.c             |  1 +
 arch/x86/kvm/vmx/main.c            | 32 ++++++++++++++++++++++++++++--
 5 files changed, 46 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
index aeff96543090..0cf928d12067 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -143,6 +143,7 @@ KVM_X86_OP_OPTIONAL(migrate_timers)
 KVM_X86_OP(msr_filter_changed)
 KVM_X86_OP(complete_emulated_msr)
 KVM_X86_OP(vcpu_deliver_sipi_vector)
+KVM_X86_OP(vcpu_deliver_init)
 KVM_X86_OP_OPTIONAL_RET0(vcpu_get_apicv_inhibit_reasons);
 
 #undef KVM_X86_OP
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f93d271aba67..75e53b2bb4af 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1775,6 +1775,7 @@ struct kvm_x86_ops {
 	int (*complete_emulated_msr)(struct kvm_vcpu *vcpu, int err);
 
 	void (*vcpu_deliver_sipi_vector)(struct kvm_vcpu *vcpu, u8 vector);
+	void (*vcpu_deliver_init)(struct kvm_vcpu *vcpu);
 
 	/*
 	 * Returns vCPU specific APICv inhibit reasons
@@ -1986,6 +1987,7 @@ void kvm_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
 void kvm_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
 int kvm_load_segment_descriptor(struct kvm_vcpu *vcpu, u16 selector, int seg);
 void kvm_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector);
+void kvm_vcpu_deliver_init(struct kvm_vcpu *vcpu);
 
 int kvm_task_switch(struct kvm_vcpu *vcpu, u16 tss_selector, int idt_index,
 		    int reason, bool has_error_code, u32 error_code);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 9e5ffe355847..68a683fe8a2f 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -3032,6 +3032,16 @@ int kvm_lapic_set_pv_eoi(struct kvm_vcpu *vcpu, u64 data, unsigned long len)
 	return 0;
 }
 
+void kvm_vcpu_deliver_init(struct kvm_vcpu *vcpu)
+{
+	kvm_vcpu_reset(vcpu, true);
+	if (kvm_vcpu_is_bsp(vcpu))
+		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+	else
+		vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
+}
+EXPORT_SYMBOL_GPL(kvm_vcpu_deliver_init);
+
 int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 {
 	struct kvm_lapic *apic = vcpu->arch.apic;
@@ -3063,13 +3073,8 @@ int kvm_apic_accept_events(struct kvm_vcpu *vcpu)
 		return 0;
 	}
 
-	if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events)) {
-		kvm_vcpu_reset(vcpu, true);
-		if (kvm_vcpu_is_bsp(apic->vcpu))
-			vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
-		else
-			vcpu->arch.mp_state = KVM_MP_STATE_INIT_RECEIVED;
-	}
+	if (test_and_clear_bit(KVM_APIC_INIT, &apic->pending_events))
+		static_call(kvm_x86_vcpu_deliver_init)(vcpu);
 	if (test_and_clear_bit(KVM_APIC_SIPI, &apic->pending_events)) {
 		if (vcpu->arch.mp_state == KVM_MP_STATE_INIT_RECEIVED) {
 			/* evaluate pending_events before reading the vector */
diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 55f2e0a9b0f6..f9191d924212 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -4829,6 +4829,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
 	.complete_emulated_msr = svm_complete_emulated_msr,
 
 	.vcpu_deliver_sipi_vector = svm_vcpu_deliver_sipi_vector,
+	.vcpu_deliver_init = kvm_vcpu_deliver_init,
 	.vcpu_get_apicv_inhibit_reasons = avic_vcpu_get_apicv_inhibit_reasons,
 };
 
diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 3651c32e6cad..4be7b27bf579 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -256,6 +256,14 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
 }
 #endif
 
+static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return true;
+
+	return vmx_apic_init_signal_blocked(vcpu);
+}
+
 static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -284,6 +292,25 @@ static void vt_deliver_interrupt(struct kvm_lapic *apic, int delivery_mode,
 	vmx_deliver_interrupt(apic, delivery_mode, trig_mode, vector);
 }
 
+static void vt_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	kvm_vcpu_deliver_sipi_vector(vcpu, vector);
+}
+
+static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu)) {
+		/* TDX doesn't support INIT.  Ignore INIT event */
+		vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
+		return;
+	}
+
+	kvm_vcpu_deliver_init(vcpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -630,13 +657,14 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 #endif
 
 	.can_emulate_instruction = vmx_can_emulate_instruction,
-	.apic_init_signal_blocked = vmx_apic_init_signal_blocked,
+	.apic_init_signal_blocked = vt_apic_init_signal_blocked,
 	.migrate_timers = vmx_migrate_timers,
 
 	.msr_filter_changed = vt_msr_filter_changed,
 	.complete_emulated_msr = kvm_complete_insn_gp,
 
-	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
+	.vcpu_deliver_sipi_vector = vt_vcpu_deliver_sipi_vector,
+	.vcpu_deliver_init = vt_vcpu_deliver_init,
 
 	.dev_mem_enc_ioctl = tdx_dev_ioctl,
 	.mem_enc_ioctl = vt_mem_enc_ioctl,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 103/113] KVM: TDX: Add methods to ignore accesses to CPU state
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (101 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 102/113] KVM: TDX: Silently ignore INIT/SIPI isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 104/113] KVM: TDX: Add methods to ignore guest instruction emulation isaku.yamahata
                   ` (9 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

From: Sean Christopherson <sean.j.christopherson@intel.com>

TDX protects TDX guest state from VMM.  Implement access methods for TDX
guest state to ignore them or return zero.  Because those methods can be
called by kvm ioctls to set/get cpu registers, they don't have KVM_BUG_ON
except one method.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 274 +++++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/tdx.c     |  49 ++++++-
 arch/x86/kvm/vmx/x86_ops.h |  13 ++
 3 files changed, 304 insertions(+), 32 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 4be7b27bf579..c9d7d8fbd2d7 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -311,6 +311,180 @@ static void vt_vcpu_deliver_init(struct kvm_vcpu *vcpu)
 	kvm_vcpu_deliver_init(vcpu);
 }
 
+static void vt_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	return vmx_vcpu_after_set_cpuid(vcpu);
+}
+
+static void vt_update_exception_bitmap(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_update_exception_bitmap(vcpu);
+}
+
+static u64 vt_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_segment_base(vcpu, seg);
+
+	return vmx_get_segment_base(vcpu, seg);
+}
+
+static void vt_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+			      int seg)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_segment(vcpu, var, seg);
+
+	vmx_get_segment(vcpu, var, seg);
+}
+
+static void vt_set_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+			      int seg)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_segment(vcpu, var, seg);
+}
+
+static int vt_get_cpl(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_cpl(vcpu);
+
+	return vmx_get_cpl(vcpu);
+}
+
+static void vt_get_cs_db_l_bits(struct kvm_vcpu *vcpu, int *db, int *l)
+{
+	if (is_td_vcpu(vcpu)) {
+		*db = 0;
+		*l = 0;
+		return;
+	}
+
+	vmx_get_cs_db_l_bits(vcpu, db, l);
+}
+
+static void vt_set_cr0(struct kvm_vcpu *vcpu, unsigned long cr0)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_cr0(vcpu, cr0);
+}
+
+static void vt_set_cr4(struct kvm_vcpu *vcpu, unsigned long cr4)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_cr4(vcpu, cr4);
+}
+
+static int vt_set_efer(struct kvm_vcpu *vcpu, u64 efer)
+{
+	if (is_td_vcpu(vcpu))
+		return 0;
+
+	return vmx_set_efer(vcpu, efer);
+}
+
+static void vt_get_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu)) {
+		memset(dt, 0, sizeof(*dt));
+		return;
+	}
+
+	vmx_get_idt(vcpu, dt);
+}
+
+static void vt_set_idt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_idt(vcpu, dt);
+}
+
+static void vt_get_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu)) {
+		memset(dt, 0, sizeof(*dt));
+		return;
+	}
+
+	vmx_get_gdt(vcpu, dt);
+}
+
+static void vt_set_gdt(struct kvm_vcpu *vcpu, struct desc_ptr *dt)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_gdt(vcpu, dt);
+}
+
+static void vt_set_dr7(struct kvm_vcpu *vcpu, unsigned long val)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_dr7(vcpu, val);
+}
+
+static void vt_sync_dirty_debug_regs(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * MOV-DR exiting is always cleared for TD guest, even in debug mode.
+	 * Thus KVM_DEBUGREG_WONT_EXIT can never be set and it should never
+	 * reach here for TD vcpu.
+	 */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
+	vmx_sync_dirty_debug_regs(vcpu);
+}
+
+static void vt_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_cache_reg(vcpu, reg);
+
+	return vmx_cache_reg(vcpu, reg);
+}
+
+static unsigned long vt_get_rflags(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_rflags(vcpu);
+
+	return vmx_get_rflags(vcpu);
+}
+
+static void vt_set_rflags(struct kvm_vcpu *vcpu, unsigned long rflags)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_rflags(vcpu, rflags);
+}
+
+static bool vt_get_if_flag(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return false;
+
+	return vmx_get_if_flag(vcpu);
+}
+
 static void vt_flush_tlb_all(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -452,6 +626,14 @@ static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
 	vmx_inject_irq(vcpu, reinjected);
 }
 
+static void vt_inject_exception(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_inject_exception(vcpu);
+}
+
 static void vt_cancel_injection(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -484,14 +666,36 @@ static void vt_request_immediate_exit(struct kvm_vcpu *vcpu)
 	vmx_request_immediate_exit(vcpu);
 }
 
-static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
-			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 {
 	if (is_td_vcpu(vcpu))
-		return tdx_get_exit_info(vcpu, reason, info1, info2, intr_info,
-					 error_code);
+		return;
 
-	return vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
+	vmx_update_cr8_intercept(vcpu, tpr, irr);
+}
+
+static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_load_eoi_exitmap(vcpu, eoi_exit_bitmap);
+}
+
+static int vt_set_tss_addr(struct kvm *kvm, unsigned int addr)
+{
+	if (is_td(kvm))
+		return 0;
+
+	return vmx_set_tss_addr(kvm, addr);
+}
+
+static int vt_set_identity_map_addr(struct kvm *kvm, u64 ident_addr)
+{
+	if (is_td(kvm))
+		return 0;
+
+	return vmx_set_identity_map_addr(kvm, ident_addr);
 }
 
 static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
@@ -502,6 +706,16 @@ static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
 }
 
+static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
+			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_get_exit_info(vcpu, reason, info1, info2, intr_info,
+					 error_code);
+
+	return vmx_get_exit_info(vcpu, reason, info1, info2, intr_info, error_code);
+}
+
 static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
 {
 	if (!is_td(kvm))
@@ -546,29 +760,29 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.vcpu_load = vt_vcpu_load,
 	.vcpu_put = vt_vcpu_put,
 
-	.update_exception_bitmap = vmx_update_exception_bitmap,
+	.update_exception_bitmap = vt_update_exception_bitmap,
 	.get_msr_feature = vmx_get_msr_feature,
 	.get_msr = vt_get_msr,
 	.set_msr = vt_set_msr,
-	.get_segment_base = vmx_get_segment_base,
-	.get_segment = vmx_get_segment,
-	.set_segment = vmx_set_segment,
-	.get_cpl = vmx_get_cpl,
-	.get_cs_db_l_bits = vmx_get_cs_db_l_bits,
-	.set_cr0 = vmx_set_cr0,
+	.get_segment_base = vt_get_segment_base,
+	.get_segment = vt_get_segment,
+	.set_segment = vt_set_segment,
+	.get_cpl = vt_get_cpl,
+	.get_cs_db_l_bits = vt_get_cs_db_l_bits,
+	.set_cr0 = vt_set_cr0,
 	.is_valid_cr4 = vmx_is_valid_cr4,
-	.set_cr4 = vmx_set_cr4,
-	.set_efer = vmx_set_efer,
-	.get_idt = vmx_get_idt,
-	.set_idt = vmx_set_idt,
-	.get_gdt = vmx_get_gdt,
-	.set_gdt = vmx_set_gdt,
-	.set_dr7 = vmx_set_dr7,
-	.sync_dirty_debug_regs = vmx_sync_dirty_debug_regs,
-	.cache_reg = vmx_cache_reg,
-	.get_rflags = vmx_get_rflags,
-	.set_rflags = vmx_set_rflags,
-	.get_if_flag = vmx_get_if_flag,
+	.set_cr4 = vt_set_cr4,
+	.set_efer = vt_set_efer,
+	.get_idt = vt_get_idt,
+	.set_idt = vt_set_idt,
+	.get_gdt = vt_get_gdt,
+	.set_gdt = vt_set_gdt,
+	.set_dr7 = vt_set_dr7,
+	.sync_dirty_debug_regs = vt_sync_dirty_debug_regs,
+	.cache_reg = vt_cache_reg,
+	.get_rflags = vt_get_rflags,
+	.set_rflags = vt_set_rflags,
+	.get_if_flag = vt_get_if_flag,
 
 	.flush_tlb_all = vt_flush_tlb_all,
 	.flush_tlb_current = vt_flush_tlb_current,
@@ -587,7 +801,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.patch_hypercall = vmx_patch_hypercall,
 	.inject_irq = vt_inject_irq,
 	.inject_nmi = vt_inject_nmi,
-	.inject_exception = vmx_inject_exception,
+	.inject_exception = vt_inject_exception,
 	.cancel_injection = vt_cancel_injection,
 	.interrupt_allowed = vt_interrupt_allowed,
 	.nmi_allowed = vt_nmi_allowed,
@@ -595,11 +809,11 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.set_nmi_mask = vt_set_nmi_mask,
 	.enable_nmi_window = vt_enable_nmi_window,
 	.enable_irq_window = vt_enable_irq_window,
-	.update_cr8_intercept = vmx_update_cr8_intercept,
+	.update_cr8_intercept = vt_update_cr8_intercept,
 	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
 	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
 	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
-	.load_eoi_exitmap = vmx_load_eoi_exitmap,
+	.load_eoi_exitmap = vt_load_eoi_exitmap,
 	.apicv_post_state_restore = vt_apicv_post_state_restore,
 	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
 	.hwapic_irr_update = vmx_hwapic_irr_update,
@@ -610,13 +824,13 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
 	.protected_apic_has_interrupt = vt_protected_apic_has_interrupt,
 
-	.set_tss_addr = vmx_set_tss_addr,
-	.set_identity_map_addr = vmx_set_identity_map_addr,
+	.set_tss_addr = vt_set_tss_addr,
+	.set_identity_map_addr = vt_set_identity_map_addr,
 	.get_mt_mask = vt_get_mt_mask,
 
 	.get_exit_info = vt_get_exit_info,
 
-	.vcpu_after_set_cpuid = vmx_vcpu_after_set_cpuid,
+	.vcpu_after_set_cpuid = vt_vcpu_after_set_cpuid,
 
 	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
 
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 778d170b7549..6de0676cd509 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -3,6 +3,7 @@
 #include <linux/mmu_context.h>
 
 #include <asm/fpu/xcr.h>
+#include <asm/virtext.h>
 #include <asm/tdx.h>
 
 #include "capabilities.h"
@@ -553,8 +554,15 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
 
 	vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
 	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
-	vcpu->arch.guest_state_protected =
-		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+	/*
+	 * TODO: support off-TD debug.  If TD DEBUG is enabled, guest state
+	 * can be accessed. guest_state_protected = false. and kvm ioctl to
+	 * access CPU states should be usable for user space VMM (e.g. qemu).
+	 *
+	 * vcpu->arch.guest_state_protected =
+	 *	!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
+	 */
+	vcpu->arch.guest_state_protected = true;
 
 	tdx->pi_desc.nv = POSTED_INTR_VECTOR;
 	tdx->pi_desc.sn = 1;
@@ -1847,6 +1855,43 @@ void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
 }
 #endif
 
+int tdx_get_cpl(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg)
+{
+	kvm_register_mark_available(vcpu, reg);
+	switch (reg) {
+	case VCPU_REGS_RSP:
+	case VCPU_REGS_RIP:
+	case VCPU_EXREG_PDPTR:
+	case VCPU_EXREG_CR0:
+	case VCPU_EXREG_CR3:
+	case VCPU_EXREG_CR4:
+		break;
+	default:
+		KVM_BUG_ON(1, vcpu->kvm);
+		break;
+	}
+}
+
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu)
+{
+	return 0;
+}
+
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg)
+{
+	return 0;
+}
+
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg)
+{
+	memset(var, 0, sizeof(*var));
+}
+
 int tdx_dev_ioctl(void __user *argp)
 {
 	struct kvm_tdx_capabilities __user *user_caps;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index d6c592d06baa..74182190b43f 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -175,6 +175,12 @@ bool tdx_has_emulated_msr(u32 index, bool write);
 int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 
+int tdx_get_cpl(struct kvm_vcpu *vcpu);
+void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
+unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu);
+u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg);
+void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var, int seg);
+
 int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
 int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
 
@@ -217,6 +223,13 @@ static inline bool tdx_has_emulated_msr(u32 index, bool write) { return false; }
 static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 
+static inline int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
+static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
+static inline unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
+static inline u64 tdx_get_segment_base(struct kvm_vcpu *vcpu, int seg) { return 0; }
+static inline void tdx_get_segment(struct kvm_vcpu *vcpu, struct kvm_segment *var,
+				   int seg) {}
+
 static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
 static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 104/113] KVM: TDX: Add methods to ignore guest instruction emulation
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (102 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 103/113] KVM: TDX: Add methods to ignore accesses to CPU state isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 105/113] KVM: TDX: Add a method to ignore dirty logging isaku.yamahata
                   ` (8 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because TDX protects TDX guest state from VMM, instructions in guest memory
cannot be emulated.  Implement methods to ignore guest instruction
emulator.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c | 28 ++++++++++++++++++++++++++--
 1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index c9d7d8fbd2d7..47c2b6e1e484 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -256,6 +256,30 @@ static void vt_enable_smi_window(struct kvm_vcpu *vcpu)
 }
 #endif
 
+static bool vt_can_emulate_instruction(struct kvm_vcpu *vcpu, int emul_type,
+				       void *insn, int insn_len)
+{
+	if (is_td_vcpu(vcpu))
+		return false;
+
+	return vmx_can_emulate_instruction(vcpu, emul_type, insn, insn_len);
+}
+
+static int vt_check_intercept(struct kvm_vcpu *vcpu,
+				 struct x86_instruction_info *info,
+				 enum x86_intercept_stage stage,
+				 struct x86_exception *exception)
+{
+	/*
+	 * This call back is triggered by the x86 instruction emulator. TDX
+	 * doesn't allow guest memory inspection.
+	 */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return X86EMUL_UNHANDLEABLE;
+
+	return vmx_check_intercept(vcpu, info, stage, exception);
+}
+
 static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -841,7 +865,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.load_mmu_pgd = vt_load_mmu_pgd,
 
-	.check_intercept = vmx_check_intercept,
+	.check_intercept = vt_check_intercept,
 	.handle_exit_irqoff = vt_handle_exit_irqoff,
 
 	.request_immediate_exit = vt_request_immediate_exit,
@@ -870,7 +894,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.enable_smi_window = vt_enable_smi_window,
 #endif
 
-	.can_emulate_instruction = vmx_can_emulate_instruction,
+	.can_emulate_instruction = vt_can_emulate_instruction,
 	.apic_init_signal_blocked = vt_apic_init_signal_blocked,
 	.migrate_timers = vmx_migrate_timers,
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 105/113] KVM: TDX: Add a method to ignore dirty logging
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (103 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 104/113] KVM: TDX: Add methods to ignore guest instruction emulation isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 106/113] KVM: TDX: Add methods to ignore VMX preemption timer isaku.yamahata
                   ` (7 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Currently TDX KVM doesn't support tracking dirty pages (yet).  Implement a
method to ignore it.  Because the flag for kvm memory slot to enable dirty
logging isn't accepted for TDX, warn on the method is called for TDX.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 47c2b6e1e484..cbac63170cea 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -730,6 +730,14 @@ static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
 }
 
+static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
+{
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
+	vmx_update_cpu_dirty_logging(vcpu);
+}
+
 static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
 {
@@ -873,7 +881,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.sched_in = vt_sched_in,
 
 	.cpu_dirty_log_size = PML_ENTITY_NUM,
-	.update_cpu_dirty_logging = vmx_update_cpu_dirty_logging,
+	.update_cpu_dirty_logging = vt_update_cpu_dirty_logging,
 
 	.nested_ops = &vmx_nested_ops,
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 106/113] KVM: TDX: Add methods to ignore VMX preemption timer
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (104 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 105/113] KVM: TDX: Add a method to ignore dirty logging isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 107/113] KVM: TDX: Add methods to ignore accesses to TSC isaku.yamahata
                   ` (6 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX doesn't support VMX preemption timer.  Implement access methods for VMM
to ignore VMX preemption timer.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c | 25 +++++++++++++++++++++++--
 1 file changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index cbac63170cea..2d2738e8c0b1 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -738,6 +738,27 @@ static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 	vmx_update_cpu_dirty_logging(vcpu);
 }
 
+#ifdef CONFIG_X86_64
+static int vt_set_hv_timer(struct kvm_vcpu *vcpu, u64 guest_deadline_tsc,
+			      bool *expired)
+{
+	/* VMX-preemption timer isn't available for TDX. */
+	if (is_td_vcpu(vcpu))
+		return -EINVAL;
+
+	return vmx_set_hv_timer(vcpu, guest_deadline_tsc, expired);
+}
+
+static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
+{
+	/* VMX-preemption timer can't be set.  See vt_set_hv_timer(). */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
+	vmx_cancel_hv_timer(vcpu);
+}
+#endif
+
 static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
 {
@@ -889,8 +910,8 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.pi_start_assignment = vmx_pi_start_assignment,
 
 #ifdef CONFIG_X86_64
-	.set_hv_timer = vmx_set_hv_timer,
-	.cancel_hv_timer = vmx_cancel_hv_timer,
+	.set_hv_timer = vt_set_hv_timer,
+	.cancel_hv_timer = vt_cancel_hv_timer,
 #endif
 
 	.setup_mce = vmx_setup_mce,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 107/113] KVM: TDX: Add methods to ignore accesses to TSC
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (105 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 106/113] KVM: TDX: Add methods to ignore VMX preemption timer isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 108/113] KVM: TDX: Ignore setting up mce isaku.yamahata
                   ` (5 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX protects TDX guest TSC state from VMM.  Implement access methods to
ignore guest TSC.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c | 44 +++++++++++++++++++++++++++++++++++++----
 1 file changed, 40 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 2d2738e8c0b1..bb9fac604ea7 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -730,6 +730,42 @@ static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
 	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
 }
 
+static u64 vt_get_l2_tsc_offset(struct kvm_vcpu *vcpu)
+{
+	/* TDX doesn't support L2 guest at the moment. */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return 0;
+
+	return vmx_get_l2_tsc_offset(vcpu);
+}
+
+static u64 vt_get_l2_tsc_multiplier(struct kvm_vcpu *vcpu)
+{
+	/* TDX doesn't support L2 guest at the moment. */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return 0;
+
+	return vmx_get_l2_tsc_multiplier(vcpu);
+}
+
+static void vt_write_tsc_offset(struct kvm_vcpu *vcpu, u64 offset)
+{
+	/* In TDX, tsc offset can't be changed. */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_write_tsc_offset(vcpu, offset);
+}
+
+static void vt_write_tsc_multiplier(struct kvm_vcpu *vcpu, u64 multiplier)
+{
+	/* In TDX, tsc multiplier can't be changed. */
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_write_tsc_multiplier(vcpu, multiplier);
+}
+
 static void vt_update_cpu_dirty_logging(struct kvm_vcpu *vcpu)
 {
 	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
@@ -887,10 +923,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 
 	.has_wbinvd_exit = cpu_has_vmx_wbinvd_exit,
 
-	.get_l2_tsc_offset = vmx_get_l2_tsc_offset,
-	.get_l2_tsc_multiplier = vmx_get_l2_tsc_multiplier,
-	.write_tsc_offset = vmx_write_tsc_offset,
-	.write_tsc_multiplier = vmx_write_tsc_multiplier,
+	.get_l2_tsc_offset = vt_get_l2_tsc_offset,
+	.get_l2_tsc_multiplier = vt_get_l2_tsc_multiplier,
+	.write_tsc_offset = vt_write_tsc_offset,
+	.write_tsc_multiplier = vt_write_tsc_multiplier,
 
 	.load_mmu_pgd = vt_load_mmu_pgd,
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 108/113] KVM: TDX: Ignore setting up mce
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (106 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 107/113] KVM: TDX: Add methods to ignore accesses to TSC isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 109/113] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch isaku.yamahata
                   ` (4 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because vmx_set_mce function is VMX specific and it cannot be used for TDX.
Add vt stub to ignore setting up mce for TDX.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index bb9fac604ea7..b4d1cc3736a6 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -795,6 +795,14 @@ static void vt_cancel_hv_timer(struct kvm_vcpu *vcpu)
 }
 #endif
 
+static void vt_setup_mce(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_setup_mce(vcpu);
+}
+
 static void vt_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 			u64 *info1, u64 *info2, u32 *intr_info, u32 *error_code)
 {
@@ -950,7 +958,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.cancel_hv_timer = vt_cancel_hv_timer,
 #endif
 
-	.setup_mce = vmx_setup_mce,
+	.setup_mce = vt_setup_mce,
 
 #ifdef CONFIG_KVM_SMM
 	.smi_allowed = vt_smi_allowed,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 109/113] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (107 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 108/113] KVM: TDX: Ignore setting up mce isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 110/113] KVM: TDX: Add methods to ignore virtual apic related operation isaku.yamahata
                   ` (3 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because guest TD memory is protected, VMM patching guest binary for
hypercall instruction isn't possible.  Add a method to ignore hypercall
patching with a warning.  Note: guest TD kernel needs to be modified to use
TDG.VP.VMCALL for hypercall.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index b4d1cc3736a6..994ead1b6788 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -642,6 +642,19 @@ static u32 vt_get_interrupt_shadow(struct kvm_vcpu *vcpu)
 	return vmx_get_interrupt_shadow(vcpu);
 }
 
+static void vt_patch_hypercall(struct kvm_vcpu *vcpu,
+				  unsigned char *hypercall)
+{
+	/*
+	 * Because guest memory is protected, guest can't be patched. TD kernel
+	 * is modified to use TDG.VP.VMCAL for hypercall.
+	 */
+	if (KVM_BUG_ON(is_td_vcpu(vcpu), vcpu->kvm))
+		return;
+
+	vmx_patch_hypercall(vcpu, hypercall);
+}
+
 static void vt_inject_irq(struct kvm_vcpu *vcpu, bool reinjected)
 {
 	if (is_td_vcpu(vcpu))
@@ -895,7 +908,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.update_emulated_instruction = vmx_update_emulated_instruction,
 	.set_interrupt_shadow = vt_set_interrupt_shadow,
 	.get_interrupt_shadow = vt_get_interrupt_shadow,
-	.patch_hypercall = vmx_patch_hypercall,
+	.patch_hypercall = vt_patch_hypercall,
 	.inject_irq = vt_inject_irq,
 	.inject_nmi = vt_inject_nmi,
 	.inject_exception = vt_inject_exception,
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 110/113] KVM: TDX: Add methods to ignore virtual apic related operation
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (108 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 109/113] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:32 ` [PATCH v11 111/113] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) isaku.yamahata
                   ` (2 subsequent siblings)
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX protects TDX guest APIC state from VMM.  Implement access methods of
TDX guest vAPIC state to ignore them or return zero.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 arch/x86/kvm/vmx/main.c    | 61 ++++++++++++++++++++++++++++++++++----
 arch/x86/kvm/vmx/tdx.c     |  6 ++++
 arch/x86/kvm/vmx/x86_ops.h |  3 ++
 3 files changed, 64 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
index 994ead1b6788..cb7f4799eb70 100644
--- a/arch/x86/kvm/vmx/main.c
+++ b/arch/x86/kvm/vmx/main.c
@@ -288,6 +288,14 @@ static bool vt_apic_init_signal_blocked(struct kvm_vcpu *vcpu)
 	return vmx_apic_init_signal_blocked(vcpu);
 }
 
+static void vt_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return tdx_set_virtual_apic_mode(vcpu);
+
+	return vmx_set_virtual_apic_mode(vcpu);
+}
+
 static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 {
 	struct pi_desc *pi = vcpu_to_pi_desc(vcpu);
@@ -296,6 +304,31 @@ static void vt_apicv_post_state_restore(struct kvm_vcpu *vcpu)
 	memset(pi->pir, 0, sizeof(pi->pir));
 }
 
+static void vt_hwapic_irr_update(struct kvm_vcpu *vcpu, int max_irr)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	return vmx_hwapic_irr_update(vcpu, max_irr);
+}
+
+static void vt_hwapic_isr_update(int max_isr)
+{
+	if (is_td_vcpu(kvm_get_running_vcpu()))
+		return;
+
+	return vmx_hwapic_isr_update(max_isr);
+}
+
+static bool vt_guest_apic_has_interrupt(struct kvm_vcpu *vcpu)
+{
+	/* TDX doesn't support L2 at the moment. */
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return false;
+
+	return vmx_guest_apic_has_interrupt(vcpu);
+}
+
 static int vt_sync_pir_to_irr(struct kvm_vcpu *vcpu)
 {
 	if (is_td_vcpu(vcpu))
@@ -711,6 +744,22 @@ static void vt_update_cr8_intercept(struct kvm_vcpu *vcpu, int tpr, int irr)
 	vmx_update_cr8_intercept(vcpu, tpr, irr);
 }
 
+static void vt_set_apic_access_page_addr(struct kvm_vcpu *vcpu)
+{
+	if (is_td_vcpu(vcpu))
+		return;
+
+	vmx_set_apic_access_page_addr(vcpu);
+}
+
+static void vt_refresh_apicv_exec_ctrl(struct kvm_vcpu *vcpu)
+{
+	if (WARN_ON_ONCE(is_td_vcpu(vcpu)))
+		return;
+
+	vmx_refresh_apicv_exec_ctrl(vcpu);
+}
+
 static void vt_load_eoi_exitmap(struct kvm_vcpu *vcpu, u64 *eoi_exit_bitmap)
 {
 	if (is_td_vcpu(vcpu))
@@ -920,15 +969,15 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
 	.enable_nmi_window = vt_enable_nmi_window,
 	.enable_irq_window = vt_enable_irq_window,
 	.update_cr8_intercept = vt_update_cr8_intercept,
-	.set_virtual_apic_mode = vmx_set_virtual_apic_mode,
-	.set_apic_access_page_addr = vmx_set_apic_access_page_addr,
-	.refresh_apicv_exec_ctrl = vmx_refresh_apicv_exec_ctrl,
+	.set_virtual_apic_mode = vt_set_virtual_apic_mode,
+	.set_apic_access_page_addr = vt_set_apic_access_page_addr,
+	.refresh_apicv_exec_ctrl = vt_refresh_apicv_exec_ctrl,
 	.load_eoi_exitmap = vt_load_eoi_exitmap,
 	.apicv_post_state_restore = vt_apicv_post_state_restore,
 	.check_apicv_inhibit_reasons = vmx_check_apicv_inhibit_reasons,
-	.hwapic_irr_update = vmx_hwapic_irr_update,
-	.hwapic_isr_update = vmx_hwapic_isr_update,
-	.guest_apic_has_interrupt = vmx_guest_apic_has_interrupt,
+	.hwapic_irr_update = vt_hwapic_irr_update,
+	.hwapic_isr_update = vt_hwapic_isr_update,
+	.guest_apic_has_interrupt = vt_guest_apic_has_interrupt,
 	.sync_pir_to_irr = vt_sync_pir_to_irr,
 	.deliver_interrupt = vt_deliver_interrupt,
 	.dy_apicv_has_pending_interrupt = pi_has_pending_interrupt,
diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index 6de0676cd509..487ba90a0b7c 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -1855,6 +1855,12 @@ void tdx_enable_smi_window(struct kvm_vcpu *vcpu)
 }
 #endif
 
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu)
+{
+	/* Only x2APIC mode is supported for TD. */
+	WARN_ON_ONCE(kvm_get_apic_mode(vcpu) != LAPIC_MODE_X2APIC);
+}
+
 int tdx_get_cpl(struct kvm_vcpu *vcpu)
 {
 	return 0;
diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
index 74182190b43f..c690a0182e6b 100644
--- a/arch/x86/kvm/vmx/x86_ops.h
+++ b/arch/x86/kvm/vmx/x86_ops.h
@@ -174,6 +174,7 @@ void tdx_get_exit_info(struct kvm_vcpu *vcpu, u32 *reason,
 bool tdx_has_emulated_msr(u32 index, bool write);
 int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
 int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr);
+void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu);
 
 int tdx_get_cpl(struct kvm_vcpu *vcpu);
 void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg);
@@ -223,6 +224,8 @@ static inline bool tdx_has_emulated_msr(u32 index, bool write) { return false; }
 static inline int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 static inline int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr) { return 1; }
 
+static inline void tdx_set_virtual_apic_mode(struct kvm_vcpu *vcpu) {}
+
 static inline int tdx_get_cpl(struct kvm_vcpu *vcpu) { return 0; }
 static inline void tdx_cache_reg(struct kvm_vcpu *vcpu, enum kvm_reg reg) {}
 static inline unsigned long tdx_get_rflags(struct kvm_vcpu *vcpu) { return 0; }
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 111/113] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX)
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (109 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 110/113] KVM: TDX: Add methods to ignore virtual apic related operation isaku.yamahata
@ 2023-01-12 16:32 ` isaku.yamahata
  2023-01-12 16:33 ` [PATCH v11 112/113] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU isaku.yamahata
  2023-01-12 16:33 ` [PATCH v11 113/113] [MARKER] the end of (the first phase of) TDX KVM patch series isaku.yamahata
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:32 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add documentation to Intel Trusted Domain Extensions(TDX) support.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 Documentation/virt/kvm/api.rst       |   9 +-
 Documentation/virt/kvm/index.rst     |   2 +
 Documentation/virt/kvm/intel-tdx.rst | 347 +++++++++++++++++++++++++++
 3 files changed, 357 insertions(+), 1 deletion(-)
 create mode 100644 Documentation/virt/kvm/intel-tdx.rst

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index d2baa05f7c04..0b5a64f3e335 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1401,6 +1401,9 @@ the memory region are automatically reflected into the guest.  For example, an
 mmap() that affects the region will be made visible immediately.  Another
 example is madvise(MADV_DROP).
 
+For TDX guest, deleting/moving memory region loses guest memory contents.
+Read only region isn't supported.  Only as-id 0 is supported.
+
 
 4.36 KVM_SET_TSS_ADDR
 ---------------------
@@ -4682,7 +4685,7 @@ H_GET_CPU_CHARACTERISTICS hypercall.
 
 :Capability: basic
 :Architectures: x86
-:Type: vm
+:Type: vm ioctl, vcpu ioctl
 :Parameters: an opaque platform specific structure (in/out)
 :Returns: 0 on success; -1 on error
 
@@ -4694,6 +4697,10 @@ Currently, this ioctl is used for issuing Secure Encrypted Virtualization
 (SEV) commands on AMD Processors. The SEV commands are defined in
 Documentation/virt/kvm/x86/amd-memory-encryption.rst.
 
+Currently, this ioctl is used for issuing Trusted Domain Extensions
+(TDX) commands on Intel Processors. The TDX commands are defined in
+Documentation/virt/kvm/intel-tdx.rst.
+
 4.111 KVM_MEMORY_ENCRYPT_REG_REGION
 -----------------------------------
 
diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst
index ad13ec55ddfe..20a2ab8fc78c 100644
--- a/Documentation/virt/kvm/index.rst
+++ b/Documentation/virt/kvm/index.rst
@@ -19,3 +19,5 @@ KVM
    vcpu-requests
    halt-polling
    review-checklist
+
+   intel-tdx
diff --git a/Documentation/virt/kvm/intel-tdx.rst b/Documentation/virt/kvm/intel-tdx.rst
new file mode 100644
index 000000000000..40e92aa2efea
--- /dev/null
+++ b/Documentation/virt/kvm/intel-tdx.rst
@@ -0,0 +1,347 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+===================================
+Intel Trust Domain Extensions (TDX)
+===================================
+
+Overview
+========
+TDX stands for Trust Domain Extensions which isolates VMs from
+the virtual-machine manager (VMM)/hypervisor and any other software on
+the platform. For details, see the specifications [1]_, whitepaper [2]_,
+architectural extensions specification [3]_, module documentation [4]_,
+loader interface specification [5]_, guest-hypervisor communication
+interface [6]_, virtual firmware design guide [7]_, and other resources
+([8]_, [9]_, [10]_, [11]_, and [12]_).
+
+
+API description
+===============
+
+KVM_MEMORY_ENCRYPT_OP
+---------------------
+:Type: vm ioctl, vcpu ioctl
+
+For TDX operations, KVM_MEMORY_ENCRYPT_OP is re-purposed to be generic
+ioctl with TDX specific sub ioctl command.
+
+::
+
+  /* Trust Domain eXtension sub-ioctl() commands. */
+  enum kvm_tdx_cmd_id {
+          KVM_TDX_CAPABILITIES = 0,
+          KVM_TDX_INIT_VM,
+          KVM_TDX_INIT_VCPU,
+          KVM_TDX_INIT_MEM_REGION,
+          KVM_TDX_FINALIZE_VM,
+
+          KVM_TDX_CMD_NR_MAX,
+  };
+
+  struct kvm_tdx_cmd {
+        /* enum kvm_tdx_cmd_id */
+        __u32 id;
+        /* flags for sub-commend. If sub-command doesn't use this, set zero. */
+        __u32 flags;
+        /*
+         * data for each sub-command. An immediate or a pointer to the actual
+         * data in process virtual address.  If sub-command doesn't use it,
+         * set zero.
+         */
+        __u64 data;
+        /*
+         * Auxiliary error code.  The sub-command may return TDX SEAMCALL
+         * status code in addition to -Exxx.
+         * Defined for consistency with struct kvm_sev_cmd.
+         */
+        __u64 error;
+        /* Reserved: Defined for consistency with struct kvm_sev_cmd. */
+        __u64 unused;
+  };
+
+KVM_TDX_CAPABILITIES
+--------------------
+:Type: vm ioctl
+
+Subset of TDSYSINFO_STRCUCT retrieved by TDH.SYS.INFO TDX SEAM call will be
+returned. Which describes about Intel TDX module.
+
+- id: KVM_TDX_CAPABILITIES
+- flags: must be 0
+- data: pointer to struct kvm_tdx_capabilities
+- error: must be 0
+- unused: must be 0
+
+::
+
+  struct kvm_tdx_cpuid_config {
+          __u32 leaf;
+          __u32 sub_leaf;
+          __u32 eax;
+          __u32 ebx;
+          __u32 ecx;
+          __u32 edx;
+  };
+
+  struct kvm_tdx_capabilities {
+          __u64 attrs_fixed0;
+          __u64 attrs_fixed1;
+          __u64 xfam_fixed0;
+          __u64 xfam_fixed1;
+
+          __u32 nr_cpuid_configs;
+          struct kvm_tdx_cpuid_config cpuid_configs[0];
+  };
+
+
+KVM_TDX_INIT_VM
+---------------
+:Type: vm ioctl
+
+Does additional VM initialization specific to TDX which corresponds to
+TDH.MNG.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VM
+- flags: must be 0
+- data: pointer to struct kvm_tdx_init_vm
+- error: must be 0
+- unused: must be 0
+
+::
+
+  struct kvm_tdx_init_vm {
+          __u32 max_vcpus;
+          __u32 reserved;
+          __u64 attributes;
+          __u64 cpuid;  /* pointer to struct kvm_cpuid2 */
+          __u64 mrconfigid[6];          /* sha384 digest */
+          __u64 mrowner[6];             /* sha384 digest */
+          __u64 mrownerconfig[6];       /* sha348 digest */
+          __u64 reserved[43];           /* must be zero for future extensibility */
+  };
+
+
+KVM_TDX_INIT_VCPU
+-----------------
+:Type: vcpu ioctl
+
+Does additional VCPU initialization specific to TDX which corresponds to
+TDH.VP.INIT TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- flags: must be 0
+- data: initial value of the guest TD VCPU RCX
+- error: must be 0
+- unused: must be 0
+
+KVM_TDX_INIT_MEM_REGION
+-----------------------
+:Type: vm ioctl
+
+Encrypt a memory continuous region which corresponding to TDH.MEM.PAGE.ADD
+TDX SEAM call.
+If KVM_TDX_MEASURE_MEMORY_REGION flag is specified, it also extends measurement
+which corresponds to TDH.MR.EXTEND TDX SEAM call.
+
+- id: KVM_TDX_INIT_VCPU
+- flags: flags
+            currently only KVM_TDX_MEASURE_MEMORY_REGION is defined
+- data: pointer to struct kvm_tdx_init_mem_region
+- error: must be 0
+- unused: must be 0
+
+::
+
+  #define KVM_TDX_MEASURE_MEMORY_REGION   (1UL << 0)
+
+  struct kvm_tdx_init_mem_region {
+          __u64 source_addr;
+          __u64 gpa;
+          __u64 nr_pages;
+  };
+
+
+KVM_TDX_FINALIZE_VM
+-------------------
+:Type: vm ioctl
+
+Complete measurement of the initial TD contents and mark it ready to run
+which corresponds to TDH.MR.FINALIZE
+
+- id: KVM_TDX_FINALIZE_VM
+- flags: must be 0
+- data: must be 0
+- error: must be 0
+- unused: must be 0
+
+KVM TDX creation flow
+=====================
+In addition to KVM normal flow, new TDX ioctls need to be called.  The control flow
+looks like as follows.
+
+#. system wide capability check
+
+   * KVM_CAP_VM_TYPES: check if VM type is supported and if TDX_VM_TYPE is
+     supported.
+
+#. creating VM
+
+   * KVM_CREATE_VM
+   * KVM_TDX_CAPABILITIES: query if TDX is supported on the platform.
+   * KVM_TDX_INIT_VM: pass TDX specific VM parameters.
+
+#. creating VCPU
+
+   * KVM_CREATE_VCPU
+   * KVM_TDX_INIT_VCPU: pass TDX specific VCPU parameters.
+
+#. initializing guest memory
+
+   * allocate guest memory and initialize page same to normal KVM case
+     In TDX case, parse and load TDVF into guest memory in addition.
+   * KVM_TDX_INIT_MEM_REGION to add and measure guest pages.
+     If the pages has contents above, those pages need to be added.
+     Otherwise the contents will be lost and guest sees zero pages.
+   * KVM_TDX_FINALIAZE_VM: Finalize VM and measurement
+     This must be after KVM_TDX_INIT_MEM_REGION.
+
+#. run vcpu
+
+Design discussion
+=================
+
+Coexistence of normal(VMX) VM and TD VM
+---------------------------------------
+It's required to allow both legacy(normal VMX) VMs and new TD VMs to
+coexist. Otherwise the benefits of VM flexibility would be eliminated.
+The main issue for it is that the logic of kvm_x86_ops callbacks for
+TDX is different from VMX. On the other hand, the variable,
+kvm_x86_ops, is global single variable. Not per-VM, not per-vcpu.
+
+Several points to be considered:
+
+  * No or minimal overhead when TDX is disabled(CONFIG_INTEL_TDX_HOST=n).
+  * Avoid overhead of indirect call via function pointers.
+  * Contain the changes under arch/x86/kvm/vmx directory and share logic
+    with VMX for maintenance.
+    Even though the ways to operation on VM (VMX instruction vs TDX
+    SEAM call) are different, the basic idea remains the same. So, many
+    logic can be shared.
+  * Future maintenance
+    The huge change of kvm_x86_ops in (near) future isn't expected.
+    a centralized file is acceptable.
+
+- Wrapping kvm x86_ops: The current choice
+
+  Introduce dedicated file for arch/x86/kvm/vmx/main.c (the name,
+  main.c, is just chosen to show main entry points for callbacks.) and
+  wrapper functions around all the callbacks with
+  "if (is-tdx) tdx-callback() else vmx-callback()".
+
+  Pros:
+
+  - No major change in common x86 KVM code. The change is (mostly)
+    contained under arch/x86/kvm/vmx/.
+  - When TDX is disabled(CONFIG_INTEL_TDX_HOST=n), the overhead is
+    optimized out.
+  - Micro optimization by avoiding function pointer.
+
+  Cons:
+
+  - Many boiler plates in arch/x86/kvm/vmx/main.c.
+
+KVM MMU Changes
+---------------
+KVM MMU needs to be enhanced to handle Secure/Shared-EPT. The
+high-level execution flow is mostly same to normal EPT case.
+EPT violation/misconfiguration -> invoke TDP fault handler ->
+resolve TDP fault -> resume execution. (or emulate MMIO)
+The difference is, that S-EPT is operated(read/write) via TDX SEAM
+call which is expensive instead of direct read/write EPT entry.
+One bit of GPA (51 or 47 bit) is repurposed so that it means shared
+with host(if set to 1) or private to TD(if cleared to 0).
+
+- The current implementation
+
+  * Reuse the existing MMU code with minimal update.  Because the
+    execution flow is mostly same. But additional operation, TDX call
+    for S-EPT, is needed. So add hooks for it to kvm_x86_ops.
+  * For performance, minimize TDX SEAM call to operate on S-EPT. When
+    getting corresponding S-EPT pages/entry from faulting GPA, don't
+    use TDX SEAM call to read S-EPT entry. Instead create shadow copy
+    in host memory.
+    Repurpose the existing kvm_mmu_page as shadow copy of S-EPT and
+    associate S-EPT to it.
+  * Treats share bit as attributes. mask/unmask the bit where
+    necessary to keep the existing traversing code works.
+    Introduce kvm.arch.gfn_shared_mask and use "if (gfn_share_mask)"
+    for special case.
+
+    * 0 : for non-TDX case
+    * 51 or 47 bit set for TDX case.
+
+  Pros:
+
+  - Large code reuse with minimal new hooks.
+  - Execution path is same.
+
+  Cons:
+
+  - Complicates the existing code.
+  - Repurpose kvm_mmu_page as shadow of Secure-EPT can be confusing.
+
+New KVM API, ioctl (sub)command, to manage TD VMs
+-------------------------------------------------
+Additional KVM APIs are needed to control TD VMs. The operations on TD
+VMs are specific to TDX.
+
+- Piggyback and repurpose KVM_MEMORY_ENCRYPT_OP
+
+  Although operations for TD VMs aren't necessarily related to memory
+  encryption, define sub operations of KVM_MEMORY_ENCRYPT_OP for TDX specific
+  ioctls.
+
+  Pros:
+
+  - No major change in common x86 KVM code.
+  - Follows the SEV case.
+
+  Cons:
+
+  - The sub operations of KVM_MEMORY_ENCRYPT_OP aren't necessarily memory
+    encryption, but operations on TD VMs.
+
+References
+==========
+
+.. [1] TDX specification
+   https://software.intel.com/content/www/us/en/develop/articles/intel-trust-domain-extensions.html
+.. [2] Intel Trust Domain Extensions (Intel TDX)
+   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-whitepaper-final9-17.pdf
+.. [3] Intel CPU Architectural Extensions Specification
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-cpu-architectural-specification.pdf
+.. [4] Intel TDX Module 1.0 EAS
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-module-1eas.pdf
+.. [5] Intel TDX Loader Interface Specification
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-seamldr-interface-specification.pdf
+.. [6] Intel TDX Guest-Hypervisor Communication Interface
+   https://software.intel.com/content/dam/develop/external/us/en/documents/intel-tdx-guest-hypervisor-communication-interface.pdf
+.. [7] Intel TDX Virtual Firmware Design Guide
+   https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.
+.. [8] intel public github
+
+   * kvm TDX branch: https://github.com/intel/tdx/tree/kvm
+   * TDX guest branch: https://github.com/intel/tdx/tree/guest
+
+.. [9] tdvf
+    https://github.com/tianocore/edk2-staging/tree/TDVF
+.. [10] KVM forum 2020: Intel Virtualization Technology Extensions to
+     Enable Hardware Isolated VMs
+     https://osseu2020.sched.com/event/eDzm/intel-virtualization-technology-extensions-to-enable-hardware-isolated-vms-sean-christopherson-intel
+.. [11] Linux Security Summit EU 2020:
+     Architectural Extensions for Hardware Virtual Machine Isolation
+     to Advance Confidential Computing in Public Clouds - Ravi Sahita
+     & Jun Nakajima, Intel Corporation
+     https://osseu2020.sched.com/event/eDOx/architectural-extensions-for-hardware-virtual-machine-isolation-to-advance-confidential-computing-in-public-clouds-ravi-sahita-jun-nakajima-intel-corporation
+.. [12] [RFCv2,00/16] KVM protected memory extension
+     https://lkml.org/lkml/2020/10/20/66
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 112/113] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (110 preceding siblings ...)
  2023-01-12 16:32 ` [PATCH v11 111/113] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) isaku.yamahata
@ 2023-01-12 16:33 ` isaku.yamahata
  2023-01-12 16:33 ` [PATCH v11 113/113] [MARKER] the end of (the first phase of) TDX KVM patch series isaku.yamahata
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:33 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Bagas Sanjaya

From: Isaku Yamahata <isaku.yamahata@intel.com>

Add a high level design document on TDX changes to TDP MMU.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Bagas Sanjaya <bagasdotme@gmail.com>
Signed-off-by: Bagas Sanjaya <bagasdotme@gmail.com>
---
 Documentation/virt/kvm/index.rst       |   1 +
 Documentation/virt/kvm/tdx-tdp-mmu.rst | 417 +++++++++++++++++++++++++
 2 files changed, 418 insertions(+)
 create mode 100644 Documentation/virt/kvm/tdx-tdp-mmu.rst

diff --git a/Documentation/virt/kvm/index.rst b/Documentation/virt/kvm/index.rst
index 20a2ab8fc78c..eafacbff1f4e 100644
--- a/Documentation/virt/kvm/index.rst
+++ b/Documentation/virt/kvm/index.rst
@@ -21,3 +21,4 @@ KVM
    review-checklist
 
    intel-tdx
+   tdx-tdp-mmu
diff --git a/Documentation/virt/kvm/tdx-tdp-mmu.rst b/Documentation/virt/kvm/tdx-tdp-mmu.rst
new file mode 100644
index 000000000000..2d91c94e6d8f
--- /dev/null
+++ b/Documentation/virt/kvm/tdx-tdp-mmu.rst
@@ -0,0 +1,417 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+Design of TDP MMU for TDX support
+=================================
+This document describes a (high level) design for TDX support of KVM TDP MMU of
+x86 KVM.
+
+In this document, we use "TD" or "guest TD" to differentiate it from the current
+"VM" (Virtual Machine), which is supported by KVM today.
+
+
+Background of TDX
+=================
+TD private memory is designed to hold TD private content, encrypted by the CPU
+using the TD ephemeral key.  An encryption engine holds a table of encryption
+keys, and an encryption key is selected for each memory transaction based on a
+Host Key Identifier (HKID).  By design, the host VMM does not have access to the
+encryption keys.
+
+In the first generation of MKTME, HKID is "stolen" from the physical address by
+allocating a configurable number of bits from the top of the physical address.
+The HKID space is partitioned into shared HKIDs for legacy MKTME accesses and
+private HKIDs for SEAM-mode-only accesses.  We use 0 for the shared HKID on the
+host so that MKTME can be opaque or bypassed on the host.
+
+During TDX non-root operation (i.e. guest TD), memory accesses can be qualified
+as either shared or private, based on the value of a new SHARED bit in the Guest
+Physical Address (GPA).  The CPU translates shared GPAs using the usual VMX EPT
+(Extended Page Table) or "Shared EPT" (in this document), which resides in the
+host VMM memory.  The Shared EPT is directly managed by the host VMM - the same
+as with the current VMX.  Since guest TDs usually require I/O, and the data
+exchange needs to be done via shared memory, thus KVM needs to use the current
+EPT functionality even for TDs.
+
+The CPU translates private GPAs using a separate Secure EPT.  The Secure EPT
+pages are encrypted and integrity-protected with the TD's ephemeral private key.
+Secure EPT can be managed _indirectly_ by the host VMM, using the TDX interface
+functions (SEAMCALLs), and thus conceptually Secure EPT is a subset of EPT
+because not all functionalities are available.
+
+Since the execution of such interface functions takes much longer time than
+accessing memory directly, in KVM we use the existing TDP code to mirror the
+Secure EPT for the TD. And we think there are at least two options today in
+terms of the timing for executing such SEAMCALLs:
+
+1. synchronous, i.e. while walking the TDP page tables, or
+2. post-walk, i.e. record what needs to be done to the real Secure EPT during
+   the walk, and execute SEAMCALLs later.
+
+The option 1 seems to be more intuitive and simpler, but the Secure EPT
+concurrency rules are different from the ones of the TDP or EPT. For example,
+MEM.SEPT.RD acquire shared access to the whole Secure EPT tree of the target
+
+Secure EPT(SEPT) operations
+---------------------------
+Secure EPT is an Extended Page Table for GPA-to-HPA translation of TD private
+HPA.  A Secure EPT is designed to be encrypted with the TD's ephemeral private
+key. SEPT pages are allocated by the host VMM via Intel TDX functions, but their
+content is intended to be hidden and is not architectural.
+
+Unlike the conventional EPT, the CPU can't directly read/write its entry.
+Instead, TDX SEAMCALL API is used.  Several SEAMCALLs correspond to operation on
+the EPT entry.
+
+* TDH.MEM.SEPT.ADD():
+
+  Add a secure EPT page from the secure EPT tree.  This corresponds to updating
+  the non-leaf EPT entry with present bit set
+
+* TDH.MEM.SEPT.REMOVE():
+
+  Remove the secure page from the secure EPT tree.  There is no corresponding
+  to the EPT operation.
+
+* TDH.MEM.SEPT.RD():
+
+  Read the secure EPT entry.  This corresponds to reading the EPT entry as
+  memory.  Please note that this is much slower than direct memory reading.
+
+* TDH.MEM.PAGE.ADD() and TDH.MEM.PAGE.AUG():
+
+  Add a private page to the secure EPT tree.  This corresponds to updating the
+  leaf EPT entry with present bit set.
+
+* THD.MEM.PAGE.REMOVE():
+
+  Remove a private page from the secure EPT tree.  There is no corresponding
+  to the EPT operation.
+
+* TDH.MEM.RANGE.BLOCK():
+
+  This (mostly) corresponds to clearing the present bit of the leaf EPT entry.
+  Note that the private page is still linked in the secure EPT.  To remove it
+  from the secure EPT, TDH.MEM.SEPT.REMOVE() and TDH.MEM.PAGE.REMOVE() needs to
+  be called.
+
+* TDH.MEM.TRACK():
+
+  Increment the TLB epoch counter. This (mostly) corresponds to EPT TLB flush.
+  Note that the private page is still linked in the secure EPT.  To remove it
+  from the secure EPT, tdh_mem_page_remove() needs to be called.
+
+
+Adding private page
+-------------------
+The procedure of populating the private page looks as follows.
+
+1. TDH.MEM.SEPT.ADD(512G level)
+2. TDH.MEM.SEPT.ADD(1G level)
+3. TDH.MEM.SEPT.ADD(2M level)
+4. TDH.MEM.PAGE.AUG(4K level)
+
+Those operations correspond to updating the EPT entries.
+
+Dropping private page and TLB shootdown
+---------------------------------------
+The procedure of dropping the private page looks as follows.
+
+1. TDH.MEM.RANGE.BLOCK(4K level)
+
+   This mostly corresponds to clear the present bit in the EPT entry.  This
+   prevents (or blocks) TLB entry from creating in the future.  Note that the
+   private page is still linked in the secure EPT tree and the existing cache
+   entry in the TLB isn't flushed.
+
+2. TDH.MEM.TRACK(range) and TLB shootdown
+
+   This mostly corresponds to the EPT TLB shootdown.  Because all vcpus share
+   the same Secure EPT, all vcpus need to flush TLB.
+
+   * TDH.MEM.TRACK(range) by one vcpu.  It increments the global internal TLB
+     epoch counter.
+
+   * send IPI to remote vcpus
+   * Other vcpu exits to VMM from guest TD and then re-enter. TDH.VP.ENTER().
+   * TDH.VP.ENTER() checks the TLB epoch counter and If its TLB is old, flush
+     TLB.
+
+   Note that only single vcpu issues tdh_mem_track().
+
+   Note that the private page is still linked in the secure EPT tree, unlike the
+   conventional EPT.
+
+3. TDH.MEM.PAGE.PROMOTE, TDH.MEM.PAGEDEMOTE(), TDH.MEM.PAGE.RELOCATE(), or
+   TDH.MEM.PAGE.REMOVE()
+
+   There is no corresponding operation to the conventional EPT.
+
+   * When changing page size (e.g. 4K <-> 2M) TDH.MEM.PAGE.PROMOTE() or
+     TDH.MEM.PAGE.DEMOTE() is used.  During those operation, the guest page is
+     kept referenced in the Secure EPT.
+
+   * When migrating page, TDH.MEM.PAGE.RELOCATE().  This requires both source
+     page and destination page.
+   * when destroying TD, TDH.MEM.PAGE.REMOVE() removes the private page from the
+     secure EPT tree.  In this case TLB shootdown is not needed because vcpus
+     don't run any more.
+
+The basic idea for TDX support
+==============================
+Because shared EPT is the same as the existing EPT, use the existing logic for
+shared EPT.  On the other hand, secure EPT requires additional operations
+instead of directly reading/writing of the EPT entry.
+
+On EPT violation, The KVM mmu walks down the EPT tree from the root, determines
+the EPT entry to operate, and updates the entry. If necessary, a TLB shootdown
+is done.  Because it's very slow to directly walk secure EPT by TDX SEAMCALL,
+TDH.MEM.SEPT.RD(), the mirror of secure EPT is created and maintained.  Add
+hooks to KVM MMU to reuse the existing code.
+
+EPT violation on shared GPA
+---------------------------
+(1) EPT violation on shared GPA or zapping shared GPA
+    ::
+
+        walk down shared EPT tree (the existing code)
+                |
+                |
+                V
+        shared EPT tree (CPU refers.)
+
+(2) update the EPT entry. (the existing code)
+
+    TLB shootdown in the case of zapping.
+
+
+EPT violation on private GPA
+----------------------------
+(1) EPT violation on private GPA or zapping private GPA
+    ::
+
+        walk down the mirror of secure EPT tree (mostly same as the existing code)
+            |
+            |
+            V
+        mirror of secure EPT tree (KVM MMU software only. reuse of the existing code)
+
+(2) update the (mirrored) EPT entry. (mostly same as the existing code)
+
+(3) call the hooks with what EPT entry is changed
+    ::
+
+           |
+        NEW: hooks in KVM MMU
+           |
+           V
+        secure EPT root(CPU refers)
+
+(4) the TDX backend calls necessary TDX SEAMCALLs to update real secure EPT.
+
+The major modification is to add hooks for the TDX backend for additional
+operations and to pass down which EPT, shared EPT, or private EPT is used, and
+twist the behavior if we're operating on private EPT.
+
+The following depicts the relationship.
+::
+
+                    KVM                             |       TDX module
+                     |                              |           |
+        -------------+----------                    |           |
+        |                      |                    |           |
+        V                      V                    |           |
+     shared GPA           private GPA               |           |
+  CPU shared EPT pointer  KVM private EPT pointer   |  CPU secure EPT pointer
+        |                      |                    |           |
+        |                      |                    |           |
+        V                      V                    |           V
+  shared EPT                private EPT<-------mirror----->Secure EPT
+        |                      |                    |           |
+        |                      \--------------------+------\    |
+        |                                           |      |    |
+        V                                           |      V    V
+  shared guest page                                 |    private guest page
+                                                    |
+                                                    |
+                              non-encrypted memory  |    encrypted memory
+                                                    |
+
+shared EPT: CPU and KVM walk with shared GPA
+            Maintained by the existing code
+private EPT: KVM walks with private GPA
+             Maintained by the twisted existing code
+secure EPT: CPU walks with private GPA.
+            Maintained by TDX module with TDX SEAMCALLs via hooks
+
+
+Tracking private EPT page
+=========================
+Shared EPT pages are managed by struct kvm_mmu_page.  They are linked in a list
+structure.  When necessary, the list is traversed to operate on.  Private EPT
+pages have different characteristics.  For example, private pages can't be
+swapped out.  When shrinking memory, we'd like to traverse only shared EPT pages
+and skip private EPT pages.  Likewise, page migration isn't supported for
+private pages (yet).  Introduce an additional list to track shared EPT pages and
+track private EPT pages independently.
+
+At the beginning of EPT violation, the fault handler knows fault GPA, thus it
+knows which EPT to operate on, private or shared.  If it's private EPT,
+an additional task is done.  Something like "if (private) { callback a hook }".
+Since the fault handler has deep function calls, it's cumbersome to hold the
+information of which EPT is operating.  Options to mitigate it are
+
+1. Pass the information as an argument for the function call.
+2. Record the information in struct kvm_mmu_page somehow.
+3. Record the information in vcpu structure.
+
+Option 2 was chosen.  Because option 1 requires modifying all the functions.  It
+would affect badly to the normal case.  Option 3 doesn't work well because in
+some cases, we need to walk both private and shared EPT.
+
+The role of the EPT page can be utilized and one bit can be curved out from
+unused bits in struct kvm_mmu_page_role.  When allocating the EPT page,
+initialize the information. Mostly struct kvm_mmu_page is available because
+we're operating on EPT pages.
+
+
+The conversion of private GPA and shared GPA
+============================================
+A page of a given GPA can be assigned to only private GPA xor shared GPA at one
+time.  The GPA can't be accessed simultaneously via both private GPA and shared
+GPA.  On guest startup, all the GPAs are assigned as private.  Guest converts
+the range of GPA to shared (or private) from private (or shared) by MapGPA
+hypercall.  MapGPA hypercall takes the start GPA and the size of the region.  If
+the given start GPA is shared, VMM converts the region into shared (if it's
+already shared, nop).  If the start GPA is private, VMM converts the region into
+private.  It implies the guest won't access the unmapped region. private(or
+shared) region after converting to shared(or private).
+
+If the guest TD triggers an EPT violation on the already converted region, the
+access won't be allowed (loop in EPT violation) until other vcpu converts back
+the region.
+
+KVM MMU records which GPA is allowed to access, private or shared by xarray.
+
+
+The original TDP MMU and race condition
+=======================================
+Because vcpus share the EPT, once the EPT entry is zapped, we need to shootdown
+TLB.  Send IPI to remote vcpus.  Remote vcpus flush their down TLBs.  Until TLB
+shootdown is done, vcpus may reference the zapped guest page.
+
+TDP MMU uses read lock of mmu_lock to mitigate vcpu contention.  When read lock
+is obtained, it depends on the atomic update of the EPT entry.  (On the other
+hand legacy MMU uses write lock.)  When vcpu is populating/zapping the EPT entry
+with a read lock held, other vcpu may be populating or zapping the same EPT
+entry at the same time.
+
+To avoid the race condition, the entry is frozen.  It means the EPT entry is set
+to the special value, REMOVED_SPTE which clears the present bit.  And then after
+TLB shootdown, update the EPT entry to the final value.
+
+Concurrent zapping
+------------------
+1. read lock
+2. freeze the EPT entry (atomically set the value to REMOVED_SPTE)
+   If other vcpu froze the entry, restart page fault.
+3. TLB shootdown
+
+   * send IPI to remote vcpus
+   * TLB flush (local and remote)
+
+   For each entry update, TLB shootdown is needed because of the
+   concurrency.
+4. atomically set the EPT entry to the final value
+5. read unlock
+
+Concurrent populating
+---------------------
+In the case of populating the non-present EPT entry, atomically update the EPT
+entry.
+
+1. read lock
+
+2. atomically update the EPT entry
+   If other vcpu frozen the entry or updated the entry, restart page fault.
+
+3. read unlock
+
+In the case of updating the present EPT entry (e.g. page migration), the
+operation is split into two.  Zapping the entry and populating the entry.
+
+1. read lock
+2. zap the EPT entry.  follow the concurrent zapping case.
+3. populate the non-present EPT entry.
+4. read unlock
+
+Non-concurrent batched zapping
+------------------------------
+In some cases, zapping the ranges is done exclusively with a write lock held.
+In this case, the TLB shootdown is batched into one.
+
+1. write lock
+2. zap the EPT entries by traversing them
+3. TLB shootdown
+4. write unlock
+
+For Secure EPT, TDX SEAMCALLs are needed in addition to updating the mirrored
+EPT entry.
+
+TDX concurrent zapping
+----------------------
+Add a hook for TDX SEAMCALLs at the step of the TLB shootdown.
+
+1. read lock
+2. freeze the EPT entry(set the value to REMOVED_SPTE)
+3. TLB shootdown via a hook
+
+   * TLB.MEM.RANGE.BLOCK()
+   * TLB.MEM.TRACK()
+   * send IPI to remote vcpus
+
+4. set the EPT entry to the final value
+5. read unlock
+
+TDX concurrent populating
+-------------------------
+TDX SEAMCALLs are required in addition to operating the mirrored EPT entry.  The
+frozen entry is utilized by following the zapping case to avoid the race
+condition.  A hook can be added.
+
+1. read lock
+2. freeze the EPT entry
+3. hook
+
+   * TDH_MEM_SEPT_ADD() for non-leaf or TDH_MEM_PAGE_AUG() for leaf.
+
+4. set the EPT entry to the final value
+5. read unlock
+
+Without freezing the entry, the following race can happen.  Suppose two vcpus
+are faulting on the same GPA and the 2M and 4K level entries aren't populated
+yet.
+
+* vcpu 1: update 2M level EPT entry
+* vcpu 2: update 4K level EPT entry
+* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
+* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry
+
+
+TDX non-concurrent batched zapping
+----------------------------------
+For simplicity, the procedure of concurrent populating is utilized.  The
+procedure can be optimized later.
+
+
+Co-existing with unmapping guest private memory
+===============================================
+TODO.  This needs to be addressed.
+
+
+Restrictions or future work
+===========================
+The following features aren't supported yet at the moment.
+
+* optimizing non-concurrent zap
+* Large page
+* Page migration
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* [PATCH v11 113/113] [MARKER] the end of (the first phase of) TDX KVM patch series
  2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
                   ` (111 preceding siblings ...)
  2023-01-12 16:33 ` [PATCH v11 112/113] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU isaku.yamahata
@ 2023-01-12 16:33 ` isaku.yamahata
  112 siblings, 0 replies; 221+ messages in thread
From: isaku.yamahata @ 2023-01-12 16:33 UTC (permalink / raw)
  To: kvm, linux-kernel
  Cc: isaku.yamahata, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

From: Isaku Yamahata <isaku.yamahata@intel.com>

This empty commit is to mark the end of (the first phase of) patch series
of TDX KVM support.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 .../virt/kvm/intel-tdx-layer-status.rst       | 32 -------------------
 1 file changed, 32 deletions(-)
 delete mode 100644 Documentation/virt/kvm/intel-tdx-layer-status.rst

diff --git a/Documentation/virt/kvm/intel-tdx-layer-status.rst b/Documentation/virt/kvm/intel-tdx-layer-status.rst
deleted file mode 100644
index 010c387ef5cc..000000000000
--- a/Documentation/virt/kvm/intel-tdx-layer-status.rst
+++ /dev/null
@@ -1,32 +0,0 @@
-.. SPDX-License-Identifier: GPL-2.0
-
-===================================
-Intel Trust Dodmain Extensions(TDX)
-===================================
-
-Layer status
-============
-What qemu can do
-----------------
-- TDX VM TYPE is exposed to Qemu.
-- Qemu can create/destroy guest of TDX vm type.
-- Qemu can create/destroy vcpu of TDX vm type.
-- Qemu can populate initial guest memory image.
-- Qemu can finalize guest TD.
-- Qemu can start to run vcpu. But vcpu can not make progress yet.
-
-Patch Layer status
-------------------
-  Patch layer                          Status
-* TDX, VMX coexistence:                 Applied
-* TDX architectural definitions:        Applied
-* TD VM creation/destruction:           Applied
-* TD vcpu creation/destruction:         Applied
-* TDX EPT violation:                    Applied
-* TD finalization:                      Applied
-* TD vcpu enter/exit:                   Applied
-* TD vcpu interrupts/exit/hypercall:    Not yet
-
-* KVM MMU GPA shared bits:              Applied
-* KVM TDP refactoring for TDX:          Applied
-* KVM TDP MMU hooks:                    Applied
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 004/113] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module
  2023-01-12 16:31 ` [PATCH v11 004/113] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module isaku.yamahata
@ 2023-01-13 12:31   ` Zhi Wang
  2023-01-17 16:03     ` Isaku Yamahata
  2023-01-16  3:48   ` Huang, Kai
  1 sibling, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-13 12:31 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

On Thu, 12 Jan 2023 08:31:12 -0800
isaku.yamahata@intel.com wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX requires several initialization steps for KVM to create guest TDs.
> Detect CPU feature, enable VMX (TDX is based on VMX), detect the TDX
> module availability, and initialize it.  This patch implements those
> steps.
> 
> There are several options on when to initialize the TDX module.  A.)
> kernel module loading time, B.) the first guest TD creation time.  A.)
> was chosen. With B.), a user may hit an error of the TDX initialization
> when trying to create the first guest TD.  The machine that fails to
> initialize the TDX module can't boot any guest TD further.  Such failure
> is undesirable and a surprise because the user expects that the machine
> can accommodate guest TD, but actually not.  So A.) is better than B.).
> 
> Introduce a module parameter, enable_tdx, to explicitly enable TDX KVM
> support.  It's off by default to keep same behavior for those who don't
> use TDX.  Implement hardware_setup method to detect TDX feature of CPU.
> Because TDX requires all present CPUs to enable VMX (VMXON).  The x86
> specific kvm_arch_post_hardware_enable_setup overrides the existing weak
> symbol of kvm_arch_post_hardware_enable_setup which is called at the KVM
> module initialization.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/Makefile      |  1 +
>  arch/x86/kvm/vmx/main.c    | 33 +++++++++++++++++++++++-----
>  arch/x86/kvm/vmx/tdx.c     | 44 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/vmx.c     | 39 +++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/x86_ops.h | 10 +++++++++
>  5 files changed, 122 insertions(+), 5 deletions(-)
>  create mode 100644 arch/x86/kvm/vmx/tdx.c
> 
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 0e894ae23cbc..4b01ab842ab7 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -25,6 +25,7 @@ kvm-$(CONFIG_KVM_SMM)	+= smm.o
>  kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o
> vmx/vmcs12.o \ vmx/hyperv.o vmx/nested.o vmx/posted_intr.o vmx/main.o
>  kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
> +kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o
>  
>  kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o
> svm/nested.o svm/avic.o \ svm/sev.o svm/hyperv.o
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 18f659d1d456..f5d1166d2718 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -7,6 +7,22 @@
>  #include "pmu.h"
>  #include "tdx.h"
>  
> +static bool enable_tdx __ro_after_init =
> IS_ENABLED(CONFIG_INTEL_TDX_HOST); +module_param_named(tdx, enable_tdx,
> bool, 0444); +

The comments says "TDX is off by default". It seems default on/off is controlled
by the kernel configuration here.

> +static __init int vt_hardware_setup(void)
> +{
> +	int ret;
> +
> +	ret = vmx_hardware_setup();
> +	if (ret)
> +		return ret;
> +
> +	enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> +
> +	return 0;
> +}
> +
>  struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.name = KBUILD_MODNAME,
>  
> @@ -149,7 +165,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  };
>  
>  struct kvm_x86_init_ops vt_init_ops __initdata = {
> -	.hardware_setup = vmx_hardware_setup,
> +	.hardware_setup = vt_hardware_setup,
>  	.handle_intel_pt_intr = NULL,
>  
>  	.runtime_ops = &vt_x86_ops,
> @@ -182,10 +198,17 @@ static int __init vt_init(void)
>  	 * Common KVM initialization _must_ come last, after this,
> /dev/kvm is
>  	 * exposed to userspace!
>  	 */
> -	vt_x86_ops.vm_size = max(sizeof(struct kvm_vmx), sizeof(struct
> kvm_tdx));
> -	vcpu_size = max(sizeof(struct vcpu_vmx), sizeof(struct
> vcpu_tdx));
> -	vcpu_align = max(__alignof__(struct vcpu_vmx),
> -			 __alignof__(struct vcpu_tdx));
> +	vt_x86_ops.vm_size = sizeof(struct kvm_vmx);
> +	vcpu_size = sizeof(struct vcpu_vmx);
> +	vcpu_align = __alignof__(struct vcpu_vmx);
> +	if (enable_tdx) {
> +		vt_x86_ops.vm_size = max_t(unsigned int,
> vt_x86_ops.vm_size,
> +					   sizeof(struct kvm_tdx));
> +		vcpu_size = max_t(unsigned int, vcpu_size,
> +				  sizeof(struct vcpu_tdx));
> +		vcpu_align = max_t(unsigned int, vcpu_align,
> +				   __alignof__(struct vcpu_tdx));
> +	}
>  	r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
>  	if (r)
>  		goto err_kvm_init;
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> new file mode 100644
> index 000000000000..d7a276118940
> --- /dev/null
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -0,0 +1,44 @@
> +// SPDX-License-Identifier: GPL-2.0
> +#include <linux/cpu.h>
> +
> +#include <asm/tdx.h>
> +
> +#include "capabilities.h"
> +#include "x86_ops.h"
> +#include "tdx.h"
> +#include "x86.h"
> +
> +#undef pr_fmt
> +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> +
> +static int __init tdx_module_setup(void)
> +{
> +	int ret;
> +

Better mention the tdx_enable() is implemented in another patch? But I guess
we need a wrapper here so that the compilation would succeed.

> +	ret = tdx_enable();
> +	if (ret) {
> +		pr_info("Failed to initialize TDX module.\n");
> +		return ret;
> +	}
> +
> +	pr_info("TDX is supported.\n");
> +	return 0;
> +}
> +
> +int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> +{
> +	int r;
> +
> +	if (!enable_ept) {
> +		pr_warn("Cannot enable TDX with EPT disabled\n");
> +		return -EINVAL;
> +	}
> +
> +	/* TDX requires VMX. */
> +	r = vmxon_all();
> +	if (!r)
> +		r = tdx_module_setup();
> +	vmxoff_all();
> +
> +	return r;
> +}
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 5de1792c9902..5dc7687dcf16 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -8147,6 +8147,45 @@ static unsigned int vmx_handle_intel_pt_intr(void)
>  	return 1;
>  }
>  
> +static __init void vmxon(void *arg)
> +{
> +	int cpu = raw_smp_processor_id();
> +	u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
> +	atomic_t *failed = arg;
> +	int r;
> +
> +	if (cr4_read_shadow() & X86_CR4_VMXE) {
> +		r = -EBUSY;
> +		goto out;
> +	}
> +
> +	r = kvm_cpu_vmxon(phys_addr);
> +out:
> +	if (r)
> +		atomic_inc(failed);
> +}
> +
> +__init int vmxon_all(void)
> +{
> +	atomic_t failed = ATOMIC_INIT(0);
> +
> +	on_each_cpu(vmxon, &failed, 1);
> +
> +	if (atomic_read(&failed))
> +		return -EBUSY;
> +	return 0;
> +}
> +
> +static __init void vmxoff(void *junk)
> +{
> +	cpu_vmxoff();
> +}
> +
> +__init void vmxoff_all(void)
> +{
> +	on_each_cpu(vmxoff, NULL, 1);
> +}
> +
>  static __init void vmx_setup_user_return_msrs(void)
>  {
>  
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 051b5c4b5c2f..fbc57fcbdd21 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -20,6 +20,10 @@ bool kvm_is_vmx_supported(void);
>  int __init vmx_init(void);
>  void vmx_exit(void);
>  
> +__init int vmxon_all(void);
> +__init void vmxoff_all(void);
> +__init int vmx_hardware_setup(void);
> +
>  extern struct kvm_x86_ops vt_x86_ops __initdata;
>  extern struct kvm_x86_init_ops vt_init_ops __initdata;
>  
> @@ -133,4 +137,10 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
>  #endif
>  void vmx_setup_mce(struct kvm_vcpu *vcpu);
>  
> +#ifdef CONFIG_INTEL_TDX_HOST
> +int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> +#else
> +static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {
> return 0; } +#endif
> +
>  #endif /* __KVM_X86_VMX_X86_OPS_H */


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 013/113] x86/cpu: Add helper functions to allocate/free TDX private host key id
  2023-01-12 16:31 ` [PATCH v11 013/113] x86/cpu: Add helper functions to allocate/free TDX private host key id isaku.yamahata
@ 2023-01-13 12:47   ` Zhi Wang
  2023-01-13 15:21     ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-13 12:47 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

On Thu, 12 Jan 2023 08:31:21 -0800
isaku.yamahata@intel.com wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX private host key id (HKID) is assigned to guest TD.  The memory
> controller encrypts guest TD memory with the assigned TDX HKID.  Add
> helper functions to allocate/free TDX private HKID so that TDX KVM can
> manage it.
> 
> Also export the global TDX private HKID that is used to encrypt TDX
> module, its memory and some dynamic data (TDR).  When VMM releasing
> encrypted page to reuse it, the page needs to be flushed with the used
> HKID.  VMM needs the global TDX private HKID to flush such pages.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/tdx.h  | 12 ++++++++++++
>  arch/x86/virt/vmx/tdx/tdx.c | 35 +++++++++++++++++++++++++++++++++++
>  2 files changed, 47 insertions(+)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 0f71d3856ede..ed9cf61ff8b4 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -107,11 +107,23 @@ static inline long tdx_kvm_hypercall(unsigned int
> nr, unsigned long p1, #ifdef CONFIG_INTEL_TDX_HOST
>  bool platform_tdx_enabled(void);
>  int tdx_enable(void);
> +/*
> + * Key id globally used by TDX module: TDX module maps TDR with this
> TDX global
> + * key id.  TDR includes key id assigned to the TD.  Then TDX module
> maps other
> + * TD-related pages with the assigned key id.  TDR requires this TDX
> global key
> + * id for cache flush unlike other TD-related pages.
> + */
> +extern u32 tdx_global_keyid __read_mostly;
> +int tdx_keyid_alloc(void);
> +void tdx_keyid_free(int keyid);
> +
>  u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	       struct tdx_module_output *out);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
>  static inline bool platform_tdx_enabled(void) { return false; }
>  static inline int tdx_enable(void)  { return -EINVAL; }
> +static inline int tdx_keyid_alloc(void) { return -EOPNOTSUPP; }
> +static inline void tdx_keyid_free(int keyid) { }
>  #endif	/* CONFIG_INTEL_TDX_HOST */
>  
>  #endif /* !__ASSEMBLY__ */
> diff --git a/arch/x86/virt/vmx/tdx/tdx.c b/arch/x86/virt/vmx/tdx/tdx.c
> index eba7e62cebec..d18ab5c4d447 100644
> --- a/arch/x86/virt/vmx/tdx/tdx.c
> +++ b/arch/x86/virt/vmx/tdx/tdx.c
> @@ -51,6 +51,10 @@ static DEFINE_MUTEX(tdx_module_lock);
>  /* All TDX-usable memory regions */
>  static LIST_HEAD(tdx_memlist);
>  
> +/* TDX module global KeyID.  Used in TDH.SYS.CONFIG ABI. */
> +u32 tdx_global_keyid __read_mostly;
> +EXPORT_SYMBOL_GPL(tdx_global_keyid);
> +
>  /*
>   * tdx_keyid_start and nr_tdx_keyids indicate that TDX is uninitialized.
>   * This is used in TDX initialization error paths to take it from
> @@ -132,6 +136,31 @@ static struct notifier_block tdx_memory_nb = {
>  	.notifier_call = tdx_memory_notifier,
>  };
>  
> +/* TDX KeyID pool */
> +static DEFINE_IDA(tdx_keyid_pool);
> +
> +int tdx_keyid_alloc(void)
> +{
> +	if (WARN_ON_ONCE(!tdx_keyid_start || !nr_tdx_keyids))
> +		return -EINVAL;

Better mention that tdx_keyid_start and nr_tdx_keyids are defined in
another patches.

> +
> +	/* The first keyID is reserved for the global key. */
> +	return ida_alloc_range(&tdx_keyid_pool, tdx_keyid_start + 1,
> +			       tdx_keyid_start + nr_tdx_keyids - 1,
> +			       GFP_KERNEL);
> +}
> +EXPORT_SYMBOL_GPL(tdx_keyid_alloc);
> +
> +void tdx_keyid_free(int keyid)
> +{
> +	/* keyid = 0 is reserved. */
> +	if (WARN_ON_ONCE(keyid <= 0))
> +		return;
> +
> +	ida_free(&tdx_keyid_pool, keyid);
> +}
> +EXPORT_SYMBOL_GPL(tdx_keyid_free);
> +
>  static int __init tdx_init(void)
>  {
>  	int err;
> @@ -1161,6 +1190,12 @@ static int init_tdx_module(void)
>  	if (ret)
>  		goto out_free_pamts;
>  
> +	/*
> +	 * Reserve the first TDX KeyID as global KeyID to protect
> +	 * TDX module metadata.
> +	 */
> +	tdx_global_keyid = tdx_keyid_start;
> +
>  	/* Initialize TDMRs to complete the TDX module initialization */
>  	ret = init_tdmrs(&tdmr_list);
>  	if (ret)


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP
  2023-01-12 16:31 ` [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP isaku.yamahata
@ 2023-01-13 12:55   ` Zhi Wang
  2023-02-27 21:28     ` Isaku Yamahata
  2023-01-16  4:44   ` Huang, Kai
  1 sibling, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-13 12:55 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

On Thu, 12 Jan 2023 08:31:25 -0800
isaku.yamahata@intel.com wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX attestation includes the maximum number of vcpu that the guest can
> accommodate.  For that, the maximum number of vcpu needs to be specified
> instead of constant, KVM_MAX_VCPUS.  Make KVM_ENABLE_CAP support
> KVM_CAP_MAX_VCPUS.
> 
> Suggested-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  virt/kvm/kvm_main.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a235b628b32f..1cfa7da92ad0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4945,7 +4945,27 @@ static int kvm_vm_ioctl_enable_cap_generic(struct
> kvm *kvm, }
>  
>  		mutex_unlock(&kvm->slots_lock);
> +		return r;
> +	}
> +	case KVM_CAP_MAX_VCPUS: {

Better mention the KVM_CAP_MAX_VCPUS defined in XXX patch in the comments.

> +		int r;
>  
> +		if (cap->flags || cap->args[0] == 0)
> +			return -EINVAL;
> +		if (cap->args[0] >  kvm_vm_ioctl_check_extension(kvm,
> KVM_CAP_MAX_VCPUS))
> +			return -E2BIG;
> +
> +		mutex_lock(&kvm->lock);
> +		/* Only decreasing is allowed. */
> +		if (cap->args[0] > kvm->max_vcpus)
> +			r = -E2BIG;
> +		else if (kvm->created_vcpus)
> +			r = -EBUSY;
> +		else {
> +			kvm->max_vcpus = cap->args[0];
> +			r = 0;
> +		}
> +		mutex_unlock(&kvm->lock);
>  		return r;
>  	}
>  	default:


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-12 16:31 ` [PATCH v11 018/113] KVM: TDX: create/destroy VM structure isaku.yamahata
@ 2023-01-13 13:12   ` Zhi Wang
  2023-01-13 15:16     ` Sean Christopherson
       [not found]   ` <080e0a246e927545718b6f427dfdcdde505a8859.camel@intel.com>
  1 sibling, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-13 13:12 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson, Kai Huang

On Thu, 12 Jan 2023 08:31:26 -0800
isaku.yamahata@intel.com wrote:

> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> As the first step to create TDX guest, create/destroy VM struct.  Assign
> TDX private Host Key ID (HKID) to the TDX guest for memory encryption and
> allocate extra pages for the TDX guest. On destruction, free allocated
> pages, and HKID.
> 
> Before tearing down private page tables, TDX requires some resources of
> the guest TD to be destroyed (i.e. HKID must have been reclaimed, etc).
> Add flush_shadow_all_private callback before tearing down private page
> tables for it.
> 
> Add vm_free() of kvm_x86_ops hook at the end of kvm_arch_destroy_vm()
> because some per-VM TDX resources, e.g. TDR, need to be freed after other
> TDX resources, e.g. HKID, were freed.
> 
> Co-developed-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Kai Huang <kai.huang@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> ---
> Changes v10 -> v11:
> - Fix doule free in tdx_vm_free() by setting NULL.
> - replace struct tdx_td_page tdr and tdcs from struct kvm_tdx with
>   unsigned long
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h |   2 +
>  arch/x86/include/asm/kvm_host.h    |   2 +
>  arch/x86/kvm/vmx/main.c            |  34 ++-
>  arch/x86/kvm/vmx/tdx.c             | 415 +++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/tdx.h             |   6 +-
>  arch/x86/kvm/vmx/x86_ops.h         |   9 +
>  arch/x86/kvm/x86.c                 |   8 +
>  7 files changed, 472 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h
> b/arch/x86/include/asm/kvm-x86-ops.h index e6708bb3f4f6..552de893af75
> 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -22,7 +22,9 @@ KVM_X86_OP(has_emulated_msr)
>  KVM_X86_OP(vcpu_after_set_cpuid)
>  KVM_X86_OP(is_vm_type_supported)
>  KVM_X86_OP(vm_init)
> +KVM_X86_OP_OPTIONAL(flush_shadow_all_private)
>  KVM_X86_OP_OPTIONAL(vm_destroy)
> +KVM_X86_OP_OPTIONAL(vm_free)
>  KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
>  KVM_X86_OP(vcpu_create)
>  KVM_X86_OP(vcpu_free)
> diff --git a/arch/x86/include/asm/kvm_host.h
> b/arch/x86/include/asm/kvm_host.h index 234c28c8e6ee..e199ddf0bb00 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1540,7 +1540,9 @@ struct kvm_x86_ops {
>  	bool (*is_vm_type_supported)(unsigned long vm_type);
>  	unsigned int vm_size;
>  	int (*vm_init)(struct kvm *kvm);
> +	void (*flush_shadow_all_private)(struct kvm *kvm);
>  	void (*vm_destroy)(struct kvm *kvm);
> +	void (*vm_free)(struct kvm *kvm);
>  
>  	/* Create, but do not attach this VCPU */
>  	int (*vcpu_precreate)(struct kvm *kvm);
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 781fbc896120..c5f2515026e9 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -29,14 +29,40 @@ static __init int vt_hardware_setup(void)
>  	return 0;
>  }
>  
> +static void vt_hardware_unsetup(void)
> +{
> +	tdx_hardware_unsetup();
> +	vmx_hardware_unsetup();
> +}
> +
>  static int vt_vm_init(struct kvm *kvm)
>  {
>  	if (is_td(kvm))
> -		return -EOPNOTSUPP;	/* Not ready to create guest
> TD yet. */
> +		return tdx_vm_init(kvm);
>  
>  	return vmx_vm_init(kvm);
>  }
>  
> +static void vt_flush_shadow_all_private(struct kvm *kvm)
> +{
> +	if (is_td(kvm))
> +		tdx_mmu_release_hkid(kvm);
> +}
> +
> +static void vt_vm_destroy(struct kvm *kvm)
> +{
> +	if (is_td(kvm))
> +		return;
> +
> +	vmx_vm_destroy(kvm);
> +}
> +
> +static void vt_vm_free(struct kvm *kvm)
> +{
> +	if (is_td(kvm))
> +		tdx_vm_free(kvm);
> +}
> +
>  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	if (!is_td(kvm))
> @@ -50,7 +76,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  
>  	.check_processor_compatibility = vmx_check_processor_compat,
>  
> -	.hardware_unsetup = vmx_hardware_unsetup,
> +	.hardware_unsetup = vt_hardware_unsetup,
>  
>  	.hardware_enable = vmx_hardware_enable,
>  	.hardware_disable = vmx_hardware_disable,
> @@ -59,7 +85,9 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.is_vm_type_supported = vt_is_vm_type_supported,
>  	.vm_size = sizeof(struct kvm_vmx),
>  	.vm_init = vt_vm_init,
> -	.vm_destroy = vmx_vm_destroy,
> +	.flush_shadow_all_private = vt_flush_shadow_all_private,
> +	.vm_destroy = vt_vm_destroy,
> +	.vm_free = vt_vm_free,
>  
>  	.vcpu_precreate = vmx_vcpu_precreate,
>  	.vcpu_create = vmx_vcpu_create,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 2bd1cc37abab..d11950d18226 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -6,6 +6,7 @@
>  #include "capabilities.h"
>  #include "x86_ops.h"
>  #include "tdx.h"
> +#include "tdx_ops.h"
>  #include "x86.h"
>  
>  #undef pr_fmt
> @@ -32,6 +33,250 @@ struct tdx_capabilities {
>  /* Capabilities of KVM + the TDX module. */
>  static struct tdx_capabilities tdx_caps;
>  
> +/*
> + * Some TDX SEAMCALLs (TDH.MNG.CREATE, TDH.PHYMEM.CACHE.WB,
> + * TDH.MNG.KEY.RECLAIMID, TDH.MNG.KEY.FREEID etc) tries to acquire a
> global lock
> + * internally in TDX module.  If failed, TDX_OPERAND_BUSY is returned
> without
> + * spinning or waiting due to a constraint on execution time.  It's
> caller's
> + * responsibility to avoid race (or retry on TDX_OPERAND_BUSY).  Use
> this mutex
> + * to avoid race in TDX module because the kernel knows better about
> scheduling.
> + */
> +static DEFINE_MUTEX(tdx_lock);
> +static struct mutex *tdx_mng_key_config_lock;
> +
> +static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> +{
> +	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> +}
> +
> +static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> +{
> +	return kvm_tdx->tdr_pa;
> +}
> +
> +static inline void tdx_hkid_free(struct kvm_tdx *kvm_tdx)
> +{
> +	tdx_keyid_free(kvm_tdx->hkid);
> +	kvm_tdx->hkid = 0;
> +}
> +
> +static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> +{
> +	return kvm_tdx->hkid > 0;
> +}
> +
> +static void tdx_clear_page(unsigned long page_pa)
> +{
> +	const void *zero_page = (const void *)
> __va(page_to_phys(ZERO_PAGE(0)));
> +	void *page = __va(page_pa);
> +	unsigned long i;
> +
> +	if (!static_cpu_has(X86_FEATURE_MOVDIR64B)) {
> +		clear_page(page);
> +		return;
> +	}
> +
> +	/*
> +	 * Zeroing the page is only necessary for systems with MKTME-i:
> +	 * when re-assign one page from old keyid to a new keyid,
> MOVDIR64B is
> +	 * required to clear/write the page with new keyid to prevent
> integrity
> +	 * error when read on the page with new keyid.
> +	 *
> +	 * clflush doesn't flush cache with HKID set.
> +	 * The cache line could be poisoned (even without MKTME-i),
> clear the
> +	 * poison bit.
> +	 */
> +	for (i = 0; i < PAGE_SIZE; i += 64)
> +		movdir64b(page + i, zero_page);
> +	/*
> +	 * MOVDIR64B store uses WC buffer.  Prevent following memory
> reads
> +	 * from seeing potentially poisoned cache.
> +	 */
> +	__mb();
> +}
> +
> +static int tdx_reclaim_page(hpa_t pa, bool do_wb, u16 hkid)
> +{
> +	struct tdx_module_output out;
> +	u64 err;
> +
> +	do {
> +		err = tdh_phymem_page_reclaim(pa, &out);
> +		/*
> +		 * TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is
> shutdown.
> +		 * state.  i.e. destructing TD.
> +		 * TDH.PHYMEM.PAGE.RECLAIM  requires TDR and target
> page.
> +		 * Because we're destructing TD, it's rare to contend
> with TDR.
> +		 */
> +	} while (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX));
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
> +		return -EIO;
> +	}
> +
> +	if (do_wb) {
> +		/*
> +		 * Only TDR page gets into this path.  No contention is
> expected
> +		 * because of the last page of TD.
> +		 */
> +		err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
> +		if (WARN_ON_ONCE(err)) {
> +			pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> +			return -EIO;
> +		}
> +	}
> +
> +	tdx_clear_page(pa);
> +	return 0;
> +}
> +
> +static void tdx_reclaim_td_page(unsigned long td_page_pa)
> +{
> +	if (!td_page_pa)
> +		return;
> +	/*
> +	 * TDCX are being reclaimed.  TDX module maps TDCX with HKID
> +	 * assigned to the TD.  Here the cache associated to the TD
> +	 * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
> +	 * cache doesn't need to be flushed again.
> +	 */
> +	if (WARN_ON(tdx_reclaim_page(td_page_pa, false, 0)))
> +		/* If reclaim failed, leak the page. */

Better add a FIXME: here as this has to be fixed later.

> +		return;
> +	free_page((unsigned long)__va(td_page_pa));
> +}
> +
> +static int tdx_do_tdh_phymem_cache_wb(void *param)
> +{
> +	u64 err = 0;
> +
> +	do {
> +		err = tdh_phymem_cache_wb(!!err);
> +	} while (err == TDX_INTERRUPTED_RESUMABLE);
> +
> +	/* Other thread may have done for us. */
> +	if (err == TDX_NO_HKID_READY_TO_WBCACHE)
> +		err = TDX_SUCCESS;
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_PHYMEM_CACHE_WB, err, NULL);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
> +void tdx_mmu_release_hkid(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	cpumask_var_t packages;
> +	bool cpumask_allocated;
> +	u64 err;
> +	int ret;
> +	int i;
> +
> +	if (!is_hkid_assigned(kvm_tdx))
> +		return;
> +
> +	if (!is_td_created(kvm_tdx))
> +		goto free_hkid;
> +
> +	cpumask_allocated = zalloc_cpumask_var(&packages, GFP_KERNEL);
> +	cpus_read_lock();
> +	for_each_online_cpu(i) {
> +		if (cpumask_allocated &&
> +
> cpumask_test_and_set_cpu(topology_physical_package_id(i),
> +						packages))
> +			continue;
> +
> +		/*
> +		 * We can destroy multiple the guest TDs simultaneously.
> +		 * Prevent tdh_phymem_cache_wb from returning TDX_BUSY
> by
> +		 * serialization.
> +		 */
> +		mutex_lock(&tdx_lock);
> +		ret = smp_call_on_cpu(i, tdx_do_tdh_phymem_cache_wb,
> NULL, 1);
> +		mutex_unlock(&tdx_lock);
> +		if (ret)
> +			break;
> +	}
> +	cpus_read_unlock();
> +	free_cpumask_var(packages);
> +
> +	mutex_lock(&tdx_lock);
> +	err = tdh_mng_key_freeid(kvm_tdx->tdr_pa);
> +	mutex_unlock(&tdx_lock);
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_KEY_FREEID, err, NULL);
> +		pr_err("tdh_mng_key_freeid failed. HKID %d is
> leaked.\n",
> +			kvm_tdx->hkid);
> +		return;
> +	}
> +
> +free_hkid:
> +	tdx_hkid_free(kvm_tdx);
> +}
> +
> +void tdx_vm_free(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	int i;
> +
> +	/* Can't reclaim or free TD pages if teardown failed. */
> +	if (is_hkid_assigned(kvm_tdx))
> +		return;
> +
> +	if (kvm_tdx->tdcs_pa) {
> +		for (i = 0; i < tdx_caps.tdcs_nr_pages; i++)
> +			tdx_reclaim_td_page(kvm_tdx->tdcs_pa[i]);
> +		kfree(kvm_tdx->tdcs_pa);
> +		kvm_tdx->tdcs_pa = NULL;
> +	}
> +
> +	if (!kvm_tdx->tdr_pa)
> +		return;
> +	/*
> +	 * TDX module maps TDR with TDX global HKID.  TDX module may
> access TDR
> +	 * while operating on TD (Especially reclaiming TDCS).  Cache
> flush with
> +	 * TDX global HKID is needed.
> +	 */
> +	if (tdx_reclaim_page(kvm_tdx->tdr_pa, true, tdx_global_keyid))
> +		return;
> +
> +	free_page((unsigned long)__va(kvm_tdx->tdr_pa));
> +	kvm_tdx->tdr_pa = 0;
> +}
> +
> +static int tdx_do_tdh_mng_key_config(void *param)
> +{
> +	hpa_t *tdr_p = param;
> +	u64 err;
> +
> +	do {
> +		err = tdh_mng_key_config(*tdr_p);
> +
> +		/*
> +		 * If it failed to generate a random key, retry it
> because this
> +		 * is typically caused by an entropy error of the CPU's
> random
> +		 * number generator.
> +		 */
> +	} while (err == TDX_KEY_GENERATION_FAILED);
> +
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_KEY_CONFIG, err, NULL);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +
> +static int __tdx_td_init(struct kvm *kvm);
> +
> +int tdx_vm_init(struct kvm *kvm)
> +{
> +	/* Place holder for now. */
> +	return __tdx_td_init(kvm);
> +}
> +
>  int tdx_dev_ioctl(void __user *argp)
>  {
>  	struct kvm_tdx_capabilities __user *user_caps;
> @@ -78,6 +323,160 @@ int tdx_dev_ioctl(void __user *argp)
>  	return 0;
>  }
>  
> +static int __tdx_td_init(struct kvm *kvm)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	cpumask_var_t packages;
> +	unsigned long *tdcs_pa = NULL;
> +	unsigned long tdr_pa = 0;
> +	unsigned long va;
> +	int ret, i;
> +	u64 err;
> +
> +	ret = tdx_keyid_alloc();
> +	if (ret < 0)
> +		return ret;
> +	kvm_tdx->hkid = ret;
> +
> +	va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +	if (!va)
> +		goto free_hkid;
> +	tdr_pa = __pa(va);
> +
> +	tdcs_pa = kcalloc(tdx_caps.tdcs_nr_pages,
> sizeof(*kvm_tdx->tdcs_pa),
> +			  GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> +	if (!tdcs_pa)
> +		goto free_tdr;
> +	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
> +		va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +		if (!va)
> +			goto free_tdcs;
> +		tdcs_pa[i] = __pa(va);
> +	}
> +
> +	if (!zalloc_cpumask_var(&packages, GFP_KERNEL)) {
> +		ret = -ENOMEM;
> +		goto free_tdcs;
> +	}
> +	cpus_read_lock();
> +	/*
> +	 * Need at least one CPU of the package to be online in order to
> +	 * program all packages for host key id.  Check it.
> +	 */
> +	for_each_present_cpu(i)
> +		cpumask_set_cpu(topology_physical_package_id(i),
> packages);
> +	for_each_online_cpu(i)
> +		cpumask_clear_cpu(topology_physical_package_id(i),
> packages);
> +	if (!cpumask_empty(packages)) {
> +		ret = -EIO;
> +		/*
> +		 * Because it's hard for human operator to figure out
> the
> +		 * reason, warn it.
> +		 */
> +		pr_warn("All packages need to have online CPU to create
> TD. Online CPU and retry.\n");
> +		goto free_packages;
> +	}
> +
> +	/*
> +	 * Acquire global lock to avoid TDX_OPERAND_BUSY:
> +	 * TDH.MNG.CREATE and other APIs try to lock the global Key
> Owner
> +	 * Table (KOT) to track the assigned TDX private HKID.  It
> doesn't spin
> +	 * to acquire the lock, returns TDX_OPERAND_BUSY instead, and
> let the
> +	 * caller to handle the contention.  This is because of time
> limitation
> +	 * usable inside the TDX module and OS/VMM knows better about
> process
> +	 * scheduling.
> +	 *
> +	 * APIs to acquire the lock of KOT:
> +	 * TDH.MNG.CREATE, TDH.MNG.KEY.FREEID, TDH.MNG.VPFLUSHDONE, and
> +	 * TDH.PHYMEM.CACHE.WB.
> +	 */
> +	mutex_lock(&tdx_lock);
> +	err = tdh_mng_create(tdr_pa, kvm_tdx->hkid);
> +	mutex_unlock(&tdx_lock);
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_CREATE, err, NULL);
> +		ret = -EIO;
> +		goto free_packages;
> +	}
> +	kvm_tdx->tdr_pa = tdr_pa;
> +
> +	for_each_online_cpu(i) {
> +		int pkg = topology_physical_package_id(i);
> +
> +		if (cpumask_test_and_set_cpu(pkg, packages))
> +			continue;
> +
> +		/*
> +		 * Program the memory controller in the package with an
> +		 * encryption key associated to a TDX private host key
> id
> +		 * assigned to this TDR.  Concurrent operations on same
> memory
> +		 * controller results in TDX_OPERAND_BUSY.  Avoid this
> race by
> +		 * mutex.
> +		 */
> +		mutex_lock(&tdx_mng_key_config_lock[pkg]);
> +		ret = smp_call_on_cpu(i, tdx_do_tdh_mng_key_config,
> +				      &kvm_tdx->tdr_pa, true);
> +		mutex_unlock(&tdx_mng_key_config_lock[pkg]);
> +		if (ret)
> +			break;
> +	}
> +	cpus_read_unlock();
> +	free_cpumask_var(packages);
> +	if (ret)
> +		goto teardown;
> +
> +	kvm_tdx->tdcs_pa = tdcs_pa;
> +	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
> +		err = tdh_mng_addcx(kvm_tdx->tdr_pa, tdcs_pa[i]);
> +		if (WARN_ON_ONCE(err)) {
> +			pr_tdx_error(TDH_MNG_ADDCX, err, NULL);
> +			for (i++; i < tdx_caps.tdcs_nr_pages; i++) {
> +				free_page((unsigned
> long)__va(tdcs_pa[i]));
> +				tdcs_pa[i] = 0;
> +			}
> +			ret = -EIO;
> +			goto teardown;
> +		}
> +	}
> +
> +	/*
> +	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT
> requires a dedicated
> +	 * ioctl() to define the configure CPUID values for the TD.
> +	 */
> +	return 0;
> +
> +	/*
> +	 * The sequence for freeing resources from a partially
> initialized TD
> +	 * varies based on where in the initialization flow failure
> occurred.
> +	 * Simply use the full teardown and destroy, which naturally
> play nice
> +	 * with partial initialization.
> +	 */
> +teardown:
> +	tdx_mmu_release_hkid(kvm);
> +	tdx_vm_free(kvm);
> +	return ret;
> +
> +free_packages:
> +	cpus_read_unlock();
> +	free_cpumask_var(packages);
> +free_tdcs:
> +	for (i = 0; i < tdx_caps.tdcs_nr_pages; i++) {
> +		if (tdcs_pa[i])
> +			free_page((unsigned long)__va(tdcs_pa[i]));
> +	}
> +	kfree(tdcs_pa);
> +	kvm_tdx->tdcs_pa = NULL;
> +
> +free_tdr:
> +	if (tdr_pa)
> +		free_page((unsigned long)__va(tdr_pa));
> +	kvm_tdx->tdr_pa = 0;
> +free_hkid:
> +	if (is_hkid_assigned(kvm_tdx))
> +		tdx_hkid_free(kvm_tdx);
> +	return ret;
> +}
> +
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_tdx_cmd tdx_cmd;
> @@ -152,6 +551,8 @@ bool tdx_is_vm_type_supported(unsigned long type)
>  
>  int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
>  {
> +	int max_pkgs;
> +	int i;
>  	int r;
>  
>  	if (!enable_ept) {
> @@ -159,6 +560,14 @@ int __init tdx_hardware_setup(struct kvm_x86_ops
> *x86_ops) return -EINVAL;
>  	}
>  
> +	max_pkgs = topology_max_packages();
> +	tdx_mng_key_config_lock = kcalloc(max_pkgs,
> sizeof(*tdx_mng_key_config_lock),
> +				   GFP_KERNEL);
> +	if (!tdx_mng_key_config_lock)
> +		return -ENOMEM;
> +	for (i = 0; i < max_pkgs; i++)
> +		mutex_init(&tdx_mng_key_config_lock[i]);
> +
>  	/* TDX requires VMX. */
>  	r = vmxon_all();
>  	if (!r)
> @@ -167,3 +576,9 @@ int __init tdx_hardware_setup(struct kvm_x86_ops
> *x86_ops) 
>  	return r;
>  }
> +
> +void tdx_hardware_unsetup(void)
> +{
> +	/* kfree accepts NULL. */
> +	kfree(tdx_mng_key_config_lock);
> +}
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 473013265bd8..e78d72cf4c3a 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -5,7 +5,11 @@
>  #ifdef CONFIG_INTEL_TDX_HOST
>  struct kvm_tdx {
>  	struct kvm kvm;
> -	/* TDX specific members follow. */
> +
> +	unsigned long tdr_pa;
> +	unsigned long *tdcs_pa;
> +
> +	int hkid;
>  };
>  
>  struct vcpu_tdx {
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index bb4090dbae37..3d0f519727c6 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -139,15 +139,24 @@ void vmx_setup_mce(struct kvm_vcpu *vcpu);
>  
>  #ifdef CONFIG_INTEL_TDX_HOST
>  int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> +void tdx_hardware_unsetup(void);
>  bool tdx_is_vm_type_supported(unsigned long type);
>  int tdx_dev_ioctl(void __user *argp);
>  
> +int tdx_vm_init(struct kvm *kvm);
> +void tdx_mmu_release_hkid(struct kvm *kvm);
> +void tdx_vm_free(struct kvm *kvm);
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>  #else
>  static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {
> return 0; } +static inline void tdx_hardware_unsetup(void) {}
>  static inline bool tdx_is_vm_type_supported(unsigned long type) {
> return false; } static inline int tdx_dev_ioctl(void __user *argp) {
> return -EOPNOTSUPP; }; 
> +static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> +static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
> +static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
> +static inline void tdx_vm_free(struct kvm *kvm) {}
>  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) {
> return -EOPNOTSUPP; } #endif
>  
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1152a9dc6d84..0fa91a9708aa 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12337,6 +12337,7 @@ void kvm_arch_destroy_vm(struct kvm *kvm)
>  	kvm_page_track_cleanup(kvm);
>  	kvm_xen_destroy_vm(kvm);
>  	kvm_hv_destroy_vm(kvm);
> +	static_call_cond(kvm_x86_vm_free)(kvm);
>  }
>  
>  static void memslot_rmap_free(struct kvm_memory_slot *slot)
> @@ -12647,6 +12648,13 @@ void kvm_arch_commit_memory_region(struct kvm
> *kvm, 
>  void kvm_arch_flush_shadow_all(struct kvm *kvm)
>  {
> +	/*
> +	 * kvm_mmu_zap_all() zaps both private and shared page tables.
> Before
> +	 * tearing down private page tables, TDX requires some TD
> resources to
> +	 * be destroyed (i.e. keyID must have been reclaimed, etc).
> Invoke
> +	 * kvm_x86_flush_shadow_all_private() for this.
> +	 */
> +	static_call_cond(kvm_x86_flush_shadow_all_private)(kvm);
>  	kvm_mmu_zap_all(kvm);
>  }
>  


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters
  2023-01-12 16:31 ` [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
@ 2023-01-13 14:58   ` Zhi Wang
  2023-01-16 10:04   ` Huang, Kai
  2023-01-17 12:19   ` Huang, Kai
  2 siblings, 0 replies; 221+ messages in thread
From: Zhi Wang @ 2023-01-13 14:58 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack, Xiaoyao Li

On Thu, 12 Jan 2023 08:31:27 -0800
isaku.yamahata@intel.com wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX requires additional parameters for TDX VM for confidential execution
> to protect its confidentiality of its memory contents and its CPU state
> from any other software, including VMM. When creating guest TD VM before
> creating vcpu, the number of vcpu, TSC frequency (that is same among
> vcpus. and it can't be changed.)  CPUIDs which is emulated by the TDX
> module. It means guest can trust those CPUIDs. and sha384 values for
> measurement.
> 
> Add new subcommand, KVM_TDX_INIT_VM, to pass parameters for TDX guest.
> It assigns encryption key to the TDX guest for memory encryption.  TDX
> encrypts memory per-guest bases.  It assigns device model passes per-VM
> parameters for the TDX guest.  The maximum number of vcpus, tsc frequency
> (TDX guest has fised VM-wide TSC frequency. not per-vcpu.  The TDX guest
> can not change it.), attributes (production or debug), available extended
> features (which is reflected into guest XCR0, IA32_XSS MSR), cpuids,
> sha384 measurements, and etc.
> 
> This subcommand is called before creating vcpu and KVM_SET_CPUID2, i.e.
> cpuids configurations aren't available yet.  So CPUIDs configuration
> values needs to be passed in struct kvm_tdx_init_vm.  It's device model
> responsibility to make this cpuid config for KVM_TDX_INIT_VM and
> KVM_SET_CPUID2.
> 

I find this one is take much more time to review than others. Mostly what I
noticed are:

It takes quite so much time to switch between the code and the TDX module
spec when verifying the configuration of TD params structs. I would suggest
to organize TD params structs setup part better:

a. Describe the steps, like step 1: xxx and step 2:xxx. Make it more
informative and purpose-oriented if necessary.

b. Separate them into small inline functions and add comments there. So
the setup_td_params() will look much cleaner to read and follow.
For example:

+	max_pa = 36;
+	entry = tdx_find_cpuid_entry(cpuid, 0x80000008, 0);
+	if (entry)
+		max_pa = entry->eax & 0xff;
+
+	td_params->eptp_controls = VMX_EPTP_MT_WB;
+	/*
+	 * No CPU supports 4-level && max_pa > 48.
+	 * "5-level paging and 5-level EPT" section 4.1 4-level EPT
+	 * "4-level EPT is limited to translating 48-bit guest-physical
+	 *  addresses."
+	 * cpu_has_vmx_ept_5levels() check is just in case.
+	 */
+	if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
+		td_params->eptp_controls |= VMX_EPTP_PWL_5;
+		td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
+	} else {
+		td_params->eptp_controls |= VMX_EPTP_PWL_4;
+	}
+

The above snippet can be wrapped in a function as setup_eptp_controls()
and then explain the purpose in the comments inside that function.
Also, it would be easier to define the macros for magic numbers inside
those functions. 

c. Organize the CPUID virtualization in dedicated functions and brief
the details there. The detail of CPUID virtualization seems super huge
in the TDX module ABI spec (it took me quite some time to go through it).
It would be better to have a brief to help people understand it without
going through the related details.

> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/tdx.h            |   3 +
>  arch/x86/include/uapi/asm/kvm.h       |  31 ++++
>  arch/x86/kvm/vmx/tdx.c                | 229 ++++++++++++++++++++++++--
>  arch/x86/kvm/vmx/tdx.h                |  20 +++
>  tools/arch/x86/include/uapi/asm/kvm.h |  33 ++++
>  5 files changed, 306 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index 2ca6e8ce1e43..d7ce2217279f 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -105,6 +105,9 @@ static inline long tdx_kvm_hypercall(unsigned int
> nr, unsigned long p1, #endif /* CONFIG_INTEL_TDX_GUEST &&
> CONFIG_KVM_GUEST */ 
>  #ifdef CONFIG_INTEL_TDX_HOST
> +
> +/* -1 indicates CPUID leaf with no sub-leaves. */
> +#define TDX_CPUID_NO_SUBLEAF	((u32)-1)
>  struct tdx_cpuid_config {
>  	u32	leaf;
>  	u32	sub_leaf;
> diff --git a/arch/x86/include/uapi/asm/kvm.h
> b/arch/x86/include/uapi/asm/kvm.h index 610b1de9eb2b..b8f28d86d4fd 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -535,6 +535,7 @@ struct kvm_pmu_event_filter {
>  /* Trust Domain eXtension sub-ioctl() commands. */
>  enum kvm_tdx_cmd_id {
>  	KVM_TDX_CAPABILITIES = 0,
> +	KVM_TDX_INIT_VM,
>  
>  	KVM_TDX_CMD_NR_MAX,
>  };
> @@ -580,4 +581,34 @@ struct kvm_tdx_capabilities {
>  	struct kvm_tdx_cpuid_config cpuid_configs[0];
>  };
>  
> +struct kvm_tdx_init_vm {
> +	__u64 attributes;
> +	__u64 mrconfigid[6];	/* sha384 digest */
> +	__u64 mrowner[6];	/* sha384 digest */
> +	__u64 mrownerconfig[6];	/* sha348 digest */
> +	union {
> +		/*
> +		 * KVM_TDX_INIT_VM is called before vcpu creation, thus
> before
> +		 * KVM_SET_CPUID2.  CPUID configurations needs to be
> passed.
> +		 *
> +		 * This configuration supersedes KVM_SET_CPUID{,2}.
> +		 * The user space VMM, e.g. qemu, should make them
> consistent
> +		 * with this values.
> +		 * sizeof(struct kvm_cpuid_entry2) *
> KVM_MAX_CPUID_ENTRIES(256)
> +		 * = 8KB.
> +		 */
> +		struct {
> +			struct kvm_cpuid2 cpuid;
> +			/* 8KB with KVM_MAX_CPUID_ENTRIES. */
> +			struct kvm_cpuid_entry2 entries[];
> +		};
> +		/*
> +		 * For future extensibility.
> +		 * The size(struct kvm_tdx_init_vm) = 16KB.
> +		 * This should be enough given sizeof(TD_PARAMS) = 1024
> +		 */
> +		__u64 reserved[2029];
> +	};
> +};
> +
>  #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index d11950d18226..0b309bbfe4e5 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -6,7 +6,6 @@
>  #include "capabilities.h"
>  #include "x86_ops.h"
>  #include "tdx.h"
> -#include "tdx_ops.h"
>  #include "x86.h"
>  
>  #undef pr_fmt
> @@ -269,12 +268,15 @@ static int tdx_do_tdh_mng_key_config(void *param)
>  	return 0;
>  }
>  
> -static int __tdx_td_init(struct kvm *kvm);
> -
>  int tdx_vm_init(struct kvm *kvm)
>  {
> -	/* Place holder for now. */
> -	return __tdx_td_init(kvm);
> +	/*
> +	 * This function initializes only KVM software construct.  It
> doesn't
> +	 * initialize TDX stuff, e.g. TDCS, TDR, TDCX, HKID etc.
> +	 * It is handled by KVM_TDX_INIT_VM, __tdx_td_init().
> +	 */
> +
> +	return 0;
>  }
>  
>  int tdx_dev_ioctl(void __user *argp)
> @@ -323,9 +325,147 @@ int tdx_dev_ioctl(void __user *argp)
>  	return 0;
>  }
>  
> -static int __tdx_td_init(struct kvm *kvm)
> +/*
> + * cpuid entry lookup in TDX cpuid config way.
> + * The difference is how to specify index(subleaves).
> + * Specify index to TDX_CPUID_NO_SUBLEAF for CPUID leaf with
> no-subleaves.
> + */
> +static const struct kvm_cpuid_entry2 *tdx_find_cpuid_entry(const struct
> kvm_cpuid2 *cpuid,
> +							   u32
> function, u32 index) +{
> +	int i;
> +
> +	/* In TDX CPU CONFIG, TDX_CPUID_NO_SUBLEAF means index = 0. */
> +	if (index == TDX_CPUID_NO_SUBLEAF)
> +		index = 0;
> +
> +	for (i = 0; i < cpuid->nent; i++) {
> +		const struct kvm_cpuid_entry2 *e = &cpuid->entries[i];
> +
> +		if (e->function == function &&
> +		    (e->index == index ||
> +		     !(e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX)))
> +			return e;
> +	}
> +	return NULL;
> +}
> +
> +static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> +			struct kvm_tdx_init_vm *init_vm)
> +{
> +	const struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
> +	const struct kvm_cpuid_entry2 *entry;
> +	u64 guest_supported_xcr0;
> +	u64 guest_supported_xss;
> +	int max_pa;
> +	int i;
> +
> +	if (kvm->created_vcpus)
> +		return -EBUSY;
> +	td_params->max_vcpus = kvm->max_vcpus;
> +	td_params->attributes = init_vm->attributes;
> +	if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> +		/*
> +		 * TODO: save/restore PMU related registers around
> TDENTER.
> +		 * Once it's done, remove this guard.
> +		 */
> +		pr_warn("TD doesn't support perfmon yet. KVM needs to
> save/restore "
> +			"host perf registers properly.\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	for (i = 0; i < tdx_caps.nr_cpuid_configs; i++) {
> +		const struct tdx_cpuid_config *config =
> &tdx_caps.cpuid_configs[i];
> +		const struct kvm_cpuid_entry2 *entry =
> +			tdx_find_cpuid_entry(cpuid, config->leaf,
> config->sub_leaf);
> +		struct tdx_cpuid_value *value =
> &td_params->cpuid_values[i]; +
> +		if (!entry)
> +			continue;
> +
> +		value->eax = entry->eax & config->eax;
> +		value->ebx = entry->ebx & config->ebx;
> +		value->ecx = entry->ecx & config->ecx;
> +		value->edx = entry->edx & config->edx;
> +	}
> +
> +	max_pa = 36;
> +	entry = tdx_find_cpuid_entry(cpuid, 0x80000008, 0);
> +	if (entry)
> +		max_pa = entry->eax & 0xff;
> +
> +	td_params->eptp_controls = VMX_EPTP_MT_WB;
> +	/*
> +	 * No CPU supports 4-level && max_pa > 48.
> +	 * "5-level paging and 5-level EPT" section 4.1 4-level EPT
> +	 * "4-level EPT is limited to translating 48-bit guest-physical
> +	 *  addresses."
> +	 * cpu_has_vmx_ept_5levels() check is just in case.
> +	 */
> +	if (cpu_has_vmx_ept_5levels() && max_pa > 48) {
> +		td_params->eptp_controls |= VMX_EPTP_PWL_5;
> +		td_params->exec_controls |= TDX_EXEC_CONTROL_MAX_GPAW;
> +	} else {
> +		td_params->eptp_controls |= VMX_EPTP_PWL_4;
> +	}
> +
> +	/* Setup td_params.xfam */
> +	entry = tdx_find_cpuid_entry(cpuid, 0xd, 0);
> +	if (entry)
> +		guest_supported_xcr0 = (entry->eax | ((u64)entry->edx
> << 32));
> +	else
> +		guest_supported_xcr0 = 0;
> +	guest_supported_xcr0 &= kvm_caps.supported_xcr0;
> +
> +	entry = tdx_find_cpuid_entry(cpuid, 0xd, 1);
> +	if (entry)
> +		guest_supported_xss = (entry->ecx | ((u64)entry->edx <<
> 32));
> +	else
> +		guest_supported_xss = 0;
> +	/* PT can be exposed to TD guest regardless of KVM's XSS
> support */
> +	guest_supported_xss &= (kvm_caps.supported_xss |
> XFEATURE_MASK_PT); +
> +	td_params->xfam = guest_supported_xcr0 | guest_supported_xss;
> +	if (td_params->xfam & XFEATURE_MASK_LBR) {
> +		/*
> +		 * TODO: once KVM supports LBR(save/restore LBR related
> +		 * registers around TDENTER), remove this guard.
> +		 */
> +		pr_warn("TD doesn't support LBR yet. KVM needs to
> save/restore "
> +			"IA32_LBR_DEPTH properly.\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	if (td_params->xfam & XFEATURE_MASK_XTILE) {
> +		/*
> +		 * TODO: once KVM supports AMX(save/restore AMX related
> +		 * registers around TDENTER), remove this guard.
> +		 */
> +		pr_warn("TD doesn't support AMX yet. KVM needs to
> save/restore "
> +			"IA32_XFD, IA32_XFD_ERR properly.\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	td_params->tsc_frequency =
> +		TDX_TSC_KHZ_TO_25MHZ(kvm->arch.default_tsc_khz);
> +
> +#define MEMCPY_SAME_SIZE(dst, src)				\
> +	do {							\
> +		BUILD_BUG_ON(sizeof(dst) != sizeof(src));	\
> +		memcpy((dst), (src), sizeof(dst));		\
> +	} while (0)
> +
> +	MEMCPY_SAME_SIZE(td_params->mrconfigid, init_vm->mrconfigid);
> +	MEMCPY_SAME_SIZE(td_params->mrowner, init_vm->mrowner);
> +	MEMCPY_SAME_SIZE(td_params->mrownerconfig,
> init_vm->mrownerconfig); +
> +	return 0;
> +}
> +
> +static int __tdx_td_init(struct kvm *kvm, struct td_params *td_params)
>  {
>  	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct tdx_module_output out;
>  	cpumask_var_t packages;
>  	unsigned long *tdcs_pa = NULL;
>  	unsigned long tdr_pa = 0;
> @@ -439,10 +579,13 @@ static int __tdx_td_init(struct kvm *kvm)
>  		}
>  	}
>  
> -	/*
> -	 * Note, TDH_MNG_INIT cannot be invoked here.  TDH_MNG_INIT
> requires a dedicated
> -	 * ioctl() to define the configure CPUID values for the TD.
> -	 */
> +	err = tdh_mng_init(kvm_tdx->tdr_pa, __pa(td_params), &out);
> +	if (WARN_ON_ONCE(err)) {
> +		pr_tdx_error(TDH_MNG_INIT, err, &out);
> +		ret = -EIO;
> +		goto teardown;
> +	}
> +
>  	return 0;
>  
>  	/*
> @@ -477,6 +620,69 @@ static int __tdx_td_init(struct kvm *kvm)
>  	return ret;
>  }
>  
> +static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(kvm);
> +	struct kvm_tdx_init_vm *init_vm = NULL;
> +	struct td_params *td_params = NULL;
> +	void *entries_end;
> +	int ret;
> +
> +	BUILD_BUG_ON(sizeof(*init_vm) != 16 * 1024);
> +	BUILD_BUG_ON((sizeof(*init_vm) - offsetof(typeof(*init_vm),
> entries)) /
> +		     sizeof(init_vm->entries[0]) <
> KVM_MAX_CPUID_ENTRIES);
> +	BUILD_BUG_ON(sizeof(struct td_params) != 1024);
> +
> +	if (is_hkid_assigned(kvm_tdx))
> +		return -EINVAL;
> +
> +	if (cmd->flags)
> +		return -EINVAL;
> +
> +	init_vm = kzalloc(sizeof(*init_vm), GFP_KERNEL);
> +	if (!init_vm)
> +		return -ENOMEM;
> +	if (copy_from_user(init_vm, (void __user *)cmd->data,
> sizeof(*init_vm))) {
> +		ret = -EFAULT;
> +		goto out;
> +	}
> +
> +	ret = -EINVAL;
> +	if (init_vm->cpuid.padding)
> +		goto out;
> +	/* init_vm->entries shouldn't overrun. */
> +	entries_end = init_vm->entries + init_vm->cpuid.nent;
> +	if (entries_end > (void *)(init_vm + 1))
> +		goto out;
> +	/* Unused part must be zero. */
> +	if (memchr_inv(entries_end, 0, (void *)(init_vm + 1) -
> entries_end))
> +		goto out;
> +
> +	td_params = kzalloc(sizeof(struct td_params), GFP_KERNEL);
> +	if (!td_params) {
> +		ret = -ENOMEM;
> +		goto out;
> +	}
> +
> +	ret = setup_tdparams(kvm, td_params, init_vm);
> +	if (ret)
> +		goto out;
> +
> +	ret = __tdx_td_init(kvm, td_params);
> +	if (ret)
> +		goto out;
> +
> +	kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx,
> TD_TDCS_EXEC_TSC_OFFSET);
> +	kvm_tdx->attributes = td_params->attributes;
> +	kvm_tdx->xfam = td_params->xfam;
> +
> +out:
> +	/* kfree() accepts NULL. */
> +	kfree(init_vm);
> +	kfree(td_params);
> +	return ret;
> +}
> +
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	struct kvm_tdx_cmd tdx_cmd;
> @@ -490,6 +696,9 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  	mutex_lock(&kvm->lock);
>  
>  	switch (tdx_cmd.id) {
> +	case KVM_TDX_INIT_VM:
> +		r = tdx_td_init(kvm, &tdx_cmd);
> +		break;
>  	default:
>  		r = -EINVAL;
>  		goto out;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index e78d72cf4c3a..1b950f98242e 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -2,6 +2,8 @@
>  #ifndef __KVM_X86_TDX_H
>  #define __KVM_X86_TDX_H
>  
> +#include "tdx_ops.h"
> +
>  #ifdef CONFIG_INTEL_TDX_HOST
>  struct kvm_tdx {
>  	struct kvm kvm;
> @@ -9,7 +11,11 @@ struct kvm_tdx {
>  	unsigned long tdr_pa;
>  	unsigned long *tdcs_pa;
>  
> +	u64 attributes;
> +	u64 xfam;
>  	int hkid;
> +
> +	u64 tsc_offset;
>  };
>  
>  struct vcpu_tdx {
> @@ -36,6 +42,20 @@ static inline struct vcpu_tdx *to_tdx(struct kvm_vcpu
> *vcpu) {
>  	return container_of(vcpu, struct vcpu_tdx, vcpu);
>  }
> +
> +static __always_inline u64 td_tdcs_exec_read64(struct kvm_tdx *kvm_tdx,
> u32 field) +{
> +	struct tdx_module_output out;
> +	u64 err;
> +
> +	err = tdh_mng_rd(kvm_tdx->tdr_pa, TDCS_EXEC(field), &out);
> +	if (unlikely(err)) {
> +		pr_err("TDH_MNG_RD[EXEC.0x%x] failed: 0x%llx\n", field,
> err);
> +		return 0;
> +	}
> +	return out.r8;
> +}
> +
>  #else
>  struct kvm_tdx {
>  	struct kvm kvm;
> diff --git a/tools/arch/x86/include/uapi/asm/kvm.h
> b/tools/arch/x86/include/uapi/asm/kvm.h index 04562740691b..eb800965b589
> 100644 --- a/tools/arch/x86/include/uapi/asm/kvm.h
> +++ b/tools/arch/x86/include/uapi/asm/kvm.h
> @@ -530,6 +530,7 @@ struct kvm_pmu_event_filter {
>  /* Trust Domain eXtension sub-ioctl() commands. */
>  enum kvm_tdx_cmd_id {
>  	KVM_TDX_CAPABILITIES = 0,
> +	KVM_TDX_INIT_VM,
>  
>  	KVM_TDX_CMD_NR_MAX,
>  };
> @@ -575,4 +576,36 @@ struct kvm_tdx_capabilities {
>  	struct kvm_tdx_cpuid_config cpuid_configs[0];
>  };
>  
> +struct kvm_tdx_init_vm {
> +	__u64 attributes;
> +	__u32 max_vcpus;
> +	__u32 padding;
> +	__u64 mrconfigid[6];    /* sha384 digest */
> +	__u64 mrowner[6];       /* sha384 digest */
> +	__u64 mrownerconfig[6]; /* sha348 digest */
> +	union {
> +		/*
> +		 * KVM_TDX_INIT_VM is called before vcpu creation, thus
> before
> +		 * KVM_SET_CPUID2.  CPUID configurations needs to be
> passed.
> +		 *
> +		 * This configuration supersedes KVM_SET_CPUID{,2}.
> +		 * The user space VMM, e.g. qemu, should make them
> consistent
> +		 * with this values.
> +		 * sizeof(struct kvm_cpuid_entry2) *
> KVM_MAX_CPUID_ENTRIES(256)
> +		 * = 8KB.
> +		 */
> +		struct {
> +			struct kvm_cpuid2 cpuid;
> +			/* 8KB with KVM_MAX_CPUID_ENTRIES. */
> +			struct kvm_cpuid_entry2 entries[];
> +		};
> +		/*
> +		 * For future extensibility.
> +		 * The size(struct kvm_tdx_init_vm) = 16KB.
> +		 * This should be enough given sizeof(TD_PARAMS) = 1024
> +		 */
> +		__u64 reserved[2028];
> +	};
> +};
> +
>  #endif /* _ASM_X86_KVM_H */


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-13 13:12   ` Zhi Wang
@ 2023-01-13 15:16     ` Sean Christopherson
  2023-01-14  9:16       ` Zhi Wang
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-13 15:16 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Sean Christopherson,
	Kai Huang

On Fri, Jan 13, 2023, Zhi Wang wrote:
> On Thu, 12 Jan 2023 08:31:26 -0800 isaku.yamahata@intel.com wrote:
> > +static void tdx_reclaim_td_page(unsigned long td_page_pa)
> > +{
> > +	if (!td_page_pa)
> > +		return;
> > +	/*
> > +	 * TDCX are being reclaimed.  TDX module maps TDCX with HKID
> > +	 * assigned to the TD.  Here the cache associated to the TD
> > +	 * was already flushed by TDH.PHYMEM.CACHE.WB before here, So
> > +	 * cache doesn't need to be flushed again.
> > +	 */
> > +	if (WARN_ON(tdx_reclaim_page(td_page_pa, false, 0)))

The WARN_ON() can go, tdx_reclaim_page() has WARN_ON_ONCE() + pr_tdx_error() in
all error paths.

> > +		/* If reclaim failed, leak the page. */
> 
> Better add a FIXME: here as this has to be fixed later.

No, leaking the page is all KVM can reasonably do here.  An improved comment would
be helpful, but no code change is required.  tdx_reclaim_page() returns an error
if and only if there's an unexpected, fatal error, e.g. a SEAMCALL with bad params,
incorrect concurrency in KVM, a TDX Module bug, etc.  Retrying at a later point is
highly unlikely to be successful.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 013/113] x86/cpu: Add helper functions to allocate/free TDX private host key id
  2023-01-13 12:47   ` Zhi Wang
@ 2023-01-13 15:21     ` Sean Christopherson
  2023-01-14  9:38       ` Zhi Wang
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-13 15:21 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack

On Fri, Jan 13, 2023, Zhi Wang wrote:
> On Thu, 12 Jan 2023 08:31:21 -0800 isaku.yamahata@intel.com wrote:
> > @@ -132,6 +136,31 @@ static struct notifier_block tdx_memory_nb = {
> >  	.notifier_call = tdx_memory_notifier,
> >  };
> >  
> > +/* TDX KeyID pool */
> > +static DEFINE_IDA(tdx_keyid_pool);
> > +
> > +int tdx_keyid_alloc(void)
> > +{
> > +	if (WARN_ON_ONCE(!tdx_keyid_start || !nr_tdx_keyids))
> > +		return -EINVAL;
> 
> Better mention that tdx_keyid_start and nr_tdx_keyids are defined in
> another patches.

Eh, no need.  That sort of information doesn't belong in the changelog because
when this code is merged it will be a natural sequence.  The cover letter
explicitly calls out that this needs the kernel patches[*].  A footnote could be
added, but asking Isaku and co. to document every external dependency is asking
too much IMO.

[*] https://lore.kernel.org/lkml/cover.1670566861.git.kai.huang@intel.com

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-13 15:16     ` Sean Christopherson
@ 2023-01-14  9:16       ` Zhi Wang
  2023-01-17 15:55         ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-14  9:16 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Sean Christopherson,
	Kai Huang

On Fri, 13 Jan 2023 15:16:08 +0000
Sean Christopherson <seanjc@google.com> wrote:

> On Fri, Jan 13, 2023, Zhi Wang wrote:
> > On Thu, 12 Jan 2023 08:31:26 -0800 isaku.yamahata@intel.com wrote:
> > > +static void tdx_reclaim_td_page(unsigned long td_page_pa)
> > > +{
> > > +	if (!td_page_pa)
> > > +		return;
> > > +	/*
> > > +	 * TDCX are being reclaimed.  TDX module maps TDCX with HKID
> > > +	 * assigned to the TD.  Here the cache associated to the TD
> > > +	 * was already flushed by TDH.PHYMEM.CACHE.WB before here,
> > > So
> > > +	 * cache doesn't need to be flushed again.
> > > +	 */
> > > +	if (WARN_ON(tdx_reclaim_page(td_page_pa, false, 0)))
> 
> The WARN_ON() can go, tdx_reclaim_page() has WARN_ON_ONCE() +
> pr_tdx_error() in all error paths.
> 
> > > +		/* If reclaim failed, leak the page. */
> > 
> > Better add a FIXME: here as this has to be fixed later.
> 
> No, leaking the page is all KVM can reasonably do here.  An improved
> comment would be helpful, but no code change is required.
> tdx_reclaim_page() returns an error if and only if there's an
> unexpected, fatal error, e.g. a SEAMCALL with bad params, incorrect
> concurrency in KVM, a TDX Module bug, etc.  Retrying at a later point is
> highly unlikely to be successful.

Hi:

The word "leaking" sounds like a situation left unhandled temporarily.

I checked the source code of the TDX module[1] for the possible reason to
fail when reviewing this patch:

tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_reclaim.c
tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_wbinvd.c

a. Invalid parameters. For example, page is not aligned, PA HKID is not zero...

For invalid parameters, a WARN_ON_ONCE() + return value is good enough as
that is how kernel handles similar situations. The caller takes the
responsibility.
 
b. Locks has been taken in TDX module. TDR page has been locked due to another
SEAMCALL, another SEAMCALL is doing PAMT walk and holding PAMT lock... 

This needs to be improved later either by retry or taking tdx_lock to avoid
TDX module fails on this.

[1] https://www.intel.com/content/www/us/en/download/738875/738876/intel-trust-domain-extension-intel-tdx-module.html

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 013/113] x86/cpu: Add helper functions to allocate/free TDX private host key id
  2023-01-13 15:21     ` Sean Christopherson
@ 2023-01-14  9:38       ` Zhi Wang
  0 siblings, 0 replies; 221+ messages in thread
From: Zhi Wang @ 2023-01-14  9:38 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack

On Fri, 13 Jan 2023 15:21:54 +0000
Sean Christopherson <seanjc@google.com> wrote:

> On Fri, Jan 13, 2023, Zhi Wang wrote:
> > On Thu, 12 Jan 2023 08:31:21 -0800 isaku.yamahata@intel.com wrote:
> > > @@ -132,6 +136,31 @@ static struct notifier_block tdx_memory_nb = {
> > >  	.notifier_call = tdx_memory_notifier,
> > >  };
> > >  
> > > +/* TDX KeyID pool */
> > > +static DEFINE_IDA(tdx_keyid_pool);
> > > +
> > > +int tdx_keyid_alloc(void)
> > > +{
> > > +	if (WARN_ON_ONCE(!tdx_keyid_start || !nr_tdx_keyids))
> > > +		return -EINVAL;
> > 
> > Better mention that tdx_keyid_start and nr_tdx_keyids are defined in
> > another patches.
> 
> Eh, no need.  That sort of information doesn't belong in the changelog
> because when this code is merged it will be a natural sequence.  The
> cover letter explicitly calls out that this needs the kernel patches[*].
>  A footnote could be added, but asking Isaku and co. to document every
> external dependency is asking too much IMO.
> 
> [*] https://lore.kernel.org/lkml/cover.1670566861.git.kai.huang@intel.com

Hi:

Thanks. I raised this concern from the reviewers' perspective. For example,
finding something was missing, grep, nothing was found, and jumping to
another window and grep. 

Finally, you can make sure if missing tdx_keyid_start in the patch is a
mistake or a dependency. Then the same happens on nr_tdx_keyids.

It would be nice to just say tdx_hkid_start, nr_tdx_keyid requires an
external patch in the comment. Or, just mention this patch depends
on an external patch in the comment. It will save quite some efforts.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 004/113] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module
  2023-01-12 16:31 ` [PATCH v11 004/113] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module isaku.yamahata
  2023-01-13 12:31   ` Zhi Wang
@ 2023-01-16  3:48   ` Huang, Kai
  1 sibling, 0 replies; 221+ messages in thread
From: Huang, Kai @ 2023-01-16  3:48 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX requires several initialization steps for KVM to create guest TDs.
> Detect CPU feature, enable VMX (TDX is based on VMX), detect the TDX module
> availability, and initialize it.  This patch implements those steps.

"detect the TDX module" is not needed anymore.

Btw, I guess you should get rid of "This patch ...".  Please see below quoted
txt from https://docs.kernel.org/process/submitting-patches.html :

"
Describe your changes in imperative mood, e.g. “make xyzzy do frotz” instead of
“[This patch] makes xyzzy do frotz” or “[I] changed xyzzy to do frotz”, as if
you are giving orders to the codebase to change its behaviour.
"

> 
> There are several options on when to initialize the TDX module.  A.) kernel
> module loading time, B.) the first guest TD creation time.  A.) was chosen.
> With B.), a user may hit an error of the TDX initialization when trying to
> create the first guest TD.  The machine that fails to initialize the TDX
> module can't boot any guest TD further.  Such failure is undesirable and a
> surprise because the user expects that the machine can accommodate guest
> TD, but actually not.  So A.) is better than B.).
> 
> Introduce a module parameter, enable_tdx, to explicitly enable TDX KVM
> support.  It's off by default to keep same behavior for those who don't use
> TDX.  Implement hardware_setup method to detect TDX feature of CPU.
> Because TDX requires all present CPUs to enable VMX (VMXON).  The x86

"Because TDX ... , the x86 specific ...".

> specific kvm_arch_post_hardware_enable_setup overrides the existing weak
> symbol of kvm_arch_post_hardware_enable_setup which is called at the KVM
> module initialization.
> 
> Suggested-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/Makefile      |  1 +
>  arch/x86/kvm/vmx/main.c    | 33 +++++++++++++++++++++++-----
>  arch/x86/kvm/vmx/tdx.c     | 44 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/vmx.c     | 39 +++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/x86_ops.h | 10 +++++++++
>  5 files changed, 122 insertions(+), 5 deletions(-)
>  create mode 100644 arch/x86/kvm/vmx/tdx.c
> 
> diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> index 0e894ae23cbc..4b01ab842ab7 100644
> --- a/arch/x86/kvm/Makefile
> +++ b/arch/x86/kvm/Makefile
> @@ -25,6 +25,7 @@ kvm-$(CONFIG_KVM_SMM)	+= smm.o
>  kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o vmx/vmcs12.o \
>  			   vmx/hyperv.o vmx/nested.o vmx/posted_intr.o vmx/main.o
>  kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
> +kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o
>  
>  kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o svm/nested.o svm/avic.o \
>  			   svm/sev.o svm/hyperv.o
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 18f659d1d456..f5d1166d2718 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -7,6 +7,22 @@
>  #include "pmu.h"
>  #include "tdx.h"
>  
> +static bool enable_tdx __ro_after_init = IS_ENABLED(CONFIG_INTEL_TDX_HOST);

The changelog said it's off by default, so not consistent here.

> +module_param_named(tdx, enable_tdx, bool, 0444);
> +
> +static __init int vt_hardware_setup(void)
> +{
> +	int ret;
> +
> +	ret = vmx_hardware_setup();
> +	if (ret)
> +		return ret;
> +
> +	enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> +
> +	return 0;
> +}
> +
>  struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.name = KBUILD_MODNAME,
>  
> @@ -149,7 +165,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  };
>  
>  struct kvm_x86_init_ops vt_init_ops __initdata = {
> -	.hardware_setup = vmx_hardware_setup,
> +	.hardware_setup = vt_hardware_setup,
>  	.handle_intel_pt_intr = NULL,
>  
>  	.runtime_ops = &vt_x86_ops,
> @@ -182,10 +198,17 @@ static int __init vt_init(void)
>  	 * Common KVM initialization _must_ come last, after this, /dev/kvm is
>  	 * exposed to userspace!
>  	 */
> -	vt_x86_ops.vm_size = max(sizeof(struct kvm_vmx), sizeof(struct kvm_tdx));
> -	vcpu_size = max(sizeof(struct vcpu_vmx), sizeof(struct vcpu_tdx));
> -	vcpu_align = max(__alignof__(struct vcpu_vmx),
> -			 __alignof__(struct vcpu_tdx));
> +	vt_x86_ops.vm_size = sizeof(struct kvm_vmx);
> +	vcpu_size = sizeof(struct vcpu_vmx);
> +	vcpu_align = __alignof__(struct vcpu_vmx);
> +	if (enable_tdx) {
> +		vt_x86_ops.vm_size = max_t(unsigned int, vt_x86_ops.vm_size,
> +					   sizeof(struct kvm_tdx));
> +		vcpu_size = max_t(unsigned int, vcpu_size,
> +				  sizeof(struct vcpu_tdx));
> +		vcpu_align = max_t(unsigned int, vcpu_align,
> +				   __alignof__(struct vcpu_tdx));
> +	}

You have below code in the previous patch:

@@ -181,9 +182,10 @@ static int __init vt_init(void)
 	 * Common KVM initialization _must_ come last, after this, /dev/kvm is
 	 * exposed to userspace!
 	 */
-	vt_x86_ops.vm_size = sizeof(struct kvm_vmx);
-	vcpu_size = sizeof(struct vcpu_vmx);
-	vcpu_align = __alignof__(struct vcpu_vmx);
+	vt_x86_ops.vm_size = max(sizeof(struct kvm_vmx), sizeof(struct
kvm_tdx));
+	vcpu_size = max(sizeof(struct vcpu_vmx), sizeof(struct vcpu_tdx));
+	vcpu_align = max(__alignof__(struct vcpu_vmx),
+			 __alignof__(struct vcpu_tdx));

The chunk here in this patch is not related how to initialize the TDX module,
but belong to "how KVM handles TDX at VM level and vcpu level", which is
introduced in the previous patch, which isn't mandatory to be introduced before
this patch.

IMHO a better way is to move the previous patch after this one, and just put 
this chunk to that patch.  And if you do so, this chunk of change can just
appear once but not twice in two patches.
 

[snip]

>  
> +#ifdef CONFIG_INTEL_TDX_HOST
> +int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> +#else
> +static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }

Why do you return 0, which is a success IIUC?




^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 014/113] x86/virt/tdx: Add a helper function to return system wide info about TDX module
  2023-01-12 16:31 ` [PATCH v11 014/113] x86/virt/tdx: Add a helper function to return system wide info about TDX module isaku.yamahata
@ 2023-01-16  4:19   ` Huang, Kai
  2023-02-27 21:20     ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-16  4:19 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX KVM needs system-wide information about the TDX module, struct
> tdsysinfo_struct.  Add a helper function tdx_get_sysinfo() to return it
> instead of KVM getting it with various error checks.  Make KVM call the
> function and stash the info.  Move out the struct definition about it to
> common place arch/x86/include/asm/tdx.h.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/tdx.h  | 54 +++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/tdx.c      | 49 ++++++++++++++++++++++++++++++++-
>  arch/x86/virt/vmx/tdx/tdx.c | 21 ++++++++++++---
>  arch/x86/virt/vmx/tdx/tdx.h | 51 -----------------------------------
>  4 files changed, 119 insertions(+), 56 deletions(-)
> 
> diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> index ed9cf61ff8b4..2ca6e8ce1e43 100644
> --- a/arch/x86/include/asm/tdx.h
> +++ b/arch/x86/include/asm/tdx.h
> @@ -105,6 +105,58 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
>  #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
>  
>  #ifdef CONFIG_INTEL_TDX_HOST
> +struct tdx_cpuid_config {
> +	u32	leaf;
> +	u32	sub_leaf;
> +	u32	eax;
> +	u32	ebx;
> +	u32	ecx;
> +	u32	edx;
> +} __packed;
> +
> +#define TDSYSINFO_STRUCT_SIZE		1024
> +#define TDSYSINFO_STRUCT_ALIGNMENT	1024
> +
> +/*
> + * The size of this structure itself is flexible.  The actual structure
> + * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE and be
> + * aligned to TDSYSINFO_STRUCT_ALIGNMENT using DECLARE_PADDED_STRUCT().
> + */
> +struct tdsysinfo_struct {
> +	/* TDX-SEAM Module Info */
> +	u32	attributes;
> +	u32	vendor_id;
> +	u32	build_date;
> +	u16	build_num;
> +	u16	minor_version;
> +	u16	major_version;
> +	u8	reserved0[14];
> +	/* Memory Info */
> +	u16	max_tdmrs;
> +	u16	max_reserved_per_tdmr;
> +	u16	pamt_entry_size;
> +	u8	reserved1[10];
> +	/* Control Struct Info */
> +	u16	tdcs_base_size;
> +	u8	reserved2[2];
> +	u16	tdvps_base_size;
> +	u8	tdvps_xfam_dependent_size;
> +	u8	reserved3[9];
> +	/* TD Capabilities */
> +	u64	attributes_fixed0;
> +	u64	attributes_fixed1;
> +	u64	xfam_fixed0;
> +	u64	xfam_fixed1;
> +	u8	reserved4[32];
> +	u32	num_cpuid_config;
> +	/*
> +	 * The actual number of CPUID_CONFIG depends on above
> +	 * 'num_cpuid_config'.
> +	 */
> +	DECLARE_FLEX_ARRAY(struct tdx_cpuid_config, cpuid_configs);
> +} __packed;
> +
> +const struct tdsysinfo_struct *tdx_get_sysinfo(void);
>  bool platform_tdx_enabled(void);
>  int tdx_enable(void);
>  /*
> @@ -120,6 +172,8 @@ void tdx_keyid_free(int keyid);
>  u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
>  	       struct tdx_module_output *out);
>  #else	/* !CONFIG_INTEL_TDX_HOST */
> +struct tdsysinfo_struct;
> +static inline const struct tdsysinfo_struct *tdx_get_sysinfo(void) { return NULL; }
>  static inline bool platform_tdx_enabled(void) { return false; }
>  static inline int tdx_enable(void)  { return -EINVAL; }
>  static inline int tdx_keyid_alloc(void) { return -EOPNOTSUPP; }
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 6c7d9ec53046..2adf5551ab26 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -11,9 +11,34 @@
>  #undef pr_fmt
>  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
>  
> +#define TDX_MAX_NR_CPUID_CONFIGS					\
> +	((TDSYSINFO_STRUCT_SIZE -					\
> +		offsetof(struct tdsysinfo_struct, cpuid_configs))	\
> +		/ sizeof(struct tdx_cpuid_config))
> +
> +struct tdx_capabilities {
> +	u8 tdcs_nr_pages;
> +	u8 tdvpx_nr_pages;
> +
> +	u64 attrs_fixed0;
> +	u64 attrs_fixed1;
> +	u64 xfam_fixed0;
> +	u64 xfam_fixed1;
> +
> +	u32 nr_cpuid_configs;
> +	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
> +};
> +
> +/* Capabilities of KVM + the TDX module. */
> +static struct tdx_capabilities tdx_caps;
> +
>  static int __init tdx_module_setup(void)
>  {
> -	int ret;
> +	const struct tdsysinfo_struct *tdsysinfo;
> +	int ret = 0;
> +
> +	BUILD_BUG_ON(sizeof(*tdsysinfo) > TDSYSINFO_STRUCT_SIZE);
> +	BUILD_BUG_ON(TDX_MAX_NR_CPUID_CONFIGS != 37);
>  
>  	ret = tdx_enable();
>  	if (ret) {
> @@ -21,6 +46,28 @@ static int __init tdx_module_setup(void)
>  		return ret;
>  	}
>  
> +	tdsysinfo = tdx_get_sysinfo();
> +	if (tdsysinfo->num_cpuid_config > TDX_MAX_NR_CPUID_CONFIGS)
> +		return -EIO;

This check basically means TDX module is buggy (or kernel has bug).  Do we
really need this check?  Is TDX module that buggy?

IMHO you don't need this one, or use WARN() if you want to catch kernel bug?

> +
> +	tdx_caps = (struct tdx_capabilities) {
> +		.tdcs_nr_pages = tdsysinfo->tdcs_base_size / PAGE_SIZE,
> +		/*
> +		 * TDVPS = TDVPR(4K page) + TDVPX(multiple 4K pages).
> +		 * -1 for TDVPR.
> +		 */
> +		.tdvpx_nr_pages = tdsysinfo->tdvps_base_size / PAGE_SIZE - 1,
> +		.attrs_fixed0 = tdsysinfo->attributes_fixed0,
> +		.attrs_fixed1 = tdsysinfo->attributes_fixed1,
> +		.xfam_fixed0 =	tdsysinfo->xfam_fixed0,
> +		.xfam_fixed1 = tdsysinfo->xfam_fixed1,
> +		.nr_cpuid_configs = tdsysinfo->num_cpuid_config,
> +	};
> +	if (!memcpy(tdx_caps.cpuid_configs, tdsysinfo->cpuid_configs,
> +			tdsysinfo->num_cpuid_config *
> +			sizeof(struct tdx_cpuid_config)))
> +		return -EIO;

Why introducing 'struct tdx_capabilities' and above code here in this patch?

It's entirely not clear why the new structure is needed -- nothing mentioned in
changelog, nor there's any comment.  Please explain in the changelog or move
this chunk to where it is needed.

Technical side, is 'struct tdx_capabilities' really needed? Or is it just for
convenience?

[snip]


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP
  2023-01-12 16:31 ` [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP isaku.yamahata
  2023-01-13 12:55   ` Zhi Wang
@ 2023-01-16  4:44   ` Huang, Kai
  2023-02-27 21:26     ` Isaku Yamahata
  1 sibling, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-16  4:44 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX attestation includes the maximum number of vcpu that the guest can
> accommodate.  
> 

I don't understand why "attestation" is the reason here.  Let's say TDX is used
w/o attestation, I don't think this patch can be discarded?

IMHO the true reason is TDX has it's own control of maximum number of vcpus,
i.e. asking you to specify the value when creating the TD.  Therefore, the
constant KVM_MAX_VCPUS doesn't work for TDX guest anymore.

 
> For that, the maximum number of vcpu needs to be specified
> instead of constant, KVM_MAX_VCPUS.  Make KVM_ENABLE_CAP support
> KVM_CAP_MAX_VCPUS.
> 
> Suggested-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  virt/kvm/kvm_main.c | 20 ++++++++++++++++++++
>  1 file changed, 20 insertions(+)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index a235b628b32f..1cfa7da92ad0 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4945,7 +4945,27 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
>  		}
>  
>  		mutex_unlock(&kvm->slots_lock);
> +		return r;
> +	}
> +	case KVM_CAP_MAX_VCPUS: {
> +		int r;
>  
> +		if (cap->flags || cap->args[0] == 0)
> +			return -EINVAL;
> +		if (cap->args[0] >  kvm_vm_ioctl_check_extension(kvm, KVM_CAP_MAX_VCPUS))
> +			return -E2BIG;
> +
> +		mutex_lock(&kvm->lock);
> +		/* Only decreasing is allowed. */

Why?

> +		if (cap->args[0] > kvm->max_vcpus)
> +			r = -E2BIG;
> +		else if (kvm->created_vcpus)
> +			r = -EBUSY;
> +		else {
> +			kvm->max_vcpus = cap->args[0];
> +			r = 0;
> +		}
> +		mutex_unlock(&kvm->lock);
>  		return r;
>  	}
>  	default:

Also, IIUC this change is made to the generic kvm_main.c, which means other
archs are  affected too.  Is this OK to other archs?  Why such change cannot
TDX-specific (or, at least x86, or vmx specific)? 

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters
  2023-01-12 16:31 ` [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
  2023-01-13 14:58   ` Zhi Wang
@ 2023-01-16 10:04   ` Huang, Kai
  2023-02-27 21:32     ` Isaku Yamahata
  2023-01-17 12:19   ` Huang, Kai
  2 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-16 10:04 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: Li, Xiaoyao, pbonzini, Shahar, Sagi, Aktas, Erdem,
	isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> +struct kvm_tdx_init_vm {
> +	__u64 attributes;
> +	__u64 mrconfigid[6];	/* sha384 digest */
> +	__u64 mrowner[6];	/* sha384 digest */
> +	__u64 mrownerconfig[6];	/* sha348 digest */
> +	union {
> +		/*
> +		 * KVM_TDX_INIT_VM is called before vcpu creation, thus before
> +		 * KVM_SET_CPUID2.  CPUID configurations needs to be passed.
> +		 *
> +		 * This configuration supersedes KVM_SET_CPUID{,2}.

What does "{,2}" mean?

> +		 * The user space VMM, e.g. qemu, should make them consistent
> +		 * with this values.

You are already using 'struct kvm_cpuid2' below.  Isn't it enough to imply that
userspace should organize data in the format of 'struct kvm_cpuid2'?

> +		 * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(256)
> +		 * = 8KB.
> +		 */

What does this comment try to imply?

> +		struct {
> +			struct kvm_cpuid2 cpuid;
> +			/* 8KB with KVM_MAX_CPUID_ENTRIES. */
> +			struct kvm_cpuid_entry2 entries[];

I don't understand what's the purpose of the second field?

Shouldn't the 'struct kvm_cpuid2' already have all the CPUID entries?

> +		};
> +		/*
> +		 * For future extensibility.
> +		 * The size(struct kvm_tdx_init_vm) = 16KB.
> +		 * This should be enough given sizeof(TD_PARAMS) = 1024
> +		 */
> +		__u64 reserved[2029];

I think this is just wrong.  How can you extend something after a dynamic size
CPUID array?

If you want extensibility, you need to put the space before the flexible array.

> +	};
> +};



^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 021/113] KVM: TDX: Refuse to unplug the last cpu on the package
  2023-01-12 16:31 ` [PATCH v11 021/113] KVM: TDX: Refuse to unplug the last cpu on the package isaku.yamahata
@ 2023-01-16 10:23   ` Huang, Kai
  2023-02-27 21:48     ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-16 10:23 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> In order to reclaim TDX HKID, (i.e. when deleting guest TD), needs to call
> TDH.PHYMEM.PAGE.WBINVD on all packages.  If we have used TDX HKID, refuse
> to offline the last online cpu. Add arch callback for cpu offline.

I think it is worth to talk about suspend staff, i.e. why we only refuse to
offline the last cpu when there's active TD, but not choose to offline the last
cpu when TDX is enabled in KVM.  People may not be able to understand
immediately the reason behind this design.

Btw, I certainly don't want to speak for Sean, but it seems this was suggested
by Sean?  If so, add a 'Suggested-by' tag?

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
> 

[snip]

> +
> +int tdx_offline_cpu(void)
> +{
> +	int curr_cpu = smp_processor_id();
> +	cpumask_var_t packages;
> +	int ret = 0;
> +	int i;
> +
> +	if (!atomic_read(&nr_configured_hkid))
> +		return 0;

As mentioned above, I think it also worth to add some comment here.  When people
are trying to understand some code, I think mostly they are just going to look
at the code itself, but won't use 'git blame' to dig out the entire changelog to
understand some code.

> +
> +	/*
> +	 * To reclaim hkid, need to call TDH.PHYMEM.PAGE.WBINVD on all packages.
> +	 * If this is the last online cpu on the package, refuse offline.
> +	 */
> +	if (!zalloc_cpumask_var(&packages, GFP_KERNEL))
> +		return -ENOMEM;
> +
> +	for_each_online_cpu(i) {
> +		if (i != curr_cpu)
> +			cpumask_set_cpu(topology_physical_package_id(i), packages);
> +	}
> +	if (!cpumask_test_cpu(topology_physical_package_id(curr_cpu), packages))
> +		ret = -EBUSY;
> +	free_cpumask_var(packages);
> +	if (ret)
> +		/*
> +		 * Because it's hard for human operator to understand the
> +		 * reason, warn it.
> +		 */
> +		pr_warn("TDX requires all packages to have an online CPU.  "
> +			"Delete all TDs in order to offline all CPUs of a package.\n");
> +	return ret;
> +}
> 

[snip]

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 025/113] KVM: TDX: Use private memory for TDX
  2023-01-12 16:31 ` [PATCH v11 025/113] KVM: TDX: Use private memory for TDX isaku.yamahata
@ 2023-01-16 10:45   ` Huang, Kai
  2023-01-17 16:40     ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-16 10:45 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: chao.p.peng, pbonzini, Shahar, Sagi, Aktas, Erdem,
	isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> Override kvm_arch_has_private_mem() to use fd-based private memory.
> Return true when a VM has a type of KVM_X86_TDX_VM.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/x86.c | 5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index d548d3af6428..a8b555935fd8 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -13498,6 +13498,11 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
>  }
>  EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
>  
> +bool kvm_arch_has_private_mem(struct kvm *kvm)
> +{
> +	return kvm->arch.vm_type == KVM_X86_TDX_VM;
> +}
> +

AMD's series has a different solution:

https://lore.kernel.org/lkml/20221214194056.161492-3-michael.roth@amd.com/

I think somehow this needs to get aligned.

Simply checking whether VM is TD certainly won't work for AMD, so my first
glance we should use something  similar to AMD's solution.  But I haven't
thought thoroughly that is absolutely needed, for instance, perhaps we can use
'gfn_shared_mask', etc.


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-01-12 16:31 ` [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure isaku.yamahata
@ 2023-01-16 10:46   ` Zhi Wang
  2023-02-27 23:49     ` Isaku Yamahata
  2023-01-19  0:45   ` Huang, Kai
  1 sibling, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-16 10:46 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

On Thu, 12 Jan 2023 08:31:31 -0800
isaku.yamahata@intel.com wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> The next step of TDX guest creation is to create vcpu.  Allocate TDX vcpu
> structures, partially initialize it.  Allocate pages of TDX vcpu for the
> TDX module.  Actual donation TDX vcpu pages to the TDX module is not done
> yet.
> 
> In the case of the conventional case, cpuid is empty at the initialization.
> and cpuid is configured after the vcpu initialization.  Because TDX
> supports only X2APIC mode, cpuid is forcibly initialized to support X2APIC
> on the vcpu initialization.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
> Changes v10 -> v11:
> - NULL check of kvmalloc_array() in tdx_vcpu_reset. Move it to
>   tdx_vcpu_create()
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/main.c    | 40 ++++++++++++++++++--
>  arch/x86/kvm/vmx/tdx.c     | 75 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/x86_ops.h | 10 +++++
>  arch/x86/kvm/x86.c         |  2 +
>  4 files changed, 123 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index ddf0742f1f67..59813ca05f36 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -63,6 +63,38 @@ static void vt_vm_free(struct kvm *kvm)
>  		tdx_vm_free(kvm);
>  }
>  
> +static int vt_vcpu_precreate(struct kvm *kvm)
> +{
> +	if (is_td(kvm))
> +		return 0;
> +
> +	return vmx_vcpu_precreate(kvm);
> +}
> +
> +static int vt_vcpu_create(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_vcpu_create(vcpu);
> +
> +	return vmx_vcpu_create(vcpu);
> +}
> +

-----
> +static void vt_vcpu_free(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_vcpu_free(vcpu);
> +
> +	return vmx_vcpu_free(vcpu);
> +}
> +
> +static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_vcpu_reset(vcpu, init_event);
> +
> +	return vmx_vcpu_reset(vcpu, init_event);
> +}
> +
----

It seems a little strange to use return in this style. Would it be better like:

-----
if (xxx) {
	tdx_vcpu_reset(xxx);
	return; 
}

vmx_vcpu_reset(xxx);
----

?

>  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	if (!is_td(kvm))
> @@ -90,10 +122,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.vm_destroy = vt_vm_destroy,
>  	.vm_free = vt_vm_free,
>  
> -	.vcpu_precreate = vmx_vcpu_precreate,
> -	.vcpu_create = vmx_vcpu_create,
> -	.vcpu_free = vmx_vcpu_free,
> -	.vcpu_reset = vmx_vcpu_reset,
> +	.vcpu_precreate = vt_vcpu_precreate,
> +	.vcpu_create = vt_vcpu_create,
> +	.vcpu_free = vt_vcpu_free,
> +	.vcpu_reset = vt_vcpu_reset,
>  
>  	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
>  	.vcpu_load = vmx_vcpu_load,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 557a609c5147..099f0737a5aa 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -281,6 +281,81 @@ int tdx_vm_init(struct kvm *kvm)
>  	return 0;
>  }
>  
> +int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_cpuid_entry2 *e;
> +
> +	/*
> +	 * On cpu creation, cpuid entry is blank.  Forcibly enable
> +	 * X2APIC feature to allow X2APIC.
> +	 * Because vcpu_reset() can't return error, allocation is done here.
> +	 */
> +	WARN_ON_ONCE(vcpu->arch.cpuid_entries);
> +	WARN_ON_ONCE(vcpu->arch.cpuid_nent);
> +	e = kvmalloc_array(1, sizeof(*e), GFP_KERNEL_ACCOUNT);
> +	if (!e)
> +		return -ENOMEM;
> +	*e  = (struct kvm_cpuid_entry2) {
> +		.function = 1,	/* Features for X2APIC */
> +		.index = 0,
> +		.eax = 0,
> +		.ebx = 0,
> +		.ecx = 1ULL << 21,	/* X2APIC */
> +		.edx = 0,
> +	};
> +	vcpu->arch.cpuid_entries = e;
> +	vcpu->arch.cpuid_nent = 1;
> +
> +	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
> +	if (!vcpu->arch.apic)
> +		return -EINVAL;
> +
> +	fpstate_set_confidential(&vcpu->arch.guest_fpu);
> +
> +	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
> +
> +	vcpu->arch.cr0_guest_owned_bits = -1ul;
> +	vcpu->arch.cr4_guest_owned_bits = -1ul;
> +
> +	vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
> +	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
> +	vcpu->arch.guest_state_protected =
> +		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
> +
> +	return 0;
> +}
> +
> +void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> +{
> +	/* This is stub for now.  More logic will come. */
> +}
> +
> +void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> +{
> +	struct msr_data apic_base_msr;
> +
> +	/* TDX doesn't support INIT event. */
> +	if (WARN_ON_ONCE(init_event))
> +		goto td_bugged;
> +
> +	/* TDX rquires X2APIC. */
                ^
               requires
> +	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> +	if (kvm_vcpu_is_reset_bsp(vcpu))
> +		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
> +	apic_base_msr.host_initiated = true;
> +	if (WARN_ON_ONCE(kvm_set_apic_base(vcpu, &apic_base_msr)))
> +		goto td_bugged;
> +
> +	/*
> +	 * Don't update mp_state to runnable because more initialization
> +	 * is needed by TDX_VCPU_INIT.
> +	 */
> +	return;
> +
> +td_bugged:
> +	vcpu->kvm->vm_bugged = true;
> +}
> +

1) Using vm_bugged to terminate the VM creation feels off. When
using it in creation path, the termination still happens in xx_vcpu_run().

Thus, even something wrong happens at a certain point of the creation path,
the VM creation still continues. Until the xxx_vcpu_run(), the VM termination
finally happens.

Why not just fail in the creation path?

2) Move 

> +	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> +	if (kvm_vcpu_is_reset_bsp(vcpu))
> +		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
> +	apic_base_msr.host_initiated = true;

to:

void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
{
        struct kvm_lapic *apic = vcpu->arch.apic;
        u64 msr_val;
        int i;

        if (!init_event) {
                msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;

		/* here */
		if (is_td_vcpu(vcpu)) 
			msr_val = xxxx;
                if (kvm_vcpu_is_reset_bsp(vcpu))
                        msr_val |= MSR_IA32_APICBASE_BSP;
                kvm_lapic_set_base(vcpu, msr_val);
        }

PS: Is there any reason that APIC MSR in TDX doesn't need
MSR_IA32_APICBASE_ENABLE?

3) Change the following:

> +
> +	/* TDX doesn't support INIT event. */
> +	if (WARN_ON_ONCE(init_event))
> +		goto td_bugged;
> +

to 
	WARN_ON_ONCE(init_event);

kvm_cpu_deliver_init() will trigger a kvm_vcpu_reset(xxx, init_event=true),
but you have already avoided this in vt_vcpu_deliver_init(). A warn
is good enough to remind people.

With these changes, tdx_vcpu_reset() will only contain the CPUID configuration
, using the vm_bugged to terminate the VM in tdx_vcpu_reset() can be removed.

>  int tdx_dev_ioctl(void __user *argp)
>  {
>  	struct kvm_tdx_capabilities __user *user_caps;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 6c40dda1cc2f..37ab2cfd35bc 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -147,7 +147,12 @@ int tdx_offline_cpu(void);
>  int tdx_vm_init(struct kvm *kvm);
>  void tdx_mmu_release_hkid(struct kvm *kvm);
>  void tdx_vm_free(struct kvm *kvm);
> +
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> +
> +int tdx_vcpu_create(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_free(struct kvm_vcpu *vcpu);
> +void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
>  #else
>  static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
>  static inline void tdx_hardware_unsetup(void) {}
> @@ -159,7 +164,12 @@ static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
>  static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
>  static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
>  static inline void tdx_vm_free(struct kvm *kvm) {}
> +
>  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> +
> +static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
> +static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
> +static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
>  #endif
>  
>  #endif /* __KVM_X86_VMX_X86_OPS_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1fb135e0c98f..e8bc66031a1d 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -492,6 +492,7 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
>  	kvm_recalculate_apic_map(vcpu->kvm);
>  	return 0;
>  }
> +EXPORT_SYMBOL_GPL(kvm_set_apic_base);
>  
>  /*
>   * Handle a fault on a hardware virtualization (VMX or SVM) instruction.
> @@ -12109,6 +12110,7 @@ bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
>  {
>  	return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
>  }
> +EXPORT_SYMBOL_GPL(kvm_vcpu_is_reset_bsp);
>

The symbols don't need to be exported with the changes mentioned above.
  
>  bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
>  {


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 031/113] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE
  2023-01-12 16:31 ` [PATCH v11 031/113] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE isaku.yamahata
@ 2023-01-16 10:54   ` Huang, Kai
  2023-02-27 21:53     ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-16 10:54 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: sean.j.christopherson, pbonzini, Shahar, Sagi, Aktas, Erdem,
	isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
> is not able to access the private memory of TD guest and do the emulation.
> Instead, TD guest expects to receive #VE when it accesses the MMIO and then
> it can explicitly make hypercall to KVM to get the expected information.
> 
> To achieve this, the TDX module always enables "EPT-violation #VE" in the
> VMCS control.  And accordingly, for the MMIO spte for the shared GPA,
> 1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
> violation happens on TD accessing MMIO range.  2. On EPT violation, KVM
> sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
> the #VE instead of EPT misconfigration unlike VMX case.  For the shared GPA
> that is not populated yet, EPT violation need to be triggered when TD guest
> accesses such shared GPA.  The non-present SPTE value for shared GPA should
> set "suppress #VE" bit.
> 
> Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
> REMOVED_SPTE.  Unconditionally set the "suppress #VE" bit (which is bit 63)
> for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
> present bit is off; 2) for normal VMX guest, KVM never enables the
> "EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
> hardware.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/vmx.h |  1 +
>  arch/x86/kvm/mmu/spte.h    | 15 ++++++++++++++-
>  arch/x86/kvm/mmu/tdp_mmu.c |  8 ++++++++
>  3 files changed, 23 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> index 498dc600bd5c..cdbf12c1a83c 100644
> --- a/arch/x86/include/asm/vmx.h
> +++ b/arch/x86/include/asm/vmx.h
> @@ -511,6 +511,7 @@ enum vmcs_field {
>  #define VMX_EPT_IPAT_BIT    			(1ull << 6)
>  #define VMX_EPT_ACCESS_BIT			(1ull << 8)
>  #define VMX_EPT_DIRTY_BIT			(1ull << 9)
> +#define VMX_EPT_SUPPRESS_VE_BIT			(1ull << 63)

I don't know whether you should introduce this macro, since it's not used in
this patch.

>  #define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
>  						 VMX_EPT_WRITABLE_MASK |       \
>  						 VMX_EPT_EXECUTABLE_MASK)
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index f190eaf6b2b5..471378ee9071 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -148,7 +148,20 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
>  
>  #define MMIO_SPTE_GEN_MASK		GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
>  
> +/*
> + * Non-present SPTE value for both VMX and SVM for TDP MMU.
> + * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
> + * For VMX EPT, bit 63 is ignored if #VE is disabled. (EPT_VIOLATION_VE=0)
> + *              bit 63 is #VE suppress if #VE is enabled. (EPT_VIOLATION_VE=1)
> + * For TDX:
> + *   TDX module sets EPT_VIOLATION_VE for Secure-EPT and conventional EPT
> + */
> +#ifdef CONFIG_X86_64
> +#define SHADOW_NONPRESENT_VALUE	BIT_ULL(63)
> +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> +#else
>  #define SHADOW_NONPRESENT_VALUE	0ULL
> +#endif
>  
>  extern u64 __read_mostly shadow_host_writable_mask;
>  extern u64 __read_mostly shadow_mmu_writable_mask;
> @@ -195,7 +208,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
>   *
>   * Only used by the TDP MMU.
>   */
> -#define REMOVED_SPTE	0x5a0ULL
> +#define REMOVED_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)

This chunk belongs to the previous patch.

>  
>  /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
>  static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 9cf5844dd34a..6111e3e9266d 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -700,6 +700,14 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>  	 * overwrite the special removed SPTE value. No bookkeeping is needed
>  	 * here since the SPTE is going from non-present to non-present.  Use
>  	 * the raw write helper to avoid an unnecessary check on volatile bits.
> +	 *
> +	 * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
> +	 * It is because when TDX is enabled, TDX module always
> +	 * enables "EPT-violation #VE", so KVM needs to set
> +	 * "suppress #VE" bit in EPT table entries, in order to get
> +	 * real EPT violation, rather than TDVMCALL.  KVM sets
> +	 * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
> +	 * can be set when EPT table entries are zapped.
>  	 */
>  	__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
>  

I don't quite think this place is the good place to explain "suppress  #VE" bit
staff.  There are other places that sets non-present SPTE too.  Perhaps putting
the comment around the macro 'SHADOW_NONPRESENT_VALUE' is better.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 032/113] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask
  2023-01-12 16:31 ` [PATCH v11 032/113] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask isaku.yamahata
@ 2023-01-16 10:59   ` Huang, Kai
  0 siblings, 0 replies; 221+ messages in thread
From: Huang, Kai @ 2023-01-16 10:59 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> To make use of the same value of shadow_mmio_mask for TDX and VMX, add
> Suppress-VE bit to shadow_mmio_mask so that shadow_mmio_mask can be common
> for both VMX and TDX.
> 
> TDX will need shadow_mmio_mask to be VMX_SUPPRESS_VE | RWX and
> shadow_mmio_value to be 0 so that EPT violation is triggered.  For VMX,
> VMX_SUPPRESS_VE doesn't matter because the spte value is required to cause
> EPT misconfig.  the additional bit doesn't affect VMX logic to add the bit
> to shadow_mmio_{value, mask}.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/mmu/spte.c | 6 ++++--
>  1 file changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index fce6f047399f..cc0bc058fb25 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -431,7 +431,9 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
>  	shadow_dirty_mask	= has_ad_bits ? VMX_EPT_DIRTY_BIT : 0ull;
>  	shadow_nx_mask		= 0ull;
>  	shadow_x_mask		= VMX_EPT_EXECUTABLE_MASK;
> -	shadow_present_mask	= has_exec_only ? 0ull : VMX_EPT_READABLE_MASK;
> +	/* VMX_EPT_SUPPRESS_VE_BIT is needed for W or X violation. */
> +	shadow_present_mask	=
> +		(has_exec_only ? 0ull : VMX_EPT_READABLE_MASK) | VMX_EPT_SUPPRESS_VE_BIT;

This chunk has nothing to do with what this patch claims to do.

>  	/*
>  	 * EPT overrides the host MTRRs, and so KVM must program the desired
>  	 * memtype directly into the SPTEs.  Note, this mask is just the mask
> @@ -448,7 +450,7 @@ void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only)
>  	 * of an EPT paging-structure entry is 110b (write/execute).
>  	 */
>  	kvm_mmu_set_mmio_spte_mask(VMX_EPT_MISCONFIG_WX_VALUE,
> -				   VMX_EPT_RWX_MASK, 0);
> +				   VMX_EPT_RWX_MASK | VMX_EPT_SUPPRESS_VE_BIT, 0);
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_set_ept_masks);
>  


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 033/113] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis
  2023-01-12 16:31 ` [PATCH v11 033/113] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis isaku.yamahata
@ 2023-01-16 11:16   ` Huang, Kai
  2023-02-27 21:58     ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-16 11:16 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: sean.j.christopherson, pbonzini, Shahar, Sagi, Aktas, Erdem,
	isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX will use a different shadow PTE entry value for MMIO from VMX.  Add
> members to kvm_arch and track value for MMIO per-VM instead of global
> variables.  By using the per-VM EPT entry value for MMIO, the existing VMX
> logic is kept working.  Introduce a separate setter function so that guest
> TD can override later.

The guest TD itself cannot override.  It is KVM who overrides it for a TDX
guest.
 
> 
> Also require mmio spte cachcing for TDX.  Actually this is true case
			 ^
			 spell check.

> because TDX require EPT and KVM EPT allows mmio spte caching.

The second sentence doesn't make sense.  IIUC "TDX requires EPT + EPT _allows_
MMIO caching = TDX _allows_ MMIO caching", which is different from "TDX
_requires_ MMIO caching).

> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm_host.h |  2 ++
>  arch/x86/kvm/mmu.h              |  1 +
>  arch/x86/kvm/mmu/mmu.c          |  7 ++++---
>  arch/x86/kvm/mmu/spte.c         | 10 ++++++++--
>  arch/x86/kvm/mmu/spte.h         |  4 ++--
>  arch/x86/kvm/mmu/tdp_mmu.c      | 14 +++++++++++---
>  6 files changed, 28 insertions(+), 10 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 73c987b3d2b6..807da4b95aba 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1243,6 +1243,8 @@ struct kvm_arch {
>  	 */
>  	spinlock_t mmu_unsync_pages_lock;
>  
> +	u64 shadow_mmio_value;
> +
>  	struct list_head assigned_dev_head;
>  	struct iommu_domain *iommu_domain;
>  	bool iommu_noncoherent;
> diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
> index a45f7a96b821..50d240d52697 100644
> --- a/arch/x86/kvm/mmu.h
> +++ b/arch/x86/kvm/mmu.h
> @@ -101,6 +101,7 @@ static inline u8 kvm_get_shadow_phys_bits(void)
>  }
>  
>  void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
> +void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value);
>  void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask);
>  void kvm_mmu_set_ept_masks(bool has_ad_bits, bool has_exec_only);
>  
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 59befdfeec23..8d3d7deebdd0 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -2450,7 +2450,7 @@ static int mmu_page_zap_pte(struct kvm *kvm, struct kvm_mmu_page *sp,
>  				return kvm_mmu_prepare_zap_page(kvm, child,
>  								invalid_list);
>  		}
> -	} else if (is_mmio_spte(pte)) {
> +	} else if (is_mmio_spte(kvm, pte)) {
>  		mmu_spte_clear_no_track(spte);
>  	}
>  	return 0;
> @@ -4119,7 +4119,7 @@ static int handle_mmio_page_fault(struct kvm_vcpu *vcpu, u64 addr, bool direct)
>  	if (WARN_ON(reserved))
>  		return -EINVAL;
>  
> -	if (is_mmio_spte(spte)) {
> +	if (is_mmio_spte(vcpu->kvm, spte)) {
>  		gfn_t gfn = get_mmio_spte_gfn(spte);
>  		unsigned int access = get_mmio_spte_access(spte);
>  
> @@ -4628,7 +4628,7 @@ static unsigned long get_cr3(struct kvm_vcpu *vcpu)
>  static bool sync_mmio_spte(struct kvm_vcpu *vcpu, u64 *sptep, gfn_t gfn,
>  			   unsigned int access)
>  {
> -	if (unlikely(is_mmio_spte(*sptep))) {
> +	if (unlikely(is_mmio_spte(vcpu->kvm, *sptep))) {
>  		if (gfn != get_mmio_spte_gfn(*sptep)) {
>  			mmu_spte_clear_no_track(sptep);
>  			return true;
> @@ -6111,6 +6111,7 @@ int kvm_mmu_init_vm(struct kvm *kvm)
>  	struct kvm_page_track_notifier_node *node = &kvm->arch.mmu_sp_tracker;
>  	int r;
>  
> +	kvm->arch.shadow_mmio_value = shadow_mmio_value;
>  	INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
>  	INIT_LIST_HEAD(&kvm->arch.zapped_obsolete_pages);
>  	INIT_LIST_HEAD(&kvm->arch.possible_nx_huge_pages);
> diff --git a/arch/x86/kvm/mmu/spte.c b/arch/x86/kvm/mmu/spte.c
> index cc0bc058fb25..a23e9205fc42 100644
> --- a/arch/x86/kvm/mmu/spte.c
> +++ b/arch/x86/kvm/mmu/spte.c
> @@ -74,10 +74,10 @@ u64 make_mmio_spte(struct kvm_vcpu *vcpu, u64 gfn, unsigned int access)
>  	u64 spte = generation_mmio_spte_mask(gen);
>  	u64 gpa = gfn << PAGE_SHIFT;
>  
> -	WARN_ON_ONCE(!shadow_mmio_value);
> +	WARN_ON_ONCE(!vcpu->kvm->arch.shadow_mmio_value);
>  
>  	access &= shadow_mmio_access_mask;
> -	spte |= shadow_mmio_value | access;
> +	spte |= vcpu->kvm->arch.shadow_mmio_value | access;
>  	spte |= gpa | shadow_nonpresent_or_rsvd_mask;
>  	spte |= (gpa & shadow_nonpresent_or_rsvd_mask)
>  		<< SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
> @@ -413,6 +413,12 @@ void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask)
>  }
>  EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_mask);
>  
> +void kvm_mmu_set_mmio_spte_value(struct kvm *kvm, u64 mmio_value)
> +{
> +	kvm->arch.shadow_mmio_value = mmio_value;
> +}
> +EXPORT_SYMBOL_GPL(kvm_mmu_set_mmio_spte_value);
> +
>  void kvm_mmu_set_me_spte_mask(u64 me_value, u64 me_mask)
>  {
>  	/* shadow_me_value must be a subset of shadow_me_mask */
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 471378ee9071..256395eb593f 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -251,9 +251,9 @@ static inline struct kvm_mmu_page *sptep_to_sp(u64 *sptep)
>  	return to_shadow_page(__pa(sptep));
>  }
>  
> -static inline bool is_mmio_spte(u64 spte)
> +static inline bool is_mmio_spte(struct kvm *kvm, u64 spte)
>  {
> -	return (spte & shadow_mmio_mask) == shadow_mmio_value &&
> +	return (spte & shadow_mmio_mask) == kvm->arch.shadow_mmio_value &&
>  	       likely(enable_mmio_caching);
>  }
>  
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 6111e3e9266d..dffacb7eb15a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -19,6 +19,14 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>  {
>  	struct workqueue_struct *wq;
>  
> +	/*
> +	 * TDs require mmio_caching to clear suppress_ve bit of SPTE for GPA
> +	 * of MMIO so that TD can convert #VE triggered by MMIO into
> +	 * TDG.VP.VMCALL<MMIO>.
> +	 */
> +	if (kvm->arch.vm_type == KVM_X86_TDX_VM && !enable_mmio_caching)
> +		return -EOPNOTSUPP;

SEV-ES does the check in hardware_setup:

void __init sev_hardware_setup(void)
{
	...
	/*
         * SEV-ES requires MMIO caching as KVM doesn't have access to the guest
         * instruction stream, i.e. can't emulate in response to a #NPF and
         * instead relies on #NPF(RSVD) being reflected into the guest as #VC
         * (the guest can then do a #VMGEXIT to request MMIO emulation).
         */
        if (!enable_mmio_caching)
                goto out;

	...
}

TDX should be done in the same way.

And IMO this chunk really doesn't belong to this patch -- I interpret this patch
as a "infrastructure patch to track shadow MMIO value on per-VM basis" (which
even should have no functional change IMHO), but this chunk is clearly doing
more than that.

> +
>  	if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
>  		return 0;
>  
> @@ -587,8 +595,8 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  		 * impact the guest since both the former and current SPTEs
>  		 * are nonpresent.
>  		 */
> -		if (WARN_ON(!is_mmio_spte(old_spte) &&
> -			    !is_mmio_spte(new_spte) &&
> +		if (WARN_ON(!is_mmio_spte(kvm, old_spte) &&
> +			    !is_mmio_spte(kvm, new_spte) &&
>  			    !is_removed_spte(new_spte)))
>  			pr_err("Unexpected SPTE change! Nonpresent SPTEs\n"
>  			       "should not be replaced with another,\n"
> @@ -1114,7 +1122,7 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
>  	}
>  
>  	/* If a MMIO SPTE is installed, the MMIO will need to be emulated. */
> -	if (unlikely(is_mmio_spte(new_spte))) {
> +	if (unlikely(is_mmio_spte(vcpu->kvm, new_spte))) {
>  		vcpu->stat.pf_mmio_spte_created++;
>  		trace_mark_mmio_spte(rcu_dereference(iter->sptep), iter->gfn,
>  				     new_spte);


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 035/113] KVM: x86/mmu: Allow per-VM override of the TDP max page level
  2023-01-12 16:31 ` [PATCH v11 035/113] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
@ 2023-01-16 11:29   ` Huang, Kai
  0 siblings, 0 replies; 221+ messages in thread
From: Huang, Kai @ 2023-01-16 11:29 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: sean.j.christopherson, pbonzini, Shahar, Sagi, Aktas, Erdem,
	isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
> 
> TDX requires special handling to support large private page.  For
> simplicity, only support 4K page for TD guest for now.  Add per-VM maximum
> page level support to support different maximum page sizes for TD guest and
> conventional VMX guest.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>

Acked-by: Kai Huang <kai.huang@intel.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization
  2023-01-12 16:31 ` [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization isaku.yamahata
@ 2023-01-16 16:07   ` Zhi Wang
  2023-02-28 11:17     ` Isaku Yamahata
  2023-01-19 10:37   ` Huang, Kai
  1 sibling, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-16 16:07 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

On Thu, 12 Jan 2023 08:31:32 -0800
isaku.yamahata@intel.com wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TD guest vcpu need to be configured before ready to run which requests
> addtional information from Device model (e.g. qemu), one 64bit value is
> passed to vcpu's RCX as an initial value.  Repurpose KVM_MEMORY_ENCRYPT_OP
> to vcpu-scope and add new sub-commands KVM_TDX_INIT_VCPU under it for such
> additional vcpu configuration.
> 

Better add more details for this mystic value to save the review efforts.

For exmaple, refining the above part as:

----

TD hands-off block(HOB) is used to pass the information from VMM to
TD virtual firmware(TDVF). Before KVM calls Intel TDX module to launch
TDVF, the address of HOB must be placed in the guest RCX.

Extend KVM_MEMORY_ENCRYPT_OP to vcpu-scope and add new... so that
TDH.VP.INIT can take the address of HOB from QEMU and place it in the
guest RCX when initializing a TDX vCPU.

----

The below paragraph seems repeating the end of the first paragraph. Guess
it can be refined or removed.


> Add callback for kvm vCPU-scoped operations of KVM_MEMORY_ENCRYPT_OP and
> add a new subcommand, KVM_TDX_INIT_VCPU, for further vcpu initialization.
>


PS: I am curious if the value of guest RCX on each VCPU will be configured
differently? (It seems they are the same according to the code of tdx-qemu)

If yes, then it is just an approach to configure the value (even it is
through TDH.VP.XXX). It should be configured in the domain level in KVM. The
TDX vCPU creation and initialization can be moved into tdx_vcpu_create()
and TDH.VP.INIT can take the value from a per-vm data structure.
 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h    |   1 +
>  arch/x86/include/asm/kvm_host.h       |   1 +
>  arch/x86/include/uapi/asm/kvm.h       |   1 +
>  arch/x86/kvm/vmx/main.c               |   9 ++
>  arch/x86/kvm/vmx/tdx.c                | 147 +++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/tdx.h                |   7 ++
>  arch/x86/kvm/vmx/x86_ops.h            |  10 +-
>  arch/x86/kvm/x86.c                    |   6 ++
>  tools/arch/x86/include/uapi/asm/kvm.h |   1 +
>  9 files changed, 178 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 1a27f3aee982..e3e9b1c2599b 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -123,6 +123,7 @@ KVM_X86_OP(enable_smi_window)
>  #endif
>  KVM_X86_OP_OPTIONAL(dev_mem_enc_ioctl)
>  KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
> +KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
>  KVM_X86_OP_OPTIONAL(mem_enc_register_region)
>  KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
>  KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 30f4ddb18548..35773f925cc5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1698,6 +1698,7 @@ struct kvm_x86_ops {
>  
>  	int (*dev_mem_enc_ioctl)(void __user *argp);
>  	int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
> +	int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
>  	int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
>  	int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
>  	int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index b8f28d86d4fd..9236c1699c48 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -536,6 +536,7 @@ struct kvm_pmu_event_filter {
>  enum kvm_tdx_cmd_id {
>  	KVM_TDX_CAPABILITIES = 0,
>  	KVM_TDX_INIT_VM,
> +	KVM_TDX_INIT_VCPU,
>  
>  	KVM_TDX_CMD_NR_MAX,
>  };
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 59813ca05f36..23b3ffc3fe23 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -103,6 +103,14 @@ static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>  	return tdx_vm_ioctl(kvm, argp);
>  }
>  
> +static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> +{
> +	if (!is_td_vcpu(vcpu))
> +		return -EINVAL;
> +
> +	return tdx_vcpu_ioctl(vcpu, argp);
> +}
> +
>  struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.name = KBUILD_MODNAME,
>  
> @@ -249,6 +257,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  
>  	.dev_mem_enc_ioctl = tdx_dev_ioctl,
>  	.mem_enc_ioctl = vt_mem_enc_ioctl,
> +	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
>  };
>  
>  struct kvm_x86_init_ops vt_init_ops __initdata = {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 099f0737a5aa..e2f5a07ad4e5 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -49,6 +49,11 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
>  	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
>  }
>  
> +static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
> +{
> +	return tdx->tdvpr_pa;
> +}
> +
>  static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
>  {
>  	return kvm_tdx->tdr_pa;
> @@ -65,6 +70,11 @@ static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
>  	return kvm_tdx->hkid > 0;
>  }
>  
> +static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> +{
> +	return kvm_tdx->finalized;
> +}
> +
>  static void tdx_clear_page(unsigned long page_pa)
>  {
>  	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> @@ -327,7 +337,21 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  
>  void tdx_vcpu_free(struct kvm_vcpu *vcpu)
>  {
> -	/* This is stub for now.  More logic will come. */
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	int i;
> +
> +	/* Can't reclaim or free pages if teardown failed. */
> +	if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm)))
> +		return;
> +

Should we have an WARN_ON_ONCE here?

> +	if (tdx->tdvpx_pa) {
> +		for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
> +			tdx_reclaim_td_page(tdx->tdvpx_pa[i]);
> +		kfree(tdx->tdvpx_pa);
> +		tdx->tdvpx_pa = NULL;
> +	}
> +	tdx_reclaim_td_page(tdx->tdvpr_pa);
> +	tdx->tdvpr_pa = 0;
>  }
>  
>  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> @@ -337,6 +361,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	/* TDX doesn't support INIT event. */
>  	if (WARN_ON_ONCE(init_event))
>  		goto td_bugged;
> +	if (WARN_ON_ONCE(is_td_vcpu_created(to_tdx(vcpu))))
> +		goto td_bugged;
>  
>  	/* TDX rquires X2APIC. */
>  	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> @@ -791,6 +817,125 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  	return r;
>  }
>  
> +static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	unsigned long *tdvpx_pa = NULL;
> +	unsigned long tdvpr_pa;
> +	unsigned long va;
> +	int ret, i;
> +	u64 err;
> +
> +	if (is_td_vcpu_created(tdx))
> +		return -EINVAL;
> +
> +	va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +	if (!va)
> +		return -ENOMEM;
> +	tdvpr_pa = __pa(va);
> +
> +	tdvpx_pa = kcalloc(tdx_caps.tdvpx_nr_pages, sizeof(*tdx->tdvpx_pa),
> +			   GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> +	if (!tdvpx_pa) {
> +		ret = -ENOMEM;
> +		goto free_tdvpr;
> +	}
> +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> +		va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +		if (!va)
> +			goto free_tdvpx;
> +		tdvpx_pa[i] = __pa(va);
> +	}
> +
> +	err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> +	if (WARN_ON_ONCE(err)) {
> +		ret = -EIO;
> +		pr_tdx_error(TDH_VP_CREATE, err, NULL);
> +		goto td_bugged_free_tdvpx;
> +	}
> +	tdx->tdvpr_pa = tdvpr_pa;
> +
> +	tdx->tdvpx_pa = tdvpx_pa;
> +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> +		err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> +		if (WARN_ON_ONCE(err)) {
> +			ret = -EIO;
> +			pr_tdx_error(TDH_VP_ADDCX, err, NULL);
> +			for (; i < tdx_caps.tdvpx_nr_pages; i++) {
> +				free_page((unsigned long)__va(tdvpx_pa[i]));
> +				tdvpx_pa[i] = 0;
> +			}
> +			goto td_bugged;
> +		}
> +	}
> +
> +	err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
> +	if (WARN_ON_ONCE(err)) {
> +		ret = -EIO;
> +		pr_tdx_error(TDH_VP_INIT, err, NULL);
> +		goto td_bugged;
> +	}
> +
> +	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> +
> +	return 0;
> +
> +td_bugged_free_tdvpx:
> +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> +		free_page((unsigned long)__va(tdvpx_pa[i]));
> +		tdvpx_pa[i] = 0;
> +	}
> +	kfree(tdvpx_pa);
> +td_bugged:
> +	vcpu->kvm->vm_bugged = true;
> +	return ret;
> +
> +free_tdvpx:
> +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
> +		if (tdvpx_pa[i])
> +			free_page((unsigned long)__va(tdvpx_pa[i]));
> +	kfree(tdvpx_pa);
> +	tdx->tdvpx_pa = NULL;
> +free_tdvpr:
> +	if (tdvpr_pa)
> +		free_page((unsigned long)__va(tdvpr_pa));
> +	tdx->tdvpr_pa = 0;
> +
> +	return ret;
> +}

Same comments with using vm_bugged in the previous patch.

> +
> +int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	struct kvm_tdx_cmd cmd;
> +	int ret;
> +
> +	if (tdx->vcpu_initialized)
> +		return -EINVAL;
> +
> +	if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
> +		return -EINVAL;
> +
> +	if (copy_from_user(&cmd, argp, sizeof(cmd)))
> +		return -EFAULT;
> +
> +	if (cmd.error || cmd.unused)
> +		return -EINVAL;
> +
> +	/* Currently only KVM_TDX_INTI_VCPU is defined for vcpu operation. */
> +	if (cmd.flags || cmd.id != KVM_TDX_INIT_VCPU)
> +		return -EINVAL;
> +
> +	ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
> +	if (ret)
> +		return ret;
> +
> +	tdx->vcpu_initialized = true;
> +	return 0;
> +}
> +
>  static int __init tdx_module_setup(void)
>  {
>  	const struct tdsysinfo_struct *tdsysinfo;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index af7fdc1516d5..e909883d60fa 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -17,12 +17,19 @@ struct kvm_tdx {
>  	u64 xfam;
>  	int hkid;
>  
> +	bool finalized;
> +
>  	u64 tsc_offset;
>  };
>  
>  struct vcpu_tdx {
>  	struct kvm_vcpu	vcpu;
>  
> +	unsigned long tdvpr_pa;
> +	unsigned long *tdvpx_pa;
> +
> +	bool vcpu_initialized;
> +
>  	/*
>  	 * Dummy to make pmu_intel not corrupt memory.
>  	 * TODO: Support PMU for TDX.  Future work.
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 37ab2cfd35bc..fba8d0800597 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -148,11 +148,12 @@ int tdx_vm_init(struct kvm *kvm);
>  void tdx_mmu_release_hkid(struct kvm *kvm);
>  void tdx_vm_free(struct kvm *kvm);
>  
> -int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> -
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> +
> +int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> +int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
>  #else
>  static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
>  static inline void tdx_hardware_unsetup(void) {}
> @@ -165,11 +166,12 @@ static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
>  static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
>  static inline void tdx_vm_free(struct kvm *kvm) {}
>  
> -static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> -
>  static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>  static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> +
> +static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> +static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>  #endif
>  
>  #endif /* __KVM_X86_VMX_X86_OPS_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e8bc66031a1d..d548d3af6428 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5976,6 +5976,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  	case KVM_SET_DEVICE_ATTR:
>  		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
>  		break;
> +	case KVM_MEMORY_ENCRYPT_OP:
> +		r = -ENOTTY;
> +		if (!kvm_x86_ops.vcpu_mem_enc_ioctl)
> +			goto out;
> +		r = kvm_x86_ops.vcpu_mem_enc_ioctl(vcpu, argp);
> +		break;
>  	default:
>  		r = -EINVAL;
>  	}
> diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
> index eb800965b589..6971f1288043 100644
> --- a/tools/arch/x86/include/uapi/asm/kvm.h
> +++ b/tools/arch/x86/include/uapi/asm/kvm.h
> @@ -531,6 +531,7 @@ struct kvm_pmu_event_filter {
>  enum kvm_tdx_cmd_id {
>  	KVM_TDX_CAPABILITIES = 0,
>  	KVM_TDX_INIT_VM,
> +	KVM_TDX_INIT_VCPU,
>  
>  	KVM_TDX_CMD_NR_MAX,
>  };


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 055/113] KVM: x86/VMX: introduce vmx tlb_remote_flush and tlb_remote_flush_with_range
  2023-01-12 16:32 ` [PATCH v11 055/113] KVM: x86/VMX: introduce vmx tlb_remote_flush and tlb_remote_flush_with_range isaku.yamahata
@ 2023-01-17  2:06   ` Huang, Kai
  2023-01-17 16:53     ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-17  2:06 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:32 -0800, isaku.yamahata@intel.com wrote:
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -244,7 +244,7 @@ static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
>  {
>  	int ret = -ENOTSUPP;
>  
> -	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> +	if (range && kvm_available_flush_tlb_with_range())
>  		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);

Again, IMHO this code change doesn't make code any clearer.  With the new code,
I need to go into the kvm_available_flush_tlb_with_range() to see what's going
on, but with the old code I don't.

That being said, I think kvm_available_flush_tlb_with_range() is sort of
redundant but I can also understand callers don't want to just check whether the
callback is valid.

Btw, I had some memory that I commented this before in some old version
(therefore the 'Again' in my reply), but I failed to dig out -- partially due to
in some old versions (<= v7) I found I have no clue which patch to look at by
just looking at the patch title.


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
  2023-01-12 16:31 ` [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs isaku.yamahata
@ 2023-01-17  2:40   ` Huang, Kai
  2023-02-27 22:00     ` Isaku Yamahata
  2023-02-17  8:27   ` Zhi Wang
  1 sibling, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-17  2:40 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Some KVM MMU operations (dirty page logging, page migration, aging page)
> aren't supported for private GFNs (yet) with the first generation of TDX.
> Silently return on unsupported TDX KVM MMU operations.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>


You already have previous patches to do similar things:

[PATCH v11 034/113] KVM: x86/mmu: Disallow fast page fault on private GPA
[PATCH v11 043/113] KVM: x86/tdp_mmu: Don't zap private pages for unsupported
cases
[PATCH v11 048/113] KVM: x86/mmu: Disallow dirty logging for x86 TDX
[PATCH v11 049/113] KVM: x86/mmu: TDX: Do not enable page track for TD guest

Now you have this patch:

[PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on
private GFNs

They are very confusing to me.  Those previous patches are all "unsupported
operations", correct? 

For instance, this patch says "dirty page logging isn't supported for private
GFNs" (and why there's a 'yet' after it?), so based on the patch title my
understanding is you are going to _ignore_ "dirty page logging".  But you
already have a previous patch to "Disallow dirty logging for x86 TDX".  

Shouldn't the two be in the same patch?  Or you were trying to highlight the
different between "x86/mmu" and "x86/tdp_mmu"?

Please try to make the whole thing more clear.  My first glance is, if it was
me, I would probably have _ONE_ dedicated patch for _EACH_ unsupported
operation, and make it very clear in the patch title.  But you may have your own
way to make things more clearer.

[snip]

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 058/113] KVM: TDX: MTRR: implement get_mt_mask() for TDX
  2023-01-12 16:32 ` [PATCH v11 058/113] KVM: TDX: MTRR: implement get_mt_mask() for TDX isaku.yamahata
@ 2023-01-17  3:11   ` Huang, Kai
  2023-02-27 23:30     ` Isaku Yamahata
  2023-02-03  6:55   ` Yuan Yao
  1 sibling, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-17  3:11 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:32 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Although TDX supports only WB for private GPA, MTRR/PAT for shared GPA
> should be supported. Implement get_mt_mask() following vmx case.

By far this is the first patch to handle MTRR/PAT.  There's absolutely no
background have been explained.

So what about MTRR/PAT related MSRs handling?  No code needed to handle?

I was expecting there should be at least some words here to explain how TDX
handles them, and if no handling is required in KVM, why.

W/o those, I don't think this patch is reviewable.  

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/main.c    | 10 +++++++++-
>  arch/x86/kvm/vmx/tdx.c     | 19 +++++++++++++++++++
>  arch/x86/kvm/vmx/x86_ops.h |  2 ++
>  3 files changed, 30 insertions(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 770d1b29d1c3..4319f6d7a4da 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -158,6 +158,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>  	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
>  }
>  
> +static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_get_mt_mask(vcpu, gfn, is_mmio);
> +
> +	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
> +}
> +
>  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	if (!is_td(kvm))
> @@ -267,7 +275,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  
>  	.set_tss_addr = vmx_set_tss_addr,
>  	.set_identity_map_addr = vmx_set_identity_map_addr,
> -	.get_mt_mask = vmx_get_mt_mask,
> +	.get_mt_mask = vt_get_mt_mask,
>  
>  	.get_exit_info = vmx_get_exit_info,
>  
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e68816999387..c4c5a8f786c1 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -309,6 +309,25 @@ int tdx_vm_init(struct kvm *kvm)
>  	return 0;
>  }
>  
> +u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +{
> +	/* TDX private GPA is always WB. */
> +	if (gfn & kvm_gfn_shared_mask(vcpu->kvm)) {

First of all, private GPA doesn't have 'shared bit' set, so comment  doesn't
reflect code.

Secondly (and again), IIUC the shared bit of the gfn has been stripped out long
time ago, so this is incorrect.

Please don't sliently ignore other people's comment:

https://lore.kernel.org/lkml/Y19NzlQcwhV%2F2wl3@debian.me/T/#mf319d5b718519709362f9f094bfc5b53fd870241



> +		WARN_ON_ONCE(is_mmio);
> +		return  MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;

Shouldn't you also include VMX_EPT_IPAT_BIT?  I lost your logic..

> +	}
> +
> +	if (is_mmio)
> +		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> +
> +	/*
> +	 * Device assignemnt without VT-d snooping capability with shared-GPA
> +	 * is dubious.
> +	 */
> +	WARN_ON_ONCE(kvm_arch_has_noncoherent_dma(vcpu->kvm));

Is there any code to reject such case at the beginning, for instance, don't
allow assigning device in such case?

> +	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +}
> +
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_cpuid_entry2 *e;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 8ae689929347..d903e0f606d3 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -154,6 +154,7 @@ void tdx_vm_free(struct kvm *kvm);
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> +u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>  
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -176,6 +177,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
>  static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>  static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> +static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
>  
>  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>  static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 005/113] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs
  2023-01-12 16:31 ` [PATCH v11 005/113] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs isaku.yamahata
@ 2023-01-17  3:31   ` Binbin Wu
  0 siblings, 0 replies; 221+ messages in thread
From: Binbin Wu @ 2023-01-17  3:31 UTC (permalink / raw)
  To: isaku.yamahata, kvm, linux-kernel
  Cc: isaku.yamahata, Paolo Bonzini, erdemaktas, Sean Christopherson,
	Sagi Shahar, David Matlack, Sean Christopherson, Xiaoyao Li


On 1/13/2023 12:31 AM, isaku.yamahata@intel.com wrote:
> From: Sean Christopherson <sean.j.christopherson@intel.com>
>
> Unlike default VMs, confidential VMs (Intel TDX and AMD SEV-ES) don't allow
> some operations (e.g., memory read/write, register state access, etc).
>
> Introduce vm_type to track the type of the VM to x86 KVM.  Other arch KVMs
> already use vm_type, KVM_INIT_VM accepts vm_type, and x86 KVM callback
> vm_init accepts vm_type.  So follow them.  Further, a different policy can
> be made based on vm_type.  Define KVM_X86_DEFAULT_VM for default VM as
> default and define KVM_X86_TDX_VM for Intel TDX VM.  The wrapper function
> will be defined as "bool is_td(kvm) { return vm_type == VM_TYPE_TDX; }"
>
> Add a capability KVM_CAP_VM_TYPES to effectively allow device model,
> e.g. qemu, to query what VM types are supported by KVM.  This (introduce a
> new capability and add vm_type) is chosen to align with other arch KVMs
> that have VM types already.  Other arch KVMs uses

uses -> use


> different name
name -> names
>   to query
> supported vm types and there is no common name for it, so new name was
> chosen.
>
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com>
> ---
>   Documentation/virt/kvm/api.rst        | 21 +++++++++++++++++++++
>   arch/x86/include/asm/kvm-x86-ops.h    |  1 +
>   arch/x86/include/asm/kvm_host.h       |  2 ++
>   arch/x86/include/uapi/asm/kvm.h       |  3 +++
>   arch/x86/kvm/svm/svm.c                |  6 ++++++
>   arch/x86/kvm/vmx/main.c               |  1 +
>   arch/x86/kvm/vmx/tdx.h                |  6 +-----
>   arch/x86/kvm/vmx/vmx.c                |  5 +++++
>   arch/x86/kvm/vmx/x86_ops.h            |  1 +
>   arch/x86/kvm/x86.c                    |  9 ++++++++-
>   include/uapi/linux/kvm.h              |  1 +
>   tools/arch/x86/include/uapi/asm/kvm.h |  3 +++
>   tools/include/uapi/linux/kvm.h        |  1 +
>   13 files changed, 54 insertions(+), 6 deletions(-)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 98459999273c..d2baa05f7c04 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -147,10 +147,31 @@ described as 'basic' will be available.
>   The new VM has no virtual cpus and no memory.
>   You probably want to use 0 as machine type.
>   
> +X86:
> +^^^^
> +
> +Supported vm type can be queried from KVM_CAP_VM_TYPES, which returns the
> +bitmap of supported vm types. The 1-setting of bit @n means vm type with
> +value @n is supported.
> +
> +S390:
> +^^^^^
> +
>   In order to create user controlled virtual machines on S390, check
>   KVM_CAP_S390_UCONTROL and use the flag KVM_VM_S390_UCONTROL as
>   privileged user (CAP_SYS_ADMIN).
>   
> +MIPS:
> +^^^^^
> +
> +To use hardware assisted virtualization on MIPS (VZ ASE) rather than
> +the default trap & emulate implementation (which changes the virtual
> +memory layout to fit in user mode), check KVM_CAP_MIPS_VZ and use the
> +flag KVM_VM_MIPS_VZ.
> +
> +ARM64:
> +^^^^^^
> +
>   On arm64, the physical address size for a VM (IPA Size limit) is limited
>   to 40bits by default. The limit can be configured if the host supports the
>   extension KVM_CAP_ARM_VM_IPA_SIZE. When supported, use
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index dba2909e5ae2..59181b12ad70 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -20,6 +20,7 @@ KVM_X86_OP(hardware_disable)
>   KVM_X86_OP(hardware_unsetup)
>   KVM_X86_OP(has_emulated_msr)
>   KVM_X86_OP(vcpu_after_set_cpuid)
> +KVM_X86_OP(is_vm_type_supported)
>   KVM_X86_OP(vm_init)
>   KVM_X86_OP_OPTIONAL(vm_destroy)
>   KVM_X86_OP_OPTIONAL_RET0(vcpu_precreate)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 60dc8f1631de..c6ccfce7dc9e 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1212,6 +1212,7 @@ enum kvm_apicv_inhibit {
>   };
>   
>   struct kvm_arch {
> +	unsigned long vm_type;
>   	unsigned long n_used_mmu_pages;
>   	unsigned long n_requested_mmu_pages;
>   	unsigned long n_max_mmu_pages;
> @@ -1536,6 +1537,7 @@ struct kvm_x86_ops {
>   	bool (*has_emulated_msr)(struct kvm *kvm, u32 index);
>   	void (*vcpu_after_set_cpuid)(struct kvm_vcpu *vcpu);
>   
> +	bool (*is_vm_type_supported)(unsigned long vm_type);
>   	unsigned int vm_size;
>   	int (*vm_init)(struct kvm *kvm);
>   	void (*vm_destroy)(struct kvm *kvm);
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index e48deab8901d..a4cca6bc6b06 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -529,4 +529,7 @@ struct kvm_pmu_event_filter {
>   #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
>   #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
>   
> +#define KVM_X86_DEFAULT_VM	0
> +#define KVM_X86_TDX_VM		1
> +
>   #endif /* _ASM_X86_KVM_H */
> diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
> index 799b24801d31..55f2e0a9b0f6 100644
> --- a/arch/x86/kvm/svm/svm.c
> +++ b/arch/x86/kvm/svm/svm.c
> @@ -4682,6 +4682,11 @@ static void svm_vm_destroy(struct kvm *kvm)
>   	sev_vm_destroy(kvm);
>   }
>   
> +static bool svm_is_vm_type_supported(unsigned long type)
> +{
> +	return type == KVM_X86_DEFAULT_VM;
> +}
> +
>   static int svm_vm_init(struct kvm *kvm)
>   {
>   	if (!pause_filter_count || !pause_filter_thresh)
> @@ -4710,6 +4715,7 @@ static struct kvm_x86_ops svm_x86_ops __initdata = {
>   	.vcpu_free = svm_vcpu_free,
>   	.vcpu_reset = svm_vcpu_reset,
>   
> +	.is_vm_type_supported = svm_is_vm_type_supported,
>   	.vm_size = sizeof(struct kvm_svm),
>   	.vm_init = svm_vm_init,
>   	.vm_destroy = svm_vm_destroy,
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index f5d1166d2718..3b24e32077d6 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -34,6 +34,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>   	.hardware_disable = vmx_hardware_disable,
>   	.has_emulated_msr = vmx_has_emulated_msr,
>   
> +	.is_vm_type_supported = vmx_is_vm_type_supported,
>   	.vm_size = sizeof(struct kvm_vmx),
>   	.vm_init = vmx_vm_init,
>   	.vm_destroy = vmx_vm_destroy,
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index 060bf48ec3d6..473013265bd8 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -15,11 +15,7 @@ struct vcpu_tdx {
>   
>   static inline bool is_td(struct kvm *kvm)
>   {
> -	/*
> -	 * TDX VM type isn't defined yet.
> -	 * return kvm->arch.vm_type == KVM_X86_TDX_VM;
> -	 */
> -	return false;
> +	return kvm->arch.vm_type == KVM_X86_TDX_VM;
>   }
>   
>   static inline bool is_td_vcpu(struct kvm_vcpu *vcpu)
> diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> index 5dc7687dcf16..f1dea386d6c2 100644
> --- a/arch/x86/kvm/vmx/vmx.c
> +++ b/arch/x86/kvm/vmx/vmx.c
> @@ -7501,6 +7501,11 @@ int vmx_vcpu_create(struct kvm_vcpu *vcpu)
>   	return err;
>   }
>   
> +bool vmx_is_vm_type_supported(unsigned long type)
> +{
> +	return type == KVM_X86_DEFAULT_VM;
> +}
> +
>   #define L1TF_MSG_SMT "L1TF CPU bug present and SMT on, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>   #define L1TF_MSG_L1D "L1TF CPU bug present and virtualization mitigation disabled, data leak possible. See CVE-2018-3646 and https://www.kernel.org/doc/html/latest/admin-guide/hw-vuln/l1tf.html for details.\n"
>   
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index fbc57fcbdd21..6980126bc32a 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -32,6 +32,7 @@ void vmx_hardware_unsetup(void);
>   int vmx_check_processor_compat(void);
>   int vmx_hardware_enable(void);
>   void vmx_hardware_disable(void);
> +bool vmx_is_vm_type_supported(unsigned long type);
>   int vmx_vm_init(struct kvm *kvm);
>   void vmx_vm_destroy(struct kvm *kvm);
>   int vmx_vcpu_precreate(struct kvm *kvm);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 07e8ab791e37..68bff699096a 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -4535,6 +4535,11 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>   	case KVM_CAP_X86_NOTIFY_VMEXIT:
>   		r = kvm_caps.has_notify_vmexit;
>   		break;
> +	case KVM_CAP_VM_TYPES:
> +		r = BIT(KVM_X86_DEFAULT_VM);
> +		if (static_call(kvm_x86_is_vm_type_supported)(KVM_X86_TDX_VM))
> +			r |= BIT(KVM_X86_TDX_VM);
> +		break;
>   	default:
>   		break;
>   	}
> @@ -12126,9 +12131,11 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
>   	int ret;
>   	unsigned long flags;
>   
> -	if (type)
> +	if (!static_call(kvm_x86_is_vm_type_supported)(type))
>   		return -EINVAL;
>   
> +	kvm->arch.vm_type = type;
> +
>   	ret = kvm_page_track_init(kvm);
>   	if (ret)
>   		goto out;
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 679d293ece0f..2a47fd0e51fd 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1212,6 +1212,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
>   #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
>   #define KVM_CAP_MEMORY_ATTRIBUTES 226
> +#define KVM_CAP_VM_TYPES 227
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   
> diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
> index 649e50a8f9dd..b67d2d59eb6c 100644
> --- a/tools/arch/x86/include/uapi/asm/kvm.h
> +++ b/tools/arch/x86/include/uapi/asm/kvm.h
> @@ -524,4 +524,7 @@ struct kvm_pmu_event_filter {
>   #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
>   #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
>   
> +#define KVM_X86_DEFAULT_VM	0
> +#define KVM_X86_TDX_VM		1
> +
>   #endif /* _ASM_X86_KVM_H */
> diff --git a/tools/include/uapi/linux/kvm.h b/tools/include/uapi/linux/kvm.h
> index 20522d4ba1e0..792a4889d1f4 100644
> --- a/tools/include/uapi/linux/kvm.h
> +++ b/tools/include/uapi/linux/kvm.h
> @@ -1175,6 +1175,7 @@ struct kvm_ppc_resize_hpt {
>   #define KVM_CAP_DIRTY_LOG_RING_ACQ_REL 223
>   #define KVM_CAP_S390_PROTECTED_ASYNC_DISABLE 224
>   #define KVM_CAP_DIRTY_LOG_RING_WITH_BITMAP 225
> +#define KVM_CAP_VM_TYPES 227
>   
>   #ifdef KVM_CAP_IRQ_ROUTING
>   

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters
  2023-01-12 16:31 ` [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
  2023-01-13 14:58   ` Zhi Wang
  2023-01-16 10:04   ` Huang, Kai
@ 2023-01-17 12:19   ` Huang, Kai
  2023-02-27 21:44     ` Isaku Yamahata
  2 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-17 12:19 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: Li, Xiaoyao, pbonzini, Shahar, Sagi, Aktas, Erdem,
	isaku.yamahata, dmatlack, Christopherson,,
	Sean


[snip]

> +/*
> + * cpuid entry lookup in TDX cpuid config way.
> + * The difference is how to specify index(subleaves).

AFAICT you only have one caller here.  If this is the only difference, will it
be simpler to ask caller to simply convert TDX_CPUID_NO_SUBLEAF to 0, so this
function can perhaps be removed?

> + * Specify index to TDX_CPUID_NO_SUBLEAF for CPUID leaf with no-subleaves.
> + */
> +static const struct kvm_cpuid_entry2 *tdx_find_cpuid_entry(const struct kvm_cpuid2 *cpuid,
> +							   u32 function, u32 index)
> +{
> +	int i;
> +
> +	/* In TDX CPU CONFIG, TDX_CPUID_NO_SUBLEAF means index = 0. */
		  ^
		  CPUID_CONFIG please.

> +	if (index == TDX_CPUID_NO_SUBLEAF)
> +		index = 0;
> +
> +	for (i = 0; i < cpuid->nent; i++) {
> +		const struct kvm_cpuid_entry2 *e = &cpuid->entries[i];
> +
> +		if (e->function == function &&
> +		    (e->index == index ||
> +		     !(e->flags & KVM_CPUID_FLAG_SIGNIFCANT_INDEX)))
> +			return e;
> +	}
> +	return NULL;
> +}
> +
> +static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> +			struct kvm_tdx_init_vm *init_vm)
> +{
> +	const struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
> +	const struct kvm_cpuid_entry2 *entry;
> +	u64 guest_supported_xcr0;
> +	u64 guest_supported_xss;
> +	int max_pa;
> +	int i;
> +
> +	if (kvm->created_vcpus)
> +		return -EBUSY;
> +	td_params->max_vcpus = kvm->max_vcpus;
> +	td_params->attributes = init_vm->attributes;
> +	if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> +		/*
> +		 * TODO: save/restore PMU related registers around TDENTER.
> +		 * Once it's done, remove this guard.
> +		 */
> +		pr_warn("TD doesn't support perfmon yet. KVM needs to save/restore "
> +			"host perf registers properly.\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	for (i = 0; i < tdx_caps.nr_cpuid_configs; i++) {
> +		const struct tdx_cpuid_config *config = &tdx_caps.cpuid_configs[i];
> +		const struct kvm_cpuid_entry2 *entry =
> +			tdx_find_cpuid_entry(cpuid, config->leaf, config->sub_leaf);
> +		struct tdx_cpuid_value *value = &td_params->cpuid_values[i];
> +
> +		if (!entry)
> +			continue;
> +
> +		value->eax = entry->eax & config->eax;
> +		value->ebx = entry->ebx & config->ebx;
> +		value->ecx = entry->ecx & config->ecx;
> +		value->edx = entry->edx & config->edx;
> +	}

A comment to explain above would be helpful, i.e TDX requires the number and the
order of those entries in TD_PARAMS's cpuid_values[] must be in the same number
and order with TDSYSINFO's CPUID_CONFIG.

Also, this code depends on @td_params already being zeroed.  Perhaps also point
it out.
 

[snip]

> +static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> +{
> 
[snip]

> +
> +	ret = setup_tdparams(kvm, td_params, init_vm);
> +	if (ret)
> +		goto out;
> +
> +	ret = __tdx_td_init(kvm, td_params);
> +	if (ret)
> +		goto out;
> +
> +	kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_OFFSET);
> +	kvm_tdx->attributes = td_params->attributes;
> +	kvm_tdx->xfam = td_params->xfam;
> +
> +out:
> +	/* kfree() accepts NULL. */
> +	kfree(init_vm);
> +	kfree(td_params);

So looks KVM doesn't CPUID configurations that are passed to the TDX module.  

IIUC, KVM still depends on userspace to later use KVM_SET_CPUID2 to fill the
_same_ CPUID entries for each vcpu?  If so, what if userspace didn't provide
consistent CPUIDs in KVM_SET_CPUID2?  Should we verify in KVM_SET_CPUID2 that
CPUIDs are consistent?

I am thinking if some #VE handling requires CPUID to make some decision, then
inconsistent CPUIDs will cause trouble, but I don't have an example now.

> +	return ret;
> +}
> +
> 

[snip]

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-14  9:16       ` Zhi Wang
@ 2023-01-17 15:55         ` Sean Christopherson
  2023-01-17 19:44           ` Zhi Wang
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-17 15:55 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Sean Christopherson,
	Kai Huang

On Sat, Jan 14, 2023, Zhi Wang wrote:
> On Fri, 13 Jan 2023 15:16:08 +0000 > Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Fri, Jan 13, 2023, Zhi Wang wrote:
> > > Better add a FIXME: here as this has to be fixed later.
> > 
> > No, leaking the page is all KVM can reasonably do here.  An improved
> > comment would be helpful, but no code change is required.
> > tdx_reclaim_page() returns an error if and only if there's an
> > unexpected, fatal error, e.g. a SEAMCALL with bad params, incorrect
> > concurrency in KVM, a TDX Module bug, etc.  Retrying at a later point is
> > highly unlikely to be successful.
> 
> Hi:
> 
> The word "leaking" sounds like a situation left unhandled temporarily.
> 
> I checked the source code of the TDX module[1] for the possible reason to
> fail when reviewing this patch:
> 
> tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_reclaim.c
> tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_wbinvd.c
> 
> a. Invalid parameters. For example, page is not aligned, PA HKID is not zero...
> 
> For invalid parameters, a WARN_ON_ONCE() + return value is good enough as
> that is how kernel handles similar situations. The caller takes the
> responsibility.
>  
> b. Locks has been taken in TDX module. TDR page has been locked due to another
> SEAMCALL, another SEAMCALL is doing PAMT walk and holding PAMT lock... 
> 
> This needs to be improved later either by retry or taking tdx_lock to avoid
> TDX module fails on this.

No, tdx_reclaim_page() already retries TDH.PHYMEM.PAGE.RECLAIM if the target page
is contended (though I'd question the validity of even that), and TDH.PHYMEM.PAGE.WBINVD
is performed only when reclaiming the TDR.  If there's contention when reclaiming
the TDR, then KVM effectively has a use-after-free bug, i.e. leaking the page is
the least of our worries.


On Thu, Jan 12, 2023 at 8:34 AM <isaku.yamahata@intel.com> wrote:
> +static int tdx_reclaim_page(hpa_t pa, bool do_wb, u16 hkid)
> +{
> +       struct tdx_module_output out;
> +       u64 err;
> +
> +       do {
> +               err = tdh_phymem_page_reclaim(pa, &out);
> +               /*
> +                * TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is shutdown.
> +                * state.  i.e. destructing TD.
> +                * TDH.PHYMEM.PAGE.RECLAIM  requires TDR and target page.
> +                * Because we're destructing TD, it's rare to contend with TDR.
> +                */
> +       } while (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX));
> +       if (WARN_ON_ONCE(err)) {
> +               pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
> +               return -EIO;
> +       }
> +
> +       if (do_wb) {
> +               /*
> +                * Only TDR page gets into this path.  No contention is expected
> +                * because of the last page of TD.
> +                */
> +               err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
> +               if (WARN_ON_ONCE(err)) {
> +                       pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> +                       return -EIO;
> +               }
> +       }
> +
> +       tdx_clear_page(pa);
> +       return 0;
> +}

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 004/113] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module
  2023-01-13 12:31   ` Zhi Wang
@ 2023-01-17 16:03     ` Isaku Yamahata
  2023-01-17 21:41       ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Isaku Yamahata @ 2023-01-17 16:03 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack

On Fri, Jan 13, 2023 at 02:31:58PM +0200,
Zhi Wang <zhi.wang.linux@gmail.com> wrote:

> On Thu, 12 Jan 2023 08:31:12 -0800
> isaku.yamahata@intel.com wrote:
> 
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > TDX requires several initialization steps for KVM to create guest TDs.
> > Detect CPU feature, enable VMX (TDX is based on VMX), detect the TDX
> > module availability, and initialize it.  This patch implements those
> > steps.
> > 
> > There are several options on when to initialize the TDX module.  A.)
> > kernel module loading time, B.) the first guest TD creation time.  A.)
> > was chosen. With B.), a user may hit an error of the TDX initialization
> > when trying to create the first guest TD.  The machine that fails to
> > initialize the TDX module can't boot any guest TD further.  Such failure
> > is undesirable and a surprise because the user expects that the machine
> > can accommodate guest TD, but actually not.  So A.) is better than B.).
> > 
> > Introduce a module parameter, enable_tdx, to explicitly enable TDX KVM
> > support.  It's off by default to keep same behavior for those who don't
> > use TDX.  Implement hardware_setup method to detect TDX feature of CPU.
> > Because TDX requires all present CPUs to enable VMX (VMXON).  The x86
> > specific kvm_arch_post_hardware_enable_setup overrides the existing weak
> > symbol of kvm_arch_post_hardware_enable_setup which is called at the KVM
> > module initialization.
> > 
> > Suggested-by: Sean Christopherson <seanjc@google.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/kvm/Makefile      |  1 +
> >  arch/x86/kvm/vmx/main.c    | 33 +++++++++++++++++++++++-----
> >  arch/x86/kvm/vmx/tdx.c     | 44 ++++++++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/vmx/vmx.c     | 39 +++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/vmx/x86_ops.h | 10 +++++++++
> >  5 files changed, 122 insertions(+), 5 deletions(-)
> >  create mode 100644 arch/x86/kvm/vmx/tdx.c
> > 
> > diff --git a/arch/x86/kvm/Makefile b/arch/x86/kvm/Makefile
> > index 0e894ae23cbc..4b01ab842ab7 100644
> > --- a/arch/x86/kvm/Makefile
> > +++ b/arch/x86/kvm/Makefile
> > @@ -25,6 +25,7 @@ kvm-$(CONFIG_KVM_SMM)	+= smm.o
> >  kvm-intel-y		+= vmx/vmx.o vmx/vmenter.o vmx/pmu_intel.o
> > vmx/vmcs12.o \ vmx/hyperv.o vmx/nested.o vmx/posted_intr.o vmx/main.o
> >  kvm-intel-$(CONFIG_X86_SGX_KVM)	+= vmx/sgx.o
> > +kvm-intel-$(CONFIG_INTEL_TDX_HOST)	+= vmx/tdx.o
> >  
> >  kvm-amd-y		+= svm/svm.o svm/vmenter.o svm/pmu.o
> > svm/nested.o svm/avic.o \ svm/sev.o svm/hyperv.o
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 18f659d1d456..f5d1166d2718 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -7,6 +7,22 @@
> >  #include "pmu.h"
> >  #include "tdx.h"
> >  
> > +static bool enable_tdx __ro_after_init =
> > IS_ENABLED(CONFIG_INTEL_TDX_HOST); +module_param_named(tdx, enable_tdx,
> > bool, 0444); +
> 
> The comments says "TDX is off by default". It seems default on/off is controlled
> by the kernel configuration here.

I'll update the comments.

> > +static __init int vt_hardware_setup(void)
> > +{
> > +	int ret;
> > +
> > +	ret = vmx_hardware_setup();
> > +	if (ret)
> > +		return ret;
> > +
> > +	enable_tdx = enable_tdx && !tdx_hardware_setup(&vt_x86_ops);
> > +
> > +	return 0;
> > +}
> > +
> >  struct kvm_x86_ops vt_x86_ops __initdata = {
> >  	.name = KBUILD_MODNAME,
> >  
> > @@ -149,7 +165,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> >  };
> >  
> >  struct kvm_x86_init_ops vt_init_ops __initdata = {
> > -	.hardware_setup = vmx_hardware_setup,
> > +	.hardware_setup = vt_hardware_setup,
> >  	.handle_intel_pt_intr = NULL,
> >  
> >  	.runtime_ops = &vt_x86_ops,
> > @@ -182,10 +198,17 @@ static int __init vt_init(void)
> >  	 * Common KVM initialization _must_ come last, after this,
> > /dev/kvm is
> >  	 * exposed to userspace!
> >  	 */
> > -	vt_x86_ops.vm_size = max(sizeof(struct kvm_vmx), sizeof(struct
> > kvm_tdx));
> > -	vcpu_size = max(sizeof(struct vcpu_vmx), sizeof(struct
> > vcpu_tdx));
> > -	vcpu_align = max(__alignof__(struct vcpu_vmx),
> > -			 __alignof__(struct vcpu_tdx));
> > +	vt_x86_ops.vm_size = sizeof(struct kvm_vmx);
> > +	vcpu_size = sizeof(struct vcpu_vmx);
> > +	vcpu_align = __alignof__(struct vcpu_vmx);
> > +	if (enable_tdx) {
> > +		vt_x86_ops.vm_size = max_t(unsigned int,
> > vt_x86_ops.vm_size,
> > +					   sizeof(struct kvm_tdx));
> > +		vcpu_size = max_t(unsigned int, vcpu_size,
> > +				  sizeof(struct vcpu_tdx));
> > +		vcpu_align = max_t(unsigned int, vcpu_align,
> > +				   __alignof__(struct vcpu_tdx));
> > +	}
> >  	r = kvm_init(vcpu_size, vcpu_align, THIS_MODULE);
> >  	if (r)
> >  		goto err_kvm_init;
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > new file mode 100644
> > index 000000000000..d7a276118940
> > --- /dev/null
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -0,0 +1,44 @@
> > +// SPDX-License-Identifier: GPL-2.0
> > +#include <linux/cpu.h>
> > +
> > +#include <asm/tdx.h>
> > +
> > +#include "capabilities.h"
> > +#include "x86_ops.h"
> > +#include "tdx.h"
> > +#include "x86.h"
> > +
> > +#undef pr_fmt
> > +#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> > +
> > +static int __init tdx_module_setup(void)
> > +{
> > +	int ret;
> > +
> 
> Better mention the tdx_enable() is implemented in another patch? But I guess
> we need a wrapper here so that the compilation would succeed.

The coverletter mentions it. I'll make the commit message of this patch
mention it anyway.

> > +	ret = tdx_enable();
> > +	if (ret) {
> > +		pr_info("Failed to initialize TDX module.\n");
> > +		return ret;
> > +	}
> > +
> > +	pr_info("TDX is supported.\n");
> > +	return 0;
> > +}
> > +
> > +int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops)
> > +{
> > +	int r;
> > +
> > +	if (!enable_ept) {
> > +		pr_warn("Cannot enable TDX with EPT disabled\n");
> > +		return -EINVAL;
> > +	}
> > +
> > +	/* TDX requires VMX. */
> > +	r = vmxon_all();
> > +	if (!r)
> > +		r = tdx_module_setup();
> > +	vmxoff_all();
> > +
> > +	return r;
> > +}
> > diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
> > index 5de1792c9902..5dc7687dcf16 100644
> > --- a/arch/x86/kvm/vmx/vmx.c
> > +++ b/arch/x86/kvm/vmx/vmx.c
> > @@ -8147,6 +8147,45 @@ static unsigned int vmx_handle_intel_pt_intr(void)
> >  	return 1;
> >  }
> >  
> > +static __init void vmxon(void *arg)
> > +{
> > +	int cpu = raw_smp_processor_id();
> > +	u64 phys_addr = __pa(per_cpu(vmxarea, cpu));
> > +	atomic_t *failed = arg;
> > +	int r;
> > +
> > +	if (cr4_read_shadow() & X86_CR4_VMXE) {
> > +		r = -EBUSY;
> > +		goto out;
> > +	}
> > +
> > +	r = kvm_cpu_vmxon(phys_addr);
> > +out:
> > +	if (r)
> > +		atomic_inc(failed);
> > +}
> > +
> > +__init int vmxon_all(void)
> > +{
> > +	atomic_t failed = ATOMIC_INIT(0);
> > +
> > +	on_each_cpu(vmxon, &failed, 1);
> > +
> > +	if (atomic_read(&failed))
> > +		return -EBUSY;
> > +	return 0;
> > +}
> > +
> > +static __init void vmxoff(void *junk)
> > +{
> > +	cpu_vmxoff();
> > +}
> > +
> > +__init void vmxoff_all(void)
> > +{
> > +	on_each_cpu(vmxoff, NULL, 1);
> > +}
> > +
> >  static __init void vmx_setup_user_return_msrs(void)
> >  {
> >  
> > diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> > index 051b5c4b5c2f..fbc57fcbdd21 100644
> > --- a/arch/x86/kvm/vmx/x86_ops.h
> > +++ b/arch/x86/kvm/vmx/x86_ops.h
> > @@ -20,6 +20,10 @@ bool kvm_is_vmx_supported(void);
> >  int __init vmx_init(void);
> >  void vmx_exit(void);
> >  
> > +__init int vmxon_all(void);
> > +__init void vmxoff_all(void);
> > +__init int vmx_hardware_setup(void);
> > +
> >  extern struct kvm_x86_ops vt_x86_ops __initdata;
> >  extern struct kvm_x86_init_ops vt_init_ops __initdata;
> >  
> > @@ -133,4 +137,10 @@ void vmx_cancel_hv_timer(struct kvm_vcpu *vcpu);
> >  #endif
> >  void vmx_setup_mce(struct kvm_vcpu *vcpu);
> >  
> > +#ifdef CONFIG_INTEL_TDX_HOST
> > +int __init tdx_hardware_setup(struct kvm_x86_ops *x86_ops);
> > +#else
> > +static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) {
> > return 0; } +#endif
> > +
> >  #endif /* __KVM_X86_VMX_X86_OPS_H */
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 025/113] KVM: TDX: Use private memory for TDX
  2023-01-16 10:45   ` Huang, Kai
@ 2023-01-17 16:40     ` Sean Christopherson
  2023-01-17 22:52       ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-17 16:40 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, chao.p.peng, pbonzini,
	Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack

On Mon, Jan 16, 2023, Huang, Kai wrote:
> On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > From: Chao Peng <chao.p.peng@linux.intel.com>
> > 
> > Override kvm_arch_has_private_mem() to use fd-based private memory.
> > Return true when a VM has a type of KVM_X86_TDX_VM.
> > 
> > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/kvm/x86.c | 5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index d548d3af6428..a8b555935fd8 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -13498,6 +13498,11 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
> >  }
> >  EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
> >  
> > +bool kvm_arch_has_private_mem(struct kvm *kvm)
> > +{
> > +	return kvm->arch.vm_type == KVM_X86_TDX_VM;
> > +}
> > +
> 
> AMD's series has a different solution:
> 
> https://lore.kernel.org/lkml/20221214194056.161492-3-michael.roth@amd.com/
> 
> I think somehow this needs to get aligned.

Ya.  My thought is

 bool kvm_arch_has_private_mem(struct kvm *kvm)
 {
	return kvm->arch.vm_type != KVM_X86_DEFAULT_VM;
 }

where the VM types end up being:

 #define KVM_X86_DEFAULT_VM	0
 #define KVM_X86_PROTECTED_VM	1
 #define KVM_X86_TDX_VM		2
 #define KVM_X86_SNP_VM		3

Don't spend too much time reworking the TDX series at this point, I'm going to do
a trial run of combining UPM+TDX+SNP sometime in the next few weeks to see how all
the pieces fit together, this is one of the common touchpoints that I'll make sure
to look at.  Though if you have ideas on, by all means post them.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 055/113] KVM: x86/VMX: introduce vmx tlb_remote_flush and tlb_remote_flush_with_range
  2023-01-17  2:06   ` Huang, Kai
@ 2023-01-17 16:53     ` Sean Christopherson
  2023-02-27 22:03       ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-17 16:53 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, pbonzini, Shahar, Sagi,
	Aktas, Erdem, isaku.yamahata, dmatlack

On Tue, Jan 17, 2023, Huang, Kai wrote:
> On Thu, 2023-01-12 at 08:32 -0800, isaku.yamahata@intel.com wrote:
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -244,7 +244,7 @@ static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> >  {
> >  	int ret = -ENOTSUPP;
> >  
> > -	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> > +	if (range && kvm_available_flush_tlb_with_range())
> >  		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
> 
> Again, IMHO this code change doesn't make code any clearer.  With the new code,
> I need to go into the kvm_available_flush_tlb_with_range() to see what's going
> on, but with the old code I don't.

Agreed.  Though I think this patch as a whole can be replaced with a more
straightforward solution.

hv_remote_flush_tlb() is used when KVM is running as a Hyper-V guest, whereas
TDX requires running KVM on bare metal.  KVM should simply disallow TDX when a
hypervisor is detected, then there's no need for vmx_tlb_remote_flush_with_range().

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-17 15:55         ` Sean Christopherson
@ 2023-01-17 19:44           ` Zhi Wang
  2023-01-17 20:56             ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-17 19:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Sean Christopherson,
	Kai Huang

On Tue, 17 Jan 2023 15:55:53 +0000
Sean Christopherson <seanjc@google.com> wrote:

> On Sat, Jan 14, 2023, Zhi Wang wrote:
> > On Fri, 13 Jan 2023 15:16:08 +0000 > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Fri, Jan 13, 2023, Zhi Wang wrote:
> > > > Better add a FIXME: here as this has to be fixed later.
> > > 
> > > No, leaking the page is all KVM can reasonably do here.  An improved
> > > comment would be helpful, but no code change is required.
> > > tdx_reclaim_page() returns an error if and only if there's an
> > > unexpected, fatal error, e.g. a SEAMCALL with bad params, incorrect
> > > concurrency in KVM, a TDX Module bug, etc.  Retrying at a later point is
> > > highly unlikely to be successful.
> > 
> > Hi:
> > 
> > The word "leaking" sounds like a situation left unhandled temporarily.
> > 
> > I checked the source code of the TDX module[1] for the possible reason to
> > fail when reviewing this patch:
> > 
> > tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_reclaim.c
> > tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_wbinvd.c
> > 
> > a. Invalid parameters. For example, page is not aligned, PA HKID is not zero...
> > 
> > For invalid parameters, a WARN_ON_ONCE() + return value is good enough as
> > that is how kernel handles similar situations. The caller takes the
> > responsibility.
> >  
> > b. Locks has been taken in TDX module. TDR page has been locked due to another
> > SEAMCALL, another SEAMCALL is doing PAMT walk and holding PAMT lock... 
> > 
> > This needs to be improved later either by retry or taking tdx_lock to avoid
> > TDX module fails on this.
> 
> No, tdx_reclaim_page() already retries TDH.PHYMEM.PAGE.RECLAIM if the target page
> is contended (though I'd question the validity of even that), and TDH.PHYMEM.PAGE.WBINVD
> is performed only when reclaiming the TDR.  If there's contention when reclaiming
> the TDR, then KVM effectively has a use-after-free bug, i.e. leaking the page is
> the least of our worries.
> 

Hi:

Thanks for the reply. "Leaking" is the consquence of even failing in retry. I
agree with this. But I was questioning if "retry" is really a correct and only
solution when encountering lock contention in the TDX module as I saw that there
are quite some magic numbers are going to be introduced because of "retry" and
there were discussions about times of retry should be 3 or 1000 in TDX guest
on hyper-V patches. It doesn't sound right.

Compare to an typical *kernel lock* case, an execution path can wait on a
waitqueue and later will be woken up. We usually do contention-wait-and-retry
and we rarely just do contention and retry X times. In TDX case, I understand
that it is hard for the TDX module to provide similar solutions as an execution
path can't stay long in the TDX module.

1) We can always take tdx_lock (linux kernel lock) when calling a SEAMCALL
that touch the TDX internal locks. But the downside is we might lose some
concurrency.

2) As TDX module doesn't provide contention-and-wait, I guess the following
approach might have been discussed when designing this "retry".

KERNEL                          TDX MODULE

SEAMCALL A   ->                 PATH A: Taking locks

SEAMCALL B   ->                 PATH B: Contention on a lock

             <-                 Return "operand busy"

SEAMCALL B   -|
              |  <- Wait on a kernel waitqueue
SEAMCALL B  <-|

SEAMCALL A   <-                 PATH A: Return

SEAMCALL A   -|
              |  <- Wake up the waitqueue
SEMACALL A  <-| 

SEAMCALL B  ->                  PATH B: Taking the locks
...

Why not this scheme wasn't chosen?

> 
> On Thu, Jan 12, 2023 at 8:34 AM <isaku.yamahata@intel.com> wrote:
> > +static int tdx_reclaim_page(hpa_t pa, bool do_wb, u16 hkid)
> > +{
> > +       struct tdx_module_output out;
> > +       u64 err;
> > +
> > +       do {
> > +               err = tdh_phymem_page_reclaim(pa, &out);
> > +               /*
> > +                * TDH.PHYMEM.PAGE.RECLAIM is allowed only when TD is shutdown.
> > +                * state.  i.e. destructing TD.
> > +                * TDH.PHYMEM.PAGE.RECLAIM  requires TDR and target page.
> > +                * Because we're destructing TD, it's rare to contend with TDR.
> > +                */
> > +       } while (err == (TDX_OPERAND_BUSY | TDX_OPERAND_ID_RCX));
> > +       if (WARN_ON_ONCE(err)) {
> > +               pr_tdx_error(TDH_PHYMEM_PAGE_RECLAIM, err, &out);
> > +               return -EIO;
> > +       }
> > +
> > +       if (do_wb) {
> > +               /*
> > +                * Only TDR page gets into this path.  No contention is expected
> > +                * because of the last page of TD.
> > +                */
> > +               err = tdh_phymem_page_wbinvd(set_hkid_to_hpa(pa, hkid));
> > +               if (WARN_ON_ONCE(err)) {
> > +                       pr_tdx_error(TDH_PHYMEM_PAGE_WBINVD, err, NULL);
> > +                       return -EIO;
> > +               }
> > +       }
> > +
> > +       tdx_clear_page(pa);
> > +       return 0;
> > +}


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-17 19:44           ` Zhi Wang
@ 2023-01-17 20:56             ` Sean Christopherson
  2023-01-17 21:01               ` Sean Christopherson
  2023-01-19 22:45               ` Zhi Wang
  0 siblings, 2 replies; 221+ messages in thread
From: Sean Christopherson @ 2023-01-17 20:56 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Sean Christopherson,
	Kai Huang

On Tue, Jan 17, 2023, Zhi Wang wrote:
> On Tue, 17 Jan 2023 15:55:53 +0000
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Sat, Jan 14, 2023, Zhi Wang wrote:
> > > On Fri, 13 Jan 2023 15:16:08 +0000 > Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > > On Fri, Jan 13, 2023, Zhi Wang wrote:
> > > > > Better add a FIXME: here as this has to be fixed later.
> > > > 
> > > > No, leaking the page is all KVM can reasonably do here.  An improved
> > > > comment would be helpful, but no code change is required.
> > > > tdx_reclaim_page() returns an error if and only if there's an
> > > > unexpected, fatal error, e.g. a SEAMCALL with bad params, incorrect
> > > > concurrency in KVM, a TDX Module bug, etc.  Retrying at a later point is
> > > > highly unlikely to be successful.
> > > 
> > > Hi:
> > > 
> > > The word "leaking" sounds like a situation left unhandled temporarily.
> > > 
> > > I checked the source code of the TDX module[1] for the possible reason to
> > > fail when reviewing this patch:
> > > 
> > > tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_reclaim.c
> > > tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_wbinvd.c
> > > 
> > > a. Invalid parameters. For example, page is not aligned, PA HKID is not zero...
> > > 
> > > For invalid parameters, a WARN_ON_ONCE() + return value is good enough as
> > > that is how kernel handles similar situations. The caller takes the
> > > responsibility.
> > >  
> > > b. Locks has been taken in TDX module. TDR page has been locked due to another
> > > SEAMCALL, another SEAMCALL is doing PAMT walk and holding PAMT lock... 
> > > 
> > > This needs to be improved later either by retry or taking tdx_lock to avoid
> > > TDX module fails on this.
> > 
> > No, tdx_reclaim_page() already retries TDH.PHYMEM.PAGE.RECLAIM if the target page
> > is contended (though I'd question the validity of even that), and TDH.PHYMEM.PAGE.WBINVD
> > is performed only when reclaiming the TDR.  If there's contention when reclaiming
> > the TDR, then KVM effectively has a use-after-free bug, i.e. leaking the page is
> > the least of our worries.
> > 
> 
> Hi:
> 
> Thanks for the reply. "Leaking" is the consquence of even failing in retry. I
> agree with this. But I was questioning if "retry" is really a correct and only
> solution when encountering lock contention in the TDX module as I saw that there
> are quite some magic numbers are going to be introduced because of "retry" and
> there were discussions about times of retry should be 3 or 1000 in TDX guest
> on hyper-V patches. It doesn't sound right.

Ah, yeah, I'm speaking only with respect to leaking pages on failure in this
specific scenario.

> Compare to an typical *kernel lock* case, an execution path can wait on a
> waitqueue and later will be woken up. We usually do contention-wait-and-retry
> and we rarely just do contention and retry X times. In TDX case, I understand
> that it is hard for the TDX module to provide similar solutions as an execution
> path can't stay long in the TDX module.

Let me preface the below comments by saying that this is the first time that I've
seen the "Single-Step and Zero-Step Attacks Mitigation Mechanisms" behavior, i.e.
the first time I've been made aware that the TDX Module can apparently decide
to take S-EPT locks in the VM-Enter path.

> 
> 1) We can always take tdx_lock (linux kernel lock) when calling a SEAMCALL
> that touch the TDX internal locks. But the downside is we might lose some
> concurrency.

This isn't really feasible in practice.  Even if tdx_lock were made a spinlock
(it's currently a mutex) so that it could it could be taken inside kvm->mmu_lock,
acquiring a per-VM lock, let alone a global lock, in KVM's page fault handling
path is not an option.  KVM has a hard requirement these days of being able to
handle multiple page faults in parallel.

> 2) As TDX module doesn't provide contention-and-wait, I guess the following
> approach might have been discussed when designing this "retry".
> 
> KERNEL                          TDX MODULE
> 
> SEAMCALL A   ->                 PATH A: Taking locks
> 
> SEAMCALL B   ->                 PATH B: Contention on a lock
> 
>              <-                 Return "operand busy"
> 
> SEAMCALL B   -|
>               |  <- Wait on a kernel waitqueue
> SEAMCALL B  <-|
> 
> SEAMCALL A   <-                 PATH A: Return
> 
> SEAMCALL A   -|
>               |  <- Wake up the waitqueue
> SEMACALL A  <-| 
> 
> SEAMCALL B  ->                  PATH B: Taking the locks
> ...
> 
> Why not this scheme wasn't chosen?

AFAIK, I don't think a waitqueue approach as ever been discussed publicly.  Intel
may have considered the idea internally, but I don't recall anything being proposed
publically (though it's entirely possible I just missed the discussion).

Anways, I don't think a waitqueue would be a good fit, at least not for S-EPT
management, which AFAICT is the only scenario where KVM does the arbitrary "retry
X times and hope things work".  If the contention occurs due to the TDX Module
taking an S-EPT lock in VM-Enter, then KVM won't get a chance to do the "Wake up
the waitqueue" action until the next VM-Exit, which IIUC is well after the TDX
Module drops the S-EPT lock.  In other words, immediately retrying and then punting
the problem further up the stack in KVM does seem to be the least awful "solution"
if there's contention.

That said, I 100% agree that the arbitrary retry stuff is awful.  The zero-step
interaction in particular isn't acceptable.

Intel folks, encountering "TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT" on VM-Enter
needs to be treated as a KVM bug, even if it means teaching KVM to kill the VM
if a vCPU is on the cusp of triggerring the "pre-determined number" of EPT faults
mentioned in this snippet:

  After a pre-determined number of such EPT violations occur on the same instruction,
  the TDX module starts tracking the GPAs that caused Secure EPT faults and fails
  further host VMM attempts to enter the TD VCPU unless previously faulting private
  GPAs are properly mapped in the Secure EPT.

If the "pre-determined number" is too low to avoid false positives, e.g. if it can
be tripped by a vCPU spinning while a different vCPU finishes handling a fault,
then either the behavior of the TDX Module needs to be revisited, or KVM needs to
stall vCPUs that are approaching the threshold until all outstanding S-EPT faults
have been serviced.

KVM shouldn't be spuriously zapping private S-EPT entries since the backing memory
is pinned, which means the only way for a vCPU to truly get stuck faulting on a
single instruction is if userspace is broken/malicious, in which case userspace
gets to keep the pieces.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-17 20:56             ` Sean Christopherson
@ 2023-01-17 21:01               ` Sean Christopherson
  2023-01-19 11:31                 ` Huang, Kai
  2023-01-19 22:45               ` Zhi Wang
  1 sibling, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-17 21:01 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Sean Christopherson,
	Kai Huang

On Tue, Jan 17, 2023, Sean Christopherson wrote:
> On Tue, Jan 17, 2023, Zhi Wang wrote:
> > 2) As TDX module doesn't provide contention-and-wait, I guess the following
> > approach might have been discussed when designing this "retry".
> > 
> > KERNEL                          TDX MODULE
> > 
> > SEAMCALL A   ->                 PATH A: Taking locks
> > 
> > SEAMCALL B   ->                 PATH B: Contention on a lock
> > 
> >              <-                 Return "operand busy"
> > 
> > SEAMCALL B   -|
> >               |  <- Wait on a kernel waitqueue
> > SEAMCALL B  <-|
> > 
> > SEAMCALL A   <-                 PATH A: Return
> > 
> > SEAMCALL A   -|
> >               |  <- Wake up the waitqueue
> > SEMACALL A  <-| 
> > 
> > SEAMCALL B  ->                  PATH B: Taking the locks
> > ...
> > 
> > Why not this scheme wasn't chosen?
> 
> AFAIK, I don't think a waitqueue approach as ever been discussed publicly.  Intel
> may have considered the idea internally, but I don't recall anything being proposed
> publically (though it's entirely possible I just missed the discussion).
> 
> Anways, I don't think a waitqueue would be a good fit, at least not for S-EPT
> management, which AFAICT is the only scenario where KVM does the arbitrary "retry
> X times and hope things work".  If the contention occurs due to the TDX Module
> taking an S-EPT lock in VM-Enter, then KVM won't get a chance to do the "Wake up
> the waitqueue" action until the next VM-Exit, which IIUC is well after the TDX
> Module drops the S-EPT lock.  In other words, immediately retrying and then punting
> the problem further up the stack in KVM does seem to be the least awful "solution"
> if there's contention.

Oh, the other important piece I forgot to mention is that dropping mmu_lock deep
in KVM's MMU in order to wait isn't always an option.  Most flows would play nice
with dropping mmu_lock and sleeping, but some paths, e.g. from the mmu_notifier,
(conditionally) disallow sleeping.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 004/113] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module
  2023-01-17 16:03     ` Isaku Yamahata
@ 2023-01-17 21:41       ` Huang, Kai
  0 siblings, 0 replies; 221+ messages in thread
From: Huang, Kai @ 2023-01-17 21:41 UTC (permalink / raw)
  To: zhi.wang.linux, isaku.yamahata
  Cc: Christopherson,,
	Sean, Shahar, Sagi, linux-kernel, Aktas, Erdem, dmatlack, kvm,
	Yamahata, Isaku, pbonzini

On Tue, 2023-01-17 at 08:03 -0800, Isaku Yamahata wrote:
> > Better mention the tdx_enable() is implemented in another patch? But I guess
> > we need a wrapper here so that the compilation would succeed.
> 
> The coverletter mentions it. I'll make the commit message of this patch
> mention it anyway.

When this series is merged, the TDX host series must have already been merged.

You obviously don't need to say something like "tdx_enable() is implemented in
another patch/series" in _changelog_.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 025/113] KVM: TDX: Use private memory for TDX
  2023-01-17 16:40     ` Sean Christopherson
@ 2023-01-17 22:52       ` Huang, Kai
  2023-01-18  1:16         ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-17 22:52 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: dmatlack, Shahar, Sagi, chao.p.peng, isaku.yamahata, Aktas,
	Erdem, kvm, pbonzini, Yamahata, Isaku, linux-kernel

On Tue, 2023-01-17 at 16:40 +0000, Sean Christopherson wrote:
> On Mon, Jan 16, 2023, Huang, Kai wrote:
> > On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > > From: Chao Peng <chao.p.peng@linux.intel.com>
> > > 
> > > Override kvm_arch_has_private_mem() to use fd-based private memory.
> > > Return true when a VM has a type of KVM_X86_TDX_VM.
> > > 
> > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > ---
> > >  arch/x86/kvm/x86.c | 5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index d548d3af6428..a8b555935fd8 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -13498,6 +13498,11 @@ int kvm_sev_es_string_io(struct kvm_vcpu *vcpu, unsigned int size,
> > >  }
> > >  EXPORT_SYMBOL_GPL(kvm_sev_es_string_io);
> > >  
> > > +bool kvm_arch_has_private_mem(struct kvm *kvm)
> > > +{
> > > +	return kvm->arch.vm_type == KVM_X86_TDX_VM;
> > > +}
> > > +
> > 
> > AMD's series has a different solution:
> > 
> > https://lore.kernel.org/lkml/20221214194056.161492-3-michael.roth@amd.com/
> > 
> > I think somehow this needs to get aligned.
> 
> Ya.  My thought is
> 
>  bool kvm_arch_has_private_mem(struct kvm *kvm)
>  {
> 	return kvm->arch.vm_type != KVM_X86_DEFAULT_VM;
>  }
> 
> where the VM types end up being:
> 
>  #define KVM_X86_DEFAULT_VM	0
>  #define KVM_X86_PROTECTED_VM	1
>  #define KVM_X86_TDX_VM		2
>  #define KVM_X86_SNP_VM		3

What's the  difference between  PROTECTED_VM vs TDX_VM/SNP_VM, may I ask?

> 
> Don't spend too much time reworking the TDX series at this point, I'm going to do
> a trial run of combining UPM+TDX+SNP sometime in the next few weeks to see how all
> the pieces fit together, this is one of the common touchpoints that I'll make sure
> to look at.  Though if you have ideas on, by all means post them.

Glad to know.  Thanks for your time.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 025/113] KVM: TDX: Use private memory for TDX
  2023-01-17 22:52       ` Huang, Kai
@ 2023-01-18  1:16         ` Sean Christopherson
  0 siblings, 0 replies; 221+ messages in thread
From: Sean Christopherson @ 2023-01-18  1:16 UTC (permalink / raw)
  To: Huang, Kai
  Cc: dmatlack, Shahar, Sagi, chao.p.peng, isaku.yamahata, Aktas,
	Erdem, kvm, pbonzini, Yamahata, Isaku, linux-kernel

On Tue, Jan 17, 2023, Huang, Kai wrote:
> On Tue, 2023-01-17 at 16:40 +0000, Sean Christopherson wrote:
> > On Mon, Jan 16, 2023, Huang, Kai wrote:
> > > AMD's series has a different solution:
> > > 
> > > https://lore.kernel.org/lkml/20221214194056.161492-3-michael.roth@amd.com/
> > > 
> > > I think somehow this needs to get aligned.
> > 
> > Ya.  My thought is
> > 
> >  bool kvm_arch_has_private_mem(struct kvm *kvm)
> >  {
> > 	return kvm->arch.vm_type != KVM_X86_DEFAULT_VM;
> >  }
> > 
> > where the VM types end up being:
> > 
> >  #define KVM_X86_DEFAULT_VM	0
> >  #define KVM_X86_PROTECTED_VM	1
> >  #define KVM_X86_TDX_VM		2
> >  #define KVM_X86_SNP_VM		3
> 
> What's the  difference between  PROTECTED_VM vs TDX_VM/SNP_VM, may I ask?

Not sure :-)  

At this point, PROTECTED_VM is a handwavy "guest that can access restricted memory".
I think/hope it can be used in combination with existing SEV/SEV-ES APIs to allow
backing such guests with restrictedmem.

As for TDX_VM and SNP_VM, I'm speculating that KVM will need to differentiate between
SNP, TDX, and everything else.  E.g. if KVM needs to uniquely identify TDX VMs at
creation time, then adding dedicated VM types seems like the obvious choice.  Ditto
for SNP VMs.

That's partly why I want to smush everything together, to see if we actually need
multiple VM types.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-01-12 16:31 ` [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure isaku.yamahata
  2023-01-16 10:46   ` Zhi Wang
@ 2023-01-19  0:45   ` Huang, Kai
  2023-02-28 11:06     ` Isaku Yamahata
  1 sibling, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-19  0:45 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> The next step of TDX guest creation is to create vcpu.  Allocate TDX vcpu
> structures, partially initialize it.  
> 

Why partially initialize it?  Shouldn't a better way be either: 1) not
initialize at all, or; 2) fully initialize? 

Can you put more _why_ here?


> Allocate pages of TDX vcpu for the
> TDX module.  Actual donation TDX vcpu pages to the TDX module is not done
> yet.

Also, can you explain _why_ it is not done here?

> 
> In the case of the conventional case, cpuid is empty at the initialization.
> and cpuid is configured after the vcpu initialization.  Because TDX
> supports only X2APIC mode, cpuid is forcibly initialized to support X2APIC
> on the vcpu initialization.

Don't quite understand here.  As you said CPUID entries are configured later in
KVM_SET_CPUID2, so what's the point of initializing CPUID to support x2apic
here?

Are you suggesting KVM_SET_CPUID2 will be somehow rejected for TDX guest, or
there will be special handling to make sure the CPUID initialized here won't be
overwritten later?

Please explain clearly here.

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
> Changes v10 -> v11:
> - NULL check of kvmalloc_array() in tdx_vcpu_reset. Move it to
>   tdx_vcpu_create()
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/main.c    | 40 ++++++++++++++++++--
>  arch/x86/kvm/vmx/tdx.c     | 75 ++++++++++++++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/x86_ops.h | 10 +++++
>  arch/x86/kvm/x86.c         |  2 +
>  4 files changed, 123 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index ddf0742f1f67..59813ca05f36 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -63,6 +63,38 @@ static void vt_vm_free(struct kvm *kvm)
>  		tdx_vm_free(kvm);
>  }
>  
> +static int vt_vcpu_precreate(struct kvm *kvm)
> +{
> +	if (is_td(kvm))
> +		return 0;
> +
> +	return vmx_vcpu_precreate(kvm);
> +}
> +
> +static int vt_vcpu_create(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_vcpu_create(vcpu);
> +
> +	return vmx_vcpu_create(vcpu);
> +}
> +
> +static void vt_vcpu_free(struct kvm_vcpu *vcpu)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_vcpu_free(vcpu);
> +
> +	return vmx_vcpu_free(vcpu);
> +}
> +
> +static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_vcpu_reset(vcpu, init_event);
> +
> +	return vmx_vcpu_reset(vcpu, init_event);
> +}
> +
>  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	if (!is_td(kvm))
> @@ -90,10 +122,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.vm_destroy = vt_vm_destroy,
>  	.vm_free = vt_vm_free,
>  
> -	.vcpu_precreate = vmx_vcpu_precreate,
> -	.vcpu_create = vmx_vcpu_create,
> -	.vcpu_free = vmx_vcpu_free,
> -	.vcpu_reset = vmx_vcpu_reset,
> +	.vcpu_precreate = vt_vcpu_precreate,
> +	.vcpu_create = vt_vcpu_create,
> +	.vcpu_free = vt_vcpu_free,
> +	.vcpu_reset = vt_vcpu_reset,
>  
>  	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
>  	.vcpu_load = vmx_vcpu_load,
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 557a609c5147..099f0737a5aa 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -281,6 +281,81 @@ int tdx_vm_init(struct kvm *kvm)
>  	return 0;
>  }
>  
> +int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_cpuid_entry2 *e;
> +
> +	/*
> +	 * On cpu creation, cpuid entry is blank.  Forcibly enable
> +	 * X2APIC feature to allow X2APIC.
> +	 * Because vcpu_reset() can't return error, allocation is done here.
> +	 */
> +	WARN_ON_ONCE(vcpu->arch.cpuid_entries);
> +	WARN_ON_ONCE(vcpu->arch.cpuid_nent);
> +	e = kvmalloc_array(1, sizeof(*e), GFP_KERNEL_ACCOUNT);

You don't need to use kvmalloc_array() when only allocating one entry.

> +	if (!e)
> +		return -ENOMEM;
> +	*e  = (struct kvm_cpuid_entry2) {
> +		.function = 1,	/* Features for X2APIC */
> +		.index = 0,
> +		.eax = 0,
> +		.ebx = 0,
> +		.ecx = 1ULL << 21,	/* X2APIC */
> +		.edx = 0,
> +	};
> +	vcpu->arch.cpuid_entries = e;
> +	vcpu->arch.cpuid_nent = 1;

As mentioned above, why doing it here? Won't be this be overwritten later in
KVM_SET_CPUID2?

> +
> +	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
> +	if (!vcpu->arch.apic)
> +		return -EINVAL;

If this is hit, what happens to the CPUID entry allocated above?  It's
absolutely not clear here in this patch.

> +
> +	fpstate_set_confidential(&vcpu->arch.guest_fpu);
> +
> +	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
> +
> +	vcpu->arch.cr0_guest_owned_bits = -1ul;
> +	vcpu->arch.cr4_guest_owned_bits = -1ul;
> +
> +	vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
> +	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
> +	vcpu->arch.guest_state_protected =
> +		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
> +
> +	return 0;
> +}
> +
> +void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> +{
> +	/* This is stub for now.  More logic will come. */
> +}
> +
> +void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> +{
> +	struct msr_data apic_base_msr;
> +
> +	/* TDX doesn't support INIT event. */
> +	if (WARN_ON_ONCE(init_event))
> +		goto td_bugged;

Should we use KVM_BUG_ON()?

Again, it appears this depends on how KVM handles INIT, which is done in a later
patch far way:

[PATCH v11 102/113] KVM: TDX: Silently ignore INIT/SIPI

And there's no material explaining how it is handled in either changelog or
comment, so to me it's not reviewable.

> +
> +	/* TDX rquires X2APIC. */
> +	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> +	if (kvm_vcpu_is_reset_bsp(vcpu))
> +		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
> +	apic_base_msr.host_initiated = true;
> +	if (WARN_ON_ONCE(kvm_set_apic_base(vcpu, &apic_base_msr)))
> +		goto td_bugged;

I think we have KVM_BUG_ON()?

TDX requires a lot more staff then just x2apic, why only x2apic is done here,
particularly in _this_ patch?

> +
> +	/*
> +	 * Don't update mp_state to runnable because more initialization
> +	 * is needed by TDX_VCPU_INIT.
> +	 */
> +	return;
> +
> +td_bugged:
> +	vcpu->kvm->vm_bugged = true;
> +}
> +
> 

[snip]

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 016/113] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  2023-01-12 16:31 ` [PATCH v11 016/113] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl isaku.yamahata
@ 2023-01-19  2:40   ` Huang, Kai
  2023-02-27 21:22     ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-19  2:40 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Add a place holder function for TDX specific VM-scoped ioctl as mem_enc_op.
> TDX specific sub-commands will be added to retrieve/pass TDX specific
> parameters.
> 
> KVM_MEMORY_ENCRYPT_OP was introduced for VM-scoped operations specific for
> guest state-protected VM.  It defined subcommands for technology-specific
> operations under KVM_MEMORY_ENCRYPT_OP.  Despite its name, the subcommands
> are not limited to memory encryption, but various technology-specific
> operations are defined.  It's natural to repurpose KVM_MEMORY_ENCRYPT_OP
> for TDX specific operations and define subcommands.
> 
> TDX requires VM-scoped TDX-specific operations for device model, for
> example, qemu.  Getting system-wide parameters, TDX-specific VM
> initialization.

There's grammar issue in the last paragraph.  Please use grammar check.

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/main.c    |  9 +++++++++
>  arch/x86/kvm/vmx/tdx.c     | 26 ++++++++++++++++++++++++++
>  arch/x86/kvm/vmx/x86_ops.h |  4 ++++
>  3 files changed, 39 insertions(+)
> 
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 16053ec3e0ae..781fbc896120 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -37,6 +37,14 @@ static int vt_vm_init(struct kvm *kvm)
>  	return vmx_vm_init(kvm);
>  }
>  
> +static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> +{
> +	if (!is_td(kvm))
> +		return -ENOTTY;
> +
> +	return tdx_vm_ioctl(kvm, argp);
> +}
> +
>  struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.name = KBUILD_MODNAME,
>  
> @@ -179,6 +187,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
>  
>  	.dev_mem_enc_ioctl = tdx_dev_ioctl,
> +	.mem_enc_ioctl = vt_mem_enc_ioctl,
>  };

IIUC, now both AMD and Intel have mem_enc_ioctl() callback implemented, so the
KVM_X86_OP_OPTIONAL() of it can be changed to KVM_X86_OP(), and the function
pointer check can be removed in the IOCTL:

diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-
ops.h
index 8dc345cc6318..a59852fb5e2a 100644
--- a/arch/x86/include/asm/kvm-x86-ops.h
+++ b/arch/x86/include/asm/kvm-x86-ops.h
@@ -116,7 +116,7 @@ KVM_X86_OP(enter_smm)
 KVM_X86_OP(leave_smm)
 KVM_X86_OP(enable_smi_window)
 #endif
-KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
+KVM_X86_OP(mem_enc_ioctl)
 KVM_X86_OP_OPTIONAL(mem_enc_register_region)
 KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
 KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index c936f8d28a53..dfa279e35478 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -6937,10 +6937,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
                goto out;
        }
        case KVM_MEMORY_ENCRYPT_OP: {
-               r = -ENOTTY;
-               if (!kvm_x86_ops.mem_enc_ioctl)
-                       goto out;
-
                r = static_call(kvm_x86_mem_enc_ioctl)(kvm, argp);
                break;
        }

[snip]


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization
  2023-01-12 16:31 ` [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization isaku.yamahata
  2023-01-16 16:07   ` Zhi Wang
@ 2023-01-19 10:37   ` Huang, Kai
  2023-02-28 11:27     ` Isaku Yamahata
  1 sibling, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-19 10:37 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: sean.j.christopherson, pbonzini, Shahar, Sagi, Aktas, Erdem,
	isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TD guest vcpu need to be configured before ready to run which requests
> addtional information from Device model (e.g. qemu), one 64bit value is
> passed to vcpu's RCX as an initial value.  
> 

The first half sentence doesn't parse to me.  It also has grammar issue.

Also, the second half only talks about TDH.VP.INIT, but there's more regarding
to creating/initializing a TDX guest vcpu.  IMHO It would be better if you can
briefly describe the whole sequence here so people can get some idea about your
code below.

Btw, I don't understand what's the point of pointing out "64bit value passed to
vcpu's RCX ...".  You can add this to the comment instead.  If it is important,
then please add more to explain it so people can understand more. 

> Repurpose KVM_MEMORY_ENCRYPT_OP
> to vcpu-scope and add new sub-commands KVM_TDX_INIT_VCPU under it for such
> additional vcpu configuration.

I am not sure using the same command for both per-VM and per-vcpu ioctls is a
good idea.  Is there any existing example does this?

> 
> Add callback for kvm vCPU-scoped operations of KVM_MEMORY_ENCRYPT_OP and
> add a new subcommand, KVM_TDX_INIT_VCPU, for further vcpu initialization.

Personally I prefer KVM_TDX_VCPU_CREATE (instead of INIT) but will leave to
maintainers.

> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/include/asm/kvm-x86-ops.h    |   1 +
>  arch/x86/include/asm/kvm_host.h       |   1 +
>  arch/x86/include/uapi/asm/kvm.h       |   1 +
>  arch/x86/kvm/vmx/main.c               |   9 ++
>  arch/x86/kvm/vmx/tdx.c                | 147 +++++++++++++++++++++++++-
>  arch/x86/kvm/vmx/tdx.h                |   7 ++
>  arch/x86/kvm/vmx/x86_ops.h            |  10 +-
>  arch/x86/kvm/x86.c                    |   6 ++
>  tools/arch/x86/include/uapi/asm/kvm.h |   1 +
>  9 files changed, 178 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> index 1a27f3aee982..e3e9b1c2599b 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -123,6 +123,7 @@ KVM_X86_OP(enable_smi_window)
>  #endif
>  KVM_X86_OP_OPTIONAL(dev_mem_enc_ioctl)
>  KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
> +KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
>  KVM_X86_OP_OPTIONAL(mem_enc_register_region)
>  KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
>  KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 30f4ddb18548..35773f925cc5 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1698,6 +1698,7 @@ struct kvm_x86_ops {
>  
>  	int (*dev_mem_enc_ioctl)(void __user *argp);
>  	int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
> +	int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
>  	int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
>  	int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
>  	int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
> diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> index b8f28d86d4fd..9236c1699c48 100644
> --- a/arch/x86/include/uapi/asm/kvm.h
> +++ b/arch/x86/include/uapi/asm/kvm.h
> @@ -536,6 +536,7 @@ struct kvm_pmu_event_filter {
>  enum kvm_tdx_cmd_id {
>  	KVM_TDX_CAPABILITIES = 0,
>  	KVM_TDX_INIT_VM,
> +	KVM_TDX_INIT_VCPU,
>  
>  	KVM_TDX_CMD_NR_MAX,
>  };
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 59813ca05f36..23b3ffc3fe23 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -103,6 +103,14 @@ static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>  	return tdx_vm_ioctl(kvm, argp);
>  }
>  
> +static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> +{
> +	if (!is_td_vcpu(vcpu))
> +		return -EINVAL;
> +
> +	return tdx_vcpu_ioctl(vcpu, argp);
> +}
> +
>  struct kvm_x86_ops vt_x86_ops __initdata = {
>  	.name = KBUILD_MODNAME,
>  
> @@ -249,6 +257,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>  
>  	.dev_mem_enc_ioctl = tdx_dev_ioctl,
>  	.mem_enc_ioctl = vt_mem_enc_ioctl,
> +	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
>  };
>  
>  struct kvm_x86_init_ops vt_init_ops __initdata = {
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 099f0737a5aa..e2f5a07ad4e5 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -49,6 +49,11 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
>  	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
>  }
>  
> +static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
> +{
> +	return tdx->tdvpr_pa;
> +}
> +
>  static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
>  {
>  	return kvm_tdx->tdr_pa;
> @@ -65,6 +70,11 @@ static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
>  	return kvm_tdx->hkid > 0;
>  }
>  
> +static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> +{
> +	return kvm_tdx->finalized;
> +}
> +
>  static void tdx_clear_page(unsigned long page_pa)
>  {
>  	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> @@ -327,7 +337,21 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  
>  void tdx_vcpu_free(struct kvm_vcpu *vcpu)
>  {
> -	/* This is stub for now.  More logic will come. */
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	int i;
> +
> +	/* Can't reclaim or free pages if teardown failed. */
> +	if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm)))
> +		return;

You may want to WARN() if it's a kernel bug you want to catch.
> +
> +	if (tdx->tdvpx_pa) {
> +		for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
> +			tdx_reclaim_td_page(tdx->tdvpx_pa[i]);
> +		kfree(tdx->tdvpx_pa);
> +		tdx->tdvpx_pa = NULL;
> +	}
> +	tdx_reclaim_td_page(tdx->tdvpr_pa);
> +	tdx->tdvpr_pa = 0;
>  }
>  
>  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> @@ -337,6 +361,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
>  	/* TDX doesn't support INIT event. */
>  	if (WARN_ON_ONCE(init_event))
>  		goto td_bugged;
> +	if (WARN_ON_ONCE(is_td_vcpu_created(to_tdx(vcpu))))
> +		goto td_bugged;

Again, not sure can we use KVM_BUG_ON()?

>  
>  	/* TDX rquires X2APIC. */
>  	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> @@ -791,6 +817,125 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
>  	return r;
>  }
>  
> +static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	unsigned long *tdvpx_pa = NULL;
> +	unsigned long tdvpr_pa;
> +	unsigned long va;
> +	int ret, i;
> +	u64 err;
> +
> +	if (is_td_vcpu_created(tdx))
> +		return -EINVAL;

Ditto.  WARN()?

> +
> +	va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +	if (!va)
> +		return -ENOMEM;
> +	tdvpr_pa = __pa(va);
> +
> +	tdvpx_pa = kcalloc(tdx_caps.tdvpx_nr_pages, sizeof(*tdx->tdvpx_pa),
> +			   GFP_KERNEL_ACCOUNT | __GFP_ZERO);

kcalloc() uses __GFP_ZERO internally.

> +	if (!tdvpx_pa) {
> +		ret = -ENOMEM;
> +		goto free_tdvpr;
> +	}
> +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> +		va = __get_free_page(GFP_KERNEL_ACCOUNT);
> +		if (!va)
> +			goto free_tdvpx;
> +		tdvpx_pa[i] = __pa(va);
> +	}
> +
> +	err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> +	if (WARN_ON_ONCE(err)) {
> +		ret = -EIO;
> +		pr_tdx_error(TDH_VP_CREATE, err, NULL);
> +		goto td_bugged_free_tdvpx;
> +	}
> +	tdx->tdvpr_pa = tdvpr_pa;
> +
> +	tdx->tdvpx_pa = tdvpx_pa;
> +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> +		err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> +		if (WARN_ON_ONCE(err)) {
> +			ret = -EIO;
> +			pr_tdx_error(TDH_VP_ADDCX, err, NULL);
> +			for (; i < tdx_caps.tdvpx_nr_pages; i++) {
> +				free_page((unsigned long)__va(tdvpx_pa[i]));
> +				tdvpx_pa[i] = 0;
> +			}
> +			goto td_bugged;
> +		}
> +	}
> +
> +	err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
> +	if (WARN_ON_ONCE(err)) {
> +		ret = -EIO;
> +		pr_tdx_error(TDH_VP_INIT, err, NULL);
> +		goto td_bugged;
> +	}
> +
> +	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> +
> +	return 0;
> +
> +td_bugged_free_tdvpx:
> +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> +		free_page((unsigned long)__va(tdvpx_pa[i]));
> +		tdvpx_pa[i] = 0;
> +	}
> +	kfree(tdvpx_pa);
> +td_bugged:
> +	vcpu->kvm->vm_bugged = true;
> +	return ret;
> +
> +free_tdvpx:
> +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
> +		if (tdvpx_pa[i])
> +			free_page((unsigned long)__va(tdvpx_pa[i]));
> +	kfree(tdvpx_pa);

This piece of code appears 3 times in this function (and there are 3 'return
ret;').  I am sure it can be done in one place instead. Can you reorganize?

> +	tdx->tdvpx_pa = NULL;
> +free_tdvpr:
> +	if (tdvpr_pa)
> +		free_page((unsigned long)__va(tdvpr_pa));
> +	tdx->tdvpr_pa = 0;
> +
> +	return ret;
> +}
> +
> +int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> +{
> +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> +	struct kvm_tdx_cmd cmd;
> +	int ret;
> +
> +	if (tdx->vcpu_initialized)
> +		return -EINVAL;
> +
> +	if (!is_hkid_assigned(kvm_tdx) || is_td_finalized(kvm_tdx))
> +		return -EINVAL;
> +
> +	if (copy_from_user(&cmd, argp, sizeof(cmd)))
> +		return -EFAULT;
> +
> +	if (cmd.error || cmd.unused)
> +		return -EINVAL;
> +
> +	/* Currently only KVM_TDX_INTI_VCPU is defined for vcpu operation. */
> +	if (cmd.flags || cmd.id != KVM_TDX_INIT_VCPU)
> +		return -EINVAL;
> +
> +	ret = tdx_td_vcpu_init(vcpu, (u64)cmd.data);
> +	if (ret)
> +		return ret;
> +
> +	tdx->vcpu_initialized = true;
> +	return 0;
> +}
> +
>  static int __init tdx_module_setup(void)
>  {
>  	const struct tdsysinfo_struct *tdsysinfo;
> diff --git a/arch/x86/kvm/vmx/tdx.h b/arch/x86/kvm/vmx/tdx.h
> index af7fdc1516d5..e909883d60fa 100644
> --- a/arch/x86/kvm/vmx/tdx.h
> +++ b/arch/x86/kvm/vmx/tdx.h
> @@ -17,12 +17,19 @@ struct kvm_tdx {
>  	u64 xfam;
>  	int hkid;
>  
> +	bool finalized;
> +
>  	u64 tsc_offset;
>  };
>  
>  struct vcpu_tdx {
>  	struct kvm_vcpu	vcpu;
>  
> +	unsigned long tdvpr_pa;
> +	unsigned long *tdvpx_pa;
> +
> +	bool vcpu_initialized;

The 'vcpu_' prefix is kinda redundant.

> +
>  	/*
>  	 * Dummy to make pmu_intel not corrupt memory.
>  	 * TODO: Support PMU for TDX.  Future work.
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 37ab2cfd35bc..fba8d0800597 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -148,11 +148,12 @@ int tdx_vm_init(struct kvm *kvm);
>  void tdx_mmu_release_hkid(struct kvm *kvm);
>  void tdx_vm_free(struct kvm *kvm);
>  
> -int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> -
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> +
> +int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> +int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);

Why bother moving the tdx_vm_ioctl() declaration?

>  #else
>  static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
>  static inline void tdx_hardware_unsetup(void) {}
> @@ -165,11 +166,12 @@ static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
>  static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
>  static inline void tdx_vm_free(struct kvm *kvm) {}
>  
> -static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> -
>  static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>  static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> +
> +static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> +static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
>  #endif
>  
>  #endif /* __KVM_X86_VMX_X86_OPS_H */
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index e8bc66031a1d..d548d3af6428 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5976,6 +5976,12 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>  	case KVM_SET_DEVICE_ATTR:
>  		r = kvm_vcpu_ioctl_device_attr(vcpu, ioctl, argp);
>  		break;
> +	case KVM_MEMORY_ENCRYPT_OP:
> +		r = -ENOTTY;
> +		if (!kvm_x86_ops.vcpu_mem_enc_ioctl)
> +			goto out;
> +		r = kvm_x86_ops.vcpu_mem_enc_ioctl(vcpu, argp);
> +		break;
>  	default:
>  		r = -EINVAL;
>  	}
> diff --git a/tools/arch/x86/include/uapi/asm/kvm.h b/tools/arch/x86/include/uapi/asm/kvm.h
> index eb800965b589..6971f1288043 100644
> --- a/tools/arch/x86/include/uapi/asm/kvm.h
> +++ b/tools/arch/x86/include/uapi/asm/kvm.h
> @@ -531,6 +531,7 @@ struct kvm_pmu_event_filter {
>  enum kvm_tdx_cmd_id {
>  	KVM_TDX_CAPABILITIES = 0,
>  	KVM_TDX_INIT_VM,
> +	KVM_TDX_INIT_VCPU,
>  
>  	KVM_TDX_CMD_NR_MAX,
>  };



^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-17 21:01               ` Sean Christopherson
@ 2023-01-19 11:31                 ` Huang, Kai
  2023-01-19 15:37                   ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-19 11:31 UTC (permalink / raw)
  To: zhi.wang.linux, Christopherson,, Sean
  Cc: sean.j.christopherson, linux-kernel, Shahar, Sagi,
	isaku.yamahata, Aktas, Erdem, dmatlack, kvm, Yamahata, Isaku,
	pbonzini

On Tue, 2023-01-17 at 21:01 +0000, Sean Christopherson wrote:
> On Tue, Jan 17, 2023, Sean Christopherson wrote:
> > On Tue, Jan 17, 2023, Zhi Wang wrote:
> > > 2) As TDX module doesn't provide contention-and-wait, I guess the following
> > > approach might have been discussed when designing this "retry".
> > > 
> > > KERNEL                          TDX MODULE
> > > 
> > > SEAMCALL A   ->                 PATH A: Taking locks
> > > 
> > > SEAMCALL B   ->                 PATH B: Contention on a lock
> > > 
> > >              <-                 Return "operand busy"
> > > 
> > > SEAMCALL B   -|
> > >               |  <- Wait on a kernel waitqueue
> > > SEAMCALL B  <-|
> > > 
> > > SEAMCALL A   <-                 PATH A: Return
> > > 
> > > SEAMCALL A   -|
> > >               |  <- Wake up the waitqueue
> > > SEMACALL A  <-| 
> > > 
> > > SEAMCALL B  ->                  PATH B: Taking the locks
> > > ...
> > > 
> > > Why not this scheme wasn't chosen?
> > 
> > AFAIK, I don't think a waitqueue approach as ever been discussed publicly.  Intel
> > may have considered the idea internally, but I don't recall anything being proposed
> > publically (though it's entirely possible I just missed the discussion).
> > 
> > Anways, I don't think a waitqueue would be a good fit, at least not for S-EPT
> > management, which AFAICT is the only scenario where KVM does the arbitrary "retry
> > X times and hope things work".  If the contention occurs due to the TDX Module
> > taking an S-EPT lock in VM-Enter, then KVM won't get a chance to do the "Wake up
> > the waitqueue" action until the next VM-Exit, which IIUC is well after the TDX
> > Module drops the S-EPT lock.  In other words, immediately retrying and then punting
> > the problem further up the stack in KVM does seem to be the least awful "solution"
> > if there's contention.
> 
> Oh, the other important piece I forgot to mention is that dropping mmu_lock deep
> in KVM's MMU in order to wait isn't always an option.  Most flows would play nice
> with dropping mmu_lock and sleeping, but some paths, e.g. from the mmu_notifier,
> (conditionally) disallow sleeping.

Could we do something similar to tdp_mmu_iter_cond_resched() but not simple busy
retrying "X times",  at least at those paths that can release mmu_lock()? 
Basically we treat TDX_OPERAND_BUSY as seamcall_needbreak(), similar to
rwlock_needbreak().  I haven't thought about details though.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 039/113] KVM: x86/mmu: Require TDP MMU for TDX
  2023-01-12 16:31 ` [PATCH v11 039/113] KVM: x86/mmu: Require TDP MMU for TDX isaku.yamahata
@ 2023-01-19 11:37   ` Huang, Kai
  0 siblings, 0 replies; 221+ messages in thread
From: Huang, Kai @ 2023-01-19 11:37 UTC (permalink / raw)
  To: kvm, linux-kernel, Yamahata, Isaku
  Cc: pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Require the TDP MMU for guest TDs, the so called "shadow" MMU does not
> support mapping guest private memory, i.e. does not support Secure-EPT.

It's not accurate.  Legacy MMU using EPT can also support.  We just choose to
not support.  Please revisit.

> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index fdcff390ebc2..6c3ce4121a46 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -27,6 +27,13 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
>  	if (kvm->arch.vm_type == KVM_X86_TDX_VM && !enable_mmio_caching)
>  		return -EOPNOTSUPP;
>  
> +	/*
> +	 * Because only the TDP MMU supports TDX, require the TDP MMU for guest
> +	 * TDs.
> +	 */
> +	if (kvm->arch.vm_type == KVM_X86_TDX_VM && !tdp_enabled)
> +		return -EOPNOTSUPP;
> +
>  	if (!tdp_enabled || !READ_ONCE(tdp_mmu_enabled))
>  		return 0;
>  


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
       [not found]   ` <080e0a246e927545718b6f427dfdcdde505a8859.camel@intel.com>
@ 2023-01-19 15:29     ` Sean Christopherson
  2023-01-19 20:40       ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-19 15:29 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, sean.j.christopherson,
	pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack

On Thu, Jan 19, 2023, Huang, Kai wrote:
> On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > +static void tdx_clear_page(unsigned long page_pa)
> > +{
> > +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > +	void *page = __va(page_pa);
> > +	unsigned long i;
> > +
> > +	if (!static_cpu_has(X86_FEATURE_MOVDIR64B)) {
> > +		clear_page(page);
> > +		return;
> > +	}
> 
> There might be below issues here:
> 
> 1) The kernel says static_cpu_has() should only be used in fast patch where each
> cycle is counted, otherwise use boot_cpu_has().  I don't know whether here you
> should use static_cpu_has().

That documentation is stale[*], go ahead and use cpu_feature_enabled().

https://lore.kernel.org/all/20221107211505.8572-1-bp@alien8.de

> 2) IIUC a CPU feature bit can be cleared by 'clearcpuid=xxx' kernel command

As you note below, using clearcpuid taints the kernel, i.e. any breakage due to
clearcpuid is user error.

> line, so looks you should use CPUID directly otherwise the MOVDIR64B below can
> be unintentionally skipped.  In practice w/o doing MOVDIR64B is fine since KeyID
> 0 doesn't have integrity enabled, but for the purpose you want to achieve
> checking real CPUID should be better.
> 
> But maybe you don't want to do CPUID check here each time when reclaiming a
> page.  In that case you can do CPUID during module initialization and cache
> whether MOVDIR64B is truly present.  static_key is a good fit for this purpose
> too I think.
> 
> But I am also seeing below in the kernel documentation:
> 
>         clearcpuid=X[,X...] [X86]
> 			......
>                         Note that using this option will taint your kernel.
>                         Also note that user programs calling CPUID directly
>                         or using the feature without checking anything
>                         will still see it. This just prevents it from
>                         being used by the kernel or shown in /proc/cpuinfo.
>                         Also note the kernel might malfunction if you disable
>                         some critical bits.
> 
> So the kernel is claiming using this will taint the kernel and it can even
> malfunction.  So maybe it's OK to use static_cpu_has()/boot_cpu_has().

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 11:31                 ` Huang, Kai
@ 2023-01-19 15:37                   ` Sean Christopherson
  2023-01-19 20:39                     ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-19 15:37 UTC (permalink / raw)
  To: Huang, Kai
  Cc: zhi.wang.linux, sean.j.christopherson, linux-kernel, Shahar,
	Sagi, isaku.yamahata, Aktas, Erdem, dmatlack, kvm, Yamahata,
	Isaku, pbonzini

On Thu, Jan 19, 2023, Huang, Kai wrote:
> On Tue, 2023-01-17 at 21:01 +0000, Sean Christopherson wrote:
> > On Tue, Jan 17, 2023, Sean Christopherson wrote:
> > > On Tue, Jan 17, 2023, Zhi Wang wrote:
> > > > 2) As TDX module doesn't provide contention-and-wait, I guess the following
> > > > approach might have been discussed when designing this "retry".
> > > > 
> > > > KERNEL                          TDX MODULE
> > > > 
> > > > SEAMCALL A   ->                 PATH A: Taking locks
> > > > 
> > > > SEAMCALL B   ->                 PATH B: Contention on a lock
> > > > 
> > > >              <-                 Return "operand busy"
> > > > 
> > > > SEAMCALL B   -|
> > > >               |  <- Wait on a kernel waitqueue
> > > > SEAMCALL B  <-|
> > > > 
> > > > SEAMCALL A   <-                 PATH A: Return
> > > > 
> > > > SEAMCALL A   -|
> > > >               |  <- Wake up the waitqueue
> > > > SEMACALL A  <-| 
> > > > 
> > > > SEAMCALL B  ->                  PATH B: Taking the locks
> > > > ...
> > > > 
> > > > Why not this scheme wasn't chosen?
> > > 
> > > AFAIK, I don't think a waitqueue approach as ever been discussed publicly.  Intel
> > > may have considered the idea internally, but I don't recall anything being proposed
> > > publically (though it's entirely possible I just missed the discussion).
> > > 
> > > Anways, I don't think a waitqueue would be a good fit, at least not for S-EPT
> > > management, which AFAICT is the only scenario where KVM does the arbitrary "retry
> > > X times and hope things work".  If the contention occurs due to the TDX Module
> > > taking an S-EPT lock in VM-Enter, then KVM won't get a chance to do the "Wake up
> > > the waitqueue" action until the next VM-Exit, which IIUC is well after the TDX
> > > Module drops the S-EPT lock.  In other words, immediately retrying and then punting
> > > the problem further up the stack in KVM does seem to be the least awful "solution"
> > > if there's contention.
> > 
> > Oh, the other important piece I forgot to mention is that dropping mmu_lock deep
> > in KVM's MMU in order to wait isn't always an option.  Most flows would play nice
> > with dropping mmu_lock and sleeping, but some paths, e.g. from the mmu_notifier,
> > (conditionally) disallow sleeping.
> 
> Could we do something similar to tdp_mmu_iter_cond_resched() but not simple busy
> retrying "X times",  at least at those paths that can release mmu_lock()?

That's effectively what happens by unwinding up the stak with an error code.
Eventually the page fault handler will get the error and retry the guest.

> Basically we treat TDX_OPERAND_BUSY as seamcall_needbreak(), similar to
> rwlock_needbreak().  I haven't thought about details though.

I am strongly opposed to that approach.  I do not want to pollute KVM's MMU code
with a bunch of retry logic and error handling just because the TDX module is
ultra paranoid and hostile to hypervisors.

The problematic scenario of faulting indefinitely on a single instruction should
never happen under normal circumstances, and so KVM should treat such scenarios
as attacks/breakage and pass the buck to userspace.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 15:37                   ` Sean Christopherson
@ 2023-01-19 20:39                     ` Huang, Kai
  2023-01-19 21:36                       ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-19 20:39 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: pbonzini, isaku.yamahata, Shahar, Sagi, linux-kernel, Aktas,
	Erdem, kvm, dmatlack, zhi.wang.linux, sean.j.christopherson,
	Yamahata, Isaku

On Thu, 2023-01-19 at 15:37 +0000, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Huang, Kai wrote:
> > On Tue, 2023-01-17 at 21:01 +0000, Sean Christopherson wrote:
> > > On Tue, Jan 17, 2023, Sean Christopherson wrote:
> > > > On Tue, Jan 17, 2023, Zhi Wang wrote:
> > > > > 2) As TDX module doesn't provide contention-and-wait, I guess the following
> > > > > approach might have been discussed when designing this "retry".
> > > > > 
> > > > > KERNEL                          TDX MODULE
> > > > > 
> > > > > SEAMCALL A   ->                 PATH A: Taking locks
> > > > > 
> > > > > SEAMCALL B   ->                 PATH B: Contention on a lock
> > > > > 
> > > > >              <-                 Return "operand busy"
> > > > > 
> > > > > SEAMCALL B   -|
> > > > >               |  <- Wait on a kernel waitqueue
> > > > > SEAMCALL B  <-|
> > > > > 
> > > > > SEAMCALL A   <-                 PATH A: Return
> > > > > 
> > > > > SEAMCALL A   -|
> > > > >               |  <- Wake up the waitqueue
> > > > > SEMACALL A  <-| 
> > > > > 
> > > > > SEAMCALL B  ->                  PATH B: Taking the locks
> > > > > ...
> > > > > 
> > > > > Why not this scheme wasn't chosen?
> > > > 
> > > > AFAIK, I don't think a waitqueue approach as ever been discussed publicly.  Intel
> > > > may have considered the idea internally, but I don't recall anything being proposed
> > > > publically (though it's entirely possible I just missed the discussion).
> > > > 
> > > > Anways, I don't think a waitqueue would be a good fit, at least not for S-EPT
> > > > management, which AFAICT is the only scenario where KVM does the arbitrary "retry
> > > > X times and hope things work".  If the contention occurs due to the TDX Module
> > > > taking an S-EPT lock in VM-Enter, then KVM won't get a chance to do the "Wake up
> > > > the waitqueue" action until the next VM-Exit, which IIUC is well after the TDX
> > > > Module drops the S-EPT lock.  In other words, immediately retrying and then punting
> > > > the problem further up the stack in KVM does seem to be the least awful "solution"
> > > > if there's contention.
> > > 
> > > Oh, the other important piece I forgot to mention is that dropping mmu_lock deep
> > > in KVM's MMU in order to wait isn't always an option.  Most flows would play nice
> > > with dropping mmu_lock and sleeping, but some paths, e.g. from the mmu_notifier,
> > > (conditionally) disallow sleeping.
> > 
> > Could we do something similar to tdp_mmu_iter_cond_resched() but not simple busy
> > retrying "X times",  at least at those paths that can release mmu_lock()?
> 
> That's effectively what happens by unwinding up the stak with an error code.
> Eventually the page fault handler will get the error and retry the guest.
> 
> > Basically we treat TDX_OPERAND_BUSY as seamcall_needbreak(), similar to
> > rwlock_needbreak().  I haven't thought about details though.
> 
> I am strongly opposed to that approach.  I do not want to pollute KVM's MMU code
> with a bunch of retry logic and error handling just because the TDX module is
> ultra paranoid and hostile to hypervisors.

Right.  But IIUC there's legal cases that SEPT SEAMCALL can return BUSY due to
multiple threads trying to read/modify SEPT simultaneously in case of TDP MMU. 
For instance, parallel page faults on different vcpus on private pages.  I
believe this is the main reason to retry.  We  previously used spinlock around
the SEAMCALLs to avoid, but looks that is not preferred.

> 
> The problematic scenario of faulting indefinitely on a single instruction should
> never happen under normal circumstances, and so KVM should treat such scenarios
> as attacks/breakage and pass the buck to userspace.

Totally agree zero-step attack can be treated KVM bug.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 15:29     ` Sean Christopherson
@ 2023-01-19 20:40       ` Huang, Kai
  0 siblings, 0 replies; 221+ messages in thread
From: Huang, Kai @ 2023-01-19 20:40 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: dmatlack, linux-kernel, Shahar, Sagi, isaku.yamahata, Aktas,
	Erdem, kvm, pbonzini, Yamahata, Isaku, sean.j.christopherson

On Thu, 2023-01-19 at 15:29 +0000, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Huang, Kai wrote:
> > On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > > +static void tdx_clear_page(unsigned long page_pa)
> > > +{
> > > +	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > > +	void *page = __va(page_pa);
> > > +	unsigned long i;
> > > +
> > > +	if (!static_cpu_has(X86_FEATURE_MOVDIR64B)) {
> > > +		clear_page(page);
> > > +		return;
> > > +	}
> > 
> > There might be below issues here:
> > 
> > 1) The kernel says static_cpu_has() should only be used in fast patch where each
> > cycle is counted, otherwise use boot_cpu_has().  I don't know whether here you
> > should use static_cpu_has().
> 
> That documentation is stale[*], go ahead and use cpu_feature_enabled().
> 
> https://lore.kernel.org/all/20221107211505.8572-1-bp@alien8.de

Thanks for the info :)

> 
> > 2) IIUC a CPU feature bit can be cleared by 'clearcpuid=xxx' kernel command
> 
> As you note below, using clearcpuid taints the kernel, i.e. any breakage due to
> clearcpuid is user error.

Agreed.



^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 20:39                     ` Huang, Kai
@ 2023-01-19 21:36                       ` Sean Christopherson
  2023-01-19 23:08                         ` Huang, Kai
  2023-01-19 23:55                         ` Huang, Kai
  0 siblings, 2 replies; 221+ messages in thread
From: Sean Christopherson @ 2023-01-19 21:36 UTC (permalink / raw)
  To: Huang, Kai
  Cc: pbonzini, isaku.yamahata, Shahar, Sagi, linux-kernel, Aktas,
	Erdem, kvm, dmatlack, zhi.wang.linux, sean.j.christopherson,
	Yamahata, Isaku

On Thu, Jan 19, 2023, Huang, Kai wrote:
> On Thu, 2023-01-19 at 15:37 +0000, Sean Christopherson wrote:
> > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > On Tue, 2023-01-17 at 21:01 +0000, Sean Christopherson wrote:
> > > > On Tue, Jan 17, 2023, Sean Christopherson wrote:
> > > > > On Tue, Jan 17, 2023, Zhi Wang wrote:
> > > > Oh, the other important piece I forgot to mention is that dropping mmu_lock deep
> > > > in KVM's MMU in order to wait isn't always an option.  Most flows would play nice
> > > > with dropping mmu_lock and sleeping, but some paths, e.g. from the mmu_notifier,
> > > > (conditionally) disallow sleeping.
> > > 
> > > Could we do something similar to tdp_mmu_iter_cond_resched() but not simple busy
> > > retrying "X times",  at least at those paths that can release mmu_lock()?
> > 
> > That's effectively what happens by unwinding up the stak with an error code.
> > Eventually the page fault handler will get the error and retry the guest.
> > 
> > > Basically we treat TDX_OPERAND_BUSY as seamcall_needbreak(), similar to
> > > rwlock_needbreak().  I haven't thought about details though.
> > 
> > I am strongly opposed to that approach.  I do not want to pollute KVM's MMU code
> > with a bunch of retry logic and error handling just because the TDX module is
> > ultra paranoid and hostile to hypervisors.
> 
> Right.  But IIUC there's legal cases that SEPT SEAMCALL can return BUSY due to
> multiple threads trying to read/modify SEPT simultaneously in case of TDP MMU. 
> For instance, parallel page faults on different vcpus on private pages.  I
> believe this is the main reason to retry.

Um, crud.  I think there's a bigger issue.  KVM always operates on its copy of the
S-EPT tables and assumes the the real S-EPT tables will always be synchronized with
KVM's mirror.  That assumption doesn't hold true without serializing SEAMCALLs in
some way.  E.g. if a SPTE is zapped and mapped at the same time, we can end up with:

  vCPU0                      vCPU1
  =====                      =====
  mirror[x] = xyz
                             old_spte = mirror[x]
                             mirror[x] = REMOVED_SPTE
                             sept[x] = REMOVED_SPTE
  sept[x] = xyz

In other words, when mmu_lock is held for read, KVM relies on atomic SPTE updates.
With the mirror=>s-ept scheme, updates are no longer atomic and everything falls
apart.

Gracefully retrying only papers over the visible failures, the really problematic
scenarios are where multiple updates race and _don't_ trigger conflicts in the TDX
module.

> We previously used spinlock around the SEAMCALLs to avoid, but looks that is
> not preferred.

That doesn't address the race above either.  And even if it did, serializing all
S-EPT SEAMCALLs for a VM is not an option, at least not in the long term.

The least invasive idea I have is expand the TDP MMU's concept of "frozen" SPTEs
and freeze (a.k.a. lock) the SPTE (KVM's mirror) until the corresponding S-EPT
update completes.

The other idea is to scrap the mirror concept entirely, though I gotta imagine
that would provide pretty awful performance.

diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index 0d8deefee66c..bcb398e71475 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -198,9 +198,9 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
 /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
 static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
 
-static inline bool is_removed_spte(u64 spte)
+static inline bool is_frozen_spte(u64 spte)
 {
-       return spte == REMOVED_SPTE;
+       return spte == REMOVED_SPTE || spte & FROZEN_SPTE;
 }
 
 /* Get an SPTE's index into its parent's page table (and the spt array). */
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index bba33aea0fb0..7f34eccadf98 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -651,6 +651,9 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
 
        lockdep_assert_held_read(&kvm->mmu_lock);
 
+       if (<is TDX> && new_spte != REMOVED_SPTE)
+               new_spte |= FROZEN_SPTE;
+
        /*
         * Note, fast_pf_fix_direct_spte() can also modify TDP MMU SPTEs and
         * does not hold the mmu_lock.
@@ -662,6 +665,9 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
                              new_spte, iter->level, true);
        handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
 
+       if (<is TDX> && new_spte != REMOVED_SPTE)
+               __kvm_tdp_mmu_write_spte(iter->sptep, new_spte);
+
        return 0;
 }
 


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-17 20:56             ` Sean Christopherson
  2023-01-17 21:01               ` Sean Christopherson
@ 2023-01-19 22:45               ` Zhi Wang
  2023-01-19 22:51                 ` Sean Christopherson
  1 sibling, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-19 22:45 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Sean Christopherson,
	Kai Huang

On Tue, 17 Jan 2023 20:56:17 +0000
Sean Christopherson <seanjc@google.com> wrote:

> On Tue, Jan 17, 2023, Zhi Wang wrote:
> > On Tue, 17 Jan 2023 15:55:53 +0000
> > Sean Christopherson <seanjc@google.com> wrote:
> > 
> > > On Sat, Jan 14, 2023, Zhi Wang wrote:
> > > > On Fri, 13 Jan 2023 15:16:08 +0000 > Sean Christopherson <seanjc@google.com> wrote:
> > > > 
> > > > > On Fri, Jan 13, 2023, Zhi Wang wrote:
> > > > > > Better add a FIXME: here as this has to be fixed later.
> > > > > 
> > > > > No, leaking the page is all KVM can reasonably do here.  An improved
> > > > > comment would be helpful, but no code change is required.
> > > > > tdx_reclaim_page() returns an error if and only if there's an
> > > > > unexpected, fatal error, e.g. a SEAMCALL with bad params, incorrect
> > > > > concurrency in KVM, a TDX Module bug, etc.  Retrying at a later point is
> > > > > highly unlikely to be successful.
> > > > 
> > > > Hi:
> > > > 
> > > > The word "leaking" sounds like a situation left unhandled temporarily.
> > > > 
> > > > I checked the source code of the TDX module[1] for the possible reason to
> > > > fail when reviewing this patch:
> > > > 
> > > > tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_reclaim.c
> > > > tdx-module-v1.0.01.01.zip\src\vmm_dispatcher\api_calls\tdh_phymem_page_wbinvd.c
> > > > 
> > > > a. Invalid parameters. For example, page is not aligned, PA HKID is not zero...
> > > > 
> > > > For invalid parameters, a WARN_ON_ONCE() + return value is good enough as
> > > > that is how kernel handles similar situations. The caller takes the
> > > > responsibility.
> > > >  
> > > > b. Locks has been taken in TDX module. TDR page has been locked due to another
> > > > SEAMCALL, another SEAMCALL is doing PAMT walk and holding PAMT lock... 
> > > > 
> > > > This needs to be improved later either by retry or taking tdx_lock to avoid
> > > > TDX module fails on this.
> > > 
> > > No, tdx_reclaim_page() already retries TDH.PHYMEM.PAGE.RECLAIM if the target page
> > > is contended (though I'd question the validity of even that), and TDH.PHYMEM.PAGE.WBINVD
> > > is performed only when reclaiming the TDR.  If there's contention when reclaiming
> > > the TDR, then KVM effectively has a use-after-free bug, i.e. leaking the page is
> > > the least of our worries.
> > > 
> > 
> > Hi:
> > 
> > Thanks for the reply. "Leaking" is the consquence of even failing in retry. I
> > agree with this. But I was questioning if "retry" is really a correct and only
> > solution when encountering lock contention in the TDX module as I saw that there
> > are quite some magic numbers are going to be introduced because of "retry" and
> > there were discussions about times of retry should be 3 or 1000 in TDX guest
> > on hyper-V patches. It doesn't sound right.
> 
> Ah, yeah, I'm speaking only with respect to leaking pages on failure in this
> specific scenario.
> 
> > Compare to an typical *kernel lock* case, an execution path can wait on a
> > waitqueue and later will be woken up. We usually do contention-wait-and-retry
> > and we rarely just do contention and retry X times. In TDX case, I understand
> > that it is hard for the TDX module to provide similar solutions as an execution
> > path can't stay long in the TDX module.
> 
> Let me preface the below comments by saying that this is the first time that I've
> seen the "Single-Step and Zero-Step Attacks Mitigation Mechanisms" behavior, i.e.
> the first time I've been made aware that the TDX Module can apparently decide
> to take S-EPT locks in the VM-Enter path.
> 
> > 
> > 1) We can always take tdx_lock (linux kernel lock) when calling a SEAMCALL
> > that touch the TDX internal locks. But the downside is we might lose some
> > concurrency.
> 
> This isn't really feasible in practice.  Even if tdx_lock were made a spinlock
> (it's currently a mutex) so that it could it could be taken inside kvm->mmu_lock,
> acquiring a per-VM lock, let alone a global lock, in KVM's page fault handling
> path is not an option.  KVM has a hard requirement these days of being able to
> handle multiple page faults in parallel.
> 
> > 2) As TDX module doesn't provide contention-and-wait, I guess the following
> > approach might have been discussed when designing this "retry".
> > 
> > KERNEL                          TDX MODULE
> > 
> > SEAMCALL A   ->                 PATH A: Taking locks
> > 
> > SEAMCALL B   ->                 PATH B: Contention on a lock
> > 
> >              <-                 Return "operand busy"
> > 
> > SEAMCALL B   -|
> >               |  <- Wait on a kernel waitqueue
> > SEAMCALL B  <-|
> > 
> > SEAMCALL A   <-                 PATH A: Return
> > 
> > SEAMCALL A   -|
> >               |  <- Wake up the waitqueue
> > SEMACALL A  <-| 
> > 
> > SEAMCALL B  ->                  PATH B: Taking the locks
> > ...
> > 
> > Why not this scheme wasn't chosen?
> 
> AFAIK, I don't think a waitqueue approach as ever been discussed publicly.  Intel
> may have considered the idea internally, but I don't recall anything being proposed
> publically (though it's entirely possible I just missed the discussion).
> 
> Anways, I don't think a waitqueue would be a good fit, at least not for S-EPT
> management, which AFAICT is the only scenario where KVM does the arbitrary "retry
> X times and hope things work".  If the contention occurs due to the TDX Module
> taking an S-EPT lock in VM-Enter, then KVM won't get a chance to do the "Wake up
> the waitqueue" action until the next VM-Exit, which IIUC is well after the TDX
> Module drops the S-EPT lock.  In other words, immediately retrying and then punting
> the problem further up the stack in KVM does seem to be the least awful "solution"
> if there's contention.
>

I should put both "waitqueue or spin" in the chart above to prevent misunderstanding.

I agree that S-EPT should be thought and treated differently than others as
any lock change on that path can affect the parallelization and performance. There are
many internal locks for protecting private data structures inside the security
firmware, like locks for TDR, TDCS, TDVPS. PAMT(mostly SEPT-related). Any contention
on these might result in a BUSY.

What I would like to clarify is: it would be better to have a mechanism to make sure a
SEAMCALL can succeed at a certain point (or at least trying to succeed) as the 
magic number would work like a lottery in reality. E.g. the CPU doing retry is running
at a higher freq than the CPU running the execution path which took the TDX inner
locks due to different P states. The retry loop might end much earlier than expectation.
Not even say when the same linux kernel binary with the same pre-determined number 
of retrying loop running on different CPU SKUs with different freqs.

We sometimes do retry in a driver to wait HW states ready is because the clock of a
device might be fixed or predictable. But I am in doubt if retrying loop works in the
case mentioned above.

It still likes a temporary workaround to me and I think we should be quite
careful of it.

> That said, I 100% agree that the arbitrary retry stuff is awful.  The zero-step
> interaction in particular isn't acceptable.
> 
> Intel folks, encountering "TDX_OPERAND_BUSY | TDX_OPERAND_ID_SEPT" on VM-Enter
> needs to be treated as a KVM bug, even if it means teaching KVM to kill the VM
> if a vCPU is on the cusp of triggerring the "pre-determined number" of EPT faults
> mentioned in this snippet:
> 
>   After a pre-determined number of such EPT violations occur on the same instruction,
>   the TDX module starts tracking the GPAs that caused Secure EPT faults and fails
>   further host VMM attempts to enter the TD VCPU unless previously faulting private
>   GPAs are properly mapped in the Secure EPT.
> 
> If the "pre-determined number" is too low to avoid false positives, e.g. if it can
> be tripped by a vCPU spinning while a different vCPU finishes handling a fault,
> then either the behavior of the TDX Module needs to be revisited, or KVM needs to
> stall vCPUs that are approaching the threshold until all outstanding S-EPT faults
> have been serviced.
>

That sharply hits the point. Let's keep this in mind and start from here.

I also saw the AMD SEV-SNP host patches had retry times like (2 * cpu cores).
But I haven't seen any explanation in the code and comment yet, like how it
supposes to be able to work in reality. Would ask this question in the patch
review. I start to feel this seems a common problem that needs to be sorted.

> KVM shouldn't be spuriously zapping private S-EPT entries since the backing memory
> is pinned, which means the only way for a vCPU to truly get stuck faulting on a
> single instruction is if userspace is broken/malicious, in which case userspace
> gets to keep the pieces.


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 22:45               ` Zhi Wang
@ 2023-01-19 22:51                 ` Sean Christopherson
  0 siblings, 0 replies; 221+ messages in thread
From: Sean Christopherson @ 2023-01-19 22:51 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Sean Christopherson,
	Kai Huang

On Fri, Jan 20, 2023, Zhi Wang wrote:
> On Tue, 17 Jan 2023 20:56:17 +0000 Sean Christopherson <seanjc@google.com> wrote:
> What I would like to clarify is: it would be better to have a mechanism to make sure a
> SEAMCALL can succeed at a certain point (or at least trying to succeed) as the 
> magic number would work like a lottery in reality.

Yep, you and I are in agreement on this point.  Ideally, KVM would be able to
guarantee success and treat TDX_OPERAND_BUSY errors as KVM bugs.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 21:36                       ` Sean Christopherson
@ 2023-01-19 23:08                         ` Huang, Kai
  2023-01-19 23:11                           ` Sean Christopherson
  2023-01-19 23:55                         ` Huang, Kai
  1 sibling, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-19 23:08 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: dmatlack, sean.j.christopherson, Shahar, Sagi, isaku.yamahata,
	Aktas, Erdem, kvm, pbonzini, zhi.wang.linux, linux-kernel,
	Yamahata, Isaku

On Thu, 2023-01-19 at 21:36 +0000, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Huang, Kai wrote:
> > On Thu, 2023-01-19 at 15:37 +0000, Sean Christopherson wrote:
> > > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > > On Tue, 2023-01-17 at 21:01 +0000, Sean Christopherson wrote:
> > > > > On Tue, Jan 17, 2023, Sean Christopherson wrote:
> > > > > > On Tue, Jan 17, 2023, Zhi Wang wrote:
> > > > > Oh, the other important piece I forgot to mention is that dropping mmu_lock deep
> > > > > in KVM's MMU in order to wait isn't always an option.  Most flows would play nice
> > > > > with dropping mmu_lock and sleeping, but some paths, e.g. from the mmu_notifier,
> > > > > (conditionally) disallow sleeping.
> > > > 
> > > > Could we do something similar to tdp_mmu_iter_cond_resched() but not simple busy
> > > > retrying "X times",  at least at those paths that can release mmu_lock()?
> > > 
> > > That's effectively what happens by unwinding up the stak with an error code.
> > > Eventually the page fault handler will get the error and retry the guest.
> > > 
> > > > Basically we treat TDX_OPERAND_BUSY as seamcall_needbreak(), similar to
> > > > rwlock_needbreak().  I haven't thought about details though.
> > > 
> > > I am strongly opposed to that approach.  I do not want to pollute KVM's MMU code
> > > with a bunch of retry logic and error handling just because the TDX module is
> > > ultra paranoid and hostile to hypervisors.
> > 
> > Right.  But IIUC there's legal cases that SEPT SEAMCALL can return BUSY due to
> > multiple threads trying to read/modify SEPT simultaneously in case of TDP MMU. 
> > For instance, parallel page faults on different vcpus on private pages.  I
> > believe this is the main reason to retry.
> 
> Um, crud.  I think there's a bigger issue.  KVM always operates on its copy of the
> S-EPT tables and assumes the the real S-EPT tables will always be synchronized with
> KVM's mirror.  That assumption doesn't hold true without serializing SEAMCALLs in
> some way.  E.g. if a SPTE is zapped and mapped at the same time, we can end up with:
> 
>   vCPU0                      vCPU1
>   =====                      =====
>   mirror[x] = xyz
>                              old_spte = mirror[x]
>                              mirror[x] = REMOVED_SPTE
>                              sept[x] = REMOVED_SPTE
>   sept[x] = xyz

IIUC this case cannot happen, as the two steps in the vcpu0 are within read
lock, which prevents from vcpu1, which holds the write lock during zapping SPTE.




^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 23:08                         ` Huang, Kai
@ 2023-01-19 23:11                           ` Sean Christopherson
  2023-01-19 23:24                             ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-19 23:11 UTC (permalink / raw)
  To: Huang, Kai
  Cc: dmatlack, sean.j.christopherson, Shahar, Sagi, isaku.yamahata,
	Aktas, Erdem, kvm, pbonzini, zhi.wang.linux, linux-kernel,
	Yamahata, Isaku

On Thu, Jan 19, 2023, Huang, Kai wrote:
> On Thu, 2023-01-19 at 21:36 +0000, Sean Christopherson wrote:
> > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > On Thu, 2023-01-19 at 15:37 +0000, Sean Christopherson wrote:
> > > > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > > > On Tue, 2023-01-17 at 21:01 +0000, Sean Christopherson wrote:
> > > > > > On Tue, Jan 17, 2023, Sean Christopherson wrote:
> > > > > > > On Tue, Jan 17, 2023, Zhi Wang wrote:
> > > > > > Oh, the other important piece I forgot to mention is that dropping mmu_lock deep
> > > > > > in KVM's MMU in order to wait isn't always an option.  Most flows would play nice
> > > > > > with dropping mmu_lock and sleeping, but some paths, e.g. from the mmu_notifier,
> > > > > > (conditionally) disallow sleeping.
> > > > > 
> > > > > Could we do something similar to tdp_mmu_iter_cond_resched() but not simple busy
> > > > > retrying "X times",  at least at those paths that can release mmu_lock()?
> > > > 
> > > > That's effectively what happens by unwinding up the stak with an error code.
> > > > Eventually the page fault handler will get the error and retry the guest.
> > > > 
> > > > > Basically we treat TDX_OPERAND_BUSY as seamcall_needbreak(), similar to
> > > > > rwlock_needbreak().  I haven't thought about details though.
> > > > 
> > > > I am strongly opposed to that approach.  I do not want to pollute KVM's MMU code
> > > > with a bunch of retry logic and error handling just because the TDX module is
> > > > ultra paranoid and hostile to hypervisors.
> > > 
> > > Right.  But IIUC there's legal cases that SEPT SEAMCALL can return BUSY due to
> > > multiple threads trying to read/modify SEPT simultaneously in case of TDP MMU. 
> > > For instance, parallel page faults on different vcpus on private pages.  I
> > > believe this is the main reason to retry.
> > 
> > Um, crud.  I think there's a bigger issue.  KVM always operates on its copy of the
> > S-EPT tables and assumes the the real S-EPT tables will always be synchronized with
> > KVM's mirror.  That assumption doesn't hold true without serializing SEAMCALLs in
> > some way.  E.g. if a SPTE is zapped and mapped at the same time, we can end up with:
> > 
> >   vCPU0                      vCPU1
> >   =====                      =====
> >   mirror[x] = xyz
> >                              old_spte = mirror[x]
> >                              mirror[x] = REMOVED_SPTE
> >                              sept[x] = REMOVED_SPTE
> >   sept[x] = xyz
> 
> IIUC this case cannot happen, as the two steps in the vcpu0 are within read
> lock, which prevents from vcpu1, which holds the write lock during zapping SPTE.

Zapping SPTEs can happen while holding mmu_lock for read, e.g. see the bug fixed
by commit 21a36ac6b6c7 ("KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage
is disallowed").

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 23:11                           ` Sean Christopherson
@ 2023-01-19 23:24                             ` Huang, Kai
  2023-01-19 23:25                               ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-19 23:24 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: dmatlack, linux-kernel, Shahar, Sagi, isaku.yamahata, Aktas,
	Erdem, kvm, sean.j.christopherson, zhi.wang.linux, pbonzini,
	Yamahata, Isaku

On Thu, 2023-01-19 at 23:11 +0000, Sean Christopherson wrote:
> On Thu, Jan 19, 2023, Huang, Kai wrote:
> > On Thu, 2023-01-19 at 21:36 +0000, Sean Christopherson wrote:
> > > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > > On Thu, 2023-01-19 at 15:37 +0000, Sean Christopherson wrote:
> > > > > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > > > > On Tue, 2023-01-17 at 21:01 +0000, Sean Christopherson wrote:
> > > > > > > On Tue, Jan 17, 2023, Sean Christopherson wrote:
> > > > > > > > On Tue, Jan 17, 2023, Zhi Wang wrote:
> > > > > > > Oh, the other important piece I forgot to mention is that dropping mmu_lock deep
> > > > > > > in KVM's MMU in order to wait isn't always an option.  Most flows would play nice
> > > > > > > with dropping mmu_lock and sleeping, but some paths, e.g. from the mmu_notifier,
> > > > > > > (conditionally) disallow sleeping.
> > > > > > 
> > > > > > Could we do something similar to tdp_mmu_iter_cond_resched() but not simple busy
> > > > > > retrying "X times",  at least at those paths that can release mmu_lock()?
> > > > > 
> > > > > That's effectively what happens by unwinding up the stak with an error code.
> > > > > Eventually the page fault handler will get the error and retry the guest.
> > > > > 
> > > > > > Basically we treat TDX_OPERAND_BUSY as seamcall_needbreak(), similar to
> > > > > > rwlock_needbreak().  I haven't thought about details though.
> > > > > 
> > > > > I am strongly opposed to that approach.  I do not want to pollute KVM's MMU code
> > > > > with a bunch of retry logic and error handling just because the TDX module is
> > > > > ultra paranoid and hostile to hypervisors.
> > > > 
> > > > Right.  But IIUC there's legal cases that SEPT SEAMCALL can return BUSY due to
> > > > multiple threads trying to read/modify SEPT simultaneously in case of TDP MMU. 
> > > > For instance, parallel page faults on different vcpus on private pages.  I
> > > > believe this is the main reason to retry.
> > > 
> > > Um, crud.  I think there's a bigger issue.  KVM always operates on its copy of the
> > > S-EPT tables and assumes the the real S-EPT tables will always be synchronized with
> > > KVM's mirror.  That assumption doesn't hold true without serializing SEAMCALLs in
> > > some way.  E.g. if a SPTE is zapped and mapped at the same time, we can end up with:
> > > 
> > >   vCPU0                      vCPU1
> > >   =====                      =====
> > >   mirror[x] = xyz
> > >                              old_spte = mirror[x]
> > >                              mirror[x] = REMOVED_SPTE
> > >                              sept[x] = REMOVED_SPTE
> > >   sept[x] = xyz
> > 
> > IIUC this case cannot happen, as the two steps in the vcpu0 are within read
> > lock, which prevents from vcpu1, which holds the write lock during zapping SPTE.
> 
> Zapping SPTEs can happen while holding mmu_lock for read, e.g. see the bug fixed
> by commit 21a36ac6b6c7 ("KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage
> is disallowed").

Sorry, I now recall that I once had patch to change write path to hold
write_lock() for TDX guest to avoid such problem.  I thought it was upstream
behaviour :)

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 23:24                             ` Huang, Kai
@ 2023-01-19 23:25                               ` Huang, Kai
  0 siblings, 0 replies; 221+ messages in thread
From: Huang, Kai @ 2023-01-19 23:25 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: pbonzini, isaku.yamahata, Shahar, Sagi, linux-kernel, Aktas,
	Erdem, kvm, sean.j.christopherson, zhi.wang.linux, dmatlack,
	Yamahata, Isaku

On Thu, 2023-01-19 at 23:24 +0000, Huang, Kai wrote:
> On Thu, 2023-01-19 at 23:11 +0000, Sean Christopherson wrote:
> > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > On Thu, 2023-01-19 at 21:36 +0000, Sean Christopherson wrote:
> > > > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > > > On Thu, 2023-01-19 at 15:37 +0000, Sean Christopherson wrote:
> > > > > > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > > > > > On Tue, 2023-01-17 at 21:01 +0000, Sean Christopherson wrote:
> > > > > > > > On Tue, Jan 17, 2023, Sean Christopherson wrote:
> > > > > > > > > On Tue, Jan 17, 2023, Zhi Wang wrote:
> > > > > > > > Oh, the other important piece I forgot to mention is that dropping mmu_lock deep
> > > > > > > > in KVM's MMU in order to wait isn't always an option.  Most flows would play nice
> > > > > > > > with dropping mmu_lock and sleeping, but some paths, e.g. from the mmu_notifier,
> > > > > > > > (conditionally) disallow sleeping.
> > > > > > > 
> > > > > > > Could we do something similar to tdp_mmu_iter_cond_resched() but not simple busy
> > > > > > > retrying "X times",  at least at those paths that can release mmu_lock()?
> > > > > > 
> > > > > > That's effectively what happens by unwinding up the stak with an error code.
> > > > > > Eventually the page fault handler will get the error and retry the guest.
> > > > > > 
> > > > > > > Basically we treat TDX_OPERAND_BUSY as seamcall_needbreak(), similar to
> > > > > > > rwlock_needbreak().  I haven't thought about details though.
> > > > > > 
> > > > > > I am strongly opposed to that approach.  I do not want to pollute KVM's MMU code
> > > > > > with a bunch of retry logic and error handling just because the TDX module is
> > > > > > ultra paranoid and hostile to hypervisors.
> > > > > 
> > > > > Right.  But IIUC there's legal cases that SEPT SEAMCALL can return BUSY due to
> > > > > multiple threads trying to read/modify SEPT simultaneously in case of TDP MMU. 
> > > > > For instance, parallel page faults on different vcpus on private pages.  I
> > > > > believe this is the main reason to retry.
> > > > 
> > > > Um, crud.  I think there's a bigger issue.  KVM always operates on its copy of the
> > > > S-EPT tables and assumes the the real S-EPT tables will always be synchronized with
> > > > KVM's mirror.  That assumption doesn't hold true without serializing SEAMCALLs in
> > > > some way.  E.g. if a SPTE is zapped and mapped at the same time, we can end up with:
> > > > 
> > > >   vCPU0                      vCPU1
> > > >   =====                      =====
> > > >   mirror[x] = xyz
> > > >                              old_spte = mirror[x]
> > > >                              mirror[x] = REMOVED_SPTE
> > > >                              sept[x] = REMOVED_SPTE
> > > >   sept[x] = xyz
> > > 
> > > IIUC this case cannot happen, as the two steps in the vcpu0 are within read
> > > lock, which prevents from vcpu1, which holds the write lock during zapping SPTE.
> > 
> > Zapping SPTEs can happen while holding mmu_lock for read, e.g. see the bug fixed
> > by commit 21a36ac6b6c7 ("KVM: x86/mmu: Re-check under lock that TDP MMU SP hugepage
> > is disallowed").
> 
> Sorry, I now recall that I once had patch to change write path to hold
> write_lock() for TDX guest to avoid such problem.  I thought it was upstream
> behaviour :)

I mean change the zap PTE path to hold write lock. :)

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 21:36                       ` Sean Christopherson
  2023-01-19 23:08                         ` Huang, Kai
@ 2023-01-19 23:55                         ` Huang, Kai
  2023-01-20  0:16                           ` Sean Christopherson
  1 sibling, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-19 23:55 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: dmatlack, sean.j.christopherson, Shahar, Sagi, isaku.yamahata,
	Aktas, Erdem, kvm, pbonzini, zhi.wang.linux, linux-kernel,
	Yamahata, Isaku

On Thu, 2023-01-19 at 21:36 +0000, Sean Christopherson wrote:
> The least invasive idea I have is expand the TDP MMU's concept of "frozen" SPTEs
> and freeze (a.k.a. lock) the SPTE (KVM's mirror) until the corresponding S-EPT
> update completes.

This will introduce another "having-to-wait while SPTE is frozen" problem I
think, which IIUC means (one way is) you have to do some loop and retry, perhaps
similar to yield_safe.

> 
> The other idea is to scrap the mirror concept entirely, though I gotta imagine
> that would provide pretty awful performance.

Right.  I don't think this is a good option.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-19 23:55                         ` Huang, Kai
@ 2023-01-20  0:16                           ` Sean Christopherson
  2023-01-20 22:21                             ` David Matlack
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-20  0:16 UTC (permalink / raw)
  To: Huang, Kai
  Cc: dmatlack, sean.j.christopherson, Shahar, Sagi, isaku.yamahata,
	Aktas, Erdem, kvm, pbonzini, zhi.wang.linux, linux-kernel,
	Yamahata, Isaku

On Thu, Jan 19, 2023, Huang, Kai wrote:
> On Thu, 2023-01-19 at 21:36 +0000, Sean Christopherson wrote:
> > The least invasive idea I have is expand the TDP MMU's concept of "frozen" SPTEs
> > and freeze (a.k.a. lock) the SPTE (KVM's mirror) until the corresponding S-EPT
> > update completes.
> 
> This will introduce another "having-to-wait while SPTE is frozen" problem I
> think, which IIUC means (one way is) you have to do some loop and retry, perhaps
> similar to yield_safe.

Yes, but because the TDP MMU already freezes SPTEs (just for a shorter duration),
I'm 99% sure all of the affected flows already know how to yield/bail when necessary.

The problem with the zero-step mitigation is that it could (theoretically) cause
a "busy" error on literally any accesses, which makes it infeasible for KVM to have
sane behavior.  E.g. freezing SPTEs to avoid the ordering issues isn't necessary
when holding mmu_lock for write, whereas the zero-step madness brings everything
into play.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-20  0:16                           ` Sean Christopherson
@ 2023-01-20 22:21                             ` David Matlack
  2023-01-21  0:12                               ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: David Matlack @ 2023-01-20 22:21 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Huang, Kai, sean.j.christopherson, Shahar, Sagi, isaku.yamahata,
	Aktas, Erdem, kvm, pbonzini, zhi.wang.linux, linux-kernel,
	Yamahata, Isaku

On Thu, Jan 19, 2023 at 4:16 PM Sean Christopherson <seanjc@google.com> wrote:
>
> On Thu, Jan 19, 2023, Huang, Kai wrote:
> > On Thu, 2023-01-19 at 21:36 +0000, Sean Christopherson wrote:
> > > The least invasive idea I have is expand the TDP MMU's concept of "frozen" SPTEs
> > > and freeze (a.k.a. lock) the SPTE (KVM's mirror) until the corresponding S-EPT
> > > update completes.
> >
> > This will introduce another "having-to-wait while SPTE is frozen" problem I
> > think, which IIUC means (one way is) you have to do some loop and retry, perhaps
> > similar to yield_safe.
>
> Yes, but because the TDP MMU already freezes SPTEs (just for a shorter duration),
> I'm 99% sure all of the affected flows already know how to yield/bail when necessary.
>
> The problem with the zero-step mitigation is that it could (theoretically) cause
> a "busy" error on literally any accesses, which makes it infeasible for KVM to have
> sane behavior.  E.g. freezing SPTEs to avoid the ordering issues isn't necessary
> when holding mmu_lock for write, whereas the zero-step madness brings everything
> into play.

(I'm still ramping up on TDX so apologies in advance if the following
is totally off base.)

The complexity, and to a lesser extent the memory overhead, of
mirroring Secure EPT tables with the TDP MMU makes me wonder if it is
really worth it. Especially since the initial TDX support has so many
constraints that would seem to allow a simpler implementation: all
private memory is pinned, no live migration support, no test/clear
young notifiers, etc.

For the initial version of KVM TDX support, what if we implemented the
Secure EPT management entirely off to the side? i.e. Not on top of the
TDP MMU. For example, write TDX-specific routines for:

 - Fully populating the Secure EPT tree some time during VM creation.
 - Tearing down the Secure EPT tree during VM destruction.
 - Support for unmapping/mapping specific regions of the Secure EPT
tree for private<->shared conversions.

With that in place, KVM never would need to handle a fault on a Secure
EPT mapping. Any fault (e.g. due to an in-progress private<->shared
conversion) can just return back to the guest to retry the memory
access until the operation is complete.

If we start with only supporting 4K pages in the Secure EPT, the
Secure EPT routines described above would be almost trivial to
implement. Huge Pages would add some complexity, but I don't think it
would be terrible. Concurrency can be handled with a single lock since
we don't have to worry about concurrent faulting.

This would avoid having TDX add a bunch of complexity to the TDP MMU
(which would only be used for shared mappings). If and when we want to
have more complicated memory management for TDX private mappings, we
could revisit TDP MMU integration. But I think this design could even
get us to the point of supporting Dirty Logging (where the only fault
KVM would have to handle for TDX private mappings would be
write-protection faults). I'm not sure it would work for Demand-Paging
(at least the performance would not be great behind a single lock),
but we can cross that bridge when we get there.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-20 22:21                             ` David Matlack
@ 2023-01-21  0:12                               ` Sean Christopherson
  2023-01-23  1:51                                 ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-21  0:12 UTC (permalink / raw)
  To: David Matlack
  Cc: Huang, Kai, sean.j.christopherson, Shahar, Sagi, isaku.yamahata,
	Aktas, Erdem, kvm, pbonzini, zhi.wang.linux, linux-kernel,
	Yamahata, Isaku

On Fri, Jan 20, 2023, David Matlack wrote:
> On Thu, Jan 19, 2023 at 4:16 PM Sean Christopherson <seanjc@google.com> wrote:
> >
> > On Thu, Jan 19, 2023, Huang, Kai wrote:
> > > On Thu, 2023-01-19 at 21:36 +0000, Sean Christopherson wrote:
> > > > The least invasive idea I have is expand the TDP MMU's concept of "frozen" SPTEs
> > > > and freeze (a.k.a. lock) the SPTE (KVM's mirror) until the corresponding S-EPT
> > > > update completes.a

Doh, the proposed patches already do a freeze+unfreeze.  Sorry, I obviously didn't
read the most recent changelog and haven't looked at the code in a few versions.

> > > This will introduce another "having-to-wait while SPTE is frozen" problem I
> > > think, which IIUC means (one way is) you have to do some loop and retry, perhaps
> > > similar to yield_safe.
> >
> > Yes, but because the TDP MMU already freezes SPTEs (just for a shorter duration),
> > I'm 99% sure all of the affected flows already know how to yield/bail when necessary.
> >
> > The problem with the zero-step mitigation is that it could (theoretically) cause
> > a "busy" error on literally any accesses, which makes it infeasible for KVM to have
> > sane behavior.  E.g. freezing SPTEs to avoid the ordering issues isn't necessary
> > when holding mmu_lock for write, whereas the zero-step madness brings everything
> > into play.
> 
> (I'm still ramping up on TDX so apologies in advance if the following
> is totally off base.)
> 
> The complexity, and to a lesser extent the memory overhead, of
> mirroring Secure EPT tables with the TDP MMU makes me wonder if it is
> really worth it. Especially since the initial TDX support has so many
> constraints that would seem to allow a simpler implementation: all
> private memory is pinned, no live migration support, no test/clear
> young notifiers, etc.
> 
> For the initial version of KVM TDX support, what if we implemented the
> Secure EPT management entirely off to the side? i.e. Not on top of the
> TDP MMU. For example, write TDX-specific routines for:
> 
>  - Fully populating the Secure EPT tree some time during VM creation.

Fully populating S-EPT is likely not a viable option for performance reasons.  The
number of SEAMCALLS needed for a large VM would be quite enormous, and to avoid
faults entirely the backing memory would also need to be pre-allocated.

>  - Tearing down the Secure EPT tree during VM destruction.
>  - Support for unmapping/mapping specific regions of the Secure EPT
> tree for private<->shared conversions.
> 
> With that in place, KVM never would need to handle a fault on a Secure
> EPT mapping. Any fault (e.g. due to an in-progress private<->shared
> conversion) can just return back to the guest to retry the memory
> access until the operation is complete.

I don't think this will work.  Or at least, it's not that simple.  TDX and SNP
both require the host to support implicit conversions from shared => private,
i.e. KVM needs to be able to handle faults on private memory at any time.  KVM
could rely on the attributes xarray to know if KVM should resume the guest or
exit to userspace, but at some point KVM needs to know whether or an entry has
been installed.  We could work around that by adding a hooking to "set attributes"
to populate S-EPT entries, but I'm not sure that would yield a simpler implementation
without sacrificing performance.

Thinking more about this, I do believe there is a simpler approach than freezing
S-EPT entries.  If we tweak the TDP MMU rules for TDX to require mmu_lock be held
for write when overwiting a present SPTE (with a live VM), then the problematic
cases go away.  As you point out, all of the restrictions on TDX private memory
means we're most of the way there.

Explicit unmap flows already take mmu_lock for write, zapping/splitting for dirty
logging isn't a thing (yet), and the S-EPT root is fixed, i.e. kvm_tdp_mmu_put_root()
should never free S-EPT tables until the VM is dead.

In other words, the only problematic flow is kvm_tdp_mmu_map().  Specifically,
KVM will overwrite a present SPTE only to demote/promote an existing entry, i.e.
to split a hugepage or collapse into a hugepage.  And I believe it's not too
difficult to ensure KVM never does in-place promotion/demotion:

  - Private memory doesn't rely on host userspace page tables, i.e. KVM won't
    try to promote to a hugepage because a hugepage was created in the host.

  - Private memory doesn't (yet) support page migration, so again KVM won't try
    to promote because a hugepage is suddenly possible.

  - Shadow paging isn't compatible, i.e. disallow_lpage won't change dynamically
    except for when there are mixed attributes, and toggling private/shared in
    the attributes requires zapping with mmu_lock held for write.

  - The idiotic non-coherent DMA + guest MTRR interaction can be disallowed (which
    I assume is already the case for TDX).

  - Disallow toggling the NX hugepage mitigation when TDX is allowed.

Races to install leaf SPTEs will yield a "loser" either on the spurious check in
tdp_mmu_map_handle_target_level():

	if (new_spte == iter->old_spte)
		ret = RET_PF_SPURIOUS;

or on the CMPXCHG in tdp_mmu_set_spte_atomic().

So I _think_ the race I described can't actually happen in practice, and if it
does happen, then we have a bug somewhere else.  Though that raises the question
of why the proposed patches do the freeze+unfreeze.  If it's just the VM destroy
case, then that should be easy enough to special case.


Intel folks,

Do you happen to know exactly what scenario prompted adding the freeze+unfreeze
code?  Is there something I'm forgetting/missing, or is it possible we can go
with a simpler implementation?

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-21  0:12                               ` Sean Christopherson
@ 2023-01-23  1:51                                 ` Huang, Kai
  2023-01-23 17:41                                   ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-23  1:51 UTC (permalink / raw)
  To: Christopherson,, Sean, dmatlack
  Cc: sean.j.christopherson, Shahar, Sagi, linux-kernel, Aktas, Erdem,
	kvm, pbonzini, zhi.wang.linux, isaku.yamahata, Yamahata, Isaku

> 
> 
> Intel folks,
> 
> Do you happen to know exactly what scenario prompted adding the freeze+unfreeze
> code?  Is there something I'm forgetting/missing, or is it possible we can go
> with a simpler implementation?

It's documented in the "TDX TDP MMU design doc" patch:

+TDX concurrent populating
+-------------------------
......
+
+Without freezing the entry, the following race can happen.  Suppose two vcpus
+are faulting on the same GPA and the 2M and 4K level entries aren't populated
+yet.
+
+* vcpu 1: update 2M level EPT entry
+* vcpu 2: update 4K level EPT entry
+* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
+* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry
+

( I guess such material will be more useful in the comment.  And perhaps we can
get rid of the "TDX TDP MMU design doc" patch in this series at least for now as
probably nobody will look at it :-) )

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-23  1:51                                 ` Huang, Kai
@ 2023-01-23 17:41                                   ` Sean Christopherson
  2023-01-26 10:54                                     ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-23 17:41 UTC (permalink / raw)
  To: Huang, Kai
  Cc: dmatlack, sean.j.christopherson, Shahar, Sagi, linux-kernel,
	Aktas, Erdem, kvm, pbonzini, zhi.wang.linux, isaku.yamahata,
	Yamahata, Isaku

On Mon, Jan 23, 2023, Huang, Kai wrote:
> > 
> > 
> > Intel folks,
> > 
> > Do you happen to know exactly what scenario prompted adding the freeze+unfreeze
> > code?  Is there something I'm forgetting/missing, or is it possible we can go
> > with a simpler implementation?
> 
> It's documented in the "TDX TDP MMU design doc" patch:
> 
> +TDX concurrent populating
> +-------------------------
> ......
> +
> +Without freezing the entry, the following race can happen.  Suppose two vcpus
> +are faulting on the same GPA and the 2M and 4K level entries aren't populated
> +yet.
> +
> +* vcpu 1: update 2M level EPT entry
> +* vcpu 2: update 4K level EPT entry
> +* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
> +* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry

Ooh, the problem isn't that two SEAMCALLs to the same entry get out of order, it's
that SEAMCALLs operating on child entries can race ahead of the parent.  Hrm.

TDX also has the annoying-but-understandable restriction that leafs need to be
removed before parents.  A not-yet-relevant complication on that front is that the
TDP MMU's behavior of recursively removing children means we also have to worry
about PRESENT => !PRESENT transitions, e.g. zapping a child because the parent is
being removed can race with a different vCPU try to populate the child (because
the vCPU handling a page fault could have seen the PRESENT parent).

I think there's an opportunity and motivation to improve the TDP MMU as a whole on
this front though.  Rather than recursively zap children in handle_removed_pt(),
we can use the RCU callback to queue the page table for removal.  Setting the parent
(target page table) !PRESENT and flushing the TLBs ensures that all children are
unreachable, i.e. KVM doesn't need to immediately set children !PRESENT.  Unlike
the shadow MMU, which maintains a hash table of shadow pages, once a parent page
table is removed from the TDP MMU, its children are unreachabled.

The RCU callback must run in near-constant time, but that's easy to solve as we
already have a workqueue for zapping page tables, i.e. the RCU callback can simply
add the target page to the zap workqueue.  That would also allow for a (very minor)
simplification of other TDP MMU code: tdp_mmu_zap_root() wouldn't needed to zap in
two passes since zapping children of the top-level SPTEs would be deferred to the
workqueue.

Back to TDX, to play nice with the restriction that parents are removed only after
children are removed, I believe KVM can use TDH.MEM.RANGE.BLOCK to make the parent
!PRESENT.  That will effectively prune the S-EPT entry and all its children, and
the RCU callback will again ensure all in-flight SEAMCALLs for the children complete
before KVM actually tries to zap the children.

And if we rework zapping page tables, I suspect we can also address David's concern
(and my not-yet-voiced concern) about polluting the TDP MMU code with logic that is
necessary only for S-EPT (freezing SPTEs before populating them).  Rather than update
S-EPT _after_ the TDP MMU SPTE, do the S-EPT update first, i.e. invoke the KVM TDX
hook before try_cmpxchg64() (or maybe instead of?).  That way KVM TDX can freeze the
to-be-installed SPTE without common TDP MMU needing to be aware of the change.

> ( I guess such material will be more useful in the comment.  And perhaps we can
> get rid of the "TDX TDP MMU design doc" patch in this series at least for now as
> probably nobody will look at it :-) )

Please keep the design doc, I'll definitely read it.  I'm just chasing too many
things at the moment and haven't given the TDX series a proper review, i.e. haven't
even glanced through all of the patches or even the shortlog.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
  2023-01-12 16:31 ` [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE isaku.yamahata
@ 2023-01-25  9:24   ` Zhi Wang
  2023-01-25 17:22     ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-25  9:24 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

On Thu, 12 Jan 2023 08:31:38 -0800
isaku.yamahata@intel.com wrote:

This refactor patch is quite hacky.

Why not change the purpose of vcpu->arch.mmu_shadow_page.gfp_zero and let the
callers respect that the initial value of spte can be configurable? It will be
generic and not TDX-specific, then kvm_init_shadow_page() is not required,
mmu_topup_shadow_page_cache() can be left un-touched as the refactor can cover
other architectures.

1) Let it store the expected nonpresent value and rename it to nonpresent_spte.

2) Let mmu_spte_clear_track_bits(), mmu_spte_clear_no_track() and all
the other places where assume 0 as initial value, respect nonpreset_spte.

3) Let kvm_mmu_topup_memory_cache() to respect nonpresent_spte: a. using GFP_ZERO
if the nonpresent_spte is zero. b. memset the page if nonpresent_spte is *not*
zero.

Now the initial value is configurable, configure the nonpresent_spte in the TDX
initialization path before the first topup in the next patch.

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> The TDX support will need the "suppress #VE" bit (bit 63) set as the
> initial value for SPTE.  To reduce code change size, introduce a new macro
> SHADOW_NONPRESENT_VALUE for the initial value for the shadow page table
> entry (SPTE) and replace hard-coded value 0 for it.  Initialize shadow page
> tables with their value.
> 
> The plan is to unconditionally set the "suppress #VE" bit for both AMD and
> Intel as: 1) AMD hardware uses the bit 63 as NX for present SPTE and
> ignored for non-present SPTE; 2) for conventional VMX guests, KVM never
> enables the "EPT-violation #VE" in VMCS control and "suppress #VE" bit is
> ignored by hardware.
> 
> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c         | 50 ++++++++++++++++++++++++++++++----
>  arch/x86/kvm/mmu/paging_tmpl.h |  3 +-
>  arch/x86/kvm/mmu/spte.h        |  2 ++
>  arch/x86/kvm/mmu/tdp_mmu.c     | 15 +++++-----
>  4 files changed, 56 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 15d0e8f11d53..59befdfeec23 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -540,9 +540,9 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>  
>  	if (!is_shadow_present_pte(old_spte) ||
>  	    !spte_has_volatile_bits(old_spte))
> -		__update_clear_spte_fast(sptep, 0ull);
> +		__update_clear_spte_fast(sptep, SHADOW_NONPRESENT_VALUE);
>  	else
> -		old_spte = __update_clear_spte_slow(sptep, 0ull);
> +		old_spte = __update_clear_spte_slow(sptep, SHADOW_NONPRESENT_VALUE);
>  
>  	if (!is_shadow_present_pte(old_spte))
>  		return old_spte;
> @@ -576,7 +576,7 @@ static u64 mmu_spte_clear_track_bits(struct kvm *kvm, u64 *sptep)
>   */
>  static void mmu_spte_clear_no_track(u64 *sptep)
>  {
> -	__update_clear_spte_fast(sptep, 0ull);
> +	__update_clear_spte_fast(sptep, SHADOW_NONPRESENT_VALUE);
>  }
>  
>  static u64 mmu_spte_get_lockless(u64 *sptep)
> @@ -644,6 +644,39 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
>  	}
>  }
>  
> +#ifdef CONFIG_X86_64
> +static inline void kvm_init_shadow_page(void *page)
> +{
> +	memset64(page, SHADOW_NONPRESENT_VALUE, 4096 / 8);
> +}
> +
> +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm_mmu_memory_cache *mc = &vcpu->arch.mmu_shadow_page_cache;
> +	int start, end, i, r;
> +
> +	start = kvm_mmu_memory_cache_nr_free_objects(mc);
> +	r = kvm_mmu_topup_memory_cache(mc, PT64_ROOT_MAX_LEVEL);
> +
> +	/*
> +	 * Note, topup may have allocated objects even if it failed to allocate
> +	 * the minimum number of objects required to make forward progress _at
> +	 * this time_.  Initialize newly allocated objects even on failure, as
> +	 * userspace can free memory and rerun the vCPU in response to -ENOMEM.
> +	 */
> +	end = kvm_mmu_memory_cache_nr_free_objects(mc);
> +	for (i = start; i < end; i++)
> +		kvm_init_shadow_page(mc->objects[i]);
> +	return r;
> +}
> +#else
> +static int mmu_topup_shadow_page_cache(struct kvm_vcpu *vcpu)
> +{
> +	return kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> +					  PT64_ROOT_MAX_LEVEL);
> +}
> +#endif /* CONFIG_X86_64 */
> +
>  static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  {
>  	int r;
> @@ -653,8 +686,7 @@ static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu, bool maybe_indirect)
>  				       1 + PT64_ROOT_MAX_LEVEL + PTE_PREFETCH_NUM);
>  	if (r)
>  		return r;
> -	r = kvm_mmu_topup_memory_cache(&vcpu->arch.mmu_shadow_page_cache,
> -				       PT64_ROOT_MAX_LEVEL);
> +	r = mmu_topup_shadow_page_cache(vcpu);
>  	if (r)
>  		return r;
>  	if (maybe_indirect) {
> @@ -5920,7 +5952,13 @@ int kvm_mmu_create(struct kvm_vcpu *vcpu)
>  	vcpu->arch.mmu_page_header_cache.kmem_cache = mmu_page_header_cache;
>  	vcpu->arch.mmu_page_header_cache.gfp_zero = __GFP_ZERO;
>  
> -	vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
> +	/*
> +	 * When X86_64, initial SEPT entries are initialized with
> +	 * SHADOW_NONPRESENT_VALUE.  Otherwise zeroed.  See
> +	 * mmu_topup_shadow_page_cache().
> +	 */
> +	if (!IS_ENABLED(CONFIG_X86_64))
> +		vcpu->arch.mmu_shadow_page_cache.gfp_zero = __GFP_ZERO;
>  
>  	vcpu->arch.mmu = &vcpu->arch.root_mmu;
>  	vcpu->arch.walk_mmu = &vcpu->arch.root_mmu;
> diff --git a/arch/x86/kvm/mmu/paging_tmpl.h b/arch/x86/kvm/mmu/paging_tmpl.h
> index 0f6455072055..42d7106c7350 100644
> --- a/arch/x86/kvm/mmu/paging_tmpl.h
> +++ b/arch/x86/kvm/mmu/paging_tmpl.h
> @@ -1036,7 +1036,8 @@ static int FNAME(sync_page)(struct kvm_vcpu *vcpu, struct kvm_mmu_page *sp)
>  		gpa_t pte_gpa;
>  		gfn_t gfn;
>  
> -		if (!sp->spt[i])
> +		/* spt[i] has initial value of shadow page table allocation */
> +		if (sp->spt[i] == SHADOW_NONPRESENT_VALUE)
>  			continue;
>  
>  		pte_gpa = first_pte_gpa + i * sizeof(pt_element_t);
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index 0d8deefee66c..f190eaf6b2b5 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -148,6 +148,8 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
>  
>  #define MMIO_SPTE_GEN_MASK		GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
>  
> +#define SHADOW_NONPRESENT_VALUE	0ULL
> +
>  extern u64 __read_mostly shadow_host_writable_mask;
>  extern u64 __read_mostly shadow_mmu_writable_mask;
>  extern u64 __read_mostly shadow_nx_mask;
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 12e430a4ebc3..9cf5844dd34a 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -701,7 +701,7 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
>  	 * here since the SPTE is going from non-present to non-present.  Use
>  	 * the raw write helper to avoid an unnecessary check on volatile bits.
>  	 */
> -	__kvm_tdp_mmu_write_spte(iter->sptep, 0);
> +	__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
>  
>  	return 0;
>  }
> @@ -878,8 +878,8 @@ static void __tdp_mmu_zap_root(struct kvm *kvm, struct kvm_mmu_page *root,
>  			continue;
>  
>  		if (!shared)
> -			tdp_mmu_set_spte(kvm, &iter, 0);
> -		else if (tdp_mmu_set_spte_atomic(kvm, &iter, 0))
> +			tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
> +		else if (tdp_mmu_set_spte_atomic(kvm, &iter, SHADOW_NONPRESENT_VALUE))
>  			goto retry;
>  	}
>  }
> @@ -935,8 +935,9 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct kvm_mmu_page *sp)
>  	if (WARN_ON_ONCE(!is_shadow_present_pte(old_spte)))
>  		return false;
>  
> -	__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte, 0,
> -			   sp->gfn, sp->role.level + 1, true, true);
> +	__tdp_mmu_set_spte(kvm, kvm_mmu_page_as_id(sp), sp->ptep, old_spte,
> +			   SHADOW_NONPRESENT_VALUE, sp->gfn, sp->role.level + 1,
> +			   true, true);
>  
>  	return true;
>  }
> @@ -970,7 +971,7 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page *root,
>  		    !is_last_spte(iter.old_spte, iter.level))
>  			continue;
>  
> -		tdp_mmu_set_spte(kvm, &iter, 0);
> +		tdp_mmu_set_spte(kvm, &iter, SHADOW_NONPRESENT_VALUE);
>  		flush = true;
>  	}
>  
> @@ -1339,7 +1340,7 @@ static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
>  	 * invariant that the PFN of a present * leaf SPTE can never change.
>  	 * See __handle_changed_spte().
>  	 */
> -	tdp_mmu_set_spte(kvm, iter, 0);
> +	tdp_mmu_set_spte(kvm, iter, SHADOW_NONPRESENT_VALUE);
>  
>  	if (!pte_write(range->pte)) {
>  		new_spte = kvm_mmu_changed_pte_notifier_make_spte(iter->old_spte,


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
  2023-01-25  9:24   ` Zhi Wang
@ 2023-01-25 17:22     ` Sean Christopherson
  2023-01-26 21:37       ` Huang, Kai
  2023-01-27 21:36       ` Zhi Wang
  0 siblings, 2 replies; 221+ messages in thread
From: Sean Christopherson @ 2023-01-25 17:22 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Sean Christopherson

On Wed, Jan 25, 2023, Zhi Wang wrote:
> On Thu, 12 Jan 2023 08:31:38 -0800
> isaku.yamahata@intel.com wrote:
> 
> This refactor patch is quite hacky.
> 
> Why not change the purpose of vcpu->arch.mmu_shadow_page.gfp_zero and let the
> callers respect that the initial value of spte can be configurable? It will be
> generic and not TDX-specific, then kvm_init_shadow_page() is not required,
> mmu_topup_shadow_page_cache() can be left un-touched as the refactor can cover
> other architectures.
> 
> 1) Let it store the expected nonpresent value and rename it to nonpresent_spte.


I agree that handling this in the common code would be cleaner, but repurposing
gfp_zero gets kludgy because it would require a magic value to say "don't initialize
the data", e.g. x86's mmu_shadowed_info_cache isn't pre-filled.

And supporting a custom 64-bit init value for kmem_cache-backed caches would require
restricting such caches to be a multiple of 8 bytes in size.

How about this?  Lightly tested.

From: Sean Christopherson <seanjc@google.com>
Date: Wed, 25 Jan 2023 16:55:01 +0000
Subject: [PATCH] KVM: Allow page-sized MMU caches to be initialized with
 custom 64-bit values

Add support to MMU caches for initializing a page with a custom 64-bit
value, e.g. to pre-fill an entire page table with non-zero PTE values.
The functionality will be used by x86 to support Intel's TDX, which needs
to set bit 63 in all non-present PTEs in order to prevent !PRESENT page
faults from getting reflected into the guest (Intel's EPT Violation #VE
architecture made the less than brilliant decision of having the per-PTE
behavior be opt-out instead of opt-in).

Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 include/linux/kvm_types.h |  1 +
 virt/kvm/kvm_main.c       | 16 ++++++++++++++--
 2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
index 76de36e56cdf..67972db17b55 100644
--- a/include/linux/kvm_types.h
+++ b/include/linux/kvm_types.h
@@ -94,6 +94,7 @@ struct kvm_mmu_memory_cache {
 	int nobjs;
 	gfp_t gfp_zero;
 	gfp_t gfp_custom;
+	u64 init_value;
 	struct kmem_cache *kmem_cache;
 	int capacity;
 	void **objects;
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index d255964ec331..78f1e49179a7 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -380,12 +380,17 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
 static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
 					       gfp_t gfp_flags)
 {
+	void *page;
+
 	gfp_flags |= mc->gfp_zero;
 
 	if (mc->kmem_cache)
 		return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
-	else
-		return (void *)__get_free_page(gfp_flags);
+
+	page = (void *)__get_free_page(gfp_flags);
+	if (page && mc->init_value)
+		memset64(page, mc->init_value, PAGE_SIZE / sizeof(mc->init_value));
+	return page;
 }
 
 int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
@@ -400,6 +405,13 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
 		if (WARN_ON_ONCE(!capacity))
 			return -EIO;
 
+		/*
+		 * Custom init values can be used only for page allocations,
+		 * and obviously conflict with __GFP_ZERO.
+		 */
+		if (WARN_ON_ONCE(mc->init_value && (mc->kmem_cache || mc->gfp_zero)))
+			return -EIO;
+
 		mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
 		if (!mc->objects)
 			return -ENOMEM;

base-commit: 503f0315c97739d3f8e645c500d81757dfbf76be
-- 


^ permalink raw reply related	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-23 17:41                                   ` Sean Christopherson
@ 2023-01-26 10:54                                     ` Huang, Kai
  2023-01-26 17:28                                       ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-26 10:54 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: dmatlack, isaku.yamahata, Shahar, Sagi, linux-kernel, Aktas,
	Erdem, kvm, sean.j.christopherson, zhi.wang.linux, pbonzini,
	Yamahata, Isaku

On Mon, 2023-01-23 at 17:41 +0000, Sean Christopherson wrote:
> On Mon, Jan 23, 2023, Huang, Kai wrote:
> > > 
> > > 
> > > Intel folks,
> > > 
> > > Do you happen to know exactly what scenario prompted adding the freeze+unfreeze
> > > code?  Is there something I'm forgetting/missing, or is it possible we can go
> > > with a simpler implementation?
> > 
> > It's documented in the "TDX TDP MMU design doc" patch:
> > 
> > +TDX concurrent populating
> > +-------------------------
> > ......
> > +
> > +Without freezing the entry, the following race can happen.  Suppose two vcpus
> > +are faulting on the same GPA and the 2M and 4K level entries aren't populated
> > +yet.
> > +
> > +* vcpu 1: update 2M level EPT entry
> > +* vcpu 2: update 4K level EPT entry
> > +* vcpu 2: TDX SEAMCALL to update 4K secure EPT entry => error
> > +* vcpu 1: TDX SEAMCALL to update 2M secure EPT entry
> 
> Ooh, the problem isn't that two SEAMCALLs to the same entry get out of order, it's
> that SEAMCALLs operating on child entries can race ahead of the parent.  Hrm.
> 
> TDX also has the annoying-but-understandable restriction that leafs need to be
> removed before parents.  A not-yet-relevant complication on that front is that the
> TDP MMU's behavior of recursively removing children means we also have to worry
> about PRESENT => !PRESENT transitions, e.g. zapping a child because the parent is
> being removed can race with a different vCPU try to populate the child (because
> the vCPU handling a page fault could have seen the PRESENT parent).

Yes.

> 
> I think there's an opportunity and motivation to improve the TDP MMU as a whole on
> this front though.  Rather than recursively zap children in handle_removed_pt(),
> we can use the RCU callback to queue the page table for removal.  Setting the parent
> (target page table) !PRESENT and flushing the TLBs ensures that all children are
> unreachable, i.e. KVM doesn't need to immediately set children !PRESENT.  Unlike
> the shadow MMU, which maintains a hash table of shadow pages, once a parent page
> table is removed from the TDP MMU, its children are unreachabled.

Do you mean something like (pseudo):

	rcu_callback(&removed_sp->rcu_head, handle_removed_pt);

> 
> The RCU callback must run in near-constant time, but that's easy to solve as we
> already have a workqueue for zapping page tables, i.e. the RCU callback can simply
> add the target page to the zap workqueue.  That would also allow for a (very minor)
> simplification of other TDP MMU code: tdp_mmu_zap_root() wouldn't needed to zap in
> two passes since zapping children of the top-level SPTEs would be deferred to the
> workqueue.

Do you mean zapping the entire page table (from root) doesn't need to be in RCU
read-critical section, but can/should be done after grace period?  I think this
makes sense since zapping entire root must happen when root is already invalid,
which cannot be used anymore when the new faults come in?

> 
> Back to TDX, to play nice with the restriction that parents are removed only after
> children are removed, I believe KVM can use TDH.MEM.RANGE.BLOCK to make the parent
> !PRESENT.  That will effectively prune the S-EPT entry and all its children, and
> the RCU callback will again ensure all in-flight SEAMCALLs for the children complete
> before KVM actually tries to zap the children.

Reading the spec, it seems TDH.MEM.RANGE.BLOCK only sets the Secure EPT entry
which points to the entire range as "blocked", but won't go down until leaf to
mark all EPT entries as "blocked", which makes sense anyway.

But it seems TDH.MEM.PAGE.REMOVE and TDH.MEM.SEPT.REMOVE both only checks
whether that target EPT entry is "blocked", but doesn't check whether any parent
has been marked as "blocked".  Not sure whether this will be a problem.  But
anyway if this is a problem, we perhaps can get TDX module to fix.

> 
> And if we rework zapping page tables, I suspect we can also address David's concern
> (and my not-yet-voiced concern) about polluting the TDP MMU code with logic that is
> necessary only for S-EPT (freezing SPTEs before populating them).  Rather than update
> S-EPT _after_ the TDP MMU SPTE, do the S-EPT update first, i.e. invoke the KVM TDX
> hook before try_cmpxchg64() (or maybe instead of?).  That way KVM TDX can freeze the
> to-be-installed SPTE without common TDP MMU needing to be aware of the change.

I don't quite understand how putting SEAMCALL before the try_cmpxchg64() can
work.  Let's say one thread is populating a mapping and another is zapping it. 
The populating thread makes SEAMCALL successfully but then try_cmpxchg64()
fails, in this case how to proceed?

> 
> > ( I guess such material will be more useful in the comment.  And perhaps we can
> > get rid of the "TDX TDP MMU design doc" patch in this series at least for now as
> > probably nobody will look at it :-) )
> 
> Please keep the design doc, I'll definitely read it.  I'm just chasing too many
> things at the moment and haven't given the TDX series a proper review, i.e. haven't
> even glanced through all of the patches or even the shortlog.



^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-26 10:54                                     ` Huang, Kai
@ 2023-01-26 17:28                                       ` Sean Christopherson
  2023-01-26 21:18                                         ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-26 17:28 UTC (permalink / raw)
  To: Huang, Kai
  Cc: dmatlack, isaku.yamahata, Shahar, Sagi, linux-kernel, Aktas,
	Erdem, kvm, sean.j.christopherson, zhi.wang.linux, pbonzini,
	Yamahata, Isaku

On Thu, Jan 26, 2023, Huang, Kai wrote:
> On Mon, 2023-01-23 at 17:41 +0000, Sean Christopherson wrote:
> > I think there's an opportunity and motivation to improve the TDP MMU as a whole on
> > this front though.  Rather than recursively zap children in handle_removed_pt(),
> > we can use the RCU callback to queue the page table for removal.  Setting the parent
> > (target page table) !PRESENT and flushing the TLBs ensures that all children are
> > unreachable, i.e. KVM doesn't need to immediately set children !PRESENT.  Unlike
> > the shadow MMU, which maintains a hash table of shadow pages, once a parent page
> > table is removed from the TDP MMU, its children are unreachabled.
> 
> Do you mean something like (pseudo):
> 
> 	rcu_callback(&removed_sp->rcu_head, handle_removed_pt);

Yep.

> > The RCU callback must run in near-constant time, but that's easy to solve as we
> > already have a workqueue for zapping page tables, i.e. the RCU callback can simply
> > add the target page to the zap workqueue.  That would also allow for a (very minor)
> > simplification of other TDP MMU code: tdp_mmu_zap_root() wouldn't needed to zap in
> > two passes since zapping children of the top-level SPTEs would be deferred to the
> > workqueue.
> 
> Do you mean zapping the entire page table (from root) doesn't need to be in RCU
> read-critical section, but can/should be done after grace period?  I think this
> makes sense since zapping entire root must happen when root is already invalid,
> which cannot be used anymore when the new faults come in?

Yes, minus the "from root" restriction.  When a page table (call it "branch" to
continue the analogy) PTE has been zapped/blocked ("pruned"), KVM just needs to
wait for all potential readers to go away.  That guarantee is provided by RCU;
software walkers, i.e. KVM itself, are required to hold RCU, and hardware walkers,
i.e. vCPUs running in the guest, are protected by proxy as the zapper ("arborist"?)
is required to hold RCU until all running vCPUs have been kicked.

In other words, once the PTE is zapped/blocked (branch is pruned), it's completely
removed from the paging tree and no other tasks can access the branch (page table
and its children).  I.e. the only remaining reference to the branch is the pointer
handed to the RCU callback.  That means the RCU callback has exclusive access to the
branch, i.e. can operate as if it were holding mmu_lock for write.  Furthermore, the
RCU callback also doesn't need to flush TLBs because that was again done when
pruning the branch.

It's the same idea that KVM already uses for root SPs, the only difference is how
KVM determines that there is exactly one entity that holds a reference to the SP.

> > Back to TDX, to play nice with the restriction that parents are removed only after
> > children are removed, I believe KVM can use TDH.MEM.RANGE.BLOCK to make the parent
> > !PRESENT.  That will effectively prune the S-EPT entry and all its children, and
> > the RCU callback will again ensure all in-flight SEAMCALLs for the children complete
> > before KVM actually tries to zap the children.
> 
> Reading the spec, it seems TDH.MEM.RANGE.BLOCK only sets the Secure EPT entry
> which points to the entire range as "blocked", but won't go down until leaf to
> mark all EPT entries as "blocked", which makes sense anyway.
> 
> But it seems TDH.MEM.PAGE.REMOVE and TDH.MEM.SEPT.REMOVE both only checks
> whether that target EPT entry is "blocked", but doesn't check whether any parent
> has been marked as "blocked".  Not sure whether this will be a problem.  But
> anyway if this is a problem, we perhaps can get TDX module to fix.

Oh, I didn't mean to suggest KVM skip TDH.MEM.RANGE.BLOCK for children, I simply
forgot that all S-EPT entries need to be blocked before they can be removed.

> > And if we rework zapping page tables, I suspect we can also address David's concern
> > (and my not-yet-voiced concern) about polluting the TDP MMU code with logic that is
> > necessary only for S-EPT (freezing SPTEs before populating them).  Rather than update
> > S-EPT _after_ the TDP MMU SPTE, do the S-EPT update first, i.e. invoke the KVM TDX
> > hook before try_cmpxchg64() (or maybe instead of?).  That way KVM TDX can freeze the
> > to-be-installed SPTE without common TDP MMU needing to be aware of the change.
> 
> I don't quite understand how putting SEAMCALL before the try_cmpxchg64() can
> work.  Let's say one thread is populating a mapping and another is zapping it. 
> The populating thread makes SEAMCALL successfully but then try_cmpxchg64()
> fails, in this case how to proceed?

Ah, sorry, that was unclear.  By "invoke the KVM TDX hook" I didn't mean "do the
SEAMCALL", I meant KVM TDX could do its own manipulation of the KVM-managed SPTEs
before the common/standard flow.  E.g. something like:

	if (kvm_x86_ops.set_private_spte && private)
		r = static_call(kvm_x86_set_private_spte(...)
	else
		r = try_cmpxchg64(...) ? 0 : -EBUSY;

so that the common code doesn't need to do, or even be aware of, the freezing.
Then I think we just need another hook in handle_removed_pt(), or maybe in what
is currently __kvm_tdp_mmu_write_spte()?

I.e. fully replace the "write" operations in the TDP MMU instead of trying to
smush S-EPT's requirements into the common path.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-26 17:28                                       ` Sean Christopherson
@ 2023-01-26 21:18                                         ` Huang, Kai
  2023-01-26 21:59                                           ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-26 21:18 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: pbonzini, Shahar, Sagi, isaku.yamahata, linux-kernel, Aktas,
	Erdem, kvm, sean.j.christopherson, zhi.wang.linux, dmatlack,
	Yamahata, Isaku

On Thu, 2023-01-26 at 17:28 +0000, Sean Christopherson wrote:
> On Thu, Jan 26, 2023, Huang, Kai wrote:
> > On Mon, 2023-01-23 at 17:41 +0000, Sean Christopherson wrote:
> > > I think there's an opportunity and motivation to improve the TDP MMU as a whole on
> > > this front though.  Rather than recursively zap children in handle_removed_pt(),
> > > we can use the RCU callback to queue the page table for removal.  Setting the parent
> > > (target page table) !PRESENT and flushing the TLBs ensures that all children are
> > > unreachable, i.e. KVM doesn't need to immediately set children !PRESENT.  Unlike
> > > the shadow MMU, which maintains a hash table of shadow pages, once a parent page
> > > table is removed from the TDP MMU, its children are unreachabled.
> > 
> > Do you mean something like (pseudo):
> > 
> > 	rcu_callback(&removed_sp->rcu_head, handle_removed_pt);
> 
> Yep.
> 
> > > The RCU callback must run in near-constant time, but that's easy to solve as we
> > > already have a workqueue for zapping page tables, i.e. the RCU callback can simply
> > > add the target page to the zap workqueue.  That would also allow for a (very minor)
> > > simplification of other TDP MMU code: tdp_mmu_zap_root() wouldn't needed to zap in
> > > two passes since zapping children of the top-level SPTEs would be deferred to the
> > > workqueue.
> > 
> > Do you mean zapping the entire page table (from root) doesn't need to be in RCU
> > read-critical section, but can/should be done after grace period?  I think this
> > makes sense since zapping entire root must happen when root is already invalid,
> > which cannot be used anymore when the new faults come in?
> 
> Yes, minus the "from root" restriction.  When a page table (call it "branch" to
> continue the analogy) PTE has been zapped/blocked ("pruned"), KVM just needs to
> wait for all potential readers to go away.  That guarantee is provided by RCU;
> software walkers, i.e. KVM itself, are required to hold RCU, and hardware walkers,
> i.e. vCPUs running in the guest, are protected by proxy as the zapper ("arborist"?)
> is required to hold RCU until all running vCPUs have been kicked.

Yes.

> 
> In other words, once the PTE is zapped/blocked (branch is pruned), it's completely
> removed from the paging tree and no other tasks can access the branch (page table
> and its children).  I.e. the only remaining reference to the branch is the pointer
> handed to the RCU callback.  That means the RCU callback has exclusive access to the
> branch, i.e. can operate as if it were holding mmu_lock for write.  Furthermore, the
> RCU callback also doesn't need to flush TLBs because that was again done when
> pruning the branch.
> 
> It's the same idea that KVM already uses for root SPs, the only difference is how
> KVM determines that there is exactly one entity that holds a reference to the SP.

Right.  This works fine for normal non-TDX case.  However for TDX unfortunately
the access to the removed branch (or the removed sub-page-table) isn't that
"exclusive" as the SEAMCALL to truly zap that branch still needs to hold the
write lock of the entire Secure EPT tree, so it can still conflict with other
threads handling new faults.

This is a problem we need to deal with anyway.

> 
> > > Back to TDX, to play nice with the restriction that parents are removed only after
> > > children are removed, I believe KVM can use TDH.MEM.RANGE.BLOCK to make the parent
> > > !PRESENT.  That will effectively prune the S-EPT entry and all its children, and
> > > the RCU callback will again ensure all in-flight SEAMCALLs for the children complete
> > > before KVM actually tries to zap the children.
> > 
> > Reading the spec, it seems TDH.MEM.RANGE.BLOCK only sets the Secure EPT entry
> > which points to the entire range as "blocked", but won't go down until leaf to
> > mark all EPT entries as "blocked", which makes sense anyway.
> > 
> > But it seems TDH.MEM.PAGE.REMOVE and TDH.MEM.SEPT.REMOVE both only checks
> > whether that target EPT entry is "blocked", but doesn't check whether any parent
> > has been marked as "blocked".  Not sure whether this will be a problem.  But
> > anyway if this is a problem, we perhaps can get TDX module to fix.
> 
> Oh, I didn't mean to suggest KVM skip TDH.MEM.RANGE.BLOCK for children, I simply
> forgot that all S-EPT entries need to be blocked before they can be removed.
> 
> > > And if we rework zapping page tables, I suspect we can also address David's concern
> > > (and my not-yet-voiced concern) about polluting the TDP MMU code with logic that is
> > > necessary only for S-EPT (freezing SPTEs before populating them).  Rather than update
> > > S-EPT _after_ the TDP MMU SPTE, do the S-EPT update first, i.e. invoke the KVM TDX
> > > hook before try_cmpxchg64() (or maybe instead of?).  That way KVM TDX can freeze the
> > > to-be-installed SPTE without common TDP MMU needing to be aware of the change.
> > 
> > I don't quite understand how putting SEAMCALL before the try_cmpxchg64() can
> > work.  Let's say one thread is populating a mapping and another is zapping it. 
> > The populating thread makes SEAMCALL successfully but then try_cmpxchg64()
> > fails, in this case how to proceed?
> 
> Ah, sorry, that was unclear.  By "invoke the KVM TDX hook" I didn't mean "do the
> SEAMCALL", I meant KVM TDX could do its own manipulation of the KVM-managed SPTEs
> before the common/standard flow.  E.g. something like:
> 
> 	if (kvm_x86_ops.set_private_spte && private)
> 		r = static_call(kvm_x86_set_private_spte(...)
> 	else
> 		r = try_cmpxchg64(...) ? 0 : -EBUSY;
> 
> so that the common code doesn't need to do, or even be aware of, the freezing.
> Then I think we just need another hook in handle_removed_pt(), or maybe in what
> is currently __kvm_tdp_mmu_write_spte()?
> 
> I.e. fully replace the "write" operations in the TDP MMU instead of trying to
> smush S-EPT's requirements into the common path.

I have no preference but looks better to me as the benefits you mentioned above.
:)

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
  2023-01-25 17:22     ` Sean Christopherson
@ 2023-01-26 21:37       ` Huang, Kai
  2023-01-26 22:01         ` Sean Christopherson
  2023-01-27 21:36       ` Zhi Wang
  1 sibling, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-26 21:37 UTC (permalink / raw)
  To: zhi.wang.linux, Christopherson,, Sean
  Cc: sean.j.christopherson, linux-kernel, Shahar, Sagi,
	isaku.yamahata, Aktas, Erdem, dmatlack, kvm, Yamahata, Isaku,
	pbonzini

On Wed, 2023-01-25 at 17:22 +0000, Sean Christopherson wrote:
> On Wed, Jan 25, 2023, Zhi Wang wrote:
> > On Thu, 12 Jan 2023 08:31:38 -0800
> > isaku.yamahata@intel.com wrote:
> > 
> > This refactor patch is quite hacky.
> > 
> > Why not change the purpose of vcpu->arch.mmu_shadow_page.gfp_zero and let the
> > callers respect that the initial value of spte can be configurable? It will be
> > generic and not TDX-specific, then kvm_init_shadow_page() is not required,
> > mmu_topup_shadow_page_cache() can be left un-touched as the refactor can cover
> > other architectures.
> > 
> > 1) Let it store the expected nonpresent value and rename it to nonpresent_spte.
> 
> 
> I agree that handling this in the common code would be cleaner, but repurposing
> gfp_zero gets kludgy because it would require a magic value to say "don't initialize
> the data", e.g. x86's mmu_shadowed_info_cache isn't pre-filled.
> 
> And supporting a custom 64-bit init value for kmem_cache-backed caches would require
> restricting such caches to be a multiple of 8 bytes in size.
> 
> How about this?  Lightly tested.
> 
> From: Sean Christopherson <seanjc@google.com>
> Date: Wed, 25 Jan 2023 16:55:01 +0000
> Subject: [PATCH] KVM: Allow page-sized MMU caches to be initialized with
>  custom 64-bit values
> 
> Add support to MMU caches for initializing a page with a custom 64-bit
> value, e.g. to pre-fill an entire page table with non-zero PTE values.
> The functionality will be used by x86 to support Intel's TDX, which needs
> to set bit 63 in all non-present PTEs in order to prevent !PRESENT page
> faults from getting reflected into the guest (Intel's EPT Violation #VE
> architecture made the less than brilliant decision of having the per-PTE
> behavior be opt-out instead of opt-in).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/kvm_types.h |  1 +
>  virt/kvm/kvm_main.c       | 16 ++++++++++++++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 76de36e56cdf..67972db17b55 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -94,6 +94,7 @@ struct kvm_mmu_memory_cache {
>  	int nobjs;
>  	gfp_t gfp_zero;
>  	gfp_t gfp_custom;
> +	u64 init_value;
>  	struct kmem_cache *kmem_cache;
>  	int capacity;
>  	void **objects;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d255964ec331..78f1e49179a7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -380,12 +380,17 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
>  static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>  					       gfp_t gfp_flags)
>  {
> +	void *page;
> +
>  	gfp_flags |= mc->gfp_zero;
>  
>  	if (mc->kmem_cache)
>  		return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
> -	else
> -		return (void *)__get_free_page(gfp_flags);
> +
> +	page = (void *)__get_free_page(gfp_flags);
> +	if (page && mc->init_value)
> +		memset64(page, mc->init_value, PAGE_SIZE / sizeof(mc->init_value));
> +	return page;
>  }
>  
>  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> @@ -400,6 +405,13 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
>  		if (WARN_ON_ONCE(!capacity))
>  			return -EIO;
>  
> +		/*
> +		 * Custom init values can be used only for page allocations,
> +		 * and obviously conflict with __GFP_ZERO.
> +		 */
> +		if (WARN_ON_ONCE(mc->init_value && (mc->kmem_cache || mc->gfp_zero)))
> +			return -EIO;
> +
>  		mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
>  		if (!mc->objects)
>  			return -ENOMEM;
> 
> base-commit: 503f0315c97739d3f8e645c500d81757dfbf76be

init_value and gfp_zone is kinda redundant.  How about removing gfp_zero
completely?

	mmu_memory_cache_alloc_obj(...)
	{
		...
		if (!mc->init_value)
			gfp_flags |= __GFP_ZERO;
		...
	}

And in kvm_mmu_create() you initialize all caches' init_value explicitly.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-26 21:18                                         ` Huang, Kai
@ 2023-01-26 21:59                                           ` Sean Christopherson
  2023-01-26 22:27                                             ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-26 21:59 UTC (permalink / raw)
  To: Huang, Kai
  Cc: pbonzini, Shahar, Sagi, isaku.yamahata, linux-kernel, Aktas,
	Erdem, kvm, sean.j.christopherson, zhi.wang.linux, dmatlack,
	Yamahata, Isaku

On Thu, Jan 26, 2023, Huang, Kai wrote:
> On Thu, 2023-01-26 at 17:28 +0000, Sean Christopherson wrote:
> > In other words, once the PTE is zapped/blocked (branch is pruned), it's completely
> > removed from the paging tree and no other tasks can access the branch (page table
> > and its children).  I.e. the only remaining reference to the branch is the pointer
> > handed to the RCU callback.  That means the RCU callback has exclusive access to the
> > branch, i.e. can operate as if it were holding mmu_lock for write.  Furthermore, the
> > RCU callback also doesn't need to flush TLBs because that was again done when
> > pruning the branch.
> > 
> > It's the same idea that KVM already uses for root SPs, the only difference is how
> > KVM determines that there is exactly one entity that holds a reference to the SP.
> 
> Right.  This works fine for normal non-TDX case.  However for TDX unfortunately
> the access to the removed branch (or the removed sub-page-table) isn't that
> "exclusive" as the SEAMCALL to truly zap that branch still needs to hold the
> write lock of the entire Secure EPT tree, so it can still conflict with other
> threads handling new faults.

I thought TDX was smart enough to read-lock only the part of the tree that it's
actually consuming, and write-lock only the part of the tree that it's actually
modifying?

Hrm, but even if TDX takes a read-lock, there's still the problem of it needing
to walk the upper levels, i.e. KVM needs to keep mid-level page tables reachable
until they're fully removed.  Blech.  That should be a non-issue at this time
though, as I don't think KVM will ever REMOVE a page table of a live guest.  I
need to look at the PROMOTE/DEMOTE flows...

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
  2023-01-26 21:37       ` Huang, Kai
@ 2023-01-26 22:01         ` Sean Christopherson
  2023-02-27 21:52           ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Sean Christopherson @ 2023-01-26 22:01 UTC (permalink / raw)
  To: Huang, Kai
  Cc: zhi.wang.linux, sean.j.christopherson, linux-kernel, Shahar,
	Sagi, isaku.yamahata, Aktas, Erdem, dmatlack, kvm, Yamahata,
	Isaku, pbonzini

On Thu, Jan 26, 2023, Huang, Kai wrote:
> On Wed, 2023-01-25 at 17:22 +0000, Sean Christopherson wrote:
> > I agree that handling this in the common code would be cleaner, but repurposing
> > gfp_zero gets kludgy because it would require a magic value to say "don't initialize
> > the data", e.g. x86's mmu_shadowed_info_cache isn't pre-filled.

...

> > @@ -400,6 +405,13 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
> >  		if (WARN_ON_ONCE(!capacity))
> >  			return -EIO;
> >  
> > +		/*
> > +		 * Custom init values can be used only for page allocations,
> > +		 * and obviously conflict with __GFP_ZERO.
> > +		 */
> > +		if (WARN_ON_ONCE(mc->init_value && (mc->kmem_cache || mc->gfp_zero)))
> > +			return -EIO;
> > +
> >  		mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> >  		if (!mc->objects)
> >  			return -ENOMEM;
> > 
> > base-commit: 503f0315c97739d3f8e645c500d81757dfbf76be
> 
> init_value and gfp_zone is kinda redundant.  How about removing gfp_zero
> completely?
> 
> 	mmu_memory_cache_alloc_obj(...)
> 	{
> 		...
> 		if (!mc->init_value)
> 			gfp_flags |= __GFP_ZERO;
> 		...
> 	}
> 
> And in kvm_mmu_create() you initialize all caches' init_value explicitly.

No, as mentioned above there's also a "don't initialize the data" case.  Leaving
init_value=0 means those users would see unnecessary zeroing, and again I don't
want to use a magic value to say "don't initialize".

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-26 21:59                                           ` Sean Christopherson
@ 2023-01-26 22:27                                             ` Huang, Kai
  2023-01-30 19:15                                               ` Sean Christopherson
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-01-26 22:27 UTC (permalink / raw)
  To: Christopherson,, Sean
  Cc: sean.j.christopherson, dmatlack, Shahar, Sagi, isaku.yamahata,
	Aktas, Erdem, kvm, pbonzini, zhi.wang.linux, linux-kernel,
	Yamahata, Isaku

On Thu, 2023-01-26 at 21:59 +0000, Sean Christopherson wrote:
> On Thu, Jan 26, 2023, Huang, Kai wrote:
> > On Thu, 2023-01-26 at 17:28 +0000, Sean Christopherson wrote:
> > > In other words, once the PTE is zapped/blocked (branch is pruned), it's completely
> > > removed from the paging tree and no other tasks can access the branch (page table
> > > and its children).  I.e. the only remaining reference to the branch is the pointer
> > > handed to the RCU callback.  That means the RCU callback has exclusive access to the
> > > branch, i.e. can operate as if it were holding mmu_lock for write.  Furthermore, the
> > > RCU callback also doesn't need to flush TLBs because that was again done when
> > > pruning the branch.
> > > 
> > > It's the same idea that KVM already uses for root SPs, the only difference is how
> > > KVM determines that there is exactly one entity that holds a reference to the SP.
> > 
> > Right.  This works fine for normal non-TDX case.  However for TDX unfortunately
> > the access to the removed branch (or the removed sub-page-table) isn't that
> > "exclusive" as the SEAMCALL to truly zap that branch still needs to hold the
> > write lock of the entire Secure EPT tree, so it can still conflict with other
> > threads handling new faults.
> 
> I thought TDX was smart enough to read-lock only the part of the tree that it's
> actually consuming, and write-lock only the part of the tree that it's actually
> modifying?

The spec says there's only exclusive/shared access to the "whole Secure EPT
tree":

8.6. Secure EPT Concurrency

Secure EPT concurrency rules are designed to support the expected usage and yet
be as simple as possible.

Host-Side (SEAMCALL) Functions:
• Functions that manage Secure EPT acquire exclusive access to the whole Secure
EPT tree of the target TD.
• In specific cases where a Secure EPT entry update may collide with a
concurrent update done by the guest TD, such host-side functions update the
Secure EPT entry as a transaction, using atomic compare and exchange operations.
• TDH.MEM.SEPT.RD acquire shared access to the whole Secure EPT tree of the
target TD to help prevent changes to the tree while they execute.
• Other functions that only read Secure EPT for GPA-to-HPA translation (e.g.,
TDH.MR.EXTEND) acquire shared access to the whole Secure EPT tree of the target
TD to help prevent changes to the tree while they execute.

> 
> Hrm, but even if TDX takes a read-lock, there's still the problem of it needing
> to walk the upper levels, i.e. KVM needs to keep mid-level page tables reachable
> until they're fully removed.  Blech.  That should be a non-issue at this time
> though, as I don't think KVM will ever REMOVE a page table of a live guest.  I
> need to look at the PROMOTE/DEMOTE flows...

In this series, if I read correctly, when slot is removed/moved, private
mappings are zapped too.  It's kinda buried in:

[PATCH v11 043/113] KVM: x86/tdp_mmu: Don't zap private pages for unsupported
cases

(it's not easy to find -- I had to use 'git blame' in the actual repo to find
the commit, sigh.)



^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
  2023-01-25 17:22     ` Sean Christopherson
  2023-01-26 21:37       ` Huang, Kai
@ 2023-01-27 21:36       ` Zhi Wang
  2023-02-27 21:50         ` Isaku Yamahata
  1 sibling, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-01-27 21:36 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sagi Shahar, David Matlack, Sean Christopherson,
	kai.huang

On Wed, 25 Jan 2023 17:22:08 +0000
Sean Christopherson <seanjc@google.com> wrote:

> On Wed, Jan 25, 2023, Zhi Wang wrote:
> > On Thu, 12 Jan 2023 08:31:38 -0800
> > isaku.yamahata@intel.com wrote:
> > 
> > This refactor patch is quite hacky.
> > 
> > Why not change the purpose of vcpu->arch.mmu_shadow_page.gfp_zero and let the
> > callers respect that the initial value of spte can be configurable? It will be
> > generic and not TDX-specific, then kvm_init_shadow_page() is not required,
> > mmu_topup_shadow_page_cache() can be left un-touched as the refactor can cover
> > other architectures.
> > 
> > 1) Let it store the expected nonpresent value and rename it to nonpresent_spte.
> 
> 
> I agree that handling this in the common code would be cleaner, but repurposing
> gfp_zero gets kludgy because it would require a magic value to say "don't initialize
> the data", e.g. x86's mmu_shadowed_info_cache isn't pre-filled.
> 
> And supporting a custom 64-bit init value for kmem_cache-backed caches would require
> restricting such caches to be a multiple of 8 bytes in size.
> 
> How about this?  Lightly tested.
> 
> From: Sean Christopherson <seanjc@google.com>
> Date: Wed, 25 Jan 2023 16:55:01 +0000
> Subject: [PATCH] KVM: Allow page-sized MMU caches to be initialized with
>  custom 64-bit values
>

It looks good enough so far although it only supports 64bit init value. But
it can be extended in the future. 

Just want to make sure people are thinking the same:

1) Keep the changes of SHADOW_NONPRESENT_VALUE and REMOVED_SPTE in TDX patch.
init_value stays as a generic feature in the kvm mmu cache layer. It is *not*
going to replace SHADOW_NONPRESENT_VALUE.

2) TDX kvm_x86_vcpu_create sets the SHADOW_NONPRESENT value into init_value.

3) mmu cache topping up function initializes the page according to init_value
with Sean's patch. 

> Add support to MMU caches for initializing a page with a custom 64-bit
> value, e.g. to pre-fill an entire page table with non-zero PTE values.
> The functionality will be used by x86 to support Intel's TDX, which needs
> to set bit 63 in all non-present PTEs in order to prevent !PRESENT page
> faults from getting reflected into the guest (Intel's EPT Violation #VE
> architecture made the less than brilliant decision of having the per-PTE
> behavior be opt-out instead of opt-in).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  include/linux/kvm_types.h |  1 +
>  virt/kvm/kvm_main.c       | 16 ++++++++++++++--
>  2 files changed, 15 insertions(+), 2 deletions(-)
> 
> diff --git a/include/linux/kvm_types.h b/include/linux/kvm_types.h
> index 76de36e56cdf..67972db17b55 100644
> --- a/include/linux/kvm_types.h
> +++ b/include/linux/kvm_types.h
> @@ -94,6 +94,7 @@ struct kvm_mmu_memory_cache {
>  	int nobjs;
>  	gfp_t gfp_zero;
>  	gfp_t gfp_custom;
> +	u64 init_value;
>  	struct kmem_cache *kmem_cache;
>  	int capacity;
>  	void **objects;
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index d255964ec331..78f1e49179a7 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -380,12 +380,17 @@ static void kvm_flush_shadow_all(struct kvm *kvm)
>  static inline void *mmu_memory_cache_alloc_obj(struct kvm_mmu_memory_cache *mc,
>  					       gfp_t gfp_flags)
>  {
> +	void *page;
> +
>  	gfp_flags |= mc->gfp_zero;
>  
>  	if (mc->kmem_cache)
>  		return kmem_cache_alloc(mc->kmem_cache, gfp_flags);
> -	else
> -		return (void *)__get_free_page(gfp_flags);
> +
> +	page = (void *)__get_free_page(gfp_flags);
> +	if (page && mc->init_value)
> +		memset64(page, mc->init_value, PAGE_SIZE / sizeof(mc->init_value));
> +	return page;
>  }
>  
>  int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity, int min)
> @@ -400,6 +405,13 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
>  		if (WARN_ON_ONCE(!capacity))
>  			return -EIO;
>  
> +		/*
> +		 * Custom init values can be used only for page allocations,
> +		 * and obviously conflict with __GFP_ZERO.
> +		 */
> +		if (WARN_ON_ONCE(mc->init_value && (mc->kmem_cache || mc->gfp_zero)))
> +			return -EIO;
> +
>  		mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
>  		if (!mc->objects)
>  			return -ENOMEM;
> 
> base-commit: 503f0315c97739d3f8e645c500d81757dfbf76be


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 018/113] KVM: TDX: create/destroy VM structure
  2023-01-26 22:27                                             ` Huang, Kai
@ 2023-01-30 19:15                                               ` Sean Christopherson
  0 siblings, 0 replies; 221+ messages in thread
From: Sean Christopherson @ 2023-01-30 19:15 UTC (permalink / raw)
  To: Huang, Kai
  Cc: sean.j.christopherson, dmatlack, Shahar, Sagi, isaku.yamahata,
	Aktas, Erdem, kvm, pbonzini, zhi.wang.linux, linux-kernel,
	Yamahata, Isaku

On Thu, Jan 26, 2023, Huang, Kai wrote:
> On Thu, 2023-01-26 at 21:59 +0000, Sean Christopherson wrote:
> > Hrm, but even if TDX takes a read-lock, there's still the problem of it needing
> > to walk the upper levels, i.e. KVM needs to keep mid-level page tables reachable
> > until they're fully removed.  Blech.  That should be a non-issue at this time
> > though, as I don't think KVM will ever REMOVE a page table of a live guest.  I
> > need to look at the PROMOTE/DEMOTE flows...
> 
> In this series, if I read correctly, when slot is removed/moved, private
> mappings are zapped too.  It's kinda buried in:
> 
> [PATCH v11 043/113] KVM: x86/tdp_mmu: Don't zap private pages for unsupported
> cases

That should be ok, only leaf SPTEs will be zapped in that case.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 099/113] KVM: TDX: Handle TDX PV map_gpa hypercall
  2023-01-12 16:32 ` [PATCH v11 099/113] KVM: TDX: Handle TDX PV map_gpa hypercall isaku.yamahata
@ 2023-01-31  1:30   ` Yuan Yao
  2023-02-27 22:12     ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Yuan Yao @ 2023-01-31  1:30 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

On Thu, Jan 12, 2023 at 08:32:47AM -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Wire up TDX PV map_gpa hypercall to the kvm/mmu backend.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/tdx.c | 53 ++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 53 insertions(+)
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index 4bbde58510a4..486d0f0c6dd1 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -1181,6 +1181,57 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
>  	return 1;
>  }
>
> +static int tdx_map_gpa(struct kvm_vcpu *vcpu)
> +{
> +	struct kvm *kvm = vcpu->kvm;
> +	gpa_t gpa = tdvmcall_a0_read(vcpu);
> +	gpa_t size = tdvmcall_a1_read(vcpu);
> +	gpa_t end = gpa + size;
> +	gfn_t s = gpa_to_gfn(gpa) & ~kvm_gfn_shared_mask(kvm);
> +	gfn_t e = gpa_to_gfn(end) & ~kvm_gfn_shared_mask(kvm);
> +	int i;
> +
> +	if (!IS_ALIGNED(gpa, 4096) || !IS_ALIGNED(size, 4096) ||
> +	    end < gpa ||
> +	    end > kvm_gfn_shared_mask(kvm) << (PAGE_SHIFT + 1) ||
> +	    kvm_is_private_gpa(kvm, gpa) != kvm_is_private_gpa(kvm, end)) {
> +		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
> +		return 1;
> +	}
> +
> +	/*
> +	 * Check how the requested region overlaps with the KVM memory slots.
> +	 * For simplicity, require that it must be contained within a memslot or
> +	 * it must not overlap with any memslots (MMIO).
> +	 */
> +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> +		struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> +		struct kvm_memslot_iter iter;
> +
> +		kvm_for_each_memslot_in_gfn_range(&iter, slots, s, e) {
> +			struct kvm_memory_slot *slot = iter.slot;
> +			gfn_t slot_s = slot->base_gfn;
> +			gfn_t slot_e = slot->base_gfn + slot->npages;
> +
> +			/* no overlap */
> +			if (e < slot_s || s >= slot_e)
> +				continue;
> +
> +			/* contained in slot */
> +			if (slot_s <= s && e <= slot_e) {
> +				if (kvm_slot_can_be_private(slot))
> +					return tdx_vp_vmcall_to_user(vcpu);
> +				continue;
> +			}
> +
> +			break;
> +		}
> +	}
> +
> +	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);

This returns TDG_VP_VMCALL_INVALID_OPERAND if the TD is running with
non-private slots, which looks incorrect to the caller in TD guest(because
the operands are correct). Can we just refuse to create TD if private memory
slot is the only supported slot type for it ?

> +	return 1;
> +}
> +
>  static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>  {
>  	if (tdvmcall_exit_type(vcpu))
> @@ -1206,6 +1257,8 @@ static int handle_tdvmcall(struct kvm_vcpu *vcpu)
>  		 * guest TD doesn't make sense.  No argument check is done.
>  		 */
>  		return tdx_vp_vmcall_to_user(vcpu);
> +	case TDG_VP_VMCALL_MAP_GPA:
> +		return tdx_map_gpa(vcpu);
>  	default:
>  		break;
>  	}
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 058/113] KVM: TDX: MTRR: implement get_mt_mask() for TDX
  2023-01-12 16:32 ` [PATCH v11 058/113] KVM: TDX: MTRR: implement get_mt_mask() for TDX isaku.yamahata
  2023-01-17  3:11   ` Huang, Kai
@ 2023-02-03  6:55   ` Yuan Yao
  2023-02-27 22:15     ` Isaku Yamahata
  1 sibling, 1 reply; 221+ messages in thread
From: Yuan Yao @ 2023-02-03  6:55 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

On Thu, Jan 12, 2023 at 08:32:06AM -0800, isaku.yamahata@intel.com wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Although TDX supports only WB for private GPA, MTRR/PAT for shared GPA
> should be supported. Implement get_mt_mask() following vmx case.
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/vmx/main.c    | 10 +++++++++-
>  arch/x86/kvm/vmx/tdx.c     | 19 +++++++++++++++++++
>  arch/x86/kvm/vmx/x86_ops.h |  2 ++
>  3 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> index 770d1b29d1c3..4319f6d7a4da 100644
> --- a/arch/x86/kvm/vmx/main.c
> +++ b/arch/x86/kvm/vmx/main.c
> @@ -158,6 +158,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
>  	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
>  }
>
> +static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +{
> +	if (is_td_vcpu(vcpu))
> +		return tdx_get_mt_mask(vcpu, gfn, is_mmio);
> +
> +	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
> +}
> +
>  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
>  {
>  	if (!is_td(kvm))
> @@ -267,7 +275,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
>
>  	.set_tss_addr = vmx_set_tss_addr,
>  	.set_identity_map_addr = vmx_set_identity_map_addr,
> -	.get_mt_mask = vmx_get_mt_mask,
> +	.get_mt_mask = vt_get_mt_mask,
>
>  	.get_exit_info = vmx_get_exit_info,
>
> diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> index e68816999387..c4c5a8f786c1 100644
> --- a/arch/x86/kvm/vmx/tdx.c
> +++ b/arch/x86/kvm/vmx/tdx.c
> @@ -309,6 +309,25 @@ int tdx_vm_init(struct kvm *kvm)
>  	return 0;
>  }
>
> +u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> +{
> +	/* TDX private GPA is always WB. */
> +	if (gfn & kvm_gfn_shared_mask(vcpu->kvm)) {
> +		WARN_ON_ONCE(is_mmio);
> +		return  MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> +	}

This looks not clear enough, the comment says things about private GPA
but the code returns WB for shared GPA, please align them.

> +
> +	if (is_mmio)
> +		return MTRR_TYPE_UNCACHABLE << VMX_EPT_MT_EPTE_SHIFT;
> +
> +	/*
> +	 * Device assignemnt without VT-d snooping capability with shared-GPA
> +	 * is dubious.
> +	 */
> +	WARN_ON_ONCE(kvm_arch_has_noncoherent_dma(vcpu->kvm));
> +	return (MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT) | VMX_EPT_IPAT_BIT;
> +}
> +
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu)
>  {
>  	struct kvm_cpuid_entry2 *e;
> diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> index 8ae689929347..d903e0f606d3 100644
> --- a/arch/x86/kvm/vmx/x86_ops.h
> +++ b/arch/x86/kvm/vmx/x86_ops.h
> @@ -154,6 +154,7 @@ void tdx_vm_free(struct kvm *kvm);
>  int tdx_vcpu_create(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_free(struct kvm_vcpu *vcpu);
>  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> +u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio);
>
>  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
>  int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp);
> @@ -176,6 +177,7 @@ static inline void tdx_vm_free(struct kvm *kvm) {}
>  static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
>  static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
>  static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> +static inline u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio) { return 0; }
>
>  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
>  static inline int tdx_vcpu_ioctl(struct kvm_vcpu *vcpu, void __user *argp) { return -EOPNOTSUPP; }
> --
> 2.25.1
>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 044/113] KVM: x86/tdp_mmu: Make handle_changed_spte() return value
  2023-01-12 16:31 ` [PATCH v11 044/113] KVM: x86/tdp_mmu: Make handle_changed_spte() return value isaku.yamahata
@ 2023-02-16 16:39   ` Zhi Wang
  0 siblings, 0 replies; 221+ messages in thread
From: Zhi Wang @ 2023-02-16 16:39 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

On Thu, 12 Jan 2023 08:31:52 -0800
isaku.yamahata@intel.com wrote:

Some typos below:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> TDX operation can fail with TDX_OPERAND_BUSY when multiple vcpu try to
                                                             vcpus
> operation on same TDX resource like Secure EPT.  It doesn't spin and returns
  operate

> busy error to VMM so that VMM has to take action, e.g. retry or whatever.
> 
> Because TDP MMU uses read spin lock for scalability, spinlock around seam
> call busts TDP MMU effort.  The other option is to let SEAMCALL fail and
> page fault handler should retry.  Make handle_changed_spte() and its caller
> return values so that kvm page fault handler can return on such cases. This
> patch makes it return only zero.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/mmu/tdp_mmu.c | 72 +++++++++++++++++++++++++-------------
>  1 file changed, 47 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7ab498b80214..4fb07f91e5d6 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -349,9 +349,9 @@ hpa_t kvm_tdp_mmu_get_vcpu_root_hpa(struct kvm_vcpu *vcpu)
>  	return __pa(root->spt);
>  }
>  
> -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level,
> -				bool shared);
> +static int __must_check handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> +					    u64 old_spte, u64 new_spte, int level,
> +					    bool shared);
>  
>  static void handle_changed_spte_acc_track(u64 old_spte, u64 new_spte, int level)
>  {
> @@ -445,6 +445,7 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>  	struct kvm_mmu_page *sp = sptep_to_sp(rcu_dereference(pt));
>  	int level = sp->role.level;
>  	gfn_t base_gfn = sp->gfn;
> +	int ret;
>  	int i;
>  
>  	trace_kvm_mmu_prepare_zap_page(sp);
> @@ -516,8 +517,14 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>  			old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte,
>  							  REMOVED_SPTE, level);
>  		}
> -		handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> -				    old_spte, REMOVED_SPTE, level, shared);
> +		ret = handle_changed_spte(kvm, kvm_mmu_page_as_id(sp), gfn,
> +					  old_spte, REMOVED_SPTE, level, shared);
> +		/*
> +		 * We are removing page tables.  Because in TDX case we don't
> +		 * zap private page tables except tearing down VM.  It means
> +		 * no race condition.
> +		 */
> +		WARN_ON_ONCE(ret);
>  	}
>  
>  	call_rcu(&sp->rcu_head, tdp_mmu_free_sp_rcu_callback);
> @@ -538,9 +545,9 @@ static void handle_removed_pt(struct kvm *kvm, tdp_ptep_t pt, bool shared)
>   * Handle bookkeeping that might result from the modification of a SPTE.
>   * This function must be called for all TDP SPTE modifications.
>   */
> -static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				  u64 old_spte, u64 new_spte, int level,
> -				  bool shared)
> +static int __must_check __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> +					      u64 old_spte, u64 new_spte, int level,
> +					      bool shared)
>  {
>  	bool was_present = is_shadow_present_pte(old_spte);
>  	bool is_present = is_shadow_present_pte(new_spte);
> @@ -576,7 +583,7 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  	}
>  
>  	if (old_spte == new_spte)
> -		return;
> +		return 0;
>  
>  	trace_kvm_tdp_mmu_spte_changed(as_id, gfn, level, old_spte, new_spte);
>  
> @@ -605,7 +612,7 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  			       "a temporary removed SPTE.\n"
>  			       "as_id: %d gfn: %llx old_spte: %llx new_spte: %llx level: %d",
>  			       as_id, gfn, old_spte, new_spte, level);
> -		return;
> +		return 0;
>  	}
>  
>  	if (is_leaf != was_leaf)
> @@ -624,17 +631,25 @@ static void __handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>  	if (was_present && !was_leaf &&
>  	    (is_leaf || !is_present || WARN_ON_ONCE(pfn_changed)))
>  		handle_removed_pt(kvm, spte_to_child_pt(old_spte, level), shared);
> +
> +	return 0;
>  }
>  
> -static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> -				u64 old_spte, u64 new_spte, int level,
> -				bool shared)
> +static int __must_check handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
> +					    u64 old_spte, u64 new_spte, int level,
> +					    bool shared)
>  {
> -	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> -			      shared);
> +	int ret;
> +
> +	ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level,
> +				    shared);
> +	if (ret)
> +		return ret;
> +
>  	handle_changed_spte_acc_track(old_spte, new_spte, level);
>  	handle_changed_spte_dirty_log(kvm, as_id, gfn, old_spte,
>  				      new_spte, level);
> +	return 0;
>  }
>  
>  /*
> @@ -653,12 +668,14 @@ static void handle_changed_spte(struct kvm *kvm, int as_id, gfn_t gfn,
>   * * -EBUSY - If the SPTE cannot be set. In this case this function will have
>   *            no side-effects other than setting iter->old_spte to the last
>   *            known value of the spte.
> + * * -EAGAIN - Same to -EBUSY. But the source is from callbacks for private spt
>   */
> -static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
> -					  struct tdp_iter *iter,
> -					  u64 new_spte)
> +static inline int __must_check tdp_mmu_set_spte_atomic(struct kvm *kvm,
> +						       struct tdp_iter *iter,
> +						       u64 new_spte)
>  {
>  	u64 *sptep = rcu_dereference(iter->sptep);
> +	int ret;
>  
>  	/*
>  	 * The caller is responsible for ensuring the old SPTE is not a REMOVED
> @@ -677,15 +694,16 @@ static inline int tdp_mmu_set_spte_atomic(struct kvm *kvm,
>  	if (!try_cmpxchg64(sptep, &iter->old_spte, new_spte))
>  		return -EBUSY;
>  
> -	__handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> -			      new_spte, iter->level, true);
> -	handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
> +	ret = __handle_changed_spte(kvm, iter->as_id, iter->gfn, iter->old_spte,
> +				    new_spte, iter->level, true);
> +	if (!ret)
> +		handle_changed_spte_acc_track(iter->old_spte, new_spte, iter->level);
>  
> -	return 0;
> +	return ret;
>  }
>  

----

> -static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> -					  struct tdp_iter *iter)
> +static inline int __must_check tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> +						       struct tdp_iter *iter)
>  {
>  	int ret;
>  
----

The above part doesn't belong to this patch.

> @@ -750,6 +768,8 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  			      u64 old_spte, u64 new_spte, gfn_t gfn, int level,
>  			      bool record_acc_track, bool record_dirty_log)
>  {
> +	int ret;
> +
>  	lockdep_assert_held_write(&kvm->mmu_lock);
>  
>  	/*
> @@ -763,7 +783,9 @@ static u64 __tdp_mmu_set_spte(struct kvm *kvm, int as_id, tdp_ptep_t sptep,
>  
>  	old_spte = kvm_tdp_mmu_write_spte(sptep, old_spte, new_spte, level);
>  
> -	__handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> +	ret = __handle_changed_spte(kvm, as_id, gfn, old_spte, new_spte, level, false);
> +	/* Because write spin lock is held, no race.  It should success. */
> +	WARN_ON_ONCE(ret);
>  
>  	if (record_acc_track)
>  		handle_changed_spte_acc_track(old_spte, new_spte, level);


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
  2023-01-12 16:31 ` [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs isaku.yamahata
  2023-01-17  2:40   ` Huang, Kai
@ 2023-02-17  8:27   ` Zhi Wang
  2023-02-27 22:02     ` Isaku Yamahata
  1 sibling, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-02-17  8:27 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: kvm, linux-kernel, isaku.yamahata, Paolo Bonzini, erdemaktas,
	Sean Christopherson, Sagi Shahar, David Matlack

On Thu, 12 Jan 2023 08:31:58 -0800
isaku.yamahata@intel.com wrote:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Some KVM MMU operations (dirty page logging, page migration, aging page)
> aren't supported for private GFNs (yet) with the first generation of TDX.
> Silently return on unsupported TDX KVM MMU operations.
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     |  3 +++
>  arch/x86/kvm/mmu/tdp_mmu.c | 50 ++++++++++++++++++++++++++++++++++----
>  arch/x86/kvm/x86.c         |  3 +++
>  3 files changed, 51 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 484e615196aa..ad0482a101a3 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -6635,6 +6635,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>  	for_each_rmap_spte(rmap_head, &iter, sptep) {
>  		sp = sptep_to_sp(sptep);
>  
> +		/* Private page dirty logging is not supported yet. */
> +		KVM_BUG_ON(is_private_sptep(sptep), kvm);
> +
>  		/*
>  		 * We cannot do huge page mapping for indirect shadow pages,
>  		 * which are found on the last rmap (level = 1) when not using
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 5ce0328c71df..69e202bd1897 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1478,7 +1478,8 @@ typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
>  
>  static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
>  						   struct kvm_gfn_range *range,
> -						   tdp_handler_t handler)
> +						   tdp_handler_t handler,
> +						   bool only_shared)

What's the purpose of having only_shared while all the callers will set it as
true?

>  {
>  	struct kvm_mmu_page *root;
>  	struct tdp_iter iter;
> @@ -1489,9 +1490,23 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
>  	 * into this helper allow blocking; it'd be dead, wasteful code.
>  	 */
>  	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
> +		gfn_t start;
> +		gfn_t end;
> +
> +		if (only_shared && is_private_sp(root))
> +			continue;
> +
>  		rcu_read_lock();
>  
> -		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
> +		/*
> +		 * For TDX shared mapping, set GFN shared bit to the range,
> +		 * so the handler() doesn't need to set it, to avoid duplicated
> +		 * code in multiple handler()s.
> +		 */
> +		start = kvm_gfn_for_root(kvm, root, range->start);
> +		end = kvm_gfn_for_root(kvm, root, range->end);
> +

The coco implementation tends to treat the SHARED bit / C bit as a page_prot,
an attribute, not a part of the GFN. From that prospective, the caller needs to
be aware if it is operating on the private memory or shared memory, so does
the handler. The page table walker should know the SHARED bit as a attribute.

I don't think it is a good idea to have two different understandings, which
will cause conversion and confusion.

> +		tdp_root_for_each_leaf_pte(iter, root, start, end)
>  			ret |= handler(kvm, &iter, range);
>  
>  		rcu_read_unlock();
> @@ -1535,7 +1550,12 @@ static bool age_gfn_range(struct kvm *kvm, struct tdp_iter *iter,
>  
>  bool kvm_tdp_mmu_age_gfn_range(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
> -	return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range);
> +	/*
> +	 * First TDX generation doesn't support clearing A bit for private
> +	 * mapping, since there's no secure EPT API to support it.  However
> +	 * it's a legitimate request for TDX guest.
> +	 */
> +	return kvm_tdp_mmu_handle_gfn(kvm, range, age_gfn_range, true);
>  }
>  
>  static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
> @@ -1546,7 +1566,8 @@ static bool test_age_gfn(struct kvm *kvm, struct tdp_iter *iter,
>  
>  bool kvm_tdp_mmu_test_age_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  {
> -	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn);
> +	/* The first TDX generation doesn't support A bit. */
> +	return kvm_tdp_mmu_handle_gfn(kvm, range, test_age_gfn, true);
>  }
>  
>  static bool set_spte_gfn(struct kvm *kvm, struct tdp_iter *iter,
> @@ -1591,8 +1612,11 @@ bool kvm_tdp_mmu_set_spte_gfn(struct kvm *kvm, struct kvm_gfn_range *range)
>  	 * No need to handle the remote TLB flush under RCU protection, the
>  	 * target SPTE _must_ be a leaf SPTE, i.e. cannot result in freeing a
>  	 * shadow page.  See the WARN on pfn_changed in __handle_changed_spte().
> +	 *
> +	 * .change_pte() callback should not happen for private page, because
> +	 * for now TDX private pages are pinned during VM's life time.
>  	 */
> -	return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn);
> +	return kvm_tdp_mmu_handle_gfn(kvm, range, set_spte_gfn, true);
>  }
>
If the mmu notifier callbacks will never operate on a private page, having a
WARN_ON() is better than silently letting it fade away.  

>  /*
> @@ -1974,6 +1998,13 @@ void kvm_tdp_mmu_clear_dirty_pt_masked(struct kvm *kvm,
>  	struct kvm_mmu_page *root;
>  
>  	lockdep_assert_held_write(&kvm->mmu_lock);
> +	/*
> +	 * First TDX generation doesn't support clearing dirty bit,
> +	 * since there's no secure EPT API to support it.  For now silently
> +	 * ignore KVM_CLEAR_DIRTY_LOG.
> +	 */
> +	if (!kvm_arch_dirty_log_supported(kvm))
> +		return;
>  	for_each_tdp_mmu_root(kvm, root, slot->as_id)
>  		clear_dirty_pt_masked(kvm, root, gfn, mask, wrprot);
>  }
> @@ -2093,6 +2124,15 @@ bool kvm_tdp_mmu_write_protect_gfn(struct kvm *kvm,
>  	bool spte_set = false;
>  
>  	lockdep_assert_held_write(&kvm->mmu_lock);
> +
> +	/*
> +	 * First TDX generation doesn't support write protecting private
> +	 * mappings, silently ignore the request.  KVM_GET_DIRTY_LOG etc
> +	 * can reach here, no warning.
> +	 */
> +	if (!kvm_arch_dirty_log_supported(kvm))
> +		return false;
> +
>  	for_each_tdp_mmu_root(kvm, root, slot->as_id)
>  		spte_set |= write_protect_gfn(kvm, root, gfn, min_level);
> 

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 5b4d5f8128a5..c4579e696d39 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12526,6 +12526,9 @@ static void kvm_mmu_slot_apply_flags(struct kvm *kvm,
>  	u32 new_flags = new ? new->flags : 0;
>  	bool log_dirty_pages = new_flags & KVM_MEM_LOG_DIRTY_PAGES;
>  
> +	if (!kvm_arch_dirty_log_supported(kvm) && log_dirty_pages)
> +		return;
> +
>  	/*
>  	 * Update CPU dirty logging if dirty logging is being toggled.  This
>  	 * applies to all operations.


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 014/113] x86/virt/tdx: Add a helper function to return system wide info about TDX module
  2023-01-16  4:19   ` Huang, Kai
@ 2023-02-27 21:20     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:20 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, pbonzini, Shahar, Sagi,
	Aktas, Erdem, isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Mon, Jan 16, 2023 at 04:19:47AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > TDX KVM needs system-wide information about the TDX module, struct
> > tdsysinfo_struct.  Add a helper function tdx_get_sysinfo() to return it
> > instead of KVM getting it with various error checks.  Make KVM call the
> > function and stash the info.  Move out the struct definition about it to
> > common place arch/x86/include/asm/tdx.h.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/include/asm/tdx.h  | 54 +++++++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/vmx/tdx.c      | 49 ++++++++++++++++++++++++++++++++-
> >  arch/x86/virt/vmx/tdx/tdx.c | 21 ++++++++++++---
> >  arch/x86/virt/vmx/tdx/tdx.h | 51 -----------------------------------
> >  4 files changed, 119 insertions(+), 56 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/tdx.h b/arch/x86/include/asm/tdx.h
> > index ed9cf61ff8b4..2ca6e8ce1e43 100644
> > --- a/arch/x86/include/asm/tdx.h
> > +++ b/arch/x86/include/asm/tdx.h
> > @@ -105,6 +105,58 @@ static inline long tdx_kvm_hypercall(unsigned int nr, unsigned long p1,
> >  #endif /* CONFIG_INTEL_TDX_GUEST && CONFIG_KVM_GUEST */
> >  
> >  #ifdef CONFIG_INTEL_TDX_HOST
> > +struct tdx_cpuid_config {
> > +	u32	leaf;
> > +	u32	sub_leaf;
> > +	u32	eax;
> > +	u32	ebx;
> > +	u32	ecx;
> > +	u32	edx;
> > +} __packed;
> > +
> > +#define TDSYSINFO_STRUCT_SIZE		1024
> > +#define TDSYSINFO_STRUCT_ALIGNMENT	1024
> > +
> > +/*
> > + * The size of this structure itself is flexible.  The actual structure
> > + * passed to TDH.SYS.INFO must be padded to TDSYSINFO_STRUCT_SIZE and be
> > + * aligned to TDSYSINFO_STRUCT_ALIGNMENT using DECLARE_PADDED_STRUCT().
> > + */
> > +struct tdsysinfo_struct {
> > +	/* TDX-SEAM Module Info */
> > +	u32	attributes;
> > +	u32	vendor_id;
> > +	u32	build_date;
> > +	u16	build_num;
> > +	u16	minor_version;
> > +	u16	major_version;
> > +	u8	reserved0[14];
> > +	/* Memory Info */
> > +	u16	max_tdmrs;
> > +	u16	max_reserved_per_tdmr;
> > +	u16	pamt_entry_size;
> > +	u8	reserved1[10];
> > +	/* Control Struct Info */
> > +	u16	tdcs_base_size;
> > +	u8	reserved2[2];
> > +	u16	tdvps_base_size;
> > +	u8	tdvps_xfam_dependent_size;
> > +	u8	reserved3[9];
> > +	/* TD Capabilities */
> > +	u64	attributes_fixed0;
> > +	u64	attributes_fixed1;
> > +	u64	xfam_fixed0;
> > +	u64	xfam_fixed1;
> > +	u8	reserved4[32];
> > +	u32	num_cpuid_config;
> > +	/*
> > +	 * The actual number of CPUID_CONFIG depends on above
> > +	 * 'num_cpuid_config'.
> > +	 */
> > +	DECLARE_FLEX_ARRAY(struct tdx_cpuid_config, cpuid_configs);
> > +} __packed;
> > +
> > +const struct tdsysinfo_struct *tdx_get_sysinfo(void);
> >  bool platform_tdx_enabled(void);
> >  int tdx_enable(void);
> >  /*
> > @@ -120,6 +172,8 @@ void tdx_keyid_free(int keyid);
> >  u64 __seamcall(u64 op, u64 rcx, u64 rdx, u64 r8, u64 r9,
> >  	       struct tdx_module_output *out);
> >  #else	/* !CONFIG_INTEL_TDX_HOST */
> > +struct tdsysinfo_struct;
> > +static inline const struct tdsysinfo_struct *tdx_get_sysinfo(void) { return NULL; }
> >  static inline bool platform_tdx_enabled(void) { return false; }
> >  static inline int tdx_enable(void)  { return -EINVAL; }
> >  static inline int tdx_keyid_alloc(void) { return -EOPNOTSUPP; }
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 6c7d9ec53046..2adf5551ab26 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -11,9 +11,34 @@
> >  #undef pr_fmt
> >  #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
> >  
> > +#define TDX_MAX_NR_CPUID_CONFIGS					\
> > +	((TDSYSINFO_STRUCT_SIZE -					\
> > +		offsetof(struct tdsysinfo_struct, cpuid_configs))	\
> > +		/ sizeof(struct tdx_cpuid_config))
> > +
> > +struct tdx_capabilities {
> > +	u8 tdcs_nr_pages;
> > +	u8 tdvpx_nr_pages;
> > +
> > +	u64 attrs_fixed0;
> > +	u64 attrs_fixed1;
> > +	u64 xfam_fixed0;
> > +	u64 xfam_fixed1;
> > +
> > +	u32 nr_cpuid_configs;
> > +	struct tdx_cpuid_config cpuid_configs[TDX_MAX_NR_CPUID_CONFIGS];
> > +};
> > +
> > +/* Capabilities of KVM + the TDX module. */
> > +static struct tdx_capabilities tdx_caps;
> > +
> >  static int __init tdx_module_setup(void)
> >  {
> > -	int ret;
> > +	const struct tdsysinfo_struct *tdsysinfo;
> > +	int ret = 0;
> > +
> > +	BUILD_BUG_ON(sizeof(*tdsysinfo) > TDSYSINFO_STRUCT_SIZE);
> > +	BUILD_BUG_ON(TDX_MAX_NR_CPUID_CONFIGS != 37);
> >  
> >  	ret = tdx_enable();
> >  	if (ret) {
> > @@ -21,6 +46,28 @@ static int __init tdx_module_setup(void)
> >  		return ret;
> >  	}
> >  
> > +	tdsysinfo = tdx_get_sysinfo();
> > +	if (tdsysinfo->num_cpuid_config > TDX_MAX_NR_CPUID_CONFIGS)
> > +		return -EIO;
> 
> This check basically means TDX module is buggy (or kernel has bug).  Do we
> really need this check?  Is TDX module that buggy?
> 
> IMHO you don't need this one, or use WARN() if you want to catch kernel bug?

Ok, I made it WARN_ON().

> > +
> > +	tdx_caps = (struct tdx_capabilities) {
> > +		.tdcs_nr_pages = tdsysinfo->tdcs_base_size / PAGE_SIZE,
> > +		/*
> > +		 * TDVPS = TDVPR(4K page) + TDVPX(multiple 4K pages).
> > +		 * -1 for TDVPR.
> > +		 */
> > +		.tdvpx_nr_pages = tdsysinfo->tdvps_base_size / PAGE_SIZE - 1,
> > +		.attrs_fixed0 = tdsysinfo->attributes_fixed0,
> > +		.attrs_fixed1 = tdsysinfo->attributes_fixed1,
> > +		.xfam_fixed0 =	tdsysinfo->xfam_fixed0,
> > +		.xfam_fixed1 = tdsysinfo->xfam_fixed1,
> > +		.nr_cpuid_configs = tdsysinfo->num_cpuid_config,
> > +	};
> > +	if (!memcpy(tdx_caps.cpuid_configs, tdsysinfo->cpuid_configs,
> > +			tdsysinfo->num_cpuid_config *
> > +			sizeof(struct tdx_cpuid_config)))
> > +		return -EIO;
> 
> Why introducing 'struct tdx_capabilities' and above code here in this patch?
> 
> It's entirely not clear why the new structure is needed -- nothing mentioned in
> changelog, nor there's any comment.  Please explain in the changelog or move
> this chunk to where it is needed.
> 
> Technical side, is 'struct tdx_capabilities' really needed? Or is it just for
> convenience?

I made use of tdsysinfo where possible.
Because it's convenient to cache the number of tdcs_pages and tdvps_pages, it
renamed tdx_capabilities into tdx_info to keep those two. and Move it to the
patch that uses the value.

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 016/113] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl
  2023-01-19  2:40   ` Huang, Kai
@ 2023-02-27 21:22     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:22 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, pbonzini, Shahar, Sagi,
	Aktas, Erdem, isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Thu, Jan 19, 2023 at 02:40:17AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 16053ec3e0ae..781fbc896120 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -37,6 +37,14 @@ static int vt_vm_init(struct kvm *kvm)
> >  	return vmx_vm_init(kvm);
> >  }
> >  
> > +static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> > +{
> > +	if (!is_td(kvm))
> > +		return -ENOTTY;
> > +
> > +	return tdx_vm_ioctl(kvm, argp);
> > +}
> > +
> >  struct kvm_x86_ops vt_x86_ops __initdata = {
> >  	.name = KBUILD_MODNAME,
> >  
> > @@ -179,6 +187,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> >  	.vcpu_deliver_sipi_vector = kvm_vcpu_deliver_sipi_vector,
> >  
> >  	.dev_mem_enc_ioctl = tdx_dev_ioctl,
> > +	.mem_enc_ioctl = vt_mem_enc_ioctl,
> >  };
> 
> IIUC, now both AMD and Intel have mem_enc_ioctl() callback implemented, so the
> KVM_X86_OP_OPTIONAL() of it can be changed to KVM_X86_OP(), and the function
> pointer check can be removed in the IOCTL:

Makes sense. Merged the following patch into the patch. Thanks for it.


> 
> diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-
> ops.h
> index 8dc345cc6318..a59852fb5e2a 100644
> --- a/arch/x86/include/asm/kvm-x86-ops.h
> +++ b/arch/x86/include/asm/kvm-x86-ops.h
> @@ -116,7 +116,7 @@ KVM_X86_OP(enter_smm)
>  KVM_X86_OP(leave_smm)
>  KVM_X86_OP(enable_smi_window)
>  #endif
> -KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
> +KVM_X86_OP(mem_enc_ioctl)
>  KVM_X86_OP_OPTIONAL(mem_enc_register_region)
>  KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
>  KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c936f8d28a53..dfa279e35478 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -6937,10 +6937,6 @@ long kvm_arch_vm_ioctl(struct file *filp,
>                 goto out;
>         }
>         case KVM_MEMORY_ENCRYPT_OP: {
> -               r = -ENOTTY;
> -               if (!kvm_x86_ops.mem_enc_ioctl)
> -                       goto out;
> -
>                 r = static_call(kvm_x86_mem_enc_ioctl)(kvm, argp);
>                 break;
>         }
> 
> [snip]
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP
  2023-01-16  4:44   ` Huang, Kai
@ 2023-02-27 21:26     ` Isaku Yamahata
  2023-02-28 21:57       ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:26 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, pbonzini, Shahar, Sagi,
	Aktas, Erdem, isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Mon, Jan 16, 2023 at 04:44:21AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > TDX attestation includes the maximum number of vcpu that the guest can
> > accommodate.  
> > 
> 
> I don't understand why "attestation" is the reason here.  Let's say TDX is used
> w/o attestation, I don't think this patch can be discarded?
>
> IMHO the true reason is TDX has it's own control of maximum number of vcpus,
> i.e. asking you to specify the value when creating the TD.  Therefore, the
> constant KVM_MAX_VCPUS doesn't work for TDX guest anymore.

Without TDX attestation, this can be discarded.  The TD is created with
max_vcpus=KVM_MAX_VCPUS by default.


> 
>  
> > For that, the maximum number of vcpu needs to be specified
> > instead of constant, KVM_MAX_VCPUS.  Make KVM_ENABLE_CAP support
> > KVM_CAP_MAX_VCPUS.
> > 
> > Suggested-by: Sagi Shahar <sagis@google.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  virt/kvm/kvm_main.c | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index a235b628b32f..1cfa7da92ad0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4945,7 +4945,27 @@ static int kvm_vm_ioctl_enable_cap_generic(struct kvm *kvm,
> >  		}
> >  
> >  		mutex_unlock(&kvm->slots_lock);
> > +		return r;
> > +	}
> > +	case KVM_CAP_MAX_VCPUS: {
> > +		int r;
> >  
> > +		if (cap->flags || cap->args[0] == 0)
> > +			return -EINVAL;
> > +		if (cap->args[0] >  kvm_vm_ioctl_check_extension(kvm, KVM_CAP_MAX_VCPUS))
> > +			return -E2BIG;
> > +
> > +		mutex_lock(&kvm->lock);
> > +		/* Only decreasing is allowed. */
> 
> Why?

I'll make it x86 specific and will drop this check.


> > +		if (cap->args[0] > kvm->max_vcpus)
> > +			r = -E2BIG;
> > +		else if (kvm->created_vcpus)
> > +			r = -EBUSY;
> > +		else {
> > +			kvm->max_vcpus = cap->args[0];
> > +			r = 0;
> > +		}
> > +		mutex_unlock(&kvm->lock);
> >  		return r;
> >  	}
> >  	default:
> 
> Also, IIUC this change is made to the generic kvm_main.c, which means other
> archs are  affected too.  Is this OK to other archs?  Why such change cannot
> TDX-specific (or, at least x86, or vmx specific)? 

Ok, I made it x86 specific. 
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP
  2023-01-13 12:55   ` Zhi Wang
@ 2023-02-27 21:28     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:28 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack

On Fri, Jan 13, 2023 at 02:55:07PM +0200,
Zhi Wang <zhi.wang.linux@gmail.com> wrote:

> On Thu, 12 Jan 2023 08:31:25 -0800
> isaku.yamahata@intel.com wrote:
> 
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > TDX attestation includes the maximum number of vcpu that the guest can
> > accommodate.  For that, the maximum number of vcpu needs to be specified
> > instead of constant, KVM_MAX_VCPUS.  Make KVM_ENABLE_CAP support
> > KVM_CAP_MAX_VCPUS.
> > 
> > Suggested-by: Sagi Shahar <sagis@google.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  virt/kvm/kvm_main.c | 20 ++++++++++++++++++++
> >  1 file changed, 20 insertions(+)
> > 
> > diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> > index a235b628b32f..1cfa7da92ad0 100644
> > --- a/virt/kvm/kvm_main.c
> > +++ b/virt/kvm/kvm_main.c
> > @@ -4945,7 +4945,27 @@ static int kvm_vm_ioctl_enable_cap_generic(struct
> > kvm *kvm, }
> >  
> >  		mutex_unlock(&kvm->slots_lock);
> > +		return r;
> > +	}
> > +	case KVM_CAP_MAX_VCPUS: {
> 
> Better mention the KVM_CAP_MAX_VCPUS defined in XXX patch in the comments.

This already exists as KVM api and documented in Documentation/virt/kvm/api.rst
to get the possible maximum number of vcpus.
This patch make it settable.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters
  2023-01-16 10:04   ` Huang, Kai
@ 2023-02-27 21:32     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:32 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, Li, Xiaoyao, pbonzini,
	Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Mon, Jan 16, 2023 at 10:04:19AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > +struct kvm_tdx_init_vm {
> > +	__u64 attributes;
> > +	__u64 mrconfigid[6];	/* sha384 digest */
> > +	__u64 mrowner[6];	/* sha384 digest */
> > +	__u64 mrownerconfig[6];	/* sha348 digest */
> > +	union {
> > +		/*
> > +		 * KVM_TDX_INIT_VM is called before vcpu creation, thus before
> > +		 * KVM_SET_CPUID2.  CPUID configurations needs to be passed.
> > +		 *
> > +		 * This configuration supersedes KVM_SET_CPUID{,2}.
> 
> What does "{,2}" mean?

Both KVM_SET_CPUID and KVM_SET_CPUID2.  For comment purpose, only KVM_SET_CPUID2
suffice. I'll drop "{,2}".

> 
> > +		 * The user space VMM, e.g. qemu, should make them consistent
> > +		 * with this values.
> 
> You are already using 'struct kvm_cpuid2' below.  Isn't it enough to imply that
> userspace should organize data in the format of 'struct kvm_cpuid2'?
> 
> > +		 * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(256)
> > +		 * = 8KB.
> > +		 */
> 
> What does this comment try to imply?
> 
> > +		struct {
> > +			struct kvm_cpuid2 cpuid;
> > +			/* 8KB with KVM_MAX_CPUID_ENTRIES. */
> > +			struct kvm_cpuid_entry2 entries[];
> 
> I don't understand what's the purpose of the second field?
> 
> Shouldn't the 'struct kvm_cpuid2' already have all the CPUID entries?
> 
> > +		};
> > +		/*
> > +		 * For future extensibility.
> > +		 * The size(struct kvm_tdx_init_vm) = 16KB.
> > +		 * This should be enough given sizeof(TD_PARAMS) = 1024
> > +		 */
> > +		__u64 reserved[2029];
> 
> I think this is just wrong.  How can you extend something after a dynamic size
> CPUID array?
> 
> If you want extensibility, you need to put the space before the flexible array.
> 
> > +	};
> > +};

I changed the struct as follows.

struct kvm_tdx_init_vm {
        __u64 attributes;
        __u64 mrconfigid[6];    /* sha384 digest */
        __u64 mrowner[6];       /* sha384 digest */
        __u64 mrownerconfig[6]; /* sha348 digest */
        /*
         * For future extensibility to make sizeof(struct kvm_tdx_init_vm) = 8KB.
         * This should be enough given sizeof(TD_PARAMS) = 1024.
         * 8KB was chosen given because
         * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(=256) = 8KB.
         */
        __u64 reserved[1004];

        /*
         * KVM_TDX_INIT_VM is called before vcpu creation, thus before
         * KVM_SET_CPUID2.
         * This configuration supersedes KVM_SET_CPUID2s for VCPUs. The user
         * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with this
         * values.
         */
        struct kvm_cpuid2 cpuid;
};

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters
  2023-01-17 12:19   ` Huang, Kai
@ 2023-02-27 21:44     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:44 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, Li, Xiaoyao, pbonzini,
	Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Tue, Jan 17, 2023 at 12:19:15PM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> > +/*
> > + * cpuid entry lookup in TDX cpuid config way.
> > + * The difference is how to specify index(subleaves).
> 
> AFAICT you only have one caller here.  If this is the only difference, will it
> be simpler to ask caller to simply convert TDX_CPUID_NO_SUBLEAF to 0, so this
> function can perhaps be removed?

I removed the function as it turned out only one caller needs it after revise.


> > +static int setup_tdparams(struct kvm *kvm, struct td_params *td_params,
> > +			struct kvm_tdx_init_vm *init_vm)
> > +{
> > +	const struct kvm_cpuid2 *cpuid = &init_vm->cpuid;
> > +	const struct kvm_cpuid_entry2 *entry;
> > +	u64 guest_supported_xcr0;
> > +	u64 guest_supported_xss;
> > +	int max_pa;
> > +	int i;
> > +
> > +	if (kvm->created_vcpus)
> > +		return -EBUSY;
> > +	td_params->max_vcpus = kvm->max_vcpus;
> > +	td_params->attributes = init_vm->attributes;
> > +	if (td_params->attributes & TDX_TD_ATTRIBUTE_PERFMON) {
> > +		/*
> > +		 * TODO: save/restore PMU related registers around TDENTER.
> > +		 * Once it's done, remove this guard.
> > +		 */
> > +		pr_warn("TD doesn't support perfmon yet. KVM needs to save/restore "
> > +			"host perf registers properly.\n");
> > +		return -EOPNOTSUPP;
> > +	}
> > +
> > +	for (i = 0; i < tdx_caps.nr_cpuid_configs; i++) {
> > +		const struct tdx_cpuid_config *config = &tdx_caps.cpuid_configs[i];
> > +		const struct kvm_cpuid_entry2 *entry =
> > +			tdx_find_cpuid_entry(cpuid, config->leaf, config->sub_leaf);
> > +		struct tdx_cpuid_value *value = &td_params->cpuid_values[i];
> > +
> > +		if (!entry)
> > +			continue;
> > +
> > +		value->eax = entry->eax & config->eax;
> > +		value->ebx = entry->ebx & config->ebx;
> > +		value->ecx = entry->ecx & config->ecx;
> > +		value->edx = entry->edx & config->edx;
> > +	}
> 
> A comment to explain above would be helpful, i.e TDX requires the number and the
> order of those entries in TD_PARAMS's cpuid_values[] must be in the same number
> and order with TDSYSINFO's CPUID_CONFIG.
> 
> Also, this code depends on @td_params already being zeroed.  Perhaps also point
> it out.

Ok, I'll add a comment.


> [snip]
> 
> > +static int tdx_td_init(struct kvm *kvm, struct kvm_tdx_cmd *cmd)
> > +{
> > 
> [snip]
> 
> > +
> > +	ret = setup_tdparams(kvm, td_params, init_vm);
> > +	if (ret)
> > +		goto out;
> > +
> > +	ret = __tdx_td_init(kvm, td_params);
> > +	if (ret)
> > +		goto out;
> > +
> > +	kvm_tdx->tsc_offset = td_tdcs_exec_read64(kvm_tdx, TD_TDCS_EXEC_TSC_OFFSET);
> > +	kvm_tdx->attributes = td_params->attributes;
> > +	kvm_tdx->xfam = td_params->xfam;
> > +
> > +out:
> > +	/* kfree() accepts NULL. */
> > +	kfree(init_vm);
> > +	kfree(td_params);
> 
> So looks KVM doesn't CPUID configurations that are passed to the TDX module.  
> 
> IIUC, KVM still depends on userspace to later use KVM_SET_CPUID2 to fill the
> _same_ CPUID entries for each vcpu?  If so, what if userspace didn't provide
> consistent CPUIDs in KVM_SET_CPUID2?  Should we verify in KVM_SET_CPUID2 that
> CPUIDs are consistent?

Yes. guest might be confused, but KVM is fine with it. I don't think check is
necessary.

Because already user space is required to configure consistent CPUIDs (and MSRs)
and KVM doesn't do such consistency check for it except minimum check for KVM to
work correctly. e.g. the highest leaf number. This patch doesn't make it worse.


> I am thinking if some #VE handling requires CPUID to make some decision, then
> inconsistent CPUIDs will cause trouble, but I don't have an example now.

Do you mean TDG.VP.VMCALL<CPUID>?  Because guest TD doesn't VMM, #VE handler in
guest should sanitize the result and should be robust against attack by 
returning broken value for TDG.VP.VMCALL<CPUID>.(or other TDG.VP.VMCALL)
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 021/113] KVM: TDX: Refuse to unplug the last cpu on the package
  2023-01-16 10:23   ` Huang, Kai
@ 2023-02-27 21:48     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:48 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, pbonzini, Shahar, Sagi,
	Aktas, Erdem, isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Mon, Jan 16, 2023 at 10:23:16AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > In order to reclaim TDX HKID, (i.e. when deleting guest TD), needs to call
> > TDH.PHYMEM.PAGE.WBINVD on all packages.  If we have used TDX HKID, refuse
> > to offline the last online cpu. Add arch callback for cpu offline.
> 
> I think it is worth to talk about suspend staff, i.e. why we only refuse to
> offline the last cpu when there's active TD, but not choose to offline the last
> cpu when TDX is enabled in KVM.  People may not be able to understand
> immediately the reason behind this design.

Updated the comment.

> Btw, I certainly don't want to speak for Sean, but it seems this was suggested
> by Sean?  If so, add a 'Suggested-by' tag?

Added suggested-by.

> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> > 
> 
> [snip]
> 
> > +
> > +int tdx_offline_cpu(void)
> > +{
> > +	int curr_cpu = smp_processor_id();
> > +	cpumask_var_t packages;
> > +	int ret = 0;
> > +	int i;
> > +
> > +	if (!atomic_read(&nr_configured_hkid))
> > +		return 0;
> 
> As mentioned above, I think it also worth to add some comment here.  When people
> are trying to understand some code, I think mostly they are just going to look
> at the code itself, but won't use 'git blame' to dig out the entire changelog to
> understand some code.

Makes sense. Added a comment.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
  2023-01-27 21:36       ` Zhi Wang
@ 2023-02-27 21:50         ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:50 UTC (permalink / raw)
  To: Zhi Wang
  Cc: Sean Christopherson, isaku.yamahata, kvm, linux-kernel,
	isaku.yamahata, Paolo Bonzini, erdemaktas, Sagi Shahar,
	David Matlack, Sean Christopherson, kai.huang

On Fri, Jan 27, 2023 at 11:36:52PM +0200,
Zhi Wang <zhi.wang.linux@gmail.com> wrote:

> On Wed, 25 Jan 2023 17:22:08 +0000
> Sean Christopherson <seanjc@google.com> wrote:
> 
> > On Wed, Jan 25, 2023, Zhi Wang wrote:
> > > On Thu, 12 Jan 2023 08:31:38 -0800
> > > isaku.yamahata@intel.com wrote:
> > > 
> > > This refactor patch is quite hacky.
> > > 
> > > Why not change the purpose of vcpu->arch.mmu_shadow_page.gfp_zero and let the
> > > callers respect that the initial value of spte can be configurable? It will be
> > > generic and not TDX-specific, then kvm_init_shadow_page() is not required,
> > > mmu_topup_shadow_page_cache() can be left un-touched as the refactor can cover
> > > other architectures.
> > > 
> > > 1) Let it store the expected nonpresent value and rename it to nonpresent_spte.
> > 
> > 
> > I agree that handling this in the common code would be cleaner, but repurposing
> > gfp_zero gets kludgy because it would require a magic value to say "don't initialize
> > the data", e.g. x86's mmu_shadowed_info_cache isn't pre-filled.
> > 
> > And supporting a custom 64-bit init value for kmem_cache-backed caches would require
> > restricting such caches to be a multiple of 8 bytes in size.
> > 
> > How about this?  Lightly tested.
> > 
> > From: Sean Christopherson <seanjc@google.com>
> > Date: Wed, 25 Jan 2023 16:55:01 +0000
> > Subject: [PATCH] KVM: Allow page-sized MMU caches to be initialized with
> >  custom 64-bit values
> >
> 
> It looks good enough so far although it only supports 64bit init value. But
> it can be extended in the future. 
> 
> Just want to make sure people are thinking the same:
> 
> 1) Keep the changes of SHADOW_NONPRESENT_VALUE and REMOVED_SPTE in TDX patch.
> init_value stays as a generic feature in the kvm mmu cache layer. It is *not*
> going to replace SHADOW_NONPRESENT_VALUE.
> 
> 2) TDX kvm_x86_vcpu_create sets the SHADOW_NONPRESENT value into init_value.
> 
> 3) mmu cache topping up function initializes the page according to init_value
> with Sean's patch. 

The patch worked for me. Included into v12 patch series.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE
  2023-01-26 22:01         ` Sean Christopherson
@ 2023-02-27 21:52           ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Huang, Kai, zhi.wang.linux, sean.j.christopherson, linux-kernel,
	Shahar, Sagi, isaku.yamahata, Aktas, Erdem, dmatlack, kvm,
	Yamahata, Isaku, pbonzini

On Thu, Jan 26, 2023 at 10:01:41PM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Thu, Jan 26, 2023, Huang, Kai wrote:
> > On Wed, 2023-01-25 at 17:22 +0000, Sean Christopherson wrote:
> > > I agree that handling this in the common code would be cleaner, but repurposing
> > > gfp_zero gets kludgy because it would require a magic value to say "don't initialize
> > > the data", e.g. x86's mmu_shadowed_info_cache isn't pre-filled.
> 
> ...
> 
> > > @@ -400,6 +405,13 @@ int __kvm_mmu_topup_memory_cache(struct kvm_mmu_memory_cache *mc, int capacity,
> > >  		if (WARN_ON_ONCE(!capacity))
> > >  			return -EIO;
> > >  
> > > +		/*
> > > +		 * Custom init values can be used only for page allocations,
> > > +		 * and obviously conflict with __GFP_ZERO.
> > > +		 */
> > > +		if (WARN_ON_ONCE(mc->init_value && (mc->kmem_cache || mc->gfp_zero)))
> > > +			return -EIO;
> > > +
> > >  		mc->objects = kvmalloc_array(sizeof(void *), capacity, gfp);
> > >  		if (!mc->objects)
> > >  			return -ENOMEM;
> > > 
> > > base-commit: 503f0315c97739d3f8e645c500d81757dfbf76be
> > 
> > init_value and gfp_zone is kinda redundant.  How about removing gfp_zero
> > completely?
> > 
> > 	mmu_memory_cache_alloc_obj(...)
> > 	{
> > 		...
> > 		if (!mc->init_value)
> > 			gfp_flags |= __GFP_ZERO;
> > 		...
> > 	}
> > 
> > And in kvm_mmu_create() you initialize all caches' init_value explicitly.
> 
> No, as mentioned above there's also a "don't initialize the data" case.  Leaving
> init_value=0 means those users would see unnecessary zeroing, and again I don't
> want to use a magic value to say "don't initialize".

The abuses of gfp_flags as zeroing doesn't work for Secure-EPT page table.
It doesn't need zeroing without initial value.  So I used the patch as is.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 031/113] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE
  2023-01-16 10:54   ` Huang, Kai
@ 2023-02-27 21:53     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:53 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, sean.j.christopherson,
	pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Mon, Jan 16, 2023 at 10:54:35AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > From: Sean Christopherson <sean.j.christopherson@intel.com>
> > 
> > For TD guest, the current way to emulate MMIO doesn't work any more, as KVM
> > is not able to access the private memory of TD guest and do the emulation.
> > Instead, TD guest expects to receive #VE when it accesses the MMIO and then
> > it can explicitly make hypercall to KVM to get the expected information.
> > 
> > To achieve this, the TDX module always enables "EPT-violation #VE" in the
> > VMCS control.  And accordingly, for the MMIO spte for the shared GPA,
> > 1. KVM needs to set "suppress #VE" bit for the non-present SPTE so that EPT
> > violation happens on TD accessing MMIO range.  2. On EPT violation, KVM
> > sets the MMIO spte to clear "suppress #VE" bit so the TD guest can receive
> > the #VE instead of EPT misconfigration unlike VMX case.  For the shared GPA
> > that is not populated yet, EPT violation need to be triggered when TD guest
> > accesses such shared GPA.  The non-present SPTE value for shared GPA should
> > set "suppress #VE" bit.
> > 
> > Add "suppress #VE" bit (bit 63) to SHADOW_NONPRESENT_VALUE and
> > REMOVED_SPTE.  Unconditionally set the "suppress #VE" bit (which is bit 63)
> > for both AMD and Intel as: 1) AMD hardware doesn't use this bit when
> > present bit is off; 2) for normal VMX guest, KVM never enables the
> > "EPT-violation #VE" in VMCS control and "suppress #VE" bit is ignored by
> > hardware.
> > 
> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/include/asm/vmx.h |  1 +
> >  arch/x86/kvm/mmu/spte.h    | 15 ++++++++++++++-
> >  arch/x86/kvm/mmu/tdp_mmu.c |  8 ++++++++
> >  3 files changed, 23 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
> > index 498dc600bd5c..cdbf12c1a83c 100644
> > --- a/arch/x86/include/asm/vmx.h
> > +++ b/arch/x86/include/asm/vmx.h
> > @@ -511,6 +511,7 @@ enum vmcs_field {
> >  #define VMX_EPT_IPAT_BIT    			(1ull << 6)
> >  #define VMX_EPT_ACCESS_BIT			(1ull << 8)
> >  #define VMX_EPT_DIRTY_BIT			(1ull << 9)
> > +#define VMX_EPT_SUPPRESS_VE_BIT			(1ull << 63)
> 
> I don't know whether you should introduce this macro, since it's not used in
> this patch.
> 
> >  #define VMX_EPT_RWX_MASK                        (VMX_EPT_READABLE_MASK |       \
> >  						 VMX_EPT_WRITABLE_MASK |       \
> >  						 VMX_EPT_EXECUTABLE_MASK)
> > diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> > index f190eaf6b2b5..471378ee9071 100644
> > --- a/arch/x86/kvm/mmu/spte.h
> > +++ b/arch/x86/kvm/mmu/spte.h
> > @@ -148,7 +148,20 @@ static_assert(MMIO_SPTE_GEN_LOW_BITS == 8 && MMIO_SPTE_GEN_HIGH_BITS == 11);
> >  
> >  #define MMIO_SPTE_GEN_MASK		GENMASK_ULL(MMIO_SPTE_GEN_LOW_BITS + MMIO_SPTE_GEN_HIGH_BITS - 1, 0)
> >  
> > +/*
> > + * Non-present SPTE value for both VMX and SVM for TDP MMU.
> > + * For SVM NPT, for non-present spte (bit 0 = 0), other bits are ignored.
> > + * For VMX EPT, bit 63 is ignored if #VE is disabled. (EPT_VIOLATION_VE=0)
> > + *              bit 63 is #VE suppress if #VE is enabled. (EPT_VIOLATION_VE=1)
> > + * For TDX:
> > + *   TDX module sets EPT_VIOLATION_VE for Secure-EPT and conventional EPT
> > + */
> > +#ifdef CONFIG_X86_64
> > +#define SHADOW_NONPRESENT_VALUE	BIT_ULL(63)
> > +static_assert(!(SHADOW_NONPRESENT_VALUE & SPTE_MMU_PRESENT_MASK));
> > +#else
> >  #define SHADOW_NONPRESENT_VALUE	0ULL
> > +#endif
> >  
> >  extern u64 __read_mostly shadow_host_writable_mask;
> >  extern u64 __read_mostly shadow_mmu_writable_mask;
> > @@ -195,7 +208,7 @@ extern u64 __read_mostly shadow_nonpresent_or_rsvd_mask;
> >   *
> >   * Only used by the TDP MMU.
> >   */
> > -#define REMOVED_SPTE	0x5a0ULL
> > +#define REMOVED_SPTE	(SHADOW_NONPRESENT_VALUE | 0x5a0ULL)
> 
> This chunk belongs to the previous patch.

Yes, move this hunk to the previous patch with VMX_EPT_SUPPRESS_VE_BIT.

> >  /* Removed SPTEs must not be misconstrued as shadow present PTEs. */
> >  static_assert(!(REMOVED_SPTE & SPTE_MMU_PRESENT_MASK));
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 9cf5844dd34a..6111e3e9266d 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -700,6 +700,14 @@ static inline int tdp_mmu_zap_spte_atomic(struct kvm *kvm,
> >  	 * overwrite the special removed SPTE value. No bookkeeping is needed
> >  	 * here since the SPTE is going from non-present to non-present.  Use
> >  	 * the raw write helper to avoid an unnecessary check on volatile bits.
> > +	 *
> > +	 * Set non-present value to SHADOW_NONPRESENT_VALUE, rather than 0.
> > +	 * It is because when TDX is enabled, TDX module always
> > +	 * enables "EPT-violation #VE", so KVM needs to set
> > +	 * "suppress #VE" bit in EPT table entries, in order to get
> > +	 * real EPT violation, rather than TDVMCALL.  KVM sets
> > +	 * SHADOW_NONPRESENT_VALUE (which sets "suppress #VE" bit) so it
> > +	 * can be set when EPT table entries are zapped.
> >  	 */
> >  	__kvm_tdp_mmu_write_spte(iter->sptep, SHADOW_NONPRESENT_VALUE);
> >  
> 
> I don't quite think this place is the good place to explain "suppress  #VE" bit
> staff.  There are other places that sets non-present SPTE too.  Perhaps putting
> the comment around the macro 'SHADOW_NONPRESENT_VALUE' is better.

Dropped this comment from this patch.

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 033/113] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis
  2023-01-16 11:16   ` Huang, Kai
@ 2023-02-27 21:58     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 21:58 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, sean.j.christopherson,
	pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Mon, Jan 16, 2023 at 11:16:04AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 6111e3e9266d..dffacb7eb15a 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -19,6 +19,14 @@ int kvm_mmu_init_tdp_mmu(struct kvm *kvm)
> >  {
> >  	struct workqueue_struct *wq;
> >  
> > +	/*
> > +	 * TDs require mmio_caching to clear suppress_ve bit of SPTE for GPA
> > +	 * of MMIO so that TD can convert #VE triggered by MMIO into
> > +	 * TDG.VP.VMCALL<MMIO>.
> > +	 */
> > +	if (kvm->arch.vm_type == KVM_X86_TDX_VM && !enable_mmio_caching)
> > +		return -EOPNOTSUPP;
> 
> SEV-ES does the check in hardware_setup:
> 
> void __init sev_hardware_setup(void)
> {
> 	...
> 	/*
>          * SEV-ES requires MMIO caching as KVM doesn't have access to the guest
>          * instruction stream, i.e. can't emulate in response to a #NPF and
>          * instead relies on #NPF(RSVD) being reflected into the guest as #VC
>          * (the guest can then do a #VMGEXIT to request MMIO emulation).
>          */
>         if (!enable_mmio_caching)
>                 goto out;
> 
> 	...
> }
> 
> TDX should be done in the same way.
> 
> And IMO this chunk really doesn't belong to this patch -- I interpret this patch
> as a "infrastructure patch to track shadow MMIO value on per-VM basis" (which
> even should have no functional change IMHO), but this chunk is clearly doing
> more than that.

It's cleaner to do in hardware_setup().  So I moved the logic into
hardware_setup() and an independent patch.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
  2023-01-17  2:40   ` Huang, Kai
@ 2023-02-27 22:00     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 22:00 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, pbonzini, Shahar, Sagi,
	Aktas, Erdem, isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Tue, Jan 17, 2023 at 02:40:46AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Some KVM MMU operations (dirty page logging, page migration, aging page)
> > aren't supported for private GFNs (yet) with the first generation of TDX.
> > Silently return on unsupported TDX KVM MMU operations.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> 
> You already have previous patches to do similar things:
> 
> [PATCH v11 034/113] KVM: x86/mmu: Disallow fast page fault on private GPA
> [PATCH v11 043/113] KVM: x86/tdp_mmu: Don't zap private pages for unsupported
> cases
> [PATCH v11 048/113] KVM: x86/mmu: Disallow dirty logging for x86 TDX
> [PATCH v11 049/113] KVM: x86/mmu: TDX: Do not enable page track for TD guest
> 
> Now you have this patch:
> 
> [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on
> private GFNs
> 
> They are very confusing to me.  Those previous patches are all "unsupported
> operations", correct? 
> 
> For instance, this patch says "dirty page logging isn't supported for private
> GFNs" (and why there's a 'yet' after it?), so based on the patch title my
> understanding is you are going to _ignore_ "dirty page logging".  But you
> already have a previous patch to "Disallow dirty logging for x86 TDX".  
> 
> Shouldn't the two be in the same patch?  Or you were trying to highlight the
> different between "x86/mmu" and "x86/tdp_mmu"?
> 
> Please try to make the whole thing more clear.  My first glance is, if it was
> me, I would probably have _ONE_ dedicated patch for _EACH_ unsupported
> operation, and make it very clear in the patch title.  But you may have your own
> way to make things more clearer.

Agreed, merged this patch into [PATCH v11 048/113] KVM: x86/mmu: Disallow dirty
logging for x86 TDX.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs
  2023-02-17  8:27   ` Zhi Wang
@ 2023-02-27 22:02     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 22:02 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack

On Fri, Feb 17, 2023 at 10:27:47AM +0200,
Zhi Wang <zhi.wang.linux@gmail.com> wrote:

> On Thu, 12 Jan 2023 08:31:58 -0800
> isaku.yamahata@intel.com wrote:
> 
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Some KVM MMU operations (dirty page logging, page migration, aging page)
> > aren't supported for private GFNs (yet) with the first generation of TDX.
> > Silently return on unsupported TDX KVM MMU operations.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/kvm/mmu/mmu.c     |  3 +++
> >  arch/x86/kvm/mmu/tdp_mmu.c | 50 ++++++++++++++++++++++++++++++++++----
> >  arch/x86/kvm/x86.c         |  3 +++
> >  3 files changed, 51 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> > index 484e615196aa..ad0482a101a3 100644
> > --- a/arch/x86/kvm/mmu/mmu.c
> > +++ b/arch/x86/kvm/mmu/mmu.c
> > @@ -6635,6 +6635,9 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
> >  	for_each_rmap_spte(rmap_head, &iter, sptep) {
> >  		sp = sptep_to_sp(sptep);
> >  
> > +		/* Private page dirty logging is not supported yet. */
> > +		KVM_BUG_ON(is_private_sptep(sptep), kvm);
> > +
> >  		/*
> >  		 * We cannot do huge page mapping for indirect shadow pages,
> >  		 * which are found on the last rmap (level = 1) when not using
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index 5ce0328c71df..69e202bd1897 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1478,7 +1478,8 @@ typedef bool (*tdp_handler_t)(struct kvm *kvm, struct tdp_iter *iter,
> >  
> >  static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
> >  						   struct kvm_gfn_range *range,
> > -						   tdp_handler_t handler)
> > +						   tdp_handler_t handler,
> > +						   bool only_shared)
> 
> What's the purpose of having only_shared while all the callers will set it as
> true?

I dropped only_shared argument.


> >  {
> >  	struct kvm_mmu_page *root;
> >  	struct tdp_iter iter;
> > @@ -1489,9 +1490,23 @@ static __always_inline bool kvm_tdp_mmu_handle_gfn(struct kvm *kvm,
> >  	 * into this helper allow blocking; it'd be dead, wasteful code.
> >  	 */
> >  	for_each_tdp_mmu_root(kvm, root, range->slot->as_id) {
> > +		gfn_t start;
> > +		gfn_t end;
> > +
> > +		if (only_shared && is_private_sp(root))
> > +			continue;
> > +
> >  		rcu_read_lock();
> >  
> > -		tdp_root_for_each_leaf_pte(iter, root, range->start, range->end)
> > +		/*
> > +		 * For TDX shared mapping, set GFN shared bit to the range,
> > +		 * so the handler() doesn't need to set it, to avoid duplicated
> > +		 * code in multiple handler()s.
> > +		 */
> > +		start = kvm_gfn_for_root(kvm, root, range->start);
> > +		end = kvm_gfn_for_root(kvm, root, range->end);
> > +
> 
> The coco implementation tends to treat the SHARED bit / C bit as a page_prot,
> an attribute, not a part of the GFN. From that prospective, the caller needs to
> be aware if it is operating on the private memory or shared memory, so does
> the handler. The page table walker should know the SHARED bit as a attribute.
> 
> I don't think it is a good idea to have two different understandings, which
> will cause conversion and confusion.

I think you're mixing how guest observes it (guest page table) with how
host/VMM manages it(EPT).
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 055/113] KVM: x86/VMX: introduce vmx tlb_remote_flush and tlb_remote_flush_with_range
  2023-01-17 16:53     ` Sean Christopherson
@ 2023-02-27 22:03       ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 22:03 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Huang, Kai, kvm, linux-kernel, Yamahata, Isaku, pbonzini, Shahar,
	Sagi, Aktas, Erdem, isaku.yamahata, dmatlack

On Tue, Jan 17, 2023 at 04:53:57PM +0000,
Sean Christopherson <seanjc@google.com> wrote:

> On Tue, Jan 17, 2023, Huang, Kai wrote:
> > On Thu, 2023-01-12 at 08:32 -0800, isaku.yamahata@intel.com wrote:
> > > --- a/arch/x86/kvm/mmu/mmu.c
> > > +++ b/arch/x86/kvm/mmu/mmu.c
> > > @@ -244,7 +244,7 @@ static void kvm_flush_remote_tlbs_with_range(struct kvm *kvm,
> > >  {
> > >  	int ret = -ENOTSUPP;
> > >  
> > > -	if (range && kvm_x86_ops.tlb_remote_flush_with_range)
> > > +	if (range && kvm_available_flush_tlb_with_range())
> > >  		ret = static_call(kvm_x86_tlb_remote_flush_with_range)(kvm, range);
> > 
> > Again, IMHO this code change doesn't make code any clearer.  With the new code,
> > I need to go into the kvm_available_flush_tlb_with_range() to see what's going
> > on, but with the old code I don't.
> 
> Agreed.  Though I think this patch as a whole can be replaced with a more
> straightforward solution.
> 
> hv_remote_flush_tlb() is used when KVM is running as a Hyper-V guest, whereas
> TDX requires running KVM on bare metal.  KVM should simply disallow TDX when a
> hypervisor is detected, then there's no need for vmx_tlb_remote_flush_with_range().

Ok, I dropped this patch and add the check into hardware_setup().  If hyper-v
is enabled, disable tdx.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 099/113] KVM: TDX: Handle TDX PV map_gpa hypercall
  2023-01-31  1:30   ` Yuan Yao
@ 2023-02-27 22:12     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 22:12 UTC (permalink / raw)
  To: Yuan Yao
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack

On Tue, Jan 31, 2023 at 09:30:29AM +0800,
Yuan Yao <yuan.yao@linux.intel.com> wrote:

> On Thu, Jan 12, 2023 at 08:32:47AM -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> >
> > Wire up TDX PV map_gpa hypercall to the kvm/mmu backend.
> >
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/kvm/vmx/tdx.c | 53 ++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 53 insertions(+)
> >
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 4bbde58510a4..486d0f0c6dd1 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -1181,6 +1181,57 @@ static int tdx_emulate_wrmsr(struct kvm_vcpu *vcpu)
> >  	return 1;
> >  }
> >
> > +static int tdx_map_gpa(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm *kvm = vcpu->kvm;
> > +	gpa_t gpa = tdvmcall_a0_read(vcpu);
> > +	gpa_t size = tdvmcall_a1_read(vcpu);
> > +	gpa_t end = gpa + size;
> > +	gfn_t s = gpa_to_gfn(gpa) & ~kvm_gfn_shared_mask(kvm);
> > +	gfn_t e = gpa_to_gfn(end) & ~kvm_gfn_shared_mask(kvm);
> > +	int i;
> > +
> > +	if (!IS_ALIGNED(gpa, 4096) || !IS_ALIGNED(size, 4096) ||
> > +	    end < gpa ||
> > +	    end > kvm_gfn_shared_mask(kvm) << (PAGE_SHIFT + 1) ||
> > +	    kvm_is_private_gpa(kvm, gpa) != kvm_is_private_gpa(kvm, end)) {
> > +		tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
> > +		return 1;
> > +	}
> > +
> > +	/*
> > +	 * Check how the requested region overlaps with the KVM memory slots.
> > +	 * For simplicity, require that it must be contained within a memslot or
> > +	 * it must not overlap with any memslots (MMIO).
> > +	 */
> > +	for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++) {
> > +		struct kvm_memslots *slots = __kvm_memslots(kvm, i);
> > +		struct kvm_memslot_iter iter;
> > +
> > +		kvm_for_each_memslot_in_gfn_range(&iter, slots, s, e) {
> > +			struct kvm_memory_slot *slot = iter.slot;
> > +			gfn_t slot_s = slot->base_gfn;
> > +			gfn_t slot_e = slot->base_gfn + slot->npages;
> > +
> > +			/* no overlap */
> > +			if (e < slot_s || s >= slot_e)
> > +				continue;
> > +
> > +			/* contained in slot */
> > +			if (slot_s <= s && e <= slot_e) {
> > +				if (kvm_slot_can_be_private(slot))
> > +					return tdx_vp_vmcall_to_user(vcpu);
> > +				continue;
> > +			}
> > +
> > +			break;
> > +		}
> > +	}
> > +
> > +	tdvmcall_set_return_code(vcpu, TDG_VP_VMCALL_INVALID_OPERAND);
> 
> This returns TDG_VP_VMCALL_INVALID_OPERAND if the TD is running with
> non-private slots, which looks incorrect to the caller in TD guest(because
> the operands are correct). Can we just refuse to create TD if private memory
> slot is the only supported slot type for it ?

TD needs to support both private and non-private memory slot for device
assignment via shared pages.

So question should be, what's the expectation for non-private memory slot with
memory attribute change or TDX map_gpa.

- setting memory attributes: KVM_SET_MEMORY_ATTRIBUTES
  - option 1. return error
  - option 2. silently ignore as nop
              Following get-attributes returns non-private.
  - option 3. Allow setting attributes value. but no effect.
              Following get-attributes returns private. but kvm page fault
              doesn't have any effect.

- TDX MAP GPA hypercall
  Let's assume the region is contained in a single memory slot.
  - option 1. return error
  - option 2. silently ignore as nop

For map gpa hypercall, it's better to return error because guest should know
which region is for private memory or device assignment.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 058/113] KVM: TDX: MTRR: implement get_mt_mask() for TDX
  2023-02-03  6:55   ` Yuan Yao
@ 2023-02-27 22:15     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 22:15 UTC (permalink / raw)
  To: Yuan Yao
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack

On Fri, Feb 03, 2023 at 02:55:45PM +0800,
Yuan Yao <yuan.yao@linux.intel.com> wrote:
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index e68816999387..c4c5a8f786c1 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -309,6 +309,25 @@ int tdx_vm_init(struct kvm *kvm)
> >  	return 0;
> >  }
> >
> > +u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > +{
> > +	/* TDX private GPA is always WB. */
> > +	if (gfn & kvm_gfn_shared_mask(vcpu->kvm)) {
> > +		WARN_ON_ONCE(is_mmio);
> > +		return  MTRR_TYPE_WRBACK << VMX_EPT_MT_EPTE_SHIFT;
> > +	}
> 
> This looks not clear enough, the comment says things about private GPA
> but the code returns WB for shared GPA, please align them.

It should be !(gfn & shared mask). Thanks for the catch.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 058/113] KVM: TDX: MTRR: implement get_mt_mask() for TDX
  2023-01-17  3:11   ` Huang, Kai
@ 2023-02-27 23:30     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 23:30 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, pbonzini, Shahar, Sagi,
	Aktas, Erdem, isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Tue, Jan 17, 2023 at 03:11:46AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-01-12 at 08:32 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > Although TDX supports only WB for private GPA, MTRR/PAT for shared GPA
> > should be supported. Implement get_mt_mask() following vmx case.
> 
> By far this is the first patch to handle MTRR/PAT.  There's absolutely no
> background have been explained.
> 
> So what about MTRR/PAT related MSRs handling?  No code needed to handle?
> 
> I was expecting there should be at least some words here to explain how TDX
> handles them, and if no handling is required in KVM, why.
> 
> W/o those, I don't think this patch is reviewable.  

I've updated the commit message.


> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/kvm/vmx/main.c    | 10 +++++++++-
> >  arch/x86/kvm/vmx/tdx.c     | 19 +++++++++++++++++++
> >  arch/x86/kvm/vmx/x86_ops.h |  2 ++
> >  3 files changed, 30 insertions(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 770d1b29d1c3..4319f6d7a4da 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -158,6 +158,14 @@ static void vt_load_mmu_pgd(struct kvm_vcpu *vcpu, hpa_t root_hpa,
> >  	vmx_load_mmu_pgd(vcpu, root_hpa, pgd_level);
> >  }
> >  
> > +static u8 vt_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > +{
> > +	if (is_td_vcpu(vcpu))
> > +		return tdx_get_mt_mask(vcpu, gfn, is_mmio);
> > +
> > +	return vmx_get_mt_mask(vcpu, gfn, is_mmio);
> > +}
> > +
> >  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> >  {
> >  	if (!is_td(kvm))
> > @@ -267,7 +275,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> >  
> >  	.set_tss_addr = vmx_set_tss_addr,
> >  	.set_identity_map_addr = vmx_set_identity_map_addr,
> > -	.get_mt_mask = vmx_get_mt_mask,
> > +	.get_mt_mask = vt_get_mt_mask,
> >  
> >  	.get_exit_info = vmx_get_exit_info,
> >  
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index e68816999387..c4c5a8f786c1 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -309,6 +309,25 @@ int tdx_vm_init(struct kvm *kvm)
> >  	return 0;
> >  }
> >  
> > +u8 tdx_get_mt_mask(struct kvm_vcpu *vcpu, gfn_t gfn, bool is_mmio)
> > +{
> > +	/* TDX private GPA is always WB. */
> > +	if (gfn & kvm_gfn_shared_mask(vcpu->kvm)) {
> 
> First of all, private GPA doesn't have 'shared bit' set, so comment  doesn't
> reflect code.
> 
> Secondly (and again), IIUC the shared bit of the gfn has been stripped out long
> time ago, so this is incorrect.
> 
> Please don't sliently ignore other people's comment:
> 
> https://lore.kernel.org/lkml/Y19NzlQcwhV%2F2wl3@debian.me/T/#mf319d5b718519709362f9f094bfc5b53fd870241

Make the logic common for vmx and tdx. the difference is to check cr0.cd.

Thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-01-16 10:46   ` Zhi Wang
@ 2023-02-27 23:49     ` Isaku Yamahata
  2023-02-28 17:55       ` Zhi Wang
  0 siblings, 1 reply; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-27 23:49 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack

On Mon, Jan 16, 2023 at 12:46:06PM +0200,
Zhi Wang <zhi.wang.linux@gmail.com> wrote:

> On Thu, 12 Jan 2023 08:31:31 -0800
> isaku.yamahata@intel.com wrote:
> 
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > The next step of TDX guest creation is to create vcpu.  Allocate TDX vcpu
> > structures, partially initialize it.  Allocate pages of TDX vcpu for the
> > TDX module.  Actual donation TDX vcpu pages to the TDX module is not done
> > yet.
> > 
> > In the case of the conventional case, cpuid is empty at the initialization.
> > and cpuid is configured after the vcpu initialization.  Because TDX
> > supports only X2APIC mode, cpuid is forcibly initialized to support X2APIC
> > on the vcpu initialization.
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> > Changes v10 -> v11:
> > - NULL check of kvmalloc_array() in tdx_vcpu_reset. Move it to
> >   tdx_vcpu_create()
> > 
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/kvm/vmx/main.c    | 40 ++++++++++++++++++--
> >  arch/x86/kvm/vmx/tdx.c     | 75 ++++++++++++++++++++++++++++++++++++++
> >  arch/x86/kvm/vmx/x86_ops.h | 10 +++++
> >  arch/x86/kvm/x86.c         |  2 +
> >  4 files changed, 123 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index ddf0742f1f67..59813ca05f36 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -63,6 +63,38 @@ static void vt_vm_free(struct kvm *kvm)
> >  		tdx_vm_free(kvm);
> >  }
> >  
> > +static int vt_vcpu_precreate(struct kvm *kvm)
> > +{
> > +	if (is_td(kvm))
> > +		return 0;
> > +
> > +	return vmx_vcpu_precreate(kvm);
> > +}
> > +
> > +static int vt_vcpu_create(struct kvm_vcpu *vcpu)
> > +{
> > +	if (is_td_vcpu(vcpu))
> > +		return tdx_vcpu_create(vcpu);
> > +
> > +	return vmx_vcpu_create(vcpu);
> > +}
> > +
> 
> -----
> > +static void vt_vcpu_free(struct kvm_vcpu *vcpu)
> > +{
> > +	if (is_td_vcpu(vcpu))
> > +		return tdx_vcpu_free(vcpu);
> > +
> > +	return vmx_vcpu_free(vcpu);
> > +}
> > +
> > +static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > +{
> > +	if (is_td_vcpu(vcpu))
> > +		return tdx_vcpu_reset(vcpu, init_event);
> > +
> > +	return vmx_vcpu_reset(vcpu, init_event);
> > +}
> > +
> ----
> 
> It seems a little strange to use return in this style. Would it be better like:
> 
> -----
> if (xxx) {
> 	tdx_vcpu_reset(xxx);
> 	return; 
> }
> 
> vmx_vcpu_reset(xxx);
> ----
> 
> ?

It's C11.  I updated the code to not use the feature.


> >  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> >  {
> >  	if (!is_td(kvm))
> > @@ -90,10 +122,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> >  	.vm_destroy = vt_vm_destroy,
> >  	.vm_free = vt_vm_free,
> >  
> > -	.vcpu_precreate = vmx_vcpu_precreate,
> > -	.vcpu_create = vmx_vcpu_create,
> > -	.vcpu_free = vmx_vcpu_free,
> > -	.vcpu_reset = vmx_vcpu_reset,
> > +	.vcpu_precreate = vt_vcpu_precreate,
> > +	.vcpu_create = vt_vcpu_create,
> > +	.vcpu_free = vt_vcpu_free,
> > +	.vcpu_reset = vt_vcpu_reset,
> >  
> >  	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
> >  	.vcpu_load = vmx_vcpu_load,
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 557a609c5147..099f0737a5aa 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -281,6 +281,81 @@ int tdx_vm_init(struct kvm *kvm)
> >  	return 0;
> >  }
> >  
> > +int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_cpuid_entry2 *e;
> > +
> > +	/*
> > +	 * On cpu creation, cpuid entry is blank.  Forcibly enable
> > +	 * X2APIC feature to allow X2APIC.
> > +	 * Because vcpu_reset() can't return error, allocation is done here.
> > +	 */
> > +	WARN_ON_ONCE(vcpu->arch.cpuid_entries);
> > +	WARN_ON_ONCE(vcpu->arch.cpuid_nent);
> > +	e = kvmalloc_array(1, sizeof(*e), GFP_KERNEL_ACCOUNT);
> > +	if (!e)
> > +		return -ENOMEM;
> > +	*e  = (struct kvm_cpuid_entry2) {
> > +		.function = 1,	/* Features for X2APIC */
> > +		.index = 0,
> > +		.eax = 0,
> > +		.ebx = 0,
> > +		.ecx = 1ULL << 21,	/* X2APIC */
> > +		.edx = 0,
> > +	};
> > +	vcpu->arch.cpuid_entries = e;
> > +	vcpu->arch.cpuid_nent = 1;
> > +
> > +	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
> > +	if (!vcpu->arch.apic)
> > +		return -EINVAL;
> > +
> > +	fpstate_set_confidential(&vcpu->arch.guest_fpu);
> > +
> > +	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
> > +
> > +	vcpu->arch.cr0_guest_owned_bits = -1ul;
> > +	vcpu->arch.cr4_guest_owned_bits = -1ul;
> > +
> > +	vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
> > +	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
> > +	vcpu->arch.guest_state_protected =
> > +		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
> > +
> > +	return 0;
> > +}
> > +
> > +void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> > +{
> > +	/* This is stub for now.  More logic will come. */
> > +}
> > +
> > +void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > +{
> > +	struct msr_data apic_base_msr;
> > +
> > +	/* TDX doesn't support INIT event. */
> > +	if (WARN_ON_ONCE(init_event))
> > +		goto td_bugged;
> > +
> > +	/* TDX rquires X2APIC. */
>                 ^
>                requires
> > +	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> > +	if (kvm_vcpu_is_reset_bsp(vcpu))
> > +		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
> > +	apic_base_msr.host_initiated = true;
> > +	if (WARN_ON_ONCE(kvm_set_apic_base(vcpu, &apic_base_msr)))
> > +		goto td_bugged;
> > +
> > +	/*
> > +	 * Don't update mp_state to runnable because more initialization
> > +	 * is needed by TDX_VCPU_INIT.
> > +	 */
> > +	return;
> > +
> > +td_bugged:
> > +	vcpu->kvm->vm_bugged = true;
> > +}
> > +
> 
> 1) Using vm_bugged to terminate the VM creation feels off. When
> using it in creation path, the termination still happens in xx_vcpu_run().
> 
> Thus, even something wrong happens at a certain point of the creation path,
> the VM creation still continues. Until the xxx_vcpu_run(), the VM termination
> finally happens.
> 
> Why not just fail in the creation path?

I converted vm_bugged to KVM_BUG_ON.  Because the td_bugged case shouldn't
happen for TDX case, it's worthwhile for KVM_BUG_ON()


> 2) Move 
> 
> > +	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> > +	if (kvm_vcpu_is_reset_bsp(vcpu))
> > +		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
> > +	apic_base_msr.host_initiated = true;
> 
> to:
> 
> void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
> {
>         struct kvm_lapic *apic = vcpu->arch.apic;
>         u64 msr_val;
>         int i;
> 
>         if (!init_event) {
>                 msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
> 
> 		/* here */
> 		if (is_td_vcpu(vcpu)) 
> 			msr_val = xxxx;
>                 if (kvm_vcpu_is_reset_bsp(vcpu))
>                         msr_val |= MSR_IA32_APICBASE_BSP;
>                 kvm_lapic_set_base(vcpu, msr_val);
>         }

No. Because I'm trying to contain is_td/is_td_vcpu in vmx specific and not use
in common x86 code.


> PS: Is there any reason that APIC MSR in TDX doesn't need
> MSR_IA32_APICBASE_ENABLE?

because LAPIC_MODE_X2APIC includes MSR_IA32_APICBASE_ENABLE.
In lapic.h
        LAPIC_MODE_X2APIC = MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE,



> 3) Change the following:
> 
> > +
> > +	/* TDX doesn't support INIT event. */
> > +	if (WARN_ON_ONCE(init_event))
> > +		goto td_bugged;
> > +
> 
> to 
> 	WARN_ON_ONCE(init_event);
> 
> kvm_cpu_deliver_init() will trigger a kvm_vcpu_reset(xxx, init_event=true),
> but you have already avoided this in vt_vcpu_deliver_init(). A warn
> is good enough to remind people.

I converted it into KVM_BUG_ON().


> With these changes, tdx_vcpu_reset() will only contain the CPUID configuration
> , using the vm_bugged to terminate the VM in tdx_vcpu_reset() can be removed.
> 
> >  int tdx_dev_ioctl(void __user *argp)
> >  {
> >  	struct kvm_tdx_capabilities __user *user_caps;
> > diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> > index 6c40dda1cc2f..37ab2cfd35bc 100644
> > --- a/arch/x86/kvm/vmx/x86_ops.h
> > +++ b/arch/x86/kvm/vmx/x86_ops.h
> > @@ -147,7 +147,12 @@ int tdx_offline_cpu(void);
> >  int tdx_vm_init(struct kvm *kvm);
> >  void tdx_mmu_release_hkid(struct kvm *kvm);
> >  void tdx_vm_free(struct kvm *kvm);
> > +
> >  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> > +
> > +int tdx_vcpu_create(struct kvm_vcpu *vcpu);
> > +void tdx_vcpu_free(struct kvm_vcpu *vcpu);
> > +void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> >  #else
> >  static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
> >  static inline void tdx_hardware_unsetup(void) {}
> > @@ -159,7 +164,12 @@ static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> >  static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
> >  static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
> >  static inline void tdx_vm_free(struct kvm *kvm) {}
> > +
> >  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> > +
> > +static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
> > +static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
> > +static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> >  #endif
> >  
> >  #endif /* __KVM_X86_VMX_X86_OPS_H */
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 1fb135e0c98f..e8bc66031a1d 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -492,6 +492,7 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> >  	kvm_recalculate_apic_map(vcpu->kvm);
> >  	return 0;
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_set_apic_base);
> >  
> >  /*
> >   * Handle a fault on a hardware virtualization (VMX or SVM) instruction.
> > @@ -12109,6 +12110,7 @@ bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
> >  {
> >  	return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
> >  }
> > +EXPORT_SYMBOL_GPL(kvm_vcpu_is_reset_bsp);
> >
> 
> The symbols don't need to be exported with the changes mentioned above.
>   
> >  bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
> >  {
> 

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-01-19  0:45   ` Huang, Kai
@ 2023-02-28 11:06     ` Isaku Yamahata
  2023-02-28 11:52       ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-28 11:06 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, pbonzini, Shahar, Sagi,
	Aktas, Erdem, isaku.yamahata, dmatlack, Christopherson,,
	Sean

On Thu, Jan 19, 2023 at 12:45:09AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > The next step of TDX guest creation is to create vcpu.  Allocate TDX vcpu
> > structures, partially initialize it.  
> > 
> 
> Why partially initialize it?  Shouldn't a better way be either: 1) not
> initialize at all, or; 2) fully initialize? 
> 
> Can you put more _why_ here?
> 
> 
> > Allocate pages of TDX vcpu for the
> > TDX module.  Actual donation TDX vcpu pages to the TDX module is not done
> > yet.
> 
> Also, can you explain _why_ it is not done here?
> 
> > 
> > In the case of the conventional case, cpuid is empty at the initialization.
> > and cpuid is configured after the vcpu initialization.  Because TDX
> > supports only X2APIC mode, cpuid is forcibly initialized to support X2APIC
> > on the vcpu initialization.
> 
> Don't quite understand here.  As you said CPUID entries are configured later in
> KVM_SET_CPUID2, so what's the point of initializing CPUID to support x2api> Are you suggesting KVM_SET_CPUID2 will be somehow rejected for TDX guest, or
> there will be special handling to make sure the CPUID initialized here won't be
> overwritten later?
> 
> Please explain clearly here.

Here is the updated one.

    The next step of TDX guest creation is to create vcpu.  Allocate TDX vcpu
    structures, initialize it that doesn't require TDX SEAMCALL.  TDX specific
    vcpuid initialization will be implemented as independent KVM_TDX_INIT_VCPU
    so that when error occurs it's easy to determine which component has the
    issue, KVM or TDX.
    
    In the case of the conventional case, cpuid is empty at the initialization.
    and cpuid is configured after the vcpu initialization.  Because TDX
    supports only X2APIC mode, cpuid is forcibly set to support X2APIC and APIC
    BASE MSR is forcibly set to X2APIC mode.  The MSR will be read only for
    guest TD.  Because kvm_arch_vcpu_create() also initializes kvm MMU that
    depends on local apic settings.  So x2apic needs to be initialized to
    X2APIC mode by vcpu_reset method before KVM mmu initialization.



> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 557a609c5147..099f0737a5aa 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -281,6 +281,81 @@ int tdx_vm_init(struct kvm *kvm)
> >  	return 0;
> >  }
> >  
> > +int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> > +{
> > +	struct kvm_cpuid_entry2 *e;
> > +
> > +	/*
> > +	 * On cpu creation, cpuid entry is blank.  Forcibly enable
> > +	 * X2APIC feature to allow X2APIC.
> > +	 * Because vcpu_reset() can't return error, allocation is done here.
> > +	 */
> > +	WARN_ON_ONCE(vcpu->arch.cpuid_entries);
> > +	WARN_ON_ONCE(vcpu->arch.cpuid_nent);
> > +	e = kvmalloc_array(1, sizeof(*e), GFP_KERNEL_ACCOUNT);
> 
> You don't need to use kvmalloc_array() when only allocating one entry.

I'll make it kvmalloc() and add comment on kvmalloc(), not kmalloc().


> > +	if (!e)
> > +		return -ENOMEM;
> > +	*e  = (struct kvm_cpuid_entry2) {
> > +		.function = 1,	/* Features for X2APIC */
> > +		.index = 0,
> > +		.eax = 0,
> > +		.ebx = 0,
> > +		.ecx = 1ULL << 21,	/* X2APIC */
> > +		.edx = 0,
> > +	};
> > +	vcpu->arch.cpuid_entries = e;
> > +	vcpu->arch.cpuid_nent = 1;
> 
> As mentioned above, why doing it here? Won't be this be overwritten later in
> KVM_SET_CPUID2?

Yes, user space VMM can overwrite cpuid[0x1] and APIC base MSR.  But it doesn't
matter because it's a bug of user space VMM. user space VMM has to keep the
consistency of cpuid and MSRs.
Because TDX module virtualizes cpuid[0x1].x2apic to fixed 1, KVM value doesn't
matter after vcpu creation.
Because KVM virtualizes APIC base as read only to guest, cpuid[0x1].x2apic
doesn't matter after vcpu creation as long as user space VMM keeps KVM APIC BASE
value.

I'll add a comment.


> > +
> > +	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
> > +	if (!vcpu->arch.apic)
> > +		return -EINVAL;
> 
> If this is hit, what happens to the CPUID entry allocated above?  It's
> absolutely not clear here in this patch.

It's memory leak. I'll move the check before memory allocation.


> > +
> > +	fpstate_set_confidential(&vcpu->arch.guest_fpu);
> > +
> > +	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
> > +
> > +	vcpu->arch.cr0_guest_owned_bits = -1ul;
> > +	vcpu->arch.cr4_guest_owned_bits = -1ul;
> > +
> > +	vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
> > +	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
> > +	vcpu->arch.guest_state_protected =
> > +		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
> > +
> > +	return 0;
> > +}
> > +
> > +void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> > +{
> > +	/* This is stub for now.  More logic will come. */
> > +}
> > +
> > +void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > +{
> > +	struct msr_data apic_base_msr;
> > +
> > +	/* TDX doesn't support INIT event. */
> > +	if (WARN_ON_ONCE(init_event))
> > +		goto td_bugged;
> 
> Should we use KVM_BUG_ON()?
>
> Again, it appears this depends on how KVM handles INIT, which is done in a later
> patch far way:
> 
> [PATCH v11 102/113] KVM: TDX: Silently ignore INIT/SIPI
> 
> And there's no material explaining how it is handled in either changelog or
> comment, so to me it's not reviewable.

I'll convert them to KVM_BUG_ON().  With this patch, I'll remove WARN_ON_ONCE()
and add KVM_BUG_ON() with the later patch.


> > +
> > +	/* TDX rquires X2APIC. */
> > +	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> > +	if (kvm_vcpu_is_reset_bsp(vcpu))
> > +		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
> > +	apic_base_msr.host_initiated = true;
> > +	if (WARN_ON_ONCE(kvm_set_apic_base(vcpu, &apic_base_msr)))
> > +		goto td_bugged;
> 
> I think we have KVM_BUG_ON()?
> 
> TDX requires a lot more staff then just x2apic, why only x2apic is done here,
> particularly in _this_ patch?

After this callback, kvm mmu initialization follows.  It depends on apic
setting.  X2APIC is only devication from the common logic kvm_lapic_reset().
I'll add a comment.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization
  2023-01-16 16:07   ` Zhi Wang
@ 2023-02-28 11:17     ` Isaku Yamahata
  2023-02-28 18:21       ` Zhi Wang
  0 siblings, 1 reply; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-28 11:17 UTC (permalink / raw)
  To: Zhi Wang
  Cc: isaku.yamahata, kvm, linux-kernel, isaku.yamahata, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

On Mon, Jan 16, 2023 at 06:07:19PM +0200,
Zhi Wang <zhi.wang.linux@gmail.com> wrote:

> On Thu, 12 Jan 2023 08:31:32 -0800
> isaku.yamahata@intel.com wrote:
> 
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > TD guest vcpu need to be configured before ready to run which requests
> > addtional information from Device model (e.g. qemu), one 64bit value is
> > passed to vcpu's RCX as an initial value.  Repurpose KVM_MEMORY_ENCRYPT_OP
> > to vcpu-scope and add new sub-commands KVM_TDX_INIT_VCPU under it for such
> > additional vcpu configuration.
> > 
> 
> Better add more details for this mystic value to save the review efforts.
> 
> For exmaple, refining the above part as:
> 
> ----
> 
> TD hands-off block(HOB) is used to pass the information from VMM to
> TD virtual firmware(TDVF). Before KVM calls Intel TDX module to launch
> TDVF, the address of HOB must be placed in the guest RCX.
> 
> Extend KVM_MEMORY_ENCRYPT_OP to vcpu-scope and add new... so that
> TDH.VP.INIT can take the address of HOB from QEMU and place it in the
> guest RCX when initializing a TDX vCPU.
> 
> ----
> 
> The below paragraph seems repeating the end of the first paragraph. Guess
> it can be refined or removed.
> 
> 
> > Add callback for kvm vCPU-scoped operations of KVM_MEMORY_ENCRYPT_OP and
> > add a new subcommand, KVM_TDX_INIT_VCPU, for further vcpu initialization.
> >

I don't think it's good idea to mention about new terminology HOB and TDVF.
We can say, VMM can pass one parameter.
Here is the updated one.

    TD guest vcpu needs TDX specific initialization before running.  Repurpose
    KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
    KVM_TDX_INIT_VCPU, and implement the callback for it.


> PS: I am curious if the value of guest RCX on each VCPU will be configured
> differently? (It seems they are the same according to the code of tdx-qemu)
> 
> If yes, then it is just an approach to configure the value (even it is
> through TDH.VP.XXX). It should be configured in the domain level in KVM. The
> TDX vCPU creation and initialization can be moved into tdx_vcpu_create()
> and TDH.VP.INIT can take the value from a per-vm data structure.

RCX can be set for each VCPUs as ABI (or TDX SEAMCALL API) between VMM and vcpu
initial value.  It's convention between user space VMM(qemu) and guest
firmware(TDVF) to pass same RCX value for all vcpu.  So KVM shouldn't enforce
same RCX value for all vcpus.  KVM should allow user space VMM to set the value
for each vcpus.


> > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > ---
> >  arch/x86/include/asm/kvm-x86-ops.h    |   1 +
> >  arch/x86/include/asm/kvm_host.h       |   1 +
> >  arch/x86/include/uapi/asm/kvm.h       |   1 +
> >  arch/x86/kvm/vmx/main.c               |   9 ++
> >  arch/x86/kvm/vmx/tdx.c                | 147 +++++++++++++++++++++++++-
> >  arch/x86/kvm/vmx/tdx.h                |   7 ++
> >  arch/x86/kvm/vmx/x86_ops.h            |  10 +-
> >  arch/x86/kvm/x86.c                    |   6 ++
> >  tools/arch/x86/include/uapi/asm/kvm.h |   1 +
> >  9 files changed, 178 insertions(+), 5 deletions(-)
> > 
> > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > index 1a27f3aee982..e3e9b1c2599b 100644
> > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > @@ -123,6 +123,7 @@ KVM_X86_OP(enable_smi_window)
> >  #endif
> >  KVM_X86_OP_OPTIONAL(dev_mem_enc_ioctl)
> >  KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
> > +KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
> >  KVM_X86_OP_OPTIONAL(mem_enc_register_region)
> >  KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
> >  KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
> > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > index 30f4ddb18548..35773f925cc5 100644
> > --- a/arch/x86/include/asm/kvm_host.h
> > +++ b/arch/x86/include/asm/kvm_host.h
> > @@ -1698,6 +1698,7 @@ struct kvm_x86_ops {
> >  
> >  	int (*dev_mem_enc_ioctl)(void __user *argp);
> >  	int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
> > +	int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
> >  	int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> >  	int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> >  	int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
> > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > index b8f28d86d4fd..9236c1699c48 100644
> > --- a/arch/x86/include/uapi/asm/kvm.h
> > +++ b/arch/x86/include/uapi/asm/kvm.h
> > @@ -536,6 +536,7 @@ struct kvm_pmu_event_filter {
> >  enum kvm_tdx_cmd_id {
> >  	KVM_TDX_CAPABILITIES = 0,
> >  	KVM_TDX_INIT_VM,
> > +	KVM_TDX_INIT_VCPU,
> >  
> >  	KVM_TDX_CMD_NR_MAX,
> >  };
> > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > index 59813ca05f36..23b3ffc3fe23 100644
> > --- a/arch/x86/kvm/vmx/main.c
> > +++ b/arch/x86/kvm/vmx/main.c
> > @@ -103,6 +103,14 @@ static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> >  	return tdx_vm_ioctl(kvm, argp);
> >  }
> >  
> > +static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> > +{
> > +	if (!is_td_vcpu(vcpu))
> > +		return -EINVAL;
> > +
> > +	return tdx_vcpu_ioctl(vcpu, argp);
> > +}
> > +
> >  struct kvm_x86_ops vt_x86_ops __initdata = {
> >  	.name = KBUILD_MODNAME,
> >  
> > @@ -249,6 +257,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> >  
> >  	.dev_mem_enc_ioctl = tdx_dev_ioctl,
> >  	.mem_enc_ioctl = vt_mem_enc_ioctl,
> > +	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
> >  };
> >  
> >  struct kvm_x86_init_ops vt_init_ops __initdata = {
> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 099f0737a5aa..e2f5a07ad4e5 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -49,6 +49,11 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> >  	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> >  }
> >  
> > +static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
> > +{
> > +	return tdx->tdvpr_pa;
> > +}
> > +
> >  static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> >  {
> >  	return kvm_tdx->tdr_pa;
> > @@ -65,6 +70,11 @@ static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> >  	return kvm_tdx->hkid > 0;
> >  }
> >  
> > +static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> > +{
> > +	return kvm_tdx->finalized;
> > +}
> > +
> >  static void tdx_clear_page(unsigned long page_pa)
> >  {
> >  	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > @@ -327,7 +337,21 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> >  
> >  void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> >  {
> > -	/* This is stub for now.  More logic will come. */
> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +	int i;
> > +
> > +	/* Can't reclaim or free pages if teardown failed. */
> > +	if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm)))
> > +		return;
> > +
> 
> Should we have an WARN_ON_ONCE here?

No.  In normal case, it can come with hkid already reclaimed.


> > +	if (tdx->tdvpx_pa) {
> > +		for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
> > +			tdx_reclaim_td_page(tdx->tdvpx_pa[i]);
> > +		kfree(tdx->tdvpx_pa);
> > +		tdx->tdvpx_pa = NULL;
> > +	}
> > +	tdx_reclaim_td_page(tdx->tdvpr_pa);
> > +	tdx->tdvpr_pa = 0;
> >  }
> >  
> >  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > @@ -337,6 +361,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >  	/* TDX doesn't support INIT event. */
> >  	if (WARN_ON_ONCE(init_event))
> >  		goto td_bugged;
> > +	if (WARN_ON_ONCE(is_td_vcpu_created(to_tdx(vcpu))))
> > +		goto td_bugged;
> >  
> >  	/* TDX rquires X2APIC. */
> >  	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> > @@ -791,6 +817,125 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> >  	return r;
> >  }
> >  
> > +static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +	unsigned long *tdvpx_pa = NULL;
> > +	unsigned long tdvpr_pa;
> > +	unsigned long va;
> > +	int ret, i;
> > +	u64 err;
> > +
> > +	if (is_td_vcpu_created(tdx))
> > +		return -EINVAL;
> > +
> > +	va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > +	if (!va)
> > +		return -ENOMEM;
> > +	tdvpr_pa = __pa(va);
> > +
> > +	tdvpx_pa = kcalloc(tdx_caps.tdvpx_nr_pages, sizeof(*tdx->tdvpx_pa),
> > +			   GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> > +	if (!tdvpx_pa) {
> > +		ret = -ENOMEM;
> > +		goto free_tdvpr;
> > +	}
> > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> > +		va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > +		if (!va)
> > +			goto free_tdvpx;
> > +		tdvpx_pa[i] = __pa(va);
> > +	}
> > +
> > +	err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> > +	if (WARN_ON_ONCE(err)) {
> > +		ret = -EIO;
> > +		pr_tdx_error(TDH_VP_CREATE, err, NULL);
> > +		goto td_bugged_free_tdvpx;
> > +	}
> > +	tdx->tdvpr_pa = tdvpr_pa;
> > +
> > +	tdx->tdvpx_pa = tdvpx_pa;
> > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> > +		err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> > +		if (WARN_ON_ONCE(err)) {
> > +			ret = -EIO;
> > +			pr_tdx_error(TDH_VP_ADDCX, err, NULL);
> > +			for (; i < tdx_caps.tdvpx_nr_pages; i++) {
> > +				free_page((unsigned long)__va(tdvpx_pa[i]));
> > +				tdvpx_pa[i] = 0;
> > +			}
> > +			goto td_bugged;
> > +		}
> > +	}
> > +
> > +	err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
> > +	if (WARN_ON_ONCE(err)) {
> > +		ret = -EIO;
> > +		pr_tdx_error(TDH_VP_INIT, err, NULL);
> > +		goto td_bugged;
> > +	}
> > +
> > +	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > +
> > +	return 0;
> > +
> > +td_bugged_free_tdvpx:
> > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> > +		free_page((unsigned long)__va(tdvpx_pa[i]));
> > +		tdvpx_pa[i] = 0;
> > +	}
> > +	kfree(tdvpx_pa);
> > +td_bugged:
> > +	vcpu->kvm->vm_bugged = true;
> > +	return ret;
> > +
> > +free_tdvpx:
> > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
> > +		if (tdvpx_pa[i])
> > +			free_page((unsigned long)__va(tdvpx_pa[i]));
> > +	kfree(tdvpx_pa);
> > +	tdx->tdvpx_pa = NULL;
> > +free_tdvpr:
> > +	if (tdvpr_pa)
> > +		free_page((unsigned long)__va(tdvpr_pa));
> > +	tdx->tdvpr_pa = 0;
> > +
> > +	return ret;
> > +}
> 
> Same comments with using vm_bugged in the previous patch.

I converted it to KVM_BUG_ON().
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization
  2023-01-19 10:37   ` Huang, Kai
@ 2023-02-28 11:27     ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-28 11:27 UTC (permalink / raw)
  To: Huang, Kai
  Cc: kvm, linux-kernel, Yamahata, Isaku, sean.j.christopherson,
	pbonzini, Shahar, Sagi, Aktas, Erdem, isaku.yamahata, dmatlack,
	Christopherson,,
	Sean

On Thu, Jan 19, 2023 at 10:37:43AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Thu, 2023-01-12 at 08:31 -0800, isaku.yamahata@intel.com wrote:
> > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > 
> > TD guest vcpu need to be configured before ready to run which requests
> > addtional information from Device model (e.g. qemu), one 64bit value is
> > passed to vcpu's RCX as an initial value.  
> > 
> 
> The first half sentence doesn't parse to me.  It also has grammar issue.
> 
> Also, the second half only talks about TDH.VP.INIT, but there's more regarding
> to creating/initializing a TDX guest vcpu.  IMHO It would be better if you can
> briefly describe the whole sequence here so people can get some idea about your
> code below.
> 
> Btw, I don't understand what's the point of pointing out "64bit value passed to
> vcpu's RCX ...".  You can add this to the comment instead.  If it is important,
> then please add more to explain it so people can understand more. 
> 

RCX and 64bit value doesn't make much sense in the commit message. I dropped
those sentence.

Here is the updated commit message.

    KVM: TDX: Do TDX specific vcpu initialization
    
    TD guest vcpu needs TDX specific initialization before running.  Repurpose
    KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
    KVM_TDX_INIT_VCPU, and implement the callback for it.

> > Repurpose KVM_MEMORY_ENCRYPT_OP
> > to vcpu-scope and add new sub-commands KVM_TDX_INIT_VCPU under it for such
> > additional vcpu configuration.
> 
> I am not sure using the same command for both per-VM and per-vcpu ioctls is a
> good idea.  Is there any existing example does this?

There are some. Please break Documentation/virt/kvm/api.rst with "type:".
You can see some ioctl supports multiple type.

  Type: system ioctl, vm ioctl
  Type: vcpu ioctl / vm ioctl
  Type: device ioctl, vm ioctl, vcpu ioctl


> > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > index 099f0737a5aa..e2f5a07ad4e5 100644
> > --- a/arch/x86/kvm/vmx/tdx.c
> > +++ b/arch/x86/kvm/vmx/tdx.c
> > @@ -49,6 +49,11 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> >  	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> >  }
> >  
> > +static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
> > +{
> > +	return tdx->tdvpr_pa;
> > +}
> > +
> >  static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> >  {
> >  	return kvm_tdx->tdr_pa;
> > @@ -65,6 +70,11 @@ static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> >  	return kvm_tdx->hkid > 0;
> >  }
> >  
> > +static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> > +{
> > +	return kvm_tdx->finalized;
> > +}
> > +
> >  static void tdx_clear_page(unsigned long page_pa)
> >  {
> >  	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > @@ -327,7 +337,21 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> >  
> >  void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> >  {
> > -	/* This is stub for now.  More logic will come. */
> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +	int i;
> > +
> > +	/* Can't reclaim or free pages if teardown failed. */
> > +	if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm)))
> > +		return;
> 
> You may want to WARN() if it's a kernel bug you want to catch.

No, it's not a bug to come here with hkid freed because vcpus_free method
can be called after vm destruction.


> > +
> > +	if (tdx->tdvpx_pa) {
> > +		for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
> > +			tdx_reclaim_td_page(tdx->tdvpx_pa[i]);
> > +		kfree(tdx->tdvpx_pa);
> > +		tdx->tdvpx_pa = NULL;
> > +	}
> > +	tdx_reclaim_td_page(tdx->tdvpr_pa);
> > +	tdx->tdvpr_pa = 0;
> >  }
> >  
> >  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > @@ -337,6 +361,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> >  	/* TDX doesn't support INIT event. */
> >  	if (WARN_ON_ONCE(init_event))
> >  		goto td_bugged;
> > +	if (WARN_ON_ONCE(is_td_vcpu_created(to_tdx(vcpu))))
> > +		goto td_bugged;
> 
> Again, not sure can we use KVM_BUG_ON()?

I converted it into KVM_BUG_ON()


> >  	/* TDX rquires X2APIC. */
> >  	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> > @@ -791,6 +817,125 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> >  	return r;
> >  }
> >  
> > +static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> > +{
> > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +	unsigned long *tdvpx_pa = NULL;
> > +	unsigned long tdvpr_pa;
> > +	unsigned long va;
> > +	int ret, i;
> > +	u64 err;
> > +
> > +	if (is_td_vcpu_created(tdx))
> > +		return -EINVAL;
> 
> Ditto.  WARN()?

No. KVM_TDX_INIT_VCPU can be called multiple times.  It's not kernel bug, but
misuse of the ioctl.


> > +
> > +	va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > +	if (!va)
> > +		return -ENOMEM;
> > +	tdvpr_pa = __pa(va);
> > +
> > +	tdvpx_pa = kcalloc(tdx_caps.tdvpx_nr_pages, sizeof(*tdx->tdvpx_pa),
> > +			   GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> 
> kcalloc() uses __GFP_ZERO internally.
> 
> > +	if (!tdvpx_pa) {
> > +		ret = -ENOMEM;
> > +		goto free_tdvpr;
> > +	}
> > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> > +		va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > +		if (!va)
> > +			goto free_tdvpx;
> > +		tdvpx_pa[i] = __pa(va);
> > +	}
> > +
> > +	err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> > +	if (WARN_ON_ONCE(err)) {
> > +		ret = -EIO;
> > +		pr_tdx_error(TDH_VP_CREATE, err, NULL);
> > +		goto td_bugged_free_tdvpx;
> > +	}
> > +	tdx->tdvpr_pa = tdvpr_pa;
> > +
> > +	tdx->tdvpx_pa = tdvpx_pa;
> > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> > +		err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> > +		if (WARN_ON_ONCE(err)) {
> > +			ret = -EIO;
> > +			pr_tdx_error(TDH_VP_ADDCX, err, NULL);
> > +			for (; i < tdx_caps.tdvpx_nr_pages; i++) {
> > +				free_page((unsigned long)__va(tdvpx_pa[i]));
> > +				tdvpx_pa[i] = 0;
> > +			}
> > +			goto td_bugged;
> > +		}
> > +	}
> > +
> > +	err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
> > +	if (WARN_ON_ONCE(err)) {
> > +		ret = -EIO;
> > +		pr_tdx_error(TDH_VP_INIT, err, NULL);
> > +		goto td_bugged;
> > +	}
> > +
> > +	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > +
> > +	return 0;
> > +
> > +td_bugged_free_tdvpx:
> > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> > +		free_page((unsigned long)__va(tdvpx_pa[i]));
> > +		tdvpx_pa[i] = 0;
> > +	}
> > +	kfree(tdvpx_pa);
> > +td_bugged:
> > +	vcpu->kvm->vm_bugged = true;
> > +	return ret;
> > +
> > +free_tdvpx:
> > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
> > +		if (tdvpx_pa[i])
> > +			free_page((unsigned long)__va(tdvpx_pa[i]));
> > +	kfree(tdvpx_pa);
> 
> This piece of code appears 3 times in this function (and there are 3 'return
> ret;').  I am sure it can be done in one place instead. Can you reorganize?

I improved the logic to delete the duplication.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-02-28 11:06     ` Isaku Yamahata
@ 2023-02-28 11:52       ` Huang, Kai
  2023-02-28 20:18         ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-02-28 11:52 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: Christopherson,,
	Sean, Shahar, Sagi, linux-kernel, Aktas, Erdem, pbonzini, kvm,
	Yamahata, Isaku, dmatlack

On Tue, 2023-02-28 at 03:06 -0800, Isaku Yamahata wrote:
> > > +	if (!e)
> > > +		return -ENOMEM;
> > > +	*e  = (struct kvm_cpuid_entry2) {
> > > +		.function = 1,	/* Features for X2APIC */
> > > +		.index = 0,
> > > +		.eax = 0,
> > > +		.ebx = 0,
> > > +		.ecx = 1ULL << 21,	/* X2APIC */
> > > +		.edx = 0,
> > > +	};
> > > +	vcpu->arch.cpuid_entries = e;
> > > +	vcpu->arch.cpuid_nent = 1;
> > 
> > As mentioned above, why doing it here? Won't be this be overwritten later in
> > KVM_SET_CPUID2?
> 
> Yes, user space VMM can overwrite cpuid[0x1] and APIC base MSR.  But it
> doesn't
> matter because it's a bug of user space VMM. user space VMM has to keep the
> consistency of cpuid and MSRs.
> Because TDX module virtualizes cpuid[0x1].x2apic to fixed 1, KVM value doesn't
> matter after vcpu creation.
> Because KVM virtualizes APIC base as read only to guest, cpuid[0x1].x2apic
> doesn't matter after vcpu creation as long as user space VMM keeps KVM APIC
> BASE
> value.
> 

Contrary, can we depend on userspace VMM to set x2APIC in CPUID, but not do this
in KVM?  If userspace doesn't do it, we treat it as userspace's bug.

Plus, userspace anyway needs to set x2APIC in CPUID regardless whether you have
done above here, correct?

I don't see the point of doing above in KVM because you are neither enforcing
anything in KVM, nor you are reducing effort of userspace.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-02-27 23:49     ` Isaku Yamahata
@ 2023-02-28 17:55       ` Zhi Wang
  2023-02-28 20:20         ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Zhi Wang @ 2023-02-28 17:55 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Zhi Wang, isaku.yamahata, kvm, linux-kernel, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack

On Mon, 27 Feb 2023 15:49:14 -0800
Isaku Yamahata <isaku.yamahata@gmail.com> wrote:

> On Mon, Jan 16, 2023 at 12:46:06PM +0200,
> Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> 
> > On Thu, 12 Jan 2023 08:31:31 -0800
> > isaku.yamahata@intel.com wrote:
> > 
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > The next step of TDX guest creation is to create vcpu.  Allocate TDX vcpu
> > > structures, partially initialize it.  Allocate pages of TDX vcpu for the
> > > TDX module.  Actual donation TDX vcpu pages to the TDX module is not done
> > > yet.
> > > 
> > > In the case of the conventional case, cpuid is empty at the initialization.
> > > and cpuid is configured after the vcpu initialization.  Because TDX
> > > supports only X2APIC mode, cpuid is forcibly initialized to support X2APIC
> > > on the vcpu initialization.
> > > 
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > ---
> > > Changes v10 -> v11:
> > > - NULL check of kvmalloc_array() in tdx_vcpu_reset. Move it to
> > >   tdx_vcpu_create()
> > > 
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > ---
> > >  arch/x86/kvm/vmx/main.c    | 40 ++++++++++++++++++--
> > >  arch/x86/kvm/vmx/tdx.c     | 75 ++++++++++++++++++++++++++++++++++++++
> > >  arch/x86/kvm/vmx/x86_ops.h | 10 +++++
> > >  arch/x86/kvm/x86.c         |  2 +
> > >  4 files changed, 123 insertions(+), 4 deletions(-)
> > > 
> > > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > > index ddf0742f1f67..59813ca05f36 100644
> > > --- a/arch/x86/kvm/vmx/main.c
> > > +++ b/arch/x86/kvm/vmx/main.c
> > > @@ -63,6 +63,38 @@ static void vt_vm_free(struct kvm *kvm)
> > >  		tdx_vm_free(kvm);
> > >  }
> > >  
> > > +static int vt_vcpu_precreate(struct kvm *kvm)
> > > +{
> > > +	if (is_td(kvm))
> > > +		return 0;
> > > +
> > > +	return vmx_vcpu_precreate(kvm);
> > > +}
> > > +
> > > +static int vt_vcpu_create(struct kvm_vcpu *vcpu)
> > > +{
> > > +	if (is_td_vcpu(vcpu))
> > > +		return tdx_vcpu_create(vcpu);
> > > +
> > > +	return vmx_vcpu_create(vcpu);
> > > +}
> > > +
> > 
> > -----
> > > +static void vt_vcpu_free(struct kvm_vcpu *vcpu)
> > > +{
> > > +	if (is_td_vcpu(vcpu))
> > > +		return tdx_vcpu_free(vcpu);
> > > +
> > > +	return vmx_vcpu_free(vcpu);
> > > +}
> > > +
> > > +static void vt_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > +{
> > > +	if (is_td_vcpu(vcpu))
> > > +		return tdx_vcpu_reset(vcpu, init_event);
> > > +
> > > +	return vmx_vcpu_reset(vcpu, init_event);
> > > +}
> > > +
> > ----
> > 
> > It seems a little strange to use return in this style. Would it be better like:
> > 
> > -----
> > if (xxx) {
> > 	tdx_vcpu_reset(xxx);
> > 	return; 
> > }
> > 
> > vmx_vcpu_reset(xxx);
> > ----
> > 
> > ?
> 
> It's C11.  I updated the code to not use the feature.
> 
> 
> > >  static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> > >  {
> > >  	if (!is_td(kvm))
> > > @@ -90,10 +122,10 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> > >  	.vm_destroy = vt_vm_destroy,
> > >  	.vm_free = vt_vm_free,
> > >  
> > > -	.vcpu_precreate = vmx_vcpu_precreate,
> > > -	.vcpu_create = vmx_vcpu_create,
> > > -	.vcpu_free = vmx_vcpu_free,
> > > -	.vcpu_reset = vmx_vcpu_reset,
> > > +	.vcpu_precreate = vt_vcpu_precreate,
> > > +	.vcpu_create = vt_vcpu_create,
> > > +	.vcpu_free = vt_vcpu_free,
> > > +	.vcpu_reset = vt_vcpu_reset,
> > >  
> > >  	.prepare_switch_to_guest = vmx_prepare_switch_to_guest,
> > >  	.vcpu_load = vmx_vcpu_load,
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 557a609c5147..099f0737a5aa 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -281,6 +281,81 @@ int tdx_vm_init(struct kvm *kvm)
> > >  	return 0;
> > >  }
> > >  
> > > +int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> > > +{
> > > +	struct kvm_cpuid_entry2 *e;
> > > +
> > > +	/*
> > > +	 * On cpu creation, cpuid entry is blank.  Forcibly enable
> > > +	 * X2APIC feature to allow X2APIC.
> > > +	 * Because vcpu_reset() can't return error, allocation is done here.
> > > +	 */
> > > +	WARN_ON_ONCE(vcpu->arch.cpuid_entries);
> > > +	WARN_ON_ONCE(vcpu->arch.cpuid_nent);
> > > +	e = kvmalloc_array(1, sizeof(*e), GFP_KERNEL_ACCOUNT);
> > > +	if (!e)
> > > +		return -ENOMEM;
> > > +	*e  = (struct kvm_cpuid_entry2) {
> > > +		.function = 1,	/* Features for X2APIC */
> > > +		.index = 0,
> > > +		.eax = 0,
> > > +		.ebx = 0,
> > > +		.ecx = 1ULL << 21,	/* X2APIC */
> > > +		.edx = 0,
> > > +	};
> > > +	vcpu->arch.cpuid_entries = e;
> > > +	vcpu->arch.cpuid_nent = 1;
> > > +
> > > +	/* TDX only supports x2APIC, which requires an in-kernel local APIC. */
> > > +	if (!vcpu->arch.apic)
> > > +		return -EINVAL;
> > > +
> > > +	fpstate_set_confidential(&vcpu->arch.guest_fpu);
> > > +
> > > +	vcpu->arch.efer = EFER_SCE | EFER_LME | EFER_LMA | EFER_NX;
> > > +
> > > +	vcpu->arch.cr0_guest_owned_bits = -1ul;
> > > +	vcpu->arch.cr4_guest_owned_bits = -1ul;
> > > +
> > > +	vcpu->arch.tsc_offset = to_kvm_tdx(vcpu->kvm)->tsc_offset;
> > > +	vcpu->arch.l1_tsc_offset = vcpu->arch.tsc_offset;
> > > +	vcpu->arch.guest_state_protected =
> > > +		!(to_kvm_tdx(vcpu->kvm)->attributes & TDX_TD_ATTRIBUTE_DEBUG);
> > > +
> > > +	return 0;
> > > +}
> > > +
> > > +void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> > > +{
> > > +	/* This is stub for now.  More logic will come. */
> > > +}
> > > +
> > > +void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > +{
> > > +	struct msr_data apic_base_msr;
> > > +
> > > +	/* TDX doesn't support INIT event. */
> > > +	if (WARN_ON_ONCE(init_event))
> > > +		goto td_bugged;
> > > +
> > > +	/* TDX rquires X2APIC. */
> >                 ^
> >                requires
> > > +	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> > > +	if (kvm_vcpu_is_reset_bsp(vcpu))
> > > +		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
> > > +	apic_base_msr.host_initiated = true;
> > > +	if (WARN_ON_ONCE(kvm_set_apic_base(vcpu, &apic_base_msr)))
> > > +		goto td_bugged;
> > > +
> > > +	/*
> > > +	 * Don't update mp_state to runnable because more initialization
> > > +	 * is needed by TDX_VCPU_INIT.
> > > +	 */
> > > +	return;
> > > +
> > > +td_bugged:
> > > +	vcpu->kvm->vm_bugged = true;
> > > +}
> > > +
> > 
> > 1) Using vm_bugged to terminate the VM creation feels off. When
> > using it in creation path, the termination still happens in xx_vcpu_run().
> > 
> > Thus, even something wrong happens at a certain point of the creation path,
> > the VM creation still continues. Until the xxx_vcpu_run(), the VM termination
> > finally happens.
> > 
> > Why not just fail in the creation path?
> 
> I converted vm_bugged to KVM_BUG_ON.  Because the td_bugged case shouldn't
> happen for TDX case, it's worthwhile for KVM_BUG_ON()
> 
> 
> > 2) Move 
> > 
> > > +	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> > > +	if (kvm_vcpu_is_reset_bsp(vcpu))
> > > +		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
> > > +	apic_base_msr.host_initiated = true;
> > 
> > to:
> > 
> > void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
> > {
> >         struct kvm_lapic *apic = vcpu->arch.apic;
> >         u64 msr_val;
> >         int i;
> > 
> >         if (!init_event) {
> >                 msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
> > 
> > 		/* here */
> > 		if (is_td_vcpu(vcpu)) 
> > 			msr_val = xxxx;
> >                 if (kvm_vcpu_is_reset_bsp(vcpu))
> >                         msr_val |= MSR_IA32_APICBASE_BSP;
> >                 kvm_lapic_set_base(vcpu, msr_val);
> >         }
> 
> No. Because I'm trying to contain is_td/is_td_vcpu in vmx specific and not use
> in common x86 code.
>

I guess so. Centeralizing the initialization would be the nice and greatly
improve the readablity of the code. Maybe adding a new callback in kvm x86_ops
like .get_default_msr_val instead.

> 
> > PS: Is there any reason that APIC MSR in TDX doesn't need
> > MSR_IA32_APICBASE_ENABLE?
> 
> because LAPIC_MODE_X2APIC includes MSR_IA32_APICBASE_ENABLE.
> In lapic.h
>         LAPIC_MODE_X2APIC = MSR_IA32_APICBASE_ENABLE | X2APIC_ENABLE,
> 
> 
> 
> > 3) Change the following:
> > 
> > > +
> > > +	/* TDX doesn't support INIT event. */
> > > +	if (WARN_ON_ONCE(init_event))
> > > +		goto td_bugged;
> > > +
> > 
> > to 
> > 	WARN_ON_ONCE(init_event);
> > 
> > kvm_cpu_deliver_init() will trigger a kvm_vcpu_reset(xxx, init_event=true),
> > but you have already avoided this in vt_vcpu_deliver_init(). A warn
> > is good enough to remind people.
> 
> I converted it into KVM_BUG_ON().
> 
> 
> > With these changes, tdx_vcpu_reset() will only contain the CPUID configuration
> > , using the vm_bugged to terminate the VM in tdx_vcpu_reset() can be removed.
> > 
> > >  int tdx_dev_ioctl(void __user *argp)
> > >  {
> > >  	struct kvm_tdx_capabilities __user *user_caps;
> > > diff --git a/arch/x86/kvm/vmx/x86_ops.h b/arch/x86/kvm/vmx/x86_ops.h
> > > index 6c40dda1cc2f..37ab2cfd35bc 100644
> > > --- a/arch/x86/kvm/vmx/x86_ops.h
> > > +++ b/arch/x86/kvm/vmx/x86_ops.h
> > > @@ -147,7 +147,12 @@ int tdx_offline_cpu(void);
> > >  int tdx_vm_init(struct kvm *kvm);
> > >  void tdx_mmu_release_hkid(struct kvm *kvm);
> > >  void tdx_vm_free(struct kvm *kvm);
> > > +
> > >  int tdx_vm_ioctl(struct kvm *kvm, void __user *argp);
> > > +
> > > +int tdx_vcpu_create(struct kvm_vcpu *vcpu);
> > > +void tdx_vcpu_free(struct kvm_vcpu *vcpu);
> > > +void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event);
> > >  #else
> > >  static inline int tdx_hardware_setup(struct kvm_x86_ops *x86_ops) { return 0; }
> > >  static inline void tdx_hardware_unsetup(void) {}
> > > @@ -159,7 +164,12 @@ static inline int tdx_vm_init(struct kvm *kvm) { return -EOPNOTSUPP; }
> > >  static inline void tdx_mmu_release_hkid(struct kvm *kvm) {}
> > >  static inline void tdx_flush_shadow_all_private(struct kvm *kvm) {}
> > >  static inline void tdx_vm_free(struct kvm *kvm) {}
> > > +
> > >  static inline int tdx_vm_ioctl(struct kvm *kvm, void __user *argp) { return -EOPNOTSUPP; }
> > > +
> > > +static inline int tdx_vcpu_create(struct kvm_vcpu *vcpu) { return -EOPNOTSUPP; }
> > > +static inline void tdx_vcpu_free(struct kvm_vcpu *vcpu) {}
> > > +static inline void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event) {}
> > >  #endif
> > >  
> > >  #endif /* __KVM_X86_VMX_X86_OPS_H */
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 1fb135e0c98f..e8bc66031a1d 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -492,6 +492,7 @@ int kvm_set_apic_base(struct kvm_vcpu *vcpu, struct msr_data *msr_info)
> > >  	kvm_recalculate_apic_map(vcpu->kvm);
> > >  	return 0;
> > >  }
> > > +EXPORT_SYMBOL_GPL(kvm_set_apic_base);
> > >  
> > >  /*
> > >   * Handle a fault on a hardware virtualization (VMX or SVM) instruction.
> > > @@ -12109,6 +12110,7 @@ bool kvm_vcpu_is_reset_bsp(struct kvm_vcpu *vcpu)
> > >  {
> > >  	return vcpu->kvm->arch.bsp_vcpu_id == vcpu->vcpu_id;
> > >  }
> > > +EXPORT_SYMBOL_GPL(kvm_vcpu_is_reset_bsp);
> > >
> > 
> > The symbols don't need to be exported with the changes mentioned above.
> >   
> > >  bool kvm_vcpu_is_bsp(struct kvm_vcpu *vcpu)
> > >  {
> > 
> 


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization
  2023-02-28 11:17     ` Isaku Yamahata
@ 2023-02-28 18:21       ` Zhi Wang
  0 siblings, 0 replies; 221+ messages in thread
From: Zhi Wang @ 2023-02-28 18:21 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Zhi Wang, isaku.yamahata, kvm, linux-kernel, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack,
	Sean Christopherson

On Tue, 28 Feb 2023 03:17:52 -0800
Isaku Yamahata <isaku.yamahata@gmail.com> wrote:

> On Mon, Jan 16, 2023 at 06:07:19PM +0200,
> Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> 
> > On Thu, 12 Jan 2023 08:31:32 -0800
> > isaku.yamahata@intel.com wrote:
> > 
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > TD guest vcpu need to be configured before ready to run which requests
> > > addtional information from Device model (e.g. qemu), one 64bit value is
> > > passed to vcpu's RCX as an initial value.  Repurpose KVM_MEMORY_ENCRYPT_OP
> > > to vcpu-scope and add new sub-commands KVM_TDX_INIT_VCPU under it for such
> > > additional vcpu configuration.
> > > 
> > 
> > Better add more details for this mystic value to save the review efforts.
> > 
> > For exmaple, refining the above part as:
> > 
> > ----
> > 
> > TD hands-off block(HOB) is used to pass the information from VMM to
> > TD virtual firmware(TDVF). Before KVM calls Intel TDX module to launch
> > TDVF, the address of HOB must be placed in the guest RCX.
> > 
> > Extend KVM_MEMORY_ENCRYPT_OP to vcpu-scope and add new... so that
> > TDH.VP.INIT can take the address of HOB from QEMU and place it in the
> > guest RCX when initializing a TDX vCPU.
> > 
> > ----
> > 
> > The below paragraph seems repeating the end of the first paragraph. Guess
> > it can be refined or removed.
> > 
> > 
> > > Add callback for kvm vCPU-scoped operations of KVM_MEMORY_ENCRYPT_OP and
> > > add a new subcommand, KVM_TDX_INIT_VCPU, for further vcpu initialization.
> > >
> 
> I don't think it's good idea to mention about new terminology HOB and TDVF.
> We can say, VMM can pass one parameter.
> Here is the updated one.
> 
>     TD guest vcpu needs TDX specific initialization before running.  Repurpose
>     KVM_MEMORY_ENCRYPT_OP to vcpu-scope, add a new sub-command
>     KVM_TDX_INIT_VCPU, and implement the callback for it.
>

Based on the experience of reviewing this patch, I think it depends on:

1) If the reviewer needs to understand the meaning of this parameter to review
this patch? If yes, a brief description of what it is (even one sentence) is
much better than "one parameter". If no, better mention the pointer, like
"Refer XXX spec for more details of xxx". (No need to mention chapter
according to Sean's maintainer book).

Direct and informative comment is always helpful for reviewing.

2) If describing this new terminology helps on the reviewing following patches?
If the following patches modify logic around this? If yes, should take this
opportunity to educate the reviewer with brief descriptions. If no, a pointer
is good enough.

> 
> > PS: I am curious if the value of guest RCX on each VCPU will be configured
> > differently? (It seems they are the same according to the code of tdx-qemu)
> > 
> > If yes, then it is just an approach to configure the value (even it is
> > through TDH.VP.XXX). It should be configured in the domain level in KVM. The
> > TDX vCPU creation and initialization can be moved into tdx_vcpu_create()
> > and TDH.VP.INIT can take the value from a per-vm data structure.
> 
> RCX can be set for each VCPUs as ABI (or TDX SEAMCALL API) between VMM and vcpu
> initial value.  It's convention between user space VMM(qemu) and guest
> firmware(TDVF) to pass same RCX value for all vcpu.  So KVM shouldn't enforce
> same RCX value for all vcpus.  KVM should allow user space VMM to set the value
> for each vcpus.
> 

I see. That makes sense then.

> 
> > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > ---
> > >  arch/x86/include/asm/kvm-x86-ops.h    |   1 +
> > >  arch/x86/include/asm/kvm_host.h       |   1 +
> > >  arch/x86/include/uapi/asm/kvm.h       |   1 +
> > >  arch/x86/kvm/vmx/main.c               |   9 ++
> > >  arch/x86/kvm/vmx/tdx.c                | 147 +++++++++++++++++++++++++-
> > >  arch/x86/kvm/vmx/tdx.h                |   7 ++
> > >  arch/x86/kvm/vmx/x86_ops.h            |  10 +-
> > >  arch/x86/kvm/x86.c                    |   6 ++
> > >  tools/arch/x86/include/uapi/asm/kvm.h |   1 +
> > >  9 files changed, 178 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h
> > > index 1a27f3aee982..e3e9b1c2599b 100644
> > > --- a/arch/x86/include/asm/kvm-x86-ops.h
> > > +++ b/arch/x86/include/asm/kvm-x86-ops.h
> > > @@ -123,6 +123,7 @@ KVM_X86_OP(enable_smi_window)
> > >  #endif
> > >  KVM_X86_OP_OPTIONAL(dev_mem_enc_ioctl)
> > >  KVM_X86_OP_OPTIONAL(mem_enc_ioctl)
> > > +KVM_X86_OP_OPTIONAL(vcpu_mem_enc_ioctl)
> > >  KVM_X86_OP_OPTIONAL(mem_enc_register_region)
> > >  KVM_X86_OP_OPTIONAL(mem_enc_unregister_region)
> > >  KVM_X86_OP_OPTIONAL(vm_copy_enc_context_from)
> > > diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> > > index 30f4ddb18548..35773f925cc5 100644
> > > --- a/arch/x86/include/asm/kvm_host.h
> > > +++ b/arch/x86/include/asm/kvm_host.h
> > > @@ -1698,6 +1698,7 @@ struct kvm_x86_ops {
> > >  
> > >  	int (*dev_mem_enc_ioctl)(void __user *argp);
> > >  	int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp);
> > > +	int (*vcpu_mem_enc_ioctl)(struct kvm_vcpu *vcpu, void __user *argp);
> > >  	int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> > >  	int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp);
> > >  	int (*vm_copy_enc_context_from)(struct kvm *kvm, unsigned int source_fd);
> > > diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
> > > index b8f28d86d4fd..9236c1699c48 100644
> > > --- a/arch/x86/include/uapi/asm/kvm.h
> > > +++ b/arch/x86/include/uapi/asm/kvm.h
> > > @@ -536,6 +536,7 @@ struct kvm_pmu_event_filter {
> > >  enum kvm_tdx_cmd_id {
> > >  	KVM_TDX_CAPABILITIES = 0,
> > >  	KVM_TDX_INIT_VM,
> > > +	KVM_TDX_INIT_VCPU,
> > >  
> > >  	KVM_TDX_CMD_NR_MAX,
> > >  };
> > > diff --git a/arch/x86/kvm/vmx/main.c b/arch/x86/kvm/vmx/main.c
> > > index 59813ca05f36..23b3ffc3fe23 100644
> > > --- a/arch/x86/kvm/vmx/main.c
> > > +++ b/arch/x86/kvm/vmx/main.c
> > > @@ -103,6 +103,14 @@ static int vt_mem_enc_ioctl(struct kvm *kvm, void __user *argp)
> > >  	return tdx_vm_ioctl(kvm, argp);
> > >  }
> > >  
> > > +static int vt_vcpu_mem_enc_ioctl(struct kvm_vcpu *vcpu, void __user *argp)
> > > +{
> > > +	if (!is_td_vcpu(vcpu))
> > > +		return -EINVAL;
> > > +
> > > +	return tdx_vcpu_ioctl(vcpu, argp);
> > > +}
> > > +
> > >  struct kvm_x86_ops vt_x86_ops __initdata = {
> > >  	.name = KBUILD_MODNAME,
> > >  
> > > @@ -249,6 +257,7 @@ struct kvm_x86_ops vt_x86_ops __initdata = {
> > >  
> > >  	.dev_mem_enc_ioctl = tdx_dev_ioctl,
> > >  	.mem_enc_ioctl = vt_mem_enc_ioctl,
> > > +	.vcpu_mem_enc_ioctl = vt_vcpu_mem_enc_ioctl,
> > >  };
> > >  
> > >  struct kvm_x86_init_ops vt_init_ops __initdata = {
> > > diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
> > > index 099f0737a5aa..e2f5a07ad4e5 100644
> > > --- a/arch/x86/kvm/vmx/tdx.c
> > > +++ b/arch/x86/kvm/vmx/tdx.c
> > > @@ -49,6 +49,11 @@ static __always_inline hpa_t set_hkid_to_hpa(hpa_t pa, u16 hkid)
> > >  	return pa | ((hpa_t)hkid << boot_cpu_data.x86_phys_bits);
> > >  }
> > >  
> > > +static inline bool is_td_vcpu_created(struct vcpu_tdx *tdx)
> > > +{
> > > +	return tdx->tdvpr_pa;
> > > +}
> > > +
> > >  static inline bool is_td_created(struct kvm_tdx *kvm_tdx)
> > >  {
> > >  	return kvm_tdx->tdr_pa;
> > > @@ -65,6 +70,11 @@ static inline bool is_hkid_assigned(struct kvm_tdx *kvm_tdx)
> > >  	return kvm_tdx->hkid > 0;
> > >  }
> > >  
> > > +static inline bool is_td_finalized(struct kvm_tdx *kvm_tdx)
> > > +{
> > > +	return kvm_tdx->finalized;
> > > +}
> > > +
> > >  static void tdx_clear_page(unsigned long page_pa)
> > >  {
> > >  	const void *zero_page = (const void *) __va(page_to_phys(ZERO_PAGE(0)));
> > > @@ -327,7 +337,21 @@ int tdx_vcpu_create(struct kvm_vcpu *vcpu)
> > >  
> > >  void tdx_vcpu_free(struct kvm_vcpu *vcpu)
> > >  {
> > > -	/* This is stub for now.  More logic will come. */
> > > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > +	int i;
> > > +
> > > +	/* Can't reclaim or free pages if teardown failed. */
> > > +	if (is_hkid_assigned(to_kvm_tdx(vcpu->kvm)))
> > > +		return;
> > > +
> > 
> > Should we have an WARN_ON_ONCE here?
> 
> No.  In normal case, it can come with hkid already reclaimed.
> 
> 
> > > +	if (tdx->tdvpx_pa) {
> > > +		for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
> > > +			tdx_reclaim_td_page(tdx->tdvpx_pa[i]);
> > > +		kfree(tdx->tdvpx_pa);
> > > +		tdx->tdvpx_pa = NULL;
> > > +	}
> > > +	tdx_reclaim_td_page(tdx->tdvpr_pa);
> > > +	tdx->tdvpr_pa = 0;
> > >  }
> > >  
> > >  void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > @@ -337,6 +361,8 @@ void tdx_vcpu_reset(struct kvm_vcpu *vcpu, bool init_event)
> > >  	/* TDX doesn't support INIT event. */
> > >  	if (WARN_ON_ONCE(init_event))
> > >  		goto td_bugged;
> > > +	if (WARN_ON_ONCE(is_td_vcpu_created(to_tdx(vcpu))))
> > > +		goto td_bugged;
> > >  
> > >  	/* TDX rquires X2APIC. */
> > >  	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> > > @@ -791,6 +817,125 @@ int tdx_vm_ioctl(struct kvm *kvm, void __user *argp)
> > >  	return r;
> > >  }
> > >  
> > > +static int tdx_td_vcpu_init(struct kvm_vcpu *vcpu, u64 vcpu_rcx)
> > > +{
> > > +	struct kvm_tdx *kvm_tdx = to_kvm_tdx(vcpu->kvm);
> > > +	struct vcpu_tdx *tdx = to_tdx(vcpu);
> > > +	unsigned long *tdvpx_pa = NULL;
> > > +	unsigned long tdvpr_pa;
> > > +	unsigned long va;
> > > +	int ret, i;
> > > +	u64 err;
> > > +
> > > +	if (is_td_vcpu_created(tdx))
> > > +		return -EINVAL;
> > > +
> > > +	va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > > +	if (!va)
> > > +		return -ENOMEM;
> > > +	tdvpr_pa = __pa(va);
> > > +
> > > +	tdvpx_pa = kcalloc(tdx_caps.tdvpx_nr_pages, sizeof(*tdx->tdvpx_pa),
> > > +			   GFP_KERNEL_ACCOUNT | __GFP_ZERO);
> > > +	if (!tdvpx_pa) {
> > > +		ret = -ENOMEM;
> > > +		goto free_tdvpr;
> > > +	}
> > > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> > > +		va = __get_free_page(GFP_KERNEL_ACCOUNT);
> > > +		if (!va)
> > > +			goto free_tdvpx;
> > > +		tdvpx_pa[i] = __pa(va);
> > > +	}
> > > +
> > > +	err = tdh_vp_create(kvm_tdx->tdr_pa, tdvpr_pa);
> > > +	if (WARN_ON_ONCE(err)) {
> > > +		ret = -EIO;
> > > +		pr_tdx_error(TDH_VP_CREATE, err, NULL);
> > > +		goto td_bugged_free_tdvpx;
> > > +	}
> > > +	tdx->tdvpr_pa = tdvpr_pa;
> > > +
> > > +	tdx->tdvpx_pa = tdvpx_pa;
> > > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> > > +		err = tdh_vp_addcx(tdx->tdvpr_pa, tdvpx_pa[i]);
> > > +		if (WARN_ON_ONCE(err)) {
> > > +			ret = -EIO;
> > > +			pr_tdx_error(TDH_VP_ADDCX, err, NULL);
> > > +			for (; i < tdx_caps.tdvpx_nr_pages; i++) {
> > > +				free_page((unsigned long)__va(tdvpx_pa[i]));
> > > +				tdvpx_pa[i] = 0;
> > > +			}
> > > +			goto td_bugged;
> > > +		}
> > > +	}
> > > +
> > > +	err = tdh_vp_init(tdx->tdvpr_pa, vcpu_rcx);
> > > +	if (WARN_ON_ONCE(err)) {
> > > +		ret = -EIO;
> > > +		pr_tdx_error(TDH_VP_INIT, err, NULL);
> > > +		goto td_bugged;
> > > +	}
> > > +
> > > +	vcpu->arch.mp_state = KVM_MP_STATE_RUNNABLE;
> > > +
> > > +	return 0;
> > > +
> > > +td_bugged_free_tdvpx:
> > > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++) {
> > > +		free_page((unsigned long)__va(tdvpx_pa[i]));
> > > +		tdvpx_pa[i] = 0;
> > > +	}
> > > +	kfree(tdvpx_pa);
> > > +td_bugged:
> > > +	vcpu->kvm->vm_bugged = true;
> > > +	return ret;
> > > +
> > > +free_tdvpx:
> > > +	for (i = 0; i < tdx_caps.tdvpx_nr_pages; i++)
> > > +		if (tdvpx_pa[i])
> > > +			free_page((unsigned long)__va(tdvpx_pa[i]));
> > > +	kfree(tdvpx_pa);
> > > +	tdx->tdvpx_pa = NULL;
> > > +free_tdvpr:
> > > +	if (tdvpr_pa)
> > > +		free_page((unsigned long)__va(tdvpr_pa));
> > > +	tdx->tdvpr_pa = 0;
> > > +
> > > +	return ret;
> > > +}
> > 
> > Same comments with using vm_bugged in the previous patch.
> 
> I converted it to KVM_BUG_ON().


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-02-28 11:52       ` Huang, Kai
@ 2023-02-28 20:18         ` Isaku Yamahata
  2023-02-28 21:49           ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-28 20:18 UTC (permalink / raw)
  To: Huang, Kai
  Cc: isaku.yamahata, Christopherson,,
	Sean, Shahar, Sagi, linux-kernel, Aktas, Erdem, pbonzini, kvm,
	Yamahata, Isaku, dmatlack

On Tue, Feb 28, 2023 at 11:52:59AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Tue, 2023-02-28 at 03:06 -0800, Isaku Yamahata wrote:
> > > > +	if (!e)
> > > > +		return -ENOMEM;
> > > > +	*e  = (struct kvm_cpuid_entry2) {
> > > > +		.function = 1,	/* Features for X2APIC */
> > > > +		.index = 0,
> > > > +		.eax = 0,
> > > > +		.ebx = 0,
> > > > +		.ecx = 1ULL << 21,	/* X2APIC */
> > > > +		.edx = 0,
> > > > +	};
> > > > +	vcpu->arch.cpuid_entries = e;
> > > > +	vcpu->arch.cpuid_nent = 1;
> > > 
> > > As mentioned above, why doing it here? Won't be this be overwritten later in
> > > KVM_SET_CPUID2?
> > 
> > Yes, user space VMM can overwrite cpuid[0x1] and APIC base MSR.  But it
> > doesn't
> > matter because it's a bug of user space VMM. user space VMM has to keep the
> > consistency of cpuid and MSRs.
> > Because TDX module virtualizes cpuid[0x1].x2apic to fixed 1, KVM value doesn't
> > matter after vcpu creation.
> > Because KVM virtualizes APIC base as read only to guest, cpuid[0x1].x2apic
> > doesn't matter after vcpu creation as long as user space VMM keeps KVM APIC
> > BASE
> > value.
> > 
> 
> Contrary, can we depend on userspace VMM to set x2APIC in CPUID, but not do this
> in KVM?  If userspace doesn't do it, we treat it as userspace's bug.
> 
> Plus, userspace anyway needs to set x2APIC in CPUID regardless whether you have
> done above here, correct?
> 
> I don't see the point of doing above in KVM because you are neither enforcing
> anything in KVM, nor you are reducing effort of userspace.

Good idea. I can drop cpuid part from tdx_vcpu_create() and apic base part from
tdx_vcpu_reset(). It needs to modify tdx_has_emulated_msr() to allow user space
VMM to update APIC BASE MSR.

diff --git a/arch/x86/kvm/vmx/tdx.c b/arch/x86/kvm/vmx/tdx.c
index e045a8132639..46c82ce3ef46 100644
--- a/arch/x86/kvm/vmx/tdx.c
+++ b/arch/x86/kvm/vmx/tdx.c
@@ -2624,7 +2624,14 @@ int tdx_get_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 
 int tdx_set_msr(struct kvm_vcpu *vcpu, struct msr_data *msr)
 {
-       if (tdx_has_emulated_msr(msr->index, true))
+       if (tdx_has_emulated_msr(msr->index, true) ||
+           /*
+            * user space VMM should explicitly set to X2APIC mode as initial
+            * value that is deviated from the conventional case.
+            */
+           (msr->host_initiated && msr->index == MSR_IA32_APICBASE &&
+            (msr->data & ~MSR_IA32_APICBASE_BSP) ==
+            (APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC)))
                return kvm_set_msr_common(vcpu, msr);
        return 1;
 }


Just FYI, qemu needs the following change.

--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -274,6 +274,11 @@ bool is_tdx_vm(void)
     return !!tdx_guest;
 }
 
+bool is_tdx_vm_finalized(void)
+{
+    return !!tdx_guest && tdx_guest->parent_obj.ready;
+}
+
 static inline uint32_t host_cpuid_reg(uint32_t function,
                                       uint32_t index, int reg)
 {
@@ -875,10 +880,20 @@ static void tdx_post_init_vcpus(void)
     TdxFirmwareEntry *hob;
     CPUState *cpu;
     int r;
+    uint64_t apic_base;
 
     hob = tdx_get_hob_entry(tdx_guest);
     CPU_FOREACH(cpu) {
-        apic_force_x2apic(X86_CPU(cpu)->apic_state);
+        X86CPU *x86_cpu = X86_CPU(cpu);
+
+        apic_force_x2apic(x86_cpu->apic_state);
+
+        apic_base = APIC_DEFAULT_ADDRESS | MSR_IA32_APICBASE_ENABLE |
+            MSR_IA32_APICBASE_EXTD;
+        if (cpu_is_bsp(x86_cpu))
+            apic_base |= MSR_IA32_APICBASE_BSP;
+        cpu_set_apic_base(x86_cpu->apic_state, apic_base);
+        kvm_put_apicbase(x86_cpu, apic_base);
 
         r = tdx_vcpu_ioctl(cpu, KVM_TDX_INIT_VCPU, 0, (void *)hob->address);
         if (r < 0) {
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 5cc0f730afa6..f77689464738 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -59,8 +59,10 @@ typedef struct TdxGuest {
 
 #ifdef CONFIG_TDX
 bool is_tdx_vm(void);
+bool is_tdx_vm_finalized(void);
 #else
 #define is_tdx_vm() 0
+#define is_tdx_vm_finalized() false;
 #endif /* CONFIG_TDX */
 
 int tdx_kvm_init(MachineState *ms, Error **errp);

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply related	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-02-28 17:55       ` Zhi Wang
@ 2023-02-28 20:20         ` Isaku Yamahata
  2023-03-01  4:58           ` Zhi Wang
  0 siblings, 1 reply; 221+ messages in thread
From: Isaku Yamahata @ 2023-02-28 20:20 UTC (permalink / raw)
  To: Zhi Wang
  Cc: Isaku Yamahata, isaku.yamahata, kvm, linux-kernel, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack

On Tue, Feb 28, 2023 at 07:55:09PM +0200,
Zhi Wang <zhi.wang.linux@gmail.com> wrote:

> On Mon, 27 Feb 2023 15:49:14 -0800
> Isaku Yamahata <isaku.yamahata@gmail.com> wrote:

> > > 2) Move 
> > > 
> > > > +	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> > > > +	if (kvm_vcpu_is_reset_bsp(vcpu))
> > > > +		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
> > > > +	apic_base_msr.host_initiated = true;
> > > 
> > > to:
> > > 
> > > void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > {
> > >         struct kvm_lapic *apic = vcpu->arch.apic;
> > >         u64 msr_val;
> > >         int i;
> > > 
> > >         if (!init_event) {
> > >                 msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
> > > 
> > > 		/* here */
> > > 		if (is_td_vcpu(vcpu)) 
> > > 			msr_val = xxxx;
> > >                 if (kvm_vcpu_is_reset_bsp(vcpu))
> > >                         msr_val |= MSR_IA32_APICBASE_BSP;
> > >                 kvm_lapic_set_base(vcpu, msr_val);
> > >         }
> > 
> > No. Because I'm trying to contain is_td/is_td_vcpu in vmx specific and not use
> > in common x86 code.
> >
> 
> I guess so. Centeralizing the initialization would be the nice and greatly
> improve the readablity of the code. Maybe adding a new callback in kvm x86_ops
> like .get_default_msr_val instead.

Finally I can eliminate cpuid/APIC BASE MSR, and move it user space VMM, qemu.
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-02-28 20:18         ` Isaku Yamahata
@ 2023-02-28 21:49           ` Huang, Kai
  2023-03-01  0:35             ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-02-28 21:49 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: Christopherson,,
	Sean, Shahar, Sagi, linux-kernel, Aktas, Erdem, dmatlack, kvm,
	Yamahata, Isaku, pbonzini

On Tue, 2023-02-28 at 12:18 -0800, Isaku Yamahata wrote:
> On Tue, Feb 28, 2023 at 11:52:59AM +0000,
> "Huang, Kai" <kai.huang@intel.com> wrote:
> 
> > On Tue, 2023-02-28 at 03:06 -0800, Isaku Yamahata wrote:
> > > > > +	if (!e)
> > > > > +		return -ENOMEM;
> > > > > +	*e  = (struct kvm_cpuid_entry2) {
> > > > > +		.function = 1,	/* Features for X2APIC */
> > > > > +		.index = 0,
> > > > > +		.eax = 0,
> > > > > +		.ebx = 0,
> > > > > +		.ecx = 1ULL << 21,	/* X2APIC */
> > > > > +		.edx = 0,
> > > > > +	};
> > > > > +	vcpu->arch.cpuid_entries = e;
> > > > > +	vcpu->arch.cpuid_nent = 1;
> > > > 
> > > > As mentioned above, why doing it here? Won't be this be overwritten later in
> > > > KVM_SET_CPUID2?
> > > 
> > > Yes, user space VMM can overwrite cpuid[0x1] and APIC base MSR.  But it
> > > doesn't
> > > matter because it's a bug of user space VMM. user space VMM has to keep the
> > > consistency of cpuid and MSRs.
> > > Because TDX module virtualizes cpuid[0x1].x2apic to fixed 1, KVM value doesn't
> > > matter after vcpu creation.
> > > Because KVM virtualizes APIC base as read only to guest, cpuid[0x1].x2apic
> > > doesn't matter after vcpu creation as long as user space VMM keeps KVM APIC
> > > BASE
> > > value.
> > > 
> > 
> > Contrary, can we depend on userspace VMM to set x2APIC in CPUID, but not do this
> > in KVM?  If userspace doesn't do it, we treat it as userspace's bug.
> > 
> > Plus, userspace anyway needs to set x2APIC in CPUID regardless whether you have
> > done above here, correct?
> > 
> > I don't see the point of doing above in KVM because you are neither enforcing
> > anything in KVM, nor you are reducing effort of userspace.
> 
> Good idea. I can drop cpuid part from tdx_vcpu_create() and apic base part from
> tdx_vcpu_reset(). It needs to modify tdx_has_emulated_msr() to allow user space
> VMM to update APIC BASE MSR.

My personal preference would be:

1) In KVM_SET_CPUID2, we do sanity check of CPUIDs provided by userspace, and
return error if not met (i.e X2APIC isn't advertised).  We already have cases
that KVM_SET_CPUID2 can fail, so extending to do TDX-specific check seems
reasonable to me too.
2) For APIC_BASE, you can just initialize the MSR in tdx_vcpu_reset() and ignore
any update (+pr_warn()?) to MSR_IA32_APIC_BASE.


But Sean may have different opinion especially for the CPUID part.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP
  2023-02-27 21:26     ` Isaku Yamahata
@ 2023-02-28 21:57       ` Huang, Kai
  2023-03-01  0:40         ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-02-28 21:57 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: Christopherson,,
	Sean, Shahar, Sagi, linux-kernel, Aktas, Erdem, pbonzini, kvm,
	Yamahata, Isaku, dmatlack

On Mon, 2023-02-27 at 13:26 -0800, Isaku Yamahata wrote:
> > > TDX attestation includes the maximum number of vcpu that the guest can
> > > accommodate.  
> > > 
> > 
> > I don't understand why "attestation" is the reason here.  Let's say TDX is
> > used
> > w/o attestation, I don't think this patch can be discarded?
> > 
> > IMHO the true reason is TDX has it's own control of maximum number of vcpus,
> > i.e. asking you to specify the value when creating the TD.  Therefore, the
> > constant KVM_MAX_VCPUS doesn't work for TDX guest anymore.
> 
> Without TDX attestation, this can be discarded.  The TD is created with
> max_vcpus=KVM_MAX_VCPUS by default.

This parses like: 

If we have attestation, the TD can be created with a user-specified non-default
value.  Otherwise, the TD is always created with default value.

It doesn't make sense, right?

Because architecturally whether TD can be created with a user specified value
doesn't depend on attestation at all.

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-02-28 21:49           ` Huang, Kai
@ 2023-03-01  0:35             ` Isaku Yamahata
  2023-03-01  0:49               ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Isaku Yamahata @ 2023-03-01  0:35 UTC (permalink / raw)
  To: Huang, Kai
  Cc: isaku.yamahata, Christopherson,,
	Sean, Shahar, Sagi, linux-kernel, Aktas, Erdem, dmatlack, kvm,
	Yamahata, Isaku, pbonzini

On Tue, Feb 28, 2023 at 09:49:10PM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Tue, 2023-02-28 at 12:18 -0800, Isaku Yamahata wrote:
> > On Tue, Feb 28, 2023 at 11:52:59AM +0000,
> > "Huang, Kai" <kai.huang@intel.com> wrote:
> > 
> > > On Tue, 2023-02-28 at 03:06 -0800, Isaku Yamahata wrote:
> > > > > > +	if (!e)
> > > > > > +		return -ENOMEM;
> > > > > > +	*e  = (struct kvm_cpuid_entry2) {
> > > > > > +		.function = 1,	/* Features for X2APIC */
> > > > > > +		.index = 0,
> > > > > > +		.eax = 0,
> > > > > > +		.ebx = 0,
> > > > > > +		.ecx = 1ULL << 21,	/* X2APIC */
> > > > > > +		.edx = 0,
> > > > > > +	};
> > > > > > +	vcpu->arch.cpuid_entries = e;
> > > > > > +	vcpu->arch.cpuid_nent = 1;
> > > > > 
> > > > > As mentioned above, why doing it here? Won't be this be overwritten later in
> > > > > KVM_SET_CPUID2?
> > > > 
> > > > Yes, user space VMM can overwrite cpuid[0x1] and APIC base MSR.  But it
> > > > doesn't
> > > > matter because it's a bug of user space VMM. user space VMM has to keep the
> > > > consistency of cpuid and MSRs.
> > > > Because TDX module virtualizes cpuid[0x1].x2apic to fixed 1, KVM value doesn't
> > > > matter after vcpu creation.
> > > > Because KVM virtualizes APIC base as read only to guest, cpuid[0x1].x2apic
> > > > doesn't matter after vcpu creation as long as user space VMM keeps KVM APIC
> > > > BASE
> > > > value.
> > > > 
> > > 
> > > Contrary, can we depend on userspace VMM to set x2APIC in CPUID, but not do this
> > > in KVM?  If userspace doesn't do it, we treat it as userspace's bug.
> > > 
> > > Plus, userspace anyway needs to set x2APIC in CPUID regardless whether you have
> > > done above here, correct?
> > > 
> > > I don't see the point of doing above in KVM because you are neither enforcing
> > > anything in KVM, nor you are reducing effort of userspace.
> > 
> > Good idea. I can drop cpuid part from tdx_vcpu_create() and apic base part from
> > tdx_vcpu_reset(). It needs to modify tdx_has_emulated_msr() to allow user space
> > VMM to update APIC BASE MSR.
> 
> My personal preference would be:
> 
> 1) In KVM_SET_CPUID2, we do sanity check of CPUIDs provided by userspace, and
> return error if not met (i.e X2APIC isn't advertised).  We already have cases
> that KVM_SET_CPUID2 can fail, so extending to do TDX-specific check seems
> reasonable to me too.

This is moot. The current check does only check maxphys address bit size and
specified xfeatures are supported by host.  It's bare minimum for kvm to work.
It doesn't try to check consistency.


> 2) For APIC_BASE, you can just initialize the MSR in tdx_vcpu_reset() and ignore
> any update (+pr_warn()?) to MSR_IA32_APIC_BASE.

The x86 common code for KVM_CREATE_VCPU, kvm_arch_vcpu_create(), calls vcpu_create,
creates lapic, and calls vcpu_reset(). 

Setting ACPI BASE MSR with X2APIC enabled, checks if cpuid x2apic bit is set.
Please notice guest_cpuid_has(vcpu, X86_FEATURE_X2APIC) in kvm_set_apic_base().
To work around it, one way is set cpuid artificially in create method as this
patch does.  Other way would be to introduce another version of
kvm_set_apic_base() that doesn't check cpuid dedicated for this purpose.
The third option is to make it user space responsibility to set initial reset
value of APIC BASE MSR.

Which option do you prefer?

Thanks,

> 
> 
> But Sean may have different opinion especially for the CPUID part.

-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP
  2023-02-28 21:57       ` Huang, Kai
@ 2023-03-01  0:40         ` Isaku Yamahata
  2023-03-01  0:54           ` Huang, Kai
  0 siblings, 1 reply; 221+ messages in thread
From: Isaku Yamahata @ 2023-03-01  0:40 UTC (permalink / raw)
  To: Huang, Kai
  Cc: isaku.yamahata, Christopherson,,
	Sean, Shahar, Sagi, linux-kernel, Aktas, Erdem, pbonzini, kvm,
	Yamahata, Isaku, dmatlack

On Tue, Feb 28, 2023 at 09:57:50PM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Mon, 2023-02-27 at 13:26 -0800, Isaku Yamahata wrote:
> > > > TDX attestation includes the maximum number of vcpu that the guest can
> > > > accommodate.  
> > > > 
> > > 
> > > I don't understand why "attestation" is the reason here.  Let's say TDX is
> > > used
> > > w/o attestation, I don't think this patch can be discarded?
> > > 
> > > IMHO the true reason is TDX has it's own control of maximum number of vcpus,
> > > i.e. asking you to specify the value when creating the TD.  Therefore, the
> > > constant KVM_MAX_VCPUS doesn't work for TDX guest anymore.
> > 
> > Without TDX attestation, this can be discarded.  The TD is created with
> > max_vcpus=KVM_MAX_VCPUS by default.
> 
> This parses like: 
> 
> If we have attestation, the TD can be created with a user-specified non-default
> value.  Otherwise, the TD is always created with default value.
> 
> It doesn't make sense, right?
> 
> Because architecturally whether TD can be created with a user specified value
> doesn't depend on attestation at all.

I'm not sure if I got your point.
Even without attestation, it's allowed to specify max vcpus.   Not "always".
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-03-01  0:35             ` Isaku Yamahata
@ 2023-03-01  0:49               ` Huang, Kai
  2023-03-03  0:43                 ` Isaku Yamahata
  0 siblings, 1 reply; 221+ messages in thread
From: Huang, Kai @ 2023-03-01  0:49 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: Christopherson,,
	Sean, Shahar, Sagi, linux-kernel, Aktas, Erdem, kvm, pbonzini,
	Yamahata, Isaku, dmatlack

On Tue, 2023-02-28 at 16:35 -0800, Isaku Yamahata wrote:
> On Tue, Feb 28, 2023 at 09:49:10PM +0000,
> "Huang, Kai" <kai.huang@intel.com> wrote:
> 
> > On Tue, 2023-02-28 at 12:18 -0800, Isaku Yamahata wrote:
> > > On Tue, Feb 28, 2023 at 11:52:59AM +0000,
> > > "Huang, Kai" <kai.huang@intel.com> wrote:
> > > 
> > > > On Tue, 2023-02-28 at 03:06 -0800, Isaku Yamahata wrote:
> > > > > > > +	if (!e)
> > > > > > > +		return -ENOMEM;
> > > > > > > +	*e  = (struct kvm_cpuid_entry2) {
> > > > > > > +		.function = 1,	/* Features for X2APIC */
> > > > > > > +		.index = 0,
> > > > > > > +		.eax = 0,
> > > > > > > +		.ebx = 0,
> > > > > > > +		.ecx = 1ULL << 21,	/* X2APIC */
> > > > > > > +		.edx = 0,
> > > > > > > +	};
> > > > > > > +	vcpu->arch.cpuid_entries = e;
> > > > > > > +	vcpu->arch.cpuid_nent = 1;
> > > > > > 
> > > > > > As mentioned above, why doing it here? Won't be this be overwritten later in
> > > > > > KVM_SET_CPUID2?
> > > > > 
> > > > > Yes, user space VMM can overwrite cpuid[0x1] and APIC base MSR.  But it
> > > > > doesn't
> > > > > matter because it's a bug of user space VMM. user space VMM has to keep the
> > > > > consistency of cpuid and MSRs.
> > > > > Because TDX module virtualizes cpuid[0x1].x2apic to fixed 1, KVM value doesn't
> > > > > matter after vcpu creation.
> > > > > Because KVM virtualizes APIC base as read only to guest, cpuid[0x1].x2apic
> > > > > doesn't matter after vcpu creation as long as user space VMM keeps KVM APIC
> > > > > BASE
> > > > > value.
> > > > > 
> > > > 
> > > > Contrary, can we depend on userspace VMM to set x2APIC in CPUID, but not do this
> > > > in KVM?  If userspace doesn't do it, we treat it as userspace's bug.
> > > > 
> > > > Plus, userspace anyway needs to set x2APIC in CPUID regardless whether you have
> > > > done above here, correct?
> > > > 
> > > > I don't see the point of doing above in KVM because you are neither enforcing
> > > > anything in KVM, nor you are reducing effort of userspace.
> > > 
> > > Good idea. I can drop cpuid part from tdx_vcpu_create() and apic base part from
> > > tdx_vcpu_reset(). It needs to modify tdx_has_emulated_msr() to allow user space
> > > VMM to update APIC BASE MSR.
> > 
> > My personal preference would be:
> > 
> > 1) In KVM_SET_CPUID2, we do sanity check of CPUIDs provided by userspace, and
> > return error if not met (i.e X2APIC isn't advertised).  We already have cases
> > that KVM_SET_CPUID2 can fail, so extending to do TDX-specific check seems
> > reasonable to me too.
> 
> This is moot. The current check does only check maxphys address bit size and
> specified xfeatures are supported by host.  It's bare minimum for kvm to work.
> It doesn't try to check consistency.
> 
> 
> > 2) For APIC_BASE, you can just initialize the MSR in tdx_vcpu_reset() and ignore
> > any update (+pr_warn()?) to MSR_IA32_APIC_BASE.
> 
> The x86 common code for KVM_CREATE_VCPU, kvm_arch_vcpu_create(), calls vcpu_create,
> creates lapic, and calls vcpu_reset(). 
> 
> Setting ACPI BASE MSR with X2APIC enabled, checks if cpuid x2apic bit is set.
> Please notice guest_cpuid_has(vcpu, X86_FEATURE_X2APIC) in kvm_set_apic_base().
> To work around it, one way is set cpuid artificially in create method as this
> patch does.  Other way would be to introduce another version of
> kvm_set_apic_base() that doesn't check cpuid dedicated for this purpose.
> The third option is to make it user space responsibility to set initial reset
> value of APIC BASE MSR.
> 
> Which option do you prefer?
> 

I just recall you have already set all CPUIDs via tdx_td_init().  I would do
below:

1) keep all CPUIDs in tdx_td_init(), and make vcpu->cpuid point to that.
2) Ignore KVM_SET_CPUID2 for TDX guest (+ pr_warn(), etc).
3) Set TDX-fixed CPU registers/msrs, etc in reset_vcpu().



^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP
  2023-03-01  0:40         ` Isaku Yamahata
@ 2023-03-01  0:54           ` Huang, Kai
  0 siblings, 0 replies; 221+ messages in thread
From: Huang, Kai @ 2023-03-01  0:54 UTC (permalink / raw)
  To: isaku.yamahata
  Cc: Christopherson,,
	Sean, Shahar, Sagi, linux-kernel, Aktas, Erdem, dmatlack, kvm,
	Yamahata, Isaku, pbonzini

On Tue, 2023-02-28 at 16:40 -0800, Isaku Yamahata wrote:
> On Tue, Feb 28, 2023 at 09:57:50PM +0000,
> "Huang, Kai" <kai.huang@intel.com> wrote:
> 
> > On Mon, 2023-02-27 at 13:26 -0800, Isaku Yamahata wrote:
> > > > > TDX attestation includes the maximum number of vcpu that the guest can
> > > > > accommodate.  
> > > > > 
> > > > 
> > > > I don't understand why "attestation" is the reason here.  Let's say TDX is
> > > > used
> > > > w/o attestation, I don't think this patch can be discarded?
> > > > 
> > > > IMHO the true reason is TDX has it's own control of maximum number of vcpus,
> > > > i.e. asking you to specify the value when creating the TD.  Therefore, the
> > > > constant KVM_MAX_VCPUS doesn't work for TDX guest anymore.
> > > 
> > > Without TDX attestation, this can be discarded.  The TD is created with
> > > max_vcpus=KVM_MAX_VCPUS by default.
> > 
> > This parses like: 
> > 
> > If we have attestation, the TD can be created with a user-specified non-default
> > value.  Otherwise, the TD is always created with default value.
> > 
> > It doesn't make sense, right?
> > 
> > Because architecturally whether TD can be created with a user specified value
> > doesn't depend on attestation at all.
> 
> I'm not sure if I got your point.
> Even without attestation, it's allowed to specify max vcpus.   Not "always".
> 

Exactly.

So "allow to specify max vcpus" isn't due to attestation, as you also said.  

Then why this patch is due to "TDX attestation includes the maximum number of
vcpu that the guest can accommodate"?  Shouldn't the reason be "TDX
architecturally has it's own control of maximum number of vcpus" as I mentioned
in my first reply?

^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-02-28 20:20         ` Isaku Yamahata
@ 2023-03-01  4:58           ` Zhi Wang
  0 siblings, 0 replies; 221+ messages in thread
From: Zhi Wang @ 2023-03-01  4:58 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Zhi Wang, isaku.yamahata, kvm, linux-kernel, Paolo Bonzini,
	erdemaktas, Sean Christopherson, Sagi Shahar, David Matlack

On Tue, 28 Feb 2023 12:20:31 -0800
Isaku Yamahata <isaku.yamahata@gmail.com> wrote:

> On Tue, Feb 28, 2023 at 07:55:09PM +0200,
> Zhi Wang <zhi.wang.linux@gmail.com> wrote:
> 
> > On Mon, 27 Feb 2023 15:49:14 -0800
> > Isaku Yamahata <isaku.yamahata@gmail.com> wrote:
> 
> > > > 2) Move 
> > > > 
> > > > > +	apic_base_msr.data = APIC_DEFAULT_PHYS_BASE | LAPIC_MODE_X2APIC;
> > > > > +	if (kvm_vcpu_is_reset_bsp(vcpu))
> > > > > +		apic_base_msr.data |= MSR_IA32_APICBASE_BSP;
> > > > > +	apic_base_msr.host_initiated = true;
> > > > 
> > > > to:
> > > > 
> > > > void kvm_lapic_reset(struct kvm_vcpu *vcpu, bool init_event)
> > > > {
> > > >         struct kvm_lapic *apic = vcpu->arch.apic;
> > > >         u64 msr_val;
> > > >         int i;
> > > > 
> > > >         if (!init_event) {
> > > >                 msr_val = APIC_DEFAULT_PHYS_BASE | MSR_IA32_APICBASE_ENABLE;
> > > > 
> > > > 		/* here */
> > > > 		if (is_td_vcpu(vcpu)) 
> > > > 			msr_val = xxxx;
> > > >                 if (kvm_vcpu_is_reset_bsp(vcpu))
> > > >                         msr_val |= MSR_IA32_APICBASE_BSP;
> > > >                 kvm_lapic_set_base(vcpu, msr_val);
> > > >         }
> > > 
> > > No. Because I'm trying to contain is_td/is_td_vcpu in vmx specific and not use
> > > in common x86 code.
> > >
> > 
> > I guess so. Centeralizing the initialization would be the nice and greatly
> > improve the readablity of the code. Maybe adding a new callback in kvm x86_ops
> > like .get_default_msr_val instead.
> 
> Finally I can eliminate cpuid/APIC BASE MSR, and move it user space VMM, qemu.

Great to hear.


^ permalink raw reply	[flat|nested] 221+ messages in thread

* Re: [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure
  2023-03-01  0:49               ` Huang, Kai
@ 2023-03-03  0:43                 ` Isaku Yamahata
  0 siblings, 0 replies; 221+ messages in thread
From: Isaku Yamahata @ 2023-03-03  0:43 UTC (permalink / raw)
  To: Huang, Kai
  Cc: isaku.yamahata, Christopherson,,
	Sean, Shahar, Sagi, linux-kernel, Aktas, Erdem, kvm, pbonzini,
	Yamahata, Isaku, dmatlack

On Wed, Mar 01, 2023 at 12:49:06AM +0000,
"Huang, Kai" <kai.huang@intel.com> wrote:

> On Tue, 2023-02-28 at 16:35 -0800, Isaku Yamahata wrote:
> > On Tue, Feb 28, 2023 at 09:49:10PM +0000,
> > "Huang, Kai" <kai.huang@intel.com> wrote:
> > 
> > > On Tue, 2023-02-28 at 12:18 -0800, Isaku Yamahata wrote:
> > > > On Tue, Feb 28, 2023 at 11:52:59AM +0000,
> > > > "Huang, Kai" <kai.huang@intel.com> wrote:
> > > > 
> > > > > On Tue, 2023-02-28 at 03:06 -0800, Isaku Yamahata wrote:
> > > > > > > > +	if (!e)
> > > > > > > > +		return -ENOMEM;
> > > > > > > > +	*e  = (struct kvm_cpuid_entry2) {
> > > > > > > > +		.function = 1,	/* Features for X2APIC */
> > > > > > > > +		.index = 0,
> > > > > > > > +		.eax = 0,
> > > > > > > > +		.ebx = 0,
> > > > > > > > +		.ecx = 1ULL << 21,	/* X2APIC */
> > > > > > > > +		.edx = 0,
> > > > > > > > +	};
> > > > > > > > +	vcpu->arch.cpuid_entries = e;
> > > > > > > > +	vcpu->arch.cpuid_nent = 1;
> > > > > > > 
> > > > > > > As mentioned above, why doing it here? Won't be this be overwritten later in
> > > > > > > KVM_SET_CPUID2?
> > > > > > 
> > > > > > Yes, user space VMM can overwrite cpuid[0x1] and APIC base MSR.  But it
> > > > > > doesn't
> > > > > > matter because it's a bug of user space VMM. user space VMM has to keep the
> > > > > > consistency of cpuid and MSRs.
> > > > > > Because TDX module virtualizes cpuid[0x1].x2apic to fixed 1, KVM value doesn't
> > > > > > matter after vcpu creation.
> > > > > > Because KVM virtualizes APIC base as read only to guest, cpuid[0x1].x2apic
> > > > > > doesn't matter after vcpu creation as long as user space VMM keeps KVM APIC
> > > > > > BASE
> > > > > > value.
> > > > > > 
> > > > > 
> > > > > Contrary, can we depend on userspace VMM to set x2APIC in CPUID, but not do this
> > > > > in KVM?  If userspace doesn't do it, we treat it as userspace's bug.
> > > > > 
> > > > > Plus, userspace anyway needs to set x2APIC in CPUID regardless whether you have
> > > > > done above here, correct?
> > > > > 
> > > > > I don't see the point of doing above in KVM because you are neither enforcing
> > > > > anything in KVM, nor you are reducing effort of userspace.
> > > > 
> > > > Good idea. I can drop cpuid part from tdx_vcpu_create() and apic base part from
> > > > tdx_vcpu_reset(). It needs to modify tdx_has_emulated_msr() to allow user space
> > > > VMM to update APIC BASE MSR.
> > > 
> > > My personal preference would be:
> > > 
> > > 1) In KVM_SET_CPUID2, we do sanity check of CPUIDs provided by userspace, and
> > > return error if not met (i.e X2APIC isn't advertised).  We already have cases
> > > that KVM_SET_CPUID2 can fail, so extending to do TDX-specific check seems
> > > reasonable to me too.
> > 
> > This is moot. The current check does only check maxphys address bit size and
> > specified xfeatures are supported by host.  It's bare minimum for kvm to work.
> > It doesn't try to check consistency.
> > 
> > 
> > > 2) For APIC_BASE, you can just initialize the MSR in tdx_vcpu_reset() and ignore
> > > any update (+pr_warn()?) to MSR_IA32_APIC_BASE.
> > 
> > The x86 common code for KVM_CREATE_VCPU, kvm_arch_vcpu_create(), calls vcpu_create,
> > creates lapic, and calls vcpu_reset(). 
> > 
> > Setting ACPI BASE MSR with X2APIC enabled, checks if cpuid x2apic bit is set.
> > Please notice guest_cpuid_has(vcpu, X86_FEATURE_X2APIC) in kvm_set_apic_base().
> > To work around it, one way is set cpuid artificially in create method as this
> > patch does.  Other way would be to introduce another version of
> > kvm_set_apic_base() that doesn't check cpuid dedicated for this purpose.
> > The third option is to make it user space responsibility to set initial reset
> > value of APIC BASE MSR.
> > 
> > Which option do you prefer?
> > 
> 
> I just recall you have already set all CPUIDs via tdx_td_init().  I would do
> below:
> 
> 1) keep all CPUIDs in tdx_td_init(), and make vcpu->cpuid point to that.
> 2) Ignore KVM_SET_CPUID2 for TDX guest (+ pr_warn(), etc).
> 3) Set TDX-fixed CPU registers/msrs, etc in reset_vcpu().

Finally I come up with the following flow.

- KVM_CREATE_VCPU
  - tdx_vcpu_create()
    no cpuid, no msr operation
- KVM_SET_CPUID2
  user space has to set cpuid...x2apic=1
- KVM_TDX_INIT_VCPU
  tdx_vcpu_ioctl() sets APIC_BASE MSR to x2apic enabled.
  Here if user space VMM doesn't set cpuid properly, it results in error.

ACPI_BASE MSR is read only for both user space VMM and guest TD.
cpuid:
After KVM_TDX_INIT_VCPU, user space VMM can update it by KVM_SET_CPUID2.
KVM doesn't care.  Guest TD see cpuid.x2apic value virtualized by TDX module
while KVM internally doesn't use the value because APIC_BASE won't change.


Thanks,
-- 
Isaku Yamahata <isaku.yamahata@gmail.com>

^ permalink raw reply	[flat|nested] 221+ messages in thread

end of thread, other threads:[~2023-03-03  0:43 UTC | newest]

Thread overview: 221+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-01-12 16:31 [PATCH v11 000/113] KVM TDX basic feature support isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 001/113] KVM: VMX: Move out vmx_x86_ops to 'main.c' to wrap VMX and TDX isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 002/113] KVM: x86/vmx: Refactor KVM VMX module init/exit functions isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 003/113] KVM: TDX: Add placeholders for TDX VM/vcpu structure isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 004/113] KVM: TDX: Initialize the TDX module when loading the KVM intel kernel module isaku.yamahata
2023-01-13 12:31   ` Zhi Wang
2023-01-17 16:03     ` Isaku Yamahata
2023-01-17 21:41       ` Huang, Kai
2023-01-16  3:48   ` Huang, Kai
2023-01-12 16:31 ` [PATCH v11 005/113] KVM: x86: Introduce vm_type to differentiate default VMs from confidential VMs isaku.yamahata
2023-01-17  3:31   ` Binbin Wu
2023-01-12 16:31 ` [PATCH v11 006/113] KVM: TDX: Make TDX VM type supported isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 007/113] [MARKER] The start of TDX KVM patch series: TDX architectural definitions isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 008/113] KVM: TDX: Define " isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 009/113] KVM: TDX: Add TDX "architectural" error codes isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 010/113] KVM: TDX: Add C wrapper functions for SEAMCALLs to the TDX module isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 011/113] KVM: TDX: Add helper functions to print TDX SEAMCALL error isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 012/113] [MARKER] The start of TDX KVM patch series: TD VM creation/destruction isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 013/113] x86/cpu: Add helper functions to allocate/free TDX private host key id isaku.yamahata
2023-01-13 12:47   ` Zhi Wang
2023-01-13 15:21     ` Sean Christopherson
2023-01-14  9:38       ` Zhi Wang
2023-01-12 16:31 ` [PATCH v11 014/113] x86/virt/tdx: Add a helper function to return system wide info about TDX module isaku.yamahata
2023-01-16  4:19   ` Huang, Kai
2023-02-27 21:20     ` Isaku Yamahata
2023-01-12 16:31 ` [PATCH v11 015/113] KVM: TDX: x86: Add ioctl to get TDX systemwide parameters isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 016/113] KVM: TDX: Add place holder for TDX VM specific mem_enc_op ioctl isaku.yamahata
2023-01-19  2:40   ` Huang, Kai
2023-02-27 21:22     ` Isaku Yamahata
2023-01-12 16:31 ` [PATCH v11 017/113] KVM: Support KVM_CAP_MAX_VCPUS for KVM_ENABLE_CAP isaku.yamahata
2023-01-13 12:55   ` Zhi Wang
2023-02-27 21:28     ` Isaku Yamahata
2023-01-16  4:44   ` Huang, Kai
2023-02-27 21:26     ` Isaku Yamahata
2023-02-28 21:57       ` Huang, Kai
2023-03-01  0:40         ` Isaku Yamahata
2023-03-01  0:54           ` Huang, Kai
2023-01-12 16:31 ` [PATCH v11 018/113] KVM: TDX: create/destroy VM structure isaku.yamahata
2023-01-13 13:12   ` Zhi Wang
2023-01-13 15:16     ` Sean Christopherson
2023-01-14  9:16       ` Zhi Wang
2023-01-17 15:55         ` Sean Christopherson
2023-01-17 19:44           ` Zhi Wang
2023-01-17 20:56             ` Sean Christopherson
2023-01-17 21:01               ` Sean Christopherson
2023-01-19 11:31                 ` Huang, Kai
2023-01-19 15:37                   ` Sean Christopherson
2023-01-19 20:39                     ` Huang, Kai
2023-01-19 21:36                       ` Sean Christopherson
2023-01-19 23:08                         ` Huang, Kai
2023-01-19 23:11                           ` Sean Christopherson
2023-01-19 23:24                             ` Huang, Kai
2023-01-19 23:25                               ` Huang, Kai
2023-01-19 23:55                         ` Huang, Kai
2023-01-20  0:16                           ` Sean Christopherson
2023-01-20 22:21                             ` David Matlack
2023-01-21  0:12                               ` Sean Christopherson
2023-01-23  1:51                                 ` Huang, Kai
2023-01-23 17:41                                   ` Sean Christopherson
2023-01-26 10:54                                     ` Huang, Kai
2023-01-26 17:28                                       ` Sean Christopherson
2023-01-26 21:18                                         ` Huang, Kai
2023-01-26 21:59                                           ` Sean Christopherson
2023-01-26 22:27                                             ` Huang, Kai
2023-01-30 19:15                                               ` Sean Christopherson
2023-01-19 22:45               ` Zhi Wang
2023-01-19 22:51                 ` Sean Christopherson
     [not found]   ` <080e0a246e927545718b6f427dfdcdde505a8859.camel@intel.com>
2023-01-19 15:29     ` Sean Christopherson
2023-01-19 20:40       ` Huang, Kai
2023-01-12 16:31 ` [PATCH v11 019/113] KVM: TDX: initialize VM with TDX specific parameters isaku.yamahata
2023-01-13 14:58   ` Zhi Wang
2023-01-16 10:04   ` Huang, Kai
2023-02-27 21:32     ` Isaku Yamahata
2023-01-17 12:19   ` Huang, Kai
2023-02-27 21:44     ` Isaku Yamahata
2023-01-12 16:31 ` [PATCH v11 020/113] KVM: TDX: Make pmu_intel.c ignore guest TD case isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 021/113] KVM: TDX: Refuse to unplug the last cpu on the package isaku.yamahata
2023-01-16 10:23   ` Huang, Kai
2023-02-27 21:48     ` Isaku Yamahata
2023-01-12 16:31 ` [PATCH v11 022/113] [MARKER] The start of TDX KVM patch series: TD vcpu creation/destruction isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 023/113] KVM: TDX: allocate/free TDX vcpu structure isaku.yamahata
2023-01-16 10:46   ` Zhi Wang
2023-02-27 23:49     ` Isaku Yamahata
2023-02-28 17:55       ` Zhi Wang
2023-02-28 20:20         ` Isaku Yamahata
2023-03-01  4:58           ` Zhi Wang
2023-01-19  0:45   ` Huang, Kai
2023-02-28 11:06     ` Isaku Yamahata
2023-02-28 11:52       ` Huang, Kai
2023-02-28 20:18         ` Isaku Yamahata
2023-02-28 21:49           ` Huang, Kai
2023-03-01  0:35             ` Isaku Yamahata
2023-03-01  0:49               ` Huang, Kai
2023-03-03  0:43                 ` Isaku Yamahata
2023-01-12 16:31 ` [PATCH v11 024/113] KVM: TDX: Do TDX specific vcpu initialization isaku.yamahata
2023-01-16 16:07   ` Zhi Wang
2023-02-28 11:17     ` Isaku Yamahata
2023-02-28 18:21       ` Zhi Wang
2023-01-19 10:37   ` Huang, Kai
2023-02-28 11:27     ` Isaku Yamahata
2023-01-12 16:31 ` [PATCH v11 025/113] KVM: TDX: Use private memory for TDX isaku.yamahata
2023-01-16 10:45   ` Huang, Kai
2023-01-17 16:40     ` Sean Christopherson
2023-01-17 22:52       ` Huang, Kai
2023-01-18  1:16         ` Sean Christopherson
2023-01-12 16:31 ` [PATCH v11 026/113] [MARKER] The start of TDX KVM patch series: KVM MMU GPA shared bits isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 027/113] KVM: x86/mmu: introduce config for PRIVATE KVM MMU isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 028/113] KVM: x86/mmu: Add address conversion functions for TDX shared bit of GPA isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 029/113] [MARKER] The start of TDX KVM patch series: KVM TDP refactoring for TDX isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 030/113] KVM: x86/mmu: Replace hardcoded value 0 for the initial value for SPTE isaku.yamahata
2023-01-25  9:24   ` Zhi Wang
2023-01-25 17:22     ` Sean Christopherson
2023-01-26 21:37       ` Huang, Kai
2023-01-26 22:01         ` Sean Christopherson
2023-02-27 21:52           ` Isaku Yamahata
2023-01-27 21:36       ` Zhi Wang
2023-02-27 21:50         ` Isaku Yamahata
2023-01-12 16:31 ` [PATCH v11 031/113] KVM: x86/mmu: Allow non-zero value for non-present SPTE and removed SPTE isaku.yamahata
2023-01-16 10:54   ` Huang, Kai
2023-02-27 21:53     ` Isaku Yamahata
2023-01-12 16:31 ` [PATCH v11 032/113] KVM: x86/mmu: Add Suppress VE bit to shadow_mmio_mask isaku.yamahata
2023-01-16 10:59   ` Huang, Kai
2023-01-12 16:31 ` [PATCH v11 033/113] KVM: x86/mmu: Track shadow MMIO value on a per-VM basis isaku.yamahata
2023-01-16 11:16   ` Huang, Kai
2023-02-27 21:58     ` Isaku Yamahata
2023-01-12 16:31 ` [PATCH v11 034/113] KVM: x86/mmu: Disallow fast page fault on private GPA isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 035/113] KVM: x86/mmu: Allow per-VM override of the TDP max page level isaku.yamahata
2023-01-16 11:29   ` Huang, Kai
2023-01-12 16:31 ` [PATCH v11 036/113] KVM: VMX: Introduce test mode related to EPT violation VE isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 037/113] [MARKER] The start of TDX KVM patch series: KVM TDP MMU hooks isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 038/113] KVM: x86/tdp_mmu: Init role member of struct kvm_mmu_page at allocation isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 039/113] KVM: x86/mmu: Require TDP MMU for TDX isaku.yamahata
2023-01-19 11:37   ` Huang, Kai
2023-01-12 16:31 ` [PATCH v11 040/113] KVM: x86/mmu: Add a new is_private member for union kvm_mmu_page_role isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 041/113] KVM: x86/mmu: Add a private pointer to struct kvm_mmu_page isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 042/113] KVM: Add flags to struct kvm_gfn_range isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 043/113] KVM: x86/tdp_mmu: Don't zap private pages for unsupported cases isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 044/113] KVM: x86/tdp_mmu: Make handle_changed_spte() return value isaku.yamahata
2023-02-16 16:39   ` Zhi Wang
2023-01-12 16:31 ` [PATCH v11 045/113] KVM: x86/mmu: Make make_spte() aware of shared GPA for MTRR isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 046/113] KVM: x86/tdp_mmu: Support TDX private mapping for TDP MMU isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 047/113] [MARKER] The start of TDX KVM patch series: TDX EPT violation isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 048/113] KVM: x86/mmu: Disallow dirty logging for x86 TDX isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 049/113] KVM: x86/mmu: TDX: Do not enable page track for TD guest isaku.yamahata
2023-01-12 16:31 ` [PATCH v11 050/113] KVM: x86/tdp_mmu: Ignore unsupported mmu operation on private GFNs isaku.yamahata
2023-01-17  2:40   ` Huang, Kai
2023-02-27 22:00     ` Isaku Yamahata
2023-02-17  8:27   ` Zhi Wang
2023-02-27 22:02     ` Isaku Yamahata
2023-01-12 16:31 ` [PATCH v11 051/113] KVM: VMX: Split out guts of EPT violation to common/exposed function isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 052/113] KVM: VMX: Move setting of EPT MMU masks to common VT-x code isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 053/113] KVM: TDX: Add accessors VMX VMCS helpers isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 054/113] KVM: TDX: Add load_mmu_pgd method for TDX isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 055/113] KVM: x86/VMX: introduce vmx tlb_remote_flush and tlb_remote_flush_with_range isaku.yamahata
2023-01-17  2:06   ` Huang, Kai
2023-01-17 16:53     ` Sean Christopherson
2023-02-27 22:03       ` Isaku Yamahata
2023-01-12 16:32 ` [PATCH v11 056/113] KVM: TDX: Retry seamcall when TDX_OPERAND_BUSY with operand SEPT isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 057/113] KVM: TDX: TDP MMU TDX support isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 058/113] KVM: TDX: MTRR: implement get_mt_mask() for TDX isaku.yamahata
2023-01-17  3:11   ` Huang, Kai
2023-02-27 23:30     ` Isaku Yamahata
2023-02-03  6:55   ` Yuan Yao
2023-02-27 22:15     ` Isaku Yamahata
2023-01-12 16:32 ` [PATCH v11 059/113] [MARKER] The start of TDX KVM patch series: TD finalization isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 060/113] KVM: x86/mmu: Introduce kvm_mmu_map_tdp_page() for use by TDX isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 061/113] KVM: TDX: Create initial guest memory isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 062/113] KVM: TDX: Finalize VM initialization isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 063/113] [MARKER] The start of TDX KVM patch series: TD vcpu enter/exit isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 064/113] KVM: TDX: Add helper assembly function to TDX vcpu isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 065/113] KVM: TDX: Implement TDX vcpu enter/exit path isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 066/113] KVM: TDX: vcpu_run: save/restore host state(host kernel gs) isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 067/113] KVM: TDX: restore host xsave state when exit from the guest TD isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 068/113] KVM: x86: Allow to update cached values in kvm_user_return_msrs w/o wrmsr isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 069/113] KVM: TDX: restore user ret MSRs isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 070/113] [MARKER] The start of TDX KVM patch series: TD vcpu exits/interrupts/hypercalls isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 071/113] KVM: TDX: complete interrupts after tdexit isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 072/113] KVM: TDX: restore debug store when TD exit isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 073/113] KVM: TDX: handle vcpu migration over logical processor isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 074/113] KVM: x86: Add a switch_db_regs flag to handle TDX's auto-switched behavior isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 075/113] KVM: TDX: Add support for find pending IRQ in a protected local APIC isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 076/113] KVM: x86: Assume timer IRQ was injected if APIC state is proteced isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 077/113] KVM: TDX: remove use of struct vcpu_vmx from posted_interrupt.c isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 078/113] KVM: TDX: Implement interrupt injection isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 079/113] KVM: TDX: Implements vcpu request_immediate_exit isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 080/113] KVM: TDX: Implement methods to inject NMI isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 081/113] KVM: VMX: Modify NMI and INTR handlers to take intr_info as function argument isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 082/113] KVM: VMX: Move NMI/exception handler to common helper isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 083/113] KVM: x86: Split core of hypercall emulation to helper function isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 084/113] KVM: TDX: Add a place holder to handle TDX VM exit isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 085/113] KVM: TDX: Handle vmentry failure for INTEL TD guest isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 086/113] KVM: TDX: handle EXIT_REASON_OTHER_SMI isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 087/113] KVM: TDX: handle ept violation/misconfig exit isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 088/113] KVM: TDX: handle EXCEPTION_NMI and EXTERNAL_INTERRUPT isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 089/113] KVM: TDX: Add a place holder for handler of TDX hypercalls (TDG.VP.VMCALL) isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 090/113] KVM: TDX: handle KVM hypercall with TDG.VP.VMCALL isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 091/113] KVM: TDX: Add KVM Exit for TDX TDG.VP.VMCALL isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 092/113] KVM: TDX: Handle TDX PV CPUID hypercall isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 093/113] KVM: TDX: Handle TDX PV HLT hypercall isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 094/113] KVM: TDX: Handle TDX PV port io hypercall isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 095/113] KVM: TDX: Handle TDX PV MMIO hypercall isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 096/113] KVM: TDX: Implement callbacks for MSR operations for TDX isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 097/113] KVM: TDX: Handle TDX PV rdmsr/wrmsr hypercall isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 098/113] KVM: TDX: Handle TDX PV report fatal error hypercall isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 099/113] KVM: TDX: Handle TDX PV map_gpa hypercall isaku.yamahata
2023-01-31  1:30   ` Yuan Yao
2023-02-27 22:12     ` Isaku Yamahata
2023-01-12 16:32 ` [PATCH v11 100/113] KVM: TDX: Handle TDG.VP.VMCALL<GetTdVmCallInfo> hypercall isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 101/113] KVM: TDX: Silently discard SMI request isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 102/113] KVM: TDX: Silently ignore INIT/SIPI isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 103/113] KVM: TDX: Add methods to ignore accesses to CPU state isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 104/113] KVM: TDX: Add methods to ignore guest instruction emulation isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 105/113] KVM: TDX: Add a method to ignore dirty logging isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 106/113] KVM: TDX: Add methods to ignore VMX preemption timer isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 107/113] KVM: TDX: Add methods to ignore accesses to TSC isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 108/113] KVM: TDX: Ignore setting up mce isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 109/113] KVM: TDX: Add a method to ignore for TDX to ignore hypercall patch isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 110/113] KVM: TDX: Add methods to ignore virtual apic related operation isaku.yamahata
2023-01-12 16:32 ` [PATCH v11 111/113] Documentation/virt/kvm: Document on Trust Domain Extensions(TDX) isaku.yamahata
2023-01-12 16:33 ` [PATCH v11 112/113] KVM: x86: design documentation on TDX support of x86 KVM TDP MMU isaku.yamahata
2023-01-12 16:33 ` [PATCH v11 113/113] [MARKER] the end of (the first phase of) TDX KVM patch series isaku.yamahata

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).