All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support
@ 2023-11-15  7:14 Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 01/70] *** HACK *** linux-headers: Update headers to pull in gmem APIs Xiaoyao Li
                   ` (69 more replies)
  0 siblings, 70 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

This v3 series combines previous QEMU gmem series[1] and TDX QEMU series[2].
Because TDX is going to be the first user of gmem (guest memfd) in QEMU,
bombining them together can provide us a full picture of how they work.

KVM provides guest memfd, which cannot be mapped, read, or written by
userspace. It's designed to serve as private memory for confidential
VMs, like Intel TDX and AMD sev-snp.

Patches 1 - 11 add support of guest memfd into QEMU, associating it with
RAMBlock. For the VM types that require private memory (see
tdx_kvm_init() in patch 17), QEMU will automatically create guest memfd
for each RAM.

Follwoing patches 12 to 70, enables TDX support to allow creating and
booting a TD (TDX VM) with QEMU.

The whole series needs to work with KVM guest memfd series[3] and KVM v17
series[4].

This series is also available in github:
https://github.com/intel/qemu-tdx/tree/tdx-qemu-upstream-v3

It's based on several patches that haven't get merged:
https://lore.kernel.org/qemu-devel/20231007065819.27498-1-xiaoyao.li@intel.com/
https://lore.kernel.org/qemu-devel/20230613131929.720453-1-xiaoyao.li@intel.com/
https://lore.kernel.org/all/20220310122215.804233-1-xiaoyao.li@intel.com/

Luckily, the absence of them doesn't block applying this series nor
affecting the functionality.

Note, I leave new qapi introduced in this series with 'since 8.2'
because I don't know next version will be 8.3 or 9.0.


[1] https://lore.kernel.org/qemu-devel/20230914035117.3285885-1-xiaoyao.li@intel.com/
[2] https://lore.kernel.org/qemu-devel/20230818095041.1973309-1-xiaoyao.li@intel.com/
[3] https://lore.kernel.org/all/20231105163040.14904-1-pbonzini@redhat.com/
[4] https://lore.kernel.org/all/cover.1699368322.git.isaku.yamahata@intel.com/ 

== Limitation and future work ==
- Readonly memslot

  TDX only support readonly (write protection) memslot for shared memory, but
  not for private memory. For simplicity, just mark readonly memslot not
  supported entirely for TDX.

- CPU model

  We cannot create a TD with arbitrary CPU model like what for non-TDX VMs,
  because only a subset of features can be configured for TD.

  - It's recommended to use '-cpu host' to create TD;
  - '+feature/-feature' might not work as expected;

  future work: To introduce specific CPU model for TDs and enhance ±features
               for TDs.

- gdb suppport

  gdb support to debug a TD of off-debug mode is future work.

===
Main changes in v3:
gmem memfd part: 
 - Since KVM side renamed gmem to guest_memfd in the uapi, this version
   renames it accordingly;
 - Drop the 'private' property of memory backend. (see comment[5])
   Now QEMU decides whether need to create guest memfd based on specific
   vm type (or specific VM implementation, please see patch *X* and *Y*);
 - Drop sw_protected_vm implementation;

TDX part:
 - improve the error report in various patches by utilizing 'errp';
 - drop the vm-type interface;
 - rename __tdx_ioctl() to tdx_ioctl_internal();
 - refine the description of 'sept-ve-disable' in qom.json;
 - use base64 for mrconfigif/mrowner/mrownerconfig instread of hex-string;
 - use type SocketAddress for quote-generation-service;

[5] https://lore.kernel.org/qemu-devel/a1e34896-c46d-c87c-0fda-971bbf3dcfbd@redhat.com/

Chao Peng (3):
  kvm: Enable KVM_SET_USER_MEMORY_REGION2 for memslot
  kvm: handle KVM_EXIT_MEMORY_FAULT
  i386/tdx: register TDVF as private memory

Chenyi Qiang (1):
  i386/tdx: setup a timer for the qio channel

Isaku Yamahata (15):
  trace/kvm: Add trace for page convertion between shared and private
  i386/tdx: Make sept_ve_disable set by default
  i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM
  kvm/tdx: Don't complain when converting vMMIO region to shared
  kvm/tdx: Ignore memory conversion to shared of unassigned region
  i386/tdvf: Introduce function to parse TDVF metadata
  i386/tdx: Add TDVF memory via KVM_TDX_INIT_MEM_REGION
  i386/tdx: handle TDG.VP.VMCALL<SetupEventNotifyInterrupt>
  i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  i386/tdx: handle TDG.VP.VMCALL<MapGPA> hypercall
  i386/tdx: Limit the range size for MapGPA
  pci-host/q35: Move PAM initialization above SMRAM initialization
  q35: Introduce smm_ranges property for q35-pci-host
  hw/i386: add option to forcibly report edge trigger in acpi tables
  i386/tdx: Don't synchronize guest tsc for TDs

Sean Christopherson (2):
  i386/kvm: Move architectural CPUID leaf generation to separate helper
  i386/tdx: Don't get/put guest state for TDX VMs

Xiaoyao Li (49):
  *** HACK *** linux-headers: Update headers to pull in gmem APIs
  RAMBlock: Add support of KVM private guest memfd
  RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  HostMem: Add mechanism to opt in kvm guest memfd via MachineState
  kvm: Introduce support for memory_attributes
  physmem: Relax the alignment check of host_startaddr in
    ram_block_discard_range()
  physmem: replace function name with __func__ in
    ram_block_discard_range()
  physmem: Introduce ram_block_convert_range() for page conversion
  *** HACK *** linux-headers: Update headers to pull in TDX API changes
  i386: Introduce tdx-guest object
  target/i386: Implement mc->kvm_type() to get VM type
  target/i386: Parse TDX vm type
  target/i386: Introduce kvm_confidential_guest_init()
  i386/tdx: Implement tdx_kvm_init() to initialize TDX VM context
  i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES
  i386/tdx: Introduce is_tdx_vm() helper and cache tdx_guest object
  i386/tdx: Adjust the supported CPUID based on TDX restrictions
  i386/tdx: Update tdx_cpuid_lookup[].tdx_fixed0/1 by
    tdx_caps.cpuid_config[]
  i386/tdx: Integrate tdx_caps->xfam_fixed0/1 into tdx_cpuid_lookup
  i386/tdx: Integrate tdx_caps->attrs_fixed0/1 to tdx_cpuid_lookup
  kvm: Introduce kvm_arch_pre_create_vcpu()
  i386/tdx: Initialize TDX before creating TD vcpus
  i386/tdx: Add property sept-ve-disable for tdx-guest object
  i386/tdx: Wire CPU features up with attributes of TD guest
  i386/tdx: Validate TD attributes
  i386/tdx: Implement user specified tsc frequency
  i386/tdx: Set kvm_readonly_mem_enabled to false for TDX VM
  kvm/memory: Introduce the infrastructure to set the default
    shared/private value
  i386/tdx: Make memory type private by default
  i386/tdx: Parse TDVF metadata for TDX VM
  i386/tdx: Skip BIOS shadowing setup
  i386/tdx: Don't initialize pc.rom for TDX VMs
  i386/tdx: Track mem_ptr for each firmware entry of TDVF
  i386/tdx: Track RAM entries for TDX VM
  headers: Add definitions from UEFI spec for volumes, resources, etc...
  i386/tdx: Setup the TD HOB list
  memory: Introduce memory_region_init_ram_guest_memfd()
  i386/tdx: Call KVM_TDX_INIT_VCPU to initialize TDX vcpu
  i386/tdx: Finalize TDX VM
  i386/tdx: Handle TDG.VP.VMCALL<REPORT_FATAL_ERROR>
  i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility
  i386/tdx: Disable SMM for TDX VMs
  i386/tdx: Disable PIC for TDX VMs
  i386/tdx: Don't allow system reset for TDX VMs
  i386/tdx: LMCE is not supported for TDX
  hw/i386: add eoi_intercept_unsupported member to X86MachineState
  i386/tdx: Only configure MSR_IA32_UCODE_REV in kvm_init_msrs() for TDs
  i386/tdx: Skip kvm_put_apicbase() for TDs
  docs: Add TDX documentation

 accel/kvm/kvm-all.c                        |  255 +++-
 accel/kvm/trace-events                     |    4 +-
 backends/hostmem-file.c                    |    1 +
 backends/hostmem-memfd.c                   |    1 +
 backends/hostmem-ram.c                     |    1 +
 backends/hostmem.c                         |    1 +
 configs/devices/i386-softmmu/default.mak   |    1 +
 docs/system/confidential-guest-support.rst |    1 +
 docs/system/i386/tdx.rst                   |  113 ++
 docs/system/target-i386.rst                |    1 +
 hw/core/machine.c                          |    5 +
 hw/i386/Kconfig                            |    6 +
 hw/i386/acpi-build.c                       |   99 +-
 hw/i386/acpi-common.c                      |   50 +-
 hw/i386/meson.build                        |    1 +
 hw/i386/pc.c                               |   21 +-
 hw/i386/pc_q35.c                           |    2 +
 hw/i386/pc_sysfw.c                         |    7 +
 hw/i386/tdvf-hob.c                         |  147 ++
 hw/i386/tdvf-hob.h                         |   24 +
 hw/i386/tdvf.c                             |  200 +++
 hw/i386/x86.c                              |   51 +-
 hw/pci-host/q35.c                          |   61 +-
 include/exec/cpu-common.h                  |    2 +
 include/exec/memory.h                      |   26 +
 include/exec/ramblock.h                    |    1 +
 include/hw/boards.h                        |    2 +
 include/hw/i386/pc.h                       |    1 +
 include/hw/i386/tdvf.h                     |   58 +
 include/hw/i386/x86.h                      |    2 +
 include/hw/pci-host/q35.h                  |    1 +
 include/standard-headers/uefi/uefi.h       |  198 +++
 include/sysemu/hostmem.h                   |    1 +
 include/sysemu/kvm.h                       |    8 +
 include/sysemu/kvm_int.h                   |    2 +
 linux-headers/asm-x86/kvm.h                |   94 ++
 linux-headers/linux/kvm.h                  |  140 ++
 qapi/qom.json                              |   29 +
 qapi/run-state.json                        |   27 +-
 system/memory.c                            |   45 +
 system/physmem.c                           |  151 +-
 system/runstate.c                          |   54 +
 target/i386/cpu-internal.h                 |    9 +
 target/i386/cpu.c                          |   12 -
 target/i386/cpu.h                          |   21 +
 target/i386/kvm/kvm-cpu.c                  |    5 +
 target/i386/kvm/kvm.c                      |  608 ++++----
 target/i386/kvm/kvm_i386.h                 |    6 +
 target/i386/kvm/meson.build                |    2 +
 target/i386/kvm/tdx-stub.c                 |   23 +
 target/i386/kvm/tdx.c                      | 1612 ++++++++++++++++++++
 target/i386/kvm/tdx.h                      |   72 +
 target/i386/sev.c                          |    1 -
 target/i386/sev.h                          |    2 +
 54 files changed, 3861 insertions(+), 407 deletions(-)
 create mode 100644 docs/system/i386/tdx.rst
 create mode 100644 hw/i386/tdvf-hob.c
 create mode 100644 hw/i386/tdvf-hob.h
 create mode 100644 hw/i386/tdvf.c
 create mode 100644 include/hw/i386/tdvf.h
 create mode 100644 include/standard-headers/uefi/uefi.h
 create mode 100644 target/i386/kvm/tdx-stub.c
 create mode 100644 target/i386/kvm/tdx.c
 create mode 100644 target/i386/kvm/tdx.h

-- 
2.34.1


^ permalink raw reply	[flat|nested] 161+ messages in thread

* [PATCH v3 01/70] *** HACK *** linux-headers: Update headers to pull in gmem APIs
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd Xiaoyao Li
                   ` (68 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

This patch needs to be updated by script

	scripts/update-linux-headers.sh

once gmem fd support is upstreamed in Linux kernel.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 linux-headers/asm-x86/kvm.h |  3 +++
 linux-headers/linux/kvm.h   | 51 +++++++++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)

diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 2b3a8f7bd2c0..003fb745347c 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -560,4 +560,7 @@ struct kvm_pmu_event_filter {
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE	BIT(0)
 
+#define KVM_X86_DEFAULT_VM	0
+#define KVM_X86_SW_PROTECTED_VM	1
+
 #endif /* _ASM_X86_KVM_H */
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 0d74ee999aa9..3fc87a845ee3 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -95,6 +95,19 @@ struct kvm_userspace_memory_region {
 	__u64 userspace_addr; /* start of the userspace allocated memory */
 };
 
+/* for KVM_SET_USER_MEMORY_REGION2 */
+struct kvm_userspace_memory_region2 {
+	__u32 slot;
+	__u32 flags;
+	__u64 guest_phys_addr;
+	__u64 memory_size;
+	__u64 userspace_addr;
+	__u64 guest_memfd_offset;
+	__u32 guest_memfd;
+	__u32 pad1;
+	__u64 pad2[14];
+};
+
 /*
  * The bit 0 ~ bit 15 of kvm_userspace_memory_region::flags are visible for
  * userspace, other bits are reserved for kvm internal use which are defined
@@ -102,6 +115,7 @@ struct kvm_userspace_memory_region {
  */
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
+#define KVM_MEM_PRIVATE		(1UL << 2)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -264,6 +278,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_SBI        35
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
+#define KVM_EXIT_MEMORY_FAULT     39
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -506,6 +521,13 @@ struct kvm_run {
 #define KVM_NOTIFY_CONTEXT_INVALID	(1 << 0)
 			__u32 flags;
 		} notify;
+		/* KVM_EXIT_MEMORY_FAULT */
+		struct {
+#define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
+			__u64 flags;
+			__u64 gpa;
+			__u64 size;
+		} memory_fault;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
@@ -1188,6 +1210,11 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COUNTER_OFFSET 227
 #define KVM_CAP_ARM_EAGER_SPLIT_CHUNK_SIZE 228
 #define KVM_CAP_ARM_SUPPORTED_BLOCK_SIZES 229
+#define KVM_CAP_USER_MEMORY2 231
+#define KVM_CAP_MEMORY_FAULT_INFO 232
+#define KVM_CAP_MEMORY_ATTRIBUTES 233
+#define KVM_CAP_GUEST_MEMFD 234
+#define KVM_CAP_VM_TYPES 235
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1469,6 +1496,8 @@ struct kvm_vfio_spapr_tce {
 					struct kvm_userspace_memory_region)
 #define KVM_SET_TSS_ADDR          _IO(KVMIO,   0x47)
 #define KVM_SET_IDENTITY_MAP_ADDR _IOW(KVMIO,  0x48, __u64)
+#define KVM_SET_USER_MEMORY_REGION2 _IOW(KVMIO, 0x49, \
+					 struct kvm_userspace_memory_region2)
 
 /* enable ucontrol for s390 */
 struct kvm_s390_ucas_mapping {
@@ -2252,4 +2281,26 @@ struct kvm_s390_zpci_op {
 /* flags for kvm_s390_zpci_op->u.reg_aen.flags */
 #define KVM_S390_ZPCIOP_REGAEN_HOST    (1 << 0)
 
+/* Available with KVM_CAP_MEMORY_ATTRIBUTES */
+#define KVM_SET_MEMORY_ATTRIBUTES              _IOW(KVMIO,  0xd2, struct kvm_memory_attributes)
+
+struct kvm_memory_attributes {
+	__u64 address;
+	__u64 size;
+	__u64 attributes;
+	__u64 flags;
+};
+
+#define KVM_MEMORY_ATTRIBUTE_PRIVATE           (1ULL << 3)
+
+#define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
+
+struct kvm_create_guest_memfd {
+	__u64 size;
+	__u64 flags;
+	__u64 reserved[6];
+};
+
+#define KVM_GUEST_MEMFD_ALLOW_HUGEPAGE		(1ULL << 0)
+
 #endif /* __LINUX_KVM_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 01/70] *** HACK *** linux-headers: Update headers to pull in gmem APIs Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 10:20   ` Daniel P. Berrangé
                     ` (3 more replies)
  2023-11-15  7:14 ` [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE Xiaoyao Li
                   ` (67 subsequent siblings)
  69 siblings, 4 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Add KVM guest_memfd support to RAMBlock so both normal hva based memory
and kvm guest memfd based private memory can be associated in one RAMBlock.

Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
create private guest_memfd during RAMBlock setup.

Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
confidential guests, such as TDX VM. How and when to set it for memory
backends will be implemented in the following patches.

Introduce memory_region_has_guest_memfd() to query if the MemoryRegion has
KVM guest_memfd allocated.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes in v3:
- rename gmem to guest_memfd;
- close(guest_memfd) when RAMBlock is released; (Daniel P. Berrangé)
- Suqash the patch that introduces memory_region_has_guest_memfd().
---
 accel/kvm/kvm-all.c     | 24 ++++++++++++++++++++++++
 include/exec/memory.h   | 13 +++++++++++++
 include/exec/ramblock.h |  1 +
 include/sysemu/kvm.h    |  2 ++
 system/memory.c         |  5 +++++
 system/physmem.c        | 27 ++++++++++++++++++++++++---
 6 files changed, 69 insertions(+), 3 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index c1b40e873531..9f751d4971f8 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -101,6 +101,7 @@ bool kvm_msi_use_devid;
 bool kvm_has_guest_debug;
 static int kvm_sstep_flags;
 static bool kvm_immediate_exit;
+static bool kvm_guest_memfd_supported;
 static hwaddr kvm_max_slot_size = ~0;
 
 static const KVMCapabilityInfo kvm_required_capabilites[] = {
@@ -2397,6 +2398,8 @@ static int kvm_init(MachineState *ms)
     }
     s->as = g_new0(struct KVMAs, s->nr_as);
 
+    kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
+
     if (object_property_find(OBJECT(current_machine), "kvm-type")) {
         g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine),
                                                             "kvm-type",
@@ -4078,3 +4081,24 @@ void query_stats_schemas_cb(StatsSchemaList **result, Error **errp)
         query_stats_schema_vcpu(first_cpu, &stats_args);
     }
 }
+
+int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
+{
+    int fd;
+    struct kvm_create_guest_memfd guest_memfd = {
+        .size = size,
+        .flags = flags,
+    };
+
+    if (!kvm_guest_memfd_supported) {
+        error_setg(errp, "KVM doesn't support guest memfd\n");
+        return -EOPNOTSUPP;
+    }
+
+    fd = kvm_vm_ioctl(kvm_state, KVM_CREATE_GUEST_MEMFD, &guest_memfd);
+    if (fd < 0) {
+        error_setg_errno(errp, errno, "%s: error creating kvm guest memfd\n", __func__);
+    }
+
+    return fd;
+}
diff --git a/include/exec/memory.h b/include/exec/memory.h
index 831f7c996d9d..f780367ab1bd 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -243,6 +243,9 @@ typedef struct IOMMUTLBEvent {
 /* RAM FD is opened read-only */
 #define RAM_READONLY_FD (1 << 11)
 
+/* RAM can be private that has kvm gmem backend */
+#define RAM_GUEST_MEMFD   (1 << 12)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
@@ -1702,6 +1705,16 @@ static inline bool memory_region_is_romd(MemoryRegion *mr)
  */
 bool memory_region_is_protected(MemoryRegion *mr);
 
+/**
+ * memory_region_has_guest_memfd: check whether a memory region has guest_memfd
+ *     associated
+ *
+ * Returns %true if a memory region's ram_block has valid guest_memfd assigned.
+ *
+ * @mr: the memory region being queried
+ */
+bool memory_region_has_guest_memfd(MemoryRegion *mr);
+
 /**
  * memory_region_get_iommu: check whether a memory region is an iommu
  *
diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
index 69c6a5390293..0a17ba882729 100644
--- a/include/exec/ramblock.h
+++ b/include/exec/ramblock.h
@@ -41,6 +41,7 @@ struct RAMBlock {
     QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
     int fd;
     uint64_t fd_offset;
+    int guest_memfd;
     size_t page_size;
     /* dirty bitmap used during migration */
     unsigned long *bmap;
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index d61487816421..fedc28c7d17f 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -538,4 +538,6 @@ bool kvm_arch_cpu_check_are_resettable(void);
 bool kvm_dirty_ring_enabled(void);
 
 uint32_t kvm_dirty_ring_size(void);
+
+int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
 #endif
diff --git a/system/memory.c b/system/memory.c
index 304fa843ea12..69741d91bbb7 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1862,6 +1862,11 @@ bool memory_region_is_protected(MemoryRegion *mr)
     return mr->ram && (mr->ram_block->flags & RAM_PROTECTED);
 }
 
+bool memory_region_has_guest_memfd(MemoryRegion *mr)
+{
+    return mr->ram_block && mr->ram_block->guest_memfd >= 0;
+}
+
 uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
 {
     uint8_t mask = mr->dirty_log_mask;
diff --git a/system/physmem.c b/system/physmem.c
index fc2b0fee0188..0af2213cbd9c 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
         }
     }
 
+#ifdef CONFIG_KVM
+    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
+        new_block->guest_memfd < 0) {
+        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
+        uint64_t flags = 0;
+        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
+                                                        flags, errp);
+        if (new_block->guest_memfd < 0) {
+            qemu_mutex_unlock_ramlist();
+            return;
+        }
+    }
+#endif
+
     new_ram_size = MAX(old_ram_size,
               (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS);
     if (new_ram_size > old_ram_size) {
@@ -1903,7 +1917,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     /* Just support these ram flags by now. */
     assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
                           RAM_PROTECTED | RAM_NAMED_FILE | RAM_READONLY |
-                          RAM_READONLY_FD)) == 0);
+                          RAM_READONLY_FD | RAM_GUEST_MEMFD)) == 0);
 
     if (xen_enabled()) {
         error_setg(errp, "-mem-path not supported with Xen");
@@ -1938,6 +1952,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
     new_block->used_length = size;
     new_block->max_length = size;
     new_block->flags = ram_flags;
+    new_block->guest_memfd = -1;
     new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset,
                                      errp);
     if (!new_block->host) {
@@ -2016,7 +2031,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     Error *local_err = NULL;
 
     assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC |
-                          RAM_NORESERVE)) == 0);
+                          RAM_NORESERVE| RAM_GUEST_MEMFD)) == 0);
     assert(!host ^ (ram_flags & RAM_PREALLOC));
 
     size = HOST_PAGE_ALIGN(size);
@@ -2028,6 +2043,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
     new_block->max_length = max_size;
     assert(max_size >= size);
     new_block->fd = -1;
+    new_block->guest_memfd = -1;
     new_block->page_size = qemu_real_host_page_size();
     new_block->host = host;
     new_block->flags = ram_flags;
@@ -2050,7 +2066,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
 RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
                          MemoryRegion *mr, Error **errp)
 {
-    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
+    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE | RAM_GUEST_MEMFD)) == 0);
     return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
 }
 
@@ -2078,6 +2094,11 @@ static void reclaim_ramblock(RAMBlock *block)
     } else {
         qemu_anon_ram_free(block->host, block->max_length);
     }
+
+    if (block->guest_memfd >= 0) {
+        close(block->guest_memfd);
+    }
+
     g_free(block);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 01/70] *** HACK *** linux-headers: Update headers to pull in gmem APIs Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 18:10   ` David Hildenbrand
  2023-11-15  7:14 ` [PATCH v3 04/70] HostMem: Add mechanism to opt in kvm guest memfd via MachineState Xiaoyao Li
                   ` (66 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

KVM allows KVM_GUEST_MEMFD_ALLOW_HUGEPAGE for guest memfd. When the
flag is set, KVM tries to allocate memory with transparent hugeapge at
first and falls back to non-hugepage on failure.

However, KVM defines one restriction that size must be hugepage size
aligned when KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
v3:
 - New one in v3.
---
 system/physmem.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 0af2213cbd9c..c56b17e44df6 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -1803,6 +1803,40 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
     }
 }
 
+#ifdef CONFIG_KVM
+#define HPAGE_PMD_SIZE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
+#define DEFAULT_PMD_SIZE (1ul << 21)
+
+static uint32_t get_thp_size(void)
+{
+    gchar *content = NULL;
+    const char *endptr;
+    static uint64_t thp_size = 0;
+    uint64_t tmp;
+
+    if (thp_size != 0) {
+        return thp_size;
+    }
+
+    if (g_file_get_contents(HPAGE_PMD_SIZE_PATH, &content, NULL, NULL) &&
+        !qemu_strtou64(content, &endptr, 0, &tmp) &&
+        (!endptr || *endptr == '\n')) {
+        /* Sanity-check the value and fallback to something reasonable. */
+        if (!tmp || !is_power_of_2(tmp)) {
+            warn_report("Read unsupported THP size: %" PRIx64, tmp);
+        } else {
+            thp_size = tmp;
+        }
+    }
+
+    if (!thp_size) {
+        thp_size = DEFAULT_PMD_SIZE;
+    }
+
+    return thp_size;
+}
+#endif
+
 static void ram_block_add(RAMBlock *new_block, Error **errp)
 {
     const bool noreserve = qemu_ram_is_noreserve(new_block);
@@ -1844,8 +1878,8 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
 #ifdef CONFIG_KVM
     if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
         new_block->guest_memfd < 0) {
-        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
-        uint64_t flags = 0;
+        uint64_t flags = QEMU_IS_ALIGNED(new_block->max_length, get_thp_size()) ?
+                         KVM_GUEST_MEMFD_ALLOW_HUGEPAGE : 0;
         new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
                                                         flags, errp);
         if (new_block->guest_memfd < 0) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 04/70] HostMem: Add mechanism to opt in kvm guest memfd via MachineState
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (2 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 18:14   ` David Hildenbrand
  2023-11-15  7:14 ` [PATCH v3 05/70] kvm: Enable KVM_SET_USER_MEMORY_REGION2 for memslot Xiaoyao Li
                   ` (65 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Add a new member "require_guest_memfd" to memory backends. When it's set
to true, it enables RAM_GUEST_MEMFD in ram_flags, thus private kvm
guest_memfd will be allocated during RAMBlock allocation.

Memory backend's @require_guest_memfd is wired with @require_guest_memfd
field of MachineState. MachineState::require_guest_memfd is supposed to
be set by any VMs that requires KVM guest memfd as private memory, e.g.,
TDX VM.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 backends/hostmem-file.c  | 1 +
 backends/hostmem-memfd.c | 1 +
 backends/hostmem-ram.c   | 1 +
 backends/hostmem.c       | 1 +
 hw/core/machine.c        | 5 +++++
 include/hw/boards.h      | 2 ++
 include/sysemu/hostmem.h | 1 +
 7 files changed, 12 insertions(+)

diff --git a/backends/hostmem-file.c b/backends/hostmem-file.c
index 361d4a8103ef..d5ea2879f321 100644
--- a/backends/hostmem-file.c
+++ b/backends/hostmem-file.c
@@ -84,6 +84,7 @@ file_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     ram_flags |= fb->readonly ? RAM_READONLY_FD : 0;
     ram_flags |= fb->rom == ON_OFF_AUTO_ON ? RAM_READONLY : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+    ram_flags |= backend->require_guest_memfd ? RAM_GUEST_MEMFD : 0;
     ram_flags |= fb->is_pmem ? RAM_PMEM : 0;
     ram_flags |= RAM_NAMED_FILE;
     memory_region_init_ram_from_file(&backend->mr, OBJECT(backend), name,
diff --git a/backends/hostmem-memfd.c b/backends/hostmem-memfd.c
index 3fc85c3db81b..011e2311f088 100644
--- a/backends/hostmem-memfd.c
+++ b/backends/hostmem-memfd.c
@@ -55,6 +55,7 @@ memfd_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+    ram_flags |= backend->require_guest_memfd ? RAM_GUEST_MEMFD : 0;
     memory_region_init_ram_from_fd(&backend->mr, OBJECT(backend), name,
                                    backend->size, ram_flags, fd, 0, errp);
     g_free(name);
diff --git a/backends/hostmem-ram.c b/backends/hostmem-ram.c
index b8e55cdbd0f8..7d2e1327f8c8 100644
--- a/backends/hostmem-ram.c
+++ b/backends/hostmem-ram.c
@@ -30,6 +30,7 @@ ram_backend_memory_alloc(HostMemoryBackend *backend, Error **errp)
     name = host_memory_backend_get_name(backend);
     ram_flags = backend->share ? RAM_SHARED : 0;
     ram_flags |= backend->reserve ? 0 : RAM_NORESERVE;
+    ram_flags |= backend->require_guest_memfd ? RAM_GUEST_MEMFD : 0;
     memory_region_init_ram_flags_nomigrate(&backend->mr, OBJECT(backend), name,
                                            backend->size, ram_flags, errp);
     g_free(name);
diff --git a/backends/hostmem.c b/backends/hostmem.c
index 747e7838c031..2deb2b78bcb8 100644
--- a/backends/hostmem.c
+++ b/backends/hostmem.c
@@ -279,6 +279,7 @@ static void host_memory_backend_init(Object *obj)
     /* TODO: convert access to globals to compat properties */
     backend->merge = machine_mem_merge(machine);
     backend->dump = machine_dump_guest_core(machine);
+    backend->require_guest_memfd = machine_require_guest_memfd(machine);
     backend->reserve = true;
     backend->prealloc_threads = machine->smp.cpus;
 }
diff --git a/hw/core/machine.c b/hw/core/machine.c
index 0c1739814124..b1b0a46ea52f 100644
--- a/hw/core/machine.c
+++ b/hw/core/machine.c
@@ -1189,6 +1189,11 @@ bool machine_mem_merge(MachineState *machine)
     return machine->mem_merge;
 }
 
+bool machine_require_guest_memfd(MachineState *machine)
+{
+    return machine->require_guest_memfd;
+}
+
 static char *cpu_slot_to_string(const CPUArchId *cpu)
 {
     GString *s = g_string_new(NULL);
diff --git a/include/hw/boards.h b/include/hw/boards.h
index a7359992980a..227aded07209 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -30,6 +30,7 @@ bool machine_usb(MachineState *machine);
 int machine_phandle_start(MachineState *machine);
 bool machine_dump_guest_core(MachineState *machine);
 bool machine_mem_merge(MachineState *machine);
+bool machine_require_guest_memfd(MachineState *machine);
 HotpluggableCPUList *machine_query_hotpluggable_cpus(MachineState *machine);
 void machine_set_cpu_numa_node(MachineState *machine,
                                const CpuInstanceProperties *props,
@@ -364,6 +365,7 @@ struct MachineState {
     char *dt_compatible;
     bool dump_guest_core;
     bool mem_merge;
+    bool require_guest_memfd;
     bool usb;
     bool usb_disabled;
     char *firmware;
diff --git a/include/sysemu/hostmem.h b/include/sysemu/hostmem.h
index 39326f1d4f9c..92f7fd469639 100644
--- a/include/sysemu/hostmem.h
+++ b/include/sysemu/hostmem.h
@@ -66,6 +66,7 @@ struct HostMemoryBackend {
     uint64_t size;
     bool merge, dump, use_canonical_path;
     bool prealloc, is_mapped, share, reserve;
+    bool require_guest_memfd;
     uint32_t prealloc_threads;
     ThreadContext *prealloc_context;
     DECLARE_BITMAP(host_nodes, MAX_NODES + 1);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 05/70] kvm: Enable KVM_SET_USER_MEMORY_REGION2 for memslot
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (3 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 04/70] HostMem: Add mechanism to opt in kvm guest memfd via MachineState Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-17 20:50   ` Isaku Yamahata
  2023-11-15  7:14 ` [PATCH v3 06/70] kvm: Introduce support for memory_attributes Xiaoyao Li
                   ` (64 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Chao Peng <chao.p.peng@linux.intel.com>

Switch to KVM_SET_USER_MEMORY_REGION2 when supported by KVM.

With KVM_SET_USER_MEMORY_REGION2, QEMU can set up memory region that
backend'ed both by hva-based shared memory and guest memfd based private
memory.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 accel/kvm/kvm-all.c      | 56 ++++++++++++++++++++++++++++++++++------
 accel/kvm/trace-events   |  2 +-
 include/sysemu/kvm_int.h |  2 ++
 3 files changed, 51 insertions(+), 9 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 9f751d4971f8..69afeb47c9c0 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -293,35 +293,69 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
 static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
 {
     KVMState *s = kvm_state;
-    struct kvm_userspace_memory_region mem;
+    struct kvm_userspace_memory_region2 mem;
+    static int cap_user_memory2 = -1;
     int ret;
 
+    if (cap_user_memory2 == -1) {
+        cap_user_memory2 = kvm_check_extension(s, KVM_CAP_USER_MEMORY2);
+    }
+
+    if (!cap_user_memory2 && slot->guest_memfd >= 0) {
+        error_report("%s, KVM doesn't support KVM_CAP_USER_MEMORY2,"
+                     " which is required by guest memfd!", __func__);
+        exit(1);
+    }
+
     mem.slot = slot->slot | (kml->as_id << 16);
     mem.guest_phys_addr = slot->start_addr;
     mem.userspace_addr = (unsigned long)slot->ram;
     mem.flags = slot->flags;
+    mem.guest_memfd = slot->guest_memfd;
+    mem.guest_memfd_offset = slot->guest_memfd_offset;
 
     if (slot->memory_size && !new && (mem.flags ^ slot->old_flags) & KVM_MEM_READONLY) {
         /* Set the slot size to 0 before setting the slot to the desired
          * value. This is needed based on KVM commit 75d61fbc. */
         mem.memory_size = 0;
-        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
+
+        if (cap_user_memory2) {
+            ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION2, &mem);
+        } else {
+            ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
+	    }
         if (ret < 0) {
             goto err;
         }
     }
     mem.memory_size = slot->memory_size;
-    ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
+    if (cap_user_memory2) {
+        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION2, &mem);
+    } else {
+        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
+    }
     slot->old_flags = mem.flags;
 err:
     trace_kvm_set_user_memory(mem.slot >> 16, (uint16_t)mem.slot, mem.flags,
                               mem.guest_phys_addr, mem.memory_size,
-                              mem.userspace_addr, ret);
+                              mem.userspace_addr, mem.guest_memfd,
+                              mem.guest_memfd_offset, ret);
     if (ret < 0) {
-        error_report("%s: KVM_SET_USER_MEMORY_REGION failed, slot=%d,"
-                     " start=0x%" PRIx64 ", size=0x%" PRIx64 ": %s",
-                     __func__, mem.slot, slot->start_addr,
-                     (uint64_t)mem.memory_size, strerror(errno));
+        if (cap_user_memory2) {
+                error_report("%s: KVM_SET_USER_MEMORY_REGION2 failed, slot=%d,"
+                        " start=0x%" PRIx64 ", size=0x%" PRIx64 ","
+                        " flags=0x%" PRIx32 ", guest_memfd=%" PRId32 ","
+                        " guest_memfd_offset=0x%" PRIx64 ": %s",
+                        __func__, mem.slot, slot->start_addr,
+                        (uint64_t)mem.memory_size, mem.flags,
+                        mem.guest_memfd, (uint64_t)mem.guest_memfd_offset,
+                        strerror(errno));
+        } else {
+                error_report("%s: KVM_SET_USER_MEMORY_REGION failed, slot=%d,"
+                            " start=0x%" PRIx64 ", size=0x%" PRIx64 ": %s",
+                            __func__, mem.slot, slot->start_addr,
+                            (uint64_t)mem.memory_size, strerror(errno));
+        }
     }
     return ret;
 }
@@ -477,6 +511,9 @@ static int kvm_mem_flags(MemoryRegion *mr)
     if (readonly && kvm_readonly_mem_allowed) {
         flags |= KVM_MEM_READONLY;
     }
+    if (memory_region_has_guest_memfd(mr)) {
+        flags |= KVM_MEM_PRIVATE;
+    }
     return flags;
 }
 
@@ -1364,6 +1401,9 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
         mem->ram_start_offset = ram_start_offset;
         mem->ram = ram;
         mem->flags = kvm_mem_flags(mr);
+        mem->guest_memfd = mr->ram_block->guest_memfd;
+        mem->guest_memfd_offset = (uint8_t*)ram - mr->ram_block->host;
+
         kvm_slot_init_dirty_bitmap(mem);
         err = kvm_set_user_memory_region(kml, mem, true);
         if (err) {
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index 14ebfa1b991c..e6ec2cda6efa 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -15,7 +15,7 @@ kvm_irqchip_update_msi_route(int virq) "Updating MSI route virq=%d"
 kvm_irqchip_release_virq(int virq) "virq %d"
 kvm_set_ioeventfd_mmio(int fd, uint64_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%" PRIx64 " val=0x%x assign: %d size: %d match: %d"
 kvm_set_ioeventfd_pio(int fd, uint16_t addr, uint32_t val, bool assign, uint32_t size, bool datamatch) "fd: %d @0x%x val=0x%x assign: %d size: %d match: %d"
-kvm_set_user_memory(uint16_t as, uint16_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, int ret) "AddrSpace#%d Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " ret=%d"
+kvm_set_user_memory(uint16_t as, uint16_t slot, uint32_t flags, uint64_t guest_phys_addr, uint64_t memory_size, uint64_t userspace_addr, uint32_t fd, uint64_t fd_offset, int ret) "AddrSpace#%d Slot#%d flags=0x%x gpa=0x%"PRIx64 " size=0x%"PRIx64 " ua=0x%"PRIx64 " guest_memfd=%d" " guest_memfd_offset=0x%" PRIx64 " ret=%d"
 kvm_clear_dirty_log(uint32_t slot, uint64_t start, uint32_t size) "slot#%"PRId32" start 0x%"PRIx64" size 0x%"PRIx32
 kvm_resample_fd_notify(int gsi) "gsi %d"
 kvm_dirty_ring_full(int id) "vcpu %d"
diff --git a/include/sysemu/kvm_int.h b/include/sysemu/kvm_int.h
index fd846394be10..58b7aa3fc786 100644
--- a/include/sysemu/kvm_int.h
+++ b/include/sysemu/kvm_int.h
@@ -30,6 +30,8 @@ typedef struct KVMSlot
     int as_id;
     /* Cache of the offset in ram address space */
     ram_addr_t ram_start_offset;
+    int guest_memfd;
+    hwaddr guest_memfd_offset;
 } KVMSlot;
 
 typedef struct KVMMemoryUpdate {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (4 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 05/70] kvm: Enable KVM_SET_USER_MEMORY_REGION2 for memslot Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 10:38   ` Daniel P. Berrangé
  2023-12-12 13:56   ` Wang, Wei W
  2023-11-15  7:14 ` [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range() Xiaoyao Li
                   ` (63 subsequent siblings)
  69 siblings, 2 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Introduce the helper functions to set the attributes of a range of
memory to private or shared.

This is necessary to notify KVM the private/shared attribute of each gpa
range. KVM needs the information to decide the GPA needs to be mapped at
hva-based shared memory or guest_memfd based private memory.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 accel/kvm/kvm-all.c  | 42 ++++++++++++++++++++++++++++++++++++++++++
 include/sysemu/kvm.h |  3 +++
 2 files changed, 45 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 69afeb47c9c0..76e2404d54d2 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -102,6 +102,7 @@ bool kvm_has_guest_debug;
 static int kvm_sstep_flags;
 static bool kvm_immediate_exit;
 static bool kvm_guest_memfd_supported;
+static uint64_t kvm_supported_memory_attributes;
 static hwaddr kvm_max_slot_size = ~0;
 
 static const KVMCapabilityInfo kvm_required_capabilites[] = {
@@ -1305,6 +1306,44 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
     kvm_max_slot_size = max_slot_size;
 }
 
+static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)
+{
+    struct kvm_memory_attributes attrs;
+    int r;
+
+    attrs.attributes = attr;
+    attrs.address = start;
+    attrs.size = size;
+    attrs.flags = 0;
+
+    r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, &attrs);
+    if (r) {
+        warn_report("%s: failed to set memory (0x%lx+%#zx) with attr 0x%lx error '%s'",
+                     __func__, start, size, attr, strerror(errno));
+    }
+    return r;
+}
+
+int kvm_set_memory_attributes_private(hwaddr start, hwaddr size)
+{
+    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
+        error_report("KVM doesn't support PRIVATE memory attribute\n");
+        return -EINVAL;
+    }
+
+    return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
+}
+
+int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size)
+{
+    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
+        error_report("KVM doesn't support PRIVATE memory attribute\n");
+        return -EINVAL;
+    }
+
+    return kvm_set_memory_attributes(start, size, 0);
+}
+
 /* Called with KVMMemoryListener.slots_lock held */
 static void kvm_set_phys_mem(KVMMemoryListener *kml,
                              MemoryRegionSection *section, bool add)
@@ -2440,6 +2479,9 @@ static int kvm_init(MachineState *ms)
 
     kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
 
+    ret = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
+    kvm_supported_memory_attributes = ret > 0 ? ret : 0;
+
     if (object_property_find(OBJECT(current_machine), "kvm-type")) {
         g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine),
                                                             "kvm-type",
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index fedc28c7d17f..0e88958190a4 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -540,4 +540,7 @@ bool kvm_dirty_ring_enabled(void);
 uint32_t kvm_dirty_ring_size(void);
 
 int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
+
+int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);
+int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
 #endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range()
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (5 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 06/70] kvm: Introduce support for memory_attributes Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 18:20   ` David Hildenbrand
  2023-11-15  7:14 ` [PATCH v3 08/70] physmem: replace function name with __func__ " Xiaoyao Li
                   ` (62 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
ram_block_discard_range() which grabs some code from
ram_discard_range(). However, during code movement, it changed alignment
check of host_startaddr from qemu_host_page_size to rb->page_size.

When ramblock is back'ed by hugepage, it requires the startaddr to be
huge page size aligned, which is a overkill. e.g., TDX's private-shared
page conversion is done at 4KB granularity. Shared page is discarded
when it gets converts to private and when shared page back'ed by
hugepage it is going to fail on this check.

So change to alignment check back to qemu_host_page_size.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes in v3:
 - Newly added in v3;
---
 system/physmem.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/system/physmem.c b/system/physmem.c
index c56b17e44df6..8a4e42c7cf60 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
 
     uint8_t *host_startaddr = rb->host + start;
 
-    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
+    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
         error_report("ram_block_discard_range: Unaligned start address: %p",
                      host_startaddr);
         goto err;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 08/70] physmem: replace function name with __func__ in ram_block_discard_range()
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (6 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range() Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 18:21   ` David Hildenbrand
  2023-11-15  7:14 ` [PATCH v3 09/70] physmem: Introduce ram_block_convert_range() for page conversion Xiaoyao Li
                   ` (61 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Use __func__ to avoid hard-coded function name.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 system/physmem.c | 38 +++++++++++++++++---------------------
 1 file changed, 17 insertions(+), 21 deletions(-)

diff --git a/system/physmem.c b/system/physmem.c
index 8a4e42c7cf60..ddfecddefcd6 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3533,16 +3533,15 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
     uint8_t *host_startaddr = rb->host + start;
 
     if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
-        error_report("ram_block_discard_range: Unaligned start address: %p",
-                     host_startaddr);
+        error_report("%s: Unaligned start address: %p",
+                     __func__, host_startaddr);
         goto err;
     }
 
     if ((start + length) <= rb->max_length) {
         bool need_madvise, need_fallocate;
         if (!QEMU_IS_ALIGNED(length, rb->page_size)) {
-            error_report("ram_block_discard_range: Unaligned length: %zx",
-                         length);
+            error_report("%s: Unaligned length: %zx", __func__, length);
             goto err;
         }
 
@@ -3566,8 +3565,8 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
              * proper error message.
              */
             if (rb->flags & RAM_READONLY_FD) {
-                error_report("ram_block_discard_range: Discarding RAM"
-                             " with readonly files is not supported");
+                error_report("%s: Discarding RAM with readonly files is not"
+                             " supported", __func__);
                 goto err;
 
             }
@@ -3582,27 +3581,26 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
              * file.
              */
             if (!qemu_ram_is_shared(rb)) {
-                warn_report_once("ram_block_discard_range: Discarding RAM"
+                warn_report_once("%s: Discarding RAM"
                                  " in private file mappings is possibly"
                                  " dangerous, because it will modify the"
                                  " underlying file and will affect other"
-                                 " users of the file");
+                                 " users of the file", __func__);
             }
 
             ret = fallocate(rb->fd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
                             start, length);
             if (ret) {
                 ret = -errno;
-                error_report("ram_block_discard_range: Failed to fallocate "
-                             "%s:%" PRIx64 " +%zx (%d)",
-                             rb->idstr, start, length, ret);
+                error_report("%s: Failed to fallocate %s:%" PRIx64 " +%zx (%d)",
+                             __func__, rb->idstr, start, length, ret);
                 goto err;
             }
 #else
             ret = -ENOSYS;
-            error_report("ram_block_discard_range: fallocate not available/file"
+            error_report("%s: fallocate not available/file"
                          "%s:%" PRIx64 " +%zx (%d)",
-                         rb->idstr, start, length, ret);
+                         __func__, rb->idstr, start, length, ret);
             goto err;
 #endif
         }
@@ -3620,25 +3618,23 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
             }
             if (ret) {
                 ret = -errno;
-                error_report("ram_block_discard_range: Failed to discard range "
+                error_report("%s: Failed to discard range "
                              "%s:%" PRIx64 " +%zx (%d)",
-                             rb->idstr, start, length, ret);
+                             __func__, rb->idstr, start, length, ret);
                 goto err;
             }
 #else
             ret = -ENOSYS;
-            error_report("ram_block_discard_range: MADVISE not available"
-                         "%s:%" PRIx64 " +%zx (%d)",
-                         rb->idstr, start, length, ret);
+            error_report("%s: MADVISE not available %s:%" PRIx64 " +%zx (%d)",
+                         __func__, rb->idstr, start, length, ret);
             goto err;
 #endif
         }
         trace_ram_block_discard_range(rb->idstr, host_startaddr, length,
                                       need_madvise, need_fallocate, ret);
     } else {
-        error_report("ram_block_discard_range: Overrun block '%s' (%" PRIu64
-                     "/%zx/" RAM_ADDR_FMT")",
-                     rb->idstr, start, length, rb->max_length);
+        error_report("%s: Overrun block '%s' (%" PRIu64 "/%zx/" RAM_ADDR_FMT")",
+                     __func__, rb->idstr, start, length, rb->max_length);
     }
 
 err:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 09/70] physmem: Introduce ram_block_convert_range() for page conversion
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (7 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 08/70] physmem: replace function name with __func__ " Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-17 21:03   ` Isaku Yamahata
  2023-11-15  7:14 ` [PATCH v3 10/70] kvm: handle KVM_EXIT_MEMORY_FAULT Xiaoyao Li
                   ` (60 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

It's used for discarding opposite memory after memory conversion, for
confidential guest.

When page is converted from shared to private, the original shared
memory can be discarded via ram_block_discard_range();

When page is converted from private to shared, the original private
memory is back'ed by guest_memfd. Introduce
ram_block_discard_guest_memfd_range() for discarding memory in
guest_memfd.

Originally-from: Isaku Yamahata <isaku.yamahata@intel.com>
Codeveloped-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 include/exec/cpu-common.h |  2 ++
 system/physmem.c          | 50 +++++++++++++++++++++++++++++++++++++++
 2 files changed, 52 insertions(+)

diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 41115d891940..de728a18eef2 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -175,6 +175,8 @@ typedef int (RAMBlockIterFunc)(RAMBlock *rb, void *opaque);
 
 int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
 int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length);
+int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
+                            bool shared_to_private);
 
 #endif
 
diff --git a/system/physmem.c b/system/physmem.c
index ddfecddefcd6..cd6008fa09ad 100644
--- a/system/physmem.c
+++ b/system/physmem.c
@@ -3641,6 +3641,29 @@ err:
     return ret;
 }
 
+static int ram_block_discard_guest_memfd_range(RAMBlock *rb, uint64_t start,
+                                               size_t length)
+{
+    int ret = -1;
+
+#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
+    ret = fallocate(rb->guest_memfd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
+                    start, length);
+
+    if (ret) {
+        ret = -errno;
+        error_report("%s: Failed to fallocate %s:%" PRIx64 " +%zx (%d)",
+                     __func__, rb->idstr, start, length, ret);
+    }
+#else
+    ret = -ENOSYS;
+    error_report("%s: fallocate not available %s:%" PRIx64 " +%zx (%d)",
+                 __func__, rb->idstr, start, length, ret);
+#endif
+
+    return ret;
+}
+
 bool ramblock_is_pmem(RAMBlock *rb)
 {
     return rb->flags & RAM_PMEM;
@@ -3828,3 +3851,30 @@ bool ram_block_discard_is_required(void)
     return qatomic_read(&ram_block_discard_required_cnt) ||
            qatomic_read(&ram_block_coordinated_discard_required_cnt);
 }
+
+int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
+                            bool shared_to_private)
+{
+    if (!rb || rb->guest_memfd < 0) {
+        return -1;
+    }
+
+    if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||
+        !QEMU_PTR_IS_ALIGNED(length, qemu_host_page_size)) {
+        return -1;
+    }
+
+    if (!length) {
+        return -1;
+    }
+
+    if (start + length > rb->max_length) {
+        return -1;
+    }
+
+    if (shared_to_private) {
+        return ram_block_discard_range(rb, start, length);
+    } else {
+        return ram_block_discard_guest_memfd_range(rb, start, length);
+    }
+}
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 10/70] kvm: handle KVM_EXIT_MEMORY_FAULT
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (8 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 09/70] physmem: Introduce ram_block_convert_range() for page conversion Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 10:42   ` Daniel P. Berrangé
  2023-11-15  7:14 ` [PATCH v3 11/70] trace/kvm: Add trace for page convertion between shared and private Xiaoyao Li
                   ` (59 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Chao Peng <chao.p.peng@linux.intel.com>

Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
KVM_EXIT_MEMORY_FAULT happens. It indicates userspace needs to do
the memory conversion on the RAMBlock to turn the memory into desired
attribute, i.e., private/shared.

Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
guest_memfd memory backend.

Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
added.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 accel/kvm/kvm-all.c | 76 +++++++++++++++++++++++++++++++++++++++------
 1 file changed, 66 insertions(+), 10 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 76e2404d54d2..58abbcb6926e 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2902,6 +2902,50 @@ static void kvm_eat_signals(CPUState *cpu)
     } while (sigismember(&chkset, SIG_IPI));
 }
 
+static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
+{
+    MemoryRegionSection section;
+    ram_addr_t offset;
+    RAMBlock *rb;
+    void *addr;
+    int ret = -1;
+
+    section = memory_region_find(get_system_memory(), start, size);
+    if (!section.mr) {
+        return ret;
+    }
+
+    if (memory_region_has_guest_memfd(section.mr)) {
+        if (to_private) {
+            ret = kvm_set_memory_attributes_private(start, size);
+        } else {
+            ret = kvm_set_memory_attributes_shared(start, size);
+        }
+
+        if (ret) {
+            memory_region_unref(section.mr);
+            return ret;
+        }
+
+        addr = memory_region_get_ram_ptr(section.mr) +
+               section.offset_within_region;
+        rb = qemu_ram_block_from_host(addr, false, &offset);
+        /*
+         * With KVM_SET_MEMORY_ATTRIBUTES by kvm_set_memory_attributes(),
+         * operation on underlying file descriptor is only for releasing
+         * unnecessary pages.
+         */
+        ram_block_convert_range(rb, offset, size, to_private);
+    } else {
+        warn_report("Convert non guest_memfd backed memory region "
+                    "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
+                    start, size, to_private ? "private" : "shared");
+    }
+
+    memory_region_unref(section.mr);
+    return ret;
+}
+
 int kvm_cpu_exec(CPUState *cpu)
 {
     struct kvm_run *run = cpu->kvm_run;
@@ -2969,18 +3013,20 @@ int kvm_cpu_exec(CPUState *cpu)
                 ret = EXCP_INTERRUPT;
                 break;
             }
-            fprintf(stderr, "error: kvm run failed %s\n",
-                    strerror(-run_ret));
+            if (!(run_ret == -EFAULT && run->exit_reason == KVM_EXIT_MEMORY_FAULT)) {
+                fprintf(stderr, "error: kvm run failed %s\n",
+                        strerror(-run_ret));
 #ifdef TARGET_PPC
-            if (run_ret == -EBUSY) {
-                fprintf(stderr,
-                        "This is probably because your SMT is enabled.\n"
-                        "VCPU can only run on primary threads with all "
-                        "secondary threads offline.\n");
-            }
+                if (run_ret == -EBUSY) {
+                    fprintf(stderr,
+                            "This is probably because your SMT is enabled.\n"
+                            "VCPU can only run on primary threads with all "
+                            "secondary threads offline.\n");
+                }
 #endif
-            ret = -1;
-            break;
+                ret = -1;
+                break;
+            }
         }
 
         trace_kvm_run_exit(cpu->cpu_index, run->exit_reason);
@@ -3067,6 +3113,16 @@ int kvm_cpu_exec(CPUState *cpu)
                 break;
             }
             break;
+        case KVM_EXIT_MEMORY_FAULT:
+            if (run->memory_fault.flags & ~KVM_MEMORY_EXIT_FLAG_PRIVATE) {
+                error_report("KVM_EXIT_MEMORY_FAULT: Unknown flag 0x%" PRIx64,
+                             (uint64_t)run->memory_fault.flags);
+                ret = -1;
+                break;
+            }
+            ret = kvm_convert_memory(run->memory_fault.gpa, run->memory_fault.size,
+                                     run->memory_fault.flags & KVM_MEMORY_EXIT_FLAG_PRIVATE);
+            break;
         default:
             DPRINTF("kvm_arch_handle_exit\n");
             ret = kvm_arch_handle_exit(cpu, run);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 11/70] trace/kvm: Add trace for page convertion between shared and private
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (9 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 10/70] kvm: handle KVM_EXIT_MEMORY_FAULT Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 12/70] *** HACK *** linux-headers: Update headers to pull in TDX API changes Xiaoyao Li
                   ` (58 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 accel/kvm/kvm-all.c    | 1 +
 accel/kvm/trace-events | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 58abbcb6926e..082f31446c97 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2910,6 +2910,7 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     void *addr;
     int ret = -1;
 
+    trace_kvm_convert_memory(start, size, to_private ? "shared_to_private" : "private_to_shared");
     section = memory_region_find(get_system_memory(), start, size);
     if (!section.mr) {
         return ret;
diff --git a/accel/kvm/trace-events b/accel/kvm/trace-events
index e6ec2cda6efa..bca51f877b12 100644
--- a/accel/kvm/trace-events
+++ b/accel/kvm/trace-events
@@ -25,4 +25,4 @@ kvm_dirty_ring_reaper(const char *s) "%s"
 kvm_dirty_ring_reap(uint64_t count, int64_t t) "reaped %"PRIu64" pages (took %"PRIi64" us)"
 kvm_dirty_ring_reaper_kick(const char *reason) "%s"
 kvm_dirty_ring_flush(int finished) "%d"
-
+kvm_convert_memory(uint64_t start, uint64_t size, const char *msg) "start 0x%" PRIx64 " size 0x%" PRIx64 " %s"
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 12/70] *** HACK *** linux-headers: Update headers to pull in TDX API changes
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (10 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 11/70] trace/kvm: Add trace for page convertion between shared and private Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 13/70] i386: Introduce tdx-guest object Xiaoyao Li
                   ` (57 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Pull in recent TDX updates, which are not backwards compatible.

It's just to make this series runnable. It will be updated by script

	scripts/update-linux-headers.sh

once TDX support is upstreamed in linux kernel

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 linux-headers/asm-x86/kvm.h | 91 +++++++++++++++++++++++++++++++++++++
 linux-headers/linux/kvm.h   | 89 ++++++++++++++++++++++++++++++++++++
 2 files changed, 180 insertions(+)

diff --git a/linux-headers/asm-x86/kvm.h b/linux-headers/asm-x86/kvm.h
index 003fb745347c..cf708ea9472e 100644
--- a/linux-headers/asm-x86/kvm.h
+++ b/linux-headers/asm-x86/kvm.h
@@ -562,5 +562,96 @@ struct kvm_pmu_event_filter {
 
 #define KVM_X86_DEFAULT_VM	0
 #define KVM_X86_SW_PROTECTED_VM	1
+#define KVM_X86_TDX_VM		2
+#define KVM_X86_SNP_VM		3
+
+/* Trust Domain eXtension sub-ioctl() commands. */
+enum kvm_tdx_cmd_id {
+	KVM_TDX_CAPABILITIES = 0,
+	KVM_TDX_INIT_VM,
+	KVM_TDX_INIT_VCPU,
+	KVM_TDX_INIT_MEM_REGION,
+	KVM_TDX_FINALIZE_VM,
+	KVM_TDX_RELEASE_VM,
+
+	KVM_TDX_CMD_NR_MAX,
+};
+
+struct kvm_tdx_cmd {
+	/* enum kvm_tdx_cmd_id */
+	__u32 id;
+	/* flags for sub-commend. If sub-command doesn't use this, set zero. */
+	__u32 flags;
+	/*
+	 * data for each sub-command. An immediate or a pointer to the actual
+	 * data in process virtual address.  If sub-command doesn't use it,
+	 * set zero.
+	 */
+	__u64 data;
+	/*
+	 * Auxiliary error code.  The sub-command may return TDX SEAMCALL
+	 * status code in addition to -Exxx.
+	 * Defined for consistency with struct kvm_sev_cmd.
+	 */
+	__u64 error;
+};
+
+struct kvm_tdx_cpuid_config {
+	__u32 leaf;
+	__u32 sub_leaf;
+	__u32 eax;
+	__u32 ebx;
+	__u32 ecx;
+	__u32 edx;
+};
+
+struct kvm_tdx_capabilities {
+	__u64 attrs_fixed0;
+	__u64 attrs_fixed1;
+	__u64 xfam_fixed0;
+	__u64 xfam_fixed1;
+#define TDX_CAP_GPAW_48	(1 << 0)
+#define TDX_CAP_GPAW_52	(1 << 1)
+	__u32 supported_gpaw;
+	__u32 padding;
+	__u64 reserved[251];
+
+	__u32 nr_cpuid_configs;
+	struct kvm_tdx_cpuid_config cpuid_configs[];
+};
+
+struct kvm_tdx_init_vm {
+	__u64 attributes;
+	__u64 mrconfigid[6];	/* sha384 digest */
+	__u64 mrowner[6];	/* sha384 digest */
+	__u64 mrownerconfig[6];	/* sha348 digest */
+	/*
+	 * For future extensibility to make sizeof(struct kvm_tdx_init_vm) = 8KB.
+	 * This should be enough given sizeof(TD_PARAMS) = 1024.
+	 * 8KB was chosen given because
+	 * sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES(=256) = 8KB.
+	 */
+	__u64 reserved[1004];
+
+	/*
+	 * Call KVM_TDX_INIT_VM before vcpu creation, thus before
+	 * KVM_SET_CPUID2.
+	 * This configuration supersedes KVM_SET_CPUID2s for VCPUs because the
+	 * TDX module directly virtualizes those CPUIDs without VMM.  The user
+	 * space VMM, e.g. qemu, should make KVM_SET_CPUID2 consistent with
+	 * those values.  If it doesn't, KVM may have wrong idea of vCPUIDs of
+	 * the guest, and KVM may wrongly emulate CPUIDs or MSRs that the TDX
+	 * module doesn't virtualize.
+	 */
+	struct kvm_cpuid2 cpuid;
+};
+
+#define KVM_TDX_MEASURE_MEMORY_REGION	(1UL << 0)
+
+struct kvm_tdx_init_mem_region {
+	__u64 source_addr;
+	__u64 gpa;
+	__u64 nr_pages;
+};
 
 #endif /* _ASM_X86_KVM_H */
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 3fc87a845ee3..9ecd6336ba1c 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -237,6 +237,92 @@ struct kvm_xen_exit {
 	} u;
 };
 
+/* masks for reg_mask to indicate which registers are passed. */
+#define TDX_VMCALL_REG_MASK_RBX	BIT_ULL(2)
+#define TDX_VMCALL_REG_MASK_RDX	BIT_ULL(3)
+#define TDX_VMCALL_REG_MASK_RSI	BIT_ULL(6)
+#define TDX_VMCALL_REG_MASK_RDI	BIT_ULL(7)
+#define TDX_VMCALL_REG_MASK_R8	BIT_ULL(8)
+#define TDX_VMCALL_REG_MASK_R9	BIT_ULL(9)
+#define TDX_VMCALL_REG_MASK_R10	BIT_ULL(10)
+#define TDX_VMCALL_REG_MASK_R11	BIT_ULL(11)
+#define TDX_VMCALL_REG_MASK_R12	BIT_ULL(12)
+#define TDX_VMCALL_REG_MASK_R13	BIT_ULL(13)
+#define TDX_VMCALL_REG_MASK_R14	BIT_ULL(14)
+#define TDX_VMCALL_REG_MASK_R15	BIT_ULL(15)
+
+struct kvm_tdx_exit {
+#define KVM_EXIT_TDX_VMCALL	1
+	__u32 type;
+	__u32 pad;
+
+	union {
+		struct kvm_tdx_vmcall {
+			/*
+			 * RAX(bit 0), RCX(bit 1) and RSP(bit 4) are reserved.
+			 * RAX(bit 0): TDG.VP.VMCALL status code.
+			 * RCX(bit 1): bitmap for used registers.
+			 * RSP(bit 4): the caller stack.
+			 */
+			union {
+				__u64 in_rcx;
+				__u64 reg_mask;
+			};
+
+			/*
+			 * Guest-Host-Communication Interface for TDX spec
+			 * defines the ABI for TDG.VP.VMCALL.
+			 */
+			/* Input parameters: guest -> VMM */
+			union {
+				__u64 in_r10;
+				__u64 type;
+			};
+			union {
+				__u64 in_r11;
+				__u64 subfunction;
+			};
+			/*
+			 * Subfunction specific.
+			 * Registers are used in this order to pass input
+			 * arguments.  r12=arg0, r13=arg1, etc.
+			 */
+			__u64 in_r12;
+			__u64 in_r13;
+			__u64 in_r14;
+			__u64 in_r15;
+			__u64 in_rbx;
+			__u64 in_rdi;
+			__u64 in_rsi;
+			__u64 in_r8;
+			__u64 in_r9;
+			__u64 in_rdx;
+
+			/* Output parameters: VMM -> guest */
+			union {
+				__u64 out_r10;
+				__u64 status_code;
+			};
+			/*
+			 * Subfunction specific.
+			 * Registers are used in this order to output return
+			 * values.  r11=ret0, r12=ret1, etc.
+			 */
+			__u64 out_r11;
+			__u64 out_r12;
+			__u64 out_r13;
+			__u64 out_r14;
+			__u64 out_r15;
+			__u64 out_rbx;
+			__u64 out_rdi;
+			__u64 out_rsi;
+			__u64 out_r8;
+			__u64 out_r9;
+			__u64 out_rdx;
+		} vmcall;
+	} u;
+};
+
 #define KVM_S390_GET_SKEYS_NONE   1
 #define KVM_S390_SKEYS_MAX        1048576
 
@@ -279,6 +365,7 @@ struct kvm_xen_exit {
 #define KVM_EXIT_RISCV_CSR        36
 #define KVM_EXIT_NOTIFY           37
 #define KVM_EXIT_MEMORY_FAULT     39
+#define KVM_EXIT_TDX              40
 
 /* For KVM_EXIT_INTERNAL_ERROR */
 /* Emulate instruction failed. */
@@ -528,6 +615,8 @@ struct kvm_run {
 			__u64 gpa;
 			__u64 size;
 		} memory_fault;
+		/* KVM_EXIT_TDX_VMCALL */
+		struct kvm_tdx_exit tdx;
 		/* Fix the size of the union. */
 		char padding[256];
 	};
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 13/70] i386: Introduce tdx-guest object
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (11 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 12/70] *** HACK *** linux-headers: Update headers to pull in TDX API changes Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-12-01 10:52   ` Markus Armbruster
  2023-11-15  7:14 ` [PATCH v3 14/70] target/i386: Implement mc->kvm_type() to get VM type Xiaoyao Li
                   ` (56 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Introduce tdx-guest object which implements the interface of
CONFIDENTIAL_GUEST_SUPPORT, and will be used to create TDX VMs (TDs) by

  qemu -machine ...,confidential-guest-support=tdx0	\
       -object tdx-guest,id=tdx0

It has only one member 'attributes' with fixed value 0 and not
configurable so far.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Acked-by: Markus Armbruster <armbru@redhat.com>
---
changes in v1
- make @attributes not user-settable
---
 configs/devices/i386-softmmu/default.mak |  1 +
 hw/i386/Kconfig                          |  5 +++
 qapi/qom.json                            | 12 +++++++
 target/i386/kvm/meson.build              |  2 ++
 target/i386/kvm/tdx.c                    | 40 ++++++++++++++++++++++++
 target/i386/kvm/tdx.h                    | 19 +++++++++++
 6 files changed, 79 insertions(+)
 create mode 100644 target/i386/kvm/tdx.c
 create mode 100644 target/i386/kvm/tdx.h

diff --git a/configs/devices/i386-softmmu/default.mak b/configs/devices/i386-softmmu/default.mak
index 598c6646dfc0..9b5ec59d65b0 100644
--- a/configs/devices/i386-softmmu/default.mak
+++ b/configs/devices/i386-softmmu/default.mak
@@ -18,6 +18,7 @@
 #CONFIG_QXL=n
 #CONFIG_SEV=n
 #CONFIG_SGA=n
+#CONFIG_TDX=n
 #CONFIG_TEST_DEVICES=n
 #CONFIG_TPM_CRB=n
 #CONFIG_TPM_TIS_ISA=n
diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index 55850791df41..cea78dcb2822 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -10,6 +10,10 @@ config SGX
     bool
     depends on KVM
 
+config TDX
+    bool
+    depends on KVM
+
 config PC
     bool
     imply APPLESMC
@@ -26,6 +30,7 @@ config PC
     imply QXL
     imply SEV
     imply SGX
+    imply TDX
     imply TEST_DEVICES
     imply TPM_CRB
     imply TPM_TIS_ISA
diff --git a/qapi/qom.json b/qapi/qom.json
index c53ef978ff7e..8e08257dac2f 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -878,6 +878,16 @@
             'reduced-phys-bits': 'uint32',
             '*kernel-hashes': 'bool' } }
 
+##
+# @TdxGuestProperties:
+#
+# Properties for tdx-guest objects.
+#
+# Since: 8.2
+##
+{ 'struct': 'TdxGuestProperties',
+  'data': { }}
+
 ##
 # @ThreadContextProperties:
 #
@@ -956,6 +966,7 @@
     'sev-guest',
     'thread-context',
     's390-pv-guest',
+    'tdx-guest',
     'throttle-group',
     'tls-creds-anon',
     'tls-creds-psk',
@@ -1022,6 +1033,7 @@
       'secret_keyring':             { 'type': 'SecretKeyringProperties',
                                       'if': 'CONFIG_SECRET_KEYRING' },
       'sev-guest':                  'SevGuestProperties',
+      'tdx-guest':                  'TdxGuestProperties',
       'thread-context':             'ThreadContextProperties',
       'throttle-group':             'ThrottleGroupProperties',
       'tls-creds-anon':             'TlsCredsAnonProperties',
diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
index 84d9143e6029..6ea0ce27b757 100644
--- a/target/i386/kvm/meson.build
+++ b/target/i386/kvm/meson.build
@@ -9,6 +9,8 @@ i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
 
 i386_kvm_ss.add(when: 'CONFIG_SEV', if_false: files('sev-stub.c'))
 
+i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c'))
+
 i386_system_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'), if_false: files('hyperv-stub.c'))
 
 i386_system_ss.add_all(when: 'CONFIG_KVM', if_true: i386_kvm_ss)
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
new file mode 100644
index 000000000000..d3792d4a3d56
--- /dev/null
+++ b/target/i386/kvm/tdx.c
@@ -0,0 +1,40 @@
+/*
+ * QEMU TDX support
+ *
+ * Copyright Intel
+ *
+ * Author:
+ *      Xiaoyao Li <xiaoyao.li@intel.com>
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2 or later.
+ * See the COPYING file in the top-level directory
+ *
+ */
+
+#include "qemu/osdep.h"
+#include "qom/object_interfaces.h"
+
+#include "tdx.h"
+
+/* tdx guest */
+OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
+                                   tdx_guest,
+                                   TDX_GUEST,
+                                   CONFIDENTIAL_GUEST_SUPPORT,
+                                   { TYPE_USER_CREATABLE },
+                                   { NULL })
+
+static void tdx_guest_init(Object *obj)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    tdx->attributes = 0;
+}
+
+static void tdx_guest_finalize(Object *obj)
+{
+}
+
+static void tdx_guest_class_init(ObjectClass *oc, void *data)
+{
+}
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
new file mode 100644
index 000000000000..415aeb5af746
--- /dev/null
+++ b/target/i386/kvm/tdx.h
@@ -0,0 +1,19 @@
+#ifndef QEMU_I386_TDX_H
+#define QEMU_I386_TDX_H
+
+#include "exec/confidential-guest-support.h"
+
+#define TYPE_TDX_GUEST "tdx-guest"
+#define TDX_GUEST(obj)  OBJECT_CHECK(TdxGuest, (obj), TYPE_TDX_GUEST)
+
+typedef struct TdxGuestClass {
+    ConfidentialGuestSupportClass parent_class;
+} TdxGuestClass;
+
+typedef struct TdxGuest {
+    ConfidentialGuestSupport parent_obj;
+
+    uint64_t attributes;    /* TD attributes */
+} TdxGuest;
+
+#endif /* QEMU_I386_TDX_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 14/70] target/i386: Implement mc->kvm_type() to get VM type
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (12 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 13/70] i386: Introduce tdx-guest object Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 10:49   ` Daniel P. Berrangé
  2023-11-15  7:14 ` [PATCH v3 15/70] target/i386: Parse TDX vm type Xiaoyao Li
                   ` (55 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Implement mc->kvm_type() for i386 machines. It provides a way for user
to create SW_PROTECTE_VM.

Also store the vm_type in machinestate to other code to query what the
VM type is.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 hw/i386/x86.c              | 12 ++++++++++++
 include/hw/i386/x86.h      |  1 +
 target/i386/kvm/kvm.c      | 25 +++++++++++++++++++++++++
 target/i386/kvm/kvm_i386.h |  1 +
 4 files changed, 39 insertions(+)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index b3d054889bba..55678279bf3b 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1377,6 +1377,17 @@ static void machine_set_sgx_epc(Object *obj, Visitor *v, const char *name,
     qapi_free_SgxEPCList(list);
 }
 
+static int x86_kvm_type(MachineState *ms, const char *vm_type)
+{
+    X86MachineState *x86ms = X86_MACHINE(ms);
+    int kvm_type;
+
+    kvm_type = kvm_get_vm_type(ms, vm_type);
+    x86ms->vm_type = kvm_type;
+
+    return kvm_type;
+}
+
 static void x86_machine_initfn(Object *obj)
 {
     X86MachineState *x86ms = X86_MACHINE(obj);
@@ -1401,6 +1412,7 @@ static void x86_machine_class_init(ObjectClass *oc, void *data)
     mc->cpu_index_to_instance_props = x86_cpu_index_to_props;
     mc->get_default_cpu_node_id = x86_get_default_cpu_node_id;
     mc->possible_cpu_arch_ids = x86_possible_cpu_arch_ids;
+    mc->kvm_type = x86_kvm_type;
     x86mc->save_tsc_khz = true;
     x86mc->fwcfg_dma_enabled = true;
     nc->nmi_monitor_handler = x86_nmi;
diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
index da19ae15463a..ab1d38569019 100644
--- a/include/hw/i386/x86.h
+++ b/include/hw/i386/x86.h
@@ -41,6 +41,7 @@ struct X86MachineState {
     MachineState parent;
 
     /*< public >*/
+    unsigned int vm_type;
 
     /* Pointers to devices and objects: */
     ISADevice *rtc;
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index b4b9ce89842f..2e47fda25f95 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -161,6 +161,31 @@ static KVMMSRHandlers msr_handlers[KVM_MSR_FILTER_MAX_RANGES];
 static RateLimit bus_lock_ratelimit_ctrl;
 static int kvm_get_one_msr(X86CPU *cpu, int index, uint64_t *value);
 
+static const char* vm_type_name[] = {
+    [KVM_X86_DEFAULT_VM] = "default",
+    [KVM_X86_SW_PROTECTED_VM] = "sw-protected-vm",
+};
+
+int kvm_get_vm_type(MachineState *ms, const char *vm_type)
+{
+    int kvm_type = KVM_X86_DEFAULT_VM;
+
+    /*
+     * old KVM doesn't support KVM_CAP_VM_TYPES and KVM_X86_DEFAULT_VM
+     * is always supported
+     */
+    if (kvm_type == KVM_X86_DEFAULT_VM) {
+        return kvm_type;
+    }
+
+    if (!(kvm_check_extension(KVM_STATE(ms->accelerator), KVM_CAP_VM_TYPES) & BIT(kvm_type))) {
+        error_report("vm-type %s not supported by KVM", vm_type_name[kvm_type]);
+        exit(1);
+    }
+
+    return kvm_type;
+}
+
 bool kvm_has_smm(void)
 {
     return kvm_vm_check_extension(kvm_state, KVM_CAP_X86_SMM);
diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
index 30fedcffea3e..55fb25fa8e2e 100644
--- a/target/i386/kvm/kvm_i386.h
+++ b/target/i386/kvm/kvm_i386.h
@@ -37,6 +37,7 @@ bool kvm_hv_vpindex_settable(void);
 bool kvm_enable_sgx_provisioning(KVMState *s);
 bool kvm_hyperv_expand_features(X86CPU *cpu, Error **errp);
 
+int kvm_get_vm_type(MachineState *ms, const char *vm_type);
 void kvm_arch_reset_vcpu(X86CPU *cs);
 void kvm_arch_after_reset_vcpu(X86CPU *cpu);
 void kvm_arch_do_init_vcpu(X86CPU *cs);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 15/70] target/i386: Parse TDX vm type
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (13 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 14/70] target/i386: Implement mc->kvm_type() to get VM type Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 16/70] target/i386: Introduce kvm_confidential_guest_init() Xiaoyao Li
                   ` (54 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

TDX VM requires VM type KVM_X86_TDX_VM to be passed to
kvm_ioctl(KVM_CREATE_VM).

If tdx-guest object is specified to confidential-guest-support, like,

  qemu -machine ...,confidential-guest-support=tdx0 \
       -object tdx-guest,id=tdx0,...

it parses VM type as KVM_X86_TDX_VM.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 target/i386/kvm/kvm.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 2e47fda25f95..c4050cbf998e 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -32,6 +32,7 @@
 #include "sysemu/runstate.h"
 #include "kvm_i386.h"
 #include "sev.h"
+#include "tdx.h"
 #include "xen-emu.h"
 #include "hyperv.h"
 #include "hyperv-proto.h"
@@ -164,12 +165,17 @@ static int kvm_get_one_msr(X86CPU *cpu, int index, uint64_t *value);
 static const char* vm_type_name[] = {
     [KVM_X86_DEFAULT_VM] = "default",
     [KVM_X86_SW_PROTECTED_VM] = "sw-protected-vm",
+    [KVM_X86_TDX_VM] = "tdx",
 };
 
 int kvm_get_vm_type(MachineState *ms, const char *vm_type)
 {
     int kvm_type = KVM_X86_DEFAULT_VM;
 
+    if (ms->cgs && object_dynamic_cast(OBJECT(ms->cgs), TYPE_TDX_GUEST)) {
+        kvm_type = KVM_X86_TDX_VM;
+    }
+
     /*
      * old KVM doesn't support KVM_CAP_VM_TYPES and KVM_X86_DEFAULT_VM
      * is always supported
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 16/70] target/i386: Introduce kvm_confidential_guest_init()
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (14 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 15/70] target/i386: Parse TDX vm type Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 17/70] i386/tdx: Implement tdx_kvm_init() to initialize TDX VM context Xiaoyao Li
                   ` (53 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Introduce a separate function kvm_confidential_guest_init(), which
dispatches specific confidential guest initialization function by
ms->cgs type.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
Reviewed-by: Philippe Mathieu-Daudé <philmd@linaro.org>
---
 target/i386/kvm/kvm.c | 11 ++++++++++-
 target/i386/sev.c     |  1 -
 target/i386/sev.h     |  2 ++
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index c4050cbf998e..dc69f4b7b196 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2542,6 +2542,15 @@ int kvm_arch_get_default_type(MachineState *ms)
     return 0;
 }
 
+static int kvm_confidential_guest_init(MachineState *ms, Error **errp)
+{
+    if (object_dynamic_cast(OBJECT(ms->cgs), TYPE_SEV_GUEST)) {
+        return sev_kvm_init(ms->cgs, errp);
+    }
+
+    return 0;
+}
+
 int kvm_arch_init(MachineState *ms, KVMState *s)
 {
     uint64_t identity_base = 0xfffbc000;
@@ -2562,7 +2571,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
      * mechanisms are supported in future (e.g. TDX), they'll need
      * their own initialization either here or elsewhere.
      */
-    ret = sev_kvm_init(ms->cgs, &local_err);
+    ret = kvm_confidential_guest_init(ms, &local_err);
     if (ret < 0) {
         error_report_err(local_err);
         return ret;
diff --git a/target/i386/sev.c b/target/i386/sev.c
index 9a7124668258..0dd45956bb00 100644
--- a/target/i386/sev.c
+++ b/target/i386/sev.c
@@ -39,7 +39,6 @@
 #include "hw/i386/pc.h"
 #include "exec/address-spaces.h"
 
-#define TYPE_SEV_GUEST "sev-guest"
 OBJECT_DECLARE_SIMPLE_TYPE(SevGuestState, SEV_GUEST)
 
 
diff --git a/target/i386/sev.h b/target/i386/sev.h
index e7499c95b1e8..1fe25d096dc4 100644
--- a/target/i386/sev.h
+++ b/target/i386/sev.h
@@ -20,6 +20,8 @@
 
 #include "exec/confidential-guest-support.h"
 
+#define TYPE_SEV_GUEST "sev-guest"
+
 #define SEV_POLICY_NODBG        0x1
 #define SEV_POLICY_NOKS         0x2
 #define SEV_POLICY_ES           0x4
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 17/70] i386/tdx: Implement tdx_kvm_init() to initialize TDX VM context
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (15 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 16/70] target/i386: Introduce kvm_confidential_guest_init() Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES Xiaoyao Li
                   ` (52 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Introduce tdx_kvm_init() and invoke it in kvm_confidential_guest_init()
if it's a TDX VM.

Set ms->require_guest_memfd to require kvm guest memfd allocation for any
memory backend. More TDX specific initialization will be added later.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/kvm.c       | 15 ++++++---------
 target/i386/kvm/meson.build |  2 +-
 target/i386/kvm/tdx-stub.c  |  8 ++++++++
 target/i386/kvm/tdx.c       |  9 +++++++++
 target/i386/kvm/tdx.h       |  2 ++
 5 files changed, 26 insertions(+), 10 deletions(-)
 create mode 100644 target/i386/kvm/tdx-stub.c

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index dc69f4b7b196..7abcdebb1452 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -62,6 +62,7 @@
 #include "migration/blocker.h"
 #include "exec/memattrs.h"
 #include "trace.h"
+#include "tdx.h"
 
 #include CONFIG_DEVICES
 
@@ -2546,6 +2547,8 @@ static int kvm_confidential_guest_init(MachineState *ms, Error **errp)
 {
     if (object_dynamic_cast(OBJECT(ms->cgs), TYPE_SEV_GUEST)) {
         return sev_kvm_init(ms->cgs, errp);
+    } else if (object_dynamic_cast(OBJECT(ms->cgs), TYPE_TDX_GUEST)) {
+        return tdx_kvm_init(ms, errp);
     }
 
     return 0;
@@ -2560,16 +2563,10 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
     Error *local_err = NULL;
 
     /*
-     * Initialize SEV context, if required
+     * Initialize confidential guest (SEV/TDX) context, if required
      *
-     * If no memory encryption is requested (ms->cgs == NULL) this is
-     * a no-op.
-     *
-     * It's also a no-op if a non-SEV confidential guest support
-     * mechanism is selected.  SEV is the only mechanism available to
-     * select on x86 at present, so this doesn't arise, but if new
-     * mechanisms are supported in future (e.g. TDX), they'll need
-     * their own initialization either here or elsewhere.
+     * It's a no-op if a non-SEV/non-tdx confidential guest support
+     * mechanism is selected, i.e., ms->cgs == NULL
      */
     ret = kvm_confidential_guest_init(ms, &local_err);
     if (ret < 0) {
diff --git a/target/i386/kvm/meson.build b/target/i386/kvm/meson.build
index 6ea0ce27b757..30a90b4d371d 100644
--- a/target/i386/kvm/meson.build
+++ b/target/i386/kvm/meson.build
@@ -9,7 +9,7 @@ i386_kvm_ss.add(when: 'CONFIG_XEN_EMU', if_true: files('xen-emu.c'))
 
 i386_kvm_ss.add(when: 'CONFIG_SEV', if_false: files('sev-stub.c'))
 
-i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c'))
+i386_kvm_ss.add(when: 'CONFIG_TDX', if_true: files('tdx.c'), if_false: files('tdx-stub.c'))
 
 i386_system_ss.add(when: 'CONFIG_HYPERV', if_true: files('hyperv.c'), if_false: files('hyperv-stub.c'))
 
diff --git a/target/i386/kvm/tdx-stub.c b/target/i386/kvm/tdx-stub.c
new file mode 100644
index 000000000000..1d866d5496bf
--- /dev/null
+++ b/target/i386/kvm/tdx-stub.c
@@ -0,0 +1,8 @@
+#include "qemu/osdep.h"
+
+#include "tdx.h"
+
+int tdx_kvm_init(MachineState *ms, Error **errp)
+{
+    return -EINVAL;
+}
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index d3792d4a3d56..621a05beeb4e 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -12,10 +12,19 @@
  */
 
 #include "qemu/osdep.h"
+#include "qapi/error.h"
 #include "qom/object_interfaces.h"
 
+#include "hw/i386/x86.h"
 #include "tdx.h"
 
+int tdx_kvm_init(MachineState *ms, Error **errp)
+{
+    ms->require_guest_memfd = true;
+
+    return 0;
+}
+
 /* tdx guest */
 OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
                                    tdx_guest,
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 415aeb5af746..c8a23d95258d 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -16,4 +16,6 @@ typedef struct TdxGuest {
     uint64_t attributes;    /* TD attributes */
 } TdxGuest;
 
+int tdx_kvm_init(MachineState *ms, Error **errp);
+
 #endif /* QEMU_I386_TDX_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (16 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 17/70] i386/tdx: Implement tdx_kvm_init() to initialize TDX VM context Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 10:54   ` Daniel P. Berrangé
  2023-11-17 21:18   ` Isaku Yamahata
  2023-11-15  7:14 ` [PATCH v3 19/70] i386/tdx: Introduce is_tdx_vm() helper and cache tdx_guest object Xiaoyao Li
                   ` (51 subsequent siblings)
  69 siblings, 2 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

KVM provides TDX capabilities via sub command KVM_TDX_CAPABILITIES of
IOCTL(KVM_MEMORY_ENCRYPT_OP). Get the capabilities when initializing
TDX context. It will be used to validate user's setting later.

Since there is no interface reporting how many cpuid configs contains in
KVM_TDX_CAPABILITIES, QEMU chooses to try starting with a known number
and abort when it exceeds KVM_MAX_CPUID_ENTRIES.

Besides, introduce the interfaces to invoke TDX "ioctls" at different
scope (KVM, VM and VCPU) in preparation.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes in v3:
- rename __tdx_ioctl() to tdx_ioctl_internal()
- Pass errp in get_tdx_capabilities();

changes in v2:
  - Make the error message more clear;

changes in v1:
  - start from nr_cpuid_configs = 6 for the loop;
  - stop the loop when nr_cpuid_configs exceeds KVM_MAX_CPUID_ENTRIES;
---
 target/i386/kvm/kvm.c      |   2 -
 target/i386/kvm/kvm_i386.h |   2 +
 target/i386/kvm/tdx.c      | 102 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 103 insertions(+), 3 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 7abcdebb1452..28e60c5ea4a7 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1687,8 +1687,6 @@ static int hyperv_init_vcpu(X86CPU *cpu)
 
 static Error *invtsc_mig_blocker;
 
-#define KVM_MAX_CPUID_ENTRIES  100
-
 static void kvm_init_xsave(CPUX86State *env)
 {
     if (has_xsave2) {
diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
index 55fb25fa8e2e..c3ef46a97a7b 100644
--- a/target/i386/kvm/kvm_i386.h
+++ b/target/i386/kvm/kvm_i386.h
@@ -13,6 +13,8 @@
 
 #include "sysemu/kvm.h"
 
+#define KVM_MAX_CPUID_ENTRIES  100
+
 #ifdef CONFIG_KVM
 
 #define kvm_pit_in_kernel() \
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 621a05beeb4e..cb0040187b27 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -12,17 +12,117 @@
  */
 
 #include "qemu/osdep.h"
+#include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "qom/object_interfaces.h"
+#include "sysemu/kvm.h"
 
 #include "hw/i386/x86.h"
+#include "kvm_i386.h"
 #include "tdx.h"
 
+static struct kvm_tdx_capabilities *tdx_caps;
+
+enum tdx_ioctl_level{
+    TDX_PLATFORM_IOCTL,
+    TDX_VM_IOCTL,
+    TDX_VCPU_IOCTL,
+};
+
+static int tdx_ioctl_internal(void *state, enum tdx_ioctl_level level, int cmd_id,
+                        __u32 flags, void *data)
+{
+    struct kvm_tdx_cmd tdx_cmd;
+    int r;
+
+    memset(&tdx_cmd, 0x0, sizeof(tdx_cmd));
+
+    tdx_cmd.id = cmd_id;
+    tdx_cmd.flags = flags;
+    tdx_cmd.data = (__u64)(unsigned long)data;
+
+    switch (level) {
+    case TDX_PLATFORM_IOCTL:
+        r = kvm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
+        break;
+    case TDX_VM_IOCTL:
+        r = kvm_vm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
+        break;
+    case TDX_VCPU_IOCTL:
+        r = kvm_vcpu_ioctl(state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
+        break;
+    default:
+        error_report("Invalid tdx_ioctl_level %d", level);
+        exit(1);
+    }
+
+    return r;
+}
+
+static inline int tdx_platform_ioctl(int cmd_id, __u32 flags, void *data)
+{
+    return tdx_ioctl_internal(NULL, TDX_PLATFORM_IOCTL, cmd_id, flags, data);
+}
+
+static inline int tdx_vm_ioctl(int cmd_id, __u32 flags, void *data)
+{
+    return tdx_ioctl_internal(NULL, TDX_VM_IOCTL, cmd_id, flags, data);
+}
+
+static inline int tdx_vcpu_ioctl(void *vcpu_fd, int cmd_id, __u32 flags,
+                                 void *data)
+{
+    return  tdx_ioctl_internal(vcpu_fd, TDX_VCPU_IOCTL, cmd_id, flags, data);
+}
+
+static int get_tdx_capabilities(Error **errp)
+{
+    struct kvm_tdx_capabilities *caps;
+    /* 1st generation of TDX reports 6 cpuid configs */
+    int nr_cpuid_configs = 6;
+    size_t size;
+    int r;
+
+    do {
+        size = sizeof(struct kvm_tdx_capabilities) +
+               nr_cpuid_configs * sizeof(struct kvm_tdx_cpuid_config);
+        caps = g_malloc0(size);
+        caps->nr_cpuid_configs = nr_cpuid_configs;
+
+        r = tdx_vm_ioctl(KVM_TDX_CAPABILITIES, 0, caps);
+        if (r == -E2BIG) {
+            g_free(caps);
+            nr_cpuid_configs *= 2;
+            if (nr_cpuid_configs > KVM_MAX_CPUID_ENTRIES) {
+                error_setg(errp, "%s: KVM TDX seems broken that number of CPUID "
+                           "entries in kvm_tdx_capabilities exceeds limit %d",
+                           __func__, KVM_MAX_CPUID_ENTRIES);
+                return r;
+            }
+        } else if (r < 0) {
+            g_free(caps);
+            error_setg_errno(errp, -r, "%s: KVM_TDX_CAPABILITIES failed", __func__);
+            return r;
+        }
+    }
+    while (r == -E2BIG);
+
+    tdx_caps = caps;
+
+    return 0;
+}
+
 int tdx_kvm_init(MachineState *ms, Error **errp)
 {
+    int r = 0;
+
     ms->require_guest_memfd = true;
 
-    return 0;
+    if (!tdx_caps) {
+        r = get_tdx_capabilities(errp);
+    }
+
+    return r;
 }
 
 /* tdx guest */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 19/70] i386/tdx: Introduce is_tdx_vm() helper and cache tdx_guest object
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (17 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-17 21:20   ` Isaku Yamahata
  2023-11-15  7:14 ` [PATCH v3 20/70] i386/tdx: Adjust the supported CPUID based on TDX restrictions Xiaoyao Li
                   ` (50 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

It will need special handling for TDX VMs all around the QEMU.
Introduce is_tdx_vm() helper to query if it's a TDX VM.

Cache tdx_guest object thus no need to cast from ms->cgs every time.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
changes in v3:
- replace object_dynamic_cast with TDX_GUEST();
---
 target/i386/kvm/tdx.c | 15 ++++++++++++++-
 target/i386/kvm/tdx.h | 10 ++++++++++
 2 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index cb0040187b27..cf8889f0a8f9 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -21,8 +21,16 @@
 #include "kvm_i386.h"
 #include "tdx.h"
 
+static TdxGuest *tdx_guest;
+
 static struct kvm_tdx_capabilities *tdx_caps;
 
+/* It's valid after kvm_confidential_guest_init()->kvm_tdx_init() */
+bool is_tdx_vm(void)
+{
+    return !!tdx_guest;
+}
+
 enum tdx_ioctl_level{
     TDX_PLATFORM_IOCTL,
     TDX_VM_IOCTL,
@@ -114,15 +122,20 @@ static int get_tdx_capabilities(Error **errp)
 
 int tdx_kvm_init(MachineState *ms, Error **errp)
 {
+    TdxGuest *tdx = TDX_GUEST(OBJECT(ms->cgs));
     int r = 0;
 
     ms->require_guest_memfd = true;
 
     if (!tdx_caps) {
         r = get_tdx_capabilities(errp);
+        if (r) {
+            return r;
+        }
     }
 
-    return r;
+    tdx_guest = tdx;
+    return 0;
 }
 
 /* tdx guest */
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index c8a23d95258d..4036ca2f3f99 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -1,6 +1,10 @@
 #ifndef QEMU_I386_TDX_H
 #define QEMU_I386_TDX_H
 
+#ifndef CONFIG_USER_ONLY
+#include CONFIG_DEVICES /* CONFIG_TDX */
+#endif
+
 #include "exec/confidential-guest-support.h"
 
 #define TYPE_TDX_GUEST "tdx-guest"
@@ -16,6 +20,12 @@ typedef struct TdxGuest {
     uint64_t attributes;    /* TD attributes */
 } TdxGuest;
 
+#ifdef CONFIG_TDX
+bool is_tdx_vm(void);
+#else
+#define is_tdx_vm() 0
+#endif /* CONFIG_TDX */
+
 int tdx_kvm_init(MachineState *ms, Error **errp);
 
 #endif /* QEMU_I386_TDX_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 20/70] i386/tdx: Adjust the supported CPUID based on TDX restrictions
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (18 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 19/70] i386/tdx: Introduce is_tdx_vm() helper and cache tdx_guest object Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 21/70] i386/tdx: Update tdx_cpuid_lookup[].tdx_fixed0/1 by tdx_caps.cpuid_config[] Xiaoyao Li
                   ` (49 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

According to Chapter "CPUID Virtualization" in TDX module spec, CPUID
bits of TD can be classified into 6 types:

------------------------------------------------------------------------
1 | As configured | configurable by VMM, independent of native value;
------------------------------------------------------------------------
2 | As configured | configurable by VMM if the bit is supported natively
    (if native)   | Otherwise it equals as native(0).
------------------------------------------------------------------------
3 | Fixed         | fixed to 0/1
------------------------------------------------------------------------
4 | Native        | reflect the native value
------------------------------------------------------------------------
5 | Calculated    | calculated by TDX module.
------------------------------------------------------------------------
6 | Inducing #VE  | get #VE exception
------------------------------------------------------------------------

Note:
1. All the configurable XFAM related features and TD attributes related
   features fall into type #2. And fixed0/1 bits of XFAM and TD
   attributes fall into type #3.

2. For CPUID leaves not listed in "CPUID virtualization Overview" table
   in TDX module spec, TDX module injects #VE to TDs when those are
   queried. For this case, TDs can request CPUID emulation from VMM via
   TDVMCALL and the values are fully controlled by VMM.

Due to TDX module has its own virtualization policy on CPUID bits, it leads
to what reported via KVM_GET_SUPPORTED_CPUID diverges from the supported
CPUID bits for TDs. In order to keep a consistent CPUID configuration
between VMM and TDs. Adjust supported CPUID for TDs based on TDX
restrictions.

Currently only focus on the CPUID leaves recognized by QEMU's
feature_word_info[] that are indexed by a FeatureWord.

Introduce a TDX CPUID lookup table, which maintains 1 entry for each
FeatureWord. Each entry has below fields:

 - tdx_fixed0/1: The bits that are fixed as 0/1;

 - vmm_fixup:   The bits that are configurable from the view of TDX module.
                But they requires emulation of VMM when they are configured
	        as enabled. For those, they are not supported if VMM doesn't
		report them as supported. So they need be fixed up by
		checking if VMM supports them.

 - inducing_ve: TD gets #VE when querying this CPUID leaf. The result is
                totally configurable by VMM.

 - supported_on_ve: It's valid only when @inducing_ve is true. It represents
		    the maximum feature set supported that be emulated
		    for TDs.

By applying TDX CPUID lookup table and TDX capabilities reported from
TDX module, the supported CPUID for TDs can be obtained from following
steps:

- get the base of VMM supported feature set;

- if the leaf is not a FeatureWord just return VMM's value without
  modification;

- if the leaf is an inducing_ve type, applying supported_on_ve mask and
  return;

- include all native bits, it covers type #2, #4, and parts of type #1.
  (it also includes some unsupported bits. The following step will
   correct it.)

- apply fixed0/1 to it (it covers #3, and rectifies the previous step);

- add configurable bits (it covers the other part of type #1);

- fix the ones in vmm_fixup;

- filter the one has valid .supported field;

(Calculated type is ignored since it's determined at runtime).

Co-developed-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 target/i386/cpu.h     |  16 +++
 target/i386/kvm/kvm.c |   4 +
 target/i386/kvm/tdx.c | 254 ++++++++++++++++++++++++++++++++++++++++++
 target/i386/kvm/tdx.h |   2 +
 4 files changed, 276 insertions(+)

diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index cd2e295bd655..bd9151d3bcaa 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -780,6 +780,8 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 
 /* Support RDFSBASE/RDGSBASE/WRFSBASE/WRGSBASE */
 #define CPUID_7_0_EBX_FSGSBASE          (1U << 0)
+/* Support for TSC adjustment MSR 0x3B */
+#define CPUID_7_0_EBX_TSC_ADJUST        (1U << 1)
 /* Support SGX */
 #define CPUID_7_0_EBX_SGX               (1U << 2)
 /* 1st Group of Advanced Bit Manipulation Extensions */
@@ -798,8 +800,12 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 #define CPUID_7_0_EBX_INVPCID           (1U << 10)
 /* Restricted Transactional Memory */
 #define CPUID_7_0_EBX_RTM               (1U << 11)
+/* Cache QoS Monitoring */
+#define CPUID_7_0_EBX_PQM               (1U << 12)
 /* Memory Protection Extension */
 #define CPUID_7_0_EBX_MPX               (1U << 14)
+/* Resource Director Technology Allocation */
+#define CPUID_7_0_EBX_RDT_A             (1U << 15)
 /* AVX-512 Foundation */
 #define CPUID_7_0_EBX_AVX512F           (1U << 16)
 /* AVX-512 Doubleword & Quadword Instruction */
@@ -855,10 +861,16 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 #define CPUID_7_0_ECX_AVX512VNNI        (1U << 11)
 /* Support for VPOPCNT[B,W] and VPSHUFBITQMB */
 #define CPUID_7_0_ECX_AVX512BITALG      (1U << 12)
+/* Intel Total Memory Encryption */
+#define CPUID_7_0_ECX_TME               (1U << 13)
 /* POPCNT for vectors of DW/QW */
 #define CPUID_7_0_ECX_AVX512_VPOPCNTDQ  (1U << 14)
+/* Placeholder for bit 15 */
+#define CPUID_7_0_ECX_FZM               (1U << 15)
 /* 5-level Page Tables */
 #define CPUID_7_0_ECX_LA57              (1U << 16)
+/* MAWAU for MPX */
+#define CPUID_7_0_ECX_MAWAU             (31U << 17)
 /* Read Processor ID */
 #define CPUID_7_0_ECX_RDPID             (1U << 22)
 /* Bus Lock Debug Exception */
@@ -869,6 +881,8 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 #define CPUID_7_0_ECX_MOVDIRI           (1U << 27)
 /* Move 64 Bytes as Direct Store Instruction */
 #define CPUID_7_0_ECX_MOVDIR64B         (1U << 28)
+/* ENQCMD and ENQCMDS instructions */
+#define CPUID_7_0_ECX_ENQCMD            (1U << 29)
 /* Support SGX Launch Control */
 #define CPUID_7_0_ECX_SGX_LC            (1U << 30)
 /* Protection Keys for Supervisor-mode Pages */
@@ -886,6 +900,8 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 #define CPUID_7_0_EDX_SERIALIZE         (1U << 14)
 /* TSX Suspend Load Address Tracking instruction */
 #define CPUID_7_0_EDX_TSX_LDTRK         (1U << 16)
+/* PCONFIG instruction */
+#define CPUID_7_0_EDX_PCONFIG           (1U << 18)
 /* Architectural LBRs */
 #define CPUID_7_0_EDX_ARCH_LBR          (1U << 19)
 /* AMX_BF16 instruction */
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 28e60c5ea4a7..f2627dd61d2b 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -522,6 +522,10 @@ uint32_t kvm_arch_get_supported_cpuid(KVMState *s, uint32_t function,
         ret |= 1U << KVM_HINTS_REALTIME;
     }
 
+    if (is_tdx_vm()) {
+        tdx_get_supported_cpuid(function, index, reg, &ret);
+    }
+
     return ret;
 }
 
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index cf8889f0a8f9..eda6e695a884 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -15,11 +15,129 @@
 #include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "qom/object_interfaces.h"
+#include "standard-headers/asm-x86/kvm_para.h"
 #include "sysemu/kvm.h"
+#include "sysemu/sysemu.h"
 
 #include "hw/i386/x86.h"
 #include "kvm_i386.h"
 #include "tdx.h"
+#include "../cpu-internal.h"
+
+#define TDX_SUPPORTED_KVM_FEATURES  ((1U << KVM_FEATURE_NOP_IO_DELAY) | \
+                                     (1U << KVM_FEATURE_PV_UNHALT) | \
+                                     (1U << KVM_FEATURE_PV_TLB_FLUSH) | \
+                                     (1U << KVM_FEATURE_PV_SEND_IPI) | \
+                                     (1U << KVM_FEATURE_POLL_CONTROL) | \
+                                     (1U << KVM_FEATURE_PV_SCHED_YIELD) | \
+                                     (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
+
+typedef struct KvmTdxCpuidLookup {
+    uint32_t tdx_fixed0;
+    uint32_t tdx_fixed1;
+
+    /*
+     * The CPUID bits that are configurable from the view of TDX module
+     * but require VMM emulation if configured to enabled by VMM.
+     *
+     * For those bits, they cannot be enabled actually if VMM (KVM/QEMU) cannot
+     * virtualize them.
+     */
+    uint32_t vmm_fixup;
+
+    bool inducing_ve;
+    /*
+     * The maximum supported feature set for given inducing-#VE leaf.
+     * It's valid only when .inducing_ve is true.
+     */
+    uint32_t supported_on_ve;
+} KvmTdxCpuidLookup;
+
+ /*
+  * QEMU maintained TDX CPUID lookup tables, which reflects how CPUIDs are
+  * virtualized for guest TDs based on "CPUID virtualization" of TDX spec.
+  *
+  * Note:
+  *
+  * This table will be updated runtime by tdx_caps reported by platform.
+  *
+  */
+static KvmTdxCpuidLookup tdx_cpuid_lookup[FEATURE_WORDS] = {
+    [FEAT_1_EDX] = {
+        .tdx_fixed0 =
+            BIT(10) /* Reserved */ | BIT(20) /* Reserved */ | CPUID_IA64,
+        .tdx_fixed1 =
+            CPUID_MSR | CPUID_PAE | CPUID_MCE | CPUID_APIC |
+            CPUID_MTRR | CPUID_MCA | CPUID_CLFLUSH | CPUID_DTS,
+        .vmm_fixup =
+            CPUID_ACPI | CPUID_PBE,
+    },
+    [FEAT_1_ECX] = {
+        .tdx_fixed0 =
+            CPUID_EXT_VMX | CPUID_EXT_SMX | BIT(16) /* Reserved */,
+        .tdx_fixed1 =
+            CPUID_EXT_CX16 | CPUID_EXT_PDCM | CPUID_EXT_X2APIC |
+            CPUID_EXT_AES | CPUID_EXT_XSAVE | CPUID_EXT_RDRAND |
+            CPUID_EXT_HYPERVISOR,
+        .vmm_fixup =
+            CPUID_EXT_EST | CPUID_EXT_TM2 | CPUID_EXT_XTPR | CPUID_EXT_DCA,
+    },
+    [FEAT_8000_0001_EDX] = {
+        .tdx_fixed1 =
+            CPUID_EXT2_NX | CPUID_EXT2_PDPE1GB | CPUID_EXT2_RDTSCP |
+            CPUID_EXT2_LM,
+    },
+    [FEAT_7_0_EBX] = {
+        .tdx_fixed0 =
+            CPUID_7_0_EBX_TSC_ADJUST | CPUID_7_0_EBX_SGX | CPUID_7_0_EBX_MPX,
+        .tdx_fixed1 =
+            CPUID_7_0_EBX_FSGSBASE | CPUID_7_0_EBX_RTM |
+            CPUID_7_0_EBX_RDSEED | CPUID_7_0_EBX_SMAP |
+            CPUID_7_0_EBX_CLFLUSHOPT | CPUID_7_0_EBX_CLWB |
+            CPUID_7_0_EBX_SHA_NI,
+        .vmm_fixup =
+            CPUID_7_0_EBX_PQM | CPUID_7_0_EBX_RDT_A,
+    },
+    [FEAT_7_0_ECX] = {
+        .tdx_fixed0 =
+            CPUID_7_0_ECX_FZM | CPUID_7_0_ECX_MAWAU |
+            CPUID_7_0_ECX_ENQCMD | CPUID_7_0_ECX_SGX_LC,
+        .tdx_fixed1 =
+            CPUID_7_0_ECX_MOVDIR64B | CPUID_7_0_ECX_BUS_LOCK_DETECT,
+        .vmm_fixup =
+            CPUID_7_0_ECX_TME,
+    },
+    [FEAT_7_0_EDX] = {
+        .tdx_fixed1 =
+            CPUID_7_0_EDX_SPEC_CTRL | CPUID_7_0_EDX_ARCH_CAPABILITIES |
+            CPUID_7_0_EDX_CORE_CAPABILITY | CPUID_7_0_EDX_SPEC_CTRL_SSBD,
+        .vmm_fixup =
+            CPUID_7_0_EDX_PCONFIG,
+    },
+    [FEAT_8000_0008_EBX] = {
+        .tdx_fixed0 =
+            ~CPUID_8000_0008_EBX_WBNOINVD,
+        .tdx_fixed1 =
+            CPUID_8000_0008_EBX_WBNOINVD,
+    },
+    [FEAT_XSAVE] = {
+        .tdx_fixed1 =
+            CPUID_XSAVE_XSAVEOPT | CPUID_XSAVE_XSAVEC |
+            CPUID_XSAVE_XSAVES,
+    },
+    [FEAT_6_EAX] = {
+        .inducing_ve = true,
+        .supported_on_ve = CPUID_6_EAX_ARAT,
+    },
+    [FEAT_8000_0007_EDX] = {
+        .inducing_ve = true,
+        .supported_on_ve = -1U,
+    },
+    [FEAT_KVM] = {
+        .inducing_ve = true,
+        .supported_on_ve = TDX_SUPPORTED_KVM_FEATURES,
+    },
+};
 
 static TdxGuest *tdx_guest;
 
@@ -31,6 +149,142 @@ bool is_tdx_vm(void)
     return !!tdx_guest;
 }
 
+static inline uint32_t host_cpuid_reg(uint32_t function,
+                                      uint32_t index, int reg)
+{
+    uint32_t eax, ebx, ecx, edx;
+    uint32_t ret = 0;
+
+    host_cpuid(function, index, &eax, &ebx, &ecx, &edx);
+
+    switch (reg) {
+    case R_EAX:
+        ret = eax;
+        break;
+    case R_EBX:
+        ret = ebx;
+        break;
+    case R_ECX:
+        ret = ecx;
+        break;
+    case R_EDX:
+        ret = edx;
+        break;
+    }
+    return ret;
+}
+
+static inline uint32_t tdx_cap_cpuid_config(uint32_t function,
+                                            uint32_t index, int reg)
+{
+    struct kvm_tdx_cpuid_config *cpuid_c;
+    int ret = 0;
+    int i;
+
+    if (tdx_caps->nr_cpuid_configs <= 0) {
+        return ret;
+    }
+
+    for (i = 0; i < tdx_caps->nr_cpuid_configs; i++) {
+        cpuid_c = &tdx_caps->cpuid_configs[i];
+        /* 0xffffffff in sub_leaf means the leaf doesn't require a sublesf */
+        if (cpuid_c->leaf == function &&
+            (cpuid_c->sub_leaf == 0xffffffff || cpuid_c->sub_leaf == index)) {
+            switch (reg) {
+            case R_EAX:
+                ret = cpuid_c->eax;
+                break;
+            case R_EBX:
+                ret = cpuid_c->ebx;
+                break;
+            case R_ECX:
+                ret = cpuid_c->ecx;
+                break;
+            case R_EDX:
+                ret = cpuid_c->edx;
+                break;
+            default:
+                return 0;
+            }
+        }
+    }
+    return ret;
+}
+
+static FeatureWord get_cpuid_featureword_index(uint32_t function,
+                                               uint32_t index, int reg)
+{
+    FeatureWord w;
+
+    for (w = 0; w < FEATURE_WORDS; w++) {
+        FeatureWordInfo *f = &feature_word_info[w];
+
+        if (f->type == MSR_FEATURE_WORD || f->cpuid.eax != function ||
+            f->cpuid.reg != reg ||
+            (f->cpuid.needs_ecx && f->cpuid.ecx != index)) {
+            continue;
+        }
+
+        return w;
+    }
+
+    return w;
+}
+
+/*
+ * TDX supported CPUID varies from what KVM reports. Adjust the result by
+ * applying the TDX restrictions.
+ */
+void tdx_get_supported_cpuid(uint32_t function, uint32_t index, int reg,
+                             uint32_t *ret)
+{
+    uint32_t vmm_cap = *ret;
+    FeatureWord w;
+
+    /* Only handle features leaves that recognized by feature_word_info[] */
+    w = get_cpuid_featureword_index(function, index, reg);
+    if (w == FEATURE_WORDS) {
+        return;
+    }
+
+    if (tdx_cpuid_lookup[w].inducing_ve) {
+        *ret &= tdx_cpuid_lookup[w].supported_on_ve;
+        return;
+    }
+
+    /*
+     * Include all the native bits as first step. It covers types
+     * - As configured (if native)
+     * - Native
+     * - XFAM related and Attributes realted
+     *
+     * It also has side effect to enable unsupported bits, e.g., the
+     * bits of "fixed0" type while present natively. It's safe because
+     * the unsupported bits will be masked off by .fixed0 later.
+     */
+    *ret |= host_cpuid_reg(function, index, reg);
+
+    /* Adjust according to "fixed" type in tdx_cpuid_lookup. */
+    *ret |= tdx_cpuid_lookup[w].tdx_fixed1;
+    *ret &= ~tdx_cpuid_lookup[w].tdx_fixed0;
+
+    /*
+     * Configurable cpuids are supported unconditionally. It's mainly to
+     * include those configurable regardless of native existence.
+     */
+    *ret |= tdx_cap_cpuid_config(function, index, reg);
+
+    /*
+     * clear the configurable bits that require VMM emulation and VMM doesn't
+     * report the support.
+     */
+    *ret &= ~(~vmm_cap & tdx_cpuid_lookup[w].vmm_fixup);
+
+    /* special handling */
+    if (function == 1 && reg == R_ECX && !enable_cpu_pm)
+        *ret &= ~CPUID_EXT_MONITOR;
+}
+
 enum tdx_ioctl_level{
     TDX_PLATFORM_IOCTL,
     TDX_VM_IOCTL,
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 4036ca2f3f99..06599b65b827 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -27,5 +27,7 @@ bool is_tdx_vm(void);
 #endif /* CONFIG_TDX */
 
 int tdx_kvm_init(MachineState *ms, Error **errp);
+void tdx_get_supported_cpuid(uint32_t function, uint32_t index, int reg,
+                             uint32_t *ret);
 
 #endif /* QEMU_I386_TDX_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 21/70] i386/tdx: Update tdx_cpuid_lookup[].tdx_fixed0/1 by tdx_caps.cpuid_config[]
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (19 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 20/70] i386/tdx: Adjust the supported CPUID based on TDX restrictions Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 22/70] i386/tdx: Integrate tdx_caps->xfam_fixed0/1 into tdx_cpuid_lookup Xiaoyao Li
                   ` (48 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

tdx_cpuid_lookup[].tdx_fixed0/1 is QEMU maintained data which reflects
TDX restrictions regrading how some CPUIDs are virtualized by TDX.

It's retrieved from TDX spec. However, TDX may change some fixed
fields to configurable in the future. Update
tdx_cpuid.lookup[].tdx_fixed0/1 fields by removing the bits that
reported from TDX module as configurable. This can adapt with the
updated TDX (module) automatically.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 target/i386/kvm/tdx.c | 30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index eda6e695a884..7fa86858de58 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -374,6 +374,34 @@ static int get_tdx_capabilities(Error **errp)
     return 0;
 }
 
+static void update_tdx_cpuid_lookup_by_tdx_caps(void)
+{
+    KvmTdxCpuidLookup *entry;
+    FeatureWordInfo *fi;
+    uint32_t config;
+    FeatureWord w;
+
+    /*
+     * Patch tdx_fixed0/1 by tdx_caps that what TDX module reports as
+     * configurable is not fixed.
+     */
+    for (w = 0; w < FEATURE_WORDS; w++) {
+        fi = &feature_word_info[w];
+        entry = &tdx_cpuid_lookup[w];
+
+        if (fi->type != CPUID_FEATURE_WORD) {
+            continue;
+        }
+
+        config = tdx_cap_cpuid_config(fi->cpuid.eax,
+                                      fi->cpuid.needs_ecx ? fi->cpuid.ecx : ~0u,
+                                      fi->cpuid.reg);
+
+        entry->tdx_fixed0 &= ~config;
+        entry->tdx_fixed1 &= ~config;
+    }
+}
+
 int tdx_kvm_init(MachineState *ms, Error **errp)
 {
     TdxGuest *tdx = TDX_GUEST(OBJECT(ms->cgs));
@@ -388,6 +416,8 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
         }
     }
 
+    update_tdx_cpuid_lookup_by_tdx_caps();
+
     tdx_guest = tdx;
     return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 22/70] i386/tdx: Integrate tdx_caps->xfam_fixed0/1 into tdx_cpuid_lookup
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (20 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 21/70] i386/tdx: Update tdx_cpuid_lookup[].tdx_fixed0/1 by tdx_caps.cpuid_config[] Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 23/70] i386/tdx: Integrate tdx_caps->attrs_fixed0/1 to tdx_cpuid_lookup Xiaoyao Li
                   ` (47 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

KVM requires userspace to pass XFAM configuration via CPUID 0xD leaves.

Convert tdx_caps->xfam_fixed0/1 into corresponding
tdx_cpuid_lookup[].tdx_fixed0/1 field of CPUID 0xD leaves. Thus the
requirement can be applied naturally.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 target/i386/cpu.c     |  3 ---
 target/i386/cpu.h     |  3 +++
 target/i386/kvm/tdx.c | 24 ++++++++++++++++++++++++
 3 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 358d9c0a655a..128b01054ff3 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -1575,9 +1575,6 @@ static const X86RegisterInfo32 x86_reg_info_32[CPU_NB_REGS32] = {
 };
 #undef REGISTER
 
-/* CPUID feature bits available in XSS */
-#define CPUID_XSTATE_XSS_MASK    (XSTATE_ARCH_LBR_MASK)
-
 ExtSaveArea x86_ext_save_areas[XSAVE_STATE_AREA_COUNT] = {
     [XSTATE_FP_BIT] = {
         /* x87 FP state component is always enabled if XSAVE is supported */
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index bd9151d3bcaa..d0b7ba5d113e 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -588,6 +588,9 @@ typedef enum X86Seg {
                                  XSTATE_Hi16_ZMM_MASK | XSTATE_PKRU_MASK | \
                                  XSTATE_XTILE_CFG_MASK | XSTATE_XTILE_DATA_MASK)
 
+/* CPUID feature bits available in XSS */
+#define CPUID_XSTATE_XSS_MASK    (XSTATE_ARCH_LBR_MASK)
+
 /* CPUID feature words */
 typedef enum FeatureWord {
     FEAT_1_EDX,         /* CPUID[1].EDX */
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 7fa86858de58..be7771bd97d7 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -400,6 +400,30 @@ static void update_tdx_cpuid_lookup_by_tdx_caps(void)
         entry->tdx_fixed0 &= ~config;
         entry->tdx_fixed1 &= ~config;
     }
+
+    /*
+     * Because KVM gets XFAM settings via CPUID leaves 0xD,  map
+     * tdx_caps->xfam_fixed{0, 1} into tdx_cpuid_lookup[].tdx_fixed{0, 1}.
+     *
+     * Then the enforment applies in tdx_get_configurable_cpuid() naturally.
+     */
+    tdx_cpuid_lookup[FEAT_XSAVE_XCR0_LO].tdx_fixed0 =
+            (uint32_t)~tdx_caps->xfam_fixed0 & CPUID_XSTATE_XCR0_MASK;
+    tdx_cpuid_lookup[FEAT_XSAVE_XCR0_LO].tdx_fixed1 =
+            (uint32_t)tdx_caps->xfam_fixed1 & CPUID_XSTATE_XCR0_MASK;
+    tdx_cpuid_lookup[FEAT_XSAVE_XCR0_HI].tdx_fixed0 =
+            (~tdx_caps->xfam_fixed0 & CPUID_XSTATE_XCR0_MASK) >> 32;
+    tdx_cpuid_lookup[FEAT_XSAVE_XCR0_HI].tdx_fixed1 =
+            (tdx_caps->xfam_fixed1 & CPUID_XSTATE_XCR0_MASK) >> 32;
+
+    tdx_cpuid_lookup[FEAT_XSAVE_XSS_LO].tdx_fixed0 =
+            (uint32_t)~tdx_caps->xfam_fixed0 & CPUID_XSTATE_XSS_MASK;
+    tdx_cpuid_lookup[FEAT_XSAVE_XSS_LO].tdx_fixed1 =
+            (uint32_t)tdx_caps->xfam_fixed1 & CPUID_XSTATE_XSS_MASK;
+    tdx_cpuid_lookup[FEAT_XSAVE_XSS_HI].tdx_fixed0 =
+            (~tdx_caps->xfam_fixed0 & CPUID_XSTATE_XSS_MASK) >> 32;
+    tdx_cpuid_lookup[FEAT_XSAVE_XSS_HI].tdx_fixed1 =
+            (tdx_caps->xfam_fixed1 & CPUID_XSTATE_XSS_MASK) >> 32;
 }
 
 int tdx_kvm_init(MachineState *ms, Error **errp)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 23/70] i386/tdx: Integrate tdx_caps->attrs_fixed0/1 to tdx_cpuid_lookup
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (21 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 22/70] i386/tdx: Integrate tdx_caps->xfam_fixed0/1 into tdx_cpuid_lookup Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 24/70] i386/kvm: Move architectural CPUID leaf generation to separate helper Xiaoyao Li
                   ` (46 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Some bits in TD attributes have corresponding CPUID feature bits. Reflect
the fixed0/1 restriction on TD attributes to their corresponding CPUID
bits in tdx_cpuid_lookup[] as well.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 target/i386/cpu-internal.h |  9 +++++++++
 target/i386/cpu.c          |  9 ---------
 target/i386/cpu.h          |  2 ++
 target/i386/kvm/tdx.c      | 21 +++++++++++++++++++++
 4 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/target/i386/cpu-internal.h b/target/i386/cpu-internal.h
index 9baac5c0b450..e980f6e3147f 100644
--- a/target/i386/cpu-internal.h
+++ b/target/i386/cpu-internal.h
@@ -20,6 +20,15 @@
 #ifndef I386_CPU_INTERNAL_H
 #define I386_CPU_INTERNAL_H
 
+typedef struct FeatureMask {
+    FeatureWord index;
+    uint64_t mask;
+} FeatureMask;
+
+typedef struct FeatureDep {
+    FeatureMask from, to;
+} FeatureDep;
+
 typedef enum FeatureWordType {
    CPUID_FEATURE_WORD,
    MSR_FEATURE_WORD,
diff --git a/target/i386/cpu.c b/target/i386/cpu.c
index 128b01054ff3..e66b7a8b7b8d 100644
--- a/target/i386/cpu.c
+++ b/target/i386/cpu.c
@@ -1442,15 +1442,6 @@ FeatureWordInfo feature_word_info[FEATURE_WORDS] = {
     },
 };
 
-typedef struct FeatureMask {
-    FeatureWord index;
-    uint64_t mask;
-} FeatureMask;
-
-typedef struct FeatureDep {
-    FeatureMask from, to;
-} FeatureDep;
-
 static FeatureDep feature_dependencies[] = {
     {
         .from = { FEAT_7_0_EDX,             CPUID_7_0_EDX_ARCH_CAPABILITIES },
diff --git a/target/i386/cpu.h b/target/i386/cpu.h
index d0b7ba5d113e..23265d890074 100644
--- a/target/i386/cpu.h
+++ b/target/i386/cpu.h
@@ -876,6 +876,8 @@ uint64_t x86_cpu_get_supported_feature_word(FeatureWord w,
 #define CPUID_7_0_ECX_MAWAU             (31U << 17)
 /* Read Processor ID */
 #define CPUID_7_0_ECX_RDPID             (1U << 22)
+/* KeyLocker */
+#define CPUID_7_0_ECX_KeyLocker         (1U << 23)
 /* Bus Lock Debug Exception */
 #define CPUID_7_0_ECX_BUS_LOCK_DETECT   (1U << 24)
 /* Cache Line Demote Instruction */
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index be7771bd97d7..1f5d8117d1a9 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -32,6 +32,13 @@
                                      (1U << KVM_FEATURE_PV_SCHED_YIELD) | \
                                      (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
 
+#define TDX_ATTRIBUTES_MAX_BITS      64
+
+static FeatureMask tdx_attrs_ctrl_fields[TDX_ATTRIBUTES_MAX_BITS] = {
+    [30] = { .index = FEAT_7_0_ECX, .mask = CPUID_7_0_ECX_PKS },
+    [31] = { .index = FEAT_7_0_ECX, .mask = CPUID_7_0_ECX_KeyLocker},
+};
+
 typedef struct KvmTdxCpuidLookup {
     uint32_t tdx_fixed0;
     uint32_t tdx_fixed1;
@@ -380,6 +387,8 @@ static void update_tdx_cpuid_lookup_by_tdx_caps(void)
     FeatureWordInfo *fi;
     uint32_t config;
     FeatureWord w;
+    FeatureMask *fm;
+    int i;
 
     /*
      * Patch tdx_fixed0/1 by tdx_caps that what TDX module reports as
@@ -401,6 +410,18 @@ static void update_tdx_cpuid_lookup_by_tdx_caps(void)
         entry->tdx_fixed1 &= ~config;
     }
 
+    for (i = 0; i < ARRAY_SIZE(tdx_attrs_ctrl_fields); i++) {
+        fm = &tdx_attrs_ctrl_fields[i];
+
+        if (tdx_caps->attrs_fixed0 & (1ULL << i)) {
+            tdx_cpuid_lookup[fm->index].tdx_fixed0 |= fm->mask;
+        }
+
+        if (tdx_caps->attrs_fixed1 & (1ULL << i)) {
+            tdx_cpuid_lookup[fm->index].tdx_fixed1 |= fm->mask;
+        }
+    }
+
     /*
      * Because KVM gets XFAM settings via CPUID leaves 0xD,  map
      * tdx_caps->xfam_fixed{0, 1} into tdx_cpuid_lookup[].tdx_fixed{0, 1}.
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 24/70] i386/kvm: Move architectural CPUID leaf generation to separate helper
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (22 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 23/70] i386/tdx: Integrate tdx_caps->attrs_fixed0/1 to tdx_cpuid_lookup Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 25/70] kvm: Introduce kvm_arch_pre_create_vcpu() Xiaoyao Li
                   ` (45 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Sean Christopherson <sean.j.christopherson@intel.com>

Move the architectural (for lack of a better term) CPUID leaf generation
to a separate helper so that the generation code can be reused by TDX,
which needs to generate a canonical VM-scoped configuration.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 target/i386/kvm/kvm.c      | 454 +++++++++++++++++++------------------
 target/i386/kvm/kvm_i386.h |   3 +
 2 files changed, 235 insertions(+), 222 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index f2627dd61d2b..dafe4d262977 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -1733,6 +1733,236 @@ static void kvm_init_nested_state(CPUX86State *env)
     }
 }
 
+uint32_t kvm_x86_arch_cpuid(CPUX86State *env, struct kvm_cpuid_entry2 *entries,
+                            uint32_t cpuid_i)
+{
+    uint32_t limit, i, j;
+    uint32_t unused;
+    struct kvm_cpuid_entry2 *c;
+
+    cpu_x86_cpuid(env, 0, 0, &limit, &unused, &unused, &unused);
+
+    for (i = 0; i <= limit; i++) {
+        if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+            fprintf(stderr, "unsupported level value: 0x%x\n", limit);
+            abort();
+        }
+        c = &entries[cpuid_i++];
+
+        switch (i) {
+        case 2: {
+            /* Keep reading function 2 till all the input is received */
+            int times;
+
+            c->function = i;
+            c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC |
+                       KVM_CPUID_FLAG_STATE_READ_NEXT;
+            cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
+            times = c->eax & 0xff;
+
+            for (j = 1; j < times; ++j) {
+                if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+                    fprintf(stderr, "cpuid_data is full, no space for "
+                            "cpuid(eax:2):eax & 0xf = 0x%x\n", times);
+                    abort();
+                }
+                c = &entries[cpuid_i++];
+                c->function = i;
+                c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC;
+                cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
+            }
+            break;
+        }
+        case 0x1f:
+            if (env->nr_dies < 2) {
+                cpuid_i--;
+                break;
+            }
+            /* fallthrough */
+        case 4:
+        case 0xb:
+        case 0xd:
+            for (j = 0; ; j++) {
+                if (i == 0xd && j == 64) {
+                    break;
+                }
+
+                c->function = i;
+                c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+                c->index = j;
+                cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
+
+                if (i == 4 && c->eax == 0) {
+                    break;
+                }
+                if (i == 0xb && !(c->ecx & 0xff00)) {
+                    break;
+                }
+                if (i == 0x1f && !(c->ecx & 0xff00)) {
+                    break;
+                }
+                if (i == 0xd && c->eax == 0) {
+                    continue;
+                }
+                if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+                    fprintf(stderr, "cpuid_data is full, no space for "
+                            "cpuid(eax:0x%x,ecx:0x%x)\n", i, j);
+                    abort();
+                }
+                c = &entries[cpuid_i++];
+            }
+            break;
+        case 0x7:
+        case 0x12:
+            for (j = 0; ; j++) {
+                c->function = i;
+                c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+                c->index = j;
+                cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
+
+                if (j > 1 && (c->eax & 0xf) != 1) {
+                    break;
+                }
+
+                if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+                    fprintf(stderr, "cpuid_data is full, no space for "
+                                "cpuid(eax:0x12,ecx:0x%x)\n", j);
+                    abort();
+                }
+                c = &entries[cpuid_i++];
+            }
+            break;
+        case 0x14:
+        case 0x1d:
+        case 0x1e: {
+            uint32_t times;
+
+            c->function = i;
+            c->index = 0;
+            c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+            cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
+            times = c->eax;
+
+            for (j = 1; j <= times; ++j) {
+                if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+                    fprintf(stderr, "cpuid_data is full, no space for "
+                                "cpuid(eax:0x%x,ecx:0x%x)\n", i, j);
+                    abort();
+                }
+                c = &entries[cpuid_i++];
+                c->function = i;
+                c->index = j;
+                c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+                cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
+            }
+            break;
+        }
+        default:
+            c->function = i;
+            c->flags = 0;
+            cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
+            if (!c->eax && !c->ebx && !c->ecx && !c->edx) {
+                /*
+                 * KVM already returns all zeroes if a CPUID entry is missing,
+                 * so we can omit it and avoid hitting KVM's 80-entry limit.
+                 */
+                cpuid_i--;
+            }
+            break;
+        }
+    }
+
+    if (limit >= 0x0a) {
+        uint32_t eax, edx;
+
+        cpu_x86_cpuid(env, 0x0a, 0, &eax, &unused, &unused, &edx);
+
+        has_architectural_pmu_version = eax & 0xff;
+        if (has_architectural_pmu_version > 0) {
+            num_architectural_pmu_gp_counters = (eax & 0xff00) >> 8;
+
+            /* Shouldn't be more than 32, since that's the number of bits
+             * available in EBX to tell us _which_ counters are available.
+             * Play it safe.
+             */
+            if (num_architectural_pmu_gp_counters > MAX_GP_COUNTERS) {
+                num_architectural_pmu_gp_counters = MAX_GP_COUNTERS;
+            }
+
+            if (has_architectural_pmu_version > 1) {
+                num_architectural_pmu_fixed_counters = edx & 0x1f;
+
+                if (num_architectural_pmu_fixed_counters > MAX_FIXED_COUNTERS) {
+                    num_architectural_pmu_fixed_counters = MAX_FIXED_COUNTERS;
+                }
+            }
+        }
+    }
+
+    cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused);
+
+    for (i = 0x80000000; i <= limit; i++) {
+        if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+            fprintf(stderr, "unsupported xlevel value: 0x%x\n", limit);
+            abort();
+        }
+        c = &entries[cpuid_i++];
+
+        switch (i) {
+        case 0x8000001d:
+            /* Query for all AMD cache information leaves */
+            for (j = 0; ; j++) {
+                c->function = i;
+                c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
+                c->index = j;
+                cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
+
+                if (c->eax == 0) {
+                    break;
+                }
+                if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+                    fprintf(stderr, "cpuid_data is full, no space for "
+                            "cpuid(eax:0x%x,ecx:0x%x)\n", i, j);
+                    abort();
+                }
+                c = &entries[cpuid_i++];
+            }
+            break;
+        default:
+            c->function = i;
+            c->flags = 0;
+            cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
+            if (!c->eax && !c->ebx && !c->ecx && !c->edx) {
+                /*
+                 * KVM already returns all zeroes if a CPUID entry is missing,
+                 * so we can omit it and avoid hitting KVM's 80-entry limit.
+                 */
+                cpuid_i--;
+            }
+            break;
+        }
+    }
+
+    /* Call Centaur's CPUID instructions they are supported. */
+    if (env->cpuid_xlevel2 > 0) {
+        cpu_x86_cpuid(env, 0xC0000000, 0, &limit, &unused, &unused, &unused);
+
+        for (i = 0xC0000000; i <= limit; i++) {
+            if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
+                fprintf(stderr, "unsupported xlevel2 value: 0x%x\n", limit);
+                abort();
+            }
+            c = &entries[cpuid_i++];
+
+            c->function = i;
+            c->flags = 0;
+            cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
+        }
+    }
+
+    return cpuid_i;
+}
+
 int kvm_arch_init_vcpu(CPUState *cs)
 {
     struct {
@@ -1749,8 +1979,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
 
     X86CPU *cpu = X86_CPU(cs);
     CPUX86State *env = &cpu->env;
-    uint32_t limit, i, j, cpuid_i;
-    uint32_t unused;
+    uint32_t cpuid_i;
     struct kvm_cpuid_entry2 *c;
     uint32_t signature[3];
     int kvm_base = KVM_CPUID_SIGNATURE;
@@ -1903,8 +2132,6 @@ int kvm_arch_init_vcpu(CPUState *cs)
         c->edx = env->features[FEAT_KVM_HINTS];
     }
 
-    cpu_x86_cpuid(env, 0, 0, &limit, &unused, &unused, &unused);
-
     if (cpu->kvm_pv_enforce_cpuid) {
         r = kvm_vcpu_enable_cap(cs, KVM_CAP_ENFORCE_PV_FEATURE_CPUID, 0, 1);
         if (r < 0) {
@@ -1915,224 +2142,7 @@ int kvm_arch_init_vcpu(CPUState *cs)
         }
     }
 
-    for (i = 0; i <= limit; i++) {
-        if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
-            fprintf(stderr, "unsupported level value: 0x%x\n", limit);
-            abort();
-        }
-        c = &cpuid_data.entries[cpuid_i++];
-
-        switch (i) {
-        case 2: {
-            /* Keep reading function 2 till all the input is received */
-            int times;
-
-            c->function = i;
-            c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC |
-                       KVM_CPUID_FLAG_STATE_READ_NEXT;
-            cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
-            times = c->eax & 0xff;
-
-            for (j = 1; j < times; ++j) {
-                if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
-                    fprintf(stderr, "cpuid_data is full, no space for "
-                            "cpuid(eax:2):eax & 0xf = 0x%x\n", times);
-                    abort();
-                }
-                c = &cpuid_data.entries[cpuid_i++];
-                c->function = i;
-                c->flags = KVM_CPUID_FLAG_STATEFUL_FUNC;
-                cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
-            }
-            break;
-        }
-        case 0x1f:
-            if (env->nr_dies < 2) {
-                cpuid_i--;
-                break;
-            }
-            /* fallthrough */
-        case 4:
-        case 0xb:
-        case 0xd:
-            for (j = 0; ; j++) {
-                if (i == 0xd && j == 64) {
-                    break;
-                }
-
-                c->function = i;
-                c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-                c->index = j;
-                cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
-
-                if (i == 4 && c->eax == 0) {
-                    break;
-                }
-                if (i == 0xb && !(c->ecx & 0xff00)) {
-                    break;
-                }
-                if (i == 0x1f && !(c->ecx & 0xff00)) {
-                    break;
-                }
-                if (i == 0xd && c->eax == 0) {
-                    continue;
-                }
-                if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
-                    fprintf(stderr, "cpuid_data is full, no space for "
-                            "cpuid(eax:0x%x,ecx:0x%x)\n", i, j);
-                    abort();
-                }
-                c = &cpuid_data.entries[cpuid_i++];
-            }
-            break;
-        case 0x7:
-        case 0x12:
-            for (j = 0; ; j++) {
-                c->function = i;
-                c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-                c->index = j;
-                cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
-
-                if (j > 1 && (c->eax & 0xf) != 1) {
-                    break;
-                }
-
-                if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
-                    fprintf(stderr, "cpuid_data is full, no space for "
-                                "cpuid(eax:0x12,ecx:0x%x)\n", j);
-                    abort();
-                }
-                c = &cpuid_data.entries[cpuid_i++];
-            }
-            break;
-        case 0x14:
-        case 0x1d:
-        case 0x1e: {
-            uint32_t times;
-
-            c->function = i;
-            c->index = 0;
-            c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-            cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
-            times = c->eax;
-
-            for (j = 1; j <= times; ++j) {
-                if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
-                    fprintf(stderr, "cpuid_data is full, no space for "
-                                "cpuid(eax:0x%x,ecx:0x%x)\n", i, j);
-                    abort();
-                }
-                c = &cpuid_data.entries[cpuid_i++];
-                c->function = i;
-                c->index = j;
-                c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-                cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
-            }
-            break;
-        }
-        default:
-            c->function = i;
-            c->flags = 0;
-            cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
-            if (!c->eax && !c->ebx && !c->ecx && !c->edx) {
-                /*
-                 * KVM already returns all zeroes if a CPUID entry is missing,
-                 * so we can omit it and avoid hitting KVM's 80-entry limit.
-                 */
-                cpuid_i--;
-            }
-            break;
-        }
-    }
-
-    if (limit >= 0x0a) {
-        uint32_t eax, edx;
-
-        cpu_x86_cpuid(env, 0x0a, 0, &eax, &unused, &unused, &edx);
-
-        has_architectural_pmu_version = eax & 0xff;
-        if (has_architectural_pmu_version > 0) {
-            num_architectural_pmu_gp_counters = (eax & 0xff00) >> 8;
-
-            /* Shouldn't be more than 32, since that's the number of bits
-             * available in EBX to tell us _which_ counters are available.
-             * Play it safe.
-             */
-            if (num_architectural_pmu_gp_counters > MAX_GP_COUNTERS) {
-                num_architectural_pmu_gp_counters = MAX_GP_COUNTERS;
-            }
-
-            if (has_architectural_pmu_version > 1) {
-                num_architectural_pmu_fixed_counters = edx & 0x1f;
-
-                if (num_architectural_pmu_fixed_counters > MAX_FIXED_COUNTERS) {
-                    num_architectural_pmu_fixed_counters = MAX_FIXED_COUNTERS;
-                }
-            }
-        }
-    }
-
-    cpu_x86_cpuid(env, 0x80000000, 0, &limit, &unused, &unused, &unused);
-
-    for (i = 0x80000000; i <= limit; i++) {
-        if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
-            fprintf(stderr, "unsupported xlevel value: 0x%x\n", limit);
-            abort();
-        }
-        c = &cpuid_data.entries[cpuid_i++];
-
-        switch (i) {
-        case 0x8000001d:
-            /* Query for all AMD cache information leaves */
-            for (j = 0; ; j++) {
-                c->function = i;
-                c->flags = KVM_CPUID_FLAG_SIGNIFCANT_INDEX;
-                c->index = j;
-                cpu_x86_cpuid(env, i, j, &c->eax, &c->ebx, &c->ecx, &c->edx);
-
-                if (c->eax == 0) {
-                    break;
-                }
-                if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
-                    fprintf(stderr, "cpuid_data is full, no space for "
-                            "cpuid(eax:0x%x,ecx:0x%x)\n", i, j);
-                    abort();
-                }
-                c = &cpuid_data.entries[cpuid_i++];
-            }
-            break;
-        default:
-            c->function = i;
-            c->flags = 0;
-            cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
-            if (!c->eax && !c->ebx && !c->ecx && !c->edx) {
-                /*
-                 * KVM already returns all zeroes if a CPUID entry is missing,
-                 * so we can omit it and avoid hitting KVM's 80-entry limit.
-                 */
-                cpuid_i--;
-            }
-            break;
-        }
-    }
-
-    /* Call Centaur's CPUID instructions they are supported. */
-    if (env->cpuid_xlevel2 > 0) {
-        cpu_x86_cpuid(env, 0xC0000000, 0, &limit, &unused, &unused, &unused);
-
-        for (i = 0xC0000000; i <= limit; i++) {
-            if (cpuid_i == KVM_MAX_CPUID_ENTRIES) {
-                fprintf(stderr, "unsupported xlevel2 value: 0x%x\n", limit);
-                abort();
-            }
-            c = &cpuid_data.entries[cpuid_i++];
-
-            c->function = i;
-            c->flags = 0;
-            cpu_x86_cpuid(env, i, 0, &c->eax, &c->ebx, &c->ecx, &c->edx);
-        }
-    }
-
+    cpuid_i = kvm_x86_arch_cpuid(env, cpuid_data.entries, cpuid_i);
     cpuid_data.cpuid.nent = cpuid_i;
 
     if (((env->cpuid_version >> 8)&0xF) >= 6
diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
index c3ef46a97a7b..cbf52c1c6d17 100644
--- a/target/i386/kvm/kvm_i386.h
+++ b/target/i386/kvm/kvm_i386.h
@@ -24,6 +24,9 @@
 #define kvm_ioapic_in_kernel() \
     (kvm_irqchip_in_kernel() && !kvm_irqchip_is_split())
 
+uint32_t kvm_x86_arch_cpuid(CPUX86State *env, struct kvm_cpuid_entry2 *entries,
+                            uint32_t cpuid_i);
+
 #else
 
 #define kvm_pit_in_kernel()      0
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 25/70] kvm: Introduce kvm_arch_pre_create_vcpu()
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (23 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 24/70] i386/kvm: Move architectural CPUID leaf generation to separate helper Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 26/70] i386/tdx: Initialize TDX before creating TD vcpus Xiaoyao Li
                   ` (44 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Introduce kvm_arch_pre_create_vcpu(), to perform arch-dependent
work prior to create any vcpu. This is for i386 TDX because it needs
call TDX_INIT_VM before creating any vcpu.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
Changes in v3:
- pass @errp to kvm_arch_pre_create_vcpu(); (Per Daniel)
---
 accel/kvm/kvm-all.c  | 10 ++++++++++
 include/sysemu/kvm.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 082f31446c97..6b5f4d62f961 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -428,6 +428,11 @@ static int kvm_get_vcpu(KVMState *s, unsigned long vcpu_id)
     return kvm_vm_ioctl(s, KVM_CREATE_VCPU, (void *)vcpu_id);
 }
 
+int __attribute__ ((weak)) kvm_arch_pre_create_vcpu(CPUState *cpu, Error **errp)
+{
+    return 0;
+}
+
 int kvm_init_vcpu(CPUState *cpu, Error **errp)
 {
     KVMState *s = kvm_state;
@@ -436,6 +441,11 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
 
     trace_kvm_init_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
 
+    ret = kvm_arch_pre_create_vcpu(cpu, errp);
+    if (ret < 0) {
+        goto err;
+    }
+
     ret = kvm_get_vcpu(s, kvm_arch_vcpu_id(cpu));
     if (ret < 0) {
         error_setg_errno(errp, -ret, "kvm_init_vcpu: kvm_get_vcpu failed (%lu)",
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 0e88958190a4..2f6592859ac6 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -341,6 +341,7 @@ int kvm_arch_get_default_type(MachineState *ms);
 
 int kvm_arch_init(MachineState *ms, KVMState *s);
 
+int kvm_arch_pre_create_vcpu(CPUState *cpu, Error **errp);
 int kvm_arch_init_vcpu(CPUState *cpu);
 int kvm_arch_destroy_vcpu(CPUState *cpu);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 26/70] i386/tdx: Initialize TDX before creating TD vcpus
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (24 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 25/70] kvm: Introduce kvm_arch_pre_create_vcpu() Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 11:01   ` Daniel P. Berrangé
  2023-11-15  7:14 ` [PATCH v3 27/70] i386/tdx: Add property sept-ve-disable for tdx-guest object Xiaoyao Li
                   ` (43 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Invoke KVM_TDX_INIT in kvm_arch_pre_create_vcpu() that KVM_TDX_INIT
configures global TD configurations, e.g. the canonical CPUID config,
and must be executed prior to creating vCPUs.

Use kvm_x86_arch_cpuid() to setup the CPUID settings for TDX VM.

Note, this doesn't address the fact that QEMU may change the CPUID
configuration when creating vCPUs, i.e. punts on refactoring QEMU to
provide a stable CPUID config prior to kvm_arch_init().

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
Changes in v3:
- Pass @errp in tdx_pre_create_vcpu() and pass error info to it. (Daniel)
---
 accel/kvm/kvm-all.c        |  9 +++++++-
 target/i386/kvm/kvm.c      |  9 ++++++++
 target/i386/kvm/tdx-stub.c |  5 +++++
 target/i386/kvm/tdx.c      | 45 ++++++++++++++++++++++++++++++++++++++
 target/i386/kvm/tdx.h      |  4 ++++
 5 files changed, 71 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 6b5f4d62f961..a92fff471b58 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -441,8 +441,15 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
 
     trace_kvm_init_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
 
+    /*
+     * tdx_pre_create_vcpu() may call cpu_x86_cpuid(). It in turn may call
+     * kvm_vm_ioctl(). Set cpu->kvm_state in advance to avoid NULL pointer
+     * dereference.
+     */
+    cpu->kvm_state = s;
     ret = kvm_arch_pre_create_vcpu(cpu, errp);
     if (ret < 0) {
+        cpu->kvm_state = NULL;
         goto err;
     }
 
@@ -450,11 +457,11 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
     if (ret < 0) {
         error_setg_errno(errp, -ret, "kvm_init_vcpu: kvm_get_vcpu failed (%lu)",
                          kvm_arch_vcpu_id(cpu));
+        cpu->kvm_state = NULL;
         goto err;
     }
 
     cpu->kvm_fd = ret;
-    cpu->kvm_state = s;
     cpu->vcpu_dirty = true;
     cpu->dirty_pages = 0;
     cpu->throttle_us_per_full = 0;
diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index dafe4d262977..fc840653ceb6 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -2268,6 +2268,15 @@ int kvm_arch_init_vcpu(CPUState *cs)
     return r;
 }
 
+int kvm_arch_pre_create_vcpu(CPUState *cpu, Error **errp)
+{
+    if (is_tdx_vm()) {
+        return tdx_pre_create_vcpu(cpu, errp);
+    }
+
+    return 0;
+}
+
 int kvm_arch_destroy_vcpu(CPUState *cs)
 {
     X86CPU *cpu = X86_CPU(cs);
diff --git a/target/i386/kvm/tdx-stub.c b/target/i386/kvm/tdx-stub.c
index 1d866d5496bf..3877d432a397 100644
--- a/target/i386/kvm/tdx-stub.c
+++ b/target/i386/kvm/tdx-stub.c
@@ -6,3 +6,8 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
 {
     return -EINVAL;
 }
+
+int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
+{
+    return -EINVAL;
+}
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 1f5d8117d1a9..122a37c93de3 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -467,6 +467,49 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
     return 0;
 }
 
+int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    X86CPU *x86cpu = X86_CPU(cpu);
+    CPUX86State *env = &x86cpu->env;
+    struct kvm_tdx_init_vm *init_vm;
+    int r = 0;
+
+    qemu_mutex_lock(&tdx_guest->lock);
+    if (tdx_guest->initialized) {
+        goto out;
+    }
+
+    init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
+                        sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES);
+
+    r = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPUS, 0, ms->smp.cpus);
+    if (r < 0) {
+        error_setg(errp, "Unable to set MAX VCPUS to %d", ms->smp.cpus);
+        goto out_free;
+    }
+
+    init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0);
+
+    init_vm->attributes = tdx_guest->attributes;
+
+    do {
+        r = tdx_vm_ioctl(KVM_TDX_INIT_VM, 0, init_vm);
+    } while (r == -EAGAIN);
+    if (r < 0) {
+        error_setg_errno(errp, -r, "KVM_TDX_INIT_VM failed");
+        goto out_free;
+    }
+
+    tdx_guest->initialized = true;
+
+out_free:
+    g_free(init_vm);
+out:
+    qemu_mutex_unlock(&tdx_guest->lock);
+    return r;
+}
+
 /* tdx guest */
 OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
                                    tdx_guest,
@@ -479,6 +522,8 @@ static void tdx_guest_init(Object *obj)
 {
     TdxGuest *tdx = TDX_GUEST(obj);
 
+    qemu_mutex_init(&tdx->lock);
+
     tdx->attributes = 0;
 }
 
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 06599b65b827..432077723ac5 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -17,6 +17,9 @@ typedef struct TdxGuestClass {
 typedef struct TdxGuest {
     ConfidentialGuestSupport parent_obj;
 
+    QemuMutex lock;
+
+    bool initialized;
     uint64_t attributes;    /* TD attributes */
 } TdxGuest;
 
@@ -29,5 +32,6 @@ bool is_tdx_vm(void);
 int tdx_kvm_init(MachineState *ms, Error **errp);
 void tdx_get_supported_cpuid(uint32_t function, uint32_t index, int reg,
                              uint32_t *ret);
+int tdx_pre_create_vcpu(CPUState *cpu, Error **errp);
 
 #endif /* QEMU_I386_TDX_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 27/70] i386/tdx: Add property sept-ve-disable for tdx-guest object
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (25 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 26/70] i386/tdx: Initialize TDX before creating TD vcpus Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-12-01 10:53   ` Markus Armbruster
  2023-11-15  7:14 ` [PATCH v3 28/70] i386/tdx: Make sept_ve_disable set by default Xiaoyao Li
                   ` (42 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Bit 28 of TD attribute, named SEPT_VE_DISABLE. When set to 1, it disables
EPT violation conversion to #VE on guest TD access of PENDING pages.

Some guest OS (e.g., Linux TD guest) may require this bit as 1.
Otherwise refuse to boot.

Add sept-ve-disable property for tdx-guest object, for user to configure
this bit.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
Changes in v3:
- update the comment of property @sept-ve-disable to make it more
  descriptive and use new format. (Daniel and Markus)
---
 qapi/qom.json         |  7 ++++++-
 target/i386/kvm/tdx.c | 24 ++++++++++++++++++++++++
 2 files changed, 30 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 8e08257dac2f..3a29659e0155 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -883,10 +883,15 @@
 #
 # Properties for tdx-guest objects.
 #
+# @sept-ve-disable: toggle bit 28 of TD attributes to control disabling
+#     of EPT violation conversion to #VE on guest TD access of PENDING
+#     pages.  Some guest OS (e.g., Linux TD guest) may require this to
+#     be set, otherwise they refuse to boot.
+#
 # Since: 8.2
 ##
 { 'struct': 'TdxGuestProperties',
-  'data': { }}
+  'data': { '*sept-ve-disable': 'bool' } }
 
 ##
 # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 122a37c93de3..6b9dca03ded5 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -32,6 +32,8 @@
                                      (1U << KVM_FEATURE_PV_SCHED_YIELD) | \
                                      (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
 
+#define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE   BIT_ULL(28)
+
 #define TDX_ATTRIBUTES_MAX_BITS      64
 
 static FeatureMask tdx_attrs_ctrl_fields[TDX_ATTRIBUTES_MAX_BITS] = {
@@ -510,6 +512,24 @@ out:
     return r;
 }
 
+static bool tdx_guest_get_sept_ve_disable(Object *obj, Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    return !!(tdx->attributes & TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE);
+}
+
+static void tdx_guest_set_sept_ve_disable(Object *obj, bool value, Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    if (value) {
+        tdx->attributes |= TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
+    } else {
+        tdx->attributes &= ~TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
+    }
+}
+
 /* tdx guest */
 OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
                                    tdx_guest,
@@ -525,6 +545,10 @@ static void tdx_guest_init(Object *obj)
     qemu_mutex_init(&tdx->lock);
 
     tdx->attributes = 0;
+
+    object_property_add_bool(obj, "sept-ve-disable",
+                             tdx_guest_get_sept_ve_disable,
+                             tdx_guest_set_sept_ve_disable);
 }
 
 static void tdx_guest_finalize(Object *obj)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 28/70] i386/tdx: Make sept_ve_disable set by default
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (26 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 27/70] i386/tdx: Add property sept-ve-disable for tdx-guest object Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 29/70] i386/tdx: Wire CPU features up with attributes of TD guest Xiaoyao Li
                   ` (41 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

For TDX KVM use case, Linux guest is the most major one.  It requires
sept_ve_disable set.  Make it default for the main use case.  For other use
case, it can be enabled/disabled via qemu command line.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
---
 target/i386/kvm/tdx.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 6b9dca03ded5..7d2b1da85951 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -544,7 +544,7 @@ static void tdx_guest_init(Object *obj)
 
     qemu_mutex_init(&tdx->lock);
 
-    tdx->attributes = 0;
+    tdx->attributes = TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE;
 
     object_property_add_bool(obj, "sept-ve-disable",
                              tdx_guest_get_sept_ve_disable,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 29/70] i386/tdx: Wire CPU features up with attributes of TD guest
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (27 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 28/70] i386/tdx: Make sept_ve_disable set by default Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 30/70] i386/tdx: Validate TD attributes Xiaoyao Li
                   ` (40 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

For QEMU VMs, PKS is configured via CPUID_7_0_ECX_PKS and PMU is
configured by x86cpu->enable_pmu. Reuse the existing configuration
interface for TDX VMs.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/tdx.c | 14 +++++++++++++-
 1 file changed, 13 insertions(+), 1 deletion(-)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 7d2b1da85951..bb10331e2a88 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -33,6 +33,8 @@
                                      (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
 
 #define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE   BIT_ULL(28)
+#define TDX_TD_ATTRIBUTES_PKS               BIT_ULL(30)
+#define TDX_TD_ATTRIBUTES_PERFMON           BIT_ULL(63)
 
 #define TDX_ATTRIBUTES_MAX_BITS      64
 
@@ -469,6 +471,15 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
     return 0;
 }
 
+static void setup_td_guest_attributes(X86CPU *x86cpu)
+{
+    CPUX86State *env = &x86cpu->env;
+
+    tdx_guest->attributes |= (env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_PKS) ?
+                             TDX_TD_ATTRIBUTES_PKS : 0;
+    tdx_guest->attributes |= x86cpu->enable_pmu ? TDX_TD_ATTRIBUTES_PERFMON : 0;
+}
+
 int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 {
     MachineState *ms = MACHINE(qdev_get_machine());
@@ -491,8 +502,9 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
         goto out_free;
     }
 
+    setup_td_guest_attributes(x86cpu);
+
     init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0);
-
     init_vm->attributes = tdx_guest->attributes;
 
     do {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 30/70] i386/tdx: Validate TD attributes
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (28 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 29/70] i386/tdx: Wire CPU features up with attributes of TD guest Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM Xiaoyao Li
                   ` (39 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Validate TD attributes with tdx_caps that fixed-0 bits must be zero and
fixed-1 bits must be set.

Besides, sanity check the attribute bits that have not been supported by
QEMU yet. e.g., debug bit, it will be allowed in the future when debug
TD support lands in QEMU.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
Changes in v3:
- using error_setg() for error report; (Daniel)
---
 target/i386/kvm/tdx.c | 29 +++++++++++++++++++++++++++--
 1 file changed, 27 insertions(+), 2 deletions(-)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index bb10331e2a88..28b3c2765c86 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -32,6 +32,7 @@
                                      (1U << KVM_FEATURE_PV_SCHED_YIELD) | \
                                      (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
 
+#define TDX_TD_ATTRIBUTES_DEBUG             BIT_ULL(0)
 #define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE   BIT_ULL(28)
 #define TDX_TD_ATTRIBUTES_PKS               BIT_ULL(30)
 #define TDX_TD_ATTRIBUTES_PERFMON           BIT_ULL(63)
@@ -471,13 +472,34 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
     return 0;
 }
 
-static void setup_td_guest_attributes(X86CPU *x86cpu)
+static int tdx_validate_attributes(TdxGuest *tdx, Error **errp)
+{
+    if (((tdx->attributes & tdx_caps->attrs_fixed0) | tdx_caps->attrs_fixed1) !=
+        tdx->attributes) {
+            error_setg(errp, "Invalid attributes 0x%lx for TDX VM "
+                       "(fixed0 0x%llx, fixed1 0x%llx)",
+                       tdx->attributes, tdx_caps->attrs_fixed0,
+                       tdx_caps->attrs_fixed1);
+            return -1;
+    }
+
+    if (tdx->attributes & TDX_TD_ATTRIBUTES_DEBUG) {
+        error_setg(errp, "Current QEMU doesn't support attributes.debug[bit 0] for TDX VM");
+        return -1;
+    }
+
+    return 0;
+}
+
+static int setup_td_guest_attributes(X86CPU *x86cpu, Error **errp)
 {
     CPUX86State *env = &x86cpu->env;
 
     tdx_guest->attributes |= (env->features[FEAT_7_0_ECX] & CPUID_7_0_ECX_PKS) ?
                              TDX_TD_ATTRIBUTES_PKS : 0;
     tdx_guest->attributes |= x86cpu->enable_pmu ? TDX_TD_ATTRIBUTES_PERFMON : 0;
+
+    return tdx_validate_attributes(tdx_guest, errp);
 }
 
 int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
@@ -502,7 +524,10 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
         goto out_free;
     }
 
-    setup_td_guest_attributes(x86cpu);
+    r = setup_td_guest_attributes(x86cpu, errp);
+    if (r) {
+        goto out;
+    }
 
     init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0);
     init_vm->attributes = tdx_guest->attributes;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (29 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 30/70] i386/tdx: Validate TD attributes Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15 17:32   ` Daniel P. Berrangé
  2023-12-01 11:00   ` Markus Armbruster
  2023-11-15  7:14 ` [PATCH v3 32/70] i386/tdx: Implement user specified tsc frequency Xiaoyao Li
                   ` (38 subsequent siblings)
  69 siblings, 2 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
can be provided for TDX attestation.

So far they were hard coded as 0. Now allow user to specify those values
via property mrconfigid, mrowner and mrownerconfig. They are all in
base64 format.

example
-object tdx-guest, \
  mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
  mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
  mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes in v3:
 - use base64 encoding instread of hex-string;
---
 qapi/qom.json         | 11 +++++-
 target/i386/kvm/tdx.c | 85 +++++++++++++++++++++++++++++++++++++++++++
 target/i386/kvm/tdx.h |  3 ++
 3 files changed, 98 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index 3a29659e0155..fd99aa1ff8cc 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -888,10 +888,19 @@
 #     pages.  Some guest OS (e.g., Linux TD guest) may require this to
 #     be set, otherwise they refuse to boot.
 #
+# @mrconfigid: base64 encoded MRCONFIGID SHA384 digest
+#
+# @mrowner: base64 encoded MROWNER SHA384 digest
+#
+# @mrownerconfig: base64 MROWNERCONFIG SHA384 digest
+#
 # Since: 8.2
 ##
 { 'struct': 'TdxGuestProperties',
-  'data': { '*sept-ve-disable': 'bool' } }
+  'data': { '*sept-ve-disable': 'bool',
+            '*mrconfigid': 'str',
+            '*mrowner': 'str',
+            '*mrownerconfig': 'str' } }
 
 ##
 # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 28b3c2765c86..b70efbcab738 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -13,6 +13,7 @@
 
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
+#include "qemu/base64.h"
 #include "qapi/error.h"
 #include "qom/object_interfaces.h"
 #include "standard-headers/asm-x86/kvm_para.h"
@@ -508,6 +509,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
     X86CPU *x86cpu = X86_CPU(cpu);
     CPUX86State *env = &x86cpu->env;
     struct kvm_tdx_init_vm *init_vm;
+    uint8_t *data;
+    size_t data_len;
     int r = 0;
 
     qemu_mutex_lock(&tdx_guest->lock);
@@ -518,6 +521,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
     init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
                         sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES);
 
+#define SHA384_DIGEST_SIZE  48
+
+    if (tdx_guest->mrconfigid) {
+        data = qbase64_decode(tdx_guest->mrconfigid,
+                              strlen(tdx_guest->mrconfigid), &data_len, errp);
+        if (!data || data_len != SHA384_DIGEST_SIZE) {
+            error_setg(errp, "TDX: failed to decode mrconfigid");
+            return -1;
+        }
+        memcpy(init_vm->mrconfigid, data, data_len);
+    }
+
+    if (tdx_guest->mrowner) {
+        data = qbase64_decode(tdx_guest->mrowner,
+                              strlen(tdx_guest->mrowner), &data_len, errp);
+        if (!data || data_len != SHA384_DIGEST_SIZE) {
+            error_setg(errp, "TDX: failed to decode mrowner");
+            return -1;
+        }
+        memcpy(init_vm->mrowner, data, data_len);
+    }
+
+    if (tdx_guest->mrownerconfig) {
+        data = qbase64_decode(tdx_guest->mrownerconfig,
+                              strlen(tdx_guest->mrownerconfig), &data_len, errp);
+        if (!data || data_len != SHA384_DIGEST_SIZE) {
+            error_setg(errp, "TDX: failed to decode mrownerconfig");
+            return -1;
+        }
+        memcpy(init_vm->mrownerconfig, data, data_len);
+    }
+
     r = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPUS, 0, ms->smp.cpus);
     if (r < 0) {
         error_setg(errp, "Unable to set MAX VCPUS to %d", ms->smp.cpus);
@@ -567,6 +602,48 @@ static void tdx_guest_set_sept_ve_disable(Object *obj, bool value, Error **errp)
     }
 }
 
+static char * tdx_guest_get_mrconfigid(Object *obj, Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    return g_strdup(tdx->mrconfigid);
+}
+
+static void tdx_guest_set_mrconfigid(Object *obj, const char *value, Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    tdx->mrconfigid = g_strdup(value);
+}
+
+static char * tdx_guest_get_mrowner(Object *obj, Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    return g_strdup(tdx->mrowner);
+}
+
+static void tdx_guest_set_mrowner(Object *obj, const char *value, Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    tdx->mrconfigid = g_strdup(value);
+}
+
+static char * tdx_guest_get_mrownerconfig(Object *obj, Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    return g_strdup(tdx->mrownerconfig);
+}
+
+static void tdx_guest_set_mrownerconfig(Object *obj, const char *value, Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    tdx->mrconfigid = g_strdup(value);
+}
+
 /* tdx guest */
 OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
                                    tdx_guest,
@@ -586,6 +663,14 @@ static void tdx_guest_init(Object *obj)
     object_property_add_bool(obj, "sept-ve-disable",
                              tdx_guest_get_sept_ve_disable,
                              tdx_guest_set_sept_ve_disable);
+    object_property_add_str(obj, "mrconfigid",
+                            tdx_guest_get_mrconfigid,
+                            tdx_guest_set_mrconfigid);
+    object_property_add_str(obj, "mrowner",
+                            tdx_guest_get_mrowner, tdx_guest_set_mrowner);
+    object_property_add_str(obj, "mrownerconfig",
+                            tdx_guest_get_mrownerconfig,
+                            tdx_guest_set_mrownerconfig);
 }
 
 static void tdx_guest_finalize(Object *obj)
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 432077723ac5..6e39ef3bac13 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -21,6 +21,9 @@ typedef struct TdxGuest {
 
     bool initialized;
     uint64_t attributes;    /* TD attributes */
+    char *mrconfigid;       /* base64 encoded sha348 digest */
+    char *mrowner;          /* base64 encoded sha348 digest */
+    char *mrownerconfig;    /* base64 encoded sha348 digest */
 } TdxGuest;
 
 #ifdef CONFIG_TDX
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 32/70] i386/tdx: Implement user specified tsc frequency
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (30 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 33/70] i386/tdx: Set kvm_readonly_mem_enabled to false for TDX VM Xiaoyao Li
                   ` (37 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Reuse "-cpu,tsc-frequency=" to get user wanted tsc frequency and call VM
scope VM_SET_TSC_KHZ to set the tsc frequency of TD before KVM_TDX_INIT_VM.

Besides, sanity check the tsc frequency to be in the legal range and
legal granularity (required by TDX module).

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
Changes in v3:
- use @errp to report error info; (Daniel)

Changes in v1:
- Use VM scope VM_SET_TSC_KHZ to set the TSC frequency of TD since KVM
  side drop the @tsc_khz field in struct kvm_tdx_init_vm
---
 target/i386/kvm/kvm.c |  9 +++++++++
 target/i386/kvm/tdx.c | 26 ++++++++++++++++++++++++++
 2 files changed, 35 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index fc840653ceb6..d09d9f4eee94 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -795,6 +795,15 @@ static int kvm_arch_set_tsc_khz(CPUState *cs)
     int r, cur_freq;
     bool set_ioctl = false;
 
+    /*
+     * TSC of TD vcpu is immutable, it cannot be set/changed via vcpu scope
+     * VM_SET_TSC_KHZ, but only be initialized via VM scope VM_SET_TSC_KHZ
+     * before ioctl KVM_TDX_INIT_VM in tdx_pre_create_vcpu()
+     */
+    if (is_tdx_vm()) {
+        return 0;
+    }
+
     if (!env->tsc_khz) {
         return 0;
     }
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index b70efbcab738..05ca841d0b66 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -33,6 +33,9 @@
                                      (1U << KVM_FEATURE_PV_SCHED_YIELD) | \
                                      (1U << KVM_FEATURE_MSI_EXT_DEST_ID))
 
+#define TDX_MIN_TSC_FREQUENCY_KHZ   (100 * 1000)
+#define TDX_MAX_TSC_FREQUENCY_KHZ   (10 * 1000 * 1000)
+
 #define TDX_TD_ATTRIBUTES_DEBUG             BIT_ULL(0)
 #define TDX_TD_ATTRIBUTES_SEPT_VE_DISABLE   BIT_ULL(28)
 #define TDX_TD_ATTRIBUTES_PKS               BIT_ULL(30)
@@ -559,6 +562,29 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
         goto out_free;
     }
 
+    r = -EINVAL;
+    if (env->tsc_khz && (env->tsc_khz < TDX_MIN_TSC_FREQUENCY_KHZ ||
+                         env->tsc_khz > TDX_MAX_TSC_FREQUENCY_KHZ)) {
+        error_setg(errp, "Invalid TSC %ld KHz, must specify cpu_frequency between [%d, %d] kHz",
+                   env->tsc_khz, TDX_MIN_TSC_FREQUENCY_KHZ,
+                   TDX_MAX_TSC_FREQUENCY_KHZ);
+        goto out;
+    }
+
+    if (env->tsc_khz % (25 * 1000)) {
+        error_setg(errp, "Invalid TSC %ld KHz, it must be multiple of 25MHz",
+                   env->tsc_khz);
+        goto out;
+    }
+
+    /* it's safe even env->tsc_khz is 0. KVM uses host's tsc_khz in this case */
+    r = kvm_vm_ioctl(kvm_state, KVM_SET_TSC_KHZ, env->tsc_khz);
+    if (r < 0) {
+        error_setg_errno(errp, -r, "Unable to set TSC frequency to %" PRId64 " kHz",
+                         env->tsc_khz);
+        goto out;
+    }
+
     r = setup_td_guest_attributes(x86cpu, errp);
     if (r) {
         goto out;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 33/70] i386/tdx: Set kvm_readonly_mem_enabled to false for TDX VM
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (31 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 32/70] i386/tdx: Implement user specified tsc frequency Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 34/70] kvm/memory: Introduce the infrastructure to set the default shared/private value Xiaoyao Li
                   ` (36 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

TDX only supports readonly for shared memory but not for private memory.

In the view of QEMU, it has no idea whether a memslot is used as shared
memory of private. Thus just mark kvm_readonly_mem_enabled to false to
TDX VM for simplicity.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/tdx.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 05ca841d0b66..50e68f9c1a41 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -472,6 +472,15 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
 
     update_tdx_cpuid_lookup_by_tdx_caps();
 
+    /*
+     * Set kvm_readonly_mem_allowed to false, because TDX only supports readonly
+     * memory for shared memory but not for private memory. Besides, whether a
+     * memslot is private or shared is not determined by QEMU.
+     *
+     * Thus, just mark readonly memory not supported for simplicity.
+     */
+    kvm_readonly_mem_allowed = false;
+
     tdx_guest = tdx;
     return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 34/70] kvm/memory: Introduce the infrastructure to set the default shared/private value
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (32 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 33/70] i386/tdx: Set kvm_readonly_mem_enabled to false for TDX VM Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 35/70] i386/tdx: Make memory type private by default Xiaoyao Li
                   ` (35 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Introduce new flag RAM_DEFAULT_PRIVATE for RAMBlock. It's used to
indicate the default attribute,  private or not.

Set the RAM range to private explicitly when it's default private.

Originated-from: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 accel/kvm/kvm-all.c   | 10 ++++++++++
 include/exec/memory.h |  6 ++++++
 system/memory.c       | 13 +++++++++++++
 3 files changed, 29 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index a92fff471b58..316690d113d0 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1467,6 +1467,16 @@ static void kvm_set_phys_mem(KVMMemoryListener *kml,
                     strerror(-err));
             abort();
         }
+
+        if (memory_region_is_default_private(mr)) {
+            err = kvm_set_memory_attributes_private(start_addr, slot_size);
+            if (err) {
+                error_report("%s: failed to set memory attribute private: %s\n",
+                             __func__, strerror(-err));
+                exit(1);
+            }
+        }
+
         start_addr += slot_size;
         ram_start_offset += slot_size;
         ram += slot_size;
diff --git a/include/exec/memory.h b/include/exec/memory.h
index f780367ab1bd..bdc4b98efe70 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -246,6 +246,9 @@ typedef struct IOMMUTLBEvent {
 /* RAM can be private that has kvm gmem backend */
 #define RAM_GUEST_MEMFD   (1 << 12)
 
+/* RAM is default private */
+#define RAM_DEFAULT_PRIVATE     (1 << 13)
+
 static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
                                        IOMMUNotifierFlag flags,
                                        hwaddr start, hwaddr end,
@@ -1715,6 +1718,9 @@ bool memory_region_is_protected(MemoryRegion *mr);
  */
 bool memory_region_has_guest_memfd(MemoryRegion *mr);
 
+void memory_region_set_default_private(MemoryRegion *mr);
+bool memory_region_is_default_private(MemoryRegion *mr);
+
 /**
  * memory_region_get_iommu: check whether a memory region is an iommu
  *
diff --git a/system/memory.c b/system/memory.c
index 69741d91bbb7..b0c58232b6f7 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -1867,6 +1867,19 @@ bool memory_region_has_guest_memfd(MemoryRegion *mr)
     return mr->ram_block && mr->ram_block->guest_memfd >= 0;
 }
 
+bool memory_region_is_default_private(MemoryRegion *mr)
+{
+    return memory_region_has_guest_memfd(mr) &&
+           (mr->ram_block->flags & RAM_DEFAULT_PRIVATE);
+}
+
+void memory_region_set_default_private(MemoryRegion *mr)
+{
+    if (memory_region_has_guest_memfd(mr)) {
+        mr->ram_block->flags |= RAM_DEFAULT_PRIVATE;
+    }
+}
+
 uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
 {
     uint8_t mask = mr->dirty_log_mask;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 35/70] i386/tdx: Make memory type private by default
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (33 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 34/70] kvm/memory: Introduce the infrastructure to set the default shared/private value Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 36/70] kvm/tdx: Don't complain when converting vMMIO region to shared Xiaoyao Li
                   ` (34 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

By default (due to the recent UPM change), restricted memory attribute is
shared.  Convert the memory region from shared to private at the memory
slot creation time.

add kvm region registering function to check the flag
and convert the region, and add memory listener to TDX guest code to set
the flag to the possible memory region.

Without this patch
- Secure-EPT violation on private area
- KVM_MEMORY_FAULT EXIT (kvm -> qemu)
- qemu converts the 4K page from shared to private
- Resume VCPU execution
- Secure-EPT violation again
- KVM resolves EPT Violation
This also prevents huge page because page conversion is done at 4K
granularity.  Although it's possible to merge 4K private mapping into
2M large page, it slows guest boot.

With this patch
- After memory slot creation, convert the region from private to shared
- Secure-EPT violation on private area.
- KVM resolves EPT Violation

Originated-from: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 include/exec/memory.h |  1 +
 target/i386/kvm/tdx.c | 20 ++++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index bdc4b98efe70..c8b0385b19ad 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -850,6 +850,7 @@ struct IOMMUMemoryRegion {
 #define MEMORY_LISTENER_PRIORITY_MIN            0
 #define MEMORY_LISTENER_PRIORITY_ACCEL          10
 #define MEMORY_LISTENER_PRIORITY_DEV_BACKEND    10
+#define MEMORY_LISTENER_PRIORITY_ACCEL_HIGH     20
 
 /**
  * struct MemoryListener: callbacks structure for updates to the physical memory map
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 50e68f9c1a41..82a1b010746a 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -19,6 +19,7 @@
 #include "standard-headers/asm-x86/kvm_para.h"
 #include "sysemu/kvm.h"
 #include "sysemu/sysemu.h"
+#include "exec/address-spaces.h"
 
 #include "hw/i386/x86.h"
 #include "kvm_i386.h"
@@ -619,6 +620,19 @@ out:
     return r;
 }
 
+static void tdx_guest_region_add(MemoryListener *listener,
+                                 MemoryRegionSection *section)
+{
+    memory_region_set_default_private(section->mr);
+}
+
+static MemoryListener tdx_memory_listener = {
+    .name = TYPE_TDX_GUEST,
+    .region_add = tdx_guest_region_add,
+    /* Higher than KVM memory listener = 10. */
+    .priority = MEMORY_LISTENER_PRIORITY_ACCEL_HIGH,
+};
+
 static bool tdx_guest_get_sept_ve_disable(Object *obj, Error **errp)
 {
     TdxGuest *tdx = TDX_GUEST(obj);
@@ -690,6 +704,12 @@ OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
 static void tdx_guest_init(Object *obj)
 {
     TdxGuest *tdx = TDX_GUEST(obj);
+    static bool memory_listener_registered = false;
+
+    if (!memory_listener_registered) {
+        memory_listener_register(&tdx_memory_listener, &address_space_memory);
+        memory_listener_registered = true;
+    }
 
     qemu_mutex_init(&tdx->lock);
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 36/70] kvm/tdx: Don't complain when converting vMMIO region to shared
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (34 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 35/70] i386/tdx: Make memory type private by default Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 37/70] kvm/tdx: Ignore memory conversion to shared of unassigned region Xiaoyao Li
                   ` (33 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

Because vMMIO region needs to be shared region, guest TD may explicitly
convert such region from private to shared.  Don't complain such
conversion.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 accel/kvm/kvm-all.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 316690d113d0..5e862db4af41 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2933,17 +2933,19 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
 {
     MemoryRegionSection section;
     ram_addr_t offset;
+    MemoryRegion *mr;
     RAMBlock *rb;
     void *addr;
     int ret = -1;
 
     trace_kvm_convert_memory(start, size, to_private ? "shared_to_private" : "private_to_shared");
     section = memory_region_find(get_system_memory(), start, size);
-    if (!section.mr) {
+    mr = section.mr;
+    if (!mr) {
         return ret;
     }
 
-    if (memory_region_has_guest_memfd(section.mr)) {
+    if (memory_region_has_guest_memfd(mr)) {
         if (to_private) {
             ret = kvm_set_memory_attributes_private(start, size);
         } else {
@@ -2965,9 +2967,22 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
          */
         ram_block_convert_range(rb, offset, size, to_private);
     } else {
-        warn_report("Convert non guest_memfd backed memory region "
-                    "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
-                    start, size, to_private ? "private" : "shared");
+        /*
+         * Because vMMIO region must be shared, guest TD may convert vMMIO
+         * region to shared explicitly.  Don't complain such case.  See
+         * memory_region_type() for checking if the region is MMIO region.
+         */
+        if (!to_private &&
+            !memory_region_is_ram(mr) &&
+            !memory_region_is_ram_device(mr) &&
+            !memory_region_is_rom(mr) &&
+            !memory_region_is_romd(mr)) {
+		    ret = 0;
+	    } else {
+            warn_report("Convert non guest_memfd backed memory region "
+                        "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
+                        start, size, to_private ? "private" : "shared");
+        }
     }
 
     memory_region_unref(section.mr);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 37/70] kvm/tdx: Ignore memory conversion to shared of unassigned region
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (35 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 36/70] kvm/tdx: Don't complain when converting vMMIO region to shared Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 38/70] i386/tdvf: Introduce function to parse TDVF metadata Xiaoyao Li
                   ` (32 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX requires vMMIO region to be shared.  For KVM, MMIO region is the region
which kvm memslot isn't assigned to (except in-kernel emulation).
qemu has the memory region for vMMIO at each device level.

While OVMF issues MapGPA(to-shared) conservatively on 32bit PCI MMIO
region, qemu doesn't find corresponding vMMIO region because it's before
PCI device allocation and memory_region_find() finds the device region, not
PCI bus region.  It's safe to ignore MapGPA(to-shared) because when guest
accesses those region they use GPA with shared bit set for vMMIO.  Ignore
memory conversion request of non-assigned region to shared and return
success.  Otherwise OVMF is confused and panics there.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 accel/kvm/kvm-all.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 5e862db4af41..89e7183a2738 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2942,6 +2942,18 @@ static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
     section = memory_region_find(get_system_memory(), start, size);
     mr = section.mr;
     if (!mr) {
+        /*
+         * Ignore converting non-assigned region to shared.
+         *
+         * TDX requires vMMIO region to be shared to inject #VE to guest.
+         * OVMF issues conservatively MapGPA(shared) on 32bit PCI MMIO region,
+         * and vIO-APIC 0xFEC00000 4K page.
+         * OVMF assigns 32bit PCI MMIO region to
+         * [top of low memory: typically 2GB=0xC000000,  0xFC00000)
+         */
+        if (!to_private) {
+            ret = 0;
+        }
         return ret;
     }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 38/70] i386/tdvf: Introduce function to parse TDVF metadata
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (36 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 37/70] kvm/tdx: Ignore memory conversion to shared of unassigned region Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 39/70] i386/tdx: Parse TDVF metadata for TDX VM Xiaoyao Li
                   ` (31 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDX VM needs to boot with its specialized firmware, Trusted Domain
Virtual Firmware (TDVF). QEMU needs to parse TDVF and map it in TD
guest memory prior to running the TDX VM.

A TDVF Metadata in TDVF image describes the structure of firmware.
QEMU refers to it to setup memory for TDVF. Introduce function
tdvf_parse_metadata() to parse the metadata from TDVF image and store
the info of each TDVF section.

TDX metadata is located by a TDX metadata offset block, which is a
GUID-ed structure. The data portion of the GUID structure contains
only an 4-byte field that is the offset of TDX metadata to the end
of firmware file.

Select X86_FW_OVMF when TDX is enable to leverage existing functions
to parse and search OVMF's GUID-ed structures.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>

---
Changes in v1:
 - rename tdvf_parse_section_entry() to
   tdvf_parse_and_check_section_entry()
Changes in RFC v4:
 - rename TDX_METADATA_GUID to TDX_METADATA_OFFSET_GUID
---
 hw/i386/Kconfig        |   1 +
 hw/i386/meson.build    |   1 +
 hw/i386/tdvf.c         | 199 +++++++++++++++++++++++++++++++++++++++++
 include/hw/i386/tdvf.h |  51 +++++++++++
 4 files changed, 252 insertions(+)
 create mode 100644 hw/i386/tdvf.c
 create mode 100644 include/hw/i386/tdvf.h

diff --git a/hw/i386/Kconfig b/hw/i386/Kconfig
index cea78dcb2822..e61d58425ff2 100644
--- a/hw/i386/Kconfig
+++ b/hw/i386/Kconfig
@@ -12,6 +12,7 @@ config SGX
 
 config TDX
     bool
+    select X86_FW_OVMF
     depends on KVM
 
 config PC
diff --git a/hw/i386/meson.build b/hw/i386/meson.build
index 369c6bf823bb..6808bd4e3032 100644
--- a/hw/i386/meson.build
+++ b/hw/i386/meson.build
@@ -27,6 +27,7 @@ i386_ss.add(when: 'CONFIG_PC', if_true: files(
   'port92.c'))
 i386_ss.add(when: 'CONFIG_X86_FW_OVMF', if_true: files('pc_sysfw_ovmf.c'),
                                         if_false: files('pc_sysfw_ovmf-stubs.c'))
+i386_ss.add(when: 'CONFIG_TDX', if_true: files('tdvf.c'))
 
 subdir('kvm')
 subdir('xen')
diff --git a/hw/i386/tdvf.c b/hw/i386/tdvf.c
new file mode 100644
index 000000000000..ff51f40088f0
--- /dev/null
+++ b/hw/i386/tdvf.c
@@ -0,0 +1,199 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+
+ * Copyright (c) 2020 Intel Corporation
+ * Author: Isaku Yamahata <isaku.yamahata at gmail.com>
+ *                        <isaku.yamahata at intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/error-report.h"
+
+#include "hw/i386/pc.h"
+#include "hw/i386/tdvf.h"
+#include "sysemu/kvm.h"
+
+#define TDX_METADATA_OFFSET_GUID    "e47a6535-984a-4798-865e-4685a7bf8ec2"
+#define TDX_METADATA_VERSION        1
+#define TDVF_SIGNATURE              0x46564454 /* TDVF as little endian */
+
+typedef struct {
+    uint32_t DataOffset;
+    uint32_t RawDataSize;
+    uint64_t MemoryAddress;
+    uint64_t MemoryDataSize;
+    uint32_t Type;
+    uint32_t Attributes;
+} TdvfSectionEntry;
+
+typedef struct {
+    uint32_t Signature;
+    uint32_t Length;
+    uint32_t Version;
+    uint32_t NumberOfSectionEntries;
+    TdvfSectionEntry SectionEntries[];
+} TdvfMetadata;
+
+struct tdx_metadata_offset {
+    uint32_t offset;
+};
+
+static TdvfMetadata *tdvf_get_metadata(void *flash_ptr, int size)
+{
+    TdvfMetadata *metadata;
+    uint32_t offset = 0;
+    uint8_t *data;
+
+    if ((uint32_t) size != size) {
+        return NULL;
+    }
+
+    if (pc_system_ovmf_table_find(TDX_METADATA_OFFSET_GUID, &data, NULL)) {
+        offset = size - le32_to_cpu(((struct tdx_metadata_offset *)data)->offset);
+
+        if (offset + sizeof(*metadata) > size) {
+            return NULL;
+        }
+    } else {
+        error_report("Cannot find TDX_METADATA_OFFSET_GUID");
+        return NULL;
+    }
+
+    metadata = flash_ptr + offset;
+
+    /* Finally, verify the signature to determine if this is a TDVF image. */
+    metadata->Signature = le32_to_cpu(metadata->Signature);
+    if (metadata->Signature != TDVF_SIGNATURE) {
+        error_report("Invalid TDVF signature in metadata!");
+        return NULL;
+    }
+
+    /* Sanity check that the TDVF doesn't overlap its own metadata. */
+    metadata->Length = le32_to_cpu(metadata->Length);
+    if (offset + metadata->Length > size) {
+        return NULL;
+    }
+
+    /* Only version 1 is supported/defined. */
+    metadata->Version = le32_to_cpu(metadata->Version);
+    if (metadata->Version != TDX_METADATA_VERSION) {
+        return NULL;
+    }
+
+    return metadata;
+}
+
+static int tdvf_parse_and_check_section_entry(const TdvfSectionEntry *src,
+                                              TdxFirmwareEntry *entry)
+{
+    entry->data_offset = le32_to_cpu(src->DataOffset);
+    entry->data_len = le32_to_cpu(src->RawDataSize);
+    entry->address = le64_to_cpu(src->MemoryAddress);
+    entry->size = le64_to_cpu(src->MemoryDataSize);
+    entry->type = le32_to_cpu(src->Type);
+    entry->attributes = le32_to_cpu(src->Attributes);
+
+    /* sanity check */
+    if (entry->size < entry->data_len) {
+        error_report("Broken metadata RawDataSize 0x%x MemoryDataSize 0x%lx",
+                     entry->data_len, entry->size);
+        return -1;
+    }
+    if (!QEMU_IS_ALIGNED(entry->address, TARGET_PAGE_SIZE)) {
+        error_report("MemoryAddress 0x%lx not page aligned", entry->address);
+        return -1;
+    }
+    if (!QEMU_IS_ALIGNED(entry->size, TARGET_PAGE_SIZE)) {
+        error_report("MemoryDataSize 0x%lx not page aligned", entry->size);
+        return -1;
+    }
+
+    switch (entry->type) {
+    case TDVF_SECTION_TYPE_BFV:
+    case TDVF_SECTION_TYPE_CFV:
+        /* The sections that must be copied from firmware image to TD memory */
+        if (entry->data_len == 0) {
+            error_report("%d section with RawDataSize == 0", entry->type);
+            return -1;
+        }
+        break;
+    case TDVF_SECTION_TYPE_TD_HOB:
+    case TDVF_SECTION_TYPE_TEMP_MEM:
+        /* The sections that no need to be copied from firmware image */
+        if (entry->data_len != 0) {
+            error_report("%d section with RawDataSize 0x%x != 0",
+                         entry->type, entry->data_len);
+            return -1;
+        }
+        break;
+    default:
+        error_report("TDVF contains unsupported section type %d", entry->type);
+        return -1;
+    }
+
+    return 0;
+}
+
+int tdvf_parse_metadata(TdxFirmware *fw, void *flash_ptr, int size)
+{
+    TdvfSectionEntry *sections;
+    TdvfMetadata *metadata;
+    ssize_t entries_size;
+    uint32_t len, i;
+
+    metadata = tdvf_get_metadata(flash_ptr, size);
+    if (!metadata) {
+        return -EINVAL;
+    }
+
+    //load and parse metadata entries
+    fw->nr_entries = le32_to_cpu(metadata->NumberOfSectionEntries);
+    if (fw->nr_entries < 2) {
+        error_report("Invalid number of fw entries (%u) in TDVF", fw->nr_entries);
+        return -EINVAL;
+    }
+
+    len = le32_to_cpu(metadata->Length);
+    entries_size = fw->nr_entries * sizeof(TdvfSectionEntry);
+    if (len != sizeof(*metadata) + entries_size) {
+        error_report("TDVF metadata len (0x%x) mismatch, expected (0x%x)",
+                     len, (uint32_t)(sizeof(*metadata) + entries_size));
+        return -EINVAL;
+    }
+
+    fw->entries = g_new(TdxFirmwareEntry, fw->nr_entries);
+    sections = g_new(TdvfSectionEntry, fw->nr_entries);
+
+    if (!memcpy(sections, (void *)metadata + sizeof(*metadata), entries_size))  {
+        error_report("Failed to read TDVF section entries");
+        goto err;
+    }
+
+    for (i = 0; i < fw->nr_entries; i++) {
+        if (tdvf_parse_and_check_section_entry(&sections[i], &fw->entries[i])) {
+            goto err;
+        }
+    }
+    g_free(sections);
+
+    return 0;
+
+err:
+    g_free(sections);
+    fw->entries = 0;
+    g_free(fw->entries);
+    return -EINVAL;
+}
diff --git a/include/hw/i386/tdvf.h b/include/hw/i386/tdvf.h
new file mode 100644
index 000000000000..593341eb2e93
--- /dev/null
+++ b/include/hw/i386/tdvf.h
@@ -0,0 +1,51 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+
+ * Copyright (c) 2020 Intel Corporation
+ * Author: Isaku Yamahata <isaku.yamahata at gmail.com>
+ *                        <isaku.yamahata at intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#ifndef HW_I386_TDVF_H
+#define HW_I386_TDVF_H
+
+#include "qemu/osdep.h"
+
+#define TDVF_SECTION_TYPE_BFV               0
+#define TDVF_SECTION_TYPE_CFV               1
+#define TDVF_SECTION_TYPE_TD_HOB            2
+#define TDVF_SECTION_TYPE_TEMP_MEM          3
+
+#define TDVF_SECTION_ATTRIBUTES_MR_EXTEND   (1U << 0)
+#define TDVF_SECTION_ATTRIBUTES_PAGE_AUG    (1U << 1)
+
+typedef struct TdxFirmwareEntry {
+    uint32_t data_offset;
+    uint32_t data_len;
+    uint64_t address;
+    uint64_t size;
+    uint32_t type;
+    uint32_t attributes;
+} TdxFirmwareEntry;
+
+typedef struct TdxFirmware {
+    uint32_t nr_entries;
+    TdxFirmwareEntry *entries;
+} TdxFirmware;
+
+int tdvf_parse_metadata(TdxFirmware *fw, void *flash_ptr, int size);
+
+#endif /* HW_I386_TDVF_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 39/70] i386/tdx: Parse TDVF metadata for TDX VM
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (37 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 38/70] i386/tdvf: Introduce function to parse TDVF metadata Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 40/70] i386/tdx: Skip BIOS shadowing setup Xiaoyao Li
                   ` (30 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

TDX cannot support pflash device since it doesn't support read-only
memslot and doesn't support emulation. Load TDVF(OVMF) with -bios option
for TDs.

When boot a TD, besides loading TDVF to the address below 4G, it needs
parse TDVF metadata.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 hw/i386/pc_sysfw.c         | 7 +++++++
 hw/i386/x86.c              | 3 ++-
 target/i386/kvm/tdx-stub.c | 5 +++++
 target/i386/kvm/tdx.c      | 5 +++++
 target/i386/kvm/tdx.h      | 4 ++++
 5 files changed, 23 insertions(+), 1 deletion(-)

diff --git a/hw/i386/pc_sysfw.c b/hw/i386/pc_sysfw.c
index c8d9e71b889b..cf63434ba89d 100644
--- a/hw/i386/pc_sysfw.c
+++ b/hw/i386/pc_sysfw.c
@@ -37,6 +37,7 @@
 #include "hw/block/flash.h"
 #include "sysemu/kvm.h"
 #include "sev.h"
+#include "kvm/tdx.h"
 
 #define FLASH_SECTOR_SIZE 4096
 
@@ -265,5 +266,11 @@ void x86_firmware_configure(void *ptr, int size)
         }
 
         sev_encrypt_flash(ptr, size, &error_fatal);
+    } else if (is_tdx_vm()) {
+        ret = tdx_parse_tdvf(ptr, size);
+        if (ret) {
+            error_report("failed to parse TDVF for TDX VM");
+            exit(1);
+        }
     }
 }
diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 55678279bf3b..fde5467c4750 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -47,6 +47,7 @@
 #include "hw/intc/i8259.h"
 #include "hw/rtc/mc146818rtc.h"
 #include "target/i386/sev.h"
+#include "kvm/tdx.h"
 
 #include "hw/acpi/cpu_hotplug.h"
 #include "hw/irq.h"
@@ -1147,7 +1148,7 @@ void x86_bios_rom_init(MachineState *ms, const char *default_firmware,
     }
     bios = g_malloc(sizeof(*bios));
     memory_region_init_ram(bios, NULL, "pc.bios", bios_size, &error_fatal);
-    if (sev_enabled()) {
+    if (sev_enabled() || is_tdx_vm()) {
         /*
          * The concept of a "reset" simply doesn't exist for
          * confidential computing guests, we have to destroy and
diff --git a/target/i386/kvm/tdx-stub.c b/target/i386/kvm/tdx-stub.c
index 3877d432a397..587dbeeed196 100644
--- a/target/i386/kvm/tdx-stub.c
+++ b/target/i386/kvm/tdx-stub.c
@@ -11,3 +11,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
 {
     return -EINVAL;
 }
+
+int tdx_parse_tdvf(void *flash_ptr, int size)
+{
+    return -EINVAL;
+}
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 82a1b010746a..cfe623fdd4e6 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -620,6 +620,11 @@ out:
     return r;
 }
 
+int tdx_parse_tdvf(void *flash_ptr, int size)
+{
+    return tdvf_parse_metadata(&tdx_guest->tdvf, flash_ptr, size);
+}
+
 static void tdx_guest_region_add(MemoryListener *listener,
                                  MemoryRegionSection *section)
 {
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 6e39ef3bac13..a46af433135f 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -6,6 +6,7 @@
 #endif
 
 #include "exec/confidential-guest-support.h"
+#include "hw/i386/tdvf.h"
 
 #define TYPE_TDX_GUEST "tdx-guest"
 #define TDX_GUEST(obj)  OBJECT_CHECK(TdxGuest, (obj), TYPE_TDX_GUEST)
@@ -24,6 +25,8 @@ typedef struct TdxGuest {
     char *mrconfigid;       /* base64 encoded sha348 digest */
     char *mrowner;          /* base64 encoded sha348 digest */
     char *mrownerconfig;    /* base64 encoded sha348 digest */
+
+    TdxFirmware tdvf;
 } TdxGuest;
 
 #ifdef CONFIG_TDX
@@ -36,5 +39,6 @@ int tdx_kvm_init(MachineState *ms, Error **errp);
 void tdx_get_supported_cpuid(uint32_t function, uint32_t index, int reg,
                              uint32_t *ret);
 int tdx_pre_create_vcpu(CPUState *cpu, Error **errp);
+int tdx_parse_tdvf(void *flash_ptr, int size);
 
 #endif /* QEMU_I386_TDX_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 40/70] i386/tdx: Skip BIOS shadowing setup
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (38 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 39/70] i386/tdx: Parse TDVF metadata for TDX VM Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 41/70] i386/tdx: Don't initialize pc.rom for TDX VMs Xiaoyao Li
                   ` (29 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

TDX doesn't support map different GPAs to same private memory. Thus,
aliasing top 128KB of BIOS as isa-bios is not supported.

On the other hand, TDX guest cannot go to real mode, it can work fine
without isa-bios.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
Changes in v1:
 - update commit message and comment to clarify
---
 hw/i386/x86.c | 25 ++++++++++++++-----------
 1 file changed, 14 insertions(+), 11 deletions(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index fde5467c4750..2f299355a5e3 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1170,17 +1170,20 @@ void x86_bios_rom_init(MachineState *ms, const char *default_firmware,
     }
     g_free(filename);
 
-    /* map the last 128KB of the BIOS in ISA space */
-    isa_bios_size = MIN(bios_size, 128 * KiB);
-    isa_bios = g_malloc(sizeof(*isa_bios));
-    memory_region_init_alias(isa_bios, NULL, "isa-bios", bios,
-                             bios_size - isa_bios_size, isa_bios_size);
-    memory_region_add_subregion_overlap(rom_memory,
-                                        0x100000 - isa_bios_size,
-                                        isa_bios,
-                                        1);
-    if (!isapc_ram_fw) {
-        memory_region_set_readonly(isa_bios, true);
+    /* For TDX, alias different GPAs to same private memory is not supported */
+    if (!is_tdx_vm()) {
+        /* map the last 128KB of the BIOS in ISA space */
+        isa_bios_size = MIN(bios_size, 128 * KiB);
+        isa_bios = g_malloc(sizeof(*isa_bios));
+        memory_region_init_alias(isa_bios, NULL, "isa-bios", bios,
+                                bios_size - isa_bios_size, isa_bios_size);
+        memory_region_add_subregion_overlap(rom_memory,
+                                            0x100000 - isa_bios_size,
+                                            isa_bios,
+                                            1);
+        if (!isapc_ram_fw) {
+            memory_region_set_readonly(isa_bios, true);
+        }
     }
 
     /* map all the bios at the top of memory */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 41/70] i386/tdx: Don't initialize pc.rom for TDX VMs
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (39 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 40/70] i386/tdx: Skip BIOS shadowing setup Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 42/70] i386/tdx: Track mem_ptr for each firmware entry of TDVF Xiaoyao Li
                   ` (28 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

For TDX, the address below 1MB are entirely general RAM. No need to
initialize pc.rom memory region for TDs.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
This is more as a workaround of the issue that for q35 machine type, the
real memslot update (which requires memslot deletion )for pc.rom happens
after tdx_init_memory_region. It leads to the private memory ADD'ed
before get lost. I haven't work out a good solution to resolve the
order issue. So just skip the pc.rom setup to avoid memslot deletion.
---
 hw/i386/pc.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/hw/i386/pc.c b/hw/i386/pc.c
index cf0afc15a558..91d8243e1dd6 100644
--- a/hw/i386/pc.c
+++ b/hw/i386/pc.c
@@ -43,6 +43,7 @@
 #include "sysemu/xen.h"
 #include "sysemu/reset.h"
 #include "kvm/kvm_i386.h"
+#include "kvm/tdx.h"
 #include "hw/xen/xen.h"
 #include "qapi/qmp/qlist.h"
 #include "qemu/error-report.h"
@@ -1036,16 +1037,18 @@ void pc_memory_init(PCMachineState *pcms,
     /* Initialize PC system firmware */
     pc_system_firmware_init(pcms, rom_memory);
 
-    option_rom_mr = g_malloc(sizeof(*option_rom_mr));
-    memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
-                           &error_fatal);
-    if (pcmc->pci_enabled) {
-        memory_region_set_readonly(option_rom_mr, true);
+    if (!is_tdx_vm()) {
+        option_rom_mr = g_malloc(sizeof(*option_rom_mr));
+        memory_region_init_ram(option_rom_mr, NULL, "pc.rom", PC_ROM_SIZE,
+                            &error_fatal);
+        if (pcmc->pci_enabled) {
+            memory_region_set_readonly(option_rom_mr, true);
+        }
+        memory_region_add_subregion_overlap(rom_memory,
+                                            PC_ROM_MIN_VGA,
+                                            option_rom_mr,
+                                            1);
     }
-    memory_region_add_subregion_overlap(rom_memory,
-                                        PC_ROM_MIN_VGA,
-                                        option_rom_mr,
-                                        1);
 
     fw_cfg = fw_cfg_arch_create(machine,
                                 x86ms->boot_cpus, x86ms->apic_id_limit);
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 42/70] i386/tdx: Track mem_ptr for each firmware entry of TDVF
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (40 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 41/70] i386/tdx: Don't initialize pc.rom for TDX VMs Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 43/70] i386/tdx: Track RAM entries for TDX VM Xiaoyao Li
                   ` (27 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

For each TDVF sections, QEMU needs to copy the content to guest
private memory via KVM API (KVM_TDX_INIT_MEM_REGION).

Introduce a field @mem_ptr for TdxFirmwareEntry to track the memory
pointer of each TDVF sections. So that QEMU can add/copy them to guest
private memory later.

TDVF sections can be classified into two groups:
 - Firmware itself, e.g., TDVF BFV and CFV, that located separately from
   guest RAM. Its memory pointer is the bios pointer.

 - Sections located at guest RAM, e.g., TEMP_MEM and TD_HOB.
   mmap a new memory range for them.

Register a machine_init_done callback to do the stuff.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 hw/i386/tdvf.c         |  1 +
 include/hw/i386/tdvf.h |  7 +++++++
 target/i386/kvm/tdx.c  | 31 +++++++++++++++++++++++++++++++
 3 files changed, 39 insertions(+)

diff --git a/hw/i386/tdvf.c b/hw/i386/tdvf.c
index ff51f40088f0..0a6445705160 100644
--- a/hw/i386/tdvf.c
+++ b/hw/i386/tdvf.c
@@ -189,6 +189,7 @@ int tdvf_parse_metadata(TdxFirmware *fw, void *flash_ptr, int size)
     }
     g_free(sections);
 
+    fw->mem_ptr = flash_ptr;
     return 0;
 
 err:
diff --git a/include/hw/i386/tdvf.h b/include/hw/i386/tdvf.h
index 593341eb2e93..d880af245a73 100644
--- a/include/hw/i386/tdvf.h
+++ b/include/hw/i386/tdvf.h
@@ -39,13 +39,20 @@ typedef struct TdxFirmwareEntry {
     uint64_t size;
     uint32_t type;
     uint32_t attributes;
+
+    void *mem_ptr;
 } TdxFirmwareEntry;
 
 typedef struct TdxFirmware {
+    void *mem_ptr;
+
     uint32_t nr_entries;
     TdxFirmwareEntry *entries;
 } TdxFirmware;
 
+#define for_each_tdx_fw_entry(fw, e)    \
+    for (e = (fw)->entries; e != (fw)->entries + (fw)->nr_entries; e++)
+
 int tdvf_parse_metadata(TdxFirmware *fw, void *flash_ptr, int size);
 
 #endif /* HW_I386_TDVF_H */
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index cfe623fdd4e6..03c7f7ab2720 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -14,6 +14,7 @@
 #include "qemu/osdep.h"
 #include "qemu/error-report.h"
 #include "qemu/base64.h"
+#include "qemu/mmap-alloc.h"
 #include "qapi/error.h"
 #include "qom/object_interfaces.h"
 #include "standard-headers/asm-x86/kvm_para.h"
@@ -22,6 +23,7 @@
 #include "exec/address-spaces.h"
 
 #include "hw/i386/x86.h"
+#include "hw/i386/tdvf.h"
 #include "kvm_i386.h"
 #include "tdx.h"
 #include "../cpu-internal.h"
@@ -457,6 +459,33 @@ static void update_tdx_cpuid_lookup_by_tdx_caps(void)
             (tdx_caps->xfam_fixed1 & CPUID_XSTATE_XSS_MASK) >> 32;
 }
 
+static void tdx_finalize_vm(Notifier *notifier, void *unused)
+{
+    TdxFirmware *tdvf = &tdx_guest->tdvf;
+    TdxFirmwareEntry *entry;
+
+    for_each_tdx_fw_entry(tdvf, entry) {
+        switch (entry->type) {
+        case TDVF_SECTION_TYPE_BFV:
+        case TDVF_SECTION_TYPE_CFV:
+            entry->mem_ptr = tdvf->mem_ptr + entry->data_offset;
+            break;
+        case TDVF_SECTION_TYPE_TD_HOB:
+        case TDVF_SECTION_TYPE_TEMP_MEM:
+            entry->mem_ptr = qemu_ram_mmap(-1, entry->size,
+                                           qemu_real_host_page_size(), 0, 0);
+            break;
+        default:
+            error_report("Unsupported TDVF section %d", entry->type);
+            exit(1);
+        }
+    }
+}
+
+static Notifier tdx_machine_done_notify = {
+    .notify = tdx_finalize_vm,
+};
+
 int tdx_kvm_init(MachineState *ms, Error **errp)
 {
     TdxGuest *tdx = TDX_GUEST(OBJECT(ms->cgs));
@@ -482,6 +511,8 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
      */
     kvm_readonly_mem_allowed = false;
 
+    qemu_add_machine_init_done_notifier(&tdx_machine_done_notify);
+
     tdx_guest = tdx;
     return 0;
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 43/70] i386/tdx: Track RAM entries for TDX VM
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (41 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 42/70] i386/tdx: Track mem_ptr for each firmware entry of TDVF Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 44/70] headers: Add definitions from UEFI spec for volumes, resources, etc Xiaoyao Li
                   ` (26 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

The RAM of TDX VM can be classified into two types:

 - TDX_RAM_UNACCEPTED: default type of TDX memory, which needs to be
   accepted by TDX guest before it can be used and will be all-zeros
   after being accepted.

 - TDX_RAM_ADDED: the RAM that is ADD'ed to TD guest before running, and
   can be used directly. E.g., TD HOB and TEMP MEM that needed by TDVF.

Maintain TdxRamEntries[] which grabs the initial RAM info from e820 table
and mark each RAM range as default type TDX_RAM_UNACCEPTED.

Then turn the range of TD HOB and TEMP MEM to TDX_RAM_ADDED since these
ranges will be ADD'ed before TD runs and no need to be accepted runtime.

The TdxRamEntries[] are later used to setup the memory TD resource HOB
that passes memory info from QEMU to TDVF.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
Changes in v3:
- use enum TdxRamType in struct TdxRamEntry; (Isaku)
- Fix the indention; (Daniel)

Changes in v1:
  - simplify the algorithm of tdx_accept_ram_range() (Suggested-by: Gerd Hoffman)
    (1) Change the existing entry to cover the accepted ram range.
    (2) If there is room before the accepted ram range add a
	TDX_RAM_UNACCEPTED entry for that.
    (3) If there is room after the accepted ram range add a
	TDX_RAM_UNACCEPTED entry for that.
---
 target/i386/kvm/tdx.c | 111 ++++++++++++++++++++++++++++++++++++++++++
 target/i386/kvm/tdx.h |  14 ++++++
 2 files changed, 125 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 03c7f7ab2720..8b60d1c65a7d 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -22,6 +22,7 @@
 #include "sysemu/sysemu.h"
 #include "exec/address-spaces.h"
 
+#include "hw/i386/e820_memory_layout.h"
 #include "hw/i386/x86.h"
 #include "hw/i386/tdvf.h"
 #include "kvm_i386.h"
@@ -459,11 +460,117 @@ static void update_tdx_cpuid_lookup_by_tdx_caps(void)
             (tdx_caps->xfam_fixed1 & CPUID_XSTATE_XSS_MASK) >> 32;
 }
 
+static void tdx_add_ram_entry(uint64_t address, uint64_t length,
+                              enum TdxRamType type)
+{
+    uint32_t nr_entries = tdx_guest->nr_ram_entries;
+    tdx_guest->ram_entries = g_renew(TdxRamEntry, tdx_guest->ram_entries,
+                                     nr_entries + 1);
+
+    tdx_guest->ram_entries[nr_entries].address = address;
+    tdx_guest->ram_entries[nr_entries].length = length;
+    tdx_guest->ram_entries[nr_entries].type = type;
+    tdx_guest->nr_ram_entries++;
+}
+
+static int tdx_accept_ram_range(uint64_t address, uint64_t length)
+{
+    uint64_t head_start, tail_start, head_length, tail_length;
+    uint64_t tmp_address, tmp_length;
+    TdxRamEntry *e;
+    int i;
+
+    for (i = 0; i < tdx_guest->nr_ram_entries; i++) {
+        e = &tdx_guest->ram_entries[i];
+
+        if (address + length <= e->address ||
+            e->address + e->length <= address) {
+            continue;
+        }
+
+        /*
+         * The to-be-accepted ram range must be fully contained by one
+         * RAM entry.
+         */
+        if (e->address > address ||
+            e->address + e->length < address + length) {
+            return -EINVAL;
+        }
+
+        if (e->type == TDX_RAM_ADDED) {
+            return -EINVAL;
+        }
+
+        break;
+    }
+
+    if (i == tdx_guest->nr_ram_entries) {
+        return -1;
+    }
+
+    tmp_address = e->address;
+    tmp_length = e->length;
+
+    e->address = address;
+    e->length = length;
+    e->type = TDX_RAM_ADDED;
+
+    head_length = address - tmp_address;
+    if (head_length > 0) {
+        head_start = tmp_address;
+        tdx_add_ram_entry(head_start, head_length, TDX_RAM_UNACCEPTED);
+    }
+
+    tail_start = address + length;
+    if (tail_start < tmp_address + tmp_length) {
+        tail_length = tmp_address + tmp_length - tail_start;
+        tdx_add_ram_entry(tail_start, tail_length, TDX_RAM_UNACCEPTED);
+    }
+
+    return 0;
+}
+
+static int tdx_ram_entry_compare(const void *lhs_, const void* rhs_)
+{
+    const TdxRamEntry *lhs = lhs_;
+    const TdxRamEntry *rhs = rhs_;
+
+    if (lhs->address == rhs->address) {
+        return 0;
+    }
+    if (le64_to_cpu(lhs->address) > le64_to_cpu(rhs->address)) {
+        return 1;
+    }
+    return -1;
+}
+
+static void tdx_init_ram_entries(void)
+{
+    unsigned i, j, nr_e820_entries;
+
+    nr_e820_entries = e820_get_num_entries();
+    tdx_guest->ram_entries = g_new(TdxRamEntry, nr_e820_entries);
+
+    for (i = 0, j = 0; i < nr_e820_entries; i++) {
+        uint64_t addr, len;
+
+        if (e820_get_entry(i, E820_RAM, &addr, &len)) {
+            tdx_guest->ram_entries[j].address = addr;
+            tdx_guest->ram_entries[j].length = len;
+            tdx_guest->ram_entries[j].type = TDX_RAM_UNACCEPTED;
+            j++;
+        }
+    }
+    tdx_guest->nr_ram_entries = j;
+}
+
 static void tdx_finalize_vm(Notifier *notifier, void *unused)
 {
     TdxFirmware *tdvf = &tdx_guest->tdvf;
     TdxFirmwareEntry *entry;
 
+    tdx_init_ram_entries();
+
     for_each_tdx_fw_entry(tdvf, entry) {
         switch (entry->type) {
         case TDVF_SECTION_TYPE_BFV:
@@ -474,12 +581,16 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
         case TDVF_SECTION_TYPE_TEMP_MEM:
             entry->mem_ptr = qemu_ram_mmap(-1, entry->size,
                                            qemu_real_host_page_size(), 0, 0);
+            tdx_accept_ram_range(entry->address, entry->size);
             break;
         default:
             error_report("Unsupported TDVF section %d", entry->type);
             exit(1);
         }
     }
+
+    qsort(tdx_guest->ram_entries, tdx_guest->nr_ram_entries,
+          sizeof(TdxRamEntry), &tdx_ram_entry_compare);
 }
 
 static Notifier tdx_machine_done_notify = {
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index a46af433135f..3a35a2bc0900 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -15,6 +15,17 @@ typedef struct TdxGuestClass {
     ConfidentialGuestSupportClass parent_class;
 } TdxGuestClass;
 
+enum TdxRamType{
+    TDX_RAM_UNACCEPTED,
+    TDX_RAM_ADDED,
+};
+
+typedef struct TdxRamEntry {
+    uint64_t address;
+    uint64_t length;
+    enum TdxRamType type;
+} TdxRamEntry;
+
 typedef struct TdxGuest {
     ConfidentialGuestSupport parent_obj;
 
@@ -27,6 +38,9 @@ typedef struct TdxGuest {
     char *mrownerconfig;    /* base64 encoded sha348 digest */
 
     TdxFirmware tdvf;
+
+    uint32_t nr_ram_entries;
+    TdxRamEntry *ram_entries;
 } TdxGuest;
 
 #ifdef CONFIG_TDX
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 44/70] headers: Add definitions from UEFI spec for volumes, resources, etc...
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (42 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 43/70] i386/tdx: Track RAM entries for TDX VM Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 45/70] i386/tdx: Setup the TD HOB list Xiaoyao Li
                   ` (25 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Add UEFI definitions for literals, enums, structs, GUIDs, etc... that
will be used by TDX to build the UEFI Hand-Off Block (HOB) that is passed
to the Trusted Domain Virtual Firmware (TDVF).

All values come from the UEFI specification [1], PI spec [2] and TDVF
design guide[3].

[1] UEFI Specification v2.1.0 https://uefi.org/sites/default/files/resources/UEFI_Spec_2_10_Aug29.pdf
[2] UEFI PI spec v1.8 https://uefi.org/sites/default/files/resources/UEFI_PI_Spec_1_8_March3.pdf
[3] https://software.intel.com/content/dam/develop/external/us/en/documents/tdx-virtual-firmware-design-guide-rev-1.pdf

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 include/standard-headers/uefi/uefi.h | 198 +++++++++++++++++++++++++++
 1 file changed, 198 insertions(+)
 create mode 100644 include/standard-headers/uefi/uefi.h

diff --git a/include/standard-headers/uefi/uefi.h b/include/standard-headers/uefi/uefi.h
new file mode 100644
index 000000000000..b15aba796156
--- /dev/null
+++ b/include/standard-headers/uefi/uefi.h
@@ -0,0 +1,198 @@
+/*
+ * Copyright (C) 2020 Intel Corporation
+ *
+ * Author: Isaku Yamahata <isaku.yamahata at gmail.com>
+ *                        <isaku.yamahata at intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ *
+ */
+
+#ifndef HW_I386_UEFI_H
+#define HW_I386_UEFI_H
+
+/***************************************************************************/
+/*
+ * basic EFI definitions
+ * supplemented with UEFI Specification Version 2.8 (Errata A)
+ * released February 2020
+ */
+/* UEFI integer is little endian */
+
+typedef struct {
+    uint32_t Data1;
+    uint16_t Data2;
+    uint16_t Data3;
+    uint8_t Data4[8];
+} EFI_GUID;
+
+typedef enum {
+    EfiReservedMemoryType,
+    EfiLoaderCode,
+    EfiLoaderData,
+    EfiBootServicesCode,
+    EfiBootServicesData,
+    EfiRuntimeServicesCode,
+    EfiRuntimeServicesData,
+    EfiConventionalMemory,
+    EfiUnusableMemory,
+    EfiACPIReclaimMemory,
+    EfiACPIMemoryNVS,
+    EfiMemoryMappedIO,
+    EfiMemoryMappedIOPortSpace,
+    EfiPalCode,
+    EfiPersistentMemory,
+    EfiUnacceptedMemoryType,
+    EfiMaxMemoryType
+} EFI_MEMORY_TYPE;
+
+#define EFI_HOB_HANDOFF_TABLE_VERSION 0x0009
+
+#define EFI_HOB_TYPE_HANDOFF              0x0001
+#define EFI_HOB_TYPE_MEMORY_ALLOCATION    0x0002
+#define EFI_HOB_TYPE_RESOURCE_DESCRIPTOR  0x0003
+#define EFI_HOB_TYPE_GUID_EXTENSION       0x0004
+#define EFI_HOB_TYPE_FV                   0x0005
+#define EFI_HOB_TYPE_CPU                  0x0006
+#define EFI_HOB_TYPE_MEMORY_POOL          0x0007
+#define EFI_HOB_TYPE_FV2                  0x0009
+#define EFI_HOB_TYPE_LOAD_PEIM_UNUSED     0x000A
+#define EFI_HOB_TYPE_UEFI_CAPSULE         0x000B
+#define EFI_HOB_TYPE_FV3                  0x000C
+#define EFI_HOB_TYPE_UNUSED               0xFFFE
+#define EFI_HOB_TYPE_END_OF_HOB_LIST      0xFFFF
+
+typedef struct {
+    uint16_t HobType;
+    uint16_t HobLength;
+    uint32_t Reserved;
+} EFI_HOB_GENERIC_HEADER;
+
+typedef uint64_t EFI_PHYSICAL_ADDRESS;
+typedef uint32_t EFI_BOOT_MODE;
+
+typedef struct {
+    EFI_HOB_GENERIC_HEADER Header;
+    uint32_t Version;
+    EFI_BOOT_MODE BootMode;
+    EFI_PHYSICAL_ADDRESS EfiMemoryTop;
+    EFI_PHYSICAL_ADDRESS EfiMemoryBottom;
+    EFI_PHYSICAL_ADDRESS EfiFreeMemoryTop;
+    EFI_PHYSICAL_ADDRESS EfiFreeMemoryBottom;
+    EFI_PHYSICAL_ADDRESS EfiEndOfHobList;
+} EFI_HOB_HANDOFF_INFO_TABLE;
+
+#define EFI_RESOURCE_SYSTEM_MEMORY          0x00000000
+#define EFI_RESOURCE_MEMORY_MAPPED_IO       0x00000001
+#define EFI_RESOURCE_IO                     0x00000002
+#define EFI_RESOURCE_FIRMWARE_DEVICE        0x00000003
+#define EFI_RESOURCE_MEMORY_MAPPED_IO_PORT  0x00000004
+#define EFI_RESOURCE_MEMORY_RESERVED        0x00000005
+#define EFI_RESOURCE_IO_RESERVED            0x00000006
+#define EFI_RESOURCE_MEMORY_UNACCEPTED      0x00000007
+#define EFI_RESOURCE_MAX_MEMORY_TYPE        0x00000008
+
+#define EFI_RESOURCE_ATTRIBUTE_PRESENT                  0x00000001
+#define EFI_RESOURCE_ATTRIBUTE_INITIALIZED              0x00000002
+#define EFI_RESOURCE_ATTRIBUTE_TESTED                   0x00000004
+#define EFI_RESOURCE_ATTRIBUTE_SINGLE_BIT_ECC           0x00000008
+#define EFI_RESOURCE_ATTRIBUTE_MULTIPLE_BIT_ECC         0x00000010
+#define EFI_RESOURCE_ATTRIBUTE_ECC_RESERVED_1           0x00000020
+#define EFI_RESOURCE_ATTRIBUTE_ECC_RESERVED_2           0x00000040
+#define EFI_RESOURCE_ATTRIBUTE_READ_PROTECTED           0x00000080
+#define EFI_RESOURCE_ATTRIBUTE_WRITE_PROTECTED          0x00000100
+#define EFI_RESOURCE_ATTRIBUTE_EXECUTION_PROTECTED      0x00000200
+#define EFI_RESOURCE_ATTRIBUTE_UNCACHEABLE              0x00000400
+#define EFI_RESOURCE_ATTRIBUTE_WRITE_COMBINEABLE        0x00000800
+#define EFI_RESOURCE_ATTRIBUTE_WRITE_THROUGH_CACHEABLE  0x00001000
+#define EFI_RESOURCE_ATTRIBUTE_WRITE_BACK_CACHEABLE     0x00002000
+#define EFI_RESOURCE_ATTRIBUTE_16_BIT_IO                0x00004000
+#define EFI_RESOURCE_ATTRIBUTE_32_BIT_IO                0x00008000
+#define EFI_RESOURCE_ATTRIBUTE_64_BIT_IO                0x00010000
+#define EFI_RESOURCE_ATTRIBUTE_UNCACHED_EXPORTED        0x00020000
+#define EFI_RESOURCE_ATTRIBUTE_READ_ONLY_PROTECTED      0x00040000
+#define EFI_RESOURCE_ATTRIBUTE_READ_ONLY_PROTECTABLE    0x00080000
+#define EFI_RESOURCE_ATTRIBUTE_READ_PROTECTABLE         0x00100000
+#define EFI_RESOURCE_ATTRIBUTE_WRITE_PROTECTABLE        0x00200000
+#define EFI_RESOURCE_ATTRIBUTE_EXECUTION_PROTECTABLE    0x00400000
+#define EFI_RESOURCE_ATTRIBUTE_PERSISTENT               0x00800000
+#define EFI_RESOURCE_ATTRIBUTE_PERSISTABLE              0x01000000
+#define EFI_RESOURCE_ATTRIBUTE_MORE_RELIABLE            0x02000000
+
+typedef uint32_t EFI_RESOURCE_TYPE;
+typedef uint32_t EFI_RESOURCE_ATTRIBUTE_TYPE;
+
+typedef struct {
+    EFI_HOB_GENERIC_HEADER Header;
+    EFI_GUID Owner;
+    EFI_RESOURCE_TYPE ResourceType;
+    EFI_RESOURCE_ATTRIBUTE_TYPE ResourceAttribute;
+    EFI_PHYSICAL_ADDRESS PhysicalStart;
+    uint64_t ResourceLength;
+} EFI_HOB_RESOURCE_DESCRIPTOR;
+
+typedef struct {
+    EFI_HOB_GENERIC_HEADER Header;
+    EFI_GUID Name;
+
+    /* guid specific data follows */
+} EFI_HOB_GUID_TYPE;
+
+typedef struct {
+    EFI_HOB_GENERIC_HEADER Header;
+    EFI_PHYSICAL_ADDRESS BaseAddress;
+    uint64_t Length;
+} EFI_HOB_FIRMWARE_VOLUME;
+
+typedef struct {
+    EFI_HOB_GENERIC_HEADER Header;
+    EFI_PHYSICAL_ADDRESS BaseAddress;
+    uint64_t Length;
+    EFI_GUID FvName;
+    EFI_GUID FileName;
+} EFI_HOB_FIRMWARE_VOLUME2;
+
+typedef struct {
+    EFI_HOB_GENERIC_HEADER Header;
+    EFI_PHYSICAL_ADDRESS BaseAddress;
+    uint64_t Length;
+    uint32_t AuthenticationStatus;
+    bool ExtractedFv;
+    EFI_GUID FvName;
+    EFI_GUID FileName;
+} EFI_HOB_FIRMWARE_VOLUME3;
+
+typedef struct {
+    EFI_HOB_GENERIC_HEADER Header;
+    uint8_t SizeOfMemorySpace;
+    uint8_t SizeOfIoSpace;
+    uint8_t Reserved[6];
+} EFI_HOB_CPU;
+
+typedef struct {
+    EFI_HOB_GENERIC_HEADER Header;
+} EFI_HOB_MEMORY_POOL;
+
+typedef struct {
+    EFI_HOB_GENERIC_HEADER Header;
+
+    EFI_PHYSICAL_ADDRESS BaseAddress;
+    uint64_t Length;
+} EFI_HOB_UEFI_CAPSULE;
+
+#define EFI_HOB_OWNER_ZERO                                      \
+    ((EFI_GUID){ 0x00000000, 0x0000, 0x0000,                    \
+        { 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00 } })
+
+#endif
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 45/70] i386/tdx: Setup the TD HOB list
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (43 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 44/70] headers: Add definitions from UEFI spec for volumes, resources, etc Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 46/70] i386/tdx: Add TDVF memory via KVM_TDX_INIT_MEM_REGION Xiaoyao Li
                   ` (24 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

The TD HOB list is used to pass the information from VMM to TDVF. The TD
HOB must include PHIT HOB and Resource Descriptor HOB. More details can
be found in TDVF specification and PI specification.

Build the TD HOB in TDX's machine_init_done callback.

Co-developed-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>

---
Changes in v1:
  - drop the code of adding mmio resources since OVMF prepares all the
    MMIO hob itself.
---
 hw/i386/meson.build   |   2 +-
 hw/i386/tdvf-hob.c    | 147 ++++++++++++++++++++++++++++++++++++++++++
 hw/i386/tdvf-hob.h    |  24 +++++++
 target/i386/kvm/tdx.c |  16 +++++
 4 files changed, 188 insertions(+), 1 deletion(-)
 create mode 100644 hw/i386/tdvf-hob.c
 create mode 100644 hw/i386/tdvf-hob.h

diff --git a/hw/i386/meson.build b/hw/i386/meson.build
index 6808bd4e3032..118fd9dae610 100644
--- a/hw/i386/meson.build
+++ b/hw/i386/meson.build
@@ -27,7 +27,7 @@ i386_ss.add(when: 'CONFIG_PC', if_true: files(
   'port92.c'))
 i386_ss.add(when: 'CONFIG_X86_FW_OVMF', if_true: files('pc_sysfw_ovmf.c'),
                                         if_false: files('pc_sysfw_ovmf-stubs.c'))
-i386_ss.add(when: 'CONFIG_TDX', if_true: files('tdvf.c'))
+i386_ss.add(when: 'CONFIG_TDX', if_true: files('tdvf.c', 'tdvf-hob.c'))
 
 subdir('kvm')
 subdir('xen')
diff --git a/hw/i386/tdvf-hob.c b/hw/i386/tdvf-hob.c
new file mode 100644
index 000000000000..0da6ff2df576
--- /dev/null
+++ b/hw/i386/tdvf-hob.c
@@ -0,0 +1,147 @@
+/*
+ * SPDX-License-Identifier: GPL-2.0-or-later
+
+ * Copyright (c) 2020 Intel Corporation
+ * Author: Isaku Yamahata <isaku.yamahata at gmail.com>
+ *                        <isaku.yamahata at intel.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+
+ * You should have received a copy of the GNU General Public License along
+ * with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qemu/error-report.h"
+#include "e820_memory_layout.h"
+#include "hw/i386/pc.h"
+#include "hw/i386/x86.h"
+#include "hw/pci/pcie_host.h"
+#include "sysemu/kvm.h"
+#include "standard-headers/uefi/uefi.h"
+#include "tdvf-hob.h"
+
+typedef struct TdvfHob {
+    hwaddr hob_addr;
+    void *ptr;
+    int size;
+
+    /* working area */
+    void *current;
+    void *end;
+} TdvfHob;
+
+static uint64_t tdvf_current_guest_addr(const TdvfHob *hob)
+{
+    return hob->hob_addr + (hob->current - hob->ptr);
+}
+
+static void tdvf_align(TdvfHob *hob, size_t align)
+{
+    hob->current = QEMU_ALIGN_PTR_UP(hob->current, align);
+}
+
+static void *tdvf_get_area(TdvfHob *hob, uint64_t size)
+{
+    void *ret;
+
+    if (hob->current + size > hob->end) {
+        error_report("TD_HOB overrun, size = 0x%" PRIx64, size);
+        exit(1);
+    }
+
+    ret = hob->current;
+    hob->current += size;
+    tdvf_align(hob, 8);
+    return ret;
+}
+
+static void tdvf_hob_add_memory_resources(TdxGuest *tdx, TdvfHob *hob)
+{
+    EFI_HOB_RESOURCE_DESCRIPTOR *region;
+    EFI_RESOURCE_ATTRIBUTE_TYPE attr;
+    EFI_RESOURCE_TYPE resource_type;
+
+    TdxRamEntry *e;
+    int i;
+
+    for (i = 0; i < tdx->nr_ram_entries; i++) {
+        e = &tdx->ram_entries[i];
+
+        if (e->type == TDX_RAM_UNACCEPTED) {
+            resource_type = EFI_RESOURCE_MEMORY_UNACCEPTED;
+            attr = EFI_RESOURCE_ATTRIBUTE_TDVF_UNACCEPTED;
+        } else if (e->type == TDX_RAM_ADDED){
+            resource_type = EFI_RESOURCE_SYSTEM_MEMORY;
+            attr = EFI_RESOURCE_ATTRIBUTE_TDVF_PRIVATE;
+        } else {
+            error_report("unknown TDX_RAM_ENTRY type %d", e->type);
+            exit(1);
+        }
+
+        region = tdvf_get_area(hob, sizeof(*region));
+        *region = (EFI_HOB_RESOURCE_DESCRIPTOR) {
+            .Header = {
+                .HobType = EFI_HOB_TYPE_RESOURCE_DESCRIPTOR,
+                .HobLength = cpu_to_le16(sizeof(*region)),
+                .Reserved = cpu_to_le32(0),
+            },
+            .Owner = EFI_HOB_OWNER_ZERO,
+            .ResourceType = cpu_to_le32(resource_type),
+            .ResourceAttribute = cpu_to_le32(attr),
+            .PhysicalStart = cpu_to_le64(e->address),
+            .ResourceLength = cpu_to_le64(e->length),
+        };
+    }
+}
+
+void tdvf_hob_create(TdxGuest *tdx, TdxFirmwareEntry *td_hob)
+{
+    TdvfHob hob = {
+        .hob_addr = td_hob->address,
+        .size = td_hob->size,
+        .ptr = td_hob->mem_ptr,
+
+        .current = td_hob->mem_ptr,
+        .end = td_hob->mem_ptr + td_hob->size,
+    };
+
+    EFI_HOB_GENERIC_HEADER *last_hob;
+    EFI_HOB_HANDOFF_INFO_TABLE *hit;
+
+    /* Note, Efi{Free}Memory{Bottom,Top} are ignored, leave 'em zeroed. */
+    hit = tdvf_get_area(&hob, sizeof(*hit));
+    *hit = (EFI_HOB_HANDOFF_INFO_TABLE) {
+        .Header = {
+            .HobType = EFI_HOB_TYPE_HANDOFF,
+            .HobLength = cpu_to_le16(sizeof(*hit)),
+            .Reserved = cpu_to_le32(0),
+        },
+        .Version = cpu_to_le32(EFI_HOB_HANDOFF_TABLE_VERSION),
+        .BootMode = cpu_to_le32(0),
+        .EfiMemoryTop = cpu_to_le64(0),
+        .EfiMemoryBottom = cpu_to_le64(0),
+        .EfiFreeMemoryTop = cpu_to_le64(0),
+        .EfiFreeMemoryBottom = cpu_to_le64(0),
+        .EfiEndOfHobList = cpu_to_le64(0), /* initialized later */
+    };
+
+    tdvf_hob_add_memory_resources(tdx, &hob);
+
+    last_hob = tdvf_get_area(&hob, sizeof(*last_hob));
+    *last_hob =  (EFI_HOB_GENERIC_HEADER) {
+        .HobType = EFI_HOB_TYPE_END_OF_HOB_LIST,
+        .HobLength = cpu_to_le16(sizeof(*last_hob)),
+        .Reserved = cpu_to_le32(0),
+    };
+    hit->EfiEndOfHobList = tdvf_current_guest_addr(&hob);
+}
diff --git a/hw/i386/tdvf-hob.h b/hw/i386/tdvf-hob.h
new file mode 100644
index 000000000000..1b737e946a8d
--- /dev/null
+++ b/hw/i386/tdvf-hob.h
@@ -0,0 +1,24 @@
+#ifndef HW_I386_TD_HOB_H
+#define HW_I386_TD_HOB_H
+
+#include "hw/i386/tdvf.h"
+#include "target/i386/kvm/tdx.h"
+
+void tdvf_hob_create(TdxGuest *tdx, TdxFirmwareEntry *td_hob);
+
+#define EFI_RESOURCE_ATTRIBUTE_TDVF_PRIVATE     \
+    (EFI_RESOURCE_ATTRIBUTE_PRESENT |           \
+     EFI_RESOURCE_ATTRIBUTE_INITIALIZED |       \
+     EFI_RESOURCE_ATTRIBUTE_TESTED)
+
+#define EFI_RESOURCE_ATTRIBUTE_TDVF_UNACCEPTED  \
+    (EFI_RESOURCE_ATTRIBUTE_PRESENT |           \
+     EFI_RESOURCE_ATTRIBUTE_INITIALIZED |       \
+     EFI_RESOURCE_ATTRIBUTE_TESTED)
+
+#define EFI_RESOURCE_ATTRIBUTE_TDVF_MMIO        \
+    (EFI_RESOURCE_ATTRIBUTE_PRESENT     |       \
+     EFI_RESOURCE_ATTRIBUTE_INITIALIZED |       \
+     EFI_RESOURCE_ATTRIBUTE_UNCACHEABLE)
+
+#endif
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 8b60d1c65a7d..2e286087b232 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -25,6 +25,7 @@
 #include "hw/i386/e820_memory_layout.h"
 #include "hw/i386/x86.h"
 #include "hw/i386/tdvf.h"
+#include "hw/i386/tdvf-hob.h"
 #include "kvm_i386.h"
 #include "tdx.h"
 #include "../cpu-internal.h"
@@ -460,6 +461,19 @@ static void update_tdx_cpuid_lookup_by_tdx_caps(void)
             (tdx_caps->xfam_fixed1 & CPUID_XSTATE_XSS_MASK) >> 32;
 }
 
+static TdxFirmwareEntry *tdx_get_hob_entry(TdxGuest *tdx)
+{
+    TdxFirmwareEntry *entry;
+
+    for_each_tdx_fw_entry(&tdx->tdvf, entry) {
+        if (entry->type == TDVF_SECTION_TYPE_TD_HOB) {
+            return entry;
+        }
+    }
+    error_report("TDVF metadata doesn't specify TD_HOB location.");
+    exit(1);
+}
+
 static void tdx_add_ram_entry(uint64_t address, uint64_t length,
                               enum TdxRamType type)
 {
@@ -591,6 +605,8 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
 
     qsort(tdx_guest->ram_entries, tdx_guest->nr_ram_entries,
           sizeof(TdxRamEntry), &tdx_ram_entry_compare);
+
+    tdvf_hob_create(tdx_guest, tdx_get_hob_entry(tdx_guest));
 }
 
 static Notifier tdx_machine_done_notify = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 46/70] i386/tdx: Add TDVF memory via KVM_TDX_INIT_MEM_REGION
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (44 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 45/70] i386/tdx: Setup the TD HOB list Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 47/70] memory: Introduce memory_region_init_ram_guest_memfd() Xiaoyao Li
                   ` (23 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

TDVF firmware (CODE and VARS) needs to be added/copied to TD's private
memory via KVM_TDX_INIT_MEM_REGION, as well as TD HOB and TEMP memory.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>

---
Changes in v1:
  - rename variable @metadata to @flags
---
 target/i386/kvm/tdx.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 2e286087b232..6bb3249fa610 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -582,6 +582,7 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
 {
     TdxFirmware *tdvf = &tdx_guest->tdvf;
     TdxFirmwareEntry *entry;
+    int r;
 
     tdx_init_ram_entries();
 
@@ -607,6 +608,29 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
           sizeof(TdxRamEntry), &tdx_ram_entry_compare);
 
     tdvf_hob_create(tdx_guest, tdx_get_hob_entry(tdx_guest));
+
+    for_each_tdx_fw_entry(tdvf, entry) {
+        struct kvm_tdx_init_mem_region mem_region = {
+            .source_addr = (__u64)entry->mem_ptr,
+            .gpa = entry->address,
+            .nr_pages = entry->size / 4096,
+        };
+
+        __u32 flags = entry->attributes & TDVF_SECTION_ATTRIBUTES_MR_EXTEND ?
+                      KVM_TDX_MEASURE_MEMORY_REGION : 0;
+
+        r = tdx_vm_ioctl(KVM_TDX_INIT_MEM_REGION, flags, &mem_region);
+        if (r < 0) {
+             error_report("KVM_TDX_INIT_MEM_REGION failed %s", strerror(-r));
+             exit(1);
+        }
+
+        if (entry->type == TDVF_SECTION_TYPE_TD_HOB ||
+            entry->type == TDVF_SECTION_TYPE_TEMP_MEM) {
+            qemu_ram_munmap(-1, entry->mem_ptr, entry->size);
+            entry->mem_ptr = NULL;
+        }
+    }
 }
 
 static Notifier tdx_machine_done_notify = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 47/70] memory: Introduce memory_region_init_ram_guest_memfd()
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (45 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 46/70] i386/tdx: Add TDVF memory via KVM_TDX_INIT_MEM_REGION Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 48/70] i386/tdx: register TDVF as private memory Xiaoyao Li
                   ` (22 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Introduce memory_region_init_ram_guest_memfd() to allocate private
guset memfd on the MemoryRegion initialization. It's for the use case of
TDVF, which must be private on TDX case.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 include/exec/memory.h |  6 ++++++
 system/memory.c       | 27 +++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/include/exec/memory.h b/include/exec/memory.h
index c8b0385b19ad..ca23a1a6b336 100644
--- a/include/exec/memory.h
+++ b/include/exec/memory.h
@@ -1590,6 +1590,12 @@ void memory_region_init_ram(MemoryRegion *mr,
                             uint64_t size,
                             Error **errp);
 
+void memory_region_init_ram_guest_memfd(MemoryRegion *mr,
+                                        Object *owner,
+                                        const char *name,
+                                        uint64_t size,
+                                        Error **errp);
+
 /**
  * memory_region_init_rom: Initialize a ROM memory region.
  *
diff --git a/system/memory.c b/system/memory.c
index b0c58232b6f7..166eb9fd6f7d 100644
--- a/system/memory.c
+++ b/system/memory.c
@@ -3632,6 +3632,33 @@ void memory_region_init_ram(MemoryRegion *mr,
     vmstate_register_ram(mr, owner_dev);
 }
 
+void memory_region_init_ram_guest_memfd(MemoryRegion *mr,
+                                        Object *owner,
+                                        const char *name,
+                                        uint64_t size,
+                                        Error **errp)
+{
+    DeviceState *owner_dev;
+    Error *err = NULL;
+
+    memory_region_init_ram_flags_nomigrate(mr, owner, name, size,
+                                           RAM_GUEST_MEMFD, &err);
+    if (err) {
+        error_propagate(errp, err);
+        return;
+    }
+    memory_region_set_default_private(mr);
+
+    /* This will assert if owner is neither NULL nor a DeviceState.
+     * We only want the owner here for the purposes of defining a
+     * unique name for migration. TODO: Ideally we should implement
+     * a naming scheme for Objects which are not DeviceStates, in
+     * which case we can relax this restriction.
+     */
+    owner_dev = DEVICE(owner);
+    vmstate_register_ram(mr, owner_dev);
+}
+
 void memory_region_init_rom(MemoryRegion *mr,
                             Object *owner,
                             const char *name,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 48/70] i386/tdx: register TDVF as private memory
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (46 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 47/70] memory: Introduce memory_region_init_ram_guest_memfd() Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 49/70] i386/tdx: Call KVM_TDX_INIT_VCPU to initialize TDX vcpu Xiaoyao Li
                   ` (21 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Chao Peng <chao.p.peng@linux.intel.com>

Allocate private guest memfd memory for BIOS if it's TD VM.

Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 hw/i386/x86.c         | 10 +++++++++-
 target/i386/kvm/tdx.c | 18 ++++++++++++++++++
 target/i386/kvm/tdx.h |  2 ++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 2f299355a5e3..0f69b55c5219 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1146,8 +1146,16 @@ void x86_bios_rom_init(MachineState *ms, const char *default_firmware,
         (bios_size % 65536) != 0) {
         goto bios_error;
     }
+
     bios = g_malloc(sizeof(*bios));
-    memory_region_init_ram(bios, NULL, "pc.bios", bios_size, &error_fatal);
+    if (is_tdx_vm()) {
+        memory_region_init_ram_guest_memfd(bios, NULL, "pc.bios", bios_size,
+                                           &error_fatal);
+        tdx_set_tdvf_region(bios);
+    } else {
+        memory_region_init_ram(bios, NULL, "pc.bios", bios_size, &error_fatal);
+    }
+
     if (sev_enabled() || is_tdx_vm()) {
         /*
          * The concept of a "reset" simply doesn't exist for
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 6bb3249fa610..4b8c13890b11 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -21,6 +21,7 @@
 #include "sysemu/kvm.h"
 #include "sysemu/sysemu.h"
 #include "exec/address-spaces.h"
+#include "exec/ramblock.h"
 
 #include "hw/i386/e820_memory_layout.h"
 #include "hw/i386/x86.h"
@@ -461,6 +462,12 @@ static void update_tdx_cpuid_lookup_by_tdx_caps(void)
             (tdx_caps->xfam_fixed1 & CPUID_XSTATE_XSS_MASK) >> 32;
 }
 
+void tdx_set_tdvf_region(MemoryRegion *tdvf_region)
+{
+    assert(!tdx_guest->tdvf_region);
+    tdx_guest->tdvf_region = tdvf_region;
+}
+
 static TdxFirmwareEntry *tdx_get_hob_entry(TdxGuest *tdx)
 {
     TdxFirmwareEntry *entry;
@@ -582,6 +589,7 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
 {
     TdxFirmware *tdvf = &tdx_guest->tdvf;
     TdxFirmwareEntry *entry;
+    RAMBlock *ram_block;
     int r;
 
     tdx_init_ram_entries();
@@ -616,6 +624,12 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
             .nr_pages = entry->size / 4096,
         };
 
+        r = kvm_set_memory_attributes_private(entry->address, entry->size);
+        if (r < 0) {
+             error_report("Reserve initial private memory failed %s", strerror(-r));
+             exit(1);
+        }
+
         __u32 flags = entry->attributes & TDVF_SECTION_ATTRIBUTES_MR_EXTEND ?
                       KVM_TDX_MEASURE_MEMORY_REGION : 0;
 
@@ -631,6 +645,10 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
             entry->mem_ptr = NULL;
         }
     }
+
+    /* Tdvf image was copied into private region above. It becomes unnecessary. */
+    ram_block = tdx_guest->tdvf_region->ram_block;
+    ram_block_discard_range(ram_block, 0, ram_block->max_length);
 }
 
 static Notifier tdx_machine_done_notify = {
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 3a35a2bc0900..5fb20a5f06bb 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -38,6 +38,7 @@ typedef struct TdxGuest {
     char *mrownerconfig;    /* base64 encoded sha348 digest */
 
     TdxFirmware tdvf;
+    MemoryRegion *tdvf_region;
 
     uint32_t nr_ram_entries;
     TdxRamEntry *ram_entries;
@@ -53,6 +54,7 @@ int tdx_kvm_init(MachineState *ms, Error **errp);
 void tdx_get_supported_cpuid(uint32_t function, uint32_t index, int reg,
                              uint32_t *ret);
 int tdx_pre_create_vcpu(CPUState *cpu, Error **errp);
+void tdx_set_tdvf_region(MemoryRegion *tdvf_region);
 int tdx_parse_tdvf(void *flash_ptr, int size);
 
 #endif /* QEMU_I386_TDX_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 49/70] i386/tdx: Call KVM_TDX_INIT_VCPU to initialize TDX vcpu
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (47 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 48/70] i386/tdx: register TDVF as private memory Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:14 ` [PATCH v3 50/70] i386/tdx: Finalize TDX VM Xiaoyao Li
                   ` (20 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

TDX vcpu needs to be initialized by SEAMCALL(TDH.VP.INIT) and KVM
provides vcpu level IOCTL KVM_TDX_INIT_VCPU for it.

KVM_TDX_INIT_VCPU needs the address of the HOB as input. Invoke it for
each vcpu after HOB list is created.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/tdx.c | 18 ++++++++++++++++++
 1 file changed, 18 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 4b8c13890b11..e55c1190c27e 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -585,6 +585,22 @@ static void tdx_init_ram_entries(void)
     tdx_guest->nr_ram_entries = j;
 }
 
+static void tdx_post_init_vcpus(void)
+{
+    TdxFirmwareEntry *hob;
+    CPUState *cpu;
+    int r;
+
+    hob = tdx_get_hob_entry(tdx_guest);
+    CPU_FOREACH(cpu) {
+        r = tdx_vcpu_ioctl(cpu, KVM_TDX_INIT_VCPU, 0, (void *)hob->address);
+        if (r < 0) {
+            error_report("KVM_TDX_INIT_VCPU failed %s", strerror(-r));
+            exit(1);
+        }
+    }
+}
+
 static void tdx_finalize_vm(Notifier *notifier, void *unused)
 {
     TdxFirmware *tdvf = &tdx_guest->tdvf;
@@ -617,6 +633,8 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
 
     tdvf_hob_create(tdx_guest, tdx_get_hob_entry(tdx_guest));
 
+    tdx_post_init_vcpus();
+
     for_each_tdx_fw_entry(tdvf, entry) {
         struct kvm_tdx_init_mem_region mem_region = {
             .source_addr = (__u64)entry->mem_ptr,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 50/70] i386/tdx: Finalize TDX VM
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (48 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 49/70] i386/tdx: Call KVM_TDX_INIT_VCPU to initialize TDX vcpu Xiaoyao Li
@ 2023-11-15  7:14 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 51/70] i386/tdx: handle TDG.VP.VMCALL<SetupEventNotifyInterrupt> Xiaoyao Li
                   ` (19 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:14 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Invoke KVM_TDX_FINALIZE_VM to finalize the TD's measurement and make
the TD vCPUs runnable once machine initialization is complete.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/tdx.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index e55c1190c27e..fc71038d7808 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -667,6 +667,13 @@ static void tdx_finalize_vm(Notifier *notifier, void *unused)
     /* Tdvf image was copied into private region above. It becomes unnecessary. */
     ram_block = tdx_guest->tdvf_region->ram_block;
     ram_block_discard_range(ram_block, 0, ram_block->max_length);
+
+    r = tdx_vm_ioctl(KVM_TDX_FINALIZE_VM, 0, NULL);
+    if (r < 0) {
+        error_report("KVM_TDX_FINALIZE_VM failed %s", strerror(-r));
+        exit(0);
+    }
+    tdx_guest->parent_obj.ready = true;
 }
 
 static Notifier tdx_machine_done_notify = {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 51/70] i386/tdx: handle TDG.VP.VMCALL<SetupEventNotifyInterrupt>
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (49 preceding siblings ...)
  2023-11-15  7:14 ` [PATCH v3 50/70] i386/tdx: Finalize TDX VM Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote> Xiaoyao Li
                   ` (18 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

For SetupEventNotifyInterrupt, record interrupt vector and the apic id
of the vcpu that received this TDVMCALL.

Later it can inject interrupt with given vector to the specific vcpu
that received SetupEventNotifyInterrupt.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 target/i386/kvm/kvm.c      |  9 ++++++
 target/i386/kvm/tdx-stub.c |  5 ++++
 target/i386/kvm/tdx.c      | 61 ++++++++++++++++++++++++++++++++++++++
 target/i386/kvm/tdx.h      |  6 ++++
 4 files changed, 81 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index d09d9f4eee94..f1c4dd759b3e 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -5414,6 +5414,15 @@ int kvm_arch_handle_exit(CPUState *cs, struct kvm_run *run)
         ret = kvm_xen_handle_exit(cpu, &run->xen);
         break;
 #endif
+    case KVM_EXIT_TDX:
+        if (!is_tdx_vm()) {
+            error_report("KVM: get KVM_EXIT_TDX for a non-TDX VM.");
+            ret = -1;
+            break;
+        }
+        tdx_handle_exit(cpu, &run->tdx);
+        ret = 0;
+        break;
     default:
         fprintf(stderr, "KVM: unknown exit reason %d\n", run->exit_reason);
         ret = -1;
diff --git a/target/i386/kvm/tdx-stub.c b/target/i386/kvm/tdx-stub.c
index 587dbeeed196..14c11a2338fc 100644
--- a/target/i386/kvm/tdx-stub.c
+++ b/target/i386/kvm/tdx-stub.c
@@ -16,3 +16,8 @@ int tdx_parse_tdvf(void *flash_ptr, int size)
 {
     return -EINVAL;
 }
+
+void tdx_handle_exit(X86CPU *cpu, struct kvm_tdx_exit *tdx_exit)
+{
+    abort();
+}
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index fc71038d7808..5fc5d857fb6f 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -956,6 +956,9 @@ static void tdx_guest_init(Object *obj)
     object_property_add_str(obj, "mrownerconfig",
                             tdx_guest_get_mrownerconfig,
                             tdx_guest_set_mrownerconfig);
+
+    tdx->event_notify_interrupt = -1;
+    tdx->event_notify_apic_id = -1;
 }
 
 static void tdx_guest_finalize(Object *obj)
@@ -965,3 +968,61 @@ static void tdx_guest_finalize(Object *obj)
 static void tdx_guest_class_init(ObjectClass *oc, void *data)
 {
 }
+
+#define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT      0x10004ULL
+
+#define TDG_VP_VMCALL_SUCCESS           0x0000000000000000ULL
+#define TDG_VP_VMCALL_RETRY             0x0000000000000001ULL
+#define TDG_VP_VMCALL_INVALID_OPERAND   0x8000000000000000ULL
+#define TDG_VP_VMCALL_GPA_INUSE         0x8000000000000001ULL
+#define TDG_VP_VMCALL_ALIGN_ERROR       0x8000000000000002ULL
+
+static void tdx_handle_setup_event_notify_interrupt(X86CPU *cpu,
+                                                    struct kvm_tdx_vmcall *vmcall)
+{
+    MachineState *ms = MACHINE(qdev_get_machine());
+    TdxGuest *tdx = TDX_GUEST(ms->cgs);
+    int event_notify_interrupt = vmcall->in_r12;
+
+    if (32 <= event_notify_interrupt && event_notify_interrupt <= 255) {
+        qemu_mutex_lock(&tdx->lock);
+        tdx->event_notify_interrupt = event_notify_interrupt;
+        tdx->event_notify_apic_id = cpu->apic_id;
+        qemu_mutex_unlock(&tdx->lock);
+        vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
+    }
+}
+
+static void tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
+{
+    vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
+
+    /* For now handle only TDG.VP.VMCALL. */
+    if (vmcall->type != 0) {
+        warn_report("unknown tdg.vp.vmcall type 0x%llx subfunction 0x%llx",
+                    vmcall->type, vmcall->subfunction);
+        return;
+    }
+
+    switch (vmcall->subfunction) {
+    case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
+        tdx_handle_setup_event_notify_interrupt(cpu, vmcall);
+        break;
+    default:
+        warn_report("unknown tdg.vp.vmcall type 0x%llx subfunction 0x%llx",
+                    vmcall->type, vmcall->subfunction);
+        break;
+    }
+}
+
+void tdx_handle_exit(X86CPU *cpu, struct kvm_tdx_exit *tdx_exit)
+{
+    switch (tdx_exit->type) {
+    case KVM_EXIT_TDX_VMCALL:
+        tdx_handle_vmcall(cpu, &tdx_exit->u.vmcall);
+        break;
+    default:
+        warn_report("unknown tdx exit type 0x%x", tdx_exit->type);
+        break;
+    }
+}
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 5fb20a5f06bb..4a8d67cc9fdb 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -7,6 +7,7 @@
 
 #include "exec/confidential-guest-support.h"
 #include "hw/i386/tdvf.h"
+#include "sysemu/kvm.h"
 
 #define TYPE_TDX_GUEST "tdx-guest"
 #define TDX_GUEST(obj)  OBJECT_CHECK(TdxGuest, (obj), TYPE_TDX_GUEST)
@@ -42,6 +43,10 @@ typedef struct TdxGuest {
 
     uint32_t nr_ram_entries;
     TdxRamEntry *ram_entries;
+
+    /* runtime state */
+    int event_notify_interrupt;
+    uint32_t event_notify_apic_id;
 } TdxGuest;
 
 #ifdef CONFIG_TDX
@@ -56,5 +61,6 @@ void tdx_get_supported_cpuid(uint32_t function, uint32_t index, int reg,
 int tdx_pre_create_vcpu(CPUState *cpu, Error **errp);
 void tdx_set_tdvf_region(MemoryRegion *tdvf_region);
 int tdx_parse_tdvf(void *flash_ptr, int size);
+void tdx_handle_exit(X86CPU *cpu, struct kvm_tdx_exit *tdx_exit);
 
 #endif /* QEMU_I386_TDX_H */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (50 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 51/70] i386/tdx: handle TDG.VP.VMCALL<SetupEventNotifyInterrupt> Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15 17:51   ` Daniel P. Berrangé
                     ` (3 more replies)
  2023-11-15  7:15 ` [PATCH v3 53/70] i386/tdx: setup a timer for the qio channel Xiaoyao Li
                   ` (17 subsequent siblings)
  69 siblings, 4 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

For GetQuote, delegate a request to Quote Generation Service.
Add property "quote-generation-socket" to tdx-guest, whihc is a property
of type SocketAddress to specify Quote Generation Service(QGS).

On request, connect to the QGS, read request buffer from shared guest
memory, send the request buffer to the server and store the response
into shared guest memory and notify TD guest by interrupt.

command line example:
  qemu-system-x86_64 \
    -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
    -machine confidential-guest-support=tdx0

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Codeveloped-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes in v3:
- rename property "quote-generation-service" to "quote-generation-socket";
- change the type of "quote-generation-socket" from str to
  SocketAddress;
- squash next patch into this one;
---
 qapi/qom.json         |   5 +-
 target/i386/kvm/tdx.c | 430 ++++++++++++++++++++++++++++++++++++++++++
 target/i386/kvm/tdx.h |   6 +
 3 files changed, 440 insertions(+), 1 deletion(-)

diff --git a/qapi/qom.json b/qapi/qom.json
index fd99aa1ff8cc..cf36a1832ddd 100644
--- a/qapi/qom.json
+++ b/qapi/qom.json
@@ -894,13 +894,16 @@
 #
 # @mrownerconfig: base64 MROWNERCONFIG SHA384 digest
 #
+# @quote-generation-socket: socket address for Quote Generation Service(QGS)
+#
 # Since: 8.2
 ##
 { 'struct': 'TdxGuestProperties',
   'data': { '*sept-ve-disable': 'bool',
             '*mrconfigid': 'str',
             '*mrowner': 'str',
-            '*mrownerconfig': 'str' } }
+            '*mrownerconfig': 'str',
+            '*quote-generation-socket': 'SocketAddress' } }
 
 ##
 # @ThreadContextProperties:
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 5fc5d857fb6f..54b38c031fb3 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -16,6 +16,7 @@
 #include "qemu/base64.h"
 #include "qemu/mmap-alloc.h"
 #include "qapi/error.h"
+#include "qapi/qapi-visit-sockets.h"
 #include "qom/object_interfaces.h"
 #include "standard-headers/asm-x86/kvm_para.h"
 #include "sysemu/kvm.h"
@@ -23,6 +24,8 @@
 #include "exec/address-spaces.h"
 #include "exec/ramblock.h"
 
+#include "exec/address-spaces.h"
+#include "hw/i386/apic_internal.h"
 #include "hw/i386/e820_memory_layout.h"
 #include "hw/i386/x86.h"
 #include "hw/i386/tdvf.h"
@@ -923,6 +926,29 @@ static void tdx_guest_set_mrownerconfig(Object *obj, const char *value, Error **
     tdx->mrconfigid = g_strdup(value);
 }
 
+static void tdx_guest_get_quote_generation(Object *obj, Visitor *v,
+                                            const char *name, void *opaque,
+                                            Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+
+    visit_type_SocketAddress(v, name, &tdx->quote_generation, errp);
+}
+
+static void tdx_guest_set_quote_generation(Object *obj, Visitor *v,
+                                           const char *name, void *opaque,
+                                           Error **errp)
+{
+    TdxGuest *tdx = TDX_GUEST(obj);
+    SocketAddress *sock = NULL;
+
+    if (!visit_type_SocketAddress(v, name, &sock, errp)) {
+        return;
+    }
+
+    tdx->quote_generation = sock;
+}
+
 /* tdx guest */
 OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
                                    tdx_guest,
@@ -957,6 +983,12 @@ static void tdx_guest_init(Object *obj)
                             tdx_guest_get_mrownerconfig,
                             tdx_guest_set_mrownerconfig);
 
+    tdx->quote_generation = NULL;
+    object_property_add(obj, "quote-generation-socket", "SocketAddress",
+                            tdx_guest_get_quote_generation,
+                            tdx_guest_set_quote_generation,
+                            NULL, NULL);
+
     tdx->event_notify_interrupt = -1;
     tdx->event_notify_apic_id = -1;
 }
@@ -969,6 +1001,7 @@ static void tdx_guest_class_init(ObjectClass *oc, void *data)
 {
 }
 
+#define TDG_VP_VMCALL_GET_QUOTE                         0x10002ULL
 #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT      0x10004ULL
 
 #define TDG_VP_VMCALL_SUCCESS           0x0000000000000000ULL
@@ -977,6 +1010,400 @@ static void tdx_guest_class_init(ObjectClass *oc, void *data)
 #define TDG_VP_VMCALL_GPA_INUSE         0x8000000000000001ULL
 #define TDG_VP_VMCALL_ALIGN_ERROR       0x8000000000000002ULL
 
+#define TDX_GET_QUOTE_STRUCTURE_VERSION 1ULL
+
+#define TDX_VP_GET_QUOTE_SUCCESS                0ULL
+#define TDX_VP_GET_QUOTE_IN_FLIGHT              (-1ULL)
+#define TDX_VP_GET_QUOTE_ERROR                  0x8000000000000000ULL
+#define TDX_VP_GET_QUOTE_QGS_UNAVAILABLE        0x8000000000000001ULL
+
+/* Limit to avoid resource starvation. */
+#define TDX_GET_QUOTE_MAX_BUF_LEN       (128 * 1024)
+#define TDX_MAX_GET_QUOTE_REQUEST       16
+
+/* Format of pages shared with guest. */
+struct tdx_get_quote_header {
+    /* Format version: must be 1 in little endian. */
+    uint64_t structure_version;
+
+    /*
+     * GetQuote status code in little endian:
+     *   Guest must set error_code to 0 to avoid information leak.
+     *   Qemu sets this before interrupting guest.
+     */
+    uint64_t error_code;
+
+    /*
+     * in-message size in little endian: The message will follow this header.
+     * The in-message will be send to QGS.
+     */
+    uint32_t in_len;
+
+    /*
+     * out-message size in little endian:
+     * On request, out_len must be zero to avoid information leak.
+     * On return, message size from QGS. Qemu overwrites this field.
+     * The message will follows this header.  The in-message is overwritten.
+     */
+    uint32_t out_len;
+
+    /*
+     * Message buffer follows.
+     * Guest sets message that will be send to QGS.  If out_len > in_len, guest
+     * should zero remaining buffer to avoid information leak.
+     * Qemu overwrites this buffer with a message returned from QGS.
+     */
+};
+
+static hwaddr tdx_shared_bit(X86CPU *cpu)
+{
+    return (cpu->phys_bits > 48) ? BIT_ULL(51) : BIT_ULL(47);
+}
+
+struct tdx_get_quote_task {
+    uint32_t apic_id;
+    hwaddr gpa;
+    uint64_t buf_len;
+    char *out_data;
+    uint64_t out_len;
+    struct tdx_get_quote_header hdr;
+    int event_notify_interrupt;
+    QIOChannelSocket *ioc;
+};
+
+struct x86_msi {
+    union {
+        struct {
+            uint32_t    reserved_0              : 2,
+                        dest_mode_logical       : 1,
+                        redirect_hint           : 1,
+                        reserved_1              : 1,
+                        virt_destid_8_14        : 7,
+                        destid_0_7              : 8,
+                        base_address            : 12;
+        } QEMU_PACKED x86_address_lo;
+        uint32_t address_lo;
+    };
+    union {
+        struct {
+            uint32_t    reserved        : 8,
+                        destid_8_31     : 24;
+        } QEMU_PACKED x86_address_hi;
+        uint32_t address_hi;
+    };
+    union {
+        struct {
+            uint32_t    vector                  : 8,
+                        delivery_mode           : 3,
+                        dest_mode_logical       : 1,
+                        reserved                : 2,
+                        active_low              : 1,
+                        is_level                : 1;
+        } QEMU_PACKED x86_data;
+        uint32_t data;
+    };
+};
+
+static void tdx_td_notify(struct tdx_get_quote_task *t)
+{
+    struct x86_msi x86_msi;
+    struct kvm_msi msi;
+    int ret;
+
+    /* It is optional for host VMM to interrupt TD. */
+    if(!(32 <= t->event_notify_interrupt && t->event_notify_interrupt <= 255))
+        return;
+
+    x86_msi = (struct x86_msi) {
+        .x86_address_lo  = {
+            .reserved_0 = 0,
+            .dest_mode_logical = 0,
+            .redirect_hint = 0,
+            .reserved_1 = 0,
+            .virt_destid_8_14 = 0,
+            .destid_0_7 = t->apic_id & 0xff,
+        },
+        .x86_address_hi = {
+            .reserved = 0,
+            .destid_8_31 = t->apic_id >> 8,
+        },
+        .x86_data = {
+            .vector = t->event_notify_interrupt,
+            .delivery_mode = APIC_DM_FIXED,
+            .dest_mode_logical = 0,
+            .reserved = 0,
+            .active_low = 0,
+            .is_level = 0,
+        },
+    };
+    msi = (struct kvm_msi) {
+        .address_lo = x86_msi.address_lo,
+        .address_hi = x86_msi.address_hi,
+        .data = x86_msi.data,
+        .flags = 0,
+        .devid = 0,
+    };
+    ret = kvm_vm_ioctl(kvm_state, KVM_SIGNAL_MSI, &msi);
+    if (ret < 0) {
+        /* In this case, no better way to tell it to guest.  Log it. */
+        error_report("TDX: injection %d failed, interrupt lost (%s).\n",
+                     t->event_notify_interrupt, strerror(-ret));
+    }
+}
+
+static void tdx_get_quote_read(void *opaque)
+{
+    struct tdx_get_quote_task *t = opaque;
+    ssize_t size = 0;
+    Error *err = NULL;
+    MachineState *ms;
+    TdxGuest *tdx;
+
+    while (true) {
+        char *buf;
+        size_t buf_size;
+
+        if (t->out_len < t->buf_len) {
+            buf = t->out_data + t->out_len;
+            buf_size = t->buf_len - t->out_len;
+        } else {
+            /*
+             * The received data is too large to fit in the shared GPA.
+             * Discard the received data and try to know the data size.
+             */
+            buf = t->out_data;
+            buf_size = t->buf_len;
+        }
+
+        size = qio_channel_read(QIO_CHANNEL(t->ioc), buf, buf_size, &err);
+        if (!size) {
+            break;
+        }
+
+        if (size < 0) {
+            if (size == QIO_CHANNEL_ERR_BLOCK) {
+                return;
+            } else {
+                break;
+            }
+        }
+        t->out_len += size;
+    }
+    /*
+     * If partial read successfully but return error at last, also treat it
+     * as failure.
+     */
+    if (size < 0) {
+        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
+        goto error;
+    }
+    if (t->out_len > 0 && t->out_len > t->buf_len) {
+        /*
+         * There is no specific error code defined for this case(E2BIG) at the
+         * moment.
+         * TODO: Once an error code for this case is defined in GHCI spec ,
+         * update the error code.
+         */
+        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
+        t->hdr.out_len = cpu_to_le32(t->out_len);
+        goto error_hdr;
+    }
+
+    if (address_space_write(
+            &address_space_memory, t->gpa + sizeof(t->hdr),
+            MEMTXATTRS_UNSPECIFIED, t->out_data, t->out_len) != MEMTX_OK) {
+        goto error;
+    }
+    /*
+     * Even if out_len == 0, it's a success.  It's up to the QGS-client contract
+     * how to interpret the zero-sized message as return message.
+     */
+    t->hdr.out_len = cpu_to_le32(t->out_len);
+    t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_SUCCESS);
+
+error:
+    if (t->hdr.error_code != cpu_to_le64(TDX_VP_GET_QUOTE_SUCCESS)) {
+        t->hdr.out_len = cpu_to_le32(0);
+    }
+error_hdr:
+    if (address_space_write(
+            &address_space_memory, t->gpa,
+            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
+        error_report("TDX: failed to update GetQuote header.");
+    }
+    tdx_td_notify(t);
+
+    qemu_set_fd_handler(t->ioc->fd, NULL, NULL, NULL);
+    qio_channel_close(QIO_CHANNEL(t->ioc), &err);
+    object_unref(OBJECT(t->ioc));
+    g_free(t->out_data);
+    g_free(t);
+
+    /* Maintain the number of in-flight requests. */
+    ms = MACHINE(qdev_get_machine());
+    tdx = TDX_GUEST(ms->cgs);
+    qemu_mutex_lock(&tdx->lock);
+    tdx->quote_generation_num--;
+    qemu_mutex_unlock(&tdx->lock);
+}
+
+/*
+ * TODO: If QGS doesn't reply for long time, make it an error and interrupt
+ * guest.
+ */
+static void tdx_handle_get_quote_connected(QIOTask *task, gpointer opaque)
+{
+    struct tdx_get_quote_task *t = opaque;
+    Error *err = NULL;
+    char *in_data = NULL;
+    MachineState *ms;
+    TdxGuest *tdx;
+
+    t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
+    if (qio_task_propagate_error(task, NULL)) {
+        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
+        goto error;
+    }
+
+    in_data = g_malloc(le32_to_cpu(t->hdr.in_len));
+    if (!in_data) {
+        goto error;
+    }
+
+    if (address_space_read(&address_space_memory, t->gpa + sizeof(t->hdr),
+                           MEMTXATTRS_UNSPECIFIED, in_data,
+                           le32_to_cpu(t->hdr.in_len)) != MEMTX_OK) {
+        goto error;
+    }
+
+    qio_channel_set_blocking(QIO_CHANNEL(t->ioc), false, NULL);
+
+    if (qio_channel_write_all(QIO_CHANNEL(t->ioc), in_data,
+                              le32_to_cpu(t->hdr.in_len), &err) ||
+        err) {
+        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
+        goto error;
+    }
+
+    g_free(in_data);
+    qemu_set_fd_handler(t->ioc->fd, tdx_get_quote_read, NULL, t);
+
+    return;
+error:
+    t->hdr.out_len = cpu_to_le32(0);
+
+    if (address_space_write(
+            &address_space_memory, t->gpa,
+            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
+        error_report("TDX: failed to update GetQuote header.\n");
+    }
+    tdx_td_notify(t);
+
+    qio_channel_close(QIO_CHANNEL(t->ioc), &err);
+    object_unref(OBJECT(t->ioc));
+    g_free(t);
+    g_free(in_data);
+
+    /* Maintain the number of in-flight requests. */
+    ms = MACHINE(qdev_get_machine());
+    tdx = TDX_GUEST(ms->cgs);
+    qemu_mutex_lock(&tdx->lock);
+    tdx->quote_generation_num--;
+    qemu_mutex_unlock(&tdx->lock);
+    return;
+}
+
+static void tdx_handle_get_quote(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
+{
+    hwaddr gpa = vmcall->in_r12;
+    uint64_t buf_len = vmcall->in_r13;
+    struct tdx_get_quote_header hdr;
+    MachineState *ms;
+    TdxGuest *tdx;
+    QIOChannelSocket *ioc;
+    struct tdx_get_quote_task *t;
+
+    vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
+
+    /* GPA must be shared. */
+    if (!(gpa & tdx_shared_bit(cpu))) {
+        return;
+    }
+    gpa &= ~tdx_shared_bit(cpu);
+
+    if (!QEMU_IS_ALIGNED(gpa, 4096) || !QEMU_IS_ALIGNED(buf_len, 4096)) {
+        vmcall->status_code = TDG_VP_VMCALL_ALIGN_ERROR;
+        return;
+    }
+    if (buf_len == 0) {
+        return;
+    }
+
+    if (address_space_read(&address_space_memory, gpa, MEMTXATTRS_UNSPECIFIED,
+                           &hdr, sizeof(hdr)) != MEMTX_OK) {
+        return;
+    }
+    if (le64_to_cpu(hdr.structure_version) != TDX_GET_QUOTE_STRUCTURE_VERSION) {
+        return;
+    }
+    /*
+     * Paranoid: Guest should clear error_code and out_len to avoid information
+     * leak.  Enforce it.  The initial value of them doesn't matter for qemu to
+     * process the request.
+     */
+    if (le64_to_cpu(hdr.error_code) != TDX_VP_GET_QUOTE_SUCCESS ||
+        le32_to_cpu(hdr.out_len) != 0) {
+        return;
+    }
+
+    /* Only safe-guard check to avoid too large buffer size. */
+    if (buf_len > TDX_GET_QUOTE_MAX_BUF_LEN ||
+        le32_to_cpu(hdr.in_len) > TDX_GET_QUOTE_MAX_BUF_LEN ||
+        le32_to_cpu(hdr.in_len) > buf_len) {
+        return;
+    }
+
+    /* Mark the buffer in-flight. */
+    hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_IN_FLIGHT);
+    if (address_space_write(&address_space_memory, gpa, MEMTXATTRS_UNSPECIFIED,
+                            &hdr, sizeof(hdr)) != MEMTX_OK) {
+        return;
+    }
+
+    ms = MACHINE(qdev_get_machine());
+    tdx = TDX_GUEST(ms->cgs);
+    ioc = qio_channel_socket_new();
+
+    t = g_malloc(sizeof(*t));
+    t->apic_id = tdx->event_notify_apic_id;
+    t->gpa = gpa;
+    t->buf_len = buf_len;
+    t->out_data = g_malloc(t->buf_len);
+    t->out_len = 0;
+    t->hdr = hdr;
+    t->ioc = ioc;
+
+    qemu_mutex_lock(&tdx->lock);
+    if (!tdx->quote_generation ||
+        /* Prevent too many in-flight get-quote request. */
+        tdx->quote_generation_num >= TDX_MAX_GET_QUOTE_REQUEST) {
+        qemu_mutex_unlock(&tdx->lock);
+        vmcall->status_code = TDG_VP_VMCALL_RETRY;
+        object_unref(OBJECT(ioc));
+        g_free(t->out_data);
+        g_free(t);
+        return;
+    }
+    tdx->quote_generation_num++;
+    t->event_notify_interrupt = tdx->event_notify_interrupt;
+    qio_channel_socket_connect_async(
+        ioc, tdx->quote_generation, tdx_handle_get_quote_connected, t, NULL,
+        NULL);
+    qemu_mutex_unlock(&tdx->lock);
+
+    vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
+}
+
 static void tdx_handle_setup_event_notify_interrupt(X86CPU *cpu,
                                                     struct kvm_tdx_vmcall *vmcall)
 {
@@ -1005,6 +1432,9 @@ static void tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
     }
 
     switch (vmcall->subfunction) {
+    case TDG_VP_VMCALL_GET_QUOTE:
+        tdx_handle_get_quote(cpu, vmcall);
+        break;
     case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
         tdx_handle_setup_event_notify_interrupt(cpu, vmcall);
         break;
diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
index 4a8d67cc9fdb..4a989805493e 100644
--- a/target/i386/kvm/tdx.h
+++ b/target/i386/kvm/tdx.h
@@ -5,8 +5,10 @@
 #include CONFIG_DEVICES /* CONFIG_TDX */
 #endif
 
+#include <linux/kvm.h>
 #include "exec/confidential-guest-support.h"
 #include "hw/i386/tdvf.h"
+#include "io/channel-socket.h"
 #include "sysemu/kvm.h"
 
 #define TYPE_TDX_GUEST "tdx-guest"
@@ -47,6 +49,10 @@ typedef struct TdxGuest {
     /* runtime state */
     int event_notify_interrupt;
     uint32_t event_notify_apic_id;
+
+    /* GetQuote */
+    int quote_generation_num;
+    SocketAddress *quote_generation;
 } TdxGuest;
 
 #ifdef CONFIG_TDX
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 53/70] i386/tdx: setup a timer for the qio channel
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (51 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote> Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15 18:02   ` Daniel P. Berrangé
  2023-11-15  7:15 ` [PATCH v3 54/70] i386/tdx: handle TDG.VP.VMCALL<MapGPA> hypercall Xiaoyao Li
                   ` (16 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Chenyi Qiang <chenyi.qiang@intel.com>

To avoid no response from QGS server, setup a timer for the transaction.
If timeout, make it an error and interrupt guest. Define the threshold of
time to 30s at present, maybe change to other value if not appropriate.

Extract the common cleanup code to make it more clear.

Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes in v3:
 - Use t->timer_armed to track if t->timer is initialized;
---
 target/i386/kvm/tdx.c | 155 ++++++++++++++++++++++++------------------
 1 file changed, 89 insertions(+), 66 deletions(-)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 54b38c031fb3..3b87c36c485e 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -1069,6 +1069,8 @@ struct tdx_get_quote_task {
     struct tdx_get_quote_header hdr;
     int event_notify_interrupt;
     QIOChannelSocket *ioc;
+    QEMUTimer timer;
+    bool timer_armed;
 };
 
 struct x86_msi {
@@ -1151,13 +1153,49 @@ static void tdx_td_notify(struct tdx_get_quote_task *t)
     }
 }
 
+static void tdx_getquote_task_cleanup(struct tdx_get_quote_task *t, bool outlen_overflow)
+{
+    MachineState *ms;
+    TdxGuest *tdx;
+
+    if (t->hdr.error_code != cpu_to_le64(TDX_VP_GET_QUOTE_SUCCESS) && !outlen_overflow) {
+        t->hdr.out_len = cpu_to_le32(0);
+    }
+
+    /* Publish the response contents before marking this request completed. */
+    smp_wmb();
+    if (address_space_write(
+            &address_space_memory, t->gpa,
+            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
+        error_report("TDX: failed to update GetQuote header.");
+    }
+    tdx_td_notify(t);
+
+    if (t->ioc->fd > 0) {
+        qemu_set_fd_handler(t->ioc->fd, NULL, NULL, NULL);
+    }
+    qio_channel_close(QIO_CHANNEL(t->ioc), NULL);
+    object_unref(OBJECT(t->ioc));
+    if (t->timer_armed)
+        timer_del(&t->timer);
+    g_free(t->out_data);
+    g_free(t);
+
+    /* Maintain the number of in-flight requests. */
+    ms = MACHINE(qdev_get_machine());
+    tdx = TDX_GUEST(ms->cgs);
+    qemu_mutex_lock(&tdx->lock);
+    tdx->quote_generation_num--;
+    qemu_mutex_unlock(&tdx->lock);
+}
+
+
 static void tdx_get_quote_read(void *opaque)
 {
     struct tdx_get_quote_task *t = opaque;
     ssize_t size = 0;
     Error *err = NULL;
-    MachineState *ms;
-    TdxGuest *tdx;
+    bool outlen_overflow = false;
 
     while (true) {
         char *buf;
@@ -1202,11 +1240,12 @@ static void tdx_get_quote_read(void *opaque)
          * There is no specific error code defined for this case(E2BIG) at the
          * moment.
          * TODO: Once an error code for this case is defined in GHCI spec ,
-         * update the error code.
+         * update the error code and the tdx_getquote_task_cleanup() argument.
          */
         t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
         t->hdr.out_len = cpu_to_le32(t->out_len);
-        goto error_hdr;
+        outlen_overflow = true;
+        goto error;
     }
 
     if (address_space_write(
@@ -1222,94 +1261,77 @@ static void tdx_get_quote_read(void *opaque)
     t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_SUCCESS);
 
 error:
-    if (t->hdr.error_code != cpu_to_le64(TDX_VP_GET_QUOTE_SUCCESS)) {
-        t->hdr.out_len = cpu_to_le32(0);
-    }
-error_hdr:
-    if (address_space_write(
-            &address_space_memory, t->gpa,
-            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
-        error_report("TDX: failed to update GetQuote header.");
-    }
-    tdx_td_notify(t);
+    tdx_getquote_task_cleanup(t, outlen_overflow);
+}
+
+#define TRANSACTION_TIMEOUT 30000
+
+static void getquote_timer_expired(void *opaque)
+{
+    struct tdx_get_quote_task *t = opaque;
+
+    tdx_getquote_task_cleanup(t, false);
+}
 
-    qemu_set_fd_handler(t->ioc->fd, NULL, NULL, NULL);
-    qio_channel_close(QIO_CHANNEL(t->ioc), &err);
-    object_unref(OBJECT(t->ioc));
-    g_free(t->out_data);
-    g_free(t);
+static void tdx_transaction_start(struct tdx_get_quote_task *t)
+{
+    int64_t time;
 
-    /* Maintain the number of in-flight requests. */
-    ms = MACHINE(qdev_get_machine());
-    tdx = TDX_GUEST(ms->cgs);
-    qemu_mutex_lock(&tdx->lock);
-    tdx->quote_generation_num--;
-    qemu_mutex_unlock(&tdx->lock);
+    time = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
+    /*
+     * Timeout callback and fd callback both run in main loop thread,
+     * thus no need to worry about race condition.
+     */
+    qemu_set_fd_handler(t->ioc->fd, tdx_get_quote_read, NULL, t);
+    timer_init_ms(&t->timer, QEMU_CLOCK_VIRTUAL, getquote_timer_expired, t);
+    timer_mod(&t->timer, time + TRANSACTION_TIMEOUT);
+    t->timer_armed = true;
 }
 
-/*
- * TODO: If QGS doesn't reply for long time, make it an error and interrupt
- * guest.
- */
 static void tdx_handle_get_quote_connected(QIOTask *task, gpointer opaque)
 {
     struct tdx_get_quote_task *t = opaque;
     Error *err = NULL;
     char *in_data = NULL;
-    MachineState *ms;
-    TdxGuest *tdx;
+    int ret = 0;
 
     t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
-    if (qio_task_propagate_error(task, NULL)) {
+    ret = qio_task_propagate_error(task, NULL);
+    if (ret) {
         t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
-        goto error;
+        goto out;
     }
 
     in_data = g_malloc(le32_to_cpu(t->hdr.in_len));
     if (!in_data) {
-        goto error;
+        ret = -1;
+        goto out;
     }
 
-    if (address_space_read(&address_space_memory, t->gpa + sizeof(t->hdr),
-                           MEMTXATTRS_UNSPECIFIED, in_data,
-                           le32_to_cpu(t->hdr.in_len)) != MEMTX_OK) {
-        goto error;
+    ret = address_space_read(&address_space_memory, t->gpa + sizeof(t->hdr),
+                             MEMTXATTRS_UNSPECIFIED, in_data,
+                             le32_to_cpu(t->hdr.in_len));
+    if (ret) {
+        g_free(in_data);
+        goto out;
     }
 
     qio_channel_set_blocking(QIO_CHANNEL(t->ioc), false, NULL);
 
-    if (qio_channel_write_all(QIO_CHANNEL(t->ioc), in_data,
-                              le32_to_cpu(t->hdr.in_len), &err) ||
-        err) {
+    ret = qio_channel_write_all(QIO_CHANNEL(t->ioc), in_data,
+                              le32_to_cpu(t->hdr.in_len), &err);
+    if (ret) {
         t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
-        goto error;
+        g_free(in_data);
+        goto out;
     }
 
-    g_free(in_data);
-    qemu_set_fd_handler(t->ioc->fd, tdx_get_quote_read, NULL, t);
-
-    return;
-error:
-    t->hdr.out_len = cpu_to_le32(0);
-
-    if (address_space_write(
-            &address_space_memory, t->gpa,
-            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
-        error_report("TDX: failed to update GetQuote header.\n");
+out:
+    if (ret) {
+        tdx_getquote_task_cleanup(t, false);
+    } else {
+        tdx_transaction_start(t);
     }
-    tdx_td_notify(t);
-
-    qio_channel_close(QIO_CHANNEL(t->ioc), &err);
-    object_unref(OBJECT(t->ioc));
-    g_free(t);
-    g_free(in_data);
-
-    /* Maintain the number of in-flight requests. */
-    ms = MACHINE(qdev_get_machine());
-    tdx = TDX_GUEST(ms->cgs);
-    qemu_mutex_lock(&tdx->lock);
-    tdx->quote_generation_num--;
-    qemu_mutex_unlock(&tdx->lock);
     return;
 }
 
@@ -1382,6 +1404,7 @@ static void tdx_handle_get_quote(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
     t->out_len = 0;
     t->hdr = hdr;
     t->ioc = ioc;
+    t->timer_armed = false;
 
     qemu_mutex_lock(&tdx->lock);
     if (!tdx->quote_generation ||
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 54/70] i386/tdx: handle TDG.VP.VMCALL<MapGPA> hypercall
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (52 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 53/70] i386/tdx: setup a timer for the qio channel Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 55/70] i386/tdx: Limit the range size for MapGPA Xiaoyao Li
                   ` (15 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

MapGPA is a hypercall to convert GPA from/to private GPA to/from shared GPA.
As the conversion function is already implemented as kvm_convert_memory,
wire it to TDX hypercall exit.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 accel/kvm/kvm-all.c   |  2 +-
 include/sysemu/kvm.h  |  2 ++
 target/i386/kvm/tdx.c | 37 +++++++++++++++++++++++++++++++++++++
 3 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 89e7183a2738..65bc92265369 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -2929,7 +2929,7 @@ static void kvm_eat_signals(CPUState *cpu)
     } while (sigismember(&chkset, SIG_IPI));
 }
 
-static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
 {
     MemoryRegionSection section;
     ram_addr_t offset;
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 2f6592859ac6..e0061848b053 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -544,4 +544,6 @@ int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
 
 int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);
 int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
+
+int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private);
 #endif
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 3b87c36c485e..b17258f17fd0 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -1001,6 +1001,7 @@ static void tdx_guest_class_init(ObjectClass *oc, void *data)
 {
 }
 
+#define TDG_VP_VMCALL_MAP_GPA                           0x10001ULL
 #define TDG_VP_VMCALL_GET_QUOTE                         0x10002ULL
 #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT      0x10004ULL
 
@@ -1060,6 +1061,39 @@ static hwaddr tdx_shared_bit(X86CPU *cpu)
     return (cpu->phys_bits > 48) ? BIT_ULL(51) : BIT_ULL(47);
 }
 
+static void tdx_handle_map_gpa(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
+{
+    hwaddr shared_bit = tdx_shared_bit(cpu);
+    hwaddr gpa = vmcall->in_r12 & ~shared_bit;
+    bool private = !(vmcall->in_r12 & shared_bit);
+    hwaddr size = vmcall->in_r13;
+    int ret = 0;
+
+    vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
+
+    if (!QEMU_IS_ALIGNED(gpa, 4096) || !QEMU_IS_ALIGNED(size, 4096)) {
+        vmcall->status_code = TDG_VP_VMCALL_ALIGN_ERROR;
+        return;
+    }
+
+    /* Overflow case. */
+    if (gpa + size < gpa) {
+        return;
+    }
+    if (gpa >= (1ULL << cpu->phys_bits) ||
+        gpa + size >= (1ULL << cpu->phys_bits)) {
+        return;
+    }
+
+    if (size > 0) {
+        ret = kvm_convert_memory(gpa, size, private);
+    }
+
+    if (!ret) {
+        vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
+    }
+}
+
 struct tdx_get_quote_task {
     uint32_t apic_id;
     hwaddr gpa;
@@ -1455,6 +1489,9 @@ static void tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
     }
 
     switch (vmcall->subfunction) {
+    case TDG_VP_VMCALL_MAP_GPA:
+        tdx_handle_map_gpa(cpu, vmcall);
+        break;
     case TDG_VP_VMCALL_GET_QUOTE:
         tdx_handle_get_quote(cpu, vmcall);
         break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 55/70] i386/tdx: Limit the range size for MapGPA
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (53 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 54/70] i386/tdx: handle TDG.VP.VMCALL<MapGPA> hypercall Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 56/70] i386/tdx: Handle TDG.VP.VMCALL<REPORT_FATAL_ERROR> Xiaoyao Li
                   ` (14 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

If the range for TDG.VP.VMCALL<MapGPA> is too large, process the limited
size and return retry error.  It's bad for VMM to take too long time,
e.g. second order, with blocking vcpu execution.  It results in too many
missing timer interrupts.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 target/i386/kvm/tdx.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index b17258f17fd0..96a10b0bb190 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -1061,12 +1061,16 @@ static hwaddr tdx_shared_bit(X86CPU *cpu)
     return (cpu->phys_bits > 48) ? BIT_ULL(51) : BIT_ULL(47);
 }
 
+/* 64MB at most in one call. What value is appropriate? */
+#define TDX_MAP_GPA_MAX_LEN     (64 * 1024 * 1024)
+
 static void tdx_handle_map_gpa(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
 {
     hwaddr shared_bit = tdx_shared_bit(cpu);
     hwaddr gpa = vmcall->in_r12 & ~shared_bit;
     bool private = !(vmcall->in_r12 & shared_bit);
     hwaddr size = vmcall->in_r13;
+    bool retry = false;
     int ret = 0;
 
     vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
@@ -1085,12 +1089,25 @@ static void tdx_handle_map_gpa(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
         return;
     }
 
+    if (size > TDX_MAP_GPA_MAX_LEN) {
+        retry = true;
+        size = TDX_MAP_GPA_MAX_LEN;
+    }
+
     if (size > 0) {
         ret = kvm_convert_memory(gpa, size, private);
     }
 
     if (!ret) {
-        vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
+        if (retry) {
+            vmcall->status_code = TDG_VP_VMCALL_RETRY;
+            vmcall->out_r11 = gpa + size;
+            if (!private) {
+                vmcall->out_r11 |= shared_bit;
+            }
+        } else {
+            vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
+        }
     }
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 56/70] i386/tdx: Handle TDG.VP.VMCALL<REPORT_FATAL_ERROR>
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (54 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 55/70] i386/tdx: Limit the range size for MapGPA Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 57/70] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility Xiaoyao Li
                   ` (13 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

TD guest can use TDG.VP.VMCALL<REPORT_FATAL_ERROR> to request termination
with error message encoded in GPRs.

Parse and print the error message, and terminate the TD guest in the
handler.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 target/i386/kvm/tdx.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 96a10b0bb190..a42b5cea36c5 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -1003,6 +1003,7 @@ static void tdx_guest_class_init(ObjectClass *oc, void *data)
 
 #define TDG_VP_VMCALL_MAP_GPA                           0x10001ULL
 #define TDG_VP_VMCALL_GET_QUOTE                         0x10002ULL
+#define TDG_VP_VMCALL_REPORT_FATAL_ERROR                0x10003ULL
 #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT      0x10004ULL
 
 #define TDG_VP_VMCALL_SUCCESS           0x0000000000000000ULL
@@ -1478,6 +1479,42 @@ static void tdx_handle_get_quote(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
     vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
 }
 
+static void tdx_handle_report_fatal_error(X86CPU *cpu,
+                                          struct kvm_tdx_vmcall *vmcall)
+{
+    uint64_t error_code = vmcall->in_r12;
+    char *message = NULL;
+
+    if (error_code & 0xffff) {
+        error_report("invalid error code of TDG.VP.VMCALL<REPORT_FATAL_ERROR>\n");
+        exit(1);
+    }
+
+    /* it has optional message */
+    if (vmcall->in_r14) {
+        uint64_t * tmp;
+
+#define GUEST_PANIC_INFO_TDX_MESSAGE_MAX        64
+        message = g_malloc0(GUEST_PANIC_INFO_TDX_MESSAGE_MAX + 1);
+
+        tmp = (uint64_t *)message;
+        /* The order is defined in TDX GHCI spec */
+        *(tmp++) = cpu_to_le64(vmcall->in_r14);
+        *(tmp++) = cpu_to_le64(vmcall->in_r15);
+        *(tmp++) = cpu_to_le64(vmcall->in_rbx);
+        *(tmp++) = cpu_to_le64(vmcall->in_rdi);
+        *(tmp++) = cpu_to_le64(vmcall->in_rsi);
+        *(tmp++) = cpu_to_le64(vmcall->in_r8);
+        *(tmp++) = cpu_to_le64(vmcall->in_r9);
+        *(tmp++) = cpu_to_le64(vmcall->in_rdx);
+        message[GUEST_PANIC_INFO_TDX_MESSAGE_MAX] = '\0';
+        assert((char *)tmp == message + GUEST_PANIC_INFO_TDX_MESSAGE_MAX);
+    }
+
+    error_report("TD guest reports fatal error. %s\n", message ? : "");
+    exit(1);
+}
+
 static void tdx_handle_setup_event_notify_interrupt(X86CPU *cpu,
                                                     struct kvm_tdx_vmcall *vmcall)
 {
@@ -1512,6 +1549,9 @@ static void tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
     case TDG_VP_VMCALL_GET_QUOTE:
         tdx_handle_get_quote(cpu, vmcall);
         break;
+    case TDG_VP_VMCALL_REPORT_FATAL_ERROR:
+        tdx_handle_report_fatal_error(cpu, vmcall);
+        break;
     case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
         tdx_handle_setup_event_notify_interrupt(cpu, vmcall);
         break;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 57/70] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (55 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 56/70] i386/tdx: Handle TDG.VP.VMCALL<REPORT_FATAL_ERROR> Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-12-01 11:11   ` Markus Armbruster
  2023-11-15  7:15 ` [PATCH v3 58/70] pci-host/q35: Move PAM initialization above SMRAM initialization Xiaoyao Li
                   ` (12 subsequent siblings)
  69 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility

Originated-from: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
Changes from v2:
- Add docmentation of new type and struct (Daniel)
- refine the error message handling (Daniel)
---
 qapi/run-state.json   | 27 ++++++++++++++++++++--
 system/runstate.c     | 54 +++++++++++++++++++++++++++++++++++++++++++
 target/i386/kvm/tdx.c | 24 +++++++++++++++++--
 3 files changed, 101 insertions(+), 4 deletions(-)

diff --git a/qapi/run-state.json b/qapi/run-state.json
index f216ba54ec4c..e18f62eaef77 100644
--- a/qapi/run-state.json
+++ b/qapi/run-state.json
@@ -496,10 +496,12 @@
 #
 # @s390: s390 guest panic information type (Since: 2.12)
 #
+# @tdx: tdx guest panic information type (Since: 8.2)
+#
 # Since: 2.9
 ##
 { 'enum': 'GuestPanicInformationType',
-  'data': [ 'hyper-v', 's390' ] }
+  'data': [ 'hyper-v', 's390', 'tdx' ] }
 
 ##
 # @GuestPanicInformation:
@@ -514,7 +516,8 @@
  'base': {'type': 'GuestPanicInformationType'},
  'discriminator': 'type',
  'data': {'hyper-v': 'GuestPanicInformationHyperV',
-          's390': 'GuestPanicInformationS390'}}
+          's390': 'GuestPanicInformationS390',
+          'tdx' : 'GuestPanicInformationTdx'}}
 
 ##
 # @GuestPanicInformationHyperV:
@@ -577,6 +580,26 @@
           'psw-addr': 'uint64',
           'reason': 'S390CrashReason'}}
 
+##
+# @GuestPanicInformationTdx:
+#
+# TDX GHCI TDG.VP.VMCALL<ReportFatalError> specific guest panic information
+#
+# @error-code: TD-specific error code
+#
+# @gpa: 4KB-aligned guest physical address of the page that containing
+#     additional error data
+#
+# @message: TD guest provided message string.  (It's not so trustable
+#     and cannot be assumed to be well formed because it comes from guest)
+#
+# Since: 8.2
+##
+{'struct': 'GuestPanicInformationTdx',
+ 'data': {'error-code': 'uint64',
+          'gpa': 'uint64',
+          'message': 'str'}}
+
 ##
 # @MEMORY_FAILURE:
 #
diff --git a/system/runstate.c b/system/runstate.c
index ea9d6c2a32a4..9275e2f265f3 100644
--- a/system/runstate.c
+++ b/system/runstate.c
@@ -518,6 +518,52 @@ static void qemu_system_wakeup(void)
     }
 }
 
+static char* tdx_parse_panic_message(char *message)
+{
+    bool printable = false;
+    char *buf = NULL;
+    int len = 0, i;
+
+    /*
+     * Although message is defined as a json string, we shouldn't
+     * unconditionally treat it as is because the guest generated it and
+     * it's not necessarily trustable.
+     */
+    if (message) {
+        /* The caller guarantees the NUL-terminated string. */
+        len = strlen(message);
+
+        printable = len > 0;
+        for (i = 0; i < len; i++) {
+            if (!(0x20 <= message[i] && message[i] <= 0x7e)) {
+                printable = false;
+                break;
+            }
+        }
+    }
+
+    if (!printable && len) {
+        /* 3 = length of "%02x " */
+        buf = g_malloc(len * 3);
+        for (i = 0; i < len; i++) {
+            if (message[i] == '\0') {
+                break;
+            } else {
+                sprintf(buf + 3 * i, "%02x ", message[i]);
+            }
+        }
+        if (i > 0)
+            /* replace the last ' '(space) to NUL */
+            buf[i * 3 - 1] = '\0';
+        else
+            buf[0] = '\0';
+
+        return buf;
+    }
+
+    return message;
+}
+
 void qemu_system_guest_panicked(GuestPanicInformation *info)
 {
     qemu_log_mask(LOG_GUEST_ERROR, "Guest crashed");
@@ -559,7 +605,15 @@ void qemu_system_guest_panicked(GuestPanicInformation *info)
                           S390CrashReason_str(info->u.s390.reason),
                           info->u.s390.psw_mask,
                           info->u.s390.psw_addr);
+        } else if (info->type == GUEST_PANIC_INFORMATION_TYPE_TDX) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                          " TDX guest reports fatal error:\"%s\""
+                          " error code: 0x%016" PRIx64 " gpa page: 0x%016" PRIx64 "\n",
+                          tdx_parse_panic_message(info->u.tdx.message),
+                          info->u.tdx.error_code,
+                          info->u.tdx.gpa);
         }
+
         qapi_free_GuestPanicInformation(info);
     }
 }
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index a42b5cea36c5..23504ba3b05e 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -20,6 +20,7 @@
 #include "qom/object_interfaces.h"
 #include "standard-headers/asm-x86/kvm_para.h"
 #include "sysemu/kvm.h"
+#include "sysemu/runstate.h"
 #include "sysemu/sysemu.h"
 #include "exec/address-spaces.h"
 #include "exec/ramblock.h"
@@ -1479,11 +1480,26 @@ static void tdx_handle_get_quote(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
     vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
 }
 
+static void tdx_panicked_on_fatal_error(X86CPU *cpu, uint64_t error_code,
+                                        uint64_t gpa, char *message)
+{
+    GuestPanicInformation *panic_info;
+
+    panic_info = g_new0(GuestPanicInformation, 1);
+    panic_info->type = GUEST_PANIC_INFORMATION_TYPE_TDX;
+    panic_info->u.tdx.error_code = error_code;
+    panic_info->u.tdx.gpa = gpa;
+    panic_info->u.tdx.message = message;
+
+    qemu_system_guest_panicked(panic_info);
+}
+
 static void tdx_handle_report_fatal_error(X86CPU *cpu,
                                           struct kvm_tdx_vmcall *vmcall)
 {
     uint64_t error_code = vmcall->in_r12;
     char *message = NULL;
+    uint64_t gpa = -1ull;
 
     if (error_code & 0xffff) {
         error_report("invalid error code of TDG.VP.VMCALL<REPORT_FATAL_ERROR>\n");
@@ -1511,8 +1527,12 @@ static void tdx_handle_report_fatal_error(X86CPU *cpu,
         assert((char *)tmp == message + GUEST_PANIC_INFO_TDX_MESSAGE_MAX);
     }
 
-    error_report("TD guest reports fatal error. %s\n", message ? : "");
-    exit(1);
+#define TDX_REPORT_FATAL_ERROR_GPA_VALID    BIT_ULL(63)
+    if (error_code & TDX_REPORT_FATAL_ERROR_GPA_VALID) {
+        gpa = vmcall->in_r13;
+    }
+
+    tdx_panicked_on_fatal_error(cpu, error_code, gpa, message);
 }
 
 static void tdx_handle_setup_event_notify_interrupt(X86CPU *cpu,
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 58/70] pci-host/q35: Move PAM initialization above SMRAM initialization
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (56 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 57/70] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 59/70] q35: Introduce smm_ranges property for q35-pci-host Xiaoyao Li
                   ` (11 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

In mch_realize(), process PAM initialization before SMRAM initialization so
that later patch can skill all the SMRAM related with a single check.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 hw/pci-host/q35.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 08534bc7cc09..4ac44975c75d 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -568,6 +568,16 @@ static void mch_realize(PCIDevice *d, Error **errp)
     /* setup pci memory mapping */
     pc_pci_as_mapping_init(mch->system_memory, mch->pci_address_space);
 
+    /* PAM */
+    init_pam(&mch->pam_regions[0], OBJECT(mch), mch->ram_memory,
+             mch->system_memory, mch->pci_address_space,
+             PAM_BIOS_BASE, PAM_BIOS_SIZE);
+    for (i = 0; i < ARRAY_SIZE(mch->pam_regions) - 1; ++i) {
+        init_pam(&mch->pam_regions[i + 1], OBJECT(mch), mch->ram_memory,
+                 mch->system_memory, mch->pci_address_space,
+                 PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
+    }
+
     /* if *disabled* show SMRAM to all CPUs */
     memory_region_init_alias(&mch->smram_region, OBJECT(mch), "smram-region",
                              mch->pci_address_space, MCH_HOST_BRIDGE_SMRAM_C_BASE,
@@ -634,15 +644,6 @@ static void mch_realize(PCIDevice *d, Error **errp)
 
     object_property_add_const_link(qdev_get_machine(), "smram",
                                    OBJECT(&mch->smram));
-
-    init_pam(&mch->pam_regions[0], OBJECT(mch), mch->ram_memory,
-             mch->system_memory, mch->pci_address_space,
-             PAM_BIOS_BASE, PAM_BIOS_SIZE);
-    for (i = 0; i < ARRAY_SIZE(mch->pam_regions) - 1; ++i) {
-        init_pam(&mch->pam_regions[i + 1], OBJECT(mch), mch->ram_memory,
-                 mch->system_memory, mch->pci_address_space,
-                 PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
-    }
 }
 
 uint64_t mch_mcfg_base(void)
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 59/70] q35: Introduce smm_ranges property for q35-pci-host
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (57 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 58/70] pci-host/q35: Move PAM initialization above SMRAM initialization Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 60/70] i386/tdx: Disable SMM for TDX VMs Xiaoyao Li
                   ` (10 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@linux.intel.com>

Add a q35 property to check whether or not SMM ranges, e.g. SMRAM, TSEG,
etc... exist for the target platform.  TDX doesn't support SMM and doesn't
play nice with QEMU modifying related guest memory ranges.

Signed-off-by: Isaku Yamahata <isaku.yamahata@linux.intel.com>
Co-developed-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 hw/i386/pc_q35.c          |  2 ++
 hw/pci-host/q35.c         | 42 +++++++++++++++++++++++++++------------
 include/hw/i386/pc.h      |  1 +
 include/hw/pci-host/q35.h |  1 +
 4 files changed, 33 insertions(+), 13 deletions(-)

diff --git a/hw/i386/pc_q35.c b/hw/i386/pc_q35.c
index 4f3e5412f6b8..3392b0c110f2 100644
--- a/hw/i386/pc_q35.c
+++ b/hw/i386/pc_q35.c
@@ -236,6 +236,8 @@ static void pc_q35_init(MachineState *machine)
                             x86ms->above_4g_mem_size, NULL);
     object_property_set_bool(phb, PCI_HOST_BYPASS_IOMMU,
                              pcms->default_bus_bypass_iommu, NULL);
+    object_property_set_bool(phb, PCI_HOST_PROP_SMM_RANGES,
+                             x86_machine_is_smm_enabled(x86ms), NULL);
     sysbus_realize_and_unref(SYS_BUS_DEVICE(phb), &error_fatal);
 
     /* pci */
diff --git a/hw/pci-host/q35.c b/hw/pci-host/q35.c
index 4ac44975c75d..8facd8b63f76 100644
--- a/hw/pci-host/q35.c
+++ b/hw/pci-host/q35.c
@@ -179,6 +179,8 @@ static Property q35_host_props[] = {
                      mch.below_4g_mem_size, 0),
     DEFINE_PROP_SIZE(PCI_HOST_ABOVE_4G_MEM_SIZE, Q35PCIHost,
                      mch.above_4g_mem_size, 0),
+    DEFINE_PROP_BOOL(PCI_HOST_PROP_SMM_RANGES, Q35PCIHost,
+                     mch.has_smm_ranges, true),
     DEFINE_PROP_BOOL("x-pci-hole64-fix", Q35PCIHost, pci_hole64_fix, true),
     DEFINE_PROP_END_OF_LIST(),
 };
@@ -214,6 +216,7 @@ static void q35_host_initfn(Object *obj)
     /* mch's object_initialize resets the default value, set it again */
     qdev_prop_set_uint64(DEVICE(s), PCI_HOST_PROP_PCI_HOLE64_SIZE,
                          Q35_PCI_HOST_HOLE64_SIZE_DEFAULT);
+
     object_property_add(obj, PCI_HOST_PROP_PCI_HOLE_START, "uint32",
                         q35_host_get_pci_hole_start,
                         NULL, NULL, NULL);
@@ -476,6 +479,10 @@ static void mch_write_config(PCIDevice *d,
         mch_update_pciexbar(mch);
     }
 
+    if (!mch->has_smm_ranges) {
+        return;
+    }
+
     if (ranges_overlap(address, len, MCH_HOST_BRIDGE_SMRAM,
                        MCH_HOST_BRIDGE_SMRAM_SIZE)) {
         mch_update_smram(mch);
@@ -494,10 +501,13 @@ static void mch_write_config(PCIDevice *d,
 static void mch_update(MCHPCIState *mch)
 {
     mch_update_pciexbar(mch);
+
     mch_update_pam(mch);
-    mch_update_smram(mch);
-    mch_update_ext_tseg_mbytes(mch);
-    mch_update_smbase_smram(mch);
+    if (mch->has_smm_ranges) {
+        mch_update_smram(mch);
+        mch_update_ext_tseg_mbytes(mch);
+        mch_update_smbase_smram(mch);
+    }
 
     /*
      * pci hole goes from end-of-low-ram to io-apic.
@@ -538,19 +548,21 @@ static void mch_reset(DeviceState *qdev)
     pci_set_quad(d->config + MCH_HOST_BRIDGE_PCIEXBAR,
                  MCH_HOST_BRIDGE_PCIEXBAR_DEFAULT);
 
-    d->config[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_DEFAULT;
-    d->config[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_DEFAULT;
-    d->wmask[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_WMASK;
-    d->wmask[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_WMASK;
+    if (mch->has_smm_ranges) {
+        d->config[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_DEFAULT;
+        d->config[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_DEFAULT;
+        d->wmask[MCH_HOST_BRIDGE_SMRAM] = MCH_HOST_BRIDGE_SMRAM_WMASK;
+        d->wmask[MCH_HOST_BRIDGE_ESMRAMC] = MCH_HOST_BRIDGE_ESMRAMC_WMASK;
 
-    if (mch->ext_tseg_mbytes > 0) {
-        pci_set_word(d->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES,
-                     MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY);
+        if (mch->ext_tseg_mbytes > 0) {
+            pci_set_word(d->config + MCH_HOST_BRIDGE_EXT_TSEG_MBYTES,
+                        MCH_HOST_BRIDGE_EXT_TSEG_MBYTES_QUERY);
+        }
+
+        d->config[MCH_HOST_BRIDGE_F_SMBASE] = 0;
+        d->wmask[MCH_HOST_BRIDGE_F_SMBASE] = 0xff;
     }
 
-    d->config[MCH_HOST_BRIDGE_F_SMBASE] = 0;
-    d->wmask[MCH_HOST_BRIDGE_F_SMBASE] = 0xff;
-
     mch_update(mch);
 }
 
@@ -578,6 +590,10 @@ static void mch_realize(PCIDevice *d, Error **errp)
                  PAM_EXPAN_BASE + i * PAM_EXPAN_SIZE, PAM_EXPAN_SIZE);
     }
 
+    if (!mch->has_smm_ranges) {
+        return;
+    }
+
     /* if *disabled* show SMRAM to all CPUs */
     memory_region_init_alias(&mch->smram_region, OBJECT(mch), "smram-region",
                              mch->pci_address_space, MCH_HOST_BRIDGE_SMRAM_C_BASE,
diff --git a/include/hw/i386/pc.h b/include/hw/i386/pc.h
index c2ba63beb9e6..9846edebac10 100644
--- a/include/hw/i386/pc.h
+++ b/include/hw/i386/pc.h
@@ -165,6 +165,7 @@ void pc_guest_info_init(PCMachineState *pcms);
 #define PCI_HOST_PROP_PCI_HOLE64_SIZE  "pci-hole64-size"
 #define PCI_HOST_BELOW_4G_MEM_SIZE     "below-4g-mem-size"
 #define PCI_HOST_ABOVE_4G_MEM_SIZE     "above-4g-mem-size"
+#define PCI_HOST_PROP_SMM_RANGES       "smm-ranges"
 
 
 void pc_pci_as_mapping_init(MemoryRegion *system_memory,
diff --git a/include/hw/pci-host/q35.h b/include/hw/pci-host/q35.h
index bafcbe675214..22fadfa3ed76 100644
--- a/include/hw/pci-host/q35.h
+++ b/include/hw/pci-host/q35.h
@@ -50,6 +50,7 @@ struct MCHPCIState {
     MemoryRegion tseg_blackhole, tseg_window;
     MemoryRegion smbase_blackhole, smbase_window;
     bool has_smram_at_smbase;
+    bool has_smm_ranges;
     Range pci_hole;
     uint64_t below_4g_mem_size;
     uint64_t above_4g_mem_size;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 60/70] i386/tdx: Disable SMM for TDX VMs
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (58 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 59/70] q35: Introduce smm_ranges property for q35-pci-host Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 61/70] i386/tdx: Disable PIC " Xiaoyao Li
                   ` (9 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

TDX doesn't support SMM and VMM cannot emulate SMM for TDX VMs because
VMM cannot manipulate TDX VM's memory.

Disable SMM for TDX VMs and error out if user requests to enable SMM.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/tdx.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 23504ba3b05e..45b587ee07c2 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -686,11 +686,19 @@ static Notifier tdx_machine_done_notify = {
 
 int tdx_kvm_init(MachineState *ms, Error **errp)
 {
+    X86MachineState *x86ms = X86_MACHINE(ms);
     TdxGuest *tdx = TDX_GUEST(OBJECT(ms->cgs));
     int r = 0;
 
     ms->require_guest_memfd = true;
 
+    if (x86ms->smm == ON_OFF_AUTO_AUTO) {
+        x86ms->smm = ON_OFF_AUTO_OFF;
+    } else if (x86ms->smm == ON_OFF_AUTO_ON) {
+        error_setg(errp, "TDX VM doesn't support SMM");
+        return -EINVAL;
+    }
+
     if (!tdx_caps) {
         r = get_tdx_capabilities(errp);
         if (r) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 61/70] i386/tdx: Disable PIC for TDX VMs
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (59 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 60/70] i386/tdx: Disable SMM for TDX VMs Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 62/70] i386/tdx: Don't allow system reset " Xiaoyao Li
                   ` (8 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Legacy PIC (8259) cannot be supported for TDX VMs since TDX module
doesn't allow directly interrupt injection.  Using posted interrupts
for the PIC is not a viable option as the guest BIOS/kernel will not
do EOI for PIC IRQs, i.e. will leave the vIRR bit set.

Hence disable PIC for TDX VMs and error out if user wants PIC.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/tdx.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 45b587ee07c2..208fe3572839 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -699,6 +699,13 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
         return -EINVAL;
     }
 
+    if (x86ms->pic == ON_OFF_AUTO_AUTO) {
+        x86ms->pic = ON_OFF_AUTO_OFF;
+    } else if (x86ms->pic == ON_OFF_AUTO_ON) {
+        error_setg(errp, "TDX VM doesn't support PIC");
+        return -EINVAL;
+    }
+
     if (!tdx_caps) {
         r = get_tdx_capabilities(errp);
         if (r) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 62/70] i386/tdx: Don't allow system reset for TDX VMs
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (60 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 61/70] i386/tdx: Disable PIC " Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 63/70] i386/tdx: LMCE is not supported for TDX Xiaoyao Li
                   ` (7 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

TDX CPU state is protected and thus vcpu state cann't be reset by VMM.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index f1c4dd759b3e..a74a0d8e0891 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -5675,7 +5675,7 @@ bool kvm_has_waitpkg(void)
 
 bool kvm_arch_cpu_check_are_resettable(void)
 {
-    return !sev_es_enabled();
+    return !sev_es_enabled() && !is_tdx_vm();
 }
 
 #define ARCH_REQ_XCOMP_GUEST_PERM       0x1025
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 63/70] i386/tdx: LMCE is not supported for TDX
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (61 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 62/70] i386/tdx: Don't allow system reset " Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 64/70] hw/i386: add eoi_intercept_unsupported member to X86MachineState Xiaoyao Li
                   ` (6 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

LMCE is not supported TDX since KVM doesn't provide emulation for
MSR_IA32_FEAT_CTL.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
---
 target/i386/kvm/kvm-cpu.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/target/i386/kvm/kvm-cpu.c b/target/i386/kvm/kvm-cpu.c
index 9c791b7b0520..8c618869533c 100644
--- a/target/i386/kvm/kvm-cpu.c
+++ b/target/i386/kvm/kvm-cpu.c
@@ -15,6 +15,7 @@
 #include "sysemu/sysemu.h"
 #include "hw/boards.h"
 
+#include "tdx.h"
 #include "kvm_i386.h"
 #include "hw/core/accel-cpu.h"
 
@@ -60,6 +61,10 @@ static bool lmce_supported(void)
     if (kvm_ioctl(kvm_state, KVM_X86_GET_MCE_CAP_SUPPORTED, &mce_cap) < 0) {
         return false;
     }
+
+    if (is_tdx_vm())
+        return false;
+
     return !!(mce_cap & MCG_LMCE_P);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 64/70] hw/i386: add eoi_intercept_unsupported member to X86MachineState
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (62 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 63/70] i386/tdx: LMCE is not supported for TDX Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 65/70] hw/i386: add option to forcibly report edge trigger in acpi tables Xiaoyao Li
                   ` (5 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Add a new bool member, eoi_intercept_unsupported, to X86MachineState
with default value false. Set true for TDX VM.

Inability to intercept eoi causes impossibility to emulate level
triggered interrupt to be re-injected when level is still kept active.
which affects interrupt controller emulation.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 hw/i386/x86.c         | 1 +
 include/hw/i386/x86.h | 1 +
 target/i386/kvm/tdx.c | 2 ++
 3 files changed, 4 insertions(+)

diff --git a/hw/i386/x86.c b/hw/i386/x86.c
index 0f69b55c5219..58206396c2da 100644
--- a/hw/i386/x86.c
+++ b/hw/i386/x86.c
@@ -1413,6 +1413,7 @@ static void x86_machine_initfn(Object *obj)
     x86ms->oem_table_id = g_strndup(ACPI_BUILD_APPNAME8, 8);
     x86ms->bus_lock_ratelimit = 0;
     x86ms->above_4g_mem_start = 4 * GiB;
+    x86ms->eoi_intercept_unsupported = false;
 }
 
 static void x86_machine_class_init(ObjectClass *oc, void *data)
diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
index ab1d38569019..b689feb389b3 100644
--- a/include/hw/i386/x86.h
+++ b/include/hw/i386/x86.h
@@ -59,6 +59,7 @@ struct X86MachineState {
 
     /* CPU and apic information: */
     bool apic_xrupt_override;
+    bool eoi_intercept_unsupported;
     unsigned pci_irq_mask;
     unsigned apic_id_limit;
     uint16_t boot_cpus;
diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
index 208fe3572839..1e8875f4af34 100644
--- a/target/i386/kvm/tdx.c
+++ b/target/i386/kvm/tdx.c
@@ -706,6 +706,8 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
         return -EINVAL;
     }
 
+    x86ms->eoi_intercept_unsupported = true;
+
     if (!tdx_caps) {
         r = get_tdx_capabilities(errp);
         if (r) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 65/70] hw/i386: add option to forcibly report edge trigger in acpi tables
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (63 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 64/70] hw/i386: add eoi_intercept_unsupported member to X86MachineState Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 66/70] i386/tdx: Don't synchronize guest tsc for TDs Xiaoyao Li
                   ` (4 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

When level trigger isn't supported on x86 platform,
forcibly report edge trigger in acpi tables.

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 hw/i386/acpi-build.c  | 99 ++++++++++++++++++++++++++++---------------
 hw/i386/acpi-common.c | 50 ++++++++++++++++------
 2 files changed, 104 insertions(+), 45 deletions(-)

diff --git a/hw/i386/acpi-build.c b/hw/i386/acpi-build.c
index 80db183b786a..23df7163e5e7 100644
--- a/hw/i386/acpi-build.c
+++ b/hw/i386/acpi-build.c
@@ -975,7 +975,8 @@ static void build_dbg_aml(Aml *table)
     aml_append(table, scope);
 }
 
-static Aml *build_link_dev(const char *name, uint8_t uid, Aml *reg)
+static Aml *build_link_dev(const char *name, uint8_t uid, Aml *reg,
+                           bool level_trigger_unsupported)
 {
     Aml *dev;
     Aml *crs;
@@ -987,7 +988,10 @@ static Aml *build_link_dev(const char *name, uint8_t uid, Aml *reg)
     aml_append(dev, aml_name_decl("_UID", aml_int(uid)));
 
     crs = aml_resource_template();
-    aml_append(crs, aml_interrupt(AML_CONSUMER, AML_LEVEL, AML_ACTIVE_HIGH,
+    aml_append(crs, aml_interrupt(AML_CONSUMER,
+                                  level_trigger_unsupported ?
+                                  AML_EDGE : AML_LEVEL,
+                                  AML_ACTIVE_HIGH,
                                   AML_SHARED, irqs, ARRAY_SIZE(irqs)));
     aml_append(dev, aml_name_decl("_PRS", crs));
 
@@ -1011,7 +1015,8 @@ static Aml *build_link_dev(const char *name, uint8_t uid, Aml *reg)
     return dev;
  }
 
-static Aml *build_gsi_link_dev(const char *name, uint8_t uid, uint8_t gsi)
+static Aml *build_gsi_link_dev(const char *name, uint8_t uid,
+                               uint8_t gsi, bool level_trigger_unsupported)
 {
     Aml *dev;
     Aml *crs;
@@ -1024,7 +1029,10 @@ static Aml *build_gsi_link_dev(const char *name, uint8_t uid, uint8_t gsi)
 
     crs = aml_resource_template();
     irqs = gsi;
-    aml_append(crs, aml_interrupt(AML_CONSUMER, AML_LEVEL, AML_ACTIVE_HIGH,
+    aml_append(crs, aml_interrupt(AML_CONSUMER,
+                                  level_trigger_unsupported ?
+                                  AML_EDGE : AML_LEVEL,
+                                  AML_ACTIVE_HIGH,
                                   AML_SHARED, &irqs, 1));
     aml_append(dev, aml_name_decl("_PRS", crs));
 
@@ -1043,7 +1051,7 @@ static Aml *build_gsi_link_dev(const char *name, uint8_t uid, uint8_t gsi)
 }
 
 /* _CRS method - get current settings */
-static Aml *build_iqcr_method(bool is_piix4)
+static Aml *build_iqcr_method(bool is_piix4, bool level_trigger_unsupported)
 {
     Aml *if_ctx;
     uint32_t irqs;
@@ -1051,7 +1059,9 @@ static Aml *build_iqcr_method(bool is_piix4)
     Aml *crs = aml_resource_template();
 
     irqs = 0;
-    aml_append(crs, aml_interrupt(AML_CONSUMER, AML_LEVEL,
+    aml_append(crs, aml_interrupt(AML_CONSUMER,
+                                  level_trigger_unsupported ?
+                                  AML_EDGE : AML_LEVEL,
                                   AML_ACTIVE_HIGH, AML_SHARED, &irqs, 1));
     aml_append(method, aml_name_decl("PRR0", crs));
 
@@ -1085,7 +1095,7 @@ static Aml *build_irq_status_method(void)
     return method;
 }
 
-static void build_piix4_pci0_int(Aml *table)
+static void build_piix4_pci0_int(Aml *table, bool level_trigger_unsupported)
 {
     Aml *dev;
     Aml *crs;
@@ -1098,12 +1108,16 @@ static void build_piix4_pci0_int(Aml *table)
     aml_append(sb_scope, pci0_scope);
 
     aml_append(sb_scope, build_irq_status_method());
-    aml_append(sb_scope, build_iqcr_method(true));
+    aml_append(sb_scope, build_iqcr_method(true, level_trigger_unsupported));
 
-    aml_append(sb_scope, build_link_dev("LNKA", 0, aml_name("PRQ0")));
-    aml_append(sb_scope, build_link_dev("LNKB", 1, aml_name("PRQ1")));
-    aml_append(sb_scope, build_link_dev("LNKC", 2, aml_name("PRQ2")));
-    aml_append(sb_scope, build_link_dev("LNKD", 3, aml_name("PRQ3")));
+    aml_append(sb_scope, build_link_dev("LNKA", 0, aml_name("PRQ0"),
+                                        level_trigger_unsupported));
+    aml_append(sb_scope, build_link_dev("LNKB", 1, aml_name("PRQ1"),
+                                        level_trigger_unsupported));
+    aml_append(sb_scope, build_link_dev("LNKC", 2, aml_name("PRQ2"),
+                                        level_trigger_unsupported));
+    aml_append(sb_scope, build_link_dev("LNKD", 3, aml_name("PRQ3"),
+                                        level_trigger_unsupported));
 
     dev = aml_device("LNKS");
     {
@@ -1112,7 +1126,9 @@ static void build_piix4_pci0_int(Aml *table)
 
         crs = aml_resource_template();
         irqs = 9;
-        aml_append(crs, aml_interrupt(AML_CONSUMER, AML_LEVEL,
+        aml_append(crs, aml_interrupt(AML_CONSUMER,
+                                      level_trigger_unsupported ?
+                                      AML_EDGE : AML_LEVEL,
                                       AML_ACTIVE_HIGH, AML_SHARED,
                                       &irqs, 1));
         aml_append(dev, aml_name_decl("_PRS", crs));
@@ -1198,7 +1214,7 @@ static Aml *build_q35_routing_table(const char *str)
     return pkg;
 }
 
-static void build_q35_pci0_int(Aml *table)
+static void build_q35_pci0_int(Aml *table, bool level_trigger_unsupported)
 {
     Aml *method;
     Aml *sb_scope = aml_scope("_SB");
@@ -1237,25 +1253,41 @@ static void build_q35_pci0_int(Aml *table)
     aml_append(sb_scope, pci0_scope);
 
     aml_append(sb_scope, build_irq_status_method());
-    aml_append(sb_scope, build_iqcr_method(false));
+    aml_append(sb_scope, build_iqcr_method(false, level_trigger_unsupported));
 
-    aml_append(sb_scope, build_link_dev("LNKA", 0, aml_name("PRQA")));
-    aml_append(sb_scope, build_link_dev("LNKB", 1, aml_name("PRQB")));
-    aml_append(sb_scope, build_link_dev("LNKC", 2, aml_name("PRQC")));
-    aml_append(sb_scope, build_link_dev("LNKD", 3, aml_name("PRQD")));
-    aml_append(sb_scope, build_link_dev("LNKE", 4, aml_name("PRQE")));
-    aml_append(sb_scope, build_link_dev("LNKF", 5, aml_name("PRQF")));
-    aml_append(sb_scope, build_link_dev("LNKG", 6, aml_name("PRQG")));
-    aml_append(sb_scope, build_link_dev("LNKH", 7, aml_name("PRQH")));
+    aml_append(sb_scope, build_link_dev("LNKA", 0, aml_name("PRQA"),
+                                        level_trigger_unsupported));
+    aml_append(sb_scope, build_link_dev("LNKB", 1, aml_name("PRQB"),
+                                        level_trigger_unsupported));
+    aml_append(sb_scope, build_link_dev("LNKC", 2, aml_name("PRQC"),
+                                        level_trigger_unsupported));
+    aml_append(sb_scope, build_link_dev("LNKD", 3, aml_name("PRQD"),
+                                        level_trigger_unsupported));
+    aml_append(sb_scope, build_link_dev("LNKE", 4, aml_name("PRQE"),
+                                        level_trigger_unsupported));
+    aml_append(sb_scope, build_link_dev("LNKF", 5, aml_name("PRQF"),
+                                        level_trigger_unsupported));
+    aml_append(sb_scope, build_link_dev("LNKG", 6, aml_name("PRQG"),
+                                        level_trigger_unsupported));
+    aml_append(sb_scope, build_link_dev("LNKH", 7, aml_name("PRQH"),
+                                        level_trigger_unsupported));
 
-    aml_append(sb_scope, build_gsi_link_dev("GSIA", 0x10, 0x10));
-    aml_append(sb_scope, build_gsi_link_dev("GSIB", 0x11, 0x11));
-    aml_append(sb_scope, build_gsi_link_dev("GSIC", 0x12, 0x12));
-    aml_append(sb_scope, build_gsi_link_dev("GSID", 0x13, 0x13));
-    aml_append(sb_scope, build_gsi_link_dev("GSIE", 0x14, 0x14));
-    aml_append(sb_scope, build_gsi_link_dev("GSIF", 0x15, 0x15));
-    aml_append(sb_scope, build_gsi_link_dev("GSIG", 0x16, 0x16));
-    aml_append(sb_scope, build_gsi_link_dev("GSIH", 0x17, 0x17));
+    aml_append(sb_scope, build_gsi_link_dev("GSIA", 0x10, 0x10,
+                                            level_trigger_unsupported));
+    aml_append(sb_scope, build_gsi_link_dev("GSIB", 0x11, 0x11,
+                                            level_trigger_unsupported));
+    aml_append(sb_scope, build_gsi_link_dev("GSIC", 0x12, 0x12,
+                                            level_trigger_unsupported));
+    aml_append(sb_scope, build_gsi_link_dev("GSID", 0x13, 0x13,
+                                            level_trigger_unsupported));
+    aml_append(sb_scope, build_gsi_link_dev("GSIE", 0x14, 0x14,
+                                            level_trigger_unsupported));
+    aml_append(sb_scope, build_gsi_link_dev("GSIF", 0x15, 0x15,
+                                            level_trigger_unsupported));
+    aml_append(sb_scope, build_gsi_link_dev("GSIG", 0x16, 0x16,
+                                            level_trigger_unsupported));
+    aml_append(sb_scope, build_gsi_link_dev("GSIH", 0x17, 0x17,
+                                            level_trigger_unsupported));
 
     aml_append(table, sb_scope);
 }
@@ -1436,6 +1468,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
     PCMachineState *pcms = PC_MACHINE(machine);
     PCMachineClass *pcmc = PC_MACHINE_GET_CLASS(machine);
     X86MachineState *x86ms = X86_MACHINE(machine);
+    bool level_trigger_unsupported = x86ms->eoi_intercept_unsupported;
     AcpiMcfgInfo mcfg;
     bool mcfg_valid = !!acpi_get_mcfg(&mcfg);
     uint32_t nr_mem = machine->ram_slots;
@@ -1468,7 +1501,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
         if (pm->pcihp_bridge_en || pm->pcihp_root_en) {
             build_x86_acpi_pci_hotplug(dsdt, pm->pcihp_io_base);
         }
-        build_piix4_pci0_int(dsdt);
+        build_piix4_pci0_int(dsdt, level_trigger_unsupported);
     } else if (q35) {
         sb_scope = aml_scope("_SB");
         dev = aml_device("PCI0");
@@ -1512,7 +1545,7 @@ build_dsdt(GArray *table_data, BIOSLinker *linker,
         if (pm->pcihp_bridge_en) {
             build_x86_acpi_pci_hotplug(dsdt, pm->pcihp_io_base);
         }
-        build_q35_pci0_int(dsdt);
+        build_q35_pci0_int(dsdt, level_trigger_unsupported);
     }
 
     if (misc->has_hpet) {
diff --git a/hw/i386/acpi-common.c b/hw/i386/acpi-common.c
index 43dc23f7e06f..26ff3c738e78 100644
--- a/hw/i386/acpi-common.c
+++ b/hw/i386/acpi-common.c
@@ -103,6 +103,7 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
     const CPUArchIdList *apic_ids = mc->possible_cpu_arch_ids(MACHINE(x86ms));
     AcpiTable table = { .sig = "APIC", .rev = 3, .oem_id = oem_id,
                         .oem_table_id = oem_table_id };
+    bool level_trigger_unsupported = x86ms->eoi_intercept_unsupported;
 
     acpi_table_begin(&table, table_data);
     /* Local APIC Address */
@@ -122,18 +123,43 @@ void acpi_build_madt(GArray *table_data, BIOSLinker *linker,
                      IO_APIC_SECONDARY_ADDRESS, IO_APIC_SECONDARY_IRQBASE);
     }
 
-    if (x86ms->apic_xrupt_override) {
-        build_xrupt_override(table_data, 0, 2,
-            0 /* Flags: Conforms to the specifications of the bus */);
-    }
-
-    for (i = 1; i < 16; i++) {
-        if (!(x86ms->pci_irq_mask & (1 << i))) {
-            /* No need for a INT source override structure. */
-            continue;
-        }
-        build_xrupt_override(table_data, i, i,
-            0xd /* Flags: Active high, Level Triggered */);
+    if (level_trigger_unsupported) {
+        /* Force edge trigger */
+        if (x86ms->apic_xrupt_override) {
+            build_xrupt_override(table_data, 0, 2,
+                                 /* Flags: active high, edge triggered */
+                                 1 | (1 << 2));
+        }
+
+        for (i = x86ms->apic_xrupt_override ? 1 : 0; i < 16; i++) {
+            build_xrupt_override(table_data, i, i,
+                                 /* Flags: active high, edge triggered */
+                                 1 | (1 << 2));
+        }
+
+        if (x86ms->ioapic2) {
+            for (i = 0; i < 16; i++) {
+                build_xrupt_override(table_data, IO_APIC_SECONDARY_IRQBASE + i,
+                                     IO_APIC_SECONDARY_IRQBASE + i,
+                                     /* Flags: active high, edge triggered */
+                                     1 | (1 << 2));
+            }
+        }
+    } else {
+        if (x86ms->apic_xrupt_override) {
+            build_xrupt_override(table_data, 0, 2,
+                                 0 /* Flags: Conforms to the specifications of the bus */);
+        }
+
+        for (i = 1; i < 16; i++) {
+            if (!(x86ms->pci_irq_mask & (1 << i))) {
+                /* No need for a INT source override structure. */
+                continue;
+            }
+            build_xrupt_override(table_data, i, i,
+                                 0xd /* Flags: Active high, Level Triggered */);
+
+        }
     }
 
     if (x2apic_mode) {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 66/70] i386/tdx: Don't synchronize guest tsc for TDs
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (64 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 65/70] hw/i386: add option to forcibly report edge trigger in acpi tables Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 67/70] i386/tdx: Only configure MSR_IA32_UCODE_REV in kvm_init_msrs() " Xiaoyao Li
                   ` (3 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Isaku Yamahata <isaku.yamahata@intel.com>

TSC of TDs is not accessible and KVM doesn't allow access of
MSR_IA32_TSC for TDs. To avoid the assert() in kvm_get_tsc, make
kvm_synchronize_all_tsc() noop for TDs,

Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
Reviewed-by: Connor Kuehl <ckuehl@redhat.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/kvm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index a74a0d8e0891..773e0b042ae9 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -279,7 +279,7 @@ void kvm_synchronize_all_tsc(void)
 {
     CPUState *cpu;
 
-    if (kvm_enabled()) {
+    if (kvm_enabled() && !is_tdx_vm()) {
         CPU_FOREACH(cpu) {
             run_on_cpu(cpu, do_kvm_synchronize_tsc, RUN_ON_CPU_NULL);
         }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 67/70] i386/tdx: Only configure MSR_IA32_UCODE_REV in kvm_init_msrs() for TDs
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (65 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 66/70] i386/tdx: Don't synchronize guest tsc for TDs Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 68/70] i386/tdx: Skip kvm_put_apicbase() " Xiaoyao Li
                   ` (2 subsequent siblings)
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

For TDs, only MSR_IA32_UCODE_REV in kvm_init_msrs() can be configured
by VMM, while the features enumerated/controlled by other MSRs except
MSR_IA32_UCODE_REV in kvm_init_msrs() are not under control of VMM.

Only configure MSR_IA32_UCODE_REV for TDs.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/kvm.c | 44 ++++++++++++++++++++++---------------------
 1 file changed, 23 insertions(+), 21 deletions(-)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 773e0b042ae9..12d909d08862 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -3279,32 +3279,34 @@ static void kvm_init_msrs(X86CPU *cpu)
     CPUX86State *env = &cpu->env;
 
     kvm_msr_buf_reset(cpu);
-    if (has_msr_arch_capabs) {
-        kvm_msr_entry_add(cpu, MSR_IA32_ARCH_CAPABILITIES,
-                          env->features[FEAT_ARCH_CAPABILITIES]);
-    }
-
-    if (has_msr_core_capabs) {
-        kvm_msr_entry_add(cpu, MSR_IA32_CORE_CAPABILITY,
-                          env->features[FEAT_CORE_CAPABILITY]);
-    }
-
-    if (has_msr_perf_capabs && cpu->enable_pmu) {
-        kvm_msr_entry_add_perf(cpu, env->features);
+
+    if (!is_tdx_vm()) {
+        if (has_msr_arch_capabs) {
+            kvm_msr_entry_add(cpu, MSR_IA32_ARCH_CAPABILITIES,
+                                env->features[FEAT_ARCH_CAPABILITIES]);
+        }
+
+        if (has_msr_core_capabs) {
+            kvm_msr_entry_add(cpu, MSR_IA32_CORE_CAPABILITY,
+                                env->features[FEAT_CORE_CAPABILITY]);
+        }
+
+        if (has_msr_perf_capabs && cpu->enable_pmu) {
+            kvm_msr_entry_add_perf(cpu, env->features);
+        }
+
+        /*
+         * Older kernels do not include VMX MSRs in KVM_GET_MSR_INDEX_LIST, but
+         * all kernels with MSR features should have them.
+         */
+        if (kvm_feature_msrs && cpu_has_vmx(env)) {
+            kvm_msr_entry_add_vmx(cpu, env->features);
+        }
     }
 
     if (has_msr_ucode_rev) {
         kvm_msr_entry_add(cpu, MSR_IA32_UCODE_REV, cpu->ucode_rev);
     }
-
-    /*
-     * Older kernels do not include VMX MSRs in KVM_GET_MSR_INDEX_LIST, but
-     * all kernels with MSR features should have them.
-     */
-    if (kvm_feature_msrs && cpu_has_vmx(env)) {
-        kvm_msr_entry_add_vmx(cpu, env->features);
-    }
-
     assert(kvm_buf_set_msrs(cpu) == 0);
 }
 
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 68/70] i386/tdx: Skip kvm_put_apicbase() for TDs
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (66 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 67/70] i386/tdx: Only configure MSR_IA32_UCODE_REV in kvm_init_msrs() " Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 69/70] i386/tdx: Don't get/put guest state for TDX VMs Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 70/70] docs: Add TDX documentation Xiaoyao Li
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

KVM doesn't allow wirting to MSR_IA32_APICBASE for TDs.

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/kvm.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 12d909d08862..5c5400c51cd1 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -3061,6 +3061,11 @@ void kvm_put_apicbase(X86CPU *cpu, uint64_t value)
 {
     int ret;
 
+    /* TODO: Allow accessing guest state for debug TDs. */
+    if (is_tdx_vm()) {
+        return;
+    }
+
     ret = kvm_put_one_msr(cpu, MSR_IA32_APICBASE, value);
     assert(ret == 1);
 }
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 69/70] i386/tdx: Don't get/put guest state for TDX VMs
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (67 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 68/70] i386/tdx: Skip kvm_put_apicbase() " Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  2023-11-15  7:15 ` [PATCH v3 70/70] docs: Add TDX documentation Xiaoyao Li
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

From: Sean Christopherson <sean.j.christopherson@intel.com>

Don't get/put state of TDX VMs since accessing/mutating guest state of
production TDs is not supported.

Note, it will be allowed for a debug TD. Corresponding support will be
introduced when debug TD support is implemented in the future.

Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com>
Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
Acked-by: Gerd Hoffmann <kraxel@redhat.com>
---
 target/i386/kvm/kvm.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
index 5c5400c51cd1..56171dc76235 100644
--- a/target/i386/kvm/kvm.c
+++ b/target/i386/kvm/kvm.c
@@ -4621,6 +4621,11 @@ int kvm_arch_put_registers(CPUState *cpu, int level)
 
     assert(cpu_is_stopped(cpu) || qemu_cpu_is_self(cpu));
 
+    /* TODO: Allow accessing guest state for debug TDs. */
+    if (is_tdx_vm()) {
+        return 0;
+    }
+
     /*
      * Put MSR_IA32_FEATURE_CONTROL first, this ensures the VM gets out of VMX
      * root operation upon vCPU reset. kvm_put_msr_feature_control() should also
@@ -4721,6 +4726,12 @@ int kvm_arch_get_registers(CPUState *cs)
     if (ret < 0) {
         goto out;
     }
+
+    /* TODO: Allow accessing guest state for debug TDs. */
+    if (is_tdx_vm()) {
+        return 0;
+    }
+
     ret = kvm_getput_regs(cpu, 0);
     if (ret < 0) {
         goto out;
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* [PATCH v3 70/70] docs: Add TDX documentation
  2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
                   ` (68 preceding siblings ...)
  2023-11-15  7:15 ` [PATCH v3 69/70] i386/tdx: Don't get/put guest state for TDX VMs Xiaoyao Li
@ 2023-11-15  7:15 ` Xiaoyao Li
  69 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-15  7:15 UTC (permalink / raw)
  To: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, xiaoyao.li, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Add docs/system/i386/tdx.rst for TDX support, and add tdx in
confidential-guest-support.rst

Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>

---
Changes since v1:
 - Add prerequisite of private gmem;
 - update example command to launch TD;

Changes since RFC v4:
 - add the restriction that kernel-irqchip must be split
---
 docs/system/confidential-guest-support.rst |   1 +
 docs/system/i386/tdx.rst                   | 113 +++++++++++++++++++++
 docs/system/target-i386.rst                |   1 +
 3 files changed, 115 insertions(+)
 create mode 100644 docs/system/i386/tdx.rst

diff --git a/docs/system/confidential-guest-support.rst b/docs/system/confidential-guest-support.rst
index 0c490dbda2b7..66129fbab64c 100644
--- a/docs/system/confidential-guest-support.rst
+++ b/docs/system/confidential-guest-support.rst
@@ -38,6 +38,7 @@ Supported mechanisms
 Currently supported confidential guest mechanisms are:
 
 * AMD Secure Encrypted Virtualization (SEV) (see :doc:`i386/amd-memory-encryption`)
+* Intel Trust Domain Extension (TDX) (see :doc:`i386/tdx`)
 * POWER Protected Execution Facility (PEF) (see :ref:`power-papr-protected-execution-facility-pef`)
 * s390x Protected Virtualization (PV) (see :doc:`s390x/protvirt`)
 
diff --git a/docs/system/i386/tdx.rst b/docs/system/i386/tdx.rst
new file mode 100644
index 000000000000..1872e4f5a8be
--- /dev/null
+++ b/docs/system/i386/tdx.rst
@@ -0,0 +1,113 @@
+Intel Trusted Domain eXtension (TDX)
+====================================
+
+Intel Trusted Domain eXtensions (TDX) refers to an Intel technology that extends
+Virtual Machine Extensions (VMX) and Multi-Key Total Memory Encryption (MKTME)
+with a new kind of virtual machine guest called a Trust Domain (TD). A TD runs
+in a CPU mode that is designed to protect the confidentiality of its memory
+contents and its CPU state from any other software, including the hosting
+Virtual Machine Monitor (VMM), unless explicitly shared by the TD itself.
+
+Prerequisites
+-------------
+
+To run TD, the physical machine needs to have TDX module loaded and initialized
+while KVM hypervisor has TDX support and has TDX enabled. If those requirements
+are met, the ``KVM_CAP_VM_TYPES`` will report the support of ``KVM_X86_TDX_VM``.
+
+Trust Domain Virtual Firmware (TDVF)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Trust Domain Virtual Firmware (TDVF) is required to provide TD services to boot
+TD Guest OS. TDVF needs to be copied to guest private memory and measured before
+a TD boots.
+
+The VM scope ``MEMORY_ENCRYPT_OP`` ioctl provides command ``KVM_TDX_INIT_MEM_REGION``
+to copy the TDVF image to TD's private memory space.
+
+Since TDX doesn't support readonly memslot, TDVF cannot be mapped as pflash
+device and it actually works as RAM. "-bios" option is chosen to load TDVF.
+
+OVMF is the opensource firmware that implements the TDVF support. Thus the
+command line to specify and load TDVF is ``-bios OVMF.fd``
+
+KVM private gmem
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+TD's memory (RAM) needs to be able to be transformed between private and shared.
+And its BIOS (OVMF/TDVF) needs to be mapped as private. Thus QEMU needs to
+allocate private gmem for them via KVM's IOCTL (KVM_CREATE_GUEST_MEMFD), which
+requires KVM is newer enough that reports KVM_CAP_GUEST_MEMFD.
+
+Feature Control
+---------------
+
+Unlike non-TDX VM, the CPU features (enumerated by CPU or MSR) of a TD is not
+under full control of VMM. VMM can only configure part of features of a TD on
+``KVM_TDX_INIT_VM`` command of VM scope ``MEMORY_ENCRYPT_OP`` ioctl.
+
+The configurable features have three types:
+
+- Attributes:
+  - PKS (bit 30) controls whether Supervisor Protection Keys is exposed to TD,
+  which determines related CPUID bit and CR4 bit;
+  - PERFMON (bit 63) controls whether PMU is exposed to TD.
+
+- XSAVE related features (XFAM):
+  XFAM is a 64b mask, which has the same format as XCR0 or IA32_XSS MSR. It
+  determines the set of extended features available for use by the guest TD.
+
+- CPUID features:
+  Only some bits of some CPUID leaves are directly configurable by VMM.
+
+What features can be configured is reported via TDX capabilities.
+
+TDX capabilities
+~~~~~~~~~~~~~~~~
+
+The VM scope ``MEMORY_ENCRYPT_OP`` ioctl provides command ``KVM_TDX_CAPABILITIES``
+to get the TDX capabilities from KVM. It returns a data structure of
+``struct kvm_tdx_capabilites``, which tells the supported configuration of
+attributes, XFAM and CPUIDs.
+
+Launching a TD (TDX VM)
+-----------------------
+
+To launch a TDX guest, below are new added and required:
+
+.. parsed-literal::
+
+    |qemu_system_x86| \\
+        -object tdx-guest,id=tdx0 \\
+        -machine ...,kernel-irqchip=split,confidential-guest-support=tdx0 \\
+        -bios OVMF.fd \\
+
+Debugging
+---------
+
+Bit 0 of TD attributes, is DEBUG bit, which decides if the TD runs in off-TD
+debug mode. When in off-TD debug mode, TD's VCPU state and private memory are
+accessible via given SEAMCALLs. This requires KVM to expose APIs to invoke those
+SEAMCALLs and resonponding QEMU change.
+
+It's targeted as future work.
+
+restrictions
+------------
+
+ - kernel-irqchip must be split;
+
+ - No readonly support for private memory;
+
+ - No SMM support: SMM support requires manipulating the guset register states
+   which is not allowed;
+
+Live Migration
+--------------
+
+TODO
+
+References
+----------
+
+- `TDX Homepage <https://www.intel.com/content/www/us/en/developer/articles/technical/intel-trust-domain-extensions.html>`__
diff --git a/docs/system/target-i386.rst b/docs/system/target-i386.rst
index 1b8a1f248abb..4d58cdbc4e06 100644
--- a/docs/system/target-i386.rst
+++ b/docs/system/target-i386.rst
@@ -29,6 +29,7 @@ Architectural features
    i386/kvm-pv
    i386/sgx
    i386/amd-memory-encryption
+   i386/tdx
 
 OS requirements
 ~~~~~~~~~~~~~~~
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-15  7:14 ` [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd Xiaoyao Li
@ 2023-11-15 10:20   ` Daniel P. Berrangé
  2023-11-16  3:34     ` Xiaoyao Li
  2023-11-15 17:54   ` David Hildenbrand
                     ` (2 subsequent siblings)
  3 siblings, 1 reply; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 10:20 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:14:11AM -0500, Xiaoyao Li wrote:
> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
> and kvm guest memfd based private memory can be associated in one RAMBlock.
> 
> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
> create private guest_memfd during RAMBlock setup.
> 
> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
> confidential guests, such as TDX VM. How and when to set it for memory
> backends will be implemented in the following patches.
> 
> Introduce memory_region_has_guest_memfd() to query if the MemoryRegion has
> KVM guest_memfd allocated.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
> - rename gmem to guest_memfd;
> - close(guest_memfd) when RAMBlock is released; (Daniel P. Berrangé)
> - Suqash the patch that introduces memory_region_has_guest_memfd().
> ---
>  accel/kvm/kvm-all.c     | 24 ++++++++++++++++++++++++
>  include/exec/memory.h   | 13 +++++++++++++
>  include/exec/ramblock.h |  1 +
>  include/sysemu/kvm.h    |  2 ++
>  system/memory.c         |  5 +++++
>  system/physmem.c        | 27 ++++++++++++++++++++++++---
>  6 files changed, 69 insertions(+), 3 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index c1b40e873531..9f751d4971f8 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -101,6 +101,7 @@ bool kvm_msi_use_devid;
>  bool kvm_has_guest_debug;
>  static int kvm_sstep_flags;
>  static bool kvm_immediate_exit;
> +static bool kvm_guest_memfd_supported;
>  static hwaddr kvm_max_slot_size = ~0;
>  
>  static const KVMCapabilityInfo kvm_required_capabilites[] = {
> @@ -2397,6 +2398,8 @@ static int kvm_init(MachineState *ms)
>      }
>      s->as = g_new0(struct KVMAs, s->nr_as);
>  
> +    kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
> +
>      if (object_property_find(OBJECT(current_machine), "kvm-type")) {
>          g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine),
>                                                              "kvm-type",
> @@ -4078,3 +4081,24 @@ void query_stats_schemas_cb(StatsSchemaList **result, Error **errp)
>          query_stats_schema_vcpu(first_cpu, &stats_args);
>      }
>  }
> +
> +int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
> +{
> +    int fd;
> +    struct kvm_create_guest_memfd guest_memfd = {
> +        .size = size,
> +        .flags = flags,
> +    };
> +
> +    if (!kvm_guest_memfd_supported) {
> +        error_setg(errp, "KVM doesn't support guest memfd\n");
> +        return -EOPNOTSUPP;

Returning an errno value is unusual when we have an 'Error **errp' parameter
for reporting, and the following codepath merely returns -1, so this is
inconsistent. Just return -1 here too.

> +    }
> +
> +    fd = kvm_vm_ioctl(kvm_state, KVM_CREATE_GUEST_MEMFD, &guest_memfd);
> +    if (fd < 0) {
> +        error_setg_errno(errp, errno, "%s: error creating kvm guest memfd\n", __func__);

I'd prefer an explicit 'return -1' here, even though 'fd' is technically going
to be -1 already.

Also including __func__ in the error message is not really needed IMHO

> +    }
> +
> +    return fd;
> +}
> diff --git a/include/exec/memory.h b/include/exec/memory.h
> index 831f7c996d9d..f780367ab1bd 100644
> --- a/include/exec/memory.h
> +++ b/include/exec/memory.h
> @@ -243,6 +243,9 @@ typedef struct IOMMUTLBEvent {
>  /* RAM FD is opened read-only */
>  #define RAM_READONLY_FD (1 << 11)
>  
> +/* RAM can be private that has kvm gmem backend */
> +#define RAM_GUEST_MEMFD   (1 << 12)
> +
>  static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>                                         IOMMUNotifierFlag flags,
>                                         hwaddr start, hwaddr end,
> @@ -1702,6 +1705,16 @@ static inline bool memory_region_is_romd(MemoryRegion *mr)
>   */
>  bool memory_region_is_protected(MemoryRegion *mr);
>  
> +/**
> + * memory_region_has_guest_memfd: check whether a memory region has guest_memfd
> + *     associated
> + *
> + * Returns %true if a memory region's ram_block has valid guest_memfd assigned.
> + *
> + * @mr: the memory region being queried
> + */
> +bool memory_region_has_guest_memfd(MemoryRegion *mr);
> +
>  /**
>   * memory_region_get_iommu: check whether a memory region is an iommu
>   *
> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
> index 69c6a5390293..0a17ba882729 100644
> --- a/include/exec/ramblock.h
> +++ b/include/exec/ramblock.h
> @@ -41,6 +41,7 @@ struct RAMBlock {
>      QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
>      int fd;
>      uint64_t fd_offset;
> +    int guest_memfd;
>      size_t page_size;
>      /* dirty bitmap used during migration */
>      unsigned long *bmap;
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index d61487816421..fedc28c7d17f 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -538,4 +538,6 @@ bool kvm_arch_cpu_check_are_resettable(void);
>  bool kvm_dirty_ring_enabled(void);
>  
>  uint32_t kvm_dirty_ring_size(void);
> +
> +int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
>  #endif
> diff --git a/system/memory.c b/system/memory.c
> index 304fa843ea12..69741d91bbb7 100644
> --- a/system/memory.c
> +++ b/system/memory.c
> @@ -1862,6 +1862,11 @@ bool memory_region_is_protected(MemoryRegion *mr)
>      return mr->ram && (mr->ram_block->flags & RAM_PROTECTED);
>  }
>  
> +bool memory_region_has_guest_memfd(MemoryRegion *mr)
> +{
> +    return mr->ram_block && mr->ram_block->guest_memfd >= 0;
> +}
> +
>  uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>  {
>      uint8_t mask = mr->dirty_log_mask;
> diff --git a/system/physmem.c b/system/physmem.c
> index fc2b0fee0188..0af2213cbd9c 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>          }
>      }
>  
> +#ifdef CONFIG_KVM
> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
> +        new_block->guest_memfd < 0) {
> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
> +        uint64_t flags = 0;
> +        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
> +                                                        flags, errp);
> +        if (new_block->guest_memfd < 0) {
> +            qemu_mutex_unlock_ramlist();
> +            return;
> +        }
> +    }
> +#endif
> +
>      new_ram_size = MAX(old_ram_size,
>                (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS);
>      if (new_ram_size > old_ram_size) {
> @@ -1903,7 +1917,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>      /* Just support these ram flags by now. */
>      assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
>                            RAM_PROTECTED | RAM_NAMED_FILE | RAM_READONLY |
> -                          RAM_READONLY_FD)) == 0);
> +                          RAM_READONLY_FD | RAM_GUEST_MEMFD)) == 0);
>  
>      if (xen_enabled()) {
>          error_setg(errp, "-mem-path not supported with Xen");
> @@ -1938,6 +1952,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>      new_block->used_length = size;
>      new_block->max_length = size;
>      new_block->flags = ram_flags;
> +    new_block->guest_memfd = -1;
>      new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset,
>                                       errp);
>      if (!new_block->host) {
> @@ -2016,7 +2031,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>      Error *local_err = NULL;
>  
>      assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC |
> -                          RAM_NORESERVE)) == 0);
> +                          RAM_NORESERVE| RAM_GUEST_MEMFD)) == 0);
>      assert(!host ^ (ram_flags & RAM_PREALLOC));
>  
>      size = HOST_PAGE_ALIGN(size);
> @@ -2028,6 +2043,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>      new_block->max_length = max_size;
>      assert(max_size >= size);
>      new_block->fd = -1;
> +    new_block->guest_memfd = -1;
>      new_block->page_size = qemu_real_host_page_size();
>      new_block->host = host;
>      new_block->flags = ram_flags;
> @@ -2050,7 +2066,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
>  RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
>                           MemoryRegion *mr, Error **errp)
>  {
> -    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
> +    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE | RAM_GUEST_MEMFD)) == 0);
>      return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
>  }
>  
> @@ -2078,6 +2094,11 @@ static void reclaim_ramblock(RAMBlock *block)
>      } else {
>          qemu_anon_ram_free(block->host, block->max_length);
>      }
> +
> +    if (block->guest_memfd >= 0) {
> +        close(block->guest_memfd);
> +    }
> +
>      g_free(block);
>  }
>  
> -- 
> 2.34.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2023-11-15  7:14 ` [PATCH v3 06/70] kvm: Introduce support for memory_attributes Xiaoyao Li
@ 2023-11-15 10:38   ` Daniel P. Berrangé
  2023-11-16  3:40     ` Xiaoyao Li
  2023-12-12 13:56   ` Wang, Wei W
  1 sibling, 1 reply; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 10:38 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:14:15AM -0500, Xiaoyao Li wrote:
> Introduce the helper functions to set the attributes of a range of
> memory to private or shared.
> 
> This is necessary to notify KVM the private/shared attribute of each gpa
> range. KVM needs the information to decide the GPA needs to be mapped at
> hva-based shared memory or guest_memfd based private memory.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
>  accel/kvm/kvm-all.c  | 42 ++++++++++++++++++++++++++++++++++++++++++
>  include/sysemu/kvm.h |  3 +++
>  2 files changed, 45 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 69afeb47c9c0..76e2404d54d2 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -102,6 +102,7 @@ bool kvm_has_guest_debug;
>  static int kvm_sstep_flags;
>  static bool kvm_immediate_exit;
>  static bool kvm_guest_memfd_supported;
> +static uint64_t kvm_supported_memory_attributes;
>  static hwaddr kvm_max_slot_size = ~0;
>  
>  static const KVMCapabilityInfo kvm_required_capabilites[] = {
> @@ -1305,6 +1306,44 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
>      kvm_max_slot_size = max_slot_size;
>  }
>  
> +static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)
> +{
> +    struct kvm_memory_attributes attrs;
> +    int r;
> +
> +    attrs.attributes = attr;
> +    attrs.address = start;
> +    attrs.size = size;
> +    attrs.flags = 0;
> +
> +    r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, &attrs);
> +    if (r) {
> +        warn_report("%s: failed to set memory (0x%lx+%#zx) with attr 0x%lx error '%s'",
> +                     __func__, start, size, attr, strerror(errno));

This is an error condition rather than an warning condition.

Also again I think __func__ is generally not required in an error message,
if the error message text is suitably descriptive - applies to other
patches in this series too.

> +    }
> +    return r;
> +}
> +
> +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size)
> +{
> +    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
> +        return -EINVAL;
> +    }
> +
> +    return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
> +}
> +
> +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size)
> +{
> +    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
> +        return -EINVAL;
> +    }
> +
> +    return kvm_set_memory_attributes(start, size, 0);
> +}
> +
>  /* Called with KVMMemoryListener.slots_lock held */
>  static void kvm_set_phys_mem(KVMMemoryListener *kml,
>                               MemoryRegionSection *section, bool add)
> @@ -2440,6 +2479,9 @@ static int kvm_init(MachineState *ms)
>  
>      kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
>  
> +    ret = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
> +    kvm_supported_memory_attributes = ret > 0 ? ret : 0;
> +
>      if (object_property_find(OBJECT(current_machine), "kvm-type")) {
>          g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine),
>                                                              "kvm-type",
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index fedc28c7d17f..0e88958190a4 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -540,4 +540,7 @@ bool kvm_dirty_ring_enabled(void);
>  uint32_t kvm_dirty_ring_size(void);
>  
>  int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
> +
> +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);
> +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
>  #endif
> -- 
> 2.34.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 10/70] kvm: handle KVM_EXIT_MEMORY_FAULT
  2023-11-15  7:14 ` [PATCH v3 10/70] kvm: handle KVM_EXIT_MEMORY_FAULT Xiaoyao Li
@ 2023-11-15 10:42   ` Daniel P. Berrangé
  2023-11-16  5:16     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 10:42 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:14:19AM -0500, Xiaoyao Li wrote:
> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
> KVM_EXIT_MEMORY_FAULT happens. It indicates userspace needs to do
> the memory conversion on the RAMBlock to turn the memory into desired
> attribute, i.e., private/shared.
> 
> Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
> guest_memfd memory backend.
> 
> Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
> added.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
>  accel/kvm/kvm-all.c | 76 +++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 66 insertions(+), 10 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 76e2404d54d2..58abbcb6926e 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -2902,6 +2902,50 @@ static void kvm_eat_signals(CPUState *cpu)
>      } while (sigismember(&chkset, SIG_IPI));
>  }
>  
> +static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
> +{
> +    MemoryRegionSection section;
> +    ram_addr_t offset;
> +    RAMBlock *rb;
> +    void *addr;
> +    int ret = -1;
> +
> +    section = memory_region_find(get_system_memory(), start, size);
> +    if (!section.mr) {
> +        return ret;
> +    }
> +
> +    if (memory_region_has_guest_memfd(section.mr)) {
> +        if (to_private) {
> +            ret = kvm_set_memory_attributes_private(start, size);
> +        } else {
> +            ret = kvm_set_memory_attributes_shared(start, size);
> +        }
> +
> +        if (ret) {
> +            memory_region_unref(section.mr);
> +            return ret;
> +        }
> +
> +        addr = memory_region_get_ram_ptr(section.mr) +
> +               section.offset_within_region;
> +        rb = qemu_ram_block_from_host(addr, false, &offset);
> +        /*
> +         * With KVM_SET_MEMORY_ATTRIBUTES by kvm_set_memory_attributes(),
> +         * operation on underlying file descriptor is only for releasing
> +         * unnecessary pages.
> +         */
> +        ram_block_convert_range(rb, offset, size, to_private);
> +    } else {
> +        warn_report("Convert non guest_memfd backed memory region "
> +                    "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
> +                    start, size, to_private ? "private" : "shared");

Again, if you're returning '-1' to indicate error, then
using warn_report is wrong, it should be error_report.

warn_report is for when you return success, indicating
the problem was non-fatal.

> +    }
> +
> +    memory_region_unref(section.mr);
> +    return ret;
> +}
> +
>  int kvm_cpu_exec(CPUState *cpu)
>  {
>      struct kvm_run *run = cpu->kvm_run;
> @@ -2969,18 +3013,20 @@ int kvm_cpu_exec(CPUState *cpu)
>                  ret = EXCP_INTERRUPT;
>                  break;
>              }
> -            fprintf(stderr, "error: kvm run failed %s\n",
> -                    strerror(-run_ret));
> +            if (!(run_ret == -EFAULT && run->exit_reason == KVM_EXIT_MEMORY_FAULT)) {
> +                fprintf(stderr, "error: kvm run failed %s\n",
> +                        strerror(-run_ret));
>  #ifdef TARGET_PPC
> -            if (run_ret == -EBUSY) {
> -                fprintf(stderr,
> -                        "This is probably because your SMT is enabled.\n"
> -                        "VCPU can only run on primary threads with all "
> -                        "secondary threads offline.\n");
> -            }
> +                if (run_ret == -EBUSY) {
> +                    fprintf(stderr,
> +                            "This is probably because your SMT is enabled.\n"
> +                            "VCPU can only run on primary threads with all "
> +                            "secondary threads offline.\n");
> +                }
>  #endif
> -            ret = -1;
> -            break;
> +                ret = -1;
> +                break;
> +            }
>          }
>  
>          trace_kvm_run_exit(cpu->cpu_index, run->exit_reason);
> @@ -3067,6 +3113,16 @@ int kvm_cpu_exec(CPUState *cpu)
>                  break;
>              }
>              break;
> +        case KVM_EXIT_MEMORY_FAULT:
> +            if (run->memory_fault.flags & ~KVM_MEMORY_EXIT_FLAG_PRIVATE) {
> +                error_report("KVM_EXIT_MEMORY_FAULT: Unknown flag 0x%" PRIx64,
> +                             (uint64_t)run->memory_fault.flags);
> +                ret = -1;
> +                break;
> +            }
> +            ret = kvm_convert_memory(run->memory_fault.gpa, run->memory_fault.size,
> +                                     run->memory_fault.flags & KVM_MEMORY_EXIT_FLAG_PRIVATE);
> +            break;
>          default:
>              DPRINTF("kvm_arch_handle_exit\n");
>              ret = kvm_arch_handle_exit(cpu, run);
> -- 
> 2.34.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 14/70] target/i386: Implement mc->kvm_type() to get VM type
  2023-11-15  7:14 ` [PATCH v3 14/70] target/i386: Implement mc->kvm_type() to get VM type Xiaoyao Li
@ 2023-11-15 10:49   ` Daniel P. Berrangé
  2023-11-16  6:22     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 10:49 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:14:23AM -0500, Xiaoyao Li wrote:
> Implement mc->kvm_type() for i386 machines. It provides a way for user
> to create SW_PROTECTE_VM.

Small typo there missing final 'D' in 'PROTECTED'

> 
> Also store the vm_type in machinestate to other code to query what the
> VM type is.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
>  hw/i386/x86.c              | 12 ++++++++++++
>  include/hw/i386/x86.h      |  1 +
>  target/i386/kvm/kvm.c      | 25 +++++++++++++++++++++++++
>  target/i386/kvm/kvm_i386.h |  1 +
>  4 files changed, 39 insertions(+)
> 
> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
> index b3d054889bba..55678279bf3b 100644
> --- a/hw/i386/x86.c
> +++ b/hw/i386/x86.c
> @@ -1377,6 +1377,17 @@ static void machine_set_sgx_epc(Object *obj, Visitor *v, const char *name,
>      qapi_free_SgxEPCList(list);
>  }
>  
> +static int x86_kvm_type(MachineState *ms, const char *vm_type)
> +{
> +    X86MachineState *x86ms = X86_MACHINE(ms);
> +    int kvm_type;
> +
> +    kvm_type = kvm_get_vm_type(ms, vm_type);
> +    x86ms->vm_type = kvm_type;
> +
> +    return kvm_type;
> +}
> +
>  static void x86_machine_initfn(Object *obj)
>  {
>      X86MachineState *x86ms = X86_MACHINE(obj);
> @@ -1401,6 +1412,7 @@ static void x86_machine_class_init(ObjectClass *oc, void *data)
>      mc->cpu_index_to_instance_props = x86_cpu_index_to_props;
>      mc->get_default_cpu_node_id = x86_get_default_cpu_node_id;
>      mc->possible_cpu_arch_ids = x86_possible_cpu_arch_ids;
> +    mc->kvm_type = x86_kvm_type;
>      x86mc->save_tsc_khz = true;
>      x86mc->fwcfg_dma_enabled = true;
>      nc->nmi_monitor_handler = x86_nmi;
> diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
> index da19ae15463a..ab1d38569019 100644
> --- a/include/hw/i386/x86.h
> +++ b/include/hw/i386/x86.h
> @@ -41,6 +41,7 @@ struct X86MachineState {
>      MachineState parent;
>  
>      /*< public >*/
> +    unsigned int vm_type;
>  
>      /* Pointers to devices and objects: */
>      ISADevice *rtc;
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index b4b9ce89842f..2e47fda25f95 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -161,6 +161,31 @@ static KVMMSRHandlers msr_handlers[KVM_MSR_FILTER_MAX_RANGES];
>  static RateLimit bus_lock_ratelimit_ctrl;
>  static int kvm_get_one_msr(X86CPU *cpu, int index, uint64_t *value);
>  
> +static const char* vm_type_name[] = {

nitpick   'char *vm_type_name[]', is normal style

> +    [KVM_X86_DEFAULT_VM] = "default",
> +    [KVM_X86_SW_PROTECTED_VM] = "sw-protected-vm",
> +};
> +
> +int kvm_get_vm_type(MachineState *ms, const char *vm_type)
> +{
> +    int kvm_type = KVM_X86_DEFAULT_VM;
> +
> +    /*
> +     * old KVM doesn't support KVM_CAP_VM_TYPES and KVM_X86_DEFAULT_VM
> +     * is always supported
> +     */
> +    if (kvm_type == KVM_X86_DEFAULT_VM) {
> +        return kvm_type;
> +    }
> +
> +    if (!(kvm_check_extension(KVM_STATE(ms->accelerator), KVM_CAP_VM_TYPES) & BIT(kvm_type))) {
> +        error_report("vm-type %s not supported by KVM", vm_type_name[kvm_type]);
> +        exit(1);
> +    }
> +
> +    return kvm_type;
> +}
> +
>  bool kvm_has_smm(void)
>  {
>      return kvm_vm_check_extension(kvm_state, KVM_CAP_X86_SMM);
> diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
> index 30fedcffea3e..55fb25fa8e2e 100644
> --- a/target/i386/kvm/kvm_i386.h
> +++ b/target/i386/kvm/kvm_i386.h
> @@ -37,6 +37,7 @@ bool kvm_hv_vpindex_settable(void);
>  bool kvm_enable_sgx_provisioning(KVMState *s);
>  bool kvm_hyperv_expand_features(X86CPU *cpu, Error **errp);
>  
> +int kvm_get_vm_type(MachineState *ms, const char *vm_type);
>  void kvm_arch_reset_vcpu(X86CPU *cs);
>  void kvm_arch_after_reset_vcpu(X86CPU *cpu);
>  void kvm_arch_do_init_vcpu(X86CPU *cs);
> -- 
> 2.34.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES
  2023-11-15  7:14 ` [PATCH v3 18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES Xiaoyao Li
@ 2023-11-15 10:54   ` Daniel P. Berrangé
  2023-12-07  7:18     ` Xiaoyao Li
  2023-11-17 21:18   ` Isaku Yamahata
  1 sibling, 1 reply; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 10:54 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:14:27AM -0500, Xiaoyao Li wrote:
> KVM provides TDX capabilities via sub command KVM_TDX_CAPABILITIES of
> IOCTL(KVM_MEMORY_ENCRYPT_OP). Get the capabilities when initializing
> TDX context. It will be used to validate user's setting later.
> 
> Since there is no interface reporting how many cpuid configs contains in
> KVM_TDX_CAPABILITIES, QEMU chooses to try starting with a known number
> and abort when it exceeds KVM_MAX_CPUID_ENTRIES.
> 
> Besides, introduce the interfaces to invoke TDX "ioctls" at different
> scope (KVM, VM and VCPU) in preparation.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
> - rename __tdx_ioctl() to tdx_ioctl_internal()
> - Pass errp in get_tdx_capabilities();
> 
> changes in v2:
>   - Make the error message more clear;
> 
> changes in v1:
>   - start from nr_cpuid_configs = 6 for the loop;
>   - stop the loop when nr_cpuid_configs exceeds KVM_MAX_CPUID_ENTRIES;
> ---
>  target/i386/kvm/kvm.c      |   2 -
>  target/i386/kvm/kvm_i386.h |   2 +
>  target/i386/kvm/tdx.c      | 102 ++++++++++++++++++++++++++++++++++++-
>  3 files changed, 103 insertions(+), 3 deletions(-)
> 
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 7abcdebb1452..28e60c5ea4a7 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -1687,8 +1687,6 @@ static int hyperv_init_vcpu(X86CPU *cpu)
>  
>  static Error *invtsc_mig_blocker;
>  
> -#define KVM_MAX_CPUID_ENTRIES  100
> -
>  static void kvm_init_xsave(CPUX86State *env)
>  {
>      if (has_xsave2) {
> diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
> index 55fb25fa8e2e..c3ef46a97a7b 100644
> --- a/target/i386/kvm/kvm_i386.h
> +++ b/target/i386/kvm/kvm_i386.h
> @@ -13,6 +13,8 @@
>  
>  #include "sysemu/kvm.h"
>  
> +#define KVM_MAX_CPUID_ENTRIES  100
> +
>  #ifdef CONFIG_KVM
>  
>  #define kvm_pit_in_kernel() \
> diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
> index 621a05beeb4e..cb0040187b27 100644
> --- a/target/i386/kvm/tdx.c
> +++ b/target/i386/kvm/tdx.c
> @@ -12,17 +12,117 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>  #include "qapi/error.h"
>  #include "qom/object_interfaces.h"
> +#include "sysemu/kvm.h"
>  
>  #include "hw/i386/x86.h"
> +#include "kvm_i386.h"
>  #include "tdx.h"
>  
> +static struct kvm_tdx_capabilities *tdx_caps;
> +
> +enum tdx_ioctl_level{
> +    TDX_PLATFORM_IOCTL,
> +    TDX_VM_IOCTL,
> +    TDX_VCPU_IOCTL,
> +};
> +
> +static int tdx_ioctl_internal(void *state, enum tdx_ioctl_level level, int cmd_id,
> +                        __u32 flags, void *data)
> +{
> +    struct kvm_tdx_cmd tdx_cmd;

Add   ' = {}'  to initialize to all-zeros, avoiding the explicit
memset call

> +    int r;
> +
> +    memset(&tdx_cmd, 0x0, sizeof(tdx_cmd));
> +
> +    tdx_cmd.id = cmd_id;
> +    tdx_cmd.flags = flags;
> +    tdx_cmd.data = (__u64)(unsigned long)data;
> +
> +    switch (level) {
> +    case TDX_PLATFORM_IOCTL:
> +        r = kvm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
> +        break;
> +    case TDX_VM_IOCTL:
> +        r = kvm_vm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
> +        break;
> +    case TDX_VCPU_IOCTL:
> +        r = kvm_vcpu_ioctl(state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
> +        break;
> +    default:
> +        error_report("Invalid tdx_ioctl_level %d", level);
> +        exit(1);
> +    }
> +
> +    return r;
> +}
> +
> +static inline int tdx_platform_ioctl(int cmd_id, __u32 flags, void *data)
> +{
> +    return tdx_ioctl_internal(NULL, TDX_PLATFORM_IOCTL, cmd_id, flags, data);
> +}
> +
> +static inline int tdx_vm_ioctl(int cmd_id, __u32 flags, void *data)
> +{
> +    return tdx_ioctl_internal(NULL, TDX_VM_IOCTL, cmd_id, flags, data);
> +}
> +
> +static inline int tdx_vcpu_ioctl(void *vcpu_fd, int cmd_id, __u32 flags,
> +                                 void *data)
> +{
> +    return  tdx_ioctl_internal(vcpu_fd, TDX_VCPU_IOCTL, cmd_id, flags, data);
> +}
> +
> +static int get_tdx_capabilities(Error **errp)
> +{
> +    struct kvm_tdx_capabilities *caps;
> +    /* 1st generation of TDX reports 6 cpuid configs */
> +    int nr_cpuid_configs = 6;
> +    size_t size;
> +    int r;
> +
> +    do {
> +        size = sizeof(struct kvm_tdx_capabilities) +
> +               nr_cpuid_configs * sizeof(struct kvm_tdx_cpuid_config);
> +        caps = g_malloc0(size);
> +        caps->nr_cpuid_configs = nr_cpuid_configs;
> +
> +        r = tdx_vm_ioctl(KVM_TDX_CAPABILITIES, 0, caps);
> +        if (r == -E2BIG) {
> +            g_free(caps);
> +            nr_cpuid_configs *= 2;
> +            if (nr_cpuid_configs > KVM_MAX_CPUID_ENTRIES) {
> +                error_setg(errp, "%s: KVM TDX seems broken that number of CPUID "
> +                           "entries in kvm_tdx_capabilities exceeds limit %d",
> +                           __func__, KVM_MAX_CPUID_ENTRIES);
> +                return r;
> +            }
> +        } else if (r < 0) {
> +            g_free(caps);
> +            error_setg_errno(errp, -r, "%s: KVM_TDX_CAPABILITIES failed", __func__);
> +            return r;
> +        }
> +    }
> +    while (r == -E2BIG);
> +
> +    tdx_caps = caps;
> +
> +    return 0;
> +}
> +
>  int tdx_kvm_init(MachineState *ms, Error **errp)
>  {
> +    int r = 0;
> +
>      ms->require_guest_memfd = true;
>  
> -    return 0;
> +    if (!tdx_caps) {
> +        r = get_tdx_capabilities(errp);
> +    }
> +
> +    return r;
>  }
>  
>  /* tdx guest */
> -- 
> 2.34.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 26/70] i386/tdx: Initialize TDX before creating TD vcpus
  2023-11-15  7:14 ` [PATCH v3 26/70] i386/tdx: Initialize TDX before creating TD vcpus Xiaoyao Li
@ 2023-11-15 11:01   ` Daniel P. Berrangé
  2023-12-04  8:28     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 11:01 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:14:35AM -0500, Xiaoyao Li wrote:
> Invoke KVM_TDX_INIT in kvm_arch_pre_create_vcpu() that KVM_TDX_INIT
> configures global TD configurations, e.g. the canonical CPUID config,
> and must be executed prior to creating vCPUs.
> 
> Use kvm_x86_arch_cpuid() to setup the CPUID settings for TDX VM.
> 
> Note, this doesn't address the fact that QEMU may change the CPUID
> configuration when creating vCPUs, i.e. punts on refactoring QEMU to
> provide a stable CPUID config prior to kvm_arch_init().
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Acked-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
> Changes in v3:
> - Pass @errp in tdx_pre_create_vcpu() and pass error info to it. (Daniel)
> ---
>  accel/kvm/kvm-all.c        |  9 +++++++-
>  target/i386/kvm/kvm.c      |  9 ++++++++
>  target/i386/kvm/tdx-stub.c |  5 +++++
>  target/i386/kvm/tdx.c      | 45 ++++++++++++++++++++++++++++++++++++++
>  target/i386/kvm/tdx.h      |  4 ++++
>  5 files changed, 71 insertions(+), 1 deletion(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 6b5f4d62f961..a92fff471b58 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -441,8 +441,15 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
>  
>      trace_kvm_init_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
>  
> +    /*
> +     * tdx_pre_create_vcpu() may call cpu_x86_cpuid(). It in turn may call
> +     * kvm_vm_ioctl(). Set cpu->kvm_state in advance to avoid NULL pointer
> +     * dereference.
> +     */
> +    cpu->kvm_state = s;
>      ret = kvm_arch_pre_create_vcpu(cpu, errp);
>      if (ret < 0) {
> +        cpu->kvm_state = NULL;
>          goto err;
>      }
>  
> @@ -450,11 +457,11 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
>      if (ret < 0) {
>          error_setg_errno(errp, -ret, "kvm_init_vcpu: kvm_get_vcpu failed (%lu)",
>                           kvm_arch_vcpu_id(cpu));
> +        cpu->kvm_state = NULL;
>          goto err;
>      }
>  
>      cpu->kvm_fd = ret;
> -    cpu->kvm_state = s;
>      cpu->vcpu_dirty = true;
>      cpu->dirty_pages = 0;
>      cpu->throttle_us_per_full = 0;
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index dafe4d262977..fc840653ceb6 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -2268,6 +2268,15 @@ int kvm_arch_init_vcpu(CPUState *cs)
>      return r;
>  }
>  
> +int kvm_arch_pre_create_vcpu(CPUState *cpu, Error **errp)
> +{
> +    if (is_tdx_vm()) {
> +        return tdx_pre_create_vcpu(cpu, errp);
> +    }
> +
> +    return 0;
> +}
> +
>  int kvm_arch_destroy_vcpu(CPUState *cs)
>  {
>      X86CPU *cpu = X86_CPU(cs);
> diff --git a/target/i386/kvm/tdx-stub.c b/target/i386/kvm/tdx-stub.c
> index 1d866d5496bf..3877d432a397 100644
> --- a/target/i386/kvm/tdx-stub.c
> +++ b/target/i386/kvm/tdx-stub.c
> @@ -6,3 +6,8 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
>  {
>      return -EINVAL;
>  }
> +
> +int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
> +{
> +    return -EINVAL;
> +}
> diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
> index 1f5d8117d1a9..122a37c93de3 100644
> --- a/target/i386/kvm/tdx.c
> +++ b/target/i386/kvm/tdx.c
> @@ -467,6 +467,49 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
>      return 0;
>  }
>  
> +int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
> +{
> +    MachineState *ms = MACHINE(qdev_get_machine());
> +    X86CPU *x86cpu = X86_CPU(cpu);
> +    CPUX86State *env = &x86cpu->env;
> +    struct kvm_tdx_init_vm *init_vm;

Mark this as auto-free to avoid the g_free() requirement

  g_autofree  struct kvm_tdx_init_vm *init_vm = NULL;

> +    int r = 0;
> +
> +    qemu_mutex_lock(&tdx_guest->lock);

   QEMU_LOCK_GUARD(&tdx_guest->lock);

to eliminate the mutex_unlock requirement, thus eliminating all
'goto' jumps and label targets, in favour of a plain 'return -1'
everywhere.

> +    if (tdx_guest->initialized) {
> +        goto out;
> +    }
> +
> +    init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
> +                        sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES);
> +
> +    r = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPUS, 0, ms->smp.cpus);
> +    if (r < 0) {
> +        error_setg(errp, "Unable to set MAX VCPUS to %d", ms->smp.cpus);
> +        goto out_free;
> +    }
> +
> +    init_vm->cpuid.nent = kvm_x86_arch_cpuid(env, init_vm->cpuid.entries, 0);
> +
> +    init_vm->attributes = tdx_guest->attributes;
> +
> +    do {
> +        r = tdx_vm_ioctl(KVM_TDX_INIT_VM, 0, init_vm);
> +    } while (r == -EAGAIN);
> +    if (r < 0) {
> +        error_setg_errno(errp, -r, "KVM_TDX_INIT_VM failed");
> +        goto out_free;
> +    }
> +
> +    tdx_guest->initialized = true;
> +
> +out_free:
> +    g_free(init_vm);
> +out:
> +    qemu_mutex_unlock(&tdx_guest->lock);
> +    return r;
> +}
> +
>  /* tdx guest */
>  OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
>                                     tdx_guest,
> @@ -479,6 +522,8 @@ static void tdx_guest_init(Object *obj)
>  {
>      TdxGuest *tdx = TDX_GUEST(obj);
>  
> +    qemu_mutex_init(&tdx->lock);
> +
>      tdx->attributes = 0;
>  }
>  
> diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
> index 06599b65b827..432077723ac5 100644
> --- a/target/i386/kvm/tdx.h
> +++ b/target/i386/kvm/tdx.h
> @@ -17,6 +17,9 @@ typedef struct TdxGuestClass {
>  typedef struct TdxGuest {
>      ConfidentialGuestSupport parent_obj;
>  
> +    QemuMutex lock;
> +
> +    bool initialized;
>      uint64_t attributes;    /* TD attributes */
>  } TdxGuest;
>  
> @@ -29,5 +32,6 @@ bool is_tdx_vm(void);
>  int tdx_kvm_init(MachineState *ms, Error **errp);
>  void tdx_get_supported_cpuid(uint32_t function, uint32_t index, int reg,

>                               uint32_t *ret);
> +int tdx_pre_create_vcpu(CPUState *cpu, Error **errp);
>  
>  #endif /* QEMU_I386_TDX_H */
> -- 
> 2.34.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM
  2023-11-15  7:14 ` [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM Xiaoyao Li
@ 2023-11-15 17:32   ` Daniel P. Berrangé
  2023-12-01 11:00   ` Markus Armbruster
  1 sibling, 0 replies; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 17:32 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:14:40AM -0500, Xiaoyao Li wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
> can be provided for TDX attestation.
> 
> So far they were hard coded as 0. Now allow user to specify those values
> via property mrconfigid, mrowner and mrownerconfig. They are all in
> base64 format.
> 
> example
> -object tdx-guest, \
>   mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
>   mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
>   mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
>  - use base64 encoding instread of hex-string;
> ---
>  qapi/qom.json         | 11 +++++-
>  target/i386/kvm/tdx.c | 85 +++++++++++++++++++++++++++++++++++++++++++
>  target/i386/kvm/tdx.h |  3 ++
>  3 files changed, 98 insertions(+), 1 deletion(-)
> 
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 3a29659e0155..fd99aa1ff8cc 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -888,10 +888,19 @@
>  #     pages.  Some guest OS (e.g., Linux TD guest) may require this to
>  #     be set, otherwise they refuse to boot.
>  #
> +# @mrconfigid: base64 encoded MRCONFIGID SHA384 digest
> +#
> +# @mrowner: base64 encoded MROWNER SHA384 digest
> +#
> +# @mrownerconfig: base64 MROWNERCONFIG SHA384 digest
> +#
>  # Since: 8.2
>  ##
>  { 'struct': 'TdxGuestProperties',
> -  'data': { '*sept-ve-disable': 'bool' } }
> +  'data': { '*sept-ve-disable': 'bool',
> +            '*mrconfigid': 'str',
> +            '*mrowner': 'str',
> +            '*mrownerconfig': 'str' } }
>  
>  ##
>  # @ThreadContextProperties:
> diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
> index 28b3c2765c86..b70efbcab738 100644
> --- a/target/i386/kvm/tdx.c
> +++ b/target/i386/kvm/tdx.c
> @@ -13,6 +13,7 @@
>  
>  #include "qemu/osdep.h"
>  #include "qemu/error-report.h"
> +#include "qemu/base64.h"
>  #include "qapi/error.h"
>  #include "qom/object_interfaces.h"
>  #include "standard-headers/asm-x86/kvm_para.h"
> @@ -508,6 +509,8 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
>      X86CPU *x86cpu = X86_CPU(cpu);
>      CPUX86State *env = &x86cpu->env;
>      struct kvm_tdx_init_vm *init_vm;
> +    uint8_t *data;
> +    size_t data_len;

Don't declare these here.

>      int r = 0;
>  
>      qemu_mutex_lock(&tdx_guest->lock);
> @@ -518,6 +521,38 @@ int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
>      init_vm = g_malloc0(sizeof(struct kvm_tdx_init_vm) +
>                          sizeof(struct kvm_cpuid_entry2) * KVM_MAX_CPUID_ENTRIES);
>  
> +#define SHA384_DIGEST_SIZE  48
> +
> +    if (tdx_guest->mrconfigid) {

> +        data = qbase64_decode(tdx_guest->mrconfigid,
> +                              strlen(tdx_guest->mrconfigid), &data_len, errp);

Declare it here:

    g_autofree uint8_t *data = qbase64_decode(...)


so we aviod the memory leak of 'data' in each if()... block


> +        if (!data || data_len != SHA384_DIGEST_SIZE) {
> +            error_setg(errp, "TDX: failed to decode mrconfigid");
> +            return -1;
> +        }
> +        memcpy(init_vm->mrconfigid, data, data_len);
> +    }
> +
> +    if (tdx_guest->mrowner) {
> +        data = qbase64_decode(tdx_guest->mrowner,
> +                              strlen(tdx_guest->mrowner), &data_len, errp);
> +        if (!data || data_len != SHA384_DIGEST_SIZE) {
> +            error_setg(errp, "TDX: failed to decode mrowner");
> +            return -1;
> +        }
> +        memcpy(init_vm->mrowner, data, data_len);
> +    }
> +
> +    if (tdx_guest->mrownerconfig) {
> +        data = qbase64_decode(tdx_guest->mrownerconfig,
> +                              strlen(tdx_guest->mrownerconfig), &data_len, errp);
> +        if (!data || data_len != SHA384_DIGEST_SIZE) {
> +            error_setg(errp, "TDX: failed to decode mrownerconfig");
> +            return -1;
> +        }
> +        memcpy(init_vm->mrownerconfig, data, data_len);
> +    }
> +
>      r = kvm_vm_enable_cap(kvm_state, KVM_CAP_MAX_VCPUS, 0, ms->smp.cpus);
>      if (r < 0) {
>          error_setg(errp, "Unable to set MAX VCPUS to %d", ms->smp.cpus);
> @@ -567,6 +602,48 @@ static void tdx_guest_set_sept_ve_disable(Object *obj, bool value, Error **errp)
>      }
>  }
> +static void tdx_guest_set_mrconfigid(Object *obj, const char *value, Error **errp)
> +{
> +    TdxGuest *tdx = TDX_GUEST(obj);
> +
> +    tdx->mrconfigid = g_strdup(value);
> +}

g_free(tdx->mrconfigid) first to be sure we don't leak if
the value is set twice.

> +
> +static char * tdx_guest_get_mrowner(Object *obj, Error **errp)
> +{
> +    TdxGuest *tdx = TDX_GUEST(obj);
> +
> +    return g_strdup(tdx->mrowner);
> +}
> +
> +static void tdx_guest_set_mrowner(Object *obj, const char *value, Error **errp)
> +{
> +    TdxGuest *tdx = TDX_GUEST(obj);
> +
> +    tdx->mrconfigid = g_strdup(value);
> +}
> +
> +static char * tdx_guest_get_mrownerconfig(Object *obj, Error **errp)
> +{
> +    TdxGuest *tdx = TDX_GUEST(obj);
> +
> +    return g_strdup(tdx->mrownerconfig);
> +}
> +
> +static void tdx_guest_set_mrownerconfig(Object *obj, const char *value, Error **errp)
> +{
> +    TdxGuest *tdx = TDX_GUEST(obj);
> +
> +    tdx->mrconfigid = g_strdup(value);
> +}
> +
>  /* tdx guest */
>  OBJECT_DEFINE_TYPE_WITH_INTERFACES(TdxGuest,
>                                     tdx_guest,
> @@ -586,6 +663,14 @@ static void tdx_guest_init(Object *obj)
>      object_property_add_bool(obj, "sept-ve-disable",
>                               tdx_guest_get_sept_ve_disable,
>                               tdx_guest_set_sept_ve_disable);
> +    object_property_add_str(obj, "mrconfigid",
> +                            tdx_guest_get_mrconfigid,
> +                            tdx_guest_set_mrconfigid);
> +    object_property_add_str(obj, "mrowner",
> +                            tdx_guest_get_mrowner, tdx_guest_set_mrowner);
> +    object_property_add_str(obj, "mrownerconfig",
> +                            tdx_guest_get_mrownerconfig,
> +                            tdx_guest_set_mrownerconfig);
>  }
>  
>  static void tdx_guest_finalize(Object *obj)
> diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
> index 432077723ac5..6e39ef3bac13 100644
> --- a/target/i386/kvm/tdx.h
> +++ b/target/i386/kvm/tdx.h
> @@ -21,6 +21,9 @@ typedef struct TdxGuest {
>  
>      bool initialized;
>      uint64_t attributes;    /* TD attributes */
> +    char *mrconfigid;       /* base64 encoded sha348 digest */
> +    char *mrowner;          /* base64 encoded sha348 digest */
> +    char *mrownerconfig;    /* base64 encoded sha348 digest */
>  } TdxGuest;
>  
>  #ifdef CONFIG_TDX
> -- 
> 2.34.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-11-15  7:15 ` [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote> Xiaoyao Li
@ 2023-11-15 17:51   ` Daniel P. Berrangé
  2023-11-15 17:58   ` Daniel P. Berrangé
                     ` (2 subsequent siblings)
  3 siblings, 0 replies; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 17:51 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:15:01AM -0500, Xiaoyao Li wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> For GetQuote, delegate a request to Quote Generation Service.
> Add property "quote-generation-socket" to tdx-guest, whihc is a property
> of type SocketAddress to specify Quote Generation Service(QGS).
> 
> On request, connect to the QGS, read request buffer from shared guest
> memory, send the request buffer to the server and store the response
> into shared guest memory and notify TD guest by interrupt.
> 
> command line example:
>   qemu-system-x86_64 \
>     -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
>     -machine confidential-guest-support=tdx0
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Codeveloped-by: Chenyi Qiang <chenyi.qiang@intel.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
> - rename property "quote-generation-service" to "quote-generation-socket";
> - change the type of "quote-generation-socket" from str to
>   SocketAddress;
> - squash next patch into this one;
> ---
>  qapi/qom.json         |   5 +-
>  target/i386/kvm/tdx.c | 430 ++++++++++++++++++++++++++++++++++++++++++
>  target/i386/kvm/tdx.h |   6 +
>  3 files changed, 440 insertions(+), 1 deletion(-)

> @@ -969,6 +1001,7 @@ static void tdx_guest_class_init(ObjectClass *oc, void *data)
>  {
>  }
>  
> +#define TDG_VP_VMCALL_GET_QUOTE                         0x10002ULL
>  #define TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT      0x10004ULL
>  
>  #define TDG_VP_VMCALL_SUCCESS           0x0000000000000000ULL
> @@ -977,6 +1010,400 @@ static void tdx_guest_class_init(ObjectClass *oc, void *data)
>  #define TDG_VP_VMCALL_GPA_INUSE         0x8000000000000001ULL
>  #define TDG_VP_VMCALL_ALIGN_ERROR       0x8000000000000002ULL
>  
> +#define TDX_GET_QUOTE_STRUCTURE_VERSION 1ULL
> +
> +#define TDX_VP_GET_QUOTE_SUCCESS                0ULL
> +#define TDX_VP_GET_QUOTE_IN_FLIGHT              (-1ULL)
> +#define TDX_VP_GET_QUOTE_ERROR                  0x8000000000000000ULL
> +#define TDX_VP_GET_QUOTE_QGS_UNAVAILABLE        0x8000000000000001ULL
> +
> +/* Limit to avoid resource starvation. */
> +#define TDX_GET_QUOTE_MAX_BUF_LEN       (128 * 1024)
> +#define TDX_MAX_GET_QUOTE_REQUEST       16
> +
> +/* Format of pages shared with guest. */
> +struct tdx_get_quote_header {
> +    /* Format version: must be 1 in little endian. */
> +    uint64_t structure_version;
> +
> +    /*
> +     * GetQuote status code in little endian:
> +     *   Guest must set error_code to 0 to avoid information leak.
> +     *   Qemu sets this before interrupting guest.
> +     */
> +    uint64_t error_code;
> +
> +    /*
> +     * in-message size in little endian: The message will follow this header.
> +     * The in-message will be send to QGS.
> +     */
> +    uint32_t in_len;
> +
> +    /*
> +     * out-message size in little endian:
> +     * On request, out_len must be zero to avoid information leak.
> +     * On return, message size from QGS. Qemu overwrites this field.
> +     * The message will follows this header.  The in-message is overwritten.
> +     */
> +    uint32_t out_len;
> +
> +    /*
> +     * Message buffer follows.
> +     * Guest sets message that will be send to QGS.  If out_len > in_len, guest
> +     * should zero remaining buffer to avoid information leak.
> +     * Qemu overwrites this buffer with a message returned from QGS.
> +     */
> +};
> +
> +static hwaddr tdx_shared_bit(X86CPU *cpu)
> +{
> +    return (cpu->phys_bits > 48) ? BIT_ULL(51) : BIT_ULL(47);
> +}
> +
> +struct tdx_get_quote_task {
> +    uint32_t apic_id;
> +    hwaddr gpa;
> +    uint64_t buf_len;
> +    char *out_data;
> +    uint64_t out_len;
> +    struct tdx_get_quote_header hdr;
> +    int event_notify_interrupt;
> +    QIOChannelSocket *ioc;
> +};
> +
> +struct x86_msi {
> +    union {
> +        struct {
> +            uint32_t    reserved_0              : 2,
> +                        dest_mode_logical       : 1,
> +                        redirect_hint           : 1,
> +                        reserved_1              : 1,
> +                        virt_destid_8_14        : 7,
> +                        destid_0_7              : 8,
> +                        base_address            : 12;
> +        } QEMU_PACKED x86_address_lo;
> +        uint32_t address_lo;
> +    };
> +    union {
> +        struct {
> +            uint32_t    reserved        : 8,
> +                        destid_8_31     : 24;
> +        } QEMU_PACKED x86_address_hi;
> +        uint32_t address_hi;
> +    };
> +    union {
> +        struct {
> +            uint32_t    vector                  : 8,
> +                        delivery_mode           : 3,
> +                        dest_mode_logical       : 1,
> +                        reserved                : 2,
> +                        active_low              : 1,
> +                        is_level                : 1;
> +        } QEMU_PACKED x86_data;
> +        uint32_t data;
> +    };
> +};
> +
> +static void tdx_td_notify(struct tdx_get_quote_task *t)
> +{
> +    struct x86_msi x86_msi;
> +    struct kvm_msi msi;
> +    int ret;
> +
> +    /* It is optional for host VMM to interrupt TD. */
> +    if(!(32 <= t->event_notify_interrupt && t->event_notify_interrupt <= 255))
> +        return;
> +
> +    x86_msi = (struct x86_msi) {
> +        .x86_address_lo  = {
> +            .reserved_0 = 0,
> +            .dest_mode_logical = 0,
> +            .redirect_hint = 0,
> +            .reserved_1 = 0,
> +            .virt_destid_8_14 = 0,
> +            .destid_0_7 = t->apic_id & 0xff,
> +        },
> +        .x86_address_hi = {
> +            .reserved = 0,
> +            .destid_8_31 = t->apic_id >> 8,
> +        },
> +        .x86_data = {
> +            .vector = t->event_notify_interrupt,
> +            .delivery_mode = APIC_DM_FIXED,
> +            .dest_mode_logical = 0,
> +            .reserved = 0,
> +            .active_low = 0,
> +            .is_level = 0,
> +        },
> +    };
> +    msi = (struct kvm_msi) {
> +        .address_lo = x86_msi.address_lo,
> +        .address_hi = x86_msi.address_hi,
> +        .data = x86_msi.data,
> +        .flags = 0,
> +        .devid = 0,
> +    };
> +    ret = kvm_vm_ioctl(kvm_state, KVM_SIGNAL_MSI, &msi);
> +    if (ret < 0) {
> +        /* In this case, no better way to tell it to guest.  Log it. */
> +        error_report("TDX: injection %d failed, interrupt lost (%s).\n",
> +                     t->event_notify_interrupt, strerror(-ret));
> +    }
> +}
> +
> +static void tdx_get_quote_read(void *opaque)
> +{
> +    struct tdx_get_quote_task *t = opaque;
> +    ssize_t size = 0;
> +    Error *err = NULL;

This error is set, but never read and more importantly
never freed.  If you're not going to use it just pass
NULL to the methods, otherwise use error_report_err to
print and free it.

> +    MachineState *ms;
> +    TdxGuest *tdx;
> +
> +    while (true) {
> +        char *buf;
> +        size_t buf_size;
> +
> +        if (t->out_len < t->buf_len) {
> +            buf = t->out_data + t->out_len;
> +            buf_size = t->buf_len - t->out_len;
> +        } else {
> +            /*
> +             * The received data is too large to fit in the shared GPA.
> +             * Discard the received data and try to know the data size.
> +             */
> +            buf = t->out_data;
> +            buf_size = t->buf_len;
> +        }
> +
> +        size = qio_channel_read(QIO_CHANNEL(t->ioc), buf, buf_size, &err);
> +        if (!size) {
> +            break;
> +        }
> +
> +        if (size < 0) {
> +            if (size == QIO_CHANNEL_ERR_BLOCK) {
> +                return;
> +            } else {
> +                break;
> +            }
> +        }
> +        t->out_len += size;
> +    }
> +    /*
> +     * If partial read successfully but return error at last, also treat it
> +     * as failure.
> +     */
> +    if (size < 0) {
> +        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
> +        goto error;
> +    }
> +    if (t->out_len > 0 && t->out_len > t->buf_len) {
> +        /*
> +         * There is no specific error code defined for this case(E2BIG) at the
> +         * moment.
> +         * TODO: Once an error code for this case is defined in GHCI spec ,
> +         * update the error code.
> +         */
> +        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
> +        t->hdr.out_len = cpu_to_le32(t->out_len);
> +        goto error_hdr;
> +    }
> +
> +    if (address_space_write(
> +            &address_space_memory, t->gpa + sizeof(t->hdr),
> +            MEMTXATTRS_UNSPECIFIED, t->out_data, t->out_len) != MEMTX_OK) {
> +        goto error;
> +    }
> +    /*
> +     * Even if out_len == 0, it's a success.  It's up to the QGS-client contract
> +     * how to interpret the zero-sized message as return message.
> +     */
> +    t->hdr.out_len = cpu_to_le32(t->out_len);
> +    t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_SUCCESS);
> +
> +error:
> +    if (t->hdr.error_code != cpu_to_le64(TDX_VP_GET_QUOTE_SUCCESS)) {
> +        t->hdr.out_len = cpu_to_le32(0);
> +    }
> +error_hdr:
> +    if (address_space_write(
> +            &address_space_memory, t->gpa,
> +            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
> +        error_report("TDX: failed to update GetQuote header.");
> +    }
> +    tdx_td_notify(t);
> +
> +    qemu_set_fd_handler(t->ioc->fd, NULL, NULL, NULL);
> +    qio_channel_close(QIO_CHANNEL(t->ioc), &err);

Likely overwriting a previously set 'err'

> +    object_unref(OBJECT(t->ioc));
> +    g_free(t->out_data);
> +    g_free(t);
> +
> +    /* Maintain the number of in-flight requests. */
> +    ms = MACHINE(qdev_get_machine());
> +    tdx = TDX_GUEST(ms->cgs);
> +    qemu_mutex_lock(&tdx->lock);
> +    tdx->quote_generation_num--;
> +    qemu_mutex_unlock(&tdx->lock);
> +}
> +
> +/*
> + * TODO: If QGS doesn't reply for long time, make it an error and interrupt
> + * guest.
> + */
> +static void tdx_handle_get_quote_connected(QIOTask *task, gpointer opaque)
> +{
> +    struct tdx_get_quote_task *t = opaque;
> +    Error *err = NULL;

Same leak problem in this method

> +    char *in_data = NULL;

g_autofree for simpler cleanup

> +    MachineState *ms;
> +    TdxGuest *tdx;
> +
> +    t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
> +    if (qio_task_propagate_error(task, NULL)) {
> +        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
> +        goto error;
> +    }
> +
> +    in_data = g_malloc(le32_to_cpu(t->hdr.in_len));

IF  't->hdr.in_len' is going from the guest then they needs
bounds checking, otherwise its a trivial denial of service
to make QEMU allocate all of host RAM.

> +    if (!in_data) {
> +        goto error;
> +    }
> +
> +    if (address_space_read(&address_space_memory, t->gpa + sizeof(t->hdr),
> +                           MEMTXATTRS_UNSPECIFIED, in_data,
> +                           le32_to_cpu(t->hdr.in_len)) != MEMTX_OK) {
> +        goto error;
> +    }
> +
> +    qio_channel_set_blocking(QIO_CHANNEL(t->ioc), false, NULL);
> +
> +    if (qio_channel_write_all(QIO_CHANNEL(t->ioc), in_data,
> +                              le32_to_cpu(t->hdr.in_len), &err) ||
> +        err) {
> +        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
> +        goto error;
> +    }
> +
> +    g_free(in_data);
> +    qemu_set_fd_handler(t->ioc->fd, tdx_get_quote_read, NULL, t);

Dn't use  qemu_set_fd_handler() with QIOChannel objects.
qio_channel_add_watch() is the API for dealing with event
callbacks

> +
> +    return;
> +error:
> +    t->hdr.out_len = cpu_to_le32(0);
> +
> +    if (address_space_write(
> +            &address_space_memory, t->gpa,
> +            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
> +        error_report("TDX: failed to update GetQuote header.\n");
> +    }
> +    tdx_td_notify(t);
> +
> +    qio_channel_close(QIO_CHANNEL(t->ioc), &err);
> +    object_unref(OBJECT(t->ioc));
> +    g_free(t);
> +    g_free(in_data);
> +
> +    /* Maintain the number of in-flight requests. */
> +    ms = MACHINE(qdev_get_machine());
> +    tdx = TDX_GUEST(ms->cgs);
> +    qemu_mutex_lock(&tdx->lock);
> +    tdx->quote_generation_num--;
> +    qemu_mutex_unlock(&tdx->lock);
> +    return;
> +}
> +
> +static void tdx_handle_get_quote(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
> +{
> +    hwaddr gpa = vmcall->in_r12;
> +    uint64_t buf_len = vmcall->in_r13;
> +    struct tdx_get_quote_header hdr;
> +    MachineState *ms;
> +    TdxGuest *tdx;
> +    QIOChannelSocket *ioc;
> +    struct tdx_get_quote_task *t;
> +
> +    vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
> +
> +    /* GPA must be shared. */
> +    if (!(gpa & tdx_shared_bit(cpu))) {
> +        return;
> +    }
> +    gpa &= ~tdx_shared_bit(cpu);
> +
> +    if (!QEMU_IS_ALIGNED(gpa, 4096) || !QEMU_IS_ALIGNED(buf_len, 4096)) {
> +        vmcall->status_code = TDG_VP_VMCALL_ALIGN_ERROR;
> +        return;
> +    }
> +    if (buf_len == 0) {
> +        return;
> +    }
> +
> +    if (address_space_read(&address_space_memory, gpa, MEMTXATTRS_UNSPECIFIED,
> +                           &hdr, sizeof(hdr)) != MEMTX_OK) {
> +        return;
> +    }
> +    if (le64_to_cpu(hdr.structure_version) != TDX_GET_QUOTE_STRUCTURE_VERSION) {
> +        return;
> +    }
> +    /*
> +     * Paranoid: Guest should clear error_code and out_len to avoid information
> +     * leak.  Enforce it.  The initial value of them doesn't matter for qemu to
> +     * process the request.
> +     */
> +    if (le64_to_cpu(hdr.error_code) != TDX_VP_GET_QUOTE_SUCCESS ||
> +        le32_to_cpu(hdr.out_len) != 0) {
> +        return;
> +    }
> +
> +    /* Only safe-guard check to avoid too large buffer size. */
> +    if (buf_len > TDX_GET_QUOTE_MAX_BUF_LEN ||
> +        le32_to_cpu(hdr.in_len) > TDX_GET_QUOTE_MAX_BUF_LEN ||
> +        le32_to_cpu(hdr.in_len) > buf_len) {
> +        return;
> +    }
> +
> +    /* Mark the buffer in-flight. */
> +    hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_IN_FLIGHT);
> +    if (address_space_write(&address_space_memory, gpa, MEMTXATTRS_UNSPECIFIED,
> +                            &hdr, sizeof(hdr)) != MEMTX_OK) {
> +        return;
> +    }
> +
> +    ms = MACHINE(qdev_get_machine());
> +    tdx = TDX_GUEST(ms->cgs);
> +    ioc = qio_channel_socket_new();
> +
> +    t = g_malloc(sizeof(*t));
> +    t->apic_id = tdx->event_notify_apic_id;
> +    t->gpa = gpa;
> +    t->buf_len = buf_len;
> +    t->out_data = g_malloc(t->buf_len);
> +    t->out_len = 0;
> +    t->hdr = hdr;
> +    t->ioc = ioc;
> +
> +    qemu_mutex_lock(&tdx->lock);
> +    if (!tdx->quote_generation ||
> +        /* Prevent too many in-flight get-quote request. */
> +        tdx->quote_generation_num >= TDX_MAX_GET_QUOTE_REQUEST) {
> +        qemu_mutex_unlock(&tdx->lock);
> +        vmcall->status_code = TDG_VP_VMCALL_RETRY;
> +        object_unref(OBJECT(ioc));
> +        g_free(t->out_data);
> +        g_free(t);
> +        return;
> +    }
> +    tdx->quote_generation_num++;
> +    t->event_notify_interrupt = tdx->event_notify_interrupt;
> +    qio_channel_socket_connect_async(
> +        ioc, tdx->quote_generation, tdx_handle_get_quote_connected, t, NULL,
> +        NULL);
> +    qemu_mutex_unlock(&tdx->lock);
> +
> +    vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
> +}
> +
>  static void tdx_handle_setup_event_notify_interrupt(X86CPU *cpu,
>                                                      struct kvm_tdx_vmcall *vmcall)
>  {
> @@ -1005,6 +1432,9 @@ static void tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
>      }
>  
>      switch (vmcall->subfunction) {
> +    case TDG_VP_VMCALL_GET_QUOTE:
> +        tdx_handle_get_quote(cpu, vmcall);
> +        break;
>      case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
>          tdx_handle_setup_event_notify_interrupt(cpu, vmcall);
>          break;
> diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
> index 4a8d67cc9fdb..4a989805493e 100644
> --- a/target/i386/kvm/tdx.h
> +++ b/target/i386/kvm/tdx.h
> @@ -5,8 +5,10 @@
>  #include CONFIG_DEVICES /* CONFIG_TDX */
>  #endif
>  
> +#include <linux/kvm.h>
>  #include "exec/confidential-guest-support.h"
>  #include "hw/i386/tdvf.h"
> +#include "io/channel-socket.h"
>  #include "sysemu/kvm.h"
>  
>  #define TYPE_TDX_GUEST "tdx-guest"
> @@ -47,6 +49,10 @@ typedef struct TdxGuest {
>      /* runtime state */
>      int event_notify_interrupt;
>      uint32_t event_notify_apic_id;
> +
> +    /* GetQuote */
> +    int quote_generation_num;
> +    SocketAddress *quote_generation;
>  } TdxGuest;

IMHO all the quote generation logic would benefit from being split
out into a completely separate self contained files

eg 'tdx-quote-generation.{c,h}'

this should define an object "TdxQuoteGenerator" which  holds these
two quote_generation_num and quote_generation  fields, and exposes
a high level API for each command taking inputs & outputs,
and doing serialization to/from the socket.  This API should do
verification of all command inputs eg the length field to prevent
guest denial of service.

The tdx_handle_get_quote() method could then call into this API.

This will give us clean separation between interaction with guest
memory, and interaction with the socket.

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-15  7:14 ` [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd Xiaoyao Li
  2023-11-15 10:20   ` Daniel P. Berrangé
@ 2023-11-15 17:54   ` David Hildenbrand
  2023-11-16  2:45     ` Xiaoyao Li
  2023-11-17 20:35   ` Isaku Yamahata
  2023-11-20  9:24   ` David Hildenbrand
  3 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-15 17:54 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 15.11.23 08:14, Xiaoyao Li wrote:
> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
> and kvm guest memfd based private memory can be associated in one RAMBlock.
> 
> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
> create private guest_memfd during RAMBlock setup.
> 
> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
> confidential guests, such as TDX VM. How and when to set it for memory
> backends will be implemented in the following patches.

Can you elaborate (and add to the patch description if there is good 
reason) why we need that flag and why we cannot simply rely on the VM 
type instead to decide whether to allocate a guest_memfd or not?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-11-15  7:15 ` [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote> Xiaoyao Li
  2023-11-15 17:51   ` Daniel P. Berrangé
@ 2023-11-15 17:58   ` Daniel P. Berrangé
  2023-12-29  2:30     ` Xiaoyao Li
  2023-12-01 11:02   ` Markus Armbruster
  2023-12-21 11:05   ` Daniel P. Berrangé
  3 siblings, 1 reply; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 17:58 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:15:01AM -0500, Xiaoyao Li wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> For GetQuote, delegate a request to Quote Generation Service.
> Add property "quote-generation-socket" to tdx-guest, whihc is a property
> of type SocketAddress to specify Quote Generation Service(QGS).
> 
> On request, connect to the QGS, read request buffer from shared guest
> memory, send the request buffer to the server and store the response
> into shared guest memory and notify TD guest by interrupt.
> 
> command line example:
>   qemu-system-x86_64 \
>     -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
>     -machine confidential-guest-support=tdx0
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Codeveloped-by: Chenyi Qiang <chenyi.qiang@intel.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
> - rename property "quote-generation-service" to "quote-generation-socket";
> - change the type of "quote-generation-socket" from str to
>   SocketAddress;
> - squash next patch into this one;
> ---
>  qapi/qom.json         |   5 +-
>  target/i386/kvm/tdx.c | 430 ++++++++++++++++++++++++++++++++++++++++++
>  target/i386/kvm/tdx.h |   6 +
>  3 files changed, 440 insertions(+), 1 deletion(-)
> 
> +static void tdx_handle_get_quote_connected(QIOTask *task, gpointer opaque)
> +{
> +    struct tdx_get_quote_task *t = opaque;
> +    Error *err = NULL;
> +    char *in_data = NULL;
> +    MachineState *ms;
> +    TdxGuest *tdx;
> +
> +    t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
> +    if (qio_task_propagate_error(task, NULL)) {
> +        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
> +        goto error;
> +    }
> +
> +    in_data = g_malloc(le32_to_cpu(t->hdr.in_len));
> +    if (!in_data) {
> +        goto error;
> +    }
> +
> +    if (address_space_read(&address_space_memory, t->gpa + sizeof(t->hdr),
> +                           MEMTXATTRS_UNSPECIFIED, in_data,
> +                           le32_to_cpu(t->hdr.in_len)) != MEMTX_OK) {
> +        goto error;
> +    }
> +
> +    qio_channel_set_blocking(QIO_CHANNEL(t->ioc), false, NULL);

You've set the channel to non-blocking, but....

> +
> +    if (qio_channel_write_all(QIO_CHANNEL(t->ioc), in_data,
> +                              le32_to_cpu(t->hdr.in_len), &err) ||
> +        err) {

...this method will block execution of this thread, by either
sleeping in poll() or doing a coroutine yield.

I don't think this is in coroutine context, so presumably this
is just blocking.  So what was the point in marking the channel
non-blocking ?

You are setting up a background watch to wait for the reply
so we don't block this thread, so you seem to want non-blocking
behaviour.

Given this, you should not be using qio_channel_write_all()
most likely. I think you need to be using qio_channel_add_watch
to get notified when it is *writable*, to send 'in_data'
incrementally & non-blocking. When that is finished then create
another watch to wait for the reply.


> +        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
> +        goto error;
> +    }
> +
> +    g_free(in_data);
> +    qemu_set_fd_handler(t->ioc->fd, tdx_get_quote_read, NULL, t);
> +
> +    return;
> +error:
> +    t->hdr.out_len = cpu_to_le32(0);
> +
> +    if (address_space_write(
> +            &address_space_memory, t->gpa,
> +            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
> +        error_report("TDX: failed to update GetQuote header.\n");
> +    }
> +    tdx_td_notify(t);
> +
> +    qio_channel_close(QIO_CHANNEL(t->ioc), &err);
> +    object_unref(OBJECT(t->ioc));
> +    g_free(t);
> +    g_free(in_data);
> +
> +    /* Maintain the number of in-flight requests. */
> +    ms = MACHINE(qdev_get_machine());
> +    tdx = TDX_GUEST(ms->cgs);
> +    qemu_mutex_lock(&tdx->lock);
> +    tdx->quote_generation_num--;
> +    qemu_mutex_unlock(&tdx->lock);
> +    return;
> +}
> +
> +static void tdx_handle_get_quote(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
> +{
> +    hwaddr gpa = vmcall->in_r12;
> +    uint64_t buf_len = vmcall->in_r13;
> +    struct tdx_get_quote_header hdr;
> +    MachineState *ms;
> +    TdxGuest *tdx;
> +    QIOChannelSocket *ioc;
> +    struct tdx_get_quote_task *t;
> +
> +    vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
> +
> +    /* GPA must be shared. */
> +    if (!(gpa & tdx_shared_bit(cpu))) {
> +        return;
> +    }
> +    gpa &= ~tdx_shared_bit(cpu);
> +
> +    if (!QEMU_IS_ALIGNED(gpa, 4096) || !QEMU_IS_ALIGNED(buf_len, 4096)) {
> +        vmcall->status_code = TDG_VP_VMCALL_ALIGN_ERROR;
> +        return;
> +    }
> +    if (buf_len == 0) {
> +        return;
> +    }
> +
> +    if (address_space_read(&address_space_memory, gpa, MEMTXATTRS_UNSPECIFIED,
> +                           &hdr, sizeof(hdr)) != MEMTX_OK) {
> +        return;
> +    }
> +    if (le64_to_cpu(hdr.structure_version) != TDX_GET_QUOTE_STRUCTURE_VERSION) {
> +        return;
> +    }
> +    /*
> +     * Paranoid: Guest should clear error_code and out_len to avoid information
> +     * leak.  Enforce it.  The initial value of them doesn't matter for qemu to
> +     * process the request.
> +     */
> +    if (le64_to_cpu(hdr.error_code) != TDX_VP_GET_QUOTE_SUCCESS ||
> +        le32_to_cpu(hdr.out_len) != 0) {
> +        return;
> +    }
> +
> +    /* Only safe-guard check to avoid too large buffer size. */
> +    if (buf_len > TDX_GET_QUOTE_MAX_BUF_LEN ||
> +        le32_to_cpu(hdr.in_len) > TDX_GET_QUOTE_MAX_BUF_LEN ||
> +        le32_to_cpu(hdr.in_len) > buf_len) {
> +        return;
> +    }
> +
> +    /* Mark the buffer in-flight. */
> +    hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_IN_FLIGHT);
> +    if (address_space_write(&address_space_memory, gpa, MEMTXATTRS_UNSPECIFIED,
> +                            &hdr, sizeof(hdr)) != MEMTX_OK) {
> +        return;
> +    }
> +
> +    ms = MACHINE(qdev_get_machine());
> +    tdx = TDX_GUEST(ms->cgs);
> +    ioc = qio_channel_socket_new();
> +
> +    t = g_malloc(sizeof(*t));
> +    t->apic_id = tdx->event_notify_apic_id;
> +    t->gpa = gpa;
> +    t->buf_len = buf_len;
> +    t->out_data = g_malloc(t->buf_len);
> +    t->out_len = 0;
> +    t->hdr = hdr;
> +    t->ioc = ioc;
> +
> +    qemu_mutex_lock(&tdx->lock);
> +    if (!tdx->quote_generation ||
> +        /* Prevent too many in-flight get-quote request. */
> +        tdx->quote_generation_num >= TDX_MAX_GET_QUOTE_REQUEST) {
> +        qemu_mutex_unlock(&tdx->lock);
> +        vmcall->status_code = TDG_VP_VMCALL_RETRY;
> +        object_unref(OBJECT(ioc));
> +        g_free(t->out_data);
> +        g_free(t);
> +        return;
> +    }
> +    tdx->quote_generation_num++;
> +    t->event_notify_interrupt = tdx->event_notify_interrupt;
> +    qio_channel_socket_connect_async(
> +        ioc, tdx->quote_generation, tdx_handle_get_quote_connected, t, NULL,
> +        NULL);
> +    qemu_mutex_unlock(&tdx->lock);
> +
> +    vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
> +}
> +
>  static void tdx_handle_setup_event_notify_interrupt(X86CPU *cpu,
>                                                      struct kvm_tdx_vmcall *vmcall)
>  {
> @@ -1005,6 +1432,9 @@ static void tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
>      }
>  
>      switch (vmcall->subfunction) {
> +    case TDG_VP_VMCALL_GET_QUOTE:
> +        tdx_handle_get_quote(cpu, vmcall);
> +        break;
>      case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
>          tdx_handle_setup_event_notify_interrupt(cpu, vmcall);
>          break;
> diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
> index 4a8d67cc9fdb..4a989805493e 100644
> --- a/target/i386/kvm/tdx.h
> +++ b/target/i386/kvm/tdx.h
> @@ -5,8 +5,10 @@
>  #include CONFIG_DEVICES /* CONFIG_TDX */
>  #endif
>  
> +#include <linux/kvm.h>
>  #include "exec/confidential-guest-support.h"
>  #include "hw/i386/tdvf.h"
> +#include "io/channel-socket.h"
>  #include "sysemu/kvm.h"
>  
>  #define TYPE_TDX_GUEST "tdx-guest"
> @@ -47,6 +49,10 @@ typedef struct TdxGuest {
>      /* runtime state */
>      int event_notify_interrupt;
>      uint32_t event_notify_apic_id;
> +
> +    /* GetQuote */
> +    int quote_generation_num;
> +    SocketAddress *quote_generation;
>  } TdxGuest;
>  
>  #ifdef CONFIG_TDX
> -- 
> 2.34.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 53/70] i386/tdx: setup a timer for the qio channel
  2023-11-15  7:15 ` [PATCH v3 53/70] i386/tdx: setup a timer for the qio channel Xiaoyao Li
@ 2023-11-15 18:02   ` Daniel P. Berrangé
  0 siblings, 0 replies; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-15 18:02 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:15:02AM -0500, Xiaoyao Li wrote:
> From: Chenyi Qiang <chenyi.qiang@intel.com>
> 
> To avoid no response from QGS server, setup a timer for the transaction.
> If timeout, make it an error and interrupt guest. Define the threshold of
> time to 30s at present, maybe change to other value if not appropriate.
> 
> Extract the common cleanup code to make it more clear.
> 
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
>  - Use t->timer_armed to track if t->timer is initialized;
> ---
>  target/i386/kvm/tdx.c | 155 ++++++++++++++++++++++++------------------
>  1 file changed, 89 insertions(+), 66 deletions(-)
> 
> diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
> index 54b38c031fb3..3b87c36c485e 100644
> --- a/target/i386/kvm/tdx.c
> +++ b/target/i386/kvm/tdx.c
> @@ -1069,6 +1069,8 @@ struct tdx_get_quote_task {
>      struct tdx_get_quote_header hdr;
>      int event_notify_interrupt;
>      QIOChannelSocket *ioc;
> +    QEMUTimer timer;
> +    bool timer_armed;
>  };
>  
>  struct x86_msi {
> @@ -1151,13 +1153,49 @@ static void tdx_td_notify(struct tdx_get_quote_task *t)
>      }
>  }
>  
> +static void tdx_getquote_task_cleanup(struct tdx_get_quote_task *t, bool outlen_overflow)
> +{
> +    MachineState *ms;
> +    TdxGuest *tdx;
> +
> +    if (t->hdr.error_code != cpu_to_le64(TDX_VP_GET_QUOTE_SUCCESS) && !outlen_overflow) {
> +        t->hdr.out_len = cpu_to_le32(0);
> +    }
> +
> +    /* Publish the response contents before marking this request completed. */
> +    smp_wmb();
> +    if (address_space_write(
> +            &address_space_memory, t->gpa,
> +            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
> +        error_report("TDX: failed to update GetQuote header.");
> +    }
> +    tdx_td_notify(t);
> +
> +    if (t->ioc->fd > 0) {
> +        qemu_set_fd_handler(t->ioc->fd, NULL, NULL, NULL);
> +    }
> +    qio_channel_close(QIO_CHANNEL(t->ioc), NULL);
> +    object_unref(OBJECT(t->ioc));
> +    if (t->timer_armed)
> +        timer_del(&t->timer);
> +    g_free(t->out_data);
> +    g_free(t);
> +
> +    /* Maintain the number of in-flight requests. */
> +    ms = MACHINE(qdev_get_machine());
> +    tdx = TDX_GUEST(ms->cgs);
> +    qemu_mutex_lock(&tdx->lock);
> +    tdx->quote_generation_num--;
> +    qemu_mutex_unlock(&tdx->lock);
> +}
> +
> +
>  static void tdx_get_quote_read(void *opaque)
>  {
>      struct tdx_get_quote_task *t = opaque;
>      ssize_t size = 0;
>      Error *err = NULL;
> -    MachineState *ms;
> -    TdxGuest *tdx;
> +    bool outlen_overflow = false;
>  
>      while (true) {
>          char *buf;
> @@ -1202,11 +1240,12 @@ static void tdx_get_quote_read(void *opaque)
>           * There is no specific error code defined for this case(E2BIG) at the
>           * moment.
>           * TODO: Once an error code for this case is defined in GHCI spec ,
> -         * update the error code.
> +         * update the error code and the tdx_getquote_task_cleanup() argument.
>           */
>          t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
>          t->hdr.out_len = cpu_to_le32(t->out_len);
> -        goto error_hdr;
> +        outlen_overflow = true;
> +        goto error;
>      }
>  
>      if (address_space_write(
> @@ -1222,94 +1261,77 @@ static void tdx_get_quote_read(void *opaque)
>      t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_SUCCESS);
>  
>  error:
> -    if (t->hdr.error_code != cpu_to_le64(TDX_VP_GET_QUOTE_SUCCESS)) {
> -        t->hdr.out_len = cpu_to_le32(0);
> -    }
> -error_hdr:
> -    if (address_space_write(
> -            &address_space_memory, t->gpa,
> -            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
> -        error_report("TDX: failed to update GetQuote header.");
> -    }
> -    tdx_td_notify(t);
> +    tdx_getquote_task_cleanup(t, outlen_overflow);
> +}
> +
> +#define TRANSACTION_TIMEOUT 30000
> +
> +static void getquote_timer_expired(void *opaque)
> +{
> +    struct tdx_get_quote_task *t = opaque;
> +
> +    tdx_getquote_task_cleanup(t, false);
> +}
>  
> -    qemu_set_fd_handler(t->ioc->fd, NULL, NULL, NULL);
> -    qio_channel_close(QIO_CHANNEL(t->ioc), &err);
> -    object_unref(OBJECT(t->ioc));
> -    g_free(t->out_data);
> -    g_free(t);
> +static void tdx_transaction_start(struct tdx_get_quote_task *t)
> +{
> +    int64_t time;
>  
> -    /* Maintain the number of in-flight requests. */
> -    ms = MACHINE(qdev_get_machine());
> -    tdx = TDX_GUEST(ms->cgs);
> -    qemu_mutex_lock(&tdx->lock);
> -    tdx->quote_generation_num--;
> -    qemu_mutex_unlock(&tdx->lock);
> +    time = qemu_clock_get_ms(QEMU_CLOCK_VIRTUAL);
> +    /*
> +     * Timeout callback and fd callback both run in main loop thread,
> +     * thus no need to worry about race condition.
> +     */
> +    qemu_set_fd_handler(t->ioc->fd, tdx_get_quote_read, NULL, t);
> +    timer_init_ms(&t->timer, QEMU_CLOCK_VIRTUAL, getquote_timer_expired, t);
> +    timer_mod(&t->timer, time + TRANSACTION_TIMEOUT);
> +    t->timer_armed = true;
>  }
>  
> -/*
> - * TODO: If QGS doesn't reply for long time, make it an error and interrupt
> - * guest.
> - */
>  static void tdx_handle_get_quote_connected(QIOTask *task, gpointer opaque)
>  {
>      struct tdx_get_quote_task *t = opaque;
>      Error *err = NULL;
>      char *in_data = NULL;
> -    MachineState *ms;
> -    TdxGuest *tdx;
> +    int ret = 0;
>  
>      t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
> -    if (qio_task_propagate_error(task, NULL)) {
> +    ret = qio_task_propagate_error(task, NULL);
> +    if (ret) {
>          t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
> -        goto error;
> +        goto out;
>      }
>  
>      in_data = g_malloc(le32_to_cpu(t->hdr.in_len));
>      if (!in_data) {
> -        goto error;
> +        ret = -1;
> +        goto out;
>      }
>  
> -    if (address_space_read(&address_space_memory, t->gpa + sizeof(t->hdr),
> -                           MEMTXATTRS_UNSPECIFIED, in_data,
> -                           le32_to_cpu(t->hdr.in_len)) != MEMTX_OK) {
> -        goto error;
> +    ret = address_space_read(&address_space_memory, t->gpa + sizeof(t->hdr),
> +                             MEMTXATTRS_UNSPECIFIED, in_data,
> +                             le32_to_cpu(t->hdr.in_len));
> +    if (ret) {
> +        g_free(in_data);
> +        goto out;
>      }
>  
>      qio_channel_set_blocking(QIO_CHANNEL(t->ioc), false, NULL);
>  
> -    if (qio_channel_write_all(QIO_CHANNEL(t->ioc), in_data,
> -                              le32_to_cpu(t->hdr.in_len), &err) ||
> -        err) {
> +    ret = qio_channel_write_all(QIO_CHANNEL(t->ioc), in_data,
> +                              le32_to_cpu(t->hdr.in_len), &err);
> +    if (ret) {
>          t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
> -        goto error;
> +        g_free(in_data);
> +        goto out;
>      }
>  
> -    g_free(in_data);

Most of the diff  in this method is just arbitrary style
changes. Just do it right the first time in your previous
patch so we don't have lots of style changes in this patch.

> -    qemu_set_fd_handler(t->ioc->fd, tdx_get_quote_read, NULL, t);
> -
> -    return;
> -error:
> -    t->hdr.out_len = cpu_to_le32(0);
> -
> -    if (address_space_write(
> -            &address_space_memory, t->gpa,
> -            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
> -        error_report("TDX: failed to update GetQuote header.\n");
> +out:
> +    if (ret) {
> +        tdx_getquote_task_cleanup(t, false);
> +    } else {
> +        tdx_transaction_start(t);
>      }
> -    tdx_td_notify(t);
> -
> -    qio_channel_close(QIO_CHANNEL(t->ioc), &err);
> -    object_unref(OBJECT(t->ioc));
> -    g_free(t);
> -    g_free(in_data);
> -
> -    /* Maintain the number of in-flight requests. */
> -    ms = MACHINE(qdev_get_machine());
> -    tdx = TDX_GUEST(ms->cgs);
> -    qemu_mutex_lock(&tdx->lock);
> -    tdx->quote_generation_num--;
> -    qemu_mutex_unlock(&tdx->lock);
>      return;
>  }
>  
> @@ -1382,6 +1404,7 @@ static void tdx_handle_get_quote(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
>      t->out_len = 0;
>      t->hdr = hdr;
>      t->ioc = ioc;
> +    t->timer_armed = false;
>  
>      qemu_mutex_lock(&tdx->lock);
>      if (!tdx->quote_generation ||
> -- 
> 2.34.1
> 

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-15  7:14 ` [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE Xiaoyao Li
@ 2023-11-15 18:10   ` David Hildenbrand
  2023-11-16  2:47     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-15 18:10 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 15.11.23 08:14, Xiaoyao Li wrote:
> KVM allows KVM_GUEST_MEMFD_ALLOW_HUGEPAGE for guest memfd. When the
> flag is set, KVM tries to allocate memory with transparent hugeapge at
> first and falls back to non-hugepage on failure.
> 
> However, KVM defines one restriction that size must be hugepage size
> aligned when KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> v3:
>   - New one in v3.
> ---
>   system/physmem.c | 38 ++++++++++++++++++++++++++++++++++++--
>   1 file changed, 36 insertions(+), 2 deletions(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index 0af2213cbd9c..c56b17e44df6 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1803,6 +1803,40 @@ static void dirty_memory_extend(ram_addr_t old_ram_size,
>       }
>   }
>   
> +#ifdef CONFIG_KVM
> +#define HPAGE_PMD_SIZE_PATH "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
> +#define DEFAULT_PMD_SIZE (1ul << 21)
> +
> +static uint32_t get_thp_size(void)
> +{
> +    gchar *content = NULL;
> +    const char *endptr;
> +    static uint64_t thp_size = 0;
> +    uint64_t tmp;
> +
> +    if (thp_size != 0) {
> +        return thp_size;
> +    }
> +
> +    if (g_file_get_contents(HPAGE_PMD_SIZE_PATH, &content, NULL, NULL) &&
> +        !qemu_strtou64(content, &endptr, 0, &tmp) &&
> +        (!endptr || *endptr == '\n')) {
> +        /* Sanity-check the value and fallback to something reasonable. */
> +        if (!tmp || !is_power_of_2(tmp)) {
> +            warn_report("Read unsupported THP size: %" PRIx64, tmp);
> +        } else {
> +            thp_size = tmp;
> +        }
> +    }
> +
> +    if (!thp_size) {
> +        thp_size = DEFAULT_PMD_SIZE;
> +    }

... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)

This should be factored out into a common helper.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 04/70] HostMem: Add mechanism to opt in kvm guest memfd via MachineState
  2023-11-15  7:14 ` [PATCH v3 04/70] HostMem: Add mechanism to opt in kvm guest memfd via MachineState Xiaoyao Li
@ 2023-11-15 18:14   ` David Hildenbrand
  2023-11-16  2:53     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-15 18:14 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 15.11.23 08:14, Xiaoyao Li wrote:
> Add a new member "require_guest_memfd" to memory backends. When it's set
> to true, it enables RAM_GUEST_MEMFD in ram_flags, thus private kvm
> guest_memfd will be allocated during RAMBlock allocation.
> 
> Memory backend's @require_guest_memfd is wired with @require_guest_memfd
> field of MachineState. MachineState::require_guest_memfd is supposed to
> be set by any VMs that requires KVM guest memfd as private memory, e.g.,
> TDX VM.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>

I'm confused, why do we need this if it's going to be the same for all 
memory backends right now?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range()
  2023-11-15  7:14 ` [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range() Xiaoyao Li
@ 2023-11-15 18:20   ` David Hildenbrand
  2023-11-16  2:56     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-15 18:20 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 15.11.23 08:14, Xiaoyao Li wrote:
> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
> ram_block_discard_range() which grabs some code from
> ram_discard_range(). However, during code movement, it changed alignment
> check of host_startaddr from qemu_host_page_size to rb->page_size.
> 
> When ramblock is back'ed by hugepage, it requires the startaddr to be
> huge page size aligned, which is a overkill. e.g., TDX's private-shared
> page conversion is done at 4KB granularity. Shared page is discarded
> when it gets converts to private and when shared page back'ed by
> hugepage it is going to fail on this check.
> 
> So change to alignment check back to qemu_host_page_size.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
>   - Newly added in v3;
> ---
>   system/physmem.c | 2 +-
>   1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/system/physmem.c b/system/physmem.c
> index c56b17e44df6..8a4e42c7cf60 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length)
>   
>       uint8_t *host_startaddr = rb->host + start;
>   
> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {

For your use cases, rb->page_size should always match qemu_host_page_size.

IIRC, we only set rb->page_size to different values for hugetlb. And 
guest_memfd does not support hugetlb.

Even if QEMU is using THP, rb->page_size should 4k.

Please elaborate how you can actually trigger that. From what I recall, 
guest_memfd is not compatible with hugetlb.

And the check here makes perfect sense for existing callers of 
ram_block_discard_range(): you cannot partially zap a hugetlb page.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 08/70] physmem: replace function name with __func__ in ram_block_discard_range()
  2023-11-15  7:14 ` [PATCH v3 08/70] physmem: replace function name with __func__ " Xiaoyao Li
@ 2023-11-15 18:21   ` David Hildenbrand
  2023-12-04  7:40     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-15 18:21 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 15.11.23 08:14, Xiaoyao Li wrote:
> Use __func__ to avoid hard-coded function name.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---

That can be queued independently.

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-15 17:54   ` David Hildenbrand
@ 2023-11-16  2:45     ` Xiaoyao Li
  2023-11-20  9:19       ` David Hildenbrand
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-16  2:45 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/16/2023 1:54 AM, David Hildenbrand wrote:
> On 15.11.23 08:14, Xiaoyao Li wrote:
>> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
>> and kvm guest memfd based private memory can be associated in one 
>> RAMBlock.
>>
>> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
>> create private guest_memfd during RAMBlock setup.
>>
>> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
>> confidential guests, such as TDX VM. How and when to set it for memory
>> backends will be implemented in the following patches.
> 
> Can you elaborate (and add to the patch description if there is good 
> reason) why we need that flag and why we cannot simply rely on the VM 
> type instead to decide whether to allocate a guest_memfd or not?
> 

The reason is, relying on the VM type is sort of hack that we need to 
get the MachineState instance and retrieve the vm type info. I think 
it's better not to couple them.

More importantly, it's not flexible and extensible for future case that 
not all the memory need guest memfd.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-15 18:10   ` David Hildenbrand
@ 2023-11-16  2:47     ` Xiaoyao Li
  2023-11-20  9:26       ` David Hildenbrand
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-16  2:47 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/16/2023 2:10 AM, David Hildenbrand wrote:
> On 15.11.23 08:14, Xiaoyao Li wrote:
>> KVM allows KVM_GUEST_MEMFD_ALLOW_HUGEPAGE for guest memfd. When the
>> flag is set, KVM tries to allocate memory with transparent hugeapge at
>> first and falls back to non-hugepage on failure.
>>
>> However, KVM defines one restriction that size must be hugepage size
>> aligned when KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is set.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> v3:
>>   - New one in v3.
>> ---
>>   system/physmem.c | 38 ++++++++++++++++++++++++++++++++++++--
>>   1 file changed, 36 insertions(+), 2 deletions(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index 0af2213cbd9c..c56b17e44df6 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1803,6 +1803,40 @@ static void dirty_memory_extend(ram_addr_t 
>> old_ram_size,
>>       }
>>   }
>> +#ifdef CONFIG_KVM
>> +#define HPAGE_PMD_SIZE_PATH 
>> "/sys/kernel/mm/transparent_hugepage/hpage_pmd_size"
>> +#define DEFAULT_PMD_SIZE (1ul << 21)
>> +
>> +static uint32_t get_thp_size(void)
>> +{
>> +    gchar *content = NULL;
>> +    const char *endptr;
>> +    static uint64_t thp_size = 0;
>> +    uint64_t tmp;
>> +
>> +    if (thp_size != 0) {
>> +        return thp_size;
>> +    }
>> +
>> +    if (g_file_get_contents(HPAGE_PMD_SIZE_PATH, &content, NULL, 
>> NULL) &&
>> +        !qemu_strtou64(content, &endptr, 0, &tmp) &&
>> +        (!endptr || *endptr == '\n')) {
>> +        /* Sanity-check the value and fallback to something 
>> reasonable. */
>> +        if (!tmp || !is_power_of_2(tmp)) {
>> +            warn_report("Read unsupported THP size: %" PRIx64, tmp);
>> +        } else {
>> +            thp_size = tmp;
>> +        }
>> +    }
>> +
>> +    if (!thp_size) {
>> +        thp_size = DEFAULT_PMD_SIZE;
>> +    }
> 
> ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)

Get caught.

> This should be factored out into a common helper.

Sure, will do it in next version.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 04/70] HostMem: Add mechanism to opt in kvm guest memfd via MachineState
  2023-11-15 18:14   ` David Hildenbrand
@ 2023-11-16  2:53     ` Xiaoyao Li
  2023-11-20  9:30       ` David Hildenbrand
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-16  2:53 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/16/2023 2:14 AM, David Hildenbrand wrote:
> On 15.11.23 08:14, Xiaoyao Li wrote:
>> Add a new member "require_guest_memfd" to memory backends. When it's set
>> to true, it enables RAM_GUEST_MEMFD in ram_flags, thus private kvm
>> guest_memfd will be allocated during RAMBlock allocation.
>>
>> Memory backend's @require_guest_memfd is wired with @require_guest_memfd
>> field of MachineState. MachineState::require_guest_memfd is supposed to
>> be set by any VMs that requires KVM guest memfd as private memory, e.g.,
>> TDX VM.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> 
> I'm confused, why do we need this if it's going to be the same for all 
> memory backends right now?
> 

I want to provide a elegant (in my sense) way to configure "the need of 
guest memfd" instead of checking x86machinestate->vm_type in physmem.c

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range()
  2023-11-15 18:20   ` David Hildenbrand
@ 2023-11-16  2:56     ` Xiaoyao Li
  2023-11-20  9:56       ` David Hildenbrand
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-16  2:56 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/16/2023 2:20 AM, David Hildenbrand wrote:
> On 15.11.23 08:14, Xiaoyao Li wrote:
>> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
>> ram_block_discard_range() which grabs some code from
>> ram_discard_range(). However, during code movement, it changed alignment
>> check of host_startaddr from qemu_host_page_size to rb->page_size.
>>
>> When ramblock is back'ed by hugepage, it requires the startaddr to be
>> huge page size aligned, which is a overkill. e.g., TDX's private-shared
>> page conversion is done at 4KB granularity. Shared page is discarded
>> when it gets converts to private and when shared page back'ed by
>> hugepage it is going to fail on this check.
>>
>> So change to alignment check back to qemu_host_page_size.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> Changes in v3:
>>   - Newly added in v3;
>> ---
>>   system/physmem.c | 2 +-
>>   1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/system/physmem.c b/system/physmem.c
>> index c56b17e44df6..8a4e42c7cf60 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb, 
>> uint64_t start, size_t length)
>>       uint8_t *host_startaddr = rb->host + start;
>> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
>> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
> 
> For your use cases, rb->page_size should always match qemu_host_page_size.
> 
> IIRC, we only set rb->page_size to different values for hugetlb. And 
> guest_memfd does not support hugetlb.
> 
> Even if QEMU is using THP, rb->page_size should 4k.
> 
> Please elaborate how you can actually trigger that. From what I recall, 
> guest_memfd is not compatible with hugetlb.

It's the shared memory that can be back'ed by hugetlb.

Later patch 9 introduces ram_block_convert_page(), which will discard 
shared memory when it gets converted to private. TD guest can request 
convert a 4K to private while the page is previously back'ed by hugetlb 
as 2M shared page.

> And the check here makes perfect sense for existing callers of 
> ram_block_discard_range(): you cannot partially zap a hugetlb page.
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-15 10:20   ` Daniel P. Berrangé
@ 2023-11-16  3:34     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-16  3:34 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/15/2023 6:20 PM, Daniel P. Berrangé wrote:
> On Wed, Nov 15, 2023 at 02:14:11AM -0500, Xiaoyao Li wrote:
>> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
>> and kvm guest memfd based private memory can be associated in one RAMBlock.
>>
>> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
>> create private guest_memfd during RAMBlock setup.
>>
>> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
>> confidential guests, such as TDX VM. How and when to set it for memory
>> backends will be implemented in the following patches.
>>
>> Introduce memory_region_has_guest_memfd() to query if the MemoryRegion has
>> KVM guest_memfd allocated.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> Changes in v3:
>> - rename gmem to guest_memfd;
>> - close(guest_memfd) when RAMBlock is released; (Daniel P. Berrangé)
>> - Suqash the patch that introduces memory_region_has_guest_memfd().
>> ---
>>   accel/kvm/kvm-all.c     | 24 ++++++++++++++++++++++++
>>   include/exec/memory.h   | 13 +++++++++++++
>>   include/exec/ramblock.h |  1 +
>>   include/sysemu/kvm.h    |  2 ++
>>   system/memory.c         |  5 +++++
>>   system/physmem.c        | 27 ++++++++++++++++++++++++---
>>   6 files changed, 69 insertions(+), 3 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index c1b40e873531..9f751d4971f8 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -101,6 +101,7 @@ bool kvm_msi_use_devid;
>>   bool kvm_has_guest_debug;
>>   static int kvm_sstep_flags;
>>   static bool kvm_immediate_exit;
>> +static bool kvm_guest_memfd_supported;
>>   static hwaddr kvm_max_slot_size = ~0;
>>   
>>   static const KVMCapabilityInfo kvm_required_capabilites[] = {
>> @@ -2397,6 +2398,8 @@ static int kvm_init(MachineState *ms)
>>       }
>>       s->as = g_new0(struct KVMAs, s->nr_as);
>>   
>> +    kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
>> +
>>       if (object_property_find(OBJECT(current_machine), "kvm-type")) {
>>           g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine),
>>                                                               "kvm-type",
>> @@ -4078,3 +4081,24 @@ void query_stats_schemas_cb(StatsSchemaList **result, Error **errp)
>>           query_stats_schema_vcpu(first_cpu, &stats_args);
>>       }
>>   }
>> +
>> +int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp)
>> +{
>> +    int fd;
>> +    struct kvm_create_guest_memfd guest_memfd = {
>> +        .size = size,
>> +        .flags = flags,
>> +    };
>> +
>> +    if (!kvm_guest_memfd_supported) {
>> +        error_setg(errp, "KVM doesn't support guest memfd\n");
>> +        return -EOPNOTSUPP;
> 
> Returning an errno value is unusual when we have an 'Error **errp' parameter
> for reporting, and the following codepath merely returns -1, so this is
> inconsistent. Just return -1 here too.

OK.

>> +    }
>> +
>> +    fd = kvm_vm_ioctl(kvm_state, KVM_CREATE_GUEST_MEMFD, &guest_memfd);
>> +    if (fd < 0) {
>> +        error_setg_errno(errp, errno, "%s: error creating kvm guest memfd\n", __func__);
> 
> I'd prefer an explicit 'return -1' here, even though 'fd' is technically going
> to be -1 already.
> 
> Also including __func__ in the error message is not really needed IMHO

OK

>> +    }
>> +
>> +    return fd;
>> +}
>> diff --git a/include/exec/memory.h b/include/exec/memory.h
>> index 831f7c996d9d..f780367ab1bd 100644
>> --- a/include/exec/memory.h
>> +++ b/include/exec/memory.h
>> @@ -243,6 +243,9 @@ typedef struct IOMMUTLBEvent {
>>   /* RAM FD is opened read-only */
>>   #define RAM_READONLY_FD (1 << 11)
>>   
>> +/* RAM can be private that has kvm gmem backend */
>> +#define RAM_GUEST_MEMFD   (1 << 12)
>> +
>>   static inline void iommu_notifier_init(IOMMUNotifier *n, IOMMUNotify fn,
>>                                          IOMMUNotifierFlag flags,
>>                                          hwaddr start, hwaddr end,
>> @@ -1702,6 +1705,16 @@ static inline bool memory_region_is_romd(MemoryRegion *mr)
>>    */
>>   bool memory_region_is_protected(MemoryRegion *mr);
>>   
>> +/**
>> + * memory_region_has_guest_memfd: check whether a memory region has guest_memfd
>> + *     associated
>> + *
>> + * Returns %true if a memory region's ram_block has valid guest_memfd assigned.
>> + *
>> + * @mr: the memory region being queried
>> + */
>> +bool memory_region_has_guest_memfd(MemoryRegion *mr);
>> +
>>   /**
>>    * memory_region_get_iommu: check whether a memory region is an iommu
>>    *
>> diff --git a/include/exec/ramblock.h b/include/exec/ramblock.h
>> index 69c6a5390293..0a17ba882729 100644
>> --- a/include/exec/ramblock.h
>> +++ b/include/exec/ramblock.h
>> @@ -41,6 +41,7 @@ struct RAMBlock {
>>       QLIST_HEAD(, RAMBlockNotifier) ramblock_notifiers;
>>       int fd;
>>       uint64_t fd_offset;
>> +    int guest_memfd;
>>       size_t page_size;
>>       /* dirty bitmap used during migration */
>>       unsigned long *bmap;
>> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
>> index d61487816421..fedc28c7d17f 100644
>> --- a/include/sysemu/kvm.h
>> +++ b/include/sysemu/kvm.h
>> @@ -538,4 +538,6 @@ bool kvm_arch_cpu_check_are_resettable(void);
>>   bool kvm_dirty_ring_enabled(void);
>>   
>>   uint32_t kvm_dirty_ring_size(void);
>> +
>> +int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
>>   #endif
>> diff --git a/system/memory.c b/system/memory.c
>> index 304fa843ea12..69741d91bbb7 100644
>> --- a/system/memory.c
>> +++ b/system/memory.c
>> @@ -1862,6 +1862,11 @@ bool memory_region_is_protected(MemoryRegion *mr)
>>       return mr->ram && (mr->ram_block->flags & RAM_PROTECTED);
>>   }
>>   
>> +bool memory_region_has_guest_memfd(MemoryRegion *mr)
>> +{
>> +    return mr->ram_block && mr->ram_block->guest_memfd >= 0;
>> +}
>> +
>>   uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>>   {
>>       uint8_t mask = mr->dirty_log_mask;
>> diff --git a/system/physmem.c b/system/physmem.c
>> index fc2b0fee0188..0af2213cbd9c 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>           }
>>       }
>>   
>> +#ifdef CONFIG_KVM
>> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
>> +        new_block->guest_memfd < 0) {
>> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
>> +        uint64_t flags = 0;
>> +        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
>> +                                                        flags, errp);
>> +        if (new_block->guest_memfd < 0) {
>> +            qemu_mutex_unlock_ramlist();
>> +            return;
>> +        }
>> +    }
>> +#endif
>> +
>>       new_ram_size = MAX(old_ram_size,
>>                 (new_block->offset + new_block->max_length) >> TARGET_PAGE_BITS);
>>       if (new_ram_size > old_ram_size) {
>> @@ -1903,7 +1917,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>>       /* Just support these ram flags by now. */
>>       assert((ram_flags & ~(RAM_SHARED | RAM_PMEM | RAM_NORESERVE |
>>                             RAM_PROTECTED | RAM_NAMED_FILE | RAM_READONLY |
>> -                          RAM_READONLY_FD)) == 0);
>> +                          RAM_READONLY_FD | RAM_GUEST_MEMFD)) == 0);
>>   
>>       if (xen_enabled()) {
>>           error_setg(errp, "-mem-path not supported with Xen");
>> @@ -1938,6 +1952,7 @@ RAMBlock *qemu_ram_alloc_from_fd(ram_addr_t size, MemoryRegion *mr,
>>       new_block->used_length = size;
>>       new_block->max_length = size;
>>       new_block->flags = ram_flags;
>> +    new_block->guest_memfd = -1;
>>       new_block->host = file_ram_alloc(new_block, size, fd, !file_size, offset,
>>                                        errp);
>>       if (!new_block->host) {
>> @@ -2016,7 +2031,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>>       Error *local_err = NULL;
>>   
>>       assert((ram_flags & ~(RAM_SHARED | RAM_RESIZEABLE | RAM_PREALLOC |
>> -                          RAM_NORESERVE)) == 0);
>> +                          RAM_NORESERVE| RAM_GUEST_MEMFD)) == 0);
>>       assert(!host ^ (ram_flags & RAM_PREALLOC));
>>   
>>       size = HOST_PAGE_ALIGN(size);
>> @@ -2028,6 +2043,7 @@ RAMBlock *qemu_ram_alloc_internal(ram_addr_t size, ram_addr_t max_size,
>>       new_block->max_length = max_size;
>>       assert(max_size >= size);
>>       new_block->fd = -1;
>> +    new_block->guest_memfd = -1;
>>       new_block->page_size = qemu_real_host_page_size();
>>       new_block->host = host;
>>       new_block->flags = ram_flags;
>> @@ -2050,7 +2066,7 @@ RAMBlock *qemu_ram_alloc_from_ptr(ram_addr_t size, void *host,
>>   RAMBlock *qemu_ram_alloc(ram_addr_t size, uint32_t ram_flags,
>>                            MemoryRegion *mr, Error **errp)
>>   {
>> -    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE)) == 0);
>> +    assert((ram_flags & ~(RAM_SHARED | RAM_NORESERVE | RAM_GUEST_MEMFD)) == 0);
>>       return qemu_ram_alloc_internal(size, size, NULL, NULL, ram_flags, mr, errp);
>>   }
>>   
>> @@ -2078,6 +2094,11 @@ static void reclaim_ramblock(RAMBlock *block)
>>       } else {
>>           qemu_anon_ram_free(block->host, block->max_length);
>>       }
>> +
>> +    if (block->guest_memfd >= 0) {
>> +        close(block->guest_memfd);
>> +    }
>> +
>>       g_free(block);
>>   }
>>   
>> -- 
>> 2.34.1
>>
> 
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2023-11-15 10:38   ` Daniel P. Berrangé
@ 2023-11-16  3:40     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-16  3:40 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/15/2023 6:38 PM, Daniel P. Berrangé wrote:
> On Wed, Nov 15, 2023 at 02:14:15AM -0500, Xiaoyao Li wrote:
>> Introduce the helper functions to set the attributes of a range of
>> memory to private or shared.
>>
>> This is necessary to notify KVM the private/shared attribute of each gpa
>> range. KVM needs the information to decide the GPA needs to be mapped at
>> hva-based shared memory or guest_memfd based private memory.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>>   accel/kvm/kvm-all.c  | 42 ++++++++++++++++++++++++++++++++++++++++++
>>   include/sysemu/kvm.h |  3 +++
>>   2 files changed, 45 insertions(+)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 69afeb47c9c0..76e2404d54d2 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -102,6 +102,7 @@ bool kvm_has_guest_debug;
>>   static int kvm_sstep_flags;
>>   static bool kvm_immediate_exit;
>>   static bool kvm_guest_memfd_supported;
>> +static uint64_t kvm_supported_memory_attributes;
>>   static hwaddr kvm_max_slot_size = ~0;
>>   
>>   static const KVMCapabilityInfo kvm_required_capabilites[] = {
>> @@ -1305,6 +1306,44 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
>>       kvm_max_slot_size = max_slot_size;
>>   }
>>   
>> +static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)
>> +{
>> +    struct kvm_memory_attributes attrs;
>> +    int r;
>> +
>> +    attrs.attributes = attr;
>> +    attrs.address = start;
>> +    attrs.size = size;
>> +    attrs.flags = 0;
>> +
>> +    r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, &attrs);
>> +    if (r) {
>> +        warn_report("%s: failed to set memory (0x%lx+%#zx) with attr 0x%lx error '%s'",
>> +                     __func__, start, size, attr, strerror(errno));
> 
> This is an error condition rather than an warning condition.
> 
> Also again I think __func__ is generally not required in an error message,
> if the error message text is suitably descriptive - applies to other
> patches in this series too.

Get it.

>> +    }
>> +    return r;
>> +}
>> +
>> +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size)
>> +{
>> +    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
>> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
>> +        return -EINVAL;
>> +    }
>> +
>> +    return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
>> +}
>> +
>> +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size)
>> +{
>> +    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
>> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
>> +        return -EINVAL;
>> +    }
>> +
>> +    return kvm_set_memory_attributes(start, size, 0);
>> +}
>> +
>>   /* Called with KVMMemoryListener.slots_lock held */
>>   static void kvm_set_phys_mem(KVMMemoryListener *kml,
>>                                MemoryRegionSection *section, bool add)
>> @@ -2440,6 +2479,9 @@ static int kvm_init(MachineState *ms)
>>   
>>       kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD);
>>   
>> +    ret = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES);
>> +    kvm_supported_memory_attributes = ret > 0 ? ret : 0;
>> +
>>       if (object_property_find(OBJECT(current_machine), "kvm-type")) {
>>           g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine),
>>                                                               "kvm-type",
>> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
>> index fedc28c7d17f..0e88958190a4 100644
>> --- a/include/sysemu/kvm.h
>> +++ b/include/sysemu/kvm.h
>> @@ -540,4 +540,7 @@ bool kvm_dirty_ring_enabled(void);
>>   uint32_t kvm_dirty_ring_size(void);
>>   
>>   int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp);
>> +
>> +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size);
>> +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size);
>>   #endif
>> -- 
>> 2.34.1
>>
> 
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 10/70] kvm: handle KVM_EXIT_MEMORY_FAULT
  2023-11-15 10:42   ` Daniel P. Berrangé
@ 2023-11-16  5:16     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-16  5:16 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/15/2023 6:42 PM, Daniel P. Berrangé wrote:
> On Wed, Nov 15, 2023 at 02:14:19AM -0500, Xiaoyao Li wrote:
>> From: Chao Peng <chao.p.peng@linux.intel.com>
>>
>> Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when
>> KVM_EXIT_MEMORY_FAULT happens. It indicates userspace needs to do
>> the memory conversion on the RAMBlock to turn the memory into desired
>> attribute, i.e., private/shared.
>>
>> Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has
>> guest_memfd memory backend.
>>
>> Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is
>> added.
>>
>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>>   accel/kvm/kvm-all.c | 76 +++++++++++++++++++++++++++++++++++++++------
>>   1 file changed, 66 insertions(+), 10 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 76e2404d54d2..58abbcb6926e 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -2902,6 +2902,50 @@ static void kvm_eat_signals(CPUState *cpu)
>>       } while (sigismember(&chkset, SIG_IPI));
>>   }
>>   
>> +static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private)
>> +{
>> +    MemoryRegionSection section;
>> +    ram_addr_t offset;
>> +    RAMBlock *rb;
>> +    void *addr;
>> +    int ret = -1;
>> +
>> +    section = memory_region_find(get_system_memory(), start, size);
>> +    if (!section.mr) {
>> +        return ret;
>> +    }
>> +
>> +    if (memory_region_has_guest_memfd(section.mr)) {
>> +        if (to_private) {
>> +            ret = kvm_set_memory_attributes_private(start, size);
>> +        } else {
>> +            ret = kvm_set_memory_attributes_shared(start, size);
>> +        }
>> +
>> +        if (ret) {
>> +            memory_region_unref(section.mr);
>> +            return ret;
>> +        }
>> +
>> +        addr = memory_region_get_ram_ptr(section.mr) +
>> +               section.offset_within_region;
>> +        rb = qemu_ram_block_from_host(addr, false, &offset);
>> +        /*
>> +         * With KVM_SET_MEMORY_ATTRIBUTES by kvm_set_memory_attributes(),
>> +         * operation on underlying file descriptor is only for releasing
>> +         * unnecessary pages.
>> +         */
>> +        ram_block_convert_range(rb, offset, size, to_private);
>> +    } else {
>> +        warn_report("Convert non guest_memfd backed memory region "
>> +                    "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s",
>> +                    start, size, to_private ? "private" : "shared");
> 
> Again, if you're returning '-1' to indicate error, then
> using warn_report is wrong, it should be error_report.
> 
> warn_report is for when you return success, indicating
> the problem was non-fatal.

Learned.

Thanks!

>> +    }
>> +
>> +    memory_region_unref(section.mr);
>> +    return ret;
>> +}
>> +
>>   int kvm_cpu_exec(CPUState *cpu)
>>   {
>>       struct kvm_run *run = cpu->kvm_run;
>> @@ -2969,18 +3013,20 @@ int kvm_cpu_exec(CPUState *cpu)
>>                   ret = EXCP_INTERRUPT;
>>                   break;
>>               }
>> -            fprintf(stderr, "error: kvm run failed %s\n",
>> -                    strerror(-run_ret));
>> +            if (!(run_ret == -EFAULT && run->exit_reason == KVM_EXIT_MEMORY_FAULT)) {
>> +                fprintf(stderr, "error: kvm run failed %s\n",
>> +                        strerror(-run_ret));
>>   #ifdef TARGET_PPC
>> -            if (run_ret == -EBUSY) {
>> -                fprintf(stderr,
>> -                        "This is probably because your SMT is enabled.\n"
>> -                        "VCPU can only run on primary threads with all "
>> -                        "secondary threads offline.\n");
>> -            }
>> +                if (run_ret == -EBUSY) {
>> +                    fprintf(stderr,
>> +                            "This is probably because your SMT is enabled.\n"
>> +                            "VCPU can only run on primary threads with all "
>> +                            "secondary threads offline.\n");
>> +                }
>>   #endif
>> -            ret = -1;
>> -            break;
>> +                ret = -1;
>> +                break;
>> +            }
>>           }
>>   
>>           trace_kvm_run_exit(cpu->cpu_index, run->exit_reason);
>> @@ -3067,6 +3113,16 @@ int kvm_cpu_exec(CPUState *cpu)
>>                   break;
>>               }
>>               break;
>> +        case KVM_EXIT_MEMORY_FAULT:
>> +            if (run->memory_fault.flags & ~KVM_MEMORY_EXIT_FLAG_PRIVATE) {
>> +                error_report("KVM_EXIT_MEMORY_FAULT: Unknown flag 0x%" PRIx64,
>> +                             (uint64_t)run->memory_fault.flags);
>> +                ret = -1;
>> +                break;
>> +            }
>> +            ret = kvm_convert_memory(run->memory_fault.gpa, run->memory_fault.size,
>> +                                     run->memory_fault.flags & KVM_MEMORY_EXIT_FLAG_PRIVATE);
>> +            break;
>>           default:
>>               DPRINTF("kvm_arch_handle_exit\n");
>>               ret = kvm_arch_handle_exit(cpu, run);
>> -- 
>> 2.34.1
>>
> 
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 14/70] target/i386: Implement mc->kvm_type() to get VM type
  2023-11-15 10:49   ` Daniel P. Berrangé
@ 2023-11-16  6:22     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-16  6:22 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/15/2023 6:49 PM, Daniel P. Berrangé wrote:
> On Wed, Nov 15, 2023 at 02:14:23AM -0500, Xiaoyao Li wrote:
>> Implement mc->kvm_type() for i386 machines. It provides a way for user
>> to create SW_PROTECTE_VM.
> 
> Small typo there missing final 'D' in 'PROTECTED'

Thanks for catching it.

I find the "PROTECTED_VM" part is the leftover of previous series. Since 
this version drop the "protected-vm" part, it should be fall back the 
earlier version like 
https://lore.kernel.org/qemu-devel/20220802074750.2581308-4-xiaoyao.li@intel.com/

I will merge next patch into this one, in next version.

>>
>> Also store the vm_type in machinestate to other code to query what the
>> VM type is.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>>   hw/i386/x86.c              | 12 ++++++++++++
>>   include/hw/i386/x86.h      |  1 +
>>   target/i386/kvm/kvm.c      | 25 +++++++++++++++++++++++++
>>   target/i386/kvm/kvm_i386.h |  1 +
>>   4 files changed, 39 insertions(+)
>>
>> diff --git a/hw/i386/x86.c b/hw/i386/x86.c
>> index b3d054889bba..55678279bf3b 100644
>> --- a/hw/i386/x86.c
>> +++ b/hw/i386/x86.c
>> @@ -1377,6 +1377,17 @@ static void machine_set_sgx_epc(Object *obj, Visitor *v, const char *name,
>>       qapi_free_SgxEPCList(list);
>>   }
>>   
>> +static int x86_kvm_type(MachineState *ms, const char *vm_type)
>> +{
>> +    X86MachineState *x86ms = X86_MACHINE(ms);
>> +    int kvm_type;
>> +
>> +    kvm_type = kvm_get_vm_type(ms, vm_type);
>> +    x86ms->vm_type = kvm_type;
>> +
>> +    return kvm_type;
>> +}
>> +
>>   static void x86_machine_initfn(Object *obj)
>>   {
>>       X86MachineState *x86ms = X86_MACHINE(obj);
>> @@ -1401,6 +1412,7 @@ static void x86_machine_class_init(ObjectClass *oc, void *data)
>>       mc->cpu_index_to_instance_props = x86_cpu_index_to_props;
>>       mc->get_default_cpu_node_id = x86_get_default_cpu_node_id;
>>       mc->possible_cpu_arch_ids = x86_possible_cpu_arch_ids;
>> +    mc->kvm_type = x86_kvm_type;
>>       x86mc->save_tsc_khz = true;
>>       x86mc->fwcfg_dma_enabled = true;
>>       nc->nmi_monitor_handler = x86_nmi;
>> diff --git a/include/hw/i386/x86.h b/include/hw/i386/x86.h
>> index da19ae15463a..ab1d38569019 100644
>> --- a/include/hw/i386/x86.h
>> +++ b/include/hw/i386/x86.h
>> @@ -41,6 +41,7 @@ struct X86MachineState {
>>       MachineState parent;
>>   
>>       /*< public >*/
>> +    unsigned int vm_type;
>>   
>>       /* Pointers to devices and objects: */
>>       ISADevice *rtc;
>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>> index b4b9ce89842f..2e47fda25f95 100644
>> --- a/target/i386/kvm/kvm.c
>> +++ b/target/i386/kvm/kvm.c
>> @@ -161,6 +161,31 @@ static KVMMSRHandlers msr_handlers[KVM_MSR_FILTER_MAX_RANGES];
>>   static RateLimit bus_lock_ratelimit_ctrl;
>>   static int kvm_get_one_msr(X86CPU *cpu, int index, uint64_t *value);
>>   
>> +static const char* vm_type_name[] = {
> 
> nitpick   'char *vm_type_name[]', is normal style

will fix it. Thanks!

>> +    [KVM_X86_DEFAULT_VM] = "default",
>> +    [KVM_X86_SW_PROTECTED_VM] = "sw-protected-vm",
>> +};
>> +
>> +int kvm_get_vm_type(MachineState *ms, const char *vm_type)
>> +{
>> +    int kvm_type = KVM_X86_DEFAULT_VM;
>> +
>> +    /*
>> +     * old KVM doesn't support KVM_CAP_VM_TYPES and KVM_X86_DEFAULT_VM
>> +     * is always supported
>> +     */
>> +    if (kvm_type == KVM_X86_DEFAULT_VM) {
>> +        return kvm_type;
>> +    }
>> +
>> +    if (!(kvm_check_extension(KVM_STATE(ms->accelerator), KVM_CAP_VM_TYPES) & BIT(kvm_type))) {
>> +        error_report("vm-type %s not supported by KVM", vm_type_name[kvm_type]);
>> +        exit(1);
>> +    }
>> +
>> +    return kvm_type;
>> +}
>> +
>>   bool kvm_has_smm(void)
>>   {
>>       return kvm_vm_check_extension(kvm_state, KVM_CAP_X86_SMM);
>> diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
>> index 30fedcffea3e..55fb25fa8e2e 100644
>> --- a/target/i386/kvm/kvm_i386.h
>> +++ b/target/i386/kvm/kvm_i386.h
>> @@ -37,6 +37,7 @@ bool kvm_hv_vpindex_settable(void);
>>   bool kvm_enable_sgx_provisioning(KVMState *s);
>>   bool kvm_hyperv_expand_features(X86CPU *cpu, Error **errp);
>>   
>> +int kvm_get_vm_type(MachineState *ms, const char *vm_type);
>>   void kvm_arch_reset_vcpu(X86CPU *cs);
>>   void kvm_arch_after_reset_vcpu(X86CPU *cpu);
>>   void kvm_arch_do_init_vcpu(X86CPU *cs);
>> -- 
>> 2.34.1
>>
> 
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-15  7:14 ` [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd Xiaoyao Li
  2023-11-15 10:20   ` Daniel P. Berrangé
  2023-11-15 17:54   ` David Hildenbrand
@ 2023-11-17 20:35   ` Isaku Yamahata
  2023-11-30  8:31     ` Xiaoyao Li
  2023-11-20  9:24   ` David Hildenbrand
  3 siblings, 1 reply; 161+ messages in thread
From: Isaku Yamahata @ 2023-11-17 20:35 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata,
	isaku.yamahata

On Wed, Nov 15, 2023 at 02:14:11AM -0500,
Xiaoyao Li <xiaoyao.li@intel.com> wrote:

> diff --git a/system/physmem.c b/system/physmem.c
> index fc2b0fee0188..0af2213cbd9c 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>          }
>      }
>  
> +#ifdef CONFIG_KVM
> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
> +        new_block->guest_memfd < 0) {
> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
> +        uint64_t flags = 0;
> +        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
> +                                                        flags, errp);
> +        if (new_block->guest_memfd < 0) {
> +            qemu_mutex_unlock_ramlist();
> +            return;
> +        }
> +    }
> +#endif
> +

We should define kvm_create_guest_memfd() stub in accel/stub/kvm-stub.c.
We can remove this #ifdef.
-- 
Isaku Yamahata <isaku.yamahata@linux.intel.com>

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 05/70] kvm: Enable KVM_SET_USER_MEMORY_REGION2 for memslot
  2023-11-15  7:14 ` [PATCH v3 05/70] kvm: Enable KVM_SET_USER_MEMORY_REGION2 for memslot Xiaoyao Li
@ 2023-11-17 20:50   ` Isaku Yamahata
  2023-12-04  6:48     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Isaku Yamahata @ 2023-11-17 20:50 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata,
	isaku.yamahata

On Wed, Nov 15, 2023 at 02:14:14AM -0500,
Xiaoyao Li <xiaoyao.li@intel.com> wrote:

> From: Chao Peng <chao.p.peng@linux.intel.com>
> 
> Switch to KVM_SET_USER_MEMORY_REGION2 when supported by KVM.
> 
> With KVM_SET_USER_MEMORY_REGION2, QEMU can set up memory region that
> backend'ed both by hva-based shared memory and guest memfd based private
> memory.
> 
> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
>  accel/kvm/kvm-all.c      | 56 ++++++++++++++++++++++++++++++++++------
>  accel/kvm/trace-events   |  2 +-
>  include/sysemu/kvm_int.h |  2 ++
>  3 files changed, 51 insertions(+), 9 deletions(-)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 9f751d4971f8..69afeb47c9c0 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -293,35 +293,69 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
>  static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
>  {
>      KVMState *s = kvm_state;
> -    struct kvm_userspace_memory_region mem;
> +    struct kvm_userspace_memory_region2 mem;
> +    static int cap_user_memory2 = -1;
>      int ret;
>  
> +    if (cap_user_memory2 == -1) {
> +        cap_user_memory2 = kvm_check_extension(s, KVM_CAP_USER_MEMORY2);
> +    }
> +
> +    if (!cap_user_memory2 && slot->guest_memfd >= 0) {
> +        error_report("%s, KVM doesn't support KVM_CAP_USER_MEMORY2,"
> +                     " which is required by guest memfd!", __func__);
> +        exit(1);
> +    }
> +
>      mem.slot = slot->slot | (kml->as_id << 16);
>      mem.guest_phys_addr = slot->start_addr;
>      mem.userspace_addr = (unsigned long)slot->ram;
>      mem.flags = slot->flags;
> +    mem.guest_memfd = slot->guest_memfd;
> +    mem.guest_memfd_offset = slot->guest_memfd_offset;
>  
>      if (slot->memory_size && !new && (mem.flags ^ slot->old_flags) & KVM_MEM_READONLY) {
>          /* Set the slot size to 0 before setting the slot to the desired
>           * value. This is needed based on KVM commit 75d61fbc. */
>          mem.memory_size = 0;
> -        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
> +
> +        if (cap_user_memory2) {
> +            ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION2, &mem);
> +        } else {
> +            ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
> +	    }
>          if (ret < 0) {
>              goto err;
>          }
>      }
>      mem.memory_size = slot->memory_size;
> -    ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
> +    if (cap_user_memory2) {
> +        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION2, &mem);
> +    } else {
> +        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
> +    }
>      slot->old_flags = mem.flags;
>  err:
>      trace_kvm_set_user_memory(mem.slot >> 16, (uint16_t)mem.slot, mem.flags,
>                                mem.guest_phys_addr, mem.memory_size,
> -                              mem.userspace_addr, ret);
> +                              mem.userspace_addr, mem.guest_memfd,
> +                              mem.guest_memfd_offset, ret);
>      if (ret < 0) {
> -        error_report("%s: KVM_SET_USER_MEMORY_REGION failed, slot=%d,"
> -                     " start=0x%" PRIx64 ", size=0x%" PRIx64 ": %s",
> -                     __func__, mem.slot, slot->start_addr,
> -                     (uint64_t)mem.memory_size, strerror(errno));
> +        if (cap_user_memory2) {
> +                error_report("%s: KVM_SET_USER_MEMORY_REGION2 failed, slot=%d,"
> +                        " start=0x%" PRIx64 ", size=0x%" PRIx64 ","
> +                        " flags=0x%" PRIx32 ", guest_memfd=%" PRId32 ","
> +                        " guest_memfd_offset=0x%" PRIx64 ": %s",
> +                        __func__, mem.slot, slot->start_addr,
> +                        (uint64_t)mem.memory_size, mem.flags,
> +                        mem.guest_memfd, (uint64_t)mem.guest_memfd_offset,
> +                        strerror(errno));
> +        } else {
> +                error_report("%s: KVM_SET_USER_MEMORY_REGION failed, slot=%d,"
> +                            " start=0x%" PRIx64 ", size=0x%" PRIx64 ": %s",
> +                            __func__, mem.slot, slot->start_addr,
> +                            (uint64_t)mem.memory_size, strerror(errno));
> +        }
>      }
>      return ret;
>  }
> @@ -477,6 +511,9 @@ static int kvm_mem_flags(MemoryRegion *mr)
>      if (readonly && kvm_readonly_mem_allowed) {
>          flags |= KVM_MEM_READONLY;
>      }
> +    if (memory_region_has_guest_memfd(mr)) {
> +        flags |= KVM_MEM_PRIVATE;
> +    }

Nitpick: it was renamed to KVM_MEM_GUEST_MEMFD
As long as the value is defined to same value, it doesn't matter, though.
-- 
Isaku Yamahata <isaku.yamahata@linux.intel.com>

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 09/70] physmem: Introduce ram_block_convert_range() for page conversion
  2023-11-15  7:14 ` [PATCH v3 09/70] physmem: Introduce ram_block_convert_range() for page conversion Xiaoyao Li
@ 2023-11-17 21:03   ` Isaku Yamahata
  2023-12-08  7:59     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Isaku Yamahata @ 2023-11-17 21:03 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata,
	isaku.yamahata

On Wed, Nov 15, 2023 at 02:14:18AM -0500,
Xiaoyao Li <xiaoyao.li@intel.com> wrote:

> It's used for discarding opposite memory after memory conversion, for
> confidential guest.
> 
> When page is converted from shared to private, the original shared
> memory can be discarded via ram_block_discard_range();
> 
> When page is converted from private to shared, the original private
> memory is back'ed by guest_memfd. Introduce
> ram_block_discard_guest_memfd_range() for discarding memory in
> guest_memfd.
> 
> Originally-from: Isaku Yamahata <isaku.yamahata@intel.com>
> Codeveloped-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
>  include/exec/cpu-common.h |  2 ++
>  system/physmem.c          | 50 +++++++++++++++++++++++++++++++++++++++
>  2 files changed, 52 insertions(+)
> 
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 41115d891940..de728a18eef2 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -175,6 +175,8 @@ typedef int (RAMBlockIterFunc)(RAMBlock *rb, void *opaque);
>  
>  int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
>  int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length);
> +int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
> +                            bool shared_to_private);
>  
>  #endif
>  
> diff --git a/system/physmem.c b/system/physmem.c
> index ddfecddefcd6..cd6008fa09ad 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -3641,6 +3641,29 @@ err:
>      return ret;
>  }
>  
> +static int ram_block_discard_guest_memfd_range(RAMBlock *rb, uint64_t start,
> +                                               size_t length)
> +{
> +    int ret = -1;
> +
> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
> +    ret = fallocate(rb->guest_memfd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
> +                    start, length);
> +
> +    if (ret) {
> +        ret = -errno;
> +        error_report("%s: Failed to fallocate %s:%" PRIx64 " +%zx (%d)",
> +                     __func__, rb->idstr, start, length, ret);
> +    }
> +#else
> +    ret = -ENOSYS;
> +    error_report("%s: fallocate not available %s:%" PRIx64 " +%zx (%d)",
> +                 __func__, rb->idstr, start, length, ret);
> +#endif
> +
> +    return ret;
> +}
> +
>  bool ramblock_is_pmem(RAMBlock *rb)
>  {
>      return rb->flags & RAM_PMEM;
> @@ -3828,3 +3851,30 @@ bool ram_block_discard_is_required(void)
>      return qatomic_read(&ram_block_discard_required_cnt) ||
>             qatomic_read(&ram_block_coordinated_discard_required_cnt);
>  }
> +
> +int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
> +                            bool shared_to_private)
> +{
> +    if (!rb || rb->guest_memfd < 0) {
> +        return -1;
> +    }
> +
> +    if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||
> +        !QEMU_PTR_IS_ALIGNED(length, qemu_host_page_size)) {
> +        return -1;
> +    }
> +
> +    if (!length) {
> +        return -1;
> +    }
> +
> +    if (start + length > rb->max_length) {
> +        return -1;
> +    }
> +
> +    if (shared_to_private) {
> +        return ram_block_discard_range(rb, start, length);
> +    } else {
> +        return ram_block_discard_guest_memfd_range(rb, start, length);
> +    }
> +}

Originally this function issued KVM_SET_MEMORY_ATTRIBUTES, the function name
mad sense. But now it doesn't, and it issues only punch hole. We should rename
it to represent what it actually does. discard_range?
-- 
Isaku Yamahata <isaku.yamahata@linux.intel.com>

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES
  2023-11-15  7:14 ` [PATCH v3 18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES Xiaoyao Li
  2023-11-15 10:54   ` Daniel P. Berrangé
@ 2023-11-17 21:18   ` Isaku Yamahata
  2023-12-07  7:16     ` Xiaoyao Li
  1 sibling, 1 reply; 161+ messages in thread
From: Isaku Yamahata @ 2023-11-17 21:18 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata,
	isaku.yamahata

On Wed, Nov 15, 2023 at 02:14:27AM -0500,
Xiaoyao Li <xiaoyao.li@intel.com> wrote:

> KVM provides TDX capabilities via sub command KVM_TDX_CAPABILITIES of
> IOCTL(KVM_MEMORY_ENCRYPT_OP). Get the capabilities when initializing
> TDX context. It will be used to validate user's setting later.
> 
> Since there is no interface reporting how many cpuid configs contains in
> KVM_TDX_CAPABILITIES, QEMU chooses to try starting with a known number
> and abort when it exceeds KVM_MAX_CPUID_ENTRIES.
> 
> Besides, introduce the interfaces to invoke TDX "ioctls" at different
> scope (KVM, VM and VCPU) in preparation.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
> - rename __tdx_ioctl() to tdx_ioctl_internal()
> - Pass errp in get_tdx_capabilities();
> 
> changes in v2:
>   - Make the error message more clear;
> 
> changes in v1:
>   - start from nr_cpuid_configs = 6 for the loop;
>   - stop the loop when nr_cpuid_configs exceeds KVM_MAX_CPUID_ENTRIES;
> ---
>  target/i386/kvm/kvm.c      |   2 -
>  target/i386/kvm/kvm_i386.h |   2 +
>  target/i386/kvm/tdx.c      | 102 ++++++++++++++++++++++++++++++++++++-
>  3 files changed, 103 insertions(+), 3 deletions(-)
> 
> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
> index 7abcdebb1452..28e60c5ea4a7 100644
> --- a/target/i386/kvm/kvm.c
> +++ b/target/i386/kvm/kvm.c
> @@ -1687,8 +1687,6 @@ static int hyperv_init_vcpu(X86CPU *cpu)
>  
>  static Error *invtsc_mig_blocker;
>  
> -#define KVM_MAX_CPUID_ENTRIES  100
> -
>  static void kvm_init_xsave(CPUX86State *env)
>  {
>      if (has_xsave2) {
> diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
> index 55fb25fa8e2e..c3ef46a97a7b 100644
> --- a/target/i386/kvm/kvm_i386.h
> +++ b/target/i386/kvm/kvm_i386.h
> @@ -13,6 +13,8 @@
>  
>  #include "sysemu/kvm.h"
>  
> +#define KVM_MAX_CPUID_ENTRIES  100
> +
>  #ifdef CONFIG_KVM
>  
>  #define kvm_pit_in_kernel() \
> diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
> index 621a05beeb4e..cb0040187b27 100644
> --- a/target/i386/kvm/tdx.c
> +++ b/target/i386/kvm/tdx.c
> @@ -12,17 +12,117 @@
>   */
>  
>  #include "qemu/osdep.h"
> +#include "qemu/error-report.h"
>  #include "qapi/error.h"
>  #include "qom/object_interfaces.h"
> +#include "sysemu/kvm.h"
>  
>  #include "hw/i386/x86.h"
> +#include "kvm_i386.h"
>  #include "tdx.h"
>  
> +static struct kvm_tdx_capabilities *tdx_caps;
> +
> +enum tdx_ioctl_level{
> +    TDX_PLATFORM_IOCTL,
> +    TDX_VM_IOCTL,
> +    TDX_VCPU_IOCTL,
> +};
> +
> +static int tdx_ioctl_internal(void *state, enum tdx_ioctl_level level, int cmd_id,
> +                        __u32 flags, void *data)
> +{
> +    struct kvm_tdx_cmd tdx_cmd;
> +    int r;
> +
> +    memset(&tdx_cmd, 0x0, sizeof(tdx_cmd));
> +
> +    tdx_cmd.id = cmd_id;
> +    tdx_cmd.flags = flags;
> +    tdx_cmd.data = (__u64)(unsigned long)data;
> +
> +    switch (level) {
> +    case TDX_PLATFORM_IOCTL:
> +        r = kvm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
> +        break;
> +    case TDX_VM_IOCTL:
> +        r = kvm_vm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
> +        break;
> +    case TDX_VCPU_IOCTL:
> +        r = kvm_vcpu_ioctl(state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
> +        break;
> +    default:
> +        error_report("Invalid tdx_ioctl_level %d", level);
> +        exit(1);
> +    }
> +
> +    return r;
> +}
> +
> +static inline int tdx_platform_ioctl(int cmd_id, __u32 flags, void *data)
> +{
> +    return tdx_ioctl_internal(NULL, TDX_PLATFORM_IOCTL, cmd_id, flags, data);
> +}
> +
> +static inline int tdx_vm_ioctl(int cmd_id, __u32 flags, void *data)
> +{
> +    return tdx_ioctl_internal(NULL, TDX_VM_IOCTL, cmd_id, flags, data);
> +}
> +
> +static inline int tdx_vcpu_ioctl(void *vcpu_fd, int cmd_id, __u32 flags,
> +                                 void *data)
> +{
> +    return  tdx_ioctl_internal(vcpu_fd, TDX_VCPU_IOCTL, cmd_id, flags, data);
> +}

As all of ioctl variants aren't used yet, we can split out them. An independent
patch to define ioctl functions.


> +
> +static int get_tdx_capabilities(Error **errp)
> +{
> +    struct kvm_tdx_capabilities *caps;
> +    /* 1st generation of TDX reports 6 cpuid configs */
> +    int nr_cpuid_configs = 6;
> +    size_t size;
> +    int r;
> +
> +    do {
> +        size = sizeof(struct kvm_tdx_capabilities) +
> +               nr_cpuid_configs * sizeof(struct kvm_tdx_cpuid_config);
> +        caps = g_malloc0(size);
> +        caps->nr_cpuid_configs = nr_cpuid_configs;
> +
> +        r = tdx_vm_ioctl(KVM_TDX_CAPABILITIES, 0, caps);
> +        if (r == -E2BIG) {
> +            g_free(caps);
> +            nr_cpuid_configs *= 2;

g_realloc()?  Maybe a matter of preference.

Other than this, it looks good to me.
-- 
Isaku Yamahata <isaku.yamahata@linux.intel.com>

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 19/70] i386/tdx: Introduce is_tdx_vm() helper and cache tdx_guest object
  2023-11-15  7:14 ` [PATCH v3 19/70] i386/tdx: Introduce is_tdx_vm() helper and cache tdx_guest object Xiaoyao Li
@ 2023-11-17 21:20   ` Isaku Yamahata
  0 siblings, 0 replies; 161+ messages in thread
From: Isaku Yamahata @ 2023-11-17 21:20 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata,
	isaku.yamahata

On Wed, Nov 15, 2023 at 02:14:28AM -0500,
Xiaoyao Li <xiaoyao.li@intel.com> wrote:

> It will need special handling for TDX VMs all around the QEMU.
> Introduce is_tdx_vm() helper to query if it's a TDX VM.
> 
> Cache tdx_guest object thus no need to cast from ms->cgs every time.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Acked-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
> changes in v3:
> - replace object_dynamic_cast with TDX_GUEST();
> ---
>  target/i386/kvm/tdx.c | 15 ++++++++++++++-
>  target/i386/kvm/tdx.h | 10 ++++++++++
>  2 files changed, 24 insertions(+), 1 deletion(-)
> 
> diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
> index cb0040187b27..cf8889f0a8f9 100644
> --- a/target/i386/kvm/tdx.c
> +++ b/target/i386/kvm/tdx.c
> @@ -21,8 +21,16 @@
>  #include "kvm_i386.h"
>  #include "tdx.h"
>  
> +static TdxGuest *tdx_guest;
> +
>  static struct kvm_tdx_capabilities *tdx_caps;
>  
> +/* It's valid after kvm_confidential_guest_init()->kvm_tdx_init() */
> +bool is_tdx_vm(void)
> +{
> +    return !!tdx_guest;
> +}
> +
>  enum tdx_ioctl_level{
>      TDX_PLATFORM_IOCTL,
>      TDX_VM_IOCTL,
> @@ -114,15 +122,20 @@ static int get_tdx_capabilities(Error **errp)
>  
>  int tdx_kvm_init(MachineState *ms, Error **errp)
>  {
> +    TdxGuest *tdx = TDX_GUEST(OBJECT(ms->cgs));
>      int r = 0;
>  
>      ms->require_guest_memfd = true;
>  
>      if (!tdx_caps) {
>          r = get_tdx_capabilities(errp);
> +        if (r) {
> +            return r;
> +        }
>      }
>  
> -    return r;
> +    tdx_guest = tdx;
> +    return 0;
>  }
>  
>  /* tdx guest */
> diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
> index c8a23d95258d..4036ca2f3f99 100644
> --- a/target/i386/kvm/tdx.h
> +++ b/target/i386/kvm/tdx.h
> @@ -1,6 +1,10 @@
>  #ifndef QEMU_I386_TDX_H
>  #define QEMU_I386_TDX_H
>  
> +#ifndef CONFIG_USER_ONLY
> +#include CONFIG_DEVICES /* CONFIG_TDX */
> +#endif
> +
>  #include "exec/confidential-guest-support.h"
>  
>  #define TYPE_TDX_GUEST "tdx-guest"
> @@ -16,6 +20,12 @@ typedef struct TdxGuest {
>      uint64_t attributes;    /* TD attributes */
>  } TdxGuest;
>  
> +#ifdef CONFIG_TDX
> +bool is_tdx_vm(void);
> +#else
> +#define is_tdx_vm() 0
> +#endif /* CONFIG_TDX */
> +
>  int tdx_kvm_init(MachineState *ms, Error **errp);
>  
>  #endif /* QEMU_I386_TDX_H */
> -- 
> 2.34.1
> 
> 

Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com>
-- 
Isaku Yamahata <isaku.yamahata@linux.intel.com>

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-16  2:45     ` Xiaoyao Li
@ 2023-11-20  9:19       ` David Hildenbrand
  2023-11-30  7:35         ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-20  9:19 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 16.11.23 03:45, Xiaoyao Li wrote:
> On 11/16/2023 1:54 AM, David Hildenbrand wrote:
>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
>>> and kvm guest memfd based private memory can be associated in one
>>> RAMBlock.
>>>
>>> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM ioctl to
>>> create private guest_memfd during RAMBlock setup.
>>>
>>> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
>>> confidential guests, such as TDX VM. How and when to set it for memory
>>> backends will be implemented in the following patches.
>>
>> Can you elaborate (and add to the patch description if there is good
>> reason) why we need that flag and why we cannot simply rely on the VM
>> type instead to decide whether to allocate a guest_memfd or not?
>>
> 
> The reason is, relying on the VM type is sort of hack that we need to
> get the MachineState instance and retrieve the vm type info. I think
> it's better not to couple them.
> 
> More importantly, it's not flexible and extensible for future case that
> not all the memory need guest memfd.
> 

Okay. In that case, please update the documentation of all functions 
where we are allowed to pass in RAM_GUEST_MEMFD. There are a couple of 
them in include/exec/memory.h

I'll note that the name/terminology of "RAM_GUEST_MEMFD" is extremely 
Linux+kvm specific. But I cannot really come up with something better 
right now.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-15  7:14 ` [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd Xiaoyao Li
                     ` (2 preceding siblings ...)
  2023-11-17 20:35   ` Isaku Yamahata
@ 2023-11-20  9:24   ` David Hildenbrand
  2023-11-30  7:37     ` Xiaoyao Li
  3 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-20  9:24 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

>   uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>   {
>       uint8_t mask = mr->dirty_log_mask;
> diff --git a/system/physmem.c b/system/physmem.c
> index fc2b0fee0188..0af2213cbd9c 100644
> --- a/system/physmem.c
> +++ b/system/physmem.c
> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>           }
>       }
>   
> +#ifdef CONFIG_KVM
> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&


I recall that we prefer to write this as

	if (kvm_enabled() && (new_block->flags & RAM_GUEST_MEMFD) &&

> +        new_block->guest_memfd < 0) {
> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
> +        uint64_t flags = 0;
> +        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
> +                                                        flags, errp);

Get rid of "flags" and just pass 0". Whatever code wants to pass flags 
later can decide how to do that.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-16  2:47     ` Xiaoyao Li
@ 2023-11-20  9:26       ` David Hildenbrand
  2023-11-30  7:32         ` Xiaoyao Li
  2023-11-30  8:00         ` Xiaoyao Li
  0 siblings, 2 replies; 161+ messages in thread
From: David Hildenbrand @ 2023-11-20  9:26 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang


>> ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
> 
> Get caught.
> 
>> This should be factored out into a common helper.
> 
> Sure, will do it in next version.

Factor it out in a separate patch. Then, this patch is get small that 
you can just squash it into #2.

And my comment regarding "flags = 0" to patch #2 does no longer apply :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 04/70] HostMem: Add mechanism to opt in kvm guest memfd via MachineState
  2023-11-16  2:53     ` Xiaoyao Li
@ 2023-11-20  9:30       ` David Hildenbrand
  2023-11-30  7:38         ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-20  9:30 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 16.11.23 03:53, Xiaoyao Li wrote:
> On 11/16/2023 2:14 AM, David Hildenbrand wrote:
>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>> Add a new member "require_guest_memfd" to memory backends. When it's set
>>> to true, it enables RAM_GUEST_MEMFD in ram_flags, thus private kvm
>>> guest_memfd will be allocated during RAMBlock allocation.
>>>
>>> Memory backend's @require_guest_memfd is wired with @require_guest_memfd
>>> field of MachineState. MachineState::require_guest_memfd is supposed to
>>> be set by any VMs that requires KVM guest memfd as private memory, e.g.,
>>> TDX VM.
>>>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>
>> I'm confused, why do we need this if it's going to be the same for all
>> memory backends right now?
>>
> 
> I want to provide a elegant (in my sense) way to configure "the need of
> guest memfd" instead of checking x86machinestate->vm_type in physmem.c
> 

It's suboptimal right now, but I guess you want to avoid looking up the 
machine e.g., in ram_backend_memory_alloc().

I'd suggest s/require_guest_memfd/guest_memfd/gc in "struct 
HostMemoryBackend".

Apart from that LGTM.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range()
  2023-11-16  2:56     ` Xiaoyao Li
@ 2023-11-20  9:56       ` David Hildenbrand
  2023-12-04  7:35         ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-20  9:56 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 16.11.23 03:56, Xiaoyao Li wrote:
> On 11/16/2023 2:20 AM, David Hildenbrand wrote:
>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
>>> ram_block_discard_range() which grabs some code from
>>> ram_discard_range(). However, during code movement, it changed alignment
>>> check of host_startaddr from qemu_host_page_size to rb->page_size.
>>>
>>> When ramblock is back'ed by hugepage, it requires the startaddr to be
>>> huge page size aligned, which is a overkill. e.g., TDX's private-shared
>>> page conversion is done at 4KB granularity. Shared page is discarded
>>> when it gets converts to private and when shared page back'ed by
>>> hugepage it is going to fail on this check.
>>>
>>> So change to alignment check back to qemu_host_page_size.
>>>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> ---
>>> Changes in v3:
>>>    - Newly added in v3;
>>> ---
>>>    system/physmem.c | 2 +-
>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index c56b17e44df6..8a4e42c7cf60 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb,
>>> uint64_t start, size_t length)
>>>        uint8_t *host_startaddr = rb->host + start;
>>> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
>>> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
>>
>> For your use cases, rb->page_size should always match qemu_host_page_size.
>>
>> IIRC, we only set rb->page_size to different values for hugetlb. And
>> guest_memfd does not support hugetlb.
>>
>> Even if QEMU is using THP, rb->page_size should 4k.
>>
>> Please elaborate how you can actually trigger that. From what I recall,
>> guest_memfd is not compatible with hugetlb.
> 
> It's the shared memory that can be back'ed by hugetlb.

Serious question: does that configuration make any sense to support at 
this point? I claim: no.

> 
> Later patch 9 introduces ram_block_convert_page(), which will discard
> shared memory when it gets converted to private. TD guest can request
> convert a 4K to private while the page is previously back'ed by hugetlb
> as 2M shared page.

So you can call ram_block_discard_guest_memfd_range() on subpage basis, 
but not ram_block_discard_range().

ram_block_convert_range() would have to thought that that (questionable) 
combination of hugetlb for shmem and ordinary pages for guest_memfd 
cannot discard shared memory.

And it probably shouldn't either way. There are other problems when not 
using hugetlb along with preallocation.

The check in ram_block_discard_range() is correct, whoever ends up 
calling it has to stop calling it.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-20  9:26       ` David Hildenbrand
@ 2023-11-30  7:32         ` Xiaoyao Li
  2023-11-30 10:59           ` David Hildenbrand
  2023-11-30  8:00         ` Xiaoyao Li
  1 sibling, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-30  7:32 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/20/2023 5:26 PM, David Hildenbrand wrote:
> 
>>> ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
>>
>> Get caught.
>>
>>> This should be factored out into a common helper.
>>
>> Sure, will do it in next version.
> 
> Factor it out in a separate patch. Then, this patch is get small that 
> you can just squash it into #2.
> 
> And my comment regarding "flags = 0" to patch #2 does no longer apply :)
> 

I see.

But it depends on if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will appear together 
with initial guest memfd in linux (hopefully 6.8)
https://lore.kernel.org/all/CABgObfa=DH7FySBviF63OS9sVog_wt-AqYgtUAGKqnY5Bizivw@mail.gmail.com/

If like Paolo committed, no KVM_GUEST_MEMFD_ALLOW_HUGEPAGE in initial 
merge, I will go simplify Patch #2. Otherwise factor out a common 
function to get hugepage size as you suggested.

Thanks!

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-20  9:19       ` David Hildenbrand
@ 2023-11-30  7:35         ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-30  7:35 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/20/2023 5:19 PM, David Hildenbrand wrote:
> On 16.11.23 03:45, Xiaoyao Li wrote:
>> On 11/16/2023 1:54 AM, David Hildenbrand wrote:
>>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>>> Add KVM guest_memfd support to RAMBlock so both normal hva based memory
>>>> and kvm guest memfd based private memory can be associated in one
>>>> RAMBlock.
>>>>
>>>> Introduce new flag RAM_GUEST_MEMFD. When it's set, it calls KVM 
>>>> ioctl to
>>>> create private guest_memfd during RAMBlock setup.
>>>>
>>>> Note, RAM_GUEST_MEMFD is supposed to be set for memory backends of
>>>> confidential guests, such as TDX VM. How and when to set it for memory
>>>> backends will be implemented in the following patches.
>>>
>>> Can you elaborate (and add to the patch description if there is good
>>> reason) why we need that flag and why we cannot simply rely on the VM
>>> type instead to decide whether to allocate a guest_memfd or not?
>>>
>>
>> The reason is, relying on the VM type is sort of hack that we need to
>> get the MachineState instance and retrieve the vm type info. I think
>> it's better not to couple them.
>>
>> More importantly, it's not flexible and extensible for future case that
>> not all the memory need guest memfd.
>>
> 
> Okay. In that case, please update the documentation of all functions 
> where we are allowed to pass in RAM_GUEST_MEMFD. There are a couple of 
> them in include/exec/memory.h

sure, thanks!

> I'll note that the name/terminology of "RAM_GUEST_MEMFD" is extremely 
> Linux+kvm specific. But I cannot really come up with something better 
> right now.
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-20  9:24   ` David Hildenbrand
@ 2023-11-30  7:37     ` Xiaoyao Li
  2023-11-30 11:01       ` David Hildenbrand
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-30  7:37 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/20/2023 5:24 PM, David Hildenbrand wrote:
>>   uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>>   {
>>       uint8_t mask = mr->dirty_log_mask;
>> diff --git a/system/physmem.c b/system/physmem.c
>> index fc2b0fee0188..0af2213cbd9c 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, 
>> Error **errp)
>>           }
>>       }
>> +#ifdef CONFIG_KVM
>> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
> 
> 
> I recall that we prefer to write this as
> 
>      if (kvm_enabled() && (new_block->flags & RAM_GUEST_MEMFD) &&

get it.

Thanks!

>> +        new_block->guest_memfd < 0) {
>> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is 
>> supported */
>> +        uint64_t flags = 0;
>> +        new_block->guest_memfd = 
>> kvm_create_guest_memfd(new_block->max_length,
>> +                                                        flags, errp);
> 
> Get rid of "flags" and just pass 0". Whatever code wants to pass flags 
> later can decide how to do that.


How to handle it please see the reply to patch 3.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 04/70] HostMem: Add mechanism to opt in kvm guest memfd via MachineState
  2023-11-20  9:30       ` David Hildenbrand
@ 2023-11-30  7:38         ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-30  7:38 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/20/2023 5:30 PM, David Hildenbrand wrote:
> On 16.11.23 03:53, Xiaoyao Li wrote:
>> On 11/16/2023 2:14 AM, David Hildenbrand wrote:
>>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>>> Add a new member "require_guest_memfd" to memory backends. When it's 
>>>> set
>>>> to true, it enables RAM_GUEST_MEMFD in ram_flags, thus private kvm
>>>> guest_memfd will be allocated during RAMBlock allocation.
>>>>
>>>> Memory backend's @require_guest_memfd is wired with 
>>>> @require_guest_memfd
>>>> field of MachineState. MachineState::require_guest_memfd is supposed to
>>>> be set by any VMs that requires KVM guest memfd as private memory, 
>>>> e.g.,
>>>> TDX VM.
>>>>
>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>
>>> I'm confused, why do we need this if it's going to be the same for all
>>> memory backends right now?
>>>
>>
>> I want to provide a elegant (in my sense) way to configure "the need of
>> guest memfd" instead of checking x86machinestate->vm_type in physmem.c
>>
> 
> It's suboptimal right now, but I guess you want to avoid looking up the 
> machine e.g., in ram_backend_memory_alloc().
> 
> I'd suggest s/require_guest_memfd/guest_memfd/gc in "struct 
> HostMemoryBackend".

sure!

> Apart from that LGTM.
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-20  9:26       ` David Hildenbrand
  2023-11-30  7:32         ` Xiaoyao Li
@ 2023-11-30  8:00         ` Xiaoyao Li
  2023-12-01 11:00           ` David Hildenbrand
  1 sibling, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-30  8:00 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/20/2023 5:26 PM, David Hildenbrand wrote:
> 
>>> ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
>>
>> Get caught.
>>
>>> This should be factored out into a common helper.
>>
>> Sure, will do it in next version.
> 
> Factor it out in a separate patch. Then, this patch is get small that 
> you can just squash it into #2.

A silly question. What file should the factored function be put in?

> And my comment regarding "flags = 0" to patch #2 does no longer apply :)
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-17 20:35   ` Isaku Yamahata
@ 2023-11-30  8:31     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-11-30  8:31 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P. Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata

On 11/18/2023 4:35 AM, Isaku Yamahata wrote:
> On Wed, Nov 15, 2023 at 02:14:11AM -0500,
> Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> 
>> diff --git a/system/physmem.c b/system/physmem.c
>> index fc2b0fee0188..0af2213cbd9c 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block, Error **errp)
>>           }
>>       }
>>   
>> +#ifdef CONFIG_KVM
>> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
>> +        new_block->guest_memfd < 0) {
>> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is supported */
>> +        uint64_t flags = 0;
>> +        new_block->guest_memfd = kvm_create_guest_memfd(new_block->max_length,
>> +                                                        flags, errp);
>> +        if (new_block->guest_memfd < 0) {
>> +            qemu_mutex_unlock_ramlist();
>> +            return;
>> +        }
>> +    }
>> +#endif
>> +
> 
> We should define kvm_create_guest_memfd() stub in accel/stub/kvm-stub.c.
> We can remove this #ifdef.

Nice suggestion! Will use stub.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-30  7:32         ` Xiaoyao Li
@ 2023-11-30 10:59           ` David Hildenbrand
  2023-11-30 16:01             ` Sean Christopherson
  0 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-30 10:59 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 30.11.23 08:32, Xiaoyao Li wrote:
> On 11/20/2023 5:26 PM, David Hildenbrand wrote:
>>
>>>> ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
>>>
>>> Get caught.
>>>
>>>> This should be factored out into a common helper.
>>>
>>> Sure, will do it in next version.
>>
>> Factor it out in a separate patch. Then, this patch is get small that
>> you can just squash it into #2.
>>
>> And my comment regarding "flags = 0" to patch #2 does no longer apply :)
>>
> 
> I see.
> 
> But it depends on if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will appear together
> with initial guest memfd in linux (hopefully 6.8)
> https://lore.kernel.org/all/CABgObfa=DH7FySBviF63OS9sVog_wt-AqYgtUAGKqnY5Bizivw@mail.gmail.com/
> 

Doesn't seem to be in -next if I am looking at the right tree:

https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd
  2023-11-30  7:37     ` Xiaoyao Li
@ 2023-11-30 11:01       ` David Hildenbrand
  0 siblings, 0 replies; 161+ messages in thread
From: David Hildenbrand @ 2023-11-30 11:01 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 30.11.23 08:37, Xiaoyao Li wrote:
> On 11/20/2023 5:24 PM, David Hildenbrand wrote:
>>>    uint8_t memory_region_get_dirty_log_mask(MemoryRegion *mr)
>>>    {
>>>        uint8_t mask = mr->dirty_log_mask;
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index fc2b0fee0188..0af2213cbd9c 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -1841,6 +1841,20 @@ static void ram_block_add(RAMBlock *new_block,
>>> Error **errp)
>>>            }
>>>        }
>>> +#ifdef CONFIG_KVM
>>> +    if (kvm_enabled() && new_block->flags & RAM_GUEST_MEMFD &&
>>
>>
>> I recall that we prefer to write this as
>>
>>       if (kvm_enabled() && (new_block->flags & RAM_GUEST_MEMFD) &&
> 
> get it.
> 
> Thanks!
> 
>>> +        new_block->guest_memfd < 0) {
>>> +        /* TODO: to decide if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE is
>>> supported */
>>> +        uint64_t flags = 0;
>>> +        new_block->guest_memfd =
>>> kvm_create_guest_memfd(new_block->max_length,
>>> +                                                        flags, errp);
>>
>> Get rid of "flags" and just pass 0". Whatever code wants to pass flags
>> later can decide how to do that.
> 
> 
> How to handle it please see the reply to patch 3.

If patch #3 cannot go in now and has to be deferred, then please clean 
this here up. Otherwise, as suggested, squash with #3.

Depending on KVM_GUEST_MEMFD_ALLOW_HUGEPAGE support :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-30 10:59           ` David Hildenbrand
@ 2023-11-30 16:01             ` Sean Christopherson
  2023-11-30 16:54               ` David Hildenbrand
  0 siblings, 1 reply; 161+ messages in thread
From: Sean Christopherson @ 2023-11-30 16:01 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Claudio Fontana, Gerd Hoffmann, Isaku Yamahata,
	Chenyi Qiang

On Thu, Nov 30, 2023, David Hildenbrand wrote:
> On 30.11.23 08:32, Xiaoyao Li wrote:
> > On 11/20/2023 5:26 PM, David Hildenbrand wrote:
> > > 
> > > > > ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
> > > > 
> > > > Get caught.
> > > > 
> > > > > This should be factored out into a common helper.
> > > > 
> > > > Sure, will do it in next version.
> > > 
> > > Factor it out in a separate patch. Then, this patch is get small that
> > > you can just squash it into #2.
> > > 
> > > And my comment regarding "flags = 0" to patch #2 does no longer apply :)
> > > 
> > 
> > I see.
> > 
> > But it depends on if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will appear together
> > with initial guest memfd in linux (hopefully 6.8)
> > https://lore.kernel.org/all/CABgObfa=DH7FySBviF63OS9sVog_wt-AqYgtUAGKqnY5Bizivw@mail.gmail.com/
> > 
> 
> Doesn't seem to be in -next if I am looking at the right tree:
> 
> https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next

Yeah, we punted on adding hugepage support for the initial guest_memfd merge so
as not to rush in kludgy uABI.  The internal KVM code isn't problematic, we just
haven't figured out exactly what the ABI should look like, e.g. should hugepages
be dependent on THP being enabled, and if not, how does userspace discover the
supported hugepage sizes?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-30 16:01             ` Sean Christopherson
@ 2023-11-30 16:54               ` David Hildenbrand
  2023-11-30 17:46                 ` Peter Xu
  2023-11-30 17:51                 ` Daniel P. Berrangé
  0 siblings, 2 replies; 161+ messages in thread
From: David Hildenbrand @ 2023-11-30 16:54 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Claudio Fontana, Gerd Hoffmann, Isaku Yamahata,
	Chenyi Qiang

On 30.11.23 17:01, Sean Christopherson wrote:
> On Thu, Nov 30, 2023, David Hildenbrand wrote:
>> On 30.11.23 08:32, Xiaoyao Li wrote:
>>> On 11/20/2023 5:26 PM, David Hildenbrand wrote:
>>>>
>>>>>> ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
>>>>>
>>>>> Get caught.
>>>>>
>>>>>> This should be factored out into a common helper.
>>>>>
>>>>> Sure, will do it in next version.
>>>>
>>>> Factor it out in a separate patch. Then, this patch is get small that
>>>> you can just squash it into #2.
>>>>
>>>> And my comment regarding "flags = 0" to patch #2 does no longer apply :)
>>>>
>>>
>>> I see.
>>>
>>> But it depends on if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will appear together
>>> with initial guest memfd in linux (hopefully 6.8)
>>> https://lore.kernel.org/all/CABgObfa=DH7FySBviF63OS9sVog_wt-AqYgtUAGKqnY5Bizivw@mail.gmail.com/
>>>
>>
>> Doesn't seem to be in -next if I am looking at the right tree:
>>
>> https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next
> 
> Yeah, we punted on adding hugepage support for the initial guest_memfd merge so
> as not to rush in kludgy uABI.  The internal KVM code isn't problematic, we just
> haven't figured out exactly what the ABI should look like, e.g. should hugepages
> be dependent on THP being enabled, and if not, how does userspace discover the
> supported hugepage sizes?

Are we talking about THP or hugetlb? They are two different things, and 
"KVM_GUEST_MEMFD_ALLOW_HUGEPAGE" doesn't make it clearer what we are 
talking about.

This patch here "get_thp_size()" indicates that we care about THP, not 
hugetlb.


THP lives in:
	/sys/kernel/mm/transparent_hugepage/
and hugetlb in:
	/sys/kernel/mm/hugepages/

THP for shmem+anon currently really only supports PMD-sized THP, that 
size can be observed via:
	/sys/kernel/mm/transparent_hugepage/hpage_pmd_size

hugetlb sizes can be detected simply by looking at the folders inside
/sys/kernel/mm/hugepages/. "tools/testing/selftests/mm/vm_util.c" in the 
kernel has a function "detect_hugetlb_page_sizes()" that uses that 
interface to detect the sizes.


But likely we want THP support here. Because for hugetlb, one would 
actually have to instruct the kernel which size to use, like we do for 
memfd with hugetlb.


Anon support for smaller sizes than PMDs is in the works, and once 
upstream, it can then be detected via 
/sys/kernel/mm/transparent_hugepage/ as well.

shmem support for smaller sizes is partially in the works: only on the 
write() path. Likely, we'll make it configurable/observable in 
/sys/kernel/mm/transparent_hugepage/ as well.


So if we are talking about THP for shmem, there really only is 
/sys/kernel/mm/transparent_hugepage/hpage_pmd_size.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-30 16:54               ` David Hildenbrand
@ 2023-11-30 17:46                 ` Peter Xu
  2023-11-30 17:57                   ` David Hildenbrand
  2023-11-30 17:51                 ` Daniel P. Berrangé
  1 sibling, 1 reply; 161+ messages in thread
From: Peter Xu @ 2023-11-30 17:46 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Xiaoyao Li, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Claudio Fontana, Gerd Hoffmann, Isaku Yamahata,
	Chenyi Qiang

On Thu, Nov 30, 2023 at 05:54:26PM +0100, David Hildenbrand wrote:
> But likely we want THP support here. Because for hugetlb, one would actually
> have to instruct the kernel which size to use, like we do for memfd with
> hugetlb.

I doubt it, as VM can still leverage larger sizes if possible?

IIUC one of the major challenges of gmem hugepage is how to support
security features while reusing existing mm infrastructures as much as
possible.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-30 16:54               ` David Hildenbrand
  2023-11-30 17:46                 ` Peter Xu
@ 2023-11-30 17:51                 ` Daniel P. Berrangé
  2023-11-30 18:22                   ` David Hildenbrand
  2023-12-01 11:22                   ` Claudio Fontana
  1 sibling, 2 replies; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-11-30 17:51 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Sean Christopherson, Xiaoyao Li, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

On Thu, Nov 30, 2023 at 05:54:26PM +0100, David Hildenbrand wrote:
> On 30.11.23 17:01, Sean Christopherson wrote:
> > On Thu, Nov 30, 2023, David Hildenbrand wrote:
> > > On 30.11.23 08:32, Xiaoyao Li wrote:
> > > > On 11/20/2023 5:26 PM, David Hildenbrand wrote:
> > > > > 
> > > > > > > ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
> > > > > > 
> > > > > > Get caught.
> > > > > > 
> > > > > > > This should be factored out into a common helper.
> > > > > > 
> > > > > > Sure, will do it in next version.
> > > > > 
> > > > > Factor it out in a separate patch. Then, this patch is get small that
> > > > > you can just squash it into #2.
> > > > > 
> > > > > And my comment regarding "flags = 0" to patch #2 does no longer apply :)
> > > > > 
> > > > 
> > > > I see.
> > > > 
> > > > But it depends on if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will appear together
> > > > with initial guest memfd in linux (hopefully 6.8)
> > > > https://lore.kernel.org/all/CABgObfa=DH7FySBviF63OS9sVog_wt-AqYgtUAGKqnY5Bizivw@mail.gmail.com/
> > > > 
> > > 
> > > Doesn't seem to be in -next if I am looking at the right tree:
> > > 
> > > https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next
> > 
> > Yeah, we punted on adding hugepage support for the initial guest_memfd merge so
> > as not to rush in kludgy uABI.  The internal KVM code isn't problematic, we just
> > haven't figured out exactly what the ABI should look like, e.g. should hugepages
> > be dependent on THP being enabled, and if not, how does userspace discover the
> > supported hugepage sizes?
> 
> Are we talking about THP or hugetlb? They are two different things, and
> "KVM_GUEST_MEMFD_ALLOW_HUGEPAGE" doesn't make it clearer what we are talking
> about.
> 
> This patch here "get_thp_size()" indicates that we care about THP, not
> hugetlb.
> 
> 
> THP lives in:
> 	/sys/kernel/mm/transparent_hugepage/
> and hugetlb in:
> 	/sys/kernel/mm/hugepages/
> 
> THP for shmem+anon currently really only supports PMD-sized THP, that size
> can be observed via:
> 	/sys/kernel/mm/transparent_hugepage/hpage_pmd_size
> 
> hugetlb sizes can be detected simply by looking at the folders inside
> /sys/kernel/mm/hugepages/. "tools/testing/selftests/mm/vm_util.c" in the
> kernel has a function "detect_hugetlb_page_sizes()" that uses that interface
> to detect the sizes.
> 
> 
> But likely we want THP support here. Because for hugetlb, one would actually
> have to instruct the kernel which size to use, like we do for memfd with
> hugetlb.

Would we not want both ultimately ?

THP is good because it increases performance vs non-HP out of the box
without the user or mgmt app having to make any decisions.

It does not give you deterministic performance though, because it has
to opportunistically assign huge pages basd on what is available and
that may differ each time a VM is launched.  Explicit admin/mgmt app
controlled huge page usage gives determinism, at the cost of increased
mgmt overhead.

Both are valid use cases depending on the tradeoff a deployment and/or
mgmt app wants to make.


With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-30 17:46                 ` Peter Xu
@ 2023-11-30 17:57                   ` David Hildenbrand
  2023-11-30 18:09                     ` David Hildenbrand
  0 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-11-30 17:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, Xiaoyao Li, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P. Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Claudio Fontana, Gerd Hoffmann, Isaku Yamahata,
	Chenyi Qiang

On 30.11.23 18:46, Peter Xu wrote:
> On Thu, Nov 30, 2023 at 05:54:26PM +0100, David Hildenbrand wrote:
>> But likely we want THP support here. Because for hugetlb, one would actually
>> have to instruct the kernel which size to use, like we do for memfd with
>> hugetlb.
> 
> I doubt it, as VM can still leverage larger sizes if possible?

What do you doubt? I am talking about the current implementation and 
expected semantics of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-30 17:57                   ` David Hildenbrand
@ 2023-11-30 18:09                     ` David Hildenbrand
  0 siblings, 0 replies; 161+ messages in thread
From: David Hildenbrand @ 2023-11-30 18:09 UTC (permalink / raw)
  To: Peter Xu
  Cc: Sean Christopherson, Xiaoyao Li, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P. Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Claudio Fontana, Gerd Hoffmann, Isaku Yamahata,
	Chenyi Qiang

On 30.11.23 18:57, David Hildenbrand wrote:
> On 30.11.23 18:46, Peter Xu wrote:
>> On Thu, Nov 30, 2023 at 05:54:26PM +0100, David Hildenbrand wrote:
>>> But likely we want THP support here. Because for hugetlb, one would actually
>>> have to instruct the kernel which size to use, like we do for memfd with
>>> hugetlb.
>>
>> I doubt it, as VM can still leverage larger sizes if possible?
> 
> What do you doubt? I am talking about the current implementation and
> expected semantics of KVM_GUEST_MEMFD_ALLOW_HUGEPAGE.
> 

I looked at the kernel implementation, and it simply allocates a 
PMD-sized folio and puts it into the pagecache. So hugetlb is not involved.

That raises various questions:

1) What are the semantics if we ever allow migrating/compacting such
    folios. Would we allow split them into smaller pages when required
    (or compact into larger)? What happens when we would partially zap
    them (fallocate?)right now? IOW, do they behave like THP, and do we
    want them to behave like THP?

2) If they behave like THP, wow would we able to compact them into
    bigger pages? khugepaged only works on VMAs IIRC.

3) How would you allocate gigantic pages if not by the help of hugetlb
    and reserved pools? At least as of today, runtime allocation of
    gigantic pages is extremely unreliable and compaction into gigantic
    pages does not work. So gigantic pages would be something for that
    far distant future.

4) cont-pte-sizes folios?

Maybe it's all clarified already, in that case I'd appreciate a pointer.

Looking at the current code, it looks like it behaves like shmem thp, 
just without any way to collapse afterwards (unless I am missing something).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-30 17:51                 ` Daniel P. Berrangé
@ 2023-11-30 18:22                   ` David Hildenbrand
  2023-12-01 11:22                   ` Claudio Fontana
  1 sibling, 0 replies; 161+ messages in thread
From: David Hildenbrand @ 2023-11-30 18:22 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Sean Christopherson, Xiaoyao Li, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

On 30.11.23 18:51, Daniel P. Berrangé wrote:
> On Thu, Nov 30, 2023 at 05:54:26PM +0100, David Hildenbrand wrote:
>> On 30.11.23 17:01, Sean Christopherson wrote:
>>> On Thu, Nov 30, 2023, David Hildenbrand wrote:
>>>> On 30.11.23 08:32, Xiaoyao Li wrote:
>>>>> On 11/20/2023 5:26 PM, David Hildenbrand wrote:
>>>>>>
>>>>>>>> ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
>>>>>>>
>>>>>>> Get caught.
>>>>>>>
>>>>>>>> This should be factored out into a common helper.
>>>>>>>
>>>>>>> Sure, will do it in next version.
>>>>>>
>>>>>> Factor it out in a separate patch. Then, this patch is get small that
>>>>>> you can just squash it into #2.
>>>>>>
>>>>>> And my comment regarding "flags = 0" to patch #2 does no longer apply :)
>>>>>>
>>>>>
>>>>> I see.
>>>>>
>>>>> But it depends on if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will appear together
>>>>> with initial guest memfd in linux (hopefully 6.8)
>>>>> https://lore.kernel.org/all/CABgObfa=DH7FySBviF63OS9sVog_wt-AqYgtUAGKqnY5Bizivw@mail.gmail.com/
>>>>>
>>>>
>>>> Doesn't seem to be in -next if I am looking at the right tree:
>>>>
>>>> https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next
>>>
>>> Yeah, we punted on adding hugepage support for the initial guest_memfd merge so
>>> as not to rush in kludgy uABI.  The internal KVM code isn't problematic, we just
>>> haven't figured out exactly what the ABI should look like, e.g. should hugepages
>>> be dependent on THP being enabled, and if not, how does userspace discover the
>>> supported hugepage sizes?
>>
>> Are we talking about THP or hugetlb? They are two different things, and
>> "KVM_GUEST_MEMFD_ALLOW_HUGEPAGE" doesn't make it clearer what we are talking
>> about.
>>
>> This patch here "get_thp_size()" indicates that we care about THP, not
>> hugetlb.
>>
>>
>> THP lives in:
>> 	/sys/kernel/mm/transparent_hugepage/
>> and hugetlb in:
>> 	/sys/kernel/mm/hugepages/
>>
>> THP for shmem+anon currently really only supports PMD-sized THP, that size
>> can be observed via:
>> 	/sys/kernel/mm/transparent_hugepage/hpage_pmd_size
>>
>> hugetlb sizes can be detected simply by looking at the folders inside
>> /sys/kernel/mm/hugepages/. "tools/testing/selftests/mm/vm_util.c" in the
>> kernel has a function "detect_hugetlb_page_sizes()" that uses that interface
>> to detect the sizes.
>>
>>
>> But likely we want THP support here. Because for hugetlb, one would actually
>> have to instruct the kernel which size to use, like we do for memfd with
>> hugetlb.
> 
> Would we not want both ultimately ?

Likely we want both somehow, although I am not sure how to obtain either 
cleanly and fully.

My question is targeted at what the current interface/implementation 
promises, and how it relates to both, THP and hugetlb.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 13/70] i386: Introduce tdx-guest object
  2023-11-15  7:14 ` [PATCH v3 13/70] i386: Introduce tdx-guest object Xiaoyao Li
@ 2023-12-01 10:52   ` Markus Armbruster
  2023-12-04  7:59     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Markus Armbruster @ 2023-12-01 10:52 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Xiaoyao Li <xiaoyao.li@intel.com> writes:

> Introduce tdx-guest object which implements the interface of
> CONFIDENTIAL_GUEST_SUPPORT, and will be used to create TDX VMs (TDs) by
>
>   qemu -machine ...,confidential-guest-support=tdx0	\
>        -object tdx-guest,id=tdx0
>
> It has only one member 'attributes' with fixed value 0 and not
> configurable so far.
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Acked-by: Gerd Hoffmann <kraxel@redhat.com>
> Acked-by: Markus Armbruster <armbru@redhat.com>

[...]

> diff --git a/qapi/qom.json b/qapi/qom.json
> index c53ef978ff7e..8e08257dac2f 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -878,6 +878,16 @@
>              'reduced-phys-bits': 'uint32',
>              '*kernel-hashes': 'bool' } }
>  
> +##
> +# @TdxGuestProperties:
> +#
> +# Properties for tdx-guest objects.
> +#
> +# Since: 8.2

Going to be 9.0.

> +##
> +{ 'struct': 'TdxGuestProperties',
> +  'data': { }}
> +
>  ##
>  # @ThreadContextProperties:
>  #
> @@ -956,6 +966,7 @@
>      'sev-guest',
>      'thread-context',
>      's390-pv-guest',
> +    'tdx-guest',
>      'throttle-group',
>      'tls-creds-anon',
>      'tls-creds-psk',
> @@ -1022,6 +1033,7 @@
>        'secret_keyring':             { 'type': 'SecretKeyringProperties',
>                                        'if': 'CONFIG_SECRET_KEYRING' },
>        'sev-guest':                  'SevGuestProperties',
> +      'tdx-guest':                  'TdxGuestProperties',
>        'thread-context':             'ThreadContextProperties',
>        'throttle-group':             'ThrottleGroupProperties',
>        'tls-creds-anon':             'TlsCredsAnonProperties',

[...]


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 27/70] i386/tdx: Add property sept-ve-disable for tdx-guest object
  2023-11-15  7:14 ` [PATCH v3 27/70] i386/tdx: Add property sept-ve-disable for tdx-guest object Xiaoyao Li
@ 2023-12-01 10:53   ` Markus Armbruster
  0 siblings, 0 replies; 161+ messages in thread
From: Markus Armbruster @ 2023-12-01 10:53 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Marcelo Tosatti, qemu-devel, kvm, Michael Roth,
	Sean Christopherson, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

Xiaoyao Li <xiaoyao.li@intel.com> writes:

> Bit 28 of TD attribute, named SEPT_VE_DISABLE. When set to 1, it disables
> EPT violation conversion to #VE on guest TD access of PENDING pages.
>
> Some guest OS (e.g., Linux TD guest) may require this bit as 1.
> Otherwise refuse to boot.
>
> Add sept-ve-disable property for tdx-guest object, for user to configure
> this bit.
>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Acked-by: Gerd Hoffmann <kraxel@redhat.com>
> ---
> Changes in v3:
> - update the comment of property @sept-ve-disable to make it more
>   descriptive and use new format. (Daniel and Markus)
> ---
>  qapi/qom.json         |  7 ++++++-
>  target/i386/kvm/tdx.c | 24 ++++++++++++++++++++++++
>  2 files changed, 30 insertions(+), 1 deletion(-)
>
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 8e08257dac2f..3a29659e0155 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -883,10 +883,15 @@
>  #
>  # Properties for tdx-guest objects.
>  #
> +# @sept-ve-disable: toggle bit 28 of TD attributes to control disabling
> +#     of EPT violation conversion to #VE on guest TD access of PENDING
> +#     pages.  Some guest OS (e.g., Linux TD guest) may require this to
> +#     be set, otherwise they refuse to boot.
> +#
>  # Since: 8.2
>  ##
>  { 'struct': 'TdxGuestProperties',
> -  'data': { }}
> +  'data': { '*sept-ve-disable': 'bool' } }
>  
>  ##
>  # @ThreadContextProperties:

Acked-by: Markus Armbruster <armbru@redhat.com>

[...]


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-30  8:00         ` Xiaoyao Li
@ 2023-12-01 11:00           ` David Hildenbrand
  0 siblings, 0 replies; 161+ messages in thread
From: David Hildenbrand @ 2023-12-01 11:00 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 30.11.23 09:00, Xiaoyao Li wrote:
> On 11/20/2023 5:26 PM, David Hildenbrand wrote:
>>
>>>> ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
>>>
>>> Get caught.
>>>
>>>> This should be factored out into a common helper.
>>>
>>> Sure, will do it in next version.
>>
>> Factor it out in a separate patch. Then, this patch is get small that
>> you can just squash it into #2.
> 
> A silly question. What file should the factored function be put in?

Good question :) it's highly Linux specific, probably util/oslib-posix.c ?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM
  2023-11-15  7:14 ` [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM Xiaoyao Li
  2023-11-15 17:32   ` Daniel P. Berrangé
@ 2023-12-01 11:00   ` Markus Armbruster
  2023-12-14  3:07     ` Xiaoyao Li
  1 sibling, 1 reply; 161+ messages in thread
From: Markus Armbruster @ 2023-12-01 11:00 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

Xiaoyao Li <xiaoyao.li@intel.com> writes:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
> can be provided for TDX attestation.
>
> So far they were hard coded as 0. Now allow user to specify those values
> via property mrconfigid, mrowner and mrownerconfig. They are all in
> base64 format.
>
> example
> -object tdx-guest, \
>   mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
>   mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
>   mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
>  - use base64 encoding instread of hex-string;
> ---
>  qapi/qom.json         | 11 +++++-
>  target/i386/kvm/tdx.c | 85 +++++++++++++++++++++++++++++++++++++++++++
>  target/i386/kvm/tdx.h |  3 ++
>  3 files changed, 98 insertions(+), 1 deletion(-)
>
> diff --git a/qapi/qom.json b/qapi/qom.json
> index 3a29659e0155..fd99aa1ff8cc 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -888,10 +888,19 @@
>  #     pages.  Some guest OS (e.g., Linux TD guest) may require this to
>  #     be set, otherwise they refuse to boot.
>  #
> +# @mrconfigid: base64 encoded MRCONFIGID SHA384 digest
> +#
> +# @mrowner: base64 encoded MROWNER SHA384 digest
> +#
> +# @mrownerconfig: base64 MROWNERCONFIG SHA384 digest

Can we come up with a description that tells the user a bit more clearly
what we're talking about?  Perhaps starting with this question could
lead us there: what's an MRCONFIGID, and why should I care?

> +#
>  # Since: 8.2
>  ##
>  { 'struct': 'TdxGuestProperties',
> -  'data': { '*sept-ve-disable': 'bool' } }
> +  'data': { '*sept-ve-disable': 'bool',
> +            '*mrconfigid': 'str',
> +            '*mrowner': 'str',
> +            '*mrownerconfig': 'str' } }
>  
>  ##
>  # @ThreadContextProperties:

[...]


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-11-15  7:15 ` [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote> Xiaoyao Li
  2023-11-15 17:51   ` Daniel P. Berrangé
  2023-11-15 17:58   ` Daniel P. Berrangé
@ 2023-12-01 11:02   ` Markus Armbruster
  2023-12-07  7:38     ` Xiaoyao Li
  2023-12-21 11:05   ` Daniel P. Berrangé
  3 siblings, 1 reply; 161+ messages in thread
From: Markus Armbruster @ 2023-12-01 11:02 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Marcelo Tosatti, qemu-devel, kvm, Michael Roth,
	Sean Christopherson, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

Xiaoyao Li <xiaoyao.li@intel.com> writes:

> From: Isaku Yamahata <isaku.yamahata@intel.com>
>
> For GetQuote, delegate a request to Quote Generation Service.
> Add property "quote-generation-socket" to tdx-guest, whihc is a property
> of type SocketAddress to specify Quote Generation Service(QGS).
>
> On request, connect to the QGS, read request buffer from shared guest
> memory, send the request buffer to the server and store the response
> into shared guest memory and notify TD guest by interrupt.
>
> command line example:
>   qemu-system-x86_64 \
>     -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
>     -machine confidential-guest-support=tdx0
>
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Codeveloped-by: Chenyi Qiang <chenyi.qiang@intel.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
> - rename property "quote-generation-service" to "quote-generation-socket";
> - change the type of "quote-generation-socket" from str to
>   SocketAddress;
> - squash next patch into this one;
> ---
>  qapi/qom.json         |   5 +-
>  target/i386/kvm/tdx.c | 430 ++++++++++++++++++++++++++++++++++++++++++
>  target/i386/kvm/tdx.h |   6 +
>  3 files changed, 440 insertions(+), 1 deletion(-)
>
> diff --git a/qapi/qom.json b/qapi/qom.json
> index fd99aa1ff8cc..cf36a1832ddd 100644
> --- a/qapi/qom.json
> +++ b/qapi/qom.json
> @@ -894,13 +894,16 @@
>  #
>  # @mrownerconfig: base64 MROWNERCONFIG SHA384 digest
>  #
> +# @quote-generation-socket: socket address for Quote Generation Service(QGS)
> +#

Long line.  Better:

   # @quote-generation-socket: socket address for Quote Generation
   #     Service(QGS)

>  # Since: 8.2
>  ##
>  { 'struct': 'TdxGuestProperties',
>    'data': { '*sept-ve-disable': 'bool',
>              '*mrconfigid': 'str',
>              '*mrowner': 'str',
> -            '*mrownerconfig': 'str' } }
> +            '*mrownerconfig': 'str',
> +            '*quote-generation-socket': 'SocketAddress' } }
>  
>  ##
>  # @ThreadContextProperties:


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 57/70] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility
  2023-11-15  7:15 ` [PATCH v3 57/70] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility Xiaoyao Li
@ 2023-12-01 11:11   ` Markus Armbruster
  2023-12-07  8:11     ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Markus Armbruster @ 2023-12-01 11:11 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Marcelo Tosatti, qemu-devel, kvm, Michael Roth,
	Sean Christopherson, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

Xiaoyao Li <xiaoyao.li@intel.com> writes:

> Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility
>
> Originated-from: Isaku Yamahata <isaku.yamahata@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes from v2:
> - Add docmentation of new type and struct (Daniel)
> - refine the error message handling (Daniel)
> ---
>  qapi/run-state.json   | 27 ++++++++++++++++++++--
>  system/runstate.c     | 54 +++++++++++++++++++++++++++++++++++++++++++
>  target/i386/kvm/tdx.c | 24 +++++++++++++++++--
>  3 files changed, 101 insertions(+), 4 deletions(-)
>
> diff --git a/qapi/run-state.json b/qapi/run-state.json
> index f216ba54ec4c..e18f62eaef77 100644
> --- a/qapi/run-state.json
> +++ b/qapi/run-state.json
> @@ -496,10 +496,12 @@
>  #
>  # @s390: s390 guest panic information type (Since: 2.12)
>  #
> +# @tdx: tdx guest panic information type (Since: 8.2)
> +#
>  # Since: 2.9
>  ##
>  { 'enum': 'GuestPanicInformationType',
> -  'data': [ 'hyper-v', 's390' ] }
> +  'data': [ 'hyper-v', 's390', 'tdx' ] }
>  
>  ##
>  # @GuestPanicInformation:
> @@ -514,7 +516,8 @@
>   'base': {'type': 'GuestPanicInformationType'},
>   'discriminator': 'type',
>   'data': {'hyper-v': 'GuestPanicInformationHyperV',
> -          's390': 'GuestPanicInformationS390'}}
> +          's390': 'GuestPanicInformationS390',
> +          'tdx' : 'GuestPanicInformationTdx'}}
>  
>  ##
>  # @GuestPanicInformationHyperV:
> @@ -577,6 +580,26 @@
>            'psw-addr': 'uint64',
>            'reason': 'S390CrashReason'}}
>  
> +##
> +# @GuestPanicInformationTdx:
> +#
> +# TDX GHCI TDG.VP.VMCALL<ReportFatalError> specific guest panic information

Long line.  Suggest

   # Guest panic information specific to TDX GHCI
   # TDG.VP.VMCALL<ReportFatalError>.

> +#
> +# @error-code: TD-specific error code
> +#
> +# @gpa: 4KB-aligned guest physical address of the page that containing
> +#     additional error data

"address of a page" implies the address is page-aligned.  4KB-aligned
feels redundant.  What about

   # @qpa: guest-physical address of a page that contains additional
   #     error data.

But in what format is the "additional error data"?

> +#
> +# @message: TD guest provided message string.  (It's not so trustable
> +#     and cannot be assumed to be well formed because it comes from guest)

guest-provided

For "well-formed" to make sense, we'd need an idea of the form / syntax.

If it's a human-readable error message, we could go with

   # @message: Human-readable error message provided by the guest.  Not
   #     to be trusted.

> +#
> +# Since: 8.2
> +##
> +{'struct': 'GuestPanicInformationTdx',
> + 'data': {'error-code': 'uint64',
> +          'gpa': 'uint64',
> +          'message': 'str'}}
> +
>  ##
>  # @MEMORY_FAILURE:
>  #

[...]


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE
  2023-11-30 17:51                 ` Daniel P. Berrangé
  2023-11-30 18:22                   ` David Hildenbrand
@ 2023-12-01 11:22                   ` Claudio Fontana
  1 sibling, 0 replies; 161+ messages in thread
From: Claudio Fontana @ 2023-12-01 11:22 UTC (permalink / raw)
  To: Daniel P. Berrangé, David Hildenbrand
  Cc: Sean Christopherson, Xiaoyao Li, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Gerd Hoffmann, Isaku Yamahata,
	Chenyi Qiang

On 11/30/23 18:51, Daniel P. Berrangé wrote:
> On Thu, Nov 30, 2023 at 05:54:26PM +0100, David Hildenbrand wrote:
>> On 30.11.23 17:01, Sean Christopherson wrote:
>>> On Thu, Nov 30, 2023, David Hildenbrand wrote:
>>>> On 30.11.23 08:32, Xiaoyao Li wrote:
>>>>> On 11/20/2023 5:26 PM, David Hildenbrand wrote:
>>>>>>
>>>>>>>> ... did you shamelessly copy that from hw/virtio/virtio-mem.c ? ;)
>>>>>>>
>>>>>>> Get caught.
>>>>>>>
>>>>>>>> This should be factored out into a common helper.
>>>>>>>
>>>>>>> Sure, will do it in next version.
>>>>>>
>>>>>> Factor it out in a separate patch. Then, this patch is get small that
>>>>>> you can just squash it into #2.
>>>>>>
>>>>>> And my comment regarding "flags = 0" to patch #2 does no longer apply :)
>>>>>>
>>>>>
>>>>> I see.
>>>>>
>>>>> But it depends on if KVM_GUEST_MEMFD_ALLOW_HUGEPAGE will appear together
>>>>> with initial guest memfd in linux (hopefully 6.8)
>>>>> https://lore.kernel.org/all/CABgObfa=DH7FySBviF63OS9sVog_wt-AqYgtUAGKqnY5Bizivw@mail.gmail.com/
>>>>>
>>>>
>>>> Doesn't seem to be in -next if I am looking at the right tree:
>>>>
>>>> https://git.kernel.org/pub/scm/virt/kvm/kvm.git/log/?h=next
>>>
>>> Yeah, we punted on adding hugepage support for the initial guest_memfd merge so
>>> as not to rush in kludgy uABI.  The internal KVM code isn't problematic, we just
>>> haven't figured out exactly what the ABI should look like, e.g. should hugepages
>>> be dependent on THP being enabled, and if not, how does userspace discover the
>>> supported hugepage sizes?
>>
>> Are we talking about THP or hugetlb? They are two different things, and
>> "KVM_GUEST_MEMFD_ALLOW_HUGEPAGE" doesn't make it clearer what we are talking
>> about.
>>
>> This patch here "get_thp_size()" indicates that we care about THP, not
>> hugetlb.
>>
>>
>> THP lives in:
>> 	/sys/kernel/mm/transparent_hugepage/
>> and hugetlb in:
>> 	/sys/kernel/mm/hugepages/
>>
>> THP for shmem+anon currently really only supports PMD-sized THP, that size
>> can be observed via:
>> 	/sys/kernel/mm/transparent_hugepage/hpage_pmd_size
>>
>> hugetlb sizes can be detected simply by looking at the folders inside
>> /sys/kernel/mm/hugepages/. "tools/testing/selftests/mm/vm_util.c" in the
>> kernel has a function "detect_hugetlb_page_sizes()" that uses that interface
>> to detect the sizes.
>>
>>
>> But likely we want THP support here. Because for hugetlb, one would actually
>> have to instruct the kernel which size to use, like we do for memfd with
>> hugetlb.
> 
> Would we not want both ultimately ?
> 
> THP is good because it increases performance vs non-HP out of the box
> without the user or mgmt app having to make any decisions.
> 
> It does not give you deterministic performance though, because it has
> to opportunistically assign huge pages basd on what is available and
> that may differ each time a VM is launched.  Explicit admin/mgmt app
> controlled huge page usage gives determinism, at the cost of increased
> mgmt overhead.
> 
> Both are valid use cases depending on the tradeoff a deployment and/or
> mgmt app wants to make.

Absolutely, it really depends on the definition of "performance" for the specific goal the user is trying to achieve.
There are very prominent use cases where THP is a big no-no due to the latency introduced.

C

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 05/70] kvm: Enable KVM_SET_USER_MEMORY_REGION2 for memslot
  2023-11-17 20:50   ` Isaku Yamahata
@ 2023-12-04  6:48     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-04  6:48 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P. Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata

On 11/18/2023 4:50 AM, Isaku Yamahata wrote:
> On Wed, Nov 15, 2023 at 02:14:14AM -0500,
> Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> 
>> From: Chao Peng <chao.p.peng@linux.intel.com>
>>
>> Switch to KVM_SET_USER_MEMORY_REGION2 when supported by KVM.
>>
>> With KVM_SET_USER_MEMORY_REGION2, QEMU can set up memory region that
>> backend'ed both by hva-based shared memory and guest memfd based private
>> memory.
>>
>> Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com>
>> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>>   accel/kvm/kvm-all.c      | 56 ++++++++++++++++++++++++++++++++++------
>>   accel/kvm/trace-events   |  2 +-
>>   include/sysemu/kvm_int.h |  2 ++
>>   3 files changed, 51 insertions(+), 9 deletions(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 9f751d4971f8..69afeb47c9c0 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -293,35 +293,69 @@ int kvm_physical_memory_addr_from_host(KVMState *s, void *ram,
>>   static int kvm_set_user_memory_region(KVMMemoryListener *kml, KVMSlot *slot, bool new)
>>   {
>>       KVMState *s = kvm_state;
>> -    struct kvm_userspace_memory_region mem;
>> +    struct kvm_userspace_memory_region2 mem;
>> +    static int cap_user_memory2 = -1;
>>       int ret;
>>   
>> +    if (cap_user_memory2 == -1) {
>> +        cap_user_memory2 = kvm_check_extension(s, KVM_CAP_USER_MEMORY2);
>> +    }
>> +
>> +    if (!cap_user_memory2 && slot->guest_memfd >= 0) {
>> +        error_report("%s, KVM doesn't support KVM_CAP_USER_MEMORY2,"
>> +                     " which is required by guest memfd!", __func__);
>> +        exit(1);
>> +    }
>> +
>>       mem.slot = slot->slot | (kml->as_id << 16);
>>       mem.guest_phys_addr = slot->start_addr;
>>       mem.userspace_addr = (unsigned long)slot->ram;
>>       mem.flags = slot->flags;
>> +    mem.guest_memfd = slot->guest_memfd;
>> +    mem.guest_memfd_offset = slot->guest_memfd_offset;
>>   
>>       if (slot->memory_size && !new && (mem.flags ^ slot->old_flags) & KVM_MEM_READONLY) {
>>           /* Set the slot size to 0 before setting the slot to the desired
>>            * value. This is needed based on KVM commit 75d61fbc. */
>>           mem.memory_size = 0;
>> -        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
>> +
>> +        if (cap_user_memory2) {
>> +            ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION2, &mem);
>> +        } else {
>> +            ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
>> +	    }
>>           if (ret < 0) {
>>               goto err;
>>           }
>>       }
>>       mem.memory_size = slot->memory_size;
>> -    ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
>> +    if (cap_user_memory2) {
>> +        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION2, &mem);
>> +    } else {
>> +        ret = kvm_vm_ioctl(s, KVM_SET_USER_MEMORY_REGION, &mem);
>> +    }
>>       slot->old_flags = mem.flags;
>>   err:
>>       trace_kvm_set_user_memory(mem.slot >> 16, (uint16_t)mem.slot, mem.flags,
>>                                 mem.guest_phys_addr, mem.memory_size,
>> -                              mem.userspace_addr, ret);
>> +                              mem.userspace_addr, mem.guest_memfd,
>> +                              mem.guest_memfd_offset, ret);
>>       if (ret < 0) {
>> -        error_report("%s: KVM_SET_USER_MEMORY_REGION failed, slot=%d,"
>> -                     " start=0x%" PRIx64 ", size=0x%" PRIx64 ": %s",
>> -                     __func__, mem.slot, slot->start_addr,
>> -                     (uint64_t)mem.memory_size, strerror(errno));
>> +        if (cap_user_memory2) {
>> +                error_report("%s: KVM_SET_USER_MEMORY_REGION2 failed, slot=%d,"
>> +                        " start=0x%" PRIx64 ", size=0x%" PRIx64 ","
>> +                        " flags=0x%" PRIx32 ", guest_memfd=%" PRId32 ","
>> +                        " guest_memfd_offset=0x%" PRIx64 ": %s",
>> +                        __func__, mem.slot, slot->start_addr,
>> +                        (uint64_t)mem.memory_size, mem.flags,
>> +                        mem.guest_memfd, (uint64_t)mem.guest_memfd_offset,
>> +                        strerror(errno));
>> +        } else {
>> +                error_report("%s: KVM_SET_USER_MEMORY_REGION failed, slot=%d,"
>> +                            " start=0x%" PRIx64 ", size=0x%" PRIx64 ": %s",
>> +                            __func__, mem.slot, slot->start_addr,
>> +                            (uint64_t)mem.memory_size, strerror(errno));
>> +        }
>>       }
>>       return ret;
>>   }
>> @@ -477,6 +511,9 @@ static int kvm_mem_flags(MemoryRegion *mr)
>>       if (readonly && kvm_readonly_mem_allowed) {
>>           flags |= KVM_MEM_READONLY;
>>       }
>> +    if (memory_region_has_guest_memfd(mr)) {
>> +        flags |= KVM_MEM_PRIVATE;
>> +    }
> 
> Nitpick: it was renamed to KVM_MEM_GUEST_MEMFD
> As long as the value is defined to same value, it doesn't matter, though.

thanks for the reminder!

Will update the headers and switch to KVM_MEM_GUEST_MEMFD.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range()
  2023-11-20  9:56       ` David Hildenbrand
@ 2023-12-04  7:35         ` Xiaoyao Li
  2023-12-04  7:53           ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-04  7:35 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/20/2023 5:56 PM, David Hildenbrand wrote:
> On 16.11.23 03:56, Xiaoyao Li wrote:
>> On 11/16/2023 2:20 AM, David Hildenbrand wrote:
>>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>>> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
>>>> ram_block_discard_range() which grabs some code from
>>>> ram_discard_range(). However, during code movement, it changed 
>>>> alignment
>>>> check of host_startaddr from qemu_host_page_size to rb->page_size.
>>>>
>>>> When ramblock is back'ed by hugepage, it requires the startaddr to be
>>>> huge page size aligned, which is a overkill. e.g., TDX's private-shared
>>>> page conversion is done at 4KB granularity. Shared page is discarded
>>>> when it gets converts to private and when shared page back'ed by
>>>> hugepage it is going to fail on this check.
>>>>
>>>> So change to alignment check back to qemu_host_page_size.
>>>>
>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>> ---
>>>> Changes in v3:
>>>>    - Newly added in v3;
>>>> ---
>>>>    system/physmem.c | 2 +-
>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index c56b17e44df6..8a4e42c7cf60 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb,
>>>> uint64_t start, size_t length)
>>>>        uint8_t *host_startaddr = rb->host + start;
>>>> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
>>>> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
>>>
>>> For your use cases, rb->page_size should always match 
>>> qemu_host_page_size.
>>>
>>> IIRC, we only set rb->page_size to different values for hugetlb. And
>>> guest_memfd does not support hugetlb.
>>>
>>> Even if QEMU is using THP, rb->page_size should 4k.
>>>
>>> Please elaborate how you can actually trigger that. From what I recall,
>>> guest_memfd is not compatible with hugetlb.
>>
>> It's the shared memory that can be back'ed by hugetlb.
> 
> Serious question: does that configuration make any sense to support at 
> this point? I claim: no.
> 
>>
>> Later patch 9 introduces ram_block_convert_page(), which will discard
>> shared memory when it gets converted to private. TD guest can request
>> convert a 4K to private while the page is previously back'ed by hugetlb
>> as 2M shared page.
> 
> So you can call ram_block_discard_guest_memfd_range() on subpage basis, 
> but not ram_block_discard_range().
> 
> ram_block_convert_range() would have to thought that that (questionable) 
> combination of hugetlb for shmem and ordinary pages for guest_memfd 
> cannot discard shared memory.
> 
> And it probably shouldn't either way. There are other problems when not 
> using hugetlb along with preallocation.

If I understand correctly, preallocation needs to be enabled for 
hugetlb. And in preallocation case, it doesn't need to discard memory. 
Is it correct?

> The check in ram_block_discard_range() is correct, whoever ends up 
> calling it has to stop calling it.
> 

So, I need add logic to ram_block_discard_page() that if the size of 
shared memory indicates hugepage, skip the discarding?


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 08/70] physmem: replace function name with __func__ in ram_block_discard_range()
  2023-11-15 18:21   ` David Hildenbrand
@ 2023-12-04  7:40     ` Xiaoyao Li
  2023-12-04  9:49       ` David Hildenbrand
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-04  7:40 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/16/2023 2:21 AM, David Hildenbrand wrote:
> On 15.11.23 08:14, Xiaoyao Li wrote:
>> Use __func__ to avoid hard-coded function name.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
> 
> That can be queued independently.

Will you queue it for 9.0? for someone else?

Do I need to send it separately?

> Reviewed-by: David Hildenbrand <david@redhat.com>
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range()
  2023-12-04  7:35         ` Xiaoyao Li
@ 2023-12-04  7:53           ` Xiaoyao Li
  2023-12-04  9:52             ` David Hildenbrand
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-04  7:53 UTC (permalink / raw)
  To: David Hildenbrand, Paolo Bonzini, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 12/4/2023 3:35 PM, Xiaoyao Li wrote:
> On 11/20/2023 5:56 PM, David Hildenbrand wrote:
>> On 16.11.23 03:56, Xiaoyao Li wrote:
>>> On 11/16/2023 2:20 AM, David Hildenbrand wrote:
>>>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>>>> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
>>>>> ram_block_discard_range() which grabs some code from
>>>>> ram_discard_range(). However, during code movement, it changed 
>>>>> alignment
>>>>> check of host_startaddr from qemu_host_page_size to rb->page_size.
>>>>>
>>>>> When ramblock is back'ed by hugepage, it requires the startaddr to be
>>>>> huge page size aligned, which is a overkill. e.g., TDX's 
>>>>> private-shared
>>>>> page conversion is done at 4KB granularity. Shared page is discarded
>>>>> when it gets converts to private and when shared page back'ed by
>>>>> hugepage it is going to fail on this check.
>>>>>
>>>>> So change to alignment check back to qemu_host_page_size.
>>>>>
>>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>>> ---
>>>>> Changes in v3:
>>>>>    - Newly added in v3;
>>>>> ---
>>>>>    system/physmem.c | 2 +-
>>>>>    1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>> index c56b17e44df6..8a4e42c7cf60 100644
>>>>> --- a/system/physmem.c
>>>>> +++ b/system/physmem.c
>>>>> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb,
>>>>> uint64_t start, size_t length)
>>>>>        uint8_t *host_startaddr = rb->host + start;
>>>>> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
>>>>> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
>>>>
>>>> For your use cases, rb->page_size should always match 
>>>> qemu_host_page_size.
>>>>
>>>> IIRC, we only set rb->page_size to different values for hugetlb. And
>>>> guest_memfd does not support hugetlb.
>>>>
>>>> Even if QEMU is using THP, rb->page_size should 4k.
>>>>
>>>> Please elaborate how you can actually trigger that. From what I recall,
>>>> guest_memfd is not compatible with hugetlb.
>>>
>>> It's the shared memory that can be back'ed by hugetlb.
>>
>> Serious question: does that configuration make any sense to support at 
>> this point? I claim: no.
>>
>>>
>>> Later patch 9 introduces ram_block_convert_page(), which will discard
>>> shared memory when it gets converted to private. TD guest can request
>>> convert a 4K to private while the page is previously back'ed by hugetlb
>>> as 2M shared page.
>>
>> So you can call ram_block_discard_guest_memfd_range() on subpage 
>> basis, but not ram_block_discard_range().
>>
>> ram_block_convert_range() would have to thought that that 
>> (questionable) combination of hugetlb for shmem and ordinary pages for 
>> guest_memfd cannot discard shared memory.
>>
>> And it probably shouldn't either way. There are other problems when 
>> not using hugetlb along with preallocation.
> 
> If I understand correctly, preallocation needs to be enabled for 
> hugetlb. And in preallocation case, it doesn't need to discard memory. 
> Is it correct?
> 
>> The check in ram_block_discard_range() is correct, whoever ends up 
>> calling it has to stop calling it.
>>
>  > So, I need add logic to ram_block_discard_page() that if the size of

Sorry, I made a typo.

Correct myself, s/ram_block_discard_page()/ram_block_convert_range()

> shared memory indicates hugepage, skip the discarding?
> 
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 13/70] i386: Introduce tdx-guest object
  2023-12-01 10:52   ` Markus Armbruster
@ 2023-12-04  7:59     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-04  7:59 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P.Berrangé,
	Eric Blake, Marcelo Tosatti, qemu-devel, kvm, Michael Roth,
	Sean Christopherson, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

On 12/1/2023 6:52 PM, Markus Armbruster wrote:
> Xiaoyao Li <xiaoyao.li@intel.com> writes:
> 
>> Introduce tdx-guest object which implements the interface of
>> CONFIDENTIAL_GUEST_SUPPORT, and will be used to create TDX VMs (TDs) by
>>
>>    qemu -machine ...,confidential-guest-support=tdx0	\
>>         -object tdx-guest,id=tdx0
>>
>> It has only one member 'attributes' with fixed value 0 and not
>> configurable so far.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> Acked-by: Gerd Hoffmann <kraxel@redhat.com>
>> Acked-by: Markus Armbruster <armbru@redhat.com>
> 
> [...]
> 
>> diff --git a/qapi/qom.json b/qapi/qom.json
>> index c53ef978ff7e..8e08257dac2f 100644
>> --- a/qapi/qom.json
>> +++ b/qapi/qom.json
>> @@ -878,6 +878,16 @@
>>               'reduced-phys-bits': 'uint32',
>>               '*kernel-hashes': 'bool' } }
>>   
>> +##
>> +# @TdxGuestProperties:
>> +#
>> +# Properties for tdx-guest objects.
>> +#
>> +# Since: 8.2
> 
> Going to be 9.0.

will update it and all others.

(I left it as 8.2 because I was not sure next version is 8.3 or 9.0)

>> +##
>> +{ 'struct': 'TdxGuestProperties',
>> +  'data': { }}
>> +
>>   ##
>>   # @ThreadContextProperties:
>>   #
>> @@ -956,6 +966,7 @@
>>       'sev-guest',
>>       'thread-context',
>>       's390-pv-guest',
>> +    'tdx-guest',
>>       'throttle-group',
>>       'tls-creds-anon',
>>       'tls-creds-psk',
>> @@ -1022,6 +1033,7 @@
>>         'secret_keyring':             { 'type': 'SecretKeyringProperties',
>>                                         'if': 'CONFIG_SECRET_KEYRING' },
>>         'sev-guest':                  'SevGuestProperties',
>> +      'tdx-guest':                  'TdxGuestProperties',
>>         'thread-context':             'ThreadContextProperties',
>>         'throttle-group':             'ThrottleGroupProperties',
>>         'tls-creds-anon':             'TlsCredsAnonProperties',
> 
> [...]
> 
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 26/70] i386/tdx: Initialize TDX before creating TD vcpus
  2023-11-15 11:01   ` Daniel P. Berrangé
@ 2023-12-04  8:28     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-04  8:28 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/15/2023 7:01 PM, Daniel P. Berrangé wrote:
> On Wed, Nov 15, 2023 at 02:14:35AM -0500, Xiaoyao Li wrote:
>> Invoke KVM_TDX_INIT in kvm_arch_pre_create_vcpu() that KVM_TDX_INIT
>> configures global TD configurations, e.g. the canonical CPUID config,
>> and must be executed prior to creating vCPUs.
>>
>> Use kvm_x86_arch_cpuid() to setup the CPUID settings for TDX VM.
>>
>> Note, this doesn't address the fact that QEMU may change the CPUID
>> configuration when creating vCPUs, i.e. punts on refactoring QEMU to
>> provide a stable CPUID config prior to kvm_arch_init().
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> Acked-by: Gerd Hoffmann <kraxel@redhat.com>
>> ---
>> Changes in v3:
>> - Pass @errp in tdx_pre_create_vcpu() and pass error info to it. (Daniel)
>> ---
>>   accel/kvm/kvm-all.c        |  9 +++++++-
>>   target/i386/kvm/kvm.c      |  9 ++++++++
>>   target/i386/kvm/tdx-stub.c |  5 +++++
>>   target/i386/kvm/tdx.c      | 45 ++++++++++++++++++++++++++++++++++++++
>>   target/i386/kvm/tdx.h      |  4 ++++
>>   5 files changed, 71 insertions(+), 1 deletion(-)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
>> index 6b5f4d62f961..a92fff471b58 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -441,8 +441,15 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
>>   
>>       trace_kvm_init_vcpu(cpu->cpu_index, kvm_arch_vcpu_id(cpu));
>>   
>> +    /*
>> +     * tdx_pre_create_vcpu() may call cpu_x86_cpuid(). It in turn may call
>> +     * kvm_vm_ioctl(). Set cpu->kvm_state in advance to avoid NULL pointer
>> +     * dereference.
>> +     */
>> +    cpu->kvm_state = s;
>>       ret = kvm_arch_pre_create_vcpu(cpu, errp);
>>       if (ret < 0) {
>> +        cpu->kvm_state = NULL;
>>           goto err;
>>       }
>>   
>> @@ -450,11 +457,11 @@ int kvm_init_vcpu(CPUState *cpu, Error **errp)
>>       if (ret < 0) {
>>           error_setg_errno(errp, -ret, "kvm_init_vcpu: kvm_get_vcpu failed (%lu)",
>>                            kvm_arch_vcpu_id(cpu));
>> +        cpu->kvm_state = NULL;
>>           goto err;
>>       }
>>   
>>       cpu->kvm_fd = ret;
>> -    cpu->kvm_state = s;
>>       cpu->vcpu_dirty = true;
>>       cpu->dirty_pages = 0;
>>       cpu->throttle_us_per_full = 0;
>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>> index dafe4d262977..fc840653ceb6 100644
>> --- a/target/i386/kvm/kvm.c
>> +++ b/target/i386/kvm/kvm.c
>> @@ -2268,6 +2268,15 @@ int kvm_arch_init_vcpu(CPUState *cs)
>>       return r;
>>   }
>>   
>> +int kvm_arch_pre_create_vcpu(CPUState *cpu, Error **errp)
>> +{
>> +    if (is_tdx_vm()) {
>> +        return tdx_pre_create_vcpu(cpu, errp);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>   int kvm_arch_destroy_vcpu(CPUState *cs)
>>   {
>>       X86CPU *cpu = X86_CPU(cs);
>> diff --git a/target/i386/kvm/tdx-stub.c b/target/i386/kvm/tdx-stub.c
>> index 1d866d5496bf..3877d432a397 100644
>> --- a/target/i386/kvm/tdx-stub.c
>> +++ b/target/i386/kvm/tdx-stub.c
>> @@ -6,3 +6,8 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
>>   {
>>       return -EINVAL;
>>   }
>> +
>> +int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
>> +{
>> +    return -EINVAL;
>> +}
>> diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
>> index 1f5d8117d1a9..122a37c93de3 100644
>> --- a/target/i386/kvm/tdx.c
>> +++ b/target/i386/kvm/tdx.c
>> @@ -467,6 +467,49 @@ int tdx_kvm_init(MachineState *ms, Error **errp)
>>       return 0;
>>   }
>>   
>> +int tdx_pre_create_vcpu(CPUState *cpu, Error **errp)
>> +{
>> +    MachineState *ms = MACHINE(qdev_get_machine());
>> +    X86CPU *x86cpu = X86_CPU(cpu);
>> +    CPUX86State *env = &x86cpu->env;
>> +    struct kvm_tdx_init_vm *init_vm;
> 
> Mark this as auto-free to avoid the g_free() requirement
> 
>    g_autofree  struct kvm_tdx_init_vm *init_vm = NULL;
> 
>> +    int r = 0;
>> +
>> +    qemu_mutex_lock(&tdx_guest->lock);
> 
>     QEMU_LOCK_GUARD(&tdx_guest->lock);
> 
> to eliminate the mutex_unlock requirement, thus eliminating all
> 'goto' jumps and label targets, in favour of a plain 'return -1'
> everywhere.
> 

Learned!

thanks!


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 08/70] physmem: replace function name with __func__ in ram_block_discard_range()
  2023-12-04  7:40     ` Xiaoyao Li
@ 2023-12-04  9:49       ` David Hildenbrand
  0 siblings, 0 replies; 161+ messages in thread
From: David Hildenbrand @ 2023-12-04  9:49 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 04.12.23 08:40, Xiaoyao Li wrote:
> On 11/16/2023 2:21 AM, David Hildenbrand wrote:
>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>> Use __func__ to avoid hard-coded function name.
>>>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> ---
>>
>> That can be queued independently.
> 
> Will you queue it for 9.0? for someone else?
> 
> Do I need to send it separately?

Probably best to just send it as a separate cleanup. Likely, Paolo will 
queue it. If not, I can do it.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range()
  2023-12-04  7:53           ` Xiaoyao Li
@ 2023-12-04  9:52             ` David Hildenbrand
  0 siblings, 0 replies; 161+ messages in thread
From: David Hildenbrand @ 2023-12-04  9:52 UTC (permalink / raw)
  To: Xiaoyao Li, Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 04.12.23 08:53, Xiaoyao Li wrote:
> On 12/4/2023 3:35 PM, Xiaoyao Li wrote:
>> On 11/20/2023 5:56 PM, David Hildenbrand wrote:
>>> On 16.11.23 03:56, Xiaoyao Li wrote:
>>>> On 11/16/2023 2:20 AM, David Hildenbrand wrote:
>>>>> On 15.11.23 08:14, Xiaoyao Li wrote:
>>>>>> Commit d3a5038c461 ("exec: ram_block_discard_range") introduced
>>>>>> ram_block_discard_range() which grabs some code from
>>>>>> ram_discard_range(). However, during code movement, it changed
>>>>>> alignment
>>>>>> check of host_startaddr from qemu_host_page_size to rb->page_size.
>>>>>>
>>>>>> When ramblock is back'ed by hugepage, it requires the startaddr to be
>>>>>> huge page size aligned, which is a overkill. e.g., TDX's
>>>>>> private-shared
>>>>>> page conversion is done at 4KB granularity. Shared page is discarded
>>>>>> when it gets converts to private and when shared page back'ed by
>>>>>> hugepage it is going to fail on this check.
>>>>>>
>>>>>> So change to alignment check back to qemu_host_page_size.
>>>>>>
>>>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>>>> ---
>>>>>> Changes in v3:
>>>>>>     - Newly added in v3;
>>>>>> ---
>>>>>>     system/physmem.c | 2 +-
>>>>>>     1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>>>> index c56b17e44df6..8a4e42c7cf60 100644
>>>>>> --- a/system/physmem.c
>>>>>> +++ b/system/physmem.c
>>>>>> @@ -3532,7 +3532,7 @@ int ram_block_discard_range(RAMBlock *rb,
>>>>>> uint64_t start, size_t length)
>>>>>>         uint8_t *host_startaddr = rb->host + start;
>>>>>> -    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, rb->page_size)) {
>>>>>> +    if (!QEMU_PTR_IS_ALIGNED(host_startaddr, qemu_host_page_size)) {
>>>>>
>>>>> For your use cases, rb->page_size should always match
>>>>> qemu_host_page_size.
>>>>>
>>>>> IIRC, we only set rb->page_size to different values for hugetlb. And
>>>>> guest_memfd does not support hugetlb.
>>>>>
>>>>> Even if QEMU is using THP, rb->page_size should 4k.
>>>>>
>>>>> Please elaborate how you can actually trigger that. From what I recall,
>>>>> guest_memfd is not compatible with hugetlb.
>>>>
>>>> It's the shared memory that can be back'ed by hugetlb.
>>>
>>> Serious question: does that configuration make any sense to support at
>>> this point? I claim: no.
>>>
>>>>
>>>> Later patch 9 introduces ram_block_convert_page(), which will discard
>>>> shared memory when it gets converted to private. TD guest can request
>>>> convert a 4K to private while the page is previously back'ed by hugetlb
>>>> as 2M shared page.
>>>
>>> So you can call ram_block_discard_guest_memfd_range() on subpage
>>> basis, but not ram_block_discard_range().
>>>
>>> ram_block_convert_range() would have to thought that that
>>> (questionable) combination of hugetlb for shmem and ordinary pages for
>>> guest_memfd cannot discard shared memory.
>>>
>>> And it probably shouldn't either way. There are other problems when
>>> not using hugetlb along with preallocation.
>>
>> If I understand correctly, preallocation needs to be enabled for
>> hugetlb. And in preallocation case, it doesn't need to discard memory.
>> Is it correct?

Yes The downside is that we'll end up with double-memory consumption. 
But if/how to optimize that in this case ca be left for future work.

>>
>>> The check in ram_block_discard_range() is correct, whoever ends up
>>> calling it has to stop calling it.
>>>
>>   > So, I need add logic to ram_block_discard_page() that if the size of
> 
> Sorry, I made a typo.
> 
> Correct myself, s/ram_block_discard_page()/ram_block_convert_range()

Yes, just leave any shared memory backend that uses hugetlb alone.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES
  2023-11-17 21:18   ` Isaku Yamahata
@ 2023-12-07  7:16     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-07  7:16 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P. Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata

On 11/18/2023 5:18 AM, Isaku Yamahata wrote:
> On Wed, Nov 15, 2023 at 02:14:27AM -0500,
> Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> 
>> KVM provides TDX capabilities via sub command KVM_TDX_CAPABILITIES of
>> IOCTL(KVM_MEMORY_ENCRYPT_OP). Get the capabilities when initializing
>> TDX context. It will be used to validate user's setting later.
>>
>> Since there is no interface reporting how many cpuid configs contains in
>> KVM_TDX_CAPABILITIES, QEMU chooses to try starting with a known number
>> and abort when it exceeds KVM_MAX_CPUID_ENTRIES.
>>
>> Besides, introduce the interfaces to invoke TDX "ioctls" at different
>> scope (KVM, VM and VCPU) in preparation.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> Changes in v3:
>> - rename __tdx_ioctl() to tdx_ioctl_internal()
>> - Pass errp in get_tdx_capabilities();
>>
>> changes in v2:
>>    - Make the error message more clear;
>>
>> changes in v1:
>>    - start from nr_cpuid_configs = 6 for the loop;
>>    - stop the loop when nr_cpuid_configs exceeds KVM_MAX_CPUID_ENTRIES;
>> ---
>>   target/i386/kvm/kvm.c      |   2 -
>>   target/i386/kvm/kvm_i386.h |   2 +
>>   target/i386/kvm/tdx.c      | 102 ++++++++++++++++++++++++++++++++++++-
>>   3 files changed, 103 insertions(+), 3 deletions(-)
>>
>> diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c
>> index 7abcdebb1452..28e60c5ea4a7 100644
>> --- a/target/i386/kvm/kvm.c
>> +++ b/target/i386/kvm/kvm.c
>> @@ -1687,8 +1687,6 @@ static int hyperv_init_vcpu(X86CPU *cpu)
>>   
>>   static Error *invtsc_mig_blocker;
>>   
>> -#define KVM_MAX_CPUID_ENTRIES  100
>> -
>>   static void kvm_init_xsave(CPUX86State *env)
>>   {
>>       if (has_xsave2) {
>> diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h
>> index 55fb25fa8e2e..c3ef46a97a7b 100644
>> --- a/target/i386/kvm/kvm_i386.h
>> +++ b/target/i386/kvm/kvm_i386.h
>> @@ -13,6 +13,8 @@
>>   
>>   #include "sysemu/kvm.h"
>>   
>> +#define KVM_MAX_CPUID_ENTRIES  100
>> +
>>   #ifdef CONFIG_KVM
>>   
>>   #define kvm_pit_in_kernel() \
>> diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c
>> index 621a05beeb4e..cb0040187b27 100644
>> --- a/target/i386/kvm/tdx.c
>> +++ b/target/i386/kvm/tdx.c
>> @@ -12,17 +12,117 @@
>>    */
>>   
>>   #include "qemu/osdep.h"
>> +#include "qemu/error-report.h"
>>   #include "qapi/error.h"
>>   #include "qom/object_interfaces.h"
>> +#include "sysemu/kvm.h"
>>   
>>   #include "hw/i386/x86.h"
>> +#include "kvm_i386.h"
>>   #include "tdx.h"
>>   
>> +static struct kvm_tdx_capabilities *tdx_caps;
>> +
>> +enum tdx_ioctl_level{
>> +    TDX_PLATFORM_IOCTL,
>> +    TDX_VM_IOCTL,
>> +    TDX_VCPU_IOCTL,
>> +};
>> +
>> +static int tdx_ioctl_internal(void *state, enum tdx_ioctl_level level, int cmd_id,
>> +                        __u32 flags, void *data)
>> +{
>> +    struct kvm_tdx_cmd tdx_cmd;
>> +    int r;
>> +
>> +    memset(&tdx_cmd, 0x0, sizeof(tdx_cmd));
>> +
>> +    tdx_cmd.id = cmd_id;
>> +    tdx_cmd.flags = flags;
>> +    tdx_cmd.data = (__u64)(unsigned long)data;
>> +
>> +    switch (level) {
>> +    case TDX_PLATFORM_IOCTL:
>> +        r = kvm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
>> +        break;
>> +    case TDX_VM_IOCTL:
>> +        r = kvm_vm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
>> +        break;
>> +    case TDX_VCPU_IOCTL:
>> +        r = kvm_vcpu_ioctl(state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd);
>> +        break;
>> +    default:
>> +        error_report("Invalid tdx_ioctl_level %d", level);
>> +        exit(1);
>> +    }
>> +
>> +    return r;
>> +}
>> +
>> +static inline int tdx_platform_ioctl(int cmd_id, __u32 flags, void *data)
>> +{
>> +    return tdx_ioctl_internal(NULL, TDX_PLATFORM_IOCTL, cmd_id, flags, data);
>> +}
>> +
>> +static inline int tdx_vm_ioctl(int cmd_id, __u32 flags, void *data)
>> +{
>> +    return tdx_ioctl_internal(NULL, TDX_VM_IOCTL, cmd_id, flags, data);
>> +}
>> +
>> +static inline int tdx_vcpu_ioctl(void *vcpu_fd, int cmd_id, __u32 flags,
>> +                                 void *data)
>> +{
>> +    return  tdx_ioctl_internal(vcpu_fd, TDX_VCPU_IOCTL, cmd_id, flags, data);
>> +}
> 
> As all of ioctl variants aren't used yet, we can split out them. 

No. tdx_vm_ioctl() is used right below.

I can remove the tdx_platform_ioctl() because its sole user, 
KVM_TDX_CAPABILITIES, changed to vm scope.

> An independent
> patch to define ioctl functions.
> 
> 
>> +
>> +static int get_tdx_capabilities(Error **errp)
>> +{
>> +    struct kvm_tdx_capabilities *caps;
>> +    /* 1st generation of TDX reports 6 cpuid configs */
>> +    int nr_cpuid_configs = 6;
>> +    size_t size;
>> +    int r;
>> +
>> +    do {
>> +        size = sizeof(struct kvm_tdx_capabilities) +
>> +               nr_cpuid_configs * sizeof(struct kvm_tdx_cpuid_config);
>> +        caps = g_malloc0(size);
>> +        caps->nr_cpuid_configs = nr_cpuid_configs;
>> +
>> +        r = tdx_vm_ioctl(KVM_TDX_CAPABILITIES, 0, caps);
>> +        if (r == -E2BIG) {
>> +            g_free(caps);
>> +            nr_cpuid_configs *= 2;
> 
> g_realloc()?  Maybe a matter of preference.

I would like to keep the current code unless strong objection from 
maintainers.

> Other than this, it looks good to me.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES
  2023-11-15 10:54   ` Daniel P. Berrangé
@ 2023-12-07  7:18     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-07  7:18 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/15/2023 6:54 PM, Daniel P. Berrangé wrote:
>> +static int tdx_ioctl_internal(void *state, enum tdx_ioctl_level level, int cmd_id,
>> +                        __u32 flags, void *data)
>> +{
>> +    struct kvm_tdx_cmd tdx_cmd;
> Add   ' = {}'  to initialize to all-zeros, avoiding the explicit
> memset call

thanks for the suggestion. Will do it in the next version.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-12-01 11:02   ` Markus Armbruster
@ 2023-12-07  7:38     ` Xiaoyao Li
  2023-12-07  9:20       ` Markus Armbruster
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-07  7:38 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P.Berrangé,
	Eric Blake, Marcelo Tosatti, qemu-devel, kvm, Michael Roth,
	Sean Christopherson, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

On 12/1/2023 7:02 PM, Markus Armbruster wrote:
> Xiaoyao Li <xiaoyao.li@intel.com> writes:
> 
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> For GetQuote, delegate a request to Quote Generation Service.
>> Add property "quote-generation-socket" to tdx-guest, whihc is a property
>> of type SocketAddress to specify Quote Generation Service(QGS).
>>
>> On request, connect to the QGS, read request buffer from shared guest
>> memory, send the request buffer to the server and store the response
>> into shared guest memory and notify TD guest by interrupt.
>>
>> command line example:
>>    qemu-system-x86_64 \
>>      -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
>>      -machine confidential-guest-support=tdx0
>>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Codeveloped-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> Changes in v3:
>> - rename property "quote-generation-service" to "quote-generation-socket";
>> - change the type of "quote-generation-socket" from str to
>>    SocketAddress;
>> - squash next patch into this one;
>> ---
>>   qapi/qom.json         |   5 +-
>>   target/i386/kvm/tdx.c | 430 ++++++++++++++++++++++++++++++++++++++++++
>>   target/i386/kvm/tdx.h |   6 +
>>   3 files changed, 440 insertions(+), 1 deletion(-)
>>
>> diff --git a/qapi/qom.json b/qapi/qom.json
>> index fd99aa1ff8cc..cf36a1832ddd 100644
>> --- a/qapi/qom.json
>> +++ b/qapi/qom.json
>> @@ -894,13 +894,16 @@
>>   #
>>   # @mrownerconfig: base64 MROWNERCONFIG SHA384 digest
>>   #
>> +# @quote-generation-socket: socket address for Quote Generation Service(QGS)
>> +#
> 
> Long line.  Better:
> 
>     # @quote-generation-socket: socket address for Quote Generation
>     #     Service(QGS)

May I ask what's the limitation for qom.json? if 80 columns limitation 
doesn't apply to it.

>>   # Since: 8.2
>>   ##
>>   { 'struct': 'TdxGuestProperties',
>>     'data': { '*sept-ve-disable': 'bool',
>>               '*mrconfigid': 'str',
>>               '*mrowner': 'str',
>> -            '*mrownerconfig': 'str' } }
>> +            '*mrownerconfig': 'str',
>> +            '*quote-generation-socket': 'SocketAddress' } }
>>   
>>   ##
>>   # @ThreadContextProperties:
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 57/70] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility
  2023-12-01 11:11   ` Markus Armbruster
@ 2023-12-07  8:11     ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-07  8:11 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P.Berrangé,
	Eric Blake, Marcelo Tosatti, qemu-devel, kvm, Michael Roth,
	Sean Christopherson, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

On 12/1/2023 7:11 PM, Markus Armbruster wrote:
> Xiaoyao Li <xiaoyao.li@intel.com> writes:
> 
>> Integrate TDX's TDX_REPORT_FATAL_ERROR into QEMU GuestPanic facility
>>
>> Originated-from: Isaku Yamahata <isaku.yamahata@intel.com>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> Changes from v2:
>> - Add docmentation of new type and struct (Daniel)
>> - refine the error message handling (Daniel)
>> ---
>>   qapi/run-state.json   | 27 ++++++++++++++++++++--
>>   system/runstate.c     | 54 +++++++++++++++++++++++++++++++++++++++++++
>>   target/i386/kvm/tdx.c | 24 +++++++++++++++++--
>>   3 files changed, 101 insertions(+), 4 deletions(-)
>>
>> diff --git a/qapi/run-state.json b/qapi/run-state.json
>> index f216ba54ec4c..e18f62eaef77 100644
>> --- a/qapi/run-state.json
>> +++ b/qapi/run-state.json
>> @@ -496,10 +496,12 @@
>>   #
>>   # @s390: s390 guest panic information type (Since: 2.12)
>>   #
>> +# @tdx: tdx guest panic information type (Since: 8.2)
>> +#
>>   # Since: 2.9
>>   ##
>>   { 'enum': 'GuestPanicInformationType',
>> -  'data': [ 'hyper-v', 's390' ] }
>> +  'data': [ 'hyper-v', 's390', 'tdx' ] }
>>   
>>   ##
>>   # @GuestPanicInformation:
>> @@ -514,7 +516,8 @@
>>    'base': {'type': 'GuestPanicInformationType'},
>>    'discriminator': 'type',
>>    'data': {'hyper-v': 'GuestPanicInformationHyperV',
>> -          's390': 'GuestPanicInformationS390'}}
>> +          's390': 'GuestPanicInformationS390',
>> +          'tdx' : 'GuestPanicInformationTdx'}}
>>   
>>   ##
>>   # @GuestPanicInformationHyperV:
>> @@ -577,6 +580,26 @@
>>             'psw-addr': 'uint64',
>>             'reason': 'S390CrashReason'}}
>>   
>> +##
>> +# @GuestPanicInformationTdx:
>> +#
>> +# TDX GHCI TDG.VP.VMCALL<ReportFatalError> specific guest panic information
> 
> Long line.  Suggest
> 
>     # Guest panic information specific to TDX GHCI
>     # TDG.VP.VMCALL<ReportFatalError>.

As I asked in patch #52, what's the limitation of one line?

>> +#
>> +# @error-code: TD-specific error code
>> +#
>> +# @gpa: 4KB-aligned guest physical address of the page that containing
>> +#     additional error data
> 
> "address of a page" implies the address is page-aligned.  4KB-aligned
> feels redundant.  What about
> 
>     # @qpa: guest-physical address of a page that contains additional
>     #     error data.
> 
> But in what format is the "additional error data"?

it's expected to hold a zero-terminated string.

>> +#
>> +# @message: TD guest provided message string.  (It's not so trustable
>> +#     and cannot be assumed to be well formed because it comes from guest)
> 
> guest-provided
> 
> For "well-formed" to make sense, we'd need an idea of the form / syntax.
> 
> If it's a human-readable error message, we could go with
> 
>     # @message: Human-readable error message provided by the guest.  Not
>     #     to be trusted.
>

looks good. I will your version.

>> +#
>> +# Since: 8.2
>> +##
>> +{'struct': 'GuestPanicInformationTdx',
>> + 'data': {'error-code': 'uint64',
>> +          'gpa': 'uint64',
>> +          'message': 'str'}}
>> +
>>   ##
>>   # @MEMORY_FAILURE:
>>   #
> 
> [...]
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-12-07  7:38     ` Xiaoyao Li
@ 2023-12-07  9:20       ` Markus Armbruster
  0 siblings, 0 replies; 161+ messages in thread
From: Markus Armbruster @ 2023-12-07  9:20 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Markus Armbruster, Paolo Bonzini, David Hildenbrand,
	Igor Mammedov, Michael S . Tsirkin, Marcel Apfelbaum,
	Richard Henderson, Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P.Berrangé,
	Eric Blake, Marcelo Tosatti, qemu-devel, kvm, Michael Roth,
	Sean Christopherson, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

Xiaoyao Li <xiaoyao.li@intel.com> writes:

> On 12/1/2023 7:02 PM, Markus Armbruster wrote:
>> Xiaoyao Li <xiaoyao.li@intel.com> writes:
>> 
>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>
>>> For GetQuote, delegate a request to Quote Generation Service.
>>> Add property "quote-generation-socket" to tdx-guest, whihc is a property
>>> of type SocketAddress to specify Quote Generation Service(QGS).
>>>
>>> On request, connect to the QGS, read request buffer from shared guest
>>> memory, send the request buffer to the server and store the response
>>> into shared guest memory and notify TD guest by interrupt.
>>>
>>> command line example:
>>>    qemu-system-x86_64 \
>>>      -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
>>>      -machine confidential-guest-support=tdx0
>>>
>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>> Codeveloped-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> ---
>>> Changes in v3:
>>> - rename property "quote-generation-service" to "quote-generation-socket";
>>> - change the type of "quote-generation-socket" from str to
>>>    SocketAddress;
>>> - squash next patch into this one;
>>> ---
>>>   qapi/qom.json         |   5 +-
>>>   target/i386/kvm/tdx.c | 430 ++++++++++++++++++++++++++++++++++++++++++
>>>   target/i386/kvm/tdx.h |   6 +
>>>   3 files changed, 440 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/qapi/qom.json b/qapi/qom.json
>>> index fd99aa1ff8cc..cf36a1832ddd 100644
>>> --- a/qapi/qom.json
>>> +++ b/qapi/qom.json
>>> @@ -894,13 +894,16 @@
>>>   #
>>>   # @mrownerconfig: base64 MROWNERCONFIG SHA384 digest
>>>   #
>>> +# @quote-generation-socket: socket address for Quote Generation Service(QGS)
>>> +#
>> Long line.  Better:
>>     # @quote-generation-socket: socket address for Quote Generation
>>     #     Service(QGS)
>
> May I ask what's the limitation for qom.json? if 80 columns limitation doesn't apply to it.

docs/devel/qapi-code-gen.rst section "Documentation markup":

    For legibility, wrap text paragraphs so every line is at most 70
    characters long.

Why is this not 80?  Humans tend to have trouble following long lines
with their eyes (I sure do).  Typographic manuals suggest to limit
columns to roughly 60 characters for exactly that reason[*].

For code, four levels of indentation plus 60 characters of actual text
yields 76.  However, code lines can be awkward to break, and going over
80 can be less bad than an awkward line break.  Use your judgement.

Documentation text, however, tends to be indented much less: 6-10
characters of indentation plus 60 of actual text yields 66-70.  When I
reflowed the entire QAPI schema documentation to stay within that limit
(commit a937b6aa739), not a single line break was awkward.

>>>   # Since: 8.2
>>>   ##
>>>   { 'struct': 'TdxGuestProperties',
>>>     'data': { '*sept-ve-disable': 'bool',
>>>               '*mrconfigid': 'str',
>>>               '*mrowner': 'str',
>>> -            '*mrownerconfig': 'str' } }
>>> +            '*mrownerconfig': 'str',
>>> +            '*quote-generation-socket': 'SocketAddress' } }
>>>     ##
>>>   # @ThreadContextProperties:
>> 

[*] https://en.wikipedia.org/wiki/Column_(typography)#Typographic_style


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 09/70] physmem: Introduce ram_block_convert_range() for page conversion
  2023-11-17 21:03   ` Isaku Yamahata
@ 2023-12-08  7:59     ` Xiaoyao Li
  2023-12-08 11:52       ` David Hildenbrand
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-08  7:59 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P. Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata

On 11/18/2023 5:03 AM, Isaku Yamahata wrote:
> On Wed, Nov 15, 2023 at 02:14:18AM -0500,
> Xiaoyao Li <xiaoyao.li@intel.com> wrote:
> 
>> It's used for discarding opposite memory after memory conversion, for
>> confidential guest.
>>
>> When page is converted from shared to private, the original shared
>> memory can be discarded via ram_block_discard_range();
>>
>> When page is converted from private to shared, the original private
>> memory is back'ed by guest_memfd. Introduce
>> ram_block_discard_guest_memfd_range() for discarding memory in
>> guest_memfd.
>>
>> Originally-from: Isaku Yamahata <isaku.yamahata@intel.com>
>> Codeveloped-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>>   include/exec/cpu-common.h |  2 ++
>>   system/physmem.c          | 50 +++++++++++++++++++++++++++++++++++++++
>>   2 files changed, 52 insertions(+)
>>
>> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
>> index 41115d891940..de728a18eef2 100644
>> --- a/include/exec/cpu-common.h
>> +++ b/include/exec/cpu-common.h
>> @@ -175,6 +175,8 @@ typedef int (RAMBlockIterFunc)(RAMBlock *rb, void *opaque);
>>   
>>   int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
>>   int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length);
>> +int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
>> +                            bool shared_to_private);
>>   
>>   #endif
>>   
>> diff --git a/system/physmem.c b/system/physmem.c
>> index ddfecddefcd6..cd6008fa09ad 100644
>> --- a/system/physmem.c
>> +++ b/system/physmem.c
>> @@ -3641,6 +3641,29 @@ err:
>>       return ret;
>>   }
>>   
>> +static int ram_block_discard_guest_memfd_range(RAMBlock *rb, uint64_t start,
>> +                                               size_t length)
>> +{
>> +    int ret = -1;
>> +
>> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>> +    ret = fallocate(rb->guest_memfd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>> +                    start, length);
>> +
>> +    if (ret) {
>> +        ret = -errno;
>> +        error_report("%s: Failed to fallocate %s:%" PRIx64 " +%zx (%d)",
>> +                     __func__, rb->idstr, start, length, ret);
>> +    }
>> +#else
>> +    ret = -ENOSYS;
>> +    error_report("%s: fallocate not available %s:%" PRIx64 " +%zx (%d)",
>> +                 __func__, rb->idstr, start, length, ret);
>> +#endif
>> +
>> +    return ret;
>> +}
>> +
>>   bool ramblock_is_pmem(RAMBlock *rb)
>>   {
>>       return rb->flags & RAM_PMEM;
>> @@ -3828,3 +3851,30 @@ bool ram_block_discard_is_required(void)
>>       return qatomic_read(&ram_block_discard_required_cnt) ||
>>              qatomic_read(&ram_block_coordinated_discard_required_cnt);
>>   }
>> +
>> +int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
>> +                            bool shared_to_private)
>> +{
>> +    if (!rb || rb->guest_memfd < 0) {
>> +        return -1;
>> +    }
>> +
>> +    if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||
>> +        !QEMU_PTR_IS_ALIGNED(length, qemu_host_page_size)) {
>> +        return -1;
>> +    }
>> +
>> +    if (!length) {
>> +        return -1;
>> +    }
>> +
>> +    if (start + length > rb->max_length) {
>> +        return -1;
>> +    }
>> +
>> +    if (shared_to_private) {
>> +        return ram_block_discard_range(rb, start, length);
>> +    } else {
>> +        return ram_block_discard_guest_memfd_range(rb, start, length);
>> +    }
>> +}
> 
> Originally this function issued KVM_SET_MEMORY_ATTRIBUTES, the function name
> mad sense. But now it doesn't, and it issues only punch hole. We should rename
> it to represent what it actually does. discard_range?

ram_block_discard_range() already exists for non-guest-memfd memory discard.

I cannot come up with a proper name. e.g., 
ram_block_discard_opposite_range() while *opposite* seems unclear.

Do you have any better idea?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 09/70] physmem: Introduce ram_block_convert_range() for page conversion
  2023-12-08  7:59     ` Xiaoyao Li
@ 2023-12-08 11:52       ` David Hildenbrand
  2023-12-21  6:18         ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: David Hildenbrand @ 2023-12-08 11:52 UTC (permalink / raw)
  To: Xiaoyao Li, Isaku Yamahata
  Cc: Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P. Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata

On 08.12.23 08:59, Xiaoyao Li wrote:
> On 11/18/2023 5:03 AM, Isaku Yamahata wrote:
>> On Wed, Nov 15, 2023 at 02:14:18AM -0500,
>> Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>
>>> It's used for discarding opposite memory after memory conversion, for
>>> confidential guest.
>>>
>>> When page is converted from shared to private, the original shared
>>> memory can be discarded via ram_block_discard_range();
>>>
>>> When page is converted from private to shared, the original private
>>> memory is back'ed by guest_memfd. Introduce
>>> ram_block_discard_guest_memfd_range() for discarding memory in
>>> guest_memfd.
>>>
>>> Originally-from: Isaku Yamahata <isaku.yamahata@intel.com>
>>> Codeveloped-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> ---
>>>    include/exec/cpu-common.h |  2 ++
>>>    system/physmem.c          | 50 +++++++++++++++++++++++++++++++++++++++
>>>    2 files changed, 52 insertions(+)
>>>
>>> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
>>> index 41115d891940..de728a18eef2 100644
>>> --- a/include/exec/cpu-common.h
>>> +++ b/include/exec/cpu-common.h
>>> @@ -175,6 +175,8 @@ typedef int (RAMBlockIterFunc)(RAMBlock *rb, void *opaque);
>>>    
>>>    int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
>>>    int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t length);
>>> +int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
>>> +                            bool shared_to_private);
>>>    
>>>    #endif
>>>    
>>> diff --git a/system/physmem.c b/system/physmem.c
>>> index ddfecddefcd6..cd6008fa09ad 100644
>>> --- a/system/physmem.c
>>> +++ b/system/physmem.c
>>> @@ -3641,6 +3641,29 @@ err:
>>>        return ret;
>>>    }
>>>    
>>> +static int ram_block_discard_guest_memfd_range(RAMBlock *rb, uint64_t start,
>>> +                                               size_t length)
>>> +{
>>> +    int ret = -1;
>>> +
>>> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>> +    ret = fallocate(rb->guest_memfd, FALLOC_FL_PUNCH_HOLE | FALLOC_FL_KEEP_SIZE,
>>> +                    start, length);
>>> +
>>> +    if (ret) {
>>> +        ret = -errno;
>>> +        error_report("%s: Failed to fallocate %s:%" PRIx64 " +%zx (%d)",
>>> +                     __func__, rb->idstr, start, length, ret);
>>> +    }
>>> +#else
>>> +    ret = -ENOSYS;
>>> +    error_report("%s: fallocate not available %s:%" PRIx64 " +%zx (%d)",
>>> +                 __func__, rb->idstr, start, length, ret);
>>> +#endif
>>> +
>>> +    return ret;
>>> +}
>>> +
>>>    bool ramblock_is_pmem(RAMBlock *rb)
>>>    {
>>>        return rb->flags & RAM_PMEM;
>>> @@ -3828,3 +3851,30 @@ bool ram_block_discard_is_required(void)
>>>        return qatomic_read(&ram_block_discard_required_cnt) ||
>>>               qatomic_read(&ram_block_coordinated_discard_required_cnt);
>>>    }
>>> +
>>> +int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t length,
>>> +                            bool shared_to_private)
>>> +{
>>> +    if (!rb || rb->guest_memfd < 0) {
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||
>>> +        !QEMU_PTR_IS_ALIGNED(length, qemu_host_page_size)) {
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (!length) {
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (start + length > rb->max_length) {
>>> +        return -1;
>>> +    }
>>> +
>>> +    if (shared_to_private) {
>>> +        return ram_block_discard_range(rb, start, length);
>>> +    } else {
>>> +        return ram_block_discard_guest_memfd_range(rb, start, length);
>>> +    }
>>> +}
>>
>> Originally this function issued KVM_SET_MEMORY_ATTRIBUTES, the function name
>> mad sense. But now it doesn't, and it issues only punch hole. We should rename
>> it to represent what it actually does. discard_range?
> 
> ram_block_discard_range() already exists for non-guest-memfd memory discard.
> 
> I cannot come up with a proper name. e.g.,
> ram_block_discard_opposite_range() while *opposite* seems unclear.
> 
> Do you have any better idea?

Having some indication that this is about "guest_memfd" back and forth 
switching/conversion will make sense. But I'm also not able to come up 
with a better name.

Maybe have two functions:

ram_block_activate_guest_memfd_range
ram_block_deactivate_guest_memfd_range

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 161+ messages in thread

* RE: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2023-11-15  7:14 ` [PATCH v3 06/70] kvm: Introduce support for memory_attributes Xiaoyao Li
  2023-11-15 10:38   ` Daniel P. Berrangé
@ 2023-12-12 13:56   ` Wang, Wei W
  2023-12-21  6:11     ` Xiaoyao Li
  1 sibling, 1 reply; 161+ messages in thread
From: Wang, Wei W @ 2023-12-12 13:56 UTC (permalink / raw)
  To: Li, Xiaoyao, Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Li, Xiaoyao, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Qiang, Chenyi

On Wednesday, November 15, 2023 3:14 PM, Xiaoyao Li wrote:
> Introduce the helper functions to set the attributes of a range of memory to
> private or shared.
> 
> This is necessary to notify KVM the private/shared attribute of each gpa range.
> KVM needs the information to decide the GPA needs to be mapped at hva-
> based shared memory or guest_memfd based private memory.
> 
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
>  accel/kvm/kvm-all.c  | 42 ++++++++++++++++++++++++++++++++++++++++++
>  include/sysemu/kvm.h |  3 +++
>  2 files changed, 45 insertions(+)
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
> 69afeb47c9c0..76e2404d54d2 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -102,6 +102,7 @@ bool kvm_has_guest_debug;  static int kvm_sstep_flags;
> static bool kvm_immediate_exit;  static bool kvm_guest_memfd_supported;
> +static uint64_t kvm_supported_memory_attributes;
>  static hwaddr kvm_max_slot_size = ~0;
> 
>  static const KVMCapabilityInfo kvm_required_capabilites[] = { @@ -1305,6
> +1306,44 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
>      kvm_max_slot_size = max_slot_size;
>  }
> 
> +static int kvm_set_memory_attributes(hwaddr start, hwaddr size,
> +uint64_t attr) {
> +    struct kvm_memory_attributes attrs;
> +    int r;
> +
> +    attrs.attributes = attr;
> +    attrs.address = start;
> +    attrs.size = size;
> +    attrs.flags = 0;
> +
> +    r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, &attrs);
> +    if (r) {
> +        warn_report("%s: failed to set memory (0x%lx+%#zx) with attr 0x%lx
> error '%s'",
> +                     __func__, start, size, attr, strerror(errno));
> +    }
> +    return r;
> +}
> +
> +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size) {
> +    if (!(kvm_supported_memory_attributes &
> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
> +        return -EINVAL;
> +    }
> +
> +    return kvm_set_memory_attributes(start, size,
> +KVM_MEMORY_ATTRIBUTE_PRIVATE); }
> +
> +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size) {
> +    if (!(kvm_supported_memory_attributes &
> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
> +        return -EINVAL;
> +    }

Duplicate code in kvm_set_memory_attributes_shared/private.
Why not move the check into kvm_set_memory_attributes?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM
  2023-12-01 11:00   ` Markus Armbruster
@ 2023-12-14  3:07     ` Xiaoyao Li
  2023-12-18 13:46       ` Markus Armbruster
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-14  3:07 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P.Berrangé,
	Eric Blake, Marcelo Tosatti, qemu-devel, kvm, Michael Roth,
	Sean Christopherson, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

On 12/1/2023 7:00 PM, Markus Armbruster wrote:
> Xiaoyao Li <xiaoyao.li@intel.com> writes:
> 
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
>> can be provided for TDX attestation.
>>
>> So far they were hard coded as 0. Now allow user to specify those values
>> via property mrconfigid, mrowner and mrownerconfig. They are all in
>> base64 format.
>>
>> example
>> -object tdx-guest, \
>>    mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
>>    mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
>>    mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v
>>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> Changes in v3:
>>   - use base64 encoding instread of hex-string;
>> ---
>>   qapi/qom.json         | 11 +++++-
>>   target/i386/kvm/tdx.c | 85 +++++++++++++++++++++++++++++++++++++++++++
>>   target/i386/kvm/tdx.h |  3 ++
>>   3 files changed, 98 insertions(+), 1 deletion(-)
>>
>> diff --git a/qapi/qom.json b/qapi/qom.json
>> index 3a29659e0155..fd99aa1ff8cc 100644
>> --- a/qapi/qom.json
>> +++ b/qapi/qom.json
>> @@ -888,10 +888,19 @@
>>   #     pages.  Some guest OS (e.g., Linux TD guest) may require this to
>>   #     be set, otherwise they refuse to boot.
>>   #
>> +# @mrconfigid: base64 encoded MRCONFIGID SHA384 digest
>> +#
>> +# @mrowner: base64 encoded MROWNER SHA384 digest
>> +#
>> +# @mrownerconfig: base64 MROWNERCONFIG SHA384 digest
> 
> Can we come up with a description that tells the user a bit more clearly
> what we're talking about?  Perhaps starting with this question could
> lead us there: what's an MRCONFIGID, and why should I care?

Below are the definition from TDX spec:

MRCONFIGID: Software-defined ID for non-owner-defined configuration of 
the guest TD – e.g., run-time or OS configuration.

MROWNER: Software-defined ID for the guest TD’s owner

MROWNERCONFIG: Software-defined ID for owner-defined configuration of 
the guest TD – e.g., specific to the workload rather than the run-time or OS


They are all attestation related, and input by users who launches the TD 
. Software inside TD can retrieve them with TDREPORT and verify if it is 
the expected value.

MROWNER is to identify the owner of the TD, MROWNERCONFIG is to pass 
OWNER's configuration. And MRCONFIGID contains configuration specific to 
OS level instead of OWNER.

Below is the explanation from Intel inside, hope it can get you more clear:

"These are primarily intended for general purpose, configurable software 
in a minimal TD. So, not a legacy VM image cloud customer wanting to 
move their VM out into the cloud. Also it’s not necessarily the case 
that any workload will use them all.

MROWNER is for declaring the owner of the TD. An example use case would 
be an vHSM TD. HSMs need to know who their administrative contact is. 
You could customize the HSM image and measurements, but then people 
can’t recognize that this is the vHSM product from XYZ. So you put the 
unmodified vHSM stack in the TD, which will include MRTD/RTMRs that 
reflect the vHSM, and the owner’s public key in MROWNER. Now, when the 
vHSM starts up, to determine who is authorized to send commands, it does 
a TDREPORT, and looks at MROWNER.

Extending this model, there could be important configuration information 
from the owner. In that case, MROWNERCONFIG is set to the hash of the 
config file that the vHSM should accept.

This results in an attestable environment that explicitly indicates that 
it’s a well recognized vHSM TD, being administered by MROWNER and 
loading the configuration information that matches MROWNERCONFIG.

Extending this idea of configuration of generally recognized software, 
it could be that there is a shim OS under the vHSM that itself is 
configurable. So MRCONFIGID, which isn’t a great name, can include 
configuration information intended for the OS level. The ID is 
confusing, but MRCONFIGID was the name we used for this register for 
SGX, so we kept the name."

>> +#
>>   # Since: 8.2
>>   ##
>>   { 'struct': 'TdxGuestProperties',
>> -  'data': { '*sept-ve-disable': 'bool' } }
>> +  'data': { '*sept-ve-disable': 'bool',
>> +            '*mrconfigid': 'str',
>> +            '*mrowner': 'str',
>> +            '*mrownerconfig': 'str' } }
>>   
>>   ##
>>   # @ThreadContextProperties:
> 
> [...]
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM
  2023-12-14  3:07     ` Xiaoyao Li
@ 2023-12-18 13:46       ` Markus Armbruster
  2023-12-19  8:27         ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Markus Armbruster @ 2023-12-18 13:46 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P.Berrangé,
	Eric Blake, Marcelo Tosatti, qemu-devel, kvm, Michael Roth,
	Sean Christopherson, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

Xiaoyao Li <xiaoyao.li@intel.com> writes:

> On 12/1/2023 7:00 PM, Markus Armbruster wrote:
>> Xiaoyao Li <xiaoyao.li@intel.com> writes:
>> 
>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>
>>> Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
>>> can be provided for TDX attestation.
>>>
>>> So far they were hard coded as 0. Now allow user to specify those values
>>> via property mrconfigid, mrowner and mrownerconfig. They are all in
>>> base64 format.
>>>
>>> example
>>> -object tdx-guest, \
>>>    mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
>>>    mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
>>>    mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v
>>>
>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>> ---
>>> Changes in v3:
>>>   - use base64 encoding instread of hex-string;
>>> ---
>>>   qapi/qom.json         | 11 +++++-
>>>   target/i386/kvm/tdx.c | 85 +++++++++++++++++++++++++++++++++++++++++++
>>>   target/i386/kvm/tdx.h |  3 ++
>>>   3 files changed, 98 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/qapi/qom.json b/qapi/qom.json
>>> index 3a29659e0155..fd99aa1ff8cc 100644
>>> --- a/qapi/qom.json
>>> +++ b/qapi/qom.json
>>> @@ -888,10 +888,19 @@
>>>  #     pages.  Some guest OS (e.g., Linux TD guest) may require this to
>>>  #     be set, otherwise they refuse to boot.
>>>  #
>>> +# @mrconfigid: base64 encoded MRCONFIGID SHA384 digest
>>> +#
>>> +# @mrowner: base64 encoded MROWNER SHA384 digest
>>> +#
>>> +# @mrownerconfig: base64 MROWNERCONFIG SHA384 digest
>>
>> Can we come up with a description that tells the user a bit more clearly
>> what we're talking about?  Perhaps starting with this question could
>> lead us there: what's an MRCONFIGID, and why should I care?
>
> Below are the definition from TDX spec:
>
> MRCONFIGID: Software-defined ID for non-owner-defined configuration of the guest TD – e.g., run-time or OS configuration.
>
> MROWNER: Software-defined ID for the guest TD’s owner
>
> MROWNERCONFIG: Software-defined ID for owner-defined configuration of the guest TD – e.g., specific to the workload rather than the run-time or OS

Have you considered using this for the doc comments?  I'd omit
"software-defined" in this context.

> They are all attestation related, and input by users who launches the TD . Software inside TD can retrieve them with TDREPORT and verify if it is the expected value.
>
> MROWNER is to identify the owner of the TD, MROWNERCONFIG is to pass OWNER's configuration. And MRCONFIGID contains configuration specific to OS level instead of OWNER.
>
> Below is the explanation from Intel inside, hope it can get you more clear:
>
> "These are primarily intended for general purpose, configurable software in a minimal TD. So, not a legacy VM image cloud customer wanting to move their VM out into the cloud. Also it’s not necessarily the case that any workload will use them all.
>
> MROWNER is for declaring the owner of the TD. An example use case would be an vHSM TD. HSMs need to know who their administrative contact is. You could customize the HSM image and measurements, but then people can’t recognize that this is the vHSM product from XYZ. So you put the unmodified vHSM stack in the TD, which will include MRTD/RTMRs that reflect the vHSM, and the owner’s public key in MROWNER. Now, when the vHSM starts up, to determine who is authorized to send commands, it does a TDREPORT, and looks at MROWNER.
>
> Extending this model, there could be important configuration information from the owner. In that case, MROWNERCONFIG is set to the hash of the config file that the vHSM should accept.
>
> This results in an attestable environment that explicitly indicates that it’s a well recognized vHSM TD, being administered by MROWNER and loading the configuration information that matches MROWNERCONFIG.
>
> Extending this idea of configuration of generally recognized software, it could be that there is a shim OS under the vHSM that itself is configurable. So MRCONFIGID, which isn’t a great name, can include configuration information intended for the OS level. The ID is confusing, but MRCONFIGID was the name we used for this register for SGX, so we kept the name."

Include a reference to this document?

>>> +#
>>>  # Since: 8.2
>>>  ##
>>>  { 'struct': 'TdxGuestProperties',
>>> -  'data': { '*sept-ve-disable': 'bool' } }
>>> +  'data': { '*sept-ve-disable': 'bool',
>>> +            '*mrconfigid': 'str',
>>> +            '*mrowner': 'str',
>>> +            '*mrownerconfig': 'str' } }
>>>   ##
>>>   # @ThreadContextProperties:
>> [...]
>> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM
  2023-12-18 13:46       ` Markus Armbruster
@ 2023-12-19  8:27         ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-19  8:27 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P.Berrangé,
	Eric Blake, Marcelo Tosatti, qemu-devel, kvm, Michael Roth,
	Sean Christopherson, Claudio Fontana, Gerd Hoffmann,
	Isaku Yamahata, Chenyi Qiang

On 12/18/2023 9:46 PM, Markus Armbruster wrote:
> Xiaoyao Li <xiaoyao.li@intel.com> writes:
> 
>> On 12/1/2023 7:00 PM, Markus Armbruster wrote:
>>> Xiaoyao Li <xiaoyao.li@intel.com> writes:
>>>
>>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>>
>>>> Three sha384 hash values, mrconfigid, mrowner and mrownerconfig, of a TD
>>>> can be provided for TDX attestation.
>>>>
>>>> So far they were hard coded as 0. Now allow user to specify those values
>>>> via property mrconfigid, mrowner and mrownerconfig. They are all in
>>>> base64 format.
>>>>
>>>> example
>>>> -object tdx-guest, \
>>>>     mrconfigid=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
>>>>     mrowner=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v,\
>>>>     mrownerconfig=ASNFZ4mrze8BI0VniavN7wEjRWeJq83vASNFZ4mrze8BI0VniavN7wEjRWeJq83v
>>>>
>>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>>> Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>> ---
>>>> Changes in v3:
>>>>    - use base64 encoding instread of hex-string;
>>>> ---
>>>>    qapi/qom.json         | 11 +++++-
>>>>    target/i386/kvm/tdx.c | 85 +++++++++++++++++++++++++++++++++++++++++++
>>>>    target/i386/kvm/tdx.h |  3 ++
>>>>    3 files changed, 98 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/qapi/qom.json b/qapi/qom.json
>>>> index 3a29659e0155..fd99aa1ff8cc 100644
>>>> --- a/qapi/qom.json
>>>> +++ b/qapi/qom.json
>>>> @@ -888,10 +888,19 @@
>>>>   #     pages.  Some guest OS (e.g., Linux TD guest) may require this to
>>>>   #     be set, otherwise they refuse to boot.
>>>>   #
>>>> +# @mrconfigid: base64 encoded MRCONFIGID SHA384 digest
>>>> +#
>>>> +# @mrowner: base64 encoded MROWNER SHA384 digest
>>>> +#
>>>> +# @mrownerconfig: base64 MROWNERCONFIG SHA384 digest
>>>
>>> Can we come up with a description that tells the user a bit more clearly
>>> what we're talking about?  Perhaps starting with this question could
>>> lead us there: what's an MRCONFIGID, and why should I care?
>>
>> Below are the definition from TDX spec:
>>
>> MRCONFIGID: Software-defined ID for non-owner-defined configuration of the guest TD – e.g., run-time or OS configuration.
>>
>> MROWNER: Software-defined ID for the guest TD’s owner
>>
>> MROWNERCONFIG: Software-defined ID for owner-defined configuration of the guest TD – e.g., specific to the workload rather than the run-time or OS
> 
> Have you considered using this for the doc comments?  I'd omit
> "software-defined" in this context.

sure. I will use them in the next version.

>> They are all attestation related, and input by users who launches the TD . Software inside TD can retrieve them with TDREPORT and verify if it is the expected value.
>>
>> MROWNER is to identify the owner of the TD, MROWNERCONFIG is to pass OWNER's configuration. And MRCONFIGID contains configuration specific to OS level instead of OWNER.
>>
>> Below is the explanation from Intel inside, hope it can get you more clear:
>>
>> "These are primarily intended for general purpose, configurable software in a minimal TD. So, not a legacy VM image cloud customer wanting to move their VM out into the cloud. Also it’s not necessarily the case that any workload will use them all.
>>
>> MROWNER is for declaring the owner of the TD. An example use case would be an vHSM TD. HSMs need to know who their administrative contact is. You could customize the HSM image and measurements, but then people can’t recognize that this is the vHSM product from XYZ. So you put the unmodified vHSM stack in the TD, which will include MRTD/RTMRs that reflect the vHSM, and the owner’s public key in MROWNER. Now, when the vHSM starts up, to determine who is authorized to send commands, it does a TDREPORT, and looks at MROWNER.
>>
>> Extending this model, there could be important configuration information from the owner. In that case, MROWNERCONFIG is set to the hash of the config file that the vHSM should accept.
>>
>> This results in an attestable environment that explicitly indicates that it’s a well recognized vHSM TD, being administered by MROWNER and loading the configuration information that matches MROWNERCONFIG.
>>
>> Extending this idea of configuration of generally recognized software, it could be that there is a shim OS under the vHSM that itself is configurable. So MRCONFIGID, which isn’t a great name, can include configuration information intended for the OS level. The ID is confusing, but MRCONFIGID was the name we used for this register for SGX, so we kept the name."
> 
> Include a reference to this document?

That was the email reply from internal attestation folks.

but I can add the link to this mail in the version.

>>>> +#
>>>>   # Since: 8.2
>>>>   ##
>>>>   { 'struct': 'TdxGuestProperties',
>>>> -  'data': { '*sept-ve-disable': 'bool' } }
>>>> +  'data': { '*sept-ve-disable': 'bool',
>>>> +            '*mrconfigid': 'str',
>>>> +            '*mrowner': 'str',
>>>> +            '*mrownerconfig': 'str' } }
>>>>    ##
>>>>    # @ThreadContextProperties:
>>> [...]
>>>
> 


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2023-12-12 13:56   ` Wang, Wei W
@ 2023-12-21  6:11     ` Xiaoyao Li
  2023-12-21 10:36       ` Wang, Wei W
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-21  6:11 UTC (permalink / raw)
  To: Wang, Wei W, Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Qiang, Chenyi

On 12/12/2023 9:56 PM, Wang, Wei W wrote:
> On Wednesday, November 15, 2023 3:14 PM, Xiaoyao Li wrote:
>> Introduce the helper functions to set the attributes of a range of memory to
>> private or shared.
>>
>> This is necessary to notify KVM the private/shared attribute of each gpa range.
>> KVM needs the information to decide the GPA needs to be mapped at hva-
>> based shared memory or guest_memfd based private memory.
>>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>>   accel/kvm/kvm-all.c  | 42 ++++++++++++++++++++++++++++++++++++++++++
>>   include/sysemu/kvm.h |  3 +++
>>   2 files changed, 45 insertions(+)
>>
>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
>> 69afeb47c9c0..76e2404d54d2 100644
>> --- a/accel/kvm/kvm-all.c
>> +++ b/accel/kvm/kvm-all.c
>> @@ -102,6 +102,7 @@ bool kvm_has_guest_debug;  static int kvm_sstep_flags;
>> static bool kvm_immediate_exit;  static bool kvm_guest_memfd_supported;
>> +static uint64_t kvm_supported_memory_attributes;
>>   static hwaddr kvm_max_slot_size = ~0;
>>
>>   static const KVMCapabilityInfo kvm_required_capabilites[] = { @@ -1305,6
>> +1306,44 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
>>       kvm_max_slot_size = max_slot_size;
>>   }
>>
>> +static int kvm_set_memory_attributes(hwaddr start, hwaddr size,
>> +uint64_t attr) {
>> +    struct kvm_memory_attributes attrs;
>> +    int r;
>> +
>> +    attrs.attributes = attr;
>> +    attrs.address = start;
>> +    attrs.size = size;
>> +    attrs.flags = 0;
>> +
>> +    r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, &attrs);
>> +    if (r) {
>> +        warn_report("%s: failed to set memory (0x%lx+%#zx) with attr 0x%lx
>> error '%s'",
>> +                     __func__, start, size, attr, strerror(errno));
>> +    }
>> +    return r;
>> +}
>> +
>> +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size) {
>> +    if (!(kvm_supported_memory_attributes &
>> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
>> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
>> +        return -EINVAL;
>> +    }
>> +
>> +    return kvm_set_memory_attributes(start, size,
>> +KVM_MEMORY_ATTRIBUTE_PRIVATE); }
>> +
>> +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size) {
>> +    if (!(kvm_supported_memory_attributes &
>> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
>> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
>> +        return -EINVAL;
>> +    }
> 
> Duplicate code in kvm_set_memory_attributes_shared/private.
> Why not move the check into kvm_set_memory_attributes?

Because it's not easy to put the check into there.

Both setting and clearing one bit require the capability check. If 
moving the check into kvm_set_memory_attributes(), the check of 
KVM_MEMORY_ATTRIBUTE_PRIVATE will have to become unconditionally, which 
is not aligned to the function name because the name is not restricted 
to shared/private attribute only.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 09/70] physmem: Introduce ram_block_convert_range() for page conversion
  2023-12-08 11:52       ` David Hildenbrand
@ 2023-12-21  6:18         ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-21  6:18 UTC (permalink / raw)
  To: David Hildenbrand, Isaku Yamahata
  Cc: Paolo Bonzini, Igor Mammedov, Michael S . Tsirkin,
	Marcel Apfelbaum, Richard Henderson, Peter Xu,
	Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P. Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti, qemu-devel, kvm,
	Michael Roth, Sean Christopherson, Claudio Fontana,
	Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang, isaku.yamahata

On 12/8/2023 7:52 PM, David Hildenbrand wrote:
> On 08.12.23 08:59, Xiaoyao Li wrote:
>> On 11/18/2023 5:03 AM, Isaku Yamahata wrote:
>>> On Wed, Nov 15, 2023 at 02:14:18AM -0500,
>>> Xiaoyao Li <xiaoyao.li@intel.com> wrote:
>>>
>>>> It's used for discarding opposite memory after memory conversion, for
>>>> confidential guest.
>>>>
>>>> When page is converted from shared to private, the original shared
>>>> memory can be discarded via ram_block_discard_range();
>>>>
>>>> When page is converted from private to shared, the original private
>>>> memory is back'ed by guest_memfd. Introduce
>>>> ram_block_discard_guest_memfd_range() for discarding memory in
>>>> guest_memfd.
>>>>
>>>> Originally-from: Isaku Yamahata <isaku.yamahata@intel.com>
>>>> Codeveloped-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>> ---
>>>>    include/exec/cpu-common.h |  2 ++
>>>>    system/physmem.c          | 50 
>>>> +++++++++++++++++++++++++++++++++++++++
>>>>    2 files changed, 52 insertions(+)
>>>>
>>>> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
>>>> index 41115d891940..de728a18eef2 100644
>>>> --- a/include/exec/cpu-common.h
>>>> +++ b/include/exec/cpu-common.h
>>>> @@ -175,6 +175,8 @@ typedef int (RAMBlockIterFunc)(RAMBlock *rb, 
>>>> void *opaque);
>>>>    int qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
>>>>    int ram_block_discard_range(RAMBlock *rb, uint64_t start, size_t 
>>>> length);
>>>> +int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t 
>>>> length,
>>>> +                            bool shared_to_private);
>>>>    #endif
>>>> diff --git a/system/physmem.c b/system/physmem.c
>>>> index ddfecddefcd6..cd6008fa09ad 100644
>>>> --- a/system/physmem.c
>>>> +++ b/system/physmem.c
>>>> @@ -3641,6 +3641,29 @@ err:
>>>>        return ret;
>>>>    }
>>>> +static int ram_block_discard_guest_memfd_range(RAMBlock *rb, 
>>>> uint64_t start,
>>>> +                                               size_t length)
>>>> +{
>>>> +    int ret = -1;
>>>> +
>>>> +#ifdef CONFIG_FALLOCATE_PUNCH_HOLE
>>>> +    ret = fallocate(rb->guest_memfd, FALLOC_FL_PUNCH_HOLE | 
>>>> FALLOC_FL_KEEP_SIZE,
>>>> +                    start, length);
>>>> +
>>>> +    if (ret) {
>>>> +        ret = -errno;
>>>> +        error_report("%s: Failed to fallocate %s:%" PRIx64 " +%zx 
>>>> (%d)",
>>>> +                     __func__, rb->idstr, start, length, ret);
>>>> +    }
>>>> +#else
>>>> +    ret = -ENOSYS;
>>>> +    error_report("%s: fallocate not available %s:%" PRIx64 " +%zx 
>>>> (%d)",
>>>> +                 __func__, rb->idstr, start, length, ret);
>>>> +#endif
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>>    bool ramblock_is_pmem(RAMBlock *rb)
>>>>    {
>>>>        return rb->flags & RAM_PMEM;
>>>> @@ -3828,3 +3851,30 @@ bool ram_block_discard_is_required(void)
>>>>        return qatomic_read(&ram_block_discard_required_cnt) ||
>>>>               
>>>> qatomic_read(&ram_block_coordinated_discard_required_cnt);
>>>>    }
>>>> +
>>>> +int ram_block_convert_range(RAMBlock *rb, uint64_t start, size_t 
>>>> length,
>>>> +                            bool shared_to_private)
>>>> +{
>>>> +    if (!rb || rb->guest_memfd < 0) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) ||
>>>> +        !QEMU_PTR_IS_ALIGNED(length, qemu_host_page_size)) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (!length) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (start + length > rb->max_length) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (shared_to_private) {
>>>> +        return ram_block_discard_range(rb, start, length);
>>>> +    } else {
>>>> +        return ram_block_discard_guest_memfd_range(rb, start, length);
>>>> +    }
>>>> +}
>>>
>>> Originally this function issued KVM_SET_MEMORY_ATTRIBUTES, the 
>>> function name
>>> mad sense. But now it doesn't, and it issues only punch hole. We 
>>> should rename
>>> it to represent what it actually does. discard_range?
>>
>> ram_block_discard_range() already exists for non-guest-memfd memory 
>> discard.
>>
>> I cannot come up with a proper name. e.g.,
>> ram_block_discard_opposite_range() while *opposite* seems unclear.
>>
>> Do you have any better idea?
> 
> Having some indication that this is about "guest_memfd" back and forth 
> switching/conversion will make sense. But I'm also not able to come up 
> with a better name.
> 
> Maybe have two functions:
> 
> ram_block_activate_guest_memfd_range
> ram_block_deactivate_guest_memfd_range
> 

finally, I decide to drop this function and expose 
ram_block_discard_guest_memfd_range() instead. So caller can call the 
ram_block_discard_*() on its own.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* RE: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2023-12-21  6:11     ` Xiaoyao Li
@ 2023-12-21 10:36       ` Wang, Wei W
  2023-12-21 11:53         ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Wang, Wei W @ 2023-12-21 10:36 UTC (permalink / raw)
  To: Li, Xiaoyao, Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Qiang, Chenyi

On Thursday, December 21, 2023 2:11 PM, Li, Xiaoyao wrote:
> On 12/12/2023 9:56 PM, Wang, Wei W wrote:
> > On Wednesday, November 15, 2023 3:14 PM, Xiaoyao Li wrote:
> >> Introduce the helper functions to set the attributes of a range of
> >> memory to private or shared.
> >>
> >> This is necessary to notify KVM the private/shared attribute of each gpa
> range.
> >> KVM needs the information to decide the GPA needs to be mapped at
> >> hva- based shared memory or guest_memfd based private memory.
> >>
> >> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> >> ---
> >>   accel/kvm/kvm-all.c  | 42
> ++++++++++++++++++++++++++++++++++++++++++
> >>   include/sysemu/kvm.h |  3 +++
> >>   2 files changed, 45 insertions(+)
> >>
> >> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
> >> 69afeb47c9c0..76e2404d54d2 100644
> >> --- a/accel/kvm/kvm-all.c
> >> +++ b/accel/kvm/kvm-all.c
> >> @@ -102,6 +102,7 @@ bool kvm_has_guest_debug;  static int
> >> kvm_sstep_flags; static bool kvm_immediate_exit;  static bool
> >> kvm_guest_memfd_supported;
> >> +static uint64_t kvm_supported_memory_attributes;
> >>   static hwaddr kvm_max_slot_size = ~0;
> >>
> >>   static const KVMCapabilityInfo kvm_required_capabilites[] = { @@
> >> -1305,6
> >> +1306,44 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
> >>       kvm_max_slot_size = max_slot_size;
> >>   }
> >>
> >> +static int kvm_set_memory_attributes(hwaddr start, hwaddr size,
> >> +uint64_t attr) {
> >> +    struct kvm_memory_attributes attrs;
> >> +    int r;
> >> +
> >> +    attrs.attributes = attr;
> >> +    attrs.address = start;
> >> +    attrs.size = size;
> >> +    attrs.flags = 0;
> >> +
> >> +    r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, &attrs);
> >> +    if (r) {
> >> +        warn_report("%s: failed to set memory (0x%lx+%#zx) with attr
> >> + 0x%lx
> >> error '%s'",
> >> +                     __func__, start, size, attr, strerror(errno));
> >> +    }
> >> +    return r;
> >> +}
> >> +
> >> +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size) {
> >> +    if (!(kvm_supported_memory_attributes &
> >> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> >> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
> >> +        return -EINVAL;
> >> +    }
> >> +
> >> +    return kvm_set_memory_attributes(start, size,
> >> +KVM_MEMORY_ATTRIBUTE_PRIVATE); }
> >> +
> >> +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size) {
> >> +    if (!(kvm_supported_memory_attributes &
> >> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> >> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
> >> +        return -EINVAL;
> >> +    }
> >
> > Duplicate code in kvm_set_memory_attributes_shared/private.
> > Why not move the check into kvm_set_memory_attributes?
> 
> Because it's not easy to put the check into there.
> 
> Both setting and clearing one bit require the capability check. If moving the
> check into kvm_set_memory_attributes(), the check of
> KVM_MEMORY_ATTRIBUTE_PRIVATE will have to become unconditionally,
> which is not aligned to the function name because the name is not restricted to
> shared/private attribute only.

No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE there.
I'm suggesting below:

diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
index 2d9a2455de..63ba74b221 100644
--- a/accel/kvm/kvm-all.c
+++ b/accel/kvm/kvm-all.c
@@ -1375,6 +1375,11 @@ static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)
     struct kvm_memory_attributes attrs;
     int r;

+    if ((attr & kvm_supported_memory_attributes) != attr) {
+        error_report("KVM doesn't support memory attr %lx\n", attr);
+        return -EINVAL;
+    }
+
     attrs.attributes = attr;
     attrs.address = start;
     attrs.size = size;
@@ -1390,21 +1395,11 @@ static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)

 int kvm_set_memory_attributes_private(hwaddr start, hwaddr size)
 {
-    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
-        error_report("KVM doesn't support PRIVATE memory attribute\n");
-        return -EINVAL;
-    }
-
     return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
 }

 int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size)
 {
-    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
-        error_report("KVM doesn't support PRIVATE memory attribute\n");
-        return -EINVAL;
-    }
-
     return kvm_set_memory_attributes(start, size, 0);
 }

Maybe you don't even need the kvm_set_memory_attributes_shared/private wrappers.

^ permalink raw reply related	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-11-15  7:15 ` [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote> Xiaoyao Li
                     ` (2 preceding siblings ...)
  2023-12-01 11:02   ` Markus Armbruster
@ 2023-12-21 11:05   ` Daniel P. Berrangé
  2023-12-22  3:14     ` Xiaoyao Li
  3 siblings, 1 reply; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-12-21 11:05 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Wed, Nov 15, 2023 at 02:15:01AM -0500, Xiaoyao Li wrote:
> From: Isaku Yamahata <isaku.yamahata@intel.com>
> 
> For GetQuote, delegate a request to Quote Generation Service.
> Add property "quote-generation-socket" to tdx-guest, whihc is a property
> of type SocketAddress to specify Quote Generation Service(QGS).
> 
> On request, connect to the QGS, read request buffer from shared guest
> memory, send the request buffer to the server and store the response
> into shared guest memory and notify TD guest by interrupt.
> 
> command line example:
>   qemu-system-x86_64 \
>     -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \

Here you're illustrating a VSOCK address.  IIUC, both the 'qgs'
daemon and QEMU will be running in the host. Why would they need
to be using VSOCK, as opposed to a regular UNIX socket connection ?

>     -machine confidential-guest-support=tdx0
> 
> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> Codeveloped-by: Chenyi Qiang <chenyi.qiang@intel.com>
> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> ---
> Changes in v3:
> - rename property "quote-generation-service" to "quote-generation-socket";
> - change the type of "quote-generation-socket" from str to
>   SocketAddress;

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2023-12-21 10:36       ` Wang, Wei W
@ 2023-12-21 11:53         ` Xiaoyao Li
  2023-12-21 13:47           ` Wang, Wei W
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-21 11:53 UTC (permalink / raw)
  To: Wang, Wei W, Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Qiang, Chenyi

On 12/21/2023 6:36 PM, Wang, Wei W wrote:
> On Thursday, December 21, 2023 2:11 PM, Li, Xiaoyao wrote:
>> On 12/12/2023 9:56 PM, Wang, Wei W wrote:
>>> On Wednesday, November 15, 2023 3:14 PM, Xiaoyao Li wrote:
>>>> Introduce the helper functions to set the attributes of a range of
>>>> memory to private or shared.
>>>>
>>>> This is necessary to notify KVM the private/shared attribute of each gpa
>> range.
>>>> KVM needs the information to decide the GPA needs to be mapped at
>>>> hva- based shared memory or guest_memfd based private memory.
>>>>
>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>> ---
>>>>    accel/kvm/kvm-all.c  | 42
>> ++++++++++++++++++++++++++++++++++++++++++
>>>>    include/sysemu/kvm.h |  3 +++
>>>>    2 files changed, 45 insertions(+)
>>>>
>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
>>>> 69afeb47c9c0..76e2404d54d2 100644
>>>> --- a/accel/kvm/kvm-all.c
>>>> +++ b/accel/kvm/kvm-all.c
>>>> @@ -102,6 +102,7 @@ bool kvm_has_guest_debug;  static int
>>>> kvm_sstep_flags; static bool kvm_immediate_exit;  static bool
>>>> kvm_guest_memfd_supported;
>>>> +static uint64_t kvm_supported_memory_attributes;
>>>>    static hwaddr kvm_max_slot_size = ~0;
>>>>
>>>>    static const KVMCapabilityInfo kvm_required_capabilites[] = { @@
>>>> -1305,6
>>>> +1306,44 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size)
>>>>        kvm_max_slot_size = max_slot_size;
>>>>    }
>>>>
>>>> +static int kvm_set_memory_attributes(hwaddr start, hwaddr size,
>>>> +uint64_t attr) {
>>>> +    struct kvm_memory_attributes attrs;
>>>> +    int r;
>>>> +
>>>> +    attrs.attributes = attr;
>>>> +    attrs.address = start;
>>>> +    attrs.size = size;
>>>> +    attrs.flags = 0;
>>>> +
>>>> +    r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, &attrs);
>>>> +    if (r) {
>>>> +        warn_report("%s: failed to set memory (0x%lx+%#zx) with attr
>>>> + 0x%lx
>>>> error '%s'",
>>>> +                     __func__, start, size, attr, strerror(errno));
>>>> +    }
>>>> +    return r;
>>>> +}
>>>> +
>>>> +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size) {
>>>> +    if (!(kvm_supported_memory_attributes &
>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
>>>> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
>>>> +        return -EINVAL;
>>>> +    }
>>>> +
>>>> +    return kvm_set_memory_attributes(start, size,
>>>> +KVM_MEMORY_ATTRIBUTE_PRIVATE); }
>>>> +
>>>> +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size) {
>>>> +    if (!(kvm_supported_memory_attributes &
>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
>>>> +        error_report("KVM doesn't support PRIVATE memory attribute\n");
>>>> +        return -EINVAL;
>>>> +    }
>>>
>>> Duplicate code in kvm_set_memory_attributes_shared/private.
>>> Why not move the check into kvm_set_memory_attributes?
>>
>> Because it's not easy to put the check into there.
>>
>> Both setting and clearing one bit require the capability check. If moving the
>> check into kvm_set_memory_attributes(), the check of
>> KVM_MEMORY_ATTRIBUTE_PRIVATE will have to become unconditionally,
>> which is not aligned to the function name because the name is not restricted to
>> shared/private attribute only.
> 
> No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE there.
> I'm suggesting below:
> 
> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c
> index 2d9a2455de..63ba74b221 100644
> --- a/accel/kvm/kvm-all.c
> +++ b/accel/kvm/kvm-all.c
> @@ -1375,6 +1375,11 @@ static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)
>       struct kvm_memory_attributes attrs;
>       int r;
> 
> +    if ((attr & kvm_supported_memory_attributes) != attr) {
> +        error_report("KVM doesn't support memory attr %lx\n", attr);
> +        return -EINVAL;
> +    }

In the case of setting a range of memory to shared while KVM doesn't 
support private memory. Above check doesn't work. and following IOCTL fails.

>       attrs.attributes = attr;
>       attrs.address = start;
>       attrs.size = size;
> @@ -1390,21 +1395,11 @@ static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr)
> 
>   int kvm_set_memory_attributes_private(hwaddr start, hwaddr size)
>   {
> -    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> -        error_report("KVM doesn't support PRIVATE memory attribute\n");
> -        return -EINVAL;
> -    }
> -
>       return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE);
>   }
> 
>   int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size)
>   {
> -    if (!(kvm_supported_memory_attributes & KVM_MEMORY_ATTRIBUTE_PRIVATE)) {
> -        error_report("KVM doesn't support PRIVATE memory attribute\n");
> -        return -EINVAL;
> -    }
> -
>       return kvm_set_memory_attributes(start, size, 0);
>   }
> 
> Maybe you don't even need the kvm_set_memory_attributes_shared/private wrappers.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* RE: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2023-12-21 11:53         ` Xiaoyao Li
@ 2023-12-21 13:47           ` Wang, Wei W
  2024-01-09  5:47             ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Wang, Wei W @ 2023-12-21 13:47 UTC (permalink / raw)
  To: Li, Xiaoyao, Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Qiang, Chenyi

On Thursday, December 21, 2023 7:54 PM, Li, Xiaoyao wrote:
> On 12/21/2023 6:36 PM, Wang, Wei W wrote:
> > No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE there.
> > I'm suggesting below:
> >
> > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
> > 2d9a2455de..63ba74b221 100644
> > --- a/accel/kvm/kvm-all.c
> > +++ b/accel/kvm/kvm-all.c
> > @@ -1375,6 +1375,11 @@ static int kvm_set_memory_attributes(hwaddr
> start, hwaddr size, uint64_t attr)
> >       struct kvm_memory_attributes attrs;
> >       int r;
> >
> > +    if ((attr & kvm_supported_memory_attributes) != attr) {
> > +        error_report("KVM doesn't support memory attr %lx\n", attr);
> > +        return -EINVAL;
> > +    }
> 
> In the case of setting a range of memory to shared while KVM doesn't support
> private memory. Above check doesn't work. and following IOCTL fails.

SHARED attribute uses the value 0, which indicates it's always supported, no?
For the implementation, can you find in the KVM side where the ioctl
would get failed in that case?

static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
                                           struct kvm_memory_attributes *attrs)
{
        gfn_t start, end;

        /* flags is currently not used. */
        if (attrs->flags)
                return -EINVAL;
        if (attrs->attributes & ~kvm_supported_mem_attributes(kvm)) ==> 0 here
                return -EINVAL;
        if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
                return -EINVAL;
        if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
                return -EINVAL;

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-12-21 11:05   ` Daniel P. Berrangé
@ 2023-12-22  3:14     ` Xiaoyao Li
  2023-12-22 13:14       ` Daniel P. Berrangé
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-22  3:14 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 12/21/2023 7:05 PM, Daniel P. Berrangé wrote:
> On Wed, Nov 15, 2023 at 02:15:01AM -0500, Xiaoyao Li wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> For GetQuote, delegate a request to Quote Generation Service.
>> Add property "quote-generation-socket" to tdx-guest, whihc is a property
>> of type SocketAddress to specify Quote Generation Service(QGS).
>>
>> On request, connect to the QGS, read request buffer from shared guest
>> memory, send the request buffer to the server and store the response
>> into shared guest memory and notify TD guest by interrupt.
>>
>> command line example:
>>    qemu-system-x86_64 \
>>      -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
> 
> Here you're illustrating a VSOCK address.  IIUC, both the 'qgs'
> daemon and QEMU will be running in the host. Why would they need
> to be using VSOCK, as opposed to a regular UNIX socket connection ?
> 

We use vsock here because the QGS server we used for testing exposes the 
vsock socket.

I will add more examples in next version to show that any socket type is 
supported.


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-12-22  3:14     ` Xiaoyao Li
@ 2023-12-22 13:14       ` Daniel P. Berrangé
  2023-12-25 12:34         ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Daniel P. Berrangé @ 2023-12-22 13:14 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Fri, Dec 22, 2023 at 11:14:12AM +0800, Xiaoyao Li wrote:
> On 12/21/2023 7:05 PM, Daniel P. Berrangé wrote:
> > On Wed, Nov 15, 2023 at 02:15:01AM -0500, Xiaoyao Li wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > For GetQuote, delegate a request to Quote Generation Service.
> > > Add property "quote-generation-socket" to tdx-guest, whihc is a property
> > > of type SocketAddress to specify Quote Generation Service(QGS).
> > > 
> > > On request, connect to the QGS, read request buffer from shared guest
> > > memory, send the request buffer to the server and store the response
> > > into shared guest memory and notify TD guest by interrupt.
> > > 
> > > command line example:
> > >    qemu-system-x86_64 \
> > >      -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
> > 
> > Here you're illustrating a VSOCK address.  IIUC, both the 'qgs'
> > daemon and QEMU will be running in the host. Why would they need
> > to be using VSOCK, as opposed to a regular UNIX socket connection ?
> > 
> 
> We use vsock here because the QGS server we used for testing exposes the
> vsock socket.

Is this is the server impl you test with:

  https://github.com/intel/SGXDataCenterAttestationPrimitives/tree/master/QuoteGeneration/quote_wrapper/qgs

or is there another impl ?

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-12-22 13:14       ` Daniel P. Berrangé
@ 2023-12-25 12:34         ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-25 12:34 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 12/22/2023 9:14 PM, Daniel P. Berrangé wrote:
> On Fri, Dec 22, 2023 at 11:14:12AM +0800, Xiaoyao Li wrote:
>> On 12/21/2023 7:05 PM, Daniel P. Berrangé wrote:
>>> On Wed, Nov 15, 2023 at 02:15:01AM -0500, Xiaoyao Li wrote:
>>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>>
>>>> For GetQuote, delegate a request to Quote Generation Service.
>>>> Add property "quote-generation-socket" to tdx-guest, whihc is a property
>>>> of type SocketAddress to specify Quote Generation Service(QGS).
>>>>
>>>> On request, connect to the QGS, read request buffer from shared guest
>>>> memory, send the request buffer to the server and store the response
>>>> into shared guest memory and notify TD guest by interrupt.
>>>>
>>>> command line example:
>>>>     qemu-system-x86_64 \
>>>>       -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
>>>
>>> Here you're illustrating a VSOCK address.  IIUC, both the 'qgs'
>>> daemon and QEMU will be running in the host. Why would they need
>>> to be using VSOCK, as opposed to a regular UNIX socket connection ?
>>>
>>
>> We use vsock here because the QGS server we used for testing exposes the
>> vsock socket.
> 
> Is this is the server impl you test with:
> 
>    https://github.com/intel/SGXDataCenterAttestationPrimitives/tree/master/QuoteGeneration/quote_wrapper/qgs

I think it should be.

I used applications/services bundled by internal teams.

> or is there another impl ?
> 
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-11-15 17:58   ` Daniel P. Berrangé
@ 2023-12-29  2:30     ` Xiaoyao Li
  2024-01-08 14:44       ` Daniel P. Berrangé
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2023-12-29  2:30 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 11/16/2023 1:58 AM, Daniel P. Berrangé wrote:
> On Wed, Nov 15, 2023 at 02:15:01AM -0500, Xiaoyao Li wrote:
>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>
>> For GetQuote, delegate a request to Quote Generation Service.
>> Add property "quote-generation-socket" to tdx-guest, whihc is a property
>> of type SocketAddress to specify Quote Generation Service(QGS).
>>
>> On request, connect to the QGS, read request buffer from shared guest
>> memory, send the request buffer to the server and store the response
>> into shared guest memory and notify TD guest by interrupt.
>>
>> command line example:
>>    qemu-system-x86_64 \
>>      -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
>>      -machine confidential-guest-support=tdx0
>>
>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>> Codeveloped-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>> ---
>> Changes in v3:
>> - rename property "quote-generation-service" to "quote-generation-socket";
>> - change the type of "quote-generation-socket" from str to
>>    SocketAddress;
>> - squash next patch into this one;
>> ---
>>   qapi/qom.json         |   5 +-
>>   target/i386/kvm/tdx.c | 430 ++++++++++++++++++++++++++++++++++++++++++
>>   target/i386/kvm/tdx.h |   6 +
>>   3 files changed, 440 insertions(+), 1 deletion(-)
>>
>> +static void tdx_handle_get_quote_connected(QIOTask *task, gpointer opaque)
>> +{
>> +    struct tdx_get_quote_task *t = opaque;
>> +    Error *err = NULL;
>> +    char *in_data = NULL;
>> +    MachineState *ms;
>> +    TdxGuest *tdx;
>> +
>> +    t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
>> +    if (qio_task_propagate_error(task, NULL)) {
>> +        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
>> +        goto error;
>> +    }
>> +
>> +    in_data = g_malloc(le32_to_cpu(t->hdr.in_len));
>> +    if (!in_data) {
>> +        goto error;
>> +    }
>> +
>> +    if (address_space_read(&address_space_memory, t->gpa + sizeof(t->hdr),
>> +                           MEMTXATTRS_UNSPECIFIED, in_data,
>> +                           le32_to_cpu(t->hdr.in_len)) != MEMTX_OK) {
>> +        goto error;
>> +    }
>> +
>> +    qio_channel_set_blocking(QIO_CHANNEL(t->ioc), false, NULL);
> 
> You've set the channel to non-blocking, but....
> 
>> +
>> +    if (qio_channel_write_all(QIO_CHANNEL(t->ioc), in_data,
>> +                              le32_to_cpu(t->hdr.in_len), &err) ||
>> +        err) {
> 
> ...this method will block execution of this thread, by either
> sleeping in poll() or doing a coroutine yield.
> 
> I don't think this is in coroutine context, so presumably this
> is just blocking.  So what was the point in marking the channel
> non-blocking ?

Hi Dainel,

First of all, I'm not good at socket or qio channel thing. Please 
correct me and teach me when I'm wrong.

I'm not the author of this patch. My understanding is that, set it to 
non-blocking is for the qio_channel_write_all() to proceed immediately?

If set non-blocking is not needed, I can remove it.

> You are setting up a background watch to wait for the reply
> so we don't block this thread, so you seem to want non-blocking
> behaviour.

Both sending and receiving are in a new thread created by 
qio_channel_socket_connect_async(). So I think both of then can be 
blocking and don't need to be in another background thread.

what's your suggestion on it? Make both sending and receiving blocking 
or non-blocking?

> Given this, you should not be using qio_channel_write_all()
> most likely. I think you need to be using qio_channel_add_watch
> to get notified when it is *writable*, to send 'in_data'
> incrementally & non-blocking. When that is finished then create
> another watch to wait for the reply.
> 
> 
>> +        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
>> +        goto error;
>> +    }
>> +
>> +    g_free(in_data);
>> +    qemu_set_fd_handler(t->ioc->fd, tdx_get_quote_read, NULL, t);
>> +
>> +    return;
>> +error:
>> +    t->hdr.out_len = cpu_to_le32(0);
>> +
>> +    if (address_space_write(
>> +            &address_space_memory, t->gpa,
>> +            MEMTXATTRS_UNSPECIFIED, &t->hdr, sizeof(t->hdr)) != MEMTX_OK) {
>> +        error_report("TDX: failed to update GetQuote header.\n");
>> +    }
>> +    tdx_td_notify(t);
>> +
>> +    qio_channel_close(QIO_CHANNEL(t->ioc), &err);
>> +    object_unref(OBJECT(t->ioc));
>> +    g_free(t);
>> +    g_free(in_data);
>> +
>> +    /* Maintain the number of in-flight requests. */
>> +    ms = MACHINE(qdev_get_machine());
>> +    tdx = TDX_GUEST(ms->cgs);
>> +    qemu_mutex_lock(&tdx->lock);
>> +    tdx->quote_generation_num--;
>> +    qemu_mutex_unlock(&tdx->lock);
>> +    return;
>> +}
>> +
>> +static void tdx_handle_get_quote(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
>> +{
>> +    hwaddr gpa = vmcall->in_r12;
>> +    uint64_t buf_len = vmcall->in_r13;
>> +    struct tdx_get_quote_header hdr;
>> +    MachineState *ms;
>> +    TdxGuest *tdx;
>> +    QIOChannelSocket *ioc;
>> +    struct tdx_get_quote_task *t;
>> +
>> +    vmcall->status_code = TDG_VP_VMCALL_INVALID_OPERAND;
>> +
>> +    /* GPA must be shared. */
>> +    if (!(gpa & tdx_shared_bit(cpu))) {
>> +        return;
>> +    }
>> +    gpa &= ~tdx_shared_bit(cpu);
>> +
>> +    if (!QEMU_IS_ALIGNED(gpa, 4096) || !QEMU_IS_ALIGNED(buf_len, 4096)) {
>> +        vmcall->status_code = TDG_VP_VMCALL_ALIGN_ERROR;
>> +        return;
>> +    }
>> +    if (buf_len == 0) {
>> +        return;
>> +    }
>> +
>> +    if (address_space_read(&address_space_memory, gpa, MEMTXATTRS_UNSPECIFIED,
>> +                           &hdr, sizeof(hdr)) != MEMTX_OK) {
>> +        return;
>> +    }
>> +    if (le64_to_cpu(hdr.structure_version) != TDX_GET_QUOTE_STRUCTURE_VERSION) {
>> +        return;
>> +    }
>> +    /*
>> +     * Paranoid: Guest should clear error_code and out_len to avoid information
>> +     * leak.  Enforce it.  The initial value of them doesn't matter for qemu to
>> +     * process the request.
>> +     */
>> +    if (le64_to_cpu(hdr.error_code) != TDX_VP_GET_QUOTE_SUCCESS ||
>> +        le32_to_cpu(hdr.out_len) != 0) {
>> +        return;
>> +    }
>> +
>> +    /* Only safe-guard check to avoid too large buffer size. */
>> +    if (buf_len > TDX_GET_QUOTE_MAX_BUF_LEN ||
>> +        le32_to_cpu(hdr.in_len) > TDX_GET_QUOTE_MAX_BUF_LEN ||
>> +        le32_to_cpu(hdr.in_len) > buf_len) {
>> +        return;
>> +    }
>> +
>> +    /* Mark the buffer in-flight. */
>> +    hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_IN_FLIGHT);
>> +    if (address_space_write(&address_space_memory, gpa, MEMTXATTRS_UNSPECIFIED,
>> +                            &hdr, sizeof(hdr)) != MEMTX_OK) {
>> +        return;
>> +    }
>> +
>> +    ms = MACHINE(qdev_get_machine());
>> +    tdx = TDX_GUEST(ms->cgs);
>> +    ioc = qio_channel_socket_new();
>> +
>> +    t = g_malloc(sizeof(*t));
>> +    t->apic_id = tdx->event_notify_apic_id;
>> +    t->gpa = gpa;
>> +    t->buf_len = buf_len;
>> +    t->out_data = g_malloc(t->buf_len);
>> +    t->out_len = 0;
>> +    t->hdr = hdr;
>> +    t->ioc = ioc;
>> +
>> +    qemu_mutex_lock(&tdx->lock);
>> +    if (!tdx->quote_generation ||
>> +        /* Prevent too many in-flight get-quote request. */
>> +        tdx->quote_generation_num >= TDX_MAX_GET_QUOTE_REQUEST) {
>> +        qemu_mutex_unlock(&tdx->lock);
>> +        vmcall->status_code = TDG_VP_VMCALL_RETRY;
>> +        object_unref(OBJECT(ioc));
>> +        g_free(t->out_data);
>> +        g_free(t);
>> +        return;
>> +    }
>> +    tdx->quote_generation_num++;
>> +    t->event_notify_interrupt = tdx->event_notify_interrupt;
>> +    qio_channel_socket_connect_async(
>> +        ioc, tdx->quote_generation, tdx_handle_get_quote_connected, t, NULL,
>> +        NULL);
>> +    qemu_mutex_unlock(&tdx->lock);
>> +
>> +    vmcall->status_code = TDG_VP_VMCALL_SUCCESS;
>> +}
>> +
>>   static void tdx_handle_setup_event_notify_interrupt(X86CPU *cpu,
>>                                                       struct kvm_tdx_vmcall *vmcall)
>>   {
>> @@ -1005,6 +1432,9 @@ static void tdx_handle_vmcall(X86CPU *cpu, struct kvm_tdx_vmcall *vmcall)
>>       }
>>   
>>       switch (vmcall->subfunction) {
>> +    case TDG_VP_VMCALL_GET_QUOTE:
>> +        tdx_handle_get_quote(cpu, vmcall);
>> +        break;
>>       case TDG_VP_VMCALL_SETUP_EVENT_NOTIFY_INTERRUPT:
>>           tdx_handle_setup_event_notify_interrupt(cpu, vmcall);
>>           break;
>> diff --git a/target/i386/kvm/tdx.h b/target/i386/kvm/tdx.h
>> index 4a8d67cc9fdb..4a989805493e 100644
>> --- a/target/i386/kvm/tdx.h
>> +++ b/target/i386/kvm/tdx.h
>> @@ -5,8 +5,10 @@
>>   #include CONFIG_DEVICES /* CONFIG_TDX */
>>   #endif
>>   
>> +#include <linux/kvm.h>
>>   #include "exec/confidential-guest-support.h"
>>   #include "hw/i386/tdvf.h"
>> +#include "io/channel-socket.h"
>>   #include "sysemu/kvm.h"
>>   
>>   #define TYPE_TDX_GUEST "tdx-guest"
>> @@ -47,6 +49,10 @@ typedef struct TdxGuest {
>>       /* runtime state */
>>       int event_notify_interrupt;
>>       uint32_t event_notify_apic_id;
>> +
>> +    /* GetQuote */
>> +    int quote_generation_num;
>> +    SocketAddress *quote_generation;
>>   } TdxGuest;
>>   
>>   #ifdef CONFIG_TDX
>> -- 
>> 2.34.1
>>
> 
> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2023-12-29  2:30     ` Xiaoyao Li
@ 2024-01-08 14:44       ` Daniel P. Berrangé
  2024-01-09  5:38         ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Daniel P. Berrangé @ 2024-01-08 14:44 UTC (permalink / raw)
  To: Xiaoyao Li
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On Fri, Dec 29, 2023 at 10:30:15AM +0800, Xiaoyao Li wrote:
> On 11/16/2023 1:58 AM, Daniel P. Berrangé wrote:
> > On Wed, Nov 15, 2023 at 02:15:01AM -0500, Xiaoyao Li wrote:
> > > From: Isaku Yamahata <isaku.yamahata@intel.com>
> > > 
> > > For GetQuote, delegate a request to Quote Generation Service.
> > > Add property "quote-generation-socket" to tdx-guest, whihc is a property
> > > of type SocketAddress to specify Quote Generation Service(QGS).
> > > 
> > > On request, connect to the QGS, read request buffer from shared guest
> > > memory, send the request buffer to the server and store the response
> > > into shared guest memory and notify TD guest by interrupt.
> > > 
> > > command line example:
> > >    qemu-system-x86_64 \
> > >      -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
> > >      -machine confidential-guest-support=tdx0
> > > 
> > > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
> > > Codeveloped-by: Chenyi Qiang <chenyi.qiang@intel.com>
> > > Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
> > > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
> > > ---
> > > Changes in v3:
> > > - rename property "quote-generation-service" to "quote-generation-socket";
> > > - change the type of "quote-generation-socket" from str to
> > >    SocketAddress;
> > > - squash next patch into this one;
> > > ---
> > >   qapi/qom.json         |   5 +-
> > >   target/i386/kvm/tdx.c | 430 ++++++++++++++++++++++++++++++++++++++++++
> > >   target/i386/kvm/tdx.h |   6 +
> > >   3 files changed, 440 insertions(+), 1 deletion(-)
> > > 
> > > +static void tdx_handle_get_quote_connected(QIOTask *task, gpointer opaque)
> > > +{
> > > +    struct tdx_get_quote_task *t = opaque;
> > > +    Error *err = NULL;
> > > +    char *in_data = NULL;
> > > +    MachineState *ms;
> > > +    TdxGuest *tdx;
> > > +
> > > +    t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
> > > +    if (qio_task_propagate_error(task, NULL)) {
> > > +        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
> > > +        goto error;
> > > +    }
> > > +
> > > +    in_data = g_malloc(le32_to_cpu(t->hdr.in_len));
> > > +    if (!in_data) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    if (address_space_read(&address_space_memory, t->gpa + sizeof(t->hdr),
> > > +                           MEMTXATTRS_UNSPECIFIED, in_data,
> > > +                           le32_to_cpu(t->hdr.in_len)) != MEMTX_OK) {
> > > +        goto error;
> > > +    }
> > > +
> > > +    qio_channel_set_blocking(QIO_CHANNEL(t->ioc), false, NULL);
> > 
> > You've set the channel to non-blocking, but....
> > 
> > > +
> > > +    if (qio_channel_write_all(QIO_CHANNEL(t->ioc), in_data,
> > > +                              le32_to_cpu(t->hdr.in_len), &err) ||
> > > +        err) {
> > 
> > ...this method will block execution of this thread, by either
> > sleeping in poll() or doing a coroutine yield.
> > 
> > I don't think this is in coroutine context, so presumably this
> > is just blocking.  So what was the point in marking the channel
> > non-blocking ?
> 
> Hi Dainel,
> 
> First of all, I'm not good at socket or qio channel thing. Please correct me
> and teach me when I'm wrong.
> 
> I'm not the author of this patch. My understanding is that, set it to
> non-blocking is for the qio_channel_write_all() to proceed immediately?

The '_all' suffixed methods are implemented such that they will
sleep in poll(), or a coroutine yield when seeing EAGAIN. 

> If set non-blocking is not needed, I can remove it.
> 
> > You are setting up a background watch to wait for the reply
> > so we don't block this thread, so you seem to want non-blocking
> > behaviour.
> 
> Both sending and receiving are in a new thread created by
> qio_channel_socket_connect_async(). So I think both of then can be blocking
> and don't need to be in another background thread.
> 
> what's your suggestion on it? Make both sending and receiving blocking or
> non-blocking?

I think the code /should/ be non-blocking, which would mean
using   qio_channel_write, instead of qio_channel_write_all,
and using a .

With regards,
Daniel
-- 
|: https://berrange.com      -o-    https://www.flickr.com/photos/dberrange :|
|: https://libvirt.org         -o-            https://fstop138.berrange.com :|
|: https://entangle-photo.org    -o-    https://www.instagram.com/dberrange :|


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote>
  2024-01-08 14:44       ` Daniel P. Berrangé
@ 2024-01-09  5:38         ` Xiaoyao Li
  0 siblings, 0 replies; 161+ messages in thread
From: Xiaoyao Li @ 2024-01-09  5:38 UTC (permalink / raw)
  To: Daniel P. Berrangé
  Cc: Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Eric Blake, Markus Armbruster, Marcelo Tosatti,
	qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Chenyi Qiang

On 1/8/2024 10:44 PM, Daniel P. Berrangé wrote:
> On Fri, Dec 29, 2023 at 10:30:15AM +0800, Xiaoyao Li wrote:
>> On 11/16/2023 1:58 AM, Daniel P. Berrangé wrote:
>>> On Wed, Nov 15, 2023 at 02:15:01AM -0500, Xiaoyao Li wrote:
>>>> From: Isaku Yamahata <isaku.yamahata@intel.com>
>>>>
>>>> For GetQuote, delegate a request to Quote Generation Service.
>>>> Add property "quote-generation-socket" to tdx-guest, whihc is a property
>>>> of type SocketAddress to specify Quote Generation Service(QGS).
>>>>
>>>> On request, connect to the QGS, read request buffer from shared guest
>>>> memory, send the request buffer to the server and store the response
>>>> into shared guest memory and notify TD guest by interrupt.
>>>>
>>>> command line example:
>>>>     qemu-system-x86_64 \
>>>>       -object '{"qom-type":"tdx-guest","id":"tdx0","quote-generation-socket":{"type": "vsock", "cid":"2","port":"1234"}}' \
>>>>       -machine confidential-guest-support=tdx0
>>>>
>>>> Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com>
>>>> Codeveloped-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> Signed-off-by: Chenyi Qiang <chenyi.qiang@intel.com>
>>>> Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com>
>>>> ---
>>>> Changes in v3:
>>>> - rename property "quote-generation-service" to "quote-generation-socket";
>>>> - change the type of "quote-generation-socket" from str to
>>>>     SocketAddress;
>>>> - squash next patch into this one;
>>>> ---
>>>>    qapi/qom.json         |   5 +-
>>>>    target/i386/kvm/tdx.c | 430 ++++++++++++++++++++++++++++++++++++++++++
>>>>    target/i386/kvm/tdx.h |   6 +
>>>>    3 files changed, 440 insertions(+), 1 deletion(-)
>>>>
>>>> +static void tdx_handle_get_quote_connected(QIOTask *task, gpointer opaque)
>>>> +{
>>>> +    struct tdx_get_quote_task *t = opaque;
>>>> +    Error *err = NULL;
>>>> +    char *in_data = NULL;
>>>> +    MachineState *ms;
>>>> +    TdxGuest *tdx;
>>>> +
>>>> +    t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_ERROR);
>>>> +    if (qio_task_propagate_error(task, NULL)) {
>>>> +        t->hdr.error_code = cpu_to_le64(TDX_VP_GET_QUOTE_QGS_UNAVAILABLE);
>>>> +        goto error;
>>>> +    }
>>>> +
>>>> +    in_data = g_malloc(le32_to_cpu(t->hdr.in_len));
>>>> +    if (!in_data) {
>>>> +        goto error;
>>>> +    }
>>>> +
>>>> +    if (address_space_read(&address_space_memory, t->gpa + sizeof(t->hdr),
>>>> +                           MEMTXATTRS_UNSPECIFIED, in_data,
>>>> +                           le32_to_cpu(t->hdr.in_len)) != MEMTX_OK) {
>>>> +        goto error;
>>>> +    }
>>>> +
>>>> +    qio_channel_set_blocking(QIO_CHANNEL(t->ioc), false, NULL);
>>>
>>> You've set the channel to non-blocking, but....
>>>
>>>> +
>>>> +    if (qio_channel_write_all(QIO_CHANNEL(t->ioc), in_data,
>>>> +                              le32_to_cpu(t->hdr.in_len), &err) ||
>>>> +        err) {
>>>
>>> ...this method will block execution of this thread, by either
>>> sleeping in poll() or doing a coroutine yield.
>>>
>>> I don't think this is in coroutine context, so presumably this
>>> is just blocking.  So what was the point in marking the channel
>>> non-blocking ?
>>
>> Hi Dainel,
>>
>> First of all, I'm not good at socket or qio channel thing. Please correct me
>> and teach me when I'm wrong.
>>
>> I'm not the author of this patch. My understanding is that, set it to
>> non-blocking is for the qio_channel_write_all() to proceed immediately?
> 
> The '_all' suffixed methods are implemented such that they will
> sleep in poll(), or a coroutine yield when seeing EAGAIN.
> 
>> If set non-blocking is not needed, I can remove it.
>>
>>> You are setting up a background watch to wait for the reply
>>> so we don't block this thread, so you seem to want non-blocking
>>> behaviour.
>>
>> Both sending and receiving are in a new thread created by
>> qio_channel_socket_connect_async(). So I think both of then can be blocking
>> and don't need to be in another background thread.
>>
>> what's your suggestion on it? Make both sending and receiving blocking or
>> non-blocking?
> 
> I think the code /should/ be non-blocking, which would mean
> using   qio_channel_write, instead of qio_channel_write_all,
> and using a .

I see. will implement in the next version.

> With regards,
> Daniel


^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2023-12-21 13:47           ` Wang, Wei W
@ 2024-01-09  5:47             ` Xiaoyao Li
  2024-01-09 14:53               ` Wang, Wei W
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2024-01-09  5:47 UTC (permalink / raw)
  To: Wang, Wei W, Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Qiang, Chenyi

On 12/21/2023 9:47 PM, Wang, Wei W wrote:
> On Thursday, December 21, 2023 7:54 PM, Li, Xiaoyao wrote:
>> On 12/21/2023 6:36 PM, Wang, Wei W wrote:
>>> No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE there.
>>> I'm suggesting below:
>>>
>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
>>> 2d9a2455de..63ba74b221 100644
>>> --- a/accel/kvm/kvm-all.c
>>> +++ b/accel/kvm/kvm-all.c
>>> @@ -1375,6 +1375,11 @@ static int kvm_set_memory_attributes(hwaddr
>> start, hwaddr size, uint64_t attr)
>>>        struct kvm_memory_attributes attrs;
>>>        int r;
>>>
>>> +    if ((attr & kvm_supported_memory_attributes) != attr) {
>>> +        error_report("KVM doesn't support memory attr %lx\n", attr);
>>> +        return -EINVAL;
>>> +    }
>>
>> In the case of setting a range of memory to shared while KVM doesn't support
>> private memory. Above check doesn't work. and following IOCTL fails.
> 
> SHARED attribute uses the value 0, which indicates it's always supported, no?
> For the implementation, can you find in the KVM side where the ioctl
> would get failed in that case?

I'm worrying about the future case, that KVM supports other memory 
attribute than shared/private. For example, KVM supports RWX bits (bit 0 
- 2) but not shared/private bit.

This patch designs kvm_set_memory_attributes() to be common for all the 
bits (and for future bits), thus it leaves the support check to each 
caller function separately.

If you think it's unnecessary, I can change the name of 
kvm_set_memory_attributes() to kvm_set_memory_shared_private() to make 
it only for shared/private bit, then the check can be moved to it.

> static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
>                                             struct kvm_memory_attributes *attrs)
> {
>          gfn_t start, end;
> 
>          /* flags is currently not used. */
>          if (attrs->flags)
>                  return -EINVAL;
>          if (attrs->attributes & ~kvm_supported_mem_attributes(kvm)) ==> 0 here
>                  return -EINVAL;
>          if (attrs->size == 0 || attrs->address + attrs->size < attrs->address)
>                  return -EINVAL;
>          if (!PAGE_ALIGNED(attrs->address) || !PAGE_ALIGNED(attrs->size))
>                  return -EINVAL;


^ permalink raw reply	[flat|nested] 161+ messages in thread

* RE: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2024-01-09  5:47             ` Xiaoyao Li
@ 2024-01-09 14:53               ` Wang, Wei W
  2024-01-09 16:32                 ` Xiaoyao Li
  0 siblings, 1 reply; 161+ messages in thread
From: Wang, Wei W @ 2024-01-09 14:53 UTC (permalink / raw)
  To: Li, Xiaoyao, Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Qiang, Chenyi

On Tuesday, January 9, 2024 1:47 PM, Li, Xiaoyao wrote:
> On 12/21/2023 9:47 PM, Wang, Wei W wrote:
> > On Thursday, December 21, 2023 7:54 PM, Li, Xiaoyao wrote:
> >> On 12/21/2023 6:36 PM, Wang, Wei W wrote:
> >>> No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE
> there.
> >>> I'm suggesting below:
> >>>
> >>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
> >>> 2d9a2455de..63ba74b221 100644
> >>> --- a/accel/kvm/kvm-all.c
> >>> +++ b/accel/kvm/kvm-all.c
> >>> @@ -1375,6 +1375,11 @@ static int kvm_set_memory_attributes(hwaddr
> >> start, hwaddr size, uint64_t attr)
> >>>        struct kvm_memory_attributes attrs;
> >>>        int r;
> >>>
> >>> +    if ((attr & kvm_supported_memory_attributes) != attr) {
> >>> +        error_report("KVM doesn't support memory attr %lx\n", attr);
> >>> +        return -EINVAL;
> >>> +    }
> >>
> >> In the case of setting a range of memory to shared while KVM doesn't
> >> support private memory. Above check doesn't work. and following IOCTL
> fails.
> >
> > SHARED attribute uses the value 0, which indicates it's always supported, no?
> > For the implementation, can you find in the KVM side where the ioctl
> > would get failed in that case?
> 
> I'm worrying about the future case, that KVM supports other memory attribute
> than shared/private. For example, KVM supports RWX bits (bit 0
> - 2) but not shared/private bit.

What's the exact issue?
+#define KVM_MEMORY_ATTRIBUTE_READ               (1ULL << 2)
+#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
+#define KVM_MEMORY_ATTRIBUTE_EXE                  (1ULL << 0)

They are checked via
"if ((attr & kvm_supported_memory_attributes) != attr)" shared above in
kvm_set_memory_attributes.
In the case you described, kvm_supported_memory_attributes will be 0x7.
Anything unexpected?

^ permalink raw reply	[flat|nested] 161+ messages in thread

* Re: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2024-01-09 14:53               ` Wang, Wei W
@ 2024-01-09 16:32                 ` Xiaoyao Li
  2024-01-10  1:53                   ` Wang, Wei W
  0 siblings, 1 reply; 161+ messages in thread
From: Xiaoyao Li @ 2024-01-09 16:32 UTC (permalink / raw)
  To: Wang, Wei W, Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Qiang, Chenyi

On 1/9/2024 10:53 PM, Wang, Wei W wrote:
> On Tuesday, January 9, 2024 1:47 PM, Li, Xiaoyao wrote:
>> On 12/21/2023 9:47 PM, Wang, Wei W wrote:
>>> On Thursday, December 21, 2023 7:54 PM, Li, Xiaoyao wrote:
>>>> On 12/21/2023 6:36 PM, Wang, Wei W wrote:
>>>>> No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE
>> there.
>>>>> I'm suggesting below:
>>>>>
>>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
>>>>> 2d9a2455de..63ba74b221 100644
>>>>> --- a/accel/kvm/kvm-all.c
>>>>> +++ b/accel/kvm/kvm-all.c
>>>>> @@ -1375,6 +1375,11 @@ static int kvm_set_memory_attributes(hwaddr
>>>> start, hwaddr size, uint64_t attr)
>>>>>         struct kvm_memory_attributes attrs;
>>>>>         int r;
>>>>>
>>>>> +    if ((attr & kvm_supported_memory_attributes) != attr) {
>>>>> +        error_report("KVM doesn't support memory attr %lx\n", attr);
>>>>> +        return -EINVAL;
>>>>> +    }
>>>>
>>>> In the case of setting a range of memory to shared while KVM doesn't
>>>> support private memory. Above check doesn't work. and following IOCTL
>> fails.
>>>
>>> SHARED attribute uses the value 0, which indicates it's always supported, no?
>>> For the implementation, can you find in the KVM side where the ioctl
>>> would get failed in that case?
>>
>> I'm worrying about the future case, that KVM supports other memory attribute
>> than shared/private. For example, KVM supports RWX bits (bit 0
>> - 2) but not shared/private bit.
> 
> What's the exact issue?
> +#define KVM_MEMORY_ATTRIBUTE_READ               (1ULL << 2)
> +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> +#define KVM_MEMORY_ATTRIBUTE_EXE                  (1ULL << 0)
> 
> They are checked via
> "if ((attr & kvm_supported_memory_attributes) != attr)" shared above in
> kvm_set_memory_attributes.
> In the case you described, kvm_supported_memory_attributes will be 0x7.
> Anything unexpected?

Sorry that I thought for wrong case.

It doesn't work on the case that KVM doesn't support memory_attribute, 
e.g., an old KVM. In this case, 'kvm_supported_memory_attributes' is 0, 
and 'attr' is 0 as well.

^ permalink raw reply	[flat|nested] 161+ messages in thread

* RE: [PATCH v3 06/70] kvm: Introduce support for memory_attributes
  2024-01-09 16:32                 ` Xiaoyao Li
@ 2024-01-10  1:53                   ` Wang, Wei W
  0 siblings, 0 replies; 161+ messages in thread
From: Wang, Wei W @ 2024-01-10  1:53 UTC (permalink / raw)
  To: Li, Xiaoyao, Paolo Bonzini, David Hildenbrand, Igor Mammedov,
	Michael S . Tsirkin, Marcel Apfelbaum, Richard Henderson,
	Peter Xu, Philippe Mathieu-Daudé,
	Cornelia Huck, Daniel P . Berrangé,
	Eric Blake, Markus Armbruster, Marcelo Tosatti
  Cc: qemu-devel, kvm, Michael Roth, Sean Christopherson,
	Claudio Fontana, Gerd Hoffmann, Isaku Yamahata, Qiang, Chenyi

On Wednesday, January 10, 2024 12:32 AM, Li, Xiaoyao wrote:
> On 1/9/2024 10:53 PM, Wang, Wei W wrote:
> > On Tuesday, January 9, 2024 1:47 PM, Li, Xiaoyao wrote:
> >> On 12/21/2023 9:47 PM, Wang, Wei W wrote:
> >>> On Thursday, December 21, 2023 7:54 PM, Li, Xiaoyao wrote:
> >>>> On 12/21/2023 6:36 PM, Wang, Wei W wrote:
> >>>>> No need to specifically check for KVM_MEMORY_ATTRIBUTE_PRIVATE
> >> there.
> >>>>> I'm suggesting below:
> >>>>>
> >>>>> diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c index
> >>>>> 2d9a2455de..63ba74b221 100644
> >>>>> --- a/accel/kvm/kvm-all.c
> >>>>> +++ b/accel/kvm/kvm-all.c
> >>>>> @@ -1375,6 +1375,11 @@ static int
> kvm_set_memory_attributes(hwaddr
> >>>> start, hwaddr size, uint64_t attr)
> >>>>>         struct kvm_memory_attributes attrs;
> >>>>>         int r;
> >>>>>
> >>>>> +    if ((attr & kvm_supported_memory_attributes) != attr) {
> >>>>> +        error_report("KVM doesn't support memory attr %lx\n", attr);
> >>>>> +        return -EINVAL;
> >>>>> +    }
> >>>>
> >>>> In the case of setting a range of memory to shared while KVM
> >>>> doesn't support private memory. Above check doesn't work. and
> >>>> following IOCTL
> >> fails.
> >>>
> >>> SHARED attribute uses the value 0, which indicates it's always supported,
> no?
> >>> For the implementation, can you find in the KVM side where the ioctl
> >>> would get failed in that case?
> >>
> >> I'm worrying about the future case, that KVM supports other memory
> >> attribute than shared/private. For example, KVM supports RWX bits
> >> (bit 0
> >> - 2) but not shared/private bit.
> >
> > What's the exact issue?
> > +#define KVM_MEMORY_ATTRIBUTE_READ               (1ULL << 2)
> > +#define KVM_MEMORY_ATTRIBUTE_WRITE             (1ULL << 1)
> > +#define KVM_MEMORY_ATTRIBUTE_EXE                  (1ULL << 0)
> >
> > They are checked via
> > "if ((attr & kvm_supported_memory_attributes) != attr)" shared above
> > in kvm_set_memory_attributes.
> > In the case you described, kvm_supported_memory_attributes will be 0x7.
> > Anything unexpected?
> 
> Sorry that I thought for wrong case.
> 
> It doesn't work on the case that KVM doesn't support memory_attribute, e.g.,
> an old KVM. In this case, 'kvm_supported_memory_attributes' is 0, and 'attr' is
> 0 as well.

How is this different in your existing implementation?

The official way of defining a feature is to take a bit (based on the first feature,
*_PRIVATE, defined). Using 0 as an attr is a bit magic and it passes all the "&" based check.
But using it for *_SHARED looks fine to me as semantically memory can always be shared
and the ioctl will return with -ENOTTY anyway in your mentioned case.

^ permalink raw reply	[flat|nested] 161+ messages in thread

end of thread, other threads:[~2024-01-10  1:53 UTC | newest]

Thread overview: 161+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2023-11-15  7:14 [PATCH v3 00/70] QEMU Guest memfd + QEMU TDX support Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 01/70] *** HACK *** linux-headers: Update headers to pull in gmem APIs Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 02/70] RAMBlock: Add support of KVM private guest memfd Xiaoyao Li
2023-11-15 10:20   ` Daniel P. Berrangé
2023-11-16  3:34     ` Xiaoyao Li
2023-11-15 17:54   ` David Hildenbrand
2023-11-16  2:45     ` Xiaoyao Li
2023-11-20  9:19       ` David Hildenbrand
2023-11-30  7:35         ` Xiaoyao Li
2023-11-17 20:35   ` Isaku Yamahata
2023-11-30  8:31     ` Xiaoyao Li
2023-11-20  9:24   ` David Hildenbrand
2023-11-30  7:37     ` Xiaoyao Li
2023-11-30 11:01       ` David Hildenbrand
2023-11-15  7:14 ` [PATCH v3 03/70] RAMBlock/guest_memfd: Enable KVM_GUEST_MEMFD_ALLOW_HUGEPAGE Xiaoyao Li
2023-11-15 18:10   ` David Hildenbrand
2023-11-16  2:47     ` Xiaoyao Li
2023-11-20  9:26       ` David Hildenbrand
2023-11-30  7:32         ` Xiaoyao Li
2023-11-30 10:59           ` David Hildenbrand
2023-11-30 16:01             ` Sean Christopherson
2023-11-30 16:54               ` David Hildenbrand
2023-11-30 17:46                 ` Peter Xu
2023-11-30 17:57                   ` David Hildenbrand
2023-11-30 18:09                     ` David Hildenbrand
2023-11-30 17:51                 ` Daniel P. Berrangé
2023-11-30 18:22                   ` David Hildenbrand
2023-12-01 11:22                   ` Claudio Fontana
2023-11-30  8:00         ` Xiaoyao Li
2023-12-01 11:00           ` David Hildenbrand
2023-11-15  7:14 ` [PATCH v3 04/70] HostMem: Add mechanism to opt in kvm guest memfd via MachineState Xiaoyao Li
2023-11-15 18:14   ` David Hildenbrand
2023-11-16  2:53     ` Xiaoyao Li
2023-11-20  9:30       ` David Hildenbrand
2023-11-30  7:38         ` Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 05/70] kvm: Enable KVM_SET_USER_MEMORY_REGION2 for memslot Xiaoyao Li
2023-11-17 20:50   ` Isaku Yamahata
2023-12-04  6:48     ` Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 06/70] kvm: Introduce support for memory_attributes Xiaoyao Li
2023-11-15 10:38   ` Daniel P. Berrangé
2023-11-16  3:40     ` Xiaoyao Li
2023-12-12 13:56   ` Wang, Wei W
2023-12-21  6:11     ` Xiaoyao Li
2023-12-21 10:36       ` Wang, Wei W
2023-12-21 11:53         ` Xiaoyao Li
2023-12-21 13:47           ` Wang, Wei W
2024-01-09  5:47             ` Xiaoyao Li
2024-01-09 14:53               ` Wang, Wei W
2024-01-09 16:32                 ` Xiaoyao Li
2024-01-10  1:53                   ` Wang, Wei W
2023-11-15  7:14 ` [PATCH v3 07/70] physmem: Relax the alignment check of host_startaddr in ram_block_discard_range() Xiaoyao Li
2023-11-15 18:20   ` David Hildenbrand
2023-11-16  2:56     ` Xiaoyao Li
2023-11-20  9:56       ` David Hildenbrand
2023-12-04  7:35         ` Xiaoyao Li
2023-12-04  7:53           ` Xiaoyao Li
2023-12-04  9:52             ` David Hildenbrand
2023-11-15  7:14 ` [PATCH v3 08/70] physmem: replace function name with __func__ " Xiaoyao Li
2023-11-15 18:21   ` David Hildenbrand
2023-12-04  7:40     ` Xiaoyao Li
2023-12-04  9:49       ` David Hildenbrand
2023-11-15  7:14 ` [PATCH v3 09/70] physmem: Introduce ram_block_convert_range() for page conversion Xiaoyao Li
2023-11-17 21:03   ` Isaku Yamahata
2023-12-08  7:59     ` Xiaoyao Li
2023-12-08 11:52       ` David Hildenbrand
2023-12-21  6:18         ` Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 10/70] kvm: handle KVM_EXIT_MEMORY_FAULT Xiaoyao Li
2023-11-15 10:42   ` Daniel P. Berrangé
2023-11-16  5:16     ` Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 11/70] trace/kvm: Add trace for page convertion between shared and private Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 12/70] *** HACK *** linux-headers: Update headers to pull in TDX API changes Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 13/70] i386: Introduce tdx-guest object Xiaoyao Li
2023-12-01 10:52   ` Markus Armbruster
2023-12-04  7:59     ` Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 14/70] target/i386: Implement mc->kvm_type() to get VM type Xiaoyao Li
2023-11-15 10:49   ` Daniel P. Berrangé
2023-11-16  6:22     ` Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 15/70] target/i386: Parse TDX vm type Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 16/70] target/i386: Introduce kvm_confidential_guest_init() Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 17/70] i386/tdx: Implement tdx_kvm_init() to initialize TDX VM context Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 18/70] i386/tdx: Get tdx_capabilities via KVM_TDX_CAPABILITIES Xiaoyao Li
2023-11-15 10:54   ` Daniel P. Berrangé
2023-12-07  7:18     ` Xiaoyao Li
2023-11-17 21:18   ` Isaku Yamahata
2023-12-07  7:16     ` Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 19/70] i386/tdx: Introduce is_tdx_vm() helper and cache tdx_guest object Xiaoyao Li
2023-11-17 21:20   ` Isaku Yamahata
2023-11-15  7:14 ` [PATCH v3 20/70] i386/tdx: Adjust the supported CPUID based on TDX restrictions Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 21/70] i386/tdx: Update tdx_cpuid_lookup[].tdx_fixed0/1 by tdx_caps.cpuid_config[] Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 22/70] i386/tdx: Integrate tdx_caps->xfam_fixed0/1 into tdx_cpuid_lookup Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 23/70] i386/tdx: Integrate tdx_caps->attrs_fixed0/1 to tdx_cpuid_lookup Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 24/70] i386/kvm: Move architectural CPUID leaf generation to separate helper Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 25/70] kvm: Introduce kvm_arch_pre_create_vcpu() Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 26/70] i386/tdx: Initialize TDX before creating TD vcpus Xiaoyao Li
2023-11-15 11:01   ` Daniel P. Berrangé
2023-12-04  8:28     ` Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 27/70] i386/tdx: Add property sept-ve-disable for tdx-guest object Xiaoyao Li
2023-12-01 10:53   ` Markus Armbruster
2023-11-15  7:14 ` [PATCH v3 28/70] i386/tdx: Make sept_ve_disable set by default Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 29/70] i386/tdx: Wire CPU features up with attributes of TD guest Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 30/70] i386/tdx: Validate TD attributes Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 31/70] i386/tdx: Allows mrconfigid/mrowner/mrownerconfig for TDX_INIT_VM Xiaoyao Li
2023-11-15 17:32   ` Daniel P. Berrangé
2023-12-01 11:00   ` Markus Armbruster
2023-12-14  3:07     ` Xiaoyao Li
2023-12-18 13:46       ` Markus Armbruster
2023-12-19  8:27         ` Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 32/70] i386/tdx: Implement user specified tsc frequency Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 33/70] i386/tdx: Set kvm_readonly_mem_enabled to false for TDX VM Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 34/70] kvm/memory: Introduce the infrastructure to set the default shared/private value Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 35/70] i386/tdx: Make memory type private by default Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 36/70] kvm/tdx: Don't complain when converting vMMIO region to shared Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 37/70] kvm/tdx: Ignore memory conversion to shared of unassigned region Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 38/70] i386/tdvf: Introduce function to parse TDVF metadata Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 39/70] i386/tdx: Parse TDVF metadata for TDX VM Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 40/70] i386/tdx: Skip BIOS shadowing setup Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 41/70] i386/tdx: Don't initialize pc.rom for TDX VMs Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 42/70] i386/tdx: Track mem_ptr for each firmware entry of TDVF Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 43/70] i386/tdx: Track RAM entries for TDX VM Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 44/70] headers: Add definitions from UEFI spec for volumes, resources, etc Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 45/70] i386/tdx: Setup the TD HOB list Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 46/70] i386/tdx: Add TDVF memory via KVM_TDX_INIT_MEM_REGION Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 47/70] memory: Introduce memory_region_init_ram_guest_memfd() Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 48/70] i386/tdx: register TDVF as private memory Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 49/70] i386/tdx: Call KVM_TDX_INIT_VCPU to initialize TDX vcpu Xiaoyao Li
2023-11-15  7:14 ` [PATCH v3 50/70] i386/tdx: Finalize TDX VM Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 51/70] i386/tdx: handle TDG.VP.VMCALL<SetupEventNotifyInterrupt> Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 52/70] i386/tdx: handle TDG.VP.VMCALL<GetQuote> Xiaoyao Li
2023-11-15 17:51   ` Daniel P. Berrangé
2023-11-15 17:58   ` Daniel P. Berrangé
2023-12-29  2:30     ` Xiaoyao Li
2024-01-08 14:44       ` Daniel P. Berrangé
2024-01-09  5:38         ` Xiaoyao Li
2023-12-01 11:02   ` Markus Armbruster
2023-12-07  7:38     ` Xiaoyao Li
2023-12-07  9:20       ` Markus Armbruster
2023-12-21 11:05   ` Daniel P. Berrangé
2023-12-22  3:14     ` Xiaoyao Li
2023-12-22 13:14       ` Daniel P. Berrangé
2023-12-25 12:34         ` Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 53/70] i386/tdx: setup a timer for the qio channel Xiaoyao Li
2023-11-15 18:02   ` Daniel P. Berrangé
2023-11-15  7:15 ` [PATCH v3 54/70] i386/tdx: handle TDG.VP.VMCALL<MapGPA> hypercall Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 55/70] i386/tdx: Limit the range size for MapGPA Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 56/70] i386/tdx: Handle TDG.VP.VMCALL<REPORT_FATAL_ERROR> Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 57/70] i386/tdx: Wire TDX_REPORT_FATAL_ERROR with GuestPanic facility Xiaoyao Li
2023-12-01 11:11   ` Markus Armbruster
2023-12-07  8:11     ` Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 58/70] pci-host/q35: Move PAM initialization above SMRAM initialization Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 59/70] q35: Introduce smm_ranges property for q35-pci-host Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 60/70] i386/tdx: Disable SMM for TDX VMs Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 61/70] i386/tdx: Disable PIC " Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 62/70] i386/tdx: Don't allow system reset " Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 63/70] i386/tdx: LMCE is not supported for TDX Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 64/70] hw/i386: add eoi_intercept_unsupported member to X86MachineState Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 65/70] hw/i386: add option to forcibly report edge trigger in acpi tables Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 66/70] i386/tdx: Don't synchronize guest tsc for TDs Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 67/70] i386/tdx: Only configure MSR_IA32_UCODE_REV in kvm_init_msrs() " Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 68/70] i386/tdx: Skip kvm_put_apicbase() " Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 69/70] i386/tdx: Don't get/put guest state for TDX VMs Xiaoyao Li
2023-11-15  7:15 ` [PATCH v3 70/70] docs: Add TDX documentation Xiaoyao Li

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.