On Tue, Mar 12, 2024 at 10:10:09AM +0800, Xuan Zhuo wrote: > This pathset is splited from the > > http://lore.kernel.org/all/20240229072044.77388-1-xuanzhuo@linux.alibaba.com > > That may needs some cycles to discuss. But that notifies too many people. > > But just the four commits need to notify so many people. > And four commits are independent. So I split that patch set, > let us review these first. > > The patch set try to refactor the params of find_vqs(). > Then we can just change the structure, when introducing new > features. > > Thanks. > > v3: > 1. fix the bug: "assignment of read-only location '*cfg.names'" > > v2: > 1. add kerneldoc for "struct vq_transport_config" @ilpo.jarvinen > > v1: > 1. fix some comments from ilpo.jarvinen@linux.intel.com > As this came in after merge window was open I'm deferring this to the next merge window. Jason, can you pls try to complete the review meanwhile? > > Xuan Zhuo (4): > virtio: find_vqs: pass struct instead of multi parameters > virtio: vring_create_virtqueue: pass struct instead of multi > parameters > virtio: vring_new_virtqueue(): pass struct instead of multi parameters > virtio_ring: simplify the parameters of the funcs related to > vring_create/new_virtqueue() > > arch/um/drivers/virtio_uml.c | 31 ++-- > drivers/platform/mellanox/mlxbf-tmfifo.c | 24 ++-- > drivers/remoteproc/remoteproc_virtio.c | 31 ++-- > drivers/s390/virtio/virtio_ccw.c | 33 ++--- > drivers/virtio/virtio_mmio.c | 30 ++-- > drivers/virtio/virtio_pci_common.c | 60 ++++---- > drivers/virtio/virtio_pci_common.h | 9 +- > drivers/virtio/virtio_pci_legacy.c | 16 ++- > drivers/virtio/virtio_pci_modern.c | 38 +++-- > drivers/virtio/virtio_ring.c | 173 ++++++++--------------- > drivers/virtio/virtio_vdpa.c | 45 +++--- > include/linux/virtio_config.h | 85 ++++++++--- > include/linux/virtio_ring.h | 93 +++++++----- > tools/virtio/virtio_test.c | 4 +- > tools/virtio/vringh_test.c | 28 ++-- > 15 files changed, 363 insertions(+), 337 deletions(-) > > -- > 2.32.0.3.g01195cf9f
On Mon, Mar 18, 2024 at 01:59:52PM +0800, Xuan Zhuo wrote: > On Mon, 18 Mar 2024 12:18:23 +0800, Jason Wang <jasowang@redhat.com> wrote: > > On Fri, Mar 15, 2024 at 3:26 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote: > > > > > > On Fri, 15 Mar 2024 11:51:48 +0800, Jason Wang <jasowang@redhat.com> wrote: > > > > On Thu, Mar 14, 2024 at 2:00 PM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote: > > > > > > > > > > On Thu, 14 Mar 2024 11:12:24 +0800, Jason Wang <jasowang@redhat.com> wrote: > > > > > > On Tue, Mar 12, 2024 at 10:10 AM Xuan Zhuo <xuanzhuo@linux.alibaba.com> wrote: > > > > > > > > > > > > > > Now, we pass multi parameters to find_vqs. These parameters > > > > > > > may work for transport or work for vring. > > > > > > > > > > > > > > And find_vqs has multi implements in many places: > > > > > > > > > > > > > > arch/um/drivers/virtio_uml.c > > > > > > > drivers/platform/mellanox/mlxbf-tmfifo.c > > > > > > > drivers/remoteproc/remoteproc_virtio.c > > > > > > > drivers/s390/virtio/virtio_ccw.c > > > > > > > drivers/virtio/virtio_mmio.c > > > > > > > drivers/virtio/virtio_pci_legacy.c > > > > > > > drivers/virtio/virtio_pci_modern.c > > > > > > > drivers/virtio/virtio_vdpa.c > > > > > > > > > > > > > > Every time, we try to add a new parameter, that is difficult. > > > > > > > We must change every find_vqs implement. > > > > > > > > > > > > > > One the other side, if we want to pass a parameter to vring, > > > > > > > we must change the call path from transport to vring. > > > > > > > Too many functions need to be changed. > > > > > > > > > > > > > > So it is time to refactor the find_vqs. We pass a structure > > > > > > > cfg to find_vqs(), that will be passed to vring by transport. > > > > > > > > > > > > > > Because the vp_modern_create_avq() use the "const char *names[]", > > > > > > > and the virtio_uml.c changes the name in the subsequent commit, so > > > > > > > change the "names" inside the virtio_vq_config from "const char *const > > > > > > > *names" to "const char **names". > > > > > > > > > > > > > > Signed-off-by: Xuan Zhuo <xuanzhuo@linux.alibaba.com> > > > > > > > Acked-by: Johannes Berg <johannes@sipsolutions.net> > > > > > > > Reviewed-by: Ilpo J=E4rvinen <ilpo.jarvinen@linux.intel.com> > > > > > > > > > > > > The name seems broken here. > > > > > > > > > > Email APP bug. > > > > > > > > > > I will fix. > > > > > > > > > > > > > > > > > > > > > > [...] > > > > > > > > > > > > > > > > > > > > typedef void vq_callback_t(struct virtqueue *); > > > > > > > > > > > > > > +/** > > > > > > > + * struct virtio_vq_config - configure for find_vqs() > > > > > > > + * @cfg_idx: Used by virtio core. The drivers should set this to 0. > > > > > > > + * During the initialization of each vq(vring setup), we need to know which > > > > > > > + * item in the array should be used at that time. But since the item in > > > > > > > + * names can be null, which causes some item of array to be skipped, we > > > > > > > + * cannot use vq.index as the current id. So add a cfg_idx to let vring > > > > > > > + * know how to get the current configuration from the array when > > > > > > > + * initializing vq. > > > > > > > > > > > > So this design is not good. If it is not something that the driver > > > > > > needs to care about, the core needs to hide it from the API. > > > > > > > > > > The driver just ignore it. That will be beneficial to the virtio core. > > > > > Otherwise, we must pass one more parameter everywhere. > > > > > > > > I don't get here, it's an internal logic and we've already done that. > > > > > > > > > ## Then these must add one param "cfg_idx"; > > > > > > struct virtqueue *vring_create_virtqueue(struct virtio_device *vdev, > > > unsigned int index, > > > struct vq_transport_config *tp_cfg, > > > struct virtio_vq_config *cfg, > > > --> uint cfg_idx); > > > > > > struct virtqueue *vring_new_virtqueue(struct virtio_device *vdev, > > > unsigned int index, > > > void *pages, > > > struct vq_transport_config *tp_cfg, > > > struct virtio_vq_config *cfg, > > > --> uint cfg_idx); > > > > > > > > > ## The functions inside virtio_ring also need to add a new param, such as: > > > > > > static struct virtqueue *vring_create_virtqueue_split(struct virtio_device *vdev, > > > unsigned int index, > > > struct vq_transport_config *tp_cfg, > > > struct virtio_vq_config, > > > --> uint cfg_idx); > > > > > > > > > > > > > I guess what I'm missing is when could the index differ from cfg_idx? > > > @cfg_idx: Used by virtio core. The drivers should set this to 0. > During the initialization of each vq(vring setup), we need to know which > item in the array should be used at that time. But since the item in > names can be null, which causes some item of array to be skipped, we > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > cannot use vq.index as the current id. So add a cfg_idx to let vring > know how to get the current configuration from the array when > initializing vq. > > > static int vp_find_vqs_msix(struct virtio_device *vdev, unsigned int nvqs, > > ................ > > for (i = 0; i < nvqs; ++i) { > if (!names[i]) { > vqs[i] = NULL; > continue; > } > > if (!callbacks[i]) > msix_vec = VIRTIO_MSI_NO_VECTOR; > else if (vp_dev->per_vq_vectors) > msix_vec = allocated_vectors++; > else > msix_vec = VP_MSIX_VQ_VECTOR; > vqs[i] = vp_setup_vq(vdev, queue_idx++, callbacks[i], names[i], > ctx ? ctx[i] : false, > msix_vec); > > > Thanks. Jason what do you think is the way to resolve this? > > > > Thanks > > > > > Thanks. > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > Thanks. > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > >
When VM runs in kvm mode, system will not exit to host mode if executing general software breakpoint instruction such as INSN_BREAK, trap exception happens in guest mode rather than host mode. In order to debug guest kernel on host side, one mechanism should be used to let vm exit to host mode. Here hypercall instruction with special code is used for software breakpoint usage, vm exists to host mode and kvm hypervisor identifies the special hypercall code and sets exit_reason with KVM_EXIT_DEBUG, and then let qemu handle it. Idea comes from ppc kvm, one api KVM_REG_LOONGARCH_DEBUG_INST is added to get hypercall code. VMM needs get sw breakpoint instruction with this api and set the corresponding sw break point for guest kernel. Since it needs hypercall instruction emulation handling, and it is dependent on this patchset: https://lore.kernel.org/all/20240315080710.2812974-1-maobibo@loongson.cn/ Signed-off-by: Bibo Mao <maobibo@loongson.cn> --- v3 -- v4: 1. Rebase on the latest kernel, remove adding macro __KVM_HAVE_GUEST_DEBUG in arch specific header files since it is put in common kvm header files. 2. Rebase on pv ipi patch, rename KVM_HC_SWDBG with KVM_HCALL_SWDBG. v2 -- v3: 1. Add api KVM_REG_LOONGARCH_DEBUG_INST to get sw breakpoint instruction for vmm. 2. Check vcpu::guest_debug with value KVM_GUESTDBG_USE_SW_BP only, since another value KVM_GUESTDBG_ENABLE will be set if it is not zero. v1 -- v2: 1. Add checking for hypercall code KVM_HC_SWDBG, it is effective only if KVM_GUESTDBG_USE_SW_BP and KVM_GUESTDBG_ENABLE is set. --- arch/loongarch/include/asm/inst.h | 1 + arch/loongarch/include/asm/kvm_host.h | 4 ++++ arch/loongarch/include/asm/kvm_para.h | 2 ++ arch/loongarch/include/uapi/asm/kvm.h | 3 +++ arch/loongarch/kvm/exit.c | 16 ++++++++++++++-- arch/loongarch/kvm/vcpu.c | 13 ++++++++++++- arch/loongarch/kvm/vm.c | 1 + 7 files changed, 37 insertions(+), 3 deletions(-) diff --git a/arch/loongarch/include/asm/inst.h b/arch/loongarch/include/asm/inst.h index ad120f924905..c3993fd88aba 100644 --- a/arch/loongarch/include/asm/inst.h +++ b/arch/loongarch/include/asm/inst.h @@ -12,6 +12,7 @@ #define INSN_NOP 0x03400000 #define INSN_BREAK 0x002a0000 +#define INSN_HVCL 0x002b8000 #define ADDR_IMMMASK_LU52ID 0xFFF0000000000000 #define ADDR_IMMMASK_LU32ID 0x000FFFFF00000000 diff --git a/arch/loongarch/include/asm/kvm_host.h b/arch/loongarch/include/asm/kvm_host.h index 0b96c6303cf7..c53946f8ef9f 100644 --- a/arch/loongarch/include/asm/kvm_host.h +++ b/arch/loongarch/include/asm/kvm_host.h @@ -31,6 +31,10 @@ #define KVM_HALT_POLL_NS_DEFAULT 500000 +#define KVM_GUESTDBG_VALID_MASK (KVM_GUESTDBG_ENABLE | \ + KVM_GUESTDBG_USE_SW_BP | KVM_GUESTDBG_SINGLESTEP) +#define KVM_GUESTDBG_SW_BP_MASK (KVM_GUESTDBG_ENABLE | \ + KVM_GUESTDBG_USE_SW_BP) struct kvm_vm_stat { struct kvm_vm_stat_generic generic; u64 pages; diff --git a/arch/loongarch/include/asm/kvm_para.h b/arch/loongarch/include/asm/kvm_para.h index a5809a854bae..56775554402a 100644 --- a/arch/loongarch/include/asm/kvm_para.h +++ b/arch/loongarch/include/asm/kvm_para.h @@ -9,8 +9,10 @@ #define HYPERVISOR_VENDOR_SHIFT 8 #define HYPERCALL_CODE(vendor, code) ((vendor << HYPERVISOR_VENDOR_SHIFT) + code) #define KVM_HCALL_CODE_PV_SERVICE 0 +#define KVM_HCALL_CODE_SWDBG 1 #define KVM_HCALL_PV_SERVICE HYPERCALL_CODE(HYPERVISOR_KVM, KVM_HCALL_CODE_PV_SERVICE) #define KVM_HCALL_FUNC_PV_IPI 1 +#define KVM_HCALL_SWDBG HYPERCALL_CODE(HYPERVISOR_KVM, KVM_HCALL_CODE_SWDBG) /* * LoongArch hypercall return code diff --git a/arch/loongarch/include/uapi/asm/kvm.h b/arch/loongarch/include/uapi/asm/kvm.h index 109785922cf9..8f78b23672ac 100644 --- a/arch/loongarch/include/uapi/asm/kvm.h +++ b/arch/loongarch/include/uapi/asm/kvm.h @@ -17,6 +17,7 @@ #define KVM_COALESCED_MMIO_PAGE_OFFSET 1 #define KVM_DIRTY_LOG_PAGE_OFFSET 64 +#define KVM_GUESTDBG_USE_SW_BP 0x00010000 /* * for KVM_GET_REGS and KVM_SET_REGS */ @@ -72,6 +73,8 @@ struct kvm_fpu { #define KVM_REG_LOONGARCH_COUNTER (KVM_REG_LOONGARCH_KVM | KVM_REG_SIZE_U64 | 1) #define KVM_REG_LOONGARCH_VCPU_RESET (KVM_REG_LOONGARCH_KVM | KVM_REG_SIZE_U64 | 2) +/* Debugging: Special instruction for software breakpoint */ +#define KVM_REG_LOONGARCH_DEBUG_INST (KVM_REG_LOONGARCH_KVM | KVM_REG_SIZE_U64 | 3) #define LOONGARCH_REG_SHIFT 3 #define LOONGARCH_REG_64(TYPE, REG) (TYPE | KVM_REG_SIZE_U64 | (REG << LOONGARCH_REG_SHIFT)) diff --git a/arch/loongarch/kvm/exit.c b/arch/loongarch/kvm/exit.c index 78857147bc14..d71172e2568e 100644 --- a/arch/loongarch/kvm/exit.c +++ b/arch/loongarch/kvm/exit.c @@ -774,23 +774,35 @@ static int kvm_handle_hypercall(struct kvm_vcpu *vcpu) { larch_inst inst; unsigned int code; + int ret; inst.word = vcpu->arch.badi; code = inst.reg0i15_format.immediate; - update_pc(&vcpu->arch); + ret = RESUME_GUEST; switch (code) { case KVM_HCALL_PV_SERVICE: vcpu->stat.hypercall_exits++; kvm_handle_pv_service(vcpu); break; + case KVM_HCALL_SWDBG: + /* KVM_HC_SWDBG only in effective when SW_BP is enabled */ + if (vcpu->guest_debug & KVM_GUESTDBG_SW_BP_MASK) { + vcpu->run->exit_reason = KVM_EXIT_DEBUG; + ret = RESUME_HOST; + break; + } + fallthrough; default: /* Treat it as noop intruction, only set return value */ kvm_write_reg(vcpu, LOONGARCH_GPR_A0, KVM_HCALL_INVALID_CODE); break; } - return RESUME_GUEST; + if (ret == RESUME_GUEST) + update_pc(&vcpu->arch); + + return ret; } /* diff --git a/arch/loongarch/kvm/vcpu.c b/arch/loongarch/kvm/vcpu.c index 76f2086ab68b..f22d10228cd2 100644 --- a/arch/loongarch/kvm/vcpu.c +++ b/arch/loongarch/kvm/vcpu.c @@ -248,7 +248,15 @@ int kvm_arch_vcpu_ioctl_set_mpstate(struct kvm_vcpu *vcpu, int kvm_arch_vcpu_ioctl_set_guest_debug(struct kvm_vcpu *vcpu, struct kvm_guest_debug *dbg) { - return -EINVAL; + if (dbg->control & ~KVM_GUESTDBG_VALID_MASK) + return -EINVAL; + + if (dbg->control & KVM_GUESTDBG_ENABLE) + vcpu->guest_debug = dbg->control; + else + vcpu->guest_debug = 0; + + return 0; } static int _kvm_getcsr(struct kvm_vcpu *vcpu, unsigned int id, u64 *val) @@ -500,6 +508,9 @@ static int kvm_get_one_reg(struct kvm_vcpu *vcpu, case KVM_REG_LOONGARCH_COUNTER: *v = drdtime() + vcpu->kvm->arch.time_offset; break; + case KVM_REG_LOONGARCH_DEBUG_INST: + *v = INSN_HVCL + KVM_HCALL_SWDBG; + break; default: ret = -EINVAL; break; diff --git a/arch/loongarch/kvm/vm.c b/arch/loongarch/kvm/vm.c index 6006a28653ad..06fd746b03b6 100644 --- a/arch/loongarch/kvm/vm.c +++ b/arch/loongarch/kvm/vm.c @@ -77,6 +77,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) case KVM_CAP_IMMEDIATE_EXIT: case KVM_CAP_IOEVENTFD: case KVM_CAP_MP_STATE: + case KVM_CAP_SET_GUEST_DEBUG: r = 1; break; case KVM_CAP_NR_VCPUS: base-commit: 0a7b0acecea273c8816f4f5b0e189989470404cf -- 2.39.3
Intel MKTME repurposes several high bits of physical address as 'keyID', so boot_cpu_data.x86_phys_bits doesn't hold physical address bits reported by CPUID anymore. If guest.MAXPHYADDR < host.MAXPHYADDR, the bit field of ‘keyID’ belongs to reserved bits in guest’s view, so intercepting #PF to fix error code is necessary, just replace boot_cpu_data.x86_phys_bits with kvm_get_shadow_phys_bits() to fix. Signed-off-by: Tao Su <tao1.su@linux.intel.com> --- arch/x86/kvm/vmx/vmx.h | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/arch/x86/kvm/vmx/vmx.h b/arch/x86/kvm/vmx/vmx.h index 65786dbe7d60..79b1757df74a 100644 --- a/arch/x86/kvm/vmx/vmx.h +++ b/arch/x86/kvm/vmx/vmx.h @@ -15,6 +15,7 @@ #include "vmx_ops.h" #include "../cpuid.h" #include "run_flags.h" +#include "../mmu.h" #define MSR_TYPE_R 1 #define MSR_TYPE_W 2 @@ -719,7 +720,8 @@ static inline bool vmx_need_pf_intercept(struct kvm_vcpu *vcpu) if (!enable_ept) return true; - return allow_smaller_maxphyaddr && cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits; + return allow_smaller_maxphyaddr && + cpuid_maxphyaddr(vcpu) < kvm_get_shadow_phys_bits(); } static inline bool is_unrestricted_guest(struct kvm_vcpu *vcpu) base-commit: b3603fcb79b1036acae10602bffc4855a4b9af80 -- 2.34.1
On Wed, 2024-03-13 at 10:14 -0700, Isaku Yamahata wrote:
> > IMO, an enum will be clearer than the two flags.
> >
> > enum {
> > PROCESS_PRIVATE_AND_SHARED,
> > PROCESS_ONLY_PRIVATE,
> > PROCESS_ONLY_SHARED,
> > };
>
> The code will be ugly like
> "if (== PRIVATE || == PRIVATE_AND_SHARED)" or
> "if (== SHARED || == PRIVATE_AND_SHARED)"
>
> two boolean (or two flags) is less error-prone.
Yes the enum would be awkward to handle. But I also thought the way
this is specified in struct kvm_gfn_range is a little strange.
It is ambiguous what it should mean if you set:
.only_private=true;
.only_shared=true;
...as happens later in the series (although it may be a mistake).
Reading the original conversation, it seems Sean suggested this
specifically. But it wasn't clear to me from the discussion what the
intention of the "only" semantics was. Like why not?
bool private;
bool shared;
On 2/29/2024 14:36, Xiaoyao Li wrote:> KVM provides TDX capabilities via sub command KVM_TDX_CAPABILITIES of > IOCTL(KVM_MEMORY_ENCRYPT_OP). Get the capabilities when initializing > TDX context. It will be used to validate user's setting later. > > Since there is no interface reporting how many cpuid configs contains in > KVM_TDX_CAPABILITIES, QEMU chooses to try starting with a known number > and abort when it exceeds KVM_MAX_CPUID_ENTRIES. > > Besides, introduce the interfaces to invoke TDX "ioctls" at different > scope (KVM, VM and VCPU) in preparation. tdx_platform_ioctl() is dropped because no user so suggest rephrasing this statement because no KVM scope ioctl interface is introduced in this patch. > > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> > --- > Changes in v4: > - use {} to initialize struct kvm_tdx_cmd, to avoid memset(); > - remove tdx_platform_ioctl() because no user; > > Changes in v3: > - rename __tdx_ioctl() to tdx_ioctl_internal() > - Pass errp in get_tdx_capabilities(); > > changes in v2: > - Make the error message more clear; > > changes in v1: > - start from nr_cpuid_configs = 6 for the loop; > - stop the loop when nr_cpuid_configs exceeds KVM_MAX_CPUID_ENTRIES; > --- > target/i386/kvm/kvm.c | 2 - > target/i386/kvm/kvm_i386.h | 2 + > target/i386/kvm/tdx.c | 91 +++++++++++++++++++++++++++++++++++++- > 3 files changed, 92 insertions(+), 3 deletions(-) > > diff --git a/target/i386/kvm/kvm.c b/target/i386/kvm/kvm.c > index 52d99d30bdc8..0e68e80f4291 100644 > --- a/target/i386/kvm/kvm.c > +++ b/target/i386/kvm/kvm.c > @@ -1685,8 +1685,6 @@ static int hyperv_init_vcpu(X86CPU *cpu) > > static Error *invtsc_mig_blocker; > > -#define KVM_MAX_CPUID_ENTRIES 100 > - > static void kvm_init_xsave(CPUX86State *env) > { > if (has_xsave2) { > diff --git a/target/i386/kvm/kvm_i386.h b/target/i386/kvm/kvm_i386.h > index 55fb25fa8e2e..c3ef46a97a7b 100644 > --- a/target/i386/kvm/kvm_i386.h > +++ b/target/i386/kvm/kvm_i386.h > @@ -13,6 +13,8 @@ > > #include "sysemu/kvm.h" > > +#define KVM_MAX_CPUID_ENTRIES 100 > + > #ifdef CONFIG_KVM > > #define kvm_pit_in_kernel() \ > diff --git a/target/i386/kvm/tdx.c b/target/i386/kvm/tdx.c > index d9a1dd46dc69..2b956450a083 100644 > --- a/target/i386/kvm/tdx.c > +++ b/target/i386/kvm/tdx.c > @@ -12,18 +12,107 @@ > */ > > #include "qemu/osdep.h" > +#include "qemu/error-report.h" > +#include "qapi/error.h" > #include "qom/object_interfaces.h" > +#include "sysemu/kvm.h" > > #include "hw/i386/x86.h" > +#include "kvm_i386.h" > #include "tdx.h" > > +static struct kvm_tdx_capabilities *tdx_caps; > + > +enum tdx_ioctl_level{ > + TDX_VM_IOCTL, > + TDX_VCPU_IOCTL, > +}; > + > +static int tdx_ioctl_internal(void *state, enum tdx_ioctl_level level, int cmd_id, > + __u32 flags, void *data) > +{ > + struct kvm_tdx_cmd tdx_cmd = {}; > + int r; > + > + tdx_cmd.id = cmd_id; > + tdx_cmd.flags = flags; > + tdx_cmd.data = (__u64)(unsigned long)data; > + > + switch (level) { > + case TDX_VM_IOCTL: > + r = kvm_vm_ioctl(kvm_state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd); > + break; > + case TDX_VCPU_IOCTL: > + r = kvm_vcpu_ioctl(state, KVM_MEMORY_ENCRYPT_OP, &tdx_cmd); > + break; > + default: > + error_report("Invalid tdx_ioctl_level %d", level); > + exit(1); > + } > + > + return r; > +} > + > +static inline int tdx_vm_ioctl(int cmd_id, __u32 flags, void *data) > +{ > + return tdx_ioctl_internal(NULL, TDX_VM_IOCTL, cmd_id, flags, data); > +} > + > +static inline int tdx_vcpu_ioctl(void *vcpu_fd, int cmd_id, __u32 flags, > + void *data) > +{ > + return tdx_ioctl_internal(vcpu_fd, TDX_VCPU_IOCTL, cmd_id, flags, data); > +} > + > +static int get_tdx_capabilities(Error **errp) > +{ > + struct kvm_tdx_capabilities *caps; > + /* 1st generation of TDX reports 6 cpuid configs */ > + int nr_cpuid_configs = 6; > + size_t size; > + int r; > + > + do { > + size = sizeof(struct kvm_tdx_capabilities) + > + nr_cpuid_configs * sizeof(struct kvm_tdx_cpuid_config); > + caps = g_malloc0(size); > + caps->nr_cpuid_configs = nr_cpuid_configs; > + > + r = tdx_vm_ioctl(KVM_TDX_CAPABILITIES, 0, caps); > + if (r == -E2BIG) { > + g_free(caps); > + nr_cpuid_configs *= 2; > + if (nr_cpuid_configs > KVM_MAX_CPUID_ENTRIES) { > + error_setg(errp, "%s: KVM TDX seems broken that number of CPUID " > + "entries in kvm_tdx_capabilities exceeds limit %d", > + __func__, KVM_MAX_CPUID_ENTRIES); > + return r; > + } > + } else if (r < 0) { > + g_free(caps); > + error_setg_errno(errp, -r, "%s: KVM_TDX_CAPABILITIES failed", __func__); > + return r; > + } > + } > + while (r == -E2BIG); > + > + tdx_caps = caps; > + > + return 0; > +} > + > static int tdx_kvm_init(ConfidentialGuestSupport *cgs, Error **errp) > { > MachineState *ms = MACHINE(qdev_get_machine()); > + int r = 0; > > ms->require_guest_memfd = true; > > - return 0; > + if (!tdx_caps) { > + r = get_tdx_capabilities(errp); > + } > + > + return r; > } > > /* tdx guest */
On Mon, Mar 18, 2024 at 07:33:37PM -0400, Paolo Bonzini wrote:
>
> The idea is that SEV SNP will only ever support KVM_SEV_INIT2. I have
> placed patches for QEMU to support this new API at branch sevinit2 of
> https://gitlab.com/bonzini/qemu.
I don't see any references to KVM_SEV_INIT2 in the sevinit2 branch. Has
everything been pushed already?
-Mike
On 2/29/2024 14:36, Xiaoyao Li wrote: > From: Chao Peng <chao.p.peng@linux.intel.com> > > When geeting KVM_EXIT_MEMORY_FAULT exit, it indicates userspace needs to > do the memory conversion on the RAMBlock to turn the memory into desired > attribute, i.e., private/shared. > > Currently only KVM_MEMORY_EXIT_FLAG_PRIVATE in flags is valid when > KVM_EXIT_MEMORY_FAULT happens. > > Note, KVM_EXIT_MEMORY_FAULT makes sense only when the RAMBlock has > guest_memfd memory backend. > > Note, KVM_EXIT_MEMORY_FAULT returns with -EFAULT, so special handling is > added. > > When page is converted from shared to private, the original shared > memory can be discarded via ram_block_discard_range(). Note, shared > memory can be discarded only when it's not back'ed by hugetlb because > hugetlb is supposed to be pre-allocated and no need for discarding. > > Signed-off-by: Chao Peng <chao.p.peng@linux.intel.com> > Co-developed-by: Xiaoyao Li <xiaoyao.li@intel.com> > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> > > --- > Changes in v4: > - open-coded ram_block_discard logic; > - change warn_report() to error_report(); (Daniel) > --- > accel/kvm/kvm-all.c | 94 ++++++++++++++++++++++++++++++++++++++++----- > 1 file changed, 84 insertions(+), 10 deletions(-) > > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c > index 70d482a2c936..87e4275932a7 100644 > --- a/accel/kvm/kvm-all.c > +++ b/accel/kvm/kvm-all.c > @@ -2903,6 +2903,68 @@ static void kvm_eat_signals(CPUState *cpu) > } while (sigismember(&chkset, SIG_IPI)); > } > > +static int kvm_convert_memory(hwaddr start, hwaddr size, bool to_private) > +{ > + MemoryRegionSection section; > + ram_addr_t offset; > + MemoryRegion *mr; > + RAMBlock *rb; > + void *addr; > + int ret = -1; > + > + if (!QEMU_PTR_IS_ALIGNED(start, qemu_host_page_size) || > + !QEMU_PTR_IS_ALIGNED(size, qemu_host_page_size)) { > + return -1; > + } > + > + if (!size) { > + return -1; > + } > + > + section = memory_region_find(get_system_memory(), start, size); > + mr = section.mr; > + if (!mr) { > + return -1; > + } > + > + if (memory_region_has_guest_memfd(mr)) { > + if (to_private) { > + ret = kvm_set_memory_attributes_private(start, size); > + } else { > + ret = kvm_set_memory_attributes_shared(start, size); > + } > + > + if (ret) { > + memory_region_unref(section.mr); > + return ret; > + } > + > + addr = memory_region_get_ram_ptr(mr) + section.offset_within_region; > + rb = qemu_ram_block_from_host(addr, false, &offset); > + > + if (to_private) { > + if (rb->page_size != qemu_host_page_size) { > + /* > + * shared memory is back'ed by hugetlb, which is supposed to be > + * pre-allocated and doesn't need to be discarded > + */ Nit: comment indentation is broken here. > + return 0; > + } else { > + ret = ram_block_discard_range(rb, offset, size); > + } > + } else { > + ret = ram_block_discard_guest_memfd_range(rb, offset, size); > + } > + } else { > + error_report("Convert non guest_memfd backed memory region " > + "(0x%"HWADDR_PRIx" ,+ 0x%"HWADDR_PRIx") to %s", Same as above. > + start, size, to_private ? "private" : "shared"); > + } > + > + memory_region_unref(section.mr); > + return ret; > +} > + > int kvm_cpu_exec(CPUState *cpu) > { > struct kvm_run *run = cpu->kvm_run; > @@ -2970,18 +3032,20 @@ int kvm_cpu_exec(CPUState *cpu) > ret = EXCP_INTERRUPT; > break; > } > - fprintf(stderr, "error: kvm run failed %s\n", > - strerror(-run_ret)); > + if (!(run_ret == -EFAULT && run->exit_reason == KVM_EXIT_MEMORY_FAULT)) { > + fprintf(stderr, "error: kvm run failed %s\n", > + strerror(-run_ret)); > #ifdef TARGET_PPC > - if (run_ret == -EBUSY) { > - fprintf(stderr, > - "This is probably because your SMT is enabled.\n" > - "VCPU can only run on primary threads with all " > - "secondary threads offline.\n"); > - } > + if (run_ret == -EBUSY) { > + fprintf(stderr, > + "This is probably because your SMT is enabled.\n" > + "VCPU can only run on primary threads with all " > + "secondary threads offline.\n"); > + } > #endif > - ret = -1; > - break; > + ret = -1; > + break; > + } > } > > trace_kvm_run_exit(cpu->cpu_index, run->exit_reason); > @@ -3064,6 +3128,16 @@ int kvm_cpu_exec(CPUState *cpu) > break; > } > break; > + case KVM_EXIT_MEMORY_FAULT: > + if (run->memory_fault.flags & ~KVM_MEMORY_EXIT_FLAG_PRIVATE) { > + error_report("KVM_EXIT_MEMORY_FAULT: Unknown flag 0x%" PRIx64, > + (uint64_t)run->memory_fault.flags); > + ret = -1; > + break; > + } > + ret = kvm_convert_memory(run->memory_fault.gpa, run->memory_fault.size, > + run->memory_fault.flags & KVM_MEMORY_EXIT_FLAG_PRIVATE); > + break; > default: > ret = kvm_arch_handle_exit(cpu, run); > break;
On 2/29/2024 14:36, Xiaoyao Li wrote:> Introduce the helper functions to set the attributes of a range of > memory to private or shared. > > This is necessary to notify KVM the private/shared attribute of each gpa > range. KVM needs the information to decide the GPA needs to be mapped at > hva-based shared memory or guest_memfd based private memory. > > Signed-off-by: Xiaoyao Li <xiaoyao.li@intel.com> > --- > Changes in v4: > - move the check of kvm_supported_memory_attributes to the common > kvm_set_memory_attributes(); (Wang Wei) > - change warn_report() to error_report() in kvm_set_memory_attributes() > and drop the __func__; (Daniel) > --- > accel/kvm/kvm-all.c | 44 ++++++++++++++++++++++++++++++++++++++++++++ > include/sysemu/kvm.h | 3 +++ > 2 files changed, 47 insertions(+) > > diff --git a/accel/kvm/kvm-all.c b/accel/kvm/kvm-all.c > index cd0aa7545a1f..70d482a2c936 100644 > --- a/accel/kvm/kvm-all.c > +++ b/accel/kvm/kvm-all.c > @@ -92,6 +92,7 @@ static bool kvm_has_guest_debug; > static int kvm_sstep_flags; > static bool kvm_immediate_exit; > static bool kvm_guest_memfd_supported; > +static uint64_t kvm_supported_memory_attributes; > static hwaddr kvm_max_slot_size = ~0; > > static const KVMCapabilityInfo kvm_required_capabilites[] = { > @@ -1304,6 +1305,46 @@ void kvm_set_max_memslot_size(hwaddr max_slot_size) > kvm_max_slot_size = max_slot_size; > } > > +static int kvm_set_memory_attributes(hwaddr start, hwaddr size, uint64_t attr) > +{ > + struct kvm_memory_attributes attrs; > + int r; > + > + if (kvm_supported_memory_attributes == 0) { > + error_report("No memory attribute supported by KVM\n"); > + return -EINVAL; > + } > + > + if ((attr & kvm_supported_memory_attributes) != attr) { > + error_report("memory attribute 0x%lx not supported by KVM," > + " supported bits are 0x%lx\n", > + attr, kvm_supported_memory_attributes); > + return -EINVAL; > + } > + > + attrs.attributes = attr; > + attrs.address = start; > + attrs.size = size; > + attrs.flags = 0; > + > + r = kvm_vm_ioctl(kvm_state, KVM_SET_MEMORY_ATTRIBUTES, &attrs); > + if (r) { > + error_report("failed to set memory (0x%lx+%#zx) with attr 0x%lx error '%s'", > + start, size, attr, strerror(errno)); > + } > + return r; > +} > + > +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size) > +{ > + return kvm_set_memory_attributes(start, size, KVM_MEMORY_ATTRIBUTE_PRIVATE); > +} > + > +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size) > +{ > + return kvm_set_memory_attributes(start, size, 0); > +} > + > /* Called with KVMMemoryListener.slots_lock held */ > static void kvm_set_phys_mem(KVMMemoryListener *kml, > MemoryRegionSection *section, bool add) > @@ -2439,6 +2480,9 @@ static int kvm_init(MachineState *ms) > > kvm_guest_memfd_supported = kvm_check_extension(s, KVM_CAP_GUEST_MEMFD); > > + ret = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES); > + kvm_supported_memory_attributes = ret > 0 ? ret : 0; kvm_check_extension() only returns non-negative value, so we can just kvm_supported_memory_attributes = kvm_check_extension(s, KVM_CAP_MEMORY_ATTRIBUTES); > + > if (object_property_find(OBJECT(current_machine), "kvm-type")) { > g_autofree char *kvm_type = object_property_get_str(OBJECT(current_machine), > "kvm-type", > diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h > index 6cdf82de8372..8e83adfbbd19 100644 > --- a/include/sysemu/kvm.h > +++ b/include/sysemu/kvm.h > @@ -546,4 +546,7 @@ uint32_t kvm_dirty_ring_size(void); > bool kvm_hwpoisoned_mem(void); > > int kvm_create_guest_memfd(uint64_t size, uint64_t flags, Error **errp); > + > +int kvm_set_memory_attributes_private(hwaddr start, hwaddr size); > +int kvm_set_memory_attributes_shared(hwaddr start, hwaddr size); > #endif
On Mon, Mar 18, 2024, Vishal Annapurve wrote: > On Mon, Mar 18, 2024 at 3:02 PM David Hildenbrand <david@redhat.com> wrote: > > Second, we should find better ways to let an IOMMU map these pages, > > *not* using GUP. There were already discussions on providing a similar > > fd+offset-style interface instead. GUP really sounds like the wrong > > approach here. Maybe we should look into passing not only guest_memfd, > > but also "ordinary" memfds. +1. I am not completely opposed to letting SNP and TDX effectively convert pages between private and shared, but I also completely agree that letting anything gup() guest_memfd memory is likely to end in tears. > I need to dig into past discussions around this, but agree that > passing guest memfds to VFIO drivers in addition to HVAs seems worth > exploring. This may be required anyways for devices supporting TDX > connect [1]. > > If we are talking about the same file catering to both private and > shared memory, there has to be some way to keep track of references on > the shared memory from both host userspace and IOMMU. > > > > > Third, I don't think we should be using huge pages where huge pages > > don't make any sense. Using a 1 GiB page so the VM will convert some > > pieces to map it using PTEs will destroy the whole purpose of using 1 > > GiB pages. It doesn't make any sense. I don't disagree, but the fundamental problem is that we have no guarantees as to what that guest will or will not do. We can certainly make very educated guesses, and probably be right 99.99% of the time, but being wrong 0.01% of the time probably means a lot of broken VMs, and a lot of unhappy customers. > I had started a discussion for this [2] using an RFC series. David is talking about the host side of things, AFAICT you're talking about the guest side... > challenge here remain: > 1) Unifying all the conversions under one layer > 2) Ensuring shared memory allocations are huge page aligned at boot > time and runtime. > > Using any kind of unified shared memory allocator (today this part is > played by SWIOTLB) will need to support huge page aligned dynamic > increments, which can be only guaranteed by carving out enough memory > at boot time for CMA area and using CMA area for allocation at > runtime. > - Since it's hard to come up with a maximum amount of shared memory > needed by VM, especially with GPUs/TPUs around, it's difficult to come > up with CMA area size at boot time. ...which is very relevant as carving out memory in the guest is nigh impossible, but carving out memory in the host for systems whose sole purpose is to run VMs is very doable. > I think it's arguable that even if a VM converts 10 % of its memory to > shared using 4k granularity, we still have fewer page table walks on > the rest of the memory when using 1G/2M pages, which is a significant > portion. Performance is a secondary concern. If this were _just_ about guest performance, I would unequivocally side with David: the guest gets to keep the pieces if it fragments a 1GiB page. The main problem we're trying to solve is that we want to provision a host such that the host can serve 1GiB pages for non-CoCo VMs, and can also simultaneously run CoCo VMs, with 100% fungibility. I.e. a host could run 100% non-CoCo VMs, 100% CoCo VMs, or more likely, some sliding mix of the two. Ideally, CoCo VMs would also get the benefits of 1GiB mappings, that's not the driving motiviation for this discussion. As HugeTLB exists today, supporting that use case isn't really feasible because there's no sane way to convert/free just a sliver of a 1GiB page (and reconstitute the 1GiB when the sliver is converted/freed back). Peeking ahead at my next comment, I don't think that solving this in the guest is a realistic option, i.e. IMO, we need to figure out a way to handle this in the host, without relying on the guest to cooperate. Luckily, we haven't added hugepage support of any kind to guest_memfd, i.e. we have a fairly blank slate to work with. The other big advantage that we should lean into is that we can make assumptions about guest_memfd usage that would never fly for a general purpose backing stores, e.g. creating a dedicated memory pool for guest_memfd is acceptable, if not desirable, for (almost?) all of the CoCo use cases. I don't have any concrete ideas at this time, but my gut feeling is that this won't be _that_ crazy hard to solve if commit hard to guest_memfd _not_ being general purposes, and if we we account for conversion scenarios when designing hugepage support for guest_memfd. > > For example, one could create a GPA layout where some regions are backed > > by gigantic pages that cannot be converted/can only be converted as a > > whole, and some are backed by 4k pages that can be converted back and > > forth. We'd use multiple guest_memfds for that. I recall that physically > > restricting such conversions/locations (e.g., for bounce buffers) in > > Linux was already discussed somewhere, but I don't recall the details. > > > > It's all not trivial and not easy to get "clean". > > Yeah, agree with this point, it's difficult to get a clean solution > here, but the host side solution might be easier to deploy (not > necessarily easier to implement) and possibly cleaner than attempts to > regulate the guest side. I think we missed the opportunity to regulate the guest side by several years. To be able to rely on such a scheme, e.g. to deploy at scale and not DoS customer VMs, KVM would need to be able to _enforce_ the scheme. And while I am more than willing to put my foot down on things where the guest is being blatantly ridiculous, wanting to convert an arbitrary 4KiB chunk of memory between private and shared isn't ridiculous (likely inefficient, but not ridiculous). I.e. I'm not willing to have KVM refuse conversions that are legal according to the SNP and TDX specs (and presumably the CCA spec, too). That's why I think we're years too late; this sort of restriction needs to go in the "hardware" spec, and that ship has sailed.
On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@intel.com wrote: > From: Sean Christopherson <sean.j.christopherson@intel.com> > > TDX supports only write-back(WB) memory type for private memory > architecturally so that (virtualized) memory type change doesn't make > sense > for private memory. Also currently, page migration isn't supported > for TDX > yet. (TDX architecturally supports page migration. it's KVM and > kernel > implementation issue.) > > Regarding memory type change (mtrr virtualization and lapic page > mapping > change), pages are zapped by kvm_zap_gfn_range(). On the next KVM > page > fault, the SPTE entry with a new memory type for the page is > populated. > Regarding page migration, pages are zapped by the mmu notifier. On > the next > KVM page fault, the new migrated page is populated. Don't zap > private > pages on unmapping for those two cases. Is the migration case relevant to TDX? > > When deleting/moving a KVM memory slot, zap private pages. Typically > tearing down VM. Don't invalidate private page tables. i.e. zap only > leaf > SPTEs for KVM mmu that has a shared bit mask. The existing > kvm_tdp_mmu_invalidate_all_roots() depends on role.invalid with read- > lock > of mmu_lock so that other vcpu can operate on KVM mmu concurrently. > It > marks the root page table invalid and zaps SPTEs of the root page > tables. The TDX module doesn't allow to unlink a protected root page > table > from the hardware and then allocate a new one for it. i.e. replacing > a > protected root page table. Instead, zap only leaf SPTEs for KVM mmu > with a > shared bit mask set. I get the part about only zapping leafs and not the root and mid-level PTEs. But why the MTRR, lapic page and migration part? Why should those not be zapped? Why is migration a consideration when it is not supported? > > Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> > Signed-off-by: Isaku Yamahata <isaku.yamahata@intel.com> > --- > arch/x86/kvm/mmu/mmu.c | 61 > ++++++++++++++++++++++++++++++++++++-- > arch/x86/kvm/mmu/tdp_mmu.c | 37 +++++++++++++++++++---- > arch/x86/kvm/mmu/tdp_mmu.h | 5 ++-- > 3 files changed, 92 insertions(+), 11 deletions(-) > > diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c > index 0d6d4506ec97..30c86e858ae4 100644 > --- a/arch/x86/kvm/mmu/mmu.c > +++ b/arch/x86/kvm/mmu/mmu.c > @@ -6339,7 +6339,7 @@ static void kvm_mmu_zap_all_fast(struct kvm > *kvm) > * e.g. before kvm_zap_obsolete_pages() could drop mmu_lock > and yield. > */ > if (tdp_mmu_enabled) > - kvm_tdp_mmu_invalidate_all_roots(kvm); > + kvm_tdp_mmu_invalidate_all_roots(kvm, true); > > /* > * Notify all vcpus to reload its shadow page table and flush > TLB. > @@ -6459,7 +6459,16 @@ void kvm_zap_gfn_range(struct kvm *kvm, gfn_t > gfn_start, gfn_t gfn_end) > flush = kvm_rmap_zap_gfn_range(kvm, gfn_start, gfn_end); > > if (tdp_mmu_enabled) > - flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, > gfn_end, flush); > + /* > + * zap_private = false. Zap only shared pages. > + * > + * kvm_zap_gfn_range() is used when MTRR or PAT > memory > + * type was changed. Later on the next kvm page > fault, > + * populate it with updated spte entry. > + * Because only WB is supported for private pages, > don't > + * care of private pages. > + */ > + flush = kvm_tdp_mmu_zap_leafs(kvm, gfn_start, > gfn_end, flush, false); > > if (flush) > kvm_flush_remote_tlbs_range(kvm, gfn_start, gfn_end - > gfn_start); > @@ -6905,10 +6914,56 @@ void kvm_arch_flush_shadow_all(struct kvm > *kvm) > kvm_mmu_zap_all(kvm); > } > > +static void kvm_mmu_zap_memslot(struct kvm *kvm, struct > kvm_memory_slot *slot) What about kvm_mmu_zap_memslot_leafs() instead? > +{ > + bool flush = false; It doesn't need to be initialized if it passes false directly into kvm_tdp_mmu_unmap_gfn_range(). It would make the code easier to understand. > + > + write_lock(&kvm->mmu_lock); > + > + /* > + * Zapping non-leaf SPTEs, a.k.a. not-last SPTEs, isn't > required, worst > + * case scenario we'll have unused shadow pages lying around > until they > + * are recycled due to age or when the VM is destroyed. > + */ > + if (tdp_mmu_enabled) { > + struct kvm_gfn_range range = { > + .slot = slot, > + .start = slot->base_gfn, > + .end = slot->base_gfn + slot->npages, > + .may_block = true, > + > + /* > + * This handles both private gfn and shared > gfn. > + * All private page should be zapped on memslot > deletion. > + */ > + .only_private = true, > + .only_shared = true, only_private and only_shared are both true? Shouldn't they both be false? (or just unset) > + }; > + > + flush = kvm_tdp_mmu_unmap_gfn_range(kvm, &range, > flush); > + } else { > + /* TDX supports only TDP-MMU case. */ > + WARN_ON_ONCE(1); How about a KVM_BUG_ON() instead? If somehow this is reached, we don't want the caller thinking the pages are zapped, then enter the guest with pages mapped that have gone elsewhere. > + flush = true; Why flush? > + } > + if (flush) > + kvm_flush_remote_tlbs(kvm); > + > + write_unlock(&kvm->mmu_lock); > +} > + > void kvm_arch_flush_shadow_memslot(struct kvm *kvm, > struct kvm_memory_slot *slot) > { > - kvm_mmu_zap_all_fast(kvm); > + if (kvm_gfn_shared_mask(kvm)) There seems to be an attempt to abstract away the existence of Secure- EPT in mmu.c, that is not fully successful. In this case the code checks kvm_gfn_shared_mask() to see if it needs to handle the zapping in a way specific needed by S-EPT. It ends up being a little confusing because the actual check is about whether there is a shared bit. It only works because only S-EPT is the only thing that has a kvm_gfn_shared_mask(). Doing something like (kvm->arch.vm_type == KVM_X86_TDX_VM) looks wrong, but is more honest about what we are getting up to here. I'm not sure though, what do you think? > + /* > + * Secure-EPT requires to release PTs from the leaf. > The > + * optimization to zap root PT first with child PT > doesn't > + * work. > + */ > + kvm_mmu_zap_memslot(kvm, slot); > + else > + kvm_mmu_zap_all_fast(kvm); > } > > void kvm_mmu_invalidate_mmio_sptes(struct kvm *kvm, u64 gen) > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c > index d47f0daf1b03..e7514a807134 100644 > --- a/arch/x86/kvm/mmu/tdp_mmu.c > +++ b/arch/x86/kvm/mmu/tdp_mmu.c > @@ -37,7 +37,7 @@ void kvm_mmu_uninit_tdp_mmu(struct kvm *kvm) > * for zapping and thus puts the TDP MMU's reference to each > root, i.e. > * ultimately frees all roots. > */ > - kvm_tdp_mmu_invalidate_all_roots(kvm); > + kvm_tdp_mmu_invalidate_all_roots(kvm, false); > kvm_tdp_mmu_zap_invalidated_roots(kvm); > > WARN_ON(atomic64_read(&kvm->arch.tdp_mmu_pages)); > @@ -771,7 +771,8 @@ bool kvm_tdp_mmu_zap_sp(struct kvm *kvm, struct > kvm_mmu_page *sp) > * operation can cause a soft lockup. > */ > static bool tdp_mmu_zap_leafs(struct kvm *kvm, struct kvm_mmu_page > *root, > - gfn_t start, gfn_t end, bool can_yield, > bool flush) > + gfn_t start, gfn_t end, bool can_yield, > bool flush, > + bool zap_private) > { > struct tdp_iter iter; > > @@ -779,6 +780,10 @@ static bool tdp_mmu_zap_leafs(struct kvm *kvm, > struct kvm_mmu_page *root, > > lockdep_assert_held_write(&kvm->mmu_lock); > > + WARN_ON_ONCE(zap_private && !is_private_sp(root)); All the callers have zap_private as zap_private && is_private_sp(root). What badness is it trying to uncover? > + if (!zap_private && is_private_sp(root)) > + return false; > +
On Mon, Mar 18, 2024 at 09:01:05PM +0000,
"Edgecombe, Rick P" <rick.p.edgecombe@intel.com> wrote:
> On Mon, 2024-02-26 at 00:26 -0800, isaku.yamahata@intel.com wrote:
> > +fastpath_t tdx_vcpu_run(struct kvm_vcpu *vcpu)
> > +{
> > + struct vcpu_tdx *tdx = to_tdx(vcpu);
> > +
> > + if (unlikely(!tdx->initialized))
> > + return -EINVAL;
> > + if (unlikely(vcpu->kvm->vm_bugged)) {
> > + tdx->exit_reason.full = TDX_NON_RECOVERABLE_VCPU;
> > + return EXIT_FASTPATH_NONE;
> > + }
> > +
>
> Isaku, can you elaborate on why this needs special handling? There is a
> check in vcpu_enter_guest() like:
> if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
> r = -EIO;
> goto out;
> }
>
> Instead it returns a SEAM error code for something actuated by KVM. But
> can it even be reached because of the other check? Not sure if there is
> a problem, just sticks out to me and wondering whats going on.
The original intention is to get out the inner loop. As Sean pointed it out,
the current code does poor job to check error of
__seamcall_saved_ret(TDH_VP_ENTER). So it fails to call KVM_BUG_ON() when it
returns unexcepted error.
The right fix is to properly check an error from TDH_VP_ENTER and call
KVM_BUG_ON(). Then the check you pointed out should go away.
for // out loop
if (kvm_check_request(KVM_REQ_VM_DEAD, vcpu)) {
for // inner loop
vcpu_run()
kvm_vcpu_exit_request(vcpu).
--
Isaku Yamahata <isaku.yamahata@intel.com>
[Dave: there is a small arch/x86/kernel/fpu change in patch 9; I am CCing you in the cover letter just for context. - Paolo] The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. The first source of variability that was encountered is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. This series adds all the APIs that are needed to customize the features, with room for future enhancements: - a new /dev/kvm device attribute to retrieve the set of supported features (right now, only debug swap) - a new sub-operation for KVM_MEM_ENCRYPT_OP that can take a struct, replacing the existing KVM_SEV_INIT and KVM_SEV_ES_INIT It then puts the new op to work by including the VMSA features as a field of the The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility; but I am considering also making them use zero as the feature mask, and will gladly adjust the patches if so requested. In order to avoid creating *two* new KVM_MEM_ENCRYPT_OPs, I decided that I could as well make SEV and SEV-ES use VM types. And then, why not make a SEV-ES VM, when created with the new VM type instead of KVM_SEV_ES_INIT, reject KVM_GET_REGS/KVM_SET_REGS and friends on the vCPU file descriptor once the VMSA has been encrypted... Which is how the API should have always behaved. Note: despite having the same number of patches as v3, #9 and #15 are new! The series is structured as follows: - patches 1 and 2 change sev.c so that it is compiled only if SEV is enabled in kconfig - patches 3 to 6 introduce the new device attribute to retrieve supported VMSA features - patches 7 and 8 introduce new infrastructure for VM types, replacing the similar code in the TDX patches - patch 9 allows setting the FPU and AVX state prior to encryption of the VMSA - patches 10 to 12 introduce the new VM types for SEV and SEV-ES, and KVM_SEV_INIT2 as a new sub-operation for KVM_MEM_ENCRYPT_OP. - patch 13 reenables DebugSwap, now that there is an API that allows doing so without breaking backwards compatibility - patches 14 and 15 test the new ioctl. The idea is that SEV SNP will only ever support KVM_SEV_INIT2. I have placed patches for QEMU to support this new API at branch sevinit2 of https://gitlab.com/bonzini/qemu. I haven't fully tested patch 9 and it really deserves a selftest, it is a bit tricky though without ucall infrastructure for SEV. I will look at it tomorrow. The series is at branch kvm-coco-queue of kvm.git, and I would like to include it in kvm/next as soon as possible after the release of -rc1. Thanks, Paolo v3->v4: - moved patches 1 and 5 to separate "fixes" series for 6.9 - do not conditionalize prototypes for functions that are called by common SVM code - rebased on top of SEV selftest infrastructure from 6.9 merge window; include new patch to drop the "subtype" concept - rebased on top of SEV-SNP patches from 6.9 merge window - rebased on top of patch to disable DebugSwap from 6.8 rc; drop "warn" once SEV_ES_INIT stops enabling VMSA features and finally re-enable DebugSwap - simplified logic to return -EINVAL from ioctls - also block KVM_GET/SET_FPU for protected-state guests - move logic to set kvm_>arch.has_protected_state to svm_vm_init - fix "struct struct" in documentation Paolo Bonzini (14): KVM: SVM: Compile sev.c if and only if CONFIG_KVM_AMD_SEV=y KVM: x86: use u64_to_user_addr() KVM: introduce new vendor op for KVM_GET_DEVICE_ATTR KVM: SEV: publish supported VMSA features KVM: SEV: store VMSA features in kvm_sev_info KVM: x86: add fields to struct kvm_arch for CoCo features KVM: x86: Add supported_vm_types to kvm_caps KVM: SEV: sync FPU and AVX state at LAUNCH_UPDATE_VMSA time KVM: SEV: introduce to_kvm_sev_info KVM: SEV: define VM types for SEV and SEV-ES KVM: SEV: introduce KVM_SEV_INIT2 operation KVM: SEV: allow SEV-ES DebugSwap again selftests: kvm: add tests for KVM_SEV_INIT2 selftests: kvm: switch to using KVM_X86_*_VM Sean Christopherson (1): KVM: SVM: Invert handling of SEV and SEV_ES feature flags Documentation/virt/kvm/api.rst | 2 + .../virt/kvm/x86/amd-memory-encryption.rst | 52 +++++- arch/x86/include/asm/fpu/api.h | 3 + arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 8 +- arch/x86/include/uapi/asm/kvm.h | 12 ++ arch/x86/kernel/fpu/xstate.h | 2 - arch/x86/kvm/Makefile | 7 +- arch/x86/kvm/cpuid.c | 2 +- arch/x86/kvm/svm/sev.c | 172 ++++++++++++++---- arch/x86/kvm/svm/svm.c | 27 ++- arch/x86/kvm/svm/svm.h | 43 ++++- arch/x86/kvm/x86.c | 170 +++++++++++------ arch/x86/kvm/x86.h | 2 + tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/include/kvm_util_base.h | 11 +- .../selftests/kvm/include/x86_64/processor.h | 6 - .../selftests/kvm/include/x86_64/sev.h | 16 +- tools/testing/selftests/kvm/lib/kvm_util.c | 1 - .../selftests/kvm/lib/x86_64/processor.c | 14 +- tools/testing/selftests/kvm/lib/x86_64/sev.c | 30 ++- .../selftests/kvm/set_memory_region_test.c | 8 +- .../selftests/kvm/x86_64/sev_init2_tests.c | 149 +++++++++++++++ 23 files changed, 569 insertions(+), 170 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/sev_init2_tests.c -- 2.43.0
There is no danger to the kernel if 32-bit userspace provides a 64-bit value that has the high bits set, but for whatever reason happens to resolve to an address that has something mapped there. KVM uses the checked version of get_user() and put_user(), so any faults are caught properly. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/kvm/x86.c | 24 +++--------------------- 1 file changed, 3 insertions(+), 21 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 47d9f03b7778..3d2029402513 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4842,25 +4842,13 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) return r; } -static inline void __user *kvm_get_attr_addr(struct kvm_device_attr *attr) -{ - void __user *uaddr = (void __user*)(unsigned long)attr->addr; - - if ((u64)(unsigned long)uaddr != attr->addr) - return ERR_PTR_USR(-EFAULT); - return uaddr; -} - static int kvm_x86_dev_get_attr(struct kvm_device_attr *attr) { - u64 __user *uaddr = kvm_get_attr_addr(attr); + u64 __user *uaddr = u64_to_user_ptr(attr->addr); if (attr->group) return -ENXIO; - if (IS_ERR(uaddr)) - return PTR_ERR(uaddr); - switch (attr->attr) { case KVM_X86_XCOMP_GUEST_SUPP: if (put_user(kvm_caps.supported_xcr0, uaddr)) @@ -5712,12 +5700,9 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu, static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) { - u64 __user *uaddr = kvm_get_attr_addr(attr); + u64 __user *uaddr = u64_to_user_ptr(attr->addr); int r; - if (IS_ERR(uaddr)) - return PTR_ERR(uaddr); - switch (attr->attr) { case KVM_VCPU_TSC_OFFSET: r = -EFAULT; @@ -5735,13 +5720,10 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu, static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu, struct kvm_device_attr *attr) { - u64 __user *uaddr = kvm_get_attr_addr(attr); + u64 __user *uaddr = u64_to_user_ptr(attr->addr); struct kvm *kvm = vcpu->kvm; int r; - if (IS_ERR(uaddr)) - return PTR_ERR(uaddr); - switch (attr->attr) { case KVM_VCPU_TSC_OFFSET: { u64 offset, tsc, ns; -- 2.43.0
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- tools/testing/selftests/kvm/Makefile | 1 + .../selftests/kvm/include/kvm_util_base.h | 6 +- .../selftests/kvm/set_memory_region_test.c | 8 +- .../selftests/kvm/x86_64/sev_init2_tests.c | 149 ++++++++++++++++++ 4 files changed, 156 insertions(+), 8 deletions(-) create mode 100644 tools/testing/selftests/kvm/x86_64/sev_init2_tests.c diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile index 741c7dc16afc..871e2de3eb05 100644 --- a/tools/testing/selftests/kvm/Makefile +++ b/tools/testing/selftests/kvm/Makefile @@ -120,6 +120,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/tsc_msrs_test TEST_GEN_PROGS_x86_64 += x86_64/vmx_pmu_caps_test TEST_GEN_PROGS_x86_64 += x86_64/xen_shinfo_test TEST_GEN_PROGS_x86_64 += x86_64/xen_vmcall_test +TEST_GEN_PROGS_x86_64 += x86_64/sev_init2_tests TEST_GEN_PROGS_x86_64 += x86_64/sev_migrate_tests TEST_GEN_PROGS_x86_64 += x86_64/sev_smoke_test TEST_GEN_PROGS_x86_64 += x86_64/amx_test diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h index 3e0db283a46a..7c06ceb36643 100644 --- a/tools/testing/selftests/kvm/include/kvm_util_base.h +++ b/tools/testing/selftests/kvm/include/kvm_util_base.h @@ -890,17 +890,15 @@ static inline struct kvm_vm *vm_create_barebones(void) return ____vm_create(VM_SHAPE_DEFAULT); } -#ifdef __x86_64__ -static inline struct kvm_vm *vm_create_barebones_protected_vm(void) +static inline struct kvm_vm *vm_create_barebones_type(unsigned long type) { const struct vm_shape shape = { .mode = VM_MODE_DEFAULT, - .type = KVM_X86_SW_PROTECTED_VM, + .type = type, }; return ____vm_create(shape); } -#endif static inline struct kvm_vm *vm_create(uint32_t nr_runnable_vcpus) { diff --git a/tools/testing/selftests/kvm/set_memory_region_test.c b/tools/testing/selftests/kvm/set_memory_region_test.c index 06b43ed23580..904d58793fc6 100644 --- a/tools/testing/selftests/kvm/set_memory_region_test.c +++ b/tools/testing/selftests/kvm/set_memory_region_test.c @@ -339,7 +339,7 @@ static void test_invalid_memory_region_flags(void) #ifdef __x86_64__ if (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM)) - vm = vm_create_barebones_protected_vm(); + vm = vm_create_barebones_type(KVM_X86_SW_PROTECTED_VM); else #endif vm = vm_create_barebones(); @@ -462,7 +462,7 @@ static void test_add_private_memory_region(void) pr_info("Testing ADD of KVM_MEM_GUEST_MEMFD memory regions\n"); - vm = vm_create_barebones_protected_vm(); + vm = vm_create_barebones_type(KVM_X86_SW_PROTECTED_VM); test_invalid_guest_memfd(vm, vm->kvm_fd, 0, "KVM fd should fail"); test_invalid_guest_memfd(vm, vm->fd, 0, "VM's fd should fail"); @@ -471,7 +471,7 @@ static void test_add_private_memory_region(void) test_invalid_guest_memfd(vm, memfd, 0, "Regular memfd() should fail"); close(memfd); - vm2 = vm_create_barebones_protected_vm(); + vm2 = vm_create_barebones_type(KVM_X86_SW_PROTECTED_VM); memfd = vm_create_guest_memfd(vm2, MEM_REGION_SIZE, 0); test_invalid_guest_memfd(vm, memfd, 0, "Other VM's guest_memfd() should fail"); @@ -499,7 +499,7 @@ static void test_add_overlapping_private_memory_regions(void) pr_info("Testing ADD of overlapping KVM_MEM_GUEST_MEMFD memory regions\n"); - vm = vm_create_barebones_protected_vm(); + vm = vm_create_barebones_type(KVM_X86_SW_PROTECTED_VM); memfd = vm_create_guest_memfd(vm, MEM_REGION_SIZE * 4, 0); diff --git a/tools/testing/selftests/kvm/x86_64/sev_init2_tests.c b/tools/testing/selftests/kvm/x86_64/sev_init2_tests.c new file mode 100644 index 000000000000..fe55aa5a1b04 --- /dev/null +++ b/tools/testing/selftests/kvm/x86_64/sev_init2_tests.c @@ -0,0 +1,149 @@ +// SPDX-License-Identifier: GPL-2.0-only +#include <linux/kvm.h> +#include <linux/psp-sev.h> +#include <stdio.h> +#include <sys/ioctl.h> +#include <stdlib.h> +#include <errno.h> +#include <pthread.h> + +#include "test_util.h" +#include "kvm_util.h" +#include "processor.h" +#include "svm_util.h" +#include "kselftest.h" + +#define SVM_SEV_FEAT_DEBUG_SWAP 32u + +/* + * Some features may have hidden dependencies, or may only work + * for certain VM types. Err on the side of safety and don't + * expect that all supported features can be passed one by one + * to KVM_SEV_INIT2. + * + * (Well, right now there's only one...) + */ +#define KNOWN_FEATURES SVM_SEV_FEAT_DEBUG_SWAP + +int kvm_fd; +u64 supported_vmsa_features; +bool have_sev_es; + +static int __sev_ioctl(int vm_fd, int cmd_id, void *data) +{ + struct kvm_sev_cmd cmd = { + .id = cmd_id, + .data = (uint64_t)data, + .sev_fd = open_sev_dev_path_or_exit(), + }; + int ret; + + ret = ioctl(vm_fd, KVM_MEMORY_ENCRYPT_OP, &cmd); + TEST_ASSERT(ret < 0 || cmd.error == SEV_RET_SUCCESS, + "%d failed: fw error: %d\n", + cmd_id, cmd.error); + + return ret; +} + +static void test_init2(unsigned long vm_type, struct kvm_sev_init *init) +{ + struct kvm_vm *vm; + int ret; + + vm = vm_create_barebones_type(vm_type); + ret = __sev_ioctl(vm->fd, KVM_SEV_INIT2, init); + TEST_ASSERT(ret == 0, + "KVM_SEV_INIT2 return code is %d (expected 0), errno: %d", + ret, errno); + kvm_vm_free(vm); +} + +static void test_init2_invalid(unsigned long vm_type, struct kvm_sev_init *init, const char *msg) +{ + struct kvm_vm *vm; + int ret; + + vm = vm_create_barebones_type(vm_type); + ret = __sev_ioctl(vm->fd, KVM_SEV_INIT2, init); + TEST_ASSERT(ret == -1 && errno == EINVAL, + "KVM_SEV_INIT2 should fail, %s.", + msg); + kvm_vm_free(vm); +} + +void test_vm_types(void) +{ + test_init2(KVM_X86_SEV_VM, &(struct kvm_sev_init){}); + + /* + * TODO: check that unsupported types cannot be created. Probably + * a separate selftest. + */ + if (have_sev_es) + test_init2(KVM_X86_SEV_ES_VM, &(struct kvm_sev_init){}); + + test_init2_invalid(0, &(struct kvm_sev_init){}, + "VM type is KVM_X86_DEFAULT_VM"); + if (kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM)) + test_init2_invalid(KVM_X86_SW_PROTECTED_VM, &(struct kvm_sev_init){}, + "VM type is KVM_X86_SW_PROTECTED_VM"); +} + +void test_flags(uint32_t vm_type) +{ + int i; + + for (i = 0; i < 32; i++) + test_init2_invalid(vm_type, + &(struct kvm_sev_init){ .flags = BIT(i) }, + "invalid flag"); +} + +void test_features(uint32_t vm_type, uint64_t supported_features) +{ + int i; + + for (i = 0; i < 64; i++) { + if (!(supported_features & (1u << i))) + test_init2_invalid(vm_type, + &(struct kvm_sev_init){ .vmsa_features = BIT_ULL(i) }, + "unknown feature"); + else if (KNOWN_FEATURES & (1u << i)) + test_init2(vm_type, + &(struct kvm_sev_init){ .vmsa_features = BIT_ULL(i) }); + } +} + +int main(int argc, char *argv[]) +{ + int kvm_fd = open_kvm_dev_path_or_exit(); + bool have_sev; + + TEST_REQUIRE(__kvm_has_device_attr(kvm_fd, 0, KVM_X86_SEV_VMSA_FEATURES) == 0); + kvm_device_attr_get(kvm_fd, 0, KVM_X86_SEV_VMSA_FEATURES, &supported_vmsa_features); + + have_sev = kvm_cpu_has(X86_FEATURE_SEV); + TEST_ASSERT(have_sev == !!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SEV_VM)), + "sev: KVM_CAP_VM_TYPES (%x) does not match cpuid (checking %x)", + kvm_check_cap(KVM_CAP_VM_TYPES), 1 << KVM_X86_SEV_VM); + + TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SEV_VM)); + have_sev_es = kvm_cpu_has(X86_FEATURE_SEV_ES); + + TEST_ASSERT(have_sev_es == !!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SEV_ES_VM)), + "sev-es: KVM_CAP_VM_TYPES (%x) does not match cpuid (checking %x)", + kvm_check_cap(KVM_CAP_VM_TYPES), 1 << KVM_X86_SEV_ES_VM); + + test_vm_types(); + + test_flags(KVM_X86_SEV_VM); + if (have_sev_es) + test_flags(KVM_X86_SEV_ES_VM); + + test_features(KVM_X86_SEV_VM, 0); + if (have_sev_es) + test_features(KVM_X86_SEV_ES_VM, supported_vmsa_features); + + return 0; +} -- 2.43.0
This removes the concept of "subtypes", instead letting the tests use proper VM types that were recently added. While the sev_init_vm() and sev_es_init_vm() are still able to operate with the legacy KVM_SEV_INIT and KVM_SEV_ES_INIT ioctls, this is limited to VMs that are created manually with vm_create_barebones(). Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- .../selftests/kvm/include/kvm_util_base.h | 5 ++-- .../selftests/kvm/include/x86_64/processor.h | 6 ---- .../selftests/kvm/include/x86_64/sev.h | 16 ++-------- tools/testing/selftests/kvm/lib/kvm_util.c | 1 - .../selftests/kvm/lib/x86_64/processor.c | 14 +++++---- tools/testing/selftests/kvm/lib/x86_64/sev.c | 30 +++++++++++++++++-- 6 files changed, 40 insertions(+), 32 deletions(-) diff --git a/tools/testing/selftests/kvm/include/kvm_util_base.h b/tools/testing/selftests/kvm/include/kvm_util_base.h index 7c06ceb36643..8acca8237687 100644 --- a/tools/testing/selftests/kvm/include/kvm_util_base.h +++ b/tools/testing/selftests/kvm/include/kvm_util_base.h @@ -93,7 +93,6 @@ enum kvm_mem_region_type { struct kvm_vm { int mode; unsigned long type; - uint8_t subtype; int kvm_fd; int fd; unsigned int pgtable_levels; @@ -200,8 +199,8 @@ enum vm_guest_mode { struct vm_shape { uint32_t type; uint8_t mode; - uint8_t subtype; - uint16_t padding; + uint8_t pad0; + uint16_t pad1; }; kvm_static_assert(sizeof(struct vm_shape) == sizeof(uint64_t)); diff --git a/tools/testing/selftests/kvm/include/x86_64/processor.h b/tools/testing/selftests/kvm/include/x86_64/processor.h index 3bd03b088dda..3c8dfd8180b1 100644 --- a/tools/testing/selftests/kvm/include/x86_64/processor.h +++ b/tools/testing/selftests/kvm/include/x86_64/processor.h @@ -23,12 +23,6 @@ extern bool host_cpu_is_intel; extern bool host_cpu_is_amd; -enum vm_guest_x86_subtype { - VM_SUBTYPE_NONE = 0, - VM_SUBTYPE_SEV, - VM_SUBTYPE_SEV_ES, -}; - /* Forced emulation prefix, used to invoke the emulator unconditionally. */ #define KVM_FEP "ud2; .byte 'k', 'v', 'm';" diff --git a/tools/testing/selftests/kvm/include/x86_64/sev.h b/tools/testing/selftests/kvm/include/x86_64/sev.h index 8a1bf88474c9..0719f083351a 100644 --- a/tools/testing/selftests/kvm/include/x86_64/sev.h +++ b/tools/testing/selftests/kvm/include/x86_64/sev.h @@ -67,20 +67,8 @@ kvm_static_assert(SEV_RET_SUCCESS == 0); __TEST_ASSERT_VM_VCPU_IOCTL(!ret, #cmd, ret, vm); \ }) -static inline void sev_vm_init(struct kvm_vm *vm) -{ - vm->arch.sev_fd = open_sev_dev_path_or_exit(); - - vm_sev_ioctl(vm, KVM_SEV_INIT, NULL); -} - - -static inline void sev_es_vm_init(struct kvm_vm *vm) -{ - vm->arch.sev_fd = open_sev_dev_path_or_exit(); - - vm_sev_ioctl(vm, KVM_SEV_ES_INIT, NULL); -} +void sev_vm_init(struct kvm_vm *vm); +void sev_es_vm_init(struct kvm_vm *vm); static inline void sev_register_encrypted_memory(struct kvm_vm *vm, struct userspace_mem_region *region) diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c index b2262b5fad9e..9da388100f3a 100644 --- a/tools/testing/selftests/kvm/lib/kvm_util.c +++ b/tools/testing/selftests/kvm/lib/kvm_util.c @@ -276,7 +276,6 @@ struct kvm_vm *____vm_create(struct vm_shape shape) vm->mode = shape.mode; vm->type = shape.type; - vm->subtype = shape.subtype; vm->pa_bits = vm_guest_mode_params[vm->mode].pa_bits; vm->va_bits = vm_guest_mode_params[vm->mode].va_bits; diff --git a/tools/testing/selftests/kvm/lib/x86_64/processor.c b/tools/testing/selftests/kvm/lib/x86_64/processor.c index 74a4c736c9ae..9f87ca8b7ab6 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/processor.c +++ b/tools/testing/selftests/kvm/lib/x86_64/processor.c @@ -578,10 +578,11 @@ void kvm_arch_vm_post_create(struct kvm_vm *vm) sync_global_to_guest(vm, host_cpu_is_intel); sync_global_to_guest(vm, host_cpu_is_amd); - if (vm->subtype == VM_SUBTYPE_SEV) - sev_vm_init(vm); - else if (vm->subtype == VM_SUBTYPE_SEV_ES) - sev_es_vm_init(vm); + if (vm->type == KVM_X86_SEV_VM || vm->type == KVM_X86_SEV_ES_VM) { + struct kvm_sev_init init = { 0 }; + + vm_sev_ioctl(vm, KVM_SEV_INIT2, &init); + } } void vcpu_arch_set_entry_point(struct kvm_vcpu *vcpu, void *guest_code) @@ -1081,9 +1082,12 @@ void kvm_get_cpu_address_width(unsigned int *pa_bits, unsigned int *va_bits) void kvm_init_vm_address_properties(struct kvm_vm *vm) { - if (vm->subtype == VM_SUBTYPE_SEV || vm->subtype == VM_SUBTYPE_SEV_ES) { + if (vm->type == KVM_X86_SEV_VM || vm->type == KVM_X86_SEV_ES_VM) { + vm->arch.sev_fd = open_sev_dev_path_or_exit(); vm->arch.c_bit = BIT_ULL(this_cpu_property(X86_PROPERTY_SEV_C_BIT)); vm->gpa_tag_mask = vm->arch.c_bit; + } else { + vm->arch.sev_fd = -1; } } diff --git a/tools/testing/selftests/kvm/lib/x86_64/sev.c b/tools/testing/selftests/kvm/lib/x86_64/sev.c index e248d3364b9c..597994fa4f41 100644 --- a/tools/testing/selftests/kvm/lib/x86_64/sev.c +++ b/tools/testing/selftests/kvm/lib/x86_64/sev.c @@ -35,6 +35,32 @@ static void encrypt_region(struct kvm_vm *vm, struct userspace_mem_region *regio } } +void sev_vm_init(struct kvm_vm *vm) +{ + if (vm->type == KVM_X86_DEFAULT_VM) { + assert(vm->arch.sev_fd == -1); + vm->arch.sev_fd = open_sev_dev_path_or_exit(); + vm_sev_ioctl(vm, KVM_SEV_INIT, NULL); + } else { + struct kvm_sev_init init = { 0 }; + assert(vm->type == KVM_X86_SEV_VM); + vm_sev_ioctl(vm, KVM_SEV_INIT2, &init); + } +} + +void sev_es_vm_init(struct kvm_vm *vm) +{ + if (vm->type == KVM_X86_DEFAULT_VM) { + assert(vm->arch.sev_fd == -1); + vm->arch.sev_fd = open_sev_dev_path_or_exit(); + vm_sev_ioctl(vm, KVM_SEV_ES_INIT, NULL); + } else { + struct kvm_sev_init init = { 0 }; + assert(vm->type == KVM_X86_SEV_ES_VM); + vm_sev_ioctl(vm, KVM_SEV_INIT2, &init); + } +} + void sev_vm_launch(struct kvm_vm *vm, uint32_t policy) { struct kvm_sev_launch_start launch_start = { @@ -91,10 +117,8 @@ struct kvm_vm *vm_sev_create_with_one_vcpu(uint32_t policy, void *guest_code, struct kvm_vcpu **cpu) { struct vm_shape shape = { - .type = VM_TYPE_DEFAULT, .mode = VM_MODE_DEFAULT, - .subtype = policy & SEV_POLICY_ES ? VM_SUBTYPE_SEV_ES : - VM_SUBTYPE_SEV, + .type = policy & SEV_POLICY_ES ? KVM_X86_SEV_ES_VM : KVM_X86_SEV_VM, }; struct kvm_vm *vm; struct kvm_vcpu *cpus[1]; -- 2.43.0 -- 2.43.0
This simplifies the implementation of KVM_CHECK_EXTENSION(KVM_CAP_VM_TYPES), and also allows the vendor module to specify which VM types are supported. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/kvm/x86.c | 12 ++++++------ arch/x86/kvm/x86.h | 2 ++ 2 files changed, 8 insertions(+), 6 deletions(-) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 98b7979b4698..8c56bcf3feb7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -94,6 +94,7 @@ struct kvm_caps kvm_caps __read_mostly = { .supported_mce_cap = MCG_CTL_P | MCG_SER_P, + .supported_vm_types = BIT(KVM_X86_DEFAULT_VM), }; EXPORT_SYMBOL_GPL(kvm_caps); @@ -4629,9 +4630,7 @@ static int kvm_ioctl_get_supported_hv_cpuid(struct kvm_vcpu *vcpu, static bool kvm_is_vm_type_supported(unsigned long type) { - return type == KVM_X86_DEFAULT_VM || - (type == KVM_X86_SW_PROTECTED_VM && - IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_mmu_enabled); + return type < 32 && (kvm_caps.supported_vm_types & BIT(type)); } int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) @@ -4832,9 +4831,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) r = kvm_caps.has_notify_vmexit; break; case KVM_CAP_VM_TYPES: - r = BIT(KVM_X86_DEFAULT_VM); - if (kvm_is_vm_type_supported(KVM_X86_SW_PROTECTED_VM)) - r |= BIT(KVM_X86_SW_PROTECTED_VM); + r = kvm_caps.supported_vm_types; break; default: break; @@ -9829,6 +9826,9 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops) kvm_register_perf_callbacks(ops->handle_intel_pt_intr); + if (IS_ENABLED(CONFIG_KVM_SW_PROTECTED_VM) && tdp_mmu_enabled) + kvm_caps.supported_vm_types |= BIT(KVM_X86_SW_PROTECTED_VM); + if (!kvm_cpu_cap_has(X86_FEATURE_XSAVES)) kvm_caps.supported_xss = 0; diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h index a8b71803777b..d80a4c6b5a38 100644 --- a/arch/x86/kvm/x86.h +++ b/arch/x86/kvm/x86.h @@ -24,6 +24,8 @@ struct kvm_caps { bool has_bus_lock_exit; /* notify VM exit supported? */ bool has_notify_vmexit; + /* bit mask of VM types */ + u32 supported_vm_types; u64 supported_mce_cap; u64 supported_xcr0; -- 2.43.0
The idea that no parameter would ever be necessary when enabling SEV or SEV-ES for a VM was decidedly optimistic. In fact, in some sense it's already a parameter whether SEV or SEV-ES is desired. Another possible source of variability is the desired set of VMSA features, as that affects the measurement of the VM's initial state and cannot be changed arbitrarily by the hypervisor. Create a new sub-operation for KVM_MEMORY_ENCRYPT_OP that can take a struct, and put the new op to work by including the VMSA features as a field of the struct. The existing KVM_SEV_INIT and KVM_SEV_ES_INIT use the full set of supported VMSA features for backwards compatibility. The struct also includes the usual bells and whistles for future extensibility: a flags field that must be zero for now, and some padding at the end. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- .../virt/kvm/x86/amd-memory-encryption.rst | 40 ++++++++++++-- arch/x86/include/uapi/asm/kvm.h | 9 ++++ arch/x86/kvm/svm/sev.c | 53 ++++++++++++++++--- 3 files changed, 92 insertions(+), 10 deletions(-) diff --git a/Documentation/virt/kvm/x86/amd-memory-encryption.rst b/Documentation/virt/kvm/x86/amd-memory-encryption.rst index fb41470c0310..f7c007d34114 100644 --- a/Documentation/virt/kvm/x86/amd-memory-encryption.rst +++ b/Documentation/virt/kvm/x86/amd-memory-encryption.rst @@ -76,15 +76,49 @@ are defined in ``<linux/psp-dev.h>``. KVM implements the following commands to support common lifecycle events of SEV guests, such as launching, running, snapshotting, migrating and decommissioning. -1. KVM_SEV_INIT ---------------- +1. KVM_SEV_INIT2 +---------------- -The KVM_SEV_INIT command is used by the hypervisor to initialize the SEV platform +The KVM_SEV_INIT2 command is used by the hypervisor to initialize the SEV platform context. In a typical workflow, this command should be the first command issued. +For this command to be accepted, either KVM_X86_SEV_VM or KVM_X86_SEV_ES_VM +must have been passed to the KVM_CREATE_VM ioctl. A virtual machine created +with those machine types in turn cannot be run until KVM_SEV_INIT2 is invoked. + +Parameters: struct kvm_sev_init (in) Returns: 0 on success, -negative on error +:: + + struct kvm_sev_init { + __u64 vmsa_features; /* initial value of features field in VMSA */ + __u32 flags; /* must be 0 */ + __u32 pad[9]; + }; + +It is an error if the hypervisor does not support any of the bits that +are set in ``flags`` or ``vmsa_features``. ``vmsa_features`` must be +0 for SEV virtual machines, as they do not have a VMSA. + +This command replaces the deprecated KVM_SEV_INIT and KVM_SEV_ES_INIT commands. +The commands did not have any parameters (the ```data``` field was unused) and +only work for the KVM_X86_DEFAULT_VM machine type (0). + +They behave as if: + +* the VM type is KVM_X86_SEV_VM for KVM_SEV_INIT, or KVM_X86_SEV_ES_VM for + KVM_SEV_ES_INIT + +* the ``flags`` and ``vmsa_features`` fields of ``struct kvm_sev_init`` are + set to zero + +If the ``KVM_X86_SEV_VMSA_FEATURES`` attribute does not exist, the hypervisor only +supports KVM_SEV_INIT and KVM_SEV_ES_INIT. In that case, note that KVM_SEV_ES_INIT +might set the debug swap VMSA feature (bit 5) depending on the value of the +``debug_swap`` parameter of ``kvm-amd.ko``. + 2. KVM_SEV_LAUNCH_START ----------------------- diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index 9d950b0b64c9..51b13080ed4b 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -690,6 +690,9 @@ enum sev_cmd_id { /* Guest Migration Extension */ KVM_SEV_SEND_CANCEL, + /* Second time is the charm; improved versions of the above ioctls. */ + KVM_SEV_INIT2, + KVM_SEV_NR_MAX, }; @@ -701,6 +704,12 @@ struct kvm_sev_cmd { __u32 sev_fd; }; +struct kvm_sev_init { + __u64 vmsa_features; + __u32 flags; + __u32 pad[9]; +}; + struct kvm_sev_launch_start { __u32 handle; __u32 policy; diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 7c2b0471b92c..dc22b31faebd 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -240,27 +240,31 @@ static void sev_unbind_asid(struct kvm *kvm, unsigned int handle) sev_decommission(handle); } -static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) +static int __sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp, + struct kvm_sev_init *data, + unsigned long vm_type) { struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; struct sev_platform_init_args init_args = {0}; + bool es_active = vm_type != KVM_X86_SEV_VM; + u64 valid_vmsa_features = es_active ? sev_supported_vmsa_features : 0; int ret; if (kvm->created_vcpus) return -EINVAL; - if (kvm->arch.vm_type != KVM_X86_DEFAULT_VM) + if (data->flags) + return -EINVAL; + + if (data->vmsa_features & ~valid_vmsa_features) return -EINVAL; if (unlikely(sev->active)) return -EINVAL; sev->active = true; - sev->es_active = argp->id == KVM_SEV_ES_INIT; - sev->vmsa_features = sev_supported_vmsa_features; - if (sev_supported_vmsa_features) - pr_warn_once("Enabling DebugSwap with KVM_SEV_ES_INIT. " - "This will not work starting with Linux 6.10\n"); + sev->es_active = es_active; + sev->vmsa_features = data->vmsa_features; ret = sev_asid_new(sev); if (ret) @@ -290,6 +294,38 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) return ret; } +static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_init data = { + .vmsa_features = 0, + }; + unsigned long vm_type; + + if (kvm->arch.vm_type != KVM_X86_DEFAULT_VM) + return -EINVAL; + + vm_type = (argp->id == KVM_SEV_INIT ? KVM_X86_SEV_VM : KVM_X86_SEV_ES_VM); + return __sev_guest_init(kvm, argp, &data, vm_type); +} + +static int sev_guest_init2(struct kvm *kvm, struct kvm_sev_cmd *argp) +{ + struct kvm_sev_info *sev = &to_kvm_svm(kvm)->sev_info; + struct kvm_sev_init data; + + if (!sev->need_init) + return -EINVAL; + + if (kvm->arch.vm_type != KVM_X86_SEV_VM && + kvm->arch.vm_type != KVM_X86_SEV_ES_VM) + return -EINVAL; + + if (copy_from_user(&data, u64_to_user_ptr(argp->data), sizeof(data))) + return -EFAULT; + + return __sev_guest_init(kvm, argp, &data, kvm->arch.vm_type); +} + static int sev_bind_asid(struct kvm *kvm, unsigned int handle, int *error) { unsigned int asid = sev_get_asid(kvm); @@ -1940,6 +1976,9 @@ int sev_mem_enc_ioctl(struct kvm *kvm, void __user *argp) case KVM_SEV_INIT: r = sev_guest_init(kvm, &sev_cmd); break; + case KVM_SEV_INIT2: + r = sev_guest_init2(kvm, &sev_cmd); + break; case KVM_SEV_LAUNCH_START: r = sev_launch_start(kvm, &sev_cmd); break; -- 2.43.0
Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/kvm/svm/sev.c | 4 ++-- arch/x86/kvm/svm/svm.h | 5 +++++ 2 files changed, 7 insertions(+), 2 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 800e836a69fb..704cd42b4f1b 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -94,7 +94,7 @@ static int sev_flush_asids(unsigned int min_asid, unsigned int max_asid) static inline bool is_mirroring_enc_context(struct kvm *kvm) { - return !!to_kvm_svm(kvm)->sev_info.enc_context_owner; + return !!to_kvm_sev_info(kvm)->enc_context_owner; } static bool sev_vcpu_has_debug_swap(struct vcpu_svm *svm) @@ -679,7 +679,7 @@ static int __sev_launch_update_vmsa(struct kvm *kvm, struct kvm_vcpu *vcpu, clflush_cache_range(svm->sev_es.vmsa, PAGE_SIZE); vmsa.reserved = 0; - vmsa.handle = to_kvm_svm(kvm)->sev_info.handle; + vmsa.handle = to_kvm_sev_info(kvm)->handle; vmsa.address = __sme_pa(svm->sev_es.vmsa); vmsa.len = PAGE_SIZE; ret = sev_issue_cmd(kvm, SEV_CMD_LAUNCH_UPDATE_VMSA, &vmsa, error); diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index b7707514d042..6313679d464b 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -319,6 +319,11 @@ static __always_inline struct kvm_svm *to_kvm_svm(struct kvm *kvm) return container_of(kvm, struct kvm_svm, kvm); } +static __always_inline struct kvm_sev_info *to_kvm_sev_info(struct kvm *kvm) +{ + return &to_kvm_svm(kvm)->sev_info; +} + static __always_inline bool sev_guest(struct kvm *kvm) { #ifdef CONFIG_KVM_AMD_SEV -- 2.43.0
Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- Documentation/virt/kvm/api.rst | 2 ++ arch/x86/include/uapi/asm/kvm.h | 2 ++ arch/x86/kvm/svm/sev.c | 16 +++++++++++++--- arch/x86/kvm/svm/svm.c | 11 +++++++++++ arch/x86/kvm/svm/svm.h | 1 + 5 files changed, 29 insertions(+), 3 deletions(-) diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst index 0b5a33ee71ee..f0b76ff5030d 100644 --- a/Documentation/virt/kvm/api.rst +++ b/Documentation/virt/kvm/api.rst @@ -8819,6 +8819,8 @@ means the VM type with value @n is supported. Possible values of @n are:: #define KVM_X86_DEFAULT_VM 0 #define KVM_X86_SW_PROTECTED_VM 1 + #define KVM_X86_SEV_VM 2 + #define KVM_X86_SEV_ES_VM 3 Note, KVM_X86_SW_PROTECTED_VM is currently only for development and testing. Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h index d0c1b459f7e9..9d950b0b64c9 100644 --- a/arch/x86/include/uapi/asm/kvm.h +++ b/arch/x86/include/uapi/asm/kvm.h @@ -857,5 +857,7 @@ struct kvm_hyperv_eventfd { #define KVM_X86_DEFAULT_VM 0 #define KVM_X86_SW_PROTECTED_VM 1 +#define KVM_X86_SEV_VM 2 +#define KVM_X86_SEV_ES_VM 3 #endif /* _ASM_X86_KVM_H */ diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 704cd42b4f1b..7c2b0471b92c 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -249,6 +249,9 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) if (kvm->created_vcpus) return -EINVAL; + if (kvm->arch.vm_type != KVM_X86_DEFAULT_VM) + return -EINVAL; + if (unlikely(sev->active)) return -EINVAL; @@ -270,6 +273,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) INIT_LIST_HEAD(&sev->regions_list); INIT_LIST_HEAD(&sev->mirror_vms); + sev->need_init = false; kvm_set_apicv_inhibit(kvm, APICV_INHIBIT_REASON_SEV); @@ -1841,7 +1845,8 @@ int sev_vm_move_enc_context_from(struct kvm *kvm, unsigned int source_fd) if (ret) goto out_fput; - if (sev_guest(kvm) || !sev_guest(source_kvm)) { + if (kvm->arch.vm_type != source_kvm->arch.vm_type || + sev_guest(kvm) || !sev_guest(source_kvm)) { ret = -EINVAL; goto out_unlock; } @@ -2162,6 +2167,7 @@ int sev_vm_copy_enc_context_from(struct kvm *kvm, unsigned int source_fd) mirror_sev->asid = source_sev->asid; mirror_sev->fd = source_sev->fd; mirror_sev->es_active = source_sev->es_active; + mirror_sev->need_init = false; mirror_sev->handle = source_sev->handle; INIT_LIST_HEAD(&mirror_sev->regions_list); INIT_LIST_HEAD(&mirror_sev->mirror_vms); @@ -2227,10 +2233,14 @@ void sev_vm_destroy(struct kvm *kvm) void __init sev_set_cpu_caps(void) { - if (sev_enabled) + if (sev_enabled) { kvm_cpu_cap_set(X86_FEATURE_SEV); - if (sev_es_enabled) + kvm_caps.supported_vm_types |= BIT(KVM_X86_SEV_VM); + } + if (sev_es_enabled) { kvm_cpu_cap_set(X86_FEATURE_SEV_ES); + kvm_caps.supported_vm_types |= BIT(KVM_X86_SEV_ES_VM); + } } void __init sev_hardware_setup(void) diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 03108055a7b0..0f3b59da0d4a 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -4078,6 +4078,9 @@ static void svm_cancel_injection(struct kvm_vcpu *vcpu) static int svm_vcpu_pre_run(struct kvm_vcpu *vcpu) { + if (to_kvm_sev_info(vcpu->kvm)->need_init) + return -EINVAL; + return 1; } @@ -4883,6 +4886,14 @@ static void svm_vm_destroy(struct kvm *kvm) static int svm_vm_init(struct kvm *kvm) { + int type = kvm->arch.vm_type; + + if (type != KVM_X86_DEFAULT_VM && + type != KVM_X86_SW_PROTECTED_VM) { + kvm->arch.has_protected_state = (type == KVM_X86_SEV_ES_VM); + to_kvm_sev_info(kvm)->need_init = true; + } + if (!pause_filter_count || !pause_filter_thresh) kvm->arch.pause_in_guest = true; diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 6313679d464b..c08eb6095e80 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -79,6 +79,7 @@ enum { struct kvm_sev_info { bool active; /* SEV enabled guest */ bool es_active; /* SEV-ES enabled guest */ + bool need_init; /* waiting for SEV_INIT2 */ unsigned int asid; /* ASID used for this guest */ unsigned int handle; /* SEV firmware handle */ int fd; /* SEV device fd */ -- 2.43.0
Allow vendor modules to provide their own attributes on /dev/kvm. To avoid proliferation of vendor ops, implement KVM_HAS_DEVICE_ATTR and KVM_GET_DEVICE_ATTR in terms of the same function. You're not supposed to use KVM_GET_DEVICE_ATTR to do complicated computations, especially on /dev/kvm. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/include/asm/kvm-x86-ops.h | 1 + arch/x86/include/asm/kvm_host.h | 1 + arch/x86/kvm/x86.c | 43 ++++++++++++++++++++---------- 3 files changed, 31 insertions(+), 14 deletions(-) diff --git a/arch/x86/include/asm/kvm-x86-ops.h b/arch/x86/include/asm/kvm-x86-ops.h index 110d7f29ca9a..5187fcf4b610 100644 --- a/arch/x86/include/asm/kvm-x86-ops.h +++ b/arch/x86/include/asm/kvm-x86-ops.h @@ -121,6 +121,7 @@ KVM_X86_OP(enter_smm) KVM_X86_OP(leave_smm) KVM_X86_OP(enable_smi_window) #endif +KVM_X86_OP_OPTIONAL(dev_get_attr) KVM_X86_OP_OPTIONAL(mem_enc_ioctl) KVM_X86_OP_OPTIONAL(mem_enc_register_region) KVM_X86_OP_OPTIONAL(mem_enc_unregister_region) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index 16e07a2eee19..f6cc7bfb5462 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1778,6 +1778,7 @@ struct kvm_x86_ops { void (*enable_smi_window)(struct kvm_vcpu *vcpu); #endif + int (*dev_get_attr)(u64 attr, u64 *val); int (*mem_enc_ioctl)(struct kvm *kvm, void __user *argp); int (*mem_enc_register_region)(struct kvm *kvm, struct kvm_enc_region *argp); int (*mem_enc_unregister_region)(struct kvm *kvm, struct kvm_enc_region *argp); diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 3d2029402513..e8253aa8ef5e 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -4842,34 +4842,49 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext) return r; } -static int kvm_x86_dev_get_attr(struct kvm_device_attr *attr) +static int __kvm_x86_dev_get_attr(struct kvm_device_attr *attr, u64 *val) { - u64 __user *uaddr = u64_to_user_ptr(attr->addr); + int r; if (attr->group) return -ENXIO; switch (attr->attr) { case KVM_X86_XCOMP_GUEST_SUPP: - if (put_user(kvm_caps.supported_xcr0, uaddr)) - return -EFAULT; - return 0; + r = 0; + *val = kvm_caps.supported_xcr0; + break; default: - return -ENXIO; + r = -ENXIO; + if (kvm_x86_ops.dev_get_attr) + r = static_call(kvm_x86_dev_get_attr)(attr->attr, val); + break; } + + return r; +} + +static int kvm_x86_dev_get_attr(struct kvm_device_attr *attr) +{ + u64 __user *uaddr = u64_to_user_ptr(attr->addr); + int r; + u64 val; + + r = __kvm_x86_dev_get_attr(attr, &val); + if (r < 0) + return r; + + if (put_user(val, uaddr)) + return -EFAULT; + + return 0; } static int kvm_x86_dev_has_attr(struct kvm_device_attr *attr) { - if (attr->group) - return -ENXIO; + u64 val; - switch (attr->attr) { - case KVM_X86_XCOMP_GUEST_SUPP: - return 0; - default: - return -ENXIO; - } + return __kvm_x86_dev_get_attr(attr, &val); } long kvm_arch_dev_ioctl(struct file *filp, -- 2.43.0
The DebugSwap feature of SEV-ES provides a way for confidential guests to use data breakpoints. Its status is record in VMSA, and therefore attestation signatures depend on whether it is enabled or not. In order to avoid invalidating the signatures depending on the host machine, it was disabled by default (see commit 5abf6dceb066, "SEV: disable SEV-ES DebugSwap by default", 2024-03-09). However, we now have a new API to create SEV VMs that allows enabling DebugSwap based on what the user tells KVM to do, and we also changed the legacy KVM_SEV_ES_INIT API to never enable DebugSwap. It is therefore possible to re-enable the feature without breaking compatibility with kernels that pre-date the introduction of DebugSwap, so go ahead. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/kvm/svm/sev.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index dc22b31faebd..1a11840facfb 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -42,7 +42,7 @@ static bool sev_es_enabled = true; module_param_named(sev_es, sev_es_enabled, bool, 0444); /* enable/disable SEV-ES DebugSwap support */ -static bool sev_es_debug_swap_enabled = false; +static bool sev_es_debug_swap_enabled = true; module_param_named(debug_swap, sev_es_debug_swap_enabled, bool, 0444); static u64 sev_supported_vmsa_features;
SEV-ES allows passing custom contents for x87, SSE and AVX state into the VMSA. Allow userspace to do that with the usual KVM_SET_XSAVE API and only mark FPU contents as confidential after it has been copied and encrypted into the VMSA. Since the XSAVE state for AVX is the first, it does not need the compacted-state handling of get_xsave_addr(). However, there are other parts of XSAVE state in the VMSA that currently are not handled, and the validation logic of get_xsave_addr() is pointless to duplicate in KVM, so move get_xsave_addr() to public FPU API; it is really just a facility to operate on XSAVE state and does not expose any internal details of arch/x86/kernel/fpu. Cc: Dave Hansen <dave.hansen@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/include/asm/fpu/api.h | 3 +++ arch/x86/kernel/fpu/xstate.h | 2 -- arch/x86/kvm/svm/sev.c | 36 ++++++++++++++++++++++++++++++++++ arch/x86/kvm/svm/svm.c | 8 -------- 4 files changed, 39 insertions(+), 10 deletions(-) diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h index a2be3aefff9f..f86ad3335529 100644 --- a/arch/x86/include/asm/fpu/api.h +++ b/arch/x86/include/asm/fpu/api.h @@ -143,6 +143,9 @@ extern void fpstate_clear_xstate_component(struct fpstate *fps, unsigned int xfe extern u64 xstate_get_guest_group_perm(void); +extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr); + + /* KVM specific functions */ extern bool fpu_alloc_guest_fpstate(struct fpu_guest *gfpu); extern void fpu_free_guest_fpstate(struct fpu_guest *gfpu); diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h index 3518fb26d06b..4ff910545451 100644 --- a/arch/x86/kernel/fpu/xstate.h +++ b/arch/x86/kernel/fpu/xstate.h @@ -54,8 +54,6 @@ extern int copy_sigframe_from_user_to_xstate(struct task_struct *tsk, const void extern void fpu__init_cpu_xstate(void); extern void fpu__init_system_xstate(unsigned int legacy_size); -extern void *get_xsave_addr(struct xregs_state *xsave, int xfeature_nr); - static inline u64 xfeatures_mask_supervisor(void) { return fpu_kernel_cfg.max_features & XFEATURE_MASK_SUPERVISOR_SUPPORTED; diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index a8300646a280..800e836a69fb 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -23,6 +23,7 @@ #include <asm/pkru.h> #include <asm/trapnr.h> #include <asm/fpu/xcr.h> +#include <asm/fpu/xstate.h> #include <asm/debugreg.h> #include "mmu.h" @@ -577,6 +578,10 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm) struct kvm_vcpu *vcpu = &svm->vcpu; struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info; struct sev_es_save_area *save = svm->sev_es.vmsa; + struct xregs_state *xsave; + const u8 *s; + u8 *d; + int i; /* Check some debug related fields before encrypting the VMSA */ if (svm->vcpu.guest_debug || (svm->vmcb->save.dr7 & ~DR7_FIXED_1)) @@ -619,6 +624,30 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm) save->sev_features = sev->vmsa_features; + xsave = &vcpu->arch.guest_fpu.fpstate->regs.xsave; + save->x87_dp = xsave->i387.rdp; + save->mxcsr = xsave->i387.mxcsr; + save->x87_ftw = xsave->i387.twd; + save->x87_fsw = xsave->i387.swd; + save->x87_fcw = xsave->i387.cwd; + save->x87_fop = xsave->i387.fop; + save->x87_ds = 0; + save->x87_cs = 0; + save->x87_rip = xsave->i387.rip; + + for (i = 0; i < 8; i++) { + d = save->fpreg_x87 + i * 10; + s = ((u8 *)xsave->i387.st_space) + i * 16; + memcpy(d, s, 10); + } + memcpy(save->fpreg_xmm, xsave->i387.xmm_space, 256); + + s = get_xsave_addr(xsave, XFEATURE_YMM); + if (s) + memcpy(save->fpreg_ymm, s, 256); + else + memset(save->fpreg_ymm, 0, 256); + pr_debug("Virtual Machine Save Area (VMSA):\n"); print_hex_dump_debug("", DUMP_PREFIX_NONE, 16, 1, save, sizeof(*save), false); @@ -657,6 +686,13 @@ static int __sev_launch_update_vmsa(struct kvm *kvm, struct kvm_vcpu *vcpu, if (ret) return ret; + /* + * SEV-ES guests maintain an encrypted version of their FPU + * state which is restored and saved on VMRUN and VMEXIT. + * Mark vcpu->arch.guest_fpu->fpstate as scratch so it won't + * do xsave/xrstor on it. + */ + fpstate_set_confidential(&vcpu->arch.guest_fpu); vcpu->arch.guest_state_protected = true; return 0; } diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index c22e87ebf0de..03108055a7b0 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1433,14 +1433,6 @@ static int svm_vcpu_create(struct kvm_vcpu *vcpu) vmsa_page = snp_safe_alloc_page(vcpu); if (!vmsa_page) goto error_free_vmcb_page; - - /* - * SEV-ES guests maintain an encrypted version of their FPU - * state which is restored and saved on VMRUN and VMEXIT. - * Mark vcpu->arch.guest_fpu->fpstate as scratch so it won't - * do xsave/xrstor on it. - */ - fpstate_set_confidential(&vcpu->arch.guest_fpu); } err = avic_init_vcpu(svm); -- 2.43.0
Some VM types have characteristics in common; in fact, the only use of VM types right now is kvm_arch_has_private_mem and it assumes that _all_ nonzero VM types have private memory. We will soon introduce a VM type for SEV and SEV-ES VMs, and at that point we will have two special characteristics of confidential VMs that depend on the VM type: not just if memory is private, but also whether guest state is protected. For the latter we have kvm->arch.guest_state_protected, which is only set on a fully initialized VM. For VM types with protected guest state, we can actually fix a problem in the SEV-ES implementation, where ioctls to set registers do not cause an error even if the VM has been initialized and the guest state encrypted. Make sure that when using VM types that will become an error. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20240209183743.22030-7-pbonzini@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/include/asm/kvm_host.h | 7 ++- arch/x86/kvm/x86.c | 93 ++++++++++++++++++++++++++------- 2 files changed, 79 insertions(+), 21 deletions(-) diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h index f6cc7bfb5462..7380877bc9b5 100644 --- a/arch/x86/include/asm/kvm_host.h +++ b/arch/x86/include/asm/kvm_host.h @@ -1279,12 +1279,14 @@ enum kvm_apicv_inhibit { }; struct kvm_arch { - unsigned long vm_type; unsigned long n_used_mmu_pages; unsigned long n_requested_mmu_pages; unsigned long n_max_mmu_pages; unsigned int indirect_shadow_pages; u8 mmu_valid_gen; + u8 vm_type; + bool has_private_mem; + bool has_protected_state; struct hlist_head mmu_page_hash[KVM_NUM_MMU_PAGES]; struct list_head active_mmu_pages; struct list_head zapped_obsolete_pages; @@ -2153,8 +2155,9 @@ void kvm_mmu_new_pgd(struct kvm_vcpu *vcpu, gpa_t new_pgd); void kvm_configure_mmu(bool enable_tdp, int tdp_forced_root_level, int tdp_max_root_level, int tdp_huge_page_level); + #ifdef CONFIG_KVM_PRIVATE_MEM -#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.vm_type != KVM_X86_DEFAULT_VM) +#define kvm_arch_has_private_mem(kvm) ((kvm)->arch.has_private_mem) #else #define kvm_arch_has_private_mem(kvm) false #endif diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index e8253aa8ef5e..98b7979b4698 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -5560,11 +5560,15 @@ static int kvm_vcpu_ioctl_x86_set_vcpu_events(struct kvm_vcpu *vcpu, return 0; } -static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, - struct kvm_debugregs *dbgregs) +static int kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, + struct kvm_debugregs *dbgregs) { unsigned int i; + if (vcpu->kvm->arch.has_protected_state && + vcpu->arch.guest_state_protected) + return -EINVAL; + memset(dbgregs, 0, sizeof(*dbgregs)); BUILD_BUG_ON(ARRAY_SIZE(vcpu->arch.db) != ARRAY_SIZE(dbgregs->db)); @@ -5573,6 +5577,7 @@ static void kvm_vcpu_ioctl_x86_get_debugregs(struct kvm_vcpu *vcpu, dbgregs->dr6 = vcpu->arch.dr6; dbgregs->dr7 = vcpu->arch.dr7; + return 0; } static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, @@ -5580,6 +5585,10 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, { unsigned int i; + if (vcpu->kvm->arch.has_protected_state && + vcpu->arch.guest_state_protected) + return -EINVAL; + if (dbgregs->flags) return -EINVAL; @@ -5600,8 +5609,8 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu, } -static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu, - u8 *state, unsigned int size) +static int kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu, + u8 *state, unsigned int size) { /* * Only copy state for features that are enabled for the guest. The @@ -5619,24 +5628,25 @@ static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu, XFEATURE_MASK_FPSSE; if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) - return; + return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0; fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, state, size, supported_xcr0, vcpu->arch.pkru); + return 0; } -static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, - struct kvm_xsave *guest_xsave) +static int kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu, + struct kvm_xsave *guest_xsave) { - kvm_vcpu_ioctl_x86_get_xsave2(vcpu, (void *)guest_xsave->region, - sizeof(guest_xsave->region)); + return kvm_vcpu_ioctl_x86_get_xsave2(vcpu, (void *)guest_xsave->region, + sizeof(guest_xsave->region)); } static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, struct kvm_xsave *guest_xsave) { if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) - return 0; + return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0; return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, guest_xsave->region, @@ -5644,18 +5654,23 @@ static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu, &vcpu->arch.pkru); } -static void kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu, - struct kvm_xcrs *guest_xcrs) +static int kvm_vcpu_ioctl_x86_get_xcrs(struct kvm_vcpu *vcpu, + struct kvm_xcrs *guest_xcrs) { + if (vcpu->kvm->arch.has_protected_state && + vcpu->arch.guest_state_protected) + return -EINVAL; + if (!boot_cpu_has(X86_FEATURE_XSAVE)) { guest_xcrs->nr_xcrs = 0; - return; + return 0; } guest_xcrs->nr_xcrs = 1; guest_xcrs->flags = 0; guest_xcrs->xcrs[0].xcr = XCR_XFEATURE_ENABLED_MASK; guest_xcrs->xcrs[0].value = vcpu->arch.xcr0; + return 0; } static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu, @@ -5663,6 +5678,10 @@ static int kvm_vcpu_ioctl_x86_set_xcrs(struct kvm_vcpu *vcpu, { int i, r = 0; + if (vcpu->kvm->arch.has_protected_state && + vcpu->arch.guest_state_protected) + return -EINVAL; + if (!boot_cpu_has(X86_FEATURE_XSAVE)) return -EINVAL; @@ -6045,7 +6064,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp, case KVM_GET_DEBUGREGS: { struct kvm_debugregs dbgregs; - kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs); + r = kvm_vcpu_ioctl_x86_get_debugregs(vcpu, &dbgregs); + if (r < 0) + break; r = -EFAULT; if (copy_to_user(argp, &dbgregs, @@ -6075,7 +6096,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp, if (!u.xsave) break; - kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave); + r = kvm_vcpu_ioctl_x86_get_xsave(vcpu, u.xsave); + if (r < 0) + break; r = -EFAULT; if (copy_to_user(argp, u.xsave, sizeof(struct kvm_xsave))) @@ -6104,7 +6127,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp, if (!u.xsave) break; - kvm_vcpu_ioctl_x86_get_xsave2(vcpu, u.buffer, size); + r = kvm_vcpu_ioctl_x86_get_xsave2(vcpu, u.buffer, size); + if (r < 0) + break; r = -EFAULT; if (copy_to_user(argp, u.xsave, size)) @@ -6120,7 +6145,9 @@ long kvm_arch_vcpu_ioctl(struct file *filp, if (!u.xcrs) break; - kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs); + r = kvm_vcpu_ioctl_x86_get_xcrs(vcpu, u.xcrs); + if (r < 0) + break; r = -EFAULT; if (copy_to_user(argp, u.xcrs, @@ -6264,6 +6291,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp, } #endif case KVM_GET_SREGS2: { + r = -EINVAL; + if (vcpu->kvm->arch.has_protected_state && + vcpu->arch.guest_state_protected) + goto out; + u.sregs2 = kzalloc(sizeof(struct kvm_sregs2), GFP_KERNEL); r = -ENOMEM; if (!u.sregs2) @@ -6276,6 +6308,11 @@ long kvm_arch_vcpu_ioctl(struct file *filp, break; } case KVM_SET_SREGS2: { + r = -EINVAL; + if (vcpu->kvm->arch.has_protected_state && + vcpu->arch.guest_state_protected) + goto out; + u.sregs2 = memdup_user(argp, sizeof(struct kvm_sregs2)); if (IS_ERR(u.sregs2)) { r = PTR_ERR(u.sregs2); @@ -11483,6 +11520,10 @@ static void __get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) int kvm_arch_vcpu_ioctl_get_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) { + if (vcpu->kvm->arch.has_protected_state && + vcpu->arch.guest_state_protected) + return -EINVAL; + vcpu_load(vcpu); __get_regs(vcpu, regs); vcpu_put(vcpu); @@ -11524,6 +11565,10 @@ static void __set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) int kvm_arch_vcpu_ioctl_set_regs(struct kvm_vcpu *vcpu, struct kvm_regs *regs) { + if (vcpu->kvm->arch.has_protected_state && + vcpu->arch.guest_state_protected) + return -EINVAL; + vcpu_load(vcpu); __set_regs(vcpu, regs); vcpu_put(vcpu); @@ -11596,6 +11641,10 @@ static void __get_sregs2(struct kvm_vcpu *vcpu, struct kvm_sregs2 *sregs2) int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu, struct kvm_sregs *sregs) { + if (vcpu->kvm->arch.has_protected_state && + vcpu->arch.guest_state_protected) + return -EINVAL; + vcpu_load(vcpu); __get_sregs(vcpu, sregs); vcpu_put(vcpu); @@ -11863,6 +11912,10 @@ int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu, { int ret; + if (vcpu->kvm->arch.has_protected_state && + vcpu->arch.guest_state_protected) + return -EINVAL; + vcpu_load(vcpu); ret = __set_sregs(vcpu, sregs); vcpu_put(vcpu); @@ -11980,7 +12033,7 @@ int kvm_arch_vcpu_ioctl_get_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) struct fxregs_state *fxsave; if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) - return 0; + return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0; vcpu_load(vcpu); @@ -12003,7 +12056,7 @@ int kvm_arch_vcpu_ioctl_set_fpu(struct kvm_vcpu *vcpu, struct kvm_fpu *fpu) struct fxregs_state *fxsave; if (fpstate_is_confidential(&vcpu->arch.guest_fpu)) - return 0; + return vcpu->kvm->arch.has_protected_state ? -EINVAL : 0; vcpu_load(vcpu); @@ -12529,6 +12582,8 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type) return -EINVAL; kvm->arch.vm_type = type; + kvm->arch.has_private_mem = + (type == KVM_X86_SW_PROTECTED_VM); ret = kvm_page_track_init(kvm); if (ret) -- 2.43.0
Right now, the set of features that are stored in the VMSA upon initialization is fixed and depends on the module parameters for kvm-amd.ko. However, the hypervisor cannot really change it at will because the feature word has to match between the hypervisor and whatever computes a measurement of the VMSA for attestation purposes. Add a field to kvm_sev_info that holds the set of features to be stored in the VMSA; and query it instead of referring to the module parameters. Because KVM_SEV_INIT and KVM_SEV_ES_INIT accept no parameters, this does not yet introduce any functional change, but it paves the way for an API that allows customization of the features per-VM. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Message-Id: <20240209183743.22030-6-pbonzini@redhat.com> Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> --- arch/x86/kvm/svm/sev.c | 29 +++++++++++++++++++++-------- arch/x86/kvm/svm/svm.c | 2 +- arch/x86/kvm/svm/svm.h | 3 ++- 3 files changed, 24 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/svm/sev.c b/arch/x86/kvm/svm/sev.c index 22c35a39c25f..a8300646a280 100644 --- a/arch/x86/kvm/svm/sev.c +++ b/arch/x86/kvm/svm/sev.c @@ -96,6 +96,14 @@ static inline bool is_mirroring_enc_context(struct kvm *kvm) return !!to_kvm_svm(kvm)->sev_info.enc_context_owner; } +static bool sev_vcpu_has_debug_swap(struct vcpu_svm *svm) +{ + struct kvm_vcpu *vcpu = &svm->vcpu; + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info; + + return sev->vmsa_features & SVM_SEV_FEAT_DEBUG_SWAP; +} + /* Must be called with the sev_bitmap_lock held */ static bool __sev_recycle_asids(unsigned int min_asid, unsigned int max_asid) { @@ -245,6 +253,11 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) sev->active = true; sev->es_active = argp->id == KVM_SEV_ES_INIT; + sev->vmsa_features = sev_supported_vmsa_features; + if (sev_supported_vmsa_features) + pr_warn_once("Enabling DebugSwap with KVM_SEV_ES_INIT. " + "This will not work starting with Linux 6.10\n"); + ret = sev_asid_new(sev); if (ret) goto e_no_asid; @@ -266,6 +279,7 @@ static int sev_guest_init(struct kvm *kvm, struct kvm_sev_cmd *argp) sev_asid_free(sev); sev->asid = 0; e_no_asid: + sev->vmsa_features = 0; sev->es_active = false; sev->active = false; return ret; @@ -560,6 +574,8 @@ static int sev_launch_update_data(struct kvm *kvm, struct kvm_sev_cmd *argp) static int sev_es_sync_vmsa(struct vcpu_svm *svm) { + struct kvm_vcpu *vcpu = &svm->vcpu; + struct kvm_sev_info *sev = &to_kvm_svm(vcpu->kvm)->sev_info; struct sev_es_save_area *save = svm->sev_es.vmsa; /* Check some debug related fields before encrypting the VMSA */ @@ -601,11 +617,7 @@ static int sev_es_sync_vmsa(struct vcpu_svm *svm) save->xss = svm->vcpu.arch.ia32_xss; save->dr6 = svm->vcpu.arch.dr6; - if (sev_supported_vmsa_features) { - save->sev_features = sev_supported_vmsa_features; - pr_warn_once("Enabling DebugSwap with KVM_SEV_ES_INIT. " - "This will not work starting with Linux 6.10\n"); - } + save->sev_features = sev->vmsa_features; pr_debug("Virtual Machine Save Area (VMSA):\n"); print_hex_dump_debug("", DUMP_PREFIX_NONE, 16, 1, save, sizeof(*save), false); @@ -1685,6 +1697,7 @@ static void sev_migrate_from(struct kvm *dst_kvm, struct kvm *src_kvm) dst->pages_locked = src->pages_locked; dst->enc_context_owner = src->enc_context_owner; dst->es_active = src->es_active; + dst->vmsa_features = src->vmsa_features; src->asid = 0; src->active = false; @@ -3057,7 +3070,7 @@ static void sev_es_init_vmcb(struct vcpu_svm *svm) svm_set_intercept(svm, TRAP_CR8_WRITE); vmcb->control.intercepts[INTERCEPT_DR] = 0; - if (!sev_es_debug_swap_enabled) { + if (!sev_vcpu_has_debug_swap(svm)) { vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_READ); vmcb_set_intercept(&vmcb->control, INTERCEPT_DR7_WRITE); recalc_intercepts(svm); @@ -3112,7 +3125,7 @@ void sev_es_vcpu_reset(struct vcpu_svm *svm) sev_enc_bit)); } -void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa) +void sev_es_prepare_switch_to_guest(struct vcpu_svm *svm, struct sev_es_save_area *hostsa) { /* * All host state for SEV-ES guests is categorized into three swap types @@ -3140,7 +3153,7 @@ void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa) * the CPU (Type-B). If DebugSwap is disabled/unsupported, the CPU both * saves and loads debug registers (Type-A). */ - if (sev_es_debug_swap_enabled) { + if (sev_vcpu_has_debug_swap(svm)) { hostsa->dr0 = native_get_debugreg(0); hostsa->dr1 = native_get_debugreg(1); hostsa->dr2 = native_get_debugreg(2); diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c index 450535d6757f..c22e87ebf0de 100644 --- a/arch/x86/kvm/svm/svm.c +++ b/arch/x86/kvm/svm/svm.c @@ -1523,7 +1523,7 @@ static void svm_prepare_switch_to_guest(struct kvm_vcpu *vcpu) struct sev_es_save_area *hostsa; hostsa = (struct sev_es_save_area *)(page_address(sd->save_area) + 0x400); - sev_es_prepare_switch_to_guest(hostsa); + sev_es_prepare_switch_to_guest(svm, hostsa); } if (tsc_scaling) diff --git a/arch/x86/kvm/svm/svm.h b/arch/x86/kvm/svm/svm.h index 864fac367424..b7707514d042 100644 --- a/arch/x86/kvm/svm/svm.h +++ b/arch/x86/kvm/svm/svm.h @@ -85,6 +85,7 @@ struct kvm_sev_info { unsigned long pages_locked; /* Number of pages locked */ struct list_head regions_list; /* List of registered regions */ u64 ap_jump_table; /* SEV-ES AP Jump Table address */ + u64 vmsa_features; struct kvm *enc_context_owner; /* Owner of copied encryption context */ struct list_head mirror_vms; /* List of VMs mirroring */ struct list_head mirror_entry; /* Use as a list entry of mirrors */ @@ -684,7 +685,7 @@ int sev_handle_vmgexit(struct kvm_vcpu *vcpu); int sev_es_string_io(struct vcpu_svm *svm, int size, unsigned int port, int in); void sev_es_vcpu_reset(struct vcpu_svm *svm); void sev_vcpu_deliver_sipi_vector(struct kvm_vcpu *vcpu, u8 vector); -void sev_es_prepare_switch_to_guest(struct sev_es_save_area *hostsa); +void sev_es_prepare_switch_to_guest(struct vcpu_svm *svm, struct sev_es_save_area *hostsa); void sev_es_unmap_ghcb(struct vcpu_svm *svm); /* These symbols are used in common code and are stubbed below. */ -- 2.43.0