Re: [RFC PATCH 1/2] spapr: Report correct GTSE support via ov5

From: Daniel Henrique Barboza <danielhb413@gmail.com>
To: Fabiano Rosas <farosas@linux.ibm.com>, qemu-devel@nongnu.org
Cc: aneesh.kumar@linux.ibm.com, qemu-ppc@nongnu.org, clg@kaod.org,
	npiggin@gmail.com, david@gibson.dropbear.id.au
Subject: Re: [RFC PATCH 1/2] spapr: Report correct GTSE support via ov5
Date: Wed, 9 Mar 2022 16:09:04 -0300	[thread overview]
Message-ID: <d9cb6b05-9c06-8081-ee7f-a44f3a85848b@gmail.com> (raw)
In-Reply-To: <20220309012400.2527157-1-farosas@linux.ibm.com>

On 3/8/22 22:23, Fabiano Rosas wrote:
> QEMU reports MMU support to the guest via the ibm,architecture-vec-5
> property of the /chosen node. Byte number 26 specifies Radix Table
> Expansions, currently only GTSE (Guest Translation Shootdown
> Enable). This feature determines whether the tlbie instruction (and
> others) are HV privileged.
> 
> Up until now, we always reported GTSE=1 to guests. Even after the
> support for GTSE=0 was added. As part of that support, a kernel
> command line radix_hcall_invalidate=on was introduced that overrides
> the GTSE value received via CAS. So a guest can run with GTSE=0 and
> use the H_RPT_INVALIDATE hcall instead of tlbie.
> 
> In this scenario, having GTSE always set to 1 by QEMU leads to a crash
> when running nested KVM guests because KVM does not allow a nested
> hypervisor to set GTSE support for its nested guests. So a nested
> guest always uses the same value for LPCR_GTSE as its HV. Since the
> nested HV disabled GTSE, but the L2 QEMU always reports GTSE=1, we run
> into a crash when:
> 
> L1 LPCR_GTSE=0
> L2 LPCR_GTSE=0
> L2 CAS GTSE=1
> 
> The nested guest will run 'tlbie' and crash because the HW looks at
> LPCR_GTSE, which is clear.
> 
> Having GTSE disabled in the L1 and enabled in the L2 is not an option
> because the whole purpose of GTSE is to disallow access to tlbie and
> we cannot allow L1 to spawn L2s that can access features that L1
> itself cannot.
> 
> We also cannot have the guest check the LPCR bit, because LPCR is
> HV-privileged.
> 
> So this patch goes through the most intuitive route which is to have
> QEMU ask KVM about GTSE support and advertise the correct value to the
> guest. A new KVM_CAP_PPC_GTSE capability is being added.
> 
> TCG continues to always enable GTSE.
> 
> Signed-off-by: Fabiano Rosas <farosas@linux.ibm.com>
> ---

I'm not sure if I fully understood the situation, so let me recap. Once upon a time,
QEMU advertised GTSE=1 and the host would never advertise other value, and everyone
was happy.

The host started to support GTSE=0, but QEMU kept advertising GTSE=1 regardless, and no
KVM GTSE cap was added to reflect the host support. I'll then assume that:

- all guests would break if running in a GTSE=0 host prior to rpt_invalidate support (which
is necessary to allow the guest to run in GTSE=0)

- apparently no one ever tried to run a KVM guest in a GTSE=0 host, so no bugs were opened

After commit 82123b756a1a2f1 (target/ppc: Support for H_RPT_INVALIDATE hcall) we added
cap-rpt-invalidate. I didn't follow the discussions of this cap but it seems like, as with
almost every other cap we have, there would be a migration problem for a guest that was in
a rpt_invalidate aware host to migrate to another where this wouldn't be true, and the cap
solves that.

What I'm not following is why, even after having cap-rpt-invalidate, we are still "lying"
about the GTSE=1 regardless of what the host supports. We could've added the GTSE KVM cap
at the same time rpt_invalidate was introduced, and guests that want to ignore this setting
can use the cap to bypass it.

In the end this patch is a needed fix IMHO. My confusion is why we're doing this just now.

The patch itself LGTM.

Thanks,

Daniel

>   hw/ppc/spapr.c       | 38 +++++++++++++++++++++++++++++++-------
>   target/ppc/kvm.c     |  8 ++++++++
>   target/ppc/kvm_ppc.h |  6 ++++++
>   3 files changed, 45 insertions(+), 7 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 4cc204f90d..3e95a1831f 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -971,7 +971,7 @@ static void spapr_dt_ov5_platform_support(SpaprMachineState *spapr, void *fdt,
>           23, 0x00, /* XICS / XIVE mode */
>           24, 0x00, /* Hash/Radix, filled in below. */
>           25, 0x00, /* Hash options: Segment Tables == no, GTSE == no. */
> -        26, 0x40, /* Radix options: GTSE == yes. */
> +        26, 0x00, /* Radix options, filled in below. */
>       };
>   
>       if (spapr->irq->xics && spapr->irq->xive) {
> @@ -1000,10 +1000,16 @@ static void spapr_dt_ov5_platform_support(SpaprMachineState *spapr, void *fdt,
>           } else {
>               val[3] = 0x00; /* Hash */
>           }
> +
> +        if (kvmppc_has_cap_gtse()) {
> +            val[7] = 0x40 /* OV5_MMU_RADIX_GTSE */;
> +        }
>       } else {
>           /* V3 MMU supports both hash and radix in tcg (with dynamic switching) */
>           val[3] = 0xC0;
> +        val[7] = 0x40 /* OV5_MMU_RADIX_GTSE */;
>       }
> +
>       _FDT(fdt_setprop(fdt, chosen, "ibm,arch-vec-5-platform-support",
>                        val, sizeof(val)));
>   }
> @@ -2824,14 +2830,32 @@ static void spapr_machine_init(MachineState *machine)
>       /* Init numa_assoc_array */
>       spapr_numa_associativity_init(spapr, machine);
>   
> -    if ((!kvm_enabled() || kvmppc_has_cap_mmu_radix()) &&
> -        ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
> +    if (ppc_type_check_compat(machine->cpu_type, CPU_POWERPC_LOGICAL_3_00, 0,
>                                 spapr->max_compat_pvr)) {
> -        spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_300);
> -        /* KVM and TCG always allow GTSE with radix... */
> -        spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_GTSE);
> +
> +        /* TCG always supports Radix w/ GTSE */
> +        if (!kvm_enabled()) {
> +            spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_300);
> +            spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_GTSE);
> +        } else {
> +            if (kvmppc_has_cap_mmu_radix()) {
> +                spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_300);
> +            }
> +
> +            /*
> +             * Only disable Guest Translation Shootdown if KVM
> +             * supports the H_RPT_INVALIDATE hypercall, otherwise we'd
> +             * leave the guest with no way to make TLB invalidations.
> +             */
> +            if (kvmppc_has_cap_rpt_invalidate()) {
> +                if (kvmppc_has_cap_gtse()) {
> +                    spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_GTSE);
> +                }
> +            } else {
> +                spapr_ovec_set(spapr->ov5, OV5_MMU_RADIX_GTSE);
> +            }
> +        }
>       }
> -    /* ... but not with hash (currently). */
>   
>       if (kvm_enabled()) {
>           /* Enable H_LOGICAL_CI_* so SLOF can talk to in-kernel devices */
> diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
> index dc93b99189..91582c4b15 100644
> --- a/target/ppc/kvm.c
> +++ b/target/ppc/kvm.c
> @@ -90,6 +90,7 @@ static int cap_ppc_nested_kvm_hv;
>   static int cap_large_decr;
>   static int cap_fwnmi;
>   static int cap_rpt_invalidate;
> +static int cap_gtse;
>   
>   static uint32_t debug_inst_opcode;
>   
> @@ -154,6 +155,8 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>       }
>   
>       cap_rpt_invalidate = kvm_vm_check_extension(s, KVM_CAP_PPC_RPT_INVALIDATE);
> +    cap_gtse = kvm_vm_check_extension(s, KVM_CAP_PPC_GTSE);
> +
>       kvm_ppc_register_host_cpu_type();
>   
>       return 0;
> @@ -2397,6 +2400,11 @@ bool kvmppc_has_cap_mmu_hash_v3(void)
>       return cap_mmu_hash_v3;
>   }
>   
> +bool kvmppc_has_cap_gtse(void)
> +{
> +    return cap_gtse;
> +}
> +
>   static bool kvmppc_power8_host(void)
>   {
>       bool ret = false;
> diff --git a/target/ppc/kvm_ppc.h b/target/ppc/kvm_ppc.h
> index ee9325bf9a..7d6980edb7 100644
> --- a/target/ppc/kvm_ppc.h
> +++ b/target/ppc/kvm_ppc.h
> @@ -63,6 +63,7 @@ bool kvmppc_has_cap_fixup_hcalls(void);
>   bool kvmppc_has_cap_htm(void);
>   bool kvmppc_has_cap_mmu_radix(void);
>   bool kvmppc_has_cap_mmu_hash_v3(void);
> +bool kvmppc_has_cap_gtse(void);
>   bool kvmppc_has_cap_xive(void);
>   int kvmppc_get_cap_safe_cache(void);
>   int kvmppc_get_cap_safe_bounds_check(void);
> @@ -343,6 +344,11 @@ static inline bool kvmppc_has_cap_mmu_hash_v3(void)
>       return false;
>   }
>   
> +static inline bool kvmppc_has_cap_gtse(void)
> +{
> +    return false;
> +}
> +
>   static inline bool kvmppc_has_cap_xive(void)
>   {
>       return false;