Re: [for-5.0 2/4] spapr: Don't attempt to clamp RMA to VRMA constraint

From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: David Gibson <david@gibson.dropbear.id.au>, groug@kaod.org, clg@kaod.org
Cc: lvivier@redhat.com, qemu-ppc@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [for-5.0 2/4] spapr: Don't attempt to clamp RMA to VRMA constraint
Date: Tue, 3 Dec 2019 14:33:30 +1100	[thread overview]
Message-ID: <d2784a5f-5410-5eee-14f8-d9157946b8b0@ozlabs.ru> (raw)
In-Reply-To: <20191129013504.145455-3-david@gibson.dropbear.id.au>


On 29/11/2019 12:35, David Gibson wrote:
> The Real Mode Area (RMA) is the part of memory which a guest can access
> when in real (MMU off) mode.  Of course, for a guest under KVM, the MMU
> isn't really turned off, it's just in a special translation mode - Virtual
> Real Mode Area (VRMA) - which looks like real mode in guest mode.
> 
> The mechanics of how this works when in Hashed Page Table (HPT) mode, put
> a constraint on the size of the RMA, which depends on the size of the HPT.
> So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA we
> advertise to the guest based on this VRMA limit.
> 
> There are several things wrong with this:
>  1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum
>     of Node 0 memory size and the VRMA limit.  That will *often* work the
>     same as clamping, but there can be other constraints on RMA size which
>     supersede Node 0 memory size.  We have real bugs caused by this
>     (currently worked around in the guest kernel)
>  2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where
>     we're past the point that we can actually advertise an RMA limit to the
>     guest
>  3) But most fundamentally, the VRMA limit depends on host configuration
>     (page size) which shouldn't be visible to the guest, but this partially
>     exposes it.  This can cause problems with migration in certain edge
>     cases, although we will mostly get away with it.
> 
> In practice, this clamping is almost never applied anyway.  With 64kiB
> pages and the normal rules for sizing of the HPT, the theoretical VRMA
> limit will be 4x(guest memory size) and so never hit.  It will hit with
> 4kiB pages, where it will be (guest memory size)/4.  However all mainstream
> distro kernels for POWER have used a 64kiB page size for at least 10 years.
> 
> So, simply replace this logic with a check that the RMA we've calculated
> based only on guest visible configuration will fit within the host implied
> VRMA limit.  This can break if running HPT guests on a host kernel with
> 4kiB page size.  As noted that's very rare.  There also exist several
> possible workarounds:
>   * Change the host kernel to use 64kiB pages
>   * Use radix MMU (RPT) guests instead of HPT
>   * Use 64kiB hugepages on the host to back guest memory
>   * Increase the guest memory size so that the RMA hits one of the fixed
>     limits before the RMA limit.  This is relatively easy on POWER8 which
>     has a 16GiB limit, harder on POWER9 which has a 1TiB limit.
>   * Decrease guest memory size so that it's below the lower bound on VRMA
>     limit (minimum HPT size is 256kiB, giving a minimum VRAM of 8MiB).
>     Difficult in practice since modern guests tend to want 1-2GiB.
>   * Use a guest NUMA configuration which artificially constrains the RMA
>     within the VRMA limit (the RMA must always fit within Node 0).
> 
> Previously, on KVM, we also temporarily reduced the rma_size to 256M so
> that the we'd load the kernel and initrd safely, regardless of the VRMA
> limit.  This was a) confusing, b) could significantly limit the size of
> images we could load and c) introduced a behavioural difference between
> KVM and TCG.  So we remove that as well.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>


Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>


> ---
>  hw/ppc/spapr.c         | 28 ++++++++++------------------
>  hw/ppc/spapr_hcall.c   |  4 ++--
>  include/hw/ppc/spapr.h |  3 +--
>  3 files changed, 13 insertions(+), 22 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 069bd04a8d..52c39daa99 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1618,7 +1618,7 @@ void spapr_reallocate_hpt(SpaprMachineState *spapr, int shift,
>      spapr_set_all_lpcrs(0, LPCR_HR | LPCR_UPRT);
>  }
>  
> -void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr)
> +void spapr_setup_hpt(SpaprMachineState *spapr)
>  {
>      int hpt_shift;
>  
> @@ -1634,10 +1634,16 @@ void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr)
>      }
>      spapr_reallocate_hpt(spapr, hpt_shift, &error_fatal);
>  
> -    if (spapr->vrma_adjust) {
> +    if (kvm_enabled()) {
>          hwaddr vrma_limit = kvmppc_vrma_limit(spapr->htab_shift);
>  
> -        spapr->rma_size = MIN(spapr_node0_size(MACHINE(spapr)), vrma_limit);
> +        /* Check our RMA fits in the possible VRMA */
> +        if (vrma_limit < spapr->rma_size) {
> +            error_report("Unable to create %" HWADDR_PRIu
> +                         "MiB RMA (VRMA only allows %" HWADDR_PRIu "MiB",
> +                         spapr->rma_size / MiB, vrma_limit / MiB);
> +            exit(EXIT_FAILURE);
> +        }
>      }
>  }
>  
> @@ -1676,7 +1682,7 @@ static void spapr_machine_reset(MachineState *machine)
>          spapr->patb_entry = PATE1_GR;
>          spapr_set_all_lpcrs(LPCR_HR | LPCR_UPRT, LPCR_HR | LPCR_UPRT);
>      } else {
> -        spapr_setup_hpt_and_vrma(spapr);
> +        spapr_setup_hpt(spapr);
>      }
>  
>      qemu_devices_reset();
> @@ -2711,20 +2717,6 @@ static void spapr_machine_init(MachineState *machine)
>  
>      spapr->rma_size = node0_size;
>  
> -    /* With KVM, we don't actually know whether KVM supports an
> -     * unbounded RMA (PR KVM) or is limited by the hash table size
> -     * (HV KVM using VRMA), so we always assume the latter
> -     *
> -     * In that case, we also limit the initial allocations for RTAS
> -     * etc... to 256M since we have no way to know what the VRMA size
> -     * is going to be as it depends on the size of the hash table
> -     * which isn't determined yet.
> -     */
> -    if (kvm_enabled()) {
> -        spapr->vrma_adjust = 1;
> -        spapr->rma_size = MIN(spapr->rma_size, 0x10000000);
> -    }
> -
>      /* Actually we don't support unbounded RMA anymore since we added
>       * proper emulation of HV mode. The max we can get is 16G which
>       * also happens to be what we configure for PAPR mode so make sure
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 140f05c1c6..372ba8bd1c 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1456,7 +1456,7 @@ static void spapr_check_setup_free_hpt(SpaprMachineState *spapr,
>          spapr_free_hpt(spapr);
>      } else if (!(patbe_new & PATE1_GR)) {
>          /* RADIX->HASH || NOTHING->HASH : Allocate HPT */
> -        spapr_setup_hpt_and_vrma(spapr);
> +        spapr_setup_hpt(spapr);
>      }
>      return;
>  }
> @@ -1772,7 +1772,7 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
>           * (because the guest isn't going to use radix) then set it up here. */
>          if ((spapr->patb_entry & PATE1_GR) && !guest_radix) {
>              /* legacy hash or new hash: */
> -            spapr_setup_hpt_and_vrma(spapr);
> +            spapr_setup_hpt(spapr);
>          }
>          spapr->cas_reboot =
>              (spapr_h_cas_compose_response(spapr, args[1], args[2],
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index d5ab5ea7b2..85c33560c3 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -154,7 +154,6 @@ struct SpaprMachineState {
>      SpaprPendingHpt *pending_hpt; /* in-progress resize */
>  
>      hwaddr rma_size;
> -    int vrma_adjust;
>      uint32_t fdt_size;
>      uint32_t fdt_initial_size;
>      void *fdt_blob;
> @@ -772,7 +771,7 @@ int spapr_h_cas_compose_response(SpaprMachineState *sm,
>                                   target_ulong addr, target_ulong size,
>                                   SpaprOptionVector *ov5_updates);
>  void close_htab_fd(SpaprMachineState *spapr);
> -void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
> +void spapr_setup_hpt(SpaprMachineState *spapr);
>  void spapr_free_hpt(SpaprMachineState *spapr);
>  SpaprTceTable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
>  void spapr_tce_table_enable(SpaprTceTable *tcet,
> 

-- 
Alexey