From: Alexey Kardashevskiy <aik@ozlabs.ru>
To: David Gibson <david@gibson.dropbear.id.au>, groug@kaod.org, clg@kaod.org
Cc: lvivier@redhat.com, qemu-ppc@nongnu.org, qemu-devel@nongnu.org
Subject: Re: [for-5.0 2/4] spapr: Don't attempt to clamp RMA to VRMA constraint
Date: Tue, 3 Dec 2019 14:33:30 +1100 [thread overview]
Message-ID: <d2784a5f-5410-5eee-14f8-d9157946b8b0@ozlabs.ru> (raw)
In-Reply-To: <20191129013504.145455-3-david@gibson.dropbear.id.au>
On 29/11/2019 12:35, David Gibson wrote:
> The Real Mode Area (RMA) is the part of memory which a guest can access
> when in real (MMU off) mode. Of course, for a guest under KVM, the MMU
> isn't really turned off, it's just in a special translation mode - Virtual
> Real Mode Area (VRMA) - which looks like real mode in guest mode.
>
> The mechanics of how this works when in Hashed Page Table (HPT) mode, put
> a constraint on the size of the RMA, which depends on the size of the HPT.
> So, the latter part of spapr_setup_hpt_and_vrma() clamps the RMA we
> advertise to the guest based on this VRMA limit.
>
> There are several things wrong with this:
> 1) spapr_setup_hpt_and_vrma() doesn't actually clamp, it takes the minimum
> of Node 0 memory size and the VRMA limit. That will *often* work the
> same as clamping, but there can be other constraints on RMA size which
> supersede Node 0 memory size. We have real bugs caused by this
> (currently worked around in the guest kernel)
> 2) Some callers of spapr_setup_hpt_and_vrma() are in a situation where
> we're past the point that we can actually advertise an RMA limit to the
> guest
> 3) But most fundamentally, the VRMA limit depends on host configuration
> (page size) which shouldn't be visible to the guest, but this partially
> exposes it. This can cause problems with migration in certain edge
> cases, although we will mostly get away with it.
>
> In practice, this clamping is almost never applied anyway. With 64kiB
> pages and the normal rules for sizing of the HPT, the theoretical VRMA
> limit will be 4x(guest memory size) and so never hit. It will hit with
> 4kiB pages, where it will be (guest memory size)/4. However all mainstream
> distro kernels for POWER have used a 64kiB page size for at least 10 years.
>
> So, simply replace this logic with a check that the RMA we've calculated
> based only on guest visible configuration will fit within the host implied
> VRMA limit. This can break if running HPT guests on a host kernel with
> 4kiB page size. As noted that's very rare. There also exist several
> possible workarounds:
> * Change the host kernel to use 64kiB pages
> * Use radix MMU (RPT) guests instead of HPT
> * Use 64kiB hugepages on the host to back guest memory
> * Increase the guest memory size so that the RMA hits one of the fixed
> limits before the RMA limit. This is relatively easy on POWER8 which
> has a 16GiB limit, harder on POWER9 which has a 1TiB limit.
> * Decrease guest memory size so that it's below the lower bound on VRMA
> limit (minimum HPT size is 256kiB, giving a minimum VRAM of 8MiB).
> Difficult in practice since modern guests tend to want 1-2GiB.
> * Use a guest NUMA configuration which artificially constrains the RMA
> within the VRMA limit (the RMA must always fit within Node 0).
>
> Previously, on KVM, we also temporarily reduced the rma_size to 256M so
> that the we'd load the kernel and initrd safely, regardless of the VRMA
> limit. This was a) confusing, b) could significantly limit the size of
> images we could load and c) introduced a behavioural difference between
> KVM and TCG. So we remove that as well.
>
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---
> hw/ppc/spapr.c | 28 ++++++++++------------------
> hw/ppc/spapr_hcall.c | 4 ++--
> include/hw/ppc/spapr.h | 3 +--
> 3 files changed, 13 insertions(+), 22 deletions(-)
>
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 069bd04a8d..52c39daa99 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1618,7 +1618,7 @@ void spapr_reallocate_hpt(SpaprMachineState *spapr, int shift,
> spapr_set_all_lpcrs(0, LPCR_HR | LPCR_UPRT);
> }
>
> -void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr)
> +void spapr_setup_hpt(SpaprMachineState *spapr)
> {
> int hpt_shift;
>
> @@ -1634,10 +1634,16 @@ void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr)
> }
> spapr_reallocate_hpt(spapr, hpt_shift, &error_fatal);
>
> - if (spapr->vrma_adjust) {
> + if (kvm_enabled()) {
> hwaddr vrma_limit = kvmppc_vrma_limit(spapr->htab_shift);
>
> - spapr->rma_size = MIN(spapr_node0_size(MACHINE(spapr)), vrma_limit);
> + /* Check our RMA fits in the possible VRMA */
> + if (vrma_limit < spapr->rma_size) {
> + error_report("Unable to create %" HWADDR_PRIu
> + "MiB RMA (VRMA only allows %" HWADDR_PRIu "MiB",
> + spapr->rma_size / MiB, vrma_limit / MiB);
> + exit(EXIT_FAILURE);
> + }
> }
> }
>
> @@ -1676,7 +1682,7 @@ static void spapr_machine_reset(MachineState *machine)
> spapr->patb_entry = PATE1_GR;
> spapr_set_all_lpcrs(LPCR_HR | LPCR_UPRT, LPCR_HR | LPCR_UPRT);
> } else {
> - spapr_setup_hpt_and_vrma(spapr);
> + spapr_setup_hpt(spapr);
> }
>
> qemu_devices_reset();
> @@ -2711,20 +2717,6 @@ static void spapr_machine_init(MachineState *machine)
>
> spapr->rma_size = node0_size;
>
> - /* With KVM, we don't actually know whether KVM supports an
> - * unbounded RMA (PR KVM) or is limited by the hash table size
> - * (HV KVM using VRMA), so we always assume the latter
> - *
> - * In that case, we also limit the initial allocations for RTAS
> - * etc... to 256M since we have no way to know what the VRMA size
> - * is going to be as it depends on the size of the hash table
> - * which isn't determined yet.
> - */
> - if (kvm_enabled()) {
> - spapr->vrma_adjust = 1;
> - spapr->rma_size = MIN(spapr->rma_size, 0x10000000);
> - }
> -
> /* Actually we don't support unbounded RMA anymore since we added
> * proper emulation of HV mode. The max we can get is 16G which
> * also happens to be what we configure for PAPR mode so make sure
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 140f05c1c6..372ba8bd1c 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1456,7 +1456,7 @@ static void spapr_check_setup_free_hpt(SpaprMachineState *spapr,
> spapr_free_hpt(spapr);
> } else if (!(patbe_new & PATE1_GR)) {
> /* RADIX->HASH || NOTHING->HASH : Allocate HPT */
> - spapr_setup_hpt_and_vrma(spapr);
> + spapr_setup_hpt(spapr);
> }
> return;
> }
> @@ -1772,7 +1772,7 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
> * (because the guest isn't going to use radix) then set it up here. */
> if ((spapr->patb_entry & PATE1_GR) && !guest_radix) {
> /* legacy hash or new hash: */
> - spapr_setup_hpt_and_vrma(spapr);
> + spapr_setup_hpt(spapr);
> }
> spapr->cas_reboot =
> (spapr_h_cas_compose_response(spapr, args[1], args[2],
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index d5ab5ea7b2..85c33560c3 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -154,7 +154,6 @@ struct SpaprMachineState {
> SpaprPendingHpt *pending_hpt; /* in-progress resize */
>
> hwaddr rma_size;
> - int vrma_adjust;
> uint32_t fdt_size;
> uint32_t fdt_initial_size;
> void *fdt_blob;
> @@ -772,7 +771,7 @@ int spapr_h_cas_compose_response(SpaprMachineState *sm,
> target_ulong addr, target_ulong size,
> SpaprOptionVector *ov5_updates);
> void close_htab_fd(SpaprMachineState *spapr);
> -void spapr_setup_hpt_and_vrma(SpaprMachineState *spapr);
> +void spapr_setup_hpt(SpaprMachineState *spapr);
> void spapr_free_hpt(SpaprMachineState *spapr);
> SpaprTceTable *spapr_tce_new_table(DeviceState *owner, uint32_t liobn);
> void spapr_tce_table_enable(SpaprTceTable *tcet,
>
--
Alexey
next prev parent reply other threads:[~2019-12-03 3:34 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-11-29 1:35 [for-5.0 0/4] Fixes for RMA size calculation David Gibson
2019-11-29 1:35 ` [for-5.0 1/4] spapr,ppc: Simplify signature of kvmppc_rma_size() David Gibson
2019-12-02 7:45 ` Cédric Le Goater
2019-12-02 8:55 ` Greg Kurz
2019-12-02 18:18 ` Cédric Le Goater
2019-12-10 3:58 ` David Gibson
2019-12-03 3:32 ` Alexey Kardashevskiy
2019-11-29 1:35 ` [for-5.0 2/4] spapr: Don't attempt to clamp RMA to VRMA constraint David Gibson
2019-12-03 3:33 ` Alexey Kardashevskiy [this message]
2019-11-29 1:35 ` [for-5.0 3/4] spapr: Clean up RMA size calculation David Gibson
2019-12-03 3:44 ` Alexey Kardashevskiy
2019-12-03 5:06 ` Alexey Kardashevskiy
2019-12-10 5:44 ` David Gibson
2019-12-10 5:29 ` David Gibson
2019-11-29 1:35 ` [for-5.0 4/4] spapr: Correct clamping of RMA to Node 0 size David Gibson
2019-12-03 4:18 ` Alexey Kardashevskiy
2019-12-10 5:21 ` David Gibson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=d2784a5f-5410-5eee-14f8-d9157946b8b0@ozlabs.ru \
--to=aik@ozlabs.ru \
--cc=clg@kaod.org \
--cc=david@gibson.dropbear.id.au \
--cc=groug@kaod.org \
--cc=lvivier@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=qemu-ppc@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).