All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
@ 2018-04-19  6:29 David Gibson
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 1/7] spapr: Maximum (HPT) pagesize property David Gibson
                   ` (7 more replies)
  0 siblings, 8 replies; 28+ messages in thread
From: David Gibson @ 2018-04-19  6:29 UTC (permalink / raw)
  To: groug, abologna; +Cc: aik, qemu-ppc, qemu-devel, clg, David Gibson

Currently the "pseries" machine type will (usually) advertise
different pagesizes to the guest when running under KVM and TCG, which
is not how things are supposed to work.

This comes from poor handling of hardware limitations which mean that
under KVM HV the guest is unable to use pagesizes larger than those
backing the guest's RAM on the host side.

The new scheme turns things around by having an explicit machine
parameter controlling the largest page size that the guest is allowed
to use.  This limitation applies regardless of accelerator.  When
we're running on KVM HV we ensure that our backing pages are adequate
to supply the requested guest page sizes, rather than adjusting the
guest page sizes based on what KVM can supply.

This means that in order to use hugepages in a PAPR guest it's
necessary to add a "cap-hpt-mps=24" machine parameter as well as
setting the mem-path correctly.  This is a bit more work on the user
and/or management side, but results in consistent behaviour so I think
it's worth it.

Longer term, we can also use this parameter to control IOMMU page
sizes.  However the restrictions here are even more complicated based
on an intersection of guest, host kernel and hardware capabilities.

This applies on top of my recent series cleaning up the PAPR mode
initialization, which in turn applies on top of my ppc-for-2.13 tree.

David Gibson (7):
  spapr: Maximum (HPT) pagesize property
  spapr: Use maximum page size capability to simplify memory backend
    checking
  target/ppc: Add ppc_hash64_filter_pagesizes()
  spapr: Add cpu_apply hook to capabilities
  spapr: Limit available pagesizes to provide a consistent guest
    environment
  spapr: Don't rewrite mmu capabilities in KVM mode
  spapr_pci: Remove unhelpful pagesize warning

 hw/ppc/spapr.c          |  18 +++---
 hw/ppc/spapr_caps.c     | 125 ++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_cpu_core.c |   4 ++
 hw/ppc/spapr_pci.c      |   7 ---
 include/hw/ppc/spapr.h  |   8 ++-
 target/ppc/kvm.c        | 149 ++++++++++++++++++++++++------------------------
 target/ppc/kvm_ppc.h    |  11 +++-
 target/ppc/mmu-hash64.c |  59 +++++++++++++++++++
 target/ppc/mmu-hash64.h |   3 +
 9 files changed, 287 insertions(+), 97 deletions(-)

-- 
2.14.3

^ permalink raw reply	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC for-2.13 1/7] spapr: Maximum (HPT) pagesize property
  2018-04-19  6:29 [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling David Gibson
@ 2018-04-19  6:29 ` David Gibson
  2018-05-02 21:06   ` Murilo Opsfelder Araujo
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 2/7] spapr: Use maximum page size capability to simplify memory backend checking David Gibson
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2018-04-19  6:29 UTC (permalink / raw)
  To: groug, abologna; +Cc: aik, qemu-ppc, qemu-devel, clg, David Gibson

The way the POWER Hash Page Table (HPT) MMU is virtualized by KVM HV means
that every page that the guest puts in the pagetables must be truly
physically contiguous, not just GPA-contiguous.  In effect this means that
an HPT guest can't use any pagesizes greater than the host page size used
to back its memory.

At present we handle this by changing what we advertise to the guest based
on the backing pagesizes.  This is pretty bad, because it means the guest
sees a different environment depending on what should be host configuration
details.

As a start on fixing this, we add a new capability parameter to the pseries
machine type which gives the maximum allowed pagesizes for an HPT guest (as
a shift).  For now we just create and validate the parameter without making
it do anything.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c         |  1 +
 hw/ppc/spapr_caps.c    | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h |  4 +++-
 3 files changed, 57 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index bdf72e1e89..36e41aff71 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3876,6 +3876,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
     smc->default_caps.caps[SPAPR_CAP_CFPC] = SPAPR_CAP_BROKEN;
     smc->default_caps.caps[SPAPR_CAP_SBBC] = SPAPR_CAP_BROKEN;
     smc->default_caps.caps[SPAPR_CAP_IBS] = SPAPR_CAP_BROKEN;
+    smc->default_caps.caps[SPAPR_CAP_HPT_MPS] = 16; /* Allow 64kiB pages */
     spapr_caps_add_properties(smc, &error_abort);
 }
 
diff --git a/hw/ppc/spapr_caps.c b/hw/ppc/spapr_caps.c
index 531e145114..cbc41f5b20 100644
--- a/hw/ppc/spapr_caps.c
+++ b/hw/ppc/spapr_caps.c
@@ -27,6 +27,7 @@
 #include "qapi/visitor.h"
 #include "sysemu/hw_accel.h"
 #include "target/ppc/cpu.h"
+#include "target/ppc/mmu-hash64.h"
 #include "cpu-models.h"
 #include "kvm_ppc.h"
 
@@ -142,6 +143,39 @@ out:
     g_free(val);
 }
 
+static void spapr_cap_get_int(Object *obj, Visitor *v, const char *name,
+                               void *opaque, Error **errp)
+{
+    sPAPRCapabilityInfo *cap = opaque;
+    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
+    int64_t value = spapr_get_cap(spapr, cap->index);
+
+    visit_type_int(v, name, &value, errp);
+}
+
+static void spapr_cap_set_int(Object *obj, Visitor *v, const char *name,
+                               void *opaque, Error **errp)
+{
+    sPAPRCapabilityInfo *cap = opaque;
+    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
+    int64_t value;
+    Error *local_err = NULL;
+
+    visit_type_int(v, name, &value, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    if ((value < 0) || (value > 255)) {
+        error_setg(errp, "Value for %s out of range (0..255)", name);
+        return;
+    }
+
+    spapr->cmd_line_caps[cap->index] = true;
+    spapr->eff.caps[cap->index] = value;
+}
+
 static void cap_htm_apply(sPAPRMachineState *spapr, uint8_t val, Error **errp)
 {
     if (!val) {
@@ -265,6 +299,16 @@ static void cap_safe_indirect_branch_apply(sPAPRMachineState *spapr,
 
 #define VALUE_DESC_TRISTATE     " (broken, workaround, fixed)"
 
+static void cap_hpt_mps_apply(sPAPRMachineState *spapr,
+                              uint8_t val, Error **errp)
+{
+    if (val < 12) {
+        error_setg(errp, "Require at least 4kiB pages (cap-hpt-mps >= 12)");
+    } else if (val < 16) {
+        warn_report("Many PAPR guests require 64kiB pages (cap-hpt-mps >= 16)");
+    }
+}
+
 sPAPRCapabilityInfo capability_table[SPAPR_CAP_NUM] = {
     [SPAPR_CAP_HTM] = {
         .name = "htm",
@@ -324,6 +368,15 @@ sPAPRCapabilityInfo capability_table[SPAPR_CAP_NUM] = {
         .possible = &cap_ibs_possible,
         .apply = cap_safe_indirect_branch_apply,
     },
+    [SPAPR_CAP_HPT_MPS] = {
+        .name = "hpt-mps",
+        .description = "Maximum page shift for Hash Page Table guests (12, 16, 24, 34)",
+        .index = SPAPR_CAP_HPT_MPS,
+        .get = spapr_cap_get_int,
+        .set = spapr_cap_set_int,
+        .type = "int",
+        .apply = cap_hpt_mps_apply,
+    },
 };
 
 static sPAPRCapabilities default_caps_with_cpu(sPAPRMachineState *spapr,
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index d60b7c6d7a..60ed3a5657 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -66,8 +66,10 @@ typedef enum {
 #define SPAPR_CAP_SBBC                  0x04
 /* Indirect Branch Serialisation */
 #define SPAPR_CAP_IBS                   0x05
+/* HPT Maximum Page Shift */
+#define SPAPR_CAP_HPT_MPS               0x06
 /* Num Caps */
-#define SPAPR_CAP_NUM                   (SPAPR_CAP_IBS + 1)
+#define SPAPR_CAP_NUM                   (SPAPR_CAP_HPT_MPS + 1)
 
 /*
  * Capability Values
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC for-2.13 2/7] spapr: Use maximum page size capability to simplify memory backend checking
  2018-04-19  6:29 [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling David Gibson
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 1/7] spapr: Maximum (HPT) pagesize property David Gibson
@ 2018-04-19  6:29 ` David Gibson
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 3/7] target/ppc: Add ppc_hash64_filter_pagesizes() David Gibson
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2018-04-19  6:29 UTC (permalink / raw)
  To: groug, abologna; +Cc: aik, qemu-ppc, qemu-devel, clg, David Gibson

The way we used to handle KVM allowable guest pagesizes for PAPR guests
required some convoluted checking of memory attached to the guest.

The allowable pagesizes advertised to the guest cpus depended on the memory
which was attached at boot, but then we needed to ensure that any memory
later hotplugged didn't change which pagesizes were allowed.

Now that we have an explicit machine option to control the allowable
maximum pagesize we can simplify this.  We just check all memory backends
against that declared pagesize.  We check base and cold-plugged memory at
reset time, and hotplugged memory at pre_plug() time.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c         | 17 +++++++----------
 hw/ppc/spapr_caps.c    | 20 ++++++++++++++++++++
 include/hw/ppc/spapr.h |  3 +++
 target/ppc/kvm.c       | 19 ++++++++++---------
 target/ppc/kvm_ppc.h   |  6 +++---
 5 files changed, 43 insertions(+), 22 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 36e41aff71..0a0fec4140 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3035,11 +3035,13 @@ out:
 static void spapr_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
                                   Error **errp)
 {
+    sPAPRMachineState *spapr = SPAPR_MACHINE(hotplug_dev);
     PCDIMMDevice *dimm = PC_DIMM(dev);
     PCDIMMDeviceClass *ddc = PC_DIMM_GET_CLASS(dimm);
     MemoryRegion *mr;
     uint64_t size;
-    char *mem_dev;
+    Object *memdev;
+    hwaddr pagesize;
 
     mr = ddc->get_memory_region(dimm, errp);
     if (!mr) {
@@ -3053,15 +3055,10 @@ static void spapr_memory_pre_plug(HotplugHandler *hotplug_dev, DeviceState *dev,
         return;
     }
 
-    mem_dev = object_property_get_str(OBJECT(dimm), PC_DIMM_MEMDEV_PROP, NULL);
-    if (mem_dev && !kvmppc_is_mem_backend_page_size_ok(mem_dev)) {
-        error_setg(errp, "Memory backend has bad page size. "
-                   "Use 'memory-backend-file' with correct mem-path.");
-        goto out;
-    }
-
-out:
-    g_free(mem_dev);
+    memdev = object_property_get_link(OBJECT(dimm), PC_DIMM_MEMDEV_PROP,
+                                      &error_abort);
+    pagesize = host_memory_backend_pagesize(MEMORY_BACKEND(memdev));
+    spapr_check_pagesize(spapr, pagesize, errp);
 }
 
 struct sPAPRDIMMState {
diff --git a/hw/ppc/spapr_caps.c b/hw/ppc/spapr_caps.c
index cbc41f5b20..5762b88689 100644
--- a/hw/ppc/spapr_caps.c
+++ b/hw/ppc/spapr_caps.c
@@ -26,6 +26,7 @@
 #include "qapi/error.h"
 #include "qapi/visitor.h"
 #include "sysemu/hw_accel.h"
+#include "exec/ram_addr.h"
 #include "target/ppc/cpu.h"
 #include "target/ppc/mmu-hash64.h"
 #include "cpu-models.h"
@@ -299,6 +300,23 @@ static void cap_safe_indirect_branch_apply(sPAPRMachineState *spapr,
 
 #define VALUE_DESC_TRISTATE     " (broken, workaround, fixed)"
 
+void spapr_check_pagesize(sPAPRMachineState *spapr, hwaddr pagesize,
+                          Error **errp)
+{
+    hwaddr maxpagesize = (1ULL << spapr->eff.caps[SPAPR_CAP_HPT_MPS]);
+
+    if (!kvmppc_hpt_needs_host_contiguous_pages()) {
+        return;
+    }
+
+    if (maxpagesize > pagesize) {
+        error_setg(errp,
+                   "Can't support %"HWADDR_PRIu" kiB guest pages with %"
+                   HWADDR_PRIu" kiB host pages with this KVM implementation",
+                   maxpagesize >> 10, pagesize >> 10);
+    }
+}
+
 static void cap_hpt_mps_apply(sPAPRMachineState *spapr,
                               uint8_t val, Error **errp)
 {
@@ -307,6 +325,8 @@ static void cap_hpt_mps_apply(sPAPRMachineState *spapr,
     } else if (val < 16) {
         warn_report("Many PAPR guests require 64kiB pages (cap-hpt-mps >= 16)");
     }
+
+    spapr_check_pagesize(spapr, qemu_getrampagesize(), errp);
 }
 
 sPAPRCapabilityInfo capability_table[SPAPR_CAP_NUM] = {
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 60ed3a5657..a3d616c612 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -805,4 +805,7 @@ void spapr_caps_reset(sPAPRMachineState *spapr);
 void spapr_caps_add_properties(sPAPRMachineClass *smc, Error **errp);
 int spapr_caps_post_migration(sPAPRMachineState *spapr);
 
+void spapr_check_pagesize(sPAPRMachineState *spapr, hwaddr pagesize,
+                          Error **errp);
+
 #endif /* HW_SPAPR_H */
diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
index e4341d6ff7..defbb08931 100644
--- a/target/ppc/kvm.c
+++ b/target/ppc/kvm.c
@@ -490,12 +490,18 @@ static void kvm_fixup_page_sizes(PowerPCCPU *cpu)
     }
 }
 
-bool kvmppc_is_mem_backend_page_size_ok(const char *obj_path)
+bool kvmppc_hpt_needs_host_contiguous_pages(void)
 {
-    Object *mem_obj = object_resolve_path(obj_path, NULL);
-    long pagesize = host_memory_backend_pagesize(MEMORY_BACKEND(mem_obj));
+    PowerPCCPU *cpu = POWERPC_CPU(first_cpu);
+    static struct kvm_ppc_smmu_info smmu_info;
+
+    if (!kvm_enabled()) {
+        return false;
+    }
 
-    return pagesize >= max_cpu_page_size;
+    kvm_get_smmu_info(cpu, &smmu_info);
+
+    return !!(smmu_info.flags & KVM_PPC_PAGE_SIZES_REAL);
 }
 
 #else /* defined (TARGET_PPC64) */
@@ -504,11 +510,6 @@ static inline void kvm_fixup_page_sizes(PowerPCCPU *cpu)
 {
 }
 
-bool kvmppc_is_mem_backend_page_size_ok(const char *obj_path)
-{
-    return true;
-}
-
 #endif /* !defined (TARGET_PPC64) */
 
 unsigned long kvm_arch_vcpu_id(CPUState *cpu)
diff --git a/target/ppc/kvm_ppc.h b/target/ppc/kvm_ppc.h
index e2840e1d33..443fca0a4e 100644
--- a/target/ppc/kvm_ppc.h
+++ b/target/ppc/kvm_ppc.h
@@ -70,7 +70,7 @@ int kvmppc_resize_hpt_prepare(PowerPCCPU *cpu, target_ulong flags, int shift);
 int kvmppc_resize_hpt_commit(PowerPCCPU *cpu, target_ulong flags, int shift);
 bool kvmppc_pvr_workaround_required(PowerPCCPU *cpu);
 
-bool kvmppc_is_mem_backend_page_size_ok(const char *obj_path);
+bool kvmppc_hpt_needs_host_contiguous_pages(void);
 
 #else
 
@@ -222,9 +222,9 @@ static inline uint64_t kvmppc_rma_size(uint64_t current_size,
     return ram_size;
 }
 
-static inline bool kvmppc_is_mem_backend_page_size_ok(const char *obj_path)
+static inline bool kvmppc_hpt_needs_host_contiguous_pages(void)
 {
-    return true;
+    return false;
 }
 
 static inline bool kvmppc_has_cap_spapr_vfio(void)
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC for-2.13 3/7] target/ppc: Add ppc_hash64_filter_pagesizes()
  2018-04-19  6:29 [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling David Gibson
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 1/7] spapr: Maximum (HPT) pagesize property David Gibson
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 2/7] spapr: Use maximum page size capability to simplify memory backend checking David Gibson
@ 2018-04-19  6:29 ` David Gibson
  2018-05-03 15:57   ` Murilo Opsfelder Araujo
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 4/7] spapr: Add cpu_apply hook to capabilities David Gibson
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2018-04-19  6:29 UTC (permalink / raw)
  To: groug, abologna; +Cc: aik, qemu-ppc, qemu-devel, clg, David Gibson

The paravirtualized PAPR platform sometimes needs to restrict the guest to
using only some of the page sizes actually supported by the host's MMU.
At the moment this is handled in KVM specific code, but for consistency we
want to apply the same limitations to all accelerators.

This makes a start on this by providing a helper function in the cpu code
to allow platform code to remove some of the cpu's page size definitions
via a caller supplied callback.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 target/ppc/mmu-hash64.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++
 target/ppc/mmu-hash64.h |  3 +++
 2 files changed, 62 insertions(+)

diff --git a/target/ppc/mmu-hash64.c b/target/ppc/mmu-hash64.c
index a1db20e3a8..b6e62864fd 100644
--- a/target/ppc/mmu-hash64.c
+++ b/target/ppc/mmu-hash64.c
@@ -1165,3 +1165,62 @@ const PPCHash64Options ppc_hash64_opts_POWER7 = {
         },
     }
 };
+
+void ppc_hash64_filter_pagesizes(PowerPCCPU *cpu,
+                                 bool (*cb)(void *, uint32_t, uint32_t),
+                                 void *opaque)
+{
+    PPCHash64Options *opts = cpu->hash64_opts;
+    int i;
+    int n = 0;
+    bool ci_largepage = false;
+
+    assert(opts);
+
+    n = 0;
+    for (i = 0; i < ARRAY_SIZE(opts->sps); i++) {
+        PPCHash64SegmentPageSizes *sps = &opts->sps[i];
+        int j;
+        int m = 0;
+
+        assert(n <= i);
+
+        if (!sps->page_shift) {
+            break;
+        }
+
+        for (j = 0; j < ARRAY_SIZE(sps->enc); j++) {
+            PPCHash64PageSize *ps = &sps->enc[j];
+
+            assert(m <= j);
+            if (!ps->page_shift) {
+                break;
+            }
+
+            if (cb(opaque, sps->page_shift, ps->page_shift)) {
+                if (ps->page_shift == 16) {
+                    ci_largepage = true;
+                }
+                sps->enc[m++] = *ps;
+            }
+        }
+
+        /* Clear rest of the row */
+        for (j = m; j < ARRAY_SIZE(sps->enc); j++) {
+            memset(&sps->enc[j], 0, sizeof(sps->enc[j]));
+        }
+
+        if (m) {
+            n++;
+        }
+    }
+
+    /* Clear the rest of the table */
+    for (i = n; i < ARRAY_SIZE(opts->sps); i++) {
+        memset(&opts->sps[i], 0, sizeof(opts->sps[i]));
+    }
+
+    if (!ci_largepage) {
+        opts->flags &= ~PPC_HASH64_CI_LARGEPAGE;
+    }
+}
diff --git a/target/ppc/mmu-hash64.h b/target/ppc/mmu-hash64.h
index f23b78d787..1aa2453497 100644
--- a/target/ppc/mmu-hash64.h
+++ b/target/ppc/mmu-hash64.h
@@ -20,6 +20,9 @@ unsigned ppc_hash64_hpte_page_shift_noslb(PowerPCCPU *cpu,
 void ppc_store_lpcr(PowerPCCPU *cpu, target_ulong val);
 void ppc_hash64_init(PowerPCCPU *cpu);
 void ppc_hash64_finalize(PowerPCCPU *cpu);
+void ppc_hash64_filter_pagesizes(PowerPCCPU *cpu,
+                                 bool (*cb)(void *, uint32_t, uint32_t),
+                                 void *opaque);
 #endif
 
 /*
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC for-2.13 4/7] spapr: Add cpu_apply hook to capabilities
  2018-04-19  6:29 [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling David Gibson
                   ` (2 preceding siblings ...)
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 3/7] target/ppc: Add ppc_hash64_filter_pagesizes() David Gibson
@ 2018-04-19  6:29 ` David Gibson
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 5/7] spapr: Limit available pagesizes to provide a consistent guest environment David Gibson
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2018-04-19  6:29 UTC (permalink / raw)
  To: groug, abologna; +Cc: aik, qemu-ppc, qemu-devel, clg, David Gibson

spapr capabilities have an apply hook to actually activate (or deactivate)
the feature in the system at reset time.  However, a number of capabilities
affect the setup of cpus, and need to be applied to each of them -
including hotplugged cpus for extra complication.  To make this simpler,
add an optional cpu_apply hook that is called from spapr_cpu_reset().

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_caps.c     | 19 +++++++++++++++++++
 hw/ppc/spapr_cpu_core.c |  2 ++
 include/hw/ppc/spapr.h  |  1 +
 3 files changed, 22 insertions(+)

diff --git a/hw/ppc/spapr_caps.c b/hw/ppc/spapr_caps.c
index 5762b88689..e71de14ffb 100644
--- a/hw/ppc/spapr_caps.c
+++ b/hw/ppc/spapr_caps.c
@@ -61,6 +61,8 @@ typedef struct sPAPRCapabilityInfo {
     sPAPRCapPossible *possible;
     /* Make sure the virtual hardware can support this capability */
     void (*apply)(sPAPRMachineState *spapr, uint8_t val, Error **errp);
+    void (*cpu_apply)(sPAPRMachineState *spapr, PowerPCCPU *cpu,
+                      uint8_t val, Error **errp);
 } sPAPRCapabilityInfo;
 
 static void spapr_cap_get_bool(Object *obj, Visitor *v, const char *name,
@@ -547,6 +549,23 @@ void spapr_caps_reset(sPAPRMachineState *spapr)
     }
 }
 
+void spapr_caps_cpu_reset(sPAPRMachineState *spapr, PowerPCCPU *cpu)
+{
+    int i;
+
+    for (i = 0; i < SPAPR_CAP_NUM; i++) {
+        sPAPRCapabilityInfo *info = &capability_table[i];
+
+        /*
+         * If the apply function can't set the desired level and thinks it's
+         * fatal, it should cause that.
+         */
+        if (info->cpu_apply) {
+            info->cpu_apply(spapr, cpu, spapr->eff.caps[i], &error_fatal);
+        }
+    }
+}
+
 void spapr_caps_add_properties(sPAPRMachineClass *smc, Error **errp)
 {
     Error *local_err = NULL;
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 0a668b8954..f041182593 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -67,6 +67,8 @@ static void spapr_cpu_reset(void *opaque)
 
     /* Set a full AMOR so guest can use the AMR as it sees fit */
     env->spr[SPR_AMOR] = 0xffffffffffffffffull;
+
+    spapr_caps_cpu_reset(SPAPR_MACHINE(qdev_get_machine()), cpu);
 }
 
 void spapr_cpu_set_entry_state(PowerPCCPU *cpu, target_ulong nip, target_ulong r3)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index a3d616c612..ced3246d0e 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -802,6 +802,7 @@ static inline uint8_t spapr_get_cap(sPAPRMachineState *spapr, int cap)
 }
 
 void spapr_caps_reset(sPAPRMachineState *spapr);
+void spapr_caps_cpu_reset(sPAPRMachineState *spapr, PowerPCCPU *cpu);
 void spapr_caps_add_properties(sPAPRMachineClass *smc, Error **errp);
 int spapr_caps_post_migration(sPAPRMachineState *spapr);
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC for-2.13 5/7] spapr: Limit available pagesizes to provide a consistent guest environment
  2018-04-19  6:29 [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling David Gibson
                   ` (3 preceding siblings ...)
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 4/7] spapr: Add cpu_apply hook to capabilities David Gibson
@ 2018-04-19  6:29 ` David Gibson
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 6/7] spapr: Don't rewrite mmu capabilities in KVM mode David Gibson
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2018-04-19  6:29 UTC (permalink / raw)
  To: groug, abologna; +Cc: aik, qemu-ppc, qemu-devel, clg, David Gibson

KVM HV has some limitations (deriving from the hardware) that mean not all
host-cpu supported pagesizes may be usable in the guest.  At present this
means that KVM guests and TCG guests may see different available page sizes
even if they notionally have the same vcpu model.  This is confusing and
also prevents migration between TCG and KVM.

This patch makes the environment consistent by always allowing the same set
of pagesizes.  Since we can't remove the KVM limitations, we do this by
always applying the same limitations it has, even to TCG guests.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_caps.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/hw/ppc/spapr_caps.c b/hw/ppc/spapr_caps.c
index e71de14ffb..5a247f190f 100644
--- a/hw/ppc/spapr_caps.c
+++ b/hw/ppc/spapr_caps.c
@@ -331,6 +331,38 @@ static void cap_hpt_mps_apply(sPAPRMachineState *spapr,
     spapr_check_pagesize(spapr, qemu_getrampagesize(), errp);
 }
 
+static bool spapr_pagesize_cb(void *opaque, uint32_t seg_pshift, uint32_t pshift)
+{
+    unsigned maxshift = *((unsigned *)opaque);
+
+    assert(pshift >= seg_pshift);
+
+    /* Don't allow the guest to use pages bigger than the configured
+     * maximum size */
+    if (pshift > maxshift) {
+        return false;
+    }
+
+    /* For whatever reason, KVM doesn't allow multiple pagesizes
+     * within a segment, *except* for the case of 16M pages in a 4k or
+     * 64k segment.  Always exclude other cases, so that TCG and KVM
+     * guests see a consistent environment */
+    if ((pshift != seg_pshift) && (pshift != 24)) {
+        return false;
+    }
+
+    return true;
+}
+
+static void cap_hpt_mps_cpu_apply(sPAPRMachineState *spapr,
+                                   PowerPCCPU *cpu,
+                                   uint8_t val, Error **errp)
+{
+    unsigned maxshift = val;
+
+    ppc_hash64_filter_pagesizes(cpu, spapr_pagesize_cb, &maxshift);
+}
+
 sPAPRCapabilityInfo capability_table[SPAPR_CAP_NUM] = {
     [SPAPR_CAP_HTM] = {
         .name = "htm",
@@ -398,6 +430,7 @@ sPAPRCapabilityInfo capability_table[SPAPR_CAP_NUM] = {
         .set = spapr_cap_set_int,
         .type = "int",
         .apply = cap_hpt_mps_apply,
+        .cpu_apply = cap_hpt_mps_cpu_apply,
     },
 };
 
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC for-2.13 6/7] spapr: Don't rewrite mmu capabilities in KVM mode
  2018-04-19  6:29 [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling David Gibson
                   ` (4 preceding siblings ...)
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 5/7] spapr: Limit available pagesizes to provide a consistent guest environment David Gibson
@ 2018-04-19  6:29 ` David Gibson
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 7/7] spapr_pci: Remove unhelpful pagesize warning David Gibson
  2018-04-19 15:30 ` [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling Andrea Bolognani
  7 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2018-04-19  6:29 UTC (permalink / raw)
  To: groug, abologna; +Cc: aik, qemu-ppc, qemu-devel, clg, David Gibson

Currently during KVM initialization on POWER, kvm_fixup_page_sizes()
rewrites a bunch of information in the cpu state to reflect the
capabilities of the host MMU and KVM.  This overwrites the information
that's already there reflecting how the TCG implementation of the MMU will
operate.

This means that we can get guest-visibly different behaviour between KVM
and TCG (and between different KVM implementations).  That's bad.  It also
prevents migration between KVM and TCG.

The pseries machine type now has filtering of the pagesizes it allows the
guest to use which means it can present a consistent model of the MMU
across all accelerators.

So, we can now replace kvm_fixup_page_sizes() with kvm_check_mmu() which
merely verifies that the expected cpu model can be faithfully handled by
KVM, rather than updating the cpu model to match KVM.

We call kvm_check_mmu() from the spapr cpu reset code.  This is a hack:
conceptually it makes more sense where fixup_page_sizes() was - in the KVM
cpu init path.  However, doing that would require moving the platform's
pagesize filtering much earlier, which would require a lot of work making
further adjustments.  There wouldn't be a lot of concrete point to doing
that, since the only KVM implementation which has the awkward MMU
restrictions is KVM HV, which can only work with an spapr guest anyway.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_cpu_core.c |   2 +
 target/ppc/kvm.c        | 134 +++++++++++++++++++++++-------------------------
 target/ppc/kvm_ppc.h    |   5 ++
 3 files changed, 72 insertions(+), 69 deletions(-)

diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index f041182593..f00180a443 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -69,6 +69,8 @@ static void spapr_cpu_reset(void *opaque)
     env->spr[SPR_AMOR] = 0xffffffffffffffffull;
 
     spapr_caps_cpu_reset(SPAPR_MACHINE(qdev_get_machine()), cpu);
+
+    kvm_check_mmu(cpu, &error_fatal);
 }
 
 void spapr_cpu_set_entry_state(PowerPCCPU *cpu, target_ulong nip, target_ulong r3)
diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
index defbb08931..fa7e1b77b3 100644
--- a/target/ppc/kvm.c
+++ b/target/ppc/kvm.c
@@ -408,85 +408,91 @@ target_ulong kvmppc_configure_v3_mmu(PowerPCCPU *cpu,
     }
 }
 
-static bool kvm_valid_page_size(uint32_t flags, long rampgsize, uint32_t shift)
+void kvm_check_mmu(PowerPCCPU *cpu, Error **errp)
 {
-    if (!(flags & KVM_PPC_PAGE_SIZES_REAL)) {
-        return true;
-    }
-
-    return (1ul << shift) <= rampgsize;
-}
-
-static long max_cpu_page_size;
-
-static void kvm_fixup_page_sizes(PowerPCCPU *cpu)
-{
-    static struct kvm_ppc_smmu_info smmu_info;
-    static bool has_smmu_info;
-    CPUPPCState *env = &cpu->env;
+    struct kvm_ppc_smmu_info smmu_info;
     int iq, ik, jq, jk;
 
-    /* We only handle page sizes for 64-bit server guests for now */
-    if (!(env->mmu_model & POWERPC_MMU_64)) {
+    /* For now, we only have anything to check on has64 MMUs */
+    if (!cpu->hash64_opts || !kvm_enabled()) {
         return;
     }
 
-    /* Collect MMU info from kernel if not already */
-    if (!has_smmu_info) {
-        kvm_get_smmu_info(cpu, &smmu_info);
-        has_smmu_info = true;
-    }
+    kvm_get_smmu_info(cpu, &smmu_info);
 
-    if (!max_cpu_page_size) {
-        max_cpu_page_size = qemu_getrampagesize();
+    if (ppc_hash64_has(cpu, PPC_HASH64_1TSEG)
+        && !(smmu_info.flags & KVM_PPC_1T_SEGMENTS)) {
+        error_setg(errp,
+                   "KVM does not support 1TiB segments which guest expects");
+        return;
     }
 
-    /* Convert to QEMU form */
-    memset(cpu->hash64_opts->sps, 0, sizeof(*cpu->hash64_opts->sps));
-
-    /* If we have HV KVM, we need to forbid CI large pages if our
-     * host page size is smaller than 64K.
-     */
-    if (smmu_info.flags & KVM_PPC_PAGE_SIZES_REAL) {
-        if (getpagesize() >= 0x10000) {
-            cpu->hash64_opts->flags |= PPC_HASH64_CI_LARGEPAGE;
-        } else {
-            cpu->hash64_opts->flags &= ~PPC_HASH64_CI_LARGEPAGE;
-        }
+    if (smmu_info.slb_size < cpu->hash64_opts->slb_size) {
+        error_setg(errp, "KVM only supports %u SLB entries, but guest needs %u",
+                   smmu_info.slb_size, cpu->hash64_opts->slb_size);
+        return;
     }
 
     /*
-     * XXX This loop should be an entry wide AND of the capabilities that
-     *     the selected CPU has with the capabilities that KVM supports.
+     * Verify that every pagesize supported by the cpu model is
+     * supported by KVM with the same encodings
      */
-    for (ik = iq = 0; ik < KVM_PPC_PAGE_SIZES_MAX_SZ; ik++) {
+    for (iq = 0; iq < ARRAY_SIZE(cpu->hash64_opts->sps); iq++) {
         PPCHash64SegmentPageSizes *qsps = &cpu->hash64_opts->sps[iq];
-        struct kvm_ppc_one_seg_page_size *ksps = &smmu_info.sps[ik];
+        struct kvm_ppc_one_seg_page_size *ksps;
 
-        if (!kvm_valid_page_size(smmu_info.flags, max_cpu_page_size,
-                                 ksps->page_shift)) {
-            continue;
-        }
-        qsps->page_shift = ksps->page_shift;
-        qsps->slb_enc = ksps->slb_enc;
-        for (jk = jq = 0; jk < KVM_PPC_PAGE_SIZES_MAX_SZ; jk++) {
-            if (!kvm_valid_page_size(smmu_info.flags, max_cpu_page_size,
-                                     ksps->enc[jk].page_shift)) {
-                continue;
-            }
-            qsps->enc[jq].page_shift = ksps->enc[jk].page_shift;
-            qsps->enc[jq].pte_enc = ksps->enc[jk].pte_enc;
-            if (++jq >= PPC_PAGE_SIZES_MAX_SZ) {
+        for (ik = 0; ik < ARRAY_SIZE(smmu_info.sps); ik++) {
+            if (qsps->page_shift == smmu_info.sps[ik].page_shift) {
                 break;
             }
         }
-        if (++iq >= PPC_PAGE_SIZES_MAX_SZ) {
-            break;
+        if (ik >= ARRAY_SIZE(smmu_info.sps)) {
+            error_setg(errp, "KVM doesn't support for base page shift %u",
+                       qsps->page_shift);
+            return;
+        }
+
+        ksps = &smmu_info.sps[ik];
+        if (ksps->slb_enc != qsps->slb_enc) {
+            error_setg(errp,
+"KVM uses SLB encoding 0x%x for page shift %u, but guest expects 0x%x",
+                       ksps->slb_enc, ksps->page_shift, qsps->slb_enc);
+            return;
+        }
+
+        for (jq = 0; jq < ARRAY_SIZE(qsps->enc); jq++) {
+            for (jk = 0; jk < ARRAY_SIZE(ksps->enc); jk++) {
+                if (qsps->enc[jq].page_shift == ksps->enc[jk].page_shift) {
+                    break;
+                }
+            }
+
+            if (jk >= ARRAY_SIZE(ksps->enc)) {
+                error_setg(errp, "KVM doesn't support page shift %u/%u",
+                           qsps->enc[jq].page_shift, qsps->page_shift);
+                return;
+            }
+            if (qsps->enc[jq].pte_enc != ksps->enc[jk].pte_enc) {
+                error_setg(errp,
+"KVM uses PTE encoding 0x%x for page shift %u/%u, but guest expects 0x%x",
+                           ksps->enc[jk].pte_enc, qsps->enc[jq].page_shift,
+                           qsps->page_shift, qsps->enc[jq].pte_enc);
+                return;
+            }
         }
     }
-    cpu->hash64_opts->slb_size = smmu_info.slb_size;
-    if (!(smmu_info.flags & KVM_PPC_1T_SEGMENTS)) {
-        cpu->hash64_opts->flags &= ~PPC_HASH64_1TSEG;
+
+    if (ppc_hash64_has(cpu, PPC_HASH64_CI_LARGEPAGE)) {
+        /* Mostly what guest pagesizes we can use are related to the
+         * host pages used to map guest RAM, which is handled in the
+         * platform code. Cache-Inhibited largepages (64k) however are
+         * used for I/O, so if they're mapped to the host at all it
+         * will be a normal mapping, not a special hugepage one used
+         * for RAM. */
+        if (getpagesize() < 0x10000) {
+            error_setg(errp,
+"KVM can't supply 64kiB CI pages, which guest expects\n");
+        }
     }
 }
 
@@ -503,13 +509,6 @@ bool kvmppc_hpt_needs_host_contiguous_pages(void)
 
     return !!(smmu_info.flags & KVM_PPC_PAGE_SIZES_REAL);
 }
-
-#else /* defined (TARGET_PPC64) */
-
-static inline void kvm_fixup_page_sizes(PowerPCCPU *cpu)
-{
-}
-
 #endif /* !defined (TARGET_PPC64) */
 
 unsigned long kvm_arch_vcpu_id(CPUState *cpu)
@@ -555,9 +554,6 @@ int kvm_arch_init_vcpu(CPUState *cs)
     CPUPPCState *cenv = &cpu->env;
     int ret;
 
-    /* Gather server mmu info from KVM and update the CPU state */
-    kvm_fixup_page_sizes(cpu);
-
     /* Synchronize sregs with kvm */
     ret = kvm_arch_sync_sregs(cpu);
     if (ret) {
diff --git a/target/ppc/kvm_ppc.h b/target/ppc/kvm_ppc.h
index 443fca0a4e..657582bb32 100644
--- a/target/ppc/kvm_ppc.h
+++ b/target/ppc/kvm_ppc.h
@@ -71,6 +71,7 @@ int kvmppc_resize_hpt_commit(PowerPCCPU *cpu, target_ulong flags, int shift);
 bool kvmppc_pvr_workaround_required(PowerPCCPU *cpu);
 
 bool kvmppc_hpt_needs_host_contiguous_pages(void);
+void kvm_check_mmu(PowerPCCPU *cpu, Error **errp);
 
 #else
 
@@ -227,6 +228,10 @@ static inline bool kvmppc_hpt_needs_host_contiguous_pages(void)
     return false;
 }
 
+static inline void kvm_check_mmu(PowerPCCPU *cpu, Error **errp)
+{
+}
+
 static inline bool kvmppc_has_cap_spapr_vfio(void)
 {
     return false;
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* [Qemu-devel] [RFC for-2.13 7/7] spapr_pci: Remove unhelpful pagesize warning
  2018-04-19  6:29 [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling David Gibson
                   ` (5 preceding siblings ...)
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 6/7] spapr: Don't rewrite mmu capabilities in KVM mode David Gibson
@ 2018-04-19  6:29 ` David Gibson
  2018-04-19 15:30 ` [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling Andrea Bolognani
  7 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2018-04-19  6:29 UTC (permalink / raw)
  To: groug, abologna; +Cc: aik, qemu-ppc, qemu-devel, clg, David Gibson

By default, the IOMMU model built into the spapr virtual PCI host bridge
supports 4kiB and 64kiB IOMMU page sizes.  However this can be overridden
which may be desirable to allow larger IOMMU page sizes when running a
guest with hugepage backing and passthrough devices.  For that reason a
warning was printed when the device wasn't configured to allow the pagesize
with which guest RAM is backed.

Experience has proven, however, that this message is more confusing than
useful.  Worse it sometimes makes little sense when the host-available page
sizes don't match those available on the guest, which can happen with
a POWER8 guest running on a POWER9 KVM host.

Long term we do want better handling to allow large IOMMU page sizes to be
used, but for now this parameter and warning don't really accomplish it.
So, remove the message, pending a better solution.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr_pci.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/hw/ppc/spapr_pci.c b/hw/ppc/spapr_pci.c
index 39a14980d3..f936ce63ef 100644
--- a/hw/ppc/spapr_pci.c
+++ b/hw/ppc/spapr_pci.c
@@ -1717,13 +1717,6 @@ static void spapr_phb_realize(DeviceState *dev, Error **errp)
     }
 
     /* DMA setup */
-    if (((sphb->page_size_mask & qemu_getrampagesize()) == 0)
-        && kvm_enabled()) {
-        warn_report("System page size 0x%lx is not enabled in page_size_mask "
-                    "(0x%"PRIx64"). Performance may be slow",
-                    qemu_getrampagesize(), sphb->page_size_mask);
-    }
-
     for (i = 0; i < windows_supported; ++i) {
         tcet = spapr_tce_new_table(DEVICE(sphb), sphb->dma_liobn[i]);
         if (!tcet) {
-- 
2.14.3

^ permalink raw reply related	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-19  6:29 [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling David Gibson
                   ` (6 preceding siblings ...)
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 7/7] spapr_pci: Remove unhelpful pagesize warning David Gibson
@ 2018-04-19 15:30 ` Andrea Bolognani
  2018-04-20  2:35   ` David Gibson
  7 siblings, 1 reply; 28+ messages in thread
From: Andrea Bolognani @ 2018-04-19 15:30 UTC (permalink / raw)
  To: David Gibson, groug; +Cc: aik, qemu-ppc, qemu-devel, clg

On Thu, 2018-04-19 at 16:29 +1000, David Gibson wrote:
> Currently the "pseries" machine type will (usually) advertise
> different pagesizes to the guest when running under KVM and TCG, which
> is not how things are supposed to work.
> 
> This comes from poor handling of hardware limitations which mean that
> under KVM HV the guest is unable to use pagesizes larger than those
> backing the guest's RAM on the host side.
> 
> The new scheme turns things around by having an explicit machine
> parameter controlling the largest page size that the guest is allowed
> to use.  This limitation applies regardless of accelerator.  When
> we're running on KVM HV we ensure that our backing pages are adequate
> to supply the requested guest page sizes, rather than adjusting the
> guest page sizes based on what KVM can supply.
> 
> This means that in order to use hugepages in a PAPR guest it's
> necessary to add a "cap-hpt-mps=24" machine parameter as well as
> setting the mem-path correctly.  This is a bit more work on the user
> and/or management side, but results in consistent behaviour so I think
> it's worth it.

libvirt guests already need to explicitly opt-in to hugepages, so
adding this new option automagically based on that shouldn't be too
difficult.

A couple of questions:

  * I see the option accepts values 12, 16, 24 and 34, with 16
    being the default. I guess 34 corresponds to 1 GiB hugepages?
    Also, in what scenario would 12 be used?

  * The name of the property suggests this setting is only relevant
    for HPT guests. libvirt doesn't really have the notion of HPT
    and RPT, and I'm not really itching to introduce it. Can we
    safely use this option for all guests, even RPT ones?

Thanks.

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-19 15:30 ` [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling Andrea Bolognani
@ 2018-04-20  2:35   ` David Gibson
  2018-04-20  9:31     ` Andrea Bolognani
  0 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2018-04-20  2:35 UTC (permalink / raw)
  To: Andrea Bolognani; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

[-- Attachment #1: Type: text/plain, Size: 3632 bytes --]

On Thu, Apr 19, 2018 at 05:30:04PM +0200, Andrea Bolognani wrote:
> On Thu, 2018-04-19 at 16:29 +1000, David Gibson wrote:
> > Currently the "pseries" machine type will (usually) advertise
> > different pagesizes to the guest when running under KVM and TCG, which
> > is not how things are supposed to work.
> > 
> > This comes from poor handling of hardware limitations which mean that
> > under KVM HV the guest is unable to use pagesizes larger than those
> > backing the guest's RAM on the host side.
> > 
> > The new scheme turns things around by having an explicit machine
> > parameter controlling the largest page size that the guest is allowed
> > to use.  This limitation applies regardless of accelerator.  When
> > we're running on KVM HV we ensure that our backing pages are adequate
> > to supply the requested guest page sizes, rather than adjusting the
> > guest page sizes based on what KVM can supply.
> > 
> > This means that in order to use hugepages in a PAPR guest it's
> > necessary to add a "cap-hpt-mps=24" machine parameter as well as
> > setting the mem-path correctly.  This is a bit more work on the user
> > and/or management side, but results in consistent behaviour so I think
> > it's worth it.
> 
> libvirt guests already need to explicitly opt-in to hugepages, so
> adding this new option automagically based on that shouldn't be too
> difficult.

Right.  We have to be a bit careful with automagic though, because
treating hugepage as a boolean is one of the problems that this
parameter is there to address.

If libvirt were to set the parameter based on the pagesize of the
hugepage mount, then it might not be consistent across a migration
(e.g. p8 to p9).  Now the new code would at least catch that and
safely fail the migration, but that might be confusing to users.

> A couple of questions:
> 
>   * I see the option accepts values 12, 16, 24 and 34, with 16
>     being the default.

In fact it should accept any value >= 12, though the ones that you
list are the interesting ones.  This does mean, for example, that if
it was just set to the hugepage size on a p9, 21 (2MiB) things should
work correctly (in practice it would act identically to setting it to
16).

> I guess 34 corresponds to 1 GiB hugepages?

No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
practice).

>     Also, in what scenario would 12 be used?

So RHEL, at least, generally configures ppc64 kernels to use 64kiB
pages, but 4kiB pages are still supported upstream (not sure if there
are any distros that still use that mode).  If your host uses 4kiB
pages you wouldn't be able to start a (KVM HV) guest without setting
this to 12 (or using a 64kiB hugepage mount).

>   * The name of the property suggests this setting is only relevant
>     for HPT guests. libvirt doesn't really have the notion of HPT
>     and RPT, and I'm not really itching to introduce it. Can we
>     safely use this option for all guests, even RPT ones?

Yes.  The "hpt" in the main is meant to imply that its restriction
only applies when the guest is in HPT mode, but it can be safely set
in any mode.  In RPT mode guest and host pagesizes are independent of
each other, so we don't have to deal with this mess.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-20  2:35   ` David Gibson
@ 2018-04-20  9:31     ` Andrea Bolognani
  2018-04-20 10:21       ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Bolognani @ 2018-04-20  9:31 UTC (permalink / raw)
  To: David Gibson; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

On Fri, 2018-04-20 at 12:35 +1000, David Gibson wrote:
> On Thu, Apr 19, 2018 at 05:30:04PM +0200, Andrea Bolognani wrote:
> > On Thu, 2018-04-19 at 16:29 +1000, David Gibson wrote:
> > > This means that in order to use hugepages in a PAPR guest it's
> > > necessary to add a "cap-hpt-mps=24" machine parameter as well as
> > > setting the mem-path correctly.  This is a bit more work on the user
> > > and/or management side, but results in consistent behaviour so I think
> > > it's worth it.
> > 
> > libvirt guests already need to explicitly opt-in to hugepages, so
> > adding this new option automagically based on that shouldn't be too
> > difficult.
> 
> Right.  We have to be a bit careful with automagic though, because
> treating hugepage as a boolean is one of the problems that this
> parameter is there to address.
> 
> If libvirt were to set the parameter based on the pagesize of the
> hugepage mount, then it might not be consistent across a migration
> (e.g. p8 to p9).  Now the new code would at least catch that and
> safely fail the migration, but that might be confusing to users.

Good point.

I'll have to look into it to be sure, but I think it should be
possible for libvirt to convert a generic

  <memoryBacking>
    <hugepages/>
  </memoryBacking>

to a more specific

  <memoryBacking>
    <hugepages>
      <page size="16384" unit="KiB"/>
    </hugepages>
  </memoryBacking>

by figuring out the page size for the default hugepage mount,
which actually sounds like a good idea regardless. Of course users
user would still be able to provide the page size themselves in the
first place.

Is the 16 MiB page size available for both POWER8 and POWER9?

> > A couple of questions:
> > 
> >   * I see the option accepts values 12, 16, 24 and 34, with 16
> >     being the default.
> 
> In fact it should accept any value >= 12, though the ones that you
> list are the interesting ones.

Well, I copied them from the QEMU help text, and I kinda assumed
that you wouldn't just list completely random values there O:-)

> This does mean, for example, that if
> it was just set to the hugepage size on a p9, 21 (2MiB) things should
> work correctly (in practice it would act identically to setting it to
> 16).

Wouldn't that lead to different behavior depending on whether you
start the guest on a POWER9 or POWER8 machine? The former would be
able to use 2 MiB hugepages, while the latter would be stuck using
regular 64 KiB pages. Migration of such a guest from POWER9 to
POWER8 wouldn't work because the hugepage allocation couldn't be
fulfilled, but the other way around would probably work and lead to
different page sizes being available inside the guest after a power
cycle, no?

> > I guess 34 corresponds to 1 GiB hugepages?
> 
> No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
> machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
> be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
> practice).

Isn't 1 GiB hugepages support at least being worked on[1]?

> >     Also, in what scenario would 12 be used?
> 
> So RHEL, at least, generally configures ppc64 kernels to use 64kiB
> pages, but 4kiB pages are still supported upstream (not sure if there
> are any distros that still use that mode).  If your host uses 4kiB
> pages you wouldn't be able to start a (KVM HV) guest without setting
> this to 12 (or using a 64kiB hugepage mount).

Mh, that's annoying, as needing to support 4 KiB pages would most
likely mean we'd have to turn this into a stand-alone configuration
knob rather than deriving it entirely from existing ones, which I'd
prefer as it's clearly much more user-friendly.

I'll check out what other distros are doing: if all the major ones
are defaulting to 64 KiB pages these days, it might be reasonable
to do the same and pretend smaller page sizes don't exist at all in
order to avoid the pain of having to tweak yet another knob, even
if that means leaving people compiling their own custom kernels
with 4 KiB page size in the dust.

> >   * The name of the property suggests this setting is only relevant
> >     for HPT guests. libvirt doesn't really have the notion of HPT
> >     and RPT, and I'm not really itching to introduce it. Can we
> >     safely use this option for all guests, even RPT ones?
> 
> Yes.  The "hpt" in the main is meant to imply that its restriction
> only applies when the guest is in HPT mode, but it can be safely set
> in any mode.  In RPT mode guest and host pagesizes are independent of
> each other, so we don't have to deal with this mess.

Good :)


[1] https://patchwork.kernel.org/patch/9729991/
-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-20  9:31     ` Andrea Bolognani
@ 2018-04-20 10:21       ` David Gibson
  2018-04-23  8:31         ` Andrea Bolognani
                           ` (2 more replies)
  0 siblings, 3 replies; 28+ messages in thread
From: David Gibson @ 2018-04-20 10:21 UTC (permalink / raw)
  To: Andrea Bolognani; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

[-- Attachment #1: Type: text/plain, Size: 7116 bytes --]

On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-20 at 12:35 +1000, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 05:30:04PM +0200, Andrea Bolognani wrote:
> > > On Thu, 2018-04-19 at 16:29 +1000, David Gibson wrote:
> > > > This means that in order to use hugepages in a PAPR guest it's
> > > > necessary to add a "cap-hpt-mps=24" machine parameter as well as
> > > > setting the mem-path correctly.  This is a bit more work on the user
> > > > and/or management side, but results in consistent behaviour so I think
> > > > it's worth it.
> > > 
> > > libvirt guests already need to explicitly opt-in to hugepages, so
> > > adding this new option automagically based on that shouldn't be too
> > > difficult.
> > 
> > Right.  We have to be a bit careful with automagic though, because
> > treating hugepage as a boolean is one of the problems that this
> > parameter is there to address.
> > 
> > If libvirt were to set the parameter based on the pagesize of the
> > hugepage mount, then it might not be consistent across a migration
> > (e.g. p8 to p9).  Now the new code would at least catch that and
> > safely fail the migration, but that might be confusing to users.
> 
> Good point.
> 
> I'll have to look into it to be sure, but I think it should be
> possible for libvirt to convert a generic
> 
>   <memoryBacking>
>     <hugepages/>
>   </memoryBacking>
> 
> to a more specific
> 
>   <memoryBacking>
>     <hugepages>
>       <page size="16384" unit="KiB"/>
>     </hugepages>
>   </memoryBacking>
> 
> by figuring out the page size for the default hugepage mount,
> which actually sounds like a good idea regardless. Of course users
> user would still be able to provide the page size themselves in the
> first place.

Sounds like a good approach.

> Is the 16 MiB page size available for both POWER8 and POWER9?

No.  That's a big part of what makes this such a mess.  HPT has 16MiB
and 16GiB hugepages, RPT has 2MiB and 1GiB hugepages.  (Well, I guess
tecnically Power9 does have 16MiB pages - but only in hash mode, which
the host won't be).

I've been looking into whether it's feasible to make a 16MiB hugepage
pool for POWER9 RPT.  The hardware can't actually use that as a
pagesize, but we could still allocate them physically contiguous, map
them using a bunch of 2MiB PTEs in RPT mode and allow them to be
mapped by guests in HPT mode.

I *think* it won't be too hard, but I haven't looked close enough to
rule out horrible gotchas yet.

> > > A couple of questions:
> > > 
> > >   * I see the option accepts values 12, 16, 24 and 34, with 16
> > >     being the default.
> > 
> > In fact it should accept any value >= 12, though the ones that you
> > list are the interesting ones.
> 
> Well, I copied them from the QEMU help text, and I kinda assumed
> that you wouldn't just list completely random values there O:-)

Ah, right, of course.

> > This does mean, for example, that if
> > it was just set to the hugepage size on a p9, 21 (2MiB) things should
> > work correctly (in practice it would act identically to setting it to
> > 16).
> 
> Wouldn't that lead to different behavior depending on whether you
> start the guest on a POWER9 or POWER8 machine? The former would be
> able to use 2 MiB hugepages, while the latter would be stuck using
> regular 64 KiB pages.

Well, no, because 2MiB hugepages aren't a thing in HPT mode.  In RPT
mode it'd be able to use 2MiB hugepages either way, because the
limitations only apply to HPT mode.

> Migration of such a guest from POWER9 to
> POWER8 wouldn't work because the hugepage allocation couldn't be
> fulfilled,

Sort of, you couldn't even get as far as staring the incoming qemu
with hpt-mps=21 on the POWER8 (unless you gave it 16MiB hugepages for
backing).

> but the other way around would probably work and lead to
> different page sizes being available inside the guest after a power
> cycle, no?

Well.. there are a few cases here.  If you migrated p8 -> p8 with
hpt-mps=21 on both ends, you couldn't actually start the guest on the
source without giving it hugepage backing.  In which case it'll be
fine on the p9 with hugepage mapping.

If you had hpt-mps=16 on the source and hpt-mps=21 on the other end,
well, you don't get to count on anything because you changed the VM
definition.  In fact it would work in this case, and you wouldn't even
get new page sizes after restart because HPT mode doesn't support any
pagesizes between 64kiB and 16MiB.

> > > I guess 34 corresponds to 1 GiB hugepages?
> > 
> > No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
> > machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
> > be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
> > practice).
> 
> Isn't 1 GiB hugepages support at least being worked on[1]?

That's for radix mode.  Hash mode has 16MiB and 16GiB, no 1GiB.

> > >     Also, in what scenario would 12 be used?
> > 
> > So RHEL, at least, generally configures ppc64 kernels to use 64kiB
> > pages, but 4kiB pages are still supported upstream (not sure if there
> > are any distros that still use that mode).  If your host uses 4kiB
> > pages you wouldn't be able to start a (KVM HV) guest without setting
> > this to 12 (or using a 64kiB hugepage mount).
> 
> Mh, that's annoying, as needing to support 4 KiB pages would most
> likely mean we'd have to turn this into a stand-alone configuration
> knob rather than deriving it entirely from existing ones, which I'd
> prefer as it's clearly much more user-friendly.

Yeah, there's really no way around it though.  Well other than always
restricting to 4kiB pages by default, which would suck for performance
with guests that want to use 64kIB pages.

> I'll check out what other distros are doing: if all the major ones
> are defaulting to 64 KiB pages these days, it might be reasonable
> to do the same and pretend smaller page sizes don't exist at all in
> order to avoid the pain of having to tweak yet another knob, even
> if that means leaving people compiling their own custom kernels
> with 4 KiB page size in the dust.

That's my guess.

> > >   * The name of the property suggests this setting is only relevant
> > >     for HPT guests. libvirt doesn't really have the notion of HPT
> > >     and RPT, and I'm not really itching to introduce it. Can we
> > >     safely use this option for all guests, even RPT ones?
> > 
> > Yes.  The "hpt" in the main is meant to imply that its restriction
> > only applies when the guest is in HPT mode, but it can be safely set
> > in any mode.  In RPT mode guest and host pagesizes are independent of
> > each other, so we don't have to deal with this mess.
> 
> Good :)
> 
> 
> [1] https://patchwork.kernel.org/patch/9729991/

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-20 10:21       ` David Gibson
@ 2018-04-23  8:31         ` Andrea Bolognani
  2018-04-24  1:26           ` David Gibson
  2018-04-24 15:35         ` Andrea Bolognani
  2018-04-25 16:09         ` Andrea Bolognani
  2 siblings, 1 reply; 28+ messages in thread
From: Andrea Bolognani @ 2018-04-23  8:31 UTC (permalink / raw)
  To: David Gibson; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > I'll check out what other distros are doing: if all the major ones
> > are defaulting to 64 KiB pages these days, it might be reasonable
> > to do the same and pretend smaller page sizes don't exist at all in
> > order to avoid the pain of having to tweak yet another knob, even
> > if that means leaving people compiling their own custom kernels
> > with 4 KiB page size in the dust.
> 
> That's my guess.

I just checked RHEL 7, Fedora 27, OpenSUSE Leap 42.3, Debian 9 and
Ubuntu 16.04: they all use 64 KiB pages, so I'd conclude leaving
out 4 KiB pages support is basically a non-issue.

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-23  8:31         ` Andrea Bolognani
@ 2018-04-24  1:26           ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2018-04-24  1:26 UTC (permalink / raw)
  To: Andrea Bolognani; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

[-- Attachment #1: Type: text/plain, Size: 1060 bytes --]

On Mon, Apr 23, 2018 at 10:31:39AM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> > On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > > I'll check out what other distros are doing: if all the major ones
> > > are defaulting to 64 KiB pages these days, it might be reasonable
> > > to do the same and pretend smaller page sizes don't exist at all in
> > > order to avoid the pain of having to tweak yet another knob, even
> > > if that means leaving people compiling their own custom kernels
> > > with 4 KiB page size in the dust.
> > 
> > That's my guess.
> 
> I just checked RHEL 7, Fedora 27, OpenSUSE Leap 42.3, Debian 9 and
> Ubuntu 16.04: they all use 64 KiB pages, so I'd conclude leaving
> out 4 KiB pages support is basically a non-issue.

Good to hear, thanks for checking.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-20 10:21       ` David Gibson
  2018-04-23  8:31         ` Andrea Bolognani
@ 2018-04-24 15:35         ` Andrea Bolognani
  2018-04-25  6:32           ` David Gibson
  2018-04-25 16:09         ` Andrea Bolognani
  2 siblings, 1 reply; 28+ messages in thread
From: Andrea Bolognani @ 2018-04-24 15:35 UTC (permalink / raw)
  To: David Gibson; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > I'll have to look into it to be sure, but I think it should be
> > possible for libvirt to convert a generic
> > 
> >   <memoryBacking>
> >     <hugepages/>
> >   </memoryBacking>
> > 
> > to a more specific
> > 
> >   <memoryBacking>
> >     <hugepages>
> >       <page size="16384" unit="KiB"/>
> >     </hugepages>
> >   </memoryBacking>
> > 
> > by figuring out the page size for the default hugepage mount,
> > which actually sounds like a good idea regardless. Of course users
> > user would still be able to provide the page size themselves in the
> > first place.
> 
> Sounds like a good approach.

Unfortunately it seems like this is not going to be feasible, as
POWER8 is apparently the only platform that enforces a strict
relationship between host page size and guest page size: x86,
aarch64 (and I have to assume POWER9 as well?) can reportedly all
deal gracefully with guests migrating between hosts that have
different hugepage mounts configured.

I need to spend some more time digesting the rest of the
information you provided, but as it stands right now I'm starting
to think this might actually need to be its own, explicit opt-in
knob at the libvirt level too after all :(

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-24 15:35         ` Andrea Bolognani
@ 2018-04-25  6:32           ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2018-04-25  6:32 UTC (permalink / raw)
  To: Andrea Bolognani; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

[-- Attachment #1: Type: text/plain, Size: 1972 bytes --]

On Tue, Apr 24, 2018 at 05:35:59PM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> > On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > > I'll have to look into it to be sure, but I think it should be
> > > possible for libvirt to convert a generic
> > > 
> > >   <memoryBacking>
> > >     <hugepages/>
> > >   </memoryBacking>
> > > 
> > > to a more specific
> > > 
> > >   <memoryBacking>
> > >     <hugepages>
> > >       <page size="16384" unit="KiB"/>
> > >     </hugepages>
> > >   </memoryBacking>
> > > 
> > > by figuring out the page size for the default hugepage mount,
> > > which actually sounds like a good idea regardless. Of course users
> > > user would still be able to provide the page size themselves in the
> > > first place.
> > 
> > Sounds like a good approach.
> 
> Unfortunately it seems like this is not going to be feasible, as
> POWER8 is apparently the only platform that enforces a strict
> relationship between host page size and guest page size: x86,
> aarch64 (and I have to assume POWER9 as well?) can reportedly all
> deal gracefully with guests migrating between hosts that have
> different hugepage mounts configured.

Yes, that's right.  As you guess POWER9 will also allow this.. at
least as long as the guest is in radix mode.  If the guest is in hash
mode (which includes but isn't limited to POWER8 compat mode guests)
then it will suffer from the same pagesize limitations as POWER8.

> I need to spend some more time digesting the rest of the
> information you provided, but as it stands right now I'm starting
> to think this might actually need to be its own, explicit opt-in
> knob at the libvirt level too after all :(

Poo.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-20 10:21       ` David Gibson
  2018-04-23  8:31         ` Andrea Bolognani
  2018-04-24 15:35         ` Andrea Bolognani
@ 2018-04-25 16:09         ` Andrea Bolognani
  2018-04-26  0:55           ` David Gibson
  2 siblings, 1 reply; 28+ messages in thread
From: Andrea Bolognani @ 2018-04-25 16:09 UTC (permalink / raw)
  To: David Gibson; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > Is the 16 MiB page size available for both POWER8 and POWER9?
> 
> No.  That's a big part of what makes this such a mess.  HPT has 16MiB
> and 16GiB hugepages, RPT has 2MiB and 1GiB hugepages.  (Well, I guess
> tecnically Power9 does have 16MiB pages - but only in hash mode, which
> the host won't be).
> 
[...]
> > > This does mean, for example, that if
> > > it was just set to the hugepage size on a p9, 21 (2MiB) things should
> > > work correctly (in practice it would act identically to setting it to
> > > 16).
> > 
> > Wouldn't that lead to different behavior depending on whether you
> > start the guest on a POWER9 or POWER8 machine? The former would be
> > able to use 2 MiB hugepages, while the latter would be stuck using
> > regular 64 KiB pages.
> 
> Well, no, because 2MiB hugepages aren't a thing in HPT mode.  In RPT
> mode it'd be able to use 2MiB hugepages either way, because the
> limitations only apply to HPT mode.
> 
> > Migration of such a guest from POWER9 to
> > POWER8 wouldn't work because the hugepage allocation couldn't be
> > fulfilled,
> 
> Sort of, you couldn't even get as far as staring the incoming qemu
> with hpt-mps=21 on the POWER8 (unless you gave it 16MiB hugepages for
> backing).
> 
> > but the other way around would probably work and lead to
> > different page sizes being available inside the guest after a power
> > cycle, no?
> 
> Well.. there are a few cases here.  If you migrated p8 -> p8 with
> hpt-mps=21 on both ends, you couldn't actually start the guest on the
> source without giving it hugepage backing.  In which case it'll be
> fine on the p9 with hugepage mapping.
> 
> If you had hpt-mps=16 on the source and hpt-mps=21 on the other end,
> well, you don't get to count on anything because you changed the VM
> definition.  In fact it would work in this case, and you wouldn't even
> get new page sizes after restart because HPT mode doesn't support any
> pagesizes between 64kiB and 16MiB.
> 
> > > > I guess 34 corresponds to 1 GiB hugepages?
> > > 
> > > No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
> > > machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
> > > be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
> > > practice).
> > 
> > Isn't 1 GiB hugepages support at least being worked on[1]?
> 
> That's for radix mode.  Hash mode has 16MiB and 16GiB, no 1GiB.

So, I've spent some more time trying to wrap my head around the
whole ordeal I'm still unclear about some of the details, though;
hopefully you'll be willing to answer a few more questions.

Basically the only page sizes you can have for HPT guests are
4 KiB, 64 KiB, 16 MiB and 16 GiB; in each case, for KVM, you need
the guest memory to be backed by host pages which are at least as
big, or it won't work. The same limitation doesn't apply to either
RPT or TCG guests.

The new parameter would make it possible to make sure you will
actually be able to use the page size you're interested in inside
the guest, by preventing it from starting at all if the host didn't
provide big enough backing pages; it would also ensure the guest
gets access to different page sizes when running using TCG as an
accelerator instead of KVM.

For a KVM guest running on a POWER8 host, the matrix would look
like

    b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
  -------- -------- -------- -------- -------- --------
   64 KiB | 64 KiB | 64 KiB |        |        |        |
  -------- -------- -------- -------- -------- --------
   16 MiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
  -------- -------- -------- -------- -------- --------
   16 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
  -------- -------- -------- -------- -------- --------

with backing page sizes from top to bottom, requested max page
sizes from left to right, actual max page sizes in the cells and
empty cells meaning the guest won't be able to start; on a POWER9
machine, the matrix would look like

    b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
  -------- -------- -------- -------- -------- --------
   64 KiB | 64 KiB | 64 KiB |        |        |        |
  -------- -------- -------- -------- -------- --------
    2 MiB | 64 KiB | 64 KiB |        |        |        |
  -------- -------- -------- -------- -------- --------
    1 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
  -------- -------- -------- -------- -------- --------

instead, and finally on TCG the backing page size wouldn't matter
and you would simply have

    b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
  -------- -------- -------- -------- -------- --------
          | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
  -------- -------- -------- -------- -------- --------

Does everything up until here make sense?

While trying to figure out this, one of the things I attempted to
do was run a guest in POWER8 compatibility mode on a POWER9 host
and use hugepages for backing, but that didn't seem to work at
all, possibly hinting at the fact that not all of the above is
actually accurate and I need you to correct me :)

This is the command line I used:

  /usr/libexec/qemu-kvm \
  -machine pseries,accel=kvm \
  -cpu host,compat=power8 \
  -m 2048 \
  -mem-prealloc \
  -mem-path /dev/hugepages \
  -smp 8,sockets=8,cores=1,threads=1 \
  -display none \
  -no-user-config \
  -nodefaults \
  -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x2,drive=vda \
  -drive file=/var/lib/libvirt/images/huge.qcow2,format=qcow2,if=none,id=vda \
  -serial mon:stdio

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-25 16:09         ` Andrea Bolognani
@ 2018-04-26  0:55           ` David Gibson
  2018-04-26  8:45             ` Andrea Bolognani
  0 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2018-04-26  0:55 UTC (permalink / raw)
  To: Andrea Bolognani; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

[-- Attachment #1: Type: text/plain, Size: 7803 bytes --]

On Wed, Apr 25, 2018 at 06:09:26PM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-20 at 20:21 +1000, David Gibson wrote:
> > On Fri, Apr 20, 2018 at 11:31:10AM +0200, Andrea Bolognani wrote:
> > > Is the 16 MiB page size available for both POWER8 and POWER9?
> > 
> > No.  That's a big part of what makes this such a mess.  HPT has 16MiB
> > and 16GiB hugepages, RPT has 2MiB and 1GiB hugepages.  (Well, I guess
> > tecnically Power9 does have 16MiB pages - but only in hash mode, which
> > the host won't be).
> > 
> [...]
> > > > This does mean, for example, that if
> > > > it was just set to the hugepage size on a p9, 21 (2MiB) things should
> > > > work correctly (in practice it would act identically to setting it to
> > > > 16).
> > > 
> > > Wouldn't that lead to different behavior depending on whether you
> > > start the guest on a POWER9 or POWER8 machine? The former would be
> > > able to use 2 MiB hugepages, while the latter would be stuck using
> > > regular 64 KiB pages.
> > 
> > Well, no, because 2MiB hugepages aren't a thing in HPT mode.  In RPT
> > mode it'd be able to use 2MiB hugepages either way, because the
> > limitations only apply to HPT mode.
> > 
> > > Migration of such a guest from POWER9 to
> > > POWER8 wouldn't work because the hugepage allocation couldn't be
> > > fulfilled,
> > 
> > Sort of, you couldn't even get as far as staring the incoming qemu
> > with hpt-mps=21 on the POWER8 (unless you gave it 16MiB hugepages for
> > backing).
> > 
> > > but the other way around would probably work and lead to
> > > different page sizes being available inside the guest after a power
> > > cycle, no?
> > 
> > Well.. there are a few cases here.  If you migrated p8 -> p8 with
> > hpt-mps=21 on both ends, you couldn't actually start the guest on the
> > source without giving it hugepage backing.  In which case it'll be
> > fine on the p9 with hugepage mapping.
> > 
> > If you had hpt-mps=16 on the source and hpt-mps=21 on the other end,
> > well, you don't get to count on anything because you changed the VM
> > definition.  In fact it would work in this case, and you wouldn't even
> > get new page sizes after restart because HPT mode doesn't support any
> > pagesizes between 64kiB and 16MiB.
> > 
> > > > > I guess 34 corresponds to 1 GiB hugepages?
> > > > 
> > > > No, 16GiB hugepages, which is the "colossal page" size on HPT POWER
> > > > machines.  It's a simple shift, (1 << 34) == 16 GiB, 1GiB pages would
> > > > be 30 (but wouldn't let the guest do any more than 24 ~ 16 MiB in
> > > > practice).
> > > 
> > > Isn't 1 GiB hugepages support at least being worked on[1]?
> > 
> > That's for radix mode.  Hash mode has 16MiB and 16GiB, no 1GiB.
> 
> So, I've spent some more time trying to wrap my head around the
> whole ordeal I'm still unclear about some of the details, though;
> hopefully you'll be willing to answer a few more questions.
> 
> Basically the only page sizes you can have for HPT guests are
> 4 KiB, 64 KiB, 16 MiB and 16 GiB; in each case, for KVM, you need
> the guest memory to be backed by host pages which are at least as
> big, or it won't work. The same limitation doesn't apply to either
> RPT or TCG guests.

That's right.  The limitation also doesn't apply to KVM PR, just KVM
HV.

[If you're interested, the reason for the limitation is that unlike
 x86 or POWER9 there aren't separate sets of gva->gpa and gpa->hpa
 pagetables. Instead there's just a single gva->hpa (hash) pagetable
 that's managed by the _host_.  When the guest wants to create a new
 mapping it uses an hcall to insert a PTE, and the hcall
 implementation translates the gpa into an hpa before inserting it
 into the HPT.  The full contents of the real HPT aren't visible to
 the guest, but the precise slot numbers within it are, so the
 assumption that there's an exact 1:1 correspondence between guest
 PTEs and host PTEs is pretty much baked into the PAPR interface.  So,
 if a hugepage is to be inserted into the guest HPT, then it's also
 being inserted into the host HPT, and needs to be really, truly host
 contiguous]

> The new parameter would make it possible to make sure you will
> actually be able to use the page size you're interested in inside
> the guest, by preventing it from starting at all if the host didn't
> provide big enough backing pages;

That's right

> it would also ensure the guest
> gets access to different page sizes when running using TCG as an
> accelerator instead of KVM.

Uh.. it would ensure the guest *doesn't* get access to different page
sizes in TCG vs. KVM.  Is that what you meant to say?

> For a KVM guest running on a POWER8 host, the matrix would look
> like
> 
>     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
>    64 KiB | 64 KiB | 64 KiB |        |        |        |
>   -------- -------- -------- -------- -------- --------
>    16 MiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
>   -------- -------- -------- -------- -------- --------
>    16 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
> 
> with backing page sizes from top to bottom, requested max page
> sizes from left to right, actual max page sizes in the cells and
> empty cells meaning the guest won't be able to start; on a POWER9
> machine, the matrix would look like
> 
>     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
>    64 KiB | 64 KiB | 64 KiB |        |        |        |
>   -------- -------- -------- -------- -------- --------
>     2 MiB | 64 KiB | 64 KiB |        |        |        |
>   -------- -------- -------- -------- -------- --------
>     1 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
>   -------- -------- -------- -------- -------- --------
> 
> instead, and finally on TCG the backing page size wouldn't matter
> and you would simply have
> 
>     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
>           | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
>   -------- -------- -------- -------- -------- --------
> 
> Does everything up until here make sense?

Yes, that all looks right.

> While trying to figure out this, one of the things I attempted to
> do was run a guest in POWER8 compatibility mode on a POWER9 host
> and use hugepages for backing, but that didn't seem to work at
> all, possibly hinting at the fact that not all of the above is
> actually accurate and I need you to correct me :)
> 
> This is the command line I used:
> 
>   /usr/libexec/qemu-kvm \
>   -machine pseries,accel=kvm \
>   -cpu host,compat=power8 \
>   -m 2048 \
>   -mem-prealloc \
>   -mem-path /dev/hugepages \
>   -smp 8,sockets=8,cores=1,threads=1 \
>   -display none \
>   -no-user-config \
>   -nodefaults \
>   -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x2,drive=vda \
>   -drive file=/var/lib/libvirt/images/huge.qcow2,format=qcow2,if=none,id=vda \
>   -serial mon:stdio

Ok, so note that the scheme I'm talking about here is *not* merged as
yet.  The above command line will run the guest with 2MiB backing.

With the existing code that should work, but the guest will only be
able to use 64kiB pages.  If it didn't work at all.. there was a bug
fixed relatively recently that broke all hugepage backing, so you
could try updating to a more recent host kernel.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-26  0:55           ` David Gibson
@ 2018-04-26  8:45             ` Andrea Bolognani
  2018-04-27  2:14               ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Bolognani @ 2018-04-26  8:45 UTC (permalink / raw)
  To: David Gibson; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

On Thu, 2018-04-26 at 10:55 +1000, David Gibson wrote:
> On Wed, Apr 25, 2018 at 06:09:26PM +0200, Andrea Bolognani wrote:
> > The new parameter would make it possible to make sure you will
> > actually be able to use the page size you're interested in inside
> > the guest, by preventing it from starting at all if the host didn't
> > provide big enough backing pages;
> 
> That's right
> 
> > it would also ensure the guest
> > gets access to different page sizes when running using TCG as an
> > accelerator instead of KVM.
> 
> Uh.. it would ensure the guest *doesn't* get access to different page
> sizes in TCG vs. KVM.  Is that what you meant to say?

Oops, looks like I accidentally a word there. Of course you got it
right and I meant exactly the opposite of what I actually wrote :/

> > For a KVM guest running on a POWER8 host, the matrix would look
> > like
> > 
> >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> >   -------- -------- -------- -------- -------- --------
> >    64 KiB | 64 KiB | 64 KiB |        |        |        |
> >   -------- -------- -------- -------- -------- --------
> >    16 MiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
> >   -------- -------- -------- -------- -------- --------
> >    16 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
> >   -------- -------- -------- -------- -------- --------
> > 
> > with backing page sizes from top to bottom, requested max page
> > sizes from left to right, actual max page sizes in the cells and
> > empty cells meaning the guest won't be able to start; on a POWER9
> > machine, the matrix would look like
> > 
> >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> >   -------- -------- -------- -------- -------- --------
> >    64 KiB | 64 KiB | 64 KiB |        |        |        |
> >   -------- -------- -------- -------- -------- --------
> >     2 MiB | 64 KiB | 64 KiB |        |        |        |
> >   -------- -------- -------- -------- -------- --------
> >     1 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
> >   -------- -------- -------- -------- -------- --------
> > 
> > instead, and finally on TCG the backing page size wouldn't matter
> > and you would simply have
> > 
> >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> >   -------- -------- -------- -------- -------- --------
> >           | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
> >   -------- -------- -------- -------- -------- --------
> > 
> > Does everything up until here make sense?
> 
> Yes, that all looks right.

Cool.

Unfortunately, that pretty much seals the deal on libvirt *not* being
able to infer the value from other guest settings :(

The only reasonable candidate would be the size of host pages used for
backing guest memory; however

  * TCG, RPT and KVM PR guests can't infer anything from it, as they
    are not tied to it. Having different behaviors for TCG and KVM
    would be easy, but differentiating between HPT KVM HV guest and
    all other kinds is something we can't do at the moment, and that
    in the past have actively resisted doing;

  * the user might want to limit things further, eg. preventing an
    HPT KVM HV guest backed by 16 MiB pages or an HPT TCG guest from
    using hugepages.

With the second use case in mind: would it make sense, or even be
possible, to make it so the capability works for RPT guests too?

Thinking even further, what about other architectures? Is this
something they might want to do too? The scenario I have in mind is
guests backed by regular pages being prevented from using hugepages
with the rationale that they wouldn't have the same performance
characteristics as if they were backed by hugepages; on the opposite
side of the spectrum, you might want to ensure the pages used to
back guest memory are as big as the biggest page you plan to use in
the guest, in order to guarantee the performance characteristics
fully match expectations.

> > While trying to figure out this, one of the things I attempted to
> > do was run a guest in POWER8 compatibility mode on a POWER9 host
> > and use hugepages for backing, but that didn't seem to work at
> > all, possibly hinting at the fact that not all of the above is
> > actually accurate and I need you to correct me :)
> > [...]
> 
> Ok, so note that the scheme I'm talking about here is *not* merged as
> yet.  The above command line will run the guest with 2MiB backing.
> 
> With the existing code that should work, but the guest will only be
> able to use 64kiB pages.

Understood: even without the ability to limit it further, the max
guest page size is obviously still capped by the backing page size.

> If it didn't work at all.. there was a bug
> fixed relatively recently that broke all hugepage backing, so you
> could try updating to a more recent host kernel.

That was probably it then!

I'll see whether I can get a newer kernel running on the host, but
my primary concern was not having gotten the command line (or the
concepts above) completely wrong :)

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-26  8:45             ` Andrea Bolognani
@ 2018-04-27  2:14               ` David Gibson
  2018-04-27  8:31                 ` Andrea Bolognani
  0 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2018-04-27  2:14 UTC (permalink / raw)
  To: Andrea Bolognani; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

[-- Attachment #1: Type: text/plain, Size: 7053 bytes --]

On Thu, Apr 26, 2018 at 10:45:40AM +0200, Andrea Bolognani wrote:
> On Thu, 2018-04-26 at 10:55 +1000, David Gibson wrote:
> > On Wed, Apr 25, 2018 at 06:09:26PM +0200, Andrea Bolognani wrote:
> > > The new parameter would make it possible to make sure you will
> > > actually be able to use the page size you're interested in inside
> > > the guest, by preventing it from starting at all if the host didn't
> > > provide big enough backing pages;
> > 
> > That's right
> > 
> > > it would also ensure the guest
> > > gets access to different page sizes when running using TCG as an
> > > accelerator instead of KVM.
> > 
> > Uh.. it would ensure the guest *doesn't* get access to different page
> > sizes in TCG vs. KVM.  Is that what you meant to say?
> 
> Oops, looks like I accidentally a word there. Of course you got it
> right and I meant exactly the opposite of what I actually wrote :/

:)

> > > For a KVM guest running on a POWER8 host, the matrix would look
> > > like
> > > 
> > >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> > >   -------- -------- -------- -------- -------- --------
> > >    64 KiB | 64 KiB | 64 KiB |        |        |        |
> > >   -------- -------- -------- -------- -------- --------
> > >    16 MiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
> > >   -------- -------- -------- -------- -------- --------
> > >    16 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
> > >   -------- -------- -------- -------- -------- --------
> > > 
> > > with backing page sizes from top to bottom, requested max page
> > > sizes from left to right, actual max page sizes in the cells and
> > > empty cells meaning the guest won't be able to start; on a POWER9
> > > machine, the matrix would look like
> > > 
> > >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> > >   -------- -------- -------- -------- -------- --------
> > >    64 KiB | 64 KiB | 64 KiB |        |        |        |
> > >   -------- -------- -------- -------- -------- --------
> > >     2 MiB | 64 KiB | 64 KiB |        |        |        |
> > >   -------- -------- -------- -------- -------- --------
> > >     1 GiB | 64 KiB | 64 KiB | 16 MiB | 16 MiB |        |
> > >   -------- -------- -------- -------- -------- --------
> > > 
> > > instead, and finally on TCG the backing page size wouldn't matter
> > > and you would simply have
> > > 
> > >     b \ m | 64 KiB |  2 MiB | 16 MiB |  1 GiB | 16 GiB |
> > >   -------- -------- -------- -------- -------- --------
> > >           | 64 KiB | 64 KiB | 16 MiB | 16 MiB | 16 GiB |
> > >   -------- -------- -------- -------- -------- --------
> > > 
> > > Does everything up until here make sense?
> > 
> > Yes, that all looks right.
> 
> Cool.
> 
> Unfortunately, that pretty much seals the deal on libvirt *not* being
> able to infer the value from other guest settings :(
> 
> The only reasonable candidate would be the size of host pages used for
> backing guest memory; however

Right.

>   * TCG, RPT and KVM PR guests can't infer anything from it, as they
>     are not tied to it. Having different behaviors for TCG and KVM
>     would be easy, but differentiating between HPT KVM HV guest and
>     all other kinds is something we can't do at the moment, and that
>     in the past have actively resisted doing;

Yeah, I certainly wouldn't recommend that.  It's basically what we're
doing in qemu now, and I want to change, because it's a bad idea.

It still would be possible to key off the host side hugepage size, but
apply the limit to all VMs - in a sense crippling TCG guests to give
them matching behaviour to KVM guests.

>   * the user might want to limit things further, eg. preventing an
>     HPT KVM HV guest backed by 16 MiB pages or an HPT TCG guest from
>     using hugepages.

Right.. note that with the draft qemu patches a TCG guest will be
prevented from using hugepages *by default* (the default value of the
capability is 16).  You have to explicitly change it to allow
hugepages to be used in a TCG guest (but you don't have to supply
hugepage backing).

> With the second use case in mind: would it make sense, or even be
> possible, to make it so the capability works for RPT guests too?

Possible, maybe.. I think there's another property where RPT pagesizes
are advertised.  But I think it's a bad idea.  In order to have the
normal HPT case work consistently we need to set the default cap value
to 16 (64kiB page max).  If that applied to RPT guests as well, we'd
be unnecessarily crippling nearly all RPT guests.

> Thinking even further, what about other architectures? Is this
> something they might want to do too? The scenario I have in mind is
> guests backed by regular pages being prevented from using hugepages
> with the rationale that they wouldn't have the same performance
> characteristics as if they were backed by hugepages; on the opposite
> side of the spectrum, you might want to ensure the pages used to
> back guest memory are as big as the biggest page you plan to use in
> the guest, in order to guarantee the performance characteristics
> fully match expectations.

Hm, well, you'd have to ask other arch people if they see a use for
that.  It doesn't look very useful to me.  I don't think libvirt can
or should ensure identical performance characteristics for a guest
across all possible migrations.  But for HPT guests, it's not a matter
of performance characteristics: if it tries to use a large page size
and KVM doesn't have large enough backing pages, the guest will
quickly just freeze on a page fault that can never be satisfied.

> > > While trying to figure out this, one of the things I attempted to
> > > do was run a guest in POWER8 compatibility mode on a POWER9 host
> > > and use hugepages for backing, but that didn't seem to work at
> > > all, possibly hinting at the fact that not all of the above is
> > > actually accurate and I need you to correct me :)
> > > [...]
> > 
> > Ok, so note that the scheme I'm talking about here is *not* merged as
> > yet.  The above command line will run the guest with 2MiB backing.
> > 
> > With the existing code that should work, but the guest will only be
> > able to use 64kiB pages.
> 
> Understood: even without the ability to limit it further, the max
> guest page size is obviously still capped by the backing page size.
> 
> > If it didn't work at all.. there was a bug
> > fixed relatively recently that broke all hugepage backing, so you
> > could try updating to a more recent host kernel.
> 
> That was probably it then!
> 
> I'll see whether I can get a newer kernel running on the host, but
> my primary concern was not having gotten the command line (or the
> concepts above) completely wrong :)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-27  2:14               ` David Gibson
@ 2018-04-27  8:31                 ` Andrea Bolognani
  2018-04-27 12:17                   ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Bolognani @ 2018-04-27  8:31 UTC (permalink / raw)
  To: David Gibson; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

On Fri, 2018-04-27 at 12:14 +1000, David Gibson wrote:
> On Thu, Apr 26, 2018 at 10:45:40AM +0200, Andrea Bolognani wrote:
> > Unfortunately, that pretty much seals the deal on libvirt *not* being
> > able to infer the value from other guest settings :(
> > 
> > The only reasonable candidate would be the size of host pages used for
> > backing guest memory; however
> 
> Right.
> 
> >   * TCG, RPT and KVM PR guests can't infer anything from it, as they
> >     are not tied to it. Having different behaviors for TCG and KVM
> >     would be easy, but differentiating between HPT KVM HV guest and
> >     all other kinds is something we can't do at the moment, and that
> >     in the past have actively resisted doing;
> 
> Yeah, I certainly wouldn't recommend that.  It's basically what we're
> doing in qemu now, and I want to change, because it's a bad idea.
> 
> It still would be possible to key off the host side hugepage size, but
> apply the limit to all VMs - in a sense crippling TCG guests to give
> them matching behaviour to KVM guests.

As you yourself mention later...

> >   * the user might want to limit things further, eg. preventing an
> >     HPT KVM HV guest backed by 16 MiB pages or an HPT TCG guest from
> >     using hugepages.
> 
> Right.. note that with the draft qemu patches a TCG guest will be
> prevented from using hugepages *by default* (the default value of the
> capability is 16).  You have to explicitly change it to allow
> hugepages to be used in a TCG guest (but you don't have to supply
> hugepage backing).

... this will already happen. That's okay[1], we can't really
avoid it if we want to ensure consistent behavior between KVM and
TCG.

> > With the second use case in mind: would it make sense, or even be
> > possible, to make it so the capability works for RPT guests too?
> 
> Possible, maybe.. I think there's another property where RPT pagesizes
> are advertised.  But I think it's a bad idea.  In order to have the
> normal HPT case work consistently we need to set the default cap value
> to 16 (64kiB page max).  If that applied to RPT guests as well, we'd
> be unnecessarily crippling nearly all RPT guests.
> 
> > Thinking even further, what about other architectures? Is this
> > something they might want to do too? The scenario I have in mind is
> > guests backed by regular pages being prevented from using hugepages
> > with the rationale that they wouldn't have the same performance
> > characteristics as if they were backed by hugepages; on the opposite
> > side of the spectrum, you might want to ensure the pages used to
> > back guest memory are as big as the biggest page you plan to use in
> > the guest, in order to guarantee the performance characteristics
> > fully match expectations.
> 
> Hm, well, you'd have to ask other arch people if they see a use for
> that.  It doesn't look very useful to me.  I don't think libvirt can
> or should ensure identical performance characteristics for a guest
> across all possible migrations.  But for HPT guests, it's not a matter
> of performance characteristics: if it tries to use a large page size
> and KVM doesn't have large enough backing pages, the guest will
> quickly just freeze on a page fault that can never be satisfied.

I realize only HPT guests *need* this, but I was trying to figure
out whether giving the host administrator more control over the
guest page size could be a useful feature in other cases as well,
as it sounds to me like it's more generally applicable

Users already need to opt-in to using hugepages in the host; asking
to opt-in to guest hugepages support as well doesn't seem too
outlandish to me.

Even if the specific flags required vary between architectures, we
could expose this in a unified fashion in libvirt. However, if this
is not something people would consider useful, we can just have a
pSeries-specific setting instead.


[1] That's of course assuming you have made sure the restriction
    only applies to the 2.13 machine type forward, and existing
    guests are not affected by the change.
-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-27  8:31                 ` Andrea Bolognani
@ 2018-04-27 12:17                   ` David Gibson
  2018-05-07 13:48                     ` Andrea Bolognani
  0 siblings, 1 reply; 28+ messages in thread
From: David Gibson @ 2018-04-27 12:17 UTC (permalink / raw)
  To: Andrea Bolognani; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

[-- Attachment #1: Type: text/plain, Size: 5643 bytes --]

On Fri, Apr 27, 2018 at 10:31:10AM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-27 at 12:14 +1000, David Gibson wrote:
> > On Thu, Apr 26, 2018 at 10:45:40AM +0200, Andrea Bolognani wrote:
> > > Unfortunately, that pretty much seals the deal on libvirt *not* being
> > > able to infer the value from other guest settings :(
> > > 
> > > The only reasonable candidate would be the size of host pages used for
> > > backing guest memory; however
> > 
> > Right.
> > 
> > >   * TCG, RPT and KVM PR guests can't infer anything from it, as they
> > >     are not tied to it. Having different behaviors for TCG and KVM
> > >     would be easy, but differentiating between HPT KVM HV guest and
> > >     all other kinds is something we can't do at the moment, and that
> > >     in the past have actively resisted doing;
> > 
> > Yeah, I certainly wouldn't recommend that.  It's basically what we're
> > doing in qemu now, and I want to change, because it's a bad idea.
> > 
> > It still would be possible to key off the host side hugepage size, but
> > apply the limit to all VMs - in a sense crippling TCG guests to give
> > them matching behaviour to KVM guests.
> 
> As you yourself mention later...

Right, I basically already made the decision to cripple TCG for KVM
compatibility at the qemu level (by default, at least).

> > >   * the user might want to limit things further, eg. preventing an
> > >     HPT KVM HV guest backed by 16 MiB pages or an HPT TCG guest from
> > >     using hugepages.
> > 
> > Right.. note that with the draft qemu patches a TCG guest will be
> > prevented from using hugepages *by default* (the default value of the
> > capability is 16).  You have to explicitly change it to allow
> > hugepages to be used in a TCG guest (but you don't have to supply
> > hugepage backing).
> 
> ... this will already happen. That's okay[1], we can't really
> avoid it if we want to ensure consistent behavior between KVM and
> TCG.

So.. regarding [1].  The draft patches *do* change the behaviour on
older machine types.  I'll consider revisiting that, but I'd need to
be convinced.  Basically we have to choose between consistency between
accelerator and consistency between versions.  I think the former is
the better choice; at least I think it is given that we *can* get both
for the overwhelmingly common case in production (KVM HV).

> > > With the second use case in mind: would it make sense, or even be
> > > possible, to make it so the capability works for RPT guests too?
> > 
> > Possible, maybe.. I think there's another property where RPT pagesizes
> > are advertised.  But I think it's a bad idea.  In order to have the
> > normal HPT case work consistently we need to set the default cap value
> > to 16 (64kiB page max).  If that applied to RPT guests as well, we'd
> > be unnecessarily crippling nearly all RPT guests.
> > 
> > > Thinking even further, what about other architectures? Is this
> > > something they might want to do too? The scenario I have in mind is
> > > guests backed by regular pages being prevented from using hugepages
> > > with the rationale that they wouldn't have the same performance
> > > characteristics as if they were backed by hugepages; on the opposite
> > > side of the spectrum, you might want to ensure the pages used to
> > > back guest memory are as big as the biggest page you plan to use in
> > > the guest, in order to guarantee the performance characteristics
> > > fully match expectations.
> > 
> > Hm, well, you'd have to ask other arch people if they see a use for
> > that.  It doesn't look very useful to me.  I don't think libvirt can
> > or should ensure identical performance characteristics for a guest
> > across all possible migrations.  But for HPT guests, it's not a matter
> > of performance characteristics: if it tries to use a large page size
> > and KVM doesn't have large enough backing pages, the guest will
> > quickly just freeze on a page fault that can never be satisfied.
> 
> I realize only HPT guests *need* this, but I was trying to figure
> out whether giving the host administrator more control over the
> guest page size could be a useful feature in other cases as well,
> as it sounds to me like it's more generally applicable

Perhaps, but I don't see a strong case for it.

> Users already need to opt-in to using hugepages in the host; asking
> to opt-in to guest hugepages support as well doesn't seem too
> outlandish to me.
> 
> Even if the specific flags required vary between architectures, we
> could expose this in a unified fashion in libvirt. However, if this
> is not something people would consider useful, we can just have a
> pSeries-specific setting instead.

The trouble is, if we made a generic knob for limiting guest
pagesizes, then I think we'd need *another* pseries specific knob so
that we can set the HPT and RPT limits differently:

It seems to be the common case is going to be to allow unlimited RPT
pagesizes (because there's really no downside to that), but limited
HPT pagesizes, so that we can migrate without requiring hugepage
backing everywhere.  A single knob that applied in all cases wouldn't
let us do that.

> [1] That's of course assuming you have made sure the restriction
>     only applies to the 2.13 machine type forward, and existing
>     guests are not affected by the change.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 1/7] spapr: Maximum (HPT) pagesize property
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 1/7] spapr: Maximum (HPT) pagesize property David Gibson
@ 2018-05-02 21:06   ` Murilo Opsfelder Araujo
  2018-05-03  1:34     ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Murilo Opsfelder Araujo @ 2018-05-02 21:06 UTC (permalink / raw)
  To: David Gibson; +Cc: groug, abologna, aik, qemu-ppc, qemu-devel, clg

On Thu, Apr 19, 2018 at 04:29:11PM +1000, David Gibson wrote:
> The way the POWER Hash Page Table (HPT) MMU is virtualized by KVM HV means
> that every page that the guest puts in the pagetables must be truly
> physically contiguous, not just GPA-contiguous.  In effect this means that
> an HPT guest can't use any pagesizes greater than the host page size used
> to back its memory.
> 
> At present we handle this by changing what we advertise to the guest based
> on the backing pagesizes.  This is pretty bad, because it means the guest
> sees a different environment depending on what should be host configuration
> details.
> 
> As a start on fixing this, we add a new capability parameter to the pseries
> machine type which gives the maximum allowed pagesizes for an HPT guest (as
> a shift).  For now we just create and validate the parameter without making
> it do anything.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr.c         |  1 +
>  hw/ppc/spapr_caps.c    | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr.h |  4 +++-
>  3 files changed, 57 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index bdf72e1e89..36e41aff71 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -3876,6 +3876,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
>      smc->default_caps.caps[SPAPR_CAP_CFPC] = SPAPR_CAP_BROKEN;
>      smc->default_caps.caps[SPAPR_CAP_SBBC] = SPAPR_CAP_BROKEN;
>      smc->default_caps.caps[SPAPR_CAP_IBS] = SPAPR_CAP_BROKEN;
> +    smc->default_caps.caps[SPAPR_CAP_HPT_MPS] = 16; /* Allow 64kiB pages */
>      spapr_caps_add_properties(smc, &error_abort);
>  }
>  
> diff --git a/hw/ppc/spapr_caps.c b/hw/ppc/spapr_caps.c
> index 531e145114..cbc41f5b20 100644
> --- a/hw/ppc/spapr_caps.c
> +++ b/hw/ppc/spapr_caps.c
> @@ -27,6 +27,7 @@
>  #include "qapi/visitor.h"
>  #include "sysemu/hw_accel.h"
>  #include "target/ppc/cpu.h"
> +#include "target/ppc/mmu-hash64.h"
>  #include "cpu-models.h"
>  #include "kvm_ppc.h"
>  
> @@ -142,6 +143,39 @@ out:
>      g_free(val);
>  }
>  
> +static void spapr_cap_get_int(Object *obj, Visitor *v, const char *name,
> +                               void *opaque, Error **errp)
> +{
> +    sPAPRCapabilityInfo *cap = opaque;
> +    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> +    int64_t value = spapr_get_cap(spapr, cap->index);
> +
> +    visit_type_int(v, name, &value, errp);
> +}
> +
> +static void spapr_cap_set_int(Object *obj, Visitor *v, const char *name,
> +                               void *opaque, Error **errp)
> +{
> +    sPAPRCapabilityInfo *cap = opaque;
> +    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> +    int64_t value;
> +    Error *local_err = NULL;
> +
> +    visit_type_int(v, name, &value, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    if ((value < 0) || (value > 255)) {
> +        error_setg(errp, "Value for %s out of range (0..255)", name);
> +        return;
> +    }
> +
> +    spapr->cmd_line_caps[cap->index] = true;
> +    spapr->eff.caps[cap->index] = value;
> +}
> +

Hi, David.

Do you think uint8_t would fit better for spapr_[gs]et_int() functions?

Perhaps renaming them to spapr_[gs]set_int8() and calling
visit_type_int8() instead.

Cheers
Murilo

>  static void cap_htm_apply(sPAPRMachineState *spapr, uint8_t val, Error **errp)
>  {
>      if (!val) {
> @@ -265,6 +299,16 @@ static void cap_safe_indirect_branch_apply(sPAPRMachineState *spapr,
>  
>  #define VALUE_DESC_TRISTATE     " (broken, workaround, fixed)"
>  
> +static void cap_hpt_mps_apply(sPAPRMachineState *spapr,
> +                              uint8_t val, Error **errp)
> +{
> +    if (val < 12) {
> +        error_setg(errp, "Require at least 4kiB pages (cap-hpt-mps >= 12)");
> +    } else if (val < 16) {
> +        warn_report("Many PAPR guests require 64kiB pages (cap-hpt-mps >= 16)");
> +    }
> +}
> +
>  sPAPRCapabilityInfo capability_table[SPAPR_CAP_NUM] = {
>      [SPAPR_CAP_HTM] = {
>          .name = "htm",
> @@ -324,6 +368,15 @@ sPAPRCapabilityInfo capability_table[SPAPR_CAP_NUM] = {
>          .possible = &cap_ibs_possible,
>          .apply = cap_safe_indirect_branch_apply,
>      },
> +    [SPAPR_CAP_HPT_MPS] = {
> +        .name = "hpt-mps",
> +        .description = "Maximum page shift for Hash Page Table guests (12, 16, 24, 34)",
> +        .index = SPAPR_CAP_HPT_MPS,
> +        .get = spapr_cap_get_int,
> +        .set = spapr_cap_set_int,
> +        .type = "int",
> +        .apply = cap_hpt_mps_apply,
> +    },
>  };
>  
>  static sPAPRCapabilities default_caps_with_cpu(sPAPRMachineState *spapr,
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index d60b7c6d7a..60ed3a5657 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -66,8 +66,10 @@ typedef enum {
>  #define SPAPR_CAP_SBBC                  0x04
>  /* Indirect Branch Serialisation */
>  #define SPAPR_CAP_IBS                   0x05
> +/* HPT Maximum Page Shift */
> +#define SPAPR_CAP_HPT_MPS               0x06
>  /* Num Caps */
> -#define SPAPR_CAP_NUM                   (SPAPR_CAP_IBS + 1)
> +#define SPAPR_CAP_NUM                   (SPAPR_CAP_HPT_MPS + 1)
>  
>  /*
>   * Capability Values
> -- 
> 2.14.3
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 1/7] spapr: Maximum (HPT) pagesize property
  2018-05-02 21:06   ` Murilo Opsfelder Araujo
@ 2018-05-03  1:34     ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2018-05-03  1:34 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo; +Cc: groug, abologna, aik, qemu-ppc, qemu-devel, clg

[-- Attachment #1: Type: text/plain, Size: 4007 bytes --]

On Wed, May 02, 2018 at 06:06:38PM -0300, Murilo Opsfelder Araujo wrote:
> On Thu, Apr 19, 2018 at 04:29:11PM +1000, David Gibson wrote:
> > The way the POWER Hash Page Table (HPT) MMU is virtualized by KVM HV means
> > that every page that the guest puts in the pagetables must be truly
> > physically contiguous, not just GPA-contiguous.  In effect this means that
> > an HPT guest can't use any pagesizes greater than the host page size used
> > to back its memory.
> > 
> > At present we handle this by changing what we advertise to the guest based
> > on the backing pagesizes.  This is pretty bad, because it means the guest
> > sees a different environment depending on what should be host configuration
> > details.
> > 
> > As a start on fixing this, we add a new capability parameter to the pseries
> > machine type which gives the maximum allowed pagesizes for an HPT guest (as
> > a shift).  For now we just create and validate the parameter without making
> > it do anything.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/ppc/spapr.c         |  1 +
> >  hw/ppc/spapr_caps.c    | 53 ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  include/hw/ppc/spapr.h |  4 +++-
> >  3 files changed, 57 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index bdf72e1e89..36e41aff71 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -3876,6 +3876,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
> >      smc->default_caps.caps[SPAPR_CAP_CFPC] = SPAPR_CAP_BROKEN;
> >      smc->default_caps.caps[SPAPR_CAP_SBBC] = SPAPR_CAP_BROKEN;
> >      smc->default_caps.caps[SPAPR_CAP_IBS] = SPAPR_CAP_BROKEN;
> > +    smc->default_caps.caps[SPAPR_CAP_HPT_MPS] = 16; /* Allow 64kiB pages */
> >      spapr_caps_add_properties(smc, &error_abort);
> >  }
> >  
> > diff --git a/hw/ppc/spapr_caps.c b/hw/ppc/spapr_caps.c
> > index 531e145114..cbc41f5b20 100644
> > --- a/hw/ppc/spapr_caps.c
> > +++ b/hw/ppc/spapr_caps.c
> > @@ -27,6 +27,7 @@
> >  #include "qapi/visitor.h"
> >  #include "sysemu/hw_accel.h"
> >  #include "target/ppc/cpu.h"
> > +#include "target/ppc/mmu-hash64.h"
> >  #include "cpu-models.h"
> >  #include "kvm_ppc.h"
> >  
> > @@ -142,6 +143,39 @@ out:
> >      g_free(val);
> >  }
> >  
> > +static void spapr_cap_get_int(Object *obj, Visitor *v, const char *name,
> > +                               void *opaque, Error **errp)
> > +{
> > +    sPAPRCapabilityInfo *cap = opaque;
> > +    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> > +    int64_t value = spapr_get_cap(spapr, cap->index);
> > +
> > +    visit_type_int(v, name, &value, errp);
> > +}
> > +
> > +static void spapr_cap_set_int(Object *obj, Visitor *v, const char *name,
> > +                               void *opaque, Error **errp)
> > +{
> > +    sPAPRCapabilityInfo *cap = opaque;
> > +    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> > +    int64_t value;
> > +    Error *local_err = NULL;
> > +
> > +    visit_type_int(v, name, &value, &local_err);
> > +    if (local_err) {
> > +        error_propagate(errp, local_err);
> > +        return;
> > +    }
> > +
> > +    if ((value < 0) || (value > 255)) {
> > +        error_setg(errp, "Value for %s out of range (0..255)", name);
> > +        return;
> > +    }
> > +
> > +    spapr->cmd_line_caps[cap->index] = true;
> > +    spapr->eff.caps[cap->index] = value;
> > +}
> > +
> 
> Hi, David.
> 
> Do you think uint8_t would fit better for spapr_[gs]et_int() functions?
> 
> Perhaps renaming them to spapr_[gs]set_int8() and calling
> visit_type_int8() instead.

Yeah, that's a good idea.  Using visit_type_uint8 means we don't need
to implement our own clamping.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 3/7] target/ppc: Add ppc_hash64_filter_pagesizes()
  2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 3/7] target/ppc: Add ppc_hash64_filter_pagesizes() David Gibson
@ 2018-05-03 15:57   ` Murilo Opsfelder Araujo
  2018-05-04  6:30     ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Murilo Opsfelder Araujo @ 2018-05-03 15:57 UTC (permalink / raw)
  To: David Gibson; +Cc: groug, abologna, aik, qemu-ppc, qemu-devel, clg

On Thu, Apr 19, 2018 at 04:29:13PM +1000, David Gibson wrote:
> The paravirtualized PAPR platform sometimes needs to restrict the guest to
> using only some of the page sizes actually supported by the host's MMU.
> At the moment this is handled in KVM specific code, but for consistency we
> want to apply the same limitations to all accelerators.
> 
> This makes a start on this by providing a helper function in the cpu code
> to allow platform code to remove some of the cpu's page size definitions
> via a caller supplied callback.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  target/ppc/mmu-hash64.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++
>  target/ppc/mmu-hash64.h |  3 +++
>  2 files changed, 62 insertions(+)
> 
> diff --git a/target/ppc/mmu-hash64.c b/target/ppc/mmu-hash64.c
> index a1db20e3a8..b6e62864fd 100644
> --- a/target/ppc/mmu-hash64.c
> +++ b/target/ppc/mmu-hash64.c
> @@ -1165,3 +1165,62 @@ const PPCHash64Options ppc_hash64_opts_POWER7 = {
>          },
>      }
>  };
> +
> +void ppc_hash64_filter_pagesizes(PowerPCCPU *cpu,
> +                                 bool (*cb)(void *, uint32_t, uint32_t),
> +                                 void *opaque)
> +{
> +    PPCHash64Options *opts = cpu->hash64_opts;
> +    int i;
> +    int n = 0;
> +    bool ci_largepage = false;
> +
> +    assert(opts);
> +
> +    n = 0;
> +    for (i = 0; i < ARRAY_SIZE(opts->sps); i++) {
> +        PPCHash64SegmentPageSizes *sps = &opts->sps[i];
> +        int j;
> +        int m = 0;
> +
> +        assert(n <= i);
> +
> +        if (!sps->page_shift) {
> +            break;
> +        }
> +
> +        for (j = 0; j < ARRAY_SIZE(sps->enc); j++) {
> +            PPCHash64PageSize *ps = &sps->enc[j];
> +
> +            assert(m <= j);
> +            if (!ps->page_shift) {
> +                break;
> +            }
> +
> +            if (cb(opaque, sps->page_shift, ps->page_shift)) {
> +                if (ps->page_shift == 16) {
> +                    ci_largepage = true;
> +                }
> +                sps->enc[m++] = *ps;
> +            }

Hi, David.

Is it possible that both sps->page_shift and ps->page_shift value 24?
This seems to be a true case for spapr_pagesize_cb(), if I'm reading
correctly.

Shouldn't ci_largepage also be set to true when ps->page_shift == 24?

Cheers
Murilo

> +        }
> +
> +        /* Clear rest of the row */
> +        for (j = m; j < ARRAY_SIZE(sps->enc); j++) {
> +            memset(&sps->enc[j], 0, sizeof(sps->enc[j]));
> +        }
> +
> +        if (m) {
> +            n++;
> +        }
> +    }
> +
> +    /* Clear the rest of the table */
> +    for (i = n; i < ARRAY_SIZE(opts->sps); i++) {
> +        memset(&opts->sps[i], 0, sizeof(opts->sps[i]));
> +    }
> +
> +    if (!ci_largepage) {
> +        opts->flags &= ~PPC_HASH64_CI_LARGEPAGE;
> +    }
> +}
> diff --git a/target/ppc/mmu-hash64.h b/target/ppc/mmu-hash64.h
> index f23b78d787..1aa2453497 100644
> --- a/target/ppc/mmu-hash64.h
> +++ b/target/ppc/mmu-hash64.h
> @@ -20,6 +20,9 @@ unsigned ppc_hash64_hpte_page_shift_noslb(PowerPCCPU *cpu,
>  void ppc_store_lpcr(PowerPCCPU *cpu, target_ulong val);
>  void ppc_hash64_init(PowerPCCPU *cpu);
>  void ppc_hash64_finalize(PowerPCCPU *cpu);
> +void ppc_hash64_filter_pagesizes(PowerPCCPU *cpu,
> +                                 bool (*cb)(void *, uint32_t, uint32_t),
> +                                 void *opaque);
>  #endif
>  
>  /*
> -- 
> 2.14.3
> 
> 

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 3/7] target/ppc: Add ppc_hash64_filter_pagesizes()
  2018-05-03 15:57   ` Murilo Opsfelder Araujo
@ 2018-05-04  6:30     ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2018-05-04  6:30 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo; +Cc: groug, abologna, aik, qemu-ppc, qemu-devel, clg

[-- Attachment #1: Type: text/plain, Size: 3084 bytes --]

On Thu, May 03, 2018 at 12:57:09PM -0300, Murilo Opsfelder Araujo wrote:
> On Thu, Apr 19, 2018 at 04:29:13PM +1000, David Gibson wrote:
> > The paravirtualized PAPR platform sometimes needs to restrict the guest to
> > using only some of the page sizes actually supported by the host's MMU.
> > At the moment this is handled in KVM specific code, but for consistency we
> > want to apply the same limitations to all accelerators.
> > 
> > This makes a start on this by providing a helper function in the cpu code
> > to allow platform code to remove some of the cpu's page size definitions
> > via a caller supplied callback.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  target/ppc/mmu-hash64.c | 59 +++++++++++++++++++++++++++++++++++++++++++++++++
> >  target/ppc/mmu-hash64.h |  3 +++
> >  2 files changed, 62 insertions(+)
> > 
> > diff --git a/target/ppc/mmu-hash64.c b/target/ppc/mmu-hash64.c
> > index a1db20e3a8..b6e62864fd 100644
> > --- a/target/ppc/mmu-hash64.c
> > +++ b/target/ppc/mmu-hash64.c
> > @@ -1165,3 +1165,62 @@ const PPCHash64Options ppc_hash64_opts_POWER7 = {
> >          },
> >      }
> >  };
> > +
> > +void ppc_hash64_filter_pagesizes(PowerPCCPU *cpu,
> > +                                 bool (*cb)(void *, uint32_t, uint32_t),
> > +                                 void *opaque)
> > +{
> > +    PPCHash64Options *opts = cpu->hash64_opts;
> > +    int i;
> > +    int n = 0;
> > +    bool ci_largepage = false;
> > +
> > +    assert(opts);
> > +
> > +    n = 0;
> > +    for (i = 0; i < ARRAY_SIZE(opts->sps); i++) {
> > +        PPCHash64SegmentPageSizes *sps = &opts->sps[i];
> > +        int j;
> > +        int m = 0;
> > +
> > +        assert(n <= i);
> > +
> > +        if (!sps->page_shift) {
> > +            break;
> > +        }
> > +
> > +        for (j = 0; j < ARRAY_SIZE(sps->enc); j++) {
> > +            PPCHash64PageSize *ps = &sps->enc[j];
> > +
> > +            assert(m <= j);
> > +            if (!ps->page_shift) {
> > +                break;
> > +            }
> > +
> > +            if (cb(opaque, sps->page_shift, ps->page_shift)) {
> > +                if (ps->page_shift == 16) {
> > +                    ci_largepage = true;
> > +                }
> > +                sps->enc[m++] = *ps;
> > +            }
> 
> Hi, David.
> 
> Is it possible that both sps->page_shift and ps->page_shift value 24?
> This seems to be a true case for spapr_pagesize_cb(), if I'm reading
> correctly.

Yes, that's possible; in fact, typical.

> Shouldn't ci_largepage also be set to true when ps->page_shift ==
> 24?

Ah.. yeah.. in order to match the existing logic this should be >=
16.  I've changed it in my draft.

Of course, I doubt that any guest would attempt to use 16MiB IO pages,
so I'm not sure it really matters, but.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-04-27 12:17                   ` David Gibson
@ 2018-05-07 13:48                     ` Andrea Bolognani
  2018-06-14  1:52                       ` David Gibson
  0 siblings, 1 reply; 28+ messages in thread
From: Andrea Bolognani @ 2018-05-07 13:48 UTC (permalink / raw)
  To: David Gibson; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

On Fri, 2018-04-27 at 22:17 +1000, David Gibson wrote:
> On Fri, Apr 27, 2018 at 10:31:10AM +0200, Andrea Bolognani wrote:
> > On Fri, 2018-04-27 at 12:14 +1000, David Gibson wrote:
> > > Right.. note that with the draft qemu patches a TCG guest will be
> > > prevented from using hugepages *by default* (the default value of the
> > > capability is 16).  You have to explicitly change it to allow
> > > hugepages to be used in a TCG guest (but you don't have to supply
> > > hugepage backing).
> > 
> > ... this will already happen. That's okay[1], we can't really
> > avoid it if we want to ensure consistent behavior between KVM and
> > TCG.
> 
> So.. regarding [1].  The draft patches *do* change the behaviour on
> older machine types.  I'll consider revisiting that, but I'd need to
> be convinced.  Basically we have to choose between consistency between
> accelerator and consistency between versions.  I think the former is
> the better choice; at least I think it is given that we *can* get both
> for the overwhelmingly common case in production (KVM HV).

Forgot to answer this point.

I agree that consistency between accelerators is the sane option
going forward, but changing the behavior for old machine types will
cause existing guests which have been using hugepages to lose the
ability to do so after being restarted on a newer QEMU.

Isn't that exactly the kind of scenario versioned machine types are
supposed to prevent?

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 28+ messages in thread

* Re: [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling
  2018-05-07 13:48                     ` Andrea Bolognani
@ 2018-06-14  1:52                       ` David Gibson
  0 siblings, 0 replies; 28+ messages in thread
From: David Gibson @ 2018-06-14  1:52 UTC (permalink / raw)
  To: Andrea Bolognani; +Cc: groug, aik, qemu-ppc, qemu-devel, clg

[-- Attachment #1: Type: text/plain, Size: 2176 bytes --]

On Mon, May 07, 2018 at 03:48:54PM +0200, Andrea Bolognani wrote:
> On Fri, 2018-04-27 at 22:17 +1000, David Gibson wrote:
> > On Fri, Apr 27, 2018 at 10:31:10AM +0200, Andrea Bolognani wrote:
> > > On Fri, 2018-04-27 at 12:14 +1000, David Gibson wrote:
> > > > Right.. note that with the draft qemu patches a TCG guest will be
> > > > prevented from using hugepages *by default* (the default value of the
> > > > capability is 16).  You have to explicitly change it to allow
> > > > hugepages to be used in a TCG guest (but you don't have to supply
> > > > hugepage backing).
> > > 
> > > ... this will already happen. That's okay[1], we can't really
> > > avoid it if we want to ensure consistent behavior between KVM and
> > > TCG.
> > 
> > So.. regarding [1].  The draft patches *do* change the behaviour on
> > older machine types.  I'll consider revisiting that, but I'd need to
> > be convinced.  Basically we have to choose between consistency between
> > accelerator and consistency between versions.  I think the former is
> > the better choice; at least I think it is given that we *can* get both
> > for the overwhelmingly common case in production (KVM HV).
> 
> Forgot to answer this point.
> 
> I agree that consistency between accelerators is the sane option
> going forward, but changing the behavior for old machine types will
> cause existing guests which have been using hugepages to lose the
> ability to do so after being restarted on a newer QEMU.
> 
> Isn't that exactly the kind of scenario versioned machine types are
> supposed to prevent?

Yeah, it is.  What I was questioning was whether it was important
enough for the case of TCG and PR guests (which have never been as
well supported) to justify keeping the other inconsistency.

On reflection, I think it probably does.. and I also think I have a
way to preserve it without having to keep around masses of the old
code, so I'll adjust that for the next spin.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 28+ messages in thread

end of thread, other threads:[~2018-06-14  2:58 UTC | newest]

Thread overview: 28+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-19  6:29 [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 1/7] spapr: Maximum (HPT) pagesize property David Gibson
2018-05-02 21:06   ` Murilo Opsfelder Araujo
2018-05-03  1:34     ` David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 2/7] spapr: Use maximum page size capability to simplify memory backend checking David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 3/7] target/ppc: Add ppc_hash64_filter_pagesizes() David Gibson
2018-05-03 15:57   ` Murilo Opsfelder Araujo
2018-05-04  6:30     ` David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 4/7] spapr: Add cpu_apply hook to capabilities David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 5/7] spapr: Limit available pagesizes to provide a consistent guest environment David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 6/7] spapr: Don't rewrite mmu capabilities in KVM mode David Gibson
2018-04-19  6:29 ` [Qemu-devel] [RFC for-2.13 7/7] spapr_pci: Remove unhelpful pagesize warning David Gibson
2018-04-19 15:30 ` [Qemu-devel] [RFC for-2.13 0/7] spapr: Clean up pagesize handling Andrea Bolognani
2018-04-20  2:35   ` David Gibson
2018-04-20  9:31     ` Andrea Bolognani
2018-04-20 10:21       ` David Gibson
2018-04-23  8:31         ` Andrea Bolognani
2018-04-24  1:26           ` David Gibson
2018-04-24 15:35         ` Andrea Bolognani
2018-04-25  6:32           ` David Gibson
2018-04-25 16:09         ` Andrea Bolognani
2018-04-26  0:55           ` David Gibson
2018-04-26  8:45             ` Andrea Bolognani
2018-04-27  2:14               ` David Gibson
2018-04-27  8:31                 ` Andrea Bolognani
2018-04-27 12:17                   ` David Gibson
2018-05-07 13:48                     ` Andrea Bolognani
2018-06-14  1:52                       ` David Gibson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.