[Qemu-devel] [PATCHv3 0/5] Hash Page Table resizing for TCG pseries guests

All of lore.kernel.org
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCHv3 0/5] Hash Page Table resizing for TCG pseries guests
@ 2016-12-12  4:05 David Gibson
  2016-12-12  4:05 ` [Qemu-devel] [PATCHv3 1/5] pseries: Add pseries-2.9 machine type David Gibson
                   ` (4 more replies)
  0 siblings, 5 replies; 20+ messages in thread
From: David Gibson @ 2016-12-12  4:05 UTC (permalink / raw)
  To: paulus, sjitindarsingh
  Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel, David Gibson

This series implements the host side of the PAPR ACR to allow runtime
resizing of the Hashed Page Table (HPT) for pseries guests.
Exercising this feature requires a guest OS which is also aware of it.
Patches to implement the guest side in Linux have just been submitted
upstream:

https://lists.ozlabs.org/pipermail/linuxppc-dev/2016-December/152164.html

Availability of the feature is controlled by a new 'resize-hpt'
machine option: it can be set to "disabled", "enabled" or "required".
The last means that qemu will refuse to boot a guest which is not
aware of the HPT resizing feature (it will instead quit during feature
negotiation).  This is potentially useful because guests which don't
support resizing will need an HPT sized for their maximum possible
RAM, which can be very wasteful of host resources.

So far this is only tested for TCG guests.  PR KVM guests should work
in theory, but I haven't been able to test yet due to (probably
unrelated) problems booting a suitable guest kernel.  HV KVM guests
will require substantial extra work in the host kernel; development is
in progress.

Changes since v2:
 * Some clearer comments based on review
 * Some minor cleanups based on review 

David Gibson (5):
  pseries: Add pseries-2.9 machine type
  pseries: Stubs for HPT resizing
  pseries: Implement HPT resizing
  pseries: Enable HPT resizing for 2.9
  pseries: Use smaller default hash page tables when guest can resize

 hw/ppc/spapr.c              | 126 ++++++++++++--
 hw/ppc/spapr_hcall.c        | 407 ++++++++++++++++++++++++++++++++++++++++++++
 hw/ppc/trace-events         |   2 +
 include/hw/compat.h         |   3 +
 include/hw/ppc/spapr.h      |  19 +++
 include/hw/ppc/spapr_ovec.h |   1 +
 target-ppc/kvm.c            |  25 +++
 target-ppc/kvm_ppc.h        |   5 +
 target-ppc/mmu-hash64.h     |   4 +
 9 files changed, 582 insertions(+), 10 deletions(-)

-- 
2.9.3

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCHv3 1/5] pseries: Add pseries-2.9 machine type
  2016-12-12  4:05 [Qemu-devel] [PATCHv3 0/5] Hash Page Table resizing for TCG pseries guests David Gibson
@ 2016-12-12  4:05 ` David Gibson
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 2/5] pseries: Stubs for HPT resizing David Gibson
                   ` (3 subsequent siblings)
  4 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-12-12  4:05 UTC (permalink / raw)
  To: paulus, sjitindarsingh
  Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel, David Gibson

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Thomas Huth <thuth@redhat.com>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
---
 hw/ppc/spapr.c      | 23 +++++++++++++++++++++--
 include/hw/compat.h |  3 +++
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index cfadc46..0f25e83 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2761,18 +2761,37 @@ static const TypeInfo spapr_machine_info = {
     type_init(spapr_machine_register_##suffix)
 
 /*
+ * pseries-2.9
+ */
+static void spapr_machine_2_9_instance_options(MachineState *machine)
+{
+}
+
+static void spapr_machine_2_9_class_options(MachineClass *mc)
+{
+    /* Defaults for the latest behaviour inherited from the base class */
+}
+
+DEFINE_SPAPR_MACHINE(2_9, "2.9", true);
+
+/*
  * pseries-2.8
  */
+#define SPAPR_COMPAT_2_8                            \
+    HW_COMPAT_2_8
+
 static void spapr_machine_2_8_instance_options(MachineState *machine)
 {
+    spapr_machine_2_9_instance_options(machine);
 }
 
 static void spapr_machine_2_8_class_options(MachineClass *mc)
 {
-    /* Defaults for the latest behaviour inherited from the base class */
+    spapr_machine_2_9_class_options(mc);
+    SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_8);
 }
 
-DEFINE_SPAPR_MACHINE(2_8, "2.8", true);
+DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
 
 /*
  * pseries-2.7
diff --git a/include/hw/compat.h b/include/hw/compat.h
index 0f06e11..f8b8354 100644
--- a/include/hw/compat.h
+++ b/include/hw/compat.h
@@ -1,6 +1,9 @@
 #ifndef HW_COMPAT_H
 #define HW_COMPAT_H
 
+#define HW_COMPAT_2_8 \
+    /* empty */
+
 #define HW_COMPAT_2_7 \
     {\
         .driver   = "virtio-pci",\
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCHv3 2/5] pseries: Stubs for HPT resizing
  2016-12-12  4:05 [Qemu-devel] [PATCHv3 0/5] Hash Page Table resizing for TCG pseries guests David Gibson
  2016-12-12  4:05 ` [Qemu-devel] [PATCHv3 1/5] pseries: Add pseries-2.9 machine type David Gibson
@ 2016-12-12  4:06 ` David Gibson
  2016-12-13 14:29   ` Laurent Vivier
  2016-12-14  5:22   ` Suraj Jitindar Singh
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 3/5] pseries: Implement " David Gibson
                   ` (2 subsequent siblings)
  4 siblings, 2 replies; 20+ messages in thread
From: David Gibson @ 2016-12-12  4:06 UTC (permalink / raw)
  To: paulus, sjitindarsingh
  Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel, David Gibson

This introduces stub implementations of the H_RESIZE_HPT_PREPARE and
H_RESIZE_HPT_COMMIT hypercalls which we hope to add in a PAPR
extension to allow run time resizing of a guest's hash page table.  It
also adds a new machine property for controlling whether this new
facility is available, and logic to check that against availability
with KVM (only supported with KVM PR for now).

Finally, it adds a new string to the hypertas property in the device
tree, advertising to the guest the availability of the HPT resizing
hypercalls.  This is a tentative suggested value, and would need to be
standardized by PAPR before being merged.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c         | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_hcall.c   | 36 ++++++++++++++++++++++++
 hw/ppc/trace-events    |  2 ++
 include/hw/ppc/spapr.h | 11 ++++++++
 target-ppc/kvm.c       | 25 +++++++++++++++++
 target-ppc/kvm_ppc.h   |  5 ++++
 6 files changed, 154 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 0f25e83..846ce51 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -760,6 +760,11 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, void *fdt)
     if (!kvm_enabled() || kvmppc_spapr_use_multitce()) {
         add_str(hypertas, "hcall-multi-tce");
     }
+
+    if (spapr->resize_hpt != SPAPR_RESIZE_HPT_DISABLED) {
+        add_str(hypertas, "hcall-hpt-resize");
+    }
+
     _FDT(fdt_setprop(fdt, rtas, "ibm,hypertas-functions",
                      hypertas->str, hypertas->len));
     g_string_free(hypertas, TRUE);
@@ -1839,11 +1844,40 @@ static void ppc_spapr_init(MachineState *machine)
     long load_limit, fw_size;
     char *filename;
     int smt = kvmppc_smt_threads();
+    Error *resize_hpt_err = NULL;
 
     msi_nonbroken = true;
 
     QLIST_INIT(&spapr->phbs);
 
+    /* Check HPT resizing availability */
+    kvmppc_check_papr_resize_hpt(&resize_hpt_err);
+    if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DEFAULT) {
+        /*
+         * If the user explicitly requested a mode we should either
+         * supply it, or fail completely (which we do below).  But if
+         * it's not set explicitly, we reset our mode to something
+         * that works
+         */
+        if (resize_hpt_err) {
+            spapr->resize_hpt = SPAPR_RESIZE_HPT_DISABLED;
+            error_free(resize_hpt_err);
+            resize_hpt_err = NULL;
+        } else {
+            spapr->resize_hpt = smc->resize_hpt_default;
+        }
+    }
+
+    assert(spapr->resize_hpt != SPAPR_RESIZE_HPT_DEFAULT);
+
+    if ((spapr->resize_hpt != SPAPR_RESIZE_HPT_DISABLED) && resize_hpt_err) {
+        /*
+         * User requested HPT resize, but this host can't supply it.  Bail out
+         */
+        error_report_err(resize_hpt_err);
+        exit(1);
+    }
+
     /* Allocate RMA if necessary */
     rma_alloc_size = kvmppc_alloc_rma(&rma);
 
@@ -2236,6 +2270,40 @@ static void spapr_set_modern_hotplug_events(Object *obj, bool value,
     spapr->use_hotplug_event_source = value;
 }
 
+static char *spapr_get_resize_hpt(Object *obj, Error **errp)
+{
+    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
+
+    switch (spapr->resize_hpt) {
+    case SPAPR_RESIZE_HPT_DEFAULT:
+        return g_strdup("default");
+    case SPAPR_RESIZE_HPT_DISABLED:
+        return g_strdup("disabled");
+    case SPAPR_RESIZE_HPT_ENABLED:
+        return g_strdup("enabled");
+    case SPAPR_RESIZE_HPT_REQUIRED:
+        return g_strdup("required");
+    }
+    assert(0);
+}
+
+static void spapr_set_resize_hpt(Object *obj, const char *value, Error **errp)
+{
+    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
+
+    if (strcmp(value, "default") == 0) {
+        spapr->resize_hpt = SPAPR_RESIZE_HPT_DEFAULT;
+    } else if (strcmp(value, "disabled") == 0) {
+        spapr->resize_hpt = SPAPR_RESIZE_HPT_DISABLED;
+    } else if (strcmp(value, "enabled") == 0) {
+        spapr->resize_hpt = SPAPR_RESIZE_HPT_ENABLED;
+    } else if (strcmp(value, "required") == 0) {
+        spapr->resize_hpt = SPAPR_RESIZE_HPT_REQUIRED;
+    } else {
+        error_setg(errp, "Bad value for \"resize-hpt\" property");
+    }
+}
+
 static void spapr_machine_initfn(Object *obj)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
@@ -2256,6 +2324,12 @@ static void spapr_machine_initfn(Object *obj)
                                     " place of standard EPOW events when possible"
                                     " (required for memory hot-unplug support)",
                                     NULL);
+
+    object_property_add_str(obj, "resize-hpt",
+                            spapr_get_resize_hpt, spapr_set_resize_hpt, NULL);
+    object_property_set_description(obj, "resize-hpt",
+                                    "Resizing of the Hash Page Table (enabled, disabled, required)",
+                                    NULL);
 }
 
 static void spapr_machine_finalizefn(Object *obj)
@@ -2707,6 +2781,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
 
     smc->dr_lmb_enabled = true;
     smc->tcg_default_cpu = "POWER8";
+    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
     mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
     fwc->get_dev_path = spapr_get_fw_dev_path;
     nc->nmi_monitor_handler = spapr_nmi;
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index fd9f1d4..72a9c4d 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -355,6 +355,38 @@ static target_ulong h_read(PowerPCCPU *cpu, sPAPRMachineState *spapr,
     return H_SUCCESS;
 }
 
+static target_ulong h_resize_hpt_prepare(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         target_ulong opcode,
+                                         target_ulong *args)
+{
+    target_ulong flags = args[0];
+    target_ulong shift = args[1];
+
+    if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
+        return H_AUTHORITY;
+    }
+
+    trace_spapr_h_resize_hpt_prepare(flags, shift);
+    return H_HARDWARE;
+}
+
+static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
+                                        sPAPRMachineState *spapr,
+                                        target_ulong opcode,
+                                        target_ulong *args)
+{
+    target_ulong flags = args[0];
+    target_ulong shift = args[1];
+
+    if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
+        return H_AUTHORITY;
+    }
+
+    trace_spapr_h_resize_hpt_commit(flags, shift);
+    return H_HARDWARE;
+}
+
 static target_ulong h_set_sprg0(PowerPCCPU *cpu, sPAPRMachineState *spapr,
                                 target_ulong opcode, target_ulong *args)
 {
@@ -1134,6 +1166,10 @@ static void hypercall_register_types(void)
     /* hcall-bulk */
     spapr_register_hypercall(H_BULK_REMOVE, h_bulk_remove);
 
+    /* hcall-hpt-resize */
+    spapr_register_hypercall(H_RESIZE_HPT_PREPARE, h_resize_hpt_prepare);
+    spapr_register_hypercall(H_RESIZE_HPT_COMMIT, h_resize_hpt_commit);
+
     /* hcall-splpar */
     spapr_register_hypercall(H_REGISTER_VPA, h_register_vpa);
     spapr_register_hypercall(H_CEDE, h_cede);
diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
index 2297ead..bf59a8f 100644
--- a/hw/ppc/trace-events
+++ b/hw/ppc/trace-events
@@ -16,6 +16,8 @@ spapr_cas_continue(unsigned long n) "Copy changes to the guest: %ld bytes"
 # hw/ppc/spapr_hcall.c
 spapr_cas_pvr_try(uint32_t pvr) "%x"
 spapr_cas_pvr(uint32_t cur_pvr, bool cpu_match, uint32_t new_pvr, uint64_t pcr) "current=%x, cpu_match=%u, new=%x, compat flags=%"PRIx64
+spapr_h_resize_hpt_prepare(uint64_t flags, uint64_t shift) "flags=0x%"PRIx64", shift=%"PRIu64
+spapr_h_resize_hpt_commit(uint64_t flags, uint64_t shift) "flags=0x%"PRIx64", shift=%"PRIu64
 
 # hw/ppc/spapr_iommu.c
 spapr_iommu_put(uint64_t liobn, uint64_t ioba, uint64_t tce, uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tce=0x%"PRIx64" ret=%"PRId64
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index a2d8964..d2c758b 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -31,6 +31,13 @@ typedef struct sPAPRMachineState sPAPRMachineState;
 #define SPAPR_MACHINE_CLASS(klass) \
     OBJECT_CLASS_CHECK(sPAPRMachineClass, klass, TYPE_SPAPR_MACHINE)
 
+typedef enum {
+    SPAPR_RESIZE_HPT_DEFAULT = 0,
+    SPAPR_RESIZE_HPT_DISABLED,
+    SPAPR_RESIZE_HPT_ENABLED,
+    SPAPR_RESIZE_HPT_REQUIRED,
+} sPAPRResizeHPT;
+
 /**
  * sPAPRMachineClass:
  */
@@ -46,6 +53,7 @@ struct sPAPRMachineClass {
                           uint64_t *buid, hwaddr *pio, 
                           hwaddr *mmio32, hwaddr *mmio64,
                           unsigned n_dma, uint32_t *liobns, Error **errp);
+    sPAPRResizeHPT resize_hpt_default;
 };
 
 /**
@@ -61,6 +69,7 @@ struct sPAPRMachineState {
     XICSState *xics;
     DeviceState *rtc;
 
+    sPAPRResizeHPT resize_hpt;
     void *htab;
     uint32_t htab_shift;
     hwaddr rma_size;
@@ -347,6 +356,8 @@ struct sPAPRMachineState {
 #define H_XIRR_X                0x2FC
 #define H_RANDOM                0x300
 #define H_SET_MODE              0x31C
+#define H_RESIZE_HPT_PREPARE    0x36C
+#define H_RESIZE_HPT_COMMIT     0x370
 #define H_SIGNAL_SYS_RESET      0x380
 #define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
 
diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
index 15e12f3..39e5753 100644
--- a/target-ppc/kvm.c
+++ b/target-ppc/kvm.c
@@ -22,6 +22,7 @@
 #include <linux/kvm.h>
 
 #include "qemu-common.h"
+#include "qapi/error.h"
 #include "qemu/error-report.h"
 #include "cpu.h"
 #include "qemu/timer.h"
@@ -2672,3 +2673,27 @@ int kvmppc_enable_hwrng(void)
 
     return kvmppc_enable_hcall(kvm_state, H_RANDOM);
 }
+
+void kvmppc_check_papr_resize_hpt(Error **errp)
+{
+    if (!kvm_enabled()) {
+        return;
+    }
+
+    /* TODO: Check specific capabilities for HPT resize aware host kernels */
+
+    /*
+     * It's tempting to try to check directly if the HPT is under our
+     * control or KVM's, which is what's really relevant here.
+     * Unfortunately, in order to correctly size the HPT, we need to
+     * know if we can do resizing, _before_ we attempt to allocate it
+     * with KVM.  Before that point, we don't officially know whether
+     * we'll control the HPT or not.  So we have to use a fallback
+     * test for PR vs HV KVM to predict that.
+     */
+    if (kvmppc_is_pr(kvm_state)) {
+        return;
+    }
+
+    error_setg(errp, "Hash page table resizing not available with this KVM version");
+}
diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
index 841a29b..3e852ba 100644
--- a/target-ppc/kvm_ppc.h
+++ b/target-ppc/kvm_ppc.h
@@ -59,6 +59,7 @@ bool kvmppc_has_cap_htm(void);
 int kvmppc_enable_hwrng(void);
 int kvmppc_put_books_sregs(PowerPCCPU *cpu);
 PowerPCCPUClass *kvm_ppc_get_host_cpu_class(void);
+void kvmppc_check_papr_resize_hpt(Error **errp);
 
 #else
 
@@ -270,6 +271,10 @@ static inline PowerPCCPUClass *kvm_ppc_get_host_cpu_class(void)
     return NULL;
 }
 
+static inline void kvmppc_check_papr_resize_hpt(Error **errp)
+{
+    return;
+}
 #endif
 
 #ifndef CONFIG_KVM
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCHv3 3/5] pseries: Implement HPT resizing
  2016-12-12  4:05 [Qemu-devel] [PATCHv3 0/5] Hash Page Table resizing for TCG pseries guests David Gibson
  2016-12-12  4:05 ` [Qemu-devel] [PATCHv3 1/5] pseries: Add pseries-2.9 machine type David Gibson
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 2/5] pseries: Stubs for HPT resizing David Gibson
@ 2016-12-12  4:06 ` David Gibson
  2016-12-14  5:30   ` Suraj Jitindar Singh
  2016-12-14  5:35   ` Suraj Jitindar Singh
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 4/5] pseries: Enable HPT resizing for 2.9 David Gibson
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 5/5] pseries: Use smaller default hash page tables when guest can resize David Gibson
  4 siblings, 2 replies; 20+ messages in thread
From: David Gibson @ 2016-12-12  4:06 UTC (permalink / raw)
  To: paulus, sjitindarsingh
  Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel, David Gibson

This patch implements hypercalls allowing a PAPR guest to resize its own
hash page table.  This will eventually allow for more flexible memory
hotplug.

The implementation is partially asynchronous, handled in a special thread
running the hpt_prepare_thread() function.  The state of a pending resize
is stored in SPAPR_MACHINE->pending_hpt.

The H_RESIZE_HPT_PREPARE hypercall will kick off creation of a new HPT, or,
if one is already in progress, monitor it for completion.  If there is an
existing HPT resize in progress that doesn't match the size specified in
the call, it will cancel it, replacing it with a new one matching the
given size.

The H_RESIZE_HPT_COMMIT completes transition to a resized HPT, and can only
be called successfully once H_RESIZE_HPT_PREPARE has successfully
completed initialization of a new HPT.  The guest must ensure that there
are no concurrent accesses to the existing HPT while this is called (this
effectively means stop_machine() for Linux guests).

For now H_RESIZE_HPT_COMMIT goes through the whole old HPT, rehashing each
HPTE into the new HPT.  This can have quite high latency, but it seems to
be of the order of typical migration downtime latencies for HPTs of size
up to ~2GiB (which would be used in a 256GiB guest).

In future we probably want to move more of the rehashing to the "prepare"
phase, by having H_ENTER and other hcalls update both current and
pending HPTs.  That's a project for another day, but should be possible
without any changes to the guest interface.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c          |   4 +-
 hw/ppc/spapr_hcall.c    | 345 +++++++++++++++++++++++++++++++++++++++++++++++-
 include/hw/ppc/spapr.h  |   6 +
 target-ppc/mmu-hash64.h |   4 +
 4 files changed, 353 insertions(+), 6 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 846ce51..d057031 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -93,8 +93,6 @@
 
 #define PHANDLE_XICP            0x00001111
 
-#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
-
 static XICSState *try_create_xics(const char *type, int nr_servers,
                                   int nr_irqs, Error **errp)
 {
@@ -1055,7 +1053,7 @@ static void close_htab_fd(sPAPRMachineState *spapr)
     spapr->htab_fd = -1;
 }
 
-static int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
+int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
 {
     int shift;
 
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 72a9c4d..1e89061 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -2,6 +2,7 @@
 #include "qapi/error.h"
 #include "sysemu/sysemu.h"
 #include "qemu/log.h"
+#include "qemu/error-report.h"
 #include "cpu.h"
 #include "exec/exec-all.h"
 #include "helper_regs.h"
@@ -355,20 +356,319 @@ static target_ulong h_read(PowerPCCPU *cpu, sPAPRMachineState *spapr,
     return H_SUCCESS;
 }
 
+struct sPAPRPendingHPT {
+    /* These fields are read-only after initialization */
+    int shift;
+    QemuThread thread;
+
+    /* These fields are protected by the BQL */
+    bool complete;
+
+    /* These fields are private to the preparation thread if
+     * !complete, otherwise protected by the BQL */
+    int ret;
+    void *hpt;
+};
+
+static void free_pending_hpt(sPAPRPendingHPT *pending)
+{
+    if (pending->hpt) {
+        qemu_vfree(pending->hpt);
+    }
+
+    g_free(pending);
+}
+
+static void *hpt_prepare_thread(void *opaque)
+{
+    sPAPRPendingHPT *pending = opaque;
+    size_t size = 1ULL << pending->shift;
+
+    pending->hpt = qemu_memalign(size, size);
+    if (pending->hpt) {
+        memset(pending->hpt, 0, size);
+        pending->ret = H_SUCCESS;
+    } else {
+        pending->ret = H_NO_MEM;
+    }
+
+    qemu_mutex_lock_iothread();
+
+    if (SPAPR_MACHINE(qdev_get_machine())->pending_hpt == pending) {
+        /* Ready to go */
+        pending->complete = true;
+    } else {
+        /* We've been cancelled, clean ourselves up */
+        free_pending_hpt(pending);
+    }
+
+    qemu_mutex_unlock_iothread();
+    return NULL;
+}
+
+/* Must be called with BQL held */
+static void cancel_hpt_prepare(sPAPRMachineState *spapr)
+{
+    sPAPRPendingHPT *pending = spapr->pending_hpt;
+
+    /* Let the thread know it's cancelled */
+    spapr->pending_hpt = NULL;
+
+    if (!pending) {
+        /* Nothing to do */
+        return;
+    }
+
+    if (!pending->complete) {
+        /* thread will clean itself up */
+        return;
+    }
+
+    free_pending_hpt(pending);
+}
+
+static int build_dimm_list(Object *obj, void *opaque)
+{
+    GSList **list = opaque;
+
+    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
+        DeviceState *dev = DEVICE(obj);
+        if (dev->realized) { /* only realized DIMMs matter */
+            *list = g_slist_prepend(*list, dev);
+        }
+    }
+
+    object_child_foreach(obj, build_dimm_list, opaque);
+    return 0;
+}
+
+static ram_addr_t get_current_ram_size(void)
+{
+    GSList *list = NULL, *item;
+    ram_addr_t size = ram_size;
+
+    build_dimm_list(qdev_get_machine(), &list);
+    for (item = list; item; item = g_slist_next(item)) {
+        Object *obj = OBJECT(item->data);
+        if (!strcmp(object_get_typename(obj), TYPE_PC_DIMM)) {
+            size += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
+                                            &error_abort);
+        }
+    }
+    g_slist_free(list);
+
+    return size;
+}
+
 static target_ulong h_resize_hpt_prepare(PowerPCCPU *cpu,
                                          sPAPRMachineState *spapr,
                                          target_ulong opcode,
                                          target_ulong *args)
 {
     target_ulong flags = args[0];
-    target_ulong shift = args[1];
+    int shift = args[1];
+    sPAPRPendingHPT *pending = spapr->pending_hpt;
 
     if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
         return H_AUTHORITY;
     }
 
     trace_spapr_h_resize_hpt_prepare(flags, shift);
-    return H_HARDWARE;
+
+    if (flags != 0) {
+        return H_PARAMETER;
+    }
+
+    if (shift && ((shift < 18) || (shift > 46))) {
+        return H_PARAMETER;
+    }
+
+    if (pending) {
+        /* something already in progress */
+        if (pending->shift == shift) {
+            /* and it's suitable */
+            if (pending->complete) {
+                return pending->ret;
+            } else {
+                return H_LONG_BUSY_ORDER_100_MSEC;
+            }
+        }
+
+        /* not suitable, cancel and replace */
+        cancel_hpt_prepare(spapr);
+    }
+
+    if (!shift) {
+        /* nothing to do */
+        return H_SUCCESS;
+    }
+
+    /* start new prepare */
+
+    /* We only allow the guest to allocate an HPT one order above what
+     * we'd normally give them (to stop a small guest claiming a huge
+     * chunk of resources in the HPT */
+    if (shift > (spapr_hpt_shift_for_ramsize(get_current_ram_size()) + 1)) {
+        return H_RESOURCE;
+    }
+
+    pending = g_new0(sPAPRPendingHPT, 1);
+    pending->shift = shift;
+    pending->ret = H_HARDWARE;
+
+    qemu_thread_create(&pending->thread, "sPAPR HPT prepare",
+                       hpt_prepare_thread, pending, QEMU_THREAD_DETACHED);
+
+    spapr->pending_hpt = pending;
+
+    /* In theory we could estimate the time more accurately based on
+     * the new size, but there's not much point */
+    return H_LONG_BUSY_ORDER_100_MSEC;
+}
+
+static uint64_t new_hpte_load0(void *htab, uint64_t pteg, int slot)
+{
+    uint8_t *addr = htab;
+
+    addr += pteg * HASH_PTEG_SIZE_64;
+    addr += slot * HASH_PTE_SIZE_64;
+    return  ldq_p(addr);
+}
+
+static void new_hpte_store(void *htab, uint64_t pteg, int slot,
+                           uint64_t pte0, uint64_t pte1)
+{
+    uint8_t *addr = htab;
+
+    addr += pteg * HASH_PTEG_SIZE_64;
+    addr += slot * HASH_PTE_SIZE_64;
+
+    stq_p(addr, pte0);
+    stq_p(addr + HASH_PTE_SIZE_64/2, pte1);
+}
+
+static int rehash_hpte(PowerPCCPU *cpu, uint64_t token,
+                       void *old, uint64_t oldsize,
+                       void *new, uint64_t newsize,
+                       uint64_t pteg, int slot)
+{
+    uint64_t old_hash_mask = (oldsize >> 7) - 1;
+    uint64_t new_hash_mask = (newsize >> 7) - 1;
+    target_ulong pte0 = ppc_hash64_load_hpte0(cpu, token, slot);
+    target_ulong pte1;
+    uint64_t avpn;
+    unsigned shift;
+    uint64_t hash, new_pteg, replace_pte0;
+
+    if (!(pte0 & HPTE64_V_VALID) || !(pte0 & HPTE64_V_BOLTED)) {
+        return H_SUCCESS;
+    }
+
+    pte1 = ppc_hash64_load_hpte1(cpu, token, slot);
+
+    shift = ppc_hash64_hpte_page_shift_noslb(cpu, pte0, pte1);
+    assert(shift); /* H_ENTER should never have allowed a bad encoding */
+    avpn = HPTE64_V_AVPN_VAL(pte0) & ~(((1ULL << shift) - 1) >> 23);
+
+    if (pte0 & HPTE64_V_SECONDARY) {
+        pteg = ~pteg;
+    }
+
+    if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_256M) {
+        uint64_t offset, vsid;
+
+        /* We only have 28 - 23 bits of offset in avpn */
+        offset = (avpn & 0x1f) << 23;
+        vsid = avpn >> 5;
+        /* We can find more bits from the pteg value */
+        if (shift < 23) {
+            offset |= ((vsid ^ pteg) & old_hash_mask) << shift;
+        }
+
+        hash = vsid ^ (offset >> shift);
+    } else if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_1T) {
+        uint64_t offset, vsid;
+
+        /* We only have 40 - 23 bits of seg_off in avpn */
+        offset = (avpn & 0x1ffff) << 23;
+        vsid = avpn >> 17;
+        if (shift < 23) {
+            offset |= ((vsid ^ (vsid << 25) ^ pteg) & old_hash_mask) << shift;
+        }
+
+        hash = vsid ^ (vsid << 25) ^ (offset >> shift);
+    } else {
+        error_report("rehash_pte: Bad segment size in HPTE");
+        return H_HARDWARE;
+    }
+
+    new_pteg = hash & new_hash_mask;
+    if (pte0 & HPTE64_V_SECONDARY) {
+        assert(~pteg == (hash & old_hash_mask));
+        new_pteg = ~new_pteg;
+    } else {
+        assert(pteg == (hash & old_hash_mask));
+    }
+    assert((oldsize != newsize) || (pteg == new_pteg));
+    replace_pte0 = new_hpte_load0(new, new_pteg, slot);
+    if (replace_pte0 & HPTE64_V_VALID) {
+        assert(newsize < oldsize);
+        if (replace_pte0 & HPTE64_V_BOLTED) {
+            if (pte0 & HPTE64_V_BOLTED) {
+                /* Bolted collision, nothing we can do */
+                return H_PTEG_FULL;
+            } else {
+                /* Discard this hpte */
+                return H_SUCCESS;
+            }
+        }
+    }
+
+    new_hpte_store(new, new_pteg, slot, pte0, pte1);
+    return H_SUCCESS;
+}
+
+static int rehash_hpt(PowerPCCPU *cpu,
+                      void *old, uint64_t oldsize,
+                      void *new, uint64_t newsize)
+{
+    CPUPPCState *env = &cpu->env;
+    uint64_t n_ptegs = oldsize >> 7;
+    uint64_t pteg;
+    int slot;
+    int rc;
+
+    assert(env->external_htab == old);
+
+    for (pteg = 0; pteg < n_ptegs; pteg++) {
+        uint64_t token = ppc_hash64_start_access(cpu, pteg * HPTES_PER_GROUP);
+
+        if (!token) {
+            return H_HARDWARE;
+        }
+
+        for (slot = 0; slot < HPTES_PER_GROUP; slot++) {
+            rc = rehash_hpte(cpu, token, old, oldsize, new, newsize,
+                             pteg, slot);
+            if (rc != H_SUCCESS) {
+                ppc_hash64_stop_access(cpu, token);
+                return rc;
+            }
+        }
+        ppc_hash64_stop_access(cpu, token);
+    }
+
+    return H_SUCCESS;
+}
+
+static void pivot_hpt(CPUState *cs, run_on_cpu_data arg)
+{
+    sPAPRMachineState *spapr = SPAPR_MACHINE(arg.host_ptr);
+    PowerPCCPU *cpu = POWERPC_CPU(cs);
+
+    cpu_synchronize_state(cs);
+    ppc_hash64_set_external_hpt(cpu, spapr->htab, spapr->htab_shift,
+                                &error_fatal);
 }
 
 static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
@@ -378,13 +678,52 @@ static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
 {
     target_ulong flags = args[0];
     target_ulong shift = args[1];
+    sPAPRPendingHPT *pending = spapr->pending_hpt;
+    int rc;
+    size_t newsize;
 
     if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
         return H_AUTHORITY;
     }
 
     trace_spapr_h_resize_hpt_commit(flags, shift);
-    return H_HARDWARE;
+
+    if (flags != 0) {
+        return H_PARAMETER;
+    }
+
+    if (!pending || (pending->shift != shift)) {
+        /* no matching prepare */
+        return H_CLOSED;
+    }
+
+    if (!pending->complete) {
+        /* prepare has not completed */
+        return H_BUSY;
+    }
+
+    newsize = 1ULL << pending->shift;
+    rc = rehash_hpt(cpu, spapr->htab, HTAB_SIZE(spapr),
+                    pending->hpt, newsize);
+    if (rc == H_SUCCESS) {
+        CPUState *cs;
+
+        qemu_vfree(spapr->htab);
+        spapr->htab = pending->hpt;
+        spapr->htab_shift = pending->shift;
+
+        CPU_FOREACH(cs) {
+            run_on_cpu(cs, pivot_hpt, RUN_ON_CPU_HOST_PTR(spapr));
+        }
+
+        pending->hpt = NULL; /* so it's not free()d */
+    }
+
+    /* Clean up */
+    spapr->pending_hpt = NULL;
+    free_pending_hpt(pending);
+
+    return rc;
 }
 
 static target_ulong h_set_sprg0(PowerPCCPU *cpu, sPAPRMachineState *spapr,
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index d2c758b..954bada 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -14,6 +14,7 @@ struct sPAPRNVRAM;
 typedef struct sPAPRConfigureConnectorState sPAPRConfigureConnectorState;
 typedef struct sPAPREventLogEntry sPAPREventLogEntry;
 typedef struct sPAPREventSource sPAPREventSource;
+typedef struct sPAPRPendingHPT sPAPRPendingHPT;
 
 #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
 #define SPAPR_ENTRY_POINT       0x100
@@ -72,6 +73,8 @@ struct sPAPRMachineState {
     sPAPRResizeHPT resize_hpt;
     void *htab;
     uint32_t htab_shift;
+    sPAPRPendingHPT *pending_hpt; /* in-progress resize */
+
     hwaddr rma_size;
     int vrma_adjust;
     ssize_t rtas_size;
@@ -627,6 +630,7 @@ void spapr_hotplug_req_remove_by_count_indexed(sPAPRDRConnectorType drc_type,
                                                uint32_t count, uint32_t index);
 void *spapr_populate_hotplug_cpu_dt(CPUState *cs, int *fdt_offset,
                                     sPAPRMachineState *spapr);
+int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
 
 /* rtas-configure-connector state */
 struct sPAPRConfigureConnectorState {
@@ -674,4 +678,6 @@ int spapr_rng_populate_dt(void *fdt);
 
 void spapr_do_system_reset_on_cpu(CPUState *cs, run_on_cpu_data arg);
 
+#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
+
 #endif /* HW_SPAPR_H */
diff --git a/target-ppc/mmu-hash64.h b/target-ppc/mmu-hash64.h
index ab5d347..4a1b781 100644
--- a/target-ppc/mmu-hash64.h
+++ b/target-ppc/mmu-hash64.h
@@ -60,11 +60,15 @@ void ppc_hash64_update_rmls(CPUPPCState *env);
 #define HASH_PTE_SIZE_64        16
 #define HASH_PTEG_SIZE_64       (HASH_PTE_SIZE_64 * HPTES_PER_GROUP)
 
+#define HPTE64_V_SSIZE          SLB_VSID_B
+#define HPTE64_V_SSIZE_256M     SLB_VSID_B_256M
+#define HPTE64_V_SSIZE_1T       SLB_VSID_B_1T
 #define HPTE64_V_SSIZE_SHIFT    62
 #define HPTE64_V_AVPN_SHIFT     7
 #define HPTE64_V_AVPN           0x3fffffffffffff80ULL
 #define HPTE64_V_AVPN_VAL(x)    (((x) & HPTE64_V_AVPN) >> HPTE64_V_AVPN_SHIFT)
 #define HPTE64_V_COMPARE(x, y)  (!(((x) ^ (y)) & 0xffffffffffffff83ULL))
+#define HPTE64_V_BOLTED         0x0000000000000010ULL
 #define HPTE64_V_LARGE          0x0000000000000004ULL
 #define HPTE64_V_SECONDARY      0x0000000000000002ULL
 #define HPTE64_V_VALID          0x0000000000000001ULL
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCHv3 4/5] pseries: Enable HPT resizing for 2.9
  2016-12-12  4:05 [Qemu-devel] [PATCHv3 0/5] Hash Page Table resizing for TCG pseries guests David Gibson
                   ` (2 preceding siblings ...)
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 3/5] pseries: Implement " David Gibson
@ 2016-12-12  4:06 ` David Gibson
  2016-12-14  5:32   ` Suraj Jitindar Singh
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 5/5] pseries: Use smaller default hash page tables when guest can resize David Gibson
  4 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2016-12-12  4:06 UTC (permalink / raw)
  To: paulus, sjitindarsingh
  Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel, David Gibson

We've now implemented a PAPR extensions which allows PAPR guests (i.e.
"pseries" machine type) to resize their hash page table during runtime.

However, that extension is only enabled if explicitly chosen on the
command line.  This patch enables it by default for spapr-2.9, but leaves
it disabled (by default) for older machine types.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
Reviewed-by: Laurent Vivier <lvivier@redhat.com>
---
 hw/ppc/spapr.c | 5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d057031..f05d0e5 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2779,7 +2779,7 @@ static void spapr_machine_class_init(ObjectClass *oc, void *data)
 
     smc->dr_lmb_enabled = true;
     smc->tcg_default_cpu = "POWER8";
-    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
+    smc->resize_hpt_default = SPAPR_RESIZE_HPT_ENABLED;
     mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
     fwc->get_dev_path = spapr_get_fw_dev_path;
     nc->nmi_monitor_handler = spapr_nmi;
@@ -2860,8 +2860,11 @@ static void spapr_machine_2_8_instance_options(MachineState *machine)
 
 static void spapr_machine_2_8_class_options(MachineClass *mc)
 {
+    sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
     spapr_machine_2_9_class_options(mc);
     SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_8);
+    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
 }
 
 DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCHv3 5/5] pseries: Use smaller default hash page tables when guest can resize
  2016-12-12  4:05 [Qemu-devel] [PATCHv3 0/5] Hash Page Table resizing for TCG pseries guests David Gibson
                   ` (3 preceding siblings ...)
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 4/5] pseries: Enable HPT resizing for 2.9 David Gibson
@ 2016-12-12  4:06 ` David Gibson
  2016-12-14  5:56   ` Suraj Jitindar Singh
  4 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2016-12-12  4:06 UTC (permalink / raw)
  To: paulus, sjitindarsingh
  Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel, David Gibson

We've now implemented a PAPR extension allowing PAPR guest to resize
their hash page table (HPT) during runtime.

This patch makes use of that facility to allocate smaller HPTs by default.
Specifically when a guest is aware of the HPT resize facility, qemu sizes
the HPT to the initial memory size, rather than the maximum memory size on
the assumption that the guest will resize its HPT if necessary for hot
plugged memory.

When the initial memory size is much smaller than the maximum memory size
(a common configuration with e.g. oVirt / RHEV) then this can save
significant memory on the HPT.

If the guest does *not* advertise HPT resize awareness when it makes the
ibm,client-architecture-support call, qemu resizes the HPT for maxmimum
memory size (unless it's been configured not to allow such guests at all).

For now we make that reallocation assuming the guest has not yet used the
HPT at all.  That's true in practice, but not, strictly, an architectural
or PAPR requirement.  If we need to in future we can fix this by having
the client-architecture-support call reboot the guest with the revised
HPT size (the client-architecture-support call is explicitly permitted to
trigger a reboot in this way).

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 hw/ppc/spapr.c              | 21 ++++++++++++++++-----
 hw/ppc/spapr_hcall.c        | 32 ++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h      |  2 ++
 include/hw/ppc/spapr_ovec.h |  1 +
 4 files changed, 51 insertions(+), 5 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index f05d0e5..51b189d 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1066,8 +1066,8 @@ int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
     return shift;
 }
 
-static void spapr_reallocate_hpt(sPAPRMachineState *spapr, int shift,
-                                 Error **errp)
+void spapr_reallocate_hpt(sPAPRMachineState *spapr, int shift,
+                          Error **errp)
 {
     long rc;
 
@@ -1140,14 +1140,20 @@ static void ppc_spapr_reset(void)
     hwaddr rtas_addr, fdt_addr;
     void *fdt;
     int rc;
+    int hpt_shift;
 
     /* Check for unknown sysbus devices */
     foreach_dynamic_sysbus_device(find_unknown_sysbus_device, NULL);
 
     /* Allocate and/or reset the hash page table */
-    spapr_reallocate_hpt(spapr,
-                         spapr_hpt_shift_for_ramsize(machine->maxram_size),
-                         &error_fatal);
+    if ((spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED)
+        || (spapr->cas_reboot
+            && !spapr_ovec_test(spapr->ov5_cas, OV5_HPT_RESIZE))) {
+        hpt_shift = spapr_hpt_shift_for_ramsize(machine->maxram_size);
+    } else {
+        hpt_shift = spapr_hpt_shift_for_ramsize(machine->ram_size);
+    }
+    spapr_reallocate_hpt(spapr, hpt_shift, &error_fatal);
 
     /* Update the RMA size if necessary */
     if (spapr->vrma_adjust) {
@@ -1941,6 +1947,11 @@ static void ppc_spapr_init(MachineState *machine)
         spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
     }
 
+    /* advertise support for HPT resizing */
+    if (spapr->resize_hpt != SPAPR_RESIZE_HPT_DISABLED) {
+        spapr_ovec_set(spapr->ov5, OV5_HPT_RESIZE);
+    }
+
     /* init CPUs */
     if (machine->cpu_model == NULL) {
         machine->cpu_model = kvm_enabled() ? "host" : smc->tcg_default_cpu;
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 1e89061..5bebfc3 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1415,6 +1415,38 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu_,
 
     ov5_guest = spapr_ovec_parse_vector(ov_table, 5);
 
+    /*
+     * HPT resizing is a bit of a special case, because when enabled
+     * we assume the guest will support it until it says it doesn't,
+     * instead of assuming it won't support it until it says it does.
+     * Strictly speaking that approach could break for guests which
+     * don't make a CAS call, but those are so old we don't care about
+     * them.  Without that assumption we'd have to make at least a
+     * temporary allocation of an HPT sized for max memory, which
+     * could be impossibly difficult under KVM HV if maxram is large.
+     */
+    if (!spapr_ovec_test(ov5_guest, OV5_HPT_RESIZE)) {
+        int maxshift = spapr_hpt_shift_for_ramsize(MACHINE(spapr)->maxram_size);
+
+        if (spapr->resize_hpt == SPAPR_RESIZE_HPT_REQUIRED) {
+            error_report(
+                "h_client_architecture_support: Guest doesn't support HPT resizing, but resize-hpt=required");
+            exit(1);
+        }
+
+        if (spapr->htab_shift < maxshift) {
+            CPUState *cs;
+            /* Guest doesn't know about HPT resizing, so we
+             * pre-emptively resize for the maximum permitted RAM.  At
+             * the point this is called, nothing should have been
+             * entered into the existing HPT */
+            spapr_reallocate_hpt(spapr, maxshift, &error_fatal);
+            CPU_FOREACH(cs) {
+                run_on_cpu(cs, pivot_hpt, RUN_ON_CPU_HOST_PTR(spapr));
+            }
+        }
+    }
+
     /* NOTE: there are actually a number of ov5 bits where input from the
      * guest is always zero, and the platform/QEMU enables them independently
      * of guest input. To model these properly we'd want some sort of mask,
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 954bada..ff3bf87 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -631,6 +631,8 @@ void spapr_hotplug_req_remove_by_count_indexed(sPAPRDRConnectorType drc_type,
 void *spapr_populate_hotplug_cpu_dt(CPUState *cs, int *fdt_offset,
                                     sPAPRMachineState *spapr);
 int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
+void spapr_reallocate_hpt(sPAPRMachineState *spapr, int shift,
+                          Error **errp);
 
 /* rtas-configure-connector state */
 struct sPAPRConfigureConnectorState {
diff --git a/include/hw/ppc/spapr_ovec.h b/include/hw/ppc/spapr_ovec.h
index 355a344..f5fed87 100644
--- a/include/hw/ppc/spapr_ovec.h
+++ b/include/hw/ppc/spapr_ovec.h
@@ -47,6 +47,7 @@ typedef struct sPAPROptionVector sPAPROptionVector;
 #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
 #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
 #define OV5_HP_EVT              OV_BIT(6, 5)
+#define OV5_HPT_RESIZE          OV_BIT(6, 7)
 
 /* interfaces */
 sPAPROptionVector *spapr_ovec_new(void);
-- 
2.9.3

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 2/5] pseries: Stubs for HPT resizing
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 2/5] pseries: Stubs for HPT resizing David Gibson
@ 2016-12-13 14:29   ` Laurent Vivier
  2016-12-14  6:12     ` David Gibson
  2016-12-14  5:22   ` Suraj Jitindar Singh
  1 sibling, 1 reply; 20+ messages in thread
From: Laurent Vivier @ 2016-12-13 14:29 UTC (permalink / raw)
  To: David Gibson, paulus, sjitindarsingh
  Cc: agraf, mdroth, thuth, qemu-ppc, qemu-devel



On 12/12/2016 05:06, David Gibson wrote:
> This introduces stub implementations of the H_RESIZE_HPT_PREPARE and
> H_RESIZE_HPT_COMMIT hypercalls which we hope to add in a PAPR
> extension to allow run time resizing of a guest's hash page table.  It
> also adds a new machine property for controlling whether this new
> facility is available, and logic to check that against availability
> with KVM (only supported with KVM PR for now).
> 
> Finally, it adds a new string to the hypertas property in the device
> tree, advertising to the guest the availability of the HPT resizing
> hypercalls.  This is a tentative suggested value, and would need to be
> standardized by PAPR before being merged.

Could you explain somewhere what is the aim of the "flags" parameter?
It could help to understand why we have it as it is not used.

Thanks,
Laurent

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 2/5] pseries: Stubs for HPT resizing
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 2/5] pseries: Stubs for HPT resizing David Gibson
  2016-12-13 14:29   ` Laurent Vivier
@ 2016-12-14  5:22   ` Suraj Jitindar Singh
  2016-12-15  1:07     ` David Gibson
  1 sibling, 1 reply; 20+ messages in thread
From: Suraj Jitindar Singh @ 2016-12-14  5:22 UTC (permalink / raw)
  To: David Gibson, paulus; +Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> This introduces stub implementations of the H_RESIZE_HPT_PREPARE and
> H_RESIZE_HPT_COMMIT hypercalls which we hope to add in a PAPR
> extension to allow run time resizing of a guest's hash page
> table.  It
> also adds a new machine property for controlling whether this new
> facility is available, and logic to check that against availability
> with KVM (only supported with KVM PR for now).
> 
> Finally, it adds a new string to the hypertas property in the device
> tree, advertising to the guest the availability of the HPT resizing
> hypercalls.  This is a tentative suggested value, and would need to
> be
> standardized by PAPR before being merged.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr.c         | 75
> ++++++++++++++++++++++++++++++++++++++++++++++++++
>  hw/ppc/spapr_hcall.c   | 36 ++++++++++++++++++++++++
>  hw/ppc/trace-events    |  2 ++
>  include/hw/ppc/spapr.h | 11 ++++++++
>  target-ppc/kvm.c       | 25 +++++++++++++++++
>  target-ppc/kvm_ppc.h   |  5 ++++
>  6 files changed, 154 insertions(+)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 0f25e83..846ce51 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -760,6 +760,11 @@ static void spapr_dt_rtas(sPAPRMachineState
> *spapr, void *fdt)
>      if (!kvm_enabled() || kvmppc_spapr_use_multitce()) {
>          add_str(hypertas, "hcall-multi-tce");
>      }
> +
> +    if (spapr->resize_hpt != SPAPR_RESIZE_HPT_DISABLED) {
> +        add_str(hypertas, "hcall-hpt-resize");
> +    }
> +
>      _FDT(fdt_setprop(fdt, rtas, "ibm,hypertas-functions",
>                       hypertas->str, hypertas->len));
>      g_string_free(hypertas, TRUE);
> @@ -1839,11 +1844,40 @@ static void ppc_spapr_init(MachineState
> *machine)
>      long load_limit, fw_size;
>      char *filename;
>      int smt = kvmppc_smt_threads();
> +    Error *resize_hpt_err = NULL;
>  
>      msi_nonbroken = true;
>  
>      QLIST_INIT(&spapr->phbs);
>  
> +    /* Check HPT resizing availability */
> +    kvmppc_check_papr_resize_hpt(&resize_hpt_err);
> +    if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DEFAULT) {
> +        /*
> +         * If the user explicitly requested a mode we should either
> +         * supply it, or fail completely (which we do below).  But
> if
> +         * it's not set explicitly, we reset our mode to something
> +         * that works
> +         */
> +        if (resize_hpt_err) {
> +            spapr->resize_hpt = SPAPR_RESIZE_HPT_DISABLED;
> +            error_free(resize_hpt_err);
> +            resize_hpt_err = NULL;
> +        } else {
> +            spapr->resize_hpt = smc->resize_hpt_default;
> +        }
> +    }
> +
> +    assert(spapr->resize_hpt != SPAPR_RESIZE_HPT_DEFAULT);
> +
> +    if ((spapr->resize_hpt != SPAPR_RESIZE_HPT_DISABLED) &&
> resize_hpt_err) {
> +        /*
> +         * User requested HPT resize, but this host can't supply
> it.  Bail out
> +         */
> +        error_report_err(resize_hpt_err);
> +        exit(1);
> +    }
> +
>      /* Allocate RMA if necessary */
>      rma_alloc_size = kvmppc_alloc_rma(&rma);
>  
> @@ -2236,6 +2270,40 @@ static void
> spapr_set_modern_hotplug_events(Object *obj, bool value,
>      spapr->use_hotplug_event_source = value;
>  }
>  
> +static char *spapr_get_resize_hpt(Object *obj, Error **errp)
> +{
> +    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> +
> +    switch (spapr->resize_hpt) {
> +    case SPAPR_RESIZE_HPT_DEFAULT:
> +        return g_strdup("default");
> +    case SPAPR_RESIZE_HPT_DISABLED:
> +        return g_strdup("disabled");
> +    case SPAPR_RESIZE_HPT_ENABLED:
> +        return g_strdup("enabled");
> +    case SPAPR_RESIZE_HPT_REQUIRED:
> +        return g_strdup("required");
> +    }
> +    assert(0);
> +}
> +
> +static void spapr_set_resize_hpt(Object *obj, const char *value,
> Error **errp)
> +{
> +    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> +
> +    if (strcmp(value, "default") == 0) {
> +        spapr->resize_hpt = SPAPR_RESIZE_HPT_DEFAULT;
> +    } else if (strcmp(value, "disabled") == 0) {
> +        spapr->resize_hpt = SPAPR_RESIZE_HPT_DISABLED;
> +    } else if (strcmp(value, "enabled") == 0) {
> +        spapr->resize_hpt = SPAPR_RESIZE_HPT_ENABLED;
> +    } else if (strcmp(value, "required") == 0) {
> +        spapr->resize_hpt = SPAPR_RESIZE_HPT_REQUIRED;
> +    } else {
> +        error_setg(errp, "Bad value for \"resize-hpt\" property");
> +    }
> +}
> +
>  static void spapr_machine_initfn(Object *obj)
>  {
>      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> @@ -2256,6 +2324,12 @@ static void spapr_machine_initfn(Object *obj)
>                                      " place of standard EPOW events
> when possible"
>                                      " (required for memory hot-
> unplug support)",
>                                      NULL);
> +
> +    object_property_add_str(obj, "resize-hpt",
> +                            spapr_get_resize_hpt,
> spapr_set_resize_hpt, NULL);
> +    object_property_set_description(obj, "resize-hpt",
> +                                    "Resizing of the Hash Page Table
> (enabled, disabled, required)",
> +                                    NULL);
>  }
>  
>  static void spapr_machine_finalizefn(Object *obj)
> @@ -2707,6 +2781,7 @@ static void
> spapr_machine_class_init(ObjectClass *oc, void *data)
>  
>      smc->dr_lmb_enabled = true;
>      smc->tcg_default_cpu = "POWER8";
> +    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
>      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
>      fwc->get_dev_path = spapr_get_fw_dev_path;
>      nc->nmi_monitor_handler = spapr_nmi;
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index fd9f1d4..72a9c4d 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -355,6 +355,38 @@ static target_ulong h_read(PowerPCCPU *cpu,
> sPAPRMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +static target_ulong h_resize_hpt_prepare(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         target_ulong opcode,
> +                                         target_ulong *args)
> +{
> +    target_ulong flags = args[0];
> +    target_ulong shift = args[1];
> +
> +    if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> +        return H_AUTHORITY;
> +    }
> +
> +    trace_spapr_h_resize_hpt_prepare(flags, shift);
> +    return H_HARDWARE;
> +}
> +
> +static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
> +                                        sPAPRMachineState *spapr,
> +                                        target_ulong opcode,
> +                                        target_ulong *args)
> +{
> +    target_ulong flags = args[0];
> +    target_ulong shift = args[1];
> +
> +    if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> +        return H_AUTHORITY;
> +    }
> +
> +    trace_spapr_h_resize_hpt_commit(flags, shift);
> +    return H_HARDWARE;
> +}
> +
>  static target_ulong h_set_sprg0(PowerPCCPU *cpu, sPAPRMachineState
> *spapr,
>                                  target_ulong opcode, target_ulong
> *args)
>  {
> @@ -1134,6 +1166,10 @@ static void hypercall_register_types(void)
>      /* hcall-bulk */
>      spapr_register_hypercall(H_BULK_REMOVE, h_bulk_remove);
>  
> +    /* hcall-hpt-resize */
> +    spapr_register_hypercall(H_RESIZE_HPT_PREPARE,
> h_resize_hpt_prepare);
> +    spapr_register_hypercall(H_RESIZE_HPT_COMMIT,
> h_resize_hpt_commit);
> +
>      /* hcall-splpar */
>      spapr_register_hypercall(H_REGISTER_VPA, h_register_vpa);
>      spapr_register_hypercall(H_CEDE, h_cede);
> diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
> index 2297ead..bf59a8f 100644
> --- a/hw/ppc/trace-events
> +++ b/hw/ppc/trace-events
> @@ -16,6 +16,8 @@ spapr_cas_continue(unsigned long n) "Copy changes
> to the guest: %ld bytes"
>  # hw/ppc/spapr_hcall.c
>  spapr_cas_pvr_try(uint32_t pvr) "%x"
>  spapr_cas_pvr(uint32_t cur_pvr, bool cpu_match, uint32_t new_pvr,
> uint64_t pcr) "current=%x, cpu_match=%u, new=%x, compat
> flags=%"PRIx64
> +spapr_h_resize_hpt_prepare(uint64_t flags, uint64_t shift)
> "flags=0x%"PRIx64", shift=%"PRIu64
> +spapr_h_resize_hpt_commit(uint64_t flags, uint64_t shift)
> "flags=0x%"PRIx64", shift=%"PRIu64
>  
>  # hw/ppc/spapr_iommu.c
>  spapr_iommu_put(uint64_t liobn, uint64_t ioba, uint64_t tce,
> uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tce=0x%"PRIx64"
> ret=%"PRId64
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index a2d8964..d2c758b 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -31,6 +31,13 @@ typedef struct sPAPRMachineState
> sPAPRMachineState;
>  #define SPAPR_MACHINE_CLASS(klass) \
>      OBJECT_CLASS_CHECK(sPAPRMachineClass, klass, TYPE_SPAPR_MACHINE)
>  
> +typedef enum {
> +    SPAPR_RESIZE_HPT_DEFAULT = 0,
> +    SPAPR_RESIZE_HPT_DISABLED,
> +    SPAPR_RESIZE_HPT_ENABLED,
> +    SPAPR_RESIZE_HPT_REQUIRED,
> +} sPAPRResizeHPT;
> +
>  /**
>   * sPAPRMachineClass:
>   */
> @@ -46,6 +53,7 @@ struct sPAPRMachineClass {
>                            uint64_t *buid, hwaddr *pio, 
>                            hwaddr *mmio32, hwaddr *mmio64,
>                            unsigned n_dma, uint32_t *liobns, Error
> **errp);
> +    sPAPRResizeHPT resize_hpt_default;
>  };
>  
>  /**
> @@ -61,6 +69,7 @@ struct sPAPRMachineState {
>      XICSState *xics;
>      DeviceState *rtc;
>  
> +    sPAPRResizeHPT resize_hpt;
>      void *htab;
>      uint32_t htab_shift;
>      hwaddr rma_size;
> @@ -347,6 +356,8 @@ struct sPAPRMachineState {
>  #define H_XIRR_X                0x2FC
>  #define H_RANDOM                0x300
>  #define H_SET_MODE              0x31C
> +#define H_RESIZE_HPT_PREPARE    0x36C
> +#define H_RESIZE_HPT_COMMIT     0x370
>  #define H_SIGNAL_SYS_RESET      0x380
>  #define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
>  
> diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
> index 15e12f3..39e5753 100644
> --- a/target-ppc/kvm.c
> +++ b/target-ppc/kvm.c
> @@ -22,6 +22,7 @@
>  #include <linux/kvm.h>
>  
>  #include "qemu-common.h"
> +#include "qapi/error.h"
>  #include "qemu/error-report.h"
>  #include "cpu.h"
>  #include "qemu/timer.h"
> @@ -2672,3 +2673,27 @@ int kvmppc_enable_hwrng(void)
>  
>      return kvmppc_enable_hcall(kvm_state, H_RANDOM);
>  }
> +
> +void kvmppc_check_papr_resize_hpt(Error **errp)
> +{
> +    if (!kvm_enabled()) {
> +        return;
> +    }
> +
> +    /* TODO: Check specific capabilities for HPT resize aware host
> kernels */
> +
> +    /*
> +     * It's tempting to try to check directly if the HPT is under
> our
> +     * control or KVM's, which is what's really relevant here.
> +     * Unfortunately, in order to correctly size the HPT, we need to
> +     * know if we can do resizing, _before_ we attempt to allocate
> it
> +     * with KVM.  Before that point, we don't officially know
> whether
> +     * we'll control the HPT or not.  So we have to use a fallback
> +     * test for PR vs HV KVM to predict that.
> +     */
> +    if (kvmppc_is_pr(kvm_state)) {
> +        return;
> +    }
> +
> +    error_setg(errp, "Hash page table resizing not available with
> this KVM version");
> +}
> diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
> index 841a29b..3e852ba 100644
> --- a/target-ppc/kvm_ppc.h
> +++ b/target-ppc/kvm_ppc.h
> @@ -59,6 +59,7 @@ bool kvmppc_has_cap_htm(void);
>  int kvmppc_enable_hwrng(void);
>  int kvmppc_put_books_sregs(PowerPCCPU *cpu);
>  PowerPCCPUClass *kvm_ppc_get_host_cpu_class(void);
> +void kvmppc_check_papr_resize_hpt(Error **errp);
>  
>  #else
>  
> @@ -270,6 +271,10 @@ static inline PowerPCCPUClass
> *kvm_ppc_get_host_cpu_class(void)
>      return NULL;
>  }
>  
> +static inline void kvmppc_check_papr_resize_hpt(Error **errp)
> +{
> +    return;
> +}
>  #endif
>  
>  #ifndef CONFIG_KVM
Since you're adding a new machine option it would be nice if this was
documented in the help message.

Either way it all seems sane:

Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 3/5] pseries: Implement HPT resizing
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 3/5] pseries: Implement " David Gibson
@ 2016-12-14  5:30   ` Suraj Jitindar Singh
  2016-12-14  6:20     ` David Gibson
  2016-12-15  0:59     ` David Gibson
  2016-12-14  5:35   ` Suraj Jitindar Singh
  1 sibling, 2 replies; 20+ messages in thread
From: Suraj Jitindar Singh @ 2016-12-14  5:30 UTC (permalink / raw)
  To: David Gibson, paulus; +Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> This patch implements hypercalls allowing a PAPR guest to resize its
> own
> hash page table.  This will eventually allow for more flexible memory
> hotplug.
> 
> The implementation is partially asynchronous, handled in a special
> thread
> running the hpt_prepare_thread() function.  The state of a pending
> resize
> is stored in SPAPR_MACHINE->pending_hpt.
> 
> The H_RESIZE_HPT_PREPARE hypercall will kick off creation of a new
> HPT, or,
> if one is already in progress, monitor it for completion.  If there
> is an
> existing HPT resize in progress that doesn't match the size specified
> in
> the call, it will cancel it, replacing it with a new one matching the
> given size.
> 
> The H_RESIZE_HPT_COMMIT completes transition to a resized HPT, and
> can only
> be called successfully once H_RESIZE_HPT_PREPARE has successfully
> completed initialization of a new HPT.  The guest must ensure that
> there
> are no concurrent accesses to the existing HPT while this is called
> (this
> effectively means stop_machine() for Linux guests).
> 
> For now H_RESIZE_HPT_COMMIT goes through the whole old HPT, rehashing
> each
> HPTE into the new HPT.  This can have quite high latency, but it
> seems to
> be of the order of typical migration downtime latencies for HPTs of
> size
> up to ~2GiB (which would be used in a 256GiB guest).
> 
> In future we probably want to move more of the rehashing to the
> "prepare"
> phase, by having H_ENTER and other hcalls update both current and
> pending HPTs.  That's a project for another day, but should be
> possible
> without any changes to the guest interface.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr.c          |   4 +-
>  hw/ppc/spapr_hcall.c    | 345
> +++++++++++++++++++++++++++++++++++++++++++++++-
>  include/hw/ppc/spapr.h  |   6 +
>  target-ppc/mmu-hash64.h |   4 +
>  4 files changed, 353 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 846ce51..d057031 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -93,8 +93,6 @@
>  
>  #define PHANDLE_XICP            0x00001111
>  
> -#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> -
>  static XICSState *try_create_xics(const char *type, int nr_servers,
>                                    int nr_irqs, Error **errp)
>  {
> @@ -1055,7 +1053,7 @@ static void close_htab_fd(sPAPRMachineState
> *spapr)
>      spapr->htab_fd = -1;
>  }
>  
> -static int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
> +int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
>  {
>      int shift;
>  
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 72a9c4d..1e89061 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -2,6 +2,7 @@
>  #include "qapi/error.h"
>  #include "sysemu/sysemu.h"
>  #include "qemu/log.h"
> +#include "qemu/error-report.h"
>  #include "cpu.h"
>  #include "exec/exec-all.h"
>  #include "helper_regs.h"
> @@ -355,20 +356,319 @@ static target_ulong h_read(PowerPCCPU *cpu,
> sPAPRMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +struct sPAPRPendingHPT {
> +    /* These fields are read-only after initialization */
> +    int shift;
> +    QemuThread thread;
> +
> +    /* These fields are protected by the BQL */
> +    bool complete;
> +
> +    /* These fields are private to the preparation thread if
> +     * !complete, otherwise protected by the BQL */
> +    int ret;
> +    void *hpt;
> +};
> +
> +static void free_pending_hpt(sPAPRPendingHPT *pending)
> +{
> +    if (pending->hpt) {
> +        qemu_vfree(pending->hpt);
> +    }
> +
> +    g_free(pending);
> +}
> +
> +static void *hpt_prepare_thread(void *opaque)
> +{
> +    sPAPRPendingHPT *pending = opaque;
> +    size_t size = 1ULL << pending->shift;
> +
> +    pending->hpt = qemu_memalign(size, size);
> +    if (pending->hpt) {
> +        memset(pending->hpt, 0, size);
> +        pending->ret = H_SUCCESS;
> +    } else {
> +        pending->ret = H_NO_MEM;
> +    }
> +
> +    qemu_mutex_lock_iothread();
> +
> +    if (SPAPR_MACHINE(qdev_get_machine())->pending_hpt == pending) {
> +        /* Ready to go */
> +        pending->complete = true;
> +    } else {
> +        /* We've been cancelled, clean ourselves up */
> +        free_pending_hpt(pending);
> +    }
> +
> +    qemu_mutex_unlock_iothread();
> +    return NULL;
> +}
> +
> +/* Must be called with BQL held */
> +static void cancel_hpt_prepare(sPAPRMachineState *spapr)
> +{
> +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> +
> +    /* Let the thread know it's cancelled */
> +    spapr->pending_hpt = NULL;
> +
> +    if (!pending) {
> +        /* Nothing to do */
> +        return;
> +    }
> +
> +    if (!pending->complete) {
> +        /* thread will clean itself up */
> +        return;
> +    }
> +
> +    free_pending_hpt(pending);
> +}
> +
> +static int build_dimm_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
> +        DeviceState *dev = DEVICE(obj);
> +        if (dev->realized) { /* only realized DIMMs matter */
> +            *list = g_slist_prepend(*list, dev);
> +        }
> +    }
> +
> +    object_child_foreach(obj, build_dimm_list, opaque);
> +    return 0;
> +}
> +
> +static ram_addr_t get_current_ram_size(void)
> +{
> +    GSList *list = NULL, *item;
> +    ram_addr_t size = ram_size;
> +
> +    build_dimm_list(qdev_get_machine(), &list);
> +    for (item = list; item; item = g_slist_next(item)) {
> +        Object *obj = OBJECT(item->data);
> +        if (!strcmp(object_get_typename(obj), TYPE_PC_DIMM)) {
> +            size += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
> +                                            &error_abort);
> +        }
> +    }
> +    g_slist_free(list);
> +
> +    return size;
> +}
> +
>  static target_ulong h_resize_hpt_prepare(PowerPCCPU *cpu,
>                                           sPAPRMachineState *spapr,
>                                           target_ulong opcode,
>                                           target_ulong *args)
>  {
>      target_ulong flags = args[0];
> -    target_ulong shift = args[1];
> +    int shift = args[1];
> +    sPAPRPendingHPT *pending = spapr->pending_hpt;
>  
>      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
>          return H_AUTHORITY;
>      }
>  
>      trace_spapr_h_resize_hpt_prepare(flags, shift);
> -    return H_HARDWARE;
> +
> +    if (flags != 0) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (shift && ((shift < 18) || (shift > 46))) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (pending) {
> +        /* something already in progress */
> +        if (pending->shift == shift) {
> +            /* and it's suitable */
> +            if (pending->complete) {
> +                return pending->ret;
> +            } else {
> +                return H_LONG_BUSY_ORDER_100_MSEC;
> +            }
> +        }
> +
> +        /* not suitable, cancel and replace */
> +        cancel_hpt_prepare(spapr);
> +    }
> +
> +    if (!shift) {
> +        /* nothing to do */
> +        return H_SUCCESS;
> +    }
> +
> +    /* start new prepare */
> +
> +    /* We only allow the guest to allocate an HPT one order above
> what
> +     * we'd normally give them (to stop a small guest claiming a
> huge
> +     * chunk of resources in the HPT */
> +    if (shift > (spapr_hpt_shift_for_ramsize(get_current_ram_size())
> + 1)) {
> +        return H_RESOURCE;
> +    }
> +
> +    pending = g_new0(sPAPRPendingHPT, 1);
> +    pending->shift = shift;
> +    pending->ret = H_HARDWARE;
> +
> +    qemu_thread_create(&pending->thread, "sPAPR HPT prepare",
> +                       hpt_prepare_thread, pending,
> QEMU_THREAD_DETACHED);
> +
> +    spapr->pending_hpt = pending;
> +
> +    /* In theory we could estimate the time more accurately based on
> +     * the new size, but there's not much point */
> +    return H_LONG_BUSY_ORDER_100_MSEC;
> +}
> +
> +static uint64_t new_hpte_load0(void *htab, uint64_t pteg, int slot)
> +{
> +    uint8_t *addr = htab;
> +
> +    addr += pteg * HASH_PTEG_SIZE_64;
> +    addr += slot * HASH_PTE_SIZE_64;
> +    return  ldq_p(addr);
> +}
> +
> +static void new_hpte_store(void *htab, uint64_t pteg, int slot,
> +                           uint64_t pte0, uint64_t pte1)
> +{
> +    uint8_t *addr = htab;
> +
> +    addr += pteg * HASH_PTEG_SIZE_64;
> +    addr += slot * HASH_PTE_SIZE_64;
> +
> +    stq_p(addr, pte0);
> +    stq_p(addr + HASH_PTE_SIZE_64/2, pte1);
> +}
> +
> +static int rehash_hpte(PowerPCCPU *cpu, uint64_t token,
> +                       void *old, uint64_t oldsize,
> +                       void *new, uint64_t newsize,
Can we call these old_hpt and new_hpt?
> +                       uint64_t pteg, int slot)
> +{
> +    uint64_t old_hash_mask = (oldsize >> 7) - 1;
> +    uint64_t new_hash_mask = (newsize >> 7) - 1;
> +    target_ulong pte0 = ppc_hash64_load_hpte0(cpu, token, slot);
> +    target_ulong pte1;
> +    uint64_t avpn;
> +    unsigned shift;
Can we call this base_pg_shift or at least add a comment:
/* Base Virtual Page Shift */ or something?
Took me a while to work out what *shift* this actually was...
> +    uint64_t hash, new_pteg, replace_pte0;
> +
*** (referenced below)
> +    if (!(pte0 & HPTE64_V_VALID) || !(pte0 & HPTE64_V_BOLTED)) {
> +        return H_SUCCESS;
> +    }
> +
> +    pte1 = ppc_hash64_load_hpte1(cpu, token, slot);
> +
> +    shift = ppc_hash64_hpte_page_shift_noslb(cpu, pte0, pte1);
> +    assert(shift); /* H_ENTER should never have allowed a bad
> encoding */
> +    avpn = HPTE64_V_AVPN_VAL(pte0) & ~(((1ULL << shift) - 1) >> 23);
> +
> +    if (pte0 & HPTE64_V_SECONDARY) {
> +        pteg = ~pteg;
> +    }
> +
The hash calculation below is pretty hard to follow... That being said
I don't really see a nicer way of structuring it. I guess you just have
to wade through the ISA if you actually want to understand what's
happening here (which I assume most people won't bother with).
> +    if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_256M) {
> +        uint64_t offset, vsid;
> +
> +        /* We only have 28 - 23 bits of offset in avpn */
> +        offset = (avpn & 0x1f) << 23;
> +        vsid = avpn >> 5;
> +        /* We can find more bits from the pteg value */
> +        if (shift < 23) {
> +            offset |= ((vsid ^ pteg) & old_hash_mask) << shift;
> +        }
> +
> +        hash = vsid ^ (offset >> shift);
> +    } else if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_1T) {
> +        uint64_t offset, vsid;
> +
> +        /* We only have 40 - 23 bits of seg_off in avpn */
> +        offset = (avpn & 0x1ffff) << 23;
> +        vsid = avpn >> 17;
> +        if (shift < 23) {
> +            offset |= ((vsid ^ (vsid << 25) ^ pteg) & old_hash_mask)
> << shift;
> +        }
> +
> +        hash = vsid ^ (vsid << 25) ^ (offset >> shift);
> +    } else {
> +        error_report("rehash_pte: Bad segment size in HPTE");
> +        return H_HARDWARE;
> +    }
> +
> +    new_pteg = hash & new_hash_mask;
> +    if (pte0 & HPTE64_V_SECONDARY) {
> +        assert(~pteg == (hash & old_hash_mask));
> +        new_pteg = ~new_pteg;
> +    } else {
> +        assert(pteg == (hash & old_hash_mask));
> +    }
> +    assert((oldsize != newsize) || (pteg == new_pteg));
> +    replace_pte0 = new_hpte_load0(new, new_pteg, slot);
> +    if (replace_pte0 & HPTE64_V_VALID) {
> +        assert(newsize < oldsize);
> +        if (replace_pte0 & HPTE64_V_BOLTED) {
Aren't both of these checks for BOLTED redundant? We know that we are
only rehashing ptes which are both valid and bolted, otherwise we would
have returned at *** above. Thus the new hpt will only contain invalid
entries or valid && bolted entries, we don't need to check if
replace_pte is bolted if we know it's valid. Similarly we know the one
we are replacing it with is bolted, otherwise we wouldn't have bothered
rehashing it. Thus it's pretty much sufficient to check replace_pte0
&& HPTE64_V_VALID == 1 which implies a bolted collision irrespective.
> +            if (pte0 & HPTE64_V_BOLTED) {
> +                /* Bolted collision, nothing we can do */
> +                return H_PTEG_FULL;
> +            } else {
> +                /* Discard this hpte */
> +                return H_SUCCESS;
> +            }
> +        }
Is there any reason the rehashed pte has to go into the same slot it
came out of? Isn't it valid to search all of the slots of the new pteg
for an empty (invalid) one and put it in the first empty space? Just
because the current slot already contains a bolted entry doesn't mean
we can't put it in another slot, right?
> +    }
> +
> +    new_hpte_store(new, new_pteg, slot, pte0, pte1);
> +    return H_SUCCESS;
> +}
> +
> +static int rehash_hpt(PowerPCCPU *cpu,
> +                      void *old, uint64_t oldsize,
> +                      void *new, uint64_t newsize)
> +{
> +    CPUPPCState *env = &cpu->env;
> +    uint64_t n_ptegs = oldsize >> 7;
> +    uint64_t pteg;
> +    int slot;
> +    int rc;
> +
> +    assert(env->external_htab == old);
> +
> +    for (pteg = 0; pteg < n_ptegs; pteg++) {
Can we rename token to pteg_[base_]addr or something? Token is very...
generic
> +        uint64_t token = ppc_hash64_start_access(cpu, pteg *
> HPTES_PER_GROUP);
> +
> +        if (!token) {
> +            return H_HARDWARE;
> +        }
> +
> +        for (slot = 0; slot < HPTES_PER_GROUP; slot++) {
> +            rc = rehash_hpte(cpu, token, old, oldsize, new, newsize,
> +                             pteg, slot);
> +            if (rc != H_SUCCESS) {
> +                ppc_hash64_stop_access(cpu, token);
> +                return rc;
> +            }
> +        }
> +        ppc_hash64_stop_access(cpu, token);
> +    }
> +
> +    return H_SUCCESS;
> +}
> +
> +static void pivot_hpt(CPUState *cs, run_on_cpu_data arg)
> +{
> +    sPAPRMachineState *spapr = SPAPR_MACHINE(arg.host_ptr);
> +    PowerPCCPU *cpu = POWERPC_CPU(cs);
> +
> +    cpu_synchronize_state(cs);
> +    ppc_hash64_set_external_hpt(cpu, spapr->htab, spapr->htab_shift,
> +                                &error_fatal);
>  }
>  
>  static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
> @@ -378,13 +678,52 @@ static target_ulong
> h_resize_hpt_commit(PowerPCCPU *cpu,
>  {
>      target_ulong flags = args[0];
>      target_ulong shift = args[1];
> +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> +    int rc;
> +    size_t newsize;
>  
>      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
>          return H_AUTHORITY;
>      }
>  
>      trace_spapr_h_resize_hpt_commit(flags, shift);
> -    return H_HARDWARE;
> +
> +    if (flags != 0) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (!pending || (pending->shift != shift)) {
> +        /* no matching prepare */
> +        return H_CLOSED;
> +    }
> +
> +    if (!pending->complete) {
> +        /* prepare has not completed */
> +        return H_BUSY;
> +    }
> +
> +    newsize = 1ULL << pending->shift;
> +    rc = rehash_hpt(cpu, spapr->htab, HTAB_SIZE(spapr),
> +                    pending->hpt, newsize);
> +    if (rc == H_SUCCESS) {
> +        CPUState *cs;
> +
> +        qemu_vfree(spapr->htab);
> +        spapr->htab = pending->hpt;
> +        spapr->htab_shift = pending->shift;
> +
> +        CPU_FOREACH(cs) {
> +            run_on_cpu(cs, pivot_hpt, RUN_ON_CPU_HOST_PTR(spapr));
> +        }
> +
> +        pending->hpt = NULL; /* so it's not free()d */
> +    }
> +
> +    /* Clean up */
> +    spapr->pending_hpt = NULL;
> +    free_pending_hpt(pending);
> +
> +    return rc;
>  }
>  
>  static target_ulong h_set_sprg0(PowerPCCPU *cpu, sPAPRMachineState
> *spapr,
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index d2c758b..954bada 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -14,6 +14,7 @@ struct sPAPRNVRAM;
>  typedef struct sPAPRConfigureConnectorState
> sPAPRConfigureConnectorState;
>  typedef struct sPAPREventLogEntry sPAPREventLogEntry;
>  typedef struct sPAPREventSource sPAPREventSource;
> +typedef struct sPAPRPendingHPT sPAPRPendingHPT;
>  
>  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
>  #define SPAPR_ENTRY_POINT       0x100
> @@ -72,6 +73,8 @@ struct sPAPRMachineState {
>      sPAPRResizeHPT resize_hpt;
>      void *htab;
>      uint32_t htab_shift;
> +    sPAPRPendingHPT *pending_hpt; /* in-progress resize */
> +
>      hwaddr rma_size;
>      int vrma_adjust;
>      ssize_t rtas_size;
> @@ -627,6 +630,7 @@ void
> spapr_hotplug_req_remove_by_count_indexed(sPAPRDRConnectorType
> drc_type,
>                                                 uint32_t count,
> uint32_t index);
>  void *spapr_populate_hotplug_cpu_dt(CPUState *cs, int *fdt_offset,
>                                      sPAPRMachineState *spapr);
> +int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
>  
>  /* rtas-configure-connector state */
>  struct sPAPRConfigureConnectorState {
> @@ -674,4 +678,6 @@ int spapr_rng_populate_dt(void *fdt);
>  
>  void spapr_do_system_reset_on_cpu(CPUState *cs, run_on_cpu_data
> arg);
>  
> +#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> +
>  #endif /* HW_SPAPR_H */
> diff --git a/target-ppc/mmu-hash64.h b/target-ppc/mmu-hash64.h
> index ab5d347..4a1b781 100644
> --- a/target-ppc/mmu-hash64.h
> +++ b/target-ppc/mmu-hash64.h
> @@ -60,11 +60,15 @@ void ppc_hash64_update_rmls(CPUPPCState *env);
>  #define HASH_PTE_SIZE_64        16
>  #define HASH_PTEG_SIZE_64       (HASH_PTE_SIZE_64 * HPTES_PER_GROUP)
>  
> +#define HPTE64_V_SSIZE          SLB_VSID_B
> +#define HPTE64_V_SSIZE_256M     SLB_VSID_B_256M
> +#define HPTE64_V_SSIZE_1T       SLB_VSID_B_1T
>  #define HPTE64_V_SSIZE_SHIFT    62
>  #define HPTE64_V_AVPN_SHIFT     7
>  #define HPTE64_V_AVPN           0x3fffffffffffff80ULL
>  #define HPTE64_V_AVPN_VAL(x)    (((x) & HPTE64_V_AVPN) >>
> HPTE64_V_AVPN_SHIFT)
>  #define HPTE64_V_COMPARE(x, y)  (!(((x) ^ (y)) &
> 0xffffffffffffff83ULL))
> +#define HPTE64_V_BOLTED         0x0000000000000010ULL
>  #define HPTE64_V_LARGE          0x0000000000000004ULL
>  #define HPTE64_V_SECONDARY      0x0000000000000002ULL
>  #define HPTE64_V_VALID          0x0000000000000001ULL

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 4/5] pseries: Enable HPT resizing for 2.9
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 4/5] pseries: Enable HPT resizing for 2.9 David Gibson
@ 2016-12-14  5:32   ` Suraj Jitindar Singh
  2016-12-14  6:20     ` David Gibson
  0 siblings, 1 reply; 20+ messages in thread
From: Suraj Jitindar Singh @ 2016-12-14  5:32 UTC (permalink / raw)
  To: David Gibson, paulus; +Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> We've now implemented a PAPR extensions which allows PAPR guests
> (i.e.
> "pseries" machine type) to resize their hash page table during
> runtime.
> 
> However, that extension is only enabled if explicitly chosen on the
> command line.  This patch enables it by default for spapr-2.9, but
> leaves
> it disabled (by default) for older machine types.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> Reviewed-by: Laurent Vivier <lvivier@redhat.com>
> ---
>  hw/ppc/spapr.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d057031..f05d0e5 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2779,7 +2779,7 @@ static void
> spapr_machine_class_init(ObjectClass *oc, void *data)
>  
>      smc->dr_lmb_enabled = true;
>      smc->tcg_default_cpu = "POWER8";
> -    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
> +    smc->resize_hpt_default = SPAPR_RESIZE_HPT_ENABLED;
>      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
>      fwc->get_dev_path = spapr_get_fw_dev_path;
>      nc->nmi_monitor_handler = spapr_nmi;
> @@ -2860,8 +2860,11 @@ static void
> spapr_machine_2_8_instance_options(MachineState *machine)
>  
>  static void spapr_machine_2_8_class_options(MachineClass *mc)
>  {
> +    sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
>      spapr_machine_2_9_class_options(mc);
>      SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_8);
> +    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
>  }
>  
>  DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
If people don't want this by default they should probably specify on
the command line.

Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 3/5] pseries: Implement HPT resizing
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 3/5] pseries: Implement " David Gibson
  2016-12-14  5:30   ` Suraj Jitindar Singh
@ 2016-12-14  5:35   ` Suraj Jitindar Singh
  2016-12-15  1:03     ` David Gibson
  1 sibling, 1 reply; 20+ messages in thread
From: Suraj Jitindar Singh @ 2016-12-14  5:35 UTC (permalink / raw)
  To: David Gibson, paulus; +Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> This patch implements hypercalls allowing a PAPR guest to resize its
> own
> hash page table.  This will eventually allow for more flexible memory
> hotplug.
> 
> The implementation is partially asynchronous, handled in a special
> thread
> running the hpt_prepare_thread() function.  The state of a pending
> resize
> is stored in SPAPR_MACHINE->pending_hpt.
> 
> The H_RESIZE_HPT_PREPARE hypercall will kick off creation of a new
> HPT, or,
> if one is already in progress, monitor it for completion.  If there
> is an
> existing HPT resize in progress that doesn't match the size specified
> in
> the call, it will cancel it, replacing it with a new one matching the
> given size.
> 
> The H_RESIZE_HPT_COMMIT completes transition to a resized HPT, and
> can only
> be called successfully once H_RESIZE_HPT_PREPARE has successfully
> completed initialization of a new HPT.  The guest must ensure that
> there
> are no concurrent accesses to the existing HPT while this is called
> (this
> effectively means stop_machine() for Linux guests).
> 
> For now H_RESIZE_HPT_COMMIT goes through the whole old HPT, rehashing
> each
> HPTE into the new HPT.  This can have quite high latency, but it
> seems to
> be of the order of typical migration downtime latencies for HPTs of
> size
> up to ~2GiB (which would be used in a 256GiB guest).
> 
> In future we probably want to move more of the rehashing to the
> "prepare"
> phase, by having H_ENTER and other hcalls update both current and
> pending HPTs.  That's a project for another day, but should be
> possible
> without any changes to the guest interface.
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr.c          |   4 +-
>  hw/ppc/spapr_hcall.c    | 345
> +++++++++++++++++++++++++++++++++++++++++++++++-
>  include/hw/ppc/spapr.h  |   6 +
>  target-ppc/mmu-hash64.h |   4 +
>  4 files changed, 353 insertions(+), 6 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 846ce51..d057031 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -93,8 +93,6 @@
>  
>  #define PHANDLE_XICP            0x00001111
>  
> -#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> -
>  static XICSState *try_create_xics(const char *type, int nr_servers,
>                                    int nr_irqs, Error **errp)
>  {
> @@ -1055,7 +1053,7 @@ static void close_htab_fd(sPAPRMachineState
> *spapr)
>      spapr->htab_fd = -1;
>  }
>  
> -static int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
> +int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
>  {
>      int shift;
>  
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 72a9c4d..1e89061 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -2,6 +2,7 @@
>  #include "qapi/error.h"
>  #include "sysemu/sysemu.h"
>  #include "qemu/log.h"
> +#include "qemu/error-report.h"
>  #include "cpu.h"
>  #include "exec/exec-all.h"
>  #include "helper_regs.h"
> @@ -355,20 +356,319 @@ static target_ulong h_read(PowerPCCPU *cpu,
> sPAPRMachineState *spapr,
>      return H_SUCCESS;
>  }
>  
> +struct sPAPRPendingHPT {
> +    /* These fields are read-only after initialization */
> +    int shift;
> +    QemuThread thread;
> +
> +    /* These fields are protected by the BQL */
> +    bool complete;
> +
> +    /* These fields are private to the preparation thread if
> +     * !complete, otherwise protected by the BQL */
> +    int ret;
> +    void *hpt;
> +};
> +
> +static void free_pending_hpt(sPAPRPendingHPT *pending)
> +{
> +    if (pending->hpt) {
> +        qemu_vfree(pending->hpt);
> +    }
> +
> +    g_free(pending);
> +}
> +
> +static void *hpt_prepare_thread(void *opaque)
> +{
> +    sPAPRPendingHPT *pending = opaque;
> +    size_t size = 1ULL << pending->shift;
> +
> +    pending->hpt = qemu_memalign(size, size);
> +    if (pending->hpt) {
> +        memset(pending->hpt, 0, size);
> +        pending->ret = H_SUCCESS;
> +    } else {
> +        pending->ret = H_NO_MEM;
> +    }
> +
> +    qemu_mutex_lock_iothread();
> +
> +    if (SPAPR_MACHINE(qdev_get_machine())->pending_hpt == pending) {
> +        /* Ready to go */
> +        pending->complete = true;
> +    } else {
> +        /* We've been cancelled, clean ourselves up */
> +        free_pending_hpt(pending);
> +    }
> +
> +    qemu_mutex_unlock_iothread();
> +    return NULL;
> +}
> +
> +/* Must be called with BQL held */
> +static void cancel_hpt_prepare(sPAPRMachineState *spapr)
> +{
> +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> +
> +    /* Let the thread know it's cancelled */
> +    spapr->pending_hpt = NULL;
> +
> +    if (!pending) {
> +        /* Nothing to do */
> +        return;
> +    }
> +
> +    if (!pending->complete) {
> +        /* thread will clean itself up */
> +        return;
> +    }
> +
> +    free_pending_hpt(pending);
> +}
> +
> +static int build_dimm_list(Object *obj, void *opaque)
> +{
> +    GSList **list = opaque;
> +
> +    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
> +        DeviceState *dev = DEVICE(obj);
> +        if (dev->realized) { /* only realized DIMMs matter */
> +            *list = g_slist_prepend(*list, dev);
> +        }
> +    }
> +
> +    object_child_foreach(obj, build_dimm_list, opaque);
> +    return 0;
> +}
> +
> +static ram_addr_t get_current_ram_size(void)
> +{
> +    GSList *list = NULL, *item;
> +    ram_addr_t size = ram_size;
> +
> +    build_dimm_list(qdev_get_machine(), &list);
> +    for (item = list; item; item = g_slist_next(item)) {
> +        Object *obj = OBJECT(item->data);
> +        if (!strcmp(object_get_typename(obj), TYPE_PC_DIMM)) {
> +            size += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
> +                                            &error_abort);
> +        }
> +    }
> +    g_slist_free(list);
> +
> +    return size;
> +}
> +
>  static target_ulong h_resize_hpt_prepare(PowerPCCPU *cpu,
>                                           sPAPRMachineState *spapr,
>                                           target_ulong opcode,
>                                           target_ulong *args)
>  {
>      target_ulong flags = args[0];
> -    target_ulong shift = args[1];
> +    int shift = args[1];
> +    sPAPRPendingHPT *pending = spapr->pending_hpt;
>  
>      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
>          return H_AUTHORITY;
>      }
>  
>      trace_spapr_h_resize_hpt_prepare(flags, shift);
> -    return H_HARDWARE;
> +
> +    if (flags != 0) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (shift && ((shift < 18) || (shift > 46))) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (pending) {
> +        /* something already in progress */
> +        if (pending->shift == shift) {
> +            /* and it's suitable */
> +            if (pending->complete) {
> +                return pending->ret;
> +            } else {
> +                return H_LONG_BUSY_ORDER_100_MSEC;
> +            }
> +        }
> +
> +        /* not suitable, cancel and replace */
> +        cancel_hpt_prepare(spapr);
> +    }
> +
> +    if (!shift) {
> +        /* nothing to do */
> +        return H_SUCCESS;
> +    }
> +
> +    /* start new prepare */
> +
> +    /* We only allow the guest to allocate an HPT one order above
> what
> +     * we'd normally give them (to stop a small guest claiming a
> huge
> +     * chunk of resources in the HPT */
> +    if (shift > (spapr_hpt_shift_for_ramsize(get_current_ram_size())
> + 1))
Given we already allocate one lower than recommended in the ISA by
default, should we set this to 2 to allow an allocation of 1 greater
than the recommended minimum?
> +        return H_RESOURCE;
> +    }
> +
> +    pending = g_new0(sPAPRPendingHPT, 1);
> +    pending->shift = shift;
> +    pending->ret = H_HARDWARE;
> +
> +    qemu_thread_create(&pending->thread, "sPAPR HPT prepare",
> +                       hpt_prepare_thread, pending,
> QEMU_THREAD_DETACHED);
> +
> +    spapr->pending_hpt = pending;
> +
> +    /* In theory we could estimate the time more accurately based on
> +     * the new size, but there's not much point */
> +    return H_LONG_BUSY_ORDER_100_MSEC;
> +}
> +
> +static uint64_t new_hpte_load0(void *htab, uint64_t pteg, int slot)
> +{
> +    uint8_t *addr = htab;
> +
> +    addr += pteg * HASH_PTEG_SIZE_64;
> +    addr += slot * HASH_PTE_SIZE_64;
> +    return  ldq_p(addr);
> +}
> +
> +static void new_hpte_store(void *htab, uint64_t pteg, int slot,
> +                           uint64_t pte0, uint64_t pte1)
> +{
> +    uint8_t *addr = htab;
> +
> +    addr += pteg * HASH_PTEG_SIZE_64;
> +    addr += slot * HASH_PTE_SIZE_64;
> +
> +    stq_p(addr, pte0);
> +    stq_p(addr + HASH_PTE_SIZE_64/2, pte1);
> +}
> +
> +static int rehash_hpte(PowerPCCPU *cpu, uint64_t token,
> +                       void *old, uint64_t oldsize,
> +                       void *new, uint64_t newsize,
> +                       uint64_t pteg, int slot)
> +{
> +    uint64_t old_hash_mask = (oldsize >> 7) - 1;
> +    uint64_t new_hash_mask = (newsize >> 7) - 1;
> +    target_ulong pte0 = ppc_hash64_load_hpte0(cpu, token, slot);
> +    target_ulong pte1;
> +    uint64_t avpn;
> +    unsigned shift;
> +    uint64_t hash, new_pteg, replace_pte0;
> +
> +    if (!(pte0 & HPTE64_V_VALID) || !(pte0 & HPTE64_V_BOLTED)) {
> +        return H_SUCCESS;
> +    }
> +
> +    pte1 = ppc_hash64_load_hpte1(cpu, token, slot);
> +
> +    shift = ppc_hash64_hpte_page_shift_noslb(cpu, pte0, pte1);
> +    assert(shift); /* H_ENTER should never have allowed a bad
> encoding */
> +    avpn = HPTE64_V_AVPN_VAL(pte0) & ~(((1ULL << shift) - 1) >> 23);
> +
> +    if (pte0 & HPTE64_V_SECONDARY) {
> +        pteg = ~pteg;
> +    }
> +
> +    if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_256M) {
> +        uint64_t offset, vsid;
> +
> +        /* We only have 28 - 23 bits of offset in avpn */
> +        offset = (avpn & 0x1f) << 23;
> +        vsid = avpn >> 5;
> +        /* We can find more bits from the pteg value */
> +        if (shift < 23) {
> +            offset |= ((vsid ^ pteg) & old_hash_mask) << shift;
> +        }
> +
> +        hash = vsid ^ (offset >> shift);
> +    } else if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_1T) {
> +        uint64_t offset, vsid;
> +
> +        /* We only have 40 - 23 bits of seg_off in avpn */
> +        offset = (avpn & 0x1ffff) << 23;
> +        vsid = avpn >> 17;
> +        if (shift < 23) {
> +            offset |= ((vsid ^ (vsid << 25) ^ pteg) & old_hash_mask)
> << shift;
> +        }
> +
> +        hash = vsid ^ (vsid << 25) ^ (offset >> shift);
> +    } else {
> +        error_report("rehash_pte: Bad segment size in HPTE");
> +        return H_HARDWARE;
> +    }
> +
> +    new_pteg = hash & new_hash_mask;
> +    if (pte0 & HPTE64_V_SECONDARY) {
> +        assert(~pteg == (hash & old_hash_mask));
> +        new_pteg = ~new_pteg;
> +    } else {
> +        assert(pteg == (hash & old_hash_mask));
> +    }
> +    assert((oldsize != newsize) || (pteg == new_pteg));
> +    replace_pte0 = new_hpte_load0(new, new_pteg, slot);
> +    if (replace_pte0 & HPTE64_V_VALID) {
> +        assert(newsize < oldsize);
> +        if (replace_pte0 & HPTE64_V_BOLTED) {
> +            if (pte0 & HPTE64_V_BOLTED) {
> +                /* Bolted collision, nothing we can do */
> +                return H_PTEG_FULL;
> +            } else {
> +                /* Discard this hpte */
> +                return H_SUCCESS;
> +            }
> +        }
> +    }
> +
> +    new_hpte_store(new, new_pteg, slot, pte0, pte1);
> +    return H_SUCCESS;
> +}
> +
> +static int rehash_hpt(PowerPCCPU *cpu,
> +                      void *old, uint64_t oldsize,
> +                      void *new, uint64_t newsize)
> +{
> +    CPUPPCState *env = &cpu->env;
> +    uint64_t n_ptegs = oldsize >> 7;
> +    uint64_t pteg;
> +    int slot;
> +    int rc;
> +
> +    assert(env->external_htab == old);
> +
> +    for (pteg = 0; pteg < n_ptegs; pteg++) {
> +        uint64_t token = ppc_hash64_start_access(cpu, pteg *
> HPTES_PER_GROUP);
> +
> +        if (!token) {
> +            return H_HARDWARE;
> +        }
> +
> +        for (slot = 0; slot < HPTES_PER_GROUP; slot++) {
> +            rc = rehash_hpte(cpu, token, old, oldsize, new, newsize,
> +                             pteg, slot);
> +            if (rc != H_SUCCESS) {
> +                ppc_hash64_stop_access(cpu, token);
> +                return rc;
> +            }
> +        }
> +        ppc_hash64_stop_access(cpu, token);
> +    }
> +
> +    return H_SUCCESS;
> +}
> +
> +static void pivot_hpt(CPUState *cs, run_on_cpu_data arg)
> +{
> +    sPAPRMachineState *spapr = SPAPR_MACHINE(arg.host_ptr);
> +    PowerPCCPU *cpu = POWERPC_CPU(cs);
> +
> +    cpu_synchronize_state(cs);
> +    ppc_hash64_set_external_hpt(cpu, spapr->htab, spapr->htab_shift,
> +                                &error_fatal);
>  }
>  
>  static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
> @@ -378,13 +678,52 @@ static target_ulong
> h_resize_hpt_commit(PowerPCCPU *cpu,
>  {
>      target_ulong flags = args[0];
>      target_ulong shift = args[1];
> +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> +    int rc;
> +    size_t newsize;
>  
>      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
>          return H_AUTHORITY;
>      }
>  
>      trace_spapr_h_resize_hpt_commit(flags, shift);
> -    return H_HARDWARE;
> +
> +    if (flags != 0) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (!pending || (pending->shift != shift)) {
> +        /* no matching prepare */
> +        return H_CLOSED;
> +    }
> +
> +    if (!pending->complete) {
> +        /* prepare has not completed */
> +        return H_BUSY;
> +    }
> +
> +    newsize = 1ULL << pending->shift;
> +    rc = rehash_hpt(cpu, spapr->htab, HTAB_SIZE(spapr),
> +                    pending->hpt, newsize);
> +    if (rc == H_SUCCESS) {
> +        CPUState *cs;
> +
> +        qemu_vfree(spapr->htab);
> +        spapr->htab = pending->hpt;
> +        spapr->htab_shift = pending->shift;
> +
> +        CPU_FOREACH(cs) {
> +            run_on_cpu(cs, pivot_hpt, RUN_ON_CPU_HOST_PTR(spapr));
> +        }
> +
> +        pending->hpt = NULL; /* so it's not free()d */
> +    }
> +
> +    /* Clean up */
> +    spapr->pending_hpt = NULL;
> +    free_pending_hpt(pending);
> +
> +    return rc;
>  }
>  
>  static target_ulong h_set_sprg0(PowerPCCPU *cpu, sPAPRMachineState
> *spapr,
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index d2c758b..954bada 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -14,6 +14,7 @@ struct sPAPRNVRAM;
>  typedef struct sPAPRConfigureConnectorState
> sPAPRConfigureConnectorState;
>  typedef struct sPAPREventLogEntry sPAPREventLogEntry;
>  typedef struct sPAPREventSource sPAPREventSource;
> +typedef struct sPAPRPendingHPT sPAPRPendingHPT;
>  
>  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
>  #define SPAPR_ENTRY_POINT       0x100
> @@ -72,6 +73,8 @@ struct sPAPRMachineState {
>      sPAPRResizeHPT resize_hpt;
>      void *htab;
>      uint32_t htab_shift;
> +    sPAPRPendingHPT *pending_hpt; /* in-progress resize */
> +
>      hwaddr rma_size;
>      int vrma_adjust;
>      ssize_t rtas_size;
> @@ -627,6 +630,7 @@ void
> spapr_hotplug_req_remove_by_count_indexed(sPAPRDRConnectorType
> drc_type,
>                                                 uint32_t count,
> uint32_t index);
>  void *spapr_populate_hotplug_cpu_dt(CPUState *cs, int *fdt_offset,
>                                      sPAPRMachineState *spapr);
> +int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
>  
>  /* rtas-configure-connector state */
>  struct sPAPRConfigureConnectorState {
> @@ -674,4 +678,6 @@ int spapr_rng_populate_dt(void *fdt);
>  
>  void spapr_do_system_reset_on_cpu(CPUState *cs, run_on_cpu_data
> arg);
>  
> +#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> +
>  #endif /* HW_SPAPR_H */
> diff --git a/target-ppc/mmu-hash64.h b/target-ppc/mmu-hash64.h
> index ab5d347..4a1b781 100644
> --- a/target-ppc/mmu-hash64.h
> +++ b/target-ppc/mmu-hash64.h
> @@ -60,11 +60,15 @@ void ppc_hash64_update_rmls(CPUPPCState *env);
>  #define HASH_PTE_SIZE_64        16
>  #define HASH_PTEG_SIZE_64       (HASH_PTE_SIZE_64 * HPTES_PER_GROUP)
>  
> +#define HPTE64_V_SSIZE          SLB_VSID_B
> +#define HPTE64_V_SSIZE_256M     SLB_VSID_B_256M
> +#define HPTE64_V_SSIZE_1T       SLB_VSID_B_1T
>  #define HPTE64_V_SSIZE_SHIFT    62
>  #define HPTE64_V_AVPN_SHIFT     7
>  #define HPTE64_V_AVPN           0x3fffffffffffff80ULL
>  #define HPTE64_V_AVPN_VAL(x)    (((x) & HPTE64_V_AVPN) >>
> HPTE64_V_AVPN_SHIFT)
>  #define HPTE64_V_COMPARE(x, y)  (!(((x) ^ (y)) &
> 0xffffffffffffff83ULL))
> +#define HPTE64_V_BOLTED         0x0000000000000010ULL
>  #define HPTE64_V_LARGE          0x0000000000000004ULL
>  #define HPTE64_V_SECONDARY      0x0000000000000002ULL
>  #define HPTE64_V_VALID          0x0000000000000001ULL

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 5/5] pseries: Use smaller default hash page tables when guest can resize
  2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 5/5] pseries: Use smaller default hash page tables when guest can resize David Gibson
@ 2016-12-14  5:56   ` Suraj Jitindar Singh
  0 siblings, 0 replies; 20+ messages in thread
From: Suraj Jitindar Singh @ 2016-12-14  5:56 UTC (permalink / raw)
  To: David Gibson, paulus; +Cc: agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> We've now implemented a PAPR extension allowing PAPR guest to resize
> their hash page table (HPT) during runtime.
> 
> This patch makes use of that facility to allocate smaller HPTs by
> default.
> Specifically when a guest is aware of the HPT resize facility, qemu
> sizes
> the HPT to the initial memory size, rather than the maximum memory
> size on
> the assumption that the guest will resize its HPT if necessary for
> hot
> plugged memory.
> 
> When the initial memory size is much smaller than the maximum memory
> size
> (a common configuration with e.g. oVirt / RHEV) then this can save
> significant memory on the HPT.
> 
> If the guest does *not* advertise HPT resize awareness when it makes
> the
> ibm,client-architecture-support call, qemu resizes the HPT for
> maxmimum
> memory size (unless it's been configured not to allow such guests at
> all).
> 
> For now we make that reallocation assuming the guest has not yet used
> the
> HPT at all.  That's true in practice, but not, strictly, an
> architectural
> or PAPR requirement.  If we need to in future we can fix this by
> having
> the client-architecture-support call reboot the guest with the
> revised
> HPT size (the client-architecture-support call is explicitly
> permitted to
> trigger a reboot in this way).
> 
> Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> ---
>  hw/ppc/spapr.c              | 21 ++++++++++++++++-----
>  hw/ppc/spapr_hcall.c        | 32 ++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr.h      |  2 ++
>  include/hw/ppc/spapr_ovec.h |  1 +
>  4 files changed, 51 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index f05d0e5..51b189d 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1066,8 +1066,8 @@ int spapr_hpt_shift_for_ramsize(uint64_t
> ramsize)
>      return shift;
>  }
>  
> -static void spapr_reallocate_hpt(sPAPRMachineState *spapr, int
> shift,
> -                                 Error **errp)
> +void spapr_reallocate_hpt(sPAPRMachineState *spapr, int shift,
> +                          Error **errp)
>  {
>      long rc;
>  
> @@ -1140,14 +1140,20 @@ static void ppc_spapr_reset(void)
>      hwaddr rtas_addr, fdt_addr;
>      void *fdt;
>      int rc;
> +    int hpt_shift;
>  
>      /* Check for unknown sysbus devices */
>      foreach_dynamic_sysbus_device(find_unknown_sysbus_device, NULL);
>  
>      /* Allocate and/or reset the hash page table */
> -    spapr_reallocate_hpt(spapr,
> -                         spapr_hpt_shift_for_ramsize(machine-
> >maxram_size),
> -                         &error_fatal);
> +    if ((spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED)
> +        || (spapr->cas_reboot
> +            && !spapr_ovec_test(spapr->ov5_cas, OV5_HPT_RESIZE))) {
> +        hpt_shift = spapr_hpt_shift_for_ramsize(machine-
> >maxram_size);
> +    } else {
> +        hpt_shift = spapr_hpt_shift_for_ramsize(machine->ram_size);
> +    }
> +    spapr_reallocate_hpt(spapr, hpt_shift, &error_fatal);
>  
>      /* Update the RMA size if necessary */
>      if (spapr->vrma_adjust) {
> @@ -1941,6 +1947,11 @@ static void ppc_spapr_init(MachineState
> *machine)
>          spapr_ovec_set(spapr->ov5, OV5_HP_EVT);
>      }
>  
> +    /* advertise support for HPT resizing */
> +    if (spapr->resize_hpt != SPAPR_RESIZE_HPT_DISABLED) {
> +        spapr_ovec_set(spapr->ov5, OV5_HPT_RESIZE);
> +    }
> +
>      /* init CPUs */
>      if (machine->cpu_model == NULL) {
>          machine->cpu_model = kvm_enabled() ? "host" : smc-
> >tcg_default_cpu;
> diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> index 1e89061..5bebfc3 100644
> --- a/hw/ppc/spapr_hcall.c
> +++ b/hw/ppc/spapr_hcall.c
> @@ -1415,6 +1415,38 @@ static target_ulong
> h_client_architecture_support(PowerPCCPU *cpu_,
>  
>      ov5_guest = spapr_ovec_parse_vector(ov_table, 5);
>  
> +    /*
> +     * HPT resizing is a bit of a special case, because when enabled
> +     * we assume the guest will support it until it says it doesn't,
> +     * instead of assuming it won't support it until it says it
> does.
> +     * Strictly speaking that approach could break for guests which
> +     * don't make a CAS call, but those are so old we don't care
> about
> +     * them.  Without that assumption we'd have to make at least a
> +     * temporary allocation of an HPT sized for max memory, which
> +     * could be impossibly difficult under KVM HV if maxram is
> large.
> +     */
> +    if (!spapr_ovec_test(ov5_guest, OV5_HPT_RESIZE)) {
> +        int maxshift = spapr_hpt_shift_for_ramsize(MACHINE(spapr)-
> >maxram_size);
> +
> +        if (spapr->resize_hpt == SPAPR_RESIZE_HPT_REQUIRED) {
> +            error_report(
> +                "h_client_architecture_support: Guest doesn't
> support HPT resizing, but resize-hpt=required");
> +            exit(1);
> +        }
> +
> +        if (spapr->htab_shift < maxshift) {
> +            CPUState *cs;
> +            /* Guest doesn't know about HPT resizing, so we
> +             * pre-emptively resize for the maximum permitted
> RAM.  At
> +             * the point this is called, nothing should have been
> +             * entered into the existing HPT */
> +            spapr_reallocate_hpt(spapr, maxshift, &error_fatal);
> +            CPU_FOREACH(cs) {
> +                run_on_cpu(cs, pivot_hpt,
> RUN_ON_CPU_HOST_PTR(spapr));
> +            }
> +        }
> +    }
> +
>      /* NOTE: there are actually a number of ov5 bits where input
> from the
>       * guest is always zero, and the platform/QEMU enables them
> independently
>       * of guest input. To model these properly we'd want some sort
> of mask,
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 954bada..ff3bf87 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -631,6 +631,8 @@ void
> spapr_hotplug_req_remove_by_count_indexed(sPAPRDRConnectorType
> drc_type,
>  void *spapr_populate_hotplug_cpu_dt(CPUState *cs, int *fdt_offset,
>                                      sPAPRMachineState *spapr);
>  int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
> +void spapr_reallocate_hpt(sPAPRMachineState *spapr, int shift,
> +                          Error **errp);
>  
>  /* rtas-configure-connector state */
>  struct sPAPRConfigureConnectorState {
> diff --git a/include/hw/ppc/spapr_ovec.h
> b/include/hw/ppc/spapr_ovec.h
> index 355a344..f5fed87 100644
> --- a/include/hw/ppc/spapr_ovec.h
> +++ b/include/hw/ppc/spapr_ovec.h
> @@ -47,6 +47,7 @@ typedef struct sPAPROptionVector sPAPROptionVector;
>  #define OV5_DRCONF_MEMORY       OV_BIT(2, 2)
>  #define OV5_FORM1_AFFINITY      OV_BIT(5, 0)
>  #define OV5_HP_EVT              OV_BIT(6, 5)
> +#define OV5_HPT_RESIZE          OV_BIT(6, 7)
>  
>  /* interfaces */
>  sPAPROptionVector *spapr_ovec_new(void);

Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 2/5] pseries: Stubs for HPT resizing
  2016-12-13 14:29   ` Laurent Vivier
@ 2016-12-14  6:12     ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-12-14  6:12 UTC (permalink / raw)
  To: Laurent Vivier
  Cc: paulus, sjitindarsingh, agraf, mdroth, thuth, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1520 bytes --]

On Tue, Dec 13, 2016 at 03:29:05PM +0100, Laurent Vivier wrote:
> 
> 
> On 12/12/2016 05:06, David Gibson wrote:
> > This introduces stub implementations of the H_RESIZE_HPT_PREPARE and
> > H_RESIZE_HPT_COMMIT hypercalls which we hope to add in a PAPR
> > extension to allow run time resizing of a guest's hash page table.  It
> > also adds a new machine property for controlling whether this new
> > facility is available, and logic to check that against availability
> > with KVM (only supported with KVM PR for now).
> > 
> > Finally, it adds a new string to the hypertas property in the device
> > tree, advertising to the guest the availability of the HPT resizing
> > hypercalls.  This is a tentative suggested value, and would need to be
> > standardized by PAPR before being merged.
> 
> Could you explain somewhere what is the aim of the "flags" parameter?
> It could help to understand why we have it as it is not used.

It's mostly just there on the general principle that have some way of
extending is a good idea.

As an example of a possible extension, we could have a flag which
caused all the valid HPTEs to be rehashed, instead of just the bolted
ones - we'd need to do tests to see if that was worthwhile (probably a
tradeoff between commit downtime and post-resize performance).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 3/5] pseries: Implement HPT resizing
  2016-12-14  5:30   ` Suraj Jitindar Singh
@ 2016-12-14  6:20     ` David Gibson
  2016-12-14 23:20       ` Suraj Jitindar Singh
  2016-12-15  0:59     ` David Gibson
  1 sibling, 1 reply; 20+ messages in thread
From: David Gibson @ 2016-12-14  6:20 UTC (permalink / raw)
  To: Suraj Jitindar Singh
  Cc: paulus, agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 11228 bytes --]

On Wed, Dec 14, 2016 at 04:30:57PM +1100, Suraj Jitindar Singh wrote:
> On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
[snip]
> > +    if (!(pte0 & HPTE64_V_VALID) || !(pte0 & HPTE64_V_BOLTED)) {
> > +        return H_SUCCESS;
> > +    }
> > +
> > +    pte1 = ppc_hash64_load_hpte1(cpu, token, slot);
> > +
> > +    shift = ppc_hash64_hpte_page_shift_noslb(cpu, pte0, pte1);
> > +    assert(shift); /* H_ENTER should never have allowed a bad
> > encoding */
> > +    avpn = HPTE64_V_AVPN_VAL(pte0) & ~(((1ULL << shift) - 1) >> 23);
> > +
> > +    if (pte0 & HPTE64_V_SECONDARY) {
> > +        pteg = ~pteg;
> > +    }
> > +
> The hash calculation below is pretty hard to follow... That being said
> I don't really see a nicer way of structuring it. I guess you just have
> to wade through the ISA if you actually want to understand what's
> happening here (which I assume most people won't bother with).

Yeah, the hash calculation is always hard to follow.  Here it's
particularly bad because we're essentially doing two: one backwards
and one forwards.  Unless someone has a great suggestion, I don't
really see a way to make this obvious.

> > +    if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_256M) {
> > +        uint64_t offset, vsid;
> > +
> > +        /* We only have 28 - 23 bits of offset in avpn */
> > +        offset = (avpn & 0x1f) << 23;
> > +        vsid = avpn >> 5;
> > +        /* We can find more bits from the pteg value */
> > +        if (shift < 23) {
> > +            offset |= ((vsid ^ pteg) & old_hash_mask) << shift;
> > +        }
> > +
> > +        hash = vsid ^ (offset >> shift);
> > +    } else if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_1T) {
> > +        uint64_t offset, vsid;
> > +
> > +        /* We only have 40 - 23 bits of seg_off in avpn */
> > +        offset = (avpn & 0x1ffff) << 23;
> > +        vsid = avpn >> 17;
> > +        if (shift < 23) {
> > +            offset |= ((vsid ^ (vsid << 25) ^ pteg) & old_hash_mask)
> > << shift;
> > +        }
> > +
> > +        hash = vsid ^ (vsid << 25) ^ (offset >> shift);
> > +    } else {
> > +        error_report("rehash_pte: Bad segment size in HPTE");
> > +        return H_HARDWARE;
> > +    }
> > +
> > +    new_pteg = hash & new_hash_mask;
> > +    if (pte0 & HPTE64_V_SECONDARY) {
> > +        assert(~pteg == (hash & old_hash_mask));
> > +        new_pteg = ~new_pteg;
> > +    } else {
> > +        assert(pteg == (hash & old_hash_mask));
> > +    }
> > +    assert((oldsize != newsize) || (pteg == new_pteg));
> > +    replace_pte0 = new_hpte_load0(new, new_pteg, slot);
> > +    if (replace_pte0 & HPTE64_V_VALID) {
> > +        assert(newsize < oldsize);
> > +        if (replace_pte0 & HPTE64_V_BOLTED) {
> Aren't both of these checks for BOLTED redundant? We know that we are
> only rehashing ptes which are both valid and bolted, otherwise we would
> have returned at *** above. Thus the new hpt will only contain invalid
> entries or valid && bolted entries, we don't need to check if
> replace_pte is bolted if we know it's valid. Similarly we know the one
> we are replacing it with is bolted, otherwise we wouldn't have bothered
> rehashing it. Thus it's pretty much sufficient to check replace_pte0
> && HPTE64_V_VALID == 1 which implies a bolted collision irrespective.

Ah, so, technically, yes it's redundant.  I'd prefer to leave it as
is, in case we ever implement a version which also rehashes non-bolted
entries.  As it is we could do that simply by changing the test at the
top, without needing a synchronized change here.

> > +            if (pte0 & HPTE64_V_BOLTED) {
> > +                /* Bolted collision, nothing we can do */
> > +                return H_PTEG_FULL;
> > +            } else {
> > +                /* Discard this hpte */
> > +                return H_SUCCESS;
> > +            }
> > +        }
> Is there any reason the rehashed pte has to go into the same slot it
> came out of? Isn't it valid to search all of the slots of the new pteg
> for an empty (invalid) one and put it in the first empty space? Just
> because the current slot already contains a bolted entry doesn't mean
> we can't put it in another slot, right?

Not in practice.  The guest is allowed to keep track of which slot it
loaded an HPTE into and use that for later updates or invalidates -
Linux guests do this in practice.  Because of that the PAPR ACR
explicitly says the resize must preserve hash slots.

I theory we could implement a variant which didn't do this, but I
doubt we'll ever want to because it will be much, much harder to work
with on the guest side.

> > +    }
> > +
> > +    new_hpte_store(new, new_pteg, slot, pte0, pte1);
> > +    return H_SUCCESS;
> > +}
> > +
> > +static int rehash_hpt(PowerPCCPU *cpu,
> > +                      void *old, uint64_t oldsize,
> > +                      void *new, uint64_t newsize)
> > +{
> > +    CPUPPCState *env = &cpu->env;
> > +    uint64_t n_ptegs = oldsize >> 7;
> > +    uint64_t pteg;
> > +    int slot;
> > +    int rc;
> > +
> > +    assert(env->external_htab == old);
> > +
> > +    for (pteg = 0; pteg < n_ptegs; pteg++) {
> Can we rename token to pteg_[base_]addr or something? Token is very...
> generic

So the value returned here is supposed to be opaque, and only passed
back to the accessor helpers.  It returns different things depending
on if we have an external HPT, or an internal HPT (virtual bare metal
machine).  Hence the generic name.

> > +        uint64_t token = ppc_hash64_start_access(cpu, pteg *
> > HPTES_PER_GROUP);
> > +
> > +        if (!token) {
> > +            return H_HARDWARE;
> > +        }
> > +
> > +        for (slot = 0; slot < HPTES_PER_GROUP; slot++) {
> > +            rc = rehash_hpte(cpu, token, old, oldsize, new, newsize,
> > +                             pteg, slot);
> > +            if (rc != H_SUCCESS) {
> > +                ppc_hash64_stop_access(cpu, token);
> > +                return rc;
> > +            }
> > +        }
> > +        ppc_hash64_stop_access(cpu, token);
> > +    }
> > +
> > +    return H_SUCCESS;
> > +}
> > +
> > +static void pivot_hpt(CPUState *cs, run_on_cpu_data arg)
> > +{
> > +    sPAPRMachineState *spapr = SPAPR_MACHINE(arg.host_ptr);
> > +    PowerPCCPU *cpu = POWERPC_CPU(cs);
> > +
> > +    cpu_synchronize_state(cs);
> > +    ppc_hash64_set_external_hpt(cpu, spapr->htab, spapr->htab_shift,
> > +                                &error_fatal);
> >  }
> >  
> >  static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
> > @@ -378,13 +678,52 @@ static target_ulong
> > h_resize_hpt_commit(PowerPCCPU *cpu,
> >  {
> >      target_ulong flags = args[0];
> >      target_ulong shift = args[1];
> > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> > +    int rc;
> > +    size_t newsize;
> >  
> >      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> >          return H_AUTHORITY;
> >      }
> >  
> >      trace_spapr_h_resize_hpt_commit(flags, shift);
> > -    return H_HARDWARE;
> > +
> > +    if (flags != 0) {
> > +        return H_PARAMETER;
> > +    }
> > +
> > +    if (!pending || (pending->shift != shift)) {
> > +        /* no matching prepare */
> > +        return H_CLOSED;
> > +    }
> > +
> > +    if (!pending->complete) {
> > +        /* prepare has not completed */
> > +        return H_BUSY;
> > +    }
> > +
> > +    newsize = 1ULL << pending->shift;
> > +    rc = rehash_hpt(cpu, spapr->htab, HTAB_SIZE(spapr),
> > +                    pending->hpt, newsize);
> > +    if (rc == H_SUCCESS) {
> > +        CPUState *cs;
> > +
> > +        qemu_vfree(spapr->htab);
> > +        spapr->htab = pending->hpt;
> > +        spapr->htab_shift = pending->shift;
> > +
> > +        CPU_FOREACH(cs) {
> > +            run_on_cpu(cs, pivot_hpt, RUN_ON_CPU_HOST_PTR(spapr));
> > +        }
> > +
> > +        pending->hpt = NULL; /* so it's not free()d */
> > +    }
> > +
> > +    /* Clean up */
> > +    spapr->pending_hpt = NULL;
> > +    free_pending_hpt(pending);
> > +
> > +    return rc;
> >  }
> >  
> >  static target_ulong h_set_sprg0(PowerPCCPU *cpu, sPAPRMachineState
> > *spapr,
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index d2c758b..954bada 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -14,6 +14,7 @@ struct sPAPRNVRAM;
> >  typedef struct sPAPRConfigureConnectorState
> > sPAPRConfigureConnectorState;
> >  typedef struct sPAPREventLogEntry sPAPREventLogEntry;
> >  typedef struct sPAPREventSource sPAPREventSource;
> > +typedef struct sPAPRPendingHPT sPAPRPendingHPT;
> >  
> >  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
> >  #define SPAPR_ENTRY_POINT       0x100
> > @@ -72,6 +73,8 @@ struct sPAPRMachineState {
> >      sPAPRResizeHPT resize_hpt;
> >      void *htab;
> >      uint32_t htab_shift;
> > +    sPAPRPendingHPT *pending_hpt; /* in-progress resize */
> > +
> >      hwaddr rma_size;
> >      int vrma_adjust;
> >      ssize_t rtas_size;
> > @@ -627,6 +630,7 @@ void
> > spapr_hotplug_req_remove_by_count_indexed(sPAPRDRConnectorType
> > drc_type,
> >                                                 uint32_t count,
> > uint32_t index);
> >  void *spapr_populate_hotplug_cpu_dt(CPUState *cs, int *fdt_offset,
> >                                      sPAPRMachineState *spapr);
> > +int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
> >  
> >  /* rtas-configure-connector state */
> >  struct sPAPRConfigureConnectorState {
> > @@ -674,4 +678,6 @@ int spapr_rng_populate_dt(void *fdt);
> >  
> >  void spapr_do_system_reset_on_cpu(CPUState *cs, run_on_cpu_data
> > arg);
> >  
> > +#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> > +
> >  #endif /* HW_SPAPR_H */
> > diff --git a/target-ppc/mmu-hash64.h b/target-ppc/mmu-hash64.h
> > index ab5d347..4a1b781 100644
> > --- a/target-ppc/mmu-hash64.h
> > +++ b/target-ppc/mmu-hash64.h
> > @@ -60,11 +60,15 @@ void ppc_hash64_update_rmls(CPUPPCState *env);
> >  #define HASH_PTE_SIZE_64        16
> >  #define HASH_PTEG_SIZE_64       (HASH_PTE_SIZE_64 * HPTES_PER_GROUP)
> >  
> > +#define HPTE64_V_SSIZE          SLB_VSID_B
> > +#define HPTE64_V_SSIZE_256M     SLB_VSID_B_256M
> > +#define HPTE64_V_SSIZE_1T       SLB_VSID_B_1T
> >  #define HPTE64_V_SSIZE_SHIFT    62
> >  #define HPTE64_V_AVPN_SHIFT     7
> >  #define HPTE64_V_AVPN           0x3fffffffffffff80ULL
> >  #define HPTE64_V_AVPN_VAL(x)    (((x) & HPTE64_V_AVPN) >>
> > HPTE64_V_AVPN_SHIFT)
> >  #define HPTE64_V_COMPARE(x, y)  (!(((x) ^ (y)) &
> > 0xffffffffffffff83ULL))
> > +#define HPTE64_V_BOLTED         0x0000000000000010ULL
> >  #define HPTE64_V_LARGE          0x0000000000000004ULL
> >  #define HPTE64_V_SECONDARY      0x0000000000000002ULL
> >  #define HPTE64_V_VALID          0x0000000000000001ULL
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 4/5] pseries: Enable HPT resizing for 2.9
  2016-12-14  5:32   ` Suraj Jitindar Singh
@ 2016-12-14  6:20     ` David Gibson
  2016-12-14 23:08       ` Suraj Jitindar Singh
  0 siblings, 1 reply; 20+ messages in thread
From: David Gibson @ 2016-12-14  6:20 UTC (permalink / raw)
  To: Suraj Jitindar Singh
  Cc: paulus, agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2215 bytes --]

On Wed, Dec 14, 2016 at 04:32:26PM +1100, Suraj Jitindar Singh wrote:
> On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> > We've now implemented a PAPR extensions which allows PAPR guests
> > (i.e.
> > "pseries" machine type) to resize their hash page table during
> > runtime.
> > 
> > However, that extension is only enabled if explicitly chosen on the
> > command line.  This patch enables it by default for spapr-2.9, but
> > leaves
> > it disabled (by default) for older machine types.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > Reviewed-by: Laurent Vivier <lvivier@redhat.com>
> > ---
> >  hw/ppc/spapr.c | 5 ++++-
> >  1 file changed, 4 insertions(+), 1 deletion(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index d057031..f05d0e5 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -2779,7 +2779,7 @@ static void
> > spapr_machine_class_init(ObjectClass *oc, void *data)
> >  
> >      smc->dr_lmb_enabled = true;
> >      smc->tcg_default_cpu = "POWER8";
> > -    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
> > +    smc->resize_hpt_default = SPAPR_RESIZE_HPT_ENABLED;
> >      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> >      fwc->get_dev_path = spapr_get_fw_dev_path;
> >      nc->nmi_monitor_handler = spapr_nmi;
> > @@ -2860,8 +2860,11 @@ static void
> > spapr_machine_2_8_instance_options(MachineState *machine)
> >  
> >  static void spapr_machine_2_8_class_options(MachineClass *mc)
> >  {
> > +    sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> > +
> >      spapr_machine_2_9_class_options(mc);
> >      SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_8);
> > +    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
> >  }
> >  
> >  DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
> If people don't want this by default they should probably specify on
> the command line.

Uh.. I don't follow you.

> Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 4/5] pseries: Enable HPT resizing for 2.9
  2016-12-14  6:20     ` David Gibson
@ 2016-12-14 23:08       ` Suraj Jitindar Singh
  0 siblings, 0 replies; 20+ messages in thread
From: Suraj Jitindar Singh @ 2016-12-14 23:08 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

On Wed, 2016-12-14 at 17:20 +1100, David Gibson wrote:
> On Wed, Dec 14, 2016 at 04:32:26PM +1100, Suraj Jitindar Singh wrote:
> > 
> > On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> > > 
> > > We've now implemented a PAPR extensions which allows PAPR guests
> > > (i.e.
> > > "pseries" machine type) to resize their hash page table during
> > > runtime.
> > > 
> > > However, that extension is only enabled if explicitly chosen on
> > > the
> > > command line.  This patch enables it by default for spapr-2.9,
> > > but
> > > leaves
> > > it disabled (by default) for older machine types.
> > > 
> > > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > > Reviewed-by: Laurent Vivier <lvivier@redhat.com>
> > > ---
> > >  hw/ppc/spapr.c | 5 ++++-
> > >  1 file changed, 4 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index d057031..f05d0e5 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -2779,7 +2779,7 @@ static void
> > > spapr_machine_class_init(ObjectClass *oc, void *data)
> > >  
> > >      smc->dr_lmb_enabled = true;
> > >      smc->tcg_default_cpu = "POWER8";
> > > -    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
> > > +    smc->resize_hpt_default = SPAPR_RESIZE_HPT_ENABLED;
> > >      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> > >      fwc->get_dev_path = spapr_get_fw_dev_path;
> > >      nc->nmi_monitor_handler = spapr_nmi;
> > > @@ -2860,8 +2860,11 @@ static void
> > > spapr_machine_2_8_instance_options(MachineState *machine)
> > >  
> > >  static void spapr_machine_2_8_class_options(MachineClass *mc)
> > >  {
> > > +    sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> > > +
> > >      spapr_machine_2_9_class_options(mc);
> > >      SET_MACHINE_COMPAT(mc, SPAPR_COMPAT_2_8);
> > > +    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
> > >  }
> > >  
> > >  DEFINE_SPAPR_MACHINE(2_8, "2.8", false);
> > If people don't want this by default they should probably specify
> > on
> > the command line.
> Uh.. I don't follow you.
Just agreeing with you setting the default to enabled :)
> 
> > 
> > Reviewed-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com>
> > 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 3/5] pseries: Implement HPT resizing
  2016-12-14  6:20     ` David Gibson
@ 2016-12-14 23:20       ` Suraj Jitindar Singh
  0 siblings, 0 replies; 20+ messages in thread
From: Suraj Jitindar Singh @ 2016-12-14 23:20 UTC (permalink / raw)
  To: David Gibson; +Cc: paulus, agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

On Wed, 2016-12-14 at 17:20 +1100, David Gibson wrote:
> On Wed, Dec 14, 2016 at 04:30:57PM +1100, Suraj Jitindar Singh wrote:
> > 
> > On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> [snip]
> > 
> > > 
> > > +    if (!(pte0 & HPTE64_V_VALID) || !(pte0 & HPTE64_V_BOLTED)) {
> > > +        return H_SUCCESS;
> > > +    }
> > > +
> > > +    pte1 = ppc_hash64_load_hpte1(cpu, token, slot);
> > > +
> > > +    shift = ppc_hash64_hpte_page_shift_noslb(cpu, pte0, pte1);
> > > +    assert(shift); /* H_ENTER should never have allowed a bad
> > > encoding */
> > > +    avpn = HPTE64_V_AVPN_VAL(pte0) & ~(((1ULL << shift) - 1) >>
> > > 23);
> > > +
> > > +    if (pte0 & HPTE64_V_SECONDARY) {
> > > +        pteg = ~pteg;
> > > +    }
> > > +
> > The hash calculation below is pretty hard to follow... That being
> > said
> > I don't really see a nicer way of structuring it. I guess you just
> > have
> > to wade through the ISA if you actually want to understand what's
> > happening here (which I assume most people won't bother with).
> Yeah, the hash calculation is always hard to follow.  Here it's
> particularly bad because we're essentially doing two: one backwards
> and one forwards.  Unless someone has a great suggestion, I don't
> really see a way to make this obvious.
> 
> > 
> > > 
> > > +    if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_256M) {
> > > +        uint64_t offset, vsid;
> > > +
> > > +        /* We only have 28 - 23 bits of offset in avpn */
> > > +        offset = (avpn & 0x1f) << 23;
> > > +        vsid = avpn >> 5;
> > > +        /* We can find more bits from the pteg value */
> > > +        if (shift < 23) {
> > > +            offset |= ((vsid ^ pteg) & old_hash_mask) << shift;
> > > +        }
> > > +
> > > +        hash = vsid ^ (offset >> shift);
> > > +    } else if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_1T) {
> > > +        uint64_t offset, vsid;
> > > +
> > > +        /* We only have 40 - 23 bits of seg_off in avpn */
> > > +        offset = (avpn & 0x1ffff) << 23;
> > > +        vsid = avpn >> 17;
> > > +        if (shift < 23) {
> > > +            offset |= ((vsid ^ (vsid << 25) ^ pteg) &
> > > old_hash_mask)
> > > << shift;
> > > +        }
> > > +
> > > +        hash = vsid ^ (vsid << 25) ^ (offset >> shift);
> > > +    } else {
> > > +        error_report("rehash_pte: Bad segment size in HPTE");
> > > +        return H_HARDWARE;
> > > +    }
> > > +
> > > +    new_pteg = hash & new_hash_mask;
> > > +    if (pte0 & HPTE64_V_SECONDARY) {
> > > +        assert(~pteg == (hash & old_hash_mask));
> > > +        new_pteg = ~new_pteg;
> > > +    } else {
> > > +        assert(pteg == (hash & old_hash_mask));
> > > +    }
> > > +    assert((oldsize != newsize) || (pteg == new_pteg));
> > > +    replace_pte0 = new_hpte_load0(new, new_pteg, slot);
> > > +    if (replace_pte0 & HPTE64_V_VALID) {
> > > +        assert(newsize < oldsize);
> > > +        if (replace_pte0 & HPTE64_V_BOLTED) {
> > Aren't both of these checks for BOLTED redundant? We know that we
> > are
> > only rehashing ptes which are both valid and bolted, otherwise we
> > would
> > have returned at *** above. Thus the new hpt will only contain
> > invalid
> > entries or valid && bolted entries, we don't need to check if
> > replace_pte is bolted if we know it's valid. Similarly we know the
> > one
> > we are replacing it with is bolted, otherwise we wouldn't have
> > bothered
> > rehashing it. Thus it's pretty much sufficient to check
> > replace_pte0
> > && HPTE64_V_VALID == 1 which implies a bolted collision
> > irrespective.
> Ah, so, technically, yes it's redundant.  I'd prefer to leave it as
> is, in case we ever implement a version which also rehashes non-
> bolted
> entries.  As it is we could do that simply by changing the test at
> the
> top, without needing a synchronized change here
Sounds good, may as well keep it as is then to make future updates
easier.
> 
> > 
> > > 
> > > +            if (pte0 & HPTE64_V_BOLTED) {
> > > +                /* Bolted collision, nothing we can do */
> > > +                return H_PTEG_FULL;
> > > +            } else {
> > > +                /* Discard this hpte */
> > > +                return H_SUCCESS;
> > > +            }
> > > +        }
> > Is there any reason the rehashed pte has to go into the same slot
> > it
> > came out of? Isn't it valid to search all of the slots of the new
> > pteg
> > for an empty (invalid) one and put it in the first empty space?
> > Just
> > because the current slot already contains a bolted entry doesn't
> > mean
> > we can't put it in another slot, right?
> Not in practice.  The guest is allowed to keep track of which slot it
> loaded an HPTE into and use that for later updates or invalidates -
> Linux guests do this in practice.  Because of that the PAPR ACR
> explicitly says the resize must preserve hash slots.
> 
> I theory we could implement a variant which didn't do this, but I
> doubt we'll ever want to because it will be much, much harder to work
> with on the guest side.
Didn't manage to get hold of the ACR so didn't realise slots had to be
preserved. Seems like if you have a bolted collision you would want to
just use a different slot rather than fail the whole rehash. But I
didn't realise guests tracked the slot either so this is probably
difficult to change.
> 
> > 
> > > 
> > > +    }
> > > +
> > > +    new_hpte_store(new, new_pteg, slot, pte0, pte1);
> > > +    return H_SUCCESS;
> > > +}
> > > +
> > > +static int rehash_hpt(PowerPCCPU *cpu,
> > > +                      void *old, uint64_t oldsize,
> > > +                      void *new, uint64_t newsize)
> > > +{
> > > +    CPUPPCState *env = &cpu->env;
> > > +    uint64_t n_ptegs = oldsize >> 7;
> > > +    uint64_t pteg;
> > > +    int slot;
> > > +    int rc;
> > > +
> > > +    assert(env->external_htab == old);
> > > +
> > > +    for (pteg = 0; pteg < n_ptegs; pteg++) {
> > Can we rename token to pteg_[base_]addr or something? Token is
> > very...
> > generic
> So the value returned here is supposed to be opaque, and only passed
> back to the accessor helpers.  It returns different things depending
> on if we have an external HPT, or an internal HPT (virtual bare metal
> machine).  Hence the generic name.
> 
> > 
> > > 
> > > +        uint64_t token = ppc_hash64_start_access(cpu, pteg *
> > > HPTES_PER_GROUP);
> > > +
> > > +        if (!token) {
> > > +            return H_HARDWARE;
> > > +        }
> > > +
> > > +        for (slot = 0; slot < HPTES_PER_GROUP; slot++) {
> > > +            rc = rehash_hpte(cpu, token, old, oldsize, new,
> > > newsize,
> > > +                             pteg, slot);
> > > +            if (rc != H_SUCCESS) {
> > > +                ppc_hash64_stop_access(cpu, token);
> > > +                return rc;
> > > +            }
> > > +        }
> > > +        ppc_hash64_stop_access(cpu, token);
> > > +    }
> > > +
> > > +    return H_SUCCESS;
> > > +}
> > > +
> > > +static void pivot_hpt(CPUState *cs, run_on_cpu_data arg)
> > > +{
> > > +    sPAPRMachineState *spapr = SPAPR_MACHINE(arg.host_ptr);
> > > +    PowerPCCPU *cpu = POWERPC_CPU(cs);
> > > +
> > > +    cpu_synchronize_state(cs);
> > > +    ppc_hash64_set_external_hpt(cpu, spapr->htab, spapr-
> > > >htab_shift,
> > > +                                &error_fatal);
> > >  }
> > >  
> > >  static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
> > > @@ -378,13 +678,52 @@ static target_ulong
> > > h_resize_hpt_commit(PowerPCCPU *cpu,
> > >  {
> > >      target_ulong flags = args[0];
> > >      target_ulong shift = args[1];
> > > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> > > +    int rc;
> > > +    size_t newsize;
> > >  
> > >      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> > >          return H_AUTHORITY;
> > >      }
> > >  
> > >      trace_spapr_h_resize_hpt_commit(flags, shift);
> > > -    return H_HARDWARE;
> > > +
> > > +    if (flags != 0) {
> > > +        return H_PARAMETER;
> > > +    }
> > > +
> > > +    if (!pending || (pending->shift != shift)) {
> > > +        /* no matching prepare */
> > > +        return H_CLOSED;
> > > +    }
> > > +
> > > +    if (!pending->complete) {
> > > +        /* prepare has not completed */
> > > +        return H_BUSY;
> > > +    }
> > > +
> > > +    newsize = 1ULL << pending->shift;
> > > +    rc = rehash_hpt(cpu, spapr->htab, HTAB_SIZE(spapr),
> > > +                    pending->hpt, newsize);
> > > +    if (rc == H_SUCCESS) {
> > > +        CPUState *cs;
> > > +
> > > +        qemu_vfree(spapr->htab);
> > > +        spapr->htab = pending->hpt;
> > > +        spapr->htab_shift = pending->shift;
> > > +
> > > +        CPU_FOREACH(cs) {
> > > +            run_on_cpu(cs, pivot_hpt,
> > > RUN_ON_CPU_HOST_PTR(spapr));
> > > +        }
> > > +
> > > +        pending->hpt = NULL; /* so it's not free()d */
> > > +    }
> > > +
> > > +    /* Clean up */
> > > +    spapr->pending_hpt = NULL;
> > > +    free_pending_hpt(pending);
> > > +
> > > +    return rc;
> > >  }
> > >  
> > >  static target_ulong h_set_sprg0(PowerPCCPU *cpu,
> > > sPAPRMachineState
> > > *spapr,
> > > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > > index d2c758b..954bada 100644
> > > --- a/include/hw/ppc/spapr.h
> > > +++ b/include/hw/ppc/spapr.h
> > > @@ -14,6 +14,7 @@ struct sPAPRNVRAM;
> > >  typedef struct sPAPRConfigureConnectorState
> > > sPAPRConfigureConnectorState;
> > >  typedef struct sPAPREventLogEntry sPAPREventLogEntry;
> > >  typedef struct sPAPREventSource sPAPREventSource;
> > > +typedef struct sPAPRPendingHPT sPAPRPendingHPT;
> > >  
> > >  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
> > >  #define SPAPR_ENTRY_POINT       0x100
> > > @@ -72,6 +73,8 @@ struct sPAPRMachineState {
> > >      sPAPRResizeHPT resize_hpt;
> > >      void *htab;
> > >      uint32_t htab_shift;
> > > +    sPAPRPendingHPT *pending_hpt; /* in-progress resize */
> > > +
> > >      hwaddr rma_size;
> > >      int vrma_adjust;
> > >      ssize_t rtas_size;
> > > @@ -627,6 +630,7 @@ void
> > > spapr_hotplug_req_remove_by_count_indexed(sPAPRDRConnectorType
> > > drc_type,
> > >                                                 uint32_t count,
> > > uint32_t index);
> > >  void *spapr_populate_hotplug_cpu_dt(CPUState *cs, int
> > > *fdt_offset,
> > >                                      sPAPRMachineState *spapr);
> > > +int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
> > >  
> > >  /* rtas-configure-connector state */
> > >  struct sPAPRConfigureConnectorState {
> > > @@ -674,4 +678,6 @@ int spapr_rng_populate_dt(void *fdt);
> > >  
> > >  void spapr_do_system_reset_on_cpu(CPUState *cs, run_on_cpu_data
> > > arg);
> > >  
> > > +#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> > > +
> > >  #endif /* HW_SPAPR_H */
> > > diff --git a/target-ppc/mmu-hash64.h b/target-ppc/mmu-hash64.h
> > > index ab5d347..4a1b781 100644
> > > --- a/target-ppc/mmu-hash64.h
> > > +++ b/target-ppc/mmu-hash64.h
> > > @@ -60,11 +60,15 @@ void ppc_hash64_update_rmls(CPUPPCState
> > > *env);
> > >  #define HASH_PTE_SIZE_64        16
> > >  #define HASH_PTEG_SIZE_64       (HASH_PTE_SIZE_64 *
> > > HPTES_PER_GROUP)
> > >  
> > > +#define HPTE64_V_SSIZE          SLB_VSID_B
> > > +#define HPTE64_V_SSIZE_256M     SLB_VSID_B_256M
> > > +#define HPTE64_V_SSIZE_1T       SLB_VSID_B_1T
> > >  #define HPTE64_V_SSIZE_SHIFT    62
> > >  #define HPTE64_V_AVPN_SHIFT     7
> > >  #define HPTE64_V_AVPN           0x3fffffffffffff80ULL
> > >  #define HPTE64_V_AVPN_VAL(x)    (((x) & HPTE64_V_AVPN) >>
> > > HPTE64_V_AVPN_SHIFT)
> > >  #define HPTE64_V_COMPARE(x, y)  (!(((x) ^ (y)) &
> > > 0xffffffffffffff83ULL))
> > > +#define HPTE64_V_BOLTED         0x0000000000000010ULL
> > >  #define HPTE64_V_LARGE          0x0000000000000004ULL
> > >  #define HPTE64_V_SECONDARY      0x0000000000000002ULL
> > >  #define HPTE64_V_VALID          0x0000000000000001ULL

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 3/5] pseries: Implement HPT resizing
  2016-12-14  5:30   ` Suraj Jitindar Singh
  2016-12-14  6:20     ` David Gibson
@ 2016-12-15  0:59     ` David Gibson
  1 sibling, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-12-15  0:59 UTC (permalink / raw)
  To: Suraj Jitindar Singh
  Cc: paulus, agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 20421 bytes --]

On Wed, Dec 14, 2016 at 04:30:57PM +1100, Suraj Jitindar Singh wrote:
> On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> > This patch implements hypercalls allowing a PAPR guest to resize its
> > own
> > hash page table.  This will eventually allow for more flexible memory
> > hotplug.
> > 
> > The implementation is partially asynchronous, handled in a special
> > thread
> > running the hpt_prepare_thread() function.  The state of a pending
> > resize
> > is stored in SPAPR_MACHINE->pending_hpt.
> > 
> > The H_RESIZE_HPT_PREPARE hypercall will kick off creation of a new
> > HPT, or,
> > if one is already in progress, monitor it for completion.  If there
> > is an
> > existing HPT resize in progress that doesn't match the size specified
> > in
> > the call, it will cancel it, replacing it with a new one matching the
> > given size.
> > 
> > The H_RESIZE_HPT_COMMIT completes transition to a resized HPT, and
> > can only
> > be called successfully once H_RESIZE_HPT_PREPARE has successfully
> > completed initialization of a new HPT.  The guest must ensure that
> > there
> > are no concurrent accesses to the existing HPT while this is called
> > (this
> > effectively means stop_machine() for Linux guests).
> > 
> > For now H_RESIZE_HPT_COMMIT goes through the whole old HPT, rehashing
> > each
> > HPTE into the new HPT.  This can have quite high latency, but it
> > seems to
> > be of the order of typical migration downtime latencies for HPTs of
> > size
> > up to ~2GiB (which would be used in a 256GiB guest).
> > 
> > In future we probably want to move more of the rehashing to the
> > "prepare"
> > phase, by having H_ENTER and other hcalls update both current and
> > pending HPTs.  That's a project for another day, but should be
> > possible
> > without any changes to the guest interface.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/ppc/spapr.c          |   4 +-
> >  hw/ppc/spapr_hcall.c    | 345
> > +++++++++++++++++++++++++++++++++++++++++++++++-
> >  include/hw/ppc/spapr.h  |   6 +
> >  target-ppc/mmu-hash64.h |   4 +
> >  4 files changed, 353 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 846ce51..d057031 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -93,8 +93,6 @@
> >  
> >  #define PHANDLE_XICP            0x00001111
> >  
> > -#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> > -
> >  static XICSState *try_create_xics(const char *type, int nr_servers,
> >                                    int nr_irqs, Error **errp)
> >  {
> > @@ -1055,7 +1053,7 @@ static void close_htab_fd(sPAPRMachineState
> > *spapr)
> >      spapr->htab_fd = -1;
> >  }
> >  
> > -static int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
> > +int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
> >  {
> >      int shift;
> >  
> > diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> > index 72a9c4d..1e89061 100644
> > --- a/hw/ppc/spapr_hcall.c
> > +++ b/hw/ppc/spapr_hcall.c
> > @@ -2,6 +2,7 @@
> >  #include "qapi/error.h"
> >  #include "sysemu/sysemu.h"
> >  #include "qemu/log.h"
> > +#include "qemu/error-report.h"
> >  #include "cpu.h"
> >  #include "exec/exec-all.h"
> >  #include "helper_regs.h"
> > @@ -355,20 +356,319 @@ static target_ulong h_read(PowerPCCPU *cpu,
> > sPAPRMachineState *spapr,
> >      return H_SUCCESS;
> >  }
> >  
> > +struct sPAPRPendingHPT {
> > +    /* These fields are read-only after initialization */
> > +    int shift;
> > +    QemuThread thread;
> > +
> > +    /* These fields are protected by the BQL */
> > +    bool complete;
> > +
> > +    /* These fields are private to the preparation thread if
> > +     * !complete, otherwise protected by the BQL */
> > +    int ret;
> > +    void *hpt;
> > +};
> > +
> > +static void free_pending_hpt(sPAPRPendingHPT *pending)
> > +{
> > +    if (pending->hpt) {
> > +        qemu_vfree(pending->hpt);
> > +    }
> > +
> > +    g_free(pending);
> > +}
> > +
> > +static void *hpt_prepare_thread(void *opaque)
> > +{
> > +    sPAPRPendingHPT *pending = opaque;
> > +    size_t size = 1ULL << pending->shift;
> > +
> > +    pending->hpt = qemu_memalign(size, size);
> > +    if (pending->hpt) {
> > +        memset(pending->hpt, 0, size);
> > +        pending->ret = H_SUCCESS;
> > +    } else {
> > +        pending->ret = H_NO_MEM;
> > +    }
> > +
> > +    qemu_mutex_lock_iothread();
> > +
> > +    if (SPAPR_MACHINE(qdev_get_machine())->pending_hpt == pending) {
> > +        /* Ready to go */
> > +        pending->complete = true;
> > +    } else {
> > +        /* We've been cancelled, clean ourselves up */
> > +        free_pending_hpt(pending);
> > +    }
> > +
> > +    qemu_mutex_unlock_iothread();
> > +    return NULL;
> > +}
> > +
> > +/* Must be called with BQL held */
> > +static void cancel_hpt_prepare(sPAPRMachineState *spapr)
> > +{
> > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> > +
> > +    /* Let the thread know it's cancelled */
> > +    spapr->pending_hpt = NULL;
> > +
> > +    if (!pending) {
> > +        /* Nothing to do */
> > +        return;
> > +    }
> > +
> > +    if (!pending->complete) {
> > +        /* thread will clean itself up */
> > +        return;
> > +    }
> > +
> > +    free_pending_hpt(pending);
> > +}
> > +
> > +static int build_dimm_list(Object *obj, void *opaque)
> > +{
> > +    GSList **list = opaque;
> > +
> > +    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
> > +        DeviceState *dev = DEVICE(obj);
> > +        if (dev->realized) { /* only realized DIMMs matter */
> > +            *list = g_slist_prepend(*list, dev);
> > +        }
> > +    }
> > +
> > +    object_child_foreach(obj, build_dimm_list, opaque);
> > +    return 0;
> > +}
> > +
> > +static ram_addr_t get_current_ram_size(void)
> > +{
> > +    GSList *list = NULL, *item;
> > +    ram_addr_t size = ram_size;
> > +
> > +    build_dimm_list(qdev_get_machine(), &list);
> > +    for (item = list; item; item = g_slist_next(item)) {
> > +        Object *obj = OBJECT(item->data);
> > +        if (!strcmp(object_get_typename(obj), TYPE_PC_DIMM)) {
> > +            size += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
> > +                                            &error_abort);
> > +        }
> > +    }
> > +    g_slist_free(list);
> > +
> > +    return size;
> > +}
> > +
> >  static target_ulong h_resize_hpt_prepare(PowerPCCPU *cpu,
> >                                           sPAPRMachineState *spapr,
> >                                           target_ulong opcode,
> >                                           target_ulong *args)
> >  {
> >      target_ulong flags = args[0];
> > -    target_ulong shift = args[1];
> > +    int shift = args[1];
> > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> >  
> >      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> >          return H_AUTHORITY;
> >      }
> >  
> >      trace_spapr_h_resize_hpt_prepare(flags, shift);
> > -    return H_HARDWARE;
> > +
> > +    if (flags != 0) {
> > +        return H_PARAMETER;
> > +    }
> > +
> > +    if (shift && ((shift < 18) || (shift > 46))) {
> > +        return H_PARAMETER;
> > +    }
> > +
> > +    if (pending) {
> > +        /* something already in progress */
> > +        if (pending->shift == shift) {
> > +            /* and it's suitable */
> > +            if (pending->complete) {
> > +                return pending->ret;
> > +            } else {
> > +                return H_LONG_BUSY_ORDER_100_MSEC;
> > +            }
> > +        }
> > +
> > +        /* not suitable, cancel and replace */
> > +        cancel_hpt_prepare(spapr);
> > +    }
> > +
> > +    if (!shift) {
> > +        /* nothing to do */
> > +        return H_SUCCESS;
> > +    }
> > +
> > +    /* start new prepare */
> > +
> > +    /* We only allow the guest to allocate an HPT one order above
> > what
> > +     * we'd normally give them (to stop a small guest claiming a
> > huge
> > +     * chunk of resources in the HPT */
> > +    if (shift > (spapr_hpt_shift_for_ramsize(get_current_ram_size())
> > + 1)) {
> > +        return H_RESOURCE;
> > +    }
> > +
> > +    pending = g_new0(sPAPRPendingHPT, 1);
> > +    pending->shift = shift;
> > +    pending->ret = H_HARDWARE;
> > +
> > +    qemu_thread_create(&pending->thread, "sPAPR HPT prepare",
> > +                       hpt_prepare_thread, pending,
> > QEMU_THREAD_DETACHED);
> > +
> > +    spapr->pending_hpt = pending;
> > +
> > +    /* In theory we could estimate the time more accurately based on
> > +     * the new size, but there's not much point */
> > +    return H_LONG_BUSY_ORDER_100_MSEC;
> > +}
> > +
> > +static uint64_t new_hpte_load0(void *htab, uint64_t pteg, int slot)
> > +{
> > +    uint8_t *addr = htab;
> > +
> > +    addr += pteg * HASH_PTEG_SIZE_64;
> > +    addr += slot * HASH_PTE_SIZE_64;
> > +    return  ldq_p(addr);
> > +}
> > +
> > +static void new_hpte_store(void *htab, uint64_t pteg, int slot,
> > +                           uint64_t pte0, uint64_t pte1)
> > +{
> > +    uint8_t *addr = htab;
> > +
> > +    addr += pteg * HASH_PTEG_SIZE_64;
> > +    addr += slot * HASH_PTE_SIZE_64;
> > +
> > +    stq_p(addr, pte0);
> > +    stq_p(addr + HASH_PTE_SIZE_64/2, pte1);
> > +}
> > +
> > +static int rehash_hpte(PowerPCCPU *cpu, uint64_t token,
> > +                       void *old, uint64_t oldsize,
> > +                       void *new, uint64_t newsize,
> Can we call these old_hpt and new_hpt?

Done.

> > +                       uint64_t pteg, int slot)
> > +{
> > +    uint64_t old_hash_mask = (oldsize >> 7) - 1;
> > +    uint64_t new_hash_mask = (newsize >> 7) - 1;
> > +    target_ulong pte0 = ppc_hash64_load_hpte0(cpu, token, slot);
> > +    target_ulong pte1;
> > +    uint64_t avpn;
> > +    unsigned shift;
> Can we call this base_pg_shift or at least add a comment:
> /* Base Virtual Page Shift */ or something?
> Took me a while to work out what *shift* this actually was...

Ah, good point.  Especially since I usually use 'shift' to refer to
the whole HPT size.  Changed.

> > +    uint64_t hash, new_pteg, replace_pte0;
> > +
> *** (referenced below)
> > +    if (!(pte0 & HPTE64_V_VALID) || !(pte0 & HPTE64_V_BOLTED)) {
> > +        return H_SUCCESS;
> > +    }
> > +
> > +    pte1 = ppc_hash64_load_hpte1(cpu, token, slot);
> > +
> > +    shift = ppc_hash64_hpte_page_shift_noslb(cpu, pte0, pte1);
> > +    assert(shift); /* H_ENTER should never have allowed a bad
> > encoding */
> > +    avpn = HPTE64_V_AVPN_VAL(pte0) & ~(((1ULL << shift) - 1) >> 23);
> > +
> > +    if (pte0 & HPTE64_V_SECONDARY) {
> > +        pteg = ~pteg;
> > +    }
> > +
> The hash calculation below is pretty hard to follow... That being said
> I don't really see a nicer way of structuring it. I guess you just have
> to wade through the ISA if you actually want to understand what's
> happening here (which I assume most people won't bother with).
> > +    if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_256M) {
> > +        uint64_t offset, vsid;
> > +
> > +        /* We only have 28 - 23 bits of offset in avpn */
> > +        offset = (avpn & 0x1f) << 23;
> > +        vsid = avpn >> 5;
> > +        /* We can find more bits from the pteg value */
> > +        if (shift < 23) {
> > +            offset |= ((vsid ^ pteg) & old_hash_mask) << shift;
> > +        }
> > +
> > +        hash = vsid ^ (offset >> shift);
> > +    } else if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_1T) {
> > +        uint64_t offset, vsid;
> > +
> > +        /* We only have 40 - 23 bits of seg_off in avpn */
> > +        offset = (avpn & 0x1ffff) << 23;
> > +        vsid = avpn >> 17;
> > +        if (shift < 23) {
> > +            offset |= ((vsid ^ (vsid << 25) ^ pteg) & old_hash_mask)
> > << shift;
> > +        }
> > +
> > +        hash = vsid ^ (vsid << 25) ^ (offset >> shift);
> > +    } else {
> > +        error_report("rehash_pte: Bad segment size in HPTE");
> > +        return H_HARDWARE;
> > +    }
> > +
> > +    new_pteg = hash & new_hash_mask;
> > +    if (pte0 & HPTE64_V_SECONDARY) {
> > +        assert(~pteg == (hash & old_hash_mask));
> > +        new_pteg = ~new_pteg;
> > +    } else {
> > +        assert(pteg == (hash & old_hash_mask));
> > +    }
> > +    assert((oldsize != newsize) || (pteg == new_pteg));
> > +    replace_pte0 = new_hpte_load0(new, new_pteg, slot);
> > +    if (replace_pte0 & HPTE64_V_VALID) {
> > +        assert(newsize < oldsize);
> > +        if (replace_pte0 & HPTE64_V_BOLTED) {
> Aren't both of these checks for BOLTED redundant? We know that we are
> only rehashing ptes which are both valid and bolted, otherwise we would
> have returned at *** above. Thus the new hpt will only contain invalid
> entries or valid && bolted entries, we don't need to check if
> replace_pte is bolted if we know it's valid. Similarly we know the one
> we are replacing it with is bolted, otherwise we wouldn't have bothered
> rehashing it. Thus it's pretty much sufficient to check replace_pte0
> && HPTE64_V_VALID == 1 which implies a bolted collision irrespective.

I've added a comment explaining why this code is here.

> > +            if (pte0 & HPTE64_V_BOLTED) {
> > +                /* Bolted collision, nothing we can do */
> > +                return H_PTEG_FULL;
> > +            } else {
> > +                /* Discard this hpte */
> > +                return H_SUCCESS;
> > +            }
> > +        }
> Is there any reason the rehashed pte has to go into the same slot it
> came out of? Isn't it valid to search all of the slots of the new pteg
> for an empty (invalid) one and put it in the first empty space? Just
> because the current slot already contains a bolted entry doesn't mean
> we can't put it in another slot, right?
> > +    }
> > +
> > +    new_hpte_store(new, new_pteg, slot, pte0, pte1);
> > +    return H_SUCCESS;
> > +}
> > +
> > +static int rehash_hpt(PowerPCCPU *cpu,
> > +                      void *old, uint64_t oldsize,
> > +                      void *new, uint64_t newsize)
> > +{
> > +    CPUPPCState *env = &cpu->env;
> > +    uint64_t n_ptegs = oldsize >> 7;
> > +    uint64_t pteg;
> > +    int slot;
> > +    int rc;
> > +
> > +    assert(env->external_htab == old);
> > +
> > +    for (pteg = 0; pteg < n_ptegs; pteg++) {
> Can we rename token to pteg_[base_]addr or something? Token is very...
> generic
> > +        uint64_t token = ppc_hash64_start_access(cpu, pteg *
> > HPTES_PER_GROUP);
> > +
> > +        if (!token) {
> > +            return H_HARDWARE;
> > +        }
> > +
> > +        for (slot = 0; slot < HPTES_PER_GROUP; slot++) {
> > +            rc = rehash_hpte(cpu, token, old, oldsize, new, newsize,
> > +                             pteg, slot);
> > +            if (rc != H_SUCCESS) {
> > +                ppc_hash64_stop_access(cpu, token);
> > +                return rc;
> > +            }
> > +        }
> > +        ppc_hash64_stop_access(cpu, token);
> > +    }
> > +
> > +    return H_SUCCESS;
> > +}
> > +
> > +static void pivot_hpt(CPUState *cs, run_on_cpu_data arg)
> > +{
> > +    sPAPRMachineState *spapr = SPAPR_MACHINE(arg.host_ptr);
> > +    PowerPCCPU *cpu = POWERPC_CPU(cs);
> > +
> > +    cpu_synchronize_state(cs);
> > +    ppc_hash64_set_external_hpt(cpu, spapr->htab, spapr->htab_shift,
> > +                                &error_fatal);
> >  }
> >  
> >  static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
> > @@ -378,13 +678,52 @@ static target_ulong
> > h_resize_hpt_commit(PowerPCCPU *cpu,
> >  {
> >      target_ulong flags = args[0];
> >      target_ulong shift = args[1];
> > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> > +    int rc;
> > +    size_t newsize;
> >  
> >      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> >          return H_AUTHORITY;
> >      }
> >  
> >      trace_spapr_h_resize_hpt_commit(flags, shift);
> > -    return H_HARDWARE;
> > +
> > +    if (flags != 0) {
> > +        return H_PARAMETER;
> > +    }
> > +
> > +    if (!pending || (pending->shift != shift)) {
> > +        /* no matching prepare */
> > +        return H_CLOSED;
> > +    }
> > +
> > +    if (!pending->complete) {
> > +        /* prepare has not completed */
> > +        return H_BUSY;
> > +    }
> > +
> > +    newsize = 1ULL << pending->shift;
> > +    rc = rehash_hpt(cpu, spapr->htab, HTAB_SIZE(spapr),
> > +                    pending->hpt, newsize);
> > +    if (rc == H_SUCCESS) {
> > +        CPUState *cs;
> > +
> > +        qemu_vfree(spapr->htab);
> > +        spapr->htab = pending->hpt;
> > +        spapr->htab_shift = pending->shift;
> > +
> > +        CPU_FOREACH(cs) {
> > +            run_on_cpu(cs, pivot_hpt, RUN_ON_CPU_HOST_PTR(spapr));
> > +        }
> > +
> > +        pending->hpt = NULL; /* so it's not free()d */
> > +    }
> > +
> > +    /* Clean up */
> > +    spapr->pending_hpt = NULL;
> > +    free_pending_hpt(pending);
> > +
> > +    return rc;
> >  }
> >  
> >  static target_ulong h_set_sprg0(PowerPCCPU *cpu, sPAPRMachineState
> > *spapr,
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index d2c758b..954bada 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -14,6 +14,7 @@ struct sPAPRNVRAM;
> >  typedef struct sPAPRConfigureConnectorState
> > sPAPRConfigureConnectorState;
> >  typedef struct sPAPREventLogEntry sPAPREventLogEntry;
> >  typedef struct sPAPREventSource sPAPREventSource;
> > +typedef struct sPAPRPendingHPT sPAPRPendingHPT;
> >  
> >  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
> >  #define SPAPR_ENTRY_POINT       0x100
> > @@ -72,6 +73,8 @@ struct sPAPRMachineState {
> >      sPAPRResizeHPT resize_hpt;
> >      void *htab;
> >      uint32_t htab_shift;
> > +    sPAPRPendingHPT *pending_hpt; /* in-progress resize */
> > +
> >      hwaddr rma_size;
> >      int vrma_adjust;
> >      ssize_t rtas_size;
> > @@ -627,6 +630,7 @@ void
> > spapr_hotplug_req_remove_by_count_indexed(sPAPRDRConnectorType
> > drc_type,
> >                                                 uint32_t count,
> > uint32_t index);
> >  void *spapr_populate_hotplug_cpu_dt(CPUState *cs, int *fdt_offset,
> >                                      sPAPRMachineState *spapr);
> > +int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
> >  
> >  /* rtas-configure-connector state */
> >  struct sPAPRConfigureConnectorState {
> > @@ -674,4 +678,6 @@ int spapr_rng_populate_dt(void *fdt);
> >  
> >  void spapr_do_system_reset_on_cpu(CPUState *cs, run_on_cpu_data
> > arg);
> >  
> > +#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> > +
> >  #endif /* HW_SPAPR_H */
> > diff --git a/target-ppc/mmu-hash64.h b/target-ppc/mmu-hash64.h
> > index ab5d347..4a1b781 100644
> > --- a/target-ppc/mmu-hash64.h
> > +++ b/target-ppc/mmu-hash64.h
> > @@ -60,11 +60,15 @@ void ppc_hash64_update_rmls(CPUPPCState *env);
> >  #define HASH_PTE_SIZE_64        16
> >  #define HASH_PTEG_SIZE_64       (HASH_PTE_SIZE_64 * HPTES_PER_GROUP)
> >  
> > +#define HPTE64_V_SSIZE          SLB_VSID_B
> > +#define HPTE64_V_SSIZE_256M     SLB_VSID_B_256M
> > +#define HPTE64_V_SSIZE_1T       SLB_VSID_B_1T
> >  #define HPTE64_V_SSIZE_SHIFT    62
> >  #define HPTE64_V_AVPN_SHIFT     7
> >  #define HPTE64_V_AVPN           0x3fffffffffffff80ULL
> >  #define HPTE64_V_AVPN_VAL(x)    (((x) & HPTE64_V_AVPN) >>
> > HPTE64_V_AVPN_SHIFT)
> >  #define HPTE64_V_COMPARE(x, y)  (!(((x) ^ (y)) &
> > 0xffffffffffffff83ULL))
> > +#define HPTE64_V_BOLTED         0x0000000000000010ULL
> >  #define HPTE64_V_LARGE          0x0000000000000004ULL
> >  #define HPTE64_V_SECONDARY      0x0000000000000002ULL
> >  #define HPTE64_V_VALID          0x0000000000000001ULL
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 3/5] pseries: Implement HPT resizing
  2016-12-14  5:35   ` Suraj Jitindar Singh
@ 2016-12-15  1:03     ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-12-15  1:03 UTC (permalink / raw)
  To: Suraj Jitindar Singh
  Cc: paulus, agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 19118 bytes --]

On Wed, Dec 14, 2016 at 04:35:56PM +1100, Suraj Jitindar Singh wrote:
> On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> > This patch implements hypercalls allowing a PAPR guest to resize its
> > own
> > hash page table.  This will eventually allow for more flexible memory
> > hotplug.
> > 
> > The implementation is partially asynchronous, handled in a special
> > thread
> > running the hpt_prepare_thread() function.  The state of a pending
> > resize
> > is stored in SPAPR_MACHINE->pending_hpt.
> > 
> > The H_RESIZE_HPT_PREPARE hypercall will kick off creation of a new
> > HPT, or,
> > if one is already in progress, monitor it for completion.  If there
> > is an
> > existing HPT resize in progress that doesn't match the size specified
> > in
> > the call, it will cancel it, replacing it with a new one matching the
> > given size.
> > 
> > The H_RESIZE_HPT_COMMIT completes transition to a resized HPT, and
> > can only
> > be called successfully once H_RESIZE_HPT_PREPARE has successfully
> > completed initialization of a new HPT.  The guest must ensure that
> > there
> > are no concurrent accesses to the existing HPT while this is called
> > (this
> > effectively means stop_machine() for Linux guests).
> > 
> > For now H_RESIZE_HPT_COMMIT goes through the whole old HPT, rehashing
> > each
> > HPTE into the new HPT.  This can have quite high latency, but it
> > seems to
> > be of the order of typical migration downtime latencies for HPTs of
> > size
> > up to ~2GiB (which would be used in a 256GiB guest).
> > 
> > In future we probably want to move more of the rehashing to the
> > "prepare"
> > phase, by having H_ENTER and other hcalls update both current and
> > pending HPTs.  That's a project for another day, but should be
> > possible
> > without any changes to the guest interface.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/ppc/spapr.c          |   4 +-
> >  hw/ppc/spapr_hcall.c    | 345
> > +++++++++++++++++++++++++++++++++++++++++++++++-
> >  include/hw/ppc/spapr.h  |   6 +
> >  target-ppc/mmu-hash64.h |   4 +
> >  4 files changed, 353 insertions(+), 6 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 846ce51..d057031 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -93,8 +93,6 @@
> >  
> >  #define PHANDLE_XICP            0x00001111
> >  
> > -#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> > -
> >  static XICSState *try_create_xics(const char *type, int nr_servers,
> >                                    int nr_irqs, Error **errp)
> >  {
> > @@ -1055,7 +1053,7 @@ static void close_htab_fd(sPAPRMachineState
> > *spapr)
> >      spapr->htab_fd = -1;
> >  }
> >  
> > -static int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
> > +int spapr_hpt_shift_for_ramsize(uint64_t ramsize)
> >  {
> >      int shift;
> >  
> > diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> > index 72a9c4d..1e89061 100644
> > --- a/hw/ppc/spapr_hcall.c
> > +++ b/hw/ppc/spapr_hcall.c
> > @@ -2,6 +2,7 @@
> >  #include "qapi/error.h"
> >  #include "sysemu/sysemu.h"
> >  #include "qemu/log.h"
> > +#include "qemu/error-report.h"
> >  #include "cpu.h"
> >  #include "exec/exec-all.h"
> >  #include "helper_regs.h"
> > @@ -355,20 +356,319 @@ static target_ulong h_read(PowerPCCPU *cpu,
> > sPAPRMachineState *spapr,
> >      return H_SUCCESS;
> >  }
> >  
> > +struct sPAPRPendingHPT {
> > +    /* These fields are read-only after initialization */
> > +    int shift;
> > +    QemuThread thread;
> > +
> > +    /* These fields are protected by the BQL */
> > +    bool complete;
> > +
> > +    /* These fields are private to the preparation thread if
> > +     * !complete, otherwise protected by the BQL */
> > +    int ret;
> > +    void *hpt;
> > +};
> > +
> > +static void free_pending_hpt(sPAPRPendingHPT *pending)
> > +{
> > +    if (pending->hpt) {
> > +        qemu_vfree(pending->hpt);
> > +    }
> > +
> > +    g_free(pending);
> > +}
> > +
> > +static void *hpt_prepare_thread(void *opaque)
> > +{
> > +    sPAPRPendingHPT *pending = opaque;
> > +    size_t size = 1ULL << pending->shift;
> > +
> > +    pending->hpt = qemu_memalign(size, size);
> > +    if (pending->hpt) {
> > +        memset(pending->hpt, 0, size);
> > +        pending->ret = H_SUCCESS;
> > +    } else {
> > +        pending->ret = H_NO_MEM;
> > +    }
> > +
> > +    qemu_mutex_lock_iothread();
> > +
> > +    if (SPAPR_MACHINE(qdev_get_machine())->pending_hpt == pending) {
> > +        /* Ready to go */
> > +        pending->complete = true;
> > +    } else {
> > +        /* We've been cancelled, clean ourselves up */
> > +        free_pending_hpt(pending);
> > +    }
> > +
> > +    qemu_mutex_unlock_iothread();
> > +    return NULL;
> > +}
> > +
> > +/* Must be called with BQL held */
> > +static void cancel_hpt_prepare(sPAPRMachineState *spapr)
> > +{
> > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> > +
> > +    /* Let the thread know it's cancelled */
> > +    spapr->pending_hpt = NULL;
> > +
> > +    if (!pending) {
> > +        /* Nothing to do */
> > +        return;
> > +    }
> > +
> > +    if (!pending->complete) {
> > +        /* thread will clean itself up */
> > +        return;
> > +    }
> > +
> > +    free_pending_hpt(pending);
> > +}
> > +
> > +static int build_dimm_list(Object *obj, void *opaque)
> > +{
> > +    GSList **list = opaque;
> > +
> > +    if (object_dynamic_cast(obj, TYPE_PC_DIMM)) {
> > +        DeviceState *dev = DEVICE(obj);
> > +        if (dev->realized) { /* only realized DIMMs matter */
> > +            *list = g_slist_prepend(*list, dev);
> > +        }
> > +    }
> > +
> > +    object_child_foreach(obj, build_dimm_list, opaque);
> > +    return 0;
> > +}
> > +
> > +static ram_addr_t get_current_ram_size(void)
> > +{
> > +    GSList *list = NULL, *item;
> > +    ram_addr_t size = ram_size;
> > +
> > +    build_dimm_list(qdev_get_machine(), &list);
> > +    for (item = list; item; item = g_slist_next(item)) {
> > +        Object *obj = OBJECT(item->data);
> > +        if (!strcmp(object_get_typename(obj), TYPE_PC_DIMM)) {
> > +            size += object_property_get_int(obj, PC_DIMM_SIZE_PROP,
> > +                                            &error_abort);
> > +        }
> > +    }
> > +    g_slist_free(list);
> > +
> > +    return size;
> > +}
> > +
> >  static target_ulong h_resize_hpt_prepare(PowerPCCPU *cpu,
> >                                           sPAPRMachineState *spapr,
> >                                           target_ulong opcode,
> >                                           target_ulong *args)
> >  {
> >      target_ulong flags = args[0];
> > -    target_ulong shift = args[1];
> > +    int shift = args[1];
> > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> >  
> >      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> >          return H_AUTHORITY;
> >      }
> >  
> >      trace_spapr_h_resize_hpt_prepare(flags, shift);
> > -    return H_HARDWARE;
> > +
> > +    if (flags != 0) {
> > +        return H_PARAMETER;
> > +    }
> > +
> > +    if (shift && ((shift < 18) || (shift > 46))) {
> > +        return H_PARAMETER;
> > +    }
> > +
> > +    if (pending) {
> > +        /* something already in progress */
> > +        if (pending->shift == shift) {
> > +            /* and it's suitable */
> > +            if (pending->complete) {
> > +                return pending->ret;
> > +            } else {
> > +                return H_LONG_BUSY_ORDER_100_MSEC;
> > +            }
> > +        }
> > +
> > +        /* not suitable, cancel and replace */
> > +        cancel_hpt_prepare(spapr);
> > +    }
> > +
> > +    if (!shift) {
> > +        /* nothing to do */
> > +        return H_SUCCESS;
> > +    }
> > +
> > +    /* start new prepare */
> > +
> > +    /* We only allow the guest to allocate an HPT one order above
> > what
> > +     * we'd normally give them (to stop a small guest claiming a
> > huge
> > +     * chunk of resources in the HPT */
> > +    if (shift > (spapr_hpt_shift_for_ramsize(get_current_ram_size())
> > + 1))
> Given we already allocate one lower than recommended in the ISA by
> default, should we set this to 2 to allow an allocation of 1 greater
> than the recommended minimum?

No, I think capping it at the ISA recommended size should be ok.  It's
physically contiguous memory in the host, which means it's a pretty
scarce resource.  If this is really a problem in practice we can
revisit it.

> > +        return H_RESOURCE;
> > +    }
> > +
> > +    pending = g_new0(sPAPRPendingHPT, 1);
> > +    pending->shift = shift;
> > +    pending->ret = H_HARDWARE;
> > +
> > +    qemu_thread_create(&pending->thread, "sPAPR HPT prepare",
> > +                       hpt_prepare_thread, pending,
> > QEMU_THREAD_DETACHED);
> > +
> > +    spapr->pending_hpt = pending;
> > +
> > +    /* In theory we could estimate the time more accurately based on
> > +     * the new size, but there's not much point */
> > +    return H_LONG_BUSY_ORDER_100_MSEC;
> > +}
> > +
> > +static uint64_t new_hpte_load0(void *htab, uint64_t pteg, int slot)
> > +{
> > +    uint8_t *addr = htab;
> > +
> > +    addr += pteg * HASH_PTEG_SIZE_64;
> > +    addr += slot * HASH_PTE_SIZE_64;
> > +    return  ldq_p(addr);
> > +}
> > +
> > +static void new_hpte_store(void *htab, uint64_t pteg, int slot,
> > +                           uint64_t pte0, uint64_t pte1)
> > +{
> > +    uint8_t *addr = htab;
> > +
> > +    addr += pteg * HASH_PTEG_SIZE_64;
> > +    addr += slot * HASH_PTE_SIZE_64;
> > +
> > +    stq_p(addr, pte0);
> > +    stq_p(addr + HASH_PTE_SIZE_64/2, pte1);
> > +}
> > +
> > +static int rehash_hpte(PowerPCCPU *cpu, uint64_t token,
> > +                       void *old, uint64_t oldsize,
> > +                       void *new, uint64_t newsize,
> > +                       uint64_t pteg, int slot)
> > +{
> > +    uint64_t old_hash_mask = (oldsize >> 7) - 1;
> > +    uint64_t new_hash_mask = (newsize >> 7) - 1;
> > +    target_ulong pte0 = ppc_hash64_load_hpte0(cpu, token, slot);
> > +    target_ulong pte1;
> > +    uint64_t avpn;
> > +    unsigned shift;
> > +    uint64_t hash, new_pteg, replace_pte0;
> > +
> > +    if (!(pte0 & HPTE64_V_VALID) || !(pte0 & HPTE64_V_BOLTED)) {
> > +        return H_SUCCESS;
> > +    }
> > +
> > +    pte1 = ppc_hash64_load_hpte1(cpu, token, slot);
> > +
> > +    shift = ppc_hash64_hpte_page_shift_noslb(cpu, pte0, pte1);
> > +    assert(shift); /* H_ENTER should never have allowed a bad
> > encoding */
> > +    avpn = HPTE64_V_AVPN_VAL(pte0) & ~(((1ULL << shift) - 1) >> 23);
> > +
> > +    if (pte0 & HPTE64_V_SECONDARY) {
> > +        pteg = ~pteg;
> > +    }
> > +
> > +    if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_256M) {
> > +        uint64_t offset, vsid;
> > +
> > +        /* We only have 28 - 23 bits of offset in avpn */
> > +        offset = (avpn & 0x1f) << 23;
> > +        vsid = avpn >> 5;
> > +        /* We can find more bits from the pteg value */
> > +        if (shift < 23) {
> > +            offset |= ((vsid ^ pteg) & old_hash_mask) << shift;
> > +        }
> > +
> > +        hash = vsid ^ (offset >> shift);
> > +    } else if ((pte0 & HPTE64_V_SSIZE) == HPTE64_V_SSIZE_1T) {
> > +        uint64_t offset, vsid;
> > +
> > +        /* We only have 40 - 23 bits of seg_off in avpn */
> > +        offset = (avpn & 0x1ffff) << 23;
> > +        vsid = avpn >> 17;
> > +        if (shift < 23) {
> > +            offset |= ((vsid ^ (vsid << 25) ^ pteg) & old_hash_mask)
> > << shift;
> > +        }
> > +
> > +        hash = vsid ^ (vsid << 25) ^ (offset >> shift);
> > +    } else {
> > +        error_report("rehash_pte: Bad segment size in HPTE");
> > +        return H_HARDWARE;
> > +    }
> > +
> > +    new_pteg = hash & new_hash_mask;
> > +    if (pte0 & HPTE64_V_SECONDARY) {
> > +        assert(~pteg == (hash & old_hash_mask));
> > +        new_pteg = ~new_pteg;
> > +    } else {
> > +        assert(pteg == (hash & old_hash_mask));
> > +    }
> > +    assert((oldsize != newsize) || (pteg == new_pteg));
> > +    replace_pte0 = new_hpte_load0(new, new_pteg, slot);
> > +    if (replace_pte0 & HPTE64_V_VALID) {
> > +        assert(newsize < oldsize);
> > +        if (replace_pte0 & HPTE64_V_BOLTED) {
> > +            if (pte0 & HPTE64_V_BOLTED) {
> > +                /* Bolted collision, nothing we can do */
> > +                return H_PTEG_FULL;
> > +            } else {
> > +                /* Discard this hpte */
> > +                return H_SUCCESS;
> > +            }
> > +        }
> > +    }
> > +
> > +    new_hpte_store(new, new_pteg, slot, pte0, pte1);
> > +    return H_SUCCESS;
> > +}
> > +
> > +static int rehash_hpt(PowerPCCPU *cpu,
> > +                      void *old, uint64_t oldsize,
> > +                      void *new, uint64_t newsize)
> > +{
> > +    CPUPPCState *env = &cpu->env;
> > +    uint64_t n_ptegs = oldsize >> 7;
> > +    uint64_t pteg;
> > +    int slot;
> > +    int rc;
> > +
> > +    assert(env->external_htab == old);
> > +
> > +    for (pteg = 0; pteg < n_ptegs; pteg++) {
> > +        uint64_t token = ppc_hash64_start_access(cpu, pteg *
> > HPTES_PER_GROUP);
> > +
> > +        if (!token) {
> > +            return H_HARDWARE;
> > +        }
> > +
> > +        for (slot = 0; slot < HPTES_PER_GROUP; slot++) {
> > +            rc = rehash_hpte(cpu, token, old, oldsize, new, newsize,
> > +                             pteg, slot);
> > +            if (rc != H_SUCCESS) {
> > +                ppc_hash64_stop_access(cpu, token);
> > +                return rc;
> > +            }
> > +        }
> > +        ppc_hash64_stop_access(cpu, token);
> > +    }
> > +
> > +    return H_SUCCESS;
> > +}
> > +
> > +static void pivot_hpt(CPUState *cs, run_on_cpu_data arg)
> > +{
> > +    sPAPRMachineState *spapr = SPAPR_MACHINE(arg.host_ptr);
> > +    PowerPCCPU *cpu = POWERPC_CPU(cs);
> > +
> > +    cpu_synchronize_state(cs);
> > +    ppc_hash64_set_external_hpt(cpu, spapr->htab, spapr->htab_shift,
> > +                                &error_fatal);
> >  }
> >  
> >  static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
> > @@ -378,13 +678,52 @@ static target_ulong
> > h_resize_hpt_commit(PowerPCCPU *cpu,
> >  {
> >      target_ulong flags = args[0];
> >      target_ulong shift = args[1];
> > +    sPAPRPendingHPT *pending = spapr->pending_hpt;
> > +    int rc;
> > +    size_t newsize;
> >  
> >      if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> >          return H_AUTHORITY;
> >      }
> >  
> >      trace_spapr_h_resize_hpt_commit(flags, shift);
> > -    return H_HARDWARE;
> > +
> > +    if (flags != 0) {
> > +        return H_PARAMETER;
> > +    }
> > +
> > +    if (!pending || (pending->shift != shift)) {
> > +        /* no matching prepare */
> > +        return H_CLOSED;
> > +    }
> > +
> > +    if (!pending->complete) {
> > +        /* prepare has not completed */
> > +        return H_BUSY;
> > +    }
> > +
> > +    newsize = 1ULL << pending->shift;
> > +    rc = rehash_hpt(cpu, spapr->htab, HTAB_SIZE(spapr),
> > +                    pending->hpt, newsize);
> > +    if (rc == H_SUCCESS) {
> > +        CPUState *cs;
> > +
> > +        qemu_vfree(spapr->htab);
> > +        spapr->htab = pending->hpt;
> > +        spapr->htab_shift = pending->shift;
> > +
> > +        CPU_FOREACH(cs) {
> > +            run_on_cpu(cs, pivot_hpt, RUN_ON_CPU_HOST_PTR(spapr));
> > +        }
> > +
> > +        pending->hpt = NULL; /* so it's not free()d */
> > +    }
> > +
> > +    /* Clean up */
> > +    spapr->pending_hpt = NULL;
> > +    free_pending_hpt(pending);
> > +
> > +    return rc;
> >  }
> >  
> >  static target_ulong h_set_sprg0(PowerPCCPU *cpu, sPAPRMachineState
> > *spapr,
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index d2c758b..954bada 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -14,6 +14,7 @@ struct sPAPRNVRAM;
> >  typedef struct sPAPRConfigureConnectorState
> > sPAPRConfigureConnectorState;
> >  typedef struct sPAPREventLogEntry sPAPREventLogEntry;
> >  typedef struct sPAPREventSource sPAPREventSource;
> > +typedef struct sPAPRPendingHPT sPAPRPendingHPT;
> >  
> >  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
> >  #define SPAPR_ENTRY_POINT       0x100
> > @@ -72,6 +73,8 @@ struct sPAPRMachineState {
> >      sPAPRResizeHPT resize_hpt;
> >      void *htab;
> >      uint32_t htab_shift;
> > +    sPAPRPendingHPT *pending_hpt; /* in-progress resize */
> > +
> >      hwaddr rma_size;
> >      int vrma_adjust;
> >      ssize_t rtas_size;
> > @@ -627,6 +630,7 @@ void
> > spapr_hotplug_req_remove_by_count_indexed(sPAPRDRConnectorType
> > drc_type,
> >                                                 uint32_t count,
> > uint32_t index);
> >  void *spapr_populate_hotplug_cpu_dt(CPUState *cs, int *fdt_offset,
> >                                      sPAPRMachineState *spapr);
> > +int spapr_hpt_shift_for_ramsize(uint64_t ramsize);
> >  
> >  /* rtas-configure-connector state */
> >  struct sPAPRConfigureConnectorState {
> > @@ -674,4 +678,6 @@ int spapr_rng_populate_dt(void *fdt);
> >  
> >  void spapr_do_system_reset_on_cpu(CPUState *cs, run_on_cpu_data
> > arg);
> >  
> > +#define HTAB_SIZE(spapr)        (1ULL << ((spapr)->htab_shift))
> > +
> >  #endif /* HW_SPAPR_H */
> > diff --git a/target-ppc/mmu-hash64.h b/target-ppc/mmu-hash64.h
> > index ab5d347..4a1b781 100644
> > --- a/target-ppc/mmu-hash64.h
> > +++ b/target-ppc/mmu-hash64.h
> > @@ -60,11 +60,15 @@ void ppc_hash64_update_rmls(CPUPPCState *env);
> >  #define HASH_PTE_SIZE_64        16
> >  #define HASH_PTEG_SIZE_64       (HASH_PTE_SIZE_64 * HPTES_PER_GROUP)
> >  
> > +#define HPTE64_V_SSIZE          SLB_VSID_B
> > +#define HPTE64_V_SSIZE_256M     SLB_VSID_B_256M
> > +#define HPTE64_V_SSIZE_1T       SLB_VSID_B_1T
> >  #define HPTE64_V_SSIZE_SHIFT    62
> >  #define HPTE64_V_AVPN_SHIFT     7
> >  #define HPTE64_V_AVPN           0x3fffffffffffff80ULL
> >  #define HPTE64_V_AVPN_VAL(x)    (((x) & HPTE64_V_AVPN) >>
> > HPTE64_V_AVPN_SHIFT)
> >  #define HPTE64_V_COMPARE(x, y)  (!(((x) ^ (y)) &
> > 0xffffffffffffff83ULL))
> > +#define HPTE64_V_BOLTED         0x0000000000000010ULL
> >  #define HPTE64_V_LARGE          0x0000000000000004ULL
> >  #define HPTE64_V_SECONDARY      0x0000000000000002ULL
> >  #define HPTE64_V_VALID          0x0000000000000001ULL
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCHv3 2/5] pseries: Stubs for HPT resizing
  2016-12-14  5:22   ` Suraj Jitindar Singh
@ 2016-12-15  1:07     ` David Gibson
  0 siblings, 0 replies; 20+ messages in thread
From: David Gibson @ 2016-12-15  1:07 UTC (permalink / raw)
  To: Suraj Jitindar Singh
  Cc: paulus, agraf, mdroth, thuth, lvivier, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 13500 bytes --]

On Wed, Dec 14, 2016 at 04:22:19PM +1100, Suraj Jitindar Singh wrote:
> On Mon, 2016-12-12 at 15:06 +1100, David Gibson wrote:
> > This introduces stub implementations of the H_RESIZE_HPT_PREPARE and
> > H_RESIZE_HPT_COMMIT hypercalls which we hope to add in a PAPR
> > extension to allow run time resizing of a guest's hash page
> > table.  It
> > also adds a new machine property for controlling whether this new
> > facility is available, and logic to check that against availability
> > with KVM (only supported with KVM PR for now).
> > 
> > Finally, it adds a new string to the hypertas property in the device
> > tree, advertising to the guest the availability of the HPT resizing
> > hypercalls.  This is a tentative suggested value, and would need to
> > be
> > standardized by PAPR before being merged.
> > 
> > Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
> > ---
> >  hw/ppc/spapr.c         | 75
> > ++++++++++++++++++++++++++++++++++++++++++++++++++
> >  hw/ppc/spapr_hcall.c   | 36 ++++++++++++++++++++++++
> >  hw/ppc/trace-events    |  2 ++
> >  include/hw/ppc/spapr.h | 11 ++++++++
> >  target-ppc/kvm.c       | 25 +++++++++++++++++
> >  target-ppc/kvm_ppc.h   |  5 ++++
> >  6 files changed, 154 insertions(+)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 0f25e83..846ce51 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -760,6 +760,11 @@ static void spapr_dt_rtas(sPAPRMachineState
> > *spapr, void *fdt)
> >      if (!kvm_enabled() || kvmppc_spapr_use_multitce()) {
> >          add_str(hypertas, "hcall-multi-tce");
> >      }
> > +
> > +    if (spapr->resize_hpt != SPAPR_RESIZE_HPT_DISABLED) {
> > +        add_str(hypertas, "hcall-hpt-resize");
> > +    }
> > +
> >      _FDT(fdt_setprop(fdt, rtas, "ibm,hypertas-functions",
> >                       hypertas->str, hypertas->len));
> >      g_string_free(hypertas, TRUE);
> > @@ -1839,11 +1844,40 @@ static void ppc_spapr_init(MachineState
> > *machine)
> >      long load_limit, fw_size;
> >      char *filename;
> >      int smt = kvmppc_smt_threads();
> > +    Error *resize_hpt_err = NULL;
> >  
> >      msi_nonbroken = true;
> >  
> >      QLIST_INIT(&spapr->phbs);
> >  
> > +    /* Check HPT resizing availability */
> > +    kvmppc_check_papr_resize_hpt(&resize_hpt_err);
> > +    if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DEFAULT) {
> > +        /*
> > +         * If the user explicitly requested a mode we should either
> > +         * supply it, or fail completely (which we do below).  But
> > if
> > +         * it's not set explicitly, we reset our mode to something
> > +         * that works
> > +         */
> > +        if (resize_hpt_err) {
> > +            spapr->resize_hpt = SPAPR_RESIZE_HPT_DISABLED;
> > +            error_free(resize_hpt_err);
> > +            resize_hpt_err = NULL;
> > +        } else {
> > +            spapr->resize_hpt = smc->resize_hpt_default;
> > +        }
> > +    }
> > +
> > +    assert(spapr->resize_hpt != SPAPR_RESIZE_HPT_DEFAULT);
> > +
> > +    if ((spapr->resize_hpt != SPAPR_RESIZE_HPT_DISABLED) &&
> > resize_hpt_err) {
> > +        /*
> > +         * User requested HPT resize, but this host can't supply
> > it.  Bail out
> > +         */
> > +        error_report_err(resize_hpt_err);
> > +        exit(1);
> > +    }
> > +
> >      /* Allocate RMA if necessary */
> >      rma_alloc_size = kvmppc_alloc_rma(&rma);
> >  
> > @@ -2236,6 +2270,40 @@ static void
> > spapr_set_modern_hotplug_events(Object *obj, bool value,
> >      spapr->use_hotplug_event_source = value;
> >  }
> >  
> > +static char *spapr_get_resize_hpt(Object *obj, Error **errp)
> > +{
> > +    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> > +
> > +    switch (spapr->resize_hpt) {
> > +    case SPAPR_RESIZE_HPT_DEFAULT:
> > +        return g_strdup("default");
> > +    case SPAPR_RESIZE_HPT_DISABLED:
> > +        return g_strdup("disabled");
> > +    case SPAPR_RESIZE_HPT_ENABLED:
> > +        return g_strdup("enabled");
> > +    case SPAPR_RESIZE_HPT_REQUIRED:
> > +        return g_strdup("required");
> > +    }
> > +    assert(0);
> > +}
> > +
> > +static void spapr_set_resize_hpt(Object *obj, const char *value,
> > Error **errp)
> > +{
> > +    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> > +
> > +    if (strcmp(value, "default") == 0) {
> > +        spapr->resize_hpt = SPAPR_RESIZE_HPT_DEFAULT;
> > +    } else if (strcmp(value, "disabled") == 0) {
> > +        spapr->resize_hpt = SPAPR_RESIZE_HPT_DISABLED;
> > +    } else if (strcmp(value, "enabled") == 0) {
> > +        spapr->resize_hpt = SPAPR_RESIZE_HPT_ENABLED;
> > +    } else if (strcmp(value, "required") == 0) {
> > +        spapr->resize_hpt = SPAPR_RESIZE_HPT_REQUIRED;
> > +    } else {
> > +        error_setg(errp, "Bad value for \"resize-hpt\" property");
> > +    }
> > +}
> > +
> >  static void spapr_machine_initfn(Object *obj)
> >  {
> >      sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
> > @@ -2256,6 +2324,12 @@ static void spapr_machine_initfn(Object *obj)
> >                                      " place of standard EPOW events
> > when possible"
> >                                      " (required for memory hot-
> > unplug support)",
> >                                      NULL);
> > +
> > +    object_property_add_str(obj, "resize-hpt",
> > +                            spapr_get_resize_hpt,
> > spapr_set_resize_hpt, NULL);
> > +    object_property_set_description(obj, "resize-hpt",
> > +                                    "Resizing of the Hash Page Table
> > (enabled, disabled, required)",
> > +                                    NULL);
> >  }
> >  
> >  static void spapr_machine_finalizefn(Object *obj)
> > @@ -2707,6 +2781,7 @@ static void
> > spapr_machine_class_init(ObjectClass *oc, void *data)
> >  
> >      smc->dr_lmb_enabled = true;
> >      smc->tcg_default_cpu = "POWER8";
> > +    smc->resize_hpt_default = SPAPR_RESIZE_HPT_DISABLED;
> >      mc->query_hotpluggable_cpus = spapr_query_hotpluggable_cpus;
> >      fwc->get_dev_path = spapr_get_fw_dev_path;
> >      nc->nmi_monitor_handler = spapr_nmi;
> > diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
> > index fd9f1d4..72a9c4d 100644
> > --- a/hw/ppc/spapr_hcall.c
> > +++ b/hw/ppc/spapr_hcall.c
> > @@ -355,6 +355,38 @@ static target_ulong h_read(PowerPCCPU *cpu,
> > sPAPRMachineState *spapr,
> >      return H_SUCCESS;
> >  }
> >  
> > +static target_ulong h_resize_hpt_prepare(PowerPCCPU *cpu,
> > +                                         sPAPRMachineState *spapr,
> > +                                         target_ulong opcode,
> > +                                         target_ulong *args)
> > +{
> > +    target_ulong flags = args[0];
> > +    target_ulong shift = args[1];
> > +
> > +    if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> > +        return H_AUTHORITY;
> > +    }
> > +
> > +    trace_spapr_h_resize_hpt_prepare(flags, shift);
> > +    return H_HARDWARE;
> > +}
> > +
> > +static target_ulong h_resize_hpt_commit(PowerPCCPU *cpu,
> > +                                        sPAPRMachineState *spapr,
> > +                                        target_ulong opcode,
> > +                                        target_ulong *args)
> > +{
> > +    target_ulong flags = args[0];
> > +    target_ulong shift = args[1];
> > +
> > +    if (spapr->resize_hpt == SPAPR_RESIZE_HPT_DISABLED) {
> > +        return H_AUTHORITY;
> > +    }
> > +
> > +    trace_spapr_h_resize_hpt_commit(flags, shift);
> > +    return H_HARDWARE;
> > +}
> > +
> >  static target_ulong h_set_sprg0(PowerPCCPU *cpu, sPAPRMachineState
> > *spapr,
> >                                  target_ulong opcode, target_ulong
> > *args)
> >  {
> > @@ -1134,6 +1166,10 @@ static void hypercall_register_types(void)
> >      /* hcall-bulk */
> >      spapr_register_hypercall(H_BULK_REMOVE, h_bulk_remove);
> >  
> > +    /* hcall-hpt-resize */
> > +    spapr_register_hypercall(H_RESIZE_HPT_PREPARE,
> > h_resize_hpt_prepare);
> > +    spapr_register_hypercall(H_RESIZE_HPT_COMMIT,
> > h_resize_hpt_commit);
> > +
> >      /* hcall-splpar */
> >      spapr_register_hypercall(H_REGISTER_VPA, h_register_vpa);
> >      spapr_register_hypercall(H_CEDE, h_cede);
> > diff --git a/hw/ppc/trace-events b/hw/ppc/trace-events
> > index 2297ead..bf59a8f 100644
> > --- a/hw/ppc/trace-events
> > +++ b/hw/ppc/trace-events
> > @@ -16,6 +16,8 @@ spapr_cas_continue(unsigned long n) "Copy changes
> > to the guest: %ld bytes"
> >  # hw/ppc/spapr_hcall.c
> >  spapr_cas_pvr_try(uint32_t pvr) "%x"
> >  spapr_cas_pvr(uint32_t cur_pvr, bool cpu_match, uint32_t new_pvr,
> > uint64_t pcr) "current=%x, cpu_match=%u, new=%x, compat
> > flags=%"PRIx64
> > +spapr_h_resize_hpt_prepare(uint64_t flags, uint64_t shift)
> > "flags=0x%"PRIx64", shift=%"PRIu64
> > +spapr_h_resize_hpt_commit(uint64_t flags, uint64_t shift)
> > "flags=0x%"PRIx64", shift=%"PRIu64
> >  
> >  # hw/ppc/spapr_iommu.c
> >  spapr_iommu_put(uint64_t liobn, uint64_t ioba, uint64_t tce,
> > uint64_t ret) "liobn=%"PRIx64" ioba=0x%"PRIx64" tce=0x%"PRIx64"
> > ret=%"PRId64
> > diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> > index a2d8964..d2c758b 100644
> > --- a/include/hw/ppc/spapr.h
> > +++ b/include/hw/ppc/spapr.h
> > @@ -31,6 +31,13 @@ typedef struct sPAPRMachineState
> > sPAPRMachineState;
> >  #define SPAPR_MACHINE_CLASS(klass) \
> >      OBJECT_CLASS_CHECK(sPAPRMachineClass, klass, TYPE_SPAPR_MACHINE)
> >  
> > +typedef enum {
> > +    SPAPR_RESIZE_HPT_DEFAULT = 0,
> > +    SPAPR_RESIZE_HPT_DISABLED,
> > +    SPAPR_RESIZE_HPT_ENABLED,
> > +    SPAPR_RESIZE_HPT_REQUIRED,
> > +} sPAPRResizeHPT;
> > +
> >  /**
> >   * sPAPRMachineClass:
> >   */
> > @@ -46,6 +53,7 @@ struct sPAPRMachineClass {
> >                            uint64_t *buid, hwaddr *pio, 
> >                            hwaddr *mmio32, hwaddr *mmio64,
> >                            unsigned n_dma, uint32_t *liobns, Error
> > **errp);
> > +    sPAPRResizeHPT resize_hpt_default;
> >  };
> >  
> >  /**
> > @@ -61,6 +69,7 @@ struct sPAPRMachineState {
> >      XICSState *xics;
> >      DeviceState *rtc;
> >  
> > +    sPAPRResizeHPT resize_hpt;
> >      void *htab;
> >      uint32_t htab_shift;
> >      hwaddr rma_size;
> > @@ -347,6 +356,8 @@ struct sPAPRMachineState {
> >  #define H_XIRR_X                0x2FC
> >  #define H_RANDOM                0x300
> >  #define H_SET_MODE              0x31C
> > +#define H_RESIZE_HPT_PREPARE    0x36C
> > +#define H_RESIZE_HPT_COMMIT     0x370
> >  #define H_SIGNAL_SYS_RESET      0x380
> >  #define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
> >  
> > diff --git a/target-ppc/kvm.c b/target-ppc/kvm.c
> > index 15e12f3..39e5753 100644
> > --- a/target-ppc/kvm.c
> > +++ b/target-ppc/kvm.c
> > @@ -22,6 +22,7 @@
> >  #include <linux/kvm.h>
> >  
> >  #include "qemu-common.h"
> > +#include "qapi/error.h"
> >  #include "qemu/error-report.h"
> >  #include "cpu.h"
> >  #include "qemu/timer.h"
> > @@ -2672,3 +2673,27 @@ int kvmppc_enable_hwrng(void)
> >  
> >      return kvmppc_enable_hcall(kvm_state, H_RANDOM);
> >  }
> > +
> > +void kvmppc_check_papr_resize_hpt(Error **errp)
> > +{
> > +    if (!kvm_enabled()) {
> > +        return;
> > +    }
> > +
> > +    /* TODO: Check specific capabilities for HPT resize aware host
> > kernels */
> > +
> > +    /*
> > +     * It's tempting to try to check directly if the HPT is under
> > our
> > +     * control or KVM's, which is what's really relevant here.
> > +     * Unfortunately, in order to correctly size the HPT, we need to
> > +     * know if we can do resizing, _before_ we attempt to allocate
> > it
> > +     * with KVM.  Before that point, we don't officially know
> > whether
> > +     * we'll control the HPT or not.  So we have to use a fallback
> > +     * test for PR vs HV KVM to predict that.
> > +     */
> > +    if (kvmppc_is_pr(kvm_state)) {
> > +        return;
> > +    }
> > +
> > +    error_setg(errp, "Hash page table resizing not available with
> > this KVM version");
> > +}
> > diff --git a/target-ppc/kvm_ppc.h b/target-ppc/kvm_ppc.h
> > index 841a29b..3e852ba 100644
> > --- a/target-ppc/kvm_ppc.h
> > +++ b/target-ppc/kvm_ppc.h
> > @@ -59,6 +59,7 @@ bool kvmppc_has_cap_htm(void);
> >  int kvmppc_enable_hwrng(void);
> >  int kvmppc_put_books_sregs(PowerPCCPU *cpu);
> >  PowerPCCPUClass *kvm_ppc_get_host_cpu_class(void);
> > +void kvmppc_check_papr_resize_hpt(Error **errp);
> >  
> >  #else
> >  
> > @@ -270,6 +271,10 @@ static inline PowerPCCPUClass
> > *kvm_ppc_get_host_cpu_class(void)
> >      return NULL;
> >  }
> >  
> > +static inline void kvmppc_check_papr_resize_hpt(Error **errp)
> > +{
> > +    return;
> > +}
> >  #endif
> >  
> >  #ifndef CONFIG_KVM
> Since you're adding a new machine option it would be nice if this was
> documented in the help message.

Um.. which help message do you mean exactly?  I've attached a property
description, so we get something in qemu-system-ppc64 -M pseries,?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2016-12-15  1:08 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-12-12  4:05 [Qemu-devel] [PATCHv3 0/5] Hash Page Table resizing for TCG pseries guests David Gibson
2016-12-12  4:05 ` [Qemu-devel] [PATCHv3 1/5] pseries: Add pseries-2.9 machine type David Gibson
2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 2/5] pseries: Stubs for HPT resizing David Gibson
2016-12-13 14:29   ` Laurent Vivier
2016-12-14  6:12     ` David Gibson
2016-12-14  5:22   ` Suraj Jitindar Singh
2016-12-15  1:07     ` David Gibson
2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 3/5] pseries: Implement " David Gibson
2016-12-14  5:30   ` Suraj Jitindar Singh
2016-12-14  6:20     ` David Gibson
2016-12-14 23:20       ` Suraj Jitindar Singh
2016-12-15  0:59     ` David Gibson
2016-12-14  5:35   ` Suraj Jitindar Singh
2016-12-15  1:03     ` David Gibson
2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 4/5] pseries: Enable HPT resizing for 2.9 David Gibson
2016-12-14  5:32   ` Suraj Jitindar Singh
2016-12-14  6:20     ` David Gibson
2016-12-14 23:08       ` Suraj Jitindar Singh
2016-12-12  4:06 ` [Qemu-devel] [PATCHv3 5/5] pseries: Use smaller default hash page tables when guest can resize David Gibson
2016-12-14  5:56   ` Suraj Jitindar Singh

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.