All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9)
@ 2017-12-09  8:43 Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 01/19] dma-helpers: add a return value to store helpers Cédric Le Goater
                   ` (18 more replies)
  0 siblings, 19 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

Hello,

On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
negotiation process determines whether the guest operates with an
interrupt controller using the XICS legacy model, as found on POWER8,
or in XIVE exploitation mode, the newer POWER9 interrupt model. XIVE
is a complex interrupt controller introducing a large number of new
features, for virtualization in particular.

It is composed of three sub-engines :

  - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
    in the main controller for the IPIS and in the PSI host
    bridge. They are configured to feed the IVRE with events.

  - Interrupt Virtualization Routing Engine (IVRE). Their job is to
    match an event source with a Notification Virtualization Target
    (NVT), a priority and an Event Queue (EQ) to determine if a
    Virtual Processor can handle the event.

  - Interrupt Virtualization Presentation Engine (IVPE). It maintains
    the interrupt state of each hardware thread and present the
    notification as an external exception.

Each of the engines uses a set of internal tables to redirect
exceptions from event sources to CPU threads. Interrupt sources have a
2-bit state machine, the Event State Buffer (ESB), that allows events
to be triggered. If the event is let through, the IVRE looks up in the
Interrupt Virtualization Entry (IVE) table for the Event Queue
Descriptor configured for the source. Each Event Queue Descriptor
defines a notification path to a CPU and an in-memory queue in which
will be recorded an event identifier for the OS to pull.

The high level ideas of the current design are :

 - introduce a persistent XIVE object under the sPAPR machine for
   newer machines and let the CAS negotiation process decide whether
   it should be used or not. Use the 'ov5_cas' attribute for this
   purpose.

 - introduce a persistent XIVE interrupt presenter under the sPAPR
   core and switch ICP after CAS. Each core has now two ICPs, one
   active through the 'intc' pointer and another one among its
   children ready to be used if the guest requires it.

 - move the XIVE EQs under the cores to simplify the XIVE model

 - allocate the CPU IPIs at the beginning of the IRQ number space to
   be compatible with XICS (which starts at 4096) and also to simplify
   the model. This means that the XIVE model covers the whole IRQ
   number space. There are no offset like in XICS splitting the IRQ
   number space.

The patchset first introduces new models for XIVE :

 - sPAPRXive holding the internal tables and the MMIO regions used by
   the XIVE controller.
   
 - sPAPRXiveNVT object storing the interrupt state of the CPU and
   acting as the XIVE interrupt presenter

then, describes the notification process and the interrupt delivery to
the CPU.

It finishes with the integration of sPAPRXive object under the sPAPR
machine, the introducion of the new XIVE hcalls, the device tree
layout, and the necessary adjustments to support the CAS negotiation.

Migration is addressed, CPU hotplug, and support for older machines
and QEMU versions also. KVM support is not addressed yet and the guest
needs to be run with kernel_irqchip=off on a POWER9 system.

Code is here:

  https://github.com/legoater/qemu/commits/xive
   
Thanks,

C.

 Changes since v1 :

 - used g_new0 instead of g_malloc0
 - removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC 
 - introduced a device reset handler. the object needs to be parented
   to sysbus when created.
 - renamed spapr_xive_irq_set() to spapr_xive_irq_enable()
 - renamed spapr_xive_irq_unset() to spapr_xive_irq_disable()
 - moved the PPC_BIT macros under target/ppc/cpu.h
 - shrinked file copyright header
 - reworked the event notification logic of the qemu_irq handlers.  
 - introduced XIVE_ESB_STORE_EOI support
 - removed 'esb_shift' field 
 - removed a useless check on the validity of the IVE in the memory
   region handlers.
 - removed the overall ESB memory region. We now have only one region
   for the provisioned sources.
 - improved 'info pic' output
 - improved LSI support
 - renamed 'sPAPRXiveICP' to 'sPAPRXiveNVT'
 - renamed 'tima' field to 'regs' 
 - renamed 'tima_os' fiels to 'ring_os'
 - removed 'tm_shift' field
 - introduced a memory region to model the User TIMA and another one
   for the OS TIMA. One page size for each.
 - removed useless checks in the memory region handlers
 - removed support for 970 ...
 - removed spapr_xive_eq_for_server() which did the EQ indexing.
 - changed spapr_xive_get_eq() to use a server and a priority parameter
 - introduced a couple of macro for the EQ indexing. 
 - replaced dma_memory_write() by stl_be_dma()
 - set initial TM_PIPR to 0xFF in sPAPRXiveNVT
 - conditioned the creation of the sPAPRXive object to the
   xive_exploitation bool which false on older pseries machine.
 - parented the sPAPRXive object to sysbus.
 - simplified priority_is_valid() routine (to its minimum)
 - used PPC_BIT() macros to define the hcall flags
 - removed useless casts
 - defined the default characteristic of the single XIVE interrupt
   source to be : *XIVE_SRC_TRIGGER | XIVE_SRC_STORE_EOI*
 - removed EQ_W0_UCOND_NOTIFY when the EQ is reseted
 - fixed XIVE_EQ_DEBUG support. Offset for the generation bit was wrong
 - added a unit id to the nodename
 - added properties for the LSIs
 - simplified the array for the "ibm,plat-res-int-priorities"  property
 - renamed spapr_xive_populate() to spapr_dt_xive()
 - moved the mapping of the XIVE memory region and the setting
   of the ICP under the machine reset handler.
 - introduced a spapr_xive_qirq() helper
 - introduced a spapr_xive_nvt_create() helper
 - handled more errors in spapr_post_load() to return EINVAL


Cédric Le Goater (19):
  dma-helpers: add a return value to store helpers
  spapr: introduce a skeleton for the XIVE interrupt controller
  spapr: introduce the XIVE interrupt sources
  spapr: add support for the LSI interrupt sources
  spapr: introduce a XIVE interrupt presenter model
  spapr: introduce the XIVE Event Queues
  spapr: push the XIVE EQ data in OS event queue
  spapr: notify the CPU when the XIVE interrupt priority is more
    privileged
  spapr: add support for the SET_OS_PENDING command (XIVE)
  spapr: introduce a 'xive_exploitation' boolean to enable XIVE
  spapr: add a sPAPRXive object to the machine
  spapr: add hcalls support for the XIVE exploitation interrupt mode
  spapr: add device tree support for the XIVE interrupt mode
  spapr: introduce a helper to map the XIVE memory regions
  spapr: add XIVE support to spapr_qirq()
  spapr: introduce a spapr_icp_create() helper
  spapr: toggle the ICP depending on the selected interrupt mode
  spapr: add support to dump XIVE information
  spapr: advertise XIVE exploitation mode in CAS

 default-configs/ppc64-softmmu.mak |    1 +
 hw/intc/Makefile.objs             |    1 +
 hw/intc/spapr_xive.c              | 1013 +++++++++++++++++++++++++++++++++++++
 hw/intc/spapr_xive_hcall.c        |  923 +++++++++++++++++++++++++++++++++
 hw/intc/xive-internal.h           |  196 +++++++
 hw/ppc/spapr.c                    |  188 ++++++-
 hw/ppc/spapr_cpu_core.c           |   37 +-
 hw/ppc/spapr_hcall.c              |    6 +
 include/hw/ppc/spapr.h            |   20 +-
 include/hw/ppc/spapr_cpu_core.h   |    1 +
 include/hw/ppc/spapr_xive.h       |   72 +++
 include/sysemu/dma.h              |    4 +-
 12 files changed, 2449 insertions(+), 13 deletions(-)
 create mode 100644 hw/intc/spapr_xive.c
 create mode 100644 hw/intc/spapr_xive_hcall.c
 create mode 100644 hw/intc/xive-internal.h
 create mode 100644 include/hw/ppc/spapr_xive.h

-- 
2.13.6

^ permalink raw reply	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 01/19] dma-helpers: add a return value to store helpers
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-19  4:46   ` David Gibson
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller Cédric Le Goater
                   ` (17 subsequent siblings)
  18 siblings, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/sysemu/dma.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/sysemu/dma.h b/include/sysemu/dma.h
index c228c6651360..74a9558af39c 100644
--- a/include/sysemu/dma.h
+++ b/include/sysemu/dma.h
@@ -153,12 +153,12 @@ static inline void dma_memory_unmap(AddressSpace *as,
         dma_memory_read(as, addr, &val, (_bits) / 8);                   \
         return _end##_bits##_to_cpu(val);                               \
     }                                                                   \
-    static inline void st##_sname##_##_end##_dma(AddressSpace *as,      \
+    static inline int st##_sname##_##_end##_dma(AddressSpace *as,      \
                                                  dma_addr_t addr,       \
                                                  uint##_bits##_t val)   \
     {                                                                   \
         val = cpu_to_##_end##_bits(val);                                \
-        dma_memory_write(as, addr, &val, (_bits) / 8);                  \
+        return dma_memory_write(as, addr, &val, (_bits) / 8);           \
     }
 
 static inline uint8_t ldub_dma(AddressSpace *as, dma_addr_t addr)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 01/19] dma-helpers: add a return value to store helpers Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09 14:06   ` Cédric Le Goater
  2017-12-20  5:09   ` David Gibson
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources Cédric Le Goater
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

With the POWER9 processor comes a new interrupt controller called
XIVE. It is composed of three sub-engines :

  - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
    in the main controller for the IPIS and in the PSI host
    bridge. They are configured to feed the IVRE with events.

  - Interrupt Virtualization Routing Engine (IVRE). Their job is to
    match an event source with a Notification Virtualization Target
    (NVT), a priority and an Event Queue (EQ) to determine if a
    Virtual Processor can handle the event.

  - Interrupt Virtualization Presentation Engine (IVPE). It maintains
    the interrupt state of each hardware thread and present the
    notification as an external exception.

Each of the engines uses a set of internal tables to redirect
exceptions from event sources to CPU threads. The first table we
introduce is the Interrupt Virtualization Entry (IVE) table, part of
the virtualization engine in charge of routing events. It associates
event sources (IRQ numbers) to event queues which will forward, or
not, the event notification to the presentation controller.

The XIVE model is designed to make use of the full range of the IRQ
number space and does not use an offset like the XICS mode does.
Hence, the IVE table is directly indexed by the IRQ number.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1 :

 - used g_new0 instead of g_malloc0
 - removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC 
 - introduced a device reset handler. the object needs to be parented
   to sysbus when created.
 - renamed spapr_xive_irq_set to spapr_xive_irq_enable
 - renamed spapr_xive_irq_unset to spapr_xive_irq_disable
 - moved the PPC_BIT macros under target/ppc/cpu.h
 - shrinked file copyright header

 default-configs/ppc64-softmmu.mak |   1 +
 hw/intc/Makefile.objs             |   1 +
 hw/intc/spapr_xive.c              | 156 ++++++++++++++++++++++++++++++++++++++
 hw/intc/xive-internal.h           |  41 ++++++++++
 include/hw/ppc/spapr_xive.h       |  35 +++++++++
 5 files changed, 234 insertions(+)
 create mode 100644 hw/intc/spapr_xive.c
 create mode 100644 hw/intc/xive-internal.h
 create mode 100644 include/hw/ppc/spapr_xive.h

diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
index d1b3a6dd50f8..4a7f6a0696de 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -56,6 +56,7 @@ CONFIG_SM501=y
 CONFIG_XICS=$(CONFIG_PSERIES)
 CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
+CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
 # For PReP
 CONFIG_SERIAL_ISA=y
 CONFIG_MC146818RTC=y
diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index ae358569a155..49e13e7aeeee 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
 obj-$(CONFIG_XICS) += xics.o
 obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
 obj-$(CONFIG_XICS_KVM) += xics_kvm.o
+obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
 obj-$(CONFIG_POWERNV) += xics_pnv.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
new file mode 100644
index 000000000000..e6e8841add17
--- /dev/null
+++ b/hw/intc/spapr_xive.c
@@ -0,0 +1,156 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "sysemu/dma.h"
+#include "monitor/monitor.h"
+#include "hw/ppc/spapr_xive.h"
+
+#include "xive-internal.h"
+
+/*
+ * Main XIVE object
+ */
+
+void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
+{
+    int i;
+
+    for (i = 0; i < xive->nr_irqs; i++) {
+        XiveIVE *ive = &xive->ivt[i];
+
+        if (!(ive->w & IVE_VALID)) {
+            continue;
+        }
+
+        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
+                       ive->w & IVE_MASKED ? "M" : " ",
+                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
+                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
+    }
+}
+
+static void spapr_xive_reset(DeviceState *dev)
+{
+    sPAPRXive *xive = SPAPR_XIVE(dev);
+    int i;
+
+    /* Mask all valid IVEs in the IRQ number space. */
+    for (i = 0; i < xive->nr_irqs; i++) {
+        XiveIVE *ive = &xive->ivt[i];
+        if (ive->w & IVE_VALID) {
+            ive->w |= IVE_MASKED;
+        }
+    }
+}
+
+static void spapr_xive_realize(DeviceState *dev, Error **errp)
+{
+    sPAPRXive *xive = SPAPR_XIVE(dev);
+
+    if (!xive->nr_irqs) {
+        error_setg(errp, "Number of interrupt needs to be greater 0");
+        return;
+    }
+
+    /* Allocate the IVT (Interrupt Virtualization Table) */
+    xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
+}
+
+static const VMStateDescription vmstate_spapr_xive_ive = {
+    .name = TYPE_SPAPR_XIVE "/ive",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField []) {
+        VMSTATE_UINT64(w, XiveIVE),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static bool vmstate_spapr_xive_needed(void *opaque)
+{
+    /* TODO check machine XIVE support */
+    return true;
+}
+
+static const VMStateDescription vmstate_spapr_xive = {
+    .name = TYPE_SPAPR_XIVE,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = vmstate_spapr_xive_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
+        VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
+                                     vmstate_spapr_xive_ive, XiveIVE),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static Property spapr_xive_properties[] = {
+    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_xive_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = spapr_xive_realize;
+    dc->reset = spapr_xive_reset;
+    dc->props = spapr_xive_properties;
+    dc->desc = "sPAPR XIVE interrupt controller";
+    dc->vmsd = &vmstate_spapr_xive;
+}
+
+static const TypeInfo spapr_xive_info = {
+    .name = TYPE_SPAPR_XIVE,
+    .parent = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(sPAPRXive),
+    .class_init = spapr_xive_class_init,
+};
+
+static void spapr_xive_register_types(void)
+{
+    type_register_static(&spapr_xive_info);
+}
+
+type_init(spapr_xive_register_types)
+
+XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
+{
+    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
+}
+
+bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
+{
+    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
+
+    if (!ive) {
+        return false;
+    }
+
+    ive->w |= IVE_VALID;
+    return true;
+}
+
+bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
+{
+    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
+
+    if (!ive) {
+        return false;
+    }
+
+    ive->w &= ~IVE_VALID;
+    return true;
+}
diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
new file mode 100644
index 000000000000..132b71a6daf0
--- /dev/null
+++ b/hw/intc/xive-internal.h
@@ -0,0 +1,41 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2016-2017, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef _INTC_XIVE_INTERNAL_H
+#define _INTC_XIVE_INTERNAL_H
+
+/* Utilities to manipulate these (originaly from OPAL) */
+#define MASK_TO_LSH(m)          (__builtin_ffsl(m) - 1)
+#define GETFIELD(m, v)          (((v) & (m)) >> MASK_TO_LSH(m))
+#define SETFIELD(m, v, val)                             \
+        (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
+
+/* IVE/EAS
+ *
+ * One per interrupt source. Targets that interrupt to a given EQ
+ * and provides the corresponding logical interrupt number (EQ data)
+ *
+ * We also map this structure to the escalation descriptor inside
+ * an EQ, though in that case the valid and masked bits are not used.
+ */
+typedef struct XiveIVE {
+        /* Use a single 64-bit definition to make it easier to
+         * perform atomic updates
+         */
+        uint64_t        w;
+#define IVE_VALID       PPC_BIT(0)
+#define IVE_EQ_BLOCK    PPC_BITMASK(4, 7)        /* Destination EQ block# */
+#define IVE_EQ_INDEX    PPC_BITMASK(8, 31)       /* Destination EQ index */
+#define IVE_MASKED      PPC_BIT(32)              /* Masked */
+#define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
+} XiveIVE;
+
+XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
+
+#endif /* _INTC_XIVE_INTERNAL_H */
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
new file mode 100644
index 000000000000..5b1f78e06a1e
--- /dev/null
+++ b/include/hw/ppc/spapr_xive.h
@@ -0,0 +1,35 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_SPAPR_XIVE_H
+#define PPC_SPAPR_XIVE_H
+
+#include <hw/sysbus.h>
+
+typedef struct sPAPRXive sPAPRXive;
+typedef struct XiveIVE XiveIVE;
+
+#define TYPE_SPAPR_XIVE "spapr-xive"
+#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
+
+struct sPAPRXive {
+    SysBusDevice parent;
+
+    /* Properties */
+    uint32_t     nr_irqs;
+
+    /* XIVE internal tables */
+    XiveIVE      *ivt;
+};
+
+bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
+bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
+void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
+
+#endif /* PPC_SPAPR_XIVE_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 01/19] dma-helpers: add a return value to store helpers Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-14 15:24   ` Cédric Le Goater
  2017-12-20  5:22   ` David Gibson
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 04/19] spapr: add support for the LSI " Cédric Le Goater
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

Each XIVE interrupt source is associated with a two bit state machine
called an Event State Buffer (ESB) : the first bit "P" means that an
interrupt is "pending" and waiting for an EOI and the bit "Q" (queued)
means a new interrupt was triggered while another was still pending.

When an event is triggered, the associated interrupt state bits are
fetched and modified and forwarded to the virtualization engine of the
controller doing the routing. These can also be controlled by MMIO, to
trigger events or turn off the sources for instance. See code for more
details on the states and transitions.

The MMIO space for the ESBs is 512GB large on the bare-metal system
(PowerNV) and the BAR depends on the chip id. In our model for the
sPAPR machine, we choose to only map the sub-region for the
provisioned IRQ numbers and to use the mapping address of chip 0 of a
real system.

In the real world, each source may have different characteristics
depending on the revision of a controller or the CPU. Early systems
had two different MMIO pages for trigger and for EOI. We choose to use
the same characteristics for all sources to simplify the model. The
minimum CPU level for XIVE exploitation mode will be DD2.X as it has
full support.

The OS will obtain the address of the MMIO page of the ESB entry
associated with a source and its characteristic using the
H_INT_GET_SOURCE_INFO hcall. This will be addressed in the patch
introducing the hcalls.

The spapr_xive_irq() routine in charge of triggering the CPU interrupt
line will be filled later on.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:

 - merged in the same patch the qemu_irq handlers
 - reworked the event notification logic of the qemu_irq handlers.  
 - introduced XIVE_ESB_STORE_EOI support
 - removed 'esb_shift' field 
 - removed a useless check on the validity of the IVE in the memory
   region handlers.
 - fixed spapr_xive_pq_trigger() to return true when XIVE_ESB_QUEUED
   is set
 - removed the overall ESB memory region. We now have only one region
   for the provisioned sources.
 - improved 'info pic' output

 hw/intc/spapr_xive.c        | 254 +++++++++++++++++++++++++++++++++++++++++++-
 hw/intc/xive-internal.h     |  10 ++
 include/hw/ppc/spapr_xive.h |   9 ++
 3 files changed, 271 insertions(+), 2 deletions(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index e6e8841add17..43df6814619d 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -18,23 +18,252 @@
 
 #include "xive-internal.h"
 
+static void spapr_xive_irq(sPAPRXive *xive, int lisn)
+{
+
+}
+
 /*
- * Main XIVE object
+ * XIVE Interrupt Source
+ */
+
+/*
+ * "magic" Event State Buffer (ESB) MMIO offsets.
+ *
+ * Each interrupt source has a 2-bit state machine called ESB
+ * which can be controlled by MMIO. It's made of 2 bits, P and
+ * Q. P indicates that an interrupt is pending (has been sent
+ * to a queue and is waiting for an EOI). Q indicates that the
+ * interrupt has been triggered while pending.
+ *
+ * This acts as a coalescing mechanism in order to guarantee
+ * that a given interrupt only occurs at most once in a queue.
+ *
+ * When doing an EOI, the Q bit will indicate if the interrupt
+ * needs to be re-triggered.
+ *
+ * The following offsets into the ESB MMIO allow to read or
+ * manipulate the PQ bits. They must be used with an 8-bytes
+ * load instruction. They all return the previous state of the
+ * interrupt (atomically).
+ *
+ * Additionally, some ESB pages support doing an EOI via a
+ * store at 0 and some ESBs support doing a trigger via a
+ * separate trigger page.
+ */
+#define XIVE_ESB_STORE_EOI      0x400 /* Store */
+#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
+#define XIVE_ESB_GET            0x800 /* Load */
+#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
+#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
+#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
+#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
+
+#define XIVE_ESB_VAL_P          0x2
+#define XIVE_ESB_VAL_Q          0x1
+
+#define XIVE_ESB_RESET          0x0
+#define XIVE_ESB_PENDING        XIVE_ESB_VAL_P
+#define XIVE_ESB_QUEUED         (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
+#define XIVE_ESB_OFF            XIVE_ESB_VAL_Q
+
+static uint8_t spapr_xive_pq_get(sPAPRXive *xive, uint32_t lisn)
+{
+    uint32_t byte = lisn / 4;
+    uint32_t bit  = (lisn % 4) * 2;
+
+    assert(byte < xive->sbe_size);
+
+    return (xive->sbe[byte] >> bit) & 0x3;
+}
+
+static uint8_t spapr_xive_pq_set(sPAPRXive *xive, uint32_t lisn, uint8_t pq)
+{
+    uint32_t byte = lisn / 4;
+    uint32_t bit  = (lisn % 4) * 2;
+    uint8_t old, new;
+
+    assert(byte < xive->sbe_size);
+
+    old = xive->sbe[byte];
+
+    new = xive->sbe[byte] & ~(0x3 << bit);
+    new |= (pq & 0x3) << bit;
+
+    xive->sbe[byte] = new;
+
+    return (old >> bit) & 0x3;
+}
+
+static bool spapr_xive_pq_eoi(sPAPRXive *xive, uint32_t lisn)
+{
+    uint8_t old_pq = spapr_xive_pq_get(xive, lisn);
+
+    switch (old_pq) {
+    case XIVE_ESB_RESET:
+        spapr_xive_pq_set(xive, lisn, XIVE_ESB_RESET);
+        return false;
+    case XIVE_ESB_PENDING:
+        spapr_xive_pq_set(xive, lisn, XIVE_ESB_RESET);
+        return false;
+    case XIVE_ESB_QUEUED:
+        spapr_xive_pq_set(xive, lisn, XIVE_ESB_PENDING);
+        return true;
+    case XIVE_ESB_OFF:
+        spapr_xive_pq_set(xive, lisn, XIVE_ESB_OFF);
+        return false;
+    default:
+         g_assert_not_reached();
+    }
+}
+
+/*
+ * Returns whether the event notification should be forwarded to the
+ * IVE for routing.
  */
+static bool spapr_xive_pq_trigger(sPAPRXive *xive, uint32_t lisn)
+{
+    uint8_t old_pq = spapr_xive_pq_get(xive, lisn);
 
+    switch (old_pq) {
+    case XIVE_ESB_RESET:
+        spapr_xive_pq_set(xive, lisn, XIVE_ESB_PENDING);
+        return true;
+    case XIVE_ESB_PENDING:
+        spapr_xive_pq_set(xive, lisn, XIVE_ESB_QUEUED);
+        return false;
+    case XIVE_ESB_QUEUED:
+        spapr_xive_pq_set(xive, lisn, XIVE_ESB_QUEUED);
+        return false;
+    case XIVE_ESB_OFF:
+        spapr_xive_pq_set(xive, lisn, XIVE_ESB_OFF);
+        return false;
+    default:
+         g_assert_not_reached();
+    }
+}
+
+/*
+ * XIVE Interrupt Source MMIOs
+ */
+
+/*
+ * Some HW use a separate page for trigger. We only support the case
+ * in which the trigger can be done in the same page as the EOI.
+ */
+static uint64_t spapr_xive_esb_read(void *opaque, hwaddr addr, unsigned size)
+{
+    sPAPRXive *xive = SPAPR_XIVE(opaque);
+    uint32_t offset = addr & 0xF00;
+    uint32_t lisn = addr >> ESB_SHIFT;
+    uint64_t ret = -1;
+
+    switch (offset) {
+    case XIVE_ESB_LOAD_EOI:
+        /*
+         * EOI on load is not used anymore as we now advertise
+         * XIVE_ESB_STORE_EOI support for the interrupt sources
+         */
+        ret = spapr_xive_pq_eoi(xive, lisn);
+        break;
+
+    case XIVE_ESB_GET:
+        ret = spapr_xive_pq_get(xive, lisn);
+        break;
+
+    case XIVE_ESB_SET_PQ_00:
+    case XIVE_ESB_SET_PQ_01:
+    case XIVE_ESB_SET_PQ_10:
+    case XIVE_ESB_SET_PQ_11:
+        ret = spapr_xive_pq_set(xive, lisn, (offset >> 8) & 0x3);
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB addr %d\n", offset);
+    }
+
+    return ret;
+}
+
+static void spapr_xive_esb_write(void *opaque, hwaddr addr,
+                                 uint64_t value, unsigned size)
+{
+    sPAPRXive *xive = SPAPR_XIVE(opaque);
+    uint32_t offset = addr & 0xF00;
+    uint32_t lisn = addr >> ESB_SHIFT;
+    bool notify = false;
+
+    switch (offset) {
+    case 0:
+        notify = spapr_xive_pq_trigger(xive, lisn);
+        break;
+    case XIVE_ESB_STORE_EOI:
+        /* If the Q bit is set, we should forward a new source event
+         * notification
+         */
+        notify = spapr_xive_pq_eoi(xive, lisn);
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %d\n",
+                      offset);
+        return;
+    }
+
+    /* Forward the source event notification for routing */
+    if (notify) {
+        spapr_xive_irq(xive, lisn);
+    }
+}
+
+static const MemoryRegionOps spapr_xive_esb_ops = {
+    .read = spapr_xive_esb_read,
+    .write = spapr_xive_esb_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+static void spapr_xive_source_set_irq(void *opaque, int lisn, int val)
+{
+    sPAPRXive *xive = SPAPR_XIVE(opaque);
+    bool notify = false;
+
+    if (val) {
+        notify = spapr_xive_pq_trigger(xive, lisn);
+    }
+
+    /* Forward the source event notification for routing */
+    if (notify) {
+        spapr_xive_irq(xive, lisn);
+    }
+}
+
+/*
+ * Main XIVE object
+ */
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
 {
     int i;
 
     for (i = 0; i < xive->nr_irqs; i++) {
         XiveIVE *ive = &xive->ivt[i];
+        uint8_t pq;
 
         if (!(ive->w & IVE_VALID)) {
             continue;
         }
 
-        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
+        pq = spapr_xive_pq_get(xive, i);
+
+        monitor_printf(mon, "  %4x %s %c%c %08x %08x\n", i,
                        ive->w & IVE_MASKED ? "M" : " ",
+                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
+                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
                        (int) GETFIELD(IVE_EQ_INDEX, ive->w),
                        (int) GETFIELD(IVE_EQ_DATA, ive->w));
     }
@@ -52,6 +281,9 @@ static void spapr_xive_reset(DeviceState *dev)
             ive->w |= IVE_MASKED;
         }
     }
+
+    /* SBEs are initialized to 0b01 which corresponds to "ints off" */
+    memset(xive->sbe, 0x55, xive->sbe_size);
 }
 
 static void spapr_xive_realize(DeviceState *dev, Error **errp)
@@ -65,6 +297,23 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
 
     /* Allocate the IVT (Interrupt Virtualization Table) */
     xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
+
+    /* QEMU IRQs */
+    xive->qirqs = qemu_allocate_irqs(spapr_xive_source_set_irq, xive,
+                                     xive->nr_irqs);
+
+    /* Allocate SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
+    xive->sbe_size = DIV_ROUND_UP(xive->nr_irqs, 4);
+    xive->sbe = g_malloc0(xive->sbe_size);
+
+    /* VC BAR. Use address of chip 0 to install the ESB memory region
+     * for *all* interrupt sources */
+    xive->esb_base = (P9_MMIO_BASE | VC_BAR_DEFAULT);
+
+    memory_region_init_io(&xive->esb_iomem, OBJECT(xive),
+                          &spapr_xive_esb_ops, xive, "xive.esb",
+                          (1ull << ESB_SHIFT) * xive->nr_irqs);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->esb_iomem);
 }
 
 static const VMStateDescription vmstate_spapr_xive_ive = {
@@ -92,6 +341,7 @@ static const VMStateDescription vmstate_spapr_xive = {
         VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
         VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
                                      vmstate_spapr_xive_ive, XiveIVE),
+        VMSTATE_VBUFFER_UINT32(sbe, sPAPRXive, 1, NULL, sbe_size),
         VMSTATE_END_OF_LIST()
     },
 };
diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
index 132b71a6daf0..872648dd96a2 100644
--- a/hw/intc/xive-internal.h
+++ b/hw/intc/xive-internal.h
@@ -16,6 +16,16 @@
 #define SETFIELD(m, v, val)                             \
         (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
 
+/*
+ * XIVE MMIO regions
+ */
+#define P9_MMIO_BASE     0x006000000000000ull
+
+/* VC BAR contains set translations for the ESBs and the EQs. */
+#define VC_BAR_DEFAULT   0x10000000000ull
+#define VC_BAR_SIZE      0x08000000000ull
+#define ESB_SHIFT        16 /* One 64k page. OPAL has two */
+
 /* IVE/EAS
  *
  * One per interrupt source. Targets that interrupt to a given EQ
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 5b1f78e06a1e..ecc15d889b74 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -24,8 +24,17 @@ struct sPAPRXive {
     /* Properties */
     uint32_t     nr_irqs;
 
+    /* IRQ */
+    qemu_irq     *qirqs;
+
     /* XIVE internal tables */
     XiveIVE      *ivt;
+    uint8_t      *sbe;
+    uint32_t     sbe_size;
+
+    /* ESB memory region */
+    hwaddr       esb_base;
+    MemoryRegion esb_iomem;
 };
 
 bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 04/19] spapr: add support for the LSI interrupt sources
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (2 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 05/19] spapr: introduce a XIVE interrupt presenter model Cédric Le Goater
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

The 'sent' status of the LSI interrupt source is modeled with the 'P'
bit of the ESB and the assertion status of the source is maintained in
an array under the main sPAPRXive object. The type of the source is
stored in the same array for practical reasons.

The OS will use the H_INT_GET_SOURCE_INFO hcall to determine the type
of an interrupt source.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive.c        | 54 +++++++++++++++++++++++++++++++++++++++++----
 include/hw/ppc/spapr_xive.h | 10 ++++++++-
 2 files changed, 59 insertions(+), 5 deletions(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 43df6814619d..c772c726667f 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -144,6 +144,21 @@ static bool spapr_xive_pq_trigger(sPAPRXive *xive, uint32_t lisn)
 }
 
 /*
+ * LSI interrupt sources use the P bit and a custom assertion flag
+ */
+static bool spapr_xive_lsi_trigger(sPAPRXive *xive, uint32_t lisn)
+{
+    uint8_t old_pq = spapr_xive_pq_get(xive, lisn);
+
+    if  (old_pq == XIVE_ESB_RESET &&
+         xive->status[lisn] & XIVE_STATUS_ASSERTED) {
+        spapr_xive_pq_set(xive, lisn, XIVE_ESB_PENDING);
+        return true;
+    }
+    return false;
+}
+
+/*
  * XIVE Interrupt Source MMIOs
  */
 
@@ -165,6 +180,14 @@ static uint64_t spapr_xive_esb_read(void *opaque, hwaddr addr, unsigned size)
          * XIVE_ESB_STORE_EOI support for the interrupt sources
          */
         ret = spapr_xive_pq_eoi(xive, lisn);
+
+        /* If the LSI source is still asserted, forward a new source
+         * event notification */
+        if (spapr_xive_irq_is_lsi(xive, lisn)) {
+            if (spapr_xive_lsi_trigger(xive, lisn)) {
+                spapr_xive_irq(xive, lisn);
+            }
+        }
         break;
 
     case XIVE_ESB_GET:
@@ -201,6 +224,14 @@ static void spapr_xive_esb_write(void *opaque, hwaddr addr,
          * notification
          */
         notify = spapr_xive_pq_eoi(xive, lisn);
+
+        /* LSI sources do not set the Q bit but they can still be
+         * asserted, in which case we should forward a new source
+         * event notification
+         */
+        if (spapr_xive_irq_is_lsi(xive, lisn)) {
+            notify = spapr_xive_lsi_trigger(xive, lisn);
+        }
         break;
     default:
         qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %d\n",
@@ -233,8 +264,17 @@ static void spapr_xive_source_set_irq(void *opaque, int lisn, int val)
     sPAPRXive *xive = SPAPR_XIVE(opaque);
     bool notify = false;
 
-    if (val) {
-        notify = spapr_xive_pq_trigger(xive, lisn);
+    if (spapr_xive_irq_is_lsi(xive, lisn)) {
+        if (val) {
+            xive->status[lisn] |= XIVE_STATUS_ASSERTED;
+        } else {
+            xive->status[lisn] &= ~XIVE_STATUS_ASSERTED;
+        }
+        notify = spapr_xive_lsi_trigger(xive, lisn);
+    } else {
+        if (val) {
+            notify = spapr_xive_pq_trigger(xive, lisn);
+        }
     }
 
     /* Forward the source event notification for routing */
@@ -260,7 +300,8 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
 
         pq = spapr_xive_pq_get(xive, i);
 
-        monitor_printf(mon, "  %4x %s %c%c %08x %08x\n", i,
+        monitor_printf(mon, "  %4x %s %s %c%c %08x %08x\n", i,
+                       spapr_xive_irq_is_lsi(xive, i) ? "LSI" : "MSI",
                        ive->w & IVE_MASKED ? "M" : " ",
                        pq & XIVE_ESB_VAL_P ? 'P' : '-',
                        pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
@@ -274,6 +315,8 @@ static void spapr_xive_reset(DeviceState *dev)
     sPAPRXive *xive = SPAPR_XIVE(dev);
     int i;
 
+    /* Do not clear IRQ's status */
+
     /* Mask all valid IVEs in the IRQ number space. */
     for (i = 0; i < xive->nr_irqs; i++) {
         XiveIVE *ive = &xive->ivt[i];
@@ -301,6 +344,7 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
     /* QEMU IRQs */
     xive->qirqs = qemu_allocate_irqs(spapr_xive_source_set_irq, xive,
                                      xive->nr_irqs);
+    xive->status = g_malloc0(xive->nr_irqs);
 
     /* Allocate SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
     xive->sbe_size = DIV_ROUND_UP(xive->nr_irqs, 4);
@@ -381,7 +425,7 @@ XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
     return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
 }
 
-bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
+bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
 {
     XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
 
@@ -390,6 +434,7 @@ bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
     }
 
     ive->w |= IVE_VALID;
+    xive->status[lisn] |= lsi ? XIVE_STATUS_LSI : 0;
     return true;
 }
 
@@ -402,5 +447,6 @@ bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
     }
 
     ive->w &= ~IVE_VALID;
+    xive->status[lisn] = 0;
     return true;
 }
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index ecc15d889b74..a7e59fd601d7 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -23,6 +23,9 @@ struct sPAPRXive {
 
     /* Properties */
     uint32_t     nr_irqs;
+#define XIVE_STATUS_LSI                0x1
+#define XIVE_STATUS_ASSERTED           0x2
+    uint8_t      *status;
 
     /* IRQ */
     qemu_irq     *qirqs;
@@ -37,7 +40,12 @@ struct sPAPRXive {
     MemoryRegion esb_iomem;
 };
 
-bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
+static inline bool spapr_xive_irq_is_lsi(sPAPRXive *xive, int lisn)
+{
+    return xive->status[lisn] & XIVE_STATUS_LSI;
+}
+
+bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
 bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 05/19] spapr: introduce a XIVE interrupt presenter model
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (3 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 04/19] spapr: add support for the LSI " Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 06/19] spapr: introduce the XIVE Event Queues Cédric Le Goater
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

Once an event has been routed by the IVRE, it reaches the XIVE
virtualization presentation engine which, simply speaking, raises one
bit in the Interrupt Pending Buffer (IBP) register corresponding to
the priority of the pending interrupt. This indicates there is an
event pending in one of the 8 priority queues and the interrupt can
then be delivered to the Virtual Processor.

The XIVE presenter engine uses a set of registers to handle priority
management and interrupt acknowledgment among other things. The most
important being :

  - Interrupt Priority Register (PIPR)
  - Interrupt Pending Buffer (IPB)
  - Current Processor Priority (CPPR)
  - Notification Source Register (NSR)

There is one set of registers per level of privilege, four in all :
HW, HV pool, OS and User. These are called rings. All registers are
accessible through a specific MMIO region called the Thread Interrupt
Management Areas (TIMA) but, depending on the privilege level of the
CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
OS privilege and therefore can only accesses the OS and the User
rings. The others are for hypervisor levels.

The CPU interrupt state is modeled with a sPAPRXiveNVT object which
stores the values of the different registers. The different TIMA views
are mapped at the same address for each CPU and 'current_cpu' is used
to retrieve the sPAPRXiveNVT holding the ring registers.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:

 - renamed 'sPAPRXiveICP' to 'sPAPRXiveNVT'
 - renamed 'tima' to 'regs' 
 - renamed 'tima_os' to 'ring_os'
 - introduced TM_RING_SIZE 
 - removed 'tm_shift' field
 - introduced a memory region to model the User TIMA and another one
   for the OS TIMA. One page size for each.
 - removed useless checks in the memory region handlers
 - removed support for 970 ...
 
 hw/intc/spapr_xive.c        | 281 ++++++++++++++++++++++++++++++++++++++++++++
 hw/intc/xive-internal.h     |  92 +++++++++++++++
 include/hw/ppc/spapr_xive.h |  11 ++
 3 files changed, 384 insertions(+)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index c772c726667f..53f0e698e135 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -15,9 +15,177 @@
 #include "sysemu/dma.h"
 #include "monitor/monitor.h"
 #include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/xics.h"
 
 #include "xive-internal.h"
 
+struct sPAPRXiveNVT {
+    DeviceState parent_obj;
+
+    CPUState  *cs;
+    qemu_irq  output;
+
+    /* Registers for all rings but sPAPR can only access the OS ring */
+    uint8_t   regs[TM_RING_COUNT * TM_RING_SIZE];
+
+    /* Shortcut to the OS ring */
+    uint8_t   *ring_os;
+};
+
+static uint64_t spapr_xive_nvt_accept(sPAPRXiveNVT *nvt)
+{
+    return 0;
+}
+
+static void spapr_xive_nvt_set_cppr(sPAPRXiveNVT *nvt, uint8_t cppr)
+{
+    if (cppr > XIVE_PRIORITY_MAX) {
+        cppr = 0xff;
+    }
+
+    nvt->ring_os[TM_CPPR] = cppr;
+}
+
+/*
+ * Thread Interrupt Management Area MMIO
+ */
+static uint64_t spapr_xive_tm_read_special(sPAPRXiveNVT *nvt, hwaddr offset,
+                                           unsigned size)
+{
+    uint64_t ret = -1;
+
+    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
+        ret = spapr_xive_nvt_accept(nvt);
+    } else {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
+                      HWADDR_PRIx" size %d\n", offset, size);
+    }
+
+    return ret;
+}
+
+#define TM_RING(offset) ((offset) & 0xf0)
+
+static uint64_t spapr_xive_tm_os_read(void *opaque, hwaddr offset,
+                                      unsigned size)
+{
+    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
+    sPAPRXiveNVT *nvt = SPAPR_XIVE_NVT(cpu->intc);
+    uint64_t ret = -1;
+    int i;
+
+    if (offset >= TM_SPC_ACK_EBB) {
+        return spapr_xive_tm_read_special(nvt, offset, size);
+    }
+
+    if (TM_RING(offset) != TM_QW1_OS) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
+                      HWADDR_PRIx"\n", offset);
+        return ret;
+    }
+
+    ret = 0;
+    for (i = 0; i < size; i++) {
+        ret |= nvt->regs[offset + i] << (8 * i);
+    }
+
+    return ret;
+}
+
+static bool spapr_xive_tm_is_readonly(uint8_t offset)
+{
+    return offset != TM_QW1_OS + TM_CPPR;
+}
+
+static void spapr_xive_tm_write_special(sPAPRXiveNVT *nvt, hwaddr offset,
+                                        uint64_t value, unsigned size)
+{
+    /* TODO: support TM_SPC_SET_OS_PENDING */
+
+    /* TODO: support TM_SPC_ACK_OS_EL */
+}
+
+static void spapr_xive_tm_os_write(void *opaque, hwaddr offset,
+                                   uint64_t value, unsigned size)
+{
+    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
+    sPAPRXiveNVT *nvt = SPAPR_XIVE_NVT(cpu->intc);
+    int i;
+
+    if (offset >= TM_SPC_ACK_EBB) {
+        spapr_xive_tm_write_special(nvt, offset, value, size);
+        return;
+    }
+
+    if (TM_RING(offset) != TM_QW1_OS) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
+                      HWADDR_PRIx"\n", offset);
+        return;
+    }
+
+    switch (size) {
+    case 1:
+        if (offset == TM_QW1_OS + TM_CPPR) {
+            spapr_xive_nvt_set_cppr(nvt, value & 0xff);
+        }
+        break;
+    case 4:
+    case 8:
+        for (i = 0; i < size; i++) {
+            if (!spapr_xive_tm_is_readonly(offset + i)) {
+                nvt->regs[offset + i] = (value >> (8 * i)) & 0xff;
+            }
+        }
+        break;
+    default:
+        g_assert_not_reached();
+    }
+}
+
+static const MemoryRegionOps spapr_xive_tm_os_ops = {
+    .read = spapr_xive_tm_os_read,
+    .write = spapr_xive_tm_os_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
+static uint64_t spapr_xive_tm_user_read(void *opaque, hwaddr offset,
+                                        unsigned size)
+{
+    qemu_log_mask(LOG_UNIMP, "XIVE: invalid access to User TIMA @%"
+                  HWADDR_PRIx"\n", offset);
+    return -1;
+}
+
+static void spapr_xive_tm_user_write(void *opaque, hwaddr offset,
+                                     uint64_t value, unsigned size)
+{
+    qemu_log_mask(LOG_UNIMP, "XIVE: invalid access to User TIMA @%"
+                  HWADDR_PRIx"\n", offset);
+}
+
+
+static const MemoryRegionOps spapr_xive_tm_user_ops = {
+    .read = spapr_xive_tm_user_read,
+    .write = spapr_xive_tm_user_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
 static void spapr_xive_irq(sPAPRXive *xive, int lisn)
 {
 
@@ -358,6 +526,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
                           &spapr_xive_esb_ops, xive, "xive.esb",
                           (1ull << ESB_SHIFT) * xive->nr_irqs);
     sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->esb_iomem);
+
+    /* The Thread Interrupt Management Area has the same address for
+     * each chip. On sPAPR, we only need to expose the User and OS
+     * level views of the TIMA.
+     */
+    xive->tm_base = (P9_MMIO_BASE | TM_BAR_DEFAULT);
+
+    memory_region_init_io(&xive->tm_iomem_user, OBJECT(xive),
+                          &spapr_xive_tm_user_ops, xive, "xive.tima.user",
+                          1ull << TM_SHIFT);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_iomem_user);
+
+    memory_region_init_io(&xive->tm_iomem_os, OBJECT(xive),
+                          &spapr_xive_tm_os_ops, xive, "xive.tima.os",
+                          1ull << TM_SHIFT);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_iomem_os);
 }
 
 static const VMStateDescription vmstate_spapr_xive_ive = {
@@ -413,9 +597,106 @@ static const TypeInfo spapr_xive_info = {
     .class_init = spapr_xive_class_init,
 };
 
+void spapr_xive_nvt_pic_print_info(sPAPRXiveNVT *nvt, Monitor *mon)
+{
+    int cpu_index = nvt->cs ? nvt->cs->cpu_index : -1;
+
+    monitor_printf(mon, "CPU %d CPPR=%02x IPB=%02x PIPR=%02x NSR=%02x\n",
+                   cpu_index, nvt->ring_os[TM_CPPR], nvt->ring_os[TM_IPB],
+                   nvt->ring_os[TM_PIPR], nvt->ring_os[TM_NSR]);
+}
+
+static void spapr_xive_nvt_reset(void *dev)
+{
+    sPAPRXiveNVT *nvt = SPAPR_XIVE_NVT(dev);
+
+    memset(nvt->regs, 0, sizeof(nvt->regs));
+}
+
+static void spapr_xive_nvt_realize(DeviceState *dev, Error **errp)
+{
+    sPAPRXiveNVT *nvt = SPAPR_XIVE_NVT(dev);
+    PowerPCCPU *cpu;
+    CPUPPCState *env;
+    Object *obj;
+    Error *err = NULL;
+
+    obj = object_property_get_link(OBJECT(dev), ICP_PROP_CPU, &err);
+    if (!obj) {
+        error_propagate(errp, err);
+        error_prepend(errp, "required link '" ICP_PROP_CPU "' not found: ");
+        return;
+    }
+
+    cpu = POWERPC_CPU(obj);
+    nvt->cs = CPU(obj);
+
+    env = &cpu->env;
+    switch (PPC_INPUT(env)) {
+    case PPC_FLAGS_INPUT_POWER7:
+        nvt->output = env->irq_inputs[POWER7_INPUT_INT];
+        break;
+
+    default:
+        error_setg(errp, "XIVE interrupt controller does not support "
+                   "this CPU bus model");
+        return;
+    }
+
+    qemu_register_reset(spapr_xive_nvt_reset, dev);
+}
+
+static void spapr_xive_nvt_unrealize(DeviceState *dev, Error **errp)
+{
+    qemu_unregister_reset(spapr_xive_nvt_reset, dev);
+}
+
+static void spapr_xive_nvt_init(Object *obj)
+{
+    sPAPRXiveNVT *nvt = SPAPR_XIVE_NVT(obj);
+
+    nvt->ring_os = &nvt->regs[TM_QW1_OS];
+}
+
+static bool vmstate_spapr_xive_nvt_needed(void *opaque)
+{
+    /* TODO check machine XIVE support */
+    return true;
+}
+
+static const VMStateDescription vmstate_spapr_xive_nvt = {
+    .name = TYPE_SPAPR_XIVE_NVT,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .needed = vmstate_spapr_xive_nvt_needed,
+    .fields = (VMStateField[]) {
+        VMSTATE_BUFFER(regs, sPAPRXiveNVT),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static void spapr_xive_nvt_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = spapr_xive_nvt_realize;
+    dc->unrealize = spapr_xive_nvt_unrealize;
+    dc->desc = "sPAPR XIVE Interrupt Presenter";
+    dc->vmsd = &vmstate_spapr_xive_nvt;
+}
+
+static const TypeInfo xive_nvt_info = {
+    .name          = TYPE_SPAPR_XIVE_NVT,
+    .parent        = TYPE_DEVICE,
+    .instance_size = sizeof(sPAPRXiveNVT),
+    .instance_init = spapr_xive_nvt_init,
+    .class_init    = spapr_xive_nvt_class_init,
+};
+
 static void spapr_xive_register_types(void)
 {
     type_register_static(&spapr_xive_info);
+    type_register_static(&xive_nvt_info);
 }
 
 type_init(spapr_xive_register_types)
diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
index 872648dd96a2..49f4b7c5f393 100644
--- a/hw/intc/xive-internal.h
+++ b/hw/intc/xive-internal.h
@@ -26,6 +26,96 @@
 #define VC_BAR_SIZE      0x08000000000ull
 #define ESB_SHIFT        16 /* One 64k page. OPAL has two */
 
+/* Thread Interrupt Management Area */
+#define TM_BAR_DEFAULT   0x30203180000ull
+#define TM_SHIFT         16
+
+/*
+ * Thread Management (aka "TM") registers
+ */
+#define TM_RING_COUNT           4
+#define TM_RING_SIZE            0x10
+
+/* TM register offsets */
+#define TM_QW0_USER             0x000 /* All rings */
+#define TM_QW1_OS               0x010 /* Ring 0..2 */
+#define TM_QW2_HV_POOL          0x020 /* Ring 0..1 */
+#define TM_QW3_HV_PHYS          0x030 /* Ring 0..1 */
+
+/* Byte offsets inside a QW             QW0 QW1 QW2 QW3 */
+#define TM_NSR                  0x0  /*  +   +   -   +  */
+#define TM_CPPR                 0x1  /*  -   +   -   +  */
+#define TM_IPB                  0x2  /*  -   +   +   +  */
+#define TM_LSMFB                0x3  /*  -   +   +   +  */
+#define TM_ACK_CNT              0x4  /*  -   +   -   -  */
+#define TM_INC                  0x5  /*  -   +   -   +  */
+#define TM_AGE                  0x6  /*  -   +   -   +  */
+#define TM_PIPR                 0x7  /*  -   +   -   +  */
+
+#define TM_WORD0                0x0
+#define TM_WORD1                0x4
+
+/*
+ * QW word 2 contains the valid bit at the top and other fields
+ * depending on the QW.
+ */
+#define TM_WORD2                0x8
+#define   TM_QW0W2_VU           PPC_BIT32(0)
+#define   TM_QW0W2_LOGIC_SERV   PPC_BITMASK32(1, 31) /* XX 2,31 ? */
+#define   TM_QW1W2_VO           PPC_BIT32(0)
+#define   TM_QW1W2_OS_CAM       PPC_BITMASK32(8, 31)
+#define   TM_QW2W2_VP           PPC_BIT32(0)
+#define   TM_QW2W2_POOL_CAM     PPC_BITMASK32(8, 31)
+#define   TM_QW3W2_VT           PPC_BIT32(0)
+#define   TM_QW3W2_LP           PPC_BIT32(6)
+#define   TM_QW3W2_LE           PPC_BIT32(7)
+#define   TM_QW3W2_T            PPC_BIT32(31)
+
+/*
+ * In addition to normal loads to "peek" and writes (only when invalid)
+ * using 4 and 8 bytes accesses, the above registers support these
+ * "special" byte operations:
+ *
+ *   - Byte load from QW0[NSR] - User level NSR (EBB)
+ *   - Byte store to QW0[NSR] - User level NSR (EBB)
+ *   - Byte load/store to QW1[CPPR] and QW3[CPPR] - CPPR access
+ *   - Byte load from QW3[TM_WORD2] - Read VT||00000||LP||LE on thrd 0
+ *                                    otherwise VT||0000000
+ *   - Byte store to QW3[TM_WORD2] - Set VT bit (and LP/LE if present)
+ *
+ * Then we have all these "special" CI ops at these offset that trigger
+ * all sorts of side effects:
+ */
+#define TM_SPC_ACK_EBB          0x800   /* Load8 ack EBB to reg*/
+#define TM_SPC_ACK_OS_REG       0x810   /* Load16 ack OS irq to reg */
+#define TM_SPC_PUSH_USR_CTX     0x808   /* Store32 Push/Validate user context */
+#define TM_SPC_PULL_USR_CTX     0x808   /* Load32 Pull/Invalidate user
+                                         * context */
+#define TM_SPC_SET_OS_PENDING   0x812   /* Store8 Set OS irq pending bit */
+#define TM_SPC_PULL_OS_CTX      0x818   /* Load32/Load64 Pull/Invalidate OS
+                                         * context to reg */
+#define TM_SPC_PULL_POOL_CTX    0x828   /* Load32/Load64 Pull/Invalidate Pool
+                                         * context to reg*/
+#define TM_SPC_ACK_HV_REG       0x830   /* Load16 ack HV irq to reg */
+#define TM_SPC_PULL_USR_CTX_OL  0xc08   /* Store8 Pull/Inval usr ctx to odd
+                                         * line */
+#define TM_SPC_ACK_OS_EL        0xc10   /* Store8 ack OS irq to even line */
+#define TM_SPC_ACK_HV_POOL_EL   0xc20   /* Store8 ack HV evt pool to even
+                                         * line */
+#define TM_SPC_ACK_HV_EL        0xc30   /* Store8 ack HV irq to even line */
+/* XXX more... */
+
+/* NSR fields for the various QW ack types */
+#define TM_QW0_NSR_EB           PPC_BIT8(0)
+#define TM_QW1_NSR_EO           PPC_BIT8(0)
+#define TM_QW3_NSR_HE           PPC_BITMASK8(0, 1)
+#define  TM_QW3_NSR_HE_NONE     0
+#define  TM_QW3_NSR_HE_POOL     1
+#define  TM_QW3_NSR_HE_PHYS     2
+#define  TM_QW3_NSR_HE_LSI      3
+#define TM_QW3_NSR_I            PPC_BIT8(2)
+#define TM_QW3_NSR_GRP_LVL      PPC_BIT8(3, 7)
+
 /* IVE/EAS
  *
  * One per interrupt source. Targets that interrupt to a given EQ
@@ -46,6 +136,8 @@ typedef struct XiveIVE {
 #define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
 } XiveIVE;
 
+#define XIVE_PRIORITY_MAX  7
+
 XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
 
 #endif /* _INTC_XIVE_INTERNAL_H */
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index a7e59fd601d7..dcaa69025878 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -14,10 +14,15 @@
 
 typedef struct sPAPRXive sPAPRXive;
 typedef struct XiveIVE XiveIVE;
+typedef struct sPAPRXiveNVT sPAPRXiveNVT;
 
 #define TYPE_SPAPR_XIVE "spapr-xive"
 #define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
 
+#define TYPE_SPAPR_XIVE_NVT "spapr-xive-nvt"
+#define SPAPR_XIVE_NVT(obj) \
+    OBJECT_CHECK(sPAPRXiveNVT, (obj), TYPE_SPAPR_XIVE_NVT)
+
 struct sPAPRXive {
     SysBusDevice parent;
 
@@ -38,6 +43,11 @@ struct sPAPRXive {
     /* ESB memory region */
     hwaddr       esb_base;
     MemoryRegion esb_iomem;
+
+    /* TIMA memory regions */
+    hwaddr       tm_base;
+    MemoryRegion tm_iomem_user;
+    MemoryRegion tm_iomem_os;
 };
 
 static inline bool spapr_xive_irq_is_lsi(sPAPRXive *xive, int lisn)
@@ -48,5 +58,6 @@ static inline bool spapr_xive_irq_is_lsi(sPAPRXive *xive, int lisn)
 bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
 bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
+void spapr_xive_nvt_pic_print_info(sPAPRXiveNVT *nvt, Monitor *mon);
 
 #endif /* PPC_SPAPR_XIVE_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 06/19] spapr: introduce the XIVE Event Queues
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (4 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 05/19] spapr: introduce a XIVE interrupt presenter model Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 07/19] spapr: push the XIVE EQ data in OS event queue Cédric Le Goater
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

The Event Queue Descriptor (EQD) table, also known as Event
Notification Descriptor (END), is an internal table of the XIVE
virtualization routing engine. It specifies on which Event Queue the
event data should be posted when an exception occurs (later on pulled
by the OS) and which Virtual Processor to notify. The Event Queue is a
much more complex structure but we start with a simple model for the
sPAPR machine.

There is one XiveEQ per priority and the model chooses to store them
under the XIVE virtualization presenter model (sPAPRXiveNVT) to save
an extra table. EQs are simply indexed with :

       (server << 3) | (priority & 0x7)

This is not in the XIVE architecture but as the EQ index is never
exposed to the guest, in the hcalls or in the device tree, we are free
to use what fits best the current model.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:

 - removed spapr_xive_eq_for_server() which did the EQ indexing.
 - changed spapr_xive_get_eq() to use a server and a priority parameter
 - introduced a couple of macro for the EQ indexing. 
 - improved 'info pic' output

 hw/intc/spapr_xive.c    | 48 ++++++++++++++++++++++++++++++++++++++++++--
 hw/intc/xive-internal.h | 53 +++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 99 insertions(+), 2 deletions(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 53f0e698e135..8e990d58ecf4 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -15,6 +15,7 @@
 #include "sysemu/dma.h"
 #include "monitor/monitor.h"
 #include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/spapr.h"
 #include "hw/ppc/xics.h"
 
 #include "xive-internal.h"
@@ -30,6 +31,8 @@ struct sPAPRXiveNVT {
 
     /* Shortcut to the OS ring */
     uint8_t   *ring_os;
+
+    XiveEQ    eqt[XIVE_PRIORITY_MAX + 1];
 };
 
 static uint64_t spapr_xive_nvt_accept(sPAPRXiveNVT *nvt)
@@ -186,6 +189,13 @@ static const MemoryRegionOps spapr_xive_tm_user_ops = {
     },
 };
 
+static sPAPRXiveNVT *spapr_xive_nvt_get(sPAPRXive *xive, int server)
+{
+    PowerPCCPU *cpu = spapr_find_cpu(server);
+
+    return cpu ? SPAPR_XIVE_NVT(cpu->intc) : NULL;
+}
+
 static void spapr_xive_irq(sPAPRXive *xive, int lisn)
 {
 
@@ -461,19 +471,22 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
     for (i = 0; i < xive->nr_irqs; i++) {
         XiveIVE *ive = &xive->ivt[i];
         uint8_t pq;
+        uint32_t eq_idx;
 
         if (!(ive->w & IVE_VALID)) {
             continue;
         }
 
         pq = spapr_xive_pq_get(xive, i);
+        eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
 
-        monitor_printf(mon, "  %4x %s %s %c%c %08x %08x\n", i,
+        monitor_printf(mon, "  %4x %s %s %c%c server:%d prio:%d %08x\n", i,
                        spapr_xive_irq_is_lsi(xive, i) ? "LSI" : "MSI",
                        ive->w & IVE_MASKED ? "M" : " ",
                        pq & XIVE_ESB_VAL_P ? 'P' : '-',
                        pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
-                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
+                       XIVE_EQ_INDEX_SERVER(eq_idx),
+                       XIVE_EQ_INDEX_PRIO(eq_idx),
                        (int) GETFIELD(IVE_EQ_DATA, ive->w));
     }
 }
@@ -611,6 +624,8 @@ static void spapr_xive_nvt_reset(void *dev)
     sPAPRXiveNVT *nvt = SPAPR_XIVE_NVT(dev);
 
     memset(nvt->regs, 0, sizeof(nvt->regs));
+
+    memset(nvt->eqt, 0, sizeof(nvt->eqt));
 }
 
 static void spapr_xive_nvt_realize(DeviceState *dev, Error **errp)
@@ -658,6 +673,23 @@ static void spapr_xive_nvt_init(Object *obj)
     nvt->ring_os = &nvt->regs[TM_QW1_OS];
 }
 
+static const VMStateDescription vmstate_spapr_xive_nvt_eq = {
+    .name = TYPE_SPAPR_XIVE_NVT "/eq",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField []) {
+        VMSTATE_UINT32(w0, XiveEQ),
+        VMSTATE_UINT32(w1, XiveEQ),
+        VMSTATE_UINT32(w2, XiveEQ),
+        VMSTATE_UINT32(w3, XiveEQ),
+        VMSTATE_UINT32(w4, XiveEQ),
+        VMSTATE_UINT32(w5, XiveEQ),
+        VMSTATE_UINT32(w6, XiveEQ),
+        VMSTATE_UINT32(w7, XiveEQ),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
 static bool vmstate_spapr_xive_nvt_needed(void *opaque)
 {
     /* TODO check machine XIVE support */
@@ -671,6 +703,8 @@ static const VMStateDescription vmstate_spapr_xive_nvt = {
     .needed = vmstate_spapr_xive_nvt_needed,
     .fields = (VMStateField[]) {
         VMSTATE_BUFFER(regs, sPAPRXiveNVT),
+        VMSTATE_STRUCT_ARRAY(eqt, sPAPRXiveNVT, (XIVE_PRIORITY_MAX + 1), 1,
+                             vmstate_spapr_xive_nvt_eq, XiveEQ),
         VMSTATE_END_OF_LIST()
     },
 };
@@ -731,3 +765,13 @@ bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
     xive->status[lisn] = 0;
     return true;
 }
+
+XiveEQ *spapr_xive_get_eq(sPAPRXive *xive, uint32_t server, uint8_t priority)
+{
+    sPAPRXiveNVT *nvt = spapr_xive_nvt_get(xive, server);
+
+    if (!nvt || priority > XIVE_PRIORITY_MAX) {
+        return NULL;
+    }
+    return &nvt->eqt[priority];
+}
diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
index 49f4b7c5f393..fcd740d276f7 100644
--- a/hw/intc/xive-internal.h
+++ b/hw/intc/xive-internal.h
@@ -136,8 +136,61 @@ typedef struct XiveIVE {
 #define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
 } XiveIVE;
 
+/* EQ */
+typedef struct XiveEQ {
+        uint32_t        w0;
+#define EQ_W0_VALID             PPC_BIT32(0)
+#define EQ_W0_ENQUEUE           PPC_BIT32(1)
+#define EQ_W0_UCOND_NOTIFY      PPC_BIT32(2)
+#define EQ_W0_BACKLOG           PPC_BIT32(3)
+#define EQ_W0_PRECL_ESC_CTL     PPC_BIT32(4)
+#define EQ_W0_ESCALATE_CTL      PPC_BIT32(5)
+#define EQ_W0_END_OF_INTR       PPC_BIT32(6)
+#define EQ_W0_QSIZE             PPC_BITMASK32(12, 15)
+#define EQ_W0_SW0               PPC_BIT32(16)
+#define EQ_W0_FIRMWARE          EQ_W0_SW0 /* Owned by FW */
+#define EQ_QSIZE_4K             0
+#define EQ_QSIZE_64K            4
+#define EQ_W0_HWDEP             PPC_BITMASK32(24, 31)
+        uint32_t        w1;
+#define EQ_W1_ESn               PPC_BITMASK32(0, 1)
+#define EQ_W1_ESn_P             PPC_BIT32(0)
+#define EQ_W1_ESn_Q             PPC_BIT32(1)
+#define EQ_W1_ESe               PPC_BITMASK32(2, 3)
+#define EQ_W1_ESe_P             PPC_BIT32(2)
+#define EQ_W1_ESe_Q             PPC_BIT32(3)
+#define EQ_W1_GENERATION        PPC_BIT32(9)
+#define EQ_W1_PAGE_OFF          PPC_BITMASK32(10, 31)
+        uint32_t        w2;
+#define EQ_W2_MIGRATION_REG     PPC_BITMASK32(0, 3)
+#define EQ_W2_OP_DESC_HI        PPC_BITMASK32(4, 31)
+        uint32_t        w3;
+#define EQ_W3_OP_DESC_LO        PPC_BITMASK32(0, 31)
+        uint32_t        w4;
+#define EQ_W4_ESC_EQ_BLOCK      PPC_BITMASK32(4, 7)
+#define EQ_W4_ESC_EQ_INDEX      PPC_BITMASK32(8, 31)
+        uint32_t        w5;
+#define EQ_W5_ESC_EQ_DATA       PPC_BITMASK32(1, 31)
+        uint32_t        w6;
+#define EQ_W6_FORMAT_BIT        PPC_BIT32(8)
+#define EQ_W6_NVT_BLOCK         PPC_BITMASK32(9, 12)
+#define EQ_W6_NVT_INDEX         PPC_BITMASK32(13, 31)
+        uint32_t        w7;
+#define EQ_W7_F0_IGNORE         PPC_BIT32(0)
+#define EQ_W7_F0_BLK_GROUPING   PPC_BIT32(1)
+#define EQ_W7_F0_PRIORITY       PPC_BITMASK32(8, 15)
+#define EQ_W7_F1_WAKEZ          PPC_BIT32(0)
+#define EQ_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
+} XiveEQ;
+
 #define XIVE_PRIORITY_MAX  7
 
 XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
+XiveEQ *spapr_xive_get_eq(sPAPRXive *xive, uint32_t server, uint8_t priority);
+
+#define XIVE_EQ_INDEX(server, prio) (((server) << 3) | ((prio) & 0x7))
+#define XIVE_EQ_INDEX_SERVER(eq_idx) ((eq_idx) >> 3)
+#define XIVE_EQ_INDEX_PRIO(eq_idx) ((eq_idx) & 0x7)
+
 
 #endif /* _INTC_XIVE_INTERNAL_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 07/19] spapr: push the XIVE EQ data in OS event queue
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (5 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 06/19] spapr: introduce the XIVE Event Queues Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 08/19] spapr: notify the CPU when the XIVE interrupt priority is more privileged Cédric Le Goater
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

If a triggered event is let through by the XIVE virtualization routing
engine, the Event Queue data defined in the associated IVE is pushed
in the in-memory event queue. The latter is a circular buffer provided
by the OS using the H_INT_SET_QUEUE_CONFIG hcall, one per server and
priority couple. Each Event Queue entry is 4 bytes long, the first bit
being a 'generation' bit and the 31 following bits the EQ Data field.

The EQ Data field provides a way to set an invariant logical event
source number for an IRQ. It is set with the H_INT_SET_SOURCE_CONFIG
hcall when the EISN flag is used.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:

 - replaced dma_memory_write() by stl_be_dma()
 - improved 'info pic' output

 hw/intc/spapr_xive.c | 98 ++++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 95 insertions(+), 3 deletions(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 8e990d58ecf4..629563d01998 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -196,9 +196,87 @@ static sPAPRXiveNVT *spapr_xive_nvt_get(sPAPRXive *xive, int server)
     return cpu ? SPAPR_XIVE_NVT(cpu->intc) : NULL;
 }
 
+static void spapr_xive_pic_print_info_eq(XiveEQ *eq, Monitor *mon)
+{
+    uint64_t qaddr_base = (((uint64_t)(eq->w2 & 0x0fffffff)) << 32) | eq->w3;
+    uint32_t qindex = GETFIELD(EQ_W1_PAGE_OFF, eq->w1);
+    uint32_t qgen = GETFIELD(EQ_W1_GENERATION, eq->w1);
+    uint32_t qsize = GETFIELD(EQ_W0_QSIZE, eq->w0);
+    uint32_t qentries = 1 << (qsize + 10);
+
+    monitor_printf(mon, "eq:@%08" PRIx64"% 6d/%5d ^%d", qaddr_base,
+                   qindex, qentries, qgen);
+}
+
+static void spapr_xive_eq_push(XiveEQ *eq, uint32_t data)
+{
+    uint64_t qaddr_base = (((uint64_t)(eq->w2 & 0x0fffffff)) << 32) | eq->w3;
+    uint32_t qsize = GETFIELD(EQ_W0_QSIZE, eq->w0);
+    uint32_t qindex = GETFIELD(EQ_W1_PAGE_OFF, eq->w1);
+    uint32_t qgen = GETFIELD(EQ_W1_GENERATION, eq->w1);
+
+    uint64_t qaddr = qaddr_base + (qindex << 2);
+    uint32_t qdata = (qgen << 31) | (data & 0x7fffffff);
+    uint32_t qentries = 1 << (qsize + 10);
+
+    if (stl_be_dma(&address_space_memory, qaddr, qdata)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to write EQ data @0x%"
+                      HWADDR_PRIx "\n", qaddr);
+        return;
+    }
+
+    qindex = (qindex + 1) % qentries;
+    if (qindex == 0) {
+        qgen ^= 1;
+        eq->w1 = SETFIELD(EQ_W1_GENERATION, eq->w1, qgen);
+    }
+    eq->w1 = SETFIELD(EQ_W1_PAGE_OFF, eq->w1, qindex);
+}
+
 static void spapr_xive_irq(sPAPRXive *xive, int lisn)
 {
+    XiveIVE *ive;
+    XiveEQ *eq;
+    uint32_t eq_idx;
+    uint8_t priority;
+
+    ive = spapr_xive_get_ive(xive, lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %d\n", lisn);
+        return;
+    }
 
+    if (ive->w & IVE_MASKED) {
+        return;
+    }
+
+    /* Find our XiveEQ */
+    eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
+    eq = spapr_xive_get_eq(xive, XIVE_EQ_INDEX_SERVER(eq_idx),
+                           XIVE_EQ_INDEX_PRIO(eq_idx));
+    if (!eq) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No EQ for LISN %d\n", lisn);
+        return;
+    }
+
+    if (eq->w0 & EQ_W0_ENQUEUE) {
+        spapr_xive_eq_push(eq, GETFIELD(IVE_EQ_DATA, ive->w));
+    }
+
+    if (!(eq->w0 & EQ_W0_UCOND_NOTIFY)) {
+        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
+    }
+
+    if (GETFIELD(EQ_W6_FORMAT_BIT, eq->w6) == 0) {
+        priority = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
+
+        /* The EQ is masked. Can this happen ?  */
+        if (priority == 0xff) {
+            g_assert_not_reached();
+        }
+    } else {
+        qemu_log_mask(LOG_UNIMP, "XIVE: w7 format1 not implemented\n");
+    }
 }
 
 /*
@@ -480,14 +558,28 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
         pq = spapr_xive_pq_get(xive, i);
         eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
 
-        monitor_printf(mon, "  %4x %s %s %c%c server:%d prio:%d %08x\n", i,
+        monitor_printf(mon, "  %4x %s %s %c%c server:%d prio:%d ", i,
                        spapr_xive_irq_is_lsi(xive, i) ? "LSI" : "MSI",
                        ive->w & IVE_MASKED ? "M" : " ",
                        pq & XIVE_ESB_VAL_P ? 'P' : '-',
                        pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
                        XIVE_EQ_INDEX_SERVER(eq_idx),
-                       XIVE_EQ_INDEX_PRIO(eq_idx),
-                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
+                       XIVE_EQ_INDEX_PRIO(eq_idx));
+
+        if (!(ive->w & IVE_MASKED)) {
+            XiveEQ *eq;
+
+            eq = spapr_xive_get_eq(xive, XIVE_EQ_INDEX_SERVER(eq_idx),
+                                   XIVE_EQ_INDEX_PRIO(eq_idx));
+            if (eq) {
+                spapr_xive_pic_print_info_eq(eq, mon);
+                monitor_printf(mon, " data:%08x",
+                               (int) GETFIELD(IVE_EQ_DATA, ive->w));
+            } else {
+                monitor_printf(mon, "no eq ?!");
+            }
+        }
+        monitor_printf(mon, "\n");
     }
 }
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 08/19] spapr: notify the CPU when the XIVE interrupt priority is more privileged
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (6 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 07/19] spapr: push the XIVE EQ data in OS event queue Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 09/19] spapr: add support for the SET_OS_PENDING command (XIVE) Cédric Le Goater
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

Once an event has been routed, the XIVE virtualization presenter
engine raises the bit corresponding to the priority of the pending
interrupt in the register IBP (Interrupt Pending Buffer). The Pending
Interrupt Priority Register (PIPR) is also updated using the IPB. It
contains the priority of the most favored pending notification.

The PIPR is then compared to the the Current Processor Priority
Register (CPPR). If it is more favored (numerically less than), the
CPU interrupt line is raised and the EO bit of the Notification Source
Register (NSR) is updated to notify the presence of an exception for
the O/S. The check needs to be done whenever the PIPR or the CPPR are
changed.

The O/S acknowledges the interrupt with a special load in the Thread
Interrupt Management Area. If the EO bit of the NSR is set, the CPPR
takes the value of PIPR. The bit number in the IBP corresponding to
the priority of the pending interrupt is reseted and so is the EO bit
of the NSR.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:

 - set initial TM_PIPR to 0xFF

 hw/intc/spapr_xive.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 88 insertions(+), 1 deletion(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 629563d01998..a8acfee740d9 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -35,9 +35,68 @@ struct sPAPRXiveNVT {
     XiveEQ    eqt[XIVE_PRIORITY_MAX + 1];
 };
 
+/* Convert a priority number to an Interrupt Pending Buffer (IPB)
+ * register, which indicates a pending interrupt at the priority
+ * corresponding to the bit number
+ */
+static uint8_t priority_to_ipb(uint8_t priority)
+{
+    return priority > XIVE_PRIORITY_MAX ?
+        0 : 1 << (XIVE_PRIORITY_MAX - priority);
+}
+
+/* Convert an Interrupt Pending Buffer (IPB) register to a Pending
+ * Interrupt Priority Register (PIPR), which contains the priority of
+ * the most favored pending notification.
+ */
+static uint8_t ipb_to_pipr(uint8_t ibp)
+{
+    return ibp ? clz32((uint32_t)ibp << 24) : 0xff;
+}
+
+/*
+ * TODO:
+ *
+ * Ben says: "
+ *
+ * PIPR is clamped to CPPR. So the value in the PIPR is:
+ *
+ *     v = leftmost_bit_of(ipb) (or 0xff);
+ *     pipr = v < cppr ? v : cppr;
+ *
+ * which means it's never actually 0xff ... surprise !".
+ *
+ * But, the CPPR is set to 0xFF by the OS and so the PIPR will always
+ * be more favored ... I am confused ...
+ */
 static uint64_t spapr_xive_nvt_accept(sPAPRXiveNVT *nvt)
 {
-    return 0;
+    uint8_t nsr = nvt->ring_os[TM_NSR];
+
+    qemu_irq_lower(nvt->output);
+
+    if (nvt->ring_os[TM_NSR] & TM_QW1_NSR_EO) {
+        uint8_t cppr = nvt->ring_os[TM_PIPR];
+
+        nvt->ring_os[TM_CPPR] = cppr;
+
+        /* Reset the pending buffer bit */
+        nvt->ring_os[TM_IPB] &= ~priority_to_ipb(cppr);
+        nvt->ring_os[TM_PIPR] = ipb_to_pipr(nvt->ring_os[TM_IPB]);
+
+        /* Drop Exception bit for OS */
+        nvt->ring_os[TM_NSR] &= ~TM_QW1_NSR_EO;
+    }
+
+    return (nsr << 8) | nvt->ring_os[TM_CPPR];
+}
+
+static void spapr_xive_nvt_notify(sPAPRXiveNVT *nvt)
+{
+    if (nvt->ring_os[TM_PIPR] < nvt->ring_os[TM_CPPR]) {
+        nvt->ring_os[TM_NSR] |= TM_QW1_NSR_EO;
+        qemu_irq_raise(nvt->output);
+    }
 }
 
 static void spapr_xive_nvt_set_cppr(sPAPRXiveNVT *nvt, uint8_t cppr)
@@ -47,6 +106,10 @@ static void spapr_xive_nvt_set_cppr(sPAPRXiveNVT *nvt, uint8_t cppr)
     }
 
     nvt->ring_os[TM_CPPR] = cppr;
+
+    /* CPPR has changed, check if we need to redistribute a pending
+     * exception */
+    spapr_xive_nvt_notify(nvt);
 }
 
 /*
@@ -239,6 +302,8 @@ static void spapr_xive_irq(sPAPRXive *xive, int lisn)
     XiveEQ *eq;
     uint32_t eq_idx;
     uint8_t priority;
+    uint32_t server;
+    sPAPRXiveNVT *nvt;
 
     ive = spapr_xive_get_ive(xive, lisn);
     if (!ive || !(ive->w & IVE_VALID)) {
@@ -267,6 +332,13 @@ static void spapr_xive_irq(sPAPRXive *xive, int lisn)
         qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
     }
 
+    server = GETFIELD(EQ_W6_NVT_INDEX, eq->w6);
+    nvt = spapr_xive_nvt_get(xive, server);
+    if (!nvt) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No NVT for server %d\n", server);
+        return;
+    }
+
     if (GETFIELD(EQ_W6_FORMAT_BIT, eq->w6) == 0) {
         priority = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
 
@@ -274,9 +346,18 @@ static void spapr_xive_irq(sPAPRXive *xive, int lisn)
         if (priority == 0xff) {
             g_assert_not_reached();
         }
+
+        /* Update the IPB (Interrupt Pending Buffer) with the priority
+         * of the new notification and inform the NVT, which will
+         * decide to raise the exception, or not, depending the CPPR.
+         */
+        nvt->ring_os[TM_IPB] |= priority_to_ipb(priority);
+        nvt->ring_os[TM_PIPR] = ipb_to_pipr(nvt->ring_os[TM_IPB]);
     } else {
         qemu_log_mask(LOG_UNIMP, "XIVE: w7 format1 not implemented\n");
     }
+
+    spapr_xive_nvt_notify(nvt);
 }
 
 /*
@@ -717,6 +798,12 @@ static void spapr_xive_nvt_reset(void *dev)
 
     memset(nvt->regs, 0, sizeof(nvt->regs));
 
+    /*
+     * Initialize PIPR to 0xFF to avoid phantom interrupts when the
+     * CPPR is first set.
+     */
+    nvt->ring_os[TM_PIPR] = ipb_to_pipr(nvt->ring_os[TM_IPB]);
+
     memset(nvt->eqt, 0, sizeof(nvt->eqt));
 }
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 09/19] spapr: add support for the SET_OS_PENDING command (XIVE)
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (7 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 08/19] spapr: notify the CPU when the XIVE interrupt priority is more privileged Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 10/19] spapr: introduce a 'xive_exploitation' boolean to enable XIVE Cédric Le Goater
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

This command offers the possibility for the O/S to adjust the IPB to
allow a CPU to process event queues of other priorities during one
physical interrupt cycle. This is not currently used by the XIVE
support for sPAPR in Linux but it is by the hypervisor.

More from Ben :

  It's a way to avoid the SW replay on EOI.

  IE, assume you have 2 interrupts in the queue. You take the exception,
  ack the first one, process it etc... Then you EOI, the HW won't send
  a second notification. You need to look at the queue and continue
  consuming until it's empty.

  Today Linux checks the queue on EOI and use a SW mechanism to
  synthesize a new pseudo-external interrupt.

  This MMIO command would allow the OS to instead set back the
  corresponding priority bit to 1 in the IPB and cause the HW to
  re-emit the interrupt instead of SW.

  Linux doesn't use this today because DD1 didn't support it for the
  HV level, but other OSes might and we also might use it when we do
  groups, thus allowing redistribution.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index a8acfee740d9..38e1f569ea82 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -166,9 +166,23 @@ static bool spapr_xive_tm_is_readonly(uint8_t offset)
 static void spapr_xive_tm_write_special(sPAPRXiveNVT *nvt, hwaddr offset,
                                         uint64_t value, unsigned size)
 {
-    /* TODO: support TM_SPC_SET_OS_PENDING */
+    switch (offset) {
+    case TM_SPC_SET_OS_PENDING:
+        if (size == 1) {
+            nvt->ring_os[TM_IPB] |= priority_to_ipb(value & 0xff);
+            nvt->ring_os[TM_PIPR] = ipb_to_pipr(nvt->ring_os[TM_IPB]);
+            spapr_xive_nvt_notify(nvt);
+        }
+        break;
+    case TM_SPC_ACK_OS_EL:  /* TODO */
+        qemu_log_mask(LOG_UNIMP, "XIVE: no command to acknowledge O/S "
+                      "Interrupt to even O/S reporting line\n");
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA write @%"
+                      HWADDR_PRIx" size %d\n", offset, size);
+    }
 
-    /* TODO: support TM_SPC_ACK_OS_EL */
 }
 
 static void spapr_xive_tm_os_write(void *opaque, hwaddr offset,
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 10/19] spapr: introduce a 'xive_exploitation' boolean to enable XIVE
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (8 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 09/19] spapr: add support for the SET_OS_PENDING command (XIVE) Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 11/19] spapr: add a sPAPRXive object to the machine Cédric Le Goater
                   ` (8 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

The XIVE exploitation interrupt mode will be enabled for newer
machines and disabled for older ones. Also provide a command line
machine option to switch XIVE off on newer machines if needed.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive.c   | 10 ++++++----
 hw/ppc/spapr.c         | 35 +++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h |  1 +
 3 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 38e1f569ea82..bf30edc87bee 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -756,8 +756,9 @@ static const VMStateDescription vmstate_spapr_xive_ive = {
 
 static bool vmstate_spapr_xive_needed(void *opaque)
 {
-    /* TODO check machine XIVE support */
-    return true;
+    sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+    return spapr->xive_exploitation;
 }
 
 static const VMStateDescription vmstate_spapr_xive = {
@@ -885,8 +886,9 @@ static const VMStateDescription vmstate_spapr_xive_nvt_eq = {
 
 static bool vmstate_spapr_xive_nvt_needed(void *opaque)
 {
-    /* TODO check machine XIVE support */
-    return true;
+    sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+
+    return spapr->xive_exploitation;
 }
 
 static const VMStateDescription vmstate_spapr_xive_nvt = {
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 306875e12320..b5b9e7f1b3b6 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2820,6 +2820,29 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
     visit_type_uint32(v, name, (uint32_t *)opaque, errp);
 }
 
+static bool spapr_get_xive_exploitation(Object *obj, Error **errp)
+{
+    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
+
+    return spapr->xive_exploitation;
+}
+
+static void spapr_set_xive_exploitation(Object *obj, bool value,
+                                            Error **errp)
+{
+    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
+
+    if (value) {
+        /* Don't let older machines activate XIVE */
+        if (!spapr->xive_exploitation) {
+            error_setg(errp, "\"xive-exploitation\" option can not be "
+                       "switched on");
+        }
+    } else {
+        spapr->xive_exploitation = false;
+    }
+}
+
 static void spapr_machine_initfn(Object *obj)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
@@ -2855,6 +2878,15 @@ static void spapr_machine_initfn(Object *obj)
     object_property_set_description(obj, "vsmt",
                                     "Virtual SMT: KVM behaves as if this were"
                                     " the host's SMT mode", &error_abort);
+
+    spapr->xive_exploitation = true;
+    object_property_add_bool(obj, "xive-exploitation",
+                            spapr_get_xive_exploitation,
+                            spapr_set_xive_exploitation,
+                            NULL);
+    object_property_set_description(obj, "xive-exploitation",
+                                    "XIVE exploitation mode POWER9",
+                                    NULL);
 }
 
 static void spapr_machine_finalizefn(Object *obj)
@@ -3890,7 +3922,10 @@ DEFINE_SPAPR_MACHINE(2_12, "2.12", true);
 
 static void spapr_machine_2_11_instance_options(MachineState *machine)
 {
+    sPAPRMachineState *spapr = SPAPR_MACHINE(machine);
+
     spapr_machine_2_12_instance_options(machine);
+    spapr->xive_exploitation = false;
 }
 
 static void spapr_machine_2_11_class_options(MachineClass *mc)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 14757b805e84..1d6d2c690d7f 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -127,6 +127,7 @@ struct sPAPRMachineState {
     MemoryHotplugState hotplug_memory;
 
     const char *icp_type;
+    bool xive_exploitation;
 };
 
 #define H_SUCCESS         0
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 11/19] spapr: add a sPAPRXive object to the machine
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (9 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 10/19] spapr: introduce a 'xive_exploitation' boolean to enable XIVE Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 12/19] spapr: add hcalls support for the XIVE exploitation interrupt mode Cédric Le Goater
                   ` (7 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

The sPAPRXive object is designed to be always available, so it is
created unconditionally on newer machines. Depending on the
configuration and the guest capabilities, the CAS negotiation process
will decide which interrupt mode to activate: legacy or XIVE
exploitation.

The XIVE model makes use of the full range of the IRQ number space.
The IRQ numbers for the CPU IPIs in XIVE are allocated at the bottom
of this space, below XICS_IRQ_BASE, to preserve compatibility with
XICS which does not use that range.

That leaves us with 4K possible IPIs. This should be enough for
sometime given that the maximum number of CPUs is 1024 for the sPAPR
machine under QEMU. For the record, the biggest POWER8 or POWER9
system has a maximum of 1536 HW threads (16 sockets, 192 cores, SMT8).

Also make sure that the allocated IRQ numbers are kept in sync between
XICS and XIVE, when available.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:

 - conditioned the creation of the sPAPRXive object to the
   xive_exploitation bool which false on older pseries machine.
 - merged in the IPI allocation patch
 - parented the sPAPRXive object to sysbus.

 hw/ppc/spapr.c         | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h |  2 ++
 2 files changed, 52 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index b5b9e7f1b3b6..195a48399e4b 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -56,6 +56,7 @@
 #include "hw/ppc/spapr_vio.h"
 #include "hw/pci-host/spapr.h"
 #include "hw/ppc/xics.h"
+#include "hw/ppc/spapr_xive.h"
 #include "hw/pci/msi.h"
 
 #include "hw/pci/pci.h"
@@ -204,6 +205,30 @@ static void xics_system_init(MachineState *machine, int nr_irqs, Error **errp)
     }
 }
 
+static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr, int nr_irqs,
+                                    Error **errp)
+{
+    Error *local_err = NULL;
+    Object *obj;
+
+    obj = object_new(TYPE_SPAPR_XIVE);
+    object_property_add_child(OBJECT(spapr), "xive", obj, &error_abort);
+    object_property_set_int(obj, nr_irqs, "nr-irqs",  &local_err);
+    if (local_err) {
+        goto error;
+    }
+    object_property_set_bool(obj, true, "realized", &local_err);
+    if (local_err) {
+        goto error;
+    }
+
+    qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
+    return SPAPR_XIVE(obj);
+error:
+    error_propagate(errp, local_err);
+    return NULL;
+}
+
 static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu,
                                   int smt_threads)
 {
@@ -2390,6 +2415,25 @@ static void ppc_spapr_init(MachineState *machine)
     /* Set up Interrupt Controller before we create the VCPUs */
     xics_system_init(machine, XICS_IRQS_SPAPR, &error_fatal);
 
+    if (spapr->xive_exploitation) {
+        /* We don't have KVM support yet, so check for irqchip=on */
+        if (kvm_enabled() && machine_kernel_irqchip_required(machine)) {
+            error_report("kernel_irqchip requested. no XIVE support");
+            exit(1);
+        } else {
+            /* XIVE uses the full range of IRQ numbers. The CPU IPIs
+             * will use the range below XICS_IRQ_BASE, unused by XICS. */
+            spapr->xive =
+                spapr_xive_create(spapr, XICS_IRQ_BASE + XICS_IRQS_SPAPR,
+                                  &error_fatal);
+
+            /* Allocate the first IRQ numbers for the XIVE IPIs */
+            for (i = 0; i < xics_max_server_number(); ++i) {
+                spapr_xive_irq_enable(spapr->xive, i, false);
+            }
+        }
+    }
+
     /* Set up containers for ibm,client-architecture-support negotiated options
      */
     spapr->ov5 = spapr_ovec_new();
@@ -3647,6 +3691,9 @@ static int ics_find_free_block(ICSState *ics, int num, int alignnum)
 static void spapr_irq_set_lsi(sPAPRMachineState *spapr, int irq, bool lsi)
 {
     ics_set_irq_type(spapr->ics, irq - spapr->ics->offset, lsi);
+    if (spapr->xive_exploitation) {
+        spapr_xive_irq_enable(spapr->xive, irq, lsi);
+    }
 }
 
 int spapr_irq_alloc(sPAPRMachineState *spapr, int irq_hint, bool lsi,
@@ -3737,6 +3784,9 @@ void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num)
             memset(&ics->irqs[i], 0, sizeof(ICSIRQState));
         }
     }
+    if (spapr->xive_exploitation) {
+        spapr_xive_irq_disable(spapr->xive, irq);
+    }
 }
 
 qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 1d6d2c690d7f..addc31dba497 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -14,6 +14,7 @@ struct sPAPRNVRAM;
 typedef struct sPAPREventLogEntry sPAPREventLogEntry;
 typedef struct sPAPREventSource sPAPREventSource;
 typedef struct sPAPRPendingHPT sPAPRPendingHPT;
+typedef struct sPAPRXive sPAPRXive;
 
 #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
 #define SPAPR_ENTRY_POINT       0x100
@@ -128,6 +129,7 @@ struct sPAPRMachineState {
 
     const char *icp_type;
     bool xive_exploitation;
+    sPAPRXive  *xive;
 };
 
 #define H_SUCCESS         0
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 12/19] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (10 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 11/19] spapr: add a sPAPRXive object to the machine Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 13/19] spapr: add device tree support for the XIVE " Cédric Le Goater
                   ` (6 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

The different XIVE virtualization engines (sources and event queues)
are configured with a set of Hypervisor calls :

 - H_INT_GET_SOURCE_INFO

   used to obtain the address of the MMIO page of the Event State
   Buffer (PQ bits) entry associated with the source.

 - H_INT_SET_SOURCE_CONFIG

   assigns a source to a "target".

 - H_INT_GET_SOURCE_CONFIG

   determines which "target" and "priority" is assigned to a source

 - H_INT_GET_QUEUE_INFO

   returns the address of the notification management page associated
   with the specified "target" and "priority".

 - H_INT_SET_QUEUE_CONFIG

   sets or resets the event queue for a given "target" and "priority".
   It is also used to set the notification configuration associated
   with the queue, only unconditional notification is supported for
   the moment. Reset is performed with a queue size of 0 and queueing
   is disabled in that case.

 - H_INT_GET_QUEUE_CONFIG

   returns the queue settings for a given "target" and "priority".

 - H_INT_RESET

   resets all of the guest's internal interrupt structures to their
   initial state, losing all configuration set via the hcalls
   H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.

 - H_INT_SYNC

   issue a synchronisation on a source to make sure all notifications
   have reached their queue.

Calls that still need to be addressed :

   H_INT_SET_OS_REPORTING_LINE
   H_INT_GET_OS_REPORTING_LINE

See the code for more documentation on each hcall.

All sources are emulated under the main XIVE object and share the same
characteristics :

    XIVE_SRC_TRIGGER | XIVE_SRC_STORE_EOI;

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:

 - simplified priority_is_valid() routine (to its minimum)
 - used PPC_BIT() macros to define the hcall flags
 - removed useless casts
 - defined the default characteristic of the single XIVE interrupt
   source to be : *XIVE_SRC_TRIGGER | XIVE_SRC_STORE_EOI*
 - made use of the new spapr_xive_get_eq() prototype
 - removed EQ_W0_UCOND_NOTIFY when the EQ is reseted
 - fixed XIVE_EQ_DEBUG support. Offset for the generation bit was wrong

 hw/intc/Makefile.objs       |   2 +-
 hw/intc/spapr_xive_hcall.c  | 859 ++++++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr.c              |   2 +
 include/hw/ppc/spapr.h      |  15 +-
 include/hw/ppc/spapr_xive.h |   4 +
 5 files changed, 880 insertions(+), 2 deletions(-)
 create mode 100644 hw/intc/spapr_xive_hcall.c

diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index 49e13e7aeeee..122e2ec77e8d 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -35,7 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
 obj-$(CONFIG_XICS) += xics.o
 obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
 obj-$(CONFIG_XICS_KVM) += xics_kvm.o
-obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
+obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
 obj-$(CONFIG_POWERNV) += xics_pnv.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
new file mode 100644
index 000000000000..86dec6c02401
--- /dev/null
+++ b/hw/intc/spapr_xive_hcall.c
@@ -0,0 +1,859 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "cpu.h"
+#include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/fdt.h"
+#include "monitor/monitor.h"
+
+#include "xive-internal.h"
+
+/*
+ * OPAL uses the priority 7 queue to automatically escalate interrupts
+ * for all other queues (DD2.X POWER9). So only priorities [0..6] are
+ * allowed for the guest.
+ */
+static bool priority_is_valid(uint8_t priority)
+{
+    switch (priority) {
+    case 0 ... 6:
+        return true;
+    case 7: /* OPAL escalation queue */
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %d requested\n",
+                      priority);
+        return false;
+    }
+}
+
+/*
+ * The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical
+ * real address of the MMIO page through which the Event State Buffer
+ * entry associated with the value of the "lisn" parameter is managed.
+ *
+ * Parameters:
+ * Input
+ * - "flags"
+ *       Bits 0-63 reserved
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *       "ibm,xive-lisn-ranges" properties, or as returned by the
+ *       ibm,query-interrupt-source-number RTAS call, or as returned
+ *       by the H_ALLOCATE_VAS_WINDOW hcall
+ *
+ * Output
+ * - R4: "flags"
+ *       Bits 0-59: Reserved
+ *       Bit 60: H_INT_ESB must be used for Event State Buffer
+ *               management
+ *       Bit 61: 1 == LSI  0 == MSI
+ *       Bit 62: the full function page supports trigger
+ *       Bit 63: Store EOI Supported
+ * - R5: Logical Real address of full function Event State Buffer
+ *       management page, -1 if ESB hcall flag is set to 1.
+ * - R6: Logical Real Address of trigger only Event State Buffer
+ *       management page or -1.
+ * - R7: Power of 2 page size for the ESB management pages returned in
+ *       R5 and R6.
+ */
+
+#define XIVE_SRC_H_INT_ESB     PPC_BIT(60)
+#define XIVE_SRC_LSI           PPC_BIT(61)
+#define XIVE_SRC_TRIGGER       PPC_BIT(62)
+#define XIVE_SRC_STORE_EOI     PPC_BIT(63)
+
+static target_ulong h_int_get_source_info(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          target_ulong opcode,
+                                          target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveIVE *ive;
+    target_ulong flags  = args[0];
+    target_ulong lisn   = args[1];
+    hwaddr mmio_base;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    ive = spapr_xive_get_ive(spapr->xive, lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        return H_P2;
+    }
+
+    mmio_base = xive->esb_base + (1ull << ESB_SHIFT) * lisn;
+
+    /* All sources are emulated under the main XIVE object and share
+     * the same characteristics.
+     */
+    args[0] = XIVE_SRC_TRIGGER | XIVE_SRC_STORE_EOI;
+    if (spapr_xive_irq_is_lsi(xive, lisn)) {
+        args[0] |= XIVE_SRC_LSI;
+    }
+
+    /* Match XIVE_SRC_TRIGGER characteristic */
+    args[1] = mmio_base;
+    args[2] = -1; /* No specific trigger page */
+    args[3] = ESB_SHIFT;
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SET_SOURCE_CONFIG hcall() is used to assign a Logical
+ * Interrupt Source to a target. The Logical Interrupt Source is
+ * designated with the "lisn" parameter and the target is designated
+ * with the "target" and "priority" parameters.  Upon return from the
+ * hcall(), no additional interrupts will be directed to the old EQ.
+ *
+ * TODO: The old EQ should be investigated for interrupts that
+ * occurred prior to or during the hcall().
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-61: Reserved
+ *      Bit 62: set the "eisn" in the EA
+ *      Bit 63: masks the interrupt source in the hardware interrupt
+ *      control structure. An interrupt masked by this mechanism will
+ *      be dropped, but it's source state bits will still be
+ *      set. There is no race-free way of unmasking and restoring the
+ *      source. Thus this should only be used in interrupts that are
+ *      also masked at the source, and only in cases where the
+ *      interrupt is not meant to be used for a large amount of time
+ *      because no valid target exists for it for example
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as returned by
+ *      the H_ALLOCATE_VAS_WINDOW hcall
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *      "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *      "ibm,plat-res-int-priorities"
+ * - "eisn" is the guest EISN associated with the "lisn"
+ *
+ * Output:
+ * - None
+ */
+
+#define XIVE_SRC_SET_EISN PPC_BIT(62)
+#define XIVE_SRC_MASK     PPC_BIT(63)
+
+static target_ulong h_int_set_source_config(PowerPCCPU *cpu,
+                                            sPAPRMachineState *spapr,
+                                            target_ulong opcode,
+                                            target_ulong *args)
+{
+    XiveIVE *ive;
+    uint64_t new_ive;
+    target_ulong flags    = args[0];
+    target_ulong lisn     = args[1];
+    target_ulong target   = args[2];
+    target_ulong priority = args[3];
+    target_ulong eisn     = args[4];
+    uint32_t eq_idx;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~(XIVE_SRC_SET_EISN | XIVE_SRC_MASK)) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    ive = spapr_xive_get_ive(spapr->xive, lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        return H_P2;
+    }
+
+    /* priority 0xff is used to reset the IVE */
+    if (priority == 0xff) {
+        new_ive = IVE_VALID | IVE_MASKED;
+        goto out;
+    }
+
+    if (flags & XIVE_SRC_MASK) {
+        new_ive = ive->w | IVE_MASKED;
+    } else {
+        new_ive = ive->w & ~IVE_MASKED;
+    }
+
+    if (!priority_is_valid(priority)) {
+        return H_P4;
+    }
+
+    /* TODO: If the partition thread count is greater than the
+     * hardware thread count, validate the "target" has a
+     * corresponding hardware thread else return H_NOT_AVAILABLE.
+     */
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the EQ corresponding to the
+     * target.
+     */
+    if (!spapr_xive_get_eq(spapr->xive, target, priority)) {
+        return H_P3;
+    }
+
+    eq_idx = XIVE_EQ_INDEX(target, priority);
+
+    new_ive = SETFIELD(IVE_EQ_BLOCK, new_ive, 0ul);
+    new_ive = SETFIELD(IVE_EQ_INDEX, new_ive, eq_idx);
+
+    if (flags & XIVE_SRC_SET_EISN) {
+        new_ive = SETFIELD(IVE_EQ_DATA, new_ive, eisn);
+    }
+
+out:
+    /* TODO: handle syncs ? */
+
+    /* And update */
+    ive->w = new_ive;
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_GET_SOURCE_CONFIG hcall() is used to determine to which
+ * target/priority pair is assigned to the specified Logical Interrupt
+ * Source.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63 Reserved
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as
+ *      returned by the H_ALLOCATE_VAS_WINDOW hcall
+ *
+ * Output:
+ * - R4: Target to which the specified Logical Interrupt Source is
+ *       assigned
+ * - R5: Priority to which the specified Logical Interrupt Source is
+ *       assigned
+ * - R6: EISN for the specified Logical Interrupt Source (this will be
+ *       equivalent to the LISN if not changed by H_INT_SET_SOURCE_CONFIG)
+ */
+static target_ulong h_int_get_source_config(PowerPCCPU *cpu,
+                                            sPAPRMachineState *spapr,
+                                            target_ulong opcode,
+                                            target_ulong *args)
+{
+    target_ulong flags = args[0];
+    target_ulong lisn = args[1];
+    XiveIVE *ive;
+    XiveEQ *eq;
+    uint32_t eq_idx;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    ive = spapr_xive_get_ive(spapr->xive, lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        return H_P2;
+    }
+
+    eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
+    eq = spapr_xive_get_eq(spapr->xive, XIVE_EQ_INDEX_SERVER(eq_idx),
+                           XIVE_EQ_INDEX_PRIO(eq_idx));
+    if (!eq) {
+        /* Not sure what to return here */
+        return H_HARDWARE;
+    }
+
+    args[0] = GETFIELD(EQ_W6_NVT_INDEX, eq->w6);
+
+    if (ive->w & IVE_MASKED) {
+        args[1] = 0xff;
+    } else {
+        args[1] = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
+    }
+
+    args[2] = GETFIELD(IVE_EQ_DATA, ive->w);
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_GET_QUEUE_INFO hcall() is used to get the logical real
+ * address of the notification management page associated with the
+ * specified target and priority.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *       Bits 0-63 Reserved
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *       "ibm,plat-res-int-priorities"
+ *
+ * Output:
+ * - R4: Logical real address of notification page
+ * - R5: Power of 2 page size of the notification page
+ */
+static target_ulong h_int_get_queue_info(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         target_ulong opcode,
+                                         target_ulong *args)
+{
+    target_ulong flags    = args[0];
+    target_ulong target   = args[1];
+    target_ulong priority = args[2];
+    XiveEQ *eq;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    if (!priority_is_valid(priority)) {
+        return H_P3;
+    }
+
+    /* TODO: If the partition thread count is greater than the
+     * hardware thread count, validate the "target" has a
+     * corresponding hardware thread else return H_NOT_AVAILABLE.
+     */
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the EQ corresponding to the
+     * target.
+     */
+    eq = spapr_xive_get_eq(spapr->xive, target, priority);
+    if (!eq)  {
+        return H_P2;
+    }
+
+    args[0] = -1; /* TODO: return ESn page */
+    if (eq->w0 & EQ_W0_ENQUEUE) {
+        args[1] = GETFIELD(EQ_W0_QSIZE, eq->w0) + 12;
+    } else {
+        args[1] = 0;
+    }
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SET_QUEUE_CONFIG hcall() is used to set or reset a EQ for
+ * a given "target" and "priority".  It is also used to set the
+ * notification config associated with the EQ.  An EQ size of 0 is
+ * used to reset the EQ config for a given target and priority. If
+ * resetting the EQ config, the END associated with the given "target"
+ * and "priority" will be changed to disable queueing.
+ *
+ * Upon return from the hcall(), no additional interrupts will be
+ * directed to the old EQ (if one was set). The old EQ (if one was
+ * set) should be investigated for interrupts that occurred prior to
+ * or during the hcall().
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-62: Reserved
+ *      Bit 63: Unconditional Notify (n) per the XIVE spec
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *       "ibm,plat-res-int-priorities"
+ * - "eventQueue": The logical real address of the start of the EQ
+ * - "eventQueueSize": The power of 2 EQ size per "ibm,xive-eq-sizes"
+ *
+ * Output:
+ * - None
+ */
+
+#define XIVE_EQ_ALWAYS_NOTIFY PPC_BIT(63)
+
+static target_ulong h_int_set_queue_config(PowerPCCPU *cpu,
+                                           sPAPRMachineState *spapr,
+                                           target_ulong opcode,
+                                           target_ulong *args)
+{
+    target_ulong flags    = args[0];
+    target_ulong target   = args[1];
+    target_ulong priority = args[2];
+    target_ulong qpage    = args[3];
+    target_ulong qsize    = args[4];
+    XiveEQ *old_eq;
+    XiveEQ eq;
+    uint32_t qdata;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~XIVE_EQ_ALWAYS_NOTIFY) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    if (!priority_is_valid(priority)) {
+        return H_P3;
+    }
+
+    /* TODO: If the partition thread count is greater than the
+     * hardware thread count, validate the "target" has a
+     * corresponding hardware thread else return H_NOT_AVAILABLE.
+     */
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the EQ corresponding to the
+     * target.
+     */
+    old_eq = spapr_xive_get_eq(spapr->xive, target, priority);
+    if (!old_eq)  {
+        return H_P2;
+    }
+
+    eq = *old_eq;
+
+    switch (qsize) {
+    case 12:
+    case 16:
+    case 21:
+    case 24:
+        eq.w3 = ((uint64_t)qpage) & 0xffffffff;
+        eq.w2 = (((uint64_t)qpage)) >> 32 & 0x0fffffff;
+        eq.w0 |= EQ_W0_ENQUEUE;
+        eq.w0 = SETFIELD(EQ_W0_QSIZE, eq.w0, qsize - 12);
+        break;
+    case 0:
+        /* reset queue and disable queueing */
+        eq.w2 = eq.w3 = 0;
+        eq.w0 &= ~EQ_W0_ENQUEUE;
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid EQ size %"PRIx64"\n",
+                      qsize);
+        return H_P5;
+    }
+
+    if (qsize) {
+        /*
+         * Let's validate the EQ address with a read of the first EQ
+         * entry. We could also check that the full queue has been
+         * zeroed by the OS.
+         */
+        if (address_space_read(&address_space_memory, qpage,
+                               MEMTXATTRS_UNSPECIFIED,
+                               (uint8_t *) &qdata, sizeof(qdata))) {
+            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ data @0x%"
+                          HWADDR_PRIx "\n", qpage);
+            return H_P4;
+        }
+    }
+
+    /* Ensure the priority and target are correctly set (they will not
+     * be right after allocation)
+     */
+    eq.w6 = SETFIELD(EQ_W6_NVT_BLOCK, 0ul, 0ul) |
+        SETFIELD(EQ_W6_NVT_INDEX, 0ul, target);
+    eq.w7 = SETFIELD(EQ_W7_F0_PRIORITY, 0ul, priority);
+
+    /* TODO: depends on notitification page (ESn) from H_INT_GET_QUEUE_INFO */
+    if (flags & XIVE_EQ_ALWAYS_NOTIFY) {
+        eq.w0 |= EQ_W0_UCOND_NOTIFY;
+    } else {
+        eq.w0 &= ~EQ_W0_UCOND_NOTIFY;
+    }
+
+    /* The generation bit for the EQ starts at 1 and The EQ page
+     * offset counter starts at 0.
+     */
+    eq.w1 = EQ_W1_GENERATION | SETFIELD(EQ_W1_PAGE_OFF, 0ul, 0ul);
+    eq.w0 |= EQ_W0_VALID;
+
+    /* TODO: issue syncs required to ensure all in-flight interrupts
+     * are complete on the old EQ */
+
+    /* Update EQ */
+    *old_eq = eq;
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_GET_QUEUE_CONFIG hcall() is used to get a EQ for a given
+ * target and priority.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-62: Reserved
+ *      Bit 63: Debug: Return debug data
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *       "ibm,plat-res-int-priorities"
+ *
+ * Output:
+ * - R4: "flags":
+ *       Bits 0-61: Reserved
+ *       Bit 62: The value of Event Queue Generation Number (g) per
+ *              the XIVE spec if "Debug" = 1
+ *       Bit 63: The value of Unconditional Notify (n) per the XIVE spec
+ * - R5: The logical real address of the start of the EQ
+ * - R6: The power of 2 EQ size per "ibm,xive-eq-sizes"
+ * - R7: The value of Event Queue Offset Counter per XIVE spec
+ *       if "Debug" = 1, else 0
+ *
+ */
+
+#define XIVE_EQ_DEBUG     PPC_BIT(63)
+
+static target_ulong h_int_get_queue_config(PowerPCCPU *cpu,
+                                           sPAPRMachineState *spapr,
+                                           target_ulong opcode,
+                                           target_ulong *args)
+{
+    target_ulong flags    = args[0];
+    target_ulong target   = args[1];
+    target_ulong priority = args[2];
+    XiveEQ *eq;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~XIVE_EQ_DEBUG) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    if (!priority_is_valid(priority)) {
+        return H_P3;
+    }
+
+    /* TODO: If the partition thread count is greater than the
+     * hardware thread count, validate the "target" has a
+     * corresponding hardware thread else return H_NOT_AVAILABLE.
+     */
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the EQ corresponding to the
+     * target.
+     */
+    eq = spapr_xive_get_eq(spapr->xive, target, priority);
+    if (!eq)  {
+        return H_P2;
+    }
+
+    args[0] = 0;
+    if (eq->w0 & EQ_W0_UCOND_NOTIFY) {
+        args[0] |= XIVE_EQ_ALWAYS_NOTIFY;
+    }
+
+    if (eq->w0 & EQ_W0_ENQUEUE) {
+        args[1] =
+            (((uint64_t)(eq->w2 & 0x0fffffff)) << 32) | eq->w3;
+        args[2] = GETFIELD(EQ_W0_QSIZE, eq->w0) + 12;
+    } else {
+        args[1] = 0;
+        args[2] = 0;
+    }
+
+    /* TODO: do we need any locking on the EQ ? */
+    if (flags & XIVE_EQ_DEBUG) {
+        /* Load the event queue generation number into the return flags */
+        args[0] |= (uint64_t)GETFIELD(EQ_W1_GENERATION, eq->w1) << 62;
+
+        /* Load R7 with the event queue offset counter */
+        args[3] = GETFIELD(EQ_W1_PAGE_OFF, eq->w1);
+    } else {
+        args[3] = 0;
+    }
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SET_OS_REPORTING_LINE hcall() is used to set the
+ * reporting cache line pair for the calling thread.  The reporting
+ * cache lines will contain the OS interrupt context when the OS
+ * issues a CI store byte to @TIMA+0xC10 to acknowledge the OS
+ * interrupt. The reporting cache lines can be reset by inputting -1
+ * in "reportingLine".  Issuing the CI store byte without reporting
+ * cache lines registered will result in the data not being accessible
+ * to the OS.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ * - "reportingLine": The logical real address of the reporting cache
+ *    line pair
+ *
+ * Output:
+ * - None
+ */
+static target_ulong h_int_set_os_reporting_line(PowerPCCPU *cpu,
+                                                sPAPRMachineState *spapr,
+                                                target_ulong opcode,
+                                                target_ulong *args)
+{
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    /* TODO: H_INT_SET_OS_REPORTING_LINE */
+    return H_FUNCTION;
+}
+
+/*
+ * The H_INT_GET_OS_REPORTING_LINE hcall() is used to get the logical
+ * real address of the reporting cache line pair set for the input
+ * "target".  If no reporting cache line pair has been set, -1 is
+ * returned.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "reportingLine": The logical real address of the reporting cache
+ *   line pair
+ *
+ * Output:
+ * - R4: The logical real address of the reporting line if set, else -1
+ */
+static target_ulong h_int_get_os_reporting_line(PowerPCCPU *cpu,
+                                                sPAPRMachineState *spapr,
+                                                target_ulong opcode,
+                                                target_ulong *args)
+{
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    /* TODO: H_INT_GET_OS_REPORTING_LINE */
+    return H_FUNCTION;
+}
+
+/*
+ * The H_INT_ESB hcall() is used to issue a load or store to the ESB
+ * page for the input "lisn".  This hcall is only supported for LISNs
+ * that have the ESB hcall flag set to 1 when returned from hcall()
+ * H_INT_GET_SOURCE_INFO.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-62: Reserved
+ *      bit 63: Store: Store=1, store operation, else load operation
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as
+ *      returned by the H_ALLOCATE_VAS_WINDOW hcall
+ * - "esbOffset" is the offset into the ESB page for the load or store operation
+ * - "storeData" is the data to write for a store operation
+ *
+ * Output:
+ * - R4: R4: The value of the load if load operation, else -1
+ */
+
+#define XIVE_ESB_STORE PPC_BIT(63)
+
+static target_ulong h_int_esb(PowerPCCPU *cpu,
+                              sPAPRMachineState *spapr,
+                              target_ulong opcode,
+                              target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveIVE *ive;
+    target_ulong flags   = args[0];
+    target_ulong lisn    = args[1];
+    target_ulong offset  = args[2];
+    target_ulong data    = args[3];
+    hwaddr esb_base;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~XIVE_ESB_STORE) {
+        return H_PARAMETER;
+    }
+
+    ive = spapr_xive_get_ive(xive, lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        return H_P2;
+    }
+
+    if (offset > (1ull << ESB_SHIFT)) {
+        return H_P3;
+    }
+
+    esb_base = xive->esb_base + (1ull << ESB_SHIFT) * lisn;
+    esb_base += offset;
+
+    if (dma_memory_rw(&address_space_memory, esb_base, &data, 8,
+                      (flags & XIVE_ESB_STORE))) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to access ESB @0x%"
+                      HWADDR_PRIx "\n", esb_base);
+        return H_HARDWARE;
+    }
+    args[0] = (flags & XIVE_ESB_STORE) ? -1 : data;
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SYNC hcall() is used to issue hardware syncs that will
+ * ensure any in flight events for the input lisn are in the event
+ * queue.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as
+ *      returned by the H_ALLOCATE_VAS_WINDOW hcall
+ *
+ * Output:
+ * - None
+ */
+static target_ulong h_int_sync(PowerPCCPU *cpu,
+                               sPAPRMachineState *spapr,
+                               target_ulong opcode,
+                               target_ulong *args)
+{
+    XiveIVE *ive;
+    target_ulong flags   = args[0];
+    target_ulong lisn    = args[1];
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    ive = spapr_xive_get_ive(spapr->xive, lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        return H_P2;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    /* This is not real hardware. Nothing to be done */
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_RESET hcall() is used to reset all of the partition's
+ * interrupt exploitation structures to their initial state.  This
+ * means losing all previously set interrupt state set via
+ * H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ *
+ * Output:
+ * - None
+ */
+static target_ulong h_int_reset(PowerPCCPU *cpu,
+                                sPAPRMachineState *spapr,
+                                target_ulong opcode,
+                                target_ulong *args)
+{
+    target_ulong flags   = args[0];
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    device_reset(DEVICE(spapr->xive));
+    return H_SUCCESS;
+}
+
+void spapr_xive_hcall_init(sPAPRMachineState *spapr)
+{
+    spapr_register_hypercall(H_INT_GET_SOURCE_INFO, h_int_get_source_info);
+    spapr_register_hypercall(H_INT_SET_SOURCE_CONFIG, h_int_set_source_config);
+    spapr_register_hypercall(H_INT_GET_SOURCE_CONFIG, h_int_get_source_config);
+    spapr_register_hypercall(H_INT_GET_QUEUE_INFO, h_int_get_queue_info);
+    spapr_register_hypercall(H_INT_SET_QUEUE_CONFIG, h_int_set_queue_config);
+    spapr_register_hypercall(H_INT_GET_QUEUE_CONFIG, h_int_get_queue_config);
+    spapr_register_hypercall(H_INT_SET_OS_REPORTING_LINE,
+                             h_int_set_os_reporting_line);
+    spapr_register_hypercall(H_INT_GET_OS_REPORTING_LINE,
+                             h_int_get_os_reporting_line);
+    spapr_register_hypercall(H_INT_ESB, h_int_esb);
+    spapr_register_hypercall(H_INT_SYNC, h_int_sync);
+    spapr_register_hypercall(H_INT_RESET, h_int_reset);
+}
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 195a48399e4b..1a71bf613b9e 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -222,6 +222,8 @@ static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr, int nr_irqs,
         goto error;
     }
 
+    spapr_xive_hcall_init(spapr);
+
     qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
     return SPAPR_XIVE(obj);
 error:
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index addc31dba497..ad923e668946 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -388,7 +388,20 @@ struct sPAPRMachineState {
 #define H_INVALIDATE_PID        0x378
 #define H_REGISTER_PROC_TBL     0x37C
 #define H_SIGNAL_SYS_RESET      0x380
-#define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
+
+#define H_INT_GET_SOURCE_INFO   0x3A8
+#define H_INT_SET_SOURCE_CONFIG 0x3AC
+#define H_INT_GET_SOURCE_CONFIG 0x3B0
+#define H_INT_GET_QUEUE_INFO    0x3B4
+#define H_INT_SET_QUEUE_CONFIG  0x3B8
+#define H_INT_GET_QUEUE_CONFIG  0x3BC
+#define H_INT_SET_OS_REPORTING_LINE 0x3C0
+#define H_INT_GET_OS_REPORTING_LINE 0x3C4
+#define H_INT_ESB               0x3C8
+#define H_INT_SYNC              0x3CC
+#define H_INT_RESET             0x3D0
+
+#define MAX_HCALL_OPCODE        H_INT_RESET
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index dcaa69025878..0385df69b028 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -60,4 +60,8 @@ bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 void spapr_xive_nvt_pic_print_info(sPAPRXiveNVT *nvt, Monitor *mon);
 
+typedef struct sPAPRMachineState sPAPRMachineState;
+
+void spapr_xive_hcall_init(sPAPRMachineState *spapr);
+
 #endif /* PPC_SPAPR_XIVE_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 13/19] spapr: add device tree support for the XIVE interrupt mode
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (11 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 12/19] spapr: add hcalls support for the XIVE exploitation interrupt mode Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 14/19] spapr: introduce a helper to map the XIVE memory regions Cédric Le Goater
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

The XIVE interface for the guest is described in the device tree under
the "interrupt-controller" node. A couple of new properties are
specific to XIVE :

 - "reg"

   contains the base address and size of the thread interrupt
   managnement areas (TIMA), also called rings, for the User level and
   for the Guest OS level. Only the Guest OS level is taken into
   account today.

 - "ibm,xive-eq-sizes"

   the size of the event queues. One cell per size supported, contains
   log2 of size, in ascending order.

 - "ibm,xive-lisn-ranges"

   the IRQ interrupt number ranges assigned to the guest for the IPIs.

and also under the root node :

 - "ibm,plat-res-int-priorities"

   contains a list of priorities that the hypervisor has reserved for
   its own use. OPAL uses the priority 7 queue to automatically
   escalate interrupts for all other queues (DD2.X POWER9). So only
   priorities [0..6] are allowed for the guest.

When the XIVE exploitation interrupt mode is activated after the CAS
negotiation, the machine will perform a reboot to rebuild the device
tree.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:

 - added a unit id to the nodename
 - added properties for the LSIs
 - simplified the array for the "ibm,plat-res-int-priorities"  property
 - renamed to spapr_dt_xive()

 hw/intc/spapr_xive_hcall.c  | 64 +++++++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr.c              |  7 ++++-
 hw/ppc/spapr_hcall.c        |  6 +++++
 include/hw/ppc/spapr_xive.h |  2 ++
 4 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
index 86dec6c02401..8aa2fb8a32d1 100644
--- a/hw/intc/spapr_xive_hcall.c
+++ b/hw/intc/spapr_xive_hcall.c
@@ -857,3 +857,67 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr)
     spapr_register_hypercall(H_INT_SYNC, h_int_sync);
     spapr_register_hypercall(H_INT_RESET, h_int_reset);
 }
+
+void spapr_dt_xive(sPAPRMachineState *spapr, int nr_servers,
+                         void *fdt, uint32_t phandle)
+{
+    sPAPRXive *xive = spapr->xive;
+    int node;
+    uint64_t timas[2 * 2];
+    /* Interrupt number ranges for the IPIs */
+    uint32_t lisn_ranges[] = {
+        cpu_to_be32(0),
+        cpu_to_be32(nr_servers),
+    };
+    uint32_t eq_sizes[] = {
+        cpu_to_be32(12), /* 4K */
+        cpu_to_be32(16), /* 64K */
+        cpu_to_be32(21), /* 2M */
+        cpu_to_be32(24), /* 16M */
+    };
+    /* The following array is in sync with the 'priority_is_valid'
+     * routine above. Linux is expected to choose priority 6.
+     */
+    uint32_t plat_res_int_priorities[] = {
+        cpu_to_be32(7),    /* start */
+        cpu_to_be32(0xf8), /* count */
+    };
+    int i;
+    gchar *nodename;
+
+    /* Thread Interrupt Management Area : User and OS views */
+    for (i = 0; i < 2; i++) {
+        timas[i * 2] = cpu_to_be64(xive->tm_base + i * (1ull << TM_SHIFT));
+        timas[i * 2 + 1] = cpu_to_be64(1ull << TM_SHIFT);
+    }
+
+    nodename = g_strdup_printf("interrupt-controller@%" PRIx64, xive->tm_base);
+    _FDT(node = fdt_add_subnode(fdt, 0, nodename));
+    g_free(nodename);
+
+    _FDT(fdt_setprop_string(fdt, node, "device_type", "power-ivpe"));
+    _FDT(fdt_setprop(fdt, node, "reg", timas, sizeof(timas)));
+
+    _FDT(fdt_setprop_string(fdt, node, "compatible", "ibm,power-ivpe"));
+    _FDT(fdt_setprop(fdt, node, "ibm,xive-eq-sizes", eq_sizes,
+                     sizeof(eq_sizes)));
+    _FDT(fdt_setprop(fdt, node, "ibm,xive-lisn-ranges", lisn_ranges,
+                     sizeof(lisn_ranges)));
+
+    /* For Linux to link the LSIs to the main interrupt controller.
+     * These properties are not in XIVE exploitation mode sPAPR
+     * specs
+     */
+    _FDT(fdt_setprop(fdt, node, "interrupt-controller", NULL, 0));
+    _FDT(fdt_setprop_cell(fdt, node, "#interrupt-cells", 2));
+
+    /* For SLOF */
+    _FDT(fdt_setprop_cell(fdt, node, "linux,phandle", phandle));
+    _FDT(fdt_setprop_cell(fdt, node, "phandle", phandle));
+
+    /* The "ibm,plat-res-int-priorities" property defines the priority
+     * ranges reserved by the hypervisor
+     */
+    _FDT(fdt_setprop(fdt, 0, "ibm,plat-res-int-priorities",
+                     plat_res_int_priorities, sizeof(plat_res_int_priorities)));
+}
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 1a71bf613b9e..2e15ee8a9333 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1155,7 +1155,12 @@ static void *spapr_build_fdt(sPAPRMachineState *spapr,
     _FDT(fdt_setprop_cell(fdt, 0, "#size-cells", 2));
 
     /* /interrupt controller */
-    spapr_dt_xics(xics_max_server_number(), fdt, PHANDLE_XICP);
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        spapr_dt_xics(xics_max_server_number(), fdt, PHANDLE_XICP);
+    } else {
+        /* Populate device tree for XIVE */
+        spapr_dt_xive(spapr, xics_max_server_number(), fdt, PHANDLE_XICP);
+    }
 
     ret = spapr_populate_memory(spapr, fdt);
     if (ret < 0) {
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index be22a6b2895f..e2a1665beee9 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1646,6 +1646,12 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
             (spapr_h_cas_compose_response(spapr, args[1], args[2],
                                           ov5_updates) != 0);
     }
+
+    /* We need to rebuild the device tree for XIVE, generate a reset */
+    if (!spapr->cas_reboot) {
+        spapr->cas_reboot = spapr_ovec_test(ov5_updates, OV5_XIVE_EXPLOIT);
+    }
+
     spapr_ovec_cleanup(ov5_updates);
 
     if (spapr->cas_reboot) {
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 0385df69b028..8c3b9cb194a9 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -63,5 +63,7 @@ void spapr_xive_nvt_pic_print_info(sPAPRXiveNVT *nvt, Monitor *mon);
 typedef struct sPAPRMachineState sPAPRMachineState;
 
 void spapr_xive_hcall_init(sPAPRMachineState *spapr);
+void spapr_dt_xive(sPAPRMachineState *spapr, int nr_servers, void *fdt,
+                         uint32_t phandle);
 
 #endif /* PPC_SPAPR_XIVE_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 14/19] spapr: introduce a helper to map the XIVE memory regions
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (12 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 13/19] spapr: add device tree support for the XIVE " Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 15/19] spapr: add XIVE support to spapr_qirq() Cédric Le Goater
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

When the XIVE exploitation interrupt mode is activated, the machine
needs to expose to the guest the MMIO regions used by the controller :

  - Event State Buffers
  - Thread Interrupt Management Area for the OS and User views

Migration will also need to reflect the current interrupt mode in use.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:
 
 - moved the mapping of the XIVE memory region under the machine reset
   handler.

 hw/intc/spapr_xive.c        | 10 ++++++++++
 hw/ppc/spapr.c              | 10 ++++++++++
 include/hw/ppc/spapr_xive.h |  1 +
 3 files changed, 21 insertions(+)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index bf30edc87bee..fcdadf727f9d 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -970,3 +970,13 @@ XiveEQ *spapr_xive_get_eq(sPAPRXive *xive, uint32_t server, uint8_t priority)
     }
     return &nvt->eqt[priority];
 }
+
+void spapr_xive_mmio_map(sPAPRXive *xive)
+{
+    /* ESBs */
+    sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->esb_base);
+
+    /* Thread Management Interrupt Area: User and OS views */
+    sysbus_mmio_map(SYS_BUS_DEVICE(xive), 1, xive->tm_base);
+    sysbus_mmio_map(SYS_BUS_DEVICE(xive), 2, xive->tm_base + (1 << TM_SHIFT));
+}
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 2e15ee8a9333..73df038a9e8b 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1537,6 +1537,11 @@ static void ppc_spapr_reset(void)
         ppc_set_compat_all(spapr->max_compat_pvr, &error_fatal);
     }
 
+    /* Setup XIVE resources if required by CAS */
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        spapr_xive_mmio_map(spapr->xive);
+    }
+
     fdt = spapr_build_fdt(spapr, rtas_addr, spapr->rtas_size);
 
     spapr_load_rtas(spapr, fdt, rtas_addr);
@@ -1644,6 +1649,11 @@ static int spapr_post_load(void *opaque, int version_id)
         }
     }
 
+    /* Restore XIVE resources if required by CAS */
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        spapr_xive_mmio_map(spapr->xive);
+    }
+
     return err;
 }
 
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 8c3b9cb194a9..5d0c178a4984 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -59,6 +59,7 @@ bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
 bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 void spapr_xive_nvt_pic_print_info(sPAPRXiveNVT *nvt, Monitor *mon);
+void spapr_xive_mmio_map(sPAPRXive *xive);
 
 typedef struct sPAPRMachineState sPAPRMachineState;
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 15/19] spapr: add XIVE support to spapr_qirq()
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (13 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 14/19] spapr: introduce a helper to map the XIVE memory regions Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 16/19] spapr: introduce a spapr_icp_create() helper Cédric Le Goater
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

The XIVE object has its own set of qirqs which is to be used when the
XIVE exploitation interrupt mode is activated.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:
 
 - introduced a spapr_xive_qirq() helper

 hw/intc/spapr_xive.c        | 12 ++++++++++++
 hw/ppc/spapr.c              |  4 ++++
 include/hw/ppc/spapr_xive.h |  1 +
 3 files changed, 17 insertions(+)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index fcdadf727f9d..e650ed69eb70 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -980,3 +980,15 @@ void spapr_xive_mmio_map(sPAPRXive *xive)
     sysbus_mmio_map(SYS_BUS_DEVICE(xive), 1, xive->tm_base);
     sysbus_mmio_map(SYS_BUS_DEVICE(xive), 2, xive->tm_base + (1 << TM_SHIFT));
 }
+
+qemu_irq spapr_xive_qirq(sPAPRXive *xive, int lisn)
+{
+    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
+
+    if (!ive || !(ive->w & IVE_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %d\n", lisn);
+        return NULL;
+    }
+
+    return xive->qirqs[lisn];
+}
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 73df038a9e8b..d117fbd5ce9d 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3810,6 +3810,10 @@ qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq)
 {
     ICSState *ics = spapr->ics;
 
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return spapr_xive_qirq(spapr->xive, irq);
+    }
+
     if (ics_valid_irq(ics, irq)) {
         return ics->qirqs[irq - ics->offset];
     }
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 5d0c178a4984..8eefb09999de 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -60,6 +60,7 @@ bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 void spapr_xive_nvt_pic_print_info(sPAPRXiveNVT *nvt, Monitor *mon);
 void spapr_xive_mmio_map(sPAPRXive *xive);
+qemu_irq spapr_xive_qirq(sPAPRXive *xive, int lisn);
 
 typedef struct sPAPRMachineState sPAPRMachineState;
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 16/19] spapr: introduce a spapr_icp_create() helper
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (14 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 15/19] spapr: add XIVE support to spapr_qirq() Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 17/19] spapr: toggle the ICP depending on the selected interrupt mode Cédric Le Goater
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

On sPAPR, the creation of the interrupt presenter depends on some of
the machine attributes. When the XIVE exploitation interrupt mode is
available, this will get more complex. So provide a machine-level
helper to isolate the process and hide the details to the sPAPR core
realize function.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
---
 hw/ppc/spapr.c          | 14 ++++++++++++++
 hw/ppc/spapr_cpu_core.c |  3 +--
 include/hw/ppc/spapr.h  |  2 ++
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d117fbd5ce9d..65fca10e5b30 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3821,6 +3821,20 @@ qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq)
     return NULL;
 }
 
+Object *spapr_icp_create(sPAPRMachineState *spapr, Object *cpu, Error **errp)
+{
+    Error *local_err = NULL;
+    Object *obj;
+
+    obj = icp_create(cpu, spapr->icp_type, XICS_FABRIC(spapr), &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return NULL;
+    }
+
+    return obj;
+}
+
 static void spapr_pic_print_info(InterruptStatsProvider *obj,
                                  Monitor *mon)
 {
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 032438b9ce70..1bfe3ff55058 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -121,8 +121,7 @@ static void spapr_cpu_core_realize_child(Object *child,
         goto error;
     }
 
-    cpu->intc = icp_create(child, spapr->icp_type, XICS_FABRIC(spapr),
-                           &local_err);
+    cpu->intc = spapr_icp_create(spapr, child, &local_err);
     if (local_err) {
         goto error;
     }
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index ad923e668946..40bda1a34607 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -740,4 +740,6 @@ int spapr_irq_alloc_block(sPAPRMachineState *spapr, int num, bool lsi,
 void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
 qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
 
+Object *spapr_icp_create(sPAPRMachineState *spapr, Object *cpu, Error **errp);
+
 #endif /* HW_SPAPR_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 17/19] spapr: toggle the ICP depending on the selected interrupt mode
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (15 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 16/19] spapr: introduce a spapr_icp_create() helper Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 18/19] spapr: add support to dump XIVE information Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 19/19] spapr: advertise XIVE exploitation mode in CAS Cédric Le Goater
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

Each interrupt mode has its own specific interrupt presenter object,
that we store under the CPU object, one for XICS and one for XIVE. The
active presenter, corresponding to the current interrupt mode, is
simply selected with a lookup on the children of the CPU.

Migration and CPU hotplug also need to reflect the current interrupt
mode in use.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v1:
 
 - conditioned the creation of the sPAPRXiveNVT object to the
   xive_exploitation bool which false on older pseries machine.
 - moved the setting of the ICP under the machine reset handler.
 - introduced a spapr_xive_nvt_create() helper
 - handled errors in spapr_post_load() to return EINVAL

 hw/intc/spapr_xive.c            | 19 +++++++++++++++++++
 hw/ppc/spapr.c                  | 29 +++++++++++++++++++++++++++++
 hw/ppc/spapr_cpu_core.c         | 34 ++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr_cpu_core.h |  1 +
 include/hw/ppc/spapr_xive.h     |  1 +
 5 files changed, 84 insertions(+)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index e650ed69eb70..8e6997bb1deb 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -992,3 +992,22 @@ qemu_irq spapr_xive_qirq(sPAPRXive *xive, int lisn)
 
     return xive->qirqs[lisn];
 }
+
+Object *spapr_xive_nvt_create(Object *cpu, const char *type, Error **errp)
+{
+    Error *local_err = NULL;
+    Object *obj;
+
+    obj = object_new(type);
+    object_property_add_child(cpu, type, obj, &error_abort);
+    object_unref(obj);
+    object_property_add_const_link(obj, ICP_PROP_CPU, cpu, &error_abort);
+    object_property_set_bool(obj, true, "realized", &local_err);
+    if (local_err) {
+        object_unparent(obj);
+        error_propagate(errp, local_err);
+        obj = NULL;
+    }
+
+    return obj;
+}
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 65fca10e5b30..4d7a3d64e51e 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1539,7 +1539,10 @@ static void ppc_spapr_reset(void)
 
     /* Setup XIVE resources if required by CAS */
     if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        spapr_cpu_core_set_icp(TYPE_SPAPR_XIVE_NVT, &error_fatal);
         spapr_xive_mmio_map(spapr->xive);
+    } else {
+        spapr_cpu_core_set_icp(spapr->icp_type, &error_fatal);
     }
 
     fdt = spapr_build_fdt(spapr, rtas_addr, spapr->rtas_size);
@@ -1651,7 +1654,13 @@ static int spapr_post_load(void *opaque, int version_id)
 
     /* Restore XIVE resources if required by CAS */
     if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        Error *local_err = NULL;
+
         spapr_xive_mmio_map(spapr->xive);
+        spapr_cpu_core_set_icp(TYPE_SPAPR_XIVE_NVT, &local_err);
+        if (local_err) {
+            return -EINVAL;
+        }
     }
 
     return err;
@@ -3832,6 +3841,26 @@ Object *spapr_icp_create(sPAPRMachineState *spapr, Object *cpu, Error **errp)
         return NULL;
     }
 
+    if (spapr->xive_exploitation) {
+        Object *obj_xive;
+
+        /* Add a XIVE interrupt presenter. The machine will switch
+         * the CPU ICP depending on the interrupt model negotiated
+         * at CAS time.
+         */
+        obj_xive = spapr_xive_nvt_create(cpu, TYPE_SPAPR_XIVE_NVT, &local_err);
+        if (local_err) {
+            object_unparent(obj);
+            error_propagate(errp, local_err);
+            return NULL;
+        }
+
+        /* when hotplugged, the CPU should have the correct ICP */
+        if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+            return obj_xive;
+        }
+    }
+
     return obj;
 }
 
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 1bfe3ff55058..e5c39cdec998 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -253,3 +253,37 @@ static const TypeInfo spapr_cpu_core_type_infos[] = {
 };
 
 DEFINE_TYPES(spapr_cpu_core_type_infos)
+
+typedef struct ForeachFindICPArgs {
+    const char *icp_type;
+    Object *icp;
+} ForeachFindICPArgs;
+
+static int spapr_cpu_core_find_icp(Object *child, void *opaque)
+{
+    ForeachFindICPArgs *args = opaque;
+
+    if (object_dynamic_cast(child, args->icp_type)) {
+        args->icp = child;
+    }
+
+    return args->icp != NULL;
+}
+
+void spapr_cpu_core_set_icp(const char *icp_type, Error **errp)
+{
+    CPUState *cs;
+
+    CPU_FOREACH(cs) {
+        ForeachFindICPArgs args = { icp_type, NULL };
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+
+        object_child_foreach(OBJECT(cs), spapr_cpu_core_find_icp, &args);
+        if (!args.icp) {
+            error_setg(errp, "Couldn't find a '%s' icp", icp_type);
+            return;
+        }
+
+        cpu->intc = args.icp;
+    }
+}
diff --git a/include/hw/ppc/spapr_cpu_core.h b/include/hw/ppc/spapr_cpu_core.h
index 1129f344aa0c..916aea6137a6 100644
--- a/include/hw/ppc/spapr_cpu_core.h
+++ b/include/hw/ppc/spapr_cpu_core.h
@@ -38,4 +38,5 @@ typedef struct sPAPRCPUCoreClass {
 } sPAPRCPUCoreClass;
 
 const char *spapr_get_cpu_core_type(const char *cpu_type);
+void spapr_cpu_core_set_icp(const char *icp_type, Error **errp);
 #endif
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 8eefb09999de..22eb6e8a4d01 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -59,6 +59,7 @@ bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
 bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 void spapr_xive_nvt_pic_print_info(sPAPRXiveNVT *nvt, Monitor *mon);
+Object *spapr_xive_nvt_create(Object *cpu, const char *type, Error **errp);
 void spapr_xive_mmio_map(sPAPRXive *xive);
 qemu_irq spapr_xive_qirq(sPAPRXive *xive, int lisn);
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 18/19] spapr: add support to dump XIVE information
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (16 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 17/19] spapr: toggle the ICP depending on the selected interrupt mode Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 19/19] spapr: advertise XIVE exploitation mode in CAS Cédric Le Goater
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

Modify the InterruptStatsProvider output to reflect the interrupt mode
currently in use by the machine.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/spapr.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 4d7a3d64e51e..867c9d759f3b 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3873,10 +3873,18 @@ static void spapr_pic_print_info(InterruptStatsProvider *obj,
     CPU_FOREACH(cs) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
 
-        icp_pic_print_info(ICP(cpu->intc), mon);
+        if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+            spapr_xive_nvt_pic_print_info(SPAPR_XIVE_NVT(cpu->intc), mon);
+        } else {
+            icp_pic_print_info(ICP(cpu->intc), mon);
+        }
     }
 
-    ics_pic_print_info(spapr->ics, mon);
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        spapr_xive_pic_print_info(spapr->xive, mon);
+    } else {
+        ics_pic_print_info(spapr->ics, mon);
+    }
 }
 
 int spapr_vcpu_id(PowerPCCPU *cpu)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [Qemu-devel] [PATCH v2 19/19] spapr: advertise XIVE exploitation mode in CAS
  2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (17 preceding siblings ...)
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 18/19] spapr: add support to dump XIVE information Cédric Le Goater
@ 2017-12-09  8:43 ` Cédric Le Goater
  18 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09  8:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz
  Cc: Cédric Le Goater

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/spapr.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 867c9d759f3b..e52c510812d9 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -989,10 +989,11 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, void *fdt)
     spapr_dt_rtas_tokens(fdt, rtas);
 }
 
-/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU features
- * that the guest may request and thus the valid values for bytes 24..26 of
- * option vector 5: */
-static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
+/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU
+ * and the XIVE features that the guest may request and thus the valid
+ * values for bytes 23..26 of option vector 5: */
+static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
+                                          int chosen)
 {
     PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
 
@@ -1015,7 +1016,16 @@ static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
         } else {
             val[3] = 0x00; /* Hash */
         }
+        /* TODO: introduce a kvmppc_has_cap_xive() ? Works with
+         * irqchip=off for now
+         */
+        if (spapr->xive_exploitation) {
+            val[1] = 0x80; /* OV5_XIVE_BOTH */
+        }
     } else {
+        if (spapr->xive_exploitation) {
+            val[1] = 0x80; /* OV5_XIVE_BOTH */
+        }
         /* V3 MMU supports both hash and radix in tcg (with dynamic switching) */
         val[3] = 0xC0;
     }
@@ -1076,7 +1086,7 @@ static void spapr_dt_chosen(sPAPRMachineState *spapr, void *fdt)
         _FDT(fdt_setprop_string(fdt, chosen, "linux,stdout-path", stdout_path));
     }
 
-    spapr_dt_ov5_platform_support(fdt, chosen);
+    spapr_dt_ov5_platform_support(spapr, fdt, chosen);
 
     g_free(stdout_path);
     g_free(bootlist);
@@ -2487,6 +2497,11 @@ static void ppc_spapr_init(MachineState *machine)
         spapr_ovec_set(spapr->ov5, OV5_HPT_RESIZE);
     }
 
+    /* advertise XIVE if not disabled by the user */
+    if (spapr->xive_exploitation) {
+        spapr_ovec_set(spapr->ov5, OV5_XIVE_EXPLOIT);
+    }
+
     /* init CPUs */
     spapr_set_vsmt_mode(spapr, &error_fatal);
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller Cédric Le Goater
@ 2017-12-09 14:06   ` Cédric Le Goater
  2017-12-20  5:09   ` David Gibson
  1 sibling, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-09 14:06 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz

> +static const VMStateDescription vmstate_spapr_xive = {
> +    .name = TYPE_SPAPR_XIVE,
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = vmstate_spapr_xive_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> +        VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
> +                                     vmstate_spapr_xive_ive, XiveIVE),

I got it wrong again. This should be :

        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(ivt, sPAPRXive, nr_irqs,
                                     vmstate_spapr_xive_ive, XiveIVE),

for migration to work.

Cheers,

C. 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources Cédric Le Goater
@ 2017-12-14 15:24   ` Cédric Le Goater
  2017-12-18  0:59     ` Benjamin Herrenschmidt
  2017-12-20  5:22   ` David Gibson
  1 sibling, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-14 15:24 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel, David Gibson, Benjamin Herrenschmidt, Greg Kurz

> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 5b1f78e06a1e..ecc15d889b74 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -24,8 +24,17 @@ struct sPAPRXive {
>      /* Properties */
>      uint32_t     nr_irqs;
>  
> +    /* IRQ */
> +    qemu_irq     *qirqs;
> +
>      /* XIVE internal tables */
>      XiveIVE      *ivt;
> +    uint8_t      *sbe;
> +    uint32_t     sbe_size;
> +
> +    /* ESB memory region */
> +    hwaddr       esb_base;
> +    MemoryRegion esb_iomem;
>  };
>  
>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
> 

The addition of the XIVE source fields directly under the sPAPRXive 
object is really a design choice. But I am starting to think that 
having multiple XIVE source objects would be a good idea.

Roughly speaking, a XIVE source is a bunch of PQ bits plus a MMIO 
region to manipulate them and, in the QEMU model, a set of associated 
qemu_irqs to do the same from the handlers. 

In real HW, the PSI host bridge controller on the P9 for instance, 
a register holds all the P bits of the IRQs (no Q bits because the 
IRQs are only LSIs) and there is a specific MMIO region for them. 
PSIHB also has a register to store the assertion level of each IRQ.   
So this is quite similar to what we are adding above and in the
next patch for the LSI support. 

The source triggering only depends on the PQ bits (plus the LSI 
level) and the result is a simple forward of the event notification 
to the central XIVE engine : the IVRE, doing the routing. The IVRE
is really our sPAPRXive object. 

The API between the source and the IVRE is extremely simple :

  static void spapr_xive_irq(sPAPRXive *xive, int lisn)

The IVRE then scans its IVT, finds the EQ, and moves on to the 
presenter.

So, we can keep the IVRE engine (sPAPRXive) attached directly to 
the machine like we have today, this is good, and introduce multiple 
XIVE source objects. The sPAPR machine would have : 

 - one for the IPIs [ 0 - nr_servers ]
 - one generic for the devices [ 4096 -  ]
 - one for each phb ? 

The source address in the overall ESB MMIO region would be calculated 
from the offset of the source IRQ numbers in the IRQ number space. 
The offset could very well be hardcoded for each device. I don't see 
any XICS compatibility problems as we are sharing correctly the IRQ 
number space already.


I am starting this discussion because the support for XIVE in the 
QEMU PowerNV machine will need multiple sources, just like for 
POWER8. PnvXive will be a bit different because the IVRE tables 
(IVT and EQDT) are in the virtual machine memory. Most of the settings 
are done in the VM. The QEMU PowerNV machine will still have to 
implement the triggering and the routing logic using the guest tables. 


Regards,

C. 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources
  2017-12-14 15:24   ` Cédric Le Goater
@ 2017-12-18  0:59     ` Benjamin Herrenschmidt
  2017-12-19  6:37       ` Cédric Le Goater
  0 siblings, 1 reply; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2017-12-18  0:59 UTC (permalink / raw)
  To: Cédric Le Goater, qemu-ppc, qemu-devel, David Gibson, Greg Kurz

On Thu, 2017-12-14 at 16:24 +0100, Cédric Le Goater wrote:
> The API between the source and the IVRE is extremely simple :
> 
>   static void spapr_xive_irq(sPAPRXive *xive, int lisn)
> 
> The IVRE then scans its IVT, finds the EQ, and moves on to the 
> presenter.

In HW it's an MMIO store between the two units (from the source to the
IVRE notification port). I wonder in the long run if we should model
that the same way...

> So, we can keep the IVRE engine (sPAPRXive) attached directly to 
> the machine like we have today, this is good, and introduce multiple 
> XIVE source objects. The sPAPR machine would have : 
> 
>  - one for the IPIs [ 0 - nr_servers ]
>  - one generic for the devices [ 4096 -  ]
>  - one for each phb ? 
> 
> The source address in the overall ESB MMIO region would be calculated 
> from the offset of the source IRQ numbers in the IRQ number space. 
> The offset could very well be hardcoded for each device. I don't see 
> any XICS compatibility problems as we are sharing correctly the IRQ 
> number space already.
> 
> 
> I am starting this discussion because the support for XIVE in the 
> QEMU PowerNV machine will need multiple sources, just like for 
> POWER8. PnvXive will be a bit different because the IVRE tables 
> (IVT and EQDT) are in the virtual machine memory. Most of the settings 
> are done in the VM. The QEMU PowerNV machine will still have to 
> implement the triggering and the routing logic using the guest tables. 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/19] dma-helpers: add a return value to store helpers
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 01/19] dma-helpers: add a return value to store helpers Cédric Le Goater
@ 2017-12-19  4:46   ` David Gibson
  2017-12-19  6:43     ` Cédric Le Goater
  0 siblings, 1 reply; 71+ messages in thread
From: David Gibson @ 2017-12-19  4:46 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 1891 bytes --]

On Sat, Dec 09, 2017 at 09:43:20AM +0100, Cédric Le Goater wrote:
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Hrm.  I know I (indirectly) suggested this, but now that I see it, I'm
thinking adding return values here but not on the read side (which
would be awkward since they return the read values) seems like not a
great idea.

So I'm ok with just open coding the dma_memory_write()s after all.

> ---
>  include/sysemu/dma.h | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/include/sysemu/dma.h b/include/sysemu/dma.h
> index c228c6651360..74a9558af39c 100644
> --- a/include/sysemu/dma.h
> +++ b/include/sysemu/dma.h
> @@ -153,12 +153,12 @@ static inline void dma_memory_unmap(AddressSpace *as,
>          dma_memory_read(as, addr, &val, (_bits) / 8);                   \
>          return _end##_bits##_to_cpu(val);                               \
>      }                                                                   \
> -    static inline void st##_sname##_##_end##_dma(AddressSpace *as,      \
> +    static inline int st##_sname##_##_end##_dma(AddressSpace *as,      \
>                                                   dma_addr_t addr,       \
>                                                   uint##_bits##_t val)   \
>      {                                                                   \
>          val = cpu_to_##_end##_bits(val);                                \
> -        dma_memory_write(as, addr, &val, (_bits) / 8);                  \
> +        return dma_memory_write(as, addr, &val, (_bits) / 8);           \
>      }
>  
>  static inline uint8_t ldub_dma(AddressSpace *as, dma_addr_t addr)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources
  2017-12-18  0:59     ` Benjamin Herrenschmidt
@ 2017-12-19  6:37       ` Cédric Le Goater
  2017-12-20  5:13         ` David Gibson
  0 siblings, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-19  6:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, qemu-ppc, qemu-devel, David Gibson, Greg Kurz

On 12/18/2017 01:59 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2017-12-14 at 16:24 +0100, Cédric Le Goater wrote:
>> The API between the source and the IVRE is extremely simple :
>>
>>   static void spapr_xive_irq(sPAPRXive *xive, int lisn)
>>
>> The IVRE then scans its IVT, finds the EQ, and moves on to the 
>> presenter.
> 
> In HW it's an MMIO store between the two units (from the source to the
> IVRE notification port). I wonder in the long run if we should model
> that the same way...

It's a problem for PowerNV. IVSEs should all have an 'IVT offset' 
register and a 'notify trigger port address' address register for 
this purpose. Real HW performs a 4bytes store of the IRQ number 
to forward the notification to the IVRE. It even makes the model 
a little simpler because we don't have to look for the appropriate 
PnvXive object to handle the routing.  

For sPAPR, we don't have such MMIOs but still, we could trigger 
directly the sPAPRXive object without using the qemu_irq objects
which stand in the middle. XIVE IPIs don't use them at all and
only use MMIOs.    

>> So, we can keep the IVRE engine (sPAPRXive) attached directly to 
>> the machine like we have today, this is good, and introduce multiple 
>> XIVE source objects. The sPAPR machine would have : 
>>
>>  - one for the IPIs [ 0 - nr_servers ]
>>  - one generic for the devices [ 4096 -  ]
>>  - one for each phb ? 
>>
>> The source address in the overall ESB MMIO region would be calculated 
>> from the offset of the source IRQ numbers in the IRQ number space. 
>> The offset could very well be hardcoded for each device. I don't see 
>> any XICS compatibility problems as we are sharing correctly the IRQ 
>> number space already.
>>
>>
>> I am starting this discussion because the support for XIVE in the 
>> QEMU PowerNV machine will need multiple sources, just like for 
>> POWER8. PnvXive will be a bit different because the IVRE tables 
>> (IVT and EQDT) are in the virtual machine memory. Most of the settings 
>> are done in the VM. The QEMU PowerNV machine will still have to 
>> implement the triggering and the routing logic using the guest tables. 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 01/19] dma-helpers: add a return value to store helpers
  2017-12-19  4:46   ` David Gibson
@ 2017-12-19  6:43     ` Cédric Le Goater
  0 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-19  6:43 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

On 12/19/2017 05:46 AM, David Gibson wrote:
> On Sat, Dec 09, 2017 at 09:43:20AM +0100, Cédric Le Goater wrote:
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> 
> Hrm.  I know I (indirectly) suggested this, but now that I see it, I'm
> thinking adding return values here but not on the read side (which
> would be awkward since they return the read values) seems like not a
> great idea.
> 
> So I'm ok with just open coding the dma_memory_write()s after all.

OK. It's not a big change. Maybe I can use :

	ldl_phys(CPU(cpu)->as, ...)
	stl_phys(CPU(cpu)->as, ...)

in some cases, which would give us the correct ordering. 

The Pnv model does a few more peek/poke in RAM to use the XIVE 
structures which are larger, 8 * uint_64t for the XiveEQs and 
16 for the XiveVPs. 

Thanks,

C.

> 
>> ---
>>  include/sysemu/dma.h | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/sysemu/dma.h b/include/sysemu/dma.h
>> index c228c6651360..74a9558af39c 100644
>> --- a/include/sysemu/dma.h
>> +++ b/include/sysemu/dma.h
>> @@ -153,12 +153,12 @@ static inline void dma_memory_unmap(AddressSpace *as,
>>          dma_memory_read(as, addr, &val, (_bits) / 8);                   \
>>          return _end##_bits##_to_cpu(val);                               \
>>      }                                                                   \
>> -    static inline void st##_sname##_##_end##_dma(AddressSpace *as,      \
>> +    static inline int st##_sname##_##_end##_dma(AddressSpace *as,      \
>>                                                   dma_addr_t addr,       \
>>                                                   uint##_bits##_t val)   \
>>      {                                                                   \
>>          val = cpu_to_##_end##_bits(val);                                \
>> -        dma_memory_write(as, addr, &val, (_bits) / 8);                  \
>> +        return dma_memory_write(as, addr, &val, (_bits) / 8);           \
>>      }
>>  
>>  static inline uint8_t ldub_dma(AddressSpace *as, dma_addr_t addr)
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller Cédric Le Goater
  2017-12-09 14:06   ` Cédric Le Goater
@ 2017-12-20  5:09   ` David Gibson
  2017-12-20  7:38     ` Cédric Le Goater
  2017-12-21  0:12     ` Benjamin Herrenschmidt
  1 sibling, 2 replies; 71+ messages in thread
From: David Gibson @ 2017-12-20  5:09 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 11836 bytes --]

On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
> With the POWER9 processor comes a new interrupt controller called
> XIVE. It is composed of three sub-engines :
> 
>   - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
>     in the main controller for the IPIS and in the PSI host
>     bridge. They are configured to feed the IVRE with events.
> 
>   - Interrupt Virtualization Routing Engine (IVRE). Their job is to
>     match an event source with a Notification Virtualization Target
>     (NVT), a priority and an Event Queue (EQ) to determine if a
>     Virtual Processor can handle the event.
> 
>   - Interrupt Virtualization Presentation Engine (IVPE). It maintains
>     the interrupt state of each hardware thread and present the
>     notification as an external exception.
> 
> Each of the engines uses a set of internal tables to redirect
> exceptions from event sources to CPU threads. The first table we
> introduce is the Interrupt Virtualization Entry (IVE) table, part of
> the virtualization engine in charge of routing events. It associates
> event sources (IRQ numbers) to event queues which will forward, or
> not, the event notification to the presentation controller.
> 
> The XIVE model is designed to make use of the full range of the IRQ
> number space and does not use an offset like the XICS mode does.
> Hence, the IVE table is directly indexed by the IRQ number.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

As you've suggested in yourself, I think we might need to more
explicitly model the different components of the XIVE system.  As part
of that, I think you need to be clearer in this base skeleton about
exactly what component your XIVE object represents.

If the answer is "the overall thing" I suspect that's not what you
want - I had one of those for XICs which proved to be a mistake
(eventually replaced by the XICSFabric interface).

Changing the model later isn't impossible, but doing so without
breaking migration can be a real pain, so I think it's worth a
reasonable effort to try and get it right initially.

> ---
> 
>  Changes since v1 :
> 
>  - used g_new0 instead of g_malloc0
>  - removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC 
>  - introduced a device reset handler. the object needs to be parented
>    to sysbus when created.
>  - renamed spapr_xive_irq_set to spapr_xive_irq_enable
>  - renamed spapr_xive_irq_unset to spapr_xive_irq_disable
>  - moved the PPC_BIT macros under target/ppc/cpu.h
>  - shrinked file copyright header
> 
>  default-configs/ppc64-softmmu.mak |   1 +
>  hw/intc/Makefile.objs             |   1 +
>  hw/intc/spapr_xive.c              | 156 ++++++++++++++++++++++++++++++++++++++
>  hw/intc/xive-internal.h           |  41 ++++++++++
>  include/hw/ppc/spapr_xive.h       |  35 +++++++++
>  5 files changed, 234 insertions(+)
>  create mode 100644 hw/intc/spapr_xive.c
>  create mode 100644 hw/intc/xive-internal.h
>  create mode 100644 include/hw/ppc/spapr_xive.h
> 
> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> index d1b3a6dd50f8..4a7f6a0696de 100644
> --- a/default-configs/ppc64-softmmu.mak
> +++ b/default-configs/ppc64-softmmu.mak
> @@ -56,6 +56,7 @@ CONFIG_SM501=y
>  CONFIG_XICS=$(CONFIG_PSERIES)
>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>  # For PReP
>  CONFIG_SERIAL_ISA=y
>  CONFIG_MC146818RTC=y
> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> index ae358569a155..49e13e7aeeee 100644
> --- a/hw/intc/Makefile.objs
> +++ b/hw/intc/Makefile.objs
> @@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>  obj-$(CONFIG_XICS) += xics.o
>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> new file mode 100644
> index 000000000000..e6e8841add17
> --- /dev/null
> +++ b/hw/intc/spapr_xive.c
> @@ -0,0 +1,156 @@
> +/*
> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> + *
> + * Copyright (c) 2017, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "qapi/error.h"
> +#include "target/ppc/cpu.h"
> +#include "sysemu/cpus.h"
> +#include "sysemu/dma.h"
> +#include "monitor/monitor.h"
> +#include "hw/ppc/spapr_xive.h"
> +
> +#include "xive-internal.h"
> +
> +/*
> + * Main XIVE object
> + */
> +
> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> +{
> +    int i;
> +
> +    for (i = 0; i < xive->nr_irqs; i++) {
> +        XiveIVE *ive = &xive->ivt[i];
> +
> +        if (!(ive->w & IVE_VALID)) {
> +            continue;
> +        }
> +
> +        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
> +                       ive->w & IVE_MASKED ? "M" : " ",
> +                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
> +                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
> +    }
> +}
> +
> +static void spapr_xive_reset(DeviceState *dev)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> +    int i;
> +
> +    /* Mask all valid IVEs in the IRQ number space. */
> +    for (i = 0; i < xive->nr_irqs; i++) {
> +        XiveIVE *ive = &xive->ivt[i];
> +        if (ive->w & IVE_VALID) {
> +            ive->w |= IVE_MASKED;
> +        }
> +    }
> +}
> +
> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> +
> +    if (!xive->nr_irqs) {
> +        error_setg(errp, "Number of interrupt needs to be greater 0");
> +        return;
> +    }
> +
> +    /* Allocate the IVT (Interrupt Virtualization Table) */
> +    xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> +}
> +
> +static const VMStateDescription vmstate_spapr_xive_ive = {
> +    .name = TYPE_SPAPR_XIVE "/ive",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField []) {
> +        VMSTATE_UINT64(w, XiveIVE),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static bool vmstate_spapr_xive_needed(void *opaque)
> +{
> +    /* TODO check machine XIVE support */
> +    return true;
> +}
> +
> +static const VMStateDescription vmstate_spapr_xive = {
> +    .name = TYPE_SPAPR_XIVE,
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .needed = vmstate_spapr_xive_needed,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> +        VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
> +                                     vmstate_spapr_xive_ive, XiveIVE),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static Property spapr_xive_properties[] = {
> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    dc->realize = spapr_xive_realize;
> +    dc->reset = spapr_xive_reset;
> +    dc->props = spapr_xive_properties;
> +    dc->desc = "sPAPR XIVE interrupt controller";
> +    dc->vmsd = &vmstate_spapr_xive;
> +}
> +
> +static const TypeInfo spapr_xive_info = {
> +    .name = TYPE_SPAPR_XIVE,
> +    .parent = TYPE_SYS_BUS_DEVICE,
> +    .instance_size = sizeof(sPAPRXive),
> +    .class_init = spapr_xive_class_init,
> +};
> +
> +static void spapr_xive_register_types(void)
> +{
> +    type_register_static(&spapr_xive_info);
> +}
> +
> +type_init(spapr_xive_register_types)
> +
> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
> +{
> +    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
> +}
> +
> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
> +{
> +    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
> +
> +    if (!ive) {
> +        return false;
> +    }
> +
> +    ive->w |= IVE_VALID;
> +    return true;
> +}
> +
> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
> +{
> +    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
> +
> +    if (!ive) {
> +        return false;
> +    }
> +
> +    ive->w &= ~IVE_VALID;
> +    return true;
> +}
> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> new file mode 100644
> index 000000000000..132b71a6daf0
> --- /dev/null
> +++ b/hw/intc/xive-internal.h
> @@ -0,0 +1,41 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2016-2017, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef _INTC_XIVE_INTERNAL_H
> +#define _INTC_XIVE_INTERNAL_H
> +
> +/* Utilities to manipulate these (originaly from OPAL) */
> +#define MASK_TO_LSH(m)          (__builtin_ffsl(m) - 1)
> +#define GETFIELD(m, v)          (((v) & (m)) >> MASK_TO_LSH(m))
> +#define SETFIELD(m, v, val)                             \
> +        (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
> +
> +/* IVE/EAS
> + *
> + * One per interrupt source. Targets that interrupt to a given EQ
> + * and provides the corresponding logical interrupt number (EQ data)
> + *
> + * We also map this structure to the escalation descriptor inside
> + * an EQ, though in that case the valid and masked bits are not used.
> + */
> +typedef struct XiveIVE {
> +        /* Use a single 64-bit definition to make it easier to
> +         * perform atomic updates
> +         */
> +        uint64_t        w;
> +#define IVE_VALID       PPC_BIT(0)
> +#define IVE_EQ_BLOCK    PPC_BITMASK(4, 7)        /* Destination EQ block# */
> +#define IVE_EQ_INDEX    PPC_BITMASK(8, 31)       /* Destination EQ index */
> +#define IVE_MASKED      PPC_BIT(32)              /* Masked */
> +#define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
> +} XiveIVE;
> +
> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
> +
> +#endif /* _INTC_XIVE_INTERNAL_H */
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> new file mode 100644
> index 000000000000..5b1f78e06a1e
> --- /dev/null
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -0,0 +1,35 @@
> +/*
> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> + *
> + * Copyright (c) 2017, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef PPC_SPAPR_XIVE_H
> +#define PPC_SPAPR_XIVE_H
> +
> +#include <hw/sysbus.h>
> +
> +typedef struct sPAPRXive sPAPRXive;
> +typedef struct XiveIVE XiveIVE;
> +
> +#define TYPE_SPAPR_XIVE "spapr-xive"
> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
> +
> +struct sPAPRXive {
> +    SysBusDevice parent;
> +
> +    /* Properties */
> +    uint32_t     nr_irqs;
> +
> +    /* XIVE internal tables */
> +    XiveIVE      *ivt;
> +};
> +
> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> +
> +#endif /* PPC_SPAPR_XIVE_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources
  2017-12-19  6:37       ` Cédric Le Goater
@ 2017-12-20  5:13         ` David Gibson
  0 siblings, 0 replies; 71+ messages in thread
From: David Gibson @ 2017-12-20  5:13 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 2851 bytes --]

On Tue, Dec 19, 2017 at 07:37:31AM +0100, Cédric Le Goater wrote:
> On 12/18/2017 01:59 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2017-12-14 at 16:24 +0100, Cédric Le Goater wrote:
> >> The API between the source and the IVRE is extremely simple :
> >>
> >>   static void spapr_xive_irq(sPAPRXive *xive, int lisn)
> >>
> >> The IVRE then scans its IVT, finds the EQ, and moves on to the 
> >> presenter.
> > 
> > In HW it's an MMIO store between the two units (from the source to the
> > IVRE notification port). I wonder in the long run if we should model
> > that the same way...
> 
> It's a problem for PowerNV. IVSEs should all have an 'IVT offset' 
> register and a 'notify trigger port address' address register for 
> this purpose. Real HW performs a 4bytes store of the IRQ number 
> to forward the notification to the IVRE. It even makes the model 
> a little simpler because we don't have to look for the appropriate 
> PnvXive object to handle the routing.  
> 
> For sPAPR, we don't have such MMIOs but still, we could trigger 
> directly the sPAPRXive object without using the qemu_irq objects
> which stand in the middle. XIVE IPIs don't use them at all and
> only use MMIOs.

Yeah, I think we're going to want a model more explicitly close to
what the hardware does.  It's tempting to shortcut it for PAPR, but a)
it'll probably cause us less trouble when we need to implement powernv
and b) I think it's less likely to break as we fill out the various
details we need.

> 
> >> So, we can keep the IVRE engine (sPAPRXive) attached directly to 
> >> the machine like we have today, this is good, and introduce multiple 
> >> XIVE source objects. The sPAPR machine would have : 
> >>
> >>  - one for the IPIs [ 0 - nr_servers ]
> >>  - one generic for the devices [ 4096 -  ]
> >>  - one for each phb ? 
> >>
> >> The source address in the overall ESB MMIO region would be calculated 
> >> from the offset of the source IRQ numbers in the IRQ number space. 
> >> The offset could very well be hardcoded for each device. I don't see 
> >> any XICS compatibility problems as we are sharing correctly the IRQ 
> >> number space already.
> >>
> >>
> >> I am starting this discussion because the support for XIVE in the 
> >> QEMU PowerNV machine will need multiple sources, just like for 
> >> POWER8. PnvXive will be a bit different because the IVRE tables 
> >> (IVT and EQDT) are in the virtual machine memory. Most of the settings 
> >> are done in the VM. The QEMU PowerNV machine will still have to 
> >> implement the triggering and the routing logic using the guest tables. 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources
  2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources Cédric Le Goater
  2017-12-14 15:24   ` Cédric Le Goater
@ 2017-12-20  5:22   ` David Gibson
  2017-12-20  7:54     ` Cédric Le Goater
  1 sibling, 1 reply; 71+ messages in thread
From: David Gibson @ 2017-12-20  5:22 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 14433 bytes --]

On Sat, Dec 09, 2017 at 09:43:22AM +0100, Cédric Le Goater wrote:
> Each XIVE interrupt source is associated with a two bit state machine
> called an Event State Buffer (ESB) : the first bit "P" means that an
> interrupt is "pending" and waiting for an EOI and the bit "Q" (queued)
> means a new interrupt was triggered while another was still pending.
> 
> When an event is triggered, the associated interrupt state bits are
> fetched and modified and forwarded to the virtualization engine of the
> controller doing the routing. These can also be controlled by MMIO, to
> trigger events or turn off the sources for instance. See code for more
> details on the states and transitions.
> 
> The MMIO space for the ESBs is 512GB large on the bare-metal system
> (PowerNV) and the BAR depends on the chip id. In our model for the
> sPAPR machine, we choose to only map the sub-region for the
> provisioned IRQ numbers and to use the mapping address of chip 0 of a
> real system.

I think we probably want a device property to make the virtualized
base address arbitrary.  It's fine for it to default to the chip 0
base, but that'll make it easier to adapt if we need to later on.

As noted in the followup messages, I think you're going to want to
move this stuff from the current xive object into a "block of sources"
object.

Apart from that this looks pretty sound.

> In the real world, each source may have different characteristics
> depending on the revision of a controller or the CPU. Early systems
> had two different MMIO pages for trigger and for EOI. We choose to use
> the same characteristics for all sources to simplify the model. The
> minimum CPU level for XIVE exploitation mode will be DD2.X as it has
> full support.
> 
> The OS will obtain the address of the MMIO page of the ESB entry
> associated with a source and its characteristic using the
> H_INT_GET_SOURCE_INFO hcall. This will be addressed in the patch
> introducing the hcalls.
> 
> The spapr_xive_irq() routine in charge of triggering the CPU interrupt
> line will be filled later on.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
> 
>  Changes since v1:
> 
>  - merged in the same patch the qemu_irq handlers
>  - reworked the event notification logic of the qemu_irq handlers.  
>  - introduced XIVE_ESB_STORE_EOI support
>  - removed 'esb_shift' field 
>  - removed a useless check on the validity of the IVE in the memory
>    region handlers.
>  - fixed spapr_xive_pq_trigger() to return true when XIVE_ESB_QUEUED
>    is set
>  - removed the overall ESB memory region. We now have only one region
>    for the provisioned sources.
>  - improved 'info pic' output
> 
>  hw/intc/spapr_xive.c        | 254 +++++++++++++++++++++++++++++++++++++++++++-
>  hw/intc/xive-internal.h     |  10 ++
>  include/hw/ppc/spapr_xive.h |   9 ++
>  3 files changed, 271 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> index e6e8841add17..43df6814619d 100644
> --- a/hw/intc/spapr_xive.c
> +++ b/hw/intc/spapr_xive.c
> @@ -18,23 +18,252 @@
>  
>  #include "xive-internal.h"
>  
> +static void spapr_xive_irq(sPAPRXive *xive, int lisn)
> +{
> +
> +}
> +
>  /*
> - * Main XIVE object
> + * XIVE Interrupt Source
> + */
> +
> +/*
> + * "magic" Event State Buffer (ESB) MMIO offsets.
> + *
> + * Each interrupt source has a 2-bit state machine called ESB
> + * which can be controlled by MMIO. It's made of 2 bits, P and
> + * Q. P indicates that an interrupt is pending (has been sent
> + * to a queue and is waiting for an EOI). Q indicates that the
> + * interrupt has been triggered while pending.
> + *
> + * This acts as a coalescing mechanism in order to guarantee
> + * that a given interrupt only occurs at most once in a queue.
> + *
> + * When doing an EOI, the Q bit will indicate if the interrupt
> + * needs to be re-triggered.
> + *
> + * The following offsets into the ESB MMIO allow to read or
> + * manipulate the PQ bits. They must be used with an 8-bytes
> + * load instruction. They all return the previous state of the
> + * interrupt (atomically).
> + *
> + * Additionally, some ESB pages support doing an EOI via a
> + * store at 0 and some ESBs support doing a trigger via a
> + * separate trigger page.
> + */
> +#define XIVE_ESB_STORE_EOI      0x400 /* Store */
> +#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
> +#define XIVE_ESB_GET            0x800 /* Load */
> +#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
> +#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
> +#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
> +#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
> +
> +#define XIVE_ESB_VAL_P          0x2
> +#define XIVE_ESB_VAL_Q          0x1
> +
> +#define XIVE_ESB_RESET          0x0
> +#define XIVE_ESB_PENDING        XIVE_ESB_VAL_P
> +#define XIVE_ESB_QUEUED         (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
> +#define XIVE_ESB_OFF            XIVE_ESB_VAL_Q
> +
> +static uint8_t spapr_xive_pq_get(sPAPRXive *xive, uint32_t lisn)
> +{
> +    uint32_t byte = lisn / 4;
> +    uint32_t bit  = (lisn % 4) * 2;
> +
> +    assert(byte < xive->sbe_size);
> +
> +    return (xive->sbe[byte] >> bit) & 0x3;
> +}
> +
> +static uint8_t spapr_xive_pq_set(sPAPRXive *xive, uint32_t lisn, uint8_t pq)
> +{
> +    uint32_t byte = lisn / 4;
> +    uint32_t bit  = (lisn % 4) * 2;
> +    uint8_t old, new;
> +
> +    assert(byte < xive->sbe_size);
> +
> +    old = xive->sbe[byte];
> +
> +    new = xive->sbe[byte] & ~(0x3 << bit);
> +    new |= (pq & 0x3) << bit;
> +
> +    xive->sbe[byte] = new;
> +
> +    return (old >> bit) & 0x3;
> +}
> +
> +static bool spapr_xive_pq_eoi(sPAPRXive *xive, uint32_t lisn)
> +{
> +    uint8_t old_pq = spapr_xive_pq_get(xive, lisn);
> +
> +    switch (old_pq) {
> +    case XIVE_ESB_RESET:
> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_RESET);
> +        return false;
> +    case XIVE_ESB_PENDING:
> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_RESET);
> +        return false;
> +    case XIVE_ESB_QUEUED:
> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_PENDING);
> +        return true;
> +    case XIVE_ESB_OFF:
> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_OFF);
> +        return false;
> +    default:
> +         g_assert_not_reached();
> +    }
> +}
> +
> +/*
> + * Returns whether the event notification should be forwarded to the
> + * IVE for routing.
>   */
> +static bool spapr_xive_pq_trigger(sPAPRXive *xive, uint32_t lisn)
> +{
> +    uint8_t old_pq = spapr_xive_pq_get(xive, lisn);
>  
> +    switch (old_pq) {
> +    case XIVE_ESB_RESET:
> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_PENDING);
> +        return true;
> +    case XIVE_ESB_PENDING:
> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_QUEUED);
> +        return false;
> +    case XIVE_ESB_QUEUED:
> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_QUEUED);
> +        return false;
> +    case XIVE_ESB_OFF:
> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_OFF);
> +        return false;
> +    default:
> +         g_assert_not_reached();
> +    }
> +}
> +
> +/*
> + * XIVE Interrupt Source MMIOs
> + */
> +
> +/*
> + * Some HW use a separate page for trigger. We only support the case
> + * in which the trigger can be done in the same page as the EOI.
> + */
> +static uint64_t spapr_xive_esb_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(opaque);
> +    uint32_t offset = addr & 0xF00;
> +    uint32_t lisn = addr >> ESB_SHIFT;
> +    uint64_t ret = -1;
> +
> +    switch (offset) {
> +    case XIVE_ESB_LOAD_EOI:
> +        /*
> +         * EOI on load is not used anymore as we now advertise
> +         * XIVE_ESB_STORE_EOI support for the interrupt sources
> +         */
> +        ret = spapr_xive_pq_eoi(xive, lisn);
> +        break;
> +
> +    case XIVE_ESB_GET:
> +        ret = spapr_xive_pq_get(xive, lisn);
> +        break;
> +
> +    case XIVE_ESB_SET_PQ_00:
> +    case XIVE_ESB_SET_PQ_01:
> +    case XIVE_ESB_SET_PQ_10:
> +    case XIVE_ESB_SET_PQ_11:
> +        ret = spapr_xive_pq_set(xive, lisn, (offset >> 8) & 0x3);
> +        break;
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB addr %d\n", offset);
> +    }
> +
> +    return ret;
> +}
> +
> +static void spapr_xive_esb_write(void *opaque, hwaddr addr,
> +                                 uint64_t value, unsigned size)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(opaque);
> +    uint32_t offset = addr & 0xF00;
> +    uint32_t lisn = addr >> ESB_SHIFT;
> +    bool notify = false;
> +
> +    switch (offset) {
> +    case 0:
> +        notify = spapr_xive_pq_trigger(xive, lisn);
> +        break;
> +    case XIVE_ESB_STORE_EOI:
> +        /* If the Q bit is set, we should forward a new source event
> +         * notification
> +         */
> +        notify = spapr_xive_pq_eoi(xive, lisn);
> +        break;
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %d\n",
> +                      offset);
> +        return;
> +    }
> +
> +    /* Forward the source event notification for routing */
> +    if (notify) {
> +        spapr_xive_irq(xive, lisn);
> +    }
> +}
> +
> +static const MemoryRegionOps spapr_xive_esb_ops = {
> +    .read = spapr_xive_esb_read,
> +    .write = spapr_xive_esb_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +static void spapr_xive_source_set_irq(void *opaque, int lisn, int val)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(opaque);
> +    bool notify = false;
> +
> +    if (val) {
> +        notify = spapr_xive_pq_trigger(xive, lisn);
> +    }
> +
> +    /* Forward the source event notification for routing */
> +    if (notify) {
> +        spapr_xive_irq(xive, lisn);
> +    }
> +}
> +
> +/*
> + * Main XIVE object
> + */
>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>  {
>      int i;
>  
>      for (i = 0; i < xive->nr_irqs; i++) {
>          XiveIVE *ive = &xive->ivt[i];
> +        uint8_t pq;
>  
>          if (!(ive->w & IVE_VALID)) {
>              continue;
>          }
>  
> -        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
> +        pq = spapr_xive_pq_get(xive, i);
> +
> +        monitor_printf(mon, "  %4x %s %c%c %08x %08x\n", i,
>                         ive->w & IVE_MASKED ? "M" : " ",
> +                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
>                         (int) GETFIELD(IVE_EQ_INDEX, ive->w),
>                         (int) GETFIELD(IVE_EQ_DATA, ive->w));
>      }
> @@ -52,6 +281,9 @@ static void spapr_xive_reset(DeviceState *dev)
>              ive->w |= IVE_MASKED;
>          }
>      }
> +
> +    /* SBEs are initialized to 0b01 which corresponds to "ints off" */
> +    memset(xive->sbe, 0x55, xive->sbe_size);
>  }
>  
>  static void spapr_xive_realize(DeviceState *dev, Error **errp)
> @@ -65,6 +297,23 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>  
>      /* Allocate the IVT (Interrupt Virtualization Table) */
>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> +
> +    /* QEMU IRQs */
> +    xive->qirqs = qemu_allocate_irqs(spapr_xive_source_set_irq, xive,
> +                                     xive->nr_irqs);
> +
> +    /* Allocate SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
> +    xive->sbe_size = DIV_ROUND_UP(xive->nr_irqs, 4);
> +    xive->sbe = g_malloc0(xive->sbe_size);
> +
> +    /* VC BAR. Use address of chip 0 to install the ESB memory region
> +     * for *all* interrupt sources */
> +    xive->esb_base = (P9_MMIO_BASE | VC_BAR_DEFAULT);
> +
> +    memory_region_init_io(&xive->esb_iomem, OBJECT(xive),
> +                          &spapr_xive_esb_ops, xive, "xive.esb",
> +                          (1ull << ESB_SHIFT) * xive->nr_irqs);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->esb_iomem);
>  }
>  
>  static const VMStateDescription vmstate_spapr_xive_ive = {
> @@ -92,6 +341,7 @@ static const VMStateDescription vmstate_spapr_xive = {
>          VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>          VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
>                                       vmstate_spapr_xive_ive, XiveIVE),
> +        VMSTATE_VBUFFER_UINT32(sbe, sPAPRXive, 1, NULL, sbe_size),
>          VMSTATE_END_OF_LIST()
>      },
>  };
> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> index 132b71a6daf0..872648dd96a2 100644
> --- a/hw/intc/xive-internal.h
> +++ b/hw/intc/xive-internal.h
> @@ -16,6 +16,16 @@
>  #define SETFIELD(m, v, val)                             \
>          (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
>  
> +/*
> + * XIVE MMIO regions
> + */
> +#define P9_MMIO_BASE     0x006000000000000ull
> +
> +/* VC BAR contains set translations for the ESBs and the EQs. */
> +#define VC_BAR_DEFAULT   0x10000000000ull
> +#define VC_BAR_SIZE      0x08000000000ull
> +#define ESB_SHIFT        16 /* One 64k page. OPAL has two */
> +
>  /* IVE/EAS
>   *
>   * One per interrupt source. Targets that interrupt to a given EQ
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 5b1f78e06a1e..ecc15d889b74 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -24,8 +24,17 @@ struct sPAPRXive {
>      /* Properties */
>      uint32_t     nr_irqs;
>  
> +    /* IRQ */
> +    qemu_irq     *qirqs;
> +
>      /* XIVE internal tables */
>      XiveIVE      *ivt;
> +    uint8_t      *sbe;
> +    uint32_t     sbe_size;
> +
> +    /* ESB memory region */
> +    hwaddr       esb_base;
> +    MemoryRegion esb_iomem;
>  };
>  
>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-20  5:09   ` David Gibson
@ 2017-12-20  7:38     ` Cédric Le Goater
  2018-04-12  5:07       ` David Gibson
  2017-12-21  0:12     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-20  7:38 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

On 12/20/2017 06:09 AM, David Gibson wrote:
> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
>> With the POWER9 processor comes a new interrupt controller called
>> XIVE. It is composed of three sub-engines :
>>
>>   - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
>>     in the main controller for the IPIS and in the PSI host
>>     bridge. They are configured to feed the IVRE with events.
>>
>>   - Interrupt Virtualization Routing Engine (IVRE). Their job is to
>>     match an event source with a Notification Virtualization Target
>>     (NVT), a priority and an Event Queue (EQ) to determine if a
>>     Virtual Processor can handle the event.
>>
>>   - Interrupt Virtualization Presentation Engine (IVPE). It maintains
>>     the interrupt state of each hardware thread and present the
>>     notification as an external exception.
>>
>> Each of the engines uses a set of internal tables to redirect
>> exceptions from event sources to CPU threads. The first table we
>> introduce is the Interrupt Virtualization Entry (IVE) table, part of
>> the virtualization engine in charge of routing events. It associates
>> event sources (IRQ numbers) to event queues which will forward, or
>> not, the event notification to the presentation controller.
>>
>> The XIVE model is designed to make use of the full range of the IRQ
>> number space and does not use an offset like the XICS mode does.
>> Hence, the IVE table is directly indexed by the IRQ number.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> 
> As you've suggested in yourself, I think we might need to more
> explicitly model the different components of the XIVE system.  As part
> of that, I think you need to be clearer in this base skeleton about
> exactly what component your XIVE object represents.

ok. The base skeleton is the IVRE, the central engine handling 
the routing. 

> If the answer is "the overall thing" 

Yes, it is more or less that currently. 

The sPAPRXive object models the source engine and the routing 
engine in one object.  

I have merged these for simplicity and because the interrupt 
controller has an internal source for the interrupts of the "IPI" 
type, which are used for the CPU IPIs but also for other generic 
interrupts, like the OpenCAPI ones. The XIVE sPAPR interface is 
also much simpler than the baremetal one, all the tables are 
maintained in the hypervisor, so this choice made some sense. 

But since, I have started the PowerNV model and I am duplicating 
a lot of code to handle the triggering and the MMIOs in the 
different sources. So I am not convinced anymore. Nevertheless, 
the overall routing logic is the same even if some the tables 
are not located in QEMU anymore, but in the machine memory.

The sPAPRXiveNVT models some of the CPU presenter engine. It 
holds the virtual CPU interrupt states when not dispatched on 
a real HW thread. Real world is more complex. There are "CAM" 
lines in the HW threads which are compared to find a matching 
candidate. But I don't think we need to anything more complex 
than today unless we want to support KVM under TCG ...
   
> I suspect that's not what you
> want - I had one of those for XICs which proved to be a mistake
> (eventually replaced by the XICSFabric interface).

The XICSFabric would be the main Xive object. The interface 
between the sources and the routing engine is hidden in sPAPR, 
we can use a simple function call : 

	spapr_xive_irq(pnv->xive, irq);

we could get rid of the qirqs but they are required for XICS.

PowerNV uses MMIOs to notify an event and it makes the modeling
somewhat easier. Each controller model has a notify port address 
register on which a interrupt number is written to forward an 
event to the routing engine. So it is a simple store. 

I don't know why there is a different notify port address per
source, may be for extra filtering at the routing engine level.   

> Changing the model later isn't impossible, but doing so without
> breaking migration can be a real pain, so I think it's worth a
> reasonable effort to try and get it right initially.

I completely agree. 

This is why I have started the PnvXive model to challenge the 
current PAPR design. I have hacked a bunch of patches for XIVE, 
LPC, PSI, OCC and basic PPC support which boot a PowerNV P9 up to 
petitboot. It would look better with a source object, but the 
location of the PQ bits is a bit problematic. It highly depends 
on the controller. The main controller uses tables in the hypervisor
memory. The PSIHB controller has its own bits. I suppose it is 
the same for PHB4. I need to take a closer look at how we could
have a common source object. 

The most important part is KVM support and how we expose the 
MMIO region. We need to make progress on that topic.

Thanks,

C.  
 

>> ---
>>
>>  Changes since v1 :
>>
>>  - used g_new0 instead of g_malloc0
>>  - removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC 
>>  - introduced a device reset handler. the object needs to be parented
>>    to sysbus when created.
>>  - renamed spapr_xive_irq_set to spapr_xive_irq_enable
>>  - renamed spapr_xive_irq_unset to spapr_xive_irq_disable
>>  - moved the PPC_BIT macros under target/ppc/cpu.h
>>  - shrinked file copyright header
>>
>>  default-configs/ppc64-softmmu.mak |   1 +
>>  hw/intc/Makefile.objs             |   1 +
>>  hw/intc/spapr_xive.c              | 156 ++++++++++++++++++++++++++++++++++++++
>>  hw/intc/xive-internal.h           |  41 ++++++++++
>>  include/hw/ppc/spapr_xive.h       |  35 +++++++++
>>  5 files changed, 234 insertions(+)
>>  create mode 100644 hw/intc/spapr_xive.c
>>  create mode 100644 hw/intc/xive-internal.h
>>  create mode 100644 include/hw/ppc/spapr_xive.h
>>
>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>> index d1b3a6dd50f8..4a7f6a0696de 100644
>> --- a/default-configs/ppc64-softmmu.mak
>> +++ b/default-configs/ppc64-softmmu.mak
>> @@ -56,6 +56,7 @@ CONFIG_SM501=y
>>  CONFIG_XICS=$(CONFIG_PSERIES)
>>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>>  # For PReP
>>  CONFIG_SERIAL_ISA=y
>>  CONFIG_MC146818RTC=y
>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>> index ae358569a155..49e13e7aeeee 100644
>> --- a/hw/intc/Makefile.objs
>> +++ b/hw/intc/Makefile.objs
>> @@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>>  obj-$(CONFIG_XICS) += xics.o
>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> new file mode 100644
>> index 000000000000..e6e8841add17
>> --- /dev/null
>> +++ b/hw/intc/spapr_xive.c
>> @@ -0,0 +1,156 @@
>> +/*
>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/log.h"
>> +#include "qapi/error.h"
>> +#include "target/ppc/cpu.h"
>> +#include "sysemu/cpus.h"
>> +#include "sysemu/dma.h"
>> +#include "monitor/monitor.h"
>> +#include "hw/ppc/spapr_xive.h"
>> +
>> +#include "xive-internal.h"
>> +
>> +/*
>> + * Main XIVE object
>> + */
>> +
>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < xive->nr_irqs; i++) {
>> +        XiveIVE *ive = &xive->ivt[i];
>> +
>> +        if (!(ive->w & IVE_VALID)) {
>> +            continue;
>> +        }
>> +
>> +        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
>> +                       ive->w & IVE_MASKED ? "M" : " ",
>> +                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
>> +                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
>> +    }
>> +}
>> +
>> +static void spapr_xive_reset(DeviceState *dev)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>> +    int i;
>> +
>> +    /* Mask all valid IVEs in the IRQ number space. */
>> +    for (i = 0; i < xive->nr_irqs; i++) {
>> +        XiveIVE *ive = &xive->ivt[i];
>> +        if (ive->w & IVE_VALID) {
>> +            ive->w |= IVE_MASKED;
>> +        }
>> +    }
>> +}
>> +
>> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>> +
>> +    if (!xive->nr_irqs) {
>> +        error_setg(errp, "Number of interrupt needs to be greater 0");
>> +        return;
>> +    }
>> +
>> +    /* Allocate the IVT (Interrupt Virtualization Table) */
>> +    xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>> +}
>> +
>> +static const VMStateDescription vmstate_spapr_xive_ive = {
>> +    .name = TYPE_SPAPR_XIVE "/ive",
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField []) {
>> +        VMSTATE_UINT64(w, XiveIVE),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +static bool vmstate_spapr_xive_needed(void *opaque)
>> +{
>> +    /* TODO check machine XIVE support */
>> +    return true;
>> +}
>> +
>> +static const VMStateDescription vmstate_spapr_xive = {
>> +    .name = TYPE_SPAPR_XIVE,
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .needed = vmstate_spapr_xive_needed,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>> +        VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
>> +                                     vmstate_spapr_xive_ive, XiveIVE),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +static Property spapr_xive_properties[] = {
>> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    dc->realize = spapr_xive_realize;
>> +    dc->reset = spapr_xive_reset;
>> +    dc->props = spapr_xive_properties;
>> +    dc->desc = "sPAPR XIVE interrupt controller";
>> +    dc->vmsd = &vmstate_spapr_xive;
>> +}
>> +
>> +static const TypeInfo spapr_xive_info = {
>> +    .name = TYPE_SPAPR_XIVE,
>> +    .parent = TYPE_SYS_BUS_DEVICE,
>> +    .instance_size = sizeof(sPAPRXive),
>> +    .class_init = spapr_xive_class_init,
>> +};
>> +
>> +static void spapr_xive_register_types(void)
>> +{
>> +    type_register_static(&spapr_xive_info);
>> +}
>> +
>> +type_init(spapr_xive_register_types)
>> +
>> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
>> +{
>> +    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>> +}
>> +
>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
>> +{
>> +    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
>> +
>> +    if (!ive) {
>> +        return false;
>> +    }
>> +
>> +    ive->w |= IVE_VALID;
>> +    return true;
>> +}
>> +
>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
>> +{
>> +    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
>> +
>> +    if (!ive) {
>> +        return false;
>> +    }
>> +
>> +    ive->w &= ~IVE_VALID;
>> +    return true;
>> +}
>> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
>> new file mode 100644
>> index 000000000000..132b71a6daf0
>> --- /dev/null
>> +++ b/hw/intc/xive-internal.h
>> @@ -0,0 +1,41 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2016-2017, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef _INTC_XIVE_INTERNAL_H
>> +#define _INTC_XIVE_INTERNAL_H
>> +
>> +/* Utilities to manipulate these (originaly from OPAL) */
>> +#define MASK_TO_LSH(m)          (__builtin_ffsl(m) - 1)
>> +#define GETFIELD(m, v)          (((v) & (m)) >> MASK_TO_LSH(m))
>> +#define SETFIELD(m, v, val)                             \
>> +        (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
>> +
>> +/* IVE/EAS
>> + *
>> + * One per interrupt source. Targets that interrupt to a given EQ
>> + * and provides the corresponding logical interrupt number (EQ data)
>> + *
>> + * We also map this structure to the escalation descriptor inside
>> + * an EQ, though in that case the valid and masked bits are not used.
>> + */
>> +typedef struct XiveIVE {
>> +        /* Use a single 64-bit definition to make it easier to
>> +         * perform atomic updates
>> +         */
>> +        uint64_t        w;
>> +#define IVE_VALID       PPC_BIT(0)
>> +#define IVE_EQ_BLOCK    PPC_BITMASK(4, 7)        /* Destination EQ block# */
>> +#define IVE_EQ_INDEX    PPC_BITMASK(8, 31)       /* Destination EQ index */
>> +#define IVE_MASKED      PPC_BIT(32)              /* Masked */
>> +#define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
>> +} XiveIVE;
>> +
>> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
>> +
>> +#endif /* _INTC_XIVE_INTERNAL_H */
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> new file mode 100644
>> index 000000000000..5b1f78e06a1e
>> --- /dev/null
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -0,0 +1,35 @@
>> +/*
>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef PPC_SPAPR_XIVE_H
>> +#define PPC_SPAPR_XIVE_H
>> +
>> +#include <hw/sysbus.h>
>> +
>> +typedef struct sPAPRXive sPAPRXive;
>> +typedef struct XiveIVE XiveIVE;
>> +
>> +#define TYPE_SPAPR_XIVE "spapr-xive"
>> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
>> +
>> +struct sPAPRXive {
>> +    SysBusDevice parent;
>> +
>> +    /* Properties */
>> +    uint32_t     nr_irqs;
>> +
>> +    /* XIVE internal tables */
>> +    XiveIVE      *ivt;
>> +};
>> +
>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>> +
>> +#endif /* PPC_SPAPR_XIVE_H */
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources
  2017-12-20  5:22   ` David Gibson
@ 2017-12-20  7:54     ` Cédric Le Goater
  2017-12-20 18:08       ` Cédric Le Goater
  0 siblings, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-20  7:54 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

On 12/20/2017 06:22 AM, David Gibson wrote:
> On Sat, Dec 09, 2017 at 09:43:22AM +0100, Cédric Le Goater wrote:
>> Each XIVE interrupt source is associated with a two bit state machine
>> called an Event State Buffer (ESB) : the first bit "P" means that an
>> interrupt is "pending" and waiting for an EOI and the bit "Q" (queued)
>> means a new interrupt was triggered while another was still pending.
>>
>> When an event is triggered, the associated interrupt state bits are
>> fetched and modified and forwarded to the virtualization engine of the
>> controller doing the routing. These can also be controlled by MMIO, to
>> trigger events or turn off the sources for instance. See code for more
>> details on the states and transitions.
>>
>> The MMIO space for the ESBs is 512GB large on the bare-metal system
>> (PowerNV) and the BAR depends on the chip id. In our model for the
>> sPAPR machine, we choose to only map the sub-region for the
>> provisioned IRQ numbers and to use the mapping address of chip 0 of a
>> real system.
> 
> I think we probably want a device property to make the virtualized
> base address arbitrary.  It's fine for it to default to the chip 0
> base, but that'll make it easier to adapt if we need to later on.

yes. We can add a "bar" property for this purpose like for some of 
the pnv models

> As noted in the followup messages, I think you're going to want to
> move this stuff from the current xive object into a "block of sources"
> object.

yes. I have now a new Xive source model for the POWER9 PSIHB controller.
It should help to find common grounds. This is what I added to support
XIVE in the current PSIHB:

  +    /* P9 */
  +    MemoryRegion esb_iomem;
  +    uint8_t sbe[4]; /* enough for 13 P&Q bits */
  +    uint32_t ivt_offset;

The ESB region mapping is handled at the machine level as it depends 
on the chip id.

The 'ivt_offset' is only used to forward the event notification to 
the routine engine :

  +static void pnv_psi_notify(PnvPsi *psi, uint32_t lisn)
  +{
  +    uint64_t notif_port =
  +        psi->regs[PSIHB_REG(PSIHB9_ESB_NOTIF_ADDR)];
  +    bool valid = notif_port & PSIHB9_ESB_NOTIF_VALID;
  +    uint64_t notify_addr = notif_port & ~PSIHB9_ESB_NOTIF_VALID;
  +    uint32_t data = cpu_to_be32(psi->ivt_offset | lisn);
  +
  +    if (valid) {
  +        cpu_physical_memory_write(notify_addr, &data, sizeof(data));
  +    }
  +}

So It really depends on the controller type. I think that could be a 
class handler.

Thanks,

C. 


> Apart from that this looks pretty sound.
> 
>> In the real world, each source may have different characteristics
>> depending on the revision of a controller or the CPU. Early systems
>> had two different MMIO pages for trigger and for EOI. We choose to use
>> the same characteristics for all sources to simplify the model. The
>> minimum CPU level for XIVE exploitation mode will be DD2.X as it has
>> full support.
>>
>> The OS will obtain the address of the MMIO page of the ESB entry
>> associated with a source and its characteristic using the
>> H_INT_GET_SOURCE_INFO hcall. This will be addressed in the patch
>> introducing the hcalls.
>>
>> The spapr_xive_irq() routine in charge of triggering the CPU interrupt
>> line will be filled later on.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>
>>  Changes since v1:
>>
>>  - merged in the same patch the qemu_irq handlers
>>  - reworked the event notification logic of the qemu_irq handlers.  
>>  - introduced XIVE_ESB_STORE_EOI support
>>  - removed 'esb_shift' field 
>>  - removed a useless check on the validity of the IVE in the memory
>>    region handlers.
>>  - fixed spapr_xive_pq_trigger() to return true when XIVE_ESB_QUEUED
>>    is set
>>  - removed the overall ESB memory region. We now have only one region
>>    for the provisioned sources.
>>  - improved 'info pic' output
>>
>>  hw/intc/spapr_xive.c        | 254 +++++++++++++++++++++++++++++++++++++++++++-
>>  hw/intc/xive-internal.h     |  10 ++
>>  include/hw/ppc/spapr_xive.h |   9 ++
>>  3 files changed, 271 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> index e6e8841add17..43df6814619d 100644
>> --- a/hw/intc/spapr_xive.c
>> +++ b/hw/intc/spapr_xive.c
>> @@ -18,23 +18,252 @@
>>  
>>  #include "xive-internal.h"
>>  
>> +static void spapr_xive_irq(sPAPRXive *xive, int lisn)
>> +{
>> +
>> +}
>> +
>>  /*
>> - * Main XIVE object
>> + * XIVE Interrupt Source
>> + */
>> +
>> +/*
>> + * "magic" Event State Buffer (ESB) MMIO offsets.
>> + *
>> + * Each interrupt source has a 2-bit state machine called ESB
>> + * which can be controlled by MMIO. It's made of 2 bits, P and
>> + * Q. P indicates that an interrupt is pending (has been sent
>> + * to a queue and is waiting for an EOI). Q indicates that the
>> + * interrupt has been triggered while pending.
>> + *
>> + * This acts as a coalescing mechanism in order to guarantee
>> + * that a given interrupt only occurs at most once in a queue.
>> + *
>> + * When doing an EOI, the Q bit will indicate if the interrupt
>> + * needs to be re-triggered.
>> + *
>> + * The following offsets into the ESB MMIO allow to read or
>> + * manipulate the PQ bits. They must be used with an 8-bytes
>> + * load instruction. They all return the previous state of the
>> + * interrupt (atomically).
>> + *
>> + * Additionally, some ESB pages support doing an EOI via a
>> + * store at 0 and some ESBs support doing a trigger via a
>> + * separate trigger page.
>> + */
>> +#define XIVE_ESB_STORE_EOI      0x400 /* Store */
>> +#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
>> +#define XIVE_ESB_GET            0x800 /* Load */
>> +#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
>> +#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
>> +#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
>> +#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
>> +
>> +#define XIVE_ESB_VAL_P          0x2
>> +#define XIVE_ESB_VAL_Q          0x1
>> +
>> +#define XIVE_ESB_RESET          0x0
>> +#define XIVE_ESB_PENDING        XIVE_ESB_VAL_P
>> +#define XIVE_ESB_QUEUED         (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
>> +#define XIVE_ESB_OFF            XIVE_ESB_VAL_Q
>> +
>> +static uint8_t spapr_xive_pq_get(sPAPRXive *xive, uint32_t lisn)
>> +{
>> +    uint32_t byte = lisn / 4;
>> +    uint32_t bit  = (lisn % 4) * 2;
>> +
>> +    assert(byte < xive->sbe_size);
>> +
>> +    return (xive->sbe[byte] >> bit) & 0x3;
>> +}
>> +
>> +static uint8_t spapr_xive_pq_set(sPAPRXive *xive, uint32_t lisn, uint8_t pq)
>> +{
>> +    uint32_t byte = lisn / 4;
>> +    uint32_t bit  = (lisn % 4) * 2;
>> +    uint8_t old, new;
>> +
>> +    assert(byte < xive->sbe_size);
>> +
>> +    old = xive->sbe[byte];
>> +
>> +    new = xive->sbe[byte] & ~(0x3 << bit);
>> +    new |= (pq & 0x3) << bit;
>> +
>> +    xive->sbe[byte] = new;
>> +
>> +    return (old >> bit) & 0x3;
>> +}
>> +
>> +static bool spapr_xive_pq_eoi(sPAPRXive *xive, uint32_t lisn)
>> +{
>> +    uint8_t old_pq = spapr_xive_pq_get(xive, lisn);
>> +
>> +    switch (old_pq) {
>> +    case XIVE_ESB_RESET:
>> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_RESET);
>> +        return false;
>> +    case XIVE_ESB_PENDING:
>> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_RESET);
>> +        return false;
>> +    case XIVE_ESB_QUEUED:
>> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_PENDING);
>> +        return true;
>> +    case XIVE_ESB_OFF:
>> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_OFF);
>> +        return false;
>> +    default:
>> +         g_assert_not_reached();
>> +    }
>> +}
>> +
>> +/*
>> + * Returns whether the event notification should be forwarded to the
>> + * IVE for routing.
>>   */
>> +static bool spapr_xive_pq_trigger(sPAPRXive *xive, uint32_t lisn)
>> +{
>> +    uint8_t old_pq = spapr_xive_pq_get(xive, lisn);
>>  
>> +    switch (old_pq) {
>> +    case XIVE_ESB_RESET:
>> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_PENDING);
>> +        return true;
>> +    case XIVE_ESB_PENDING:
>> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_QUEUED);
>> +        return false;
>> +    case XIVE_ESB_QUEUED:
>> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_QUEUED);
>> +        return false;
>> +    case XIVE_ESB_OFF:
>> +        spapr_xive_pq_set(xive, lisn, XIVE_ESB_OFF);
>> +        return false;
>> +    default:
>> +         g_assert_not_reached();
>> +    }
>> +}
>> +
>> +/*
>> + * XIVE Interrupt Source MMIOs
>> + */
>> +
>> +/*
>> + * Some HW use a separate page for trigger. We only support the case
>> + * in which the trigger can be done in the same page as the EOI.
>> + */
>> +static uint64_t spapr_xive_esb_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(opaque);
>> +    uint32_t offset = addr & 0xF00;
>> +    uint32_t lisn = addr >> ESB_SHIFT;
>> +    uint64_t ret = -1;
>> +
>> +    switch (offset) {
>> +    case XIVE_ESB_LOAD_EOI:
>> +        /*
>> +         * EOI on load is not used anymore as we now advertise
>> +         * XIVE_ESB_STORE_EOI support for the interrupt sources
>> +         */
>> +        ret = spapr_xive_pq_eoi(xive, lisn);
>> +        break;
>> +
>> +    case XIVE_ESB_GET:
>> +        ret = spapr_xive_pq_get(xive, lisn);
>> +        break;
>> +
>> +    case XIVE_ESB_SET_PQ_00:
>> +    case XIVE_ESB_SET_PQ_01:
>> +    case XIVE_ESB_SET_PQ_10:
>> +    case XIVE_ESB_SET_PQ_11:
>> +        ret = spapr_xive_pq_set(xive, lisn, (offset >> 8) & 0x3);
>> +        break;
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB addr %d\n", offset);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static void spapr_xive_esb_write(void *opaque, hwaddr addr,
>> +                                 uint64_t value, unsigned size)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(opaque);
>> +    uint32_t offset = addr & 0xF00;
>> +    uint32_t lisn = addr >> ESB_SHIFT;
>> +    bool notify = false;
>> +
>> +    switch (offset) {
>> +    case 0:
>> +        notify = spapr_xive_pq_trigger(xive, lisn);
>> +        break;
>> +    case XIVE_ESB_STORE_EOI:
>> +        /* If the Q bit is set, we should forward a new source event
>> +         * notification
>> +         */
>> +        notify = spapr_xive_pq_eoi(xive, lisn);
>> +        break;
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %d\n",
>> +                      offset);
>> +        return;
>> +    }
>> +
>> +    /* Forward the source event notification for routing */
>> +    if (notify) {
>> +        spapr_xive_irq(xive, lisn);
>> +    }
>> +}
>> +
>> +static const MemoryRegionOps spapr_xive_esb_ops = {
>> +    .read = spapr_xive_esb_read,
>> +    .write = spapr_xive_esb_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +static void spapr_xive_source_set_irq(void *opaque, int lisn, int val)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(opaque);
>> +    bool notify = false;
>> +
>> +    if (val) {
>> +        notify = spapr_xive_pq_trigger(xive, lisn);
>> +    }
>> +
>> +    /* Forward the source event notification for routing */
>> +    if (notify) {
>> +        spapr_xive_irq(xive, lisn);
>> +    }
>> +}
>> +
>> +/*
>> + * Main XIVE object
>> + */
>>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>>  {
>>      int i;
>>  
>>      for (i = 0; i < xive->nr_irqs; i++) {
>>          XiveIVE *ive = &xive->ivt[i];
>> +        uint8_t pq;
>>  
>>          if (!(ive->w & IVE_VALID)) {
>>              continue;
>>          }
>>  
>> -        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
>> +        pq = spapr_xive_pq_get(xive, i);
>> +
>> +        monitor_printf(mon, "  %4x %s %c%c %08x %08x\n", i,
>>                         ive->w & IVE_MASKED ? "M" : " ",
>> +                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
>> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
>>                         (int) GETFIELD(IVE_EQ_INDEX, ive->w),
>>                         (int) GETFIELD(IVE_EQ_DATA, ive->w));
>>      }
>> @@ -52,6 +281,9 @@ static void spapr_xive_reset(DeviceState *dev)
>>              ive->w |= IVE_MASKED;
>>          }
>>      }
>> +
>> +    /* SBEs are initialized to 0b01 which corresponds to "ints off" */
>> +    memset(xive->sbe, 0x55, xive->sbe_size);
>>  }
>>  
>>  static void spapr_xive_realize(DeviceState *dev, Error **errp)
>> @@ -65,6 +297,23 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>  
>>      /* Allocate the IVT (Interrupt Virtualization Table) */
>>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>> +
>> +    /* QEMU IRQs */
>> +    xive->qirqs = qemu_allocate_irqs(spapr_xive_source_set_irq, xive,
>> +                                     xive->nr_irqs);
>> +
>> +    /* Allocate SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
>> +    xive->sbe_size = DIV_ROUND_UP(xive->nr_irqs, 4);
>> +    xive->sbe = g_malloc0(xive->sbe_size);
>> +
>> +    /* VC BAR. Use address of chip 0 to install the ESB memory region
>> +     * for *all* interrupt sources */
>> +    xive->esb_base = (P9_MMIO_BASE | VC_BAR_DEFAULT);
>> +
>> +    memory_region_init_io(&xive->esb_iomem, OBJECT(xive),
>> +                          &spapr_xive_esb_ops, xive, "xive.esb",
>> +                          (1ull << ESB_SHIFT) * xive->nr_irqs);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->esb_iomem);
>>  }
>>  
>>  static const VMStateDescription vmstate_spapr_xive_ive = {
>> @@ -92,6 +341,7 @@ static const VMStateDescription vmstate_spapr_xive = {
>>          VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>>          VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
>>                                       vmstate_spapr_xive_ive, XiveIVE),
>> +        VMSTATE_VBUFFER_UINT32(sbe, sPAPRXive, 1, NULL, sbe_size),
>>          VMSTATE_END_OF_LIST()
>>      },
>>  };
>> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
>> index 132b71a6daf0..872648dd96a2 100644
>> --- a/hw/intc/xive-internal.h
>> +++ b/hw/intc/xive-internal.h
>> @@ -16,6 +16,16 @@
>>  #define SETFIELD(m, v, val)                             \
>>          (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
>>  
>> +/*
>> + * XIVE MMIO regions
>> + */
>> +#define P9_MMIO_BASE     0x006000000000000ull
>> +
>> +/* VC BAR contains set translations for the ESBs and the EQs. */
>> +#define VC_BAR_DEFAULT   0x10000000000ull
>> +#define VC_BAR_SIZE      0x08000000000ull
>> +#define ESB_SHIFT        16 /* One 64k page. OPAL has two */
>> +
>>  /* IVE/EAS
>>   *
>>   * One per interrupt source. Targets that interrupt to a given EQ
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 5b1f78e06a1e..ecc15d889b74 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -24,8 +24,17 @@ struct sPAPRXive {
>>      /* Properties */
>>      uint32_t     nr_irqs;
>>  
>> +    /* IRQ */
>> +    qemu_irq     *qirqs;
>> +
>>      /* XIVE internal tables */
>>      XiveIVE      *ivt;
>> +    uint8_t      *sbe;
>> +    uint32_t     sbe_size;
>> +
>> +    /* ESB memory region */
>> +    hwaddr       esb_base;
>> +    MemoryRegion esb_iomem;
>>  };
>>  
>>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources
  2017-12-20  7:54     ` Cédric Le Goater
@ 2017-12-20 18:08       ` Cédric Le Goater
  0 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-20 18:08 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

On 12/20/2017 08:54 AM, Cédric Le Goater wrote:
> On 12/20/2017 06:22 AM, David Gibson wrote:
>> On Sat, Dec 09, 2017 at 09:43:22AM +0100, Cédric Le Goater wrote:
>>> Each XIVE interrupt source is associated with a two bit state machine
>>> called an Event State Buffer (ESB) : the first bit "P" means that an
>>> interrupt is "pending" and waiting for an EOI and the bit "Q" (queued)
>>> means a new interrupt was triggered while another was still pending.
>>>
>>> When an event is triggered, the associated interrupt state bits are
>>> fetched and modified and forwarded to the virtualization engine of the
>>> controller doing the routing. These can also be controlled by MMIO, to
>>> trigger events or turn off the sources for instance. See code for more
>>> details on the states and transitions.
>>>
>>> The MMIO space for the ESBs is 512GB large on the bare-metal system
>>> (PowerNV) and the BAR depends on the chip id. In our model for the
>>> sPAPR machine, we choose to only map the sub-region for the
>>> provisioned IRQ numbers and to use the mapping address of chip 0 of a
>>> real system.
>>
>> I think we probably want a device property to make the virtualized
>> base address arbitrary.  It's fine for it to default to the chip 0
>> base, but that'll make it easier to adapt if we need to later on.
> 
> yes. We can add a "bar" property for this purpose like for some of 
> the pnv models
> 
>> As noted in the followup messages, I think you're going to want to
>> move this stuff from the current xive object into a "block of sources"
>> object.

I have (re)introduced a XiveSource object. Only a single instance, and 
under the sPAPRXive object (because it is easier to create). Adding a 
source list should not be too problematic if needed. 

So the XiveSource is generic and I hope to be able to do the same for 
the presenter. 

Just like for XICS, I am also adding a :

	typedef struct XiveFabricClass {
	    InterfaceClass parent;
	    void (*notify)(XiveFabric *xive, int lisn);
	} XiveFabricClass;

which we can use for both the pnv and pseries machines, but the fabric 
is not the machine itself, it is the Xive routing engine, an object 
below.

C.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-20  5:09   ` David Gibson
  2017-12-20  7:38     ` Cédric Le Goater
@ 2017-12-21  0:12     ` Benjamin Herrenschmidt
  2017-12-21  9:16       ` Cédric Le Goater
  2018-04-12  5:08       ` David Gibson
  1 sibling, 2 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2017-12-21  0:12 UTC (permalink / raw)
  To: David Gibson, Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
> 
> As you've suggested in yourself, I think we might need to more
> explicitly model the different components of the XIVE system.  As part
> of that, I think you need to be clearer in this base skeleton about
> exactly what component your XIVE object represents.
> 
> If the answer is "the overall thing" I suspect that's not what you
> want - I had one of those for XICs which proved to be a mistake
> (eventually replaced by the XICSFabric interface).
> 
> Changing the model later isn't impossible, but doing so without
> breaking migration can be a real pain, so I think it's worth a
> reasonable effort to try and get it right initially.

Note: we do need to speed things up a bit, as having exploitation mode
in KVM will significantly help with IPI performance among other things.

I'm about ready to do the KVM bits. The one thing we need to discuss
and figure a good design for is how we map all those interrupt control
pages into qemu.

Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
which are used for guest IPIs and for vio/virtio/emulated interrupts)
comes with a "control page" (ESB page) which needs to be mapped into
the guest, and the generic IPIs also come with a trigger page which
needs to be mapped into the guest for guest IPIs or OpenCAPI
interrupts, or just qemu for emulated devices.

Now that can be thousands of these critters. I certainly don't want to
create thousands of VMAs in qemu and even less thousands of memory
regions in KVM.

So we need some kind of mechanism by wich a single large VMA gets
mmap'ed into qemu (or maybe a couple of these, but not too many) and
the interrupt pages can be assigned to slots in there and demand
faulted.

For the generic interrupts, this can probably be covered by KVM, adding
some arch ioctls for allocating IPIs and mmap'ing that region etc...

For pass-through, it's trickier, we don't want to mmap each irqfd
individually for the above reason, so we want to "link" them to KVM. We
don't want to allow qemu to take control of any arbitrary interrupt in
the system though, so it has to related to the ownership of the irqfd
coming from vfio.

OpenCAPI I suspect will be its own can of worms...

Also, have we decided how the process of switching between XICS and
XIVE will work vs. CAS ? And how that will interact with KVM ? I was
thinking the kernel would implement a different KVM device type, ie
the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
KVM_DEV_TYPE_XIVE.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-21  0:12     ` Benjamin Herrenschmidt
@ 2017-12-21  9:16       ` Cédric Le Goater
  2017-12-21 10:09         ` Cédric Le Goater
  2017-12-21 22:53         ` Benjamin Herrenschmidt
  2018-04-12  5:08       ` David Gibson
  1 sibling, 2 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-21  9:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On 12/21/2017 01:12 AM, Benjamin Herrenschmidt wrote:
> On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
>>
>> As you've suggested in yourself, I think we might need to more
>> explicitly model the different components of the XIVE system.  As part
>> of that, I think you need to be clearer in this base skeleton about
>> exactly what component your XIVE object represents.
>>
>> If the answer is "the overall thing" I suspect that's not what you
>> want - I had one of those for XICs which proved to be a mistake
>> (eventually replaced by the XICSFabric interface).
>>
>> Changing the model later isn't impossible, but doing so without
>> breaking migration can be a real pain, so I think it's worth a
>> reasonable effort to try and get it right initially.
> 
> Note: we do need to speed things up a bit, as having exploitation mode
> in KVM will significantly help with IPI performance among other things.
> 
> I'm about ready to do the KVM bits. The one thing we need to discuss
> and figure a good design for is how we map all those interrupt control
> pages into qemu.
> 
> Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
> which are used for guest IPIs and for vio/virtio/emulated interrupts)
> comes with a "control page" (ESB page) which needs to be mapped into
> the guest, and the generic IPIs also come with a trigger page which
> needs to be mapped into the guest for guest IPIs or OpenCAPI
> interrupts, or just qemu for emulated devices.

what about the OS TIMA page ? Do we trap the accesses in QEMU and
forward them to KVM ? or do we use a similar mechanism. 

> Now that can be thousands of these critters. I certainly don't want to
> create thousands of VMAs in qemu and even less thousands of memory
> regions in KVM.

we can provision one mapping per kvmppc_xive_src_block  maybe ?  

> So we need some kind of mechanism by wich a single large VMA gets
> mmap'ed into qemu (or maybe a couple of these, but not too many) and
> the interrupt pages can be assigned to slots in there and demand
> faulted.

Frederic has started to put in place a similar mecanism for OpenCAPI.

> For the generic interrupts, this can probably be covered by KVM, adding
> some arch ioctls for allocating IPIs and mmap'ing that region etc...

The KVM device has a ioctl handler :
   
	struct kvm_device_ops {

		long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
			      unsigned long arg);
	};

So a KVM device for the XIVE interrupt controller can implement a couple 
of extra calls for its need, like getting the VMA addresses, etc

> For pass-through, it's trickier, we don't want to mmap each irqfd
> individually for the above reason, so we want to "link" them to KVM. We
> don't want to allow qemu to take control of any arbitrary interrupt in
> the system though, so it has to related to the ownership of the irqfd
> coming from vfio.
> 
> OpenCAPI I suspect will be its own can of worms...
> 
> Also, have we decided how the process of switching between XICS and
> XIVE will work vs. CAS ? 

That's how it is described in the architecture. The current choice is
to create both XICS and XIVE objects and choose at CAS which one to
use. It relies today on the capability of the pseries machine to 
allocate IRQ numbers for both interrupt controller backends. These
patches have been merged in QEMU.

A change of interrupt mode results in a reset. The device tree is 
populated accordingly and the ICPs are switched for the model in 
use. 

> And how that will interact with KVM ? 

I expect we will do the same, which is to create two KVM devices to 
be able to handle both interrupt controller backends depending on the 
mode negotiated by the guest.  


> I was
> thinking the kernel would implement a different KVM device type, ie
> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
> KVM_DEV_TYPE_XIVE.

yes. it makes sense. The new device will have a lot in common with the 
KVM_DEV_TYPE_XICS using kvm_xive_ops.

C.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-21  9:16       ` Cédric Le Goater
@ 2017-12-21 10:09         ` Cédric Le Goater
  2017-12-21 22:53         ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2017-12-21 10:09 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On 12/21/2017 10:16 AM, Cédric Le Goater wrote:
> On 12/21/2017 01:12 AM, Benjamin Herrenschmidt wrote:
>> On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
>>>
>>> As you've suggested in yourself, I think we might need to more
>>> explicitly model the different components of the XIVE system.  As part
>>> of that, I think you need to be clearer in this base skeleton about
>>> exactly what component your XIVE object represents.
>>>
>>> If the answer is "the overall thing" I suspect that's not what you
>>> want - I had one of those for XICs which proved to be a mistake
>>> (eventually replaced by the XICSFabric interface).
>>>
>>> Changing the model later isn't impossible, but doing so without
>>> breaking migration can be a real pain, so I think it's worth a
>>> reasonable effort to try and get it right initially.
>>
>> Note: we do need to speed things up a bit, as having exploitation mode
>> in KVM will significantly help with IPI performance among other things.
>>
>> I'm about ready to do the KVM bits. The one thing we need to discuss
>> and figure a good design for is how we map all those interrupt control
>> pages into qemu.
>>
>> Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
>> which are used for guest IPIs and for vio/virtio/emulated interrupts)
>> comes with a "control page" (ESB page) which needs to be mapped into
>> the guest, and the generic IPIs also come with a trigger page which
>> needs to be mapped into the guest for guest IPIs or OpenCAPI
>> interrupts, or just qemu for emulated devices.
> 
> what about the OS TIMA page ? Do we trap the accesses in QEMU and
> forward them to KVM ? or do we use a similar mechanism. 
> 
>> Now that can be thousands of these critters. I certainly don't want to
>> create thousands of VMAs in qemu and even less thousands of memory
>> regions in KVM.
> 
> we can provision one mapping per kvmppc_xive_src_block  maybe ?  
> 
>> So we need some kind of mechanism by wich a single large VMA gets
>> mmap'ed into qemu (or maybe a couple of these, but not too many) and
>> the interrupt pages can be assigned to slots in there and demand
>> faulted.
> 
> Frederic has started to put in place a similar mecanism for OpenCAPI.
> 
>> For the generic interrupts, this can probably be covered by KVM, adding
>> some arch ioctls for allocating IPIs and mmap'ing that region etc...
> 
> The KVM device has a ioctl handler :
>    
> 	struct kvm_device_ops {
> 
> 		long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
> 			      unsigned long arg);
> 	};
> 
> So a KVM device for the XIVE interrupt controller can implement a couple 
> of extra calls for its need, like getting the VMA addresses, etc

or use set/get_attr. 

I wonder if it would be possible to add a 'mmap' ops to kvm_device_fops 
for the KVM_DEV_TYPE_XIVE device. 

C.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-21  9:16       ` Cédric Le Goater
  2017-12-21 10:09         ` Cédric Le Goater
@ 2017-12-21 22:53         ` Benjamin Herrenschmidt
  2018-01-17  9:18           ` Cédric Le Goater
  1 sibling, 1 reply; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2017-12-21 22:53 UTC (permalink / raw)
  To: Cédric Le Goater, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On Thu, 2017-12-21 at 10:16 +0100, Cédric Le Goater wrote:
> On 12/21/2017 01:12 AM, Benjamin Herrenschmidt wrote:
> > On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
> > > 
> > > As you've suggested in yourself, I think we might need to more
> > > explicitly model the different components of the XIVE system.  As part
> > > of that, I think you need to be clearer in this base skeleton about
> > > exactly what component your XIVE object represents.
> > > 
> > > If the answer is "the overall thing" I suspect that's not what you
> > > want - I had one of those for XICs which proved to be a mistake
> > > (eventually replaced by the XICSFabric interface).
> > > 
> > > Changing the model later isn't impossible, but doing so without
> > > breaking migration can be a real pain, so I think it's worth a
> > > reasonable effort to try and get it right initially.
> > 
> > Note: we do need to speed things up a bit, as having exploitation mode
> > in KVM will significantly help with IPI performance among other things.
> > 
> > I'm about ready to do the KVM bits. The one thing we need to discuss
> > and figure a good design for is how we map all those interrupt control
> > pages into qemu.
> > 
> > Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
> > which are used for guest IPIs and for vio/virtio/emulated interrupts)
> > comes with a "control page" (ESB page) which needs to be mapped into
> > the guest, and the generic IPIs also come with a trigger page which
> > needs to be mapped into the guest for guest IPIs or OpenCAPI
> > interrupts, or just qemu for emulated devices.
> 
> what about the OS TIMA page ? Do we trap the accesses in QEMU and
> forward them to KVM ? or do we use a similar mechanism. 

No, no, we'll have an mmap facility for it in kvm but it worries me
less as there's only one of these and there's little damage qemu can do
having access to it :)
> 
> > Now that can be thousands of these critters. I certainly don't want to
> > create thousands of VMAs in qemu and even less thousands of memory
> > regions in KVM.
> 
> we can provision one mapping per kvmppc_xive_src_block  maybe ?  

Maybe. Last I looked KVM walk of memory regions was linear though. Mind
you it's not a huge deal if the guest RAM is always in the first
entries.

> > So we need some kind of mechanism by wich a single large VMA gets
> > mmap'ed into qemu (or maybe a couple of these, but not too many) and
> > the interrupt pages can be assigned to slots in there and demand
> > faulted.
> 
> Frederic has started to put in place a similar mecanism for OpenCAPI.

I know, though he made it rather OpenCAPI specific which is going to be
"interesting" when it comes to virtualizing OpenCAPI...

> > For the generic interrupts, this can probably be covered by KVM, adding
> > some arch ioctls for allocating IPIs and mmap'ing that region etc...
> 
> The KVM device has a ioctl handler :
>    
> 	struct kvm_device_ops {
> 
> 		long (*ioctl)(struct kvm_device *dev, unsigned int ioctl,
> 			      unsigned long arg);
> 	};
> 
> So a KVM device for the XIVE interrupt controller can implement a couple 
> of extra calls for its need, like getting the VMA addresses, etc
> 
> > For pass-through, it's trickier, we don't want to mmap each irqfd
> > individually for the above reason, so we want to "link" them to KVM. We
> > don't want to allow qemu to take control of any arbitrary interrupt in
> > the system though, so it has to related to the ownership of the irqfd
> > coming from vfio.
> > 
> > OpenCAPI I suspect will be its own can of worms...
> > 
> > Also, have we decided how the process of switching between XICS and
> > XIVE will work vs. CAS ? 
> 
> That's how it is described in the architecture. The current choice is
> to create both XICS and XIVE objects and choose at CAS which one to
> use. It relies today on the capability of the pseries machine to 
> allocate IRQ numbers for both interrupt controller backends. These
> patches have been merged in QEMU.
> 
> A change of interrupt mode results in a reset. The device tree is 
> populated accordingly and the ICPs are switched for the model in 
> use. 

For KVM we need to only instanciate one of them though.

> > And how that will interact with KVM ? 
> 
> I expect we will do the same, which is to create two KVM devices to 
> be able to handle both interrupt controller backends depending on the 
> mode negotiated by the guest.  

That will be an ungodly mess, I'd rather we only instanciate the right
one.

> > I was
> > thinking the kernel would implement a different KVM device type, ie
> > the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
> > KVM_DEV_TYPE_XIVE.
> 
> yes. it makes sense. The new device will have a lot in common with the 
> KVM_DEV_TYPE_XICS using kvm_xive_ops.

Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-21 22:53         ` Benjamin Herrenschmidt
@ 2018-01-17  9:18           ` Cédric Le Goater
  2018-01-17 11:10             ` Benjamin Herrenschmidt
  2018-04-12  5:10             ` David Gibson
  0 siblings, 2 replies; 71+ messages in thread
From: Cédric Le Goater @ 2018-01-17  9:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

>>> Also, have we decided how the process of switching between XICS and
>>> XIVE will work vs. CAS ? 
>>
>> That's how it is described in the architecture. The current choice is
>> to create both XICS and XIVE objects and choose at CAS which one to
>> use. It relies today on the capability of the pseries machine to 
>> allocate IRQ numbers for both interrupt controller backends. These
>> patches have been merged in QEMU.
>>
>> A change of interrupt mode results in a reset. The device tree is 
>> populated accordingly and the ICPs are switched for the model in 
>> use. 
> 
> For KVM we need to only instanciate one of them though.

Hmm,

How would we handle a guest rebooting on a kernel without XIVE support ? 
Are you suggesting to create the XICS or XIVE device in the CAS negotiation 
process ? So, the machine would not have any interrupt controller before 
CAS. That seems really late to me. grub uses the console for instance. 

I think it should prepare for both options, start in XIVE legacy mode, 
which is XICS, then possibly switch to XIVE exploitation mode.

>>> And how that will interact with KVM ? 
>>
>> I expect we will do the same, which is to create two KVM devices to 
>> be able to handle both interrupt controller backends depending on the 
>> mode negotiated by the guest.  
> 
> That will be an ungodly mess, I'd rather we only instanciate the right
> one.

It's rather transparent currently in the emulated version. There are two 
sets of objects in QEMU, switching is done in CAS. KVM support should not 
change anything in that area. 

I expect the 'xive-kvm' object to get/set states for migration, just like 
for XICS and to setup the ESB+TIMA memory regions, which is new. 

C. 
 
>>> I was
>>> thinking the kernel would implement a different KVM device type, ie
>>> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
>>> KVM_DEV_TYPE_XIVE.
>>
>> yes. it makes sense. The new device will have a lot in common with the 
>> KVM_DEV_TYPE_XICS using kvm_xive_ops.
> 
> Ben.
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-01-17  9:18           ` Cédric Le Goater
@ 2018-01-17 11:10             ` Benjamin Herrenschmidt
  2018-01-17 14:39               ` Cédric Le Goater
  2018-04-12  5:10             ` David Gibson
  1 sibling, 1 reply; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2018-01-17 11:10 UTC (permalink / raw)
  To: Cédric Le Goater, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On Wed, 2018-01-17 at 10:18 +0100, Cédric Le Goater wrote:
> > > > Also, have we decided how the process of switching between XICS and
> > > > XIVE will work vs. CAS ? 
> > > 
> > > That's how it is described in the architecture. The current choice is
> > > to create both XICS and XIVE objects and choose at CAS which one to
> > > use. It relies today on the capability of the pseries machine to 
> > > allocate IRQ numbers for both interrupt controller backends. These
> > > patches have been merged in QEMU.
> > > 
> > > A change of interrupt mode results in a reset. The device tree is 
> > > populated accordingly and the ICPs are switched for the model in 
> > > use. 
> > 
> > For KVM we need to only instanciate one of them though.
> 
> Hmm,
> 
> How would we handle a guest rebooting on a kernel without XIVE support ? 

It will do CAS again and we can change the devices.

> Are you suggesting to create the XICS or XIVE device in the CAS negotiation 
> process ? So, the machine would not have any interrupt controller before 
> CAS. That seems really late to me. grub uses the console for instance. 

We start with XICS by default.

> I think it should prepare for both options, start in XIVE legacy mode, 
> which is XICS, then possibly switch to XIVE exploitation mode.
> 
> > > > And how that will interact with KVM ? 
> > > 
> > > I expect we will do the same, which is to create two KVM devices to 
> > > be able to handle both interrupt controller backends depending on the 
> > > mode negotiated by the guest.  
> > 
> > That will be an ungodly mess, I'd rather we only instanciate the right
> > one.
> 
> It's rather transparent currently in the emulated version. There are two 
> sets of objects in QEMU, switching is done in CAS. KVM support should not 
> change anything in that area. 
> 
> I expect the 'xive-kvm' object to get/set states for migration, just like 
> for XICS and to setup the ESB+TIMA memory regions, which is new. 

But both XICS and XIVE are completely different kernel KVM devices that will
need to "hook" into the same set of internal hooks for things like interrupts
being passed through, RTAS calls etc... 

How does KVM knows which one to "activate" ?

I don't think the kernel should have both. 

> > > > I was
> > > > thinking the kernel would implement a different KVM device type, ie
> > > > the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
> > > > KVM_DEV_TYPE_XIVE.
> > > 
> > > yes. it makes sense. The new device will have a lot in common with the 
> > > KVM_DEV_TYPE_XICS using kvm_xive_ops.
> > 
> > Ben.
> > 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-01-17 11:10             ` Benjamin Herrenschmidt
@ 2018-01-17 14:39               ` Cédric Le Goater
  2018-01-17 17:57                 ` Cédric Le Goater
                                   ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Cédric Le Goater @ 2018-01-17 14:39 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On 01/17/2018 12:10 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2018-01-17 at 10:18 +0100, Cédric Le Goater wrote:
>>>>> Also, have we decided how the process of switching between XICS and
>>>>> XIVE will work vs. CAS ? 
>>>>
>>>> That's how it is described in the architecture. The current choice is
>>>> to create both XICS and XIVE objects and choose at CAS which one to
>>>> use. It relies today on the capability of the pseries machine to 
>>>> allocate IRQ numbers for both interrupt controller backends. These
>>>> patches have been merged in QEMU.
>>>>
>>>> A change of interrupt mode results in a reset. The device tree is 
>>>> populated accordingly and the ICPs are switched for the model in 
>>>> use. 
>>>
>>> For KVM we need to only instanciate one of them though.
>>
>> Hmm,
>>
>> How would we handle a guest rebooting on a kernel without XIVE support ? 
> 
> It will do CAS again and we can change the devices.

So, we would destroy the previous QEMU ICS object and create a new one 
in the CAS hcall. That would probably work. There might be some issues 
in creating and destroying the ICS KVM device, but that can be studied 
without XIVE.

It used to be considered ugly to create a QEMU device at reset time, so 
I wonder if this is still the case, because when the machine reaches CAS, 
we really are beyond reset.   

If this is OK, then the next "issue" is to keep in sync the allocated 
IRQ numbers. The IRQ allocator is now merged at the machine level, so 
the synchronization is obvious to do when both backend QEMU objects 
are available. that's the path I took. If both QEMU objects are not 
available, then we need to scan the IRQ number space in the current 
interrupt mode and allocate the same IRQs in the newly negotiated mode. 
Probably OK. I don't see major problems with the current code. 

Migration is a problem. We will need both backend QEMU objects to be 
available anyhow if we want to migrate. So we are back to the current 
solution creating both QEMU objects but we can try to defer some of the 
KVM inits and create the KVM device on demand at CAS time.

The next problem is the ICP object that currently needs the KVM device 
fd to connect the vcpus ... So, we will need to change that also. 
That is probably the biggest problem today. We need a way to disconnect 
the vpcu from the KVM device and see how we can defer the connection.
I need to make sure this is possible, I can check that without XIVE
I think.

>> Are you suggesting to create the XICS or XIVE device in the CAS negotiation 
>> process ? So, the machine would not have any interrupt controller before 
>> CAS. That seems really late to me. grub uses the console for instance. 
> 
> We start with XICS by default.

yes.

>> I think it should prepare for both options, start in XIVE legacy mode, 
>> which is XICS, then possibly switch to XIVE exploitation mode.
>>
>>>>> And how that will interact with KVM ? 
>>>>
>>>> I expect we will do the same, which is to create two KVM devices to 
>>>> be able to handle both interrupt controller backends depending on the 
>>>> mode negotiated by the guest.  
>>>
>>> That will be an ungodly mess, I'd rather we only instanciate the right
>>> one.
>>
>> It's rather transparent currently in the emulated version. There are two 
>> sets of objects in QEMU, switching is done in CAS. KVM support should not 
>> change anything in that area. 
>>
>> I expect the 'xive-kvm' object to get/set states for migration, just like 
>> for XICS and to setup the ESB+TIMA memory regions, which is new. 
> 
> But both XICS and XIVE are completely different kernel KVM devices that will
> need to "hook" into the same set of internal hooks for things like interrupts
> being passed through, RTAS calls etc... 
> 
> How does KVM knows which one to "activate" ?

Can't we add an extra IRQ type and use vcpu->arch.irq_type for that ? 
I haven't studied all the low level details though.

> I don't think the kernel should have both. 

I hear that. From a QEMU perspective, it is much easier to put everything 
in place for both interrupt modes and let the guest decide what it wants 
to use. 

If we choose not to, we will need to find solution to defer the KVM inits
and to disconnect/reconnect the vcpus. For the latter, we could add a 
KVM_DISABLE_CAP ioctl or maybe better add a new capability like 
KVM_CAP_IRQ_XIVE to perform the switch.


C.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-01-17 14:39               ` Cédric Le Goater
@ 2018-01-17 17:57                 ` Cédric Le Goater
  2018-01-17 21:27                 ` Benjamin Herrenschmidt
  2018-04-12  5:15                 ` David Gibson
  2 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2018-01-17 17:57 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

> How does KVM knows which one to "activate" ?
> 
> Can't we add an extra IRQ type and use vcpu->arch.irq_type for that ? 
> I haven't studied all the low level details though.

I don't think connecting a vcpu to two different KVM devices makes sense ...
So we need to destroy/recreate the KVM device and disconnect/reconnect 
the vcpus. I will take a closer look. 

C.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-01-17 14:39               ` Cédric Le Goater
  2018-01-17 17:57                 ` Cédric Le Goater
@ 2018-01-17 21:27                 ` Benjamin Herrenschmidt
  2018-01-18 13:27                   ` Cédric Le Goater
  2018-02-11  8:08                   ` David Gibson
  2018-04-12  5:15                 ` David Gibson
  2 siblings, 2 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2018-01-17 21:27 UTC (permalink / raw)
  To: Cédric Le Goater, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
> Migration is a problem. We will need both backend QEMU objects to be 
> available anyhow if we want to migrate. So we are back to the current 
> solution creating both QEMU objects but we can try to defer some of the 
> KVM inits and create the KVM device on demand at CAS time.

Do we have a way to migrate a piece of info from the machine *first*
that indicate what type of XICS/XIVE to instanciate ?

> The next problem is the ICP object that currently needs the KVM device 
> fd to connect the vcpus ... So, we will need to change that also. 
> That is probably the biggest problem today. We need a way to disconnect 
> the vpcu from the KVM device and see how we can defer the connection.
> I need to make sure this is possible, I can check that without XIVE

Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-01-17 21:27                 ` Benjamin Herrenschmidt
@ 2018-01-18 13:27                   ` Cédric Le Goater
  2018-01-18 21:08                     ` Benjamin Herrenschmidt
  2018-02-11  8:08                   ` David Gibson
  1 sibling, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2018-01-18 13:27 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On 01/17/2018 10:27 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
>> Migration is a problem. We will need both backend QEMU objects to be 
>> available anyhow if we want to migrate. So we are back to the current 
>> solution creating both QEMU objects but we can try to defer some of the 
>> KVM inits and create the KVM device on demand at CAS time.
> 
> Do we have a way to migrate a piece of info from the machine *first*
> that indicate what type of XICS/XIVE to instanciate ?

The source and the target machines should have the same realized 
objects. I think this is the simplest solution to keep the migration 
framework maintainable. 


I don't think it is a problem to call a xics_fini() routine to 
destroy the XICS KVM device if a new interrupt mode was negotiated
in CAS. We would then call a xive_init() routing to create the new 
XIVE KVM device.

When done, the question boils down to disconnect and reconnect the 
vcpus to the KVM device. The QEMU CPU ->intc pointer should be 
updated also but that's a QEMU level problem. Already done.
 
In the QEMU "icp-kvm" object, the connection to the KVM device 
is currently forced in the realize routine but we can add some 
handlers to manage the link. Similar handlers would do the same 
in the QEMU "nvt-kvm" object when XIVE is on.


If we think this is a possible way to address the problem, I can 
check the above thinking on a XICS KVM machine and force the 
init/fini sequence in the CAS negotiation process. I will need 
a KVM ioctl to destroy a device and maybe a KVM VCPU ioctl to 
disable a capability. 

Cheers,

C. 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-01-18 13:27                   ` Cédric Le Goater
@ 2018-01-18 21:08                     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2018-01-18 21:08 UTC (permalink / raw)
  To: Cédric Le Goater, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On Thu, 2018-01-18 at 14:27 +0100, Cédric Le Goater wrote:
> The source and the target machines should have the same realized 
> objects. I think this is the simplest solution to keep the migration 
> framework maintainable. 

Yeah well, it all boils down to qemu migration being completely brain
dead in relying on an external entity to create the same machine rather
than carrying the configuration in the migration stream... ugh.

> I don't think it is a problem to call a xics_fini() routine to 
> destroy the XICS KVM device if a new interrupt mode was negotiated
> in CAS. We would then call a xive_init() routing to create the new 
> XIVE KVM device.
> 
> When done, the question boils down to disconnect and reconnect the 
> vcpus to the KVM device. The QEMU CPU ->intc pointer should be 
> updated also but that's a QEMU level problem. Already done.

The problem is more the in-kernel hooks.
 
> In the QEMU "icp-kvm" object, the connection to the KVM device 
> is currently forced in the realize routine but we can add some 
> handlers to manage the link. Similar handlers would do the same 
> in the QEMU "nvt-kvm" object when XIVE is on.
> 
> 
> If we think this is a possible way to address the problem, I can 
> check the above thinking on a XICS KVM machine and force the 
> init/fini sequence in the CAS negotiation process. I will need 
> a KVM ioctl to destroy a device and maybe a KVM VCPU ioctl to 
> disable a capability. 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-01-17 21:27                 ` Benjamin Herrenschmidt
  2018-01-18 13:27                   ` Cédric Le Goater
@ 2018-02-11  8:08                   ` David Gibson
  2018-02-11 22:55                     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-02-11  8:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Cédric Le Goater, qemu-ppc, qemu-devel, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 1291 bytes --]

On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
> > Migration is a problem. We will need both backend QEMU objects to be 
> > available anyhow if we want to migrate. So we are back to the current 
> > solution creating both QEMU objects but we can try to defer some of the 
> > KVM inits and create the KVM device on demand at CAS time.
> 
> Do we have a way to migrate a piece of info from the machine *first*
> that indicate what type of XICS/XIVE to instanciate ?

Nope.  qemu migration doesn't work like that.  Yes, it should, and
everyone knows it, but changing it is a really long term project.

> 
> > The next problem is the ICP object that currently needs the KVM device 
> > fd to connect the vcpus ... So, we will need to change that also. 
> > That is probably the biggest problem today. We need a way to disconnect 
> > the vpcu from the KVM device and see how we can defer the connection.
> > I need to make sure this is possible, I can check that without XIVE
> 
> Ben.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-02-11  8:08                   ` David Gibson
@ 2018-02-11 22:55                     ` Benjamin Herrenschmidt
  2018-02-12  2:02                       ` Alexey Kardashevskiy
                                         ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2018-02-11 22:55 UTC (permalink / raw)
  To: David Gibson; +Cc: Cédric Le Goater, qemu-ppc, qemu-devel, Greg Kurz

On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote:
> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
> > > Migration is a problem. We will need both backend QEMU objects to be 
> > > available anyhow if we want to migrate. So we are back to the current 
> > > solution creating both QEMU objects but we can try to defer some of the 
> > > KVM inits and create the KVM device on demand at CAS time.
> > 
> > Do we have a way to migrate a piece of info from the machine *first*
> > that indicate what type of XICS/XIVE to instanciate ?
> 
> Nope.  qemu migration doesn't work like that.  Yes, it should, and
> everyone knows it, but changing it is a really long term project.

Well, we have a problem then. It looks like Qemu broken migration is
fundamentally incompatible with PAPR and CAS design...

I know we don't migrate the configuration, that's not exactly what I
had in mind tho... Can we have some piece of *data* from the machine be
migrated first, and use it on the target to reconfigure the interrupt
controller before the stream arrives ?

Otherwise, we have indeed no much choice but the horrible wart of
creating both interrupt controllers with only one "active".

> > 
> > > The next problem is the ICP object that currently needs the KVM device 
> > > fd to connect the vcpus ... So, we will need to change that also. 
> > > That is probably the biggest problem today. We need a way to disconnect 
> > > the vpcu from the KVM device and see how we can defer the connection.
> > > I need to make sure this is possible, I can check that without XIVE
> > 
> > Ben.
> > 
> 
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-02-11 22:55                     ` Benjamin Herrenschmidt
@ 2018-02-12  2:02                       ` Alexey Kardashevskiy
  2018-02-12 12:20                         ` [Qemu-devel] [Qemu-ppc] " Andrea Bolognani
  2018-02-12  7:10                       ` [Qemu-devel] " Cédric Le Goater
  2018-04-12  5:16                       ` David Gibson
  2 siblings, 1 reply; 71+ messages in thread
From: Alexey Kardashevskiy @ 2018-02-12  2:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson
  Cc: Greg Kurz, qemu-ppc, Cédric Le Goater, qemu-devel

On 12/02/18 09:55, Benjamin Herrenschmidt wrote:
> On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote:
>> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote:
>>> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
>>>> Migration is a problem. We will need both backend QEMU objects to be 
>>>> available anyhow if we want to migrate. So we are back to the current 
>>>> solution creating both QEMU objects but we can try to defer some of the 
>>>> KVM inits and create the KVM device on demand at CAS time.
>>>
>>> Do we have a way to migrate a piece of info from the machine *first*
>>> that indicate what type of XICS/XIVE to instanciate ?
>>
>> Nope.  qemu migration doesn't work like that.  Yes, it should, and
>> everyone knows it, but changing it is a really long term project.
> 
> Well, we have a problem then. It looks like Qemu broken migration is
> fundamentally incompatible with PAPR and CAS design...
> 
> I know we don't migrate the configuration, that's not exactly what I
> had in mind tho... Can we have some piece of *data* from the machine be
> migrated first, and use it on the target to reconfigure the interrupt
> controller before the stream arrives ?


These days this is done via libvirt - it reads properties it needs via QMP,
then sends an XML with everything (the interrupt controller type may be one
of such properties), and starts the destination QEMU with the explicit
interrupt controller (like -machine pseries,intrc=xive).

Hacking QEMU to do all of this is still in a distant TODO...


> Otherwise, we have indeed no much choice but the horrible wart of
> creating both interrupt controllers with only one "active".
> 
>>>
>>>> The next problem is the ICP object that currently needs the KVM device 
>>>> fd to connect the vcpus ... So, we will need to change that also. 
>>>> That is probably the biggest problem today. We need a way to disconnect 
>>>> the vpcu from the KVM device and see how we can defer the connection.
>>>> I need to make sure this is possible, I can check that without XIVE



-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-02-11 22:55                     ` Benjamin Herrenschmidt
  2018-02-12  2:02                       ` Alexey Kardashevskiy
@ 2018-02-12  7:10                       ` Cédric Le Goater
  2018-04-12  5:16                       ` David Gibson
  2 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2018-02-12  7:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On 02/11/2018 11:55 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote:
>> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote:
>>> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
>>>> Migration is a problem. We will need both backend QEMU objects to be 
>>>> available anyhow if we want to migrate. So we are back to the current 
>>>> solution creating both QEMU objects but we can try to defer some of the 
>>>> KVM inits and create the KVM device on demand at CAS time.
>>>
>>> Do we have a way to migrate a piece of info from the machine *first*
>>> that indicate what type of XICS/XIVE to instanciate ?
>>
>> Nope.  qemu migration doesn't work like that.  Yes, it should, and
>> everyone knows it, but changing it is a really long term project.
> 
> Well, we have a problem then. It looks like Qemu broken migration is
> fundamentally incompatible with PAPR and CAS design...
> 
> I know we don't migrate the configuration, that's not exactly what I
> had in mind tho... Can we have some piece of *data* from the machine be
> migrated first, and use it on the target to reconfigure the interrupt
> controller before the stream arrives ?
> 
> Otherwise, we have indeed no much choice but the horrible wart of
> creating both interrupt controllers with only one "active".

Well, both QEMU model objects would be created, yes, but one only KVM 
associated device. It's a bit ugly from a QEMU point of view because  
the KVM initialization is deferred at reset but, in the pratice, it 
results in a couple of calls to : 

  - disconnect the VCPU from the KVM interrupt device
  - destroy the previous KVM interrupt device (new ioctl)
  - create the new KVM interrupt device
  - reconnect the VCPU to the KVM interrupt device

I don't think it will be a major problem.

What I am unease with currently, is how to share the same XIVE objects 
when under KVM and when not. The only difference is in the nature of
the MMIO region and the qemu_irq handler. Work in progress.

And we have four interrupt modes to support : XICS-KVM, XICS, XIVE-KVM, 
XIVE.    

Thanks,

C. 
 

>>>> The next problem is the ICP object that currently needs the KVM device 
>>>> fd to connect the vcpus ... So, we will need to change that also. 
>>>> That is probably the biggest problem today. We need a way to disconnect 
>>>> the vpcu from the KVM device and see how we can defer the connection.
>>>> I need to make sure this is possible, I can check that without XIVE
>>>
>>> Ben.
>>>
>>
>>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-02-12  2:02                       ` Alexey Kardashevskiy
@ 2018-02-12 12:20                         ` Andrea Bolognani
  2018-02-12 14:40                           ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 71+ messages in thread
From: Andrea Bolognani @ 2018-02-12 12:20 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Benjamin Herrenschmidt, David Gibson
  Cc: qemu-devel, qemu-ppc, Greg Kurz, Cédric Le Goater

On Mon, 2018-02-12 at 13:02 +1100, Alexey Kardashevskiy wrote:
> On 12/02/18 09:55, Benjamin Herrenschmidt wrote:
> > Well, we have a problem then. It looks like Qemu broken migration is
> > fundamentally incompatible with PAPR and CAS design...
> > 
> > I know we don't migrate the configuration, that's not exactly what I
> > had in mind tho... Can we have some piece of *data* from the machine be
> > migrated first, and use it on the target to reconfigure the interrupt
> > controller before the stream arrives ?
> 
> These days this is done via libvirt - it reads properties it needs via QMP,
> then sends an XML with everything (the interrupt controller type may be one
> of such properties), and starts the destination QEMU with the explicit
> interrupt controller (like -machine pseries,intrc=xive).

Clarification: libvirt will use the user-defined XML configuration
to generate the QEMU command line both for the source and the target
of the migration, but it will not automagically figure out properties
through QMP. So if you want the controller to explicitly show up on
the QEMU command line, libvirt should be taught about it.

-- 
Andrea Bolognani / Red Hat / Virtualization

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-02-12 12:20                         ` [Qemu-devel] [Qemu-ppc] " Andrea Bolognani
@ 2018-02-12 14:40                           ` Benjamin Herrenschmidt
  2018-02-13  1:11                             ` Alexey Kardashevskiy
  2018-02-13  7:40                             ` Cédric Le Goater
  0 siblings, 2 replies; 71+ messages in thread
From: Benjamin Herrenschmidt @ 2018-02-12 14:40 UTC (permalink / raw)
  To: Andrea Bolognani, Alexey Kardashevskiy, David Gibson
  Cc: qemu-devel, qemu-ppc, Greg Kurz, Cédric Le Goater

On Mon, 2018-02-12 at 13:20 +0100, Andrea Bolognani wrote:
> On Mon, 2018-02-12 at 13:02 +1100, Alexey Kardashevskiy wrote:
> > On 12/02/18 09:55, Benjamin Herrenschmidt wrote:
> > > Well, we have a problem then. It looks like Qemu broken migration is
> > > fundamentally incompatible with PAPR and CAS design...
> > > 
> > > I know we don't migrate the configuration, that's not exactly what I
> > > had in mind tho... Can we have some piece of *data* from the machine be
> > > migrated first, and use it on the target to reconfigure the interrupt
> > > controller before the stream arrives ?
> > 
> > These days this is done via libvirt - it reads properties it needs via QMP,
> > then sends an XML with everything (the interrupt controller type may be one
> > of such properties), and starts the destination QEMU with the explicit
> > interrupt controller (like -machine pseries,intrc=xive).
> 
> Clarification: libvirt will use the user-defined XML configuration
> to generate the QEMU command line both for the source and the target
> of the migration, but it will not automagically figure out properties
> through QMP. So if you want the controller to explicitly show up on
> the QEMU command line, libvirt should be taught about it.

Which can't work because the guest pretty much decides what it will be
early on during the boot process.

So we're back to square 1 having to instanciate both objects in qemu
with some kind of "activation" flag.

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-02-12 14:40                           ` Benjamin Herrenschmidt
@ 2018-02-13  1:11                             ` Alexey Kardashevskiy
  2018-02-13  7:40                             ` Cédric Le Goater
  1 sibling, 0 replies; 71+ messages in thread
From: Alexey Kardashevskiy @ 2018-02-13  1:11 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Andrea Bolognani, David Gibson
  Cc: qemu-devel, qemu-ppc, Greg Kurz, Cédric Le Goater

On 13/02/18 01:40, Benjamin Herrenschmidt wrote:
> On Mon, 2018-02-12 at 13:20 +0100, Andrea Bolognani wrote:
>> On Mon, 2018-02-12 at 13:02 +1100, Alexey Kardashevskiy wrote:
>>> On 12/02/18 09:55, Benjamin Herrenschmidt wrote:
>>>> Well, we have a problem then. It looks like Qemu broken migration is
>>>> fundamentally incompatible with PAPR and CAS design...
>>>>
>>>> I know we don't migrate the configuration, that's not exactly what I
>>>> had in mind tho... Can we have some piece of *data* from the machine be
>>>> migrated first, and use it on the target to reconfigure the interrupt
>>>> controller before the stream arrives ?
>>>
>>> These days this is done via libvirt - it reads properties it needs via QMP,
>>> then sends an XML with everything (the interrupt controller type may be one
>>> of such properties), and starts the destination QEMU with the explicit
>>> interrupt controller (like -machine pseries,intrc=xive).
>>
>> Clarification: libvirt will use the user-defined XML configuration
>> to generate the QEMU command line both for the source and the target
>> of the migration, but it will not automagically figure out properties
>> through QMP. So if you want the controller to explicitly show up on
>> the QEMU command line, libvirt should be taught about it.
> 
> Which can't work because the guest pretty much decides what it will be
> early on during the boot process.


At the time of migration the guest has told QEMU what intrc it wants (via
cas?) and libvirt can ask QEMU via QMP about that when migrating.


> So we're back to square 1 having to instanciate both objects in qemu
> with some kind of "activation" flag.
> 
> Cheers,
> Ben.
> 


-- 
Alexey

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-02-12 14:40                           ` Benjamin Herrenschmidt
  2018-02-13  1:11                             ` Alexey Kardashevskiy
@ 2018-02-13  7:40                             ` Cédric Le Goater
  1 sibling, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2018-02-13  7:40 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Andrea Bolognani, Alexey Kardashevskiy,
	David Gibson
  Cc: qemu-devel, qemu-ppc, Greg Kurz

On 02/12/2018 03:40 PM, Benjamin Herrenschmidt wrote:
> On Mon, 2018-02-12 at 13:20 +0100, Andrea Bolognani wrote:
>> On Mon, 2018-02-12 at 13:02 +1100, Alexey Kardashevskiy wrote:
>>> On 12/02/18 09:55, Benjamin Herrenschmidt wrote:
>>>> Well, we have a problem then. It looks like Qemu broken migration is
>>>> fundamentally incompatible with PAPR and CAS design...
>>>>
>>>> I know we don't migrate the configuration, that's not exactly what I
>>>> had in mind tho... Can we have some piece of *data* from the machine be
>>>> migrated first, and use it on the target to reconfigure the interrupt
>>>> controller before the stream arrives ?
>>>
>>> These days this is done via libvirt - it reads properties it needs via QMP,
>>> then sends an XML with everything (the interrupt controller type may be one
>>> of such properties), and starts the destination QEMU with the explicit
>>> interrupt controller (like -machine pseries,intrc=xive).
>>
>> Clarification: libvirt will use the user-defined XML configuration
>> to generate the QEMU command line both for the source and the target
>> of the migration, but it will not automagically figure out properties
>> through QMP. So if you want the controller to explicitly show up on
>> the QEMU command line, libvirt should be taught about it.
> 
> Which can't work because the guest pretty much decides what it will be
> early on during the boot process.
> 
> So we're back to square 1 having to instanciate both objects in qemu
> with some kind of "activation" flag.

yes and the activation flag is the associated bit in CAS OV5 :

	spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)

if a new interrupt mode is negotiated, a machine reset is required,
a new device tree is populated, new ICPs are installed, etc. There
is a little more to do with KVM and we need to find the right model 
abstraction for it. Anyhow, it is not a big problem to switch from 
one mode to another when both objects are around. It is even easier
to keep the allocated IRQs in sync in fact. 

What problem do you foresee with KVM ? this is already solved for 
irqchip=off.

Cheers, 

C.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-20  7:38     ` Cédric Le Goater
@ 2018-04-12  5:07       ` David Gibson
  2018-04-12  8:18         ` Cédric Le Goater
  0 siblings, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-04-12  5:07 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 16289 bytes --]

On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
> On 12/20/2017 06:09 AM, David Gibson wrote:
> > On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
> >> With the POWER9 processor comes a new interrupt controller called
> >> XIVE. It is composed of three sub-engines :
> >>
> >>   - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
> >>     in the main controller for the IPIS and in the PSI host
> >>     bridge. They are configured to feed the IVRE with events.
> >>
> >>   - Interrupt Virtualization Routing Engine (IVRE). Their job is to
> >>     match an event source with a Notification Virtualization Target
> >>     (NVT), a priority and an Event Queue (EQ) to determine if a
> >>     Virtual Processor can handle the event.
> >>
> >>   - Interrupt Virtualization Presentation Engine (IVPE). It maintains
> >>     the interrupt state of each hardware thread and present the
> >>     notification as an external exception.
> >>
> >> Each of the engines uses a set of internal tables to redirect
> >> exceptions from event sources to CPU threads. The first table we
> >> introduce is the Interrupt Virtualization Entry (IVE) table, part of
> >> the virtualization engine in charge of routing events. It associates
> >> event sources (IRQ numbers) to event queues which will forward, or
> >> not, the event notification to the presentation controller.
> >>
> >> The XIVE model is designed to make use of the full range of the IRQ
> >> number space and does not use an offset like the XICS mode does.
> >> Hence, the IVE table is directly indexed by the IRQ number.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > 
> > As you've suggested in yourself, I think we might need to more
> > explicitly model the different components of the XIVE system.  As part
> > of that, I think you need to be clearer in this base skeleton about
> > exactly what component your XIVE object represents.

Sorry it's been so long since I looked at these.

> ok. The base skeleton is the IVRE, the central engine handling 
> the routing. 
> 
> > If the answer is "the overall thing" 
> 
> Yes, it is more or less that currently. 
> 
> The sPAPRXive object models the source engine and the routing 
> engine in one object.

Yeah, I suspect we don't want that.  Although it might seem simpler in
the spapr case, at least at first glance, I think it will cause us
problems later.  At the very least, it's likely to make it harder to
share code between the spapr and powernv case.  I think it will also
make for more confusion about exactly what things belong where.

> I have merged these for simplicity and because the interrupt 
> controller has an internal source for the interrupts of the "IPI" 
> type, which are used for the CPU IPIs but also for other generic 
> interrupts, like the OpenCAPI ones. The XIVE sPAPR interface is 
> also much simpler than the baremetal one, all the tables are 
> maintained in the hypervisor, so this choice made some sense. 
> 
> But since, I have started the PowerNV model and I am duplicating 
> a lot of code to handle the triggering and the MMIOs in the 
> different sources. So I am not convinced anymore. Nevertheless, 
> the overall routing logic is the same even if some the tables 
> are not located in QEMU anymore, but in the machine memory.
> 
> The sPAPRXiveNVT models some of the CPU presenter engine. It 
> holds the virtual CPU interrupt states when not dispatched on 
> a real HW thread. Real world is more complex. There are "CAM" 
> lines in the HW threads which are compared to find a matching 
> candidate. But I don't think we need to anything more complex 
> than today unless we want to support KVM under TCG ...
>    
> > I suspect that's not what you
> > want - I had one of those for XICs which proved to be a mistake
> > (eventually replaced by the XICSFabric interface).
> 
> The XICSFabric would be the main Xive object. The interface 
> between the sources and the routing engine is hidden in sPAPR, 
> we can use a simple function call : 
> 
> 	spapr_xive_irq(pnv->xive, irq);
> 
> we could get rid of the qirqs but they are required for XICS.

I don't quite follow, but this doesn't sound right.

> PowerNV uses MMIOs to notify an event and it makes the modeling
> somewhat easier. Each controller model has a notify port address 
> register on which a interrupt number is written to forward an 
> event to the routing engine. So it is a simple store. 
> 
> I don't know why there is a different notify port address per
> source, may be for extra filtering at the routing engine level.   
> 
> > Changing the model later isn't impossible, but doing so without
> > breaking migration can be a real pain, so I think it's worth a
> > reasonable effort to try and get it right initially.
> 
> I completely agree. 
> 
> This is why I have started the PnvXive model to challenge the 
> current PAPR design. I have hacked a bunch of patches for XIVE, 
> LPC, PSI, OCC and basic PPC support which boot a PowerNV P9 up to 
> petitboot. It would look better with a source object, but the 
> location of the PQ bits is a bit problematic. It highly depends 
> on the controller. The main controller uses tables in the hypervisor
> memory. The PSIHB controller has its own bits. I suppose it is 
> the same for PHB4. I need to take a closer look at how we could
> have a common source object.

Ok, sounds like a good idea.

> 
> The most important part is KVM support and how we expose the 
> MMIO region. We need to make progress on that topic.
> 
> Thanks,
> 
> C.  
>  
> 
> >> ---
> >>
> >>  Changes since v1 :
> >>
> >>  - used g_new0 instead of g_malloc0
> >>  - removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC 
> >>  - introduced a device reset handler. the object needs to be parented
> >>    to sysbus when created.
> >>  - renamed spapr_xive_irq_set to spapr_xive_irq_enable
> >>  - renamed spapr_xive_irq_unset to spapr_xive_irq_disable
> >>  - moved the PPC_BIT macros under target/ppc/cpu.h
> >>  - shrinked file copyright header
> >>
> >>  default-configs/ppc64-softmmu.mak |   1 +
> >>  hw/intc/Makefile.objs             |   1 +
> >>  hw/intc/spapr_xive.c              | 156 ++++++++++++++++++++++++++++++++++++++
> >>  hw/intc/xive-internal.h           |  41 ++++++++++
> >>  include/hw/ppc/spapr_xive.h       |  35 +++++++++
> >>  5 files changed, 234 insertions(+)
> >>  create mode 100644 hw/intc/spapr_xive.c
> >>  create mode 100644 hw/intc/xive-internal.h
> >>  create mode 100644 include/hw/ppc/spapr_xive.h
> >>
> >> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> >> index d1b3a6dd50f8..4a7f6a0696de 100644
> >> --- a/default-configs/ppc64-softmmu.mak
> >> +++ b/default-configs/ppc64-softmmu.mak
> >> @@ -56,6 +56,7 @@ CONFIG_SM501=y
> >>  CONFIG_XICS=$(CONFIG_PSERIES)
> >>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
> >>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
> >> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
> >>  # For PReP
> >>  CONFIG_SERIAL_ISA=y
> >>  CONFIG_MC146818RTC=y
> >> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> >> index ae358569a155..49e13e7aeeee 100644
> >> --- a/hw/intc/Makefile.objs
> >> +++ b/hw/intc/Makefile.objs
> >> @@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
> >>  obj-$(CONFIG_XICS) += xics.o
> >>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
> >>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> >> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
> >>  obj-$(CONFIG_POWERNV) += xics_pnv.o
> >>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
> >>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> >> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >> new file mode 100644
> >> index 000000000000..e6e8841add17
> >> --- /dev/null
> >> +++ b/hw/intc/spapr_xive.c
> >> @@ -0,0 +1,156 @@
> >> +/*
> >> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu/log.h"
> >> +#include "qapi/error.h"
> >> +#include "target/ppc/cpu.h"
> >> +#include "sysemu/cpus.h"
> >> +#include "sysemu/dma.h"
> >> +#include "monitor/monitor.h"
> >> +#include "hw/ppc/spapr_xive.h"
> >> +
> >> +#include "xive-internal.h"
> >> +
> >> +/*
> >> + * Main XIVE object
> >> + */
> >> +
> >> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> >> +{
> >> +    int i;
> >> +
> >> +    for (i = 0; i < xive->nr_irqs; i++) {
> >> +        XiveIVE *ive = &xive->ivt[i];
> >> +
> >> +        if (!(ive->w & IVE_VALID)) {
> >> +            continue;
> >> +        }
> >> +
> >> +        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
> >> +                       ive->w & IVE_MASKED ? "M" : " ",
> >> +                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
> >> +                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
> >> +    }
> >> +}
> >> +
> >> +static void spapr_xive_reset(DeviceState *dev)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +    int i;
> >> +
> >> +    /* Mask all valid IVEs in the IRQ number space. */
> >> +    for (i = 0; i < xive->nr_irqs; i++) {
> >> +        XiveIVE *ive = &xive->ivt[i];
> >> +        if (ive->w & IVE_VALID) {
> >> +            ive->w |= IVE_MASKED;
> >> +        }
> >> +    }
> >> +}
> >> +
> >> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +
> >> +    if (!xive->nr_irqs) {
> >> +        error_setg(errp, "Number of interrupt needs to be greater 0");
> >> +        return;
> >> +    }
> >> +
> >> +    /* Allocate the IVT (Interrupt Virtualization Table) */
> >> +    xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> >> +}
> >> +
> >> +static const VMStateDescription vmstate_spapr_xive_ive = {
> >> +    .name = TYPE_SPAPR_XIVE "/ive",
> >> +    .version_id = 1,
> >> +    .minimum_version_id = 1,
> >> +    .fields = (VMStateField []) {
> >> +        VMSTATE_UINT64(w, XiveIVE),
> >> +        VMSTATE_END_OF_LIST()
> >> +    },
> >> +};
> >> +
> >> +static bool vmstate_spapr_xive_needed(void *opaque)
> >> +{
> >> +    /* TODO check machine XIVE support */
> >> +    return true;
> >> +}
> >> +
> >> +static const VMStateDescription vmstate_spapr_xive = {
> >> +    .name = TYPE_SPAPR_XIVE,
> >> +    .version_id = 1,
> >> +    .minimum_version_id = 1,
> >> +    .needed = vmstate_spapr_xive_needed,
> >> +    .fields = (VMStateField[]) {
> >> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> >> +        VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
> >> +                                     vmstate_spapr_xive_ive, XiveIVE),
> >> +        VMSTATE_END_OF_LIST()
> >> +    },
> >> +};
> >> +
> >> +static Property spapr_xive_properties[] = {
> >> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
> >> +    DEFINE_PROP_END_OF_LIST(),
> >> +};
> >> +
> >> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >> +
> >> +    dc->realize = spapr_xive_realize;
> >> +    dc->reset = spapr_xive_reset;
> >> +    dc->props = spapr_xive_properties;
> >> +    dc->desc = "sPAPR XIVE interrupt controller";
> >> +    dc->vmsd = &vmstate_spapr_xive;
> >> +}
> >> +
> >> +static const TypeInfo spapr_xive_info = {
> >> +    .name = TYPE_SPAPR_XIVE,
> >> +    .parent = TYPE_SYS_BUS_DEVICE,
> >> +    .instance_size = sizeof(sPAPRXive),
> >> +    .class_init = spapr_xive_class_init,
> >> +};
> >> +
> >> +static void spapr_xive_register_types(void)
> >> +{
> >> +    type_register_static(&spapr_xive_info);
> >> +}
> >> +
> >> +type_init(spapr_xive_register_types)
> >> +
> >> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
> >> +{
> >> +    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
> >> +}
> >> +
> >> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
> >> +{
> >> +    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
> >> +
> >> +    if (!ive) {
> >> +        return false;
> >> +    }
> >> +
> >> +    ive->w |= IVE_VALID;
> >> +    return true;
> >> +}
> >> +
> >> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
> >> +{
> >> +    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
> >> +
> >> +    if (!ive) {
> >> +        return false;
> >> +    }
> >> +
> >> +    ive->w &= ~IVE_VALID;
> >> +    return true;
> >> +}
> >> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
> >> new file mode 100644
> >> index 000000000000..132b71a6daf0
> >> --- /dev/null
> >> +++ b/hw/intc/xive-internal.h
> >> @@ -0,0 +1,41 @@
> >> +/*
> >> + * QEMU PowerPC XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2016-2017, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#ifndef _INTC_XIVE_INTERNAL_H
> >> +#define _INTC_XIVE_INTERNAL_H
> >> +
> >> +/* Utilities to manipulate these (originaly from OPAL) */
> >> +#define MASK_TO_LSH(m)          (__builtin_ffsl(m) - 1)
> >> +#define GETFIELD(m, v)          (((v) & (m)) >> MASK_TO_LSH(m))
> >> +#define SETFIELD(m, v, val)                             \
> >> +        (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
> >> +
> >> +/* IVE/EAS
> >> + *
> >> + * One per interrupt source. Targets that interrupt to a given EQ
> >> + * and provides the corresponding logical interrupt number (EQ data)
> >> + *
> >> + * We also map this structure to the escalation descriptor inside
> >> + * an EQ, though in that case the valid and masked bits are not used.
> >> + */
> >> +typedef struct XiveIVE {
> >> +        /* Use a single 64-bit definition to make it easier to
> >> +         * perform atomic updates
> >> +         */
> >> +        uint64_t        w;
> >> +#define IVE_VALID       PPC_BIT(0)
> >> +#define IVE_EQ_BLOCK    PPC_BITMASK(4, 7)        /* Destination EQ block# */
> >> +#define IVE_EQ_INDEX    PPC_BITMASK(8, 31)       /* Destination EQ index */
> >> +#define IVE_MASKED      PPC_BIT(32)              /* Masked */
> >> +#define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
> >> +} XiveIVE;
> >> +
> >> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
> >> +
> >> +#endif /* _INTC_XIVE_INTERNAL_H */
> >> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >> new file mode 100644
> >> index 000000000000..5b1f78e06a1e
> >> --- /dev/null
> >> +++ b/include/hw/ppc/spapr_xive.h
> >> @@ -0,0 +1,35 @@
> >> +/*
> >> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#ifndef PPC_SPAPR_XIVE_H
> >> +#define PPC_SPAPR_XIVE_H
> >> +
> >> +#include <hw/sysbus.h>
> >> +
> >> +typedef struct sPAPRXive sPAPRXive;
> >> +typedef struct XiveIVE XiveIVE;
> >> +
> >> +#define TYPE_SPAPR_XIVE "spapr-xive"
> >> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
> >> +
> >> +struct sPAPRXive {
> >> +    SysBusDevice parent;
> >> +
> >> +    /* Properties */
> >> +    uint32_t     nr_irqs;
> >> +
> >> +    /* XIVE internal tables */
> >> +    XiveIVE      *ivt;
> >> +};
> >> +
> >> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
> >> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> >> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> >> +
> >> +#endif /* PPC_SPAPR_XIVE_H */
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2017-12-21  0:12     ` Benjamin Herrenschmidt
  2017-12-21  9:16       ` Cédric Le Goater
@ 2018-04-12  5:08       ` David Gibson
  2018-04-12  8:28         ` Cédric Le Goater
  1 sibling, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-04-12  5:08 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Cédric Le Goater, qemu-ppc, qemu-devel, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 3083 bytes --]

On Thu, Dec 21, 2017 at 11:12:06AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
> > 
> > As you've suggested in yourself, I think we might need to more
> > explicitly model the different components of the XIVE system.  As part
> > of that, I think you need to be clearer in this base skeleton about
> > exactly what component your XIVE object represents.
> > 
> > If the answer is "the overall thing" I suspect that's not what you
> > want - I had one of those for XICs which proved to be a mistake
> > (eventually replaced by the XICSFabric interface).
> > 
> > Changing the model later isn't impossible, but doing so without
> > breaking migration can be a real pain, so I think it's worth a
> > reasonable effort to try and get it right initially.
> 
> Note: we do need to speed things up a bit, as having exploitation mode
> in KVM will significantly help with IPI performance among other things.
> 
> I'm about ready to do the KVM bits. The one thing we need to discuss
> and figure a good design for is how we map all those interrupt control
> pages into qemu.
> 
> Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
> which are used for guest IPIs and for vio/virtio/emulated interrupts)
> comes with a "control page" (ESB page) which needs to be mapped into
> the guest, and the generic IPIs also come with a trigger page which
> needs to be mapped into the guest for guest IPIs or OpenCAPI
> interrupts, or just qemu for emulated devices.
> 
> Now that can be thousands of these critters. I certainly don't want to
> create thousands of VMAs in qemu and even less thousands of memory
> regions in KVM.
> 
> So we need some kind of mechanism by wich a single large VMA gets
> mmap'ed into qemu (or maybe a couple of these, but not too many) and
> the interrupt pages can be assigned to slots in there and demand
> faulted.

Ok, I see your point.  We'll definitely need to be able to map things
in as a block, rather than one by one.

> For the generic interrupts, this can probably be covered by KVM, adding
> some arch ioctls for allocating IPIs and mmap'ing that region etc...
> 
> For pass-through, it's trickier, we don't want to mmap each irqfd
> individually for the above reason, so we want to "link" them to KVM. We
> don't want to allow qemu to take control of any arbitrary interrupt in
> the system though, so it has to related to the ownership of the irqfd
> coming from vfio.
> 
> OpenCAPI I suspect will be its own can of worms...
> 
> Also, have we decided how the process of switching between XICS and
> XIVE will work vs. CAS ? And how that will interact with KVM ? I was
> thinking the kernel would implement a different KVM device type, ie
> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
> KVM_DEV_TYPE_XIVE.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-01-17  9:18           ` Cédric Le Goater
  2018-01-17 11:10             ` Benjamin Herrenschmidt
@ 2018-04-12  5:10             ` David Gibson
  2018-04-12  8:41               ` Cédric Le Goater
  1 sibling, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-04-12  5:10 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 2756 bytes --]

On Wed, Jan 17, 2018 at 10:18:43AM +0100, Cédric Le Goater wrote:
> >>> Also, have we decided how the process of switching between XICS and
> >>> XIVE will work vs. CAS ? 
> >>
> >> That's how it is described in the architecture. The current choice is
> >> to create both XICS and XIVE objects and choose at CAS which one to
> >> use. It relies today on the capability of the pseries machine to 
> >> allocate IRQ numbers for both interrupt controller backends. These
> >> patches have been merged in QEMU.
> >>
> >> A change of interrupt mode results in a reset. The device tree is 
> >> populated accordingly and the ICPs are switched for the model in 
> >> use. 
> > 
> > For KVM we need to only instanciate one of them though.
> 
> Hmm,
> 
> How would we handle a guest rebooting on a kernel without XIVE support ? 
> Are you suggesting to create the XICS or XIVE device in the CAS negotiation 
> process ? So, the machine would not have any interrupt controller before 
> CAS. That seems really late to me. grub uses the console for instance. 
> 
> I think it should prepare for both options, start in XIVE legacy mode, 
> which is XICS, then possibly switch to XIVE exploitation mode.

I think for our first draft we should have XIVE and XICS based
platforms as separate machine types (or a machine option, I guess).

We do want to allow this to be autonegotiated, but I feel like
emphasising that at the beginning is causing unnatural design
decisions in the XIVE model itself.

> 
> >>> And how that will interact with KVM ?
> >>
> >> I expect we will do the same, which is to create two KVM devices to 
> >> be able to handle both interrupt controller backends depending on the 
> >> mode negotiated by the guest.  
> > 
> > That will be an ungodly mess, I'd rather we only instanciate the right
> > one.
> 
> It's rather transparent currently in the emulated version. There are two 
> sets of objects in QEMU, switching is done in CAS. KVM support should not 
> change anything in that area. 
> 
> I expect the 'xive-kvm' object to get/set states for migration, just like 
> for XICS and to setup the ESB+TIMA memory regions, which is new. 
> 
> C. 
>  
> >>> I was
> >>> thinking the kernel would implement a different KVM device type, ie
> >>> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
> >>> KVM_DEV_TYPE_XIVE.
> >>
> >> yes. it makes sense. The new device will have a lot in common with the 
> >> KVM_DEV_TYPE_XICS using kvm_xive_ops.
> > 
> > Ben.
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-01-17 14:39               ` Cédric Le Goater
  2018-01-17 17:57                 ` Cédric Le Goater
  2018-01-17 21:27                 ` Benjamin Herrenschmidt
@ 2018-04-12  5:15                 ` David Gibson
  2018-04-12  8:51                   ` Cédric Le Goater
  2 siblings, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-04-12  5:15 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 5571 bytes --]

On Wed, Jan 17, 2018 at 03:39:46PM +0100, Cédric Le Goater wrote:
> On 01/17/2018 12:10 PM, Benjamin Herrenschmidt wrote:
> > On Wed, 2018-01-17 at 10:18 +0100, Cédric Le Goater wrote:
> >>>>> Also, have we decided how the process of switching between XICS and
> >>>>> XIVE will work vs. CAS ? 
> >>>>
> >>>> That's how it is described in the architecture. The current choice is
> >>>> to create both XICS and XIVE objects and choose at CAS which one to
> >>>> use. It relies today on the capability of the pseries machine to 
> >>>> allocate IRQ numbers for both interrupt controller backends. These
> >>>> patches have been merged in QEMU.
> >>>>
> >>>> A change of interrupt mode results in a reset. The device tree is 
> >>>> populated accordingly and the ICPs are switched for the model in 
> >>>> use. 
> >>>
> >>> For KVM we need to only instanciate one of them though.
> >>
> >> Hmm,
> >>
> >> How would we handle a guest rebooting on a kernel without XIVE support ? 
> > 
> > It will do CAS again and we can change the devices.
> 
> So, we would destroy the previous QEMU ICS object and create a new one 
> in the CAS hcall. That would probably work. There might be some issues 
> in creating and destroying the ICS KVM device, but that can be studied 
> without XIVE.

Adding and removing devices at runtime based on guest requests like
this will get really hairy in qemu.

As I've said before for the first cut, I think we want to select just
one as a machine option to avoid this confusion.

Looking further ahead, I think we'll be better off having both the
XIVE and XICS models always present (at least minimally) in qemu, but
with only one "active" at any given time.

Note that having the inactive one destroy and clean up the
corresponding KVM devices is fine, as is deallocating as much of its
runtime state as we can without changing the notional QOM tree.

> 
> It used to be considered ugly to create a QEMU device at reset time, so 
> I wonder if this is still the case, because when the machine reaches CAS, 
> we really are beyond reset.   
> 
> If this is OK, then the next "issue" is to keep in sync the allocated 
> IRQ numbers. The IRQ allocator is now merged at the machine level, so 
> the synchronization is obvious to do when both backend QEMU objects 
> are available. that's the path I took. If both QEMU objects are not 
> available, then we need to scan the IRQ number space in the current 
> interrupt mode and allocate the same IRQs in the newly negotiated mode. 
> Probably OK. I don't see major problems with the current code. 
> 
> Migration is a problem. We will need both backend QEMU objects to be 
> available anyhow if we want to migrate. So we are back to the current 
> solution creating both QEMU objects but we can try to defer some of the 
> KVM inits and create the KVM device on demand at CAS time.
> 
> The next problem is the ICP object that currently needs the KVM device 
> fd to connect the vcpus ... So, we will need to change that also. 
> That is probably the biggest problem today. We need a way to disconnect 
> the vpcu from the KVM device and see how we can defer the connection.
> I need to make sure this is possible, I can check that without XIVE
> I think.
> 
> >> Are you suggesting to create the XICS or XIVE device in the CAS negotiation 
> >> process ? So, the machine would not have any interrupt controller before 
> >> CAS. That seems really late to me. grub uses the console for instance. 
> > 
> > We start with XICS by default.
> 
> yes.
> 
> >> I think it should prepare for both options, start in XIVE legacy mode, 
> >> which is XICS, then possibly switch to XIVE exploitation mode.
> >>
> >>>>> And how that will interact with KVM ? 
> >>>>
> >>>> I expect we will do the same, which is to create two KVM devices to 
> >>>> be able to handle both interrupt controller backends depending on the 
> >>>> mode negotiated by the guest.  
> >>>
> >>> That will be an ungodly mess, I'd rather we only instanciate the right
> >>> one.
> >>
> >> It's rather transparent currently in the emulated version. There are two 
> >> sets of objects in QEMU, switching is done in CAS. KVM support should not 
> >> change anything in that area. 
> >>
> >> I expect the 'xive-kvm' object to get/set states for migration, just like 
> >> for XICS and to setup the ESB+TIMA memory regions, which is new. 
> > 
> > But both XICS and XIVE are completely different kernel KVM devices that will
> > need to "hook" into the same set of internal hooks for things like interrupts
> > being passed through, RTAS calls etc... 
> > 
> > How does KVM knows which one to "activate" ?
> 
> Can't we add an extra IRQ type and use vcpu->arch.irq_type for that ? 
> I haven't studied all the low level details though.
> 
> > I don't think the kernel should have both. 
> 
> I hear that. From a QEMU perspective, it is much easier to put everything 
> in place for both interrupt modes and let the guest decide what it wants 
> to use. 
> 
> If we choose not to, we will need to find solution to defer the KVM inits
> and to disconnect/reconnect the vcpus. For the latter, we could add a 
> KVM_DISABLE_CAP ioctl or maybe better add a new capability like 
> KVM_CAP_IRQ_XIVE to perform the switch.
> 
> 
> C.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-02-11 22:55                     ` Benjamin Herrenschmidt
  2018-02-12  2:02                       ` Alexey Kardashevskiy
  2018-02-12  7:10                       ` [Qemu-devel] " Cédric Le Goater
@ 2018-04-12  5:16                       ` David Gibson
  2018-04-12  8:36                         ` Cédric Le Goater
  2 siblings, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-04-12  5:16 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Cédric Le Goater, qemu-ppc, qemu-devel, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 1852 bytes --]

On Mon, Feb 12, 2018 at 09:55:17AM +1100, Benjamin Herrenschmidt wrote:
> On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote:
> > On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote:
> > > On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
> > > > Migration is a problem. We will need both backend QEMU objects to be 
> > > > available anyhow if we want to migrate. So we are back to the current 
> > > > solution creating both QEMU objects but we can try to defer some of the 
> > > > KVM inits and create the KVM device on demand at CAS time.
> > > 
> > > Do we have a way to migrate a piece of info from the machine *first*
> > > that indicate what type of XICS/XIVE to instanciate ?
> > 
> > Nope.  qemu migration doesn't work like that.  Yes, it should, and
> > everyone knows it, but changing it is a really long term project.
> 
> Well, we have a problem then. It looks like Qemu broken migration is
> fundamentally incompatible with PAPR and CAS design...

Hrm, the fit is very clunky certainly, but i think we can make it work.

> I know we don't migrate the configuration, that's not exactly what I
> had in mind tho... Can we have some piece of *data* from the machine be
> migrated first, and use it on the target to reconfigure the interrupt
> controller before the stream arrives ?

Sorta.. maybe.. but it would probably get really ugly if we don't
preserve the usual way object lifetimes work.

> Otherwise, we have indeed no much choice but the horrible wart of
> creating both interrupt controllers with only one "active".

I really think this is the way to go, warts and all.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-12  5:07       ` David Gibson
@ 2018-04-12  8:18         ` Cédric Le Goater
  2018-04-16  4:26           ` David Gibson
  0 siblings, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2018-04-12  8:18 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

On 04/12/2018 07:07 AM, David Gibson wrote:
> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
>>>> With the POWER9 processor comes a new interrupt controller called
>>>> XIVE. It is composed of three sub-engines :
>>>>
>>>>   - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
>>>>     in the main controller for the IPIS and in the PSI host
>>>>     bridge. They are configured to feed the IVRE with events.
>>>>
>>>>   - Interrupt Virtualization Routing Engine (IVRE). Their job is to
>>>>     match an event source with a Notification Virtualization Target
>>>>     (NVT), a priority and an Event Queue (EQ) to determine if a
>>>>     Virtual Processor can handle the event.
>>>>
>>>>   - Interrupt Virtualization Presentation Engine (IVPE). It maintains
>>>>     the interrupt state of each hardware thread and present the
>>>>     notification as an external exception.
>>>>
>>>> Each of the engines uses a set of internal tables to redirect
>>>> exceptions from event sources to CPU threads. The first table we
>>>> introduce is the Interrupt Virtualization Entry (IVE) table, part of
>>>> the virtualization engine in charge of routing events. It associates
>>>> event sources (IRQ numbers) to event queues which will forward, or
>>>> not, the event notification to the presentation controller.
>>>>
>>>> The XIVE model is designed to make use of the full range of the IRQ
>>>> number space and does not use an offset like the XICS mode does.
>>>> Hence, the IVE table is directly indexed by the IRQ number.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>
>>> As you've suggested in yourself, I think we might need to more
>>> explicitly model the different components of the XIVE system.  As part
>>> of that, I think you need to be clearer in this base skeleton about
>>> exactly what component your XIVE object represents.
> 
> Sorry it's been so long since I looked at these.

That's fine. I have been working on a XIVE device model for the PowerNV
machine and KVM support for the pseries. I have a better understanding
of the overall picture.

The patchset has not changed much so we can still discuss on this
basis without me flooding the mailing list.

>> ok. The base skeleton is the IVRE, the central engine handling 
>> the routing. 
>>
>>> If the answer is "the overall thing" 
>>
>> Yes, it is more or less that currently. 
>>
>> The sPAPRXive object models the source engine and the routing 
>> engine in one object.
> 
> Yeah, I suspect we don't want that.  Although it might seem simpler in
> the spapr case, at least at first glance, I think it will cause us
> problems later.  At the very least, it's likely to make it harder to
> share code between the spapr and powernv case.  I think it will also
> make for more confusion about exactly what things belong where.

I tend to agree. 

We need to clarify (a bit) what is in the XIVE interrupt controller 
silicon, and how XIVE works. The XIVE device models for spapr and 
powernv should be very close as the differences are small. KVM support 
should be built on the spapr model.

There are 3 different sub-engines in the XIVE interrupt controller
device :

* IVSE (XiveSource model)

  interrupt sources, which expose their PQ bits through ESB MMIO pages 
  (there are different levels of support depending on HW revision) 

  The XIVE interrupt controller has a set of internal sources for 
  IPIs and CAPI like interrupts.

* IVRE (No real model)

  in the middle, doing the routing of source event notification to
  (cpu) targets. It relies on internal tables which are stored in 
  the hypervisor/QEMU/KVM for the spapr machine and in the VM RAM 
  for the powernv machine. 

  Configuration updates of the XIVE tables are done through hcalls 
  on spapr and with MMIOs on the IC regs on powernv. On the latter,
  the changes are flushed backed in the VM RAM. 

* IVPE (XiveNVT)

  set of registers for interrupt management at the CPU level. Exposed
  in a specific MMIO region called the TIMA.

The XIVE tables are :

* IVT

  associate an interrupt source number with an event queue. the data
  to be pushed in the queue is stored there also.

* EQDT:

  describes the queues in the OS RAM, also contains a set of flags,
  a virtual target, etc.

* VPDT:

  describe the virtual targets, which can have different natures,
  a lpar, a cpu. This is for powernv, spapr does not have this 
  concept.


So, the idea behind the sPAPRXive object is to model a XIVE interrupt
controller device. It contains today :

 - an internal source block for all interrupts : IPIs and virtual 
   device interrupts. In the IRQ number space, the IPIs are below
   4096 and the device interrupts above, which keeps compatibility 
   with XICS. This is important to be able to change interrupt mode.

   PowerNV has different source blocks, like for P8.

 - a routing engine, which is limited to the IVT. This is a shortcut 
   and it might be better to introduce a specific object. Anyhow, this 
   is a state to capture.

   In the current version I am working on, the XiveFabric interface is
   more complex :

	typedef struct XiveFabricClass {
	    InterfaceClass parent;
	    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
	    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
	    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
	} XiveFabricClass;

   It helps in making the routing algorithm independent of the model. 
   I hope to make powernv converge and use it.

 - a set of MMIOs for the TIMA. They model the presenter engine. 
   current_cpu is used to retrieve the NVT object, which holds the 
   registers for interrupt management.  

The EQs are stored under the NVT. This saves us an unnecessary EQDT 
table. But we could add one under the XIVE device model.


>> I have merged these for simplicity and because the interrupt 
>> controller has an internal source for the interrupts of the "IPI" 
>> type, which are used for the CPU IPIs but also for other generic 
>> interrupts, like the OpenCAPI ones. The XIVE sPAPR interface is 
>> also much simpler than the baremetal one, all the tables are 
>> maintained in the hypervisor, so this choice made some sense. 
>>
>> But since, I have started the PowerNV model and I am duplicating 
>> a lot of code to handle the triggering and the MMIOs in the 
>> different sources. So I am not convinced anymore. Nevertheless, 
>> the overall routing logic is the same even if some the tables 
>> are not located in QEMU anymore, but in the machine memory.
>>
>> The sPAPRXiveNVT models some of the CPU presenter engine. It 
>> holds the virtual CPU interrupt states when not dispatched on 
>> a real HW thread. Real world is more complex. There are "CAM" 
>> lines in the HW threads which are compared to find a matching 
>> candidate. But I don't think we need to anything more complex 
>> than today unless we want to support KVM under TCG ...
>>    
>>> I suspect that's not what you
>>> want - I had one of those for XICs which proved to be a mistake
>>> (eventually replaced by the XICSFabric interface).
>>
>> The XICSFabric would be the main Xive object. The interface 
>> between the sources and the routing engine is hidden in sPAPR, 
>> we can use a simple function call : 
>>
>> 	spapr_xive_irq(pnv->xive, irq);
>>
>> we could get rid of the qirqs but they are required for XICS.
> 
> I don't quite follow, but this doesn't sound right.

I don't remember what I had in mind at that time. Let's forget it.
 
>> PowerNV uses MMIOs to notify an event and it makes the modeling
>> somewhat easier. Each controller model has a notify port address 
>> register on which a interrupt number is written to forward an 
>> event to the routing engine. So it is a simple store. 
>>
>> I don't know why there is a different notify port address per
>> source, may be for extra filtering at the routing engine level.   
>>
>>> Changing the model later isn't impossible, but doing so without
>>> breaking migration can be a real pain, so I think it's worth a
>>> reasonable effort to try and get it right initially.
>>
>> I completely agree. 
>>
>> This is why I have started the PnvXive model to challenge the 
>> current PAPR design. I have hacked a bunch of patches for XIVE, 
>> LPC, PSI, OCC and basic PPC support which boot a PowerNV P9 up to 
>> petitboot. It would look better with a source object, but the 
>> location of the PQ bits is a bit problematic. It highly depends 
>> on the controller. The main controller uses tables in the hypervisor
>> memory. The PSIHB controller has its own bits. I suppose it is 
>> the same for PHB4. I need to take a closer look at how we could
>> have a common source object.
> 
> Ok, sounds like a good idea.

I made progress and the spapr and powernv models are nearly 
reconciliated. the only difference in the routing algorithm is the 
privilege level at which the NVT notification is done. powernv work 
at the HV level.

>> The most important part is KVM support and how we expose the 
>> MMIO region. We need to make progress on that topic.

for KVM, a set of *_kvm objects handle the differences with the 
emulated mode. ram_device memory regions are needed for the ESB 
MMIO pages and the TIMA. That's mostly it.  

C.



>> Thanks,
>>
>> C.  
>>  
>>
>>>> ---
>>>>
>>>>  Changes since v1 :
>>>>
>>>>  - used g_new0 instead of g_malloc0
>>>>  - removed VMSTATE_STRUCT_VARRAY_UINT32_ALLOC 
>>>>  - introduced a device reset handler. the object needs to be parented
>>>>    to sysbus when created.
>>>>  - renamed spapr_xive_irq_set to spapr_xive_irq_enable
>>>>  - renamed spapr_xive_irq_unset to spapr_xive_irq_disable
>>>>  - moved the PPC_BIT macros under target/ppc/cpu.h
>>>>  - shrinked file copyright header
>>>>
>>>>  default-configs/ppc64-softmmu.mak |   1 +
>>>>  hw/intc/Makefile.objs             |   1 +
>>>>  hw/intc/spapr_xive.c              | 156 ++++++++++++++++++++++++++++++++++++++
>>>>  hw/intc/xive-internal.h           |  41 ++++++++++
>>>>  include/hw/ppc/spapr_xive.h       |  35 +++++++++
>>>>  5 files changed, 234 insertions(+)
>>>>  create mode 100644 hw/intc/spapr_xive.c
>>>>  create mode 100644 hw/intc/xive-internal.h
>>>>  create mode 100644 include/hw/ppc/spapr_xive.h
>>>>
>>>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>>>> index d1b3a6dd50f8..4a7f6a0696de 100644
>>>> --- a/default-configs/ppc64-softmmu.mak
>>>> +++ b/default-configs/ppc64-softmmu.mak
>>>> @@ -56,6 +56,7 @@ CONFIG_SM501=y
>>>>  CONFIG_XICS=$(CONFIG_PSERIES)
>>>>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>>>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>>>> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>>>>  # For PReP
>>>>  CONFIG_SERIAL_ISA=y
>>>>  CONFIG_MC146818RTC=y
>>>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>>>> index ae358569a155..49e13e7aeeee 100644
>>>> --- a/hw/intc/Makefile.objs
>>>> +++ b/hw/intc/Makefile.objs
>>>> @@ -35,6 +35,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>>>>  obj-$(CONFIG_XICS) += xics.o
>>>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>>> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>>>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>>>> new file mode 100644
>>>> index 000000000000..e6e8841add17
>>>> --- /dev/null
>>>> +++ b/hw/intc/spapr_xive.c
>>>> @@ -0,0 +1,156 @@
>>>> +/*
>>>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/log.h"
>>>> +#include "qapi/error.h"
>>>> +#include "target/ppc/cpu.h"
>>>> +#include "sysemu/cpus.h"
>>>> +#include "sysemu/dma.h"
>>>> +#include "monitor/monitor.h"
>>>> +#include "hw/ppc/spapr_xive.h"
>>>> +
>>>> +#include "xive-internal.h"
>>>> +
>>>> +/*
>>>> + * Main XIVE object
>>>> + */
>>>> +
>>>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < xive->nr_irqs; i++) {
>>>> +        XiveIVE *ive = &xive->ivt[i];
>>>> +
>>>> +        if (!(ive->w & IVE_VALID)) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
>>>> +                       ive->w & IVE_MASKED ? "M" : " ",
>>>> +                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
>>>> +                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
>>>> +    }
>>>> +}
>>>> +
>>>> +static void spapr_xive_reset(DeviceState *dev)
>>>> +{
>>>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>>>> +    int i;
>>>> +
>>>> +    /* Mask all valid IVEs in the IRQ number space. */
>>>> +    for (i = 0; i < xive->nr_irqs; i++) {
>>>> +        XiveIVE *ive = &xive->ivt[i];
>>>> +        if (ive->w & IVE_VALID) {
>>>> +            ive->w |= IVE_MASKED;
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>>> +{
>>>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>>>> +
>>>> +    if (!xive->nr_irqs) {
>>>> +        error_setg(errp, "Number of interrupt needs to be greater 0");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /* Allocate the IVT (Interrupt Virtualization Table) */
>>>> +    xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>>>> +}
>>>> +
>>>> +static const VMStateDescription vmstate_spapr_xive_ive = {
>>>> +    .name = TYPE_SPAPR_XIVE "/ive",
>>>> +    .version_id = 1,
>>>> +    .minimum_version_id = 1,
>>>> +    .fields = (VMStateField []) {
>>>> +        VMSTATE_UINT64(w, XiveIVE),
>>>> +        VMSTATE_END_OF_LIST()
>>>> +    },
>>>> +};
>>>> +
>>>> +static bool vmstate_spapr_xive_needed(void *opaque)
>>>> +{
>>>> +    /* TODO check machine XIVE support */
>>>> +    return true;
>>>> +}
>>>> +
>>>> +static const VMStateDescription vmstate_spapr_xive = {
>>>> +    .name = TYPE_SPAPR_XIVE,
>>>> +    .version_id = 1,
>>>> +    .minimum_version_id = 1,
>>>> +    .needed = vmstate_spapr_xive_needed,
>>>> +    .fields = (VMStateField[]) {
>>>> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>>>> +        VMSTATE_STRUCT_VARRAY_UINT32(ivt, sPAPRXive, nr_irqs, 1,
>>>> +                                     vmstate_spapr_xive_ive, XiveIVE),
>>>> +        VMSTATE_END_OF_LIST()
>>>> +    },
>>>> +};
>>>> +
>>>> +static Property spapr_xive_properties[] = {
>>>> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
>>>> +    DEFINE_PROP_END_OF_LIST(),
>>>> +};
>>>> +
>>>> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>> +{
>>>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>>>> +
>>>> +    dc->realize = spapr_xive_realize;
>>>> +    dc->reset = spapr_xive_reset;
>>>> +    dc->props = spapr_xive_properties;
>>>> +    dc->desc = "sPAPR XIVE interrupt controller";
>>>> +    dc->vmsd = &vmstate_spapr_xive;
>>>> +}
>>>> +
>>>> +static const TypeInfo spapr_xive_info = {
>>>> +    .name = TYPE_SPAPR_XIVE,
>>>> +    .parent = TYPE_SYS_BUS_DEVICE,
>>>> +    .instance_size = sizeof(sPAPRXive),
>>>> +    .class_init = spapr_xive_class_init,
>>>> +};
>>>> +
>>>> +static void spapr_xive_register_types(void)
>>>> +{
>>>> +    type_register_static(&spapr_xive_info);
>>>> +}
>>>> +
>>>> +type_init(spapr_xive_register_types)
>>>> +
>>>> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn)
>>>> +{
>>>> +    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>>>> +}
>>>> +
>>>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn)
>>>> +{
>>>> +    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
>>>> +
>>>> +    if (!ive) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    ive->w |= IVE_VALID;
>>>> +    return true;
>>>> +}
>>>> +
>>>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
>>>> +{
>>>> +    XiveIVE *ive = spapr_xive_get_ive(xive, lisn);
>>>> +
>>>> +    if (!ive) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    ive->w &= ~IVE_VALID;
>>>> +    return true;
>>>> +}
>>>> diff --git a/hw/intc/xive-internal.h b/hw/intc/xive-internal.h
>>>> new file mode 100644
>>>> index 000000000000..132b71a6daf0
>>>> --- /dev/null
>>>> +++ b/hw/intc/xive-internal.h
>>>> @@ -0,0 +1,41 @@
>>>> +/*
>>>> + * QEMU PowerPC XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2016-2017, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#ifndef _INTC_XIVE_INTERNAL_H
>>>> +#define _INTC_XIVE_INTERNAL_H
>>>> +
>>>> +/* Utilities to manipulate these (originaly from OPAL) */
>>>> +#define MASK_TO_LSH(m)          (__builtin_ffsl(m) - 1)
>>>> +#define GETFIELD(m, v)          (((v) & (m)) >> MASK_TO_LSH(m))
>>>> +#define SETFIELD(m, v, val)                             \
>>>> +        (((v) & ~(m)) | ((((typeof(v))(val)) << MASK_TO_LSH(m)) & (m)))
>>>> +
>>>> +/* IVE/EAS
>>>> + *
>>>> + * One per interrupt source. Targets that interrupt to a given EQ
>>>> + * and provides the corresponding logical interrupt number (EQ data)
>>>> + *
>>>> + * We also map this structure to the escalation descriptor inside
>>>> + * an EQ, though in that case the valid and masked bits are not used.
>>>> + */
>>>> +typedef struct XiveIVE {
>>>> +        /* Use a single 64-bit definition to make it easier to
>>>> +         * perform atomic updates
>>>> +         */
>>>> +        uint64_t        w;
>>>> +#define IVE_VALID       PPC_BIT(0)
>>>> +#define IVE_EQ_BLOCK    PPC_BITMASK(4, 7)        /* Destination EQ block# */
>>>> +#define IVE_EQ_INDEX    PPC_BITMASK(8, 31)       /* Destination EQ index */
>>>> +#define IVE_MASKED      PPC_BIT(32)              /* Masked */
>>>> +#define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
>>>> +} XiveIVE;
>>>> +
>>>> +XiveIVE *spapr_xive_get_ive(sPAPRXive *xive, uint32_t lisn);
>>>> +
>>>> +#endif /* _INTC_XIVE_INTERNAL_H */
>>>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>>>> new file mode 100644
>>>> index 000000000000..5b1f78e06a1e
>>>> --- /dev/null
>>>> +++ b/include/hw/ppc/spapr_xive.h
>>>> @@ -0,0 +1,35 @@
>>>> +/*
>>>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#ifndef PPC_SPAPR_XIVE_H
>>>> +#define PPC_SPAPR_XIVE_H
>>>> +
>>>> +#include <hw/sysbus.h>
>>>> +
>>>> +typedef struct sPAPRXive sPAPRXive;
>>>> +typedef struct XiveIVE XiveIVE;
>>>> +
>>>> +#define TYPE_SPAPR_XIVE "spapr-xive"
>>>> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
>>>> +
>>>> +struct sPAPRXive {
>>>> +    SysBusDevice parent;
>>>> +
>>>> +    /* Properties */
>>>> +    uint32_t     nr_irqs;
>>>> +
>>>> +    /* XIVE internal tables */
>>>> +    XiveIVE      *ivt;
>>>> +};
>>>> +
>>>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn);
>>>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>>>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>>>> +
>>>> +#endif /* PPC_SPAPR_XIVE_H */
>>>
>>
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-12  5:08       ` David Gibson
@ 2018-04-12  8:28         ` Cédric Le Goater
  0 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2018-04-12  8:28 UTC (permalink / raw)
  To: David Gibson, Benjamin Herrenschmidt; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On 04/12/2018 07:08 AM, David Gibson wrote:
> On Thu, Dec 21, 2017 at 11:12:06AM +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2017-12-20 at 16:09 +1100, David Gibson wrote:
>>>
>>> As you've suggested in yourself, I think we might need to more
>>> explicitly model the different components of the XIVE system.  As part
>>> of that, I think you need to be clearer in this base skeleton about
>>> exactly what component your XIVE object represents.
>>>
>>> If the answer is "the overall thing" I suspect that's not what you
>>> want - I had one of those for XICs which proved to be a mistake
>>> (eventually replaced by the XICSFabric interface).
>>>
>>> Changing the model later isn't impossible, but doing so without
>>> breaking migration can be a real pain, so I think it's worth a
>>> reasonable effort to try and get it right initially.
>>
>> Note: we do need to speed things up a bit, as having exploitation mode
>> in KVM will significantly help with IPI performance among other things.
>>
>> I'm about ready to do the KVM bits. The one thing we need to discuss
>> and figure a good design for is how we map all those interrupt control
>> pages into qemu.
>>
>> Each interrupt (either PCIe pass-through or the "generic XIVE IPIs"
>> which are used for guest IPIs and for vio/virtio/emulated interrupts)
>> comes with a "control page" (ESB page) which needs to be mapped into
>> the guest, and the generic IPIs also come with a trigger page which
>> needs to be mapped into the guest for guest IPIs or OpenCAPI
>> interrupts, or just qemu for emulated devices.
>>
>> Now that can be thousands of these critters. I certainly don't want to
>> create thousands of VMAs in qemu and even less thousands of memory
>> regions in KVM.
>>
>> So we need some kind of mechanism by wich a single large VMA gets
>> mmap'ed into qemu (or maybe a couple of these, but not too many) and
>> the interrupt pages can be assigned to slots in there and demand
>> faulted.
> 
> Ok, I see your point.  We'll definitely need to be able to map things
> in as a block, rather than one by one.

So, the approach taken is to use a mmap() exposed in a single ram_device 
memory region to the guest. The size is the irq number space size. 
This is hardcoded to 4096 (IPIs) + 1024 (virtual device interrupts) in 
QEMU. We can change that, but the 4K split is important for XICS 
compatibility. The kvm xive device should self adapt.

C. 


>> For the generic interrupts, this can probably be covered by KVM, adding
>> some arch ioctls for allocating IPIs and mmap'ing that region etc...
>>
>> For pass-through, it's trickier, we don't want to mmap each irqfd
>> individually for the above reason, so we want to "link" them to KVM. We
>> don't want to allow qemu to take control of any arbitrary interrupt in
>> the system though, so it has to related to the ownership of the irqfd
>> coming from vfio.
>>
>> OpenCAPI I suspect will be its own can of worms...
>>
>> Also, have we decided how the process of switching between XICS and
>> XIVE will work vs. CAS ? And how that will interact with KVM ? I was
>> thinking the kernel would implement a different KVM device type, ie
>> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
>> KVM_DEV_TYPE_XIVE.
>>
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-12  5:16                       ` David Gibson
@ 2018-04-12  8:36                         ` Cédric Le Goater
  2018-04-16  4:29                           ` David Gibson
  0 siblings, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2018-04-12  8:36 UTC (permalink / raw)
  To: David Gibson, Benjamin Herrenschmidt; +Cc: qemu-ppc, qemu-devel, Greg Kurz

On 04/12/2018 07:16 AM, David Gibson wrote:
> On Mon, Feb 12, 2018 at 09:55:17AM +1100, Benjamin Herrenschmidt wrote:
>> On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote:
>>> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote:
>>>> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
>>>>> Migration is a problem. We will need both backend QEMU objects to be 
>>>>> available anyhow if we want to migrate. So we are back to the current 
>>>>> solution creating both QEMU objects but we can try to defer some of the 
>>>>> KVM inits and create the KVM device on demand at CAS time.
>>>>
>>>> Do we have a way to migrate a piece of info from the machine *first*
>>>> that indicate what type of XICS/XIVE to instanciate ?
>>>
>>> Nope.  qemu migration doesn't work like that.  Yes, it should, and
>>> everyone knows it, but changing it is a really long term project.
>>
>> Well, we have a problem then. It looks like Qemu broken migration is
>> fundamentally incompatible with PAPR and CAS design...
> 
> Hrm, the fit is very clunky certainly, but i think we can make it work.
> 
>> I know we don't migrate the configuration, that's not exactly what I
>> had in mind tho... Can we have some piece of *data* from the machine be
>> migrated first, and use it on the target to reconfigure the interrupt
>> controller before the stream arrives ?
> 
> Sorta.. maybe.. but it would probably get really ugly if we don't
> preserve the usual way object lifetimes work.
> 
>> Otherwise, we have indeed no much choice but the horrible wart of
>> creating both interrupt controllers with only one "active".
> 
> I really think this is the way to go, warts and all.
> 

Yes ... KVM makes it a little uglier. 

A KVM_DEVICE_DESTROY device is needed to cleanup the VM and a 
DISABLE_CAP to disconnect the vpcu from the current KVM XIVE/XICS 
device. I have used an extra arg on ENABLE_CAP for the moment.    

At the QEMU level, we need to connect/reconnect at reset time to
handle possible changes in CAS, and at post_load.

Destroying the MemoryRegion is a bit problematic, I have not
found a common layout compatible with both the emulated mode 
(std IO regions) and the KVM mode (ram device regions)

C.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-12  5:10             ` David Gibson
@ 2018-04-12  8:41               ` Cédric Le Goater
  0 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2018-04-12  8:41 UTC (permalink / raw)
  To: David Gibson; +Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel, Greg Kurz

On 04/12/2018 07:10 AM, David Gibson wrote:
> On Wed, Jan 17, 2018 at 10:18:43AM +0100, Cédric Le Goater wrote:
>>>>> Also, have we decided how the process of switching between XICS and
>>>>> XIVE will work vs. CAS ? 
>>>>
>>>> That's how it is described in the architecture. The current choice is
>>>> to create both XICS and XIVE objects and choose at CAS which one to
>>>> use. It relies today on the capability of the pseries machine to 
>>>> allocate IRQ numbers for both interrupt controller backends. These
>>>> patches have been merged in QEMU.
>>>>
>>>> A change of interrupt mode results in a reset. The device tree is 
>>>> populated accordingly and the ICPs are switched for the model in 
>>>> use. 
>>>
>>> For KVM we need to only instanciate one of them though.
>>
>> Hmm,
>>
>> How would we handle a guest rebooting on a kernel without XIVE support ? 
>> Are you suggesting to create the XICS or XIVE device in the CAS negotiation 
>> process ? So, the machine would not have any interrupt controller before 
>> CAS. That seems really late to me. grub uses the console for instance. 
>>
>> I think it should prepare for both options, start in XIVE legacy mode, 
>> which is XICS, then possibly switch to XIVE exploitation mode.
> 
> I think for our first draft we should have XIVE and XICS based
> platforms as separate machine types (or a machine option, I guess).

OK. This is my current choice for KVM. 

Emulated mode is rather simple to handle, and this is why I have  
kept the reset after CAS if there is a change in the interrupt mode. 
 
> We do want to allow this to be autonegotiated, but I feel like
> emphasising that at the beginning is causing unnatural design
> decisions in the XIVE model itself.

Yes. This is mostly a KVM problem which also has impacts on XICS 
of course ...

C.  
 
>>
>>>>> And how that will interact with KVM ?
>>>>
>>>> I expect we will do the same, which is to create two KVM devices to 
>>>> be able to handle both interrupt controller backends depending on the 
>>>> mode negotiated by the guest.  
>>>
>>> That will be an ungodly mess, I'd rather we only instanciate the right
>>> one.
>>
>> It's rather transparent currently in the emulated version. There are two 
>> sets of objects in QEMU, switching is done in CAS. KVM support should not 
>> change anything in that area. 
>>
>> I expect the 'xive-kvm' object to get/set states for migration, just like 
>> for XICS and to setup the ESB+TIMA memory regions, which is new. 
>>
>> C. 
>>  
>>>>> I was
>>>>> thinking the kernel would implement a different KVM device type, ie
>>>>> the "emulated XICS" would remain KVM_DEV_TYPE_XICS and XIVE would be
>>>>> KVM_DEV_TYPE_XIVE.
>>>>
>>>> yes. it makes sense. The new device will have a lot in common with the 
>>>> KVM_DEV_TYPE_XICS using kvm_xive_ops.
>>>
>>> Ben.
>>>
>>
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-12  5:15                 ` David Gibson
@ 2018-04-12  8:51                   ` Cédric Le Goater
  0 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2018-04-12  8:51 UTC (permalink / raw)
  To: David Gibson; +Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel, Greg Kurz

On 04/12/2018 07:15 AM, David Gibson wrote:
> On Wed, Jan 17, 2018 at 03:39:46PM +0100, Cédric Le Goater wrote:
>> On 01/17/2018 12:10 PM, Benjamin Herrenschmidt wrote:
>>> On Wed, 2018-01-17 at 10:18 +0100, Cédric Le Goater wrote:
>>>>>>> Also, have we decided how the process of switching between XICS and
>>>>>>> XIVE will work vs. CAS ? 
>>>>>>
>>>>>> That's how it is described in the architecture. The current choice is
>>>>>> to create both XICS and XIVE objects and choose at CAS which one to
>>>>>> use. It relies today on the capability of the pseries machine to 
>>>>>> allocate IRQ numbers for both interrupt controller backends. These
>>>>>> patches have been merged in QEMU.
>>>>>>
>>>>>> A change of interrupt mode results in a reset. The device tree is 
>>>>>> populated accordingly and the ICPs are switched for the model in 
>>>>>> use. 
>>>>>
>>>>> For KVM we need to only instanciate one of them though.
>>>>
>>>> Hmm,
>>>>
>>>> How would we handle a guest rebooting on a kernel without XIVE support ? 
>>>
>>> It will do CAS again and we can change the devices.
>>
>> So, we would destroy the previous QEMU ICS object and create a new one 
>> in the CAS hcall. That would probably work. There might be some issues 
>> in creating and destroying the ICS KVM device, but that can be studied 
>> without XIVE.
> 
> Adding and removing devices at runtime based on guest requests like
> this will get really hairy in qemu.

I confirm ...

> As I've said before for the first cut, I think we want to select just
> one as a machine option to avoid this confusion.

OK

> Looking further ahead, I think we'll be better off having both the
> XIVE and XICS models always present (at least minimally) in qemu, but
> with only one "active" at any given time.

Under emulation it is not too complex to support both mode. 
XIVE and XICS objects are both created but spapr->ov5_cas 
filters their usage 

However, syncing the change in KVM is more complex.

> Note that having the inactive one destroy and clean up the
> corresponding KVM devices is fine, as is deallocating as much of its
> runtime state as we can without changing the notional QOM tree.

yes. I will try to send a patchset organized that way : 

 - spapr XIVE emulated mode (both mode supported)
 - XIVE KVM in an exclusive way, the machine will need to be
   restarted from the command line to change interrupt mode.   
 - support of change of interrupt mode under KVM 
 - powernv device model (rough)


C.

>> It used to be considered ugly to create a QEMU device at reset time, so 
>> I wonder if this is still the case, because when the machine reaches CAS, 
>> we really are beyond reset.   
>>
>> If this is OK, then the next "issue" is to keep in sync the allocated 
>> IRQ numbers. The IRQ allocator is now merged at the machine level, so 
>> the synchronization is obvious to do when both backend QEMU objects 
>> are available. that's the path I took. If both QEMU objects are not 
>> available, then we need to scan the IRQ number space in the current 
>> interrupt mode and allocate the same IRQs in the newly negotiated mode. 
>> Probably OK. I don't see major problems with the current code. 
>>
>> Migration is a problem. We will need both backend QEMU objects to be 
>> available anyhow if we want to migrate. So we are back to the current 
>> solution creating both QEMU objects but we can try to defer some of the 
>> KVM inits and create the KVM device on demand at CAS time.
>>
>> The next problem is the ICP object that currently needs the KVM device 
>> fd to connect the vcpus ... So, we will need to change that also. 
>> That is probably the biggest problem today. We need a way to disconnect 
>> the vpcu from the KVM device and see how we can defer the connection.
>> I need to make sure this is possible, I can check that without XIVE
>> I think.
>>
>>>> Are you suggesting to create the XICS or XIVE device in the CAS negotiation 
>>>> process ? So, the machine would not have any interrupt controller before 
>>>> CAS. That seems really late to me. grub uses the console for instance. 
>>>
>>> We start with XICS by default.
>>
>> yes.
>>
>>>> I think it should prepare for both options, start in XIVE legacy mode, 
>>>> which is XICS, then possibly switch to XIVE exploitation mode.
>>>>
>>>>>>> And how that will interact with KVM ? 
>>>>>>
>>>>>> I expect we will do the same, which is to create two KVM devices to 
>>>>>> be able to handle both interrupt controller backends depending on the 
>>>>>> mode negotiated by the guest.  
>>>>>
>>>>> That will be an ungodly mess, I'd rather we only instanciate the right
>>>>> one.
>>>>
>>>> It's rather transparent currently in the emulated version. There are two 
>>>> sets of objects in QEMU, switching is done in CAS. KVM support should not 
>>>> change anything in that area. 
>>>>
>>>> I expect the 'xive-kvm' object to get/set states for migration, just like 
>>>> for XICS and to setup the ESB+TIMA memory regions, which is new. 
>>>
>>> But both XICS and XIVE are completely different kernel KVM devices that will
>>> need to "hook" into the same set of internal hooks for things like interrupts
>>> being passed through, RTAS calls etc... 
>>>
>>> How does KVM knows which one to "activate" ?
>>
>> Can't we add an extra IRQ type and use vcpu->arch.irq_type for that ? 
>> I haven't studied all the low level details though.
>>
>>> I don't think the kernel should have both. 
>>
>> I hear that. From a QEMU perspective, it is much easier to put everything 
>> in place for both interrupt modes and let the guest decide what it wants 
>> to use. 
>>
>> If we choose not to, we will need to find solution to defer the KVM inits
>> and to disconnect/reconnect the vcpus. For the latter, we could add a 
>> KVM_DISABLE_CAP ioctl or maybe better add a new capability like 
>> KVM_CAP_IRQ_XIVE to perform the switch.
>>
>>
>> C.
>>
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-12  8:18         ` Cédric Le Goater
@ 2018-04-16  4:26           ` David Gibson
  2018-04-19 17:40             ` Cédric Le Goater
  0 siblings, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-04-16  4:26 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 8231 bytes --]

On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
> On 04/12/2018 07:07 AM, David Gibson wrote:
> > On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
> >> On 12/20/2017 06:09 AM, David Gibson wrote:
> >>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
> >>>> With the POWER9 processor comes a new interrupt controller called
> >>>> XIVE. It is composed of three sub-engines :
> >>>>
> >>>>   - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
> >>>>     in the main controller for the IPIS and in the PSI host
> >>>>     bridge. They are configured to feed the IVRE with events.
> >>>>
> >>>>   - Interrupt Virtualization Routing Engine (IVRE). Their job is to
> >>>>     match an event source with a Notification Virtualization Target
> >>>>     (NVT), a priority and an Event Queue (EQ) to determine if a
> >>>>     Virtual Processor can handle the event.
> >>>>
> >>>>   - Interrupt Virtualization Presentation Engine (IVPE). It maintains
> >>>>     the interrupt state of each hardware thread and present the
> >>>>     notification as an external exception.
> >>>>
> >>>> Each of the engines uses a set of internal tables to redirect
> >>>> exceptions from event sources to CPU threads. The first table we
> >>>> introduce is the Interrupt Virtualization Entry (IVE) table, part of
> >>>> the virtualization engine in charge of routing events. It associates
> >>>> event sources (IRQ numbers) to event queues which will forward, or
> >>>> not, the event notification to the presentation controller.
> >>>>
> >>>> The XIVE model is designed to make use of the full range of the IRQ
> >>>> number space and does not use an offset like the XICS mode does.
> >>>> Hence, the IVE table is directly indexed by the IRQ number.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>
> >>> As you've suggested in yourself, I think we might need to more
> >>> explicitly model the different components of the XIVE system.  As part
> >>> of that, I think you need to be clearer in this base skeleton about
> >>> exactly what component your XIVE object represents.
> > 
> > Sorry it's been so long since I looked at these.
> 
> That's fine. I have been working on a XIVE device model for the PowerNV
> machine and KVM support for the pseries. I have a better understanding
> of the overall picture.
> 
> The patchset has not changed much so we can still discuss on this
> basis without me flooding the mailing list.
> 
> >> ok. The base skeleton is the IVRE, the central engine handling 
> >> the routing. 
> >>
> >>> If the answer is "the overall thing" 
> >>
> >> Yes, it is more or less that currently. 
> >>
> >> The sPAPRXive object models the source engine and the routing 
> >> engine in one object.
> > 
> > Yeah, I suspect we don't want that.  Although it might seem simpler in
> > the spapr case, at least at first glance, I think it will cause us
> > problems later.  At the very least, it's likely to make it harder to
> > share code between the spapr and powernv case.  I think it will also
> > make for more confusion about exactly what things belong where.
> 
> I tend to agree. 
> 
> We need to clarify (a bit) what is in the XIVE interrupt controller 
> silicon, and how XIVE works. The XIVE device models for spapr and 
> powernv should be very close as the differences are small. KVM support 
> should be built on the spapr model.
> 
> There are 3 different sub-engines in the XIVE interrupt controller
> device :
> 
> * IVSE (XiveSource model)
> 
>   interrupt sources, which expose their PQ bits through ESB MMIO pages 
>   (there are different levels of support depending on HW revision) 
> 
>   The XIVE interrupt controller has a set of internal sources for 
>   IPIs and CAPI like interrupts.

Ok.  IIUC in hardware there's one of these in each PHB, plus maybe one
or two others.  Is that right?

> 
> * IVRE (No real model)
> 
>   in the middle, doing the routing of source event notification to
>   (cpu) targets. It relies on internal tables which are stored in 
>   the hypervisor/QEMU/KVM for the spapr machine and in the VM RAM 
>   for the powernv machine.

What does VM RAM mean in the powernv context?

>   Configuration updates of the XIVE tables are done through hcalls 
>   on spapr and with MMIOs on the IC regs on powernv. On the latter,
>   the changes are flushed backed in the VM RAM. 
> 
> * IVPE (XiveNVT)
> 
>   set of registers for interrupt management at the CPU level. Exposed
>   in a specific MMIO region called the TIMA.

Ok.

> The XIVE tables are :
> 
> * IVT
> 
>   associate an interrupt source number with an event queue. the data
>   to be pushed in the queue is stored there also.

Ok, so there would be one of these tables for each IVRE, with one
entry for each source managed by that IVSE, yes?

Do the XIVE IPIs have entries here, or do they bypass this?

> * EQDT:
> 
>   describes the queues in the OS RAM, also contains a set of flags,
>   a virtual target, etc.

So on real hardware this would be global, yes?  And it would be
consulted by the IVRE?

For guests, we'd expect one table per-guest?  How would those be
integrated with the host table?

> * VPDT:
> 
>   describe the virtual targets, which can have different natures,
>   a lpar, a cpu. This is for powernv, spapr does not have this 
>   concept.

Ok  On hardware that would also be global and consulted by the IVRE,
yes?

Under PAPR, I'm guessing the concept is missing because it essentially
has a fixed contents: an entry for each vcpu and maybe one for the
lpar as a whole?

> So, the idea behind the sPAPRXive object is to model a XIVE interrupt
> controller device. It contains today :

Yeah, what a "XIVE interrupt controller device" is not really clear to
me.  If it's something that is necessarily global, I think you'll be
better off making it a machine-interface rather than a distinct
object.

> 
>  - an internal source block for all interrupts : IPIs and virtual 
>    device interrupts. In the IRQ number space, the IPIs are below
>    4096 and the device interrupts above, which keeps compatibility 
>    with XICS. This is important to be able to change interrupt mode.
> 
>    PowerNV has different source blocks, like for P8.
> 
>  - a routing engine, which is limited to the IVT. This is a shortcut 
>    and it might be better to introduce a specific object. Anyhow, this 
>    is a state to capture.

Ok.  It sounds like this is roughly the equivalent of the XICSFabric,
and likewise would probably be better handled by an interface on the
machine rather than a distinct object.  But I'm not clear enough to be
certain of that yet.

>    In the current version I am working on, the XiveFabric interface is
>    more complex :
> 
> 	typedef struct XiveFabricClass {
> 	    InterfaceClass parent;
> 	    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);

This does an IVT lookup, I take it?

> 	    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);

This one a VPDT lookup, yes?

> 	    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);

And this one an EQDT lookup?

> 	} XiveFabricClass;
> 
>    It helps in making the routing algorithm independent of the model. 
>    I hope to make powernv converge and use it.
> 
>  - a set of MMIOs for the TIMA. They model the presenter engine. 
>    current_cpu is used to retrieve the NVT object, which holds the 
>    registers for interrupt management.  

Right.  Now the TIMA is local to a target/server not an EQ, right?

I guess we need at least one of these per-vcpu.  Do we also need an
lpar-global, or other special ones?

> The EQs are stored under the NVT. This saves us an unnecessary EQDT 
> table. But we could add one under the XIVE device model.

I'm not sure of the distinction you're drawing between the NVT and the
XIVE device mode.

[snip]

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-12  8:36                         ` Cédric Le Goater
@ 2018-04-16  4:29                           ` David Gibson
  2018-04-19 13:01                             ` Cédric Le Goater
  0 siblings, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-04-16  4:29 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 3087 bytes --]

On Thu, Apr 12, 2018 at 10:36:10AM +0200, Cédric Le Goater wrote:
> On 04/12/2018 07:16 AM, David Gibson wrote:
> > On Mon, Feb 12, 2018 at 09:55:17AM +1100, Benjamin Herrenschmidt wrote:
> >> On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote:
> >>> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote:
> >>>> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
> >>>>> Migration is a problem. We will need both backend QEMU objects to be 
> >>>>> available anyhow if we want to migrate. So we are back to the current 
> >>>>> solution creating both QEMU objects but we can try to defer some of the 
> >>>>> KVM inits and create the KVM device on demand at CAS time.
> >>>>
> >>>> Do we have a way to migrate a piece of info from the machine *first*
> >>>> that indicate what type of XICS/XIVE to instanciate ?
> >>>
> >>> Nope.  qemu migration doesn't work like that.  Yes, it should, and
> >>> everyone knows it, but changing it is a really long term project.
> >>
> >> Well, we have a problem then. It looks like Qemu broken migration is
> >> fundamentally incompatible with PAPR and CAS design...
> > 
> > Hrm, the fit is very clunky certainly, but i think we can make it work.
> > 
> >> I know we don't migrate the configuration, that's not exactly what I
> >> had in mind tho... Can we have some piece of *data* from the machine be
> >> migrated first, and use it on the target to reconfigure the interrupt
> >> controller before the stream arrives ?
> > 
> > Sorta.. maybe.. but it would probably get really ugly if we don't
> > preserve the usual way object lifetimes work.
> > 
> >> Otherwise, we have indeed no much choice but the horrible wart of
> >> creating both interrupt controllers with only one "active".
> > 
> > I really think this is the way to go, warts and all.
> > 
> 
> Yes ... KVM makes it a little uglier. 
> 
> A KVM_DEVICE_DESTROY device is needed to cleanup the VM and a 
> DISABLE_CAP to disconnect the vpcu from the current KVM XIVE/XICS 
> device. I have used an extra arg on ENABLE_CAP for the moment.    
> 
> At the QEMU level, we need to connect/reconnect at reset time to
> handle possible changes in CAS, and at post_load.

Right.

> Destroying the MemoryRegion is a bit problematic, I have not
> found a common layout compatible with both the emulated mode 
> (std IO regions) and the KVM mode (ram device regions)

That sounds awkward, I guess we'll discuss the details of this later.


Btw, a secondary advantage of starting off with XIVE only under a
different machine type is that we can declare that one not to be
migration stable until we're ready.  So we can merge something that's
ok to experiment with, but reserve the right to incompatibly change
the migration format until we're confident we're ready and can merge
it into the "stable" machine type.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-16  4:29                           ` David Gibson
@ 2018-04-19 13:01                             ` Cédric Le Goater
  0 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2018-04-19 13:01 UTC (permalink / raw)
  To: David Gibson; +Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel, Greg Kurz

On 04/16/2018 06:29 AM, David Gibson wrote:
> On Thu, Apr 12, 2018 at 10:36:10AM +0200, Cédric Le Goater wrote:
>> On 04/12/2018 07:16 AM, David Gibson wrote:
>>> On Mon, Feb 12, 2018 at 09:55:17AM +1100, Benjamin Herrenschmidt wrote:
>>>> On Sun, 2018-02-11 at 19:08 +1100, David Gibson wrote:
>>>>> On Thu, Jan 18, 2018 at 08:27:52AM +1100, Benjamin Herrenschmidt wrote:
>>>>>> On Wed, 2018-01-17 at 15:39 +0100, Cédric Le Goater wrote:
>>>>>>> Migration is a problem. We will need both backend QEMU objects to be 
>>>>>>> available anyhow if we want to migrate. So we are back to the current 
>>>>>>> solution creating both QEMU objects but we can try to defer some of the 
>>>>>>> KVM inits and create the KVM device on demand at CAS time.
>>>>>>
>>>>>> Do we have a way to migrate a piece of info from the machine *first*
>>>>>> that indicate what type of XICS/XIVE to instanciate ?
>>>>>
>>>>> Nope.  qemu migration doesn't work like that.  Yes, it should, and
>>>>> everyone knows it, but changing it is a really long term project.
>>>>
>>>> Well, we have a problem then. It looks like Qemu broken migration is
>>>> fundamentally incompatible with PAPR and CAS design...
>>>
>>> Hrm, the fit is very clunky certainly, but i think we can make it work.
>>>
>>>> I know we don't migrate the configuration, that's not exactly what I
>>>> had in mind tho... Can we have some piece of *data* from the machine be
>>>> migrated first, and use it on the target to reconfigure the interrupt
>>>> controller before the stream arrives ?
>>>
>>> Sorta.. maybe.. but it would probably get really ugly if we don't
>>> preserve the usual way object lifetimes work.
>>>
>>>> Otherwise, we have indeed no much choice but the horrible wart of
>>>> creating both interrupt controllers with only one "active".
>>>
>>> I really think this is the way to go, warts and all.
>>>
>>
>> Yes ... KVM makes it a little uglier. 
>>
>> A KVM_DEVICE_DESTROY device is needed to cleanup the VM and a 
>> DISABLE_CAP to disconnect the vpcu from the current KVM XIVE/XICS 
>> device. I have used an extra arg on ENABLE_CAP for the moment.    
>>
>> At the QEMU level, we need to connect/reconnect at reset time to
>> handle possible changes in CAS, and at post_load.
> 
> Right.

v3 uses the same 'reset' function to setup the interrupts at machine 
reset time and at post_load. Keep that in mind. May be we should have 
distinct routines.
 
> 
>> Destroying the MemoryRegion is a bit problematic, I have not
>> found a common layout compatible with both the emulated mode 
>> (std IO regions) and the KVM mode (ram device regions)
> 
> That sounds awkward, I guess we'll discuss the details of this later.

I have fixed that in v3.

> Btw, a secondary advantage of starting off with XIVE only under a
> different machine type is that we can declare that one not to be
> migration stable until we're ready.  So we can merge something that's
> ok to experiment with, but reserve the right to incompatibly change
> the migration format until we're confident we're ready and can merge
> it into the "stable" machine type.

Reseting KVM devices is not the most complex feature to support but 
it seems to have an impact on migration. So we might need the 
non-migratable machine type to fix that. Let's see how v3 is welcomed
or not.
  
 
Thanks,

C.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-16  4:26           ` David Gibson
@ 2018-04-19 17:40             ` Cédric Le Goater
  2018-04-26  5:36               ` David Gibson
  0 siblings, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2018-04-19 17:40 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

On 04/16/2018 06:26 AM, David Gibson wrote:
> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
>> On 04/12/2018 07:07 AM, David Gibson wrote:
>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater wrote:
>>>>>> With the POWER9 processor comes a new interrupt controller called
>>>>>> XIVE. It is composed of three sub-engines :
>>>>>>
>>>>>>   - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
>>>>>>     in the main controller for the IPIS and in the PSI host
>>>>>>     bridge. They are configured to feed the IVRE with events.
>>>>>>
>>>>>>   - Interrupt Virtualization Routing Engine (IVRE). Their job is to
>>>>>>     match an event source with a Notification Virtualization Target
>>>>>>     (NVT), a priority and an Event Queue (EQ) to determine if a
>>>>>>     Virtual Processor can handle the event.
>>>>>>
>>>>>>   - Interrupt Virtualization Presentation Engine (IVPE). It maintains
>>>>>>     the interrupt state of each hardware thread and present the
>>>>>>     notification as an external exception.
>>>>>>
>>>>>> Each of the engines uses a set of internal tables to redirect
>>>>>> exceptions from event sources to CPU threads. The first table we
>>>>>> introduce is the Interrupt Virtualization Entry (IVE) table, part of
>>>>>> the virtualization engine in charge of routing events. It associates
>>>>>> event sources (IRQ numbers) to event queues which will forward, or
>>>>>> not, the event notification to the presentation controller.
>>>>>>
>>>>>> The XIVE model is designed to make use of the full range of the IRQ
>>>>>> number space and does not use an offset like the XICS mode does.
>>>>>> Hence, the IVE table is directly indexed by the IRQ number.
>>>>>>
>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>
>>>>> As you've suggested in yourself, I think we might need to more
>>>>> explicitly model the different components of the XIVE system.  As part
>>>>> of that, I think you need to be clearer in this base skeleton about
>>>>> exactly what component your XIVE object represents.
>>>
>>> Sorry it's been so long since I looked at these.
>>
>> That's fine. I have been working on a XIVE device model for the PowerNV
>> machine and KVM support for the pseries. I have a better understanding
>> of the overall picture.
>>
>> The patchset has not changed much so we can still discuss on this
>> basis without me flooding the mailing list.
>>
>>>> ok. The base skeleton is the IVRE, the central engine handling 
>>>> the routing. 
>>>>
>>>>> If the answer is "the overall thing" 
>>>>
>>>> Yes, it is more or less that currently. 
>>>>
>>>> The sPAPRXive object models the source engine and the routing 
>>>> engine in one object.
>>>
>>> Yeah, I suspect we don't want that.  Although it might seem simpler in
>>> the spapr case, at least at first glance, I think it will cause us
>>> problems later.  At the very least, it's likely to make it harder to
>>> share code between the spapr and powernv case.  I think it will also
>>> make for more confusion about exactly what things belong where.
>>
>> I tend to agree. 
>>
>> We need to clarify (a bit) what is in the XIVE interrupt controller 
>> silicon, and how XIVE works. The XIVE device models for spapr and 
>> powernv should be very close as the differences are small. KVM support 
>> should be built on the spapr model.
>>
>> There are 3 different sub-engines in the XIVE interrupt controller
>> device :
>>
>> * IVSE (XiveSource model)
>>
>>   interrupt sources, which expose their PQ bits through ESB MMIO pages 
>>   (there are different levels of support depending on HW revision) 
>>
>>   The XIVE interrupt controller has a set of internal sources for 
>>   IPIs and CAPI like interrupts.
> 
> Ok.  IIUC in hardware there's one of these in each PHB, 

yes

> plus maybe one or two others.  Is that right?

yes. PSI for instance on PowerNV. I have this device as a first
xive source on Power?V

>>
>> * IVRE (No real model)
>>
>>   in the middle, doing the routing of source event notification to
>>   (cpu) targets. It relies on internal tables which are stored in 
>>   the hypervisor/QEMU/KVM for the spapr machine and in the VM RAM 
>>   for the powernv machine.
> 
> What does VM RAM mean in the powernv context?

The PowerNV is indeed not a VM. So I meant the RAM of the QEMU PowerNV 
machine. skiboot does the allocation and the HW setup using a set of 
IC registers exposed as MMIOs. 

>>   Configuration updates of the XIVE tables are done through hcalls 
>>   on spapr and with MMIOs on the IC regs on powernv. On the latter,
>>   the changes are flushed backed in the VM RAM. 
>>
>> * IVPE (XiveNVT)
>>
>>   set of registers for interrupt management at the CPU level. Exposed
>>   in a specific MMIO region called the TIMA.
> 
> Ok.
> 
>> The XIVE tables are :
>>
>> * IVT
>>
>>   associate an interrupt source number with an event queue. the data
>>   to be pushed in the queue is stored there also.
> 
> Ok, so there would be one of these tables for each IVRE, 

yes. one for each XIVE interrupt controller. That is one per processor 
or socket.

> with one entry for each source managed by that IVSE, yes?

yes. The table is simply indexed by the interrupt number in the
global IRQ number space of the machine.

> Do the XIVE IPIs have entries here, or do they bypass this?

no. The IPIs have entries also in this table.

>> * EQDT:
>>
>>   describes the queues in the OS RAM, also contains a set of flags,
>>   a virtual target, etc.
> 
> So on real hardware this would be global, yes?  And it would be
> consulted by the IVRE?

yes. Exactly. The XIVE routing routine :

	https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706

gives a good overview of the usage of the tables.

> For guests, we'd expect one table per-guest?  

yes but only in emulation mode. 

> How would those be integrated with the host table?

Under KVM, this is handled by the host table (setup done in skiboot) 
and we are only interested in the state of the EQs for migration. 
This state is set  with the H_INT_SET_QUEUE_CONFIG hcall, followed
by an OPAL call and then a HW update. It defines the EQ page in which
to push event notification for the couple server/priority. 

>> * VPDT:
>>
>>   describe the virtual targets, which can have different natures,
>>   a lpar, a cpu. This is for powernv, spapr does not have this 
>>   concept.
> 
> Ok  On hardware that would also be global and consulted by the IVRE,
> yes?

yes. 

> Under PAPR, I'm guessing the concept is missing because it essentially
> has a fixed contents: an entry for each vcpu 

yes.

> and maybe one for the lpar as a whole?
That would be more a host concept. But, yes, it exists in XIVE. 
 
>> So, the idea behind the sPAPRXive object is to model a XIVE interrupt
>> controller device. It contains today :
> 
> Yeah, what a "XIVE interrupt controller device" is not really clear to
> me.  If it's something that is necessarily global, I think you'll be
> better off making it a machine-interface rather than a distinct
> object.

hmm, OK. We do need a XiveSource object (like in the XICS) and an IVE 
table. reshuffling is not a big problem. But then, we also have the
associated KVM device which is very much like the QEMU emulated device.  

>>  - an internal source block for all interrupts : IPIs and virtual 
>>    device interrupts. In the IRQ number space, the IPIs are below
>>    4096 and the device interrupts above, which keeps compatibility 
>>    with XICS. This is important to be able to change interrupt mode.
>>
>>    PowerNV has different source blocks, like for P8.
>>
>>  - a routing engine, which is limited to the IVT. This is a shortcut 
>>    and it might be better to introduce a specific object. Anyhow, this 
>>    is a state to capture.
> 
> Ok.  It sounds like this is roughly the equivalent of the XICSFabric,
> and likewise would probably be better handled by an interface on the
> machine 

yes indeed. it is the case in v3. PowerNV isn't quite in sync with
this concept but it is getting close. 

> rather than a distinct object.  But I'm not clear enough to be
> certain of that yet.

but we need to put the IVT somewhere.

>>    In the current version I am working on, the XiveFabric interface is
>>    more complex :
>>
>> 	typedef struct XiveFabricClass {
>> 	    InterfaceClass parent;
>> 	    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
> 
> This does an IVT lookup, I take it?

yes. It is an interface for the underlying storage, which is different
in sPAPR and PowerNV. The goal is to make the routing generic. 

>> 	    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
> 
> This one a VPDT lookup, yes?

yes.

>> 	    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
> 
> And this one an EQDT lookup?

yes.

>> 	} XiveFabricClass;
>>
>>    It helps in making the routing algorithm independent of the model. 
>>    I hope to make powernv converge and use it.
>>
>>  - a set of MMIOs for the TIMA. They model the presenter engine. 
>>    current_cpu is used to retrieve the NVT object, which holds the 
>>    registers for interrupt management.  
> 
> Right.  Now the TIMA is local to a target/server not an EQ, right?

The TIMA is the MMIO giving access to the registers which are per CPU. 
The EQ are for routing. They are under the CPU object because it is 
convenient.
 
> I guess we need at least one of these per-vcpu.  

yes.

> Do we also need an lpar-global, or other special ones?

That would be for the host. AFAICT KVM does not use such special
VPs. 

>> The EQs are stored under the NVT. This saves us an unnecessary EQDT 
>> table. But we could add one under the XIVE device model.
> 
> I'm not sure of the distinction you're drawing between the NVT and the
> XIVE device mode.

we could add a new table under the XIVE interrupt device model 
sPAPRXive to store the EQs and indexed them like skiboot does. 
But it seems unnecessary to me as we can use the object below 
'cpu->intc', which is the XiveNVT object.  

C.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-19 17:40             ` Cédric Le Goater
@ 2018-04-26  5:36               ` David Gibson
  2018-04-26  8:17                 ` Cédric Le Goater
  0 siblings, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-04-26  5:36 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 5335 bytes --]

On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
> On 04/16/2018 06:26 AM, David Gibson wrote:
> > On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
> >> On 04/12/2018 07:07 AM, David Gibson wrote:
> >>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
> >>>> On 12/20/2017 06:09 AM, David Gibson wrote:
> >>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
> wrote:
[snip]
> >> The XIVE tables are :
> >>
> >> * IVT
> >>
> >>   associate an interrupt source number with an event queue. the data
> >>   to be pushed in the queue is stored there also.
> > 
> > Ok, so there would be one of these tables for each IVRE, 
> 
> yes. one for each XIVE interrupt controller. That is one per processor 
> or socket.

Ah.. so there can be more than one in a multi-socket system.

> > with one entry for each source managed by that IVSE, yes?
> 
> yes. The table is simply indexed by the interrupt number in the
> global IRQ number space of the machine.

How does that work on a multi-chip machine?  Does each chip just have
a table for a slice of the global irq number space?

> > Do the XIVE IPIs have entries here, or do they bypass this?
> 
> no. The IPIs have entries also in this table.
> 
> >> * EQDT:
> >>
> >>   describes the queues in the OS RAM, also contains a set of flags,
> >>   a virtual target, etc.
> > 
> > So on real hardware this would be global, yes?  And it would be
> > consulted by the IVRE?
> 
> yes. Exactly. The XIVE routing routine :
> 
> 	https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
> 
> gives a good overview of the usage of the tables.
> 
> > For guests, we'd expect one table per-guest?  
> 
> yes but only in emulation mode. 

I'm not sure what you mean by this.

> > How would those be integrated with the host table?
> 
> Under KVM, this is handled by the host table (setup done in skiboot) 
> and we are only interested in the state of the EQs for migration.

This doesn't make sense to me; the guest is able to alter the IVT
entries, so that configuration must be migrated somehow.

> This state is set  with the H_INT_SET_QUEUE_CONFIG hcall,

"This state" here meaning IVT entries?

> followed
> by an OPAL call and then a HW update. It defines the EQ page in which
> to push event notification for the couple server/priority. 
> 
> >> * VPDT:
> >>
> >>   describe the virtual targets, which can have different natures,
> >>   a lpar, a cpu. This is for powernv, spapr does not have this 
> >>   concept.
> > 
> > Ok  On hardware that would also be global and consulted by the IVRE,
> > yes?
> 
> yes.

Except.. is it actually global, or is there one per-chip/socket?

[snip]
> >>    In the current version I am working on, the XiveFabric interface is
> >>    more complex :
> >>
> >> 	typedef struct XiveFabricClass {
> >> 	    InterfaceClass parent;
> >> 	    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
> > 
> > This does an IVT lookup, I take it?
> 
> yes. It is an interface for the underlying storage, which is different
> in sPAPR and PowerNV. The goal is to make the routing generic.

Right.  So, yes, we definitely want a method *somehwere* to do an IVT
lookup.  I'm not entirely sure where it belongs yet.

> 
> >> 	    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
> > 
> > This one a VPDT lookup, yes?
> 
> yes.
> 
> >> 	    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
> > 
> > And this one an EQDT lookup?
> 
> yes.
> 
> >> 	} XiveFabricClass;
> >>
> >>    It helps in making the routing algorithm independent of the model. 
> >>    I hope to make powernv converge and use it.
> >>
> >>  - a set of MMIOs for the TIMA. They model the presenter engine. 
> >>    current_cpu is used to retrieve the NVT object, which holds the 
> >>    registers for interrupt management.  
> > 
> > Right.  Now the TIMA is local to a target/server not an EQ, right?
> 
> The TIMA is the MMIO giving access to the registers which are per CPU. 
> The EQ are for routing. They are under the CPU object because it is 
> convenient.
>  
> > I guess we need at least one of these per-vcpu.  
> 
> yes.
> 
> > Do we also need an lpar-global, or other special ones?
> 
> That would be for the host. AFAICT KVM does not use such special
> VPs.

Um.. "does not use".. don't we get to decide that?

> >> The EQs are stored under the NVT. This saves us an unnecessary EQDT 
> >> table. But we could add one under the XIVE device model.
> > 
> > I'm not sure of the distinction you're drawing between the NVT and the
> > XIVE device mode.
> 
> we could add a new table under the XIVE interrupt device model 
> sPAPRXive to store the EQs and indexed them like skiboot does. 
> But it seems unnecessary to me as we can use the object below 
> 'cpu->intc', which is the XiveNVT object.  

So, basically assuming a fixed set of EQs (one per priority?) per CPU
for a PAPR guest?  That makes sense (assuming PAPR doesn't provide
guest interfaces to ask for something else).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-26  5:36               ` David Gibson
@ 2018-04-26  8:17                 ` Cédric Le Goater
  2018-05-03  2:29                   ` David Gibson
  0 siblings, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2018-04-26  8:17 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

On 04/26/2018 07:36 AM, David Gibson wrote:
> On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
>> On 04/16/2018 06:26 AM, David Gibson wrote:
>>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
>>>> On 04/12/2018 07:07 AM, David Gibson wrote:
>>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
>> wrote:
> [snip]
>>>> The XIVE tables are :
>>>>
>>>> * IVT
>>>>
>>>>   associate an interrupt source number with an event queue. the data
>>>>   to be pushed in the queue is stored there also.
>>>
>>> Ok, so there would be one of these tables for each IVRE, 
>>
>> yes. one for each XIVE interrupt controller. That is one per processor 
>> or socket.
> 
> Ah.. so there can be more than one in a multi-socket system.
>  >>> with one entry for each source managed by that IVSE, yes?
>>
>> yes. The table is simply indexed by the interrupt number in the
>> global IRQ number space of the machine.
> 
> How does that work on a multi-chip machine?  Does each chip just have
> a table for a slice of the global irq number space?

yes. IRQ Allocation is done relative to the chip, each chip having 
a range depending on its block id. XIVE has a concept of block,
which is used in skiboot in a one-to-one relationship with the chip.

>>> Do the XIVE IPIs have entries here, or do they bypass this?
>>
>> no. The IPIs have entries also in this table.
>>
>>>> * EQDT:
>>>>
>>>>   describes the queues in the OS RAM, also contains a set of flags,
>>>>   a virtual target, etc.
>>>
>>> So on real hardware this would be global, yes?  And it would be
>>> consulted by the IVRE?
>>
>> yes. Exactly. The XIVE routing routine :
>>
>> 	https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
>>
>> gives a good overview of the usage of the tables.
>>
>>> For guests, we'd expect one table per-guest?  
>>
>> yes but only in emulation mode. 
> 
> I'm not sure what you mean by this.

I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall 
table allocated in OPAL for the system. 

 
>>> How would those be integrated with the host table?
>>
>> Under KVM, this is handled by the host table (setup done in skiboot) 
>> and we are only interested in the state of the EQs for migration.
> 
> This doesn't make sense to me; the guest is able to alter the IVT
> entries, so that configuration must be migrated somehow.

yes. The IVE needs to be migrated. We use get/set KVM ioctls to save 
and restore the value which is cached in the KVM irq state struct 
(server, prio, eq data). no OPAL calls are needed though.
 
>> This state is set  with the H_INT_SET_QUEUE_CONFIG hcall,
> 
> "This state" here meaning IVT entries?

no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a 
server/priority couple. That is where the event queue data is pushed. 

H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
and the eq data to be pushed in case of an event.
 
>> followed
>> by an OPAL call and then a HW update. It defines the EQ page in which
>> to push event notification for the couple server/priority. 
>>
>>>> * VPDT:
>>>>
>>>>   describe the virtual targets, which can have different natures,
>>>>   a lpar, a cpu. This is for powernv, spapr does not have this 
>>>>   concept.
>>>
>>> Ok  On hardware that would also be global and consulted by the IVRE,
>>> yes?
>>
>> yes.
> 
> Except.. is it actually global, or is there one per-chip/socket?

There is a global VP allocator splitting the ids depending on the
block/chip, but, to be honest, I have not dug in the details

> [snip]
>>>>    In the current version I am working on, the XiveFabric interface is
>>>>    more complex :
>>>>
>>>> 	typedef struct XiveFabricClass {
>>>> 	    InterfaceClass parent;
>>>> 	    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>>
>>> This does an IVT lookup, I take it?
>>
>> yes. It is an interface for the underlying storage, which is different
>> in sPAPR and PowerNV. The goal is to make the routing generic.
> 
> Right.  So, yes, we definitely want a method *somehwere* to do an IVT
> lookup.  I'm not entirely sure where it belongs yet.

Me either. I have stuffed the XiveFabric with all the abstraction 
needed for the moment. 

I am starting to think that there should be an interface to forward 
events and another one to route them. The router being a special case 
of the forwarder, the last one. The "simple" devices, like PSI, should 
only be forwarders for the sources they own but the interrupt controllers 
should be forwarders (they have sources) and also routers.

>>>> 	    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>>>
>>> This one a VPDT lookup, yes?
>>
>> yes.
>>
>>>> 	    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
>>>
>>> And this one an EQDT lookup?
>>
>> yes.
>>
>>>> 	} XiveFabricClass;
>>>>
>>>>    It helps in making the routing algorithm independent of the model. 
>>>>    I hope to make powernv converge and use it.
>>>>
>>>>  - a set of MMIOs for the TIMA. They model the presenter engine. 
>>>>    current_cpu is used to retrieve the NVT object, which holds the 
>>>>    registers for interrupt management.  
>>>
>>> Right.  Now the TIMA is local to a target/server not an EQ, right?
>>
>> The TIMA is the MMIO giving access to the registers which are per CPU. 
>> The EQ are for routing. They are under the CPU object because it is 
>> convenient.
>>  
>>> I guess we need at least one of these per-vcpu.  
>>
>> yes.
>>
>>> Do we also need an lpar-global, or other special ones?
>>
>> That would be for the host. AFAICT KVM does not use such special
>> VPs.
> 
> Um.. "does not use".. don't we get to decide that?

Well, that part in the specs is still a little obscure for me and 
I am not sure it will fit very well in the Linux/KVM model. It should 
be hidden to the guest anyway and can come in later.

>>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT 
>>>> table. But we could add one under the XIVE device model.
>>>
>>> I'm not sure of the distinction you're drawing between the NVT and the
>>> XIVE device mode.
>>
>> we could add a new table under the XIVE interrupt device model 
>> sPAPRXive to store the EQs and indexed them like skiboot does. 
>> But it seems unnecessary to me as we can use the object below 
>> 'cpu->intc', which is the XiveNVT object.  
> 
> So, basically assuming a fixed set of EQs (one per priority?)

yes. It's easier to capture the state and dump information from
the monitor.

> per CPU for a PAPR guest?  

yes, that's own it works.

> That makes sense (assuming PAPR doesn't provide guest interfaces to 
> ask for something else).

Yes. All hcalls take prio/server parameters and the reserved prio range 
for the platform is in the device tree. 0xFF is a special case to reset 
targeting. 

Thanks,

C.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-04-26  8:17                 ` Cédric Le Goater
@ 2018-05-03  2:29                   ` David Gibson
  2018-05-03  8:43                     ` Cédric Le Goater
  0 siblings, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-05-03  2:29 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 8586 bytes --]

On Thu, Apr 26, 2018 at 10:17:13AM +0200, Cédric Le Goater wrote:
> On 04/26/2018 07:36 AM, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
> >> On 04/16/2018 06:26 AM, David Gibson wrote:
> >>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
> >>>> On 04/12/2018 07:07 AM, David Gibson wrote:
> >>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
> >>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
> >>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
> >> wrote:
> > [snip]
> >>>> The XIVE tables are :
> >>>>
> >>>> * IVT
> >>>>
> >>>>   associate an interrupt source number with an event queue. the data
> >>>>   to be pushed in the queue is stored there also.
> >>>
> >>> Ok, so there would be one of these tables for each IVRE, 
> >>
> >> yes. one for each XIVE interrupt controller. That is one per processor 
> >> or socket.
> > 
> > Ah.. so there can be more than one in a multi-socket system.
> >  >>> with one entry for each source managed by that IVSE, yes?
> >>
> >> yes. The table is simply indexed by the interrupt number in the
> >> global IRQ number space of the machine.
> > 
> > How does that work on a multi-chip machine?  Does each chip just have
> > a table for a slice of the global irq number space?
> 
> yes. IRQ Allocation is done relative to the chip, each chip having 
> a range depending on its block id. XIVE has a concept of block,
> which is used in skiboot in a one-to-one relationship with the chip.

Ok.  I'm assuming this block id forms the high(ish) bits of the global
irq number, yes?

> >>> Do the XIVE IPIs have entries here, or do they bypass this?
> >>
> >> no. The IPIs have entries also in this table.
> >>
> >>>> * EQDT:
> >>>>
> >>>>   describes the queues in the OS RAM, also contains a set of flags,
> >>>>   a virtual target, etc.
> >>>
> >>> So on real hardware this would be global, yes?  And it would be
> >>> consulted by the IVRE?
> >>
> >> yes. Exactly. The XIVE routing routine :
> >>
> >> 	https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
> >>
> >> gives a good overview of the usage of the tables.
> >>
> >>> For guests, we'd expect one table per-guest?  
> >>
> >> yes but only in emulation mode. 
> > 
> > I'm not sure what you mean by this.
> 
> I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall 
> table allocated in OPAL for the system. 

Right.. I'm thinking of this from the point of view of the guest
and/or qemu, rather than from the implementation.  Even if the actual
storage of the entries is distributed across the host's global table,
we still logically have a table per guest, right?

> >>> How would those be integrated with the host table?
> >>
> >> Under KVM, this is handled by the host table (setup done in skiboot) 
> >> and we are only interested in the state of the EQs for migration.
> > 
> > This doesn't make sense to me; the guest is able to alter the IVT
> > entries, so that configuration must be migrated somehow.
> 
> yes. The IVE needs to be migrated. We use get/set KVM ioctls to save 
> and restore the value which is cached in the KVM irq state struct 
> (server, prio, eq data). no OPAL calls are needed though.

Right.  Again, at this stage I don't particularly care what the
backend details are - whether the host calls OPAL or whatever.  I'm
more concerned with the logical model.

> >> This state is set  with the H_INT_SET_QUEUE_CONFIG hcall,
> > 
> > "This state" here meaning IVT entries?
> 
> no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a 
> server/priority couple. That is where the event queue data is
> pushed.

Ah.  Doesn't that mean the guest *does* effectively have an EQD table,
updated by this call?  We'd need to migrate that data as well, and
it's not part of the IVT, right?

> H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
> and the eq data to be pushed in case of an event.

Ok - that's the IVT entries, yes?

>  
> >> followed
> >> by an OPAL call and then a HW update. It defines the EQ page in which
> >> to push event notification for the couple server/priority. 
> >>
> >>>> * VPDT:
> >>>>
> >>>>   describe the virtual targets, which can have different natures,
> >>>>   a lpar, a cpu. This is for powernv, spapr does not have this 
> >>>>   concept.
> >>>
> >>> Ok  On hardware that would also be global and consulted by the IVRE,
> >>> yes?
> >>
> >> yes.
> > 
> > Except.. is it actually global, or is there one per-chip/socket?
> 
> There is a global VP allocator splitting the ids depending on the
> block/chip, but, to be honest, I have not dug in the details
> 
> > [snip]
> >>>>    In the current version I am working on, the XiveFabric interface is
> >>>>    more complex :
> >>>>
> >>>> 	typedef struct XiveFabricClass {
> >>>> 	    InterfaceClass parent;
> >>>> 	    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
> >>>
> >>> This does an IVT lookup, I take it?
> >>
> >> yes. It is an interface for the underlying storage, which is different
> >> in sPAPR and PowerNV. The goal is to make the routing generic.
> > 
> > Right.  So, yes, we definitely want a method *somehwere* to do an IVT
> > lookup.  I'm not entirely sure where it belongs yet.
> 
> Me either. I have stuffed the XiveFabric with all the abstraction 
> needed for the moment. 
> 
> I am starting to think that there should be an interface to forward 
> events and another one to route them. The router being a special case 
> of the forwarder, the last one. The "simple" devices, like PSI, should 
> only be forwarders for the sources they own but the interrupt controllers 
> should be forwarders (they have sources) and also routers.

I'm not really clear what you mean by "forward" here.

> 
> >>>> 	    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
> >>>
> >>> This one a VPDT lookup, yes?
> >>
> >> yes.
> >>
> >>>> 	    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
> >>>
> >>> And this one an EQDT lookup?
> >>
> >> yes.
> >>
> >>>> 	} XiveFabricClass;
> >>>>
> >>>>    It helps in making the routing algorithm independent of the model. 
> >>>>    I hope to make powernv converge and use it.
> >>>>
> >>>>  - a set of MMIOs for the TIMA. They model the presenter engine. 
> >>>>    current_cpu is used to retrieve the NVT object, which holds the 
> >>>>    registers for interrupt management.  
> >>>
> >>> Right.  Now the TIMA is local to a target/server not an EQ, right?
> >>
> >> The TIMA is the MMIO giving access to the registers which are per CPU. 
> >> The EQ are for routing. They are under the CPU object because it is 
> >> convenient.
> >>  
> >>> I guess we need at least one of these per-vcpu.  
> >>
> >> yes.
> >>
> >>> Do we also need an lpar-global, or other special ones?
> >>
> >> That would be for the host. AFAICT KVM does not use such special
> >> VPs.
> > 
> > Um.. "does not use".. don't we get to decide that?
> 
> Well, that part in the specs is still a little obscure for me and 
> I am not sure it will fit very well in the Linux/KVM model. It should 
> be hidden to the guest anyway and can come in later.
> 
> >>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT 
> >>>> table. But we could add one under the XIVE device model.
> >>>
> >>> I'm not sure of the distinction you're drawing between the NVT and the
> >>> XIVE device mode.
> >>
> >> we could add a new table under the XIVE interrupt device model 
> >> sPAPRXive to store the EQs and indexed them like skiboot does. 
> >> But it seems unnecessary to me as we can use the object below 
> >> 'cpu->intc', which is the XiveNVT object.  
> > 
> > So, basically assuming a fixed set of EQs (one per priority?)
> 
> yes. It's easier to capture the state and dump information from
> the monitor.
> 
> > per CPU for a PAPR guest?  
> 
> yes, that's own it works.
> 
> > That makes sense (assuming PAPR doesn't provide guest interfaces to 
> > ask for something else).
> 
> Yes. All hcalls take prio/server parameters and the reserved prio range 
> for the platform is in the device tree. 0xFF is a special case to reset 
> targeting. 
> 
> Thanks,
> 
> C.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-05-03  2:29                   ` David Gibson
@ 2018-05-03  8:43                     ` Cédric Le Goater
  2018-05-04  6:35                       ` David Gibson
  0 siblings, 1 reply; 71+ messages in thread
From: Cédric Le Goater @ 2018-05-03  8:43 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

On 05/03/2018 04:29 AM, David Gibson wrote:
> On Thu, Apr 26, 2018 at 10:17:13AM +0200, Cédric Le Goater wrote:
>> On 04/26/2018 07:36 AM, David Gibson wrote:
>>> On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
>>>> On 04/16/2018 06:26 AM, David Gibson wrote:
>>>>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
>>>>>> On 04/12/2018 07:07 AM, David Gibson wrote:
>>>>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>>>>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>>>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
>>>> wrote:
>>> [snip]
>>>>>> The XIVE tables are :
>>>>>>
>>>>>> * IVT
>>>>>>
>>>>>>   associate an interrupt source number with an event queue. the data
>>>>>>   to be pushed in the queue is stored there also.
>>>>>
>>>>> Ok, so there would be one of these tables for each IVRE, 
>>>>
>>>> yes. one for each XIVE interrupt controller. That is one per processor 
>>>> or socket.
>>>
>>> Ah.. so there can be more than one in a multi-socket system.
>>>  >>> with one entry for each source managed by that IVSE, yes?
>>>>
>>>> yes. The table is simply indexed by the interrupt number in the
>>>> global IRQ number space of the machine.
>>>
>>> How does that work on a multi-chip machine?  Does each chip just have
>>> a table for a slice of the global irq number space?
>>
>> yes. IRQ Allocation is done relative to the chip, each chip having 
>> a range depending on its block id. XIVE has a concept of block,
>> which is used in skiboot in a one-to-one relationship with the chip.
> 
> Ok.  I'm assuming this block id forms the high(ish) bits of the global
> irq number, yes?

yes. the 8 top bits are reserved, the next 4 bits are for the 
block id, 16 blocks for 16 socket/chips, and the 20 lower bits 
are for the ISN.

>>>>> Do the XIVE IPIs have entries here, or do they bypass this?
>>>>
>>>> no. The IPIs have entries also in this table.
>>>>
>>>>>> * EQDT:
>>>>>>
>>>>>>   describes the queues in the OS RAM, also contains a set of flags,
>>>>>>   a virtual target, etc.
>>>>>
>>>>> So on real hardware this would be global, yes?  And it would be
>>>>> consulted by the IVRE?
>>>>
>>>> yes. Exactly. The XIVE routing routine :
>>>>
>>>> 	https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
>>>>
>>>> gives a good overview of the usage of the tables.
>>>>
>>>>> For guests, we'd expect one table per-guest?  
>>>>
>>>> yes but only in emulation mode. 
>>>
>>> I'm not sure what you mean by this.
>>
>> I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall 
>> table allocated in OPAL for the system. 
> 
> Right.. I'm thinking of this from the point of view of the guest
> and/or qemu, rather than from the implementation.  Even if the actual
> storage of the entries is distributed across the host's global table,
> we still logically have a table per guest, right?

Yes. (the XiveSource object would be the table-per-guest and its 
counterpart in KVM: the source block)  

>>>>> How would those be integrated with the host table?
>>>>
>>>> Under KVM, this is handled by the host table (setup done in skiboot) 
>>>> and we are only interested in the state of the EQs for migration.
>>>
>>> This doesn't make sense to me; the guest is able to alter the IVT
>>> entries, so that configuration must be migrated somehow.
>>
>> yes. The IVE needs to be migrated. We use get/set KVM ioctls to save 
>> and restore the value which is cached in the KVM irq state struct 
>> (server, prio, eq data). no OPAL calls are needed though.
> 
> Right.  Again, at this stage I don't particularly care what the
> backend details are - whether the host calls OPAL or whatever.  I'm
> more concerned with the logical model.

ok.

> 
>>>> This state is set  with the H_INT_SET_QUEUE_CONFIG hcall,
>>>
>>> "This state" here meaning IVT entries?
>>
>> no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a 
>> server/priority couple. That is where the event queue data is
>> pushed.
> 
> Ah.  Doesn't that mean the guest *does* effectively have an EQD table,

well, yes, it's behing the hood. but the guest does not know anything 
about the Xive controller internal structures, IVE, EQD, VPD and tables. 
Only OPAL does in fact.

> updated by this call?  

it is indeed the purpose of H_INT_SET_QUEUE_CONFIG

> We'd need to migrate that data as well, 

yes we do and some fields require OPAL support.

> and it's not part of the IVT, right?

yes. The IVT only contains the EQ index, the server/priority tuple used 
for routing.

>> H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
>> and the eq data to be pushed in case of an event.
> 
> Ok - that's the IVT entries, yes?

yes.


>>>> followed
>>>> by an OPAL call and then a HW update. It defines the EQ page in which
>>>> to push event notification for the couple server/priority. 
>>>>
>>>>>> * VPDT:
>>>>>>
>>>>>>   describe the virtual targets, which can have different natures,
>>>>>>   a lpar, a cpu. This is for powernv, spapr does not have this 
>>>>>>   concept.
>>>>>
>>>>> Ok  On hardware that would also be global and consulted by the IVRE,
>>>>> yes?
>>>>
>>>> yes.
>>>
>>> Except.. is it actually global, or is there one per-chip/socket?
>>
>> There is a global VP allocator splitting the ids depending on the
>> block/chip, but, to be honest, I have not dug in the details
>>
>>> [snip]
>>>>>>    In the current version I am working on, the XiveFabric interface is
>>>>>>    more complex :
>>>>>>
>>>>>> 	typedef struct XiveFabricClass {
>>>>>> 	    InterfaceClass parent;
>>>>>> 	    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>>>>
>>>>> This does an IVT lookup, I take it?
>>>>
>>>> yes. It is an interface for the underlying storage, which is different
>>>> in sPAPR and PowerNV. The goal is to make the routing generic.
>>>
>>> Right.  So, yes, we definitely want a method *somehwere* to do an IVT
>>> lookup.  I'm not entirely sure where it belongs yet.
>>
>> Me either. I have stuffed the XiveFabric with all the abstraction 
>> needed for the moment. 
>>
>> I am starting to think that there should be an interface to forward 
>> events and another one to route them. The router being a special case 
>> of the forwarder, the last one. The "simple" devices, like PSI, should 
>> only be forwarders for the sources they own but the interrupt controllers 
>> should be forwarders (they have sources) and also routers.
> 
> I'm not really clear what you mean by "forward" here.

When a interrupt source is triggered, a notification event can
be generated and forwarded to the XIVE router if the transition 
algo (depending on the PQ bit) lets it through. A forward is 
a simple load of the IRQ number at a specific MMIO address defined
by the main IC.

For QEMU sPAPR, it's a funtion call but for QEMU powernv, it's a
load.

C.


>>>>>> 	    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>>>>>
>>>>> This one a VPDT lookup, yes?
>>>>
>>>> yes.
>>>>
>>>>>> 	    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
>>>>>
>>>>> And this one an EQDT lookup?
>>>>
>>>> yes.
>>>>
>>>>>> 	} XiveFabricClass;
>>>>>>
>>>>>>    It helps in making the routing algorithm independent of the model. 
>>>>>>    I hope to make powernv converge and use it.
>>>>>>
>>>>>>  - a set of MMIOs for the TIMA. They model the presenter engine. 
>>>>>>    current_cpu is used to retrieve the NVT object, which holds the 
>>>>>>    registers for interrupt management.  
>>>>>
>>>>> Right.  Now the TIMA is local to a target/server not an EQ, right?
>>>>
>>>> The TIMA is the MMIO giving access to the registers which are per CPU. 
>>>> The EQ are for routing. They are under the CPU object because it is 
>>>> convenient.
>>>>  
>>>>> I guess we need at least one of these per-vcpu.  
>>>>
>>>> yes.
>>>>
>>>>> Do we also need an lpar-global, or other special ones?
>>>>
>>>> That would be for the host. AFAICT KVM does not use such special
>>>> VPs.
>>>
>>> Um.. "does not use".. don't we get to decide that?
>>
>> Well, that part in the specs is still a little obscure for me and 
>> I am not sure it will fit very well in the Linux/KVM model. It should 
>> be hidden to the guest anyway and can come in later.
>>
>>>>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT 
>>>>>> table. But we could add one under the XIVE device model.
>>>>>
>>>>> I'm not sure of the distinction you're drawing between the NVT and the
>>>>> XIVE device mode.
>>>>
>>>> we could add a new table under the XIVE interrupt device model 
>>>> sPAPRXive to store the EQs and indexed them like skiboot does. 
>>>> But it seems unnecessary to me as we can use the object below 
>>>> 'cpu->intc', which is the XiveNVT object.  
>>>
>>> So, basically assuming a fixed set of EQs (one per priority?)
>>
>> yes. It's easier to capture the state and dump information from
>> the monitor.
>>
>>> per CPU for a PAPR guest?  
>>
>> yes, that's own it works.
>>
>>> That makes sense (assuming PAPR doesn't provide guest interfaces to 
>>> ask for something else).
>>
>> Yes. All hcalls take prio/server parameters and the reserved prio range 
>> for the platform is in the device tree. 0xFF is a special case to reset 
>> targeting. 
>>
>> Thanks,
>>
>> C.
>>
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-05-03  8:43                     ` Cédric Le Goater
@ 2018-05-04  6:35                       ` David Gibson
  2018-05-04 15:35                         ` Cédric Le Goater
  0 siblings, 1 reply; 71+ messages in thread
From: David Gibson @ 2018-05-04  6:35 UTC (permalink / raw)
  To: Cédric Le Goater
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

[-- Attachment #1: Type: text/plain, Size: 10674 bytes --]

On Thu, May 03, 2018 at 10:43:47AM +0200, Cédric Le Goater wrote:
> On 05/03/2018 04:29 AM, David Gibson wrote:
> > On Thu, Apr 26, 2018 at 10:17:13AM +0200, Cédric Le Goater wrote:
> >> On 04/26/2018 07:36 AM, David Gibson wrote:
> >>> On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
> >>>> On 04/16/2018 06:26 AM, David Gibson wrote:
> >>>>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
> >>>>>> On 04/12/2018 07:07 AM, David Gibson wrote:
> >>>>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
> >>>>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
> >>>>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
> >>>> wrote:
> >>> [snip]
> >>>>>> The XIVE tables are :
> >>>>>>
> >>>>>> * IVT
> >>>>>>
> >>>>>>   associate an interrupt source number with an event queue. the data
> >>>>>>   to be pushed in the queue is stored there also.
> >>>>>
> >>>>> Ok, so there would be one of these tables for each IVRE, 
> >>>>
> >>>> yes. one for each XIVE interrupt controller. That is one per processor 
> >>>> or socket.
> >>>
> >>> Ah.. so there can be more than one in a multi-socket system.
> >>>  >>> with one entry for each source managed by that IVSE, yes?
> >>>>
> >>>> yes. The table is simply indexed by the interrupt number in the
> >>>> global IRQ number space of the machine.
> >>>
> >>> How does that work on a multi-chip machine?  Does each chip just have
> >>> a table for a slice of the global irq number space?
> >>
> >> yes. IRQ Allocation is done relative to the chip, each chip having 
> >> a range depending on its block id. XIVE has a concept of block,
> >> which is used in skiboot in a one-to-one relationship with the chip.
> > 
> > Ok.  I'm assuming this block id forms the high(ish) bits of the global
> > irq number, yes?
> 
> yes. the 8 top bits are reserved, the next 4 bits are for the 
> block id, 16 blocks for 16 socket/chips, and the 20 lower bits 
> are for the ISN.

Ok.

> >>>>> Do the XIVE IPIs have entries here, or do they bypass this?
> >>>>
> >>>> no. The IPIs have entries also in this table.
> >>>>
> >>>>>> * EQDT:
> >>>>>>
> >>>>>>   describes the queues in the OS RAM, also contains a set of flags,
> >>>>>>   a virtual target, etc.
> >>>>>
> >>>>> So on real hardware this would be global, yes?  And it would be
> >>>>> consulted by the IVRE?
> >>>>
> >>>> yes. Exactly. The XIVE routing routine :
> >>>>
> >>>> 	https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
> >>>>
> >>>> gives a good overview of the usage of the tables.
> >>>>
> >>>>> For guests, we'd expect one table per-guest?  
> >>>>
> >>>> yes but only in emulation mode. 
> >>>
> >>> I'm not sure what you mean by this.
> >>
> >> I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall 
> >> table allocated in OPAL for the system. 
> > 
> > Right.. I'm thinking of this from the point of view of the guest
> > and/or qemu, rather than from the implementation.  Even if the actual
> > storage of the entries is distributed across the host's global table,
> > we still logically have a table per guest, right?
> 
> Yes. (the XiveSource object would be the table-per-guest and its 
> counterpart in KVM: the source block)

Uh.. I'm talking about the IVT (or a slice of it) here, so this would
be a XiveRouter, not a XiveSource owning it.

> 
> >>>>> How would those be integrated with the host table?
> >>>>
> >>>> Under KVM, this is handled by the host table (setup done in skiboot) 
> >>>> and we are only interested in the state of the EQs for migration.
> >>>
> >>> This doesn't make sense to me; the guest is able to alter the IVT
> >>> entries, so that configuration must be migrated somehow.
> >>
> >> yes. The IVE needs to be migrated. We use get/set KVM ioctls to save 
> >> and restore the value which is cached in the KVM irq state struct 
> >> (server, prio, eq data). no OPAL calls are needed though.
> > 
> > Right.  Again, at this stage I don't particularly care what the
> > backend details are - whether the host calls OPAL or whatever.  I'm
> > more concerned with the logical model.
> 
> ok.
> 
> > 
> >>>> This state is set  with the H_INT_SET_QUEUE_CONFIG hcall,
> >>>
> >>> "This state" here meaning IVT entries?
> >>
> >> no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a 
> >> server/priority couple. That is where the event queue data is
> >> pushed.
> > 
> > Ah.  Doesn't that mean the guest *does* effectively have an EQD table,
> 
> well, yes, it's behing the hood. but the guest does not know anything 
> about the Xive controller internal structures, IVE, EQD, VPD and tables. 
> Only OPAL does in fact.

Right, it's under the hood.  But then so is the IVT (and the TCE
tables and the HPT for that matter).  So we're probably going to have
a (*get_eqd) method somewhere that looks up in guest RAM or in an
external table depending.

> > updated by this call?  
> 
> it is indeed the purpose of H_INT_SET_QUEUE_CONFIG
> 
> > We'd need to migrate that data as well, 
> 
> yes we do and some fields require OPAL support.
> 
> > and it's not part of the IVT, right?
> 
> yes. The IVT only contains the EQ index, the server/priority tuple used 
> for routing.
> 
> >> H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
> >> and the eq data to be pushed in case of an event.
> > 
> > Ok - that's the IVT entries, yes?
> 
> yes.
> 
> 
> >>>> followed
> >>>> by an OPAL call and then a HW update. It defines the EQ page in which
> >>>> to push event notification for the couple server/priority. 
> >>>>
> >>>>>> * VPDT:
> >>>>>>
> >>>>>>   describe the virtual targets, which can have different natures,
> >>>>>>   a lpar, a cpu. This is for powernv, spapr does not have this 
> >>>>>>   concept.
> >>>>>
> >>>>> Ok  On hardware that would also be global and consulted by the IVRE,
> >>>>> yes?
> >>>>
> >>>> yes.
> >>>
> >>> Except.. is it actually global, or is there one per-chip/socket?
> >>
> >> There is a global VP allocator splitting the ids depending on the
> >> block/chip, but, to be honest, I have not dug in the details
> >>
> >>> [snip]
> >>>>>>    In the current version I am working on, the XiveFabric interface is
> >>>>>>    more complex :
> >>>>>>
> >>>>>> 	typedef struct XiveFabricClass {
> >>>>>> 	    InterfaceClass parent;
> >>>>>> 	    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
> >>>>>
> >>>>> This does an IVT lookup, I take it?
> >>>>
> >>>> yes. It is an interface for the underlying storage, which is different
> >>>> in sPAPR and PowerNV. The goal is to make the routing generic.
> >>>
> >>> Right.  So, yes, we definitely want a method *somehwere* to do an IVT
> >>> lookup.  I'm not entirely sure where it belongs yet.
> >>
> >> Me either. I have stuffed the XiveFabric with all the abstraction 
> >> needed for the moment. 
> >>
> >> I am starting to think that there should be an interface to forward 
> >> events and another one to route them. The router being a special case 
> >> of the forwarder, the last one. The "simple" devices, like PSI, should 
> >> only be forwarders for the sources they own but the interrupt controllers 
> >> should be forwarders (they have sources) and also routers.
> > 
> > I'm not really clear what you mean by "forward" here.
> 
> When a interrupt source is triggered, a notification event can
> be generated and forwarded to the XIVE router if the transition 
> algo (depending on the PQ bit) lets it through. A forward is 
> a simple load of the IRQ number at a specific MMIO address defined
> by the main IC.
> 
> For QEMU sPAPR, it's a funtion call but for QEMU powernv, it's a
> load.
> 
> C.
> 
> 
> >>>>>> 	    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
> >>>>>
> >>>>> This one a VPDT lookup, yes?
> >>>>
> >>>> yes.
> >>>>
> >>>>>> 	    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
> >>>>>
> >>>>> And this one an EQDT lookup?
> >>>>
> >>>> yes.
> >>>>
> >>>>>> 	} XiveFabricClass;
> >>>>>>
> >>>>>>    It helps in making the routing algorithm independent of the model. 
> >>>>>>    I hope to make powernv converge and use it.
> >>>>>>
> >>>>>>  - a set of MMIOs for the TIMA. They model the presenter engine. 
> >>>>>>    current_cpu is used to retrieve the NVT object, which holds the 
> >>>>>>    registers for interrupt management.  
> >>>>>
> >>>>> Right.  Now the TIMA is local to a target/server not an EQ, right?
> >>>>
> >>>> The TIMA is the MMIO giving access to the registers which are per CPU. 
> >>>> The EQ are for routing. They are under the CPU object because it is 
> >>>> convenient.
> >>>>  
> >>>>> I guess we need at least one of these per-vcpu.  
> >>>>
> >>>> yes.
> >>>>
> >>>>> Do we also need an lpar-global, or other special ones?
> >>>>
> >>>> That would be for the host. AFAICT KVM does not use such special
> >>>> VPs.
> >>>
> >>> Um.. "does not use".. don't we get to decide that?
> >>
> >> Well, that part in the specs is still a little obscure for me and 
> >> I am not sure it will fit very well in the Linux/KVM model. It should 
> >> be hidden to the guest anyway and can come in later.
> >>
> >>>>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT 
> >>>>>> table. But we could add one under the XIVE device model.
> >>>>>
> >>>>> I'm not sure of the distinction you're drawing between the NVT and the
> >>>>> XIVE device mode.
> >>>>
> >>>> we could add a new table under the XIVE interrupt device model 
> >>>> sPAPRXive to store the EQs and indexed them like skiboot does. 
> >>>> But it seems unnecessary to me as we can use the object below 
> >>>> 'cpu->intc', which is the XiveNVT object.  
> >>>
> >>> So, basically assuming a fixed set of EQs (one per priority?)
> >>
> >> yes. It's easier to capture the state and dump information from
> >> the monitor.
> >>
> >>> per CPU for a PAPR guest?  
> >>
> >> yes, that's own it works.
> >>
> >>> That makes sense (assuming PAPR doesn't provide guest interfaces to 
> >>> ask for something else).
> >>
> >> Yes. All hcalls take prio/server parameters and the reserved prio range 
> >> for the platform is in the device tree. 0xFF is a special case to reset 
> >> targeting. 
> >>
> >> Thanks,
> >>
> >> C.
> >>
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller
  2018-05-04  6:35                       ` David Gibson
@ 2018-05-04 15:35                         ` Cédric Le Goater
  0 siblings, 0 replies; 71+ messages in thread
From: Cédric Le Goater @ 2018-05-04 15:35 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Greg Kurz

On 05/04/2018 08:35 AM, David Gibson wrote:
> On Thu, May 03, 2018 at 10:43:47AM +0200, Cédric Le Goater wrote:
>> On 05/03/2018 04:29 AM, David Gibson wrote:
>>> On Thu, Apr 26, 2018 at 10:17:13AM +0200, Cédric Le Goater wrote:
>>>> On 04/26/2018 07:36 AM, David Gibson wrote:
>>>>> On Thu, Apr 19, 2018 at 07:40:09PM +0200, Cédric Le Goater wrote:
>>>>>> On 04/16/2018 06:26 AM, David Gibson wrote:
>>>>>>> On Thu, Apr 12, 2018 at 10:18:11AM +0200, Cédric Le Goater wrote:
>>>>>>>> On 04/12/2018 07:07 AM, David Gibson wrote:
>>>>>>>>> On Wed, Dec 20, 2017 at 08:38:41AM +0100, Cédric Le Goater wrote:
>>>>>>>>>> On 12/20/2017 06:09 AM, David Gibson wrote:
>>>>>>>>>>> On Sat, Dec 09, 2017 at 09:43:21AM +0100, Cédric Le Goater
>>>>>> wrote:
>>>>> [snip]
>>>>>>>> The XIVE tables are :
>>>>>>>>
>>>>>>>> * IVT
>>>>>>>>
>>>>>>>>   associate an interrupt source number with an event queue. the data
>>>>>>>>   to be pushed in the queue is stored there also.
>>>>>>>
>>>>>>> Ok, so there would be one of these tables for each IVRE, 
>>>>>>
>>>>>> yes. one for each XIVE interrupt controller. That is one per processor 
>>>>>> or socket.
>>>>>
>>>>> Ah.. so there can be more than one in a multi-socket system.
>>>>>  >>> with one entry for each source managed by that IVSE, yes?
>>>>>>
>>>>>> yes. The table is simply indexed by the interrupt number in the
>>>>>> global IRQ number space of the machine.
>>>>>
>>>>> How does that work on a multi-chip machine?  Does each chip just have
>>>>> a table for a slice of the global irq number space?
>>>>
>>>> yes. IRQ Allocation is done relative to the chip, each chip having 
>>>> a range depending on its block id. XIVE has a concept of block,
>>>> which is used in skiboot in a one-to-one relationship with the chip.
>>>
>>> Ok.  I'm assuming this block id forms the high(ish) bits of the global
>>> irq number, yes?
>>
>> yes. the 8 top bits are reserved, the next 4 bits are for the 
>> block id, 16 blocks for 16 socket/chips, and the 20 lower bits 
>> are for the ISN.
> 
> Ok.
> 
>>>>>>> Do the XIVE IPIs have entries here, or do they bypass this?
>>>>>>
>>>>>> no. The IPIs have entries also in this table.
>>>>>>
>>>>>>>> * EQDT:
>>>>>>>>
>>>>>>>>   describes the queues in the OS RAM, also contains a set of flags,
>>>>>>>>   a virtual target, etc.
>>>>>>>
>>>>>>> So on real hardware this would be global, yes?  And it would be
>>>>>>> consulted by the IVRE?
>>>>>>
>>>>>> yes. Exactly. The XIVE routing routine :
>>>>>>
>>>>>> 	https://github.com/legoater/qemu/blob/xive/hw/intc/xive.c#L706
>>>>>>
>>>>>> gives a good overview of the usage of the tables.
>>>>>>
>>>>>>> For guests, we'd expect one table per-guest?  
>>>>>>
>>>>>> yes but only in emulation mode. 
>>>>>
>>>>> I'm not sure what you mean by this.
>>>>
>>>> I meant the sPAPR QEMU emulation mode. Linux/KVM relies on the overall 
>>>> table allocated in OPAL for the system. 
>>>
>>> Right.. I'm thinking of this from the point of view of the guest
>>> and/or qemu, rather than from the implementation.  Even if the actual
>>> storage of the entries is distributed across the host's global table,
>>> we still logically have a table per guest, right?
>>
>> Yes. (the XiveSource object would be the table-per-guest and its 
>> counterpart in KVM: the source block)
> 
> Uh.. I'm talking about the IVT (or a slice of it) here, so this would
> be a XiveRouter, not a XiveSource owning it.

yes. Sorry. sPAPR has a unique XiveSource and a corresponding IVT.

>>>>>>> How would those be integrated with the host table?
>>>>>>
>>>>>> Under KVM, this is handled by the host table (setup done in skiboot) 
>>>>>> and we are only interested in the state of the EQs for migration.
>>>>>
>>>>> This doesn't make sense to me; the guest is able to alter the IVT
>>>>> entries, so that configuration must be migrated somehow.
>>>>
>>>> yes. The IVE needs to be migrated. We use get/set KVM ioctls to save 
>>>> and restore the value which is cached in the KVM irq state struct 
>>>> (server, prio, eq data). no OPAL calls are needed though.
>>>
>>> Right.  Again, at this stage I don't particularly care what the
>>> backend details are - whether the host calls OPAL or whatever.  I'm
>>> more concerned with the logical model.
>>
>> ok.
>>
>>>
>>>>>> This state is set  with the H_INT_SET_QUEUE_CONFIG hcall,
>>>>>
>>>>> "This state" here meaning IVT entries?
>>>>
>>>> no. The H_INT_SET_QUEUE_CONFIG sets the event queue OS page for a 
>>>> server/priority couple. That is where the event queue data is
>>>> pushed.
>>>
>>> Ah.  Doesn't that mean the guest *does* effectively have an EQD table,
>>
>> well, yes, it's behing the hood. but the guest does not know anything 
>> about the Xive controller internal structures, IVE, EQD, VPD and tables. 
>> Only OPAL does in fact.
> 
> Right, it's under the hood.  But then so is the IVT (and the TCE
> tables and the HPT for that matter).  So we're probably going to have
> a (*get_eqd) method somewhere that looks up in guest RAM or in an
> external table depending.

yes. definitely. 

C.

>>> updated by this call?  
>>
>> it is indeed the purpose of H_INT_SET_QUEUE_CONFIG
>>
>>> We'd need to migrate that data as well, 
>>
>> yes we do and some fields require OPAL support.
>>
>>> and it's not part of the IVT, right?
>>
>> yes. The IVT only contains the EQ index, the server/priority tuple used 
>> for routing.
>>
>>>> H_INT_SET_SOURCE_CONFIG does the targeting : irq, server, priority,
>>>> and the eq data to be pushed in case of an event.
>>>
>>> Ok - that's the IVT entries, yes?
>>
>> yes.
>>
>>
>>>>>> followed
>>>>>> by an OPAL call and then a HW update. It defines the EQ page in which
>>>>>> to push event notification for the couple server/priority. 
>>>>>>
>>>>>>>> * VPDT:
>>>>>>>>
>>>>>>>>   describe the virtual targets, which can have different natures,
>>>>>>>>   a lpar, a cpu. This is for powernv, spapr does not have this 
>>>>>>>>   concept.
>>>>>>>
>>>>>>> Ok  On hardware that would also be global and consulted by the IVRE,
>>>>>>> yes?
>>>>>>
>>>>>> yes.
>>>>>
>>>>> Except.. is it actually global, or is there one per-chip/socket?
>>>>
>>>> There is a global VP allocator splitting the ids depending on the
>>>> block/chip, but, to be honest, I have not dug in the details
>>>>
>>>>> [snip]
>>>>>>>>    In the current version I am working on, the XiveFabric interface is
>>>>>>>>    more complex :
>>>>>>>>
>>>>>>>> 	typedef struct XiveFabricClass {
>>>>>>>> 	    InterfaceClass parent;
>>>>>>>> 	    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>>>>>>
>>>>>>> This does an IVT lookup, I take it?
>>>>>>
>>>>>> yes. It is an interface for the underlying storage, which is different
>>>>>> in sPAPR and PowerNV. The goal is to make the routing generic.
>>>>>
>>>>> Right.  So, yes, we definitely want a method *somehwere* to do an IVT
>>>>> lookup.  I'm not entirely sure where it belongs yet.
>>>>
>>>> Me either. I have stuffed the XiveFabric with all the abstraction 
>>>> needed for the moment. 
>>>>
>>>> I am starting to think that there should be an interface to forward 
>>>> events and another one to route them. The router being a special case 
>>>> of the forwarder, the last one. The "simple" devices, like PSI, should 
>>>> only be forwarders for the sources they own but the interrupt controllers 
>>>> should be forwarders (they have sources) and also routers.
>>>
>>> I'm not really clear what you mean by "forward" here.
>>
>> When a interrupt source is triggered, a notification event can
>> be generated and forwarded to the XIVE router if the transition 
>> algo (depending on the PQ bit) lets it through. A forward is 
>> a simple load of the IRQ number at a specific MMIO address defined
>> by the main IC.
>>
>> For QEMU sPAPR, it's a funtion call but for QEMU powernv, it's a
>> load.
>>
>> C.
>>
>>
>>>>>>>> 	    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>>>>>>>
>>>>>>> This one a VPDT lookup, yes?
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>>> 	    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
>>>>>>>
>>>>>>> And this one an EQDT lookup?
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>>> 	} XiveFabricClass;
>>>>>>>>
>>>>>>>>    It helps in making the routing algorithm independent of the model. 
>>>>>>>>    I hope to make powernv converge and use it.
>>>>>>>>
>>>>>>>>  - a set of MMIOs for the TIMA. They model the presenter engine. 
>>>>>>>>    current_cpu is used to retrieve the NVT object, which holds the 
>>>>>>>>    registers for interrupt management.  
>>>>>>>
>>>>>>> Right.  Now the TIMA is local to a target/server not an EQ, right?
>>>>>>
>>>>>> The TIMA is the MMIO giving access to the registers which are per CPU. 
>>>>>> The EQ are for routing. They are under the CPU object because it is 
>>>>>> convenient.
>>>>>>  
>>>>>>> I guess we need at least one of these per-vcpu.  
>>>>>>
>>>>>> yes.
>>>>>>
>>>>>>> Do we also need an lpar-global, or other special ones?
>>>>>>
>>>>>> That would be for the host. AFAICT KVM does not use such special
>>>>>> VPs.
>>>>>
>>>>> Um.. "does not use".. don't we get to decide that?
>>>>
>>>> Well, that part in the specs is still a little obscure for me and 
>>>> I am not sure it will fit very well in the Linux/KVM model. It should 
>>>> be hidden to the guest anyway and can come in later.
>>>>
>>>>>>>> The EQs are stored under the NVT. This saves us an unnecessary EQDT 
>>>>>>>> table. But we could add one under the XIVE device model.
>>>>>>>
>>>>>>> I'm not sure of the distinction you're drawing between the NVT and the
>>>>>>> XIVE device mode.
>>>>>>
>>>>>> we could add a new table under the XIVE interrupt device model 
>>>>>> sPAPRXive to store the EQs and indexed them like skiboot does. 
>>>>>> But it seems unnecessary to me as we can use the object below 
>>>>>> 'cpu->intc', which is the XiveNVT object.  
>>>>>
>>>>> So, basically assuming a fixed set of EQs (one per priority?)
>>>>
>>>> yes. It's easier to capture the state and dump information from
>>>> the monitor.
>>>>
>>>>> per CPU for a PAPR guest?  
>>>>
>>>> yes, that's own it works.
>>>>
>>>>> That makes sense (assuming PAPR doesn't provide guest interfaces to 
>>>>> ask for something else).
>>>>
>>>> Yes. All hcalls take prio/server parameters and the reserved prio range 
>>>> for the platform is in the device tree. 0xFF is a special case to reset 
>>>> targeting. 
>>>>
>>>> Thanks,
>>>>
>>>> C.
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2018-05-04 15:35 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2017-12-09  8:43 [Qemu-devel] [PATCH v2 00/19] spapr: Guest exploitation of the XIVE interrupt controller (POWER9) Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 01/19] dma-helpers: add a return value to store helpers Cédric Le Goater
2017-12-19  4:46   ` David Gibson
2017-12-19  6:43     ` Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 02/19] spapr: introduce a skeleton for the XIVE interrupt controller Cédric Le Goater
2017-12-09 14:06   ` Cédric Le Goater
2017-12-20  5:09   ` David Gibson
2017-12-20  7:38     ` Cédric Le Goater
2018-04-12  5:07       ` David Gibson
2018-04-12  8:18         ` Cédric Le Goater
2018-04-16  4:26           ` David Gibson
2018-04-19 17:40             ` Cédric Le Goater
2018-04-26  5:36               ` David Gibson
2018-04-26  8:17                 ` Cédric Le Goater
2018-05-03  2:29                   ` David Gibson
2018-05-03  8:43                     ` Cédric Le Goater
2018-05-04  6:35                       ` David Gibson
2018-05-04 15:35                         ` Cédric Le Goater
2017-12-21  0:12     ` Benjamin Herrenschmidt
2017-12-21  9:16       ` Cédric Le Goater
2017-12-21 10:09         ` Cédric Le Goater
2017-12-21 22:53         ` Benjamin Herrenschmidt
2018-01-17  9:18           ` Cédric Le Goater
2018-01-17 11:10             ` Benjamin Herrenschmidt
2018-01-17 14:39               ` Cédric Le Goater
2018-01-17 17:57                 ` Cédric Le Goater
2018-01-17 21:27                 ` Benjamin Herrenschmidt
2018-01-18 13:27                   ` Cédric Le Goater
2018-01-18 21:08                     ` Benjamin Herrenschmidt
2018-02-11  8:08                   ` David Gibson
2018-02-11 22:55                     ` Benjamin Herrenschmidt
2018-02-12  2:02                       ` Alexey Kardashevskiy
2018-02-12 12:20                         ` [Qemu-devel] [Qemu-ppc] " Andrea Bolognani
2018-02-12 14:40                           ` Benjamin Herrenschmidt
2018-02-13  1:11                             ` Alexey Kardashevskiy
2018-02-13  7:40                             ` Cédric Le Goater
2018-02-12  7:10                       ` [Qemu-devel] " Cédric Le Goater
2018-04-12  5:16                       ` David Gibson
2018-04-12  8:36                         ` Cédric Le Goater
2018-04-16  4:29                           ` David Gibson
2018-04-19 13:01                             ` Cédric Le Goater
2018-04-12  5:15                 ` David Gibson
2018-04-12  8:51                   ` Cédric Le Goater
2018-04-12  5:10             ` David Gibson
2018-04-12  8:41               ` Cédric Le Goater
2018-04-12  5:08       ` David Gibson
2018-04-12  8:28         ` Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 03/19] spapr: introduce the XIVE interrupt sources Cédric Le Goater
2017-12-14 15:24   ` Cédric Le Goater
2017-12-18  0:59     ` Benjamin Herrenschmidt
2017-12-19  6:37       ` Cédric Le Goater
2017-12-20  5:13         ` David Gibson
2017-12-20  5:22   ` David Gibson
2017-12-20  7:54     ` Cédric Le Goater
2017-12-20 18:08       ` Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 04/19] spapr: add support for the LSI " Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 05/19] spapr: introduce a XIVE interrupt presenter model Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 06/19] spapr: introduce the XIVE Event Queues Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 07/19] spapr: push the XIVE EQ data in OS event queue Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 08/19] spapr: notify the CPU when the XIVE interrupt priority is more privileged Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 09/19] spapr: add support for the SET_OS_PENDING command (XIVE) Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 10/19] spapr: introduce a 'xive_exploitation' boolean to enable XIVE Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 11/19] spapr: add a sPAPRXive object to the machine Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 12/19] spapr: add hcalls support for the XIVE exploitation interrupt mode Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 13/19] spapr: add device tree support for the XIVE " Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 14/19] spapr: introduce a helper to map the XIVE memory regions Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 15/19] spapr: add XIVE support to spapr_qirq() Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 16/19] spapr: introduce a spapr_icp_create() helper Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 17/19] spapr: toggle the ICP depending on the selected interrupt mode Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 18/19] spapr: add support to dump XIVE information Cédric Le Goater
2017-12-09  8:43 ` [Qemu-devel] [PATCH v2 19/19] spapr: advertise XIVE exploitation mode in CAS Cédric Le Goater

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.