All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9)
@ 2018-04-19 12:42 Cédric Le Goater
  2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model Cédric Le Goater
                   ` (35 more replies)
  0 siblings, 36 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:42 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

Hello,

The POWER9 processor comes with a new interrupt controller, called
XIVE, which introduces a large number of new features, for
virtualization in particular.

* XIVE interrupt controller

It is composed of three sub-engines :

  - Interrupt Virtualization Source Engine (IVSE). These are in PHBs,
    in the main controller for the IPIS and in the PSI host
    bridge. They are configured to feed the IVRE with events.

  - Interrupt Virtualization Routing Engine (IVRE). Its job is to
    match an event source with a Notification Virtualization Target
    (NVT), a priority and an Event Queue (EQ) to determine if a
    Virtual Processor can handle the event.

  - Interrupt Virtualization Presentation Engine (IVPE). It maintains
    the interrupt state of each hardware thread and present the
    notification as an external exception.

Each of the engines uses a set of internal tables to redirect
exceptions from event sources to CPU threads. Interrupt sources have a
2-bit state machine, the Event State Buffer (ESB), that allows events
to be triggered. If the event is let through, the IVRE looks up in the
Interrupt Virtualization Entry (IVE) table for the Event Queue
Descriptor configured for the source. Each Event Queue Descriptor
defines a notification path to a CPU and an in-memory queue in which
will be recorded an event identifier for the OS to pull.

On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
negotiation process determines whether the guest operates with a
interrupt controller using the XICS legacy model, as found on POWER8,
or in XIVE exploitation mode. On a POWER9 PowerNV machine, the XIVE
interrupt controller is a must have.


* XIVE for sPAPR

Here are the high level ideas of the current design to add support for
XIVE :

 - introduce a persistent sPAPRXive object under the sPAPR machine for
   newer machines and let the CAS negotiation process decide whether
   it should be used or not. Use the 'ov5_cas' attribute for this
   purpose.

 - introduce a persistent XIVE interrupt presenter under the sPAPR
   core and switch ICP after CAS. Each core has now two ICPs, one
   active through the 'intc' pointer and another one among its
   children ready to be used if the guest requires it.

 - move the XIVE EQs under the cores to simplify the XIVE model

 - allocate the CPU IPIs at the beginning of the IRQ number space to
   be compatible with XICS (which starts at 4096) and also to simplify
   the model. This means that the XIVE model covers the whole IRQ
   number space. There are no offset like in XICS splitting the IRQ
   number space.


* sPAPR patchset layout 

It first defines new models for XIVE, which will be shared between the
machines or with KVM for sPAPR :

 - XiveSource holding the PQ bits and the ESB MMIO region used to
   control them.

 - XiveNVT holding the CPU interrupt state and the EQ state. it models
   the XIVE interrupt presenter engine.

 - sPAPRXive modeling the XIVE interrupt controller for sPAPR
   machines, holding the internal routing table, a single XIVE source
   for the IPIs and other interrupts and the TIMA MMIO regions used by
   the XiveNVT to do interrupt management. 

We do not model the IVRE, but this is not a problem to introduce it if
needed. Maybe for migration. To be discussed.
   
Then, the notification process and the interrupt delivery to the CPU
is described. Support for sPAPR is completed with the integration of
the sPAPRXive object in the machine, the definition of the new XIVE
hcalls, the device tree layout, and the necessary adjustments to
support the CAS negotiation.

Follows the support for KVM with a set of specific XIVE models, very
much like XICS does.  But, the interrupt mode is still chosen at the
init of the machine and the reset does change the KVM interrupt
device. A couple of patches try to fix this limitation with a proposal
to support resets of KVM devices. Some issues in the MMU migration
which still need to be addressed.


* PowerNV extension

It seemed interesting to include the models for PowerNV as a way to
validate that the concept are valid.

The patchset finishes with RFCs of models for the XIVE interrupt
controller and for the PSI bridge device for the POWER9 PowerNV. PSI
provides a good example of the usage of the notify() handler of the
XiveFabric interface, linking the PSI XiveSource to its owning device.


* Coverage

At this stage, XIVE support in QEMU covers :

 - TCG & KVM kernel_irqchip=off/on
 - CPU hotplug
 - support for older machines
 - migration under TCG
 - migration under KVM, including kernel_irqchip=off <-> kernel_irqchip=on


* Caveats

Migration still needs some care to make sure all HW states are
captured correctly. Extra quiescence points are possibly needed,
to turn off/on the XIVE configuration under KVM.

KVM device reset works well enough but has consequences on MMU
migration. Probably an ordering problem.


* Github
 
QEMU:

  https://github.com/legoater/qemu/commits/xive

Linux/KVM (to be sent later on):

  https://github.com/legoater/linux/commits/xive

Thanks,

C.

 Changes since v2 :

 - added support for Store EOI
 - added support for two page ESB MMIO setting like on KVM
 - introduced the XiveFabric interface
 - introduced spapr_xive_mmio_unmap()
 - KVM support

Cédric Le Goater (35):
  ppc/xive: introduce a XIVE interrupt source model
  ppc/xive: add support for the LSI interrupt sources
  ppc/xive: introduce the XiveFabric interface
  spapr/xive: introduce a XIVE interrupt controller for sPAPR
  spapr/xive: add a single source block to the sPAPR XIVE model
  spapr/xive: introduce a XIVE interrupt presenter model
  spapr/xive: introduce the XIVE Event Queues
  spapr: push the XIVE EQ data in OS event queue
  spapr: notify the CPU when the XIVE interrupt priority is more
    privileged
  spapr: add support for the SET_OS_PENDING command (XIVE)
  spapr: introduce a 'xive_exploitation' option to enable XIVE
  spapr: add a sPAPRXive object to the machine
  spapr: add hcalls support for the XIVE exploitation interrupt mode
  spapr: add device tree support for the XIVE exploitation mode
  sysbus: add a sysbus_mmio_unmap() helper
  spapr: introduce a helper to map the XIVE memory regions
  spapr: add XIVE support to spapr_qirq()
  spapr: introduce a spapr_icp_create() helper
  spapr: toggle the ICP depending on the selected interrupt mode
  spapr: add support to dump XIVE information
  spapr: advertise XIVE exploitation mode in CAS
  spapr: add classes for the XIVE models
  target/ppc/kvm: add Linux KVM definitions for XIVE
  spapr/xive: add common realize routine for KVM
  spapr/xive: add KVM support
  spapr/xive: add a XIVE KVM device to the machine
  migration: discard non-migratable RAMBlocks
  intc: introduce a CPUIntc interface
  spapr/xive,xics: use the CPU_INTC handlers to reset KVM
  spapr/xive,xics: reset KVM at machine reset
  spapr/xive: raise migration priority of the machine
  ppc/pnv: introduce a pnv_icp_create() helper
  ppc: externalize ppc_get_vcpu_by_pir()
  ppc/pnv: add XIVE support
  ppc/pnv: add a PSI bridge model for POWER9 processor

 default-configs/ppc64-softmmu.mak |    3 +
 exec.c                            |   10 +
 hw/core/sysbus.c                  |   10 +
 hw/intc/Makefile.objs             |    5 +-
 hw/intc/intc.c                    |   26 +
 hw/intc/pnv_xive.c                | 1234 +++++++++++++++++++++++++++++++++++++
 hw/intc/pnv_xive_regs.h           |  314 ++++++++++
 hw/intc/spapr_xive.c              |  324 ++++++++++
 hw/intc/spapr_xive_hcall.c        |  923 +++++++++++++++++++++++++++
 hw/intc/spapr_xive_kvm.c          |  655 ++++++++++++++++++++
 hw/intc/xics.c                    |    4 +
 hw/intc/xics_kvm.c                |  108 +++-
 hw/intc/xive.c                    | 1200 ++++++++++++++++++++++++++++++++++++
 hw/ppc/pnv.c                      |   93 +--
 hw/ppc/pnv_core.c                 |    2 +-
 hw/ppc/pnv_psi.c                  |  399 +++++++++++-
 hw/ppc/ppc.c                      |   16 +
 hw/ppc/spapr.c                    |  264 +++++++-
 hw/ppc/spapr_cpu_core.c           |   55 +-
 hw/ppc/spapr_hcall.c              |    6 +
 hw/ppc/spapr_rtas.c               |    2 -
 include/exec/cpu-common.h         |    1 +
 include/hw/intc/intc.h            |   21 +
 include/hw/ppc/pnv.h              |   37 +-
 include/hw/ppc/pnv_psi.h          |   50 +-
 include/hw/ppc/pnv_xive.h         |   89 +++
 include/hw/ppc/pnv_xscom.h        |    5 +
 include/hw/ppc/ppc.h              |    1 +
 include/hw/ppc/spapr.h            |   21 +-
 include/hw/ppc/spapr_cpu_core.h   |    2 +
 include/hw/ppc/spapr_xive.h       |   93 +++
 include/hw/ppc/xics.h             |    1 +
 include/hw/ppc/xive.h             |  269 ++++++++
 include/hw/ppc/xive_regs.h        |  187 ++++++
 include/hw/sysbus.h               |    1 +
 include/migration/vmstate.h       |    2 +
 linux-headers/asm-powerpc/kvm.h   |   18 +
 linux-headers/linux/kvm.h         |    5 +
 migration/ram.c                   |   42 +-
 target/ppc/kvm.c                  |    7 +
 target/ppc/kvm_ppc.h              |    6 +
 41 files changed, 6414 insertions(+), 97 deletions(-)
 create mode 100644 hw/intc/pnv_xive.c
 create mode 100644 hw/intc/pnv_xive_regs.h
 create mode 100644 hw/intc/spapr_xive.c
 create mode 100644 hw/intc/spapr_xive_hcall.c
 create mode 100644 hw/intc/spapr_xive_kvm.c
 create mode 100644 hw/intc/xive.c
 create mode 100644 include/hw/ppc/pnv_xive.h
 create mode 100644 include/hw/ppc/spapr_xive.h
 create mode 100644 include/hw/ppc/xive.h
 create mode 100644 include/hw/ppc/xive_regs.h

-- 
2.13.6

^ permalink raw reply	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
@ 2018-04-19 12:42 ` Cédric Le Goater
  2018-04-20  7:10   ` David Gibson
  2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources Cédric Le Goater
                   ` (34 subsequent siblings)
  35 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:42 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

Each XIVE interrupt source is associated with a two bit state machine
called an Event State Buffer (ESB) : the first bit "P" means that an
interrupt is "pending" and waiting for an EOI and the bit "Q" (queued)
means a new interrupt was triggered while another was still pending.

When an event is triggered, the associated interrupt state bits are
fetched and modified and forwarded to the virtualization engine of the
controller doing the routing. These can also be controlled by MMIO, to
trigger events or turn off the sources for instance. See code for more
details on the states and transitions.

On a sPAPR machine, the OS will obtain the address of the MMIO page of
the ESB entry associated with a source and its characteristic using
the H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is
used.

The xive_source_notify() routine is in charge forwarding the source
event notification to the routing engine. It will be filled later on.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 Changes since v2:

 - added support for Store EOI
 - added support for two page MMIO setting like on KVM

 default-configs/ppc64-softmmu.mak |   1 +
 hw/intc/Makefile.objs             |   1 +
 hw/intc/xive.c                    | 335 ++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/xive.h             | 130 +++++++++++++++
 4 files changed, 467 insertions(+)
 create mode 100644 hw/intc/xive.c
 create mode 100644 include/hw/ppc/xive.h

diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
index b94af6c7c62a..c6d13e757977 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -16,4 +16,5 @@ CONFIG_VIRTIO_VGA=y
 CONFIG_XICS=$(CONFIG_PSERIES)
 CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
+CONFIG_XIVE=$(CONFIG_PSERIES)
 CONFIG_MEM_HOTPLUG=y
diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index 0e9963f5eecc..72a46ed91c31 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -37,6 +37,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
 obj-$(CONFIG_XICS) += xics.o
 obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
 obj-$(CONFIG_XICS_KVM) += xics_kvm.o
+obj-$(CONFIG_XIVE) += xive.o
 obj-$(CONFIG_POWERNV) += xics_pnv.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
new file mode 100644
index 000000000000..c70578759d02
--- /dev/null
+++ b/hw/intc/xive.c
@@ -0,0 +1,335 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "sysemu/dma.h"
+#include "monitor/monitor.h"
+#include "hw/ppc/xive.h"
+
+/*
+ * XIVE Interrupt Source
+ */
+
+uint8_t xive_source_pq_get(XiveSource *xsrc, uint32_t srcno)
+{
+    uint32_t byte = srcno / 4;
+    uint32_t bit  = (srcno % 4) * 2;
+
+    assert(byte < xsrc->sbe_size);
+
+    return (xsrc->sbe[byte] >> bit) & 0x3;
+}
+
+uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
+{
+    uint32_t byte = srcno / 4;
+    uint32_t bit  = (srcno % 4) * 2;
+    uint8_t old, new;
+
+    assert(byte < xsrc->sbe_size);
+
+    old = xsrc->sbe[byte];
+
+    new = xsrc->sbe[byte] & ~(0x3 << bit);
+    new |= (pq & 0x3) << bit;
+
+    xsrc->sbe[byte] = new;
+
+    return (old >> bit) & 0x3;
+}
+
+static bool xive_source_pq_eoi(XiveSource *xsrc, uint32_t srcno)
+{
+    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
+
+    switch (old_pq) {
+    case XIVE_ESB_RESET:
+        xive_source_pq_set(xsrc, srcno, XIVE_ESB_RESET);
+        return false;
+    case XIVE_ESB_PENDING:
+        xive_source_pq_set(xsrc, srcno, XIVE_ESB_RESET);
+        return false;
+    case XIVE_ESB_QUEUED:
+        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
+        return true;
+    case XIVE_ESB_OFF:
+        xive_source_pq_set(xsrc, srcno, XIVE_ESB_OFF);
+        return false;
+    default:
+         g_assert_not_reached();
+    }
+}
+
+/*
+ * Returns whether the event notification should be forwarded.
+ */
+static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
+{
+    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
+
+    switch (old_pq) {
+    case XIVE_ESB_RESET:
+        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
+        return true;
+    case XIVE_ESB_PENDING:
+        xive_source_pq_set(xsrc, srcno, XIVE_ESB_QUEUED);
+        return false;
+    case XIVE_ESB_QUEUED:
+        xive_source_pq_set(xsrc, srcno, XIVE_ESB_QUEUED);
+        return false;
+    case XIVE_ESB_OFF:
+        xive_source_pq_set(xsrc, srcno, XIVE_ESB_OFF);
+        return false;
+    default:
+         g_assert_not_reached();
+    }
+}
+
+/*
+ * Forward the source event notification to the associated XiveFabric,
+ * the device owning the sources.
+ */
+static void xive_source_notify(XiveSource *xsrc, int srcno)
+{
+
+}
+
+/* In a two pages ESB MMIO setting, even page is the trigger page, odd
+ * page is for management */
+static inline bool xive_source_is_trigger_page(hwaddr addr)
+{
+    return !((addr >> 16) & 1);
+}
+
+static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
+{
+    XiveSource *xsrc = XIVE_SOURCE(opaque);
+    uint32_t offset = addr & 0xF00;
+    uint32_t srcno = addr >> xsrc->esb_shift;
+    uint64_t ret = -1;
+
+    if (xive_source_esb_2page(xsrc) && xive_source_is_trigger_page(addr)) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "XIVE: invalid load on IRQ %d trigger page at "
+                      "0x%"HWADDR_PRIx"\n", srcno, addr);
+        return -1;
+    }
+
+    switch (offset) {
+    case XIVE_ESB_LOAD_EOI:
+        /*
+         * Load EOI is not the default source setting under QEMU, but
+         * this is what HW uses currently.
+         */
+        ret = xive_source_pq_eoi(xsrc, srcno);
+
+        break;
+
+    case XIVE_ESB_GET:
+        ret = xive_source_pq_get(xsrc, srcno);
+        break;
+
+    case XIVE_ESB_SET_PQ_00:
+    case XIVE_ESB_SET_PQ_01:
+    case XIVE_ESB_SET_PQ_10:
+    case XIVE_ESB_SET_PQ_11:
+        ret = xive_source_pq_set(xsrc, srcno, (offset >> 8) & 0x3);
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB addr %d\n", offset);
+    }
+
+    return ret;
+}
+
+static void xive_source_esb_write(void *opaque, hwaddr addr,
+                                 uint64_t value, unsigned size)
+{
+    XiveSource *xsrc = XIVE_SOURCE(opaque);
+    uint32_t offset = addr & 0xF00;
+    uint32_t srcno = addr >> xsrc->esb_shift;
+    bool notify = false;
+
+    switch (offset) {
+    case 0:
+        notify = xive_source_pq_trigger(xsrc, srcno);
+        break;
+
+    case XIVE_ESB_STORE_EOI:
+        if (xive_source_is_trigger_page(addr)) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                          "XIVE: invalid store on IRQ %d trigger page at "
+                          "0x%"HWADDR_PRIx"\n", srcno, addr);
+            return;
+        }
+
+        if (!(xsrc->esb_flags & XIVE_SRC_STORE_EOI)) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                          "XIVE: invalid Store EOI for IRQ %d\n", srcno);
+            return;
+        }
+
+        /* If the Q bit is set, we should forward a new source event
+         * notification
+         */
+        notify = xive_source_pq_eoi(xsrc, srcno);
+        break;
+
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %d\n",
+                      offset);
+        return;
+    }
+
+    /* Forward the source event notification for routing */
+    if (notify) {
+        xive_source_notify(xsrc, srcno);
+    }
+}
+
+static const MemoryRegionOps xive_source_esb_ops = {
+    .read = xive_source_esb_read,
+    .write = xive_source_esb_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+static void xive_source_set_irq(void *opaque, int srcno, int val)
+{
+    XiveSource *xsrc = XIVE_SOURCE(opaque);
+    bool notify = false;
+
+    if (val) {
+        notify = xive_source_pq_trigger(xsrc, srcno);
+    }
+
+    /* Forward the source event notification for routing */
+    if (notify) {
+        xive_source_notify(xsrc, srcno);
+    }
+}
+
+void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
+{
+    int i;
+
+    monitor_printf(mon, "XIVE Source %6x ..%6x\n",
+                   xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        uint8_t pq = xive_source_pq_get(xsrc, i);
+        uint32_t lisn = i  + xsrc->offset;
+
+        if (pq == XIVE_ESB_OFF) {
+            continue;
+        }
+
+        monitor_printf(mon, "  %4x %c%c\n", lisn,
+                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
+                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
+    }
+}
+
+static void xive_source_reset(DeviceState *dev)
+{
+    XiveSource *xsrc = XIVE_SOURCE(dev);
+
+    /* SBEs are initialized to 0b01 which corresponds to "ints off" */
+    memset(xsrc->sbe, 0x55, xsrc->sbe_size);
+}
+
+static void xive_source_realize(DeviceState *dev, Error **errp)
+{
+    XiveSource *xsrc = XIVE_SOURCE(dev);
+
+    if (!xsrc->nr_irqs) {
+        error_setg(errp, "Number of interrupt needs to be greater than 0");
+        return;
+    }
+
+    if (xsrc->esb_shift != XIVE_ESB_4K &&
+        xsrc->esb_shift != XIVE_ESB_4K_2PAGE &&
+        xsrc->esb_shift != XIVE_ESB_64K &&
+        xsrc->esb_shift != XIVE_ESB_64K_2PAGE) {
+        error_setg(errp, "Invalid ESB shift setting");
+        return;
+    }
+
+    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
+                                     xsrc->nr_irqs);
+
+    /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
+    xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
+    xsrc->sbe = g_malloc0(xsrc->sbe_size);
+
+    /* TODO: H_INT_ESB support, which removing the ESB MMIOs */
+
+    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
+                          &xive_source_esb_ops, xsrc, "xive.esb",
+                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
+}
+
+static const VMStateDescription vmstate_xive_source = {
+    .name = TYPE_XIVE_SOURCE,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
+        VMSTATE_VBUFFER_UINT32(sbe, XiveSource, 1, NULL, sbe_size),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * The default XIVE interrupt source setting for ESB MMIO is two 64k
+ * pages without Store EOI. This is in sync with KVM.
+ */
+static Property xive_source_properties[] = {
+    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
+    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
+    DEFINE_PROP_UINT64("bar", XiveSource, esb_base, 0),
+    DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void xive_source_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = xive_source_realize;
+    dc->reset = xive_source_reset;
+    dc->props = xive_source_properties;
+    dc->desc = "XIVE interrupt source";
+    dc->vmsd = &vmstate_xive_source;
+}
+
+static const TypeInfo xive_source_info = {
+    .name          = TYPE_XIVE_SOURCE,
+    .parent        = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(XiveSource),
+    .class_init    = xive_source_class_init,
+};
+
+static void xive_register_types(void)
+{
+    type_register_static(&xive_source_info);
+}
+
+type_init(xive_register_types)
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
new file mode 100644
index 000000000000..d92a50519edf
--- /dev/null
+++ b/include/hw/ppc/xive.h
@@ -0,0 +1,130 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_XIVE_H
+#define PPC_XIVE_H
+
+#include "hw/sysbus.h"
+
+/*
+ * XIVE Interrupt Source
+ */
+
+#define TYPE_XIVE_SOURCE "xive-source"
+#define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
+
+/*
+ * XIVE Source Interrupt source characteristics, which define how the
+ * ESB are controlled.
+ */
+#define XIVE_SRC_H_INT_ESB     0x1 /* ESB managed with hcall H_INT_ESB */
+#define XIVE_SRC_STORE_EOI     0x4 /* Store EOI supported */
+
+typedef struct XiveSource {
+    SysBusDevice parent;
+
+    /* IRQs */
+    uint32_t     nr_irqs;
+    uint32_t     offset;
+    qemu_irq     *qirqs;
+
+    /* PQ bits */
+    uint8_t      *sbe;
+    uint32_t     sbe_size;
+
+    /* ESB memory region */
+    uint64_t     esb_flags;
+    hwaddr       esb_base;
+    uint32_t     esb_shift;
+    MemoryRegion esb_mmio;
+} XiveSource;
+
+/*
+ * ESB MMIO setting. Can be one page, for both source triggering and
+ * source management, or two different pages. See below for magic
+ * values.
+ */
+#define XIVE_ESB_4K          12 /* PSI HB */
+#define XIVE_ESB_4K_2PAGE    17
+#define XIVE_ESB_64K         16
+#define XIVE_ESB_64K_2PAGE   17
+
+static inline bool xive_source_esb_2page(XiveSource *xsrc)
+{
+    return xsrc->esb_shift == XIVE_ESB_64K_2PAGE;
+}
+
+static inline hwaddr xive_source_esb_base(XiveSource *xsrc, uint32_t srcno)
+{
+    assert(srcno < xsrc->nr_irqs);
+    return xsrc->esb_base + (1ull << xsrc->esb_shift) * srcno;
+}
+
+/* The trigger page is always the first/even page */
+#define xive_source_esb_trigger xive_source_esb_base
+
+/* In a two pages ESB MMIO setting, the odd page is for management */
+static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
+{
+    hwaddr addr = xive_source_esb_base(xsrc, srcno);
+
+    if (xive_source_esb_2page(xsrc)) {
+        addr += (1 << (xsrc->esb_shift - 1));
+    }
+
+    return addr;
+}
+
+/*
+ * Each interrupt source has a 2-bit state machine called ESB which
+ * can be controlled by MMIO. It's made of 2 bits, P and Q. P
+ * indicates that an interrupt is pending (has been sent to a queue
+ * and is waiting for an EOI). Q indicates that the interrupt has been
+ * triggered while pending.
+ *
+ * This acts as a coalescing mechanism in order to guarantee
+ * that a given interrupt only occurs at most once in a queue.
+ *
+ * When doing an EOI, the Q bit will indicate if the interrupt
+ * needs to be re-triggered.
+ */
+#define XIVE_ESB_VAL_P        0x2
+#define XIVE_ESB_VAL_Q        0x1
+
+#define XIVE_ESB_RESET        0x0
+#define XIVE_ESB_PENDING      XIVE_ESB_VAL_P
+#define XIVE_ESB_QUEUED       (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
+#define XIVE_ESB_OFF          XIVE_ESB_VAL_Q
+
+/*
+ * "magic" Event State Buffer (ESB) MMIO offsets.
+ *
+ * The following offsets into the ESB MMIO allow to read or
+ * manipulate the PQ bits. They must be used with an 8-bytes
+ * load instruction. They all return the previous state of the
+ * interrupt (atomically).
+ *
+ * Additionally, some ESB pages support doing an EOI via a
+ * store at 0 and some ESBs support doing a trigger via a
+ * separate trigger page.
+ */
+#define XIVE_ESB_STORE_EOI      0x400 /* Store */
+#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
+#define XIVE_ESB_GET            0x800 /* Load */
+#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
+#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
+#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
+#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
+
+uint8_t xive_source_pq_get(XiveSource *xsrc, uint32_t srcno);
+uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
+
+void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon);
+
+#endif /* PPC_XIVE_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
  2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model Cédric Le Goater
@ 2018-04-19 12:42 ` Cédric Le Goater
  2018-04-23  6:44   ` David Gibson
  2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface Cédric Le Goater
                   ` (33 subsequent siblings)
  35 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:42 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The 'sent' status of the LSI interrupt source is modeled with the 'P'
bit of the ESB and the assertion status of the source is maintained in
an array under the main sPAPRXive object. The type of the source is
stored in the same array for practical reasons.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/xive.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++----
 include/hw/ppc/xive.h | 16 +++++++++++++++
 2 files changed, 66 insertions(+), 4 deletions(-)

diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index c70578759d02..060976077dd7 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -104,6 +104,21 @@ static void xive_source_notify(XiveSource *xsrc, int srcno)
 
 }
 
+/*
+ * LSI interrupt sources use the P bit and a custom assertion flag
+ */
+static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t srcno)
+{
+    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
+
+    if  (old_pq == XIVE_ESB_RESET &&
+         xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
+        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
+        return true;
+    }
+    return false;
+}
+
 /* In a two pages ESB MMIO setting, even page is the trigger page, odd
  * page is for management */
 static inline bool xive_source_is_trigger_page(hwaddr addr)
@@ -133,6 +148,13 @@ static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
          */
         ret = xive_source_pq_eoi(xsrc, srcno);
 
+        /* If the LSI source is still asserted, forward a new source
+         * event notification */
+        if (xive_source_irq_is_lsi(xsrc, srcno)) {
+            if (xive_source_lsi_trigger(xsrc, srcno)) {
+                xive_source_notify(xsrc, srcno);
+            }
+        }
         break;
 
     case XIVE_ESB_GET:
@@ -183,6 +205,14 @@ static void xive_source_esb_write(void *opaque, hwaddr addr,
          * notification
          */
         notify = xive_source_pq_eoi(xsrc, srcno);
+
+        /* LSI sources do not set the Q bit but they can still be
+         * asserted, in which case we should forward a new source
+         * event notification
+         */
+        if (xive_source_irq_is_lsi(xsrc, srcno)) {
+            notify = xive_source_lsi_trigger(xsrc, srcno);
+        }
         break;
 
     default:
@@ -216,8 +246,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
     XiveSource *xsrc = XIVE_SOURCE(opaque);
     bool notify = false;
 
-    if (val) {
-        notify = xive_source_pq_trigger(xsrc, srcno);
+    if (xive_source_irq_is_lsi(xsrc, srcno)) {
+        if (val) {
+            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
+        } else {
+            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
+        }
+        notify = xive_source_lsi_trigger(xsrc, srcno);
+    } else {
+        if (val) {
+            notify = xive_source_pq_trigger(xsrc, srcno);
+        }
     }
 
     /* Forward the source event notification for routing */
@@ -234,13 +273,13 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
                    xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
     for (i = 0; i < xsrc->nr_irqs; i++) {
         uint8_t pq = xive_source_pq_get(xsrc, i);
-        uint32_t lisn = i  + xsrc->offset;
 
         if (pq == XIVE_ESB_OFF) {
             continue;
         }
 
-        monitor_printf(mon, "  %4x %c%c\n", lisn,
+        monitor_printf(mon, "  %4x %s %c%c\n", i + xsrc->offset,
+                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
                        pq & XIVE_ESB_VAL_P ? 'P' : '-',
                        pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
     }
@@ -249,6 +288,12 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
 static void xive_source_reset(DeviceState *dev)
 {
     XiveSource *xsrc = XIVE_SOURCE(dev);
+    int i;
+
+    /* Keep the IRQ type */
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        xsrc->status[i] &= ~XIVE_STATUS_ASSERTED;
+    }
 
     /* SBEs are initialized to 0b01 which corresponds to "ints off" */
     memset(xsrc->sbe, 0x55, xsrc->sbe_size);
@@ -273,6 +318,7 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
 
     xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
                                      xsrc->nr_irqs);
+    xsrc->status = g_malloc0(xsrc->nr_irqs);
 
     /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
     xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index d92a50519edf..0b76dd278d9b 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -33,6 +33,9 @@ typedef struct XiveSource {
     uint32_t     nr_irqs;
     uint32_t     offset;
     qemu_irq     *qirqs;
+#define XIVE_STATUS_LSI         0x1
+#define XIVE_STATUS_ASSERTED    0x2
+    uint8_t      *status;
 
     /* PQ bits */
     uint8_t      *sbe;
@@ -127,4 +130,17 @@ uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
 
 void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon);
 
+static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
+{
+    assert(srcno < xsrc->nr_irqs);
+    return xsrc->status[srcno] & XIVE_STATUS_LSI;
+}
+
+static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
+                                       bool lsi)
+{
+    assert(srcno < xsrc->nr_irqs);
+    xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
+}
+
 #endif /* PPC_XIVE_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
  2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model Cédric Le Goater
  2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources Cédric Le Goater
@ 2018-04-19 12:42 ` Cédric Le Goater
  2018-04-23  6:46   ` David Gibson
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR Cédric Le Goater
                   ` (32 subsequent siblings)
  35 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:42 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The XiveFabric offers a simple interface, between the XiveSourve
object and the device model owning the interrupt sources, to forward
an event notification to the XIVE interrupt controller of the machine
and if the owner is the controller, to call directly the routing
sub-engine.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
 include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
 2 files changed, 61 insertions(+), 1 deletion(-)

diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 060976077dd7..b4c3d06c1219 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -17,6 +17,21 @@
 #include "hw/ppc/xive.h"
 
 /*
+ * XIVE Fabric
+ */
+
+static void xive_fabric_route(XiveFabric *xf, int lisn)
+{
+
+}
+
+static const TypeInfo xive_fabric_info = {
+    .name = TYPE_XIVE_FABRIC,
+    .parent = TYPE_INTERFACE,
+    .class_size = sizeof(XiveFabricClass),
+};
+
+/*
  * XIVE Interrupt Source
  */
 
@@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
 
 /*
  * Forward the source event notification to the associated XiveFabric,
- * the device owning the sources.
+ * the device owning the sources, or perform the routing if the device
+ * is the interrupt controller.
  */
 static void xive_source_notify(XiveSource *xsrc, int srcno)
 {
 
+    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
+
+    if (xfc->notify) {
+        xfc->notify(xsrc->xive, srcno + xsrc->offset);
+    } else {
+        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
+    }
 }
 
 /*
@@ -302,6 +325,17 @@ static void xive_source_reset(DeviceState *dev)
 static void xive_source_realize(DeviceState *dev, Error **errp)
 {
     XiveSource *xsrc = XIVE_SOURCE(dev);
+    Object *obj;
+    Error *local_err = NULL;
+
+    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
+    if (!obj) {
+        error_propagate(errp, local_err);
+        error_prepend(errp, "required link 'xive' not found: ");
+        return;
+    }
+
+    xsrc->xive = XIVE_FABRIC(obj);
 
     if (!xsrc->nr_irqs) {
         error_setg(errp, "Number of interrupt needs to be greater than 0");
@@ -376,6 +410,7 @@ static const TypeInfo xive_source_info = {
 static void xive_register_types(void)
 {
     type_register_static(&xive_source_info);
+    type_register_static(&xive_fabric_info);
 }
 
 type_init(xive_register_types)
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 0b76dd278d9b..4fcae2c763e6 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -12,6 +12,8 @@
 
 #include "hw/sysbus.h"
 
+typedef struct XiveFabric XiveFabric;
+
 /*
  * XIVE Interrupt Source
  */
@@ -46,6 +48,8 @@ typedef struct XiveSource {
     hwaddr       esb_base;
     uint32_t     esb_shift;
     MemoryRegion esb_mmio;
+
+    XiveFabric   *xive;
 } XiveSource;
 
 /*
@@ -143,4 +147,25 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
     xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
 }
 
+/*
+ * XIVE Fabric
+ */
+
+typedef struct XiveFabric {
+    Object parent;
+} XiveFabric;
+
+#define TYPE_XIVE_FABRIC "xive-fabric"
+#define XIVE_FABRIC(obj)                                     \
+    OBJECT_CHECK(XiveFabric, (obj), TYPE_XIVE_FABRIC)
+#define XIVE_FABRIC_CLASS(klass)                                     \
+    OBJECT_CLASS_CHECK(XiveFabricClass, (klass), TYPE_XIVE_FABRIC)
+#define XIVE_FABRIC_GET_CLASS(obj)                                   \
+    OBJECT_GET_CLASS(XiveFabricClass, (obj), TYPE_XIVE_FABRIC)
+
+typedef struct XiveFabricClass {
+    InterfaceClass parent;
+    void (*notify)(XiveFabric *xf, uint32_t lisn);
+} XiveFabricClass;
+
 #endif /* PPC_XIVE_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (2 preceding siblings ...)
  2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-24  6:51   ` David Gibson
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 05/35] spapr/xive: add a single source block to the sPAPR XIVE model Cédric Le Goater
                   ` (31 subsequent siblings)
  35 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

sPAPRXive is a model for the XIVE interrupt controller device of the
sPAPR machine. It holds the routing XIVE table, the Interrupt
Virtualization Entry (IVE) table which associates interrupt source
numbers with targets.

Also extend the XiveFabric with an accessor to the IVT. This will be
needed by the routing algorithm.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 May be should introduce a XiveRouter model to hold the IVT. To be
 discussed.

 Changes since v2 :

 - introduced the XiveFabric interface

 default-configs/ppc64-softmmu.mak |   1 +
 hw/intc/Makefile.objs             |   1 +
 hw/intc/spapr_xive.c              | 159 ++++++++++++++++++++++++++++++++++++++
 hw/intc/xive.c                    |   7 ++
 include/hw/ppc/spapr_xive.h       |  31 ++++++++
 include/hw/ppc/xive.h             |   5 ++
 include/hw/ppc/xive_regs.h        |  33 ++++++++
 7 files changed, 237 insertions(+)
 create mode 100644 hw/intc/spapr_xive.c
 create mode 100644 include/hw/ppc/spapr_xive.h
 create mode 100644 include/hw/ppc/xive_regs.h

diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
index c6d13e757977..f8d34722931d 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -17,4 +17,5 @@ CONFIG_XICS=$(CONFIG_PSERIES)
 CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_XIVE=$(CONFIG_PSERIES)
+CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
 CONFIG_MEM_HOTPLUG=y
diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index 72a46ed91c31..301a8e972d91 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -38,6 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
 obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
 obj-$(CONFIG_XICS_KVM) += xics_kvm.o
 obj-$(CONFIG_XIVE) += xive.o
+obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
 obj-$(CONFIG_POWERNV) += xics_pnv.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
new file mode 100644
index 000000000000..020444e2665a
--- /dev/null
+++ b/hw/intc/spapr_xive.c
@@ -0,0 +1,159 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "monitor/monitor.h"
+#include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/xive_regs.h"
+
+void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
+{
+    int i;
+
+    monitor_printf(mon, "IVE Table\n");
+    for (i = 0; i < xive->nr_irqs; i++) {
+        XiveIVE *ive = &xive->ivt[i];
+
+        if (!(ive->w & IVE_VALID)) {
+            continue;
+        }
+
+        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
+                       ive->w & IVE_MASKED ? "M" : " ",
+                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
+                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
+    }
+}
+
+static void spapr_xive_reset(DeviceState *dev)
+{
+    sPAPRXive *xive = SPAPR_XIVE(dev);
+    int i;
+
+    /* Mask all valid IVEs in the IRQ number space. */
+    for (i = 0; i < xive->nr_irqs; i++) {
+        XiveIVE *ive = &xive->ivt[i];
+        if (ive->w & IVE_VALID) {
+            ive->w |= IVE_MASKED;
+        }
+    }
+}
+
+static void spapr_xive_init(Object *obj)
+{
+
+}
+
+static void spapr_xive_realize(DeviceState *dev, Error **errp)
+{
+    sPAPRXive *xive = SPAPR_XIVE(dev);
+
+    if (!xive->nr_irqs) {
+        error_setg(errp, "Number of interrupt needs to be greater 0");
+        return;
+    }
+
+    /* Allocate the Interrupt Virtualization Table */
+    xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
+}
+
+static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
+{
+    sPAPRXive *xive = SPAPR_XIVE(xf);
+
+    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
+}
+
+static const VMStateDescription vmstate_spapr_xive_ive = {
+    .name = TYPE_SPAPR_XIVE "/ive",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField []) {
+        VMSTATE_UINT64(w, XiveIVE),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static const VMStateDescription vmstate_spapr_xive = {
+    .name = TYPE_SPAPR_XIVE,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
+        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(ivt, sPAPRXive, nr_irqs,
+                                     vmstate_spapr_xive_ive, XiveIVE),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static Property spapr_xive_properties[] = {
+    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_xive_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
+
+    dc->realize = spapr_xive_realize;
+    dc->reset = spapr_xive_reset;
+    dc->props = spapr_xive_properties;
+    dc->desc = "sPAPR XIVE interrupt controller";
+    dc->vmsd = &vmstate_spapr_xive;
+
+    xfc->get_ive = spapr_xive_get_ive;
+}
+
+static const TypeInfo spapr_xive_info = {
+    .name = TYPE_SPAPR_XIVE,
+    .parent = TYPE_SYS_BUS_DEVICE,
+    .instance_init = spapr_xive_init,
+    .instance_size = sizeof(sPAPRXive),
+    .class_init = spapr_xive_class_init,
+    .interfaces = (InterfaceInfo[]) {
+            { TYPE_XIVE_FABRIC },
+            { },
+    },
+};
+
+static void spapr_xive_register_types(void)
+{
+    type_register_static(&spapr_xive_info);
+}
+
+type_init(spapr_xive_register_types)
+
+bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
+{
+    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
+
+    if (!ive) {
+        return false;
+    }
+
+    ive->w |= IVE_VALID;
+    return true;
+}
+
+bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
+{
+    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
+
+    if (!ive) {
+        return false;
+    }
+
+    ive->w &= ~IVE_VALID;
+    return true;
+}
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index b4c3d06c1219..dccad0318834 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -20,6 +20,13 @@
  * XIVE Fabric
  */
 
+XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn)
+{
+    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
+
+    return xfc->get_ive(xf, lisn);
+}
+
 static void xive_fabric_route(XiveFabric *xf, int lisn)
 {
 
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
new file mode 100644
index 000000000000..1d966b5d3a96
--- /dev/null
+++ b/include/hw/ppc/spapr_xive.h
@@ -0,0 +1,31 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_SPAPR_XIVE_H
+#define PPC_SPAPR_XIVE_H
+
+#include "hw/sysbus.h"
+#include "hw/ppc/xive.h"
+
+#define TYPE_SPAPR_XIVE "spapr-xive"
+#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
+
+typedef struct sPAPRXive {
+    SysBusDevice parent;
+
+    /* Routing table */
+    XiveIVE      *ivt;
+    uint32_t     nr_irqs;
+} sPAPRXive;
+
+bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
+bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
+void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
+
+#endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 4fcae2c763e6..5b145816acdc 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -11,6 +11,7 @@
 #define PPC_XIVE_H
 
 #include "hw/sysbus.h"
+#include "hw/ppc/xive_regs.h"
 
 typedef struct XiveFabric XiveFabric;
 
@@ -166,6 +167,10 @@ typedef struct XiveFabric {
 typedef struct XiveFabricClass {
     InterfaceClass parent;
     void (*notify)(XiveFabric *xf, uint32_t lisn);
+
+    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
 } XiveFabricClass;
 
+XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
+
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
new file mode 100644
index 000000000000..5903f29eb789
--- /dev/null
+++ b/include/hw/ppc/xive_regs.h
@@ -0,0 +1,33 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2016-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef _PPC_XIVE_REGS_H
+#define _PPC_XIVE_REGS_H
+
+/* IVE/EAS
+ *
+ * One per interrupt source. Targets that interrupt to a given EQ
+ * and provides the corresponding logical interrupt number (EQ data)
+ *
+ * We also map this structure to the escalation descriptor inside
+ * an EQ, though in that case the valid and masked bits are not used.
+ */
+typedef struct XiveIVE {
+        /* Use a single 64-bit definition to make it easier to
+         * perform atomic updates
+         */
+        uint64_t        w;
+#define IVE_VALID       PPC_BIT(0)
+#define IVE_EQ_BLOCK    PPC_BITMASK(4, 7)        /* Destination EQ block# */
+#define IVE_EQ_INDEX    PPC_BITMASK(8, 31)       /* Destination EQ index */
+#define IVE_MASKED      PPC_BIT(32)              /* Masked */
+#define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
+} XiveIVE;
+
+#endif /* _INTC_XIVE_INTERNAL_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 05/35] spapr/xive: add a single source block to the sPAPR XIVE model
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (3 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-24  6:58   ` David Gibson
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model Cédric Le Goater
                   ` (30 subsequent siblings)
  35 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

Bare-metal systems (PowerNV) have multiples interrupt sources. The
XIVE interrupt controller has an internal source for IPIs and generic
IPIs, the PSIHB has one and also the PHBs. But, for simplicity on the
sPAPR machine, we use a unique XiveSource object for all IPIs and
virtual device interrupts of the VM.

The ESB MMIO region used to control the sources is mapped at the
address of chip 0 of a real system and only the provisioned IRQ
numbers are covered.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive.c        | 34 ++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr_xive.h |  3 +++
 include/hw/ppc/xive.h       |  6 ++++++
 3 files changed, 43 insertions(+)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 020444e2665a..90cde8a4082d 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -14,12 +14,15 @@
 #include "sysemu/cpus.h"
 #include "monitor/monitor.h"
 #include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/xive.h"
 #include "hw/ppc/xive_regs.h"
 
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
 {
     int i;
 
+    xive_source_pic_print_info(&xive->source, mon);
+
     monitor_printf(mon, "IVE Table\n");
     for (i = 0; i < xive->nr_irqs; i++) {
         XiveIVE *ive = &xive->ivt[i];
@@ -40,6 +43,9 @@ static void spapr_xive_reset(DeviceState *dev)
     sPAPRXive *xive = SPAPR_XIVE(dev);
     int i;
 
+    /* Xive Source reset is done through SysBus, it should put all
+     * IRQs to OFF (!P|Q) */
+
     /* Mask all valid IVEs in the IRQ number space. */
     for (i = 0; i < xive->nr_irqs; i++) {
         XiveIVE *ive = &xive->ivt[i];
@@ -51,18 +57,42 @@ static void spapr_xive_reset(DeviceState *dev)
 
 static void spapr_xive_init(Object *obj)
 {
+    sPAPRXive *xive = SPAPR_XIVE(obj);
 
+    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
+    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
 }
 
 static void spapr_xive_realize(DeviceState *dev, Error **errp)
 {
     sPAPRXive *xive = SPAPR_XIVE(dev);
+    XiveSource *xsrc = &xive->source;
+    Error *local_err = NULL;
 
     if (!xive->nr_irqs) {
         error_setg(errp, "Number of interrupt needs to be greater 0");
         return;
     }
 
+    /* The XIVE interrupt controller has an internal source for IPIs
+     * and generic IPIs, the PSIHB has one and also the PHBs. For
+     * simplicity, we use a unique XIVE source object for *all*
+     * interrupts on sPAPR. The ESBs pages are mapped at the address
+     * of chip 0 of a real system.
+     */
+    object_property_set_int(OBJECT(xsrc), XIVE_VC_BASE, "bar",
+                            &error_fatal);
+    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
+                            &error_fatal);
+    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
+                                   &error_fatal);
+    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
+
     /* Allocate the Interrupt Virtualization Table */
     xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
 }
@@ -137,23 +167,27 @@ type_init(spapr_xive_register_types)
 bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
 {
     XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
+    XiveSource *xsrc = &xive->source;
 
     if (!ive) {
         return false;
     }
 
     ive->w |= IVE_VALID;
+    xive_source_irq_set(xsrc, lisn - xsrc->offset, lsi);
     return true;
 }
 
 bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
 {
     XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
+    XiveSource *xsrc = &xive->source;
 
     if (!ive) {
         return false;
     }
 
     ive->w &= ~IVE_VALID;
+    xive_source_irq_set(xsrc, lisn - xsrc->offset, false);
     return true;
 }
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 1d966b5d3a96..4538c622b60a 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -19,6 +19,9 @@
 typedef struct sPAPRXive {
     SysBusDevice parent;
 
+    /* Internal interrupt source for IPIs and virtual devices */
+    XiveSource   source;
+
     /* Routing table */
     XiveIVE      *ivt;
     uint32_t     nr_irqs;
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 5b145816acdc..57295715a4a5 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -16,6 +16,12 @@
 typedef struct XiveFabric XiveFabric;
 
 /*
+ * XIVE MMIO regions
+ */
+
+#define XIVE_VC_BASE   0x0006010000000000ull
+
+/*
  * XIVE Interrupt Source
  */
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (4 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 05/35] spapr/xive: add a single source block to the sPAPR XIVE model Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-26  7:11   ` David Gibson
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues Cédric Le Goater
                   ` (29 subsequent siblings)
  35 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The XIVE presenter engine uses a set of registers to handle priority
management and interrupt acknowledgment among other things. The most
important ones being :

  - Interrupt Priority Register (PIPR)
  - Interrupt Pending Buffer (IPB)
  - Current Processor Priority (CPPR)
  - Notification Source Register (NSR)

There is one set of registers per level of privilege, four in all :
HW, HV pool, OS and User. These are called rings. All registers are
accessible through a specific MMIO region called the Thread Interrupt
Management Areas (TIMA) but, depending on the privilege level of the
CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
OS privilege and therefore can only accesses the OS and the User
rings. The others are for hypervisor levels.

The CPU interrupt state is modeled with a XiveNVT object which stores
the values of the different registers. The different TIMA views are
mapped at the same address for each CPU and 'current_cpu' is used to
retrieve the XiveNVT holding the ring registers.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v2 :

 - introduced the XiveFabric interface

 hw/intc/spapr_xive.c        |  25 ++++
 hw/intc/xive.c              | 279 ++++++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr_xive.h |   5 +
 include/hw/ppc/xive.h       |  31 +++++
 include/hw/ppc/xive_regs.h  |  84 +++++++++++++
 5 files changed, 424 insertions(+)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 90cde8a4082d..f07832bf0a00 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -13,6 +13,7 @@
 #include "target/ppc/cpu.h"
 #include "sysemu/cpus.h"
 #include "monitor/monitor.h"
+#include "hw/ppc/spapr.h"
 #include "hw/ppc/spapr_xive.h"
 #include "hw/ppc/xive.h"
 #include "hw/ppc/xive_regs.h"
@@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
 
     /* Allocate the Interrupt Virtualization Table */
     xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
+
+    /* The Thread Interrupt Management Area has the same address for
+     * each chip. On sPAPR, we only need to expose the User and OS
+     * level views of the TIMA.
+     */
+    xive->tm_base = XIVE_TM_BASE;
+
+    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
+                          &xive_tm_user_ops, xive, "xive.tima.user",
+                          1ull << TM_SHIFT);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
+
+    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
+                          &xive_tm_os_ops, xive, "xive.tima.os",
+                          1ull << TM_SHIFT);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
 }
 
 static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
@@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
     return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
 }
 
+static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
+{
+    PowerPCCPU *cpu = spapr_find_cpu(server);
+
+    return cpu ? XIVE_NVT(cpu->intc) : NULL;
+}
+
 static const VMStateDescription vmstate_spapr_xive_ive = {
     .name = TYPE_SPAPR_XIVE "/ive",
     .version_id = 1,
@@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
     dc->vmsd = &vmstate_spapr_xive;
 
     xfc->get_ive = spapr_xive_get_ive;
+    xfc->get_nvt = spapr_xive_get_nvt;
 }
 
 static const TypeInfo spapr_xive_info = {
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index dccad0318834..5691bb9474e4 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -14,7 +14,278 @@
 #include "sysemu/cpus.h"
 #include "sysemu/dma.h"
 #include "monitor/monitor.h"
+#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
 #include "hw/ppc/xive.h"
+#include "hw/ppc/xive_regs.h"
+
+/*
+ * XIVE Interrupt Presenter
+ */
+
+static uint64_t xive_nvt_accept(XiveNVT *nvt)
+{
+    return 0;
+}
+
+static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
+{
+    if (cppr > XIVE_PRIORITY_MAX) {
+        cppr = 0xff;
+    }
+
+    nvt->ring_os[TM_CPPR] = cppr;
+}
+
+/*
+ * OS Thread Interrupt Management Area MMIO
+ */
+static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
+                                           unsigned size)
+{
+    uint64_t ret = -1;
+
+    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
+        ret = xive_nvt_accept(nvt);
+    } else {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
+                      HWADDR_PRIx" size %d\n", offset, size);
+    }
+
+    return ret;
+}
+
+#define TM_RING(offset) ((offset) & 0xf0)
+
+static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
+                                      unsigned size)
+{
+    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
+    XiveNVT *nvt = XIVE_NVT(cpu->intc);
+    uint64_t ret = -1;
+    int i;
+
+    if (offset >= TM_SPC_ACK_EBB) {
+        return xive_tm_read_special(nvt, offset, size);
+    }
+
+    if (TM_RING(offset) != TM_QW1_OS) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
+                      HWADDR_PRIx"\n", offset);
+        return ret;
+    }
+
+    ret = 0;
+    for (i = 0; i < size; i++) {
+        ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1));
+    }
+
+    return ret;
+}
+
+static bool xive_tm_is_readonly(uint8_t offset)
+{
+    return offset != TM_QW1_OS + TM_CPPR;
+}
+
+static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset,
+                                        uint64_t value, unsigned size)
+{
+    /* TODO: support TM_SPC_SET_OS_PENDING */
+
+    /* TODO: support TM_SPC_ACK_OS_EL */
+}
+
+static void xive_tm_os_write(void *opaque, hwaddr offset,
+                                   uint64_t value, unsigned size)
+{
+    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
+    XiveNVT *nvt = XIVE_NVT(cpu->intc);
+    int i;
+
+    if (offset >= TM_SPC_ACK_EBB) {
+        xive_tm_write_special(nvt, offset, value, size);
+        return;
+    }
+
+    if (TM_RING(offset) != TM_QW1_OS) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
+                      HWADDR_PRIx"\n", offset);
+        return;
+    }
+
+    switch (size) {
+    case 1:
+        if (offset == TM_QW1_OS + TM_CPPR) {
+            xive_nvt_set_cppr(nvt, value & 0xff);
+        }
+        break;
+    case 4:
+    case 8:
+        for (i = 0; i < size; i++) {
+            if (!xive_tm_is_readonly(offset + i)) {
+                nvt->regs[offset + i] = (value >> (8 * (size - i - 1))) & 0xff;
+            }
+        }
+        break;
+    default:
+        g_assert_not_reached();
+    }
+}
+
+const MemoryRegionOps xive_tm_os_ops = {
+    .read = xive_tm_os_read,
+    .write = xive_tm_os_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
+/*
+ * User Thread Interrupt Management Area MMIO
+ */
+
+static uint64_t xive_tm_user_read(void *opaque, hwaddr offset,
+                                        unsigned size)
+{
+    qemu_log_mask(LOG_UNIMP, "XIVE: invalid access to User TIMA @%"
+                  HWADDR_PRIx"\n", offset);
+    return -1;
+}
+
+static void xive_tm_user_write(void *opaque, hwaddr offset,
+                                     uint64_t value, unsigned size)
+{
+    qemu_log_mask(LOG_UNIMP, "XIVE: invalid access to User TIMA @%"
+                  HWADDR_PRIx"\n", offset);
+}
+
+
+const MemoryRegionOps xive_tm_user_ops = {
+    .read = xive_tm_user_read,
+    .write = xive_tm_user_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
+static char *xive_nvt_ring_print(uint8_t *ring)
+{
+    uint32_t w2 = be32_to_cpu(*((uint32_t *) &ring[TM_WORD2]));
+
+    return g_strdup_printf("%02x  %02x   %02x  %02x    %02x   "
+                   "%02x  %02x  %02x   %08x",
+                   ring[TM_NSR], ring[TM_CPPR], ring[TM_IPB], ring[TM_LSMFB],
+                   ring[TM_ACK_CNT], ring[TM_INC], ring[TM_AGE], ring[TM_PIPR],
+                   w2);
+}
+
+void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon)
+{
+    int cpu_index = nvt->cs ? nvt->cs->cpu_index : -1;
+    char *s;
+
+    monitor_printf(mon, "CPU[%04x]: QW    NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
+                   " W2\n", cpu_index);
+
+    s = xive_nvt_ring_print(&nvt->regs[TM_QW1_OS]);
+    monitor_printf(mon, "CPU[%04x]: OS    %s\n", cpu_index, s);
+    g_free(s);
+    s = xive_nvt_ring_print(&nvt->regs[TM_QW0_USER]);
+    monitor_printf(mon, "CPU[%04x]: USER  %s\n", cpu_index, s);
+    g_free(s);
+}
+
+static void xive_nvt_reset(void *dev)
+{
+    XiveNVT *nvt = XIVE_NVT(dev);
+
+    memset(nvt->regs, 0, sizeof(nvt->regs));
+}
+
+static void xive_nvt_realize(DeviceState *dev, Error **errp)
+{
+    XiveNVT *nvt = XIVE_NVT(dev);
+    PowerPCCPU *cpu;
+    CPUPPCState *env;
+    Object *obj;
+    Error *err = NULL;
+
+    obj = object_property_get_link(OBJECT(dev), ICP_PROP_CPU, &err);
+    if (!obj) {
+        error_propagate(errp, err);
+        error_prepend(errp, "required link '" ICP_PROP_CPU "' not found: ");
+        return;
+    }
+
+    cpu = POWERPC_CPU(obj);
+    nvt->cs = CPU(obj);
+
+    env = &cpu->env;
+    switch (PPC_INPUT(env)) {
+    case PPC_FLAGS_INPUT_POWER7:
+        nvt->output = env->irq_inputs[POWER7_INPUT_INT];
+        break;
+
+    default:
+        error_setg(errp, "XIVE interrupt controller does not support "
+                   "this CPU bus model");
+        return;
+    }
+
+    qemu_register_reset(xive_nvt_reset, dev);
+}
+
+static void xive_nvt_unrealize(DeviceState *dev, Error **errp)
+{
+    qemu_unregister_reset(xive_nvt_reset, dev);
+}
+
+static void xive_nvt_init(Object *obj)
+{
+    XiveNVT *nvt = XIVE_NVT(obj);
+
+    nvt->ring_os = &nvt->regs[TM_QW1_OS];
+}
+
+static const VMStateDescription vmstate_xive_nvt = {
+    .name = TYPE_XIVE_NVT,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_BUFFER(regs, XiveNVT),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static void xive_nvt_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = xive_nvt_realize;
+    dc->unrealize = xive_nvt_unrealize;
+    dc->desc = "XIVE Interrupt Presenter";
+    dc->vmsd = &vmstate_xive_nvt;
+}
+
+static const TypeInfo xive_nvt_info = {
+    .name          = TYPE_XIVE_NVT,
+    .parent        = TYPE_DEVICE,
+    .instance_size = sizeof(XiveNVT),
+    .instance_init = xive_nvt_init,
+    .class_init    = xive_nvt_class_init,
+};
 
 /*
  * XIVE Fabric
@@ -27,6 +298,13 @@ XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn)
     return xfc->get_ive(xf, lisn);
 }
 
+XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server)
+{
+    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
+
+    return xfc->get_nvt(xf, server);
+}
+
 static void xive_fabric_route(XiveFabric *xf, int lisn)
 {
 
@@ -418,6 +696,7 @@ static void xive_register_types(void)
 {
     type_register_static(&xive_source_info);
     type_register_static(&xive_fabric_info);
+    type_register_static(&xive_nvt_info);
 }
 
 type_init(xive_register_types)
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 4538c622b60a..25d78eec884d 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -25,6 +25,11 @@ typedef struct sPAPRXive {
     /* Routing table */
     XiveIVE      *ivt;
     uint32_t     nr_irqs;
+
+    /* TIMA memory regions */
+    hwaddr       tm_base;
+    MemoryRegion tm_mmio_user;
+    MemoryRegion tm_mmio_os;
 } sPAPRXive;
 
 bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 57295715a4a5..1a2da610d91c 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -20,6 +20,7 @@ typedef struct XiveFabric XiveFabric;
  */
 
 #define XIVE_VC_BASE   0x0006010000000000ull
+#define XIVE_TM_BASE   0x0006030203180000ull
 
 /*
  * XIVE Interrupt Source
@@ -155,6 +156,34 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
 }
 
 /*
+ * XIVE Interrupt Presenter
+ */
+
+#define TYPE_XIVE_NVT "xive-nvt"
+#define XIVE_NVT(obj) OBJECT_CHECK(XiveNVT, (obj), TYPE_XIVE_NVT)
+
+#define TM_RING_COUNT           4
+#define TM_RING_SIZE            0x10
+
+typedef struct XiveNVT {
+    DeviceState parent_obj;
+
+    CPUState  *cs;
+    qemu_irq  output;
+
+    /* Thread interrupt Management (TM) registers */
+    uint8_t   regs[TM_RING_COUNT * TM_RING_SIZE];
+
+    /* Shortcuts to rings */
+    uint8_t   *ring_os;
+} XiveNVT;
+
+extern const MemoryRegionOps xive_tm_user_ops;
+extern const MemoryRegionOps xive_tm_os_ops;
+
+void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon);
+
+/*
  * XIVE Fabric
  */
 
@@ -175,8 +204,10 @@ typedef struct XiveFabricClass {
     void (*notify)(XiveFabric *xf, uint32_t lisn);
 
     XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
+    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
 } XiveFabricClass;
 
 XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
+XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server);
 
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
index 5903f29eb789..f2e2a1ac8f6e 100644
--- a/include/hw/ppc/xive_regs.h
+++ b/include/hw/ppc/xive_regs.h
@@ -10,6 +10,88 @@
 #ifndef _PPC_XIVE_REGS_H
 #define _PPC_XIVE_REGS_H
 
+#define TM_SHIFT                16
+
+/* TM register offsets */
+#define TM_QW0_USER             0x000 /* All rings */
+#define TM_QW1_OS               0x010 /* Ring 0..2 */
+#define TM_QW2_HV_POOL          0x020 /* Ring 0..1 */
+#define TM_QW3_HV_PHYS          0x030 /* Ring 0..1 */
+
+/* Byte offsets inside a QW             QW0 QW1 QW2 QW3 */
+#define TM_NSR                  0x0  /*  +   +   -   +  */
+#define TM_CPPR                 0x1  /*  -   +   -   +  */
+#define TM_IPB                  0x2  /*  -   +   +   +  */
+#define TM_LSMFB                0x3  /*  -   +   +   +  */
+#define TM_ACK_CNT              0x4  /*  -   +   -   -  */
+#define TM_INC                  0x5  /*  -   +   -   +  */
+#define TM_AGE                  0x6  /*  -   +   -   +  */
+#define TM_PIPR                 0x7  /*  -   +   -   +  */
+
+#define TM_WORD0                0x0
+#define TM_WORD1                0x4
+
+/*
+ * QW word 2 contains the valid bit at the top and other fields
+ * depending on the QW.
+ */
+#define TM_WORD2                0x8
+#define   TM_QW0W2_VU           PPC_BIT32(0)
+#define   TM_QW0W2_LOGIC_SERV   PPC_BITMASK32(1, 31) /* XX 2,31 ? */
+#define   TM_QW1W2_VO           PPC_BIT32(0)
+#define   TM_QW1W2_OS_CAM       PPC_BITMASK32(8, 31)
+#define   TM_QW2W2_VP           PPC_BIT32(0)
+#define   TM_QW2W2_POOL_CAM     PPC_BITMASK32(8, 31)
+#define   TM_QW3W2_VT           PPC_BIT32(0)
+#define   TM_QW3W2_LP           PPC_BIT32(6)
+#define   TM_QW3W2_LE           PPC_BIT32(7)
+#define   TM_QW3W2_T            PPC_BIT32(31)
+
+/*
+ * In addition to normal loads to "peek" and writes (only when invalid)
+ * using 4 and 8 bytes accesses, the above registers support these
+ * "special" byte operations:
+ *
+ *   - Byte load from QW0[NSR] - User level NSR (EBB)
+ *   - Byte store to QW0[NSR] - User level NSR (EBB)
+ *   - Byte load/store to QW1[CPPR] and QW3[CPPR] - CPPR access
+ *   - Byte load from QW3[TM_WORD2] - Read VT||00000||LP||LE on thrd 0
+ *                                    otherwise VT||0000000
+ *   - Byte store to QW3[TM_WORD2] - Set VT bit (and LP/LE if present)
+ *
+ * Then we have all these "special" CI ops at these offset that trigger
+ * all sorts of side effects:
+ */
+#define TM_SPC_ACK_EBB          0x800   /* Load8 ack EBB to reg*/
+#define TM_SPC_ACK_OS_REG       0x810   /* Load16 ack OS irq to reg */
+#define TM_SPC_PUSH_USR_CTX     0x808   /* Store32 Push/Validate user context */
+#define TM_SPC_PULL_USR_CTX     0x808   /* Load32 Pull/Invalidate user
+                                         * context */
+#define TM_SPC_SET_OS_PENDING   0x812   /* Store8 Set OS irq pending bit */
+#define TM_SPC_PULL_OS_CTX      0x818   /* Load32/Load64 Pull/Invalidate OS
+                                         * context to reg */
+#define TM_SPC_PULL_POOL_CTX    0x828   /* Load32/Load64 Pull/Invalidate Pool
+                                         * context to reg*/
+#define TM_SPC_ACK_HV_REG       0x830   /* Load16 ack HV irq to reg */
+#define TM_SPC_PULL_USR_CTX_OL  0xc08   /* Store8 Pull/Inval usr ctx to odd
+                                         * line */
+#define TM_SPC_ACK_OS_EL        0xc10   /* Store8 ack OS irq to even line */
+#define TM_SPC_ACK_HV_POOL_EL   0xc20   /* Store8 ack HV evt pool to even
+                                         * line */
+#define TM_SPC_ACK_HV_EL        0xc30   /* Store8 ack HV irq to even line */
+/* XXX more... */
+
+/* NSR fields for the various QW ack types */
+#define TM_QW0_NSR_EB           PPC_BIT8(0)
+#define TM_QW1_NSR_EO           PPC_BIT8(0)
+#define TM_QW3_NSR_HE           PPC_BITMASK8(0, 1)
+#define  TM_QW3_NSR_HE_NONE     0
+#define  TM_QW3_NSR_HE_POOL     1
+#define  TM_QW3_NSR_HE_PHYS     2
+#define  TM_QW3_NSR_HE_LSI      3
+#define TM_QW3_NSR_I            PPC_BIT8(2)
+#define TM_QW3_NSR_GRP_LVL      PPC_BIT8(3, 7)
+
 /* IVE/EAS
  *
  * One per interrupt source. Targets that interrupt to a given EQ
@@ -30,4 +112,6 @@ typedef struct XiveIVE {
 #define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
 } XiveIVE;
 
+#define XIVE_PRIORITY_MAX  7
+
 #endif /* _INTC_XIVE_INTERNAL_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (5 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-26  7:25   ` David Gibson
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 08/35] spapr: push the XIVE EQ data in OS event queue Cédric Le Goater
                   ` (28 subsequent siblings)
  35 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The Event Queue Descriptor (EQD) table is an internal table of the
XIVE routing sub-engine. It specifies on which Event Queue the event
data should be posted when an exception occurs (later on pulled by the
OS) and which Virtual Processor to notify. The Event Queue is a much
more complex structure but we start with a simple model for the sPAPR
machine.

There is one XiveEQ per priority and these are stored under the XIVE
virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :

       (server << 3) | (priority & 0x7)

This is not in the XIVE architecture but as the EQ index is never
exposed to the guest, in the hcalls nor in the device tree, we are
free to use what fits best the current model.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v2 :

 - introduced the XiveFabric interface

 hw/intc/spapr_xive.c        | 31 +++++++++++++++++---
 hw/intc/xive.c              | 71 +++++++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr_xive.h |  7 +++++
 include/hw/ppc/xive.h       |  8 +++++
 include/hw/ppc/xive_regs.h  | 48 ++++++++++++++++++++++++++++++
 5 files changed, 161 insertions(+), 4 deletions(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index f07832bf0a00..d0d5a7d7f969 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -27,15 +27,30 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
     monitor_printf(mon, "IVE Table\n");
     for (i = 0; i < xive->nr_irqs; i++) {
         XiveIVE *ive = &xive->ivt[i];
+        uint32_t eq_idx;
 
         if (!(ive->w & IVE_VALID)) {
             continue;
         }
 
-        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
-                       ive->w & IVE_MASKED ? "M" : " ",
-                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
-                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
+        eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
+
+        monitor_printf(mon, "  %6x %s eqidx:%03d ", i,
+                       ive->w & IVE_MASKED ? "M" : " ", eq_idx);
+
+        if (!(ive->w & IVE_MASKED)) {
+            XiveEQ *eq;
+
+            eq = xive_fabric_get_eq(XIVE_FABRIC(xive), eq_idx);
+            if (eq && (eq->w0 & EQ_W0_VALID)) {
+                xive_eq_pic_print_info(eq, mon);
+                monitor_printf(mon, " data:%08x",
+                               (int) GETFIELD(IVE_EQ_DATA, ive->w));
+            } else {
+                monitor_printf(mon, "no eq ?!");
+            }
+        }
+        monitor_printf(mon, "\n");
     }
 }
 
@@ -128,6 +143,13 @@ static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
     return cpu ? XIVE_NVT(cpu->intc) : NULL;
 }
 
+static XiveEQ *spapr_xive_get_eq(XiveFabric *xf, uint32_t eq_idx)
+{
+    XiveNVT *nvt = xive_fabric_get_nvt(xf, SPAPR_XIVE_EQ_SERVER(eq_idx));
+
+    return xive_nvt_eq_get(nvt, SPAPR_XIVE_EQ_PRIO(eq_idx));
+}
+
 static const VMStateDescription vmstate_spapr_xive_ive = {
     .name = TYPE_SPAPR_XIVE "/ive",
     .version_id = 1,
@@ -168,6 +190,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
 
     xfc->get_ive = spapr_xive_get_ive;
     xfc->get_nvt = spapr_xive_get_nvt;
+    xfc->get_eq = spapr_xive_get_eq;
 }
 
 static const TypeInfo spapr_xive_info = {
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 5691bb9474e4..2ab37fde80e8 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -19,6 +19,47 @@
 #include "hw/ppc/xive_regs.h"
 
 /*
+ * XiveEQ helpers
+ */
+
+XiveEQ *xive_nvt_eq_get(XiveNVT *nvt, uint8_t priority)
+{
+    if (!nvt || priority > XIVE_PRIORITY_MAX) {
+        return NULL;
+    }
+    return &nvt->eqt[priority];
+}
+
+void xive_eq_reset(XiveEQ *eq)
+{
+    memset(eq, 0, sizeof(*eq));
+
+    /* switch off the escalation and notification ESBs */
+    eq->w1 = EQ_W1_ESe_Q | EQ_W1_ESn_Q;
+}
+
+void xive_eq_pic_print_info(XiveEQ *eq, Monitor *mon)
+{
+    uint64_t qaddr_base = (((uint64_t)(eq->w2 & 0x0fffffff)) << 32) | eq->w3;
+    uint32_t qindex = GETFIELD(EQ_W1_PAGE_OFF, eq->w1);
+    uint32_t qgen = GETFIELD(EQ_W1_GENERATION, eq->w1);
+    uint32_t qsize = GETFIELD(EQ_W0_QSIZE, eq->w0);
+    uint32_t qentries = 1 << (qsize + 10);
+
+    uint32_t server = GETFIELD(EQ_W6_NVT_INDEX, eq->w6);
+    uint8_t priority = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
+
+    monitor_printf(mon, "%c%c%c%c%c prio:%d server:%03d eq:@%08"PRIx64
+                   "% 6d/%5d ^%d",
+                   eq->w0 & EQ_W0_VALID ? 'v' : '-',
+                   eq->w0 & EQ_W0_ENQUEUE ? 'q' : '-',
+                   eq->w0 & EQ_W0_UCOND_NOTIFY ? 'n' : '-',
+                   eq->w0 & EQ_W0_BACKLOG ? 'b' : '-',
+                   eq->w0 & EQ_W0_ESCALATE_CTL ? 'e' : '-',
+                   priority, server, qaddr_base, qindex, qentries, qgen);
+}
+
+/*
  * XIVE Interrupt Presenter
  */
 
@@ -210,8 +251,12 @@ void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon)
 static void xive_nvt_reset(void *dev)
 {
     XiveNVT *nvt = XIVE_NVT(dev);
+    int i;
 
     memset(nvt->regs, 0, sizeof(nvt->regs));
+    for (i = 0; i < ARRAY_SIZE(nvt->eqt); i++) {
+        xive_eq_reset(&nvt->eqt[i]);
+    }
 }
 
 static void xive_nvt_realize(DeviceState *dev, Error **errp)
@@ -259,12 +304,31 @@ static void xive_nvt_init(Object *obj)
     nvt->ring_os = &nvt->regs[TM_QW1_OS];
 }
 
+static const VMStateDescription vmstate_xive_nvt_eq = {
+    .name = TYPE_XIVE_NVT "/eq",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField []) {
+        VMSTATE_UINT32(w0, XiveEQ),
+        VMSTATE_UINT32(w1, XiveEQ),
+        VMSTATE_UINT32(w2, XiveEQ),
+        VMSTATE_UINT32(w3, XiveEQ),
+        VMSTATE_UINT32(w4, XiveEQ),
+        VMSTATE_UINT32(w5, XiveEQ),
+        VMSTATE_UINT32(w6, XiveEQ),
+        VMSTATE_UINT32(w7, XiveEQ),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
 static const VMStateDescription vmstate_xive_nvt = {
     .name = TYPE_XIVE_NVT,
     .version_id = 1,
     .minimum_version_id = 1,
     .fields = (VMStateField[]) {
         VMSTATE_BUFFER(regs, XiveNVT),
+        VMSTATE_STRUCT_ARRAY(eqt, XiveNVT, (XIVE_PRIORITY_MAX + 1), 1,
+                             vmstate_xive_nvt_eq, XiveEQ),
         VMSTATE_END_OF_LIST()
     },
 };
@@ -305,6 +369,13 @@ XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server)
     return xfc->get_nvt(xf, server);
 }
 
+XiveEQ *xive_fabric_get_eq(XiveFabric *xf, uint32_t eq_idx)
+{
+   XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
+
+   return xfc->get_eq(xf, eq_idx);
+}
+
 static void xive_fabric_route(XiveFabric *xf, int lisn)
 {
 
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 25d78eec884d..7cb3561aa3d3 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -36,4 +36,11 @@ bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
 bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 
+/*
+ * sPAPR encoding of EQ indexes
+ */
+#define SPAPR_XIVE_EQ_INDEX(server, prio)  (((server) << 3) | ((prio) & 0x7))
+#define SPAPR_XIVE_EQ_SERVER(eq_idx) ((eq_idx) >> 3)
+#define SPAPR_XIVE_EQ_PRIO(eq_idx)   ((eq_idx) & 0x7)
+
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 1a2da610d91c..6cc02638c677 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -176,12 +176,18 @@ typedef struct XiveNVT {
 
     /* Shortcuts to rings */
     uint8_t   *ring_os;
+
+    XiveEQ    eqt[XIVE_PRIORITY_MAX + 1];
 } XiveNVT;
 
 extern const MemoryRegionOps xive_tm_user_ops;
 extern const MemoryRegionOps xive_tm_os_ops;
 
 void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon);
+XiveEQ *xive_nvt_eq_get(XiveNVT *nvt, uint8_t priority);
+
+void xive_eq_reset(XiveEQ *eq);
+void xive_eq_pic_print_info(XiveEQ *eq, Monitor *mon);
 
 /*
  * XIVE Fabric
@@ -205,9 +211,11 @@ typedef struct XiveFabricClass {
 
     XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
     XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
+    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
 } XiveFabricClass;
 
 XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
 XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server);
+XiveEQ  *xive_fabric_get_eq(XiveFabric *xf, uint32_t eq_idx);
 
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
index f2e2a1ac8f6e..bcc44e766db9 100644
--- a/include/hw/ppc/xive_regs.h
+++ b/include/hw/ppc/xive_regs.h
@@ -112,6 +112,54 @@ typedef struct XiveIVE {
 #define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
 } XiveIVE;
 
+/* EQ */
+typedef struct XiveEQ {
+        uint32_t        w0;
+#define EQ_W0_VALID             PPC_BIT32(0) /* "v" bit */
+#define EQ_W0_ENQUEUE           PPC_BIT32(1) /* "q" bit */
+#define EQ_W0_UCOND_NOTIFY      PPC_BIT32(2) /* "n" bit */
+#define EQ_W0_BACKLOG           PPC_BIT32(3) /* "b" bit */
+#define EQ_W0_PRECL_ESC_CTL     PPC_BIT32(4) /* "p" bit */
+#define EQ_W0_ESCALATE_CTL      PPC_BIT32(5) /* "e" bit */
+#define EQ_W0_UNCOND_ESCALATE   PPC_BIT32(6) /* "u" bit - DD2.0 */
+#define EQ_W0_SILENT_ESCALATE   PPC_BIT32(7) /* "s" bit - DD2.0 */
+#define EQ_W0_QSIZE             PPC_BITMASK32(12, 15)
+#define EQ_W0_SW0               PPC_BIT32(16)
+#define EQ_W0_FIRMWARE          EQ_W0_SW0 /* Owned by FW */
+#define EQ_QSIZE_4K             0
+#define EQ_QSIZE_64K            4
+#define EQ_W0_HWDEP             PPC_BITMASK32(24, 31)
+        uint32_t        w1;
+#define EQ_W1_ESn               PPC_BITMASK32(0, 1)
+#define EQ_W1_ESn_P             PPC_BIT32(0)
+#define EQ_W1_ESn_Q             PPC_BIT32(1)
+#define EQ_W1_ESe               PPC_BITMASK32(2, 3)
+#define EQ_W1_ESe_P             PPC_BIT32(2)
+#define EQ_W1_ESe_Q             PPC_BIT32(3)
+#define EQ_W1_GENERATION        PPC_BIT32(9)
+#define EQ_W1_PAGE_OFF          PPC_BITMASK32(10, 31)
+        uint32_t        w2;
+#define EQ_W2_MIGRATION_REG     PPC_BITMASK32(0, 3)
+#define EQ_W2_OP_DESC_HI        PPC_BITMASK32(4, 31)
+        uint32_t        w3;
+#define EQ_W3_OP_DESC_LO        PPC_BITMASK32(0, 31)
+        uint32_t        w4;
+#define EQ_W4_ESC_EQ_BLOCK      PPC_BITMASK32(4, 7)
+#define EQ_W4_ESC_EQ_INDEX      PPC_BITMASK32(8, 31)
+        uint32_t        w5;
+#define EQ_W5_ESC_EQ_DATA       PPC_BITMASK32(1, 31)
+        uint32_t        w6;
+#define EQ_W6_FORMAT_BIT        PPC_BIT32(8)
+#define EQ_W6_NVT_BLOCK         PPC_BITMASK32(9, 12)
+#define EQ_W6_NVT_INDEX         PPC_BITMASK32(13, 31)
+        uint32_t        w7;
+#define EQ_W7_F0_IGNORE         PPC_BIT32(0)
+#define EQ_W7_F0_BLK_GROUPING   PPC_BIT32(1)
+#define EQ_W7_F0_PRIORITY       PPC_BITMASK32(8, 15)
+#define EQ_W7_F1_WAKEZ          PPC_BIT32(0)
+#define EQ_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
+} XiveEQ;
+
 #define XIVE_PRIORITY_MAX  7
 
 #endif /* _INTC_XIVE_INTERNAL_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 08/35] spapr: push the XIVE EQ data in OS event queue
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (6 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 09/35] spapr: notify the CPU when the XIVE interrupt priority is more privileged Cédric Le Goater
                   ` (27 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

When a notification is let through by the routing engine, the Event
Queue data defined in the associated IVE is pushed in the in-memory
event queue. The latter is a circular buffer provided by the OS, one
per server and priority couple. Each Event Queue entry is 4 bytes
long, the first bit being a 'generation' bit and the 31 following bits
the EQ Data field.

The EQ Data field is a way to set an invariant logical event source
number for an IRQ. It is set with the H_INT_SET_SOURCE_CONFIG hcall
when the EISN flag is used.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v2 :

 - used dma_memory_write() to push EQ data
 - introduced the XiveFabric interface, to generalize the routing algo.

 hw/intc/xive.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 65 insertions(+)

diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 2ab37fde80e8..420cc6703b88 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -59,6 +59,31 @@ void xive_eq_pic_print_info(XiveEQ *eq, Monitor *mon)
                    priority, server, qaddr_base, qindex, qentries, qgen);
 }
 
+static void xive_eq_push(XiveEQ *eq, uint32_t data)
+{
+    uint64_t qaddr_base = (((uint64_t)(eq->w2 & 0x0fffffff)) << 32) | eq->w3;
+    uint32_t qsize = GETFIELD(EQ_W0_QSIZE, eq->w0);
+    uint32_t qindex = GETFIELD(EQ_W1_PAGE_OFF, eq->w1);
+    uint32_t qgen = GETFIELD(EQ_W1_GENERATION, eq->w1);
+
+    uint64_t qaddr = qaddr_base + (qindex << 2);
+    uint32_t qdata = cpu_to_be32((qgen << 31) | (data & 0x7fffffff));
+    uint32_t qentries = 1 << (qsize + 10);
+
+    if (dma_memory_write(&address_space_memory, qaddr, &qdata, sizeof(qdata))) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to write EQ data @0x%"
+                      HWADDR_PRIx "\n", qaddr);
+        return;
+    }
+
+    qindex = (qindex + 1) % qentries;
+    if (qindex == 0) {
+        qgen ^= 1;
+        eq->w1 = SETFIELD(EQ_W1_GENERATION, eq->w1, qgen);
+    }
+    eq->w1 = SETFIELD(EQ_W1_PAGE_OFF, eq->w1, qindex);
+}
+
 /*
  * XIVE Interrupt Presenter
  */
@@ -378,7 +403,47 @@ XiveEQ *xive_fabric_get_eq(XiveFabric *xf, uint32_t eq_idx)
 
 static void xive_fabric_route(XiveFabric *xf, int lisn)
 {
+    XiveIVE *ive;
+    XiveEQ *eq;
+    uint32_t eq_idx;
+    uint8_t priority;
 
+    ive = xive_fabric_get_ive(xf, lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %d\n", lisn);
+        return;
+    }
+
+    if (ive->w & IVE_MASKED) {
+        return;
+    }
+
+    /* Find our XiveEQ */
+    eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
+    eq = xive_fabric_get_eq(xf, eq_idx);
+    if (!eq || !(eq->w0 & EQ_W0_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No EQ for LISN %d\n", lisn);
+        return;
+    }
+
+    if (eq->w0 & EQ_W0_ENQUEUE) {
+        xive_eq_push(eq, GETFIELD(IVE_EQ_DATA, ive->w));
+    }
+
+    if (!(eq->w0 & EQ_W0_UCOND_NOTIFY)) {
+        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
+    }
+
+    if (GETFIELD(EQ_W6_FORMAT_BIT, eq->w6) == 0) {
+        priority = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
+
+        /* The EQ is masked. Can this happen ?  */
+        if (priority == 0xff) {
+            g_assert_not_reached();
+        }
+    } else {
+        qemu_log_mask(LOG_UNIMP, "XIVE: w7 format1 not implemented\n");
+    }
 }
 
 static const TypeInfo xive_fabric_info = {
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 09/35] spapr: notify the CPU when the XIVE interrupt priority is more privileged
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (7 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 08/35] spapr: push the XIVE EQ data in OS event queue Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 10/35] spapr: add support for the SET_OS_PENDING command (XIVE) Cédric Le Goater
                   ` (26 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

After the event was pushed in the Xive EQ, the presenter engine raises
the bit corresponding to the priority of the pending interrupt in the
register IBP (Interrupt Pending Buffer) to indicate there is an event
pending in one of the 8 priority queues. The Pending Interrupt
Priority Register (PIPR) is also updated using the IPB. This register
represent the priority of the most favored pending notification.

The PIPR is then compared to the the Current Processor Priority
Register (CPPR). If it is more favored (numerically less than), the
CPU interrupt line is raised and the EO bit of the Notification Source
Register (NSR) is updated to notify the presence of an exception for
the O/S. The check needs to be done whenever the PIPR or the CPPR are
changed.

The O/S acknowledges the interrupt with a special load in the Thread
Interrupt Management Area. If the EO bit of the NSR is set, the CPPR
takes the value of PIPR. The bit number in the IBP corresponding to
the priority of the pending interrupt is reseted and so is the EO bit
of the NSR.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v2 :

 - introduced the XiveFabric interface

 hw/intc/xive.c | 78 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 77 insertions(+), 1 deletion(-)

diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 420cc6703b88..8d0e77cac12a 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -88,9 +88,63 @@ static void xive_eq_push(XiveEQ *eq, uint32_t data)
  * XIVE Interrupt Presenter
  */
 
+/* Convert a priority number to an Interrupt Pending Buffer (IPB)
+ * register, which indicates a pending interrupt at the priority
+ * corresponding to the bit number
+ */
+static uint8_t priority_to_ipb(uint8_t priority)
+{
+    return priority > XIVE_PRIORITY_MAX ?
+        0 : 1 << (XIVE_PRIORITY_MAX - priority);
+}
+
+/* Convert an Interrupt Pending Buffer (IPB) register to a Pending
+ * Interrupt Priority Register (PIPR), which contains the priority of
+ * the most favored pending notification.
+ */
+static uint8_t ipb_to_pipr(uint8_t ibp)
+{
+    return ibp ? clz32((uint32_t)ibp << 24) : 0xff;
+}
+
+/* Update the IPB (Interrupt Pending Buffer) with the priority
+ * of the new notification and inform the NVT, which will
+ * decide to raise the exception, or not, depending the CPPR.
+ */
+static void xive_nvt_ipb_update(XiveNVT *nvt, uint8_t priority)
+{
+    nvt->ring_os[TM_IPB] |= priority_to_ipb(priority);
+    nvt->ring_os[TM_PIPR] = ipb_to_pipr(nvt->ring_os[TM_IPB]);
+}
+
 static uint64_t xive_nvt_accept(XiveNVT *nvt)
 {
-    return 0;
+    uint8_t nsr = nvt->ring_os[TM_NSR];
+
+    qemu_irq_lower(nvt->output);
+
+    if (nvt->ring_os[TM_NSR] & TM_QW1_NSR_EO) {
+        uint8_t cppr = nvt->ring_os[TM_PIPR];
+
+        nvt->ring_os[TM_CPPR] = cppr;
+
+        /* Reset the pending buffer bit */
+        nvt->ring_os[TM_IPB] &= ~priority_to_ipb(cppr);
+        nvt->ring_os[TM_PIPR] = ipb_to_pipr(nvt->ring_os[TM_IPB]);
+
+        /* Drop Exception bit for OS */
+        nvt->ring_os[TM_NSR] &= ~TM_QW1_NSR_EO;
+    }
+
+    return (nsr << 8) | nvt->ring_os[TM_CPPR];
+}
+
+static void xive_nvt_notify(XiveNVT *nvt)
+{
+    if (nvt->ring_os[TM_PIPR] < nvt->ring_os[TM_CPPR]) {
+        nvt->ring_os[TM_NSR] |= TM_QW1_NSR_EO;
+        qemu_irq_raise(nvt->output);
+    }
 }
 
 static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
@@ -100,6 +154,10 @@ static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
     }
 
     nvt->ring_os[TM_CPPR] = cppr;
+
+    /* CPPR has changed, check if we need to redistribute a pending
+     * exception */
+    xive_nvt_notify(nvt);
 }
 
 /*
@@ -279,6 +337,12 @@ static void xive_nvt_reset(void *dev)
     int i;
 
     memset(nvt->regs, 0, sizeof(nvt->regs));
+    /*
+     * Initialize PIPR to 0xFF to avoid phantom interrupts when the
+     * CPPR is first set.
+     */
+    nvt->ring_os[TM_PIPR] = ipb_to_pipr(nvt->ring_os[TM_IPB]);
+
     for (i = 0; i < ARRAY_SIZE(nvt->eqt); i++) {
         xive_eq_reset(&nvt->eqt[i]);
     }
@@ -407,6 +471,8 @@ static void xive_fabric_route(XiveFabric *xf, int lisn)
     XiveEQ *eq;
     uint32_t eq_idx;
     uint8_t priority;
+    XiveNVT *nvt;
+    uint32_t nvt_idx;
 
     ive = xive_fabric_get_ive(xf, lisn);
     if (!ive || !(ive->w & IVE_VALID)) {
@@ -434,6 +500,13 @@ static void xive_fabric_route(XiveFabric *xf, int lisn)
         qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
     }
 
+    nvt_idx = GETFIELD(EQ_W6_NVT_INDEX, eq->w6);
+    nvt = xive_fabric_get_nvt(xf, nvt_idx);
+    if (!nvt) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No NVT for idx %d\n", nvt_idx);
+        return;
+    }
+
     if (GETFIELD(EQ_W6_FORMAT_BIT, eq->w6) == 0) {
         priority = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
 
@@ -441,9 +514,12 @@ static void xive_fabric_route(XiveFabric *xf, int lisn)
         if (priority == 0xff) {
             g_assert_not_reached();
         }
+        xive_nvt_ipb_update(nvt, priority);
     } else {
         qemu_log_mask(LOG_UNIMP, "XIVE: w7 format1 not implemented\n");
     }
+
+    xive_nvt_notify(nvt);
 }
 
 static const TypeInfo xive_fabric_info = {
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 10/35] spapr: add support for the SET_OS_PENDING command (XIVE)
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (8 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 09/35] spapr: notify the CPU when the XIVE interrupt priority is more privileged Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 11/35] spapr: introduce a 'xive_exploitation' option to enable XIVE Cédric Le Goater
                   ` (25 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

This command offers the possibility for the O/S to adjust the IPB to
allow a CPU to process event queues of other priorities during one
physical interrupt cycle. This is not currently used by the XIVE
native exploitation driver for sPAPR in Linux but it is by the
hypervisor.

More from Ben :

  It's a way to avoid the SW replay on EOI.

  IE, assume you have 2 interrupts in the queue. You take the exception,
  ack the first one, process it etc... Then you EOI, the HW won't send
  a second notification. You need to look at the queue and continue
  consuming until it's empty.

  Today Linux checks the queue on EOI and use a SW mechanism to
  synthesize a new pseudo-external interrupt.

  This MMIO command would allow the OS to instead set back the
  corresponding priority bit to 1 in the IPB and cause the HW to
  re-emit the interrupt instead of SW.

  Linux doesn't use this today because DD1 didn't support it for the
  HV level, but other OSes might and we also might use it when we do
  groups, thus allowing redistribution.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/xive.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 8d0e77cac12a..20e216f03c5b 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -214,9 +214,22 @@ static bool xive_tm_is_readonly(uint8_t offset)
 static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset,
                                         uint64_t value, unsigned size)
 {
-    /* TODO: support TM_SPC_SET_OS_PENDING */
+    switch (offset) {
+    case TM_SPC_SET_OS_PENDING:
+        if (size == 1) {
+            xive_nvt_ipb_update(nvt, value & 0xff);
+            xive_nvt_notify(nvt);
+        }
+        break;
+    case TM_SPC_ACK_OS_EL:  /* TODO */
+        qemu_log_mask(LOG_UNIMP, "XIVE: no command to acknowledge O/S "
+                      "Interrupt to even O/S reporting line\n");
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA write @%"
+                      HWADDR_PRIx" size %d\n", offset, size);
+    }
 
-    /* TODO: support TM_SPC_ACK_OS_EL */
 }
 
 static void xive_tm_os_write(void *opaque, hwaddr offset,
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 11/35] spapr: introduce a 'xive_exploitation' option to enable XIVE
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (9 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 10/35] spapr: add support for the SET_OS_PENDING command (XIVE) Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 12/35] spapr: add a sPAPRXive object to the machine Cédric Le Goater
                   ` (24 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

Also provide a 'both' option to activate both interrupt mode: XIVE
exploitation and legacy (XICS).

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 Changes since v2 :

 - changed the option to a string : "both|off|on"
 - option is not enabled by default anymore. To be discussed.

 hw/ppc/spapr.c         | 41 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h |  1 +
 2 files changed, 42 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index a81570e7c8b1..b459c0076792 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2907,6 +2907,39 @@ static void spapr_set_vsmt(Object *obj, Visitor *v, const char *name,
     visit_type_uint32(v, name, (uint32_t *)opaque, errp);
 }
 
+static char *spapr_get_xive_exploitation(Object *obj, Error **errp)
+{
+    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
+
+    switch (spapr->xive_exploitation) {
+    case 0x80:
+        return g_strdup("both");
+    case 0x40:
+        return g_strdup("on");
+    case 0x0:
+        return g_strdup("off");
+    }
+    g_assert_not_reached();
+}
+
+static void spapr_set_xive_exploitation(Object *obj, const char *value,
+                                        Error **errp)
+{
+    sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
+
+    /* TODO: Don't let older machines activate XIVE */
+
+    if (strcmp(value, "both") == 0) {
+        spapr->xive_exploitation = 0x80;
+    } else if (strcmp(value, "on") == 0) {
+        spapr->xive_exploitation = 0x40;
+    } else if (strcmp(value, "off") == 0) {
+        spapr->xive_exploitation = 0;
+    } else {
+        error_setg(errp, "Bad value for \"xive-exploitation\" property");
+    }
+}
+
 static void spapr_instance_init(Object *obj)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(obj);
@@ -2944,6 +2977,14 @@ static void spapr_instance_init(Object *obj)
                                     " the host's SMT mode", &error_abort);
     object_property_add_bool(obj, "vfio-no-msix-emulation",
                              spapr_get_msix_emulation, NULL, NULL);
+    spapr->xive_exploitation = false;
+    object_property_add_str(obj, "xive-exploitation",
+                            spapr_get_xive_exploitation,
+                            spapr_set_xive_exploitation,
+                            NULL);
+    object_property_set_description(obj, "xive-exploitation",
+                                    "XIVE exploitation mode POWER9",
+                                    NULL);
 }
 
 static void spapr_machine_finalizefn(Object *obj)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index d60b7c6d7a8b..3f8980310492 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -165,6 +165,7 @@ struct sPAPRMachineState {
     MemoryHotplugState hotplug_memory;
 
     const char *icp_type;
+    uint8_t xive_exploitation;
 
     bool cmd_line_caps[SPAPR_CAP_NUM];
     sPAPRCapabilities def, eff, mig;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 12/35] spapr: add a sPAPRXive object to the machine
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (10 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 11/35] spapr: introduce a 'xive_exploitation' option to enable XIVE Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 13/35] spapr: add hcalls support for the XIVE exploitation interrupt mode Cédric Le Goater
                   ` (23 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The sPAPRXive object is designed to be always available, so it is
created unconditionally on newer machines. Depending on the
configuration and the guest capabilities, the CAS negotiation process
will decide which interrupt mode to activate: legacy or XIVE
exploitation.

The XIVE model makes use of the full range of the IRQ number space.
The IRQ numbers for the CPU IPIs in XIVE are allocated at the bottom
of this space, below XICS_IRQ_BASE, to preserve compatibility with
XICS which does not use that range.

That leaves us with 4K possible IPIs. This should be enough for
sometime given that the maximum number of CPUs is 1024 for the sPAPR
machine under QEMU. For the record, the biggest POWER8 or POWER9
system has a maximum of 1536 HW threads (16 sockets, 192 cores, SMT8).

Also make sure that the allocated IRQ numbers are kept in sync between
XICS and XIVE, when available.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 Changes since v2 :

 - introduced the xive_system_init() routine
 - handled vsmt by moving the allocation of the IPIS after the CPUs
   are initialized

 hw/ppc/spapr.c         | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h |  2 ++
 2 files changed, 65 insertions(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index b459c0076792..8bbd2a677935 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -56,6 +56,7 @@
 #include "hw/ppc/spapr_vio.h"
 #include "hw/pci-host/spapr.h"
 #include "hw/ppc/xics.h"
+#include "hw/ppc/spapr_xive.h"
 #include "hw/pci/msi.h"
 
 #include "hw/pci/pci.h"
@@ -209,6 +210,48 @@ static void xics_system_init(MachineState *machine, int nr_irqs, Error **errp)
     }
 }
 
+static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
+                                    const char *type_xive, int nr_irqs,
+                                    Error **errp)
+{
+    Error *local_err = NULL;
+    Object *obj;
+
+    obj = object_new(type_xive);
+    object_property_add_child(OBJECT(spapr), "xive", obj, &error_abort);
+    object_property_set_int(obj, nr_irqs, "nr-irqs",  &local_err);
+    if (local_err) {
+        goto error;
+    }
+    object_property_set_bool(obj, true, "realized", &local_err);
+    if (local_err) {
+        goto error;
+    }
+
+    qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
+    return SPAPR_XIVE(obj);
+error:
+    error_propagate(errp, local_err);
+    return NULL;
+}
+
+static void xive_system_init(MachineState *machine, int nr_irqs, Error **errp)
+{
+    sPAPRMachineState *spapr = SPAPR_MACHINE(machine);
+
+    /* We don't have KVM support yet, so check for irqchip=on */
+    if (kvm_enabled() && machine_kernel_irqchip_required(machine)) {
+        error_report("kernel_irqchip requested. no XIVE support");
+        exit(1);
+    }
+
+    if (spapr->xive) {
+        return;
+    }
+
+    spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE, nr_irqs, errp);
+}
+
 static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu,
                                   int smt_threads)
 {
@@ -2473,6 +2516,12 @@ static void spapr_machine_init(MachineState *machine)
     /* Set up Interrupt Controller before we create the VCPUs */
     xics_system_init(machine, XICS_IRQS_SPAPR, &error_fatal);
 
+    if (spapr->xive_exploitation) {
+        /* XIVE uses the full range of IRQ numbers. */
+        xive_system_init(machine, XICS_IRQ_BASE + XICS_IRQS_SPAPR,
+                         &error_fatal);
+    }
+
     /* Set up containers for ibm,client-architecture-support negotiated options
      */
     spapr->ov5 = spapr_ovec_new();
@@ -2503,6 +2552,14 @@ static void spapr_machine_init(MachineState *machine)
     /* init CPUs */
     spapr_init_cpus(spapr);
 
+    /* Allocate the first IRQ numbers for the CPU IPIs, below
+     * XICS_IRQ_BASE, which is unused by XICS. */
+    if (spapr->xive_exploitation) {
+        for (i = 0; i < xics_max_server_number(spapr); ++i) {
+            spapr_xive_irq_enable(spapr->xive, i, false);
+        }
+    }
+
     if (kvm_enabled()) {
         /* Enable H_LOGICAL_CI_* so SLOF can talk to in-kernel devices */
         kvmppc_enable_logical_ci_hcalls();
@@ -3752,6 +3809,9 @@ static int ics_find_free_block(ICSState *ics, int num, int alignnum)
 static void spapr_irq_set_lsi(sPAPRMachineState *spapr, int irq, bool lsi)
 {
     ics_set_irq_type(spapr->ics, irq - spapr->ics->offset, lsi);
+    if (spapr->xive_exploitation) {
+        spapr_xive_irq_enable(spapr->xive, irq, lsi);
+    }
 }
 
 int spapr_irq_alloc(sPAPRMachineState *spapr, int irq_hint, bool lsi,
@@ -3842,6 +3902,9 @@ void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num)
             memset(&ics->irqs[i], 0, sizeof(ICSIRQState));
         }
     }
+    if (spapr->xive_exploitation) {
+        spapr_xive_irq_disable(spapr->xive, irq);
+    }
 }
 
 qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq)
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 3f8980310492..875f658973a1 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -14,6 +14,7 @@ struct sPAPRNVRAM;
 typedef struct sPAPREventLogEntry sPAPREventLogEntry;
 typedef struct sPAPREventSource sPAPREventSource;
 typedef struct sPAPRPendingHPT sPAPRPendingHPT;
+typedef struct sPAPRXive sPAPRXive;
 
 #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
 #define SPAPR_ENTRY_POINT       0x100
@@ -166,6 +167,7 @@ struct sPAPRMachineState {
 
     const char *icp_type;
     uint8_t xive_exploitation;
+    sPAPRXive  *xive;
 
     bool cmd_line_caps[SPAPR_CAP_NUM];
     sPAPRCapabilities def, eff, mig;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 13/35] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (11 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 12/35] spapr: add a sPAPRXive object to the machine Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 14/35] spapr: add device tree support for the XIVE exploitation mode Cédric Le Goater
                   ` (22 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The different XIVE virtualization engines (sources and event queues)
are configured with a set of Hypervisor calls :

 - H_INT_GET_SOURCE_INFO

   used to obtain the address of the MMIO page of the Event State
   Buffer (PQ bits) entry associated with the source.

 - H_INT_SET_SOURCE_CONFIG

   assigns a source to a "target".

 - H_INT_GET_SOURCE_CONFIG

   determines which "target" and "priority" is assigned to a source

 - H_INT_GET_QUEUE_INFO

   returns the address of the notification management page associated
   with the specified "target" and "priority".

 - H_INT_SET_QUEUE_CONFIG

   sets or resets the event queue for a given "target" and "priority".
   It is also used to set the notification configuration associated
   with the queue, only unconditional notification is supported for
   the moment. Reset is performed with a queue size of 0 and queueing
   is disabled in that case.

 - H_INT_GET_QUEUE_CONFIG

   returns the queue settings for a given "target" and "priority".

 - H_INT_RESET

   resets all of the guest's internal interrupt structures to their
   initial state, losing all configuration set via the hcalls
   H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.

 - H_INT_SYNC

   issue a synchronisation on a source to make sure all notifications
   have reached their queue.

Calls that still need to be addressed :

   H_INT_SET_OS_REPORTING_LINE
   H_INT_GET_OS_REPORTING_LINE

See the code for more documentation on each hcall.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 Changes since v2 :

 - introduced the XiveFabric interface
 - reworked h_int_get_source_info() to better support all ESB MMIO settings

 hw/intc/Makefile.objs       |   2 +-
 hw/intc/spapr_xive_hcall.c  | 859 ++++++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr.c              |   3 +
 include/hw/ppc/spapr.h      |  15 +-
 include/hw/ppc/spapr_xive.h |   4 +
 5 files changed, 881 insertions(+), 2 deletions(-)
 create mode 100644 hw/intc/spapr_xive_hcall.c

diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index 301a8e972d91..eacd26836ebf 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -38,7 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
 obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
 obj-$(CONFIG_XICS_KVM) += xics_kvm.o
 obj-$(CONFIG_XIVE) += xive.o
-obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
+obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
 obj-$(CONFIG_POWERNV) += xics_pnv.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
new file mode 100644
index 000000000000..9f3d579bba2c
--- /dev/null
+++ b/hw/intc/spapr_xive_hcall.c
@@ -0,0 +1,859 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "cpu.h"
+#include "hw/ppc/fdt.h"
+#include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/xive_regs.h"
+#include "monitor/monitor.h"
+
+
+/*
+ * OPAL uses the priority 7 queue to automatically escalate interrupts
+ * for all other queues (DD2.X POWER9). So only priorities [0..6] are
+ * allowed for the guest.
+ */
+static bool priority_is_valid(uint8_t priority)
+{
+    switch (priority) {
+    case 0 ... 6:
+        return true;
+    case 7: /* OPAL escalation queue */
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %d requested\n",
+                      priority);
+        return false;
+    }
+}
+
+/*
+ * The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical
+ * real address of the MMIO page through which the Event State Buffer
+ * entry associated with the value of the "lisn" parameter is managed.
+ *
+ * Parameters:
+ * Input
+ * - "flags"
+ *       Bits 0-63 reserved
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *       "ibm,xive-lisn-ranges" properties, or as returned by the
+ *       ibm,query-interrupt-source-number RTAS call, or as returned
+ *       by the H_ALLOCATE_VAS_WINDOW hcall
+ *
+ * Output
+ * - R4: "flags"
+ *       Bits 0-59: Reserved
+ *       Bit 60: H_INT_ESB must be used for Event State Buffer
+ *               management
+ *       Bit 61: 1 == LSI  0 == MSI
+ *       Bit 62: the full function page supports trigger
+ *       Bit 63: Store EOI Supported
+ * - R5: Logical Real address of full function Event State Buffer
+ *       management page, -1 if ESB hcall flag is set to 1.
+ * - R6: Logical Real Address of trigger only Event State Buffer
+ *       management page or -1.
+ * - R7: Power of 2 page size for the ESB management pages returned in
+ *       R5 and R6.
+ */
+
+#define SPAPR_XIVE_SRC_H_INT_ESB     PPC_BIT(60) /* ESB manage with H_INT_ESB */
+#define SPAPR_XIVE_SRC_LSI           PPC_BIT(61) /* Virtual LSI type */
+#define SPAPR_XIVE_SRC_TRIGGER       PPC_BIT(62) /* Trigger and management
+                                                    on same page */
+#define SPAPR_XIVE_SRC_STORE_EOI     PPC_BIT(63) /* Store EOI support */
+
+static target_ulong h_int_get_source_info(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          target_ulong opcode,
+                                          target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveIVE *ive;
+    target_ulong flags  = args[0];
+    target_ulong lisn   = args[1];
+    XiveSource *xsrc = &xive->source;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    ive = xive_fabric_get_ive(XIVE_FABRIC(spapr->xive), lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        return H_P2;
+    }
+
+    /* All sources are emulated under the main XIVE object and share
+     * the same characteristics.
+     */
+    args[0] = 0;
+    if (!xive_source_esb_2page(xsrc)) {
+        args[0] |= SPAPR_XIVE_SRC_TRIGGER;
+    }
+    if (xsrc->esb_flags & XIVE_SRC_STORE_EOI) {
+        args[0] |= SPAPR_XIVE_SRC_STORE_EOI;
+    }
+    if (xsrc->esb_flags & XIVE_SRC_H_INT_ESB) {
+        args[0] |= SPAPR_XIVE_SRC_H_INT_ESB;
+    }
+
+    if (xive_source_irq_is_lsi(xsrc, lisn - xsrc->offset)) {
+        args[0] |= SPAPR_XIVE_SRC_LSI;
+    }
+
+    if (!(xsrc->esb_flags & XIVE_SRC_H_INT_ESB)) {
+        args[1] = xive_source_esb_mgmt(xsrc, lisn - xsrc->offset);
+    } else {
+        args[1] = -1;
+    }
+
+    if (xive_source_esb_2page(xsrc)) {
+        args[2] = xive_source_esb_trigger(xsrc, lisn - xsrc->offset);
+    } else {
+        args[2] = -1;
+    }
+
+    args[3] = xsrc->esb_shift;
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SET_SOURCE_CONFIG hcall() is used to assign a Logical
+ * Interrupt Source to a target. The Logical Interrupt Source is
+ * designated with the "lisn" parameter and the target is designated
+ * with the "target" and "priority" parameters.  Upon return from the
+ * hcall(), no additional interrupts will be directed to the old EQ.
+ *
+ * TODO: The old EQ should be investigated for interrupts that
+ * occurred prior to or during the hcall().
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-61: Reserved
+ *      Bit 62: set the "eisn" in the EA
+ *      Bit 63: masks the interrupt source in the hardware interrupt
+ *      control structure. An interrupt masked by this mechanism will
+ *      be dropped, but it's source state bits will still be
+ *      set. There is no race-free way of unmasking and restoring the
+ *      source. Thus this should only be used in interrupts that are
+ *      also masked at the source, and only in cases where the
+ *      interrupt is not meant to be used for a large amount of time
+ *      because no valid target exists for it for example
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as returned by
+ *      the H_ALLOCATE_VAS_WINDOW hcall
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *      "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *      "ibm,plat-res-int-priorities"
+ * - "eisn" is the guest EISN associated with the "lisn"
+ *
+ * Output:
+ * - None
+ */
+
+#define SPAPR_XIVE_SRC_SET_EISN PPC_BIT(62)
+#define SPAPR_XIVE_SRC_MASK     PPC_BIT(63)
+
+static target_ulong h_int_set_source_config(PowerPCCPU *cpu,
+                                            sPAPRMachineState *spapr,
+                                            target_ulong opcode,
+                                            target_ulong *args)
+{
+    XiveIVE *ive;
+    uint64_t new_ive;
+    target_ulong flags    = args[0];
+    target_ulong lisn     = args[1];
+    target_ulong target   = args[2];
+    target_ulong priority = args[3];
+    target_ulong eisn     = args[4];
+    uint32_t eq_idx;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~(SPAPR_XIVE_SRC_SET_EISN | SPAPR_XIVE_SRC_MASK)) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    ive = xive_fabric_get_ive(XIVE_FABRIC(spapr->xive), lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        return H_P2;
+    }
+
+    /* priority 0xff is used to reset the IVE */
+    if (priority == 0xff) {
+        new_ive = IVE_VALID | IVE_MASKED;
+        goto out;
+    }
+
+    if (flags & SPAPR_XIVE_SRC_MASK) {
+        new_ive = ive->w | IVE_MASKED;
+    } else {
+        new_ive = ive->w & ~IVE_MASKED;
+    }
+
+    if (!priority_is_valid(priority)) {
+        return H_P4;
+    }
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the EQ corresponding to the
+     * target.
+     */
+    eq_idx = SPAPR_XIVE_EQ_INDEX(target, priority);
+    if (!xive_fabric_get_eq(XIVE_FABRIC(spapr->xive), eq_idx)) {
+        return H_P3;
+    }
+
+    new_ive = SETFIELD(IVE_EQ_BLOCK, new_ive, 0ul);
+    new_ive = SETFIELD(IVE_EQ_INDEX, new_ive, eq_idx);
+
+    if (flags & SPAPR_XIVE_SRC_SET_EISN) {
+        new_ive = SETFIELD(IVE_EQ_DATA, new_ive, eisn);
+    }
+
+out:
+    /* And update */
+    ive->w = new_ive;
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_GET_SOURCE_CONFIG hcall() is used to determine to which
+ * target/priority pair is assigned to the specified Logical Interrupt
+ * Source.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63 Reserved
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as
+ *      returned by the H_ALLOCATE_VAS_WINDOW hcall
+ *
+ * Output:
+ * - R4: Target to which the specified Logical Interrupt Source is
+ *       assigned
+ * - R5: Priority to which the specified Logical Interrupt Source is
+ *       assigned
+ * - R6: EISN for the specified Logical Interrupt Source (this will be
+ *       equivalent to the LISN if not changed by H_INT_SET_SOURCE_CONFIG)
+ */
+static target_ulong h_int_get_source_config(PowerPCCPU *cpu,
+                                            sPAPRMachineState *spapr,
+                                            target_ulong opcode,
+                                            target_ulong *args)
+{
+    target_ulong flags = args[0];
+    target_ulong lisn = args[1];
+    XiveIVE *ive;
+    XiveEQ *eq;
+    uint32_t eq_idx;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    ive = xive_fabric_get_ive(XIVE_FABRIC(spapr->xive), lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        return H_P2;
+    }
+
+    eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
+    eq = xive_fabric_get_eq(XIVE_FABRIC(spapr->xive), eq_idx);
+    if (!eq) {
+        /* Not sure what to return here */
+        return H_HARDWARE;
+    }
+
+    args[0] = GETFIELD(EQ_W6_NVT_INDEX, eq->w6);
+
+    if (ive->w & IVE_MASKED) {
+        args[1] = 0xff;
+    } else {
+        args[1] = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
+    }
+
+    args[2] = GETFIELD(IVE_EQ_DATA, ive->w);
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_GET_QUEUE_INFO hcall() is used to get the logical real
+ * address of the notification management page associated with the
+ * specified target and priority.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *       Bits 0-63 Reserved
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *       "ibm,plat-res-int-priorities"
+ *
+ * Output:
+ * - R4: Logical real address of notification page
+ * - R5: Power of 2 page size of the notification page
+ */
+static target_ulong h_int_get_queue_info(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         target_ulong opcode,
+                                         target_ulong *args)
+{
+    target_ulong flags    = args[0];
+    target_ulong target   = args[1];
+    target_ulong priority = args[2];
+    XiveEQ *eq;
+    uint32_t eq_idx;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    if (!priority_is_valid(priority)) {
+        return H_P3;
+    }
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the EQ corresponding to the
+     * target.
+     */
+    eq_idx = SPAPR_XIVE_EQ_INDEX(target, priority);
+    eq = xive_fabric_get_eq(XIVE_FABRIC(spapr->xive), eq_idx);
+    if (!eq)  {
+        return H_P2;
+    }
+
+    args[0] = -1; /* TODO: return ESn page */
+    if (eq->w0 & EQ_W0_ENQUEUE) {
+        args[1] = GETFIELD(EQ_W0_QSIZE, eq->w0) + 12;
+    } else {
+        args[1] = 0;
+    }
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SET_QUEUE_CONFIG hcall() is used to set or reset a EQ for
+ * a given "target" and "priority".  It is also used to set the
+ * notification config associated with the EQ.  An EQ size of 0 is
+ * used to reset the EQ config for a given target and priority. If
+ * resetting the EQ config, the END associated with the given "target"
+ * and "priority" will be changed to disable queueing.
+ *
+ * Upon return from the hcall(), no additional interrupts will be
+ * directed to the old EQ (if one was set). The old EQ (if one was
+ * set) should be investigated for interrupts that occurred prior to
+ * or during the hcall().
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-62: Reserved
+ *      Bit 63: Unconditional Notify (n) per the XIVE spec
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *       "ibm,plat-res-int-priorities"
+ * - "eventQueue": The logical real address of the start of the EQ
+ * - "eventQueueSize": The power of 2 EQ size per "ibm,xive-eq-sizes"
+ *
+ * Output:
+ * - None
+ */
+
+#define SPAPR_XIVE_EQ_ALWAYS_NOTIFY PPC_BIT(63)
+
+static target_ulong h_int_set_queue_config(PowerPCCPU *cpu,
+                                           sPAPRMachineState *spapr,
+                                           target_ulong opcode,
+                                           target_ulong *args)
+{
+    target_ulong flags    = args[0];
+    target_ulong target   = args[1];
+    target_ulong priority = args[2];
+    target_ulong qpage    = args[3];
+    target_ulong qsize    = args[4];
+    XiveEQ *old_eq;
+    XiveEQ eq;
+    uint32_t eq_idx;
+    uint32_t qdata;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~SPAPR_XIVE_EQ_ALWAYS_NOTIFY) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    if (!priority_is_valid(priority)) {
+        return H_P3;
+    }
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the EQ corresponding to the
+     * target.
+     */
+    eq_idx = SPAPR_XIVE_EQ_INDEX(target, priority);
+    old_eq = xive_fabric_get_eq(XIVE_FABRIC(spapr->xive), eq_idx);
+    if (!old_eq)  {
+        return H_P2;
+    }
+
+    eq = *old_eq;
+
+    switch (qsize) {
+    case 12:
+    case 16:
+    case 21:
+    case 24:
+        eq.w3 = ((uint64_t)qpage) & 0xffffffff;
+        eq.w2 = (((uint64_t)qpage)) >> 32 & 0x0fffffff;
+        eq.w0 |= EQ_W0_ENQUEUE;
+        eq.w0 = SETFIELD(EQ_W0_QSIZE, eq.w0, qsize - 12);
+        break;
+    case 0:
+        /* reset queue and disable queueing */
+        xive_eq_reset(&eq);
+        goto out;
+
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid EQ size %"PRIx64"\n",
+                      qsize);
+        return H_P5;
+    }
+
+    if (qsize) {
+        /*
+         * Let's validate the EQ address with a read of the first EQ
+         * entry. We could also check that the full queue has been
+         * zeroed by the OS.
+         */
+        if (address_space_read(&address_space_memory, qpage,
+                               MEMTXATTRS_UNSPECIFIED,
+                               (uint8_t *) &qdata, sizeof(qdata))) {
+            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ data @0x%"
+                          HWADDR_PRIx "\n", qpage);
+            return H_P4;
+        }
+    }
+
+    /* Ensure the priority and target are correctly set (they will not
+     * be right after allocation)
+     */
+    eq.w6 = SETFIELD(EQ_W6_NVT_BLOCK, 0ul, 0ul) |
+        SETFIELD(EQ_W6_NVT_INDEX, 0ul, target);
+    eq.w7 = SETFIELD(EQ_W7_F0_PRIORITY, 0ul, priority);
+
+    /* TODO: depends on notitification page (ESn) from H_INT_GET_QUEUE_INFO */
+    if (flags & SPAPR_XIVE_EQ_ALWAYS_NOTIFY) {
+        eq.w0 |= EQ_W0_UCOND_NOTIFY;
+    } else {
+        eq.w0 &= ~EQ_W0_UCOND_NOTIFY;
+    }
+
+    /* The generation bit for the EQ starts at 1 and The EQ page
+     * offset counter starts at 0.
+     */
+    eq.w1 = EQ_W1_GENERATION | SETFIELD(EQ_W1_PAGE_OFF, 0ul, 0ul);
+    eq.w0 |= EQ_W0_VALID;
+
+    /* TODO: issue syncs required to ensure all in-flight interrupts
+     * are complete on the old EQ */
+out:
+    /* Update EQ */
+    *old_eq = eq;
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_GET_QUEUE_CONFIG hcall() is used to get a EQ for a given
+ * target and priority.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-62: Reserved
+ *      Bit 63: Debug: Return debug data
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *       "ibm,plat-res-int-priorities"
+ *
+ * Output:
+ * - R4: "flags":
+ *       Bits 0-61: Reserved
+ *       Bit 62: The value of Event Queue Generation Number (g) per
+ *              the XIVE spec if "Debug" = 1
+ *       Bit 63: The value of Unconditional Notify (n) per the XIVE spec
+ * - R5: The logical real address of the start of the EQ
+ * - R6: The power of 2 EQ size per "ibm,xive-eq-sizes"
+ * - R7: The value of Event Queue Offset Counter per XIVE spec
+ *       if "Debug" = 1, else 0
+ *
+ */
+
+#define SPAPR_XIVE_EQ_DEBUG     PPC_BIT(63)
+
+static target_ulong h_int_get_queue_config(PowerPCCPU *cpu,
+                                           sPAPRMachineState *spapr,
+                                           target_ulong opcode,
+                                           target_ulong *args)
+{
+    target_ulong flags    = args[0];
+    target_ulong target   = args[1];
+    target_ulong priority = args[2];
+    XiveEQ *eq;
+    uint32_t eq_idx;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~SPAPR_XIVE_EQ_DEBUG) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    if (!priority_is_valid(priority)) {
+        return H_P3;
+    }
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the EQ corresponding to the
+     * target.
+     */
+    eq_idx = SPAPR_XIVE_EQ_INDEX(target, priority);
+    eq = xive_fabric_get_eq(XIVE_FABRIC(spapr->xive), eq_idx);
+    if (!eq)  {
+        return H_P2;
+    }
+
+    args[0] = 0;
+    if (eq->w0 & EQ_W0_UCOND_NOTIFY) {
+        args[0] |= SPAPR_XIVE_EQ_ALWAYS_NOTIFY;
+    }
+
+    if (eq->w0 & EQ_W0_ENQUEUE) {
+        args[1] =
+            (((uint64_t)(eq->w2 & 0x0fffffff)) << 32) | eq->w3;
+        args[2] = GETFIELD(EQ_W0_QSIZE, eq->w0) + 12;
+    } else {
+        args[1] = 0;
+        args[2] = 0;
+    }
+
+    /* TODO: do we need any locking on the EQ ? */
+    if (flags & SPAPR_XIVE_EQ_DEBUG) {
+        /* Load the event queue generation number into the return flags */
+        args[0] |= (uint64_t)GETFIELD(EQ_W1_GENERATION, eq->w1) << 62;
+
+        /* Load R7 with the event queue offset counter */
+        args[3] = GETFIELD(EQ_W1_PAGE_OFF, eq->w1);
+    } else {
+        args[3] = 0;
+    }
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SET_OS_REPORTING_LINE hcall() is used to set the
+ * reporting cache line pair for the calling thread.  The reporting
+ * cache lines will contain the OS interrupt context when the OS
+ * issues a CI store byte to @TIMA+0xC10 to acknowledge the OS
+ * interrupt. The reporting cache lines can be reset by inputting -1
+ * in "reportingLine".  Issuing the CI store byte without reporting
+ * cache lines registered will result in the data not being accessible
+ * to the OS.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ * - "reportingLine": The logical real address of the reporting cache
+ *    line pair
+ *
+ * Output:
+ * - None
+ */
+static target_ulong h_int_set_os_reporting_line(PowerPCCPU *cpu,
+                                                sPAPRMachineState *spapr,
+                                                target_ulong opcode,
+                                                target_ulong *args)
+{
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    /* TODO: H_INT_SET_OS_REPORTING_LINE */
+    return H_FUNCTION;
+}
+
+/*
+ * The H_INT_GET_OS_REPORTING_LINE hcall() is used to get the logical
+ * real address of the reporting cache line pair set for the input
+ * "target".  If no reporting cache line pair has been set, -1 is
+ * returned.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "reportingLine": The logical real address of the reporting cache
+ *   line pair
+ *
+ * Output:
+ * - R4: The logical real address of the reporting line if set, else -1
+ */
+static target_ulong h_int_get_os_reporting_line(PowerPCCPU *cpu,
+                                                sPAPRMachineState *spapr,
+                                                target_ulong opcode,
+                                                target_ulong *args)
+{
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    /* TODO: H_INT_GET_OS_REPORTING_LINE */
+    return H_FUNCTION;
+}
+
+/*
+ * The H_INT_ESB hcall() is used to issue a load or store to the ESB
+ * page for the input "lisn".  This hcall is only supported for LISNs
+ * that have the ESB hcall flag set to 1 when returned from hcall()
+ * H_INT_GET_SOURCE_INFO.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-62: Reserved
+ *      bit 63: Store: Store=1, store operation, else load operation
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as
+ *      returned by the H_ALLOCATE_VAS_WINDOW hcall
+ * - "esbOffset" is the offset into the ESB page for the load or store operation
+ * - "storeData" is the data to write for a store operation
+ *
+ * Output:
+ * - R4: R4: The value of the load if load operation, else -1
+ */
+
+#define SPAPR_XIVE_ESB_STORE PPC_BIT(63)
+
+static target_ulong h_int_esb(PowerPCCPU *cpu,
+                              sPAPRMachineState *spapr,
+                              target_ulong opcode,
+                              target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveIVE *ive;
+    target_ulong flags   = args[0];
+    target_ulong lisn    = args[1];
+    target_ulong offset  = args[2];
+    target_ulong data    = args[3];
+    hwaddr mmio_addr;
+    XiveSource *xsrc = &xive->source;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~SPAPR_XIVE_ESB_STORE) {
+        return H_PARAMETER;
+    }
+
+    ive = xive_fabric_get_ive(XIVE_FABRIC(xive), lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        return H_P2;
+    }
+
+    if (offset > (1ull << xsrc->esb_shift)) {
+        return H_P3;
+    }
+
+    mmio_addr = xive_source_esb_base(xsrc, lisn - xsrc->offset) + offset;
+
+    if (dma_memory_rw(&address_space_memory, mmio_addr, &data, 8,
+                      (flags & SPAPR_XIVE_ESB_STORE))) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to access ESB @0x%"
+                      HWADDR_PRIx "\n", mmio_addr);
+        return H_HARDWARE;
+    }
+    args[0] = (flags & SPAPR_XIVE_ESB_STORE) ? -1 : data;
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SYNC hcall() is used to issue hardware syncs that will
+ * ensure any in flight events for the input lisn are in the event
+ * queue.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as
+ *      returned by the H_ALLOCATE_VAS_WINDOW hcall
+ *
+ * Output:
+ * - None
+ */
+static target_ulong h_int_sync(PowerPCCPU *cpu,
+                               sPAPRMachineState *spapr,
+                               target_ulong opcode,
+                               target_ulong *args)
+{
+    XiveIVE *ive;
+    target_ulong flags   = args[0];
+    target_ulong lisn    = args[1];
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    ive = xive_fabric_get_ive(XIVE_FABRIC(spapr->xive), lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        return H_P2;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    /* This is not real hardware. Nothing to be done */
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_RESET hcall() is used to reset all of the partition's
+ * interrupt exploitation structures to their initial state.  This
+ * means losing all previously set interrupt state set via
+ * H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ *
+ * Output:
+ * - None
+ */
+static target_ulong h_int_reset(PowerPCCPU *cpu,
+                                sPAPRMachineState *spapr,
+                                target_ulong opcode,
+                                target_ulong *args)
+{
+    target_ulong flags   = args[0];
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    device_reset(DEVICE(spapr->xive));
+    return H_SUCCESS;
+}
+
+void spapr_xive_hcall_init(sPAPRMachineState *spapr)
+{
+    spapr_register_hypercall(H_INT_GET_SOURCE_INFO, h_int_get_source_info);
+    spapr_register_hypercall(H_INT_SET_SOURCE_CONFIG, h_int_set_source_config);
+    spapr_register_hypercall(H_INT_GET_SOURCE_CONFIG, h_int_get_source_config);
+    spapr_register_hypercall(H_INT_GET_QUEUE_INFO, h_int_get_queue_info);
+    spapr_register_hypercall(H_INT_SET_QUEUE_CONFIG, h_int_set_queue_config);
+    spapr_register_hypercall(H_INT_GET_QUEUE_CONFIG, h_int_get_queue_config);
+    spapr_register_hypercall(H_INT_SET_OS_REPORTING_LINE,
+                             h_int_set_os_reporting_line);
+    spapr_register_hypercall(H_INT_GET_OS_REPORTING_LINE,
+                             h_int_get_os_reporting_line);
+    spapr_register_hypercall(H_INT_ESB, h_int_esb);
+    spapr_register_hypercall(H_INT_SYNC, h_int_sync);
+    spapr_register_hypercall(H_INT_RESET, h_int_reset);
+}
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 8bbd2a677935..7a65dcde3ff7 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -250,6 +250,9 @@ static void xive_system_init(MachineState *machine, int nr_irqs, Error **errp)
     }
 
     spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE, nr_irqs, errp);
+    if (spapr->xive) {
+        spapr_xive_hcall_init(spapr);
+    }
 }
 
 static int spapr_fixup_cpu_smt_dt(void *fdt, int offset, PowerPCCPU *cpu,
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 875f658973a1..6b6496d9c343 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -443,7 +443,20 @@ struct sPAPRMachineState {
 #define H_INVALIDATE_PID        0x378
 #define H_REGISTER_PROC_TBL     0x37C
 #define H_SIGNAL_SYS_RESET      0x380
-#define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
+
+#define H_INT_GET_SOURCE_INFO   0x3A8
+#define H_INT_SET_SOURCE_CONFIG 0x3AC
+#define H_INT_GET_SOURCE_CONFIG 0x3B0
+#define H_INT_GET_QUEUE_INFO    0x3B4
+#define H_INT_SET_QUEUE_CONFIG  0x3B8
+#define H_INT_GET_QUEUE_CONFIG  0x3BC
+#define H_INT_SET_OS_REPORTING_LINE 0x3C0
+#define H_INT_GET_OS_REPORTING_LINE 0x3C4
+#define H_INT_ESB               0x3C8
+#define H_INT_SYNC              0x3CC
+#define H_INT_RESET             0x3D0
+
+#define MAX_HCALL_OPCODE        H_INT_RESET
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 7cb3561aa3d3..49c78b8e33c6 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -43,4 +43,8 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 #define SPAPR_XIVE_EQ_SERVER(eq_idx) ((eq_idx) >> 3)
 #define SPAPR_XIVE_EQ_PRIO(eq_idx)   ((eq_idx) & 0x7)
 
+typedef struct sPAPRMachineState sPAPRMachineState;
+
+void spapr_xive_hcall_init(sPAPRMachineState *spapr);
+
 #endif /* PPC_SPAPR_XIVE_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 14/35] spapr: add device tree support for the XIVE exploitation mode
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (12 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 13/35] spapr: add hcalls support for the XIVE exploitation interrupt mode Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 15/35] sysbus: add a sysbus_mmio_unmap() helper Cédric Le Goater
                   ` (21 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The XIVE interface for the guest is described in the device tree under
the "interrupt-controller" node. A couple of new properties are
specific to XIVE :

 - "reg"

   contains the base address and size of the thread interrupt
   managnement areas (TIMA), also called rings, for the User level and
   for the Guest OS level. Only the Guest OS level is taken into
   account today.

 - "ibm,xive-eq-sizes"

   the size of the event queues. One cell per size supported, contains
   log2 of size, in ascending order.

 - "ibm,xive-lisn-ranges"

   the IRQ interrupt number ranges assigned to the guest for the IPIs.

and also under the root node :

 - "ibm,plat-res-int-priorities"

   contains a list of priorities that the hypervisor has reserved for
   its own use. OPAL uses the priority 7 queue to automatically
   escalate interrupts for all other queues (DD2.X POWER9). So only
   priorities [0..6] are allowed for the guest.

When the XIVE exploitation interrupt mode is activated after the CAS
negotiation, the machine will perform a reboot to rebuild the device
tree.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive_hcall.c  | 64 +++++++++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr.c              |  7 ++++-
 hw/ppc/spapr_hcall.c        |  6 +++++
 include/hw/ppc/spapr_xive.h |  2 ++
 4 files changed, 78 insertions(+), 1 deletion(-)

diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
index 9f3d579bba2c..7dcb1f90ae5b 100644
--- a/hw/intc/spapr_xive_hcall.c
+++ b/hw/intc/spapr_xive_hcall.c
@@ -857,3 +857,67 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr)
     spapr_register_hypercall(H_INT_SYNC, h_int_sync);
     spapr_register_hypercall(H_INT_RESET, h_int_reset);
 }
+
+void spapr_dt_xive(sPAPRMachineState *spapr, int nr_servers,
+                   void *fdt, uint32_t phandle)
+{
+    sPAPRXive *xive = spapr->xive;
+    int node;
+    uint64_t timas[2 * 2];
+    /* Interrupt number ranges for the IPIs */
+    uint32_t lisn_ranges[] = {
+        cpu_to_be32(0),
+        cpu_to_be32(nr_servers),
+    };
+    uint32_t eq_sizes[] = {
+        cpu_to_be32(12), /* 4K */
+        cpu_to_be32(16), /* 64K */
+        cpu_to_be32(21), /* 2M */
+        cpu_to_be32(24), /* 16M */
+    };
+    /* The following array is in sync with the 'priority_is_valid'
+     * routine above. Linux is expected to choose priority 6.
+     */
+    uint32_t plat_res_int_priorities[] = {
+        cpu_to_be32(7),    /* start */
+        cpu_to_be32(0xf8), /* count */
+    };
+    int i;
+    gchar *nodename;
+
+    /* Thread Interrupt Management Area : User and OS views */
+    for (i = 0; i < 2; i++) {
+        timas[i * 2] = cpu_to_be64(xive->tm_base + i * (1ull << TM_SHIFT));
+        timas[i * 2 + 1] = cpu_to_be64(1ull << TM_SHIFT);
+    }
+
+    nodename = g_strdup_printf("interrupt-controller@%" PRIx64, xive->tm_base);
+    _FDT(node = fdt_add_subnode(fdt, 0, nodename));
+    g_free(nodename);
+
+    _FDT(fdt_setprop_string(fdt, node, "device_type", "power-ivpe"));
+    _FDT(fdt_setprop(fdt, node, "reg", timas, sizeof(timas)));
+
+    _FDT(fdt_setprop_string(fdt, node, "compatible", "ibm,power-ivpe"));
+    _FDT(fdt_setprop(fdt, node, "ibm,xive-eq-sizes", eq_sizes,
+                     sizeof(eq_sizes)));
+    _FDT(fdt_setprop(fdt, node, "ibm,xive-lisn-ranges", lisn_ranges,
+                     sizeof(lisn_ranges)));
+
+    /* For Linux to link the LSIs to the main interrupt controller.
+     * These properties are not in XIVE exploitation mode sPAPR
+     * specs
+     */
+    _FDT(fdt_setprop(fdt, node, "interrupt-controller", NULL, 0));
+    _FDT(fdt_setprop_cell(fdt, node, "#interrupt-cells", 2));
+
+    /* For SLOF */
+    _FDT(fdt_setprop_cell(fdt, node, "linux,phandle", phandle));
+    _FDT(fdt_setprop_cell(fdt, node, "phandle", phandle));
+
+    /* The "ibm,plat-res-int-priorities" property defines the priority
+     * ranges reserved by the hypervisor
+     */
+    _FDT(fdt_setprop(fdt, 0, "ibm,plat-res-int-priorities",
+                     plat_res_int_priorities, sizeof(plat_res_int_priorities)));
+}
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 7a65dcde3ff7..d4bc6f56c9d4 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1188,7 +1188,12 @@ static void *spapr_build_fdt(sPAPRMachineState *spapr,
     _FDT(fdt_setprop_cell(fdt, 0, "#size-cells", 2));
 
     /* /interrupt controller */
-    spapr_dt_xics(xics_max_server_number(spapr), fdt, PHANDLE_XICP);
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        spapr_dt_xics(xics_max_server_number(spapr), fdt, PHANDLE_XICP);
+    } else {
+        /* Populate device tree for XIVE */
+        spapr_dt_xive(spapr, xics_max_server_number(spapr), fdt, PHANDLE_XICP);
+    }
 
     ret = spapr_populate_memory(spapr, fdt);
     if (ret < 0) {
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index 16bccdd5c012..3215c3b4aec3 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1655,6 +1655,12 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
             (spapr_h_cas_compose_response(spapr, args[1], args[2],
                                           ov5_updates) != 0);
     }
+
+    /* We need to rebuild the device tree for XIVE, generate a reset */
+    if (!spapr->cas_reboot) {
+        spapr->cas_reboot = spapr_ovec_test(ov5_updates, OV5_XIVE_EXPLOIT);
+    }
+
     spapr_ovec_cleanup(ov5_updates);
 
     if (spapr->cas_reboot) {
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 49c78b8e33c6..416f51404ce2 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -46,5 +46,7 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 typedef struct sPAPRMachineState sPAPRMachineState;
 
 void spapr_xive_hcall_init(sPAPRMachineState *spapr);
+void spapr_dt_xive(sPAPRMachineState *spapr, int nr_servers, void *fdt,
+                   uint32_t phandle);
 
 #endif /* PPC_SPAPR_XIVE_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 15/35] sysbus: add a sysbus_mmio_unmap() helper
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (13 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 14/35] spapr: add device tree support for the XIVE exploitation mode Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 16/35] spapr: introduce a helper to map the XIVE memory regions Cédric Le Goater
                   ` (20 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

This will be used to remove the MMIO regions of the POWER9 XIVE
interrupt controller when the sPAPR machine is reseted.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/core/sysbus.c    | 10 ++++++++++
 include/hw/sysbus.h |  1 +
 2 files changed, 11 insertions(+)

diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
index 5d0887f499de..e31f8e6be2f1 100644
--- a/hw/core/sysbus.c
+++ b/hw/core/sysbus.c
@@ -152,6 +152,16 @@ static void sysbus_mmio_map_common(SysBusDevice *dev, int n, hwaddr addr,
     }
 }
 
+void sysbus_mmio_unmap(SysBusDevice *dev, int n)
+{
+    assert(n >= 0 && n < dev->num_mmio);
+
+    if (dev->mmio[n].addr != (hwaddr)-1) {
+        memory_region_del_subregion(get_system_memory(), dev->mmio[n].memory);
+        dev->mmio[n].addr = (hwaddr)-1;
+    }
+}
+
 void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr)
 {
     sysbus_mmio_map_common(dev, n, addr, false, 0);
diff --git a/include/hw/sysbus.h b/include/hw/sysbus.h
index e88bb6dae0c1..2446983de167 100644
--- a/include/hw/sysbus.h
+++ b/include/hw/sysbus.h
@@ -92,6 +92,7 @@ qemu_irq sysbus_get_connected_irq(SysBusDevice *dev, int n);
 void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr);
 void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
                              int priority);
+void sysbus_mmio_unmap(SysBusDevice *dev, int n);
 void sysbus_add_io(SysBusDevice *dev, hwaddr addr,
                    MemoryRegion *mem);
 MemoryRegion *sysbus_address_space(SysBusDevice *dev);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 16/35] spapr: introduce a helper to map the XIVE memory regions
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (14 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 15/35] sysbus: add a sysbus_mmio_unmap() helper Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 17/35] spapr: add XIVE support to spapr_qirq() Cédric Le Goater
                   ` (19 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

When the XIVE exploitation interrupt mode is activated, the machine
needs to expose to the guest the MMIO regions used by the controller :

  - Event State Buffers
  - Thread Interrupt Management Area for the OS and User views

Migration will also need to reflect the current interrupt mode in use.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 Changes since v2:
 
 - introduced spapr_xive_mmio_unmap()
 - introduced spapr_machine_reset() for reset and post_load
 
 hw/intc/spapr_xive.c        | 24 ++++++++++++++++++++++++
 hw/ppc/spapr.c              | 22 ++++++++++++++++++++++
 include/hw/ppc/spapr_xive.h |  2 ++
 3 files changed, 48 insertions(+)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index d0d5a7d7f969..7aba6e571a93 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -239,3 +239,27 @@ bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
     xive_source_irq_set(xsrc, lisn - xsrc->offset, false);
     return true;
 }
+
+void spapr_xive_mmio_map(sPAPRXive *xive)
+{
+    XiveSource *xsrc = &xive->source;
+
+    /* ESBs */
+    sysbus_mmio_map(SYS_BUS_DEVICE(xsrc), 0, xsrc->esb_base);
+
+    /* Thread Management Interrupt Area: User and OS views */
+    sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->tm_base);
+    sysbus_mmio_map(SYS_BUS_DEVICE(xive), 1, xive->tm_base + (1 << TM_SHIFT));
+}
+
+void spapr_xive_mmio_unmap(sPAPRXive *xive)
+{
+    XiveSource *xsrc = &xive->source;
+
+    /* ESBs */
+    sysbus_mmio_unmap(SYS_BUS_DEVICE(xsrc), 0);
+
+    /* Thread Management Interrupt Area: User and OS views */
+    sysbus_mmio_unmap(SYS_BUS_DEVICE(xive), 0);
+    sysbus_mmio_unmap(SYS_BUS_DEVICE(xive), 1);
+}
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d4bc6f56c9d4..a9770f8f0a6e 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1519,6 +1519,19 @@ static int spapr_reset_drcs(Object *child, void *opaque)
     return 0;
 }
 
+/* Setup XIVE exploitation or legacy mode as required by CAS */
+static void spapr_reset_interrupt(sPAPRMachineState *spapr, Error **errp)
+{
+    /* Reset XIVE if enabled */
+    if (spapr->xive_exploitation) {
+        spapr_xive_mmio_unmap(spapr->xive);
+    }
+
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        spapr_xive_mmio_map(spapr->xive);
+    }
+}
+
 static void spapr_machine_reset(void)
 {
     MachineState *machine = MACHINE(qdev_get_machine());
@@ -1555,6 +1568,8 @@ static void spapr_machine_reset(void)
         ppc_set_compat(first_ppc_cpu, spapr->max_compat_pvr, &error_fatal);
     }
 
+    spapr_reset_interrupt(spapr, &error_fatal);
+
     qemu_devices_reset();
 
     /* DRC reset may cause a device to be unplugged. This will cause troubles
@@ -1664,6 +1679,7 @@ static int spapr_post_load(void *opaque, int version_id)
 {
     sPAPRMachineState *spapr = (sPAPRMachineState *)opaque;
     int err = 0;
+    Error *local_err = NULL;
 
     err = spapr_caps_post_migration(spapr);
     if (err) {
@@ -1698,6 +1714,12 @@ static int spapr_post_load(void *opaque, int version_id)
         }
     }
 
+    spapr_reset_interrupt(spapr, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return -EINVAL;
+    }
+
     return err;
 }
 
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 416f51404ce2..0373b1c995bc 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -35,6 +35,8 @@ typedef struct sPAPRXive {
 bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
 bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
+void spapr_xive_mmio_map(sPAPRXive *xive);
+void spapr_xive_mmio_unmap(sPAPRXive *xive);
 
 /*
  * sPAPR encoding of EQ indexes
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 17/35] spapr: add XIVE support to spapr_qirq()
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (15 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 16/35] spapr: introduce a helper to map the XIVE memory regions Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 18/35] spapr: introduce a spapr_icp_create() helper Cédric Le Goater
                   ` (18 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The XIVE object has its own set of qirqs which is to be used when the
XIVE exploitation interrupt mode is activated.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive.c        | 13 +++++++++++++
 hw/ppc/spapr.c              |  4 ++++
 include/hw/ppc/spapr_xive.h |  1 +
 include/hw/ppc/xive.h       |  6 ++++++
 4 files changed, 24 insertions(+)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 7aba6e571a93..98e067bfc90c 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -263,3 +263,16 @@ void spapr_xive_mmio_unmap(sPAPRXive *xive)
     sysbus_mmio_unmap(SYS_BUS_DEVICE(xive), 0);
     sysbus_mmio_unmap(SYS_BUS_DEVICE(xive), 1);
 }
+
+qemu_irq spapr_xive_qirq(sPAPRXive *xive, int lisn)
+{
+    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
+    XiveSource *xsrc = &xive->source;
+
+    if (!ive || !(ive->w & IVE_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %d\n", lisn);
+        return NULL;
+    }
+
+    return xive_source_qirq(xsrc, lisn - xsrc->offset);
+}
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index a9770f8f0a6e..4fcc942ccfa3 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3941,6 +3941,10 @@ qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq)
 {
     ICSState *ics = spapr->ics;
 
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return spapr_xive_qirq(spapr->xive, irq);
+    }
+
     if (ics_valid_irq(ics, irq)) {
         return ics->qirqs[irq - ics->offset];
     }
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 0373b1c995bc..df87e68b3d05 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -37,6 +37,7 @@ bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 void spapr_xive_mmio_map(sPAPRXive *xive);
 void spapr_xive_mmio_unmap(sPAPRXive *xive);
+qemu_irq spapr_xive_qirq(sPAPRXive *xive, int lisn);
 
 /*
  * sPAPR encoding of EQ indexes
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 6cc02638c677..328b093eb9c3 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -155,6 +155,12 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
     xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
 }
 
+static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
+{
+    assert(srcno < xsrc->nr_irqs);
+    return xsrc->qirqs[srcno];
+}
+
 /*
  * XIVE Interrupt Presenter
  */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 18/35] spapr: introduce a spapr_icp_create() helper
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (16 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 17/35] spapr: add XIVE support to spapr_qirq() Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 19/35] spapr: toggle the ICP depending on the selected interrupt mode Cédric Le Goater
                   ` (17 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

On sPAPR, the creation of the interrupt presenter depends on some of
the machine attributes. When the XIVE exploitation interrupt mode is
available, this will get more complex. So provide a machine-level
helper to isolate the process and hide the details to the sPAPR core
realize function.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
Reviewed-by: Greg Kurz <groug@kaod.org>
Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/spapr.c          | 14 ++++++++++++++
 hw/ppc/spapr_cpu_core.c |  3 +--
 include/hw/ppc/spapr.h  |  2 ++
 3 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 4fcc942ccfa3..32c7801b249e 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3952,6 +3952,20 @@ qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq)
     return NULL;
 }
 
+Object *spapr_icp_create(sPAPRMachineState *spapr, Object *cpu, Error **errp)
+{
+    Error *local_err = NULL;
+    Object *obj;
+
+    obj = icp_create(cpu, spapr->icp_type, XICS_FABRIC(spapr), &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return NULL;
+    }
+
+    return obj;
+}
+
 static void spapr_pic_print_info(InterruptStatsProvider *obj,
                                  Monitor *mon)
 {
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 94afeb399e99..76bff4cc372d 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -129,8 +129,7 @@ static void spapr_cpu_core_realize_child(Object *child,
         goto error;
     }
 
-    cpu->intc = icp_create(child, spapr->icp_type, XICS_FABRIC(spapr),
-                           &local_err);
+    cpu->intc = spapr_icp_create(spapr, child, &local_err);
     if (local_err) {
         goto error;
     }
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 6b6496d9c343..d5e168f5ad4e 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -819,4 +819,6 @@ void spapr_caps_reset(sPAPRMachineState *spapr);
 void spapr_caps_add_properties(sPAPRMachineClass *smc, Error **errp);
 int spapr_caps_post_migration(sPAPRMachineState *spapr);
 
+Object *spapr_icp_create(sPAPRMachineState *spapr, Object *cpu, Error **errp);
+
 #endif /* HW_SPAPR_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 19/35] spapr: toggle the ICP depending on the selected interrupt mode
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (17 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 18/35] spapr: introduce a spapr_icp_create() helper Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 20/35] spapr: add support to dump XIVE information Cédric Le Goater
                   ` (16 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

Each interrupt mode has its own specific interrupt presenter object,
that we store under the CPU object, one for XICS and one for XIVE. The
active presenter, corresponding to the current interrupt mode, is
simply selected with a lookup on the children of the CPU.

Migration and CPU hotplug also need to reflect the current interrupt
mode in use.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/xive.c                  | 19 ++++++++++++++++++
 hw/ppc/spapr.c                  | 40 +++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_cpu_core.c         | 44 +++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr.h          |  1 +
 include/hw/ppc/spapr_cpu_core.h |  2 ++
 include/hw/ppc/xive.h           |  1 +
 6 files changed, 107 insertions(+)

diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 20e216f03c5b..2daa36f77a6b 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -88,6 +88,25 @@ static void xive_eq_push(XiveEQ *eq, uint32_t data)
  * XIVE Interrupt Presenter
  */
 
+Object *xive_nvt_create(Object *cpu, const char *type, Error **errp)
+{
+    Error *local_err = NULL;
+    Object *obj;
+
+    obj = object_new(type);
+    object_property_add_child(cpu, type, obj, &error_abort);
+    object_unref(obj);
+    object_property_add_const_link(obj, ICP_PROP_CPU, cpu, &error_abort);
+    object_property_set_bool(obj, true, "realized", &local_err);
+    if (local_err) {
+        object_unparent(obj);
+        error_propagate(errp, local_err);
+        obj = NULL;
+    }
+
+    return obj;
+}
+
 /* Convert a priority number to an Interrupt Pending Buffer (IPB)
  * register, which indicates a pending interrupt at the priority
  * corresponding to the bit number
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 32c7801b249e..0c59816bf3d6 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -251,6 +251,7 @@ static void xive_system_init(MachineState *machine, int nr_irqs, Error **errp)
 
     spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE, nr_irqs, errp);
     if (spapr->xive) {
+        spapr->nvt_type = TYPE_XIVE_NVT;
         spapr_xive_hcall_init(spapr);
     }
 }
@@ -1522,13 +1523,32 @@ static int spapr_reset_drcs(Object *child, void *opaque)
 /* Setup XIVE exploitation or legacy mode as required by CAS */
 static void spapr_reset_interrupt(sPAPRMachineState *spapr, Error **errp)
 {
+    Error *local_err = NULL;
+    const char *intc_type;
+
     /* Reset XIVE if enabled */
     if (spapr->xive_exploitation) {
         spapr_xive_mmio_unmap(spapr->xive);
     }
 
+    /* Reset CPU ICPs */
+    spapr_cpu_core_reset_icp(&local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
     if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
         spapr_xive_mmio_map(spapr->xive);
+        intc_type = spapr->nvt_type;
+    } else {
+        intc_type = spapr->icp_type;
+    }
+
+    spapr_cpu_core_set_icp(intc_type, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
     }
 }
 
@@ -3963,6 +3983,26 @@ Object *spapr_icp_create(sPAPRMachineState *spapr, Object *cpu, Error **errp)
         return NULL;
     }
 
+    if (spapr->xive_exploitation) {
+        Object *obj_xive;
+
+        /* Add a XIVE interrupt presenter. The machine will switch
+         * the CPU ICP depending on the interrupt model negotiated
+         * at CAS time.
+         */
+        obj_xive = xive_nvt_create(cpu, spapr->nvt_type, &local_err);
+        if (local_err) {
+            object_unparent(obj);
+            error_propagate(errp, local_err);
+            return NULL;
+        }
+
+        /* when hotplugged, the CPU should have the correct ICP */
+        if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+            return obj_xive;
+        }
+    }
+
     return obj;
 }
 
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 76bff4cc372d..3df2bda53f50 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -256,3 +256,47 @@ static const TypeInfo spapr_cpu_core_type_infos[] = {
 };
 
 DEFINE_TYPES(spapr_cpu_core_type_infos)
+
+void spapr_cpu_core_reset_icp(Error **errp)
+{
+    CPUState *cs;
+
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+        cpu->intc = NULL;
+    }
+}
+
+typedef struct ForeachFindICPArgs {
+    const char *icp_type;
+    Object *icp;
+} ForeachFindICPArgs;
+
+static int spapr_cpu_core_find_icp(Object *child, void *opaque)
+{
+    ForeachFindICPArgs *args = opaque;
+
+    if (object_dynamic_cast(child, args->icp_type)) {
+        args->icp = child;
+    }
+
+    return args->icp != NULL;
+}
+
+void spapr_cpu_core_set_icp(const char *icp_type, Error **errp)
+{
+    CPUState *cs;
+
+    CPU_FOREACH(cs) {
+        ForeachFindICPArgs args = { icp_type, NULL };
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+
+        object_child_foreach(OBJECT(cs), spapr_cpu_core_find_icp, &args);
+        if (!args.icp) {
+            error_setg(errp, "Couldn't find a '%s' icp", icp_type);
+            return;
+        }
+
+        cpu->intc = args.icp;
+    }
+}
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index d5e168f5ad4e..43ef5f743974 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -168,6 +168,7 @@ struct sPAPRMachineState {
     const char *icp_type;
     uint8_t xive_exploitation;
     sPAPRXive  *xive;
+    const char *nvt_type;
 
     bool cmd_line_caps[SPAPR_CAP_NUM];
     sPAPRCapabilities def, eff, mig;
diff --git a/include/hw/ppc/spapr_cpu_core.h b/include/hw/ppc/spapr_cpu_core.h
index 1129f344aa0c..c7ccc99d22e7 100644
--- a/include/hw/ppc/spapr_cpu_core.h
+++ b/include/hw/ppc/spapr_cpu_core.h
@@ -38,4 +38,6 @@ typedef struct sPAPRCPUCoreClass {
 } sPAPRCPUCoreClass;
 
 const char *spapr_get_cpu_core_type(const char *cpu_type);
+void spapr_cpu_core_set_icp(const char *icp_type, Error **errp);
+void spapr_cpu_core_reset_icp(Error **errp);
 #endif
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 328b093eb9c3..24ce58812a7c 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -191,6 +191,7 @@ extern const MemoryRegionOps xive_tm_os_ops;
 
 void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon);
 XiveEQ *xive_nvt_eq_get(XiveNVT *nvt, uint8_t priority);
+Object *xive_nvt_create(Object *cpu, const char *type, Error **errp);
 
 void xive_eq_reset(XiveEQ *eq);
 void xive_eq_pic_print_info(XiveEQ *eq, Monitor *mon);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 20/35] spapr: add support to dump XIVE information
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (18 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 19/35] spapr: toggle the ICP depending on the selected interrupt mode Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 21/35] spapr: advertise XIVE exploitation mode in CAS Cédric Le Goater
                   ` (15 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

Modify the InterruptStatsProvider output to reflect the interrupt mode
currently in use by the machine.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/spapr.c | 12 ++++++++++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 0c59816bf3d6..e3567543e6e6 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -4015,10 +4015,18 @@ static void spapr_pic_print_info(InterruptStatsProvider *obj,
     CPU_FOREACH(cs) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
 
-        icp_pic_print_info(ICP(cpu->intc), mon);
+        if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+            xive_nvt_pic_print_info(XIVE_NVT(cpu->intc), mon);
+        } else {
+            icp_pic_print_info(ICP(cpu->intc), mon);
+        }
     }
 
-    ics_pic_print_info(spapr->ics, mon);
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        spapr_xive_pic_print_info(spapr->xive, mon);
+    } else {
+        ics_pic_print_info(spapr->ics, mon);
+    }
 }
 
 int spapr_get_vcpu_id(PowerPCCPU *cpu)
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 21/35] spapr: advertise XIVE exploitation mode in CAS
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (19 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 20/35] spapr: add support to dump XIVE information Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 22/35] spapr: add classes for the XIVE models Cédric Le Goater
                   ` (14 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

Both XIVE and XICS interrupt mode are advertised for the moment.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/spapr.c | 25 ++++++++++++++++++++-----
 1 file changed, 20 insertions(+), 5 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e3567543e6e6..d05c83cdb322 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1016,10 +1016,11 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, void *fdt)
     spapr_dt_rtas_tokens(fdt, rtas);
 }
 
-/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU features
- * that the guest may request and thus the valid values for bytes 24..26 of
- * option vector 5: */
-static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
+/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU
+ * and the XIVE features that the guest may request and thus the valid
+ * values for bytes 23..26 of option vector 5: */
+static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
+                                          int chosen)
 {
     PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
 
@@ -1042,7 +1043,16 @@ static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
         } else {
             val[3] = 0x00; /* Hash */
         }
+        /* TODO: introduce a kvmppc_has_cap_xive() ? Works with
+         * irqchip=off for now
+         */
+        if (spapr->xive_exploitation) {
+            val[1] = 0x80; /* OV5_XIVE_BOTH */
+        }
     } else {
+        if (spapr->xive_exploitation) {
+            val[1] = 0x80; /* OV5_XIVE_BOTH */
+        }
         /* V3 MMU supports both hash and radix in tcg (with dynamic switching) */
         val[3] = 0xC0;
     }
@@ -1110,7 +1120,7 @@ static void spapr_dt_chosen(sPAPRMachineState *spapr, void *fdt)
         _FDT(fdt_setprop_string(fdt, chosen, "stdout-path", stdout_path));
     }
 
-    spapr_dt_ov5_platform_support(fdt, chosen);
+    spapr_dt_ov5_platform_support(spapr, fdt, chosen);
 
     g_free(stdout_path);
     g_free(bootlist);
@@ -2599,6 +2609,11 @@ static void spapr_machine_init(MachineState *machine)
         spapr_ovec_set(spapr->ov5, OV5_HPT_RESIZE);
     }
 
+    /* advertise XIVE if not disabled by the user */
+    if (spapr->xive_exploitation) {
+        spapr_ovec_set(spapr->ov5, OV5_XIVE_EXPLOIT);
+    }
+
     /* init CPUs */
     spapr_init_cpus(spapr);
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 22/35] spapr: add classes for the XIVE models
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (20 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 21/35] spapr: advertise XIVE exploitation mode in CAS Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 23/35] target/ppc/kvm: add Linux KVM definitions for XIVE Cédric Le Goater
                   ` (13 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The XIVE models for the emulated and the KVM mode will have a lot in
common. Introduce some classes to handle the differences, mostly to
synchronize the state with KVM for the monitor and migration. This is
very much like XICS is doing.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive.c        | 32 ++++++++++++++++++
 hw/intc/xive.c              | 79 +++++++++++++++++++++++++++++++++++++++++++++
 include/hw/ppc/spapr_xive.h | 13 ++++++++
 include/hw/ppc/xive.h       | 30 +++++++++++++++++
 4 files changed, 154 insertions(+)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 98e067bfc90c..f0c2fe52b3c6 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -20,8 +20,13 @@
 
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
 {
+    sPAPRXiveClass *sxc = SPAPR_XIVE_GET_CLASS(xive);
     int i;
 
+    if (sxc->synchronize_state) {
+        sxc->synchronize_state(xive);
+    }
+
     xive_source_pic_print_info(&xive->source, mon);
 
     monitor_printf(mon, "IVE Table\n");
@@ -150,6 +155,30 @@ static XiveEQ *spapr_xive_get_eq(XiveFabric *xf, uint32_t eq_idx)
     return xive_nvt_eq_get(nvt, SPAPR_XIVE_EQ_PRIO(eq_idx));
 }
 
+static int vmstate_spapr_xive_pre_save(void *opaque)
+{
+    sPAPRXive *xive = opaque;
+    sPAPRXiveClass *sxc = SPAPR_XIVE_GET_CLASS(xive);
+
+    if (sxc->pre_save) {
+        sxc->pre_save(xive);
+    }
+
+    return 0;
+}
+
+static int vmstate_spapr_xive_post_load(void *opaque, int version_id)
+{
+    sPAPRXive *xive = opaque;
+    sPAPRXiveClass *sxc = SPAPR_XIVE_GET_CLASS(xive);
+
+    if (sxc->post_load) {
+        sxc->post_load(xive, version_id);
+    }
+
+    return 0;
+}
+
 static const VMStateDescription vmstate_spapr_xive_ive = {
     .name = TYPE_SPAPR_XIVE "/ive",
     .version_id = 1,
@@ -164,6 +193,8 @@ static const VMStateDescription vmstate_spapr_xive = {
     .name = TYPE_SPAPR_XIVE,
     .version_id = 1,
     .minimum_version_id = 1,
+    .pre_save = vmstate_spapr_xive_pre_save,
+    .post_load = vmstate_spapr_xive_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
         VMSTATE_STRUCT_VARRAY_POINTER_UINT32(ivt, sPAPRXive, nr_irqs,
@@ -199,6 +230,7 @@ static const TypeInfo spapr_xive_info = {
     .instance_init = spapr_xive_init,
     .instance_size = sizeof(sPAPRXive),
     .class_init = spapr_xive_class_init,
+    .class_size = sizeof(sPAPRXiveClass),
     .interfaces = (InterfaceInfo[]) {
             { TYPE_XIVE_FABRIC },
             { },
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 2daa36f77a6b..11af3bf1184a 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -349,9 +349,14 @@ static char *xive_nvt_ring_print(uint8_t *ring)
 
 void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon)
 {
+    XiveNVTClass *xnc = XIVE_NVT_GET_CLASS(nvt);
     int cpu_index = nvt->cs ? nvt->cs->cpu_index : -1;
     char *s;
 
+    if (xnc->synchronize_state) {
+        xnc->synchronize_state(nvt);
+    }
+
     monitor_printf(mon, "CPU[%04x]: QW    NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
                    " W2\n", cpu_index);
 
@@ -366,6 +371,7 @@ void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon)
 static void xive_nvt_reset(void *dev)
 {
     XiveNVT *nvt = XIVE_NVT(dev);
+    XiveNVTClass *xnc = XIVE_NVT_GET_CLASS(nvt);
     int i;
 
     memset(nvt->regs, 0, sizeof(nvt->regs));
@@ -378,11 +384,16 @@ static void xive_nvt_reset(void *dev)
     for (i = 0; i < ARRAY_SIZE(nvt->eqt); i++) {
         xive_eq_reset(&nvt->eqt[i]);
     }
+
+    if (xnc->reset) {
+        xnc->reset(nvt);
+    }
 }
 
 static void xive_nvt_realize(DeviceState *dev, Error **errp)
 {
     XiveNVT *nvt = XIVE_NVT(dev);
+    XiveNVTClass *xnc = XIVE_NVT_GET_CLASS(nvt);
     PowerPCCPU *cpu;
     CPUPPCState *env;
     Object *obj;
@@ -410,6 +421,10 @@ static void xive_nvt_realize(DeviceState *dev, Error **errp)
         return;
     }
 
+    if (xnc->realize) {
+        xnc->realize(nvt, errp);
+    }
+
     qemu_register_reset(xive_nvt_reset, dev);
 }
 
@@ -442,10 +457,36 @@ static const VMStateDescription vmstate_xive_nvt_eq = {
     },
 };
 
+static int vmstate_xive_nvt_pre_save(void *opaque)
+{
+    XiveNVT *nvt = opaque;
+    XiveNVTClass *xnc = XIVE_NVT_GET_CLASS(nvt);
+
+    if (xnc->pre_save) {
+        xnc->pre_save(nvt);
+    }
+
+    return 0;
+}
+
+static int vmstate_xive_nvt_post_load(void *opaque, int version_id)
+{
+    XiveNVT *nvt = opaque;
+    XiveNVTClass *xnc = XIVE_NVT_GET_CLASS(nvt);
+
+    if (xnc->post_load) {
+        xnc->post_load(nvt, version_id);
+    }
+
+    return 0;
+}
+
 static const VMStateDescription vmstate_xive_nvt = {
     .name = TYPE_XIVE_NVT,
     .version_id = 1,
     .minimum_version_id = 1,
+    .pre_save = vmstate_xive_nvt_pre_save,
+    .post_load = vmstate_xive_nvt_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_BUFFER(regs, XiveNVT),
         VMSTATE_STRUCT_ARRAY(eqt, XiveNVT, (XIVE_PRIORITY_MAX + 1), 1,
@@ -470,6 +511,7 @@ static const TypeInfo xive_nvt_info = {
     .instance_size = sizeof(XiveNVT),
     .instance_init = xive_nvt_init,
     .class_init    = xive_nvt_class_init,
+    .class_size    = sizeof(XiveNVTClass),
 };
 
 /*
@@ -819,8 +861,13 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
 
 void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
 {
+    XiveSourceClass *xsc = XIVE_SOURCE_GET_CLASS(xsrc);
     int i;
 
+    if (xsc->synchronize_state) {
+        xsc->synchronize_state(xsrc);
+    }
+
     monitor_printf(mon, "XIVE Source %6x ..%6x\n",
                    xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
     for (i = 0; i < xsrc->nr_irqs; i++) {
@@ -840,6 +887,7 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
 static void xive_source_reset(DeviceState *dev)
 {
     XiveSource *xsrc = XIVE_SOURCE(dev);
+    XiveSourceClass *xsc = XIVE_SOURCE_GET_CLASS(xsrc);
     int i;
 
     /* Keep the IRQ type */
@@ -849,6 +897,10 @@ static void xive_source_reset(DeviceState *dev)
 
     /* SBEs are initialized to 0b01 which corresponds to "ints off" */
     memset(xsrc->sbe, 0x55, xsrc->sbe_size);
+
+    if (xsc->reset) {
+        xsc->reset(xsrc);
+    }
 }
 
 static void xive_source_realize(DeviceState *dev, Error **errp)
@@ -895,10 +947,36 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
     sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
 }
 
+static int vmstate_xive_source_pre_save(void *opaque)
+{
+    XiveSource *xsrc = opaque;
+    XiveSourceClass *xsc = XIVE_SOURCE_GET_CLASS(xsrc);
+
+    if (xsc->pre_save) {
+        xsc->pre_save(xsrc);
+    }
+
+    return 0;
+}
+
+static int vmstate_xive_source_post_load(void *opaque, int version_id)
+{
+    XiveSource *xsrc = opaque;
+    XiveSourceClass *xsc = XIVE_SOURCE_GET_CLASS(xsrc);
+
+    if (xsc->post_load) {
+        xsc->post_load(xsrc, version_id);
+    }
+
+    return 0;
+}
+
 static const VMStateDescription vmstate_xive_source = {
     .name = TYPE_XIVE_SOURCE,
     .version_id = 1,
     .minimum_version_id = 1,
+    .pre_save = vmstate_xive_source_pre_save,
+    .post_load = vmstate_xive_source_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
         VMSTATE_VBUFFER_UINT32(sbe, XiveSource, 1, NULL, sbe_size),
@@ -934,6 +1012,7 @@ static const TypeInfo xive_source_info = {
     .parent        = TYPE_SYS_BUS_DEVICE,
     .instance_size = sizeof(XiveSource),
     .class_init    = xive_source_class_init,
+    .class_size    = sizeof(XiveSourceClass),
 };
 
 static void xive_register_types(void)
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index df87e68b3d05..41e2784403b2 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -32,6 +32,19 @@ typedef struct sPAPRXive {
     MemoryRegion tm_mmio_os;
 } sPAPRXive;
 
+#define SPAPR_XIVE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(sPAPRXiveClass, (klass), TYPE_SPAPR_XIVE)
+#define SPAPR_XIVE_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(sPAPRXiveClass, (obj), TYPE_SPAPR_XIVE)
+
+typedef struct sPAPRXiveClass {
+    SysBusDeviceClass parent_class;
+
+    void (*synchronize_state)(sPAPRXive *xive);
+    void (*pre_save)(sPAPRXive *xsrc);
+    int (*post_load)(sPAPRXive *xsrc, int version_id);
+} sPAPRXiveClass;
+
 bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
 bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 24ce58812a7c..36de10af0109 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -60,6 +60,20 @@ typedef struct XiveSource {
     XiveFabric   *xive;
 } XiveSource;
 
+#define XIVE_SOURCE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(XiveSourceClass, (klass), TYPE_XIVE_SOURCE)
+#define XIVE_SOURCE_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(XiveSourceClass, (obj), TYPE_XIVE_SOURCE)
+
+typedef struct XiveSourceClass {
+    SysBusDeviceClass parent_class;
+
+    void (*synchronize_state)(XiveSource *xsrc);
+    void (*reset)(XiveSource *xsrc);
+    void (*pre_save)(XiveSource *xsrc);
+    int (*post_load)(XiveSource *xsrc, int version_id);
+} XiveSourceClass;
+
 /*
  * ESB MMIO setting. Can be one page, for both source triggering and
  * source management, or two different pages. See below for magic
@@ -186,6 +200,22 @@ typedef struct XiveNVT {
     XiveEQ    eqt[XIVE_PRIORITY_MAX + 1];
 } XiveNVT;
 
+
+#define XIVE_NVT_CLASS(klass) \
+     OBJECT_CLASS_CHECK(XiveNVTClass, (klass), TYPE_XIVE_NVT)
+#define XIVE_NVT_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(XiveNVTClass, (obj), TYPE_XIVE_NVT)
+
+typedef struct XiveNVTClass {
+    DeviceClass parent_class;
+
+    void (*realize)(XiveNVT *nvt, Error **errp);
+    void (*synchronize_state)(XiveNVT *nvt);
+    void (*reset)(XiveNVT *nvt);
+    void (*pre_save)(XiveNVT *nvt);
+    int (*post_load)(XiveNVT *nvt, int version_id);
+} XiveNVTClass;
+
 extern const MemoryRegionOps xive_tm_user_ops;
 extern const MemoryRegionOps xive_tm_os_ops;
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 23/35] target/ppc/kvm: add Linux KVM definitions for XIVE
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (21 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 22/35] spapr: add classes for the XIVE models Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 24/35] spapr/xive: add common realize routine for KVM Cédric Le Goater
                   ` (12 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

These define a new capability and a new KVM device for the XIVE native
exploitation interrupt mode. New ioctls are also introduced to
initialize the KVM device and handle VM migration.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 linux-headers/asm-powerpc/kvm.h | 18 ++++++++++++++++++
 linux-headers/linux/kvm.h       |  3 +++
 target/ppc/kvm.c                |  7 +++++++
 target/ppc/kvm_ppc.h            |  6 ++++++
 4 files changed, 34 insertions(+)

diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
index 833ed9a16adf..530c068fd850 100644
--- a/linux-headers/asm-powerpc/kvm.h
+++ b/linux-headers/asm-powerpc/kvm.h
@@ -480,6 +480,16 @@ struct kvm_ppc_cpu_char {
 #define  KVM_REG_PPC_ICP_PPRI_SHIFT	16	/* pending irq priority */
 #define  KVM_REG_PPC_ICP_PPRI_MASK	0xff
 
+#define KVM_REG_PPC_VP_STATE	(KVM_REG_PPC | KVM_REG_SIZE_U128 | 0x8d)
+#define KVM_REG_PPC_VP_EQ0	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8e)
+#define KVM_REG_PPC_VP_EQ1	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8f)
+#define KVM_REG_PPC_VP_EQ2	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x90)
+#define KVM_REG_PPC_VP_EQ3	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x91)
+#define KVM_REG_PPC_VP_EQ4	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x92)
+#define KVM_REG_PPC_VP_EQ5	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x93)
+#define KVM_REG_PPC_VP_EQ6	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x94)
+#define KVM_REG_PPC_VP_EQ7	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x95)
+
 /* Device control API: PPC-specific devices */
 #define KVM_DEV_MPIC_GRP_MISC		1
 #define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
@@ -673,4 +683,12 @@ struct kvm_ppc_cpu_char {
 #define  KVM_XICS_PRESENTED		(1ULL << 43)
 #define  KVM_XICS_QUEUED		(1ULL << 44)
 
+/* POWER9 XIVE Interrupt Controller */
+#define KVM_DEV_XIVE_GRP_SOURCES	1	/* 64-bit source attributes */
+#define KVM_DEV_XIVE_GRP_CTRL		2
+#define   KVM_DEV_XIVE_GET_ESB_FD	1
+#define   KVM_DEV_XIVE_GET_TIMA_FD	2
+#define   KVM_DEV_XIVE_VC_BASE		3
+#define KVM_DEV_XIVE_GRP_IVE		3	/* 64-bit source attributes */
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index a167be89d1ec..2c20d34f194b 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -936,6 +936,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_PPC_GET_CPU_CHAR 151
 #define KVM_CAP_S390_BPB 152
 #define KVM_CAP_GET_MSR_FEATURES 153
+#define KVM_CAP_PPC_IRQ_XIVE 154
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1159,6 +1160,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
 	KVM_DEV_TYPE_ARM_VGIC_ITS,
 #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
+	KVM_DEV_TYPE_XIVE,
+#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
 	KVM_DEV_TYPE_MAX,
 };
 
diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
index 79a436a38457..f5d770a77651 100644
--- a/target/ppc/kvm.c
+++ b/target/ppc/kvm.c
@@ -87,6 +87,7 @@ static int cap_fixup_hcalls;
 static int cap_htm;             /* Hardware transactional memory support */
 static int cap_mmu_radix;
 static int cap_mmu_hash_v3;
+static int cap_xive;
 static int cap_resize_hpt;
 static int cap_ppc_pvr_compat;
 static int cap_ppc_safe_cache;
@@ -150,6 +151,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
     cap_htm = kvm_vm_check_extension(s, KVM_CAP_PPC_HTM);
     cap_mmu_radix = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_RADIX);
     cap_mmu_hash_v3 = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_HASH_V3);
+    cap_xive = kvm_vm_check_extension(s, KVM_CAP_PPC_IRQ_XIVE);
     cap_resize_hpt = kvm_vm_check_extension(s, KVM_CAP_SPAPR_RESIZE_HPT);
     kvmppc_get_cpu_characteristics(s);
     /*
@@ -2461,6 +2463,11 @@ bool kvmppc_has_cap_mmu_hash_v3(void)
     return cap_mmu_hash_v3;
 }
 
+bool kvmppc_has_cap_xive(void)
+{
+    return cap_xive;
+}
+
 static void kvmppc_get_cpu_characteristics(KVMState *s)
 {
     struct kvm_ppc_cpu_char c;
diff --git a/target/ppc/kvm_ppc.h b/target/ppc/kvm_ppc.h
index 4d2789eef6ef..fef6b5d9ce9f 100644
--- a/target/ppc/kvm_ppc.h
+++ b/target/ppc/kvm_ppc.h
@@ -60,6 +60,7 @@ bool kvmppc_has_cap_fixup_hcalls(void);
 bool kvmppc_has_cap_htm(void);
 bool kvmppc_has_cap_mmu_radix(void);
 bool kvmppc_has_cap_mmu_hash_v3(void);
+bool kvmppc_has_cap_xive(void);
 int kvmppc_get_cap_safe_cache(void);
 int kvmppc_get_cap_safe_bounds_check(void);
 int kvmppc_get_cap_safe_indirect_branch(void);
@@ -299,6 +300,11 @@ static inline bool kvmppc_has_cap_mmu_hash_v3(void)
     return false;
 }
 
+static inline bool kvmppc_has_cap_xive(void)
+{
+    return false;
+}
+
 static inline int kvmppc_get_cap_safe_cache(void)
 {
     return 0;
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 24/35] spapr/xive: add common realize routine for KVM
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (22 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 23/35] target/ppc/kvm: add Linux KVM definitions for XIVE Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 25/35] spapr/xive: add KVM support Cédric Le Goater
                   ` (11 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The XiveSource and sPAPRXive device models will be shared between the
emulated and the KVM mode. The difference will reside in the way the
memory regions are initialized and in the qemu_irq handler. Introduce
common realize routines to share some code.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive.c        | 17 +++++++++++++++--
 hw/intc/xive.c              | 21 ++++++++++++++++-----
 include/hw/ppc/spapr_xive.h |  1 +
 include/hw/ppc/xive.h       |  3 +++
 4 files changed, 35 insertions(+), 7 deletions(-)

diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index f0c2fe52b3c6..bd604089ad49 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -84,9 +84,8 @@ static void spapr_xive_init(Object *obj)
     object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
 }
 
-static void spapr_xive_realize(DeviceState *dev, Error **errp)
+void spapr_xive_common_realize(sPAPRXive *xive, int esb_shift, Error **errp)
 {
-    sPAPRXive *xive = SPAPR_XIVE(dev);
     XiveSource *xsrc = &xive->source;
     Error *local_err = NULL;
 
@@ -105,6 +104,8 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
                             &error_fatal);
     object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
                             &error_fatal);
+    object_property_set_int(OBJECT(xsrc), esb_shift, "shift",
+                            &error_fatal);
     object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
                                    &error_fatal);
     object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
@@ -122,6 +123,18 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
      * level views of the TIMA.
      */
     xive->tm_base = XIVE_TM_BASE;
+}
+
+static void spapr_xive_realize(DeviceState *dev, Error **errp)
+{
+    sPAPRXive *xive = SPAPR_XIVE(dev);
+    Error *local_err = NULL;
+
+    spapr_xive_common_realize(xive, XIVE_ESB_64K_2PAGE, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
 
     memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
                           &xive_tm_user_ops, xive, "xive.tima.user",
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 11af3bf1184a..520b532dbf09 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -903,13 +903,13 @@ static void xive_source_reset(DeviceState *dev)
     }
 }
 
-static void xive_source_realize(DeviceState *dev, Error **errp)
+void xive_source_common_realize(XiveSource *xsrc, qemu_irq_handler handler,
+                                Error **errp)
 {
-    XiveSource *xsrc = XIVE_SOURCE(dev);
     Object *obj;
     Error *local_err = NULL;
 
-    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
+    obj = object_property_get_link(OBJECT(xsrc), "xive", &local_err);
     if (!obj) {
         error_propagate(errp, local_err);
         error_prepend(errp, "required link 'xive' not found: ");
@@ -931,13 +931,24 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
-                                     xsrc->nr_irqs);
+    xsrc->qirqs = qemu_allocate_irqs(handler, xsrc, xsrc->nr_irqs);
     xsrc->status = g_malloc0(xsrc->nr_irqs);
 
     /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
     xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
     xsrc->sbe = g_malloc0(xsrc->sbe_size);
+}
+
+static void xive_source_realize(DeviceState *dev, Error **errp)
+{
+    XiveSource *xsrc = XIVE_SOURCE(dev);
+    Error *local_err = NULL;
+
+    xive_source_common_realize(xsrc, xive_source_set_irq, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
 
     /* TODO: H_INT_ESB support, which removing the ESB MMIOs */
 
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 41e2784403b2..f3ac084a71be 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -51,6 +51,7 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 void spapr_xive_mmio_map(sPAPRXive *xive);
 void spapr_xive_mmio_unmap(sPAPRXive *xive);
 qemu_irq spapr_xive_qirq(sPAPRXive *xive, int lisn);
+void spapr_xive_common_realize(sPAPRXive *xive, int esb_shift, Error **errp);
 
 /*
  * sPAPR encoding of EQ indexes
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 36de10af0109..b040cf580fc9 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -175,6 +175,9 @@ static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
     return xsrc->qirqs[srcno];
 }
 
+void xive_source_common_realize(XiveSource *xsrc, qemu_irq_handler handler,
+                                Error **errp);
+
 /*
  * XIVE Interrupt Presenter
  */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 25/35] spapr/xive: add KVM support
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (23 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 24/35] spapr/xive: add common realize routine for KVM Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 26/35] spapr/xive: add a XIVE KVM device to the machine Cédric Le Goater
                   ` (10 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

This introduces a set of XIVE models specific to KVM. They handle the
synchronization of the state with KVM, for the monitor usage and for
the migration.

The TIMA and ESB MMIO regions are initialized differently under KVM.
'ram device' memory mappings, similarly to VFIO, are exposed to the
guest and the associated VMAs on the host are populated dynamically
with the appropriate pages using a fault handler.

For migration, the main sPAPRXive interrupt controller model saves and
restores the IVE table needed for routing. Each VCPU presenter model
(XiveNVT) does the same for the interrupt management registers and for
the XIVE EQs. The XIVE EQ pages, which reside in the guest RAM, are
marked dirty when the EQs are captured. These get/set operations rely
on their KVM counterpart in the host kernel which acts as a proxy for
OPAL, the host firmware.

The XiveSource model can save and restore the PQ bits directly in the
ESB MMIO region and does not need KVM support.

The XIVE migration sequence is currently ordered with priorities but
this might not be enough to capture correctly all HW state. Extra
quiescence points are possibly needed. Work in progress.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 default-configs/ppc64-softmmu.mak |   1 +
 hw/intc/Makefile.objs             |   1 +
 hw/intc/spapr_xive.c              |   1 +
 hw/intc/spapr_xive_kvm.c          | 565 ++++++++++++++++++++++++++++++++++++++
 hw/intc/xive.c                    |   1 +
 include/hw/ppc/spapr_xive.h       |  23 ++
 include/hw/ppc/xive.h             |   3 +
 include/migration/vmstate.h       |   2 +
 8 files changed, 597 insertions(+)
 create mode 100644 hw/intc/spapr_xive_kvm.c

diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
index f8d34722931d..ac7e3af2473c 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -18,4 +18,5 @@ CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_XIVE=$(CONFIG_PSERIES)
 CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
+CONFIG_XIVE_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_MEM_HOTPLUG=y
diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index eacd26836ebf..dd4d69db2bdd 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -39,6 +39,7 @@ obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
 obj-$(CONFIG_XICS_KVM) += xics_kvm.o
 obj-$(CONFIG_XIVE) += xive.o
 obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
+obj-$(CONFIG_XIVE_KVM) += spapr_xive_kvm.o
 obj-$(CONFIG_POWERNV) += xics_pnv.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index bd604089ad49..b8496033a8b9 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -208,6 +208,7 @@ static const VMStateDescription vmstate_spapr_xive = {
     .minimum_version_id = 1,
     .pre_save = vmstate_spapr_xive_pre_save,
     .post_load = vmstate_spapr_xive_post_load,
+    .priority = MIG_PRI_XIVE_IVE,
     .fields = (VMStateField[]) {
         VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
         VMSTATE_STRUCT_VARRAY_POINTER_UINT32(ivt, sPAPRXive, nr_irqs,
diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
new file mode 100644
index 000000000000..ec5613fc2804
--- /dev/null
+++ b/hw/intc/spapr_xive_kvm.c
@@ -0,0 +1,565 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qemu/error-report.h"
+#include "qapi/error.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "sysemu/kvm.h"
+#include "monitor/monitor.h"
+#include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/xive.h"
+#include "hw/ppc/xive_regs.h"
+#include "kvm_ppc.h"
+
+#include <sys/ioctl.h>
+
+/* TODO: kernel_xive_fd is used as a global switch for XIVE */
+static int kernel_xive_fd = -1;
+
+typedef struct KVMEnabledCPU {
+    unsigned long vcpu_id;
+    QLIST_ENTRY(KVMEnabledCPU) node;
+} KVMEnabledCPU;
+
+static QLIST_HEAD(, KVMEnabledCPU)
+    kvm_enabled_cpus = QLIST_HEAD_INITIALIZER(&kvm_enabled_cpus);
+
+static bool xive_nvt_kvm_cpu_is_enabled(CPUState *cs)
+{
+    KVMEnabledCPU *enabled_cpu;
+    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
+
+    QLIST_FOREACH(enabled_cpu, &kvm_enabled_cpus, node) {
+        if (enabled_cpu->vcpu_id == vcpu_id) {
+            return true;
+        }
+    }
+    return false;
+}
+
+static void xive_nvt_kvm_cpu_enable(CPUState *cs)
+{
+    KVMEnabledCPU *enabled_cpu;
+    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
+
+    enabled_cpu = g_malloc(sizeof(*enabled_cpu));
+    enabled_cpu->vcpu_id = vcpu_id;
+    QLIST_INSERT_HEAD(&kvm_enabled_cpus, enabled_cpu, node);
+}
+
+static inline bool xive_queue_is_valid(int priority)
+{
+    switch (priority) {
+    case 0 ... 6:
+        return true;
+    case 7: /* OPAL escalation queue */
+    default:
+        return false;
+    }
+}
+
+static void xive_nvt_kvm_get_state(XiveNVT *nvt)
+{
+    uint64_t state[2] = { 0 };
+    int ret;
+    int i;
+
+    ret = kvm_get_one_reg(nvt->cs, KVM_REG_PPC_VP_STATE, state);
+    if (ret != 0) {
+        error_report("Unable to retrieve KVM XIVE interrupt controller state"
+                " for CPU %ld: %s", kvm_arch_vcpu_id(nvt->cs), strerror(errno));
+        return;
+    }
+
+    /* First quad should be a backup of word0 and word1 of the OS
+     * ring. Second quad is the OPAL internal state which holds word4
+     * of the VP structure. We are only interested by the IPB in there
+     * but we should consider it as opaque.
+     *
+     * As we won't use the registers of the HV ring on sPAPR, let's
+     * hijack them to store the 'OPAL' state
+     */
+    *((uint64_t *) nvt->ring_os) = state[0];
+    *((uint64_t *) &nvt->regs[TM_QW2_HV_POOL]) = state[1];
+
+    /* Now dump all the queue internals  */
+    for (i = 0; i < ARRAY_SIZE(nvt->eqt); i++) {
+        XiveEQ eq = { 0 };
+
+        if (!xive_queue_is_valid(i)) {
+            continue;
+        }
+
+        ret = kvm_get_one_reg(nvt->cs, KVM_REG_PPC_VP_EQ0 + i, &eq);
+        if (ret != 0) {
+            error_report("Unable to retrieve KVM XIVE interrupt controller"
+                         " state for CPU %ld priority %d: %s",
+                         kvm_arch_vcpu_id(nvt->cs), i, strerror(errno));
+            return;
+        }
+
+        if (eq.w0 & EQ_W0_VALID) {
+            memcpy(&nvt->eqt[i], &eq, sizeof(nvt->eqt[i]));
+        }
+    }
+}
+
+static void xive_nvt_kvm_do_synchronize_state(CPUState *cpu,
+                                              run_on_cpu_data arg)
+{
+    xive_nvt_kvm_get_state(arg.host_ptr);
+}
+
+static void xive_nvt_kvm_synchronize_state(XiveNVT *nvt)
+{
+    if (nvt->cs) {
+        run_on_cpu(nvt->cs, xive_nvt_kvm_do_synchronize_state,
+                   RUN_ON_CPU_HOST_PTR(nvt));
+    }
+}
+
+static int xive_nvt_kvm_set_state(XiveNVT *nvt, int version_id)
+{
+    uint64_t state[2];
+    int ret;
+    int i;
+
+    /* NVT for this CPU thread is not in use, exiting */
+    if (!nvt->cs) {
+        return 0;
+    }
+
+    state[0] = *((uint64_t *) nvt->ring_os);
+    state[1] = *((uint64_t *) &nvt->regs[TM_QW2_HV_POOL]);
+
+    ret = kvm_set_one_reg(nvt->cs, KVM_REG_PPC_VP_STATE, state);
+    if (ret != 0) {
+        error_report("Unable to restore KVM XIVE interrupt controller state"
+                     " for CPU %ld: %s", kvm_arch_vcpu_id(nvt->cs),
+                     strerror(errno));
+        return ret;
+    }
+
+    for (i = 0; i < ARRAY_SIZE(nvt->eqt); i++) {
+        XiveEQ *eq = &nvt->eqt[i];
+
+        if (!xive_queue_is_valid(i)) {
+            continue;
+        }
+
+        if (!(eq->w0 & EQ_W0_VALID)) {
+            continue;
+        }
+
+        ret = kvm_set_one_reg(nvt->cs, KVM_REG_PPC_VP_EQ0 + i, eq);
+        if (ret != 0) {
+            error_report("Unable to restore KVM XIVE interrupt controller"
+                         " state for CPU %ld priority %d: %s",
+                         kvm_arch_vcpu_id(nvt->cs), i, strerror(errno));
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+static void xive_nvt_kvm_reset(XiveNVT *nvt)
+{
+    /* XIVE is not enabled at first machine reset, only after CAS. */
+    if (kernel_xive_fd == -1) {
+        return;
+    }
+
+    xive_nvt_kvm_set_state(nvt, 1);
+}
+
+static void xive_nvt_kvm_realize(XiveNVT *nvt, Error **errp)
+{
+    CPUState *cs = nvt->cs;
+    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
+    int ret;
+
+    /* Check if CPU was hot unplugged and replugged. */
+    if (xive_nvt_kvm_cpu_is_enabled(cs)) {
+        return;
+    }
+
+    ret = kvm_vcpu_enable_cap(cs, KVM_CAP_PPC_IRQ_XIVE, 0, kernel_xive_fd,
+                              vcpu_id, 0);
+    if (ret < 0) {
+        error_setg(errp, "Unable to connect CPU%ld to KVM XIVE device: %s",
+                   vcpu_id, strerror(errno));
+        return;
+    }
+
+    xive_nvt_kvm_cpu_enable(cs);
+}
+
+static void xive_nvt_kvm_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    XiveNVTClass *xnc = XIVE_NVT_CLASS(klass);
+
+    dc->desc = "XIVE KVM Interrupt Presenter";
+
+    xnc->realize = xive_nvt_kvm_realize;
+    xnc->synchronize_state = xive_nvt_kvm_synchronize_state;
+    xnc->reset = xive_nvt_kvm_reset;
+    xnc->pre_save = xive_nvt_kvm_get_state;
+    xnc->post_load = xive_nvt_kvm_set_state;
+}
+
+static const TypeInfo xive_nvt_kvm_info = {
+    .name          = TYPE_XIVE_NVT_KVM,
+    .parent        = TYPE_XIVE_NVT,
+    .instance_size = sizeof(XiveNVT),
+    .class_init    = xive_nvt_kvm_class_init,
+    .class_size    = sizeof(XiveNVTClass),
+};
+
+static void xive_source_kvm_set_irq(void *opaque, int srcno, int val)
+{
+    XiveSource *xsrc = opaque;
+    struct kvm_irq_level args;
+    int rc;
+
+    args.irq = srcno + xsrc->offset;
+    if (!xive_source_irq_is_lsi(xsrc, srcno)) {
+        if (!val) {
+            return;
+        }
+        args.level = KVM_INTERRUPT_SET;
+    } else {
+        args.level = val ? KVM_INTERRUPT_SET_LEVEL : KVM_INTERRUPT_UNSET;
+    }
+    rc = kvm_vm_ioctl(kvm_state, KVM_IRQ_LINE, &args);
+    if (rc < 0) {
+        error_report("kvm_irq_line() failed : %s", strerror(errno));
+    }
+}
+
+static void xive_source_kvm_reset(XiveSource *xsrc)
+{
+    sPAPRXive *xive = SPAPR_XIVE_KVM(xsrc->xive);
+    int i;
+
+    /* XIVE is not enabled at first machine reset, only after CAS. */
+    if (xive->fd == -1) {
+        return;
+    }
+
+    /*
+     * At reset, interrupt sources are simply created and MASKED, we
+     * only need to inform KVM about their type: LSI or MSI.
+     */
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        Error *err = NULL;
+        uint64_t state = 0;
+
+        if (xive_source_irq_is_lsi(xsrc, i)) {
+            state |= KVM_XICS_LEVEL_SENSITIVE;
+        }
+
+        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_SOURCES,
+                          i + xsrc->offset, &state, true, &err);
+        if (err) {
+            error_report_err(err);
+            return;
+        }
+    }
+}
+
+/*
+ * This is used to perform the magic loads from an ESB described in
+ * xive.h.
+ */
+static uint8_t xive_esb_read(XiveSource *xsrc, int srcno, uint32_t offset)
+{
+    unsigned long addr = (unsigned long) xsrc->esb_mmap +
+        (1ull << xsrc->esb_shift) * srcno;
+
+    /* In a two pages ESB MMIO setting, the odd page is for management */
+    if (xive_source_esb_2page(xsrc)) {
+        addr += (1 << (xsrc->esb_shift - 1));
+    }
+    addr += offset;
+
+    return *((uint8_t *) addr);
+}
+
+static void xive_source_kvm_get_state(XiveSource *xsrc)
+{
+    int i;
+
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        /* Perform a load without side effect to retrieve the PQ bits */
+        uint8_t pq = xive_esb_read(xsrc, i, XIVE_ESB_GET);
+
+        /* and save PQ locally */
+        xive_source_pq_set(xsrc, i, pq);
+    }
+}
+
+static int xive_source_kvm_set_state(XiveSource *xsrc, int version_id)
+{
+    int i;
+    int unused = 0;
+
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        uint8_t pq = xive_source_pq_get(xsrc, i);
+
+        /* TODO: prevent the compiler from optimizing away the load */
+        unused |= xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_00 + (pq << 8));
+    }
+
+    return unused;
+}
+
+static void xive_source_kvm_synchronize_state(XiveSource *xsrc)
+{
+    xive_source_kvm_get_state(xsrc);
+}
+
+static void xive_source_kvm_realize(DeviceState *dev, Error **errp)
+{
+    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
+
+    xive_source_common_realize(xsrc, xive_source_kvm_set_irq, errp);
+
+    /* The memory regions for the ESB MMIOs will be initialized after
+     * KVM is, but we need to declare them on SysBus for the first
+     * munmap to work in spapr_reset_interrupt() */
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
+}
+
+static void xive_source_kvm_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    XiveSourceClass *xsc = XIVE_SOURCE_CLASS(klass);
+
+    dc->realize = xive_source_kvm_realize;
+    dc->desc = "XIVE KVM interrupt source";
+
+    xsc->synchronize_state = xive_source_kvm_synchronize_state;
+    xsc->reset = xive_source_kvm_reset;
+    xsc->pre_save = xive_source_kvm_get_state;
+    xsc->post_load = xive_source_kvm_set_state;
+}
+
+static const TypeInfo xive_source_kvm_info = {
+    .name = TYPE_XIVE_SOURCE_KVM,
+    .parent = TYPE_XIVE_SOURCE,
+    .instance_size = sizeof(XiveSource),
+    .class_init    = xive_source_kvm_class_init,
+    .class_size    = sizeof(XiveSourceClass),
+};
+
+static void spapr_xive_kvm_get_state(sPAPRXive *xive)
+{
+    XiveSource *xsrc = &xive->source;
+    int i;
+
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        XiveIVE *ive = &xive->ivt[i];
+
+        if (!(ive->w & IVE_VALID)) {
+            continue;
+        }
+
+        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_IVE,
+                          i + xsrc->offset, ive, false, &error_abort);
+    }
+}
+
+static int spapr_xive_kvm_set_state(sPAPRXive *xive, int version_id)
+{
+    XiveSource *xsrc = &xive->source;
+    int i;
+
+    /* First initialize the KVM device sources, as XIVE is not enabled
+     * at machine reset */
+    xive_source_kvm_reset(xsrc);
+
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        XiveIVE *ive = &xive->ivt[i];
+        Error *err = NULL;
+
+        if (!(ive->w & IVE_VALID) || ive->w & IVE_MASKED) {
+            continue;
+        }
+
+        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_IVE,
+                          i + xsrc->offset, ive, true, &err);
+        if (err) {
+            error_report_err(err);
+            return -1;
+        }
+    }
+    return 0;
+}
+
+static void spapr_xive_kvm_synchronize_state(sPAPRXive *xive)
+{
+    spapr_xive_kvm_get_state(xive);
+}
+
+static void spapr_xive_kvm_init(Object *obj)
+{
+    sPAPRXive *xive = SPAPR_XIVE_KVM(obj);
+
+    /* We need a KVM flavored source */
+    object_initialize(&xive->source, sizeof(xive->source),
+                      TYPE_XIVE_SOURCE_KVM);
+    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
+}
+
+static void spapr_xive_kvm_realize(DeviceState *dev, Error **errp)
+{
+    sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
+    Error *local_err = NULL;
+
+    spapr_xive_common_realize(xive, XIVE_ESB_64K_2PAGE, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    xive->fd = -1;
+
+    /* The memory regions for the TIMA MMIOs will be initialized after
+     * KVM is, but we need to declare them on SysBus for the first
+     * munmap to work in spapr_reset_interrupt() */
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
+}
+
+static void spapr_xive_kvm_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRXiveClass *sxc = SPAPR_XIVE_CLASS(klass);
+
+    dc->realize = spapr_xive_kvm_realize;
+    /* spapr_xive_reset() is used for reset. */
+    dc->desc = "sPAPR XIVE KVM interrupt controller";
+
+    sxc->synchronize_state = spapr_xive_kvm_synchronize_state;
+    sxc->pre_save = spapr_xive_kvm_get_state;
+    sxc->post_load = spapr_xive_kvm_set_state;
+}
+
+static const TypeInfo spapr_xive_kvm_info = {
+    .name = TYPE_SPAPR_XIVE_KVM,
+    .parent = TYPE_SPAPR_XIVE,
+    .instance_init = spapr_xive_kvm_init,
+    .instance_size = sizeof(sPAPRXive),
+    .class_init = spapr_xive_kvm_class_init,
+    .class_size = sizeof(sPAPRXiveClass),
+};
+
+static void xive_kvm_register_types(void)
+{
+    type_register_static(&spapr_xive_kvm_info);
+    type_register_static(&xive_source_kvm_info);
+    type_register_static(&xive_nvt_kvm_info);
+}
+
+type_init(xive_kvm_register_types)
+
+static void *spapr_xive_kvm_mmap(sPAPRXive *xive, int ctrl, size_t len,
+                                 Error **errp)
+{
+    Error *local_err = NULL;
+    void *addr;
+    int fd;
+
+    kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_CTRL, ctrl, &fd, false,
+                      &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return NULL;
+    }
+
+    addr = mmap(NULL, len, PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
+    close(fd);
+    if (addr == MAP_FAILED) {
+        error_setg_errno(errp, errno, "Unable to set XIVE mmaping");
+        return NULL;
+    }
+
+    return addr;
+}
+
+void xive_kvm_init(sPAPRXive *xive, Error **errp)
+{
+    XiveSource *xsrc = &xive->source;
+    Error *local_err = NULL;
+    size_t esb_len, tima_len;
+
+    if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
+        error_setg(errp,
+                   "IRQ_XIVE capability must be present for KVM XIVE device");
+        return;
+    }
+
+    xive->fd = kvm_create_device(kvm_state, KVM_DEV_TYPE_XIVE, false);
+    if (xive->fd < 0) {
+        error_setg_errno(errp, -xive->fd, "error creating KVM XIVE device");
+        return;
+    }
+
+    kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_CTRL, KVM_DEV_XIVE_VC_BASE,
+                      &xsrc->esb_base, true, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    /* ESB MMIO region */
+    esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
+    xsrc->esb_mmap = spapr_xive_kvm_mmap(xive, KVM_DEV_XIVE_GET_ESB_FD,
+                                         esb_len, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    memory_region_init_ram_device_ptr(&xsrc->esb_mmio, OBJECT(xsrc),
+                                      "xive.esb", esb_len, xsrc->esb_mmap);
+
+    /* TIMA USER MMIO region */
+    tima_len = 1ull << TM_SHIFT;
+    xive->tm_mmap_user = spapr_xive_kvm_mmap(xive, KVM_DEV_XIVE_GET_TIMA_FD,
+                                             tima_len, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    memory_region_init_ram_device_ptr(&xive->tm_mmio_user, OBJECT(xive),
+                                      "xive.tima.user", tima_len,
+                                      xive->tm_mmap_user);
+
+    /* TIMA OS MMIO region */
+    xive->tm_mmap_os = spapr_xive_kvm_mmap(xive, KVM_DEV_XIVE_GET_TIMA_FD,
+                                           tima_len, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    memory_region_init_ram_device_ptr(&xive->tm_mmio_os, OBJECT(xive),
+                                      "xive.tima.os", tima_len,
+                                      xive->tm_mmap_os);
+
+    kernel_xive_fd = xive->fd;
+    kvm_kernel_irqchip = true;
+    kvm_msi_via_irqfd_allowed = true;
+    kvm_gsi_direct_mapping = true;
+}
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 520b532dbf09..d96732cfe6be 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -487,6 +487,7 @@ static const VMStateDescription vmstate_xive_nvt = {
     .minimum_version_id = 1,
     .pre_save = vmstate_xive_nvt_pre_save,
     .post_load = vmstate_xive_nvt_post_load,
+    .priority = MIG_PRI_XIVE_NVT,
     .fields = (VMStateField[]) {
         VMSTATE_BUFFER(regs, XiveNVT),
         VMSTATE_STRUCT_ARRAY(eqt, XiveNVT, (XIVE_PRIORITY_MAX + 1), 1,
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index f3ac084a71be..75b790cc9730 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -30,6 +30,11 @@ typedef struct sPAPRXive {
     hwaddr       tm_base;
     MemoryRegion tm_mmio_user;
     MemoryRegion tm_mmio_os;
+
+    /* KVM support */
+    int          fd;
+    void         *tm_mmap_user;
+    void         *tm_mmap_os;
 } sPAPRXive;
 
 #define SPAPR_XIVE_CLASS(klass) \
@@ -66,4 +71,22 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr);
 void spapr_dt_xive(sPAPRMachineState *spapr, int nr_servers, void *fdt,
                    uint32_t phandle);
 
+/*
+ * XIVE KVM device
+ */
+
+#define TYPE_SPAPR_XIVE_KVM "spapr-xive-kvm"
+#define SPAPR_XIVE_KVM(obj) \
+    OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE_KVM)
+
+#define TYPE_XIVE_SOURCE_KVM "xive-source-kvm"
+#define XIVE_SOURCE_KVM(obj) \
+    OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE_KVM)
+
+#define TYPE_XIVE_NVT_KVM "xive-nvt-kvm"
+#define XIVE_NVT_KVM(obj) \
+    OBJECT_CHECK(XiveNVT, (obj), TYPE_XIVE_NVT_KVM)
+
+void xive_kvm_init(sPAPRXive *xive, Error **errp);
+
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index b040cf580fc9..e99cd874ef3c 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -58,6 +58,9 @@ typedef struct XiveSource {
     MemoryRegion esb_mmio;
 
     XiveFabric   *xive;
+
+    /* KVM support */
+    void         *esb_mmap;
 } XiveSource;
 
 #define XIVE_SOURCE_CLASS(klass) \
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index df463fd33d69..482bdf365297 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -151,6 +151,8 @@ typedef enum {
     MIG_PRI_PCI_BUS,            /* Must happen before IOMMU */
     MIG_PRI_GICV3_ITS,          /* Must happen before PCI devices */
     MIG_PRI_GICV3,              /* Must happen before the ITS */
+    MIG_PRI_XIVE_IVE,           /* Must happen before XIVE source */
+    MIG_PRI_XIVE_NVT,           /* Must happen before XIVE IVE */
     MIG_PRI_MAX,
 } MigrationPriority;
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 26/35] spapr/xive: add a XIVE KVM device to the machine
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (24 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 25/35] spapr/xive: add KVM support Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 27/35] migration: discard non-migratable RAMBlocks Cédric Le Goater
                   ` (9 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

As the VM is connected to the KVM interrupt device at the init of the
machine, KVM support is added for XICS or for XIVE but not both at the
same time. This should change later on when KVM resets are supported.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/xics_kvm.c | 22 +++++++++++++++++++++-
 hw/ppc/spapr.c     | 34 +++++++++++++++++++++++-----------
 2 files changed, 44 insertions(+), 12 deletions(-)

diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index 89fb20e2c55c..e727397c4a4d 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -62,6 +62,11 @@ static void icp_get_kvm_state(ICPState *icp)
     };
     int ret;
 
+    /* A change of the interrupt mode can disable XICS */
+    if (kernel_xics_fd == -1) {
+        return;
+    }
+
     /* ICP for this CPU thread is not in use, exiting */
     if (!icp->cs) {
         return;
@@ -102,6 +107,11 @@ static int icp_set_kvm_state(ICPState *icp, int version_id)
     };
     int ret;
 
+    /* Protect resets. A change of the interrupt mode can disable XICS */
+    if (kernel_xics_fd == -1) {
+        return 0;
+    }
+
     /* ICP for this CPU thread is not in use, exiting */
     if (!icp->cs) {
         return 0;
@@ -135,7 +145,7 @@ static void icp_kvm_realize(ICPState *icp, Error **errp)
     int ret;
 
     if (kernel_xics_fd == -1) {
-        abort();
+        return;
     }
 
     /*
@@ -192,6 +202,11 @@ static void ics_get_kvm_state(ICSState *ics)
     };
     int i;
 
+    /* A change of the interrupt mode can disable XICS */
+    if (kernel_xics_fd == -1) {
+        return;
+    }
+
     for (i = 0; i < ics->nr_irqs; i++) {
         ICSIRQState *irq = &ics->irqs[i];
         int ret;
@@ -262,6 +277,11 @@ static int ics_set_kvm_state(ICSState *ics, int version_id)
     };
     int i;
 
+    /* Protect resets. A change of the interrupt mode can disable XICS */
+    if (kernel_xics_fd == -1) {
+        return 0;
+    }
+
     for (i = 0; i < ics->nr_irqs; i++) {
         ICSIRQState *irq = &ics->irqs[i];
         int ret;
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d05c83cdb322..c98ceeed9d6f 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -189,8 +189,7 @@ static void xics_system_init(MachineState *machine, int nr_irqs, Error **errp)
     sPAPRMachineState *spapr = SPAPR_MACHINE(machine);
 
     if (kvm_enabled()) {
-        if (machine_kernel_irqchip_allowed(machine) &&
-            !xics_kvm_init(spapr, errp)) {
+        if (machine_kernel_irqchip_allowed(machine)) {
             spapr->icp_type = TYPE_KVM_ICP;
             spapr->ics = spapr_ics_create(spapr, TYPE_ICS_KVM, nr_irqs, errp);
         }
@@ -239,10 +238,16 @@ static void xive_system_init(MachineState *machine, int nr_irqs, Error **errp)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(machine);
 
-    /* We don't have KVM support yet, so check for irqchip=on */
-    if (kvm_enabled() && machine_kernel_irqchip_required(machine)) {
-        error_report("kernel_irqchip requested. no XIVE support");
-        exit(1);
+    if (kvm_enabled()) {
+        if (machine_kernel_irqchip_allowed(machine)) {
+            spapr->nvt_type = TYPE_XIVE_NVT_KVM;
+            spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE_KVM,
+                                            nr_irqs, errp);
+        }
+        if (machine_kernel_irqchip_required(machine) && !spapr->xive) {
+            error_prepend(errp, "kernel_irqchip requested but unavailable: ");
+            return;
+        }
     }
 
     if (spapr->xive) {
@@ -1043,11 +1048,9 @@ static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
         } else {
             val[3] = 0x00; /* Hash */
         }
-        /* TODO: introduce a kvmppc_has_cap_xive() ? Works with
-         * irqchip=off for now
-         */
-        if (spapr->xive_exploitation) {
-            val[1] = 0x80; /* OV5_XIVE_BOTH */
+        /* TODO: when under KVM, only advertise XIVE but not both mode */
+        if (spapr->xive_exploitation && kvmppc_has_cap_xive()) {
+            val[1] = 0x40; /* OV5_XIVE_EXPLOIT */
         }
     } else {
         if (spapr->xive_exploitation) {
@@ -2580,6 +2583,15 @@ static void spapr_machine_init(MachineState *machine)
         /* XIVE uses the full range of IRQ numbers. */
         xive_system_init(machine, XICS_IRQ_BASE + XICS_IRQS_SPAPR,
                          &error_fatal);
+
+        /* TODO: initialize KVM for XIVE or for XICS but not for both */
+        if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+            xive_kvm_init(spapr->xive, &error_fatal);
+        }
+    } else {
+        if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+            xics_kvm_init(spapr, &error_fatal);
+        }
     }
 
     /* Set up containers for ibm,client-architecture-support negotiated options
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 27/35] migration: discard non-migratable RAMBlocks
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (25 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 26/35] spapr/xive: add a XIVE KVM device to the machine Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 28/35] intc: introduce a CPUIntc interface Cédric Le Goater
                   ` (8 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

On the POWER9 processor, the XIVE interrupt controller can control
interrupt sources using MMIO to trigger events, to EOI or to turn off
the sources. Priority management and interrupt acknowledgment is also
controlled by MMIO in the presenter sub-engine.

These MMIO regions are exposed to guests in QEMU with a set of 'ram
device' memory mappings, similarly to VFIO, and the VMAs are populated
dynamically with the appropriate pages using a fault handler.

But, these regions are an issue for migration. We need to discard the
associated RAMBlocks from the RAM state on the source VM and let the
destination VM rebuild the memory mappings on the new host in the
post_load() operation just before resuming the system.

To achieve this goal, the following introduces a new RAMBlock flag
RAM_MIGRATABLE which is updated in the vmstate_register_ram() and
vmstate_unregister_ram() routines. This flag is then used by the
migration to identify RAMBlocks to discard on the source. Some checks
are also performed on the destination to make sure nothing invalid was
sent.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 exec.c                    | 10 ++++++++++
 include/exec/cpu-common.h |  1 +
 migration/ram.c           | 42 ++++++++++++++++++++++++++++++++----------
 3 files changed, 43 insertions(+), 10 deletions(-)

diff --git a/exec.c b/exec.c
index 02b1efebb7c3..e9432c06294e 100644
--- a/exec.c
+++ b/exec.c
@@ -104,6 +104,9 @@ static MemoryRegion io_mem_unassigned;
  * (Set during postcopy)
  */
 #define RAM_UF_ZEROPAGE (1 << 3)
+
+/* RAM can be migrated */
+#define RAM_MIGRATABLE (1 << 4)
 #endif
 
 #ifdef TARGET_PAGE_BITS_VARY
@@ -1807,6 +1810,11 @@ void qemu_ram_set_uf_zeroable(RAMBlock *rb)
     rb->flags |= RAM_UF_ZEROPAGE;
 }
 
+bool qemu_ram_is_migratable(RAMBlock *rb)
+{
+    return rb->flags & RAM_MIGRATABLE;
+}
+
 /* Called with iothread lock held.  */
 void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
 {
@@ -1823,6 +1831,7 @@ void qemu_ram_set_idstr(RAMBlock *new_block, const char *name, DeviceState *dev)
         }
     }
     pstrcat(new_block->idstr, sizeof(new_block->idstr), name);
+    new_block->flags |= RAM_MIGRATABLE;
 
     rcu_read_lock();
     RAMBLOCK_FOREACH(block) {
@@ -1845,6 +1854,7 @@ void qemu_ram_unset_idstr(RAMBlock *block)
      */
     if (block) {
         memset(block->idstr, 0, sizeof(block->idstr));
+        block->flags &= ~RAM_MIGRATABLE;
     }
 }
 
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 24d335f95d45..96854519b514 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -75,6 +75,7 @@ const char *qemu_ram_get_idstr(RAMBlock *rb);
 bool qemu_ram_is_shared(RAMBlock *rb);
 bool qemu_ram_is_uf_zeroable(RAMBlock *rb);
 void qemu_ram_set_uf_zeroable(RAMBlock *rb);
+bool qemu_ram_is_migratable(RAMBlock *rb);
 
 size_t qemu_ram_pagesize(RAMBlock *block);
 size_t qemu_ram_pagesize_largest(void);
diff --git a/migration/ram.c b/migration/ram.c
index 0e90efa09236..89c3accc4e26 100644
--- a/migration/ram.c
+++ b/migration/ram.c
@@ -187,6 +187,11 @@ void ramblock_recv_bitmap_set_range(RAMBlock *rb, void *host_addr,
                       nr);
 }
 
+/* Should be holding either ram_list.mutex, or the RCU lock. */
+#define RAMBLOCK_FOREACH_MIGRATABLE(block)             \
+    RAMBLOCK_FOREACH(block)                            \
+        if (!qemu_ram_is_migratable(block)) {} else
+
 /*
  * An outstanding page request, on the source, having been received
  * and queued
@@ -780,6 +785,10 @@ unsigned long migration_bitmap_find_dirty(RAMState *rs, RAMBlock *rb,
     unsigned long *bitmap = rb->bmap;
     unsigned long next;
 
+    if (!qemu_ram_is_migratable(rb)) {
+        return size;
+    }
+
     if (rs->ram_bulk_stage && start > 0) {
         next = start + 1;
     } else {
@@ -825,7 +834,7 @@ uint64_t ram_pagesize_summary(void)
     RAMBlock *block;
     uint64_t summary = 0;
 
-    RAMBLOCK_FOREACH(block) {
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
         summary |= block->page_size;
     }
 
@@ -849,7 +858,7 @@ static void migration_bitmap_sync(RAMState *rs)
 
     qemu_mutex_lock(&rs->bitmap_mutex);
     rcu_read_lock();
-    RAMBLOCK_FOREACH(block) {
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
         migration_bitmap_sync_range(rs, block, 0, block->used_length);
     }
     rcu_read_unlock();
@@ -1499,6 +1508,10 @@ static int ram_save_host_page(RAMState *rs, PageSearchStatus *pss,
     size_t pagesize_bits =
         qemu_ram_pagesize(pss->block) >> TARGET_PAGE_BITS;
 
+    if (!qemu_ram_is_migratable(pss->block)) {
+        return 0;
+    }
+
     do {
         tmppages = ram_save_target_page(rs, pss, last_stage);
         if (tmppages < 0) {
@@ -1587,7 +1600,7 @@ uint64_t ram_bytes_total(void)
     uint64_t total = 0;
 
     rcu_read_lock();
-    RAMBLOCK_FOREACH(block) {
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
         total += block->used_length;
     }
     rcu_read_unlock();
@@ -1642,7 +1655,7 @@ static void ram_save_cleanup(void *opaque)
      */
     memory_global_dirty_log_stop();
 
-    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
         g_free(block->bmap);
         block->bmap = NULL;
         g_free(block->unsentmap);
@@ -1705,7 +1718,7 @@ void ram_postcopy_migrated_memory_release(MigrationState *ms)
 {
     struct RAMBlock *block;
 
-    RAMBLOCK_FOREACH(block) {
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
         unsigned long *bitmap = block->bmap;
         unsigned long range = block->used_length >> TARGET_PAGE_BITS;
         unsigned long run_start = find_next_zero_bit(bitmap, range, 0);
@@ -1783,7 +1796,7 @@ static int postcopy_each_ram_send_discard(MigrationState *ms)
     struct RAMBlock *block;
     int ret;
 
-    RAMBLOCK_FOREACH(block) {
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
         PostcopyDiscardState *pds =
             postcopy_discard_send_init(ms, block->idstr);
 
@@ -1991,7 +2004,7 @@ int ram_postcopy_send_discard_bitmap(MigrationState *ms)
     rs->last_sent_block = NULL;
     rs->last_page = 0;
 
-    QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
         unsigned long pages = block->used_length >> TARGET_PAGE_BITS;
         unsigned long *bitmap = block->bmap;
         unsigned long *unsentmap = block->unsentmap;
@@ -2150,7 +2163,7 @@ static void ram_list_init_bitmaps(void)
 
     /* Skip setting bitmap if there is no RAM */
     if (ram_bytes_total()) {
-        QLIST_FOREACH_RCU(block, &ram_list.blocks, next) {
+        RAMBLOCK_FOREACH_MIGRATABLE(block) {
             pages = block->max_length >> TARGET_PAGE_BITS;
             block->bmap = bitmap_new(pages);
             bitmap_set(block->bmap, 0, pages);
@@ -2226,7 +2239,7 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
 
     qemu_put_be64(f, ram_bytes_total() | RAM_SAVE_FLAG_MEM_SIZE);
 
-    RAMBLOCK_FOREACH(block) {
+    RAMBLOCK_FOREACH_MIGRATABLE(block) {
         qemu_put_byte(f, strlen(block->idstr));
         qemu_put_buffer(f, (uint8_t *)block->idstr, strlen(block->idstr));
         qemu_put_be64(f, block->used_length);
@@ -2471,6 +2484,11 @@ static inline RAMBlock *ram_block_from_stream(QEMUFile *f, int flags)
         return NULL;
     }
 
+    if (!qemu_ram_is_migratable(block)) {
+        error_report("block %s should not be migrated !", id);
+        return NULL;
+    }
+
     return block;
 }
 
@@ -2921,7 +2939,11 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
                 length = qemu_get_be64(f);
 
                 block = qemu_ram_block_by_name(id);
-                if (block) {
+                if (block && !qemu_ram_is_migratable(block)) {
+                    error_report("block %s should not be migrated !", id);
+                    ret = -EINVAL;
+
+                } else if (block) {
                     if (length != block->used_length) {
                         Error *local_err = NULL;
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 28/35] intc: introduce a CPUIntc interface
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (26 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 27/35] migration: discard non-migratable RAMBlocks Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 29/35] spapr/xive, xics: use the CPU_INTC handlers to reset KVM Cédric Le Goater
                   ` (7 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

which we will use in the sPAPR machine to reset the interrupt
controller of each CPU at the KVM level.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/intc.c         | 26 ++++++++++++++++++++++++++
 include/hw/intc/intc.h | 21 +++++++++++++++++++++
 2 files changed, 47 insertions(+)

diff --git a/hw/intc/intc.c b/hw/intc/intc.c
index 2e1e29e753a1..bb08ff51596a 100644
--- a/hw/intc/intc.c
+++ b/hw/intc/intc.c
@@ -32,10 +32,36 @@ static const TypeInfo intctrl_info = {
     .class_size = sizeof(InterruptStatsProviderClass),
 };
 
+static const TypeInfo cpu_intc_info = {
+    .name = TYPE_CPU_INTC,
+    .parent = TYPE_INTERFACE,
+    .class_size = sizeof(CPUIntcClass),
+};
+
 static void intc_register_types(void)
 {
+    type_register_static(&cpu_intc_info);
     type_register_static(&intctrl_info);
 }
 
 type_init(intc_register_types)
 
+void cpu_intc_disconnect(CPUIntc *intc, Error **errp)
+{
+    CPUIntcClass *cic;
+
+    cic = CPU_INTC_GET_CLASS(intc);
+    if (cic->disconnect) {
+        cic->disconnect(intc, errp);
+    }
+}
+
+void cpu_intc_connect(CPUIntc *intc, Error **errp)
+{
+    CPUIntcClass *cic;
+
+    cic = CPU_INTC_GET_CLASS(intc);
+    if (cic->connect) {
+        cic->connect(intc, errp);
+    }
+}
diff --git a/include/hw/intc/intc.h b/include/hw/intc/intc.h
index 27d9828943ae..3536e7d57ffd 100644
--- a/include/hw/intc/intc.h
+++ b/include/hw/intc/intc.h
@@ -30,4 +30,25 @@ typedef struct InterruptStatsProviderClass {
     void (*print_info)(InterruptStatsProvider *obj, Monitor *mon);
 } InterruptStatsProviderClass;
 
+#define TYPE_CPU_INTC "cpu-intc"
+#define CPU_INTC(obj)                                     \
+    OBJECT_CHECK(CPUIntc, (obj), TYPE_CPU_INTC)
+#define CPU_INTC_CLASS(klass)                                     \
+    OBJECT_CLASS_CHECK(CPUIntcClass, (klass), TYPE_CPU_INTC)
+#define CPU_INTC_GET_CLASS(obj)                                   \
+    OBJECT_GET_CLASS(CPUIntcClass, (obj), TYPE_CPU_INTC)
+
+typedef struct CPUIntc {
+    Object parent;
+} CPUIntc;
+
+typedef struct CPUIntcClass {
+    InterfaceClass parent;
+    void (*connect)(CPUIntc *icp, Error **errp);
+    void (*disconnect)(CPUIntc *icp, Error **errp);
+} CPUIntcClass;
+
+void cpu_intc_disconnect(CPUIntc *intc, Error **errp);
+void cpu_intc_connect(CPUIntc *intc, Error **errp);
+
 #endif
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 29/35] spapr/xive, xics: use the CPU_INTC handlers to reset KVM
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (27 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 28/35] intc: introduce a CPUIntc interface Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 30/35] spapr/xive, xics: reset KVM at machine reset Cédric Le Goater
                   ` (6 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The vCPUs are disconnected from the KVM device using a 'disable=1' as
last argument of the KVM_ENABLE_CAP ioctl. This is a bit hacky, we
should probably introduce a KVM_DISABLE_CAP ioctl.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive_kvm.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++--
 hw/intc/xics.c           |  4 ++++
 hw/intc/xics_kvm.c       | 48 ++++++++++++++++++++++++++++++++++++++++--
 hw/intc/xive.c           |  5 +++++
 hw/ppc/spapr_cpu_core.c  |  8 +++++++
 5 files changed, 116 insertions(+), 4 deletions(-)

diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index ec5613fc2804..e3851991653e 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -15,6 +15,7 @@
 #include "sysemu/cpus.h"
 #include "sysemu/kvm.h"
 #include "monitor/monitor.h"
+#include "hw/intc/intc.h"
 #include "hw/ppc/spapr.h"
 #include "hw/ppc/spapr_xive.h"
 #include "hw/ppc/xive.h"
@@ -47,6 +48,25 @@ static bool xive_nvt_kvm_cpu_is_enabled(CPUState *cs)
     return false;
 }
 
+static void xive_nvt_kvm_cpu_disable(CPUState *cs, Error **errp)
+{
+    KVMEnabledCPU *enabled_cpu;
+    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
+
+    QLIST_FOREACH(enabled_cpu, &kvm_enabled_cpus, node) {
+        if (enabled_cpu->vcpu_id == vcpu_id) {
+            break;
+        }
+    }
+
+    if (enabled_cpu->vcpu_id == vcpu_id) {
+        QLIST_REMOVE(enabled_cpu, node);
+        g_free(enabled_cpu);
+    } else {
+        error_setg(errp, "Can not find enabled CPU%ld", vcpu_id);
+    }
+}
+
 static void xive_nvt_kvm_cpu_enable(CPUState *cs)
 {
     KVMEnabledCPU *enabled_cpu;
@@ -183,8 +203,36 @@ static void xive_nvt_kvm_reset(XiveNVT *nvt)
     xive_nvt_kvm_set_state(nvt, 1);
 }
 
-static void xive_nvt_kvm_realize(XiveNVT *nvt, Error **errp)
+static void xive_nvt_kvm_disconnect(CPUIntc *intc, Error **errp)
 {
+    XiveNVT *nvt = XIVE_NVT_KVM(intc);
+    CPUState *cs = nvt->cs;
+    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
+    int ret;
+
+    if (kernel_xive_fd == -1) {
+        return;
+    }
+
+    /* Disable IRQ capability with a 'disable=1' as last argument.
+     *
+     * This is a bit hacky, we should introduce a KVM_DISABLE_CAP
+     * iotcl
+     */
+    ret = kvm_vcpu_enable_cap(cs, KVM_CAP_PPC_IRQ_XIVE, 0, kernel_xive_fd,
+                              vcpu_id, 1);
+    if (ret < 0) {
+        error_setg(errp, "Unable to disconnect CPU%ld from KVM XIVE device: %s",
+                   vcpu_id, strerror(errno));
+        return;
+    }
+
+    xive_nvt_kvm_cpu_disable(cs, errp);
+}
+
+static void xive_nvt_kvm_connect(CPUIntc *intc, Error **errp)
+{
+    XiveNVT *nvt = XIVE_NVT_KVM(intc);
     CPUState *cs = nvt->cs;
     unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
     int ret;
@@ -209,14 +257,17 @@ static void xive_nvt_kvm_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
     XiveNVTClass *xnc = XIVE_NVT_CLASS(klass);
+    CPUIntcClass *cic = CPU_INTC_CLASS(klass);
 
     dc->desc = "XIVE KVM Interrupt Presenter";
 
-    xnc->realize = xive_nvt_kvm_realize;
     xnc->synchronize_state = xive_nvt_kvm_synchronize_state;
     xnc->reset = xive_nvt_kvm_reset;
     xnc->pre_save = xive_nvt_kvm_get_state;
     xnc->post_load = xive_nvt_kvm_set_state;
+
+    cic->connect = xive_nvt_kvm_connect;
+    cic->disconnect = xive_nvt_kvm_disconnect;
 }
 
 static const TypeInfo xive_nvt_kvm_info = {
diff --git a/hw/intc/xics.c b/hw/intc/xics.c
index e73e623e3b53..48fed2731fd2 100644
--- a/hw/intc/xics.c
+++ b/hw/intc/xics.c
@@ -381,6 +381,10 @@ static const TypeInfo icp_info = {
     .instance_size = sizeof(ICPState),
     .class_init = icp_class_init,
     .class_size = sizeof(ICPStateClass),
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_CPU_INTC },
+        { }
+    }
 };
 
 Object *icp_create(Object *cpu, const char *type, XICSFabric *xi, Error **errp)
diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index e727397c4a4d..62ea4ea150f2 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -32,6 +32,7 @@
 #include "hw/hw.h"
 #include "trace.h"
 #include "sysemu/kvm.h"
+#include "hw/intc/intc.h"
 #include "hw/ppc/spapr.h"
 #include "hw/ppc/xics.h"
 #include "kvm_ppc.h"
@@ -137,8 +138,48 @@ static void icp_kvm_reset(ICPState *icp)
     icp_set_kvm_state(icp, 1);
 }
 
-static void icp_kvm_realize(ICPState *icp, Error **errp)
+static void icp_kvm_disconnect(CPUIntc *intc, Error **errp)
 {
+    ICPState *icp = ICP(intc);
+    CPUState *cs = icp->cs;
+    KVMEnabledICP *enabled_icp;
+    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
+    int ret;
+
+    if (kernel_xics_fd == -1) {
+        return;
+    }
+
+    /* Disable IRQ capability with a 'disable=1' as last argument.
+     *
+     * This is a bit hacky, we should introduce a KVM_DISABLE_CAP
+     * iotcl
+     */
+    ret = kvm_vcpu_enable_cap(cs, KVM_CAP_IRQ_XICS, 0, kernel_xics_fd,
+                              vcpu_id, 1);
+    if (ret < 0) {
+        error_setg(errp, "Unable to disconnect CPU%ld to kernel XICS: %s",
+                   vcpu_id, strerror(errno));
+        return;
+    }
+
+    QLIST_FOREACH(enabled_icp, &kvm_enabled_icps, node) {
+        if (enabled_icp->vcpu_id == vcpu_id) {
+            break;
+        }
+    }
+
+    if (enabled_icp->vcpu_id == vcpu_id) {
+        QLIST_REMOVE(enabled_icp, node);
+        g_free(enabled_icp);
+    } else {
+        error_setg(errp, "Can not find enabled CPU%ld", vcpu_id);
+    }
+ }
+
+static void icp_kvm_connect(CPUIntc *intc, Error **errp)
+{
+    ICPState *icp = ICP(intc);
     CPUState *cs = icp->cs;
     KVMEnabledICP *enabled_icp;
     unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
@@ -173,12 +214,15 @@ static void icp_kvm_realize(ICPState *icp, Error **errp)
 static void icp_kvm_class_init(ObjectClass *klass, void *data)
 {
     ICPStateClass *icpc = ICP_CLASS(klass);
+    CPUIntcClass *cic = CPU_INTC_CLASS(klass);
 
     icpc->pre_save = icp_get_kvm_state;
     icpc->post_load = icp_set_kvm_state;
-    icpc->realize = icp_kvm_realize;
     icpc->reset = icp_kvm_reset;
     icpc->synchronize_state = icp_synchronize_state;
+
+    cic->connect = icp_kvm_connect;
+    cic->disconnect = icp_kvm_disconnect;
 }
 
 static const TypeInfo icp_kvm_info = {
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index d96732cfe6be..4a9b09e3d819 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -14,6 +14,7 @@
 #include "sysemu/cpus.h"
 #include "sysemu/dma.h"
 #include "monitor/monitor.h"
+#include "hw/intc/intc.h"
 #include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
 #include "hw/ppc/xive.h"
 #include "hw/ppc/xive_regs.h"
@@ -513,6 +514,10 @@ static const TypeInfo xive_nvt_info = {
     .instance_init = xive_nvt_init,
     .class_init    = xive_nvt_class_init,
     .class_size    = sizeof(XiveNVTClass),
+    .interfaces = (InterfaceInfo[]) {
+        { TYPE_CPU_INTC },
+        { }
+    }
 };
 
 /*
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 3df2bda53f50..aa612cb1c9f6 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -16,6 +16,7 @@
 #include "sysemu/cpus.h"
 #include "sysemu/kvm.h"
 #include "target/ppc/kvm_ppc.h"
+#include "hw/intc/intc.h"
 #include "hw/ppc/ppc.h"
 #include "target/ppc/mmu-hash64.h"
 #include "sysemu/numa.h"
@@ -105,6 +106,7 @@ static void spapr_cpu_core_unrealizefn(DeviceState *dev, Error **errp)
         PowerPCCPU *cpu = POWERPC_CPU(cs);
 
         spapr_cpu_destroy(cpu);
+        cpu_intc_disconnect(CPU_INTC(cpu->intc), NULL);
         object_unparent(cpu->intc);
         cpu_remove_sync(cs);
         object_unparent(obj);
@@ -134,6 +136,10 @@ static void spapr_cpu_core_realize_child(Object *child,
         goto error;
     }
 
+    cpu_intc_connect(CPU_INTC(cpu->intc), &local_err);
+    if (local_err) {
+        goto error;
+    }
     return;
 
 error:
@@ -263,6 +269,7 @@ void spapr_cpu_core_reset_icp(Error **errp)
 
     CPU_FOREACH(cs) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
+        cpu_intc_disconnect(CPU_INTC(cpu->intc), errp);
         cpu->intc = NULL;
     }
 }
@@ -298,5 +305,6 @@ void spapr_cpu_core_set_icp(const char *icp_type, Error **errp)
         }
 
         cpu->intc = args.icp;
+        cpu_intc_connect(CPU_INTC(cpu->intc), errp);
     }
 }
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 30/35] spapr/xive, xics: reset KVM at machine reset
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (28 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 29/35] spapr/xive, xics: use the CPU_INTC handlers to reset KVM Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 31/35] spapr/xive: raise migration priority of the machine Cédric Le Goater
                   ` (5 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The interrupt controller KVM device needs to be destroyed and then
recreated accordingly with the interrupt mode negociated at CAS time.
A new KVM_DESTROY_DEVICE is required for this purpose along with the
necessary support in Linux/KVM.

This won't work without the vpcus being first disconnected from the
KVM device.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive_kvm.c    | 39 +++++++++++++++++++++++++++++++++++++++
 hw/intc/xics_kvm.c          | 38 ++++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr.c              | 38 +++++++++++++++++++++++++++-----------
 hw/ppc/spapr_rtas.c         |  2 --
 include/hw/ppc/spapr_xive.h |  1 +
 include/hw/ppc/xics.h       |  1 +
 linux-headers/linux/kvm.h   |  2 ++
 7 files changed, 108 insertions(+), 13 deletions(-)

diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index e3851991653e..be7c9d1fe0aa 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -614,3 +614,42 @@ void xive_kvm_init(sPAPRXive *xive, Error **errp)
     kvm_msi_via_irqfd_allowed = true;
     kvm_gsi_direct_mapping = true;
 }
+
+int xive_kvm_fini(sPAPRXive *xive, Error **errp)
+{
+    int rc;
+    XiveSource *xsrc = &xive->source;;
+    struct kvm_create_device xive_destroy_device = {
+        .fd = kernel_xive_fd,
+        .type = KVM_DEV_TYPE_XIVE,
+        .flags = 0,
+    };
+
+    if (kernel_xive_fd == -1) {
+        return 0;
+    }
+
+    if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
+        error_setg(errp,
+                   "KVM and IRQ_XIVE capability must be present for KVM XIVE device");
+        return -1;
+    }
+
+    munmap(xsrc->esb_mmap, (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
+    munmap(xive->tm_mmap_user, 1ull << TM_SHIFT);
+    munmap(xive->tm_mmap_os, 1ull << TM_SHIFT);
+
+    rc = kvm_vm_ioctl(kvm_state, KVM_DESTROY_DEVICE, &xive_destroy_device);
+    if (rc < 0) {
+        error_setg_errno(errp, -rc, "Error on KVM_DESTROY_DEVICE for XIVE");
+    }
+    close(xive->fd);
+    xive->fd = -1;
+
+    kernel_xive_fd = -1;
+    kvm_kernel_irqchip = false;
+    kvm_msi_via_irqfd_allowed = false;
+    kvm_gsi_direct_mapping = false;
+
+    return 0;
+}
diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index 62ea4ea150f2..0a1b0a703451 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -518,6 +518,44 @@ fail:
     return -1;
 }
 
+int xics_kvm_fini(sPAPRMachineState *spapr, Error **errp)
+{
+    int rc;
+    struct kvm_create_device xics_create_device = {
+        .fd = kernel_xics_fd,
+        .type = KVM_DEV_TYPE_XICS,
+        .flags = 0,
+    };
+
+    if (kernel_xics_fd == -1) {
+        return 0;
+    }
+
+    if (!kvm_enabled() || !kvm_check_extension(kvm_state, KVM_CAP_IRQ_XICS)) {
+        error_setg(errp,
+                   "KVM and IRQ_XICS capability must be present for KVM XICS device");
+        return -1;
+    }
+
+    rc = kvm_vm_ioctl(kvm_state, KVM_DESTROY_DEVICE, &xics_create_device);
+    if (rc < 0) {
+        error_setg_errno(errp, -rc, "Error on KVM_DESTROY_DEVICE for XICS");
+    }
+    close(kernel_xics_fd);
+    kernel_xics_fd = -1;
+
+    kvmppc_define_rtas_kernel_token(0, "ibm,set-xive");
+    kvmppc_define_rtas_kernel_token(0, "ibm,get-xive");
+    kvmppc_define_rtas_kernel_token(0, "ibm,int-on");
+    kvmppc_define_rtas_kernel_token(0, "ibm,int-off");
+
+    kvm_kernel_irqchip = false;
+    kvm_msi_via_irqfd_allowed = false;
+    kvm_gsi_direct_mapping = false;
+
+    return rc;
+}
+
 static void xics_kvm_register_types(void)
 {
     type_register_static(&ics_kvm_info);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index c98ceeed9d6f..dea636f9befe 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1048,9 +1048,8 @@ static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
         } else {
             val[3] = 0x00; /* Hash */
         }
-        /* TODO: when under KVM, only advertise XIVE but not both mode */
         if (spapr->xive_exploitation && kvmppc_has_cap_xive()) {
-            val[1] = 0x40; /* OV5_XIVE_EXPLOIT */
+            val[1] = 0x80; /* OV5_XIVE_BOTH */
         }
     } else {
         if (spapr->xive_exploitation) {
@@ -1536,6 +1535,7 @@ static int spapr_reset_drcs(Object *child, void *opaque)
 /* Setup XIVE exploitation or legacy mode as required by CAS */
 static void spapr_reset_interrupt(sPAPRMachineState *spapr, Error **errp)
 {
+    MachineState *machine = MACHINE(spapr);
     Error *local_err = NULL;
     const char *intc_type;
 
@@ -1551,13 +1551,38 @@ static void spapr_reset_interrupt(sPAPRMachineState *spapr, Error **errp)
         return;
     }
 
+    /* Destroy KVM device */
+    if (kvm_enabled()) {
+        xics_kvm_fini(spapr, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+        xive_kvm_fini(spapr->xive, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+
     if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+            xive_kvm_init(spapr->xive, &local_err);
+        }
         spapr_xive_mmio_map(spapr->xive);
         intc_type = spapr->nvt_type;
     } else {
+        if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+            xics_kvm_init(spapr, &local_err);
+        }
         intc_type = spapr->icp_type;
     }
 
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
     spapr_cpu_core_set_icp(intc_type, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
@@ -2583,15 +2608,6 @@ static void spapr_machine_init(MachineState *machine)
         /* XIVE uses the full range of IRQ numbers. */
         xive_system_init(machine, XICS_IRQ_BASE + XICS_IRQS_SPAPR,
                          &error_fatal);
-
-        /* TODO: initialize KVM for XIVE or for XICS but not for both */
-        if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
-            xive_kvm_init(spapr->xive, &error_fatal);
-        }
-    } else {
-        if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
-            xics_kvm_init(spapr, &error_fatal);
-        }
     }
 
     /* Set up containers for ibm,client-architecture-support negotiated options
diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
index 0ec5fa4cfe43..9a3d42486e50 100644
--- a/hw/ppc/spapr_rtas.c
+++ b/hw/ppc/spapr_rtas.c
@@ -404,8 +404,6 @@ void spapr_rtas_register(int token, const char *name, spapr_rtas_fn fn)
 
     token -= RTAS_TOKEN_BASE;
 
-    assert(!rtas_table[token].name);
-
     rtas_table[token].name = name;
     rtas_table[token].fn = fn;
 }
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 75b790cc9730..1a28ab0de46d 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -88,5 +88,6 @@ void spapr_dt_xive(sPAPRMachineState *spapr, int nr_servers, void *fdt,
     OBJECT_CHECK(XiveNVT, (obj), TYPE_XIVE_NVT_KVM)
 
 void xive_kvm_init(sPAPRXive *xive, Error **errp);
+int xive_kvm_fini(sPAPRXive *xive, Error **errp);
 
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xics.h b/include/hw/ppc/xics.h
index 6cebff47a7d4..dc2b1bf7ac44 100644
--- a/include/hw/ppc/xics.h
+++ b/include/hw/ppc/xics.h
@@ -204,6 +204,7 @@ void icp_resend(ICPState *ss);
 
 typedef struct sPAPRMachineState sPAPRMachineState;
 
+int xics_kvm_fini(sPAPRMachineState *spapr, Error **errp);
 int xics_kvm_init(sPAPRMachineState *spapr, Error **errp);
 void xics_spapr_init(sPAPRMachineState *spapr);
 
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 2c20d34f194b..8864b855c08b 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -1279,6 +1279,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
 
+#define KVM_DESTROY_DEVICE	  _IOWR(KVMIO,  0xf0, struct kvm_create_device)
+
 /*
  * ioctls for vcpu fds
  */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 31/35] spapr/xive: raise migration priority of the machine
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (29 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 30/35] spapr/xive, xics: reset KVM at machine reset Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 32/35] ppc/pnv: introduce a pnv_icp_create() helper Cédric Le Goater
                   ` (4 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The XIVE MMIO regions should be set on the destination before the XIVE
sources are restored. This is currently handled at the machine level
because it depends on the KVM initialization to be done before
anything else.

But it has ugly consequences on MMU, which seems broken after migration :

  Oops: Exception in kernel mode, sig: 4 [#1]
  LE SMP NR_CPUS=2048 NUMA pSeries
  Modules linked in: ipmi_devintf ipmi_msghandler vmx_crypto crct10dif_vpmsum ...
  CPU: 3 PID: 1 Comm: systemd Not tainted 4.16.0+ #2
  NIP:  c000000000079810 LR: c00000000033f720 CTR: 0000000000000000
  REGS: c00000007a803880 TRAP: 0700   Not tainted  (4.16.0+)
  MSR:  8000000002049033 <SF,VEC,EE,ME,IR,DR,RI,LE>  CR: 24048884  XER: 20040000
  CFAR: c000000000079ae4 SOFTE: 0
  GPR00: c00000000033f720 c00000007a803b00 c0000000015f8a00 c00000007bb1d800
  GPR04: 00000000000000a0 c0000000017a2598 c00000007a803ba0 0000000000000002
  GPR08: 8403bb74000000c0 0000000000000004 00000000000000c0 0000000000000060
  GPR12: 0000000044048888 c000000007d80f00 00000594336eeaa0 0000000000000003
  GPR16: 00007ffff732c410 00007ffff732c420 00000594336ec090 fffffffffffffffd
  GPR20: 0000000000000000 c00000007bb1d800 0000059444690000 0000059444680000
  GPR24: 0000059444680000 8603146e00000080 c00000007bb1d800 0000000000000001
  GPR28: c0000000017a24e8 0000059444680000 0000000200000000 00000594446800a0
  NIP [c000000000079810] radix__flush_tlb_page_psize+0x60/0x300
  LR [c00000000033f720] ptep_clear_flush+0xe0/0x1e0
  Call Trace:
  [c00000007a803b00] [c00000007a803b80] 0xc00000007a803b80 (unreliable)
  [c00000007a803b40] [c00000007a803b80] 0xc00000007a803b80
  [c00000007a803b80] [c000000000325cc4] wp_page_copy+0x314/0x9a0
  [c00000007a803c10] [c0000000003298b4] do_wp_page+0x1e4/0x860
  [c00000007a803c60] [c00000000032f58c] __handle_mm_fault+0x10fc/0x1b10
  [c00000007a803d40] [c0000000003300d8] handle_mm_fault+0x138/0x250
  [c00000007a803d80] [c000000000069a24] __do_page_fault+0x224/0xa50
  [c00000007a803e30] [c00000000000a534] handle_page_fault+0x18/0x38

Work in progress.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/spapr.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index dea636f9befe..24b3ee2fe13d 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1916,6 +1916,7 @@ static const VMStateDescription vmstate_spapr = {
     .pre_load = spapr_pre_load,
     .post_load = spapr_post_load,
     .pre_save = spapr_pre_save,
+    .priority = MIG_PRI_MAX,
     .fields = (VMStateField[]) {
         /* used to be @next_irq */
         VMSTATE_UNUSED_BUFFER(version_before_3, 0, 4),
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 32/35] ppc/pnv: introduce a pnv_icp_create() helper
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (30 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 31/35] spapr/xive: raise migration priority of the machine Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 33/35] ppc: externalize ppc_get_vcpu_by_pir() Cédric Le Goater
                   ` (3 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The type of the interrupt presenter depends on the processor family,
POWER8 uses XICS and POWER9 uses XIVE. Provide a machine-level helper
to isolate the process and hide the details to the pnv core realize
function.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/pnv.c         | 14 ++++++++++++++
 hw/ppc/pnv_core.c    |  2 +-
 include/hw/ppc/pnv.h |  2 ++
 3 files changed, 17 insertions(+), 1 deletion(-)

diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 98ee3c607ae7..91452b7eeb01 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -1015,6 +1015,20 @@ static ICPState *pnv_icp_get(XICSFabric *xi, int pir)
     return cpu ? ICP(cpu->intc) : NULL;
 }
 
+Object *pnv_icp_create(PnvMachineState *pnv, Object *cpu, Error **errp)
+{
+    Error *local_err = NULL;
+    Object *obj;
+
+    obj = icp_create(cpu, TYPE_PNV_ICP, XICS_FABRIC(pnv), &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return NULL;
+    }
+
+    return obj;
+}
+
 static void pnv_pic_print_info(InterruptStatsProvider *obj,
                                Monitor *mon)
 {
diff --git a/hw/ppc/pnv_core.c b/hw/ppc/pnv_core.c
index cbb64ad9e7e0..1961dd2a2641 100644
--- a/hw/ppc/pnv_core.c
+++ b/hw/ppc/pnv_core.c
@@ -133,7 +133,7 @@ static void pnv_core_realize_child(Object *child, XICSFabric *xi, Error **errp)
         return;
     }
 
-    cpu->intc = icp_create(child, TYPE_PNV_ICP, xi, &local_err);
+    cpu->intc = pnv_icp_create(PNV_MACHINE(xi), child, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         return;
diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
index 90759240a7b1..877c3b79b239 100644
--- a/include/hw/ppc/pnv.h
+++ b/include/hw/ppc/pnv.h
@@ -191,4 +191,6 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
     (0x0003ffe000000000ull + (uint64_t)PNV_CHIP_INDEX(chip) * \
      PNV_PSIHB_FSP_SIZE)
 
+Object *pnv_icp_create(PnvMachineState *spapr, Object *cpu, Error **errp);
+
 #endif /* _PPC_PNV_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 33/35] ppc: externalize ppc_get_vcpu_by_pir()
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (31 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 32/35] ppc/pnv: introduce a pnv_icp_create() helper Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 34/35] ppc/pnv: add XIVE support Cédric Le Goater
                   ` (2 subsequent siblings)
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

We will use it to get the CPU interrupt presenter in XIVE.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/pnv.c         | 16 ----------------
 hw/ppc/ppc.c         | 16 ++++++++++++++++
 include/hw/ppc/ppc.h |  1 +
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 91452b7eeb01..d07a8ce38e99 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -992,22 +992,6 @@ static void pnv_ics_resend(XICSFabric *xi)
     }
 }
 
-static PowerPCCPU *ppc_get_vcpu_by_pir(int pir)
-{
-    CPUState *cs;
-
-    CPU_FOREACH(cs) {
-        PowerPCCPU *cpu = POWERPC_CPU(cs);
-        CPUPPCState *env = &cpu->env;
-
-        if (env->spr_cb[SPR_PIR].default_value == pir) {
-            return cpu;
-        }
-    }
-
-    return NULL;
-}
-
 static ICPState *pnv_icp_get(XICSFabric *xi, int pir)
 {
     PowerPCCPU *cpu = ppc_get_vcpu_by_pir(pir);
diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
index ec4be25f4994..9292f986eba7 100644
--- a/hw/ppc/ppc.c
+++ b/hw/ppc/ppc.c
@@ -1358,3 +1358,19 @@ void PPC_debug_write (void *opaque, uint32_t addr, uint32_t val)
         break;
     }
 }
+
+PowerPCCPU *ppc_get_vcpu_by_pir(int pir)
+{
+    CPUState *cs;
+
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+        CPUPPCState *env = &cpu->env;
+
+        if (env->spr_cb[SPR_PIR].default_value == pir) {
+            return cpu;
+        }
+    }
+
+    return NULL;
+}
diff --git a/include/hw/ppc/ppc.h b/include/hw/ppc/ppc.h
index ff0ac306be72..c8e54544e563 100644
--- a/include/hw/ppc/ppc.h
+++ b/include/hw/ppc/ppc.h
@@ -4,6 +4,7 @@
 #include "target/ppc/cpu-qom.h"
 
 void ppc_set_irq(PowerPCCPU *cpu, int n_IRQ, int level);
+PowerPCCPU *ppc_get_vcpu_by_pir(int pir);
 
 /* PowerPC hardware exceptions management helpers */
 typedef void (*clk_setup_cb)(void *opaque, uint32_t freq);
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 34/35] ppc/pnv: add XIVE support
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (32 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 33/35] ppc: externalize ppc_get_vcpu_by_pir() Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 35/35] ppc/pnv: add a PSI bridge model for POWER9 processor Cédric Le Goater
  2018-04-19 13:28 ` [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) no-reply
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

This is simple model of the POWER9 XIVE interrupt controller for the
PowerNV machine. XIVE for baremetal is a complex controller and the
model only addresses the needs of the skiboot firmware. Support is
provided for :

* virtual structure descriptor tables describing the XIVE
  internal tables stored in the machine RAM :

  - IVT
    associate an interrupt source number with an event queue. the
    data to be pushed in the queue is stored there also.

  - EQDT
    describes the queues in the OS RAM, also contains a set of flags,
    a virtual target, etc.

  - VPDT
    describe the virtual targets, which can have different
    natures, a lpar, a cpu.

* translation sets, splitting the overall ESB MMIO in two:
  IPIs and EQs.

* MMIO regions :

  - Interrupt controller registers
  - ESB MMIO for IPIs and EQs
  - Presenter MMIO (Not used)
  - Thread Interrupt Management Area MMIO, direct and indirect

* internal sources for IPIs and CAPI like interrupts.

The integration with the generic XiveFabric routing engine is not
complete yet and the TIMA handlers for the HV privilege level are
duplicating a lot of code. work in progress.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/Makefile.objs      |    2 +-
 hw/intc/pnv_xive.c         | 1234 ++++++++++++++++++++++++++++++++++++++++++++
 hw/intc/pnv_xive_regs.h    |  314 +++++++++++
 hw/intc/xive.c             |  160 +++++-
 hw/ppc/pnv.c               |   36 +-
 include/hw/ppc/pnv.h       |   21 +
 include/hw/ppc/pnv_xive.h  |   89 ++++
 include/hw/ppc/pnv_xscom.h |    3 +
 include/hw/ppc/xive.h      |    5 +
 include/hw/ppc/xive_regs.h |   22 +
 10 files changed, 1874 insertions(+), 12 deletions(-)
 create mode 100644 hw/intc/pnv_xive.c
 create mode 100644 hw/intc/pnv_xive_regs.h
 create mode 100644 include/hw/ppc/pnv_xive.h

diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index dd4d69db2bdd..145bfaf44014 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -40,7 +40,7 @@ obj-$(CONFIG_XICS_KVM) += xics_kvm.o
 obj-$(CONFIG_XIVE) += xive.o
 obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
 obj-$(CONFIG_XIVE_KVM) += spapr_xive_kvm.o
-obj-$(CONFIG_POWERNV) += xics_pnv.o
+obj-$(CONFIG_POWERNV) += xics_pnv.o pnv_xive.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
 obj-$(CONFIG_S390_FLIC_KVM) += s390_flic_kvm.o
diff --git a/hw/intc/pnv_xive.c b/hw/intc/pnv_xive.c
new file mode 100644
index 000000000000..ef521b402567
--- /dev/null
+++ b/hw/intc/pnv_xive.c
@@ -0,0 +1,1234 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "sysemu/dma.h"
+#include "monitor/monitor.h"
+#include "hw/ppc/fdt.h"
+#include "hw/ppc/pnv.h"
+#include "hw/ppc/pnv_xscom.h"
+#include "hw/ppc/pnv_xive.h"
+#include "hw/ppc/xive_regs.h"
+#include "hw/ppc/ppc.h"
+
+#include <libfdt.h>
+
+#include "pnv_xive_regs.h"
+
+#define EQ_PER_PAGE           (0x10000 / sizeof(XiveEQ))
+#define VP_PER_PAGE           (0x10000 / sizeof(XiveVP))
+
+static uint64_t pnv_xive_eq_addr(PnvXive *xive, uint32_t idx)
+{
+    uint64_t vsd;
+    uint64_t page_addr;
+
+    if (idx >= xive->eqdt_count) {
+        return 0;
+    }
+
+    vsd = be64_to_cpu(xive->eqdt[idx / EQ_PER_PAGE]);
+    page_addr = vsd & VSD_ADDRESS_MASK;
+    if (!page_addr) {
+        return 0;
+    }
+
+    /* We don't support nested indirect tables */
+    if (VSD_INDIRECT & vsd) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: found a nested indirect EQ"
+                      " table at index %d\n", idx);
+        return 0;
+    }
+
+    return page_addr + (idx % EQ_PER_PAGE) * sizeof(XiveEQ);
+}
+
+static int pnv_xive_eq_get(PnvXive *xive, uint32_t idx, XiveEQ *eq)
+{
+    uint64_t eq_addr = pnv_xive_eq_addr(xive, idx);
+
+    if (!eq_addr) {
+        return -1;
+    }
+
+    cpu_physical_memory_read(eq_addr, eq, sizeof(XiveEQ));
+    eq->w0 = be32_to_cpu(eq->w0);
+    eq->w1 = be32_to_cpu(eq->w1);
+    eq->w2 = be32_to_cpu(eq->w2);
+    eq->w3 = be32_to_cpu(eq->w3);
+    eq->w4 = be32_to_cpu(eq->w4);
+    eq->w5 = be32_to_cpu(eq->w5);
+    eq->w6 = be32_to_cpu(eq->w6);
+    eq->w7 = be32_to_cpu(eq->w7);
+
+    return 0;
+}
+
+static int pnv_xive_eq_set(PnvXive *xive, uint32_t idx, XiveEQ *in_eq)
+{
+    XiveEQ eq;
+    uint64_t eq_addr = pnv_xive_eq_addr(xive, idx);
+
+    if (!eq_addr) {
+        return -1;
+    }
+
+    eq.w0 = cpu_to_be32(in_eq->w0);
+    eq.w1 = cpu_to_be32(in_eq->w1);
+    eq.w2 = cpu_to_be32(in_eq->w2);
+    eq.w3 = cpu_to_be32(in_eq->w3);
+    eq.w4 = cpu_to_be32(in_eq->w4);
+    eq.w5 = cpu_to_be32(in_eq->w5);
+    eq.w6 = cpu_to_be32(in_eq->w6);
+    eq.w7 = cpu_to_be32(in_eq->w7);
+    cpu_physical_memory_write(eq_addr, &eq, sizeof(XiveEQ));
+    return 0;
+}
+
+static void pnv_xive_eq_update(PnvXive *xive, uint32_t idx)
+{
+    uint32_t size = 1 << (GETFIELD(VSD_TSIZE, xive->vsds[VST_TSEL_EQDT]) + 12);
+    uint64_t eqdt_addr = xive->vsds[VST_TSEL_EQDT] & VSD_ADDRESS_MASK;
+    uint64_t eq_addr;
+
+    /* Update the EQ indirect table which might have newly allocated
+     * pages. We could use the idx to limit the transfer */
+    cpu_physical_memory_read(eqdt_addr, xive->eqdt, size);
+
+    eq_addr = pnv_xive_eq_addr(xive, idx);
+    if (!eq_addr) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: Update failed for EQ %d\n", idx);
+        return;
+    }
+
+    cpu_physical_memory_write(eq_addr, xive->eqc_watch, sizeof(XiveEQ));
+}
+
+static uint64_t pnv_xive_vp_addr(PnvXive *xive, uint32_t idx)
+{
+    uint64_t vsd;
+    uint64_t page_addr;
+
+    if (idx >= xive->vpdt_count) {
+        return 0;
+    }
+
+    vsd = be64_to_cpu(xive->vpdt[idx / VP_PER_PAGE]);
+    page_addr = vsd & VSD_ADDRESS_MASK;
+    if (!page_addr) {
+        return 0;
+    }
+
+    /* We don't support nested indirect tables */
+    if (VSD_INDIRECT & vsd) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: found a nested indirect VP"
+                      " table at index %x\n", idx);
+        return 0;
+    }
+
+    return page_addr + (idx % VP_PER_PAGE) * sizeof(XiveVP);
+}
+
+static int pnv_xive_vp_get(PnvXive *xive, uint32_t idx, XiveVP *vp)
+{
+    uint64_t vp_addr = pnv_xive_vp_addr(xive, idx);
+
+    if (!vp_addr) {
+        return -1;
+    }
+
+    cpu_physical_memory_read(vp_addr, vp, sizeof(XiveVP));
+    vp->w0 = cpu_to_be32(vp->w0);
+    vp->w1 = cpu_to_be32(vp->w1);
+    vp->w2 = cpu_to_be32(vp->w2);
+    vp->w3 = cpu_to_be32(vp->w3);
+    vp->w4 = cpu_to_be32(vp->w4);
+    vp->w5 = cpu_to_be32(vp->w5);
+    vp->w6 = cpu_to_be32(vp->w6);
+    vp->w7 = cpu_to_be32(vp->w7);
+
+    return 0;
+}
+
+static void pnv_xive_vp_update(PnvXive *xive, uint32_t idx)
+{
+    uint32_t size = 1 << (GETFIELD(VSD_TSIZE, xive->vsds[VST_TSEL_VPDT]) + 12);
+    uint64_t vpdt_addr = xive->vsds[VST_TSEL_VPDT] & VSD_ADDRESS_MASK;
+    uint64_t vp_addr;
+
+    /* Update the VP indirect table which might have newly allocated
+     * pages. We could use the idx to limit the transfer */
+    cpu_physical_memory_read(vpdt_addr, xive->vpdt, size);
+
+    vp_addr = pnv_xive_vp_addr(xive, idx);
+    if (!vp_addr) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: Update failed for VP %x\n", idx);
+        return;
+    }
+
+    cpu_physical_memory_write(vp_addr, xive->vpc_watch, sizeof(XiveVP));
+}
+
+static void pnv_xive_ive_update(PnvXive *xive, uint32_t idx)
+{
+    uint64_t ivt_addr = xive->vsds[VST_TSEL_IVT] & VSD_ADDRESS_MASK;
+    uint64_t ive_addr = ivt_addr + idx * sizeof(XiveIVE);
+    XiveIVE *ive = &xive->ivt[idx];
+
+    *((uint64_t *) ive) = ldq_be_dma(&address_space_memory, ive_addr);
+}
+
+#define PNV_XIVE_SET_XLATE_SIZE  (8ull << 30)
+
+static uint64_t pnv_xive_set_xlate_edt_size(PnvXive *xive, uint64_t type)
+{
+    uint64_t size = 0;
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(xive->set_xlate_edt); i++) {
+        /* This is supposing that the IPIs and EQs set translations
+         * are contiguous */
+        uint64_t edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[i]);
+
+        if (edt_type == type) {
+            size += PNV_XIVE_SET_XLATE_SIZE;
+        }
+    }
+
+    return size;
+}
+
+static int pnv_xive_set_xlate_update(PnvXive *xive, uint64_t val)
+{
+    uint8_t index = xive->set_xlate_autoinc ?
+        xive->set_xlate_index++ : xive->set_xlate_index;
+
+    switch (xive->set_xlate) {
+    case CQ_TAR_TSEL_EDT:
+        index %= sizeof(xive->set_xlate_edt);
+        xive->set_xlate_edt[index] = val;
+        break;
+    case CQ_TAR_TSEL_VDT:
+        index %= sizeof(xive->set_xlate_vdt);
+        xive->set_xlate_vdt[index] = val;
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid table set %d\n",
+                      (int) xive->set_xlate);
+        return -1;
+    }
+
+    return 0;
+}
+
+static int pnv_xive_set_xlate_select(PnvXive *xive, uint64_t val)
+{
+    xive->set_xlate_autoinc = val & CQ_TAR_TBL_AUTOINC;
+    xive->set_xlate = val & CQ_TAR_TSEL;
+    xive->set_xlate_index = GETFIELD(CQ_TAR_TSEL_INDEX, val);
+
+    return 0;
+}
+
+static void pnv_xive_source_realize(PnvXive *xive, uint32_t count,
+                                    Error **errp)
+{
+    XiveSource *xsrc = &xive->source;
+    Error *local_err = NULL;
+    uint64_t esb_mmio_size = pnv_xive_set_xlate_edt_size(xive, CQ_TDR_EDT_IPI);
+
+    /* Remap the ESB region for IPIs now that the set translation have
+     * been configured.
+     */
+    memory_region_transaction_begin();
+    memory_region_set_size(&xive->esb_mmio, esb_mmio_size);
+    memory_region_set_enabled(&xive->esb_mmio, true);
+    memory_region_transaction_commit();
+
+    object_property_set_int(OBJECT(xsrc), xive->esb_base, "bar", &error_fatal);
+    object_property_set_int(OBJECT(xsrc), XIVE_ESB_64K_2PAGE, "shift",
+                            &error_fatal);
+    object_property_set_int(OBJECT(xsrc), count, "nr-irqs", &error_fatal);
+    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
+                                   &error_fatal);
+    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
+
+    /* Install the ESB MMIO region in the overall region configured
+     * for the purpose in the interrupt controller . */
+    memory_region_add_subregion(&xive->esb_mmio, 0, &xsrc->esb_mmio);
+}
+
+static void pnv_xive_eq_source_realize(PnvXive *xive, uint32_t count,
+                                       Error **errp)
+{
+    XiveSource *eq_xsrc = &xive->eq_source;
+    Error *local_err = NULL;
+    uint64_t esb_mmio_size = pnv_xive_set_xlate_edt_size(xive, CQ_TDR_EDT_IPI);
+    uint64_t eq_mmio_size = pnv_xive_set_xlate_edt_size(xive, CQ_TDR_EDT_EQ);
+
+    xive->eq_base = xive->vc_base + esb_mmio_size;
+
+    /* Remap the ESB region for EQs now that the set translation have
+     * been configured.
+     */
+    memory_region_transaction_begin();
+    memory_region_set_size(&xive->eq_mmio, eq_mmio_size);
+    memory_region_set_address(&xive->eq_mmio, esb_mmio_size);
+    memory_region_set_enabled(&xive->eq_mmio, true);
+    memory_region_transaction_commit();
+
+    /* check for some skiboot oddity on the table size */
+    if (xive->eq_base + count * (1ull << XIVE_ESB_64K_2PAGE) >
+        xive->vc_base + PNV_XIVE_VC_SIZE) {
+        uint32_t old = count;
+        count = (xive->vc_base + PNV_XIVE_VC_SIZE -
+                 xive->eq_base) >> XIVE_ESB_64K_2PAGE;
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: EQ count %d too large for VC "
+                      "MMIO region. shrinking to %d\n", old, count);
+    }
+
+    object_property_set_int(OBJECT(eq_xsrc), xive->eq_base, "bar",
+                            &error_fatal);
+    object_property_set_int(OBJECT(eq_xsrc), XIVE_ESB_64K_2PAGE, "shift",
+                            &error_fatal);
+    object_property_set_int(OBJECT(eq_xsrc), count, "nr-irqs", &error_fatal);
+    object_property_add_const_link(OBJECT(eq_xsrc), "xive", OBJECT(xive),
+                                   &error_fatal);
+    object_property_set_bool(OBJECT(eq_xsrc), true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(eq_xsrc), sysbus_get_default());
+
+    /* Install the EQ ESB MMIO region in the overall region configured
+     * for the purpose in the interrupt controller . */
+    memory_region_add_subregion(&xive->eq_mmio, 0, &eq_xsrc->esb_mmio);
+}
+
+static void pnv_xive_table_set_data(PnvXive *xive, uint64_t val, bool pc_engine)
+{
+    uint64_t addr = val & VSD_ADDRESS_MASK;
+    uint32_t size = 1 << (GETFIELD(VSD_TSIZE, val) + 12);
+    bool indirect = VSD_INDIRECT & val;
+    uint8_t mode = GETFIELD(VSD_MODE, val);
+
+    if (mode != VSD_MODE_EXCLUSIVE) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no support for non-exclusive"
+                      " tables");
+        return;
+    }
+
+    switch (xive->vst_tsel) {
+    case VST_TSEL_IVT:
+        if (!xive->ivt) {
+            xive->nr_irqs = size / sizeof(XiveIVE);
+
+            xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
+
+            /* Read initial state from the guest RAM */
+            cpu_physical_memory_read(addr, xive->ivt, size);
+            xive->vsds[xive->vst_tsel] = val;
+        }
+        break;
+
+    case VST_TSEL_SBE:
+        /* We do not use the SBE bits backed in the guest RAM but
+         * instead, we create our own source. The IVT table should
+         * have been created before.
+         */
+        if (!DEVICE(&xive->source)->realized) {
+
+            pnv_xive_source_realize(xive, xive->nr_irqs, &error_fatal);
+            device_reset(DEVICE(&xive->source));
+            xive->vsds[xive->vst_tsel] = val;
+        }
+        break;
+
+    case VST_TSEL_EQDT:
+        if (!xive->eqdt) {
+
+            /* EQDT is expected to be indirect even though skiboot can
+             * be compiled in direct mode */
+            assert(indirect);
+
+            /* FIXME: skiboot set the EQDT as indirect with 64K
+             * subpages, which is too big for the VC MMIO region.
+             */
+            val &= ~VSD_TSIZE;
+            val |= SETFIELD(VSD_TSIZE, 0ull, 0);
+            size = 0x1000;
+
+            xive->eqdt_count = size * EQ_PER_PAGE / 8;
+
+            xive->eqdt = g_malloc0(size);
+
+            /* Should be all NULL pointers */
+            cpu_physical_memory_read(addr, xive->eqdt, size);
+
+            xive->vsds[xive->vst_tsel] = val;
+
+            /* We do not use the ESn bits of the XiveEQ structure
+             * backed in the guest RAM but instead, we create our own
+             * source.
+             */
+            pnv_xive_eq_source_realize(xive, xive->eqdt_count, &error_fatal);
+        }
+        break;
+
+    case VST_TSEL_VPDT:
+
+        /* There is a hack in skiboot to workaround DD1 issue with the
+         * VPT setting in the VC engine in DD1. Skip it, we will get
+         * it from the PC engine anyhow */
+        if (!xive->vpdt && pc_engine) {
+
+            /* VPDT is indirect */
+            assert(indirect);
+
+            /* FIXME: skiboot set the VPDT as indirect with 64K
+             * subpages.
+             */
+            val &= ~VSD_TSIZE;
+            val |= SETFIELD(VSD_TSIZE, 0ull, 0);
+            size = 0x1000;
+
+            xive->vpdt_count = size * VP_PER_PAGE / 8;
+
+            xive->vpdt = g_malloc0(size);
+
+            /* should be all NULL pointers */
+            cpu_physical_memory_read(addr, xive->vpdt, size);
+
+            xive->vsds[xive->vst_tsel] = val;
+        }
+        break;
+    case VST_TSEL_IRQ:
+        /* TODO */
+        xive->vsds[xive->vst_tsel] = val;
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid table type %d\n",
+                      xive->vst_tsel);
+        return;
+    }
+}
+
+/*
+ * Some accesses to the TIMA are sometime done from some other thread
+ * context. For resets.
+ */
+static void pnv_xive_thread_indirect_set(PnvXive *xive, uint64_t val)
+{
+    int pir = GETFIELD(PC_TCTXT_INDIR_THRDID, xive->regs[PC_TCTXT_INDIR0 >> 3]);
+
+    if (val & PC_TCTXT_INDIR_VALID) {
+        if (xive->cpu_ind) {
+            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: indirect access already set "
+                          " for invalid PIR %d", pir);
+        }
+
+        pir = GETFIELD(PC_TCTXT_INDIR_THRDID, val) & 0xff;
+        xive->cpu_ind = ppc_get_vcpu_by_pir(pir);
+        if (!xive->cpu_ind) {
+            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid PIR %d for"
+                          " indirect access\n", pir);
+        }
+    } else {
+        xive->cpu_ind = NULL;
+    }
+}
+
+/*
+ * Interrupt Controller MMIO
+ */
+static void pnv_xive_ic_reg_write(PnvXive *xive, uint32_t offset, uint64_t val,
+                               bool mmio)
+{
+    uint32_t reg = offset >> 3;
+
+    switch (offset) {
+    case CQ_CFG_PB_GEN:
+    case CQ_MSGSND:
+    case CQ_PBI_CTL:
+    case CQ_FIRMASK_OR:
+
+    case PC_TCTXT_CFG:
+    case PC_TCTXT_TRACK:
+    case PC_TCTXT_INDIR1:
+    case PC_TCTXT_INDIR2:
+    case PC_TCTXT_INDIR3:
+    case PC_GLOBAL_CONFIG:
+        /* set indirect mode for VSDs */
+
+    case PC_VPC_SCRUB_MASK:
+    case PC_VPC_CWATCH_SPEC:
+    case VC_GLOBAL_CONFIG:
+        /* set indirect mode for VSDs */
+
+    case VC_AIB_TX_ORDER_TAG2:
+
+    case VC_IRQ_CONFIG_IPI:
+    case VC_IRQ_CONFIG_HW:
+    case VC_IRQ_CONFIG_CASCADE1:
+    case VC_IRQ_CONFIG_CASCADE2:
+    case VC_IRQ_CONFIG_REDIST:
+    case VC_IRQ_CONFIG_IPI_CASC:
+
+    case VC_EQC_SCRUB_MASK:
+    case VC_EQC_CWATCH_SPEC:
+    case VC_EQC_CONFIG:
+    case VC_IVC_SCRUB_MASK:
+    case PC_AT_KILL_MASK:
+    case VC_AT_MACRO_KILL_MASK:
+        xive->regs[reg] = val;
+        break;
+
+    /* TODO: we could set the memory region when the BAR are
+     * configured by firmware instead of hardcoding the adddr/size
+     * values when the object is realized.
+     */
+    case CQ_IC_BAR: /* IC BAR and page size. 8 * 64k */
+        xive->regs[reg] = val;
+        break;
+
+    case CQ_TM1_BAR: /* TM BAR and page size. 4 * 64k */
+    case CQ_TM2_BAR: /* second TM BAR and page size. For hotplug use */
+        xive->regs[reg] = val;
+        break;
+
+    case CQ_PC_BAR:
+        xive->regs[reg] = val;
+        break;
+
+    case CQ_PC_BARM: /* PC BAR size */
+        xive->regs[reg] = val;
+        break;
+
+    case CQ_VC_BAR:
+        xive->regs[reg] = val;
+        break;
+
+    case CQ_VC_BARM: /* VC BAR size */
+        xive->regs[reg] = val;
+        break;
+
+    case PC_AT_KILL:
+        /* TODO: reload vpdt because pages were cleared */
+        xive->regs[reg] |= val;
+        break;
+
+    case VC_AT_MACRO_KILL:
+        /* TODO: reload eddt because pages were cleared */
+        xive->regs[reg] |= val;
+        break;
+
+    case PC_THREAD_EN_REG0_SET: /* Physical Thread Enable */
+    case PC_THREAD_EN_REG1_SET: /* Physical Thread Enable (fused core) */
+        xive->regs[reg] |= val;
+        break;
+
+    case PC_THREAD_EN_REG0_CLR:
+        xive->regs[PC_THREAD_EN_REG0_SET >> 3] &= ~val;
+        break;
+    case PC_THREAD_EN_REG1_CLR:
+        xive->regs[PC_THREAD_EN_REG1_SET >> 3] &= ~val;
+        break;
+
+    case PC_TCTXT_INDIR0: /* set up CPU for indirect TIMA access */
+        pnv_xive_thread_indirect_set(xive, val);
+        xive->regs[reg] = val;
+        break;
+
+    case CQ_TAR: /* Set Translation Table Address */
+        pnv_xive_set_xlate_select(xive, val);
+        break;
+
+    case CQ_TDR: /* Set Translation Table Data */
+        pnv_xive_set_xlate_update(xive, val);
+        break;
+
+    case VC_IVC_SCRUB_TRIG:
+        pnv_xive_ive_update(xive, GETFIELD(VC_SCRUB_OFFSET, val));
+        break;
+
+    case PC_VPC_CWATCH_DAT0:
+    case PC_VPC_CWATCH_DAT1:
+    case PC_VPC_CWATCH_DAT2:
+    case PC_VPC_CWATCH_DAT3:
+    case PC_VPC_CWATCH_DAT4:
+    case PC_VPC_CWATCH_DAT5:
+    case PC_VPC_CWATCH_DAT6:
+    case PC_VPC_CWATCH_DAT7: /* XiveVP data for update */
+        xive->vpc_watch[(offset - PC_VPC_CWATCH_DAT0) / 8] = cpu_to_be64(val);
+        break;
+
+    case PC_VPC_SCRUB_TRIG:
+        pnv_xive_vp_update(xive, GETFIELD(PC_SCRUB_OFFSET, val));
+        break;
+
+    case VC_EQC_CWATCH_DAT0:
+    case VC_EQC_CWATCH_DAT1:
+    case VC_EQC_CWATCH_DAT2:
+    case VC_EQC_CWATCH_DAT3: /* XiveEQ data for update */
+        xive->eqc_watch[(offset - VC_EQC_CWATCH_DAT0) / 8] = cpu_to_be64(val);
+        break;
+
+    case VC_EQC_SCRUB_TRIG:
+        pnv_xive_eq_update(xive, GETFIELD(VC_SCRUB_OFFSET, val));
+        break;
+
+    case VC_VSD_TABLE_ADDR:
+    case PC_VSD_TABLE_ADDR:
+        xive->vst_tsel = GETFIELD(VST_TABLE_SELECT, val);
+        xive->vst_tidx = GETFIELD(VST_TABLE_OFFSET, val);
+        break;
+
+    case VC_VSD_TABLE_DATA:
+        pnv_xive_table_set_data(xive, val, false);
+        break;
+
+    case PC_VSD_TABLE_DATA:
+        pnv_xive_table_set_data(xive, val, true);
+        break;
+
+    case VC_SBC_CONFIG:
+        xive->regs[reg] = val;
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE/IC: invalid writing to reg=0x%08x"
+                      " mmio=%d\n", offset, mmio);
+    }
+}
+
+static uint64_t pnv_xive_ic_reg_read(PnvXive *xive, uint32_t offset, bool mmio)
+{
+    uint64_t val = 0;
+    uint32_t reg = offset >> 3;
+
+    switch (offset) {
+    case CQ_CFG_PB_GEN:
+    case CQ_MSGSND: /* activated cores */
+    case CQ_IC_BAR:
+    case CQ_TM1_BAR:
+    case CQ_TM2_BAR:
+    case CQ_PC_BAR:
+    case CQ_PC_BARM:
+    case CQ_VC_BAR:
+    case CQ_VC_BARM:
+    case CQ_TAR:
+    case CQ_TDR:
+    case CQ_PBI_CTL:
+
+    case PC_TCTXT_CFG:
+    case PC_TCTXT_TRACK:
+    case PC_TCTXT_INDIR0:
+    case PC_TCTXT_INDIR1:
+    case PC_TCTXT_INDIR2:
+    case PC_TCTXT_INDIR3:
+    case PC_GLOBAL_CONFIG:
+
+    case PC_VPC_SCRUB_MASK:
+    case PC_VPC_CWATCH_SPEC:
+    case PC_VPC_CWATCH_DAT0:
+    case PC_VPC_CWATCH_DAT1:
+    case PC_VPC_CWATCH_DAT2:
+    case PC_VPC_CWATCH_DAT3:
+    case PC_VPC_CWATCH_DAT4:
+    case PC_VPC_CWATCH_DAT5:
+    case PC_VPC_CWATCH_DAT6:
+    case PC_VPC_CWATCH_DAT7:
+
+    case VC_GLOBAL_CONFIG:
+    case VC_AIB_TX_ORDER_TAG2:
+
+    case VC_IRQ_CONFIG_IPI:
+    case VC_IRQ_CONFIG_HW:
+    case VC_IRQ_CONFIG_CASCADE1:
+    case VC_IRQ_CONFIG_CASCADE2:
+    case VC_IRQ_CONFIG_REDIST:
+    case VC_IRQ_CONFIG_IPI_CASC:
+
+    case VC_EQC_SCRUB_MASK:
+    case VC_EQC_CWATCH_DAT0:
+    case VC_EQC_CWATCH_DAT1:
+    case VC_EQC_CWATCH_DAT2:
+    case VC_EQC_CWATCH_DAT3:
+
+    case VC_EQC_CWATCH_SPEC:
+    case VC_IVC_SCRUB_MASK:
+    case VC_SBC_CONFIG:
+    case VC_AT_MACRO_KILL_MASK:
+    case VC_VSD_TABLE_ADDR:
+    case PC_VSD_TABLE_ADDR:
+    case VC_VSD_TABLE_DATA:
+    case PC_VSD_TABLE_DATA:
+        val = xive->regs[reg];
+        break;
+    case PC_VPC_SCRUB_TRIG:
+    case VC_IVC_SCRUB_TRIG:
+    case VC_EQC_SCRUB_TRIG:
+        xive->regs[reg] &= ~VC_SCRUB_VALID;
+        val = xive->regs[reg];
+        break;
+    case VC_EQC_CONFIG:
+        val = SYNC_MASK;
+        break;
+    case PC_AT_KILL:
+        xive->regs[reg] &= ~PC_AT_KILL_VALID;
+        val = xive->regs[reg];
+        break;
+    case VC_AT_MACRO_KILL:
+        xive->regs[reg] &= ~VC_KILL_VALID;
+        val = xive->regs[reg];
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE/IC: invalid read reg=0x%08x"
+                      " mmio=%d\n", offset, mmio);
+    }
+
+    return val;
+}
+
+/*
+ * Interrupt Controller MMIO: Notify ports
+ */
+static void pnv_xive_ic_notify_write(PnvXive *xive, hwaddr addr,
+                                     uint64_t val, unsigned size)
+{
+    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xive);
+
+    xfc->notify(XIVE_FABRIC(xive), val);
+}
+
+/*
+ * Interrupt Controller MMIO: Synchronisation registers
+ */
+#define PNV_XIVE_SYNC_IPI       0x400 /* Sync IPI */
+#define PNV_XIVE_SYNC_HW        0x480 /* Sync HW */
+#define PNV_XIVE_SYNC_OS_ESC    0x500 /* Sync OS escalations */
+#define PNV_XIVE_SYNC_HW_ESC    0x580 /* Sync Hyp escalations */
+#define PNV_XIVE_SYNC_REDIS     0x600 /* Sync Redistribution */
+
+static void pnv_xive_ic_sync_write(void *opaque, hwaddr addr, uint64_t val,
+                                   unsigned size)
+{
+
+    switch (addr) {
+    case PNV_XIVE_SYNC_IPI:
+    case PNV_XIVE_SYNC_HW:
+    case PNV_XIVE_SYNC_OS_ESC:
+    case PNV_XIVE_SYNC_HW_ESC:
+    case PNV_XIVE_SYNC_REDIS:
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE/IC: invalid sync @%"
+                      HWADDR_PRIx"\n", addr);
+    }
+}
+
+/*
+ * Interrupt controller MMIO regions
+ *
+ * 0x00000 - 0x0FFFF : BARs
+ * 0x10000 - 0x107FF : Notify ports
+ * 0x10800 - 0x10FFF : Synchronisation registers
+ * 0x40000 - 0x7FFFF : indirect TIMA
+ */
+static void pnv_xive_ic_write(void *opaque, hwaddr addr,
+                              uint64_t val, unsigned size)
+{
+    switch (addr) {
+    case 0x00000 ... 0x0FFFF:
+        pnv_xive_ic_reg_write(opaque, addr, val, true);
+        break;
+    case 0x10000 ... 0x107FF:
+        pnv_xive_ic_notify_write(opaque, addr - 0x10000, val, size);
+        break;
+    case 0x10800 ... 0x10FFF:
+        pnv_xive_ic_sync_write(opaque, addr - 0x10800, val, size);
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE/IC: invalid write @%"
+                      HWADDR_PRIx"\n", addr);
+        break;
+    }
+}
+
+static uint64_t pnv_xive_ic_read(void *opaque, hwaddr addr, unsigned size)
+{
+    uint64_t ret = 0;
+
+    switch (addr) {
+    case 0x00000 ... 0x0FFFF:
+        ret = pnv_xive_ic_reg_read(opaque, addr, true);
+        break;
+    case 0x10800 ... 0x10FFF:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: read IC notify port addr @%"
+                      HWADDR_PRIx"\n", addr);
+        break;
+    case 0x10000 ... 0x107FF:
+        /* no writes on synchronisation registers */
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE/IC: invalid read @%"
+                      HWADDR_PRIx"\n", addr);
+        break;
+    }
+
+    return ret;
+}
+
+static const MemoryRegionOps pnv_xive_ic_ops = {
+    .read = pnv_xive_ic_read,
+    .write = pnv_xive_ic_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
+/*
+ * Interrupt controller XSCOM regions. Accesses can nearly all be
+ * redirected to the MMIO region.
+ */
+static uint64_t pnv_xive_xscom_read(void *opaque, hwaddr addr, unsigned size)
+{
+    switch (addr >> 3) {
+    case X_VC_EQC_CONFIG:
+        /* This is the only XSCOM load done in skiboot. To be checked. */
+        return SYNC_MASK;
+    default:
+        return pnv_xive_ic_reg_read(opaque, addr, false);
+    }
+}
+
+static void pnv_xive_xscom_write(void *opaque, hwaddr addr,
+                                uint64_t val, unsigned size)
+{
+    pnv_xive_ic_reg_write(opaque, addr, val, false);
+}
+
+static const MemoryRegionOps pnv_xive_xscom_ops = {
+    .read = pnv_xive_xscom_read,
+    .write = pnv_xive_xscom_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    }
+};
+
+/* TODO: finish reconciliating with XIVE generic routing routine */
+static void pnv_xive_notify(XiveFabric *xf, uint32_t lisn)
+{
+    PnvXive *xive = PNV_XIVE(xf);
+    XiveIVE *ive;
+    XiveEQ eq;
+    uint32_t eq_idx;
+    uint8_t priority;
+    uint32_t nvt_idx;
+    XiveNVT *nvt;
+
+    ive = xive_fabric_get_ive(xf, lisn);
+    if (!ive || !(ive->w & IVE_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %d\n", lisn);
+        return;
+    }
+
+    if (ive->w & IVE_MASKED) {
+        return;
+    }
+
+    /* Find our XiveEQ */
+    eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
+    if (pnv_xive_eq_get(xive, eq_idx, &eq)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No EQ %d\n", eq_idx);
+        return;
+    }
+
+    if (!(eq.w0 & EQ_W0_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No valid EQ for LISN %d\n", lisn);
+        return;
+    }
+
+    if (eq.w0 & EQ_W0_ENQUEUE) {
+        xive_eq_push(&eq, GETFIELD(IVE_EQ_DATA, ive->w));
+        pnv_xive_eq_set(xive, eq_idx, &eq);
+    }
+    if (!(eq.w0 & EQ_W0_UCOND_NOTIFY)) {
+        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
+    }
+
+    nvt_idx = GETFIELD(EQ_W6_NVT_INDEX, eq.w6);
+    nvt = xive_fabric_get_nvt(xf, nvt_idx);
+    if (!nvt) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No NVT for idx %d\n", nvt_idx);
+        return;
+    }
+
+    if (GETFIELD(EQ_W6_FORMAT_BIT, eq.w6) == 0) {
+        priority = GETFIELD(EQ_W7_F0_PRIORITY, eq.w7);
+
+        /* The EQ is masked. Can this happen ?  */
+        if (priority == 0xff) {
+            return;
+        }
+
+        /* Update the IPB (Interrupt Pending Buffer) with the priority
+         * of the new notification. HW uses MMIOs to update the VP
+         * structures. Something to address later.
+         */
+        xive_nvt_hv_ipb_update(nvt, priority);
+    } else {
+        qemu_log_mask(LOG_UNIMP, "XIVE: w7 format1 not implemented\n");
+    }
+
+    xive_nvt_hv_notify(nvt);
+}
+
+/*
+ * Virtualization Controller MMIO region. It contain the ESB pages for
+ * the IPIs interrupts and ESB pages for the EQs. The split is done
+ * with the set translation tables.
+ */
+static uint64_t pnv_xive_vc_read(void *opaque, hwaddr offset,
+                                   unsigned size)
+{
+    qemu_log_mask(LOG_GUEST_ERROR, "XIVE/VC: invalid read @%"
+                  HWADDR_PRIx"\n", offset);
+
+    /* if out of scope, specs says to return all ones */
+    return -1;
+}
+
+static void pnv_xive_vc_write(void *opaque, hwaddr offset,
+                                uint64_t value, unsigned size)
+{
+    qemu_log_mask(LOG_GUEST_ERROR, "XIVE/VC: invalid write @%"
+                  HWADDR_PRIx" val=0x%"PRIx64"\n", offset, value);
+}
+
+static const MemoryRegionOps pnv_xive_vc_ops = {
+    .read = pnv_xive_vc_read,
+    .write = pnv_xive_vc_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+/*
+ * Presenter Controller MMIO region. This is used by the
+ * Virtualization Controller to update the IPB and the NVT (XiveVP)
+ * table when required. Not implemented yet.
+ */
+static uint64_t pnv_xive_pc_read(void *opaque, hwaddr addr,
+                                 unsigned size)
+{
+    qemu_log_mask(LOG_GUEST_ERROR, "XIVE/PC: invalid read @%"HWADDR_PRIx"\n",
+                  addr);
+    return -1;
+}
+
+static void pnv_xive_pc_write(void *opaque, hwaddr offset,
+                              uint64_t value, unsigned size)
+{
+    qemu_log_mask(LOG_GUEST_ERROR, "XIVE/PC: invalid write @%"HWADDR_PRIx
+                  " val=0x%"PRIx64"\n", offset, value);
+}
+
+static const MemoryRegionOps pnv_xive_pc_ops = {
+    .read = pnv_xive_pc_read,
+    .write = pnv_xive_pc_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+void pnv_xive_pic_print_info(PnvXive *xive, Monitor *mon)
+{
+    int i;
+
+    monitor_printf(mon, "IVE Table\n");
+    for (i = 0; i < xive->nr_irqs; i++) {
+        XiveIVE *ive = &xive->ivt[i];
+        uint32_t eq_idx;
+
+        if (!(ive->w & IVE_VALID)) {
+            continue;
+        }
+
+        eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
+
+        monitor_printf(mon, " %6x %s eqidx:%d ", i,
+                       ive->w & IVE_MASKED ? "M" : " ",
+                       eq_idx);
+
+
+        if (!(ive->w & IVE_MASKED)) {
+            XiveEQ eq;
+
+            if (!pnv_xive_eq_get(xive, eq_idx, &eq)) {
+                xive_eq_pic_print_info(&eq, mon);
+                monitor_printf(mon, " data:%08x",
+                               (int) GETFIELD(IVE_EQ_DATA, ive->w));
+            } else {
+                monitor_printf(mon, "no eq ?!");
+            }
+        }
+        monitor_printf(mon, "\n");
+    }
+
+    xive_source_pic_print_info(&xive->source, mon);
+}
+
+static void pnv_xive_reset(DeviceState *dev)
+{
+    PnvXive *xive = PNV_XIVE(dev);
+    int i;
+
+    device_reset(DEVICE(&xive->source));
+    device_reset(DEVICE(&xive->eq_source));
+
+    /* Mask all valid IVEs in the IRQ number space. */
+    for (i = 0; i < xive->nr_irqs; i++) {
+        XiveIVE *ive = &xive->ivt[i];
+        if (ive->w & IVE_VALID) {
+            ive->w |= IVE_MASKED;
+        }
+    }
+}
+
+static void pnv_xive_init(Object *obj)
+{
+    PnvXive *xive = PNV_XIVE(obj);
+
+    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
+    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
+
+    object_initialize(&xive->eq_source, sizeof(xive->eq_source),
+                      TYPE_XIVE_SOURCE);
+    object_property_add_child(obj, "eq_source", OBJECT(&xive->eq_source), NULL);
+}
+
+static void pnv_xive_realize(DeviceState *dev, Error **errp)
+{
+    PnvXive *xive = PNV_XIVE(dev);
+
+    /* XSCOM region */
+    memory_region_init_io(&xive->xscom_regs, OBJECT(dev), &pnv_xive_xscom_ops,
+                          xive, "xscom-xive", PNV_XSCOM_XIVE_SIZE << 3);
+
+    /* Interrupt controller MMIO region */
+    memory_region_init_io(&xive->ic_mmio, OBJECT(dev), &pnv_xive_ic_ops, xive,
+                          "xive.ic", PNV_XIVE_IC_SIZE);
+
+    /* Overall Virtualization Controller MMIO region.  */
+    memory_region_init_io(&xive->vc_mmio, OBJECT(xive), &pnv_xive_vc_ops, xive,
+                          "xive.vc", PNV_XIVE_VC_SIZE);
+
+    /* Virtualization Controller subregions for IPIs & EQs. Their
+     * sizes and offsets will be configured later when the translation
+     * sets are established
+     */
+    xive->esb_base = xive->vc_base;
+    memory_region_init_io(&xive->esb_mmio, OBJECT(xive), NULL, xive,
+                          "xive.vc.esb", 0);
+    memory_region_add_subregion(&xive->vc_mmio, 0, &xive->esb_mmio);
+
+    xive->eq_base = xive->vc_base;
+    memory_region_init_io(&xive->eq_mmio, OBJECT(xive), NULL, xive,
+                          "xive.vc.eq", 0);
+    memory_region_add_subregion(&xive->vc_mmio, 0, &xive->eq_mmio);
+
+    /* Thread Interrupt Management Area */
+    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_hv_ops,
+                          &xive->cpu_ind, "xive.tima", PNV_XIVE_TM_SIZE);
+    memory_region_init_alias(&xive->tm_mmio_indirect, OBJECT(xive),
+                             "xive.tima.indirect",
+                             &xive->tm_mmio, 0, PNV_XIVE_TM_SIZE);
+
+    /* Presenter Controller MMIO region */
+    memory_region_init_io(&xive->pc_mmio, OBJECT(xive), &pnv_xive_pc_ops, xive,
+                          "xive.pc", PNV_XIVE_PC_SIZE);
+
+    /* Map all regions from the XIVE model realize routine. This is
+     * simpler than from the machine
+     */
+    memory_region_add_subregion(get_system_memory(), xive->ic_base,
+                                &xive->ic_mmio);
+    memory_region_add_subregion(get_system_memory(), xive->ic_base + 0x40000,
+                                &xive->tm_mmio_indirect);
+    memory_region_add_subregion(get_system_memory(), xive->vc_base,
+                                &xive->vc_mmio);
+    memory_region_add_subregion(get_system_memory(), xive->pc_base,
+                                &xive->pc_mmio);
+    memory_region_add_subregion(get_system_memory(), xive->tm_base,
+                                &xive->tm_mmio);
+}
+
+static int pnv_xive_dt_xscom(PnvXScomInterface *dev, void *fdt,
+                             int xscom_offset)
+{
+    const char compat[] = "ibm,power9-xive-x";
+    char *name;
+    int offset;
+    uint32_t lpc_pcba = PNV_XSCOM_XIVE_BASE;
+    uint32_t reg[] = {
+        cpu_to_be32(lpc_pcba),
+        cpu_to_be32(PNV_XSCOM_XIVE_SIZE)
+    };
+
+    name = g_strdup_printf("xive@%x", lpc_pcba);
+    offset = fdt_add_subnode(fdt, xscom_offset, name);
+    _FDT(offset);
+    g_free(name);
+
+    _FDT((fdt_setprop(fdt, offset, "reg", reg, sizeof(reg))));
+    _FDT((fdt_setprop(fdt, offset, "compatible", compat,
+                      sizeof(compat))));
+    return 0;
+}
+
+static Property pnv_xive_properties[] = {
+    DEFINE_PROP_UINT64("ic-bar", PnvXive, ic_base, 0),
+    DEFINE_PROP_UINT64("vc-bar", PnvXive, vc_base, 0),
+    DEFINE_PROP_UINT64("pc-bar", PnvXive, pc_base, 0),
+    DEFINE_PROP_UINT64("tm-bar", PnvXive, tm_base, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static XiveNVT *pnv_xive_get_nvt(XiveFabric *xf, uint32_t nvt_idx)
+{
+    PnvXive *xive = PNV_XIVE(xf);
+    int server;
+    PowerPCCPU *cpu;
+    XiveVP vp;
+
+    /* only use the VP to check the valid bit */
+    if (pnv_xive_vp_get(xive, nvt_idx, &vp)) {
+        return NULL;
+    }
+
+    if (!(vp.w0 & VP_W0_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: VP idx %x is invalid\n", nvt_idx);
+        return NULL;
+    }
+
+    /* TODO: quick and dirty NVT-to-server decoding ... This needs
+     * more care. */
+    server = nvt_idx & 0x7f;
+    cpu = ppc_get_vcpu_by_pir(server);
+
+    return cpu ? XIVE_NVT(cpu->intc) : NULL;
+}
+
+static XiveIVE *pnv_xive_get_ive(XiveFabric *xf, uint32_t lisn)
+{
+    PnvXive *xive = PNV_XIVE(xf);
+
+    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
+}
+
+static void pnv_xive_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PnvXScomInterfaceClass *xdc = PNV_XSCOM_INTERFACE_CLASS(klass);
+    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
+
+    xdc->dt_xscom = pnv_xive_dt_xscom;
+
+    dc->desc = "PowerNV XIVE Interrupt Controller";
+    dc->realize = pnv_xive_realize;
+    dc->props = pnv_xive_properties;
+    dc->reset = pnv_xive_reset;
+
+    xfc->get_ive = pnv_xive_get_ive;
+    xfc->get_nvt = pnv_xive_get_nvt;
+    /* TODO : xfc->get_eq */
+    xfc->notify = pnv_xive_notify;
+};
+
+static const TypeInfo pnv_xive_info = {
+    .name          = TYPE_PNV_XIVE,
+    .parent        = TYPE_SYS_BUS_DEVICE,
+    .instance_init = pnv_xive_init,
+    .instance_size = sizeof(PnvXive),
+    .class_init    = pnv_xive_class_init,
+    .interfaces    = (InterfaceInfo[]) {
+        { TYPE_PNV_XSCOM_INTERFACE },
+        { TYPE_XIVE_FABRIC },
+        { }
+    }
+};
+
+static void pnv_xive_register_types(void)
+{
+    type_register_static(&pnv_xive_info);
+}
+
+type_init(pnv_xive_register_types)
+
+void pnv_chip_xive_realize(PnvChip *chip, Error **errp)
+{
+    Object *obj;
+    Error *local_err = NULL;
+
+    obj = object_new(TYPE_PNV_XIVE);
+    qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
+
+    object_property_add_child(OBJECT(chip), "xive", obj, &error_abort);
+    object_property_set_int(obj, PNV_XIVE_IC_BASE(chip), "ic-bar",
+                            &error_fatal);
+    object_property_set_int(obj, PNV_XIVE_VC_BASE(chip), "vc-bar",
+                            &error_fatal);
+    object_property_set_int(obj, PNV_XIVE_PC_BASE(chip), "pc-bar",
+                            &error_fatal);
+    object_property_set_int(obj, PNV_XIVE_TM_BASE(chip), "tm-bar",
+                            &error_fatal);
+    object_property_set_bool(obj, true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    chip->xive = PNV_XIVE(obj);
+
+    pnv_xscom_add_subregion(chip, PNV_XSCOM_XIVE_BASE,
+                            &chip->xive->xscom_regs);
+}
diff --git a/hw/intc/pnv_xive_regs.h b/hw/intc/pnv_xive_regs.h
new file mode 100644
index 000000000000..2ea371211bcc
--- /dev/null
+++ b/hw/intc/pnv_xive_regs.h
@@ -0,0 +1,314 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_PNV_XIVE_REGS_H
+#define PPC_PNV_XIVE_REGS_H
+
+/* IC register offsets */
+#define CQ_SWI_CMD_HIST         0x020
+#define CQ_SWI_CMD_POLL         0x028
+#define CQ_SWI_CMD_BCAST        0x030
+#define CQ_SWI_CMD_ASSIGN       0x038
+#define CQ_SWI_CMD_BLK_UPD      0x040
+#define CQ_SWI_RSP              0x048
+#define X_CQ_CFG_PB_GEN         0x0a
+#define CQ_CFG_PB_GEN           0x050
+#define   CQ_INT_ADDR_OPT       PPC_BITMASK(14, 15)
+#define X_CQ_IC_BAR             0x10
+#define X_CQ_MSGSND             0x0b
+#define CQ_MSGSND               0x058
+#define CQ_CNPM_SEL             0x078
+#define CQ_IC_BAR               0x080
+#define   CQ_IC_BAR_VALID       PPC_BIT(0)
+#define   CQ_IC_BAR_64K         PPC_BIT(1)
+#define X_CQ_TM1_BAR            0x12
+#define CQ_TM1_BAR              0x90
+#define X_CQ_TM2_BAR            0x014
+#define CQ_TM2_BAR              0x0a0
+#define   CQ_TM_BAR_VALID       PPC_BIT(0)
+#define   CQ_TM_BAR_64K         PPC_BIT(1)
+#define X_CQ_PC_BAR             0x16
+#define CQ_PC_BAR               0x0b0
+#define  CQ_PC_BAR_VALID        PPC_BIT(0)
+#define X_CQ_PC_BARM            0x17
+#define CQ_PC_BARM              0x0b8
+#define  CQ_PC_BARM_MASK        PPC_BITMASK(26, 38)
+#define X_CQ_VC_BAR             0x18
+#define CQ_VC_BAR               0x0c0
+#define  CQ_VC_BAR_VALID        PPC_BIT(0)
+#define X_CQ_VC_BARM            0x19
+#define CQ_VC_BARM              0x0c8
+#define  CQ_VC_BARM_MASK        PPC_BITMASK(21, 37)
+#define X_CQ_TAR                0x1e
+#define CQ_TAR                  0x0f0
+#define  CQ_TAR_TBL_AUTOINC     PPC_BIT(0)
+#define  CQ_TAR_TSEL            PPC_BITMASK(12, 15)
+#define  CQ_TAR_TSEL_BLK        PPC_BIT(12)
+#define  CQ_TAR_TSEL_MIG        PPC_BIT(13)
+#define  CQ_TAR_TSEL_VDT        PPC_BIT(14)
+#define  CQ_TAR_TSEL_EDT        PPC_BIT(15)
+#define  CQ_TAR_TSEL_INDEX      PPC_BITMASK(26, 31)
+#define X_CQ_TDR                0x1f
+#define CQ_TDR                  0x0f8
+#define  CQ_TDR_VDT_VALID       PPC_BIT(0)
+#define  CQ_TDR_VDT_BLK         PPC_BITMASK(11, 15)
+#define  CQ_TDR_VDT_INDEX       PPC_BITMASK(28, 31)
+#define  CQ_TDR_EDT_TYPE        PPC_BITMASK(0, 1)
+#define  CQ_TDR_EDT_INVALID     0
+#define  CQ_TDR_EDT_IPI         1
+#define  CQ_TDR_EDT_EQ          2
+#define  CQ_TDR_EDT_BLK         PPC_BITMASK(12, 15)
+#define  CQ_TDR_EDT_INDEX       PPC_BITMASK(26, 31)
+#define X_CQ_PBI_CTL            0x20
+#define CQ_PBI_CTL              0x100
+#define  CQ_PBI_PC_64K          PPC_BIT(5)
+#define  CQ_PBI_VC_64K          PPC_BIT(6)
+#define  CQ_PBI_LNX_TRIG        PPC_BIT(7)
+#define  CQ_PBI_FORCE_TM_LOCAL  PPC_BIT(22)
+#define CQ_PBO_CTL              0x108
+#define CQ_AIB_CTL              0x110
+#define X_CQ_RST_CTL            0x23
+#define CQ_RST_CTL              0x118
+#define X_CQ_FIRMASK            0x33
+#define CQ_FIRMASK              0x198
+#define X_CQ_FIRMASK_AND        0x34
+#define CQ_FIRMASK_AND          0x1a0
+#define X_CQ_FIRMASK_OR         0x35
+#define CQ_FIRMASK_OR           0x1a8
+
+/* PC LBS1 register offsets */
+#define X_PC_TCTXT_CFG          0x100
+#define PC_TCTXT_CFG            0x400
+#define  PC_TCTXT_CFG_BLKGRP_EN         PPC_BIT(0)
+#define  PC_TCTXT_CFG_TARGET_EN         PPC_BIT(1)
+#define  PC_TCTXT_CFG_LGS_EN            PPC_BIT(2)
+#define  PC_TCTXT_CFG_STORE_ACK         PPC_BIT(3)
+#define  PC_TCTXT_CFG_HARD_CHIPID_BLK   PPC_BIT(8)
+#define  PC_TCTXT_CHIPID_OVERRIDE       PPC_BIT(9)
+#define  PC_TCTXT_CHIPID                PPC_BITMASK(12, 15)
+#define  PC_TCTXT_INIT_AGE              PPC_BITMASK(30, 31)
+#define X_PC_TCTXT_TRACK        0x101
+#define PC_TCTXT_TRACK          0x408
+#define  PC_TCTXT_TRACK_EN              PPC_BIT(0)
+#define X_PC_TCTXT_INDIR0       0x104
+#define PC_TCTXT_INDIR0         0x420
+#define  PC_TCTXT_INDIR_VALID           PPC_BIT(0)
+#define  PC_TCTXT_INDIR_THRDID          PPC_BITMASK(9, 15)
+#define X_PC_TCTXT_INDIR1       0x105
+#define PC_TCTXT_INDIR1         0x428
+#define X_PC_TCTXT_INDIR2       0x106
+#define PC_TCTXT_INDIR2         0x430
+#define X_PC_TCTXT_INDIR3       0x107
+#define PC_TCTXT_INDIR3         0x438
+#define X_PC_THREAD_EN_REG0     0x108
+#define PC_THREAD_EN_REG0       0x440
+#define X_PC_THREAD_EN_REG0_SET 0x109
+#define PC_THREAD_EN_REG0_SET   0x448
+#define X_PC_THREAD_EN_REG0_CLR 0x10a
+#define PC_THREAD_EN_REG0_CLR   0x450
+#define X_PC_THREAD_EN_REG1     0x10c
+#define PC_THREAD_EN_REG1       0x460
+#define X_PC_THREAD_EN_REG1_SET 0x10d
+#define PC_THREAD_EN_REG1_SET   0x468
+#define X_PC_THREAD_EN_REG1_CLR 0x10e
+#define PC_THREAD_EN_REG1_CLR   0x470
+#define X_PC_GLOBAL_CONFIG      0x110
+#define PC_GLOBAL_CONFIG        0x480
+#define  PC_GCONF_INDIRECT      PPC_BIT(32)
+#define  PC_GCONF_CHIPID_OVR    PPC_BIT(40)
+#define  PC_GCONF_CHIPID        PPC_BITMASK(44, 47)
+#define X_PC_VSD_TABLE_ADDR     0x111
+#define PC_VSD_TABLE_ADDR       0x488
+#define X_PC_VSD_TABLE_DATA     0x112
+#define PC_VSD_TABLE_DATA       0x490
+#define X_PC_AT_KILL            0x116
+#define PC_AT_KILL              0x4b0
+#define  PC_AT_KILL_VALID       PPC_BIT(0)
+#define  PC_AT_KILL_BLOCK_ID    PPC_BITMASK(27, 31)
+#define  PC_AT_KILL_OFFSET      PPC_BITMASK(48, 60)
+#define X_PC_AT_KILL_MASK       0x117
+#define PC_AT_KILL_MASK         0x4b8
+
+/* PC LBS2 register offsets */
+#define X_PC_VPC_CACHE_ENABLE   0x161
+#define PC_VPC_CACHE_ENABLE     0x708
+#define  PC_VPC_CACHE_EN_MASK   PPC_BITMASK(0, 31)
+#define X_PC_VPC_SCRUB_TRIG     0x162
+#define PC_VPC_SCRUB_TRIG       0x710
+#define X_PC_VPC_SCRUB_MASK     0x163
+#define PC_VPC_SCRUB_MASK       0x718
+#define  PC_SCRUB_VALID         PPC_BIT(0)
+#define  PC_SCRUB_WANT_DISABLE  PPC_BIT(1)
+#define  PC_SCRUB_WANT_INVAL    PPC_BIT(2)
+#define  PC_SCRUB_BLOCK_ID      PPC_BITMASK(27, 31)
+#define  PC_SCRUB_OFFSET        PPC_BITMASK(45, 63)
+#define X_PC_VPC_CWATCH_SPEC    0x167
+#define PC_VPC_CWATCH_SPEC      0x738
+#define  PC_VPC_CWATCH_CONFLICT PPC_BIT(0)
+#define  PC_VPC_CWATCH_FULL     PPC_BIT(8)
+#define  PC_VPC_CWATCH_BLOCKID  PPC_BITMASK(27, 31)
+#define  PC_VPC_CWATCH_OFFSET   PPC_BITMASK(45, 63)
+#define X_PC_VPC_CWATCH_DAT0    0x168
+#define PC_VPC_CWATCH_DAT0      0x740
+#define X_PC_VPC_CWATCH_DAT1    0x169
+#define PC_VPC_CWATCH_DAT1      0x748
+#define X_PC_VPC_CWATCH_DAT2    0x16a
+#define PC_VPC_CWATCH_DAT2      0x750
+#define X_PC_VPC_CWATCH_DAT3    0x16b
+#define PC_VPC_CWATCH_DAT3      0x758
+#define X_PC_VPC_CWATCH_DAT4    0x16c
+#define PC_VPC_CWATCH_DAT4      0x760
+#define X_PC_VPC_CWATCH_DAT5    0x16d
+#define PC_VPC_CWATCH_DAT5      0x768
+#define X_PC_VPC_CWATCH_DAT6    0x16e
+#define PC_VPC_CWATCH_DAT6      0x770
+#define X_PC_VPC_CWATCH_DAT7    0x16f
+#define PC_VPC_CWATCH_DAT7      0x778
+
+/* VC0 register offsets */
+#define X_VC_GLOBAL_CONFIG      0x200
+#define VC_GLOBAL_CONFIG        0x800
+#define  VC_GCONF_INDIRECT      PPC_BIT(32)
+#define X_VC_VSD_TABLE_ADDR     0x201
+#define VC_VSD_TABLE_ADDR       0x808
+#define X_VC_VSD_TABLE_DATA     0x202
+#define VC_VSD_TABLE_DATA       0x810
+#define VC_IVE_ISB_BLOCK_MODE   0x818
+#define VC_EQD_BLOCK_MODE       0x820
+#define VC_VPS_BLOCK_MODE       0x828
+#define X_VC_IRQ_CONFIG_IPI     0x208
+#define VC_IRQ_CONFIG_IPI       0x840
+#define  VC_IRQ_CONFIG_MEMB_EN  PPC_BIT(45)
+#define  VC_IRQ_CONFIG_MEMB_SZ  PPC_BITMASK(46, 51)
+#define VC_IRQ_CONFIG_HW        0x848
+#define VC_IRQ_CONFIG_CASCADE1  0x850
+#define VC_IRQ_CONFIG_CASCADE2  0x858
+#define VC_IRQ_CONFIG_REDIST    0x860
+#define VC_IRQ_CONFIG_IPI_CASC  0x868
+#define X_VC_AIB_TX_ORDER_TAG2  0x22d
+#define  VC_AIB_TX_ORDER_TAG2_REL_TF    PPC_BIT(20)
+#define VC_AIB_TX_ORDER_TAG2    0x890
+#define X_VC_AT_MACRO_KILL      0x23e
+#define VC_AT_MACRO_KILL        0x8b0
+#define X_VC_AT_MACRO_KILL_MASK 0x23f
+#define VC_AT_MACRO_KILL_MASK   0x8b8
+#define  VC_KILL_VALID          PPC_BIT(0)
+#define  VC_KILL_TYPE           PPC_BITMASK(14, 15)
+#define   VC_KILL_IRQ   0
+#define   VC_KILL_IVC   1
+#define   VC_KILL_SBC   2
+#define   VC_KILL_EQD   3
+#define  VC_KILL_BLOCK_ID       PPC_BITMASK(27, 31)
+#define  VC_KILL_OFFSET         PPC_BITMASK(48, 60)
+#define X_VC_EQC_CACHE_ENABLE   0x211
+#define VC_EQC_CACHE_ENABLE     0x908
+#define  VC_EQC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
+#define X_VC_EQC_SCRUB_TRIG     0x212
+#define VC_EQC_SCRUB_TRIG       0x910
+#define X_VC_EQC_SCRUB_MASK     0x213
+#define VC_EQC_SCRUB_MASK       0x918
+#define X_VC_EQC_CWATCH_SPEC    0x215
+#define VC_EQC_CONFIG           0x920
+#define X_VC_EQC_CONFIG         0x214
+#define  VC_EQC_CONF_SYNC_IPI           PPC_BIT(32)
+#define  VC_EQC_CONF_SYNC_HW            PPC_BIT(33)
+#define  VC_EQC_CONF_SYNC_ESC1          PPC_BIT(34)
+#define  VC_EQC_CONF_SYNC_ESC2          PPC_BIT(35)
+#define  VC_EQC_CONF_SYNC_REDI          PPC_BIT(36)
+#define  VC_EQC_CONF_EQP_INTERLEAVE     PPC_BIT(38)
+#define  VC_EQC_CONF_ENABLE_END_s_BIT   PPC_BIT(39)
+#define  VC_EQC_CONF_ENABLE_END_u_BIT   PPC_BIT(40)
+#define  VC_EQC_CONF_ENABLE_END_c_BIT   PPC_BIT(41)
+#define  VC_EQC_CONF_ENABLE_MORE_QSZ    PPC_BIT(42)
+#define  VC_EQC_CONF_SKIP_ESCALATE      PPC_BIT(43)
+#define VC_EQC_CWATCH_SPEC      0x928
+#define  VC_EQC_CWATCH_CONFLICT PPC_BIT(0)
+#define  VC_EQC_CWATCH_FULL     PPC_BIT(8)
+#define  VC_EQC_CWATCH_BLOCKID  PPC_BITMASK(28, 31)
+#define  VC_EQC_CWATCH_OFFSET   PPC_BITMASK(40, 63)
+#define X_VC_EQC_CWATCH_DAT0    0x216
+#define VC_EQC_CWATCH_DAT0      0x930
+#define X_VC_EQC_CWATCH_DAT1    0x217
+#define VC_EQC_CWATCH_DAT1      0x938
+#define X_VC_EQC_CWATCH_DAT2    0x218
+#define VC_EQC_CWATCH_DAT2      0x940
+#define X_VC_EQC_CWATCH_DAT3    0x219
+#define VC_EQC_CWATCH_DAT3      0x948
+#define X_VC_IVC_SCRUB_TRIG     0x222
+#define VC_IVC_SCRUB_TRIG       0x990
+#define X_VC_IVC_SCRUB_MASK     0x223
+#define VC_IVC_SCRUB_MASK       0x998
+#define X_VC_SBC_SCRUB_TRIG     0x232
+#define VC_SBC_SCRUB_TRIG       0xa10
+#define X_VC_SBC_SCRUB_MASK     0x233
+#define VC_SBC_SCRUB_MASK       0xa18
+#define  VC_SCRUB_VALID         PPC_BIT(0)
+#define  VC_SCRUB_WANT_DISABLE  PPC_BIT(1)
+#define  VC_SCRUB_WANT_INVAL    PPC_BIT(2) /* EQC and SBC only */
+#define  VC_SCRUB_BLOCK_ID      PPC_BITMASK(28, 31)
+#define  VC_SCRUB_OFFSET        PPC_BITMASK(40, 63)
+#define X_VC_IVC_CACHE_ENABLE   0x221
+#define VC_IVC_CACHE_ENABLE     0x988
+#define  VC_IVC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
+#define X_VC_SBC_CACHE_ENABLE   0x231
+#define VC_SBC_CACHE_ENABLE     0xa08
+#define  VC_SBC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
+#define VC_IVC_CACHE_SCRUB_TRIG 0x990
+#define VC_IVC_CACHE_SCRUB_MASK 0x998
+#define VC_SBC_CACHE_ENABLE     0xa08
+#define VC_SBC_CACHE_SCRUB_TRIG 0xa10
+#define VC_SBC_CACHE_SCRUB_MASK 0xa18
+#define VC_SBC_CONFIG           0xa20
+#define X_VC_SBC_CONFIG         0x234
+#define  VC_SBC_CONF_CPLX_CIST  PPC_BIT(44)
+#define  VC_SBC_CONF_CIST_BOTH  PPC_BIT(45)
+#define  VC_SBC_CONF_NO_UPD_PRF PPC_BIT(59)
+
+/* VC1 register offsets */
+
+/* VSD Table address register definitions (shared) */
+#define VST_ADDR_AUTOINC        PPC_BIT(0)
+#define VST_TABLE_SELECT        PPC_BITMASK(13, 15)
+#define  VST_TSEL_IVT   0
+#define  VST_TSEL_SBE   1
+#define  VST_TSEL_EQDT  2
+#define  VST_TSEL_VPDT  3
+#define  VST_TSEL_IRQ   4       /* VC only */
+#define VST_TABLE_OFFSET        PPC_BITMASK(27, 31)
+
+/* Number of queue overflow pages */
+#define VC_QUEUE_OVF_COUNT      6
+
+/* Bits in a VSD entry.
+ *
+ * Note: the address is naturally aligned,  we don't use a PPC_BITMASK,
+ *       but just a mask to apply to the address before OR'ing it in.
+ *
+ * Note: VSD_FIRMWARE is a SW bit ! It hijacks an unused bit in the
+ *       VSD and is only meant to be used in indirect mode !
+ */
+#define VSD_MODE                PPC_BITMASK(0, 1)
+#define  VSD_MODE_SHARED        1
+#define  VSD_MODE_EXCLUSIVE     2
+#define  VSD_MODE_FORWARD       3
+#define VSD_ADDRESS_MASK        0x0ffffffffffff000ull
+#define VSD_MIGRATION_REG       PPC_BITMASK(52, 55)
+#define VSD_INDIRECT            PPC_BIT(56)
+#define VSD_TSIZE               PPC_BITMASK(59, 63)
+#define VSD_FIRMWARE            PPC_BIT(2) /* Read warning above */
+
+#define SYNC_MASK                \
+        (VC_EQC_CONF_SYNC_IPI  | \
+         VC_EQC_CONF_SYNC_HW   | \
+         VC_EQC_CONF_SYNC_ESC1 | \
+         VC_EQC_CONF_SYNC_ESC2 | \
+         VC_EQC_CONF_SYNC_REDI)
+
+
+#endif /* PPC_PNV_XIVE_REGS_H */
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 4a9b09e3d819..782a2f8f5ef2 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -60,7 +60,7 @@ void xive_eq_pic_print_info(XiveEQ *eq, Monitor *mon)
                    priority, server, qaddr_base, qindex, qentries, qgen);
 }
 
-static void xive_eq_push(XiveEQ *eq, uint32_t data)
+void xive_eq_push(XiveEQ *eq, uint32_t data)
 {
     uint64_t qaddr_base = (((uint64_t)(eq->w2 & 0x0fffffff)) << 32) | eq->w3;
     uint32_t qsize = GETFIELD(EQ_W0_QSIZE, eq->w0);
@@ -137,6 +137,12 @@ static void xive_nvt_ipb_update(XiveNVT *nvt, uint8_t priority)
     nvt->ring_os[TM_PIPR] = ipb_to_pipr(nvt->ring_os[TM_IPB]);
 }
 
+void xive_nvt_hv_ipb_update(XiveNVT *nvt, uint8_t priority)
+{
+    nvt->ring_hv[TM_IPB] |= priority_to_ipb(priority);
+    nvt->ring_hv[TM_PIPR] = ipb_to_pipr(nvt->ring_hv[TM_IPB]);
+}
+
 static uint64_t xive_nvt_accept(XiveNVT *nvt)
 {
     uint8_t nsr = nvt->ring_os[TM_NSR];
@@ -337,6 +343,150 @@ const MemoryRegionOps xive_tm_user_ops = {
     },
 };
 
+/*
+ * HV Thread Interrupt Management Area MMIO
+ */
+
+static uint64_t xive_nvt_hv_accept(XiveNVT *nvt)
+{
+    uint8_t nsr = nvt->ring_hv[TM_NSR];
+
+    qemu_irq_lower(nvt->output);
+
+    if (nvt->ring_hv[TM_NSR] & TM_QW3_NSR_HE) {
+        uint8_t cppr = nvt->ring_hv[TM_PIPR];
+
+        nvt->ring_hv[TM_CPPR] = cppr;
+
+        /* Reset the pending buffer bit */
+        nvt->ring_hv[TM_IPB] &= ~priority_to_ipb(cppr);
+        nvt->ring_hv[TM_PIPR] = ipb_to_pipr(nvt->ring_hv[TM_IPB]);
+
+        /* Drop Exception bit for HV */
+        nvt->ring_hv[TM_NSR] &= ~TM_QW3_NSR_HE;
+    }
+
+    return (nsr << 8) | nvt->ring_hv[TM_CPPR];
+}
+
+void xive_nvt_hv_notify(XiveNVT *nvt)
+{
+    if (nvt->ring_hv[TM_PIPR] < nvt->ring_hv[TM_CPPR]) {
+        nvt->ring_hv[TM_NSR] =
+            SETFIELD(TM_QW3_NSR_HE, nvt->ring_hv[TM_NSR], TM_QW3_NSR_HE_PHYS);
+        qemu_irq_raise(nvt->output);
+    }
+}
+
+static void xive_nvt_hv_set_cppr(XiveNVT *nvt, uint8_t cppr)
+{
+    if (cppr > XIVE_PRIORITY_MAX) {
+        cppr = 0xff;
+    }
+
+    nvt->ring_hv[TM_CPPR] = cppr;
+
+    /* CPPR has changed, check if we need to redistribute a pending
+     * exception */
+    xive_nvt_hv_notify(nvt);
+}
+
+static uint64_t xive_tm_hv_read_special(XiveNVT *nvt, hwaddr offset,
+                                           unsigned size)
+{
+    uint64_t ret = -1;
+
+    if (offset == TM_SPC_ACK_HV_REG && size == 2) {
+        return xive_nvt_hv_accept(nvt);
+    }
+
+    if (offset == TM_SPC_PULL_POOL_CTX) {
+        ret = nvt->regs[TM_QW2_HV_POOL + TM_WORD2] & TM_QW2W2_POOL_CAM;
+        nvt->regs[TM_QW2_HV_POOL + TM_WORD2] &= ~TM_QW2W2_POOL_CAM;
+        return ret;
+    }
+
+    qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
+                 HWADDR_PRIx" size %d\n", offset, size);
+    return ret;
+}
+
+static uint64_t xive_tm_hv_read(void *opaque, hwaddr offset,
+                                 unsigned size)
+{
+    PowerPCCPU **cpuptr = opaque;
+    PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
+    XiveNVT *nvt = XIVE_NVT(cpu->intc);
+    uint64_t ret = -1;
+    int i;
+
+    assert(nvt);
+
+    /* Do not take into account the View */
+    offset &= 0xFFF;
+
+    if (offset >= TM_SPC_ACK_EBB) {
+        return xive_tm_hv_read_special(nvt, offset, size);
+    }
+
+    ret = 0;
+    for (i = 0; i < size; i++) {
+        ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1));
+    }
+
+    return ret;
+}
+
+static void xive_tm_hv_write_special(XiveNVT *nvt, hwaddr offset,
+                                        uint64_t value, unsigned size)
+{
+    qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA write @%"
+                      HWADDR_PRIx" size %d\n", offset, size);
+}
+
+static void xive_tm_hv_write(void *opaque, hwaddr offset,
+                              uint64_t value, unsigned size)
+{
+    PowerPCCPU **cpuptr = opaque;
+    PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
+    XiveNVT *nvt = XIVE_NVT(cpu->intc);
+    int i;
+
+    /* Do not take into account the View */
+    offset &= 0xFFF;
+
+    if (offset >= TM_SPC_ACK_EBB) {
+        xive_tm_hv_write_special(nvt, offset, value, size);
+        return;
+    }
+
+    switch (offset) {
+    case TM_QW3_HV_PHYS + TM_CPPR:
+        xive_nvt_hv_set_cppr(nvt, value & 0xff);
+        return;
+    default:
+        break;
+    }
+
+    for (i = 0; i < size; i++) {
+        nvt->regs[offset + i] = (value >> (8 * (size - i - 1))) & 0xff;
+    }
+}
+
+const MemoryRegionOps xive_tm_hv_ops = {
+    .read = xive_tm_hv_read,
+    .write = xive_tm_hv_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
 static char *xive_nvt_ring_print(uint8_t *ring)
 {
     uint32_t w2 = be32_to_cpu(*((uint32_t *) &ring[TM_WORD2]));
@@ -361,6 +511,12 @@ void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon)
     monitor_printf(mon, "CPU[%04x]: QW    NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
                    " W2\n", cpu_index);
 
+    s = xive_nvt_ring_print(&nvt->regs[TM_QW3_HV_PHYS]);
+    monitor_printf(mon, "CPU[%04x]: HV    %s\n", cpu_index, s);
+    g_free(s);
+    s = xive_nvt_ring_print(&nvt->regs[TM_QW2_HV_POOL]);
+    monitor_printf(mon, "CPU[%04x]: POOL  %s\n", cpu_index, s);
+    g_free(s);
     s = xive_nvt_ring_print(&nvt->regs[TM_QW1_OS]);
     monitor_printf(mon, "CPU[%04x]: OS    %s\n", cpu_index, s);
     g_free(s);
@@ -381,6 +537,7 @@ static void xive_nvt_reset(void *dev)
      * CPPR is first set.
      */
     nvt->ring_os[TM_PIPR] = ipb_to_pipr(nvt->ring_os[TM_IPB]);
+    nvt->ring_hv[TM_PIPR] = ipb_to_pipr(nvt->ring_hv[TM_IPB]);
 
     for (i = 0; i < ARRAY_SIZE(nvt->eqt); i++) {
         xive_eq_reset(&nvt->eqt[i]);
@@ -439,6 +596,7 @@ static void xive_nvt_init(Object *obj)
     XiveNVT *nvt = XIVE_NVT(obj);
 
     nvt->ring_os = &nvt->regs[TM_QW1_OS];
+    nvt->ring_hv = &nvt->regs[TM_QW3_HV_PHYS];
 }
 
 static const VMStateDescription vmstate_xive_nvt_eq = {
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index d07a8ce38e99..4dd84b83e04c 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -300,7 +300,10 @@ static void pnv_dt_chip(PnvChip *chip, void *fdt)
         pnv_dt_core(chip, pnv_core, fdt);
 
         /* Interrupt Control Presenters (ICP). One per core. */
-        pnv_dt_icp(chip, fdt, pnv_core->pir, CPU_CORE(pnv_core)->nr_threads);
+        if (!pnv_chip_is_power9(chip)) {
+            pnv_dt_icp(chip, fdt, pnv_core->pir,
+                       CPU_CORE(pnv_core)->nr_threads);
+        }
     }
 
     if (chip->ram_size) {
@@ -923,9 +926,14 @@ static void pnv_chip_realize(DeviceState *dev, Error **errp)
                              &error_fatal);
     pnv_xscom_add_subregion(chip, PNV_XSCOM_LPC_BASE, &chip->lpc.xscom_regs);
 
-    /* Interrupt Management Area. This is the memory region holding
-     * all the Interrupt Control Presenter (ICP) registers */
-    pnv_chip_icp_realize(chip, &error);
+    if (!pnv_chip_is_power9(chip)) {
+        /* Interrupt Management Area. This is the memory region holding
+         * all the Interrupt Control Presenter (ICP) registers */
+        pnv_chip_icp_realize(chip, &error);
+    } else {
+        /* XIVE Interrupt Controller on P9 */
+        pnv_chip_xive_realize(chip, &error);
+    }
     if (error) {
         error_propagate(errp, error);
         return;
@@ -1004,10 +1012,10 @@ Object *pnv_icp_create(PnvMachineState *pnv, Object *cpu, Error **errp)
     Error *local_err = NULL;
     Object *obj;
 
-    obj = icp_create(cpu, TYPE_PNV_ICP, XICS_FABRIC(pnv), &local_err);
-    if (local_err) {
-        error_propagate(errp, local_err);
-        return NULL;
+    if (!pnv_is_power9(pnv)) {
+        obj = icp_create(cpu, TYPE_PNV_ICP, XICS_FABRIC(pnv), &local_err);
+    } else {
+        obj = xive_nvt_create(cpu, TYPE_XIVE_NVT, &local_err);
     }
 
     return obj;
@@ -1023,11 +1031,19 @@ static void pnv_pic_print_info(InterruptStatsProvider *obj,
     CPU_FOREACH(cs) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
 
-        icp_pic_print_info(ICP(cpu->intc), mon);
+        if (!pnv_is_power9(pnv)) {
+            icp_pic_print_info(ICP(cpu->intc), mon);
+        } else {
+            xive_nvt_pic_print_info(XIVE_NVT(cpu->intc), mon);
+        }
     }
 
     for (i = 0; i < pnv->num_chips; i++) {
-        ics_pic_print_info(&pnv->chips[i]->psi.ics, mon);
+        if (!pnv_is_power9(pnv)) {
+            ics_pic_print_info(&pnv->chips[i]->psi.ics, mon);
+        } else {
+            pnv_xive_pic_print_info(pnv->chips[0]->xive, mon);
+        }
     }
 }
 
diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
index 877c3b79b239..f66fe53c38bb 100644
--- a/include/hw/ppc/pnv.h
+++ b/include/hw/ppc/pnv.h
@@ -25,6 +25,7 @@
 #include "hw/ppc/pnv_lpc.h"
 #include "hw/ppc/pnv_psi.h"
 #include "hw/ppc/pnv_occ.h"
+#include "hw/ppc/pnv_xive.h"
 
 #define TYPE_PNV_CHIP "pnv-chip"
 #define PNV_CHIP(obj) OBJECT_CHECK(PnvChip, (obj), TYPE_PNV_CHIP)
@@ -62,6 +63,8 @@ typedef struct PnvChip {
     PnvLpcController lpc;
     PnvPsi       psi;
     PnvOCC       occ;
+
+    PnvXive      *xive;
 } PnvChip;
 
 typedef struct PnvChipClass {
@@ -191,6 +194,24 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
     (0x0003ffe000000000ull + (uint64_t)PNV_CHIP_INDEX(chip) * \
      PNV_PSIHB_FSP_SIZE)
 
+/*
+ * POWER9 MMIO base addresses
+ */
+#define PNV_XIVE_VC_SIZE             0x0000008000000000ull
+#define PNV_XIVE_VC_BASE(chip)      (0x0006010000000000ull      \
+    + (uint64_t)PNV_CHIP_INDEX(chip) * PNV_XIVE_VC_SIZE)
+
+#define PNV_XIVE_PC_SIZE             0x0000001000000000ull
+#define PNV_XIVE_PC_BASE(chip)      (0x0006018000000000ull      \
+    + (uint64_t)PNV_CHIP_INDEX(chip) * PNV_XIVE_PC_SIZE)
+
+#define PNV_XIVE_IC_SIZE             0x0000000000080000ull
+#define PNV_XIVE_IC_BASE(chip)      (0x0006030203100000ull \
+     + (uint64_t)PNV_CHIP_INDEX(chip) * PNV_XIVE_IC_SIZE)
+
+#define PNV_XIVE_TM_SIZE             0x0000000000040000ull
+#define PNV_XIVE_TM_BASE(chip)       0x0006030203180000ull
+
 Object *pnv_icp_create(PnvMachineState *spapr, Object *cpu, Error **errp);
 
 #endif /* _PPC_PNV_H */
diff --git a/include/hw/ppc/pnv_xive.h b/include/hw/ppc/pnv_xive.h
new file mode 100644
index 000000000000..723345cc57e2
--- /dev/null
+++ b/include/hw/ppc/pnv_xive.h
@@ -0,0 +1,89 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_PNV_XIVE_H
+#define PPC_PNV_XIVE_H
+
+#include "hw/sysbus.h"
+#include "hw/ppc/xive.h"
+
+typedef struct XiveIVE XiveIVE;
+
+#define TYPE_PNV_XIVE "pnv-xive"
+#define PNV_XIVE(obj) OBJECT_CHECK(PnvXive, (obj), TYPE_PNV_XIVE)
+
+typedef struct PnvXive {
+    SysBusDevice parent_obj;
+
+    /* Interrupt controller regs */
+    uint64_t     regs[0x300];
+    MemoryRegion xscom_regs;
+
+    /* For IPIs and accelerator interrupts */
+    XiveSource   source;
+    XiveSource   eq_source;
+
+    /* Interrupt Virtualization Entry table */
+    XiveIVE      *ivt;
+    uint32_t     nr_irqs;
+
+    /* Event Queue Descriptor table */
+    uint64_t     *eqdt;
+    uint32_t     eqdt_count;
+    uint64_t     eqc_watch[4]; /* EQ cache update */
+
+    /* Virtual Processor Descriptor table */
+    uint64_t     *vpdt;
+    uint32_t     vpdt_count;
+    uint64_t     vpc_watch[8];  /* VP cache update */
+
+    /* Virtual Structure Tables : IVT, SBE, EQDT, VPDT, IRQ */
+    uint8_t      vst_tsel;
+    uint8_t      vst_tidx;
+    uint64_t     vsds[5];
+
+    /* Set Translation tables */
+    bool         set_xlate_autoinc;
+    uint64_t     set_xlate_index;
+    uint64_t     set_xlate;
+    uint64_t     set_xlate_edt[64]; /* IPIs & EQs */
+    uint64_t     set_xlate_vdt[16];
+
+    /* Interrupt controller MMIO */
+    MemoryRegion ic_mmio;
+    hwaddr       ic_base;
+
+    /* VC memory regions */
+    hwaddr       vc_base;
+    MemoryRegion vc_mmio;
+    hwaddr       esb_base;
+    MemoryRegion esb_mmio;
+    hwaddr       eq_base;
+    MemoryRegion eq_mmio;
+
+    /* PC memory regions */
+    hwaddr       pc_base;
+    MemoryRegion pc_mmio;
+
+    /* TIMA memory regions */
+    hwaddr       tm_base;
+    MemoryRegion tm_mmio;
+    MemoryRegion tm_mmio_indirect;
+
+    /* CPU for indirect TIMA access */
+    PowerPCCPU   *cpu_ind;
+} PnvXive;
+
+void pnv_xive_pic_print_info(PnvXive *xive, Monitor *mon);
+
+typedef struct PnvChip PnvChip;
+
+void pnv_chip_xive_realize(PnvChip *chip, Error **errp);
+
+#endif /* PPC_PNV_XIVE_H */
diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
index 255b26a5aaf6..f4b1649ffffa 100644
--- a/include/hw/ppc/pnv_xscom.h
+++ b/include/hw/ppc/pnv_xscom.h
@@ -73,6 +73,9 @@ typedef struct PnvXScomInterfaceClass {
 #define PNV_XSCOM_OCC_BASE        0x0066000
 #define PNV_XSCOM_OCC_SIZE        0x6000
 
+#define PNV_XSCOM_XIVE_BASE       0x5013000
+#define PNV_XSCOM_XIVE_SIZE       0x300
+
 extern void pnv_xscom_realize(PnvChip *chip, Error **errp);
 extern int pnv_dt_xscom(PnvChip *chip, void *fdt, int offset);
 
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index e99cd874ef3c..6c71a02cc39a 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -202,6 +202,7 @@ typedef struct XiveNVT {
 
     /* Shortcuts to rings */
     uint8_t   *ring_os;
+    uint8_t   *ring_hv;
 
     XiveEQ    eqt[XIVE_PRIORITY_MAX + 1];
 } XiveNVT;
@@ -224,13 +225,17 @@ typedef struct XiveNVTClass {
 
 extern const MemoryRegionOps xive_tm_user_ops;
 extern const MemoryRegionOps xive_tm_os_ops;
+extern const MemoryRegionOps xive_tm_hv_ops;
 
 void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon);
 XiveEQ *xive_nvt_eq_get(XiveNVT *nvt, uint8_t priority);
 Object *xive_nvt_create(Object *cpu, const char *type, Error **errp);
+void xive_nvt_hv_notify(XiveNVT *nvt);
+void xive_nvt_hv_ipb_update(XiveNVT *nvt, uint8_t priority);
 
 void xive_eq_reset(XiveEQ *eq);
 void xive_eq_pic_print_info(XiveEQ *eq, Monitor *mon);
+void xive_eq_push(XiveEQ *eq, uint32_t data);
 
 /*
  * XIVE Fabric
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
index bcc44e766db9..cd2ffd9f6152 100644
--- a/include/hw/ppc/xive_regs.h
+++ b/include/hw/ppc/xive_regs.h
@@ -160,6 +160,28 @@ typedef struct XiveEQ {
 #define EQ_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
 } XiveEQ;
 
+/* VP */
+typedef struct XiveVP {
+        uint32_t        w0;
+#define VP_W0_VALID             PPC_BIT32(0)
+        uint32_t        w1;
+        uint32_t        w2;
+        uint32_t        w3;
+        uint32_t        w4;
+        uint32_t        w5;
+        uint32_t        w6;
+        uint32_t        w7;
+        uint32_t        w8;
+#define VP_W8_GRP_VALID         PPC_BIT32(0)
+        uint32_t        w9;
+        uint32_t        wa;
+        uint32_t        wb;
+        uint32_t        wc;
+        uint32_t        wd;
+        uint32_t        we;
+        uint32_t        wf;
+} XiveVP;
+
 #define XIVE_PRIORITY_MAX  7
 
 #endif /* _INTC_XIVE_INTERNAL_H */
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* [Qemu-devel] [PATCH v3 35/35] ppc/pnv: add a PSI bridge model for POWER9 processor
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (33 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 34/35] ppc/pnv: add XIVE support Cédric Le Goater
@ 2018-04-19 12:43 ` Cédric Le Goater
  2018-04-19 13:28 ` [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) no-reply
  35 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-19 12:43 UTC (permalink / raw)
  To: qemu-ppc, qemu-devel
  Cc: David Gibson, Benjamin Herrenschmidt, Cédric Le Goater

The PSI bridge on POWER9 is very similar to POWER8. The BAR is still
set through XSCOM but the controls are now entirely done with MMIOs.
More interrupts are defined and the interrupt controller interface has
changed to XIVE. The POWER9 model is a first example of the usage of
the notify() handler of the XiveFabric interface, linking the PSI
XiveSource to its owning device model.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/pnv.c               |  41 ++---
 hw/ppc/pnv_psi.c           | 399 ++++++++++++++++++++++++++++++++++++++++++---
 include/hw/ppc/pnv.h       |  14 +-
 include/hw/ppc/pnv_psi.h   |  50 +++++-
 include/hw/ppc/pnv_xscom.h |   2 +
 5 files changed, 458 insertions(+), 48 deletions(-)

diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 4dd84b83e04c..3ad6c4cd906d 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -806,19 +806,8 @@ static void pnv_chip_init(Object *obj)
     object_initialize(&chip->lpc, sizeof(chip->lpc), TYPE_PNV_LPC);
     object_property_add_child(obj, "lpc", OBJECT(&chip->lpc), NULL);
 
-    object_initialize(&chip->psi, sizeof(chip->psi), TYPE_PNV_PSI);
-    object_property_add_child(obj, "psi", OBJECT(&chip->psi), NULL);
-    object_property_add_const_link(OBJECT(&chip->psi), "xics",
-                                   OBJECT(qdev_get_machine()), &error_abort);
-
     object_initialize(&chip->occ, sizeof(chip->occ), TYPE_PNV_OCC);
     object_property_add_child(obj, "occ", OBJECT(&chip->occ), NULL);
-    object_property_add_const_link(OBJECT(&chip->occ), "psi",
-                                   OBJECT(&chip->psi), &error_abort);
-
-    /* The LPC controller needs PSI to generate interrupts */
-    object_property_add_const_link(OBJECT(&chip->lpc), "psi",
-                                   OBJECT(&chip->psi), &error_abort);
 }
 
 static void pnv_chip_icp_realize(PnvChip *chip, Error **errp)
@@ -921,7 +910,16 @@ static void pnv_chip_realize(DeviceState *dev, Error **errp)
         i++;
     }
 
+    /* Processor Service Interface (PSI) Host Bridge */
+    pnv_chip_psi_realize(chip, &error);
+    if (error) {
+        error_propagate(errp, error);
+        return;
+    }
+
     /* Create LPC controller */
+    object_property_add_const_link(OBJECT(&chip->lpc), "psi",
+                                   OBJECT(chip->psi), &error_abort);
     object_property_set_bool(OBJECT(&chip->lpc), true, "realized",
                              &error_fatal);
     pnv_xscom_add_subregion(chip, PNV_XSCOM_LPC_BASE, &chip->lpc.xscom_regs);
@@ -939,17 +937,9 @@ static void pnv_chip_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    /* Processor Service Interface (PSI) Host Bridge */
-    object_property_set_int(OBJECT(&chip->psi), PNV_PSIHB_BASE(chip),
-                            "bar", &error_fatal);
-    object_property_set_bool(OBJECT(&chip->psi), true, "realized", &error);
-    if (error) {
-        error_propagate(errp, error);
-        return;
-    }
-    pnv_xscom_add_subregion(chip, PNV_XSCOM_PSIHB_BASE, &chip->psi.xscom_regs);
-
     /* Create the simplified OCC model */
+    object_property_add_const_link(OBJECT(&chip->occ), "psi",
+                                   OBJECT(chip->psi), &error_abort);
     object_property_set_bool(OBJECT(&chip->occ), true, "realized", &error);
     if (error) {
         error_propagate(errp, error);
@@ -983,8 +973,8 @@ static ICSState *pnv_ics_get(XICSFabric *xi, int irq)
     int i;
 
     for (i = 0; i < pnv->num_chips; i++) {
-        if (ics_valid_irq(&pnv->chips[i]->psi.ics, irq)) {
-            return &pnv->chips[i]->psi.ics;
+        if (ics_valid_irq(&pnv->chips[i]->psi->ics, irq)) {
+            return &pnv->chips[i]->psi->ics;
         }
     }
     return NULL;
@@ -996,7 +986,7 @@ static void pnv_ics_resend(XICSFabric *xi)
     int i;
 
     for (i = 0; i < pnv->num_chips; i++) {
-        ics_resend(&pnv->chips[i]->psi.ics);
+        ics_resend(&pnv->chips[i]->psi->ics);
     }
 }
 
@@ -1040,9 +1030,10 @@ static void pnv_pic_print_info(InterruptStatsProvider *obj,
 
     for (i = 0; i < pnv->num_chips; i++) {
         if (!pnv_is_power9(pnv)) {
-            ics_pic_print_info(&pnv->chips[i]->psi.ics, mon);
+            ics_pic_print_info(&pnv->chips[i]->psi->ics, mon);
         } else {
             pnv_xive_pic_print_info(pnv->chips[0]->xive, mon);
+            xive_source_pic_print_info(&pnv->chips[0]->psi->source, mon);
         }
     }
 }
diff --git a/hw/ppc/pnv_psi.c b/hw/ppc/pnv_psi.c
index 5b969127c303..d233e64fd940 100644
--- a/hw/ppc/pnv_psi.c
+++ b/hw/ppc/pnv_psi.c
@@ -114,12 +114,14 @@
 #define PSIHB_BAR_MASK                  0x0003fffffff00000ull
 #define PSIHB_FSPBAR_MASK               0x0003ffff00000000ull
 
+#define PSIHB_REG(addr) (((addr) >> 3) + PSIHB_XSCOM_BAR)
+
 static void pnv_psi_set_bar(PnvPsi *psi, uint64_t bar)
 {
     MemoryRegion *sysmem = get_system_memory();
     uint64_t old = psi->regs[PSIHB_XSCOM_BAR];
 
-    psi->regs[PSIHB_XSCOM_BAR] = bar & (PSIHB_BAR_MASK | PSIHB_BAR_EN);
+    psi->regs[PSIHB_XSCOM_BAR] = bar;
 
     /* Update MR, always remove it first */
     if (old & PSIHB_BAR_EN) {
@@ -128,7 +130,7 @@ static void pnv_psi_set_bar(PnvPsi *psi, uint64_t bar)
 
     /* Then add it back if needed */
     if (bar & PSIHB_BAR_EN) {
-        uint64_t addr = bar & PSIHB_BAR_MASK;
+        uint64_t addr = bar & ~PSIHB_BAR_EN;
         memory_region_add_subregion(sysmem, addr, &psi->regs_mr);
     }
 }
@@ -205,7 +207,12 @@ static const uint64_t stat_bits[] = {
     [PSIHB_IRQ_EXTERNAL]  = PSIHB_IRQ_STAT_EXT,
 };
 
-void pnv_psi_irq_set(PnvPsi *psi, PnvPsiIrq irq, bool state)
+void pnv_psi_irq_set(PnvPsi *psi, int irq, bool state)
+{
+    PNV_PSI_GET_CLASS(psi)->irq_set(psi, irq, state);
+}
+
+static void pnv_psi_power8_irq_set(PnvPsi *psi, int irq, bool state)
 {
     ICSState *ics = &psi->ics;
     uint32_t xivr_reg;
@@ -324,7 +331,7 @@ static uint64_t pnv_psi_reg_read(PnvPsi *psi, uint32_t offset, bool mmio)
         val = psi->regs[offset];
         break;
     default:
-        qemu_log_mask(LOG_UNIMP, "PSI: read at Ox%" PRIx32 "\n", offset);
+        qemu_log_mask(LOG_UNIMP, "PSI: read at 0x%" PRIx32 "\n", offset);
     }
     return val;
 }
@@ -383,7 +390,7 @@ static void pnv_psi_reg_write(PnvPsi *psi, uint32_t offset, uint64_t val,
         pnv_psi_set_irsn(psi, val);
         break;
     default:
-        qemu_log_mask(LOG_UNIMP, "PSI: write at Ox%" PRIx32 "\n", offset);
+        qemu_log_mask(LOG_UNIMP, "PSI: write at 0x%" PRIx32 "\n", offset);
     }
 }
 
@@ -393,13 +400,13 @@ static void pnv_psi_reg_write(PnvPsi *psi, uint32_t offset, uint64_t val,
  */
 static uint64_t pnv_psi_mmio_read(void *opaque, hwaddr addr, unsigned size)
 {
-    return pnv_psi_reg_read(opaque, (addr >> 3) + PSIHB_XSCOM_BAR, true);
+    return pnv_psi_reg_read(opaque, PSIHB_REG(addr), true);
 }
 
 static void pnv_psi_mmio_write(void *opaque, hwaddr addr,
                               uint64_t val, unsigned size)
 {
-    pnv_psi_reg_write(opaque, (addr >> 3) + PSIHB_XSCOM_BAR, val, true);
+    pnv_psi_reg_write(opaque, PSIHB_REG(addr), val, true);
 }
 
 static const MemoryRegionOps psi_mmio_ops = {
@@ -458,7 +465,7 @@ static const uint8_t irq_to_xivr[] = {
     PSIHB_XSCOM_XIVR_EXT,
 };
 
-static void pnv_psi_realize(DeviceState *dev, Error **errp)
+static void pnv_psi_power8_realize(DeviceState *dev, Error **errp)
 {
     PnvPsi *psi = PNV_PSI(dev);
     ICSState *ics = &psi->ics;
@@ -510,28 +517,34 @@ static void pnv_psi_realize(DeviceState *dev, Error **errp)
     }
 }
 
+static const char compat_p8[] = "ibm,power8-psihb-x\0ibm,psihb-x";
+static const char compat_p9[] = "ibm,power9-psihb-x\0ibm,psihb-x";
+
 static int pnv_psi_dt_xscom(PnvXScomInterface *dev, void *fdt, int xscom_offset)
 {
-    const char compat[] = "ibm,power8-psihb-x\0ibm,psihb-x";
+    PnvPsiClass *ppc = PNV_PSI_GET_CLASS(dev);
     char *name;
     int offset;
-    uint32_t lpc_pcba = PNV_XSCOM_PSIHB_BASE;
     uint32_t reg[] = {
-        cpu_to_be32(lpc_pcba),
-        cpu_to_be32(PNV_XSCOM_PSIHB_SIZE)
+        cpu_to_be32(ppc->xscom_pcba),
+        cpu_to_be32(ppc->xscom_size)
     };
 
-    name = g_strdup_printf("psihb@%x", lpc_pcba);
+    name = g_strdup_printf("psihb@%x", ppc->xscom_pcba);
     offset = fdt_add_subnode(fdt, xscom_offset, name);
     _FDT(offset);
     g_free(name);
 
-    _FDT((fdt_setprop(fdt, offset, "reg", reg, sizeof(reg))));
-
-    _FDT((fdt_setprop_cell(fdt, offset, "#address-cells", 2)));
-    _FDT((fdt_setprop_cell(fdt, offset, "#size-cells", 1)));
-    _FDT((fdt_setprop(fdt, offset, "compatible", compat,
-                      sizeof(compat))));
+    _FDT(fdt_setprop(fdt, offset, "reg", reg, sizeof(reg)));
+    _FDT(fdt_setprop_cell(fdt, offset, "#address-cells", 2));
+    _FDT(fdt_setprop_cell(fdt, offset, "#size-cells", 1));
+    if (ppc->chip_type == PNV_CHIP_POWER9) {
+        _FDT(fdt_setprop(fdt, offset, "compatible", compat_p9,
+                         sizeof(compat_p9)));
+    } else {
+        _FDT(fdt_setprop(fdt, offset, "compatible", compat_p8,
+                         sizeof(compat_p8)));
+    }
     return 0;
 }
 
@@ -541,6 +554,309 @@ static Property pnv_psi_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void pnv_psi_power8_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PnvPsiClass *ppc = PNV_PSI_CLASS(klass);
+
+    dc->realize = pnv_psi_power8_realize;
+
+    ppc->chip_type =  PNV_CHIP_POWER8;
+    ppc->xscom_pcba = PNV_XSCOM_PSIHB_BASE;
+    ppc->xscom_size = PNV_XSCOM_PSIHB_SIZE;
+    ppc->irq_set    = pnv_psi_power8_irq_set;
+}
+
+static const TypeInfo pnv_psi_power8_info = {
+    .name          = TYPE_PNV_PSI_POWER8,
+    .parent        = TYPE_PNV_PSI,
+    .class_init    = pnv_psi_power8_class_init,
+};
+
+/* Common registers */
+
+#define PSIHB9_CR                       0x20
+#define PSIHB9_SEMR                     0x28
+
+/* P9 registers */
+
+#define PSIHB9_INTERRUPT_CONTROL        0x58
+#define   PSIHB9_IRQ_METHOD             PPC_BIT(0)
+#define   PSIHB9_IRQ_RESET              PPC_BIT(1)
+#define PSIHB9_ESB_CI_BASE              0x60
+#define   PSIHB9_ESB_CI_VALID           1
+#define PSIHB9_ESB_NOTIF_ADDR           0x68
+#define   PSIHB9_ESB_NOTIF_VALID        1
+#define PSIHB9_IVT_OFFSET               0x70
+#define   PSIHB9_IVT_OFF_SHIFT          32
+
+#define PSIHB9_IRQ_LEVEL                0x78 /* assertion */
+#define   PSIHB9_IRQ_LEVEL_PSI          PPC_BIT(0)
+#define   PSIHB9_IRQ_LEVEL_OCC          PPC_BIT(1)
+#define   PSIHB9_IRQ_LEVEL_FSI          PPC_BIT(2)
+#define   PSIHB9_IRQ_LEVEL_LPCHC        PPC_BIT(3)
+#define   PSIHB9_IRQ_LEVEL_LOCAL_ERR    PPC_BIT(4)
+#define   PSIHB9_IRQ_LEVEL_GLOBAL_ERR   PPC_BIT(5)
+#define   PSIHB9_IRQ_LEVEL_TPM          PPC_BIT(6)
+#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ1    PPC_BIT(7)
+#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ2    PPC_BIT(8)
+#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ3    PPC_BIT(9)
+#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ4    PPC_BIT(10)
+#define   PSIHB9_IRQ_LEVEL_SBE_I2C      PPC_BIT(11)
+#define   PSIHB9_IRQ_LEVEL_DIO          PPC_BIT(12)
+#define   PSIHB9_IRQ_LEVEL_PSU          PPC_BIT(13)
+#define   PSIHB9_IRQ_LEVEL_I2C_C        PPC_BIT(14)
+#define   PSIHB9_IRQ_LEVEL_I2C_D        PPC_BIT(15)
+#define   PSIHB9_IRQ_LEVEL_I2C_E        PPC_BIT(16)
+#define   PSIHB9_IRQ_LEVEL_SBE          PPC_BIT(19)
+
+#define PSIHB9_IRQ_STAT                 0x80 /* P bit */
+#define   PSIHB9_IRQ_STAT_PSI           PPC_BIT(0)
+#define   PSIHB9_IRQ_STAT_OCC           PPC_BIT(1)
+#define   PSIHB9_IRQ_STAT_FSI           PPC_BIT(2)
+#define   PSIHB9_IRQ_STAT_LPCHC         PPC_BIT(3)
+#define   PSIHB9_IRQ_STAT_LOCAL_ERR     PPC_BIT(4)
+#define   PSIHB9_IRQ_STAT_GLOBAL_ERR    PPC_BIT(5)
+#define   PSIHB9_IRQ_STAT_TPM           PPC_BIT(6)
+#define   PSIHB9_IRQ_STAT_LPC_SIRQ1     PPC_BIT(7)
+#define   PSIHB9_IRQ_STAT_LPC_SIRQ2     PPC_BIT(8)
+#define   PSIHB9_IRQ_STAT_LPC_SIRQ3     PPC_BIT(9)
+#define   PSIHB9_IRQ_STAT_LPC_SIRQ4     PPC_BIT(10)
+#define   PSIHB9_IRQ_STAT_SBE_I2C       PPC_BIT(11)
+#define   PSIHB9_IRQ_STAT_DIO           PPC_BIT(12)
+#define   PSIHB9_IRQ_STAT_PSU           PPC_BIT(13)
+
+static void pnv_psi_notify(XiveFabric *xf, uint32_t lisn)
+{
+    PnvPsi *psi = PNV_PSI(xf);
+    uint64_t notif_port =
+        psi->regs[PSIHB_REG(PSIHB9_ESB_NOTIF_ADDR)];
+    bool valid = notif_port & PSIHB9_ESB_NOTIF_VALID;
+    uint64_t notify_addr = notif_port & ~PSIHB9_ESB_NOTIF_VALID;
+    uint32_t data = cpu_to_be32(lisn);
+
+    if (valid) {
+        cpu_physical_memory_write(notify_addr, &data, sizeof(data));
+    }
+}
+
+/*
+ * TODO : move to parent class
+ */
+static void pnv_psi_power9_reset(DeviceState *dev)
+{
+    PnvPsi *psi = PNV_PSI(dev);
+
+    memset(psi->regs, 0x0, sizeof(psi->regs));
+
+    psi->regs[PSIHB_XSCOM_BAR] = psi->bar | PSIHB_BAR_EN;
+}
+
+static uint64_t pnv_psi_p9_mmio_read(void *opaque, hwaddr addr, unsigned size)
+{
+    PnvPsi *psi = PNV_PSI(opaque);
+    uint32_t reg = PSIHB_REG(addr);
+    uint64_t val = -1;
+
+    switch (addr) {
+    case PSIHB9_CR:
+    case PSIHB9_SEMR:
+        /* FSP stuff */
+    case PSIHB9_INTERRUPT_CONTROL:
+    case PSIHB9_ESB_CI_BASE:
+    case PSIHB9_ESB_NOTIF_ADDR:
+    case PSIHB9_IVT_OFFSET:
+        val = psi->regs[reg];
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "PSI: read at 0x%" PRIx64 "\n", addr);
+    }
+
+    return val;
+}
+
+static void pnv_psi_p9_mmio_write(void *opaque, hwaddr addr,
+                                  uint64_t val, unsigned size)
+{
+    PnvPsi *psi = PNV_PSI(opaque);
+    uint32_t reg = PSIHB_REG(addr);
+
+    switch (addr) {
+    case PSIHB9_CR:
+    case PSIHB9_SEMR:
+        /* FSP stuff */
+        break;
+    case PSIHB9_INTERRUPT_CONTROL:
+        if (val & PSIHB9_IRQ_RESET) {
+            device_reset(DEVICE(&psi->source));
+        }
+        psi->regs[reg] = val;
+        break;
+
+    case PSIHB9_ESB_CI_BASE:
+        /* TODO: map ESB region now ? */
+        psi->regs[reg] = val;
+        break;
+
+    case PSIHB9_ESB_NOTIF_ADDR:
+        psi->regs[reg] = val;
+        break;
+    case PSIHB9_IVT_OFFSET:
+        psi->source.offset = (val >> PSIHB9_IVT_OFF_SHIFT);
+        psi->regs[reg] = val;
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "PSI: write at 0x%" PRIx64 "\n", addr);
+    }
+}
+
+static const MemoryRegionOps pnv_psi_p9_mmio_ops = {
+    .read = pnv_psi_p9_mmio_read,
+    .write = pnv_psi_p9_mmio_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+static uint64_t pnv_psi_p9_xscom_read(void *opaque, hwaddr addr, unsigned size)
+{
+    /* No read are expected */
+    qemu_log_mask(LOG_GUEST_ERROR, "PSI: xscom read at 0x%" PRIx64 "\n", addr);
+    return -1;
+}
+
+static void pnv_psi_p9_xscom_write(void *opaque, hwaddr addr,
+                                uint64_t val, unsigned size)
+{
+    PnvPsi *psi = PNV_PSI(opaque);
+
+    /* XSCOM is only used to set the PSIHB MMIO region */
+    switch (addr >> 3) {
+    case PSIHB_XSCOM_BAR:
+        pnv_psi_set_bar(psi, val);
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "PSI: xscom write at 0x%" PRIx64 "\n",
+                      addr);
+    }
+}
+
+static const MemoryRegionOps pnv_psi_p9_xscom_ops = {
+    .read = pnv_psi_p9_xscom_read,
+    .write = pnv_psi_p9_xscom_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    }
+};
+
+static void pnv_psi_power9_irq_set(PnvPsi *psi, int irq, bool state)
+{
+    uint32_t irq_method = psi->regs[PSIHB_REG(PSIHB9_INTERRUPT_CONTROL)];
+
+    if (irq > PSIHB9_NUM_IRQS) {
+        qemu_log_mask(LOG_GUEST_ERROR, "PSI: Unsupported irq %d\n", irq);
+        return;
+    }
+
+    if (irq_method & PSIHB9_IRQ_METHOD) {
+        qemu_log_mask(LOG_GUEST_ERROR, "PSI: LSI IRQ method no supported\n");
+        return;
+    }
+
+    if (state) {
+        psi->regs[PSIHB_REG(PSIHB9_IRQ_LEVEL)] |= PPC_BIT(irq);
+    } else {
+        psi->regs[PSIHB_REG(PSIHB9_IRQ_LEVEL)] &= ~PPC_BIT(irq);
+    }
+
+    qemu_set_irq(xive_source_qirq(&psi->source, irq), state);
+}
+
+static void pnv_psi_power9_init(Object *obj)
+{
+    PnvPsi *psi = PNV_PSI(obj);
+
+    object_initialize(&psi->source, sizeof(psi->source), TYPE_XIVE_SOURCE);
+    object_property_add_child(obj, "source", OBJECT(&psi->source), NULL);
+}
+
+static void pnv_psi_power9_realize(DeviceState *dev, Error **errp)
+{
+    PnvPsi *psi = PNV_PSI(dev);
+    XiveSource *xsrc = &psi->source;
+    Error *local_err = NULL;
+    int i;
+
+    object_property_set_int(OBJECT(xsrc), psi->bar, "bar", &error_fatal);
+    object_property_set_int(OBJECT(xsrc), XIVE_ESB_4K, "shift",
+                            &error_fatal);
+    object_property_set_int(OBJECT(xsrc), PSIHB9_NUM_IRQS, "nr-irqs",
+                            &error_fatal);
+    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(psi),
+                                   &error_fatal);
+    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
+
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        xive_source_irq_set(xsrc, i, true);
+    }
+
+    /* XSCOM region for PSI registers */
+    pnv_xscom_region_init(&psi->xscom_regs, OBJECT(dev), &pnv_psi_p9_xscom_ops,
+                psi, "xscom-psi", PNV_XSCOM_P9_PSIHB_SIZE);
+
+    /* Initialize MMIO region */
+    memory_region_init_io(&psi->regs_mr, OBJECT(dev), &pnv_psi_p9_mmio_ops, psi,
+                          "psihb", PNV_PSIHB9_SIZE);
+
+    /* Default BAR for MMIO region */
+    pnv_psi_set_bar(psi, psi->bar | PSIHB_BAR_EN);
+}
+
+static void pnv_psi_power9_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PnvPsiClass *ppc = PNV_PSI_CLASS(klass);
+    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
+
+    dc->realize = pnv_psi_power9_realize;
+    dc->reset   = pnv_psi_power9_reset;
+
+    ppc->chip_type  = PNV_CHIP_POWER9;
+    ppc->xscom_pcba = PNV_XSCOM_P9_PSIHB_BASE;
+    ppc->xscom_size = PNV_XSCOM_P9_PSIHB_SIZE;
+    ppc->irq_set    = pnv_psi_power9_irq_set;
+
+    xfc->notify      = pnv_psi_notify;
+}
+
+static const TypeInfo pnv_psi_power9_info = {
+    .name          = TYPE_PNV_PSI_POWER9,
+    .parent        = TYPE_PNV_PSI,
+    .instance_init = pnv_psi_power9_init,
+    .class_init    = pnv_psi_power9_class_init,
+    .interfaces = (InterfaceInfo[]) {
+            { TYPE_XIVE_FABRIC },
+            { },
+    },
+};
+
 static void pnv_psi_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -548,7 +864,7 @@ static void pnv_psi_class_init(ObjectClass *klass, void *data)
 
     xdc->dt_xscom = pnv_psi_dt_xscom;
 
-    dc->realize = pnv_psi_realize;
+    dc->desc = "PowerNV PSI Controller";
     dc->props = pnv_psi_properties;
 }
 
@@ -558,6 +874,8 @@ static const TypeInfo pnv_psi_info = {
     .instance_size = sizeof(PnvPsi),
     .instance_init = pnv_psi_init,
     .class_init    = pnv_psi_class_init,
+    .class_size    = sizeof(PnvPsiClass),
+    .abstract      = true,
     .interfaces    = (InterfaceInfo[]) {
         { TYPE_PNV_XSCOM_INTERFACE },
         { }
@@ -567,6 +885,47 @@ static const TypeInfo pnv_psi_info = {
 static void pnv_psi_register_types(void)
 {
     type_register_static(&pnv_psi_info);
+    type_register_static(&pnv_psi_power8_info);
+    type_register_static(&pnv_psi_power9_info);
 }
 
 type_init(pnv_psi_register_types)
+
+void pnv_chip_psi_realize(PnvChip *chip, Error **errp)
+{
+    Object *obj;
+    Error *local_err = NULL;
+
+    /* Processor Service Interface (PSI) Host Bridge */
+    if (pnv_chip_is_power9(chip)) {
+        obj = object_new(TYPE_PNV_PSI_POWER9);
+        object_property_set_int(obj, PNV_PSIHB9_BASE(chip), "bar",
+                                &error_fatal);
+    } else {
+        obj = object_new(TYPE_PNV_PSI_POWER8);
+        object_property_set_int(obj, PNV_PSIHB_BASE(chip), "bar", &error_fatal);
+        object_property_add_const_link(obj, "xics", qdev_get_machine(),
+                                       &error_abort);
+    }
+    qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
+
+    object_property_add_child(OBJECT(chip), "psi", obj, &error_abort);
+    object_property_set_bool(obj, true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    chip->psi = PNV_PSI(obj);
+
+    if (pnv_chip_is_power9(chip)) {
+        pnv_xscom_add_subregion(chip, PNV_XSCOM_P9_PSIHB_BASE,
+                                &chip->psi->xscom_regs);
+        memory_region_add_subregion(get_system_memory(),
+                                    PNV_PSIHB9_ESB_BASE(chip),
+                                    &chip->psi->source.esb_mmio);
+    } else {
+        pnv_xscom_add_subregion(chip, PNV_XSCOM_PSIHB_BASE,
+                                &chip->psi->xscom_regs);
+    }
+}
diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
index f66fe53c38bb..c53ad244ba7d 100644
--- a/include/hw/ppc/pnv.h
+++ b/include/hw/ppc/pnv.h
@@ -61,7 +61,7 @@ typedef struct PnvChip {
     MemoryRegion icp_mmio;
 
     PnvLpcController lpc;
-    PnvPsi       psi;
+    PnvPsi       *psi;
     PnvOCC       occ;
 
     PnvXive      *xive;
@@ -205,6 +205,10 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
 #define PNV_XIVE_PC_BASE(chip)      (0x0006018000000000ull      \
     + (uint64_t)PNV_CHIP_INDEX(chip) * PNV_XIVE_PC_SIZE)
 
+#define PNV_PSIHB9_SIZE              0x0000000000100000ull
+#define PNV_PSIHB9_BASE(chip)       (0x0006030203000000ull       \
+    + (uint64_t)PNV_CHIP_INDEX(chip) * PNV_PSIHB_SIZE)
+
 #define PNV_XIVE_IC_SIZE             0x0000000000080000ull
 #define PNV_XIVE_IC_BASE(chip)      (0x0006030203100000ull \
      + (uint64_t)PNV_CHIP_INDEX(chip) * PNV_XIVE_IC_SIZE)
@@ -212,6 +216,14 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
 #define PNV_XIVE_TM_SIZE             0x0000000000040000ull
 #define PNV_XIVE_TM_BASE(chip)       0x0006030203180000ull
 
+#define PNV_PSIHB9_ESB_SIZE          0x0000000000010000ull
+#define PNV_PSIHB9_ESB_BASE(chip)   (0x00060302031c0000ull      \
+     + (uint64_t)PNV_CHIP_INDEX(chip) *  PNV_PSIHB9_ESB_SIZE)
+
 Object *pnv_icp_create(PnvMachineState *spapr, Object *cpu, Error **errp);
 
+/*
+ * POWER9 MMIO base addresses
+ */
+
 #endif /* _PPC_PNV_H */
diff --git a/include/hw/ppc/pnv_psi.h b/include/hw/ppc/pnv_psi.h
index f6af5eae1fa8..82d60bc6578f 100644
--- a/include/hw/ppc/pnv_psi.h
+++ b/include/hw/ppc/pnv_psi.h
@@ -21,10 +21,35 @@
 
 #include "hw/sysbus.h"
 #include "hw/ppc/xics.h"
+#include "hw/ppc/xive.h"
 
 #define TYPE_PNV_PSI "pnv-psi"
 #define PNV_PSI(obj) \
      OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI)
+#define PNV_PSI_CLASS(klass) \
+     OBJECT_CLASS_CHECK(PnvPsiClass, (klass), TYPE_PNV_PSI)
+#define PNV_PSI_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(PnvPsiClass, (obj), TYPE_PNV_PSI)
+
+typedef struct PnvPsi PnvPsi;
+typedef struct PnvChip PnvChip;
+typedef struct PnvPsiClass {
+    SysBusDeviceClass parent_class;
+
+    int chip_type;
+    uint32_t xscom_pcba;
+    uint32_t xscom_size;
+
+    void (*irq_set)(PnvPsi *psi, int, bool state);
+} PnvPsiClass;
+
+#define TYPE_PNV_PSI_POWER8 TYPE_PNV_PSI "-POWER8"
+#define PNV_PSI_POWER8(obj) \
+    OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI_POWER8)
+
+#define TYPE_PNV_PSI_POWER9 TYPE_PNV_PSI "-POWER9"
+#define PNV_PSI_POWER9(obj) \
+    OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI_POWER9)
 
 #define PSIHB_XSCOM_MAX         0x20
 
@@ -38,9 +63,12 @@ typedef struct PnvPsi {
     /* MemoryRegion fsp_mr; */
     uint64_t fsp_bar;
 
-    /* Interrupt generation */
+    /* P8 Interrupt generation */
     ICSState ics;
 
+    /* P9 Interrupt generation */
+    XiveSource source;
+
     /* Registers */
     uint64_t regs[PSIHB_XSCOM_MAX];
 
@@ -60,6 +88,24 @@ typedef enum PnvPsiIrq {
 
 #define PSI_NUM_INTERRUPTS 6
 
-extern void pnv_psi_irq_set(PnvPsi *psi, PnvPsiIrq irq, bool state);
+/* P9 PSI Interrupts */
+#define PSIHB9_IRQ_PSI          0
+#define PSIHB9_IRQ_OCC          1
+#define PSIHB9_IRQ_FSI          2
+#define PSIHB9_IRQ_LPCHC        3
+#define PSIHB9_IRQ_LOCAL_ERR    4
+#define PSIHB9_IRQ_GLOBAL_ERR   5
+#define PSIHB9_IRQ_TPM          6
+#define PSIHB9_IRQ_LPC_SIRQ0    7
+#define PSIHB9_IRQ_LPC_SIRQ1    8
+#define PSIHB9_IRQ_LPC_SIRQ2    9
+#define PSIHB9_IRQ_LPC_SIRQ3    10
+#define PSIHB9_IRQ_SBE_I2C      11
+#define PSIHB9_IRQ_DIO          12
+#define PSIHB9_IRQ_PSU          13
+#define PSIHB9_NUM_IRQS         14
+
+void pnv_psi_irq_set(PnvPsi *psi, int irq, bool state);
+void pnv_chip_psi_realize(PnvChip *chip, Error **errp);
 
 #endif /* _PPC_PNV_PSI_H */
diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
index f4b1649ffffa..60a963e2de29 100644
--- a/include/hw/ppc/pnv_xscom.h
+++ b/include/hw/ppc/pnv_xscom.h
@@ -69,6 +69,8 @@ typedef struct PnvXScomInterfaceClass {
 
 #define PNV_XSCOM_PSIHB_BASE      0x2010900
 #define PNV_XSCOM_PSIHB_SIZE      0x20
+#define PNV_XSCOM_P9_PSIHB_BASE   0x5012900
+#define PNV_XSCOM_P9_PSIHB_SIZE   0x100
 
 #define PNV_XSCOM_OCC_BASE        0x0066000
 #define PNV_XSCOM_OCC_SIZE        0x6000
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9)
  2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (34 preceding siblings ...)
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 35/35] ppc/pnv: add a PSI bridge model for POWER9 processor Cédric Le Goater
@ 2018-04-19 13:28 ` no-reply
  35 siblings, 0 replies; 100+ messages in thread
From: no-reply @ 2018-04-19 13:28 UTC (permalink / raw)
  To: clg; +Cc: famz, qemu-ppc, qemu-devel, david

Hi,

This series seems to have some coding style problems. See output below for
more information:

Type: series
Message-id: 20180419124331.3915-1-clg@kaod.org
Subject: [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9)

=== TEST SCRIPT BEGIN ===
#!/bin/bash

BASE=base
n=1
total=$(git log --oneline $BASE.. | wc -l)
failed=0

git config --local diff.renamelimit 0
git config --local diff.renames True
git config --local diff.algorithm histogram

commits="$(git log --format=%H --reverse $BASE..)"
for c in $commits; do
    echo "Checking PATCH $n/$total: $(git log -n 1 --format=%s $c)..."
    if ! git show $c --format=email | ./scripts/checkpatch.pl --mailback -; then
        failed=1
        echo
    fi
    n=$((n+1))
done

exit $failed
=== TEST SCRIPT END ===

Updating 3c8cf5a9c21ff8782164d1def7f44bd888713384
From https://github.com/patchew-project/qemu
 * [new tag]               patchew/20180419124331.3915-1-clg@kaod.org -> patchew/20180419124331.3915-1-clg@kaod.org
Switched to a new branch 'test'
c4469e7316 ppc/pnv: add a PSI bridge model for POWER9 processor
33f2dbc799 ppc/pnv: add XIVE support
403005364d ppc: externalize ppc_get_vcpu_by_pir()
aaccb0b820 ppc/pnv: introduce a pnv_icp_create() helper
238430eaf9 spapr/xive: raise migration priority of the machine
58147d3114 spapr/xive, xics: reset KVM at machine reset
c10eec0b36 spapr/xive, xics: use the CPU_INTC handlers to reset KVM
ac85306946 intc: introduce a CPUIntc interface
5db0587d7a migration: discard non-migratable RAMBlocks
58b1c5a84d spapr/xive: add a XIVE KVM device to the machine
d04e47a5ce spapr/xive: add KVM support
0dfc28a269 spapr/xive: add common realize routine for KVM
2ecb56c69d target/ppc/kvm: add Linux KVM definitions for XIVE
87b9d2e8b2 spapr: add classes for the XIVE models
e83377df2b spapr: advertise XIVE exploitation mode in CAS
48987801b2 spapr: add support to dump XIVE information
829e2cbf09 spapr: toggle the ICP depending on the selected interrupt mode
3e35956bbf spapr: introduce a spapr_icp_create() helper
519c790144 spapr: add XIVE support to spapr_qirq()
f4f30c31c4 spapr: introduce a helper to map the XIVE memory regions
70574a057b sysbus: add a sysbus_mmio_unmap() helper
54465abc2d spapr: add device tree support for the XIVE exploitation mode
879817b623 spapr: add hcalls support for the XIVE exploitation interrupt mode
4afdd938d5 spapr: add a sPAPRXive object to the machine
f67a9f05b9 spapr: introduce a 'xive_exploitation' option to enable XIVE
11464bf255 spapr: add support for the SET_OS_PENDING command (XIVE)
a619ca5b74 spapr: notify the CPU when the XIVE interrupt priority is more privileged
5fc2664cfe spapr: push the XIVE EQ data in OS event queue
f62e7dbede spapr/xive: introduce the XIVE Event Queues
329205b1dc spapr/xive: introduce a XIVE interrupt presenter model
876c681a62 spapr/xive: add a single source block to the sPAPR XIVE model
5feac5da74 spapr/xive: introduce a XIVE interrupt controller for sPAPR
4d6b2f074b ppc/xive: introduce the XiveFabric interface
59d738daa4 ppc/xive: add support for the LSI interrupt sources
084d850ee3 ppc/xive: introduce a XIVE interrupt source model

=== OUTPUT BEGIN ===
Checking PATCH 1/35: ppc/xive: introduce a XIVE interrupt source model...
Checking PATCH 2/35: ppc/xive: add support for the LSI interrupt sources...
Checking PATCH 3/35: ppc/xive: introduce the XiveFabric interface...
Checking PATCH 4/35: spapr/xive: introduce a XIVE interrupt controller for sPAPR...
Checking PATCH 5/35: spapr/xive: add a single source block to the sPAPR XIVE model...
Checking PATCH 6/35: spapr/xive: introduce a XIVE interrupt presenter model...
Checking PATCH 7/35: spapr/xive: introduce the XIVE Event Queues...
Checking PATCH 8/35: spapr: push the XIVE EQ data in OS event queue...
Checking PATCH 9/35: spapr: notify the CPU when the XIVE interrupt priority is more privileged...
Checking PATCH 10/35: spapr: add support for the SET_OS_PENDING command (XIVE)...
Checking PATCH 11/35: spapr: introduce a 'xive_exploitation' option to enable XIVE...
Checking PATCH 12/35: spapr: add a sPAPRXive object to the machine...
Checking PATCH 13/35: spapr: add hcalls support for the XIVE exploitation interrupt mode...
Checking PATCH 14/35: spapr: add device tree support for the XIVE exploitation mode...
Checking PATCH 15/35: sysbus: add a sysbus_mmio_unmap() helper...
Checking PATCH 16/35: spapr: introduce a helper to map the XIVE memory regions...
Checking PATCH 17/35: spapr: add XIVE support to spapr_qirq()...
Checking PATCH 18/35: spapr: introduce a spapr_icp_create() helper...
Checking PATCH 19/35: spapr: toggle the ICP depending on the selected interrupt mode...
Checking PATCH 20/35: spapr: add support to dump XIVE information...
Checking PATCH 21/35: spapr: advertise XIVE exploitation mode in CAS...
Checking PATCH 22/35: spapr: add classes for the XIVE models...
Checking PATCH 23/35: target/ppc/kvm: add Linux KVM definitions for XIVE...
Checking PATCH 24/35: spapr/xive: add common realize routine for KVM...
Checking PATCH 25/35: spapr/xive: add KVM support...
Checking PATCH 26/35: spapr/xive: add a XIVE KVM device to the machine...
Checking PATCH 27/35: migration: discard non-migratable RAMBlocks...
ERROR: Macros with multiple statements should be enclosed in a do - while loop
#96: FILE: migration/ram.c:191:
+#define RAMBLOCK_FOREACH_MIGRATABLE(block)             \
+    RAMBLOCK_FOREACH(block)                            \
+        if (!qemu_ram_is_migratable(block)) {} else

ERROR: trailing statements should be on next line
#98: FILE: migration/ram.c:193:
+        if (!qemu_ram_is_migratable(block)) {} else

total: 2 errors, 0 warnings, 167 lines checked

Your patch has style problems, please review.  If any of these errors
are false positives report them to the maintainer, see
CHECKPATCH in MAINTAINERS.

Checking PATCH 28/35: intc: introduce a CPUIntc interface...
Checking PATCH 29/35: spapr/xive, xics: use the CPU_INTC handlers to reset KVM...
Checking PATCH 30/35: spapr/xive, xics: reset KVM at machine reset...
Checking PATCH 31/35: spapr/xive: raise migration priority of the machine...
Checking PATCH 32/35: ppc/pnv: introduce a pnv_icp_create() helper...
Checking PATCH 33/35: ppc: externalize ppc_get_vcpu_by_pir()...
Checking PATCH 34/35: ppc/pnv: add XIVE support...
Checking PATCH 35/35: ppc/pnv: add a PSI bridge model for POWER9 processor...
=== OUTPUT END ===

Test command exited with code: 1


---
Email generated automatically by Patchew [http://patchew.org/].
Please send your feedback to patchew-devel@redhat.com

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model
  2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model Cédric Le Goater
@ 2018-04-20  7:10   ` David Gibson
  2018-04-20  8:27     ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-20  7:10 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 18922 bytes --]

On Thu, Apr 19, 2018 at 02:42:57PM +0200, Cédric Le Goater wrote:
> Each XIVE interrupt source is associated with a two bit state machine
> called an Event State Buffer (ESB) : the first bit "P" means that an
> interrupt is "pending" and waiting for an EOI and the bit "Q" (queued)
> means a new interrupt was triggered while another was still pending.
> 
> When an event is triggered, the associated interrupt state bits are
> fetched and modified and forwarded to the virtualization engine of the
> controller doing the routing. These can also be controlled by MMIO, to
> trigger events or turn off the sources for instance. See code for more
> details on the states and transitions.
> 
> On a sPAPR machine, the OS will obtain the address of the MMIO page of
> the ESB entry associated with a source and its characteristic using
> the H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is
> used.
> 
> The xive_source_notify() routine is in charge forwarding the source
> event notification to the routing engine. It will be filled later on.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  Changes since v2:
> 
>  - added support for Store EOI
>  - added support for two page MMIO setting like on KVM

Looks generally sane to me, though I have a few queries.

> 
>  default-configs/ppc64-softmmu.mak |   1 +
>  hw/intc/Makefile.objs             |   1 +
>  hw/intc/xive.c                    | 335 ++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/xive.h             | 130 +++++++++++++++
>  4 files changed, 467 insertions(+)
>  create mode 100644 hw/intc/xive.c
>  create mode 100644 include/hw/ppc/xive.h
> 
> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> index b94af6c7c62a..c6d13e757977 100644
> --- a/default-configs/ppc64-softmmu.mak
> +++ b/default-configs/ppc64-softmmu.mak
> @@ -16,4 +16,5 @@ CONFIG_VIRTIO_VGA=y
>  CONFIG_XICS=$(CONFIG_PSERIES)
>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
> +CONFIG_XIVE=$(CONFIG_PSERIES)
>  CONFIG_MEM_HOTPLUG=y
> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> index 0e9963f5eecc..72a46ed91c31 100644
> --- a/hw/intc/Makefile.objs
> +++ b/hw/intc/Makefile.objs
> @@ -37,6 +37,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>  obj-$(CONFIG_XICS) += xics.o
>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> +obj-$(CONFIG_XIVE) += xive.o
>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> new file mode 100644
> index 000000000000..c70578759d02
> --- /dev/null
> +++ b/hw/intc/xive.c
> @@ -0,0 +1,335 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "qapi/error.h"
> +#include "target/ppc/cpu.h"
> +#include "sysemu/cpus.h"
> +#include "sysemu/dma.h"
> +#include "monitor/monitor.h"
> +#include "hw/ppc/xive.h"
> +
> +/*
> + * XIVE Interrupt Source
> + */
> +
> +uint8_t xive_source_pq_get(XiveSource *xsrc, uint32_t srcno)
> +{
> +    uint32_t byte = srcno / 4;
> +    uint32_t bit  = (srcno % 4) * 2;
> +
> +    assert(byte < xsrc->sbe_size);
> +
> +    return (xsrc->sbe[byte] >> bit) & 0x3;
> +}
> +
> +uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
> +{
> +    uint32_t byte = srcno / 4;
> +    uint32_t bit  = (srcno % 4) * 2;
> +    uint8_t old, new;
> +
> +    assert(byte < xsrc->sbe_size);
> +
> +    old = xsrc->sbe[byte];
> +
> +    new = xsrc->sbe[byte] & ~(0x3 << bit);
> +    new |= (pq & 0x3) << bit;
> +
> +    xsrc->sbe[byte] = new;
> +
> +    return (old >> bit) & 0x3;
> +}
> +
> +static bool xive_source_pq_eoi(XiveSource *xsrc, uint32_t srcno)
> +{
> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
> +
> +    switch (old_pq) {
> +    case XIVE_ESB_RESET:
> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_RESET);
> +        return false;
> +    case XIVE_ESB_PENDING:
> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_RESET);
> +        return false;
> +    case XIVE_ESB_QUEUED:
> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
> +        return true;
> +    case XIVE_ESB_OFF:
> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_OFF);
> +        return false;
> +    default:
> +         g_assert_not_reached();
> +    }
> +}
> +
> +/*
> + * Returns whether the event notification should be forwarded.
> + */
> +static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
> +{
> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
> +
> +    switch (old_pq) {
> +    case XIVE_ESB_RESET:
> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
> +        return true;
> +    case XIVE_ESB_PENDING:
> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_QUEUED);
> +        return false;
> +    case XIVE_ESB_QUEUED:
> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_QUEUED);
> +        return false;
> +    case XIVE_ESB_OFF:
> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_OFF);
> +        return false;
> +    default:
> +         g_assert_not_reached();
> +    }
> +}
> +
> +/*
> + * Forward the source event notification to the associated XiveFabric,
> + * the device owning the sources.
> + */
> +static void xive_source_notify(XiveSource *xsrc, int srcno)
> +{
> +
> +}
> +
> +/* In a two pages ESB MMIO setting, even page is the trigger page, odd
> + * page is for management */

Can I understand from this that the result from this function is only
meaningful in 2-pages mode?

> +static inline bool xive_source_is_trigger_page(hwaddr addr)
> +{
> +    return !((addr >> 16) & 1);

Later on you seem to have both 4k and 64k variants list, but here you
hardcode 64k.  Is that a problem?

> +}
> +
> +static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> +    uint32_t offset = addr & 0xF00;

You ignore the low bits of the address entirely, so effective you have
a 256 byte range that's all aliases of the same register.  Is that
intentional?

> +    uint32_t srcno = addr >> xsrc->esb_shift;
> +    uint64_t ret = -1;
> +
> +    if (xive_source_esb_2page(xsrc) && xive_source_is_trigger_page(addr)) {
> +        qemu_log_mask(LOG_GUEST_ERROR,
> +                      "XIVE: invalid load on IRQ %d trigger page at "
> +                      "0x%"HWADDR_PRIx"\n", srcno, addr);
> +        return -1;
> +    }
> +
> +    switch (offset) {
> +    case XIVE_ESB_LOAD_EOI:
> +        /*
> +         * Load EOI is not the default source setting under QEMU, but
> +         * this is what HW uses currently.
> +         */
> +        ret = xive_source_pq_eoi(xsrc, srcno);

You're implicitly casting a bool return value into a u64 here, is that
intentional?

> +
> +        break;
> +
> +    case XIVE_ESB_GET:
> +        ret = xive_source_pq_get(xsrc, srcno);
> +        break;
> +
> +    case XIVE_ESB_SET_PQ_00:
> +    case XIVE_ESB_SET_PQ_01:
> +    case XIVE_ESB_SET_PQ_10:
> +    case XIVE_ESB_SET_PQ_11:
> +        ret = xive_source_pq_set(xsrc, srcno, (offset >> 8) & 0x3);
> +        break;
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB addr %d\n", offset);
> +    }
> +
> +    return ret;
> +}
> +
> +static void xive_source_esb_write(void *opaque, hwaddr addr,
> +                                 uint64_t value, unsigned size)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> +    uint32_t offset = addr & 0xF00;
> +    uint32_t srcno = addr >> xsrc->esb_shift;
> +    bool notify = false;
> +
> +    switch (offset) {
> +    case 0:
> +        notify = xive_source_pq_trigger(xsrc, srcno);
> +        break;
> +
> +    case XIVE_ESB_STORE_EOI:
> +        if (xive_source_is_trigger_page(addr)) {
> +            qemu_log_mask(LOG_GUEST_ERROR,
> +                          "XIVE: invalid store on IRQ %d trigger page at "
> +                          "0x%"HWADDR_PRIx"\n", srcno, addr);
> +            return;
> +        }
> +
> +        if (!(xsrc->esb_flags & XIVE_SRC_STORE_EOI)) {
> +            qemu_log_mask(LOG_GUEST_ERROR,
> +                          "XIVE: invalid Store EOI for IRQ %d\n", srcno);
> +            return;
> +        }
> +
> +        /* If the Q bit is set, we should forward a new source event
> +         * notification
> +         */
> +        notify = xive_source_pq_eoi(xsrc, srcno);
> +        break;
> +
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %d\n",
> +                      offset);
> +        return;
> +    }
> +
> +    /* Forward the source event notification for routing */
> +    if (notify) {
> +        xive_source_notify(xsrc, srcno);
> +    }

EOI via this path calls notify, but the one via the read path
doesn't.  Is that correct?

> +}
> +
> +static const MemoryRegionOps xive_source_esb_ops = {
> +    .read = xive_source_esb_read,
> +    .write = xive_source_esb_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +static void xive_source_set_irq(void *opaque, int srcno, int val)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> +    bool notify = false;
> +
> +    if (val) {
> +        notify = xive_source_pq_trigger(xsrc, srcno);
> +    }
> +
> +    /* Forward the source event notification for routing */
> +    if (notify) {
> +        xive_source_notify(xsrc, srcno);
> +    }
> +}
> +
> +void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
> +{
> +    int i;
> +
> +    monitor_printf(mon, "XIVE Source %6x ..%6x\n",
> +                   xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        uint8_t pq = xive_source_pq_get(xsrc, i);
> +        uint32_t lisn = i  + xsrc->offset;
> +
> +        if (pq == XIVE_ESB_OFF) {
> +            continue;
> +        }
> +
> +        monitor_printf(mon, "  %4x %c%c\n", lisn,
> +                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> +    }
> +}
> +
> +static void xive_source_reset(DeviceState *dev)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(dev);
> +
> +    /* SBEs are initialized to 0b01 which corresponds to "ints off" */
> +    memset(xsrc->sbe, 0x55, xsrc->sbe_size);
> +}
> +
> +static void xive_source_realize(DeviceState *dev, Error **errp)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(dev);
> +
> +    if (!xsrc->nr_irqs) {
> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
> +        return;
> +    }
> +
> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
> +        xsrc->esb_shift != XIVE_ESB_4K_2PAGE &&
> +        xsrc->esb_shift != XIVE_ESB_64K &&
> +        xsrc->esb_shift != XIVE_ESB_64K_2PAGE) {
> +        error_setg(errp, "Invalid ESB shift setting");
> +        return;
> +    }
> +
> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
> +                                     xsrc->nr_irqs);
> +
> +    /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
> +    xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
> +    xsrc->sbe = g_malloc0(xsrc->sbe_size);
> +
> +    /* TODO: H_INT_ESB support, which removing the ESB MMIOs */
> +
> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> +                          &xive_source_esb_ops, xsrc, "xive.esb",
> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
> +}
> +
> +static const VMStateDescription vmstate_xive_source = {
> +    .name = TYPE_XIVE_SOURCE,
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
> +        VMSTATE_VBUFFER_UINT32(sbe, XiveSource, 1, NULL, sbe_size),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +/*
> + * The default XIVE interrupt source setting for ESB MMIO is two 64k
> + * pages without Store EOI. This is in sync with KVM.
> + */
> +static Property xive_source_properties[] = {
> +    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
> +    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
> +    DEFINE_PROP_UINT64("bar", XiveSource, esb_base, 0),

Isn't this redundant with however the base address is handled through
the SysBusDevice stuff (I forget the details)?

> +    DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void xive_source_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    dc->realize = xive_source_realize;
> +    dc->reset = xive_source_reset;
> +    dc->props = xive_source_properties;
> +    dc->desc = "XIVE interrupt source";
> +    dc->vmsd = &vmstate_xive_source;
> +}
> +
> +static const TypeInfo xive_source_info = {
> +    .name          = TYPE_XIVE_SOURCE,
> +    .parent        = TYPE_SYS_BUS_DEVICE,
> +    .instance_size = sizeof(XiveSource),
> +    .class_init    = xive_source_class_init,
> +};
> +
> +static void xive_register_types(void)
> +{
> +    type_register_static(&xive_source_info);
> +}
> +
> +type_init(xive_register_types)
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> new file mode 100644
> index 000000000000..d92a50519edf
> --- /dev/null
> +++ b/include/hw/ppc/xive.h
> @@ -0,0 +1,130 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef PPC_XIVE_H
> +#define PPC_XIVE_H
> +
> +#include "hw/sysbus.h"
> +
> +/*
> + * XIVE Interrupt Source
> + */
> +
> +#define TYPE_XIVE_SOURCE "xive-source"
> +#define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
> +
> +/*
> + * XIVE Source Interrupt source characteristics, which define how the
> + * ESB are controlled.
> + */
> +#define XIVE_SRC_H_INT_ESB     0x1 /* ESB managed with hcall H_INT_ESB */
> +#define XIVE_SRC_STORE_EOI     0x4 /* Store EOI supported */
> +
> +typedef struct XiveSource {
> +    SysBusDevice parent;
> +
> +    /* IRQs */
> +    uint32_t     nr_irqs;
> +    uint32_t     offset;
> +    qemu_irq     *qirqs;
> +
> +    /* PQ bits */
> +    uint8_t      *sbe;
> +    uint32_t     sbe_size;
> +
> +    /* ESB memory region */
> +    uint64_t     esb_flags;
> +    hwaddr       esb_base;
> +    uint32_t     esb_shift;
> +    MemoryRegion esb_mmio;
> +} XiveSource;
> +
> +/*
> + * ESB MMIO setting. Can be one page, for both source triggering and
> + * source management, or two different pages. See below for magic
> + * values.
> + */
> +#define XIVE_ESB_4K          12 /* PSI HB */
> +#define XIVE_ESB_4K_2PAGE    17

Should this be 13 instead of 17?

> +#define XIVE_ESB_64K         16
> +#define XIVE_ESB_64K_2PAGE   17

(Also, who the hell comes up with a brand new PIC and decides to have
*4* different interface variants.  But that's not your problem, I
realise).

> +
> +static inline bool xive_source_esb_2page(XiveSource *xsrc)
> +{
> +    return xsrc->esb_shift == XIVE_ESB_64K_2PAGE;
> +}
> +
> +static inline hwaddr xive_source_esb_base(XiveSource *xsrc, uint32_t srcno)
> +{
> +    assert(srcno < xsrc->nr_irqs);
> +    return xsrc->esb_base + (1ull << xsrc->esb_shift) * srcno;
> +}
> +
> +/* The trigger page is always the first/even page */
> +#define xive_source_esb_trigger xive_source_esb_base
> +
> +/* In a two pages ESB MMIO setting, the odd page is for management */
> +static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
> +{
> +    hwaddr addr = xive_source_esb_base(xsrc, srcno);
> +
> +    if (xive_source_esb_2page(xsrc)) {
> +        addr += (1 << (xsrc->esb_shift - 1));
> +    }
> +
> +    return addr;
> +}
> +
> +/*
> + * Each interrupt source has a 2-bit state machine called ESB which
> + * can be controlled by MMIO. It's made of 2 bits, P and Q. P
> + * indicates that an interrupt is pending (has been sent to a queue
> + * and is waiting for an EOI). Q indicates that the interrupt has been
> + * triggered while pending.
> + *
> + * This acts as a coalescing mechanism in order to guarantee
> + * that a given interrupt only occurs at most once in a queue.
> + *
> + * When doing an EOI, the Q bit will indicate if the interrupt
> + * needs to be re-triggered.
> + */
> +#define XIVE_ESB_VAL_P        0x2
> +#define XIVE_ESB_VAL_Q        0x1
> +
> +#define XIVE_ESB_RESET        0x0
> +#define XIVE_ESB_PENDING      XIVE_ESB_VAL_P
> +#define XIVE_ESB_QUEUED       (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
> +#define XIVE_ESB_OFF          XIVE_ESB_VAL_Q
> +
> +/*
> + * "magic" Event State Buffer (ESB) MMIO offsets.
> + *
> + * The following offsets into the ESB MMIO allow to read or
> + * manipulate the PQ bits. They must be used with an 8-bytes
> + * load instruction. They all return the previous state of the
> + * interrupt (atomically).
> + *
> + * Additionally, some ESB pages support doing an EOI via a
> + * store at 0 and some ESBs support doing a trigger via a
> + * separate trigger page.
> + */
> +#define XIVE_ESB_STORE_EOI      0x400 /* Store */
> +#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
> +#define XIVE_ESB_GET            0x800 /* Load */
> +#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
> +#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
> +#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
> +#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
> +
> +uint8_t xive_source_pq_get(XiveSource *xsrc, uint32_t srcno);
> +uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
> +
> +void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon);
> +
> +#endif /* PPC_XIVE_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model
  2018-04-20  7:10   ` David Gibson
@ 2018-04-20  8:27     ` Cédric Le Goater
  2018-04-23  3:59       ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-20  8:27 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/20/2018 09:10 AM, David Gibson wrote:
> On Thu, Apr 19, 2018 at 02:42:57PM +0200, Cédric Le Goater wrote:
>> Each XIVE interrupt source is associated with a two bit state machine
>> called an Event State Buffer (ESB) : the first bit "P" means that an
>> interrupt is "pending" and waiting for an EOI and the bit "Q" (queued)
>> means a new interrupt was triggered while another was still pending.
>>
>> When an event is triggered, the associated interrupt state bits are
>> fetched and modified and forwarded to the virtualization engine of the
>> controller doing the routing. These can also be controlled by MMIO, to
>> trigger events or turn off the sources for instance. See code for more
>> details on the states and transitions.
>>
>> On a sPAPR machine, the OS will obtain the address of the MMIO page of
>> the ESB entry associated with a source and its characteristic using
>> the H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is
>> used.
>>
>> The xive_source_notify() routine is in charge forwarding the source
>> event notification to the routing engine. It will be filled later on.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  Changes since v2:
>>
>>  - added support for Store EOI
>>  - added support for two page MMIO setting like on KVM
> 
> Looks generally sane to me, though I have a few queries.
> 
>>
>>  default-configs/ppc64-softmmu.mak |   1 +
>>  hw/intc/Makefile.objs             |   1 +
>>  hw/intc/xive.c                    | 335 ++++++++++++++++++++++++++++++++++++++
>>  include/hw/ppc/xive.h             | 130 +++++++++++++++
>>  4 files changed, 467 insertions(+)
>>  create mode 100644 hw/intc/xive.c
>>  create mode 100644 include/hw/ppc/xive.h
>>
>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>> index b94af6c7c62a..c6d13e757977 100644
>> --- a/default-configs/ppc64-softmmu.mak
>> +++ b/default-configs/ppc64-softmmu.mak
>> @@ -16,4 +16,5 @@ CONFIG_VIRTIO_VGA=y
>>  CONFIG_XICS=$(CONFIG_PSERIES)
>>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>> +CONFIG_XIVE=$(CONFIG_PSERIES)
>>  CONFIG_MEM_HOTPLUG=y
>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>> index 0e9963f5eecc..72a46ed91c31 100644
>> --- a/hw/intc/Makefile.objs
>> +++ b/hw/intc/Makefile.objs
>> @@ -37,6 +37,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>>  obj-$(CONFIG_XICS) += xics.o
>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>> +obj-$(CONFIG_XIVE) += xive.o
>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> new file mode 100644
>> index 000000000000..c70578759d02
>> --- /dev/null
>> +++ b/hw/intc/xive.c
>> @@ -0,0 +1,335 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/log.h"
>> +#include "qapi/error.h"
>> +#include "target/ppc/cpu.h"
>> +#include "sysemu/cpus.h"
>> +#include "sysemu/dma.h"
>> +#include "monitor/monitor.h"
>> +#include "hw/ppc/xive.h"
>> +
>> +/*
>> + * XIVE Interrupt Source
>> + */
>> +
>> +uint8_t xive_source_pq_get(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    uint32_t byte = srcno / 4;
>> +    uint32_t bit  = (srcno % 4) * 2;
>> +
>> +    assert(byte < xsrc->sbe_size);
>> +
>> +    return (xsrc->sbe[byte] >> bit) & 0x3;
>> +}
>> +
>> +uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
>> +{
>> +    uint32_t byte = srcno / 4;
>> +    uint32_t bit  = (srcno % 4) * 2;
>> +    uint8_t old, new;
>> +
>> +    assert(byte < xsrc->sbe_size);
>> +
>> +    old = xsrc->sbe[byte];
>> +
>> +    new = xsrc->sbe[byte] & ~(0x3 << bit);
>> +    new |= (pq & 0x3) << bit;
>> +
>> +    xsrc->sbe[byte] = new;
>> +
>> +    return (old >> bit) & 0x3;
>> +}
>> +
>> +static bool xive_source_pq_eoi(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
>> +
>> +    switch (old_pq) {
>> +    case XIVE_ESB_RESET:
>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_RESET);
>> +        return false;
>> +    case XIVE_ESB_PENDING:
>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_RESET);
>> +        return false;
>> +    case XIVE_ESB_QUEUED:
>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
>> +        return true;
>> +    case XIVE_ESB_OFF:
>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_OFF);
>> +        return false;
>> +    default:
>> +         g_assert_not_reached();
>> +    }
>> +}
>> +
>> +/*
>> + * Returns whether the event notification should be forwarded.
>> + */
>> +static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
>> +
>> +    switch (old_pq) {
>> +    case XIVE_ESB_RESET:
>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
>> +        return true;
>> +    case XIVE_ESB_PENDING:
>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_QUEUED);
>> +        return false;
>> +    case XIVE_ESB_QUEUED:
>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_QUEUED);
>> +        return false;
>> +    case XIVE_ESB_OFF:
>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_OFF);
>> +        return false;
>> +    default:
>> +         g_assert_not_reached();
>> +    }
>> +}
>> +
>> +/*
>> + * Forward the source event notification to the associated XiveFabric,
>> + * the device owning the sources.
>> + */
>> +static void xive_source_notify(XiveSource *xsrc, int srcno)
>> +{
>> +
>> +}
>> +
>> +/* In a two pages ESB MMIO setting, even page is the trigger page, odd
>> + * page is for management */
> 
> Can I understand from this that the result from this function is only
> meaningful in 2-pages mode?

yes. May be I should rename it to xive_source_is_even_page() ?

>> +static inline bool xive_source_is_trigger_page(hwaddr addr)
>> +{
>> +    return !((addr >> 16) & 1);
> 
> Later on you seem to have both 4k and 64k variants list, but here you
> hardcode 64k.  Is that a problem?
	
oups. This should be  :

	(addr >> (xsrc->esb_shift - 1))

I did the tests with the spapr guest which uses 64k ESB MMIO pages. 
The check is only significant in a 2 pages setting.
 
>> +}
>> +
>> +static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>> +    uint32_t offset = addr & 0xF00;
> 
> You ignore the low bits of the address entirely, so effective you have
> a 256 byte range that's all aliases of the same register.  Is that
> intentional?

yes but it's not entirely correct. The exact ranges are :

			Load			Store

0x000 .. 0x3FF		EOI and return 0|1	Trigger
0x400 .. 0x7FF		EOI and return 0|1	EOI
0x800 .. 0xBFF  	return PQ		undefined
0xC00 .. 0xCFF		return PQ and PQ=00	PQ=00
0xD00 .. 0xDFF		return PQ and PQ=01	PQ=01
0xE00 .. 0xDFF		return PQ and PQ=10	PQ=10
0xF00 .. 0xDFF		return PQ and PQ=11	PQ=11

There is room for some improvements.

The trigger page in a two pages ESB MMIO settings only triggers on stores.

 
>> +    uint32_t srcno = addr >> xsrc->esb_shift;
>> +    uint64_t ret = -1;
>> +
>> +    if (xive_source_esb_2page(xsrc) && xive_source_is_trigger_page(addr)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR,
>> +                      "XIVE: invalid load on IRQ %d trigger page at "
>> +                      "0x%"HWADDR_PRIx"\n", srcno, addr);
>> +        return -1;
>> +    }
>> +
>> +    switch (offset) {
>> +    case XIVE_ESB_LOAD_EOI:
>> +        /*
>> +         * Load EOI is not the default source setting under QEMU, but
>> +         * this is what HW uses currently.
>> +         */
>> +        ret = xive_source_pq_eoi(xsrc, srcno);
> 
> You're implicitly casting a bool return value into a u64 here, is that
> intentional?

yes. is that bad ? This is what the load is supposed to return in the transition 
algo. 

>> +
>> +        break;
>> +
>> +    case XIVE_ESB_GET:
>> +        ret = xive_source_pq_get(xsrc, srcno);
>> +        break;
>> +
>> +    case XIVE_ESB_SET_PQ_00:
>> +    case XIVE_ESB_SET_PQ_01:
>> +    case XIVE_ESB_SET_PQ_10:
>> +    case XIVE_ESB_SET_PQ_11:
>> +        ret = xive_source_pq_set(xsrc, srcno, (offset >> 8) & 0x3);
>> +        break;
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB addr %d\n", offset);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static void xive_source_esb_write(void *opaque, hwaddr addr,
>> +                                 uint64_t value, unsigned size)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>> +    uint32_t offset = addr & 0xF00;
>> +    uint32_t srcno = addr >> xsrc->esb_shift;
>> +    bool notify = false;
>> +
>> +    switch (offset) {
>> +    case 0:
>> +        notify = xive_source_pq_trigger(xsrc, srcno);
>> +        break;
>> +
>> +    case XIVE_ESB_STORE_EOI:
>> +        if (xive_source_is_trigger_page(addr)) {
>> +            qemu_log_mask(LOG_GUEST_ERROR,
>> +                          "XIVE: invalid store on IRQ %d trigger page at "
>> +                          "0x%"HWADDR_PRIx"\n", srcno, addr);
>> +            return;
>> +        }
>> +
>> +        if (!(xsrc->esb_flags & XIVE_SRC_STORE_EOI)) {
>> +            qemu_log_mask(LOG_GUEST_ERROR,
>> +                          "XIVE: invalid Store EOI for IRQ %d\n", srcno);
>> +            return;
>> +        }
>> +
>> +        /* If the Q bit is set, we should forward a new source event
>> +         * notification
>> +         */
>> +        notify = xive_source_pq_eoi(xsrc, srcno);
>> +        break;
>> +
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %d\n",
>> +                      offset);
>> +        return;
>> +    }
>> +
>> +    /* Forward the source event notification for routing */
>> +    if (notify) {
>> +        xive_source_notify(xsrc, srcno);
>> +    }
> 
> EOI via this path calls notify, but the one via the read path
> doesn't.  Is that correct?

No. I have given attention to the one page ESB MMIO setting + Store EOI 
in the emulated mode and not enough to the two pages ESB MMIO setting. 
This is a late change to be compatible with KVM. I will fix.

> 
>> +}
>> +
>> +static const MemoryRegionOps xive_source_esb_ops = {
>> +    .read = xive_source_esb_read,
>> +    .write = xive_source_esb_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +static void xive_source_set_irq(void *opaque, int srcno, int val)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>> +    bool notify = false;
>> +
>> +    if (val) {
>> +        notify = xive_source_pq_trigger(xsrc, srcno);
>> +    }
>> +
>> +    /* Forward the source event notification for routing */
>> +    if (notify) {
>> +        xive_source_notify(xsrc, srcno);
>> +    }
>> +}
>> +
>> +void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
>> +{
>> +    int i;
>> +
>> +    monitor_printf(mon, "XIVE Source %6x ..%6x\n",
>> +                   xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        uint8_t pq = xive_source_pq_get(xsrc, i);
>> +        uint32_t lisn = i  + xsrc->offset;
>> +
>> +        if (pq == XIVE_ESB_OFF) {
>> +            continue;
>> +        }
>> +
>> +        monitor_printf(mon, "  %4x %c%c\n", lisn,
>> +                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
>> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>> +    }
>> +}
>> +
>> +static void xive_source_reset(DeviceState *dev)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(dev);
>> +
>> +    /* SBEs are initialized to 0b01 which corresponds to "ints off" */
>> +    memset(xsrc->sbe, 0x55, xsrc->sbe_size);
>> +}
>> +
>> +static void xive_source_realize(DeviceState *dev, Error **errp)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(dev);
>> +
>> +    if (!xsrc->nr_irqs) {
>> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
>> +        return;
>> +    }
>> +
>> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
>> +        xsrc->esb_shift != XIVE_ESB_4K_2PAGE &&
>> +        xsrc->esb_shift != XIVE_ESB_64K &&
>> +        xsrc->esb_shift != XIVE_ESB_64K_2PAGE) {
>> +        error_setg(errp, "Invalid ESB shift setting");
>> +        return;
>> +    }
>> +
>> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
>> +                                     xsrc->nr_irqs);
>> +
>> +    /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
>> +    xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
>> +    xsrc->sbe = g_malloc0(xsrc->sbe_size);
>> +
>> +    /* TODO: H_INT_ESB support, which removing the ESB MMIOs */
>> +
>> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>> +                          &xive_source_esb_ops, xsrc, "xive.esb",
>> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>> +}
>> +
>> +static const VMStateDescription vmstate_xive_source = {
>> +    .name = TYPE_XIVE_SOURCE,
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
>> +        VMSTATE_VBUFFER_UINT32(sbe, XiveSource, 1, NULL, sbe_size),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +/*
>> + * The default XIVE interrupt source setting for ESB MMIO is two 64k
>> + * pages without Store EOI. This is in sync with KVM.
>> + */
>> +static Property xive_source_properties[] = {
>> +    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
>> +    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
>> +    DEFINE_PROP_UINT64("bar", XiveSource, esb_base, 0),
> 
> Isn't this redundant with however the base address is handled through
> the SysBusDevice stuff (I forget the details)?

Storing the ESB MMIO base address under the XiveSource object is 
convenient later on in the h_int_get_source_info hcall which make
use of the helpers : 

	hwaddr xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
	hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)

But it is only used in that place. So we could just store the ESB 
MMIO base address under the sPAPRXive controller. This makes some
sense in the design, as we have to inform KVM of this address with
a KVM device ioctl. 

>> +    DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void xive_source_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    dc->realize = xive_source_realize;
>> +    dc->reset = xive_source_reset;
>> +    dc->props = xive_source_properties;
>> +    dc->desc = "XIVE interrupt source";
>> +    dc->vmsd = &vmstate_xive_source;
>> +}
>> +
>> +static const TypeInfo xive_source_info = {
>> +    .name          = TYPE_XIVE_SOURCE,
>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>> +    .instance_size = sizeof(XiveSource),
>> +    .class_init    = xive_source_class_init,
>> +};
>> +
>> +static void xive_register_types(void)
>> +{
>> +    type_register_static(&xive_source_info);
>> +}
>> +
>> +type_init(xive_register_types)
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> new file mode 100644
>> index 000000000000..d92a50519edf
>> --- /dev/null
>> +++ b/include/hw/ppc/xive.h
>> @@ -0,0 +1,130 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef PPC_XIVE_H
>> +#define PPC_XIVE_H
>> +
>> +#include "hw/sysbus.h"
>> +
>> +/*
>> + * XIVE Interrupt Source
>> + */
>> +
>> +#define TYPE_XIVE_SOURCE "xive-source"
>> +#define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
>> +
>> +/*
>> + * XIVE Source Interrupt source characteristics, which define how the
>> + * ESB are controlled.
>> + */
>> +#define XIVE_SRC_H_INT_ESB     0x1 /* ESB managed with hcall H_INT_ESB */
>> +#define XIVE_SRC_STORE_EOI     0x4 /* Store EOI supported */
>> +
>> +typedef struct XiveSource {
>> +    SysBusDevice parent;
>> +
>> +    /* IRQs */
>> +    uint32_t     nr_irqs;
>> +    uint32_t     offset;
>> +    qemu_irq     *qirqs;
>> +
>> +    /* PQ bits */
>> +    uint8_t      *sbe;
>> +    uint32_t     sbe_size;
>> +
>> +    /* ESB memory region */
>> +    uint64_t     esb_flags;
>> +    hwaddr       esb_base;
>> +    uint32_t     esb_shift;
>> +    MemoryRegion esb_mmio;
>> +} XiveSource;
>> +
>> +/*
>> + * ESB MMIO setting. Can be one page, for both source triggering and
>> + * source management, or two different pages. See below for magic
>> + * values.
>> + */
>> +#define XIVE_ESB_4K          12 /* PSI HB */
>> +#define XIVE_ESB_4K_2PAGE    17
> 
> Should this be 13 instead of 17?

oups. obviously this is not used.

>> +#define XIVE_ESB_64K         16
>> +#define XIVE_ESB_64K_2PAGE   17
> 
> (Also, who the hell comes up with a brand new PIC and decides to have
> *4* different interface variants.  But that's not your problem, I
> realise).

HW constraints on the different controllers which need to expose
sources : PSI, PHB4. The internal sources of the XIVE interrupt 
controller can be configured to use 4K or 64K but I doubt the 4k
will be ever used.

Thanks,

C.
 
>> +
>> +static inline bool xive_source_esb_2page(XiveSource *xsrc)
>> +{
>> +    return xsrc->esb_shift == XIVE_ESB_64K_2PAGE;
>> +}
>> +
>> +static inline hwaddr xive_source_esb_base(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +    return xsrc->esb_base + (1ull << xsrc->esb_shift) * srcno;
>> +}
>> +
>> +/* The trigger page is always the first/even page */
>> +#define xive_source_esb_trigger xive_source_esb_base
>> +
>> +/* In a two pages ESB MMIO setting, the odd page is for management */
>> +static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
>> +{
>> +    hwaddr addr = xive_source_esb_base(xsrc, srcno);
>> +
>> +    if (xive_source_esb_2page(xsrc)) {
>> +        addr += (1 << (xsrc->esb_shift - 1));
>> +    }
>> +
>> +    return addr;
>> +}
>> +
>> +/*
>> + * Each interrupt source has a 2-bit state machine called ESB which
>> + * can be controlled by MMIO. It's made of 2 bits, P and Q. P
>> + * indicates that an interrupt is pending (has been sent to a queue
>> + * and is waiting for an EOI). Q indicates that the interrupt has been
>> + * triggered while pending.
>> + *
>> + * This acts as a coalescing mechanism in order to guarantee
>> + * that a given interrupt only occurs at most once in a queue.
>> + *
>> + * When doing an EOI, the Q bit will indicate if the interrupt
>> + * needs to be re-triggered.
>> + */
>> +#define XIVE_ESB_VAL_P        0x2
>> +#define XIVE_ESB_VAL_Q        0x1
>> +
>> +#define XIVE_ESB_RESET        0x0
>> +#define XIVE_ESB_PENDING      XIVE_ESB_VAL_P
>> +#define XIVE_ESB_QUEUED       (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
>> +#define XIVE_ESB_OFF          XIVE_ESB_VAL_Q
>> +
>> +/*
>> + * "magic" Event State Buffer (ESB) MMIO offsets.
>> + *
>> + * The following offsets into the ESB MMIO allow to read or
>> + * manipulate the PQ bits. They must be used with an 8-bytes
>> + * load instruction. They all return the previous state of the
>> + * interrupt (atomically).
>> + *
>> + * Additionally, some ESB pages support doing an EOI via a
>> + * store at 0 and some ESBs support doing a trigger via a
>> + * separate trigger page.
>> + */
>> +#define XIVE_ESB_STORE_EOI      0x400 /* Store */
>> +#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
>> +#define XIVE_ESB_GET            0x800 /* Load */
>> +#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
>> +#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
>> +#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
>> +#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
>> +
>> +uint8_t xive_source_pq_get(XiveSource *xsrc, uint32_t srcno);
>> +uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
>> +
>> +void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon);
>> +
>> +#endif /* PPC_XIVE_H */
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model
  2018-04-20  8:27     ` Cédric Le Goater
@ 2018-04-23  3:59       ` David Gibson
  2018-04-23  7:11         ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-23  3:59 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 20293 bytes --]

On Fri, Apr 20, 2018 at 10:27:21AM +0200, Cédric Le Goater wrote:
> On 04/20/2018 09:10 AM, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 02:42:57PM +0200, Cédric Le Goater wrote:
> >> Each XIVE interrupt source is associated with a two bit state machine
> >> called an Event State Buffer (ESB) : the first bit "P" means that an
> >> interrupt is "pending" and waiting for an EOI and the bit "Q" (queued)
> >> means a new interrupt was triggered while another was still pending.
> >>
> >> When an event is triggered, the associated interrupt state bits are
> >> fetched and modified and forwarded to the virtualization engine of the
> >> controller doing the routing. These can also be controlled by MMIO, to
> >> trigger events or turn off the sources for instance. See code for more
> >> details on the states and transitions.
> >>
> >> On a sPAPR machine, the OS will obtain the address of the MMIO page of
> >> the ESB entry associated with a source and its characteristic using
> >> the H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is
> >> used.
> >>
> >> The xive_source_notify() routine is in charge forwarding the source
> >> event notification to the routing engine. It will be filled later on.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  Changes since v2:
> >>
> >>  - added support for Store EOI
> >>  - added support for two page MMIO setting like on KVM
> > 
> > Looks generally sane to me, though I have a few queries.
> > 
> >>
> >>  default-configs/ppc64-softmmu.mak |   1 +
> >>  hw/intc/Makefile.objs             |   1 +
> >>  hw/intc/xive.c                    | 335 ++++++++++++++++++++++++++++++++++++++
> >>  include/hw/ppc/xive.h             | 130 +++++++++++++++
> >>  4 files changed, 467 insertions(+)
> >>  create mode 100644 hw/intc/xive.c
> >>  create mode 100644 include/hw/ppc/xive.h
> >>
> >> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> >> index b94af6c7c62a..c6d13e757977 100644
> >> --- a/default-configs/ppc64-softmmu.mak
> >> +++ b/default-configs/ppc64-softmmu.mak
> >> @@ -16,4 +16,5 @@ CONFIG_VIRTIO_VGA=y
> >>  CONFIG_XICS=$(CONFIG_PSERIES)
> >>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
> >>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
> >> +CONFIG_XIVE=$(CONFIG_PSERIES)
> >>  CONFIG_MEM_HOTPLUG=y
> >> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> >> index 0e9963f5eecc..72a46ed91c31 100644
> >> --- a/hw/intc/Makefile.objs
> >> +++ b/hw/intc/Makefile.objs
> >> @@ -37,6 +37,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
> >>  obj-$(CONFIG_XICS) += xics.o
> >>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
> >>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> >> +obj-$(CONFIG_XIVE) += xive.o
> >>  obj-$(CONFIG_POWERNV) += xics_pnv.o
> >>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
> >>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> new file mode 100644
> >> index 000000000000..c70578759d02
> >> --- /dev/null
> >> +++ b/hw/intc/xive.c
> >> @@ -0,0 +1,335 @@
> >> +/*
> >> + * QEMU PowerPC XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu/log.h"
> >> +#include "qapi/error.h"
> >> +#include "target/ppc/cpu.h"
> >> +#include "sysemu/cpus.h"
> >> +#include "sysemu/dma.h"
> >> +#include "monitor/monitor.h"
> >> +#include "hw/ppc/xive.h"
> >> +
> >> +/*
> >> + * XIVE Interrupt Source
> >> + */
> >> +
> >> +uint8_t xive_source_pq_get(XiveSource *xsrc, uint32_t srcno)
> >> +{
> >> +    uint32_t byte = srcno / 4;
> >> +    uint32_t bit  = (srcno % 4) * 2;
> >> +
> >> +    assert(byte < xsrc->sbe_size);
> >> +
> >> +    return (xsrc->sbe[byte] >> bit) & 0x3;
> >> +}
> >> +
> >> +uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
> >> +{
> >> +    uint32_t byte = srcno / 4;
> >> +    uint32_t bit  = (srcno % 4) * 2;
> >> +    uint8_t old, new;
> >> +
> >> +    assert(byte < xsrc->sbe_size);
> >> +
> >> +    old = xsrc->sbe[byte];
> >> +
> >> +    new = xsrc->sbe[byte] & ~(0x3 << bit);
> >> +    new |= (pq & 0x3) << bit;
> >> +
> >> +    xsrc->sbe[byte] = new;
> >> +
> >> +    return (old >> bit) & 0x3;
> >> +}
> >> +
> >> +static bool xive_source_pq_eoi(XiveSource *xsrc, uint32_t srcno)
> >> +{
> >> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
> >> +
> >> +    switch (old_pq) {
> >> +    case XIVE_ESB_RESET:
> >> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_RESET);
> >> +        return false;
> >> +    case XIVE_ESB_PENDING:
> >> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_RESET);
> >> +        return false;
> >> +    case XIVE_ESB_QUEUED:
> >> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
> >> +        return true;
> >> +    case XIVE_ESB_OFF:
> >> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_OFF);
> >> +        return false;
> >> +    default:
> >> +         g_assert_not_reached();
> >> +    }
> >> +}
> >> +
> >> +/*
> >> + * Returns whether the event notification should be forwarded.
> >> + */
> >> +static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
> >> +{
> >> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
> >> +
> >> +    switch (old_pq) {
> >> +    case XIVE_ESB_RESET:
> >> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
> >> +        return true;
> >> +    case XIVE_ESB_PENDING:
> >> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_QUEUED);
> >> +        return false;
> >> +    case XIVE_ESB_QUEUED:
> >> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_QUEUED);
> >> +        return false;
> >> +    case XIVE_ESB_OFF:
> >> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_OFF);
> >> +        return false;
> >> +    default:
> >> +         g_assert_not_reached();
> >> +    }
> >> +}
> >> +
> >> +/*
> >> + * Forward the source event notification to the associated XiveFabric,
> >> + * the device owning the sources.
> >> + */
> >> +static void xive_source_notify(XiveSource *xsrc, int srcno)
> >> +{
> >> +
> >> +}
> >> +
> >> +/* In a two pages ESB MMIO setting, even page is the trigger page, odd
> >> + * page is for management */
> > 
> > Can I understand from this that the result from this function is only
> > meaningful in 2-pages mode?
> 
> yes. May be I should rename it to xive_source_is_even_page() ?

Seems very long winded.  Maybe keep the name but have it check both
whether it's an even page and also whether you're in 2pages mode.

> >> +static inline bool xive_source_is_trigger_page(hwaddr addr)
> >> +{
> >> +    return !((addr >> 16) & 1);
> > 
> > Later on you seem to have both 4k and 64k variants list, but here you
> > hardcode 64k.  Is that a problem?
> 	
> oups. This should be  :
> 
> 	(addr >> (xsrc->esb_shift - 1))
> 
> I did the tests with the spapr guest which uses 64k ESB MMIO pages. 
> The check is only significant in a 2 pages setting.

Ok.

> >> +}
> >> +
> >> +static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> >> +    uint32_t offset = addr & 0xF00;
> > 
> > You ignore the low bits of the address entirely, so effective you have
> > a 256 byte range that's all aliases of the same register.  Is that
> > intentional?
> 
> yes but it's not entirely correct. The exact ranges are :
> 
> 			Load			Store
> 
> 0x000 .. 0x3FF		EOI and return 0|1	Trigger
> 0x400 .. 0x7FF		EOI and return 0|1	EOI
> 0x800 .. 0xBFF  	return PQ		undefined
> 0xC00 .. 0xCFF		return PQ and PQ=00	PQ=00
> 0xD00 .. 0xDFF		return PQ and PQ=01	PQ=01
> 0xE00 .. 0xDFF		return PQ and PQ=10	PQ=10
> 0xF00 .. 0xDFF		return PQ and PQ=11	PQ=11
> 
> There is room for some improvements.
> 
> The trigger page in a two pages ESB MMIO settings only triggers on stores.

Ok.

> >> +    uint32_t srcno = addr >> xsrc->esb_shift;
> >> +    uint64_t ret = -1;
> >> +
> >> +    if (xive_source_esb_2page(xsrc) && xive_source_is_trigger_page(addr)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR,
> >> +                      "XIVE: invalid load on IRQ %d trigger page at "
> >> +                      "0x%"HWADDR_PRIx"\n", srcno, addr);
> >> +        return -1;
> >> +    }
> >> +
> >> +    switch (offset) {
> >> +    case XIVE_ESB_LOAD_EOI:
> >> +        /*
> >> +         * Load EOI is not the default source setting under QEMU, but
> >> +         * this is what HW uses currently.
> >> +         */
> >> +        ret = xive_source_pq_eoi(xsrc, srcno);
> > 
> > You're implicitly casting a bool return value into a u64 here, is that
> > intentional?
> 
> yes. is that bad ? This is what the load is supposed to return in the transition 
> algo. 

No, that's fine, as long as just using the LSB in your return is
correct.  Just making sure I understood.

> >> +
> >> +        break;
> >> +
> >> +    case XIVE_ESB_GET:
> >> +        ret = xive_source_pq_get(xsrc, srcno);
> >> +        break;
> >> +
> >> +    case XIVE_ESB_SET_PQ_00:
> >> +    case XIVE_ESB_SET_PQ_01:
> >> +    case XIVE_ESB_SET_PQ_10:
> >> +    case XIVE_ESB_SET_PQ_11:
> >> +        ret = xive_source_pq_set(xsrc, srcno, (offset >> 8) & 0x3);
> >> +        break;
> >> +    default:
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB addr %d\n", offset);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static void xive_source_esb_write(void *opaque, hwaddr addr,
> >> +                                 uint64_t value, unsigned size)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> >> +    uint32_t offset = addr & 0xF00;
> >> +    uint32_t srcno = addr >> xsrc->esb_shift;
> >> +    bool notify = false;
> >> +
> >> +    switch (offset) {
> >> +    case 0:
> >> +        notify = xive_source_pq_trigger(xsrc, srcno);
> >> +        break;
> >> +
> >> +    case XIVE_ESB_STORE_EOI:
> >> +        if (xive_source_is_trigger_page(addr)) {
> >> +            qemu_log_mask(LOG_GUEST_ERROR,
> >> +                          "XIVE: invalid store on IRQ %d trigger page at "
> >> +                          "0x%"HWADDR_PRIx"\n", srcno, addr);
> >> +            return;
> >> +        }
> >> +
> >> +        if (!(xsrc->esb_flags & XIVE_SRC_STORE_EOI)) {
> >> +            qemu_log_mask(LOG_GUEST_ERROR,
> >> +                          "XIVE: invalid Store EOI for IRQ %d\n", srcno);
> >> +            return;
> >> +        }
> >> +
> >> +        /* If the Q bit is set, we should forward a new source event
> >> +         * notification
> >> +         */
> >> +        notify = xive_source_pq_eoi(xsrc, srcno);
> >> +        break;
> >> +
> >> +    default:
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %d\n",
> >> +                      offset);
> >> +        return;
> >> +    }
> >> +
> >> +    /* Forward the source event notification for routing */
> >> +    if (notify) {
> >> +        xive_source_notify(xsrc, srcno);
> >> +    }
> > 
> > EOI via this path calls notify, but the one via the read path
> > doesn't.  Is that correct?
> 
> No. I have given attention to the one page ESB MMIO setting + Store EOI 
> in the emulated mode and not enough to the two pages ESB MMIO setting. 
> This is a late change to be compatible with KVM. I will fix.

Ok.

> > 
> >> +}
> >> +
> >> +static const MemoryRegionOps xive_source_esb_ops = {
> >> +    .read = xive_source_esb_read,
> >> +    .write = xive_source_esb_write,
> >> +    .endianness = DEVICE_BIG_ENDIAN,
> >> +    .valid = {
> >> +        .min_access_size = 8,
> >> +        .max_access_size = 8,
> >> +    },
> >> +    .impl = {
> >> +        .min_access_size = 8,
> >> +        .max_access_size = 8,
> >> +    },
> >> +};
> >> +
> >> +static void xive_source_set_irq(void *opaque, int srcno, int val)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> >> +    bool notify = false;
> >> +
> >> +    if (val) {
> >> +        notify = xive_source_pq_trigger(xsrc, srcno);
> >> +    }
> >> +
> >> +    /* Forward the source event notification for routing */
> >> +    if (notify) {
> >> +        xive_source_notify(xsrc, srcno);
> >> +    }
> >> +}
> >> +
> >> +void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
> >> +{
> >> +    int i;
> >> +
> >> +    monitor_printf(mon, "XIVE Source %6x ..%6x\n",
> >> +                   xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
> >> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >> +        uint8_t pq = xive_source_pq_get(xsrc, i);
> >> +        uint32_t lisn = i  + xsrc->offset;
> >> +
> >> +        if (pq == XIVE_ESB_OFF) {
> >> +            continue;
> >> +        }
> >> +
> >> +        monitor_printf(mon, "  %4x %c%c\n", lisn,
> >> +                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
> >> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> >> +    }
> >> +}
> >> +
> >> +static void xive_source_reset(DeviceState *dev)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(dev);
> >> +
> >> +    /* SBEs are initialized to 0b01 which corresponds to "ints off" */
> >> +    memset(xsrc->sbe, 0x55, xsrc->sbe_size);
> >> +}
> >> +
> >> +static void xive_source_realize(DeviceState *dev, Error **errp)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(dev);
> >> +
> >> +    if (!xsrc->nr_irqs) {
> >> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
> >> +        return;
> >> +    }
> >> +
> >> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
> >> +        xsrc->esb_shift != XIVE_ESB_4K_2PAGE &&
> >> +        xsrc->esb_shift != XIVE_ESB_64K &&
> >> +        xsrc->esb_shift != XIVE_ESB_64K_2PAGE) {
> >> +        error_setg(errp, "Invalid ESB shift setting");
> >> +        return;
> >> +    }
> >> +
> >> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
> >> +                                     xsrc->nr_irqs);
> >> +
> >> +    /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
> >> +    xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
> >> +    xsrc->sbe = g_malloc0(xsrc->sbe_size);
> >> +
> >> +    /* TODO: H_INT_ESB support, which removing the ESB MMIOs */
> >> +
> >> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> >> +                          &xive_source_esb_ops, xsrc, "xive.esb",
> >> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
> >> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
> >> +}
> >> +
> >> +static const VMStateDescription vmstate_xive_source = {
> >> +    .name = TYPE_XIVE_SOURCE,
> >> +    .version_id = 1,
> >> +    .minimum_version_id = 1,
> >> +    .fields = (VMStateField[]) {
> >> +        VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
> >> +        VMSTATE_VBUFFER_UINT32(sbe, XiveSource, 1, NULL, sbe_size),
> >> +        VMSTATE_END_OF_LIST()
> >> +    },
> >> +};
> >> +
> >> +/*
> >> + * The default XIVE interrupt source setting for ESB MMIO is two 64k
> >> + * pages without Store EOI. This is in sync with KVM.
> >> + */
> >> +static Property xive_source_properties[] = {
> >> +    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
> >> +    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
> >> +    DEFINE_PROP_UINT64("bar", XiveSource, esb_base, 0),
> > 
> > Isn't this redundant with however the base address is handled through
> > the SysBusDevice stuff (I forget the details)?
> 
> Storing the ESB MMIO base address under the XiveSource object is 
> convenient later on in the h_int_get_source_info hcall which make
> use of the helpers : 
> 
> 	hwaddr xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
> 	hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
> 
> But it is only used in that place. So we could just store the ESB 
> MMIO base address under the sPAPRXive controller. This makes some
> sense in the design, as we have to inform KVM of this address with
> a KVM device ioctl.

Well.. I really dislike the idea that you could change the actual MMIO
mapping address in one place, but other bits of code would still think
it was mapped somewhere else.

> >> +    DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
> >> +    DEFINE_PROP_END_OF_LIST(),
> >> +};
> >> +
> >> +static void xive_source_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >> +
> >> +    dc->realize = xive_source_realize;
> >> +    dc->reset = xive_source_reset;
> >> +    dc->props = xive_source_properties;
> >> +    dc->desc = "XIVE interrupt source";
> >> +    dc->vmsd = &vmstate_xive_source;
> >> +}
> >> +
> >> +static const TypeInfo xive_source_info = {
> >> +    .name          = TYPE_XIVE_SOURCE,
> >> +    .parent        = TYPE_SYS_BUS_DEVICE,
> >> +    .instance_size = sizeof(XiveSource),
> >> +    .class_init    = xive_source_class_init,
> >> +};
> >> +
> >> +static void xive_register_types(void)
> >> +{
> >> +    type_register_static(&xive_source_info);
> >> +}
> >> +
> >> +type_init(xive_register_types)
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> new file mode 100644
> >> index 000000000000..d92a50519edf
> >> --- /dev/null
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -0,0 +1,130 @@
> >> +/*
> >> + * QEMU PowerPC XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#ifndef PPC_XIVE_H
> >> +#define PPC_XIVE_H
> >> +
> >> +#include "hw/sysbus.h"
> >> +
> >> +/*
> >> + * XIVE Interrupt Source
> >> + */
> >> +
> >> +#define TYPE_XIVE_SOURCE "xive-source"
> >> +#define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
> >> +
> >> +/*
> >> + * XIVE Source Interrupt source characteristics, which define how the
> >> + * ESB are controlled.
> >> + */
> >> +#define XIVE_SRC_H_INT_ESB     0x1 /* ESB managed with hcall H_INT_ESB */
> >> +#define XIVE_SRC_STORE_EOI     0x4 /* Store EOI supported */
> >> +
> >> +typedef struct XiveSource {
> >> +    SysBusDevice parent;
> >> +
> >> +    /* IRQs */
> >> +    uint32_t     nr_irqs;
> >> +    uint32_t     offset;
> >> +    qemu_irq     *qirqs;
> >> +
> >> +    /* PQ bits */
> >> +    uint8_t      *sbe;
> >> +    uint32_t     sbe_size;
> >> +
> >> +    /* ESB memory region */
> >> +    uint64_t     esb_flags;
> >> +    hwaddr       esb_base;
> >> +    uint32_t     esb_shift;
> >> +    MemoryRegion esb_mmio;
> >> +} XiveSource;
> >> +
> >> +/*
> >> + * ESB MMIO setting. Can be one page, for both source triggering and
> >> + * source management, or two different pages. See below for magic
> >> + * values.
> >> + */
> >> +#define XIVE_ESB_4K          12 /* PSI HB */
> >> +#define XIVE_ESB_4K_2PAGE    17
> > 
> > Should this be 13 instead of 17?
> 
> oups. obviously this is not used.
> 
> >> +#define XIVE_ESB_64K         16
> >> +#define XIVE_ESB_64K_2PAGE   17
> > 
> > (Also, who the hell comes up with a brand new PIC and decides to have
> > *4* different interface variants.  But that's not your problem, I
> > realise).
> 
> HW constraints on the different controllers which need to expose
> sources : PSI, PHB4. The internal sources of the XIVE interrupt 
> controller can be configured to use 4K or 64K but I doubt the 4k
> will be ever used.

Sure, the hardware is different, but *why* is it different.  This is a
brand new design, you'd think they could come up with one variant that
works for all the cases.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources
  2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources Cédric Le Goater
@ 2018-04-23  6:44   ` David Gibson
  2018-04-23  7:31     ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-23  6:44 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 6834 bytes --]

On Thu, Apr 19, 2018 at 02:42:58PM +0200, Cédric Le Goater wrote:
> The 'sent' status of the LSI interrupt source is modeled with the 'P'
> bit of the ESB and the assertion status of the source is maintained in
> an array under the main sPAPRXive object. The type of the source is
> stored in the same array for practical reasons.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  hw/intc/xive.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++----
>  include/hw/ppc/xive.h | 16 +++++++++++++++
>  2 files changed, 66 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index c70578759d02..060976077dd7 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -104,6 +104,21 @@ static void xive_source_notify(XiveSource *xsrc, int srcno)
>  
>  }
>  
> +/*
> + * LSI interrupt sources use the P bit and a custom assertion flag
> + */
> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t srcno)
> +{
> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
> +
> +    if  (old_pq == XIVE_ESB_RESET &&
> +         xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
> +        return true;
> +    }
> +    return false;
> +}
> +
>  /* In a two pages ESB MMIO setting, even page is the trigger page, odd
>   * page is for management */
>  static inline bool xive_source_is_trigger_page(hwaddr addr)
> @@ -133,6 +148,13 @@ static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
>           */
>          ret = xive_source_pq_eoi(xsrc, srcno);
>  
> +        /* If the LSI source is still asserted, forward a new source
> +         * event notification */
> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
> +            if (xive_source_lsi_trigger(xsrc, srcno)) {
> +                xive_source_notify(xsrc, srcno);
> +            }
> +        }
>          break;
>  
>      case XIVE_ESB_GET:
> @@ -183,6 +205,14 @@ static void xive_source_esb_write(void *opaque, hwaddr addr,
>           * notification
>           */
>          notify = xive_source_pq_eoi(xsrc, srcno);
> +
> +        /* LSI sources do not set the Q bit but they can still be
> +         * asserted, in which case we should forward a new source
> +         * event notification
> +         */
> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
> +            notify = xive_source_lsi_trigger(xsrc, srcno);
> +        }
>          break;
>  
>      default:
> @@ -216,8 +246,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
>      XiveSource *xsrc = XIVE_SOURCE(opaque);
>      bool notify = false;
>  
> -    if (val) {
> -        notify = xive_source_pq_trigger(xsrc, srcno);
> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
> +        if (val) {
> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
> +        } else {
> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
> +        }
> +        notify = xive_source_lsi_trigger(xsrc, srcno);
> +    } else {
> +        if (val) {
> +            notify = xive_source_pq_trigger(xsrc, srcno);
> +        }
>      }
>  
>      /* Forward the source event notification for routing */
> @@ -234,13 +273,13 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
>                     xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
>      for (i = 0; i < xsrc->nr_irqs; i++) {
>          uint8_t pq = xive_source_pq_get(xsrc, i);
> -        uint32_t lisn = i  + xsrc->offset;
>  
>          if (pq == XIVE_ESB_OFF) {
>              continue;
>          }
>  
> -        monitor_printf(mon, "  %4x %c%c\n", lisn,
> +        monitor_printf(mon, "  %4x %s %c%c\n", i + xsrc->offset,
> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
>                         pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>      }
> @@ -249,6 +288,12 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
>  static void xive_source_reset(DeviceState *dev)
>  {
>      XiveSource *xsrc = XIVE_SOURCE(dev);
> +    int i;
> +
> +    /* Keep the IRQ type */
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        xsrc->status[i] &= ~XIVE_STATUS_ASSERTED;
> +    }
>  
>      /* SBEs are initialized to 0b01 which corresponds to "ints off" */
>      memset(xsrc->sbe, 0x55, xsrc->sbe_size);
> @@ -273,6 +318,7 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
>  
>      xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
>                                       xsrc->nr_irqs);
> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
>  
>      /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
>      xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index d92a50519edf..0b76dd278d9b 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -33,6 +33,9 @@ typedef struct XiveSource {
>      uint32_t     nr_irqs;
>      uint32_t     offset;
>      qemu_irq     *qirqs;
> +#define XIVE_STATUS_LSI         0x1
> +#define XIVE_STATUS_ASSERTED    0x2
> +    uint8_t      *status;

I don't love the idea of mixing configuration information (STATUS_LSI)
with runtime state information (ASSERTED) in the same field.  Any
reason not to have these as parallel bitmaps.

Come to that.. is there a compelling reason to allow any individual
irq to be marked LSI or MSI, rather than using separate XiveSource
objects for MSIs and LSIs?

>      /* PQ bits */
>      uint8_t      *sbe;

.. and come to that is there a reason to keep the ASSERTED bit in a
separate array from sbe?  AFAICT the actual 2-bit-per-irq layout is
never exposed to the guests.

Or, even re-use the Q bit for asserted in LSIs (but report it as
always 0 in the register read/write path).

> @@ -127,4 +130,17 @@ uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
>  
>  void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon);
>  
> +static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
> +{
> +    assert(srcno < xsrc->nr_irqs);
> +    return xsrc->status[srcno] & XIVE_STATUS_LSI;
> +}
> +
> +static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
> +                                       bool lsi)
> +{
> +    assert(srcno < xsrc->nr_irqs);
> +    xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
> +}
> +
>  #endif /* PPC_XIVE_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface Cédric Le Goater
@ 2018-04-23  6:46   ` David Gibson
  2018-04-23  7:58     ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-23  6:46 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 4492 bytes --]

On Thu, Apr 19, 2018 at 02:42:59PM +0200, Cédric Le Goater wrote:
> The XiveFabric offers a simple interface, between the XiveSourve
> object and the device model owning the interrupt sources, to forward
> an event notification to the XIVE interrupt controller of the machine
> and if the owner is the controller, to call directly the routing
> sub-engine.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
>  include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
>  2 files changed, 61 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 060976077dd7..b4c3d06c1219 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -17,6 +17,21 @@
>  #include "hw/ppc/xive.h"
>  
>  /*
> + * XIVE Fabric
> + */
> +
> +static void xive_fabric_route(XiveFabric *xf, int lisn)
> +{
> +
> +}
> +
> +static const TypeInfo xive_fabric_info = {
> +    .name = TYPE_XIVE_FABRIC,
> +    .parent = TYPE_INTERFACE,
> +    .class_size = sizeof(XiveFabricClass),
> +};
> +
> +/*
>   * XIVE Interrupt Source
>   */
>  
> @@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
>  
>  /*
>   * Forward the source event notification to the associated XiveFabric,
> - * the device owning the sources.
> + * the device owning the sources, or perform the routing if the device
> + * is the interrupt controller.
>   */
>  static void xive_source_notify(XiveSource *xsrc, int srcno)
>  {
>  
> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
> +
> +    if (xfc->notify) {
> +        xfc->notify(xsrc->xive, srcno + xsrc->offset);
> +    } else {
> +        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
> +    }

Why 2 cases?  Can't the XiveFabric object just make its notify equal
to xive_fabric_route if that's what it wants?

>  }
>  
>  /*
> @@ -302,6 +325,17 @@ static void xive_source_reset(DeviceState *dev)
>  static void xive_source_realize(DeviceState *dev, Error **errp)
>  {
>      XiveSource *xsrc = XIVE_SOURCE(dev);
> +    Object *obj;
> +    Error *local_err = NULL;
> +
> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
> +    if (!obj) {
> +        error_propagate(errp, local_err);
> +        error_prepend(errp, "required link 'xive' not found: ");
> +        return;
> +    }
> +
> +    xsrc->xive = XIVE_FABRIC(obj);
>  
>      if (!xsrc->nr_irqs) {
>          error_setg(errp, "Number of interrupt needs to be greater than 0");
> @@ -376,6 +410,7 @@ static const TypeInfo xive_source_info = {
>  static void xive_register_types(void)
>  {
>      type_register_static(&xive_source_info);
> +    type_register_static(&xive_fabric_info);
>  }
>  
>  type_init(xive_register_types)
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 0b76dd278d9b..4fcae2c763e6 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -12,6 +12,8 @@
>  
>  #include "hw/sysbus.h"
>  
> +typedef struct XiveFabric XiveFabric;
> +
>  /*
>   * XIVE Interrupt Source
>   */
> @@ -46,6 +48,8 @@ typedef struct XiveSource {
>      hwaddr       esb_base;
>      uint32_t     esb_shift;
>      MemoryRegion esb_mmio;
> +
> +    XiveFabric   *xive;
>  } XiveSource;
>  
>  /*
> @@ -143,4 +147,25 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>      xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
>  }
>  
> +/*
> + * XIVE Fabric
> + */
> +
> +typedef struct XiveFabric {
> +    Object parent;
> +} XiveFabric;
> +
> +#define TYPE_XIVE_FABRIC "xive-fabric"
> +#define XIVE_FABRIC(obj)                                     \
> +    OBJECT_CHECK(XiveFabric, (obj), TYPE_XIVE_FABRIC)
> +#define XIVE_FABRIC_CLASS(klass)                                     \
> +    OBJECT_CLASS_CHECK(XiveFabricClass, (klass), TYPE_XIVE_FABRIC)
> +#define XIVE_FABRIC_GET_CLASS(obj)                                   \
> +    OBJECT_GET_CLASS(XiveFabricClass, (obj), TYPE_XIVE_FABRIC)
> +
> +typedef struct XiveFabricClass {
> +    InterfaceClass parent;
> +    void (*notify)(XiveFabric *xf, uint32_t lisn);
> +} XiveFabricClass;
> +
>  #endif /* PPC_XIVE_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model
  2018-04-23  3:59       ` David Gibson
@ 2018-04-23  7:11         ` Cédric Le Goater
  2018-04-24  1:24           ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-23  7:11 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/23/2018 05:59 AM, David Gibson wrote:
> On Fri, Apr 20, 2018 at 10:27:21AM +0200, Cédric Le Goater wrote:
>> On 04/20/2018 09:10 AM, David Gibson wrote:
>>> On Thu, Apr 19, 2018 at 02:42:57PM +0200, Cédric Le Goater wrote:
>>>> Each XIVE interrupt source is associated with a two bit state machine
>>>> called an Event State Buffer (ESB) : the first bit "P" means that an
>>>> interrupt is "pending" and waiting for an EOI and the bit "Q" (queued)
>>>> means a new interrupt was triggered while another was still pending.
>>>>
>>>> When an event is triggered, the associated interrupt state bits are
>>>> fetched and modified and forwarded to the virtualization engine of the
>>>> controller doing the routing. These can also be controlled by MMIO, to
>>>> trigger events or turn off the sources for instance. See code for more
>>>> details on the states and transitions.
>>>>
>>>> On a sPAPR machine, the OS will obtain the address of the MMIO page of
>>>> the ESB entry associated with a source and its characteristic using
>>>> the H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is
>>>> used.
>>>>
>>>> The xive_source_notify() routine is in charge forwarding the source
>>>> event notification to the routing engine. It will be filled later on.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  Changes since v2:
>>>>
>>>>  - added support for Store EOI
>>>>  - added support for two page MMIO setting like on KVM
>>>
>>> Looks generally sane to me, though I have a few queries.
>>>
>>>>
>>>>  default-configs/ppc64-softmmu.mak |   1 +
>>>>  hw/intc/Makefile.objs             |   1 +
>>>>  hw/intc/xive.c                    | 335 ++++++++++++++++++++++++++++++++++++++
>>>>  include/hw/ppc/xive.h             | 130 +++++++++++++++
>>>>  4 files changed, 467 insertions(+)
>>>>  create mode 100644 hw/intc/xive.c
>>>>  create mode 100644 include/hw/ppc/xive.h
>>>>
>>>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>>>> index b94af6c7c62a..c6d13e757977 100644
>>>> --- a/default-configs/ppc64-softmmu.mak
>>>> +++ b/default-configs/ppc64-softmmu.mak
>>>> @@ -16,4 +16,5 @@ CONFIG_VIRTIO_VGA=y
>>>>  CONFIG_XICS=$(CONFIG_PSERIES)
>>>>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>>>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>>>> +CONFIG_XIVE=$(CONFIG_PSERIES)
>>>>  CONFIG_MEM_HOTPLUG=y
>>>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>>>> index 0e9963f5eecc..72a46ed91c31 100644
>>>> --- a/hw/intc/Makefile.objs
>>>> +++ b/hw/intc/Makefile.objs
>>>> @@ -37,6 +37,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>>>>  obj-$(CONFIG_XICS) += xics.o
>>>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>>> +obj-$(CONFIG_XIVE) += xive.o
>>>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> new file mode 100644
>>>> index 000000000000..c70578759d02
>>>> --- /dev/null
>>>> +++ b/hw/intc/xive.c
>>>> @@ -0,0 +1,335 @@
>>>> +/*
>>>> + * QEMU PowerPC XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017-2018, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/log.h"
>>>> +#include "qapi/error.h"
>>>> +#include "target/ppc/cpu.h"
>>>> +#include "sysemu/cpus.h"
>>>> +#include "sysemu/dma.h"
>>>> +#include "monitor/monitor.h"
>>>> +#include "hw/ppc/xive.h"
>>>> +
>>>> +/*
>>>> + * XIVE Interrupt Source
>>>> + */
>>>> +
>>>> +uint8_t xive_source_pq_get(XiveSource *xsrc, uint32_t srcno)
>>>> +{
>>>> +    uint32_t byte = srcno / 4;
>>>> +    uint32_t bit  = (srcno % 4) * 2;
>>>> +
>>>> +    assert(byte < xsrc->sbe_size);
>>>> +
>>>> +    return (xsrc->sbe[byte] >> bit) & 0x3;
>>>> +}
>>>> +
>>>> +uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
>>>> +{
>>>> +    uint32_t byte = srcno / 4;
>>>> +    uint32_t bit  = (srcno % 4) * 2;
>>>> +    uint8_t old, new;
>>>> +
>>>> +    assert(byte < xsrc->sbe_size);
>>>> +
>>>> +    old = xsrc->sbe[byte];
>>>> +
>>>> +    new = xsrc->sbe[byte] & ~(0x3 << bit);
>>>> +    new |= (pq & 0x3) << bit;
>>>> +
>>>> +    xsrc->sbe[byte] = new;
>>>> +
>>>> +    return (old >> bit) & 0x3;
>>>> +}
>>>> +
>>>> +static bool xive_source_pq_eoi(XiveSource *xsrc, uint32_t srcno)
>>>> +{
>>>> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
>>>> +
>>>> +    switch (old_pq) {
>>>> +    case XIVE_ESB_RESET:
>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_RESET);
>>>> +        return false;
>>>> +    case XIVE_ESB_PENDING:
>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_RESET);
>>>> +        return false;
>>>> +    case XIVE_ESB_QUEUED:
>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
>>>> +        return true;
>>>> +    case XIVE_ESB_OFF:
>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_OFF);
>>>> +        return false;
>>>> +    default:
>>>> +         g_assert_not_reached();
>>>> +    }
>>>> +}
>>>> +
>>>> +/*
>>>> + * Returns whether the event notification should be forwarded.
>>>> + */
>>>> +static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
>>>> +{
>>>> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
>>>> +
>>>> +    switch (old_pq) {
>>>> +    case XIVE_ESB_RESET:
>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
>>>> +        return true;
>>>> +    case XIVE_ESB_PENDING:
>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_QUEUED);
>>>> +        return false;
>>>> +    case XIVE_ESB_QUEUED:
>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_QUEUED);
>>>> +        return false;
>>>> +    case XIVE_ESB_OFF:
>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_OFF);
>>>> +        return false;
>>>> +    default:
>>>> +         g_assert_not_reached();
>>>> +    }
>>>> +}
>>>> +
>>>> +/*
>>>> + * Forward the source event notification to the associated XiveFabric,
>>>> + * the device owning the sources.
>>>> + */
>>>> +static void xive_source_notify(XiveSource *xsrc, int srcno)
>>>> +{
>>>> +
>>>> +}
>>>> +
>>>> +/* In a two pages ESB MMIO setting, even page is the trigger page, odd
>>>> + * page is for management */
>>>
>>> Can I understand from this that the result from this function is only
>>> meaningful in 2-pages mode?
>>
>> yes. May be I should rename it to xive_source_is_even_page() ?
> 
> Seems very long winded.  Maybe keep the name but have it check both
> whether it's an even page and also whether you're in 2pages mode.

yes. I had that in mind.

> 
>>>> +static inline bool xive_source_is_trigger_page(hwaddr addr)
>>>> +{
>>>> +    return !((addr >> 16) & 1);
>>>
>>> Later on you seem to have both 4k and 64k variants list, but here you
>>> hardcode 64k.  Is that a problem?
>> 	
>> oups. This should be  :
>>
>> 	(addr >> (xsrc->esb_shift - 1))
>>
>> I did the tests with the spapr guest which uses 64k ESB MMIO pages. 
>> The check is only significant in a 2 pages setting.
> 
> Ok.
> 
>>>> +}
>>>> +
>>>> +static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>>>> +    uint32_t offset = addr & 0xF00;
>>>
>>> You ignore the low bits of the address entirely, so effective you have
>>> a 256 byte range that's all aliases of the same register.  Is that
>>> intentional?
>>
>> yes but it's not entirely correct. The exact ranges are :
>>
>> 			Load			Store
>>
>> 0x000 .. 0x3FF		EOI and return 0|1	Trigger
>> 0x400 .. 0x7FF		EOI and return 0|1	EOI
>> 0x800 .. 0xBFF  	return PQ		undefined
>> 0xC00 .. 0xCFF		return PQ and PQ=00	PQ=00
>> 0xD00 .. 0xDFF		return PQ and PQ=01	PQ=01
>> 0xE00 .. 0xDFF		return PQ and PQ=10	PQ=10
>> 0xF00 .. 0xDFF		return PQ and PQ=11	PQ=11
>>
>> There is room for some improvements.
>>
>> The trigger page in a two pages ESB MMIO settings only triggers on stores.
> 
> Ok.

I have fixed the ranges and added the above table as a comment in the code.
it helps in understanding what are the MMIOs are.

> 
>>>> +    uint32_t srcno = addr >> xsrc->esb_shift;
>>>> +    uint64_t ret = -1;
>>>> +
>>>> +    if (xive_source_esb_2page(xsrc) && xive_source_is_trigger_page(addr)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR,
>>>> +                      "XIVE: invalid load on IRQ %d trigger page at "
>>>> +                      "0x%"HWADDR_PRIx"\n", srcno, addr);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    switch (offset) {
>>>> +    case XIVE_ESB_LOAD_EOI:
>>>> +        /*
>>>> +         * Load EOI is not the default source setting under QEMU, but
>>>> +         * this is what HW uses currently.
>>>> +         */
>>>> +        ret = xive_source_pq_eoi(xsrc, srcno);
>>>
>>> You're implicitly casting a bool return value into a u64 here, is that
>>> intentional?
>>
>> yes. is that bad ? This is what the load is supposed to return in the transition 
>> algo. 
> 
> No, that's fine, as long as just using the LSB in your return is
> correct.  Just making sure I understood.
> 
>>>> +
>>>> +        break;
>>>> +
>>>> +    case XIVE_ESB_GET:
>>>> +        ret = xive_source_pq_get(xsrc, srcno);
>>>> +        break;
>>>> +
>>>> +    case XIVE_ESB_SET_PQ_00:
>>>> +    case XIVE_ESB_SET_PQ_01:
>>>> +    case XIVE_ESB_SET_PQ_10:
>>>> +    case XIVE_ESB_SET_PQ_11:
>>>> +        ret = xive_source_pq_set(xsrc, srcno, (offset >> 8) & 0x3);
>>>> +        break;
>>>> +    default:
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB addr %d\n", offset);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static void xive_source_esb_write(void *opaque, hwaddr addr,
>>>> +                                 uint64_t value, unsigned size)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>>>> +    uint32_t offset = addr & 0xF00;
>>>> +    uint32_t srcno = addr >> xsrc->esb_shift;
>>>> +    bool notify = false;
>>>> +
>>>> +    switch (offset) {
>>>> +    case 0:
>>>> +        notify = xive_source_pq_trigger(xsrc, srcno);
>>>> +        break;
>>>> +
>>>> +    case XIVE_ESB_STORE_EOI:
>>>> +        if (xive_source_is_trigger_page(addr)) {
>>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>>> +                          "XIVE: invalid store on IRQ %d trigger page at "
>>>> +                          "0x%"HWADDR_PRIx"\n", srcno, addr);
>>>> +            return;
>>>> +        }
>>>> +
>>>> +        if (!(xsrc->esb_flags & XIVE_SRC_STORE_EOI)) {
>>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>>> +                          "XIVE: invalid Store EOI for IRQ %d\n", srcno);
>>>> +            return;
>>>> +        }
>>>> +
>>>> +        /* If the Q bit is set, we should forward a new source event
>>>> +         * notification
>>>> +         */
>>>> +        notify = xive_source_pq_eoi(xsrc, srcno);
>>>> +        break;
>>>> +
>>>> +    default:
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %d\n",
>>>> +                      offset);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /* Forward the source event notification for routing */
>>>> +    if (notify) {
>>>> +        xive_source_notify(xsrc, srcno);
>>>> +    }
>>>
>>> EOI via this path calls notify, but the one via the read path
>>> doesn't.  Is that correct?
>>
>> No. I have given attention to the one page ESB MMIO setting + Store EOI 
>> in the emulated mode and not enough to the two pages ESB MMIO setting. 
>> This is a late change to be compatible with KVM. I will fix.
> 
> Ok.
> 
>>>
>>>> +}
>>>> +
>>>> +static const MemoryRegionOps xive_source_esb_ops = {
>>>> +    .read = xive_source_esb_read,
>>>> +    .write = xive_source_esb_write,
>>>> +    .endianness = DEVICE_BIG_ENDIAN,
>>>> +    .valid = {
>>>> +        .min_access_size = 8,
>>>> +        .max_access_size = 8,
>>>> +    },
>>>> +    .impl = {
>>>> +        .min_access_size = 8,
>>>> +        .max_access_size = 8,
>>>> +    },
>>>> +};
>>>> +
>>>> +static void xive_source_set_irq(void *opaque, int srcno, int val)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>>>> +    bool notify = false;
>>>> +
>>>> +    if (val) {
>>>> +        notify = xive_source_pq_trigger(xsrc, srcno);
>>>> +    }
>>>> +
>>>> +    /* Forward the source event notification for routing */
>>>> +    if (notify) {
>>>> +        xive_source_notify(xsrc, srcno);
>>>> +    }
>>>> +}
>>>> +
>>>> +void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    monitor_printf(mon, "XIVE Source %6x ..%6x\n",
>>>> +                   xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
>>>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>>>> +        uint8_t pq = xive_source_pq_get(xsrc, i);
>>>> +        uint32_t lisn = i  + xsrc->offset;
>>>> +
>>>> +        if (pq == XIVE_ESB_OFF) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        monitor_printf(mon, "  %4x %c%c\n", lisn,
>>>> +                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
>>>> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>>>> +    }
>>>> +}
>>>> +
>>>> +static void xive_source_reset(DeviceState *dev)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(dev);
>>>> +
>>>> +    /* SBEs are initialized to 0b01 which corresponds to "ints off" */
>>>> +    memset(xsrc->sbe, 0x55, xsrc->sbe_size);
>>>> +}
>>>> +
>>>> +static void xive_source_realize(DeviceState *dev, Error **errp)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(dev);
>>>> +
>>>> +    if (!xsrc->nr_irqs) {
>>>> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
>>>> +        xsrc->esb_shift != XIVE_ESB_4K_2PAGE &&
>>>> +        xsrc->esb_shift != XIVE_ESB_64K &&
>>>> +        xsrc->esb_shift != XIVE_ESB_64K_2PAGE) {
>>>> +        error_setg(errp, "Invalid ESB shift setting");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
>>>> +                                     xsrc->nr_irqs);
>>>> +
>>>> +    /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
>>>> +    xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
>>>> +    xsrc->sbe = g_malloc0(xsrc->sbe_size);
>>>> +
>>>> +    /* TODO: H_INT_ESB support, which removing the ESB MMIOs */
>>>> +
>>>> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>>>> +                          &xive_source_esb_ops, xsrc, "xive.esb",
>>>> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>>>> +}
>>>> +
>>>> +static const VMStateDescription vmstate_xive_source = {
>>>> +    .name = TYPE_XIVE_SOURCE,
>>>> +    .version_id = 1,
>>>> +    .minimum_version_id = 1,
>>>> +    .fields = (VMStateField[]) {
>>>> +        VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
>>>> +        VMSTATE_VBUFFER_UINT32(sbe, XiveSource, 1, NULL, sbe_size),
>>>> +        VMSTATE_END_OF_LIST()
>>>> +    },
>>>> +};
>>>> +
>>>> +/*
>>>> + * The default XIVE interrupt source setting for ESB MMIO is two 64k
>>>> + * pages without Store EOI. This is in sync with KVM.
>>>> + */
>>>> +static Property xive_source_properties[] = {
>>>> +    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
>>>> +    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
>>>> +    DEFINE_PROP_UINT64("bar", XiveSource, esb_base, 0),
>>>
>>> Isn't this redundant with however the base address is handled through
>>> the SysBusDevice stuff (I forget the details)?
>>
>> Storing the ESB MMIO base address under the XiveSource object is 
>> convenient later on in the h_int_get_source_info hcall which make
>> use of the helpers : 
>>
>> 	hwaddr xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
>> 	hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
>>
>> But it is only used in that place. So we could just store the ESB 
>> MMIO base address under the sPAPRXive controller. This makes some
>> sense in the design, as we have to inform KVM of this address with
>> a KVM device ioctl.
> 
> Well.. I really dislike the idea that you could change the actual MMIO
> mapping address in one place, but other bits of code would still think
> it was mapped somewhere else.

OK. I think that my last proposal of removing the ESB MMIO address 
from the source and letting the owning device, the sPAPR Xive controller 
in this case, but this is the same for PoweNV or PSI HB, handle the
mapping goes in the direction you want to take ?  

It does looks better in the overall XIVE model and the XiveSource
object have no need of this address.
 
> 
>>>> +    DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
>>>> +    DEFINE_PROP_END_OF_LIST(),
>>>> +};
>>>> +
>>>> +static void xive_source_class_init(ObjectClass *klass, void *data)
>>>> +{
>>>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>>>> +
>>>> +    dc->realize = xive_source_realize;
>>>> +    dc->reset = xive_source_reset;
>>>> +    dc->props = xive_source_properties;
>>>> +    dc->desc = "XIVE interrupt source";
>>>> +    dc->vmsd = &vmstate_xive_source;
>>>> +}
>>>> +
>>>> +static const TypeInfo xive_source_info = {
>>>> +    .name          = TYPE_XIVE_SOURCE,
>>>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>>>> +    .instance_size = sizeof(XiveSource),
>>>> +    .class_init    = xive_source_class_init,
>>>> +};
>>>> +
>>>> +static void xive_register_types(void)
>>>> +{
>>>> +    type_register_static(&xive_source_info);
>>>> +}
>>>> +
>>>> +type_init(xive_register_types)
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> new file mode 100644
>>>> index 000000000000..d92a50519edf
>>>> --- /dev/null
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -0,0 +1,130 @@
>>>> +/*
>>>> + * QEMU PowerPC XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017-2018, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#ifndef PPC_XIVE_H
>>>> +#define PPC_XIVE_H
>>>> +
>>>> +#include "hw/sysbus.h"
>>>> +
>>>> +/*
>>>> + * XIVE Interrupt Source
>>>> + */
>>>> +
>>>> +#define TYPE_XIVE_SOURCE "xive-source"
>>>> +#define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
>>>> +
>>>> +/*
>>>> + * XIVE Source Interrupt source characteristics, which define how the
>>>> + * ESB are controlled.
>>>> + */
>>>> +#define XIVE_SRC_H_INT_ESB     0x1 /* ESB managed with hcall H_INT_ESB */
>>>> +#define XIVE_SRC_STORE_EOI     0x4 /* Store EOI supported */
>>>> +
>>>> +typedef struct XiveSource {
>>>> +    SysBusDevice parent;
>>>> +
>>>> +    /* IRQs */
>>>> +    uint32_t     nr_irqs;
>>>> +    uint32_t     offset;
>>>> +    qemu_irq     *qirqs;
>>>> +
>>>> +    /* PQ bits */
>>>> +    uint8_t      *sbe;
>>>> +    uint32_t     sbe_size;
>>>> +
>>>> +    /* ESB memory region */
>>>> +    uint64_t     esb_flags;
>>>> +    hwaddr       esb_base;
>>>> +    uint32_t     esb_shift;
>>>> +    MemoryRegion esb_mmio;
>>>> +} XiveSource;
>>>> +
>>>> +/*
>>>> + * ESB MMIO setting. Can be one page, for both source triggering and
>>>> + * source management, or two different pages. See below for magic
>>>> + * values.
>>>> + */
>>>> +#define XIVE_ESB_4K          12 /* PSI HB */
>>>> +#define XIVE_ESB_4K_2PAGE    17
>>>
>>> Should this be 13 instead of 17?
>>
>> oups. obviously this is not used.
>>
>>>> +#define XIVE_ESB_64K         16
>>>> +#define XIVE_ESB_64K_2PAGE   17
>>>
>>> (Also, who the hell comes up with a brand new PIC and decides to have
>>> *4* different interface variants.  But that's not your problem, I
>>> realise).
>>
>> HW constraints on the different controllers which need to expose
>> sources : PSI, PHB4. The internal sources of the XIVE interrupt 
>> controller can be configured to use 4K or 64K but I doubt the 4k
>> will be ever used.
> 
> Sure, the hardware is different, but *why* is it different.  This is a
> brand new design, you'd think they could come up with one variant that
> works for all the cases.
> 

Yes this is interesting. I will ask the HW team.

Thanks.

C.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources
  2018-04-23  6:44   ` David Gibson
@ 2018-04-23  7:31     ` Cédric Le Goater
  2018-04-24  6:41       ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-23  7:31 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/23/2018 08:44 AM, David Gibson wrote:
> On Thu, Apr 19, 2018 at 02:42:58PM +0200, Cédric Le Goater wrote:
>> The 'sent' status of the LSI interrupt source is modeled with the 'P'
>> bit of the ESB and the assertion status of the source is maintained in
>> an array under the main sPAPRXive object. The type of the source is
>> stored in the same array for practical reasons.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  hw/intc/xive.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++----
>>  include/hw/ppc/xive.h | 16 +++++++++++++++
>>  2 files changed, 66 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index c70578759d02..060976077dd7 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -104,6 +104,21 @@ static void xive_source_notify(XiveSource *xsrc, int srcno)
>>  
>>  }
>>  
>> +/*
>> + * LSI interrupt sources use the P bit and a custom assertion flag
>> + */
>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
>> +
>> +    if  (old_pq == XIVE_ESB_RESET &&
>> +         xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
>> +        return true;
>> +    }
>> +    return false;
>> +}
>> +
>>  /* In a two pages ESB MMIO setting, even page is the trigger page, odd
>>   * page is for management */
>>  static inline bool xive_source_is_trigger_page(hwaddr addr)
>> @@ -133,6 +148,13 @@ static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
>>           */
>>          ret = xive_source_pq_eoi(xsrc, srcno);
>>  
>> +        /* If the LSI source is still asserted, forward a new source
>> +         * event notification */
>> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
>> +            if (xive_source_lsi_trigger(xsrc, srcno)) {
>> +                xive_source_notify(xsrc, srcno);
>> +            }
>> +        }
>>          break;
>>  
>>      case XIVE_ESB_GET:
>> @@ -183,6 +205,14 @@ static void xive_source_esb_write(void *opaque, hwaddr addr,
>>           * notification
>>           */
>>          notify = xive_source_pq_eoi(xsrc, srcno);
>> +
>> +        /* LSI sources do not set the Q bit but they can still be
>> +         * asserted, in which case we should forward a new source
>> +         * event notification
>> +         */
>> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
>> +            notify = xive_source_lsi_trigger(xsrc, srcno);
>> +        }

FYI, I have moved that common test under xive_source_pq_eoi()

>>          break;
>>  
>>      default:
>> @@ -216,8 +246,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
>>      XiveSource *xsrc = XIVE_SOURCE(opaque);
>>      bool notify = false;
>>  
>> -    if (val) {
>> -        notify = xive_source_pq_trigger(xsrc, srcno);
>> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
>> +        if (val) {
>> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
>> +        } else {
>> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
>> +        }
>> +        notify = xive_source_lsi_trigger(xsrc, srcno);
>> +    } else {
>> +        if (val) {
>> +            notify = xive_source_pq_trigger(xsrc, srcno);
>> +        }
>>      }
>>  
>>      /* Forward the source event notification for routing */
>> @@ -234,13 +273,13 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
>>                     xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
>>      for (i = 0; i < xsrc->nr_irqs; i++) {
>>          uint8_t pq = xive_source_pq_get(xsrc, i);
>> -        uint32_t lisn = i  + xsrc->offset;
>>  
>>          if (pq == XIVE_ESB_OFF) {
>>              continue;
>>          }
>>  
>> -        monitor_printf(mon, "  %4x %c%c\n", lisn,
>> +        monitor_printf(mon, "  %4x %s %c%c\n", i + xsrc->offset,
>> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
>>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
>>                         pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>>      }
>> @@ -249,6 +288,12 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
>>  static void xive_source_reset(DeviceState *dev)
>>  {
>>      XiveSource *xsrc = XIVE_SOURCE(dev);
>> +    int i;
>> +
>> +    /* Keep the IRQ type */
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        xsrc->status[i] &= ~XIVE_STATUS_ASSERTED;
>> +    }
>>  
>>      /* SBEs are initialized to 0b01 which corresponds to "ints off" */
>>      memset(xsrc->sbe, 0x55, xsrc->sbe_size);
>> @@ -273,6 +318,7 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
>>  
>>      xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
>>                                       xsrc->nr_irqs);
>> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
>>  
>>      /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
>>      xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index d92a50519edf..0b76dd278d9b 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -33,6 +33,9 @@ typedef struct XiveSource {
>>      uint32_t     nr_irqs;
>>      uint32_t     offset;
>>      qemu_irq     *qirqs;
>> +#define XIVE_STATUS_LSI         0x1
>> +#define XIVE_STATUS_ASSERTED    0x2
>> +    uint8_t      *status;
> 
> I don't love the idea of mixing configuration information (STATUS_LSI)
> with runtime state information (ASSERTED) in the same field.  Any
> reason not to have these as parallel bitmaps.

none. I can change that. 
 
> Come to that.. is there a compelling reason to allow any individual
> irq to be marked LSI or MSI, rather than using separate XiveSource
> objects for MSIs and LSIs?

yes. I would have preferred two distinct interrupt source objects but 
this is to be compatible with XICS, which uses only one. If we want
to be able to change interrupt mode, the IRQ number space should be
organized in the exact same way. Or we should change XICS also.

Also, the change (a bitmap) is really small.

>>      /* PQ bits */
>>      uint8_t      *sbe;
> 
> .. and come to that is there a reason to keep the ASSERTED bit in a
> separate array from sbe?  AFAICT the actual 2-bit-per-irq layout is
> never exposed to the guests.

indeed. we always use the xive_source_pq_get/set() helpers to 
manipulate the PQ bits. So we could add an extra bit for the ASSERT 
without too much changes. Could also we put the type there or would 
you still prefer a bitmap ?  

> Or, even re-use the Q bit for asserted in LSIs (but report it as
> always 0 in the register read/write path).

I would prefer to add extra status bits. It is easier to debug.

Thanks,

C.

>> @@ -127,4 +130,17 @@ uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
>>  
>>  void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon);
>>  
>> +static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +    return xsrc->status[srcno] & XIVE_STATUS_LSI;
>> +}
>> +
>> +static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>> +                                       bool lsi)
>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +    xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
>> +}
>> +
>>  #endif /* PPC_XIVE_H */
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-04-23  6:46   ` David Gibson
@ 2018-04-23  7:58     ` Cédric Le Goater
  2018-04-24  6:46       ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-23  7:58 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/23/2018 08:46 AM, David Gibson wrote:
> On Thu, Apr 19, 2018 at 02:42:59PM +0200, Cédric Le Goater wrote:
>> The XiveFabric offers a simple interface, between the XiveSourve
>> object and the device model owning the interrupt sources, to forward
>> an event notification to the XIVE interrupt controller of the machine
>> and if the owner is the controller, to call directly the routing
>> sub-engine.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
>>  include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
>>  2 files changed, 61 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 060976077dd7..b4c3d06c1219 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -17,6 +17,21 @@
>>  #include "hw/ppc/xive.h"
>>  
>>  /*
>> + * XIVE Fabric
>> + */
>> +
>> +static void xive_fabric_route(XiveFabric *xf, int lisn)
>> +{
>> +
>> +}
>> +
>> +static const TypeInfo xive_fabric_info = {
>> +    .name = TYPE_XIVE_FABRIC,
>> +    .parent = TYPE_INTERFACE,
>> +    .class_size = sizeof(XiveFabricClass),
>> +};
>> +
>> +/*
>>   * XIVE Interrupt Source
>>   */
>>  
>> @@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
>>  
>>  /*
>>   * Forward the source event notification to the associated XiveFabric,
>> - * the device owning the sources.
>> + * the device owning the sources, or perform the routing if the device
>> + * is the interrupt controller.
>>   */
>>  static void xive_source_notify(XiveSource *xsrc, int srcno)
>>  {
>>  
>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
>> +
>> +    if (xfc->notify) {
>> +        xfc->notify(xsrc->xive, srcno + xsrc->offset);
>> +    } else {
>> +        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
>> +    }
> 
> Why 2 cases?  Can't the XiveFabric object just make its notify equal
> to xive_fabric_route if that's what it wants?
Under sPAPR, all the sources, IPIs and virtual device interrupts, 
generate events which are directly routed by xive_fabric_route(). 
There is no need of an extra hop. Indeed. 

Under PowerNV, some sources forward the notification to the routing 
engine using a specific MMIO load on a notify address which is stored 
in one of the controller registers. So we need a hop to reach the 
device model, owning the sources, and do that load : 

	static void pnv_psi_notify(XiveFabric *xf, uint32_t lisn)
	{
	    PnvPsi *psi = PNV_PSI(xf);
	    uint64_t notif_port =
	        psi->regs[PSIHB_REG(PSIHB9_ESB_NOTIF_ADDR)];
	    bool valid = notif_port & PSIHB9_ESB_NOTIF_VALID;
	    uint64_t notify_addr = notif_port & ~PSIHB9_ESB_NOTIF_VALID;
	    uint32_t data = cpu_to_be32(lisn);
	
	    if (valid) {
	        cpu_physical_memory_write(notify_addr, &data, sizeof(data));
	    }
	}

The PnvXive model handles the load and forwards to the fabric again.  

The IPIs under PowerNV do not need an extra hop so they reach the 
routing routine directly without the extra notify() hop. 

However, PowerNV at the end should be using xive_fabric_route() 
but there are some differences on how the NVT registers are 
updated (HV vs. OS mode) and it's not handled yet so it uses a 
notify() handler. But is should disappear and call directly 
xive_fabric_route() in a near future.


May be, XiveFabricNotifier would be a better name for this feature ?
I am adding a few ops later which are more related to routing.

Thanks,

C.


> 
>>  }
>>  
>>  /*
>> @@ -302,6 +325,17 @@ static void xive_source_reset(DeviceState *dev)
>>  static void xive_source_realize(DeviceState *dev, Error **errp)
>>  {
>>      XiveSource *xsrc = XIVE_SOURCE(dev);
>> +    Object *obj;
>> +    Error *local_err = NULL;
>> +
>> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
>> +    if (!obj) {
>> +        error_propagate(errp, local_err);
>> +        error_prepend(errp, "required link 'xive' not found: ");
>> +        return;
>> +    }
>> +
>> +    xsrc->xive = XIVE_FABRIC(obj);
>>  
>>      if (!xsrc->nr_irqs) {
>>          error_setg(errp, "Number of interrupt needs to be greater than 0");
>> @@ -376,6 +410,7 @@ static const TypeInfo xive_source_info = {
>>  static void xive_register_types(void)
>>  {
>>      type_register_static(&xive_source_info);
>> +    type_register_static(&xive_fabric_info);
>>  }
>>  
>>  type_init(xive_register_types)
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 0b76dd278d9b..4fcae2c763e6 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -12,6 +12,8 @@
>>  
>>  #include "hw/sysbus.h"
>>  
>> +typedef struct XiveFabric XiveFabric;
>> +
>>  /*
>>   * XIVE Interrupt Source
>>   */
>> @@ -46,6 +48,8 @@ typedef struct XiveSource {
>>      hwaddr       esb_base;
>>      uint32_t     esb_shift;
>>      MemoryRegion esb_mmio;
>> +
>> +    XiveFabric   *xive;
>>  } XiveSource;
>>  
>>  /*
>> @@ -143,4 +147,25 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>>      xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
>>  }
>>  
>> +/*
>> + * XIVE Fabric
>> + */
>> +
>> +typedef struct XiveFabric {
>> +    Object parent;
>> +} XiveFabric;
>> +
>> +#define TYPE_XIVE_FABRIC "xive-fabric"
>> +#define XIVE_FABRIC(obj)                                     \
>> +    OBJECT_CHECK(XiveFabric, (obj), TYPE_XIVE_FABRIC)
>> +#define XIVE_FABRIC_CLASS(klass)                                     \
>> +    OBJECT_CLASS_CHECK(XiveFabricClass, (klass), TYPE_XIVE_FABRIC)
>> +#define XIVE_FABRIC_GET_CLASS(obj)                                   \
>> +    OBJECT_GET_CLASS(XiveFabricClass, (obj), TYPE_XIVE_FABRIC)
>> +
>> +typedef struct XiveFabricClass {
>> +    InterfaceClass parent;
>> +    void (*notify)(XiveFabric *xf, uint32_t lisn);
>> +} XiveFabricClass;
>> +
>>  #endif /* PPC_XIVE_H */
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model
  2018-04-23  7:11         ` Cédric Le Goater
@ 2018-04-24  1:24           ` David Gibson
  0 siblings, 0 replies; 100+ messages in thread
From: David Gibson @ 2018-04-24  1:24 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 3627 bytes --]

On Mon, Apr 23, 2018 at 09:11:28AM +0200, Cédric Le Goater wrote:
> On 04/23/2018 05:59 AM, David Gibson wrote:
> > On Fri, Apr 20, 2018 at 10:27:21AM +0200, Cédric Le Goater wrote:
> >> On 04/20/2018 09:10 AM, David Gibson wrote:
> >>> On Thu, Apr 19, 2018 at 02:42:57PM +0200, Cédric Le Goater wrote:
> >>>> Each XIVE interrupt source is associated with a two bit state machine
> >>>> called an Event State Buffer (ESB) : the first bit "P" means that an
> >>>> interrupt is "pending" and waiting for an EOI and the bit "Q" (queued)
> >>>> means a new interrupt was triggered while another was still pending.
> >>>>
> >>>> When an event is triggered, the associated interrupt state bits are
> >>>> fetched and modified and forwarded to the virtualization engine of the
> >>>> controller doing the routing. These can also be controlled by MMIO, to
> >>>> trigger events or turn off the sources for instance. See code for more
> >>>> details on the states and transitions.
> >>>>
> >>>> On a sPAPR machine, the OS will obtain the address of the MMIO page of
> >>>> the ESB entry associated with a source and its characteristic using
> >>>> the H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is
> >>>> used.
> >>>>
> >>>> The xive_source_notify() routine is in charge forwarding the source
> >>>> event notification to the routing engine. It will be filled later on.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>  Changes since v2:
> >>>>
> >>>>  - added support for Store EOI
> >>>>  - added support for two page MMIO setting like on KVM
> >>>
> >>> Looks generally sane to me, though I have a few queries.
[snip]
> >>>> + * The default XIVE interrupt source setting for ESB MMIO is two 64k
> >>>> + * pages without Store EOI. This is in sync with KVM.
> >>>> + */
> >>>> +static Property xive_source_properties[] = {
> >>>> +    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
> >>>> +    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
> >>>> +    DEFINE_PROP_UINT64("bar", XiveSource, esb_base, 0),
> >>>
> >>> Isn't this redundant with however the base address is handled through
> >>> the SysBusDevice stuff (I forget the details)?
> >>
> >> Storing the ESB MMIO base address under the XiveSource object is 
> >> convenient later on in the h_int_get_source_info hcall which make
> >> use of the helpers : 
> >>
> >> 	hwaddr xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
> >> 	hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
> >>
> >> But it is only used in that place. So we could just store the ESB 
> >> MMIO base address under the sPAPRXive controller. This makes some
> >> sense in the design, as we have to inform KVM of this address with
> >> a KVM device ioctl.
> > 
> > Well.. I really dislike the idea that you could change the actual MMIO
> > mapping address in one place, but other bits of code would still think
> > it was mapped somewhere else.
> 
> OK. I think that my last proposal of removing the ESB MMIO address 
> from the source and letting the owning device, the sPAPR Xive controller 
> in this case, but this is the same for PoweNV or PSI HB, handle the
> mapping goes in the direction you want to take ?  
> 
> It does looks better in the overall XIVE model and the XiveSource
> object have no need of this address.

Yes, I think that's the way to go.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources
  2018-04-23  7:31     ` Cédric Le Goater
@ 2018-04-24  6:41       ` David Gibson
  2018-04-24  8:11         ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-24  6:41 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 8629 bytes --]

On Mon, Apr 23, 2018 at 09:31:24AM +0200, Cédric Le Goater wrote:
> On 04/23/2018 08:44 AM, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 02:42:58PM +0200, Cédric Le Goater wrote:
> >> The 'sent' status of the LSI interrupt source is modeled with the 'P'
> >> bit of the ESB and the assertion status of the source is maintained in
> >> an array under the main sPAPRXive object. The type of the source is
> >> stored in the same array for practical reasons.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  hw/intc/xive.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++----
> >>  include/hw/ppc/xive.h | 16 +++++++++++++++
> >>  2 files changed, 66 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index c70578759d02..060976077dd7 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -104,6 +104,21 @@ static void xive_source_notify(XiveSource *xsrc, int srcno)
> >>  
> >>  }
> >>  
> >> +/*
> >> + * LSI interrupt sources use the P bit and a custom assertion flag
> >> + */
> >> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t srcno)
> >> +{
> >> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
> >> +
> >> +    if  (old_pq == XIVE_ESB_RESET &&
> >> +         xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
> >> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
> >> +        return true;
> >> +    }
> >> +    return false;
> >> +}
> >> +
> >>  /* In a two pages ESB MMIO setting, even page is the trigger page, odd
> >>   * page is for management */
> >>  static inline bool xive_source_is_trigger_page(hwaddr addr)
> >> @@ -133,6 +148,13 @@ static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
> >>           */
> >>          ret = xive_source_pq_eoi(xsrc, srcno);
> >>  
> >> +        /* If the LSI source is still asserted, forward a new source
> >> +         * event notification */
> >> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >> +            if (xive_source_lsi_trigger(xsrc, srcno)) {
> >> +                xive_source_notify(xsrc, srcno);
> >> +            }
> >> +        }
> >>          break;
> >>  
> >>      case XIVE_ESB_GET:
> >> @@ -183,6 +205,14 @@ static void xive_source_esb_write(void *opaque, hwaddr addr,
> >>           * notification
> >>           */
> >>          notify = xive_source_pq_eoi(xsrc, srcno);
> >> +
> >> +        /* LSI sources do not set the Q bit but they can still be
> >> +         * asserted, in which case we should forward a new source
> >> +         * event notification
> >> +         */
> >> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >> +            notify = xive_source_lsi_trigger(xsrc, srcno);
> >> +        }
> 
> FYI, I have moved that common test under xive_source_pq_eoi()

Ok.

> >>          break;
> >>  
> >>      default:
> >> @@ -216,8 +246,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
> >>      XiveSource *xsrc = XIVE_SOURCE(opaque);
> >>      bool notify = false;
> >>  
> >> -    if (val) {
> >> -        notify = xive_source_pq_trigger(xsrc, srcno);
> >> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >> +        if (val) {
> >> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
> >> +        } else {
> >> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
> >> +        }
> >> +        notify = xive_source_lsi_trigger(xsrc, srcno);
> >> +    } else {
> >> +        if (val) {
> >> +            notify = xive_source_pq_trigger(xsrc, srcno);
> >> +        }
> >>      }
> >>  
> >>      /* Forward the source event notification for routing */
> >> @@ -234,13 +273,13 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
> >>                     xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
> >>      for (i = 0; i < xsrc->nr_irqs; i++) {
> >>          uint8_t pq = xive_source_pq_get(xsrc, i);
> >> -        uint32_t lisn = i  + xsrc->offset;
> >>  
> >>          if (pq == XIVE_ESB_OFF) {
> >>              continue;
> >>          }
> >>  
> >> -        monitor_printf(mon, "  %4x %c%c\n", lisn,
> >> +        monitor_printf(mon, "  %4x %s %c%c\n", i + xsrc->offset,
> >> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
> >>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
> >>                         pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> >>      }
> >> @@ -249,6 +288,12 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
> >>  static void xive_source_reset(DeviceState *dev)
> >>  {
> >>      XiveSource *xsrc = XIVE_SOURCE(dev);
> >> +    int i;
> >> +
> >> +    /* Keep the IRQ type */
> >> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >> +        xsrc->status[i] &= ~XIVE_STATUS_ASSERTED;
> >> +    }
> >>  
> >>      /* SBEs are initialized to 0b01 which corresponds to "ints off" */
> >>      memset(xsrc->sbe, 0x55, xsrc->sbe_size);
> >> @@ -273,6 +318,7 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
> >>  
> >>      xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
> >>                                       xsrc->nr_irqs);
> >> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
> >>  
> >>      /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
> >>      xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index d92a50519edf..0b76dd278d9b 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -33,6 +33,9 @@ typedef struct XiveSource {
> >>      uint32_t     nr_irqs;
> >>      uint32_t     offset;
> >>      qemu_irq     *qirqs;
> >> +#define XIVE_STATUS_LSI         0x1
> >> +#define XIVE_STATUS_ASSERTED    0x2
> >> +    uint8_t      *status;
> > 
> > I don't love the idea of mixing configuration information (STATUS_LSI)
> > with runtime state information (ASSERTED) in the same field.  Any
> > reason not to have these as parallel bitmaps.
> 
> none. I can change that. 

Ok.

> > Come to that.. is there a compelling reason to allow any individual
> > irq to be marked LSI or MSI, rather than using separate XiveSource
> > objects for MSIs and LSIs?
> 
> yes. I would have preferred two distinct interrupt source objects but 
> this is to be compatible with XICS, which uses only one. If we want
> to be able to change interrupt mode, the IRQ number space should be
> organized in the exact same way. Or we should change XICS also.
> 
> Also, the change (a bitmap) is really small.

Hrm, but since XIVE supports thousands of irqs, it could be quite a
large bitmap.

It's not impossible - in fact, not really even that hard - to change
the existing irq layout on xics.  It does need a new machine type
variant, of course.

> >>      /* PQ bits */
> >>      uint8_t      *sbe;
> > 
> > .. and come to that is there a reason to keep the ASSERTED bit in a
> > separate array from sbe?  AFAICT the actual 2-bit-per-irq layout is
> > never exposed to the guests.
> 
> indeed. we always use the xive_source_pq_get/set() helpers to 
> manipulate the PQ bits. So we could add an extra bit for the ASSERT 
> without too much changes. Could also we put the type there or would 
> you still prefer a bitmap ?  

I'd prefer the type (config information) be separate from the P, Q,
ASSERTED bits (state information).

> > Or, even re-use the Q bit for asserted in LSIs (but report it as
> > always 0 in the register read/write path).
> 
> I would prefer to add extra status bits. It is easier to debug.
> 
> Thanks,
> 
> C.
> 
> >> @@ -127,4 +130,17 @@ uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
> >>  
> >>  void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon);
> >>  
> >> +static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
> >> +{
> >> +    assert(srcno < xsrc->nr_irqs);
> >> +    return xsrc->status[srcno] & XIVE_STATUS_LSI;
> >> +}
> >> +
> >> +static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
> >> +                                       bool lsi)
> >> +{
> >> +    assert(srcno < xsrc->nr_irqs);
> >> +    xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
> >> +}
> >> +
> >>  #endif /* PPC_XIVE_H */
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-04-23  7:58     ` Cédric Le Goater
@ 2018-04-24  6:46       ` David Gibson
  2018-04-24  9:33         ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-24  6:46 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 7101 bytes --]

On Mon, Apr 23, 2018 at 09:58:43AM +0200, Cédric Le Goater wrote:
> On 04/23/2018 08:46 AM, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 02:42:59PM +0200, Cédric Le Goater wrote:
> >> The XiveFabric offers a simple interface, between the XiveSourve
> >> object and the device model owning the interrupt sources, to forward
> >> an event notification to the XIVE interrupt controller of the machine
> >> and if the owner is the controller, to call directly the routing
> >> sub-engine.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
> >>  include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
> >>  2 files changed, 61 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index 060976077dd7..b4c3d06c1219 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -17,6 +17,21 @@
> >>  #include "hw/ppc/xive.h"
> >>  
> >>  /*
> >> + * XIVE Fabric
> >> + */
> >> +
> >> +static void xive_fabric_route(XiveFabric *xf, int lisn)
> >> +{
> >> +
> >> +}
> >> +
> >> +static const TypeInfo xive_fabric_info = {
> >> +    .name = TYPE_XIVE_FABRIC,
> >> +    .parent = TYPE_INTERFACE,
> >> +    .class_size = sizeof(XiveFabricClass),
> >> +};
> >> +
> >> +/*
> >>   * XIVE Interrupt Source
> >>   */
> >>  
> >> @@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
> >>  
> >>  /*
> >>   * Forward the source event notification to the associated XiveFabric,
> >> - * the device owning the sources.
> >> + * the device owning the sources, or perform the routing if the device
> >> + * is the interrupt controller.
> >>   */
> >>  static void xive_source_notify(XiveSource *xsrc, int srcno)
> >>  {
> >>  
> >> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
> >> +
> >> +    if (xfc->notify) {
> >> +        xfc->notify(xsrc->xive, srcno + xsrc->offset);
> >> +    } else {
> >> +        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
> >> +    }
> > 
> > Why 2 cases?  Can't the XiveFabric object just make its notify equal
> > to xive_fabric_route if that's what it wants?
> Under sPAPR, all the sources, IPIs and virtual device interrupts, 
> generate events which are directly routed by xive_fabric_route(). 
> There is no need of an extra hop. Indeed. 

Ok.

> Under PowerNV, some sources forward the notification to the routing 
> engine using a specific MMIO load on a notify address which is stored 
> in one of the controller registers. So we need a hop to reach the 
> device model, owning the sources, and do that load :

Hm.  So you're saying that in pnv some sources send their notification
to some other unit, that would then (after possible masking) forward
on to the overall xive fabric?

That seems like a property of the source object, rather than a
property of the fabric.  Indeed varying this by source object would
require the objects have a different xive pointer, when I thought the
idea was that the XiveFabric was global.

> 	static void pnv_psi_notify(XiveFabric *xf, uint32_t lisn)
> 	{
> 	    PnvPsi *psi = PNV_PSI(xf);
> 	    uint64_t notif_port =
> 	        psi->regs[PSIHB_REG(PSIHB9_ESB_NOTIF_ADDR)];
> 	    bool valid = notif_port & PSIHB9_ESB_NOTIF_VALID;
> 	    uint64_t notify_addr = notif_port & ~PSIHB9_ESB_NOTIF_VALID;
> 	    uint32_t data = cpu_to_be32(lisn);
> 	
> 	    if (valid) {
> 	        cpu_physical_memory_write(notify_addr, &data, sizeof(data));
> 	    }
> 	}
> 
> The PnvXive model handles the load and forwards to the fabric again.  
> 
> The IPIs under PowerNV do not need an extra hop so they reach the 
> routing routine directly without the extra notify() hop. 
> 
> However, PowerNV at the end should be using xive_fabric_route() 
> but there are some differences on how the NVT registers are 
> updated (HV vs. OS mode) and it's not handled yet so it uses a 
> notify() handler. But is should disappear and call directly 
> xive_fabric_route() in a near future.
> 
> 
> May be, XiveFabricNotifier would be a better name for this feature ?
> I am adding a few ops later which are more related to routing.
> 
> Thanks,
> 
> C.
> 
> 
> > 
> >>  }
> >>  
> >>  /*
> >> @@ -302,6 +325,17 @@ static void xive_source_reset(DeviceState *dev)
> >>  static void xive_source_realize(DeviceState *dev, Error **errp)
> >>  {
> >>      XiveSource *xsrc = XIVE_SOURCE(dev);
> >> +    Object *obj;
> >> +    Error *local_err = NULL;
> >> +
> >> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
> >> +    if (!obj) {
> >> +        error_propagate(errp, local_err);
> >> +        error_prepend(errp, "required link 'xive' not found: ");
> >> +        return;
> >> +    }
> >> +
> >> +    xsrc->xive = XIVE_FABRIC(obj);
> >>  
> >>      if (!xsrc->nr_irqs) {
> >>          error_setg(errp, "Number of interrupt needs to be greater than 0");
> >> @@ -376,6 +410,7 @@ static const TypeInfo xive_source_info = {
> >>  static void xive_register_types(void)
> >>  {
> >>      type_register_static(&xive_source_info);
> >> +    type_register_static(&xive_fabric_info);
> >>  }
> >>  
> >>  type_init(xive_register_types)
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index 0b76dd278d9b..4fcae2c763e6 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -12,6 +12,8 @@
> >>  
> >>  #include "hw/sysbus.h"
> >>  
> >> +typedef struct XiveFabric XiveFabric;
> >> +
> >>  /*
> >>   * XIVE Interrupt Source
> >>   */
> >> @@ -46,6 +48,8 @@ typedef struct XiveSource {
> >>      hwaddr       esb_base;
> >>      uint32_t     esb_shift;
> >>      MemoryRegion esb_mmio;
> >> +
> >> +    XiveFabric   *xive;
> >>  } XiveSource;
> >>  
> >>  /*
> >> @@ -143,4 +147,25 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
> >>      xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
> >>  }
> >>  
> >> +/*
> >> + * XIVE Fabric
> >> + */
> >> +
> >> +typedef struct XiveFabric {
> >> +    Object parent;
> >> +} XiveFabric;
> >> +
> >> +#define TYPE_XIVE_FABRIC "xive-fabric"
> >> +#define XIVE_FABRIC(obj)                                     \
> >> +    OBJECT_CHECK(XiveFabric, (obj), TYPE_XIVE_FABRIC)
> >> +#define XIVE_FABRIC_CLASS(klass)                                     \
> >> +    OBJECT_CLASS_CHECK(XiveFabricClass, (klass), TYPE_XIVE_FABRIC)
> >> +#define XIVE_FABRIC_GET_CLASS(obj)                                   \
> >> +    OBJECT_GET_CLASS(XiveFabricClass, (obj), TYPE_XIVE_FABRIC)
> >> +
> >> +typedef struct XiveFabricClass {
> >> +    InterfaceClass parent;
> >> +    void (*notify)(XiveFabric *xf, uint32_t lisn);
> >> +} XiveFabricClass;
> >> +
> >>  #endif /* PPC_XIVE_H */
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR Cédric Le Goater
@ 2018-04-24  6:51   ` David Gibson
  2018-04-24  9:46     ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-24  6:51 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 11327 bytes --]

On Thu, Apr 19, 2018 at 02:43:00PM +0200, Cédric Le Goater wrote:
> sPAPRXive is a model for the XIVE interrupt controller device of the
> sPAPR machine. It holds the routing XIVE table, the Interrupt
> Virtualization Entry (IVE) table which associates interrupt source
> numbers with targets.
> 
> Also extend the XiveFabric with an accessor to the IVT. This will be
> needed by the routing algorithm.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
> 
>  May be should introduce a XiveRouter model to hold the IVT. To be
>  discussed.

Yeah, maybe.  Am I correct in thinking that on pnv there could be more
than one XiveRouter?

If we did have a XiveRouter, I'm not sure we'd need the XiveFabric
interface, possibly its methods could just be class methods of
XiveRouter.

> 
>  Changes since v2 :
> 
>  - introduced the XiveFabric interface
> 
>  default-configs/ppc64-softmmu.mak |   1 +
>  hw/intc/Makefile.objs             |   1 +
>  hw/intc/spapr_xive.c              | 159 ++++++++++++++++++++++++++++++++++++++
>  hw/intc/xive.c                    |   7 ++
>  include/hw/ppc/spapr_xive.h       |  31 ++++++++
>  include/hw/ppc/xive.h             |   5 ++
>  include/hw/ppc/xive_regs.h        |  33 ++++++++
>  7 files changed, 237 insertions(+)
>  create mode 100644 hw/intc/spapr_xive.c
>  create mode 100644 include/hw/ppc/spapr_xive.h
>  create mode 100644 include/hw/ppc/xive_regs.h
> 
> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> index c6d13e757977..f8d34722931d 100644
> --- a/default-configs/ppc64-softmmu.mak
> +++ b/default-configs/ppc64-softmmu.mak
> @@ -17,4 +17,5 @@ CONFIG_XICS=$(CONFIG_PSERIES)
>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>  CONFIG_XIVE=$(CONFIG_PSERIES)
> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>  CONFIG_MEM_HOTPLUG=y
> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> index 72a46ed91c31..301a8e972d91 100644
> --- a/hw/intc/Makefile.objs
> +++ b/hw/intc/Makefile.objs
> @@ -38,6 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>  obj-$(CONFIG_XIVE) += xive.o
> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> new file mode 100644
> index 000000000000..020444e2665a
> --- /dev/null
> +++ b/hw/intc/spapr_xive.c
> @@ -0,0 +1,159 @@
> +/*
> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "qapi/error.h"
> +#include "target/ppc/cpu.h"
> +#include "sysemu/cpus.h"
> +#include "monitor/monitor.h"
> +#include "hw/ppc/spapr_xive.h"
> +#include "hw/ppc/xive_regs.h"
> +
> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> +{
> +    int i;
> +
> +    monitor_printf(mon, "IVE Table\n");
> +    for (i = 0; i < xive->nr_irqs; i++) {
> +        XiveIVE *ive = &xive->ivt[i];
> +
> +        if (!(ive->w & IVE_VALID)) {
> +            continue;
> +        }
> +
> +        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
> +                       ive->w & IVE_MASKED ? "M" : " ",
> +                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
> +                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
> +    }
> +}
> +
> +static void spapr_xive_reset(DeviceState *dev)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> +    int i;
> +
> +    /* Mask all valid IVEs in the IRQ number space. */
> +    for (i = 0; i < xive->nr_irqs; i++) {
> +        XiveIVE *ive = &xive->ivt[i];
> +        if (ive->w & IVE_VALID) {
> +            ive->w |= IVE_MASKED;
> +        }
> +    }
> +}
> +
> +static void spapr_xive_init(Object *obj)

I'm trying to standardize on init_instance methods being called
*_instance_init().  It helps to make it obvious that this is ineed an
instance_init() method, rather than one of the various other init
calls that exist in various places.

> +{
> +
> +}
> +
> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> +
> +    if (!xive->nr_irqs) {
> +        error_setg(errp, "Number of interrupt needs to be greater 0");
> +        return;
> +    }
> +
> +    /* Allocate the Interrupt Virtualization Table */
> +    xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> +}
> +
> +static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(xf);
> +
> +    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
> +}
> +
> +static const VMStateDescription vmstate_spapr_xive_ive = {
> +    .name = TYPE_SPAPR_XIVE "/ive",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField []) {
> +        VMSTATE_UINT64(w, XiveIVE),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static const VMStateDescription vmstate_spapr_xive = {
> +    .name = TYPE_SPAPR_XIVE,
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(ivt, sPAPRXive, nr_irqs,
> +                                     vmstate_spapr_xive_ive, XiveIVE),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static Property spapr_xive_properties[] = {
> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
> +
> +    dc->realize = spapr_xive_realize;
> +    dc->reset = spapr_xive_reset;
> +    dc->props = spapr_xive_properties;
> +    dc->desc = "sPAPR XIVE interrupt controller";
> +    dc->vmsd = &vmstate_spapr_xive;
> +
> +    xfc->get_ive = spapr_xive_get_ive;
> +}
> +
> +static const TypeInfo spapr_xive_info = {
> +    .name = TYPE_SPAPR_XIVE,
> +    .parent = TYPE_SYS_BUS_DEVICE,
> +    .instance_init = spapr_xive_init,
> +    .instance_size = sizeof(sPAPRXive),
> +    .class_init = spapr_xive_class_init,
> +    .interfaces = (InterfaceInfo[]) {
> +            { TYPE_XIVE_FABRIC },
> +            { },
> +    },
> +};
> +
> +static void spapr_xive_register_types(void)
> +{
> +    type_register_static(&spapr_xive_info);
> +}
> +
> +type_init(spapr_xive_register_types)
> +
> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
> +{
> +    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
> +
> +    if (!ive) {
> +        return false;
> +    }
> +
> +    ive->w |= IVE_VALID;
> +    return true;
> +}
> +
> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
> +{
> +    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
> +
> +    if (!ive) {
> +        return false;
> +    }
> +
> +    ive->w &= ~IVE_VALID;
> +    return true;
> +}
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index b4c3d06c1219..dccad0318834 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -20,6 +20,13 @@
>   * XIVE Fabric
>   */
>  
> +XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn)
> +{
> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
> +
> +    return xfc->get_ive(xf, lisn);
> +}
> +
>  static void xive_fabric_route(XiveFabric *xf, int lisn)
>  {
>  
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> new file mode 100644
> index 000000000000..1d966b5d3a96
> --- /dev/null
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -0,0 +1,31 @@
> +/*
> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef PPC_SPAPR_XIVE_H
> +#define PPC_SPAPR_XIVE_H
> +
> +#include "hw/sysbus.h"
> +#include "hw/ppc/xive.h"
> +
> +#define TYPE_SPAPR_XIVE "spapr-xive"
> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
> +
> +typedef struct sPAPRXive {
> +    SysBusDevice parent;
> +
> +    /* Routing table */
> +    XiveIVE      *ivt;
> +    uint32_t     nr_irqs;
> +} sPAPRXive;
> +
> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> +
> +#endif /* PPC_SPAPR_XIVE_H */
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 4fcae2c763e6..5b145816acdc 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -11,6 +11,7 @@
>  #define PPC_XIVE_H
>  
>  #include "hw/sysbus.h"
> +#include "hw/ppc/xive_regs.h"
>  
>  typedef struct XiveFabric XiveFabric;
>  
> @@ -166,6 +167,10 @@ typedef struct XiveFabric {
>  typedef struct XiveFabricClass {
>      InterfaceClass parent;
>      void (*notify)(XiveFabric *xf, uint32_t lisn);
> +
> +    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>  } XiveFabricClass;
>  
> +XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
> +
>  #endif /* PPC_XIVE_H */
> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> new file mode 100644
> index 000000000000..5903f29eb789
> --- /dev/null
> +++ b/include/hw/ppc/xive_regs.h
> @@ -0,0 +1,33 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2016-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef _PPC_XIVE_REGS_H
> +#define _PPC_XIVE_REGS_H
> +
> +/* IVE/EAS
> + *
> + * One per interrupt source. Targets that interrupt to a given EQ
> + * and provides the corresponding logical interrupt number (EQ data)
> + *
> + * We also map this structure to the escalation descriptor inside
> + * an EQ, though in that case the valid and masked bits are not used.
> + */
> +typedef struct XiveIVE {
> +        /* Use a single 64-bit definition to make it easier to
> +         * perform atomic updates
> +         */
> +        uint64_t        w;
> +#define IVE_VALID       PPC_BIT(0)
> +#define IVE_EQ_BLOCK    PPC_BITMASK(4, 7)        /* Destination EQ block# */
> +#define IVE_EQ_INDEX    PPC_BITMASK(8, 31)       /* Destination EQ index */
> +#define IVE_MASKED      PPC_BIT(32)              /* Masked */
> +#define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
> +} XiveIVE;
> +
> +#endif /* _INTC_XIVE_INTERNAL_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/35] spapr/xive: add a single source block to the sPAPR XIVE model
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 05/35] spapr/xive: add a single source block to the sPAPR XIVE model Cédric Le Goater
@ 2018-04-24  6:58   ` David Gibson
  2018-04-24  8:19     ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-24  6:58 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 5484 bytes --]

On Thu, Apr 19, 2018 at 02:43:01PM +0200, Cédric Le Goater wrote:
> Bare-metal systems (PowerNV) have multiples interrupt sources. The
> XIVE interrupt controller has an internal source for IPIs and generic
> IPIs, the PSIHB has one and also the PHBs. But, for simplicity on the
> sPAPR machine, we use a unique XiveSource object for all IPIs and
> virtual device interrupts of the VM.
> 
> The ESB MMIO region used to control the sources is mapped at the
> address of chip 0 of a real system and only the provisioned IRQ
> numbers are covered.

Is that MMIO address PAPR specified, or arbitrary?

> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  hw/intc/spapr_xive.c        | 34 ++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr_xive.h |  3 +++
>  include/hw/ppc/xive.h       |  6 ++++++
>  3 files changed, 43 insertions(+)
> 
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> index 020444e2665a..90cde8a4082d 100644
> --- a/hw/intc/spapr_xive.c
> +++ b/hw/intc/spapr_xive.c
> @@ -14,12 +14,15 @@
>  #include "sysemu/cpus.h"
>  #include "monitor/monitor.h"
>  #include "hw/ppc/spapr_xive.h"
> +#include "hw/ppc/xive.h"
>  #include "hw/ppc/xive_regs.h"
>  
>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>  {
>      int i;
>  
> +    xive_source_pic_print_info(&xive->source, mon);
> +
>      monitor_printf(mon, "IVE Table\n");
>      for (i = 0; i < xive->nr_irqs; i++) {
>          XiveIVE *ive = &xive->ivt[i];
> @@ -40,6 +43,9 @@ static void spapr_xive_reset(DeviceState *dev)
>      sPAPRXive *xive = SPAPR_XIVE(dev);
>      int i;
>  
> +    /* Xive Source reset is done through SysBus, it should put all
> +     * IRQs to OFF (!P|Q) */
> +
>      /* Mask all valid IVEs in the IRQ number space. */
>      for (i = 0; i < xive->nr_irqs; i++) {
>          XiveIVE *ive = &xive->ivt[i];
> @@ -51,18 +57,42 @@ static void spapr_xive_reset(DeviceState *dev)
>  
>  static void spapr_xive_init(Object *obj)
>  {
> +    sPAPRXive *xive = SPAPR_XIVE(obj);
>  
> +    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
> +    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
>  }
>  
>  static void spapr_xive_realize(DeviceState *dev, Error **errp)
>  {
>      sPAPRXive *xive = SPAPR_XIVE(dev);
> +    XiveSource *xsrc = &xive->source;
> +    Error *local_err = NULL;
>  
>      if (!xive->nr_irqs) {
>          error_setg(errp, "Number of interrupt needs to be greater 0");
>          return;
>      }
>  
> +    /* The XIVE interrupt controller has an internal source for IPIs
> +     * and generic IPIs, the PSIHB has one and also the PHBs. For
> +     * simplicity, we use a unique XIVE source object for *all*
> +     * interrupts on sPAPR. The ESBs pages are mapped at the address
> +     * of chip 0 of a real system.
> +     */
> +    object_property_set_int(OBJECT(xsrc), XIVE_VC_BASE, "bar",
> +                            &error_fatal);
> +    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
> +                            &error_fatal);
> +    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
> +                                   &error_fatal);
> +    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
> +
>      /* Allocate the Interrupt Virtualization Table */
>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>  }
> @@ -137,23 +167,27 @@ type_init(spapr_xive_register_types)
>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
>  {
>      XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
> +    XiveSource *xsrc = &xive->source;
>  
>      if (!ive) {
>          return false;
>      }
>  
>      ive->w |= IVE_VALID;
> +    xive_source_irq_set(xsrc, lisn - xsrc->offset, lsi);
>      return true;
>  }
>  
>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
>  {
>      XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
> +    XiveSource *xsrc = &xive->source;
>  
>      if (!ive) {
>          return false;
>      }
>  
>      ive->w &= ~IVE_VALID;
> +    xive_source_irq_set(xsrc, lisn - xsrc->offset, false);
>      return true;
>  }
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 1d966b5d3a96..4538c622b60a 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -19,6 +19,9 @@
>  typedef struct sPAPRXive {
>      SysBusDevice parent;
>  
> +    /* Internal interrupt source for IPIs and virtual devices */
> +    XiveSource   source;
> +
>      /* Routing table */
>      XiveIVE      *ivt;
>      uint32_t     nr_irqs;
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 5b145816acdc..57295715a4a5 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -16,6 +16,12 @@
>  typedef struct XiveFabric XiveFabric;
>  
>  /*
> + * XIVE MMIO regions
> + */
> +
> +#define XIVE_VC_BASE   0x0006010000000000ull
> +
> +/*
>   * XIVE Interrupt Source
>   */
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources
  2018-04-24  6:41       ` David Gibson
@ 2018-04-24  8:11         ` Cédric Le Goater
  2018-04-26  3:28           ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-24  8:11 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/24/2018 08:41 AM, David Gibson wrote:
> On Mon, Apr 23, 2018 at 09:31:24AM +0200, Cédric Le Goater wrote:
>> On 04/23/2018 08:44 AM, David Gibson wrote:
>>> On Thu, Apr 19, 2018 at 02:42:58PM +0200, Cédric Le Goater wrote:
>>>> The 'sent' status of the LSI interrupt source is modeled with the 'P'
>>>> bit of the ESB and the assertion status of the source is maintained in
>>>> an array under the main sPAPRXive object. The type of the source is
>>>> stored in the same array for practical reasons.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  hw/intc/xive.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++----
>>>>  include/hw/ppc/xive.h | 16 +++++++++++++++
>>>>  2 files changed, 66 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> index c70578759d02..060976077dd7 100644
>>>> --- a/hw/intc/xive.c
>>>> +++ b/hw/intc/xive.c
>>>> @@ -104,6 +104,21 @@ static void xive_source_notify(XiveSource *xsrc, int srcno)
>>>>  
>>>>  }
>>>>  
>>>> +/*
>>>> + * LSI interrupt sources use the P bit and a custom assertion flag
>>>> + */
>>>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t srcno)
>>>> +{
>>>> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
>>>> +
>>>> +    if  (old_pq == XIVE_ESB_RESET &&
>>>> +         xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
>>>> +        return true;
>>>> +    }
>>>> +    return false;
>>>> +}
>>>> +
>>>>  /* In a two pages ESB MMIO setting, even page is the trigger page, odd
>>>>   * page is for management */
>>>>  static inline bool xive_source_is_trigger_page(hwaddr addr)
>>>> @@ -133,6 +148,13 @@ static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
>>>>           */
>>>>          ret = xive_source_pq_eoi(xsrc, srcno);
>>>>  
>>>> +        /* If the LSI source is still asserted, forward a new source
>>>> +         * event notification */
>>>> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
>>>> +            if (xive_source_lsi_trigger(xsrc, srcno)) {
>>>> +                xive_source_notify(xsrc, srcno);
>>>> +            }
>>>> +        }
>>>>          break;
>>>>  
>>>>      case XIVE_ESB_GET:
>>>> @@ -183,6 +205,14 @@ static void xive_source_esb_write(void *opaque, hwaddr addr,
>>>>           * notification
>>>>           */
>>>>          notify = xive_source_pq_eoi(xsrc, srcno);
>>>> +
>>>> +        /* LSI sources do not set the Q bit but they can still be
>>>> +         * asserted, in which case we should forward a new source
>>>> +         * event notification
>>>> +         */
>>>> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
>>>> +            notify = xive_source_lsi_trigger(xsrc, srcno);
>>>> +        }
>>
>> FYI, I have moved that common test under xive_source_pq_eoi()
> 
> Ok.
> 
>>>>          break;
>>>>  
>>>>      default:
>>>> @@ -216,8 +246,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
>>>>      XiveSource *xsrc = XIVE_SOURCE(opaque);
>>>>      bool notify = false;
>>>>  
>>>> -    if (val) {
>>>> -        notify = xive_source_pq_trigger(xsrc, srcno);
>>>> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
>>>> +        if (val) {
>>>> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
>>>> +        } else {
>>>> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
>>>> +        }
>>>> +        notify = xive_source_lsi_trigger(xsrc, srcno);
>>>> +    } else {
>>>> +        if (val) {
>>>> +            notify = xive_source_pq_trigger(xsrc, srcno);
>>>> +        }
>>>>      }
>>>>  
>>>>      /* Forward the source event notification for routing */
>>>> @@ -234,13 +273,13 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
>>>>                     xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
>>>>      for (i = 0; i < xsrc->nr_irqs; i++) {
>>>>          uint8_t pq = xive_source_pq_get(xsrc, i);
>>>> -        uint32_t lisn = i  + xsrc->offset;
>>>>  
>>>>          if (pq == XIVE_ESB_OFF) {
>>>>              continue;
>>>>          }
>>>>  
>>>> -        monitor_printf(mon, "  %4x %c%c\n", lisn,
>>>> +        monitor_printf(mon, "  %4x %s %c%c\n", i + xsrc->offset,
>>>> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
>>>>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
>>>>                         pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>>>>      }
>>>> @@ -249,6 +288,12 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
>>>>  static void xive_source_reset(DeviceState *dev)
>>>>  {
>>>>      XiveSource *xsrc = XIVE_SOURCE(dev);
>>>> +    int i;
>>>> +
>>>> +    /* Keep the IRQ type */
>>>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>>>> +        xsrc->status[i] &= ~XIVE_STATUS_ASSERTED;
>>>> +    }
>>>>  
>>>>      /* SBEs are initialized to 0b01 which corresponds to "ints off" */
>>>>      memset(xsrc->sbe, 0x55, xsrc->sbe_size);
>>>> @@ -273,6 +318,7 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
>>>>  
>>>>      xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
>>>>                                       xsrc->nr_irqs);
>>>> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
>>>>  
>>>>      /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
>>>>      xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> index d92a50519edf..0b76dd278d9b 100644
>>>> --- a/include/hw/ppc/xive.h
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -33,6 +33,9 @@ typedef struct XiveSource {
>>>>      uint32_t     nr_irqs;
>>>>      uint32_t     offset;
>>>>      qemu_irq     *qirqs;
>>>> +#define XIVE_STATUS_LSI         0x1
>>>> +#define XIVE_STATUS_ASSERTED    0x2
>>>> +    uint8_t      *status;
>>>
>>> I don't love the idea of mixing configuration information (STATUS_LSI)
>>> with runtime state information (ASSERTED) in the same field.  Any
>>> reason not to have these as parallel bitmaps.
>>
>> none. I can change that. 
> 
> Ok.
> 
>>> Come to that.. is there a compelling reason to allow any individual
>>> irq to be marked LSI or MSI, rather than using separate XiveSource
>>> objects for MSIs and LSIs?
>>
>> yes. I would have preferred two distinct interrupt source objects but 
>> this is to be compatible with XICS, which uses only one. If we want
>> to be able to change interrupt mode, the IRQ number space should be
>> organized in the exact same way. Or we should change XICS also.
>>
>> Also, the change (a bitmap) is really small.
> 
> Hrm, but since XIVE supports thousands of irqs, it could be quite a
> large bitmap.

Yes. The change is small, not the bitmap.
 
> It's not impossible - in fact, not really even that hard - to change
> the existing irq layout on xics.  It does need a new machine type
> variant, of course.

I did some work on that topic a while ago :

	https://patchwork.ozlabs.org/cover/836782/

But we stopped exploring the idea. May be it was not the good approach.
The PHBs LSIs would benefit from such a split though.

>>>>      /* PQ bits */
>>>>      uint8_t      *sbe;
>>>
>>> .. and come to that is there a reason to keep the ASSERTED bit in a
>>> separate array from sbe?  AFAICT the actual 2-bit-per-irq layout is
>>> never exposed to the guests.
>>
>> indeed. we always use the xive_source_pq_get/set() helpers to 
>> manipulate the PQ bits. So we could add an extra bit for the ASSERT 
>> without too much changes. Could also we put the type there or would 
>> you still prefer a bitmap ?  
> 
> I'd prefer the type (config information) be separate from the P, Q,
> ASSERTED bits (state information).

ok. So I will use the 'uint8_t *status' for P, Q, ASSERT, which leaves
5 bits available, but I don't think it is really worth the pain to 
optimize the size. The sbe array will disappear and we will have 
a bitmap for the type.

Thanks,

C. 

>>> Or, even re-use the Q bit for asserted in LSIs (but report it as
>>> always 0 in the register read/write path).
>>
>> I would prefer to add extra status bits. It is easier to debug.
>>
>> Thanks,
>>
>> C.
>>
>>>> @@ -127,4 +130,17 @@ uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
>>>>  
>>>>  void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon);
>>>>  
>>>> +static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
>>>> +{
>>>> +    assert(srcno < xsrc->nr_irqs);
>>>> +    return xsrc->status[srcno] & XIVE_STATUS_LSI;
>>>> +}
>>>> +
>>>> +static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>>>> +                                       bool lsi)
>>>> +{
>>>> +    assert(srcno < xsrc->nr_irqs);
>>>> +    xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
>>>> +}
>>>> +
>>>>  #endif /* PPC_XIVE_H */
>>>
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/35] spapr/xive: add a single source block to the sPAPR XIVE model
  2018-04-24  6:58   ` David Gibson
@ 2018-04-24  8:19     ` Cédric Le Goater
  2018-04-26  4:46       ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-24  8:19 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/24/2018 08:58 AM, David Gibson wrote:
> On Thu, Apr 19, 2018 at 02:43:01PM +0200, Cédric Le Goater wrote:
>> Bare-metal systems (PowerNV) have multiples interrupt sources. The
>> XIVE interrupt controller has an internal source for IPIs and generic
>> IPIs, the PSIHB has one and also the PHBs. But, for simplicity on the
>> sPAPR machine, we use a unique XiveSource object for all IPIs and
>> virtual device interrupts of the VM.
>>
>> The ESB MMIO region used to control the sources is mapped at the
>> address of chip 0 of a real system and only the provisioned IRQ
>> numbers are covered.
> 
> Is that MMIO address PAPR specified, or arbitrary?

There are no specified value for the ESB address. It's queried by 
the guest using the H_INT_GET_SOURCE_INFO hcall. For KVM, I have
introduced a ioctl to configure the KVM device. 

Same for the TIMA, but in this case, the address is exposed to the
guest in the device tree.
 
> 
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  hw/intc/spapr_xive.c        | 34 ++++++++++++++++++++++++++++++++++
>>  include/hw/ppc/spapr_xive.h |  3 +++
>>  include/hw/ppc/xive.h       |  6 ++++++
>>  3 files changed, 43 insertions(+)
>>
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> index 020444e2665a..90cde8a4082d 100644
>> --- a/hw/intc/spapr_xive.c
>> +++ b/hw/intc/spapr_xive.c
>> @@ -14,12 +14,15 @@
>>  #include "sysemu/cpus.h"
>>  #include "monitor/monitor.h"
>>  #include "hw/ppc/spapr_xive.h"
>> +#include "hw/ppc/xive.h"
>>  #include "hw/ppc/xive_regs.h"
>>  
>>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>>  {
>>      int i;
>>  
>> +    xive_source_pic_print_info(&xive->source, mon);
>> +
>>      monitor_printf(mon, "IVE Table\n");
>>      for (i = 0; i < xive->nr_irqs; i++) {
>>          XiveIVE *ive = &xive->ivt[i];
>> @@ -40,6 +43,9 @@ static void spapr_xive_reset(DeviceState *dev)
>>      sPAPRXive *xive = SPAPR_XIVE(dev);
>>      int i;
>>  
>> +    /* Xive Source reset is done through SysBus, it should put all
>> +     * IRQs to OFF (!P|Q) */
>> +
>>      /* Mask all valid IVEs in the IRQ number space. */
>>      for (i = 0; i < xive->nr_irqs; i++) {
>>          XiveIVE *ive = &xive->ivt[i];
>> @@ -51,18 +57,42 @@ static void spapr_xive_reset(DeviceState *dev)
>>  
>>  static void spapr_xive_init(Object *obj)
>>  {
>> +    sPAPRXive *xive = SPAPR_XIVE(obj);
>>  
>> +    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
>> +    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
>>  }
>>  
>>  static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>  {
>>      sPAPRXive *xive = SPAPR_XIVE(dev);
>> +    XiveSource *xsrc = &xive->source;
>> +    Error *local_err = NULL;
>>  
>>      if (!xive->nr_irqs) {
>>          error_setg(errp, "Number of interrupt needs to be greater 0");
>>          return;
>>      }
>>  
>> +    /* The XIVE interrupt controller has an internal source for IPIs
>> +     * and generic IPIs, the PSIHB has one and also the PHBs. For
>> +     * simplicity, we use a unique XIVE source object for *all*
>> +     * interrupts on sPAPR. The ESBs pages are mapped at the address
>> +     * of chip 0 of a real system.
>> +     */
>> +    object_property_set_int(OBJECT(xsrc), XIVE_VC_BASE, "bar",
>> +                            &error_fatal);
>> +    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
>> +                            &error_fatal);
>> +    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
>> +                                   &error_fatal);
>> +    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
>> +
>>      /* Allocate the Interrupt Virtualization Table */
>>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>>  }
>> @@ -137,23 +167,27 @@ type_init(spapr_xive_register_types)
>>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
>>  {
>>      XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
>> +    XiveSource *xsrc = &xive->source;
>>  
>>      if (!ive) {
>>          return false;
>>      }
>>  
>>      ive->w |= IVE_VALID;
>> +    xive_source_irq_set(xsrc, lisn - xsrc->offset, lsi);
>>      return true;
>>  }
>>  
>>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
>>  {
>>      XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
>> +    XiveSource *xsrc = &xive->source;
>>  
>>      if (!ive) {
>>          return false;
>>      }
>>  
>>      ive->w &= ~IVE_VALID;
>> +    xive_source_irq_set(xsrc, lisn - xsrc->offset, false);
>>      return true;
>>  }
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 1d966b5d3a96..4538c622b60a 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -19,6 +19,9 @@
>>  typedef struct sPAPRXive {
>>      SysBusDevice parent;
>>  
>> +    /* Internal interrupt source for IPIs and virtual devices */
>> +    XiveSource   source;
>> +
>>      /* Routing table */
>>      XiveIVE      *ivt;
>>      uint32_t     nr_irqs;
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 5b145816acdc..57295715a4a5 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -16,6 +16,12 @@
>>  typedef struct XiveFabric XiveFabric;
>>  
>>  /*
>> + * XIVE MMIO regions
>> + */
>> +
>> +#define XIVE_VC_BASE   0x0006010000000000ull
>> +
>> +/*
>>   * XIVE Interrupt Source
>>   */
>>  
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-04-24  6:46       ` David Gibson
@ 2018-04-24  9:33         ` Cédric Le Goater
  2018-04-26  3:54           ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-24  9:33 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/24/2018 08:46 AM, David Gibson wrote:
> On Mon, Apr 23, 2018 at 09:58:43AM +0200, Cédric Le Goater wrote:
>> On 04/23/2018 08:46 AM, David Gibson wrote:
>>> On Thu, Apr 19, 2018 at 02:42:59PM +0200, Cédric Le Goater wrote:
>>>> The XiveFabric offers a simple interface, between the XiveSourve
>>>> object and the device model owning the interrupt sources, to forward
>>>> an event notification to the XIVE interrupt controller of the machine
>>>> and if the owner is the controller, to call directly the routing
>>>> sub-engine.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
>>>>  include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
>>>>  2 files changed, 61 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> index 060976077dd7..b4c3d06c1219 100644
>>>> --- a/hw/intc/xive.c
>>>> +++ b/hw/intc/xive.c
>>>> @@ -17,6 +17,21 @@
>>>>  #include "hw/ppc/xive.h"
>>>>  
>>>>  /*
>>>> + * XIVE Fabric
>>>> + */
>>>> +
>>>> +static void xive_fabric_route(XiveFabric *xf, int lisn)
>>>> +{
>>>> +
>>>> +}
>>>> +
>>>> +static const TypeInfo xive_fabric_info = {
>>>> +    .name = TYPE_XIVE_FABRIC,
>>>> +    .parent = TYPE_INTERFACE,
>>>> +    .class_size = sizeof(XiveFabricClass),
>>>> +};
>>>> +
>>>> +/*
>>>>   * XIVE Interrupt Source
>>>>   */
>>>>  
>>>> @@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
>>>>  
>>>>  /*
>>>>   * Forward the source event notification to the associated XiveFabric,
>>>> - * the device owning the sources.
>>>> + * the device owning the sources, or perform the routing if the device
>>>> + * is the interrupt controller.
>>>>   */
>>>>  static void xive_source_notify(XiveSource *xsrc, int srcno)
>>>>  {
>>>>  
>>>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
>>>> +
>>>> +    if (xfc->notify) {
>>>> +        xfc->notify(xsrc->xive, srcno + xsrc->offset);
>>>> +    } else {
>>>> +        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
>>>> +    }
>>>
>>> Why 2 cases?  Can't the XiveFabric object just make its notify equal
>>> to xive_fabric_route if that's what it wants?
>> Under sPAPR, all the sources, IPIs and virtual device interrupts, 
>> generate events which are directly routed by xive_fabric_route(). 
>> There is no need of an extra hop. Indeed. 
> 
> Ok.
> 
>> Under PowerNV, some sources forward the notification to the routing 
>> engine using a specific MMIO load on a notify address which is stored 
>> in one of the controller registers. So we need a hop to reach the 
>> device model, owning the sources, and do that load :
> 
> Hm.  So you're saying that in pnv some sources send their notification
> to some other unit, 

Not to any unit/device, to the device owning the sources.

For the XiveSource object under PSI, the XIVEFabric interface is the 
PSI device object it self, which knows how to forward the notification 
on the XIVE Power "bus". To be more precise, the PSI HB device has 
14 interrupt sources, which notifications are forwarded using a MMIO 
load to some address. The load address is configured (by skiboot) in 
one of the PSI device registers, and points to a MMIO region of the 
main XIVE interrupt controller. 

The PHB4 sources should be the same.

For the XiveSource object (all interrupts) under sPAPRXive, the 
XIVEFabric is the main interrupt controller sPAPRXive.

For the XiveSource object (IPIs) under PnvXive, the XIVEFabric is 
also the main interrupt controller PnvXive.

> that would then (after possible masking) forward on to the overall> xive fabric ? 

yes. May be XIVEFabric is a confusing name. What about XIVEForwarder ? 

> That seems like a property of the source object, 

The source object is generic. It's a bunch of PQ bits that can be 
controlled by MMIOs. Nothing more.  

> rather than a
> property of the fabric.  Indeed varying this by source object would
> require the objects have a different xive pointer, when I thought the
> idea was that the XiveFabric was global.

When a notification is forwarded, the sources needs to call an 
interface which generally is implemented by the source owner, 
which is not necessarily the main IC. 

>> 	static void pnv_psi_notify(XiveFabric *xf, uint32_t lisn)
>> 	{
>> 	    PnvPsi *psi = PNV_PSI(xf);
>> 	    uint64_t notif_port =
>> 	        psi->regs[PSIHB_REG(PSIHB9_ESB_NOTIF_ADDR)];
>> 	    bool valid = notif_port & PSIHB9_ESB_NOTIF_VALID;
>> 	    uint64_t notify_addr = notif_port & ~PSIHB9_ESB_NOTIF_VALID;
>> 	    uint32_t data = cpu_to_be32(lisn);
>> 	
>> 	    if (valid) {
>> 	        cpu_physical_memory_write(notify_addr, &data, sizeof(data));
>> 	    }
>> 	}
>>
>> The PnvXive model handles the load and forwards to the fabric again.  
>>
>> The IPIs under PowerNV do not need an extra hop so they reach the 
>> routing routine directly without the extra notify() hop. 
>>
>> However, PowerNV at the end should be using xive_fabric_route() 
>> but there are some differences on how the NVT registers are 
>> updated (HV vs. OS mode) and it's not handled yet so it uses a 
>> notify() handler. But is should disappear and call directly 
>> xive_fabric_route() in a near future.
>>
>>
>> May be, XiveFabricNotifier would be a better name for this feature ?
>> I am adding a few ops later which are more related to routing.
>>
>> Thanks,
>>
>> C.
>>
>>
>>>
>>>>  }
>>>>  
>>>>  /*
>>>> @@ -302,6 +325,17 @@ static void xive_source_reset(DeviceState *dev)
>>>>  static void xive_source_realize(DeviceState *dev, Error **errp)
>>>>  {
>>>>      XiveSource *xsrc = XIVE_SOURCE(dev);
>>>> +    Object *obj;
>>>> +    Error *local_err = NULL;
>>>> +
>>>> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
>>>> +    if (!obj) {
>>>> +        error_propagate(errp, local_err);
>>>> +        error_prepend(errp, "required link 'xive' not found: ");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    xsrc->xive = XIVE_FABRIC(obj);
>>>>  
>>>>      if (!xsrc->nr_irqs) {
>>>>          error_setg(errp, "Number of interrupt needs to be greater than 0");
>>>> @@ -376,6 +410,7 @@ static const TypeInfo xive_source_info = {
>>>>  static void xive_register_types(void)
>>>>  {
>>>>      type_register_static(&xive_source_info);
>>>> +    type_register_static(&xive_fabric_info);
>>>>  }
>>>>  
>>>>  type_init(xive_register_types)
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> index 0b76dd278d9b..4fcae2c763e6 100644
>>>> --- a/include/hw/ppc/xive.h
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -12,6 +12,8 @@
>>>>  
>>>>  #include "hw/sysbus.h"
>>>>  
>>>> +typedef struct XiveFabric XiveFabric;
>>>> +
>>>>  /*
>>>>   * XIVE Interrupt Source
>>>>   */
>>>> @@ -46,6 +48,8 @@ typedef struct XiveSource {
>>>>      hwaddr       esb_base;
>>>>      uint32_t     esb_shift;
>>>>      MemoryRegion esb_mmio;
>>>> +
>>>> +    XiveFabric   *xive;
>>>>  } XiveSource;
>>>>  
>>>>  /*
>>>> @@ -143,4 +147,25 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>>>>      xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
>>>>  }
>>>>  
>>>> +/*
>>>> + * XIVE Fabric
>>>> + */
>>>> +
>>>> +typedef struct XiveFabric {
>>>> +    Object parent;
>>>> +} XiveFabric;
>>>> +
>>>> +#define TYPE_XIVE_FABRIC "xive-fabric"
>>>> +#define XIVE_FABRIC(obj)                                     \
>>>> +    OBJECT_CHECK(XiveFabric, (obj), TYPE_XIVE_FABRIC)
>>>> +#define XIVE_FABRIC_CLASS(klass)                                     \
>>>> +    OBJECT_CLASS_CHECK(XiveFabricClass, (klass), TYPE_XIVE_FABRIC)
>>>> +#define XIVE_FABRIC_GET_CLASS(obj)                                   \
>>>> +    OBJECT_GET_CLASS(XiveFabricClass, (obj), TYPE_XIVE_FABRIC)
>>>> +
>>>> +typedef struct XiveFabricClass {
>>>> +    InterfaceClass parent;
>>>> +    void (*notify)(XiveFabric *xf, uint32_t lisn);
>>>> +} XiveFabricClass;
>>>> +
>>>>  #endif /* PPC_XIVE_H */
>>>
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-04-24  6:51   ` David Gibson
@ 2018-04-24  9:46     ` Cédric Le Goater
  2018-04-26  4:20       ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-24  9:46 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/24/2018 08:51 AM, David Gibson wrote:
> On Thu, Apr 19, 2018 at 02:43:00PM +0200, Cédric Le Goater wrote:
>> sPAPRXive is a model for the XIVE interrupt controller device of the
>> sPAPR machine. It holds the routing XIVE table, the Interrupt
>> Virtualization Entry (IVE) table which associates interrupt source
>> numbers with targets.
>>
>> Also extend the XiveFabric with an accessor to the IVT. This will be
>> needed by the routing algorithm.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>
>>  May be should introduce a XiveRouter model to hold the IVT. To be
>>  discussed.
> 
> Yeah, maybe.  Am I correct in thinking that on pnv there could be more
> than one XiveRouter?

There is only one, the main IC. 

> If we did have a XiveRouter, I'm not sure we'd need the XiveFabric
> interface, possibly its methods could just be class methods of
> XiveRouter.

Yes. We could introduce a XiveRouter to share the ivt table between 
the sPAPRXive and the PnvXIVE models, the interrupt controllers of
the machines. Methods would provide way to get the ivt/eq/nvt
objects required for routing. I need to add a set_eq() to push the
EQ data.

The XiveRouter would also be a XiveFabric (or some other name) to 
let the internal sources of the interrupt controller forward events.

>>
>>  Changes since v2 :
>>
>>  - introduced the XiveFabric interface
>>
>>  default-configs/ppc64-softmmu.mak |   1 +
>>  hw/intc/Makefile.objs             |   1 +
>>  hw/intc/spapr_xive.c              | 159 ++++++++++++++++++++++++++++++++++++++
>>  hw/intc/xive.c                    |   7 ++
>>  include/hw/ppc/spapr_xive.h       |  31 ++++++++
>>  include/hw/ppc/xive.h             |   5 ++
>>  include/hw/ppc/xive_regs.h        |  33 ++++++++
>>  7 files changed, 237 insertions(+)
>>  create mode 100644 hw/intc/spapr_xive.c
>>  create mode 100644 include/hw/ppc/spapr_xive.h
>>  create mode 100644 include/hw/ppc/xive_regs.h
>>
>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>> index c6d13e757977..f8d34722931d 100644
>> --- a/default-configs/ppc64-softmmu.mak
>> +++ b/default-configs/ppc64-softmmu.mak
>> @@ -17,4 +17,5 @@ CONFIG_XICS=$(CONFIG_PSERIES)
>>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>>  CONFIG_XIVE=$(CONFIG_PSERIES)
>> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>>  CONFIG_MEM_HOTPLUG=y
>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>> index 72a46ed91c31..301a8e972d91 100644
>> --- a/hw/intc/Makefile.objs
>> +++ b/hw/intc/Makefile.objs
>> @@ -38,6 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>  obj-$(CONFIG_XIVE) += xive.o
>> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> new file mode 100644
>> index 000000000000..020444e2665a
>> --- /dev/null
>> +++ b/hw/intc/spapr_xive.c
>> @@ -0,0 +1,159 @@
>> +/*
>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/log.h"
>> +#include "qapi/error.h"
>> +#include "target/ppc/cpu.h"
>> +#include "sysemu/cpus.h"
>> +#include "monitor/monitor.h"
>> +#include "hw/ppc/spapr_xive.h"
>> +#include "hw/ppc/xive_regs.h"
>> +
>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>> +{
>> +    int i;
>> +
>> +    monitor_printf(mon, "IVE Table\n");
>> +    for (i = 0; i < xive->nr_irqs; i++) {
>> +        XiveIVE *ive = &xive->ivt[i];
>> +
>> +        if (!(ive->w & IVE_VALID)) {
>> +            continue;
>> +        }
>> +
>> +        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
>> +                       ive->w & IVE_MASKED ? "M" : " ",
>> +                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
>> +                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
>> +    }
>> +}
>> +
>> +static void spapr_xive_reset(DeviceState *dev)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>> +    int i;
>> +
>> +    /* Mask all valid IVEs in the IRQ number space. */
>> +    for (i = 0; i < xive->nr_irqs; i++) {
>> +        XiveIVE *ive = &xive->ivt[i];
>> +        if (ive->w & IVE_VALID) {
>> +            ive->w |= IVE_MASKED;
>> +        }
>> +    }
>> +}
>> +
>> +static void spapr_xive_init(Object *obj)
> 
> I'm trying to standardize on init_instance methods being called
> *_instance_init().  It helps to make it obvious that this is ineed an
> instance_init() method, rather than one of the various other init
> calls that exist in various places.

ok. this is good practice. I will fix.

Thanks,

C.

> 
>> +{
>> +
>> +}
>> +
>> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>> +
>> +    if (!xive->nr_irqs) {
>> +        error_setg(errp, "Number of interrupt needs to be greater 0");
>> +        return;
>> +    }
>> +
>> +    /* Allocate the Interrupt Virtualization Table */
>> +    xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>> +}
>> +
>> +static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(xf);
>> +
>> +    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>> +}
>> +
>> +static const VMStateDescription vmstate_spapr_xive_ive = {
>> +    .name = TYPE_SPAPR_XIVE "/ive",
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField []) {
>> +        VMSTATE_UINT64(w, XiveIVE),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +static const VMStateDescription vmstate_spapr_xive = {
>> +    .name = TYPE_SPAPR_XIVE,
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(ivt, sPAPRXive, nr_irqs,
>> +                                     vmstate_spapr_xive_ive, XiveIVE),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +static Property spapr_xive_properties[] = {
>> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
>> +
>> +    dc->realize = spapr_xive_realize;
>> +    dc->reset = spapr_xive_reset;
>> +    dc->props = spapr_xive_properties;
>> +    dc->desc = "sPAPR XIVE interrupt controller";
>> +    dc->vmsd = &vmstate_spapr_xive;
>> +
>> +    xfc->get_ive = spapr_xive_get_ive;
>> +}
>> +
>> +static const TypeInfo spapr_xive_info = {
>> +    .name = TYPE_SPAPR_XIVE,
>> +    .parent = TYPE_SYS_BUS_DEVICE,
>> +    .instance_init = spapr_xive_init,
>> +    .instance_size = sizeof(sPAPRXive),
>> +    .class_init = spapr_xive_class_init,
>> +    .interfaces = (InterfaceInfo[]) {
>> +            { TYPE_XIVE_FABRIC },
>> +            { },
>> +    },
>> +};
>> +
>> +static void spapr_xive_register_types(void)
>> +{
>> +    type_register_static(&spapr_xive_info);
>> +}
>> +
>> +type_init(spapr_xive_register_types)
>> +
>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
>> +{
>> +    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
>> +
>> +    if (!ive) {
>> +        return false;
>> +    }
>> +
>> +    ive->w |= IVE_VALID;
>> +    return true;
>> +}
>> +
>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
>> +{
>> +    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
>> +
>> +    if (!ive) {
>> +        return false;
>> +    }
>> +
>> +    ive->w &= ~IVE_VALID;
>> +    return true;
>> +}
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index b4c3d06c1219..dccad0318834 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -20,6 +20,13 @@
>>   * XIVE Fabric
>>   */
>>  
>> +XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn)
>> +{
>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
>> +
>> +    return xfc->get_ive(xf, lisn);
>> +}
>> +
>>  static void xive_fabric_route(XiveFabric *xf, int lisn)
>>  {
>>  
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> new file mode 100644
>> index 000000000000..1d966b5d3a96
>> --- /dev/null
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -0,0 +1,31 @@
>> +/*
>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef PPC_SPAPR_XIVE_H
>> +#define PPC_SPAPR_XIVE_H
>> +
>> +#include "hw/sysbus.h"
>> +#include "hw/ppc/xive.h"
>> +
>> +#define TYPE_SPAPR_XIVE "spapr-xive"
>> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
>> +
>> +typedef struct sPAPRXive {
>> +    SysBusDevice parent;
>> +
>> +    /* Routing table */
>> +    XiveIVE      *ivt;
>> +    uint32_t     nr_irqs;
>> +} sPAPRXive;
>> +
>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>> +
>> +#endif /* PPC_SPAPR_XIVE_H */
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 4fcae2c763e6..5b145816acdc 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -11,6 +11,7 @@
>>  #define PPC_XIVE_H
>>  
>>  #include "hw/sysbus.h"
>> +#include "hw/ppc/xive_regs.h"
>>  
>>  typedef struct XiveFabric XiveFabric;
>>  
>> @@ -166,6 +167,10 @@ typedef struct XiveFabric {
>>  typedef struct XiveFabricClass {
>>      InterfaceClass parent;
>>      void (*notify)(XiveFabric *xf, uint32_t lisn);
>> +
>> +    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>  } XiveFabricClass;
>>  
>> +XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
>> +
>>  #endif /* PPC_XIVE_H */
>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>> new file mode 100644
>> index 000000000000..5903f29eb789
>> --- /dev/null
>> +++ b/include/hw/ppc/xive_regs.h
>> @@ -0,0 +1,33 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2016-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef _PPC_XIVE_REGS_H
>> +#define _PPC_XIVE_REGS_H
>> +
>> +/* IVE/EAS
>> + *
>> + * One per interrupt source. Targets that interrupt to a given EQ
>> + * and provides the corresponding logical interrupt number (EQ data)
>> + *
>> + * We also map this structure to the escalation descriptor inside
>> + * an EQ, though in that case the valid and masked bits are not used.
>> + */
>> +typedef struct XiveIVE {
>> +        /* Use a single 64-bit definition to make it easier to
>> +         * perform atomic updates
>> +         */
>> +        uint64_t        w;
>> +#define IVE_VALID       PPC_BIT(0)
>> +#define IVE_EQ_BLOCK    PPC_BITMASK(4, 7)        /* Destination EQ block# */
>> +#define IVE_EQ_INDEX    PPC_BITMASK(8, 31)       /* Destination EQ index */
>> +#define IVE_MASKED      PPC_BIT(32)              /* Masked */
>> +#define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
>> +} XiveIVE;
>> +
>> +#endif /* _INTC_XIVE_INTERNAL_H */
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources
  2018-04-24  8:11         ` Cédric Le Goater
@ 2018-04-26  3:28           ` David Gibson
  2018-04-26 12:16             ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-26  3:28 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 10362 bytes --]

On Tue, Apr 24, 2018 at 10:11:27AM +0200, Cédric Le Goater wrote:
> On 04/24/2018 08:41 AM, David Gibson wrote:
> > On Mon, Apr 23, 2018 at 09:31:24AM +0200, Cédric Le Goater wrote:
> >> On 04/23/2018 08:44 AM, David Gibson wrote:
> >>> On Thu, Apr 19, 2018 at 02:42:58PM +0200, Cédric Le Goater wrote:
> >>>> The 'sent' status of the LSI interrupt source is modeled with the 'P'
> >>>> bit of the ESB and the assertion status of the source is maintained in
> >>>> an array under the main sPAPRXive object. The type of the source is
> >>>> stored in the same array for practical reasons.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>  hw/intc/xive.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++----
> >>>>  include/hw/ppc/xive.h | 16 +++++++++++++++
> >>>>  2 files changed, 66 insertions(+), 4 deletions(-)
> >>>>
> >>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >>>> index c70578759d02..060976077dd7 100644
> >>>> --- a/hw/intc/xive.c
> >>>> +++ b/hw/intc/xive.c
> >>>> @@ -104,6 +104,21 @@ static void xive_source_notify(XiveSource *xsrc, int srcno)
> >>>>  
> >>>>  }
> >>>>  
> >>>> +/*
> >>>> + * LSI interrupt sources use the P bit and a custom assertion flag
> >>>> + */
> >>>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t srcno)
> >>>> +{
> >>>> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
> >>>> +
> >>>> +    if  (old_pq == XIVE_ESB_RESET &&
> >>>> +         xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
> >>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
> >>>> +        return true;
> >>>> +    }
> >>>> +    return false;
> >>>> +}
> >>>> +
> >>>>  /* In a two pages ESB MMIO setting, even page is the trigger page, odd
> >>>>   * page is for management */
> >>>>  static inline bool xive_source_is_trigger_page(hwaddr addr)
> >>>> @@ -133,6 +148,13 @@ static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
> >>>>           */
> >>>>          ret = xive_source_pq_eoi(xsrc, srcno);
> >>>>  
> >>>> +        /* If the LSI source is still asserted, forward a new source
> >>>> +         * event notification */
> >>>> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >>>> +            if (xive_source_lsi_trigger(xsrc, srcno)) {
> >>>> +                xive_source_notify(xsrc, srcno);
> >>>> +            }
> >>>> +        }
> >>>>          break;
> >>>>  
> >>>>      case XIVE_ESB_GET:
> >>>> @@ -183,6 +205,14 @@ static void xive_source_esb_write(void *opaque, hwaddr addr,
> >>>>           * notification
> >>>>           */
> >>>>          notify = xive_source_pq_eoi(xsrc, srcno);
> >>>> +
> >>>> +        /* LSI sources do not set the Q bit but they can still be
> >>>> +         * asserted, in which case we should forward a new source
> >>>> +         * event notification
> >>>> +         */
> >>>> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >>>> +            notify = xive_source_lsi_trigger(xsrc, srcno);
> >>>> +        }
> >>
> >> FYI, I have moved that common test under xive_source_pq_eoi()
> > 
> > Ok.
> > 
> >>>>          break;
> >>>>  
> >>>>      default:
> >>>> @@ -216,8 +246,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
> >>>>      XiveSource *xsrc = XIVE_SOURCE(opaque);
> >>>>      bool notify = false;
> >>>>  
> >>>> -    if (val) {
> >>>> -        notify = xive_source_pq_trigger(xsrc, srcno);
> >>>> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >>>> +        if (val) {
> >>>> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
> >>>> +        } else {
> >>>> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
> >>>> +        }
> >>>> +        notify = xive_source_lsi_trigger(xsrc, srcno);
> >>>> +    } else {
> >>>> +        if (val) {
> >>>> +            notify = xive_source_pq_trigger(xsrc, srcno);
> >>>> +        }
> >>>>      }
> >>>>  
> >>>>      /* Forward the source event notification for routing */
> >>>> @@ -234,13 +273,13 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
> >>>>                     xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
> >>>>      for (i = 0; i < xsrc->nr_irqs; i++) {
> >>>>          uint8_t pq = xive_source_pq_get(xsrc, i);
> >>>> -        uint32_t lisn = i  + xsrc->offset;
> >>>>  
> >>>>          if (pq == XIVE_ESB_OFF) {
> >>>>              continue;
> >>>>          }
> >>>>  
> >>>> -        monitor_printf(mon, "  %4x %c%c\n", lisn,
> >>>> +        monitor_printf(mon, "  %4x %s %c%c\n", i + xsrc->offset,
> >>>> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
> >>>>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
> >>>>                         pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> >>>>      }
> >>>> @@ -249,6 +288,12 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
> >>>>  static void xive_source_reset(DeviceState *dev)
> >>>>  {
> >>>>      XiveSource *xsrc = XIVE_SOURCE(dev);
> >>>> +    int i;
> >>>> +
> >>>> +    /* Keep the IRQ type */
> >>>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >>>> +        xsrc->status[i] &= ~XIVE_STATUS_ASSERTED;
> >>>> +    }
> >>>>  
> >>>>      /* SBEs are initialized to 0b01 which corresponds to "ints off" */
> >>>>      memset(xsrc->sbe, 0x55, xsrc->sbe_size);
> >>>> @@ -273,6 +318,7 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
> >>>>  
> >>>>      xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
> >>>>                                       xsrc->nr_irqs);
> >>>> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
> >>>>  
> >>>>      /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
> >>>>      xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
> >>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >>>> index d92a50519edf..0b76dd278d9b 100644
> >>>> --- a/include/hw/ppc/xive.h
> >>>> +++ b/include/hw/ppc/xive.h
> >>>> @@ -33,6 +33,9 @@ typedef struct XiveSource {
> >>>>      uint32_t     nr_irqs;
> >>>>      uint32_t     offset;
> >>>>      qemu_irq     *qirqs;
> >>>> +#define XIVE_STATUS_LSI         0x1
> >>>> +#define XIVE_STATUS_ASSERTED    0x2
> >>>> +    uint8_t      *status;
> >>>
> >>> I don't love the idea of mixing configuration information (STATUS_LSI)
> >>> with runtime state information (ASSERTED) in the same field.  Any
> >>> reason not to have these as parallel bitmaps.
> >>
> >> none. I can change that. 
> > 
> > Ok.
> > 
> >>> Come to that.. is there a compelling reason to allow any individual
> >>> irq to be marked LSI or MSI, rather than using separate XiveSource
> >>> objects for MSIs and LSIs?
> >>
> >> yes. I would have preferred two distinct interrupt source objects but 
> >> this is to be compatible with XICS, which uses only one. If we want
> >> to be able to change interrupt mode, the IRQ number space should be
> >> organized in the exact same way. Or we should change XICS also.
> >>
> >> Also, the change (a bitmap) is really small.
> > 
> > Hrm, but since XIVE supports thousands of irqs, it could be quite a
> > large bitmap.
> 
> Yes. The change is small, not the bitmap.
>  
> > It's not impossible - in fact, not really even that hard - to change
> > the existing irq layout on xics.  It does need a new machine type
> > variant, of course.
> 
> I did some work on that topic a while ago :
> 
> 	https://patchwork.ozlabs.org/cover/836782/
> 
> But we stopped exploring the idea. May be it was not the good approach.
> The PHBs LSIs would benefit from such a split though.

So, no, I don't think that was a good approach, but that doesn't mean
other ways of rearranging the irq numbers aren't ok.  The thing here
is that we don't want to think of an "irq allocator" - there are some
bits like that in there already, but they were always a mistake.

We have lots of irq space (both XICS and XIVE) so instead we should
come up with a static mapping of irqs to devices.


> >>>>      /* PQ bits */
> >>>>      uint8_t      *sbe;
> >>>
> >>> .. and come to that is there a reason to keep the ASSERTED bit in a
> >>> separate array from sbe?  AFAICT the actual 2-bit-per-irq layout is
> >>> never exposed to the guests.
> >>
> >> indeed. we always use the xive_source_pq_get/set() helpers to 
> >> manipulate the PQ bits. So we could add an extra bit for the ASSERT 
> >> without too much changes. Could also we put the type there or would 
> >> you still prefer a bitmap ?  
> > 
> > I'd prefer the type (config information) be separate from the P, Q,
> > ASSERTED bits (state information).
> 
> ok. So I will use the 'uint8_t *status' for P, Q, ASSERT, which leaves
> 5 bits available, but I don't think it is really worth the pain to 
> optimize the size.

Sure.  I don't really care if it's packed or not.

> The sbe array will disappear and we will have 
> a bitmap for the type.

We may or may not keep the type bitmap based on the discussion above,
but in any case this is a good step forward.

> 
> Thanks,
> 
> C. 
> 
> >>> Or, even re-use the Q bit for asserted in LSIs (but report it as
> >>> always 0 in the register read/write path).
> >>
> >> I would prefer to add extra status bits. It is easier to debug.
> >>
> >> Thanks,
> >>
> >> C.
> >>
> >>>> @@ -127,4 +130,17 @@ uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
> >>>>  
> >>>>  void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon);
> >>>>  
> >>>> +static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
> >>>> +{
> >>>> +    assert(srcno < xsrc->nr_irqs);
> >>>> +    return xsrc->status[srcno] & XIVE_STATUS_LSI;
> >>>> +}
> >>>> +
> >>>> +static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
> >>>> +                                       bool lsi)
> >>>> +{
> >>>> +    assert(srcno < xsrc->nr_irqs);
> >>>> +    xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
> >>>> +}
> >>>> +
> >>>>  #endif /* PPC_XIVE_H */
> >>>
> >>
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-04-24  9:33         ` Cédric Le Goater
@ 2018-04-26  3:54           ` David Gibson
  2018-04-26 10:30             ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-26  3:54 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 9191 bytes --]

On Tue, Apr 24, 2018 at 11:33:11AM +0200, Cédric Le Goater wrote:
> On 04/24/2018 08:46 AM, David Gibson wrote:
> > On Mon, Apr 23, 2018 at 09:58:43AM +0200, Cédric Le Goater wrote:
> >> On 04/23/2018 08:46 AM, David Gibson wrote:
> >>> On Thu, Apr 19, 2018 at 02:42:59PM +0200, Cédric Le Goater wrote:
> >>>> The XiveFabric offers a simple interface, between the XiveSourve
> >>>> object and the device model owning the interrupt sources, to forward
> >>>> an event notification to the XIVE interrupt controller of the machine
> >>>> and if the owner is the controller, to call directly the routing
> >>>> sub-engine.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>  hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
> >>>>  include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
> >>>>  2 files changed, 61 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >>>> index 060976077dd7..b4c3d06c1219 100644
> >>>> --- a/hw/intc/xive.c
> >>>> +++ b/hw/intc/xive.c
> >>>> @@ -17,6 +17,21 @@
> >>>>  #include "hw/ppc/xive.h"
> >>>>  
> >>>>  /*
> >>>> + * XIVE Fabric
> >>>> + */
> >>>> +
> >>>> +static void xive_fabric_route(XiveFabric *xf, int lisn)
> >>>> +{
> >>>> +
> >>>> +}
> >>>> +
> >>>> +static const TypeInfo xive_fabric_info = {
> >>>> +    .name = TYPE_XIVE_FABRIC,
> >>>> +    .parent = TYPE_INTERFACE,
> >>>> +    .class_size = sizeof(XiveFabricClass),
> >>>> +};
> >>>> +
> >>>> +/*
> >>>>   * XIVE Interrupt Source
> >>>>   */
> >>>>  
> >>>> @@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
> >>>>  
> >>>>  /*
> >>>>   * Forward the source event notification to the associated XiveFabric,
> >>>> - * the device owning the sources.
> >>>> + * the device owning the sources, or perform the routing if the device
> >>>> + * is the interrupt controller.
> >>>>   */
> >>>>  static void xive_source_notify(XiveSource *xsrc, int srcno)
> >>>>  {
> >>>>  
> >>>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
> >>>> +
> >>>> +    if (xfc->notify) {
> >>>> +        xfc->notify(xsrc->xive, srcno + xsrc->offset);
> >>>> +    } else {
> >>>> +        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
> >>>> +    }
> >>>
> >>> Why 2 cases?  Can't the XiveFabric object just make its notify equal
> >>> to xive_fabric_route if that's what it wants?
> >> Under sPAPR, all the sources, IPIs and virtual device interrupts, 
> >> generate events which are directly routed by xive_fabric_route(). 
> >> There is no need of an extra hop. Indeed. 
> > 
> > Ok.
> > 
> >> Under PowerNV, some sources forward the notification to the routing 
> >> engine using a specific MMIO load on a notify address which is stored 
> >> in one of the controller registers. So we need a hop to reach the 
> >> device model, owning the sources, and do that load :
> > 
> > Hm.  So you're saying that in pnv some sources send their notification
> > to some other unit, 
> 
> Not to any unit/device, to the device owning the sources.
> 
> For the XiveSource object under PSI, the XIVEFabric interface is the 
> PSI device object it self, which knows how to forward the notification 
> on the XIVE Power "bus". To be more precise, the PSI HB device has 
> 14 interrupt sources, which notifications are forwarded using a MMIO 
> load to some address. The load address is configured (by skiboot) in 
> one of the PSI device registers, and points to a MMIO region of the 
> main XIVE interrupt controller. 
> 
> The PHB4 sources should be the same.
> 
> For the XiveSource object (all interrupts) under sPAPRXive, the 
> XIVEFabric is the main interrupt controller sPAPRXive.
> 
> For the XiveSource object (IPIs) under PnvXive, the XIVEFabric is 
> also the main interrupt controller PnvXive.

Hrm.  Apparently I'm missing something, I'm really not getting what
you're trying to explain here.

> > that would then (after possible masking) forward on to the overall> xive fabric ? 
> 
> yes. May be XIVEFabric is a confusing name. What about XIVEForwarder ? 

Maybe..?

> > That seems like a property of the source object, 
> 
> The source object is generic. It's a bunch of PQ bits that can be 
> controlled by MMIOs. Nothing more.

Hmm.  Isn't the source object also responsible for forwarding the
interrupt to something up the chain (whatever that is)?

> 
> > rather than a
> > property of the fabric.  Indeed varying this by source object would
> > require the objects have a different xive pointer, when I thought the
> > idea was that the XiveFabric was global.
> 
> When a notification is forwarded, the sources needs to call an 
> interface which generally is implemented by the source owner,

I'm not quite sure what you mean by "source owner".

> which is not necessarily the main IC. 
> 
> >> 	static void pnv_psi_notify(XiveFabric *xf, uint32_t lisn)
> >> 	{
> >> 	    PnvPsi *psi = PNV_PSI(xf);
> >> 	    uint64_t notif_port =
> >> 	        psi->regs[PSIHB_REG(PSIHB9_ESB_NOTIF_ADDR)];
> >> 	    bool valid = notif_port & PSIHB9_ESB_NOTIF_VALID;
> >> 	    uint64_t notify_addr = notif_port & ~PSIHB9_ESB_NOTIF_VALID;
> >> 	    uint32_t data = cpu_to_be32(lisn);
> >> 	
> >> 	    if (valid) {
> >> 	        cpu_physical_memory_write(notify_addr, &data, sizeof(data));
> >> 	    }
> >> 	}
> >>
> >> The PnvXive model handles the load and forwards to the fabric again.  
> >>
> >> The IPIs under PowerNV do not need an extra hop so they reach the 
> >> routing routine directly without the extra notify() hop. 
> >>
> >> However, PowerNV at the end should be using xive_fabric_route() 
> >> but there are some differences on how the NVT registers are 
> >> updated (HV vs. OS mode) and it's not handled yet so it uses a 
> >> notify() handler. But is should disappear and call directly 
> >> xive_fabric_route() in a near future.
> >>
> >>
> >> May be, XiveFabricNotifier would be a better name for this feature ?
> >> I am adding a few ops later which are more related to routing.
> >>
> >> Thanks,
> >>
> >> C.
> >>
> >>
> >>>
> >>>>  }
> >>>>  
> >>>>  /*
> >>>> @@ -302,6 +325,17 @@ static void xive_source_reset(DeviceState *dev)
> >>>>  static void xive_source_realize(DeviceState *dev, Error **errp)
> >>>>  {
> >>>>      XiveSource *xsrc = XIVE_SOURCE(dev);
> >>>> +    Object *obj;
> >>>> +    Error *local_err = NULL;
> >>>> +
> >>>> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
> >>>> +    if (!obj) {
> >>>> +        error_propagate(errp, local_err);
> >>>> +        error_prepend(errp, "required link 'xive' not found: ");
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    xsrc->xive = XIVE_FABRIC(obj);
> >>>>  
> >>>>      if (!xsrc->nr_irqs) {
> >>>>          error_setg(errp, "Number of interrupt needs to be greater than 0");
> >>>> @@ -376,6 +410,7 @@ static const TypeInfo xive_source_info = {
> >>>>  static void xive_register_types(void)
> >>>>  {
> >>>>      type_register_static(&xive_source_info);
> >>>> +    type_register_static(&xive_fabric_info);
> >>>>  }
> >>>>  
> >>>>  type_init(xive_register_types)
> >>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >>>> index 0b76dd278d9b..4fcae2c763e6 100644
> >>>> --- a/include/hw/ppc/xive.h
> >>>> +++ b/include/hw/ppc/xive.h
> >>>> @@ -12,6 +12,8 @@
> >>>>  
> >>>>  #include "hw/sysbus.h"
> >>>>  
> >>>> +typedef struct XiveFabric XiveFabric;
> >>>> +
> >>>>  /*
> >>>>   * XIVE Interrupt Source
> >>>>   */
> >>>> @@ -46,6 +48,8 @@ typedef struct XiveSource {
> >>>>      hwaddr       esb_base;
> >>>>      uint32_t     esb_shift;
> >>>>      MemoryRegion esb_mmio;
> >>>> +
> >>>> +    XiveFabric   *xive;
> >>>>  } XiveSource;
> >>>>  
> >>>>  /*
> >>>> @@ -143,4 +147,25 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
> >>>>      xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
> >>>>  }
> >>>>  
> >>>> +/*
> >>>> + * XIVE Fabric
> >>>> + */
> >>>> +
> >>>> +typedef struct XiveFabric {
> >>>> +    Object parent;
> >>>> +} XiveFabric;
> >>>> +
> >>>> +#define TYPE_XIVE_FABRIC "xive-fabric"
> >>>> +#define XIVE_FABRIC(obj)                                     \
> >>>> +    OBJECT_CHECK(XiveFabric, (obj), TYPE_XIVE_FABRIC)
> >>>> +#define XIVE_FABRIC_CLASS(klass)                                     \
> >>>> +    OBJECT_CLASS_CHECK(XiveFabricClass, (klass), TYPE_XIVE_FABRIC)
> >>>> +#define XIVE_FABRIC_GET_CLASS(obj)                                   \
> >>>> +    OBJECT_GET_CLASS(XiveFabricClass, (obj), TYPE_XIVE_FABRIC)
> >>>> +
> >>>> +typedef struct XiveFabricClass {
> >>>> +    InterfaceClass parent;
> >>>> +    void (*notify)(XiveFabric *xf, uint32_t lisn);
> >>>> +} XiveFabricClass;
> >>>> +
> >>>>  #endif /* PPC_XIVE_H */
> >>>
> >>
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-04-24  9:46     ` Cédric Le Goater
@ 2018-04-26  4:20       ` David Gibson
  2018-04-26 10:43         ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-26  4:20 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 13561 bytes --]

On Tue, Apr 24, 2018 at 11:46:04AM +0200, Cédric Le Goater wrote:
> On 04/24/2018 08:51 AM, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 02:43:00PM +0200, Cédric Le Goater wrote:
> >> sPAPRXive is a model for the XIVE interrupt controller device of the
> >> sPAPR machine. It holds the routing XIVE table, the Interrupt
> >> Virtualization Entry (IVE) table which associates interrupt source
> >> numbers with targets.
> >>
> >> Also extend the XiveFabric with an accessor to the IVT. This will be
> >> needed by the routing algorithm.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>
> >>  May be should introduce a XiveRouter model to hold the IVT. To be
> >>  discussed.
> > 
> > Yeah, maybe.  Am I correct in thinking that on pnv there could be more
> > than one XiveRouter?
> 
> There is only one, the main IC. 

Ok, that's what I thought originally.  In that case some of the stuff
in the patches really doesn't make sense to me.

> > If we did have a XiveRouter, I'm not sure we'd need the XiveFabric
> > interface, possibly its methods could just be class methods of
> > XiveRouter.
> 
> Yes. We could introduce a XiveRouter to share the ivt table between 
> the sPAPRXive and the PnvXIVE models, the interrupt controllers of
> the machines. Methods would provide way to get the ivt/eq/nvt
> objects required for routing. I need to add a set_eq() to push the
> EQ data.

Hrm.  Well, to add some more clarity, let's say the XiveRouter is the
object which owns the IVT.  It may or may not do other stuff as well.

Now IIUC, on pnv the IVT lives in main system memory.  Under PAPR is
the IVT in guest memory, or is it outside (updated by
hypercalls/rtas)?
 
> The XiveRouter would also be a XiveFabric (or some other name) to 
> let the internal sources of the interrupt controller forward events.

The further we go here, the less sure I am that XiveFabric even makes
sense as a concept.

> 
> >>
> >>  Changes since v2 :
> >>
> >>  - introduced the XiveFabric interface
> >>
> >>  default-configs/ppc64-softmmu.mak |   1 +
> >>  hw/intc/Makefile.objs             |   1 +
> >>  hw/intc/spapr_xive.c              | 159 ++++++++++++++++++++++++++++++++++++++
> >>  hw/intc/xive.c                    |   7 ++
> >>  include/hw/ppc/spapr_xive.h       |  31 ++++++++
> >>  include/hw/ppc/xive.h             |   5 ++
> >>  include/hw/ppc/xive_regs.h        |  33 ++++++++
> >>  7 files changed, 237 insertions(+)
> >>  create mode 100644 hw/intc/spapr_xive.c
> >>  create mode 100644 include/hw/ppc/spapr_xive.h
> >>  create mode 100644 include/hw/ppc/xive_regs.h
> >>
> >> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> >> index c6d13e757977..f8d34722931d 100644
> >> --- a/default-configs/ppc64-softmmu.mak
> >> +++ b/default-configs/ppc64-softmmu.mak
> >> @@ -17,4 +17,5 @@ CONFIG_XICS=$(CONFIG_PSERIES)
> >>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
> >>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
> >>  CONFIG_XIVE=$(CONFIG_PSERIES)
> >> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
> >>  CONFIG_MEM_HOTPLUG=y
> >> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> >> index 72a46ed91c31..301a8e972d91 100644
> >> --- a/hw/intc/Makefile.objs
> >> +++ b/hw/intc/Makefile.objs
> >> @@ -38,6 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
> >>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
> >>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> >>  obj-$(CONFIG_XIVE) += xive.o
> >> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
> >>  obj-$(CONFIG_POWERNV) += xics_pnv.o
> >>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
> >>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> >> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >> new file mode 100644
> >> index 000000000000..020444e2665a
> >> --- /dev/null
> >> +++ b/hw/intc/spapr_xive.c
> >> @@ -0,0 +1,159 @@
> >> +/*
> >> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu/log.h"
> >> +#include "qapi/error.h"
> >> +#include "target/ppc/cpu.h"
> >> +#include "sysemu/cpus.h"
> >> +#include "monitor/monitor.h"
> >> +#include "hw/ppc/spapr_xive.h"
> >> +#include "hw/ppc/xive_regs.h"
> >> +
> >> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> >> +{
> >> +    int i;
> >> +
> >> +    monitor_printf(mon, "IVE Table\n");
> >> +    for (i = 0; i < xive->nr_irqs; i++) {
> >> +        XiveIVE *ive = &xive->ivt[i];
> >> +
> >> +        if (!(ive->w & IVE_VALID)) {
> >> +            continue;
> >> +        }
> >> +
> >> +        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
> >> +                       ive->w & IVE_MASKED ? "M" : " ",
> >> +                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
> >> +                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
> >> +    }
> >> +}
> >> +
> >> +static void spapr_xive_reset(DeviceState *dev)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +    int i;
> >> +
> >> +    /* Mask all valid IVEs in the IRQ number space. */
> >> +    for (i = 0; i < xive->nr_irqs; i++) {
> >> +        XiveIVE *ive = &xive->ivt[i];
> >> +        if (ive->w & IVE_VALID) {
> >> +            ive->w |= IVE_MASKED;
> >> +        }
> >> +    }
> >> +}
> >> +
> >> +static void spapr_xive_init(Object *obj)
> > 
> > I'm trying to standardize on init_instance methods being called
> > *_instance_init().  It helps to make it obvious that this is ineed an
> > instance_init() method, rather than one of the various other init
> > calls that exist in various places.
> 
> ok. this is good practice. I will fix.
> 
> Thanks,
> 
> C.
> 
> > 
> >> +{
> >> +
> >> +}
> >> +
> >> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +
> >> +    if (!xive->nr_irqs) {
> >> +        error_setg(errp, "Number of interrupt needs to be greater 0");
> >> +        return;
> >> +    }
> >> +
> >> +    /* Allocate the Interrupt Virtualization Table */
> >> +    xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> >> +}
> >> +
> >> +static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(xf);
> >> +
> >> +    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
> >> +}
> >> +
> >> +static const VMStateDescription vmstate_spapr_xive_ive = {
> >> +    .name = TYPE_SPAPR_XIVE "/ive",
> >> +    .version_id = 1,
> >> +    .minimum_version_id = 1,
> >> +    .fields = (VMStateField []) {
> >> +        VMSTATE_UINT64(w, XiveIVE),
> >> +        VMSTATE_END_OF_LIST()
> >> +    },
> >> +};
> >> +
> >> +static const VMStateDescription vmstate_spapr_xive = {
> >> +    .name = TYPE_SPAPR_XIVE,
> >> +    .version_id = 1,
> >> +    .minimum_version_id = 1,
> >> +    .fields = (VMStateField[]) {
> >> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> >> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(ivt, sPAPRXive, nr_irqs,
> >> +                                     vmstate_spapr_xive_ive, XiveIVE),
> >> +        VMSTATE_END_OF_LIST()
> >> +    },
> >> +};
> >> +
> >> +static Property spapr_xive_properties[] = {
> >> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
> >> +    DEFINE_PROP_END_OF_LIST(),
> >> +};
> >> +
> >> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
> >> +
> >> +    dc->realize = spapr_xive_realize;
> >> +    dc->reset = spapr_xive_reset;
> >> +    dc->props = spapr_xive_properties;
> >> +    dc->desc = "sPAPR XIVE interrupt controller";
> >> +    dc->vmsd = &vmstate_spapr_xive;
> >> +
> >> +    xfc->get_ive = spapr_xive_get_ive;
> >> +}
> >> +
> >> +static const TypeInfo spapr_xive_info = {
> >> +    .name = TYPE_SPAPR_XIVE,
> >> +    .parent = TYPE_SYS_BUS_DEVICE,
> >> +    .instance_init = spapr_xive_init,
> >> +    .instance_size = sizeof(sPAPRXive),
> >> +    .class_init = spapr_xive_class_init,
> >> +    .interfaces = (InterfaceInfo[]) {
> >> +            { TYPE_XIVE_FABRIC },
> >> +            { },
> >> +    },
> >> +};
> >> +
> >> +static void spapr_xive_register_types(void)
> >> +{
> >> +    type_register_static(&spapr_xive_info);
> >> +}
> >> +
> >> +type_init(spapr_xive_register_types)
> >> +
> >> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
> >> +{
> >> +    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
> >> +
> >> +    if (!ive) {
> >> +        return false;
> >> +    }
> >> +
> >> +    ive->w |= IVE_VALID;
> >> +    return true;
> >> +}
> >> +
> >> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
> >> +{
> >> +    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
> >> +
> >> +    if (!ive) {
> >> +        return false;
> >> +    }
> >> +
> >> +    ive->w &= ~IVE_VALID;
> >> +    return true;
> >> +}
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index b4c3d06c1219..dccad0318834 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -20,6 +20,13 @@
> >>   * XIVE Fabric
> >>   */
> >>  
> >> +XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn)
> >> +{
> >> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
> >> +
> >> +    return xfc->get_ive(xf, lisn);
> >> +}
> >> +
> >>  static void xive_fabric_route(XiveFabric *xf, int lisn)
> >>  {
> >>  
> >> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >> new file mode 100644
> >> index 000000000000..1d966b5d3a96
> >> --- /dev/null
> >> +++ b/include/hw/ppc/spapr_xive.h
> >> @@ -0,0 +1,31 @@
> >> +/*
> >> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#ifndef PPC_SPAPR_XIVE_H
> >> +#define PPC_SPAPR_XIVE_H
> >> +
> >> +#include "hw/sysbus.h"
> >> +#include "hw/ppc/xive.h"
> >> +
> >> +#define TYPE_SPAPR_XIVE "spapr-xive"
> >> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
> >> +
> >> +typedef struct sPAPRXive {
> >> +    SysBusDevice parent;
> >> +
> >> +    /* Routing table */
> >> +    XiveIVE      *ivt;
> >> +    uint32_t     nr_irqs;
> >> +} sPAPRXive;
> >> +
> >> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
> >> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> >> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> >> +
> >> +#endif /* PPC_SPAPR_XIVE_H */
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index 4fcae2c763e6..5b145816acdc 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -11,6 +11,7 @@
> >>  #define PPC_XIVE_H
> >>  
> >>  #include "hw/sysbus.h"
> >> +#include "hw/ppc/xive_regs.h"
> >>  
> >>  typedef struct XiveFabric XiveFabric;
> >>  
> >> @@ -166,6 +167,10 @@ typedef struct XiveFabric {
> >>  typedef struct XiveFabricClass {
> >>      InterfaceClass parent;
> >>      void (*notify)(XiveFabric *xf, uint32_t lisn);
> >> +
> >> +    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
> >>  } XiveFabricClass;
> >>  
> >> +XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
> >> +
> >>  #endif /* PPC_XIVE_H */
> >> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> >> new file mode 100644
> >> index 000000000000..5903f29eb789
> >> --- /dev/null
> >> +++ b/include/hw/ppc/xive_regs.h
> >> @@ -0,0 +1,33 @@
> >> +/*
> >> + * QEMU PowerPC XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2016-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#ifndef _PPC_XIVE_REGS_H
> >> +#define _PPC_XIVE_REGS_H
> >> +
> >> +/* IVE/EAS
> >> + *
> >> + * One per interrupt source. Targets that interrupt to a given EQ
> >> + * and provides the corresponding logical interrupt number (EQ data)
> >> + *
> >> + * We also map this structure to the escalation descriptor inside
> >> + * an EQ, though in that case the valid and masked bits are not used.
> >> + */
> >> +typedef struct XiveIVE {
> >> +        /* Use a single 64-bit definition to make it easier to
> >> +         * perform atomic updates
> >> +         */
> >> +        uint64_t        w;
> >> +#define IVE_VALID       PPC_BIT(0)
> >> +#define IVE_EQ_BLOCK    PPC_BITMASK(4, 7)        /* Destination EQ block# */
> >> +#define IVE_EQ_INDEX    PPC_BITMASK(8, 31)       /* Destination EQ index */
> >> +#define IVE_MASKED      PPC_BIT(32)              /* Masked */
> >> +#define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
> >> +} XiveIVE;
> >> +
> >> +#endif /* _INTC_XIVE_INTERNAL_H */
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 05/35] spapr/xive: add a single source block to the sPAPR XIVE model
  2018-04-24  8:19     ` Cédric Le Goater
@ 2018-04-26  4:46       ` David Gibson
  0 siblings, 0 replies; 100+ messages in thread
From: David Gibson @ 2018-04-26  4:46 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 6382 bytes --]

On Tue, Apr 24, 2018 at 10:19:58AM +0200, Cédric Le Goater wrote:
> On 04/24/2018 08:58 AM, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 02:43:01PM +0200, Cédric Le Goater wrote:
> >> Bare-metal systems (PowerNV) have multiples interrupt sources. The
> >> XIVE interrupt controller has an internal source for IPIs and generic
> >> IPIs, the PSIHB has one and also the PHBs. But, for simplicity on the
> >> sPAPR machine, we use a unique XiveSource object for all IPIs and
> >> virtual device interrupts of the VM.
> >>
> >> The ESB MMIO region used to control the sources is mapped at the
> >> address of chip 0 of a real system and only the provisioned IRQ
> >> numbers are covered.
> > 
> > Is that MMIO address PAPR specified, or arbitrary?
> 
> There are no specified value for the ESB address. It's queried by 
> the guest using the H_INT_GET_SOURCE_INFO hcall. For KVM, I have
> introduced a ioctl to configure the KVM device.

Ok.

> 
> Same for the TIMA, but in this case, the address is exposed to the
> guest in the device tree.
>  
> > 
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  hw/intc/spapr_xive.c        | 34 ++++++++++++++++++++++++++++++++++
> >>  include/hw/ppc/spapr_xive.h |  3 +++
> >>  include/hw/ppc/xive.h       |  6 ++++++
> >>  3 files changed, 43 insertions(+)
> >>
> >> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >> index 020444e2665a..90cde8a4082d 100644
> >> --- a/hw/intc/spapr_xive.c
> >> +++ b/hw/intc/spapr_xive.c
> >> @@ -14,12 +14,15 @@
> >>  #include "sysemu/cpus.h"
> >>  #include "monitor/monitor.h"
> >>  #include "hw/ppc/spapr_xive.h"
> >> +#include "hw/ppc/xive.h"
> >>  #include "hw/ppc/xive_regs.h"
> >>  
> >>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> >>  {
> >>      int i;
> >>  
> >> +    xive_source_pic_print_info(&xive->source, mon);
> >> +
> >>      monitor_printf(mon, "IVE Table\n");
> >>      for (i = 0; i < xive->nr_irqs; i++) {
> >>          XiveIVE *ive = &xive->ivt[i];
> >> @@ -40,6 +43,9 @@ static void spapr_xive_reset(DeviceState *dev)
> >>      sPAPRXive *xive = SPAPR_XIVE(dev);
> >>      int i;
> >>  
> >> +    /* Xive Source reset is done through SysBus, it should put all
> >> +     * IRQs to OFF (!P|Q) */
> >> +
> >>      /* Mask all valid IVEs in the IRQ number space. */
> >>      for (i = 0; i < xive->nr_irqs; i++) {
> >>          XiveIVE *ive = &xive->ivt[i];
> >> @@ -51,18 +57,42 @@ static void spapr_xive_reset(DeviceState *dev)
> >>  
> >>  static void spapr_xive_init(Object *obj)
> >>  {
> >> +    sPAPRXive *xive = SPAPR_XIVE(obj);
> >>  
> >> +    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
> >> +    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
> >>  }
> >>  
> >>  static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >>  {
> >>      sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +    XiveSource *xsrc = &xive->source;
> >> +    Error *local_err = NULL;
> >>  
> >>      if (!xive->nr_irqs) {
> >>          error_setg(errp, "Number of interrupt needs to be greater 0");
> >>          return;
> >>      }
> >>  
> >> +    /* The XIVE interrupt controller has an internal source for IPIs
> >> +     * and generic IPIs, the PSIHB has one and also the PHBs. For
> >> +     * simplicity, we use a unique XIVE source object for *all*
> >> +     * interrupts on sPAPR. The ESBs pages are mapped at the address
> >> +     * of chip 0 of a real system.
> >> +     */
> >> +    object_property_set_int(OBJECT(xsrc), XIVE_VC_BASE, "bar",
> >> +                            &error_fatal);
> >> +    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
> >> +                            &error_fatal);
> >> +    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
> >> +                                   &error_fatal);
> >> +    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return;
> >> +    }
> >> +    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
> >> +
> >>      /* Allocate the Interrupt Virtualization Table */
> >>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> >>  }
> >> @@ -137,23 +167,27 @@ type_init(spapr_xive_register_types)
> >>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
> >>  {
> >>      XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
> >> +    XiveSource *xsrc = &xive->source;
> >>  
> >>      if (!ive) {
> >>          return false;
> >>      }
> >>  
> >>      ive->w |= IVE_VALID;
> >> +    xive_source_irq_set(xsrc, lisn - xsrc->offset, lsi);
> >>      return true;
> >>  }
> >>  
> >>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
> >>  {
> >>      XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
> >> +    XiveSource *xsrc = &xive->source;
> >>  
> >>      if (!ive) {
> >>          return false;
> >>      }
> >>  
> >>      ive->w &= ~IVE_VALID;
> >> +    xive_source_irq_set(xsrc, lisn - xsrc->offset, false);
> >>      return true;
> >>  }
> >> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >> index 1d966b5d3a96..4538c622b60a 100644
> >> --- a/include/hw/ppc/spapr_xive.h
> >> +++ b/include/hw/ppc/spapr_xive.h
> >> @@ -19,6 +19,9 @@
> >>  typedef struct sPAPRXive {
> >>      SysBusDevice parent;
> >>  
> >> +    /* Internal interrupt source for IPIs and virtual devices */
> >> +    XiveSource   source;
> >> +
> >>      /* Routing table */
> >>      XiveIVE      *ivt;
> >>      uint32_t     nr_irqs;
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index 5b145816acdc..57295715a4a5 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -16,6 +16,12 @@
> >>  typedef struct XiveFabric XiveFabric;
> >>  
> >>  /*
> >> + * XIVE MMIO regions
> >> + */
> >> +
> >> +#define XIVE_VC_BASE   0x0006010000000000ull



> >> +
> >> +/*
> >>   * XIVE Interrupt Source
> >>   */
> >>  
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model Cédric Le Goater
@ 2018-04-26  7:11   ` David Gibson
  2018-04-26  9:27     ` Cédric Le Goater
  2018-05-02  7:39     ` Cédric Le Goater
  0 siblings, 2 replies; 100+ messages in thread
From: David Gibson @ 2018-04-26  7:11 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 21399 bytes --]

On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
> The XIVE presenter engine uses a set of registers to handle priority
> management and interrupt acknowledgment among other things. The most
> important ones being :
> 
>   - Interrupt Priority Register (PIPR)
>   - Interrupt Pending Buffer (IPB)
>   - Current Processor Priority (CPPR)
>   - Notification Source Register (NSR)
> 
> There is one set of registers per level of privilege, four in all :
> HW, HV pool, OS and User. These are called rings. All registers are
> accessible through a specific MMIO region called the Thread Interrupt
> Management Areas (TIMA) but, depending on the privilege level of the
> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
> OS privilege and therefore can only accesses the OS and the User
> rings. The others are for hypervisor levels.
> 
> The CPU interrupt state is modeled with a XiveNVT object which stores
> the values of the different registers. The different TIMA views are
> mapped at the same address for each CPU and 'current_cpu' is used to
> retrieve the XiveNVT holding the ring registers.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
> 
>  Changes since v2 :
> 
>  - introduced the XiveFabric interface
> 
>  hw/intc/spapr_xive.c        |  25 ++++
>  hw/intc/xive.c              | 279 ++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr_xive.h |   5 +
>  include/hw/ppc/xive.h       |  31 +++++
>  include/hw/ppc/xive_regs.h  |  84 +++++++++++++
>  5 files changed, 424 insertions(+)
> 
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> index 90cde8a4082d..f07832bf0a00 100644
> --- a/hw/intc/spapr_xive.c
> +++ b/hw/intc/spapr_xive.c
> @@ -13,6 +13,7 @@
>  #include "target/ppc/cpu.h"
>  #include "sysemu/cpus.h"
>  #include "monitor/monitor.h"
> +#include "hw/ppc/spapr.h"
>  #include "hw/ppc/spapr_xive.h"
>  #include "hw/ppc/xive.h"
>  #include "hw/ppc/xive_regs.h"
> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>  
>      /* Allocate the Interrupt Virtualization Table */
>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> +
> +    /* The Thread Interrupt Management Area has the same address for
> +     * each chip. On sPAPR, we only need to expose the User and OS
> +     * level views of the TIMA.
> +     */
> +    xive->tm_base = XIVE_TM_BASE;

The constant should probably have PAPR in the name somewhere, since
it's just for PAPR machines (same for the ESB mappings, actually).

> +
> +    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
> +                          &xive_tm_user_ops, xive, "xive.tima.user",
> +                          1ull << TM_SHIFT);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
> +
> +    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
> +                          &xive_tm_os_ops, xive, "xive.tima.os",
> +                          1ull << TM_SHIFT);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
>  }
>  
>  static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>      return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>  }
>  
> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
> +{
> +    PowerPCCPU *cpu = spapr_find_cpu(server);
> +
> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
> +}

So this is a bit of a tangent, but I've been thinking of implementing
a scheme where there's an opaque pointer in the cpu structure for the
use of the machine.  I'm planning for that to replace the intc pointer
(which isn't really used directly by the cpu). That would allow us to
have spapr put a structure there and have both xics and xive pointers
which could be useful later on.

I think we'd need something similar to correctly handle migration of
the VPA state, which is currently horribly broken.

> +
>  static const VMStateDescription vmstate_spapr_xive_ive = {
>      .name = TYPE_SPAPR_XIVE "/ive",
>      .version_id = 1,
> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>      dc->vmsd = &vmstate_spapr_xive;
>  
>      xfc->get_ive = spapr_xive_get_ive;
> +    xfc->get_nvt = spapr_xive_get_nvt;
>  }
>  
>  static const TypeInfo spapr_xive_info = {
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index dccad0318834..5691bb9474e4 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -14,7 +14,278 @@
>  #include "sysemu/cpus.h"
>  #include "sysemu/dma.h"
>  #include "monitor/monitor.h"
> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
>  #include "hw/ppc/xive.h"
> +#include "hw/ppc/xive_regs.h"
> +
> +/*
> + * XIVE Interrupt Presenter
> + */
> +
> +static uint64_t xive_nvt_accept(XiveNVT *nvt)
> +{
> +    return 0;
> +}
> +
> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
> +{
> +    if (cppr > XIVE_PRIORITY_MAX) {
> +        cppr = 0xff;
> +    }
> +
> +    nvt->ring_os[TM_CPPR] = cppr;

Surely this needs to recheck if we should be interrupting the cpu?

> +}
> +
> +/*
> + * OS Thread Interrupt Management Area MMIO
> + */
> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
> +                                           unsigned size)
> +{
> +    uint64_t ret = -1;
> +
> +    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
> +        ret = xive_nvt_accept(nvt);
> +    } else {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
> +                      HWADDR_PRIx" size %d\n", offset, size);
> +    }
> +
> +    return ret;
> +}
> +
> +#define TM_RING(offset) ((offset) & 0xf0)
> +
> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
> +                                      unsigned size)
> +{
> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);

So, as I said on a previous version of this, we can actually correctly
represent different mappings in different cpu spaces, by exploiting
cpu->as and not just having them all point to &address_space_memory.

> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
> +    uint64_t ret = -1;
> +    int i;
> +
> +    if (offset >= TM_SPC_ACK_EBB) {
> +        return xive_tm_read_special(nvt, offset, size);
> +    }
> +
> +    if (TM_RING(offset) != TM_QW1_OS) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
> +                      HWADDR_PRIx"\n", offset);
> +        return ret;

Just return -1 would be clearer here;

> +    }
> +
> +    ret = 0;
> +    for (i = 0; i < size; i++) {
> +        ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1));
> +    }
> +
> +    return ret;
> +}
> +
> +static bool xive_tm_is_readonly(uint8_t offset)
> +{
> +    return offset != TM_QW1_OS + TM_CPPR;
> +}
> +
> +static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset,
> +                                        uint64_t value, unsigned size)
> +{
> +    /* TODO: support TM_SPC_SET_OS_PENDING */
> +
> +    /* TODO: support TM_SPC_ACK_OS_EL */
> +}
> +
> +static void xive_tm_os_write(void *opaque, hwaddr offset,
> +                                   uint64_t value, unsigned size)
> +{
> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
> +    int i;
> +
> +    if (offset >= TM_SPC_ACK_EBB) {
> +        xive_tm_write_special(nvt, offset, value, size);
> +        return;
> +    }
> +
> +    if (TM_RING(offset) != TM_QW1_OS) {

Why have this if you have separate OS and user regions as you appear
to do below?

Or to look at it another way, shouldn't it be possible to make the
read/write accessors the same for the OS and user rings?

> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
> +                      HWADDR_PRIx"\n", offset);
> +        return;
> +    }
> +
> +    switch (size) {
> +    case 1:
> +        if (offset == TM_QW1_OS + TM_CPPR) {
> +            xive_nvt_set_cppr(nvt, value & 0xff);
> +        }
> +        break;
> +    case 4:
> +    case 8:
> +        for (i = 0; i < size; i++) {
> +            if (!xive_tm_is_readonly(offset + i)) {
> +                nvt->regs[offset + i] = (value >> (8 * (size - i - 1))) & 0xff;
> +            }
> +        }
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +}
> +
> +const MemoryRegionOps xive_tm_os_ops = {
> +    .read = xive_tm_os_read,
> +    .write = xive_tm_os_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +/*
> + * User Thread Interrupt Management Area MMIO
> + */
> +
> +static uint64_t xive_tm_user_read(void *opaque, hwaddr offset,
> +                                        unsigned size)
> +{
> +    qemu_log_mask(LOG_UNIMP, "XIVE: invalid access to User TIMA @%"
> +                  HWADDR_PRIx"\n", offset);
> +    return -1;
> +}
> +
> +static void xive_tm_user_write(void *opaque, hwaddr offset,
> +                                     uint64_t value, unsigned size)
> +{
> +    qemu_log_mask(LOG_UNIMP, "XIVE: invalid access to User TIMA @%"
> +                  HWADDR_PRIx"\n", offset);
> +}
> +
> +
> +const MemoryRegionOps xive_tm_user_ops = {
> +    .read = xive_tm_user_read,
> +    .write = xive_tm_user_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +static char *xive_nvt_ring_print(uint8_t *ring)
> +{
> +    uint32_t w2 = be32_to_cpu(*((uint32_t *) &ring[TM_WORD2]));
> +
> +    return g_strdup_printf("%02x  %02x   %02x  %02x    %02x   "
> +                   "%02x  %02x  %02x   %08x",
> +                   ring[TM_NSR], ring[TM_CPPR], ring[TM_IPB], ring[TM_LSMFB],
> +                   ring[TM_ACK_CNT], ring[TM_INC], ring[TM_AGE], ring[TM_PIPR],
> +                   w2);
> +}
> +
> +void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon)
> +{
> +    int cpu_index = nvt->cs ? nvt->cs->cpu_index : -1;
> +    char *s;
> +
> +    monitor_printf(mon, "CPU[%04x]: QW    NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
> +                   " W2\n", cpu_index);
> +
> +    s = xive_nvt_ring_print(&nvt->regs[TM_QW1_OS]);
> +    monitor_printf(mon, "CPU[%04x]: OS    %s\n", cpu_index, s);
> +    g_free(s);
> +    s = xive_nvt_ring_print(&nvt->regs[TM_QW0_USER]);
> +    monitor_printf(mon, "CPU[%04x]: USER  %s\n", cpu_index, s);
> +    g_free(s);
> +}
> +
> +static void xive_nvt_reset(void *dev)
> +{
> +    XiveNVT *nvt = XIVE_NVT(dev);
> +
> +    memset(nvt->regs, 0, sizeof(nvt->regs));
> +}
> +
> +static void xive_nvt_realize(DeviceState *dev, Error **errp)
> +{
> +    XiveNVT *nvt = XIVE_NVT(dev);
> +    PowerPCCPU *cpu;
> +    CPUPPCState *env;
> +    Object *obj;
> +    Error *err = NULL;
> +
> +    obj = object_property_get_link(OBJECT(dev), ICP_PROP_CPU, &err);

Please get rid of the remaining "ICP" naming in the xive code.

> +    if (!obj) {
> +        error_propagate(errp, err);
> +        error_prepend(errp, "required link '" ICP_PROP_CPU "' not found: ");
> +        return;
> +    }
> +
> +    cpu = POWERPC_CPU(obj);
> +    nvt->cs = CPU(obj);
> +
> +    env = &cpu->env;
> +    switch (PPC_INPUT(env)) {
> +    case PPC_FLAGS_INPUT_POWER7:
> +        nvt->output = env->irq_inputs[POWER7_INPUT_INT];
> +        break;
> +
> +    default:
> +        error_setg(errp, "XIVE interrupt controller does not support "
> +                   "this CPU bus model");
> +        return;
> +    }
> +
> +    qemu_register_reset(xive_nvt_reset, dev);

If this is a sysbus device, which I think it is, you shouldn't need to
explicitly register a reset handler.  Instead you can set a device
reset handler which will be called with the reset.

> +}
> +
> +static void xive_nvt_unrealize(DeviceState *dev, Error **errp)
> +{
> +    qemu_unregister_reset(xive_nvt_reset, dev);
> +}
> +
> +static void xive_nvt_init(Object *obj)
> +{
> +    XiveNVT *nvt = XIVE_NVT(obj);
> +
> +    nvt->ring_os = &nvt->regs[TM_QW1_OS];

The ring_os field is basically pointless, being just an offset into a
structure you already have.  A macro or inline would be a better idea.

> +}
> +
> +static const VMStateDescription vmstate_xive_nvt = {
> +    .name = TYPE_XIVE_NVT,
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_BUFFER(regs, XiveNVT),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static void xive_nvt_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    dc->realize = xive_nvt_realize;
> +    dc->unrealize = xive_nvt_unrealize;
> +    dc->desc = "XIVE Interrupt Presenter";
> +    dc->vmsd = &vmstate_xive_nvt;
> +}
> +
> +static const TypeInfo xive_nvt_info = {
> +    .name          = TYPE_XIVE_NVT,
> +    .parent        = TYPE_DEVICE,
> +    .instance_size = sizeof(XiveNVT),
> +    .instance_init = xive_nvt_init,
> +    .class_init    = xive_nvt_class_init,
> +};
>  
>  /*
>   * XIVE Fabric
> @@ -27,6 +298,13 @@ XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn)
>      return xfc->get_ive(xf, lisn);
>  }
>  
> +XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server)
> +{
> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
> +
> +    return xfc->get_nvt(xf, server);
> +}
> +
>  static void xive_fabric_route(XiveFabric *xf, int lisn)
>  {
>  
> @@ -418,6 +696,7 @@ static void xive_register_types(void)
>  {
>      type_register_static(&xive_source_info);
>      type_register_static(&xive_fabric_info);
> +    type_register_static(&xive_nvt_info);
>  }
>  
>  type_init(xive_register_types)
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 4538c622b60a..25d78eec884d 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -25,6 +25,11 @@ typedef struct sPAPRXive {
>      /* Routing table */
>      XiveIVE      *ivt;
>      uint32_t     nr_irqs;
> +
> +    /* TIMA memory regions */
> +    hwaddr       tm_base;
> +    MemoryRegion tm_mmio_user;
> +    MemoryRegion tm_mmio_os;
>  } sPAPRXive;
>  
>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 57295715a4a5..1a2da610d91c 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -20,6 +20,7 @@ typedef struct XiveFabric XiveFabric;
>   */
>  
>  #define XIVE_VC_BASE   0x0006010000000000ull
> +#define XIVE_TM_BASE   0x0006030203180000ull
>  
>  /*
>   * XIVE Interrupt Source
> @@ -155,6 +156,34 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>  }
>  
>  /*
> + * XIVE Interrupt Presenter
> + */
> +
> +#define TYPE_XIVE_NVT "xive-nvt"
> +#define XIVE_NVT(obj) OBJECT_CHECK(XiveNVT, (obj), TYPE_XIVE_NVT)
> +
> +#define TM_RING_COUNT           4
> +#define TM_RING_SIZE            0x10
> +
> +typedef struct XiveNVT {
> +    DeviceState parent_obj;
> +
> +    CPUState  *cs;
> +    qemu_irq  output;
> +
> +    /* Thread interrupt Management (TM) registers */
> +    uint8_t   regs[TM_RING_COUNT * TM_RING_SIZE];
> +
> +    /* Shortcuts to rings */
> +    uint8_t   *ring_os;
> +} XiveNVT;
> +
> +extern const MemoryRegionOps xive_tm_user_ops;
> +extern const MemoryRegionOps xive_tm_os_ops;
> +
> +void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon);
> +
> +/*
>   * XIVE Fabric
>   */
>  
> @@ -175,8 +204,10 @@ typedef struct XiveFabricClass {
>      void (*notify)(XiveFabric *xf, uint32_t lisn);
>  
>      XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
> +    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>  } XiveFabricClass;
>  
>  XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
> +XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server);
>  
>  #endif /* PPC_XIVE_H */
> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> index 5903f29eb789..f2e2a1ac8f6e 100644
> --- a/include/hw/ppc/xive_regs.h
> +++ b/include/hw/ppc/xive_regs.h
> @@ -10,6 +10,88 @@
>  #ifndef _PPC_XIVE_REGS_H
>  #define _PPC_XIVE_REGS_H
>  
> +#define TM_SHIFT                16
> +
> +/* TM register offsets */
> +#define TM_QW0_USER             0x000 /* All rings */
> +#define TM_QW1_OS               0x010 /* Ring 0..2 */
> +#define TM_QW2_HV_POOL          0x020 /* Ring 0..1 */
> +#define TM_QW3_HV_PHYS          0x030 /* Ring 0..1 */
> +
> +/* Byte offsets inside a QW             QW0 QW1 QW2 QW3 */
> +#define TM_NSR                  0x0  /*  +   +   -   +  */
> +#define TM_CPPR                 0x1  /*  -   +   -   +  */
> +#define TM_IPB                  0x2  /*  -   +   +   +  */
> +#define TM_LSMFB                0x3  /*  -   +   +   +  */
> +#define TM_ACK_CNT              0x4  /*  -   +   -   -  */
> +#define TM_INC                  0x5  /*  -   +   -   +  */
> +#define TM_AGE                  0x6  /*  -   +   -   +  */
> +#define TM_PIPR                 0x7  /*  -   +   -   +  */
> +
> +#define TM_WORD0                0x0
> +#define TM_WORD1                0x4
> +
> +/*
> + * QW word 2 contains the valid bit at the top and other fields
> + * depending on the QW.
> + */
> +#define TM_WORD2                0x8
> +#define   TM_QW0W2_VU           PPC_BIT32(0)
> +#define   TM_QW0W2_LOGIC_SERV   PPC_BITMASK32(1, 31) /* XX 2,31 ? */
> +#define   TM_QW1W2_VO           PPC_BIT32(0)
> +#define   TM_QW1W2_OS_CAM       PPC_BITMASK32(8, 31)
> +#define   TM_QW2W2_VP           PPC_BIT32(0)
> +#define   TM_QW2W2_POOL_CAM     PPC_BITMASK32(8, 31)
> +#define   TM_QW3W2_VT           PPC_BIT32(0)
> +#define   TM_QW3W2_LP           PPC_BIT32(6)
> +#define   TM_QW3W2_LE           PPC_BIT32(7)
> +#define   TM_QW3W2_T            PPC_BIT32(31)
> +
> +/*
> + * In addition to normal loads to "peek" and writes (only when invalid)
> + * using 4 and 8 bytes accesses, the above registers support these
> + * "special" byte operations:
> + *
> + *   - Byte load from QW0[NSR] - User level NSR (EBB)
> + *   - Byte store to QW0[NSR] - User level NSR (EBB)
> + *   - Byte load/store to QW1[CPPR] and QW3[CPPR] - CPPR access
> + *   - Byte load from QW3[TM_WORD2] - Read VT||00000||LP||LE on thrd 0
> + *                                    otherwise VT||0000000
> + *   - Byte store to QW3[TM_WORD2] - Set VT bit (and LP/LE if present)
> + *
> + * Then we have all these "special" CI ops at these offset that trigger
> + * all sorts of side effects:
> + */
> +#define TM_SPC_ACK_EBB          0x800   /* Load8 ack EBB to reg*/
> +#define TM_SPC_ACK_OS_REG       0x810   /* Load16 ack OS irq to reg */
> +#define TM_SPC_PUSH_USR_CTX     0x808   /* Store32 Push/Validate user context */
> +#define TM_SPC_PULL_USR_CTX     0x808   /* Load32 Pull/Invalidate user
> +                                         * context */
> +#define TM_SPC_SET_OS_PENDING   0x812   /* Store8 Set OS irq pending bit */
> +#define TM_SPC_PULL_OS_CTX      0x818   /* Load32/Load64 Pull/Invalidate OS
> +                                         * context to reg */
> +#define TM_SPC_PULL_POOL_CTX    0x828   /* Load32/Load64 Pull/Invalidate Pool
> +                                         * context to reg*/
> +#define TM_SPC_ACK_HV_REG       0x830   /* Load16 ack HV irq to reg */
> +#define TM_SPC_PULL_USR_CTX_OL  0xc08   /* Store8 Pull/Inval usr ctx to odd
> +                                         * line */
> +#define TM_SPC_ACK_OS_EL        0xc10   /* Store8 ack OS irq to even line */
> +#define TM_SPC_ACK_HV_POOL_EL   0xc20   /* Store8 ack HV evt pool to even
> +                                         * line */
> +#define TM_SPC_ACK_HV_EL        0xc30   /* Store8 ack HV irq to even line */
> +/* XXX more... */
> +
> +/* NSR fields for the various QW ack types */
> +#define TM_QW0_NSR_EB           PPC_BIT8(0)
> +#define TM_QW1_NSR_EO           PPC_BIT8(0)
> +#define TM_QW3_NSR_HE           PPC_BITMASK8(0, 1)
> +#define  TM_QW3_NSR_HE_NONE     0
> +#define  TM_QW3_NSR_HE_POOL     1
> +#define  TM_QW3_NSR_HE_PHYS     2
> +#define  TM_QW3_NSR_HE_LSI      3
> +#define TM_QW3_NSR_I            PPC_BIT8(2)
> +#define TM_QW3_NSR_GRP_LVL      PPC_BIT8(3, 7)
> +
>  /* IVE/EAS
>   *
>   * One per interrupt source. Targets that interrupt to a given EQ
> @@ -30,4 +112,6 @@ typedef struct XiveIVE {
>  #define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
>  } XiveIVE;
>  
> +#define XIVE_PRIORITY_MAX  7
> +
>  #endif /* _INTC_XIVE_INTERNAL_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues Cédric Le Goater
@ 2018-04-26  7:25   ` David Gibson
  2018-04-26  9:48     ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-26  7:25 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 12406 bytes --]

On Thu, Apr 19, 2018 at 02:43:03PM +0200, Cédric Le Goater wrote:
> The Event Queue Descriptor (EQD) table is an internal table of the
> XIVE routing sub-engine. It specifies on which Event Queue the event
> data should be posted when an exception occurs (later on pulled by the
> OS) and which Virtual Processor to notify.

Uhhh.. I thought the IVT said which queue and vp to notify, and the
EQD gave metadata for event queues.

> The Event Queue is a much
> more complex structure but we start with a simple model for the sPAPR
> machine.
> 
> There is one XiveEQ per priority and these are stored under the XIVE
> virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :
> 
>        (server << 3) | (priority & 0x7)
> 
> This is not in the XIVE architecture but as the EQ index is never
> exposed to the guest, in the hcalls nor in the device tree, we are
> free to use what fits best the current model.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Is the EQD actually modifiable by a guest?  Or are the settings of the
EQs fixed by PAPR?

> ---
> 
>  Changes since v2 :
> 
>  - introduced the XiveFabric interface
> 
>  hw/intc/spapr_xive.c        | 31 +++++++++++++++++---
>  hw/intc/xive.c              | 71 +++++++++++++++++++++++++++++++++++++++++++++
>  include/hw/ppc/spapr_xive.h |  7 +++++
>  include/hw/ppc/xive.h       |  8 +++++
>  include/hw/ppc/xive_regs.h  | 48 ++++++++++++++++++++++++++++++
>  5 files changed, 161 insertions(+), 4 deletions(-)
> 
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> index f07832bf0a00..d0d5a7d7f969 100644
> --- a/hw/intc/spapr_xive.c
> +++ b/hw/intc/spapr_xive.c
> @@ -27,15 +27,30 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>      monitor_printf(mon, "IVE Table\n");
>      for (i = 0; i < xive->nr_irqs; i++) {
>          XiveIVE *ive = &xive->ivt[i];
> +        uint32_t eq_idx;
>  
>          if (!(ive->w & IVE_VALID)) {
>              continue;
>          }
>  
> -        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
> -                       ive->w & IVE_MASKED ? "M" : " ",
> -                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
> -                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
> +        eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
> +
> +        monitor_printf(mon, "  %6x %s eqidx:%03d ", i,
> +                       ive->w & IVE_MASKED ? "M" : " ", eq_idx);
> +
> +        if (!(ive->w & IVE_MASKED)) {
> +            XiveEQ *eq;
> +
> +            eq = xive_fabric_get_eq(XIVE_FABRIC(xive), eq_idx);
> +            if (eq && (eq->w0 & EQ_W0_VALID)) {
> +                xive_eq_pic_print_info(eq, mon);
> +                monitor_printf(mon, " data:%08x",
> +                               (int) GETFIELD(IVE_EQ_DATA, ive->w));
> +            } else {
> +                monitor_printf(mon, "no eq ?!");
> +            }
> +        }
> +        monitor_printf(mon, "\n");
>      }
>  }
>  
> @@ -128,6 +143,13 @@ static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
>      return cpu ? XIVE_NVT(cpu->intc) : NULL;
>  }
>  
> +static XiveEQ *spapr_xive_get_eq(XiveFabric *xf, uint32_t eq_idx)
> +{
> +    XiveNVT *nvt = xive_fabric_get_nvt(xf, SPAPR_XIVE_EQ_SERVER(eq_idx));
> +
> +    return xive_nvt_eq_get(nvt, SPAPR_XIVE_EQ_PRIO(eq_idx));
> +}
> +
>  static const VMStateDescription vmstate_spapr_xive_ive = {
>      .name = TYPE_SPAPR_XIVE "/ive",
>      .version_id = 1,
> @@ -168,6 +190,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>  
>      xfc->get_ive = spapr_xive_get_ive;
>      xfc->get_nvt = spapr_xive_get_nvt;
> +    xfc->get_eq = spapr_xive_get_eq;
>  }
>  
>  static const TypeInfo spapr_xive_info = {
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 5691bb9474e4..2ab37fde80e8 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -19,6 +19,47 @@
>  #include "hw/ppc/xive_regs.h"
>  
>  /*
> + * XiveEQ helpers
> + */
> +
> +XiveEQ *xive_nvt_eq_get(XiveNVT *nvt, uint8_t priority)
> +{
> +    if (!nvt || priority > XIVE_PRIORITY_MAX) {
> +        return NULL;
> +    }
> +    return &nvt->eqt[priority];
> +}
> +
> +void xive_eq_reset(XiveEQ *eq)
> +{
> +    memset(eq, 0, sizeof(*eq));
> +
> +    /* switch off the escalation and notification ESBs */
> +    eq->w1 = EQ_W1_ESe_Q | EQ_W1_ESn_Q;
> +}
> +
> +void xive_eq_pic_print_info(XiveEQ *eq, Monitor *mon)
> +{
> +    uint64_t qaddr_base = (((uint64_t)(eq->w2 & 0x0fffffff)) << 32) | eq->w3;
> +    uint32_t qindex = GETFIELD(EQ_W1_PAGE_OFF, eq->w1);
> +    uint32_t qgen = GETFIELD(EQ_W1_GENERATION, eq->w1);
> +    uint32_t qsize = GETFIELD(EQ_W0_QSIZE, eq->w0);
> +    uint32_t qentries = 1 << (qsize + 10);
> +
> +    uint32_t server = GETFIELD(EQ_W6_NVT_INDEX, eq->w6);
> +    uint8_t priority = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
> +
> +    monitor_printf(mon, "%c%c%c%c%c prio:%d server:%03d eq:@%08"PRIx64
> +                   "% 6d/%5d ^%d",
> +                   eq->w0 & EQ_W0_VALID ? 'v' : '-',
> +                   eq->w0 & EQ_W0_ENQUEUE ? 'q' : '-',
> +                   eq->w0 & EQ_W0_UCOND_NOTIFY ? 'n' : '-',
> +                   eq->w0 & EQ_W0_BACKLOG ? 'b' : '-',
> +                   eq->w0 & EQ_W0_ESCALATE_CTL ? 'e' : '-',
> +                   priority, server, qaddr_base, qindex, qentries, qgen);
> +}
> +
> +/*
>   * XIVE Interrupt Presenter
>   */
>  
> @@ -210,8 +251,12 @@ void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon)
>  static void xive_nvt_reset(void *dev)
>  {
>      XiveNVT *nvt = XIVE_NVT(dev);
> +    int i;
>  
>      memset(nvt->regs, 0, sizeof(nvt->regs));
> +    for (i = 0; i < ARRAY_SIZE(nvt->eqt); i++) {
> +        xive_eq_reset(&nvt->eqt[i]);
> +    }

Hrm.  Having the EQs "owned" by the NVT makes things simple for PAPR.
But won't that break down for the powernv case?

>  }
>  
>  static void xive_nvt_realize(DeviceState *dev, Error **errp)
> @@ -259,12 +304,31 @@ static void xive_nvt_init(Object *obj)
>      nvt->ring_os = &nvt->regs[TM_QW1_OS];
>  }
>  
> +static const VMStateDescription vmstate_xive_nvt_eq = {
> +    .name = TYPE_XIVE_NVT "/eq",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField []) {
> +        VMSTATE_UINT32(w0, XiveEQ),
> +        VMSTATE_UINT32(w1, XiveEQ),
> +        VMSTATE_UINT32(w2, XiveEQ),
> +        VMSTATE_UINT32(w3, XiveEQ),
> +        VMSTATE_UINT32(w4, XiveEQ),
> +        VMSTATE_UINT32(w5, XiveEQ),
> +        VMSTATE_UINT32(w6, XiveEQ),
> +        VMSTATE_UINT32(w7, XiveEQ),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
>  static const VMStateDescription vmstate_xive_nvt = {
>      .name = TYPE_XIVE_NVT,
>      .version_id = 1,
>      .minimum_version_id = 1,
>      .fields = (VMStateField[]) {
>          VMSTATE_BUFFER(regs, XiveNVT),
> +        VMSTATE_STRUCT_ARRAY(eqt, XiveNVT, (XIVE_PRIORITY_MAX + 1), 1,
> +                             vmstate_xive_nvt_eq, XiveEQ),
>          VMSTATE_END_OF_LIST()
>      },
>  };
> @@ -305,6 +369,13 @@ XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server)
>      return xfc->get_nvt(xf, server);
>  }
>  
> +XiveEQ *xive_fabric_get_eq(XiveFabric *xf, uint32_t eq_idx)
> +{
> +   XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
> +
> +   return xfc->get_eq(xf, eq_idx);
> +}
> +
>  static void xive_fabric_route(XiveFabric *xf, int lisn)
>  {
>  
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 25d78eec884d..7cb3561aa3d3 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -36,4 +36,11 @@ bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>  
> +/*
> + * sPAPR encoding of EQ indexes
> + */
> +#define SPAPR_XIVE_EQ_INDEX(server, prio)  (((server) << 3) | ((prio) & 0x7))
> +#define SPAPR_XIVE_EQ_SERVER(eq_idx) ((eq_idx) >> 3)
> +#define SPAPR_XIVE_EQ_PRIO(eq_idx)   ((eq_idx) & 0x7)
> +
>  #endif /* PPC_SPAPR_XIVE_H */
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 1a2da610d91c..6cc02638c677 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -176,12 +176,18 @@ typedef struct XiveNVT {
>  
>      /* Shortcuts to rings */
>      uint8_t   *ring_os;
> +
> +    XiveEQ    eqt[XIVE_PRIORITY_MAX + 1];
>  } XiveNVT;
>  
>  extern const MemoryRegionOps xive_tm_user_ops;
>  extern const MemoryRegionOps xive_tm_os_ops;
>  
>  void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon);
> +XiveEQ *xive_nvt_eq_get(XiveNVT *nvt, uint8_t priority);
> +
> +void xive_eq_reset(XiveEQ *eq);
> +void xive_eq_pic_print_info(XiveEQ *eq, Monitor *mon);
>  
>  /*
>   * XIVE Fabric
> @@ -205,9 +211,11 @@ typedef struct XiveFabricClass {
>  
>      XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>      XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
> +    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
>  } XiveFabricClass;
>  
>  XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
>  XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server);
> +XiveEQ  *xive_fabric_get_eq(XiveFabric *xf, uint32_t eq_idx);
>  
>  #endif /* PPC_XIVE_H */
> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> index f2e2a1ac8f6e..bcc44e766db9 100644
> --- a/include/hw/ppc/xive_regs.h
> +++ b/include/hw/ppc/xive_regs.h
> @@ -112,6 +112,54 @@ typedef struct XiveIVE {
>  #define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
>  } XiveIVE;
>  
> +/* EQ */
> +typedef struct XiveEQ {
> +        uint32_t        w0;
> +#define EQ_W0_VALID             PPC_BIT32(0) /* "v" bit */
> +#define EQ_W0_ENQUEUE           PPC_BIT32(1) /* "q" bit */
> +#define EQ_W0_UCOND_NOTIFY      PPC_BIT32(2) /* "n" bit */
> +#define EQ_W0_BACKLOG           PPC_BIT32(3) /* "b" bit */
> +#define EQ_W0_PRECL_ESC_CTL     PPC_BIT32(4) /* "p" bit */
> +#define EQ_W0_ESCALATE_CTL      PPC_BIT32(5) /* "e" bit */
> +#define EQ_W0_UNCOND_ESCALATE   PPC_BIT32(6) /* "u" bit - DD2.0 */
> +#define EQ_W0_SILENT_ESCALATE   PPC_BIT32(7) /* "s" bit - DD2.0 */
> +#define EQ_W0_QSIZE             PPC_BITMASK32(12, 15)
> +#define EQ_W0_SW0               PPC_BIT32(16)
> +#define EQ_W0_FIRMWARE          EQ_W0_SW0 /* Owned by FW */
> +#define EQ_QSIZE_4K             0
> +#define EQ_QSIZE_64K            4
> +#define EQ_W0_HWDEP             PPC_BITMASK32(24, 31)
> +        uint32_t        w1;
> +#define EQ_W1_ESn               PPC_BITMASK32(0, 1)
> +#define EQ_W1_ESn_P             PPC_BIT32(0)
> +#define EQ_W1_ESn_Q             PPC_BIT32(1)
> +#define EQ_W1_ESe               PPC_BITMASK32(2, 3)
> +#define EQ_W1_ESe_P             PPC_BIT32(2)
> +#define EQ_W1_ESe_Q             PPC_BIT32(3)
> +#define EQ_W1_GENERATION        PPC_BIT32(9)
> +#define EQ_W1_PAGE_OFF          PPC_BITMASK32(10, 31)
> +        uint32_t        w2;
> +#define EQ_W2_MIGRATION_REG     PPC_BITMASK32(0, 3)
> +#define EQ_W2_OP_DESC_HI        PPC_BITMASK32(4, 31)
> +        uint32_t        w3;
> +#define EQ_W3_OP_DESC_LO        PPC_BITMASK32(0, 31)
> +        uint32_t        w4;
> +#define EQ_W4_ESC_EQ_BLOCK      PPC_BITMASK32(4, 7)
> +#define EQ_W4_ESC_EQ_INDEX      PPC_BITMASK32(8, 31)
> +        uint32_t        w5;
> +#define EQ_W5_ESC_EQ_DATA       PPC_BITMASK32(1, 31)
> +        uint32_t        w6;
> +#define EQ_W6_FORMAT_BIT        PPC_BIT32(8)
> +#define EQ_W6_NVT_BLOCK         PPC_BITMASK32(9, 12)
> +#define EQ_W6_NVT_INDEX         PPC_BITMASK32(13, 31)
> +        uint32_t        w7;
> +#define EQ_W7_F0_IGNORE         PPC_BIT32(0)
> +#define EQ_W7_F0_BLK_GROUPING   PPC_BIT32(1)
> +#define EQ_W7_F0_PRIORITY       PPC_BITMASK32(8, 15)
> +#define EQ_W7_F1_WAKEZ          PPC_BIT32(0)
> +#define EQ_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
> +} XiveEQ;
> +
>  #define XIVE_PRIORITY_MAX  7
>  
>  #endif /* _INTC_XIVE_INTERNAL_H */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-04-26  7:11   ` David Gibson
@ 2018-04-26  9:27     ` Cédric Le Goater
  2018-04-26 17:15       ` Cédric Le Goater
  2018-05-03  5:35       ` David Gibson
  2018-05-02  7:39     ` Cédric Le Goater
  1 sibling, 2 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-26  9:27 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/26/2018 09:11 AM, David Gibson wrote:
> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
>> The XIVE presenter engine uses a set of registers to handle priority
>> management and interrupt acknowledgment among other things. The most
>> important ones being :
>>
>>   - Interrupt Priority Register (PIPR)
>>   - Interrupt Pending Buffer (IPB)
>>   - Current Processor Priority (CPPR)
>>   - Notification Source Register (NSR)
>>
>> There is one set of registers per level of privilege, four in all :
>> HW, HV pool, OS and User. These are called rings. All registers are
>> accessible through a specific MMIO region called the Thread Interrupt
>> Management Areas (TIMA) but, depending on the privilege level of the
>> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
>> OS privilege and therefore can only accesses the OS and the User
>> rings. The others are for hypervisor levels.
>>
>> The CPU interrupt state is modeled with a XiveNVT object which stores
>> the values of the different registers. The different TIMA views are
>> mapped at the same address for each CPU and 'current_cpu' is used to
>> retrieve the XiveNVT holding the ring registers.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>
>>  Changes since v2 :
>>
>>  - introduced the XiveFabric interface
>>
>>  hw/intc/spapr_xive.c        |  25 ++++
>>  hw/intc/xive.c              | 279 ++++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/ppc/spapr_xive.h |   5 +
>>  include/hw/ppc/xive.h       |  31 +++++
>>  include/hw/ppc/xive_regs.h  |  84 +++++++++++++
>>  5 files changed, 424 insertions(+)
>>
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> index 90cde8a4082d..f07832bf0a00 100644
>> --- a/hw/intc/spapr_xive.c
>> +++ b/hw/intc/spapr_xive.c
>> @@ -13,6 +13,7 @@
>>  #include "target/ppc/cpu.h"
>>  #include "sysemu/cpus.h"
>>  #include "monitor/monitor.h"
>> +#include "hw/ppc/spapr.h"
>>  #include "hw/ppc/spapr_xive.h"
>>  #include "hw/ppc/xive.h"
>>  #include "hw/ppc/xive_regs.h"
>> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>  
>>      /* Allocate the Interrupt Virtualization Table */
>>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>> +
>> +    /* The Thread Interrupt Management Area has the same address for
>> +     * each chip. On sPAPR, we only need to expose the User and OS
>> +     * level views of the TIMA.
>> +     */
>> +    xive->tm_base = XIVE_TM_BASE;
> 
> The constant should probably have PAPR in the name somewhere, since
> it's just for PAPR machines (same for the ESB mappings, actually).

ok. 

I have also made 'tm_base' a property, like 'vc_base' for ESBs, in 
case we want to change the value when the guest is instantiated. 
I doubt it but this is an address in the global address space, so 
letting the machine have control is better I think. 
 
> 
>> +
>> +    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
>> +                          &xive_tm_user_ops, xive, "xive.tima.user",
>> +                          1ull << TM_SHIFT);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
>> +
>> +    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
>> +                          &xive_tm_os_ops, xive, "xive.tima.os",
>> +                          1ull << TM_SHIFT);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
>>  }
>>  
>>  static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>      return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>>  }
>>  
>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
>> +{
>> +    PowerPCCPU *cpu = spapr_find_cpu(server);
>> +
>> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
>> +}
> 
> So this is a bit of a tangent, but I've been thinking of implementing
> a scheme where there's an opaque pointer in the cpu structure for the
> use of the machine.  I'm planning for that to replace the intc pointer
> (which isn't really used directly by the cpu). That would allow us to
> have spapr put a structure there and have both xics and xive pointers
> which could be useful later on.

ok. That should simplify the patchset at the end, in which we need to 
switch the 'intc' pointer. 

> I think we'd need something similar to correctly handle migration of
> the VPA state, which is currently horribly broken.
> 
>> +
>>  static const VMStateDescription vmstate_spapr_xive_ive = {
>>      .name = TYPE_SPAPR_XIVE "/ive",
>>      .version_id = 1,
>> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>      dc->vmsd = &vmstate_spapr_xive;
>>  
>>      xfc->get_ive = spapr_xive_get_ive;
>> +    xfc->get_nvt = spapr_xive_get_nvt;
>>  }
>>  
>>  static const TypeInfo spapr_xive_info = {
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index dccad0318834..5691bb9474e4 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -14,7 +14,278 @@
>>  #include "sysemu/cpus.h"
>>  #include "sysemu/dma.h"
>>  #include "monitor/monitor.h"
>> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
>>  #include "hw/ppc/xive.h"
>> +#include "hw/ppc/xive_regs.h"
>> +
>> +/*
>> + * XIVE Interrupt Presenter
>> + */
>> +
>> +static uint64_t xive_nvt_accept(XiveNVT *nvt)
>> +{
>> +    return 0;
>> +}
>> +
>> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
>> +{
>> +    if (cppr > XIVE_PRIORITY_MAX) {
>> +        cppr = 0xff;
>> +    }
>> +
>> +    nvt->ring_os[TM_CPPR] = cppr;
> 
> Surely this needs to recheck if we should be interrupting the cpu?

yes. In patch 9, when we introduce the nvt notify routine.

>> +}
>> +
>> +/*
>> + * OS Thread Interrupt Management Area MMIO
>> + */
>> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
>> +                                           unsigned size)
>> +{
>> +    uint64_t ret = -1;
>> +
>> +    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
>> +        ret = xive_nvt_accept(nvt);
>> +    } else {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
>> +                      HWADDR_PRIx" size %d\n", offset, size);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +#define TM_RING(offset) ((offset) & 0xf0)
>> +
>> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
>> +                                      unsigned size)
>> +{
>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> 
> So, as I said on a previous version of this, we can actually correctly
> represent different mappings in different cpu spaces, by exploiting
> cpu->as and not just having them all point to &address_space_memory.

Yes, you did and I haven't studied the question yet. For the next version.

>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>> +    uint64_t ret = -1;
>> +    int i;
>> +
>> +    if (offset >= TM_SPC_ACK_EBB) {
>> +        return xive_tm_read_special(nvt, offset, size);
>> +    }
>> +
>> +    if (TM_RING(offset) != TM_QW1_OS) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
>> +                      HWADDR_PRIx"\n", offset);
>> +        return ret;
> 
> Just return -1 would be clearer here;

ok.

> 
>> +    }
>> +
>> +    ret = 0;
>> +    for (i = 0; i < size; i++) {
>> +        ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1));
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +static bool xive_tm_is_readonly(uint8_t offset)
>> +{
>> +    return offset != TM_QW1_OS + TM_CPPR;
>> +}
>> +
>> +static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset,
>> +                                        uint64_t value, unsigned size)
>> +{
>> +    /* TODO: support TM_SPC_SET_OS_PENDING */
>> +
>> +    /* TODO: support TM_SPC_ACK_OS_EL */
>> +}
>> +
>> +static void xive_tm_os_write(void *opaque, hwaddr offset,
>> +                                   uint64_t value, unsigned size)
>> +{
>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>> +    int i;
>> +
>> +    if (offset >= TM_SPC_ACK_EBB) {
>> +        xive_tm_write_special(nvt, offset, value, size);
>> +        return;
>> +    }
>> +
>> +    if (TM_RING(offset) != TM_QW1_OS) {
> 
> Why have this if you have separate OS and user regions as you appear
> to do below?

This is another problem we are trying to solve. 

The registers a CPU can access depends on the TIMA view it is using. 
The OS TIMA view only sees the OS ring registers. The HV view sees all. 

> Or to look at it another way, shouldn't it be possible to make the
> read/write accessors the same for the OS and user rings?

For some parts yes, but the special load/store addresses are different
for each view, the read-only register also. It seemed easier to duplicate.

I think the problem will become clearer (or worse) with pnv which uses 
the HV mode.

>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
>> +                      HWADDR_PRIx"\n", offset);
>> +        return;
>> +    }
>> +
>> +    switch (size) {
>> +    case 1:
>> +        if (offset == TM_QW1_OS + TM_CPPR) {
>> +            xive_nvt_set_cppr(nvt, value & 0xff);
>> +        }
>> +        break;
>> +    case 4:
>> +    case 8:
>> +        for (i = 0; i < size; i++) {
>> +            if (!xive_tm_is_readonly(offset + i)) {
>> +                nvt->regs[offset + i] = (value >> (8 * (size - i - 1))) & 0xff;
>> +            }
>> +        }
>> +        break;
>> +    default:
>> +        g_assert_not_reached();
>> +    }
>> +}
>> +
>> +const MemoryRegionOps xive_tm_os_ops = {
>> +    .read = xive_tm_os_read,
>> +    .write = xive_tm_os_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 1,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 1,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +/*
>> + * User Thread Interrupt Management Area MMIO
>> + */
>> +
>> +static uint64_t xive_tm_user_read(void *opaque, hwaddr offset,
>> +                                        unsigned size)
>> +{
>> +    qemu_log_mask(LOG_UNIMP, "XIVE: invalid access to User TIMA @%"
>> +                  HWADDR_PRIx"\n", offset);
>> +    return -1;
>> +}
>> +
>> +static void xive_tm_user_write(void *opaque, hwaddr offset,
>> +                                     uint64_t value, unsigned size)
>> +{
>> +    qemu_log_mask(LOG_UNIMP, "XIVE: invalid access to User TIMA @%"
>> +                  HWADDR_PRIx"\n", offset);
>> +}
>> +
>> +
>> +const MemoryRegionOps xive_tm_user_ops = {
>> +    .read = xive_tm_user_read,
>> +    .write = xive_tm_user_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 1,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 1,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +static char *xive_nvt_ring_print(uint8_t *ring)
>> +{
>> +    uint32_t w2 = be32_to_cpu(*((uint32_t *) &ring[TM_WORD2]));
>> +
>> +    return g_strdup_printf("%02x  %02x   %02x  %02x    %02x   "
>> +                   "%02x  %02x  %02x   %08x",
>> +                   ring[TM_NSR], ring[TM_CPPR], ring[TM_IPB], ring[TM_LSMFB],
>> +                   ring[TM_ACK_CNT], ring[TM_INC], ring[TM_AGE], ring[TM_PIPR],
>> +                   w2);
>> +}
>> +
>> +void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon)
>> +{
>> +    int cpu_index = nvt->cs ? nvt->cs->cpu_index : -1;
>> +    char *s;
>> +
>> +    monitor_printf(mon, "CPU[%04x]: QW    NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
>> +                   " W2\n", cpu_index);
>> +
>> +    s = xive_nvt_ring_print(&nvt->regs[TM_QW1_OS]);
>> +    monitor_printf(mon, "CPU[%04x]: OS    %s\n", cpu_index, s);
>> +    g_free(s);
>> +    s = xive_nvt_ring_print(&nvt->regs[TM_QW0_USER]);
>> +    monitor_printf(mon, "CPU[%04x]: USER  %s\n", cpu_index, s);
>> +    g_free(s);
>> +}
>> +
>> +static void xive_nvt_reset(void *dev)
>> +{
>> +    XiveNVT *nvt = XIVE_NVT(dev);
>> +
>> +    memset(nvt->regs, 0, sizeof(nvt->regs));
>> +}
>> +
>> +static void xive_nvt_realize(DeviceState *dev, Error **errp)
>> +{
>> +    XiveNVT *nvt = XIVE_NVT(dev);
>> +    PowerPCCPU *cpu;
>> +    CPUPPCState *env;
>> +    Object *obj;
>> +    Error *err = NULL;
>> +
>> +    obj = object_property_get_link(OBJECT(dev), ICP_PROP_CPU, &err);
> 
> Please get rid of the remaining "ICP" naming in the xive code.

ok.  I will kill the define.

>> +    if (!obj) {
>> +        error_propagate(errp, err);
>> +        error_prepend(errp, "required link '" ICP_PROP_CPU "' not found: ");
>> +        return;
>> +    }
>> +
>> +    cpu = POWERPC_CPU(obj);
>> +    nvt->cs = CPU(obj);
>> +
>> +    env = &cpu->env;
>> +    switch (PPC_INPUT(env)) {
>> +    case PPC_FLAGS_INPUT_POWER7:
>> +        nvt->output = env->irq_inputs[POWER7_INPUT_INT];
>> +        break;
>> +
>> +    default:
>> +        error_setg(errp, "XIVE interrupt controller does not support "
>> +                   "this CPU bus model");
>> +        return;
>> +    }
>> +
>> +    qemu_register_reset(xive_nvt_reset, dev);
> 
> If this is a sysbus device, which I think it is, 

It is not. The TIMA MMIO region is in the sPAPRXive model but that might 
change if we use cpu->as. I agree it would look better to have a memory
region per cpu.

> you shouldn't need to
> explicitly register a reset handler.  Instead you can set a device
> reset handler which will be called with the reset.
> 
>> +}
>> +
>> +static void xive_nvt_unrealize(DeviceState *dev, Error **errp)
>> +{
>> +    qemu_unregister_reset(xive_nvt_reset, dev);
>> +}
>> +
>> +static void xive_nvt_init(Object *obj)
>> +{
>> +    XiveNVT *nvt = XIVE_NVT(obj);
>> +
>> +    nvt->ring_os = &nvt->regs[TM_QW1_OS];
> 
> The ring_os field is basically pointless, being just an offset into a
> structure you already have.  A macro or inline would be a better idea.

ok. I liked the idea but I agree it's overkill to have an init routine
just for this. I will find something.

>> +}
>> +
>> +static const VMStateDescription vmstate_xive_nvt = {
>> +    .name = TYPE_XIVE_NVT,
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_BUFFER(regs, XiveNVT),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +static void xive_nvt_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    dc->realize = xive_nvt_realize;
>> +    dc->unrealize = xive_nvt_unrealize;
>> +    dc->desc = "XIVE Interrupt Presenter";
>> +    dc->vmsd = &vmstate_xive_nvt;
>> +}
>> +
>> +static const TypeInfo xive_nvt_info = {
>> +    .name          = TYPE_XIVE_NVT,
>> +    .parent        = TYPE_DEVICE,
>> +    .instance_size = sizeof(XiveNVT),
>> +    .instance_init = xive_nvt_init,
>> +    .class_init    = xive_nvt_class_init,
>> +};
>>  
>>  /*
>>   * XIVE Fabric
>> @@ -27,6 +298,13 @@ XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn)
>>      return xfc->get_ive(xf, lisn);
>>  }
>>  
>> +XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server)
>> +{
>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
>> +
>> +    return xfc->get_nvt(xf, server);
>> +}
>> +
>>  static void xive_fabric_route(XiveFabric *xf, int lisn)
>>  {
>>  
>> @@ -418,6 +696,7 @@ static void xive_register_types(void)
>>  {
>>      type_register_static(&xive_source_info);
>>      type_register_static(&xive_fabric_info);
>> +    type_register_static(&xive_nvt_info);
>>  }
>>  
>>  type_init(xive_register_types)
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 4538c622b60a..25d78eec884d 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -25,6 +25,11 @@ typedef struct sPAPRXive {
>>      /* Routing table */
>>      XiveIVE      *ivt;
>>      uint32_t     nr_irqs;
>> +
>> +    /* TIMA memory regions */
>> +    hwaddr       tm_base;
>> +    MemoryRegion tm_mmio_user;
>> +    MemoryRegion tm_mmio_os;
>>  } sPAPRXive;
>>  
>>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 57295715a4a5..1a2da610d91c 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -20,6 +20,7 @@ typedef struct XiveFabric XiveFabric;
>>   */
>>  
>>  #define XIVE_VC_BASE   0x0006010000000000ull
>> +#define XIVE_TM_BASE   0x0006030203180000ull
>>  
>>  /*
>>   * XIVE Interrupt Source
>> @@ -155,6 +156,34 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>>  }
>>  
>>  /*
>> + * XIVE Interrupt Presenter
>> + */
>> +
>> +#define TYPE_XIVE_NVT "xive-nvt"
>> +#define XIVE_NVT(obj) OBJECT_CHECK(XiveNVT, (obj), TYPE_XIVE_NVT)
>> +
>> +#define TM_RING_COUNT           4
>> +#define TM_RING_SIZE            0x10
>> +
>> +typedef struct XiveNVT {
>> +    DeviceState parent_obj;
>> +
>> +    CPUState  *cs;
>> +    qemu_irq  output;
>> +
>> +    /* Thread interrupt Management (TM) registers */
>> +    uint8_t   regs[TM_RING_COUNT * TM_RING_SIZE];
>> +
>> +    /* Shortcuts to rings */
>> +    uint8_t   *ring_os;
>> +} XiveNVT;
>> +
>> +extern const MemoryRegionOps xive_tm_user_ops;
>> +extern const MemoryRegionOps xive_tm_os_ops;
>> +
>> +void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon);
>> +
>> +/*
>>   * XIVE Fabric
>>   */
>>  
>> @@ -175,8 +204,10 @@ typedef struct XiveFabricClass {
>>      void (*notify)(XiveFabric *xf, uint32_t lisn);
>>  
>>      XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>> +    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>>  } XiveFabricClass;
>>  
>>  XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
>> +XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server);
>>  
>>  #endif /* PPC_XIVE_H */
>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>> index 5903f29eb789..f2e2a1ac8f6e 100644
>> --- a/include/hw/ppc/xive_regs.h
>> +++ b/include/hw/ppc/xive_regs.h
>> @@ -10,6 +10,88 @@
>>  #ifndef _PPC_XIVE_REGS_H
>>  #define _PPC_XIVE_REGS_H
>>  
>> +#define TM_SHIFT                16
>> +
>> +/* TM register offsets */
>> +#define TM_QW0_USER             0x000 /* All rings */
>> +#define TM_QW1_OS               0x010 /* Ring 0..2 */
>> +#define TM_QW2_HV_POOL          0x020 /* Ring 0..1 */
>> +#define TM_QW3_HV_PHYS          0x030 /* Ring 0..1 */
>> +
>> +/* Byte offsets inside a QW             QW0 QW1 QW2 QW3 */
>> +#define TM_NSR                  0x0  /*  +   +   -   +  */
>> +#define TM_CPPR                 0x1  /*  -   +   -   +  */
>> +#define TM_IPB                  0x2  /*  -   +   +   +  */
>> +#define TM_LSMFB                0x3  /*  -   +   +   +  */
>> +#define TM_ACK_CNT              0x4  /*  -   +   -   -  */
>> +#define TM_INC                  0x5  /*  -   +   -   +  */
>> +#define TM_AGE                  0x6  /*  -   +   -   +  */
>> +#define TM_PIPR                 0x7  /*  -   +   -   +  */
>> +
>> +#define TM_WORD0                0x0
>> +#define TM_WORD1                0x4
>> +
>> +/*
>> + * QW word 2 contains the valid bit at the top and other fields
>> + * depending on the QW.
>> + */
>> +#define TM_WORD2                0x8
>> +#define   TM_QW0W2_VU           PPC_BIT32(0)
>> +#define   TM_QW0W2_LOGIC_SERV   PPC_BITMASK32(1, 31) /* XX 2,31 ? */
>> +#define   TM_QW1W2_VO           PPC_BIT32(0)
>> +#define   TM_QW1W2_OS_CAM       PPC_BITMASK32(8, 31)
>> +#define   TM_QW2W2_VP           PPC_BIT32(0)
>> +#define   TM_QW2W2_POOL_CAM     PPC_BITMASK32(8, 31)
>> +#define   TM_QW3W2_VT           PPC_BIT32(0)
>> +#define   TM_QW3W2_LP           PPC_BIT32(6)
>> +#define   TM_QW3W2_LE           PPC_BIT32(7)
>> +#define   TM_QW3W2_T            PPC_BIT32(31)
>> +
>> +/*
>> + * In addition to normal loads to "peek" and writes (only when invalid)
>> + * using 4 and 8 bytes accesses, the above registers support these
>> + * "special" byte operations:
>> + *
>> + *   - Byte load from QW0[NSR] - User level NSR (EBB)
>> + *   - Byte store to QW0[NSR] - User level NSR (EBB)
>> + *   - Byte load/store to QW1[CPPR] and QW3[CPPR] - CPPR access
>> + *   - Byte load from QW3[TM_WORD2] - Read VT||00000||LP||LE on thrd 0
>> + *                                    otherwise VT||0000000
>> + *   - Byte store to QW3[TM_WORD2] - Set VT bit (and LP/LE if present)
>> + *
>> + * Then we have all these "special" CI ops at these offset that trigger
>> + * all sorts of side effects:
>> + */
>> +#define TM_SPC_ACK_EBB          0x800   /* Load8 ack EBB to reg*/
>> +#define TM_SPC_ACK_OS_REG       0x810   /* Load16 ack OS irq to reg */
>> +#define TM_SPC_PUSH_USR_CTX     0x808   /* Store32 Push/Validate user context */
>> +#define TM_SPC_PULL_USR_CTX     0x808   /* Load32 Pull/Invalidate user
>> +                                         * context */
>> +#define TM_SPC_SET_OS_PENDING   0x812   /* Store8 Set OS irq pending bit */
>> +#define TM_SPC_PULL_OS_CTX      0x818   /* Load32/Load64 Pull/Invalidate OS
>> +                                         * context to reg */
>> +#define TM_SPC_PULL_POOL_CTX    0x828   /* Load32/Load64 Pull/Invalidate Pool
>> +                                         * context to reg*/
>> +#define TM_SPC_ACK_HV_REG       0x830   /* Load16 ack HV irq to reg */
>> +#define TM_SPC_PULL_USR_CTX_OL  0xc08   /* Store8 Pull/Inval usr ctx to odd
>> +                                         * line */
>> +#define TM_SPC_ACK_OS_EL        0xc10   /* Store8 ack OS irq to even line */
>> +#define TM_SPC_ACK_HV_POOL_EL   0xc20   /* Store8 ack HV evt pool to even
>> +                                         * line */
>> +#define TM_SPC_ACK_HV_EL        0xc30   /* Store8 ack HV irq to even line */
>> +/* XXX more... */
>> +
>> +/* NSR fields for the various QW ack types */
>> +#define TM_QW0_NSR_EB           PPC_BIT8(0)
>> +#define TM_QW1_NSR_EO           PPC_BIT8(0)
>> +#define TM_QW3_NSR_HE           PPC_BITMASK8(0, 1)
>> +#define  TM_QW3_NSR_HE_NONE     0
>> +#define  TM_QW3_NSR_HE_POOL     1
>> +#define  TM_QW3_NSR_HE_PHYS     2
>> +#define  TM_QW3_NSR_HE_LSI      3
>> +#define TM_QW3_NSR_I            PPC_BIT8(2)
>> +#define TM_QW3_NSR_GRP_LVL      PPC_BIT8(3, 7)
>> +
>>  /* IVE/EAS
>>   *
>>   * One per interrupt source. Targets that interrupt to a given EQ
>> @@ -30,4 +112,6 @@ typedef struct XiveIVE {
>>  #define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
>>  } XiveIVE;
>>  
>> +#define XIVE_PRIORITY_MAX  7
>> +
>>  #endif /* _INTC_XIVE_INTERNAL_H */
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-04-26  7:25   ` David Gibson
@ 2018-04-26  9:48     ` Cédric Le Goater
  2018-05-03  5:45       ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-26  9:48 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/26/2018 09:25 AM, David Gibson wrote:
> On Thu, Apr 19, 2018 at 02:43:03PM +0200, Cédric Le Goater wrote:
>> The Event Queue Descriptor (EQD) table is an internal table of the
>> XIVE routing sub-engine. It specifies on which Event Queue the event
>> data should be posted when an exception occurs (later on pulled by the
>> OS) and which Virtual Processor to notify.
> 
> Uhhh.. I thought the IVT said which queue and vp to notify, and the
> EQD gave metadata for event queues.

yes. the above poorly written. The Event Queue Descriptor contains the
guest address of the event queue in which the data is written. I will 
rephrase.      

The IVT contains IVEs which indeed define for an IRQ which EQ to notify 
and what data to push on the queue. 
 
>> The Event Queue is a much
>> more complex structure but we start with a simple model for the sPAPR
>> machine.
>>
>> There is one XiveEQ per priority and these are stored under the XIVE
>> virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :
>>
>>        (server << 3) | (priority & 0x7)
>>
>> This is not in the XIVE architecture but as the EQ index is never
>> exposed to the guest, in the hcalls nor in the device tree, we are
>> free to use what fits best the current model.

This EQ indexing is important to notice because it will also show up 
in KVM to build the IVE from the KVM irq state. 

  
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> 
> Is the EQD actually modifiable by a guest?  Or are the settings of the
> EQs fixed by PAPR?

The guest uses the H_INT_SET_QUEUE_CONFIG hcall to define the address
of the event queue for a couple prio/server.

>> ---
>>
>>  Changes since v2 :
>>
>>  - introduced the XiveFabric interface
>>
>>  hw/intc/spapr_xive.c        | 31 +++++++++++++++++---
>>  hw/intc/xive.c              | 71 +++++++++++++++++++++++++++++++++++++++++++++
>>  include/hw/ppc/spapr_xive.h |  7 +++++
>>  include/hw/ppc/xive.h       |  8 +++++
>>  include/hw/ppc/xive_regs.h  | 48 ++++++++++++++++++++++++++++++
>>  5 files changed, 161 insertions(+), 4 deletions(-)
>>
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> index f07832bf0a00..d0d5a7d7f969 100644
>> --- a/hw/intc/spapr_xive.c
>> +++ b/hw/intc/spapr_xive.c
>> @@ -27,15 +27,30 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>>      monitor_printf(mon, "IVE Table\n");
>>      for (i = 0; i < xive->nr_irqs; i++) {
>>          XiveIVE *ive = &xive->ivt[i];
>> +        uint32_t eq_idx;
>>  
>>          if (!(ive->w & IVE_VALID)) {
>>              continue;
>>          }
>>  
>> -        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
>> -                       ive->w & IVE_MASKED ? "M" : " ",
>> -                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
>> -                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
>> +        eq_idx = GETFIELD(IVE_EQ_INDEX, ive->w);
>> +
>> +        monitor_printf(mon, "  %6x %s eqidx:%03d ", i,
>> +                       ive->w & IVE_MASKED ? "M" : " ", eq_idx);
>> +
>> +        if (!(ive->w & IVE_MASKED)) {
>> +            XiveEQ *eq;
>> +
>> +            eq = xive_fabric_get_eq(XIVE_FABRIC(xive), eq_idx);
>> +            if (eq && (eq->w0 & EQ_W0_VALID)) {
>> +                xive_eq_pic_print_info(eq, mon);
>> +                monitor_printf(mon, " data:%08x",
>> +                               (int) GETFIELD(IVE_EQ_DATA, ive->w));
>> +            } else {
>> +                monitor_printf(mon, "no eq ?!");
>> +            }
>> +        }
>> +        monitor_printf(mon, "\n");
>>      }
>>  }
>>  
>> @@ -128,6 +143,13 @@ static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
>>      return cpu ? XIVE_NVT(cpu->intc) : NULL;
>>  }
>>  
>> +static XiveEQ *spapr_xive_get_eq(XiveFabric *xf, uint32_t eq_idx)
>> +{
>> +    XiveNVT *nvt = xive_fabric_get_nvt(xf, SPAPR_XIVE_EQ_SERVER(eq_idx));
>> +
>> +    return xive_nvt_eq_get(nvt, SPAPR_XIVE_EQ_PRIO(eq_idx));
>> +}
>> +
>>  static const VMStateDescription vmstate_spapr_xive_ive = {
>>      .name = TYPE_SPAPR_XIVE "/ive",
>>      .version_id = 1,
>> @@ -168,6 +190,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>  
>>      xfc->get_ive = spapr_xive_get_ive;
>>      xfc->get_nvt = spapr_xive_get_nvt;
>> +    xfc->get_eq = spapr_xive_get_eq;
>>  }
>>  
>>  static const TypeInfo spapr_xive_info = {
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 5691bb9474e4..2ab37fde80e8 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -19,6 +19,47 @@
>>  #include "hw/ppc/xive_regs.h"
>>  
>>  /*
>> + * XiveEQ helpers
>> + */
>> +
>> +XiveEQ *xive_nvt_eq_get(XiveNVT *nvt, uint8_t priority)
>> +{
>> +    if (!nvt || priority > XIVE_PRIORITY_MAX) {
>> +        return NULL;
>> +    }
>> +    return &nvt->eqt[priority];
>> +}
>> +
>> +void xive_eq_reset(XiveEQ *eq)
>> +{
>> +    memset(eq, 0, sizeof(*eq));
>> +
>> +    /* switch off the escalation and notification ESBs */
>> +    eq->w1 = EQ_W1_ESe_Q | EQ_W1_ESn_Q;
>> +}
>> +
>> +void xive_eq_pic_print_info(XiveEQ *eq, Monitor *mon)
>> +{
>> +    uint64_t qaddr_base = (((uint64_t)(eq->w2 & 0x0fffffff)) << 32) | eq->w3;
>> +    uint32_t qindex = GETFIELD(EQ_W1_PAGE_OFF, eq->w1);
>> +    uint32_t qgen = GETFIELD(EQ_W1_GENERATION, eq->w1);
>> +    uint32_t qsize = GETFIELD(EQ_W0_QSIZE, eq->w0);
>> +    uint32_t qentries = 1 << (qsize + 10);
>> +
>> +    uint32_t server = GETFIELD(EQ_W6_NVT_INDEX, eq->w6);
>> +    uint8_t priority = GETFIELD(EQ_W7_F0_PRIORITY, eq->w7);
>> +
>> +    monitor_printf(mon, "%c%c%c%c%c prio:%d server:%03d eq:@%08"PRIx64
>> +                   "% 6d/%5d ^%d",
>> +                   eq->w0 & EQ_W0_VALID ? 'v' : '-',
>> +                   eq->w0 & EQ_W0_ENQUEUE ? 'q' : '-',
>> +                   eq->w0 & EQ_W0_UCOND_NOTIFY ? 'n' : '-',
>> +                   eq->w0 & EQ_W0_BACKLOG ? 'b' : '-',
>> +                   eq->w0 & EQ_W0_ESCALATE_CTL ? 'e' : '-',
>> +                   priority, server, qaddr_base, qindex, qentries, qgen);
>> +}
>> +
>> +/*
>>   * XIVE Interrupt Presenter
>>   */
>>  
>> @@ -210,8 +251,12 @@ void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon)
>>  static void xive_nvt_reset(void *dev)
>>  {
>>      XiveNVT *nvt = XIVE_NVT(dev);
>> +    int i;
>>  
>>      memset(nvt->regs, 0, sizeof(nvt->regs));
>> +    for (i = 0; i < ARRAY_SIZE(nvt->eqt); i++) {
>> +        xive_eq_reset(&nvt->eqt[i]);
>> +    }
> 
> Hrm.  Having the EQs "owned" by the NVT makes things simple for PAPR.
> But won't that break down for the powernv case?

powernv stores the EQs in the RAM of the machine and they are maintained 
by skiboot using IC registers. To get/set an EQ from QEMU powernv, we need 
to read/write the RAM and the ones under the XiveNVT become useless. 

The model does not use much the skiboot VP table though, only to get the
valid bit, and instead, it uses XiveNVT objects. In the future, we might 
use more the VP table to be more precise. But nevertheless we will need 
a XiveNVT object to store the interrupt management registers.

> 
>>  }
>>  
>>  static void xive_nvt_realize(DeviceState *dev, Error **errp)
>> @@ -259,12 +304,31 @@ static void xive_nvt_init(Object *obj)
>>      nvt->ring_os = &nvt->regs[TM_QW1_OS];
>>  }
>>  
>> +static const VMStateDescription vmstate_xive_nvt_eq = {
>> +    .name = TYPE_XIVE_NVT "/eq",
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField []) {
>> +        VMSTATE_UINT32(w0, XiveEQ),
>> +        VMSTATE_UINT32(w1, XiveEQ),
>> +        VMSTATE_UINT32(w2, XiveEQ),
>> +        VMSTATE_UINT32(w3, XiveEQ),
>> +        VMSTATE_UINT32(w4, XiveEQ),
>> +        VMSTATE_UINT32(w5, XiveEQ),
>> +        VMSTATE_UINT32(w6, XiveEQ),
>> +        VMSTATE_UINT32(w7, XiveEQ),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>>  static const VMStateDescription vmstate_xive_nvt = {
>>      .name = TYPE_XIVE_NVT,
>>      .version_id = 1,
>>      .minimum_version_id = 1,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_BUFFER(regs, XiveNVT),
>> +        VMSTATE_STRUCT_ARRAY(eqt, XiveNVT, (XIVE_PRIORITY_MAX + 1), 1,
>> +                             vmstate_xive_nvt_eq, XiveEQ),
>>          VMSTATE_END_OF_LIST()
>>      },
>>  };
>> @@ -305,6 +369,13 @@ XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server)
>>      return xfc->get_nvt(xf, server);
>>  }
>>  
>> +XiveEQ *xive_fabric_get_eq(XiveFabric *xf, uint32_t eq_idx)
>> +{
>> +   XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
>> +
>> +   return xfc->get_eq(xf, eq_idx);
>> +}
>> +
>>  static void xive_fabric_route(XiveFabric *xf, int lisn)
>>  {
>>  
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 25d78eec884d..7cb3561aa3d3 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -36,4 +36,11 @@ bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>>  
>> +/*
>> + * sPAPR encoding of EQ indexes
>> + */
>> +#define SPAPR_XIVE_EQ_INDEX(server, prio)  (((server) << 3) | ((prio) & 0x7))
>> +#define SPAPR_XIVE_EQ_SERVER(eq_idx) ((eq_idx) >> 3)
>> +#define SPAPR_XIVE_EQ_PRIO(eq_idx)   ((eq_idx) & 0x7)
>> +
>>  #endif /* PPC_SPAPR_XIVE_H */
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 1a2da610d91c..6cc02638c677 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -176,12 +176,18 @@ typedef struct XiveNVT {
>>  
>>      /* Shortcuts to rings */
>>      uint8_t   *ring_os;
>> +
>> +    XiveEQ    eqt[XIVE_PRIORITY_MAX + 1];
>>  } XiveNVT;
>>  
>>  extern const MemoryRegionOps xive_tm_user_ops;
>>  extern const MemoryRegionOps xive_tm_os_ops;
>>  
>>  void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon);
>> +XiveEQ *xive_nvt_eq_get(XiveNVT *nvt, uint8_t priority);
>> +
>> +void xive_eq_reset(XiveEQ *eq);
>> +void xive_eq_pic_print_info(XiveEQ *eq, Monitor *mon);
>>  
>>  /*
>>   * XIVE Fabric
>> @@ -205,9 +211,11 @@ typedef struct XiveFabricClass {
>>  
>>      XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>      XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>> +    XiveEQ  *(*get_eq)(XiveFabric *xf, uint32_t eq_idx);
>>  } XiveFabricClass;
>>  
>>  XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
>>  XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server);
>> +XiveEQ  *xive_fabric_get_eq(XiveFabric *xf, uint32_t eq_idx);
>>  
>>  #endif /* PPC_XIVE_H */
>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>> index f2e2a1ac8f6e..bcc44e766db9 100644
>> --- a/include/hw/ppc/xive_regs.h
>> +++ b/include/hw/ppc/xive_regs.h
>> @@ -112,6 +112,54 @@ typedef struct XiveIVE {
>>  #define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
>>  } XiveIVE;
>>  
>> +/* EQ */
>> +typedef struct XiveEQ {
>> +        uint32_t        w0;
>> +#define EQ_W0_VALID             PPC_BIT32(0) /* "v" bit */
>> +#define EQ_W0_ENQUEUE           PPC_BIT32(1) /* "q" bit */
>> +#define EQ_W0_UCOND_NOTIFY      PPC_BIT32(2) /* "n" bit */
>> +#define EQ_W0_BACKLOG           PPC_BIT32(3) /* "b" bit */
>> +#define EQ_W0_PRECL_ESC_CTL     PPC_BIT32(4) /* "p" bit */
>> +#define EQ_W0_ESCALATE_CTL      PPC_BIT32(5) /* "e" bit */
>> +#define EQ_W0_UNCOND_ESCALATE   PPC_BIT32(6) /* "u" bit - DD2.0 */
>> +#define EQ_W0_SILENT_ESCALATE   PPC_BIT32(7) /* "s" bit - DD2.0 */
>> +#define EQ_W0_QSIZE             PPC_BITMASK32(12, 15)
>> +#define EQ_W0_SW0               PPC_BIT32(16)
>> +#define EQ_W0_FIRMWARE          EQ_W0_SW0 /* Owned by FW */
>> +#define EQ_QSIZE_4K             0
>> +#define EQ_QSIZE_64K            4
>> +#define EQ_W0_HWDEP             PPC_BITMASK32(24, 31)
>> +        uint32_t        w1;
>> +#define EQ_W1_ESn               PPC_BITMASK32(0, 1)
>> +#define EQ_W1_ESn_P             PPC_BIT32(0)
>> +#define EQ_W1_ESn_Q             PPC_BIT32(1)
>> +#define EQ_W1_ESe               PPC_BITMASK32(2, 3)
>> +#define EQ_W1_ESe_P             PPC_BIT32(2)
>> +#define EQ_W1_ESe_Q             PPC_BIT32(3)
>> +#define EQ_W1_GENERATION        PPC_BIT32(9)
>> +#define EQ_W1_PAGE_OFF          PPC_BITMASK32(10, 31)
>> +        uint32_t        w2;
>> +#define EQ_W2_MIGRATION_REG     PPC_BITMASK32(0, 3)
>> +#define EQ_W2_OP_DESC_HI        PPC_BITMASK32(4, 31)
>> +        uint32_t        w3;
>> +#define EQ_W3_OP_DESC_LO        PPC_BITMASK32(0, 31)
>> +        uint32_t        w4;
>> +#define EQ_W4_ESC_EQ_BLOCK      PPC_BITMASK32(4, 7)
>> +#define EQ_W4_ESC_EQ_INDEX      PPC_BITMASK32(8, 31)
>> +        uint32_t        w5;
>> +#define EQ_W5_ESC_EQ_DATA       PPC_BITMASK32(1, 31)
>> +        uint32_t        w6;
>> +#define EQ_W6_FORMAT_BIT        PPC_BIT32(8)
>> +#define EQ_W6_NVT_BLOCK         PPC_BITMASK32(9, 12)
>> +#define EQ_W6_NVT_INDEX         PPC_BITMASK32(13, 31)
>> +        uint32_t        w7;
>> +#define EQ_W7_F0_IGNORE         PPC_BIT32(0)
>> +#define EQ_W7_F0_BLK_GROUPING   PPC_BIT32(1)
>> +#define EQ_W7_F0_PRIORITY       PPC_BITMASK32(8, 15)
>> +#define EQ_W7_F1_WAKEZ          PPC_BIT32(0)
>> +#define EQ_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
>> +} XiveEQ;
>> +
>>  #define XIVE_PRIORITY_MAX  7
>>  
>>  #endif /* _INTC_XIVE_INTERNAL_H */
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-04-26  3:54           ` David Gibson
@ 2018-04-26 10:30             ` Cédric Le Goater
  2018-04-27  6:32               ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-26 10:30 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/26/2018 05:54 AM, David Gibson wrote:
> On Tue, Apr 24, 2018 at 11:33:11AM +0200, Cédric Le Goater wrote:
>> On 04/24/2018 08:46 AM, David Gibson wrote:
>>> On Mon, Apr 23, 2018 at 09:58:43AM +0200, Cédric Le Goater wrote:
>>>> On 04/23/2018 08:46 AM, David Gibson wrote:
>>>>> On Thu, Apr 19, 2018 at 02:42:59PM +0200, Cédric Le Goater wrote:
>>>>>> The XiveFabric offers a simple interface, between the XiveSourve
>>>>>> object and the device model owning the interrupt sources, to forward
>>>>>> an event notification to the XIVE interrupt controller of the machine
>>>>>> and if the owner is the controller, to call directly the routing
>>>>>> sub-engine.
>>>>>>
>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>> ---
>>>>>>  hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
>>>>>>  include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
>>>>>>  2 files changed, 61 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>>>> index 060976077dd7..b4c3d06c1219 100644
>>>>>> --- a/hw/intc/xive.c
>>>>>> +++ b/hw/intc/xive.c
>>>>>> @@ -17,6 +17,21 @@
>>>>>>  #include "hw/ppc/xive.h"
>>>>>>  
>>>>>>  /*
>>>>>> + * XIVE Fabric
>>>>>> + */
>>>>>> +
>>>>>> +static void xive_fabric_route(XiveFabric *xf, int lisn)
>>>>>> +{
>>>>>> +
>>>>>> +}
>>>>>> +
>>>>>> +static const TypeInfo xive_fabric_info = {
>>>>>> +    .name = TYPE_XIVE_FABRIC,
>>>>>> +    .parent = TYPE_INTERFACE,
>>>>>> +    .class_size = sizeof(XiveFabricClass),
>>>>>> +};
>>>>>> +
>>>>>> +/*
>>>>>>   * XIVE Interrupt Source
>>>>>>   */
>>>>>>  
>>>>>> @@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
>>>>>>  
>>>>>>  /*
>>>>>>   * Forward the source event notification to the associated XiveFabric,
>>>>>> - * the device owning the sources.
>>>>>> + * the device owning the sources, or perform the routing if the device
>>>>>> + * is the interrupt controller.
>>>>>>   */
>>>>>>  static void xive_source_notify(XiveSource *xsrc, int srcno)
>>>>>>  {
>>>>>>  
>>>>>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
>>>>>> +
>>>>>> +    if (xfc->notify) {
>>>>>> +        xfc->notify(xsrc->xive, srcno + xsrc->offset);
>>>>>> +    } else {
>>>>>> +        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
>>>>>> +    }
>>>>>
>>>>> Why 2 cases?  Can't the XiveFabric object just make its notify equal
>>>>> to xive_fabric_route if that's what it wants?
>>>> Under sPAPR, all the sources, IPIs and virtual device interrupts, 
>>>> generate events which are directly routed by xive_fabric_route(). 
>>>> There is no need of an extra hop. Indeed. 
>>>
>>> Ok.
>>>
>>>> Under PowerNV, some sources forward the notification to the routing 
>>>> engine using a specific MMIO load on a notify address which is stored 
>>>> in one of the controller registers. So we need a hop to reach the 
>>>> device model, owning the sources, and do that load :
>>>
>>> Hm.  So you're saying that in pnv some sources send their notification
>>> to some other unit, 
>>
>> Not to any unit/device, to the device owning the sources.
>>
>> For the XiveSource object under PSI, the XIVEFabric interface is the 
>> PSI device object it self, which knows how to forward the notification 
>> on the XIVE Power "bus". To be more precise, the PSI HB device has 
>> 14 interrupt sources, which notifications are forwarded using a MMIO 
>> load to some address. The load address is configured (by skiboot) in 
>> one of the PSI device registers, and points to a MMIO region of the 
>> main XIVE interrupt controller. 
>>
>> The PHB4 sources should be the same.
>>
>> For the XiveSource object (all interrupts) under sPAPRXive, the 
>> XIVEFabric is the main interrupt controller sPAPRXive.
>>
>> For the XiveSource object (IPIs) under PnvXive, the XIVEFabric is 
>> also the main interrupt controller PnvXive.
> 
> Hrm.  Apparently I'm missing something, I'm really not getting what
> you're trying to explain here.

I see that. Let's try again.

>>> that would then (after possible masking) forward on to the overall> xive fabric ? 
>>
>> yes. May be XIVEFabric is a confusing name. What about XIVEForwarder ? 
> 
> Maybe..?
> 
>>> That seems like a property of the source object, 
>>
>> The source object is generic. It's a bunch of PQ bits that can be 
>> controlled by MMIOs. Nothing more.
> 
> Hmm.  Isn't the source object also responsible for forwarding the
> interrupt to something up the chain (whatever that is)?

Yes but it can not forward directly. The XiveSource is generic and 
can only call a handler :

	xfc->notify(xsrc->xive, srcno + xsrc->offset);

The device model owner, the parent of the XiveSource object, would 
do the real forward. 

It's very similar to what we have today with XICS :

	- The sPAPR model has an ICSState  
	- The PnvPSI model has an ICSState 
	- The PnvPHB3 model has two ICSStates

and the 'xics' pointer in ICSState points to the 'interrupt unit' of 
the machine to do resends and to grab ICPs. So it used for routing 
essentially.

in Xive 

	- sPAPRXive model has a XiveSource
	- PnvXive model has a XiveSource
	- PnvPSI model has a XiveSource
	- PnvPHB4 model should have also.

and the 'xive' pointer in XiveSource points to the parent object,
which will handle the event notification forwarding or routing.
 
C.

>>> rather than a
>>> property of the fabric.  Indeed varying this by source object would
>>> require the objects have a different xive pointer, when I thought the
>>> idea was that the XiveFabric was global.
>>
>> When a notification is forwarded, the sources needs to call an 
>> interface which generally is implemented by the source owner,
> 
> I'm not quite sure what you mean by "source owner".

The parent object.
 
>> which is not necessarily the main IC. 
>>
>>>> 	static void pnv_psi_notify(XiveFabric *xf, uint32_t lisn)
>>>> 	{
>>>> 	    PnvPsi *psi = PNV_PSI(xf);
>>>> 	    uint64_t notif_port =
>>>> 	        psi->regs[PSIHB_REG(PSIHB9_ESB_NOTIF_ADDR)];
>>>> 	    bool valid = notif_port & PSIHB9_ESB_NOTIF_VALID;
>>>> 	    uint64_t notify_addr = notif_port & ~PSIHB9_ESB_NOTIF_VALID;
>>>> 	    uint32_t data = cpu_to_be32(lisn);
>>>> 	
>>>> 	    if (valid) {
>>>> 	        cpu_physical_memory_write(notify_addr, &data, sizeof(data));
>>>> 	    }
>>>> 	}
>>>>
>>>> The PnvXive model handles the load and forwards to the fabric again.  
>>>>
>>>> The IPIs under PowerNV do not need an extra hop so they reach the 
>>>> routing routine directly without the extra notify() hop. 
>>>>
>>>> However, PowerNV at the end should be using xive_fabric_route() 
>>>> but there are some differences on how the NVT registers are 
>>>> updated (HV vs. OS mode) and it's not handled yet so it uses a 
>>>> notify() handler. But is should disappear and call directly 
>>>> xive_fabric_route() in a near future.
>>>>
>>>>
>>>> May be, XiveFabricNotifier would be a better name for this feature ?
>>>> I am adding a few ops later which are more related to routing.
>>>>
>>>> Thanks,
>>>>
>>>> C.
>>>>
>>>>
>>>>>
>>>>>>  }
>>>>>>  
>>>>>>  /*
>>>>>> @@ -302,6 +325,17 @@ static void xive_source_reset(DeviceState *dev)
>>>>>>  static void xive_source_realize(DeviceState *dev, Error **errp)
>>>>>>  {
>>>>>>      XiveSource *xsrc = XIVE_SOURCE(dev);
>>>>>> +    Object *obj;
>>>>>> +    Error *local_err = NULL;
>>>>>> +
>>>>>> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
>>>>>> +    if (!obj) {
>>>>>> +        error_propagate(errp, local_err);
>>>>>> +        error_prepend(errp, "required link 'xive' not found: ");
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    xsrc->xive = XIVE_FABRIC(obj);
>>>>>>  
>>>>>>      if (!xsrc->nr_irqs) {
>>>>>>          error_setg(errp, "Number of interrupt needs to be greater than 0");
>>>>>> @@ -376,6 +410,7 @@ static const TypeInfo xive_source_info = {
>>>>>>  static void xive_register_types(void)
>>>>>>  {
>>>>>>      type_register_static(&xive_source_info);
>>>>>> +    type_register_static(&xive_fabric_info);
>>>>>>  }
>>>>>>  
>>>>>>  type_init(xive_register_types)
>>>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>>>> index 0b76dd278d9b..4fcae2c763e6 100644
>>>>>> --- a/include/hw/ppc/xive.h
>>>>>> +++ b/include/hw/ppc/xive.h
>>>>>> @@ -12,6 +12,8 @@
>>>>>>  
>>>>>>  #include "hw/sysbus.h"
>>>>>>  
>>>>>> +typedef struct XiveFabric XiveFabric;
>>>>>> +
>>>>>>  /*
>>>>>>   * XIVE Interrupt Source
>>>>>>   */
>>>>>> @@ -46,6 +48,8 @@ typedef struct XiveSource {
>>>>>>      hwaddr       esb_base;
>>>>>>      uint32_t     esb_shift;
>>>>>>      MemoryRegion esb_mmio;
>>>>>> +
>>>>>> +    XiveFabric   *xive;
>>>>>>  } XiveSource;
>>>>>>  
>>>>>>  /*
>>>>>> @@ -143,4 +147,25 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>>>>>>      xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
>>>>>>  }
>>>>>>  
>>>>>> +/*
>>>>>> + * XIVE Fabric
>>>>>> + */
>>>>>> +
>>>>>> +typedef struct XiveFabric {
>>>>>> +    Object parent;
>>>>>> +} XiveFabric;
>>>>>> +
>>>>>> +#define TYPE_XIVE_FABRIC "xive-fabric"
>>>>>> +#define XIVE_FABRIC(obj)                                     \
>>>>>> +    OBJECT_CHECK(XiveFabric, (obj), TYPE_XIVE_FABRIC)
>>>>>> +#define XIVE_FABRIC_CLASS(klass)                                     \
>>>>>> +    OBJECT_CLASS_CHECK(XiveFabricClass, (klass), TYPE_XIVE_FABRIC)
>>>>>> +#define XIVE_FABRIC_GET_CLASS(obj)                                   \
>>>>>> +    OBJECT_GET_CLASS(XiveFabricClass, (obj), TYPE_XIVE_FABRIC)
>>>>>> +
>>>>>> +typedef struct XiveFabricClass {
>>>>>> +    InterfaceClass parent;
>>>>>> +    void (*notify)(XiveFabric *xf, uint32_t lisn);
>>>>>> +} XiveFabricClass;
>>>>>> +
>>>>>>  #endif /* PPC_XIVE_H */
>>>>>
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-04-26  4:20       ` David Gibson
@ 2018-04-26 10:43         ` Cédric Le Goater
  2018-05-03  5:22           ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-26 10:43 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/26/2018 06:20 AM, David Gibson wrote:
> On Tue, Apr 24, 2018 at 11:46:04AM +0200, Cédric Le Goater wrote:
>> On 04/24/2018 08:51 AM, David Gibson wrote:
>>> On Thu, Apr 19, 2018 at 02:43:00PM +0200, Cédric Le Goater wrote:
>>>> sPAPRXive is a model for the XIVE interrupt controller device of the
>>>> sPAPR machine. It holds the routing XIVE table, the Interrupt
>>>> Virtualization Entry (IVE) table which associates interrupt source
>>>> numbers with targets.
>>>>
>>>> Also extend the XiveFabric with an accessor to the IVT. This will be
>>>> needed by the routing algorithm.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>
>>>>  May be should introduce a XiveRouter model to hold the IVT. To be
>>>>  discussed.
>>>
>>> Yeah, maybe.  Am I correct in thinking that on pnv there could be more
>>> than one XiveRouter?
>>
>> There is only one, the main IC. 
> 
> Ok, that's what I thought originally.  In that case some of the stuff
> in the patches really doesn't make sense to me.

well, there is one IC per chip on powernv, but we haven't reach that part
yet.

>>> If we did have a XiveRouter, I'm not sure we'd need the XiveFabric
>>> interface, possibly its methods could just be class methods of
>>> XiveRouter.
>>
>> Yes. We could introduce a XiveRouter to share the ivt table between 
>> the sPAPRXive and the PnvXIVE models, the interrupt controllers of
>> the machines. Methods would provide way to get the ivt/eq/nvt
>> objects required for routing. I need to add a set_eq() to push the
>> EQ data.
> 
> Hrm.  Well, to add some more clarity, let's say the XiveRouter is the
> object which owns the IVT.  

OK. that would be a model with some state and not an interface.

> It may or may not do other stuff as well.

Its only task would be to do the final event routing: get the IVE,
get the EQ, push the EQ DATA in the OS event queue, notify the CPU.

> Now IIUC, on pnv the IVT lives in main system memory.  

yes. It is allocated by skiboot in RAM and fed to the HW using some 
IC configuration registers. Then, each entry is configured with OPAL 
calls and the HW is updated using cache scrub registers. 

> Under PAPR is the IVT in guest memory, or is it outside (updated by
> hypercalls/rtas)?

Under sPAPR, the IVT is updated by the H_INT_SET_SOURCE_CONFIG hcall
which configures the targeting of an IRQ. It's not in the guest 
memory.

Behind the hood, the IVT is still configured by OPAL under KVM and 
by QEMU when kernel_irqchip=off 


>> The XiveRouter would also be a XiveFabric (or some other name) to 
>> let the internal sources of the interrupt controller forward events.
> 
> The further we go here, the less sure I am that XiveFabric even makes
> sense as a concept.

See previous email.

C.

>>
>>>>
>>>>  Changes since v2 :
>>>>
>>>>  - introduced the XiveFabric interface
>>>>
>>>>  default-configs/ppc64-softmmu.mak |   1 +
>>>>  hw/intc/Makefile.objs             |   1 +
>>>>  hw/intc/spapr_xive.c              | 159 ++++++++++++++++++++++++++++++++++++++
>>>>  hw/intc/xive.c                    |   7 ++
>>>>  include/hw/ppc/spapr_xive.h       |  31 ++++++++
>>>>  include/hw/ppc/xive.h             |   5 ++
>>>>  include/hw/ppc/xive_regs.h        |  33 ++++++++
>>>>  7 files changed, 237 insertions(+)
>>>>  create mode 100644 hw/intc/spapr_xive.c
>>>>  create mode 100644 include/hw/ppc/spapr_xive.h
>>>>  create mode 100644 include/hw/ppc/xive_regs.h
>>>>
>>>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>>>> index c6d13e757977..f8d34722931d 100644
>>>> --- a/default-configs/ppc64-softmmu.mak
>>>> +++ b/default-configs/ppc64-softmmu.mak
>>>> @@ -17,4 +17,5 @@ CONFIG_XICS=$(CONFIG_PSERIES)
>>>>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>>>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>>>>  CONFIG_XIVE=$(CONFIG_PSERIES)
>>>> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>>>>  CONFIG_MEM_HOTPLUG=y
>>>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>>>> index 72a46ed91c31..301a8e972d91 100644
>>>> --- a/hw/intc/Makefile.objs
>>>> +++ b/hw/intc/Makefile.objs
>>>> @@ -38,6 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
>>>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>>>  obj-$(CONFIG_XIVE) += xive.o
>>>> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>>>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>>>> new file mode 100644
>>>> index 000000000000..020444e2665a
>>>> --- /dev/null
>>>> +++ b/hw/intc/spapr_xive.c
>>>> @@ -0,0 +1,159 @@
>>>> +/*
>>>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017-2018, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/log.h"
>>>> +#include "qapi/error.h"
>>>> +#include "target/ppc/cpu.h"
>>>> +#include "sysemu/cpus.h"
>>>> +#include "monitor/monitor.h"
>>>> +#include "hw/ppc/spapr_xive.h"
>>>> +#include "hw/ppc/xive_regs.h"
>>>> +
>>>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    monitor_printf(mon, "IVE Table\n");
>>>> +    for (i = 0; i < xive->nr_irqs; i++) {
>>>> +        XiveIVE *ive = &xive->ivt[i];
>>>> +
>>>> +        if (!(ive->w & IVE_VALID)) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        monitor_printf(mon, "  %4x %s %08x %08x\n", i,
>>>> +                       ive->w & IVE_MASKED ? "M" : " ",
>>>> +                       (int) GETFIELD(IVE_EQ_INDEX, ive->w),
>>>> +                       (int) GETFIELD(IVE_EQ_DATA, ive->w));
>>>> +    }
>>>> +}
>>>> +
>>>> +static void spapr_xive_reset(DeviceState *dev)
>>>> +{
>>>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>>>> +    int i;
>>>> +
>>>> +    /* Mask all valid IVEs in the IRQ number space. */
>>>> +    for (i = 0; i < xive->nr_irqs; i++) {
>>>> +        XiveIVE *ive = &xive->ivt[i];
>>>> +        if (ive->w & IVE_VALID) {
>>>> +            ive->w |= IVE_MASKED;
>>>> +        }
>>>> +    }
>>>> +}
>>>> +
>>>> +static void spapr_xive_init(Object *obj)
>>>
>>> I'm trying to standardize on init_instance methods being called
>>> *_instance_init().  It helps to make it obvious that this is ineed an
>>> instance_init() method, rather than one of the various other init
>>> calls that exist in various places.
>>
>> ok. this is good practice. I will fix.
>>
>> Thanks,
>>
>> C.
>>
>>>
>>>> +{
>>>> +
>>>> +}
>>>> +
>>>> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>>> +{
>>>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>>>> +
>>>> +    if (!xive->nr_irqs) {
>>>> +        error_setg(errp, "Number of interrupt needs to be greater 0");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /* Allocate the Interrupt Virtualization Table */
>>>> +    xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>>>> +}
>>>> +
>>>> +static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>>> +{
>>>> +    sPAPRXive *xive = SPAPR_XIVE(xf);
>>>> +
>>>> +    return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>>>> +}
>>>> +
>>>> +static const VMStateDescription vmstate_spapr_xive_ive = {
>>>> +    .name = TYPE_SPAPR_XIVE "/ive",
>>>> +    .version_id = 1,
>>>> +    .minimum_version_id = 1,
>>>> +    .fields = (VMStateField []) {
>>>> +        VMSTATE_UINT64(w, XiveIVE),
>>>> +        VMSTATE_END_OF_LIST()
>>>> +    },
>>>> +};
>>>> +
>>>> +static const VMStateDescription vmstate_spapr_xive = {
>>>> +    .name = TYPE_SPAPR_XIVE,
>>>> +    .version_id = 1,
>>>> +    .minimum_version_id = 1,
>>>> +    .fields = (VMStateField[]) {
>>>> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>>>> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(ivt, sPAPRXive, nr_irqs,
>>>> +                                     vmstate_spapr_xive_ive, XiveIVE),
>>>> +        VMSTATE_END_OF_LIST()
>>>> +    },
>>>> +};
>>>> +
>>>> +static Property spapr_xive_properties[] = {
>>>> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
>>>> +    DEFINE_PROP_END_OF_LIST(),
>>>> +};
>>>> +
>>>> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>> +{
>>>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>>>> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
>>>> +
>>>> +    dc->realize = spapr_xive_realize;
>>>> +    dc->reset = spapr_xive_reset;
>>>> +    dc->props = spapr_xive_properties;
>>>> +    dc->desc = "sPAPR XIVE interrupt controller";
>>>> +    dc->vmsd = &vmstate_spapr_xive;
>>>> +
>>>> +    xfc->get_ive = spapr_xive_get_ive;
>>>> +}
>>>> +
>>>> +static const TypeInfo spapr_xive_info = {
>>>> +    .name = TYPE_SPAPR_XIVE,
>>>> +    .parent = TYPE_SYS_BUS_DEVICE,
>>>> +    .instance_init = spapr_xive_init,
>>>> +    .instance_size = sizeof(sPAPRXive),
>>>> +    .class_init = spapr_xive_class_init,
>>>> +    .interfaces = (InterfaceInfo[]) {
>>>> +            { TYPE_XIVE_FABRIC },
>>>> +            { },
>>>> +    },
>>>> +};
>>>> +
>>>> +static void spapr_xive_register_types(void)
>>>> +{
>>>> +    type_register_static(&spapr_xive_info);
>>>> +}
>>>> +
>>>> +type_init(spapr_xive_register_types)
>>>> +
>>>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
>>>> +{
>>>> +    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
>>>> +
>>>> +    if (!ive) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    ive->w |= IVE_VALID;
>>>> +    return true;
>>>> +}
>>>> +
>>>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
>>>> +{
>>>> +    XiveIVE *ive = spapr_xive_get_ive(XIVE_FABRIC(xive), lisn);
>>>> +
>>>> +    if (!ive) {
>>>> +        return false;
>>>> +    }
>>>> +
>>>> +    ive->w &= ~IVE_VALID;
>>>> +    return true;
>>>> +}
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> index b4c3d06c1219..dccad0318834 100644
>>>> --- a/hw/intc/xive.c
>>>> +++ b/hw/intc/xive.c
>>>> @@ -20,6 +20,13 @@
>>>>   * XIVE Fabric
>>>>   */
>>>>  
>>>> +XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn)
>>>> +{
>>>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
>>>> +
>>>> +    return xfc->get_ive(xf, lisn);
>>>> +}
>>>> +
>>>>  static void xive_fabric_route(XiveFabric *xf, int lisn)
>>>>  {
>>>>  
>>>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>>>> new file mode 100644
>>>> index 000000000000..1d966b5d3a96
>>>> --- /dev/null
>>>> +++ b/include/hw/ppc/spapr_xive.h
>>>> @@ -0,0 +1,31 @@
>>>> +/*
>>>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017-2018, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#ifndef PPC_SPAPR_XIVE_H
>>>> +#define PPC_SPAPR_XIVE_H
>>>> +
>>>> +#include "hw/sysbus.h"
>>>> +#include "hw/ppc/xive.h"
>>>> +
>>>> +#define TYPE_SPAPR_XIVE "spapr-xive"
>>>> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
>>>> +
>>>> +typedef struct sPAPRXive {
>>>> +    SysBusDevice parent;
>>>> +
>>>> +    /* Routing table */
>>>> +    XiveIVE      *ivt;
>>>> +    uint32_t     nr_irqs;
>>>> +} sPAPRXive;
>>>> +
>>>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>>>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>>>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>>>> +
>>>> +#endif /* PPC_SPAPR_XIVE_H */
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> index 4fcae2c763e6..5b145816acdc 100644
>>>> --- a/include/hw/ppc/xive.h
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -11,6 +11,7 @@
>>>>  #define PPC_XIVE_H
>>>>  
>>>>  #include "hw/sysbus.h"
>>>> +#include "hw/ppc/xive_regs.h"
>>>>  
>>>>  typedef struct XiveFabric XiveFabric;
>>>>  
>>>> @@ -166,6 +167,10 @@ typedef struct XiveFabric {
>>>>  typedef struct XiveFabricClass {
>>>>      InterfaceClass parent;
>>>>      void (*notify)(XiveFabric *xf, uint32_t lisn);
>>>> +
>>>> +    XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>>>  } XiveFabricClass;
>>>>  
>>>> +XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
>>>> +
>>>>  #endif /* PPC_XIVE_H */
>>>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>>>> new file mode 100644
>>>> index 000000000000..5903f29eb789
>>>> --- /dev/null
>>>> +++ b/include/hw/ppc/xive_regs.h
>>>> @@ -0,0 +1,33 @@
>>>> +/*
>>>> + * QEMU PowerPC XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2016-2018, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#ifndef _PPC_XIVE_REGS_H
>>>> +#define _PPC_XIVE_REGS_H
>>>> +
>>>> +/* IVE/EAS
>>>> + *
>>>> + * One per interrupt source. Targets that interrupt to a given EQ
>>>> + * and provides the corresponding logical interrupt number (EQ data)
>>>> + *
>>>> + * We also map this structure to the escalation descriptor inside
>>>> + * an EQ, though in that case the valid and masked bits are not used.
>>>> + */
>>>> +typedef struct XiveIVE {
>>>> +        /* Use a single 64-bit definition to make it easier to
>>>> +         * perform atomic updates
>>>> +         */
>>>> +        uint64_t        w;
>>>> +#define IVE_VALID       PPC_BIT(0)
>>>> +#define IVE_EQ_BLOCK    PPC_BITMASK(4, 7)        /* Destination EQ block# */
>>>> +#define IVE_EQ_INDEX    PPC_BITMASK(8, 31)       /* Destination EQ index */
>>>> +#define IVE_MASKED      PPC_BIT(32)              /* Masked */
>>>> +#define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
>>>> +} XiveIVE;
>>>> +
>>>> +#endif /* _INTC_XIVE_INTERNAL_H */
>>>
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources
  2018-04-26  3:28           ` David Gibson
@ 2018-04-26 12:16             ` Cédric Le Goater
  2018-04-27  2:43               ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-26 12:16 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/26/2018 05:28 AM, David Gibson wrote:
> On Tue, Apr 24, 2018 at 10:11:27AM +0200, Cédric Le Goater wrote:
>> On 04/24/2018 08:41 AM, David Gibson wrote:
>>> On Mon, Apr 23, 2018 at 09:31:24AM +0200, Cédric Le Goater wrote:
>>>> On 04/23/2018 08:44 AM, David Gibson wrote:
>>>>> On Thu, Apr 19, 2018 at 02:42:58PM +0200, Cédric Le Goater wrote:
>>>>>> The 'sent' status of the LSI interrupt source is modeled with the 'P'
>>>>>> bit of the ESB and the assertion status of the source is maintained in
>>>>>> an array under the main sPAPRXive object. The type of the source is
>>>>>> stored in the same array for practical reasons.
>>>>>>
>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>> ---
>>>>>>  hw/intc/xive.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++----
>>>>>>  include/hw/ppc/xive.h | 16 +++++++++++++++
>>>>>>  2 files changed, 66 insertions(+), 4 deletions(-)
>>>>>>
>>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>>>> index c70578759d02..060976077dd7 100644
>>>>>> --- a/hw/intc/xive.c
>>>>>> +++ b/hw/intc/xive.c
>>>>>> @@ -104,6 +104,21 @@ static void xive_source_notify(XiveSource *xsrc, int srcno)
>>>>>>  
>>>>>>  }
>>>>>>  
>>>>>> +/*
>>>>>> + * LSI interrupt sources use the P bit and a custom assertion flag
>>>>>> + */
>>>>>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t srcno)
>>>>>> +{
>>>>>> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
>>>>>> +
>>>>>> +    if  (old_pq == XIVE_ESB_RESET &&
>>>>>> +         xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
>>>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
>>>>>> +        return true;
>>>>>> +    }
>>>>>> +    return false;
>>>>>> +}
>>>>>> +
>>>>>>  /* In a two pages ESB MMIO setting, even page is the trigger page, odd
>>>>>>   * page is for management */
>>>>>>  static inline bool xive_source_is_trigger_page(hwaddr addr)
>>>>>> @@ -133,6 +148,13 @@ static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
>>>>>>           */
>>>>>>          ret = xive_source_pq_eoi(xsrc, srcno);
>>>>>>  
>>>>>> +        /* If the LSI source is still asserted, forward a new source
>>>>>> +         * event notification */
>>>>>> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
>>>>>> +            if (xive_source_lsi_trigger(xsrc, srcno)) {
>>>>>> +                xive_source_notify(xsrc, srcno);
>>>>>> +            }
>>>>>> +        }
>>>>>>          break;
>>>>>>  
>>>>>>      case XIVE_ESB_GET:
>>>>>> @@ -183,6 +205,14 @@ static void xive_source_esb_write(void *opaque, hwaddr addr,
>>>>>>           * notification
>>>>>>           */
>>>>>>          notify = xive_source_pq_eoi(xsrc, srcno);
>>>>>> +
>>>>>> +        /* LSI sources do not set the Q bit but they can still be
>>>>>> +         * asserted, in which case we should forward a new source
>>>>>> +         * event notification
>>>>>> +         */
>>>>>> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
>>>>>> +            notify = xive_source_lsi_trigger(xsrc, srcno);
>>>>>> +        }
>>>>
>>>> FYI, I have moved that common test under xive_source_pq_eoi()
>>>
>>> Ok.
>>>
>>>>>>          break;
>>>>>>  
>>>>>>      default:
>>>>>> @@ -216,8 +246,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
>>>>>>      XiveSource *xsrc = XIVE_SOURCE(opaque);
>>>>>>      bool notify = false;
>>>>>>  
>>>>>> -    if (val) {
>>>>>> -        notify = xive_source_pq_trigger(xsrc, srcno);
>>>>>> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
>>>>>> +        if (val) {
>>>>>> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
>>>>>> +        } else {
>>>>>> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
>>>>>> +        }
>>>>>> +        notify = xive_source_lsi_trigger(xsrc, srcno);
>>>>>> +    } else {
>>>>>> +        if (val) {
>>>>>> +            notify = xive_source_pq_trigger(xsrc, srcno);
>>>>>> +        }
>>>>>>      }
>>>>>>  
>>>>>>      /* Forward the source event notification for routing */
>>>>>> @@ -234,13 +273,13 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
>>>>>>                     xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
>>>>>>      for (i = 0; i < xsrc->nr_irqs; i++) {
>>>>>>          uint8_t pq = xive_source_pq_get(xsrc, i);
>>>>>> -        uint32_t lisn = i  + xsrc->offset;
>>>>>>  
>>>>>>          if (pq == XIVE_ESB_OFF) {
>>>>>>              continue;
>>>>>>          }
>>>>>>  
>>>>>> -        monitor_printf(mon, "  %4x %c%c\n", lisn,
>>>>>> +        monitor_printf(mon, "  %4x %s %c%c\n", i + xsrc->offset,
>>>>>> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
>>>>>>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
>>>>>>                         pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>>>>>>      }
>>>>>> @@ -249,6 +288,12 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
>>>>>>  static void xive_source_reset(DeviceState *dev)
>>>>>>  {
>>>>>>      XiveSource *xsrc = XIVE_SOURCE(dev);
>>>>>> +    int i;
>>>>>> +
>>>>>> +    /* Keep the IRQ type */
>>>>>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>>>>>> +        xsrc->status[i] &= ~XIVE_STATUS_ASSERTED;
>>>>>> +    }
>>>>>>  
>>>>>>      /* SBEs are initialized to 0b01 which corresponds to "ints off" */
>>>>>>      memset(xsrc->sbe, 0x55, xsrc->sbe_size);
>>>>>> @@ -273,6 +318,7 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
>>>>>>  
>>>>>>      xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
>>>>>>                                       xsrc->nr_irqs);
>>>>>> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
>>>>>>  
>>>>>>      /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
>>>>>>      xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
>>>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>>>> index d92a50519edf..0b76dd278d9b 100644
>>>>>> --- a/include/hw/ppc/xive.h
>>>>>> +++ b/include/hw/ppc/xive.h
>>>>>> @@ -33,6 +33,9 @@ typedef struct XiveSource {
>>>>>>      uint32_t     nr_irqs;
>>>>>>      uint32_t     offset;
>>>>>>      qemu_irq     *qirqs;
>>>>>> +#define XIVE_STATUS_LSI         0x1
>>>>>> +#define XIVE_STATUS_ASSERTED    0x2
>>>>>> +    uint8_t      *status;
>>>>>
>>>>> I don't love the idea of mixing configuration information (STATUS_LSI)
>>>>> with runtime state information (ASSERTED) in the same field.  Any
>>>>> reason not to have these as parallel bitmaps.
>>>>
>>>> none. I can change that. 
>>>
>>> Ok.
>>>
>>>>> Come to that.. is there a compelling reason to allow any individual
>>>>> irq to be marked LSI or MSI, rather than using separate XiveSource
>>>>> objects for MSIs and LSIs?
>>>>
>>>> yes. I would have preferred two distinct interrupt source objects but 
>>>> this is to be compatible with XICS, which uses only one. If we want
>>>> to be able to change interrupt mode, the IRQ number space should be
>>>> organized in the exact same way. Or we should change XICS also.
>>>>
>>>> Also, the change (a bitmap) is really small.
>>>
>>> Hrm, but since XIVE supports thousands of irqs, it could be quite a
>>> large bitmap.
>>
>> Yes. The change is small, not the bitmap.
>>  
>>> It's not impossible - in fact, not really even that hard - to change
>>> the existing irq layout on xics.  It does need a new machine type
>>> variant, of course.
>>
>> I did some work on that topic a while ago :
>>
>> 	https://patchwork.ozlabs.org/cover/836782/
>>
>> But we stopped exploring the idea. May be it was not the good approach.
>> The PHBs LSIs would benefit from such a split though.
> 
> So, no, I don't think that was a good approach, but that doesn't mean
> other ways of rearranging the irq numbers aren't ok.  The thing here
> is that we don't want to think of an "irq allocator" - there are some
> bits like that in there already, but they were always a mistake.
> 
> We have lots of irq space (both XICS and XIVE) so instead we should
> come up with a static mapping of irqs to devices.

yes. I would prefer that also. 

We could change the spapr_irq_alloc() routine to get a block of 
IRQs in the range defined for a device family, and use a device 
id to offset in that family range ? Here are some figures :

device family        block size  max devices  

EVENT_CLASS_EPOW              1           1  
EVENT_CLASS_HOT_PLUG          1           1   
VIO_VSCSI                     1          10  
VIO_LLAN                      1          10  
VIO_VTY                       1           5  
                      
PCI/PHB                    1024           5  

C.


>>>>>>      /* PQ bits */
>>>>>>      uint8_t      *sbe;
>>>>>
>>>>> .. and come to that is there a reason to keep the ASSERTED bit in a
>>>>> separate array from sbe?  AFAICT the actual 2-bit-per-irq layout is
>>>>> never exposed to the guests.
>>>>
>>>> indeed. we always use the xive_source_pq_get/set() helpers to 
>>>> manipulate the PQ bits. So we could add an extra bit for the ASSERT 
>>>> without too much changes. Could also we put the type there or would 
>>>> you still prefer a bitmap ?  
>>>
>>> I'd prefer the type (config information) be separate from the P, Q,
>>> ASSERTED bits (state information).
>>
>> ok. So I will use the 'uint8_t *status' for P, Q, ASSERT, which leaves
>> 5 bits available, but I don't think it is really worth the pain to 
>> optimize the size.
> 
> Sure.  I don't really care if it's packed or not.
> 
>> The sbe array will disappear and we will have 
>> a bitmap for the type.
> 
> We may or may not keep the type bitmap based on the discussion above,
> but in any case this is a good step forward.
> 
>>
>> Thanks,
>>
>> C. 
>>
>>>>> Or, even re-use the Q bit for asserted in LSIs (but report it as
>>>>> always 0 in the register read/write path).
>>>>
>>>> I would prefer to add extra status bits. It is easier to debug.
>>>>
>>>> Thanks,
>>>>
>>>> C.
>>>>
>>>>>> @@ -127,4 +130,17 @@ uint8_t xive_source_pq_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
>>>>>>  
>>>>>>  void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon);
>>>>>>  
>>>>>> +static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
>>>>>> +{
>>>>>> +    assert(srcno < xsrc->nr_irqs);
>>>>>> +    return xsrc->status[srcno] & XIVE_STATUS_LSI;
>>>>>> +}
>>>>>> +
>>>>>> +static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>>>>>> +                                       bool lsi)
>>>>>> +{
>>>>>> +    assert(srcno < xsrc->nr_irqs);
>>>>>> +    xsrc->status[srcno] |= lsi ? XIVE_STATUS_LSI : 0;
>>>>>> +}
>>>>>> +
>>>>>>  #endif /* PPC_XIVE_H */
>>>>>
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-04-26  9:27     ` Cédric Le Goater
@ 2018-04-26 17:15       ` Cédric Le Goater
  2018-05-03  5:39         ` David Gibson
  2018-05-03  5:35       ` David Gibson
  1 sibling, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-04-26 17:15 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/26/2018 11:27 AM, Cédric Le Goater wrote:
> On 04/26/2018 09:11 AM, David Gibson wrote:
>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
>>> The XIVE presenter engine uses a set of registers to handle priority
>>> management and interrupt acknowledgment among other things. The most
>>> important ones being :
>>>
>>>   - Interrupt Priority Register (PIPR)
>>>   - Interrupt Pending Buffer (IPB)
>>>   - Current Processor Priority (CPPR)
>>>   - Notification Source Register (NSR)
>>>
>>> There is one set of registers per level of privilege, four in all :
>>> HW, HV pool, OS and User. These are called rings. All registers are
>>> accessible through a specific MMIO region called the Thread Interrupt
>>> Management Areas (TIMA) but, depending on the privilege level of the
>>> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
>>> OS privilege and therefore can only accesses the OS and the User
>>> rings. The others are for hypervisor levels.
>>>
>>> The CPU interrupt state is modeled with a XiveNVT object which stores
>>> the values of the different registers. The different TIMA views are
>>> mapped at the same address for each CPU and 'current_cpu' is used to
>>> retrieve the XiveNVT holding the ring registers.
>>>
>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>> ---
>>>
>>>  Changes since v2 :
>>>
>>>  - introduced the XiveFabric interface
>>>
>>>  hw/intc/spapr_xive.c        |  25 ++++
>>>  hw/intc/xive.c              | 279 ++++++++++++++++++++++++++++++++++++++++++++
>>>  include/hw/ppc/spapr_xive.h |   5 +
>>>  include/hw/ppc/xive.h       |  31 +++++
>>>  include/hw/ppc/xive_regs.h  |  84 +++++++++++++
>>>  5 files changed, 424 insertions(+)
>>>
>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>>> index 90cde8a4082d..f07832bf0a00 100644
>>> --- a/hw/intc/spapr_xive.c
>>> +++ b/hw/intc/spapr_xive.c
>>> @@ -13,6 +13,7 @@
>>>  #include "target/ppc/cpu.h"
>>>  #include "sysemu/cpus.h"
>>>  #include "monitor/monitor.h"
>>> +#include "hw/ppc/spapr.h"
>>>  #include "hw/ppc/spapr_xive.h"
>>>  #include "hw/ppc/xive.h"
>>>  #include "hw/ppc/xive_regs.h"
>>> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>>  
>>>      /* Allocate the Interrupt Virtualization Table */
>>>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>>> +
>>> +    /* The Thread Interrupt Management Area has the same address for
>>> +     * each chip. On sPAPR, we only need to expose the User and OS
>>> +     * level views of the TIMA.
>>> +     */
>>> +    xive->tm_base = XIVE_TM_BASE;
>>
>> The constant should probably have PAPR in the name somewhere, since
>> it's just for PAPR machines (same for the ESB mappings, actually).
> 
> ok. 
> 
> I have also made 'tm_base' a property, like 'vc_base' for ESBs, in 
> case we want to change the value when the guest is instantiated. 
> I doubt it but this is an address in the global address space, so 
> letting the machine have control is better I think. 
>  
>>
>>> +
>>> +    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
>>> +                          &xive_tm_user_ops, xive, "xive.tima.user",
>>> +                          1ull << TM_SHIFT);
>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
>>> +
>>> +    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
>>> +                          &xive_tm_os_ops, xive, "xive.tima.os",
>>> +                          1ull << TM_SHIFT);
>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
>>>  }
>>>  
>>>  static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>>      return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>>>  }
>>>  
>>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
>>> +{
>>> +    PowerPCCPU *cpu = spapr_find_cpu(server);
>>> +
>>> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
>>> +}
>>
>> So this is a bit of a tangent, but I've been thinking of implementing
>> a scheme where there's an opaque pointer in the cpu structure for the
>> use of the machine.  I'm planning for that to replace the intc pointer
>> (which isn't really used directly by the cpu). That would allow us to
>> have spapr put a structure there and have both xics and xive pointers
>> which could be useful later on.
> 
> ok. That should simplify the patchset at the end, in which we need to 
> switch the 'intc' pointer. 
> 
>> I think we'd need something similar to correctly handle migration of
>> the VPA state, which is currently horribly broken.
>>
>>> +
>>>  static const VMStateDescription vmstate_spapr_xive_ive = {
>>>      .name = TYPE_SPAPR_XIVE "/ive",
>>>      .version_id = 1,
>>> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>      dc->vmsd = &vmstate_spapr_xive;
>>>  
>>>      xfc->get_ive = spapr_xive_get_ive;
>>> +    xfc->get_nvt = spapr_xive_get_nvt;
>>>  }
>>>  
>>>  static const TypeInfo spapr_xive_info = {
>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>> index dccad0318834..5691bb9474e4 100644
>>> --- a/hw/intc/xive.c
>>> +++ b/hw/intc/xive.c
>>> @@ -14,7 +14,278 @@
>>>  #include "sysemu/cpus.h"
>>>  #include "sysemu/dma.h"
>>>  #include "monitor/monitor.h"
>>> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
>>>  #include "hw/ppc/xive.h"
>>> +#include "hw/ppc/xive_regs.h"
>>> +
>>> +/*
>>> + * XIVE Interrupt Presenter
>>> + */
>>> +
>>> +static uint64_t xive_nvt_accept(XiveNVT *nvt)
>>> +{
>>> +    return 0;
>>> +}
>>> +
>>> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
>>> +{
>>> +    if (cppr > XIVE_PRIORITY_MAX) {
>>> +        cppr = 0xff;
>>> +    }
>>> +
>>> +    nvt->ring_os[TM_CPPR] = cppr;
>>
>> Surely this needs to recheck if we should be interrupting the cpu?
> 
> yes. In patch 9, when we introduce the nvt notify routine.
> 
>>> +}
>>> +
>>> +/*
>>> + * OS Thread Interrupt Management Area MMIO
>>> + */
>>> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
>>> +                                           unsigned size)
>>> +{
>>> +    uint64_t ret = -1;
>>> +
>>> +    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
>>> +        ret = xive_nvt_accept(nvt);
>>> +    } else {
>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
>>> +                      HWADDR_PRIx" size %d\n", offset, size);
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +#define TM_RING(offset) ((offset) & 0xf0)
>>> +
>>> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
>>> +                                      unsigned size)
>>> +{
>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>
>> So, as I said on a previous version of this, we can actually correctly
>> represent different mappings in different cpu spaces, by exploiting
>> cpu->as and not just having them all point to &address_space_memory.
> 
> Yes, you did and I haven't studied the question yet. For the next version.
> 
>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>>> +    uint64_t ret = -1;
>>> +    int i;
>>> +
>>> +    if (offset >= TM_SPC_ACK_EBB) {
>>> +        return xive_tm_read_special(nvt, offset, size);
>>> +    }
>>> +
>>> +    if (TM_RING(offset) != TM_QW1_OS) {
>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
>>> +                      HWADDR_PRIx"\n", offset);
>>> +        return ret;
>>
>> Just return -1 would be clearer here;
> 
> ok.
> 
>>
>>> +    }
>>> +
>>> +    ret = 0;
>>> +    for (i = 0; i < size; i++) {
>>> +        ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1));
>>> +    }
>>> +
>>> +    return ret;
>>> +}
>>> +
>>> +static bool xive_tm_is_readonly(uint8_t offset)
>>> +{
>>> +    return offset != TM_QW1_OS + TM_CPPR;
>>> +}
>>> +
>>> +static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset,
>>> +                                        uint64_t value, unsigned size)
>>> +{
>>> +    /* TODO: support TM_SPC_SET_OS_PENDING */
>>> +
>>> +    /* TODO: support TM_SPC_ACK_OS_EL */
>>> +}
>>> +
>>> +static void xive_tm_os_write(void *opaque, hwaddr offset,
>>> +                                   uint64_t value, unsigned size)
>>> +{
>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>>> +    int i;
>>> +
>>> +    if (offset >= TM_SPC_ACK_EBB) {
>>> +        xive_tm_write_special(nvt, offset, value, size);
>>> +        return;
>>> +    }
>>> +
>>> +    if (TM_RING(offset) != TM_QW1_OS) {
>>
>> Why have this if you have separate OS and user regions as you appear
>> to do below?
> 
> This is another problem we are trying to solve. 
> 
> The registers a CPU can access depends on the TIMA view it is using. 
> The OS TIMA view only sees the OS ring registers. The HV view sees all. 

So, I gave a deeper look at the specs and I understood a little more 
details of the concepts behind. You need to do frequent round-trips 
to this document ...  

These registers are accessible through four aligned pages, each exposing 
a different view of the registers. First page (page address ending 
in 0b00) gives access to the entire context and is reserved for the 
ring 0 security monitor. The second (page address ending in 0b01) 
is for the hypervisor, ring 1. The third (page address ending in 0b10) 
is for the operating system, ring 2. The fourth (page address ending 
in 0b11) is for user level, ring 3.

The sPAPR machine runs at the OS privilege and therefore can only 
accesses the OS and the User rings, 2 and 3. The others are for
hypervisor levels.

I will try to come with a better implementation of the model and
make sure the ring numbers are respected. I am not sure we should 
have only one memory region or four distinct ones with their
own ops. There are some differences in the load/store of each view.

C.


>> Or to look at it another way, shouldn't it be possible to make the
>> read/write accessors the same for the OS and user rings?
> 
> For some parts yes, but the special load/store addresses are different
> for each view, the read-only register also. It seemed easier to duplicate.
> 
> I think the problem will become clearer (or worse) with pnv which uses 
> the HV mode.
>
> 
>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
>>> +                      HWADDR_PRIx"\n", offset);
>>> +        return;
>>> +    }
>>> +
>>> +    switch (size) {
>>> +    case 1:
>>> +        if (offset == TM_QW1_OS + TM_CPPR) {
>>> +            xive_nvt_set_cppr(nvt, value & 0xff);
>>> +        }
>>> +        break;
>>> +    case 4:
>>> +    case 8:
>>> +        for (i = 0; i < size; i++) {
>>> +            if (!xive_tm_is_readonly(offset + i)) {
>>> +                nvt->regs[offset + i] = (value >> (8 * (size - i - 1))) & 0xff;
>>> +            }
>>> +        }
>>> +        break;
>>> +    default:
>>> +        g_assert_not_reached();
>>> +    }
>>> +}
>>> +
>>> +const MemoryRegionOps xive_tm_os_ops = {
>>> +    .read = xive_tm_os_read,
>>> +    .write = xive_tm_os_write,
>>> +    .endianness = DEVICE_BIG_ENDIAN,
>>> +    .valid = {
>>> +        .min_access_size = 1,
>>> +        .max_access_size = 8,
>>> +    },
>>> +    .impl = {
>>> +        .min_access_size = 1,
>>> +        .max_access_size = 8,
>>> +    },
>>> +};
>>> +
>>> +/*
>>> + * User Thread Interrupt Management Area MMIO
>>> + */
>>> +
>>> +static uint64_t xive_tm_user_read(void *opaque, hwaddr offset,
>>> +                                        unsigned size)
>>> +{
>>> +    qemu_log_mask(LOG_UNIMP, "XIVE: invalid access to User TIMA @%"
>>> +                  HWADDR_PRIx"\n", offset);
>>> +    return -1;
>>> +}
>>> +
>>> +static void xive_tm_user_write(void *opaque, hwaddr offset,
>>> +                                     uint64_t value, unsigned size)
>>> +{
>>> +    qemu_log_mask(LOG_UNIMP, "XIVE: invalid access to User TIMA @%"
>>> +                  HWADDR_PRIx"\n", offset);
>>> +}
>>> +
>>> +
>>> +const MemoryRegionOps xive_tm_user_ops = {
>>> +    .read = xive_tm_user_read,
>>> +    .write = xive_tm_user_write,
>>> +    .endianness = DEVICE_BIG_ENDIAN,
>>> +    .valid = {
>>> +        .min_access_size = 1,
>>> +        .max_access_size = 8,
>>> +    },
>>> +    .impl = {
>>> +        .min_access_size = 1,
>>> +        .max_access_size = 8,
>>> +    },
>>> +};
>>> +
>>> +static char *xive_nvt_ring_print(uint8_t *ring)
>>> +{
>>> +    uint32_t w2 = be32_to_cpu(*((uint32_t *) &ring[TM_WORD2]));
>>> +
>>> +    return g_strdup_printf("%02x  %02x   %02x  %02x    %02x   "
>>> +                   "%02x  %02x  %02x   %08x",
>>> +                   ring[TM_NSR], ring[TM_CPPR], ring[TM_IPB], ring[TM_LSMFB],
>>> +                   ring[TM_ACK_CNT], ring[TM_INC], ring[TM_AGE], ring[TM_PIPR],
>>> +                   w2);
>>> +}
>>> +
>>> +void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon)
>>> +{
>>> +    int cpu_index = nvt->cs ? nvt->cs->cpu_index : -1;
>>> +    char *s;
>>> +
>>> +    monitor_printf(mon, "CPU[%04x]: QW    NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
>>> +                   " W2\n", cpu_index);
>>> +
>>> +    s = xive_nvt_ring_print(&nvt->regs[TM_QW1_OS]);
>>> +    monitor_printf(mon, "CPU[%04x]: OS    %s\n", cpu_index, s);
>>> +    g_free(s);
>>> +    s = xive_nvt_ring_print(&nvt->regs[TM_QW0_USER]);
>>> +    monitor_printf(mon, "CPU[%04x]: USER  %s\n", cpu_index, s);
>>> +    g_free(s);
>>> +}
>>> +
>>> +static void xive_nvt_reset(void *dev)
>>> +{
>>> +    XiveNVT *nvt = XIVE_NVT(dev);
>>> +
>>> +    memset(nvt->regs, 0, sizeof(nvt->regs));
>>> +}
>>> +
>>> +static void xive_nvt_realize(DeviceState *dev, Error **errp)
>>> +{
>>> +    XiveNVT *nvt = XIVE_NVT(dev);
>>> +    PowerPCCPU *cpu;
>>> +    CPUPPCState *env;
>>> +    Object *obj;
>>> +    Error *err = NULL;
>>> +
>>> +    obj = object_property_get_link(OBJECT(dev), ICP_PROP_CPU, &err);
>>
>> Please get rid of the remaining "ICP" naming in the xive code.
> 
> ok.  I will kill the define.
> 
>>> +    if (!obj) {
>>> +        error_propagate(errp, err);
>>> +        error_prepend(errp, "required link '" ICP_PROP_CPU "' not found: ");
>>> +        return;
>>> +    }
>>> +
>>> +    cpu = POWERPC_CPU(obj);
>>> +    nvt->cs = CPU(obj);
>>> +
>>> +    env = &cpu->env;
>>> +    switch (PPC_INPUT(env)) {
>>> +    case PPC_FLAGS_INPUT_POWER7:
>>> +        nvt->output = env->irq_inputs[POWER7_INPUT_INT];
>>> +        break;
>>> +
>>> +    default:
>>> +        error_setg(errp, "XIVE interrupt controller does not support "
>>> +                   "this CPU bus model");
>>> +        return;
>>> +    }
>>> +
>>> +    qemu_register_reset(xive_nvt_reset, dev);
>>
>> If this is a sysbus device, which I think it is, 
> 
> It is not. The TIMA MMIO region is in the sPAPRXive model but that might 
> change if we use cpu->as. I agree it would look better to have a memory
> region per cpu.
> 
>> you shouldn't need to
>> explicitly register a reset handler.  Instead you can set a device
>> reset handler which will be called with the reset.
>>
>>> +}
>>> +
>>> +static void xive_nvt_unrealize(DeviceState *dev, Error **errp)
>>> +{
>>> +    qemu_unregister_reset(xive_nvt_reset, dev);
>>> +}
>>> +
>>> +static void xive_nvt_init(Object *obj)
>>> +{
>>> +    XiveNVT *nvt = XIVE_NVT(obj);
>>> +
>>> +    nvt->ring_os = &nvt->regs[TM_QW1_OS];
>>
>> The ring_os field is basically pointless, being just an offset into a
>> structure you already have.  A macro or inline would be a better idea.
> 
> ok. I liked the idea but I agree it's overkill to have an init routine
> just for this. I will find something.
> 
>>> +}
>>> +
>>> +static const VMStateDescription vmstate_xive_nvt = {
>>> +    .name = TYPE_XIVE_NVT,
>>> +    .version_id = 1,
>>> +    .minimum_version_id = 1,
>>> +    .fields = (VMStateField[]) {
>>> +        VMSTATE_BUFFER(regs, XiveNVT),
>>> +        VMSTATE_END_OF_LIST()
>>> +    },
>>> +};
>>> +
>>> +static void xive_nvt_class_init(ObjectClass *klass, void *data)
>>> +{
>>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>>> +
>>> +    dc->realize = xive_nvt_realize;
>>> +    dc->unrealize = xive_nvt_unrealize;
>>> +    dc->desc = "XIVE Interrupt Presenter";
>>> +    dc->vmsd = &vmstate_xive_nvt;
>>> +}
>>> +
>>> +static const TypeInfo xive_nvt_info = {
>>> +    .name          = TYPE_XIVE_NVT,
>>> +    .parent        = TYPE_DEVICE,
>>> +    .instance_size = sizeof(XiveNVT),
>>> +    .instance_init = xive_nvt_init,
>>> +    .class_init    = xive_nvt_class_init,
>>> +};
>>>  
>>>  /*
>>>   * XIVE Fabric
>>> @@ -27,6 +298,13 @@ XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn)
>>>      return xfc->get_ive(xf, lisn);
>>>  }
>>>  
>>> +XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server)
>>> +{
>>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xf);
>>> +
>>> +    return xfc->get_nvt(xf, server);
>>> +}
>>> +
>>>  static void xive_fabric_route(XiveFabric *xf, int lisn)
>>>  {
>>>  
>>> @@ -418,6 +696,7 @@ static void xive_register_types(void)
>>>  {
>>>      type_register_static(&xive_source_info);
>>>      type_register_static(&xive_fabric_info);
>>> +    type_register_static(&xive_nvt_info);
>>>  }
>>>  
>>>  type_init(xive_register_types)
>>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>>> index 4538c622b60a..25d78eec884d 100644
>>> --- a/include/hw/ppc/spapr_xive.h
>>> +++ b/include/hw/ppc/spapr_xive.h
>>> @@ -25,6 +25,11 @@ typedef struct sPAPRXive {
>>>      /* Routing table */
>>>      XiveIVE      *ivt;
>>>      uint32_t     nr_irqs;
>>> +
>>> +    /* TIMA memory regions */
>>> +    hwaddr       tm_base;
>>> +    MemoryRegion tm_mmio_user;
>>> +    MemoryRegion tm_mmio_os;
>>>  } sPAPRXive;
>>>  
>>>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>> index 57295715a4a5..1a2da610d91c 100644
>>> --- a/include/hw/ppc/xive.h
>>> +++ b/include/hw/ppc/xive.h
>>> @@ -20,6 +20,7 @@ typedef struct XiveFabric XiveFabric;
>>>   */
>>>  
>>>  #define XIVE_VC_BASE   0x0006010000000000ull
>>> +#define XIVE_TM_BASE   0x0006030203180000ull
>>>  
>>>  /*
>>>   * XIVE Interrupt Source
>>> @@ -155,6 +156,34 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>>>  }
>>>  
>>>  /*
>>> + * XIVE Interrupt Presenter
>>> + */
>>> +
>>> +#define TYPE_XIVE_NVT "xive-nvt"
>>> +#define XIVE_NVT(obj) OBJECT_CHECK(XiveNVT, (obj), TYPE_XIVE_NVT)
>>> +
>>> +#define TM_RING_COUNT           4
>>> +#define TM_RING_SIZE            0x10
>>> +
>>> +typedef struct XiveNVT {
>>> +    DeviceState parent_obj;
>>> +
>>> +    CPUState  *cs;
>>> +    qemu_irq  output;
>>> +
>>> +    /* Thread interrupt Management (TM) registers */
>>> +    uint8_t   regs[TM_RING_COUNT * TM_RING_SIZE];
>>> +
>>> +    /* Shortcuts to rings */
>>> +    uint8_t   *ring_os;
>>> +} XiveNVT;
>>> +
>>> +extern const MemoryRegionOps xive_tm_user_ops;
>>> +extern const MemoryRegionOps xive_tm_os_ops;
>>> +
>>> +void xive_nvt_pic_print_info(XiveNVT *nvt, Monitor *mon);
>>> +
>>> +/*
>>>   * XIVE Fabric
>>>   */
>>>  
>>> @@ -175,8 +204,10 @@ typedef struct XiveFabricClass {
>>>      void (*notify)(XiveFabric *xf, uint32_t lisn);
>>>  
>>>      XiveIVE *(*get_ive)(XiveFabric *xf, uint32_t lisn);
>>> +    XiveNVT *(*get_nvt)(XiveFabric *xf, uint32_t server);
>>>  } XiveFabricClass;
>>>  
>>>  XiveIVE *xive_fabric_get_ive(XiveFabric *xf, uint32_t lisn);
>>> +XiveNVT *xive_fabric_get_nvt(XiveFabric *xf, uint32_t server);
>>>  
>>>  #endif /* PPC_XIVE_H */
>>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>>> index 5903f29eb789..f2e2a1ac8f6e 100644
>>> --- a/include/hw/ppc/xive_regs.h
>>> +++ b/include/hw/ppc/xive_regs.h
>>> @@ -10,6 +10,88 @@
>>>  #ifndef _PPC_XIVE_REGS_H
>>>  #define _PPC_XIVE_REGS_H
>>>  
>>> +#define TM_SHIFT                16
>>> +
>>> +/* TM register offsets */
>>> +#define TM_QW0_USER             0x000 /* All rings */
>>> +#define TM_QW1_OS               0x010 /* Ring 0..2 */
>>> +#define TM_QW2_HV_POOL          0x020 /* Ring 0..1 */
>>> +#define TM_QW3_HV_PHYS          0x030 /* Ring 0..1 */
>>> +
>>> +/* Byte offsets inside a QW             QW0 QW1 QW2 QW3 */
>>> +#define TM_NSR                  0x0  /*  +   +   -   +  */
>>> +#define TM_CPPR                 0x1  /*  -   +   -   +  */
>>> +#define TM_IPB                  0x2  /*  -   +   +   +  */
>>> +#define TM_LSMFB                0x3  /*  -   +   +   +  */
>>> +#define TM_ACK_CNT              0x4  /*  -   +   -   -  */
>>> +#define TM_INC                  0x5  /*  -   +   -   +  */
>>> +#define TM_AGE                  0x6  /*  -   +   -   +  */
>>> +#define TM_PIPR                 0x7  /*  -   +   -   +  */
>>> +
>>> +#define TM_WORD0                0x0
>>> +#define TM_WORD1                0x4
>>> +
>>> +/*
>>> + * QW word 2 contains the valid bit at the top and other fields
>>> + * depending on the QW.
>>> + */
>>> +#define TM_WORD2                0x8
>>> +#define   TM_QW0W2_VU           PPC_BIT32(0)
>>> +#define   TM_QW0W2_LOGIC_SERV   PPC_BITMASK32(1, 31) /* XX 2,31 ? */
>>> +#define   TM_QW1W2_VO           PPC_BIT32(0)
>>> +#define   TM_QW1W2_OS_CAM       PPC_BITMASK32(8, 31)
>>> +#define   TM_QW2W2_VP           PPC_BIT32(0)
>>> +#define   TM_QW2W2_POOL_CAM     PPC_BITMASK32(8, 31)
>>> +#define   TM_QW3W2_VT           PPC_BIT32(0)
>>> +#define   TM_QW3W2_LP           PPC_BIT32(6)
>>> +#define   TM_QW3W2_LE           PPC_BIT32(7)
>>> +#define   TM_QW3W2_T            PPC_BIT32(31)
>>> +
>>> +/*
>>> + * In addition to normal loads to "peek" and writes (only when invalid)
>>> + * using 4 and 8 bytes accesses, the above registers support these
>>> + * "special" byte operations:
>>> + *
>>> + *   - Byte load from QW0[NSR] - User level NSR (EBB)
>>> + *   - Byte store to QW0[NSR] - User level NSR (EBB)
>>> + *   - Byte load/store to QW1[CPPR] and QW3[CPPR] - CPPR access
>>> + *   - Byte load from QW3[TM_WORD2] - Read VT||00000||LP||LE on thrd 0
>>> + *                                    otherwise VT||0000000
>>> + *   - Byte store to QW3[TM_WORD2] - Set VT bit (and LP/LE if present)
>>> + *
>>> + * Then we have all these "special" CI ops at these offset that trigger
>>> + * all sorts of side effects:
>>> + */
>>> +#define TM_SPC_ACK_EBB          0x800   /* Load8 ack EBB to reg*/
>>> +#define TM_SPC_ACK_OS_REG       0x810   /* Load16 ack OS irq to reg */
>>> +#define TM_SPC_PUSH_USR_CTX     0x808   /* Store32 Push/Validate user context */
>>> +#define TM_SPC_PULL_USR_CTX     0x808   /* Load32 Pull/Invalidate user
>>> +                                         * context */
>>> +#define TM_SPC_SET_OS_PENDING   0x812   /* Store8 Set OS irq pending bit */
>>> +#define TM_SPC_PULL_OS_CTX      0x818   /* Load32/Load64 Pull/Invalidate OS
>>> +                                         * context to reg */
>>> +#define TM_SPC_PULL_POOL_CTX    0x828   /* Load32/Load64 Pull/Invalidate Pool
>>> +                                         * context to reg*/
>>> +#define TM_SPC_ACK_HV_REG       0x830   /* Load16 ack HV irq to reg */
>>> +#define TM_SPC_PULL_USR_CTX_OL  0xc08   /* Store8 Pull/Inval usr ctx to odd
>>> +                                         * line */
>>> +#define TM_SPC_ACK_OS_EL        0xc10   /* Store8 ack OS irq to even line */
>>> +#define TM_SPC_ACK_HV_POOL_EL   0xc20   /* Store8 ack HV evt pool to even
>>> +                                         * line */
>>> +#define TM_SPC_ACK_HV_EL        0xc30   /* Store8 ack HV irq to even line */
>>> +/* XXX more... */
>>> +
>>> +/* NSR fields for the various QW ack types */
>>> +#define TM_QW0_NSR_EB           PPC_BIT8(0)
>>> +#define TM_QW1_NSR_EO           PPC_BIT8(0)
>>> +#define TM_QW3_NSR_HE           PPC_BITMASK8(0, 1)
>>> +#define  TM_QW3_NSR_HE_NONE     0
>>> +#define  TM_QW3_NSR_HE_POOL     1
>>> +#define  TM_QW3_NSR_HE_PHYS     2
>>> +#define  TM_QW3_NSR_HE_LSI      3
>>> +#define TM_QW3_NSR_I            PPC_BIT8(2)
>>> +#define TM_QW3_NSR_GRP_LVL      PPC_BIT8(3, 7)
>>> +
>>>  /* IVE/EAS
>>>   *
>>>   * One per interrupt source. Targets that interrupt to a given EQ
>>> @@ -30,4 +112,6 @@ typedef struct XiveIVE {
>>>  #define IVE_EQ_DATA     PPC_BITMASK(33, 63)      /* Data written to the EQ */
>>>  } XiveIVE;
>>>  
>>> +#define XIVE_PRIORITY_MAX  7
>>> +
>>>  #endif /* _INTC_XIVE_INTERNAL_H */
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources
  2018-04-26 12:16             ` Cédric Le Goater
@ 2018-04-27  2:43               ` David Gibson
  2018-05-04 14:25                 ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-27  2:43 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 10018 bytes --]

On Thu, Apr 26, 2018 at 02:16:06PM +0200, Cédric Le Goater wrote:
> On 04/26/2018 05:28 AM, David Gibson wrote:
> > On Tue, Apr 24, 2018 at 10:11:27AM +0200, Cédric Le Goater wrote:
> >> On 04/24/2018 08:41 AM, David Gibson wrote:
> >>> On Mon, Apr 23, 2018 at 09:31:24AM +0200, Cédric Le Goater wrote:
> >>>> On 04/23/2018 08:44 AM, David Gibson wrote:
> >>>>> On Thu, Apr 19, 2018 at 02:42:58PM +0200, Cédric Le Goater wrote:
> >>>>>> The 'sent' status of the LSI interrupt source is modeled with the 'P'
> >>>>>> bit of the ESB and the assertion status of the source is maintained in
> >>>>>> an array under the main sPAPRXive object. The type of the source is
> >>>>>> stored in the same array for practical reasons.
> >>>>>>
> >>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>> ---
> >>>>>>  hw/intc/xive.c        | 54 +++++++++++++++++++++++++++++++++++++++++++++++----
> >>>>>>  include/hw/ppc/xive.h | 16 +++++++++++++++
> >>>>>>  2 files changed, 66 insertions(+), 4 deletions(-)
> >>>>>>
> >>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >>>>>> index c70578759d02..060976077dd7 100644
> >>>>>> --- a/hw/intc/xive.c
> >>>>>> +++ b/hw/intc/xive.c
> >>>>>> @@ -104,6 +104,21 @@ static void xive_source_notify(XiveSource *xsrc, int srcno)
> >>>>>>  
> >>>>>>  }
> >>>>>>  
> >>>>>> +/*
> >>>>>> + * LSI interrupt sources use the P bit and a custom assertion flag
> >>>>>> + */
> >>>>>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t srcno)
> >>>>>> +{
> >>>>>> +    uint8_t old_pq = xive_source_pq_get(xsrc, srcno);
> >>>>>> +
> >>>>>> +    if  (old_pq == XIVE_ESB_RESET &&
> >>>>>> +         xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
> >>>>>> +        xive_source_pq_set(xsrc, srcno, XIVE_ESB_PENDING);
> >>>>>> +        return true;
> >>>>>> +    }
> >>>>>> +    return false;
> >>>>>> +}
> >>>>>> +
> >>>>>>  /* In a two pages ESB MMIO setting, even page is the trigger page, odd
> >>>>>>   * page is for management */
> >>>>>>  static inline bool xive_source_is_trigger_page(hwaddr addr)
> >>>>>> @@ -133,6 +148,13 @@ static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
> >>>>>>           */
> >>>>>>          ret = xive_source_pq_eoi(xsrc, srcno);
> >>>>>>  
> >>>>>> +        /* If the LSI source is still asserted, forward a new source
> >>>>>> +         * event notification */
> >>>>>> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >>>>>> +            if (xive_source_lsi_trigger(xsrc, srcno)) {
> >>>>>> +                xive_source_notify(xsrc, srcno);
> >>>>>> +            }
> >>>>>> +        }
> >>>>>>          break;
> >>>>>>  
> >>>>>>      case XIVE_ESB_GET:
> >>>>>> @@ -183,6 +205,14 @@ static void xive_source_esb_write(void *opaque, hwaddr addr,
> >>>>>>           * notification
> >>>>>>           */
> >>>>>>          notify = xive_source_pq_eoi(xsrc, srcno);
> >>>>>> +
> >>>>>> +        /* LSI sources do not set the Q bit but they can still be
> >>>>>> +         * asserted, in which case we should forward a new source
> >>>>>> +         * event notification
> >>>>>> +         */
> >>>>>> +        if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >>>>>> +            notify = xive_source_lsi_trigger(xsrc, srcno);
> >>>>>> +        }
> >>>>
> >>>> FYI, I have moved that common test under xive_source_pq_eoi()
> >>>
> >>> Ok.
> >>>
> >>>>>>          break;
> >>>>>>  
> >>>>>>      default:
> >>>>>> @@ -216,8 +246,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
> >>>>>>      XiveSource *xsrc = XIVE_SOURCE(opaque);
> >>>>>>      bool notify = false;
> >>>>>>  
> >>>>>> -    if (val) {
> >>>>>> -        notify = xive_source_pq_trigger(xsrc, srcno);
> >>>>>> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >>>>>> +        if (val) {
> >>>>>> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
> >>>>>> +        } else {
> >>>>>> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
> >>>>>> +        }
> >>>>>> +        notify = xive_source_lsi_trigger(xsrc, srcno);
> >>>>>> +    } else {
> >>>>>> +        if (val) {
> >>>>>> +            notify = xive_source_pq_trigger(xsrc, srcno);
> >>>>>> +        }
> >>>>>>      }
> >>>>>>  
> >>>>>>      /* Forward the source event notification for routing */
> >>>>>> @@ -234,13 +273,13 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
> >>>>>>                     xsrc->offset, xsrc->offset + xsrc->nr_irqs - 1);
> >>>>>>      for (i = 0; i < xsrc->nr_irqs; i++) {
> >>>>>>          uint8_t pq = xive_source_pq_get(xsrc, i);
> >>>>>> -        uint32_t lisn = i  + xsrc->offset;
> >>>>>>  
> >>>>>>          if (pq == XIVE_ESB_OFF) {
> >>>>>>              continue;
> >>>>>>          }
> >>>>>>  
> >>>>>> -        monitor_printf(mon, "  %4x %c%c\n", lisn,
> >>>>>> +        monitor_printf(mon, "  %4x %s %c%c\n", i + xsrc->offset,
> >>>>>> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
> >>>>>>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
> >>>>>>                         pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> >>>>>>      }
> >>>>>> @@ -249,6 +288,12 @@ void xive_source_pic_print_info(XiveSource *xsrc, Monitor *mon)
> >>>>>>  static void xive_source_reset(DeviceState *dev)
> >>>>>>  {
> >>>>>>      XiveSource *xsrc = XIVE_SOURCE(dev);
> >>>>>> +    int i;
> >>>>>> +
> >>>>>> +    /* Keep the IRQ type */
> >>>>>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >>>>>> +        xsrc->status[i] &= ~XIVE_STATUS_ASSERTED;
> >>>>>> +    }
> >>>>>>  
> >>>>>>      /* SBEs are initialized to 0b01 which corresponds to "ints off" */
> >>>>>>      memset(xsrc->sbe, 0x55, xsrc->sbe_size);
> >>>>>> @@ -273,6 +318,7 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
> >>>>>>  
> >>>>>>      xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
> >>>>>>                                       xsrc->nr_irqs);
> >>>>>> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
> >>>>>>  
> >>>>>>      /* Allocate the SBEs (State Bit Entry). 2 bits, so 4 entries per byte */
> >>>>>>      xsrc->sbe_size = DIV_ROUND_UP(xsrc->nr_irqs, 4);
> >>>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >>>>>> index d92a50519edf..0b76dd278d9b 100644
> >>>>>> --- a/include/hw/ppc/xive.h
> >>>>>> +++ b/include/hw/ppc/xive.h
> >>>>>> @@ -33,6 +33,9 @@ typedef struct XiveSource {
> >>>>>>      uint32_t     nr_irqs;
> >>>>>>      uint32_t     offset;
> >>>>>>      qemu_irq     *qirqs;
> >>>>>> +#define XIVE_STATUS_LSI         0x1
> >>>>>> +#define XIVE_STATUS_ASSERTED    0x2
> >>>>>> +    uint8_t      *status;
> >>>>>
> >>>>> I don't love the idea of mixing configuration information (STATUS_LSI)
> >>>>> with runtime state information (ASSERTED) in the same field.  Any
> >>>>> reason not to have these as parallel bitmaps.
> >>>>
> >>>> none. I can change that. 
> >>>
> >>> Ok.
> >>>
> >>>>> Come to that.. is there a compelling reason to allow any individual
> >>>>> irq to be marked LSI or MSI, rather than using separate XiveSource
> >>>>> objects for MSIs and LSIs?
> >>>>
> >>>> yes. I would have preferred two distinct interrupt source objects but 
> >>>> this is to be compatible with XICS, which uses only one. If we want
> >>>> to be able to change interrupt mode, the IRQ number space should be
> >>>> organized in the exact same way. Or we should change XICS also.
> >>>>
> >>>> Also, the change (a bitmap) is really small.
> >>>
> >>> Hrm, but since XIVE supports thousands of irqs, it could be quite a
> >>> large bitmap.
> >>
> >> Yes. The change is small, not the bitmap.
> >>  
> >>> It's not impossible - in fact, not really even that hard - to change
> >>> the existing irq layout on xics.  It does need a new machine type
> >>> variant, of course.
> >>
> >> I did some work on that topic a while ago :
> >>
> >> 	https://patchwork.ozlabs.org/cover/836782/
> >>
> >> But we stopped exploring the idea. May be it was not the good approach.
> >> The PHBs LSIs would benefit from such a split though.
> > 
> > So, no, I don't think that was a good approach, but that doesn't mean
> > other ways of rearranging the irq numbers aren't ok.  The thing here
> > is that we don't want to think of an "irq allocator" - there are some
> > bits like that in there already, but they were always a mistake.
> > 
> > We have lots of irq space (both XICS and XIVE) so instead we should
> > come up with a static mapping of irqs to devices.
> 
> yes. I would prefer that also. 
> 
> We could change the spapr_irq_alloc() routine to get a block of 
> IRQs in the range defined for a device family, and use a device 
> id to offset in that family range ? Here are some figures :
> 
> device family        block size  max devices  
> 
> EVENT_CLASS_EPOW              1           1  
> EVENT_CLASS_HOT_PLUG          1           1   
> VIO_VSCSI                     1          10  
> VIO_LLAN                      1          10  
> VIO_VTY                       1           5  
>                       
> PCI/PHB                    1024           5  

No, I'm thinking we should eliminate spapr_irq_alloc() entirely.
Well, ok, not entirely, we'll still need it for the old machine
types.  But remove it's use for the current machine type completely.

Instead we have an explicit map of ranges for various purposes.  The
one-off things like EPOW and HOTPLUG can have plain constant values.
PCI LSIs will be calculated as something like PCI_IRQ_BASE + <phb
index>*4 + <irq pin>.  The VIO devices we handle as VIO_BASE + <reg
value> or something.

MSIs will still need some sort of allocation, but we can do that
within a range set aside for them.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-04-26 10:30             ` Cédric Le Goater
@ 2018-04-27  6:32               ` David Gibson
  2018-05-02 15:28                 ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-04-27  6:32 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 8524 bytes --]

On Thu, Apr 26, 2018 at 12:30:42PM +0200, Cédric Le Goater wrote:
> On 04/26/2018 05:54 AM, David Gibson wrote:
> > On Tue, Apr 24, 2018 at 11:33:11AM +0200, Cédric Le Goater wrote:
> >> On 04/24/2018 08:46 AM, David Gibson wrote:
> >>> On Mon, Apr 23, 2018 at 09:58:43AM +0200, Cédric Le Goater wrote:
> >>>> On 04/23/2018 08:46 AM, David Gibson wrote:
> >>>>> On Thu, Apr 19, 2018 at 02:42:59PM +0200, Cédric Le Goater wrote:
> >>>>>> The XiveFabric offers a simple interface, between the XiveSourve
> >>>>>> object and the device model owning the interrupt sources, to forward
> >>>>>> an event notification to the XIVE interrupt controller of the machine
> >>>>>> and if the owner is the controller, to call directly the routing
> >>>>>> sub-engine.
> >>>>>>
> >>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>> ---
> >>>>>>  hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
> >>>>>>  include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
> >>>>>>  2 files changed, 61 insertions(+), 1 deletion(-)
> >>>>>>
> >>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >>>>>> index 060976077dd7..b4c3d06c1219 100644
> >>>>>> --- a/hw/intc/xive.c
> >>>>>> +++ b/hw/intc/xive.c
> >>>>>> @@ -17,6 +17,21 @@
> >>>>>>  #include "hw/ppc/xive.h"
> >>>>>>  
> >>>>>>  /*
> >>>>>> + * XIVE Fabric
> >>>>>> + */
> >>>>>> +
> >>>>>> +static void xive_fabric_route(XiveFabric *xf, int lisn)
> >>>>>> +{
> >>>>>> +
> >>>>>> +}
> >>>>>> +
> >>>>>> +static const TypeInfo xive_fabric_info = {
> >>>>>> +    .name = TYPE_XIVE_FABRIC,
> >>>>>> +    .parent = TYPE_INTERFACE,
> >>>>>> +    .class_size = sizeof(XiveFabricClass),
> >>>>>> +};
> >>>>>> +
> >>>>>> +/*
> >>>>>>   * XIVE Interrupt Source
> >>>>>>   */
> >>>>>>  
> >>>>>> @@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
> >>>>>>  
> >>>>>>  /*
> >>>>>>   * Forward the source event notification to the associated XiveFabric,
> >>>>>> - * the device owning the sources.
> >>>>>> + * the device owning the sources, or perform the routing if the device
> >>>>>> + * is the interrupt controller.
> >>>>>>   */
> >>>>>>  static void xive_source_notify(XiveSource *xsrc, int srcno)
> >>>>>>  {
> >>>>>>  
> >>>>>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
> >>>>>> +
> >>>>>> +    if (xfc->notify) {
> >>>>>> +        xfc->notify(xsrc->xive, srcno + xsrc->offset);
> >>>>>> +    } else {
> >>>>>> +        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
> >>>>>> +    }
> >>>>>
> >>>>> Why 2 cases?  Can't the XiveFabric object just make its notify equal
> >>>>> to xive_fabric_route if that's what it wants?
> >>>> Under sPAPR, all the sources, IPIs and virtual device interrupts, 
> >>>> generate events which are directly routed by xive_fabric_route(). 
> >>>> There is no need of an extra hop. Indeed. 
> >>>
> >>> Ok.
> >>>
> >>>> Under PowerNV, some sources forward the notification to the routing 
> >>>> engine using a specific MMIO load on a notify address which is stored 
> >>>> in one of the controller registers. So we need a hop to reach the 
> >>>> device model, owning the sources, and do that load :
> >>>
> >>> Hm.  So you're saying that in pnv some sources send their notification
> >>> to some other unit, 
> >>
> >> Not to any unit/device, to the device owning the sources.
> >>
> >> For the XiveSource object under PSI, the XIVEFabric interface is the 
> >> PSI device object it self, which knows how to forward the notification 
> >> on the XIVE Power "bus". To be more precise, the PSI HB device has 
> >> 14 interrupt sources, which notifications are forwarded using a MMIO 
> >> load to some address. The load address is configured (by skiboot) in 
> >> one of the PSI device registers, and points to a MMIO region of the 
> >> main XIVE interrupt controller. 
> >>
> >> The PHB4 sources should be the same.
> >>
> >> For the XiveSource object (all interrupts) under sPAPRXive, the 
> >> XIVEFabric is the main interrupt controller sPAPRXive.
> >>
> >> For the XiveSource object (IPIs) under PnvXive, the XIVEFabric is 
> >> also the main interrupt controller PnvXive.
> > 
> > Hrm.  Apparently I'm missing something, I'm really not getting what
> > you're trying to explain here.
> 
> I see that. Let's try again.
> 
> >>> that would then (after possible masking) forward on to the overall> xive fabric ? 
> >>
> >> yes. May be XIVEFabric is a confusing name. What about XIVEForwarder ? 
> > 
> > Maybe..?
> > 
> >>> That seems like a property of the source object, 
> >>
> >> The source object is generic. It's a bunch of PQ bits that can be 
> >> controlled by MMIOs. Nothing more.
> > 
> > Hmm.  Isn't the source object also responsible for forwarding the
> > interrupt to something up the chain (whatever that is)?
> 
> Yes but it can not forward directly. The XiveSource is generic and 
> can only call a handler :
> 
> 	xfc->notify(xsrc->xive, srcno + xsrc->offset);

But.. your patch doesn't do that always, it's conditional which I
still don't understand.

> The device model owner, the parent of the XiveSource object, would 
> do the real forward.

Why?  I mean the XiveSource basically represents the xive irq related
logic of the PHB or whatever, why would it not represent *all* of
that, rather than just the ESB bits, meaning the owner has to have
some more xive logic for the forwarding.

Note that I don't think the fact that some sources notify via mmio and
some are internal really matters.  It's not like we're modelling the
power bus down to the wire-transaction level.

> It's very similar to what we have today with XICS :
> 
> 	- The sPAPR model has an ICSState  
> 	- The PnvPSI model has an ICSState 
> 	- The PnvPHB3 model has two ICSStates
> 
> and the 'xics' pointer in ICSState points to the 'interrupt unit' of 
> the machine to do resends and to grab ICPs. So it used for routing 
> essentially.

Hmm.  I think you and I are looking at XICSFabric kind of
differently.  As I see it, it's not really an active component at
all.  Rather it's basically a global "map" of the xics components so
that they can find each other.

> in Xive 
> 
> 	- sPAPRXive model has a XiveSource
> 	- PnvXive model has a XiveSource
> 	- PnvPSI model has a XiveSource
> 	- PnvPHB4 model should have also.
> 
> and the 'xive' pointer in XiveSource points to the parent object,

Uh.. yeah.. the xics pointer in ICS units doesn't point to the parent
object, except maybe by accident.  It's absolutely intended to be
global, and so points to the machine.

> which will handle the event notification forwarding or routing.

Ok, how about this for a partial model.  We have:

XiveSource objects:
	* Owns an ESB table
	* Knows the mapping of its local irq offsets to global irq
	  numbers
	* Provides the mmio interface for ESB manipulation
	* When neccessary, notifies a new interrupt to a XiveRouter

XiveRouter objects:
	* Responsible for a fixed range of global irq numbers
	* Owns an IVT (but what that means can vary, see below)
	* When notified of an irq, routes it to the appropriate EQ
	  (haven't thought about this part yet)
	* Abstract class - needs subclasses to define how to get IVEs

XiveFabric interface:
	* Lets XIVE components locate each other
	* get_router() method: maps a global irq number to XiveRouter
	  object
	* Always global (implemented on the machine)

On pseries we have:

	? XiveSource objects.  We probably only need one, but 1 for
	LSI and one for MSI might be convenient.  More wouldn't break
	the model

	1 sPAPRXiveRouter.  This is a subclass of XiveRouter that
	holds the IVT internall (and migrates it).

	the XiveFabric implementation always returns the single global
	router for get_router()

On powernv we have:
	N XiveSource objects.  Some in PHBs, some extra ones on each
	chip

	(#chips) PowerXiveRouter objects.  This subclass of XiveRouter
	stores the register giving the IVT base address and migrates
	that, but the IVT contents are in RAM

	the XiveFabric get_router() implementation returns the right
	chip's router based on the irq number

Obviously the router->EQ sides still needs a bunch of thought.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-04-26  7:11   ` David Gibson
  2018-04-26  9:27     ` Cédric Le Goater
@ 2018-05-02  7:39     ` Cédric Le Goater
  2018-05-03  5:43       ` David Gibson
  1 sibling, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-02  7:39 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

>>  
>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
>> +{
>> +    PowerPCCPU *cpu = spapr_find_cpu(server);
>> +
>> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
>> +}
> 
> So this is a bit of a tangent, but I've been thinking of implementing
> a scheme where there's an opaque pointer in the cpu structure for the
> use of the machine.  I'm planning for that to replace the intc pointer
> (which isn't really used directly by the cpu). That would allow us to
> have spapr put a structure there and have both xics and xive pointers
> which could be useful later on.

Here is a quick try of the idea. Tested on pnv and spapr machines.
I lacked inspiration on the name so I called the object {Machine}Link. 

Thanks,

C.


>From 107808feda62c09b2df9a60aba5b30127ffab976 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= <clg@kaod.org>
Date: Wed, 2 May 2018 09:24:37 +0200
Subject: [PATCH] ppc: introduce a link Object between the CPU and the machine
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/xics_spapr.c    | 10 +++----
 hw/ppc/pnv.c            | 80 +++++++++++++++++++++++++++++++++++++++++++++++--
 hw/ppc/pnv_core.c       |  2 +-
 hw/ppc/spapr.c          | 77 +++++++++++++++++++++++++++++++++++++++++++++--
 hw/ppc/spapr_cpu_core.c |  5 ++--
 include/hw/ppc/pnv.h    |  3 ++
 include/hw/ppc/spapr.h  |  3 ++
 target/ppc/cpu.h        |  2 +-
 8 files changed, 167 insertions(+), 15 deletions(-)

diff --git a/hw/intc/xics_spapr.c b/hw/intc/xics_spapr.c
index 2e27b92b871a..9cd560bdd093 100644
--- a/hw/intc/xics_spapr.c
+++ b/hw/intc/xics_spapr.c
@@ -44,7 +44,7 @@ static target_ulong h_cppr(PowerPCCPU *cpu, sPAPRMachineState *spapr,
 {
     target_ulong cppr = args[0];
 
-    icp_set_cppr(ICP(cpu->intc), cppr);
+    icp_set_cppr(spapr_link_icp(cpu), cppr);
     return H_SUCCESS;
 }
 
@@ -65,7 +65,7 @@ static target_ulong h_ipi(PowerPCCPU *cpu, sPAPRMachineState *spapr,
 static target_ulong h_xirr(PowerPCCPU *cpu, sPAPRMachineState *spapr,
                            target_ulong opcode, target_ulong *args)
 {
-    uint32_t xirr = icp_accept(ICP(cpu->intc));
+    uint32_t xirr = icp_accept(spapr_link_icp(cpu));
 
     args[0] = xirr;
     return H_SUCCESS;
@@ -74,7 +74,7 @@ static target_ulong h_xirr(PowerPCCPU *cpu, sPAPRMachineState *spapr,
 static target_ulong h_xirr_x(PowerPCCPU *cpu, sPAPRMachineState *spapr,
                              target_ulong opcode, target_ulong *args)
 {
-    uint32_t xirr = icp_accept(ICP(cpu->intc));
+    uint32_t xirr = icp_accept(spapr_link_icp(cpu));
 
     args[0] = xirr;
     args[1] = cpu_get_host_ticks();
@@ -86,7 +86,7 @@ static target_ulong h_eoi(PowerPCCPU *cpu, sPAPRMachineState *spapr,
 {
     target_ulong xirr = args[0];
 
-    icp_eoi(ICP(cpu->intc), xirr);
+    icp_eoi(spapr_link_icp(cpu), xirr);
     return H_SUCCESS;
 }
 
@@ -94,7 +94,7 @@ static target_ulong h_ipoll(PowerPCCPU *cpu, sPAPRMachineState *spapr,
                             target_ulong opcode, target_ulong *args)
 {
     uint32_t mfrr;
-    uint32_t xirr = icp_ipoll(ICP(cpu->intc), &mfrr);
+    uint32_t xirr = icp_ipoll(spapr_link_icp(cpu), &mfrr);
 
     args[0] = xirr;
     args[1] = mfrr;
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 031488131629..64c35dfdf427 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -970,6 +970,75 @@ static void pnv_chip_class_init(ObjectClass *klass, void *data)
     dc->desc = "PowerNV Chip";
 }
 
+#define TYPE_PNV_LINK "pnv-link"
+#define PNV_LINK(obj) OBJECT_CHECK(PnvLink, (obj), TYPE_PNV_LINK)
+
+typedef struct PnvLink {
+    DeviceState parent;
+
+    ICPState *icp;
+} PnvLink;
+
+static void pnv_link_realize(DeviceState *dev, Error **errp)
+{
+    PnvMachineState *pnv = PNV_MACHINE(qdev_get_machine());
+    PnvLink *link = PNV_LINK(dev);
+    Object *cpu;
+    Error *local_err = NULL;
+
+    cpu = object_property_get_link(OBJECT(dev), "cpu", &local_err);
+    if (!cpu) {
+        error_propagate(errp, local_err);
+        error_prepend(errp, "required link 'cpu' not found: ");
+        return;
+    }
+
+    link->icp = ICP(icp_create(cpu, TYPE_PNV_ICP, XICS_FABRIC(pnv),
+                               &local_err));
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+}
+
+static void pnv_link_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = pnv_link_realize;
+}
+
+static const TypeInfo pnv_link_info = {
+    .name = TYPE_PNV_LINK,
+    .parent = TYPE_DEVICE,
+    .instance_size = sizeof(PnvLink),
+    .class_init = pnv_link_class_init,
+};
+
+Object *pnv_link_create(Object *cpu, Error **errp)
+{
+    Error *local_err = NULL;
+    Object *obj;
+
+    obj = object_new(TYPE_PNV_LINK);
+    object_property_add_child(cpu, TYPE_PNV_LINK, obj, &error_abort);
+    object_unref(obj);
+    object_property_add_const_link(obj, "cpu", cpu, &error_abort);
+    object_property_set_bool(obj, true, "realized", &local_err);
+    if (local_err) {
+        object_unparent(obj);
+        error_propagate(errp, local_err);
+        obj = NULL;
+    }
+
+    return obj;
+}
+
+ICPState *pnv_link_icp(PowerPCCPU *cpu)
+{
+    return PNV_LINK(cpu->link)->icp;
+}
+
 static ICSState *pnv_ics_get(XICSFabric *xi, int irq)
 {
     PnvMachineState *pnv = PNV_MACHINE(xi);
@@ -1013,7 +1082,7 @@ static ICPState *pnv_icp_get(XICSFabric *xi, int pir)
 {
     PowerPCCPU *cpu = ppc_get_vcpu_by_pir(pir);
 
-    return cpu ? ICP(cpu->intc) : NULL;
+    return cpu ? pnv_link_icp(cpu) : NULL;
 }
 
 static void pnv_pic_print_info(InterruptStatsProvider *obj,
@@ -1026,7 +1095,7 @@ static void pnv_pic_print_info(InterruptStatsProvider *obj,
     CPU_FOREACH(cs) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
 
-        icp_pic_print_info(ICP(cpu->intc), mon);
+        icp_pic_print_info(pnv_link_icp(cpu), mon);
     }
 
     for (i = 0; i < pnv->num_chips; i++) {
@@ -1142,3 +1211,10 @@ static const TypeInfo types[] = {
 };
 
 DEFINE_TYPES(types)
+
+static void pnv_machine_register_types(void)
+{
+    type_register_static(&pnv_link_info);
+}
+
+type_init(pnv_machine_register_types)
diff --git a/hw/ppc/pnv_core.c b/hw/ppc/pnv_core.c
index cbb64ad9e7e0..96f70ac2df8b 100644
--- a/hw/ppc/pnv_core.c
+++ b/hw/ppc/pnv_core.c
@@ -133,7 +133,7 @@ static void pnv_core_realize_child(Object *child, XICSFabric *xi, Error **errp)
         return;
     }
 
-    cpu->intc = icp_create(child, TYPE_PNV_ICP, xi, &local_err);
+    cpu->link = pnv_link_create(child, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
         return;
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index b35aff5d811c..d151460dc72c 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1746,7 +1746,7 @@ static int spapr_post_load(void *opaque, int version_id)
         CPUState *cs;
         CPU_FOREACH(cs) {
             PowerPCCPU *cpu = POWERPC_CPU(cs);
-            icp_resend(ICP(cpu->intc));
+            icp_resend(spapr_link_icp(cpu));
         }
     }
 
@@ -3763,6 +3763,76 @@ static void spapr_phb_placement(sPAPRMachineState *spapr, uint32_t index,
     *mmio64 = SPAPR_PCI_BASE + (index + 1) * SPAPR_PCI_MEM64_WIN_SIZE;
 }
 
+
+#define TYPE_SPAPR_LINK "spapr-link"
+#define SPAPR_LINK(obj) OBJECT_CHECK(sPAPRLink, (obj), TYPE_SPAPR_LINK)
+
+typedef struct sPAPRLink {
+    DeviceState parent;
+
+    ICPState *icp;
+} sPAPRLink;
+
+static void spapr_link_realize(DeviceState *dev, Error **errp)
+{
+    sPAPRMachineState *spapr = SPAPR_MACHINE(qdev_get_machine());
+    sPAPRLink *link = SPAPR_LINK(dev);
+    Object *cpu;
+    Error *local_err = NULL;
+
+    cpu = object_property_get_link(OBJECT(dev), "cpu", &local_err);
+    if (!cpu) {
+        error_propagate(errp, local_err);
+        error_prepend(errp, "required link 'cpu' not found: ");
+        return;
+    }
+
+    link->icp = ICP(icp_create(cpu, spapr->icp_type, XICS_FABRIC(spapr),
+                               &local_err));
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+}
+
+static void spapr_link_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = spapr_link_realize;
+}
+
+static const TypeInfo spapr_link_info = {
+    .name = TYPE_SPAPR_LINK,
+    .parent = TYPE_DEVICE,
+    .instance_size = sizeof(sPAPRLink),
+    .class_init = spapr_link_class_init,
+};
+
+Object *spapr_link_create(Object *cpu, sPAPRMachineState *spapr, Error **errp)
+{
+    Error *local_err = NULL;
+    Object *obj;
+
+    obj = object_new(TYPE_SPAPR_LINK);
+    object_property_add_child(cpu, TYPE_SPAPR_LINK, obj, &error_abort);
+    object_unref(obj);
+    object_property_add_const_link(obj, "cpu", cpu, &error_abort);
+    object_property_set_bool(obj, true, "realized", &local_err);
+    if (local_err) {
+        object_unparent(obj);
+        error_propagate(errp, local_err);
+        obj = NULL;
+    }
+
+    return obj;
+}
+
+ICPState *spapr_link_icp(PowerPCCPU *cpu)
+{
+    return SPAPR_LINK(cpu->link)->icp;
+}
+
 static ICSState *spapr_ics_get(XICSFabric *dev, int irq)
 {
     sPAPRMachineState *spapr = SPAPR_MACHINE(dev);
@@ -3781,7 +3851,7 @@ static ICPState *spapr_icp_get(XICSFabric *xi, int vcpu_id)
 {
     PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
 
-    return cpu ? ICP(cpu->intc) : NULL;
+    return cpu ? spapr_link_icp(cpu) : NULL;
 }
 
 #define ICS_IRQ_FREE(ics, srcno)   \
@@ -3923,7 +3993,7 @@ static void spapr_pic_print_info(InterruptStatsProvider *obj,
     CPU_FOREACH(cs) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
 
-        icp_pic_print_info(ICP(cpu->intc), mon);
+        icp_pic_print_info(spapr_link_icp(cpu), mon);
     }
 
     ics_pic_print_info(spapr->ics, mon);
@@ -4472,6 +4542,7 @@ DEFINE_SPAPR_MACHINE(2_1, "2.1", false);
 static void spapr_machine_register_types(void)
 {
     type_register_static(&spapr_machine_info);
+    type_register_static(&spapr_link_info);
 }
 
 type_init(spapr_machine_register_types)
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 01dbc6942410..f02a41d011e9 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -104,7 +104,7 @@ static void spapr_cpu_core_unrealizefn(DeviceState *dev, Error **errp)
         PowerPCCPU *cpu = POWERPC_CPU(cs);
 
         spapr_cpu_destroy(cpu);
-        object_unparent(cpu->intc);
+        object_unparent(cpu->link);
         cpu_remove_sync(cs);
         object_unparent(obj);
     }
@@ -128,8 +128,7 @@ static void spapr_cpu_core_realize_child(Object *child,
         goto error;
     }
 
-    cpu->intc = icp_create(child, spapr->icp_type, XICS_FABRIC(spapr),
-                           &local_err);
+    cpu->link = spapr_link_create(child, spapr, &local_err);
     if (local_err) {
         goto error;
     }
diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
index 90759240a7b1..f76b3aa16ceb 100644
--- a/include/hw/ppc/pnv.h
+++ b/include/hw/ppc/pnv.h
@@ -137,6 +137,9 @@ typedef struct PnvMachineState {
     Notifier     powerdown_notifier;
 } PnvMachineState;
 
+Object *pnv_link_create(Object *cpu, Error **errp);
+ICPState *pnv_link_icp(PowerPCCPU *cpu);
+
 static inline bool pnv_chip_is_power9(const PnvChip *chip)
 {
     return PNV_CHIP_GET_CLASS(chip)->chip_type == PNV_CHIP_POWER9;
diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index d60b7c6d7a8b..5a160f75ac8f 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -803,4 +803,7 @@ void spapr_caps_reset(sPAPRMachineState *spapr);
 void spapr_caps_add_properties(sPAPRMachineClass *smc, Error **errp);
 int spapr_caps_post_migration(sPAPRMachineState *spapr);
 
+Object *spapr_link_create(Object *cpu, sPAPRMachineState *spapr, Error **errp);
+ICPState *spapr_link_icp(PowerPCCPU *cpu);
+
 #endif /* HW_SPAPR_H */
diff --git a/target/ppc/cpu.h b/target/ppc/cpu.h
index 8c9e03f54d3d..19f43bbe6723 100644
--- a/target/ppc/cpu.h
+++ b/target/ppc/cpu.h
@@ -1204,7 +1204,7 @@ struct PowerPCCPU {
     int vcpu_id;
     uint32_t compat_pvr;
     PPCVirtualHypervisor *vhyp;
-    Object *intc;
+    Object *link;
     int32_t node_id; /* NUMA node this CPU belongs to */
     PPCHash64Options *hash64_opts;
 
-- 
2.13.6

^ permalink raw reply related	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-04-27  6:32               ` David Gibson
@ 2018-05-02 15:28                 ` Cédric Le Goater
  2018-05-03  5:13                   ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-02 15:28 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/27/2018 08:32 AM, David Gibson wrote:
> On Thu, Apr 26, 2018 at 12:30:42PM +0200, Cédric Le Goater wrote:
>> On 04/26/2018 05:54 AM, David Gibson wrote:
>>> On Tue, Apr 24, 2018 at 11:33:11AM +0200, Cédric Le Goater wrote:
>>>> On 04/24/2018 08:46 AM, David Gibson wrote:
>>>>> On Mon, Apr 23, 2018 at 09:58:43AM +0200, Cédric Le Goater wrote:
>>>>>> On 04/23/2018 08:46 AM, David Gibson wrote:
>>>>>>> On Thu, Apr 19, 2018 at 02:42:59PM +0200, Cédric Le Goater wrote:
>>>>>>>> The XiveFabric offers a simple interface, between the XiveSourve
>>>>>>>> object and the device model owning the interrupt sources, to forward
>>>>>>>> an event notification to the XIVE interrupt controller of the machine
>>>>>>>> and if the owner is the controller, to call directly the routing
>>>>>>>> sub-engine.
>>>>>>>>
>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>> ---
>>>>>>>>  hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
>>>>>>>>  include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
>>>>>>>>  2 files changed, 61 insertions(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>>>>>> index 060976077dd7..b4c3d06c1219 100644
>>>>>>>> --- a/hw/intc/xive.c
>>>>>>>> +++ b/hw/intc/xive.c
>>>>>>>> @@ -17,6 +17,21 @@
>>>>>>>>  #include "hw/ppc/xive.h"
>>>>>>>>  
>>>>>>>>  /*
>>>>>>>> + * XIVE Fabric
>>>>>>>> + */
>>>>>>>> +
>>>>>>>> +static void xive_fabric_route(XiveFabric *xf, int lisn)
>>>>>>>> +{
>>>>>>>> +
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static const TypeInfo xive_fabric_info = {
>>>>>>>> +    .name = TYPE_XIVE_FABRIC,
>>>>>>>> +    .parent = TYPE_INTERFACE,
>>>>>>>> +    .class_size = sizeof(XiveFabricClass),
>>>>>>>> +};
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>>   * XIVE Interrupt Source
>>>>>>>>   */
>>>>>>>>  
>>>>>>>> @@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
>>>>>>>>  
>>>>>>>>  /*
>>>>>>>>   * Forward the source event notification to the associated XiveFabric,
>>>>>>>> - * the device owning the sources.
>>>>>>>> + * the device owning the sources, or perform the routing if the device
>>>>>>>> + * is the interrupt controller.
>>>>>>>>   */
>>>>>>>>  static void xive_source_notify(XiveSource *xsrc, int srcno)
>>>>>>>>  {
>>>>>>>>  
>>>>>>>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
>>>>>>>> +
>>>>>>>> +    if (xfc->notify) {
>>>>>>>> +        xfc->notify(xsrc->xive, srcno + xsrc->offset);
>>>>>>>> +    } else {
>>>>>>>> +        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
>>>>>>>> +    }
>>>>>>>
>>>>>>> Why 2 cases?  Can't the XiveFabric object just make its notify equal
>>>>>>> to xive_fabric_route if that's what it wants?
>>>>>> Under sPAPR, all the sources, IPIs and virtual device interrupts, 
>>>>>> generate events which are directly routed by xive_fabric_route(). 
>>>>>> There is no need of an extra hop. Indeed. 
>>>>>
>>>>> Ok.
>>>>>
>>>>>> Under PowerNV, some sources forward the notification to the routing 
>>>>>> engine using a specific MMIO load on a notify address which is stored 
>>>>>> in one of the controller registers. So we need a hop to reach the 
>>>>>> device model, owning the sources, and do that load :
>>>>>
>>>>> Hm.  So you're saying that in pnv some sources send their notification
>>>>> to some other unit, 
>>>>
>>>> Not to any unit/device, to the device owning the sources.
>>>>
>>>> For the XiveSource object under PSI, the XIVEFabric interface is the 
>>>> PSI device object it self, which knows how to forward the notification 
>>>> on the XIVE Power "bus". To be more precise, the PSI HB device has 
>>>> 14 interrupt sources, which notifications are forwarded using a MMIO 
>>>> load to some address. The load address is configured (by skiboot) in 
>>>> one of the PSI device registers, and points to a MMIO region of the 
>>>> main XIVE interrupt controller. 
>>>>
>>>> The PHB4 sources should be the same.
>>>>
>>>> For the XiveSource object (all interrupts) under sPAPRXive, the 
>>>> XIVEFabric is the main interrupt controller sPAPRXive.
>>>>
>>>> For the XiveSource object (IPIs) under PnvXive, the XIVEFabric is 
>>>> also the main interrupt controller PnvXive.
>>>
>>> Hrm.  Apparently I'm missing something, I'm really not getting what
>>> you're trying to explain here.
>>
>> I see that. Let's try again.
>>
>>>>> that would then (after possible masking) forward on to the overall> xive fabric ? 
>>>>
>>>> yes. May be XIVEFabric is a confusing name. What about XIVEForwarder ? 
>>>
>>> Maybe..?
>>>
>>>>> That seems like a property of the source object, 
>>>>
>>>> The source object is generic. It's a bunch of PQ bits that can be 
>>>> controlled by MMIOs. Nothing more.
>>>
>>> Hmm.  Isn't the source object also responsible for forwarding the
>>> interrupt to something up the chain (whatever that is)?
>>
>> Yes but it can not forward directly. The XiveSource is generic and 
>> can only call a handler :
>>
>> 	xfc->notify(xsrc->xive, srcno + xsrc->offset);
> 
> But.. your patch doesn't do that always, it's conditional which I
> still don't understand.

Because at the end of the notify/forward chain, you route.

>> The device model owner, the parent of the XiveSource object, would 
>> do the real forward.
> 
> Why?  

because, in my idea of the XiveSource concept, it does not have 
the logic to do so: the register for the MMIO address to use, 
another one for the IVT offset, etc

> I mean the XiveSource basically represents the xive irq related
> logic of the PHB or whatever, why would it not represent *all* of
> that, rather than just the ESB bits, meaning the owner has to have
> some more xive logic for the forwarding.

ok. This is where we diverge in the concept. 

The PQ bits, the ESB MMIO region handlers can be easily shared 
between the different device models. They are the common part
of devices with XIVE interrupt sources.  

> Note that I don't think the fact that some sources notify via mmio and
> some are internal really matters.  It's not like we're modelling the
> power bus down to the wire-transaction level.

yes but the configuration of the devices are different. pnv devices 
will have registers accessible through MMIO or XSCOM and configured
by the firmware. spapr is all set up by QEMU.

>> It's very similar to what we have today with XICS :
>>
>> 	- The sPAPR model has an ICSState  
>> 	- The PnvPSI model has an ICSState 
>> 	- The PnvPHB3 model has two ICSStates
>>
>> and the 'xics' pointer in ICSState points to the 'interrupt unit' of 
>> the machine to do resends and to grab ICPs. So it used for routing 
>> essentially.
> 
> Hmm.  I think you and I are looking at XICSFabric kind of
> differently.  As I see it, it's not really an active component at
> all.  Rather it's basically a global "map" of the xics components so
> that they can find each other.

ok. I am not that far either. 
 
>> in Xive 
>>
>> 	- sPAPRXive model has a XiveSource
>> 	- PnvXive model has a XiveSource
>> 	- PnvPSI model has a XiveSource
>> 	- PnvPHB4 model should have also.
>>
>> and the 'xive' pointer in XiveSource points to the parent object,
> 
> Uh.. yeah.. the xics pointer in ICS units doesn't point to the parent
> object, except maybe by accident.  It's absolutely intended to be
> global, and so points to the machine.

yes. I agree. 

XIVE has more layers and visible components due to the internal 
tables used for routing.

>> which will handle the event notification forwarding or routing.
> 
> Ok, how about this for a partial model.  We have:
> 
> XiveSource objects:
> 	* Owns an ESB table
> 	* Knows the mapping of its local irq offsets to global irq
> 	  numbers

That is the 'offset' attribute I suppose. This is set at runtime 
for the powernv devices. For pseries, we should need it for 
passthrough. I think. I haven't looked at that part yet.

> 	* Provides the mmio interface for ESB manipulation
> 	* When neccessary, notifies a new interrupt to a XiveRouter

ok. I think that what we have today fits the idea then. 

> XiveRouter objects:
> 	* Responsible for a fixed range of global irq numbers
> 	* Owns an IVT (but what that means can vary, see below)

the size of the IVT table is determined at runtime for powernv.

> 	* When notified of an irq, routes it to the appropriate EQ
> 	  (haven't thought about this part yet)

we will need a class handler to get EQs

> 	* Abstract class - needs subclasses to define how to get IVEs

OK. We will need a few other ops. The router needs to :

	- get IVEs
	- get EQ descriptors
	- update OS EQs (write event data in OS RAM)
	- update EQ descriptors (to set EQ descriptor index & toggle)
	- get an NVT/VP (to notify CPUs)

For powernv, we should also consider updating the NVT/VP. It can be 
done later. 

> XiveFabric interface:
> 	* Lets XIVE components locate each other

hmm, it should be a chain : 

	source -> router -> presenter -> cpu

So the components should not have to locate each other. The presenter
does not know about the source for instance. Only the sources need
to forward events to the main controller logic doing the routing.

> 	* get_router() method: maps a global irq number to XiveRouter
> 	  object

Ah. you are thinking about the multichip case under powernv. I need
to look at that more closely. But XIVE has a concept of block which 
is used by skiboot to map a chip to a block and the XIVE tables have 
a block field.

> 	* Always global (implemented on the machine)

OK. this is more or less the object modeling the main interrupt 
controller ? What I called sPAPRXive in the current patchset.

 
> On pseries we have:
> 
> 	? XiveSource objects.  We probably only need one, but 1 for
> 	LSI and one for MSI might be convenient.  

It's not too ugly for the moment. If we create a source object 
for LSIs only that might be more complex for passthrough devices 
and their associated ESB MMIO region.

side note :

LSIs work under TCG, and used to work under KVM until we removed
the StoreEOI support. Since the EOI is now different, we need 
to find a way to handle EOI for guest virtual LSIs which are 
not LSIs for the host. We can still change the Linux spapr 
backend or have QEMU handle the EOI for virtual LSIs. This is 
on my TODO list.

>       More wouldn't break the model

It should not. Initially the pachset had two source objects : 
one for IPIs and one for the virtual devices interrupts. 

> 	1 sPAPRXiveRouter.  This is a subclass of XiveRouter that
> 	holds the IVT internall (and migrates it).

OK.

> 	the XiveFabric implementation always returns the single global
> 	router for get_router()

we can do that. 

Do we gather all XIVE components objects under an object 'sPAPRXive' 
modeling that way the main interrupt controller of the machine ? 
It should not have any state but it will hold the TIMA region 
most certainly, and the addresses where to map the ESB and 
TIMA regions.

KVM support will bring extra needs.
 
> On powernv we have:
> 	N XiveSource objects.  Some in PHBs, some extra ones on each
> 	chip

yes and PSI to start with.  

> 	(#chips) PowerXiveRouter objects.  This subclass of XiveRouter
> 	stores the register giving the IVT base address and migrates
> 	that, but the IVT contents are in RAM
> 
> 	the XiveFabric get_router() implementation returns the right
> 	chip's router based on the irq number
> 
> Obviously the router->EQ sides still needs a bunch of thought.

These are all routing tables : 

	- IVT
	- EQDT
	- VPDT  

Thanks,

C.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-05-02 15:28                 ` Cédric Le Goater
@ 2018-05-03  5:13                   ` David Gibson
  2018-05-23 10:12                     ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-03  5:13 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 15933 bytes --]

On Wed, May 02, 2018 at 05:28:23PM +0200, Cédric Le Goater wrote:
> On 04/27/2018 08:32 AM, David Gibson wrote:
> > On Thu, Apr 26, 2018 at 12:30:42PM +0200, Cédric Le Goater wrote:
> >> On 04/26/2018 05:54 AM, David Gibson wrote:
> >>> On Tue, Apr 24, 2018 at 11:33:11AM +0200, Cédric Le Goater wrote:
> >>>> On 04/24/2018 08:46 AM, David Gibson wrote:
> >>>>> On Mon, Apr 23, 2018 at 09:58:43AM +0200, Cédric Le Goater wrote:
> >>>>>> On 04/23/2018 08:46 AM, David Gibson wrote:
> >>>>>>> On Thu, Apr 19, 2018 at 02:42:59PM +0200, Cédric Le Goater wrote:
> >>>>>>>> The XiveFabric offers a simple interface, between the XiveSourve
> >>>>>>>> object and the device model owning the interrupt sources, to forward
> >>>>>>>> an event notification to the XIVE interrupt controller of the machine
> >>>>>>>> and if the owner is the controller, to call directly the routing
> >>>>>>>> sub-engine.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>>>> ---
> >>>>>>>>  hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
> >>>>>>>>  include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
> >>>>>>>>  2 files changed, 61 insertions(+), 1 deletion(-)
> >>>>>>>>
> >>>>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >>>>>>>> index 060976077dd7..b4c3d06c1219 100644
> >>>>>>>> --- a/hw/intc/xive.c
> >>>>>>>> +++ b/hw/intc/xive.c
> >>>>>>>> @@ -17,6 +17,21 @@
> >>>>>>>>  #include "hw/ppc/xive.h"
> >>>>>>>>  
> >>>>>>>>  /*
> >>>>>>>> + * XIVE Fabric
> >>>>>>>> + */
> >>>>>>>> +
> >>>>>>>> +static void xive_fabric_route(XiveFabric *xf, int lisn)
> >>>>>>>> +{
> >>>>>>>> +
> >>>>>>>> +}
> >>>>>>>> +
> >>>>>>>> +static const TypeInfo xive_fabric_info = {
> >>>>>>>> +    .name = TYPE_XIVE_FABRIC,
> >>>>>>>> +    .parent = TYPE_INTERFACE,
> >>>>>>>> +    .class_size = sizeof(XiveFabricClass),
> >>>>>>>> +};
> >>>>>>>> +
> >>>>>>>> +/*
> >>>>>>>>   * XIVE Interrupt Source
> >>>>>>>>   */
> >>>>>>>>  
> >>>>>>>> @@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
> >>>>>>>>  
> >>>>>>>>  /*
> >>>>>>>>   * Forward the source event notification to the associated XiveFabric,
> >>>>>>>> - * the device owning the sources.
> >>>>>>>> + * the device owning the sources, or perform the routing if the device
> >>>>>>>> + * is the interrupt controller.
> >>>>>>>>   */
> >>>>>>>>  static void xive_source_notify(XiveSource *xsrc, int srcno)
> >>>>>>>>  {
> >>>>>>>>  
> >>>>>>>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
> >>>>>>>> +
> >>>>>>>> +    if (xfc->notify) {
> >>>>>>>> +        xfc->notify(xsrc->xive, srcno + xsrc->offset);
> >>>>>>>> +    } else {
> >>>>>>>> +        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
> >>>>>>>> +    }
> >>>>>>>
> >>>>>>> Why 2 cases?  Can't the XiveFabric object just make its notify equal
> >>>>>>> to xive_fabric_route if that's what it wants?
> >>>>>> Under sPAPR, all the sources, IPIs and virtual device interrupts, 
> >>>>>> generate events which are directly routed by xive_fabric_route(). 
> >>>>>> There is no need of an extra hop. Indeed. 
> >>>>>
> >>>>> Ok.
> >>>>>
> >>>>>> Under PowerNV, some sources forward the notification to the routing 
> >>>>>> engine using a specific MMIO load on a notify address which is stored 
> >>>>>> in one of the controller registers. So we need a hop to reach the 
> >>>>>> device model, owning the sources, and do that load :
> >>>>>
> >>>>> Hm.  So you're saying that in pnv some sources send their notification
> >>>>> to some other unit, 
> >>>>
> >>>> Not to any unit/device, to the device owning the sources.
> >>>>
> >>>> For the XiveSource object under PSI, the XIVEFabric interface is the 
> >>>> PSI device object it self, which knows how to forward the notification 
> >>>> on the XIVE Power "bus". To be more precise, the PSI HB device has 
> >>>> 14 interrupt sources, which notifications are forwarded using a MMIO 
> >>>> load to some address. The load address is configured (by skiboot) in 
> >>>> one of the PSI device registers, and points to a MMIO region of the 
> >>>> main XIVE interrupt controller. 
> >>>>
> >>>> The PHB4 sources should be the same.
> >>>>
> >>>> For the XiveSource object (all interrupts) under sPAPRXive, the 
> >>>> XIVEFabric is the main interrupt controller sPAPRXive.
> >>>>
> >>>> For the XiveSource object (IPIs) under PnvXive, the XIVEFabric is 
> >>>> also the main interrupt controller PnvXive.
> >>>
> >>> Hrm.  Apparently I'm missing something, I'm really not getting what
> >>> you're trying to explain here.
> >>
> >> I see that. Let's try again.
> >>
> >>>>> that would then (after possible masking) forward on to the overall> xive fabric ? 
> >>>>
> >>>> yes. May be XIVEFabric is a confusing name. What about XIVEForwarder ? 
> >>>
> >>> Maybe..?
> >>>
> >>>>> That seems like a property of the source object, 
> >>>>
> >>>> The source object is generic. It's a bunch of PQ bits that can be 
> >>>> controlled by MMIOs. Nothing more.
> >>>
> >>> Hmm.  Isn't the source object also responsible for forwarding the
> >>> interrupt to something up the chain (whatever that is)?
> >>
> >> Yes but it can not forward directly. The XiveSource is generic and 
> >> can only call a handler :
> >>
> >> 	xfc->notify(xsrc->xive, srcno + xsrc->offset);
> > 
> > But.. your patch doesn't do that always, it's conditional which I
> > still don't understand.
> 
> Because at the end of the notify/forward chain, you route.

Hrm.  I'm really not understanding this notify/forward thing as
distinct from routing.  I mean, from the description here it sounds
kind of like cascaded interrupt controllers, which qemu already has
mechanisms to handle, but I didn't think (most) POWER9 devices worked
like that.

> >> The device model owner, the parent of the XiveSource object, would 
> >> do the real forward.
> > 
> > Why?  
> 
> because, in my idea of the XiveSource concept, it does not have 
> the logic to do so: the register for the MMIO address to use, 
> another one for the IVT offset, etc
> 
> > I mean the XiveSource basically represents the xive irq related
> > logic of the PHB or whatever, why would it not represent *all* of
> > that, rather than just the ESB bits, meaning the owner has to have
> > some more xive logic for the forwarding.
> 
> ok. This is where we diverge in the concept. 
> 
> The PQ bits, the ESB MMIO region handlers can be easily shared 
> between the different device models. They are the common part
> of devices with XIVE interrupt sources.

Sure, but QOM subclasses are a thing, which sounds like a better way
of doing this than having to have extra XIVE logic in all the parent
devices.

> > Note that I don't think the fact that some sources notify via mmio and
> > some are internal really matters.  It's not like we're modelling the
> > power bus down to the wire-transaction level.
> 
> yes but the configuration of the devices are different. pnv devices 
> will have registers accessible through MMIO or XSCOM and configured
> by the firmware. spapr is all set up by QEMU.

Ah.. ok.  Actually modelling the mmio forwards probably makes sense
then.  I still think it makes more sense in a pnv XiveSource subclass
rather than putting it in the containing device.

> >> It's very similar to what we have today with XICS :
> >>
> >> 	- The sPAPR model has an ICSState  
> >> 	- The PnvPSI model has an ICSState 
> >> 	- The PnvPHB3 model has two ICSStates
> >>
> >> and the 'xics' pointer in ICSState points to the 'interrupt unit' of 
> >> the machine to do resends and to grab ICPs. So it used for routing 
> >> essentially.
> > 
> > Hmm.  I think you and I are looking at XICSFabric kind of
> > differently.  As I see it, it's not really an active component at
> > all.  Rather it's basically a global "map" of the xics components so
> > that they can find each other.
> 
> ok. I am not that far either. 
>  
> >> in Xive 
> >>
> >> 	- sPAPRXive model has a XiveSource
> >> 	- PnvXive model has a XiveSource
> >> 	- PnvPSI model has a XiveSource
> >> 	- PnvPHB4 model should have also.
> >>
> >> and the 'xive' pointer in XiveSource points to the parent object,
> > 
> > Uh.. yeah.. the xics pointer in ICS units doesn't point to the parent
> > object, except maybe by accident.  It's absolutely intended to be
> > global, and so points to the machine.
> 
> yes. I agree. 
> 
> XIVE has more layers and visible components due to the internal 
> tables used for routing.

Right.

> >> which will handle the event notification forwarding or routing.
> > 
> > Ok, how about this for a partial model.  We have:
> > 
> > XiveSource objects:
> > 	* Owns an ESB table
> > 	* Knows the mapping of its local irq offsets to global irq
> > 	  numbers
> 
> That is the 'offset' attribute I suppose. This is set at runtime 
> for the powernv devices.

Ok, so that offset is effectively a register (which will have to be
migrated), rather than a device property.

> For pseries, we should need it for 
> passthrough. I think. I haven't looked at that part yet.

Uh.. I really hope not.  AIUI the offsets are decided by the platform
rather than the guest in this case, yes?  In which case if we can't
count on a fixed offset even in passthrough mode, then migration is
basically impossible.

> > 	* Provides the mmio interface for ESB manipulation
> > 	* When neccessary, notifies a new interrupt to a XiveRouter
> 
> ok. I think that what we have today fits the idea then. 

Yes, I think so to - just clarifying in the context of the rest of
this proposal.

> > XiveRouter objects:
> > 	* Responsible for a fixed range of global irq numbers
> > 	* Owns an IVT (but what that means can vary, see below)
> 
> the size of the IVT table is determined at runtime for powernv.

That's fine - both the size and base of the IVT will be registers of
the powernv variant of the device.

> > 	* When notified of an irq, routes it to the appropriate EQ
> > 	  (haven't thought about this part yet)
> 
> we will need a class handler to get EQs

Sure.

> > 	* Abstract class - needs subclasses to define how to get IVEs
> 
> OK. We will need a few other ops. The router needs to :
> 
> 	- get IVEs
> 	- get EQ descriptors
> 	- update OS EQs (write event data in OS RAM)

IIUC the actual EQ (as opposed to its descriptor) will be in guest RAM
for both powernv and spapr, so this can be common (utiliizing a
subclass hook to locate the EQ base address).

> 	- update EQ descriptors (to set EQ descriptor index & toggle)
> 	- get an NVT/VP (to notify CPUs)

Getting NVT/VP sounds more like it would belong in the XiveFabric than
the router, but I haven't looked at this in detail yet.

> For powernv, we should also consider updating the NVT/VP. It can be 
> done later. 
> 
> > XiveFabric interface:
> > 	* Lets XIVE components locate each other
> 
> hmm, it should be a chain : 
> 
> 	source -> router -> presenter -> cpu
> 
> So the components should not have to locate each other. The presenter
> does not know about the source for instance. Only the sources need
> to forward events to the main controller logic doing the routing.

source -> router

From the above sounds like we can maybe make this just a router
property in the source, rather than a lookup through a fabric.

router -> presenter

Here we might still need a "fabric".  By defintion the router can
direct to a bunch of different presenters, so it needs some kind of
map to find them, no?

presenter -> cpu

This one's trivial; just the cpu->intc pointer or similar.

> > 	* get_router() method: maps a global irq number to XiveRouter
> > 	  object
> 
> Ah. you are thinking about the multichip case under powernv. I need
> to look at that more closely. But XIVE has a concept of block which 
> is used by skiboot to map a chip to a block and the XIVE tables have 
> a block field.

Hmm.. I think we need to understand this to make a real model.  I'd
thought the block number would just be the chip number somewhere in
the high bits of the global irq, but sounds like there might be yet
another indirection here.

> > 	* Always global (implemented on the machine)
> 
> OK. this is more or less the object modeling the main interrupt 
> controller ? What I called sPAPRXive in the current patchset.

No.  This is *always* global - assuming we need it at all - even on
multichip powernv.

> > On pseries we have:
> > 
> > 	? XiveSource objects.  We probably only need one, but 1 for
> > 	LSI and one for MSI might be convenient.  
> 
> It's not too ugly for the moment. If we create a source object 
> for LSIs only that might be more complex for passthrough devices 
> and their associated ESB MMIO region.
> 
> side note :
> 
> LSIs work under TCG, and used to work under KVM until we removed
> the StoreEOI support. Since the EOI is now different, we need 
> to find a way to handle EOI for guest virtual LSIs which are 
> not LSIs for the host.

I can't quite picture that case - can you give a concrete example?

> We can still change the Linux spapr 
> backend or have QEMU handle the EOI for virtual LSIs. This is 
> on my TODO list.
> 
> >       More wouldn't break the model
> 
> It should not. Initially the pachset had two source objects : 
> one for IPIs and one for the virtual devices interrupts. 
> 
> > 	1 sPAPRXiveRouter.  This is a subclass of XiveRouter that
> > 	holds the IVT internall (and migrates it).
> 
> OK.
> 
> > 	the XiveFabric implementation always returns the single global
> > 	router for get_router()
> 
> we can do that. 

Again, if it's true that each source object always forwards to just
one router, then we don't need a get_router(), we can just use a link
property.

> Do we gather all XIVE components objects under an object 'sPAPRXive' 
> modeling that way the main interrupt controller of the machine ? 

No, I don't think so.

> It should not have any state but it will hold the TIMA region 
> most certainly, and the addresses where to map the ESB and 
> TIMA regions.

Uh.. I'd expect the TIMAs to be held by the NVT objects, which we
haven't covered yet.

> KVM support will bring extra needs.
>  
> > On powernv we have:
> > 	N XiveSource objects.  Some in PHBs, some extra ones on each
> > 	chip
> 
> yes and PSI to start with.  
> 
> > 	(#chips) PowerXiveRouter objects.  This subclass of XiveRouter
> > 	stores the register giving the IVT base address and migrates
> > 	that, but the IVT contents are in RAM
> > 
> > 	the XiveFabric get_router() implementation returns the right
> > 	chip's router based on the irq number
> > 
> > Obviously the router->EQ sides still needs a bunch of thought.
> 
> These are all routing tables : 
> 
> 	- IVT
> 	- EQDT
> 	- VPDT  

Thinking about what you've said, I'm thinking maybe what we need on
the source side is two versions which don't exactly correspond to
spapr vs powernv:

InBandXiveSource: this one's notify behaviour is to issue an mmio read
to a configured address.  It's expected that mmio address belongs to a
XiveRouter, but the source itself doesn't know anything routers.

OutOfBandXiveSource: this one's notify behaviour is to explicitly poke
a XiveRouter.  A link to the router and offset (which range of router
irqs are used by this source) would be object properties.

powernv would use the first for "external" irq sources and the second
for internal ones.  spapr would use the second for everything.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-04-26 10:43         ` Cédric Le Goater
@ 2018-05-03  5:22           ` David Gibson
  2018-05-03 16:50             ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-03  5:22 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 4557 bytes --]

On Thu, Apr 26, 2018 at 12:43:29PM +0200, Cédric Le Goater wrote:
> On 04/26/2018 06:20 AM, David Gibson wrote:
> > On Tue, Apr 24, 2018 at 11:46:04AM +0200, Cédric Le Goater wrote:
> >> On 04/24/2018 08:51 AM, David Gibson wrote:
> >>> On Thu, Apr 19, 2018 at 02:43:00PM +0200, Cédric Le Goater wrote:
> >>>> sPAPRXive is a model for the XIVE interrupt controller device of the
> >>>> sPAPR machine. It holds the routing XIVE table, the Interrupt
> >>>> Virtualization Entry (IVE) table which associates interrupt source
> >>>> numbers with targets.
> >>>>
> >>>> Also extend the XiveFabric with an accessor to the IVT. This will be
> >>>> needed by the routing algorithm.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>
> >>>>  May be should introduce a XiveRouter model to hold the IVT. To be
> >>>>  discussed.
> >>>
> >>> Yeah, maybe.  Am I correct in thinking that on pnv there could be more
> >>> than one XiveRouter?
> >>
> >> There is only one, the main IC. 
> > 
> > Ok, that's what I thought originally.  In that case some of the stuff
> > in the patches really doesn't make sense to me.
> 
> well, there is one IC per chip on powernv, but we haven't reach that part
> yet.

Hmm.  There's some things we can delay dealing with, but I don't think
this is one of them.  I think we need to understand how multichip is
going to work in order to come up with a sane architecture.  Otherwise
I fear we'll end up with something that we either need to horribly
bastardize for multichip, or have to rework things dramatically
leading to migration nightmares.

> >>> If we did have a XiveRouter, I'm not sure we'd need the XiveFabric
> >>> interface, possibly its methods could just be class methods of
> >>> XiveRouter.
> >>
> >> Yes. We could introduce a XiveRouter to share the ivt table between 
> >> the sPAPRXive and the PnvXIVE models, the interrupt controllers of
> >> the machines. Methods would provide way to get the ivt/eq/nvt
> >> objects required for routing. I need to add a set_eq() to push the
> >> EQ data.
> > 
> > Hrm.  Well, to add some more clarity, let's say the XiveRouter is the
> > object which owns the IVT.  
> 
> OK. that would be a model with some state and not an interface.

Yes.  For papr variant it would have the whole IVT contents as its
state.  For the powernv, just the registers telling it where to find
the IVT in RAM.

> > It may or may not do other stuff as well.
> 
> Its only task would be to do the final event routing: get the IVE,
> get the EQ, push the EQ DATA in the OS event queue, notify the CPU.

That seems like a lot of steps.  Up to push the EQ DATA, certainly.
And I guess it'll have to ping an NVT somehow, but I'm not sure it
should know about CPUs as such.

I'm not sure at this stage what should own the EQD table.  In the
multichip case is there one EQD table for every IVT?  I'm guessing
not - I figure the EQD table must be effectively global so that any
chip's router can send events to any EQ in the whole system.

> > Now IIUC, on pnv the IVT lives in main system memory.  
> 
> yes. It is allocated by skiboot in RAM and fed to the HW using some 
> IC configuration registers. Then, each entry is configured with OPAL 
> calls and the HW is updated using cache scrub registers. 

Right.  At least for the first pass we should be able to treat the
cache scrub registers as no-ops and just not cache anything in the
qemu implementation.

> > Under PAPR is the IVT in guest memory, or is it outside (updated by
> > hypercalls/rtas)?
> 
> Under sPAPR, the IVT is updated by the H_INT_SET_SOURCE_CONFIG hcall
> which configures the targeting of an IRQ. It's not in the guest 
> memory.

Right.

> Behind the hood, the IVT is still configured by OPAL under KVM and 
> by QEMU when kernel_irqchip=off

Sure.  Even with kernel_irqchip=on there's still logically a guest IVT
(or "IVT view" I guess), even if it's actual entries are stored
distributed across various places in the host's IVTs.

> >> The XiveRouter would also be a XiveFabric (or some other name) to 
> >> let the internal sources of the interrupt controller forward events.
> > 
> > The further we go here, the less sure I am that XiveFabric even makes
> > sense as a concept.
> 
> See previous email.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-04-26  9:27     ` Cédric Le Goater
  2018-04-26 17:15       ` Cédric Le Goater
@ 2018-05-03  5:35       ` David Gibson
  2018-05-03 16:06         ` Cédric Le Goater
  1 sibling, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-03  5:35 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 11604 bytes --]

On Thu, Apr 26, 2018 at 11:27:21AM +0200, Cédric Le Goater wrote:
> On 04/26/2018 09:11 AM, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
> >> The XIVE presenter engine uses a set of registers to handle priority
> >> management and interrupt acknowledgment among other things. The most
> >> important ones being :
> >>
> >>   - Interrupt Priority Register (PIPR)
> >>   - Interrupt Pending Buffer (IPB)
> >>   - Current Processor Priority (CPPR)
> >>   - Notification Source Register (NSR)
> >>
> >> There is one set of registers per level of privilege, four in all :
> >> HW, HV pool, OS and User. These are called rings. All registers are
> >> accessible through a specific MMIO region called the Thread Interrupt
> >> Management Areas (TIMA) but, depending on the privilege level of the
> >> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
> >> OS privilege and therefore can only accesses the OS and the User
> >> rings. The others are for hypervisor levels.
> >>
> >> The CPU interrupt state is modeled with a XiveNVT object which stores
> >> the values of the different registers. The different TIMA views are
> >> mapped at the same address for each CPU and 'current_cpu' is used to
> >> retrieve the XiveNVT holding the ring registers.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>
> >>  Changes since v2 :
> >>
> >>  - introduced the XiveFabric interface
> >>
> >>  hw/intc/spapr_xive.c        |  25 ++++
> >>  hw/intc/xive.c              | 279 ++++++++++++++++++++++++++++++++++++++++++++
> >>  include/hw/ppc/spapr_xive.h |   5 +
> >>  include/hw/ppc/xive.h       |  31 +++++
> >>  include/hw/ppc/xive_regs.h  |  84 +++++++++++++
> >>  5 files changed, 424 insertions(+)
> >>
> >> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >> index 90cde8a4082d..f07832bf0a00 100644
> >> --- a/hw/intc/spapr_xive.c
> >> +++ b/hw/intc/spapr_xive.c
> >> @@ -13,6 +13,7 @@
> >>  #include "target/ppc/cpu.h"
> >>  #include "sysemu/cpus.h"
> >>  #include "monitor/monitor.h"
> >> +#include "hw/ppc/spapr.h"
> >>  #include "hw/ppc/spapr_xive.h"
> >>  #include "hw/ppc/xive.h"
> >>  #include "hw/ppc/xive_regs.h"
> >> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >>  
> >>      /* Allocate the Interrupt Virtualization Table */
> >>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> >> +
> >> +    /* The Thread Interrupt Management Area has the same address for
> >> +     * each chip. On sPAPR, we only need to expose the User and OS
> >> +     * level views of the TIMA.
> >> +     */
> >> +    xive->tm_base = XIVE_TM_BASE;
> > 
> > The constant should probably have PAPR in the name somewhere, since
> > it's just for PAPR machines (same for the ESB mappings, actually).
> 
> ok. 
> 
> I have also made 'tm_base' a property, like 'vc_base' for ESBs, in 
> case we want to change the value when the guest is instantiated. 
> I doubt it but this is an address in the global address space, so 
> letting the machine have control is better I think.

I agree.

> >> +
> >> +    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
> >> +                          &xive_tm_user_ops, xive, "xive.tima.user",
> >> +                          1ull << TM_SHIFT);
> >> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
> >> +
> >> +    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
> >> +                          &xive_tm_os_ops, xive, "xive.tima.os",
> >> +                          1ull << TM_SHIFT);
> >> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
> >>  }
> >>  
> >>  static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
> >> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
> >>      return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
> >>  }
> >>  
> >> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
> >> +{
> >> +    PowerPCCPU *cpu = spapr_find_cpu(server);
> >> +
> >> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
> >> +}
> > 
> > So this is a bit of a tangent, but I've been thinking of implementing
> > a scheme where there's an opaque pointer in the cpu structure for the
> > use of the machine.  I'm planning for that to replace the intc pointer
> > (which isn't really used directly by the cpu). That would allow us to
> > have spapr put a structure there and have both xics and xive pointers
> > which could be useful later on.
> 
> ok. That should simplify the patchset at the end, in which we need to 
> switch the 'intc' pointer. 
> 
> > I think we'd need something similar to correctly handle migration of
> > the VPA state, which is currently horribly broken.
> > 
> >> +
> >>  static const VMStateDescription vmstate_spapr_xive_ive = {
> >>      .name = TYPE_SPAPR_XIVE "/ive",
> >>      .version_id = 1,
> >> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >>      dc->vmsd = &vmstate_spapr_xive;
> >>  
> >>      xfc->get_ive = spapr_xive_get_ive;
> >> +    xfc->get_nvt = spapr_xive_get_nvt;
> >>  }
> >>  
> >>  static const TypeInfo spapr_xive_info = {
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index dccad0318834..5691bb9474e4 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -14,7 +14,278 @@
> >>  #include "sysemu/cpus.h"
> >>  #include "sysemu/dma.h"
> >>  #include "monitor/monitor.h"
> >> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
> >>  #include "hw/ppc/xive.h"
> >> +#include "hw/ppc/xive_regs.h"
> >> +
> >> +/*
> >> + * XIVE Interrupt Presenter
> >> + */
> >> +
> >> +static uint64_t xive_nvt_accept(XiveNVT *nvt)
> >> +{
> >> +    return 0;
> >> +}
> >> +
> >> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
> >> +{
> >> +    if (cppr > XIVE_PRIORITY_MAX) {
> >> +        cppr = 0xff;
> >> +    }
> >> +
> >> +    nvt->ring_os[TM_CPPR] = cppr;
> > 
> > Surely this needs to recheck if we should be interrupting the cpu?
> 
> yes. In patch 9, when we introduce the nvt notify routine.

Ok.

> >> +}
> >> +
> >> +/*
> >> + * OS Thread Interrupt Management Area MMIO
> >> + */
> >> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
> >> +                                           unsigned size)
> >> +{
> >> +    uint64_t ret = -1;
> >> +
> >> +    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
> >> +        ret = xive_nvt_accept(nvt);
> >> +    } else {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
> >> +                      HWADDR_PRIx" size %d\n", offset, size);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +#define TM_RING(offset) ((offset) & 0xf0)
> >> +
> >> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
> >> +                                      unsigned size)
> >> +{
> >> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> > 
> > So, as I said on a previous version of this, we can actually correctly
> > represent different mappings in different cpu spaces, by exploiting
> > cpu->as and not just having them all point to &address_space_memory.
> 
> Yes, you did and I haven't studied the question yet. For the next version.

So, it's possible that using the cpu->as thing will be more trouble
that it's worth.  I am a little concerned about using current_cpu
though.  First, will it work with KVM with kernel_irqchip=off - the
cpus are running truly concurrently, but we still need to work out
who's poking at the TIMA.  Second, are there any cases where we might
need to trip this "on behalf of" a specific cpu that's not the current
one.

> >> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
> >> +    uint64_t ret = -1;
> >> +    int i;
> >> +
> >> +    if (offset >= TM_SPC_ACK_EBB) {
> >> +        return xive_tm_read_special(nvt, offset, size);
> >> +    }
> >> +
> >> +    if (TM_RING(offset) != TM_QW1_OS) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
> >> +                      HWADDR_PRIx"\n", offset);
> >> +        return ret;
> > 
> > Just return -1 would be clearer here;
> 
> ok.
> 
> > 
> >> +    }
> >> +
> >> +    ret = 0;
> >> +    for (i = 0; i < size; i++) {
> >> +        ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1));
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +static bool xive_tm_is_readonly(uint8_t offset)
> >> +{
> >> +    return offset != TM_QW1_OS + TM_CPPR;
> >> +}
> >> +
> >> +static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset,
> >> +                                        uint64_t value, unsigned size)
> >> +{
> >> +    /* TODO: support TM_SPC_SET_OS_PENDING */
> >> +
> >> +    /* TODO: support TM_SPC_ACK_OS_EL */
> >> +}
> >> +
> >> +static void xive_tm_os_write(void *opaque, hwaddr offset,
> >> +                                   uint64_t value, unsigned size)
> >> +{
> >> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> >> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
> >> +    int i;
> >> +
> >> +    if (offset >= TM_SPC_ACK_EBB) {
> >> +        xive_tm_write_special(nvt, offset, value, size);
> >> +        return;
> >> +    }
> >> +
> >> +    if (TM_RING(offset) != TM_QW1_OS) {
> > 
> > Why have this if you have separate OS and user regions as you appear
> > to do below?
> 
> This is another problem we are trying to solve. 
> 
> The registers a CPU can access depends on the TIMA view it is using. 
> The OS TIMA view only sees the OS ring registers. The HV view sees all. 
> 
> > Or to look at it another way, shouldn't it be possible to make the
> > read/write accessors the same for the OS and user rings?
> 
> For some parts yes, but the special load/store addresses are different
> for each view, the read-only register also. It seemed easier to duplicate.
> 
> I think the problem will become clearer (or worse) with pnv which uses 
> the HV mode.

Oh.  I had the impression that each ring had a basically identical set
of registers and you just had access to the region for your ring and
the ones below.  Are you saying instead it's basically a single block
of registers with various different privilege levels for each of them?

[snip]
> >> +}
> >> +
> >> +static void xive_nvt_unrealize(DeviceState *dev, Error **errp)
> >> +{
> >> +    qemu_unregister_reset(xive_nvt_reset, dev);
> >> +}
> >> +
> >> +static void xive_nvt_init(Object *obj)
> >> +{
> >> +    XiveNVT *nvt = XIVE_NVT(obj);
> >> +
> >> +    nvt->ring_os = &nvt->regs[TM_QW1_OS];
> > 
> > The ring_os field is basically pointless, being just an offset into a
> > structure you already have.  A macro or inline would be a better idea.
> 
> ok. I liked the idea but I agree it's overkill to have an init routine
> just for this. I will find something.

That too, but it's also something that looks like an optimization but
isn't, which is bad practice.  On modern cpus math is cheap (and this
is just a trivial offset), memory accesses are expensive.  You're
essentially caching this offset - raising all the usual invalidation
questions for a cache - when caching it is *more* expensive than just
computing it every time.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-04-26 17:15       ` Cédric Le Goater
@ 2018-05-03  5:39         ` David Gibson
  2018-05-03 15:10           ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-03  5:39 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 2847 bytes --]

On Thu, Apr 26, 2018 at 07:15:29PM +0200, Cédric Le Goater wrote:
> On 04/26/2018 11:27 AM, Cédric Le Goater wrote:
> > On 04/26/2018 09:11 AM, David Gibson wrote:
> >> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
[snip]
> >>> +static void xive_tm_os_write(void *opaque, hwaddr offset,
> >>> +                                   uint64_t value, unsigned size)
> >>> +{
> >>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> >>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
> >>> +    int i;
> >>> +
> >>> +    if (offset >= TM_SPC_ACK_EBB) {
> >>> +        xive_tm_write_special(nvt, offset, value, size);
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    if (TM_RING(offset) != TM_QW1_OS) {
> >>
> >> Why have this if you have separate OS and user regions as you appear
> >> to do below?
> > 
> > This is another problem we are trying to solve. 
> > 
> > The registers a CPU can access depends on the TIMA view it is using. 
> > The OS TIMA view only sees the OS ring registers. The HV view sees all. 
> 
> So, I gave a deeper look at the specs and I understood a little more 
> details of the concepts behind. You need to do frequent round-trips 
> to this document ...  
> 
> These registers are accessible through four aligned pages, each exposing 
> a different view of the registers. First page (page address ending 
> in 0b00) gives access to the entire context and is reserved for the 
> ring 0 security monitor. The second (page address ending in 0b01) 
> is for the hypervisor, ring 1. The third (page address ending in 0b10) 
> is for the operating system, ring 2. The fourth (page address ending 
> in 0b11) is for user level, ring 3.
> 
> The sPAPR machine runs at the OS privilege and therefore can only 
> accesses the OS and the User rings, 2 and 3. The others are for
> hypervisor levels.

Ok, that much is what I thought.  What I'm less clear on is what each
page looks like compared to the others.  Previously I thought each one
had the same registers, just manipulating the corresponding ring.  Are
you saying instead that each ring's page basically has a subset of the
registers in the next most privileged page?

> I will try to come with a better implementation of the model and
> make sure the ring numbers are respected. I am not sure we should 
> have only one memory region or four distinct ones with their
> own ops. There are some differences in the load/store of each view.

Right.  I'm not clear at this point if that's for good reasons, or
just because IBM's hardware designers don't seem to have gotten the
hang of Don't Repeat Yourself.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-05-02  7:39     ` Cédric Le Goater
@ 2018-05-03  5:43       ` David Gibson
  2018-05-03 14:42         ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-03  5:43 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 1645 bytes --]

On Wed, May 02, 2018 at 09:39:44AM +0200, Cédric Le Goater wrote:
> >>  
> >> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
> >> +{
> >> +    PowerPCCPU *cpu = spapr_find_cpu(server);
> >> +
> >> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
> >> +}
> > 
> > So this is a bit of a tangent, but I've been thinking of implementing
> > a scheme where there's an opaque pointer in the cpu structure for the
> > use of the machine.  I'm planning for that to replace the intc pointer
> > (which isn't really used directly by the cpu). That would allow us to
> > have spapr put a structure there and have both xics and xive pointers
> > which could be useful later on.
> 
> Here is a quick try of the idea. Tested on pnv and spapr machines.
> I lacked inspiration on the name so I called the object
> {Machine}Link.

This is a bit overkill compared to what I had in mind.  I don't think
the thing we're pointing to has to be a fully realized QOM object.  I
was just going to replace the Object * with a void *, that it's up to
the machine to interpret.

I'm also wondering about restricting this idea to vhyp platforms.  The
idea is that for physical-esque machines the cpu really does (or
should) know how things are connected to it.  It's the abstraction of
the paravirt platform that makes it fuzzy.  In which case I'd see it
as the "opaque" pointer that goes along with the vhyp function
pointers.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-04-26  9:48     ` Cédric Le Goater
@ 2018-05-03  5:45       ` David Gibson
  2018-05-03  6:07         ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-03  5:45 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 2309 bytes --]

On Thu, Apr 26, 2018 at 11:48:06AM +0200, Cédric Le Goater wrote:
> On 04/26/2018 09:25 AM, David Gibson wrote:
> > On Thu, Apr 19, 2018 at 02:43:03PM +0200, Cédric Le Goater wrote:
> >> The Event Queue Descriptor (EQD) table is an internal table of the
> >> XIVE routing sub-engine. It specifies on which Event Queue the event
> >> data should be posted when an exception occurs (later on pulled by the
> >> OS) and which Virtual Processor to notify.
> > 
> > Uhhh.. I thought the IVT said which queue and vp to notify, and the
> > EQD gave metadata for event queues.
> 
> yes. the above poorly written. The Event Queue Descriptor contains the
> guest address of the event queue in which the data is written. I will 
> rephrase.      
> 
> The IVT contains IVEs which indeed define for an IRQ which EQ to notify 
> and what data to push on the queue. 
>  
> >> The Event Queue is a much
> >> more complex structure but we start with a simple model for the sPAPR
> >> machine.
> >>
> >> There is one XiveEQ per priority and these are stored under the XIVE
> >> virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :
> >>
> >>        (server << 3) | (priority & 0x7)
> >>
> >> This is not in the XIVE architecture but as the EQ index is never
> >> exposed to the guest, in the hcalls nor in the device tree, we are
> >> free to use what fits best the current model.
> 
> This EQ indexing is important to notice because it will also show up 
> in KVM to build the IVE from the KVM irq state.

Ok, are you saying that while this combined EQ index will never appear
in guest <-> host interfaces, it might show up in qemu <-> KVM
interfaces?

> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > 
> > Is the EQD actually modifiable by a guest?  Or are the settings of the
> > EQs fixed by PAPR?
> 
> The guest uses the H_INT_SET_QUEUE_CONFIG hcall to define the address
> of the event queue for a couple prio/server.

Ok, so the EQD can be modified by the guest.  In which case we need to
work out what object owns it, since it'll need to migrate it.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-05-03  5:45       ` David Gibson
@ 2018-05-03  6:07         ` Cédric Le Goater
  2018-05-03  6:25           ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-03  6:07 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/03/2018 07:45 AM, David Gibson wrote:
> On Thu, Apr 26, 2018 at 11:48:06AM +0200, Cédric Le Goater wrote:
>> On 04/26/2018 09:25 AM, David Gibson wrote:
>>> On Thu, Apr 19, 2018 at 02:43:03PM +0200, Cédric Le Goater wrote:
>>>> The Event Queue Descriptor (EQD) table is an internal table of the
>>>> XIVE routing sub-engine. It specifies on which Event Queue the event
>>>> data should be posted when an exception occurs (later on pulled by the
>>>> OS) and which Virtual Processor to notify.
>>>
>>> Uhhh.. I thought the IVT said which queue and vp to notify, and the
>>> EQD gave metadata for event queues.
>>
>> yes. the above poorly written. The Event Queue Descriptor contains the
>> guest address of the event queue in which the data is written. I will 
>> rephrase.      
>>
>> The IVT contains IVEs which indeed define for an IRQ which EQ to notify 
>> and what data to push on the queue. 
>>  
>>>> The Event Queue is a much
>>>> more complex structure but we start with a simple model for the sPAPR
>>>> machine.
>>>>
>>>> There is one XiveEQ per priority and these are stored under the XIVE
>>>> virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :
>>>>
>>>>        (server << 3) | (priority & 0x7)
>>>>
>>>> This is not in the XIVE architecture but as the EQ index is never
>>>> exposed to the guest, in the hcalls nor in the device tree, we are
>>>> free to use what fits best the current model.
>>
>> This EQ indexing is important to notice because it will also show up 
>> in KVM to build the IVE from the KVM irq state.
> 
> Ok, are you saying that while this combined EQ index will never appear
> in guest <-> host interfaces, 

Indeed.

> it might show up in qemu <-> KVM interfaces?

Not directly but it is part of the IVE as the IVE_EQ_INDEX field. When
dumped, it has to be built in some ways, compatible with the emulated 
mode in QEMU. 

>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>
>>> Is the EQD actually modifiable by a guest?  Or are the settings of the
>>> EQs fixed by PAPR?
>>
>> The guest uses the H_INT_SET_QUEUE_CONFIG hcall to define the address
>> of the event queue for a couple prio/server.
> 
> Ok, so the EQD can be modified by the guest.  In which case we need to
> work out what object owns it, since it'll need to migrate it.

Indeed. The EQD are CPU related as there is one EQD per couple (cpu, 
priority). The KVM patchset dumps/restores the eight XiveEQ struct 
using per cpu ioctls. The EQ in the OS RAM is marked dirty at that
stage.

C. 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-05-03  6:07         ` Cédric Le Goater
@ 2018-05-03  6:25           ` David Gibson
  2018-05-03 14:37             ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-03  6:25 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 3435 bytes --]

On Thu, May 03, 2018 at 08:07:54AM +0200, Cédric Le Goater wrote:
> On 05/03/2018 07:45 AM, David Gibson wrote:
> > On Thu, Apr 26, 2018 at 11:48:06AM +0200, Cédric Le Goater wrote:
> >> On 04/26/2018 09:25 AM, David Gibson wrote:
> >>> On Thu, Apr 19, 2018 at 02:43:03PM +0200, Cédric Le Goater wrote:
> >>>> The Event Queue Descriptor (EQD) table is an internal table of the
> >>>> XIVE routing sub-engine. It specifies on which Event Queue the event
> >>>> data should be posted when an exception occurs (later on pulled by the
> >>>> OS) and which Virtual Processor to notify.
> >>>
> >>> Uhhh.. I thought the IVT said which queue and vp to notify, and the
> >>> EQD gave metadata for event queues.
> >>
> >> yes. the above poorly written. The Event Queue Descriptor contains the
> >> guest address of the event queue in which the data is written. I will 
> >> rephrase.      
> >>
> >> The IVT contains IVEs which indeed define for an IRQ which EQ to notify 
> >> and what data to push on the queue. 
> >>  
> >>>> The Event Queue is a much
> >>>> more complex structure but we start with a simple model for the sPAPR
> >>>> machine.
> >>>>
> >>>> There is one XiveEQ per priority and these are stored under the XIVE
> >>>> virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :
> >>>>
> >>>>        (server << 3) | (priority & 0x7)
> >>>>
> >>>> This is not in the XIVE architecture but as the EQ index is never
> >>>> exposed to the guest, in the hcalls nor in the device tree, we are
> >>>> free to use what fits best the current model.
> >>
> >> This EQ indexing is important to notice because it will also show up 
> >> in KVM to build the IVE from the KVM irq state.
> > 
> > Ok, are you saying that while this combined EQ index will never appear
> > in guest <-> host interfaces, 
> 
> Indeed.
> 
> > it might show up in qemu <-> KVM interfaces?
> 
> Not directly but it is part of the IVE as the IVE_EQ_INDEX field. When
> dumped, it has to be built in some ways, compatible with the emulated 
> mode in QEMU. 

Hrm.  But is the exact IVE contents visible to qemu (for a PAPR
guest)?  I would have thought the qemu <-> KVM interfaces would have
abstracted this the same way the guest <-> KVM interfaces do.  Or is
there a reason not to?

> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>
> >>> Is the EQD actually modifiable by a guest?  Or are the settings of the
> >>> EQs fixed by PAPR?
> >>
> >> The guest uses the H_INT_SET_QUEUE_CONFIG hcall to define the address
> >> of the event queue for a couple prio/server.
> > 
> > Ok, so the EQD can be modified by the guest.  In which case we need to
> > work out what object owns it, since it'll need to migrate it.
> 
> Indeed. The EQD are CPU related as there is one EQD per couple (cpu, 
> priority). The KVM patchset dumps/restores the eight XiveEQ struct 
> using per cpu ioctls. The EQ in the OS RAM is marked dirty at that
> stage.

To make sure I'm clear: for PAPR there's a strict relationship between
EQD and CPU (one EQD for each (cpu, priority) tuple).  But for powernv
that's not the case, right?  AIUI the mapping of EQs to cpus was
configurable, is that right?


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-05-03  6:25           ` David Gibson
@ 2018-05-03 14:37             ` Cédric Le Goater
  2018-05-04  5:19               ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-03 14:37 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/03/2018 08:25 AM, David Gibson wrote:
> On Thu, May 03, 2018 at 08:07:54AM +0200, Cédric Le Goater wrote:
>> On 05/03/2018 07:45 AM, David Gibson wrote:
>>> On Thu, Apr 26, 2018 at 11:48:06AM +0200, Cédric Le Goater wrote:
>>>> On 04/26/2018 09:25 AM, David Gibson wrote:
>>>>> On Thu, Apr 19, 2018 at 02:43:03PM +0200, Cédric Le Goater wrote:
>>>>>> The Event Queue Descriptor (EQD) table is an internal table of the
>>>>>> XIVE routing sub-engine. It specifies on which Event Queue the event
>>>>>> data should be posted when an exception occurs (later on pulled by the
>>>>>> OS) and which Virtual Processor to notify.
>>>>>
>>>>> Uhhh.. I thought the IVT said which queue and vp to notify, and the
>>>>> EQD gave metadata for event queues.
>>>>
>>>> yes. the above poorly written. The Event Queue Descriptor contains the
>>>> guest address of the event queue in which the data is written. I will 
>>>> rephrase.      
>>>>
>>>> The IVT contains IVEs which indeed define for an IRQ which EQ to notify 
>>>> and what data to push on the queue. 
>>>>  
>>>>>> The Event Queue is a much
>>>>>> more complex structure but we start with a simple model for the sPAPR
>>>>>> machine.
>>>>>>
>>>>>> There is one XiveEQ per priority and these are stored under the XIVE
>>>>>> virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :
>>>>>>
>>>>>>        (server << 3) | (priority & 0x7)
>>>>>>
>>>>>> This is not in the XIVE architecture but as the EQ index is never
>>>>>> exposed to the guest, in the hcalls nor in the device tree, we are
>>>>>> free to use what fits best the current model.
>>>>
>>>> This EQ indexing is important to notice because it will also show up 
>>>> in KVM to build the IVE from the KVM irq state.
>>>
>>> Ok, are you saying that while this combined EQ index will never appear
>>> in guest <-> host interfaces, 
>>
>> Indeed.
>>
>>> it might show up in qemu <-> KVM interfaces?
>>
>> Not directly but it is part of the IVE as the IVE_EQ_INDEX field. When
>> dumped, it has to be built in some ways, compatible with the emulated 
>> mode in QEMU. 
> 
> Hrm.  But is the exact IVE contents visible to qemu (for a PAPR
> guest)?  

The guest only uses hcalls which arguments are :
 
	- cpu numbers,
	- priority numbers from defined ranges, 
	- logical interrupt numbers.  
	- physical address of the EQ 

The visible parts for the guest of the IVE are the 'priority', the 'cpu', 
and the 'eisn', which is the effective IRQ number the guest is assigning 
to the source. The 'eisn" will be pushed in the EQ.

The IVE EQ index is not visible.
 
> I would have thought the qemu <-> KVM interfaces would have
> abstracted this the same way the guest <-> KVM interfaces do.  > Or is there a reason not to?

It is practical to dump 64bit IVEs directly from KVM into the QEMU 
internal structures because it fits the emulated mode without doing 
any translation ... This might be seen as a shortcut. You will tell 
me when you reach the KVM part.   

>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>
>>>>> Is the EQD actually modifiable by a guest?  Or are the settings of the
>>>>> EQs fixed by PAPR?
>>>>
>>>> The guest uses the H_INT_SET_QUEUE_CONFIG hcall to define the address
>>>> of the event queue for a couple prio/server.
>>>
>>> Ok, so the EQD can be modified by the guest.  In which case we need to
>>> work out what object owns it, since it'll need to migrate it.
>>
>> Indeed. The EQD are CPU related as there is one EQD per couple (cpu, 
>> priority). The KVM patchset dumps/restores the eight XiveEQ struct 
>> using per cpu ioctls. The EQ in the OS RAM is marked dirty at that
>> stage.
> 
> To make sure I'm clear: for PAPR there's a strict relationship between
> EQD and CPU (one EQD for each (cpu, priority) tuple).  

Yes.

> But for powernv that's not the case, right?  

It is.

> AIUI the mapping of EQs to cpus was configurable, is that right?

Each cpu has 8 EQD. Same for virtual cpus. 

I am not sure what you understood before ? It is surely something
I wrote, my XIVE understanding is still making progress.


C.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-05-03  5:43       ` David Gibson
@ 2018-05-03 14:42         ` Cédric Le Goater
  0 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-03 14:42 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/03/2018 07:43 AM, David Gibson wrote:
> On Wed, May 02, 2018 at 09:39:44AM +0200, Cédric Le Goater wrote:
>>>>  
>>>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
>>>> +{
>>>> +    PowerPCCPU *cpu = spapr_find_cpu(server);
>>>> +
>>>> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
>>>> +}
>>>
>>> So this is a bit of a tangent, but I've been thinking of implementing
>>> a scheme where there's an opaque pointer in the cpu structure for the
>>> use of the machine.  I'm planning for that to replace the intc pointer
>>> (which isn't really used directly by the cpu). That would allow us to
>>> have spapr put a structure there and have both xics and xive pointers
>>> which could be useful later on.
>>
>> Here is a quick try of the idea. Tested on pnv and spapr machines.
>> I lacked inspiration on the name so I called the object
>> {Machine}Link.
> 
> This is a bit overkill compared to what I had in mind.  I don't think
> the thing we're pointing to has to be a fully realized QOM object. 

Yes, it is quite a bit of code for a simple struct.

> I was just going to replace the Object * with a void *, that it's up to
> the machine to interpret.

So the machine would just g_malloc0 a custom struct for each CPU, filling
it out depending on the configuration/needs ? 
 
> I'm also wondering about restricting this idea to vhyp platforms.  

OK. 

> The idea is that for physical-esque machines the cpu really does (or
> should) know how things are connected to it.  

yes. P9 does not have a XICS interrupt controller on PowerNV.

> It's the abstraction of
> the paravirt platform that makes it fuzzy.  In which case I'd see it
> as the "opaque" pointer that goes along with the vhyp function
> pointers.

I will take a look for the intc pointer as xive needs an extra one.

C.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-05-03  5:39         ` David Gibson
@ 2018-05-03 15:10           ` Cédric Le Goater
  2018-05-04  4:44             ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-03 15:10 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/03/2018 07:39 AM, David Gibson wrote:
> On Thu, Apr 26, 2018 at 07:15:29PM +0200, Cédric Le Goater wrote:
>> On 04/26/2018 11:27 AM, Cédric Le Goater wrote:
>>> On 04/26/2018 09:11 AM, David Gibson wrote:
>>>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
> [snip]
>>>>> +static void xive_tm_os_write(void *opaque, hwaddr offset,
>>>>> +                                   uint64_t value, unsigned size)
>>>>> +{
>>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>>>>> +    int i;
>>>>> +
>>>>> +    if (offset >= TM_SPC_ACK_EBB) {
>>>>> +        xive_tm_write_special(nvt, offset, value, size);
>>>>> +        return;
>>>>> +    }
>>>>> +
>>>>> +    if (TM_RING(offset) != TM_QW1_OS) {
>>>>
>>>> Why have this if you have separate OS and user regions as you appear
>>>> to do below?
>>>
>>> This is another problem we are trying to solve. 
>>>
>>> The registers a CPU can access depends on the TIMA view it is using. 
>>> The OS TIMA view only sees the OS ring registers. The HV view sees all. 
>>
>> So, I gave a deeper look at the specs and I understood a little more 
>> details of the concepts behind. You need to do frequent round-trips 
>> to this document ...  
>>
>> These registers are accessible through four aligned pages, each exposing 
>> a different view of the registers. First page (page address ending 
>> in 0b00) gives access to the entire context and is reserved for the 
>> ring 0 security monitor. The second (page address ending in 0b01) 
>> is for the hypervisor, ring 1. The third (page address ending in 0b10) 
>> is for the operating system, ring 2. The fourth (page address ending 
>> in 0b11) is for user level, ring 3.
>>
>> The sPAPR machine runs at the OS privilege and therefore can only 
>> accesses the OS and the User rings, 2 and 3. The others are for
>> hypervisor levels.
> 
> Ok, that much is what I thought.  What I'm less clear on is what each
> page looks like compared to the others.  Previously I thought each one
> had the same registers, 

yes.

> just manipulating the corresponding ring.  

no. 

> Are you saying instead that each ring's page basically has a subset 
> of the registers in the next most privileged page?

That's the idea. 

The registers are defined as follow :

	QW-0 User      
	QW-1 O/S      
	QW-2 Pool   
	QW-3 Physical 

and the pages :

- 0006030203180000 security monitor 
  can access all registers 

- 0006030203190000 hv
  can access all registers minus the secure regs

- 00060302031a0000 os
  can access some of the OS (QW1) and User (QW0) registers
 
- 00060302031b0000 user
  can access NSR reg of User (QW0) registers

On sPAPR, we can remap the os/user pages to some other base address 
but we should keep the same page offset.


>> I will try to come with a better implementation of the model and
>> make sure the ring numbers are respected. I am not sure we should 
>> have only one memory region or four distinct ones with their
>> own ops. There are some differences in the load/store of each view.
> 
> Right.  I'm not clear at this point if that's for good reasons, or
> just because IBM's hardware designers don't seem to have gotten the
> hang of Don't Repeat Yourself.
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-05-03  5:35       ` David Gibson
@ 2018-05-03 16:06         ` Cédric Le Goater
  2018-05-04  4:51           ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-03 16:06 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/03/2018 07:35 AM, David Gibson wrote:
> On Thu, Apr 26, 2018 at 11:27:21AM +0200, Cédric Le Goater wrote:
>> On 04/26/2018 09:11 AM, David Gibson wrote:
>>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
>>>> The XIVE presenter engine uses a set of registers to handle priority
>>>> management and interrupt acknowledgment among other things. The most
>>>> important ones being :
>>>>
>>>>   - Interrupt Priority Register (PIPR)
>>>>   - Interrupt Pending Buffer (IPB)
>>>>   - Current Processor Priority (CPPR)
>>>>   - Notification Source Register (NSR)
>>>>
>>>> There is one set of registers per level of privilege, four in all :
>>>> HW, HV pool, OS and User. These are called rings. All registers are
>>>> accessible through a specific MMIO region called the Thread Interrupt
>>>> Management Areas (TIMA) but, depending on the privilege level of the
>>>> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
>>>> OS privilege and therefore can only accesses the OS and the User
>>>> rings. The others are for hypervisor levels.
>>>>
>>>> The CPU interrupt state is modeled with a XiveNVT object which stores
>>>> the values of the different registers. The different TIMA views are
>>>> mapped at the same address for each CPU and 'current_cpu' is used to
>>>> retrieve the XiveNVT holding the ring registers.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>
>>>>  Changes since v2 :
>>>>
>>>>  - introduced the XiveFabric interface
>>>>
>>>>  hw/intc/spapr_xive.c        |  25 ++++
>>>>  hw/intc/xive.c              | 279 ++++++++++++++++++++++++++++++++++++++++++++
>>>>  include/hw/ppc/spapr_xive.h |   5 +
>>>>  include/hw/ppc/xive.h       |  31 +++++
>>>>  include/hw/ppc/xive_regs.h  |  84 +++++++++++++
>>>>  5 files changed, 424 insertions(+)
>>>>
>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>>>> index 90cde8a4082d..f07832bf0a00 100644
>>>> --- a/hw/intc/spapr_xive.c
>>>> +++ b/hw/intc/spapr_xive.c
>>>> @@ -13,6 +13,7 @@
>>>>  #include "target/ppc/cpu.h"
>>>>  #include "sysemu/cpus.h"
>>>>  #include "monitor/monitor.h"
>>>> +#include "hw/ppc/spapr.h"
>>>>  #include "hw/ppc/spapr_xive.h"
>>>>  #include "hw/ppc/xive.h"
>>>>  #include "hw/ppc/xive_regs.h"
>>>> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>>>  
>>>>      /* Allocate the Interrupt Virtualization Table */
>>>>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>>>> +
>>>> +    /* The Thread Interrupt Management Area has the same address for
>>>> +     * each chip. On sPAPR, we only need to expose the User and OS
>>>> +     * level views of the TIMA.
>>>> +     */
>>>> +    xive->tm_base = XIVE_TM_BASE;
>>>
>>> The constant should probably have PAPR in the name somewhere, since
>>> it's just for PAPR machines (same for the ESB mappings, actually).
>>
>> ok. 
>>
>> I have also made 'tm_base' a property, like 'vc_base' for ESBs, in 
>> case we want to change the value when the guest is instantiated. 
>> I doubt it but this is an address in the global address space, so 
>> letting the machine have control is better I think.
> 
> I agree.
> 
>>>> +
>>>> +    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
>>>> +                          &xive_tm_user_ops, xive, "xive.tima.user",
>>>> +                          1ull << TM_SHIFT);
>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
>>>> +
>>>> +    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
>>>> +                          &xive_tm_os_ops, xive, "xive.tima.os",
>>>> +                          1ull << TM_SHIFT);
>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
>>>>  }
>>>>  
>>>>  static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>>> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>>>      return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>>>>  }
>>>>  
>>>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
>>>> +{
>>>> +    PowerPCCPU *cpu = spapr_find_cpu(server);
>>>> +
>>>> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
>>>> +}
>>>
>>> So this is a bit of a tangent, but I've been thinking of implementing
>>> a scheme where there's an opaque pointer in the cpu structure for the
>>> use of the machine.  I'm planning for that to replace the intc pointer
>>> (which isn't really used directly by the cpu). That would allow us to
>>> have spapr put a structure there and have both xics and xive pointers
>>> which could be useful later on.
>>
>> ok. That should simplify the patchset at the end, in which we need to 
>> switch the 'intc' pointer. 
>>
>>> I think we'd need something similar to correctly handle migration of
>>> the VPA state, which is currently horribly broken.
>>>
>>>> +
>>>>  static const VMStateDescription vmstate_spapr_xive_ive = {
>>>>      .name = TYPE_SPAPR_XIVE "/ive",
>>>>      .version_id = 1,
>>>> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>>      dc->vmsd = &vmstate_spapr_xive;
>>>>  
>>>>      xfc->get_ive = spapr_xive_get_ive;
>>>> +    xfc->get_nvt = spapr_xive_get_nvt;
>>>>  }
>>>>  
>>>>  static const TypeInfo spapr_xive_info = {
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> index dccad0318834..5691bb9474e4 100644
>>>> --- a/hw/intc/xive.c
>>>> +++ b/hw/intc/xive.c
>>>> @@ -14,7 +14,278 @@
>>>>  #include "sysemu/cpus.h"
>>>>  #include "sysemu/dma.h"
>>>>  #include "monitor/monitor.h"
>>>> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
>>>>  #include "hw/ppc/xive.h"
>>>> +#include "hw/ppc/xive_regs.h"
>>>> +
>>>> +/*
>>>> + * XIVE Interrupt Presenter
>>>> + */
>>>> +
>>>> +static uint64_t xive_nvt_accept(XiveNVT *nvt)
>>>> +{
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
>>>> +{
>>>> +    if (cppr > XIVE_PRIORITY_MAX) {
>>>> +        cppr = 0xff;
>>>> +    }
>>>> +
>>>> +    nvt->ring_os[TM_CPPR] = cppr;
>>>
>>> Surely this needs to recheck if we should be interrupting the cpu?
>>
>> yes. In patch 9, when we introduce the nvt notify routine.
> 
> Ok.
> 
>>>> +}
>>>> +
>>>> +/*
>>>> + * OS Thread Interrupt Management Area MMIO
>>>> + */
>>>> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
>>>> +                                           unsigned size)
>>>> +{
>>>> +    uint64_t ret = -1;
>>>> +
>>>> +    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
>>>> +        ret = xive_nvt_accept(nvt);
>>>> +    } else {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
>>>> +                      HWADDR_PRIx" size %d\n", offset, size);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +#define TM_RING(offset) ((offset) & 0xf0)
>>>> +
>>>> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
>>>> +                                      unsigned size)
>>>> +{
>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>>
>>> So, as I said on a previous version of this, we can actually correctly
>>> represent different mappings in different cpu spaces, by exploiting
>>> cpu->as and not just having them all point to &address_space_memory.
>>
>> Yes, you did and I haven't studied the question yet. For the next version.
> 
> So, it's possible that using the cpu->as thing will be more trouble
> that it's worth. 

One of the trouble is the number of memory regions to use, one per cpu, 
and the KVM support. Having a single region is much easier. 

> I am a little concerned about using current_cpu though.  
> First, will it work with KVM with kernel_irqchip=off - the
> cpus are running truly concurrently,

FWIW, I didn't see any issue yet while stressing. 

> but we still need to work out who's poking at the TIMA.  

I understand. The registers are accessed by the current cpu to set the 
CPPR and to ack an interrupt. But when we route an event, we also access 
and modify the registers. Do you suggest some locking ? I am not sure
how are protected the TIMA region accesses vs. the routing, which is 
necessarily initiated by an ESB MMIO though.

> Second, are there any cases where we might
> need to trip this "on behalf of" a specific cpu that's not the current
> one.

ah. yes. sort of :) only in powernv, when the xive is reseted (and when 
dumping the state for debug).

The IC has a way to access indirectly the registers of a HW thread. 
It, first, sets the PC_TCTXT_INDIR_THRDID register with the PIR of 
the targeted thread and then loads on the indirect TIMA can be done 
as if it was the current thread. The indirect TIMA is mapped 4 pages 
after the  IC BAR.

The resulting memory region op is a little ugly and might need 
some rework : 

static uint64_t xive_tm_hv_read(void *opaque, hwaddr offset,
                                 unsigned size)
{
    PowerPCCPU **cpuptr = opaque;
    PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
    ...


>>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>>>> +    uint64_t ret = -1;
>>>> +    int i;
>>>> +
>>>> +    if (offset >= TM_SPC_ACK_EBB) {
>>>> +        return xive_tm_read_special(nvt, offset, size);
>>>> +    }
>>>> +
>>>> +    if (TM_RING(offset) != TM_QW1_OS) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
>>>> +                      HWADDR_PRIx"\n", offset);
>>>> +        return ret;
>>>
>>> Just return -1 would be clearer here;
>>
>> ok.
>>
>>>
>>>> +    }
>>>> +
>>>> +    ret = 0;
>>>> +    for (i = 0; i < size; i++) {
>>>> +        ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1));
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +static bool xive_tm_is_readonly(uint8_t offset)
>>>> +{
>>>> +    return offset != TM_QW1_OS + TM_CPPR;
>>>> +}
>>>> +
>>>> +static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset,
>>>> +                                        uint64_t value, unsigned size)
>>>> +{
>>>> +    /* TODO: support TM_SPC_SET_OS_PENDING */
>>>> +
>>>> +    /* TODO: support TM_SPC_ACK_OS_EL */
>>>> +}
>>>> +
>>>> +static void xive_tm_os_write(void *opaque, hwaddr offset,
>>>> +                                   uint64_t value, unsigned size)
>>>> +{
>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>>>> +    int i;
>>>> +
>>>> +    if (offset >= TM_SPC_ACK_EBB) {
>>>> +        xive_tm_write_special(nvt, offset, value, size);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (TM_RING(offset) != TM_QW1_OS) {
>>>
>>> Why have this if you have separate OS and user regions as you appear
>>> to do below?
>>
>> This is another problem we are trying to solve. 
>>
>> The registers a CPU can access depends on the TIMA view it is using. 
>> The OS TIMA view only sees the OS ring registers. The HV view sees all. 
>>
>>> Or to look at it another way, shouldn't it be possible to make the
>>> read/write accessors the same for the OS and user rings?
>>
>> For some parts yes, but the special load/store addresses are different
>> for each view, the read-only register also. It seemed easier to duplicate.
>>
>> I think the problem will become clearer (or worse) with pnv which uses 
>> the HV mode.
> 
> Oh.  I had the impression that each ring had a basically identical set
> of registers and you just had access to the region for your ring and
> the ones below.  Are you saying instead it's basically a single block
> of registers with various different privilege levels for each of them?

yes. I think I answered this question more clearly in a previous email.

> [snip]
>>>> +}
>>>> +
>>>> +static void xive_nvt_unrealize(DeviceState *dev, Error **errp)
>>>> +{
>>>> +    qemu_unregister_reset(xive_nvt_reset, dev);
>>>> +}
>>>> +
>>>> +static void xive_nvt_init(Object *obj)
>>>> +{
>>>> +    XiveNVT *nvt = XIVE_NVT(obj);
>>>> +
>>>> +    nvt->ring_os = &nvt->regs[TM_QW1_OS];
>>>
>>> The ring_os field is basically pointless, being just an offset into a
>>> structure you already have.  A macro or inline would be a better idea.
>>
>> ok. I liked the idea but I agree it's overkill to have an init routine
>> just for this. I will find something.
> 
> That too, but it's also something that looks like an optimization but
> isn't, which is bad practice.  On modern cpus math is cheap (and this
> is just a trivial offset), memory accesses are expensive.  You're
> essentially caching this offset - raising all the usual invalidation
> questions for a cache - when caching it is *more* expensive than just
> computing it every time.

ok. removing this offset was a good opportunity to generalize the 
routing algorithm and use a 'ring' parameter in all routines. Same 
for the accept path. 


C.

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-05-03  5:22           ` David Gibson
@ 2018-05-03 16:50             ` Cédric Le Goater
  2018-05-04  3:33               ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-03 16:50 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/03/2018 07:22 AM, David Gibson wrote:
> On Thu, Apr 26, 2018 at 12:43:29PM +0200, Cédric Le Goater wrote:
>> On 04/26/2018 06:20 AM, David Gibson wrote:
>>> On Tue, Apr 24, 2018 at 11:46:04AM +0200, Cédric Le Goater wrote:
>>>> On 04/24/2018 08:51 AM, David Gibson wrote:
>>>>> On Thu, Apr 19, 2018 at 02:43:00PM +0200, Cédric Le Goater wrote:
>>>>>> sPAPRXive is a model for the XIVE interrupt controller device of the
>>>>>> sPAPR machine. It holds the routing XIVE table, the Interrupt
>>>>>> Virtualization Entry (IVE) table which associates interrupt source
>>>>>> numbers with targets.
>>>>>>
>>>>>> Also extend the XiveFabric with an accessor to the IVT. This will be
>>>>>> needed by the routing algorithm.
>>>>>>
>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>> ---
>>>>>>
>>>>>>  May be should introduce a XiveRouter model to hold the IVT. To be
>>>>>>  discussed.
>>>>>
>>>>> Yeah, maybe.  Am I correct in thinking that on pnv there could be more
>>>>> than one XiveRouter?
>>>>
>>>> There is only one, the main IC. 
>>>
>>> Ok, that's what I thought originally.  In that case some of the stuff
>>> in the patches really doesn't make sense to me.
>>
>> well, there is one IC per chip on powernv, but we haven't reach that part
>> yet.
> 
> Hmm.  There's some things we can delay dealing with, but I don't think
> this is one of them.  I think we need to understand how multichip is
> going to work in order to come up with a sane architecture.  Otherwise
> I fear we'll end up with something that we either need to horribly
> bastardize for multichip, or have to rework things dramatically
> leading to migration nightmares.

So, it is all controlled by MMIO, so we should be fine on that part. 
As for the internal tables, they are all configured by firmware, using
a chip identifier (block). I need to check how the remote XIVE are 
accessed. I think this is by MMIO. 

I haven't looked at multichip XIVE support but I am not too worried as 
the framework is already in place for the machine.
 
>>>>> If we did have a XiveRouter, I'm not sure we'd need the XiveFabric
>>>>> interface, possibly its methods could just be class methods of
>>>>> XiveRouter.
>>>>
>>>> Yes. We could introduce a XiveRouter to share the ivt table between 
>>>> the sPAPRXive and the PnvXIVE models, the interrupt controllers of
>>>> the machines. Methods would provide way to get the ivt/eq/nvt
>>>> objects required for routing. I need to add a set_eq() to push the
>>>> EQ data.
>>>
>>> Hrm.  Well, to add some more clarity, let's say the XiveRouter is the
>>> object which owns the IVT.  
>>
>> OK. that would be a model with some state and not an interface.
> 
> Yes.  For papr variant it would have the whole IVT contents as its
> state.  For the powernv, just the registers telling it where to find
> the IVT in RAM.
> 
>>> It may or may not do other stuff as well.
>>
>> Its only task would be to do the final event routing: get the IVE,
>> get the EQ, push the EQ DATA in the OS event queue, notify the CPU.
> 
> That seems like a lot of steps.  Up to push the EQ DATA, certainly.
> And I guess it'll have to ping an NVT somehow, but I'm not sure it
> should know about CPUs as such.

For PowerNV, the concept could be generalized, yes. An NVT can 
contain the interrupt state of a logical server but the common 
case is baremetal without guests for QEMU and so we have a NVT 
per cpu. 

PowerNV will have some limitation but we can make it better than 
today for sure. It boots.

We can improve some of the NVT notification process, the way NVT 
are matched eventually. may be support remote engines if the
NVT is not local. I have not looked at the details.

> I'm not sure at this stage what should own the EQD table.

The EQDT is in RAM.

> In the multichip case is there one EQD table for every IVT?

There is one EQDT per chip, same for the IVT. They are in RAM, 
identified with a block ID.

>  I'm guessing
> not - I figure the EQD table must be effectively global so that any
> chip's router can send events to any EQ in the whole system.
>>>> Now IIUC, on pnv the IVT lives in main system memory.  
>>
>> yes. It is allocated by skiboot in RAM and fed to the HW using some 
>> IC configuration registers. Then, each entry is configured with OPAL 
>> calls and the HW is updated using cache scrub registers. 
> 
> Right.  At least for the first pass we should be able to treat the
> cache scrub registers as no-ops and just not cache anything in the
> qemu implementation.

The model currently supports the cache scrub registers, we need it
to update some values. It's not too complex. 


>>> Under PAPR is the IVT in guest memory, or is it outside (updated by
>>> hypercalls/rtas)?
>>
>> Under sPAPR, the IVT is updated by the H_INT_SET_SOURCE_CONFIG hcall
>> which configures the targeting of an IRQ. It's not in the guest 
>> memory.
> 
> Right.
> 
>> Behind the hood, the IVT is still configured by OPAL under KVM and 
>> by QEMU when kernel_irqchip=off
> 
> Sure.  Even with kernel_irqchip=on there's still logically a guest IVT
> (or "IVT view" I guess), even if it's actual entries are stored
> distributed across various places in the host's IVTs.

yes. The XIVE KVM device caches the info. This is used to dump the 
state without doing OPAL calls.

C. 


>>>> The XiveRouter would also be a XiveFabric (or some other name) to 
>>>> let the internal sources of the interrupt controller forward events.
>>>
>>> The further we go here, the less sure I am that XiveFabric even makes
>>> sense as a concept.
>>
>> See previous email.
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-05-03 16:50             ` Cédric Le Goater
@ 2018-05-04  3:33               ` David Gibson
  2018-05-04 13:05                 ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-04  3:33 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 6822 bytes --]

On Thu, May 03, 2018 at 06:50:09PM +0200, Cédric Le Goater wrote:
> On 05/03/2018 07:22 AM, David Gibson wrote:
> > On Thu, Apr 26, 2018 at 12:43:29PM +0200, Cédric Le Goater wrote:
> >> On 04/26/2018 06:20 AM, David Gibson wrote:
> >>> On Tue, Apr 24, 2018 at 11:46:04AM +0200, Cédric Le Goater wrote:
> >>>> On 04/24/2018 08:51 AM, David Gibson wrote:
> >>>>> On Thu, Apr 19, 2018 at 02:43:00PM +0200, Cédric Le Goater wrote:
> >>>>>> sPAPRXive is a model for the XIVE interrupt controller device of the
> >>>>>> sPAPR machine. It holds the routing XIVE table, the Interrupt
> >>>>>> Virtualization Entry (IVE) table which associates interrupt source
> >>>>>> numbers with targets.
> >>>>>>
> >>>>>> Also extend the XiveFabric with an accessor to the IVT. This will be
> >>>>>> needed by the routing algorithm.
> >>>>>>
> >>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>> ---
> >>>>>>
> >>>>>>  May be should introduce a XiveRouter model to hold the IVT. To be
> >>>>>>  discussed.
> >>>>>
> >>>>> Yeah, maybe.  Am I correct in thinking that on pnv there could be more
> >>>>> than one XiveRouter?
> >>>>
> >>>> There is only one, the main IC. 
> >>>
> >>> Ok, that's what I thought originally.  In that case some of the stuff
> >>> in the patches really doesn't make sense to me.
> >>
> >> well, there is one IC per chip on powernv, but we haven't reach that part
> >> yet.
> > 
> > Hmm.  There's some things we can delay dealing with, but I don't think
> > this is one of them.  I think we need to understand how multichip is
> > going to work in order to come up with a sane architecture.  Otherwise
> > I fear we'll end up with something that we either need to horribly
> > bastardize for multichip, or have to rework things dramatically
> > leading to migration nightmares.
> 
> So, it is all controlled by MMIO, so we should be fine on that part. 
> As for the internal tables, they are all configured by firmware, using
> a chip identifier (block). I need to check how the remote XIVE are 
> accessed. I think this is by MMIO. 

Right, but for powernv we execute OPAL inside the VM, rather than
emulating its effects.  So we still need to model the actual hardware
interfaces.  OPAL hides the details from the kernel, but not from us
on the other side.

> I haven't looked at multichip XIVE support but I am not too worried as 
> the framework is already in place for the machine.
>  
> >>>>> If we did have a XiveRouter, I'm not sure we'd need the XiveFabric
> >>>>> interface, possibly its methods could just be class methods of
> >>>>> XiveRouter.
> >>>>
> >>>> Yes. We could introduce a XiveRouter to share the ivt table between 
> >>>> the sPAPRXive and the PnvXIVE models, the interrupt controllers of
> >>>> the machines. Methods would provide way to get the ivt/eq/nvt
> >>>> objects required for routing. I need to add a set_eq() to push the
> >>>> EQ data.
> >>>
> >>> Hrm.  Well, to add some more clarity, let's say the XiveRouter is the
> >>> object which owns the IVT.  
> >>
> >> OK. that would be a model with some state and not an interface.
> > 
> > Yes.  For papr variant it would have the whole IVT contents as its
> > state.  For the powernv, just the registers telling it where to find
> > the IVT in RAM.
> > 
> >>> It may or may not do other stuff as well.
> >>
> >> Its only task would be to do the final event routing: get the IVE,
> >> get the EQ, push the EQ DATA in the OS event queue, notify the CPU.
> > 
> > That seems like a lot of steps.  Up to push the EQ DATA, certainly.
> > And I guess it'll have to ping an NVT somehow, but I'm not sure it
> > should know about CPUs as such.
> 
> For PowerNV, the concept could be generalized, yes. An NVT can 
> contain the interrupt state of a logical server but the common 
> case is baremetal without guests for QEMU and so we have a NVT 
> per cpu. 

Hmm.  We eventually want to support a kernel running guests under
qemu/powernv though, right?  So even if we don't allow it right now,
we don't want allowing that to require major surgery to our
architecture.

> PowerNV will have some limitation but we can make it better than 
> today for sure. It boots.
> 
> We can improve some of the NVT notification process, the way NVT 
> are matched eventually. may be support remote engines if the
> NVT is not local. I have not looked at the details.
> 
> > I'm not sure at this stage what should own the EQD table.
> 
> The EQDT is in RAM.

Not for spapr, it's not.  And even when it is in RAM, something needs
to own the register that gives its base address.

> > In the multichip case is there one EQD table for every IVT?
> 
> There is one EQDT per chip, same for the IVT. They are in RAM, 
> identified with a block ID.
> 
> >  I'm guessing
> > not - I figure the EQD table must be effectively global so that any
> > chip's router can send events to any EQ in the whole system.
> >>>> Now IIUC, on pnv the IVT lives in main system memory.  
> >>
> >> yes. It is allocated by skiboot in RAM and fed to the HW using some 
> >> IC configuration registers. Then, each entry is configured with OPAL 
> >> calls and the HW is updated using cache scrub registers. 
> > 
> > Right.  At least for the first pass we should be able to treat the
> > cache scrub registers as no-ops and just not cache anything in the
> > qemu implementation.
> 
> The model currently supports the cache scrub registers, we need it
> to update some values. It's not too complex.

Ok.

> >>> Under PAPR is the IVT in guest memory, or is it outside (updated by
> >>> hypercalls/rtas)?
> >>
> >> Under sPAPR, the IVT is updated by the H_INT_SET_SOURCE_CONFIG hcall
> >> which configures the targeting of an IRQ. It's not in the guest 
> >> memory.
> > 
> > Right.
> > 
> >> Behind the hood, the IVT is still configured by OPAL under KVM and 
> >> by QEMU when kernel_irqchip=off
> > 
> > Sure.  Even with kernel_irqchip=on there's still logically a guest IVT
> > (or "IVT view" I guess), even if it's actual entries are stored
> > distributed across various places in the host's IVTs.
> 
> yes. The XIVE KVM device caches the info. This is used to dump the 
> state without doing OPAL calls.
> 
> C. 
> 
> 
> >>>> The XiveRouter would also be a XiveFabric (or some other name) to 
> >>>> let the internal sources of the interrupt controller forward events.
> >>>
> >>> The further we go here, the less sure I am that XiveFabric even makes
> >>> sense as a concept.
> >>
> >> See previous email.
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-05-03 15:10           ` Cédric Le Goater
@ 2018-05-04  4:44             ` David Gibson
  2018-05-04 14:15               ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-04  4:44 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 4673 bytes --]

On Thu, May 03, 2018 at 05:10:48PM +0200, Cédric Le Goater wrote:
> On 05/03/2018 07:39 AM, David Gibson wrote:
> > On Thu, Apr 26, 2018 at 07:15:29PM +0200, Cédric Le Goater wrote:
> >> On 04/26/2018 11:27 AM, Cédric Le Goater wrote:
> >>> On 04/26/2018 09:11 AM, David Gibson wrote:
> >>>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
> > [snip]
> >>>>> +static void xive_tm_os_write(void *opaque, hwaddr offset,
> >>>>> +                                   uint64_t value, unsigned size)
> >>>>> +{
> >>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> >>>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
> >>>>> +    int i;
> >>>>> +
> >>>>> +    if (offset >= TM_SPC_ACK_EBB) {
> >>>>> +        xive_tm_write_special(nvt, offset, value, size);
> >>>>> +        return;
> >>>>> +    }
> >>>>> +
> >>>>> +    if (TM_RING(offset) != TM_QW1_OS) {
> >>>>
> >>>> Why have this if you have separate OS and user regions as you appear
> >>>> to do below?
> >>>
> >>> This is another problem we are trying to solve. 
> >>>
> >>> The registers a CPU can access depends on the TIMA view it is using. 
> >>> The OS TIMA view only sees the OS ring registers. The HV view sees all. 
> >>
> >> So, I gave a deeper look at the specs and I understood a little more 
> >> details of the concepts behind. You need to do frequent round-trips 
> >> to this document ...  
> >>
> >> These registers are accessible through four aligned pages, each exposing 
> >> a different view of the registers. First page (page address ending 
> >> in 0b00) gives access to the entire context and is reserved for the 
> >> ring 0 security monitor. The second (page address ending in 0b01) 
> >> is for the hypervisor, ring 1. The third (page address ending in 0b10) 
> >> is for the operating system, ring 2. The fourth (page address ending 
> >> in 0b11) is for user level, ring 3.
> >>
> >> The sPAPR machine runs at the OS privilege and therefore can only 
> >> accesses the OS and the User rings, 2 and 3. The others are for
> >> hypervisor levels.
> > 
> > Ok, that much is what I thought.  What I'm less clear on is what each
> > page looks like compared to the others.  Previously I thought each one
> > had the same registers, 
> 
> yes.
> 
> > just manipulating the corresponding ring.  
> 
> no. 
> 
> > Are you saying instead that each ring's page basically has a subset 
> > of the registers in the next most privileged page?
> 
> That's the idea. 

Ah, ok.

> The registers are defined as follow :
> 
> 	QW-0 User      
> 	QW-1 O/S      
> 	QW-2 Pool   
> 	QW-3 Physical 
> 
> and the pages :
> 
> - 0006030203180000 security monitor 
>   can access all registers 
> 
> - 0006030203190000 hv
>   can access all registers minus the secure regs
> 
> - 00060302031a0000 os
>   can access some of the OS (QW1) and User (QW0) registers
>  
> - 00060302031b0000 user
>   can access NSR reg of User (QW0) registers

I can see two reasonable ways of doing this:

A)

Have a single set of read/write functions.  These implement all the
registers but take a "privilege level" parameter which controls which
will actually work.  Those could then be wired up in one of two ways:

  A1) Single memory region.  The accessor derives the priv level from
  the relevant address bits, before masking it down to a single
  register page.  Then, as above

  A2) Multiple memory regions with the same accessor functions but
  different opaque pointer.  The accessor gets the priv level from
  its opaque pointer, then the address is just within a single ring's
  page.

B)

Separate memory regions with separate accessors.  The ring-0 accessor
implements the ring-0 registers, then calls the ring-1 accessor
function for everything else.  ring-1 calls ring-2 and so forth.

> On sPAPR, we can remap the os/user pages to some other base address 
> but we should keep the same page offset.

Sure.

> 
> 
> >> I will try to come with a better implementation of the model and
> >> make sure the ring numbers are respected. I am not sure we should 
> >> have only one memory region or four distinct ones with their
> >> own ops. There are some differences in the load/store of each view.
> > 
> > Right.  I'm not clear at this point if that's for good reasons, or
> > just because IBM's hardware designers don't seem to have gotten the
> > hang of Don't Repeat Yourself.
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-05-03 16:06         ` Cédric Le Goater
@ 2018-05-04  4:51           ` David Gibson
  2018-05-04 13:11             ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-04  4:51 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 14488 bytes --]

On Thu, May 03, 2018 at 06:06:14PM +0200, Cédric Le Goater wrote:
> On 05/03/2018 07:35 AM, David Gibson wrote:
> > On Thu, Apr 26, 2018 at 11:27:21AM +0200, Cédric Le Goater wrote:
> >> On 04/26/2018 09:11 AM, David Gibson wrote:
> >>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
> >>>> The XIVE presenter engine uses a set of registers to handle priority
> >>>> management and interrupt acknowledgment among other things. The most
> >>>> important ones being :
> >>>>
> >>>>   - Interrupt Priority Register (PIPR)
> >>>>   - Interrupt Pending Buffer (IPB)
> >>>>   - Current Processor Priority (CPPR)
> >>>>   - Notification Source Register (NSR)
> >>>>
> >>>> There is one set of registers per level of privilege, four in all :
> >>>> HW, HV pool, OS and User. These are called rings. All registers are
> >>>> accessible through a specific MMIO region called the Thread Interrupt
> >>>> Management Areas (TIMA) but, depending on the privilege level of the
> >>>> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
> >>>> OS privilege and therefore can only accesses the OS and the User
> >>>> rings. The others are for hypervisor levels.
> >>>>
> >>>> The CPU interrupt state is modeled with a XiveNVT object which stores
> >>>> the values of the different registers. The different TIMA views are
> >>>> mapped at the same address for each CPU and 'current_cpu' is used to
> >>>> retrieve the XiveNVT holding the ring registers.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>
> >>>>  Changes since v2 :
> >>>>
> >>>>  - introduced the XiveFabric interface
> >>>>
> >>>>  hw/intc/spapr_xive.c        |  25 ++++
> >>>>  hw/intc/xive.c              | 279 ++++++++++++++++++++++++++++++++++++++++++++
> >>>>  include/hw/ppc/spapr_xive.h |   5 +
> >>>>  include/hw/ppc/xive.h       |  31 +++++
> >>>>  include/hw/ppc/xive_regs.h  |  84 +++++++++++++
> >>>>  5 files changed, 424 insertions(+)
> >>>>
> >>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >>>> index 90cde8a4082d..f07832bf0a00 100644
> >>>> --- a/hw/intc/spapr_xive.c
> >>>> +++ b/hw/intc/spapr_xive.c
> >>>> @@ -13,6 +13,7 @@
> >>>>  #include "target/ppc/cpu.h"
> >>>>  #include "sysemu/cpus.h"
> >>>>  #include "monitor/monitor.h"
> >>>> +#include "hw/ppc/spapr.h"
> >>>>  #include "hw/ppc/spapr_xive.h"
> >>>>  #include "hw/ppc/xive.h"
> >>>>  #include "hw/ppc/xive_regs.h"
> >>>> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >>>>  
> >>>>      /* Allocate the Interrupt Virtualization Table */
> >>>>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> >>>> +
> >>>> +    /* The Thread Interrupt Management Area has the same address for
> >>>> +     * each chip. On sPAPR, we only need to expose the User and OS
> >>>> +     * level views of the TIMA.
> >>>> +     */
> >>>> +    xive->tm_base = XIVE_TM_BASE;
> >>>
> >>> The constant should probably have PAPR in the name somewhere, since
> >>> it's just for PAPR machines (same for the ESB mappings, actually).
> >>
> >> ok. 
> >>
> >> I have also made 'tm_base' a property, like 'vc_base' for ESBs, in 
> >> case we want to change the value when the guest is instantiated. 
> >> I doubt it but this is an address in the global address space, so 
> >> letting the machine have control is better I think.
> > 
> > I agree.
> > 
> >>>> +
> >>>> +    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
> >>>> +                          &xive_tm_user_ops, xive, "xive.tima.user",
> >>>> +                          1ull << TM_SHIFT);
> >>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
> >>>> +
> >>>> +    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
> >>>> +                          &xive_tm_os_ops, xive, "xive.tima.os",
> >>>> +                          1ull << TM_SHIFT);
> >>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
> >>>>  }
> >>>>  
> >>>>  static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
> >>>> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
> >>>>      return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
> >>>>  }
> >>>>  
> >>>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
> >>>> +{
> >>>> +    PowerPCCPU *cpu = spapr_find_cpu(server);
> >>>> +
> >>>> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
> >>>> +}
> >>>
> >>> So this is a bit of a tangent, but I've been thinking of implementing
> >>> a scheme where there's an opaque pointer in the cpu structure for the
> >>> use of the machine.  I'm planning for that to replace the intc pointer
> >>> (which isn't really used directly by the cpu). That would allow us to
> >>> have spapr put a structure there and have both xics and xive pointers
> >>> which could be useful later on.
> >>
> >> ok. That should simplify the patchset at the end, in which we need to 
> >> switch the 'intc' pointer. 
> >>
> >>> I think we'd need something similar to correctly handle migration of
> >>> the VPA state, which is currently horribly broken.
> >>>
> >>>> +
> >>>>  static const VMStateDescription vmstate_spapr_xive_ive = {
> >>>>      .name = TYPE_SPAPR_XIVE "/ive",
> >>>>      .version_id = 1,
> >>>> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >>>>      dc->vmsd = &vmstate_spapr_xive;
> >>>>  
> >>>>      xfc->get_ive = spapr_xive_get_ive;
> >>>> +    xfc->get_nvt = spapr_xive_get_nvt;
> >>>>  }
> >>>>  
> >>>>  static const TypeInfo spapr_xive_info = {
> >>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >>>> index dccad0318834..5691bb9474e4 100644
> >>>> --- a/hw/intc/xive.c
> >>>> +++ b/hw/intc/xive.c
> >>>> @@ -14,7 +14,278 @@
> >>>>  #include "sysemu/cpus.h"
> >>>>  #include "sysemu/dma.h"
> >>>>  #include "monitor/monitor.h"
> >>>> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
> >>>>  #include "hw/ppc/xive.h"
> >>>> +#include "hw/ppc/xive_regs.h"
> >>>> +
> >>>> +/*
> >>>> + * XIVE Interrupt Presenter
> >>>> + */
> >>>> +
> >>>> +static uint64_t xive_nvt_accept(XiveNVT *nvt)
> >>>> +{
> >>>> +    return 0;
> >>>> +}
> >>>> +
> >>>> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
> >>>> +{
> >>>> +    if (cppr > XIVE_PRIORITY_MAX) {
> >>>> +        cppr = 0xff;
> >>>> +    }
> >>>> +
> >>>> +    nvt->ring_os[TM_CPPR] = cppr;
> >>>
> >>> Surely this needs to recheck if we should be interrupting the cpu?
> >>
> >> yes. In patch 9, when we introduce the nvt notify routine.
> > 
> > Ok.
> > 
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * OS Thread Interrupt Management Area MMIO
> >>>> + */
> >>>> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
> >>>> +                                           unsigned size)
> >>>> +{
> >>>> +    uint64_t ret = -1;
> >>>> +
> >>>> +    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
> >>>> +        ret = xive_nvt_accept(nvt);
> >>>> +    } else {
> >>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
> >>>> +                      HWADDR_PRIx" size %d\n", offset, size);
> >>>> +    }
> >>>> +
> >>>> +    return ret;
> >>>> +}
> >>>> +
> >>>> +#define TM_RING(offset) ((offset) & 0xf0)
> >>>> +
> >>>> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
> >>>> +                                      unsigned size)
> >>>> +{
> >>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> >>>
> >>> So, as I said on a previous version of this, we can actually correctly
> >>> represent different mappings in different cpu spaces, by exploiting
> >>> cpu->as and not just having them all point to &address_space_memory.
> >>
> >> Yes, you did and I haven't studied the question yet. For the next version.
> > 
> > So, it's possible that using the cpu->as thing will be more trouble
> > that it's worth. 
> 
> One of the trouble is the number of memory regions to use, one per cpu, 

Well, we're already going to have an NVT object for each cpu, yes?  So
a memory region per-cpu doesn't seem like a big stretch.

> and the KVM support.

And I really don't see how the memory regions impacts KVM.

> Having a single region is much easier. 
> 
> > I am a little concerned about using current_cpu though.  
> > First, will it work with KVM with kernel_irqchip=off - the
> > cpus are running truly concurrently,
> 
> FWIW, I didn't see any issue yet while stressing. 

Ok.

> > but we still need to work out who's poking at the TIMA.  
> 
> I understand. The registers are accessed by the current cpu to set the 
> CPPR and to ack an interrupt. But when we route an event, we also access 
> and modify the registers. Do you suggest some locking ? I am not sure
> how are protected the TIMA region accesses vs. the routing, which is 
> necessarily initiated by an ESB MMIO though.

Locking isn't really the issue.  I mean, we do need locking, but the
BQL should provide that.  The issue is what exactly does "current"
mean in the context of multiple concurrently running cpus.  Does it
always mean what we need it to mean in every context we might call
this from.

> > Second, are there any cases where we might
> > need to trip this "on behalf of" a specific cpu that's not the current
> > one.
> 
> ah. yes. sort of :) only in powernv, when the xive is reseted (and when 
> dumping the state for debug).
> 
> The IC has a way to access indirectly the registers of a HW thread. 
> It, first, sets the PC_TCTXT_INDIR_THRDID register with the PIR of 
> the targeted thread and then loads on the indirect TIMA can be done 
> as if it was the current thread. The indirect TIMA is mapped 4 pages 
> after the  IC BAR.
> 
> The resulting memory region op is a little ugly and might need 
> some rework : 
> 
> static uint64_t xive_tm_hv_read(void *opaque, hwaddr offset,
>                                  unsigned size)
> {
>     PowerPCCPU **cpuptr = opaque;
>     PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
>     ...
> 
> 
> >>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
> >>>> +    uint64_t ret = -1;
> >>>> +    int i;
> >>>> +
> >>>> +    if (offset >= TM_SPC_ACK_EBB) {
> >>>> +        return xive_tm_read_special(nvt, offset, size);
> >>>> +    }
> >>>> +
> >>>> +    if (TM_RING(offset) != TM_QW1_OS) {
> >>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
> >>>> +                      HWADDR_PRIx"\n", offset);
> >>>> +        return ret;
> >>>
> >>> Just return -1 would be clearer here;
> >>
> >> ok.
> >>
> >>>
> >>>> +    }
> >>>> +
> >>>> +    ret = 0;
> >>>> +    for (i = 0; i < size; i++) {
> >>>> +        ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1));
> >>>> +    }
> >>>> +
> >>>> +    return ret;
> >>>> +}
> >>>> +
> >>>> +static bool xive_tm_is_readonly(uint8_t offset)
> >>>> +{
> >>>> +    return offset != TM_QW1_OS + TM_CPPR;
> >>>> +}
> >>>> +
> >>>> +static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset,
> >>>> +                                        uint64_t value, unsigned size)
> >>>> +{
> >>>> +    /* TODO: support TM_SPC_SET_OS_PENDING */
> >>>> +
> >>>> +    /* TODO: support TM_SPC_ACK_OS_EL */
> >>>> +}
> >>>> +
> >>>> +static void xive_tm_os_write(void *opaque, hwaddr offset,
> >>>> +                                   uint64_t value, unsigned size)
> >>>> +{
> >>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> >>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
> >>>> +    int i;
> >>>> +
> >>>> +    if (offset >= TM_SPC_ACK_EBB) {
> >>>> +        xive_tm_write_special(nvt, offset, value, size);
> >>>> +        return;
> >>>> +    }
> >>>> +
> >>>> +    if (TM_RING(offset) != TM_QW1_OS) {
> >>>
> >>> Why have this if you have separate OS and user regions as you appear
> >>> to do below?
> >>
> >> This is another problem we are trying to solve. 
> >>
> >> The registers a CPU can access depends on the TIMA view it is using. 
> >> The OS TIMA view only sees the OS ring registers. The HV view sees all. 
> >>
> >>> Or to look at it another way, shouldn't it be possible to make the
> >>> read/write accessors the same for the OS and user rings?
> >>
> >> For some parts yes, but the special load/store addresses are different
> >> for each view, the read-only register also. It seemed easier to duplicate.
> >>
> >> I think the problem will become clearer (or worse) with pnv which uses 
> >> the HV mode.
> > 
> > Oh.  I had the impression that each ring had a basically identical set
> > of registers and you just had access to the region for your ring and
> > the ones below.  Are you saying instead it's basically a single block
> > of registers with various different privilege levels for each of them?
> 
> yes. I think I answered this question more clearly in a previous email.
> 
> > [snip]
> >>>> +}
> >>>> +
> >>>> +static void xive_nvt_unrealize(DeviceState *dev, Error **errp)
> >>>> +{
> >>>> +    qemu_unregister_reset(xive_nvt_reset, dev);
> >>>> +}
> >>>> +
> >>>> +static void xive_nvt_init(Object *obj)
> >>>> +{
> >>>> +    XiveNVT *nvt = XIVE_NVT(obj);
> >>>> +
> >>>> +    nvt->ring_os = &nvt->regs[TM_QW1_OS];
> >>>
> >>> The ring_os field is basically pointless, being just an offset into a
> >>> structure you already have.  A macro or inline would be a better idea.
> >>
> >> ok. I liked the idea but I agree it's overkill to have an init routine
> >> just for this. I will find something.
> > 
> > That too, but it's also something that looks like an optimization but
> > isn't, which is bad practice.  On modern cpus math is cheap (and this
> > is just a trivial offset), memory accesses are expensive.  You're
> > essentially caching this offset - raising all the usual invalidation
> > questions for a cache - when caching it is *more* expensive than just
> > computing it every time.
> 
> ok. removing this offset was a good opportunity to generalize the 
> routing algorithm and use a 'ring' parameter in all routines. Same 
> for the accept path. 
> 
> 
> C.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-05-03 14:37             ` Cédric Le Goater
@ 2018-05-04  5:19               ` David Gibson
  2018-05-04 13:29                 ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-04  5:19 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 5236 bytes --]

On Thu, May 03, 2018 at 04:37:29PM +0200, Cédric Le Goater wrote:
> On 05/03/2018 08:25 AM, David Gibson wrote:
> > On Thu, May 03, 2018 at 08:07:54AM +0200, Cédric Le Goater wrote:
> >> On 05/03/2018 07:45 AM, David Gibson wrote:
> >>> On Thu, Apr 26, 2018 at 11:48:06AM +0200, Cédric Le Goater wrote:
> >>>> On 04/26/2018 09:25 AM, David Gibson wrote:
> >>>>> On Thu, Apr 19, 2018 at 02:43:03PM +0200, Cédric Le Goater wrote:
> >>>>>> The Event Queue Descriptor (EQD) table is an internal table of the
> >>>>>> XIVE routing sub-engine. It specifies on which Event Queue the event
> >>>>>> data should be posted when an exception occurs (later on pulled by the
> >>>>>> OS) and which Virtual Processor to notify.
> >>>>>
> >>>>> Uhhh.. I thought the IVT said which queue and vp to notify, and the
> >>>>> EQD gave metadata for event queues.
> >>>>
> >>>> yes. the above poorly written. The Event Queue Descriptor contains the
> >>>> guest address of the event queue in which the data is written. I will 
> >>>> rephrase.      
> >>>>
> >>>> The IVT contains IVEs which indeed define for an IRQ which EQ to notify 
> >>>> and what data to push on the queue. 
> >>>>  
> >>>>>> The Event Queue is a much
> >>>>>> more complex structure but we start with a simple model for the sPAPR
> >>>>>> machine.
> >>>>>>
> >>>>>> There is one XiveEQ per priority and these are stored under the XIVE
> >>>>>> virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :
> >>>>>>
> >>>>>>        (server << 3) | (priority & 0x7)
> >>>>>>
> >>>>>> This is not in the XIVE architecture but as the EQ index is never
> >>>>>> exposed to the guest, in the hcalls nor in the device tree, we are
> >>>>>> free to use what fits best the current model.
> >>>>
> >>>> This EQ indexing is important to notice because it will also show up 
> >>>> in KVM to build the IVE from the KVM irq state.
> >>>
> >>> Ok, are you saying that while this combined EQ index will never appear
> >>> in guest <-> host interfaces, 
> >>
> >> Indeed.
> >>
> >>> it might show up in qemu <-> KVM interfaces?
> >>
> >> Not directly but it is part of the IVE as the IVE_EQ_INDEX field. When
> >> dumped, it has to be built in some ways, compatible with the emulated 
> >> mode in QEMU. 
> > 
> > Hrm.  But is the exact IVE contents visible to qemu (for a PAPR
> > guest)?  
> 
> The guest only uses hcalls which arguments are :
>  
> 	- cpu numbers,
> 	- priority numbers from defined ranges, 
> 	- logical interrupt numbers.  
> 	- physical address of the EQ 
> 
> The visible parts for the guest of the IVE are the 'priority', the 'cpu', 
> and the 'eisn', which is the effective IRQ number the guest is assigning 
> to the source. The 'eisn" will be pushed in the EQ.

Ok.

> The IVE EQ index is not visible.

Good.

> > I would have thought the qemu <-> KVM interfaces would have
> > abstracted this the same way the guest <-> KVM interfaces do.  > Or is there a reason not to?
> 
> It is practical to dump 64bit IVEs directly from KVM into the QEMU 
> internal structures because it fits the emulated mode without doing 
> any translation ... This might be seen as a shortcut. You will tell 
> me when you reach the KVM part.   

Ugh.. exposing to qemu the raw IVEs sounds like a bad idea to me.
When we migrate, we're going to have to assign the guest (server,
priority) tuples to host EQ indicies, and I think it makes more sense
to do that in KVM and hide the raw indices from qemu than to have qemu
mangle them explicitly on migration.

> >>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>
> >>>>> Is the EQD actually modifiable by a guest?  Or are the settings of the
> >>>>> EQs fixed by PAPR?
> >>>>
> >>>> The guest uses the H_INT_SET_QUEUE_CONFIG hcall to define the address
> >>>> of the event queue for a couple prio/server.
> >>>
> >>> Ok, so the EQD can be modified by the guest.  In which case we need to
> >>> work out what object owns it, since it'll need to migrate it.
> >>
> >> Indeed. The EQD are CPU related as there is one EQD per couple (cpu, 
> >> priority). The KVM patchset dumps/restores the eight XiveEQ struct 
> >> using per cpu ioctls. The EQ in the OS RAM is marked dirty at that
> >> stage.
> > 
> > To make sure I'm clear: for PAPR there's a strict relationship between
> > EQD and CPU (one EQD for each (cpu, priority) tuple).  
> 
> Yes.
> 
> > But for powernv that's not the case, right?  
> 
> It is.

Uh.. I don't think either of us phrased that well, I'm still not sure
which way you're answering that.

> > AIUI the mapping of EQs to cpus was configurable, is that right?
> 
> Each cpu has 8 EQD. Same for virtual cpus.

Hmm.. but is that 8 EQD per cpu something built into the hardware, or
just a convention of how the host kernel and OPAL operate?

> 
> I am not sure what you understood before ? It is surely something
> I wrote, my XIVE understanding is still making progress.
> 
> 
> C.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-05-04  3:33               ` David Gibson
@ 2018-05-04 13:05                 ` Cédric Le Goater
  2018-05-05  4:26                   ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-04 13:05 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/04/2018 05:33 AM, David Gibson wrote:
> On Thu, May 03, 2018 at 06:50:09PM +0200, Cédric Le Goater wrote:
>> On 05/03/2018 07:22 AM, David Gibson wrote:
>>> On Thu, Apr 26, 2018 at 12:43:29PM +0200, Cédric Le Goater wrote:
>>>> On 04/26/2018 06:20 AM, David Gibson wrote:
>>>>> On Tue, Apr 24, 2018 at 11:46:04AM +0200, Cédric Le Goater wrote:
>>>>>> On 04/24/2018 08:51 AM, David Gibson wrote:
>>>>>>> On Thu, Apr 19, 2018 at 02:43:00PM +0200, Cédric Le Goater wrote:
>>>>>>>> sPAPRXive is a model for the XIVE interrupt controller device of the
>>>>>>>> sPAPR machine. It holds the routing XIVE table, the Interrupt
>>>>>>>> Virtualization Entry (IVE) table which associates interrupt source
>>>>>>>> numbers with targets.
>>>>>>>>
>>>>>>>> Also extend the XiveFabric with an accessor to the IVT. This will be
>>>>>>>> needed by the routing algorithm.
>>>>>>>>
>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>> ---
>>>>>>>>
>>>>>>>>  May be should introduce a XiveRouter model to hold the IVT. To be
>>>>>>>>  discussed.
>>>>>>>
>>>>>>> Yeah, maybe.  Am I correct in thinking that on pnv there could be more
>>>>>>> than one XiveRouter?
>>>>>>
>>>>>> There is only one, the main IC. 
>>>>>
>>>>> Ok, that's what I thought originally.  In that case some of the stuff
>>>>> in the patches really doesn't make sense to me.
>>>>
>>>> well, there is one IC per chip on powernv, but we haven't reach that part
>>>> yet.
>>>
>>> Hmm.  There's some things we can delay dealing with, but I don't think
>>> this is one of them.  I think we need to understand how multichip is
>>> going to work in order to come up with a sane architecture.  Otherwise
>>> I fear we'll end up with something that we either need to horribly
>>> bastardize for multichip, or have to rework things dramatically
>>> leading to migration nightmares.
>>
>> So, it is all controlled by MMIO, so we should be fine on that part. 
>> As for the internal tables, they are all configured by firmware, using
>> a chip identifier (block). I need to check how the remote XIVE are 
>> accessed. I think this is by MMIO. 
> 
> Right, but for powernv we execute OPAL inside the VM, rather than
> emulating its effects.  So we still need to model the actual hardware
> interfaces.  OPAL hides the details from the kernel, but not from us
> on the other side.

Yes. This is the case in the current model. I took a look today and
I have a few fixes for the MMIO layout for P9 chips which I will send.

As for XIVE, the model needs to be a little more  complex to support 
VSD_MODE_FORWARD tables which describe how to forward a notification
to another XIVE IC on another chip. They contain an address on which 
to load, This is another hop in the notification chain.  

>> I haven't looked at multichip XIVE support but I am not too worried as 
>> the framework is already in place for the machine.
>>  
>>>>>>> If we did have a XiveRouter, I'm not sure we'd need the XiveFabric
>>>>>>> interface, possibly its methods could just be class methods of
>>>>>>> XiveRouter.
>>>>>>
>>>>>> Yes. We could introduce a XiveRouter to share the ivt table between 
>>>>>> the sPAPRXive and the PnvXIVE models, the interrupt controllers of
>>>>>> the machines. Methods would provide way to get the ivt/eq/nvt
>>>>>> objects required for routing. I need to add a set_eq() to push the
>>>>>> EQ data.
>>>>>
>>>>> Hrm.  Well, to add some more clarity, let's say the XiveRouter is the
>>>>> object which owns the IVT.  
>>>>
>>>> OK. that would be a model with some state and not an interface.
>>>
>>> Yes.  For papr variant it would have the whole IVT contents as its
>>> state.  For the powernv, just the registers telling it where to find
>>> the IVT in RAM.
>>>
>>>>> It may or may not do other stuff as well.
>>>>
>>>> Its only task would be to do the final event routing: get the IVE,
>>>> get the EQ, push the EQ DATA in the OS event queue, notify the CPU.
>>>
>>> That seems like a lot of steps.  Up to push the EQ DATA, certainly.
>>> And I guess it'll have to ping an NVT somehow, but I'm not sure it
>>> should know about CPUs as such.
>>
>> For PowerNV, the concept could be generalized, yes. An NVT can 
>> contain the interrupt state of a logical server but the common 
>> case is baremetal without guests for QEMU and so we have a NVT 
>> per cpu. 
> 
> Hmm.  We eventually want to support a kernel running guests under
> qemu/powernv though, right?  

arg. an emulated hypervisor ! OK let's say this is a long term goal :) 

> So even if we don't allow it right now,
> we don't want allowing that to require major surgery to our
> architecture.

That I agree on. 

>> PowerNV will have some limitation but we can make it better than 
>> today for sure. It boots.
>>
>> We can improve some of the NVT notification process, the way NVT 
>> are matched eventually. may be support remote engines if the
>> NVT is not local. I have not looked at the details.
>>
>>> I'm not sure at this stage what should own the EQD table.
>>
>> The EQDT is in RAM.
> 
> Not for spapr, it's not.  

yeah ok. It's in QEMU/KVM.

> And even when it is in RAM, something needs
> to own the register that gives its base address.

It's more complex than registers on powernv. There is a procedure
to define the XIVE tables using XIVE table descriptors which contain
their characteristics, size, indirect vs. indirect, local vs remote.
OPAL/skiboot defines all these to configure the HW, and the model
necessarily needs to support the same interface. This is the case
for a single chip.  

C.

>>> In the multichip case is there one EQD table for every IVT?
>>
>> There is one EQDT per chip, same for the IVT. They are in RAM, 
>> identified with a block ID.
>>
>>>  I'm guessing
>>> not - I figure the EQD table must be effectively global so that any
>>> chip's router can send events to any EQ in the whole system.
>>>>>> Now IIUC, on pnv the IVT lives in main system memory.  
>>>>
>>>> yes. It is allocated by skiboot in RAM and fed to the HW using some 
>>>> IC configuration registers. Then, each entry is configured with OPAL 
>>>> calls and the HW is updated using cache scrub registers. 
>>>
>>> Right.  At least for the first pass we should be able to treat the
>>> cache scrub registers as no-ops and just not cache anything in the
>>> qemu implementation.
>>
>> The model currently supports the cache scrub registers, we need it
>> to update some values. It's not too complex.
> 
> Ok.
> 
>>>>> Under PAPR is the IVT in guest memory, or is it outside (updated by
>>>>> hypercalls/rtas)?
>>>>
>>>> Under sPAPR, the IVT is updated by the H_INT_SET_SOURCE_CONFIG hcall
>>>> which configures the targeting of an IRQ. It's not in the guest 
>>>> memory.
>>>
>>> Right.
>>>
>>>> Behind the hood, the IVT is still configured by OPAL under KVM and 
>>>> by QEMU when kernel_irqchip=off
>>>
>>> Sure.  Even with kernel_irqchip=on there's still logically a guest IVT
>>> (or "IVT view" I guess), even if it's actual entries are stored
>>> distributed across various places in the host's IVTs.
>>
>> yes. The XIVE KVM device caches the info. This is used to dump the 
>> state without doing OPAL calls.
>>
>> C. 
>>
>>
>>>>>> The XiveRouter would also be a XiveFabric (or some other name) to 
>>>>>> let the internal sources of the interrupt controller forward events.
>>>>>
>>>>> The further we go here, the less sure I am that XiveFabric even makes
>>>>> sense as a concept.
>>>>
>>>> See previous email.
>>>
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-05-04  4:51           ` David Gibson
@ 2018-05-04 13:11             ` Cédric Le Goater
  2018-05-05  4:27               ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-04 13:11 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/04/2018 06:51 AM, David Gibson wrote:
> On Thu, May 03, 2018 at 06:06:14PM +0200, Cédric Le Goater wrote:
>> On 05/03/2018 07:35 AM, David Gibson wrote:
>>> On Thu, Apr 26, 2018 at 11:27:21AM +0200, Cédric Le Goater wrote:
>>>> On 04/26/2018 09:11 AM, David Gibson wrote:
>>>>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
>>>>>> The XIVE presenter engine uses a set of registers to handle priority
>>>>>> management and interrupt acknowledgment among other things. The most
>>>>>> important ones being :
>>>>>>
>>>>>>   - Interrupt Priority Register (PIPR)
>>>>>>   - Interrupt Pending Buffer (IPB)
>>>>>>   - Current Processor Priority (CPPR)
>>>>>>   - Notification Source Register (NSR)
>>>>>>
>>>>>> There is one set of registers per level of privilege, four in all :
>>>>>> HW, HV pool, OS and User. These are called rings. All registers are
>>>>>> accessible through a specific MMIO region called the Thread Interrupt
>>>>>> Management Areas (TIMA) but, depending on the privilege level of the
>>>>>> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
>>>>>> OS privilege and therefore can only accesses the OS and the User
>>>>>> rings. The others are for hypervisor levels.
>>>>>>
>>>>>> The CPU interrupt state is modeled with a XiveNVT object which stores
>>>>>> the values of the different registers. The different TIMA views are
>>>>>> mapped at the same address for each CPU and 'current_cpu' is used to
>>>>>> retrieve the XiveNVT holding the ring registers.
>>>>>>
>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>> ---
>>>>>>
>>>>>>  Changes since v2 :
>>>>>>
>>>>>>  - introduced the XiveFabric interface
>>>>>>
>>>>>>  hw/intc/spapr_xive.c        |  25 ++++
>>>>>>  hw/intc/xive.c              | 279 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>  include/hw/ppc/spapr_xive.h |   5 +
>>>>>>  include/hw/ppc/xive.h       |  31 +++++
>>>>>>  include/hw/ppc/xive_regs.h  |  84 +++++++++++++
>>>>>>  5 files changed, 424 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>>>>>> index 90cde8a4082d..f07832bf0a00 100644
>>>>>> --- a/hw/intc/spapr_xive.c
>>>>>> +++ b/hw/intc/spapr_xive.c
>>>>>> @@ -13,6 +13,7 @@
>>>>>>  #include "target/ppc/cpu.h"
>>>>>>  #include "sysemu/cpus.h"
>>>>>>  #include "monitor/monitor.h"
>>>>>> +#include "hw/ppc/spapr.h"
>>>>>>  #include "hw/ppc/spapr_xive.h"
>>>>>>  #include "hw/ppc/xive.h"
>>>>>>  #include "hw/ppc/xive_regs.h"
>>>>>> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>>>>>  
>>>>>>      /* Allocate the Interrupt Virtualization Table */
>>>>>>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>>>>>> +
>>>>>> +    /* The Thread Interrupt Management Area has the same address for
>>>>>> +     * each chip. On sPAPR, we only need to expose the User and OS
>>>>>> +     * level views of the TIMA.
>>>>>> +     */
>>>>>> +    xive->tm_base = XIVE_TM_BASE;
>>>>>
>>>>> The constant should probably have PAPR in the name somewhere, since
>>>>> it's just for PAPR machines (same for the ESB mappings, actually).
>>>>
>>>> ok. 
>>>>
>>>> I have also made 'tm_base' a property, like 'vc_base' for ESBs, in 
>>>> case we want to change the value when the guest is instantiated. 
>>>> I doubt it but this is an address in the global address space, so 
>>>> letting the machine have control is better I think.
>>>
>>> I agree.
>>>
>>>>>> +
>>>>>> +    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
>>>>>> +                          &xive_tm_user_ops, xive, "xive.tima.user",
>>>>>> +                          1ull << TM_SHIFT);
>>>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
>>>>>> +
>>>>>> +    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
>>>>>> +                          &xive_tm_os_ops, xive, "xive.tima.os",
>>>>>> +                          1ull << TM_SHIFT);
>>>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
>>>>>>  }
>>>>>>  
>>>>>>  static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>>>>> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>>>>>      return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>>>>>>  }
>>>>>>  
>>>>>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
>>>>>> +{
>>>>>> +    PowerPCCPU *cpu = spapr_find_cpu(server);
>>>>>> +
>>>>>> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
>>>>>> +}
>>>>>
>>>>> So this is a bit of a tangent, but I've been thinking of implementing
>>>>> a scheme where there's an opaque pointer in the cpu structure for the
>>>>> use of the machine.  I'm planning for that to replace the intc pointer
>>>>> (which isn't really used directly by the cpu). That would allow us to
>>>>> have spapr put a structure there and have both xics and xive pointers
>>>>> which could be useful later on.
>>>>
>>>> ok. That should simplify the patchset at the end, in which we need to 
>>>> switch the 'intc' pointer. 
>>>>
>>>>> I think we'd need something similar to correctly handle migration of
>>>>> the VPA state, which is currently horribly broken.
>>>>>
>>>>>> +
>>>>>>  static const VMStateDescription vmstate_spapr_xive_ive = {
>>>>>>      .name = TYPE_SPAPR_XIVE "/ive",
>>>>>>      .version_id = 1,
>>>>>> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>>>>      dc->vmsd = &vmstate_spapr_xive;
>>>>>>  
>>>>>>      xfc->get_ive = spapr_xive_get_ive;
>>>>>> +    xfc->get_nvt = spapr_xive_get_nvt;
>>>>>>  }
>>>>>>  
>>>>>>  static const TypeInfo spapr_xive_info = {
>>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>>>> index dccad0318834..5691bb9474e4 100644
>>>>>> --- a/hw/intc/xive.c
>>>>>> +++ b/hw/intc/xive.c
>>>>>> @@ -14,7 +14,278 @@
>>>>>>  #include "sysemu/cpus.h"
>>>>>>  #include "sysemu/dma.h"
>>>>>>  #include "monitor/monitor.h"
>>>>>> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
>>>>>>  #include "hw/ppc/xive.h"
>>>>>> +#include "hw/ppc/xive_regs.h"
>>>>>> +
>>>>>> +/*
>>>>>> + * XIVE Interrupt Presenter
>>>>>> + */
>>>>>> +
>>>>>> +static uint64_t xive_nvt_accept(XiveNVT *nvt)
>>>>>> +{
>>>>>> +    return 0;
>>>>>> +}
>>>>>> +
>>>>>> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
>>>>>> +{
>>>>>> +    if (cppr > XIVE_PRIORITY_MAX) {
>>>>>> +        cppr = 0xff;
>>>>>> +    }
>>>>>> +
>>>>>> +    nvt->ring_os[TM_CPPR] = cppr;
>>>>>
>>>>> Surely this needs to recheck if we should be interrupting the cpu?
>>>>
>>>> yes. In patch 9, when we introduce the nvt notify routine.
>>>
>>> Ok.
>>>
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * OS Thread Interrupt Management Area MMIO
>>>>>> + */
>>>>>> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
>>>>>> +                                           unsigned size)
>>>>>> +{
>>>>>> +    uint64_t ret = -1;
>>>>>> +
>>>>>> +    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
>>>>>> +        ret = xive_nvt_accept(nvt);
>>>>>> +    } else {
>>>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
>>>>>> +                      HWADDR_PRIx" size %d\n", offset, size);
>>>>>> +    }
>>>>>> +
>>>>>> +    return ret;
>>>>>> +}
>>>>>> +
>>>>>> +#define TM_RING(offset) ((offset) & 0xf0)
>>>>>> +
>>>>>> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
>>>>>> +                                      unsigned size)
>>>>>> +{
>>>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>>>>
>>>>> So, as I said on a previous version of this, we can actually correctly
>>>>> represent different mappings in different cpu spaces, by exploiting
>>>>> cpu->as and not just having them all point to &address_space_memory.
>>>>
>>>> Yes, you did and I haven't studied the question yet. For the next version.
>>>
>>> So, it's possible that using the cpu->as thing will be more trouble
>>> that it's worth. 
>>
>> One of the trouble is the number of memory regions to use, one per cpu, 
> 
> Well, we're already going to have an NVT object for each cpu, yes?  So
> a memory region per-cpu doesn't seem like a big stretch.
> 
>> and the KVM support.
> 
> And I really don't see how the memory regions impacts KVM.

The TIMA is setup when the KVM device is initialized using some specific 
ioctl to get an fd on a MMIO region from the host. It is then passed to 
the guest as a 'ram_device', same for the ESBs. 

This is not a common region.
 
>> Having a single region is much easier. 
>>
>>> I am a little concerned about using current_cpu though.  
>>> First, will it work with KVM with kernel_irqchip=off - the
>>> cpus are running truly concurrently,
>>
>> FWIW, I didn't see any issue yet while stressing. 
> 
> Ok.
> 
>>> but we still need to work out who's poking at the TIMA.  
>>
>> I understand. The registers are accessed by the current cpu to set the 
>> CPPR and to ack an interrupt. But when we route an event, we also access 
>> and modify the registers. Do you suggest some locking ? I am not sure
>> how are protected the TIMA region accesses vs. the routing, which is 
>> necessarily initiated by an ESB MMIO though.
> 
> Locking isn't really the issue.  I mean, we do need locking, but the
> BQL should provide that.  The issue is what exactly does "current"
> mean in the context of multiple concurrently running cpus.  Does it
> always mean what we need it to mean in every context we might call
> this from.

I would say so.

C.

>>> Second, are there any cases where we might
>>> need to trip this "on behalf of" a specific cpu that's not the current
>>> one.
>>
>> ah. yes. sort of :) only in powernv, when the xive is reseted (and when 
>> dumping the state for debug).
>>
>> The IC has a way to access indirectly the registers of a HW thread. 
>> It, first, sets the PC_TCTXT_INDIR_THRDID register with the PIR of 
>> the targeted thread and then loads on the indirect TIMA can be done 
>> as if it was the current thread. The indirect TIMA is mapped 4 pages 
>> after the  IC BAR.
>>
>> The resulting memory region op is a little ugly and might need 
>> some rework : 
>>
>> static uint64_t xive_tm_hv_read(void *opaque, hwaddr offset,
>>                                  unsigned size)
>> {
>>     PowerPCCPU **cpuptr = opaque;
>>     PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
>>     ...
>>
>>
>>>>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>>>>>> +    uint64_t ret = -1;
>>>>>> +    int i;
>>>>>> +
>>>>>> +    if (offset >= TM_SPC_ACK_EBB) {
>>>>>> +        return xive_tm_read_special(nvt, offset, size);
>>>>>> +    }
>>>>>> +
>>>>>> +    if (TM_RING(offset) != TM_QW1_OS) {
>>>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid access to non-OS ring @%"
>>>>>> +                      HWADDR_PRIx"\n", offset);
>>>>>> +        return ret;
>>>>>
>>>>> Just return -1 would be clearer here;
>>>>
>>>> ok.
>>>>
>>>>>
>>>>>> +    }
>>>>>> +
>>>>>> +    ret = 0;
>>>>>> +    for (i = 0; i < size; i++) {
>>>>>> +        ret |= (uint64_t) nvt->regs[offset + i] << (8 * (size - i - 1));
>>>>>> +    }
>>>>>> +
>>>>>> +    return ret;
>>>>>> +}
>>>>>> +
>>>>>> +static bool xive_tm_is_readonly(uint8_t offset)
>>>>>> +{
>>>>>> +    return offset != TM_QW1_OS + TM_CPPR;
>>>>>> +}
>>>>>> +
>>>>>> +static void xive_tm_write_special(XiveNVT *nvt, hwaddr offset,
>>>>>> +                                        uint64_t value, unsigned size)
>>>>>> +{
>>>>>> +    /* TODO: support TM_SPC_SET_OS_PENDING */
>>>>>> +
>>>>>> +    /* TODO: support TM_SPC_ACK_OS_EL */
>>>>>> +}
>>>>>> +
>>>>>> +static void xive_tm_os_write(void *opaque, hwaddr offset,
>>>>>> +                                   uint64_t value, unsigned size)
>>>>>> +{
>>>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>>>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>>>>>> +    int i;
>>>>>> +
>>>>>> +    if (offset >= TM_SPC_ACK_EBB) {
>>>>>> +        xive_tm_write_special(nvt, offset, value, size);
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>> +    if (TM_RING(offset) != TM_QW1_OS) {
>>>>>
>>>>> Why have this if you have separate OS and user regions as you appear
>>>>> to do below?
>>>>
>>>> This is another problem we are trying to solve. 
>>>>
>>>> The registers a CPU can access depends on the TIMA view it is using. 
>>>> The OS TIMA view only sees the OS ring registers. The HV view sees all. 
>>>>
>>>>> Or to look at it another way, shouldn't it be possible to make the
>>>>> read/write accessors the same for the OS and user rings?
>>>>
>>>> For some parts yes, but the special load/store addresses are different
>>>> for each view, the read-only register also. It seemed easier to duplicate.
>>>>
>>>> I think the problem will become clearer (or worse) with pnv which uses 
>>>> the HV mode.
>>>
>>> Oh.  I had the impression that each ring had a basically identical set
>>> of registers and you just had access to the region for your ring and
>>> the ones below.  Are you saying instead it's basically a single block
>>> of registers with various different privilege levels for each of them?
>>
>> yes. I think I answered this question more clearly in a previous email.
>>
>>> [snip]
>>>>>> +}
>>>>>> +
>>>>>> +static void xive_nvt_unrealize(DeviceState *dev, Error **errp)
>>>>>> +{
>>>>>> +    qemu_unregister_reset(xive_nvt_reset, dev);
>>>>>> +}
>>>>>> +
>>>>>> +static void xive_nvt_init(Object *obj)
>>>>>> +{
>>>>>> +    XiveNVT *nvt = XIVE_NVT(obj);
>>>>>> +
>>>>>> +    nvt->ring_os = &nvt->regs[TM_QW1_OS];
>>>>>
>>>>> The ring_os field is basically pointless, being just an offset into a
>>>>> structure you already have.  A macro or inline would be a better idea.
>>>>
>>>> ok. I liked the idea but I agree it's overkill to have an init routine
>>>> just for this. I will find something.
>>>
>>> That too, but it's also something that looks like an optimization but
>>> isn't, which is bad practice.  On modern cpus math is cheap (and this
>>> is just a trivial offset), memory accesses are expensive.  You're
>>> essentially caching this offset - raising all the usual invalidation
>>> questions for a cache - when caching it is *more* expensive than just
>>> computing it every time.
>>
>> ok. removing this offset was a good opportunity to generalize the 
>> routing algorithm and use a 'ring' parameter in all routines. Same 
>> for the accept path. 
>>
>>
>> C.
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-05-04  5:19               ` David Gibson
@ 2018-05-04 13:29                 ` Cédric Le Goater
  2018-05-05  4:29                   ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-04 13:29 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/04/2018 07:19 AM, David Gibson wrote:
> On Thu, May 03, 2018 at 04:37:29PM +0200, Cédric Le Goater wrote:
>> On 05/03/2018 08:25 AM, David Gibson wrote:
>>> On Thu, May 03, 2018 at 08:07:54AM +0200, Cédric Le Goater wrote:
>>>> On 05/03/2018 07:45 AM, David Gibson wrote:
>>>>> On Thu, Apr 26, 2018 at 11:48:06AM +0200, Cédric Le Goater wrote:
>>>>>> On 04/26/2018 09:25 AM, David Gibson wrote:
>>>>>>> On Thu, Apr 19, 2018 at 02:43:03PM +0200, Cédric Le Goater wrote:
>>>>>>>> The Event Queue Descriptor (EQD) table is an internal table of the
>>>>>>>> XIVE routing sub-engine. It specifies on which Event Queue the event
>>>>>>>> data should be posted when an exception occurs (later on pulled by the
>>>>>>>> OS) and which Virtual Processor to notify.
>>>>>>>
>>>>>>> Uhhh.. I thought the IVT said which queue and vp to notify, and the
>>>>>>> EQD gave metadata for event queues.
>>>>>>
>>>>>> yes. the above poorly written. The Event Queue Descriptor contains the
>>>>>> guest address of the event queue in which the data is written. I will 
>>>>>> rephrase.      
>>>>>>
>>>>>> The IVT contains IVEs which indeed define for an IRQ which EQ to notify 
>>>>>> and what data to push on the queue. 
>>>>>>  
>>>>>>>> The Event Queue is a much
>>>>>>>> more complex structure but we start with a simple model for the sPAPR
>>>>>>>> machine.
>>>>>>>>
>>>>>>>> There is one XiveEQ per priority and these are stored under the XIVE
>>>>>>>> virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :
>>>>>>>>
>>>>>>>>        (server << 3) | (priority & 0x7)
>>>>>>>>
>>>>>>>> This is not in the XIVE architecture but as the EQ index is never
>>>>>>>> exposed to the guest, in the hcalls nor in the device tree, we are
>>>>>>>> free to use what fits best the current model.
>>>>>>
>>>>>> This EQ indexing is important to notice because it will also show up 
>>>>>> in KVM to build the IVE from the KVM irq state.
>>>>>
>>>>> Ok, are you saying that while this combined EQ index will never appear
>>>>> in guest <-> host interfaces, 
>>>>
>>>> Indeed.
>>>>
>>>>> it might show up in qemu <-> KVM interfaces?
>>>>
>>>> Not directly but it is part of the IVE as the IVE_EQ_INDEX field. When
>>>> dumped, it has to be built in some ways, compatible with the emulated 
>>>> mode in QEMU. 
>>>
>>> Hrm.  But is the exact IVE contents visible to qemu (for a PAPR
>>> guest)?  
>>
>> The guest only uses hcalls which arguments are :
>>  
>> 	- cpu numbers,
>> 	- priority numbers from defined ranges, 
>> 	- logical interrupt numbers.  
>> 	- physical address of the EQ 
>>
>> The visible parts for the guest of the IVE are the 'priority', the 'cpu', 
>> and the 'eisn', which is the effective IRQ number the guest is assigning 
>> to the source. The 'eisn" will be pushed in the EQ.
> 
> Ok.
> 
>> The IVE EQ index is not visible.
> 
> Good.
> 
>>> I would have thought the qemu <-> KVM interfaces would have
>>> abstracted this the same way the guest <-> KVM interfaces do.  > Or is there a reason not to?
>>
>> It is practical to dump 64bit IVEs directly from KVM into the QEMU 
>> internal structures because it fits the emulated mode without doing 
>> any translation ... This might be seen as a shortcut. You will tell 
>> me when you reach the KVM part.   
> 
> Ugh.. exposing to qemu the raw IVEs sounds like a bad idea to me.

You definitely need to in QEMU in emulation mode. The whole routing 
relies on it. 

> When we migrate, we're going to have to assign the guest (server,
> priority) tuples to host EQ indicies, and I think it makes more sense
> to do that in KVM and hide the raw indices from qemu than to have qemu
> mangle them explicitly on migration.

We will need some mangling mechanism for the KVM ioctls saving and
restoring state. This is very similar to XICS. 
 
>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>
>>>>>>> Is the EQD actually modifiable by a guest?  Or are the settings of the
>>>>>>> EQs fixed by PAPR?
>>>>>>
>>>>>> The guest uses the H_INT_SET_QUEUE_CONFIG hcall to define the address
>>>>>> of the event queue for a couple prio/server.
>>>>>
>>>>> Ok, so the EQD can be modified by the guest.  In which case we need to
>>>>> work out what object owns it, since it'll need to migrate it.
>>>>
>>>> Indeed. The EQD are CPU related as there is one EQD per couple (cpu, 
>>>> priority). The KVM patchset dumps/restores the eight XiveEQ struct 
>>>> using per cpu ioctls. The EQ in the OS RAM is marked dirty at that
>>>> stage.
>>>
>>> To make sure I'm clear: for PAPR there's a strict relationship between
>>> EQD and CPU (one EQD for each (cpu, priority) tuple).  
>>
>> Yes.
>>
>>> But for powernv that's not the case, right?  
>>
>> It is.
> 
> Uh.. I don't think either of us phrased that well, I'm still not sure
> which way you're answering that.

there's a strict relationship between EQD and CPU (one EQD for each (cpu, priority) tuple) in spapr and in powernv.

>>> AIUI the mapping of EQs to cpus was configurable, is that right?
>>
>> Each cpu has 8 EQD. Same for virtual cpus.
> 
> Hmm.. but is that 8 EQD per cpu something built into the hardware, or
> just a convention of how the host kernel and OPAL operate?

It's not in the HW, it is used by the HW to route the notification. 
The EQD contains the EQ characteristics :

* functional bits :
  - valid bit
  - enqueue bit, to update OS in RAM EQ or not
  - unconditional notification
  - backlog
  - escalation
  - ...
* OS EQ fields 
  - physical address
  - entry index
  - toggle bit
* NVT fields
  - block/chip
  - index
* etc.

It's a big structure : 8 words.

The EQD table is allocated by OPAL/skiboot and fed to the HW for
its use. The OS powernv uses OPAL calls  configure the EQD with its 
needs : 

int64_t opal_xive_set_queue_info(uint64_t vp, uint32_t prio,
				 uint64_t qpage,
				 uint64_t qsize,
				 uint64_t qflags);


sPAPR uses an hcall :

static long plpar_int_set_queue_config(unsigned long flags,
				       unsigned long target,
				       unsigned long priority,
				       unsigned long qpage,
				       unsigned long qsize)


but it is translated in an OPAL call in KVM.

C.

 
>  
>>
>> I am not sure what you understood before ? It is surely something
>> I wrote, my XIVE understanding is still making progress.
>>
>>
>> C.
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-05-04  4:44             ` David Gibson
@ 2018-05-04 14:15               ` Cédric Le Goater
  0 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-04 14:15 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/04/2018 06:44 AM, David Gibson wrote:
> On Thu, May 03, 2018 at 05:10:48PM +0200, Cédric Le Goater wrote:
>> On 05/03/2018 07:39 AM, David Gibson wrote:
>>> On Thu, Apr 26, 2018 at 07:15:29PM +0200, Cédric Le Goater wrote:
>>>> On 04/26/2018 11:27 AM, Cédric Le Goater wrote:
>>>>> On 04/26/2018 09:11 AM, David Gibson wrote:
>>>>>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
>>> [snip]
>>>>>>> +static void xive_tm_os_write(void *opaque, hwaddr offset,
>>>>>>> +                                   uint64_t value, unsigned size)
>>>>>>> +{
>>>>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>>>>>> +    XiveNVT *nvt = XIVE_NVT(cpu->intc);
>>>>>>> +    int i;
>>>>>>> +
>>>>>>> +    if (offset >= TM_SPC_ACK_EBB) {
>>>>>>> +        xive_tm_write_special(nvt, offset, value, size);
>>>>>>> +        return;
>>>>>>> +    }
>>>>>>> +
>>>>>>> +    if (TM_RING(offset) != TM_QW1_OS) {
>>>>>>
>>>>>> Why have this if you have separate OS and user regions as you appear
>>>>>> to do below?
>>>>>
>>>>> This is another problem we are trying to solve. 
>>>>>
>>>>> The registers a CPU can access depends on the TIMA view it is using. 
>>>>> The OS TIMA view only sees the OS ring registers. The HV view sees all. 
>>>>
>>>> So, I gave a deeper look at the specs and I understood a little more 
>>>> details of the concepts behind. You need to do frequent round-trips 
>>>> to this document ...  
>>>>
>>>> These registers are accessible through four aligned pages, each exposing 
>>>> a different view of the registers. First page (page address ending 
>>>> in 0b00) gives access to the entire context and is reserved for the 
>>>> ring 0 security monitor. The second (page address ending in 0b01) 
>>>> is for the hypervisor, ring 1. The third (page address ending in 0b10) 
>>>> is for the operating system, ring 2. The fourth (page address ending 
>>>> in 0b11) is for user level, ring 3.
>>>>
>>>> The sPAPR machine runs at the OS privilege and therefore can only 
>>>> accesses the OS and the User rings, 2 and 3. The others are for
>>>> hypervisor levels.
>>>
>>> Ok, that much is what I thought.  What I'm less clear on is what each
>>> page looks like compared to the others.  Previously I thought each one
>>> had the same registers, 
>>
>> yes.
>>
>>> just manipulating the corresponding ring.  
>>
>> no. 
>>
>>> Are you saying instead that each ring's page basically has a subset 
>>> of the registers in the next most privileged page?
>>
>> That's the idea. 
> 
> Ah, ok.
> 
>> The registers are defined as follow :
>>
>> 	QW-0 User      
>> 	QW-1 O/S      
>> 	QW-2 Pool   
>> 	QW-3 Physical 
>>
>> and the pages :
>>
>> - 0006030203180000 security monitor 
>>   can access all registers 
>>
>> - 0006030203190000 hv
>>   can access all registers minus the secure regs
>>
>> - 00060302031a0000 os
>>   can access some of the OS (QW1) and User (QW0) registers
>>  
>> - 00060302031b0000 user
>>   can access NSR reg of User (QW0) registers
> 
> I can see two reasonable ways of doing this:
> 
> A)
> 
> Have a single set of read/write functions.  These implement all the
> registers but take a "privilege level" parameter which controls which
> will actually work.  Those could then be wired up in one of two ways:
> 
>   A1) Single memory region.  The accessor derives the priv level from
>   the relevant address bits, before masking it down to a single
>   register page.  Then, as above

Yes. That's the goal behind the page ordering :

page address ending in 0b00 : ring 0, security monitor 
page address ending in 0b01 : ring 1, hypervisor 
page address ending in 0b10 : ring 2, operating system  
page address ending in 0b11 : ring 3, user level

I don't why the registers are ordered the other way around though.

That's would be the direction to take for the emulated mode, I think.
It covers well the PowerNV (4 pages) and the sPAPR case (2 pages), 
in each case, the machine IC controller decides how much pages to map.
The memory region ops do the rest.

For KVM, we need to populate the VMA with the host TIMA page associated 
with ring 2 (OS) and then ring 3 (USER). 

This option looks better overall. I will see how ugly it gets with the implementation.

C.


>   A2) Multiple memory regions with the same accessor functions but
>   different opaque pointer.  The accessor gets the priv level from
>   its opaque pointer, then the address is just within a single ring's
>   page.
>
> B)
> 
> Separate memory regions with separate accessors.  The ring-0 accessor
> implements the ring-0 registers, then calls the ring-1 accessor
> function for everything else.  ring-1 calls ring-2 and so forth.
>
>> On sPAPR, we can remap the os/user pages to some other base address 
>> but we should keep the same page offset.
> 
> Sure.
> 
>>
>>
>>>> I will try to come with a better implementation of the model and
>>>> make sure the ring numbers are respected. I am not sure we should 
>>>> have only one memory region or four distinct ones with their
>>>> own ops. There are some differences in the load/store of each view.
>>>
>>> Right.  I'm not clear at this point if that's for good reasons, or
>>> just because IBM's hardware designers don't seem to have gotten the
>>> hang of Don't Repeat Yourself.
>>>
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources
  2018-04-27  2:43               ` David Gibson
@ 2018-05-04 14:25                 ` Cédric Le Goater
  2018-05-05  4:32                   ` David Gibson
  0 siblings, 1 reply; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-04 14:25 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 04/27/2018 04:43 AM, David Gibson wrote:
>>>> I did some work on that topic a while ago :
>>>>
>>>> 	https://patchwork.ozlabs.org/cover/836782/
>>>>
>>>> But we stopped exploring the idea. May be it was not the good approach.
>>>> The PHBs LSIs would benefit from such a split though.
>>> So, no, I don't think that was a good approach, but that doesn't mean
>>> other ways of rearranging the irq numbers aren't ok.  The thing here
>>> is that we don't want to think of an "irq allocator" - there are some
>>> bits like that in there already, but they were always a mistake.
>>>
>>> We have lots of irq space (both XICS and XIVE) so instead we should
>>> come up with a static mapping of irqs to devices.
>> yes. I would prefer that also. 
>>
>> We could change the spapr_irq_alloc() routine to get a block of 
>> IRQs in the range defined for a device family, and use a device 
>> id to offset in that family range ? Here are some figures :
>>
>> device family        block size  max devices  
>>
>> EVENT_CLASS_EPOW              1           1  
>> EVENT_CLASS_HOT_PLUG          1           1   
>> VIO_VSCSI                     1          10  
>> VIO_LLAN                      1          10  
>> VIO_VTY                       1           5  
>>                       
>> PCI/PHB                    1024           5  
> No, I'm thinking we should eliminate spapr_irq_alloc() entirely.
> Well, ok, not entirely, we'll still need it for the old machine
> types.  But remove it's use for the current machine type completely.
> 
> Instead we have an explicit map of ranges for various purposes.  The
> one-off things like EPOW and HOTPLUG can have plain constant values.
> PCI LSIs will be calculated as something like PCI_IRQ_BASE + <phb
> index>*4 + <irq pin>.  The VIO devices we handle as VIO_BASE + <reg
> value> or something.
> 
> MSIs will still need some sort of allocation, but we can do that
> within a range set aside for them.

Should we address the static mapping of irqs before introducing XIVE ? 

I don't think it changes much of the architecture now that the allocator
is under the machine. However, I wonder what would be the impact of 
PHB hotplug. 

C. 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-05-04 13:05                 ` Cédric Le Goater
@ 2018-05-05  4:26                   ` David Gibson
  2018-05-09  7:23                     ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-05  4:26 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 6449 bytes --]

On Fri, May 04, 2018 at 03:05:08PM +0200, Cédric Le Goater wrote:
> On 05/04/2018 05:33 AM, David Gibson wrote:
> > On Thu, May 03, 2018 at 06:50:09PM +0200, Cédric Le Goater wrote:
> >> On 05/03/2018 07:22 AM, David Gibson wrote:
> >>> On Thu, Apr 26, 2018 at 12:43:29PM +0200, Cédric Le Goater wrote:
> >>>> On 04/26/2018 06:20 AM, David Gibson wrote:
> >>>>> On Tue, Apr 24, 2018 at 11:46:04AM +0200, Cédric Le Goater wrote:
> >>>>>> On 04/24/2018 08:51 AM, David Gibson wrote:
> >>>>>>> On Thu, Apr 19, 2018 at 02:43:00PM +0200, Cédric Le Goater wrote:
> >>>>>>>> sPAPRXive is a model for the XIVE interrupt controller device of the
> >>>>>>>> sPAPR machine. It holds the routing XIVE table, the Interrupt
> >>>>>>>> Virtualization Entry (IVE) table which associates interrupt source
> >>>>>>>> numbers with targets.
> >>>>>>>>
> >>>>>>>> Also extend the XiveFabric with an accessor to the IVT. This will be
> >>>>>>>> needed by the routing algorithm.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>>>> ---
> >>>>>>>>
> >>>>>>>>  May be should introduce a XiveRouter model to hold the IVT. To be
> >>>>>>>>  discussed.
> >>>>>>>
> >>>>>>> Yeah, maybe.  Am I correct in thinking that on pnv there could be more
> >>>>>>> than one XiveRouter?
> >>>>>>
> >>>>>> There is only one, the main IC. 
> >>>>>
> >>>>> Ok, that's what I thought originally.  In that case some of the stuff
> >>>>> in the patches really doesn't make sense to me.
> >>>>
> >>>> well, there is one IC per chip on powernv, but we haven't reach that part
> >>>> yet.
> >>>
> >>> Hmm.  There's some things we can delay dealing with, but I don't think
> >>> this is one of them.  I think we need to understand how multichip is
> >>> going to work in order to come up with a sane architecture.  Otherwise
> >>> I fear we'll end up with something that we either need to horribly
> >>> bastardize for multichip, or have to rework things dramatically
> >>> leading to migration nightmares.
> >>
> >> So, it is all controlled by MMIO, so we should be fine on that part. 
> >> As for the internal tables, they are all configured by firmware, using
> >> a chip identifier (block). I need to check how the remote XIVE are 
> >> accessed. I think this is by MMIO. 
> > 
> > Right, but for powernv we execute OPAL inside the VM, rather than
> > emulating its effects.  So we still need to model the actual hardware
> > interfaces.  OPAL hides the details from the kernel, but not from us
> > on the other side.
> 
> Yes. This is the case in the current model. I took a look today and
> I have a few fixes for the MMIO layout for P9 chips which I will send.
> 
> As for XIVE, the model needs to be a little more  complex to support 
> VSD_MODE_FORWARD tables which describe how to forward a notification
> to another XIVE IC on another chip. They contain an address on which 
> to load, This is another hop in the notification chain.  

Ah, ok.  So is that mode and address configured in the (bare metal)
IVT as well?  Or is that a different piece of configuration?

> >> I haven't looked at multichip XIVE support but I am not too worried as 
> >> the framework is already in place for the machine.
> >>  
> >>>>>>> If we did have a XiveRouter, I'm not sure we'd need the XiveFabric
> >>>>>>> interface, possibly its methods could just be class methods of
> >>>>>>> XiveRouter.
> >>>>>>
> >>>>>> Yes. We could introduce a XiveRouter to share the ivt table between 
> >>>>>> the sPAPRXive and the PnvXIVE models, the interrupt controllers of
> >>>>>> the machines. Methods would provide way to get the ivt/eq/nvt
> >>>>>> objects required for routing. I need to add a set_eq() to push the
> >>>>>> EQ data.
> >>>>>
> >>>>> Hrm.  Well, to add some more clarity, let's say the XiveRouter is the
> >>>>> object which owns the IVT.  
> >>>>
> >>>> OK. that would be a model with some state and not an interface.
> >>>
> >>> Yes.  For papr variant it would have the whole IVT contents as its
> >>> state.  For the powernv, just the registers telling it where to find
> >>> the IVT in RAM.
> >>>
> >>>>> It may or may not do other stuff as well.
> >>>>
> >>>> Its only task would be to do the final event routing: get the IVE,
> >>>> get the EQ, push the EQ DATA in the OS event queue, notify the CPU.
> >>>
> >>> That seems like a lot of steps.  Up to push the EQ DATA, certainly.
> >>> And I guess it'll have to ping an NVT somehow, but I'm not sure it
> >>> should know about CPUs as such.
> >>
> >> For PowerNV, the concept could be generalized, yes. An NVT can 
> >> contain the interrupt state of a logical server but the common 
> >> case is baremetal without guests for QEMU and so we have a NVT 
> >> per cpu. 
> > 
> > Hmm.  We eventually want to support a kernel running guests under
> > qemu/powernv though, right?  
> 
> arg. an emulated hypervisor ! OK let's say this is a long term goal :) 
> 
> > So even if we don't allow it right now,
> > we don't want allowing that to require major surgery to our
> > architecture.
> 
> That I agree on. 
> 
> >> PowerNV will have some limitation but we can make it better than 
> >> today for sure. It boots.
> >>
> >> We can improve some of the NVT notification process, the way NVT 
> >> are matched eventually. may be support remote engines if the
> >> NVT is not local. I have not looked at the details.
> >>
> >>> I'm not sure at this stage what should own the EQD table.
> >>
> >> The EQDT is in RAM.
> > 
> > Not for spapr, it's not.  
> 
> yeah ok. It's in QEMU/KVM.
> 
> > And even when it is in RAM, something needs
> > to own the register that gives its base address.
> 
> It's more complex than registers on powernv. There is a procedure
> to define the XIVE tables using XIVE table descriptors which contain
> their characteristics, size, indirect vs. indirect, local vs remote.
> OPAL/skiboot defines all these to configure the HW, and the model
> necessarily needs to support the same interface. This is the case
> for a single chip.

Ah, ok.  So there's some sort of IVTD.  Also in RAM?  Eventually there
must be a register giving the base address of the IVTD, yes?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-05-04 13:11             ` Cédric Le Goater
@ 2018-05-05  4:27               ` David Gibson
  2018-05-09  7:27                 ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-05  4:27 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 10424 bytes --]

On Fri, May 04, 2018 at 03:11:57PM +0200, Cédric Le Goater wrote:
> On 05/04/2018 06:51 AM, David Gibson wrote:
> > On Thu, May 03, 2018 at 06:06:14PM +0200, Cédric Le Goater wrote:
> >> On 05/03/2018 07:35 AM, David Gibson wrote:
> >>> On Thu, Apr 26, 2018 at 11:27:21AM +0200, Cédric Le Goater wrote:
> >>>> On 04/26/2018 09:11 AM, David Gibson wrote:
> >>>>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
> >>>>>> The XIVE presenter engine uses a set of registers to handle priority
> >>>>>> management and interrupt acknowledgment among other things. The most
> >>>>>> important ones being :
> >>>>>>
> >>>>>>   - Interrupt Priority Register (PIPR)
> >>>>>>   - Interrupt Pending Buffer (IPB)
> >>>>>>   - Current Processor Priority (CPPR)
> >>>>>>   - Notification Source Register (NSR)
> >>>>>>
> >>>>>> There is one set of registers per level of privilege, four in all :
> >>>>>> HW, HV pool, OS and User. These are called rings. All registers are
> >>>>>> accessible through a specific MMIO region called the Thread Interrupt
> >>>>>> Management Areas (TIMA) but, depending on the privilege level of the
> >>>>>> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
> >>>>>> OS privilege and therefore can only accesses the OS and the User
> >>>>>> rings. The others are for hypervisor levels.
> >>>>>>
> >>>>>> The CPU interrupt state is modeled with a XiveNVT object which stores
> >>>>>> the values of the different registers. The different TIMA views are
> >>>>>> mapped at the same address for each CPU and 'current_cpu' is used to
> >>>>>> retrieve the XiveNVT holding the ring registers.
> >>>>>>
> >>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>> ---
> >>>>>>
> >>>>>>  Changes since v2 :
> >>>>>>
> >>>>>>  - introduced the XiveFabric interface
> >>>>>>
> >>>>>>  hw/intc/spapr_xive.c        |  25 ++++
> >>>>>>  hw/intc/xive.c              | 279 ++++++++++++++++++++++++++++++++++++++++++++
> >>>>>>  include/hw/ppc/spapr_xive.h |   5 +
> >>>>>>  include/hw/ppc/xive.h       |  31 +++++
> >>>>>>  include/hw/ppc/xive_regs.h  |  84 +++++++++++++
> >>>>>>  5 files changed, 424 insertions(+)
> >>>>>>
> >>>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >>>>>> index 90cde8a4082d..f07832bf0a00 100644
> >>>>>> --- a/hw/intc/spapr_xive.c
> >>>>>> +++ b/hw/intc/spapr_xive.c
> >>>>>> @@ -13,6 +13,7 @@
> >>>>>>  #include "target/ppc/cpu.h"
> >>>>>>  #include "sysemu/cpus.h"
> >>>>>>  #include "monitor/monitor.h"
> >>>>>> +#include "hw/ppc/spapr.h"
> >>>>>>  #include "hw/ppc/spapr_xive.h"
> >>>>>>  #include "hw/ppc/xive.h"
> >>>>>>  #include "hw/ppc/xive_regs.h"
> >>>>>> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >>>>>>  
> >>>>>>      /* Allocate the Interrupt Virtualization Table */
> >>>>>>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
> >>>>>> +
> >>>>>> +    /* The Thread Interrupt Management Area has the same address for
> >>>>>> +     * each chip. On sPAPR, we only need to expose the User and OS
> >>>>>> +     * level views of the TIMA.
> >>>>>> +     */
> >>>>>> +    xive->tm_base = XIVE_TM_BASE;
> >>>>>
> >>>>> The constant should probably have PAPR in the name somewhere, since
> >>>>> it's just for PAPR machines (same for the ESB mappings, actually).
> >>>>
> >>>> ok. 
> >>>>
> >>>> I have also made 'tm_base' a property, like 'vc_base' for ESBs, in 
> >>>> case we want to change the value when the guest is instantiated. 
> >>>> I doubt it but this is an address in the global address space, so 
> >>>> letting the machine have control is better I think.
> >>>
> >>> I agree.
> >>>
> >>>>>> +
> >>>>>> +    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
> >>>>>> +                          &xive_tm_user_ops, xive, "xive.tima.user",
> >>>>>> +                          1ull << TM_SHIFT);
> >>>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
> >>>>>> +
> >>>>>> +    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
> >>>>>> +                          &xive_tm_os_ops, xive, "xive.tima.os",
> >>>>>> +                          1ull << TM_SHIFT);
> >>>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
> >>>>>>  }
> >>>>>>  
> >>>>>>  static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
> >>>>>> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
> >>>>>>      return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
> >>>>>>  }
> >>>>>>  
> >>>>>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
> >>>>>> +{
> >>>>>> +    PowerPCCPU *cpu = spapr_find_cpu(server);
> >>>>>> +
> >>>>>> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
> >>>>>> +}
> >>>>>
> >>>>> So this is a bit of a tangent, but I've been thinking of implementing
> >>>>> a scheme where there's an opaque pointer in the cpu structure for the
> >>>>> use of the machine.  I'm planning for that to replace the intc pointer
> >>>>> (which isn't really used directly by the cpu). That would allow us to
> >>>>> have spapr put a structure there and have both xics and xive pointers
> >>>>> which could be useful later on.
> >>>>
> >>>> ok. That should simplify the patchset at the end, in which we need to 
> >>>> switch the 'intc' pointer. 
> >>>>
> >>>>> I think we'd need something similar to correctly handle migration of
> >>>>> the VPA state, which is currently horribly broken.
> >>>>>
> >>>>>> +
> >>>>>>  static const VMStateDescription vmstate_spapr_xive_ive = {
> >>>>>>      .name = TYPE_SPAPR_XIVE "/ive",
> >>>>>>      .version_id = 1,
> >>>>>> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >>>>>>      dc->vmsd = &vmstate_spapr_xive;
> >>>>>>  
> >>>>>>      xfc->get_ive = spapr_xive_get_ive;
> >>>>>> +    xfc->get_nvt = spapr_xive_get_nvt;
> >>>>>>  }
> >>>>>>  
> >>>>>>  static const TypeInfo spapr_xive_info = {
> >>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >>>>>> index dccad0318834..5691bb9474e4 100644
> >>>>>> --- a/hw/intc/xive.c
> >>>>>> +++ b/hw/intc/xive.c
> >>>>>> @@ -14,7 +14,278 @@
> >>>>>>  #include "sysemu/cpus.h"
> >>>>>>  #include "sysemu/dma.h"
> >>>>>>  #include "monitor/monitor.h"
> >>>>>> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
> >>>>>>  #include "hw/ppc/xive.h"
> >>>>>> +#include "hw/ppc/xive_regs.h"
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * XIVE Interrupt Presenter
> >>>>>> + */
> >>>>>> +
> >>>>>> +static uint64_t xive_nvt_accept(XiveNVT *nvt)
> >>>>>> +{
> >>>>>> +    return 0;
> >>>>>> +}
> >>>>>> +
> >>>>>> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
> >>>>>> +{
> >>>>>> +    if (cppr > XIVE_PRIORITY_MAX) {
> >>>>>> +        cppr = 0xff;
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    nvt->ring_os[TM_CPPR] = cppr;
> >>>>>
> >>>>> Surely this needs to recheck if we should be interrupting the cpu?
> >>>>
> >>>> yes. In patch 9, when we introduce the nvt notify routine.
> >>>
> >>> Ok.
> >>>
> >>>>>> +}
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * OS Thread Interrupt Management Area MMIO
> >>>>>> + */
> >>>>>> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
> >>>>>> +                                           unsigned size)
> >>>>>> +{
> >>>>>> +    uint64_t ret = -1;
> >>>>>> +
> >>>>>> +    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
> >>>>>> +        ret = xive_nvt_accept(nvt);
> >>>>>> +    } else {
> >>>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
> >>>>>> +                      HWADDR_PRIx" size %d\n", offset, size);
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    return ret;
> >>>>>> +}
> >>>>>> +
> >>>>>> +#define TM_RING(offset) ((offset) & 0xf0)
> >>>>>> +
> >>>>>> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
> >>>>>> +                                      unsigned size)
> >>>>>> +{
> >>>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> >>>>>
> >>>>> So, as I said on a previous version of this, we can actually correctly
> >>>>> represent different mappings in different cpu spaces, by exploiting
> >>>>> cpu->as and not just having them all point to &address_space_memory.
> >>>>
> >>>> Yes, you did and I haven't studied the question yet. For the next version.
> >>>
> >>> So, it's possible that using the cpu->as thing will be more trouble
> >>> that it's worth. 
> >>
> >> One of the trouble is the number of memory regions to use, one per cpu, 
> > 
> > Well, we're already going to have an NVT object for each cpu, yes?  So
> > a memory region per-cpu doesn't seem like a big stretch.
> > 
> >> and the KVM support.
> > 
> > And I really don't see how the memory regions impacts KVM.
> 
> The TIMA is setup when the KVM device is initialized using some specific 
> ioctl to get an fd on a MMIO region from the host. It is then passed to 
> the guest as a 'ram_device', same for the ESBs. 

Ah, good point.

> This is not a common region.

I'm not sure what you mean by that.

> >> Having a single region is much easier. 
> >>
> >>> I am a little concerned about using current_cpu though.  
> >>> First, will it work with KVM with kernel_irqchip=off - the
> >>> cpus are running truly concurrently,
> >>
> >> FWIW, I didn't see any issue yet while stressing. 
> > 
> > Ok.
> > 
> >>> but we still need to work out who's poking at the TIMA.  
> >>
> >> I understand. The registers are accessed by the current cpu to set the 
> >> CPPR and to ack an interrupt. But when we route an event, we also access 
> >> and modify the registers. Do you suggest some locking ? I am not sure
> >> how are protected the TIMA region accesses vs. the routing, which is 
> >> necessarily initiated by an ESB MMIO though.
> > 
> > Locking isn't really the issue.  I mean, we do need locking, but the
> > BQL should provide that.  The issue is what exactly does "current"
> > mean in the context of multiple concurrently running cpus.  Does it
> > always mean what we need it to mean in every context we might call
> > this from.
> 
> I would say so.

Ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-05-04 13:29                 ` Cédric Le Goater
@ 2018-05-05  4:29                   ` David Gibson
  2018-05-09  8:01                     ` Cédric Le Goater
  0 siblings, 1 reply; 100+ messages in thread
From: David Gibson @ 2018-05-05  4:29 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 7515 bytes --]

On Fri, May 04, 2018 at 03:29:02PM +0200, Cédric Le Goater wrote:
> On 05/04/2018 07:19 AM, David Gibson wrote:
> > On Thu, May 03, 2018 at 04:37:29PM +0200, Cédric Le Goater wrote:
> >> On 05/03/2018 08:25 AM, David Gibson wrote:
> >>> On Thu, May 03, 2018 at 08:07:54AM +0200, Cédric Le Goater wrote:
> >>>> On 05/03/2018 07:45 AM, David Gibson wrote:
> >>>>> On Thu, Apr 26, 2018 at 11:48:06AM +0200, Cédric Le Goater wrote:
> >>>>>> On 04/26/2018 09:25 AM, David Gibson wrote:
> >>>>>>> On Thu, Apr 19, 2018 at 02:43:03PM +0200, Cédric Le Goater wrote:
> >>>>>>>> The Event Queue Descriptor (EQD) table is an internal table of the
> >>>>>>>> XIVE routing sub-engine. It specifies on which Event Queue the event
> >>>>>>>> data should be posted when an exception occurs (later on pulled by the
> >>>>>>>> OS) and which Virtual Processor to notify.
> >>>>>>>
> >>>>>>> Uhhh.. I thought the IVT said which queue and vp to notify, and the
> >>>>>>> EQD gave metadata for event queues.
> >>>>>>
> >>>>>> yes. the above poorly written. The Event Queue Descriptor contains the
> >>>>>> guest address of the event queue in which the data is written. I will 
> >>>>>> rephrase.      
> >>>>>>
> >>>>>> The IVT contains IVEs which indeed define for an IRQ which EQ to notify 
> >>>>>> and what data to push on the queue. 
> >>>>>>  
> >>>>>>>> The Event Queue is a much
> >>>>>>>> more complex structure but we start with a simple model for the sPAPR
> >>>>>>>> machine.
> >>>>>>>>
> >>>>>>>> There is one XiveEQ per priority and these are stored under the XIVE
> >>>>>>>> virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :
> >>>>>>>>
> >>>>>>>>        (server << 3) | (priority & 0x7)
> >>>>>>>>
> >>>>>>>> This is not in the XIVE architecture but as the EQ index is never
> >>>>>>>> exposed to the guest, in the hcalls nor in the device tree, we are
> >>>>>>>> free to use what fits best the current model.
> >>>>>>
> >>>>>> This EQ indexing is important to notice because it will also show up 
> >>>>>> in KVM to build the IVE from the KVM irq state.
> >>>>>
> >>>>> Ok, are you saying that while this combined EQ index will never appear
> >>>>> in guest <-> host interfaces, 
> >>>>
> >>>> Indeed.
> >>>>
> >>>>> it might show up in qemu <-> KVM interfaces?
> >>>>
> >>>> Not directly but it is part of the IVE as the IVE_EQ_INDEX field. When
> >>>> dumped, it has to be built in some ways, compatible with the emulated 
> >>>> mode in QEMU. 
> >>>
> >>> Hrm.  But is the exact IVE contents visible to qemu (for a PAPR
> >>> guest)?  
> >>
> >> The guest only uses hcalls which arguments are :
> >>  
> >> 	- cpu numbers,
> >> 	- priority numbers from defined ranges, 
> >> 	- logical interrupt numbers.  
> >> 	- physical address of the EQ 
> >>
> >> The visible parts for the guest of the IVE are the 'priority', the 'cpu', 
> >> and the 'eisn', which is the effective IRQ number the guest is assigning 
> >> to the source. The 'eisn" will be pushed in the EQ.
> > 
> > Ok.
> > 
> >> The IVE EQ index is not visible.
> > 
> > Good.
> > 
> >>> I would have thought the qemu <-> KVM interfaces would have
> >>> abstracted this the same way the guest <-> KVM interfaces do.  > Or is there a reason not to?
> >>
> >> It is practical to dump 64bit IVEs directly from KVM into the QEMU 
> >> internal structures because it fits the emulated mode without doing 
> >> any translation ... This might be seen as a shortcut. You will tell 
> >> me when you reach the KVM part.   
> > 
> > Ugh.. exposing to qemu the raw IVEs sounds like a bad idea to me.
> 
> You definitely need to in QEMU in emulation mode. The whole routing 
> relies on it. 

I'm not exactly sure what you mean by "emulation mode" here.  Above,
I'm talking specifically about a KVM HV, PAPR guest.

> > When we migrate, we're going to have to assign the guest (server,
> > priority) tuples to host EQ indicies, and I think it makes more sense
> > to do that in KVM and hide the raw indices from qemu than to have qemu
> > mangle them explicitly on migration.
> 
> We will need some mangling mechanism for the KVM ioctls saving and
> restoring state. This is very similar to XICS. 
>  
> >>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>>>
> >>>>>>> Is the EQD actually modifiable by a guest?  Or are the settings of the
> >>>>>>> EQs fixed by PAPR?
> >>>>>>
> >>>>>> The guest uses the H_INT_SET_QUEUE_CONFIG hcall to define the address
> >>>>>> of the event queue for a couple prio/server.
> >>>>>
> >>>>> Ok, so the EQD can be modified by the guest.  In which case we need to
> >>>>> work out what object owns it, since it'll need to migrate it.
> >>>>
> >>>> Indeed. The EQD are CPU related as there is one EQD per couple (cpu, 
> >>>> priority). The KVM patchset dumps/restores the eight XiveEQ struct 
> >>>> using per cpu ioctls. The EQ in the OS RAM is marked dirty at that
> >>>> stage.
> >>>
> >>> To make sure I'm clear: for PAPR there's a strict relationship between
> >>> EQD and CPU (one EQD for each (cpu, priority) tuple).  
> >>
> >> Yes.
> >>
> >>> But for powernv that's not the case, right?  
> >>
> >> It is.
> > 
> > Uh.. I don't think either of us phrased that well, I'm still not sure
> > which way you're answering that.
> 
> there's a strict relationship between EQD and CPU (one EQD for each (cpu, priority) tuple) in spapr and in powernv.

For powernv that seems to be contradicted by what you say below.
AFAICT there might be a strict association at the host kernel or even
the OPAL level, but not at the hardware level.

> >>> AIUI the mapping of EQs to cpus was configurable, is that right?
> >>
> >> Each cpu has 8 EQD. Same for virtual cpus.
> > 
> > Hmm.. but is that 8 EQD per cpu something built into the hardware, or
> > just a convention of how the host kernel and OPAL operate?
> 
> It's not in the HW, it is used by the HW to route the notification. 
> The EQD contains the EQ characteristics :
> 
> * functional bits :
>   - valid bit
>   - enqueue bit, to update OS in RAM EQ or not
>   - unconditional notification
>   - backlog
>   - escalation
>   - ...
> * OS EQ fields 
>   - physical address
>   - entry index
>   - toggle bit
> * NVT fields
>   - block/chip
>   - index
> * etc.
> 
> It's a big structure : 8 words.

Ok.  So yeah, the cpu association of the EQ is there in the NVT
fields, not baked into the hardware.

> The EQD table is allocated by OPAL/skiboot and fed to the HW for
> its use. The OS powernv uses OPAL calls  configure the EQD with its 
> needs : 
> 
> int64_t opal_xive_set_queue_info(uint64_t vp, uint32_t prio,
> 				 uint64_t qpage,
> 				 uint64_t qsize,
> 				 uint64_t qflags);
> 
> 
> sPAPR uses an hcall :
> 
> static long plpar_int_set_queue_config(unsigned long flags,
> 				       unsigned long target,
> 				       unsigned long priority,
> 				       unsigned long qpage,
> 				       unsigned long qsize)
> 
> 
> but it is translated in an OPAL call in KVM.
> 
> C.
> 
>  
> >  
> >>
> >> I am not sure what you understood before ? It is surely something
> >> I wrote, my XIVE understanding is still making progress.
> >>
> >>
> >> C.
> >>
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources
  2018-05-04 14:25                 ` Cédric Le Goater
@ 2018-05-05  4:32                   ` David Gibson
  0 siblings, 0 replies; 100+ messages in thread
From: David Gibson @ 2018-05-05  4:32 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 2790 bytes --]

On Fri, May 04, 2018 at 04:25:16PM +0200, Cédric Le Goater wrote:
> On 04/27/2018 04:43 AM, David Gibson wrote:
> >>>> I did some work on that topic a while ago :
> >>>>
> >>>> 	https://patchwork.ozlabs.org/cover/836782/
> >>>>
> >>>> But we stopped exploring the idea. May be it was not the good approach.
> >>>> The PHBs LSIs would benefit from such a split though.
> >>> So, no, I don't think that was a good approach, but that doesn't mean
> >>> other ways of rearranging the irq numbers aren't ok.  The thing here
> >>> is that we don't want to think of an "irq allocator" - there are some
> >>> bits like that in there already, but they were always a mistake.
> >>>
> >>> We have lots of irq space (both XICS and XIVE) so instead we should
> >>> come up with a static mapping of irqs to devices.
> >> yes. I would prefer that also. 
> >>
> >> We could change the spapr_irq_alloc() routine to get a block of 
> >> IRQs in the range defined for a device family, and use a device 
> >> id to offset in that family range ? Here are some figures :
> >>
> >> device family        block size  max devices  
> >>
> >> EVENT_CLASS_EPOW              1           1  
> >> EVENT_CLASS_HOT_PLUG          1           1   
> >> VIO_VSCSI                     1          10  
> >> VIO_LLAN                      1          10  
> >> VIO_VTY                       1           5  
> >>                       
> >> PCI/PHB                    1024           5  
> > No, I'm thinking we should eliminate spapr_irq_alloc() entirely.
> > Well, ok, not entirely, we'll still need it for the old machine
> > types.  But remove it's use for the current machine type completely.
> > 
> > Instead we have an explicit map of ranges for various purposes.  The
> > one-off things like EPOW and HOTPLUG can have plain constant values.
> > PCI LSIs will be calculated as something like PCI_IRQ_BASE + <phb
> > index>*4 + <irq pin>.  The VIO devices we handle as VIO_BASE + <reg
> > value> or something.
> > 
> > MSIs will still need some sort of allocation, but we can do that
> > within a range set aside for them.
> 
> Should we address the static mapping of irqs before introducing XIVE ? 

Yes, I think so.

> I don't think it changes much of the architecture now that the allocator
> is under the machine. However, I wonder what would be the impact of 
> PHB hotplug.

I don't think it should be too bad.  We now require that PHBs have the
'index' parameter set, and that won't change with hotplug.  We can
then set aside a region of irq #s for each index of PHB.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR
  2018-05-05  4:26                   ` David Gibson
@ 2018-05-09  7:23                     ` Cédric Le Goater
  0 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-09  7:23 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/05/2018 06:26 AM, David Gibson wrote:
> On Fri, May 04, 2018 at 03:05:08PM +0200, Cédric Le Goater wrote:
>> On 05/04/2018 05:33 AM, David Gibson wrote:
>>> On Thu, May 03, 2018 at 06:50:09PM +0200, Cédric Le Goater wrote:
>>>> On 05/03/2018 07:22 AM, David Gibson wrote:
>>>>> On Thu, Apr 26, 2018 at 12:43:29PM +0200, Cédric Le Goater wrote:
>>>>>> On 04/26/2018 06:20 AM, David Gibson wrote:
>>>>>>> On Tue, Apr 24, 2018 at 11:46:04AM +0200, Cédric Le Goater wrote:
>>>>>>>> On 04/24/2018 08:51 AM, David Gibson wrote:
>>>>>>>>> On Thu, Apr 19, 2018 at 02:43:00PM +0200, Cédric Le Goater wrote:
>>>>>>>>>> sPAPRXive is a model for the XIVE interrupt controller device of the
>>>>>>>>>> sPAPR machine. It holds the routing XIVE table, the Interrupt
>>>>>>>>>> Virtualization Entry (IVE) table which associates interrupt source
>>>>>>>>>> numbers with targets.
>>>>>>>>>>
>>>>>>>>>> Also extend the XiveFabric with an accessor to the IVT. This will be
>>>>>>>>>> needed by the routing algorithm.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>>>> ---
>>>>>>>>>>
>>>>>>>>>>  May be should introduce a XiveRouter model to hold the IVT. To be
>>>>>>>>>>  discussed.
>>>>>>>>>
>>>>>>>>> Yeah, maybe.  Am I correct in thinking that on pnv there could be more
>>>>>>>>> than one XiveRouter?
>>>>>>>>
>>>>>>>> There is only one, the main IC. 
>>>>>>>
>>>>>>> Ok, that's what I thought originally.  In that case some of the stuff
>>>>>>> in the patches really doesn't make sense to me.
>>>>>>
>>>>>> well, there is one IC per chip on powernv, but we haven't reach that part
>>>>>> yet.
>>>>>
>>>>> Hmm.  There's some things we can delay dealing with, but I don't think
>>>>> this is one of them.  I think we need to understand how multichip is
>>>>> going to work in order to come up with a sane architecture.  Otherwise
>>>>> I fear we'll end up with something that we either need to horribly
>>>>> bastardize for multichip, or have to rework things dramatically
>>>>> leading to migration nightmares.
>>>>
>>>> So, it is all controlled by MMIO, so we should be fine on that part. 
>>>> As for the internal tables, they are all configured by firmware, using
>>>> a chip identifier (block). I need to check how the remote XIVE are 
>>>> accessed. I think this is by MMIO. 
>>>
>>> Right, but for powernv we execute OPAL inside the VM, rather than
>>> emulating its effects.  So we still need to model the actual hardware
>>> interfaces.  OPAL hides the details from the kernel, but not from us
>>> on the other side.
>>
>> Yes. This is the case in the current model. I took a look today and
>> I have a few fixes for the MMIO layout for P9 chips which I will send.
>>
>> As for XIVE, the model needs to be a little more  complex to support 
>> VSD_MODE_FORWARD tables which describe how to forward a notification
>> to another XIVE IC on another chip. They contain an address on which 
>> to load, This is another hop in the notification chain.  
> 
> Ah, ok.  So is that mode and address configured in the (bare metal)
> IVT as well?  Or is that a different piece of configuration?

The mode of a virtual structure table is configured by firmware. 
There are 4 main table types:  IVT, SBE, EQD, VPD (and an extra one
for IRQ) for the 16 possible blocks of a machine (I am simplifying 
a bit there). 

Local tables to a block/chip, today, are set to EXCLUSIVE and all 
remotes tables set to FORWARD.

The address of a table is configured by FW also. In case of a 
FORWARD table, it is set to the remote IC BAR + one page. This page 
has two 2K windows : one for for HW interrupt triggers and another 
one for to forward interrupts and for operation synchronization. 
>>>> I haven't looked at multichip XIVE support but I am not too worried as 
>>>> the framework is already in place for the machine.
>>>>  
>>>>>>>>> If we did have a XiveRouter, I'm not sure we'd need the XiveFabric
>>>>>>>>> interface, possibly its methods could just be class methods of
>>>>>>>>> XiveRouter.
>>>>>>>>
>>>>>>>> Yes. We could introduce a XiveRouter to share the ivt table between 
>>>>>>>> the sPAPRXive and the PnvXIVE models, the interrupt controllers of
>>>>>>>> the machines. Methods would provide way to get the ivt/eq/nvt
>>>>>>>> objects required for routing. I need to add a set_eq() to push the
>>>>>>>> EQ data.
>>>>>>>
>>>>>>> Hrm.  Well, to add some more clarity, let's say the XiveRouter is the
>>>>>>> object which owns the IVT.  
>>>>>>
>>>>>> OK. that would be a model with some state and not an interface.
>>>>>
>>>>> Yes.  For papr variant it would have the whole IVT contents as its
>>>>> state.  For the powernv, just the registers telling it where to find
>>>>> the IVT in RAM.
>>>>>
>>>>>>> It may or may not do other stuff as well.
>>>>>>
>>>>>> Its only task would be to do the final event routing: get the IVE,
>>>>>> get the EQ, push the EQ DATA in the OS event queue, notify the CPU.
>>>>>
>>>>> That seems like a lot of steps.  Up to push the EQ DATA, certainly.
>>>>> And I guess it'll have to ping an NVT somehow, but I'm not sure it
>>>>> should know about CPUs as such.
>>>>
>>>> For PowerNV, the concept could be generalized, yes. An NVT can 
>>>> contain the interrupt state of a logical server but the common 
>>>> case is baremetal without guests for QEMU and so we have a NVT 
>>>> per cpu. 
>>>
>>> Hmm.  We eventually want to support a kernel running guests under
>>> qemu/powernv though, right?  
>>
>> arg. an emulated hypervisor ! OK let's say this is a long term goal :) 
>>
>>> So even if we don't allow it right now,
>>> we don't want allowing that to require major surgery to our
>>> architecture.
>>
>> That I agree on. 
>>
>>>> PowerNV will have some limitation but we can make it better than 
>>>> today for sure. It boots.
>>>>
>>>> We can improve some of the NVT notification process, the way NVT 
>>>> are matched eventually. may be support remote engines if the
>>>> NVT is not local. I have not looked at the details.
>>>>
>>>>> I'm not sure at this stage what should own the EQD table.
>>>>
>>>> The EQDT is in RAM.
>>>
>>> Not for spapr, it's not.  
>>
>> yeah ok. It's in QEMU/KVM.
>>
>>> And even when it is in RAM, something needs
>>> to own the register that gives its base address.
>>
>> It's more complex than registers on powernv. There is a procedure
>> to define the XIVE tables using XIVE table descriptors which contain
>> their characteristics, size, indirect vs. indirect, local vs remote.
>> OPAL/skiboot defines all these to configure the HW, and the model
>> necessarily needs to support the same interface. This is the case
>> for a single chip.
> 
> Ah, ok.  So there's some sort of IVTD. 

These are called Virtual Structure table Descriptors (VSDs). Each
XIVE chip has an array of these.

> Also in RAM?  

Yes. But the vsd are just temporary structures to configure HW. 
What is important is the information it is holding : IVT, EQDT,
VPDT, etc. 

> Eventually there
> must be a register giving the base address of the IVTD, yes?

There are two registers to configure the table. One to set the 
table type and block, and one to set its VSD.

C.
 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model
  2018-05-05  4:27               ` David Gibson
@ 2018-05-09  7:27                 ` Cédric Le Goater
  0 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-09  7:27 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/05/2018 06:27 AM, David Gibson wrote:
> On Fri, May 04, 2018 at 03:11:57PM +0200, Cédric Le Goater wrote:
>> On 05/04/2018 06:51 AM, David Gibson wrote:
>>> On Thu, May 03, 2018 at 06:06:14PM +0200, Cédric Le Goater wrote:
>>>> On 05/03/2018 07:35 AM, David Gibson wrote:
>>>>> On Thu, Apr 26, 2018 at 11:27:21AM +0200, Cédric Le Goater wrote:
>>>>>> On 04/26/2018 09:11 AM, David Gibson wrote:
>>>>>>> On Thu, Apr 19, 2018 at 02:43:02PM +0200, Cédric Le Goater wrote:
>>>>>>>> The XIVE presenter engine uses a set of registers to handle priority
>>>>>>>> management and interrupt acknowledgment among other things. The most
>>>>>>>> important ones being :
>>>>>>>>
>>>>>>>>   - Interrupt Priority Register (PIPR)
>>>>>>>>   - Interrupt Pending Buffer (IPB)
>>>>>>>>   - Current Processor Priority (CPPR)
>>>>>>>>   - Notification Source Register (NSR)
>>>>>>>>
>>>>>>>> There is one set of registers per level of privilege, four in all :
>>>>>>>> HW, HV pool, OS and User. These are called rings. All registers are
>>>>>>>> accessible through a specific MMIO region called the Thread Interrupt
>>>>>>>> Management Areas (TIMA) but, depending on the privilege level of the
>>>>>>>> CPU, the view of the TIMA is filtered. The sPAPR machine runs at the
>>>>>>>> OS privilege and therefore can only accesses the OS and the User
>>>>>>>> rings. The others are for hypervisor levels.
>>>>>>>>
>>>>>>>> The CPU interrupt state is modeled with a XiveNVT object which stores
>>>>>>>> the values of the different registers. The different TIMA views are
>>>>>>>> mapped at the same address for each CPU and 'current_cpu' is used to
>>>>>>>> retrieve the XiveNVT holding the ring registers.
>>>>>>>>
>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>> ---
>>>>>>>>
>>>>>>>>  Changes since v2 :
>>>>>>>>
>>>>>>>>  - introduced the XiveFabric interface
>>>>>>>>
>>>>>>>>  hw/intc/spapr_xive.c        |  25 ++++
>>>>>>>>  hw/intc/xive.c              | 279 ++++++++++++++++++++++++++++++++++++++++++++
>>>>>>>>  include/hw/ppc/spapr_xive.h |   5 +
>>>>>>>>  include/hw/ppc/xive.h       |  31 +++++
>>>>>>>>  include/hw/ppc/xive_regs.h  |  84 +++++++++++++
>>>>>>>>  5 files changed, 424 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>>>>>>>> index 90cde8a4082d..f07832bf0a00 100644
>>>>>>>> --- a/hw/intc/spapr_xive.c
>>>>>>>> +++ b/hw/intc/spapr_xive.c
>>>>>>>> @@ -13,6 +13,7 @@
>>>>>>>>  #include "target/ppc/cpu.h"
>>>>>>>>  #include "sysemu/cpus.h"
>>>>>>>>  #include "monitor/monitor.h"
>>>>>>>> +#include "hw/ppc/spapr.h"
>>>>>>>>  #include "hw/ppc/spapr_xive.h"
>>>>>>>>  #include "hw/ppc/xive.h"
>>>>>>>>  #include "hw/ppc/xive_regs.h"
>>>>>>>> @@ -95,6 +96,22 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>>>>>>>  
>>>>>>>>      /* Allocate the Interrupt Virtualization Table */
>>>>>>>>      xive->ivt = g_new0(XiveIVE, xive->nr_irqs);
>>>>>>>> +
>>>>>>>> +    /* The Thread Interrupt Management Area has the same address for
>>>>>>>> +     * each chip. On sPAPR, we only need to expose the User and OS
>>>>>>>> +     * level views of the TIMA.
>>>>>>>> +     */
>>>>>>>> +    xive->tm_base = XIVE_TM_BASE;
>>>>>>>
>>>>>>> The constant should probably have PAPR in the name somewhere, since
>>>>>>> it's just for PAPR machines (same for the ESB mappings, actually).
>>>>>>
>>>>>> ok. 
>>>>>>
>>>>>> I have also made 'tm_base' a property, like 'vc_base' for ESBs, in 
>>>>>> case we want to change the value when the guest is instantiated. 
>>>>>> I doubt it but this is an address in the global address space, so 
>>>>>> letting the machine have control is better I think.
>>>>>
>>>>> I agree.
>>>>>
>>>>>>>> +
>>>>>>>> +    memory_region_init_io(&xive->tm_mmio_user, OBJECT(xive),
>>>>>>>> +                          &xive_tm_user_ops, xive, "xive.tima.user",
>>>>>>>> +                          1ull << TM_SHIFT);
>>>>>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_user);
>>>>>>>> +
>>>>>>>> +    memory_region_init_io(&xive->tm_mmio_os, OBJECT(xive),
>>>>>>>> +                          &xive_tm_os_ops, xive, "xive.tima.os",
>>>>>>>> +                          1ull << TM_SHIFT);
>>>>>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio_os);
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>>>>>>> @@ -104,6 +121,13 @@ static XiveIVE *spapr_xive_get_ive(XiveFabric *xf, uint32_t lisn)
>>>>>>>>      return lisn < xive->nr_irqs ? &xive->ivt[lisn] : NULL;
>>>>>>>>  }
>>>>>>>>  
>>>>>>>> +static XiveNVT *spapr_xive_get_nvt(XiveFabric *xf, uint32_t server)
>>>>>>>> +{
>>>>>>>> +    PowerPCCPU *cpu = spapr_find_cpu(server);
>>>>>>>> +
>>>>>>>> +    return cpu ? XIVE_NVT(cpu->intc) : NULL;
>>>>>>>> +}
>>>>>>>
>>>>>>> So this is a bit of a tangent, but I've been thinking of implementing
>>>>>>> a scheme where there's an opaque pointer in the cpu structure for the
>>>>>>> use of the machine.  I'm planning for that to replace the intc pointer
>>>>>>> (which isn't really used directly by the cpu). That would allow us to
>>>>>>> have spapr put a structure there and have both xics and xive pointers
>>>>>>> which could be useful later on.
>>>>>>
>>>>>> ok. That should simplify the patchset at the end, in which we need to 
>>>>>> switch the 'intc' pointer. 
>>>>>>
>>>>>>> I think we'd need something similar to correctly handle migration of
>>>>>>> the VPA state, which is currently horribly broken.
>>>>>>>
>>>>>>>> +
>>>>>>>>  static const VMStateDescription vmstate_spapr_xive_ive = {
>>>>>>>>      .name = TYPE_SPAPR_XIVE "/ive",
>>>>>>>>      .version_id = 1,
>>>>>>>> @@ -143,6 +167,7 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>>>>>>      dc->vmsd = &vmstate_spapr_xive;
>>>>>>>>  
>>>>>>>>      xfc->get_ive = spapr_xive_get_ive;
>>>>>>>> +    xfc->get_nvt = spapr_xive_get_nvt;
>>>>>>>>  }
>>>>>>>>  
>>>>>>>>  static const TypeInfo spapr_xive_info = {
>>>>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>>>>>> index dccad0318834..5691bb9474e4 100644
>>>>>>>> --- a/hw/intc/xive.c
>>>>>>>> +++ b/hw/intc/xive.c
>>>>>>>> @@ -14,7 +14,278 @@
>>>>>>>>  #include "sysemu/cpus.h"
>>>>>>>>  #include "sysemu/dma.h"
>>>>>>>>  #include "monitor/monitor.h"
>>>>>>>> +#include "hw/ppc/xics.h" /* for ICP_PROP_CPU */
>>>>>>>>  #include "hw/ppc/xive.h"
>>>>>>>> +#include "hw/ppc/xive_regs.h"
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * XIVE Interrupt Presenter
>>>>>>>> + */
>>>>>>>> +
>>>>>>>> +static uint64_t xive_nvt_accept(XiveNVT *nvt)
>>>>>>>> +{
>>>>>>>> +    return 0;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +static void xive_nvt_set_cppr(XiveNVT *nvt, uint8_t cppr)
>>>>>>>> +{
>>>>>>>> +    if (cppr > XIVE_PRIORITY_MAX) {
>>>>>>>> +        cppr = 0xff;
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    nvt->ring_os[TM_CPPR] = cppr;
>>>>>>>
>>>>>>> Surely this needs to recheck if we should be interrupting the cpu?
>>>>>>
>>>>>> yes. In patch 9, when we introduce the nvt notify routine.
>>>>>
>>>>> Ok.
>>>>>
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +/*
>>>>>>>> + * OS Thread Interrupt Management Area MMIO
>>>>>>>> + */
>>>>>>>> +static uint64_t xive_tm_read_special(XiveNVT *nvt, hwaddr offset,
>>>>>>>> +                                           unsigned size)
>>>>>>>> +{
>>>>>>>> +    uint64_t ret = -1;
>>>>>>>> +
>>>>>>>> +    if (offset == TM_SPC_ACK_OS_REG && size == 2) {
>>>>>>>> +        ret = xive_nvt_accept(nvt);
>>>>>>>> +    } else {
>>>>>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid TIMA read @%"
>>>>>>>> +                      HWADDR_PRIx" size %d\n", offset, size);
>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    return ret;
>>>>>>>> +}
>>>>>>>> +
>>>>>>>> +#define TM_RING(offset) ((offset) & 0xf0)
>>>>>>>> +
>>>>>>>> +static uint64_t xive_tm_os_read(void *opaque, hwaddr offset,
>>>>>>>> +                                      unsigned size)
>>>>>>>> +{
>>>>>>>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>>>>>>>
>>>>>>> So, as I said on a previous version of this, we can actually correctly
>>>>>>> represent different mappings in different cpu spaces, by exploiting
>>>>>>> cpu->as and not just having them all point to &address_space_memory.
>>>>>>
>>>>>> Yes, you did and I haven't studied the question yet. For the next version.
>>>>>
>>>>> So, it's possible that using the cpu->as thing will be more trouble
>>>>> that it's worth. 
>>>>
>>>> One of the trouble is the number of memory regions to use, one per cpu, 
>>>
>>> Well, we're already going to have an NVT object for each cpu, yes?  So
>>> a memory region per-cpu doesn't seem like a big stretch.
>>>
>>>> and the KVM support.
>>>
>>> And I really don't see how the memory regions impacts KVM.
>>
>> The TIMA is setup when the KVM device is initialized using some specific 
>> ioctl to get an fd on a MMIO region from the host. It is then passed to 
>> the guest as a 'ram_device', same for the ESBs. 
> 
> Ah, good point.
> 
>> This is not a common region.
> 
> I'm not sure what you mean by that.

I meant by that 'out of the ordinary', 'unusual'. Specially when under KVM.

C.

 
>>>> Having a single region is much easier. 
>>>>
>>>>> I am a little concerned about using current_cpu though.  
>>>>> First, will it work with KVM with kernel_irqchip=off - the
>>>>> cpus are running truly concurrently,
>>>>
>>>> FWIW, I didn't see any issue yet while stressing. 
>>>
>>> Ok.
>>>
>>>>> but we still need to work out who's poking at the TIMA.  
>>>>
>>>> I understand. The registers are accessed by the current cpu to set the 
>>>> CPPR and to ack an interrupt. But when we route an event, we also access 
>>>> and modify the registers. Do you suggest some locking ? I am not sure
>>>> how are protected the TIMA region accesses vs. the routing, which is 
>>>> necessarily initiated by an ESB MMIO though.
>>>
>>> Locking isn't really the issue.  I mean, we do need locking, but the
>>> BQL should provide that.  The issue is what exactly does "current"
>>> mean in the context of multiple concurrently running cpus.  Does it
>>> always mean what we need it to mean in every context we might call
>>> this from.
>>
>> I would say so.
> 
> Ok.
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues
  2018-05-05  4:29                   ` David Gibson
@ 2018-05-09  8:01                     ` Cédric Le Goater
  0 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-09  8:01 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/05/2018 06:29 AM, David Gibson wrote:
> On Fri, May 04, 2018 at 03:29:02PM +0200, Cédric Le Goater wrote:
>> On 05/04/2018 07:19 AM, David Gibson wrote:
>>> On Thu, May 03, 2018 at 04:37:29PM +0200, Cédric Le Goater wrote:
>>>> On 05/03/2018 08:25 AM, David Gibson wrote:
>>>>> On Thu, May 03, 2018 at 08:07:54AM +0200, Cédric Le Goater wrote:
>>>>>> On 05/03/2018 07:45 AM, David Gibson wrote:
>>>>>>> On Thu, Apr 26, 2018 at 11:48:06AM +0200, Cédric Le Goater wrote:
>>>>>>>> On 04/26/2018 09:25 AM, David Gibson wrote:
>>>>>>>>> On Thu, Apr 19, 2018 at 02:43:03PM +0200, Cédric Le Goater wrote:
>>>>>>>>>> The Event Queue Descriptor (EQD) table is an internal table of the
>>>>>>>>>> XIVE routing sub-engine. It specifies on which Event Queue the event
>>>>>>>>>> data should be posted when an exception occurs (later on pulled by the
>>>>>>>>>> OS) and which Virtual Processor to notify.
>>>>>>>>>
>>>>>>>>> Uhhh.. I thought the IVT said which queue and vp to notify, and the
>>>>>>>>> EQD gave metadata for event queues.
>>>>>>>>
>>>>>>>> yes. the above poorly written. The Event Queue Descriptor contains the
>>>>>>>> guest address of the event queue in which the data is written. I will 
>>>>>>>> rephrase.      
>>>>>>>>
>>>>>>>> The IVT contains IVEs which indeed define for an IRQ which EQ to notify 
>>>>>>>> and what data to push on the queue. 
>>>>>>>>  
>>>>>>>>>> The Event Queue is a much
>>>>>>>>>> more complex structure but we start with a simple model for the sPAPR
>>>>>>>>>> machine.
>>>>>>>>>>
>>>>>>>>>> There is one XiveEQ per priority and these are stored under the XIVE
>>>>>>>>>> virtualization presenter (sPAPRXiveNVT). EQs are simply indexed with :
>>>>>>>>>>
>>>>>>>>>>        (server << 3) | (priority & 0x7)
>>>>>>>>>>
>>>>>>>>>> This is not in the XIVE architecture but as the EQ index is never
>>>>>>>>>> exposed to the guest, in the hcalls nor in the device tree, we are
>>>>>>>>>> free to use what fits best the current model.
>>>>>>>>
>>>>>>>> This EQ indexing is important to notice because it will also show up 
>>>>>>>> in KVM to build the IVE from the KVM irq state.
>>>>>>>
>>>>>>> Ok, are you saying that while this combined EQ index will never appear
>>>>>>> in guest <-> host interfaces, 
>>>>>>
>>>>>> Indeed.
>>>>>>
>>>>>>> it might show up in qemu <-> KVM interfaces?
>>>>>>
>>>>>> Not directly but it is part of the IVE as the IVE_EQ_INDEX field. When
>>>>>> dumped, it has to be built in some ways, compatible with the emulated 
>>>>>> mode in QEMU. 
>>>>>
>>>>> Hrm.  But is the exact IVE contents visible to qemu (for a PAPR
>>>>> guest)?  
>>>>
>>>> The guest only uses hcalls which arguments are :
>>>>  
>>>> 	- cpu numbers,
>>>> 	- priority numbers from defined ranges, 
>>>> 	- logical interrupt numbers.  
>>>> 	- physical address of the EQ 
>>>>
>>>> The visible parts for the guest of the IVE are the 'priority', the 'cpu', 
>>>> and the 'eisn', which is the effective IRQ number the guest is assigning 
>>>> to the source. The 'eisn" will be pushed in the EQ.
>>>
>>> Ok.
>>>
>>>> The IVE EQ index is not visible.
>>>
>>> Good.
>>>
>>>>> I would have thought the qemu <-> KVM interfaces would have
>>>>> abstracted this the same way the guest <-> KVM interfaces do.  > Or is there a reason not to?
>>>>
>>>> It is practical to dump 64bit IVEs directly from KVM into the QEMU 
>>>> internal structures because it fits the emulated mode without doing 
>>>> any translation ... This might be seen as a shortcut. You will tell 
>>>> me when you reach the KVM part.   
>>>
>>> Ugh.. exposing to qemu the raw IVEs sounds like a bad idea to me.
>>
>> You definitely need to in QEMU in emulation mode. The whole routing 
>> relies on it. 
> 
> I'm not exactly sure what you mean by "emulation mode" here.  Above,
> I'm talking specifically about a KVM HV, PAPR guest.

ah ok. I understand. 

KVM does not manipulate raw IVEs. Only OPAL manipulates the raw 
XIVE structures. But as the emulation mode under QEMU needs to 
also manipulate these structures, it seemed practical to use raw 
XIVE structures to transfer the state from KVM to QEMU. 

But, It might not be such a great idea. I suppose we should define 
a QEMU/KVM format for the exchanges with KVM and then, inside QEMU, 
have a translation QEMU/KVM to XIVE. The XIVE format being the
format used for migration.


>>> When we migrate, we're going to have to assign the guest (server,
>>> priority) tuples to host EQ indicies, and I think it makes more sense
>>> to do that in KVM and hide the raw indices from qemu than to have qemu
>>> mangle them explicitly on migration.
>>
>> We will need some mangling mechanism for the KVM ioctls saving and
>> restoring state. This is very similar to XICS. 
>>  
>>>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>>>
>>>>>>>>> Is the EQD actually modifiable by a guest?  Or are the settings of the
>>>>>>>>> EQs fixed by PAPR?
>>>>>>>>
>>>>>>>> The guest uses the H_INT_SET_QUEUE_CONFIG hcall to define the address
>>>>>>>> of the event queue for a couple prio/server.
>>>>>>>
>>>>>>> Ok, so the EQD can be modified by the guest.  In which case we need to
>>>>>>> work out what object owns it, since it'll need to migrate it.
>>>>>>
>>>>>> Indeed. The EQD are CPU related as there is one EQD per couple (cpu, 
>>>>>> priority). The KVM patchset dumps/restores the eight XiveEQ struct 
>>>>>> using per cpu ioctls. The EQ in the OS RAM is marked dirty at that
>>>>>> stage.
>>>>>
>>>>> To make sure I'm clear: for PAPR there's a strict relationship between
>>>>> EQD and CPU (one EQD for each (cpu, priority) tuple).  
>>>>
>>>> Yes.
>>>>
>>>>> But for powernv that's not the case, right?  
>>>>
>>>> It is.
>>>
>>> Uh.. I don't think either of us phrased that well, I'm still not sure
>>> which way you're answering that.
>>
>> there's a strict relationship between EQD and CPU (one EQD for each (cpu, priority) tuple) in spapr and in powernv.
> 
> For powernv that seems to be contradicted by what you say below.

ok. I see what you mean. There is a difference for the hypervisor when 
guests are running. As QEMU PowerNV does not support guests (yet), 
when can start the model with a strict relationship between EQD and 
CPU.

But it's not the case when guest are running, because the EQD refers 
to a NVT/VP which can be a virtual processor or a group of such. 

The current model is taking a shortcut, the CPU list should be scanned
to find matching CAM lines (W2 in the TIMA). I need to take a closer
look for powernv even if it is not strictly needed for the model 
without guest. 

> AFAICT there might be a strict association at the host kernel or even
> the OPAL level, but not at the hardware level.
> 
>>>>> AIUI the mapping of EQs to cpus was configurable, is that right?
>>>>
>>>> Each cpu has 8 EQD. Same for virtual cpus.
>>>
>>> Hmm.. but is that 8 EQD per cpu something built into the hardware, or
>>> just a convention of how the host kernel and OPAL operate?
>>
>> It's not in the HW, it is used by the HW to route the notification. 
>> The EQD contains the EQ characteristics :
>>
>> * functional bits :
>>   - valid bit
>>   - enqueue bit, to update OS in RAM EQ or not
>>   - unconditional notification
>>   - backlog
>>   - escalation
>>   - ...
>> * OS EQ fields 
>>   - physical address
>>   - entry index
>>   - toggle bit
>> * NVT fields
>>   - block/chip
>>   - index
>> * etc.
>>
>> It's a big structure : 8 words.
> 
> Ok.  So yeah, the cpu association of the EQ is there in the NVT
> fields, not baked into the hardware.

yes.

C. 

>> The EQD table is allocated by OPAL/skiboot and fed to the HW for
>> its use. The OS powernv uses OPAL calls  configure the EQD with its 
>> needs : 
>>
>> int64_t opal_xive_set_queue_info(uint64_t vp, uint32_t prio,
>> 				 uint64_t qpage,
>> 				 uint64_t qsize,
>> 				 uint64_t qflags);
>>
>>
>> sPAPR uses an hcall :
>>
>> static long plpar_int_set_queue_config(unsigned long flags,
>> 				       unsigned long target,
>> 				       unsigned long priority,
>> 				       unsigned long qpage,
>> 				       unsigned long qsize)
>>
>>
>> but it is translated in an OPAL call in KVM.
>>
>> C.
>>
>>  
>>>  
>>>>
>>>> I am not sure what you understood before ? It is surely something
>>>> I wrote, my XIVE understanding is still making progress.
>>>>
>>>>
>>>> C.
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 100+ messages in thread

* Re: [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface
  2018-05-03  5:13                   ` David Gibson
@ 2018-05-23 10:12                     ` Cédric Le Goater
  0 siblings, 0 replies; 100+ messages in thread
From: Cédric Le Goater @ 2018-05-23 10:12 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 05/03/2018 07:13 AM, David Gibson wrote:
> On Wed, May 02, 2018 at 05:28:23PM +0200, Cédric Le Goater wrote:
>> On 04/27/2018 08:32 AM, David Gibson wrote:
>>> On Thu, Apr 26, 2018 at 12:30:42PM +0200, Cédric Le Goater wrote:
>>>> On 04/26/2018 05:54 AM, David Gibson wrote:
>>>>> On Tue, Apr 24, 2018 at 11:33:11AM +0200, Cédric Le Goater wrote:
>>>>>> On 04/24/2018 08:46 AM, David Gibson wrote:
>>>>>>> On Mon, Apr 23, 2018 at 09:58:43AM +0200, Cédric Le Goater wrote:
>>>>>>>> On 04/23/2018 08:46 AM, David Gibson wrote:
>>>>>>>>> On Thu, Apr 19, 2018 at 02:42:59PM +0200, Cédric Le Goater wrote:
>>>>>>>>>> The XiveFabric offers a simple interface, between the XiveSourve
>>>>>>>>>> object and the device model owning the interrupt sources, to forward
>>>>>>>>>> an event notification to the XIVE interrupt controller of the machine
>>>>>>>>>> and if the owner is the controller, to call directly the routing
>>>>>>>>>> sub-engine.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>>>> ---
>>>>>>>>>>  hw/intc/xive.c        | 37 ++++++++++++++++++++++++++++++++++++-
>>>>>>>>>>  include/hw/ppc/xive.h | 25 +++++++++++++++++++++++++
>>>>>>>>>>  2 files changed, 61 insertions(+), 1 deletion(-)
>>>>>>>>>>
>>>>>>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>>>>>>>> index 060976077dd7..b4c3d06c1219 100644
>>>>>>>>>> --- a/hw/intc/xive.c
>>>>>>>>>> +++ b/hw/intc/xive.c
>>>>>>>>>> @@ -17,6 +17,21 @@
>>>>>>>>>>  #include "hw/ppc/xive.h"
>>>>>>>>>>  
>>>>>>>>>>  /*
>>>>>>>>>> + * XIVE Fabric
>>>>>>>>>> + */
>>>>>>>>>> +
>>>>>>>>>> +static void xive_fabric_route(XiveFabric *xf, int lisn)
>>>>>>>>>> +{
>>>>>>>>>> +
>>>>>>>>>> +}
>>>>>>>>>> +
>>>>>>>>>> +static const TypeInfo xive_fabric_info = {
>>>>>>>>>> +    .name = TYPE_XIVE_FABRIC,
>>>>>>>>>> +    .parent = TYPE_INTERFACE,
>>>>>>>>>> +    .class_size = sizeof(XiveFabricClass),
>>>>>>>>>> +};
>>>>>>>>>> +
>>>>>>>>>> +/*
>>>>>>>>>>   * XIVE Interrupt Source
>>>>>>>>>>   */
>>>>>>>>>>  
>>>>>>>>>> @@ -97,11 +112,19 @@ static bool xive_source_pq_trigger(XiveSource *xsrc, uint32_t srcno)
>>>>>>>>>>  
>>>>>>>>>>  /*
>>>>>>>>>>   * Forward the source event notification to the associated XiveFabric,
>>>>>>>>>> - * the device owning the sources.
>>>>>>>>>> + * the device owning the sources, or perform the routing if the device
>>>>>>>>>> + * is the interrupt controller.
>>>>>>>>>>   */
>>>>>>>>>>  static void xive_source_notify(XiveSource *xsrc, int srcno)
>>>>>>>>>>  {
>>>>>>>>>>  
>>>>>>>>>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
>>>>>>>>>> +
>>>>>>>>>> +    if (xfc->notify) {
>>>>>>>>>> +        xfc->notify(xsrc->xive, srcno + xsrc->offset);
>>>>>>>>>> +    } else {
>>>>>>>>>> +        xive_fabric_route(xsrc->xive, srcno + xsrc->offset);
>>>>>>>>>> +    }
>>>>>>>>>
>>>>>>>>> Why 2 cases?  Can't the XiveFabric object just make its notify equal
>>>>>>>>> to xive_fabric_route if that's what it wants?
>>>>>>>> Under sPAPR, all the sources, IPIs and virtual device interrupts, 
>>>>>>>> generate events which are directly routed by xive_fabric_route(). 
>>>>>>>> There is no need of an extra hop. Indeed. 
>>>>>>>
>>>>>>> Ok.
>>>>>>>
>>>>>>>> Under PowerNV, some sources forward the notification to the routing 
>>>>>>>> engine using a specific MMIO load on a notify address which is stored 
>>>>>>>> in one of the controller registers. So we need a hop to reach the 
>>>>>>>> device model, owning the sources, and do that load :
>>>>>>>
>>>>>>> Hm.  So you're saying that in pnv some sources send their notification
>>>>>>> to some other unit, 
>>>>>>
>>>>>> Not to any unit/device, to the device owning the sources.
>>>>>>
>>>>>> For the XiveSource object under PSI, the XIVEFabric interface is the 
>>>>>> PSI device object it self, which knows how to forward the notification 
>>>>>> on the XIVE Power "bus". To be more precise, the PSI HB device has 
>>>>>> 14 interrupt sources, which notifications are forwarded using a MMIO 
>>>>>> load to some address. The load address is configured (by skiboot) in 
>>>>>> one of the PSI device registers, and points to a MMIO region of the 
>>>>>> main XIVE interrupt controller. 
>>>>>>
>>>>>> The PHB4 sources should be the same.
>>>>>>
>>>>>> For the XiveSource object (all interrupts) under sPAPRXive, the 
>>>>>> XIVEFabric is the main interrupt controller sPAPRXive.
>>>>>>
>>>>>> For the XiveSource object (IPIs) under PnvXive, the XIVEFabric is 
>>>>>> also the main interrupt controller PnvXive.
>>>>>
>>>>> Hrm.  Apparently I'm missing something, I'm really not getting what
>>>>> you're trying to explain here.
>>>>
>>>> I see that. Let's try again.
>>>>
>>>>>>> that would then (after possible masking) forward on to the overall> xive fabric ? 
>>>>>>
>>>>>> yes. May be XIVEFabric is a confusing name. What about XIVEForwarder ? 
>>>>>
>>>>> Maybe..?
>>>>>
>>>>>>> That seems like a property of the source object, 
>>>>>>
>>>>>> The source object is generic. It's a bunch of PQ bits that can be 
>>>>>> controlled by MMIOs. Nothing more.
>>>>>
>>>>> Hmm.  Isn't the source object also responsible for forwarding the
>>>>> interrupt to something up the chain (whatever that is)?
>>>>
>>>> Yes but it can not forward directly. The XiveSource is generic and 
>>>> can only call a handler :
>>>>
>>>> 	xfc->notify(xsrc->xive, srcno + xsrc->offset);
>>>
>>> But.. your patch doesn't do that always, it's conditional which I
>>> still don't understand.
>>
>> Because at the end of the notify/forward chain, you route.
> 
> Hrm.  I'm really not understanding this notify/forward thing as
> distinct from routing.  I mean, from the description here it sounds
> kind of like cascaded interrupt controllers, which qemu already has
> mechanisms to handle, but I didn't think (most) POWER9 devices worked
> like that.

they don't. 

The concept I have in mind for the XIVEFabric QOM interface is 
something close to the XIVE logic unit in HW which links the 
device interrupt sources and the XIVE interrupt controller of the 
chip to the PowerBUS. The interrupt controller also has a special 
CQ (common queue) unit acting as a proxy for the PowerBus for the 
virtualization/router and the presenter units.  

It's a convenient model for QEMU because the XiveSource model only 
has to call an interface handler when a notification event is let 
through. It's like doing a MMIO to send on the PowerBUS the ISN of 
the source that just triggered.  

>>>> The device model owner, the parent of the XiveSource object, would 
>>>> do the real forward.
>>>
>>> Why?  
>>
>> because, in my idea of the XiveSource concept, it does not have 
>> the logic to do so: the register for the MMIO address to use, 
>> another one for the IVT offset, etc
>>
>>> I mean the XiveSource basically represents the xive irq related
>>> logic of the PHB or whatever, why would it not represent *all* of
>>> that, rather than just the ESB bits, meaning the owner has to have
>>> some more xive logic for the forwarding.
>>
>> ok. This is where we diverge in the concept. 
>>
>> The PQ bits, the ESB MMIO region handlers can be easily shared 
>> between the different device models. They are the common part
>> of devices with XIVE interrupt sources.
> 
> Sure, but QOM subclasses are a thing, which sounds like a better way
> of doing this than having to have extra XIVE logic in all the parent
> devices.

I agree It can be done with subclasses also but QOM interfaces are 
quickly defined and stateless. 

On sPAPR, the extra XIVE logic would only be the sPAPRXive model 
acting as the machine interrupt controller model. It owns all the 
guest interrupt sources.

In PowerNV, we would have the PnvXive interrupt controller model
for the chip, which also owns the IPI sources, and the PSIHB model 
and the PHB4 model. 

This is very much like XICS. Each relevant device has one or more 
ICSstate objects. ICSstate has a pointer to the XICSfabric.


>>> Note that I don't think the fact that some sources notify via mmio and
>>> some are internal really matters.  It's not like we're modelling the
>>> power bus down to the wire-transaction level.
>>
>> yes but the configuration of the devices are different. pnv devices 
>> will have registers accessible through MMIO or XSCOM and configured
>> by the firmware. spapr is all set up by QEMU.
> 
> Ah.. ok.  Actually modelling the mmio forwards probably makes sense
> then.  I still think it makes more sense in a pnv XiveSource subclass
> rather than putting it in the containing device.

Hmm, we will then need to define classes for the sPAPRXiveSource, 
PnvXiveSource, PSIXiveSource, which works fine for sure. But with 
a QOM interface, we could use directly the XiveSource class. 

I will take a look. 

>>>> It's very similar to what we have today with XICS :
>>>>
>>>> 	- The sPAPR model has an ICSState  
>>>> 	- The PnvPSI model has an ICSState 
>>>> 	- The PnvPHB3 model has two ICSStates
>>>>
>>>> and the 'xics' pointer in ICSState points to the 'interrupt unit' of 
>>>> the machine to do resends and to grab ICPs. So it used for routing 
>>>> essentially.
>>>
>>> Hmm.  I think you and I are looking at XICSFabric kind of
>>> differently.  As I see it, it's not really an active component at
>>> all.  Rather it's basically a global "map" of the xics components so
>>> that they can find each other.
>>
>> ok. I am not that far either. 
>>  
>>>> in Xive 
>>>>
>>>> 	- sPAPRXive model has a XiveSource
>>>> 	- PnvXive model has a XiveSource
>>>> 	- PnvPSI model has a XiveSource
>>>> 	- PnvPHB4 model should have also.
>>>>
>>>> and the 'xive' pointer in XiveSource points to the parent object,
>>>
>>> Uh.. yeah.. the xics pointer in ICS units doesn't point to the parent
>>> object, except maybe by accident.  It's absolutely intended to be
>>> global, and so points to the machine.
>>
>> yes. I agree. 
>>
>> XIVE has more layers and visible components due to the internal 
>> tables used for routing.
> 
> Right.
> 
>>>> which will handle the event notification forwarding or routing.
>>>
>>> Ok, how about this for a partial model.  We have:
>>>
>>> XiveSource objects:
>>> 	* Owns an ESB table
>>> 	* Knows the mapping of its local irq offsets to global irq
>>> 	  numbers
>>
>> That is the 'offset' attribute I suppose. This is set at runtime 
>> for the powernv devices.
> 
> Ok, so that offset is effectively a register (which will have to be
> migrated), rather than a device property.

powernv migration is a not on my radar yet :)

>> For pseries, we should need it for 
>> passthrough. I think. I haven't looked at that part yet.
> 
> Uh.. I really hope not.  AIUI the offsets are decided by the platform
> rather than the guest in this case, yes?  

A mapping is done in the guest. I still need to check how the ESB 
pages are populated in the guest ESB MMIO region. We should be fine 
I think.

> In which case if we can't
> count on a fixed offset even in passthrough mode, then migration is
> basically impossible.
>>>> 	* Provides the mmio interface for ESB manipulation
>>> 	* When neccessary, notifies a new interrupt to a XiveRouter
>>
>> ok. I think that what we have today fits the idea then. 
> 
> Yes, I think so to - just clarifying in the context of the rest of
> this proposal.

yes. I think we should be adding a XiveRouter class like you propose 
below.
 
>>> XiveRouter objects:
>>> 	* Responsible for a fixed range of global irq numbers
>>> 	* Owns an IVT (but what that means can vary, see below)
>>
>> the size of the IVT table is determined at runtime for powernv.
> 
> That's fine - both the size and base of the IVT will be registers of
> the powernv variant of the device.
> 
>>> 	* When notified of an irq, routes it to the appropriate EQ
>>> 	  (haven't thought about this part yet)
>>
>> we will need a class handler to get EQs
> 
> Sure.
> 
>>> 	* Abstract class - needs subclasses to define how to get IVEs
>>
>> OK. We will need a few other ops. The router needs to :
>>
>> 	- get IVEs
>> 	- get EQ descriptors
>> 	- update OS EQs (write event data in OS RAM)
> 
> IIUC the actual EQ (as opposed to its descriptor) will be in guest RAM
> for both powernv and spapr,

yes. 

> so this can be common (utiliizing a
> subclass hook to locate the EQ base address).

yes.
 
>> 	- update EQ descriptors (to set EQ descriptor index & toggle)
>> 	- get an NVT/VP (to notify CPUs)
> 
> Getting NVT/VP sounds more like it would belong in the XiveFabric than
> the router, but I haven't looked at this in detail yet.

The VP table is owned by the presenter unit. It is notified by the router
unit when an event is let through. The presenter then looks for the VP to
notify among the VPs dispatched on the HW threads. If none is found, 
the event is escalated. 

The router units owns the IVE and the EQ tables, but still can do some
remote access to the VPDT, to update the backlog and IBP I think. 
I don't think we need to introduce a presenter model, accessors for 
the VPs should enough. 

>> For powernv, we should also consider updating the NVT/VP. It can be 
>> done later. 
>>
>>> XiveFabric interface:
>>> 	* Lets XIVE components locate each other
>>
>> hmm, it should be a chain : 
>>
>> 	source -> router -> presenter -> cpu
>>
>> So the components should not have to locate each other. The presenter
>> does not know about the source for instance. Only the sources need
>> to forward events to the main controller logic doing the routing.
> 
> source -> router
> 
> From the above sounds like we can maybe make this just a router
> property in the source, rather than a lookup through a fabric.

If we use a direct link from the source to the router, how do we handle 
the powernv case which does event notification with a MMIO load ? 
I think a call through an interface is better (the XiveFabric)

> router -> presenter
> 
> Here we might still need a "fabric".  By defintion the router can
> direct to a bunch of different presenters, so it needs some kind of
> map to find them, no?

yes. The vp_block identifies the chip. There are different possible 
scenarios for the block configuration on a system but OPAL uses a 
simple one : one chip <-> one block. We can use that to start with 
on PowerNV.

And on pseries, just use block 0 for all, we don't really care.
The concept of block does not reach the spapr specs. 

As for the presenter it self, it loops on the CPUs to check the CAM 
line register of the thread interrupt management context to see if 
one matches vp_block+vp_index. This is one routine. I don't think 
we need a model for that.

> presenter -> cpu
> 
> This one's trivial; just the cpu->intc pointer or similar.

yes. We store under the cpu the thread interrupt management context,
which is a set of registers that the OS accesses through the TIMA,
one per machine.

>>> 	* get_router() method: maps a global irq number to XiveRouter
>>> 	  object
>>
>> Ah. you are thinking about the multichip case under powernv. I need
>> to look at that more closely. But XIVE has a concept of block which 
>> is used by skiboot to map a chip to a block and the XIVE tables have 
>> a block field.
> 
> Hmm.. I think we need to understand this to make a real model.  I'd
> thought the block number would just be the chip number somewhere in
> the high bits of the global irq, but sounds like there might be yet
> another indirection here.

We have a one-to-one mapping today. I doubt it will change. We can
start with that.

>>> 	* Always global (implemented on the machine)
>>
>> OK. this is more or less the object modeling the main interrupt 
>> controller ? What I called sPAPRXive in the current patchset.
> 
> No.  This is *always* global - assuming we need it at all - even on
> multichip powernv.
>
>>> On pseries we have:
>>>
>>> 	? XiveSource objects.  We probably only need one, but 1 for
>>> 	LSI and one for MSI might be convenient.  
>>
>> It's not too ugly for the moment. If we create a source object 
>> for LSIs only that might be more complex for passthrough devices 
>> and their associated ESB MMIO region.
>>
>> side note :
>>
>> LSIs work under TCG, and used to work under KVM until we removed
>> the StoreEOI support. Since the EOI is now different, we need 
>> to find a way to handle EOI for guest virtual LSIs which are 
>> not LSIs for the host.
> 
> I can't quite picture that case - can you give a concrete example?

A guest won't be able to use a rtl8139 nic under KVM.

As it is the guest that does the EOI, it should use the appropriate 
sequence depending on the interrupt type, LSI vs. MSI. The store EOI
was used for both types until we removed the support bc of ordering 
issues. So now, the guest uses a specific EOI sequence for LSI but 
the hypervisor implements virtual LSIs on top of IPIs interrupts which
are MSIs ... we are doomed. We need to fix the guest or to find a 
way to reroute such interrupts to QEMU.  


> 
>> We can still change the Linux spapr 
>> backend or have QEMU handle the EOI for virtual LSIs. This is 
>> on my TODO list.
>>
>>>       More wouldn't break the model
>>
>> It should not. Initially the pachset had two source objects : 
>> one for IPIs and one for the virtual devices interrupts. 
>>
>>> 	1 sPAPRXiveRouter.  This is a subclass of XiveRouter that
>>> 	holds the IVT internall (and migrates it).
>>
>> OK.
>>
>>> 	the XiveFabric implementation always returns the single global
>>> 	router for get_router()
>>
>> we can do that. 
> 
> Again, if it's true that each source object always forwards to just
> one router, then we don't need a get_router(), we can just use a link
> property.

that is the goal of the architecture, yes. powernv sources use MMIOs 
to forward events to the router and source event notifications are 
usually forwarded to the local chip. 

There are some case in which the IVE lookups are be remote. We can
use the block to scan the system chips for that. we don't have to
implement all the protocol.

> 
>> Do we gather all XIVE components objects under an object 'sPAPRXive' 
>> modeling that way the main interrupt controller of the machine ? 
> 
> No, I don't think so.
Hmm, 

sPAPRXive would inherit from the XiveRouter and define the IVT
and the EQDT. It is also practical to set up the TIMA memory region
which is unique per machine. KVM support will need custom version 
of this model.  

>> It should not have any state but it will hold the TIMA region 
>> most certainly, and the addresses where to map the ESB and 
>> TIMA regions.
> 
> Uh.. I'd expect the TIMAs to be held by the NVT objects, which we
> haven't covered yet.

There is a set of registers per CPU (thread interrupt management context),
but there is one TIMA (thread interrupt management areao) memory region 
for the whole machine.
 
We should change the NVT name, it is confusing. NVT is an old name for the 
backing store struct in RAM, which was replaced by VP. 

XiveTCTX (Xive Thread Context) is better IMO and its reflects the names 
of the IC registers.
 
>> KVM support will bring extra needs.
>>  
>>> On powernv we have:
>>> 	N XiveSource objects.  Some in PHBs, some extra ones on each
>>> 	chip
>>
>> yes and PSI to start with.  
>>
>>> 	(#chips) PowerXiveRouter objects.  This subclass of XiveRouter
>>> 	stores the register giving the IVT base address and migrates
>>> 	that, but the IVT contents are in RAM
>>>
>>> 	the XiveFabric get_router() implementation returns the right
>>> 	chip's router based on the irq number
>>>
>>> Obviously the router->EQ sides still needs a bunch of thought.
>>
>> These are all routing tables : 
>>
>> 	- IVT
>> 	- EQDT
>> 	- VPDT  
> 
> Thinking about what you've said, I'm thinking maybe what we need on
> the source side is two versions which don't exactly correspond to
> spapr vs powernv:
> 
> InBandXiveSource: this one's notify behaviour is to issue an mmio read
> to a configured address.  It's expected that mmio address belongs to a
> XiveRouter, but the source itself doesn't know anything routers.
> 
> OutOfBandXiveSource: this one's notify behaviour is to explicitly poke
> a XiveRouter.  A link to the router and offset (which range of router
> irqs are used by this source) would be object properties.
> 
> powernv would use the first for "external" irq sources and the second
> for internal ones.  spapr would use the second for everything.
> 

OK. 

I think I will stick to my idea for the next version because there are 
quite a lot of changes in the models already : a new spapr irq backend,
a XiveRouter, a EQD table for sPAPR, a complete TIMA support for all 
privileges, a renamed XiveTCTX, a simple Presenter scanning CAM lines, 
support for EQ ESBs (that no system uses), a pseries-2.13-xive machine,
etc.

I will share the powernv and PSIHB models to validate the common
XIVE model. We can change to the above if it is really not satisfying.

Thanks,

C.

^ permalink raw reply	[flat|nested] 100+ messages in thread

end of thread, other threads:[~2018-05-23 10:12 UTC | newest]

Thread overview: 100+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-19 12:42 [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 01/35] ppc/xive: introduce a XIVE interrupt source model Cédric Le Goater
2018-04-20  7:10   ` David Gibson
2018-04-20  8:27     ` Cédric Le Goater
2018-04-23  3:59       ` David Gibson
2018-04-23  7:11         ` Cédric Le Goater
2018-04-24  1:24           ` David Gibson
2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 02/35] ppc/xive: add support for the LSI interrupt sources Cédric Le Goater
2018-04-23  6:44   ` David Gibson
2018-04-23  7:31     ` Cédric Le Goater
2018-04-24  6:41       ` David Gibson
2018-04-24  8:11         ` Cédric Le Goater
2018-04-26  3:28           ` David Gibson
2018-04-26 12:16             ` Cédric Le Goater
2018-04-27  2:43               ` David Gibson
2018-05-04 14:25                 ` Cédric Le Goater
2018-05-05  4:32                   ` David Gibson
2018-04-19 12:42 ` [Qemu-devel] [PATCH v3 03/35] ppc/xive: introduce the XiveFabric interface Cédric Le Goater
2018-04-23  6:46   ` David Gibson
2018-04-23  7:58     ` Cédric Le Goater
2018-04-24  6:46       ` David Gibson
2018-04-24  9:33         ` Cédric Le Goater
2018-04-26  3:54           ` David Gibson
2018-04-26 10:30             ` Cédric Le Goater
2018-04-27  6:32               ` David Gibson
2018-05-02 15:28                 ` Cédric Le Goater
2018-05-03  5:13                   ` David Gibson
2018-05-23 10:12                     ` Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 04/35] spapr/xive: introduce a XIVE interrupt controller for sPAPR Cédric Le Goater
2018-04-24  6:51   ` David Gibson
2018-04-24  9:46     ` Cédric Le Goater
2018-04-26  4:20       ` David Gibson
2018-04-26 10:43         ` Cédric Le Goater
2018-05-03  5:22           ` David Gibson
2018-05-03 16:50             ` Cédric Le Goater
2018-05-04  3:33               ` David Gibson
2018-05-04 13:05                 ` Cédric Le Goater
2018-05-05  4:26                   ` David Gibson
2018-05-09  7:23                     ` Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 05/35] spapr/xive: add a single source block to the sPAPR XIVE model Cédric Le Goater
2018-04-24  6:58   ` David Gibson
2018-04-24  8:19     ` Cédric Le Goater
2018-04-26  4:46       ` David Gibson
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 06/35] spapr/xive: introduce a XIVE interrupt presenter model Cédric Le Goater
2018-04-26  7:11   ` David Gibson
2018-04-26  9:27     ` Cédric Le Goater
2018-04-26 17:15       ` Cédric Le Goater
2018-05-03  5:39         ` David Gibson
2018-05-03 15:10           ` Cédric Le Goater
2018-05-04  4:44             ` David Gibson
2018-05-04 14:15               ` Cédric Le Goater
2018-05-03  5:35       ` David Gibson
2018-05-03 16:06         ` Cédric Le Goater
2018-05-04  4:51           ` David Gibson
2018-05-04 13:11             ` Cédric Le Goater
2018-05-05  4:27               ` David Gibson
2018-05-09  7:27                 ` Cédric Le Goater
2018-05-02  7:39     ` Cédric Le Goater
2018-05-03  5:43       ` David Gibson
2018-05-03 14:42         ` Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 07/35] spapr/xive: introduce the XIVE Event Queues Cédric Le Goater
2018-04-26  7:25   ` David Gibson
2018-04-26  9:48     ` Cédric Le Goater
2018-05-03  5:45       ` David Gibson
2018-05-03  6:07         ` Cédric Le Goater
2018-05-03  6:25           ` David Gibson
2018-05-03 14:37             ` Cédric Le Goater
2018-05-04  5:19               ` David Gibson
2018-05-04 13:29                 ` Cédric Le Goater
2018-05-05  4:29                   ` David Gibson
2018-05-09  8:01                     ` Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 08/35] spapr: push the XIVE EQ data in OS event queue Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 09/35] spapr: notify the CPU when the XIVE interrupt priority is more privileged Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 10/35] spapr: add support for the SET_OS_PENDING command (XIVE) Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 11/35] spapr: introduce a 'xive_exploitation' option to enable XIVE Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 12/35] spapr: add a sPAPRXive object to the machine Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 13/35] spapr: add hcalls support for the XIVE exploitation interrupt mode Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 14/35] spapr: add device tree support for the XIVE exploitation mode Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 15/35] sysbus: add a sysbus_mmio_unmap() helper Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 16/35] spapr: introduce a helper to map the XIVE memory regions Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 17/35] spapr: add XIVE support to spapr_qirq() Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 18/35] spapr: introduce a spapr_icp_create() helper Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 19/35] spapr: toggle the ICP depending on the selected interrupt mode Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 20/35] spapr: add support to dump XIVE information Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 21/35] spapr: advertise XIVE exploitation mode in CAS Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 22/35] spapr: add classes for the XIVE models Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 23/35] target/ppc/kvm: add Linux KVM definitions for XIVE Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 24/35] spapr/xive: add common realize routine for KVM Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 25/35] spapr/xive: add KVM support Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 26/35] spapr/xive: add a XIVE KVM device to the machine Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 27/35] migration: discard non-migratable RAMBlocks Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 28/35] intc: introduce a CPUIntc interface Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 29/35] spapr/xive, xics: use the CPU_INTC handlers to reset KVM Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 30/35] spapr/xive, xics: reset KVM at machine reset Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 31/35] spapr/xive: raise migration priority of the machine Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 32/35] ppc/pnv: introduce a pnv_icp_create() helper Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 33/35] ppc: externalize ppc_get_vcpu_by_pir() Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 34/35] ppc/pnv: add XIVE support Cédric Le Goater
2018-04-19 12:43 ` [Qemu-devel] [PATCH v3 35/35] ppc/pnv: add a PSI bridge model for POWER9 processor Cédric Le Goater
2018-04-19 13:28 ` [Qemu-devel] [PATCH v3 00/35] ppc: support for the XIVE interrupt controller (POWER9) no-reply

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.