All of lore.kernel.org
 help / color / mirror / Atom feed
* [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9)
@ 2018-11-16 10:56 Cédric Le Goater
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model Cédric Le Goater
                   ` (35 more replies)
  0 siblings, 36 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:56 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

Hello,

Here is the version 5 of the QEMU models adding support for the XIVE
interrupt controller to the sPAPR machine, under TCG and KVM, and to
the PowerNV POWER9 machine.

The most important changes for sPAPR are the introduction of a new
'dual' pseries machine supporting both interrupt mode: XICS and XIVE,
under TCG and KVM, and fixes for the virtual LSIs.

The QEMU PowerNV POWER9 machine now has support for PHB4 bringing it
to the level of a "real" machine. It validates the model of the
barematel XIVE IC controller proposed in this patchset. The other
PowerNV models will be part of another patchset.

Thanks,

C.


Changes in v5 :

 Common XIVE models :

 - renamed the XIVE structures to fit the changes of the XIVE
   architecture documents: IVE, EQD, VPD -> EAS, END, NVT   
 - reworked the monitor ouput to print the EQ contents

 sPAPR models :

 - introduced a XIVE Router 'reset' method for the Xive Thread Context
   to set the OS CAM line of the VCPU
 - introduced a spapr_irq_init() routine to the sPAPR IRQ backend
   and reworked the XIVE-only machine to fit mainline QEMU
 - introduced a reset() method to the sPAPR IRQ backend to handle
   changes in the interrupt mode after machine reset
 - introduced a 'dual' machine supporting both interrupt mode

 KVM :

 - introduced some more sPAPR NVT and END indexing helpers for KVM support
 - fixed the virtual LSIs in KVM by using the H_INT_ESB source flag
 - improved the KVM support with better common classes and cleaner
   QEMU<->KVM interfaces
 - improved KVM migration with a better control on the capture sequence.
   Still some issues with 'ceded' VCPUs 
 - introduced KVM support for the 'dual' machine

 PowerNV:

 - introduced address spaces for the IPI and END set translation
   tables


Changes in v4 :

   See https://lists.gnu.org/archive/html/qemu-devel/2018-06/msg01672.html


= XIVE =================================================================


The POWER9 processor comes with a new interrupt controller, called
XIVE as "eXternal Interrupt Virtualization Engine".


* Overall architecture


              XIVE Interrupt Controller
              +-------------------------------------+       IPIs
              | +---------+ +---------+ +---------+ |    +--------+
              | |VC       | |CQ       | |PC       |----> | CORES  |
              | |     esb | |         | |         |----> |        |
              | |     eas | |  Bridge | |         |----> |        |
              | |SC   end | |         | |     nvt | |    |        |
+------+      | +---------+ +----+----+ +---------+ |    +--+-+-+-+
| RAM  |      +------------------|------------------+       | | |
|      |                         |                          | | |
|      |                         |                          | | |
|      |   +---------------------v--------------------------v-v-v---+      other
|      <---+                       Power Bus                        +----> chips
|  esb |   +-----------+-----------------------+--------------------+
|  eas |               |                       |
|  end |               |                       |
|  nvt |           +---+----+              +---+----+
+------+           |SC      |              |SC      |
                   |        |              |        |
                   | 2-bits |              | 2-bits |
                   | local  |              |   VC   |
                   +--------+              +--------+
                     PCIe                  NX,NPU,CAPI

                  SC: Source Controller (aka. IVSE)
                  VC: Virtualization Controller (aka. IVRE)
                  CQ: Common Queue (Bridge)
                  PC: Presentation Controller (aka. IVPE)

              2-bits: source state machine (PQ bits)
                 esb: Event State Buffer (Array of PQ bits in an IVSE)
                 eas: Event Assignment Structure
                 end: Event Notification Descriptor
                 nvt: Notification Virtual Target

It is composed of three sub-engines :

  - Interrupt Virtualization Source Engine (IVSE), or Source
    Controller (SC). These are found in PCI PHBs, in the PSI host
    bridge controller, but also inside the main controller for the
    core IPIs and other sub-chips (NX, CAP, NPU) of the
    chip/processor. They are configured to feed the IVRE with events.

  - Interrupt Virtualization Routing Engine (IVRE) or Virtualization
    Controller (VC). Its job is to match an event source with an Event
    Notification Descriptor (END).

  - Interrupt Virtualization Presentation Engine (IVPE) or Presentation
    Controller (PC). It maintains the interrupt context state of each
    thread and handles the delivery of the external exception to the
    thread.


* XIVE internal tables

Each of the sub-engines uses a set of tables to redirect exceptions
from event sources to CPU threads.

                                             +-------+
   User or OS                                |  EQ   |
       or                            +------>|entries|
   Hypervisor                        |       |  ..   |
     Memory                          |       +-------+
                                     |           ^
                                     |           |
               +--------------------------------------------------+
                                     |           |
   Hypervisor        +------+    +---+--+    +---+--+   +------+
     Memory          | ESB  |    | EAT  |    | ENDT |   | NVTT |
    (skiboot)        +----+-+    +----+-+    +----+-+   +------+
                       ^  |        ^  |        ^  |       ^
                       |  |        |  |        |  |       |
               +--------------------------------------------------+
                       |  |        |  |        |  |       |
                       |  |        |  |        |  |       |
                 +-----|--|--------|--|--------|--|-+   +-|-----+    +------+
                 |     |  |        |  |        |  | |   | | tctx|    |Thread|
    IPI or   ----+     +  v        +  v        +  v |---| +  .. |----->     |
   HW events     |                                  |   |       |    |      |
                 |              IVRE                |   | IVPE  |    +------+
                 +----------------------------------+   +-------+

The IVSE have a 2-bits, P for pending and Q for queued, state machine
for each source that allows events to be triggered. They are stored in
an array, the Event State Buffer (ESB) and controlled by MMIOs.

If the event is let through, the IVRE looks up in the Event Assignment
Structure (EAS) table for an Event Notification Descriptor (END)
configured for the source. Each Event Notification Descriptor defines
a notification path to a CPU and an in-memory Event Queue, in which
will be pushed an EQ data for the OS to pull.

The IVPE determines if a Notification Virtual Target (NVT) can handle
the event by scanning the thread contexts of the VPs dispatched on the
processor HW threads. It maintains the interrupt context state of each
thread in a NVT table.


* Overview of the QEMU models for the XIVE sub-engines

The XiveSource models the IVSE in general, internal and external. It
handles the source ESBs and the MMIO interface to control them.

The XiveFabric is a small helper interface interconnecting the
XiveSource to the XiveRouter.

The XiveRouter is an abstract model acting as a combined IVRE and
IVPE. It routes event notifications using the IVE and EQD tables to
the IVPE sub-engine which does a CAM scan to find a CPU to deliver the
exception. Storage should be provided by the inheriting classes.

XiveEQSource is a special source object. It exposes the EQ ESB MMIOs of
the Event Queues which are used for coalescing event notifications and
for escalation. Not used on the field, only to sync the EQ cache in
OPAL.

Finally, the XiveTCTX contains the interrupt state context of a thread,
four sets of registers, one for each exception that can be delivered
to a CPU. These contexts are scanned by the IVPE to find a matching VP
when a notification is triggered. It also models the Thread Interrupt
Management Area (TIMA), which exposes the thread context registers to
the CPU for interrupt management.


* XIVE for sPAPR

sPAPRXive models the XIVE interrupt controller of a sPAPR machine. It
inherits from the XiveRouter and provisions storage for the IVE and
END tables. The NVT table does not need a backend in sPAPR. It owns a
XiveSource object for the IPIs and the virtual device interrupts, a
memory region for the TIMA and a XiveENDSource to manage the END ESBs.
(not used by Linux).

These choices were made to have a sPAPR interrupt controller
consistent with the one found on baremetal and to facilitate KVM
support, the main difficulty being the host memory regions exposed to
the guest.

The NVT and tbe END indexing needs some care and a set of helpers are
defined to ease the conversion between the CPU id as seen by the guest
and the XIVE identifiers manipulated by the models. 


* Integration in the sPAPR machine, xive only and dual

A new sPAPR IRQ backend is defined for XIVE. It introduces a couple of
new operations to handle the differences in the creation of the device
tree and in the allocation of the CPU interrupt controller. A new
'xive' only pseries machine is defined using this XIVE backend.

Being able to support both interrupt mode in the same machine requires
some more changes. As the machine chooses the interrupt mode at CAS
time, it is activated after a reconfiguration done in a reset. This is
handled by a new 'dual' sPAPR IRQ backend which is built on top of the
XICS and XIVE backend. A new 'dual' pseries machine is defined using
this backend.


* KVM support

Support for KVM introduces a set of specific XIVE models, very much
like XICS does, which self-connect to their KVM counterparts in the
Linux kernel. Two host memory regions are exposed to the guest and
need special care at initialization :

  - ESB mmios
  - Thread Interrupt Management Area (TIMA)

The models uses KVM accessors to synchronize the QEMU state with
KVM. The states are :

  - the source configuration (EAT)
  - the END configuration (ENDT)
  - the OS EQ state (toggle bit and index)
  - the thread interrupt context registers.

Hybrid guest using KVM and an emulated irqchip (kernel_irqchip=off) is
supported.

Migration under KVM is supported but still has some issues with the
pages of the OS event notification queues when the VCPUs are
'ceded'. They are mapped to the ZERO_PAGE on the receiving side. Work
in progress.

KVM support for the 'dual' machine required some more changes. Both
interrupt mode need to be initialized at the QEMU level to keep the
IRQ number space in sync and to allow switching from one mode to
another. At the KVM level, the whole initialization of the KVM device,
sources and presenters, needs to be done in the reset handler when the
interrupt mode is chosen. This is a major change in the KVM models.

KVM being initialized at reset, we loose the possiblity to fallback to
the QEMU emulated mode in case of failure and failures become fatal to
the machine.


* PowerNV models

The PnvXIVE model uses the XiveRouter abstract model just like
sPAPRXive. It provides accessors to the EAS, END and NVT tables which
are stored in the QEMU PowerNV machine and not in QEMU anymore. It
owns a set of memory regions for the IC registers, the ESBs, the END
ESBs, the TIMA, the notification MMIO.

Multichip is supported and the available IVSEs are the internal one
for the IPIS, the PSI host bridge controller and PHB4.

The next interesting step would be to add escalation events and model
the VCPU dispatching to support emulated KVM guests.


* GitHub trees
 
QEMU sPAPR:

  https://github.com/legoater/qemu/commits/xive-3.1
  
QEMU PowerNV:

  https://github.com/legoater/qemu/commits/powernv-3.1

Linux/KVM:

  https://github.com/legoater/linux/commits/xive-4.20

OPAL:

  https://github.com/legoater/skiboot/commits/xive



Cédric Le Goater (36):
  ppc/xive: introduce a XIVE interrupt source model
  ppc/xive: add support for the LSI interrupt sources
  ppc/xive: introduce the XiveFabric interface
  ppc/xive: introduce the XiveRouter model
  ppc/xive: introduce the XIVE Event Notification Descriptors
  ppc/xive: add support for the END Event State buffers
  ppc/xive: introduce the XIVE interrupt thread context
  ppc/xive: introduce a simplified XIVE presenter
  ppc/xive: notify the CPU when the interrupt priority is more
    privileged
  spapr/xive: introduce a XIVE interrupt controller
  spapr/xive: use the VCPU id as a NVT identifier
  spapr: initialize VSMT before initializing the IRQ backend
  spapr: introduce a spapr_irq_init() routine
  spapr: modify the irq backend 'init' method
  spapr: introdude a new machine IRQ backend for XIVE
  spapr: add hcalls support for the XIVE exploitation interrupt mode
  spapr: add device tree support for the XIVE exploitation mode
  spapr: allocate the interrupt thread context under the CPU core
  spapr: add a 'pseries-3.1-xive' machine type
  spapr: add classes for the XIVE models
  spapr: extend the sPAPR IRQ backend for XICS migration
  spapr/xive: add models for KVM support
  spapr/xive: add migration support for KVM
  spapr: add a 'reset' method to the sPAPR IRQ backend
  spapr: set the interrupt presenter at reset
  spapr: add a 'pseries-3.1-dual' machine type
  sysbus: add a sysbus_mmio_unmap() helper
  ppc/xics: introduce a icp_kvm_init() routine
  ppc/xics: remove abort() in icp_kvm_init()
  spapr: check for KVM IRQ device activation
  spapr/xive: export the spapr_xive_kvm_init() routine
  spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers
  spapr: introduce routines to delete the KVM IRQ device
  spapr: add KVM support to the 'dual' machine
  ppc: externalize ppc_get_vcpu_by_pir()
  ppc/pnv: add XIVE support

 default-configs/ppc64-softmmu.mak |    3 +
 hw/intc/pnv_xive_regs.h           |  314 +++++
 include/hw/ppc/pnv.h              |   22 +-
 include/hw/ppc/pnv_xive.h         |  100 ++
 include/hw/ppc/pnv_xscom.h        |    3 +
 include/hw/ppc/ppc.h              |    1 +
 include/hw/ppc/spapr.h            |   18 +-
 include/hw/ppc/spapr_cpu_core.h   |    2 +
 include/hw/ppc/spapr_irq.h        |   16 +-
 include/hw/ppc/spapr_xive.h       |  113 ++
 include/hw/ppc/xics.h             |    1 +
 include/hw/ppc/xive.h             |  332 ++++++
 include/hw/ppc/xive_regs.h        |  183 +++
 include/hw/sysbus.h               |    1 +
 include/migration/vmstate.h       |    1 +
 linux-headers/asm-powerpc/kvm.h   |   45 +
 linux-headers/linux/kvm.h         |    6 +
 target/ppc/kvm_ppc.h              |    6 +
 hw/core/sysbus.c                  |   10 +
 hw/intc/pnv_xive.c                | 1612 +++++++++++++++++++++++++
 hw/intc/spapr_xive.c              |  523 +++++++++
 hw/intc/spapr_xive_hcall.c        |  954 +++++++++++++++
 hw/intc/spapr_xive_kvm.c          |  983 ++++++++++++++++
 hw/intc/xics_kvm.c                |  113 +-
 hw/intc/xive.c                    | 1811 +++++++++++++++++++++++++++++
 hw/ppc/pnv.c                      |   74 +-
 hw/ppc/ppc.c                      |   16 +
 hw/ppc/spapr.c                    |   84 +-
 hw/ppc/spapr_cpu_core.c           |   31 +-
 hw/ppc/spapr_hcall.c              |   16 +
 hw/ppc/spapr_irq.c                |  455 +++++++-
 hw/ppc/spapr_rtas.c               |    2 +-
 target/ppc/kvm.c                  |    7 +
 hw/intc/Makefile.objs             |    5 +-
 34 files changed, 7780 insertions(+), 83 deletions(-)
 create mode 100644 hw/intc/pnv_xive_regs.h
 create mode 100644 include/hw/ppc/pnv_xive.h
 create mode 100644 include/hw/ppc/spapr_xive.h
 create mode 100644 include/hw/ppc/xive.h
 create mode 100644 include/hw/ppc/xive_regs.h
 create mode 100644 hw/intc/pnv_xive.c
 create mode 100644 hw/intc/spapr_xive.c
 create mode 100644 hw/intc/spapr_xive_hcall.c
 create mode 100644 hw/intc/spapr_xive_kvm.c
 create mode 100644 hw/intc/xive.c

-- 
2.17.2

^ permalink raw reply	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
@ 2018-11-16 10:56 ` Cédric Le Goater
  2018-11-22  3:05   ` David Gibson
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources Cédric Le Goater
                   ` (34 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:56 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The first sub-engine of the overall XIVE architecture is the Interrupt
Virtualization Source Engine (IVSE). An IVSE can be integrated into
another logic, like in a PCI PHB or in the main interrupt controller
to manage IPIs.

Each IVSE instance is associated with an Event State Buffer (ESB) that
contains a two bit state entry for each possible event source. When an
event is signaled to the IVSE, by MMIO or some other means, the
associated interrupt state bits are fetched from the ESB and
modified. Depending on the resulting ESB state, the event is forwarded
to the IVRE sub-engine of the controller doing the routing.

Each supported ESB entry is associated with either a single or a
even/odd pair of pages which provides commands to manage the source:
to EOI, to turn off the source for instance.

On a sPAPR machine, the O/S will obtain the page address of the ESB
entry associated with a source and its characteristic using the
H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is used.

The xive_source_notify() routine is in charge forwarding the source
event notification to the routing engine. It will be filled later on.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 default-configs/ppc64-softmmu.mak |   1 +
 include/hw/ppc/xive.h             | 130 ++++++++++
 hw/intc/xive.c                    | 379 ++++++++++++++++++++++++++++++
 hw/intc/Makefile.objs             |   1 +
 4 files changed, 511 insertions(+)
 create mode 100644 include/hw/ppc/xive.h
 create mode 100644 hw/intc/xive.c

diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
index aec2855750d6..2d1e7c5c4668 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -16,6 +16,7 @@ CONFIG_VIRTIO_VGA=y
 CONFIG_XICS=$(CONFIG_PSERIES)
 CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
+CONFIG_XIVE=$(CONFIG_PSERIES)
 CONFIG_MEM_DEVICE=y
 CONFIG_DIMM=y
 CONFIG_SPAPR_RNG=y
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
new file mode 100644
index 000000000000..5fec4b08705d
--- /dev/null
+++ b/include/hw/ppc/xive.h
@@ -0,0 +1,130 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_XIVE_H
+#define PPC_XIVE_H
+
+#include "hw/sysbus.h"
+
+/*
+ * XIVE Interrupt Source
+ */
+
+#define TYPE_XIVE_SOURCE "xive-source"
+#define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
+
+/*
+ * XIVE Interrupt Source characteristics, which define how the ESB are
+ * controlled.
+ */
+#define XIVE_SRC_H_INT_ESB     0x1 /* ESB managed with hcall H_INT_ESB */
+#define XIVE_SRC_STORE_EOI     0x2 /* Store EOI supported */
+
+typedef struct XiveSource {
+    SysBusDevice parent;
+
+    /* IRQs */
+    uint32_t        nr_irqs;
+    qemu_irq        *qirqs;
+
+    /* PQ bits */
+    uint8_t         *status;
+
+    /* ESB memory region */
+    uint64_t        esb_flags;
+    uint32_t        esb_shift;
+    MemoryRegion    esb_mmio;
+} XiveSource;
+
+/*
+ * ESB MMIO setting. Can be one page, for both source triggering and
+ * source management, or two different pages. See below for magic
+ * values.
+ */
+#define XIVE_ESB_4K          12 /* PSI HB only */
+#define XIVE_ESB_4K_2PAGE    13
+#define XIVE_ESB_64K         16
+#define XIVE_ESB_64K_2PAGE   17
+
+static inline bool xive_source_esb_has_2page(XiveSource *xsrc)
+{
+    return xsrc->esb_shift == XIVE_ESB_64K_2PAGE ||
+        xsrc->esb_shift == XIVE_ESB_4K_2PAGE;
+}
+
+/* The trigger page is always the first/even page */
+static inline hwaddr xive_source_esb_page(XiveSource *xsrc, uint32_t srcno)
+{
+    assert(srcno < xsrc->nr_irqs);
+    return (1ull << xsrc->esb_shift) * srcno;
+}
+
+/* In a two pages ESB MMIO setting, the odd page is for management */
+static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
+{
+    hwaddr addr = xive_source_esb_page(xsrc, srcno);
+
+    if (xive_source_esb_has_2page(xsrc)) {
+        addr += (1 << (xsrc->esb_shift - 1));
+    }
+
+    return addr;
+}
+
+/*
+ * Each interrupt source has a 2-bit state machine which can be
+ * controlled by MMIO. P indicates that an interrupt is pending (has
+ * been sent to a queue and is waiting for an EOI). Q indicates that
+ * the interrupt has been triggered while pending.
+ *
+ * This acts as a coalescing mechanism in order to guarantee that a
+ * given interrupt only occurs at most once in a queue.
+ *
+ * When doing an EOI, the Q bit will indicate if the interrupt
+ * needs to be re-triggered.
+ */
+#define XIVE_ESB_VAL_P        0x2
+#define XIVE_ESB_VAL_Q        0x1
+
+#define XIVE_ESB_RESET        0x0
+#define XIVE_ESB_PENDING      XIVE_ESB_VAL_P
+#define XIVE_ESB_QUEUED       (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
+#define XIVE_ESB_OFF          XIVE_ESB_VAL_Q
+
+/*
+ * "magic" Event State Buffer (ESB) MMIO offsets.
+ *
+ * The following offsets into the ESB MMIO allow to read or manipulate
+ * the PQ bits. They must be used with an 8-byte load instruction.
+ * They all return the previous state of the interrupt (atomically).
+ *
+ * Additionally, some ESB pages support doing an EOI via a store and
+ * some ESBs support doing a trigger via a separate trigger page.
+ */
+#define XIVE_ESB_STORE_EOI      0x400 /* Store */
+#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
+#define XIVE_ESB_GET            0x800 /* Load */
+#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
+#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
+#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
+#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
+
+uint8_t xive_source_esb_get(XiveSource *xsrc, uint32_t srcno);
+uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
+
+void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset,
+                                Monitor *mon);
+
+static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
+{
+    assert(srcno < xsrc->nr_irqs);
+    return xsrc->qirqs[srcno];
+}
+
+#endif /* PPC_XIVE_H */
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
new file mode 100644
index 000000000000..f7621f84828c
--- /dev/null
+++ b/hw/intc/xive.c
@@ -0,0 +1,379 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "sysemu/dma.h"
+#include "monitor/monitor.h"
+#include "hw/ppc/xive.h"
+
+/*
+ * XIVE ESB helpers
+ */
+
+static uint8_t xive_esb_set(uint8_t *pq, uint8_t value)
+{
+    uint8_t old_pq = *pq & 0x3;
+
+    *pq &= ~0x3;
+    *pq |= value & 0x3;
+
+    return old_pq;
+}
+
+static bool xive_esb_trigger(uint8_t *pq)
+{
+    uint8_t old_pq = *pq & 0x3;
+
+    switch (old_pq) {
+    case XIVE_ESB_RESET:
+        xive_esb_set(pq, XIVE_ESB_PENDING);
+        return true;
+    case XIVE_ESB_PENDING:
+    case XIVE_ESB_QUEUED:
+        xive_esb_set(pq, XIVE_ESB_QUEUED);
+        return false;
+    case XIVE_ESB_OFF:
+        xive_esb_set(pq, XIVE_ESB_OFF);
+        return false;
+    default:
+         g_assert_not_reached();
+    }
+}
+
+static bool xive_esb_eoi(uint8_t *pq)
+{
+    uint8_t old_pq = *pq & 0x3;
+
+    switch (old_pq) {
+    case XIVE_ESB_RESET:
+    case XIVE_ESB_PENDING:
+        xive_esb_set(pq, XIVE_ESB_RESET);
+        return false;
+    case XIVE_ESB_QUEUED:
+        xive_esb_set(pq, XIVE_ESB_PENDING);
+        return true;
+    case XIVE_ESB_OFF:
+        xive_esb_set(pq, XIVE_ESB_OFF);
+        return false;
+    default:
+         g_assert_not_reached();
+    }
+}
+
+/*
+ * XIVE Interrupt Source (or IVSE)
+ */
+
+uint8_t xive_source_esb_get(XiveSource *xsrc, uint32_t srcno)
+{
+    assert(srcno < xsrc->nr_irqs);
+
+    return xsrc->status[srcno] & 0x3;
+}
+
+uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
+{
+    assert(srcno < xsrc->nr_irqs);
+
+    return xive_esb_set(&xsrc->status[srcno], pq);
+}
+
+/*
+ * Returns whether the event notification should be forwarded.
+ */
+static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
+{
+    assert(srcno < xsrc->nr_irqs);
+
+    return xive_esb_trigger(&xsrc->status[srcno]);
+}
+
+/*
+ * Returns whether the event notification should be forwarded.
+ */
+static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
+{
+    assert(srcno < xsrc->nr_irqs);
+
+    return xive_esb_eoi(&xsrc->status[srcno]);
+}
+
+/*
+ * Forward the source event notification to the Router
+ */
+static void xive_source_notify(XiveSource *xsrc, int srcno)
+{
+
+}
+
+/*
+ * In a two pages ESB MMIO setting, even page is the trigger page, odd
+ * page is for management
+ */
+static inline bool addr_is_even(hwaddr addr, uint32_t shift)
+{
+    return !((addr >> shift) & 1);
+}
+
+static inline bool xive_source_is_trigger_page(XiveSource *xsrc, hwaddr addr)
+{
+    return xive_source_esb_has_2page(xsrc) &&
+        addr_is_even(addr, xsrc->esb_shift - 1);
+}
+
+/*
+ * ESB MMIO loads
+ *                      Trigger page    Management/EOI page
+ * 2 pages setting      even            odd
+ *
+ * 0x000 .. 0x3FF       -1              EOI and return 0|1
+ * 0x400 .. 0x7FF       -1              EOI and return 0|1
+ * 0x800 .. 0xBFF       -1              return PQ
+ * 0xC00 .. 0xCFF       -1              return PQ and atomically PQ=0
+ * 0xD00 .. 0xDFF       -1              return PQ and atomically PQ=0
+ * 0xE00 .. 0xDFF       -1              return PQ and atomically PQ=1
+ * 0xF00 .. 0xDFF       -1              return PQ and atomically PQ=1
+ */
+static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
+{
+    XiveSource *xsrc = XIVE_SOURCE(opaque);
+    uint32_t offset = addr & 0xFFF;
+    uint32_t srcno = addr >> xsrc->esb_shift;
+    uint64_t ret = -1;
+
+    /* In a two pages ESB MMIO setting, trigger page should not be read */
+    if (xive_source_is_trigger_page(xsrc, addr)) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "XIVE: invalid load on IRQ %d trigger page at "
+                      "0x%"HWADDR_PRIx"\n", srcno, addr);
+        return -1;
+    }
+
+    switch (offset) {
+    case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
+        ret = xive_source_esb_eoi(xsrc, srcno);
+
+        /* Forward the source event notification for routing */
+        if (ret) {
+            xive_source_notify(xsrc, srcno);
+        }
+        break;
+
+    case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
+        ret = xive_source_esb_get(xsrc, srcno);
+        break;
+
+    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
+    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
+    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
+    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
+        ret = xive_source_esb_set(xsrc, srcno, (offset >> 8) & 0x3);
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB load addr %x\n",
+                      offset);
+    }
+
+    return ret;
+}
+
+/*
+ * ESB MMIO stores
+ *                      Trigger page    Management/EOI page
+ * 2 pages setting      even            odd
+ *
+ * 0x000 .. 0x3FF       Trigger         Trigger
+ * 0x400 .. 0x7FF       Trigger         EOI
+ * 0x800 .. 0xBFF       Trigger         undefined
+ * 0xC00 .. 0xCFF       Trigger         PQ=00
+ * 0xD00 .. 0xDFF       Trigger         PQ=01
+ * 0xE00 .. 0xDFF       Trigger         PQ=10
+ * 0xF00 .. 0xDFF       Trigger         PQ=11
+ */
+static void xive_source_esb_write(void *opaque, hwaddr addr,
+                                  uint64_t value, unsigned size)
+{
+    XiveSource *xsrc = XIVE_SOURCE(opaque);
+    uint32_t offset = addr & 0xFFF;
+    uint32_t srcno = addr >> xsrc->esb_shift;
+    bool notify = false;
+
+    /* In a two pages ESB MMIO setting, trigger page only triggers */
+    if (xive_source_is_trigger_page(xsrc, addr)) {
+        notify = xive_source_esb_trigger(xsrc, srcno);
+        goto out;
+    }
+
+    switch (offset) {
+    case 0 ... 0x3FF:
+        notify = xive_source_esb_trigger(xsrc, srcno);
+        break;
+
+    case XIVE_ESB_STORE_EOI ... XIVE_ESB_STORE_EOI + 0x3FF:
+        if (!(xsrc->esb_flags & XIVE_SRC_STORE_EOI)) {
+            qemu_log_mask(LOG_GUEST_ERROR,
+                          "XIVE: invalid Store EOI for IRQ %d\n", srcno);
+            return;
+        }
+
+        notify = xive_source_esb_eoi(xsrc, srcno);
+        break;
+
+    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
+    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
+    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
+    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
+        xive_source_esb_set(xsrc, srcno, (offset >> 8) & 0x3);
+        break;
+
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %x\n",
+                      offset);
+        return;
+    }
+
+out:
+    /* Forward the source event notification for routing */
+    if (notify) {
+        xive_source_notify(xsrc, srcno);
+    }
+}
+
+static const MemoryRegionOps xive_source_esb_ops = {
+    .read = xive_source_esb_read,
+    .write = xive_source_esb_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+static void xive_source_set_irq(void *opaque, int srcno, int val)
+{
+    XiveSource *xsrc = XIVE_SOURCE(opaque);
+    bool notify = false;
+
+    if (val) {
+        notify = xive_source_esb_trigger(xsrc, srcno);
+    }
+
+    /* Forward the source event notification for routing */
+    if (notify) {
+        xive_source_notify(xsrc, srcno);
+    }
+}
+
+void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
+{
+    int i;
+
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        uint8_t pq = xive_source_esb_get(xsrc, i);
+
+        if (pq == XIVE_ESB_OFF) {
+            continue;
+        }
+
+        monitor_printf(mon, "  %08x %c%c\n", i + offset,
+                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
+                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
+    }
+}
+
+static void xive_source_reset(DeviceState *dev)
+{
+    XiveSource *xsrc = XIVE_SOURCE(dev);
+
+    /* PQs are initialized to 0b01 which corresponds to "ints off" */
+    memset(xsrc->status, 0x1, xsrc->nr_irqs);
+}
+
+static void xive_source_realize(DeviceState *dev, Error **errp)
+{
+    XiveSource *xsrc = XIVE_SOURCE(dev);
+
+    if (!xsrc->nr_irqs) {
+        error_setg(errp, "Number of interrupt needs to be greater than 0");
+        return;
+    }
+
+    if (xsrc->esb_shift != XIVE_ESB_4K &&
+        xsrc->esb_shift != XIVE_ESB_4K_2PAGE &&
+        xsrc->esb_shift != XIVE_ESB_64K &&
+        xsrc->esb_shift != XIVE_ESB_64K_2PAGE) {
+        error_setg(errp, "Invalid ESB shift setting");
+        return;
+    }
+
+    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
+                                     xsrc->nr_irqs);
+
+    xsrc->status = g_malloc0(xsrc->nr_irqs);
+
+    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
+                          &xive_source_esb_ops, xsrc, "xive.esb",
+                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
+}
+
+static const VMStateDescription vmstate_xive_source = {
+    .name = TYPE_XIVE_SOURCE,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
+        VMSTATE_VBUFFER_UINT32(status, XiveSource, 1, NULL, nr_irqs),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+/*
+ * The default XIVE interrupt source setting for the ESB MMIOs is two
+ * 64k pages without Store EOI, to be in sync with KVM.
+ */
+static Property xive_source_properties[] = {
+    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
+    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
+    DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void xive_source_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->desc    = "XIVE Interrupt Source";
+    dc->props   = xive_source_properties;
+    dc->realize = xive_source_realize;
+    dc->reset   = xive_source_reset;
+    dc->vmsd    = &vmstate_xive_source;
+}
+
+static const TypeInfo xive_source_info = {
+    .name          = TYPE_XIVE_SOURCE,
+    .parent        = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(XiveSource),
+    .class_init    = xive_source_class_init,
+};
+
+static void xive_register_types(void)
+{
+    type_register_static(&xive_source_info);
+}
+
+type_init(xive_register_types)
diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index 0e9963f5eecc..72a46ed91c31 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -37,6 +37,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
 obj-$(CONFIG_XICS) += xics.o
 obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
 obj-$(CONFIG_XICS_KVM) += xics_kvm.o
+obj-$(CONFIG_XIVE) += xive.o
 obj-$(CONFIG_POWERNV) += xics_pnv.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model Cédric Le Goater
@ 2018-11-16 10:56 ` Cédric Le Goater
  2018-11-22  3:19   ` David Gibson
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 03/36] ppc/xive: introduce the XiveFabric interface Cédric Le Goater
                   ` (33 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:56 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The 'sent' status of the LSI interrupt source is modeled with the 'P'
bit of the ESB and the assertion status of the source is maintained in
an array under the main sPAPRXive object. The type of the source is
stored in the same array for practical reasons.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/xive.h | 20 ++++++++++++-
 hw/intc/xive.c        | 68 +++++++++++++++++++++++++++++++++++++++----
 2 files changed, 81 insertions(+), 7 deletions(-)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 5fec4b08705d..e118acd59f1e 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -32,8 +32,10 @@ typedef struct XiveSource {
     /* IRQs */
     uint32_t        nr_irqs;
     qemu_irq        *qirqs;
+    unsigned long   *lsi_map;
+    int32_t         lsi_map_size; /* for VMSTATE_BITMAP */
 
-    /* PQ bits */
+    /* PQ bits and LSI assertion bit */
     uint8_t         *status;
 
     /* ESB memory region */
@@ -89,6 +91,7 @@ static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
  * When doing an EOI, the Q bit will indicate if the interrupt
  * needs to be re-triggered.
  */
+#define XIVE_STATUS_ASSERTED  0x4  /* Extra bit for LSI */
 #define XIVE_ESB_VAL_P        0x2
 #define XIVE_ESB_VAL_Q        0x1
 
@@ -127,4 +130,19 @@ static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
     return xsrc->qirqs[srcno];
 }
 
+static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
+{
+    assert(srcno < xsrc->nr_irqs);
+    return test_bit(srcno, xsrc->lsi_map);
+}
+
+static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
+                                       bool lsi)
+{
+    assert(srcno < xsrc->nr_irqs);
+    if (lsi) {
+        bitmap_set(xsrc->lsi_map, srcno, 1);
+    }
+}
+
 #endif /* PPC_XIVE_H */
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index f7621f84828c..ac4605fee8b7 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -88,14 +88,40 @@ uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
     return xive_esb_set(&xsrc->status[srcno], pq);
 }
 
+/*
+ * Returns whether the event notification should be forwarded.
+ */
+static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t srcno)
+{
+    uint8_t old_pq = xive_source_esb_get(xsrc, srcno);
+
+    switch (old_pq) {
+    case XIVE_ESB_RESET:
+        xive_source_esb_set(xsrc, srcno, XIVE_ESB_PENDING);
+        return true;
+    default:
+        return false;
+    }
+}
+
 /*
  * Returns whether the event notification should be forwarded.
  */
 static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
 {
+    bool ret;
+
     assert(srcno < xsrc->nr_irqs);
 
-    return xive_esb_trigger(&xsrc->status[srcno]);
+    ret = xive_esb_trigger(&xsrc->status[srcno]);
+
+    if (xive_source_irq_is_lsi(xsrc, srcno) &&
+        xive_source_esb_get(xsrc, srcno) == XIVE_ESB_QUEUED) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "XIVE: queued an event on LSI IRQ %d\n", srcno);
+    }
+
+    return ret;
 }
 
 /*
@@ -103,9 +129,22 @@ static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
  */
 static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
 {
+    bool ret;
+
     assert(srcno < xsrc->nr_irqs);
 
-    return xive_esb_eoi(&xsrc->status[srcno]);
+    ret = xive_esb_eoi(&xsrc->status[srcno]);
+
+    /* LSI sources do not set the Q bit but they can still be
+     * asserted, in which case we should forward a new event
+     * notification
+     */
+    if (xive_source_irq_is_lsi(xsrc, srcno) &&
+        xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
+        ret = xive_source_lsi_trigger(xsrc, srcno);
+    }
+
+    return ret;
 }
 
 /*
@@ -268,8 +307,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
     XiveSource *xsrc = XIVE_SOURCE(opaque);
     bool notify = false;
 
-    if (val) {
-        notify = xive_source_esb_trigger(xsrc, srcno);
+    if (xive_source_irq_is_lsi(xsrc, srcno)) {
+        if (val) {
+            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
+            notify = xive_source_lsi_trigger(xsrc, srcno);
+        } else {
+            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
+        }
+    } else {
+        if (val) {
+            notify = xive_source_esb_trigger(xsrc, srcno);
+        }
     }
 
     /* Forward the source event notification for routing */
@@ -289,9 +337,11 @@ void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
             continue;
         }
 
-        monitor_printf(mon, "  %08x %c%c\n", i + offset,
+        monitor_printf(mon, "  %08x %s %c%c%c\n", i + offset,
+                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
                        pq & XIVE_ESB_VAL_P ? 'P' : '-',
-                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
+                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
+                       xsrc->status[i] & XIVE_STATUS_ASSERTED ? 'A' : ' ');
     }
 }
 
@@ -299,6 +349,8 @@ static void xive_source_reset(DeviceState *dev)
 {
     XiveSource *xsrc = XIVE_SOURCE(dev);
 
+    /* Do not clear the LSI bitmap */
+
     /* PQs are initialized to 0b01 which corresponds to "ints off" */
     memset(xsrc->status, 0x1, xsrc->nr_irqs);
 }
@@ -325,6 +377,9 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
 
     xsrc->status = g_malloc0(xsrc->nr_irqs);
 
+    xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
+    xsrc->lsi_map_size = xsrc->nr_irqs;
+
     memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
                           &xive_source_esb_ops, xsrc, "xive.esb",
                           (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
@@ -338,6 +393,7 @@ static const VMStateDescription vmstate_xive_source = {
     .fields = (VMStateField[]) {
         VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
         VMSTATE_VBUFFER_UINT32(status, XiveSource, 1, NULL, nr_irqs),
+        VMSTATE_BITMAP(lsi_map, XiveSource, 1, lsi_map_size),
         VMSTATE_END_OF_LIST()
     },
 };
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 03/36] ppc/xive: introduce the XiveFabric interface
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model Cédric Le Goater
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources Cédric Le Goater
@ 2018-11-16 10:56 ` Cédric Le Goater
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model Cédric Le Goater
                   ` (32 subsequent siblings)
  35 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:56 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The XiveFabric offers a simple interface, between the XiveSource
object and the main interrupt controller of the machine. It will
forward event notifications to the XIVE Interrupt Virtualization
Routing Engine (IVRE).

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/xive.h | 23 +++++++++++++++++++++++
 hw/intc/xive.c        | 25 +++++++++++++++++++++++++
 2 files changed, 48 insertions(+)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index e118acd59f1e..be93fae6317b 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -12,6 +12,27 @@
 
 #include "hw/sysbus.h"
 
+/*
+ * XIVE Fabric (Interface between Source and Router)
+ */
+
+typedef struct XiveFabric {
+    Object parent;
+} XiveFabric;
+
+#define TYPE_XIVE_FABRIC "xive-fabric"
+#define XIVE_FABRIC(obj)                                     \
+    OBJECT_CHECK(XiveFabric, (obj), TYPE_XIVE_FABRIC)
+#define XIVE_FABRIC_CLASS(klass)                                     \
+    OBJECT_CLASS_CHECK(XiveFabricClass, (klass), TYPE_XIVE_FABRIC)
+#define XIVE_FABRIC_GET_CLASS(obj)                                   \
+    OBJECT_GET_CLASS(XiveFabricClass, (obj), TYPE_XIVE_FABRIC)
+
+typedef struct XiveFabricClass {
+    InterfaceClass parent;
+    void (*notify)(XiveFabric *xf, uint32_t lisn);
+} XiveFabricClass;
+
 /*
  * XIVE Interrupt Source
  */
@@ -42,6 +63,8 @@ typedef struct XiveSource {
     uint64_t        esb_flags;
     uint32_t        esb_shift;
     MemoryRegion    esb_mmio;
+
+    XiveFabric      *xive;
 } XiveSource;
 
 /*
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index ac4605fee8b7..014a2e41f71f 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -152,7 +152,11 @@ static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
  */
 static void xive_source_notify(XiveSource *xsrc, int srcno)
 {
+    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xsrc->xive);
 
+    if (xfc->notify) {
+        xfc->notify(xsrc->xive, srcno);
+    }
 }
 
 /*
@@ -358,6 +362,17 @@ static void xive_source_reset(DeviceState *dev)
 static void xive_source_realize(DeviceState *dev, Error **errp)
 {
     XiveSource *xsrc = XIVE_SOURCE(dev);
+    Object *obj;
+    Error *local_err = NULL;
+
+    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
+    if (!obj) {
+        error_propagate(errp, local_err);
+        error_prepend(errp, "required link 'xive' not found: ");
+        return;
+    }
+
+    xsrc->xive = XIVE_FABRIC(obj);
 
     if (!xsrc->nr_irqs) {
         error_setg(errp, "Number of interrupt needs to be greater than 0");
@@ -427,9 +442,19 @@ static const TypeInfo xive_source_info = {
     .class_init    = xive_source_class_init,
 };
 
+/*
+ * XIVE Fabric
+ */
+static const TypeInfo xive_fabric_info = {
+    .name = TYPE_XIVE_FABRIC,
+    .parent = TYPE_INTERFACE,
+    .class_size = sizeof(XiveFabricClass),
+};
+
 static void xive_register_types(void)
 {
     type_register_static(&xive_source_info);
+    type_register_static(&xive_fabric_info);
 }
 
 type_init(xive_register_types)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (2 preceding siblings ...)
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 03/36] ppc/xive: introduce the XiveFabric interface Cédric Le Goater
@ 2018-11-16 10:56 ` Cédric Le Goater
  2018-11-22  4:11   ` David Gibson
  2018-11-22  4:44   ` David Gibson
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors Cédric Le Goater
                   ` (31 subsequent siblings)
  35 siblings, 2 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:56 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The XiveRouter models the second sub-engine of the overall XIVE
architecture : the Interrupt Virtualization Routing Engine (IVRE).

The IVRE handles event notifications of the IVSE through MMIO stores
and performs the interrupt routing process. For this purpose, it uses
a set of table stored in system memory, the first of which being the
Event Assignment Structure (EAS) table.

The EAT associates an interrupt source number with an Event Notification
Descriptor (END) which will be used in a second phase of the routing
process to identify a Notification Virtual Target.

The XiveRouter is an abstract class which needs to be inherited from
to define a storage for the EAT, and other upcoming tables. The
'chip-id' atttribute is not strictly necessary for the sPAPR and
PowerNV machines but it's a good way to test the routing algorithm.
Without this atttribute, the XiveRouter could be a simple QOM
interface.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/xive.h      | 32 ++++++++++++++
 include/hw/ppc/xive_regs.h | 31 ++++++++++++++
 hw/intc/xive.c             | 86 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 149 insertions(+)
 create mode 100644 include/hw/ppc/xive_regs.h

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index be93fae6317b..5a0696366577 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -11,6 +11,7 @@
 #define PPC_XIVE_H
 
 #include "hw/sysbus.h"
+#include "hw/ppc/xive_regs.h"
 
 /*
  * XIVE Fabric (Interface between Source and Router)
@@ -168,4 +169,35 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
     }
 }
 
+/*
+ * XIVE Router
+ */
+
+typedef struct XiveRouter {
+    SysBusDevice    parent;
+
+    uint32_t        chip_id;
+} XiveRouter;
+
+#define TYPE_XIVE_ROUTER "xive-router"
+#define XIVE_ROUTER(obj)                                \
+    OBJECT_CHECK(XiveRouter, (obj), TYPE_XIVE_ROUTER)
+#define XIVE_ROUTER_CLASS(klass)                                        \
+    OBJECT_CLASS_CHECK(XiveRouterClass, (klass), TYPE_XIVE_ROUTER)
+#define XIVE_ROUTER_GET_CLASS(obj)                              \
+    OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
+
+typedef struct XiveRouterClass {
+    SysBusDeviceClass parent;
+
+    /* XIVE table accessors */
+    int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
+    int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
+} XiveRouterClass;
+
+void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
+
+int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
+int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
+
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
new file mode 100644
index 000000000000..12499b33614c
--- /dev/null
+++ b/include/hw/ppc/xive_regs.h
@@ -0,0 +1,31 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2016-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_XIVE_REGS_H
+#define PPC_XIVE_REGS_H
+
+/* EAS (Event Assignment Structure)
+ *
+ * One per interrupt source. Targets an interrupt to a given Event
+ * Notification Descriptor (END) and provides the corresponding
+ * logical interrupt number (END data)
+ */
+typedef struct XiveEAS {
+        /* Use a single 64-bit definition to make it easier to
+         * perform atomic updates
+         */
+        uint64_t        w;
+#define EAS_VALID       PPC_BIT(0)
+#define EAS_END_BLOCK   PPC_BITMASK(4, 7)        /* Destination END block# */
+#define EAS_END_INDEX   PPC_BITMASK(8, 31)       /* Destination END index */
+#define EAS_MASKED      PPC_BIT(32)              /* Masked */
+#define EAS_END_DATA    PPC_BITMASK(33, 63)      /* Data written to the END */
+} XiveEAS;
+
+#endif /* PPC_XIVE_REGS_H */
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 014a2e41f71f..c4c90a25758e 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -442,6 +442,91 @@ static const TypeInfo xive_source_info = {
     .class_init    = xive_source_class_init,
 };
 
+/*
+ * XIVE Router (aka. Virtualization Controller or IVRE)
+ */
+
+int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
+{
+    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
+
+    return xrc->get_eas(xrtr, lisn, eas);
+}
+
+int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
+{
+    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
+
+    return xrc->set_eas(xrtr, lisn, eas);
+}
+
+static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
+{
+    XiveRouter *xrtr = XIVE_ROUTER(xf);
+    XiveEAS eas;
+
+    /* EAS cache lookup */
+    if (xive_router_get_eas(xrtr, lisn, &eas)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: Unknown LISN %x\n", lisn);
+        return;
+    }
+
+    /* The IVRE has a State Bit Cache for its internal sources which
+     * is also involed at this point. We skip the SBC lookup because
+     * the state bits of the sources are modeled internally in QEMU.
+     */
+
+    if (!(eas.w & EAS_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);
+        return;
+    }
+
+    if (eas.w & EAS_MASKED) {
+        /* Notification completed */
+        return;
+    }
+}
+
+static Property xive_router_properties[] = {
+    DEFINE_PROP_UINT32("chip-id", XiveRouter, chip_id, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void xive_router_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
+
+    dc->desc    = "XIVE Router Engine";
+    dc->props   = xive_router_properties;
+    xfc->notify = xive_router_notify;
+}
+
+static const TypeInfo xive_router_info = {
+    .name          = TYPE_XIVE_ROUTER,
+    .parent        = TYPE_SYS_BUS_DEVICE,
+    .abstract      = true,
+    .class_size    = sizeof(XiveRouterClass),
+    .class_init    = xive_router_class_init,
+    .interfaces    = (InterfaceInfo[]) {
+        { TYPE_XIVE_FABRIC },
+        { }
+    }
+};
+
+void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
+{
+    if (!(eas->w & EAS_VALID)) {
+        return;
+    }
+
+    monitor_printf(mon, "  %08x %s end:%02x/%04x data:%08x\n",
+                   lisn, eas->w & EAS_MASKED ? "M" : " ",
+                   (uint8_t)  GETFIELD(EAS_END_BLOCK, eas->w),
+                   (uint32_t) GETFIELD(EAS_END_INDEX, eas->w),
+                   (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
+}
+
 /*
  * XIVE Fabric
  */
@@ -455,6 +540,7 @@ static void xive_register_types(void)
 {
     type_register_static(&xive_source_info);
     type_register_static(&xive_fabric_info);
+    type_register_static(&xive_router_info);
 }
 
 type_init(xive_register_types)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (3 preceding siblings ...)
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model Cédric Le Goater
@ 2018-11-16 10:56 ` Cédric Le Goater
  2018-11-22  4:41   ` David Gibson
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers Cédric Le Goater
                   ` (30 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:56 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

To complete the event routing, the IVRE sub-engine uses an internal
table containing Event Notification Descriptor (END) structures.

An END specifies on which Event Queue (EQ) the event notification
data, defined in the associated EAS, should be posted when an
exception occurs. It also defines which Notification Virtual Target
(NVT) should be notified.

The Event Queue is a memory page provided by the O/S defining a
circular buffer, one per server and priority couple, containing Event
Queue entries. These are 4 bytes long, the first bit being a
'generation' bit and the 31 following bits the END Data field. They
are pulled by the O/S when the exception occurs.

The END Data field is a way to set an invariant logical event source
number for an IRQ. It is set with the H_INT_SET_SOURCE_CONFIG hcall
when the EISN flag is used.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/xive.h      |  18 ++++
 include/hw/ppc/xive_regs.h |  48 ++++++++++
 hw/intc/xive.c             | 185 ++++++++++++++++++++++++++++++++++++-
 3 files changed, 248 insertions(+), 3 deletions(-)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 5a0696366577..ce62aaf28343 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -193,11 +193,29 @@ typedef struct XiveRouterClass {
     /* XIVE table accessors */
     int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
     int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
+    int (*get_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
+                   XiveEND *end);
+    int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
+                   XiveEND *end);
 } XiveRouterClass;
 
 void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
 
 int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
 int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
+int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
+                        XiveEND *end);
+int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
+                        XiveEND *end);
+
+/*
+ * For legacy compatibility, the exceptions define up to 256 different
+ * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
+ * and the least favored level 0xFF.
+ */
+#define XIVE_PRIORITY_MAX  7
+
+void xive_end_reset(XiveEND *end);
+void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
 
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
index 12499b33614c..f97fb2b90bee 100644
--- a/include/hw/ppc/xive_regs.h
+++ b/include/hw/ppc/xive_regs.h
@@ -28,4 +28,52 @@ typedef struct XiveEAS {
 #define EAS_END_DATA    PPC_BITMASK(33, 63)      /* Data written to the END */
 } XiveEAS;
 
+/* Event Notification Descriptor (END) */
+typedef struct XiveEND {
+        uint32_t        w0;
+#define END_W0_VALID             PPC_BIT32(0) /* "v" bit */
+#define END_W0_ENQUEUE           PPC_BIT32(1) /* "q" bit */
+#define END_W0_UCOND_NOTIFY      PPC_BIT32(2) /* "n" bit */
+#define END_W0_BACKLOG           PPC_BIT32(3) /* "b" bit */
+#define END_W0_PRECL_ESC_CTL     PPC_BIT32(4) /* "p" bit */
+#define END_W0_ESCALATE_CTL      PPC_BIT32(5) /* "e" bit */
+#define END_W0_UNCOND_ESCALATE   PPC_BIT32(6) /* "u" bit - DD2.0 */
+#define END_W0_SILENT_ESCALATE   PPC_BIT32(7) /* "s" bit - DD2.0 */
+#define END_W0_QSIZE             PPC_BITMASK32(12, 15)
+#define END_W0_SW0               PPC_BIT32(16)
+#define END_W0_FIRMWARE          END_W0_SW0 /* Owned by FW */
+#define END_QSIZE_4K             0
+#define END_QSIZE_64K            4
+#define END_W0_HWDEP             PPC_BITMASK32(24, 31)
+        uint32_t        w1;
+#define END_W1_ESn               PPC_BITMASK32(0, 1)
+#define END_W1_ESn_P             PPC_BIT32(0)
+#define END_W1_ESn_Q             PPC_BIT32(1)
+#define END_W1_ESe               PPC_BITMASK32(2, 3)
+#define END_W1_ESe_P             PPC_BIT32(2)
+#define END_W1_ESe_Q             PPC_BIT32(3)
+#define END_W1_GENERATION        PPC_BIT32(9)
+#define END_W1_PAGE_OFF          PPC_BITMASK32(10, 31)
+        uint32_t        w2;
+#define END_W2_MIGRATION_REG     PPC_BITMASK32(0, 3)
+#define END_W2_OP_DESC_HI        PPC_BITMASK32(4, 31)
+        uint32_t        w3;
+#define END_W3_OP_DESC_LO        PPC_BITMASK32(0, 31)
+        uint32_t        w4;
+#define END_W4_ESC_END_BLOCK     PPC_BITMASK32(4, 7)
+#define END_W4_ESC_END_INDEX     PPC_BITMASK32(8, 31)
+        uint32_t        w5;
+#define END_W5_ESC_END_DATA      PPC_BITMASK32(1, 31)
+        uint32_t        w6;
+#define END_W6_FORMAT_BIT        PPC_BIT32(8)
+#define END_W6_NVT_BLOCK         PPC_BITMASK32(9, 12)
+#define END_W6_NVT_INDEX         PPC_BITMASK32(13, 31)
+        uint32_t        w7;
+#define END_W7_F0_IGNORE         PPC_BIT32(0)
+#define END_W7_F0_BLK_GROUPING   PPC_BIT32(1)
+#define END_W7_F0_PRIORITY       PPC_BITMASK32(8, 15)
+#define END_W7_F1_WAKEZ          PPC_BIT32(0)
+#define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
+} XiveEND;
+
 #endif /* PPC_XIVE_REGS_H */
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index c4c90a25758e..9cb001e7b540 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -442,6 +442,101 @@ static const TypeInfo xive_source_info = {
     .class_init    = xive_source_class_init,
 };
 
+/*
+ * XiveEND helpers
+ */
+
+void xive_end_reset(XiveEND *end)
+{
+    memset(end, 0, sizeof(*end));
+
+    /* switch off the escalation and notification ESBs */
+    end->w1 = END_W1_ESe_Q | END_W1_ESn_Q;
+}
+
+static void xive_end_queue_pic_print_info(XiveEND *end, uint32_t width,
+                                          Monitor *mon)
+{
+    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
+    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
+    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
+    uint32_t qentries = 1 << (qsize + 10);
+    int i;
+
+    /*
+     * print out the [ (qindex - (width - 1)) .. (qindex + 1)] window
+     */
+    monitor_printf(mon, " [ ");
+    qindex = (qindex - (width - 1)) & (qentries - 1);
+    for (i = 0; i < width; i++) {
+        uint64_t qaddr = qaddr_base + (qindex << 2);
+        uint32_t qdata = -1;
+
+        if (dma_memory_read(&address_space_memory, qaddr, &qdata,
+                            sizeof(qdata))) {
+            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ @0x%"
+                          HWADDR_PRIx "\n", qaddr);
+            return;
+        }
+        monitor_printf(mon, "%s%08x ", i == width - 1 ? "^" : "",
+                       be32_to_cpu(qdata));
+        qindex = (qindex + 1) & (qentries - 1);
+    }
+    monitor_printf(mon, "]\n");
+}
+
+void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon)
+{
+    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
+    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
+    uint32_t qgen = GETFIELD(END_W1_GENERATION, end->w1);
+    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
+    uint32_t qentries = 1 << (qsize + 10);
+
+    uint32_t nvt = GETFIELD(END_W6_NVT_INDEX, end->w6);
+    uint8_t priority = GETFIELD(END_W7_F0_PRIORITY, end->w7);
+
+    if (!(end->w0 & END_W0_VALID)) {
+        return;
+    }
+
+    monitor_printf(mon, "  %08x %c%c%c%c%c prio:%d nvt:%04x eq:@%08"PRIx64
+                   "% 6d/%5d ^%d", end_idx,
+                   end->w0 & END_W0_VALID ? 'v' : '-',
+                   end->w0 & END_W0_ENQUEUE ? 'q' : '-',
+                   end->w0 & END_W0_UCOND_NOTIFY ? 'n' : '-',
+                   end->w0 & END_W0_BACKLOG ? 'b' : '-',
+                   end->w0 & END_W0_ESCALATE_CTL ? 'e' : '-',
+                   priority, nvt, qaddr_base, qindex, qentries, qgen);
+
+    xive_end_queue_pic_print_info(end, 6, mon);
+}
+
+static void xive_end_push(XiveEND *end, uint32_t data)
+{
+    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
+    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
+    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
+    uint32_t qgen = GETFIELD(END_W1_GENERATION, end->w1);
+
+    uint64_t qaddr = qaddr_base + (qindex << 2);
+    uint32_t qdata = cpu_to_be32((qgen << 31) | (data & 0x7fffffff));
+    uint32_t qentries = 1 << (qsize + 10);
+
+    if (dma_memory_write(&address_space_memory, qaddr, &qdata, sizeof(qdata))) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to write END data @0x%"
+                      HWADDR_PRIx "\n", qaddr);
+        return;
+    }
+
+    qindex = (qindex + 1) & (qentries - 1);
+    if (qindex == 0) {
+        qgen ^= 1;
+        end->w1 = SETFIELD(END_W1_GENERATION, end->w1, qgen);
+    }
+    end->w1 = SETFIELD(END_W1_PAGE_OFF, end->w1, qindex);
+}
+
 /*
  * XIVE Router (aka. Virtualization Controller or IVRE)
  */
@@ -460,6 +555,82 @@ int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
     return xrc->set_eas(xrtr, lisn, eas);
 }
 
+int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
+                        XiveEND *end)
+{
+   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
+
+   return xrc->get_end(xrtr, end_blk, end_idx, end);
+}
+
+int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
+                        XiveEND *end)
+{
+   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
+
+   return xrc->set_end(xrtr, end_blk, end_idx, end);
+}
+
+/*
+ * An END trigger can come from an event trigger (IPI or HW) or from
+ * another chip. We don't model the PowerBus but the END trigger
+ * message has the same parameters than in the function below.
+ */
+static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
+                                   uint32_t end_idx, uint32_t end_data)
+{
+    XiveEND end;
+    uint8_t priority;
+    uint8_t format;
+
+    /* END cache lookup */
+    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
+                      end_idx);
+        return;
+    }
+
+    if (!(end.w0 & END_W0_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
+                      end_blk, end_idx);
+        return;
+    }
+
+    if (end.w0 & END_W0_ENQUEUE) {
+        xive_end_push(&end, end_data);
+        xive_router_set_end(xrtr, end_blk, end_idx, &end);
+    }
+
+    /*
+     * The W7 format depends on the F bit in W6. It defines the type
+     * of the notification :
+     *
+     *   F=0 : single or multiple NVT notification
+     *   F=1 : User level Event-Based Branch (EBB) notification, no
+     *         priority
+     */
+    format = GETFIELD(END_W6_FORMAT_BIT, end.w6);
+    priority = GETFIELD(END_W7_F0_PRIORITY, end.w7);
+
+    /* The END is masked */
+    if (format == 0 && priority == 0xff) {
+        return;
+    }
+
+    /*
+     * Check the END ESn (Event State Buffer for notification) for
+     * even futher coalescing in the Router
+     */
+    if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
+        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
+        return;
+    }
+
+    /*
+     * Follows IVPE notification
+     */
+}
+
 static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
 {
     XiveRouter *xrtr = XIVE_ROUTER(xf);
@@ -471,9 +642,9 @@ static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
         return;
     }
 
-    /* The IVRE has a State Bit Cache for its internal sources which
-     * is also involed at this point. We skip the SBC lookup because
-     * the state bits of the sources are modeled internally in QEMU.
+    /* The IVRE checks the State Bit Cache at this point. We skip the
+     * SBC lookup because the state bits of the sources are modeled
+     * internally in QEMU.
      */
 
     if (!(eas.w & EAS_VALID)) {
@@ -485,6 +656,14 @@ static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
         /* Notification completed */
         return;
     }
+
+    /*
+     * The event trigger becomes an END trigger
+     */
+    xive_router_end_notify(xrtr,
+                           GETFIELD(EAS_END_BLOCK, eas.w),
+                           GETFIELD(EAS_END_INDEX, eas.w),
+                           GETFIELD(EAS_END_DATA,  eas.w));
 }
 
 static Property xive_router_properties[] = {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (4 preceding siblings ...)
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors Cédric Le Goater
@ 2018-11-16 10:56 ` Cédric Le Goater
  2018-11-22  5:13   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 07/36] ppc/xive: introduce the XIVE interrupt thread context Cédric Le Goater
                   ` (29 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:56 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The Event Notification Descriptor also contains two Event State
Buffers providing further coalescing of interrupts, one for the
notification event (ESn) and one for the escalation events (ESe). A
MMIO page is assigned for each to control the EOI through loads
only. Stores are not allowed.

The END ESBs are modeled through an object resembling the 'XiveSource'
It is stateless as the END state bits are backed into the XiveEND
structure under the XiveRouter and the MMIO accesses follow the same
rules as for the standard source ESBs.

END ESBs are not supported by the Linux drivers neither on OPAL nor on
sPAPR. Nevetherless, it provides a mean to study the question in the
future and validates a bit more the XIVE model.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/xive.h |  20 ++++++
 hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 178 insertions(+), 2 deletions(-)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index ce62aaf28343..24301bf2076d 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
 int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
                         XiveEND *end);
 
+/*
+ * XIVE END ESBs
+ */
+
+#define TYPE_XIVE_END_SOURCE "xive-end-source"
+#define XIVE_END_SOURCE(obj) \
+    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
+
+typedef struct XiveENDSource {
+    SysBusDevice parent;
+
+    uint32_t        nr_ends;
+
+    /* ESB memory region */
+    uint32_t        esb_shift;
+    MemoryRegion    esb_mmio;
+
+    XiveRouter      *xrtr;
+} XiveENDSource;
+
 /*
  * For legacy compatibility, the exceptions define up to 256 different
  * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 9cb001e7b540..5a8882d47a98 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -622,8 +622,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
      * even futher coalescing in the Router
      */
     if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
-        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
-        return;
+        uint8_t pq = GETFIELD(END_W1_ESn, end.w1);
+        bool notify = xive_esb_trigger(&pq);
+
+        if (pq != GETFIELD(END_W1_ESn, end.w1)) {
+            end.w1 = SETFIELD(END_W1_ESn, end.w1, pq);
+            xive_router_set_end(xrtr, end_blk, end_idx, &end);
+        }
+
+        /* ESn[Q]=1 : end of notification */
+        if (!notify) {
+            return;
+        }
     }
 
     /*
@@ -706,6 +716,151 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
                    (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
 }
 
+/*
+ * END ESB MMIO loads
+ */
+static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned size)
+{
+    XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
+    XiveRouter *xrtr = xsrc->xrtr;
+    uint32_t offset = addr & 0xFFF;
+    uint8_t end_blk;
+    uint32_t end_idx;
+    XiveEND end;
+    uint32_t end_esmask;
+    uint8_t pq;
+    uint64_t ret = -1;
+
+    end_blk = xrtr->chip_id;
+    end_idx = addr >> (xsrc->esb_shift + 1);
+    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
+                      end_idx);
+        return -1;
+    }
+
+    if (!(end.w0 & END_W0_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
+                      end_blk, end_idx);
+        return -1;
+    }
+
+    end_esmask = addr_is_even(addr, xsrc->esb_shift) ? END_W1_ESn : END_W1_ESe;
+    pq = GETFIELD(end_esmask, end.w1);
+
+    switch (offset) {
+    case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
+        ret = xive_esb_eoi(&pq);
+
+        /* Forward the source event notification for routing ?? */
+        break;
+
+    case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
+        ret = pq;
+        break;
+
+    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
+    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
+    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
+    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
+        ret = xive_esb_set(&pq, (offset >> 8) & 0x3);
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid END ESB load addr %d\n",
+                      offset);
+        return -1;
+    }
+
+    if (pq != GETFIELD(end_esmask, end.w1)) {
+        end.w1 = SETFIELD(end_esmask, end.w1, pq);
+        xive_router_set_end(xrtr, end_blk, end_idx, &end);
+    }
+
+    return ret;
+}
+
+/*
+ * END ESB MMIO stores are invalid
+ */
+static void xive_end_source_write(void *opaque, hwaddr addr,
+                                  uint64_t value, unsigned size)
+{
+    qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr 0x%"
+                  HWADDR_PRIx"\n", addr);
+}
+
+static const MemoryRegionOps xive_end_source_ops = {
+    .read = xive_end_source_read,
+    .write = xive_end_source_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+static void xive_end_source_realize(DeviceState *dev, Error **errp)
+{
+    XiveENDSource *xsrc = XIVE_END_SOURCE(dev);
+    Object *obj;
+    Error *local_err = NULL;
+
+    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
+    if (!obj) {
+        error_propagate(errp, local_err);
+        error_prepend(errp, "required link 'xive' not found: ");
+        return;
+    }
+
+    xsrc->xrtr = XIVE_ROUTER(obj);
+
+    if (!xsrc->nr_ends) {
+        error_setg(errp, "Number of interrupt needs to be greater than 0");
+        return;
+    }
+
+    if (xsrc->esb_shift != XIVE_ESB_4K &&
+        xsrc->esb_shift != XIVE_ESB_64K) {
+        error_setg(errp, "Invalid ESB shift setting");
+        return;
+    }
+
+    /*
+     * Each END is assigned an even/odd pair of MMIO pages, the even page
+     * manages the ESn field while the odd page manages the ESe field.
+     */
+    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
+                          &xive_end_source_ops, xsrc, "xive.end",
+                          (1ull << (xsrc->esb_shift + 1)) * xsrc->nr_ends);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
+}
+
+static Property xive_end_source_properties[] = {
+    DEFINE_PROP_UINT32("nr-ends", XiveENDSource, nr_ends, 0),
+    DEFINE_PROP_UINT32("shift", XiveENDSource, esb_shift, XIVE_ESB_64K),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void xive_end_source_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->desc    = "XIVE END Source";
+    dc->props   = xive_end_source_properties;
+    dc->realize = xive_end_source_realize;
+}
+
+static const TypeInfo xive_end_source_info = {
+    .name          = TYPE_XIVE_END_SOURCE,
+    .parent        = TYPE_SYS_BUS_DEVICE,
+    .instance_size = sizeof(XiveENDSource),
+    .class_init    = xive_end_source_class_init,
+};
+
 /*
  * XIVE Fabric
  */
@@ -720,6 +875,7 @@ static void xive_register_types(void)
     type_register_static(&xive_source_info);
     type_register_static(&xive_fabric_info);
     type_register_static(&xive_router_info);
+    type_register_static(&xive_end_source_info);
 }
 
 type_init(xive_register_types)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 07/36] ppc/xive: introduce the XIVE interrupt thread context
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (5 preceding siblings ...)
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-23  5:08   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter Cédric Le Goater
                   ` (28 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

Each POWER9 processor chip has a XIVE presenter that can generate four
different exceptions to its threads:

  - hypervisor exception,
  - O/S exception
  - Event-Based Branch (EBB)
  - msgsnd (doorbell).

Each exception has a state independent from the others called a Thread
Interrupt Management context. This context is a set of registers which
lets the thread handle priority management and interrupt acknowledgment
among other things. The most important ones being :

  - Interrupt Priority Register  (PIPR)
  - Interrupt Pending Buffer     (IPB)
  - Current Processor Priority   (CPPR)
  - Notification Source Register (NSR)

These registers are accessible through a specific MMIO region, called
the Thread Interrupt Management Area (TIMA), four aligned pages, each
exposing a different view of the registers. First page (page address
ending in 0b00) gives access to the entire context and is reserved for
the ring 0 security monitor. The second (page address ending in 0b01)
is for the hypervisor, ring 1. The third (page address ending in 0b10)
is for the operating system, ring 2. The fourth (page address ending
in 0b11) is for user level, ring 3.

The thread interrupt context is modeled with a XiveTCTX object
containing the values of the different exception registers. The TIMA
region is mapped at the same address for each CPU.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/xive.h      |  36 +++
 include/hw/ppc/xive_regs.h |  82 +++++++
 hw/intc/xive.c             | 443 +++++++++++++++++++++++++++++++++++++
 3 files changed, 561 insertions(+)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 24301bf2076d..5987f26ddb98 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -238,4 +238,40 @@ typedef struct XiveENDSource {
 void xive_end_reset(XiveEND *end);
 void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
 
+/*
+ * XIVE Thread interrupt Management (TM) context
+ */
+
+#define TYPE_XIVE_TCTX "xive-tctx"
+#define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
+
+/*
+ * XIVE Thread interrupt Management register rings :
+ *
+ *   QW-0  User       event-based exception state
+ *   QW-1  O/S        OS context for priority management, interrupt acks
+ *   QW-2  Pool       hypervisor context for virtual processor being dispatched
+ *   QW-3  Physical   for the security monitor to manage the entire context
+ */
+#define TM_RING_COUNT           4
+#define TM_RING_SIZE            0x10
+
+typedef struct XiveTCTX {
+    DeviceState parent_obj;
+
+    CPUState    *cs;
+    qemu_irq    output;
+
+    uint8_t     regs[TM_RING_COUNT * TM_RING_SIZE];
+
+    XiveRouter  *xrtr;
+} XiveTCTX;
+
+/*
+ * XIVE Thread Interrupt Management Aera (TIMA)
+ */
+extern const MemoryRegionOps xive_tm_ops;
+
+void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
+
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
index f97fb2b90bee..2e3d6cb507da 100644
--- a/include/hw/ppc/xive_regs.h
+++ b/include/hw/ppc/xive_regs.h
@@ -10,6 +10,88 @@
 #ifndef PPC_XIVE_REGS_H
 #define PPC_XIVE_REGS_H
 
+#define TM_SHIFT                16
+
+/* TM register offsets */
+#define TM_QW0_USER             0x000 /* All rings */
+#define TM_QW1_OS               0x010 /* Ring 0..2 */
+#define TM_QW2_HV_POOL          0x020 /* Ring 0..1 */
+#define TM_QW3_HV_PHYS          0x030 /* Ring 0..1 */
+
+/* Byte offsets inside a QW             QW0 QW1 QW2 QW3 */
+#define TM_NSR                  0x0  /*  +   +   -   +  */
+#define TM_CPPR                 0x1  /*  -   +   -   +  */
+#define TM_IPB                  0x2  /*  -   +   +   +  */
+#define TM_LSMFB                0x3  /*  -   +   +   +  */
+#define TM_ACK_CNT              0x4  /*  -   +   -   -  */
+#define TM_INC                  0x5  /*  -   +   -   +  */
+#define TM_AGE                  0x6  /*  -   +   -   +  */
+#define TM_PIPR                 0x7  /*  -   +   -   +  */
+
+#define TM_WORD0                0x0
+#define TM_WORD1                0x4
+
+/*
+ * QW word 2 contains the valid bit at the top and other fields
+ * depending on the QW.
+ */
+#define TM_WORD2                0x8
+#define   TM_QW0W2_VU           PPC_BIT32(0)
+#define   TM_QW0W2_LOGIC_SERV   PPC_BITMASK32(1, 31) /* XX 2,31 ? */
+#define   TM_QW1W2_VO           PPC_BIT32(0)
+#define   TM_QW1W2_OS_CAM       PPC_BITMASK32(8, 31)
+#define   TM_QW2W2_VP           PPC_BIT32(0)
+#define   TM_QW2W2_POOL_CAM     PPC_BITMASK32(8, 31)
+#define   TM_QW3W2_VT           PPC_BIT32(0)
+#define   TM_QW3W2_LP           PPC_BIT32(6)
+#define   TM_QW3W2_LE           PPC_BIT32(7)
+#define   TM_QW3W2_T            PPC_BIT32(31)
+
+/*
+ * In addition to normal loads to "peek" and writes (only when invalid)
+ * using 4 and 8 bytes accesses, the above registers support these
+ * "special" byte operations:
+ *
+ *   - Byte load from QW0[NSR] - User level NSR (EBB)
+ *   - Byte store to QW0[NSR] - User level NSR (EBB)
+ *   - Byte load/store to QW1[CPPR] and QW3[CPPR] - CPPR access
+ *   - Byte load from QW3[TM_WORD2] - Read VT||00000||LP||LE on thrd 0
+ *                                    otherwise VT||0000000
+ *   - Byte store to QW3[TM_WORD2] - Set VT bit (and LP/LE if present)
+ *
+ * Then we have all these "special" CI ops at these offset that trigger
+ * all sorts of side effects:
+ */
+#define TM_SPC_ACK_EBB          0x800   /* Load8 ack EBB to reg*/
+#define TM_SPC_ACK_OS_REG       0x810   /* Load16 ack OS irq to reg */
+#define TM_SPC_PUSH_USR_CTX     0x808   /* Store32 Push/Validate user context */
+#define TM_SPC_PULL_USR_CTX     0x808   /* Load32 Pull/Invalidate user
+                                         * context */
+#define TM_SPC_SET_OS_PENDING   0x812   /* Store8 Set OS irq pending bit */
+#define TM_SPC_PULL_OS_CTX      0x818   /* Load32/Load64 Pull/Invalidate OS
+                                         * context to reg */
+#define TM_SPC_PULL_POOL_CTX    0x828   /* Load32/Load64 Pull/Invalidate Pool
+                                         * context to reg*/
+#define TM_SPC_ACK_HV_REG       0x830   /* Load16 ack HV irq to reg */
+#define TM_SPC_PULL_USR_CTX_OL  0xc08   /* Store8 Pull/Inval usr ctx to odd
+                                         * line */
+#define TM_SPC_ACK_OS_EL        0xc10   /* Store8 ack OS irq to even line */
+#define TM_SPC_ACK_HV_POOL_EL   0xc20   /* Store8 ack HV evt pool to even
+                                         * line */
+#define TM_SPC_ACK_HV_EL        0xc30   /* Store8 ack HV irq to even line */
+/* XXX more... */
+
+/* NSR fields for the various QW ack types */
+#define TM_QW0_NSR_EB           PPC_BIT8(0)
+#define TM_QW1_NSR_EO           PPC_BIT8(0)
+#define TM_QW3_NSR_HE           PPC_BITMASK8(0, 1)
+#define  TM_QW3_NSR_HE_NONE     0
+#define  TM_QW3_NSR_HE_POOL     1
+#define  TM_QW3_NSR_HE_PHYS     2
+#define  TM_QW3_NSR_HE_LSI      3
+#define TM_QW3_NSR_I            PPC_BIT8(2)
+#define TM_QW3_NSR_GRP_LVL      PPC_BIT8(3, 7)
+
 /* EAS (Event Assignment Structure)
  *
  * One per interrupt source. Targets an interrupt to a given Event
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 5a8882d47a98..4c6cb5d52975 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -15,6 +15,448 @@
 #include "sysemu/dma.h"
 #include "monitor/monitor.h"
 #include "hw/ppc/xive.h"
+#include "hw/ppc/xive_regs.h"
+
+/*
+ * XIVE Thread Interrupt Management context
+ */
+
+static uint64_t xive_tctx_accept(XiveTCTX *tctx, uint8_t ring)
+{
+    return 0;
+}
+
+static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
+{
+    if (cppr > XIVE_PRIORITY_MAX) {
+        cppr = 0xff;
+    }
+
+    tctx->regs[ring + TM_CPPR] = cppr;
+}
+
+/*
+ * XIVE Thread Interrupt Management Area (TIMA)
+ *
+ * This region gives access to the registers of the thread interrupt
+ * management context. It is four page wide, each page providing a
+ * different view of the registers. The page with the lower offset is
+ * the most privileged and gives access to the entire context.
+ */
+
+#define XIVE_TM_HW_PAGE   0x0
+#define XIVE_TM_HV_PAGE   0x1
+#define XIVE_TM_OS_PAGE   0x2
+#define XIVE_TM_USER_PAGE 0x3
+
+/*
+ * Define an access map for each page of the TIMA that we will use in
+ * the memory region ops to filter values when doing loads and stores
+ * of raw registers values
+ *
+ * Registers accessibility bits :
+ *
+ *    0x0 - no access
+ *    0x1 - write only
+ *    0x2 - read only
+ *    0x3 - read/write
+ */
+
+static const uint8_t xive_tm_hw_view[] = {
+    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
+    /* QW-1 OS   */   3, 3, 3, 3,   3, 3, 0, 3,   3, 3, 3, 3,   0, 0, 0, 0,
+    /* QW-2 HV   */   0, 0, 3, 3,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
+    /* QW-3 HW   */   3, 3, 3, 3,   0, 3, 0, 3,   3, 0, 0, 3,   3, 3, 3, 0,
+};
+
+static const uint8_t xive_tm_hv_view[] = {
+    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
+    /* QW-1 OS   */   3, 3, 3, 3,   3, 3, 0, 3,   3, 3, 3, 3,   0, 0, 0, 0,
+    /* QW-2 HV   */   0, 0, 3, 3,   0, 0, 0, 0,   0, 3, 3, 3,   0, 0, 0, 0,
+    /* QW-3 HW   */   3, 3, 3, 3,   0, 3, 0, 3,   3, 0, 0, 3,   0, 0, 0, 0,
+};
+
+static const uint8_t xive_tm_os_view[] = {
+    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
+    /* QW-1 OS   */   2, 3, 2, 2,   2, 2, 0, 2,   0, 0, 0, 0,   0, 0, 0, 0,
+    /* QW-2 HV   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+    /* QW-3 HW   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 3, 3, 0,
+};
+
+static const uint8_t xive_tm_user_view[] = {
+    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+    /* QW-1 OS   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+    /* QW-2 HV   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+    /* QW-3 HW   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
+};
+
+/*
+ * Overall TIMA access map for the thread interrupt management context
+ * registers
+ */
+static const uint8_t *xive_tm_views[] = {
+    [XIVE_TM_HW_PAGE]   = xive_tm_hw_view,
+    [XIVE_TM_HV_PAGE]   = xive_tm_hv_view,
+    [XIVE_TM_OS_PAGE]   = xive_tm_os_view,
+    [XIVE_TM_USER_PAGE] = xive_tm_user_view,
+};
+
+/*
+ * Computes a register access mask for a given offset in the TIMA
+ */
+static uint64_t xive_tm_mask(hwaddr offset, unsigned size, bool write)
+{
+    uint8_t page_offset = (offset >> TM_SHIFT) & 0x3;
+    uint8_t reg_offset = offset & 0x3F;
+    uint8_t reg_mask = write ? 0x1 : 0x2;
+    uint64_t mask = 0x0;
+    int i;
+
+    for (i = 0; i < size; i++) {
+        if (xive_tm_views[page_offset][reg_offset + i] & reg_mask) {
+            mask |= (uint64_t) 0xff << (8 * (size - i - 1));
+        }
+    }
+
+    return mask;
+}
+
+static void xive_tm_raw_write(XiveTCTX *tctx, hwaddr offset, uint64_t value,
+                              unsigned size)
+{
+    uint8_t ring_offset = offset & 0x30;
+    uint8_t reg_offset = offset & 0x3F;
+    uint64_t mask = xive_tm_mask(offset, size, true);
+    int i;
+
+    /*
+     * Only 4 or 8 bytes stores are allowed and the User ring is
+     * excluded
+     */
+    if (size < 4 || !mask || ring_offset == TM_QW0_USER) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid write access at TIMA @%"
+                      HWADDR_PRIx"\n", offset);
+        return;
+    }
+
+    /*
+     * Use the register offset for the raw values and filter out
+     * reserved values
+     */
+    for (i = 0; i < size; i++) {
+        uint8_t byte_mask = (mask >> (8 * (size - i - 1)));
+        if (byte_mask) {
+            tctx->regs[reg_offset + i] = (value >> (8 * (size - i - 1))) &
+                byte_mask;
+        }
+    }
+}
+
+static uint64_t xive_tm_raw_read(XiveTCTX *tctx, hwaddr offset, unsigned size)
+{
+    uint8_t ring_offset = offset & 0x30;
+    uint8_t reg_offset = offset & 0x3F;
+    uint64_t mask = xive_tm_mask(offset, size, false);
+    uint64_t ret;
+    int i;
+
+    /*
+     * Only 4 or 8 bytes loads are allowed and the User ring is
+     * excluded
+     */
+    if (size < 4 || !mask || ring_offset == TM_QW0_USER) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid read access at TIMA @%"
+                      HWADDR_PRIx"\n", offset);
+        return -1;
+    }
+
+    /* Use the register offset for the raw values */
+    ret = 0;
+    for (i = 0; i < size; i++) {
+        ret |= (uint64_t) tctx->regs[reg_offset + i] << (8 * (size - i - 1));
+    }
+
+    /* filter out reserved values */
+    return ret & mask;
+}
+
+/*
+ * The TM context is mapped twice within each page. Stores and loads
+ * to the first mapping below 2K write and read the specified values
+ * without modification. The second mapping above 2K performs specific
+ * state changes (side effects) in addition to setting/returning the
+ * interrupt management area context of the processor thread.
+ */
+static uint64_t xive_tm_ack_os_reg(XiveTCTX *tctx, hwaddr offset, unsigned size)
+{
+    return xive_tctx_accept(tctx, TM_QW1_OS);
+}
+
+static void xive_tm_set_os_cppr(XiveTCTX *tctx, hwaddr offset,
+                                uint64_t value, unsigned size)
+{
+    xive_tctx_set_cppr(tctx, TM_QW1_OS, value & 0xff);
+}
+
+/*
+ * Define a mapping of "special" operations depending on the TIMA page
+ * offset and the size of the operation.
+ */
+typedef struct XiveTmOp {
+    uint8_t  page_offset;
+    uint32_t op_offset;
+    unsigned size;
+    void     (*write_handler)(XiveTCTX *tctx, hwaddr offset, uint64_t value,
+                              unsigned size);
+    uint64_t (*read_handler)(XiveTCTX *tctx, hwaddr offset, unsigned size);
+} XiveTmOp;
+
+static const XiveTmOp xive_tm_operations[] = {
+    /*
+     * MMIOs below 2K : raw values and special operations without side
+     * effects
+     */
+    { XIVE_TM_OS_PAGE, TM_QW1_OS + TM_CPPR,   1, xive_tm_set_os_cppr, NULL },
+
+    /* MMIOs above 2K : special operations with side effects */
+    { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG,     2, NULL, xive_tm_ack_os_reg },
+};
+
+static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
+{
+    uint8_t page_offset = (offset >> TM_SHIFT) & 0x3;
+    uint32_t op_offset = offset & 0xFFF;
+    int i;
+
+    for (i = 0; i < ARRAY_SIZE(xive_tm_operations); i++) {
+        const XiveTmOp *xto = &xive_tm_operations[i];
+
+        /* Accesses done from a more privileged TIMA page is allowed */
+        if (xto->page_offset >= page_offset &&
+            xto->op_offset == op_offset &&
+            xto->size == size &&
+            ((write && xto->write_handler) || (!write && xto->read_handler))) {
+            return xto;
+        }
+    }
+    return NULL;
+}
+
+/*
+ * TIMA MMIO handlers
+ */
+static void xive_tm_write(void *opaque, hwaddr offset,
+                          uint64_t value, unsigned size)
+{
+    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
+    XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
+    const XiveTmOp *xto;
+
+    /*
+     * TODO: check V bit in Q[0-3]W2, check PTER bit associated with CPU
+     */
+
+    /*
+     * First, check for special operations in the 2K region
+     */
+    if (offset & 0x800) {
+        xto = xive_tm_find_op(offset, size, true);
+        if (!xto) {
+            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid write access at TIMA"
+                          "@%"HWADDR_PRIx"\n", offset);
+        } else {
+            xto->write_handler(tctx, offset, value, size);
+        }
+        return;
+    }
+
+    /*
+     * Then, for special operations in the region below 2K.
+     */
+    xto = xive_tm_find_op(offset, size, true);
+    if (xto) {
+        xto->write_handler(tctx, offset, value, size);
+        return;
+    }
+
+    /*
+     * Finish with raw access to the register values
+     */
+    xive_tm_raw_write(tctx, offset, value, size);
+}
+
+static uint64_t xive_tm_read(void *opaque, hwaddr offset, unsigned size)
+{
+    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
+    XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
+    const XiveTmOp *xto;
+
+    /*
+     * TODO: check V bit in Q[0-3]W2, check PTER bit associated with CPU
+     */
+
+    /*
+     * First, check for special operations in the 2K region
+     */
+    if (offset & 0x800) {
+        xto = xive_tm_find_op(offset, size, false);
+        if (!xto) {
+            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid read access to TIMA"
+                          "@%"HWADDR_PRIx"\n", offset);
+            return -1;
+        }
+        return xto->read_handler(tctx, offset, size);
+    }
+
+    /*
+     * Then, for special operations in the region below 2K.
+     */
+    xto = xive_tm_find_op(offset, size, false);
+    if (xto) {
+        return xto->read_handler(tctx, offset, size);
+    }
+
+    /*
+     * Finish with raw access to the register values
+     */
+    return xive_tm_raw_read(tctx, offset, size);
+}
+
+const MemoryRegionOps xive_tm_ops = {
+    .read = xive_tm_read,
+    .write = xive_tm_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
+static char *xive_tctx_ring_print(uint8_t *ring)
+{
+    uint32_t w2 = be32_to_cpu(*((uint32_t *) &ring[TM_WORD2]));
+
+    return g_strdup_printf("%02x   %02x  %02x    %02x   %02x  "
+                   "%02x  %02x   %02x  %08x",
+                   ring[TM_NSR], ring[TM_CPPR], ring[TM_IPB], ring[TM_LSMFB],
+                   ring[TM_ACK_CNT], ring[TM_INC], ring[TM_AGE], ring[TM_PIPR],
+                   w2);
+}
+
+static const struct {
+    uint8_t    qw;
+    const char *name;
+} xive_tctx_ring_infos[TM_RING_COUNT] = {
+    { TM_QW3_HV_PHYS, "HW"   },
+    { TM_QW2_HV_POOL, "HV"   },
+    { TM_QW1_OS,      "OS"   },
+    { TM_QW0_USER,    "USER" },
+};
+
+void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
+{
+    int cpu_index = tctx->cs ? tctx->cs->cpu_index : -1;
+    int i;
+
+    monitor_printf(mon, "CPU[%04x]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
+                   "  W2\n", cpu_index);
+
+    for (i = 0; i < TM_RING_COUNT; i++) {
+        char *s = xive_tctx_ring_print(&tctx->regs[xive_tctx_ring_infos[i].qw]);
+        monitor_printf(mon, "CPU[%04x]: %4s    %s\n", cpu_index,
+                       xive_tctx_ring_infos[i].name, s);
+        g_free(s);
+    }
+}
+
+static void xive_tctx_reset(void *dev)
+{
+    XiveTCTX *tctx = XIVE_TCTX(dev);
+
+    memset(tctx->regs, 0, sizeof(tctx->regs));
+
+    /* Set some defaults */
+    tctx->regs[TM_QW1_OS + TM_LSMFB] = 0xFF;
+    tctx->regs[TM_QW1_OS + TM_ACK_CNT] = 0xFF;
+    tctx->regs[TM_QW1_OS + TM_AGE] = 0xFF;
+}
+
+static void xive_tctx_realize(DeviceState *dev, Error **errp)
+{
+    XiveTCTX *tctx = XIVE_TCTX(dev);
+    PowerPCCPU *cpu;
+    CPUPPCState *env;
+    Object *obj;
+    Error *local_err = NULL;
+
+    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
+    if (!obj) {
+        error_propagate(errp, local_err);
+        error_prepend(errp, "required link 'xive' not found: ");
+        return;
+    }
+    tctx->xrtr = XIVE_ROUTER(obj);
+
+    obj = object_property_get_link(OBJECT(dev), "cpu", &local_err);
+    if (!obj) {
+        error_propagate(errp, local_err);
+        error_prepend(errp, "required link 'cpu' not found: ");
+        return;
+    }
+
+    cpu = POWERPC_CPU(obj);
+    tctx->cs = CPU(obj);
+
+    env = &cpu->env;
+    switch (PPC_INPUT(env)) {
+    case PPC_FLAGS_INPUT_POWER7:
+        tctx->output = env->irq_inputs[POWER7_INPUT_INT];
+        break;
+
+    default:
+        error_setg(errp, "XIVE interrupt controller does not support "
+                   "this CPU bus model");
+        return;
+    }
+
+    qemu_register_reset(xive_tctx_reset, dev);
+}
+
+static void xive_tctx_unrealize(DeviceState *dev, Error **errp)
+{
+    qemu_unregister_reset(xive_tctx_reset, dev);
+}
+
+static const VMStateDescription vmstate_xive_tctx = {
+    .name = TYPE_XIVE_TCTX,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_BUFFER(regs, XiveTCTX),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static void xive_tctx_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+
+    dc->realize = xive_tctx_realize;
+    dc->unrealize = xive_tctx_unrealize;
+    dc->desc = "XIVE Interrupt Thread Context";
+    dc->vmsd = &vmstate_xive_tctx;
+}
+
+static const TypeInfo xive_tctx_info = {
+    .name          = TYPE_XIVE_TCTX,
+    .parent        = TYPE_DEVICE,
+    .instance_size = sizeof(XiveTCTX),
+    .class_init    = xive_tctx_class_init,
+};
 
 /*
  * XIVE ESB helpers
@@ -876,6 +1318,7 @@ static void xive_register_types(void)
     type_register_static(&xive_fabric_info);
     type_register_static(&xive_router_info);
     type_register_static(&xive_end_source_info);
+    type_register_static(&xive_tctx_info);
 }
 
 type_init(xive_register_types)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (6 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 07/36] ppc/xive: introduce the XIVE interrupt thread context Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-27 23:49   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged Cédric Le Goater
                   ` (27 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The last sub-engine of the XIVE architecture is the Interrupt
Virtualization Presentation Engine (IVPE). On HW, they share elements,
the Power Bus interface (CQ), the routing table descriptors, and they
can be combined in the same HW logic. We do the same in QEMU and
combine both engines in the XiveRouter for simplicity.

When the IVRE has completed its job of matching an event source with a
Notification Virtual Target (NVT) to notify, it forwards the event
notification to the IVPE sub-engine. The IVPE scans the thread
interrupt contexts of the Notification Virtual Targets (NVT)
dispatched on the HW processor threads and if a match is found, it
signals the thread. If not, the IVPE escalates the notification to
some other targets and records the notification in a backlog queue.

The IVPE maintains the thread interrupt context state for each of its
NVTs not dispatched on HW processor threads in the Notification
Virtual Target table (NVTT).

The model currently only supports single NVT notifications.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/xive.h      |  13 +++
 include/hw/ppc/xive_regs.h |  22 ++++
 hw/intc/xive.c             | 223 +++++++++++++++++++++++++++++++++++++
 3 files changed, 258 insertions(+)

diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 5987f26ddb98..e715a6c6923d 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -197,6 +197,10 @@ typedef struct XiveRouterClass {
                    XiveEND *end);
     int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
                    XiveEND *end);
+    int (*get_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
+                   XiveNVT *nvt);
+    int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
+                   XiveNVT *nvt);
 } XiveRouterClass;
 
 void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
@@ -207,6 +211,10 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
                         XiveEND *end);
 int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
                         XiveEND *end);
+int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
+                        XiveNVT *nvt);
+int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
+                        XiveNVT *nvt);
 
 /*
  * XIVE END ESBs
@@ -274,4 +282,9 @@ extern const MemoryRegionOps xive_tm_ops;
 
 void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
 
+static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
+{
+    return (nvt_blk << 19) | nvt_idx;
+}
+
 #endif /* PPC_XIVE_H */
diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
index 2e3d6cb507da..05cb992d2815 100644
--- a/include/hw/ppc/xive_regs.h
+++ b/include/hw/ppc/xive_regs.h
@@ -158,4 +158,26 @@ typedef struct XiveEND {
 #define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
 } XiveEND;
 
+/* Notification Virtual Target (NVT) */
+typedef struct XiveNVT {
+        uint32_t        w0;
+#define NVT_W0_VALID             PPC_BIT32(0)
+        uint32_t        w1;
+        uint32_t        w2;
+        uint32_t        w3;
+        uint32_t        w4;
+        uint32_t        w5;
+        uint32_t        w6;
+        uint32_t        w7;
+        uint32_t        w8;
+#define NVT_W8_GRP_VALID         PPC_BIT32(0)
+        uint32_t        w9;
+        uint32_t        wa;
+        uint32_t        wb;
+        uint32_t        wc;
+        uint32_t        wd;
+        uint32_t        we;
+        uint32_t        wf;
+} XiveNVT;
+
 #endif /* PPC_XIVE_REGS_H */
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 4c6cb5d52975..5ba3b06e6e25 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -373,6 +373,32 @@ void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
     }
 }
 
+/* The HW CAM (23bits) is hardwired to :
+ *
+ *   0x000||0b1||4Bit chip number||7Bit Thread number.
+ *
+ * and when the block grouping extension is enabled :
+ *
+ *   4Bit chip number||0x001||7Bit Thread number.
+ */
+static uint32_t tctx_hw_cam_line(bool block_group, uint8_t chip_id, uint8_t tid)
+{
+    if (block_group) {
+        return 1 << 11 | (chip_id & 0xf) << 7 | (tid & 0x7f);
+    } else {
+        return (chip_id & 0xf) << 11 | 1 << 7 | (tid & 0x7f);
+    }
+}
+
+static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
+{
+    PowerPCCPU *cpu = POWERPC_CPU(tctx->cs);
+    CPUPPCState *env = &cpu->env;
+    uint32_t pir = env->spr_cb[SPR_PIR].default_value;
+
+    return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
+}
+
 static void xive_tctx_reset(void *dev)
 {
     XiveTCTX *tctx = XIVE_TCTX(dev);
@@ -1013,6 +1039,195 @@ int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
    return xrc->set_end(xrtr, end_blk, end_idx, end);
 }
 
+int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
+                        XiveNVT *nvt)
+{
+   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
+
+   return xrc->get_nvt(xrtr, nvt_blk, nvt_idx, nvt);
+}
+
+int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
+                        XiveNVT *nvt)
+{
+   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
+
+   return xrc->set_nvt(xrtr, nvt_blk, nvt_idx, nvt);
+}
+
+static bool xive_tctx_ring_match(XiveTCTX *tctx, uint8_t ring,
+                                 uint8_t nvt_blk, uint32_t nvt_idx,
+                                 bool cam_ignore, uint32_t logic_serv)
+{
+    uint8_t *regs = &tctx->regs[ring];
+    uint32_t w2 = be32_to_cpu(*((uint32_t *) &regs[TM_WORD2]));
+    uint32_t cam = xive_tctx_cam_line(nvt_blk, nvt_idx);
+    bool block_group = false; /* TODO (PowerNV) */
+
+    /* TODO (PowerNV): ignore low order bits of nvt id */
+
+    switch (ring) {
+    case TM_QW3_HV_PHYS:
+        return (w2 & TM_QW3W2_VT) && xive_tctx_hw_cam_line(tctx, block_group) ==
+            tctx_hw_cam_line(block_group, nvt_blk, nvt_idx);
+
+    case TM_QW2_HV_POOL:
+        return (w2 & TM_QW2W2_VP) && (cam == GETFIELD(TM_QW2W2_POOL_CAM, w2));
+
+    case TM_QW1_OS:
+        return (w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2));
+
+    case TM_QW0_USER:
+        return ((w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2)) &&
+                (w2 & TM_QW0W2_VU) &&
+                (logic_serv == GETFIELD(TM_QW0W2_LOGIC_SERV, w2)));
+
+    default:
+        g_assert_not_reached();
+    }
+}
+
+static int xive_presenter_tctx_match(XiveTCTX *tctx, uint8_t format,
+                                     uint8_t nvt_blk, uint32_t nvt_idx,
+                                     bool cam_ignore, uint32_t logic_serv)
+{
+    if (format == 0) {
+        /* F=0 & i=1: Logical server notification */
+        if (cam_ignore == true) {
+            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no support for LS "
+                          "NVT %x/%x\n", nvt_blk, nvt_idx);
+             return -1;
+        }
+
+        /* F=0 & i=0: Specific NVT notification */
+        if (xive_tctx_ring_match(tctx, TM_QW3_HV_PHYS,
+                                nvt_blk, nvt_idx, false, 0)) {
+            return TM_QW3_HV_PHYS;
+        }
+        if (xive_tctx_ring_match(tctx, TM_QW2_HV_POOL,
+                                nvt_blk, nvt_idx, false, 0)) {
+            return TM_QW2_HV_POOL;
+        }
+        if (xive_tctx_ring_match(tctx, TM_QW1_OS,
+                                nvt_blk, nvt_idx, false, 0)) {
+            return TM_QW1_OS;
+        }
+    } else {
+        /* F=1 : User level Event-Based Branch (EBB) notification */
+        if (xive_tctx_ring_match(tctx, TM_QW0_USER,
+                                nvt_blk, nvt_idx, false, logic_serv)) {
+            return TM_QW0_USER;
+        }
+    }
+    return -1;
+}
+
+typedef struct XiveTCTXMatch {
+    XiveTCTX *tctx;
+    uint8_t ring;
+} XiveTCTXMatch;
+
+static bool xive_presenter_match(XiveRouter *xrtr, uint8_t format,
+                                 uint8_t nvt_blk, uint32_t nvt_idx,
+                                 bool cam_ignore, uint8_t priority,
+                                 uint32_t logic_serv, XiveTCTXMatch *match)
+{
+    CPUState *cs;
+
+    /* TODO (PowerNV): handle chip_id overwrite of block field for
+     * hardwired CAM compares */
+
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+        XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
+        int ring;
+
+        /*
+         * HW checks that the CPU is enabled in the Physical Thread
+         * Enable Register (PTER).
+         */
+
+        /*
+         * Check the thread context CAM lines and record matches. We
+         * will handle CPU exception delivery later
+         */
+        ring = xive_presenter_tctx_match(tctx, format, nvt_blk, nvt_idx,
+                                         cam_ignore, logic_serv);
+        /*
+         * Save the context and follow on to catch duplicates, that we
+         * don't support yet.
+         */
+        if (ring != -1) {
+            if (match->tctx) {
+                qemu_log_mask(LOG_GUEST_ERROR, "XIVE: already found a thread "
+                              "context NVT %x/%x\n", nvt_blk, nvt_idx);
+                return false;
+            }
+
+            match->ring = ring;
+            match->tctx = tctx;
+        }
+    }
+
+    if (!match->tctx) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is not dispatched\n",
+                      nvt_blk, nvt_idx);
+        return false;
+    }
+
+    return true;
+}
+
+/*
+ * This is our simple Xive Presenter Engine model. It is merged in the
+ * Router as it does not require an extra object.
+ *
+ * It receives notification requests sent by the IVRE to find one
+ * matching NVT (or more) dispatched on the processor threads. In case
+ * of a single NVT notification, the process is abreviated and the
+ * thread is signaled if a match is found. In case of a logical server
+ * notification (bits ignored at the end of the NVT identifier), the
+ * IVPE and IVRE select a winning thread using different filters. This
+ * involves 2 or 3 exchanges on the PowerBus that the model does not
+ * support.
+ *
+ * The parameters represent what is sent on the PowerBus
+ */
+static void xive_presenter_notify(XiveRouter *xrtr, uint8_t format,
+                                  uint8_t nvt_blk, uint32_t nvt_idx,
+                                  bool cam_ignore, uint8_t priority,
+                                  uint32_t logic_serv)
+{
+    XiveNVT nvt;
+    XiveTCTXMatch match = { 0 };
+    bool found;
+
+    /* NVT cache lookup */
+    if (xive_router_get_nvt(xrtr, nvt_blk, nvt_idx, &nvt)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no NVT %x/%x\n",
+                      nvt_blk, nvt_idx);
+        return;
+    }
+
+    if (!(nvt.w0 & NVT_W0_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is invalid\n",
+                      nvt_blk, nvt_idx);
+        return;
+    }
+
+    found = xive_presenter_match(xrtr, format, nvt_blk, nvt_idx, cam_ignore,
+                                 priority, logic_serv, &match);
+    if (found) {
+        return;
+    }
+
+    /* If no matching NVT is dispatched on a HW thread :
+     * - update the NVT structure if backlog is activated
+     * - escalate (ESe PQ bits and EAS in w4-5) if escalation is
+     *   activated
+     */
+}
+
 /*
  * An END trigger can come from an event trigger (IPI or HW) or from
  * another chip. We don't model the PowerBus but the END trigger
@@ -1081,6 +1296,14 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
     /*
      * Follows IVPE notification
      */
+    xive_presenter_notify(xrtr, format,
+                          GETFIELD(END_W6_NVT_BLOCK, end.w6),
+                          GETFIELD(END_W6_NVT_INDEX, end.w6),
+                          GETFIELD(END_W7_F0_IGNORE, end.w7),
+                          priority,
+                          GETFIELD(END_W7_F1_LOG_SERVER_ID, end.w7));
+
+    /* TODO: Auto EOI. */
 }
 
 static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (7 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  0:13   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller Cédric Le Goater
                   ` (26 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

After the event data was pushed in the O/S Event Queue, the IVPE
raises the bit corresponding to the priority of the pending interrupt
in the register IBP (Interrupt Pending Buffer) to indicate there is an
event pending in one of the 8 priority queues. The Pending Interrupt
Priority Register (PIPR) is also updated using the IPB. This register
represent the priority of the most favored pending notification.

The PIPR is then compared to the the Current Processor Priority
Register (CPPR). If it is more favored (numerically less than), the
CPU interrupt line is raised and the EO bit of the Notification Source
Register (NSR) is updated to notify the presence of an exception for
the O/S. The check needs to be done whenever the PIPR or the CPPR are
changed.

The O/S acknowledges the interrupt with a special load in the Thread
Interrupt Management Area. If the EO bit of the NSR is set, the CPPR
takes the value of PIPR. The bit number in the IBP corresponding to
the priority of the pending interrupt is reseted and so is the EO bit
of the NSR.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/xive.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 93 insertions(+), 1 deletion(-)

diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 5ba3b06e6e25..c49932d2b799 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -21,9 +21,73 @@
  * XIVE Thread Interrupt Management context
  */
 
+/* Convert a priority number to an Interrupt Pending Buffer (IPB)
+ * register, which indicates a pending interrupt at the priority
+ * corresponding to the bit number
+ */
+static uint8_t priority_to_ipb(uint8_t priority)
+{
+    return priority > XIVE_PRIORITY_MAX ?
+        0 : 1 << (XIVE_PRIORITY_MAX - priority);
+}
+
+/* Convert an Interrupt Pending Buffer (IPB) register to a Pending
+ * Interrupt Priority Register (PIPR), which contains the priority of
+ * the most favored pending notification.
+ */
+static uint8_t ipb_to_pipr(uint8_t ibp)
+{
+    return ibp ? clz32((uint32_t)ibp << 24) : 0xff;
+}
+
+static void ipb_update(uint8_t *regs, uint8_t priority)
+{
+    regs[TM_IPB] |= priority_to_ipb(priority);
+    regs[TM_PIPR] = ipb_to_pipr(regs[TM_IPB]);
+}
+
+static uint8_t exception_mask(uint8_t ring)
+{
+    switch (ring) {
+    case TM_QW1_OS:
+        return TM_QW1_NSR_EO;
+    default:
+        g_assert_not_reached();
+    }
+}
+
 static uint64_t xive_tctx_accept(XiveTCTX *tctx, uint8_t ring)
 {
-    return 0;
+    uint8_t *regs = &tctx->regs[ring];
+    uint8_t nsr = regs[TM_NSR];
+    uint8_t mask = exception_mask(ring);
+
+    qemu_irq_lower(tctx->output);
+
+    if (regs[TM_NSR] & mask) {
+        uint8_t cppr = regs[TM_PIPR];
+
+        regs[TM_CPPR] = cppr;
+
+        /* Reset the pending buffer bit */
+        regs[TM_IPB] &= ~priority_to_ipb(cppr);
+        regs[TM_PIPR] = ipb_to_pipr(regs[TM_IPB]);
+
+        /* Drop Exception bit */
+        regs[TM_NSR] &= ~mask;
+    }
+
+    return (nsr << 8) | regs[TM_CPPR];
+}
+
+static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring)
+{
+    uint8_t *regs = &tctx->regs[ring];
+
+    if (regs[TM_PIPR] < regs[TM_CPPR]) {
+        regs[TM_NSR] |= exception_mask(ring);
+        qemu_irq_raise(tctx->output);
+    }
 }
 
 static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
@@ -33,6 +97,9 @@ static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
     }
 
     tctx->regs[ring + TM_CPPR] = cppr;
+
+    /* CPPR has changed, check if we need to raise a pending exception */
+    xive_tctx_notify(tctx, ring);
 }
 
 /*
@@ -198,6 +265,17 @@ static void xive_tm_set_os_cppr(XiveTCTX *tctx, hwaddr offset,
     xive_tctx_set_cppr(tctx, TM_QW1_OS, value & 0xff);
 }
 
+/*
+ * Adjust the IPB to allow a CPU to process event queues of other
+ * priorities during one physical interrupt cycle.
+ */
+static void xive_tm_set_os_pending(XiveTCTX *tctx, hwaddr offset,
+                                   uint64_t value, unsigned size)
+{
+    ipb_update(&tctx->regs[TM_QW1_OS], value & 0xff);
+    xive_tctx_notify(tctx, TM_QW1_OS);
+}
+
 /*
  * Define a mapping of "special" operations depending on the TIMA page
  * offset and the size of the operation.
@@ -220,6 +298,7 @@ static const XiveTmOp xive_tm_operations[] = {
 
     /* MMIOs above 2K : special operations with side effects */
     { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG,     2, NULL, xive_tm_ack_os_reg },
+    { XIVE_TM_OS_PAGE, TM_SPC_SET_OS_PENDING, 1, xive_tm_set_os_pending, NULL },
 };
 
 static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
@@ -409,6 +488,13 @@ static void xive_tctx_reset(void *dev)
     tctx->regs[TM_QW1_OS + TM_LSMFB] = 0xFF;
     tctx->regs[TM_QW1_OS + TM_ACK_CNT] = 0xFF;
     tctx->regs[TM_QW1_OS + TM_AGE] = 0xFF;
+
+    /*
+     * Initialize PIPR to 0xFF to avoid phantom interrupts when the
+     * CPPR is first set.
+     */
+    tctx->regs[TM_QW1_OS + TM_PIPR] =
+        ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
 }
 
 static void xive_tctx_realize(DeviceState *dev, Error **errp)
@@ -1218,9 +1304,15 @@ static void xive_presenter_notify(XiveRouter *xrtr, uint8_t format,
     found = xive_presenter_match(xrtr, format, nvt_blk, nvt_idx, cam_ignore,
                                  priority, logic_serv, &match);
     if (found) {
+        ipb_update(&match.tctx->regs[match.ring], priority);
+        xive_tctx_notify(match.tctx, match.ring);
         return;
     }
 
+    /* Record the IPB in the associated NVT structure */
+    ipb_update((uint8_t *) &nvt.w4, priority);
+    xive_router_set_nvt(xrtr, nvt_blk, nvt_idx, &nvt);
+
     /* If no matching NVT is dispatched on a HW thread :
      * - update the NVT structure if backlog is activated
      * - escalate (ESe PQ bits and EAS in w4-5) if escalation is
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (8 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  0:52   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier Cédric Le Goater
                   ` (25 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

sPAPRXive models the XIVE interrupt controller of the sPAPR machine.
It inherits from the XiveRouter and provisions storage for the routing
tables :

  - Event Assignment Structure (EAS)
  - Event Notification Descriptor (END)

The sPAPRXive model incorporates an internal XiveSource for the IPIs
and for the interrupts of the virtual devices of the guest. This model
is consistent with XIVE architecture which also incorporates an
internal IVSE for IPIs and accelerator interrupts in the IVRE
sub-engine.

The sPAPRXive model exports two memory regions, one for the ESB
trigger and management pages used to control the sources and one for
the TIMA pages. They are mapped by default at the addresses found on
chip 0 of a baremetal system. This is also consistent with the XIVE
architecture which defines a Virtualization Controller BAR for the
internal IVSE ESB pages and a Thread Managment BAR for the TIMA.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 default-configs/ppc64-softmmu.mak |   1 +
 include/hw/ppc/spapr_xive.h       |  46 +++++
 hw/intc/spapr_xive.c              | 323 ++++++++++++++++++++++++++++++
 hw/intc/Makefile.objs             |   1 +
 4 files changed, 371 insertions(+)
 create mode 100644 include/hw/ppc/spapr_xive.h
 create mode 100644 hw/intc/spapr_xive.c

diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
index 2d1e7c5c4668..7f34ad0528ed 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -17,6 +17,7 @@ CONFIG_XICS=$(CONFIG_PSERIES)
 CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_XIVE=$(CONFIG_PSERIES)
+CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
 CONFIG_MEM_DEVICE=y
 CONFIG_DIMM=y
 CONFIG_SPAPR_RNG=y
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
new file mode 100644
index 000000000000..06727bd86aa9
--- /dev/null
+++ b/include/hw/ppc/spapr_xive.h
@@ -0,0 +1,46 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_SPAPR_XIVE_H
+#define PPC_SPAPR_XIVE_H
+
+#include "hw/sysbus.h"
+#include "hw/ppc/xive.h"
+
+#define TYPE_SPAPR_XIVE "spapr-xive"
+#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
+
+typedef struct sPAPRXive {
+    XiveRouter    parent;
+
+    /* Internal interrupt source for IPIs and virtual devices */
+    XiveSource    source;
+    hwaddr        vc_base;
+
+    /* END ESB MMIOs */
+    XiveENDSource end_source;
+    hwaddr        end_base;
+
+    /* Routing table */
+    XiveEAS       *eat;
+    uint32_t      nr_irqs;
+    XiveEND       *endt;
+    uint32_t      nr_ends;
+
+    /* TIMA mapping address */
+    hwaddr        tm_base;
+    MemoryRegion  tm_mmio;
+} sPAPRXive;
+
+bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
+bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
+void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
+qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
+
+#endif /* PPC_SPAPR_XIVE_H */
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
new file mode 100644
index 000000000000..5d038146c08e
--- /dev/null
+++ b/hw/intc/spapr_xive.c
@@ -0,0 +1,323 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "monitor/monitor.h"
+#include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/xive.h"
+#include "hw/ppc/xive_regs.h"
+
+/*
+ * XIVE Virtualization Controller BAR and Thread Managment BAR that we
+ * use for the ESB pages and the TIMA pages
+ */
+#define SPAPR_XIVE_VC_BASE   0x0006010000000000ull
+#define SPAPR_XIVE_TM_BASE   0x0006030203180000ull
+
+void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
+{
+    int i;
+    uint32_t offset = 0;
+
+    monitor_printf(mon, "XIVE Source %08x .. %08x\n", offset,
+                   offset + xive->source.nr_irqs - 1);
+    xive_source_pic_print_info(&xive->source, offset, mon);
+
+    monitor_printf(mon, "XIVE EAT %08x .. %08x\n", 0, xive->nr_irqs - 1);
+    for (i = 0; i < xive->nr_irqs; i++) {
+        xive_eas_pic_print_info(&xive->eat[i], i, mon);
+    }
+
+    monitor_printf(mon, "XIVE ENDT %08x .. %08x\n", 0, xive->nr_ends - 1);
+    for (i = 0; i < xive->nr_ends; i++) {
+        xive_end_pic_print_info(&xive->endt[i], i, mon);
+    }
+}
+
+/* Map the ESB pages and the TIMA pages */
+static void spapr_xive_mmio_map(sPAPRXive *xive)
+{
+    sysbus_mmio_map(SYS_BUS_DEVICE(&xive->source), 0, xive->vc_base);
+    sysbus_mmio_map(SYS_BUS_DEVICE(&xive->end_source), 0, xive->end_base);
+    sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->tm_base);
+}
+
+static void spapr_xive_reset(DeviceState *dev)
+{
+    sPAPRXive *xive = SPAPR_XIVE(dev);
+    int i;
+
+    /* Xive Source reset is done through SysBus, it should put all
+     * IRQs to OFF (!P|Q) */
+
+    /* Mask all valid EASs in the IRQ number space. */
+    for (i = 0; i < xive->nr_irqs; i++) {
+        XiveEAS *eas = &xive->eat[i];
+        if (eas->w & EAS_VALID) {
+            eas->w |= EAS_MASKED;
+        }
+    }
+
+    for (i = 0; i < xive->nr_ends; i++) {
+        xive_end_reset(&xive->endt[i]);
+    }
+
+    spapr_xive_mmio_map(xive);
+}
+
+static void spapr_xive_instance_init(Object *obj)
+{
+    sPAPRXive *xive = SPAPR_XIVE(obj);
+
+    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
+    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
+
+    object_initialize(&xive->end_source, sizeof(xive->end_source),
+                      TYPE_XIVE_END_SOURCE);
+    object_property_add_child(obj, "end_source", OBJECT(&xive->end_source),
+                              NULL);
+}
+
+static void spapr_xive_realize(DeviceState *dev, Error **errp)
+{
+    sPAPRXive *xive = SPAPR_XIVE(dev);
+    XiveSource *xsrc = &xive->source;
+    XiveENDSource *end_xsrc = &xive->end_source;
+    Error *local_err = NULL;
+
+    if (!xive->nr_irqs) {
+        error_setg(errp, "Number of interrupt needs to be greater 0");
+        return;
+    }
+
+    if (!xive->nr_ends) {
+        error_setg(errp, "Number of interrupt needs to be greater 0");
+        return;
+    }
+
+    /*
+     * Initialize the internal sources, for IPIs and virtual devices.
+     */
+    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
+                            &error_fatal);
+    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
+                                   &error_fatal);
+    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
+
+    /*
+     * Initialize the END ESB source
+     */
+    object_property_set_int(OBJECT(end_xsrc), xive->nr_irqs, "nr-ends",
+                            &error_fatal);
+    object_property_add_const_link(OBJECT(end_xsrc), "xive", OBJECT(xive),
+                                   &error_fatal);
+    object_property_set_bool(OBJECT(end_xsrc), true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(end_xsrc), sysbus_get_default());
+
+    /* Set the mapping address of the END ESB pages after the source ESBs */
+    xive->end_base = xive->vc_base + (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
+
+    /*
+     * Allocate the routing tables
+     */
+    xive->eat = g_new0(XiveEAS, xive->nr_irqs);
+    xive->endt = g_new0(XiveEND, xive->nr_ends);
+
+    /* TIMA initialization */
+    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
+                          "xive.tima", 4ull << TM_SHIFT);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
+}
+
+static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
+{
+    sPAPRXive *xive = SPAPR_XIVE(xrtr);
+
+    if (lisn >= xive->nr_irqs) {
+        return -1;
+    }
+
+    *eas = xive->eat[lisn];
+    return 0;
+}
+
+static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
+{
+    sPAPRXive *xive = SPAPR_XIVE(xrtr);
+
+    if (lisn >= xive->nr_irqs) {
+        return -1;
+    }
+
+    xive->eat[lisn] = *eas;
+    return 0;
+}
+
+static int spapr_xive_get_end(XiveRouter *xrtr,
+                              uint8_t end_blk, uint32_t end_idx, XiveEND *end)
+{
+    sPAPRXive *xive = SPAPR_XIVE(xrtr);
+
+    if (end_idx >= xive->nr_ends) {
+        return -1;
+    }
+
+    memcpy(end, &xive->endt[end_idx], sizeof(XiveEND));
+    return 0;
+}
+
+static int spapr_xive_set_end(XiveRouter *xrtr,
+                              uint8_t end_blk, uint32_t end_idx, XiveEND *end)
+{
+    sPAPRXive *xive = SPAPR_XIVE(xrtr);
+
+    if (end_idx >= xive->nr_ends) {
+        return -1;
+    }
+
+    memcpy(&xive->endt[end_idx], end, sizeof(XiveEND));
+    return 0;
+}
+
+static const VMStateDescription vmstate_spapr_xive_end = {
+    .name = TYPE_SPAPR_XIVE "/end",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField []) {
+        VMSTATE_UINT32(w0, XiveEND),
+        VMSTATE_UINT32(w1, XiveEND),
+        VMSTATE_UINT32(w2, XiveEND),
+        VMSTATE_UINT32(w3, XiveEND),
+        VMSTATE_UINT32(w4, XiveEND),
+        VMSTATE_UINT32(w5, XiveEND),
+        VMSTATE_UINT32(w6, XiveEND),
+        VMSTATE_UINT32(w7, XiveEND),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static const VMStateDescription vmstate_spapr_xive_eas = {
+    .name = TYPE_SPAPR_XIVE "/eas",
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField []) {
+        VMSTATE_UINT64(w, XiveEAS),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static const VMStateDescription vmstate_spapr_xive = {
+    .name = TYPE_SPAPR_XIVE,
+    .version_id = 1,
+    .minimum_version_id = 1,
+    .fields = (VMStateField[]) {
+        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
+        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(eat, sPAPRXive, nr_irqs,
+                                     vmstate_spapr_xive_eas, XiveEAS),
+        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(endt, sPAPRXive, nr_ends,
+                                             vmstate_spapr_xive_end, XiveEND),
+        VMSTATE_END_OF_LIST()
+    },
+};
+
+static Property spapr_xive_properties[] = {
+    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
+    DEFINE_PROP_UINT32("nr-ends", sPAPRXive, nr_ends, 0),
+    DEFINE_PROP_UINT64("vc-base", sPAPRXive, vc_base, SPAPR_XIVE_VC_BASE),
+    DEFINE_PROP_UINT64("tm-base", sPAPRXive, tm_base, SPAPR_XIVE_TM_BASE),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void spapr_xive_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
+
+    dc->desc    = "sPAPR XIVE Interrupt Controller";
+    dc->props   = spapr_xive_properties;
+    dc->realize = spapr_xive_realize;
+    dc->reset   = spapr_xive_reset;
+    dc->vmsd    = &vmstate_spapr_xive;
+
+    xrc->get_eas = spapr_xive_get_eas;
+    xrc->set_eas = spapr_xive_set_eas;
+    xrc->get_end = spapr_xive_get_end;
+    xrc->set_end = spapr_xive_set_end;
+}
+
+static const TypeInfo spapr_xive_info = {
+    .name = TYPE_SPAPR_XIVE,
+    .parent = TYPE_XIVE_ROUTER,
+    .instance_init = spapr_xive_instance_init,
+    .instance_size = sizeof(sPAPRXive),
+    .class_init = spapr_xive_class_init,
+};
+
+static void spapr_xive_register_types(void)
+{
+    type_register_static(&spapr_xive_info);
+}
+
+type_init(spapr_xive_register_types)
+
+bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
+{
+    XiveSource *xsrc = &xive->source;
+
+    if (lisn >= xive->nr_irqs) {
+        return false;
+    }
+
+    xive->eat[lisn].w |= EAS_VALID;
+    xive_source_irq_set(xsrc, lisn, lsi);
+    return true;
+}
+
+bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
+{
+    XiveSource *xsrc = &xive->source;
+
+    if (lisn >= xive->nr_irqs) {
+        return false;
+    }
+
+    xive->eat[lisn].w &= ~EAS_VALID;
+    xive_source_irq_set(xsrc, lisn, false);
+    return true;
+}
+
+qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn)
+{
+    XiveSource *xsrc = &xive->source;
+
+    if (lisn >= xive->nr_irqs) {
+        return NULL;
+    }
+
+    if (!(xive->eat[lisn].w & EAS_VALID)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);
+        return NULL;
+    }
+
+    return xive_source_qirq(xsrc, lisn);
+}
diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index 72a46ed91c31..301a8e972d91 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -38,6 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
 obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
 obj-$(CONFIG_XICS_KVM) += xics_kvm.o
 obj-$(CONFIG_XIVE) += xive.o
+obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
 obj-$(CONFIG_POWERNV) += xics_pnv.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (9 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  2:39   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend Cédric Le Goater
                   ` (24 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The IVPE scans the O/S CAM line of the XIVE thread interrupt contexts
to find a matching Notification Virtual Target (NVT) among the NVTs
dispatched on the HW processor threads.

On a real system, the thread interrupt contexts are updated by the
hypervisor when a Virtual Processor is scheduled to run on a HW
thread. Under QEMU, the model emulates the same behavior by hardwiring
the NVT identifier in the thread context registers at reset.

The NVT identifier used by the sPAPRXive model is the VCPU id. The END
identifier is also derived from the VCPU id. A set of helpers doing
the conversion between identifiers are provided for the hcalls
configuring the sources and the ENDs.

The model does not need a NVT table but The XiveRouter NVT operations
are provided to perform some extra checks in the routing algorithm.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_xive.h |  17 +++++
 include/hw/ppc/xive.h       |   3 +
 hw/intc/spapr_xive.c        | 136 ++++++++++++++++++++++++++++++++++++
 hw/intc/xive.c              |   9 +++
 4 files changed, 165 insertions(+)

diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 06727bd86aa9..3f65b8f485fd 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -43,4 +43,21 @@ bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
 
+/*
+ * sPAPR NVT and END indexing helpers
+ */
+uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, uint8_t nvt_blk,
+                                  uint32_t nvt_idx);
+int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t target,
+                            uint8_t *out_nvt_blk, uint32_t *out_nvt_idx);
+int spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
+                          uint8_t *out_nvt_blk, uint32_t *out_nvt_idx);
+
+int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t end_idx,
+                             uint32_t *out_server, uint8_t *out_prio);
+int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
+                             uint8_t *out_end_blk, uint32_t *out_end_idx);
+int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
+                          uint8_t *out_end_blk, uint32_t *out_end_idx);
+
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index e715a6c6923d..e6931ddaa83f 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -187,6 +187,8 @@ typedef struct XiveRouter {
 #define XIVE_ROUTER_GET_CLASS(obj)                              \
     OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
 
+typedef struct XiveTCTX XiveTCTX;
+
 typedef struct XiveRouterClass {
     SysBusDeviceClass parent;
 
@@ -201,6 +203,7 @@ typedef struct XiveRouterClass {
                    XiveNVT *nvt);
     int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
                    XiveNVT *nvt);
+    void (*reset_tctx)(XiveRouter *xrtr, XiveTCTX *tctx);
 } XiveRouterClass;
 
 void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 5d038146c08e..3bf77ace11a2 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -199,6 +199,139 @@ static int spapr_xive_set_end(XiveRouter *xrtr,
     return 0;
 }
 
+static int spapr_xive_get_nvt(XiveRouter *xrtr,
+                              uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
+{
+    sPAPRXive *xive = SPAPR_XIVE(xrtr);
+    uint32_t vcpu_id = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
+    PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
+
+    if (!cpu) {
+        return -1;
+    }
+
+    /*
+     * sPAPR does not maintain a NVT table. Return that the NVT is
+     * valid if we have found a matching CPU
+     */
+    nvt->w0 = NVT_W0_VALID;
+    return 0;
+}
+
+static int spapr_xive_set_nvt(XiveRouter *xrtr,
+                              uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
+{
+    /* no NVT table */
+    return 0;
+}
+
+/*
+ * When a Virtual Processor is scheduled to run on a HW thread, the
+ * hypervisor pushes its identifier in the OS CAM line. Under QEMU, we
+ * need to emulate the same behavior.
+ */
+static void spapr_xive_reset_tctx(XiveRouter *xrtr, XiveTCTX *tctx)
+{
+    uint8_t  nvt_blk;
+    uint32_t nvt_idx;
+    uint32_t nvt_cam;
+
+    spapr_xive_cpu_to_nvt(SPAPR_XIVE(xrtr), POWERPC_CPU(tctx->cs),
+                          &nvt_blk, &nvt_idx);
+
+    nvt_cam = cpu_to_be32(TM_QW1W2_VO | xive_tctx_cam_line(nvt_blk, nvt_idx));
+    memcpy(&tctx->regs[TM_QW1_OS + TM_WORD2], &nvt_cam, 4);
+}
+
+/*
+ * The allocation of VP blocks is a complex operation in OPAL and the
+ * VP identifiers have a relation with the number of HW chips, the
+ * size of the VP blocks, VP grouping, etc. The QEMU sPAPR XIVE
+ * controller model does not have the same constraints and can use a
+ * simple mapping scheme of the CPU vcpu_id
+ *
+ * These identifiers are never returned to the OS.
+ */
+
+#define SPAPR_XIVE_VP_BASE 0x400
+
+uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, uint8_t nvt_blk,
+                                  uint32_t nvt_idx)
+{
+    return nvt_idx - SPAPR_XIVE_VP_BASE;
+}
+
+int spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
+                          uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
+{
+    XiveRouter *xrtr = XIVE_ROUTER(xive);
+
+    if (!cpu) {
+        return -1;
+    }
+
+    if (out_nvt_blk) {
+        /* For testing purpose, we could use 0 for nvt_blk */
+        *out_nvt_blk = xrtr->chip_id;
+    }
+
+    if (out_nvt_blk) {
+        *out_nvt_idx = SPAPR_XIVE_VP_BASE + cpu->vcpu_id;
+    }
+    return 0;
+}
+
+int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t target,
+                             uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
+{
+    return spapr_xive_cpu_to_nvt(xive, spapr_find_cpu(target), out_nvt_blk,
+                                 out_nvt_idx);
+}
+
+/*
+ * sPAPR END indexing uses a simple mapping of the CPU vcpu_id, 8
+ * priorities per CPU
+ */
+int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t end_idx,
+                             uint32_t *out_server, uint8_t *out_prio)
+{
+    if (out_server) {
+        *out_server = end_idx >> 3;
+    }
+
+    if (out_prio) {
+        *out_prio = end_idx & 0x7;
+    }
+    return 0;
+}
+
+int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
+                          uint8_t *out_end_blk, uint32_t *out_end_idx)
+{
+    XiveRouter *xrtr = XIVE_ROUTER(xive);
+
+    if (!cpu) {
+        return -1;
+    }
+
+    if (out_end_blk) {
+        /* For testing purpose, we could use 0 for nvt_blk */
+        *out_end_blk = xrtr->chip_id;
+    }
+
+    if (out_end_idx) {
+        *out_end_idx = (cpu->vcpu_id << 3) + prio;
+    }
+    return 0;
+}
+
+int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
+                             uint8_t *out_end_blk, uint32_t *out_end_idx)
+{
+    return spapr_xive_cpu_to_end(xive, spapr_find_cpu(target), prio,
+                                 out_end_blk, out_end_idx);
+}
+
 static const VMStateDescription vmstate_spapr_xive_end = {
     .name = TYPE_SPAPR_XIVE "/end",
     .version_id = 1,
@@ -263,6 +396,9 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
     xrc->set_eas = spapr_xive_set_eas;
     xrc->get_end = spapr_xive_get_end;
     xrc->set_end = spapr_xive_set_end;
+    xrc->get_nvt = spapr_xive_get_nvt;
+    xrc->set_nvt = spapr_xive_set_nvt;
+    xrc->reset_tctx = spapr_xive_reset_tctx;
 }
 
 static const TypeInfo spapr_xive_info = {
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index c49932d2b799..fc6ef5895e6d 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -481,6 +481,7 @@ static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
 static void xive_tctx_reset(void *dev)
 {
     XiveTCTX *tctx = XIVE_TCTX(dev);
+    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
 
     memset(tctx->regs, 0, sizeof(tctx->regs));
 
@@ -495,6 +496,14 @@ static void xive_tctx_reset(void *dev)
      */
     tctx->regs[TM_QW1_OS + TM_PIPR] =
         ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
+
+    /*
+     * QEMU sPAPR XIVE only. To let the controller model reset the OS
+     * CAM line with the VP identifier.
+     */
+    if (xrc->reset_tctx) {
+        xrc->reset_tctx(tctx->xrtr, tctx);
+    }
 }
 
 static void xive_tctx_realize(DeviceState *dev, Error **errp)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (10 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  2:57   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 13/36] spapr: introduce a spapr_irq_init() routine Cédric Le Goater
                   ` (23 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

We will need to use xics_max_server_number() to create the sPAPRXive
object modeling the interrupt controller of the machine which is
created before the CPUs.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/spapr.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 7afd1a175bf2..50cb9f9f4a02 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2466,11 +2466,6 @@ static void spapr_init_cpus(sPAPRMachineState *spapr)
         boot_cores_nr = possible_cpus->len;
     }
 
-    /* VSMT must be set in order to be able to compute VCPU ids, ie to
-     * call xics_max_server_number() or spapr_vcpu_id().
-     */
-    spapr_set_vsmt_mode(spapr, &error_fatal);
-
     if (smc->pre_2_10_has_unused_icps) {
         int i;
 
@@ -2593,6 +2588,11 @@ static void spapr_machine_init(MachineState *machine)
     /* Setup a load limit for the ramdisk leaving room for SLOF and FDT */
     load_limit = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FW_OVERHEAD;
 
+    /* VSMT must be set in order to be able to compute VCPU ids, ie to
+     * call xics_max_server_number() or spapr_vcpu_id().
+     */
+    spapr_set_vsmt_mode(spapr, &error_fatal);
+
     /* Set up Interrupt Controller before we create the VCPUs */
     smc->irq->init(spapr, &error_fatal);
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 13/36] spapr: introduce a spapr_irq_init() routine
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (11 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  2:59   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 14/36] spapr: modify the irq backend 'init' method Cédric Le Goater
                   ` (22 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

Initialize the MSI bitmap from it as this will be necessary for the
sPAPR IRQ backend for XIVE.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_irq.h |  1 +
 hw/ppc/spapr.c             |  2 +-
 hw/ppc/spapr_irq.c         | 16 +++++++++++-----
 3 files changed, 13 insertions(+), 6 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index a467ce696ee4..bd7301e6d9c6 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -43,6 +43,7 @@ typedef struct sPAPRIrq {
 extern sPAPRIrq spapr_irq_xics;
 extern sPAPRIrq spapr_irq_xics_legacy;
 
+void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
 void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
 qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 50cb9f9f4a02..e470efe7993c 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2594,7 +2594,7 @@ static void spapr_machine_init(MachineState *machine)
     spapr_set_vsmt_mode(spapr, &error_fatal);
 
     /* Set up Interrupt Controller before we create the VCPUs */
-    smc->irq->init(spapr, &error_fatal);
+    spapr_irq_init(spapr, &error_fatal);
 
     /* Set up containers for ibm,client-architecture-support negotiated options
      */
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index e77b94cc685e..f8b651de0ec9 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -97,11 +97,6 @@ static void spapr_irq_init_xics(sPAPRMachineState *spapr, Error **errp)
     int nr_irqs = smc->irq->nr_irqs;
     Error *local_err = NULL;
 
-    /* Initialize the MSI IRQ allocator. */
-    if (!SPAPR_MACHINE_GET_CLASS(spapr)->legacy_irq_allocation) {
-        spapr_irq_msi_init(spapr, smc->irq->nr_msis);
-    }
-
     if (kvm_enabled()) {
         if (machine_kernel_irqchip_allowed(machine) &&
             !xics_kvm_init(spapr, &local_err)) {
@@ -213,6 +208,17 @@ sPAPRIrq spapr_irq_xics = {
 /*
  * sPAPR IRQ frontend routines for devices
  */
+void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
+{
+    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
+
+    /* Initialize the MSI IRQ allocator. */
+    if (!SPAPR_MACHINE_GET_CLASS(spapr)->legacy_irq_allocation) {
+        spapr_irq_msi_init(spapr, smc->irq->nr_msis);
+    }
+
+    smc->irq->init(spapr, errp);
+}
 
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp)
 {
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 14/36] spapr: modify the irq backend 'init' method
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (12 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 13/36] spapr: introduce a spapr_irq_init() routine Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 15/36] spapr: introdude a new machine IRQ backend for XIVE Cédric Le Goater
                   ` (21 subsequent siblings)
  35 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

Add a 'nr_irqs' parameter to the 'init' method to remove the use of
the machine class.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_irq.h | 2 +-
 hw/ppc/spapr_irq.c         | 7 +++----
 2 files changed, 4 insertions(+), 5 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index bd7301e6d9c6..0e9229bf219e 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -33,7 +33,7 @@ typedef struct sPAPRIrq {
     uint32_t    nr_irqs;
     uint32_t    nr_msis;
 
-    void (*init)(sPAPRMachineState *spapr, Error **errp);
+    void (*init)(sPAPRMachineState *spapr, int nr_irqs, Error **errp);
     int (*claim)(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
     void (*free)(sPAPRMachineState *spapr, int irq, int num);
     qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index f8b651de0ec9..bac450ffff23 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -90,11 +90,10 @@ error:
     return NULL;
 }
 
-static void spapr_irq_init_xics(sPAPRMachineState *spapr, Error **errp)
+static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
+                                Error **errp)
 {
     MachineState *machine = MACHINE(spapr);
-    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
-    int nr_irqs = smc->irq->nr_irqs;
     Error *local_err = NULL;
 
     if (kvm_enabled()) {
@@ -217,7 +216,7 @@ void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
         spapr_irq_msi_init(spapr, smc->irq->nr_msis);
     }
 
-    smc->irq->init(spapr, errp);
+    smc->irq->init(spapr, smc->irq->nr_irqs, errp);
 }
 
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 15/36] spapr: introdude a new machine IRQ backend for XIVE
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (13 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 14/36] spapr: modify the irq backend 'init' method Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  3:28   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode Cédric Le Goater
                   ` (20 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The XIVE IRQ backend uses the same layout as the new XICS backend but
covers the full range of the IRQ number space. The IRQ numbers for the
CPU IPIs are allocated at the bottom of this space, below 4K, to
preserve compatibility with XICS which does not use that range.

This should be enough given that the maximum number of CPUs is 1024
for the sPAPR machine under QEMU. For the record, the biggest POWER8
or POWER9 system has a maximum of 1536 HW threads (16 sockets, 192
cores, SMT8).

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr.h     |   2 +
 include/hw/ppc/spapr_irq.h |   7 ++-
 hw/ppc/spapr.c             |   2 +-
 hw/ppc/spapr_irq.c         | 119 ++++++++++++++++++++++++++++++++++++-
 4 files changed, 124 insertions(+), 6 deletions(-)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 6279711fe8f7..1fbc2663e06c 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -16,6 +16,7 @@ typedef struct sPAPREventLogEntry sPAPREventLogEntry;
 typedef struct sPAPREventSource sPAPREventSource;
 typedef struct sPAPRPendingHPT sPAPRPendingHPT;
 typedef struct ICSState ICSState;
+typedef struct sPAPRXive sPAPRXive;
 
 #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
 #define SPAPR_ENTRY_POINT       0x100
@@ -175,6 +176,7 @@ struct sPAPRMachineState {
     const char *icp_type;
     int32_t irq_map_nr;
     unsigned long *irq_map;
+    sPAPRXive  *xive;
 
     bool cmd_line_caps[SPAPR_CAP_NUM];
     sPAPRCapabilities def, eff, mig;
diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 0e9229bf219e..c854ae527808 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -13,6 +13,7 @@
 /*
  * IRQ range offsets per device type
  */
+#define SPAPR_IRQ_IPI        0x0
 #define SPAPR_IRQ_EPOW       0x1000  /* XICS_IRQ_BASE offset */
 #define SPAPR_IRQ_HOTPLUG    0x1001
 #define SPAPR_IRQ_VIO        0x1100  /* 256 VIO devices */
@@ -33,7 +34,8 @@ typedef struct sPAPRIrq {
     uint32_t    nr_irqs;
     uint32_t    nr_msis;
 
-    void (*init)(sPAPRMachineState *spapr, int nr_irqs, Error **errp);
+    void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
+                 Error **errp);
     int (*claim)(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
     void (*free)(sPAPRMachineState *spapr, int irq, int num);
     qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
@@ -42,8 +44,9 @@ typedef struct sPAPRIrq {
 
 extern sPAPRIrq spapr_irq_xics;
 extern sPAPRIrq spapr_irq_xics_legacy;
+extern sPAPRIrq spapr_irq_xive;
 
-void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
+void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp);
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
 void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
 qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index e470efe7993c..9f8c19e56e7a 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -2594,7 +2594,7 @@ static void spapr_machine_init(MachineState *machine)
     spapr_set_vsmt_mode(spapr, &error_fatal);
 
     /* Set up Interrupt Controller before we create the VCPUs */
-    spapr_irq_init(spapr, &error_fatal);
+    spapr_irq_init(spapr, xics_max_server_number(spapr), &error_fatal);
 
     /* Set up containers for ibm,client-architecture-support negotiated options
      */
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index bac450ffff23..2569ae1bc7f8 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -12,6 +12,7 @@
 #include "qemu/error-report.h"
 #include "qapi/error.h"
 #include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_xive.h"
 #include "hw/ppc/xics.h"
 #include "sysemu/kvm.h"
 
@@ -91,7 +92,7 @@ error:
 }
 
 static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
-                                Error **errp)
+                                int nr_servers, Error **errp)
 {
     MachineState *machine = MACHINE(spapr);
     Error *local_err = NULL;
@@ -204,10 +205,122 @@ sPAPRIrq spapr_irq_xics = {
     .print_info  = spapr_irq_print_info_xics,
 };
 
+ /*
+ * XIVE IRQ backend.
+ */
+static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
+                                    const char *type_xive, int nr_irqs,
+                                    int nr_servers, Error **errp)
+{
+    sPAPRXive *xive;
+    Error *local_err = NULL;
+    Object *obj;
+    uint32_t nr_ends = nr_servers << 3; /* 8 priority ENDs per CPU */
+    int i;
+
+    obj = object_new(type_xive);
+    object_property_set_int(obj, nr_irqs, "nr-irqs", &error_abort);
+    object_property_set_int(obj, nr_ends, "nr-ends", &error_abort);
+    object_property_set_bool(obj, true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return NULL;
+    }
+    qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
+    xive = SPAPR_XIVE(obj);
+
+    /* Enable the CPU IPIs */
+    for (i = 0; i < nr_servers; ++i) {
+        spapr_xive_irq_enable(xive, SPAPR_IRQ_IPI + i, false);
+    }
+
+    return xive;
+}
+
+static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
+                                int nr_servers, Error **errp)
+{
+    MachineState *machine = MACHINE(spapr);
+    Error *local_err = NULL;
+
+    /* KVM XIVE support */
+    if (kvm_enabled()) {
+        if (machine_kernel_irqchip_required(machine)) {
+            error_setg(errp, "kernel_irqchip requested. no XIVE support");
+            return;
+        }
+    }
+
+    /* QEMU XIVE support */
+    spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE, nr_irqs, nr_servers,
+                                    &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+}
+
+static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
+                                Error **errp)
+{
+    if (!spapr_xive_irq_enable(spapr->xive, irq, lsi)) {
+        error_setg(errp, "IRQ %d is invalid", irq);
+        return -1;
+    }
+    return 0;
+}
+
+static void spapr_irq_free_xive(sPAPRMachineState *spapr, int irq, int num)
+{
+    int i;
+
+    for (i = irq; i < irq + num; ++i) {
+        spapr_xive_irq_disable(spapr->xive, i);
+    }
+}
+
+static qemu_irq spapr_qirq_xive(sPAPRMachineState *spapr, int irq)
+{
+    return spapr_xive_qirq(spapr->xive, irq);
+}
+
+static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
+                                      Monitor *mon)
+{
+    CPUState *cs;
+
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+
+        xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
+    }
+
+    spapr_xive_pic_print_info(spapr->xive, mon);
+}
+
+/*
+ * XIVE uses the full IRQ number space. Set it to 8K to be compatible
+ * with XICS.
+ */
+
+#define SPAPR_IRQ_XIVE_NR_IRQS     0x2000
+#define SPAPR_IRQ_XIVE_NR_MSIS     (SPAPR_IRQ_XIVE_NR_IRQS - SPAPR_IRQ_MSI)
+
+sPAPRIrq spapr_irq_xive = {
+    .nr_irqs     = SPAPR_IRQ_XIVE_NR_IRQS,
+    .nr_msis     = SPAPR_IRQ_XIVE_NR_MSIS,
+
+    .init        = spapr_irq_init_xive,
+    .claim       = spapr_irq_claim_xive,
+    .free        = spapr_irq_free_xive,
+    .qirq        = spapr_qirq_xive,
+    .print_info  = spapr_irq_print_info_xive,
+};
+
 /*
  * sPAPR IRQ frontend routines for devices
  */
-void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
+void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp)
 {
     sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
 
@@ -216,7 +329,7 @@ void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
         spapr_irq_msi_init(spapr, smc->irq->nr_msis);
     }
 
-    smc->irq->init(spapr, smc->irq->nr_irqs, errp);
+    smc->irq->init(spapr, smc->irq->nr_irqs, nr_servers, errp);
 }
 
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp)
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (14 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 15/36] spapr: introdude a new machine IRQ backend for XIVE Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  4:25   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 17/36] spapr: add device tree support for the XIVE exploitation mode Cédric Le Goater
                   ` (19 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The different XIVE virtualization structures (sources and event queues)
are configured with a set of Hypervisor calls :

 - H_INT_GET_SOURCE_INFO

   used to obtain the address of the MMIO page of the Event State
   Buffer (ESB) entry associated with the source.

 - H_INT_SET_SOURCE_CONFIG

   assigns a source to a "target".

 - H_INT_GET_SOURCE_CONFIG

   determines which "target" and "priority" is assigned to a source

 - H_INT_GET_QUEUE_INFO

   returns the address of the notification management page associated
   with the specified "target" and "priority".

 - H_INT_SET_QUEUE_CONFIG

   sets or resets the event queue for a given "target" and "priority".
   It is also used to set the notification configuration associated
   with the queue, only unconditional notification is supported for
   the moment. Reset is performed with a queue size of 0 and queueing
   is disabled in that case.

 - H_INT_GET_QUEUE_CONFIG

   returns the queue settings for a given "target" and "priority".

 - H_INT_RESET

   resets all of the guest's internal interrupt structures to their
   initial state, losing all configuration set via the hcalls
   H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.

 - H_INT_SYNC

   issue a synchronisation on a source to make sure all notifications
   have reached their queue.

Calls that still need to be addressed :

   H_INT_SET_OS_REPORTING_LINE
   H_INT_GET_OS_REPORTING_LINE

See the code for more documentation on each hcall.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr.h      |  15 +-
 include/hw/ppc/spapr_xive.h |   6 +
 hw/intc/spapr_xive_hcall.c  | 892 ++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr_irq.c          |   2 +
 hw/intc/Makefile.objs       |   2 +-
 5 files changed, 915 insertions(+), 2 deletions(-)
 create mode 100644 hw/intc/spapr_xive_hcall.c

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 1fbc2663e06c..8415faea7b82 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -452,7 +452,20 @@ struct sPAPRMachineState {
 #define H_INVALIDATE_PID        0x378
 #define H_REGISTER_PROC_TBL     0x37C
 #define H_SIGNAL_SYS_RESET      0x380
-#define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
+
+#define H_INT_GET_SOURCE_INFO   0x3A8
+#define H_INT_SET_SOURCE_CONFIG 0x3AC
+#define H_INT_GET_SOURCE_CONFIG 0x3B0
+#define H_INT_GET_QUEUE_INFO    0x3B4
+#define H_INT_SET_QUEUE_CONFIG  0x3B8
+#define H_INT_GET_QUEUE_CONFIG  0x3BC
+#define H_INT_SET_OS_REPORTING_LINE 0x3C0
+#define H_INT_GET_OS_REPORTING_LINE 0x3C4
+#define H_INT_ESB               0x3C8
+#define H_INT_SYNC              0x3CC
+#define H_INT_RESET             0x3D0
+
+#define MAX_HCALL_OPCODE        H_INT_RESET
 
 /* The hcalls above are standardized in PAPR and implemented by pHyp
  * as well.
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 3f65b8f485fd..418511f3dc10 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -60,4 +60,10 @@ int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
 int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
                           uint8_t *out_end_blk, uint32_t *out_end_idx);
 
+bool spapr_xive_priority_is_valid(uint8_t priority);
+
+typedef struct sPAPRMachineState sPAPRMachineState;
+
+void spapr_xive_hcall_init(sPAPRMachineState *spapr);
+
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
new file mode 100644
index 000000000000..52e4e23995f5
--- /dev/null
+++ b/hw/intc/spapr_xive_hcall.c
@@ -0,0 +1,892 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "cpu.h"
+#include "hw/ppc/fdt.h"
+#include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/xive_regs.h"
+#include "monitor/monitor.h"
+
+/*
+ * OPAL uses the priority 7 EQ to automatically escalate interrupts
+ * for all other queues (DD2.X POWER9). So only priorities [0..6] are
+ * available for the guest.
+ */
+bool spapr_xive_priority_is_valid(uint8_t priority)
+{
+    switch (priority) {
+    case 0 ... 6:
+        return true;
+    case 7: /* OPAL escalation queue */
+    default:
+        return false;
+    }
+}
+
+/*
+ * The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical
+ * real address of the MMIO page through which the Event State Buffer
+ * entry associated with the value of the "lisn" parameter is managed.
+ *
+ * Parameters:
+ * Input
+ * - "flags"
+ *       Bits 0-63 reserved
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *       "ibm,xive-lisn-ranges" properties, or as returned by the
+ *       ibm,query-interrupt-source-number RTAS call, or as returned
+ *       by the H_ALLOCATE_VAS_WINDOW hcall
+ *
+ * Output
+ * - R4: "flags"
+ *       Bits 0-59: Reserved
+ *       Bit 60: H_INT_ESB must be used for Event State Buffer
+ *               management
+ *       Bit 61: 1 == LSI  0 == MSI
+ *       Bit 62: the full function page supports trigger
+ *       Bit 63: Store EOI Supported
+ * - R5: Logical Real address of full function Event State Buffer
+ *       management page, -1 if ESB hcall flag is set to 1.
+ * - R6: Logical Real Address of trigger only Event State Buffer
+ *       management page or -1.
+ * - R7: Power of 2 page size for the ESB management pages returned in
+ *       R5 and R6.
+ */
+
+#define SPAPR_XIVE_SRC_H_INT_ESB     PPC_BIT(60) /* ESB manage with H_INT_ESB */
+#define SPAPR_XIVE_SRC_LSI           PPC_BIT(61) /* Virtual LSI type */
+#define SPAPR_XIVE_SRC_TRIGGER       PPC_BIT(62) /* Trigger and management
+                                                    on same page */
+#define SPAPR_XIVE_SRC_STORE_EOI     PPC_BIT(63) /* Store EOI support */
+
+static target_ulong h_int_get_source_info(PowerPCCPU *cpu,
+                                          sPAPRMachineState *spapr,
+                                          target_ulong opcode,
+                                          target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveSource *xsrc = &xive->source;
+    XiveEAS eas;
+    target_ulong flags  = args[0];
+    target_ulong lisn   = args[1];
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
+        return H_P2;
+    }
+
+    if (!(eas.w & EAS_VALID)) {
+        return H_P2;
+    }
+
+    /* All sources are emulated under the main XIVE object and share
+     * the same characteristics.
+     */
+    args[0] = 0;
+    if (!xive_source_esb_has_2page(xsrc)) {
+        args[0] |= SPAPR_XIVE_SRC_TRIGGER;
+    }
+    if (xsrc->esb_flags & XIVE_SRC_STORE_EOI) {
+        args[0] |= SPAPR_XIVE_SRC_STORE_EOI;
+    }
+
+    /*
+     * Force the use of the H_INT_ESB hcall in case of an LSI
+     * interrupt. This is necessary under KVM to re-trigger the
+     * interrupt if the level is still asserted
+     */
+    if (xive_source_irq_is_lsi(xsrc, lisn)) {
+        args[0] |= SPAPR_XIVE_SRC_H_INT_ESB | SPAPR_XIVE_SRC_LSI;
+    }
+
+    if (!(args[0] & SPAPR_XIVE_SRC_H_INT_ESB)) {
+        args[1] = xive->vc_base + xive_source_esb_mgmt(xsrc, lisn);
+    } else {
+        args[1] = -1;
+    }
+
+    if (xive_source_esb_has_2page(xsrc)) {
+        args[2] = xive->vc_base + xive_source_esb_page(xsrc, lisn);
+    } else {
+        args[2] = -1;
+    }
+
+    args[3] = TARGET_PAGE_SIZE;
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SET_SOURCE_CONFIG hcall() is used to assign a Logical
+ * Interrupt Source to a target. The Logical Interrupt Source is
+ * designated with the "lisn" parameter and the target is designated
+ * with the "target" and "priority" parameters.  Upon return from the
+ * hcall(), no additional interrupts will be directed to the old EQ.
+ *
+ * TODO: The old EQ should be investigated for interrupts that
+ * occurred prior to or during the hcall().
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-61: Reserved
+ *      Bit 62: set the "eisn" in the EA
+ *      Bit 63: masks the interrupt source in the hardware interrupt
+ *      control structure. An interrupt masked by this mechanism will
+ *      be dropped, but it's source state bits will still be
+ *      set. There is no race-free way of unmasking and restoring the
+ *      source. Thus this should only be used in interrupts that are
+ *      also masked at the source, and only in cases where the
+ *      interrupt is not meant to be used for a large amount of time
+ *      because no valid target exists for it for example
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as returned by
+ *      the H_ALLOCATE_VAS_WINDOW hcall
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *      "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *      "ibm,plat-res-int-priorities"
+ * - "eisn" is the guest EISN associated with the "lisn"
+ *
+ * Output:
+ * - None
+ */
+
+#define SPAPR_XIVE_SRC_SET_EISN PPC_BIT(62)
+#define SPAPR_XIVE_SRC_MASK     PPC_BIT(63)
+
+static target_ulong h_int_set_source_config(PowerPCCPU *cpu,
+                                            sPAPRMachineState *spapr,
+                                            target_ulong opcode,
+                                            target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveRouter *xrtr = XIVE_ROUTER(xive);
+    XiveEAS eas, new_eas;
+    target_ulong flags    = args[0];
+    target_ulong lisn     = args[1];
+    target_ulong target   = args[2];
+    target_ulong priority = args[3];
+    target_ulong eisn     = args[4];
+    uint8_t end_blk;
+    uint32_t end_idx;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~(SPAPR_XIVE_SRC_SET_EISN | SPAPR_XIVE_SRC_MASK)) {
+        return H_PARAMETER;
+    }
+
+    if (xive_router_get_eas(xrtr, lisn, &eas)) {
+        return H_P2;
+    }
+
+    if (!(eas.w & EAS_VALID)) {
+        return H_P2;
+    }
+
+    /* priority 0xff is used to reset the EAS */
+    if (priority == 0xff) {
+        new_eas.w = EAS_VALID | EAS_MASKED;
+        goto out;
+    }
+
+    if (flags & SPAPR_XIVE_SRC_MASK) {
+        new_eas.w = eas.w | EAS_MASKED;
+    } else {
+        new_eas.w = eas.w & ~EAS_MASKED;
+    }
+
+    if (!spapr_xive_priority_is_valid(priority)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
+                      priority);
+        return H_P4;
+    }
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the END corresponding to the
+     * target.
+     */
+    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
+        return H_P3;
+    }
+
+    new_eas.w = SETFIELD(EAS_END_BLOCK, new_eas.w, end_blk);
+    new_eas.w = SETFIELD(EAS_END_INDEX, new_eas.w, end_idx);
+
+    if (flags & SPAPR_XIVE_SRC_SET_EISN) {
+        new_eas.w = SETFIELD(EAS_END_DATA, new_eas.w, eisn);
+    }
+
+out:
+    if (xive_router_set_eas(xrtr, lisn, &new_eas)) {
+        return H_HARDWARE;
+    }
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_GET_SOURCE_CONFIG hcall() is used to determine to which
+ * target/priority pair is assigned to the specified Logical Interrupt
+ * Source.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63 Reserved
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as
+ *      returned by the H_ALLOCATE_VAS_WINDOW hcall
+ *
+ * Output:
+ * - R4: Target to which the specified Logical Interrupt Source is
+ *       assigned
+ * - R5: Priority to which the specified Logical Interrupt Source is
+ *       assigned
+ * - R6: EISN for the specified Logical Interrupt Source (this will be
+ *       equivalent to the LISN if not changed by H_INT_SET_SOURCE_CONFIG)
+ */
+static target_ulong h_int_get_source_config(PowerPCCPU *cpu,
+                                            sPAPRMachineState *spapr,
+                                            target_ulong opcode,
+                                            target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveRouter *xrtr = XIVE_ROUTER(xive);
+    target_ulong flags = args[0];
+    target_ulong lisn = args[1];
+    XiveEAS eas;
+    XiveEND end;
+    uint8_t end_blk, nvt_blk;
+    uint32_t end_idx, nvt_idx;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    if (xive_router_get_eas(xrtr, lisn, &eas)) {
+        return H_P2;
+    }
+
+    if (!(eas.w & EAS_VALID)) {
+        return H_P2;
+    }
+
+    end_blk = GETFIELD(EAS_END_BLOCK, eas.w);
+    end_idx = GETFIELD(EAS_END_INDEX, eas.w);
+    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
+        /* Not sure what to return here */
+        return H_HARDWARE;
+    }
+
+    nvt_blk = GETFIELD(END_W6_NVT_BLOCK, end.w6);
+    nvt_idx = GETFIELD(END_W6_NVT_INDEX, end.w6);
+    args[0] = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
+
+    if (eas.w & EAS_MASKED) {
+        args[1] = 0xff;
+    } else {
+        args[1] = GETFIELD(END_W7_F0_PRIORITY, end.w7);
+    }
+
+    args[2] = GETFIELD(EAS_END_DATA, eas.w);
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_GET_QUEUE_INFO hcall() is used to get the logical real
+ * address of the notification management page associated with the
+ * specified target and priority.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *       Bits 0-63 Reserved
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *       "ibm,plat-res-int-priorities"
+ *
+ * Output:
+ * - R4: Logical real address of notification page
+ * - R5: Power of 2 page size of the notification page
+ */
+static target_ulong h_int_get_queue_info(PowerPCCPU *cpu,
+                                         sPAPRMachineState *spapr,
+                                         target_ulong opcode,
+                                         target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveENDSource *end_xsrc = &xive->end_source;
+    target_ulong flags = args[0];
+    target_ulong target = args[1];
+    target_ulong priority = args[2];
+    XiveEND end;
+    uint8_t end_blk;
+    uint32_t end_idx;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    if (!spapr_xive_priority_is_valid(priority)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
+                      priority);
+        return H_P3;
+    }
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the END corresponding to the
+     * target.
+     */
+    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
+        return H_P2;
+    }
+
+    if (xive_router_get_end(XIVE_ROUTER(xive), end_blk, end_idx, &end)) {
+        return H_HARDWARE;
+    }
+
+    args[0] = xive->end_base + (1ull << (end_xsrc->esb_shift + 1)) * end_idx;
+    if (end.w0 & END_W0_ENQUEUE) {
+        args[1] = GETFIELD(END_W0_QSIZE, end.w0) + 12;
+    } else {
+        args[1] = 0;
+    }
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SET_QUEUE_CONFIG hcall() is used to set or reset a EQ for
+ * a given "target" and "priority".  It is also used to set the
+ * notification config associated with the EQ.  An EQ size of 0 is
+ * used to reset the EQ config for a given target and priority. If
+ * resetting the EQ config, the END associated with the given "target"
+ * and "priority" will be changed to disable queueing.
+ *
+ * Upon return from the hcall(), no additional interrupts will be
+ * directed to the old EQ (if one was set). The old EQ (if one was
+ * set) should be investigated for interrupts that occurred prior to
+ * or during the hcall().
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-62: Reserved
+ *      Bit 63: Unconditional Notify (n) per the XIVE spec
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *       "ibm,plat-res-int-priorities"
+ * - "eventQueue": The logical real address of the start of the EQ
+ * - "eventQueueSize": The power of 2 EQ size per "ibm,xive-eq-sizes"
+ *
+ * Output:
+ * - None
+ */
+
+#define SPAPR_XIVE_END_ALWAYS_NOTIFY PPC_BIT(63)
+
+static target_ulong h_int_set_queue_config(PowerPCCPU *cpu,
+                                           sPAPRMachineState *spapr,
+                                           target_ulong opcode,
+                                           target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveRouter *xrtr = XIVE_ROUTER(xive);
+    target_ulong flags = args[0];
+    target_ulong target = args[1];
+    target_ulong priority = args[2];
+    target_ulong qpage = args[3];
+    target_ulong qsize = args[4];
+    XiveEND end;
+    uint8_t end_blk, nvt_blk;
+    uint32_t end_idx, nvt_idx;
+    uint32_t qdata;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~SPAPR_XIVE_END_ALWAYS_NOTIFY) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    if (!spapr_xive_priority_is_valid(priority)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
+                      priority);
+        return H_P3;
+    }
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the END corresponding to the
+     * target.
+     */
+
+    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
+        return H_P2;
+    }
+
+    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
+        return H_HARDWARE;
+    }
+
+    switch (qsize) {
+    case 12:
+    case 16:
+    case 21:
+    case 24:
+        end.w3 = ((uint64_t)qpage) & 0xffffffff;
+        end.w2 = (((uint64_t)qpage)) >> 32 & 0x0fffffff;
+        end.w0 |= END_W0_ENQUEUE;
+        end.w0 = SETFIELD(END_W0_QSIZE, end.w0, qsize - 12);
+        break;
+    case 0:
+        /* reset queue and disable queueing */
+        xive_end_reset(&end);
+        goto out;
+
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid EQ size %"PRIx64"\n",
+                      qsize);
+        return H_P5;
+    }
+
+    if (qsize) {
+        /*
+         * Let's validate the EQ address with a read of the first EQ
+         * entry. We could also check that the full queue has been
+         * zeroed by the OS.
+         */
+        if (address_space_read(&address_space_memory, qpage,
+                               MEMTXATTRS_UNSPECIFIED,
+                               (uint8_t *) &qdata, sizeof(qdata))) {
+            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ data @0x%"
+                          HWADDR_PRIx "\n", qpage);
+            return H_P4;
+        }
+    }
+
+    if (spapr_xive_target_to_nvt(xive, target, &nvt_blk, &nvt_idx)) {
+        return H_HARDWARE;
+    }
+
+    /* Ensure the priority and target are correctly set (they will not
+     * be right after allocation)
+     */
+    end.w6 = SETFIELD(END_W6_NVT_BLOCK, 0ul, nvt_blk) |
+        SETFIELD(END_W6_NVT_INDEX, 0ul, nvt_idx);
+    end.w7 = SETFIELD(END_W7_F0_PRIORITY, 0ul, priority);
+
+    if (flags & SPAPR_XIVE_END_ALWAYS_NOTIFY) {
+        end.w0 |= END_W0_UCOND_NOTIFY;
+    } else {
+        end.w0 &= ~END_W0_UCOND_NOTIFY;
+    }
+
+    /* The generation bit for the END starts at 1 and The END page
+     * offset counter starts at 0.
+     */
+    end.w1 = END_W1_GENERATION | SETFIELD(END_W1_PAGE_OFF, 0ul, 0ul);
+    end.w0 |= END_W0_VALID;
+
+    /* TODO: issue syncs required to ensure all in-flight interrupts
+     * are complete on the old END */
+out:
+    /* Update END */
+    if (xive_router_set_end(xrtr, end_blk, end_idx, &end)) {
+        return H_HARDWARE;
+    }
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_GET_QUEUE_CONFIG hcall() is used to get a EQ for a given
+ * target and priority.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-62: Reserved
+ *      Bit 63: Debug: Return debug data
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "priority" is a valid priority not in
+ *       "ibm,plat-res-int-priorities"
+ *
+ * Output:
+ * - R4: "flags":
+ *       Bits 0-61: Reserved
+ *       Bit 62: The value of Event Queue Generation Number (g) per
+ *              the XIVE spec if "Debug" = 1
+ *       Bit 63: The value of Unconditional Notify (n) per the XIVE spec
+ * - R5: The logical real address of the start of the EQ
+ * - R6: The power of 2 EQ size per "ibm,xive-eq-sizes"
+ * - R7: The value of Event Queue Offset Counter per XIVE spec
+ *       if "Debug" = 1, else 0
+ *
+ */
+
+#define SPAPR_XIVE_END_DEBUG     PPC_BIT(63)
+
+static target_ulong h_int_get_queue_config(PowerPCCPU *cpu,
+                                           sPAPRMachineState *spapr,
+                                           target_ulong opcode,
+                                           target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    target_ulong flags = args[0];
+    target_ulong target = args[1];
+    target_ulong priority = args[2];
+    XiveEND end;
+    uint8_t end_blk;
+    uint32_t end_idx;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~SPAPR_XIVE_END_DEBUG) {
+        return H_PARAMETER;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    if (!spapr_xive_priority_is_valid(priority)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
+                      priority);
+        return H_P3;
+    }
+
+    /* Validate that "target" is part of the list of threads allocated
+     * to the partition. For that, find the END corresponding to the
+     * target.
+     */
+    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
+        return H_P2;
+    }
+
+    if (xive_router_get_end(XIVE_ROUTER(xive), end_blk, end_idx, &end)) {
+        return H_HARDWARE;
+    }
+
+    args[0] = 0;
+    if (end.w0 & END_W0_UCOND_NOTIFY) {
+        args[0] |= SPAPR_XIVE_END_ALWAYS_NOTIFY;
+    }
+
+    if (end.w0 & END_W0_ENQUEUE) {
+        args[1] =
+            (((uint64_t)(end.w2 & 0x0fffffff)) << 32) | end.w3;
+        args[2] = GETFIELD(END_W0_QSIZE, end.w0) + 12;
+    } else {
+        args[1] = 0;
+        args[2] = 0;
+    }
+
+    /* TODO: do we need any locking on the END ? */
+    if (flags & SPAPR_XIVE_END_DEBUG) {
+        /* Load the event queue generation number into the return flags */
+        args[0] |= (uint64_t)GETFIELD(END_W1_GENERATION, end.w1) << 62;
+
+        /* Load R7 with the event queue offset counter */
+        args[3] = GETFIELD(END_W1_PAGE_OFF, end.w1);
+    } else {
+        args[3] = 0;
+    }
+
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SET_OS_REPORTING_LINE hcall() is used to set the
+ * reporting cache line pair for the calling thread.  The reporting
+ * cache lines will contain the OS interrupt context when the OS
+ * issues a CI store byte to @TIMA+0xC10 to acknowledge the OS
+ * interrupt. The reporting cache lines can be reset by inputting -1
+ * in "reportingLine".  Issuing the CI store byte without reporting
+ * cache lines registered will result in the data not being accessible
+ * to the OS.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ * - "reportingLine": The logical real address of the reporting cache
+ *    line pair
+ *
+ * Output:
+ * - None
+ */
+static target_ulong h_int_set_os_reporting_line(PowerPCCPU *cpu,
+                                                sPAPRMachineState *spapr,
+                                                target_ulong opcode,
+                                                target_ulong *args)
+{
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    /* TODO: H_INT_SET_OS_REPORTING_LINE */
+    return H_FUNCTION;
+}
+
+/*
+ * The H_INT_GET_OS_REPORTING_LINE hcall() is used to get the logical
+ * real address of the reporting cache line pair set for the input
+ * "target".  If no reporting cache line pair has been set, -1 is
+ * returned.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ * - "target" is per "ibm,ppc-interrupt-server#s" or
+ *       "ibm,ppc-interrupt-gserver#s"
+ * - "reportingLine": The logical real address of the reporting cache
+ *   line pair
+ *
+ * Output:
+ * - R4: The logical real address of the reporting line if set, else -1
+ */
+static target_ulong h_int_get_os_reporting_line(PowerPCCPU *cpu,
+                                                sPAPRMachineState *spapr,
+                                                target_ulong opcode,
+                                                target_ulong *args)
+{
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    /* TODO: H_INT_GET_OS_REPORTING_LINE */
+    return H_FUNCTION;
+}
+
+/*
+ * The H_INT_ESB hcall() is used to issue a load or store to the ESB
+ * page for the input "lisn".  This hcall is only supported for LISNs
+ * that have the ESB hcall flag set to 1 when returned from hcall()
+ * H_INT_GET_SOURCE_INFO.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-62: Reserved
+ *      bit 63: Store: Store=1, store operation, else load operation
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as
+ *      returned by the H_ALLOCATE_VAS_WINDOW hcall
+ * - "esbOffset" is the offset into the ESB page for the load or store operation
+ * - "storeData" is the data to write for a store operation
+ *
+ * Output:
+ * - R4: R4: The value of the load if load operation, else -1
+ */
+
+#define SPAPR_XIVE_ESB_STORE PPC_BIT(63)
+
+static target_ulong h_int_esb(PowerPCCPU *cpu,
+                              sPAPRMachineState *spapr,
+                              target_ulong opcode,
+                              target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveEAS eas;
+    target_ulong flags  = args[0];
+    target_ulong lisn   = args[1];
+    target_ulong offset = args[2];
+    target_ulong data   = args[3];
+    hwaddr mmio_addr;
+    XiveSource *xsrc = &xive->source;
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags & ~SPAPR_XIVE_ESB_STORE) {
+        return H_PARAMETER;
+    }
+
+    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
+        return H_P2;
+    }
+
+    if (!(eas.w & EAS_VALID)) {
+        return H_P2;
+    }
+
+    if (offset > (1ull << xsrc->esb_shift)) {
+        return H_P3;
+    }
+
+    mmio_addr = xive->vc_base + xive_source_esb_mgmt(xsrc, lisn) + offset;
+
+    if (dma_memory_rw(&address_space_memory, mmio_addr, &data, 8,
+                      (flags & SPAPR_XIVE_ESB_STORE))) {
+        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to access ESB @0x%"
+                      HWADDR_PRIx "\n", mmio_addr);
+        return H_HARDWARE;
+    }
+    args[0] = (flags & SPAPR_XIVE_ESB_STORE) ? -1 : data;
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_SYNC hcall() is used to issue hardware syncs that will
+ * ensure any in flight events for the input lisn are in the event
+ * queue.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ * - "lisn" is per "interrupts", "interrupt-map", or
+ *      "ibm,xive-lisn-ranges" properties, or as returned by the
+ *      ibm,query-interrupt-source-number RTAS call, or as
+ *      returned by the H_ALLOCATE_VAS_WINDOW hcall
+ *
+ * Output:
+ * - None
+ */
+static target_ulong h_int_sync(PowerPCCPU *cpu,
+                               sPAPRMachineState *spapr,
+                               target_ulong opcode,
+                               target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    XiveEAS eas;
+    target_ulong flags = args[0];
+    target_ulong lisn = args[1];
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
+        return H_P2;
+    }
+
+    if (!(eas.w & EAS_VALID)) {
+        return H_P2;
+    }
+
+    /*
+     * H_STATE should be returned if a H_INT_RESET is in progress.
+     * This is not needed when running the emulation under QEMU
+     */
+
+    /* This is not real hardware. Nothing to be done */
+    return H_SUCCESS;
+}
+
+/*
+ * The H_INT_RESET hcall() is used to reset all of the partition's
+ * interrupt exploitation structures to their initial state.  This
+ * means losing all previously set interrupt state set via
+ * H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
+ *
+ * Parameters:
+ * Input:
+ * - "flags"
+ *      Bits 0-63: Reserved
+ *
+ * Output:
+ * - None
+ */
+static target_ulong h_int_reset(PowerPCCPU *cpu,
+                                sPAPRMachineState *spapr,
+                                target_ulong opcode,
+                                target_ulong *args)
+{
+    sPAPRXive *xive = spapr->xive;
+    target_ulong flags   = args[0];
+
+    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        return H_FUNCTION;
+    }
+
+    if (flags) {
+        return H_PARAMETER;
+    }
+
+    device_reset(DEVICE(xive));
+    return H_SUCCESS;
+}
+
+void spapr_xive_hcall_init(sPAPRMachineState *spapr)
+{
+    spapr_register_hypercall(H_INT_GET_SOURCE_INFO, h_int_get_source_info);
+    spapr_register_hypercall(H_INT_SET_SOURCE_CONFIG, h_int_set_source_config);
+    spapr_register_hypercall(H_INT_GET_SOURCE_CONFIG, h_int_get_source_config);
+    spapr_register_hypercall(H_INT_GET_QUEUE_INFO, h_int_get_queue_info);
+    spapr_register_hypercall(H_INT_SET_QUEUE_CONFIG, h_int_set_queue_config);
+    spapr_register_hypercall(H_INT_GET_QUEUE_CONFIG, h_int_get_queue_config);
+    spapr_register_hypercall(H_INT_SET_OS_REPORTING_LINE,
+                             h_int_set_os_reporting_line);
+    spapr_register_hypercall(H_INT_GET_OS_REPORTING_LINE,
+                             h_int_get_os_reporting_line);
+    spapr_register_hypercall(H_INT_ESB, h_int_esb);
+    spapr_register_hypercall(H_INT_SYNC, h_int_sync);
+    spapr_register_hypercall(H_INT_RESET, h_int_reset);
+}
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 2569ae1bc7f8..da6fcfaa3c52 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -258,6 +258,8 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
         error_propagate(errp, local_err);
         return;
     }
+
+    spapr_xive_hcall_init(spapr);
 }
 
 static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index 301a8e972d91..eacd26836ebf 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -38,7 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
 obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
 obj-$(CONFIG_XICS_KVM) += xics_kvm.o
 obj-$(CONFIG_XIVE) += xive.o
-obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
+obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
 obj-$(CONFIG_POWERNV) += xics_pnv.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 17/36] spapr: add device tree support for the XIVE exploitation mode
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (15 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  4:31   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 18/36] spapr: allocate the interrupt thread context under the CPU core Cédric Le Goater
                   ` (18 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The XIVE interface for the guest is described in the device tree under
the "interrupt-controller" node. A couple of new properties are
specific to XIVE :

 - "reg"

   contains the base address and size of the thread interrupt
   managnement areas (TIMA), for the User level and for the Guest OS
   level. Only the Guest OS level is taken into account today.

 - "ibm,xive-eq-sizes"

   the size of the event queues. One cell per size supported, contains
   log2 of size, in ascending order.

 - "ibm,xive-lisn-ranges"

   the IRQ interrupt number ranges assigned to the guest for the IPIs.

and also under the root node :

 - "ibm,plat-res-int-priorities"

   contains a list of priorities that the hypervisor has reserved for
   its own use. OPAL uses the priority 7 queue to automatically
   escalate interrupts for all other queues (DD2.X POWER9). So only
   priorities [0..6] are allowed for the guest.

Extend the sPAPR IRQ backend with a new handler to populate the DT
with the appropriate "interrupt-controller" node.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_irq.h  |  2 ++
 include/hw/ppc/spapr_xive.h |  2 ++
 hw/intc/spapr_xive_hcall.c  | 62 +++++++++++++++++++++++++++++++++++++
 hw/ppc/spapr.c              |  3 +-
 hw/ppc/spapr_irq.c          | 17 ++++++++++
 5 files changed, 85 insertions(+), 1 deletion(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index c854ae527808..cfdc1f86e713 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -40,6 +40,8 @@ typedef struct sPAPRIrq {
     void (*free)(sPAPRMachineState *spapr, int irq, int num);
     qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
     void (*print_info)(sPAPRMachineState *spapr, Monitor *mon);
+    void (*dt_populate)(sPAPRMachineState *spapr, uint32_t nr_servers,
+                        void *fdt, uint32_t phandle);
 } sPAPRIrq;
 
 extern sPAPRIrq spapr_irq_xics;
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 418511f3dc10..5b3fab192d41 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -65,5 +65,7 @@ bool spapr_xive_priority_is_valid(uint8_t priority);
 typedef struct sPAPRMachineState sPAPRMachineState;
 
 void spapr_xive_hcall_init(sPAPRMachineState *spapr);
+void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt,
+                   uint32_t phandle);
 
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
index 52e4e23995f5..66c78aa88500 100644
--- a/hw/intc/spapr_xive_hcall.c
+++ b/hw/intc/spapr_xive_hcall.c
@@ -890,3 +890,65 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr)
     spapr_register_hypercall(H_INT_SYNC, h_int_sync);
     spapr_register_hypercall(H_INT_RESET, h_int_reset);
 }
+
+void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt, uint32_t phandle)
+{
+    int node;
+    uint64_t timas[2 * 2];
+    /* Interrupt number ranges for the IPIs */
+    uint32_t lisn_ranges[] = {
+        cpu_to_be32(0),
+        cpu_to_be32(nr_servers),
+    };
+    uint32_t eq_sizes[] = {
+        cpu_to_be32(12), /* 4K */
+        cpu_to_be32(16), /* 64K */
+        cpu_to_be32(21), /* 2M */
+        cpu_to_be32(24), /* 16M */
+    };
+    /* The following array is in sync with the 'spapr_xive_priority_is_valid'
+     * routine above. The O/S is expected to choose priority 6.
+     */
+    uint32_t plat_res_int_priorities[] = {
+        cpu_to_be32(7),    /* start */
+        cpu_to_be32(0xf8), /* count */
+    };
+    gchar *nodename;
+
+    /* Thread Interrupt Management Area : User (ring 3) and OS (ring 2) */
+    timas[0] = cpu_to_be64(xive->tm_base + 3 * (1ull << TM_SHIFT));
+    timas[1] = cpu_to_be64(1ull << TM_SHIFT);
+    timas[2] = cpu_to_be64(xive->tm_base + 2 * (1ull << TM_SHIFT));
+    timas[3] = cpu_to_be64(1ull << TM_SHIFT);
+
+    nodename = g_strdup_printf("interrupt-controller@%" PRIx64,
+                               xive->tm_base + 3 * (1 << TM_SHIFT));
+    _FDT(node = fdt_add_subnode(fdt, 0, nodename));
+    g_free(nodename);
+
+    _FDT(fdt_setprop_string(fdt, node, "device_type", "power-ivpe"));
+    _FDT(fdt_setprop(fdt, node, "reg", timas, sizeof(timas)));
+
+    _FDT(fdt_setprop_string(fdt, node, "compatible", "ibm,power-ivpe"));
+    _FDT(fdt_setprop(fdt, node, "ibm,xive-eq-sizes", eq_sizes,
+                     sizeof(eq_sizes)));
+    _FDT(fdt_setprop(fdt, node, "ibm,xive-lisn-ranges", lisn_ranges,
+                     sizeof(lisn_ranges)));
+
+    /* For Linux to link the LSIs to the main interrupt controller.
+     * These properties are not in XIVE exploitation mode sPAPR
+     * specs
+     */
+    _FDT(fdt_setprop(fdt, node, "interrupt-controller", NULL, 0));
+    _FDT(fdt_setprop_cell(fdt, node, "#interrupt-cells", 2));
+
+    /* For SLOF */
+    _FDT(fdt_setprop_cell(fdt, node, "linux,phandle", phandle));
+    _FDT(fdt_setprop_cell(fdt, node, "phandle", phandle));
+
+    /* The "ibm,plat-res-int-priorities" property defines the priority
+     * ranges reserved by the hypervisor
+     */
+    _FDT(fdt_setprop(fdt, 0, "ibm,plat-res-int-priorities",
+                     plat_res_int_priorities, sizeof(plat_res_int_priorities)));
+}
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 9f8c19e56e7a..ad1692cdcd0f 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1270,7 +1270,8 @@ static void *spapr_build_fdt(sPAPRMachineState *spapr,
     _FDT(fdt_setprop_cell(fdt, 0, "#size-cells", 2));
 
     /* /interrupt controller */
-    spapr_dt_xics(xics_max_server_number(spapr), fdt, PHANDLE_XICP);
+    smc->irq->dt_populate(spapr, xics_max_server_number(spapr), fdt,
+                          PHANDLE_XICP);
 
     ret = spapr_populate_memory(spapr, fdt);
     if (ret < 0) {
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index da6fcfaa3c52..d88a029d8c5c 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -190,6 +190,13 @@ static void spapr_irq_print_info_xics(sPAPRMachineState *spapr, Monitor *mon)
     ics_pic_print_info(spapr->ics, mon);
 }
 
+static void spapr_irq_dt_populate_xics(sPAPRMachineState *spapr,
+                                       uint32_t nr_servers, void *fdt,
+                                       uint32_t phandle)
+{
+    spapr_dt_xics(nr_servers, fdt, phandle);
+}
+
 #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
 #define SPAPR_IRQ_XICS_NR_MSIS     \
     (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
@@ -203,6 +210,7 @@ sPAPRIrq spapr_irq_xics = {
     .free        = spapr_irq_free_xics,
     .qirq        = spapr_qirq_xics,
     .print_info  = spapr_irq_print_info_xics,
+    .dt_populate = spapr_irq_dt_populate_xics,
 };
 
  /*
@@ -300,6 +308,13 @@ static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
     spapr_xive_pic_print_info(spapr->xive, mon);
 }
 
+static void spapr_irq_dt_populate_xive(sPAPRMachineState *spapr,
+                                       uint32_t nr_servers, void *fdt,
+                                       uint32_t phandle)
+{
+    spapr_dt_xive(spapr->xive, nr_servers, fdt, phandle);
+}
+
 /*
  * XIVE uses the full IRQ number space. Set it to 8K to be compatible
  * with XICS.
@@ -317,6 +332,7 @@ sPAPRIrq spapr_irq_xive = {
     .free        = spapr_irq_free_xive,
     .qirq        = spapr_qirq_xive,
     .print_info  = spapr_irq_print_info_xive,
+    .dt_populate = spapr_irq_dt_populate_xive,
 };
 
 /*
@@ -421,4 +437,5 @@ sPAPRIrq spapr_irq_xics_legacy = {
     .free        = spapr_irq_free_xics,
     .qirq        = spapr_qirq_xics,
     .print_info  = spapr_irq_print_info_xics,
+    .dt_populate = spapr_irq_dt_populate_xics,
 };
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 18/36] spapr: allocate the interrupt thread context under the CPU core
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (16 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 17/36] spapr: add device tree support for the XIVE exploitation mode Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  4:39   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 19/36] spapr: add a 'pseries-3.1-xive' machine type Cédric Le Goater
                   ` (17 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

Each interrupt mode has its own specific interrupt presenter object,
that we store under the CPU object, one for XICS and one for XIVE.

Extend the sPAPR IRQ backend with a new handler to support them both.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr.h     |  1 +
 include/hw/ppc/spapr_irq.h |  2 ++
 include/hw/ppc/xive.h      |  2 ++
 hw/intc/xive.c             | 21 +++++++++++++++++++++
 hw/ppc/spapr_cpu_core.c    |  5 ++---
 hw/ppc/spapr_irq.c         | 17 +++++++++++++++++
 6 files changed, 45 insertions(+), 3 deletions(-)

diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
index 8415faea7b82..f43ef69d61bc 100644
--- a/include/hw/ppc/spapr.h
+++ b/include/hw/ppc/spapr.h
@@ -177,6 +177,7 @@ struct sPAPRMachineState {
     int32_t irq_map_nr;
     unsigned long *irq_map;
     sPAPRXive  *xive;
+    const char *xive_tctx_type;
 
     bool cmd_line_caps[SPAPR_CAP_NUM];
     sPAPRCapabilities def, eff, mig;
diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index cfdc1f86e713..c3b4c38145eb 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -42,6 +42,8 @@ typedef struct sPAPRIrq {
     void (*print_info)(sPAPRMachineState *spapr, Monitor *mon);
     void (*dt_populate)(sPAPRMachineState *spapr, uint32_t nr_servers,
                         void *fdt, uint32_t phandle);
+    Object *(*cpu_intc_create)(sPAPRMachineState *spapr, Object *cpu,
+                               Error **errp);
 } sPAPRIrq;
 
 extern sPAPRIrq spapr_irq_xics;
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index e6931ddaa83f..b74eb326dcd1 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -284,6 +284,8 @@ typedef struct XiveTCTX {
 extern const MemoryRegionOps xive_tm_ops;
 
 void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
+Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
+                         Error **errp);
 
 static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
 {
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index fc6ef5895e6d..7d921023e2ee 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -579,6 +579,27 @@ static const TypeInfo xive_tctx_info = {
     .class_init    = xive_tctx_class_init,
 };
 
+Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
+                         Error **errp)
+{
+    Error *local_err = NULL;
+    Object *obj;
+
+    obj = object_new(type);
+    object_property_add_child(cpu, type, obj, &error_abort);
+    object_unref(obj);
+    object_property_add_const_link(obj, "cpu", cpu, &error_abort);
+    object_property_add_const_link(obj, "xive", OBJECT(xrtr), &error_abort);
+    object_property_set_bool(obj, true, "realized", &local_err);
+    if (local_err) {
+        object_unparent(obj);
+        error_propagate(errp, local_err);
+        return NULL;
+    }
+
+    return obj;
+}
+
 /*
  * XIVE ESB helpers
  */
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 2398ce62c0e7..1811cd48db90 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -11,7 +11,6 @@
 #include "hw/ppc/spapr_cpu_core.h"
 #include "target/ppc/cpu.h"
 #include "hw/ppc/spapr.h"
-#include "hw/ppc/xics.h" /* for icp_create() - to be removed */
 #include "hw/boards.h"
 #include "qapi/error.h"
 #include "sysemu/cpus.h"
@@ -215,6 +214,7 @@ static void spapr_cpu_core_unrealize(DeviceState *dev, Error **errp)
 static void spapr_realize_vcpu(PowerPCCPU *cpu, sPAPRMachineState *spapr,
                                sPAPRCPUCore *sc, Error **errp)
 {
+    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
     CPUPPCState *env = &cpu->env;
     CPUState *cs = CPU(cpu);
     Error *local_err = NULL;
@@ -233,8 +233,7 @@ static void spapr_realize_vcpu(PowerPCCPU *cpu, sPAPRMachineState *spapr,
     qemu_register_reset(spapr_cpu_reset, cpu);
     spapr_cpu_reset(cpu);
 
-    cpu->intc = icp_create(OBJECT(cpu), spapr->icp_type, XICS_FABRIC(spapr),
-                           &local_err);
+    cpu->intc = smc->irq->cpu_intc_create(spapr, OBJECT(cpu), &local_err);
     if (local_err) {
         goto error_unregister;
     }
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index d88a029d8c5c..253abc10e780 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -197,6 +197,12 @@ static void spapr_irq_dt_populate_xics(sPAPRMachineState *spapr,
     spapr_dt_xics(nr_servers, fdt, phandle);
 }
 
+static Object *spapr_irq_cpu_intc_create_xics(sPAPRMachineState *spapr,
+                                              Object *cpu, Error **errp)
+{
+    return icp_create(cpu, spapr->icp_type, XICS_FABRIC(spapr), errp);
+}
+
 #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
 #define SPAPR_IRQ_XICS_NR_MSIS     \
     (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
@@ -211,6 +217,7 @@ sPAPRIrq spapr_irq_xics = {
     .qirq        = spapr_qirq_xics,
     .print_info  = spapr_irq_print_info_xics,
     .dt_populate = spapr_irq_dt_populate_xics,
+    .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
 };
 
  /*
@@ -267,6 +274,7 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
         return;
     }
 
+    spapr->xive_tctx_type = TYPE_XIVE_TCTX;
     spapr_xive_hcall_init(spapr);
 }
 
@@ -315,6 +323,13 @@ static void spapr_irq_dt_populate_xive(sPAPRMachineState *spapr,
     spapr_dt_xive(spapr->xive, nr_servers, fdt, phandle);
 }
 
+static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
+                                              Object *cpu, Error **errp)
+{
+    return xive_tctx_create(cpu, spapr->xive_tctx_type,
+                            XIVE_ROUTER(spapr->xive), errp);
+}
+
 /*
  * XIVE uses the full IRQ number space. Set it to 8K to be compatible
  * with XICS.
@@ -333,6 +348,7 @@ sPAPRIrq spapr_irq_xive = {
     .qirq        = spapr_qirq_xive,
     .print_info  = spapr_irq_print_info_xive,
     .dt_populate = spapr_irq_dt_populate_xive,
+    .cpu_intc_create = spapr_irq_cpu_intc_create_xive,
 };
 
 /*
@@ -438,4 +454,5 @@ sPAPRIrq spapr_irq_xics_legacy = {
     .qirq        = spapr_qirq_xics,
     .print_info  = spapr_irq_print_info_xics,
     .dt_populate = spapr_irq_dt_populate_xics,
+    .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
 };
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 19/36] spapr: add a 'pseries-3.1-xive' machine type
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (17 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 18/36] spapr: allocate the interrupt thread context under the CPU core Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  4:42   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 20/36] spapr: add classes for the XIVE models Cédric Le Goater
                   ` (16 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The interrupt mode is statically defined to XIVE only for this machine.
The guest OS is required to have support for the XIVE exploitation
mode of the POWER9 interrupt controller.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_irq.h |  1 +
 hw/ppc/spapr.c             | 36 +++++++++++++++++++++++++++++++-----
 hw/ppc/spapr_irq.c         |  3 +++
 3 files changed, 35 insertions(+), 5 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index c3b4c38145eb..b299dd794bff 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -33,6 +33,7 @@ void spapr_irq_msi_reset(sPAPRMachineState *spapr);
 typedef struct sPAPRIrq {
     uint32_t    nr_irqs;
     uint32_t    nr_msis;
+    uint8_t     ov5;
 
     void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
                  Error **errp);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index ad1692cdcd0f..8fbb743769db 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1097,12 +1097,14 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, void *fdt)
     spapr_dt_rtas_tokens(fdt, rtas);
 }
 
-/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU features
- * that the guest may request and thus the valid values for bytes 24..26 of
- * option vector 5: */
-static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
+/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU
+ * and the XIVE features that the guest may request and thus the valid
+ * values for bytes 23..26 of option vector 5: */
+static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
+                                          int chosen)
 {
     PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
+    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
 
     char val[2 * 4] = {
         23, 0x00, /* Xive mode, filled in below. */
@@ -1123,7 +1125,11 @@ static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
         } else {
             val[3] = 0x00; /* Hash */
         }
+        /* TODO: test KVM support */
+        val[1] = smc->irq->ov5;
     } else {
+        val[1] = smc->irq->ov5;
+
         /* V3 MMU supports both hash and radix in tcg (with dynamic switching) */
         val[3] = 0xC0;
     }
@@ -1191,7 +1197,7 @@ static void spapr_dt_chosen(sPAPRMachineState *spapr, void *fdt)
         _FDT(fdt_setprop_string(fdt, chosen, "stdout-path", stdout_path));
     }
 
-    spapr_dt_ov5_platform_support(fdt, chosen);
+    spapr_dt_ov5_platform_support(spapr, fdt, chosen);
 
     g_free(stdout_path);
     g_free(bootlist);
@@ -2622,6 +2628,11 @@ static void spapr_machine_init(MachineState *machine)
     /* advertise support for ibm,dyamic-memory-v2 */
     spapr_ovec_set(spapr->ov5, OV5_DRMEM_V2);
 
+    /* advertise XIVE */
+    if (smc->irq->ov5) {
+        spapr_ovec_set(spapr->ov5, OV5_XIVE_EXPLOIT);
+    }
+
     /* init CPUs */
     spapr_init_cpus(spapr);
 
@@ -3971,6 +3982,21 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
 
 DEFINE_SPAPR_MACHINE(3_1, "3.1", true);
 
+static void spapr_machine_3_1_xive_instance_options(MachineState *machine)
+{
+    spapr_machine_3_1_instance_options(machine);
+}
+
+static void spapr_machine_3_1_xive_class_options(MachineClass *mc)
+{
+    sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
+    spapr_machine_3_1_class_options(mc);
+    smc->irq = &spapr_irq_xive;
+}
+
+DEFINE_SPAPR_MACHINE(3_1_xive, "3.1-xive", false);
+
 /*
  * pseries-3.0
  */
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 253abc10e780..42e73851b174 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -210,6 +210,7 @@ static Object *spapr_irq_cpu_intc_create_xics(sPAPRMachineState *spapr,
 sPAPRIrq spapr_irq_xics = {
     .nr_irqs     = SPAPR_IRQ_XICS_NR_IRQS,
     .nr_msis     = SPAPR_IRQ_XICS_NR_MSIS,
+    .ov5         = 0x0, /* XICS only */
 
     .init        = spapr_irq_init_xics,
     .claim       = spapr_irq_claim_xics,
@@ -341,6 +342,7 @@ static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
 sPAPRIrq spapr_irq_xive = {
     .nr_irqs     = SPAPR_IRQ_XIVE_NR_IRQS,
     .nr_msis     = SPAPR_IRQ_XIVE_NR_MSIS,
+    .ov5         = 0x40, /* XIVE exploitation mode only */
 
     .init        = spapr_irq_init_xive,
     .claim       = spapr_irq_claim_xive,
@@ -447,6 +449,7 @@ int spapr_irq_find(sPAPRMachineState *spapr, int num, bool align, Error **errp)
 sPAPRIrq spapr_irq_xics_legacy = {
     .nr_irqs     = SPAPR_IRQ_XICS_LEGACY_NR_IRQS,
     .nr_msis     = SPAPR_IRQ_XICS_LEGACY_NR_IRQS,
+    .ov5         = 0x0, /* XICS only */
 
     .init        = spapr_irq_init_xics,
     .claim       = spapr_irq_claim_xics,
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 20/36] spapr: add classes for the XIVE models
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (18 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 19/36] spapr: add a 'pseries-3.1-xive' machine type Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  5:13   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 21/36] spapr: extend the sPAPR IRQ backend for XICS migration Cédric Le Goater
                   ` (15 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The XIVE models for the QEMU and KVM accelerators will have a lot in
common. Introduce an abstract class for the source, the thread context
and the interrupt controller object to handle the differences in the
object initialization. These classes will also be used to define state
synchronization handlers for the monitor and migration usage.

This is very much like the XICS models.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_xive.h |  15 +++++
 include/hw/ppc/xive.h       |  30 ++++++++++
 hw/intc/spapr_xive.c        |  86 +++++++++++++++++++---------
 hw/intc/xive.c              | 109 +++++++++++++++++++++++++-----------
 hw/ppc/spapr_irq.c          |   4 +-
 5 files changed, 182 insertions(+), 62 deletions(-)

diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 5b3fab192d41..aca2969a09ab 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -13,6 +13,10 @@
 #include "hw/sysbus.h"
 #include "hw/ppc/xive.h"
 
+#define TYPE_SPAPR_XIVE_BASE "spapr-xive-base"
+#define SPAPR_XIVE_BASE(obj) \
+    OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE_BASE)
+
 #define TYPE_SPAPR_XIVE "spapr-xive"
 #define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
 
@@ -38,6 +42,17 @@ typedef struct sPAPRXive {
     MemoryRegion  tm_mmio;
 } sPAPRXive;
 
+#define SPAPR_XIVE_BASE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(sPAPRXiveClass, (klass), TYPE_SPAPR_XIVE_BASE)
+#define SPAPR_XIVE_BASE_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(sPAPRXiveClass, (obj), TYPE_SPAPR_XIVE_BASE)
+
+typedef struct sPAPRXiveClass {
+    XiveRouterClass parent_class;
+
+    DeviceRealize   parent_realize;
+} sPAPRXiveClass;
+
 bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
 bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index b74eb326dcd1..281ed370121c 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -38,6 +38,10 @@ typedef struct XiveFabricClass {
  * XIVE Interrupt Source
  */
 
+#define TYPE_XIVE_SOURCE_BASE "xive-source-base"
+#define XIVE_SOURCE_BASE(obj) \
+    OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE_BASE)
+
 #define TYPE_XIVE_SOURCE "xive-source"
 #define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
 
@@ -68,6 +72,18 @@ typedef struct XiveSource {
     XiveFabric      *xive;
 } XiveSource;
 
+#define XIVE_SOURCE_BASE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(XiveSourceClass, (klass), TYPE_XIVE_SOURCE_BASE)
+#define XIVE_SOURCE_BASE_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(XiveSourceClass, (obj), TYPE_XIVE_SOURCE_BASE)
+
+typedef struct XiveSourceClass {
+    SysBusDeviceClass parent_class;
+
+    DeviceRealize     parent_realize;
+    DeviceReset       parent_reset;
+} XiveSourceClass;
+
 /*
  * ESB MMIO setting. Can be one page, for both source triggering and
  * source management, or two different pages. See below for magic
@@ -253,6 +269,9 @@ void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
  * XIVE Thread interrupt Management (TM) context
  */
 
+#define TYPE_XIVE_TCTX_BASE "xive-tctx-base"
+#define XIVE_TCTX_BASE(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_BASE)
+
 #define TYPE_XIVE_TCTX "xive-tctx"
 #define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
 
@@ -278,6 +297,17 @@ typedef struct XiveTCTX {
     XiveRouter  *xrtr;
 } XiveTCTX;
 
+#define XIVE_TCTX_BASE_CLASS(klass) \
+     OBJECT_CLASS_CHECK(XiveTCTXClass, (klass), TYPE_XIVE_TCTX_BASE)
+#define XIVE_TCTX_BASE_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(XiveTCTXClass, (obj), TYPE_XIVE_TCTX_BASE)
+
+typedef struct XiveTCTXClass {
+    DeviceClass       parent_class;
+
+    DeviceRealize     parent_realize;
+} XiveTCTXClass;
+
 /*
  * XIVE Thread Interrupt Management Aera (TIMA)
  */
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index 3bf77ace11a2..ec85f7e4f88d 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -53,9 +53,9 @@ static void spapr_xive_mmio_map(sPAPRXive *xive)
     sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->tm_base);
 }
 
-static void spapr_xive_reset(DeviceState *dev)
+static void spapr_xive_base_reset(DeviceState *dev)
 {
-    sPAPRXive *xive = SPAPR_XIVE(dev);
+    sPAPRXive *xive = SPAPR_XIVE_BASE(dev);
     int i;
 
     /* Xive Source reset is done through SysBus, it should put all
@@ -76,9 +76,9 @@ static void spapr_xive_reset(DeviceState *dev)
     spapr_xive_mmio_map(xive);
 }
 
-static void spapr_xive_instance_init(Object *obj)
+static void spapr_xive_base_instance_init(Object *obj)
 {
-    sPAPRXive *xive = SPAPR_XIVE(obj);
+    sPAPRXive *xive = SPAPR_XIVE_BASE(obj);
 
     object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
     object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
@@ -89,9 +89,9 @@ static void spapr_xive_instance_init(Object *obj)
                               NULL);
 }
 
-static void spapr_xive_realize(DeviceState *dev, Error **errp)
+static void spapr_xive_base_realize(DeviceState *dev, Error **errp)
 {
-    sPAPRXive *xive = SPAPR_XIVE(dev);
+    sPAPRXive *xive = SPAPR_XIVE_BASE(dev);
     XiveSource *xsrc = &xive->source;
     XiveENDSource *end_xsrc = &xive->end_source;
     Error *local_err = NULL;
@@ -142,16 +142,11 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
      */
     xive->eat = g_new0(XiveEAS, xive->nr_irqs);
     xive->endt = g_new0(XiveEND, xive->nr_ends);
-
-    /* TIMA initialization */
-    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
-                          "xive.tima", 4ull << TM_SHIFT);
-    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
 }
 
 static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
 {
-    sPAPRXive *xive = SPAPR_XIVE(xrtr);
+    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
 
     if (lisn >= xive->nr_irqs) {
         return -1;
@@ -163,7 +158,7 @@ static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
 
 static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
 {
-    sPAPRXive *xive = SPAPR_XIVE(xrtr);
+    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
 
     if (lisn >= xive->nr_irqs) {
         return -1;
@@ -176,7 +171,7 @@ static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
 static int spapr_xive_get_end(XiveRouter *xrtr,
                               uint8_t end_blk, uint32_t end_idx, XiveEND *end)
 {
-    sPAPRXive *xive = SPAPR_XIVE(xrtr);
+    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
 
     if (end_idx >= xive->nr_ends) {
         return -1;
@@ -189,7 +184,7 @@ static int spapr_xive_get_end(XiveRouter *xrtr,
 static int spapr_xive_set_end(XiveRouter *xrtr,
                               uint8_t end_blk, uint32_t end_idx, XiveEND *end)
 {
-    sPAPRXive *xive = SPAPR_XIVE(xrtr);
+    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
 
     if (end_idx >= xive->nr_ends) {
         return -1;
@@ -202,7 +197,7 @@ static int spapr_xive_set_end(XiveRouter *xrtr,
 static int spapr_xive_get_nvt(XiveRouter *xrtr,
                               uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
 {
-    sPAPRXive *xive = SPAPR_XIVE(xrtr);
+    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
     uint32_t vcpu_id = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
     PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
 
@@ -236,7 +231,7 @@ static void spapr_xive_reset_tctx(XiveRouter *xrtr, XiveTCTX *tctx)
     uint32_t nvt_idx;
     uint32_t nvt_cam;
 
-    spapr_xive_cpu_to_nvt(SPAPR_XIVE(xrtr), POWERPC_CPU(tctx->cs),
+    spapr_xive_cpu_to_nvt(SPAPR_XIVE_BASE(xrtr), POWERPC_CPU(tctx->cs),
                           &nvt_blk, &nvt_idx);
 
     nvt_cam = cpu_to_be32(TM_QW1W2_VO | xive_tctx_cam_line(nvt_blk, nvt_idx));
@@ -359,7 +354,7 @@ static const VMStateDescription vmstate_spapr_xive_eas = {
     },
 };
 
-static const VMStateDescription vmstate_spapr_xive = {
+static const VMStateDescription vmstate_spapr_xive_base = {
     .name = TYPE_SPAPR_XIVE,
     .version_id = 1,
     .minimum_version_id = 1,
@@ -373,7 +368,7 @@ static const VMStateDescription vmstate_spapr_xive = {
     },
 };
 
-static Property spapr_xive_properties[] = {
+static Property spapr_xive_base_properties[] = {
     DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
     DEFINE_PROP_UINT32("nr-ends", sPAPRXive, nr_ends, 0),
     DEFINE_PROP_UINT64("vc-base", sPAPRXive, vc_base, SPAPR_XIVE_VC_BASE),
@@ -381,16 +376,16 @@ static Property spapr_xive_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static void spapr_xive_class_init(ObjectClass *klass, void *data)
+static void spapr_xive_base_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
     XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
 
     dc->desc    = "sPAPR XIVE Interrupt Controller";
-    dc->props   = spapr_xive_properties;
-    dc->realize = spapr_xive_realize;
-    dc->reset   = spapr_xive_reset;
-    dc->vmsd    = &vmstate_spapr_xive;
+    dc->props   = spapr_xive_base_properties;
+    dc->realize = spapr_xive_base_realize;
+    dc->reset   = spapr_xive_base_reset;
+    dc->vmsd    = &vmstate_spapr_xive_base;
 
     xrc->get_eas = spapr_xive_get_eas;
     xrc->set_eas = spapr_xive_set_eas;
@@ -401,16 +396,55 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
     xrc->reset_tctx = spapr_xive_reset_tctx;
 }
 
+static const TypeInfo spapr_xive_base_info = {
+    .name = TYPE_SPAPR_XIVE_BASE,
+    .parent = TYPE_XIVE_ROUTER,
+    .abstract = true,
+    .instance_init = spapr_xive_base_instance_init,
+    .instance_size = sizeof(sPAPRXive),
+    .class_init = spapr_xive_base_class_init,
+    .class_size = sizeof(sPAPRXiveClass),
+};
+
+static void spapr_xive_realize(DeviceState *dev, Error **errp)
+{
+    sPAPRXive *xive = SPAPR_XIVE(dev);
+    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
+    Error *local_err = NULL;
+
+    sxc->parent_realize(dev, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    /* TIMA */
+    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
+                          "xive.tima", 4ull << TM_SHIFT);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
+}
+
+static void spapr_xive_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_CLASS(klass);
+
+    device_class_set_parent_realize(dc, spapr_xive_realize,
+                                    &sxc->parent_realize);
+}
+
 static const TypeInfo spapr_xive_info = {
     .name = TYPE_SPAPR_XIVE,
-    .parent = TYPE_XIVE_ROUTER,
-    .instance_init = spapr_xive_instance_init,
+    .parent = TYPE_SPAPR_XIVE_BASE,
+    .instance_init = spapr_xive_base_instance_init,
     .instance_size = sizeof(sPAPRXive),
     .class_init = spapr_xive_class_init,
+    .class_size = sizeof(sPAPRXiveClass),
 };
 
 static void spapr_xive_register_types(void)
 {
+    type_register_static(&spapr_xive_base_info);
     type_register_static(&spapr_xive_info);
 }
 
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 7d921023e2ee..9bb37553c9ec 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -478,9 +478,9 @@ static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
     return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
 }
 
-static void xive_tctx_reset(void *dev)
+static void xive_tctx_base_reset(void *dev)
 {
-    XiveTCTX *tctx = XIVE_TCTX(dev);
+    XiveTCTX *tctx = XIVE_TCTX_BASE(dev);
     XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
 
     memset(tctx->regs, 0, sizeof(tctx->regs));
@@ -506,9 +506,9 @@ static void xive_tctx_reset(void *dev)
     }
 }
 
-static void xive_tctx_realize(DeviceState *dev, Error **errp)
+static void xive_tctx_base_realize(DeviceState *dev, Error **errp)
 {
-    XiveTCTX *tctx = XIVE_TCTX(dev);
+    XiveTCTX *tctx = XIVE_TCTX_BASE(dev);
     PowerPCCPU *cpu;
     CPUPPCState *env;
     Object *obj;
@@ -544,15 +544,15 @@ static void xive_tctx_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    qemu_register_reset(xive_tctx_reset, dev);
+    qemu_register_reset(xive_tctx_base_reset, dev);
 }
 
-static void xive_tctx_unrealize(DeviceState *dev, Error **errp)
+static void xive_tctx_base_unrealize(DeviceState *dev, Error **errp)
 {
-    qemu_unregister_reset(xive_tctx_reset, dev);
+    qemu_unregister_reset(xive_tctx_base_reset, dev);
 }
 
-static const VMStateDescription vmstate_xive_tctx = {
+static const VMStateDescription vmstate_xive_tctx_base = {
     .name = TYPE_XIVE_TCTX,
     .version_id = 1,
     .minimum_version_id = 1,
@@ -562,21 +562,28 @@ static const VMStateDescription vmstate_xive_tctx = {
     },
 };
 
-static void xive_tctx_class_init(ObjectClass *klass, void *data)
+static void xive_tctx_base_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
 
-    dc->realize = xive_tctx_realize;
-    dc->unrealize = xive_tctx_unrealize;
+    dc->realize = xive_tctx_base_realize;
+    dc->unrealize = xive_tctx_base_unrealize;
     dc->desc = "XIVE Interrupt Thread Context";
-    dc->vmsd = &vmstate_xive_tctx;
+    dc->vmsd = &vmstate_xive_tctx_base;
 }
 
-static const TypeInfo xive_tctx_info = {
-    .name          = TYPE_XIVE_TCTX,
+static const TypeInfo xive_tctx_base_info = {
+    .name          = TYPE_XIVE_TCTX_BASE,
     .parent        = TYPE_DEVICE,
+    .abstract      = true,
     .instance_size = sizeof(XiveTCTX),
-    .class_init    = xive_tctx_class_init,
+    .class_init    = xive_tctx_base_class_init,
+    .class_size    = sizeof(XiveTCTXClass),
+};
+
+static const TypeInfo xive_tctx_info = {
+    .name          = TYPE_XIVE_TCTX,
+    .parent        = TYPE_XIVE_TCTX_BASE,
 };
 
 Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
@@ -933,9 +940,9 @@ void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
     }
 }
 
-static void xive_source_reset(DeviceState *dev)
+static void xive_source_base_reset(DeviceState *dev)
 {
-    XiveSource *xsrc = XIVE_SOURCE(dev);
+    XiveSource *xsrc = XIVE_SOURCE_BASE(dev);
 
     /* Do not clear the LSI bitmap */
 
@@ -943,9 +950,9 @@ static void xive_source_reset(DeviceState *dev)
     memset(xsrc->status, 0x1, xsrc->nr_irqs);
 }
 
-static void xive_source_realize(DeviceState *dev, Error **errp)
+static void xive_source_base_realize(DeviceState *dev,  Error **errp)
 {
-    XiveSource *xsrc = XIVE_SOURCE(dev);
+    XiveSource *xsrc = XIVE_SOURCE_BASE(dev);
     Object *obj;
     Error *local_err = NULL;
 
@@ -971,21 +978,14 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
         return;
     }
 
-    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
-                                     xsrc->nr_irqs);
-
     xsrc->status = g_malloc0(xsrc->nr_irqs);
 
     xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
     xsrc->lsi_map_size = xsrc->nr_irqs;
 
-    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
-                          &xive_source_esb_ops, xsrc, "xive.esb",
-                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
-    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
 }
 
-static const VMStateDescription vmstate_xive_source = {
+static const VMStateDescription vmstate_xive_source_base = {
     .name = TYPE_XIVE_SOURCE,
     .version_id = 1,
     .minimum_version_id = 1,
@@ -1001,29 +1001,68 @@ static const VMStateDescription vmstate_xive_source = {
  * The default XIVE interrupt source setting for the ESB MMIOs is two
  * 64k pages without Store EOI, to be in sync with KVM.
  */
-static Property xive_source_properties[] = {
+static Property xive_source_base_properties[] = {
     DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
     DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
     DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
     DEFINE_PROP_END_OF_LIST(),
 };
 
-static void xive_source_class_init(ObjectClass *klass, void *data)
+static void xive_source_base_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
 
     dc->desc    = "XIVE Interrupt Source";
-    dc->props   = xive_source_properties;
-    dc->realize = xive_source_realize;
-    dc->reset   = xive_source_reset;
-    dc->vmsd    = &vmstate_xive_source;
+    dc->props   = xive_source_base_properties;
+    dc->realize = xive_source_base_realize;
+    dc->reset   = xive_source_base_reset;
+    dc->vmsd    = &vmstate_xive_source_base;
+}
+
+static const TypeInfo xive_source_base_info = {
+    .name          = TYPE_XIVE_SOURCE_BASE,
+    .parent        = TYPE_SYS_BUS_DEVICE,
+    .abstract      = true,
+    .instance_size = sizeof(XiveSource),
+    .class_init    = xive_source_base_class_init,
+    .class_size    = sizeof(XiveSourceClass),
+};
+
+static void xive_source_realize(DeviceState *dev, Error **errp)
+{
+    XiveSource *xsrc = XIVE_SOURCE(dev);
+    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
+    Error *local_err = NULL;
+
+    xsc->parent_realize(dev, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc, xsrc->nr_irqs);
+
+    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
+                          &xive_source_esb_ops, xsrc, "xive.esb",
+                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
+    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
+}
+
+static void xive_source_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    XiveSourceClass *xsc = XIVE_SOURCE_BASE_CLASS(klass);
+
+    device_class_set_parent_realize(dc, xive_source_realize,
+                                    &xsc->parent_realize);
 }
 
 static const TypeInfo xive_source_info = {
     .name          = TYPE_XIVE_SOURCE,
-    .parent        = TYPE_SYS_BUS_DEVICE,
+    .parent        = TYPE_XIVE_SOURCE_BASE,
     .instance_size = sizeof(XiveSource),
     .class_init    = xive_source_class_init,
+    .class_size    = sizeof(XiveSourceClass),
 };
 
 /*
@@ -1659,10 +1698,12 @@ static const TypeInfo xive_fabric_info = {
 
 static void xive_register_types(void)
 {
+    type_register_static(&xive_source_base_info);
     type_register_static(&xive_source_info);
     type_register_static(&xive_fabric_info);
     type_register_static(&xive_router_info);
     type_register_static(&xive_end_source_info);
+    type_register_static(&xive_tctx_base_info);
     type_register_static(&xive_tctx_info);
 }
 
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 42e73851b174..f6e9e44d4cf9 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -243,7 +243,7 @@ static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
         return NULL;
     }
     qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
-    xive = SPAPR_XIVE(obj);
+    xive = SPAPR_XIVE_BASE(obj);
 
     /* Enable the CPU IPIs */
     for (i = 0; i < nr_servers; ++i) {
@@ -311,7 +311,7 @@ static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
     CPU_FOREACH(cs) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
 
-        xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
+        xive_tctx_pic_print_info(XIVE_TCTX_BASE(cpu->intc), mon);
     }
 
     spapr_xive_pic_print_info(spapr->xive, mon);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 21/36] spapr: extend the sPAPR IRQ backend for XICS migration
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (19 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 20/36] spapr: add classes for the XIVE models Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  5:54   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 22/36] spapr/xive: add models for KVM support Cédric Le Goater
                   ` (14 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

Introduce a new sPAPR IRQ handler to handle resend after migration
when the machine is using a KVM XICS interrupt controller model.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_irq.h |  2 ++
 hw/ppc/spapr.c             | 13 +++++--------
 hw/ppc/spapr_irq.c         | 27 +++++++++++++++++++++++++++
 3 files changed, 34 insertions(+), 8 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index b299dd794bff..4e36c0984e1a 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -45,6 +45,7 @@ typedef struct sPAPRIrq {
                         void *fdt, uint32_t phandle);
     Object *(*cpu_intc_create)(sPAPRMachineState *spapr, Object *cpu,
                                Error **errp);
+    int (*post_load)(sPAPRMachineState *spapr, int version_id);
 } sPAPRIrq;
 
 extern sPAPRIrq spapr_irq_xics;
@@ -55,6 +56,7 @@ void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp);
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
 void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
 qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
+int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id);
 
 /*
  * XICS legacy routines
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 8fbb743769db..f9cf2debff5a 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1738,14 +1738,6 @@ static int spapr_post_load(void *opaque, int version_id)
         return err;
     }
 
-    if (!object_dynamic_cast(OBJECT(spapr->ics), TYPE_ICS_KVM)) {
-        CPUState *cs;
-        CPU_FOREACH(cs) {
-            PowerPCCPU *cpu = POWERPC_CPU(cs);
-            icp_resend(ICP(cpu->intc));
-        }
-    }
-
     /* In earlier versions, there was no separate qdev for the PAPR
      * RTC, so the RTC offset was stored directly in sPAPREnvironment.
      * So when migrating from those versions, poke the incoming offset
@@ -1766,6 +1758,11 @@ static int spapr_post_load(void *opaque, int version_id)
         }
     }
 
+    err = spapr_irq_post_load(spapr, version_id);
+    if (err) {
+        return err;
+    }
+
     return err;
 }
 
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index f6e9e44d4cf9..33dd5da7d255 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -203,6 +203,18 @@ static Object *spapr_irq_cpu_intc_create_xics(sPAPRMachineState *spapr,
     return icp_create(cpu, spapr->icp_type, XICS_FABRIC(spapr), errp);
 }
 
+static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
+{
+    if (!object_dynamic_cast(OBJECT(spapr->ics), TYPE_ICS_KVM)) {
+        CPUState *cs;
+        CPU_FOREACH(cs) {
+            PowerPCCPU *cpu = POWERPC_CPU(cs);
+            icp_resend(ICP(cpu->intc));
+        }
+    }
+    return 0;
+}
+
 #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
 #define SPAPR_IRQ_XICS_NR_MSIS     \
     (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
@@ -219,6 +231,7 @@ sPAPRIrq spapr_irq_xics = {
     .print_info  = spapr_irq_print_info_xics,
     .dt_populate = spapr_irq_dt_populate_xics,
     .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
+    .post_load   = spapr_irq_post_load_xics,
 };
 
  /*
@@ -331,6 +344,11 @@ static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
                             XIVE_ROUTER(spapr->xive), errp);
 }
 
+static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
+{
+    return 0;
+}
+
 /*
  * XIVE uses the full IRQ number space. Set it to 8K to be compatible
  * with XICS.
@@ -351,6 +369,7 @@ sPAPRIrq spapr_irq_xive = {
     .print_info  = spapr_irq_print_info_xive,
     .dt_populate = spapr_irq_dt_populate_xive,
     .cpu_intc_create = spapr_irq_cpu_intc_create_xive,
+    .post_load   = spapr_irq_post_load_xive,
 };
 
 /*
@@ -389,6 +408,13 @@ qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq)
     return smc->irq->qirq(spapr, irq);
 }
 
+int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id)
+{
+    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
+
+    return smc->irq->post_load(spapr, version_id);
+}
+
 /*
  * XICS legacy routines - to deprecate one day
  */
@@ -458,4 +484,5 @@ sPAPRIrq spapr_irq_xics_legacy = {
     .print_info  = spapr_irq_print_info_xics,
     .dt_populate = spapr_irq_dt_populate_xics,
     .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
+    .post_load   = spapr_irq_post_load_xics,
 };
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 22/36] spapr/xive: add models for KVM support
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (20 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 21/36] spapr: extend the sPAPR IRQ backend for XICS migration Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-28  5:52   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 23/36] spapr/xive: add migration support for KVM Cédric Le Goater
                   ` (13 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

This introduces a set of XIVE models specific to KVM which derive from
the XIVE base models. The interfaces with KVM are a new capability and
a new KVM device for the XIVE native exploitation interrupt mode.

They handle the initialization of the TIMA and the source ESB memory
regions which have a different type under KVM. These are 'ram device'
memory mappings, similarly to VFIO, exposed to the guest and the
associated VMAs on the host are populated dynamically with the
appropriate pages using a fault handler.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 default-configs/ppc64-softmmu.mak |   1 +
 include/hw/ppc/spapr_xive.h       |  18 ++
 include/hw/ppc/xive.h             |   3 +
 linux-headers/asm-powerpc/kvm.h   |  12 +
 linux-headers/linux/kvm.h         |   4 +
 target/ppc/kvm_ppc.h              |   6 +
 hw/intc/spapr_xive_kvm.c          | 430 ++++++++++++++++++++++++++++++
 hw/ppc/spapr.c                    |   7 +-
 hw/ppc/spapr_irq.c                |  19 +-
 target/ppc/kvm.c                  |   7 +
 hw/intc/Makefile.objs             |   1 +
 11 files changed, 503 insertions(+), 5 deletions(-)
 create mode 100644 hw/intc/spapr_xive_kvm.c

diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
index 7f34ad0528ed..c1bf5cd951f5 100644
--- a/default-configs/ppc64-softmmu.mak
+++ b/default-configs/ppc64-softmmu.mak
@@ -18,6 +18,7 @@ CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
 CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_XIVE=$(CONFIG_PSERIES)
 CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
+CONFIG_XIVE_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
 CONFIG_MEM_DEVICE=y
 CONFIG_DIMM=y
 CONFIG_SPAPR_RNG=y
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index aca2969a09ab..9c817bb7ae74 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -40,6 +40,10 @@ typedef struct sPAPRXive {
     /* TIMA mapping address */
     hwaddr        tm_base;
     MemoryRegion  tm_mmio;
+
+    /* KVM support */
+    int           fd;
+    void          *tm_mmap;
 } sPAPRXive;
 
 #define SPAPR_XIVE_BASE_CLASS(klass) \
@@ -83,4 +87,18 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr);
 void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt,
                    uint32_t phandle);
 
+/*
+ * XIVE KVM models
+ */
+
+#define TYPE_SPAPR_XIVE_KVM  "spapr-xive-kvm"
+#define SPAPR_XIVE_KVM(obj)  OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE_KVM)
+
+#define TYPE_XIVE_SOURCE_KVM "xive-source-kvm"
+#define XIVE_SOURCE_KVM(obj) \
+    OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE_KVM)
+
+#define TYPE_XIVE_TCTX_KVM   "xive-tctx-kvm"
+#define XIVE_TCTX_KVM(obj)   OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_KVM)
+
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 281ed370121c..7aaf5a182cb3 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -69,6 +69,9 @@ typedef struct XiveSource {
     uint32_t        esb_shift;
     MemoryRegion    esb_mmio;
 
+    /* KVM support */
+    void            *esb_mmap;
+
     XiveFabric      *xive;
 } XiveSource;
 
diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
index 8c876c166ef2..f34c971491dd 100644
--- a/linux-headers/asm-powerpc/kvm.h
+++ b/linux-headers/asm-powerpc/kvm.h
@@ -675,4 +675,16 @@ struct kvm_ppc_cpu_char {
 #define  KVM_XICS_PRESENTED		(1ULL << 43)
 #define  KVM_XICS_QUEUED		(1ULL << 44)
 
+/* POWER9 XIVE Native Interrupt Controller */
+#define KVM_DEV_XIVE_GRP_CTRL		1
+#define   KVM_DEV_XIVE_GET_ESB_FD	1
+#define   KVM_DEV_XIVE_GET_TIMA_FD	2
+#define   KVM_DEV_XIVE_VC_BASE		3
+#define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
+
+/* Layout of 64-bit XIVE source attribute values */
+#define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
+#define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
+
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index f11a7eb49cfa..59fa8d8d7f39 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -965,6 +965,8 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_COALESCED_PIO 162
 #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
 #define KVM_CAP_EXCEPTION_PAYLOAD 164
+#define KVM_CAP_ARM_VM_IPA_SIZE 165
+#define KVM_CAP_PPC_IRQ_XIVE 166
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1188,6 +1190,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
 	KVM_DEV_TYPE_ARM_VGIC_ITS,
 #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
+	KVM_DEV_TYPE_XIVE,
+#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
 	KVM_DEV_TYPE_MAX,
 };
 
diff --git a/target/ppc/kvm_ppc.h b/target/ppc/kvm_ppc.h
index bdfaa4e70a83..d2159660f9f2 100644
--- a/target/ppc/kvm_ppc.h
+++ b/target/ppc/kvm_ppc.h
@@ -59,6 +59,7 @@ bool kvmppc_has_cap_fixup_hcalls(void);
 bool kvmppc_has_cap_htm(void);
 bool kvmppc_has_cap_mmu_radix(void);
 bool kvmppc_has_cap_mmu_hash_v3(void);
+bool kvmppc_has_cap_xive(void);
 int kvmppc_get_cap_safe_cache(void);
 int kvmppc_get_cap_safe_bounds_check(void);
 int kvmppc_get_cap_safe_indirect_branch(void);
@@ -307,6 +308,11 @@ static inline bool kvmppc_has_cap_mmu_hash_v3(void)
     return false;
 }
 
+static inline bool kvmppc_has_cap_xive(void)
+{
+    return false;
+}
+
 static inline int kvmppc_get_cap_safe_cache(void)
 {
     return 0;
diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
new file mode 100644
index 000000000000..767f90826e43
--- /dev/null
+++ b/hw/intc/spapr_xive_kvm.c
@@ -0,0 +1,430 @@
+/*
+ * QEMU PowerPC sPAPR XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qemu/error-report.h"
+#include "qapi/error.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "sysemu/kvm.h"
+#include "hw/ppc/spapr.h"
+#include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/xive.h"
+#include "kvm_ppc.h"
+
+#include <sys/ioctl.h>
+
+/*
+ * Helpers for CPU hotplug
+ */
+typedef struct KVMEnabledCPU {
+    unsigned long vcpu_id;
+    QLIST_ENTRY(KVMEnabledCPU) node;
+} KVMEnabledCPU;
+
+static QLIST_HEAD(, KVMEnabledCPU)
+    kvm_enabled_cpus = QLIST_HEAD_INITIALIZER(&kvm_enabled_cpus);
+
+static bool kvm_cpu_is_enabled(CPUState *cs)
+{
+    KVMEnabledCPU *enabled_cpu;
+    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
+
+    QLIST_FOREACH(enabled_cpu, &kvm_enabled_cpus, node) {
+        if (enabled_cpu->vcpu_id == vcpu_id) {
+            return true;
+        }
+    }
+    return false;
+}
+
+static void kvm_cpu_enable(CPUState *cs)
+{
+    KVMEnabledCPU *enabled_cpu;
+    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
+
+    enabled_cpu = g_malloc(sizeof(*enabled_cpu));
+    enabled_cpu->vcpu_id = vcpu_id;
+    QLIST_INSERT_HEAD(&kvm_enabled_cpus, enabled_cpu, node);
+}
+
+/*
+ * XIVE Thread Interrupt Management context (KVM)
+ */
+
+static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
+{
+    sPAPRXive *xive;
+    unsigned long vcpu_id;
+    int ret;
+
+    /* Check if CPU was hot unplugged and replugged. */
+    if (kvm_cpu_is_enabled(tctx->cs)) {
+        return;
+    }
+
+    vcpu_id = kvm_arch_vcpu_id(tctx->cs);
+    xive = SPAPR_XIVE_KVM(tctx->xrtr);
+
+    ret = kvm_vcpu_enable_cap(tctx->cs, KVM_CAP_PPC_IRQ_XIVE, 0, xive->fd,
+                              vcpu_id, 0);
+    if (ret < 0) {
+        error_setg(errp, "Unable to connect CPU%ld to KVM XIVE device: %s",
+                   vcpu_id, strerror(errno));
+        return;
+    }
+
+    kvm_cpu_enable(tctx->cs);
+}
+
+static void xive_tctx_kvm_realize(DeviceState *dev, Error **errp)
+{
+    XiveTCTX *tctx = XIVE_TCTX_KVM(dev);
+    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(dev);
+    Error *local_err = NULL;
+
+    xtc->parent_realize(dev, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    xive_tctx_kvm_init(tctx, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+}
+
+static void xive_tctx_kvm_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    XiveTCTXClass *xtc = XIVE_TCTX_BASE_CLASS(klass);
+
+    dc->desc = "sPAPR XIVE KVM Interrupt Thread Context";
+
+    device_class_set_parent_realize(dc, xive_tctx_kvm_realize,
+                                    &xtc->parent_realize);
+}
+
+static const TypeInfo xive_tctx_kvm_info = {
+    .name          = TYPE_XIVE_TCTX_KVM,
+    .parent        = TYPE_XIVE_TCTX_BASE,
+    .instance_size = sizeof(XiveTCTX),
+    .class_init    = xive_tctx_kvm_class_init,
+    .class_size    = sizeof(XiveTCTXClass),
+};
+
+/*
+ * XIVE Interrupt Source (KVM)
+ */
+
+static void xive_source_kvm_init(XiveSource *xsrc, Error **errp)
+{
+    sPAPRXive *xive = SPAPR_XIVE_KVM(xsrc->xive);
+    int i;
+
+    /*
+     * At reset, interrupt sources are simply created and MASKED. We
+     * only need to inform the KVM device about their type: LSI or
+     * MSI.
+     */
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        Error *local_err = NULL;
+        uint64_t state = 0;
+
+        if (xive_source_irq_is_lsi(xsrc, i)) {
+            state |= KVM_XIVE_LEVEL_SENSITIVE;
+            if (xsrc->status[i] & XIVE_STATUS_ASSERTED) {
+                state |= KVM_XIVE_LEVEL_ASSERTED;
+            }
+        }
+
+        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_SOURCES, i, &state,
+                          true, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+}
+
+static void xive_source_kvm_reset(DeviceState *dev)
+{
+    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
+    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
+
+    xsc->parent_reset(dev);
+
+    xive_source_kvm_init(xsrc, &error_fatal);
+}
+
+static void xive_source_kvm_set_irq(void *opaque, int srcno, int val)
+{
+    XiveSource *xsrc = opaque;
+    struct kvm_irq_level args;
+    int rc;
+
+    args.irq = srcno;
+    if (!xive_source_irq_is_lsi(xsrc, srcno)) {
+        if (!val) {
+            return;
+        }
+        args.level = KVM_INTERRUPT_SET;
+    } else {
+        if (val) {
+            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
+            args.level = KVM_INTERRUPT_SET_LEVEL;
+        } else {
+            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
+            args.level = KVM_INTERRUPT_UNSET;
+        }
+    }
+    rc = kvm_vm_ioctl(kvm_state, KVM_IRQ_LINE, &args);
+    if (rc < 0) {
+        error_report("kvm_irq_line() failed : %s", strerror(errno));
+    }
+}
+
+static void *spapr_xive_kvm_mmap(sPAPRXive *xive, int ctrl, size_t len,
+                                 Error **errp)
+{
+    Error *local_err = NULL;
+    void *addr;
+    int fd;
+
+    kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_CTRL, ctrl, &fd, false,
+                      &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return NULL;
+    }
+
+    addr = mmap(NULL, len, PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
+    close(fd);
+    if (addr == MAP_FAILED) {
+        error_setg_errno(errp, errno, "Unable to set XIVE mmaping");
+        return NULL;
+    }
+
+    return addr;
+}
+
+/*
+ * The sPAPRXive KVM model should have initialized the KVM device
+ * before initializing the source
+ */
+static void xive_source_kvm_mmap(XiveSource *xsrc, Error **errp)
+{
+    sPAPRXive *xive = SPAPR_XIVE_KVM(xsrc->xive);
+    Error *local_err = NULL;
+    size_t esb_len;
+
+    esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
+    xsrc->esb_mmap = spapr_xive_kvm_mmap(xive, KVM_DEV_XIVE_GET_ESB_FD,
+                                         esb_len, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    memory_region_init_ram_device_ptr(&xsrc->esb_mmio, OBJECT(xsrc),
+                                      "xive.esb", esb_len, xsrc->esb_mmap);
+    sysbus_init_mmio(SYS_BUS_DEVICE(xsrc), &xsrc->esb_mmio);
+}
+
+static void xive_source_kvm_realize(DeviceState *dev, Error **errp)
+{
+    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
+    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
+    Error *local_err = NULL;
+
+    xsc->parent_realize(dev, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    xsrc->qirqs = qemu_allocate_irqs(xive_source_kvm_set_irq, xsrc,
+                                     xsrc->nr_irqs);
+
+    xive_source_kvm_mmap(xsrc, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+}
+
+static void xive_source_kvm_unrealize(DeviceState *dev, Error **errp)
+{
+    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
+    size_t esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
+
+    munmap(xsrc->esb_mmap, esb_len);
+}
+
+static void xive_source_kvm_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    XiveSourceClass *xsc = XIVE_SOURCE_BASE_CLASS(klass);
+
+    device_class_set_parent_realize(dc, xive_source_kvm_realize,
+                                    &xsc->parent_realize);
+    device_class_set_parent_reset(dc, xive_source_kvm_reset,
+                                  &xsc->parent_reset);
+
+    dc->desc = "sPAPR XIVE KVM Interrupt Source";
+    dc->unrealize = xive_source_kvm_unrealize;
+}
+
+static const TypeInfo xive_source_kvm_info = {
+    .name = TYPE_XIVE_SOURCE_KVM,
+    .parent = TYPE_XIVE_SOURCE_BASE,
+    .instance_size = sizeof(XiveSource),
+    .class_init    = xive_source_kvm_class_init,
+    .class_size    = sizeof(XiveSourceClass),
+};
+
+/*
+ * sPAPR XIVE Router (KVM)
+ */
+
+static void spapr_xive_kvm_instance_init(Object *obj)
+{
+    sPAPRXive *xive = SPAPR_XIVE_KVM(obj);
+
+    xive->fd = -1;
+
+    /* We need a KVM flavored source */
+    object_initialize(&xive->source, sizeof(xive->source),
+                      TYPE_XIVE_SOURCE_KVM);
+    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
+
+    /* No KVM support for END ESBs. OPAL doesn't either */
+    object_initialize(&xive->end_source, sizeof(xive->end_source),
+                      TYPE_XIVE_END_SOURCE);
+    object_property_add_child(obj, "end_source", OBJECT(&xive->end_source),
+                              NULL);
+}
+
+static void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
+{
+    Error *local_err = NULL;
+    size_t tima_len;
+
+    if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
+        error_setg(errp,
+                   "IRQ_XIVE capability must be present for KVM XIVE device");
+        return;
+    }
+
+    /* First, create the KVM XIVE device */
+    xive->fd = kvm_create_device(kvm_state, KVM_DEV_TYPE_XIVE, false);
+    if (xive->fd < 0) {
+        error_setg_errno(errp, -xive->fd, "error creating KVM XIVE device");
+        return;
+    }
+
+    /* Source ESBs KVM mapping
+     *
+     * Inform KVM where we will map the ESB pages. This is needed by
+     * the H_INT_GET_SOURCE_INFO hcall which returns the source
+     * characteristics, among which the ESB page address.
+     */
+    kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_CTRL, KVM_DEV_XIVE_VC_BASE,
+                      &xive->vc_base, true, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    /* Let the XiveSource KVM model handle the mapping for the moment */
+
+    /* TIMA KVM mapping
+     *
+     * We could also inform KVM where the TIMA will be mapped but as
+     * this is a fixed MMIO address for the system it does not seem
+     * necessary to provide a KVM ioctl to change it.
+     */
+    tima_len = 4ull << TM_SHIFT;
+    xive->tm_mmap = spapr_xive_kvm_mmap(xive, KVM_DEV_XIVE_GET_TIMA_FD,
+                                        tima_len, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    memory_region_init_ram_device_ptr(&xive->tm_mmio, OBJECT(xive),
+                                      "xive.tima", tima_len, xive->tm_mmap);
+    sysbus_init_mmio(SYS_BUS_DEVICE(xive), &xive->tm_mmio);
+
+    kvm_kernel_irqchip = true;
+    kvm_msi_via_irqfd_allowed = true;
+    kvm_gsi_direct_mapping = true;
+}
+
+static void spapr_xive_kvm_realize(DeviceState *dev, Error **errp)
+{
+    sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
+    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
+    Error *local_err = NULL;
+
+    spapr_xive_kvm_init(xive, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    /* Initialize the source and the local routing tables */
+    sxc->parent_realize(dev, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+}
+
+static void spapr_xive_kvm_unrealize(DeviceState *dev, Error **errp)
+{
+    sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
+
+    close(xive->fd);
+    xive->fd = -1;
+
+    munmap(xive->tm_mmap, 4ull << TM_SHIFT);
+}
+
+static void spapr_xive_kvm_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_CLASS(klass);
+
+    device_class_set_parent_realize(dc, spapr_xive_kvm_realize,
+                                    &sxc->parent_realize);
+
+    dc->desc = "sPAPR XIVE KVM Interrupt Controller";
+    dc->unrealize = spapr_xive_kvm_unrealize;
+}
+
+static const TypeInfo spapr_xive_kvm_info = {
+    .name = TYPE_SPAPR_XIVE_KVM,
+    .parent = TYPE_SPAPR_XIVE_BASE,
+    .instance_init = spapr_xive_kvm_instance_init,
+    .instance_size = sizeof(sPAPRXive),
+    .class_init = spapr_xive_kvm_class_init,
+    .class_size = sizeof(sPAPRXiveClass),
+};
+
+static void xive_kvm_register_types(void)
+{
+    type_register_static(&spapr_xive_kvm_info);
+    type_register_static(&xive_source_kvm_info);
+    type_register_static(&xive_tctx_kvm_info);
+}
+
+type_init(xive_kvm_register_types)
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index f9cf2debff5a..d1be2579cd9b 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1125,8 +1125,11 @@ static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
         } else {
             val[3] = 0x00; /* Hash */
         }
-        /* TODO: test KVM support */
-        val[1] = smc->irq->ov5;
+        if (kvmppc_has_cap_xive()) {
+            val[1] = smc->irq->ov5;
+        } else {
+            val[1] = 0x00;
+        }
     } else {
         val[1] = smc->irq->ov5;
 
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 33dd5da7d255..92ef53743b64 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -273,9 +273,22 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
     Error *local_err = NULL;
 
     /* KVM XIVE support */
-    if (kvm_enabled()) {
-        if (machine_kernel_irqchip_required(machine)) {
-            error_setg(errp, "kernel_irqchip requested. no XIVE support");
+    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+        spapr->xive_tctx_type = TYPE_XIVE_TCTX_KVM;
+        spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE_KVM, nr_irqs,
+                                        nr_servers, &local_err);
+
+        if (local_err && machine_kernel_irqchip_required(machine)) {
+            error_propagate(errp, local_err);
+            error_prepend(errp, "kernel_irqchip requested but init failed : ");
+            return;
+        }
+
+        /*
+         * XIVE support is activated under KVM. No need to initialize
+         * the fallback mode under QEMU
+         */
+        if (spapr->xive) {
             return;
         }
     }
diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
index f81327d6cd47..3b7cf106242b 100644
--- a/target/ppc/kvm.c
+++ b/target/ppc/kvm.c
@@ -86,6 +86,7 @@ static int cap_fixup_hcalls;
 static int cap_htm;             /* Hardware transactional memory support */
 static int cap_mmu_radix;
 static int cap_mmu_hash_v3;
+static int cap_xive;
 static int cap_resize_hpt;
 static int cap_ppc_pvr_compat;
 static int cap_ppc_safe_cache;
@@ -149,6 +150,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
     cap_htm = kvm_vm_check_extension(s, KVM_CAP_PPC_HTM);
     cap_mmu_radix = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_RADIX);
     cap_mmu_hash_v3 = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_HASH_V3);
+    cap_xive = kvm_vm_check_extension(s, KVM_CAP_PPC_IRQ_XIVE);
     cap_resize_hpt = kvm_vm_check_extension(s, KVM_CAP_SPAPR_RESIZE_HPT);
     kvmppc_get_cpu_characteristics(s);
     cap_ppc_nested_kvm_hv = kvm_vm_check_extension(s, KVM_CAP_PPC_NESTED_HV);
@@ -2385,6 +2387,11 @@ static int parse_cap_ppc_safe_indirect_branch(struct kvm_ppc_cpu_char c)
     return 0;
 }
 
+bool kvmppc_has_cap_xive(void)
+{
+    return cap_xive;
+}
+
 static void kvmppc_get_cpu_characteristics(KVMState *s)
 {
     struct kvm_ppc_cpu_char c;
diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index eacd26836ebf..dd4d69db2bdd 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -39,6 +39,7 @@ obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
 obj-$(CONFIG_XICS_KVM) += xics_kvm.o
 obj-$(CONFIG_XIVE) += xive.o
 obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
+obj-$(CONFIG_XIVE_KVM) += spapr_xive_kvm.o
 obj-$(CONFIG_POWERNV) += xics_pnv.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 23/36] spapr/xive: add migration support for KVM
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (21 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 22/36] spapr/xive: add models for KVM support Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-29  3:43   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 24/36] spapr: add a 'reset' method to the sPAPR IRQ backend Cédric Le Goater
                   ` (12 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

This extends the KVM XIVE models to handle the state synchronization
with KVM, for the monitor usage and for the migration.

The migration priority of the XIVE interrupt controller sPAPRXive is
raised for KVM. It operates first and orchestrates the capture
sequence of the states of all the XIVE models. The XIVE sources are
masked to quiesce the interrupt flow and a XIVE xync is performed to
stabilize the OS Event Queues. The state of the ENDs are then captured
by the XIVE interrupt controller model, sPAPRXive, and the state of
the thread contexts by the thread interrupt presenter model,
XiveTCTX. When done, a rollback is performed to restore the sources to
their initial state.

The sPAPRXive 'post_load' method is called from the sPAPR machine,
after all XIVE device states have been transfered and loaded. First,
sPAPRXive restores the XIVE routing tables: ENDT and EAT. Next, are
restored the thread interrupt context registers and the source PQ
bits.

The get/set operations rely on their KVM counterpart in the host
kernel which acts as a proxy for OPAL, the host firmware.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---

 WIP:
 
    If migration occurs when a VCPU is 'ceded', some the OS event
    notification queues are mapped to the ZERO_PAGE on the receiving
    side. As if the HW had triggered a page fault before the dirty
    page was transferred from the source or as if we were not using
    the correct page table.

 include/hw/ppc/spapr_xive.h     |   5 +
 include/hw/ppc/xive.h           |   3 +
 include/migration/vmstate.h     |   1 +
 linux-headers/asm-powerpc/kvm.h |  33 +++
 hw/intc/spapr_xive.c            |  32 +++
 hw/intc/spapr_xive_kvm.c        | 494 ++++++++++++++++++++++++++++++++
 hw/intc/xive.c                  |  46 +++
 hw/ppc/spapr_irq.c              |   2 +-
 8 files changed, 615 insertions(+), 1 deletion(-)

diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 9c817bb7ae74..d2517c040958 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -55,12 +55,17 @@ typedef struct sPAPRXiveClass {
     XiveRouterClass parent_class;
 
     DeviceRealize   parent_realize;
+
+    void (*synchronize_state)(sPAPRXive *xive);
+    int  (*pre_save)(sPAPRXive *xsrc);
+    int  (*post_load)(sPAPRXive *xsrc, int version_id);
 } sPAPRXiveClass;
 
 bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
 bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
 qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
+int spapr_xive_post_load(sPAPRXive *xive, int version_id);
 
 /*
  * sPAPR NVT and END indexing helpers
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index 7aaf5a182cb3..c8201462d698 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -309,6 +309,9 @@ typedef struct XiveTCTXClass {
     DeviceClass       parent_class;
 
     DeviceRealize     parent_realize;
+
+    void (*synchronize_state)(XiveTCTX *tctx);
+    int  (*post_load)(XiveTCTX *tctx, int version_id);
 } XiveTCTXClass;
 
 /*
diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
index 2b501d04669a..ee2e836cc1c1 100644
--- a/include/migration/vmstate.h
+++ b/include/migration/vmstate.h
@@ -154,6 +154,7 @@ typedef enum {
     MIG_PRI_PCI_BUS,            /* Must happen before IOMMU */
     MIG_PRI_GICV3_ITS,          /* Must happen before PCI devices */
     MIG_PRI_GICV3,              /* Must happen before the ITS */
+    MIG_PRI_XIVE_IC,            /* Must happen before all XIVE models */
     MIG_PRI_MAX,
 } MigrationPriority;
 
diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
index f34c971491dd..9d55ade23634 100644
--- a/linux-headers/asm-powerpc/kvm.h
+++ b/linux-headers/asm-powerpc/kvm.h
@@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
 #define  KVM_REG_PPC_ICP_PPRI_SHIFT	16	/* pending irq priority */
 #define  KVM_REG_PPC_ICP_PPRI_MASK	0xff
 
+#define KVM_REG_PPC_NVT_STATE	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d)
+
 /* Device control API: PPC-specific devices */
 #define KVM_DEV_MPIC_GRP_MISC		1
 #define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
@@ -681,10 +683,41 @@ struct kvm_ppc_cpu_char {
 #define   KVM_DEV_XIVE_GET_TIMA_FD	2
 #define   KVM_DEV_XIVE_VC_BASE		3
 #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
+#define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
+#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
+#define KVM_DEV_XIVE_GRP_EQ		5	/* 64-bit eq attributes */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
 #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
 
+/* Layout of 64-bit eas attribute values */
+#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
+#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
+#define KVM_XIVE_EAS_SERVER_SHIFT	3
+#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
+#define KVM_XIVE_EAS_MASK_SHIFT		32
+#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
+#define KVM_XIVE_EAS_EISN_SHIFT		33
+#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
+
+/* Layout of 64-bit eq attribute */
+#define KVM_XIVE_EQ_PRIORITY_SHIFT	0
+#define KVM_XIVE_EQ_PRIORITY_MASK	0x7
+#define KVM_XIVE_EQ_SERVER_SHIFT	3
+#define KVM_XIVE_EQ_SERVER_MASK		0xfffffff8ULL
+
+/* Layout of 64-bit eq attribute values */
+struct kvm_ppc_xive_eq {
+	__u32 flags;
+	__u32 qsize;
+	__u64 qpage;
+	__u32 qtoggle;
+	__u32 qindex;
+};
+
+#define KVM_XIVE_EQ_FLAG_ENABLED	0x00000001
+#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY	0x00000002
+#define KVM_XIVE_EQ_FLAG_ESCALATE	0x00000004
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index ec85f7e4f88d..c5c0e063dc33 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -27,9 +27,14 @@
 
 void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
 {
+    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
     int i;
     uint32_t offset = 0;
 
+    if (sxc->synchronize_state) {
+        sxc->synchronize_state(xive);
+    }
+
     monitor_printf(mon, "XIVE Source %08x .. %08x\n", offset,
                    offset + xive->source.nr_irqs - 1);
     xive_source_pic_print_info(&xive->source, offset, mon);
@@ -354,10 +359,37 @@ static const VMStateDescription vmstate_spapr_xive_eas = {
     },
 };
 
+static int vmstate_spapr_xive_pre_save(void *opaque)
+{
+    sPAPRXive *xive = SPAPR_XIVE_BASE(opaque);
+    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
+
+    if (sxc->pre_save) {
+        return sxc->pre_save(xive);
+    }
+
+    return 0;
+}
+
+/* handled at the machine level */
+int spapr_xive_post_load(sPAPRXive *xive, int version_id)
+{
+    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
+
+    if (sxc->post_load) {
+        return sxc->post_load(xive, version_id);
+    }
+
+    return 0;
+}
+
 static const VMStateDescription vmstate_spapr_xive_base = {
     .name = TYPE_SPAPR_XIVE,
     .version_id = 1,
     .minimum_version_id = 1,
+    .pre_save = vmstate_spapr_xive_pre_save,
+    .post_load = NULL, /* handled at the machine level */
+    .priority = MIG_PRI_XIVE_IC,
     .fields = (VMStateField[]) {
         VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
         VMSTATE_STRUCT_VARRAY_POINTER_UINT32(eat, sPAPRXive, nr_irqs,
diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index 767f90826e43..176083c37d61 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -58,6 +58,58 @@ static void kvm_cpu_enable(CPUState *cs)
 /*
  * XIVE Thread Interrupt Management context (KVM)
  */
+static void xive_tctx_kvm_set_state(XiveTCTX *tctx, Error **errp)
+{
+    uint64_t state[4];
+    int ret;
+
+    /* word0 and word1 of the OS ring. */
+    state[0] = *((uint64_t *) &tctx->regs[TM_QW1_OS]);
+
+    /* VP identifier. Only for KVM pr_debug() */
+    state[1] = *((uint64_t *) &tctx->regs[TM_QW1_OS + TM_WORD2]);
+
+    ret = kvm_set_one_reg(tctx->cs, KVM_REG_PPC_NVT_STATE, state);
+    if (ret != 0) {
+        error_setg_errno(errp, errno, "Could restore KVM XIVE CPU %ld state",
+                         kvm_arch_vcpu_id(tctx->cs));
+    }
+}
+
+static void xive_tctx_kvm_get_state(XiveTCTX *tctx, Error **errp)
+{
+    uint64_t state[4] = { 0 };
+    int ret;
+
+    ret = kvm_get_one_reg(tctx->cs, KVM_REG_PPC_NVT_STATE, state);
+    if (ret != 0) {
+        error_setg_errno(errp, errno, "Could capture KVM XIVE CPU %ld state",
+                         kvm_arch_vcpu_id(tctx->cs));
+        return;
+    }
+
+    /* word0 and word1 of the OS ring. */
+    *((uint64_t *) &tctx->regs[TM_QW1_OS]) = state[0];
+
+    /*
+     * KVM also returns word2 containing the VP CAM line value which
+     * is interesting to print out the VP identifier in the QEMU
+     * monitor. No need to restore it.
+     */
+    *((uint64_t *) &tctx->regs[TM_QW1_OS + TM_WORD2]) = state[1];
+}
+
+static void xive_tctx_kvm_do_synchronize_state(CPUState *cpu,
+                                              run_on_cpu_data arg)
+{
+    xive_tctx_kvm_get_state(arg.host_ptr, &error_fatal);
+}
+
+static void xive_tctx_kvm_synchronize_state(XiveTCTX *tctx)
+{
+    run_on_cpu(tctx->cs, xive_tctx_kvm_do_synchronize_state,
+               RUN_ON_CPU_HOST_PTR(tctx));
+}
 
 static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
 {
@@ -112,6 +164,8 @@ static void xive_tctx_kvm_class_init(ObjectClass *klass, void *data)
 
     device_class_set_parent_realize(dc, xive_tctx_kvm_realize,
                                     &xtc->parent_realize);
+
+    xtc->synchronize_state = xive_tctx_kvm_synchronize_state;
 }
 
 static const TypeInfo xive_tctx_kvm_info = {
@@ -166,6 +220,34 @@ static void xive_source_kvm_reset(DeviceState *dev)
     xive_source_kvm_init(xsrc, &error_fatal);
 }
 
+/*
+ * This is used to perform the magic loads on the ESB pages, described
+ * in xive.h.
+ */
+static uint8_t xive_esb_read(XiveSource *xsrc, int srcno, uint32_t offset)
+{
+    unsigned long addr = (unsigned long) xsrc->esb_mmap +
+        xive_source_esb_mgmt(xsrc, srcno) + offset;
+
+    /* Prevent the compiler from optimizing away the load */
+    volatile uint64_t value = *((uint64_t *) addr);
+
+    return be64_to_cpu(value) & 0x3;
+}
+
+static void xive_source_kvm_get_state(XiveSource *xsrc)
+{
+    int i;
+
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        /* Perform a load without side effect to retrieve the PQ bits */
+        uint8_t pq = xive_esb_read(xsrc, i, XIVE_ESB_GET);
+
+        /* and save PQ locally */
+        xive_source_esb_set(xsrc, i, pq);
+    }
+}
+
 static void xive_source_kvm_set_irq(void *opaque, int srcno, int val)
 {
     XiveSource *xsrc = opaque;
@@ -295,6 +377,414 @@ static const TypeInfo xive_source_kvm_info = {
 /*
  * sPAPR XIVE Router (KVM)
  */
+static int spapr_xive_kvm_set_eq_state(sPAPRXive *xive, CPUState *cs,
+                                       Error **errp)
+{
+    XiveRouter *xrtr = XIVE_ROUTER(xive);
+    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
+    int ret;
+    int i;
+
+    for (i = 0; i < XIVE_PRIORITY_MAX + 1; i++) {
+        Error *local_err = NULL;
+        XiveEND end;
+        uint8_t end_blk;
+        uint32_t end_idx;
+        struct kvm_ppc_xive_eq kvm_eq = { 0 };
+        uint64_t kvm_eq_idx;
+
+        if (!spapr_xive_priority_is_valid(i)) {
+            continue;
+        }
+
+        spapr_xive_cpu_to_end(xive, POWERPC_CPU(cs), i, &end_blk, &end_idx);
+
+        ret = xive_router_get_end(xrtr, end_blk, end_idx, &end);
+        if (ret) {
+            error_setg(errp, "XIVE: No END for CPU %ld priority %d",
+                       vcpu_id, i);
+            return ret;
+        }
+
+        if (!(end.w0 & END_W0_VALID)) {
+            continue;
+        }
+
+        /* Build the KVM state from the local END structure */
+        kvm_eq.flags   = KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY;
+        kvm_eq.qsize   = GETFIELD(END_W0_QSIZE, end.w0) + 12;
+        kvm_eq.qpage   = (((uint64_t)(end.w2 & 0x0fffffff)) << 32) | end.w3;
+        kvm_eq.qtoggle = GETFIELD(END_W1_GENERATION, end.w1);
+        kvm_eq.qindex  = GETFIELD(END_W1_PAGE_OFF, end.w1);
+
+        /* Encode the tuple (server, prio) as a KVM EQ index */
+        kvm_eq_idx = i << KVM_XIVE_EQ_PRIORITY_SHIFT &
+            KVM_XIVE_EQ_PRIORITY_MASK;
+        kvm_eq_idx |= vcpu_id << KVM_XIVE_EQ_SERVER_SHIFT &
+            KVM_XIVE_EQ_SERVER_MASK;
+
+        ret = kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EQ, kvm_eq_idx,
+                                &kvm_eq, true, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+static int spapr_xive_kvm_get_eq_state(sPAPRXive *xive, CPUState *cs,
+                                       Error **errp)
+{
+    XiveRouter *xrtr = XIVE_ROUTER(xive);
+    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
+    int ret;
+    int i;
+
+    for (i = 0; i < XIVE_PRIORITY_MAX + 1; i++) {
+        Error *local_err = NULL;
+        struct kvm_ppc_xive_eq kvm_eq = { 0 };
+        uint64_t kvm_eq_idx;
+        XiveEND end = { 0 };
+        uint8_t end_blk, nvt_blk;
+        uint32_t end_idx, nvt_idx;
+
+        /* Skip priorities reserved for the hypervisor */
+        if (!spapr_xive_priority_is_valid(i)) {
+            continue;
+        }
+
+        /* Encode the tuple (server, prio) as a KVM EQ index */
+        kvm_eq_idx = i << KVM_XIVE_EQ_PRIORITY_SHIFT &
+            KVM_XIVE_EQ_PRIORITY_MASK;
+        kvm_eq_idx |= vcpu_id << KVM_XIVE_EQ_SERVER_SHIFT &
+            KVM_XIVE_EQ_SERVER_MASK;
+
+        ret = kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EQ, kvm_eq_idx,
+                                &kvm_eq, false, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return ret;
+        }
+
+        if (!(kvm_eq.flags & KVM_XIVE_EQ_FLAG_ENABLED)) {
+            continue;
+        }
+
+        /* Update the local END structure with the KVM input */
+        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ENABLED) {
+                end.w0 |= END_W0_VALID | END_W0_ENQUEUE;
+        }
+        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY) {
+                end.w0 |= END_W0_UCOND_NOTIFY;
+        }
+        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ESCALATE) {
+                end.w0 |= END_W0_ESCALATE_CTL;
+        }
+        end.w0 |= SETFIELD(END_W0_QSIZE, 0ul, kvm_eq.qsize - 12);
+
+        end.w1 = SETFIELD(END_W1_GENERATION, 0ul, kvm_eq.qtoggle) |
+            SETFIELD(END_W1_PAGE_OFF, 0ul, kvm_eq.qindex);
+        end.w2 = (kvm_eq.qpage >> 32) & 0x0fffffff;
+        end.w3 = kvm_eq.qpage & 0xffffffff;
+        end.w4 = 0;
+        end.w5 = 0;
+
+        ret = spapr_xive_cpu_to_nvt(xive, POWERPC_CPU(cs), &nvt_blk, &nvt_idx);
+        if (ret) {
+            error_setg(errp, "XIVE: No NVT for CPU %ld", vcpu_id);
+            return ret;
+        }
+
+        end.w6 = SETFIELD(END_W6_NVT_BLOCK, 0ul, nvt_blk) |
+            SETFIELD(END_W6_NVT_INDEX, 0ul, nvt_idx);
+        end.w7 = SETFIELD(END_W7_F0_PRIORITY, 0ul, i);
+
+        spapr_xive_cpu_to_end(xive, POWERPC_CPU(cs), i, &end_blk, &end_idx);
+
+        ret = xive_router_set_end(xrtr, end_blk, end_idx, &end);
+        if (ret) {
+            error_setg(errp, "XIVE: No END for CPU %ld priority %d",
+                       vcpu_id, i);
+            return ret;
+        }
+    }
+
+    return 0;
+}
+
+static void spapr_xive_kvm_set_eas_state(sPAPRXive *xive, Error **errp)
+{
+    XiveSource *xsrc = &xive->source;
+    int i;
+
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        XiveEAS *eas = &xive->eat[i];
+        uint32_t end_idx;
+        uint32_t end_blk;
+        uint32_t eisn;
+        uint8_t priority;
+        uint32_t server;
+        uint64_t kvm_eas;
+        Error *local_err = NULL;
+
+        /* No need to set MASKED EAS, this is the default state after reset */
+        if (!(eas->w & EAS_VALID) || eas->w & EAS_MASKED) {
+            continue;
+        }
+
+        end_idx = GETFIELD(EAS_END_INDEX, eas->w);
+        end_blk = GETFIELD(EAS_END_BLOCK, eas->w);
+        eisn = GETFIELD(EAS_END_DATA, eas->w);
+
+        spapr_xive_end_to_target(xive, end_blk, end_idx, &server, &priority);
+
+        kvm_eas = priority << KVM_XIVE_EAS_PRIORITY_SHIFT &
+            KVM_XIVE_EAS_PRIORITY_MASK;
+        kvm_eas |= server << KVM_XIVE_EAS_SERVER_SHIFT &
+            KVM_XIVE_EAS_SERVER_MASK;
+        kvm_eas |= ((uint64_t)eisn << KVM_XIVE_EAS_EISN_SHIFT) &
+            KVM_XIVE_EAS_EISN_MASK;
+
+        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EAS, i, &kvm_eas, true,
+                          &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+}
+
+static void spapr_xive_kvm_get_eas_state(sPAPRXive *xive, Error **errp)
+{
+    XiveSource *xsrc = &xive->source;
+    int i;
+
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        XiveEAS *eas = &xive->eat[i];
+        XiveEAS new_eas;
+        uint64_t kvm_eas;
+        uint8_t priority;
+        uint32_t server;
+        uint32_t end_idx;
+        uint8_t end_blk;
+        uint32_t eisn;
+        Error *local_err = NULL;
+
+        if (!(eas->w & EAS_VALID)) {
+            continue;
+        }
+
+        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EAS, i, &kvm_eas, false,
+                          &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+
+        priority = (kvm_eas & KVM_XIVE_EAS_PRIORITY_MASK) >>
+            KVM_XIVE_EAS_PRIORITY_SHIFT;
+        server = (kvm_eas & KVM_XIVE_EAS_SERVER_MASK) >>
+            KVM_XIVE_EAS_SERVER_SHIFT;
+        eisn = (kvm_eas & KVM_XIVE_EAS_EISN_MASK) >> KVM_XIVE_EAS_EISN_SHIFT;
+
+        if (spapr_xive_target_to_end(xive, server, priority, &end_blk,
+                                     &end_idx)) {
+            error_setg(errp, "XIVE: invalid tuple CPU %d priority %d", server,
+                       priority);
+            return;
+        }
+
+        new_eas.w = EAS_VALID;
+        if (kvm_eas & KVM_XIVE_EAS_MASK_MASK) {
+            new_eas.w |= EAS_MASKED;
+        }
+
+        new_eas.w = SETFIELD(EAS_END_INDEX, new_eas.w, end_idx);
+        new_eas.w = SETFIELD(EAS_END_BLOCK, new_eas.w, end_blk);
+        new_eas.w = SETFIELD(EAS_END_DATA, new_eas.w, eisn);
+
+        *eas = new_eas;
+    }
+}
+
+static void spapr_xive_kvm_sync_all(sPAPRXive *xive, Error **errp)
+{
+    XiveSource *xsrc = &xive->source;
+    Error *local_err = NULL;
+    int i;
+
+    /* Sync the KVM source. This reaches the XIVE HW through OPAL */
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        XiveEAS *eas = &xive->eat[i];
+
+        if (!(eas->w & EAS_VALID)) {
+            continue;
+        }
+
+        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_SYNC, i, NULL, true,
+                          &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+}
+
+/*
+ * The sPAPRXive KVM model migration priority is higher to make sure
+ * its 'pre_save' method runs before all the other XIVE models. It
+ * orchestrates the capture sequence of the XIVE states in the
+ * following order:
+ *
+ *   1. mask all the sources by setting PQ=01, which returns the
+ *      previous value and save it.
+ *   2. sync the sources in KVM to stabilize all the queues
+ *      sync the ENDs to make sure END -> VP is fully completed
+ *   3. dump the EAS table
+ *   4. dump the END table
+ *   5. dump the thread context (IPB)
+ *
+ *  Rollback to restore the current configuration of the sources
+ */
+static int spapr_xive_kvm_pre_save(sPAPRXive *xive)
+{
+    XiveSource *xsrc = &xive->source;
+    Error *local_err = NULL;
+    CPUState *cs;
+    int i;
+    int ret = 0;
+
+    /* Quiesce the sources, to stop the flow of event notifications */
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        /*
+         * Mask and save the ESB PQs locally in the XiveSource object.
+         */
+        uint8_t pq = xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_01);
+        xive_source_esb_set(xsrc, i, pq);
+    }
+
+    /* Sync the sources in KVM */
+    spapr_xive_kvm_sync_all(xive, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        goto out;
+    }
+
+    /* Grab the EAT (could be done earlier ?) */
+    spapr_xive_kvm_get_eas_state(xive, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        goto out;
+    }
+
+    /*
+     * Grab the ENDs. The EQ index and the toggle bit are what we want
+     * to capture
+     */
+    CPU_FOREACH(cs) {
+        spapr_xive_kvm_get_eq_state(xive, cs, &local_err);
+        if (local_err) {
+            error_report_err(local_err);
+            goto out;
+        }
+    }
+
+    /* Capture the thread interrupt contexts */
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+
+        /* TODO: Check if we need to use under run_on_cpu() ? */
+        xive_tctx_kvm_get_state(XIVE_TCTX_KVM(cpu->intc), &local_err);
+        if (local_err) {
+            error_report_err(local_err);
+            goto out;
+        }
+    }
+
+    /* All done. */
+
+out:
+    /* Restore the sources to their initial state */
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        uint8_t pq = xive_source_esb_get(xsrc, i);
+        if (xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_00 + (pq << 8)) != 0x1) {
+            error_report("XIVE: IRQ %d has an invalid state", i);
+        }
+    }
+
+    /*
+     * The XiveSource and the XiveTCTX states will be collected by
+     * their respective vmstate handlers afterwards.
+     */
+    return ret;
+}
+
+/*
+ * The sPAPRXive 'post_load' method is called by the sPAPR machine,
+ * after all XIVE device states have been transfered and loaded.
+ *
+ * All should be in place when the VCPUs resume execution.
+ */
+static int spapr_xive_kvm_post_load(sPAPRXive *xive, int version_id)
+{
+    XiveSource *xsrc = &xive->source;
+    Error *local_err = NULL;
+    CPUState *cs;
+    int i;
+
+    /* Set the ENDs first. The targetting depends on it. */
+    CPU_FOREACH(cs) {
+        spapr_xive_kvm_set_eq_state(xive, cs, &local_err);
+        if (local_err) {
+            error_report_err(local_err);
+            return -1;
+        }
+    }
+
+    /* Restore the targetting, if any */
+    spapr_xive_kvm_set_eas_state(xive, &local_err);
+    if (local_err) {
+        error_report_err(local_err);
+        return -1;
+    }
+
+    /* Restore the thread interrupt contexts */
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+
+        xive_tctx_kvm_set_state(XIVE_TCTX_KVM(cpu->intc), &local_err);
+        if (local_err) {
+            error_report_err(local_err);
+            return -1;
+        }
+    }
+
+    /*
+     * Get the saved state from the XiveSource model and restore the
+     * PQ bits
+     */
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        uint8_t pq = xive_source_esb_get(xsrc, i);
+        xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_00 + (pq << 8));
+    }
+    return 0;
+}
+
+static void spapr_xive_kvm_synchronize_state(sPAPRXive *xive)
+{
+    XiveSource *xsrc = &xive->source;
+    CPUState *cs;
+
+    xive_source_kvm_get_state(xsrc);
+
+    spapr_xive_kvm_get_eas_state(xive, &error_fatal);
+
+    CPU_FOREACH(cs) {
+        spapr_xive_kvm_get_eq_state(xive, cs, &error_fatal);
+    }
+}
 
 static void spapr_xive_kvm_instance_init(Object *obj)
 {
@@ -409,6 +899,10 @@ static void spapr_xive_kvm_class_init(ObjectClass *klass, void *data)
 
     dc->desc = "sPAPR XIVE KVM Interrupt Controller";
     dc->unrealize = spapr_xive_kvm_unrealize;
+
+    sxc->synchronize_state = spapr_xive_kvm_synchronize_state;
+    sxc->pre_save = spapr_xive_kvm_pre_save;
+    sxc->post_load = spapr_xive_kvm_post_load;
 }
 
 static const TypeInfo spapr_xive_kvm_info = {
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index 9bb37553c9ec..c9aedecc8216 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -438,9 +438,14 @@ static const struct {
 
 void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
 {
+    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(tctx);
     int cpu_index = tctx->cs ? tctx->cs->cpu_index : -1;
     int i;
 
+    if (xtc->synchronize_state) {
+        xtc->synchronize_state(tctx);
+    }
+
     monitor_printf(mon, "CPU[%04x]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
                    "  W2\n", cpu_index);
 
@@ -552,10 +557,23 @@ static void xive_tctx_base_unrealize(DeviceState *dev, Error **errp)
     qemu_unregister_reset(xive_tctx_base_reset, dev);
 }
 
+static int vmstate_xive_tctx_post_load(void *opaque, int version_id)
+{
+    XiveTCTX *tctx = XIVE_TCTX_BASE(opaque);
+    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(tctx);
+
+    if (xtc->post_load) {
+        return xtc->post_load(tctx, version_id);
+    }
+
+    return 0;
+}
+
 static const VMStateDescription vmstate_xive_tctx_base = {
     .name = TYPE_XIVE_TCTX,
     .version_id = 1,
     .minimum_version_id = 1,
+    .post_load = vmstate_xive_tctx_post_load,
     .fields = (VMStateField[]) {
         VMSTATE_BUFFER(regs, XiveTCTX),
         VMSTATE_END_OF_LIST()
@@ -581,9 +599,37 @@ static const TypeInfo xive_tctx_base_info = {
     .class_size    = sizeof(XiveTCTXClass),
 };
 
+static int xive_tctx_post_load(XiveTCTX *tctx, int version_id)
+{
+    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
+
+    /*
+     * When we collect the states from KVM XIVE irqchip, we set word2
+     * of the thread context to print out the OS CAM line under the
+     * QEMU monitor.
+     *
+     * This breaks migration on a guest using TCG or not using a KVM
+     * irqchip. Fix with an extra reset of the thread contexts.
+     */
+    if (xrc->reset_tctx) {
+        xrc->reset_tctx(tctx->xrtr, tctx);
+    }
+    return 0;
+}
+
+static void xive_tctx_class_init(ObjectClass *klass, void *data)
+{
+    XiveTCTXClass *xtc = XIVE_TCTX_BASE_CLASS(klass);
+
+    xtc->post_load = xive_tctx_post_load;
+}
+
 static const TypeInfo xive_tctx_info = {
     .name          = TYPE_XIVE_TCTX,
     .parent        = TYPE_XIVE_TCTX_BASE,
+    .instance_size = sizeof(XiveTCTX),
+    .class_init    = xive_tctx_class_init,
+    .class_size    = sizeof(XiveTCTXClass),
 };
 
 Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 92ef53743b64..6fac6ca70595 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -359,7 +359,7 @@ static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
 
 static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
 {
-    return 0;
+    return spapr_xive_post_load(spapr->xive, version_id);
 }
 
 /*
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 24/36] spapr: add a 'reset' method to the sPAPR IRQ backend
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (22 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 23/36] spapr/xive: add migration support for KVM Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-29  3:47   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 25/36] spapr: set the interrupt presenter at reset Cédric Le Goater
                   ` (11 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

This method will become useful when the new machine supporting both
interrupt modes, XIVE and XICS, is introduced. In this machine, the
interrupt mode is chosen by the CAS negotiation process and activated
after a reset.

For the time being, the only thing that can be done in the XIVE reset
handler is to map the pages for the TIMA and for the source ESBs.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_irq.h  |  2 ++
 include/hw/ppc/spapr_xive.h |  1 +
 hw/intc/spapr_xive.c        |  4 +---
 hw/ppc/spapr.c              |  2 ++
 hw/ppc/spapr_irq.c          | 21 +++++++++++++++++++++
 5 files changed, 27 insertions(+), 3 deletions(-)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 4e36c0984e1a..34128976e21c 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -46,6 +46,7 @@ typedef struct sPAPRIrq {
     Object *(*cpu_intc_create)(sPAPRMachineState *spapr, Object *cpu,
                                Error **errp);
     int (*post_load)(sPAPRMachineState *spapr, int version_id);
+    void (*reset)(sPAPRMachineState *spapr, Error **errp);
 } sPAPRIrq;
 
 extern sPAPRIrq spapr_irq_xics;
@@ -57,6 +58,7 @@ int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
 void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
 qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
 int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id);
+void spapr_irq_reset(sPAPRMachineState *spapr, Error **errp);
 
 /*
  * XICS legacy routines
diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index d2517c040958..fa7f3d7718da 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -91,6 +91,7 @@ typedef struct sPAPRMachineState sPAPRMachineState;
 void spapr_xive_hcall_init(sPAPRMachineState *spapr);
 void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt,
                    uint32_t phandle);
+void spapr_xive_mmio_map(sPAPRXive *xive);
 
 /*
  * XIVE KVM models
diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
index c5c0e063dc33..def43160e12a 100644
--- a/hw/intc/spapr_xive.c
+++ b/hw/intc/spapr_xive.c
@@ -51,7 +51,7 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
 }
 
 /* Map the ESB pages and the TIMA pages */
-static void spapr_xive_mmio_map(sPAPRXive *xive)
+void spapr_xive_mmio_map(sPAPRXive *xive)
 {
     sysbus_mmio_map(SYS_BUS_DEVICE(&xive->source), 0, xive->vc_base);
     sysbus_mmio_map(SYS_BUS_DEVICE(&xive->end_source), 0, xive->end_base);
@@ -77,8 +77,6 @@ static void spapr_xive_base_reset(DeviceState *dev)
     for (i = 0; i < xive->nr_ends; i++) {
         xive_end_reset(&xive->endt[i]);
     }
-
-    spapr_xive_mmio_map(xive);
 }
 
 static void spapr_xive_base_instance_init(Object *obj)
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index d1be2579cd9b..013e6ea8aa64 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1628,6 +1628,8 @@ static void spapr_machine_reset(void)
         spapr_irq_msi_reset(spapr);
     }
 
+    spapr_irq_reset(spapr, &error_fatal);
+
     qemu_devices_reset();
 
     /* DRC reset may cause a device to be unplugged. This will cause troubles
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 6fac6ca70595..984c6d60cd9f 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -13,6 +13,7 @@
 #include "qapi/error.h"
 #include "hw/ppc/spapr.h"
 #include "hw/ppc/spapr_xive.h"
+#include "hw/ppc/spapr_cpu_core.h"
 #include "hw/ppc/xics.h"
 #include "sysemu/kvm.h"
 
@@ -215,6 +216,10 @@ static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
     return 0;
 }
 
+static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
+{
+}
+
 #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
 #define SPAPR_IRQ_XICS_NR_MSIS     \
     (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
@@ -232,6 +237,7 @@ sPAPRIrq spapr_irq_xics = {
     .dt_populate = spapr_irq_dt_populate_xics,
     .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
     .post_load   = spapr_irq_post_load_xics,
+    .reset       = spapr_irq_reset_xics,
 };
 
  /*
@@ -362,6 +368,11 @@ static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
     return spapr_xive_post_load(spapr->xive, version_id);
 }
 
+static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
+{
+    spapr_xive_mmio_map(spapr->xive);
+}
+
 /*
  * XIVE uses the full IRQ number space. Set it to 8K to be compatible
  * with XICS.
@@ -383,6 +394,7 @@ sPAPRIrq spapr_irq_xive = {
     .dt_populate = spapr_irq_dt_populate_xive,
     .cpu_intc_create = spapr_irq_cpu_intc_create_xive,
     .post_load   = spapr_irq_post_load_xive,
+    .reset       = spapr_irq_reset_xive,
 };
 
 /*
@@ -428,6 +440,15 @@ int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id)
     return smc->irq->post_load(spapr, version_id);
 }
 
+void spapr_irq_reset(sPAPRMachineState *spapr, Error **errp)
+{
+    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
+
+    if (smc->irq->reset) {
+        smc->irq->reset(spapr, errp);
+    }
+}
+
 /*
  * XICS legacy routines - to deprecate one day
  */
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 25/36] spapr: set the interrupt presenter at reset
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (23 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 24/36] spapr: add a 'reset' method to the sPAPR IRQ backend Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-29  4:03   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 26/36] spapr: add a 'pseries-3.1-dual' machine type Cédric Le Goater
                   ` (10 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

Currently, the interrupt presenter of the VPCU is set at realize
time. Setting it at reset will become useful when the new machine
supporting both interrupt modes is introduced. In this machine, the
interrupt mode is chosen at CAS time and activated after a reset.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_cpu_core.h |  2 ++
 hw/ppc/spapr_cpu_core.c         | 26 ++++++++++++++++++++++++++
 hw/ppc/spapr_irq.c              | 11 +++++++++++
 3 files changed, 39 insertions(+)

diff --git a/include/hw/ppc/spapr_cpu_core.h b/include/hw/ppc/spapr_cpu_core.h
index 9e2821e4b31f..fc8ea9021656 100644
--- a/include/hw/ppc/spapr_cpu_core.h
+++ b/include/hw/ppc/spapr_cpu_core.h
@@ -53,4 +53,6 @@ static inline sPAPRCPUState *spapr_cpu_state(PowerPCCPU *cpu)
     return (sPAPRCPUState *)cpu->machine_data;
 }
 
+void spapr_cpu_core_set_intc(PowerPCCPU *cpu, const char *intc_type);
+
 #endif
diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
index 1811cd48db90..529de0b6b9c8 100644
--- a/hw/ppc/spapr_cpu_core.c
+++ b/hw/ppc/spapr_cpu_core.c
@@ -398,3 +398,29 @@ static const TypeInfo spapr_cpu_core_type_infos[] = {
 };
 
 DEFINE_TYPES(spapr_cpu_core_type_infos)
+
+typedef struct ForeachFindIntCArgs {
+    const char *intc_type;
+    Object *intc;
+} ForeachFindIntCArgs;
+
+static int spapr_cpu_core_find_intc(Object *child, void *opaque)
+{
+    ForeachFindIntCArgs *args = opaque;
+
+    if (object_dynamic_cast(child, args->intc_type)) {
+        args->intc = child;
+    }
+
+    return args->intc != NULL;
+}
+
+void spapr_cpu_core_set_intc(PowerPCCPU *cpu, const char *intc_type)
+{
+    ForeachFindIntCArgs args = { intc_type, NULL };
+
+    object_child_foreach(OBJECT(cpu), spapr_cpu_core_find_intc, &args);
+    g_assert(args.intc);
+
+    cpu->intc = args.intc;
+}
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 984c6d60cd9f..969efad7e6e9 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -218,6 +218,11 @@ static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
 
 static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
 {
+    CPUState *cs;
+
+    CPU_FOREACH(cs) {
+        spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->icp_type);
+    }
 }
 
 #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
@@ -370,6 +375,12 @@ static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
 
 static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
 {
+    CPUState *cs;
+
+    CPU_FOREACH(cs) {
+        spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->xive_tctx_type);
+    }
+
     spapr_xive_mmio_map(spapr->xive);
 }
 
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 26/36] spapr: add a 'pseries-3.1-dual' machine type
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (24 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 25/36] spapr: set the interrupt presenter at reset Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper Cédric Le Goater
                   ` (9 subsequent siblings)
  35 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

This pseries machine makes use of a new sPAPR IRQ backend supporting
both interrupt modes : XIVE and XICS, the default being XICS.

The interrupt mode is chosen by the CAS negotiation process and
activated after a reset to take into account the required changes in
the machine. These impact the device tree layout, the interrupt
presenter object and the exposed MMIO regions in the case of XIVE.

KVM is not yet supported.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_irq.h |   1 +
 hw/ppc/spapr.c             |  15 ++++
 hw/ppc/spapr_hcall.c       |  16 +++++
 hw/ppc/spapr_irq.c         | 140 +++++++++++++++++++++++++++++++++++++
 4 files changed, 172 insertions(+)

diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
index 34128976e21c..08f12ee3177b 100644
--- a/include/hw/ppc/spapr_irq.h
+++ b/include/hw/ppc/spapr_irq.h
@@ -52,6 +52,7 @@ typedef struct sPAPRIrq {
 extern sPAPRIrq spapr_irq_xics;
 extern sPAPRIrq spapr_irq_xics_legacy;
 extern sPAPRIrq spapr_irq_xive;
+extern sPAPRIrq spapr_irq_dual;
 
 void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp);
 int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 013e6ea8aa64..03f6fb93ed47 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -3999,6 +3999,21 @@ static void spapr_machine_3_1_xive_class_options(MachineClass *mc)
 
 DEFINE_SPAPR_MACHINE(3_1_xive, "3.1-xive", false);
 
+static void spapr_machine_3_1_dual_instance_options(MachineState *machine)
+{
+    spapr_machine_3_1_instance_options(machine);
+}
+
+static void spapr_machine_3_1_dual_class_options(MachineClass *mc)
+{
+    sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
+
+    spapr_machine_3_1_class_options(mc);
+    smc->irq = &spapr_irq_dual;
+}
+
+DEFINE_SPAPR_MACHINE(3_1_dual, "3.1-dual", false);
+
 /*
  * pseries-3.0
  */
diff --git a/hw/ppc/spapr_hcall.c b/hw/ppc/spapr_hcall.c
index ae913d070f50..e9cc3697c02f 100644
--- a/hw/ppc/spapr_hcall.c
+++ b/hw/ppc/spapr_hcall.c
@@ -1654,6 +1654,22 @@ static target_ulong h_client_architecture_support(PowerPCCPU *cpu,
             (spapr_h_cas_compose_response(spapr, args[1], args[2],
                                           ov5_updates) != 0);
     }
+
+    /*
+     * Generate a machine reset when we have an update of the
+     * interrupt mode.
+     */
+    if (!spapr->cas_reboot) {
+        sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
+
+        /*
+         * The reset is not required when running under the XIVE-only
+         * machine. This test can be certainly improved.
+         */
+        spapr->cas_reboot = spapr_ovec_test(ov5_updates, OV5_XIVE_EXPLOIT)
+            && smc->irq != &spapr_irq_xive;
+    }
+
     spapr_ovec_cleanup(ov5_updates);
 
     if (spapr->cas_reboot) {
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 969efad7e6e9..79ead51c630d 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -408,6 +408,146 @@ sPAPRIrq spapr_irq_xive = {
     .reset       = spapr_irq_reset_xive,
 };
 
+/*
+ * Dual XIVE and XICS IRQ backend.
+ *
+ * Both interrupt mode, XIVE and XICS, objects are created but the
+ * machine starts in legacy interrupt mode (XICS). It can be changed
+ * by the CAS negotiation process and, in that case, the new mode is
+ * activated after extra machine reset.
+ */
+
+/*
+ * Returns the sPAPR IRQ backend negotiated by CAS. XICS is the
+ * default.
+ */
+static sPAPRIrq *spapr_irq_current(sPAPRMachineState *spapr)
+{
+    return spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT) ?
+        &spapr_irq_xive : &spapr_irq_xics;
+}
+
+static void spapr_irq_init_dual(sPAPRMachineState *spapr, int nr_irqs,
+                                int nr_servers, Error **errp)
+{
+    Error *local_err = NULL;
+
+    if (kvm_enabled()) {
+        error_setg(errp, "No KVM support for the 'dual' machine");
+        return;
+    }
+
+    spapr_irq_xics.init(spapr, spapr_irq_xics.nr_irqs, nr_servers, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    spapr_irq_xive.init(spapr, spapr_irq_xive.nr_irqs, nr_servers, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+}
+
+static int spapr_irq_claim_dual(sPAPRMachineState *spapr, int irq, bool lsi,
+                                Error **errp)
+{
+    int ret;
+    Error *local_err = NULL;
+
+    ret = spapr_irq_xive.claim(spapr, irq, lsi, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return ret;
+    }
+
+    ret = spapr_irq_xics.claim(spapr, irq, lsi, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+    }
+
+    return ret;
+}
+
+static void spapr_irq_free_dual(sPAPRMachineState *spapr, int irq, int num)
+{
+    spapr_irq_xive.free(spapr, irq, num);
+    spapr_irq_xics.free(spapr, irq, num);
+}
+
+static qemu_irq spapr_qirq_dual(sPAPRMachineState *spapr, int irq)
+{
+    return spapr_irq_current(spapr)->qirq(spapr, irq);
+}
+
+static void spapr_irq_print_info_dual(sPAPRMachineState *spapr, Monitor *mon)
+{
+    spapr_irq_current(spapr)->print_info(spapr, mon);
+}
+
+static void spapr_irq_dt_populate_dual(sPAPRMachineState *spapr,
+                                       uint32_t nr_servers, void *fdt,
+                                       uint32_t phandle)
+{
+    spapr_irq_current(spapr)->dt_populate(spapr, nr_servers, fdt, phandle);
+}
+
+static Object *spapr_irq_cpu_intc_create_dual(sPAPRMachineState *spapr,
+                                              Object *cpu, Error **errp)
+{
+    Error *local_err = NULL;
+
+    spapr_irq_xive.cpu_intc_create(spapr, cpu, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return NULL;
+    }
+
+    /* Default to XICS interrupt mode */
+    return spapr_irq_xics.cpu_intc_create(spapr, cpu, errp);
+}
+
+static int spapr_irq_post_load_dual(sPAPRMachineState *spapr, int version_id)
+{
+    /*
+     * Force a reset of the XIVE backend after migration.
+     */
+    if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        spapr_irq_xive.reset(spapr, &error_fatal);
+    }
+
+    return spapr_irq_current(spapr)->post_load(spapr, version_id);
+}
+
+static void spapr_irq_reset_dual(sPAPRMachineState *spapr, Error **errp)
+{
+    /*
+     * Only XICS is reseted at startup as it is the default interrupt
+     * mode.
+     */
+    spapr_irq_current(spapr)->reset(spapr, errp);
+}
+
+#define SPAPR_IRQ_DUAL_NR_IRQS     0x2000
+#define SPAPR_IRQ_DUAL_NR_MSIS     (SPAPR_IRQ_DUAL_NR_IRQS - SPAPR_IRQ_MSI)
+
+sPAPRIrq spapr_irq_dual = {
+    .nr_irqs     = SPAPR_IRQ_DUAL_NR_IRQS,
+    .nr_msis     = SPAPR_IRQ_DUAL_NR_MSIS,
+    .ov5         = 0x80, /* both mode */
+
+    .init        = spapr_irq_init_dual,
+    .claim       = spapr_irq_claim_dual,
+    .free        = spapr_irq_free_dual,
+    .qirq        = spapr_qirq_dual,
+    .print_info  = spapr_irq_print_info_dual,
+    .dt_populate = spapr_irq_dt_populate_dual,
+    .cpu_intc_create = spapr_irq_cpu_intc_create_dual,
+    .post_load   = spapr_irq_post_load_dual,
+    .reset       = spapr_irq_reset_dual,
+};
+
 /*
  * sPAPR IRQ frontend routines for devices
  */
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (25 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 26/36] spapr: add a 'pseries-3.1-dual' machine type Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-29  4:09   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 28/36] ppc/xics: introduce a icp_kvm_init() routine Cédric Le Goater
                   ` (8 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

This will be used to remove the MMIO regions of the POWER9 XIVE
interrupt controller when the sPAPR machine is reseted.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/sysbus.h |  1 +
 hw/core/sysbus.c    | 10 ++++++++++
 2 files changed, 11 insertions(+)

diff --git a/include/hw/sysbus.h b/include/hw/sysbus.h
index 0b59a3b8d605..bc641984b5da 100644
--- a/include/hw/sysbus.h
+++ b/include/hw/sysbus.h
@@ -92,6 +92,7 @@ qemu_irq sysbus_get_connected_irq(SysBusDevice *dev, int n);
 void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr);
 void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
                              int priority);
+void sysbus_mmio_unmap(SysBusDevice *dev, int n);
 void sysbus_add_io(SysBusDevice *dev, hwaddr addr,
                    MemoryRegion *mem);
 MemoryRegion *sysbus_address_space(SysBusDevice *dev);
diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
index 7ac36ad3e707..09f202167dcb 100644
--- a/hw/core/sysbus.c
+++ b/hw/core/sysbus.c
@@ -153,6 +153,16 @@ static void sysbus_mmio_map_common(SysBusDevice *dev, int n, hwaddr addr,
     }
 }
 
+void sysbus_mmio_unmap(SysBusDevice *dev, int n)
+{
+    assert(n >= 0 && n < dev->num_mmio);
+
+    if (dev->mmio[n].addr != (hwaddr)-1) {
+        memory_region_del_subregion(get_system_memory(), dev->mmio[n].memory);
+        dev->mmio[n].addr = (hwaddr)-1;
+    }
+}
+
 void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr)
 {
     sysbus_mmio_map_common(dev, n, addr, false, 0);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 28/36] ppc/xics: introduce a icp_kvm_init() routine
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (26 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-29  4:08   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 29/36] ppc/xics: remove abort() in icp_kvm_init() Cédric Le Goater
                   ` (7 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

This routine gathers all the KVM initialization of the XICS KVM
presenter. It will be useful when the initialization of the KVM XICS
device is moved to a global routine.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/xics_kvm.c | 29 +++++++++++++++++++----------
 1 file changed, 19 insertions(+), 10 deletions(-)

diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index e8fa9a53aeba..efad1b19d821 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -123,11 +123,8 @@ static void icp_kvm_reset(DeviceState *dev)
     icp_set_kvm_state(ICP(dev), 1);
 }
 
-static void icp_kvm_realize(DeviceState *dev, Error **errp)
+static void icp_kvm_init(ICPState *icp, Error **errp)
 {
-    ICPState *icp = ICP(dev);
-    ICPStateClass *icpc = ICP_GET_CLASS(icp);
-    Error *local_err = NULL;
     CPUState *cs;
     KVMEnabledICP *enabled_icp;
     unsigned long vcpu_id;
@@ -137,12 +134,6 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
         abort();
     }
 
-    icpc->parent_realize(dev, &local_err);
-    if (local_err) {
-        error_propagate(errp, local_err);
-        return;
-    }
-
     cs = icp->cs;
     vcpu_id = kvm_arch_vcpu_id(cs);
 
@@ -168,6 +159,24 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
     QLIST_INSERT_HEAD(&kvm_enabled_icps, enabled_icp, node);
 }
 
+static void icp_kvm_realize(DeviceState *dev, Error **errp)
+{
+    ICPStateClass *icpc = ICP_GET_CLASS(dev);
+    Error *local_err = NULL;
+
+    icpc->parent_realize(dev, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    icp_kvm_init(ICP(dev), &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+}
+
 static void icp_kvm_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 29/36] ppc/xics: remove abort() in icp_kvm_init()
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (27 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 28/36] ppc/xics: introduce a icp_kvm_init() routine Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 30/36] spapr: check for KVM IRQ device activation Cédric Le Goater
                   ` (6 subsequent siblings)
  35 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

Replace the abort with an error report which will be handled by the
caller.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/xics_kvm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index efad1b19d821..9662e208fa81 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -131,7 +131,8 @@ static void icp_kvm_init(ICPState *icp, Error **errp)
     int ret;
 
     if (kernel_xics_fd == -1) {
-        abort();
+        error_setg(errp, "KVM XICS device is not initialized");
+        return;
     }
 
     cs = icp->cs;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 30/36] spapr: check for KVM IRQ device activation
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (28 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 29/36] ppc/xics: remove abort() in icp_kvm_init() Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 31/36] spapr/xive: export the spapr_xive_kvm_init() routine Cédric Le Goater
                   ` (5 subsequent siblings)
  35 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The KVM IRQ device activation will depend on the interrupt mode chosen
at CAS time by the machine and some methods used at reset or by the
migration need to be protected.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive_kvm.c |  5 +++++
 hw/intc/xics_kvm.c       | 20 ++++++++++++++++++++
 2 files changed, 25 insertions(+)

diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index 176083c37d61..b9fee4ea240f 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -656,6 +656,11 @@ static int spapr_xive_kvm_pre_save(sPAPRXive *xive)
     int i;
     int ret = 0;
 
+    /* The KVM XIVE device is not in use */
+    if (xive->fd == -1) {
+        return 0;
+    }
+
     /* Quiesce the sources, to stop the flow of event notifications */
     for (i = 0; i < xsrc->nr_irqs; i++) {
         /*
diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index 9662e208fa81..eabc901a4556 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -58,6 +58,11 @@ static void icp_get_kvm_state(ICPState *icp)
     uint64_t state;
     int ret;
 
+    /* The KVM XICS device is not in use */
+    if (kernel_xics_fd == -1) {
+        return;
+    }
+
     /* ICP for this CPU thread is not in use, exiting */
     if (!icp->cs) {
         return;
@@ -94,6 +99,11 @@ static int icp_set_kvm_state(ICPState *icp, int version_id)
     uint64_t state;
     int ret;
 
+    /* The KVM XICS device is not in use */
+    if (kernel_xics_fd == -1) {
+        return 0;
+    }
+
     /* ICP for this CPU thread is not in use, exiting */
     if (!icp->cs) {
         return 0;
@@ -209,6 +219,11 @@ static void ics_get_kvm_state(ICSState *ics)
     uint64_t state;
     int i;
 
+    /* The KVM XICS device is not in use */
+    if (kernel_xics_fd == -1) {
+        return;
+    }
+
     for (i = 0; i < ics->nr_irqs; i++) {
         ICSIRQState *irq = &ics->irqs[i];
 
@@ -268,6 +283,11 @@ static int ics_set_kvm_state(ICSState *ics, int version_id)
     int i;
     Error *local_err = NULL;
 
+    /* The KVM XICS device is not in use */
+    if (kernel_xics_fd == -1) {
+        return 0;
+    }
+
     for (i = 0; i < ics->nr_irqs; i++) {
         ICSIRQState *irq = &ics->irqs[i];
         int ret;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 31/36] spapr/xive: export the spapr_xive_kvm_init() routine
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (29 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 30/36] spapr: check for KVM IRQ device activation Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-29  4:11   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 32/36] spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers Cédric Le Goater
                   ` (4 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

We will need it to initialize the KVM XIVE device globally from the
machine when the XIVE interrupt mode is selected.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_xive.h | 2 ++
 hw/intc/spapr_xive_kvm.c    | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index fa7f3d7718da..1d134a681326 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -107,4 +107,6 @@ void spapr_xive_mmio_map(sPAPRXive *xive);
 #define TYPE_XIVE_TCTX_KVM   "xive-tctx-kvm"
 #define XIVE_TCTX_KVM(obj)   OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_KVM)
 
+void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp);
+
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index b9fee4ea240f..cb2aa6e81274 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -809,7 +809,7 @@ static void spapr_xive_kvm_instance_init(Object *obj)
                               NULL);
 }
 
-static void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
+void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
 {
     Error *local_err = NULL;
     size_t tima_len;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 32/36] spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (30 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 31/36] spapr/xive: export the spapr_xive_kvm_init() routine Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-29  4:12   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 33/36] spapr: introduce routines to delete the KVM IRQ device Cédric Le Goater
                   ` (3 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

Removing RTAS handlers will become necessary when the new pseries
machine supporting multiple interrupt mode is introduced.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/ppc/spapr_rtas.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
index d6a0952154ac..e005d5d08151 100644
--- a/hw/ppc/spapr_rtas.c
+++ b/hw/ppc/spapr_rtas.c
@@ -404,7 +404,7 @@ void spapr_rtas_register(int token, const char *name, spapr_rtas_fn fn)
 
     token -= RTAS_TOKEN_BASE;
 
-    assert(!rtas_table[token].name);
+    assert(!name || !rtas_table[token].name);
 
     rtas_table[token].name = name;
     rtas_table[token].fn = fn;
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 33/36] spapr: introduce routines to delete the KVM IRQ device
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (31 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 32/36] spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-29  4:17   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 34/36] spapr: add KVM support to the 'dual' machine Cédric Le Goater
                   ` (2 subsequent siblings)
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

If a new interrupt mode is chosen by CAS, the machine generates a
reset to reconfigure. At this point, the connection with the previous
KVM device needs to be closed and a new connection needs to opened
with the KVM device operating the chosen interrupt mode.

New routines are introduced to destroy the XICS and XIVE KVM
devices. They make use of a new KVM device ioctl which destroys the
device and also disconnects the IRQ presenters from the VCPUs.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/spapr_xive.h |  1 +
 include/hw/ppc/xics.h       |  1 +
 linux-headers/linux/kvm.h   |  2 ++
 hw/intc/spapr_xive_kvm.c    | 54 +++++++++++++++++++++++++++++++++++
 hw/intc/xics_kvm.c          | 57 +++++++++++++++++++++++++++++++++++++
 5 files changed, 115 insertions(+)

diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
index 1d134a681326..c913c0aed08a 100644
--- a/include/hw/ppc/spapr_xive.h
+++ b/include/hw/ppc/spapr_xive.h
@@ -108,5 +108,6 @@ void spapr_xive_mmio_map(sPAPRXive *xive);
 #define XIVE_TCTX_KVM(obj)   OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_KVM)
 
 void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp);
+void spapr_xive_kvm_fini(sPAPRXive *xive, Error **errp);
 
 #endif /* PPC_SPAPR_XIVE_H */
diff --git a/include/hw/ppc/xics.h b/include/hw/ppc/xics.h
index 9958443d1984..a5468c6eb6e3 100644
--- a/include/hw/ppc/xics.h
+++ b/include/hw/ppc/xics.h
@@ -205,6 +205,7 @@ void icp_resend(ICPState *ss);
 typedef struct sPAPRMachineState sPAPRMachineState;
 
 int xics_kvm_init(sPAPRMachineState *spapr, Error **errp);
+int xics_kvm_fini(sPAPRMachineState *spapr, Error **errp);
 void xics_spapr_init(sPAPRMachineState *spapr);
 
 Object *icp_create(Object *cpu, const char *type, XICSFabric *xi,
diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
index 59fa8d8d7f39..b7a74c58d0db 100644
--- a/linux-headers/linux/kvm.h
+++ b/linux-headers/linux/kvm.h
@@ -1309,6 +1309,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
 
+#define KVM_DESTROY_DEVICE	  _IOWR(KVMIO,  0xf0, struct kvm_create_device)
+
 /*
  * ioctls for vcpu fds
  */
diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index cb2aa6e81274..0672d8bcbc6b 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -55,6 +55,16 @@ static void kvm_cpu_enable(CPUState *cs)
     QLIST_INSERT_HEAD(&kvm_enabled_cpus, enabled_cpu, node);
 }
 
+static void kvm_cpu_disable_all(void)
+{
+    KVMEnabledCPU *enabled_cpu, *next;
+
+    QLIST_FOREACH_SAFE(enabled_cpu, &kvm_enabled_cpus, node, next) {
+        QLIST_REMOVE(enabled_cpu, node);
+        g_free(enabled_cpu);
+    }
+}
+
 /*
  * XIVE Thread Interrupt Management context (KVM)
  */
@@ -864,6 +874,50 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
     kvm_gsi_direct_mapping = true;
 }
 
+void spapr_xive_kvm_fini(sPAPRXive *xive, Error **errp)
+{
+    XiveSource *xsrc = &xive->source;
+    struct kvm_create_device xive_destroy_device = {
+        .fd = xive->fd,
+        .type = KVM_DEV_TYPE_XIVE,
+        .flags = 0,
+    };
+    size_t esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
+    int rc;
+
+    /* The KVM XIVE device is not in use */
+    if (xive->fd == -1) {
+        return;
+    }
+
+    if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
+        error_setg(errp,
+                   "IRQ_XIVE capability must be present for KVM XIVE device");
+        return;
+    }
+
+    /* Clear the KVM mapping */
+    sysbus_mmio_unmap(SYS_BUS_DEVICE(xsrc), 0);
+    munmap(xsrc->esb_mmap, esb_len);
+    sysbus_mmio_unmap(SYS_BUS_DEVICE(xive), 0);
+    munmap(xive->tm_mmap, 4ull << TM_SHIFT);
+
+    /* Destroy the KVM device. This also clears the VCPU presenters */
+    rc = kvm_vm_ioctl(kvm_state, KVM_DESTROY_DEVICE, &xive_destroy_device);
+    if (rc < 0) {
+        error_setg_errno(errp, -rc, "Error on KVM_DESTROY_DEVICE for XIVE");
+    }
+    close(xive->fd);
+    xive->fd = -1;
+
+    kvm_kernel_irqchip = false;
+    kvm_msi_via_irqfd_allowed = false;
+    kvm_gsi_direct_mapping = false;
+
+    /* Clear the local list of presenter (hotplug) */
+    kvm_cpu_disable_all();
+}
+
 static void spapr_xive_kvm_realize(DeviceState *dev, Error **errp)
 {
     sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index eabc901a4556..a7e3ec32a761 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -50,6 +50,16 @@ typedef struct KVMEnabledICP {
 static QLIST_HEAD(, KVMEnabledICP)
     kvm_enabled_icps = QLIST_HEAD_INITIALIZER(&kvm_enabled_icps);
 
+static void kvm_disable_icps(void)
+{
+    KVMEnabledICP *enabled_icp, *next;
+
+    QLIST_FOREACH_SAFE(enabled_icp, &kvm_enabled_icps, node, next) {
+        QLIST_REMOVE(enabled_icp, node);
+        g_free(enabled_icp);
+    }
+}
+
 /*
  * ICP-KVM
  */
@@ -475,6 +485,53 @@ fail:
     return -1;
 }
 
+int xics_kvm_fini(sPAPRMachineState *spapr, Error **errp)
+{
+    int rc;
+    struct kvm_create_device xics_create_device = {
+        .fd = kernel_xics_fd,
+        .type = KVM_DEV_TYPE_XICS,
+        .flags = 0,
+    };
+
+    /* The KVM XICS device is not in use */
+    if (kernel_xics_fd == -1) {
+        return 0;
+    }
+
+    if (!kvm_enabled() || !kvm_check_extension(kvm_state, KVM_CAP_IRQ_XICS)) {
+        error_setg(errp,
+                   "KVM and IRQ_XICS capability must be present for KVM XICS device");
+        return -1;
+    }
+
+    rc = kvm_vm_ioctl(kvm_state, KVM_DESTROY_DEVICE, &xics_create_device);
+    if (rc < 0) {
+        error_setg_errno(errp, -rc, "Error on KVM_DESTROY_DEVICE for XICS");
+    }
+    close(kernel_xics_fd);
+    kernel_xics_fd = -1;
+
+    spapr_rtas_register(RTAS_IBM_SET_XIVE, NULL, 0);
+    spapr_rtas_register(RTAS_IBM_GET_XIVE, NULL, 0);
+    spapr_rtas_register(RTAS_IBM_INT_OFF, NULL, 0);
+    spapr_rtas_register(RTAS_IBM_INT_ON, NULL, 0);
+
+    kvmppc_define_rtas_kernel_token(0, "ibm,set-xive");
+    kvmppc_define_rtas_kernel_token(0, "ibm,get-xive");
+    kvmppc_define_rtas_kernel_token(0, "ibm,int-on");
+    kvmppc_define_rtas_kernel_token(0, "ibm,int-off");
+
+    kvm_kernel_irqchip = false;
+    kvm_msi_via_irqfd_allowed = false;
+    kvm_gsi_direct_mapping = false;
+
+    /* Clear the presenter from the VCPUs */
+    kvm_disable_icps();
+
+    return rc;
+}
+
 static void xics_kvm_register_types(void)
 {
     type_register_static(&ics_kvm_info);
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 34/36] spapr: add KVM support to the 'dual' machine
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (32 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 33/36] spapr: introduce routines to delete the KVM IRQ device Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-29  4:22   ` David Gibson
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 35/36] ppc: externalize ppc_get_vcpu_by_pir() Cédric Le Goater
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 36/36] ppc/pnv: add XIVE support Cédric Le Goater
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

The interrupt mode is chosen by the CAS negotiation process and
activated after a reset to take into account the required changes in
the machine. This brings new constraints on how the associated KVM IRQ
device is initialized.

Currently, each model takes care of the initialization of the KVM
device in their realize method but this is not possible anymore as the
initialization needs to done globaly when the interrupt mode is known,
i.e. when machine is reseted. It also means that we need a way to
delete a KVM device when another mode is chosen.

Also, to support migration, the QEMU objects holding the state to
transfer should always be available but not necessarily activated.

The overall approach of this proposal is to initialize both interrupt
mode at the QEMU level and keep the IRQ number space in sync to allow
switching from one mode to another. For the KVM side of things, the
whole initialization of the KVM device, sources and presenters, is
grouped in a single routine. The XICS and XIVE sPAPR IRQ reset
handlers are modified accordingly to handle the init and delete
sequences of the KVM device. The post_load handlers also are, to take
into account a possible change of interrupt mode after transfer.

As KVM is now initialized at reset, we loose the possiblity to
fallback to the QEMU emulated mode in case of failure and failures
become fatal to the machine.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/spapr_xive_kvm.c | 48 +++++++++++-----------
 hw/intc/xics_kvm.c       | 18 ++++++---
 hw/ppc/spapr_irq.c       | 86 +++++++++++++++++++++++++++++-----------
 3 files changed, 98 insertions(+), 54 deletions(-)

diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
index 0672d8bcbc6b..9c7d36f51e3d 100644
--- a/hw/intc/spapr_xive_kvm.c
+++ b/hw/intc/spapr_xive_kvm.c
@@ -148,7 +148,6 @@ static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
 
 static void xive_tctx_kvm_realize(DeviceState *dev, Error **errp)
 {
-    XiveTCTX *tctx = XIVE_TCTX_KVM(dev);
     XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(dev);
     Error *local_err = NULL;
 
@@ -157,12 +156,6 @@ static void xive_tctx_kvm_realize(DeviceState *dev, Error **errp)
         error_propagate(errp, local_err);
         return;
     }
-
-    xive_tctx_kvm_init(tctx, &local_err);
-    if (local_err) {
-        error_propagate(errp, local_err);
-        return;
-    }
 }
 
 static void xive_tctx_kvm_class_init(ObjectClass *klass, void *data)
@@ -222,12 +215,9 @@ static void xive_source_kvm_init(XiveSource *xsrc, Error **errp)
 
 static void xive_source_kvm_reset(DeviceState *dev)
 {
-    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
     XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
 
     xsc->parent_reset(dev);
-
-    xive_source_kvm_init(xsrc, &error_fatal);
 }
 
 /*
@@ -346,12 +336,6 @@ static void xive_source_kvm_realize(DeviceState *dev, Error **errp)
 
     xsrc->qirqs = qemu_allocate_irqs(xive_source_kvm_set_irq, xsrc,
                                      xsrc->nr_irqs);
-
-    xive_source_kvm_mmap(xsrc, &local_err);
-    if (local_err) {
-        error_propagate(errp, local_err);
-        return;
-    }
 }
 
 static void xive_source_kvm_unrealize(DeviceState *dev, Error **errp)
@@ -823,6 +807,7 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
 {
     Error *local_err = NULL;
     size_t tima_len;
+    CPUState *cs;
 
     if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
         error_setg(errp,
@@ -850,7 +835,18 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
         return;
     }
 
-    /* Let the XiveSource KVM model handle the mapping for the moment */
+    xive_source_kvm_mmap(&xive->source, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+
+    /* Create the KVM interrupt sources */
+    xive_source_kvm_init(&xive->source, &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
 
     /* TIMA KVM mapping
      *
@@ -869,6 +865,17 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
                                       "xive.tima", tima_len, xive->tm_mmap);
     sysbus_init_mmio(SYS_BUS_DEVICE(xive), &xive->tm_mmio);
 
+    /* Connect the presenters to the VCPU */
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+
+        xive_tctx_kvm_init(XIVE_TCTX_BASE(cpu->intc), &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
+    }
+
     kvm_kernel_irqchip = true;
     kvm_msi_via_irqfd_allowed = true;
     kvm_gsi_direct_mapping = true;
@@ -920,16 +927,9 @@ void spapr_xive_kvm_fini(sPAPRXive *xive, Error **errp)
 
 static void spapr_xive_kvm_realize(DeviceState *dev, Error **errp)
 {
-    sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
     sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
     Error *local_err = NULL;
 
-    spapr_xive_kvm_init(xive, &local_err);
-    if (local_err) {
-        error_propagate(errp, local_err);
-        return;
-    }
-
     /* Initialize the source and the local routing tables */
     sxc->parent_realize(dev, &local_err);
     if (local_err) {
diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
index a7e3ec32a761..c89fa943847c 100644
--- a/hw/intc/xics_kvm.c
+++ b/hw/intc/xics_kvm.c
@@ -190,12 +190,6 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
         error_propagate(errp, local_err);
         return;
     }
-
-    icp_kvm_init(ICP(dev), &local_err);
-    if (local_err) {
-        error_propagate(errp, local_err);
-        return;
-    }
 }
 
 static void icp_kvm_class_init(ObjectClass *klass, void *data)
@@ -427,6 +421,8 @@ static void rtas_dummy(PowerPCCPU *cpu, sPAPRMachineState *spapr,
 int xics_kvm_init(sPAPRMachineState *spapr, Error **errp)
 {
     int rc;
+    CPUState *cs;
+    Error *local_err = NULL;
 
     if (!kvm_enabled() || !kvm_check_extension(kvm_state, KVM_CAP_IRQ_XICS)) {
         error_setg(errp,
@@ -475,6 +471,16 @@ int xics_kvm_init(sPAPRMachineState *spapr, Error **errp)
     kvm_msi_via_irqfd_allowed = true;
     kvm_gsi_direct_mapping = true;
 
+    /* Connect the presenters to the VCPU */
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+
+        icp_kvm_init(ICP(cpu->intc), &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            goto fail;
+        }
+    }
     return 0;
 
 fail:
diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
index 79ead51c630d..f1720a8dda33 100644
--- a/hw/ppc/spapr_irq.c
+++ b/hw/ppc/spapr_irq.c
@@ -98,20 +98,14 @@ static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
     MachineState *machine = MACHINE(spapr);
     Error *local_err = NULL;
 
-    if (kvm_enabled()) {
-        if (machine_kernel_irqchip_allowed(machine) &&
-            !xics_kvm_init(spapr, &local_err)) {
-            spapr->icp_type = TYPE_KVM_ICP;
-            spapr->ics = spapr_ics_create(spapr, TYPE_ICS_KVM, nr_irqs,
-                                          &local_err);
-        }
-        if (machine_kernel_irqchip_required(machine) && !spapr->ics) {
-            error_prepend(&local_err,
-                          "kernel_irqchip requested but unavailable: ");
-            goto error;
+    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+        spapr->icp_type = TYPE_KVM_ICP;
+        spapr->ics = spapr_ics_create(spapr, TYPE_ICS_KVM, nr_irqs,
+                                      &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
         }
-        error_free(local_err);
-        local_err = NULL;
     }
 
     if (!spapr->ics) {
@@ -119,10 +113,11 @@ static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
         spapr->icp_type = TYPE_ICP;
         spapr->ics = spapr_ics_create(spapr, TYPE_ICS_SIMPLE, nr_irqs,
                                       &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            return;
+        }
     }
-
-error:
-    error_propagate(errp, local_err);
 }
 
 #define ICS_IRQ_FREE(ics, srcno)   \
@@ -218,11 +213,28 @@ static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
 
 static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
 {
+    MachineState *machine = MACHINE(spapr);
     CPUState *cs;
+    Error *local_err = NULL;
 
     CPU_FOREACH(cs) {
         spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->icp_type);
     }
+
+    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+        xics_kvm_fini(spapr, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            error_prepend(errp, "KVM XICS fini failed: ");
+            return;
+        }
+        xics_kvm_init(spapr, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            error_prepend(errp, "KVM XICS init failed: ");
+            return;
+        }
+    }
 }
 
 #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
@@ -288,10 +300,8 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
         spapr->xive_tctx_type = TYPE_XIVE_TCTX_KVM;
         spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE_KVM, nr_irqs,
                                         nr_servers, &local_err);
-
-        if (local_err && machine_kernel_irqchip_required(machine)) {
+        if (local_err) {
             error_propagate(errp, local_err);
-            error_prepend(errp, "kernel_irqchip requested but init failed : ");
             return;
         }
 
@@ -375,12 +385,29 @@ static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
 
 static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
 {
+    MachineState *machine = MACHINE(spapr);
     CPUState *cs;
+    Error *local_err = NULL;
 
     CPU_FOREACH(cs) {
         spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->xive_tctx_type);
     }
 
+    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+        spapr_xive_kvm_fini(spapr->xive, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            error_prepend(errp, "KVM XIVE fini failed: ");
+            return;
+        }
+        spapr_xive_kvm_init(spapr->xive, &local_err);
+        if (local_err) {
+            error_propagate(errp, local_err);
+            error_prepend(errp, "KVM XIVE init failed: ");
+            return;
+        }
+    }
+
     spapr_xive_mmio_map(spapr->xive);
 }
 
@@ -432,11 +459,6 @@ static void spapr_irq_init_dual(sPAPRMachineState *spapr, int nr_irqs,
 {
     Error *local_err = NULL;
 
-    if (kvm_enabled()) {
-        error_setg(errp, "No KVM support for the 'dual' machine");
-        return;
-    }
-
     spapr_irq_xics.init(spapr, spapr_irq_xics.nr_irqs, nr_servers, &local_err);
     if (local_err) {
         error_propagate(errp, local_err);
@@ -510,10 +532,15 @@ static Object *spapr_irq_cpu_intc_create_dual(sPAPRMachineState *spapr,
 
 static int spapr_irq_post_load_dual(sPAPRMachineState *spapr, int version_id)
 {
+    MachineState *machine = MACHINE(spapr);
+
     /*
      * Force a reset of the XIVE backend after migration.
      */
     if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
+        if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+            xics_kvm_fini(spapr, &error_fatal);
+        }
         spapr_irq_xive.reset(spapr, &error_fatal);
     }
 
@@ -522,6 +549,17 @@ static int spapr_irq_post_load_dual(sPAPRMachineState *spapr, int version_id)
 
 static void spapr_irq_reset_dual(sPAPRMachineState *spapr, Error **errp)
 {
+    MachineState *machine = MACHINE(spapr);
+
+    /*
+     * Destroy all the KVM IRQ devices. This also clears the VCPU
+     * presenters
+     */
+    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
+        xics_kvm_fini(spapr, &error_fatal);
+        spapr_xive_kvm_fini(spapr->xive, &error_fatal);
+    }
+
     /*
      * Only XICS is reseted at startup as it is the default interrupt
      * mode.
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 35/36] ppc: externalize ppc_get_vcpu_by_pir()
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (33 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 34/36] spapr: add KVM support to the 'dual' machine Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 36/36] ppc/pnv: add XIVE support Cédric Le Goater
  35 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

We will use it to get the CPU interrupt presenter in XIVE.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/ppc.h |  1 +
 hw/ppc/pnv.c         | 16 ----------------
 hw/ppc/ppc.c         | 16 ++++++++++++++++
 3 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/include/hw/ppc/ppc.h b/include/hw/ppc/ppc.h
index 298ec354a8a8..daaa04a22dbf 100644
--- a/include/hw/ppc/ppc.h
+++ b/include/hw/ppc/ppc.h
@@ -4,6 +4,7 @@
 #include "target/ppc/cpu-qom.h"
 
 void ppc_set_irq(PowerPCCPU *cpu, int n_IRQ, int level);
+PowerPCCPU *ppc_get_vcpu_by_pir(int pir);
 
 /* PowerPC hardware exceptions management helpers */
 typedef void (*clk_setup_cb)(void *opaque, uint32_t freq);
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 346f5e7aedb5..66f2301b4ece 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -1070,22 +1070,6 @@ static void pnv_ics_resend(XICSFabric *xi)
     }
 }
 
-static PowerPCCPU *ppc_get_vcpu_by_pir(int pir)
-{
-    CPUState *cs;
-
-    CPU_FOREACH(cs) {
-        PowerPCCPU *cpu = POWERPC_CPU(cs);
-        CPUPPCState *env = &cpu->env;
-
-        if (env->spr_cb[SPR_PIR].default_value == pir) {
-            return cpu;
-        }
-    }
-
-    return NULL;
-}
-
 static ICPState *pnv_icp_get(XICSFabric *xi, int pir)
 {
     PowerPCCPU *cpu = ppc_get_vcpu_by_pir(pir);
diff --git a/hw/ppc/ppc.c b/hw/ppc/ppc.c
index ec4be25f4994..9292f986eba7 100644
--- a/hw/ppc/ppc.c
+++ b/hw/ppc/ppc.c
@@ -1358,3 +1358,19 @@ void PPC_debug_write (void *opaque, uint32_t addr, uint32_t val)
         break;
     }
 }
+
+PowerPCCPU *ppc_get_vcpu_by_pir(int pir)
+{
+    CPUState *cs;
+
+    CPU_FOREACH(cs) {
+        PowerPCCPU *cpu = POWERPC_CPU(cs);
+        CPUPPCState *env = &cpu->env;
+
+        if (env->spr_cb[SPR_PIR].default_value == pir) {
+            return cpu;
+        }
+    }
+
+    return NULL;
+}
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* [Qemu-devel] [PATCH v5 36/36] ppc/pnv: add XIVE support
  2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
                   ` (34 preceding siblings ...)
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 35/36] ppc: externalize ppc_get_vcpu_by_pir() Cédric Le Goater
@ 2018-11-16 10:57 ` Cédric Le Goater
  2018-12-03  2:26   ` David Gibson
  35 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-16 10:57 UTC (permalink / raw)
  To: David Gibson
  Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt, Cédric Le Goater

This is simple model of the POWER9 XIVE interrupt controller for the
PowerNV machine. XIVE for baremetal is a complex controller and the
model only addresses the needs of the skiboot firmware.

* Overall architecture

              XIVE Interrupt Controller
              +-------------------------------------+       IPIs
              | +---------+ +---------+ +---------+ |    +--------+
              | |VC       | |CQ       | |PC       |----> | CORES  |
              | |     esb | |         | |         |----> |        |
              | |     eas | |  Bridge | |         |----> |        |
              | |SC   end | |         | |     nvt | |    |        |
+------+      | +---------+ +----+----+ +---------+ |    +--+-+-+-+
| RAM  |      +------------------|------------------+       | | |
|      |                         |                          | | |
|      |                         |                          | | |
|      |   +---------------------v--------------------------v-v-v---+      other
|      <---+                       Power Bus                        +----> chips
|  esb |   +-----------+-----------------------+--------------------+
|  eas |               |                       |
|  end |               |                       |
|  nvt |           +---+----+              +---+----+
+------+           |SC      |              |SC      |
                   |        |              |        |
                   | 2-bits |              | 2-bits |
                   | local  |              |   VC   |
                   +--------+              +--------+
                     PCIe                  NX,NPU,CAPI

                  SC: Source Controller (aka. IVSE)
                  VC: Virtualization Controller (aka. IVRE)
                  CQ: Common Queue (Bridge)
                  PC: Presentation Controller (aka. IVPE)

              2-bits: source state machine
                 esb: Event State Buffer (Array of PQ bits in an IVSE)
                 eas: Event Assignment Structure
                 end: Event Notification Descriptor
                 nvt: Notification Virtual Target

It is composed of three sub-engines :

  - Interrupt Virtualization Source Engine (IVSE), or Source
    Controller (SC). These are found in PCI PHBs, in the PSI host
    bridge controller, but also inside the main controller for the
    core IPIs and other sub-chips (NX, CAP, NPU) of the
    chip/processor. They are configured to feed the IVRE with events.

  - Interrupt Virtualization Routing Engine (IVRE) or Virtualization
    Controller (VC). Its job is to match an event source with an Event
    Notification Descriptor (END).

  - Interrupt Virtualization Presentation Engine (IVPE) or Presentation
    Controller (PC). It maintains the interrupt context state of each
    thread and handles the delivery of the external exception to the
    thread.

* XIVE internal tables

Each of the sub-engines uses a set of tables to redirect exceptions
from event sources to CPU threads.

                                             +-------+
   User or OS                                |  EQ   |
       or                            +------>|entries|
   Hypervisor                        |       |  ..   |
     Memory                          |       +-------+
                                     |           ^
                                     |           |
               +--------------------------------------------------+
                                     |           |
   Hypervisor        +------+    +---+--+    +---+--+   +------+
     Memory          | ESB  |    | EAT  |    | ENDT |   | NVTT |
    (skiboot)        +----+-+    +----+-+    +----+-+   +------+
                       ^  |        ^  |        ^  |       ^
                       |  |        |  |        |  |       |
               +--------------------------------------------------+
                       |  |        |  |        |  |       |
                       |  |        |  |        |  |       |
                 +-----|--|--------|--|--------|--|-+   +-|-----+    +------+
                 |     |  |        |  |        |  | |   | | tctx|    |Thread|
    IPI or   ----+     +  v        +  v        +  v |---| +  .. |----->     |
   HW events     |                                  |   |       |    |      |
                 |              IVRE                |   | IVPE  |    +------+
                 +----------------------------------+   +-------+

The IVSE have a 2-bits, P for pending and Q for queued, state machine
for each source that allows events to be triggered. They are stored in
an array, the Event State Buffer (ESB) and controlled by MMIOs.

If the event is let through, the IVRE looks up in the Event Assignment
Structure (EAS) table for an Event Notification Descriptor (END)
configured for the source. Each Event Notification Descriptor defines
a notification path to a CPU and an in-memory Event Queue, in which
will be pushed an EQ data for the OS to pull.

The IVPE determines if a Notification Virtual Target (NVT) can handle
the event by scanning the thread contexts of the VPs dispatched on the
processor HW threads. It maintains the interrupt context state of each
thread in a NVT table.

* QEMU model for PowerNV

The PowerNV model reuses the common XIVE framework developed for sPAPR
and the fundamentals aspects are quite the same. The difference are
outlined below.

The controller initial BAR configuration is performed using the XSCOM
bus from there, MMIO are used for further configuration.

The MMIO regions exposed are :

 - Interrupt controller registers
 - ESB pages for IPIs and ENDs
 - Presenter MMIO (Not used)
 - Thread Interrupt Management Area MMIO, direct and indirect

Virtualization Controller MMIO region containing the IPI ESB pages and
END ESB pages is sub-divided into "sets" which map portions of the VC
region to the different ESB pages. It is configured at runtime through
the EDT set translation table to let the firmware decide how to split
the address space between IPI ESB pages and END ESB pages.

The XIVE tables are now in the machine RAM and not in the hypervisor
anymore. The firmware (skiboot) configures these tables using Virtual
Structure Descriptor defining the characteristics of each table : SBE,
EAS, END and NVT. These are later used to access the virtual interrupt
entries. The internal cache of these tables in the interrupt controller
is updated and invalidated using a set of registers.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/pnv_xive_regs.h    |  314 +++++++
 include/hw/ppc/pnv.h       |   22 +-
 include/hw/ppc/pnv_xive.h  |  100 +++
 include/hw/ppc/pnv_xscom.h |    3 +
 include/hw/ppc/xive.h      |    1 +
 hw/intc/pnv_xive.c         | 1612 ++++++++++++++++++++++++++++++++++++
 hw/intc/xive.c             |   63 +-
 hw/ppc/pnv.c               |   58 +-
 hw/intc/Makefile.objs      |    2 +-
 9 files changed, 2164 insertions(+), 11 deletions(-)
 create mode 100644 hw/intc/pnv_xive_regs.h
 create mode 100644 include/hw/ppc/pnv_xive.h
 create mode 100644 hw/intc/pnv_xive.c

diff --git a/hw/intc/pnv_xive_regs.h b/hw/intc/pnv_xive_regs.h
new file mode 100644
index 000000000000..509d5a18cdde
--- /dev/null
+++ b/hw/intc/pnv_xive_regs.h
@@ -0,0 +1,314 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_PNV_XIVE_REGS_H
+#define PPC_PNV_XIVE_REGS_H
+
+/* IC register offsets 0x0 - 0x400 */
+#define CQ_SWI_CMD_HIST         0x020
+#define CQ_SWI_CMD_POLL         0x028
+#define CQ_SWI_CMD_BCAST        0x030
+#define CQ_SWI_CMD_ASSIGN       0x038
+#define CQ_SWI_CMD_BLK_UPD      0x040
+#define CQ_SWI_RSP              0x048
+#define X_CQ_CFG_PB_GEN         0x0a
+#define CQ_CFG_PB_GEN           0x050
+#define   CQ_INT_ADDR_OPT       PPC_BITMASK(14, 15)
+#define X_CQ_IC_BAR             0x10
+#define X_CQ_MSGSND             0x0b
+#define CQ_MSGSND               0x058
+#define CQ_CNPM_SEL             0x078
+#define CQ_IC_BAR               0x080
+#define   CQ_IC_BAR_VALID       PPC_BIT(0)
+#define   CQ_IC_BAR_64K         PPC_BIT(1)
+#define X_CQ_TM1_BAR            0x12
+#define CQ_TM1_BAR              0x90
+#define X_CQ_TM2_BAR            0x014
+#define CQ_TM2_BAR              0x0a0
+#define   CQ_TM_BAR_VALID       PPC_BIT(0)
+#define   CQ_TM_BAR_64K         PPC_BIT(1)
+#define X_CQ_PC_BAR             0x16
+#define CQ_PC_BAR               0x0b0
+#define  CQ_PC_BAR_VALID        PPC_BIT(0)
+#define X_CQ_PC_BARM            0x17
+#define CQ_PC_BARM              0x0b8
+#define  CQ_PC_BARM_MASK        PPC_BITMASK(26, 38)
+#define X_CQ_VC_BAR             0x18
+#define CQ_VC_BAR               0x0c0
+#define  CQ_VC_BAR_VALID        PPC_BIT(0)
+#define X_CQ_VC_BARM            0x19
+#define CQ_VC_BARM              0x0c8
+#define  CQ_VC_BARM_MASK        PPC_BITMASK(21, 37)
+#define X_CQ_TAR                0x1e
+#define CQ_TAR                  0x0f0
+#define  CQ_TAR_TBL_AUTOINC     PPC_BIT(0)
+#define  CQ_TAR_TSEL            PPC_BITMASK(12, 15)
+#define  CQ_TAR_TSEL_BLK        PPC_BIT(12)
+#define  CQ_TAR_TSEL_MIG        PPC_BIT(13)
+#define  CQ_TAR_TSEL_VDT        PPC_BIT(14)
+#define  CQ_TAR_TSEL_EDT        PPC_BIT(15)
+#define  CQ_TAR_TSEL_INDEX      PPC_BITMASK(26, 31)
+#define X_CQ_TDR                0x1f
+#define CQ_TDR                  0x0f8
+#define  CQ_TDR_VDT_VALID       PPC_BIT(0)
+#define  CQ_TDR_VDT_BLK         PPC_BITMASK(11, 15)
+#define  CQ_TDR_VDT_INDEX       PPC_BITMASK(28, 31)
+#define  CQ_TDR_EDT_TYPE        PPC_BITMASK(0, 1)
+#define  CQ_TDR_EDT_INVALID     0
+#define  CQ_TDR_EDT_IPI         1
+#define  CQ_TDR_EDT_EQ          2
+#define  CQ_TDR_EDT_BLK         PPC_BITMASK(12, 15)
+#define  CQ_TDR_EDT_INDEX       PPC_BITMASK(26, 31)
+#define X_CQ_PBI_CTL            0x20
+#define CQ_PBI_CTL              0x100
+#define  CQ_PBI_PC_64K          PPC_BIT(5)
+#define  CQ_PBI_VC_64K          PPC_BIT(6)
+#define  CQ_PBI_LNX_TRIG        PPC_BIT(7)
+#define  CQ_PBI_FORCE_TM_LOCAL  PPC_BIT(22)
+#define CQ_PBO_CTL              0x108
+#define CQ_AIB_CTL              0x110
+#define X_CQ_RST_CTL            0x23
+#define CQ_RST_CTL              0x118
+#define X_CQ_FIRMASK            0x33
+#define CQ_FIRMASK              0x198
+#define X_CQ_FIRMASK_AND        0x34
+#define CQ_FIRMASK_AND          0x1a0
+#define X_CQ_FIRMASK_OR         0x35
+#define CQ_FIRMASK_OR           0x1a8
+
+/* PC LBS1 register offsets 0x400 - 0x800 */
+#define X_PC_TCTXT_CFG          0x100
+#define PC_TCTXT_CFG            0x400
+#define  PC_TCTXT_CFG_BLKGRP_EN         PPC_BIT(0)
+#define  PC_TCTXT_CFG_TARGET_EN         PPC_BIT(1)
+#define  PC_TCTXT_CFG_LGS_EN            PPC_BIT(2)
+#define  PC_TCTXT_CFG_STORE_ACK         PPC_BIT(3)
+#define  PC_TCTXT_CFG_HARD_CHIPID_BLK   PPC_BIT(8)
+#define  PC_TCTXT_CHIPID_OVERRIDE       PPC_BIT(9)
+#define  PC_TCTXT_CHIPID                PPC_BITMASK(12, 15)
+#define  PC_TCTXT_INIT_AGE              PPC_BITMASK(30, 31)
+#define X_PC_TCTXT_TRACK        0x101
+#define PC_TCTXT_TRACK          0x408
+#define  PC_TCTXT_TRACK_EN              PPC_BIT(0)
+#define X_PC_TCTXT_INDIR0       0x104
+#define PC_TCTXT_INDIR0         0x420
+#define  PC_TCTXT_INDIR_VALID           PPC_BIT(0)
+#define  PC_TCTXT_INDIR_THRDID          PPC_BITMASK(9, 15)
+#define X_PC_TCTXT_INDIR1       0x105
+#define PC_TCTXT_INDIR1         0x428
+#define X_PC_TCTXT_INDIR2       0x106
+#define PC_TCTXT_INDIR2         0x430
+#define X_PC_TCTXT_INDIR3       0x107
+#define PC_TCTXT_INDIR3         0x438
+#define X_PC_THREAD_EN_REG0     0x108
+#define PC_THREAD_EN_REG0       0x440
+#define X_PC_THREAD_EN_REG0_SET 0x109
+#define PC_THREAD_EN_REG0_SET   0x448
+#define X_PC_THREAD_EN_REG0_CLR 0x10a
+#define PC_THREAD_EN_REG0_CLR   0x450
+#define X_PC_THREAD_EN_REG1     0x10c
+#define PC_THREAD_EN_REG1       0x460
+#define X_PC_THREAD_EN_REG1_SET 0x10d
+#define PC_THREAD_EN_REG1_SET   0x468
+#define X_PC_THREAD_EN_REG1_CLR 0x10e
+#define PC_THREAD_EN_REG1_CLR   0x470
+#define X_PC_GLOBAL_CONFIG      0x110
+#define PC_GLOBAL_CONFIG        0x480
+#define  PC_GCONF_INDIRECT      PPC_BIT(32)
+#define  PC_GCONF_CHIPID_OVR    PPC_BIT(40)
+#define  PC_GCONF_CHIPID        PPC_BITMASK(44, 47)
+#define X_PC_VSD_TABLE_ADDR     0x111
+#define PC_VSD_TABLE_ADDR       0x488
+#define X_PC_VSD_TABLE_DATA     0x112
+#define PC_VSD_TABLE_DATA       0x490
+#define X_PC_AT_KILL            0x116
+#define PC_AT_KILL              0x4b0
+#define  PC_AT_KILL_VALID       PPC_BIT(0)
+#define  PC_AT_KILL_BLOCK_ID    PPC_BITMASK(27, 31)
+#define  PC_AT_KILL_OFFSET      PPC_BITMASK(48, 60)
+#define X_PC_AT_KILL_MASK       0x117
+#define PC_AT_KILL_MASK         0x4b8
+
+/* PC LBS2 register offsets */
+#define X_PC_VPC_CACHE_ENABLE   0x161
+#define PC_VPC_CACHE_ENABLE     0x708
+#define  PC_VPC_CACHE_EN_MASK   PPC_BITMASK(0, 31)
+#define X_PC_VPC_SCRUB_TRIG     0x162
+#define PC_VPC_SCRUB_TRIG       0x710
+#define X_PC_VPC_SCRUB_MASK     0x163
+#define PC_VPC_SCRUB_MASK       0x718
+#define  PC_SCRUB_VALID         PPC_BIT(0)
+#define  PC_SCRUB_WANT_DISABLE  PPC_BIT(1)
+#define  PC_SCRUB_WANT_INVAL    PPC_BIT(2)
+#define  PC_SCRUB_BLOCK_ID      PPC_BITMASK(27, 31)
+#define  PC_SCRUB_OFFSET        PPC_BITMASK(45, 63)
+#define X_PC_VPC_CWATCH_SPEC    0x167
+#define PC_VPC_CWATCH_SPEC      0x738
+#define  PC_VPC_CWATCH_CONFLICT PPC_BIT(0)
+#define  PC_VPC_CWATCH_FULL     PPC_BIT(8)
+#define  PC_VPC_CWATCH_BLOCKID  PPC_BITMASK(27, 31)
+#define  PC_VPC_CWATCH_OFFSET   PPC_BITMASK(45, 63)
+#define X_PC_VPC_CWATCH_DAT0    0x168
+#define PC_VPC_CWATCH_DAT0      0x740
+#define X_PC_VPC_CWATCH_DAT1    0x169
+#define PC_VPC_CWATCH_DAT1      0x748
+#define X_PC_VPC_CWATCH_DAT2    0x16a
+#define PC_VPC_CWATCH_DAT2      0x750
+#define X_PC_VPC_CWATCH_DAT3    0x16b
+#define PC_VPC_CWATCH_DAT3      0x758
+#define X_PC_VPC_CWATCH_DAT4    0x16c
+#define PC_VPC_CWATCH_DAT4      0x760
+#define X_PC_VPC_CWATCH_DAT5    0x16d
+#define PC_VPC_CWATCH_DAT5      0x768
+#define X_PC_VPC_CWATCH_DAT6    0x16e
+#define PC_VPC_CWATCH_DAT6      0x770
+#define X_PC_VPC_CWATCH_DAT7    0x16f
+#define PC_VPC_CWATCH_DAT7      0x778
+
+/* VC0 register offsets 0x800 - 0xFFF */
+#define X_VC_GLOBAL_CONFIG      0x200
+#define VC_GLOBAL_CONFIG        0x800
+#define  VC_GCONF_INDIRECT      PPC_BIT(32)
+#define X_VC_VSD_TABLE_ADDR     0x201
+#define VC_VSD_TABLE_ADDR       0x808
+#define X_VC_VSD_TABLE_DATA     0x202
+#define VC_VSD_TABLE_DATA       0x810
+#define VC_IVE_ISB_BLOCK_MODE   0x818
+#define VC_EQD_BLOCK_MODE       0x820
+#define VC_VPS_BLOCK_MODE       0x828
+#define X_VC_IRQ_CONFIG_IPI     0x208
+#define VC_IRQ_CONFIG_IPI       0x840
+#define  VC_IRQ_CONFIG_MEMB_EN  PPC_BIT(45)
+#define  VC_IRQ_CONFIG_MEMB_SZ  PPC_BITMASK(46, 51)
+#define VC_IRQ_CONFIG_HW        0x848
+#define VC_IRQ_CONFIG_CASCADE1  0x850
+#define VC_IRQ_CONFIG_CASCADE2  0x858
+#define VC_IRQ_CONFIG_REDIST    0x860
+#define VC_IRQ_CONFIG_IPI_CASC  0x868
+#define X_VC_AIB_TX_ORDER_TAG2  0x22d
+#define  VC_AIB_TX_ORDER_TAG2_REL_TF    PPC_BIT(20)
+#define VC_AIB_TX_ORDER_TAG2    0x890
+#define X_VC_AT_MACRO_KILL      0x23e
+#define VC_AT_MACRO_KILL        0x8b0
+#define X_VC_AT_MACRO_KILL_MASK 0x23f
+#define VC_AT_MACRO_KILL_MASK   0x8b8
+#define  VC_KILL_VALID          PPC_BIT(0)
+#define  VC_KILL_TYPE           PPC_BITMASK(14, 15)
+#define   VC_KILL_IRQ   0
+#define   VC_KILL_IVC   1
+#define   VC_KILL_SBC   2
+#define   VC_KILL_EQD   3
+#define  VC_KILL_BLOCK_ID       PPC_BITMASK(27, 31)
+#define  VC_KILL_OFFSET         PPC_BITMASK(48, 60)
+#define X_VC_EQC_CACHE_ENABLE   0x211
+#define VC_EQC_CACHE_ENABLE     0x908
+#define  VC_EQC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
+#define X_VC_EQC_SCRUB_TRIG     0x212
+#define VC_EQC_SCRUB_TRIG       0x910
+#define X_VC_EQC_SCRUB_MASK     0x213
+#define VC_EQC_SCRUB_MASK       0x918
+#define X_VC_EQC_CWATCH_SPEC    0x215
+#define VC_EQC_CONFIG           0x920
+#define X_VC_EQC_CONFIG         0x214
+#define  VC_EQC_CONF_SYNC_IPI           PPC_BIT(32)
+#define  VC_EQC_CONF_SYNC_HW            PPC_BIT(33)
+#define  VC_EQC_CONF_SYNC_ESC1          PPC_BIT(34)
+#define  VC_EQC_CONF_SYNC_ESC2          PPC_BIT(35)
+#define  VC_EQC_CONF_SYNC_REDI          PPC_BIT(36)
+#define  VC_EQC_CONF_EQP_INTERLEAVE     PPC_BIT(38)
+#define  VC_EQC_CONF_ENABLE_END_s_BIT   PPC_BIT(39)
+#define  VC_EQC_CONF_ENABLE_END_u_BIT   PPC_BIT(40)
+#define  VC_EQC_CONF_ENABLE_END_c_BIT   PPC_BIT(41)
+#define  VC_EQC_CONF_ENABLE_MORE_QSZ    PPC_BIT(42)
+#define  VC_EQC_CONF_SKIP_ESCALATE      PPC_BIT(43)
+#define VC_EQC_CWATCH_SPEC      0x928
+#define  VC_EQC_CWATCH_CONFLICT PPC_BIT(0)
+#define  VC_EQC_CWATCH_FULL     PPC_BIT(8)
+#define  VC_EQC_CWATCH_BLOCKID  PPC_BITMASK(28, 31)
+#define  VC_EQC_CWATCH_OFFSET   PPC_BITMASK(40, 63)
+#define X_VC_EQC_CWATCH_DAT0    0x216
+#define VC_EQC_CWATCH_DAT0      0x930
+#define X_VC_EQC_CWATCH_DAT1    0x217
+#define VC_EQC_CWATCH_DAT1      0x938
+#define X_VC_EQC_CWATCH_DAT2    0x218
+#define VC_EQC_CWATCH_DAT2      0x940
+#define X_VC_EQC_CWATCH_DAT3    0x219
+#define VC_EQC_CWATCH_DAT3      0x948
+#define X_VC_IVC_SCRUB_TRIG     0x222
+#define VC_IVC_SCRUB_TRIG       0x990
+#define X_VC_IVC_SCRUB_MASK     0x223
+#define VC_IVC_SCRUB_MASK       0x998
+#define X_VC_SBC_SCRUB_TRIG     0x232
+#define VC_SBC_SCRUB_TRIG       0xa10
+#define X_VC_SBC_SCRUB_MASK     0x233
+#define VC_SBC_SCRUB_MASK       0xa18
+#define  VC_SCRUB_VALID         PPC_BIT(0)
+#define  VC_SCRUB_WANT_DISABLE  PPC_BIT(1)
+#define  VC_SCRUB_WANT_INVAL    PPC_BIT(2) /* EQC and SBC only */
+#define  VC_SCRUB_BLOCK_ID      PPC_BITMASK(28, 31)
+#define  VC_SCRUB_OFFSET        PPC_BITMASK(40, 63)
+#define X_VC_IVC_CACHE_ENABLE   0x221
+#define VC_IVC_CACHE_ENABLE     0x988
+#define  VC_IVC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
+#define X_VC_SBC_CACHE_ENABLE   0x231
+#define VC_SBC_CACHE_ENABLE     0xa08
+#define  VC_SBC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
+#define VC_IVC_CACHE_SCRUB_TRIG 0x990
+#define VC_IVC_CACHE_SCRUB_MASK 0x998
+#define VC_SBC_CACHE_ENABLE     0xa08
+#define VC_SBC_CACHE_SCRUB_TRIG 0xa10
+#define VC_SBC_CACHE_SCRUB_MASK 0xa18
+#define VC_SBC_CONFIG           0xa20
+#define X_VC_SBC_CONFIG         0x234
+#define  VC_SBC_CONF_CPLX_CIST  PPC_BIT(44)
+#define  VC_SBC_CONF_CIST_BOTH  PPC_BIT(45)
+#define  VC_SBC_CONF_NO_UPD_PRF PPC_BIT(59)
+
+/* VC1 register offsets */
+
+/* VSD Table address register definitions (shared) */
+#define VST_ADDR_AUTOINC        PPC_BIT(0)
+#define VST_TABLE_SELECT        PPC_BITMASK(13, 15)
+#define  VST_TSEL_IVT   0
+#define  VST_TSEL_SBE   1
+#define  VST_TSEL_EQDT  2
+#define  VST_TSEL_VPDT  3
+#define  VST_TSEL_IRQ   4       /* VC only */
+#define VST_TABLE_BLOCK        PPC_BITMASK(27, 31)
+
+/* Number of queue overflow pages */
+#define VC_QUEUE_OVF_COUNT      6
+
+/* Bits in a VSD entry.
+ *
+ * Note: the address is naturally aligned,  we don't use a PPC_BITMASK,
+ *       but just a mask to apply to the address before OR'ing it in.
+ *
+ * Note: VSD_FIRMWARE is a SW bit ! It hijacks an unused bit in the
+ *       VSD and is only meant to be used in indirect mode !
+ */
+#define VSD_MODE                PPC_BITMASK(0, 1)
+#define  VSD_MODE_SHARED        1
+#define  VSD_MODE_EXCLUSIVE     2
+#define  VSD_MODE_FORWARD       3
+#define VSD_ADDRESS_MASK        0x0ffffffffffff000ull
+#define VSD_MIGRATION_REG       PPC_BITMASK(52, 55)
+#define VSD_INDIRECT            PPC_BIT(56)
+#define VSD_TSIZE               PPC_BITMASK(59, 63)
+#define VSD_FIRMWARE            PPC_BIT(2) /* Read warning above */
+
+#define VC_EQC_SYNC_MASK         \
+        (VC_EQC_CONF_SYNC_IPI  | \
+         VC_EQC_CONF_SYNC_HW   | \
+         VC_EQC_CONF_SYNC_ESC1 | \
+         VC_EQC_CONF_SYNC_ESC2 | \
+         VC_EQC_CONF_SYNC_REDI)
+
+
+#endif /* PPC_PNV_XIVE_REGS_H */
diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
index 86d5f54e5459..402dd8f6452c 100644
--- a/include/hw/ppc/pnv.h
+++ b/include/hw/ppc/pnv.h
@@ -25,6 +25,7 @@
 #include "hw/ppc/pnv_lpc.h"
 #include "hw/ppc/pnv_psi.h"
 #include "hw/ppc/pnv_occ.h"
+#include "hw/ppc/pnv_xive.h"
 
 #define TYPE_PNV_CHIP "pnv-chip"
 #define PNV_CHIP(obj) OBJECT_CHECK(PnvChip, (obj), TYPE_PNV_CHIP)
@@ -82,6 +83,7 @@ typedef struct Pnv9Chip {
     PnvChip      parent_obj;
 
     /*< public >*/
+    PnvXive      xive;
 } Pnv9Chip;
 
 typedef struct PnvChipClass {
@@ -205,7 +207,6 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
 #define PNV_ICP_BASE(chip)                                              \
     (0x0003ffff80000000ull + (uint64_t) PNV_CHIP_INDEX(chip) * PNV_ICP_SIZE)
 
-
 #define PNV_PSIHB_SIZE       0x0000000000100000ull
 #define PNV_PSIHB_BASE(chip) \
     (0x0003fffe80000000ull + (uint64_t)PNV_CHIP_INDEX(chip) * PNV_PSIHB_SIZE)
@@ -215,4 +216,23 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
     (0x0003ffe000000000ull + (uint64_t)PNV_CHIP_INDEX(chip) * \
      PNV_PSIHB_FSP_SIZE)
 
+/*
+ * POWER9 MMIO base addresses
+ */
+#define PNV9_CHIP_BASE(chip, base)   \
+    ((base) + ((uint64_t) (chip)->chip_id << 42))
+
+#define PNV9_XIVE_VC_SIZE            0x0000008000000000ull
+#define PNV9_XIVE_VC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006010000000000ull)
+
+#define PNV9_XIVE_PC_SIZE            0x0000001000000000ull
+#define PNV9_XIVE_PC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006018000000000ull)
+
+#define PNV9_XIVE_IC_SIZE            0x0000000000080000ull
+#define PNV9_XIVE_IC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203100000ull)
+
+#define PNV9_XIVE_TM_SIZE            0x0000000000040000ull
+#define PNV9_XIVE_TM_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203180000ull)
+
+
 #endif /* _PPC_PNV_H */
diff --git a/include/hw/ppc/pnv_xive.h b/include/hw/ppc/pnv_xive.h
new file mode 100644
index 000000000000..5b64d4cafe8f
--- /dev/null
+++ b/include/hw/ppc/pnv_xive.h
@@ -0,0 +1,100 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#ifndef PPC_PNV_XIVE_H
+#define PPC_PNV_XIVE_H
+
+#include "hw/sysbus.h"
+#include "hw/ppc/xive.h"
+
+#define TYPE_PNV_XIVE "pnv-xive"
+#define PNV_XIVE(obj) OBJECT_CHECK(PnvXive, (obj), TYPE_PNV_XIVE)
+
+#define XIVE_BLOCK_MAX      16
+
+#define XIVE_XLATE_BLK_MAX  16  /* Block Scope Table (0-15) */
+#define XIVE_XLATE_MIG_MAX  16  /* Migration Register Table (1-15) */
+#define XIVE_XLATE_VDT_MAX  16  /* VDT Domain Table (0-15) */
+#define XIVE_XLATE_EDT_MAX  64  /* EDT Domain Table (0-63) */
+
+typedef struct PnvXive {
+    XiveRouter    parent_obj;
+
+    /* Can be overridden by XIVE configuration */
+    uint32_t      thread_chip_id;
+    uint32_t      chip_id;
+
+    /* Interrupt controller regs */
+    uint64_t      regs[0x300];
+    MemoryRegion  xscom_regs;
+
+    /* For IPIs and accelerator interrupts */
+    uint32_t      nr_irqs;
+    XiveSource    source;
+
+    uint32_t      nr_ends;
+    XiveENDSource end_source;
+
+    /* Cache update registers */
+    uint64_t      eqc_watch[4];
+    uint64_t      vpc_watch[8];
+
+    /* Virtual Structure Table Descriptors : EAT, SBE, ENDT, NVTT, IRQ */
+    uint64_t      vsds[5][XIVE_BLOCK_MAX];
+
+    /* Set Translation tables */
+    bool          set_xlate_autoinc;
+    uint64_t      set_xlate_index;
+    uint64_t      set_xlate;
+
+    uint64_t      set_xlate_blk[XIVE_XLATE_BLK_MAX];
+    uint64_t      set_xlate_mig[XIVE_XLATE_MIG_MAX];
+    uint64_t      set_xlate_vdt[XIVE_XLATE_VDT_MAX];
+    uint64_t      set_xlate_edt[XIVE_XLATE_EDT_MAX];
+
+    /* Interrupt controller MMIO */
+    hwaddr        ic_base;
+    uint32_t      ic_shift;
+    MemoryRegion  ic_mmio;
+    MemoryRegion  ic_reg_mmio;
+    MemoryRegion  ic_notify_mmio;
+
+    /* VC memory regions */
+    hwaddr        vc_base;
+    uint64_t      vc_size;
+    uint32_t      vc_shift;
+    MemoryRegion  vc_mmio;
+
+    /* IPI and END address space to model the EDT segmentation */
+    uint32_t      edt_shift;
+    MemoryRegion  ipi_mmio;
+    AddressSpace  ipi_as;
+    MemoryRegion  end_mmio;
+    AddressSpace  end_as;
+
+    /* PC memory regions */
+    hwaddr        pc_base;
+    uint64_t      pc_size;
+    uint32_t      pc_shift;
+    MemoryRegion  pc_mmio;
+    uint32_t      vdt_shift;
+
+    /* TIMA memory regions */
+    hwaddr        tm_base;
+    uint32_t      tm_shift;
+    MemoryRegion  tm_mmio;
+    MemoryRegion  tm_mmio_indirect;
+
+    /* CPU for indirect TIMA access */
+    PowerPCCPU    *cpu_ind;
+} PnvXive;
+
+void pnv_xive_pic_print_info(PnvXive *xive, Monitor *mon);
+
+#endif /* PPC_PNV_XIVE_H */
diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
index 255b26a5aaf6..6623ec54a7a8 100644
--- a/include/hw/ppc/pnv_xscom.h
+++ b/include/hw/ppc/pnv_xscom.h
@@ -73,6 +73,9 @@ typedef struct PnvXScomInterfaceClass {
 #define PNV_XSCOM_OCC_BASE        0x0066000
 #define PNV_XSCOM_OCC_SIZE        0x6000
 
+#define PNV9_XSCOM_XIVE_BASE      0x5013000
+#define PNV9_XSCOM_XIVE_SIZE      0x300
+
 extern void pnv_xscom_realize(PnvChip *chip, Error **errp);
 extern int pnv_dt_xscom(PnvChip *chip, void *fdt, int offset);
 
diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
index c8201462d698..6089511cff83 100644
--- a/include/hw/ppc/xive.h
+++ b/include/hw/ppc/xive.h
@@ -237,6 +237,7 @@ int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
                         XiveNVT *nvt);
 int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
                         XiveNVT *nvt);
+void xive_router_notify(XiveFabric *xf, uint32_t lisn);
 
 /*
  * XIVE END ESBs
diff --git a/hw/intc/pnv_xive.c b/hw/intc/pnv_xive.c
new file mode 100644
index 000000000000..9f0c41cdb750
--- /dev/null
+++ b/hw/intc/pnv_xive.c
@@ -0,0 +1,1612 @@
+/*
+ * QEMU PowerPC XIVE interrupt controller model
+ *
+ * Copyright (c) 2017-2018, IBM Corporation.
+ *
+ * This code is licensed under the GPL version 2 or later. See the
+ * COPYING file in the top-level directory.
+ */
+
+#include "qemu/osdep.h"
+#include "qemu/log.h"
+#include "qapi/error.h"
+#include "target/ppc/cpu.h"
+#include "sysemu/cpus.h"
+#include "sysemu/dma.h"
+#include "monitor/monitor.h"
+#include "hw/ppc/fdt.h"
+#include "hw/ppc/pnv.h"
+#include "hw/ppc/pnv_xscom.h"
+#include "hw/ppc/pnv_xive.h"
+#include "hw/ppc/xive_regs.h"
+#include "hw/ppc/ppc.h"
+
+#include <libfdt.h>
+
+#include "pnv_xive_regs.h"
+
+/*
+ * Interrupt source number encoding
+ */
+#define SRCNO_BLOCK(srcno)        (((srcno) >> 28) & 0xf)
+#define SRCNO_INDEX(srcno)        ((srcno) & 0x0fffffff)
+#define XIVE_SRCNO(blk, idx)      ((uint32_t)(blk) << 28 | (idx))
+
+/*
+ * Virtual structures table accessors
+ */
+typedef struct XiveVstInfo {
+    const char *name;
+    uint32_t    size;
+    uint32_t    max_blocks;
+} XiveVstInfo;
+
+static const XiveVstInfo vst_infos[] = {
+    [VST_TSEL_IVT]  = { "EAT",  sizeof(XiveEAS), 16 },
+    [VST_TSEL_SBE]  = { "SBE",  0,               16 },
+    [VST_TSEL_EQDT] = { "ENDT", sizeof(XiveEND), 16 },
+    [VST_TSEL_VPDT] = { "VPDT", sizeof(XiveNVT),  32 },
+
+    /* Interrupt fifo backing store table :
+     *
+     * 0 - IPI,
+     * 1 - HWD,
+     * 2 - First escalate,
+     * 3 - Second escalate,
+     * 4 - Redistribution,
+     * 5 - IPI cascaded queue ?
+     */
+    [VST_TSEL_IRQ]  = { "IRQ",  0,               6  },
+};
+
+#define xive_error(xive, fmt, ...)                                      \
+    qemu_log_mask(LOG_GUEST_ERROR, "XIVE[%x] - " fmt "\n", (xive)->chip_id, \
+                  ## __VA_ARGS__);
+
+/*
+ * Our lookup routine for a remote XIVE IC. A simple scan of the chips.
+ */
+static PnvXive *pnv_xive_get_ic(PnvXive *xive, uint8_t blk)
+{
+    PnvMachineState *pnv = PNV_MACHINE(qdev_get_machine());
+    int i;
+
+    for (i = 0; i < pnv->num_chips; i++) {
+        Pnv9Chip *chip9 = PNV9_CHIP(pnv->chips[i]);
+        PnvXive *ic_xive = &chip9->xive;
+        bool chip_override =
+            ic_xive->regs[PC_GLOBAL_CONFIG >> 3] & PC_GCONF_CHIPID_OVR;
+
+        if (chip_override) {
+            if (ic_xive->chip_id == blk) {
+                return ic_xive;
+            }
+        } else {
+            ; /* TODO: Block scope support */
+        }
+    }
+    xive_error(xive, "VST: unknown chip/block %d !?", blk);
+    return NULL;
+}
+
+/*
+ * Virtual Structures Table accessors for SBE, EAT, ENDT, NVT
+ */
+static uint64_t pnv_xive_vst_addr_direct(PnvXive *xive,
+                                         const XiveVstInfo *info, uint64_t vsd,
+                                         uint8_t blk, uint32_t idx)
+{
+    uint64_t vst_addr = vsd & VSD_ADDRESS_MASK;
+    uint64_t vst_tsize = 1ull << (GETFIELD(VSD_TSIZE, vsd) + 12);
+    uint32_t idx_max = (vst_tsize / info->size) - 1;
+
+    if (idx > idx_max) {
+#ifdef XIVE_DEBUG
+        xive_error(xive, "VST: %s entry %x/%x out of range !?", info->name,
+                   blk, idx);
+#endif
+        return 0;
+    }
+
+    return vst_addr + idx * info->size;
+}
+
+#define XIVE_VSD_SIZE 8
+
+static uint64_t pnv_xive_vst_addr_indirect(PnvXive *xive,
+                                           const XiveVstInfo *info,
+                                           uint64_t vsd, uint8_t blk,
+                                           uint32_t idx)
+{
+    uint64_t vsd_addr;
+    uint64_t vst_addr;
+    uint32_t page_shift;
+    uint32_t page_mask;
+    uint64_t vst_tsize = 1ull << (GETFIELD(VSD_TSIZE, vsd) + 12);
+    uint32_t idx_max = (vst_tsize / XIVE_VSD_SIZE) - 1;
+
+    if (idx > idx_max) {
+#ifdef XIVE_DEBUG
+        xive_error(xive, "VET: %s entry %x/%x out of range !?", info->name,
+                   blk, idx);
+#endif
+        return 0;
+    }
+
+    vsd_addr = vsd & VSD_ADDRESS_MASK;
+
+    /*
+     * Read the first descriptor to get the page size of each indirect
+     * table.
+     */
+    vsd = ldq_be_dma(&address_space_memory, vsd_addr);
+    page_shift = GETFIELD(VSD_TSIZE, vsd) + 12;
+    page_mask = (1ull << page_shift) - 1;
+
+    /* Indirect page size can be 4K, 64K, 2M. */
+    if (page_shift != 12 && page_shift != 16 && page_shift != 23) {
+        xive_error(xive, "VST: invalid %s table shift %d", info->name,
+                   page_shift);
+    }
+
+    if (!(vsd & VSD_ADDRESS_MASK)) {
+        xive_error(xive, "VST: invalid %s entry %x/%x !?", info->name,
+                   blk, 0);
+        return 0;
+    }
+
+    /* Load the descriptor we are looking for, if not already done */
+    if (idx) {
+        vsd_addr = vsd_addr + (idx >> page_shift);
+        vsd = ldq_be_dma(&address_space_memory, vsd_addr);
+
+        if (page_shift != GETFIELD(VSD_TSIZE, vsd) + 12) {
+            xive_error(xive, "VST: %s entry %x/%x indirect page size differ !?",
+                       info->name, blk, idx);
+            return 0;
+        }
+    }
+
+    vst_addr = vsd & VSD_ADDRESS_MASK;
+
+    return vst_addr + (idx & page_mask) * info->size;
+}
+
+static uint64_t pnv_xive_vst_addr(PnvXive *xive, uint8_t type, uint8_t blk,
+                                  uint32_t idx)
+{
+    uint64_t vsd;
+
+    if (blk >= vst_infos[type].max_blocks) {
+        xive_error(xive, "VST: invalid block id %d for VST %s %d !?",
+                   blk, vst_infos[type].name, idx);
+        return 0;
+    }
+
+    vsd = xive->vsds[type][blk];
+
+    /* Remote VST accesses */
+    if (GETFIELD(VSD_MODE, vsd) == VSD_MODE_FORWARD) {
+        xive = pnv_xive_get_ic(xive, blk);
+
+        return xive ? pnv_xive_vst_addr(xive, type, blk, idx) : 0;
+    }
+
+    if (VSD_INDIRECT & vsd) {
+        return pnv_xive_vst_addr_indirect(xive, &vst_infos[type], vsd,
+                                          blk, idx);
+    }
+
+    return pnv_xive_vst_addr_direct(xive, &vst_infos[type], vsd, blk, idx);
+}
+
+static int pnv_xive_get_end(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
+                           XiveEND *end)
+{
+    PnvXive *xive = PNV_XIVE(xrtr);
+    uint64_t end_addr = pnv_xive_vst_addr(xive, VST_TSEL_EQDT, blk, idx);
+
+    if (!end_addr) {
+        return -1;
+    }
+
+    cpu_physical_memory_read(end_addr, end, sizeof(XiveEND));
+    end->w0 = be32_to_cpu(end->w0);
+    end->w1 = be32_to_cpu(end->w1);
+    end->w2 = be32_to_cpu(end->w2);
+    end->w3 = be32_to_cpu(end->w3);
+    end->w4 = be32_to_cpu(end->w4);
+    end->w5 = be32_to_cpu(end->w5);
+    end->w6 = be32_to_cpu(end->w6);
+    end->w7 = be32_to_cpu(end->w7);
+
+    return 0;
+}
+
+static int pnv_xive_set_end(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
+                           XiveEND *in_end)
+{
+    PnvXive *xive = PNV_XIVE(xrtr);
+    XiveEND end;
+    uint64_t end_addr = pnv_xive_vst_addr(xive, VST_TSEL_EQDT, blk, idx);
+
+    if (!end_addr) {
+        return -1;
+    }
+
+    end.w0 = cpu_to_be32(in_end->w0);
+    end.w1 = cpu_to_be32(in_end->w1);
+    end.w2 = cpu_to_be32(in_end->w2);
+    end.w3 = cpu_to_be32(in_end->w3);
+    end.w4 = cpu_to_be32(in_end->w4);
+    end.w5 = cpu_to_be32(in_end->w5);
+    end.w6 = cpu_to_be32(in_end->w6);
+    end.w7 = cpu_to_be32(in_end->w7);
+    cpu_physical_memory_write(end_addr, &end, sizeof(XiveEND));
+    return 0;
+}
+
+static int pnv_xive_end_update(PnvXive *xive, uint8_t blk, uint32_t idx)
+{
+    uint64_t end_addr = pnv_xive_vst_addr(xive, VST_TSEL_EQDT, blk, idx);
+
+    if (!end_addr) {
+        return -1;
+    }
+
+    cpu_physical_memory_write(end_addr, xive->eqc_watch, sizeof(XiveEND));
+    return 0;
+}
+
+static int pnv_xive_get_nvt(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
+                           XiveNVT *nvt)
+{
+    PnvXive *xive = PNV_XIVE(xrtr);
+    uint64_t nvt_addr = pnv_xive_vst_addr(xive, VST_TSEL_VPDT, blk, idx);
+
+    if (!nvt_addr) {
+        return -1;
+    }
+
+    cpu_physical_memory_read(nvt_addr, nvt, sizeof(XiveNVT));
+    nvt->w0 = cpu_to_be32(nvt->w0);
+    nvt->w1 = cpu_to_be32(nvt->w1);
+    nvt->w2 = cpu_to_be32(nvt->w2);
+    nvt->w3 = cpu_to_be32(nvt->w3);
+    nvt->w4 = cpu_to_be32(nvt->w4);
+    nvt->w5 = cpu_to_be32(nvt->w5);
+    nvt->w6 = cpu_to_be32(nvt->w6);
+    nvt->w7 = cpu_to_be32(nvt->w7);
+
+    return 0;
+}
+
+static int pnv_xive_set_nvt(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
+                           XiveNVT *in_nvt)
+{
+    PnvXive *xive = PNV_XIVE(xrtr);
+    XiveNVT nvt;
+    uint64_t nvt_addr = pnv_xive_vst_addr(xive, VST_TSEL_VPDT, blk, idx);
+
+    if (!nvt_addr) {
+        return -1;
+    }
+
+    nvt.w0 = cpu_to_be32(in_nvt->w0);
+    nvt.w1 = cpu_to_be32(in_nvt->w1);
+    nvt.w2 = cpu_to_be32(in_nvt->w2);
+    nvt.w3 = cpu_to_be32(in_nvt->w3);
+    nvt.w4 = cpu_to_be32(in_nvt->w4);
+    nvt.w5 = cpu_to_be32(in_nvt->w5);
+    nvt.w6 = cpu_to_be32(in_nvt->w6);
+    nvt.w7 = cpu_to_be32(in_nvt->w7);
+    cpu_physical_memory_write(nvt_addr, &nvt, sizeof(XiveNVT));
+    return 0;
+}
+
+static int pnv_xive_nvt_update(PnvXive *xive, uint8_t blk, uint32_t idx)
+{
+    uint64_t nvt_addr = pnv_xive_vst_addr(xive, VST_TSEL_VPDT, blk, idx);
+
+    if (!nvt_addr) {
+        return -1;
+    }
+
+    cpu_physical_memory_write(nvt_addr, xive->vpc_watch, sizeof(XiveNVT));
+    return 0;
+}
+
+static int pnv_xive_get_eas(XiveRouter *xrtr, uint32_t srcno, XiveEAS *eas)
+{
+    PnvXive *xive = PNV_XIVE(xrtr);
+    uint8_t  blk = SRCNO_BLOCK(srcno);
+    uint32_t idx = SRCNO_INDEX(srcno);
+    uint64_t eas_addr;
+
+    /* TODO: check when remote EAS lookups are possible */
+    if (pnv_xive_get_ic(xive, blk) != xive) {
+        xive_error(xive, "VST: EAS %x is remote !?", srcno);
+        return -1;
+    }
+
+    eas_addr = pnv_xive_vst_addr(xive, VST_TSEL_IVT, blk, idx);
+    if (!eas_addr) {
+        return -1;
+    }
+
+    eas->w &= ~EAS_VALID;
+    *((uint64_t *) eas) = ldq_be_dma(&address_space_memory, eas_addr);
+    return 0;
+}
+
+static int pnv_xive_set_eas(XiveRouter *xrtr, uint32_t srcno, XiveEAS *ive)
+{
+    /* All done. */
+    return 0;
+}
+
+static int pnv_xive_eas_update(PnvXive *xive, uint32_t idx)
+{
+    /* All done. */
+    return 0;
+}
+
+/*
+ * XIVE Set Translation Table configuration
+ *
+ * The Virtualization Controller MMIO region containing the IPI ESB
+ * pages and END ESB pages is sub-divided into "sets" which map
+ * portions of the VC region to the different ESB pages. It is
+ * configured at runtime through the EDT set translation table to let
+ * the firmware decide how to split the address space between IPI ESB
+ * pages and END ESB pages.
+ */
+static int pnv_xive_set_xlate_update(PnvXive *xive, uint64_t val)
+{
+    uint8_t index = xive->set_xlate_autoinc ?
+        xive->set_xlate_index++ : xive->set_xlate_index;
+    uint8_t max_index;
+    uint64_t *xlate_table;
+
+    switch (xive->set_xlate) {
+    case CQ_TAR_TSEL_BLK:
+        max_index = ARRAY_SIZE(xive->set_xlate_blk);
+        xlate_table = xive->set_xlate_blk;
+        break;
+    case CQ_TAR_TSEL_MIG:
+        max_index = ARRAY_SIZE(xive->set_xlate_mig);
+        xlate_table = xive->set_xlate_mig;
+        break;
+    case CQ_TAR_TSEL_EDT:
+        max_index = ARRAY_SIZE(xive->set_xlate_edt);
+        xlate_table = xive->set_xlate_edt;
+        break;
+    case CQ_TAR_TSEL_VDT:
+        max_index = ARRAY_SIZE(xive->set_xlate_vdt);
+        xlate_table = xive->set_xlate_vdt;
+        break;
+    default:
+        xive_error(xive, "xlate: invalid table %d", (int) xive->set_xlate);
+        return -1;
+    }
+
+    if (index >= max_index) {
+        return -1;
+    }
+
+    xlate_table[index] = val;
+    return 0;
+}
+
+static int pnv_xive_set_xlate_select(PnvXive *xive, uint64_t val)
+{
+    xive->set_xlate_autoinc = val & CQ_TAR_TBL_AUTOINC;
+    xive->set_xlate = val & CQ_TAR_TSEL;
+    xive->set_xlate_index = GETFIELD(CQ_TAR_TSEL_INDEX, val);
+
+    return 0;
+}
+
+/*
+ * Computes the overall size of the IPI or the END ESB pages
+ */
+static uint64_t pnv_xive_set_xlate_edt_size(PnvXive *xive, uint64_t type)
+{
+    uint64_t edt_size = 1ull << xive->edt_shift;
+    uint64_t size = 0;
+    int i;
+
+    for (i = 0; i < XIVE_XLATE_EDT_MAX; i++) {
+        uint64_t edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[i]);
+
+        if (edt_type == type) {
+            size += edt_size;
+        }
+    }
+
+    return size;
+}
+
+/*
+ * Maps an offset of the VC region in the IPI or END region using the
+ * layout defined by the EDT table
+ */
+static uint64_t pnv_xive_set_xlate_edt_offset(PnvXive *xive, uint64_t vc_offset,
+                                              uint64_t type)
+{
+    int i;
+    uint64_t edt_size = (1ull << xive->edt_shift);
+    uint64_t edt_offset = vc_offset;
+
+    for (i = 0; i < XIVE_XLATE_EDT_MAX && (i * edt_size) < vc_offset; i++) {
+        uint64_t edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[i]);
+
+        if (edt_type != type) {
+            edt_offset -= edt_size;
+        }
+    }
+
+    return edt_offset;
+}
+
+/*
+ * IPI and END sources realize routines
+ *
+ * We use the EDT table to size the internal XiveSource object backing
+ * the IPIs and the XiveENDSource object backing the ENDs
+ */
+static void pnv_xive_source_realize(PnvXive *xive, Error **errp)
+{
+    XiveSource *xsrc = &xive->source;
+    Error *local_err = NULL;
+    uint64_t ipi_mmio_size = pnv_xive_set_xlate_edt_size(xive, CQ_TDR_EDT_IPI);
+
+    /* Two pages per IRQ */
+    xive->nr_irqs = ipi_mmio_size / (1ull << (xive->vc_shift + 1));
+
+    /*
+     * Configure store EOI if required by firwmare (skiboot has
+     * removed support recently though)
+     */
+    if (xive->regs[VC_SBC_CONFIG >> 3] &
+        (VC_SBC_CONF_CPLX_CIST | VC_SBC_CONF_CIST_BOTH)) {
+        object_property_set_int(OBJECT(xsrc), XIVE_SRC_STORE_EOI, "flags",
+                                &error_fatal);
+    }
+
+    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
+                            &error_fatal);
+    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
+                                   &error_fatal);
+    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
+
+    /* Install the IPI ESB MMIO region in its VC region */
+    memory_region_add_subregion(&xive->ipi_mmio, 0, &xsrc->esb_mmio);
+
+    /* Start in a clean state */
+    device_reset(DEVICE(&xive->source));
+}
+
+static void pnv_xive_end_source_realize(PnvXive *xive, Error **errp)
+{
+    XiveENDSource *end_xsrc = &xive->end_source;
+    Error *local_err = NULL;
+    uint64_t end_mmio_size = pnv_xive_set_xlate_edt_size(xive, CQ_TDR_EDT_EQ);
+
+    /* Two pages per END: ESn and ESe */
+    xive->nr_ends  = end_mmio_size / (1ull << (xive->vc_shift + 1));
+
+    object_property_set_int(OBJECT(end_xsrc), xive->nr_ends, "nr-ends",
+                            &error_fatal);
+    object_property_add_const_link(OBJECT(end_xsrc), "xive", OBJECT(xive),
+                                   &error_fatal);
+    object_property_set_bool(OBJECT(end_xsrc), true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(end_xsrc), sysbus_get_default());
+
+    /* Install the END ESB MMIO region in its VC region */
+    memory_region_add_subregion(&xive->end_mmio, 0, &end_xsrc->esb_mmio);
+}
+
+/*
+ * Virtual Structure Tables (VST) configuration
+ */
+static void pnv_xive_table_set_exclusive(PnvXive *xive, uint8_t type,
+                                         uint8_t blk, uint64_t vsd)
+{
+    bool gconf_indirect =
+        xive->regs[VC_GLOBAL_CONFIG >> 3] & VC_GCONF_INDIRECT;
+    uint32_t vst_shift = GETFIELD(VSD_TSIZE, vsd) + 12;
+    uint64_t vst_addr = vsd & VSD_ADDRESS_MASK;
+
+    if (VSD_INDIRECT & vsd) {
+        if (!gconf_indirect) {
+            xive_error(xive, "VST: %s indirect tables not enabled",
+                       vst_infos[type].name);
+            return;
+        }
+    }
+
+    switch (type) {
+    case VST_TSEL_IVT:
+        /*
+         * This is our trigger to create the XiveSource object backing
+         * the IPIs.
+         */
+        pnv_xive_source_realize(xive, &error_fatal);
+        break;
+
+    case VST_TSEL_EQDT:
+        /* Same trigger but for the XiveENDSource object backing the ENDs. */
+        pnv_xive_end_source_realize(xive, &error_fatal);
+        break;
+
+    case VST_TSEL_VPDT:
+        /* FIXME (skiboot) : remove DD1 workaround on the NVT table size */
+        vst_shift = 16;
+        break;
+
+    case VST_TSEL_SBE: /* Not modeled */
+        /*
+         * Contains the backing store pages for the source PQ bits.
+         * The XiveSource object has its own. We would need a custom
+         * source object to use this backing.
+         */
+        break;
+
+    case VST_TSEL_IRQ: /* VC only. Not modeled */
+        /*
+         * These tables contains the backing store pages for the
+         * interrupt fifos of the VC sub-engine in case of overflow.
+         */
+        break;
+    default:
+        g_assert_not_reached();
+    }
+
+    if (!QEMU_IS_ALIGNED(vst_addr, 1ull << vst_shift)) {
+        xive_error(xive, "VST: %s table address 0x%"PRIx64" is not aligned with"
+                   " page shift %d", vst_infos[type].name, vst_addr, vst_shift);
+    }
+
+    /* Keep the VSD for later use */
+    xive->vsds[type][blk] = vsd;
+}
+
+/*
+ * Both PC and VC sub-engines are configured as each use the Virtual
+ * Structure Tables : SBE, EAS, END and NVT.
+ */
+static void pnv_xive_table_set_data(PnvXive *xive, uint64_t vsd, bool pc_engine)
+{
+    uint8_t mode = GETFIELD(VSD_MODE, vsd);
+    uint8_t type = GETFIELD(VST_TABLE_SELECT,
+                            xive->regs[VC_VSD_TABLE_ADDR >> 3]);
+    uint8_t blk = GETFIELD(VST_TABLE_BLOCK,
+                             xive->regs[VC_VSD_TABLE_ADDR >> 3]);
+    uint64_t vst_addr = vsd & VSD_ADDRESS_MASK;
+
+    if (type > VST_TSEL_IRQ) {
+        xive_error(xive, "VST: invalid table type %d", type);
+        return;
+    }
+
+    if (blk >= vst_infos[type].max_blocks) {
+        xive_error(xive, "VST: invalid block id %d for"
+                      " %s table", blk, vst_infos[type].name);
+        return;
+    }
+
+    /*
+     * Only take the VC sub-engine configuration into account because
+     * the XiveRouter model combines both VC and PC sub-engines
+     */
+    if (pc_engine) {
+        return;
+    }
+
+    if (!vst_addr) {
+        xive_error(xive, "VST: invalid %s table address", vst_infos[type].name);
+        return;
+    }
+
+    switch (mode) {
+    case VSD_MODE_FORWARD:
+        xive->vsds[type][blk] = vsd;
+        break;
+
+    case VSD_MODE_EXCLUSIVE:
+        pnv_xive_table_set_exclusive(xive, type, blk, vsd);
+        break;
+
+    default:
+        xive_error(xive, "VST: unsupported table mode %d", mode);
+        return;
+    }
+}
+
+/*
+ * When the TIMA is accessed from the indirect page, the thread id
+ * (PIR) has to be configured in the IC before. This is used for
+ * resets and for debug purpose also.
+ */
+static void pnv_xive_thread_indirect_set(PnvXive *xive, uint64_t val)
+{
+    int pir = GETFIELD(PC_TCTXT_INDIR_THRDID, xive->regs[PC_TCTXT_INDIR0 >> 3]);
+
+    if (val & PC_TCTXT_INDIR_VALID) {
+        if (xive->cpu_ind) {
+            xive_error(xive, "IC: indirect access already set for "
+                       "invalid PIR %d", pir);
+        }
+
+        pir = GETFIELD(PC_TCTXT_INDIR_THRDID, val) & 0xff;
+        xive->cpu_ind = ppc_get_vcpu_by_pir(pir);
+        if (!xive->cpu_ind) {
+            xive_error(xive, "IC: invalid PIR %d for indirect access", pir);
+        }
+    } else {
+        xive->cpu_ind = NULL;
+    }
+}
+
+/*
+ * Interrupt Controller registers MMIO
+ */
+static void pnv_xive_ic_reg_write(PnvXive *xive, uint32_t offset, uint64_t val,
+                                  bool mmio)
+{
+    MemoryRegion *sysmem = get_system_memory();
+    uint32_t reg = offset >> 3;
+
+    switch (offset) {
+
+    /*
+     * XIVE CQ (PowerBus bridge) settings
+     */
+    case CQ_MSGSND:     /* msgsnd for doorbells */
+    case CQ_FIRMASK_OR: /* FIR error reporting */
+        xive->regs[reg] = val;
+        break;
+    case CQ_PBI_CTL:
+        if (val & CQ_PBI_PC_64K) {
+            xive->pc_shift = 16;
+        }
+        if (val & CQ_PBI_VC_64K) {
+            xive->vc_shift = 16;
+        }
+        break;
+    case CQ_CFG_PB_GEN: /* PowerBus General Configuration */
+        /*
+         * TODO: CQ_INT_ADDR_OPT for 1-block-per-chip mode
+         */
+        xive->regs[reg] = val;
+        break;
+
+    /*
+     * XIVE Virtualization Controller settings
+     */
+    case VC_GLOBAL_CONFIG:
+        xive->regs[reg] = val;
+        break;
+
+    /*
+     * XIVE Presenter Controller settings
+     */
+    case PC_GLOBAL_CONFIG:
+        /* Overrides Int command Chip ID with the Chip ID field */
+        if (val & PC_GCONF_CHIPID_OVR) {
+            xive->chip_id = GETFIELD(PC_GCONF_CHIPID, val);
+        }
+        xive->regs[reg] = val;
+        break;
+    case PC_TCTXT_CFG:
+        /*
+         * TODO: PC_TCTXT_CFG_BLKGRP_EN for block group support
+         * TODO: PC_TCTXT_CFG_HARD_CHIPID_BLK
+         */
+
+        /*
+         * Moves the chipid into block field for hardwired CAM
+         * compares Block offset value is adjusted to 0b0..01 & ThrdId
+         */
+        if (val & PC_TCTXT_CHIPID_OVERRIDE) {
+            xive->thread_chip_id = GETFIELD(PC_TCTXT_CHIPID, val);
+        }
+        break;
+    case PC_TCTXT_TRACK: /* Enable block tracking (DD2) */
+        xive->regs[reg] = val;
+        break;
+
+    /*
+     * Misc settings
+     */
+    case VC_EQC_CONFIG: /* enable silent escalation */
+    case VC_SBC_CONFIG: /* Store EOI configuration */
+    case VC_AIB_TX_ORDER_TAG2:
+        xive->regs[reg] = val;
+        break;
+
+    /*
+     * XIVE BAR settings (XSCOM only)
+     */
+    case CQ_RST_CTL:
+        /* resets all bars */
+        break;
+
+    case CQ_IC_BAR: /* IC BAR. 8 pages */
+        xive->ic_shift = val & CQ_IC_BAR_64K ? 16 : 12;
+        if (!(val & CQ_IC_BAR_VALID)) {
+            xive->ic_base = 0;
+            if (xive->regs[reg] & CQ_IC_BAR_VALID) {
+                memory_region_del_subregion(&xive->ic_mmio,
+                                            &xive->ic_reg_mmio);
+                memory_region_del_subregion(&xive->ic_mmio,
+                                            &xive->ic_notify_mmio);
+                memory_region_del_subregion(sysmem, &xive->ic_mmio);
+                memory_region_del_subregion(sysmem, &xive->tm_mmio_indirect);
+            }
+        } else {
+            xive->ic_base  = val & ~(CQ_IC_BAR_VALID | CQ_IC_BAR_64K);
+            if (!(xive->regs[reg] & CQ_IC_BAR_VALID)) {
+                memory_region_add_subregion(sysmem, xive->ic_base,
+                                            &xive->ic_mmio);
+                memory_region_add_subregion(&xive->ic_mmio,  0,
+                                            &xive->ic_reg_mmio);
+                memory_region_add_subregion(&xive->ic_mmio,
+                                            1ul << xive->ic_shift,
+                                            &xive->ic_notify_mmio);
+                memory_region_add_subregion(sysmem,
+                                   xive->ic_base + (4ull << xive->ic_shift),
+                                   &xive->tm_mmio_indirect);
+            }
+        }
+        xive->regs[reg] = val;
+        break;
+
+    case CQ_TM1_BAR: /* TM BAR and page size. 4 pages */
+    case CQ_TM2_BAR: /* second TM BAR is for hotplug use */
+        xive->tm_shift = val & CQ_TM_BAR_64K ? 16 : 12;
+        if (!(val & CQ_TM_BAR_VALID)) {
+            xive->tm_base = 0;
+            if (xive->regs[reg] & CQ_TM_BAR_VALID) {
+                memory_region_del_subregion(sysmem, &xive->tm_mmio);
+            }
+        } else {
+            xive->tm_base  = val & ~(CQ_TM_BAR_VALID | CQ_TM_BAR_64K);
+            if (!(xive->regs[reg] & CQ_TM_BAR_VALID)) {
+                memory_region_add_subregion(sysmem, xive->tm_base,
+                                            &xive->tm_mmio);
+            }
+        }
+        xive->regs[reg] = val;
+       break;
+
+    case CQ_PC_BAR:
+        if (!(val & CQ_PC_BAR_VALID)) {
+            xive->pc_base = 0;
+            if (xive->regs[reg] & CQ_PC_BAR_VALID) {
+                memory_region_del_subregion(sysmem, &xive->pc_mmio);
+            }
+        } else {
+            xive->pc_base = val & ~(CQ_PC_BAR_VALID);
+            if (!(xive->regs[reg] & CQ_PC_BAR_VALID)) {
+                memory_region_add_subregion(sysmem, xive->pc_base,
+                                            &xive->pc_mmio);
+            }
+        }
+        xive->regs[reg] = val;
+        break;
+    case CQ_PC_BARM: /* TODO: configure PC BAR size at runtime */
+        xive->pc_size =  (~val + 1) & CQ_PC_BARM_MASK;
+        xive->regs[reg] = val;
+
+        /* Compute the size of the VDT sets */
+        xive->vdt_shift = ctz64(xive->pc_size / XIVE_XLATE_VDT_MAX);
+        break;
+
+    case CQ_VC_BAR: /* From 64M to 4TB */
+        if (!(val & CQ_VC_BAR_VALID)) {
+            xive->vc_base = 0;
+            if (xive->regs[reg] & CQ_VC_BAR_VALID) {
+                memory_region_del_subregion(sysmem, &xive->vc_mmio);
+            }
+        } else {
+            xive->vc_base = val & ~(CQ_VC_BAR_VALID);
+            if (!(xive->regs[reg] & CQ_VC_BAR_VALID)) {
+                memory_region_add_subregion(sysmem, xive->vc_base,
+                                            &xive->vc_mmio);
+            }
+        }
+        xive->regs[reg] = val;
+        break;
+    case CQ_VC_BARM: /* TODO: configure VC BAR size at runtime */
+        xive->vc_size = (~val + 1) & CQ_VC_BARM_MASK;
+        xive->regs[reg] = val;
+
+        /* Compute the size of the EDT sets */
+        xive->edt_shift = ctz64(xive->vc_size / XIVE_XLATE_EDT_MAX);
+        break;
+
+    /*
+     * XIVE Set Translation Table settings. Defines the layout of the
+     * VC BAR containing the ESB pages of the IPIs and of the ENDs
+     */
+    case CQ_TAR: /* Set Translation Table Address */
+        pnv_xive_set_xlate_select(xive, val);
+        break;
+    case CQ_TDR: /* Set Translation Table Data */
+        pnv_xive_set_xlate_update(xive, val);
+        break;
+
+    /*
+     * XIVE VC & PC Virtual Structure Table settings
+     */
+    case VC_VSD_TABLE_ADDR:
+    case PC_VSD_TABLE_ADDR: /* Virtual table selector */
+        xive->regs[reg] = val;
+        break;
+    case VC_VSD_TABLE_DATA: /* Virtual table setting */
+    case PC_VSD_TABLE_DATA:
+        pnv_xive_table_set_data(xive, val, offset == PC_VSD_TABLE_DATA);
+        break;
+
+    /*
+     * Interrupt fifo overflow in memory backing store. Not modeled
+     */
+    case VC_IRQ_CONFIG_IPI:
+    case VC_IRQ_CONFIG_HW:
+    case VC_IRQ_CONFIG_CASCADE1:
+    case VC_IRQ_CONFIG_CASCADE2:
+    case VC_IRQ_CONFIG_REDIST:
+    case VC_IRQ_CONFIG_IPI_CASC:
+        xive->regs[reg] = val;
+        break;
+
+    /*
+     * XIVE hardware thread enablement
+     */
+    case PC_THREAD_EN_REG0_SET: /* Physical Thread Enable */
+    case PC_THREAD_EN_REG1_SET: /* Physical Thread Enable (fused core) */
+        xive->regs[reg] |= val;
+        break;
+    case PC_THREAD_EN_REG0_CLR:
+        xive->regs[PC_THREAD_EN_REG0_SET >> 3] &= ~val;
+        break;
+    case PC_THREAD_EN_REG1_CLR:
+        xive->regs[PC_THREAD_EN_REG1_SET >> 3] &= ~val;
+        break;
+
+    /*
+     * Indirect TIMA access set up. Defines the HW thread to use.
+     */
+    case PC_TCTXT_INDIR0:
+        pnv_xive_thread_indirect_set(xive, val);
+        xive->regs[reg] = val;
+        break;
+    case PC_TCTXT_INDIR1:
+    case PC_TCTXT_INDIR2:
+    case PC_TCTXT_INDIR3:
+        /* TODO: check what PC_TCTXT_INDIR[123] are for */
+        xive->regs[reg] = val;
+        break;
+
+    /*
+     * XIVE PC & VC cache updates for EAS, NVT and END
+     */
+    case PC_VPC_SCRUB_MASK:
+    case PC_VPC_CWATCH_SPEC:
+    case VC_EQC_SCRUB_MASK:
+    case VC_EQC_CWATCH_SPEC:
+    case VC_IVC_SCRUB_MASK:
+        xive->regs[reg] = val;
+        break;
+    case VC_IVC_SCRUB_TRIG:
+        pnv_xive_eas_update(xive, GETFIELD(VC_SCRUB_OFFSET, val));
+        break;
+    case PC_VPC_CWATCH_DAT0:
+    case PC_VPC_CWATCH_DAT1:
+    case PC_VPC_CWATCH_DAT2:
+    case PC_VPC_CWATCH_DAT3:
+    case PC_VPC_CWATCH_DAT4:
+    case PC_VPC_CWATCH_DAT5:
+    case PC_VPC_CWATCH_DAT6:
+    case PC_VPC_CWATCH_DAT7:
+        xive->vpc_watch[(offset - PC_VPC_CWATCH_DAT0) / 8] = cpu_to_be64(val);
+        break;
+    case PC_VPC_SCRUB_TRIG:
+        pnv_xive_nvt_update(xive, GETFIELD(PC_SCRUB_BLOCK_ID, val),
+                           GETFIELD(PC_SCRUB_OFFSET, val));
+        break;
+    case VC_EQC_CWATCH_DAT0:
+    case VC_EQC_CWATCH_DAT1:
+    case VC_EQC_CWATCH_DAT2:
+    case VC_EQC_CWATCH_DAT3:
+        xive->eqc_watch[(offset - VC_EQC_CWATCH_DAT0) / 8] = cpu_to_be64(val);
+        break;
+    case VC_EQC_SCRUB_TRIG:
+        pnv_xive_end_update(xive, GETFIELD(VC_SCRUB_BLOCK_ID, val),
+                            GETFIELD(VC_SCRUB_OFFSET, val));
+        break;
+
+    /*
+     * XIVE PC & VC cache invalidation
+     */
+    case PC_AT_KILL:
+        xive->regs[reg] |= val;
+        break;
+    case VC_AT_MACRO_KILL:
+        xive->regs[reg] |= val;
+        break;
+    case PC_AT_KILL_MASK:
+    case VC_AT_MACRO_KILL_MASK:
+        xive->regs[reg] = val;
+        break;
+
+    default:
+        xive_error(xive, "IC: invalid write to reg=0x%08x mmio=%d", offset,
+                   mmio);
+    }
+}
+
+static uint64_t pnv_xive_ic_reg_read(PnvXive *xive, uint32_t offset, bool mmio)
+{
+    uint64_t val = 0;
+    uint32_t reg = offset >> 3;
+
+    switch (offset) {
+    case CQ_CFG_PB_GEN:
+    case CQ_IC_BAR:
+    case CQ_TM1_BAR:
+    case CQ_TM2_BAR:
+    case CQ_PC_BAR:
+    case CQ_PC_BARM:
+    case CQ_VC_BAR:
+    case CQ_VC_BARM:
+    case CQ_TAR:
+    case CQ_TDR:
+    case CQ_PBI_CTL:
+
+    case PC_TCTXT_CFG:
+    case PC_TCTXT_TRACK:
+    case PC_TCTXT_INDIR0:
+    case PC_TCTXT_INDIR1:
+    case PC_TCTXT_INDIR2:
+    case PC_TCTXT_INDIR3:
+    case PC_GLOBAL_CONFIG:
+
+    case PC_VPC_SCRUB_MASK:
+    case PC_VPC_CWATCH_SPEC:
+    case PC_VPC_CWATCH_DAT0:
+    case PC_VPC_CWATCH_DAT1:
+    case PC_VPC_CWATCH_DAT2:
+    case PC_VPC_CWATCH_DAT3:
+    case PC_VPC_CWATCH_DAT4:
+    case PC_VPC_CWATCH_DAT5:
+    case PC_VPC_CWATCH_DAT6:
+    case PC_VPC_CWATCH_DAT7:
+
+    case VC_GLOBAL_CONFIG:
+    case VC_AIB_TX_ORDER_TAG2:
+
+    case VC_IRQ_CONFIG_IPI:
+    case VC_IRQ_CONFIG_HW:
+    case VC_IRQ_CONFIG_CASCADE1:
+    case VC_IRQ_CONFIG_CASCADE2:
+    case VC_IRQ_CONFIG_REDIST:
+    case VC_IRQ_CONFIG_IPI_CASC:
+
+    case VC_EQC_SCRUB_MASK:
+    case VC_EQC_CWATCH_DAT0:
+    case VC_EQC_CWATCH_DAT1:
+    case VC_EQC_CWATCH_DAT2:
+    case VC_EQC_CWATCH_DAT3:
+
+    case VC_EQC_CWATCH_SPEC:
+    case VC_IVC_SCRUB_MASK:
+    case VC_SBC_CONFIG:
+    case VC_AT_MACRO_KILL_MASK:
+    case VC_VSD_TABLE_ADDR:
+    case PC_VSD_TABLE_ADDR:
+    case VC_VSD_TABLE_DATA:
+    case PC_VSD_TABLE_DATA:
+        val = xive->regs[reg];
+        break;
+
+    case CQ_MSGSND: /* Identifies which cores have msgsnd enabled.
+                     * Say all have. */
+        val = 0xffffff0000000000;
+        break;
+
+    /*
+     * XIVE PC & VC cache updates for EAS, NVT and END
+     */
+    case PC_VPC_SCRUB_TRIG:
+    case VC_IVC_SCRUB_TRIG:
+    case VC_EQC_SCRUB_TRIG:
+        xive->regs[reg] &= ~VC_SCRUB_VALID;
+        val = xive->regs[reg];
+        break;
+
+    /*
+     * XIVE PC & VC cache invalidation
+     */
+    case PC_AT_KILL:
+        xive->regs[reg] &= ~PC_AT_KILL_VALID;
+        val = xive->regs[reg];
+        break;
+    case VC_AT_MACRO_KILL:
+        xive->regs[reg] &= ~VC_KILL_VALID;
+        val = xive->regs[reg];
+        break;
+
+    /*
+     * XIVE synchronisation
+     */
+    case VC_EQC_CONFIG:
+        val = VC_EQC_SYNC_MASK;
+        break;
+
+    default:
+        xive_error(xive, "IC: invalid read reg=0x%08x mmio=%d", offset, mmio);
+    }
+
+    return val;
+}
+
+static void pnv_xive_ic_reg_write_mmio(void *opaque, hwaddr addr,
+                                       uint64_t val, unsigned size)
+{
+    pnv_xive_ic_reg_write(opaque, addr, val, true);
+}
+
+static uint64_t pnv_xive_ic_reg_read_mmio(void *opaque, hwaddr addr,
+                                      unsigned size)
+{
+    return pnv_xive_ic_reg_read(opaque, addr, true);
+}
+
+static const MemoryRegionOps pnv_xive_ic_reg_ops = {
+    .read = pnv_xive_ic_reg_read_mmio,
+    .write = pnv_xive_ic_reg_write_mmio,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+/*
+ * Interrupt Controller MMIO: Notify port page (write only)
+ */
+#define PNV_XIVE_FORWARD_IPI        0x800 /* Forward IPI */
+#define PNV_XIVE_FORWARD_HW         0x880 /* Forward HW */
+#define PNV_XIVE_FORWARD_OS_ESC     0x900 /* Forward OS escalation */
+#define PNV_XIVE_FORWARD_HW_ESC     0x980 /* Forward Hyp escalation */
+#define PNV_XIVE_FORWARD_REDIS      0xa00 /* Forward Redistribution */
+#define PNV_XIVE_RESERVED5          0xa80 /* Cache line 5 PowerBUS operation */
+#define PNV_XIVE_RESERVED6          0xb00 /* Cache line 6 PowerBUS operation */
+#define PNV_XIVE_RESERVED7          0xb80 /* Cache line 7 PowerBUS operation */
+
+/* VC synchronisation */
+#define PNV_XIVE_SYNC_IPI           0xc00 /* Sync IPI */
+#define PNV_XIVE_SYNC_HW            0xc80 /* Sync HW */
+#define PNV_XIVE_SYNC_OS_ESC        0xd00 /* Sync OS escalation */
+#define PNV_XIVE_SYNC_HW_ESC        0xd80 /* Sync Hyp escalation */
+#define PNV_XIVE_SYNC_REDIS         0xe00 /* Sync Redistribution */
+
+/* PC synchronisation */
+#define PNV_XIVE_SYNC_PULL          0xe80 /* Sync pull context */
+#define PNV_XIVE_SYNC_PUSH          0xf00 /* Sync push context */
+#define PNV_XIVE_SYNC_VPC           0xf80 /* Sync remove VPC store */
+
+static void pnv_xive_ic_hw_trigger(PnvXive *xive, hwaddr addr, uint64_t val)
+{
+    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xive);
+
+    xfc->notify(XIVE_FABRIC(xive), val);
+}
+
+static void pnv_xive_ic_notify_write(void *opaque, hwaddr addr, uint64_t val,
+                                     unsigned size)
+{
+    PnvXive *xive = PNV_XIVE(opaque);
+
+    /* VC: HW triggers */
+    switch (addr) {
+    case 0x000 ... 0x7FF:
+        pnv_xive_ic_hw_trigger(opaque, addr, val);
+        break;
+
+    /* VC: Forwarded IRQs */
+    case PNV_XIVE_FORWARD_IPI:
+    case PNV_XIVE_FORWARD_HW:
+    case PNV_XIVE_FORWARD_OS_ESC:
+    case PNV_XIVE_FORWARD_HW_ESC:
+    case PNV_XIVE_FORWARD_REDIS:
+        /* TODO: forwarded IRQs. Should be like HW triggers */
+        xive_error(xive, "IC: forwarded at @0x%"HWADDR_PRIx" IRQ 0x%"PRIx64,
+                   addr, val);
+        break;
+
+    /* VC syncs */
+    case PNV_XIVE_SYNC_IPI:
+    case PNV_XIVE_SYNC_HW:
+    case PNV_XIVE_SYNC_OS_ESC:
+    case PNV_XIVE_SYNC_HW_ESC:
+    case PNV_XIVE_SYNC_REDIS:
+        break;
+
+    /* PC sync */
+    case PNV_XIVE_SYNC_PULL:
+    case PNV_XIVE_SYNC_PUSH:
+    case PNV_XIVE_SYNC_VPC:
+        break;
+
+    default:
+        xive_error(xive, "IC: invalid notify write @%"HWADDR_PRIx, addr);
+    }
+}
+
+static uint64_t pnv_xive_ic_notify_read(void *opaque, hwaddr addr,
+                                        unsigned size)
+{
+    PnvXive *xive = PNV_XIVE(opaque);
+
+    /* loads are invalid */
+    xive_error(xive, "IC: invalid notify read @%"HWADDR_PRIx, addr);
+    return -1;
+}
+
+static const MemoryRegionOps pnv_xive_ic_notify_ops = {
+    .read = pnv_xive_ic_notify_read,
+    .write = pnv_xive_ic_notify_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+/*
+ * Interrupt controller MMIO region. The layout is compatible between
+ * 4K and 64K pages :
+ *
+ * Page 0           sub-engine BARs
+ *  0x000 - 0x3FF   IC registers
+ *  0x400 - 0x7FF   PC registers
+ *  0x800 - 0xFFF   VC registers
+ *
+ * Page 1           Notify page
+ *  0x000 - 0x7FF   HW interrupt triggers (PSI, PHB)
+ *  0x800 - 0xFFF   forwards and syncs
+ *
+ * Page 2           LSI Trigger page (writes only) (not modeled)
+ * Page 3           LSI SB EOI page (reads only) (not modeled)
+ *
+ * Page 4-7         indirect TIMA (aliased to TIMA region)
+ */
+static void pnv_xive_ic_write(void *opaque, hwaddr addr,
+                              uint64_t val, unsigned size)
+{
+    PnvXive *xive = PNV_XIVE(opaque);
+
+    xive_error(xive, "IC: invalid write @%"HWADDR_PRIx, addr);
+}
+
+static uint64_t pnv_xive_ic_read(void *opaque, hwaddr addr, unsigned size)
+{
+    PnvXive *xive = PNV_XIVE(opaque);
+
+    xive_error(xive, "IC: invalid read @%"HWADDR_PRIx, addr);
+    return -1;
+}
+
+static const MemoryRegionOps pnv_xive_ic_ops = {
+    .read = pnv_xive_ic_read,
+    .write = pnv_xive_ic_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+/*
+ * Interrupt controller XSCOM region. Load accesses are nearly all
+ * done all through the MMIO region.
+ */
+static uint64_t pnv_xive_xscom_read(void *opaque, hwaddr addr, unsigned size)
+{
+    PnvXive *xive = PNV_XIVE(opaque);
+
+    switch (addr >> 3) {
+    case X_VC_EQC_CONFIG:
+        /*
+         * This is the only XSCOM load done in skiboot. Bizarre. To be
+         * checked.
+         */
+        return VC_EQC_SYNC_MASK;
+    default:
+        return pnv_xive_ic_reg_read(xive, addr, false);
+    }
+}
+
+static void pnv_xive_xscom_write(void *opaque, hwaddr addr,
+                                uint64_t val, unsigned size)
+{
+    pnv_xive_ic_reg_write(opaque, addr, val, false);
+}
+
+static const MemoryRegionOps pnv_xive_xscom_ops = {
+    .read = pnv_xive_xscom_read,
+    .write = pnv_xive_xscom_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    }
+};
+
+/*
+ * Virtualization Controller MMIO region containing the IPI and END ESB pages
+ */
+static uint64_t pnv_xive_vc_read(void *opaque, hwaddr offset,
+                                 unsigned size)
+{
+    PnvXive *xive = PNV_XIVE(opaque);
+    uint64_t edt_index = offset >> xive->edt_shift;
+    uint64_t edt_type = 0;
+    uint64_t ret = -1;
+    uint64_t edt_offset;
+    MemTxResult result;
+    AddressSpace *edt_as = NULL;
+
+    if (edt_index < XIVE_XLATE_EDT_MAX) {
+        edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[edt_index]);
+    }
+
+    switch (edt_type) {
+    case CQ_TDR_EDT_IPI:
+        edt_as = &xive->ipi_as;
+        break;
+    case CQ_TDR_EDT_EQ:
+        edt_as = &xive->end_as;
+        break;
+    default:
+        xive_error(xive, "VC: invalid read @%"HWADDR_PRIx, offset);
+        return -1;
+    }
+
+    /* remap the offset for the targeted address space */
+    edt_offset = pnv_xive_set_xlate_edt_offset(xive, offset, edt_type);
+
+    ret = address_space_ldq(edt_as, edt_offset, MEMTXATTRS_UNSPECIFIED,
+                            &result);
+    if (result != MEMTX_OK) {
+        xive_error(xive, "VC: %s read failed at @0x%"HWADDR_PRIx " -> @0x%"
+                   HWADDR_PRIx, edt_type == CQ_TDR_EDT_IPI ? "IPI" : "END",
+                   offset, edt_offset);
+        return -1;
+    }
+
+    return ret;
+}
+
+static void pnv_xive_vc_write(void *opaque, hwaddr offset,
+                              uint64_t val, unsigned size)
+{
+    PnvXive *xive = PNV_XIVE(opaque);
+    uint64_t edt_index = offset >> xive->edt_shift;
+    uint64_t edt_type = 0;
+    uint64_t edt_offset;
+    MemTxResult result;
+    AddressSpace *edt_as = NULL;
+
+    if (edt_index < XIVE_XLATE_EDT_MAX) {
+        edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[edt_index]);
+    }
+
+    switch (edt_type) {
+    case CQ_TDR_EDT_IPI:
+        edt_as = &xive->ipi_as;
+        break;
+    case CQ_TDR_EDT_EQ:
+        edt_as = &xive->end_as;
+        break;
+    default:
+        xive_error(xive, "VC: invalid read @%"HWADDR_PRIx, offset);
+        return;
+    }
+
+    /* remap the offset for the targeted address space */
+    edt_offset = pnv_xive_set_xlate_edt_offset(xive, offset, edt_type);
+
+    address_space_stq(edt_as, edt_offset, val, MEMTXATTRS_UNSPECIFIED, &result);
+    if (result != MEMTX_OK) {
+        xive_error(xive, "VC: write failed at @0x%"HWADDR_PRIx, edt_offset);
+    }
+}
+
+static const MemoryRegionOps pnv_xive_vc_ops = {
+    .read = pnv_xive_vc_read,
+    .write = pnv_xive_vc_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+/*
+ * Presenter Controller MMIO region. This is used by the Virtualization
+ * Controller to update the IPB in the NVT table when required. Not
+ * implemented.
+ */
+static uint64_t pnv_xive_pc_read(void *opaque, hwaddr addr,
+                                 unsigned size)
+{
+    PnvXive *xive = PNV_XIVE(opaque);
+
+    xive_error(xive, "PC: invalid read @%"HWADDR_PRIx, addr);
+    return -1;
+}
+
+static void pnv_xive_pc_write(void *opaque, hwaddr addr,
+                              uint64_t value, unsigned size)
+{
+    PnvXive *xive = PNV_XIVE(opaque);
+
+    xive_error(xive, "PC: invalid write to VC @%"HWADDR_PRIx, addr);
+}
+
+static const MemoryRegionOps pnv_xive_pc_ops = {
+    .read = pnv_xive_pc_read,
+    .write = pnv_xive_pc_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 1,
+        .max_access_size = 8,
+    },
+};
+
+void pnv_xive_pic_print_info(PnvXive *xive, Monitor *mon)
+{
+    XiveRouter *xrtr = XIVE_ROUTER(xive);
+    XiveEAS eas;
+    XiveEND end;
+    uint32_t endno = 0;
+    uint32_t srcno0 = XIVE_SRCNO(xive->chip_id, 0);
+    uint32_t srcno = srcno0;
+
+    monitor_printf(mon, "XIVE[%x] Source %08x .. %08x\n", xive->chip_id,
+                  srcno0, srcno0 + xive->source.nr_irqs - 1);
+    xive_source_pic_print_info(&xive->source, srcno0, mon);
+
+    monitor_printf(mon, "XIVE[%x] EAT %08x .. %08x\n", xive->chip_id,
+                   srcno0, srcno0 + xive->nr_irqs - 1);
+    while (!xive_router_get_eas(xrtr, srcno, &eas)) {
+        if (!(eas.w & EAS_MASKED)) {
+            xive_eas_pic_print_info(&eas, srcno, mon);
+        }
+        srcno++;
+    }
+
+    monitor_printf(mon, "XIVE[%x] ENDT %08x .. %08x\n", xive->chip_id,
+                   0, xive->nr_ends - 1);
+    while (!xive_router_get_end(xrtr, xrtr->chip_id, endno, &end)) {
+        xive_end_pic_print_info(&end, endno++, mon);
+    }
+}
+
+static void pnv_xive_reset(DeviceState *dev)
+{
+    PnvXive *xive = PNV_XIVE(dev);
+    PnvChip *chip = PNV_CHIP(object_property_get_link(OBJECT(dev), "chip",
+                                                      &error_fatal));
+
+    /*
+     * Use the chip id to identify the XIVE interrupt controller. It
+     * can be overriden by configuration at runtime.
+     */
+    xive->chip_id = xive->thread_chip_id = chip->chip_id;
+
+    /* Default page size. Should be changed at runtime to 64k */
+    xive->ic_shift = xive->vc_shift = xive->pc_shift = 12;
+
+    /*
+     * PowerNV XIVE sources are realized at runtime when the set
+     * translation tables are configured.
+     */
+    if (DEVICE(&xive->source)->realized) {
+        object_property_set_bool(OBJECT(&xive->source), false, "realized",
+                                 &error_fatal);
+    }
+
+    if (DEVICE(&xive->end_source)->realized) {
+        object_property_set_bool(OBJECT(&xive->end_source), false, "realized",
+                                 &error_fatal);
+    }
+}
+
+/*
+ * The VC sub-engine incorporates a source controller for the IPIs.
+ * When triggered, we need to construct a source number with the
+ * chip/block identifier
+ */
+static void pnv_xive_notify(XiveFabric *xf, uint32_t srcno)
+{
+    PnvXive *xive = PNV_XIVE(xf);
+
+    xive_router_notify(xf, XIVE_SRCNO(xive->chip_id, srcno));
+}
+
+static void pnv_xive_init(Object *obj)
+{
+    PnvXive *xive = PNV_XIVE(obj);
+
+    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
+    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
+
+    object_initialize(&xive->end_source, sizeof(xive->end_source),
+                      TYPE_XIVE_END_SOURCE);
+    object_property_add_child(obj, "end_source", OBJECT(&xive->end_source),
+                              NULL);
+}
+
+static void pnv_xive_realize(DeviceState *dev, Error **errp)
+{
+    PnvXive *xive = PNV_XIVE(dev);
+
+    /* Default page size. Generally changed at runtime to 64k */
+    xive->ic_shift = xive->vc_shift = xive->pc_shift = 12;
+
+    /* XSCOM region, used for initial configuration of the BARs */
+    memory_region_init_io(&xive->xscom_regs, OBJECT(dev), &pnv_xive_xscom_ops,
+                          xive, "xscom-xive", PNV9_XSCOM_XIVE_SIZE << 3);
+
+    /* Interrupt controller MMIO region */
+    memory_region_init_io(&xive->ic_mmio, OBJECT(dev), &pnv_xive_ic_ops, xive,
+                          "xive.ic", PNV9_XIVE_IC_SIZE);
+    memory_region_init_io(&xive->ic_reg_mmio, OBJECT(dev), &pnv_xive_ic_reg_ops,
+                          xive, "xive.ic.reg", 1 << xive->ic_shift);
+    memory_region_init_io(&xive->ic_notify_mmio, OBJECT(dev),
+                          &pnv_xive_ic_notify_ops,
+                          xive, "xive.ic.notify", 1 << xive->ic_shift);
+
+    /* The Pervasive LSI trigger and EOI pages are not modeled */
+
+    /*
+     * Overall Virtualization Controller MMIO region containing the
+     * IPI ESB pages and END ESB pages. The layout is defined by the
+     * EDT set translation table and the accesses are dispatched using
+     * address spaces for each.
+     */
+    memory_region_init_io(&xive->vc_mmio, OBJECT(xive), &pnv_xive_vc_ops, xive,
+                          "xive.vc", PNV9_XIVE_VC_SIZE);
+
+    memory_region_init(&xive->ipi_mmio, OBJECT(xive), "xive.vc.ipi",
+                       PNV9_XIVE_VC_SIZE);
+    address_space_init(&xive->ipi_as, &xive->ipi_mmio, "xive.vc.ipi");
+    memory_region_init(&xive->end_mmio, OBJECT(xive), "xive.vc.end",
+                       PNV9_XIVE_VC_SIZE);
+    address_space_init(&xive->end_as, &xive->end_mmio, "xive.vc.end");
+
+
+    /* Presenter Controller MMIO region (not implemented) */
+    memory_region_init_io(&xive->pc_mmio, OBJECT(xive), &pnv_xive_pc_ops, xive,
+                          "xive.pc", PNV9_XIVE_PC_SIZE);
+
+    /* Thread Interrupt Management Area, direct an indirect */
+    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops,
+                          &xive->cpu_ind, "xive.tima", PNV9_XIVE_TM_SIZE);
+    memory_region_init_alias(&xive->tm_mmio_indirect, OBJECT(xive),
+                             "xive.tima.indirect",
+                             &xive->tm_mmio, 0, PNV9_XIVE_TM_SIZE);
+}
+
+static int pnv_xive_dt_xscom(PnvXScomInterface *dev, void *fdt,
+                             int xscom_offset)
+{
+    const char compat[] = "ibm,power9-xive-x";
+    char *name;
+    int offset;
+    uint32_t lpc_pcba = PNV9_XSCOM_XIVE_BASE;
+    uint32_t reg[] = {
+        cpu_to_be32(lpc_pcba),
+        cpu_to_be32(PNV9_XSCOM_XIVE_SIZE)
+    };
+
+    name = g_strdup_printf("xive@%x", lpc_pcba);
+    offset = fdt_add_subnode(fdt, xscom_offset, name);
+    _FDT(offset);
+    g_free(name);
+
+    _FDT((fdt_setprop(fdt, offset, "reg", reg, sizeof(reg))));
+    _FDT((fdt_setprop(fdt, offset, "compatible", compat,
+                      sizeof(compat))));
+    return 0;
+}
+
+static Property pnv_xive_properties[] = {
+    DEFINE_PROP_UINT64("ic-bar", PnvXive, ic_base, 0),
+    DEFINE_PROP_UINT64("vc-bar", PnvXive, vc_base, 0),
+    DEFINE_PROP_UINT64("pc-bar", PnvXive, pc_base, 0),
+    DEFINE_PROP_UINT64("tm-bar", PnvXive, tm_base, 0),
+    DEFINE_PROP_END_OF_LIST(),
+};
+
+static void pnv_xive_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PnvXScomInterfaceClass *xdc = PNV_XSCOM_INTERFACE_CLASS(klass);
+    XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
+    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
+
+    xdc->dt_xscom = pnv_xive_dt_xscom;
+
+    dc->desc = "PowerNV XIVE Interrupt Controller";
+    dc->realize = pnv_xive_realize;
+    dc->props = pnv_xive_properties;
+    dc->reset = pnv_xive_reset;
+
+    xrc->get_eas = pnv_xive_get_eas;
+    xrc->set_eas = pnv_xive_set_eas;
+    xrc->get_end = pnv_xive_get_end;
+    xrc->set_end = pnv_xive_set_end;
+    xrc->get_nvt  = pnv_xive_get_nvt;
+    xrc->set_nvt  = pnv_xive_set_nvt;
+
+    xfc->notify  = pnv_xive_notify;
+};
+
+static const TypeInfo pnv_xive_info = {
+    .name          = TYPE_PNV_XIVE,
+    .parent        = TYPE_XIVE_ROUTER,
+    .instance_init = pnv_xive_init,
+    .instance_size = sizeof(PnvXive),
+    .class_init    = pnv_xive_class_init,
+    .interfaces    = (InterfaceInfo[]) {
+        { TYPE_PNV_XSCOM_INTERFACE },
+        { }
+    }
+};
+
+static void pnv_xive_register_types(void)
+{
+    type_register_static(&pnv_xive_info);
+}
+
+type_init(pnv_xive_register_types)
diff --git a/hw/intc/xive.c b/hw/intc/xive.c
index c9aedecc8216..9925c90481ae 100644
--- a/hw/intc/xive.c
+++ b/hw/intc/xive.c
@@ -51,6 +51,8 @@ static uint8_t exception_mask(uint8_t ring)
     switch (ring) {
     case TM_QW1_OS:
         return TM_QW1_NSR_EO;
+    case TM_QW3_HV_PHYS:
+        return TM_QW3_NSR_HE;
     default:
         g_assert_not_reached();
     }
@@ -85,7 +87,17 @@ static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring)
     uint8_t *regs = &tctx->regs[ring];
 
     if (regs[TM_PIPR] < regs[TM_CPPR]) {
-        regs[TM_NSR] |= exception_mask(ring);
+        switch (ring) {
+        case TM_QW1_OS:
+            regs[TM_NSR] |= TM_QW1_NSR_EO;
+            break;
+        case TM_QW3_HV_PHYS:
+            regs[TM_NSR] |= SETFIELD(TM_QW3_NSR_HE, regs[TM_NSR],
+                                     TM_QW3_NSR_HE_PHYS);
+            break;
+        default:
+            g_assert_not_reached();
+        }
         qemu_irq_raise(tctx->output);
     }
 }
@@ -116,6 +128,38 @@ static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
 #define XIVE_TM_OS_PAGE   0x2
 #define XIVE_TM_USER_PAGE 0x3
 
+static void xive_tm_set_hv_cppr(XiveTCTX *tctx, hwaddr offset,
+                                uint64_t value, unsigned size)
+{
+    xive_tctx_set_cppr(tctx, TM_QW3_HV_PHYS, value & 0xff);
+}
+
+static uint64_t xive_tm_ack_hv_reg(XiveTCTX *tctx, hwaddr offset, unsigned size)
+{
+    return xive_tctx_accept(tctx, TM_QW3_HV_PHYS);
+}
+
+static uint64_t xive_tm_pull_pool_ctx(XiveTCTX *tctx, hwaddr offset,
+                                      unsigned size)
+{
+    uint64_t ret;
+
+    ret = tctx->regs[TM_QW2_HV_POOL + TM_WORD2] & TM_QW2W2_POOL_CAM;
+    tctx->regs[TM_QW2_HV_POOL + TM_WORD2] &= ~TM_QW2W2_POOL_CAM;
+    return ret;
+}
+
+static void xive_tm_vt_push(XiveTCTX *tctx, hwaddr offset,
+                            uint64_t value, unsigned size)
+{
+    tctx->regs[TM_QW3_HV_PHYS + TM_WORD2] = value & 0xff;
+}
+
+static uint64_t xive_tm_vt_poll(XiveTCTX *tctx, hwaddr offset, unsigned size)
+{
+    return tctx->regs[TM_QW3_HV_PHYS + TM_WORD2] & 0xff;
+}
+
 /*
  * Define an access map for each page of the TIMA that we will use in
  * the memory region ops to filter values when doing loads and stores
@@ -295,10 +339,16 @@ static const XiveTmOp xive_tm_operations[] = {
      * effects
      */
     { XIVE_TM_OS_PAGE, TM_QW1_OS + TM_CPPR,   1, xive_tm_set_os_cppr, NULL },
+    { XIVE_TM_HV_PAGE, TM_QW3_HV_PHYS + TM_CPPR, 1, xive_tm_set_hv_cppr, NULL },
+    { XIVE_TM_HV_PAGE, TM_QW3_HV_PHYS + TM_WORD2, 1, xive_tm_vt_push, NULL },
+    { XIVE_TM_HV_PAGE, TM_QW3_HV_PHYS + TM_WORD2, 1, NULL, xive_tm_vt_poll },
 
     /* MMIOs above 2K : special operations with side effects */
     { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG,     2, NULL, xive_tm_ack_os_reg },
     { XIVE_TM_OS_PAGE, TM_SPC_SET_OS_PENDING, 1, xive_tm_set_os_pending, NULL },
+    { XIVE_TM_HV_PAGE, TM_SPC_ACK_HV_REG,     2, NULL, xive_tm_ack_hv_reg },
+    { XIVE_TM_HV_PAGE, TM_SPC_PULL_POOL_CTX,  4, NULL, xive_tm_pull_pool_ctx },
+    { XIVE_TM_HV_PAGE, TM_SPC_PULL_POOL_CTX,  8, NULL, xive_tm_pull_pool_ctx },
 };
 
 static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
@@ -327,7 +377,8 @@ static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
 static void xive_tm_write(void *opaque, hwaddr offset,
                           uint64_t value, unsigned size)
 {
-    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
+    PowerPCCPU **cpuptr = opaque;
+    PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
     XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
     const XiveTmOp *xto;
 
@@ -366,7 +417,8 @@ static void xive_tm_write(void *opaque, hwaddr offset,
 
 static uint64_t xive_tm_read(void *opaque, hwaddr offset, unsigned size)
 {
-    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
+    PowerPCCPU **cpuptr = opaque;
+    PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
     XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
     const XiveTmOp *xto;
 
@@ -501,6 +553,9 @@ static void xive_tctx_base_reset(void *dev)
      */
     tctx->regs[TM_QW1_OS + TM_PIPR] =
         ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
+    tctx->regs[TM_QW3_HV_PHYS + TM_PIPR] =
+        ipb_to_pipr(tctx->regs[TM_QW3_HV_PHYS + TM_IPB]);
+
 
     /*
      * QEMU sPAPR XIVE only. To let the controller model reset the OS
@@ -1513,7 +1568,7 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
     /* TODO: Auto EOI. */
 }
 
-static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
+void xive_router_notify(XiveFabric *xf, uint32_t lisn)
 {
     XiveRouter *xrtr = XIVE_ROUTER(xf);
     XiveEAS eas;
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index 66f2301b4ece..7b0bda652338 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -279,7 +279,10 @@ static void pnv_dt_chip(PnvChip *chip, void *fdt)
         pnv_dt_core(chip, pnv_core, fdt);
 
         /* Interrupt Control Presenters (ICP). One per core. */
-        pnv_dt_icp(chip, fdt, pnv_core->pir, CPU_CORE(pnv_core)->nr_threads);
+        if (!pnv_chip_is_power9(chip)) {
+            pnv_dt_icp(chip, fdt, pnv_core->pir,
+                       CPU_CORE(pnv_core)->nr_threads);
+        }
     }
 
     if (chip->ram_size) {
@@ -693,7 +696,15 @@ static uint32_t pnv_chip_core_pir_p9(PnvChip *chip, uint32_t core_id)
 static Object *pnv_chip_power9_intc_create(PnvChip *chip, Object *child,
                                            Error **errp)
 {
-    return NULL;
+    Pnv9Chip *chip9 = PNV9_CHIP(chip);
+
+    /*
+     * The core creates its interrupt presenter but the XIVE interrupt
+     * controller object is initialized afterwards. Hopefully, it's
+     * only used at runtime.
+     */
+    return xive_tctx_create(child, TYPE_XIVE_TCTX,
+                            XIVE_ROUTER(&chip9->xive), errp);
 }
 
 /* Allowed core identifiers on a POWER8 Processor Chip :
@@ -875,11 +886,19 @@ static void pnv_chip_power8nvl_class_init(ObjectClass *klass, void *data)
 
 static void pnv_chip_power9_instance_init(Object *obj)
 {
+    Pnv9Chip *chip9 = PNV9_CHIP(obj);
+
+    object_initialize(&chip9->xive, sizeof(chip9->xive), TYPE_PNV_XIVE);
+    object_property_add_child(obj, "xive", OBJECT(&chip9->xive), NULL);
+    object_property_add_const_link(OBJECT(&chip9->xive), "chip", obj,
+                                   &error_abort);
 }
 
 static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
 {
     PnvChipClass *pcc = PNV_CHIP_GET_CLASS(dev);
+    Pnv9Chip *chip9 = PNV9_CHIP(dev);
+    PnvChip *chip = PNV_CHIP(dev);
     Error *local_err = NULL;
 
     pcc->parent_realize(dev, &local_err);
@@ -887,6 +906,24 @@ static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
         error_propagate(errp, local_err);
         return;
     }
+
+    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_IC_BASE(chip),
+                            "ic-bar", &error_fatal);
+    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_VC_BASE(chip),
+                            "vc-bar", &error_fatal);
+    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_PC_BASE(chip),
+                            "pc-bar", &error_fatal);
+    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_TM_BASE(chip),
+                            "tm-bar", &error_fatal);
+    object_property_set_bool(OBJECT(&chip9->xive), true, "realized",
+                             &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(&chip9->xive), sysbus_get_default());
+    pnv_xscom_add_subregion(chip, PNV9_XSCOM_XIVE_BASE,
+                            &chip9->xive.xscom_regs);
 }
 
 static void pnv_chip_power9_class_init(ObjectClass *klass, void *data)
@@ -1087,12 +1124,23 @@ static void pnv_pic_print_info(InterruptStatsProvider *obj,
     CPU_FOREACH(cs) {
         PowerPCCPU *cpu = POWERPC_CPU(cs);
 
-        icp_pic_print_info(ICP(cpu->intc), mon);
+        if (pnv_chip_is_power9(pnv->chips[0])) {
+            xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
+        } else {
+            icp_pic_print_info(ICP(cpu->intc), mon);
+        }
     }
 
     for (i = 0; i < pnv->num_chips; i++) {
-        Pnv8Chip *chip8 = PNV8_CHIP(pnv->chips[i]);
-        ics_pic_print_info(&chip8->psi.ics, mon);
+        PnvChip *chip = pnv->chips[i];
+
+        if (pnv_chip_is_power9(pnv->chips[i])) {
+            Pnv9Chip *chip9 = PNV9_CHIP(chip);
+            pnv_xive_pic_print_info(&chip9->xive, mon);
+        } else {
+            Pnv8Chip *chip8 = PNV8_CHIP(chip);
+            ics_pic_print_info(&chip8->psi.ics, mon);
+        }
     }
 }
 
diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
index dd4d69db2bdd..145bfaf44014 100644
--- a/hw/intc/Makefile.objs
+++ b/hw/intc/Makefile.objs
@@ -40,7 +40,7 @@ obj-$(CONFIG_XICS_KVM) += xics_kvm.o
 obj-$(CONFIG_XIVE) += xive.o
 obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
 obj-$(CONFIG_XIVE_KVM) += spapr_xive_kvm.o
-obj-$(CONFIG_POWERNV) += xics_pnv.o
+obj-$(CONFIG_POWERNV) += xics_pnv.o pnv_xive.o
 obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
 obj-$(CONFIG_S390_FLIC) += s390_flic.o
 obj-$(CONFIG_S390_FLIC_KVM) += s390_flic_kvm.o
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model Cédric Le Goater
@ 2018-11-22  3:05   ` David Gibson
  2018-11-22  7:25     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-22  3:05 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 20613 bytes --]

On Fri, Nov 16, 2018 at 11:56:54AM +0100, Cédric Le Goater wrote:
> The first sub-engine of the overall XIVE architecture is the Interrupt
> Virtualization Source Engine (IVSE). An IVSE can be integrated into
> another logic, like in a PCI PHB or in the main interrupt controller
> to manage IPIs.
> 
> Each IVSE instance is associated with an Event State Buffer (ESB) that
> contains a two bit state entry for each possible event source. When an
> event is signaled to the IVSE, by MMIO or some other means, the
> associated interrupt state bits are fetched from the ESB and
> modified. Depending on the resulting ESB state, the event is forwarded
> to the IVRE sub-engine of the controller doing the routing.
> 
> Each supported ESB entry is associated with either a single or a
> even/odd pair of pages which provides commands to manage the source:
> to EOI, to turn off the source for instance.
> 
> On a sPAPR machine, the O/S will obtain the page address of the ESB
> entry associated with a source and its characteristic using the
> H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is used.
> 
> The xive_source_notify() routine is in charge forwarding the source
> event notification to the routing engine. It will be filled later on.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Ok, this is looking basically pretty good.  Few details to query
below.


> ---
>  default-configs/ppc64-softmmu.mak |   1 +
>  include/hw/ppc/xive.h             | 130 ++++++++++
>  hw/intc/xive.c                    | 379 ++++++++++++++++++++++++++++++
>  hw/intc/Makefile.objs             |   1 +
>  4 files changed, 511 insertions(+)
>  create mode 100644 include/hw/ppc/xive.h
>  create mode 100644 hw/intc/xive.c
> 
> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> index aec2855750d6..2d1e7c5c4668 100644
> --- a/default-configs/ppc64-softmmu.mak
> +++ b/default-configs/ppc64-softmmu.mak
> @@ -16,6 +16,7 @@ CONFIG_VIRTIO_VGA=y
>  CONFIG_XICS=$(CONFIG_PSERIES)
>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
> +CONFIG_XIVE=$(CONFIG_PSERIES)
>  CONFIG_MEM_DEVICE=y
>  CONFIG_DIMM=y
>  CONFIG_SPAPR_RNG=y
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> new file mode 100644
> index 000000000000..5fec4b08705d
> --- /dev/null
> +++ b/include/hw/ppc/xive.h
> @@ -0,0 +1,130 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.

A cheat sheet in the top of this header with the old and new XIVE
terms would quite nice to have.

> + */
> +
> +#ifndef PPC_XIVE_H
> +#define PPC_XIVE_H
> +
> +#include "hw/sysbus.h"

So, I'm a bit dubious about making the XiveSource a SysBus device -
I'm concerned it won't play well with tying it into the other devices
like PHB that "own" it in real hardware.

I think we'd be better off making it a direct descendent of
TYPE_DEVICE which constructs the MMIO region, but doesn't map it.
Then we can havea SysBusDevice (and/or other) wrapper which
instantiates the XiveSource core and maps it into somewhere
accessible.

> +
> +/*
> + * XIVE Interrupt Source
> + */
> +
> +#define TYPE_XIVE_SOURCE "xive-source"
> +#define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
> +
> +/*
> + * XIVE Interrupt Source characteristics, which define how the ESB are
> + * controlled.
> + */
> +#define XIVE_SRC_H_INT_ESB     0x1 /* ESB managed with hcall H_INT_ESB */
> +#define XIVE_SRC_STORE_EOI     0x2 /* Store EOI supported */
> +
> +typedef struct XiveSource {
> +    SysBusDevice parent;
> +
> +    /* IRQs */
> +    uint32_t        nr_irqs;
> +    qemu_irq        *qirqs;
> +
> +    /* PQ bits */
> +    uint8_t         *status;
> +
> +    /* ESB memory region */
> +    uint64_t        esb_flags;
> +    uint32_t        esb_shift;
> +    MemoryRegion    esb_mmio;
> +} XiveSource;
> +
> +/*
> + * ESB MMIO setting. Can be one page, for both source triggering and
> + * source management, or two different pages. See below for magic
> + * values.
> + */
> +#define XIVE_ESB_4K          12 /* PSI HB only */
> +#define XIVE_ESB_4K_2PAGE    13
> +#define XIVE_ESB_64K         16
> +#define XIVE_ESB_64K_2PAGE   17
> +
> +static inline bool xive_source_esb_has_2page(XiveSource *xsrc)
> +{
> +    return xsrc->esb_shift == XIVE_ESB_64K_2PAGE ||
> +        xsrc->esb_shift == XIVE_ESB_4K_2PAGE;
> +}
> +
> +/* The trigger page is always the first/even page */
> +static inline hwaddr xive_source_esb_page(XiveSource *xsrc, uint32_t srcno)

This function doesn't appear to be used anywhere except..

> +{
> +    assert(srcno < xsrc->nr_irqs);
> +    return (1ull << xsrc->esb_shift) * srcno;
> +}
> +
> +/* In a two pages ESB MMIO setting, the odd page is for management */
> +static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)


..here, and this function doesn't appear to be used anywhere.

> +{
> +    hwaddr addr = xive_source_esb_page(xsrc, srcno);
> +
> +    if (xive_source_esb_has_2page(xsrc)) {
> +        addr += (1 << (xsrc->esb_shift - 1));
> +    }
> +
> +    return addr;
> +}
> +
> +/*
> + * Each interrupt source has a 2-bit state machine which can be
> + * controlled by MMIO. P indicates that an interrupt is pending (has
> + * been sent to a queue and is waiting for an EOI). Q indicates that
> + * the interrupt has been triggered while pending.
> + *
> + * This acts as a coalescing mechanism in order to guarantee that a
> + * given interrupt only occurs at most once in a queue.
> + *
> + * When doing an EOI, the Q bit will indicate if the interrupt
> + * needs to be re-triggered.
> + */
> +#define XIVE_ESB_VAL_P        0x2
> +#define XIVE_ESB_VAL_Q        0x1
> +
> +#define XIVE_ESB_RESET        0x0
> +#define XIVE_ESB_PENDING      XIVE_ESB_VAL_P
> +#define XIVE_ESB_QUEUED       (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
> +#define XIVE_ESB_OFF          XIVE_ESB_VAL_Q
> +
> +/*
> + * "magic" Event State Buffer (ESB) MMIO offsets.
> + *
> + * The following offsets into the ESB MMIO allow to read or manipulate
> + * the PQ bits. They must be used with an 8-byte load instruction.
> + * They all return the previous state of the interrupt (atomically).
> + *
> + * Additionally, some ESB pages support doing an EOI via a store and
> + * some ESBs support doing a trigger via a separate trigger page.
> + */
> +#define XIVE_ESB_STORE_EOI      0x400 /* Store */
> +#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
> +#define XIVE_ESB_GET            0x800 /* Load */
> +#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
> +#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
> +#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
> +#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
> +
> +uint8_t xive_source_esb_get(XiveSource *xsrc, uint32_t srcno);
> +uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
> +
> +void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset,
> +                                Monitor *mon);
> +
> +static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
> +{
> +    assert(srcno < xsrc->nr_irqs);
> +    return xsrc->qirqs[srcno];
> +}
> +
> +#endif /* PPC_XIVE_H */
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> new file mode 100644
> index 000000000000..f7621f84828c
> --- /dev/null
> +++ b/hw/intc/xive.c
> @@ -0,0 +1,379 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "qapi/error.h"
> +#include "target/ppc/cpu.h"
> +#include "sysemu/cpus.h"
> +#include "sysemu/dma.h"
> +#include "monitor/monitor.h"
> +#include "hw/ppc/xive.h"
> +
> +/*
> + * XIVE ESB helpers
> + */
> +
> +static uint8_t xive_esb_set(uint8_t *pq, uint8_t value)
> +{
> +    uint8_t old_pq = *pq & 0x3;
> +
> +    *pq &= ~0x3;
> +    *pq |= value & 0x3;
> +
> +    return old_pq;
> +}
> +
> +static bool xive_esb_trigger(uint8_t *pq)
> +{
> +    uint8_t old_pq = *pq & 0x3;
> +
> +    switch (old_pq) {
> +    case XIVE_ESB_RESET:
> +        xive_esb_set(pq, XIVE_ESB_PENDING);
> +        return true;
> +    case XIVE_ESB_PENDING:
> +    case XIVE_ESB_QUEUED:
> +        xive_esb_set(pq, XIVE_ESB_QUEUED);
> +        return false;
> +    case XIVE_ESB_OFF:
> +        xive_esb_set(pq, XIVE_ESB_OFF);
> +        return false;
> +    default:
> +         g_assert_not_reached();
> +    }
> +}
> +
> +static bool xive_esb_eoi(uint8_t *pq)
> +{
> +    uint8_t old_pq = *pq & 0x3;
> +
> +    switch (old_pq) {
> +    case XIVE_ESB_RESET:
> +    case XIVE_ESB_PENDING:
> +        xive_esb_set(pq, XIVE_ESB_RESET);
> +        return false;
> +    case XIVE_ESB_QUEUED:
> +        xive_esb_set(pq, XIVE_ESB_PENDING);
> +        return true;
> +    case XIVE_ESB_OFF:
> +        xive_esb_set(pq, XIVE_ESB_OFF);
> +        return false;
> +    default:
> +         g_assert_not_reached();
> +    }
> +}
> +
> +/*
> + * XIVE Interrupt Source (or IVSE)
> + */
> +
> +uint8_t xive_source_esb_get(XiveSource *xsrc, uint32_t srcno)
> +{
> +    assert(srcno < xsrc->nr_irqs);
> +
> +    return xsrc->status[srcno] & 0x3;
> +}
> +
> +uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
> +{
> +    assert(srcno < xsrc->nr_irqs);
> +
> +    return xive_esb_set(&xsrc->status[srcno], pq);
> +}
> +
> +/*
> + * Returns whether the event notification should be forwarded.
> + */
> +static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
> +{
> +    assert(srcno < xsrc->nr_irqs);
> +
> +    return xive_esb_trigger(&xsrc->status[srcno]);
> +}
> +
> +/*
> + * Returns whether the event notification should be forwarded.
> + */
> +static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
> +{
> +    assert(srcno < xsrc->nr_irqs);
> +
> +    return xive_esb_eoi(&xsrc->status[srcno]);
> +}
> +
> +/*
> + * Forward the source event notification to the Router
> + */
> +static void xive_source_notify(XiveSource *xsrc, int srcno)
> +{
> +
> +}
> +
> +/*
> + * In a two pages ESB MMIO setting, even page is the trigger page, odd
> + * page is for management
> + */
> +static inline bool addr_is_even(hwaddr addr, uint32_t shift)
> +{
> +    return !((addr >> shift) & 1);
> +}
> +
> +static inline bool xive_source_is_trigger_page(XiveSource *xsrc, hwaddr addr)
> +{
> +    return xive_source_esb_has_2page(xsrc) &&
> +        addr_is_even(addr, xsrc->esb_shift - 1);
> +}
> +
> +/*
> + * ESB MMIO loads
> + *                      Trigger page    Management/EOI page
> + * 2 pages setting      even            odd
> + *
> + * 0x000 .. 0x3FF       -1              EOI and return 0|1
> + * 0x400 .. 0x7FF       -1              EOI and return 0|1
> + * 0x800 .. 0xBFF       -1              return PQ
> + * 0xC00 .. 0xCFF       -1              return PQ and atomically PQ=0
> + * 0xD00 .. 0xDFF       -1              return PQ and atomically PQ=0
> + * 0xE00 .. 0xDFF       -1              return PQ and atomically PQ=1
> + * 0xF00 .. 0xDFF       -1              return PQ and atomically PQ=1
> + */

I can't quite make sense of this table.  What do the -1s represent,
and how does it relate to the non-2page case?

> +static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> +    uint32_t offset = addr & 0xFFF;
> +    uint32_t srcno = addr >> xsrc->esb_shift;
> +    uint64_t ret = -1;
> +
> +    /* In a two pages ESB MMIO setting, trigger page should not be read */
> +    if (xive_source_is_trigger_page(xsrc, addr)) {
> +        qemu_log_mask(LOG_GUEST_ERROR,
> +                      "XIVE: invalid load on IRQ %d trigger page at "
> +                      "0x%"HWADDR_PRIx"\n", srcno, addr);
> +        return -1;
> +    }
> +
> +    switch (offset) {
> +    case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
> +        ret = xive_source_esb_eoi(xsrc, srcno);
> +
> +        /* Forward the source event notification for routing */
> +        if (ret) {
> +            xive_source_notify(xsrc, srcno);
> +        }
> +        break;
> +
> +    case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
> +        ret = xive_source_esb_get(xsrc, srcno);
> +        break;
> +
> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
> +        ret = xive_source_esb_set(xsrc, srcno, (offset >> 8) & 0x3);
> +        break;
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB load addr %x\n",
> +                      offset);
> +    }
> +
> +    return ret;
> +}
> +
> +/*
> + * ESB MMIO stores
> + *                      Trigger page    Management/EOI page
> + * 2 pages setting      even            odd

As with the previous table, I don't quite understand what the headings
above mean.

> + * 0x000 .. 0x3FF       Trigger         Trigger
> + * 0x400 .. 0x7FF       Trigger         EOI
> + * 0x800 .. 0xBFF       Trigger         undefined
> + * 0xC00 .. 0xCFF       Trigger         PQ=00
> + * 0xD00 .. 0xDFF       Trigger         PQ=01
> + * 0xE00 .. 0xDFF       Trigger         PQ=10
> + * 0xF00 .. 0xDFF       Trigger         PQ=11
> + */
> +static void xive_source_esb_write(void *opaque, hwaddr addr,
> +                                  uint64_t value, unsigned size)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> +    uint32_t offset = addr & 0xFFF;
> +    uint32_t srcno = addr >> xsrc->esb_shift;
> +    bool notify = false;
> +
> +    /* In a two pages ESB MMIO setting, trigger page only triggers */
> +    if (xive_source_is_trigger_page(xsrc, addr)) {
> +        notify = xive_source_esb_trigger(xsrc, srcno);
> +        goto out;
> +    }
> +
> +    switch (offset) {
> +    case 0 ... 0x3FF:
> +        notify = xive_source_esb_trigger(xsrc, srcno);
> +        break;
> +
> +    case XIVE_ESB_STORE_EOI ... XIVE_ESB_STORE_EOI + 0x3FF:
> +        if (!(xsrc->esb_flags & XIVE_SRC_STORE_EOI)) {
> +            qemu_log_mask(LOG_GUEST_ERROR,
> +                          "XIVE: invalid Store EOI for IRQ %d\n", srcno);
> +            return;
> +        }
> +
> +        notify = xive_source_esb_eoi(xsrc, srcno);
> +        break;
> +
> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
> +        xive_source_esb_set(xsrc, srcno, (offset >> 8) & 0x3);
> +        break;
> +
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %x\n",
> +                      offset);
> +        return;
> +    }
> +
> +out:
> +    /* Forward the source event notification for routing */
> +    if (notify) {
> +        xive_source_notify(xsrc, srcno);
> +    }
> +}
> +
> +static const MemoryRegionOps xive_source_esb_ops = {
> +    .read = xive_source_esb_read,
> +    .write = xive_source_esb_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +static void xive_source_set_irq(void *opaque, int srcno, int val)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> +    bool notify = false;
> +
> +    if (val) {
> +        notify = xive_source_esb_trigger(xsrc, srcno);
> +    }
> +
> +    /* Forward the source event notification for routing */
> +    if (notify) {
> +        xive_source_notify(xsrc, srcno);
> +    }
> +}
> +
> +void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
> +{
> +    int i;
> +
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        uint8_t pq = xive_source_esb_get(xsrc, i);
> +
> +        if (pq == XIVE_ESB_OFF) {
> +            continue;
> +        }
> +
> +        monitor_printf(mon, "  %08x %c%c\n", i + offset,
> +                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> +    }
> +}
> +
> +static void xive_source_reset(DeviceState *dev)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(dev);
> +
> +    /* PQs are initialized to 0b01 which corresponds to "ints off" */
> +    memset(xsrc->status, 0x1, xsrc->nr_irqs);

You've already got XIVE_ESB_OFF defined to make this a little clearer.

> +}
> +
> +static void xive_source_realize(DeviceState *dev, Error **errp)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(dev);
> +
> +    if (!xsrc->nr_irqs) {
> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
> +        return;
> +    }
> +
> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
> +        xsrc->esb_shift != XIVE_ESB_4K_2PAGE &&
> +        xsrc->esb_shift != XIVE_ESB_64K &&
> +        xsrc->esb_shift != XIVE_ESB_64K_2PAGE) {
> +        error_setg(errp, "Invalid ESB shift setting");
> +        return;
> +    }
> +
> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
> +                                     xsrc->nr_irqs);
> +
> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
> +
> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> +                          &xive_source_esb_ops, xsrc, "xive.esb",
> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
> +}
> +
> +static const VMStateDescription vmstate_xive_source = {
> +    .name = TYPE_XIVE_SOURCE,
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
> +        VMSTATE_VBUFFER_UINT32(status, XiveSource, 1, NULL, nr_irqs),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +/*
> + * The default XIVE interrupt source setting for the ESB MMIOs is two
> + * 64k pages without Store EOI, to be in sync with KVM.
> + */
> +static Property xive_source_properties[] = {
> +    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
> +    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
> +    DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void xive_source_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    dc->desc    = "XIVE Interrupt Source";
> +    dc->props   = xive_source_properties;
> +    dc->realize = xive_source_realize;
> +    dc->reset   = xive_source_reset;
> +    dc->vmsd    = &vmstate_xive_source;
> +}
> +
> +static const TypeInfo xive_source_info = {
> +    .name          = TYPE_XIVE_SOURCE,
> +    .parent        = TYPE_SYS_BUS_DEVICE,
> +    .instance_size = sizeof(XiveSource),
> +    .class_init    = xive_source_class_init,
> +};
> +
> +static void xive_register_types(void)
> +{
> +    type_register_static(&xive_source_info);
> +}
> +
> +type_init(xive_register_types)
> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> index 0e9963f5eecc..72a46ed91c31 100644
> --- a/hw/intc/Makefile.objs
> +++ b/hw/intc/Makefile.objs
> @@ -37,6 +37,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>  obj-$(CONFIG_XICS) += xics.o
>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> +obj-$(CONFIG_XIVE) += xive.o
>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>  obj-$(CONFIG_S390_FLIC) += s390_flic.o

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources Cédric Le Goater
@ 2018-11-22  3:19   ` David Gibson
  2018-11-22  7:39     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-22  3:19 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 7578 bytes --]

On Fri, Nov 16, 2018 at 11:56:55AM +0100, Cédric Le Goater wrote:
> The 'sent' status of the LSI interrupt source is modeled with the 'P'
> bit of the ESB and the assertion status of the source is maintained in
> an array under the main sPAPRXive object. The type of the source is
> stored in the same array for practical reasons.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Looks good except for some minor details.

> ---
>  include/hw/ppc/xive.h | 20 ++++++++++++-
>  hw/intc/xive.c        | 68 +++++++++++++++++++++++++++++++++++++++----
>  2 files changed, 81 insertions(+), 7 deletions(-)
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 5fec4b08705d..e118acd59f1e 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -32,8 +32,10 @@ typedef struct XiveSource {
>      /* IRQs */
>      uint32_t        nr_irqs;
>      qemu_irq        *qirqs;
> +    unsigned long   *lsi_map;
> +    int32_t         lsi_map_size; /* for VMSTATE_BITMAP */

At some point it's possible we'll want XiveSource subclasses that just
know which irqs are LSI and which aren't without an explicit map.  But
this detail isn't exposed in the migration stream or the user
interface, so we can tweak it later as ncessary.

> -    /* PQ bits */
> +    /* PQ bits and LSI assertion bit */
>      uint8_t         *status;
>  
>      /* ESB memory region */
> @@ -89,6 +91,7 @@ static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
>   * When doing an EOI, the Q bit will indicate if the interrupt
>   * needs to be re-triggered.
>   */
> +#define XIVE_STATUS_ASSERTED  0x4  /* Extra bit for LSI */
>  #define XIVE_ESB_VAL_P        0x2
>  #define XIVE_ESB_VAL_Q        0x1
>  
> @@ -127,4 +130,19 @@ static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
>      return xsrc->qirqs[srcno];
>  }
>  
> +static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
> +{
> +    assert(srcno < xsrc->nr_irqs);
> +    return test_bit(srcno, xsrc->lsi_map);
> +}
> +
> +static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
> +                                       bool lsi)

The function name isn't obvious about this being controlling LSI
configuration. '..._irq_set_lsi' maybe?

> +{
> +    assert(srcno < xsrc->nr_irqs);
> +    if (lsi) {
> +        bitmap_set(xsrc->lsi_map, srcno, 1);
> +    }
> +}
> +
>  #endif /* PPC_XIVE_H */
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index f7621f84828c..ac4605fee8b7 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -88,14 +88,40 @@ uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
>      return xive_esb_set(&xsrc->status[srcno], pq);
>  }
>  
> +/*
> + * Returns whether the event notification should be forwarded.
> + */
> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t
> srcno)

What exactly "trigger" means isn't entirely obvious for an LSI.  Might
be clearer to have "lsi_assert" and "lsi_deassert" helpers instead.

> +{
> +    uint8_t old_pq = xive_source_esb_get(xsrc, srcno);
> +
> +    switch (old_pq) {
> +    case XIVE_ESB_RESET:
> +        xive_source_esb_set(xsrc, srcno, XIVE_ESB_PENDING);
> +        return true;
> +    default:
> +        return false;
> +    }
> +}
> +
>  /*
>   * Returns whether the event notification should be forwarded.
>   */
>  static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
>  {
> +    bool ret;
> +
>      assert(srcno < xsrc->nr_irqs);
>  
> -    return xive_esb_trigger(&xsrc->status[srcno]);
> +    ret = xive_esb_trigger(&xsrc->status[srcno]);
> +
> +    if (xive_source_irq_is_lsi(xsrc, srcno) &&
> +        xive_source_esb_get(xsrc, srcno) == XIVE_ESB_QUEUED) {
> +        qemu_log_mask(LOG_GUEST_ERROR,
> +                      "XIVE: queued an event on LSI IRQ %d\n", srcno);
> +    }
> +
> +    return ret;
>  }
>  
>  /*
> @@ -103,9 +129,22 @@ static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
>   */
>  static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
>  {
> +    bool ret;
> +
>      assert(srcno < xsrc->nr_irqs);
>  
> -    return xive_esb_eoi(&xsrc->status[srcno]);
> +    ret = xive_esb_eoi(&xsrc->status[srcno]);
> +
> +    /* LSI sources do not set the Q bit but they can still be
> +     * asserted, in which case we should forward a new event
> +     * notification
> +     */
> +    if (xive_source_irq_is_lsi(xsrc, srcno) &&
> +        xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
> +        ret = xive_source_lsi_trigger(xsrc, srcno);
> +    }
> +
> +    return ret;
>  }
>  
>  /*
> @@ -268,8 +307,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
>      XiveSource *xsrc = XIVE_SOURCE(opaque);
>      bool notify = false;
>  
> -    if (val) {
> -        notify = xive_source_esb_trigger(xsrc, srcno);
> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
> +        if (val) {
> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
> +            notify = xive_source_lsi_trigger(xsrc, srcno);
> +        } else {
> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
> +        }
> +    } else {
> +        if (val) {
> +            notify = xive_source_esb_trigger(xsrc, srcno);
> +        }
>      }
>  
>      /* Forward the source event notification for routing */
> @@ -289,9 +337,11 @@ void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
>              continue;
>          }
>  
> -        monitor_printf(mon, "  %08x %c%c\n", i + offset,
> +        monitor_printf(mon, "  %08x %s %c%c%c\n", i + offset,
> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
> -                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
> +                       xsrc->status[i] & XIVE_STATUS_ASSERTED ? 'A' : ' ');
>      }
>  }
>  
> @@ -299,6 +349,8 @@ static void xive_source_reset(DeviceState *dev)
>  {
>      XiveSource *xsrc = XIVE_SOURCE(dev);
>  
> +    /* Do not clear the LSI bitmap */
> +
>      /* PQs are initialized to 0b01 which corresponds to "ints off" */
>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
>  }
> @@ -325,6 +377,9 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
>  
>      xsrc->status = g_malloc0(xsrc->nr_irqs);
>  
> +    xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
> +    xsrc->lsi_map_size = xsrc->nr_irqs;
> +
>      memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>                            &xive_source_esb_ops, xsrc, "xive.esb",
>                            (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
> @@ -338,6 +393,7 @@ static const VMStateDescription vmstate_xive_source = {
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
>          VMSTATE_VBUFFER_UINT32(status, XiveSource, 1, NULL, nr_irqs),
> +        VMSTATE_BITMAP(lsi_map, XiveSource, 1, lsi_map_size),

This shouldn't be here.  The lsi_map is all set up at machine
configuration time and then static, so it doesn't need to be migrated.

>          VMSTATE_END_OF_LIST()
>      },
>  };

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model Cédric Le Goater
@ 2018-11-22  4:11   ` David Gibson
  2018-11-22  7:53     ` Cédric Le Goater
  2018-11-22  4:44   ` David Gibson
  1 sibling, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-22  4:11 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 8569 bytes --]

On Fri, Nov 16, 2018 at 11:56:57AM +0100, Cédric Le Goater wrote:
> The XiveRouter models the second sub-engine of the overall XIVE
> architecture : the Interrupt Virtualization Routing Engine (IVRE).
> 
> The IVRE handles event notifications of the IVSE through MMIO stores
> and performs the interrupt routing process. For this purpose, it uses
> a set of table stored in system memory, the first of which being the
> Event Assignment Structure (EAS) table.
> 
> The EAT associates an interrupt source number with an Event Notification
> Descriptor (END) which will be used in a second phase of the routing
> process to identify a Notification Virtual Target.
> 
> The XiveRouter is an abstract class which needs to be inherited from
> to define a storage for the EAT, and other upcoming tables. The
> 'chip-id' atttribute is not strictly necessary for the sPAPR and
> PowerNV machines but it's a good way to test the routing algorithm.
> Without this atttribute, the XiveRouter could be a simple QOM
> interface.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/xive.h      | 32 ++++++++++++++
>  include/hw/ppc/xive_regs.h | 31 ++++++++++++++
>  hw/intc/xive.c             | 86 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 149 insertions(+)
>  create mode 100644 include/hw/ppc/xive_regs.h
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index be93fae6317b..5a0696366577 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -11,6 +11,7 @@
>  #define PPC_XIVE_H
>  
>  #include "hw/sysbus.h"

Again, I don't think making this a SysBusDevice is quite right.
Even more so for the router than the source, because at least for PAPR
it might not have any MMIO presence at all.

> +#include "hw/ppc/xive_regs.h"
>  
>  /*
>   * XIVE Fabric (Interface between Source and Router)
> @@ -168,4 +169,35 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>      }
>  }
>  
> +/*
> + * XIVE Router
> + */
> +
> +typedef struct XiveRouter {
> +    SysBusDevice    parent;
> +
> +    uint32_t        chip_id;

I don't think this belongs in the base class.  The PowerNV specific
variants will need it, but it doesn't make sense for the PAPR version.

> +} XiveRouter;
> +
> +#define TYPE_XIVE_ROUTER "xive-router"
> +#define XIVE_ROUTER(obj)                                \
> +    OBJECT_CHECK(XiveRouter, (obj), TYPE_XIVE_ROUTER)
> +#define XIVE_ROUTER_CLASS(klass)                                        \
> +    OBJECT_CLASS_CHECK(XiveRouterClass, (klass), TYPE_XIVE_ROUTER)
> +#define XIVE_ROUTER_GET_CLASS(obj)                              \
> +    OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
> +
> +typedef struct XiveRouterClass {
> +    SysBusDeviceClass parent;
> +
> +    /* XIVE table accessors */
> +    int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> +    int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> +} XiveRouterClass;
> +
> +void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
> +
> +int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> +int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> +
>  #endif /* PPC_XIVE_H */
> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> new file mode 100644
> index 000000000000..12499b33614c
> --- /dev/null
> +++ b/include/hw/ppc/xive_regs.h
> @@ -0,0 +1,31 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2016-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef PPC_XIVE_REGS_H
> +#define PPC_XIVE_REGS_H
> +
> +/* EAS (Event Assignment Structure)
> + *
> + * One per interrupt source. Targets an interrupt to a given Event
> + * Notification Descriptor (END) and provides the corresponding
> + * logical interrupt number (END data)
> + */
> +typedef struct XiveEAS {
> +        /* Use a single 64-bit definition to make it easier to
> +         * perform atomic updates
> +         */
> +        uint64_t        w;
> +#define EAS_VALID       PPC_BIT(0)
> +#define EAS_END_BLOCK   PPC_BITMASK(4, 7)        /* Destination END block# */
> +#define EAS_END_INDEX   PPC_BITMASK(8, 31)       /* Destination END index */
> +#define EAS_MASKED      PPC_BIT(32)              /* Masked */
> +#define EAS_END_DATA    PPC_BITMASK(33, 63)      /* Data written to the END */
> +} XiveEAS;
> +
> +#endif /* PPC_XIVE_REGS_H */
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 014a2e41f71f..c4c90a25758e 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -442,6 +442,91 @@ static const TypeInfo xive_source_info = {
>      .class_init    = xive_source_class_init,
>  };
>  
> +/*
> + * XIVE Router (aka. Virtualization Controller or IVRE)
> + */
> +
> +int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> +{
> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> +
> +    return xrc->get_eas(xrtr, lisn, eas);
> +}
> +
> +int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> +{
> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> +
> +    return xrc->set_eas(xrtr, lisn, eas);
> +}
> +
> +static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
> +{
> +    XiveRouter *xrtr = XIVE_ROUTER(xf);
> +    XiveEAS eas;
> +
> +    /* EAS cache lookup */
> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: Unknown LISN %x\n", lisn);
> +        return;
> +    }

AFAICT a bad LISN here means a qemu error (in the source, probably),
not a user or guest error, so an assert() would be more appropriate.

> +
> +    /* The IVRE has a State Bit Cache for its internal sources which
> +     * is also involed at this point. We skip the SBC lookup because
> +     * the state bits of the sources are modeled internally in QEMU.
> +     */
> +
> +    if (!(eas.w & EAS_VALID)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);
> +        return;
> +    }
> +
> +    if (eas.w & EAS_MASKED) {
> +        /* Notification completed */
> +        return;
> +    }
> +}
> +
> +static Property xive_router_properties[] = {
> +    DEFINE_PROP_UINT32("chip-id", XiveRouter, chip_id, 0),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void xive_router_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
> +
> +    dc->desc    = "XIVE Router Engine";
> +    dc->props   = xive_router_properties;
> +    xfc->notify = xive_router_notify;
> +}
> +
> +static const TypeInfo xive_router_info = {
> +    .name          = TYPE_XIVE_ROUTER,
> +    .parent        = TYPE_SYS_BUS_DEVICE,
> +    .abstract      = true,
> +    .class_size    = sizeof(XiveRouterClass),
> +    .class_init    = xive_router_class_init,
> +    .interfaces    = (InterfaceInfo[]) {
> +        { TYPE_XIVE_FABRIC },

So as far as I can see so far, the XiveFabric interface will
essentially have to be implemented on the router object, so I'm not
seeing much point to having the interface rather than just a direct
call on the router object.  But I haven't read the whole series yet,
so maybe I'm missing something.


> +        { }
> +    }
> +};
> +
> +void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
> +{
> +    if (!(eas->w & EAS_VALID)) {
> +        return;
> +    }
> +
> +    monitor_printf(mon, "  %08x %s end:%02x/%04x data:%08x\n",
> +                   lisn, eas->w & EAS_MASKED ? "M" : " ",
> +                   (uint8_t)  GETFIELD(EAS_END_BLOCK, eas->w),
> +                   (uint32_t) GETFIELD(EAS_END_INDEX, eas->w),
> +                   (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
> +}
> +
>  /*
>   * XIVE Fabric
>   */
> @@ -455,6 +540,7 @@ static void xive_register_types(void)
>  {
>      type_register_static(&xive_source_info);
>      type_register_static(&xive_fabric_info);
> +    type_register_static(&xive_router_info);
>  }
>  
>  type_init(xive_register_types)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors Cédric Le Goater
@ 2018-11-22  4:41   ` David Gibson
  2018-11-22  6:49     ` Benjamin Herrenschmidt
  2018-11-22 21:47     ` Cédric Le Goater
  0 siblings, 2 replies; 184+ messages in thread
From: David Gibson @ 2018-11-22  4:41 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 14713 bytes --]

On Fri, Nov 16, 2018 at 11:56:58AM +0100, Cédric Le Goater wrote:
> To complete the event routing, the IVRE sub-engine uses an internal
> table containing Event Notification Descriptor (END) structures.
> 
> An END specifies on which Event Queue (EQ) the event notification
> data, defined in the associated EAS, should be posted when an
> exception occurs. It also defines which Notification Virtual Target
> (NVT) should be notified.
> 
> The Event Queue is a memory page provided by the O/S defining a
> circular buffer, one per server and priority couple, containing Event
> Queue entries. These are 4 bytes long, the first bit being a
> 'generation' bit and the 31 following bits the END Data field. They
> are pulled by the O/S when the exception occurs.
> 
> The END Data field is a way to set an invariant logical event source
> number for an IRQ. It is set with the H_INT_SET_SOURCE_CONFIG hcall
> when the EISN flag is used.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/xive.h      |  18 ++++
>  include/hw/ppc/xive_regs.h |  48 ++++++++++
>  hw/intc/xive.c             | 185 ++++++++++++++++++++++++++++++++++++-
>  3 files changed, 248 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 5a0696366577..ce62aaf28343 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -193,11 +193,29 @@ typedef struct XiveRouterClass {
>      /* XIVE table accessors */
>      int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>      int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> +    int (*get_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> +                   XiveEND *end);
> +    int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> +                   XiveEND *end);

Hrm.  So unlike the EAS, which is basically just a word, the END is a
pretty large structure.  It's unclear here if get/set are expected to
copy the whole thing out and in, or if get get give you a pointer into
a "live" structure and set just does any necessary barriers after an
update.

Really, for a non-atomic value like this, I'm not sure get/set is the
right model.

Also as I understand it nearly all the indices in XIVE are broken into
block/index.  Is there a reason those are folded together into lisn
for the EAS, but not for the END?

>  } XiveRouterClass;
>  
>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
>  
>  int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>  int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> +int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> +                        XiveEND *end);
> +int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> +                        XiveEND *end);
> +
> +/*
> + * For legacy compatibility, the exceptions define up to 256 different
> + * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
> + * and the least favored level 0xFF.
> + */
> +#define XIVE_PRIORITY_MAX  7
> +
> +void xive_end_reset(XiveEND *end);
> +void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
>  
>  #endif /* PPC_XIVE_H */
> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> index 12499b33614c..f97fb2b90bee 100644
> --- a/include/hw/ppc/xive_regs.h
> +++ b/include/hw/ppc/xive_regs.h
> @@ -28,4 +28,52 @@ typedef struct XiveEAS {
>  #define EAS_END_DATA    PPC_BITMASK(33, 63)      /* Data written to the END */
>  } XiveEAS;
>  
> +/* Event Notification Descriptor (END) */
> +typedef struct XiveEND {
> +        uint32_t        w0;
> +#define END_W0_VALID             PPC_BIT32(0) /* "v" bit */
> +#define END_W0_ENQUEUE           PPC_BIT32(1) /* "q" bit */
> +#define END_W0_UCOND_NOTIFY      PPC_BIT32(2) /* "n" bit */
> +#define END_W0_BACKLOG           PPC_BIT32(3) /* "b" bit */
> +#define END_W0_PRECL_ESC_CTL     PPC_BIT32(4) /* "p" bit */
> +#define END_W0_ESCALATE_CTL      PPC_BIT32(5) /* "e" bit */
> +#define END_W0_UNCOND_ESCALATE   PPC_BIT32(6) /* "u" bit - DD2.0 */
> +#define END_W0_SILENT_ESCALATE   PPC_BIT32(7) /* "s" bit - DD2.0 */
> +#define END_W0_QSIZE             PPC_BITMASK32(12, 15)
> +#define END_W0_SW0               PPC_BIT32(16)
> +#define END_W0_FIRMWARE          END_W0_SW0 /* Owned by FW */
> +#define END_QSIZE_4K             0
> +#define END_QSIZE_64K            4
> +#define END_W0_HWDEP             PPC_BITMASK32(24, 31)
> +        uint32_t        w1;
> +#define END_W1_ESn               PPC_BITMASK32(0, 1)
> +#define END_W1_ESn_P             PPC_BIT32(0)
> +#define END_W1_ESn_Q             PPC_BIT32(1)
> +#define END_W1_ESe               PPC_BITMASK32(2, 3)
> +#define END_W1_ESe_P             PPC_BIT32(2)
> +#define END_W1_ESe_Q             PPC_BIT32(3)
> +#define END_W1_GENERATION        PPC_BIT32(9)
> +#define END_W1_PAGE_OFF          PPC_BITMASK32(10, 31)
> +        uint32_t        w2;
> +#define END_W2_MIGRATION_REG     PPC_BITMASK32(0, 3)
> +#define END_W2_OP_DESC_HI        PPC_BITMASK32(4, 31)
> +        uint32_t        w3;
> +#define END_W3_OP_DESC_LO        PPC_BITMASK32(0, 31)
> +        uint32_t        w4;
> +#define END_W4_ESC_END_BLOCK     PPC_BITMASK32(4, 7)
> +#define END_W4_ESC_END_INDEX     PPC_BITMASK32(8, 31)
> +        uint32_t        w5;
> +#define END_W5_ESC_END_DATA      PPC_BITMASK32(1, 31)
> +        uint32_t        w6;
> +#define END_W6_FORMAT_BIT        PPC_BIT32(8)
> +#define END_W6_NVT_BLOCK         PPC_BITMASK32(9, 12)
> +#define END_W6_NVT_INDEX         PPC_BITMASK32(13, 31)
> +        uint32_t        w7;
> +#define END_W7_F0_IGNORE         PPC_BIT32(0)
> +#define END_W7_F0_BLK_GROUPING   PPC_BIT32(1)
> +#define END_W7_F0_PRIORITY       PPC_BITMASK32(8, 15)
> +#define END_W7_F1_WAKEZ          PPC_BIT32(0)
> +#define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
> +} XiveEND;
> +
>  #endif /* PPC_XIVE_REGS_H */
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index c4c90a25758e..9cb001e7b540 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -442,6 +442,101 @@ static const TypeInfo xive_source_info = {
>      .class_init    = xive_source_class_init,
>  };
>  
> +/*
> + * XiveEND helpers
> + */
> +
> +void xive_end_reset(XiveEND *end)
> +{
> +    memset(end, 0, sizeof(*end));
> +
> +    /* switch off the escalation and notification ESBs */
> +    end->w1 = END_W1_ESe_Q | END_W1_ESn_Q;

It's not obvious to me what circumstances this would be called under.
Since the ENDs are in system memory, a memset() seems like an odd
thing for (virtual) hardware to be doing to it.

> +}
> +
> +static void xive_end_queue_pic_print_info(XiveEND *end, uint32_t width,
> +                                          Monitor *mon)
> +{
> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
> +    uint32_t qentries = 1 << (qsize + 10);
> +    int i;
> +
> +    /*
> +     * print out the [ (qindex - (width - 1)) .. (qindex + 1)] window
> +     */
> +    monitor_printf(mon, " [ ");
> +    qindex = (qindex - (width - 1)) & (qentries - 1);
> +    for (i = 0; i < width; i++) {
> +        uint64_t qaddr = qaddr_base + (qindex << 2);
> +        uint32_t qdata = -1;
> +
> +        if (dma_memory_read(&address_space_memory, qaddr, &qdata,
> +                            sizeof(qdata))) {
> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ @0x%"
> +                          HWADDR_PRIx "\n", qaddr);
> +            return;
> +        }
> +        monitor_printf(mon, "%s%08x ", i == width - 1 ? "^" : "",
> +                       be32_to_cpu(qdata));
> +        qindex = (qindex + 1) & (qentries - 1);
> +    }
> +    monitor_printf(mon, "]\n");
> +}
> +
> +void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon)
> +{
> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
> +    uint32_t qgen = GETFIELD(END_W1_GENERATION, end->w1);
> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
> +    uint32_t qentries = 1 << (qsize + 10);
> +
> +    uint32_t nvt = GETFIELD(END_W6_NVT_INDEX, end->w6);
> +    uint8_t priority = GETFIELD(END_W7_F0_PRIORITY, end->w7);
> +
> +    if (!(end->w0 & END_W0_VALID)) {
> +        return;
> +    }
> +
> +    monitor_printf(mon, "  %08x %c%c%c%c%c prio:%d nvt:%04x eq:@%08"PRIx64
> +                   "% 6d/%5d ^%d", end_idx,
> +                   end->w0 & END_W0_VALID ? 'v' : '-',
> +                   end->w0 & END_W0_ENQUEUE ? 'q' : '-',
> +                   end->w0 & END_W0_UCOND_NOTIFY ? 'n' : '-',
> +                   end->w0 & END_W0_BACKLOG ? 'b' : '-',
> +                   end->w0 & END_W0_ESCALATE_CTL ? 'e' : '-',
> +                   priority, nvt, qaddr_base, qindex, qentries, qgen);
> +
> +    xive_end_queue_pic_print_info(end, 6, mon);
> +}
> +
> +static void xive_end_push(XiveEND *end, uint32_t data)

s/push/enqueue/ please, "push" suggests a stack.  (Not to mention that
"push" and "pull" are used as terms elsewhere in XIVE).

> +{
> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
> +    uint32_t qgen = GETFIELD(END_W1_GENERATION, end->w1);
> +
> +    uint64_t qaddr = qaddr_base + (qindex << 2);
> +    uint32_t qdata = cpu_to_be32((qgen << 31) | (data & 0x7fffffff));
> +    uint32_t qentries = 1 << (qsize + 10);
> +
> +    if (dma_memory_write(&address_space_memory, qaddr, &qdata, sizeof(qdata))) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to write END data @0x%"
> +                      HWADDR_PRIx "\n", qaddr);
> +        return;
> +    }
> +
> +    qindex = (qindex + 1) & (qentries - 1);
> +    if (qindex == 0) {
> +        qgen ^= 1;
> +        end->w1 = SETFIELD(END_W1_GENERATION, end->w1, qgen);
> +    }
> +    end->w1 = SETFIELD(END_W1_PAGE_OFF, end->w1, qindex);
> +}
> +
>  /*
>   * XIVE Router (aka. Virtualization Controller or IVRE)
>   */
> @@ -460,6 +555,82 @@ int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>      return xrc->set_eas(xrtr, lisn, eas);
>  }
>  
> +int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> +                        XiveEND *end)
> +{
> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> +
> +   return xrc->get_end(xrtr, end_blk, end_idx, end);
> +}
> +
> +int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> +                        XiveEND *end)
> +{
> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> +
> +   return xrc->set_end(xrtr, end_blk, end_idx, end);
> +}
> +
> +/*
> + * An END trigger can come from an event trigger (IPI or HW) or from
> + * another chip. We don't model the PowerBus but the END trigger
> + * message has the same parameters than in the function below.
> + */
> +static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
> +                                   uint32_t end_idx, uint32_t end_data)
> +{
> +    XiveEND end;
> +    uint8_t priority;
> +    uint8_t format;
> +
> +    /* END cache lookup */
> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
> +                      end_idx);
> +        return;
> +    }
> +
> +    if (!(end.w0 & END_W0_VALID)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
> +                      end_blk, end_idx);
> +        return;
> +    }
> +
> +    if (end.w0 & END_W0_ENQUEUE) {
> +        xive_end_push(&end, end_data);
> +        xive_router_set_end(xrtr, end_blk, end_idx, &end);
> +    }
> +
> +    /*
> +     * The W7 format depends on the F bit in W6. It defines the type
> +     * of the notification :
> +     *
> +     *   F=0 : single or multiple NVT notification
> +     *   F=1 : User level Event-Based Branch (EBB) notification, no
> +     *         priority
> +     */
> +    format = GETFIELD(END_W6_FORMAT_BIT, end.w6);
> +    priority = GETFIELD(END_W7_F0_PRIORITY, end.w7);
> +
> +    /* The END is masked */
> +    if (format == 0 && priority == 0xff) {
> +        return;
> +    }
> +
> +    /*
> +     * Check the END ESn (Event State Buffer for notification) for
> +     * even futher coalescing in the Router
> +     */
> +    if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
> +        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
> +        return;
> +    }
> +
> +    /*
> +     * Follows IVPE notification
> +     */
> +}
> +
>  static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>  {
>      XiveRouter *xrtr = XIVE_ROUTER(xf);
> @@ -471,9 +642,9 @@ static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>          return;
>      }
>  
> -    /* The IVRE has a State Bit Cache for its internal sources which
> -     * is also involed at this point. We skip the SBC lookup because
> -     * the state bits of the sources are modeled internally in QEMU.
> +    /* The IVRE checks the State Bit Cache at this point. We skip the
> +     * SBC lookup because the state bits of the sources are modeled
> +     * internally in QEMU.

Replacing a comment about something we're not doing with a different
comment about something we're not doing doesn't seem very useful.
Maybe fold these together into one patch or the other.

>       */
>  
>      if (!(eas.w & EAS_VALID)) {
> @@ -485,6 +656,14 @@ static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>          /* Notification completed */
>          return;
>      }
> +
> +    /*
> +     * The event trigger becomes an END trigger
> +     */
> +    xive_router_end_notify(xrtr,
> +                           GETFIELD(EAS_END_BLOCK, eas.w),
> +                           GETFIELD(EAS_END_INDEX, eas.w),
> +                           GETFIELD(EAS_END_DATA,  eas.w));
>  }
>  
>  static Property xive_router_properties[] = {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model Cédric Le Goater
  2018-11-22  4:11   ` David Gibson
@ 2018-11-22  4:44   ` David Gibson
  2018-11-22  6:50     ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-22  4:44 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 3460 bytes --]

On Fri, Nov 16, 2018 at 11:56:57AM +0100, Cédric Le Goater wrote:
> The XiveRouter models the second sub-engine of the overall XIVE
> architecture : the Interrupt Virtualization Routing Engine (IVRE).
> 
> The IVRE handles event notifications of the IVSE through MMIO stores
> and performs the interrupt routing process. For this purpose, it uses
> a set of table stored in system memory, the first of which being the
> Event Assignment Structure (EAS) table.
> 
> The EAT associates an interrupt source number with an Event Notification
> Descriptor (END) which will be used in a second phase of the routing
> process to identify a Notification Virtual Target.
> 
> The XiveRouter is an abstract class which needs to be inherited from
> to define a storage for the EAT, and other upcoming tables. The
> 'chip-id' atttribute is not strictly necessary for the sPAPR and
> PowerNV machines but it's a good way to test the routing algorithm.
> Without this atttribute, the XiveRouter could be a simple QOM
> interface.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/xive.h      | 32 ++++++++++++++
>  include/hw/ppc/xive_regs.h | 31 ++++++++++++++
>  hw/intc/xive.c             | 86 ++++++++++++++++++++++++++++++++++++++
>  3 files changed, 149 insertions(+)
>  create mode 100644 include/hw/ppc/xive_regs.h
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index be93fae6317b..5a0696366577 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -11,6 +11,7 @@
>  #define PPC_XIVE_H
>  
>  #include "hw/sysbus.h"
> +#include "hw/ppc/xive_regs.h"
>  
>  /*
>   * XIVE Fabric (Interface between Source and Router)
> @@ -168,4 +169,35 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>      }
>  }
>  
> +/*
> + * XIVE Router
> + */
> +
> +typedef struct XiveRouter {
> +    SysBusDevice    parent;
> +
> +    uint32_t        chip_id;
> +} XiveRouter;
> +
> +#define TYPE_XIVE_ROUTER "xive-router"
> +#define XIVE_ROUTER(obj)                                \
> +    OBJECT_CHECK(XiveRouter, (obj), TYPE_XIVE_ROUTER)
> +#define XIVE_ROUTER_CLASS(klass)                                        \
> +    OBJECT_CLASS_CHECK(XiveRouterClass, (klass), TYPE_XIVE_ROUTER)
> +#define XIVE_ROUTER_GET_CLASS(obj)                              \
> +    OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
> +
> +typedef struct XiveRouterClass {
> +    SysBusDeviceClass parent;
> +
> +    /* XIVE table accessors */
> +    int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> +    int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);

Sorry, didn't think of this in my first reply.

1) Does the hardware ever actually write back to the EAS?  I know it
does for the END, but it's not clear why it would need to for the
EAS.  If not, we don't need the setter.

2) The signatures are a bit odd here.  For the setter, a value would
make sense than a (XiveEAS *), since it's just a word.  For the getter
you could return the EAS value directly rather than using a pointer -
there's already a valid bit in the EAS so you can construct a value
with that cleared if the lisn is out of bounds.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers Cédric Le Goater
@ 2018-11-22  5:13   ` David Gibson
  2018-11-22 21:58     ` Cédric Le Goater
  2018-11-29 22:06     ` Cédric Le Goater
  0 siblings, 2 replies; 184+ messages in thread
From: David Gibson @ 2018-11-22  5:13 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 8757 bytes --]

On Fri, Nov 16, 2018 at 11:56:59AM +0100, Cédric Le Goater wrote:
> The Event Notification Descriptor also contains two Event State
> Buffers providing further coalescing of interrupts, one for the
> notification event (ESn) and one for the escalation events (ESe). A
> MMIO page is assigned for each to control the EOI through loads
> only. Stores are not allowed.
> 
> The END ESBs are modeled through an object resembling the 'XiveSource'
> It is stateless as the END state bits are backed into the XiveEND
> structure under the XiveRouter and the MMIO accesses follow the same
> rules as for the standard source ESBs.
> 
> END ESBs are not supported by the Linux drivers neither on OPAL nor on
> sPAPR. Nevetherless, it provides a mean to study the question in the
> future and validates a bit more the XIVE model.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/xive.h |  20 ++++++
>  hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
>  2 files changed, 178 insertions(+), 2 deletions(-)
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index ce62aaf28343..24301bf2076d 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>                          XiveEND *end);
>  
> +/*
> + * XIVE END ESBs
> + */
> +
> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
> +#define XIVE_END_SOURCE(obj) \
> +    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)

Is there a particular reason to make this a full QOM object, rather
than just embedding it in the XiveRouter?

> +typedef struct XiveENDSource {
> +    SysBusDevice parent;
> +
> +    uint32_t        nr_ends;
> +
> +    /* ESB memory region */
> +    uint32_t        esb_shift;
> +    MemoryRegion    esb_mmio;
> +
> +    XiveRouter      *xrtr;
> +} XiveENDSource;
> +
>  /*
>   * For legacy compatibility, the exceptions define up to 256 different
>   * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 9cb001e7b540..5a8882d47a98 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -622,8 +622,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>       * even futher coalescing in the Router
>       */
>      if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
> -        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
> -        return;
> +        uint8_t pq = GETFIELD(END_W1_ESn, end.w1);
> +        bool notify = xive_esb_trigger(&pq);
> +
> +        if (pq != GETFIELD(END_W1_ESn, end.w1)) {
> +            end.w1 = SETFIELD(END_W1_ESn, end.w1, pq);
> +            xive_router_set_end(xrtr, end_blk, end_idx, &end);
> +        }
> +
> +        /* ESn[Q]=1 : end of notification */
> +        if (!notify) {
> +            return;
> +        }
>      }
>  
>      /*
> @@ -706,6 +716,151 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
>                     (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
>  }
>  
> +/*
> + * END ESB MMIO loads
> + */
> +static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
> +    XiveRouter *xrtr = xsrc->xrtr;
> +    uint32_t offset = addr & 0xFFF;
> +    uint8_t end_blk;
> +    uint32_t end_idx;
> +    XiveEND end;
> +    uint32_t end_esmask;
> +    uint8_t pq;
> +    uint64_t ret = -1;
> +
> +    end_blk = xrtr->chip_id;
> +    end_idx = addr >> (xsrc->esb_shift + 1);
> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
> +                      end_idx);
> +        return -1;
> +    }
> +
> +    if (!(end.w0 & END_W0_VALID)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
> +                      end_blk, end_idx);
> +        return -1;
> +    }
> +
> +    end_esmask = addr_is_even(addr, xsrc->esb_shift) ? END_W1_ESn : END_W1_ESe;
> +    pq = GETFIELD(end_esmask, end.w1);
> +
> +    switch (offset) {
> +    case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
> +        ret = xive_esb_eoi(&pq);
> +
> +        /* Forward the source event notification for routing ?? */
> +        break;
> +
> +    case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
> +        ret = pq;
> +        break;
> +
> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
> +        ret = xive_esb_set(&pq, (offset >> 8) & 0x3);
> +        break;
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid END ESB load addr %d\n",
> +                      offset);
> +        return -1;
> +    }
> +
> +    if (pq != GETFIELD(end_esmask, end.w1)) {
> +        end.w1 = SETFIELD(end_esmask, end.w1, pq);
> +        xive_router_set_end(xrtr, end_blk, end_idx, &end);
> +    }

We can probably share some more code with XiveSource here, but that's
something that can be refined later.

> +
> +    return ret;
> +}
> +
> +/*
> + * END ESB MMIO stores are invalid
> + */
> +static void xive_end_source_write(void *opaque, hwaddr addr,
> +                                  uint64_t value, unsigned size)
> +{
> +    qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr 0x%"
> +                  HWADDR_PRIx"\n", addr);
> +}
> +
> +static const MemoryRegionOps xive_end_source_ops = {
> +    .read = xive_end_source_read,
> +    .write = xive_end_source_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +static void xive_end_source_realize(DeviceState *dev, Error **errp)
> +{
> +    XiveENDSource *xsrc = XIVE_END_SOURCE(dev);
> +    Object *obj;
> +    Error *local_err = NULL;
> +
> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
> +    if (!obj) {
> +        error_propagate(errp, local_err);
> +        error_prepend(errp, "required link 'xive' not found: ");
> +        return;
> +    }
> +
> +    xsrc->xrtr = XIVE_ROUTER(obj);
> +
> +    if (!xsrc->nr_ends) {
> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
> +        return;
> +    }
> +
> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
> +        xsrc->esb_shift != XIVE_ESB_64K) {
> +        error_setg(errp, "Invalid ESB shift setting");
> +        return;
> +    }
> +
> +    /*
> +     * Each END is assigned an even/odd pair of MMIO pages, the even page
> +     * manages the ESn field while the odd page manages the ESe field.
> +     */
> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> +                          &xive_end_source_ops, xsrc, "xive.end",
> +                          (1ull << (xsrc->esb_shift + 1)) * xsrc->nr_ends);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
> +}
> +
> +static Property xive_end_source_properties[] = {
> +    DEFINE_PROP_UINT32("nr-ends", XiveENDSource, nr_ends, 0),
> +    DEFINE_PROP_UINT32("shift", XiveENDSource, esb_shift, XIVE_ESB_64K),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void xive_end_source_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    dc->desc    = "XIVE END Source";
> +    dc->props   = xive_end_source_properties;
> +    dc->realize = xive_end_source_realize;
> +}
> +
> +static const TypeInfo xive_end_source_info = {
> +    .name          = TYPE_XIVE_END_SOURCE,
> +    .parent        = TYPE_SYS_BUS_DEVICE,
> +    .instance_size = sizeof(XiveENDSource),
> +    .class_init    = xive_end_source_class_init,
> +};
> +
>  /*
>   * XIVE Fabric
>   */
> @@ -720,6 +875,7 @@ static void xive_register_types(void)
>      type_register_static(&xive_source_info);
>      type_register_static(&xive_fabric_info);
>      type_register_static(&xive_router_info);
> +    type_register_static(&xive_end_source_info);
>  }
>  
>  type_init(xive_register_types)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors
  2018-11-22  4:41   ` David Gibson
@ 2018-11-22  6:49     ` Benjamin Herrenschmidt
  2018-11-23  3:51       ` David Gibson
  2018-11-22 21:47     ` Cédric Le Goater
  1 sibling, 1 reply; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2018-11-22  6:49 UTC (permalink / raw)
  To: David Gibson, Cédric Le Goater; +Cc: qemu-ppc, qemu-devel

On Thu, 2018-11-22 at 15:41 +1100, David Gibson wrote:
> 
> > +void xive_end_reset(XiveEND *end)
> > +{
> > +    memset(end, 0, sizeof(*end));
> > +
> > +    /* switch off the escalation and notification ESBs */
> > +    end->w1 = END_W1_ESe_Q | END_W1_ESn_Q;
> 
> It's not obvious to me what circumstances this would be called under.
> Since the ENDs are in system memory, a memset() seems like an odd
> thing for (virtual) hardware to be doing to it.
> 
> > +}

Not on PAPR ...

Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-22  4:44   ` David Gibson
@ 2018-11-22  6:50     ` Benjamin Herrenschmidt
  2018-11-22  7:59       ` Cédric Le Goater
  2018-11-23  1:10       ` David Gibson
  0 siblings, 2 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2018-11-22  6:50 UTC (permalink / raw)
  To: David Gibson, Cédric Le Goater; +Cc: qemu-ppc, qemu-devel

On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
> 
> Sorry, didn't think of this in my first reply.
> 
> 1) Does the hardware ever actually write back to the EAS?  I know it
> does for the END, but it's not clear why it would need to for the
> EAS.  If not, we don't need the setter.

Nope, though the PAPR model will via hcalls

> 
> 2) The signatures are a bit odd here.  For the setter, a value would
> make sense than a (XiveEAS *), since it's just a word.  For the getter
> you could return the EAS value directly rather than using a pointer -
> there's already a valid bit in the EAS so you can construct a value
> with that cleared if the lisn is out of bounds.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model
  2018-11-22  3:05   ` David Gibson
@ 2018-11-22  7:25     ` Cédric Le Goater
  2018-11-23  0:31       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-22  7:25 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/22/18 4:05 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:56:54AM +0100, Cédric Le Goater wrote:
>> The first sub-engine of the overall XIVE architecture is the Interrupt
>> Virtualization Source Engine (IVSE). An IVSE can be integrated into
>> another logic, like in a PCI PHB or in the main interrupt controller
>> to manage IPIs.
>>
>> Each IVSE instance is associated with an Event State Buffer (ESB) that
>> contains a two bit state entry for each possible event source. When an
>> event is signaled to the IVSE, by MMIO or some other means, the
>> associated interrupt state bits are fetched from the ESB and
>> modified. Depending on the resulting ESB state, the event is forwarded
>> to the IVRE sub-engine of the controller doing the routing.
>>
>> Each supported ESB entry is associated with either a single or a
>> even/odd pair of pages which provides commands to manage the source:
>> to EOI, to turn off the source for instance.
>>
>> On a sPAPR machine, the O/S will obtain the page address of the ESB
>> entry associated with a source and its characteristic using the
>> H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is used.
>>
>> The xive_source_notify() routine is in charge forwarding the source
>> event notification to the routing engine. It will be filled later on.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> 
> Ok, this is looking basically pretty good.  Few details to query
> below.
> 
> 
>> ---
>>  default-configs/ppc64-softmmu.mak |   1 +
>>  include/hw/ppc/xive.h             | 130 ++++++++++
>>  hw/intc/xive.c                    | 379 ++++++++++++++++++++++++++++++
>>  hw/intc/Makefile.objs             |   1 +
>>  4 files changed, 511 insertions(+)
>>  create mode 100644 include/hw/ppc/xive.h
>>  create mode 100644 hw/intc/xive.c
>>
>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>> index aec2855750d6..2d1e7c5c4668 100644
>> --- a/default-configs/ppc64-softmmu.mak
>> +++ b/default-configs/ppc64-softmmu.mak
>> @@ -16,6 +16,7 @@ CONFIG_VIRTIO_VGA=y
>>  CONFIG_XICS=$(CONFIG_PSERIES)
>>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>> +CONFIG_XIVE=$(CONFIG_PSERIES)
>>  CONFIG_MEM_DEVICE=y
>>  CONFIG_DIMM=y
>>  CONFIG_SPAPR_RNG=y
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> new file mode 100644
>> index 000000000000..5fec4b08705d
>> --- /dev/null
>> +++ b/include/hw/ppc/xive.h
>> @@ -0,0 +1,130 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
> 
> A cheat sheet in the top of this header with the old and new XIVE
> terms would quite nice to have.

Yes. It's a good place. I will put the XIVE acronyms here :
     
     EA		Event Assignment
     EISN	Effective Interrupt Source Number
     END	Event Notification Descriptor
     ESB	Event State Buffer
     EQ		Event Queue
     LISN	Logical Interrupt Source Number
     NVT	Notification Virtual Target
     TIMA	Thread Interrupt Management Area
     ...


>> + */
>> +
>> +#ifndef PPC_XIVE_H
>> +#define PPC_XIVE_H
>> +
>> +#include "hw/sysbus.h"
> 
> So, I'm a bit dubious about making the XiveSource a SysBus device -
> I'm concerned it won't play well with tying it into the other devices
> like PHB that "own" it in real hardware.

It does but I can take a look at changing it to a DeviceState. The 
reset handlers might be a concern.

> I think we'd be better off making it a direct descendent of
> TYPE_DEVICE which constructs the MMIO region, but doesn't map it.

At a moment, I started working on a XiveESB object doing what I think 
you are suggesting and I removed it. I am reluctant adding more 
complexity now, the patchset is just growing and growing ... 

But I agree there are fundamentals to get right for KVM. Let's talk 
about it after you have looked at the overall patchset, at least up 
to KVM initial support.

> Then we can havea SysBusDevice (and/or other) wrapper which
> instantiates the XiveSource core and maps it into somewhere
> accessible.

The XIVE controller model does the mapping of the source currently.
In the case of sPAPR, the controller model controls the TIMA and 
for PowerNV, there are quite few others MMIO regions to handle.

> 
>> +
>> +/*
>> + * XIVE Interrupt Source
>> + */
>> +
>> +#define TYPE_XIVE_SOURCE "xive-source"
>> +#define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
>> +
>> +/*
>> + * XIVE Interrupt Source characteristics, which define how the ESB are
>> + * controlled.
>> + */
>> +#define XIVE_SRC_H_INT_ESB     0x1 /* ESB managed with hcall H_INT_ESB */
>> +#define XIVE_SRC_STORE_EOI     0x2 /* Store EOI supported */
>> +
>> +typedef struct XiveSource {
>> +    SysBusDevice parent;
>> +
>> +    /* IRQs */
>> +    uint32_t        nr_irqs;
>> +    qemu_irq        *qirqs;
>> +
>> +    /* PQ bits */
>> +    uint8_t         *status;
>> +
>> +    /* ESB memory region */
>> +    uint64_t        esb_flags;
>> +    uint32_t        esb_shift;
>> +    MemoryRegion    esb_mmio;
>> +} XiveSource;
>> +
>> +/*
>> + * ESB MMIO setting. Can be one page, for both source triggering and
>> + * source management, or two different pages. See below for magic
>> + * values.
>> + */
>> +#define XIVE_ESB_4K          12 /* PSI HB only */
>> +#define XIVE_ESB_4K_2PAGE    13
>> +#define XIVE_ESB_64K         16
>> +#define XIVE_ESB_64K_2PAGE   17
>> +
>> +static inline bool xive_source_esb_has_2page(XiveSource *xsrc)
>> +{
>> +    return xsrc->esb_shift == XIVE_ESB_64K_2PAGE ||
>> +        xsrc->esb_shift == XIVE_ESB_4K_2PAGE;
>> +}
>> +
>> +/* The trigger page is always the first/even page */
>> +static inline hwaddr xive_source_esb_page(XiveSource *xsrc, uint32_t srcno)
> 
> This function doesn't appear to be used anywhere except..

It's used in patch 16 adding the hcalls also.

>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +    return (1ull << xsrc->esb_shift) * srcno;
>> +}
>> +
>> +/* In a two pages ESB MMIO setting, the odd page is for management */
>> +static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
> 
> 
> ..here, and this function doesn't appear to be used anywhere.

It's used in patch 16 adding the hcalls and patch 23 for KVM.

This is basic ESB support which I thought belong to the patch on sources.
 
> 
>> +{
>> +    hwaddr addr = xive_source_esb_page(xsrc, srcno);
>> +
>> +    if (xive_source_esb_has_2page(xsrc)) {
>> +        addr += (1 << (xsrc->esb_shift - 1));
>> +    }
>> +
>> +    return addr;
>> +}
>> +
>> +/*
>> + * Each interrupt source has a 2-bit state machine which can be
>> + * controlled by MMIO. P indicates that an interrupt is pending (has
>> + * been sent to a queue and is waiting for an EOI). Q indicates that
>> + * the interrupt has been triggered while pending.
>> + *
>> + * This acts as a coalescing mechanism in order to guarantee that a
>> + * given interrupt only occurs at most once in a queue.
>> + *
>> + * When doing an EOI, the Q bit will indicate if the interrupt
>> + * needs to be re-triggered.
>> + */
>> +#define XIVE_ESB_VAL_P        0x2
>> +#define XIVE_ESB_VAL_Q        0x1
>> +
>> +#define XIVE_ESB_RESET        0x0
>> +#define XIVE_ESB_PENDING      XIVE_ESB_VAL_P
>> +#define XIVE_ESB_QUEUED       (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
>> +#define XIVE_ESB_OFF          XIVE_ESB_VAL_Q
>> +
>> +/*
>> + * "magic" Event State Buffer (ESB) MMIO offsets.
>> + *
>> + * The following offsets into the ESB MMIO allow to read or manipulate
>> + * the PQ bits. They must be used with an 8-byte load instruction.
>> + * They all return the previous state of the interrupt (atomically).
>> + *
>> + * Additionally, some ESB pages support doing an EOI via a store and
>> + * some ESBs support doing a trigger via a separate trigger page.
>> + */
>> +#define XIVE_ESB_STORE_EOI      0x400 /* Store */
>> +#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
>> +#define XIVE_ESB_GET            0x800 /* Load */
>> +#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
>> +#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
>> +#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
>> +#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
>> +
>> +uint8_t xive_source_esb_get(XiveSource *xsrc, uint32_t srcno);
>> +uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
>> +
>> +void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset,
>> +                                Monitor *mon);
>> +
>> +static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +    return xsrc->qirqs[srcno];
>> +}
>> +
>> +#endif /* PPC_XIVE_H */
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> new file mode 100644
>> index 000000000000..f7621f84828c
>> --- /dev/null
>> +++ b/hw/intc/xive.c
>> @@ -0,0 +1,379 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/log.h"
>> +#include "qapi/error.h"
>> +#include "target/ppc/cpu.h"
>> +#include "sysemu/cpus.h"
>> +#include "sysemu/dma.h"
>> +#include "monitor/monitor.h"
>> +#include "hw/ppc/xive.h"
>> +
>> +/*
>> + * XIVE ESB helpers
>> + */
>> +
>> +static uint8_t xive_esb_set(uint8_t *pq, uint8_t value)
>> +{
>> +    uint8_t old_pq = *pq & 0x3;
>> +
>> +    *pq &= ~0x3;
>> +    *pq |= value & 0x3;
>> +
>> +    return old_pq;
>> +}
>> +
>> +static bool xive_esb_trigger(uint8_t *pq)
>> +{
>> +    uint8_t old_pq = *pq & 0x3;
>> +
>> +    switch (old_pq) {
>> +    case XIVE_ESB_RESET:
>> +        xive_esb_set(pq, XIVE_ESB_PENDING);
>> +        return true;
>> +    case XIVE_ESB_PENDING:
>> +    case XIVE_ESB_QUEUED:
>> +        xive_esb_set(pq, XIVE_ESB_QUEUED);
>> +        return false;
>> +    case XIVE_ESB_OFF:
>> +        xive_esb_set(pq, XIVE_ESB_OFF);
>> +        return false;
>> +    default:
>> +         g_assert_not_reached();
>> +    }
>> +}
>> +
>> +static bool xive_esb_eoi(uint8_t *pq)
>> +{
>> +    uint8_t old_pq = *pq & 0x3;
>> +
>> +    switch (old_pq) {
>> +    case XIVE_ESB_RESET:
>> +    case XIVE_ESB_PENDING:
>> +        xive_esb_set(pq, XIVE_ESB_RESET);
>> +        return false;
>> +    case XIVE_ESB_QUEUED:
>> +        xive_esb_set(pq, XIVE_ESB_PENDING);
>> +        return true;
>> +    case XIVE_ESB_OFF:
>> +        xive_esb_set(pq, XIVE_ESB_OFF);
>> +        return false;
>> +    default:
>> +         g_assert_not_reached();
>> +    }
>> +}
>> +
>> +/*
>> + * XIVE Interrupt Source (or IVSE)
>> + */
>> +
>> +uint8_t xive_source_esb_get(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +
>> +    return xsrc->status[srcno] & 0x3;
>> +}
>> +
>> +uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +
>> +    return xive_esb_set(&xsrc->status[srcno], pq);
>> +}
>> +
>> +/*
>> + * Returns whether the event notification should be forwarded.
>> + */
>> +static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +
>> +    return xive_esb_trigger(&xsrc->status[srcno]);
>> +}
>> +
>> +/*
>> + * Returns whether the event notification should be forwarded.
>> + */
>> +static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +
>> +    return xive_esb_eoi(&xsrc->status[srcno]);
>> +}
>> +
>> +/*
>> + * Forward the source event notification to the Router
>> + */
>> +static void xive_source_notify(XiveSource *xsrc, int srcno)
>> +{
>> +
>> +}
>> +
>> +/*
>> + * In a two pages ESB MMIO setting, even page is the trigger page, odd
>> + * page is for management
>> + */
>> +static inline bool addr_is_even(hwaddr addr, uint32_t shift)
>> +{
>> +    return !((addr >> shift) & 1);
>> +}
>> +
>> +static inline bool xive_source_is_trigger_page(XiveSource *xsrc, hwaddr addr)
>> +{
>> +    return xive_source_esb_has_2page(xsrc) &&
>> +        addr_is_even(addr, xsrc->esb_shift - 1);
>> +}
>> +
>> +/*
>> + * ESB MMIO loads
>> + *                      Trigger page    Management/EOI page
>> + * 2 pages setting      even            odd
>> + *
>> + * 0x000 .. 0x3FF       -1              EOI and return 0|1
>> + * 0x400 .. 0x7FF       -1              EOI and return 0|1
>> + * 0x800 .. 0xBFF       -1              return PQ
>> + * 0xC00 .. 0xCFF       -1              return PQ and atomically PQ=0
>> + * 0xD00 .. 0xDFF       -1              return PQ and atomically PQ=0
>> + * 0xE00 .. 0xDFF       -1              return PQ and atomically PQ=1
>> + * 0xF00 .. 0xDFF       -1              return PQ and atomically PQ=1
>> + */
> 
> I can't quite make sense of this table.  What do the -1s represent,

the value returned by the load.

> and how does it relate to the non-2page case?

one page ESB support trigger and management on the same page. So for loads,
the odd page behavior applies.  

>> +static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>> +    uint32_t offset = addr & 0xFFF;
>> +    uint32_t srcno = addr >> xsrc->esb_shift;
>> +    uint64_t ret = -1;
>> +
>> +    /* In a two pages ESB MMIO setting, trigger page should not be read */
>> +    if (xive_source_is_trigger_page(xsrc, addr)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR,
>> +                      "XIVE: invalid load on IRQ %d trigger page at "
>> +                      "0x%"HWADDR_PRIx"\n", srcno, addr);
>> +        return -1;
>> +    }
>> +
>> +    switch (offset) {
>> +    case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
>> +        ret = xive_source_esb_eoi(xsrc, srcno);
>> +
>> +        /* Forward the source event notification for routing */
>> +        if (ret) {
>> +            xive_source_notify(xsrc, srcno);
>> +        }
>> +        break;
>> +
>> +    case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
>> +        ret = xive_source_esb_get(xsrc, srcno);
>> +        break;
>> +
>> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
>> +        ret = xive_source_esb_set(xsrc, srcno, (offset >> 8) & 0x3);
>> +        break;
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB load addr %x\n",
>> +                      offset);
>> +    }
>> +
>> +    return ret;
>> +}
>> +
>> +/*
>> + * ESB MMIO stores
>> + *                      Trigger page    Management/EOI page
>> + * 2 pages setting      even            odd
> 
> As with the previous table, I don't quite understand what the headings
> above mean.

one page ESB support trigger and management on the same page. So for stores,
the odd page behavior applies.

The headings can be improved. I will think of something.

>> + * 0x000 .. 0x3FF       Trigger         Trigger
>> + * 0x400 .. 0x7FF       Trigger         EOI
>> + * 0x800 .. 0xBFF       Trigger         undefined
>> + * 0xC00 .. 0xCFF       Trigger         PQ=00
>> + * 0xD00 .. 0xDFF       Trigger         PQ=01
>> + * 0xE00 .. 0xDFF       Trigger         PQ=10
>> + * 0xF00 .. 0xDFF       Trigger         PQ=11
>> + */
>> +static void xive_source_esb_write(void *opaque, hwaddr addr,
>> +                                  uint64_t value, unsigned size)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>> +    uint32_t offset = addr & 0xFFF;
>> +    uint32_t srcno = addr >> xsrc->esb_shift;
>> +    bool notify = false;
>> +
>> +    /* In a two pages ESB MMIO setting, trigger page only triggers */
>> +    if (xive_source_is_trigger_page(xsrc, addr)) {
>> +        notify = xive_source_esb_trigger(xsrc, srcno);
>> +        goto out;
>> +    }
>> +
>> +    switch (offset) {
>> +    case 0 ... 0x3FF:
>> +        notify = xive_source_esb_trigger(xsrc, srcno);
>> +        break;
>> +
>> +    case XIVE_ESB_STORE_EOI ... XIVE_ESB_STORE_EOI + 0x3FF:
>> +        if (!(xsrc->esb_flags & XIVE_SRC_STORE_EOI)) {
>> +            qemu_log_mask(LOG_GUEST_ERROR,
>> +                          "XIVE: invalid Store EOI for IRQ %d\n", srcno);
>> +            return;
>> +        }
>> +
>> +        notify = xive_source_esb_eoi(xsrc, srcno);
>> +        break;
>> +
>> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
>> +        xive_source_esb_set(xsrc, srcno, (offset >> 8) & 0x3);
>> +        break;
>> +
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %x\n",
>> +                      offset);
>> +        return;
>> +    }
>> +
>> +out:
>> +    /* Forward the source event notification for routing */
>> +    if (notify) {
>> +        xive_source_notify(xsrc, srcno);
>> +    }
>> +}
>> +
>> +static const MemoryRegionOps xive_source_esb_ops = {
>> +    .read = xive_source_esb_read,
>> +    .write = xive_source_esb_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +static void xive_source_set_irq(void *opaque, int srcno, int val)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>> +    bool notify = false;
>> +
>> +    if (val) {
>> +        notify = xive_source_esb_trigger(xsrc, srcno);
>> +    }
>> +
>> +    /* Forward the source event notification for routing */
>> +    if (notify) {
>> +        xive_source_notify(xsrc, srcno);
>> +    }
>> +}
>> +
>> +void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        uint8_t pq = xive_source_esb_get(xsrc, i);
>> +
>> +        if (pq == XIVE_ESB_OFF) {
>> +            continue;
>> +        }
>> +
>> +        monitor_printf(mon, "  %08x %c%c\n", i + offset,
>> +                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
>> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>> +    }
>> +}
>> +
>> +static void xive_source_reset(DeviceState *dev)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(dev);
>> +
>> +    /* PQs are initialized to 0b01 which corresponds to "ints off" */
>> +    memset(xsrc->status, 0x1, xsrc->nr_irqs);
> 
> You've already got XIVE_ESB_OFF defined to make this a little clearer.

Sure.

Thanks,

C. 


> 
>> +}
>> +
>> +static void xive_source_realize(DeviceState *dev, Error **errp)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(dev);
>> +
>> +    if (!xsrc->nr_irqs) {
>> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
>> +        return;
>> +    }
>> +
>> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
>> +        xsrc->esb_shift != XIVE_ESB_4K_2PAGE &&
>> +        xsrc->esb_shift != XIVE_ESB_64K &&
>> +        xsrc->esb_shift != XIVE_ESB_64K_2PAGE) {
>> +        error_setg(errp, "Invalid ESB shift setting");
>> +        return;
>> +    }
>> +
>> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
>> +                                     xsrc->nr_irqs);
>> +
>> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
>> +
>> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>> +                          &xive_source_esb_ops, xsrc, "xive.esb",
>> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>> +}
>> +
>> +static const VMStateDescription vmstate_xive_source = {
>> +    .name = TYPE_XIVE_SOURCE,
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
>> +        VMSTATE_VBUFFER_UINT32(status, XiveSource, 1, NULL, nr_irqs),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +/*
>> + * The default XIVE interrupt source setting for the ESB MMIOs is two
>> + * 64k pages without Store EOI, to be in sync with KVM.
>> + */
>> +static Property xive_source_properties[] = {
>> +    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
>> +    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
>> +    DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void xive_source_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    dc->desc    = "XIVE Interrupt Source";
>> +    dc->props   = xive_source_properties;
>> +    dc->realize = xive_source_realize;
>> +    dc->reset   = xive_source_reset;
>> +    dc->vmsd    = &vmstate_xive_source;
>> +}
>> +
>> +static const TypeInfo xive_source_info = {
>> +    .name          = TYPE_XIVE_SOURCE,
>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>> +    .instance_size = sizeof(XiveSource),
>> +    .class_init    = xive_source_class_init,
>> +};
>> +
>> +static void xive_register_types(void)
>> +{
>> +    type_register_static(&xive_source_info);
>> +}
>> +
>> +type_init(xive_register_types)
>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>> index 0e9963f5eecc..72a46ed91c31 100644
>> --- a/hw/intc/Makefile.objs
>> +++ b/hw/intc/Makefile.objs
>> @@ -37,6 +37,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>>  obj-$(CONFIG_XICS) += xics.o
>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>> +obj-$(CONFIG_XIVE) += xive.o
>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources
  2018-11-22  3:19   ` David Gibson
@ 2018-11-22  7:39     ` Cédric Le Goater
  2018-11-23  1:08       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-22  7:39 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/22/18 4:19 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:56:55AM +0100, Cédric Le Goater wrote:
>> The 'sent' status of the LSI interrupt source is modeled with the 'P'
>> bit of the ESB and the assertion status of the source is maintained in
>> an array under the main sPAPRXive object. The type of the source is
>> stored in the same array for practical reasons.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> 
> Looks good except for some minor details.
> 
>> ---
>>  include/hw/ppc/xive.h | 20 ++++++++++++-
>>  hw/intc/xive.c        | 68 +++++++++++++++++++++++++++++++++++++++----
>>  2 files changed, 81 insertions(+), 7 deletions(-)
>>
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 5fec4b08705d..e118acd59f1e 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -32,8 +32,10 @@ typedef struct XiveSource {
>>      /* IRQs */
>>      uint32_t        nr_irqs;
>>      qemu_irq        *qirqs;
>> +    unsigned long   *lsi_map;
>> +    int32_t         lsi_map_size; /* for VMSTATE_BITMAP */
> 
> At some point it's possible we'll want XiveSource subclasses that just
> know which irqs are LSI and which aren't without an explicit map.  But
> this detail isn't exposed in the migration stream or the user
> interface, so we can tweak it later as ncessary.
> 
>> -    /* PQ bits */
>> +    /* PQ bits and LSI assertion bit */
>>      uint8_t         *status;
>>  
>>      /* ESB memory region */
>> @@ -89,6 +91,7 @@ static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
>>   * When doing an EOI, the Q bit will indicate if the interrupt
>>   * needs to be re-triggered.
>>   */
>> +#define XIVE_STATUS_ASSERTED  0x4  /* Extra bit for LSI */
>>  #define XIVE_ESB_VAL_P        0x2
>>  #define XIVE_ESB_VAL_Q        0x1
>>  
>> @@ -127,4 +130,19 @@ static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
>>      return xsrc->qirqs[srcno];
>>  }
>>  
>> +static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +    return test_bit(srcno, xsrc->lsi_map);
>> +}
>> +
>> +static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>> +                                       bool lsi)
> 
> The function name isn't obvious about this being controlling LSI
> configuration. '..._irq_set_lsi' maybe?

yes.


>> +{
>> +    assert(srcno < xsrc->nr_irqs);
>> +    if (lsi) {
>> +        bitmap_set(xsrc->lsi_map, srcno, 1);
>> +    }
>> +}
>> +
>>  #endif /* PPC_XIVE_H */
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index f7621f84828c..ac4605fee8b7 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -88,14 +88,40 @@ uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
>>      return xive_esb_set(&xsrc->status[srcno], pq);
>>  }
>>  
>> +/*
>> + * Returns whether the event notification should be forwarded.
>> + */
>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t
>> srcno)
> 
> What exactly "trigger" means isn't entirely obvious for an LSI.  Might
> be clearer to have "lsi_assert" and "lsi_deassert" helpers instead.

This is called only when the interrupt is asserted. So it is a 
simplified LSI trigger depending only on the 'P' bit.
 
> 
>> +{
>> +    uint8_t old_pq = xive_source_esb_get(xsrc, srcno);
>> +
>> +    switch (old_pq) {
>> +    case XIVE_ESB_RESET:
>> +        xive_source_esb_set(xsrc, srcno, XIVE_ESB_PENDING);
>> +        return true;
>> +    default:
>> +        return false;
>> +    }
>> +}
>> +
>>  /*
>>   * Returns whether the event notification should be forwarded.
>>   */
>>  static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
>>  {
>> +    bool ret;
>> +
>>      assert(srcno < xsrc->nr_irqs);
>>  
>> -    return xive_esb_trigger(&xsrc->status[srcno]);
>> +    ret = xive_esb_trigger(&xsrc->status[srcno]);
>> +
>> +    if (xive_source_irq_is_lsi(xsrc, srcno) &&
>> +        xive_source_esb_get(xsrc, srcno) == XIVE_ESB_QUEUED) {
>> +        qemu_log_mask(LOG_GUEST_ERROR,
>> +                      "XIVE: queued an event on LSI IRQ %d\n", srcno);
>> +    }
>> +
>> +    return ret;
>>  }
>>  
>>  /*
>> @@ -103,9 +129,22 @@ static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
>>   */
>>  static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
>>  {
>> +    bool ret;
>> +
>>      assert(srcno < xsrc->nr_irqs);
>>  
>> -    return xive_esb_eoi(&xsrc->status[srcno]);
>> +    ret = xive_esb_eoi(&xsrc->status[srcno]);
>> +
>> +    /* LSI sources do not set the Q bit but they can still be
>> +     * asserted, in which case we should forward a new event
>> +     * notification
>> +     */
>> +    if (xive_source_irq_is_lsi(xsrc, srcno) &&
>> +        xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
>> +        ret = xive_source_lsi_trigger(xsrc, srcno);
>> +    }
>> +
>> +    return ret;
>>  }
>>  
>>  /*
>> @@ -268,8 +307,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
>>      XiveSource *xsrc = XIVE_SOURCE(opaque);
>>      bool notify = false;
>>  
>> -    if (val) {
>> -        notify = xive_source_esb_trigger(xsrc, srcno);
>> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
>> +        if (val) {
>> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
>> +            notify = xive_source_lsi_trigger(xsrc, srcno);
>> +        } else {
>> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
>> +        }
>> +    } else {
>> +        if (val) {
>> +            notify = xive_source_esb_trigger(xsrc, srcno);
>> +        }
>>      }
>>  
>>      /* Forward the source event notification for routing */
>> @@ -289,9 +337,11 @@ void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
>>              continue;
>>          }
>>  
>> -        monitor_printf(mon, "  %08x %c%c\n", i + offset,
>> +        monitor_printf(mon, "  %08x %s %c%c%c\n", i + offset,
>> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
>>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
>> -                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
>> +                       xsrc->status[i] & XIVE_STATUS_ASSERTED ? 'A' : ' ');
>>      }
>>  }
>>  
>> @@ -299,6 +349,8 @@ static void xive_source_reset(DeviceState *dev)
>>  {
>>      XiveSource *xsrc = XIVE_SOURCE(dev);
>>  
>> +    /* Do not clear the LSI bitmap */
>> +
>>      /* PQs are initialized to 0b01 which corresponds to "ints off" */
>>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
>>  }
>> @@ -325,6 +377,9 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
>>  
>>      xsrc->status = g_malloc0(xsrc->nr_irqs);
>>  
>> +    xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
>> +    xsrc->lsi_map_size = xsrc->nr_irqs;
>> +
>>      memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>>                            &xive_source_esb_ops, xsrc, "xive.esb",
>>                            (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
>> @@ -338,6 +393,7 @@ static const VMStateDescription vmstate_xive_source = {
>>      .fields = (VMStateField[]) {
>>          VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
>>          VMSTATE_VBUFFER_UINT32(status, XiveSource, 1, NULL, nr_irqs),
>> +        VMSTATE_BITMAP(lsi_map, XiveSource, 1, lsi_map_size),
> 
> This shouldn't be here.  The lsi_map is all set up at machine
> configuration time and then static, so it doesn't need to be migrated.

yes. of course ... I will get rid of it.

Thanks,

C. 
> 
>>          VMSTATE_END_OF_LIST()
>>      },
>>  };
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-22  4:11   ` David Gibson
@ 2018-11-22  7:53     ` Cédric Le Goater
  2018-11-23  3:50       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-22  7:53 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 9407 bytes --]

On 11/22/18 5:11 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:56:57AM +0100, Cédric Le Goater wrote:
>> The XiveRouter models the second sub-engine of the overall XIVE
>> architecture : the Interrupt Virtualization Routing Engine (IVRE).
>>
>> The IVRE handles event notifications of the IVSE through MMIO stores
>> and performs the interrupt routing process. For this purpose, it uses
>> a set of table stored in system memory, the first of which being the
>> Event Assignment Structure (EAS) table.
>>
>> The EAT associates an interrupt source number with an Event Notification
>> Descriptor (END) which will be used in a second phase of the routing
>> process to identify a Notification Virtual Target.
>>
>> The XiveRouter is an abstract class which needs to be inherited from
>> to define a storage for the EAT, and other upcoming tables. The
>> 'chip-id' atttribute is not strictly necessary for the sPAPR and
>> PowerNV machines but it's a good way to test the routing algorithm.
>> Without this atttribute, the XiveRouter could be a simple QOM
>> interface.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/xive.h      | 32 ++++++++++++++
>>  include/hw/ppc/xive_regs.h | 31 ++++++++++++++
>>  hw/intc/xive.c             | 86 ++++++++++++++++++++++++++++++++++++++
>>  3 files changed, 149 insertions(+)
>>  create mode 100644 include/hw/ppc/xive_regs.h
>>
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index be93fae6317b..5a0696366577 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -11,6 +11,7 @@
>>  #define PPC_XIVE_H
>>  
>>  #include "hw/sysbus.h"
> 
> Again, I don't think making this a SysBusDevice is quite right.
> Even more so for the router than the source, because at least for PAPR
> it might not have any MMIO presence at all.

The controller model inherits from the XiveRouter and manages the TIMA.

>> +#include "hw/ppc/xive_regs.h"
>>  
>>  /*
>>   * XIVE Fabric (Interface between Source and Router)
>> @@ -168,4 +169,35 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>>      }
>>  }
>>  
>> +/*
>> + * XIVE Router
>> + */
>> +
>> +typedef struct XiveRouter {
>> +    SysBusDevice    parent;
>> +
>> +    uint32_t        chip_id;
> 
> I don't think this belongs in the base class.  The PowerNV specific
> variants will need it, but it doesn't make sense for the PAPR version.

yeah. I am using it as a END and NVT block identifier but it's not 
required for sPAPR, it could just be zero. 

It was good to test the routing algo which should not assume that the 
block id is zero. 
 
> 
>> +} XiveRouter;
>> +
>> +#define TYPE_XIVE_ROUTER "xive-router"
>> +#define XIVE_ROUTER(obj)                                \
>> +    OBJECT_CHECK(XiveRouter, (obj), TYPE_XIVE_ROUTER)
>> +#define XIVE_ROUTER_CLASS(klass)                                        \
>> +    OBJECT_CLASS_CHECK(XiveRouterClass, (klass), TYPE_XIVE_ROUTER)
>> +#define XIVE_ROUTER_GET_CLASS(obj)                              \
>> +    OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
>> +
>> +typedef struct XiveRouterClass {
>> +    SysBusDeviceClass parent;
>> +
>> +    /* XIVE table accessors */
>> +    int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>> +    int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>> +} XiveRouterClass;
>> +
>> +void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
>> +
>> +int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>> +int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>> +
>>  #endif /* PPC_XIVE_H */
>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>> new file mode 100644
>> index 000000000000..12499b33614c
>> --- /dev/null
>> +++ b/include/hw/ppc/xive_regs.h
>> @@ -0,0 +1,31 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2016-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef PPC_XIVE_REGS_H
>> +#define PPC_XIVE_REGS_H
>> +
>> +/* EAS (Event Assignment Structure)
>> + *
>> + * One per interrupt source. Targets an interrupt to a given Event
>> + * Notification Descriptor (END) and provides the corresponding
>> + * logical interrupt number (END data)
>> + */
>> +typedef struct XiveEAS {
>> +        /* Use a single 64-bit definition to make it easier to
>> +         * perform atomic updates
>> +         */
>> +        uint64_t        w;
>> +#define EAS_VALID       PPC_BIT(0)
>> +#define EAS_END_BLOCK   PPC_BITMASK(4, 7)        /* Destination END block# */
>> +#define EAS_END_INDEX   PPC_BITMASK(8, 31)       /* Destination END index */
>> +#define EAS_MASKED      PPC_BIT(32)              /* Masked */
>> +#define EAS_END_DATA    PPC_BITMASK(33, 63)      /* Data written to the END */
>> +} XiveEAS;
>> +
>> +#endif /* PPC_XIVE_REGS_H */
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 014a2e41f71f..c4c90a25758e 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -442,6 +442,91 @@ static const TypeInfo xive_source_info = {
>>      .class_init    = xive_source_class_init,
>>  };
>>  
>> +/*
>> + * XIVE Router (aka. Virtualization Controller or IVRE)
>> + */
>> +
>> +int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>> +{
>> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>> +
>> +    return xrc->get_eas(xrtr, lisn, eas);
>> +}
>> +
>> +int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>> +{
>> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>> +
>> +    return xrc->set_eas(xrtr, lisn, eas);
>> +}
>> +
>> +static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>> +{
>> +    XiveRouter *xrtr = XIVE_ROUTER(xf);
>> +    XiveEAS eas;
>> +
>> +    /* EAS cache lookup */
>> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: Unknown LISN %x\n", lisn);
>> +        return;
>> +    }
> 
> AFAICT a bad LISN here means a qemu error (in the source, probably),
> not a user or guest error, so an assert() would be more appropriate.

hmm, I would say no because in the case of PowerNV, the firmware could
have badly configured the ISN offset of a source which would notify the 
router with a bad notification event data.  


>> +
>> +    /* The IVRE has a State Bit Cache for its internal sources which
>> +     * is also involed at this point. We skip the SBC lookup because
>> +     * the state bits of the sources are modeled internally in QEMU.
>> +     */
>> +
>> +    if (!(eas.w & EAS_VALID)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);
>> +        return;
>> +    }
>> +
>> +    if (eas.w & EAS_MASKED) {
>> +        /* Notification completed */
>> +        return;
>> +    }
>> +}
>> +
>> +static Property xive_router_properties[] = {
>> +    DEFINE_PROP_UINT32("chip-id", XiveRouter, chip_id, 0),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void xive_router_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
>> +
>> +    dc->desc    = "XIVE Router Engine";
>> +    dc->props   = xive_router_properties;
>> +    xfc->notify = xive_router_notify;
>> +}
>> +
>> +static const TypeInfo xive_router_info = {
>> +    .name          = TYPE_XIVE_ROUTER,
>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>> +    .abstract      = true,
>> +    .class_size    = sizeof(XiveRouterClass),
>> +    .class_init    = xive_router_class_init,
>> +    .interfaces    = (InterfaceInfo[]) {
>> +        { TYPE_XIVE_FABRIC },
> 
> So as far as I can see so far, the XiveFabric interface will
> essentially have to be implemented on the router object, so I'm not
> seeing much point to having the interface rather than just a direct
> call on the router object.  But I haven't read the whole series yet,
> so maybe I'm missing something.

The PSIHB and PHB4 models are using it but there are not in the series.

I can send the PSIHB patch in the next version if you like, it's the 
patch right after PnvXive. It's attached below for the moment. Look at 
pnv_psi_notify().

Thanks,

C.

>> +        { }
>> +    }
>> +};
>> +
>> +void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
>> +{
>> +    if (!(eas->w & EAS_VALID)) {
>> +        return;
>> +    }
>> +
>> +    monitor_printf(mon, "  %08x %s end:%02x/%04x data:%08x\n",
>> +                   lisn, eas->w & EAS_MASKED ? "M" : " ",
>> +                   (uint8_t)  GETFIELD(EAS_END_BLOCK, eas->w),
>> +                   (uint32_t) GETFIELD(EAS_END_INDEX, eas->w),
>> +                   (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
>> +}
>> +
>>  /*
>>   * XIVE Fabric
>>   */
>> @@ -455,6 +540,7 @@ static void xive_register_types(void)
>>  {
>>      type_register_static(&xive_source_info);
>>      type_register_static(&xive_fabric_info);
>> +    type_register_static(&xive_router_info);
>>  }
>>  
>>  type_init(xive_register_types)
> 


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #2: 0001-ppc-pnv-add-a-PSI-bridge-model-for-POWER9-processor.patch --]
[-- Type: text/x-patch; name="0001-ppc-pnv-add-a-PSI-bridge-model-for-POWER9-processor.patch", Size: 25041 bytes --]

>From 680fd6ff7c99e669708fbc5cfdbfcd95e83e7c07 Mon Sep 17 00:00:00 2001
From: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= <clg@kaod.org>
Date: Wed, 21 Nov 2018 10:29:45 +0100
Subject: [PATCH] ppc/pnv: add a PSI bridge model for POWER9 processor
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The PSI bridge on POWER9 is very similar to POWER8. The BAR is still
set through XSCOM but the controls are now entirely done with MMIOs.
More interrupts are defined and the interrupt controller interface has
changed to XIVE. The POWER9 model is a first example of the usage of
the notify() handler of the XiveFabric interface, linking the PSI
XiveSource to its owning device model.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/hw/ppc/pnv.h       |   6 +
 include/hw/ppc/pnv_psi.h   |  50 ++++-
 include/hw/ppc/pnv_xscom.h |   3 +
 hw/ppc/pnv.c               |  20 +-
 hw/ppc/pnv_psi.c           | 390 ++++++++++++++++++++++++++++++++++---
 5 files changed, 444 insertions(+), 25 deletions(-)

diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
index c402e5d5844b..8be1147481f9 100644
--- a/include/hw/ppc/pnv.h
+++ b/include/hw/ppc/pnv.h
@@ -88,6 +88,7 @@ typedef struct Pnv9Chip {
 
     /*< public >*/
     PnvXive      xive;
+    PnvPsi       psi;
 } Pnv9Chip;
 
 typedef struct PnvChipClass {
@@ -250,11 +251,16 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
 #define PNV9_XIVE_PC_SIZE            0x0000001000000000ull
 #define PNV9_XIVE_PC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006018000000000ull)
 
+#define PNV9_PSIHB_SIZE              0x0000000000100000ull
+#define PNV9_PSIHB_BASE(chip)        PNV9_CHIP_BASE(chip, 0x0006030203000000ull)
+
 #define PNV9_XIVE_IC_SIZE            0x0000000000080000ull
 #define PNV9_XIVE_IC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203100000ull)
 
 #define PNV9_XIVE_TM_SIZE            0x0000000000040000ull
 #define PNV9_XIVE_TM_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203180000ull)
 
+#define PNV9_PSIHB_ESB_SIZE          0x0000000000010000ull
+#define PNV9_PSIHB_ESB_BASE(chip)    PNV9_CHIP_BASE(chip, 0x00060302031c0000ull)
 
 #endif /* _PPC_PNV_H */
diff --git a/include/hw/ppc/pnv_psi.h b/include/hw/ppc/pnv_psi.h
index f6af5eae1fa8..b8f8d082bcf9 100644
--- a/include/hw/ppc/pnv_psi.h
+++ b/include/hw/ppc/pnv_psi.h
@@ -21,10 +21,35 @@
 
 #include "hw/sysbus.h"
 #include "hw/ppc/xics.h"
+#include "hw/ppc/xive.h"
 
 #define TYPE_PNV_PSI "pnv-psi"
 #define PNV_PSI(obj) \
      OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI)
+#define PNV_PSI_CLASS(klass) \
+     OBJECT_CLASS_CHECK(PnvPsiClass, (klass), TYPE_PNV_PSI)
+#define PNV_PSI_GET_CLASS(obj) \
+     OBJECT_GET_CLASS(PnvPsiClass, (obj), TYPE_PNV_PSI)
+
+typedef struct PnvPsi PnvPsi;
+typedef struct PnvChip PnvChip;
+typedef struct PnvPsiClass {
+    SysBusDeviceClass parent_class;
+
+    int chip_type;
+    uint32_t xscom_pcba;
+    uint32_t xscom_size;
+
+    void (*irq_set)(PnvPsi *psi, int, bool state);
+} PnvPsiClass;
+
+#define TYPE_PNV_PSI_POWER8 TYPE_PNV_PSI "-POWER8"
+#define PNV_PSI_POWER8(obj) \
+    OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI_POWER8)
+
+#define TYPE_PNV_PSI_POWER9 TYPE_PNV_PSI "-POWER9"
+#define PNV_PSI_POWER9(obj) \
+    OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI_POWER9)
 
 #define PSIHB_XSCOM_MAX         0x20
 
@@ -38,9 +63,12 @@ typedef struct PnvPsi {
     /* MemoryRegion fsp_mr; */
     uint64_t fsp_bar;
 
-    /* Interrupt generation */
+    /* P8 Interrupt generation */
     ICSState ics;
 
+    /* P9 Interrupt generation */
+    XiveSource source;
+
     /* Registers */
     uint64_t regs[PSIHB_XSCOM_MAX];
 
@@ -60,6 +88,24 @@ typedef enum PnvPsiIrq {
 
 #define PSI_NUM_INTERRUPTS 6
 
-extern void pnv_psi_irq_set(PnvPsi *psi, PnvPsiIrq irq, bool state);
+/* P9 PSI Interrupts */
+#define PSIHB9_IRQ_PSI          0
+#define PSIHB9_IRQ_OCC          1
+#define PSIHB9_IRQ_FSI          2
+#define PSIHB9_IRQ_LPCHC        3
+#define PSIHB9_IRQ_LOCAL_ERR    4
+#define PSIHB9_IRQ_GLOBAL_ERR   5
+#define PSIHB9_IRQ_TPM          6
+#define PSIHB9_IRQ_LPC_SIRQ0    7
+#define PSIHB9_IRQ_LPC_SIRQ1    8
+#define PSIHB9_IRQ_LPC_SIRQ2    9
+#define PSIHB9_IRQ_LPC_SIRQ3    10
+#define PSIHB9_IRQ_SBE_I2C      11
+#define PSIHB9_IRQ_DIO          12
+#define PSIHB9_IRQ_PSU          13
+#define PSIHB9_NUM_IRQS         14
+
+void pnv_psi_irq_set(PnvPsi *psi, int irq, bool state);
+void pnv_psi_pic_print_info(PnvPsi *psi, Monitor *mon);
 
 #endif /* _PPC_PNV_PSI_H */
diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
index 5bd43467a1ab..019b45bf9189 100644
--- a/include/hw/ppc/pnv_xscom.h
+++ b/include/hw/ppc/pnv_xscom.h
@@ -82,6 +82,9 @@ typedef struct PnvXScomInterfaceClass {
 #define PNV_XSCOM_PBCQ_SPCI_BASE  0x9013c00
 #define PNV_XSCOM_PBCQ_SPCI_SIZE  0x5
 
+#define PNV9_XSCOM_PSIHB_BASE     0x5012900
+#define PNV9_XSCOM_PSIHB_SIZE     0x100
+
 #define PNV9_XSCOM_XIVE_BASE      0x5013000
 #define PNV9_XSCOM_XIVE_SIZE      0x300
 
diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
index b6af896c30e4..e67c9d7d3995 100644
--- a/hw/ppc/pnv.c
+++ b/hw/ppc/pnv.c
@@ -743,7 +743,7 @@ static void pnv_chip_power8_instance_init(Object *obj)
     PnvChipClass *pcc = PNV_CHIP_GET_CLASS(obj);
     int i;
 
-    object_initialize(&chip8->psi, sizeof(chip8->psi), TYPE_PNV_PSI);
+    object_initialize(&chip8->psi, sizeof(chip8->psi), TYPE_PNV_PSI_POWER8);
     object_property_add_child(obj, "psi", OBJECT(&chip8->psi), NULL);
     object_property_add_const_link(OBJECT(&chip8->psi), "xics",
                                    OBJECT(qdev_get_machine()), &error_abort);
@@ -923,6 +923,11 @@ static void pnv_chip_power9_instance_init(Object *obj)
     object_property_add_child(obj, "xive", OBJECT(&chip9->xive), NULL);
     object_property_add_const_link(OBJECT(&chip9->xive), "chip", obj,
                                    &error_abort);
+
+    object_initialize(&chip9->psi, sizeof(chip9->psi), TYPE_PNV_PSI_POWER9);
+    object_property_add_child(obj, "psi", OBJECT(&chip9->psi), NULL);
+    object_property_add_const_link(OBJECT(&chip9->psi), "chip", obj,
+                                   &error_abort);
 }
 
 static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
@@ -955,6 +960,18 @@ static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
     qdev_set_parent_bus(DEVICE(&chip9->xive), sysbus_get_default());
     pnv_xscom_add_subregion(chip, PNV9_XSCOM_XIVE_BASE,
                             &chip9->xive.xscom_regs);
+
+    /* Processor Service Interface (PSI) Host Bridge */
+    object_property_set_int(OBJECT(&chip9->psi), PNV9_PSIHB_BASE(chip),
+                            "bar", &error_fatal);
+    object_property_set_bool(OBJECT(&chip9->psi), true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(&chip9->psi), sysbus_get_default());
+    pnv_xscom_add_subregion(chip, PNV9_XSCOM_PSIHB_BASE,
+                            &chip9->psi.xscom_regs);
 }
 
 static void pnv_chip_power9_class_init(ObjectClass *klass, void *data)
@@ -1188,6 +1205,7 @@ static void pnv_pic_print_info(InterruptStatsProvider *obj,
             Pnv9Chip *chip9 = PNV9_CHIP(chip);
 
              pnv_xive_pic_print_info(&chip9->xive, mon);
+             pnv_psi_pic_print_info(&chip9->psi, mon);
         } else {
             Pnv8Chip *chip8 = PNV8_CHIP(chip);
 
diff --git a/hw/ppc/pnv_psi.c b/hw/ppc/pnv_psi.c
index 5b969127c303..8b85dd9555e8 100644
--- a/hw/ppc/pnv_psi.c
+++ b/hw/ppc/pnv_psi.c
@@ -22,6 +22,7 @@
 #include "target/ppc/cpu.h"
 #include "qemu/log.h"
 #include "qapi/error.h"
+#include "monitor/monitor.h"
 
 #include "exec/address-spaces.h"
 
@@ -114,12 +115,14 @@
 #define PSIHB_BAR_MASK                  0x0003fffffff00000ull
 #define PSIHB_FSPBAR_MASK               0x0003ffff00000000ull
 
+#define PSIHB_REG(addr) (((addr) >> 3) + PSIHB_XSCOM_BAR)
+
 static void pnv_psi_set_bar(PnvPsi *psi, uint64_t bar)
 {
     MemoryRegion *sysmem = get_system_memory();
     uint64_t old = psi->regs[PSIHB_XSCOM_BAR];
 
-    psi->regs[PSIHB_XSCOM_BAR] = bar & (PSIHB_BAR_MASK | PSIHB_BAR_EN);
+    psi->regs[PSIHB_XSCOM_BAR] = bar;
 
     /* Update MR, always remove it first */
     if (old & PSIHB_BAR_EN) {
@@ -128,7 +131,7 @@ static void pnv_psi_set_bar(PnvPsi *psi, uint64_t bar)
 
     /* Then add it back if needed */
     if (bar & PSIHB_BAR_EN) {
-        uint64_t addr = bar & PSIHB_BAR_MASK;
+        uint64_t addr = bar & ~PSIHB_BAR_EN;
         memory_region_add_subregion(sysmem, addr, &psi->regs_mr);
     }
 }
@@ -205,7 +208,12 @@ static const uint64_t stat_bits[] = {
     [PSIHB_IRQ_EXTERNAL]  = PSIHB_IRQ_STAT_EXT,
 };
 
-void pnv_psi_irq_set(PnvPsi *psi, PnvPsiIrq irq, bool state)
+void pnv_psi_irq_set(PnvPsi *psi, int irq, bool state)
+{
+    PNV_PSI_GET_CLASS(psi)->irq_set(psi, irq, state);
+}
+
+static void pnv_psi_power8_irq_set(PnvPsi *psi, int irq, bool state)
 {
     ICSState *ics = &psi->ics;
     uint32_t xivr_reg;
@@ -324,7 +332,7 @@ static uint64_t pnv_psi_reg_read(PnvPsi *psi, uint32_t offset, bool mmio)
         val = psi->regs[offset];
         break;
     default:
-        qemu_log_mask(LOG_UNIMP, "PSI: read at Ox%" PRIx32 "\n", offset);
+        qemu_log_mask(LOG_UNIMP, "PSI: read at 0x%" PRIx32 "\n", offset);
     }
     return val;
 }
@@ -383,7 +391,7 @@ static void pnv_psi_reg_write(PnvPsi *psi, uint32_t offset, uint64_t val,
         pnv_psi_set_irsn(psi, val);
         break;
     default:
-        qemu_log_mask(LOG_UNIMP, "PSI: write at Ox%" PRIx32 "\n", offset);
+        qemu_log_mask(LOG_UNIMP, "PSI: write at 0x%" PRIx32 "\n", offset);
     }
 }
 
@@ -393,13 +401,13 @@ static void pnv_psi_reg_write(PnvPsi *psi, uint32_t offset, uint64_t val,
  */
 static uint64_t pnv_psi_mmio_read(void *opaque, hwaddr addr, unsigned size)
 {
-    return pnv_psi_reg_read(opaque, (addr >> 3) + PSIHB_XSCOM_BAR, true);
+    return pnv_psi_reg_read(opaque, PSIHB_REG(addr), true);
 }
 
 static void pnv_psi_mmio_write(void *opaque, hwaddr addr,
                               uint64_t val, unsigned size)
 {
-    pnv_psi_reg_write(opaque, (addr >> 3) + PSIHB_XSCOM_BAR, val, true);
+    pnv_psi_reg_write(opaque, PSIHB_REG(addr), val, true);
 }
 
 static const MemoryRegionOps psi_mmio_ops = {
@@ -441,7 +449,7 @@ static const MemoryRegionOps pnv_psi_xscom_ops = {
     }
 };
 
-static void pnv_psi_init(Object *obj)
+static void pnv_psi_power8_instance_init(Object *obj)
 {
     PnvPsi *psi = PNV_PSI(obj);
 
@@ -458,7 +466,7 @@ static const uint8_t irq_to_xivr[] = {
     PSIHB_XSCOM_XIVR_EXT,
 };
 
-static void pnv_psi_realize(DeviceState *dev, Error **errp)
+static void pnv_psi_power8_realize(DeviceState *dev, Error **errp)
 {
     PnvPsi *psi = PNV_PSI(dev);
     ICSState *ics = &psi->ics;
@@ -510,28 +518,34 @@ static void pnv_psi_realize(DeviceState *dev, Error **errp)
     }
 }
 
+static const char compat_p8[] = "ibm,power8-psihb-x\0ibm,psihb-x";
+static const char compat_p9[] = "ibm,power9-psihb-x\0ibm,psihb-x";
+
 static int pnv_psi_dt_xscom(PnvXScomInterface *dev, void *fdt, int xscom_offset)
 {
-    const char compat[] = "ibm,power8-psihb-x\0ibm,psihb-x";
+    PnvPsiClass *ppc = PNV_PSI_GET_CLASS(dev);
     char *name;
     int offset;
-    uint32_t lpc_pcba = PNV_XSCOM_PSIHB_BASE;
     uint32_t reg[] = {
-        cpu_to_be32(lpc_pcba),
-        cpu_to_be32(PNV_XSCOM_PSIHB_SIZE)
+        cpu_to_be32(ppc->xscom_pcba),
+        cpu_to_be32(ppc->xscom_size)
     };
 
-    name = g_strdup_printf("psihb@%x", lpc_pcba);
+    name = g_strdup_printf("psihb@%x", ppc->xscom_pcba);
     offset = fdt_add_subnode(fdt, xscom_offset, name);
     _FDT(offset);
     g_free(name);
 
-    _FDT((fdt_setprop(fdt, offset, "reg", reg, sizeof(reg))));
-
-    _FDT((fdt_setprop_cell(fdt, offset, "#address-cells", 2)));
-    _FDT((fdt_setprop_cell(fdt, offset, "#size-cells", 1)));
-    _FDT((fdt_setprop(fdt, offset, "compatible", compat,
-                      sizeof(compat))));
+    _FDT(fdt_setprop(fdt, offset, "reg", reg, sizeof(reg)));
+    _FDT(fdt_setprop_cell(fdt, offset, "#address-cells", 2));
+    _FDT(fdt_setprop_cell(fdt, offset, "#size-cells", 1));
+    if (ppc->chip_type == PNV_CHIP_POWER9) {
+        _FDT(fdt_setprop(fdt, offset, "compatible", compat_p9,
+                         sizeof(compat_p9)));
+    } else {
+        _FDT(fdt_setprop(fdt, offset, "compatible", compat_p8,
+                         sizeof(compat_p8)));
+    }
     return 0;
 }
 
@@ -541,6 +555,324 @@ static Property pnv_psi_properties[] = {
     DEFINE_PROP_END_OF_LIST(),
 };
 
+static void pnv_psi_power8_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PnvPsiClass *ppc = PNV_PSI_CLASS(klass);
+
+    dc->desc    = "PowerNV PSI Controller POWER8";
+    dc->realize = pnv_psi_power8_realize;
+
+    ppc->chip_type =  PNV_CHIP_POWER8;
+    ppc->xscom_pcba = PNV_XSCOM_PSIHB_BASE;
+    ppc->xscom_size = PNV_XSCOM_PSIHB_SIZE;
+    ppc->irq_set    = pnv_psi_power8_irq_set;
+}
+
+static const TypeInfo pnv_psi_power8_info = {
+    .name          = TYPE_PNV_PSI_POWER8,
+    .parent        = TYPE_PNV_PSI,
+    .instance_init = pnv_psi_power8_instance_init,
+    .class_init    = pnv_psi_power8_class_init,
+};
+
+/* Common registers */
+
+#define PSIHB9_CR                       0x20
+#define PSIHB9_SEMR                     0x28
+
+/* P9 registers */
+
+#define PSIHB9_INTERRUPT_CONTROL        0x58
+#define   PSIHB9_IRQ_METHOD             PPC_BIT(0)
+#define   PSIHB9_IRQ_RESET              PPC_BIT(1)
+#define PSIHB9_ESB_CI_BASE              0x60
+#define   PSIHB9_ESB_CI_VALID           1
+#define PSIHB9_ESB_NOTIF_ADDR           0x68
+#define   PSIHB9_ESB_NOTIF_VALID        1
+#define PSIHB9_IVT_OFFSET               0x70
+#define   PSIHB9_IVT_OFF_SHIFT          32
+
+#define PSIHB9_IRQ_LEVEL                0x78 /* assertion */
+#define   PSIHB9_IRQ_LEVEL_PSI          PPC_BIT(0)
+#define   PSIHB9_IRQ_LEVEL_OCC          PPC_BIT(1)
+#define   PSIHB9_IRQ_LEVEL_FSI          PPC_BIT(2)
+#define   PSIHB9_IRQ_LEVEL_LPCHC        PPC_BIT(3)
+#define   PSIHB9_IRQ_LEVEL_LOCAL_ERR    PPC_BIT(4)
+#define   PSIHB9_IRQ_LEVEL_GLOBAL_ERR   PPC_BIT(5)
+#define   PSIHB9_IRQ_LEVEL_TPM          PPC_BIT(6)
+#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ1    PPC_BIT(7)
+#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ2    PPC_BIT(8)
+#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ3    PPC_BIT(9)
+#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ4    PPC_BIT(10)
+#define   PSIHB9_IRQ_LEVEL_SBE_I2C      PPC_BIT(11)
+#define   PSIHB9_IRQ_LEVEL_DIO          PPC_BIT(12)
+#define   PSIHB9_IRQ_LEVEL_PSU          PPC_BIT(13)
+#define   PSIHB9_IRQ_LEVEL_I2C_C        PPC_BIT(14)
+#define   PSIHB9_IRQ_LEVEL_I2C_D        PPC_BIT(15)
+#define   PSIHB9_IRQ_LEVEL_I2C_E        PPC_BIT(16)
+#define   PSIHB9_IRQ_LEVEL_SBE          PPC_BIT(19)
+
+#define PSIHB9_IRQ_STAT                 0x80 /* P bit */
+#define   PSIHB9_IRQ_STAT_PSI           PPC_BIT(0)
+#define   PSIHB9_IRQ_STAT_OCC           PPC_BIT(1)
+#define   PSIHB9_IRQ_STAT_FSI           PPC_BIT(2)
+#define   PSIHB9_IRQ_STAT_LPCHC         PPC_BIT(3)
+#define   PSIHB9_IRQ_STAT_LOCAL_ERR     PPC_BIT(4)
+#define   PSIHB9_IRQ_STAT_GLOBAL_ERR    PPC_BIT(5)
+#define   PSIHB9_IRQ_STAT_TPM           PPC_BIT(6)
+#define   PSIHB9_IRQ_STAT_LPC_SIRQ1     PPC_BIT(7)
+#define   PSIHB9_IRQ_STAT_LPC_SIRQ2     PPC_BIT(8)
+#define   PSIHB9_IRQ_STAT_LPC_SIRQ3     PPC_BIT(9)
+#define   PSIHB9_IRQ_STAT_LPC_SIRQ4     PPC_BIT(10)
+#define   PSIHB9_IRQ_STAT_SBE_I2C       PPC_BIT(11)
+#define   PSIHB9_IRQ_STAT_DIO           PPC_BIT(12)
+#define   PSIHB9_IRQ_STAT_PSU           PPC_BIT(13)
+
+static void pnv_psi_notify(XiveFabric *xf, uint32_t srcno)
+{
+    PnvPsi *psi = PNV_PSI(xf);
+    uint64_t notif_port = psi->regs[PSIHB_REG(PSIHB9_ESB_NOTIF_ADDR)];
+    bool valid = notif_port & PSIHB9_ESB_NOTIF_VALID;
+    uint64_t notify_addr = notif_port & ~PSIHB9_ESB_NOTIF_VALID;
+
+    uint32_t offset =
+        (psi->regs[PSIHB_REG(PSIHB9_IVT_OFFSET)] >> PSIHB9_IVT_OFF_SHIFT);
+    uint64_t lisn = cpu_to_be64(offset + srcno);
+
+    if (valid) {
+        cpu_physical_memory_write(notify_addr, &lisn, sizeof(lisn));
+    }
+}
+
+/*
+ * TODO : move to parent class
+ */
+static void pnv_psi_reset(DeviceState *dev)
+{
+    PnvPsi *psi = PNV_PSI(dev);
+
+    memset(psi->regs, 0x0, sizeof(psi->regs));
+
+    psi->regs[PSIHB_XSCOM_BAR] = psi->bar | PSIHB_BAR_EN;
+}
+
+static uint64_t pnv_psi_p9_mmio_read(void *opaque, hwaddr addr, unsigned size)
+{
+    PnvPsi *psi = PNV_PSI(opaque);
+    uint32_t reg = PSIHB_REG(addr);
+    uint64_t val = -1;
+
+    switch (addr) {
+    case PSIHB9_CR:
+    case PSIHB9_SEMR:
+        /* FSP stuff */
+    case PSIHB9_INTERRUPT_CONTROL:
+    case PSIHB9_ESB_CI_BASE:
+    case PSIHB9_ESB_NOTIF_ADDR:
+    case PSIHB9_IVT_OFFSET:
+        val = psi->regs[reg];
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "PSI: read at 0x%" PRIx64 "\n", addr);
+    }
+
+    return val;
+}
+
+static void pnv_psi_p9_mmio_write(void *opaque, hwaddr addr,
+                                  uint64_t val, unsigned size)
+{
+    PnvPsi *psi = PNV_PSI(opaque);
+    uint32_t reg = PSIHB_REG(addr);
+    MemoryRegion *sysmem = get_system_memory();
+
+    switch (addr) {
+    case PSIHB9_CR:
+    case PSIHB9_SEMR:
+        /* FSP stuff */
+        break;
+    case PSIHB9_INTERRUPT_CONTROL:
+        if (val & PSIHB9_IRQ_RESET) {
+            device_reset(DEVICE(&psi->source));
+        }
+        psi->regs[reg] = val;
+        break;
+
+    case PSIHB9_ESB_CI_BASE:
+        if (!(val & PSIHB9_ESB_CI_VALID)) {
+            if (psi->regs[reg] & PSIHB9_ESB_CI_VALID) {
+                memory_region_del_subregion(sysmem, &psi->source.esb_mmio);
+            }
+        } else {
+            if (!(psi->regs[reg] & PSIHB9_ESB_CI_VALID)) {
+                memory_region_add_subregion(sysmem,
+                                        val & ~PSIHB9_ESB_CI_VALID,
+                                        &psi->source.esb_mmio);
+            }
+        }
+        psi->regs[reg] = val;
+        break;
+
+    case PSIHB9_ESB_NOTIF_ADDR:
+        psi->regs[reg] = val;
+        break;
+    case PSIHB9_IVT_OFFSET:
+        psi->regs[reg] = val;
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "PSI: write at 0x%" PRIx64 "\n", addr);
+    }
+}
+
+static const MemoryRegionOps pnv_psi_p9_mmio_ops = {
+    .read = pnv_psi_p9_mmio_read,
+    .write = pnv_psi_p9_mmio_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+};
+
+static uint64_t pnv_psi_p9_xscom_read(void *opaque, hwaddr addr, unsigned size)
+{
+    /* No read are expected */
+    qemu_log_mask(LOG_GUEST_ERROR, "PSI: xscom read at 0x%" PRIx64 "\n", addr);
+    return -1;
+}
+
+static void pnv_psi_p9_xscom_write(void *opaque, hwaddr addr,
+                                uint64_t val, unsigned size)
+{
+    PnvPsi *psi = PNV_PSI(opaque);
+
+    /* XSCOM is only used to set the PSIHB MMIO region */
+    switch (addr >> 3) {
+    case PSIHB_XSCOM_BAR:
+        pnv_psi_set_bar(psi, val);
+        break;
+    default:
+        qemu_log_mask(LOG_GUEST_ERROR, "PSI: xscom write at 0x%" PRIx64 "\n",
+                      addr);
+    }
+}
+
+static const MemoryRegionOps pnv_psi_p9_xscom_ops = {
+    .read = pnv_psi_p9_xscom_read,
+    .write = pnv_psi_p9_xscom_write,
+    .endianness = DEVICE_BIG_ENDIAN,
+    .valid = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    },
+    .impl = {
+        .min_access_size = 8,
+        .max_access_size = 8,
+    }
+};
+
+static void pnv_psi_power9_irq_set(PnvPsi *psi, int irq, bool state)
+{
+    uint32_t irq_method = psi->regs[PSIHB_REG(PSIHB9_INTERRUPT_CONTROL)];
+
+    if (irq > PSIHB9_NUM_IRQS) {
+        qemu_log_mask(LOG_GUEST_ERROR, "PSI: Unsupported irq %d\n", irq);
+        return;
+    }
+
+    if (irq_method & PSIHB9_IRQ_METHOD) {
+        qemu_log_mask(LOG_GUEST_ERROR, "PSI: LSI IRQ method no supported\n");
+        return;
+    }
+
+    /* Update LSI levels */
+    if (state) {
+        psi->regs[PSIHB_REG(PSIHB9_IRQ_LEVEL)] |= PPC_BIT(irq);
+    } else {
+        psi->regs[PSIHB_REG(PSIHB9_IRQ_LEVEL)] &= ~PPC_BIT(irq);
+    }
+
+    qemu_set_irq(xive_source_qirq(&psi->source, irq), state);
+}
+
+static void pnv_psi_power9_instance_init(Object *obj)
+{
+    PnvPsi *psi = PNV_PSI(obj);
+
+    object_initialize(&psi->source, sizeof(psi->source), TYPE_XIVE_SOURCE);
+    object_property_add_child(obj, "source", OBJECT(&psi->source), NULL);
+}
+
+static void pnv_psi_power9_realize(DeviceState *dev, Error **errp)
+{
+    PnvPsi *psi = PNV_PSI(dev);
+    XiveSource *xsrc = &psi->source;
+    Error *local_err = NULL;
+    int i;
+
+    /* This is the only device with 4k ESB pages */
+    object_property_set_int(OBJECT(xsrc), XIVE_ESB_4K, "shift",
+                            &error_fatal);
+    object_property_set_int(OBJECT(xsrc), PSIHB9_NUM_IRQS, "nr-irqs",
+                            &error_fatal);
+    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(psi),
+                                   &error_fatal);
+    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
+    if (local_err) {
+        error_propagate(errp, local_err);
+        return;
+    }
+    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
+
+    for (i = 0; i < xsrc->nr_irqs; i++) {
+        xive_source_irq_set(xsrc, i, true);
+    }
+
+    /* XSCOM region for PSI registers */
+    pnv_xscom_region_init(&psi->xscom_regs, OBJECT(dev), &pnv_psi_p9_xscom_ops,
+                psi, "xscom-psi", PNV9_XSCOM_PSIHB_SIZE);
+
+    /* Initialize MMIO region */
+    memory_region_init_io(&psi->regs_mr, OBJECT(dev), &pnv_psi_p9_mmio_ops, psi,
+                          "psihb", PNV9_PSIHB_SIZE);
+
+    /* Default BAR for MMIO region */
+    pnv_psi_set_bar(psi, psi->bar | PSIHB_BAR_EN);
+}
+
+static void pnv_psi_power9_class_init(ObjectClass *klass, void *data)
+{
+    DeviceClass *dc = DEVICE_CLASS(klass);
+    PnvPsiClass *ppc = PNV_PSI_CLASS(klass);
+    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
+
+    dc->desc    = "PowerNV PSI Controller POWER9";
+    dc->realize = pnv_psi_power9_realize;
+
+    ppc->chip_type  = PNV_CHIP_POWER9;
+    ppc->xscom_pcba = PNV9_XSCOM_PSIHB_BASE;
+    ppc->xscom_size = PNV9_XSCOM_PSIHB_SIZE;
+    ppc->irq_set    = pnv_psi_power9_irq_set;
+
+    xfc->notify      = pnv_psi_notify;
+}
+
+static const TypeInfo pnv_psi_power9_info = {
+    .name          = TYPE_PNV_PSI_POWER9,
+    .parent        = TYPE_PNV_PSI,
+    .instance_init = pnv_psi_power9_instance_init,
+    .class_init    = pnv_psi_power9_class_init,
+    .interfaces = (InterfaceInfo[]) {
+            { TYPE_XIVE_FABRIC },
+            { },
+    },
+};
+
 static void pnv_psi_class_init(ObjectClass *klass, void *data)
 {
     DeviceClass *dc = DEVICE_CLASS(klass);
@@ -548,16 +880,18 @@ static void pnv_psi_class_init(ObjectClass *klass, void *data)
 
     xdc->dt_xscom = pnv_psi_dt_xscom;
 
-    dc->realize = pnv_psi_realize;
+    dc->desc = "PowerNV PSI Controller";
     dc->props = pnv_psi_properties;
+    dc->reset  = pnv_psi_reset;
 }
 
 static const TypeInfo pnv_psi_info = {
     .name          = TYPE_PNV_PSI,
     .parent        = TYPE_SYS_BUS_DEVICE,
     .instance_size = sizeof(PnvPsi),
-    .instance_init = pnv_psi_init,
     .class_init    = pnv_psi_class_init,
+    .class_size    = sizeof(PnvPsiClass),
+    .abstract      = true,
     .interfaces    = (InterfaceInfo[]) {
         { TYPE_PNV_XSCOM_INTERFACE },
         { }
@@ -567,6 +901,18 @@ static const TypeInfo pnv_psi_info = {
 static void pnv_psi_register_types(void)
 {
     type_register_static(&pnv_psi_info);
+    type_register_static(&pnv_psi_power8_info);
+    type_register_static(&pnv_psi_power9_info);
 }
 
 type_init(pnv_psi_register_types)
+
+void pnv_psi_pic_print_info(PnvPsi *psi, Monitor *mon)
+{
+    uint32_t offset =
+        (psi->regs[PSIHB_REG(PSIHB9_IVT_OFFSET)] >> PSIHB9_IVT_OFF_SHIFT);
+
+    monitor_printf(mon, "PSIHB Source %08x .. %08x\n",
+                  offset, offset + psi->source.nr_irqs - 1);
+    xive_source_pic_print_info(&psi->source, offset, mon);
+}
-- 
2.17.2


^ permalink raw reply related	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-22  6:50     ` Benjamin Herrenschmidt
@ 2018-11-22  7:59       ` Cédric Le Goater
  2018-11-23  1:17         ` David Gibson
  2018-11-23  1:10       ` David Gibson
  1 sibling, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-22  7:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson; +Cc: qemu-ppc, qemu-devel

On 11/22/18 7:50 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
>>
>> Sorry, didn't think of this in my first reply.
>>
>> 1) Does the hardware ever actually write back to the EAS?  I know it
>> does for the END, but it's not clear why it would need to for the
>> EAS.  If not, we don't need the setter.
> 
> Nope, though the PAPR model will via hcalls

Indeed. The H_INT_SET_SOURCE_CONFIG hcall updates the EAT.

>> 2) The signatures are a bit odd here.  For the setter, a value would
>> make sense than a (XiveEAS *), since it's just a word.  For the getter
>> you could return the EAS value directly rather than using a pointer -
>> there's already a valid bit in the EAS so you can construct a value
>> with that cleared if the lisn is out of bounds.

Yes we could. I think I made it that way to be consistent with the 
other XIVE internal structures which are bigger : END, NVT

There might be other reasons in Pnv. One was to use generic accessors 
to the guest RAM but I didn't do it finally. Take a look at the Pnv
model and we might decide to change the prototype then. I don't 
think it's a major change.

Thanks,

C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors
  2018-11-22  4:41   ` David Gibson
  2018-11-22  6:49     ` Benjamin Herrenschmidt
@ 2018-11-22 21:47     ` Cédric Le Goater
  2018-11-23  4:35       ` David Gibson
  1 sibling, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-22 21:47 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/22/18 5:41 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:56:58AM +0100, Cédric Le Goater wrote:
>> To complete the event routing, the IVRE sub-engine uses an internal
>> table containing Event Notification Descriptor (END) structures.
>>
>> An END specifies on which Event Queue (EQ) the event notification
>> data, defined in the associated EAS, should be posted when an
>> exception occurs. It also defines which Notification Virtual Target
>> (NVT) should be notified.
>>
>> The Event Queue is a memory page provided by the O/S defining a
>> circular buffer, one per server and priority couple, containing Event
>> Queue entries. These are 4 bytes long, the first bit being a
>> 'generation' bit and the 31 following bits the END Data field. They
>> are pulled by the O/S when the exception occurs.
>>
>> The END Data field is a way to set an invariant logical event source
>> number for an IRQ. It is set with the H_INT_SET_SOURCE_CONFIG hcall
>> when the EISN flag is used.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/xive.h      |  18 ++++
>>  include/hw/ppc/xive_regs.h |  48 ++++++++++
>>  hw/intc/xive.c             | 185 ++++++++++++++++++++++++++++++++++++-
>>  3 files changed, 248 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 5a0696366577..ce62aaf28343 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -193,11 +193,29 @@ typedef struct XiveRouterClass {
>>      /* XIVE table accessors */
>>      int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>>      int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>> +    int (*get_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>> +                   XiveEND *end);
>> +    int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>> +                   XiveEND *end);
> 
> Hrm.  So unlike the EAS, which is basically just a word, the END is a
> pretty large structure.  

yes. and so will be the NVT.

> It's unclear here if get/set are expected to copy the whole thing out 
> and in, 

That's the plan. 

What I had in mind are memory accessors to the XIVE structures, which 
are local to QEMU for sPAPR and in the guest RAM for PowerNV (Please
take a look at the XIVE PowerNV model).

> or if get give you a pointer into a "live" structure 

no

> and set just does any necessary barriers after an update.
that would be too complex for the PowerNV model I think. There is a cache
in between the software running on the (QEMU) machine and the XIVE HW but
it would be hard to handle. 
 
> Really, for a non-atomic value like this, I'm not sure get/set is the
> right model.

ok. we need something to get them out and in.

> Also as I understand it nearly all the indices in XIVE are broken into
> block/index.  Is there a reason those are folded together into lisn
> for the EAS, but not for the END?

The indexing of the EAT is global to the sytem and the index defines
which blk to use. The IRQ source numbers on the powerbus are architected 
to be :

    #define XIVE_SRCNO(blk, idx)      ((uint32_t)(blk) << 28 | (idx))

and XIVE can use different strategies to identify the XIVE IC in charge 
of routing. It can be a one-to-one chip to block relation as skiboot does. 
Using a block scope table is possible also. Our model only supports one 
block per chip and some shortcuts are taken but not that much in fact.
 
Remote access to the XIVE structures of another chip are done through 
MMIO (not modeled in PowerNV) and the blkid is used to partition the MMIO 
regions. Being local is better for performance because the END and NVT 
tables have a strong relation with the XIVE subengines using them 
(VC and PC). 

May be, Ben can clarified it this is badly explained. 

>>  } XiveRouterClass;
>>  
>>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
>>  
>>  int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>>  int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>> +int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>> +                        XiveEND *end);
>> +int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>> +                        XiveEND *end);
>> +
>> +/*
>> + * For legacy compatibility, the exceptions define up to 256 different
>> + * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
>> + * and the least favored level 0xFF.
>> + */
>> +#define XIVE_PRIORITY_MAX  7
>> +
>> +void xive_end_reset(XiveEND *end);
>> +void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
>>  
>>  #endif /* PPC_XIVE_H */
>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>> index 12499b33614c..f97fb2b90bee 100644
>> --- a/include/hw/ppc/xive_regs.h
>> +++ b/include/hw/ppc/xive_regs.h
>> @@ -28,4 +28,52 @@ typedef struct XiveEAS {
>>  #define EAS_END_DATA    PPC_BITMASK(33, 63)      /* Data written to the END */
>>  } XiveEAS;
>>  
>> +/* Event Notification Descriptor (END) */
>> +typedef struct XiveEND {
>> +        uint32_t        w0;
>> +#define END_W0_VALID             PPC_BIT32(0) /* "v" bit */
>> +#define END_W0_ENQUEUE           PPC_BIT32(1) /* "q" bit */
>> +#define END_W0_UCOND_NOTIFY      PPC_BIT32(2) /* "n" bit */
>> +#define END_W0_BACKLOG           PPC_BIT32(3) /* "b" bit */
>> +#define END_W0_PRECL_ESC_CTL     PPC_BIT32(4) /* "p" bit */
>> +#define END_W0_ESCALATE_CTL      PPC_BIT32(5) /* "e" bit */
>> +#define END_W0_UNCOND_ESCALATE   PPC_BIT32(6) /* "u" bit - DD2.0 */
>> +#define END_W0_SILENT_ESCALATE   PPC_BIT32(7) /* "s" bit - DD2.0 */
>> +#define END_W0_QSIZE             PPC_BITMASK32(12, 15)
>> +#define END_W0_SW0               PPC_BIT32(16)
>> +#define END_W0_FIRMWARE          END_W0_SW0 /* Owned by FW */
>> +#define END_QSIZE_4K             0
>> +#define END_QSIZE_64K            4
>> +#define END_W0_HWDEP             PPC_BITMASK32(24, 31)
>> +        uint32_t        w1;
>> +#define END_W1_ESn               PPC_BITMASK32(0, 1)
>> +#define END_W1_ESn_P             PPC_BIT32(0)
>> +#define END_W1_ESn_Q             PPC_BIT32(1)
>> +#define END_W1_ESe               PPC_BITMASK32(2, 3)
>> +#define END_W1_ESe_P             PPC_BIT32(2)
>> +#define END_W1_ESe_Q             PPC_BIT32(3)
>> +#define END_W1_GENERATION        PPC_BIT32(9)
>> +#define END_W1_PAGE_OFF          PPC_BITMASK32(10, 31)
>> +        uint32_t        w2;
>> +#define END_W2_MIGRATION_REG     PPC_BITMASK32(0, 3)
>> +#define END_W2_OP_DESC_HI        PPC_BITMASK32(4, 31)
>> +        uint32_t        w3;
>> +#define END_W3_OP_DESC_LO        PPC_BITMASK32(0, 31)
>> +        uint32_t        w4;
>> +#define END_W4_ESC_END_BLOCK     PPC_BITMASK32(4, 7)
>> +#define END_W4_ESC_END_INDEX     PPC_BITMASK32(8, 31)
>> +        uint32_t        w5;
>> +#define END_W5_ESC_END_DATA      PPC_BITMASK32(1, 31)
>> +        uint32_t        w6;
>> +#define END_W6_FORMAT_BIT        PPC_BIT32(8)
>> +#define END_W6_NVT_BLOCK         PPC_BITMASK32(9, 12)
>> +#define END_W6_NVT_INDEX         PPC_BITMASK32(13, 31)
>> +        uint32_t        w7;
>> +#define END_W7_F0_IGNORE         PPC_BIT32(0)
>> +#define END_W7_F0_BLK_GROUPING   PPC_BIT32(1)
>> +#define END_W7_F0_PRIORITY       PPC_BITMASK32(8, 15)
>> +#define END_W7_F1_WAKEZ          PPC_BIT32(0)
>> +#define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
>> +} XiveEND;
>> +
>>  #endif /* PPC_XIVE_REGS_H */
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index c4c90a25758e..9cb001e7b540 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -442,6 +442,101 @@ static const TypeInfo xive_source_info = {
>>      .class_init    = xive_source_class_init,
>>  };
>>  
>> +/*
>> + * XiveEND helpers
>> + */
>> +
>> +void xive_end_reset(XiveEND *end)
>> +{
>> +    memset(end, 0, sizeof(*end));
>> +
>> +    /* switch off the escalation and notification ESBs */
>> +    end->w1 = END_W1_ESe_Q | END_W1_ESn_Q;
> 
> It's not obvious to me what circumstances this would be called under.
> Since the ENDs are in system memory, a memset() seems like an odd
> thing for (virtual) hardware to be doing to it.

It makes sense on sPAPR if one day some OS starts using the END ESBs for 
further coalescing of the events. None does for now but I have added the 
model though. 

>> +}
>> +
>> +static void xive_end_queue_pic_print_info(XiveEND *end, uint32_t width,
>> +                                          Monitor *mon)
>> +{
>> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
>> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
>> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
>> +    uint32_t qentries = 1 << (qsize + 10);
>> +    int i;
>> +
>> +    /*
>> +     * print out the [ (qindex - (width - 1)) .. (qindex + 1)] window
>> +     */
>> +    monitor_printf(mon, " [ ");
>> +    qindex = (qindex - (width - 1)) & (qentries - 1);
>> +    for (i = 0; i < width; i++) {
>> +        uint64_t qaddr = qaddr_base + (qindex << 2);
>> +        uint32_t qdata = -1;
>> +
>> +        if (dma_memory_read(&address_space_memory, qaddr, &qdata,
>> +                            sizeof(qdata))) {
>> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ @0x%"
>> +                          HWADDR_PRIx "\n", qaddr);
>> +            return;
>> +        }
>> +        monitor_printf(mon, "%s%08x ", i == width - 1 ? "^" : "",
>> +                       be32_to_cpu(qdata));
>> +        qindex = (qindex + 1) & (qentries - 1);
>> +    }
>> +    monitor_printf(mon, "]\n");
>> +}
>> +
>> +void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon)
>> +{
>> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
>> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
>> +    uint32_t qgen = GETFIELD(END_W1_GENERATION, end->w1);
>> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
>> +    uint32_t qentries = 1 << (qsize + 10);
>> +
>> +    uint32_t nvt = GETFIELD(END_W6_NVT_INDEX, end->w6);
>> +    uint8_t priority = GETFIELD(END_W7_F0_PRIORITY, end->w7);
>> +
>> +    if (!(end->w0 & END_W0_VALID)) {
>> +        return;
>> +    }
>> +
>> +    monitor_printf(mon, "  %08x %c%c%c%c%c prio:%d nvt:%04x eq:@%08"PRIx64
>> +                   "% 6d/%5d ^%d", end_idx,
>> +                   end->w0 & END_W0_VALID ? 'v' : '-',
>> +                   end->w0 & END_W0_ENQUEUE ? 'q' : '-',
>> +                   end->w0 & END_W0_UCOND_NOTIFY ? 'n' : '-',
>> +                   end->w0 & END_W0_BACKLOG ? 'b' : '-',
>> +                   end->w0 & END_W0_ESCALATE_CTL ? 'e' : '-',
>> +                   priority, nvt, qaddr_base, qindex, qentries, qgen);
>> +
>> +    xive_end_queue_pic_print_info(end, 6, mon);
>> +}
>> +
>> +static void xive_end_push(XiveEND *end, uint32_t data)
> 
> s/push/enqueue/ please, "push" suggests a stack.  (Not to mention that
> "push" and "pull" are used as terms elsewhere in XIVE).

yes. you are right. I will change.

>> +{
>> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
>> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
>> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
>> +    uint32_t qgen = GETFIELD(END_W1_GENERATION, end->w1);
>> +
>> +    uint64_t qaddr = qaddr_base + (qindex << 2);
>> +    uint32_t qdata = cpu_to_be32((qgen << 31) | (data & 0x7fffffff));
>> +    uint32_t qentries = 1 << (qsize + 10);
>> +
>> +    if (dma_memory_write(&address_space_memory, qaddr, &qdata, sizeof(qdata))) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to write END data @0x%"
>> +                      HWADDR_PRIx "\n", qaddr);
>> +        return;
>> +    }
>> +
>> +    qindex = (qindex + 1) & (qentries - 1);
>> +    if (qindex == 0) {
>> +        qgen ^= 1;
>> +        end->w1 = SETFIELD(END_W1_GENERATION, end->w1, qgen);
>> +    }
>> +    end->w1 = SETFIELD(END_W1_PAGE_OFF, end->w1, qindex);
>> +}
>> +
>>  /*
>>   * XIVE Router (aka. Virtualization Controller or IVRE)
>>   */
>> @@ -460,6 +555,82 @@ int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>      return xrc->set_eas(xrtr, lisn, eas);
>>  }
>>  
>> +int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>> +                        XiveEND *end)
>> +{
>> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>> +
>> +   return xrc->get_end(xrtr, end_blk, end_idx, end);
>> +}
>> +
>> +int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>> +                        XiveEND *end)
>> +{
>> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>> +
>> +   return xrc->set_end(xrtr, end_blk, end_idx, end);
>> +}
>> +
>> +/*
>> + * An END trigger can come from an event trigger (IPI or HW) or from
>> + * another chip. We don't model the PowerBus but the END trigger
>> + * message has the same parameters than in the function below.
>> + */
>> +static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>> +                                   uint32_t end_idx, uint32_t end_data)
>> +{
>> +    XiveEND end;
>> +    uint8_t priority;
>> +    uint8_t format;
>> +
>> +    /* END cache lookup */
>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
>> +                      end_idx);
>> +        return;
>> +    }
>> +
>> +    if (!(end.w0 & END_W0_VALID)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
>> +                      end_blk, end_idx);
>> +        return;
>> +    }
>> +
>> +    if (end.w0 & END_W0_ENQUEUE) {
>> +        xive_end_push(&end, end_data);
>> +        xive_router_set_end(xrtr, end_blk, end_idx, &end);
>> +    }
>> +
>> +    /*
>> +     * The W7 format depends on the F bit in W6. It defines the type
>> +     * of the notification :
>> +     *
>> +     *   F=0 : single or multiple NVT notification
>> +     *   F=1 : User level Event-Based Branch (EBB) notification, no
>> +     *         priority
>> +     */
>> +    format = GETFIELD(END_W6_FORMAT_BIT, end.w6);
>> +    priority = GETFIELD(END_W7_F0_PRIORITY, end.w7);
>> +
>> +    /* The END is masked */
>> +    if (format == 0 && priority == 0xff) {
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Check the END ESn (Event State Buffer for notification) for
>> +     * even futher coalescing in the Router
>> +     */
>> +    if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
>> +        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Follows IVPE notification
>> +     */
>> +}
>> +
>>  static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>>  {
>>      XiveRouter *xrtr = XIVE_ROUTER(xf);
>> @@ -471,9 +642,9 @@ static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>>          return;
>>      }
>>  
>> -    /* The IVRE has a State Bit Cache for its internal sources which
>> -     * is also involed at this point. We skip the SBC lookup because
>> -     * the state bits of the sources are modeled internally in QEMU.
>> +    /* The IVRE checks the State Bit Cache at this point. We skip the
>> +     * SBC lookup because the state bits of the sources are modeled
>> +     * internally in QEMU.
> 
> Replacing a comment about something we're not doing with a different
> comment about something we're not doing doesn't seem very useful.
> Maybe fold these together into one patch or the other.

That's me rephrasing. it should be indeed in the previous patch

Thanks,

C.

>>       */
>>  
>>      if (!(eas.w & EAS_VALID)) {
>> @@ -485,6 +656,14 @@ static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>>          /* Notification completed */
>>          return;
>>      }
>> +
>> +    /*
>> +     * The event trigger becomes an END trigger
>> +     */
>> +    xive_router_end_notify(xrtr,
>> +                           GETFIELD(EAS_END_BLOCK, eas.w),
>> +                           GETFIELD(EAS_END_INDEX, eas.w),
>> +                           GETFIELD(EAS_END_DATA,  eas.w));
>>  }
>>  
>>  static Property xive_router_properties[] = {
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-11-22  5:13   ` David Gibson
@ 2018-11-22 21:58     ` Cédric Le Goater
  2018-11-23  4:36       ` David Gibson
  2018-11-29 22:06     ` Cédric Le Goater
  1 sibling, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-22 21:58 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/22/18 6:13 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:56:59AM +0100, Cédric Le Goater wrote:
>> The Event Notification Descriptor also contains two Event State
>> Buffers providing further coalescing of interrupts, one for the
>> notification event (ESn) and one for the escalation events (ESe). A
>> MMIO page is assigned for each to control the EOI through loads
>> only. Stores are not allowed.
>>
>> The END ESBs are modeled through an object resembling the 'XiveSource'
>> It is stateless as the END state bits are backed into the XiveEND
>> structure under the XiveRouter and the MMIO accesses follow the same
>> rules as for the standard source ESBs.
>>
>> END ESBs are not supported by the Linux drivers neither on OPAL nor on
>> sPAPR. Nevetherless, it provides a mean to study the question in the
>> future and validates a bit more the XIVE model.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/xive.h |  20 ++++++
>>  hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 178 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index ce62aaf28343..24301bf2076d 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>                          XiveEND *end);
>>  
>> +/*
>> + * XIVE END ESBs
>> + */
>> +
>> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
>> +#define XIVE_END_SOURCE(obj) \
>> +    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
> 
> Is there a particular reason to make this a full QOM object, rather
> than just embedding it in the XiveRouter?

yes, it should probably be under the XiveRouter you are right because
there is a direct link with the ENDT which is in the XiverRouter. 

But if I remove the chip_id field from the XiveRouter, it becomes a QOM
interface. something to ponder.
 
>> +typedef struct XiveENDSource {
>> +    SysBusDevice parent;
>> +
>> +    uint32_t        nr_ends;
>> +
>> +    /* ESB memory region */
>> +    uint32_t        esb_shift;
>> +    MemoryRegion    esb_mmio;
>> +
>> +    XiveRouter      *xrtr;
>> +} XiveENDSource;
>> +
>>  /*
>>   * For legacy compatibility, the exceptions define up to 256 different
>>   * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 9cb001e7b540..5a8882d47a98 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -622,8 +622,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>>       * even futher coalescing in the Router
>>       */
>>      if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
>> -        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
>> -        return;
>> +        uint8_t pq = GETFIELD(END_W1_ESn, end.w1);
>> +        bool notify = xive_esb_trigger(&pq);
>> +
>> +        if (pq != GETFIELD(END_W1_ESn, end.w1)) {
>> +            end.w1 = SETFIELD(END_W1_ESn, end.w1, pq);
>> +            xive_router_set_end(xrtr, end_blk, end_idx, &end);
>> +        }
>> +
>> +        /* ESn[Q]=1 : end of notification */
>> +        if (!notify) {
>> +            return;
>> +        }
>>      }
>>  
>>      /*
>> @@ -706,6 +716,151 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
>>                     (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
>>  }
>>  
>> +/*
>> + * END ESB MMIO loads
>> + */
>> +static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
>> +    XiveRouter *xrtr = xsrc->xrtr;
>> +    uint32_t offset = addr & 0xFFF;
>> +    uint8_t end_blk;
>> +    uint32_t end_idx;
>> +    XiveEND end;
>> +    uint32_t end_esmask;
>> +    uint8_t pq;
>> +    uint64_t ret = -1;
>> +
>> +    end_blk = xrtr->chip_id;
>> +    end_idx = addr >> (xsrc->esb_shift + 1);
>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
>> +                      end_idx);
>> +        return -1;
>> +    }
>> +
>> +    if (!(end.w0 & END_W0_VALID)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
>> +                      end_blk, end_idx);
>> +        return -1;
>> +    }
>> +
>> +    end_esmask = addr_is_even(addr, xsrc->esb_shift) ? END_W1_ESn : END_W1_ESe;
>> +    pq = GETFIELD(end_esmask, end.w1);
>> +
>> +    switch (offset) {
>> +    case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
>> +        ret = xive_esb_eoi(&pq);
>> +
>> +        /* Forward the source event notification for routing ?? */
>> +        break;
>> +
>> +    case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
>> +        ret = pq;
>> +        break;
>> +
>> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
>> +        ret = xive_esb_set(&pq, (offset >> 8) & 0x3);
>> +        break;
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid END ESB load addr %d\n",
>> +                      offset);
>> +        return -1;
>> +    }
>> +
>> +    if (pq != GETFIELD(end_esmask, end.w1)) {
>> +        end.w1 = SETFIELD(end_esmask, end.w1, pq);
>> +        xive_router_set_end(xrtr, end_blk, end_idx, &end);
>> +    }
> 
> We can probably share some more code with XiveSource here, but that's
> something that can be refined later.

yes clearly. The idea was to introduce a XiveESB model handling only the 
MMIO aspects and rely on an interface to query/modify the underlying PQ bits.
These state bits are related to a device and the ESB pages are the XIVE way 
to expose them. 

I left that for later. I didn't want to complexify more the XiveSource 
with a feature not used today. 

Thanks,

C.

> 
>> +
>> +    return ret;
>> +}
>> +
>> +/*
>> + * END ESB MMIO stores are invalid
>> + */
>> +static void xive_end_source_write(void *opaque, hwaddr addr,
>> +                                  uint64_t value, unsigned size)
>> +{
>> +    qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr 0x%"
>> +                  HWADDR_PRIx"\n", addr);
>> +}
>> +
>> +static const MemoryRegionOps xive_end_source_ops = {
>> +    .read = xive_end_source_read,
>> +    .write = xive_end_source_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +static void xive_end_source_realize(DeviceState *dev, Error **errp)
>> +{
>> +    XiveENDSource *xsrc = XIVE_END_SOURCE(dev);
>> +    Object *obj;
>> +    Error *local_err = NULL;
>> +
>> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
>> +    if (!obj) {
>> +        error_propagate(errp, local_err);
>> +        error_prepend(errp, "required link 'xive' not found: ");
>> +        return;
>> +    }
>> +
>> +    xsrc->xrtr = XIVE_ROUTER(obj);
>> +
>> +    if (!xsrc->nr_ends) {
>> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
>> +        return;
>> +    }
>> +
>> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
>> +        xsrc->esb_shift != XIVE_ESB_64K) {
>> +        error_setg(errp, "Invalid ESB shift setting");
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Each END is assigned an even/odd pair of MMIO pages, the even page
>> +     * manages the ESn field while the odd page manages the ESe field.
>> +     */
>> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>> +                          &xive_end_source_ops, xsrc, "xive.end",
>> +                          (1ull << (xsrc->esb_shift + 1)) * xsrc->nr_ends);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>> +}
>> +
>> +static Property xive_end_source_properties[] = {
>> +    DEFINE_PROP_UINT32("nr-ends", XiveENDSource, nr_ends, 0),
>> +    DEFINE_PROP_UINT32("shift", XiveENDSource, esb_shift, XIVE_ESB_64K),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void xive_end_source_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    dc->desc    = "XIVE END Source";
>> +    dc->props   = xive_end_source_properties;
>> +    dc->realize = xive_end_source_realize;
>> +}
>> +
>> +static const TypeInfo xive_end_source_info = {
>> +    .name          = TYPE_XIVE_END_SOURCE,
>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>> +    .instance_size = sizeof(XiveENDSource),
>> +    .class_init    = xive_end_source_class_init,
>> +};
>> +
>>  /*
>>   * XIVE Fabric
>>   */
>> @@ -720,6 +875,7 @@ static void xive_register_types(void)
>>      type_register_static(&xive_source_info);
>>      type_register_static(&xive_fabric_info);
>>      type_register_static(&xive_router_info);
>> +    type_register_static(&xive_end_source_info);
>>  }
>>  
>>  type_init(xive_register_types)
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model
  2018-11-22  7:25     ` Cédric Le Goater
@ 2018-11-23  0:31       ` David Gibson
  2018-11-23  8:21         ` Cédric Le Goater
  2018-11-26  8:14         ` Cédric Le Goater
  0 siblings, 2 replies; 184+ messages in thread
From: David Gibson @ 2018-11-23  0:31 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 24713 bytes --]

On Thu, Nov 22, 2018 at 08:25:06AM +0100, Cédric Le Goater wrote:
> On 11/22/18 4:05 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:56:54AM +0100, Cédric Le Goater wrote:
> >> The first sub-engine of the overall XIVE architecture is the Interrupt
> >> Virtualization Source Engine (IVSE). An IVSE can be integrated into
> >> another logic, like in a PCI PHB or in the main interrupt controller
> >> to manage IPIs.
> >>
> >> Each IVSE instance is associated with an Event State Buffer (ESB) that
> >> contains a two bit state entry for each possible event source. When an
> >> event is signaled to the IVSE, by MMIO or some other means, the
> >> associated interrupt state bits are fetched from the ESB and
> >> modified. Depending on the resulting ESB state, the event is forwarded
> >> to the IVRE sub-engine of the controller doing the routing.
> >>
> >> Each supported ESB entry is associated with either a single or a
> >> even/odd pair of pages which provides commands to manage the source:
> >> to EOI, to turn off the source for instance.
> >>
> >> On a sPAPR machine, the O/S will obtain the page address of the ESB
> >> entry associated with a source and its characteristic using the
> >> H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is used.
> >>
> >> The xive_source_notify() routine is in charge forwarding the source
> >> event notification to the routing engine. It will be filled later on.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > 
> > Ok, this is looking basically pretty good.  Few details to query
> > below.
> > 
> > 
> >> ---
> >>  default-configs/ppc64-softmmu.mak |   1 +
> >>  include/hw/ppc/xive.h             | 130 ++++++++++
> >>  hw/intc/xive.c                    | 379 ++++++++++++++++++++++++++++++
> >>  hw/intc/Makefile.objs             |   1 +
> >>  4 files changed, 511 insertions(+)
> >>  create mode 100644 include/hw/ppc/xive.h
> >>  create mode 100644 hw/intc/xive.c
> >>
> >> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> >> index aec2855750d6..2d1e7c5c4668 100644
> >> --- a/default-configs/ppc64-softmmu.mak
> >> +++ b/default-configs/ppc64-softmmu.mak
> >> @@ -16,6 +16,7 @@ CONFIG_VIRTIO_VGA=y
> >>  CONFIG_XICS=$(CONFIG_PSERIES)
> >>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
> >>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
> >> +CONFIG_XIVE=$(CONFIG_PSERIES)
> >>  CONFIG_MEM_DEVICE=y
> >>  CONFIG_DIMM=y
> >>  CONFIG_SPAPR_RNG=y
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> new file mode 100644
> >> index 000000000000..5fec4b08705d
> >> --- /dev/null
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -0,0 +1,130 @@
> >> +/*
> >> + * QEMU PowerPC XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> > 
> > A cheat sheet in the top of this header with the old and new XIVE
> > terms would quite nice to have.
> 
> Yes. It's a good place. I will put the XIVE acronyms here :
>      
>      EA		Event Assignment
>      EISN	Effective Interrupt Source Number
>      END	Event Notification Descriptor
>      ESB	Event State Buffer
>      EQ		Event Queue
>      LISN	Logical Interrupt Source Number
>      NVT	Notification Virtual Target
>      TIMA	Thread Interrupt Management Area
>      ...

That sounds good, but what I'd also like is showing that NVT == VP and
EAS == IVT and so forth.

> >> + */
> >> +
> >> +#ifndef PPC_XIVE_H
> >> +#define PPC_XIVE_H
> >> +
> >> +#include "hw/sysbus.h"
> > 
> > So, I'm a bit dubious about making the XiveSource a SysBus device -
> > I'm concerned it won't play well with tying it into the other devices
> > like PHB that "own" it in real hardware.
> 
> It does but I can take a look at changing it to a DeviceState. The 
> reset handlers might be a concern.

As "non bus" device I think you'd need to register your own reset
handler rather than just setting dc->reset.  Otherwise, I think that
should work.

> > I think we'd be better off making it a direct descendent of
> > TYPE_DEVICE which constructs the MMIO region, but doesn't map it.
> 
> At a moment, I started working on a XiveESB object doing what I think 
> you are suggesting and I removed it. I am reluctant adding more 
> complexity now, the patchset is just growing and growing ... 
> 
> But I agree there are fundamentals to get right for KVM. Let's talk 
> about it after you have looked at the overall patchset, at least up 
> to KVM initial support.

Hm, ok.

> > Then we can havea SysBusDevice (and/or other) wrapper which
> > instantiates the XiveSource core and maps it into somewhere
> > accessible.
> 
> The XIVE controller model does the mapping of the source currently.

I'm.. I'm not sure what you mean by that.   We have a
sysbus_init_mmio() right here which effectively maps in the MMIO
region AFAICT.

> In the case of sPAPR, the controller model controls the TIMA and 
> for PowerNV, there are quite few others MMIO regions to handle.
> 
> > 
> >> +
> >> +/*
> >> + * XIVE Interrupt Source
> >> + */
> >> +
> >> +#define TYPE_XIVE_SOURCE "xive-source"
> >> +#define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
> >> +
> >> +/*
> >> + * XIVE Interrupt Source characteristics, which define how the ESB are
> >> + * controlled.
> >> + */
> >> +#define XIVE_SRC_H_INT_ESB     0x1 /* ESB managed with hcall H_INT_ESB */
> >> +#define XIVE_SRC_STORE_EOI     0x2 /* Store EOI supported */
> >> +
> >> +typedef struct XiveSource {
> >> +    SysBusDevice parent;
> >> +
> >> +    /* IRQs */
> >> +    uint32_t        nr_irqs;
> >> +    qemu_irq        *qirqs;
> >> +
> >> +    /* PQ bits */
> >> +    uint8_t         *status;
> >> +
> >> +    /* ESB memory region */
> >> +    uint64_t        esb_flags;
> >> +    uint32_t        esb_shift;
> >> +    MemoryRegion    esb_mmio;
> >> +} XiveSource;
> >> +
> >> +/*
> >> + * ESB MMIO setting. Can be one page, for both source triggering and
> >> + * source management, or two different pages. See below for magic
> >> + * values.
> >> + */
> >> +#define XIVE_ESB_4K          12 /* PSI HB only */
> >> +#define XIVE_ESB_4K_2PAGE    13
> >> +#define XIVE_ESB_64K         16
> >> +#define XIVE_ESB_64K_2PAGE   17
> >> +
> >> +static inline bool xive_source_esb_has_2page(XiveSource *xsrc)
> >> +{
> >> +    return xsrc->esb_shift == XIVE_ESB_64K_2PAGE ||
> >> +        xsrc->esb_shift == XIVE_ESB_4K_2PAGE;
> >> +}
> >> +
> >> +/* The trigger page is always the first/even page */
> >> +static inline hwaddr xive_source_esb_page(XiveSource *xsrc, uint32_t srcno)
> > 
> > This function doesn't appear to be used anywhere except..
> 
> It's used in patch 16 adding the hcalls also.
> 
> >> +{
> >> +    assert(srcno < xsrc->nr_irqs);
> >> +    return (1ull << xsrc->esb_shift) * srcno;
> >> +}
> >> +
> >> +/* In a two pages ESB MMIO setting, the odd page is for management */
> >> +static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
> > 
> > 
> > ..here, and this function doesn't appear to be used anywhere.
> 
> It's used in patch 16 adding the hcalls and patch 23 for KVM.
> 
> This is basic ESB support which I thought belong to the patch on sources.
>  
> > 
> >> +{
> >> +    hwaddr addr = xive_source_esb_page(xsrc, srcno);
> >> +
> >> +    if (xive_source_esb_has_2page(xsrc)) {
> >> +        addr += (1 << (xsrc->esb_shift - 1));
> >> +    }
> >> +
> >> +    return addr;
> >> +}
> >> +
> >> +/*
> >> + * Each interrupt source has a 2-bit state machine which can be
> >> + * controlled by MMIO. P indicates that an interrupt is pending (has
> >> + * been sent to a queue and is waiting for an EOI). Q indicates that
> >> + * the interrupt has been triggered while pending.
> >> + *
> >> + * This acts as a coalescing mechanism in order to guarantee that a
> >> + * given interrupt only occurs at most once in a queue.
> >> + *
> >> + * When doing an EOI, the Q bit will indicate if the interrupt
> >> + * needs to be re-triggered.
> >> + */
> >> +#define XIVE_ESB_VAL_P        0x2
> >> +#define XIVE_ESB_VAL_Q        0x1
> >> +
> >> +#define XIVE_ESB_RESET        0x0
> >> +#define XIVE_ESB_PENDING      XIVE_ESB_VAL_P
> >> +#define XIVE_ESB_QUEUED       (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
> >> +#define XIVE_ESB_OFF          XIVE_ESB_VAL_Q
> >> +
> >> +/*
> >> + * "magic" Event State Buffer (ESB) MMIO offsets.
> >> + *
> >> + * The following offsets into the ESB MMIO allow to read or manipulate
> >> + * the PQ bits. They must be used with an 8-byte load instruction.
> >> + * They all return the previous state of the interrupt (atomically).
> >> + *
> >> + * Additionally, some ESB pages support doing an EOI via a store and
> >> + * some ESBs support doing a trigger via a separate trigger page.
> >> + */
> >> +#define XIVE_ESB_STORE_EOI      0x400 /* Store */
> >> +#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
> >> +#define XIVE_ESB_GET            0x800 /* Load */
> >> +#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
> >> +#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
> >> +#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
> >> +#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
> >> +
> >> +uint8_t xive_source_esb_get(XiveSource *xsrc, uint32_t srcno);
> >> +uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
> >> +
> >> +void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset,
> >> +                                Monitor *mon);
> >> +
> >> +static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
> >> +{
> >> +    assert(srcno < xsrc->nr_irqs);
> >> +    return xsrc->qirqs[srcno];
> >> +}
> >> +
> >> +#endif /* PPC_XIVE_H */
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> new file mode 100644
> >> index 000000000000..f7621f84828c
> >> --- /dev/null
> >> +++ b/hw/intc/xive.c
> >> @@ -0,0 +1,379 @@
> >> +/*
> >> + * QEMU PowerPC XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu/log.h"
> >> +#include "qapi/error.h"
> >> +#include "target/ppc/cpu.h"
> >> +#include "sysemu/cpus.h"
> >> +#include "sysemu/dma.h"
> >> +#include "monitor/monitor.h"
> >> +#include "hw/ppc/xive.h"
> >> +
> >> +/*
> >> + * XIVE ESB helpers
> >> + */
> >> +
> >> +static uint8_t xive_esb_set(uint8_t *pq, uint8_t value)
> >> +{
> >> +    uint8_t old_pq = *pq & 0x3;
> >> +
> >> +    *pq &= ~0x3;
> >> +    *pq |= value & 0x3;
> >> +
> >> +    return old_pq;
> >> +}
> >> +
> >> +static bool xive_esb_trigger(uint8_t *pq)
> >> +{
> >> +    uint8_t old_pq = *pq & 0x3;
> >> +
> >> +    switch (old_pq) {
> >> +    case XIVE_ESB_RESET:
> >> +        xive_esb_set(pq, XIVE_ESB_PENDING);
> >> +        return true;
> >> +    case XIVE_ESB_PENDING:
> >> +    case XIVE_ESB_QUEUED:
> >> +        xive_esb_set(pq, XIVE_ESB_QUEUED);
> >> +        return false;
> >> +    case XIVE_ESB_OFF:
> >> +        xive_esb_set(pq, XIVE_ESB_OFF);
> >> +        return false;
> >> +    default:
> >> +         g_assert_not_reached();
> >> +    }
> >> +}
> >> +
> >> +static bool xive_esb_eoi(uint8_t *pq)
> >> +{
> >> +    uint8_t old_pq = *pq & 0x3;
> >> +
> >> +    switch (old_pq) {
> >> +    case XIVE_ESB_RESET:
> >> +    case XIVE_ESB_PENDING:
> >> +        xive_esb_set(pq, XIVE_ESB_RESET);
> >> +        return false;
> >> +    case XIVE_ESB_QUEUED:
> >> +        xive_esb_set(pq, XIVE_ESB_PENDING);
> >> +        return true;
> >> +    case XIVE_ESB_OFF:
> >> +        xive_esb_set(pq, XIVE_ESB_OFF);
> >> +        return false;
> >> +    default:
> >> +         g_assert_not_reached();
> >> +    }
> >> +}
> >> +
> >> +/*
> >> + * XIVE Interrupt Source (or IVSE)
> >> + */
> >> +
> >> +uint8_t xive_source_esb_get(XiveSource *xsrc, uint32_t srcno)
> >> +{
> >> +    assert(srcno < xsrc->nr_irqs);
> >> +
> >> +    return xsrc->status[srcno] & 0x3;
> >> +}
> >> +
> >> +uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
> >> +{
> >> +    assert(srcno < xsrc->nr_irqs);
> >> +
> >> +    return xive_esb_set(&xsrc->status[srcno], pq);
> >> +}
> >> +
> >> +/*
> >> + * Returns whether the event notification should be forwarded.
> >> + */
> >> +static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
> >> +{
> >> +    assert(srcno < xsrc->nr_irqs);
> >> +
> >> +    return xive_esb_trigger(&xsrc->status[srcno]);
> >> +}
> >> +
> >> +/*
> >> + * Returns whether the event notification should be forwarded.
> >> + */
> >> +static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
> >> +{
> >> +    assert(srcno < xsrc->nr_irqs);
> >> +
> >> +    return xive_esb_eoi(&xsrc->status[srcno]);
> >> +}
> >> +
> >> +/*
> >> + * Forward the source event notification to the Router
> >> + */
> >> +static void xive_source_notify(XiveSource *xsrc, int srcno)
> >> +{
> >> +
> >> +}
> >> +
> >> +/*
> >> + * In a two pages ESB MMIO setting, even page is the trigger page, odd
> >> + * page is for management
> >> + */
> >> +static inline bool addr_is_even(hwaddr addr, uint32_t shift)
> >> +{
> >> +    return !((addr >> shift) & 1);
> >> +}
> >> +
> >> +static inline bool xive_source_is_trigger_page(XiveSource *xsrc, hwaddr addr)
> >> +{
> >> +    return xive_source_esb_has_2page(xsrc) &&
> >> +        addr_is_even(addr, xsrc->esb_shift - 1);
> >> +}
> >> +
> >> +/*
> >> + * ESB MMIO loads
> >> + *                      Trigger page    Management/EOI page
> >> + * 2 pages setting      even            odd
> >> + *
> >> + * 0x000 .. 0x3FF       -1              EOI and return 0|1
> >> + * 0x400 .. 0x7FF       -1              EOI and return 0|1
> >> + * 0x800 .. 0xBFF       -1              return PQ
> >> + * 0xC00 .. 0xCFF       -1              return PQ and atomically PQ=0
> >> + * 0xD00 .. 0xDFF       -1              return PQ and atomically PQ=0
> >> + * 0xE00 .. 0xDFF       -1              return PQ and atomically PQ=1
> >> + * 0xF00 .. 0xDFF       -1              return PQ and atomically PQ=1
> >> + */
> > 
> > I can't quite make sense of this table.  What do the -1s represent,
> 
> the value returned by the load.
> 
> > and how does it relate to the non-2page case?
> 
> one page ESB support trigger and management on the same page. So for loads,
> the odd page behavior applies.  
> 
> >> +static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> >> +    uint32_t offset = addr & 0xFFF;
> >> +    uint32_t srcno = addr >> xsrc->esb_shift;
> >> +    uint64_t ret = -1;
> >> +
> >> +    /* In a two pages ESB MMIO setting, trigger page should not be read */
> >> +    if (xive_source_is_trigger_page(xsrc, addr)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR,
> >> +                      "XIVE: invalid load on IRQ %d trigger page at "
> >> +                      "0x%"HWADDR_PRIx"\n", srcno, addr);
> >> +        return -1;
> >> +    }
> >> +
> >> +    switch (offset) {
> >> +    case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
> >> +        ret = xive_source_esb_eoi(xsrc, srcno);
> >> +
> >> +        /* Forward the source event notification for routing */
> >> +        if (ret) {
> >> +            xive_source_notify(xsrc, srcno);
> >> +        }
> >> +        break;
> >> +
> >> +    case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
> >> +        ret = xive_source_esb_get(xsrc, srcno);
> >> +        break;
> >> +
> >> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
> >> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
> >> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
> >> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
> >> +        ret = xive_source_esb_set(xsrc, srcno, (offset >> 8) & 0x3);
> >> +        break;
> >> +    default:
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB load addr %x\n",
> >> +                      offset);
> >> +    }
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +/*
> >> + * ESB MMIO stores
> >> + *                      Trigger page    Management/EOI page
> >> + * 2 pages setting      even            odd
> > 
> > As with the previous table, I don't quite understand what the headings
> > above mean.
> 
> one page ESB support trigger and management on the same page. So for stores,
> the odd page behavior applies.
> 
> The headings can be improved. I will think of something.
> 
> >> + * 0x000 .. 0x3FF       Trigger         Trigger
> >> + * 0x400 .. 0x7FF       Trigger         EOI
> >> + * 0x800 .. 0xBFF       Trigger         undefined
> >> + * 0xC00 .. 0xCFF       Trigger         PQ=00
> >> + * 0xD00 .. 0xDFF       Trigger         PQ=01
> >> + * 0xE00 .. 0xDFF       Trigger         PQ=10
> >> + * 0xF00 .. 0xDFF       Trigger         PQ=11
> >> + */
> >> +static void xive_source_esb_write(void *opaque, hwaddr addr,
> >> +                                  uint64_t value, unsigned size)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> >> +    uint32_t offset = addr & 0xFFF;
> >> +    uint32_t srcno = addr >> xsrc->esb_shift;
> >> +    bool notify = false;
> >> +
> >> +    /* In a two pages ESB MMIO setting, trigger page only triggers */
> >> +    if (xive_source_is_trigger_page(xsrc, addr)) {
> >> +        notify = xive_source_esb_trigger(xsrc, srcno);
> >> +        goto out;
> >> +    }
> >> +
> >> +    switch (offset) {
> >> +    case 0 ... 0x3FF:
> >> +        notify = xive_source_esb_trigger(xsrc, srcno);
> >> +        break;
> >> +
> >> +    case XIVE_ESB_STORE_EOI ... XIVE_ESB_STORE_EOI + 0x3FF:
> >> +        if (!(xsrc->esb_flags & XIVE_SRC_STORE_EOI)) {
> >> +            qemu_log_mask(LOG_GUEST_ERROR,
> >> +                          "XIVE: invalid Store EOI for IRQ %d\n", srcno);
> >> +            return;
> >> +        }
> >> +
> >> +        notify = xive_source_esb_eoi(xsrc, srcno);
> >> +        break;
> >> +
> >> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
> >> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
> >> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
> >> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
> >> +        xive_source_esb_set(xsrc, srcno, (offset >> 8) & 0x3);
> >> +        break;
> >> +
> >> +    default:
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %x\n",
> >> +                      offset);
> >> +        return;
> >> +    }
> >> +
> >> +out:
> >> +    /* Forward the source event notification for routing */
> >> +    if (notify) {
> >> +        xive_source_notify(xsrc, srcno);
> >> +    }
> >> +}
> >> +
> >> +static const MemoryRegionOps xive_source_esb_ops = {
> >> +    .read = xive_source_esb_read,
> >> +    .write = xive_source_esb_write,
> >> +    .endianness = DEVICE_BIG_ENDIAN,
> >> +    .valid = {
> >> +        .min_access_size = 8,
> >> +        .max_access_size = 8,
> >> +    },
> >> +    .impl = {
> >> +        .min_access_size = 8,
> >> +        .max_access_size = 8,
> >> +    },
> >> +};
> >> +
> >> +static void xive_source_set_irq(void *opaque, int srcno, int val)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
> >> +    bool notify = false;
> >> +
> >> +    if (val) {
> >> +        notify = xive_source_esb_trigger(xsrc, srcno);
> >> +    }
> >> +
> >> +    /* Forward the source event notification for routing */
> >> +    if (notify) {
> >> +        xive_source_notify(xsrc, srcno);
> >> +    }
> >> +}
> >> +
> >> +void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
> >> +{
> >> +    int i;
> >> +
> >> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >> +        uint8_t pq = xive_source_esb_get(xsrc, i);
> >> +
> >> +        if (pq == XIVE_ESB_OFF) {
> >> +            continue;
> >> +        }
> >> +
> >> +        monitor_printf(mon, "  %08x %c%c\n", i + offset,
> >> +                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
> >> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> >> +    }
> >> +}
> >> +
> >> +static void xive_source_reset(DeviceState *dev)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(dev);
> >> +
> >> +    /* PQs are initialized to 0b01 which corresponds to "ints off" */
> >> +    memset(xsrc->status, 0x1, xsrc->nr_irqs);
> > 
> > You've already got XIVE_ESB_OFF defined to make this a little clearer.
> 
> Sure.
> 
> Thanks,
> 
> C. 
> 
> 
> > 
> >> +}
> >> +
> >> +static void xive_source_realize(DeviceState *dev, Error **errp)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(dev);
> >> +
> >> +    if (!xsrc->nr_irqs) {
> >> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
> >> +        return;
> >> +    }
> >> +
> >> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
> >> +        xsrc->esb_shift != XIVE_ESB_4K_2PAGE &&
> >> +        xsrc->esb_shift != XIVE_ESB_64K &&
> >> +        xsrc->esb_shift != XIVE_ESB_64K_2PAGE) {
> >> +        error_setg(errp, "Invalid ESB shift setting");
> >> +        return;
> >> +    }
> >> +
> >> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
> >> +                                     xsrc->nr_irqs);
> >> +
> >> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
> >> +
> >> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> >> +                          &xive_source_esb_ops, xsrc, "xive.esb",
> >> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
> >> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
> >> +}
> >> +
> >> +static const VMStateDescription vmstate_xive_source = {
> >> +    .name = TYPE_XIVE_SOURCE,
> >> +    .version_id = 1,
> >> +    .minimum_version_id = 1,
> >> +    .fields = (VMStateField[]) {
> >> +        VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
> >> +        VMSTATE_VBUFFER_UINT32(status, XiveSource, 1, NULL, nr_irqs),
> >> +        VMSTATE_END_OF_LIST()
> >> +    },
> >> +};
> >> +
> >> +/*
> >> + * The default XIVE interrupt source setting for the ESB MMIOs is two
> >> + * 64k pages without Store EOI, to be in sync with KVM.
> >> + */
> >> +static Property xive_source_properties[] = {
> >> +    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
> >> +    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
> >> +    DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
> >> +    DEFINE_PROP_END_OF_LIST(),
> >> +};
> >> +
> >> +static void xive_source_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >> +
> >> +    dc->desc    = "XIVE Interrupt Source";
> >> +    dc->props   = xive_source_properties;
> >> +    dc->realize = xive_source_realize;
> >> +    dc->reset   = xive_source_reset;
> >> +    dc->vmsd    = &vmstate_xive_source;
> >> +}
> >> +
> >> +static const TypeInfo xive_source_info = {
> >> +    .name          = TYPE_XIVE_SOURCE,
> >> +    .parent        = TYPE_SYS_BUS_DEVICE,
> >> +    .instance_size = sizeof(XiveSource),
> >> +    .class_init    = xive_source_class_init,
> >> +};
> >> +
> >> +static void xive_register_types(void)
> >> +{
> >> +    type_register_static(&xive_source_info);
> >> +}
> >> +
> >> +type_init(xive_register_types)
> >> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> >> index 0e9963f5eecc..72a46ed91c31 100644
> >> --- a/hw/intc/Makefile.objs
> >> +++ b/hw/intc/Makefile.objs
> >> @@ -37,6 +37,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
> >>  obj-$(CONFIG_XICS) += xics.o
> >>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
> >>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> >> +obj-$(CONFIG_XIVE) += xive.o
> >>  obj-$(CONFIG_POWERNV) += xics_pnv.o
> >>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
> >>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources
  2018-11-22  7:39     ` Cédric Le Goater
@ 2018-11-23  1:08       ` David Gibson
  2018-11-23 13:28         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-23  1:08 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 8916 bytes --]

On Thu, Nov 22, 2018 at 08:39:41AM +0100, Cédric Le Goater wrote:
> On 11/22/18 4:19 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:56:55AM +0100, Cédric Le Goater wrote:
> >> The 'sent' status of the LSI interrupt source is modeled with the 'P'
> >> bit of the ESB and the assertion status of the source is maintained in
> >> an array under the main sPAPRXive object. The type of the source is
> >> stored in the same array for practical reasons.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > 
> > Looks good except for some minor details.
> > 
> >> ---
> >>  include/hw/ppc/xive.h | 20 ++++++++++++-
> >>  hw/intc/xive.c        | 68 +++++++++++++++++++++++++++++++++++++++----
> >>  2 files changed, 81 insertions(+), 7 deletions(-)
> >>
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index 5fec4b08705d..e118acd59f1e 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -32,8 +32,10 @@ typedef struct XiveSource {
> >>      /* IRQs */
> >>      uint32_t        nr_irqs;
> >>      qemu_irq        *qirqs;
> >> +    unsigned long   *lsi_map;
> >> +    int32_t         lsi_map_size; /* for VMSTATE_BITMAP */
> > 
> > At some point it's possible we'll want XiveSource subclasses that just
> > know which irqs are LSI and which aren't without an explicit map.  But
> > this detail isn't exposed in the migration stream or the user
> > interface, so we can tweak it later as ncessary.
> > 
> >> -    /* PQ bits */
> >> +    /* PQ bits and LSI assertion bit */
> >>      uint8_t         *status;
> >>  
> >>      /* ESB memory region */
> >> @@ -89,6 +91,7 @@ static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
> >>   * When doing an EOI, the Q bit will indicate if the interrupt
> >>   * needs to be re-triggered.
> >>   */
> >> +#define XIVE_STATUS_ASSERTED  0x4  /* Extra bit for LSI */
> >>  #define XIVE_ESB_VAL_P        0x2
> >>  #define XIVE_ESB_VAL_Q        0x1
> >>  
> >> @@ -127,4 +130,19 @@ static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
> >>      return xsrc->qirqs[srcno];
> >>  }
> >>  
> >> +static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
> >> +{
> >> +    assert(srcno < xsrc->nr_irqs);
> >> +    return test_bit(srcno, xsrc->lsi_map);
> >> +}
> >> +
> >> +static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
> >> +                                       bool lsi)
> > 
> > The function name isn't obvious about this being controlling LSI
> > configuration. '..._irq_set_lsi' maybe?
> 
> yes.
> 
> 
> >> +{
> >> +    assert(srcno < xsrc->nr_irqs);
> >> +    if (lsi) {
> >> +        bitmap_set(xsrc->lsi_map, srcno, 1);
> >> +    }
> >> +}
> >> +
> >>  #endif /* PPC_XIVE_H */
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index f7621f84828c..ac4605fee8b7 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -88,14 +88,40 @@ uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
> >>      return xive_esb_set(&xsrc->status[srcno], pq);
> >>  }
> >>  
> >> +/*
> >> + * Returns whether the event notification should be forwarded.
> >> + */
> >> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t
> >> srcno)
> > 
> > What exactly "trigger" means isn't entirely obvious for an LSI.  Might
> > be clearer to have "lsi_assert" and "lsi_deassert" helpers instead.
> 
> This is called only when the interrupt is asserted. So it is a 
> simplified LSI trigger depending only on the 'P' bit.

Yes, I see that.  But the result is that while the MSI logic is
encapsulated in the MSI trigger function, this leaves the LSI logic
split across the trigger function and set_irq() itself.  I think it
would be better to have assert and deassert helpers instead, which
handle both the trigger/notification and also the updating of the
ASSERTED bit.

> > 
> >> +{
> >> +    uint8_t old_pq = xive_source_esb_get(xsrc, srcno);
> >> +
> >> +    switch (old_pq) {
> >> +    case XIVE_ESB_RESET:
> >> +        xive_source_esb_set(xsrc, srcno, XIVE_ESB_PENDING);
> >> +        return true;
> >> +    default:
> >> +        return false;
> >> +    }
> >> +}
> >> +
> >>  /*
> >>   * Returns whether the event notification should be forwarded.
> >>   */
> >>  static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
> >>  {
> >> +    bool ret;
> >> +
> >>      assert(srcno < xsrc->nr_irqs);
> >>  
> >> -    return xive_esb_trigger(&xsrc->status[srcno]);
> >> +    ret = xive_esb_trigger(&xsrc->status[srcno]);
> >> +
> >> +    if (xive_source_irq_is_lsi(xsrc, srcno) &&
> >> +        xive_source_esb_get(xsrc, srcno) == XIVE_ESB_QUEUED) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR,
> >> +                      "XIVE: queued an event on LSI IRQ %d\n", srcno);
> >> +    }
> >> +
> >> +    return ret;
> >>  }
> >>  
> >>  /*
> >> @@ -103,9 +129,22 @@ static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
> >>   */
> >>  static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
> >>  {
> >> +    bool ret;
> >> +
> >>      assert(srcno < xsrc->nr_irqs);
> >>  
> >> -    return xive_esb_eoi(&xsrc->status[srcno]);
> >> +    ret = xive_esb_eoi(&xsrc->status[srcno]);
> >> +
> >> +    /* LSI sources do not set the Q bit but they can still be
> >> +     * asserted, in which case we should forward a new event
> >> +     * notification
> >> +     */
> >> +    if (xive_source_irq_is_lsi(xsrc, srcno) &&
> >> +        xsrc->status[srcno] & XIVE_STATUS_ASSERTED) {
> >> +        ret = xive_source_lsi_trigger(xsrc, srcno);
> >> +    }
> >> +
> >> +    return ret;
> >>  }
> >>  
> >>  /*
> >> @@ -268,8 +307,17 @@ static void xive_source_set_irq(void *opaque, int srcno, int val)
> >>      XiveSource *xsrc = XIVE_SOURCE(opaque);
> >>      bool notify = false;
> >>  
> >> -    if (val) {
> >> -        notify = xive_source_esb_trigger(xsrc, srcno);
> >> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >> +        if (val) {
> >> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
> >> +            notify = xive_source_lsi_trigger(xsrc, srcno);
> >> +        } else {
> >> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
> >> +        }
> >> +    } else {
> >> +        if (val) {
> >> +            notify = xive_source_esb_trigger(xsrc, srcno);
> >> +        }
> >>      }
> >>  
> >>      /* Forward the source event notification for routing */
> >> @@ -289,9 +337,11 @@ void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
> >>              continue;
> >>          }
> >>  
> >> -        monitor_printf(mon, "  %08x %c%c\n", i + offset,
> >> +        monitor_printf(mon, "  %08x %s %c%c%c\n", i + offset,
> >> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
> >>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
> >> -                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> >> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
> >> +                       xsrc->status[i] & XIVE_STATUS_ASSERTED ? 'A' : ' ');
> >>      }
> >>  }
> >>  
> >> @@ -299,6 +349,8 @@ static void xive_source_reset(DeviceState *dev)
> >>  {
> >>      XiveSource *xsrc = XIVE_SOURCE(dev);
> >>  
> >> +    /* Do not clear the LSI bitmap */
> >> +
> >>      /* PQs are initialized to 0b01 which corresponds to "ints off" */
> >>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
> >>  }
> >> @@ -325,6 +377,9 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
> >>  
> >>      xsrc->status = g_malloc0(xsrc->nr_irqs);
> >>  
> >> +    xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
> >> +    xsrc->lsi_map_size = xsrc->nr_irqs;
> >> +
> >>      memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> >>                            &xive_source_esb_ops, xsrc, "xive.esb",
> >>                            (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
> >> @@ -338,6 +393,7 @@ static const VMStateDescription vmstate_xive_source = {
> >>      .fields = (VMStateField[]) {
> >>          VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
> >>          VMSTATE_VBUFFER_UINT32(status, XiveSource, 1, NULL, nr_irqs),
> >> +        VMSTATE_BITMAP(lsi_map, XiveSource, 1, lsi_map_size),
> > 
> > This shouldn't be here.  The lsi_map is all set up at machine
> > configuration time and then static, so it doesn't need to be migrated.
> 
> yes. of course ... I will get rid of it.
> 
> Thanks,
> 
> C. 
> > 
> >>          VMSTATE_END_OF_LIST()
> >>      },
> >>  };
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-22  6:50     ` Benjamin Herrenschmidt
  2018-11-22  7:59       ` Cédric Le Goater
@ 2018-11-23  1:10       ` David Gibson
  2018-11-23 10:28         ` Cédric Le Goater
  1 sibling, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-23  1:10 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Cédric Le Goater, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1221 bytes --]

On Thu, Nov 22, 2018 at 05:50:07PM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
> > 
> > Sorry, didn't think of this in my first reply.
> > 
> > 1) Does the hardware ever actually write back to the EAS?  I know it
> > does for the END, but it's not clear why it would need to for the
> > EAS.  If not, we don't need the setter.
> 
> Nope, though the PAPR model will via hcalls

Right, bit AIUI the set_eas hook is about abstracting PAPR vs bare
metal details.  Since the hcall knows it's PAPR it can just update the
backing information for the EAS directly, and no need for an
abstracted hook.

> 
> > 
> > 2) The signatures are a bit odd here.  For the setter, a value would
> > make sense than a (XiveEAS *), since it's just a word.  For the getter
> > you could return the EAS value directly rather than using a pointer -
> > there's already a valid bit in the EAS so you can construct a value
> > with that cleared if the lisn is out of bounds.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-22  7:59       ` Cédric Le Goater
@ 2018-11-23  1:17         ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-23  1:17 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1769 bytes --]

On Thu, Nov 22, 2018 at 08:59:32AM +0100, Cédric Le Goater wrote:
> On 11/22/18 7:50 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
> >>
> >> Sorry, didn't think of this in my first reply.
> >>
> >> 1) Does the hardware ever actually write back to the EAS?  I know it
> >> does for the END, but it's not clear why it would need to for the
> >> EAS.  If not, we don't need the setter.
> > 
> > Nope, though the PAPR model will via hcalls
> 
> Indeed. The H_INT_SET_SOURCE_CONFIG hcall updates the EAT.
> 
> >> 2) The signatures are a bit odd here.  For the setter, a value would
> >> make sense than a (XiveEAS *), since it's just a word.  For the getter
> >> you could return the EAS value directly rather than using a pointer -
> >> there's already a valid bit in the EAS so you can construct a value
> >> with that cleared if the lisn is out of bounds.
> 
> Yes we could. I think I made it that way to be consistent with the 
> other XIVE internal structures which are bigger : END, NVT

Yeah, but as noted elsewhere I don't really like the get/set model for
the bigger-than-word-size structures.  It gives the impression that
they're atomic updates when they can't be, as well as unnecessarily
copying a bunch of stuff, sometimes on hot paths

> There might be other reasons in Pnv. One was to use generic accessors 
> to the guest RAM but I didn't do it finally. Take a look at the Pnv
> model and we might decide to change the prototype then. I don't 
> think it's a major change.

Hmmm.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-22  7:53     ` Cédric Le Goater
@ 2018-11-23  3:50       ` David Gibson
  2018-11-23  8:06         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-23  3:50 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 37191 bytes --]

On Thu, Nov 22, 2018 at 08:53:00AM +0100, Cédric Le Goater wrote:
> On 11/22/18 5:11 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:56:57AM +0100, Cédric Le Goater wrote:
> >> The XiveRouter models the second sub-engine of the overall XIVE
> >> architecture : the Interrupt Virtualization Routing Engine (IVRE).
> >>
> >> The IVRE handles event notifications of the IVSE through MMIO stores
> >> and performs the interrupt routing process. For this purpose, it uses
> >> a set of table stored in system memory, the first of which being the
> >> Event Assignment Structure (EAS) table.
> >>
> >> The EAT associates an interrupt source number with an Event Notification
> >> Descriptor (END) which will be used in a second phase of the routing
> >> process to identify a Notification Virtual Target.
> >>
> >> The XiveRouter is an abstract class which needs to be inherited from
> >> to define a storage for the EAT, and other upcoming tables. The
> >> 'chip-id' atttribute is not strictly necessary for the sPAPR and
> >> PowerNV machines but it's a good way to test the routing algorithm.
> >> Without this atttribute, the XiveRouter could be a simple QOM
> >> interface.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  include/hw/ppc/xive.h      | 32 ++++++++++++++
> >>  include/hw/ppc/xive_regs.h | 31 ++++++++++++++
> >>  hw/intc/xive.c             | 86 ++++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 149 insertions(+)
> >>  create mode 100644 include/hw/ppc/xive_regs.h
> >>
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index be93fae6317b..5a0696366577 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -11,6 +11,7 @@
> >>  #define PPC_XIVE_H
> >>  
> >>  #include "hw/sysbus.h"
> > 
> > Again, I don't think making this a SysBusDevice is quite right.
> > Even more so for the router than the source, because at least for PAPR
> > it might not have any MMIO presence at all.
> 
> The controller model inherits from the XiveRouter and manages the
> TIMA.

Um.. I'm not sure what you mean by the "controller model".  Surely the
presenter should own the TIMA, not the router?

> 
> >> +#include "hw/ppc/xive_regs.h"
> >>  
> >>  /*
> >>   * XIVE Fabric (Interface between Source and Router)
> >> @@ -168,4 +169,35 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
> >>      }
> >>  }
> >>  
> >> +/*
> >> + * XIVE Router
> >> + */
> >> +
> >> +typedef struct XiveRouter {
> >> +    SysBusDevice    parent;
> >> +
> >> +    uint32_t        chip_id;
> > 
> > I don't think this belongs in the base class.  The PowerNV specific
> > variants will need it, but it doesn't make sense for the PAPR version.
> 
> yeah. I am using it as a END and NVT block identifier but it's not 
> required for sPAPR, it could just be zero. 
> 
> It was good to test the routing algo which should not assume that the 
> block id is zero. 
>  
> > 
> >> +} XiveRouter;
> >> +
> >> +#define TYPE_XIVE_ROUTER "xive-router"
> >> +#define XIVE_ROUTER(obj)                                \
> >> +    OBJECT_CHECK(XiveRouter, (obj), TYPE_XIVE_ROUTER)
> >> +#define XIVE_ROUTER_CLASS(klass)                                        \
> >> +    OBJECT_CLASS_CHECK(XiveRouterClass, (klass), TYPE_XIVE_ROUTER)
> >> +#define XIVE_ROUTER_GET_CLASS(obj)                              \
> >> +    OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
> >> +
> >> +typedef struct XiveRouterClass {
> >> +    SysBusDeviceClass parent;
> >> +
> >> +    /* XIVE table accessors */
> >> +    int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> >> +    int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> >> +} XiveRouterClass;
> >> +
> >> +void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
> >> +
> >> +int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> >> +int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> >> +
> >>  #endif /* PPC_XIVE_H */
> >> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> >> new file mode 100644
> >> index 000000000000..12499b33614c
> >> --- /dev/null
> >> +++ b/include/hw/ppc/xive_regs.h
> >> @@ -0,0 +1,31 @@
> >> +/*
> >> + * QEMU PowerPC XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2016-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#ifndef PPC_XIVE_REGS_H
> >> +#define PPC_XIVE_REGS_H
> >> +
> >> +/* EAS (Event Assignment Structure)
> >> + *
> >> + * One per interrupt source. Targets an interrupt to a given Event
> >> + * Notification Descriptor (END) and provides the corresponding
> >> + * logical interrupt number (END data)
> >> + */
> >> +typedef struct XiveEAS {
> >> +        /* Use a single 64-bit definition to make it easier to
> >> +         * perform atomic updates
> >> +         */
> >> +        uint64_t        w;
> >> +#define EAS_VALID       PPC_BIT(0)
> >> +#define EAS_END_BLOCK   PPC_BITMASK(4, 7)        /* Destination END block# */
> >> +#define EAS_END_INDEX   PPC_BITMASK(8, 31)       /* Destination END index */
> >> +#define EAS_MASKED      PPC_BIT(32)              /* Masked */
> >> +#define EAS_END_DATA    PPC_BITMASK(33, 63)      /* Data written to the END */
> >> +} XiveEAS;
> >> +
> >> +#endif /* PPC_XIVE_REGS_H */
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index 014a2e41f71f..c4c90a25758e 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -442,6 +442,91 @@ static const TypeInfo xive_source_info = {
> >>      .class_init    = xive_source_class_init,
> >>  };
> >>  
> >> +/*
> >> + * XIVE Router (aka. Virtualization Controller or IVRE)
> >> + */
> >> +
> >> +int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> >> +{
> >> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> >> +
> >> +    return xrc->get_eas(xrtr, lisn, eas);
> >> +}
> >> +
> >> +int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> >> +{
> >> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> >> +
> >> +    return xrc->set_eas(xrtr, lisn, eas);
> >> +}
> >> +
> >> +static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
> >> +{
> >> +    XiveRouter *xrtr = XIVE_ROUTER(xf);
> >> +    XiveEAS eas;
> >> +
> >> +    /* EAS cache lookup */
> >> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: Unknown LISN %x\n", lisn);
> >> +        return;
> >> +    }
> > 
> > AFAICT a bad LISN here means a qemu error (in the source, probably),
> > not a user or guest error, so an assert() would be more appropriate.
> 
> hmm, I would say no because in the case of PowerNV, the firmware could
> have badly configured the ISN offset of a source which would notify the 
> router with a bad notification event data.

Ah, good point.  That's fine as it is then.

> >> +
> >> +    /* The IVRE has a State Bit Cache for its internal sources which
> >> +     * is also involed at this point. We skip the SBC lookup because
> >> +     * the state bits of the sources are modeled internally in QEMU.
> >> +     */
> >> +
> >> +    if (!(eas.w & EAS_VALID)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);
> >> +        return;
> >> +    }
> >> +
> >> +    if (eas.w & EAS_MASKED) {
> >> +        /* Notification completed */
> >> +        return;
> >> +    }
> >> +}
> >> +
> >> +static Property xive_router_properties[] = {
> >> +    DEFINE_PROP_UINT32("chip-id", XiveRouter, chip_id, 0),
> >> +    DEFINE_PROP_END_OF_LIST(),
> >> +};
> >> +
> >> +static void xive_router_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
> >> +
> >> +    dc->desc    = "XIVE Router Engine";
> >> +    dc->props   = xive_router_properties;
> >> +    xfc->notify = xive_router_notify;
> >> +}
> >> +
> >> +static const TypeInfo xive_router_info = {
> >> +    .name          = TYPE_XIVE_ROUTER,
> >> +    .parent        = TYPE_SYS_BUS_DEVICE,
> >> +    .abstract      = true,
> >> +    .class_size    = sizeof(XiveRouterClass),
> >> +    .class_init    = xive_router_class_init,
> >> +    .interfaces    = (InterfaceInfo[]) {
> >> +        { TYPE_XIVE_FABRIC },
> > 
> > So as far as I can see so far, the XiveFabric interface will
> > essentially have to be implemented on the router object, so I'm not
> > seeing much point to having the interface rather than just a direct
> > call on the router object.  But I haven't read the whole series yet,
> > so maybe I'm missing something.
> 
> The PSIHB and PHB4 models are using it but there are not in the series.
> 
> I can send the PSIHB patch in the next version if you like, it's the 
> patch right after PnvXive. It's attached below for the moment. Look at 
> pnv_psi_notify().

Hrm, I see.  This seems like a really convoluted way of achieving what
you need here.  We want to abstract exactly how the source delivers
notifies, but doing it with an interface on some object that's not
necessarily either the source or the router seems odd.  At the very
least the names need to change (of both interface and property for the
target object).

> 
> Thanks,
> 
> C.
> 
> >> +        { }
> >> +    }
> >> +};
> >> +
> >> +void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
> >> +{
> >> +    if (!(eas->w & EAS_VALID)) {
> >> +        return;
> >> +    }
> >> +
> >> +    monitor_printf(mon, "  %08x %s end:%02x/%04x data:%08x\n",
> >> +                   lisn, eas->w & EAS_MASKED ? "M" : " ",
> >> +                   (uint8_t)  GETFIELD(EAS_END_BLOCK, eas->w),
> >> +                   (uint32_t) GETFIELD(EAS_END_INDEX, eas->w),
> >> +                   (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
> >> +}
> >> +
> >>  /*
> >>   * XIVE Fabric
> >>   */
> >> @@ -455,6 +540,7 @@ static void xive_register_types(void)
> >>  {
> >>      type_register_static(&xive_source_info);
> >>      type_register_static(&xive_fabric_info);
> >> +    type_register_static(&xive_router_info);
> >>  }
> >>  
> >>  type_init(xive_register_types)
> > 
> 

> >From 680fd6ff7c99e669708fbc5cfdbfcd95e83e7c07 Mon Sep 17 00:00:00 2001
> From: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= <clg@kaod.org>
> Date: Wed, 21 Nov 2018 10:29:45 +0100
> Subject: [PATCH] ppc/pnv: add a PSI bridge model for POWER9 processor
> MIME-Version: 1.0
> Content-Type: text/plain; charset=UTF-8
> Content-Transfer-Encoding: 8bit
> 
> The PSI bridge on POWER9 is very similar to POWER8. The BAR is still
> set through XSCOM but the controls are now entirely done with MMIOs.
> More interrupts are defined and the interrupt controller interface has
> changed to XIVE. The POWER9 model is a first example of the usage of
> the notify() handler of the XiveFabric interface, linking the PSI
> XiveSource to its owning device model.
> 
> Signed-off-by: C??dric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/pnv.h       |   6 +
>  include/hw/ppc/pnv_psi.h   |  50 ++++-
>  include/hw/ppc/pnv_xscom.h |   3 +
>  hw/ppc/pnv.c               |  20 +-
>  hw/ppc/pnv_psi.c           | 390 ++++++++++++++++++++++++++++++++++---
>  5 files changed, 444 insertions(+), 25 deletions(-)
> 
> diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
> index c402e5d5844b..8be1147481f9 100644
> --- a/include/hw/ppc/pnv.h
> +++ b/include/hw/ppc/pnv.h
> @@ -88,6 +88,7 @@ typedef struct Pnv9Chip {
>  
>      /*< public >*/
>      PnvXive      xive;
> +    PnvPsi       psi;
>  } Pnv9Chip;
>  
>  typedef struct PnvChipClass {
> @@ -250,11 +251,16 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
>  #define PNV9_XIVE_PC_SIZE            0x0000001000000000ull
>  #define PNV9_XIVE_PC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006018000000000ull)
>  
> +#define PNV9_PSIHB_SIZE              0x0000000000100000ull
> +#define PNV9_PSIHB_BASE(chip)        PNV9_CHIP_BASE(chip, 0x0006030203000000ull)
> +
>  #define PNV9_XIVE_IC_SIZE            0x0000000000080000ull
>  #define PNV9_XIVE_IC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203100000ull)
>  
>  #define PNV9_XIVE_TM_SIZE            0x0000000000040000ull
>  #define PNV9_XIVE_TM_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203180000ull)
>  
> +#define PNV9_PSIHB_ESB_SIZE          0x0000000000010000ull
> +#define PNV9_PSIHB_ESB_BASE(chip)    PNV9_CHIP_BASE(chip, 0x00060302031c0000ull)
>  
>  #endif /* _PPC_PNV_H */
> diff --git a/include/hw/ppc/pnv_psi.h b/include/hw/ppc/pnv_psi.h
> index f6af5eae1fa8..b8f8d082bcf9 100644
> --- a/include/hw/ppc/pnv_psi.h
> +++ b/include/hw/ppc/pnv_psi.h
> @@ -21,10 +21,35 @@
>  
>  #include "hw/sysbus.h"
>  #include "hw/ppc/xics.h"
> +#include "hw/ppc/xive.h"
>  
>  #define TYPE_PNV_PSI "pnv-psi"
>  #define PNV_PSI(obj) \
>       OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI)
> +#define PNV_PSI_CLASS(klass) \
> +     OBJECT_CLASS_CHECK(PnvPsiClass, (klass), TYPE_PNV_PSI)
> +#define PNV_PSI_GET_CLASS(obj) \
> +     OBJECT_GET_CLASS(PnvPsiClass, (obj), TYPE_PNV_PSI)
> +
> +typedef struct PnvPsi PnvPsi;
> +typedef struct PnvChip PnvChip;
> +typedef struct PnvPsiClass {
> +    SysBusDeviceClass parent_class;
> +
> +    int chip_type;
> +    uint32_t xscom_pcba;
> +    uint32_t xscom_size;
> +
> +    void (*irq_set)(PnvPsi *psi, int, bool state);
> +} PnvPsiClass;
> +
> +#define TYPE_PNV_PSI_POWER8 TYPE_PNV_PSI "-POWER8"
> +#define PNV_PSI_POWER8(obj) \
> +    OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI_POWER8)
> +
> +#define TYPE_PNV_PSI_POWER9 TYPE_PNV_PSI "-POWER9"
> +#define PNV_PSI_POWER9(obj) \
> +    OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI_POWER9)
>  
>  #define PSIHB_XSCOM_MAX         0x20
>  
> @@ -38,9 +63,12 @@ typedef struct PnvPsi {
>      /* MemoryRegion fsp_mr; */
>      uint64_t fsp_bar;
>  
> -    /* Interrupt generation */
> +    /* P8 Interrupt generation */
>      ICSState ics;
>  
> +    /* P9 Interrupt generation */
> +    XiveSource source;
> +
>      /* Registers */
>      uint64_t regs[PSIHB_XSCOM_MAX];
>  
> @@ -60,6 +88,24 @@ typedef enum PnvPsiIrq {
>  
>  #define PSI_NUM_INTERRUPTS 6
>  
> -extern void pnv_psi_irq_set(PnvPsi *psi, PnvPsiIrq irq, bool state);
> +/* P9 PSI Interrupts */
> +#define PSIHB9_IRQ_PSI          0
> +#define PSIHB9_IRQ_OCC          1
> +#define PSIHB9_IRQ_FSI          2
> +#define PSIHB9_IRQ_LPCHC        3
> +#define PSIHB9_IRQ_LOCAL_ERR    4
> +#define PSIHB9_IRQ_GLOBAL_ERR   5
> +#define PSIHB9_IRQ_TPM          6
> +#define PSIHB9_IRQ_LPC_SIRQ0    7
> +#define PSIHB9_IRQ_LPC_SIRQ1    8
> +#define PSIHB9_IRQ_LPC_SIRQ2    9
> +#define PSIHB9_IRQ_LPC_SIRQ3    10
> +#define PSIHB9_IRQ_SBE_I2C      11
> +#define PSIHB9_IRQ_DIO          12
> +#define PSIHB9_IRQ_PSU          13
> +#define PSIHB9_NUM_IRQS         14
> +
> +void pnv_psi_irq_set(PnvPsi *psi, int irq, bool state);
> +void pnv_psi_pic_print_info(PnvPsi *psi, Monitor *mon);
>  
>  #endif /* _PPC_PNV_PSI_H */
> diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
> index 5bd43467a1ab..019b45bf9189 100644
> --- a/include/hw/ppc/pnv_xscom.h
> +++ b/include/hw/ppc/pnv_xscom.h
> @@ -82,6 +82,9 @@ typedef struct PnvXScomInterfaceClass {
>  #define PNV_XSCOM_PBCQ_SPCI_BASE  0x9013c00
>  #define PNV_XSCOM_PBCQ_SPCI_SIZE  0x5
>  
> +#define PNV9_XSCOM_PSIHB_BASE     0x5012900
> +#define PNV9_XSCOM_PSIHB_SIZE     0x100
> +
>  #define PNV9_XSCOM_XIVE_BASE      0x5013000
>  #define PNV9_XSCOM_XIVE_SIZE      0x300
>  
> diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
> index b6af896c30e4..e67c9d7d3995 100644
> --- a/hw/ppc/pnv.c
> +++ b/hw/ppc/pnv.c
> @@ -743,7 +743,7 @@ static void pnv_chip_power8_instance_init(Object *obj)
>      PnvChipClass *pcc = PNV_CHIP_GET_CLASS(obj);
>      int i;
>  
> -    object_initialize(&chip8->psi, sizeof(chip8->psi), TYPE_PNV_PSI);
> +    object_initialize(&chip8->psi, sizeof(chip8->psi), TYPE_PNV_PSI_POWER8);
>      object_property_add_child(obj, "psi", OBJECT(&chip8->psi), NULL);
>      object_property_add_const_link(OBJECT(&chip8->psi), "xics",
>                                     OBJECT(qdev_get_machine()), &error_abort);
> @@ -923,6 +923,11 @@ static void pnv_chip_power9_instance_init(Object *obj)
>      object_property_add_child(obj, "xive", OBJECT(&chip9->xive), NULL);
>      object_property_add_const_link(OBJECT(&chip9->xive), "chip", obj,
>                                     &error_abort);
> +
> +    object_initialize(&chip9->psi, sizeof(chip9->psi), TYPE_PNV_PSI_POWER9);
> +    object_property_add_child(obj, "psi", OBJECT(&chip9->psi), NULL);
> +    object_property_add_const_link(OBJECT(&chip9->psi), "chip", obj,
> +                                   &error_abort);
>  }
>  
>  static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
> @@ -955,6 +960,18 @@ static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
>      qdev_set_parent_bus(DEVICE(&chip9->xive), sysbus_get_default());
>      pnv_xscom_add_subregion(chip, PNV9_XSCOM_XIVE_BASE,
>                              &chip9->xive.xscom_regs);
> +
> +    /* Processor Service Interface (PSI) Host Bridge */
> +    object_property_set_int(OBJECT(&chip9->psi), PNV9_PSIHB_BASE(chip),
> +                            "bar", &error_fatal);
> +    object_property_set_bool(OBJECT(&chip9->psi), true, "realized", &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +    qdev_set_parent_bus(DEVICE(&chip9->psi), sysbus_get_default());
> +    pnv_xscom_add_subregion(chip, PNV9_XSCOM_PSIHB_BASE,
> +                            &chip9->psi.xscom_regs);
>  }
>  
>  static void pnv_chip_power9_class_init(ObjectClass *klass, void *data)
> @@ -1188,6 +1205,7 @@ static void pnv_pic_print_info(InterruptStatsProvider *obj,
>              Pnv9Chip *chip9 = PNV9_CHIP(chip);
>  
>               pnv_xive_pic_print_info(&chip9->xive, mon);
> +             pnv_psi_pic_print_info(&chip9->psi, mon);
>          } else {
>              Pnv8Chip *chip8 = PNV8_CHIP(chip);
>  
> diff --git a/hw/ppc/pnv_psi.c b/hw/ppc/pnv_psi.c
> index 5b969127c303..8b85dd9555e8 100644
> --- a/hw/ppc/pnv_psi.c
> +++ b/hw/ppc/pnv_psi.c
> @@ -22,6 +22,7 @@
>  #include "target/ppc/cpu.h"
>  #include "qemu/log.h"
>  #include "qapi/error.h"
> +#include "monitor/monitor.h"
>  
>  #include "exec/address-spaces.h"
>  
> @@ -114,12 +115,14 @@
>  #define PSIHB_BAR_MASK                  0x0003fffffff00000ull
>  #define PSIHB_FSPBAR_MASK               0x0003ffff00000000ull
>  
> +#define PSIHB_REG(addr) (((addr) >> 3) + PSIHB_XSCOM_BAR)
> +
>  static void pnv_psi_set_bar(PnvPsi *psi, uint64_t bar)
>  {
>      MemoryRegion *sysmem = get_system_memory();
>      uint64_t old = psi->regs[PSIHB_XSCOM_BAR];
>  
> -    psi->regs[PSIHB_XSCOM_BAR] = bar & (PSIHB_BAR_MASK | PSIHB_BAR_EN);
> +    psi->regs[PSIHB_XSCOM_BAR] = bar;
>  
>      /* Update MR, always remove it first */
>      if (old & PSIHB_BAR_EN) {
> @@ -128,7 +131,7 @@ static void pnv_psi_set_bar(PnvPsi *psi, uint64_t bar)
>  
>      /* Then add it back if needed */
>      if (bar & PSIHB_BAR_EN) {
> -        uint64_t addr = bar & PSIHB_BAR_MASK;
> +        uint64_t addr = bar & ~PSIHB_BAR_EN;
>          memory_region_add_subregion(sysmem, addr, &psi->regs_mr);
>      }
>  }
> @@ -205,7 +208,12 @@ static const uint64_t stat_bits[] = {
>      [PSIHB_IRQ_EXTERNAL]  = PSIHB_IRQ_STAT_EXT,
>  };
>  
> -void pnv_psi_irq_set(PnvPsi *psi, PnvPsiIrq irq, bool state)
> +void pnv_psi_irq_set(PnvPsi *psi, int irq, bool state)
> +{
> +    PNV_PSI_GET_CLASS(psi)->irq_set(psi, irq, state);
> +}
> +
> +static void pnv_psi_power8_irq_set(PnvPsi *psi, int irq, bool state)
>  {
>      ICSState *ics = &psi->ics;
>      uint32_t xivr_reg;
> @@ -324,7 +332,7 @@ static uint64_t pnv_psi_reg_read(PnvPsi *psi, uint32_t offset, bool mmio)
>          val = psi->regs[offset];
>          break;
>      default:
> -        qemu_log_mask(LOG_UNIMP, "PSI: read at Ox%" PRIx32 "\n", offset);
> +        qemu_log_mask(LOG_UNIMP, "PSI: read at 0x%" PRIx32 "\n", offset);
>      }
>      return val;
>  }
> @@ -383,7 +391,7 @@ static void pnv_psi_reg_write(PnvPsi *psi, uint32_t offset, uint64_t val,
>          pnv_psi_set_irsn(psi, val);
>          break;
>      default:
> -        qemu_log_mask(LOG_UNIMP, "PSI: write at Ox%" PRIx32 "\n", offset);
> +        qemu_log_mask(LOG_UNIMP, "PSI: write at 0x%" PRIx32 "\n", offset);
>      }
>  }
>  
> @@ -393,13 +401,13 @@ static void pnv_psi_reg_write(PnvPsi *psi, uint32_t offset, uint64_t val,
>   */
>  static uint64_t pnv_psi_mmio_read(void *opaque, hwaddr addr, unsigned size)
>  {
> -    return pnv_psi_reg_read(opaque, (addr >> 3) + PSIHB_XSCOM_BAR, true);
> +    return pnv_psi_reg_read(opaque, PSIHB_REG(addr), true);
>  }
>  
>  static void pnv_psi_mmio_write(void *opaque, hwaddr addr,
>                                uint64_t val, unsigned size)
>  {
> -    pnv_psi_reg_write(opaque, (addr >> 3) + PSIHB_XSCOM_BAR, val, true);
> +    pnv_psi_reg_write(opaque, PSIHB_REG(addr), val, true);
>  }
>  
>  static const MemoryRegionOps psi_mmio_ops = {
> @@ -441,7 +449,7 @@ static const MemoryRegionOps pnv_psi_xscom_ops = {
>      }
>  };
>  
> -static void pnv_psi_init(Object *obj)
> +static void pnv_psi_power8_instance_init(Object *obj)
>  {
>      PnvPsi *psi = PNV_PSI(obj);
>  
> @@ -458,7 +466,7 @@ static const uint8_t irq_to_xivr[] = {
>      PSIHB_XSCOM_XIVR_EXT,
>  };
>  
> -static void pnv_psi_realize(DeviceState *dev, Error **errp)
> +static void pnv_psi_power8_realize(DeviceState *dev, Error **errp)
>  {
>      PnvPsi *psi = PNV_PSI(dev);
>      ICSState *ics = &psi->ics;
> @@ -510,28 +518,34 @@ static void pnv_psi_realize(DeviceState *dev, Error **errp)
>      }
>  }
>  
> +static const char compat_p8[] = "ibm,power8-psihb-x\0ibm,psihb-x";
> +static const char compat_p9[] = "ibm,power9-psihb-x\0ibm,psihb-x";
> +
>  static int pnv_psi_dt_xscom(PnvXScomInterface *dev, void *fdt, int xscom_offset)
>  {
> -    const char compat[] = "ibm,power8-psihb-x\0ibm,psihb-x";
> +    PnvPsiClass *ppc = PNV_PSI_GET_CLASS(dev);
>      char *name;
>      int offset;
> -    uint32_t lpc_pcba = PNV_XSCOM_PSIHB_BASE;
>      uint32_t reg[] = {
> -        cpu_to_be32(lpc_pcba),
> -        cpu_to_be32(PNV_XSCOM_PSIHB_SIZE)
> +        cpu_to_be32(ppc->xscom_pcba),
> +        cpu_to_be32(ppc->xscom_size)
>      };
>  
> -    name = g_strdup_printf("psihb@%x", lpc_pcba);
> +    name = g_strdup_printf("psihb@%x", ppc->xscom_pcba);
>      offset = fdt_add_subnode(fdt, xscom_offset, name);
>      _FDT(offset);
>      g_free(name);
>  
> -    _FDT((fdt_setprop(fdt, offset, "reg", reg, sizeof(reg))));
> -
> -    _FDT((fdt_setprop_cell(fdt, offset, "#address-cells", 2)));
> -    _FDT((fdt_setprop_cell(fdt, offset, "#size-cells", 1)));
> -    _FDT((fdt_setprop(fdt, offset, "compatible", compat,
> -                      sizeof(compat))));
> +    _FDT(fdt_setprop(fdt, offset, "reg", reg, sizeof(reg)));
> +    _FDT(fdt_setprop_cell(fdt, offset, "#address-cells", 2));
> +    _FDT(fdt_setprop_cell(fdt, offset, "#size-cells", 1));
> +    if (ppc->chip_type == PNV_CHIP_POWER9) {
> +        _FDT(fdt_setprop(fdt, offset, "compatible", compat_p9,
> +                         sizeof(compat_p9)));
> +    } else {
> +        _FDT(fdt_setprop(fdt, offset, "compatible", compat_p8,
> +                         sizeof(compat_p8)));
> +    }
>      return 0;
>  }
>  
> @@ -541,6 +555,324 @@ static Property pnv_psi_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> +static void pnv_psi_power8_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    PnvPsiClass *ppc = PNV_PSI_CLASS(klass);
> +
> +    dc->desc    = "PowerNV PSI Controller POWER8";
> +    dc->realize = pnv_psi_power8_realize;
> +
> +    ppc->chip_type =  PNV_CHIP_POWER8;
> +    ppc->xscom_pcba = PNV_XSCOM_PSIHB_BASE;
> +    ppc->xscom_size = PNV_XSCOM_PSIHB_SIZE;
> +    ppc->irq_set    = pnv_psi_power8_irq_set;
> +}
> +
> +static const TypeInfo pnv_psi_power8_info = {
> +    .name          = TYPE_PNV_PSI_POWER8,
> +    .parent        = TYPE_PNV_PSI,
> +    .instance_init = pnv_psi_power8_instance_init,
> +    .class_init    = pnv_psi_power8_class_init,
> +};
> +
> +/* Common registers */
> +
> +#define PSIHB9_CR                       0x20
> +#define PSIHB9_SEMR                     0x28
> +
> +/* P9 registers */
> +
> +#define PSIHB9_INTERRUPT_CONTROL        0x58
> +#define   PSIHB9_IRQ_METHOD             PPC_BIT(0)
> +#define   PSIHB9_IRQ_RESET              PPC_BIT(1)
> +#define PSIHB9_ESB_CI_BASE              0x60
> +#define   PSIHB9_ESB_CI_VALID           1
> +#define PSIHB9_ESB_NOTIF_ADDR           0x68
> +#define   PSIHB9_ESB_NOTIF_VALID        1
> +#define PSIHB9_IVT_OFFSET               0x70
> +#define   PSIHB9_IVT_OFF_SHIFT          32
> +
> +#define PSIHB9_IRQ_LEVEL                0x78 /* assertion */
> +#define   PSIHB9_IRQ_LEVEL_PSI          PPC_BIT(0)
> +#define   PSIHB9_IRQ_LEVEL_OCC          PPC_BIT(1)
> +#define   PSIHB9_IRQ_LEVEL_FSI          PPC_BIT(2)
> +#define   PSIHB9_IRQ_LEVEL_LPCHC        PPC_BIT(3)
> +#define   PSIHB9_IRQ_LEVEL_LOCAL_ERR    PPC_BIT(4)
> +#define   PSIHB9_IRQ_LEVEL_GLOBAL_ERR   PPC_BIT(5)
> +#define   PSIHB9_IRQ_LEVEL_TPM          PPC_BIT(6)
> +#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ1    PPC_BIT(7)
> +#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ2    PPC_BIT(8)
> +#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ3    PPC_BIT(9)
> +#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ4    PPC_BIT(10)
> +#define   PSIHB9_IRQ_LEVEL_SBE_I2C      PPC_BIT(11)
> +#define   PSIHB9_IRQ_LEVEL_DIO          PPC_BIT(12)
> +#define   PSIHB9_IRQ_LEVEL_PSU          PPC_BIT(13)
> +#define   PSIHB9_IRQ_LEVEL_I2C_C        PPC_BIT(14)
> +#define   PSIHB9_IRQ_LEVEL_I2C_D        PPC_BIT(15)
> +#define   PSIHB9_IRQ_LEVEL_I2C_E        PPC_BIT(16)
> +#define   PSIHB9_IRQ_LEVEL_SBE          PPC_BIT(19)
> +
> +#define PSIHB9_IRQ_STAT                 0x80 /* P bit */
> +#define   PSIHB9_IRQ_STAT_PSI           PPC_BIT(0)
> +#define   PSIHB9_IRQ_STAT_OCC           PPC_BIT(1)
> +#define   PSIHB9_IRQ_STAT_FSI           PPC_BIT(2)
> +#define   PSIHB9_IRQ_STAT_LPCHC         PPC_BIT(3)
> +#define   PSIHB9_IRQ_STAT_LOCAL_ERR     PPC_BIT(4)
> +#define   PSIHB9_IRQ_STAT_GLOBAL_ERR    PPC_BIT(5)
> +#define   PSIHB9_IRQ_STAT_TPM           PPC_BIT(6)
> +#define   PSIHB9_IRQ_STAT_LPC_SIRQ1     PPC_BIT(7)
> +#define   PSIHB9_IRQ_STAT_LPC_SIRQ2     PPC_BIT(8)
> +#define   PSIHB9_IRQ_STAT_LPC_SIRQ3     PPC_BIT(9)
> +#define   PSIHB9_IRQ_STAT_LPC_SIRQ4     PPC_BIT(10)
> +#define   PSIHB9_IRQ_STAT_SBE_I2C       PPC_BIT(11)
> +#define   PSIHB9_IRQ_STAT_DIO           PPC_BIT(12)
> +#define   PSIHB9_IRQ_STAT_PSU           PPC_BIT(13)
> +
> +static void pnv_psi_notify(XiveFabric *xf, uint32_t srcno)
> +{
> +    PnvPsi *psi = PNV_PSI(xf);
> +    uint64_t notif_port = psi->regs[PSIHB_REG(PSIHB9_ESB_NOTIF_ADDR)];
> +    bool valid = notif_port & PSIHB9_ESB_NOTIF_VALID;
> +    uint64_t notify_addr = notif_port & ~PSIHB9_ESB_NOTIF_VALID;
> +
> +    uint32_t offset =
> +        (psi->regs[PSIHB_REG(PSIHB9_IVT_OFFSET)] >> PSIHB9_IVT_OFF_SHIFT);
> +    uint64_t lisn = cpu_to_be64(offset + srcno);
> +
> +    if (valid) {
> +        cpu_physical_memory_write(notify_addr, &lisn, sizeof(lisn));
> +    }
> +}
> +
> +/*
> + * TODO : move to parent class
> + */
> +static void pnv_psi_reset(DeviceState *dev)
> +{
> +    PnvPsi *psi = PNV_PSI(dev);
> +
> +    memset(psi->regs, 0x0, sizeof(psi->regs));
> +
> +    psi->regs[PSIHB_XSCOM_BAR] = psi->bar | PSIHB_BAR_EN;
> +}
> +
> +static uint64_t pnv_psi_p9_mmio_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    PnvPsi *psi = PNV_PSI(opaque);
> +    uint32_t reg = PSIHB_REG(addr);
> +    uint64_t val = -1;
> +
> +    switch (addr) {
> +    case PSIHB9_CR:
> +    case PSIHB9_SEMR:
> +        /* FSP stuff */
> +    case PSIHB9_INTERRUPT_CONTROL:
> +    case PSIHB9_ESB_CI_BASE:
> +    case PSIHB9_ESB_NOTIF_ADDR:
> +    case PSIHB9_IVT_OFFSET:
> +        val = psi->regs[reg];
> +        break;
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "PSI: read at 0x%" PRIx64 "\n", addr);
> +    }
> +
> +    return val;
> +}
> +
> +static void pnv_psi_p9_mmio_write(void *opaque, hwaddr addr,
> +                                  uint64_t val, unsigned size)
> +{
> +    PnvPsi *psi = PNV_PSI(opaque);
> +    uint32_t reg = PSIHB_REG(addr);
> +    MemoryRegion *sysmem = get_system_memory();
> +
> +    switch (addr) {
> +    case PSIHB9_CR:
> +    case PSIHB9_SEMR:
> +        /* FSP stuff */
> +        break;
> +    case PSIHB9_INTERRUPT_CONTROL:
> +        if (val & PSIHB9_IRQ_RESET) {
> +            device_reset(DEVICE(&psi->source));
> +        }
> +        psi->regs[reg] = val;
> +        break;
> +
> +    case PSIHB9_ESB_CI_BASE:
> +        if (!(val & PSIHB9_ESB_CI_VALID)) {
> +            if (psi->regs[reg] & PSIHB9_ESB_CI_VALID) {
> +                memory_region_del_subregion(sysmem, &psi->source.esb_mmio);
> +            }
> +        } else {
> +            if (!(psi->regs[reg] & PSIHB9_ESB_CI_VALID)) {
> +                memory_region_add_subregion(sysmem,
> +                                        val & ~PSIHB9_ESB_CI_VALID,
> +                                        &psi->source.esb_mmio);
> +            }
> +        }
> +        psi->regs[reg] = val;
> +        break;
> +
> +    case PSIHB9_ESB_NOTIF_ADDR:
> +        psi->regs[reg] = val;
> +        break;
> +    case PSIHB9_IVT_OFFSET:
> +        psi->regs[reg] = val;
> +        break;
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "PSI: write at 0x%" PRIx64 "\n", addr);
> +    }
> +}
> +
> +static const MemoryRegionOps pnv_psi_p9_mmio_ops = {
> +    .read = pnv_psi_p9_mmio_read,
> +    .write = pnv_psi_p9_mmio_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +static uint64_t pnv_psi_p9_xscom_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    /* No read are expected */
> +    qemu_log_mask(LOG_GUEST_ERROR, "PSI: xscom read at 0x%" PRIx64 "\n", addr);
> +    return -1;
> +}
> +
> +static void pnv_psi_p9_xscom_write(void *opaque, hwaddr addr,
> +                                uint64_t val, unsigned size)
> +{
> +    PnvPsi *psi = PNV_PSI(opaque);
> +
> +    /* XSCOM is only used to set the PSIHB MMIO region */
> +    switch (addr >> 3) {
> +    case PSIHB_XSCOM_BAR:
> +        pnv_psi_set_bar(psi, val);
> +        break;
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "PSI: xscom write at 0x%" PRIx64 "\n",
> +                      addr);
> +    }
> +}
> +
> +static const MemoryRegionOps pnv_psi_p9_xscom_ops = {
> +    .read = pnv_psi_p9_xscom_read,
> +    .write = pnv_psi_p9_xscom_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    }
> +};
> +
> +static void pnv_psi_power9_irq_set(PnvPsi *psi, int irq, bool state)
> +{
> +    uint32_t irq_method = psi->regs[PSIHB_REG(PSIHB9_INTERRUPT_CONTROL)];
> +
> +    if (irq > PSIHB9_NUM_IRQS) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "PSI: Unsupported irq %d\n", irq);
> +        return;
> +    }
> +
> +    if (irq_method & PSIHB9_IRQ_METHOD) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "PSI: LSI IRQ method no supported\n");
> +        return;
> +    }
> +
> +    /* Update LSI levels */
> +    if (state) {
> +        psi->regs[PSIHB_REG(PSIHB9_IRQ_LEVEL)] |= PPC_BIT(irq);
> +    } else {
> +        psi->regs[PSIHB_REG(PSIHB9_IRQ_LEVEL)] &= ~PPC_BIT(irq);
> +    }
> +
> +    qemu_set_irq(xive_source_qirq(&psi->source, irq), state);
> +}
> +
> +static void pnv_psi_power9_instance_init(Object *obj)
> +{
> +    PnvPsi *psi = PNV_PSI(obj);
> +
> +    object_initialize(&psi->source, sizeof(psi->source), TYPE_XIVE_SOURCE);
> +    object_property_add_child(obj, "source", OBJECT(&psi->source), NULL);
> +}
> +
> +static void pnv_psi_power9_realize(DeviceState *dev, Error **errp)
> +{
> +    PnvPsi *psi = PNV_PSI(dev);
> +    XiveSource *xsrc = &psi->source;
> +    Error *local_err = NULL;
> +    int i;
> +
> +    /* This is the only device with 4k ESB pages */
> +    object_property_set_int(OBJECT(xsrc), XIVE_ESB_4K, "shift",
> +                            &error_fatal);
> +    object_property_set_int(OBJECT(xsrc), PSIHB9_NUM_IRQS, "nr-irqs",
> +                            &error_fatal);
> +    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(psi),
> +                                   &error_fatal);
> +    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
> +
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        xive_source_irq_set(xsrc, i, true);
> +    }
> +
> +    /* XSCOM region for PSI registers */
> +    pnv_xscom_region_init(&psi->xscom_regs, OBJECT(dev), &pnv_psi_p9_xscom_ops,
> +                psi, "xscom-psi", PNV9_XSCOM_PSIHB_SIZE);
> +
> +    /* Initialize MMIO region */
> +    memory_region_init_io(&psi->regs_mr, OBJECT(dev), &pnv_psi_p9_mmio_ops, psi,
> +                          "psihb", PNV9_PSIHB_SIZE);
> +
> +    /* Default BAR for MMIO region */
> +    pnv_psi_set_bar(psi, psi->bar | PSIHB_BAR_EN);
> +}
> +
> +static void pnv_psi_power9_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    PnvPsiClass *ppc = PNV_PSI_CLASS(klass);
> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
> +
> +    dc->desc    = "PowerNV PSI Controller POWER9";
> +    dc->realize = pnv_psi_power9_realize;
> +
> +    ppc->chip_type  = PNV_CHIP_POWER9;
> +    ppc->xscom_pcba = PNV9_XSCOM_PSIHB_BASE;
> +    ppc->xscom_size = PNV9_XSCOM_PSIHB_SIZE;
> +    ppc->irq_set    = pnv_psi_power9_irq_set;
> +
> +    xfc->notify      = pnv_psi_notify;
> +}
> +
> +static const TypeInfo pnv_psi_power9_info = {
> +    .name          = TYPE_PNV_PSI_POWER9,
> +    .parent        = TYPE_PNV_PSI,
> +    .instance_init = pnv_psi_power9_instance_init,
> +    .class_init    = pnv_psi_power9_class_init,
> +    .interfaces = (InterfaceInfo[]) {
> +            { TYPE_XIVE_FABRIC },
> +            { },
> +    },
> +};
> +
>  static void pnv_psi_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
> @@ -548,16 +880,18 @@ static void pnv_psi_class_init(ObjectClass *klass, void *data)
>  
>      xdc->dt_xscom = pnv_psi_dt_xscom;
>  
> -    dc->realize = pnv_psi_realize;
> +    dc->desc = "PowerNV PSI Controller";
>      dc->props = pnv_psi_properties;
> +    dc->reset  = pnv_psi_reset;
>  }
>  
>  static const TypeInfo pnv_psi_info = {
>      .name          = TYPE_PNV_PSI,
>      .parent        = TYPE_SYS_BUS_DEVICE,
>      .instance_size = sizeof(PnvPsi),
> -    .instance_init = pnv_psi_init,
>      .class_init    = pnv_psi_class_init,
> +    .class_size    = sizeof(PnvPsiClass),
> +    .abstract      = true,
>      .interfaces    = (InterfaceInfo[]) {
>          { TYPE_PNV_XSCOM_INTERFACE },
>          { }
> @@ -567,6 +901,18 @@ static const TypeInfo pnv_psi_info = {
>  static void pnv_psi_register_types(void)
>  {
>      type_register_static(&pnv_psi_info);
> +    type_register_static(&pnv_psi_power8_info);
> +    type_register_static(&pnv_psi_power9_info);
>  }
>  
>  type_init(pnv_psi_register_types)
> +
> +void pnv_psi_pic_print_info(PnvPsi *psi, Monitor *mon)
> +{
> +    uint32_t offset =
> +        (psi->regs[PSIHB_REG(PSIHB9_IVT_OFFSET)] >> PSIHB9_IVT_OFF_SHIFT);
> +
> +    monitor_printf(mon, "PSIHB Source %08x .. %08x\n",
> +                  offset, offset + psi->source.nr_irqs - 1);
> +    xive_source_pic_print_info(&psi->source, offset, mon);
> +}


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors
  2018-11-22  6:49     ` Benjamin Herrenschmidt
@ 2018-11-23  3:51       ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-23  3:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Cédric Le Goater, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 853 bytes --]

On Thu, Nov 22, 2018 at 05:49:09PM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2018-11-22 at 15:41 +1100, David Gibson wrote:
> > 
> > > +void xive_end_reset(XiveEND *end)
> > > +{
> > > +    memset(end, 0, sizeof(*end));
> > > +
> > > +    /* switch off the escalation and notification ESBs */
> > > +    end->w1 = END_W1_ESe_Q | END_W1_ESn_Q;
> > 
> > It's not obvious to me what circumstances this would be called under.
> > Since the ENDs are in system memory, a memset() seems like an odd
> > thing for (virtual) hardware to be doing to it.
> > 
> > > +}
> 
> Not on PAPR ...

Right, so the memset() can go in PAPR specific code.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors
  2018-11-22 21:47     ` Cédric Le Goater
@ 2018-11-23  4:35       ` David Gibson
  2018-11-23 11:01         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-23  4:35 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 18572 bytes --]

On Thu, Nov 22, 2018 at 10:47:44PM +0100, Cédric Le Goater wrote:
> On 11/22/18 5:41 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:56:58AM +0100, Cédric Le Goater wrote:
> >> To complete the event routing, the IVRE sub-engine uses an internal
> >> table containing Event Notification Descriptor (END) structures.
> >>
> >> An END specifies on which Event Queue (EQ) the event notification
> >> data, defined in the associated EAS, should be posted when an
> >> exception occurs. It also defines which Notification Virtual Target
> >> (NVT) should be notified.
> >>
> >> The Event Queue is a memory page provided by the O/S defining a
> >> circular buffer, one per server and priority couple, containing Event
> >> Queue entries. These are 4 bytes long, the first bit being a
> >> 'generation' bit and the 31 following bits the END Data field. They
> >> are pulled by the O/S when the exception occurs.
> >>
> >> The END Data field is a way to set an invariant logical event source
> >> number for an IRQ. It is set with the H_INT_SET_SOURCE_CONFIG hcall
> >> when the EISN flag is used.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  include/hw/ppc/xive.h      |  18 ++++
> >>  include/hw/ppc/xive_regs.h |  48 ++++++++++
> >>  hw/intc/xive.c             | 185 ++++++++++++++++++++++++++++++++++++-
> >>  3 files changed, 248 insertions(+), 3 deletions(-)
> >>
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index 5a0696366577..ce62aaf28343 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -193,11 +193,29 @@ typedef struct XiveRouterClass {
> >>      /* XIVE table accessors */
> >>      int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> >>      int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> >> +    int (*get_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >> +                   XiveEND *end);
> >> +    int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >> +                   XiveEND *end);
> > 
> > Hrm.  So unlike the EAS, which is basically just a word, the END is a
> > pretty large structure.  
> 
> yes. and so will be the NVT.
> 
> > It's unclear here if get/set are expected to copy the whole thing out 
> > and in, 
> 
> That's the plan. 

Yeah, I don't think that's a good idea.  In some cases the updates are
on hot paths, so the extra copy isn't good, and more importantly it
makes it look like an atomic update, but it's not really.

Well... I guess it probably is because of the BQL, but I'd prefer not
to rely on that excessively.

> What I had in mind are memory accessors to the XIVE structures, which 
> are local to QEMU for sPAPR and in the guest RAM for PowerNV (Please
> take a look at the XIVE PowerNV model).
> 
> > or if get give you a pointer into a "live" structure 
> 
> no
> 
> > and set just does any necessary barriers after an update.
> that would be too complex for the PowerNV model I think. There is a cache
> in between the software running on the (QEMU) machine and the XIVE HW but
> it would be hard to handle. 
>  
> > Really, for a non-atomic value like this, I'm not sure get/set is the
> > right model.
> 
> ok. we need something to get them out and in.

I've thought about this a bit more.  What I think might work is
"end_read" and "end_write" callbacks, which take a word number in
addition to the parameters you have already

> > Also as I understand it nearly all the indices in XIVE are broken into
> > block/index.  Is there a reason those are folded together into lisn
> > for the EAS, but not for the END?
> 
> The indexing of the EAT is global to the sytem and the index defines
> which blk to use. The IRQ source numbers on the powerbus are architected 
> to be :
> 
>     #define XIVE_SRCNO(blk, idx)      ((uint32_t)(blk) << 28 | (idx))
> 
> and XIVE can use different strategies to identify the XIVE IC in charge 
> of routing. It can be a one-to-one chip to block relation as skiboot does. 
> Using a block scope table is possible also. Our model only supports one 
> block per chip and some shortcuts are taken but not that much in fact.
>  
> Remote access to the XIVE structures of another chip are done through 
> MMIO (not modeled in PowerNV) and the blkid is used to partition the MMIO 
> regions. Being local is better for performance because the END and NVT 
> tables have a strong relation with the XIVE subengines using them 
> (VC and PC). 
> 
> May be, Ben can clarified it this is badly explained.

Right.. I think I understand what the blocks are all about.

But my question is, why encode the block and index together for the
EAS, but separately for the END?

> 
> >>  } XiveRouterClass;
> >>  
> >>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
> >>  
> >>  int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> >>  int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
> >> +int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >> +                        XiveEND *end);
> >> +int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >> +                        XiveEND *end);
> >> +
> >> +/*
> >> + * For legacy compatibility, the exceptions define up to 256 different
> >> + * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
> >> + * and the least favored level 0xFF.
> >> + */
> >> +#define XIVE_PRIORITY_MAX  7
> >> +
> >> +void xive_end_reset(XiveEND *end);
> >> +void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
> >>  
> >>  #endif /* PPC_XIVE_H */
> >> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> >> index 12499b33614c..f97fb2b90bee 100644
> >> --- a/include/hw/ppc/xive_regs.h
> >> +++ b/include/hw/ppc/xive_regs.h
> >> @@ -28,4 +28,52 @@ typedef struct XiveEAS {
> >>  #define EAS_END_DATA    PPC_BITMASK(33, 63)      /* Data written to the END */
> >>  } XiveEAS;
> >>  
> >> +/* Event Notification Descriptor (END) */
> >> +typedef struct XiveEND {
> >> +        uint32_t        w0;
> >> +#define END_W0_VALID             PPC_BIT32(0) /* "v" bit */
> >> +#define END_W0_ENQUEUE           PPC_BIT32(1) /* "q" bit */
> >> +#define END_W0_UCOND_NOTIFY      PPC_BIT32(2) /* "n" bit */
> >> +#define END_W0_BACKLOG           PPC_BIT32(3) /* "b" bit */
> >> +#define END_W0_PRECL_ESC_CTL     PPC_BIT32(4) /* "p" bit */
> >> +#define END_W0_ESCALATE_CTL      PPC_BIT32(5) /* "e" bit */
> >> +#define END_W0_UNCOND_ESCALATE   PPC_BIT32(6) /* "u" bit - DD2.0 */
> >> +#define END_W0_SILENT_ESCALATE   PPC_BIT32(7) /* "s" bit - DD2.0 */
> >> +#define END_W0_QSIZE             PPC_BITMASK32(12, 15)
> >> +#define END_W0_SW0               PPC_BIT32(16)
> >> +#define END_W0_FIRMWARE          END_W0_SW0 /* Owned by FW */
> >> +#define END_QSIZE_4K             0
> >> +#define END_QSIZE_64K            4
> >> +#define END_W0_HWDEP             PPC_BITMASK32(24, 31)
> >> +        uint32_t        w1;
> >> +#define END_W1_ESn               PPC_BITMASK32(0, 1)
> >> +#define END_W1_ESn_P             PPC_BIT32(0)
> >> +#define END_W1_ESn_Q             PPC_BIT32(1)
> >> +#define END_W1_ESe               PPC_BITMASK32(2, 3)
> >> +#define END_W1_ESe_P             PPC_BIT32(2)
> >> +#define END_W1_ESe_Q             PPC_BIT32(3)
> >> +#define END_W1_GENERATION        PPC_BIT32(9)
> >> +#define END_W1_PAGE_OFF          PPC_BITMASK32(10, 31)
> >> +        uint32_t        w2;
> >> +#define END_W2_MIGRATION_REG     PPC_BITMASK32(0, 3)
> >> +#define END_W2_OP_DESC_HI        PPC_BITMASK32(4, 31)
> >> +        uint32_t        w3;
> >> +#define END_W3_OP_DESC_LO        PPC_BITMASK32(0, 31)
> >> +        uint32_t        w4;
> >> +#define END_W4_ESC_END_BLOCK     PPC_BITMASK32(4, 7)
> >> +#define END_W4_ESC_END_INDEX     PPC_BITMASK32(8, 31)
> >> +        uint32_t        w5;
> >> +#define END_W5_ESC_END_DATA      PPC_BITMASK32(1, 31)
> >> +        uint32_t        w6;
> >> +#define END_W6_FORMAT_BIT        PPC_BIT32(8)
> >> +#define END_W6_NVT_BLOCK         PPC_BITMASK32(9, 12)
> >> +#define END_W6_NVT_INDEX         PPC_BITMASK32(13, 31)
> >> +        uint32_t        w7;
> >> +#define END_W7_F0_IGNORE         PPC_BIT32(0)
> >> +#define END_W7_F0_BLK_GROUPING   PPC_BIT32(1)
> >> +#define END_W7_F0_PRIORITY       PPC_BITMASK32(8, 15)
> >> +#define END_W7_F1_WAKEZ          PPC_BIT32(0)
> >> +#define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
> >> +} XiveEND;
> >> +
> >>  #endif /* PPC_XIVE_REGS_H */
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index c4c90a25758e..9cb001e7b540 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -442,6 +442,101 @@ static const TypeInfo xive_source_info = {
> >>      .class_init    = xive_source_class_init,
> >>  };
> >>  
> >> +/*
> >> + * XiveEND helpers
> >> + */
> >> +
> >> +void xive_end_reset(XiveEND *end)
> >> +{
> >> +    memset(end, 0, sizeof(*end));
> >> +
> >> +    /* switch off the escalation and notification ESBs */
> >> +    end->w1 = END_W1_ESe_Q | END_W1_ESn_Q;
> > 
> > It's not obvious to me what circumstances this would be called under.
> > Since the ENDs are in system memory, a memset() seems like an odd
> > thing for (virtual) hardware to be doing to it.
> 
> It makes sense on sPAPR if one day some OS starts using the END ESBs for 
> further coalescing of the events. None does for now but I have added the 
> model though.

Hrm, I think that belongs in PAPR specific code.  It's not really part
of the router model - it's the PAPR stuff configuring the router at
reset time (much as firmware would configure it at reset time for bare
metal).

> 
> >> +}
> >> +
> >> +static void xive_end_queue_pic_print_info(XiveEND *end, uint32_t width,
> >> +                                          Monitor *mon)
> >> +{
> >> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
> >> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
> >> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
> >> +    uint32_t qentries = 1 << (qsize + 10);
> >> +    int i;
> >> +
> >> +    /*
> >> +     * print out the [ (qindex - (width - 1)) .. (qindex + 1)] window
> >> +     */
> >> +    monitor_printf(mon, " [ ");
> >> +    qindex = (qindex - (width - 1)) & (qentries - 1);
> >> +    for (i = 0; i < width; i++) {
> >> +        uint64_t qaddr = qaddr_base + (qindex << 2);
> >> +        uint32_t qdata = -1;
> >> +
> >> +        if (dma_memory_read(&address_space_memory, qaddr, &qdata,
> >> +                            sizeof(qdata))) {
> >> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ @0x%"
> >> +                          HWADDR_PRIx "\n", qaddr);
> >> +            return;
> >> +        }
> >> +        monitor_printf(mon, "%s%08x ", i == width - 1 ? "^" : "",
> >> +                       be32_to_cpu(qdata));
> >> +        qindex = (qindex + 1) & (qentries - 1);
> >> +    }
> >> +    monitor_printf(mon, "]\n");
> >> +}
> >> +
> >> +void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon)
> >> +{
> >> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
> >> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
> >> +    uint32_t qgen = GETFIELD(END_W1_GENERATION, end->w1);
> >> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
> >> +    uint32_t qentries = 1 << (qsize + 10);
> >> +
> >> +    uint32_t nvt = GETFIELD(END_W6_NVT_INDEX, end->w6);
> >> +    uint8_t priority = GETFIELD(END_W7_F0_PRIORITY, end->w7);
> >> +
> >> +    if (!(end->w0 & END_W0_VALID)) {
> >> +        return;
> >> +    }
> >> +
> >> +    monitor_printf(mon, "  %08x %c%c%c%c%c prio:%d nvt:%04x eq:@%08"PRIx64
> >> +                   "% 6d/%5d ^%d", end_idx,
> >> +                   end->w0 & END_W0_VALID ? 'v' : '-',
> >> +                   end->w0 & END_W0_ENQUEUE ? 'q' : '-',
> >> +                   end->w0 & END_W0_UCOND_NOTIFY ? 'n' : '-',
> >> +                   end->w0 & END_W0_BACKLOG ? 'b' : '-',
> >> +                   end->w0 & END_W0_ESCALATE_CTL ? 'e' : '-',
> >> +                   priority, nvt, qaddr_base, qindex, qentries, qgen);
> >> +
> >> +    xive_end_queue_pic_print_info(end, 6, mon);
> >> +}
> >> +
> >> +static void xive_end_push(XiveEND *end, uint32_t data)
> > 
> > s/push/enqueue/ please, "push" suggests a stack.  (Not to mention that
> > "push" and "pull" are used as terms elsewhere in XIVE).
> 
> yes. you are right. I will change.
> 
> >> +{
> >> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
> >> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
> >> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
> >> +    uint32_t qgen = GETFIELD(END_W1_GENERATION, end->w1);
> >> +
> >> +    uint64_t qaddr = qaddr_base + (qindex << 2);
> >> +    uint32_t qdata = cpu_to_be32((qgen << 31) | (data & 0x7fffffff));
> >> +    uint32_t qentries = 1 << (qsize + 10);
> >> +
> >> +    if (dma_memory_write(&address_space_memory, qaddr, &qdata, sizeof(qdata))) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to write END data @0x%"
> >> +                      HWADDR_PRIx "\n", qaddr);
> >> +        return;
> >> +    }
> >> +
> >> +    qindex = (qindex + 1) & (qentries - 1);
> >> +    if (qindex == 0) {
> >> +        qgen ^= 1;
> >> +        end->w1 = SETFIELD(END_W1_GENERATION, end->w1, qgen);
> >> +    }
> >> +    end->w1 = SETFIELD(END_W1_PAGE_OFF, end->w1, qindex);
> >> +}
> >> +
> >>  /*
> >>   * XIVE Router (aka. Virtualization Controller or IVRE)
> >>   */
> >> @@ -460,6 +555,82 @@ int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> >>      return xrc->set_eas(xrtr, lisn, eas);
> >>  }
> >>  
> >> +int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >> +                        XiveEND *end)
> >> +{
> >> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> >> +
> >> +   return xrc->get_end(xrtr, end_blk, end_idx, end);
> >> +}
> >> +
> >> +int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >> +                        XiveEND *end)
> >> +{
> >> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> >> +
> >> +   return xrc->set_end(xrtr, end_blk, end_idx, end);
> >> +}
> >> +
> >> +/*
> >> + * An END trigger can come from an event trigger (IPI or HW) or from
> >> + * another chip. We don't model the PowerBus but the END trigger
> >> + * message has the same parameters than in the function below.
> >> + */
> >> +static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
> >> +                                   uint32_t end_idx, uint32_t end_data)
> >> +{
> >> +    XiveEND end;
> >> +    uint8_t priority;
> >> +    uint8_t format;
> >> +
> >> +    /* END cache lookup */
> >> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
> >> +                      end_idx);
> >> +        return;
> >> +    }
> >> +
> >> +    if (!(end.w0 & END_W0_VALID)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
> >> +                      end_blk, end_idx);
> >> +        return;
> >> +    }
> >> +
> >> +    if (end.w0 & END_W0_ENQUEUE) {
> >> +        xive_end_push(&end, end_data);
> >> +        xive_router_set_end(xrtr, end_blk, end_idx, &end);
> >> +    }
> >> +
> >> +    /*
> >> +     * The W7 format depends on the F bit in W6. It defines the type
> >> +     * of the notification :
> >> +     *
> >> +     *   F=0 : single or multiple NVT notification
> >> +     *   F=1 : User level Event-Based Branch (EBB) notification, no
> >> +     *         priority
> >> +     */
> >> +    format = GETFIELD(END_W6_FORMAT_BIT, end.w6);
> >> +    priority = GETFIELD(END_W7_F0_PRIORITY, end.w7);
> >> +
> >> +    /* The END is masked */
> >> +    if (format == 0 && priority == 0xff) {
> >> +        return;
> >> +    }
> >> +
> >> +    /*
> >> +     * Check the END ESn (Event State Buffer for notification) for
> >> +     * even futher coalescing in the Router
> >> +     */
> >> +    if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
> >> +        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
> >> +        return;
> >> +    }
> >> +
> >> +    /*
> >> +     * Follows IVPE notification
> >> +     */
> >> +}
> >> +
> >>  static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
> >>  {
> >>      XiveRouter *xrtr = XIVE_ROUTER(xf);
> >> @@ -471,9 +642,9 @@ static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
> >>          return;
> >>      }
> >>  
> >> -    /* The IVRE has a State Bit Cache for its internal sources which
> >> -     * is also involed at this point. We skip the SBC lookup because
> >> -     * the state bits of the sources are modeled internally in QEMU.
> >> +    /* The IVRE checks the State Bit Cache at this point. We skip the
> >> +     * SBC lookup because the state bits of the sources are modeled
> >> +     * internally in QEMU.
> > 
> > Replacing a comment about something we're not doing with a different
> > comment about something we're not doing doesn't seem very useful.
> > Maybe fold these together into one patch or the other.
> 
> That's me rephrasing. it should be indeed in the previous patch
> 
> Thanks,
> 
> C.
> 
> >>       */
> >>  
> >>      if (!(eas.w & EAS_VALID)) {
> >> @@ -485,6 +656,14 @@ static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
> >>          /* Notification completed */
> >>          return;
> >>      }
> >> +
> >> +    /*
> >> +     * The event trigger becomes an END trigger
> >> +     */
> >> +    xive_router_end_notify(xrtr,
> >> +                           GETFIELD(EAS_END_BLOCK, eas.w),
> >> +                           GETFIELD(EAS_END_INDEX, eas.w),
> >> +                           GETFIELD(EAS_END_DATA,  eas.w));
> >>  }
> >>  
> >>  static Property xive_router_properties[] = {
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-11-22 21:58     ` Cédric Le Goater
@ 2018-11-23  4:36       ` David Gibson
  2018-11-23  7:28         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-23  4:36 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 10391 bytes --]

On Thu, Nov 22, 2018 at 10:58:56PM +0100, Cédric Le Goater wrote:
> On 11/22/18 6:13 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:56:59AM +0100, Cédric Le Goater wrote:
> >> The Event Notification Descriptor also contains two Event State
> >> Buffers providing further coalescing of interrupts, one for the
> >> notification event (ESn) and one for the escalation events (ESe). A
> >> MMIO page is assigned for each to control the EOI through loads
> >> only. Stores are not allowed.
> >>
> >> The END ESBs are modeled through an object resembling the 'XiveSource'
> >> It is stateless as the END state bits are backed into the XiveEND
> >> structure under the XiveRouter and the MMIO accesses follow the same
> >> rules as for the standard source ESBs.
> >>
> >> END ESBs are not supported by the Linux drivers neither on OPAL nor on
> >> sPAPR. Nevetherless, it provides a mean to study the question in the
> >> future and validates a bit more the XIVE model.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  include/hw/ppc/xive.h |  20 ++++++
> >>  hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
> >>  2 files changed, 178 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index ce62aaf28343..24301bf2076d 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>                          XiveEND *end);
> >>  
> >> +/*
> >> + * XIVE END ESBs
> >> + */
> >> +
> >> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
> >> +#define XIVE_END_SOURCE(obj) \
> >> +    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
> > 
> > Is there a particular reason to make this a full QOM object, rather
> > than just embedding it in the XiveRouter?
> 
> yes, it should probably be under the XiveRouter you are right because
> there is a direct link with the ENDT which is in the XiverRouter. 
> 
> But if I remove the chip_id field from the XiveRouter, it becomes a QOM
> interface. something to ponder.

Huh?  I really don't understand what you're saying here.  What does
chip_id have to do with anything?

>  
> >> +typedef struct XiveENDSource {
> >> +    SysBusDevice parent;
> >> +
> >> +    uint32_t        nr_ends;
> >> +
> >> +    /* ESB memory region */
> >> +    uint32_t        esb_shift;
> >> +    MemoryRegion    esb_mmio;
> >> +
> >> +    XiveRouter      *xrtr;
> >> +} XiveENDSource;
> >> +
> >>  /*
> >>   * For legacy compatibility, the exceptions define up to 256 different
> >>   * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index 9cb001e7b540..5a8882d47a98 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -622,8 +622,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
> >>       * even futher coalescing in the Router
> >>       */
> >>      if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
> >> -        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
> >> -        return;
> >> +        uint8_t pq = GETFIELD(END_W1_ESn, end.w1);
> >> +        bool notify = xive_esb_trigger(&pq);
> >> +
> >> +        if (pq != GETFIELD(END_W1_ESn, end.w1)) {
> >> +            end.w1 = SETFIELD(END_W1_ESn, end.w1, pq);
> >> +            xive_router_set_end(xrtr, end_blk, end_idx, &end);
> >> +        }
> >> +
> >> +        /* ESn[Q]=1 : end of notification */
> >> +        if (!notify) {
> >> +            return;
> >> +        }
> >>      }
> >>  
> >>      /*
> >> @@ -706,6 +716,151 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
> >>                     (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
> >>  }
> >>  
> >> +/*
> >> + * END ESB MMIO loads
> >> + */
> >> +static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned size)
> >> +{
> >> +    XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
> >> +    XiveRouter *xrtr = xsrc->xrtr;
> >> +    uint32_t offset = addr & 0xFFF;
> >> +    uint8_t end_blk;
> >> +    uint32_t end_idx;
> >> +    XiveEND end;
> >> +    uint32_t end_esmask;
> >> +    uint8_t pq;
> >> +    uint64_t ret = -1;
> >> +
> >> +    end_blk = xrtr->chip_id;
> >> +    end_idx = addr >> (xsrc->esb_shift + 1);
> >> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
> >> +                      end_idx);
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (!(end.w0 & END_W0_VALID)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
> >> +                      end_blk, end_idx);
> >> +        return -1;
> >> +    }
> >> +
> >> +    end_esmask = addr_is_even(addr, xsrc->esb_shift) ? END_W1_ESn : END_W1_ESe;
> >> +    pq = GETFIELD(end_esmask, end.w1);
> >> +
> >> +    switch (offset) {
> >> +    case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
> >> +        ret = xive_esb_eoi(&pq);
> >> +
> >> +        /* Forward the source event notification for routing ?? */
> >> +        break;
> >> +
> >> +    case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
> >> +        ret = pq;
> >> +        break;
> >> +
> >> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
> >> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
> >> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
> >> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
> >> +        ret = xive_esb_set(&pq, (offset >> 8) & 0x3);
> >> +        break;
> >> +    default:
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid END ESB load addr %d\n",
> >> +                      offset);
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (pq != GETFIELD(end_esmask, end.w1)) {
> >> +        end.w1 = SETFIELD(end_esmask, end.w1, pq);
> >> +        xive_router_set_end(xrtr, end_blk, end_idx, &end);
> >> +    }
> > 
> > We can probably share some more code with XiveSource here, but that's
> > something that can be refined later.
> 
> yes clearly. The idea was to introduce a XiveESB model handling only the 
> MMIO aspects and rely on an interface to query/modify the underlying PQ bits.
> These state bits are related to a device and the ESB pages are the XIVE way 
> to expose them. 
> 
> I left that for later. I didn't want to complexify more the XiveSource 
> with a feature not used today. 
> 
> Thanks,
> 
> C.
> 
> > 
> >> +
> >> +    return ret;
> >> +}
> >> +
> >> +/*
> >> + * END ESB MMIO stores are invalid
> >> + */
> >> +static void xive_end_source_write(void *opaque, hwaddr addr,
> >> +                                  uint64_t value, unsigned size)
> >> +{
> >> +    qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr 0x%"
> >> +                  HWADDR_PRIx"\n", addr);
> >> +}
> >> +
> >> +static const MemoryRegionOps xive_end_source_ops = {
> >> +    .read = xive_end_source_read,
> >> +    .write = xive_end_source_write,
> >> +    .endianness = DEVICE_BIG_ENDIAN,
> >> +    .valid = {
> >> +        .min_access_size = 8,
> >> +        .max_access_size = 8,
> >> +    },
> >> +    .impl = {
> >> +        .min_access_size = 8,
> >> +        .max_access_size = 8,
> >> +    },
> >> +};
> >> +
> >> +static void xive_end_source_realize(DeviceState *dev, Error **errp)
> >> +{
> >> +    XiveENDSource *xsrc = XIVE_END_SOURCE(dev);
> >> +    Object *obj;
> >> +    Error *local_err = NULL;
> >> +
> >> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
> >> +    if (!obj) {
> >> +        error_propagate(errp, local_err);
> >> +        error_prepend(errp, "required link 'xive' not found: ");
> >> +        return;
> >> +    }
> >> +
> >> +    xsrc->xrtr = XIVE_ROUTER(obj);
> >> +
> >> +    if (!xsrc->nr_ends) {
> >> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
> >> +        return;
> >> +    }
> >> +
> >> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
> >> +        xsrc->esb_shift != XIVE_ESB_64K) {
> >> +        error_setg(errp, "Invalid ESB shift setting");
> >> +        return;
> >> +    }
> >> +
> >> +    /*
> >> +     * Each END is assigned an even/odd pair of MMIO pages, the even page
> >> +     * manages the ESn field while the odd page manages the ESe field.
> >> +     */
> >> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> >> +                          &xive_end_source_ops, xsrc, "xive.end",
> >> +                          (1ull << (xsrc->esb_shift + 1)) * xsrc->nr_ends);
> >> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
> >> +}
> >> +
> >> +static Property xive_end_source_properties[] = {
> >> +    DEFINE_PROP_UINT32("nr-ends", XiveENDSource, nr_ends, 0),
> >> +    DEFINE_PROP_UINT32("shift", XiveENDSource, esb_shift, XIVE_ESB_64K),
> >> +    DEFINE_PROP_END_OF_LIST(),
> >> +};
> >> +
> >> +static void xive_end_source_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >> +
> >> +    dc->desc    = "XIVE END Source";
> >> +    dc->props   = xive_end_source_properties;
> >> +    dc->realize = xive_end_source_realize;
> >> +}
> >> +
> >> +static const TypeInfo xive_end_source_info = {
> >> +    .name          = TYPE_XIVE_END_SOURCE,
> >> +    .parent        = TYPE_SYS_BUS_DEVICE,
> >> +    .instance_size = sizeof(XiveENDSource),
> >> +    .class_init    = xive_end_source_class_init,
> >> +};
> >> +
> >>  /*
> >>   * XIVE Fabric
> >>   */
> >> @@ -720,6 +875,7 @@ static void xive_register_types(void)
> >>      type_register_static(&xive_source_info);
> >>      type_register_static(&xive_fabric_info);
> >>      type_register_static(&xive_router_info);
> >> +    type_register_static(&xive_end_source_info);
> >>  }
> >>  
> >>  type_init(xive_register_types)
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/36] ppc/xive: introduce the XIVE interrupt thread context
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 07/36] ppc/xive: introduce the XIVE interrupt thread context Cédric Le Goater
@ 2018-11-23  5:08   ` David Gibson
  2018-11-25 20:35     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-23  5:08 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 24073 bytes --]

On Fri, Nov 16, 2018 at 11:57:00AM +0100, Cédric Le Goater wrote:
> Each POWER9 processor chip has a XIVE presenter that can generate four
> different exceptions to its threads:
> 
>   - hypervisor exception,
>   - O/S exception
>   - Event-Based Branch (EBB)
>   - msgsnd (doorbell).
> 
> Each exception has a state independent from the others called a Thread
> Interrupt Management context. This context is a set of registers which
> lets the thread handle priority management and interrupt acknowledgment
> among other things. The most important ones being :
> 
>   - Interrupt Priority Register  (PIPR)
>   - Interrupt Pending Buffer     (IPB)
>   - Current Processor Priority   (CPPR)
>   - Notification Source Register (NSR)
> 
> These registers are accessible through a specific MMIO region, called
> the Thread Interrupt Management Area (TIMA), four aligned pages, each
> exposing a different view of the registers. First page (page address
> ending in 0b00) gives access to the entire context and is reserved for
> the ring 0 security monitor. The second (page address ending in 0b01)
> is for the hypervisor, ring 1. The third (page address ending in 0b10)
> is for the operating system, ring 2. The fourth (page address ending
> in 0b11) is for user level, ring 3.
> 
> The thread interrupt context is modeled with a XiveTCTX object
> containing the values of the different exception registers. The TIMA
> region is mapped at the same address for each CPU.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/xive.h      |  36 +++
>  include/hw/ppc/xive_regs.h |  82 +++++++
>  hw/intc/xive.c             | 443 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 561 insertions(+)
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 24301bf2076d..5987f26ddb98 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -238,4 +238,40 @@ typedef struct XiveENDSource {
>  void xive_end_reset(XiveEND *end);
>  void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
>  
> +/*
> + * XIVE Thread interrupt Management (TM) context
> + */
> +
> +#define TYPE_XIVE_TCTX "xive-tctx"
> +#define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
> +
> +/*
> + * XIVE Thread interrupt Management register rings :
> + *
> + *   QW-0  User       event-based exception state
> + *   QW-1  O/S        OS context for priority management, interrupt acks
> + *   QW-2  Pool       hypervisor context for virtual processor being dispatched
> + *   QW-3  Physical   for the security monitor to manage the entire context

That last description is misleading, AIUI the hypervisor can and does
make use of the physical ring as well as the pool ring.

> + */
> +#define TM_RING_COUNT           4
> +#define TM_RING_SIZE            0x10
> +
> +typedef struct XiveTCTX {
> +    DeviceState parent_obj;
> +
> +    CPUState    *cs;
> +    qemu_irq    output;
> +
> +    uint8_t     regs[TM_RING_COUNT * TM_RING_SIZE];

I'm a bit dubious about representing the state with a full buffer like
this.  Isn't a fair bit of this space reserved or derived values which
aren't backed by real state?

> +
> +    XiveRouter  *xrtr;

What's this for?  AFAIK a TCTX isn't associated with a particular
routing unit.

> +} XiveTCTX;
> +
> +/*
> + * XIVE Thread Interrupt Management Aera (TIMA)

Typo s/Aera/Area/

> + */
> +extern const MemoryRegionOps xive_tm_ops;
> +
> +void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
> +
>  #endif /* PPC_XIVE_H */
> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> index f97fb2b90bee..2e3d6cb507da 100644
> --- a/include/hw/ppc/xive_regs.h
> +++ b/include/hw/ppc/xive_regs.h
> @@ -10,6 +10,88 @@
>  #ifndef PPC_XIVE_REGS_H
>  #define PPC_XIVE_REGS_H
>  
> +#define TM_SHIFT                16
> +
> +/* TM register offsets */
> +#define TM_QW0_USER             0x000 /* All rings */
> +#define TM_QW1_OS               0x010 /* Ring 0..2 */
> +#define TM_QW2_HV_POOL          0x020 /* Ring 0..1 */
> +#define TM_QW3_HV_PHYS          0x030 /* Ring 0..1 */
> +
> +/* Byte offsets inside a QW             QW0 QW1 QW2 QW3 */
> +#define TM_NSR                  0x0  /*  +   +   -   +  */
> +#define TM_CPPR                 0x1  /*  -   +   -   +  */
> +#define TM_IPB                  0x2  /*  -   +   +   +  */
> +#define TM_LSMFB                0x3  /*  -   +   +   +  */
> +#define TM_ACK_CNT              0x4  /*  -   +   -   -  */
> +#define TM_INC                  0x5  /*  -   +   -   +  */
> +#define TM_AGE                  0x6  /*  -   +   -   +  */
> +#define TM_PIPR                 0x7  /*  -   +   -   +  */
> +
> +#define TM_WORD0                0x0
> +#define TM_WORD1                0x4
> +
> +/*
> + * QW word 2 contains the valid bit at the top and other fields
> + * depending on the QW.
> + */
> +#define TM_WORD2                0x8
> +#define   TM_QW0W2_VU           PPC_BIT32(0)
> +#define   TM_QW0W2_LOGIC_SERV   PPC_BITMASK32(1, 31) /* XX 2,31 ? */
> +#define   TM_QW1W2_VO           PPC_BIT32(0)
> +#define   TM_QW1W2_OS_CAM       PPC_BITMASK32(8, 31)
> +#define   TM_QW2W2_VP           PPC_BIT32(0)
> +#define   TM_QW2W2_POOL_CAM     PPC_BITMASK32(8, 31)
> +#define   TM_QW3W2_VT           PPC_BIT32(0)
> +#define   TM_QW3W2_LP           PPC_BIT32(6)
> +#define   TM_QW3W2_LE           PPC_BIT32(7)
> +#define   TM_QW3W2_T            PPC_BIT32(31)
> +
> +/*
> + * In addition to normal loads to "peek" and writes (only when invalid)
> + * using 4 and 8 bytes accesses, the above registers support these
> + * "special" byte operations:
> + *
> + *   - Byte load from QW0[NSR] - User level NSR (EBB)
> + *   - Byte store to QW0[NSR] - User level NSR (EBB)
> + *   - Byte load/store to QW1[CPPR] and QW3[CPPR] - CPPR access
> + *   - Byte load from QW3[TM_WORD2] - Read VT||00000||LP||LE on thrd 0
> + *                                    otherwise VT||0000000
> + *   - Byte store to QW3[TM_WORD2] - Set VT bit (and LP/LE if present)
> + *
> + * Then we have all these "special" CI ops at these offset that trigger
> + * all sorts of side effects:
> + */
> +#define TM_SPC_ACK_EBB          0x800   /* Load8 ack EBB to reg*/
> +#define TM_SPC_ACK_OS_REG       0x810   /* Load16 ack OS irq to reg */
> +#define TM_SPC_PUSH_USR_CTX     0x808   /* Store32 Push/Validate user context */
> +#define TM_SPC_PULL_USR_CTX     0x808   /* Load32 Pull/Invalidate user
> +                                         * context */
> +#define TM_SPC_SET_OS_PENDING   0x812   /* Store8 Set OS irq pending bit */
> +#define TM_SPC_PULL_OS_CTX      0x818   /* Load32/Load64 Pull/Invalidate OS
> +                                         * context to reg */
> +#define TM_SPC_PULL_POOL_CTX    0x828   /* Load32/Load64 Pull/Invalidate Pool
> +                                         * context to reg*/
> +#define TM_SPC_ACK_HV_REG       0x830   /* Load16 ack HV irq to reg */
> +#define TM_SPC_PULL_USR_CTX_OL  0xc08   /* Store8 Pull/Inval usr ctx to odd
> +                                         * line */
> +#define TM_SPC_ACK_OS_EL        0xc10   /* Store8 ack OS irq to even line */
> +#define TM_SPC_ACK_HV_POOL_EL   0xc20   /* Store8 ack HV evt pool to even
> +                                         * line */
> +#define TM_SPC_ACK_HV_EL        0xc30   /* Store8 ack HV irq to even line */
> +/* XXX more... */
> +
> +/* NSR fields for the various QW ack types */
> +#define TM_QW0_NSR_EB           PPC_BIT8(0)
> +#define TM_QW1_NSR_EO           PPC_BIT8(0)
> +#define TM_QW3_NSR_HE           PPC_BITMASK8(0, 1)
> +#define  TM_QW3_NSR_HE_NONE     0
> +#define  TM_QW3_NSR_HE_POOL     1
> +#define  TM_QW3_NSR_HE_PHYS     2
> +#define  TM_QW3_NSR_HE_LSI      3
> +#define TM_QW3_NSR_I            PPC_BIT8(2)
> +#define TM_QW3_NSR_GRP_LVL      PPC_BIT8(3, 7)
> +
>  /* EAS (Event Assignment Structure)
>   *
>   * One per interrupt source. Targets an interrupt to a given Event
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 5a8882d47a98..4c6cb5d52975 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -15,6 +15,448 @@
>  #include "sysemu/dma.h"
>  #include "monitor/monitor.h"
>  #include "hw/ppc/xive.h"
> +#include "hw/ppc/xive_regs.h"
> +
> +/*
> + * XIVE Thread Interrupt Management context
> + */
> +
> +static uint64_t xive_tctx_accept(XiveTCTX *tctx, uint8_t ring)
> +{
> +    return 0;
> +}
> +
> +static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
> +{
> +    if (cppr > XIVE_PRIORITY_MAX) {
> +        cppr = 0xff;
> +    }
> +
> +    tctx->regs[ring + TM_CPPR] = cppr;
> +}
> +
> +/*
> + * XIVE Thread Interrupt Management Area (TIMA)
> + *
> + * This region gives access to the registers of the thread interrupt
> + * management context. It is four page wide, each page providing a
> + * different view of the registers. The page with the lower offset is
> + * the most privileged and gives access to the entire context.
> + */
> +
> +#define XIVE_TM_HW_PAGE   0x0
> +#define XIVE_TM_HV_PAGE   0x1
> +#define XIVE_TM_OS_PAGE   0x2
> +#define XIVE_TM_USER_PAGE 0x3
> +
> +/*
> + * Define an access map for each page of the TIMA that we will use in
> + * the memory region ops to filter values when doing loads and stores
> + * of raw registers values
> + *
> + * Registers accessibility bits :
> + *
> + *    0x0 - no access
> + *    0x1 - write only
> + *    0x2 - read only
> + *    0x3 - read/write
> + */
> +
> +static const uint8_t xive_tm_hw_view[] = {
> +    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
> +    /* QW-1 OS   */   3, 3, 3, 3,   3, 3, 0, 3,   3, 3, 3, 3,   0, 0, 0, 0,
> +    /* QW-2 HV   */   0, 0, 3, 3,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
> +    /* QW-3 HW   */   3, 3, 3, 3,   0, 3, 0, 3,   3, 0, 0, 3,   3, 3, 3, 0,

Can we stick to the "Pool" / "Phys" names rather than inventing HV and
HW.  XIVE already has too many names for things.  To clarify that's
for the naming of the QWs, the view names are fine.

> +};
> +
> +static const uint8_t xive_tm_hv_view[] = {
> +    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
> +    /* QW-1 OS   */   3, 3, 3, 3,   3, 3, 0, 3,   3, 3, 3, 3,   0, 0, 0, 0,
> +    /* QW-2 HV   */   0, 0, 3, 3,   0, 0, 0, 0,   0, 3, 3, 3,   0, 0, 0, 0,
> +    /* QW-3 HW   */   3, 3, 3, 3,   0, 3, 0, 3,   3, 0, 0, 3,   0, 0, 0, 0,
> +};
> +
> +static const uint8_t xive_tm_os_view[] = {
> +    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
> +    /* QW-1 OS   */   2, 3, 2, 2,   2, 2, 0, 2,   0, 0, 0, 0,   0, 0, 0, 0,
> +    /* QW-2 HV   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
> +    /* QW-3 HW   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 3, 3, 0,

Are those bytes near the end of QW-3 really accessible in OS but not
hypervisor view?

> +};
> +
> +static const uint8_t xive_tm_user_view[] = {
> +    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
> +    /* QW-1 OS   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
> +    /* QW-2 HV   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
> +    /* QW-3 HW   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
> +};
> +
> +/*
> + * Overall TIMA access map for the thread interrupt management context
> + * registers
> + */
> +static const uint8_t *xive_tm_views[] = {
> +    [XIVE_TM_HW_PAGE]   = xive_tm_hw_view,
> +    [XIVE_TM_HV_PAGE]   = xive_tm_hv_view,
> +    [XIVE_TM_OS_PAGE]   = xive_tm_os_view,
> +    [XIVE_TM_USER_PAGE] = xive_tm_user_view,
> +};
> +
> +/*
> + * Computes a register access mask for a given offset in the TIMA
> + */
> +static uint64_t xive_tm_mask(hwaddr offset, unsigned size, bool write)
> +{
> +    uint8_t page_offset = (offset >> TM_SHIFT) & 0x3;
> +    uint8_t reg_offset = offset & 0x3F;
> +    uint8_t reg_mask = write ? 0x1 : 0x2;
> +    uint64_t mask = 0x0;
> +    int i;
> +
> +    for (i = 0; i < size; i++) {
> +        if (xive_tm_views[page_offset][reg_offset + i] & reg_mask) {
> +            mask |= (uint64_t) 0xff << (8 * (size - i - 1));
> +        }
> +    }
> +
> +    return mask;
> +}
> +
> +static void xive_tm_raw_write(XiveTCTX *tctx, hwaddr offset, uint64_t value,
> +                              unsigned size)
> +{
> +    uint8_t ring_offset = offset & 0x30;
> +    uint8_t reg_offset = offset & 0x3F;
> +    uint64_t mask = xive_tm_mask(offset, size, true);
> +    int i;
> +
> +    /*
> +     * Only 4 or 8 bytes stores are allowed and the User ring is
> +     * excluded
> +     */
> +    if (size < 4 || !mask || ring_offset == TM_QW0_USER) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid write access at TIMA @%"
> +                      HWADDR_PRIx"\n", offset);
> +        return;
> +    }
> +
> +    /*
> +     * Use the register offset for the raw values and filter out
> +     * reserved values
> +     */
> +    for (i = 0; i < size; i++) {
> +        uint8_t byte_mask = (mask >> (8 * (size - i - 1)));
> +        if (byte_mask) {
> +            tctx->regs[reg_offset + i] = (value >> (8 * (size - i - 1))) &
> +                byte_mask;
> +        }
> +    }
> +}
> +
> +static uint64_t xive_tm_raw_read(XiveTCTX *tctx, hwaddr offset, unsigned size)
> +{
> +    uint8_t ring_offset = offset & 0x30;
> +    uint8_t reg_offset = offset & 0x3F;
> +    uint64_t mask = xive_tm_mask(offset, size, false);
> +    uint64_t ret;
> +    int i;
> +
> +    /*
> +     * Only 4 or 8 bytes loads are allowed and the User ring is
> +     * excluded
> +     */
> +    if (size < 4 || !mask || ring_offset == TM_QW0_USER) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid read access at TIMA @%"
> +                      HWADDR_PRIx"\n", offset);
> +        return -1;
> +    }
> +
> +    /* Use the register offset for the raw values */
> +    ret = 0;
> +    for (i = 0; i < size; i++) {
> +        ret |= (uint64_t) tctx->regs[reg_offset + i] << (8 * (size - i - 1));
> +    }
> +
> +    /* filter out reserved values */
> +    return ret & mask;
> +}
> +
> +/*
> + * The TM context is mapped twice within each page. Stores and loads
> + * to the first mapping below 2K write and read the specified values
> + * without modification. The second mapping above 2K performs specific
> + * state changes (side effects) in addition to setting/returning the
> + * interrupt management area context of the processor thread.
> + */
> +static uint64_t xive_tm_ack_os_reg(XiveTCTX *tctx, hwaddr offset, unsigned size)
> +{
> +    return xive_tctx_accept(tctx, TM_QW1_OS);
> +}
> +
> +static void xive_tm_set_os_cppr(XiveTCTX *tctx, hwaddr offset,
> +                                uint64_t value, unsigned size)
> +{
> +    xive_tctx_set_cppr(tctx, TM_QW1_OS, value & 0xff);
> +}
> +
> +/*
> + * Define a mapping of "special" operations depending on the TIMA page
> + * offset and the size of the operation.
> + */
> +typedef struct XiveTmOp {
> +    uint8_t  page_offset;
> +    uint32_t op_offset;
> +    unsigned size;
> +    void     (*write_handler)(XiveTCTX *tctx, hwaddr offset, uint64_t value,
> +                              unsigned size);
> +    uint64_t (*read_handler)(XiveTCTX *tctx, hwaddr offset, unsigned size);
> +} XiveTmOp;
> +
> +static const XiveTmOp xive_tm_operations[] = {
> +    /*
> +     * MMIOs below 2K : raw values and special operations without side
> +     * effects
> +     */
> +    { XIVE_TM_OS_PAGE, TM_QW1_OS + TM_CPPR,   1, xive_tm_set_os_cppr, NULL },
> +
> +    /* MMIOs above 2K : special operations with side effects */
> +    { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG,     2, NULL, xive_tm_ack_os_reg },
> +};
> +
> +static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
> +{
> +    uint8_t page_offset = (offset >> TM_SHIFT) & 0x3;
> +    uint32_t op_offset = offset & 0xFFF;
> +    int i;
> +
> +    for (i = 0; i < ARRAY_SIZE(xive_tm_operations); i++) {
> +        const XiveTmOp *xto = &xive_tm_operations[i];
> +
> +        /* Accesses done from a more privileged TIMA page is allowed */
> +        if (xto->page_offset >= page_offset &&
> +            xto->op_offset == op_offset &&
> +            xto->size == size &&
> +            ((write && xto->write_handler) || (!write && xto->read_handler))) {
> +            return xto;
> +        }
> +    }
> +    return NULL;
> +}
> +
> +/*
> + * TIMA MMIO handlers
> + */
> +static void xive_tm_write(void *opaque, hwaddr offset,
> +                          uint64_t value, unsigned size)
> +{
> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> +    XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
> +    const XiveTmOp *xto;
> +
> +    /*
> +     * TODO: check V bit in Q[0-3]W2, check PTER bit associated with CPU
> +     */
> +
> +    /*
> +     * First, check for special operations in the 2K region
> +     */
> +    if (offset & 0x800) {
> +        xto = xive_tm_find_op(offset, size, true);
> +        if (!xto) {
> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid write access at TIMA"
> +                          "@%"HWADDR_PRIx"\n", offset);
> +        } else {
> +            xto->write_handler(tctx, offset, value, size);
> +        }
> +        return;
> +    }
> +
> +    /*
> +     * Then, for special operations in the region below 2K.
> +     */
> +    xto = xive_tm_find_op(offset, size, true);
> +    if (xto) {
> +        xto->write_handler(tctx, offset, value, size);
> +        return;
> +    }
> +
> +    /*
> +     * Finish with raw access to the register values
> +     */
> +    xive_tm_raw_write(tctx, offset, value, size);
> +}
> +
> +static uint64_t xive_tm_read(void *opaque, hwaddr offset, unsigned size)
> +{
> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> +    XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
> +    const XiveTmOp *xto;
> +
> +    /*
> +     * TODO: check V bit in Q[0-3]W2, check PTER bit associated with CPU
> +     */
> +
> +    /*
> +     * First, check for special operations in the 2K region
> +     */
> +    if (offset & 0x800) {
> +        xto = xive_tm_find_op(offset, size, false);
> +        if (!xto) {
> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid read access to TIMA"
> +                          "@%"HWADDR_PRIx"\n", offset);
> +            return -1;
> +        }
> +        return xto->read_handler(tctx, offset, size);
> +    }
> +
> +    /*
> +     * Then, for special operations in the region below 2K.
> +     */
> +    xto = xive_tm_find_op(offset, size, false);
> +    if (xto) {
> +        return xto->read_handler(tctx, offset, size);
> +    }
> +
> +    /*
> +     * Finish with raw access to the register values
> +     */
> +    return xive_tm_raw_read(tctx, offset, size);
> +}
> +
> +const MemoryRegionOps xive_tm_ops = {
> +    .read = xive_tm_read,
> +    .write = xive_tm_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +static char *xive_tctx_ring_print(uint8_t *ring)
> +{
> +    uint32_t w2 = be32_to_cpu(*((uint32_t *) &ring[TM_WORD2]));
> +
> +    return g_strdup_printf("%02x   %02x  %02x    %02x   %02x  "
> +                   "%02x  %02x   %02x  %08x",
> +                   ring[TM_NSR], ring[TM_CPPR], ring[TM_IPB], ring[TM_LSMFB],
> +                   ring[TM_ACK_CNT], ring[TM_INC], ring[TM_AGE], ring[TM_PIPR],
> +                   w2);
> +}
> +
> +static const struct {
> +    uint8_t    qw;
> +    const char *name;
> +} xive_tctx_ring_infos[TM_RING_COUNT] = {
> +    { TM_QW3_HV_PHYS, "HW"   },
> +    { TM_QW2_HV_POOL, "HV"   },
> +    { TM_QW1_OS,      "OS"   },
> +    { TM_QW0_USER,    "USER" },

Likewise here if we can stick to PHYS and POOL rather than HW and HV.

Also, the qw field takes exactly the values 0..3, why not just an
array of names indexed by the ring number.

> +};
> +
> +void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
> +{
> +    int cpu_index = tctx->cs ? tctx->cs->cpu_index : -1;
> +    int i;
> +
> +    monitor_printf(mon, "CPU[%04x]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
> +                   "  W2\n", cpu_index);
> +
> +    for (i = 0; i < TM_RING_COUNT; i++) {
> +        char *s = xive_tctx_ring_print(&tctx->regs[xive_tctx_ring_infos[i].qw]);
> +        monitor_printf(mon, "CPU[%04x]: %4s    %s\n", cpu_index,
> +                       xive_tctx_ring_infos[i].name, s);
> +        g_free(s);
> +    }
> +}
> +
> +static void xive_tctx_reset(void *dev)
> +{
> +    XiveTCTX *tctx = XIVE_TCTX(dev);
> +
> +    memset(tctx->regs, 0, sizeof(tctx->regs));
> +
> +    /* Set some defaults */
> +    tctx->regs[TM_QW1_OS + TM_LSMFB] = 0xFF;
> +    tctx->regs[TM_QW1_OS + TM_ACK_CNT] = 0xFF;
> +    tctx->regs[TM_QW1_OS + TM_AGE] = 0xFF;
> +}
> +
> +static void xive_tctx_realize(DeviceState *dev, Error **errp)
> +{
> +    XiveTCTX *tctx = XIVE_TCTX(dev);
> +    PowerPCCPU *cpu;
> +    CPUPPCState *env;
> +    Object *obj;
> +    Error *local_err = NULL;
> +
> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
> +    if (!obj) {
> +        error_propagate(errp, local_err);
> +        error_prepend(errp, "required link 'xive' not found: ");
> +        return;
> +    }
> +    tctx->xrtr = XIVE_ROUTER(obj);
> +
> +    obj = object_property_get_link(OBJECT(dev), "cpu", &local_err);
> +    if (!obj) {
> +        error_propagate(errp, local_err);
> +        error_prepend(errp, "required link 'cpu' not found: ");
> +        return;
> +    }
> +
> +    cpu = POWERPC_CPU(obj);
> +    tctx->cs = CPU(obj);
> +
> +    env = &cpu->env;
> +    switch (PPC_INPUT(env)) {
> +    case PPC_FLAGS_INPUT_POWER7:
> +        tctx->output = env->irq_inputs[POWER7_INPUT_INT];
> +        break;
> +
> +    default:
> +        error_setg(errp, "XIVE interrupt controller does not support "
> +                   "this CPU bus model");
> +        return;
> +    }
> +
> +    qemu_register_reset(xive_tctx_reset, dev);
> +}
> +
> +static void xive_tctx_unrealize(DeviceState *dev, Error **errp)
> +{
> +    qemu_unregister_reset(xive_tctx_reset, dev);
> +}
> +
> +static const VMStateDescription vmstate_xive_tctx = {
> +    .name = TYPE_XIVE_TCTX,
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_BUFFER(regs, XiveTCTX),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static void xive_tctx_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +
> +    dc->realize = xive_tctx_realize;
> +    dc->unrealize = xive_tctx_unrealize;
> +    dc->desc = "XIVE Interrupt Thread Context";
> +    dc->vmsd = &vmstate_xive_tctx;
> +}
> +
> +static const TypeInfo xive_tctx_info = {
> +    .name          = TYPE_XIVE_TCTX,
> +    .parent        = TYPE_DEVICE,
> +    .instance_size = sizeof(XiveTCTX),
> +    .class_init    = xive_tctx_class_init,
> +};
>  
>  /*
>   * XIVE ESB helpers
> @@ -876,6 +1318,7 @@ static void xive_register_types(void)
>      type_register_static(&xive_fabric_info);
>      type_register_static(&xive_router_info);
>      type_register_static(&xive_end_source_info);
> +    type_register_static(&xive_tctx_info);
>  }
>  
>  type_init(xive_register_types)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-11-23  4:36       ` David Gibson
@ 2018-11-23  7:28         ` Cédric Le Goater
  2018-11-26  5:54           ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-23  7:28 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/23/18 5:36 AM, David Gibson wrote:
> On Thu, Nov 22, 2018 at 10:58:56PM +0100, Cédric Le Goater wrote:
>> On 11/22/18 6:13 AM, David Gibson wrote:
>>> On Fri, Nov 16, 2018 at 11:56:59AM +0100, Cédric Le Goater wrote:
>>>> The Event Notification Descriptor also contains two Event State
>>>> Buffers providing further coalescing of interrupts, one for the
>>>> notification event (ESn) and one for the escalation events (ESe). A
>>>> MMIO page is assigned for each to control the EOI through loads
>>>> only. Stores are not allowed.
>>>>
>>>> The END ESBs are modeled through an object resembling the 'XiveSource'
>>>> It is stateless as the END state bits are backed into the XiveEND
>>>> structure under the XiveRouter and the MMIO accesses follow the same
>>>> rules as for the standard source ESBs.
>>>>
>>>> END ESBs are not supported by the Linux drivers neither on OPAL nor on
>>>> sPAPR. Nevetherless, it provides a mean to study the question in the
>>>> future and validates a bit more the XIVE model.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  include/hw/ppc/xive.h |  20 ++++++
>>>>  hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
>>>>  2 files changed, 178 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> index ce62aaf28343..24301bf2076d 100644
>>>> --- a/include/hw/ppc/xive.h
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>>                          XiveEND *end);
>>>>  
>>>> +/*
>>>> + * XIVE END ESBs
>>>> + */
>>>> +
>>>> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
>>>> +#define XIVE_END_SOURCE(obj) \
>>>> +    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
>>>
>>> Is there a particular reason to make this a full QOM object, rather
>>> than just embedding it in the XiveRouter?
>>
>> yes, it should probably be under the XiveRouter you are right because
>> there is a direct link with the ENDT which is in the XiverRouter. 
>>
>> But if I remove the chip_id field from the XiveRouter, it becomes a QOM
>> interface. something to ponder.
> 
> Huh?  I really don't understand what you're saying here.  What does
> chip_id have to do with anything?

I am quoting a comment of yours :

	> +/*
	> + * XIVE Router
	> + */
	> +
	> +typedef struct XiveRouter {
	> +    SysBusDevice    parent;
	> +
	> +    uint32_t        chip_id;

	I don't think this belongs in the base class.  The PowerNV specific
	variants will need it, but it doesn't make sense for the PAPR version.


If we remove 'chip_id' from XiveRouter, it can become a QOM interface 
without state, like the XiveFabric is.

C. 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-23  3:50       ` David Gibson
@ 2018-11-23  8:06         ` Cédric Le Goater
  2018-11-27  1:54           ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-23  8:06 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/23/18 4:50 AM, David Gibson wrote:
> On Thu, Nov 22, 2018 at 08:53:00AM +0100, Cédric Le Goater wrote:
>> On 11/22/18 5:11 AM, David Gibson wrote:
>>> On Fri, Nov 16, 2018 at 11:56:57AM +0100, Cédric Le Goater wrote:
>>>> The XiveRouter models the second sub-engine of the overall XIVE
>>>> architecture : the Interrupt Virtualization Routing Engine (IVRE).
>>>>
>>>> The IVRE handles event notifications of the IVSE through MMIO stores
>>>> and performs the interrupt routing process. For this purpose, it uses
>>>> a set of table stored in system memory, the first of which being the
>>>> Event Assignment Structure (EAS) table.
>>>>
>>>> The EAT associates an interrupt source number with an Event Notification
>>>> Descriptor (END) which will be used in a second phase of the routing
>>>> process to identify a Notification Virtual Target.
>>>>
>>>> The XiveRouter is an abstract class which needs to be inherited from
>>>> to define a storage for the EAT, and other upcoming tables. The
>>>> 'chip-id' atttribute is not strictly necessary for the sPAPR and
>>>> PowerNV machines but it's a good way to test the routing algorithm.
>>>> Without this atttribute, the XiveRouter could be a simple QOM
>>>> interface.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  include/hw/ppc/xive.h      | 32 ++++++++++++++
>>>>  include/hw/ppc/xive_regs.h | 31 ++++++++++++++
>>>>  hw/intc/xive.c             | 86 ++++++++++++++++++++++++++++++++++++++
>>>>  3 files changed, 149 insertions(+)
>>>>  create mode 100644 include/hw/ppc/xive_regs.h
>>>>
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> index be93fae6317b..5a0696366577 100644
>>>> --- a/include/hw/ppc/xive.h
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -11,6 +11,7 @@
>>>>  #define PPC_XIVE_H
>>>>  
>>>>  #include "hw/sysbus.h"
>>>
>>> Again, I don't think making this a SysBusDevice is quite right.
>>> Even more so for the router than the source, because at least for PAPR
>>> it might not have any MMIO presence at all.
>>
>> The controller model inherits from the XiveRouter and manages the
>> TIMA.
> 
> Um.. I'm not sure what you mean by the "controller model".  Surely the
> presenter should own the TIMA, not the router?

The XIVE VC (routing) and PC (presente) subengines are merged under the 
controller model sPAPRXIVE and the PC matching algo looking for a target 
is reduced to one function. We don't want to model the exchanges on the
PowerBUS.

The TIMA MMIO region exposing the thread interrupt context is managed by 
the controller sPAPRXIVE (easier for KVM) and the interrupt context 
registers are under model XiveTCXT. This is our XICS/ICP equivalent 
in XIVE.

>>>> +#include "hw/ppc/xive_regs.h"
>>>>  
>>>>  /*
>>>>   * XIVE Fabric (Interface between Source and Router)
>>>> @@ -168,4 +169,35 @@ static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
>>>>      }
>>>>  }
>>>>  
>>>> +/*
>>>> + * XIVE Router
>>>> + */
>>>> +
>>>> +typedef struct XiveRouter {
>>>> +    SysBusDevice    parent;
>>>> +
>>>> +    uint32_t        chip_id;
>>>
>>> I don't think this belongs in the base class.  The PowerNV specific
>>> variants will need it, but it doesn't make sense for the PAPR version.
>>
>> yeah. I am using it as a END and NVT block identifier but it's not 
>> required for sPAPR, it could just be zero. 
>>
>> It was good to test the routing algo which should not assume that the 
>> block id is zero. 
>>  
>>>
>>>> +} XiveRouter;
>>>> +
>>>> +#define TYPE_XIVE_ROUTER "xive-router"
>>>> +#define XIVE_ROUTER(obj)                                \
>>>> +    OBJECT_CHECK(XiveRouter, (obj), TYPE_XIVE_ROUTER)
>>>> +#define XIVE_ROUTER_CLASS(klass)                                        \
>>>> +    OBJECT_CLASS_CHECK(XiveRouterClass, (klass), TYPE_XIVE_ROUTER)
>>>> +#define XIVE_ROUTER_GET_CLASS(obj)                              \
>>>> +    OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
>>>> +
>>>> +typedef struct XiveRouterClass {
>>>> +    SysBusDeviceClass parent;
>>>> +
>>>> +    /* XIVE table accessors */
>>>> +    int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>>>> +    int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>>>> +} XiveRouterClass;
>>>> +
>>>> +void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
>>>> +
>>>> +int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>>>> +int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>>>> +
>>>>  #endif /* PPC_XIVE_H */
>>>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>>>> new file mode 100644
>>>> index 000000000000..12499b33614c
>>>> --- /dev/null
>>>> +++ b/include/hw/ppc/xive_regs.h
>>>> @@ -0,0 +1,31 @@
>>>> +/*
>>>> + * QEMU PowerPC XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2016-2018, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#ifndef PPC_XIVE_REGS_H
>>>> +#define PPC_XIVE_REGS_H
>>>> +
>>>> +/* EAS (Event Assignment Structure)
>>>> + *
>>>> + * One per interrupt source. Targets an interrupt to a given Event
>>>> + * Notification Descriptor (END) and provides the corresponding
>>>> + * logical interrupt number (END data)
>>>> + */
>>>> +typedef struct XiveEAS {
>>>> +        /* Use a single 64-bit definition to make it easier to
>>>> +         * perform atomic updates
>>>> +         */
>>>> +        uint64_t        w;
>>>> +#define EAS_VALID       PPC_BIT(0)
>>>> +#define EAS_END_BLOCK   PPC_BITMASK(4, 7)        /* Destination END block# */
>>>> +#define EAS_END_INDEX   PPC_BITMASK(8, 31)       /* Destination END index */
>>>> +#define EAS_MASKED      PPC_BIT(32)              /* Masked */
>>>> +#define EAS_END_DATA    PPC_BITMASK(33, 63)      /* Data written to the END */
>>>> +} XiveEAS;
>>>> +
>>>> +#endif /* PPC_XIVE_REGS_H */
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> index 014a2e41f71f..c4c90a25758e 100644
>>>> --- a/hw/intc/xive.c
>>>> +++ b/hw/intc/xive.c
>>>> @@ -442,6 +442,91 @@ static const TypeInfo xive_source_info = {
>>>>      .class_init    = xive_source_class_init,
>>>>  };
>>>>  
>>>> +/*
>>>> + * XIVE Router (aka. Virtualization Controller or IVRE)
>>>> + */
>>>> +
>>>> +int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>>> +{
>>>> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>>>> +
>>>> +    return xrc->get_eas(xrtr, lisn, eas);
>>>> +}
>>>> +
>>>> +int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>>> +{
>>>> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>>>> +
>>>> +    return xrc->set_eas(xrtr, lisn, eas);
>>>> +}
>>>> +
>>>> +static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>>>> +{
>>>> +    XiveRouter *xrtr = XIVE_ROUTER(xf);
>>>> +    XiveEAS eas;
>>>> +
>>>> +    /* EAS cache lookup */
>>>> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: Unknown LISN %x\n", lisn);
>>>> +        return;
>>>> +    }
>>>
>>> AFAICT a bad LISN here means a qemu error (in the source, probably),
>>> not a user or guest error, so an assert() would be more appropriate.
>>
>> hmm, I would say no because in the case of PowerNV, the firmware could
>> have badly configured the ISN offset of a source which would notify the 
>> router with a bad notification event data.
> 
> Ah, good point.  That's fine as it is then.
> 
>>>> +
>>>> +    /* The IVRE has a State Bit Cache for its internal sources which
>>>> +     * is also involed at this point. We skip the SBC lookup because
>>>> +     * the state bits of the sources are modeled internally in QEMU.
>>>> +     */
>>>> +
>>>> +    if (!(eas.w & EAS_VALID)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (eas.w & EAS_MASKED) {
>>>> +        /* Notification completed */
>>>> +        return;
>>>> +    }
>>>> +}
>>>> +
>>>> +static Property xive_router_properties[] = {
>>>> +    DEFINE_PROP_UINT32("chip-id", XiveRouter, chip_id, 0),
>>>> +    DEFINE_PROP_END_OF_LIST(),
>>>> +};
>>>> +
>>>> +static void xive_router_class_init(ObjectClass *klass, void *data)
>>>> +{
>>>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>>>> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
>>>> +
>>>> +    dc->desc    = "XIVE Router Engine";
>>>> +    dc->props   = xive_router_properties;
>>>> +    xfc->notify = xive_router_notify;
>>>> +}
>>>> +
>>>> +static const TypeInfo xive_router_info = {
>>>> +    .name          = TYPE_XIVE_ROUTER,
>>>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>>>> +    .abstract      = true,
>>>> +    .class_size    = sizeof(XiveRouterClass),
>>>> +    .class_init    = xive_router_class_init,
>>>> +    .interfaces    = (InterfaceInfo[]) {
>>>> +        { TYPE_XIVE_FABRIC },
>>>
>>> So as far as I can see so far, the XiveFabric interface will
>>> essentially have to be implemented on the router object, so I'm not
>>> seeing much point to having the interface rather than just a direct
>>> call on the router object.  But I haven't read the whole series yet,
>>> so maybe I'm missing something.
>>
>> The PSIHB and PHB4 models are using it but there are not in the series.
>>
>> I can send the PSIHB patch in the next version if you like, it's the 
>> patch right after PnvXive. It's attached below for the moment. Look at 
>> pnv_psi_notify().
> 
> Hrm, I see.  This seems like a really convoluted way of achieving what
> you need here.  We want to abstract exactly how the source delivers
> notifies, 

on sPAPR, I agree that the forwarding of event notification could be a 
simple XiveRouter call but the XiveRouter covers both machines :/

On PowerNV, HW uses MMIOs to forward events and only the device knows 
about the IRQ number offset in the global IRQ number space and the 
notification port to use for the MMIO store. A PowerNV XIVE source 
would forward the event notification to a piece of logic which sends 
a PowerBUS event notification message. How it reaches the XIVE IC is
beyong QEMU as it would means modeling the PowerBUS. 

> but doing it with an interface on some object that's not necessarily
> either the source or the router seems odd.  
There is no direct link between the device owing the source and the 
XIVE controller, they could be on the same Power chip but the routing 
could be done by some other chips. This scenario is covered btw.

See it as a connector object.

> At the very least the names need to change (of both interface and > property for the target object).

I am fine with renaming it. With the above explanations, if they are 
clear enough, how do see them ?

Thanks,

C. 

> 
>>
>> Thanks,
>>
>> C.
>>
>>>> +        { }
>>>> +    }
>>>> +};
>>>> +
>>>> +void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
>>>> +{
>>>> +    if (!(eas->w & EAS_VALID)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    monitor_printf(mon, "  %08x %s end:%02x/%04x data:%08x\n",
>>>> +                   lisn, eas->w & EAS_MASKED ? "M" : " ",
>>>> +                   (uint8_t)  GETFIELD(EAS_END_BLOCK, eas->w),
>>>> +                   (uint32_t) GETFIELD(EAS_END_INDEX, eas->w),
>>>> +                   (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
>>>> +}
>>>> +
>>>>  /*
>>>>   * XIVE Fabric
>>>>   */
>>>> @@ -455,6 +540,7 @@ static void xive_register_types(void)
>>>>  {
>>>>      type_register_static(&xive_source_info);
>>>>      type_register_static(&xive_fabric_info);
>>>> +    type_register_static(&xive_router_info);
>>>>  }
>>>>  
>>>>  type_init(xive_register_types)
>>>
>>
> 
>> >From 680fd6ff7c99e669708fbc5cfdbfcd95e83e7c07 Mon Sep 17 00:00:00 2001
>> From: =?UTF-8?q?C=C3=A9dric=20Le=20Goater?= <clg@kaod.org>
>> Date: Wed, 21 Nov 2018 10:29:45 +0100
>> Subject: [PATCH] ppc/pnv: add a PSI bridge model for POWER9 processor
>> MIME-Version: 1.0
>> Content-Type: text/plain; charset=UTF-8
>> Content-Transfer-Encoding: 8bit
>>
>> The PSI bridge on POWER9 is very similar to POWER8. The BAR is still
>> set through XSCOM but the controls are now entirely done with MMIOs.
>> More interrupts are defined and the interrupt controller interface has
>> changed to XIVE. The POWER9 model is a first example of the usage of
>> the notify() handler of the XiveFabric interface, linking the PSI
>> XiveSource to its owning device model.
>>
>> Signed-off-by: C??dric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/pnv.h       |   6 +
>>  include/hw/ppc/pnv_psi.h   |  50 ++++-
>>  include/hw/ppc/pnv_xscom.h |   3 +
>>  hw/ppc/pnv.c               |  20 +-
>>  hw/ppc/pnv_psi.c           | 390 ++++++++++++++++++++++++++++++++++---
>>  5 files changed, 444 insertions(+), 25 deletions(-)
>>
>> diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
>> index c402e5d5844b..8be1147481f9 100644
>> --- a/include/hw/ppc/pnv.h
>> +++ b/include/hw/ppc/pnv.h
>> @@ -88,6 +88,7 @@ typedef struct Pnv9Chip {
>>  
>>      /*< public >*/
>>      PnvXive      xive;
>> +    PnvPsi       psi;
>>  } Pnv9Chip;
>>  
>>  typedef struct PnvChipClass {
>> @@ -250,11 +251,16 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
>>  #define PNV9_XIVE_PC_SIZE            0x0000001000000000ull
>>  #define PNV9_XIVE_PC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006018000000000ull)
>>  
>> +#define PNV9_PSIHB_SIZE              0x0000000000100000ull
>> +#define PNV9_PSIHB_BASE(chip)        PNV9_CHIP_BASE(chip, 0x0006030203000000ull)
>> +
>>  #define PNV9_XIVE_IC_SIZE            0x0000000000080000ull
>>  #define PNV9_XIVE_IC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203100000ull)
>>  
>>  #define PNV9_XIVE_TM_SIZE            0x0000000000040000ull
>>  #define PNV9_XIVE_TM_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203180000ull)
>>  
>> +#define PNV9_PSIHB_ESB_SIZE          0x0000000000010000ull
>> +#define PNV9_PSIHB_ESB_BASE(chip)    PNV9_CHIP_BASE(chip, 0x00060302031c0000ull)
>>  
>>  #endif /* _PPC_PNV_H */
>> diff --git a/include/hw/ppc/pnv_psi.h b/include/hw/ppc/pnv_psi.h
>> index f6af5eae1fa8..b8f8d082bcf9 100644
>> --- a/include/hw/ppc/pnv_psi.h
>> +++ b/include/hw/ppc/pnv_psi.h
>> @@ -21,10 +21,35 @@
>>  
>>  #include "hw/sysbus.h"
>>  #include "hw/ppc/xics.h"
>> +#include "hw/ppc/xive.h"
>>  
>>  #define TYPE_PNV_PSI "pnv-psi"
>>  #define PNV_PSI(obj) \
>>       OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI)
>> +#define PNV_PSI_CLASS(klass) \
>> +     OBJECT_CLASS_CHECK(PnvPsiClass, (klass), TYPE_PNV_PSI)
>> +#define PNV_PSI_GET_CLASS(obj) \
>> +     OBJECT_GET_CLASS(PnvPsiClass, (obj), TYPE_PNV_PSI)
>> +
>> +typedef struct PnvPsi PnvPsi;
>> +typedef struct PnvChip PnvChip;
>> +typedef struct PnvPsiClass {
>> +    SysBusDeviceClass parent_class;
>> +
>> +    int chip_type;
>> +    uint32_t xscom_pcba;
>> +    uint32_t xscom_size;
>> +
>> +    void (*irq_set)(PnvPsi *psi, int, bool state);
>> +} PnvPsiClass;
>> +
>> +#define TYPE_PNV_PSI_POWER8 TYPE_PNV_PSI "-POWER8"
>> +#define PNV_PSI_POWER8(obj) \
>> +    OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI_POWER8)
>> +
>> +#define TYPE_PNV_PSI_POWER9 TYPE_PNV_PSI "-POWER9"
>> +#define PNV_PSI_POWER9(obj) \
>> +    OBJECT_CHECK(PnvPsi, (obj), TYPE_PNV_PSI_POWER9)
>>  
>>  #define PSIHB_XSCOM_MAX         0x20
>>  
>> @@ -38,9 +63,12 @@ typedef struct PnvPsi {
>>      /* MemoryRegion fsp_mr; */
>>      uint64_t fsp_bar;
>>  
>> -    /* Interrupt generation */
>> +    /* P8 Interrupt generation */
>>      ICSState ics;
>>  
>> +    /* P9 Interrupt generation */
>> +    XiveSource source;
>> +
>>      /* Registers */
>>      uint64_t regs[PSIHB_XSCOM_MAX];
>>  
>> @@ -60,6 +88,24 @@ typedef enum PnvPsiIrq {
>>  
>>  #define PSI_NUM_INTERRUPTS 6
>>  
>> -extern void pnv_psi_irq_set(PnvPsi *psi, PnvPsiIrq irq, bool state);
>> +/* P9 PSI Interrupts */
>> +#define PSIHB9_IRQ_PSI          0
>> +#define PSIHB9_IRQ_OCC          1
>> +#define PSIHB9_IRQ_FSI          2
>> +#define PSIHB9_IRQ_LPCHC        3
>> +#define PSIHB9_IRQ_LOCAL_ERR    4
>> +#define PSIHB9_IRQ_GLOBAL_ERR   5
>> +#define PSIHB9_IRQ_TPM          6
>> +#define PSIHB9_IRQ_LPC_SIRQ0    7
>> +#define PSIHB9_IRQ_LPC_SIRQ1    8
>> +#define PSIHB9_IRQ_LPC_SIRQ2    9
>> +#define PSIHB9_IRQ_LPC_SIRQ3    10
>> +#define PSIHB9_IRQ_SBE_I2C      11
>> +#define PSIHB9_IRQ_DIO          12
>> +#define PSIHB9_IRQ_PSU          13
>> +#define PSIHB9_NUM_IRQS         14
>> +
>> +void pnv_psi_irq_set(PnvPsi *psi, int irq, bool state);
>> +void pnv_psi_pic_print_info(PnvPsi *psi, Monitor *mon);
>>  
>>  #endif /* _PPC_PNV_PSI_H */
>> diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
>> index 5bd43467a1ab..019b45bf9189 100644
>> --- a/include/hw/ppc/pnv_xscom.h
>> +++ b/include/hw/ppc/pnv_xscom.h
>> @@ -82,6 +82,9 @@ typedef struct PnvXScomInterfaceClass {
>>  #define PNV_XSCOM_PBCQ_SPCI_BASE  0x9013c00
>>  #define PNV_XSCOM_PBCQ_SPCI_SIZE  0x5
>>  
>> +#define PNV9_XSCOM_PSIHB_BASE     0x5012900
>> +#define PNV9_XSCOM_PSIHB_SIZE     0x100
>> +
>>  #define PNV9_XSCOM_XIVE_BASE      0x5013000
>>  #define PNV9_XSCOM_XIVE_SIZE      0x300
>>  
>> diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
>> index b6af896c30e4..e67c9d7d3995 100644
>> --- a/hw/ppc/pnv.c
>> +++ b/hw/ppc/pnv.c
>> @@ -743,7 +743,7 @@ static void pnv_chip_power8_instance_init(Object *obj)
>>      PnvChipClass *pcc = PNV_CHIP_GET_CLASS(obj);
>>      int i;
>>  
>> -    object_initialize(&chip8->psi, sizeof(chip8->psi), TYPE_PNV_PSI);
>> +    object_initialize(&chip8->psi, sizeof(chip8->psi), TYPE_PNV_PSI_POWER8);
>>      object_property_add_child(obj, "psi", OBJECT(&chip8->psi), NULL);
>>      object_property_add_const_link(OBJECT(&chip8->psi), "xics",
>>                                     OBJECT(qdev_get_machine()), &error_abort);
>> @@ -923,6 +923,11 @@ static void pnv_chip_power9_instance_init(Object *obj)
>>      object_property_add_child(obj, "xive", OBJECT(&chip9->xive), NULL);
>>      object_property_add_const_link(OBJECT(&chip9->xive), "chip", obj,
>>                                     &error_abort);
>> +
>> +    object_initialize(&chip9->psi, sizeof(chip9->psi), TYPE_PNV_PSI_POWER9);
>> +    object_property_add_child(obj, "psi", OBJECT(&chip9->psi), NULL);
>> +    object_property_add_const_link(OBJECT(&chip9->psi), "chip", obj,
>> +                                   &error_abort);
>>  }
>>  
>>  static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
>> @@ -955,6 +960,18 @@ static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
>>      qdev_set_parent_bus(DEVICE(&chip9->xive), sysbus_get_default());
>>      pnv_xscom_add_subregion(chip, PNV9_XSCOM_XIVE_BASE,
>>                              &chip9->xive.xscom_regs);
>> +
>> +    /* Processor Service Interface (PSI) Host Bridge */
>> +    object_property_set_int(OBJECT(&chip9->psi), PNV9_PSIHB_BASE(chip),
>> +                            "bar", &error_fatal);
>> +    object_property_set_bool(OBJECT(&chip9->psi), true, "realized", &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +    qdev_set_parent_bus(DEVICE(&chip9->psi), sysbus_get_default());
>> +    pnv_xscom_add_subregion(chip, PNV9_XSCOM_PSIHB_BASE,
>> +                            &chip9->psi.xscom_regs);
>>  }
>>  
>>  static void pnv_chip_power9_class_init(ObjectClass *klass, void *data)
>> @@ -1188,6 +1205,7 @@ static void pnv_pic_print_info(InterruptStatsProvider *obj,
>>              Pnv9Chip *chip9 = PNV9_CHIP(chip);
>>  
>>               pnv_xive_pic_print_info(&chip9->xive, mon);
>> +             pnv_psi_pic_print_info(&chip9->psi, mon);
>>          } else {
>>              Pnv8Chip *chip8 = PNV8_CHIP(chip);
>>  
>> diff --git a/hw/ppc/pnv_psi.c b/hw/ppc/pnv_psi.c
>> index 5b969127c303..8b85dd9555e8 100644
>> --- a/hw/ppc/pnv_psi.c
>> +++ b/hw/ppc/pnv_psi.c
>> @@ -22,6 +22,7 @@
>>  #include "target/ppc/cpu.h"
>>  #include "qemu/log.h"
>>  #include "qapi/error.h"
>> +#include "monitor/monitor.h"
>>  
>>  #include "exec/address-spaces.h"
>>  
>> @@ -114,12 +115,14 @@
>>  #define PSIHB_BAR_MASK                  0x0003fffffff00000ull
>>  #define PSIHB_FSPBAR_MASK               0x0003ffff00000000ull
>>  
>> +#define PSIHB_REG(addr) (((addr) >> 3) + PSIHB_XSCOM_BAR)
>> +
>>  static void pnv_psi_set_bar(PnvPsi *psi, uint64_t bar)
>>  {
>>      MemoryRegion *sysmem = get_system_memory();
>>      uint64_t old = psi->regs[PSIHB_XSCOM_BAR];
>>  
>> -    psi->regs[PSIHB_XSCOM_BAR] = bar & (PSIHB_BAR_MASK | PSIHB_BAR_EN);
>> +    psi->regs[PSIHB_XSCOM_BAR] = bar;
>>  
>>      /* Update MR, always remove it first */
>>      if (old & PSIHB_BAR_EN) {
>> @@ -128,7 +131,7 @@ static void pnv_psi_set_bar(PnvPsi *psi, uint64_t bar)
>>  
>>      /* Then add it back if needed */
>>      if (bar & PSIHB_BAR_EN) {
>> -        uint64_t addr = bar & PSIHB_BAR_MASK;
>> +        uint64_t addr = bar & ~PSIHB_BAR_EN;
>>          memory_region_add_subregion(sysmem, addr, &psi->regs_mr);
>>      }
>>  }
>> @@ -205,7 +208,12 @@ static const uint64_t stat_bits[] = {
>>      [PSIHB_IRQ_EXTERNAL]  = PSIHB_IRQ_STAT_EXT,
>>  };
>>  
>> -void pnv_psi_irq_set(PnvPsi *psi, PnvPsiIrq irq, bool state)
>> +void pnv_psi_irq_set(PnvPsi *psi, int irq, bool state)
>> +{
>> +    PNV_PSI_GET_CLASS(psi)->irq_set(psi, irq, state);
>> +}
>> +
>> +static void pnv_psi_power8_irq_set(PnvPsi *psi, int irq, bool state)
>>  {
>>      ICSState *ics = &psi->ics;
>>      uint32_t xivr_reg;
>> @@ -324,7 +332,7 @@ static uint64_t pnv_psi_reg_read(PnvPsi *psi, uint32_t offset, bool mmio)
>>          val = psi->regs[offset];
>>          break;
>>      default:
>> -        qemu_log_mask(LOG_UNIMP, "PSI: read at Ox%" PRIx32 "\n", offset);
>> +        qemu_log_mask(LOG_UNIMP, "PSI: read at 0x%" PRIx32 "\n", offset);
>>      }
>>      return val;
>>  }
>> @@ -383,7 +391,7 @@ static void pnv_psi_reg_write(PnvPsi *psi, uint32_t offset, uint64_t val,
>>          pnv_psi_set_irsn(psi, val);
>>          break;
>>      default:
>> -        qemu_log_mask(LOG_UNIMP, "PSI: write at Ox%" PRIx32 "\n", offset);
>> +        qemu_log_mask(LOG_UNIMP, "PSI: write at 0x%" PRIx32 "\n", offset);
>>      }
>>  }
>>  
>> @@ -393,13 +401,13 @@ static void pnv_psi_reg_write(PnvPsi *psi, uint32_t offset, uint64_t val,
>>   */
>>  static uint64_t pnv_psi_mmio_read(void *opaque, hwaddr addr, unsigned size)
>>  {
>> -    return pnv_psi_reg_read(opaque, (addr >> 3) + PSIHB_XSCOM_BAR, true);
>> +    return pnv_psi_reg_read(opaque, PSIHB_REG(addr), true);
>>  }
>>  
>>  static void pnv_psi_mmio_write(void *opaque, hwaddr addr,
>>                                uint64_t val, unsigned size)
>>  {
>> -    pnv_psi_reg_write(opaque, (addr >> 3) + PSIHB_XSCOM_BAR, val, true);
>> +    pnv_psi_reg_write(opaque, PSIHB_REG(addr), val, true);
>>  }
>>  
>>  static const MemoryRegionOps psi_mmio_ops = {
>> @@ -441,7 +449,7 @@ static const MemoryRegionOps pnv_psi_xscom_ops = {
>>      }
>>  };
>>  
>> -static void pnv_psi_init(Object *obj)
>> +static void pnv_psi_power8_instance_init(Object *obj)
>>  {
>>      PnvPsi *psi = PNV_PSI(obj);
>>  
>> @@ -458,7 +466,7 @@ static const uint8_t irq_to_xivr[] = {
>>      PSIHB_XSCOM_XIVR_EXT,
>>  };
>>  
>> -static void pnv_psi_realize(DeviceState *dev, Error **errp)
>> +static void pnv_psi_power8_realize(DeviceState *dev, Error **errp)
>>  {
>>      PnvPsi *psi = PNV_PSI(dev);
>>      ICSState *ics = &psi->ics;
>> @@ -510,28 +518,34 @@ static void pnv_psi_realize(DeviceState *dev, Error **errp)
>>      }
>>  }
>>  
>> +static const char compat_p8[] = "ibm,power8-psihb-x\0ibm,psihb-x";
>> +static const char compat_p9[] = "ibm,power9-psihb-x\0ibm,psihb-x";
>> +
>>  static int pnv_psi_dt_xscom(PnvXScomInterface *dev, void *fdt, int xscom_offset)
>>  {
>> -    const char compat[] = "ibm,power8-psihb-x\0ibm,psihb-x";
>> +    PnvPsiClass *ppc = PNV_PSI_GET_CLASS(dev);
>>      char *name;
>>      int offset;
>> -    uint32_t lpc_pcba = PNV_XSCOM_PSIHB_BASE;
>>      uint32_t reg[] = {
>> -        cpu_to_be32(lpc_pcba),
>> -        cpu_to_be32(PNV_XSCOM_PSIHB_SIZE)
>> +        cpu_to_be32(ppc->xscom_pcba),
>> +        cpu_to_be32(ppc->xscom_size)
>>      };
>>  
>> -    name = g_strdup_printf("psihb@%x", lpc_pcba);
>> +    name = g_strdup_printf("psihb@%x", ppc->xscom_pcba);
>>      offset = fdt_add_subnode(fdt, xscom_offset, name);
>>      _FDT(offset);
>>      g_free(name);
>>  
>> -    _FDT((fdt_setprop(fdt, offset, "reg", reg, sizeof(reg))));
>> -
>> -    _FDT((fdt_setprop_cell(fdt, offset, "#address-cells", 2)));
>> -    _FDT((fdt_setprop_cell(fdt, offset, "#size-cells", 1)));
>> -    _FDT((fdt_setprop(fdt, offset, "compatible", compat,
>> -                      sizeof(compat))));
>> +    _FDT(fdt_setprop(fdt, offset, "reg", reg, sizeof(reg)));
>> +    _FDT(fdt_setprop_cell(fdt, offset, "#address-cells", 2));
>> +    _FDT(fdt_setprop_cell(fdt, offset, "#size-cells", 1));
>> +    if (ppc->chip_type == PNV_CHIP_POWER9) {
>> +        _FDT(fdt_setprop(fdt, offset, "compatible", compat_p9,
>> +                         sizeof(compat_p9)));
>> +    } else {
>> +        _FDT(fdt_setprop(fdt, offset, "compatible", compat_p8,
>> +                         sizeof(compat_p8)));
>> +    }
>>      return 0;
>>  }
>>  
>> @@ -541,6 +555,324 @@ static Property pnv_psi_properties[] = {
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> +static void pnv_psi_power8_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    PnvPsiClass *ppc = PNV_PSI_CLASS(klass);
>> +
>> +    dc->desc    = "PowerNV PSI Controller POWER8";
>> +    dc->realize = pnv_psi_power8_realize;
>> +
>> +    ppc->chip_type =  PNV_CHIP_POWER8;
>> +    ppc->xscom_pcba = PNV_XSCOM_PSIHB_BASE;
>> +    ppc->xscom_size = PNV_XSCOM_PSIHB_SIZE;
>> +    ppc->irq_set    = pnv_psi_power8_irq_set;
>> +}
>> +
>> +static const TypeInfo pnv_psi_power8_info = {
>> +    .name          = TYPE_PNV_PSI_POWER8,
>> +    .parent        = TYPE_PNV_PSI,
>> +    .instance_init = pnv_psi_power8_instance_init,
>> +    .class_init    = pnv_psi_power8_class_init,
>> +};
>> +
>> +/* Common registers */
>> +
>> +#define PSIHB9_CR                       0x20
>> +#define PSIHB9_SEMR                     0x28
>> +
>> +/* P9 registers */
>> +
>> +#define PSIHB9_INTERRUPT_CONTROL        0x58
>> +#define   PSIHB9_IRQ_METHOD             PPC_BIT(0)
>> +#define   PSIHB9_IRQ_RESET              PPC_BIT(1)
>> +#define PSIHB9_ESB_CI_BASE              0x60
>> +#define   PSIHB9_ESB_CI_VALID           1
>> +#define PSIHB9_ESB_NOTIF_ADDR           0x68
>> +#define   PSIHB9_ESB_NOTIF_VALID        1
>> +#define PSIHB9_IVT_OFFSET               0x70
>> +#define   PSIHB9_IVT_OFF_SHIFT          32
>> +
>> +#define PSIHB9_IRQ_LEVEL                0x78 /* assertion */
>> +#define   PSIHB9_IRQ_LEVEL_PSI          PPC_BIT(0)
>> +#define   PSIHB9_IRQ_LEVEL_OCC          PPC_BIT(1)
>> +#define   PSIHB9_IRQ_LEVEL_FSI          PPC_BIT(2)
>> +#define   PSIHB9_IRQ_LEVEL_LPCHC        PPC_BIT(3)
>> +#define   PSIHB9_IRQ_LEVEL_LOCAL_ERR    PPC_BIT(4)
>> +#define   PSIHB9_IRQ_LEVEL_GLOBAL_ERR   PPC_BIT(5)
>> +#define   PSIHB9_IRQ_LEVEL_TPM          PPC_BIT(6)
>> +#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ1    PPC_BIT(7)
>> +#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ2    PPC_BIT(8)
>> +#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ3    PPC_BIT(9)
>> +#define   PSIHB9_IRQ_LEVEL_LPC_SIRQ4    PPC_BIT(10)
>> +#define   PSIHB9_IRQ_LEVEL_SBE_I2C      PPC_BIT(11)
>> +#define   PSIHB9_IRQ_LEVEL_DIO          PPC_BIT(12)
>> +#define   PSIHB9_IRQ_LEVEL_PSU          PPC_BIT(13)
>> +#define   PSIHB9_IRQ_LEVEL_I2C_C        PPC_BIT(14)
>> +#define   PSIHB9_IRQ_LEVEL_I2C_D        PPC_BIT(15)
>> +#define   PSIHB9_IRQ_LEVEL_I2C_E        PPC_BIT(16)
>> +#define   PSIHB9_IRQ_LEVEL_SBE          PPC_BIT(19)
>> +
>> +#define PSIHB9_IRQ_STAT                 0x80 /* P bit */
>> +#define   PSIHB9_IRQ_STAT_PSI           PPC_BIT(0)
>> +#define   PSIHB9_IRQ_STAT_OCC           PPC_BIT(1)
>> +#define   PSIHB9_IRQ_STAT_FSI           PPC_BIT(2)
>> +#define   PSIHB9_IRQ_STAT_LPCHC         PPC_BIT(3)
>> +#define   PSIHB9_IRQ_STAT_LOCAL_ERR     PPC_BIT(4)
>> +#define   PSIHB9_IRQ_STAT_GLOBAL_ERR    PPC_BIT(5)
>> +#define   PSIHB9_IRQ_STAT_TPM           PPC_BIT(6)
>> +#define   PSIHB9_IRQ_STAT_LPC_SIRQ1     PPC_BIT(7)
>> +#define   PSIHB9_IRQ_STAT_LPC_SIRQ2     PPC_BIT(8)
>> +#define   PSIHB9_IRQ_STAT_LPC_SIRQ3     PPC_BIT(9)
>> +#define   PSIHB9_IRQ_STAT_LPC_SIRQ4     PPC_BIT(10)
>> +#define   PSIHB9_IRQ_STAT_SBE_I2C       PPC_BIT(11)
>> +#define   PSIHB9_IRQ_STAT_DIO           PPC_BIT(12)
>> +#define   PSIHB9_IRQ_STAT_PSU           PPC_BIT(13)
>> +
>> +static void pnv_psi_notify(XiveFabric *xf, uint32_t srcno)
>> +{
>> +    PnvPsi *psi = PNV_PSI(xf);
>> +    uint64_t notif_port = psi->regs[PSIHB_REG(PSIHB9_ESB_NOTIF_ADDR)];
>> +    bool valid = notif_port & PSIHB9_ESB_NOTIF_VALID;
>> +    uint64_t notify_addr = notif_port & ~PSIHB9_ESB_NOTIF_VALID;
>> +
>> +    uint32_t offset =
>> +        (psi->regs[PSIHB_REG(PSIHB9_IVT_OFFSET)] >> PSIHB9_IVT_OFF_SHIFT);
>> +    uint64_t lisn = cpu_to_be64(offset + srcno);
>> +
>> +    if (valid) {
>> +        cpu_physical_memory_write(notify_addr, &lisn, sizeof(lisn));
>> +    }
>> +}
>> +
>> +/*
>> + * TODO : move to parent class
>> + */
>> +static void pnv_psi_reset(DeviceState *dev)
>> +{
>> +    PnvPsi *psi = PNV_PSI(dev);
>> +
>> +    memset(psi->regs, 0x0, sizeof(psi->regs));
>> +
>> +    psi->regs[PSIHB_XSCOM_BAR] = psi->bar | PSIHB_BAR_EN;
>> +}
>> +
>> +static uint64_t pnv_psi_p9_mmio_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    PnvPsi *psi = PNV_PSI(opaque);
>> +    uint32_t reg = PSIHB_REG(addr);
>> +    uint64_t val = -1;
>> +
>> +    switch (addr) {
>> +    case PSIHB9_CR:
>> +    case PSIHB9_SEMR:
>> +        /* FSP stuff */
>> +    case PSIHB9_INTERRUPT_CONTROL:
>> +    case PSIHB9_ESB_CI_BASE:
>> +    case PSIHB9_ESB_NOTIF_ADDR:
>> +    case PSIHB9_IVT_OFFSET:
>> +        val = psi->regs[reg];
>> +        break;
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "PSI: read at 0x%" PRIx64 "\n", addr);
>> +    }
>> +
>> +    return val;
>> +}
>> +
>> +static void pnv_psi_p9_mmio_write(void *opaque, hwaddr addr,
>> +                                  uint64_t val, unsigned size)
>> +{
>> +    PnvPsi *psi = PNV_PSI(opaque);
>> +    uint32_t reg = PSIHB_REG(addr);
>> +    MemoryRegion *sysmem = get_system_memory();
>> +
>> +    switch (addr) {
>> +    case PSIHB9_CR:
>> +    case PSIHB9_SEMR:
>> +        /* FSP stuff */
>> +        break;
>> +    case PSIHB9_INTERRUPT_CONTROL:
>> +        if (val & PSIHB9_IRQ_RESET) {
>> +            device_reset(DEVICE(&psi->source));
>> +        }
>> +        psi->regs[reg] = val;
>> +        break;
>> +
>> +    case PSIHB9_ESB_CI_BASE:
>> +        if (!(val & PSIHB9_ESB_CI_VALID)) {
>> +            if (psi->regs[reg] & PSIHB9_ESB_CI_VALID) {
>> +                memory_region_del_subregion(sysmem, &psi->source.esb_mmio);
>> +            }
>> +        } else {
>> +            if (!(psi->regs[reg] & PSIHB9_ESB_CI_VALID)) {
>> +                memory_region_add_subregion(sysmem,
>> +                                        val & ~PSIHB9_ESB_CI_VALID,
>> +                                        &psi->source.esb_mmio);
>> +            }
>> +        }
>> +        psi->regs[reg] = val;
>> +        break;
>> +
>> +    case PSIHB9_ESB_NOTIF_ADDR:
>> +        psi->regs[reg] = val;
>> +        break;
>> +    case PSIHB9_IVT_OFFSET:
>> +        psi->regs[reg] = val;
>> +        break;
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "PSI: write at 0x%" PRIx64 "\n", addr);
>> +    }
>> +}
>> +
>> +static const MemoryRegionOps pnv_psi_p9_mmio_ops = {
>> +    .read = pnv_psi_p9_mmio_read,
>> +    .write = pnv_psi_p9_mmio_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +static uint64_t pnv_psi_p9_xscom_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    /* No read are expected */
>> +    qemu_log_mask(LOG_GUEST_ERROR, "PSI: xscom read at 0x%" PRIx64 "\n", addr);
>> +    return -1;
>> +}
>> +
>> +static void pnv_psi_p9_xscom_write(void *opaque, hwaddr addr,
>> +                                uint64_t val, unsigned size)
>> +{
>> +    PnvPsi *psi = PNV_PSI(opaque);
>> +
>> +    /* XSCOM is only used to set the PSIHB MMIO region */
>> +    switch (addr >> 3) {
>> +    case PSIHB_XSCOM_BAR:
>> +        pnv_psi_set_bar(psi, val);
>> +        break;
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "PSI: xscom write at 0x%" PRIx64 "\n",
>> +                      addr);
>> +    }
>> +}
>> +
>> +static const MemoryRegionOps pnv_psi_p9_xscom_ops = {
>> +    .read = pnv_psi_p9_xscom_read,
>> +    .write = pnv_psi_p9_xscom_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    }
>> +};
>> +
>> +static void pnv_psi_power9_irq_set(PnvPsi *psi, int irq, bool state)
>> +{
>> +    uint32_t irq_method = psi->regs[PSIHB_REG(PSIHB9_INTERRUPT_CONTROL)];
>> +
>> +    if (irq > PSIHB9_NUM_IRQS) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "PSI: Unsupported irq %d\n", irq);
>> +        return;
>> +    }
>> +
>> +    if (irq_method & PSIHB9_IRQ_METHOD) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "PSI: LSI IRQ method no supported\n");
>> +        return;
>> +    }
>> +
>> +    /* Update LSI levels */
>> +    if (state) {
>> +        psi->regs[PSIHB_REG(PSIHB9_IRQ_LEVEL)] |= PPC_BIT(irq);
>> +    } else {
>> +        psi->regs[PSIHB_REG(PSIHB9_IRQ_LEVEL)] &= ~PPC_BIT(irq);
>> +    }
>> +
>> +    qemu_set_irq(xive_source_qirq(&psi->source, irq), state);
>> +}
>> +
>> +static void pnv_psi_power9_instance_init(Object *obj)
>> +{
>> +    PnvPsi *psi = PNV_PSI(obj);
>> +
>> +    object_initialize(&psi->source, sizeof(psi->source), TYPE_XIVE_SOURCE);
>> +    object_property_add_child(obj, "source", OBJECT(&psi->source), NULL);
>> +}
>> +
>> +static void pnv_psi_power9_realize(DeviceState *dev, Error **errp)
>> +{
>> +    PnvPsi *psi = PNV_PSI(dev);
>> +    XiveSource *xsrc = &psi->source;
>> +    Error *local_err = NULL;
>> +    int i;
>> +
>> +    /* This is the only device with 4k ESB pages */
>> +    object_property_set_int(OBJECT(xsrc), XIVE_ESB_4K, "shift",
>> +                            &error_fatal);
>> +    object_property_set_int(OBJECT(xsrc), PSIHB9_NUM_IRQS, "nr-irqs",
>> +                            &error_fatal);
>> +    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(psi),
>> +                                   &error_fatal);
>> +    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
>> +
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        xive_source_irq_set(xsrc, i, true);
>> +    }
>> +
>> +    /* XSCOM region for PSI registers */
>> +    pnv_xscom_region_init(&psi->xscom_regs, OBJECT(dev), &pnv_psi_p9_xscom_ops,
>> +                psi, "xscom-psi", PNV9_XSCOM_PSIHB_SIZE);
>> +
>> +    /* Initialize MMIO region */
>> +    memory_region_init_io(&psi->regs_mr, OBJECT(dev), &pnv_psi_p9_mmio_ops, psi,
>> +                          "psihb", PNV9_PSIHB_SIZE);
>> +
>> +    /* Default BAR for MMIO region */
>> +    pnv_psi_set_bar(psi, psi->bar | PSIHB_BAR_EN);
>> +}
>> +
>> +static void pnv_psi_power9_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    PnvPsiClass *ppc = PNV_PSI_CLASS(klass);
>> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
>> +
>> +    dc->desc    = "PowerNV PSI Controller POWER9";
>> +    dc->realize = pnv_psi_power9_realize;
>> +
>> +    ppc->chip_type  = PNV_CHIP_POWER9;
>> +    ppc->xscom_pcba = PNV9_XSCOM_PSIHB_BASE;
>> +    ppc->xscom_size = PNV9_XSCOM_PSIHB_SIZE;
>> +    ppc->irq_set    = pnv_psi_power9_irq_set;
>> +
>> +    xfc->notify      = pnv_psi_notify;
>> +}
>> +
>> +static const TypeInfo pnv_psi_power9_info = {
>> +    .name          = TYPE_PNV_PSI_POWER9,
>> +    .parent        = TYPE_PNV_PSI,
>> +    .instance_init = pnv_psi_power9_instance_init,
>> +    .class_init    = pnv_psi_power9_class_init,
>> +    .interfaces = (InterfaceInfo[]) {
>> +            { TYPE_XIVE_FABRIC },
>> +            { },
>> +    },
>> +};
>> +
>>  static void pnv_psi_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
>> @@ -548,16 +880,18 @@ static void pnv_psi_class_init(ObjectClass *klass, void *data)
>>  
>>      xdc->dt_xscom = pnv_psi_dt_xscom;
>>  
>> -    dc->realize = pnv_psi_realize;
>> +    dc->desc = "PowerNV PSI Controller";
>>      dc->props = pnv_psi_properties;
>> +    dc->reset  = pnv_psi_reset;
>>  }
>>  
>>  static const TypeInfo pnv_psi_info = {
>>      .name          = TYPE_PNV_PSI,
>>      .parent        = TYPE_SYS_BUS_DEVICE,
>>      .instance_size = sizeof(PnvPsi),
>> -    .instance_init = pnv_psi_init,
>>      .class_init    = pnv_psi_class_init,
>> +    .class_size    = sizeof(PnvPsiClass),
>> +    .abstract      = true,
>>      .interfaces    = (InterfaceInfo[]) {
>>          { TYPE_PNV_XSCOM_INTERFACE },
>>          { }
>> @@ -567,6 +901,18 @@ static const TypeInfo pnv_psi_info = {
>>  static void pnv_psi_register_types(void)
>>  {
>>      type_register_static(&pnv_psi_info);
>> +    type_register_static(&pnv_psi_power8_info);
>> +    type_register_static(&pnv_psi_power9_info);
>>  }
>>  
>>  type_init(pnv_psi_register_types)
>> +
>> +void pnv_psi_pic_print_info(PnvPsi *psi, Monitor *mon)
>> +{
>> +    uint32_t offset =
>> +        (psi->regs[PSIHB_REG(PSIHB9_IVT_OFFSET)] >> PSIHB9_IVT_OFF_SHIFT);
>> +
>> +    monitor_printf(mon, "PSIHB Source %08x .. %08x\n",
>> +                  offset, offset + psi->source.nr_irqs - 1);
>> +    xive_source_pic_print_info(&psi->source, offset, mon);
>> +}
> 
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model
  2018-11-23  0:31       ` David Gibson
@ 2018-11-23  8:21         ` Cédric Le Goater
  2018-11-26  8:14         ` Cédric Le Goater
  1 sibling, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-23  8:21 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/23/18 1:31 AM, David Gibson wrote:
> On Thu, Nov 22, 2018 at 08:25:06AM +0100, Cédric Le Goater wrote:
>> On 11/22/18 4:05 AM, David Gibson wrote:
>>> On Fri, Nov 16, 2018 at 11:56:54AM +0100, Cédric Le Goater wrote:
>>>> The first sub-engine of the overall XIVE architecture is the Interrupt
>>>> Virtualization Source Engine (IVSE). An IVSE can be integrated into
>>>> another logic, like in a PCI PHB or in the main interrupt controller
>>>> to manage IPIs.
>>>>
>>>> Each IVSE instance is associated with an Event State Buffer (ESB) that
>>>> contains a two bit state entry for each possible event source. When an
>>>> event is signaled to the IVSE, by MMIO or some other means, the
>>>> associated interrupt state bits are fetched from the ESB and
>>>> modified. Depending on the resulting ESB state, the event is forwarded
>>>> to the IVRE sub-engine of the controller doing the routing.
>>>>
>>>> Each supported ESB entry is associated with either a single or a
>>>> even/odd pair of pages which provides commands to manage the source:
>>>> to EOI, to turn off the source for instance.
>>>>
>>>> On a sPAPR machine, the O/S will obtain the page address of the ESB
>>>> entry associated with a source and its characteristic using the
>>>> H_INT_GET_SOURCE_INFO hcall. On PowerNV, a similar OPAL call is used.
>>>>
>>>> The xive_source_notify() routine is in charge forwarding the source
>>>> event notification to the routing engine. It will be filled later on.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>
>>> Ok, this is looking basically pretty good.  Few details to query
>>> below.
>>>
>>>
>>>> ---
>>>>  default-configs/ppc64-softmmu.mak |   1 +
>>>>  include/hw/ppc/xive.h             | 130 ++++++++++
>>>>  hw/intc/xive.c                    | 379 ++++++++++++++++++++++++++++++
>>>>  hw/intc/Makefile.objs             |   1 +
>>>>  4 files changed, 511 insertions(+)
>>>>  create mode 100644 include/hw/ppc/xive.h
>>>>  create mode 100644 hw/intc/xive.c
>>>>
>>>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>>>> index aec2855750d6..2d1e7c5c4668 100644
>>>> --- a/default-configs/ppc64-softmmu.mak
>>>> +++ b/default-configs/ppc64-softmmu.mak
>>>> @@ -16,6 +16,7 @@ CONFIG_VIRTIO_VGA=y
>>>>  CONFIG_XICS=$(CONFIG_PSERIES)
>>>>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>>>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>>>> +CONFIG_XIVE=$(CONFIG_PSERIES)
>>>>  CONFIG_MEM_DEVICE=y
>>>>  CONFIG_DIMM=y
>>>>  CONFIG_SPAPR_RNG=y
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> new file mode 100644
>>>> index 000000000000..5fec4b08705d
>>>> --- /dev/null
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -0,0 +1,130 @@
>>>> +/*
>>>> + * QEMU PowerPC XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017-2018, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>
>>> A cheat sheet in the top of this header with the old and new XIVE
>>> terms would quite nice to have.
>>
>> Yes. It's a good place. I will put the XIVE acronyms here :
>>      
>>      EA		Event Assignment
>>      EISN	Effective Interrupt Source Number
>>      END	Event Notification Descriptor
>>      ESB	Event State Buffer
>>      EQ		Event Queue
>>      LISN	Logical Interrupt Source Number
>>      NVT	Notification Virtual Target
>>      TIMA	Thread Interrupt Management Area
>>      ...
> 
> That sounds good, but what I'd also like is showing that NVT == VP and
> EAS == IVT and so forth.

sure. I will add that. 

skiboot and PowerNV Linux are using a mixed version and I wonder if I 
should not clarify that also, skiboot at least. 

>>>> + */
>>>> +
>>>> +#ifndef PPC_XIVE_H
>>>> +#define PPC_XIVE_H
>>>> +
>>>> +#include "hw/sysbus.h"
>>>
>>> So, I'm a bit dubious about making the XiveSource a SysBus device -
>>> I'm concerned it won't play well with tying it into the other devices
>>> like PHB that "own" it in real hardware.
>>
>> It does but I can take a look at changing it to a DeviceState. The 
>> reset handlers might be a concern.
> 
> As "non bus" device I think you'd need to register your own reset
> handler rather than just setting dc->reset.  Otherwise, I think that
> should work.

yes. I will give a look/try, it might not be such a problem. 

>>> I think we'd be better off making it a direct descendent of
>>> TYPE_DEVICE which constructs the MMIO region, but doesn't map it.
>>
>> At a moment, I started working on a XiveESB object doing what I think 
>> you are suggesting and I removed it. I am reluctant adding more 
>> complexity now, the patchset is just growing and growing ... 
>>
>> But I agree there are fundamentals to get right for KVM. Let's talk 
>> about it after you have looked at the overall patchset, at least up 
>> to KVM initial support.
> 
> Hm, ok.
> 
>>> Then we can havea SysBusDevice (and/or other) wrapper which
>>> instantiates the XiveSource core and maps it into somewhere
>>> accessible.
>>
>> The XIVE controller model does the mapping of the source currently.
> 
> I'm.. I'm not sure what you mean by that.   We have a
> sysbus_init_mmio() right here which effectively maps in the MMIO
> region AFAICT.

yes. what I meant is that the XIVE controller model does all the 
mapping, for the TIMA and for the ESB pages of the XiveSource. 

This is a 'critical' part of the XIVE model because the region have 
a different nature under KVM, which requires the KVM device to be 
created before the region are, and to be destroyed when the device
is. 

It took me a while to get all in place to support all aspects of 
the model: KVM and not KVM, switch of interrupt controller, machine 
reset, post_load, PowerNV devices, etc. So expect some resistant 
from me on that topic. 

Thanks,

C.

>> In the case of sPAPR, the controller model controls the TIMA and 
>> for PowerNV, there are quite few others MMIO regions to handle.
>>
>>>
>>>> +
>>>> +/*
>>>> + * XIVE Interrupt Source
>>>> + */
>>>> +
>>>> +#define TYPE_XIVE_SOURCE "xive-source"
>>>> +#define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
>>>> +
>>>> +/*
>>>> + * XIVE Interrupt Source characteristics, which define how the ESB are
>>>> + * controlled.
>>>> + */
>>>> +#define XIVE_SRC_H_INT_ESB     0x1 /* ESB managed with hcall H_INT_ESB */
>>>> +#define XIVE_SRC_STORE_EOI     0x2 /* Store EOI supported */
>>>> +
>>>> +typedef struct XiveSource {
>>>> +    SysBusDevice parent;
>>>> +
>>>> +    /* IRQs */
>>>> +    uint32_t        nr_irqs;
>>>> +    qemu_irq        *qirqs;
>>>> +
>>>> +    /* PQ bits */
>>>> +    uint8_t         *status;
>>>> +
>>>> +    /* ESB memory region */
>>>> +    uint64_t        esb_flags;
>>>> +    uint32_t        esb_shift;
>>>> +    MemoryRegion    esb_mmio;
>>>> +} XiveSource;
>>>> +
>>>> +/*
>>>> + * ESB MMIO setting. Can be one page, for both source triggering and
>>>> + * source management, or two different pages. See below for magic
>>>> + * values.
>>>> + */
>>>> +#define XIVE_ESB_4K          12 /* PSI HB only */
>>>> +#define XIVE_ESB_4K_2PAGE    13
>>>> +#define XIVE_ESB_64K         16
>>>> +#define XIVE_ESB_64K_2PAGE   17
>>>> +
>>>> +static inline bool xive_source_esb_has_2page(XiveSource *xsrc)
>>>> +{
>>>> +    return xsrc->esb_shift == XIVE_ESB_64K_2PAGE ||
>>>> +        xsrc->esb_shift == XIVE_ESB_4K_2PAGE;
>>>> +}
>>>> +
>>>> +/* The trigger page is always the first/even page */
>>>> +static inline hwaddr xive_source_esb_page(XiveSource *xsrc, uint32_t srcno)
>>>
>>> This function doesn't appear to be used anywhere except..
>>
>> It's used in patch 16 adding the hcalls also.
>>
>>>> +{
>>>> +    assert(srcno < xsrc->nr_irqs);
>>>> +    return (1ull << xsrc->esb_shift) * srcno;
>>>> +}
>>>> +
>>>> +/* In a two pages ESB MMIO setting, the odd page is for management */
>>>> +static inline hwaddr xive_source_esb_mgmt(XiveSource *xsrc, int srcno)
>>>
>>>
>>> ..here, and this function doesn't appear to be used anywhere.
>>
>> It's used in patch 16 adding the hcalls and patch 23 for KVM.
>>
>> This is basic ESB support which I thought belong to the patch on sources.
>>  
>>>
>>>> +{
>>>> +    hwaddr addr = xive_source_esb_page(xsrc, srcno);
>>>> +
>>>> +    if (xive_source_esb_has_2page(xsrc)) {
>>>> +        addr += (1 << (xsrc->esb_shift - 1));
>>>> +    }
>>>> +
>>>> +    return addr;
>>>> +}
>>>> +
>>>> +/*
>>>> + * Each interrupt source has a 2-bit state machine which can be
>>>> + * controlled by MMIO. P indicates that an interrupt is pending (has
>>>> + * been sent to a queue and is waiting for an EOI). Q indicates that
>>>> + * the interrupt has been triggered while pending.
>>>> + *
>>>> + * This acts as a coalescing mechanism in order to guarantee that a
>>>> + * given interrupt only occurs at most once in a queue.
>>>> + *
>>>> + * When doing an EOI, the Q bit will indicate if the interrupt
>>>> + * needs to be re-triggered.
>>>> + */
>>>> +#define XIVE_ESB_VAL_P        0x2
>>>> +#define XIVE_ESB_VAL_Q        0x1
>>>> +
>>>> +#define XIVE_ESB_RESET        0x0
>>>> +#define XIVE_ESB_PENDING      XIVE_ESB_VAL_P
>>>> +#define XIVE_ESB_QUEUED       (XIVE_ESB_VAL_P | XIVE_ESB_VAL_Q)
>>>> +#define XIVE_ESB_OFF          XIVE_ESB_VAL_Q
>>>> +
>>>> +/*
>>>> + * "magic" Event State Buffer (ESB) MMIO offsets.
>>>> + *
>>>> + * The following offsets into the ESB MMIO allow to read or manipulate
>>>> + * the PQ bits. They must be used with an 8-byte load instruction.
>>>> + * They all return the previous state of the interrupt (atomically).
>>>> + *
>>>> + * Additionally, some ESB pages support doing an EOI via a store and
>>>> + * some ESBs support doing a trigger via a separate trigger page.
>>>> + */
>>>> +#define XIVE_ESB_STORE_EOI      0x400 /* Store */
>>>> +#define XIVE_ESB_LOAD_EOI       0x000 /* Load */
>>>> +#define XIVE_ESB_GET            0x800 /* Load */
>>>> +#define XIVE_ESB_SET_PQ_00      0xc00 /* Load */
>>>> +#define XIVE_ESB_SET_PQ_01      0xd00 /* Load */
>>>> +#define XIVE_ESB_SET_PQ_10      0xe00 /* Load */
>>>> +#define XIVE_ESB_SET_PQ_11      0xf00 /* Load */
>>>> +
>>>> +uint8_t xive_source_esb_get(XiveSource *xsrc, uint32_t srcno);
>>>> +uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq);
>>>> +
>>>> +void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset,
>>>> +                                Monitor *mon);
>>>> +
>>>> +static inline qemu_irq xive_source_qirq(XiveSource *xsrc, uint32_t srcno)
>>>> +{
>>>> +    assert(srcno < xsrc->nr_irqs);
>>>> +    return xsrc->qirqs[srcno];
>>>> +}
>>>> +
>>>> +#endif /* PPC_XIVE_H */
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> new file mode 100644
>>>> index 000000000000..f7621f84828c
>>>> --- /dev/null
>>>> +++ b/hw/intc/xive.c
>>>> @@ -0,0 +1,379 @@
>>>> +/*
>>>> + * QEMU PowerPC XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017-2018, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/log.h"
>>>> +#include "qapi/error.h"
>>>> +#include "target/ppc/cpu.h"
>>>> +#include "sysemu/cpus.h"
>>>> +#include "sysemu/dma.h"
>>>> +#include "monitor/monitor.h"
>>>> +#include "hw/ppc/xive.h"
>>>> +
>>>> +/*
>>>> + * XIVE ESB helpers
>>>> + */
>>>> +
>>>> +static uint8_t xive_esb_set(uint8_t *pq, uint8_t value)
>>>> +{
>>>> +    uint8_t old_pq = *pq & 0x3;
>>>> +
>>>> +    *pq &= ~0x3;
>>>> +    *pq |= value & 0x3;
>>>> +
>>>> +    return old_pq;
>>>> +}
>>>> +
>>>> +static bool xive_esb_trigger(uint8_t *pq)
>>>> +{
>>>> +    uint8_t old_pq = *pq & 0x3;
>>>> +
>>>> +    switch (old_pq) {
>>>> +    case XIVE_ESB_RESET:
>>>> +        xive_esb_set(pq, XIVE_ESB_PENDING);
>>>> +        return true;
>>>> +    case XIVE_ESB_PENDING:
>>>> +    case XIVE_ESB_QUEUED:
>>>> +        xive_esb_set(pq, XIVE_ESB_QUEUED);
>>>> +        return false;
>>>> +    case XIVE_ESB_OFF:
>>>> +        xive_esb_set(pq, XIVE_ESB_OFF);
>>>> +        return false;
>>>> +    default:
>>>> +         g_assert_not_reached();
>>>> +    }
>>>> +}
>>>> +
>>>> +static bool xive_esb_eoi(uint8_t *pq)
>>>> +{
>>>> +    uint8_t old_pq = *pq & 0x3;
>>>> +
>>>> +    switch (old_pq) {
>>>> +    case XIVE_ESB_RESET:
>>>> +    case XIVE_ESB_PENDING:
>>>> +        xive_esb_set(pq, XIVE_ESB_RESET);
>>>> +        return false;
>>>> +    case XIVE_ESB_QUEUED:
>>>> +        xive_esb_set(pq, XIVE_ESB_PENDING);
>>>> +        return true;
>>>> +    case XIVE_ESB_OFF:
>>>> +        xive_esb_set(pq, XIVE_ESB_OFF);
>>>> +        return false;
>>>> +    default:
>>>> +         g_assert_not_reached();
>>>> +    }
>>>> +}
>>>> +
>>>> +/*
>>>> + * XIVE Interrupt Source (or IVSE)
>>>> + */
>>>> +
>>>> +uint8_t xive_source_esb_get(XiveSource *xsrc, uint32_t srcno)
>>>> +{
>>>> +    assert(srcno < xsrc->nr_irqs);
>>>> +
>>>> +    return xsrc->status[srcno] & 0x3;
>>>> +}
>>>> +
>>>> +uint8_t xive_source_esb_set(XiveSource *xsrc, uint32_t srcno, uint8_t pq)
>>>> +{
>>>> +    assert(srcno < xsrc->nr_irqs);
>>>> +
>>>> +    return xive_esb_set(&xsrc->status[srcno], pq);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Returns whether the event notification should be forwarded.
>>>> + */
>>>> +static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
>>>> +{
>>>> +    assert(srcno < xsrc->nr_irqs);
>>>> +
>>>> +    return xive_esb_trigger(&xsrc->status[srcno]);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Returns whether the event notification should be forwarded.
>>>> + */
>>>> +static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
>>>> +{
>>>> +    assert(srcno < xsrc->nr_irqs);
>>>> +
>>>> +    return xive_esb_eoi(&xsrc->status[srcno]);
>>>> +}
>>>> +
>>>> +/*
>>>> + * Forward the source event notification to the Router
>>>> + */
>>>> +static void xive_source_notify(XiveSource *xsrc, int srcno)
>>>> +{
>>>> +
>>>> +}
>>>> +
>>>> +/*
>>>> + * In a two pages ESB MMIO setting, even page is the trigger page, odd
>>>> + * page is for management
>>>> + */
>>>> +static inline bool addr_is_even(hwaddr addr, uint32_t shift)
>>>> +{
>>>> +    return !((addr >> shift) & 1);
>>>> +}
>>>> +
>>>> +static inline bool xive_source_is_trigger_page(XiveSource *xsrc, hwaddr addr)
>>>> +{
>>>> +    return xive_source_esb_has_2page(xsrc) &&
>>>> +        addr_is_even(addr, xsrc->esb_shift - 1);
>>>> +}
>>>> +
>>>> +/*
>>>> + * ESB MMIO loads
>>>> + *                      Trigger page    Management/EOI page
>>>> + * 2 pages setting      even            odd
>>>> + *
>>>> + * 0x000 .. 0x3FF       -1              EOI and return 0|1
>>>> + * 0x400 .. 0x7FF       -1              EOI and return 0|1
>>>> + * 0x800 .. 0xBFF       -1              return PQ
>>>> + * 0xC00 .. 0xCFF       -1              return PQ and atomically PQ=0
>>>> + * 0xD00 .. 0xDFF       -1              return PQ and atomically PQ=0
>>>> + * 0xE00 .. 0xDFF       -1              return PQ and atomically PQ=1
>>>> + * 0xF00 .. 0xDFF       -1              return PQ and atomically PQ=1
>>>> + */
>>>
>>> I can't quite make sense of this table.  What do the -1s represent,
>>
>> the value returned by the load.
>>
>>> and how does it relate to the non-2page case?
>>
>> one page ESB support trigger and management on the same page. So for loads,
>> the odd page behavior applies.  
>>
>>>> +static uint64_t xive_source_esb_read(void *opaque, hwaddr addr, unsigned size)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>>>> +    uint32_t offset = addr & 0xFFF;
>>>> +    uint32_t srcno = addr >> xsrc->esb_shift;
>>>> +    uint64_t ret = -1;
>>>> +
>>>> +    /* In a two pages ESB MMIO setting, trigger page should not be read */
>>>> +    if (xive_source_is_trigger_page(xsrc, addr)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR,
>>>> +                      "XIVE: invalid load on IRQ %d trigger page at "
>>>> +                      "0x%"HWADDR_PRIx"\n", srcno, addr);
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    switch (offset) {
>>>> +    case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
>>>> +        ret = xive_source_esb_eoi(xsrc, srcno);
>>>> +
>>>> +        /* Forward the source event notification for routing */
>>>> +        if (ret) {
>>>> +            xive_source_notify(xsrc, srcno);
>>>> +        }
>>>> +        break;
>>>> +
>>>> +    case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
>>>> +        ret = xive_source_esb_get(xsrc, srcno);
>>>> +        break;
>>>> +
>>>> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
>>>> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
>>>> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
>>>> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
>>>> +        ret = xive_source_esb_set(xsrc, srcno, (offset >> 8) & 0x3);
>>>> +        break;
>>>> +    default:
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB load addr %x\n",
>>>> +                      offset);
>>>> +    }
>>>> +
>>>> +    return ret;
>>>> +}
>>>> +
>>>> +/*
>>>> + * ESB MMIO stores
>>>> + *                      Trigger page    Management/EOI page
>>>> + * 2 pages setting      even            odd
>>>
>>> As with the previous table, I don't quite understand what the headings
>>> above mean.
>>
>> one page ESB support trigger and management on the same page. So for stores,
>> the odd page behavior applies.
>>
>> The headings can be improved. I will think of something.
>>
>>>> + * 0x000 .. 0x3FF       Trigger         Trigger
>>>> + * 0x400 .. 0x7FF       Trigger         EOI
>>>> + * 0x800 .. 0xBFF       Trigger         undefined
>>>> + * 0xC00 .. 0xCFF       Trigger         PQ=00
>>>> + * 0xD00 .. 0xDFF       Trigger         PQ=01
>>>> + * 0xE00 .. 0xDFF       Trigger         PQ=10
>>>> + * 0xF00 .. 0xDFF       Trigger         PQ=11
>>>> + */
>>>> +static void xive_source_esb_write(void *opaque, hwaddr addr,
>>>> +                                  uint64_t value, unsigned size)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>>>> +    uint32_t offset = addr & 0xFFF;
>>>> +    uint32_t srcno = addr >> xsrc->esb_shift;
>>>> +    bool notify = false;
>>>> +
>>>> +    /* In a two pages ESB MMIO setting, trigger page only triggers */
>>>> +    if (xive_source_is_trigger_page(xsrc, addr)) {
>>>> +        notify = xive_source_esb_trigger(xsrc, srcno);
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    switch (offset) {
>>>> +    case 0 ... 0x3FF:
>>>> +        notify = xive_source_esb_trigger(xsrc, srcno);
>>>> +        break;
>>>> +
>>>> +    case XIVE_ESB_STORE_EOI ... XIVE_ESB_STORE_EOI + 0x3FF:
>>>> +        if (!(xsrc->esb_flags & XIVE_SRC_STORE_EOI)) {
>>>> +            qemu_log_mask(LOG_GUEST_ERROR,
>>>> +                          "XIVE: invalid Store EOI for IRQ %d\n", srcno);
>>>> +            return;
>>>> +        }
>>>> +
>>>> +        notify = xive_source_esb_eoi(xsrc, srcno);
>>>> +        break;
>>>> +
>>>> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
>>>> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
>>>> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
>>>> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
>>>> +        xive_source_esb_set(xsrc, srcno, (offset >> 8) & 0x3);
>>>> +        break;
>>>> +
>>>> +    default:
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr %x\n",
>>>> +                      offset);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +out:
>>>> +    /* Forward the source event notification for routing */
>>>> +    if (notify) {
>>>> +        xive_source_notify(xsrc, srcno);
>>>> +    }
>>>> +}
>>>> +
>>>> +static const MemoryRegionOps xive_source_esb_ops = {
>>>> +    .read = xive_source_esb_read,
>>>> +    .write = xive_source_esb_write,
>>>> +    .endianness = DEVICE_BIG_ENDIAN,
>>>> +    .valid = {
>>>> +        .min_access_size = 8,
>>>> +        .max_access_size = 8,
>>>> +    },
>>>> +    .impl = {
>>>> +        .min_access_size = 8,
>>>> +        .max_access_size = 8,
>>>> +    },
>>>> +};
>>>> +
>>>> +static void xive_source_set_irq(void *opaque, int srcno, int val)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(opaque);
>>>> +    bool notify = false;
>>>> +
>>>> +    if (val) {
>>>> +        notify = xive_source_esb_trigger(xsrc, srcno);
>>>> +    }
>>>> +
>>>> +    /* Forward the source event notification for routing */
>>>> +    if (notify) {
>>>> +        xive_source_notify(xsrc, srcno);
>>>> +    }
>>>> +}
>>>> +
>>>> +void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>>>> +        uint8_t pq = xive_source_esb_get(xsrc, i);
>>>> +
>>>> +        if (pq == XIVE_ESB_OFF) {
>>>> +            continue;
>>>> +        }
>>>> +
>>>> +        monitor_printf(mon, "  %08x %c%c\n", i + offset,
>>>> +                       pq & XIVE_ESB_VAL_P ? 'P' : '-',
>>>> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>>>> +    }
>>>> +}
>>>> +
>>>> +static void xive_source_reset(DeviceState *dev)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(dev);
>>>> +
>>>> +    /* PQs are initialized to 0b01 which corresponds to "ints off" */
>>>> +    memset(xsrc->status, 0x1, xsrc->nr_irqs);
>>>
>>> You've already got XIVE_ESB_OFF defined to make this a little clearer.
>>
>> Sure.
>>
>> Thanks,
>>
>> C. 
>>
>>
>>>
>>>> +}
>>>> +
>>>> +static void xive_source_realize(DeviceState *dev, Error **errp)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(dev);
>>>> +
>>>> +    if (!xsrc->nr_irqs) {
>>>> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
>>>> +        xsrc->esb_shift != XIVE_ESB_4K_2PAGE &&
>>>> +        xsrc->esb_shift != XIVE_ESB_64K &&
>>>> +        xsrc->esb_shift != XIVE_ESB_64K_2PAGE) {
>>>> +        error_setg(errp, "Invalid ESB shift setting");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
>>>> +                                     xsrc->nr_irqs);
>>>> +
>>>> +    xsrc->status = g_malloc0(xsrc->nr_irqs);
>>>> +
>>>> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>>>> +                          &xive_source_esb_ops, xsrc, "xive.esb",
>>>> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>>>> +}
>>>> +
>>>> +static const VMStateDescription vmstate_xive_source = {
>>>> +    .name = TYPE_XIVE_SOURCE,
>>>> +    .version_id = 1,
>>>> +    .minimum_version_id = 1,
>>>> +    .fields = (VMStateField[]) {
>>>> +        VMSTATE_UINT32_EQUAL(nr_irqs, XiveSource, NULL),
>>>> +        VMSTATE_VBUFFER_UINT32(status, XiveSource, 1, NULL, nr_irqs),
>>>> +        VMSTATE_END_OF_LIST()
>>>> +    },
>>>> +};
>>>> +
>>>> +/*
>>>> + * The default XIVE interrupt source setting for the ESB MMIOs is two
>>>> + * 64k pages without Store EOI, to be in sync with KVM.
>>>> + */
>>>> +static Property xive_source_properties[] = {
>>>> +    DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
>>>> +    DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
>>>> +    DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
>>>> +    DEFINE_PROP_END_OF_LIST(),
>>>> +};
>>>> +
>>>> +static void xive_source_class_init(ObjectClass *klass, void *data)
>>>> +{
>>>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>>>> +
>>>> +    dc->desc    = "XIVE Interrupt Source";
>>>> +    dc->props   = xive_source_properties;
>>>> +    dc->realize = xive_source_realize;
>>>> +    dc->reset   = xive_source_reset;
>>>> +    dc->vmsd    = &vmstate_xive_source;
>>>> +}
>>>> +
>>>> +static const TypeInfo xive_source_info = {
>>>> +    .name          = TYPE_XIVE_SOURCE,
>>>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>>>> +    .instance_size = sizeof(XiveSource),
>>>> +    .class_init    = xive_source_class_init,
>>>> +};
>>>> +
>>>> +static void xive_register_types(void)
>>>> +{
>>>> +    type_register_static(&xive_source_info);
>>>> +}
>>>> +
>>>> +type_init(xive_register_types)
>>>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>>>> index 0e9963f5eecc..72a46ed91c31 100644
>>>> --- a/hw/intc/Makefile.objs
>>>> +++ b/hw/intc/Makefile.objs
>>>> @@ -37,6 +37,7 @@ obj-$(CONFIG_SH4) += sh_intc.o
>>>>  obj-$(CONFIG_XICS) += xics.o
>>>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>>> +obj-$(CONFIG_XIVE) += xive.o
>>>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
>>>
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-23  1:10       ` David Gibson
@ 2018-11-23 10:28         ` Cédric Le Goater
  2018-11-26  5:44           ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-23 10:28 UTC (permalink / raw)
  To: David Gibson, Benjamin Herrenschmidt; +Cc: qemu-ppc, qemu-devel

On 11/23/18 2:10 AM, David Gibson wrote:
> On Thu, Nov 22, 2018 at 05:50:07PM +1100, Benjamin Herrenschmidt wrote:
>> On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
>>>
>>> Sorry, didn't think of this in my first reply.
>>>
>>> 1) Does the hardware ever actually write back to the EAS?  I know it
>>> does for the END, but it's not clear why it would need to for the
>>> EAS.  If not, we don't need the setter.
>>
>> Nope, though the PAPR model will via hcalls
> 
> Right, bit AIUI the set_eas hook is about abstracting PAPR vs bare
> metal details.  Since the hcall knows it's PAPR it can just update the
> backing information for the EAS directly, and no need for an
> abstracted hook.

Indeed, the first versions of the XIVE patchset did not use such hooks, 
but when discussed we said we wanted abstract methods for the router 
to validate the overall XIVE model, which is useful for PowerNV. 

We can change again and have the hcalls get/set directly in the EAT
and ENDT. It would certainly simplify the sPAPR model. 

C.


> 
>>
>>>
>>> 2) The signatures are a bit odd here.  For the setter, a value would
>>> make sense than a (XiveEAS *), since it's just a word.  For the getter
>>> you could return the EAS value directly rather than using a pointer -
>>> there's already a valid bit in the EAS so you can construct a value
>>> with that cleared if the lisn is out of bounds.
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors
  2018-11-23  4:35       ` David Gibson
@ 2018-11-23 11:01         ` Cédric Le Goater
  2018-11-29  4:46           ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-23 11:01 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/23/18 5:35 AM, David Gibson wrote:
> On Thu, Nov 22, 2018 at 10:47:44PM +0100, Cédric Le Goater wrote:
>> On 11/22/18 5:41 AM, David Gibson wrote:
>>> On Fri, Nov 16, 2018 at 11:56:58AM +0100, Cédric Le Goater wrote:
>>>> To complete the event routing, the IVRE sub-engine uses an internal
>>>> table containing Event Notification Descriptor (END) structures.
>>>>
>>>> An END specifies on which Event Queue (EQ) the event notification
>>>> data, defined in the associated EAS, should be posted when an
>>>> exception occurs. It also defines which Notification Virtual Target
>>>> (NVT) should be notified.
>>>>
>>>> The Event Queue is a memory page provided by the O/S defining a
>>>> circular buffer, one per server and priority couple, containing Event
>>>> Queue entries. These are 4 bytes long, the first bit being a
>>>> 'generation' bit and the 31 following bits the END Data field. They
>>>> are pulled by the O/S when the exception occurs.
>>>>
>>>> The END Data field is a way to set an invariant logical event source
>>>> number for an IRQ. It is set with the H_INT_SET_SOURCE_CONFIG hcall
>>>> when the EISN flag is used.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  include/hw/ppc/xive.h      |  18 ++++
>>>>  include/hw/ppc/xive_regs.h |  48 ++++++++++
>>>>  hw/intc/xive.c             | 185 ++++++++++++++++++++++++++++++++++++-
>>>>  3 files changed, 248 insertions(+), 3 deletions(-)
>>>>
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> index 5a0696366577..ce62aaf28343 100644
>>>> --- a/include/hw/ppc/xive.h
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -193,11 +193,29 @@ typedef struct XiveRouterClass {
>>>>      /* XIVE table accessors */
>>>>      int (*get_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>>>>      int (*set_eas)(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>>>> +    int (*get_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>> +                   XiveEND *end);
>>>> +    int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>> +                   XiveEND *end);
>>>
>>> Hrm.  So unlike the EAS, which is basically just a word, the END is a
>>> pretty large structure.  
>>
>> yes. and so will be the NVT.
>>
>>> It's unclear here if get/set are expected to copy the whole thing out 
>>> and in, 
>>
>> That's the plan. 
> 
> Yeah, I don't think that's a good idea.  In some cases the updates are
> on hot paths, so the extra copy isn't good, and more importantly it
> makes it look like an atomic update, but it's not really.
> 
> Well... I guess it probably is because of the BQL, but I'd prefer not
> to rely on that excessively.
> 
>> What I had in mind are memory accessors to the XIVE structures, which 
>> are local to QEMU for sPAPR and in the guest RAM for PowerNV (Please
>> take a look at the XIVE PowerNV model).
>>
>>> or if get give you a pointer into a "live" structure 
>>
>> no
>>
>>> and set just does any necessary barriers after an update.
>> that would be too complex for the PowerNV model I think. There is a cache
>> in between the software running on the (QEMU) machine and the XIVE HW but
>> it would be hard to handle. 
>>  
>>> Really, for a non-atomic value like this, I'm not sure get/set is the
>>> right model.
>>
>> ok. we need something to get them out and in.
> 
> I've thought about this a bit more.  What I think might work is
> "end_read" and "end_write" callbacks, which take a word number in
> addition to the parameters you have already

ouch. This is not going to simplify the routing algo where all the get and
set are done. The whole END is needed in the END trigger. So we will need
routines to get and set the whole END.

The NVT is not used too much for the moment. 

>>> Also as I understand it nearly all the indices in XIVE are broken into
>>> block/index.  Is there a reason those are folded together into lisn
>>> for the EAS, but not for the END?
>>
>> The indexing of the EAT is global to the sytem and the index defines
>> which blk to use. The IRQ source numbers on the powerbus are architected 
>> to be :
>>
>>     #define XIVE_SRCNO(blk, idx)      ((uint32_t)(blk) << 28 | (idx))
>>
>> and XIVE can use different strategies to identify the XIVE IC in charge 
>> of routing. It can be a one-to-one chip to block relation as skiboot does. 
>> Using a block scope table is possible also. Our model only supports one 
>> block per chip and some shortcuts are taken but not that much in fact.
>>  
>> Remote access to the XIVE structures of another chip are done through 
>> MMIO (not modeled in PowerNV) and the blkid is used to partition the MMIO 
>> regions. Being local is better for performance because the END and NVT 
>> tables have a strong relation with the XIVE subengines using them 
>> (VC and PC).  
>>
>> May be, Ben can clarified it this is badly explained.
> 
> Right.. I think I understand what the blocks are all about.
> 
> But my question is, why encode the block and index together for the
> EAS, but separately for the END?

mostly because when the PowerNV devices forward an notification event, they 
do so by writing the full IRQ number on a MMIO notify port and this number
reaches the XiveRouter routing algo without being modified. On sPAPR it's
even simpler.

But we could decode the full IRQ number in the router before querying the 
associated EAS. The EAS accessor would then use the same interface. 
I will look into it. Not a big change. I think. 

> 
>>
>>>>  } XiveRouterClass;
>>>>  
>>>>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
>>>>  
>>>>  int xive_router_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>>>>  int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas);
>>>> +int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>> +                        XiveEND *end);
>>>> +int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>> +                        XiveEND *end);
>>>> +
>>>> +/*
>>>> + * For legacy compatibility, the exceptions define up to 256 different
>>>> + * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
>>>> + * and the least favored level 0xFF.
>>>> + */
>>>> +#define XIVE_PRIORITY_MAX  7
>>>> +
>>>> +void xive_end_reset(XiveEND *end);
>>>> +void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
>>>>  
>>>>  #endif /* PPC_XIVE_H */
>>>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>>>> index 12499b33614c..f97fb2b90bee 100644
>>>> --- a/include/hw/ppc/xive_regs.h
>>>> +++ b/include/hw/ppc/xive_regs.h
>>>> @@ -28,4 +28,52 @@ typedef struct XiveEAS {
>>>>  #define EAS_END_DATA    PPC_BITMASK(33, 63)      /* Data written to the END */
>>>>  } XiveEAS;
>>>>  
>>>> +/* Event Notification Descriptor (END) */
>>>> +typedef struct XiveEND {
>>>> +        uint32_t        w0;
>>>> +#define END_W0_VALID             PPC_BIT32(0) /* "v" bit */
>>>> +#define END_W0_ENQUEUE           PPC_BIT32(1) /* "q" bit */
>>>> +#define END_W0_UCOND_NOTIFY      PPC_BIT32(2) /* "n" bit */
>>>> +#define END_W0_BACKLOG           PPC_BIT32(3) /* "b" bit */
>>>> +#define END_W0_PRECL_ESC_CTL     PPC_BIT32(4) /* "p" bit */
>>>> +#define END_W0_ESCALATE_CTL      PPC_BIT32(5) /* "e" bit */
>>>> +#define END_W0_UNCOND_ESCALATE   PPC_BIT32(6) /* "u" bit - DD2.0 */
>>>> +#define END_W0_SILENT_ESCALATE   PPC_BIT32(7) /* "s" bit - DD2.0 */
>>>> +#define END_W0_QSIZE             PPC_BITMASK32(12, 15)
>>>> +#define END_W0_SW0               PPC_BIT32(16)
>>>> +#define END_W0_FIRMWARE          END_W0_SW0 /* Owned by FW */
>>>> +#define END_QSIZE_4K             0
>>>> +#define END_QSIZE_64K            4
>>>> +#define END_W0_HWDEP             PPC_BITMASK32(24, 31)
>>>> +        uint32_t        w1;
>>>> +#define END_W1_ESn               PPC_BITMASK32(0, 1)
>>>> +#define END_W1_ESn_P             PPC_BIT32(0)
>>>> +#define END_W1_ESn_Q             PPC_BIT32(1)
>>>> +#define END_W1_ESe               PPC_BITMASK32(2, 3)
>>>> +#define END_W1_ESe_P             PPC_BIT32(2)
>>>> +#define END_W1_ESe_Q             PPC_BIT32(3)
>>>> +#define END_W1_GENERATION        PPC_BIT32(9)
>>>> +#define END_W1_PAGE_OFF          PPC_BITMASK32(10, 31)
>>>> +        uint32_t        w2;
>>>> +#define END_W2_MIGRATION_REG     PPC_BITMASK32(0, 3)
>>>> +#define END_W2_OP_DESC_HI        PPC_BITMASK32(4, 31)
>>>> +        uint32_t        w3;
>>>> +#define END_W3_OP_DESC_LO        PPC_BITMASK32(0, 31)
>>>> +        uint32_t        w4;
>>>> +#define END_W4_ESC_END_BLOCK     PPC_BITMASK32(4, 7)
>>>> +#define END_W4_ESC_END_INDEX     PPC_BITMASK32(8, 31)
>>>> +        uint32_t        w5;
>>>> +#define END_W5_ESC_END_DATA      PPC_BITMASK32(1, 31)
>>>> +        uint32_t        w6;
>>>> +#define END_W6_FORMAT_BIT        PPC_BIT32(8)
>>>> +#define END_W6_NVT_BLOCK         PPC_BITMASK32(9, 12)
>>>> +#define END_W6_NVT_INDEX         PPC_BITMASK32(13, 31)
>>>> +        uint32_t        w7;
>>>> +#define END_W7_F0_IGNORE         PPC_BIT32(0)
>>>> +#define END_W7_F0_BLK_GROUPING   PPC_BIT32(1)
>>>> +#define END_W7_F0_PRIORITY       PPC_BITMASK32(8, 15)
>>>> +#define END_W7_F1_WAKEZ          PPC_BIT32(0)
>>>> +#define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
>>>> +} XiveEND;
>>>> +
>>>>  #endif /* PPC_XIVE_REGS_H */
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> index c4c90a25758e..9cb001e7b540 100644
>>>> --- a/hw/intc/xive.c
>>>> +++ b/hw/intc/xive.c
>>>> @@ -442,6 +442,101 @@ static const TypeInfo xive_source_info = {
>>>>      .class_init    = xive_source_class_init,
>>>>  };
>>>>  
>>>> +/*
>>>> + * XiveEND helpers
>>>> + */
>>>> +
>>>> +void xive_end_reset(XiveEND *end)
>>>> +{
>>>> +    memset(end, 0, sizeof(*end));
>>>> +
>>>> +    /* switch off the escalation and notification ESBs */
>>>> +    end->w1 = END_W1_ESe_Q | END_W1_ESn_Q;
>>>
>>> It's not obvious to me what circumstances this would be called under.
>>> Since the ENDs are in system memory, a memset() seems like an odd
>>> thing for (virtual) hardware to be doing to it.
>>
>> It makes sense on sPAPR if one day some OS starts using the END ESBs for 
>> further coalescing of the events. None does for now but I have added the 
>> model though.
> 
> Hrm, I think that belongs in PAPR specific code.  It's not really part
> of the router model - it's the PAPR stuff configuring the router at
> reset time (much as firmware would configure it at reset time for bare
> metal).

This is true this routine is only used by the H_INT_RESET hcall and by 
the reset handler of the sPAPR controller model. But it made sense to put 
this END helper routine with the other END routines. Don't you think so ? 

C.

> 
>>
>>>> +}
>>>> +
>>>> +static void xive_end_queue_pic_print_info(XiveEND *end, uint32_t width,
>>>> +                                          Monitor *mon)
>>>> +{
>>>> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
>>>> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
>>>> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
>>>> +    uint32_t qentries = 1 << (qsize + 10);
>>>> +    int i;
>>>> +
>>>> +    /*
>>>> +     * print out the [ (qindex - (width - 1)) .. (qindex + 1)] window
>>>> +     */
>>>> +    monitor_printf(mon, " [ ");
>>>> +    qindex = (qindex - (width - 1)) & (qentries - 1);
>>>> +    for (i = 0; i < width; i++) {
>>>> +        uint64_t qaddr = qaddr_base + (qindex << 2);
>>>> +        uint32_t qdata = -1;
>>>> +
>>>> +        if (dma_memory_read(&address_space_memory, qaddr, &qdata,
>>>> +                            sizeof(qdata))) {
>>>> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ @0x%"
>>>> +                          HWADDR_PRIx "\n", qaddr);
>>>> +            return;
>>>> +        }
>>>> +        monitor_printf(mon, "%s%08x ", i == width - 1 ? "^" : "",
>>>> +                       be32_to_cpu(qdata));
>>>> +        qindex = (qindex + 1) & (qentries - 1);
>>>> +    }
>>>> +    monitor_printf(mon, "]\n");
>>>> +}
>>>> +
>>>> +void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon)
>>>> +{
>>>> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
>>>> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
>>>> +    uint32_t qgen = GETFIELD(END_W1_GENERATION, end->w1);
>>>> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
>>>> +    uint32_t qentries = 1 << (qsize + 10);
>>>> +
>>>> +    uint32_t nvt = GETFIELD(END_W6_NVT_INDEX, end->w6);
>>>> +    uint8_t priority = GETFIELD(END_W7_F0_PRIORITY, end->w7);
>>>> +
>>>> +    if (!(end->w0 & END_W0_VALID)) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    monitor_printf(mon, "  %08x %c%c%c%c%c prio:%d nvt:%04x eq:@%08"PRIx64
>>>> +                   "% 6d/%5d ^%d", end_idx,
>>>> +                   end->w0 & END_W0_VALID ? 'v' : '-',
>>>> +                   end->w0 & END_W0_ENQUEUE ? 'q' : '-',
>>>> +                   end->w0 & END_W0_UCOND_NOTIFY ? 'n' : '-',
>>>> +                   end->w0 & END_W0_BACKLOG ? 'b' : '-',
>>>> +                   end->w0 & END_W0_ESCALATE_CTL ? 'e' : '-',
>>>> +                   priority, nvt, qaddr_base, qindex, qentries, qgen);
>>>> +
>>>> +    xive_end_queue_pic_print_info(end, 6, mon);
>>>> +}
>>>> +
>>>> +static void xive_end_push(XiveEND *end, uint32_t data)
>>>
>>> s/push/enqueue/ please, "push" suggests a stack.  (Not to mention that
>>> "push" and "pull" are used as terms elsewhere in XIVE).
>>
>> yes. you are right. I will change.
>>
>>>> +{
>>>> +    uint64_t qaddr_base = (((uint64_t)(end->w2 & 0x0fffffff)) << 32) | end->w3;
>>>> +    uint32_t qsize = GETFIELD(END_W0_QSIZE, end->w0);
>>>> +    uint32_t qindex = GETFIELD(END_W1_PAGE_OFF, end->w1);
>>>> +    uint32_t qgen = GETFIELD(END_W1_GENERATION, end->w1);
>>>> +
>>>> +    uint64_t qaddr = qaddr_base + (qindex << 2);
>>>> +    uint32_t qdata = cpu_to_be32((qgen << 31) | (data & 0x7fffffff));
>>>> +    uint32_t qentries = 1 << (qsize + 10);
>>>> +
>>>> +    if (dma_memory_write(&address_space_memory, qaddr, &qdata, sizeof(qdata))) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to write END data @0x%"
>>>> +                      HWADDR_PRIx "\n", qaddr);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    qindex = (qindex + 1) & (qentries - 1);
>>>> +    if (qindex == 0) {
>>>> +        qgen ^= 1;
>>>> +        end->w1 = SETFIELD(END_W1_GENERATION, end->w1, qgen);
>>>> +    }
>>>> +    end->w1 = SETFIELD(END_W1_PAGE_OFF, end->w1, qindex);
>>>> +}
>>>> +
>>>>  /*
>>>>   * XIVE Router (aka. Virtualization Controller or IVRE)
>>>>   */
>>>> @@ -460,6 +555,82 @@ int xive_router_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>>>      return xrc->set_eas(xrtr, lisn, eas);
>>>>  }
>>>>  
>>>> +int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>> +                        XiveEND *end)
>>>> +{
>>>> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>>>> +
>>>> +   return xrc->get_end(xrtr, end_blk, end_idx, end);
>>>> +}
>>>> +
>>>> +int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>> +                        XiveEND *end)
>>>> +{
>>>> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>>>> +
>>>> +   return xrc->set_end(xrtr, end_blk, end_idx, end);
>>>> +}
>>>> +
>>>> +/*
>>>> + * An END trigger can come from an event trigger (IPI or HW) or from
>>>> + * another chip. We don't model the PowerBus but the END trigger
>>>> + * message has the same parameters than in the function below.
>>>> + */
>>>> +static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>>>> +                                   uint32_t end_idx, uint32_t end_data)
>>>> +{
>>>> +    XiveEND end;
>>>> +    uint8_t priority;
>>>> +    uint8_t format;
>>>> +
>>>> +    /* END cache lookup */
>>>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
>>>> +                      end_idx);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (!(end.w0 & END_W0_VALID)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
>>>> +                      end_blk, end_idx);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    if (end.w0 & END_W0_ENQUEUE) {
>>>> +        xive_end_push(&end, end_data);
>>>> +        xive_router_set_end(xrtr, end_blk, end_idx, &end);
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * The W7 format depends on the F bit in W6. It defines the type
>>>> +     * of the notification :
>>>> +     *
>>>> +     *   F=0 : single or multiple NVT notification
>>>> +     *   F=1 : User level Event-Based Branch (EBB) notification, no
>>>> +     *         priority
>>>> +     */
>>>> +    format = GETFIELD(END_W6_FORMAT_BIT, end.w6);
>>>> +    priority = GETFIELD(END_W7_F0_PRIORITY, end.w7);
>>>> +
>>>> +    /* The END is masked */
>>>> +    if (format == 0 && priority == 0xff) {
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Check the END ESn (Event State Buffer for notification) for
>>>> +     * even futher coalescing in the Router
>>>> +     */
>>>> +    if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
>>>> +        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Follows IVPE notification
>>>> +     */
>>>> +}
>>>> +
>>>>  static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>>>>  {
>>>>      XiveRouter *xrtr = XIVE_ROUTER(xf);
>>>> @@ -471,9 +642,9 @@ static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>>>>          return;
>>>>      }
>>>>  
>>>> -    /* The IVRE has a State Bit Cache for its internal sources which
>>>> -     * is also involed at this point. We skip the SBC lookup because
>>>> -     * the state bits of the sources are modeled internally in QEMU.
>>>> +    /* The IVRE checks the State Bit Cache at this point. We skip the
>>>> +     * SBC lookup because the state bits of the sources are modeled
>>>> +     * internally in QEMU.
>>>
>>> Replacing a comment about something we're not doing with a different
>>> comment about something we're not doing doesn't seem very useful.
>>> Maybe fold these together into one patch or the other.
>>
>> That's me rephrasing. it should be indeed in the previous patch
>>
>> Thanks,
>>
>> C.
>>
>>>>       */
>>>>  
>>>>      if (!(eas.w & EAS_VALID)) {
>>>> @@ -485,6 +656,14 @@ static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>>>>          /* Notification completed */
>>>>          return;
>>>>      }
>>>> +
>>>> +    /*
>>>> +     * The event trigger becomes an END trigger
>>>> +     */
>>>> +    xive_router_end_notify(xrtr,
>>>> +                           GETFIELD(EAS_END_BLOCK, eas.w),
>>>> +                           GETFIELD(EAS_END_INDEX, eas.w),
>>>> +                           GETFIELD(EAS_END_DATA,  eas.w));
>>>>  }
>>>>  
>>>>  static Property xive_router_properties[] = {
>>>
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources
  2018-11-23  1:08       ` David Gibson
@ 2018-11-23 13:28         ` Cédric Le Goater
  2018-11-26  5:39           ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-23 13:28 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt


>>>> +/*
>>>> + * Returns whether the event notification should be forwarded.
>>>> + */
>>>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t
>>>> srcno)
>>>
>>> What exactly "trigger" means isn't entirely obvious for an LSI.  Might
>>> be clearer to have "lsi_assert" and "lsi_deassert" helpers instead.
>>
>> This is called only when the interrupt is asserted. So it is a 
>> simplified LSI trigger depending only on the 'P' bit.
> 
> Yes, I see that.  But the result is that while the MSI logic is
> encapsulated in the MSI trigger function, this leaves the LSI logic
> split across the trigger function and set_irq() itself. I think it 
> would be better to have assert and deassert helpers instead, which
> handle both the trigger/notification and also the updating of the
> ASSERTED bit.

Something like the xive_source_set_irq_lsi() below ?

Thanks,

C.


Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 hw/intc/xive.c        |   58 ++++++++++++++++++++++++++++++++++++++++++++------
 include/hw/ppc/xive.h |   19 +++++++++++++++-
 2 files changed, 70 insertions(+), 7 deletions(-)

Index: qemu.git/include/hw/ppc/xive.h
===================================================================
--- qemu.git.orig/include/hw/ppc/xive.h
+++ qemu.git/include/hw/ppc/xive.h
@@ -32,8 +32,9 @@ typedef struct XiveSource {
     /* IRQs */
     uint32_t        nr_irqs;
     qemu_irq        *qirqs;
+    unsigned long   *lsi_map;
 
-    /* PQ bits */
+    /* PQ bits and LSI assertion bit */
     uint8_t         *status;
 
     /* ESB memory region */
@@ -89,6 +90,7 @@ static inline hwaddr xive_source_esb_mgm
  * When doing an EOI, the Q bit will indicate if the interrupt
  * needs to be re-triggered.
  */
+#define XIVE_STATUS_ASSERTED  0x4  /* Extra bit for LSI */
 #define XIVE_ESB_VAL_P        0x2
 #define XIVE_ESB_VAL_Q        0x1
 
@@ -127,4 +129,19 @@ static inline qemu_irq xive_source_qirq(
     return xsrc->qirqs[srcno];
 }
 
+static inline bool xive_source_irq_is_lsi(XiveSource *xsrc, uint32_t srcno)
+{
+    assert(srcno < xsrc->nr_irqs);
+    return test_bit(srcno, xsrc->lsi_map);
+}
+
+static inline void xive_source_irq_set(XiveSource *xsrc, uint32_t srcno,
+                                       bool lsi)
+{
+    assert(srcno < xsrc->nr_irqs);
+    if (lsi) {
+        bitmap_set(xsrc->lsi_map, srcno, 1);
+    }
+}
+
 #endif /* PPC_XIVE_H */
Index: qemu.git/hw/intc/xive.c
===================================================================
--- qemu.git.orig/hw/intc/xive.c
+++ qemu.git/hw/intc/xive.c
@@ -91,11 +91,35 @@ uint8_t xive_source_esb_set(XiveSource *
 /*
  * Returns whether the event notification should be forwarded.
  */
+static bool xive_source_set_irq_lsi(XiveSource *xsrc, uint32_t srcno, int val)
+{
+    if (!val)  {
+        xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
+        return false;
+    }
+
+    xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
+    return xive_source_esb_get(xsrc, srcno) == XIVE_ESB_RESET;
+}
+
+/*
+ * Returns whether the event notification should be forwarded.
+ */
 static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
 {
+    bool notify;
+
     assert(srcno < xsrc->nr_irqs);
 
-    return xive_esb_trigger(&xsrc->status[srcno]);
+    notify = xive_esb_trigger(&xsrc->status[srcno]);
+
+    if (xive_source_irq_is_lsi(xsrc, srcno) &&
+        xive_source_esb_get(xsrc, srcno) == XIVE_ESB_QUEUED) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "XIVE: queued an event on LSI IRQ %d\n", srcno);
+    }
+
+    return notify;
 }
 
 /*
@@ -103,9 +127,22 @@ static bool xive_source_esb_trigger(Xive
  */
 static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
 {
+    bool notify;
+
     assert(srcno < xsrc->nr_irqs);
 
-    return xive_esb_eoi(&xsrc->status[srcno]);
+    notify = xive_esb_eoi(&xsrc->status[srcno]);
+
+    /* LSI sources do not set the Q bit but they can still be
+     * asserted, in which case we should forward a new event
+     * notification
+     */
+    if (xive_source_irq_is_lsi(xsrc, srcno)) {
+        bool level = xsrc->status[srcno] & XIVE_STATUS_ASSERTED;
+        notify = xive_source_set_irq_lsi(xsrc, srcno, level);
+    }
+
+    return notify;
 }
 
 /*
@@ -268,8 +305,12 @@ static void xive_source_set_irq(void *op
     XiveSource *xsrc = XIVE_SOURCE(opaque);
     bool notify = false;
 
-    if (val) {
-        notify = xive_source_esb_trigger(xsrc, srcno);
+    if (xive_source_irq_is_lsi(xsrc, srcno)) {
+        notify = xive_source_set_irq_lsi(xsrc, srcno, val);
+    } else {
+        if (val) {
+            notify = xive_source_esb_trigger(xsrc, srcno);
+        }
     }
 
     /* Forward the source event notification for routing */
@@ -289,9 +330,11 @@ void xive_source_pic_print_info(XiveSour
             continue;
         }
 
-        monitor_printf(mon, "  %08x %c%c\n", i + offset,
+        monitor_printf(mon, "  %08x %s %c%c%c\n", i + offset,
+                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
                        pq & XIVE_ESB_VAL_P ? 'P' : '-',
-                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
+                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
+                       xsrc->status[i] & XIVE_STATUS_ASSERTED ? 'A' : ' ');
     }
 }
 
@@ -299,6 +342,8 @@ static void xive_source_reset(DeviceStat
 {
     XiveSource *xsrc = XIVE_SOURCE(dev);
 
+    /* Do not clear the LSI bitmap */
+
     /* PQs are initialized to 0b01 which corresponds to "ints off" */
     memset(xsrc->status, 0x1, xsrc->nr_irqs);
 }
@@ -324,6 +369,7 @@ static void xive_source_realize(DeviceSt
                                      xsrc->nr_irqs);
 
     xsrc->status = g_malloc0(xsrc->nr_irqs);
+    xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
 
     memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
                           &xive_source_esb_ops, xsrc, "xive.esb",

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/36] ppc/xive: introduce the XIVE interrupt thread context
  2018-11-23  5:08   ` David Gibson
@ 2018-11-25 20:35     ` Cédric Le Goater
  2018-11-27  5:07       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-25 20:35 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/23/18 6:08 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:00AM +0100, Cédric Le Goater wrote:
>> Each POWER9 processor chip has a XIVE presenter that can generate four
>> different exceptions to its threads:
>>
>>   - hypervisor exception,
>>   - O/S exception
>>   - Event-Based Branch (EBB)
>>   - msgsnd (doorbell).
>>
>> Each exception has a state independent from the others called a Thread
>> Interrupt Management context. This context is a set of registers which
>> lets the thread handle priority management and interrupt acknowledgment
>> among other things. The most important ones being :
>>
>>   - Interrupt Priority Register  (PIPR)
>>   - Interrupt Pending Buffer     (IPB)
>>   - Current Processor Priority   (CPPR)
>>   - Notification Source Register (NSR)
>>
>> These registers are accessible through a specific MMIO region, called
>> the Thread Interrupt Management Area (TIMA), four aligned pages, each
>> exposing a different view of the registers. First page (page address
>> ending in 0b00) gives access to the entire context and is reserved for
>> the ring 0 security monitor. The second (page address ending in 0b01)
>> is for the hypervisor, ring 1. The third (page address ending in 0b10)
>> is for the operating system, ring 2. The fourth (page address ending
>> in 0b11) is for user level, ring 3.
>>
>> The thread interrupt context is modeled with a XiveTCTX object
>> containing the values of the different exception registers. The TIMA
>> region is mapped at the same address for each CPU.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/xive.h      |  36 +++
>>  include/hw/ppc/xive_regs.h |  82 +++++++
>>  hw/intc/xive.c             | 443 +++++++++++++++++++++++++++++++++++++
>>  3 files changed, 561 insertions(+)
>>
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 24301bf2076d..5987f26ddb98 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -238,4 +238,40 @@ typedef struct XiveENDSource {
>>  void xive_end_reset(XiveEND *end);
>>  void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
>>  
>> +/*
>> + * XIVE Thread interrupt Management (TM) context
>> + */
>> +
>> +#define TYPE_XIVE_TCTX "xive-tctx"
>> +#define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
>> +
>> +/*
>> + * XIVE Thread interrupt Management register rings :
>> + *
>> + *   QW-0  User       event-based exception state
>> + *   QW-1  O/S        OS context for priority management, interrupt acks
>> + *   QW-2  Pool       hypervisor context for virtual processor being dispatched
>> + *   QW-3  Physical   for the security monitor to manage the entire context
> 
> That last description is misleading, AIUI the hypervisor can and does
> make use of the physical ring as well as the pool ring.

yes. The description is from the spec. I will rephrase. 

> 
>> + */
>> +#define TM_RING_COUNT           4
>> +#define TM_RING_SIZE            0x10
>> +
>> +typedef struct XiveTCTX {
>> +    DeviceState parent_obj;
>> +
>> +    CPUState    *cs;
>> +    qemu_irq    output;
>> +
>> +    uint8_t     regs[TM_RING_COUNT * TM_RING_SIZE];
> 
> I'm a bit dubious about representing the state with a full buffer like
> this.  Isn't a fair bit of this space reserved or derived values which
> aren't backed by real state?

Under sPAPR only the TM_QW1_OS ring is accessed but the TM_QW0_USER 
will also be when we support EBB.

When running under the PowerNV machine, all rings could be accessed.
Today only 2 and 3 are.

It seemed correct to expose all registers under the thread interrupt
context model and filter the accesses with the TIMA. It fits the HW
well.

> 
>> +
>> +    XiveRouter  *xrtr;
> 
> What's this for?  AFAIK a TCTX isn't associated with a particular
> routing unit.

I should have added the pointer in patch 11 where it is used. This is 
to let the sPAPR XIVE controller model reset the OS CAM line with the 
VP identifier.

The TCTX belong to the IVPE XIVE subengine and as the IVRE and IVPE are 
modeled under the XiveRouter, it's not suprising to see this '*xrtr' 
back pointer. But I agree we might not need.  Let's talk about it when 
you reach patch 11. 

> 
>> +} XiveTCTX;
>> +
>> +/*
>> + * XIVE Thread Interrupt Management Aera (TIMA)
> 
> Typo s/Aera/Area/

yep.

> 
>> + */
>> +extern const MemoryRegionOps xive_tm_ops;
>> +
>> +void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
>> +
>>  #endif /* PPC_XIVE_H */
>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>> index f97fb2b90bee..2e3d6cb507da 100644
>> --- a/include/hw/ppc/xive_regs.h
>> +++ b/include/hw/ppc/xive_regs.h
>> @@ -10,6 +10,88 @@
>>  #ifndef PPC_XIVE_REGS_H
>>  #define PPC_XIVE_REGS_H
>>  
>> +#define TM_SHIFT                16
>> +
>> +/* TM register offsets */
>> +#define TM_QW0_USER             0x000 /* All rings */
>> +#define TM_QW1_OS               0x010 /* Ring 0..2 */
>> +#define TM_QW2_HV_POOL          0x020 /* Ring 0..1 */
>> +#define TM_QW3_HV_PHYS          0x030 /* Ring 0..1 */
>> +
>> +/* Byte offsets inside a QW             QW0 QW1 QW2 QW3 */
>> +#define TM_NSR                  0x0  /*  +   +   -   +  */
>> +#define TM_CPPR                 0x1  /*  -   +   -   +  */
>> +#define TM_IPB                  0x2  /*  -   +   +   +  */
>> +#define TM_LSMFB                0x3  /*  -   +   +   +  */
>> +#define TM_ACK_CNT              0x4  /*  -   +   -   -  */
>> +#define TM_INC                  0x5  /*  -   +   -   +  */
>> +#define TM_AGE                  0x6  /*  -   +   -   +  */
>> +#define TM_PIPR                 0x7  /*  -   +   -   +  */
>> +
>> +#define TM_WORD0                0x0
>> +#define TM_WORD1                0x4
>> +
>> +/*
>> + * QW word 2 contains the valid bit at the top and other fields
>> + * depending on the QW.
>> + */
>> +#define TM_WORD2                0x8
>> +#define   TM_QW0W2_VU           PPC_BIT32(0)
>> +#define   TM_QW0W2_LOGIC_SERV   PPC_BITMASK32(1, 31) /* XX 2,31 ? */
>> +#define   TM_QW1W2_VO           PPC_BIT32(0)
>> +#define   TM_QW1W2_OS_CAM       PPC_BITMASK32(8, 31)
>> +#define   TM_QW2W2_VP           PPC_BIT32(0)
>> +#define   TM_QW2W2_POOL_CAM     PPC_BITMASK32(8, 31)
>> +#define   TM_QW3W2_VT           PPC_BIT32(0)
>> +#define   TM_QW3W2_LP           PPC_BIT32(6)
>> +#define   TM_QW3W2_LE           PPC_BIT32(7)
>> +#define   TM_QW3W2_T            PPC_BIT32(31)
>> +
>> +/*
>> + * In addition to normal loads to "peek" and writes (only when invalid)
>> + * using 4 and 8 bytes accesses, the above registers support these
>> + * "special" byte operations:
>> + *
>> + *   - Byte load from QW0[NSR] - User level NSR (EBB)
>> + *   - Byte store to QW0[NSR] - User level NSR (EBB)
>> + *   - Byte load/store to QW1[CPPR] and QW3[CPPR] - CPPR access
>> + *   - Byte load from QW3[TM_WORD2] - Read VT||00000||LP||LE on thrd 0
>> + *                                    otherwise VT||0000000
>> + *   - Byte store to QW3[TM_WORD2] - Set VT bit (and LP/LE if present)
>> + *
>> + * Then we have all these "special" CI ops at these offset that trigger
>> + * all sorts of side effects:
>> + */
>> +#define TM_SPC_ACK_EBB          0x800   /* Load8 ack EBB to reg*/
>> +#define TM_SPC_ACK_OS_REG       0x810   /* Load16 ack OS irq to reg */
>> +#define TM_SPC_PUSH_USR_CTX     0x808   /* Store32 Push/Validate user context */
>> +#define TM_SPC_PULL_USR_CTX     0x808   /* Load32 Pull/Invalidate user
>> +                                         * context */
>> +#define TM_SPC_SET_OS_PENDING   0x812   /* Store8 Set OS irq pending bit */
>> +#define TM_SPC_PULL_OS_CTX      0x818   /* Load32/Load64 Pull/Invalidate OS
>> +                                         * context to reg */
>> +#define TM_SPC_PULL_POOL_CTX    0x828   /* Load32/Load64 Pull/Invalidate Pool
>> +                                         * context to reg*/
>> +#define TM_SPC_ACK_HV_REG       0x830   /* Load16 ack HV irq to reg */
>> +#define TM_SPC_PULL_USR_CTX_OL  0xc08   /* Store8 Pull/Inval usr ctx to odd
>> +                                         * line */
>> +#define TM_SPC_ACK_OS_EL        0xc10   /* Store8 ack OS irq to even line */
>> +#define TM_SPC_ACK_HV_POOL_EL   0xc20   /* Store8 ack HV evt pool to even
>> +                                         * line */
>> +#define TM_SPC_ACK_HV_EL        0xc30   /* Store8 ack HV irq to even line */
>> +/* XXX more... */
>> +
>> +/* NSR fields for the various QW ack types */
>> +#define TM_QW0_NSR_EB           PPC_BIT8(0)
>> +#define TM_QW1_NSR_EO           PPC_BIT8(0)
>> +#define TM_QW3_NSR_HE           PPC_BITMASK8(0, 1)
>> +#define  TM_QW3_NSR_HE_NONE     0
>> +#define  TM_QW3_NSR_HE_POOL     1
>> +#define  TM_QW3_NSR_HE_PHYS     2
>> +#define  TM_QW3_NSR_HE_LSI      3
>> +#define TM_QW3_NSR_I            PPC_BIT8(2)
>> +#define TM_QW3_NSR_GRP_LVL      PPC_BIT8(3, 7)
>> +
>>  /* EAS (Event Assignment Structure)
>>   *
>>   * One per interrupt source. Targets an interrupt to a given Event
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 5a8882d47a98..4c6cb5d52975 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -15,6 +15,448 @@
>>  #include "sysemu/dma.h"
>>  #include "monitor/monitor.h"
>>  #include "hw/ppc/xive.h"
>> +#include "hw/ppc/xive_regs.h"
>> +
>> +/*
>> + * XIVE Thread Interrupt Management context
>> + */
>> +
>> +static uint64_t xive_tctx_accept(XiveTCTX *tctx, uint8_t ring)
>> +{
>> +    return 0;
>> +}
>> +
>> +static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
>> +{
>> +    if (cppr > XIVE_PRIORITY_MAX) {
>> +        cppr = 0xff;
>> +    }
>> +
>> +    tctx->regs[ring + TM_CPPR] = cppr;
>> +}
>> +
>> +/*
>> + * XIVE Thread Interrupt Management Area (TIMA)
>> + *
>> + * This region gives access to the registers of the thread interrupt
>> + * management context. It is four page wide, each page providing a
>> + * different view of the registers. The page with the lower offset is
>> + * the most privileged and gives access to the entire context.
>> + */
>> +
>> +#define XIVE_TM_HW_PAGE   0x0
>> +#define XIVE_TM_HV_PAGE   0x1
>> +#define XIVE_TM_OS_PAGE   0x2
>> +#define XIVE_TM_USER_PAGE 0x3
>> +
>> +/*
>> + * Define an access map for each page of the TIMA that we will use in
>> + * the memory region ops to filter values when doing loads and stores
>> + * of raw registers values
>> + *
>> + * Registers accessibility bits :
>> + *
>> + *    0x0 - no access
>> + *    0x1 - write only
>> + *    0x2 - read only
>> + *    0x3 - read/write
>> + */
>> +
>> +static const uint8_t xive_tm_hw_view[] = {
>> +    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
>> +    /* QW-1 OS   */   3, 3, 3, 3,   3, 3, 0, 3,   3, 3, 3, 3,   0, 0, 0, 0,
>> +    /* QW-2 HV   */   0, 0, 3, 3,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
>> +    /* QW-3 HW   */   3, 3, 3, 3,   0, 3, 0, 3,   3, 0, 0, 3,   3, 3, 3, 0,
> 
> Can we stick to the "Pool" / "Phys" names rather than inventing HV and
> HW.  XIVE already has too many names for things.  To clarify that's
> for the naming of the QWs, the view names are fine.

sure. I will fix the names.
 
>> +};
>> +
>> +static const uint8_t xive_tm_hv_view[] = {
>> +    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
>> +    /* QW-1 OS   */   3, 3, 3, 3,   3, 3, 0, 3,   3, 3, 3, 3,   0, 0, 0, 0,
>> +    /* QW-2 HV   */   0, 0, 3, 3,   0, 0, 0, 0,   0, 3, 3, 3,   0, 0, 0, 0,
>> +    /* QW-3 HW   */   3, 3, 3, 3,   0, 3, 0, 3,   3, 0, 0, 3,   0, 0, 0, 0,
>> +};
>> +
>> +static const uint8_t xive_tm_os_view[] = {
>> +    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   3, 3, 3, 3,   0, 0, 0, 0,
>> +    /* QW-1 OS   */   2, 3, 2, 2,   2, 2, 0, 2,   0, 0, 0, 0,   0, 0, 0, 0,
>> +    /* QW-2 HV   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
>> +    /* QW-3 HW   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 3, 3, 0,
> 
> Are those bytes near the end of QW-3 really accessible in OS but not
> hypervisor view?

No. It's copy/paste error. Thx.

>> +};
>> +
>> +static const uint8_t xive_tm_user_view[] = {
>> +    /* QW-0 User */   3, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
>> +    /* QW-1 OS   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
>> +    /* QW-2 HV   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
>> +    /* QW-3 HW   */   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,   0, 0, 0, 0,
>> +};
>> +
>> +/*
>> + * Overall TIMA access map for the thread interrupt management context
>> + * registers
>> + */
>> +static const uint8_t *xive_tm_views[] = {
>> +    [XIVE_TM_HW_PAGE]   = xive_tm_hw_view,
>> +    [XIVE_TM_HV_PAGE]   = xive_tm_hv_view,
>> +    [XIVE_TM_OS_PAGE]   = xive_tm_os_view,
>> +    [XIVE_TM_USER_PAGE] = xive_tm_user_view,
>> +};
>> +
>> +/*
>> + * Computes a register access mask for a given offset in the TIMA
>> + */
>> +static uint64_t xive_tm_mask(hwaddr offset, unsigned size, bool write)
>> +{
>> +    uint8_t page_offset = (offset >> TM_SHIFT) & 0x3;
>> +    uint8_t reg_offset = offset & 0x3F;
>> +    uint8_t reg_mask = write ? 0x1 : 0x2;
>> +    uint64_t mask = 0x0;
>> +    int i;
>> +
>> +    for (i = 0; i < size; i++) {
>> +        if (xive_tm_views[page_offset][reg_offset + i] & reg_mask) {
>> +            mask |= (uint64_t) 0xff << (8 * (size - i - 1));
>> +        }
>> +    }
>> +
>> +    return mask;
>> +}
>> +
>> +static void xive_tm_raw_write(XiveTCTX *tctx, hwaddr offset, uint64_t value,
>> +                              unsigned size)
>> +{
>> +    uint8_t ring_offset = offset & 0x30;
>> +    uint8_t reg_offset = offset & 0x3F;
>> +    uint64_t mask = xive_tm_mask(offset, size, true);
>> +    int i;
>> +
>> +    /*
>> +     * Only 4 or 8 bytes stores are allowed and the User ring is
>> +     * excluded
>> +     */
>> +    if (size < 4 || !mask || ring_offset == TM_QW0_USER) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid write access at TIMA @%"
>> +                      HWADDR_PRIx"\n", offset);
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Use the register offset for the raw values and filter out
>> +     * reserved values
>> +     */
>> +    for (i = 0; i < size; i++) {
>> +        uint8_t byte_mask = (mask >> (8 * (size - i - 1)));
>> +        if (byte_mask) {
>> +            tctx->regs[reg_offset + i] = (value >> (8 * (size - i - 1))) &
>> +                byte_mask;
>> +        }
>> +    }
>> +}
>> +
>> +static uint64_t xive_tm_raw_read(XiveTCTX *tctx, hwaddr offset, unsigned size)
>> +{
>> +    uint8_t ring_offset = offset & 0x30;
>> +    uint8_t reg_offset = offset & 0x3F;
>> +    uint64_t mask = xive_tm_mask(offset, size, false);
>> +    uint64_t ret;
>> +    int i;
>> +
>> +    /*
>> +     * Only 4 or 8 bytes loads are allowed and the User ring is
>> +     * excluded
>> +     */
>> +    if (size < 4 || !mask || ring_offset == TM_QW0_USER) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid read access at TIMA @%"
>> +                      HWADDR_PRIx"\n", offset);
>> +        return -1;
>> +    }
>> +
>> +    /* Use the register offset for the raw values */
>> +    ret = 0;
>> +    for (i = 0; i < size; i++) {
>> +        ret |= (uint64_t) tctx->regs[reg_offset + i] << (8 * (size - i - 1));
>> +    }
>> +
>> +    /* filter out reserved values */
>> +    return ret & mask;
>> +}
>> +
>> +/*
>> + * The TM context is mapped twice within each page. Stores and loads
>> + * to the first mapping below 2K write and read the specified values
>> + * without modification. The second mapping above 2K performs specific
>> + * state changes (side effects) in addition to setting/returning the
>> + * interrupt management area context of the processor thread.
>> + */
>> +static uint64_t xive_tm_ack_os_reg(XiveTCTX *tctx, hwaddr offset, unsigned size)
>> +{
>> +    return xive_tctx_accept(tctx, TM_QW1_OS);
>> +}
>> +
>> +static void xive_tm_set_os_cppr(XiveTCTX *tctx, hwaddr offset,
>> +                                uint64_t value, unsigned size)
>> +{
>> +    xive_tctx_set_cppr(tctx, TM_QW1_OS, value & 0xff);
>> +}
>> +
>> +/*
>> + * Define a mapping of "special" operations depending on the TIMA page
>> + * offset and the size of the operation.
>> + */
>> +typedef struct XiveTmOp {
>> +    uint8_t  page_offset;
>> +    uint32_t op_offset;
>> +    unsigned size;
>> +    void     (*write_handler)(XiveTCTX *tctx, hwaddr offset, uint64_t value,
>> +                              unsigned size);
>> +    uint64_t (*read_handler)(XiveTCTX *tctx, hwaddr offset, unsigned size);
>> +} XiveTmOp;
>> +
>> +static const XiveTmOp xive_tm_operations[] = {
>> +    /*
>> +     * MMIOs below 2K : raw values and special operations without side
>> +     * effects
>> +     */
>> +    { XIVE_TM_OS_PAGE, TM_QW1_OS + TM_CPPR,   1, xive_tm_set_os_cppr, NULL },
>> +
>> +    /* MMIOs above 2K : special operations with side effects */
>> +    { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG,     2, NULL, xive_tm_ack_os_reg },
>> +};
>> +
>> +static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
>> +{
>> +    uint8_t page_offset = (offset >> TM_SHIFT) & 0x3;
>> +    uint32_t op_offset = offset & 0xFFF;
>> +    int i;
>> +
>> +    for (i = 0; i < ARRAY_SIZE(xive_tm_operations); i++) {
>> +        const XiveTmOp *xto = &xive_tm_operations[i];
>> +
>> +        /* Accesses done from a more privileged TIMA page is allowed */
>> +        if (xto->page_offset >= page_offset &&
>> +            xto->op_offset == op_offset &&
>> +            xto->size == size &&
>> +            ((write && xto->write_handler) || (!write && xto->read_handler))) {
>> +            return xto;
>> +        }
>> +    }
>> +    return NULL;
>> +}
>> +
>> +/*
>> + * TIMA MMIO handlers
>> + */
>> +static void xive_tm_write(void *opaque, hwaddr offset,
>> +                          uint64_t value, unsigned size)
>> +{
>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>> +    XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
>> +    const XiveTmOp *xto;
>> +
>> +    /*
>> +     * TODO: check V bit in Q[0-3]W2, check PTER bit associated with CPU
>> +     */
>> +
>> +    /*
>> +     * First, check for special operations in the 2K region
>> +     */
>> +    if (offset & 0x800) {
>> +        xto = xive_tm_find_op(offset, size, true);
>> +        if (!xto) {
>> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid write access at TIMA"
>> +                          "@%"HWADDR_PRIx"\n", offset);
>> +        } else {
>> +            xto->write_handler(tctx, offset, value, size);
>> +        }
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Then, for special operations in the region below 2K.
>> +     */
>> +    xto = xive_tm_find_op(offset, size, true);
>> +    if (xto) {
>> +        xto->write_handler(tctx, offset, value, size);
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Finish with raw access to the register values
>> +     */
>> +    xive_tm_raw_write(tctx, offset, value, size);
>> +}
>> +
>> +static uint64_t xive_tm_read(void *opaque, hwaddr offset, unsigned size)
>> +{
>> +    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>> +    XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
>> +    const XiveTmOp *xto;
>> +
>> +    /*
>> +     * TODO: check V bit in Q[0-3]W2, check PTER bit associated with CPU
>> +     */
>> +
>> +    /*
>> +     * First, check for special operations in the 2K region
>> +     */
>> +    if (offset & 0x800) {
>> +        xto = xive_tm_find_op(offset, size, false);
>> +        if (!xto) {
>> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid read access to TIMA"
>> +                          "@%"HWADDR_PRIx"\n", offset);
>> +            return -1;
>> +        }
>> +        return xto->read_handler(tctx, offset, size);
>> +    }
>> +
>> +    /*
>> +     * Then, for special operations in the region below 2K.
>> +     */
>> +    xto = xive_tm_find_op(offset, size, false);
>> +    if (xto) {
>> +        return xto->read_handler(tctx, offset, size);
>> +    }
>> +
>> +    /*
>> +     * Finish with raw access to the register values
>> +     */
>> +    return xive_tm_raw_read(tctx, offset, size);
>> +}
>> +
>> +const MemoryRegionOps xive_tm_ops = {
>> +    .read = xive_tm_read,
>> +    .write = xive_tm_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 1,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 1,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +static char *xive_tctx_ring_print(uint8_t *ring)
>> +{
>> +    uint32_t w2 = be32_to_cpu(*((uint32_t *) &ring[TM_WORD2]));
>> +
>> +    return g_strdup_printf("%02x   %02x  %02x    %02x   %02x  "
>> +                   "%02x  %02x   %02x  %08x",
>> +                   ring[TM_NSR], ring[TM_CPPR], ring[TM_IPB], ring[TM_LSMFB],
>> +                   ring[TM_ACK_CNT], ring[TM_INC], ring[TM_AGE], ring[TM_PIPR],
>> +                   w2);
>> +}
>> +
>> +static const struct {
>> +    uint8_t    qw;
>> +    const char *name;
>> +} xive_tctx_ring_infos[TM_RING_COUNT] = {
>> +    { TM_QW3_HV_PHYS, "HW"   },
>> +    { TM_QW2_HV_POOL, "HV"   },
>> +    { TM_QW1_OS,      "OS"   },
>> +    { TM_QW0_USER,    "USER" },
> 
> Likewise here if we can stick to PHYS and POOL rather than HW and HV.

I don't remember why I changed the names :/ I will fix.

> Also, the qw field takes exactly the values 0..3, why not just an
> array of names indexed by the ring number.

yes. There should be room for simplification. 

Thanks,

C.

>> +};
>> +
>> +void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
>> +{
>> +    int cpu_index = tctx->cs ? tctx->cs->cpu_index : -1;
>> +    int i;
>> +
>> +    monitor_printf(mon, "CPU[%04x]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
>> +                   "  W2\n", cpu_index);
>> +
>> +    for (i = 0; i < TM_RING_COUNT; i++) {
>> +        char *s = xive_tctx_ring_print(&tctx->regs[xive_tctx_ring_infos[i].qw]);
>> +        monitor_printf(mon, "CPU[%04x]: %4s    %s\n", cpu_index,
>> +                       xive_tctx_ring_infos[i].name, s);
>> +        g_free(s);
>> +    }
>> +}
>> +
>> +static void xive_tctx_reset(void *dev)
>> +{
>> +    XiveTCTX *tctx = XIVE_TCTX(dev);
>> +
>> +    memset(tctx->regs, 0, sizeof(tctx->regs));
>> +
>> +    /* Set some defaults */
>> +    tctx->regs[TM_QW1_OS + TM_LSMFB] = 0xFF;
>> +    tctx->regs[TM_QW1_OS + TM_ACK_CNT] = 0xFF;
>> +    tctx->regs[TM_QW1_OS + TM_AGE] = 0xFF;
>> +}
>> +
>> +static void xive_tctx_realize(DeviceState *dev, Error **errp)
>> +{
>> +    XiveTCTX *tctx = XIVE_TCTX(dev);
>> +    PowerPCCPU *cpu;
>> +    CPUPPCState *env;
>> +    Object *obj;
>> +    Error *local_err = NULL;
>> +
>> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
>> +    if (!obj) {
>> +        error_propagate(errp, local_err);
>> +        error_prepend(errp, "required link 'xive' not found: ");
>> +        return;
>> +    }
>> +    tctx->xrtr = XIVE_ROUTER(obj);
>> +
>> +    obj = object_property_get_link(OBJECT(dev), "cpu", &local_err);
>> +    if (!obj) {
>> +        error_propagate(errp, local_err);
>> +        error_prepend(errp, "required link 'cpu' not found: ");
>> +        return;
>> +    }
>> +
>> +    cpu = POWERPC_CPU(obj);
>> +    tctx->cs = CPU(obj);
>> +
>> +    env = &cpu->env;
>> +    switch (PPC_INPUT(env)) {
>> +    case PPC_FLAGS_INPUT_POWER7:
>> +        tctx->output = env->irq_inputs[POWER7_INPUT_INT];
>> +        break;
>> +
>> +    default:
>> +        error_setg(errp, "XIVE interrupt controller does not support "
>> +                   "this CPU bus model");
>> +        return;
>> +    }
>> +
>> +    qemu_register_reset(xive_tctx_reset, dev);
>> +}
>> +
>> +static void xive_tctx_unrealize(DeviceState *dev, Error **errp)
>> +{
>> +    qemu_unregister_reset(xive_tctx_reset, dev);
>> +}
>> +
>> +static const VMStateDescription vmstate_xive_tctx = {
>> +    .name = TYPE_XIVE_TCTX,
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_BUFFER(regs, XiveTCTX),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +static void xive_tctx_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    dc->realize = xive_tctx_realize;
>> +    dc->unrealize = xive_tctx_unrealize;
>> +    dc->desc = "XIVE Interrupt Thread Context";
>> +    dc->vmsd = &vmstate_xive_tctx;
>> +}
>> +
>> +static const TypeInfo xive_tctx_info = {
>> +    .name          = TYPE_XIVE_TCTX,
>> +    .parent        = TYPE_DEVICE,
>> +    .instance_size = sizeof(XiveTCTX),
>> +    .class_init    = xive_tctx_class_init,
>> +};
>>  
>>  /*
>>   * XIVE ESB helpers
>> @@ -876,6 +1318,7 @@ static void xive_register_types(void)
>>      type_register_static(&xive_fabric_info);
>>      type_register_static(&xive_router_info);
>>      type_register_static(&xive_end_source_info);
>> +    type_register_static(&xive_tctx_info);
>>  }
>>  
>>  type_init(xive_register_types)
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources
  2018-11-23 13:28         ` Cédric Le Goater
@ 2018-11-26  5:39           ` David Gibson
  2018-11-26 11:20             ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-26  5:39 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 4515 bytes --]

On Fri, Nov 23, 2018 at 02:28:35PM +0100, Cédric Le Goater wrote:
> 
> >>>> +/*
> >>>> + * Returns whether the event notification should be forwarded.
> >>>> + */
> >>>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t
> >>>> srcno)
> >>>
> >>> What exactly "trigger" means isn't entirely obvious for an LSI.  Might
> >>> be clearer to have "lsi_assert" and "lsi_deassert" helpers instead.
> >>
> >> This is called only when the interrupt is asserted. So it is a 
> >> simplified LSI trigger depending only on the 'P' bit.
> > 
> > Yes, I see that.  But the result is that while the MSI logic is
> > encapsulated in the MSI trigger function, this leaves the LSI logic
> > split across the trigger function and set_irq() itself. I think it 
> > would be better to have assert and deassert helpers instead, which
> > handle both the trigger/notification and also the updating of the
> > ASSERTED bit.
> 
> Something like the xive_source_set_irq_lsi() below ?

Uh.. not exactly what I had in mind, but close enough.

[snip]
+/*
> + * Returns whether the event notification should be forwarded.
> + */
>  static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
>  {
> +    bool notify;
> +
>      assert(srcno < xsrc->nr_irqs);
>  
> -    return xive_esb_trigger(&xsrc->status[srcno]);
> +    notify = xive_esb_trigger(&xsrc->status[srcno]);
> +
> +    if (xive_source_irq_is_lsi(xsrc, srcno) &&

Except that this block can go, since this function is no longer called
for LSIs.

> +        xive_source_esb_get(xsrc, srcno) == XIVE_ESB_QUEUED) {
> +        qemu_log_mask(LOG_GUEST_ERROR,
> +                      "XIVE: queued an event on LSI IRQ %d\n", srcno);
> +    }
> +
> +    return notify;
>  }
>  
>  /*
> @@ -103,9 +127,22 @@ static bool xive_source_esb_trigger(Xive
>   */
>  static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
>  {
> +    bool notify;
> +
>      assert(srcno < xsrc->nr_irqs);
>  
> -    return xive_esb_eoi(&xsrc->status[srcno]);
> +    notify = xive_esb_eoi(&xsrc->status[srcno]);
> +
> +    /* LSI sources do not set the Q bit but they can still be
> +     * asserted, in which case we should forward a new event
> +     * notification
> +     */
> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
> +        bool level = xsrc->status[srcno] & XIVE_STATUS_ASSERTED;
> +        notify = xive_source_set_irq_lsi(xsrc, srcno, level);
> +    }
> +
> +    return notify;
>  }
>  
>  /*
> @@ -268,8 +305,12 @@ static void xive_source_set_irq(void *op
>      XiveSource *xsrc = XIVE_SOURCE(opaque);
>      bool notify = false;
>  
> -    if (val) {
> -        notify = xive_source_esb_trigger(xsrc, srcno);
> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
> +        notify = xive_source_set_irq_lsi(xsrc, srcno, val);
> +    } else {
> +        if (val) {
> +            notify = xive_source_esb_trigger(xsrc, srcno);
> +        }
>      }
>  
>      /* Forward the source event notification for routing */
> @@ -289,9 +330,11 @@ void xive_source_pic_print_info(XiveSour
>              continue;
>          }
>  
> -        monitor_printf(mon, "  %08x %c%c\n", i + offset,
> +        monitor_printf(mon, "  %08x %s %c%c%c\n", i + offset,
> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
> -                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
> +                       xsrc->status[i] & XIVE_STATUS_ASSERTED ? 'A' : ' ');
>      }
>  }
>  
> @@ -299,6 +342,8 @@ static void xive_source_reset(DeviceStat
>  {
>      XiveSource *xsrc = XIVE_SOURCE(dev);
>  
> +    /* Do not clear the LSI bitmap */
> +
>      /* PQs are initialized to 0b01 which corresponds to "ints off" */
>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
>  }
> @@ -324,6 +369,7 @@ static void xive_source_realize(DeviceSt
>                                       xsrc->nr_irqs);
>  
>      xsrc->status = g_malloc0(xsrc->nr_irqs);
> +    xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
>  
>      memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>                            &xive_source_esb_ops, xsrc, "xive.esb",
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-23 10:28         ` Cédric Le Goater
@ 2018-11-26  5:44           ` David Gibson
  2018-11-26  9:39             ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-26  5:44 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 1780 bytes --]

On Fri, Nov 23, 2018 at 11:28:24AM +0100, Cédric Le Goater wrote:
> On 11/23/18 2:10 AM, David Gibson wrote:
> > On Thu, Nov 22, 2018 at 05:50:07PM +1100, Benjamin Herrenschmidt wrote:
> >> On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
> >>>
> >>> Sorry, didn't think of this in my first reply.
> >>>
> >>> 1) Does the hardware ever actually write back to the EAS?  I know it
> >>> does for the END, but it's not clear why it would need to for the
> >>> EAS.  If not, we don't need the setter.
> >>
> >> Nope, though the PAPR model will via hcalls
> > 
> > Right, bit AIUI the set_eas hook is about abstracting PAPR vs bare
> > metal details.  Since the hcall knows it's PAPR it can just update the
> > backing information for the EAS directly, and no need for an
> > abstracted hook.
> 
> Indeed, the first versions of the XIVE patchset did not use such hooks, 
> but when discussed we said we wanted abstract methods for the router 
> to validate the overall XIVE model, which is useful for PowerNV. 
> 
> We can change again and have the hcalls get/set directly in the EAT
> and ENDT. It would certainly simplify the sPAPR model.

I think that's the better approach.

> >>> 2) The signatures are a bit odd here.  For the setter, a value would
> >>> make sense than a (XiveEAS *), since it's just a word.  For the getter
> >>> you could return the EAS value directly rather than using a pointer -
> >>> there's already a valid bit in the EAS so you can construct a value
> >>> with that cleared if the lisn is out of bounds.
> >>
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-11-23  7:28         ` Cédric Le Goater
@ 2018-11-26  5:54           ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-26  5:54 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 3591 bytes --]

On Fri, Nov 23, 2018 at 08:28:42AM +0100, Cédric Le Goater wrote:
> On 11/23/18 5:36 AM, David Gibson wrote:
> > On Thu, Nov 22, 2018 at 10:58:56PM +0100, Cédric Le Goater wrote:
> >> On 11/22/18 6:13 AM, David Gibson wrote:
> >>> On Fri, Nov 16, 2018 at 11:56:59AM +0100, Cédric Le Goater wrote:
> >>>> The Event Notification Descriptor also contains two Event State
> >>>> Buffers providing further coalescing of interrupts, one for the
> >>>> notification event (ESn) and one for the escalation events (ESe). A
> >>>> MMIO page is assigned for each to control the EOI through loads
> >>>> only. Stores are not allowed.
> >>>>
> >>>> The END ESBs are modeled through an object resembling the 'XiveSource'
> >>>> It is stateless as the END state bits are backed into the XiveEND
> >>>> structure under the XiveRouter and the MMIO accesses follow the same
> >>>> rules as for the standard source ESBs.
> >>>>
> >>>> END ESBs are not supported by the Linux drivers neither on OPAL nor on
> >>>> sPAPR. Nevetherless, it provides a mean to study the question in the
> >>>> future and validates a bit more the XIVE model.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>  include/hw/ppc/xive.h |  20 ++++++
> >>>>  hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
> >>>>  2 files changed, 178 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >>>> index ce62aaf28343..24301bf2076d 100644
> >>>> --- a/include/hw/ppc/xive.h
> >>>> +++ b/include/hw/ppc/xive.h
> >>>> @@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>>>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>>>                          XiveEND *end);
> >>>>  
> >>>> +/*
> >>>> + * XIVE END ESBs
> >>>> + */
> >>>> +
> >>>> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
> >>>> +#define XIVE_END_SOURCE(obj) \
> >>>> +    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
> >>>
> >>> Is there a particular reason to make this a full QOM object, rather
> >>> than just embedding it in the XiveRouter?
> >>
> >> yes, it should probably be under the XiveRouter you are right because
> >> there is a direct link with the ENDT which is in the XiverRouter. 
> >>
> >> But if I remove the chip_id field from the XiveRouter, it becomes a QOM
> >> interface. something to ponder.
> > 
> > Huh?  I really don't understand what you're saying here.  What does
> > chip_id have to do with anything?
> 
> I am quoting a comment of yours :
> 
> 	> +/*
> 	> + * XIVE Router
> 	> + */
> 	> +
> 	> +typedef struct XiveRouter {
> 	> +    SysBusDevice    parent;
> 	> +
> 	> +    uint32_t        chip_id;
> 
> 	I don't think this belongs in the base class.  The PowerNV specific
> 	variants will need it, but it doesn't make sense for the PAPR version.
> 
> 
> If we remove 'chip_id' from XiveRouter, it can become a QOM interface 
> without state, like the XiveFabric is.

Hm, not really.  At this stage the object does't have any state, but
at least the pnv variants will have state for the registers which
point it to the EAST and ENDT, so a "real" object still makes more
sense than an interface.  If it were an interface it's not clear what
real object it would be attached to.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model
  2018-11-23  0:31       ` David Gibson
  2018-11-23  8:21         ` Cédric Le Goater
@ 2018-11-26  8:14         ` Cédric Le Goater
  1 sibling, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-26  8:14 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

>>>> + */
>>>> +
>>>> +#ifndef PPC_XIVE_H
>>>> +#define PPC_XIVE_H
>>>> +
>>>> +#include "hw/sysbus.h"
>>>
>>> So, I'm a bit dubious about making the XiveSource a SysBus device -
>>> I'm concerned it won't play well with tying it into the other devices
>>> like PHB that "own" it in real hardware.
>>
>> It does but I can take a look at changing it to a DeviceState. The 
>> reset handlers might be a concern.
> 
> As "non bus" device I think you'd need to register your own reset
> handler rather than just setting dc->reset.  Otherwise, I think that
> should work.

I removed from XIVE the SysBus dependencies and indeed it's better 
not to rely on the default reset and mapping behavior of sysbus. 

I am addressing your comments in a WIP v6 branch on github.   

C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-26  5:44           ` David Gibson
@ 2018-11-26  9:39             ` Cédric Le Goater
  2018-11-27  0:11               ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-26  9:39 UTC (permalink / raw)
  To: David Gibson; +Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel

On 11/26/18 6:44 AM, David Gibson wrote:
> On Fri, Nov 23, 2018 at 11:28:24AM +0100, Cédric Le Goater wrote:
>> On 11/23/18 2:10 AM, David Gibson wrote:
>>> On Thu, Nov 22, 2018 at 05:50:07PM +1100, Benjamin Herrenschmidt wrote:
>>>> On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
>>>>>
>>>>> Sorry, didn't think of this in my first reply.
>>>>>
>>>>> 1) Does the hardware ever actually write back to the EAS?  I know it
>>>>> does for the END, but it's not clear why it would need to for the
>>>>> EAS.  If not, we don't need the setter.
>>>>
>>>> Nope, though the PAPR model will via hcalls
>>>
>>> Right, bit AIUI the set_eas hook is about abstracting PAPR vs bare
>>> metal details.  Since the hcall knows it's PAPR it can just update the
>>> backing information for the EAS directly, and no need for an
>>> abstracted hook.
>>
>> Indeed, the first versions of the XIVE patchset did not use such hooks, 
>> but when discussed we said we wanted abstract methods for the router 
>> to validate the overall XIVE model, which is useful for PowerNV. 
>>
>> We can change again and have the hcalls get/set directly in the EAT
>> and ENDT. It would certainly simplify the sPAPR model.
> 
> I think that's the better approach.

ok. let's keep that in mind for  : 

 [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
 [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation

which are using the XiveRouter methods to access the controller EAT 
and ENDT. I thought that was good practice to validate the model but 
we can use direct sPAPR table accessors or none at all.


I think these prereq patches could be merged now :

 [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ
 [PATCH v5 13/36] spapr: introduce a spapr_irq_init() routine
 [PATCH v5 14/36] spapr: modify the irq backend 'init' method

This one also :

 [PATCH v5 21/36] spapr: extend the sPAPR IRQ backend for XICS

Thanks,

C. 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources
  2018-11-26  5:39           ` David Gibson
@ 2018-11-26 11:20             ` Cédric Le Goater
  2018-11-26 23:48               ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-26 11:20 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/26/18 6:39 AM, David Gibson wrote:
> On Fri, Nov 23, 2018 at 02:28:35PM +0100, Cédric Le Goater wrote:
>>
>>>>>> +/*
>>>>>> + * Returns whether the event notification should be forwarded.
>>>>>> + */
>>>>>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t
>>>>>> srcno)
>>>>>
>>>>> What exactly "trigger" means isn't entirely obvious for an LSI.  Might
>>>>> be clearer to have "lsi_assert" and "lsi_deassert" helpers instead.
>>>>
>>>> This is called only when the interrupt is asserted. So it is a 
>>>> simplified LSI trigger depending only on the 'P' bit.
>>>
>>> Yes, I see that.  But the result is that while the MSI logic is
>>> encapsulated in the MSI trigger function, this leaves the LSI logic
>>> split across the trigger function and set_irq() itself. I think it 
>>> would be better to have assert and deassert helpers instead, which
>>> handle both the trigger/notification and also the updating of the
>>> ASSERTED bit.
>>
>> Something like the xive_source_set_irq_lsi() below ?
> 
> Uh.. not exactly what I had in mind, but close enough.
> 
> [snip]
> +/*
>> + * Returns whether the event notification should be forwarded.
>> + */
>>  static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
>>  {
>> +    bool notify;
>> +
>>      assert(srcno < xsrc->nr_irqs);
>>  
>> -    return xive_esb_trigger(&xsrc->status[srcno]);
>> +    notify = xive_esb_trigger(&xsrc->status[srcno]);
>> +
>> +    if (xive_source_irq_is_lsi(xsrc, srcno) &&
> 
> Except that this block can go, since this function is no longer called
> for LSIs.

It still can be through the ESB MMIOs, if the guest does a load on the 
trigger page. 

C.

 
> 
>> +        xive_source_esb_get(xsrc, srcno) == XIVE_ESB_QUEUED) {
>> +        qemu_log_mask(LOG_GUEST_ERROR,
>> +                      "XIVE: queued an event on LSI IRQ %d\n", srcno);
>> +    }
>> +
>> +    return notify;
>>  }
>>  
>>  /*
>> @@ -103,9 +127,22 @@ static bool xive_source_esb_trigger(Xive
>>   */
>>  static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
>>  {
>> +    bool notify;
>> +
>>      assert(srcno < xsrc->nr_irqs);
>>  
>> -    return xive_esb_eoi(&xsrc->status[srcno]);
>> +    notify = xive_esb_eoi(&xsrc->status[srcno]);
>> +
>> +    /* LSI sources do not set the Q bit but they can still be
>> +     * asserted, in which case we should forward a new event
>> +     * notification
>> +     */
>> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
>> +        bool level = xsrc->status[srcno] & XIVE_STATUS_ASSERTED;
>> +        notify = xive_source_set_irq_lsi(xsrc, srcno, level);
>> +    }
>> +
>> +    return notify;
>>  }
>>  
>>  /*
>> @@ -268,8 +305,12 @@ static void xive_source_set_irq(void *op
>>      XiveSource *xsrc = XIVE_SOURCE(opaque);
>>      bool notify = false;
>>  
>> -    if (val) {
>> -        notify = xive_source_esb_trigger(xsrc, srcno);
>> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
>> +        notify = xive_source_set_irq_lsi(xsrc, srcno, val);
>> +    } else {
>> +        if (val) {
>> +            notify = xive_source_esb_trigger(xsrc, srcno);
>> +        }
>>      }
>>  
>>      /* Forward the source event notification for routing */
>> @@ -289,9 +330,11 @@ void xive_source_pic_print_info(XiveSour
>>              continue;
>>          }
>>  
>> -        monitor_printf(mon, "  %08x %c%c\n", i + offset,
>> +        monitor_printf(mon, "  %08x %s %c%c%c\n", i + offset,
>> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
>>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
>> -                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
>> +                       xsrc->status[i] & XIVE_STATUS_ASSERTED ? 'A' : ' ');
>>      }
>>  }
>>  
>> @@ -299,6 +342,8 @@ static void xive_source_reset(DeviceStat
>>  {
>>      XiveSource *xsrc = XIVE_SOURCE(dev);
>>  
>> +    /* Do not clear the LSI bitmap */
>> +
>>      /* PQs are initialized to 0b01 which corresponds to "ints off" */
>>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
>>  }
>> @@ -324,6 +369,7 @@ static void xive_source_realize(DeviceSt
>>                                       xsrc->nr_irqs);
>>  
>>      xsrc->status = g_malloc0(xsrc->nr_irqs);
>> +    xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
>>  
>>      memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>>                            &xive_source_esb_ops, xsrc, "xive.esb",
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources
  2018-11-26 11:20             ` Cédric Le Goater
@ 2018-11-26 23:48               ` David Gibson
  2018-11-27  7:30                 ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-26 23:48 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 5368 bytes --]

On Mon, Nov 26, 2018 at 12:20:19PM +0100, Cédric Le Goater wrote:
> On 11/26/18 6:39 AM, David Gibson wrote:
> > On Fri, Nov 23, 2018 at 02:28:35PM +0100, Cédric Le Goater wrote:
> >>
> >>>>>> +/*
> >>>>>> + * Returns whether the event notification should be forwarded.
> >>>>>> + */
> >>>>>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t
> >>>>>> srcno)
> >>>>>
> >>>>> What exactly "trigger" means isn't entirely obvious for an LSI.  Might
> >>>>> be clearer to have "lsi_assert" and "lsi_deassert" helpers instead.
> >>>>
> >>>> This is called only when the interrupt is asserted. So it is a 
> >>>> simplified LSI trigger depending only on the 'P' bit.
> >>>
> >>> Yes, I see that.  But the result is that while the MSI logic is
> >>> encapsulated in the MSI trigger function, this leaves the LSI logic
> >>> split across the trigger function and set_irq() itself. I think it 
> >>> would be better to have assert and deassert helpers instead, which
> >>> handle both the trigger/notification and also the updating of the
> >>> ASSERTED bit.
> >>
> >> Something like the xive_source_set_irq_lsi() below ?
> > 
> > Uh.. not exactly what I had in mind, but close enough.
> > 
> > [snip]
> > +/*
> >> + * Returns whether the event notification should be forwarded.
> >> + */
> >>  static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
> >>  {
> >> +    bool notify;
> >> +
> >>      assert(srcno < xsrc->nr_irqs);
> >>  
> >> -    return xive_esb_trigger(&xsrc->status[srcno]);
> >> +    notify = xive_esb_trigger(&xsrc->status[srcno]);
> >> +
> >> +    if (xive_source_irq_is_lsi(xsrc, srcno) &&
> > 
> > Except that this block can go, since this function is no longer called
> > for LSIs.
> 
> It still can be through the ESB MMIOs, if the guest does a load on the 
> trigger page.

Oh, good point.  That makes me rethink all my comments on this matter.

In that case I think your original code was fine, except that I'd
prefer to see the setting of the ASSERTED bit inside the trigger
function, instead of in the set_irq() caller.


> 
> C.
> 
>  
> > 
> >> +        xive_source_esb_get(xsrc, srcno) == XIVE_ESB_QUEUED) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR,
> >> +                      "XIVE: queued an event on LSI IRQ %d\n", srcno);
> >> +    }
> >> +
> >> +    return notify;
> >>  }
> >>  
> >>  /*
> >> @@ -103,9 +127,22 @@ static bool xive_source_esb_trigger(Xive
> >>   */
> >>  static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
> >>  {
> >> +    bool notify;
> >> +
> >>      assert(srcno < xsrc->nr_irqs);
> >>  
> >> -    return xive_esb_eoi(&xsrc->status[srcno]);
> >> +    notify = xive_esb_eoi(&xsrc->status[srcno]);
> >> +
> >> +    /* LSI sources do not set the Q bit but they can still be
> >> +     * asserted, in which case we should forward a new event
> >> +     * notification
> >> +     */
> >> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >> +        bool level = xsrc->status[srcno] & XIVE_STATUS_ASSERTED;
> >> +        notify = xive_source_set_irq_lsi(xsrc, srcno, level);
> >> +    }
> >> +
> >> +    return notify;
> >>  }
> >>  
> >>  /*
> >> @@ -268,8 +305,12 @@ static void xive_source_set_irq(void *op
> >>      XiveSource *xsrc = XIVE_SOURCE(opaque);
> >>      bool notify = false;
> >>  
> >> -    if (val) {
> >> -        notify = xive_source_esb_trigger(xsrc, srcno);
> >> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
> >> +        notify = xive_source_set_irq_lsi(xsrc, srcno, val);
> >> +    } else {
> >> +        if (val) {
> >> +            notify = xive_source_esb_trigger(xsrc, srcno);
> >> +        }
> >>      }
> >>  
> >>      /* Forward the source event notification for routing */
> >> @@ -289,9 +330,11 @@ void xive_source_pic_print_info(XiveSour
> >>              continue;
> >>          }
> >>  
> >> -        monitor_printf(mon, "  %08x %c%c\n", i + offset,
> >> +        monitor_printf(mon, "  %08x %s %c%c%c\n", i + offset,
> >> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
> >>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
> >> -                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
> >> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
> >> +                       xsrc->status[i] & XIVE_STATUS_ASSERTED ? 'A' : ' ');
> >>      }
> >>  }
> >>  
> >> @@ -299,6 +342,8 @@ static void xive_source_reset(DeviceStat
> >>  {
> >>      XiveSource *xsrc = XIVE_SOURCE(dev);
> >>  
> >> +    /* Do not clear the LSI bitmap */
> >> +
> >>      /* PQs are initialized to 0b01 which corresponds to "ints off" */
> >>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
> >>  }
> >> @@ -324,6 +369,7 @@ static void xive_source_realize(DeviceSt
> >>                                       xsrc->nr_irqs);
> >>  
> >>      xsrc->status = g_malloc0(xsrc->nr_irqs);
> >> +    xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
> >>  
> >>      memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> >>                            &xive_source_esb_ops, xsrc, "xive.esb",
> >>
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-26  9:39             ` Cédric Le Goater
@ 2018-11-27  0:11               ` David Gibson
  2018-11-27  7:30                 ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-27  0:11 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2958 bytes --]

On Mon, Nov 26, 2018 at 10:39:44AM +0100, Cédric Le Goater wrote:
> On 11/26/18 6:44 AM, David Gibson wrote:
> > On Fri, Nov 23, 2018 at 11:28:24AM +0100, Cédric Le Goater wrote:
> >> On 11/23/18 2:10 AM, David Gibson wrote:
> >>> On Thu, Nov 22, 2018 at 05:50:07PM +1100, Benjamin Herrenschmidt wrote:
> >>>> On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
> >>>>>
> >>>>> Sorry, didn't think of this in my first reply.
> >>>>>
> >>>>> 1) Does the hardware ever actually write back to the EAS?  I know it
> >>>>> does for the END, but it's not clear why it would need to for the
> >>>>> EAS.  If not, we don't need the setter.
> >>>>
> >>>> Nope, though the PAPR model will via hcalls
> >>>
> >>> Right, bit AIUI the set_eas hook is about abstracting PAPR vs bare
> >>> metal details.  Since the hcall knows it's PAPR it can just update the
> >>> backing information for the EAS directly, and no need for an
> >>> abstracted hook.
> >>
> >> Indeed, the first versions of the XIVE patchset did not use such hooks, 
> >> but when discussed we said we wanted abstract methods for the router 
> >> to validate the overall XIVE model, which is useful for PowerNV. 
> >>
> >> We can change again and have the hcalls get/set directly in the EAT
> >> and ENDT. It would certainly simplify the sPAPR model.
> > 
> > I think that's the better approach.
> 
> ok. let's keep that in mind for  : 
> 
>  [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
>  [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation
> 
> which are using the XiveRouter methods to access the controller EAT 
> and ENDT. I thought that was good practice to validate the model but 
> we can use direct sPAPR table accessors or none at all.

Ok.  Consistency is good as a general rule, but I don't think it makes
sense to force the EAT and the ENDT into the same model.  The EAT is
pure configuration, whereas the the ENDT has both configuration and
status.  Or to look at it another way, the EAT is purely software
controlled, whereas the ENDT is at least partially hardware
controlled.

(I realize that gets a bit fuzzy when considering PAPR, but I think
from the point of view of the XIVE model it makes sense to treat the
PAPR hypervisor logic as "software", even though it's "hardware" from the
guest point of view).

> 
> 
> I think these prereq patches could be merged now :
> 
>  [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ
>  [PATCH v5 13/36] spapr: introduce a spapr_irq_init() routine
>  [PATCH v5 14/36] spapr: modify the irq backend 'init' method
> 
> This one also :
> 
>  [PATCH v5 21/36] spapr: extend the sPAPR IRQ backend for XICS
> 
> Thanks,
> 
> C. 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-23  8:06         ` Cédric Le Goater
@ 2018-11-27  1:54           ` David Gibson
  2018-11-27  8:45             ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-27  1:54 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 3838 bytes --]

On Fri, Nov 23, 2018 at 09:06:07AM +0100, Cédric Le Goater wrote:
> On 11/23/18 4:50 AM, David Gibson wrote:
> > On Thu, Nov 22, 2018 at 08:53:00AM +0100, Cédric Le Goater wrote:
> >> On 11/22/18 5:11 AM, David Gibson wrote:
> >>> On Fri, Nov 16, 2018 at 11:56:57AM +0100, Cédric Le Goater
> wrote:
[snip]
> >>> So as far as I can see so far, the XiveFabric interface will
> >>> essentially have to be implemented on the router object, so I'm not
> >>> seeing much point to having the interface rather than just a direct
> >>> call on the router object.  But I haven't read the whole series yet,
> >>> so maybe I'm missing something.
> >>
> >> The PSIHB and PHB4 models are using it but there are not in the series.
> >>
> >> I can send the PSIHB patch in the next version if you like, it's the 
> >> patch right after PnvXive. It's attached below for the moment. Look at 
> >> pnv_psi_notify().
> > 
> > Hrm, I see.  This seems like a really convoluted way of achieving what
> > you need here.  We want to abstract exactly how the source delivers
> > notifies, 
> 
> on sPAPR, I agree that the forwarding of event notification could be a 
> simple XiveRouter call but the XiveRouter covers both machines :/
> 
> On PowerNV, HW uses MMIOs to forward events and only the device knows 
> about the IRQ number offset in the global IRQ number space and the 
> notification port to use for the MMIO store. A PowerNV XIVE source 
> would forward the event notification to a piece of logic which sends 
> a PowerBUS event notification message. How it reaches the XIVE IC is
> beyong QEMU as it would means modeling the PowerBUS. 
> 
> > but doing it with an interface on some object that's not necessarily
> > either the source or the router seems odd.  
> There is no direct link between the device owing the source and the 
> XIVE controller, they could be on the same Power chip but the routing 
> could be done by some other chips. This scenario is covered btw.
> 
> See it as a connector object.
> 
> > At the very least the names need to change (of both interface and > property for the target object).
> 
> I am fine with renaming it. With the above explanations, if they are 
> clear enough, how do see them ?

TBH, I didn't find the info above particularly illuminating.  However,
I think perusing the code has finally gotten my head around the model
(sorry it's taken so long).  I think two things were confusing me.

1) The name had be thinking in terms of the XicsFabric, but the
function here is totally different.

2) I was thinking of the XiveSource as handling all source-side irq
related logic, but I guess it's real function is a bit more limited.
As I now understand it, it's only really handling the ESB and
immediately surrounding logic - the "owning" device (e.g. PHB or PSI)
is responsible for the connection "up the stack" as it were.

So, I'm ok with the model.  Just to verify that my understanding is
correct, can you confirm my reasoning below:

  * For PowerNV, we'd generally expect the notify function to be
    implemented by the "owning" device.  For the XIVE internal source,
    that would be the XiveRouter itself, immediately triggering the
    right EAS.  For the PHB and PSI irq sources, that will code in the
    PHB/PSI which performs the MMIO to poke a router.

  * For PAPR, for simplicity, we'd expect to wire all sources direct
    to a single system-wide router object.

I definitely think it needs a name change though.  "XiveNotify"
perhaps?  And the property to configure it on the XiveSource, maybe
"notify" or "notify_via".

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/36] ppc/xive: introduce the XIVE interrupt thread context
  2018-11-25 20:35     ` Cédric Le Goater
@ 2018-11-27  5:07       ` David Gibson
  2018-11-27 12:47         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-27  5:07 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 5344 bytes --]

On Sun, Nov 25, 2018 at 09:35:08PM +0100, Cédric Le Goater wrote:
> On 11/23/18 6:08 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:00AM +0100, Cédric Le Goater wrote:
> >> Each POWER9 processor chip has a XIVE presenter that can generate four
> >> different exceptions to its threads:
> >>
> >>   - hypervisor exception,
> >>   - O/S exception
> >>   - Event-Based Branch (EBB)
> >>   - msgsnd (doorbell).
> >>
> >> Each exception has a state independent from the others called a Thread
> >> Interrupt Management context. This context is a set of registers which
> >> lets the thread handle priority management and interrupt acknowledgment
> >> among other things. The most important ones being :
> >>
> >>   - Interrupt Priority Register  (PIPR)
> >>   - Interrupt Pending Buffer     (IPB)
> >>   - Current Processor Priority   (CPPR)
> >>   - Notification Source Register (NSR)
> >>
> >> These registers are accessible through a specific MMIO region, called
> >> the Thread Interrupt Management Area (TIMA), four aligned pages, each
> >> exposing a different view of the registers. First page (page address
> >> ending in 0b00) gives access to the entire context and is reserved for
> >> the ring 0 security monitor. The second (page address ending in 0b01)
> >> is for the hypervisor, ring 1. The third (page address ending in 0b10)
> >> is for the operating system, ring 2. The fourth (page address ending
> >> in 0b11) is for user level, ring 3.
> >>
> >> The thread interrupt context is modeled with a XiveTCTX object
> >> containing the values of the different exception registers. The TIMA
> >> region is mapped at the same address for each CPU.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  include/hw/ppc/xive.h      |  36 +++
> >>  include/hw/ppc/xive_regs.h |  82 +++++++
> >>  hw/intc/xive.c             | 443 +++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 561 insertions(+)
> >>
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index 24301bf2076d..5987f26ddb98 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -238,4 +238,40 @@ typedef struct XiveENDSource {
> >>  void xive_end_reset(XiveEND *end);
> >>  void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
> >>  
> >> +/*
> >> + * XIVE Thread interrupt Management (TM) context
> >> + */
> >> +
> >> +#define TYPE_XIVE_TCTX "xive-tctx"
> >> +#define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
> >> +
> >> +/*
> >> + * XIVE Thread interrupt Management register rings :
> >> + *
> >> + *   QW-0  User       event-based exception state
> >> + *   QW-1  O/S        OS context for priority management, interrupt acks
> >> + *   QW-2  Pool       hypervisor context for virtual processor being dispatched
> >> + *   QW-3  Physical   for the security monitor to manage the entire context
> > 
> > That last description is misleading, AIUI the hypervisor can and does
> > make use of the physical ring as well as the pool ring.
> 
> yes. The description is from the spec. I will rephrase. 
> 
> > 
> >> + */
> >> +#define TM_RING_COUNT           4
> >> +#define TM_RING_SIZE            0x10
> >> +
> >> +typedef struct XiveTCTX {
> >> +    DeviceState parent_obj;
> >> +
> >> +    CPUState    *cs;
> >> +    qemu_irq    output;
> >> +
> >> +    uint8_t     regs[TM_RING_COUNT * TM_RING_SIZE];
> > 
> > I'm a bit dubious about representing the state with a full buffer like
> > this.  Isn't a fair bit of this space reserved or derived values which
> > aren't backed by real state?
> 
> Under sPAPR only the TM_QW1_OS ring is accessed but the TM_QW0_USER 
> will also be when we support EBB.
> 
> When running under the PowerNV machine, all rings could be accessed.
> Today only 2 and 3 are.

It's not the fact that all rings are exposed that I'm talking about.
What I mean is that some of the logical register address space isn't
actually by actually modifiable registers.  For example the HW CAM
bits are hardwired to a value which can be derived from elsewhere.
Likewise the "highest pending priority" register or whatever it's
called is AFAICT calculated on read from the CPPR and IPB.

So the thing I'm questioning is having actual storage bytes associated
with logical registers which are actually derived from other state.

> It seemed correct to expose all registers under the thread interrupt
> context model and filter the accesses with the TIMA. It fits the HW
> well.

> >> +
> >> +    XiveRouter  *xrtr;
> > 
> > What's this for?  AFAIK a TCTX isn't associated with a particular
> > routing unit.
> 
> I should have added the pointer in patch 11 where it is used. This is 
> to let the sPAPR XIVE controller model reset the OS CAM line with the 
> VP identifier.

> The TCTX belong to the IVPE XIVE subengine and as the IVRE and IVPE are 
> modeled under the XiveRouter, it's not suprising to see this '*xrtr' 
> back pointer. But I agree we might not need.  Let's talk about it when 
> you reach patch 11.

Ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-27  0:11               ` David Gibson
@ 2018-11-27  7:30                 ` Cédric Le Goater
  2018-11-27 22:56                   ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-27  7:30 UTC (permalink / raw)
  To: David Gibson; +Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel

On 11/27/18 1:11 AM, David Gibson wrote:
> On Mon, Nov 26, 2018 at 10:39:44AM +0100, Cédric Le Goater wrote:
>> On 11/26/18 6:44 AM, David Gibson wrote:
>>> On Fri, Nov 23, 2018 at 11:28:24AM +0100, Cédric Le Goater wrote:
>>>> On 11/23/18 2:10 AM, David Gibson wrote:
>>>>> On Thu, Nov 22, 2018 at 05:50:07PM +1100, Benjamin Herrenschmidt wrote:
>>>>>> On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
>>>>>>>
>>>>>>> Sorry, didn't think of this in my first reply.
>>>>>>>
>>>>>>> 1) Does the hardware ever actually write back to the EAS?  I know it
>>>>>>> does for the END, but it's not clear why it would need to for the
>>>>>>> EAS.  If not, we don't need the setter.
>>>>>>
>>>>>> Nope, though the PAPR model will via hcalls
>>>>>
>>>>> Right, bit AIUI the set_eas hook is about abstracting PAPR vs bare
>>>>> metal details.  Since the hcall knows it's PAPR it can just update the
>>>>> backing information for the EAS directly, and no need for an
>>>>> abstracted hook.
>>>>
>>>> Indeed, the first versions of the XIVE patchset did not use such hooks, 
>>>> but when discussed we said we wanted abstract methods for the router 
>>>> to validate the overall XIVE model, which is useful for PowerNV. 
>>>>
>>>> We can change again and have the hcalls get/set directly in the EAT
>>>> and ENDT. It would certainly simplify the sPAPR model.
>>>
>>> I think that's the better approach.
>>
>> ok. let's keep that in mind for  : 
>>
>>  [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
>>  [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation
>>
>> which are using the XiveRouter methods to access the controller EAT 
>> and ENDT. I thought that was good practice to validate the model but 
>> we can use direct sPAPR table accessors or none at all.
> 
> Ok.  Consistency is good as a general rule, but I don't think it makes
> sense to force the EAT and the ENDT into the same model.  

What do you mean by model ? the QEMU machine IC model ?

> The EAT is
> pure configuration, whereas the the ENDT has both configuration and
> status.  Or to look at it another way, the EAT is purely software
> controlled, whereas the ENDT is at least partially hardware
> controlled.

yes but the EAT and the ENDT are XIVE internal tables of the same XIVE 
sub-engine, the IVRE, Interrupt Virtualization Routing Engine, formely 
known as VC, for Virtualization Controller. the EAS is just an entry 
point to the ENDT. I don't see why we would use different models for 
them.

> (I realize that gets a bit fuzzy when considering PAPR, but I think
> from the point of view of the XIVE model it makes sense to treat the
> PAPR hypervisor logic as "software", even though it's "hardware" from the
> guest point of view).

That I agree but the resulting code is too ugly in the hcalls. Tell me 
when you reach patch 11, which links the machine IC model sPAPRXive to 
the generic XiveRouter and also check patch 16 introducing the hcalls, 
which update the XIVE internal tables.

Thanks,

C. 

 
>>
>>
>> I think these prereq patches could be merged now :
>>
>>  [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ
>>  [PATCH v5 13/36] spapr: introduce a spapr_irq_init() routine
>>  [PATCH v5 14/36] spapr: modify the irq backend 'init' method
>>
>> This one also :
>>
>>  [PATCH v5 21/36] spapr: extend the sPAPR IRQ backend for XICS
>>
>> Thanks,
>>
>> C. 
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources
  2018-11-26 23:48               ` David Gibson
@ 2018-11-27  7:30                 ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-27  7:30 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/27/18 12:48 AM, David Gibson wrote:
> On Mon, Nov 26, 2018 at 12:20:19PM +0100, Cédric Le Goater wrote:
>> On 11/26/18 6:39 AM, David Gibson wrote:
>>> On Fri, Nov 23, 2018 at 02:28:35PM +0100, Cédric Le Goater wrote:
>>>>
>>>>>>>> +/*
>>>>>>>> + * Returns whether the event notification should be forwarded.
>>>>>>>> + */
>>>>>>>> +static bool xive_source_lsi_trigger(XiveSource *xsrc, uint32_t
>>>>>>>> srcno)
>>>>>>>
>>>>>>> What exactly "trigger" means isn't entirely obvious for an LSI.  Might
>>>>>>> be clearer to have "lsi_assert" and "lsi_deassert" helpers instead.
>>>>>>
>>>>>> This is called only when the interrupt is asserted. So it is a 
>>>>>> simplified LSI trigger depending only on the 'P' bit.
>>>>>
>>>>> Yes, I see that.  But the result is that while the MSI logic is
>>>>> encapsulated in the MSI trigger function, this leaves the LSI logic
>>>>> split across the trigger function and set_irq() itself. I think it 
>>>>> would be better to have assert and deassert helpers instead, which
>>>>> handle both the trigger/notification and also the updating of the
>>>>> ASSERTED bit.
>>>>
>>>> Something like the xive_source_set_irq_lsi() below ?
>>>
>>> Uh.. not exactly what I had in mind, but close enough.
>>>
>>> [snip]
>>> +/*
>>>> + * Returns whether the event notification should be forwarded.
>>>> + */
>>>>  static bool xive_source_esb_trigger(XiveSource *xsrc, uint32_t srcno)
>>>>  {
>>>> +    bool notify;
>>>> +
>>>>      assert(srcno < xsrc->nr_irqs);
>>>>  
>>>> -    return xive_esb_trigger(&xsrc->status[srcno]);
>>>> +    notify = xive_esb_trigger(&xsrc->status[srcno]);
>>>> +
>>>> +    if (xive_source_irq_is_lsi(xsrc, srcno) &&
>>>
>>> Except that this block can go, since this function is no longer called
>>> for LSIs.
>>
>> It still can be through the ESB MMIOs, if the guest does a load on the 
>> trigger page.
> 
> Oh, good point.  That makes me rethink all my comments on this matter.
> 
> In that case I think your original code was fine, except that I'd
> prefer to see the setting of the ASSERTED bit inside the trigger
> function, instead of in the set_irq() caller.

ok. I will change that.

Thanks,

C. 


> 
> 
>>
>> C.
>>
>>  
>>>
>>>> +        xive_source_esb_get(xsrc, srcno) == XIVE_ESB_QUEUED) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR,
>>>> +                      "XIVE: queued an event on LSI IRQ %d\n", srcno);
>>>> +    }
>>>> +
>>>> +    return notify;
>>>>  }
>>>>  
>>>>  /*
>>>> @@ -103,9 +127,22 @@ static bool xive_source_esb_trigger(Xive
>>>>   */
>>>>  static bool xive_source_esb_eoi(XiveSource *xsrc, uint32_t srcno)
>>>>  {
>>>> +    bool notify;
>>>> +
>>>>      assert(srcno < xsrc->nr_irqs);
>>>>  
>>>> -    return xive_esb_eoi(&xsrc->status[srcno]);
>>>> +    notify = xive_esb_eoi(&xsrc->status[srcno]);
>>>> +
>>>> +    /* LSI sources do not set the Q bit but they can still be
>>>> +     * asserted, in which case we should forward a new event
>>>> +     * notification
>>>> +     */
>>>> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
>>>> +        bool level = xsrc->status[srcno] & XIVE_STATUS_ASSERTED;
>>>> +        notify = xive_source_set_irq_lsi(xsrc, srcno, level);
>>>> +    }
>>>> +
>>>> +    return notify;
>>>>  }
>>>>  
>>>>  /*
>>>> @@ -268,8 +305,12 @@ static void xive_source_set_irq(void *op
>>>>      XiveSource *xsrc = XIVE_SOURCE(opaque);
>>>>      bool notify = false;
>>>>  
>>>> -    if (val) {
>>>> -        notify = xive_source_esb_trigger(xsrc, srcno);
>>>> +    if (xive_source_irq_is_lsi(xsrc, srcno)) {
>>>> +        notify = xive_source_set_irq_lsi(xsrc, srcno, val);
>>>> +    } else {
>>>> +        if (val) {
>>>> +            notify = xive_source_esb_trigger(xsrc, srcno);
>>>> +        }
>>>>      }
>>>>  
>>>>      /* Forward the source event notification for routing */
>>>> @@ -289,9 +330,11 @@ void xive_source_pic_print_info(XiveSour
>>>>              continue;
>>>>          }
>>>>  
>>>> -        monitor_printf(mon, "  %08x %c%c\n", i + offset,
>>>> +        monitor_printf(mon, "  %08x %s %c%c%c\n", i + offset,
>>>> +                       xive_source_irq_is_lsi(xsrc, i) ? "LSI" : "MSI",
>>>>                         pq & XIVE_ESB_VAL_P ? 'P' : '-',
>>>> -                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-');
>>>> +                       pq & XIVE_ESB_VAL_Q ? 'Q' : '-',
>>>> +                       xsrc->status[i] & XIVE_STATUS_ASSERTED ? 'A' : ' ');
>>>>      }
>>>>  }
>>>>  
>>>> @@ -299,6 +342,8 @@ static void xive_source_reset(DeviceStat
>>>>  {
>>>>      XiveSource *xsrc = XIVE_SOURCE(dev);
>>>>  
>>>> +    /* Do not clear the LSI bitmap */
>>>> +
>>>>      /* PQs are initialized to 0b01 which corresponds to "ints off" */
>>>>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
>>>>  }
>>>> @@ -324,6 +369,7 @@ static void xive_source_realize(DeviceSt
>>>>                                       xsrc->nr_irqs);
>>>>  
>>>>      xsrc->status = g_malloc0(xsrc->nr_irqs);
>>>> +    xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
>>>>  
>>>>      memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>>>>                            &xive_source_esb_ops, xsrc, "xive.esb",
>>>>
>>>
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-27  1:54           ` David Gibson
@ 2018-11-27  8:45             ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-27  8:45 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/27/18 2:54 AM, David Gibson wrote:
> On Fri, Nov 23, 2018 at 09:06:07AM +0100, Cédric Le Goater wrote:
>> On 11/23/18 4:50 AM, David Gibson wrote:
>>> On Thu, Nov 22, 2018 at 08:53:00AM +0100, Cédric Le Goater wrote:
>>>> On 11/22/18 5:11 AM, David Gibson wrote:
>>>>> On Fri, Nov 16, 2018 at 11:56:57AM +0100, Cédric Le Goater
>> wrote:
> [snip]
>>>>> So as far as I can see so far, the XiveFabric interface will
>>>>> essentially have to be implemented on the router object, so I'm not
>>>>> seeing much point to having the interface rather than just a direct
>>>>> call on the router object.  But I haven't read the whole series yet,
>>>>> so maybe I'm missing something.
>>>>
>>>> The PSIHB and PHB4 models are using it but there are not in the series.
>>>>
>>>> I can send the PSIHB patch in the next version if you like, it's the 
>>>> patch right after PnvXive. It's attached below for the moment. Look at 
>>>> pnv_psi_notify().
>>>
>>> Hrm, I see.  This seems like a really convoluted way of achieving what
>>> you need here.  We want to abstract exactly how the source delivers
>>> notifies, 
>>
>> on sPAPR, I agree that the forwarding of event notification could be a 
>> simple XiveRouter call but the XiveRouter covers both machines :/
>>
>> On PowerNV, HW uses MMIOs to forward events and only the device knows 
>> about the IRQ number offset in the global IRQ number space and the 
>> notification port to use for the MMIO store. A PowerNV XIVE source 
>> would forward the event notification to a piece of logic which sends 
>> a PowerBUS event notification message. How it reaches the XIVE IC is
>> beyong QEMU as it would means modeling the PowerBUS. 
>>
>>> but doing it with an interface on some object that's not necessarily
>>> either the source or the router seems odd.  
>> There is no direct link between the device owing the source and the 
>> XIVE controller, they could be on the same Power chip but the routing 
>> could be done by some other chips. This scenario is covered btw.
>>
>> See it as a connector object.
>>
>>> At the very least the names need to change (of both interface and > property for the target object).
>>
>> I am fine with renaming it. With the above explanations, if they are 
>> clear enough, how do see them ?
> 
> TBH, I didn't find the info above particularly illuminating.

This is really a PowerNV need. So, I can reshuffle the code and make 
a direct link between the XiveSource and the XiveRouter models for sPAPR.
It's a small change. and reintroduce XiveFabric (or whatever name we choose) 
later before the PowerNV Xive and PSIHB model. 

>  However,
> I think perusing the code has finally gotten my head around the model
> (sorry it's taken so long).  I think two things were confusing me.
> 
> 1) The name had be thinking in terms of the XicsFabric, but the
> function here is totally different.

Yes. I agree. It's not the same thing at all.

> 2) I was thinking of the XiveSource as handling all source-side irq
> related logic, but I guess it's real function is a bit more limited.
> As I now understand it, it's only really handling the ESB and
> immediately surrounding logic - the "owning" device (e.g. PHB or PSI)
> is responsible for the connection "up the stack" as it were.

yes.

> So, I'm ok with the model.  Just to verify that my understanding is
> correct, can you confirm my reasoning below:
> 
>   * For PowerNV, we'd generally expect the notify function to be
>     implemented by the "owning" device.  For the XIVE internal source,
>     that would be the XiveRouter itself, immediately triggering the
>     right EAS.  For the PHB and PSI irq sources, that will code in the
>     PHB/PSI which performs the MMIO to poke a router.

exactly.
 
>   * For PAPR, for simplicity, we'd expect to wire all sources direct
>     to a single system-wide router object.

yes. 

> 
> I definitely think it needs a name change though.  "XiveNotify"
> perhaps?  

Yes. XiveNotifier may be ? to use a noun and not a verb.

> And the property to configure it on the XiveSource, maybe "notify" 
> or "notify_via".

What the XIVE engines are doing is forwarding a trigger event to the 
next engine that can possibly do the routing to the final target. 
  
In the specs, the verbs 'trigger', 'forward', 'notify', 'route' are 
commonly used. I think 'notify' is the most frequent.

ok for 'notify'.

Thanks,

C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 07/36] ppc/xive: introduce the XIVE interrupt thread context
  2018-11-27  5:07       ` David Gibson
@ 2018-11-27 12:47         ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-27 12:47 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/27/18 6:07 AM, David Gibson wrote:
> On Sun, Nov 25, 2018 at 09:35:08PM +0100, Cédric Le Goater wrote:
>> On 11/23/18 6:08 AM, David Gibson wrote:
>>> On Fri, Nov 16, 2018 at 11:57:00AM +0100, Cédric Le Goater wrote:
>>>> Each POWER9 processor chip has a XIVE presenter that can generate four
>>>> different exceptions to its threads:
>>>>
>>>>   - hypervisor exception,
>>>>   - O/S exception
>>>>   - Event-Based Branch (EBB)
>>>>   - msgsnd (doorbell).
>>>>
>>>> Each exception has a state independent from the others called a Thread
>>>> Interrupt Management context. This context is a set of registers which
>>>> lets the thread handle priority management and interrupt acknowledgment
>>>> among other things. The most important ones being :
>>>>
>>>>   - Interrupt Priority Register  (PIPR)
>>>>   - Interrupt Pending Buffer     (IPB)
>>>>   - Current Processor Priority   (CPPR)
>>>>   - Notification Source Register (NSR)
>>>>
>>>> These registers are accessible through a specific MMIO region, called
>>>> the Thread Interrupt Management Area (TIMA), four aligned pages, each
>>>> exposing a different view of the registers. First page (page address
>>>> ending in 0b00) gives access to the entire context and is reserved for
>>>> the ring 0 security monitor. The second (page address ending in 0b01)
>>>> is for the hypervisor, ring 1. The third (page address ending in 0b10)
>>>> is for the operating system, ring 2. The fourth (page address ending
>>>> in 0b11) is for user level, ring 3.
>>>>
>>>> The thread interrupt context is modeled with a XiveTCTX object
>>>> containing the values of the different exception registers. The TIMA
>>>> region is mapped at the same address for each CPU.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  include/hw/ppc/xive.h      |  36 +++
>>>>  include/hw/ppc/xive_regs.h |  82 +++++++
>>>>  hw/intc/xive.c             | 443 +++++++++++++++++++++++++++++++++++++
>>>>  3 files changed, 561 insertions(+)
>>>>
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> index 24301bf2076d..5987f26ddb98 100644
>>>> --- a/include/hw/ppc/xive.h
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -238,4 +238,40 @@ typedef struct XiveENDSource {
>>>>  void xive_end_reset(XiveEND *end);
>>>>  void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
>>>>  
>>>> +/*
>>>> + * XIVE Thread interrupt Management (TM) context
>>>> + */
>>>> +
>>>> +#define TYPE_XIVE_TCTX "xive-tctx"
>>>> +#define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
>>>> +
>>>> +/*
>>>> + * XIVE Thread interrupt Management register rings :
>>>> + *
>>>> + *   QW-0  User       event-based exception state
>>>> + *   QW-1  O/S        OS context for priority management, interrupt acks
>>>> + *   QW-2  Pool       hypervisor context for virtual processor being dispatched
>>>> + *   QW-3  Physical   for the security monitor to manage the entire context
>>>
>>> That last description is misleading, AIUI the hypervisor can and does
>>> make use of the physical ring as well as the pool ring.
>>
>> yes. The description is from the spec. I will rephrase. 
>>
>>>
>>>> + */
>>>> +#define TM_RING_COUNT           4
>>>> +#define TM_RING_SIZE            0x10
>>>> +
>>>> +typedef struct XiveTCTX {
>>>> +    DeviceState parent_obj;
>>>> +
>>>> +    CPUState    *cs;
>>>> +    qemu_irq    output;
>>>> +
>>>> +    uint8_t     regs[TM_RING_COUNT * TM_RING_SIZE];
>>>
>>> I'm a bit dubious about representing the state with a full buffer like
>>> this.  Isn't a fair bit of this space reserved or derived values which
>>> aren't backed by real state?
>>
>> Under sPAPR only the TM_QW1_OS ring is accessed but the TM_QW0_USER 
>> will also be when we support EBB.
>>
>> When running under the PowerNV machine, all rings could be accessed.
>> Today only 2 and 3 are.
> 
> It's not the fact that all rings are exposed that I'm talking about.
> What I mean is that some of the logical register address space isn't
> actually by actually modifiable registers.  For example the HW CAM
> bits are hardwired to a value which can be derived from elsewhere.

yes indeed.

> Likewise the "highest pending priority" register or whatever it's
> called is AFAICT calculated on read from the CPPR and IPB.

yes. the PIPR is computed from the IPB. (patch 10)

> So the thing I'm questioning is having actual storage bytes associated
> with logical registers which are actually derived from other state.

having storage for word2 and word3 on all rings is questionable I agree,
but I think the eight (NSR CPPR IPB LSMFB ACK# INC AGE PIPR) registers 
are good to have explicitly for the model.  

It fits well the HW and all the TIMA accessors which would be much
more complex if we only represented the interesting bits we need
to transfer. 

We also use the word0 and word1 to store the KVM state (os ring, maybe 
user one day) directly pulled from the KVM VCPU structure. word2 is also 
collected to print out the KVM VP identifier in the QEMU monitor.  

So, I think these four quad-words are useful and their storage is 
relatively small compared to the EAT (8bytes * nr_irqs) and ENDT 
(32bytes * 8 * nr_cpus).

Thanks,

C. 

>> It seemed correct to expose all registers under the thread interrupt
>> context model and filter the accesses with the TIMA. It fits the HW
>> well.
> 
>>>> +
>>>> +    XiveRouter  *xrtr;
>>>
>>> What's this for?  AFAIK a TCTX isn't associated with a particular
>>> routing unit.
>>
>> I should have added the pointer in patch 11 where it is used. This is 
>> to let the sPAPR XIVE controller model reset the OS CAM line with the 
>> VP identifier.
> 
>> The TCTX belong to the IVPE XIVE subengine and as the IVRE and IVPE are 
>> modeled under the XiveRouter, it's not suprising to see this '*xrtr' 
>> back pointer. But I agree we might not need.  Let's talk about it when 
>> you reach patch 11.
> 
> Ok.
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model
  2018-11-27  7:30                 ` Cédric Le Goater
@ 2018-11-27 22:56                   ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-27 22:56 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 4080 bytes --]

On Tue, Nov 27, 2018 at 08:30:15AM +0100, Cédric Le Goater wrote:
> On 11/27/18 1:11 AM, David Gibson wrote:
> > On Mon, Nov 26, 2018 at 10:39:44AM +0100, Cédric Le Goater wrote:
> >> On 11/26/18 6:44 AM, David Gibson wrote:
> >>> On Fri, Nov 23, 2018 at 11:28:24AM +0100, Cédric Le Goater wrote:
> >>>> On 11/23/18 2:10 AM, David Gibson wrote:
> >>>>> On Thu, Nov 22, 2018 at 05:50:07PM +1100, Benjamin Herrenschmidt wrote:
> >>>>>> On Thu, 2018-11-22 at 15:44 +1100, David Gibson wrote:
> >>>>>>>
> >>>>>>> Sorry, didn't think of this in my first reply.
> >>>>>>>
> >>>>>>> 1) Does the hardware ever actually write back to the EAS?  I know it
> >>>>>>> does for the END, but it's not clear why it would need to for the
> >>>>>>> EAS.  If not, we don't need the setter.
> >>>>>>
> >>>>>> Nope, though the PAPR model will via hcalls
> >>>>>
> >>>>> Right, bit AIUI the set_eas hook is about abstracting PAPR vs bare
> >>>>> metal details.  Since the hcall knows it's PAPR it can just update the
> >>>>> backing information for the EAS directly, and no need for an
> >>>>> abstracted hook.
> >>>>
> >>>> Indeed, the first versions of the XIVE patchset did not use such hooks, 
> >>>> but when discussed we said we wanted abstract methods for the router 
> >>>> to validate the overall XIVE model, which is useful for PowerNV. 
> >>>>
> >>>> We can change again and have the hcalls get/set directly in the EAT
> >>>> and ENDT. It would certainly simplify the sPAPR model.
> >>>
> >>> I think that's the better approach.
> >>
> >> ok. let's keep that in mind for  : 
> >>
> >>  [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
> >>  [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation
> >>
> >> which are using the XiveRouter methods to access the controller EAT 
> >> and ENDT. I thought that was good practice to validate the model but 
> >> we can use direct sPAPR table accessors or none at all.
> > 
> > Ok.  Consistency is good as a general rule, but I don't think it makes
> > sense to force the EAT and the ENDT into the same model.  
> 
> What do you mean by model ? the QEMU machine IC model ?

Oh, sorry, nothing that formal.  By "model" I just meant the same
pattern of accessor hooks.  They certainly do belong in the same qemu
object.

> > The EAT is
> > pure configuration, whereas the the ENDT has both configuration and
> > status.  Or to look at it another way, the EAT is purely software
> > controlled, whereas the ENDT is at least partially hardware
> > controlled.
> 
> yes but the EAT and the ENDT are XIVE internal tables of the same XIVE 
> sub-engine, the IVRE, Interrupt Virtualization Routing Engine, formely 
> known as VC, for Virtualization Controller. the EAS is just an entry 
> point to the ENDT. I don't see why we would use different models for 
> them.
> 
> > (I realize that gets a bit fuzzy when considering PAPR, but I think
> > from the point of view of the XIVE model it makes sense to treat the
> > PAPR hypervisor logic as "software", even though it's "hardware" from the
> > guest point of view).
> 
> That I agree but the resulting code is too ugly in the hcalls. Tell me 
> when you reach patch 11, which links the machine IC model sPAPRXive to 
> the generic XiveRouter and also check patch 16 introducing the hcalls, 
> which update the XIVE internal tables.
> 
> Thanks,
> 
> C. 
> 
>  
> >>
> >>
> >> I think these prereq patches could be merged now :
> >>
> >>  [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ
> >>  [PATCH v5 13/36] spapr: introduce a spapr_irq_init() routine
> >>  [PATCH v5 14/36] spapr: modify the irq backend 'init' method
> >>
> >> This one also :
> >>
> >>  [PATCH v5 21/36] spapr: extend the sPAPR IRQ backend for XICS
> >>
> >> Thanks,
> >>
> >> C. 
> >>
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter Cédric Le Goater
@ 2018-11-27 23:49   ` David Gibson
  2018-11-28  2:34     ` Benjamin Herrenschmidt
  2018-11-28 10:59     ` Cédric Le Goater
  0 siblings, 2 replies; 184+ messages in thread
From: David Gibson @ 2018-11-27 23:49 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 15419 bytes --]

On Fri, Nov 16, 2018 at 11:57:01AM +0100, Cédric Le Goater wrote:
> The last sub-engine of the XIVE architecture is the Interrupt
> Virtualization Presentation Engine (IVPE). On HW, they share elements,
> the Power Bus interface (CQ), the routing table descriptors, and they
> can be combined in the same HW logic. We do the same in QEMU and
> combine both engines in the XiveRouter for simplicity.

Ok, I'm not entirely convinced combining the IVPE and IVRE into a
single object is a good idea, but we can probably discuss that once
I've read further.

> When the IVRE has completed its job of matching an event source with a
> Notification Virtual Target (NVT) to notify, it forwards the event
> notification to the IVPE sub-engine. The IVPE scans the thread
> interrupt contexts of the Notification Virtual Targets (NVT)
> dispatched on the HW processor threads and if a match is found, it
> signals the thread. If not, the IVPE escalates the notification to
> some other targets and records the notification in a backlog queue.
> 
> The IVPE maintains the thread interrupt context state for each of its
> NVTs not dispatched on HW processor threads in the Notification
> Virtual Target table (NVTT).
> 
> The model currently only supports single NVT notifications.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/xive.h      |  13 +++
>  include/hw/ppc/xive_regs.h |  22 ++++
>  hw/intc/xive.c             | 223 +++++++++++++++++++++++++++++++++++++
>  3 files changed, 258 insertions(+)
> 
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 5987f26ddb98..e715a6c6923d 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -197,6 +197,10 @@ typedef struct XiveRouterClass {
>                     XiveEND *end);
>      int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>                     XiveEND *end);
> +    int (*get_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> +                   XiveNVT *nvt);
> +    int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> +                   XiveNVT *nvt);

As with the ENDs, I don't think get/set is a good interface for a
bigger-than-word-size object.

>  } XiveRouterClass;
>  
>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
> @@ -207,6 +211,10 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>                          XiveEND *end);
>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>                          XiveEND *end);
> +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> +                        XiveNVT *nvt);
> +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> +                        XiveNVT *nvt);
>  
>  /*
>   * XIVE END ESBs
> @@ -274,4 +282,9 @@ extern const MemoryRegionOps xive_tm_ops;
>  
>  void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
>  
> +static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
> +{
> +    return (nvt_blk << 19) | nvt_idx;

I'm guessing this formula is the standard way of combining the NVT
block and index into a single word?  If so, I think we should
standardize on passing a single word "nvt_id" around and only
splitting it when we need to use the block separately.  Same goes for
the end_id, assuming there's a standard way of putting that into a
single word.  That will address the point I raised earlier about lisn
being passed around as a single word, but these later stage ids being
split.

We'll probably want some inlines or macros to build an
nvt/end/lisn/whatever id from block and index as well.

> +}
> +
>  #endif /* PPC_XIVE_H */
> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> index 2e3d6cb507da..05cb992d2815 100644
> --- a/include/hw/ppc/xive_regs.h
> +++ b/include/hw/ppc/xive_regs.h
> @@ -158,4 +158,26 @@ typedef struct XiveEND {
>  #define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
>  } XiveEND;
>  
> +/* Notification Virtual Target (NVT) */
> +typedef struct XiveNVT {
> +        uint32_t        w0;
> +#define NVT_W0_VALID             PPC_BIT32(0)
> +        uint32_t        w1;
> +        uint32_t        w2;
> +        uint32_t        w3;
> +        uint32_t        w4;
> +        uint32_t        w5;
> +        uint32_t        w6;
> +        uint32_t        w7;
> +        uint32_t        w8;
> +#define NVT_W8_GRP_VALID         PPC_BIT32(0)
> +        uint32_t        w9;
> +        uint32_t        wa;
> +        uint32_t        wb;
> +        uint32_t        wc;
> +        uint32_t        wd;
> +        uint32_t        we;
> +        uint32_t        wf;
> +} XiveNVT;
> +
>  #endif /* PPC_XIVE_REGS_H */
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 4c6cb5d52975..5ba3b06e6e25 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -373,6 +373,32 @@ void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
>      }
>  }
>  
> +/* The HW CAM (23bits) is hardwired to :
> + *
> + *   0x000||0b1||4Bit chip number||7Bit Thread number.
> + *
> + * and when the block grouping extension is enabled :
> + *
> + *   4Bit chip number||0x001||7Bit Thread number.
> + */
> +static uint32_t tctx_hw_cam_line(bool block_group, uint8_t chip_id, uint8_t tid)
> +{
> +    if (block_group) {
> +        return 1 << 11 | (chip_id & 0xf) << 7 | (tid & 0x7f);
> +    } else {
> +        return (chip_id & 0xf) << 11 | 1 << 7 | (tid & 0x7f);
> +    }
> +}
> +
> +static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
> +{
> +    PowerPCCPU *cpu = POWERPC_CPU(tctx->cs);
> +    CPUPPCState *env = &cpu->env;
> +    uint32_t pir = env->spr_cb[SPR_PIR].default_value;

I don't much like reaching into the cpu state itself.  I think a
better idea would be to have the TCTX have its HW CAM id set during
initialization (via a property) and then use that.  This will mean
less mucking about if future cpu revisions don't split the PIR into
chip and tid ids in the same way.

> +    return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
> +}
> +
>  static void xive_tctx_reset(void *dev)
>  {
>      XiveTCTX *tctx = XIVE_TCTX(dev);
> @@ -1013,6 +1039,195 @@ int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>     return xrc->set_end(xrtr, end_blk, end_idx, end);
>  }
>  
> +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> +                        XiveNVT *nvt)
> +{
> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> +
> +   return xrc->get_nvt(xrtr, nvt_blk, nvt_idx, nvt);
> +}
> +
> +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> +                        XiveNVT *nvt)
> +{
> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> +
> +   return xrc->set_nvt(xrtr, nvt_blk, nvt_idx, nvt);
> +}
> +
> +static bool xive_tctx_ring_match(XiveTCTX *tctx, uint8_t ring,
> +                                 uint8_t nvt_blk, uint32_t nvt_idx,
> +                                 bool cam_ignore, uint32_t logic_serv)
> +{
> +    uint8_t *regs = &tctx->regs[ring];
> +    uint32_t w2 = be32_to_cpu(*((uint32_t *) &regs[TM_WORD2]));
> +    uint32_t cam = xive_tctx_cam_line(nvt_blk, nvt_idx);
> +    bool block_group = false; /* TODO (PowerNV) */
> +
> +    /* TODO (PowerNV): ignore low order bits of nvt id */
> +
> +    switch (ring) {
> +    case TM_QW3_HV_PHYS:
> +        return (w2 & TM_QW3W2_VT) && xive_tctx_hw_cam_line(tctx, block_group) ==
> +            tctx_hw_cam_line(block_group, nvt_blk, nvt_idx);

The difference between "xive_tctx_hw_cam_line" and "tctx_hw_cam_line"
here is far from obvious.  Remember that namespacing prefixes aren't
necessary for static functions, which can let you give more
descriptive names without getting excessively long.

> +    case TM_QW2_HV_POOL:
> +        return (w2 & TM_QW2W2_VP) && (cam == GETFIELD(TM_QW2W2_POOL_CAM, w2));
> +
> +    case TM_QW1_OS:
> +        return (w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2));
> +
> +    case TM_QW0_USER:
> +        return ((w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2)) &&
> +                (w2 & TM_QW0W2_VU) &&
> +                (logic_serv == GETFIELD(TM_QW0W2_LOGIC_SERV, w2)));
> +
> +    default:
> +        g_assert_not_reached();
> +    }
> +}
> +
> +static int xive_presenter_tctx_match(XiveTCTX *tctx, uint8_t format,
> +                                     uint8_t nvt_blk, uint32_t nvt_idx,
> +                                     bool cam_ignore, uint32_t logic_serv)
> +{
> +    if (format == 0) {
> +        /* F=0 & i=1: Logical server notification */
> +        if (cam_ignore == true) {
> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no support for LS "
> +                          "NVT %x/%x\n", nvt_blk, nvt_idx);
> +             return -1;
> +        }
> +
> +        /* F=0 & i=0: Specific NVT notification */
> +        if (xive_tctx_ring_match(tctx, TM_QW3_HV_PHYS,
> +                                nvt_blk, nvt_idx, false, 0)) {
> +            return TM_QW3_HV_PHYS;
> +        }
> +        if (xive_tctx_ring_match(tctx, TM_QW2_HV_POOL,
> +                                nvt_blk, nvt_idx, false, 0)) {
> +            return TM_QW2_HV_POOL;
> +        }
> +        if (xive_tctx_ring_match(tctx, TM_QW1_OS,
> +                                nvt_blk, nvt_idx, false, 0)) {
> +            return TM_QW1_OS;
> +        }

Hm.  It's a bit pointless to iterate through each ring calling a
common function, when that "common" function consists entirely of a
switch which makes it not really common at all.

So I think you want separate helper functions for each ring's match,
or even just fold the previous function into this one.

> +    } else {
> +        /* F=1 : User level Event-Based Branch (EBB) notification */
> +        if (xive_tctx_ring_match(tctx, TM_QW0_USER,
> +                                nvt_blk, nvt_idx, false, logic_serv)) {
> +            return TM_QW0_USER;
> +        }
> +    }
> +    return -1;
> +}
> +
> +typedef struct XiveTCTXMatch {
> +    XiveTCTX *tctx;
> +    uint8_t ring;
> +} XiveTCTXMatch;
> +
> +static bool xive_presenter_match(XiveRouter *xrtr, uint8_t format,
> +                                 uint8_t nvt_blk, uint32_t nvt_idx,
> +                                 bool cam_ignore, uint8_t priority,
> +                                 uint32_t logic_serv, XiveTCTXMatch *match)
> +{
> +    CPUState *cs;
> +
> +    /* TODO (PowerNV): handle chip_id overwrite of block field for
> +     * hardwired CAM compares */
> +
> +    CPU_FOREACH(cs) {
> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> +        XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
> +        int ring;
> +
> +        /*
> +         * HW checks that the CPU is enabled in the Physical Thread
> +         * Enable Register (PTER).
> +         */
> +
> +        /*
> +         * Check the thread context CAM lines and record matches. We
> +         * will handle CPU exception delivery later
> +         */
> +        ring = xive_presenter_tctx_match(tctx, format, nvt_blk, nvt_idx,
> +                                         cam_ignore, logic_serv);
> +        /*
> +         * Save the context and follow on to catch duplicates, that we
> +         * don't support yet.
> +         */
> +        if (ring != -1) {
> +            if (match->tctx) {
> +                qemu_log_mask(LOG_GUEST_ERROR, "XIVE: already found a thread "
> +                              "context NVT %x/%x\n", nvt_blk, nvt_idx);
> +                return false;
> +            }
> +
> +            match->ring = ring;
> +            match->tctx = tctx;
> +        }
> +    }
> +
> +    if (!match->tctx) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is not dispatched\n",
> +                      nvt_blk, nvt_idx);
> +        return false;

Hmm.. this isn't actually an error isn't it?  At least not for powernv
- that just means the NVT isn't currently dispatched, so we'll need to
trigger the escalation interrupt.  Does this get changed later in the
series?

> +    }
> +
> +    return true;
> +}
> +
> +/*
> + * This is our simple Xive Presenter Engine model. It is merged in the
> + * Router as it does not require an extra object.
> + *
> + * It receives notification requests sent by the IVRE to find one
> + * matching NVT (or more) dispatched on the processor threads. In case
> + * of a single NVT notification, the process is abreviated and the
> + * thread is signaled if a match is found. In case of a logical server
> + * notification (bits ignored at the end of the NVT identifier), the
> + * IVPE and IVRE select a winning thread using different filters. This
> + * involves 2 or 3 exchanges on the PowerBus that the model does not
> + * support.
> + *
> + * The parameters represent what is sent on the PowerBus
> + */
> +static void xive_presenter_notify(XiveRouter *xrtr, uint8_t format,
> +                                  uint8_t nvt_blk, uint32_t nvt_idx,
> +                                  bool cam_ignore, uint8_t priority,
> +                                  uint32_t logic_serv)
> +{
> +    XiveNVT nvt;
> +    XiveTCTXMatch match = { 0 };
> +    bool found;
> +
> +    /* NVT cache lookup */
> +    if (xive_router_get_nvt(xrtr, nvt_blk, nvt_idx, &nvt)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no NVT %x/%x\n",
> +                      nvt_blk, nvt_idx);
> +        return;
> +    }
> +
> +    if (!(nvt.w0 & NVT_W0_VALID)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is invalid\n",
> +                      nvt_blk, nvt_idx);
> +        return;
> +    }
> +
> +    found = xive_presenter_match(xrtr, format, nvt_blk, nvt_idx, cam_ignore,
> +                                 priority, logic_serv, &match);
> +    if (found) {
> +        return;
> +    }
> +
> +    /* If no matching NVT is dispatched on a HW thread :
> +     * - update the NVT structure if backlog is activated
> +     * - escalate (ESe PQ bits and EAS in w4-5) if escalation is
> +     *   activated
> +     */
> +}
> +
>  /*
>   * An END trigger can come from an event trigger (IPI or HW) or from
>   * another chip. We don't model the PowerBus but the END trigger
> @@ -1081,6 +1296,14 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>      /*
>       * Follows IVPE notification
>       */
> +    xive_presenter_notify(xrtr, format,
> +                          GETFIELD(END_W6_NVT_BLOCK, end.w6),
> +                          GETFIELD(END_W6_NVT_INDEX, end.w6),
> +                          GETFIELD(END_W7_F0_IGNORE, end.w7),
> +                          priority,
> +                          GETFIELD(END_W7_F1_LOG_SERVER_ID, end.w7));
> +
> +    /* TODO: Auto EOI. */
>  }
>  
>  static void xive_router_notify(XiveFabric *xf, uint32_t lisn)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged Cédric Le Goater
@ 2018-11-28  0:13   ` David Gibson
  2018-11-28  2:32     ` Benjamin Herrenschmidt
  2018-11-28 11:30     ` Cédric Le Goater
  0 siblings, 2 replies; 184+ messages in thread
From: David Gibson @ 2018-11-28  0:13 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 6752 bytes --]

On Fri, Nov 16, 2018 at 11:57:02AM +0100, Cédric Le Goater wrote:
> After the event data was pushed in the O/S Event Queue, the IVPE
> raises the bit corresponding to the priority of the pending interrupt
> in the register IBP (Interrupt Pending Buffer) to indicate there is an
> event pending in one of the 8 priority queues. The Pending Interrupt
> Priority Register (PIPR) is also updated using the IPB. This register
> represent the priority of the most favored pending notification.
> 
> The PIPR is then compared to the the Current Processor Priority
> Register (CPPR). If it is more favored (numerically less than), the
> CPU interrupt line is raised and the EO bit of the Notification Source
> Register (NSR) is updated to notify the presence of an exception for
> the O/S. The check needs to be done whenever the PIPR or the CPPR are
> changed.
> 
> The O/S acknowledges the interrupt with a special load in the Thread
> Interrupt Management Area. If the EO bit of the NSR is set, the CPPR
> takes the value of PIPR. The bit number in the IBP corresponding to
> the priority of the pending interrupt is reseted and so is the EO bit
> of the NSR.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  hw/intc/xive.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++-
>  1 file changed, 93 insertions(+), 1 deletion(-)
> 
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 5ba3b06e6e25..c49932d2b799 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -21,9 +21,73 @@
>   * XIVE Thread Interrupt Management context
>   */
>  
> +/* Convert a priority number to an Interrupt Pending Buffer (IPB)
> + * register, which indicates a pending interrupt at the priority
> + * corresponding to the bit number
> + */
> +static uint8_t priority_to_ipb(uint8_t priority)
> +{
> +    return priority > XIVE_PRIORITY_MAX ?
> +        0 : 1 << (XIVE_PRIORITY_MAX - priority);
> +}
> +
> +/* Convert an Interrupt Pending Buffer (IPB) register to a Pending
> + * Interrupt Priority Register (PIPR), which contains the priority of
> + * the most favored pending notification.
> + */
> +static uint8_t ipb_to_pipr(uint8_t ibp)
> +{
> +    return ibp ? clz32((uint32_t)ibp << 24) : 0xff;
> +}
> +
> +static void ipb_update(uint8_t *regs, uint8_t priority)
> +{
> +    regs[TM_IPB] |= priority_to_ipb(priority);
> +    regs[TM_PIPR] = ipb_to_pipr(regs[TM_IPB]);
> +}
> +
> +static uint8_t exception_mask(uint8_t ring)
> +{
> +    switch (ring) {
> +    case TM_QW1_OS:
> +        return TM_QW1_NSR_EO;
> +    default:
> +        g_assert_not_reached();
> +    }
> +}
> +
>  static uint64_t xive_tctx_accept(XiveTCTX *tctx, uint8_t ring)
>  {
> -    return 0;
> +    uint8_t *regs = &tctx->regs[ring];
> +    uint8_t nsr = regs[TM_NSR];
> +    uint8_t mask = exception_mask(ring);
> +
> +    qemu_irq_lower(tctx->output);
> +
> +    if (regs[TM_NSR] & mask) {
> +        uint8_t cppr = regs[TM_PIPR];
> +
> +        regs[TM_CPPR] = cppr;
> +
> +        /* Reset the pending buffer bit */
> +        regs[TM_IPB] &= ~priority_to_ipb(cppr);
> +        regs[TM_PIPR] = ipb_to_pipr(regs[TM_IPB]);
> +
> +        /* Drop Exception bit */
> +        regs[TM_NSR] &= ~mask;
> +    }
> +
> +    return (nsr << 8) | regs[TM_CPPR];

Don't you need a cast to avoid (nsr << 8) being a shift-wider-than-size?

> +}
> +
> +static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring)
> +{
> +    uint8_t *regs = &tctx->regs[ring];
> +
> +    if (regs[TM_PIPR] < regs[TM_CPPR]) {
> +        regs[TM_NSR] |= exception_mask(ring);
> +        qemu_irq_raise(tctx->output);
> +    }
>  }
>  
>  static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
> @@ -33,6 +97,9 @@ static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
>      }
>  
>      tctx->regs[ring + TM_CPPR] = cppr;
> +
> +    /* CPPR has changed, check if we need to raise a pending exception */
> +    xive_tctx_notify(tctx, ring);
>  }
>  
>  /*
> @@ -198,6 +265,17 @@ static void xive_tm_set_os_cppr(XiveTCTX *tctx, hwaddr offset,
>      xive_tctx_set_cppr(tctx, TM_QW1_OS, value & 0xff);
>  }
>  
> +/*
> + * Adjust the IPB to allow a CPU to process event queues of other
> + * priorities during one physical interrupt cycle.
> + */
> +static void xive_tm_set_os_pending(XiveTCTX *tctx, hwaddr offset,
> +                                   uint64_t value, unsigned size)
> +{
> +    ipb_update(&tctx->regs[TM_QW1_OS], value & 0xff);
> +    xive_tctx_notify(tctx, TM_QW1_OS);
> +}
> +
>  /*
>   * Define a mapping of "special" operations depending on the TIMA page
>   * offset and the size of the operation.
> @@ -220,6 +298,7 @@ static const XiveTmOp xive_tm_operations[] = {
>  
>      /* MMIOs above 2K : special operations with side effects */
>      { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG,     2, NULL, xive_tm_ack_os_reg },
> +    { XIVE_TM_OS_PAGE, TM_SPC_SET_OS_PENDING, 1, xive_tm_set_os_pending, NULL },
>  };
>  
>  static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
> @@ -409,6 +488,13 @@ static void xive_tctx_reset(void *dev)
>      tctx->regs[TM_QW1_OS + TM_LSMFB] = 0xFF;
>      tctx->regs[TM_QW1_OS + TM_ACK_CNT] = 0xFF;
>      tctx->regs[TM_QW1_OS + TM_AGE] = 0xFF;
> +
> +    /*
> +     * Initialize PIPR to 0xFF to avoid phantom interrupts when the
> +     * CPPR is first set.
> +     */
> +    tctx->regs[TM_QW1_OS + TM_PIPR] =
> +        ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
>  }
>  
>  static void xive_tctx_realize(DeviceState *dev, Error **errp)
> @@ -1218,9 +1304,15 @@ static void xive_presenter_notify(XiveRouter *xrtr, uint8_t format,
>      found = xive_presenter_match(xrtr, format, nvt_blk, nvt_idx, cam_ignore,
>                                   priority, logic_serv, &match);
>      if (found) {
> +        ipb_update(&match.tctx->regs[match.ring], priority);
> +        xive_tctx_notify(match.tctx, match.ring);
>          return;
>      }
>  
> +    /* Record the IPB in the associated NVT structure */
> +    ipb_update((uint8_t *) &nvt.w4, priority);
> +    xive_router_set_nvt(xrtr, nvt_blk, nvt_idx, &nvt);

You're only writing back the NVT in the !found case.  Don't you still
need to update it in the found case?

>      /* If no matching NVT is dispatched on a HW thread :
>       * - update the NVT structure if backlog is activated
>       * - escalate (ESe PQ bits and EAS in w4-5) if escalation is

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller Cédric Le Goater
@ 2018-11-28  0:52   ` David Gibson
  2018-11-28 16:27     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-28  0:52 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 16356 bytes --]

On Fri, Nov 16, 2018 at 11:57:03AM +0100, Cédric Le Goater wrote:
> sPAPRXive models the XIVE interrupt controller of the sPAPR machine.
> It inherits from the XiveRouter and provisions storage for the routing
> tables :
> 
>   - Event Assignment Structure (EAS)
>   - Event Notification Descriptor (END)
> 
> The sPAPRXive model incorporates an internal XiveSource for the IPIs
> and for the interrupts of the virtual devices of the guest. This model
> is consistent with XIVE architecture which also incorporates an
> internal IVSE for IPIs and accelerator interrupts in the IVRE
> sub-engine.
> 
> The sPAPRXive model exports two memory regions, one for the ESB
> trigger and management pages used to control the sources and one for
> the TIMA pages. They are mapped by default at the addresses found on
> chip 0 of a baremetal system. This is also consistent with the XIVE
> architecture which defines a Virtualization Controller BAR for the
> internal IVSE ESB pages and a Thread Managment BAR for the TIMA.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  default-configs/ppc64-softmmu.mak |   1 +
>  include/hw/ppc/spapr_xive.h       |  46 +++++
>  hw/intc/spapr_xive.c              | 323 ++++++++++++++++++++++++++++++
>  hw/intc/Makefile.objs             |   1 +
>  4 files changed, 371 insertions(+)
>  create mode 100644 include/hw/ppc/spapr_xive.h
>  create mode 100644 hw/intc/spapr_xive.c
> 
> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> index 2d1e7c5c4668..7f34ad0528ed 100644
> --- a/default-configs/ppc64-softmmu.mak
> +++ b/default-configs/ppc64-softmmu.mak
> @@ -17,6 +17,7 @@ CONFIG_XICS=$(CONFIG_PSERIES)
>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>  CONFIG_XIVE=$(CONFIG_PSERIES)
> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>  CONFIG_MEM_DEVICE=y
>  CONFIG_DIMM=y
>  CONFIG_SPAPR_RNG=y
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> new file mode 100644
> index 000000000000..06727bd86aa9
> --- /dev/null
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -0,0 +1,46 @@
> +/*
> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef PPC_SPAPR_XIVE_H
> +#define PPC_SPAPR_XIVE_H
> +
> +#include "hw/sysbus.h"
> +#include "hw/ppc/xive.h"
> +
> +#define TYPE_SPAPR_XIVE "spapr-xive"
> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
> +
> +typedef struct sPAPRXive {
> +    XiveRouter    parent;
> +
> +    /* Internal interrupt source for IPIs and virtual devices */
> +    XiveSource    source;
> +    hwaddr        vc_base;
> +
> +    /* END ESB MMIOs */
> +    XiveENDSource end_source;
> +    hwaddr        end_base;
> +
> +    /* Routing table */
> +    XiveEAS       *eat;
> +    uint32_t      nr_irqs;
> +    XiveEND       *endt;
> +    uint32_t      nr_ends;
> +
> +    /* TIMA mapping address */
> +    hwaddr        tm_base;
> +    MemoryRegion  tm_mmio;
> +} sPAPRXive;
> +
> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> +qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
> +
> +#endif /* PPC_SPAPR_XIVE_H */
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> new file mode 100644
> index 000000000000..5d038146c08e
> --- /dev/null
> +++ b/hw/intc/spapr_xive.c
> @@ -0,0 +1,323 @@
> +/*
> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "qapi/error.h"
> +#include "target/ppc/cpu.h"
> +#include "sysemu/cpus.h"
> +#include "monitor/monitor.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/ppc/spapr_xive.h"
> +#include "hw/ppc/xive.h"
> +#include "hw/ppc/xive_regs.h"
> +
> +/*
> + * XIVE Virtualization Controller BAR and Thread Managment BAR that we
> + * use for the ESB pages and the TIMA pages
> + */
> +#define SPAPR_XIVE_VC_BASE   0x0006010000000000ull
> +#define SPAPR_XIVE_TM_BASE   0x0006030203180000ull
> +
> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> +{
> +    int i;
> +    uint32_t offset = 0;
> +
> +    monitor_printf(mon, "XIVE Source %08x .. %08x\n", offset,
> +                   offset + xive->source.nr_irqs - 1);
> +    xive_source_pic_print_info(&xive->source, offset, mon);
> +
> +    monitor_printf(mon, "XIVE EAT %08x .. %08x\n", 0, xive->nr_irqs - 1);
> +    for (i = 0; i < xive->nr_irqs; i++) {
> +        xive_eas_pic_print_info(&xive->eat[i], i, mon);
> +    }
> +
> +    monitor_printf(mon, "XIVE ENDT %08x .. %08x\n", 0, xive->nr_ends - 1);
> +    for (i = 0; i < xive->nr_ends; i++) {
> +        xive_end_pic_print_info(&xive->endt[i], i, mon);
> +    }

AIUI the PAPR model hides the details of ENDs, EQs and NVTs - instead
each logical EAS just points at a (thread, priority) pair, which under
the hood has exactly one END and one NVT bound to it.

Given that, would it make more sense to reformat the info here to show
things in terms of those (thread, priority) pairs, rather than the
internal EAS and END details?

> +}
> +
> +/* Map the ESB pages and the TIMA pages */
> +static void spapr_xive_mmio_map(sPAPRXive *xive)
> +{
> +    sysbus_mmio_map(SYS_BUS_DEVICE(&xive->source), 0, xive->vc_base);
> +    sysbus_mmio_map(SYS_BUS_DEVICE(&xive->end_source), 0, xive->end_base);

Uh.. I didn't think the PAPR model exposed the END sources to the guest?

> +    sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->tm_base);
> +}
> +
> +static void spapr_xive_reset(DeviceState *dev)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> +    int i;
> +
> +    /* Xive Source reset is done through SysBus, it should put all
> +     * IRQs to OFF (!P|Q) */
> +
> +    /* Mask all valid EASs in the IRQ number space. */
> +    for (i = 0; i < xive->nr_irqs; i++) {
> +        XiveEAS *eas = &xive->eat[i];
> +        if (eas->w & EAS_VALID) {
> +            eas->w |= EAS_MASKED;

To ensure consistent behaviour across reboots, it would be better to
reset the whole of the EAS, except those which have to be preserved
across reboots (which would be VALID, and maybe nothing else?).

> +        }
> +    }
> +
> +    for (i = 0; i < xive->nr_ends; i++) {
> +        xive_end_reset(&xive->endt[i]);
> +    }
> +
> +    spapr_xive_mmio_map(xive);

You shouldn't need to re-etablish MMIO mappings at reset time, only
during initialization.

> +}
> +
> +static void spapr_xive_instance_init(Object *obj)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(obj);
> +
> +    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);

Yeah, embedding the source here makes sense, but it's a strong
indication that XiveSource should not be a SysBusDevice subclass.  I
really think it wants to be a TYPE_DEVICE subclass - and, in fact, I
think it can be object_initialize() embedded everywhere it's used.

I've also said elswhere that I suspect XiveRouter should also not be a
SysBusDevice.  With that approach it might make sense to embed it
here, rather than subclassing it (the old composition vs. inheritance
debate).

> +    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
> +
> +    object_initialize(&xive->end_source, sizeof(xive->end_source),
> +                      TYPE_XIVE_END_SOURCE);
> +    object_property_add_child(obj, "end_source", OBJECT(&xive->end_source),
> +                              NULL);
> +}
> +
> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> +    XiveSource *xsrc = &xive->source;
> +    XiveENDSource *end_xsrc = &xive->end_source;
> +    Error *local_err = NULL;
> +
> +    if (!xive->nr_irqs) {
> +        error_setg(errp, "Number of interrupt needs to be greater 0");
> +        return;
> +    }
> +
> +    if (!xive->nr_ends) {
> +        error_setg(errp, "Number of interrupt needs to be greater 0");
> +        return;
> +    }
> +
> +    /*
> +     * Initialize the internal sources, for IPIs and virtual devices.
> +     */
> +    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
> +                            &error_fatal);
> +    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
> +                                   &error_fatal);
> +    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
> +
> +    /*
> +     * Initialize the END ESB source
> +     */
> +    object_property_set_int(OBJECT(end_xsrc), xive->nr_irqs, "nr-ends",
> +                            &error_fatal);
> +    object_property_add_const_link(OBJECT(end_xsrc), "xive", OBJECT(xive),
> +                                   &error_fatal);
> +    object_property_set_bool(OBJECT(end_xsrc), true, "realized", &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +    qdev_set_parent_bus(DEVICE(end_xsrc), sysbus_get_default());
> +
> +    /* Set the mapping address of the END ESB pages after the source ESBs */
> +    xive->end_base = xive->vc_base + (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
> +
> +    /*
> +     * Allocate the routing tables
> +     */
> +    xive->eat = g_new0(XiveEAS, xive->nr_irqs);
> +    xive->endt = g_new0(XiveEND, xive->nr_ends);
> +
> +    /* TIMA initialization */
> +    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
> +                          "xive.tima", 4ull << TM_SHIFT);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
> +}
> +
> +static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> +
> +    if (lisn >= xive->nr_irqs) {
> +        return -1;
> +    }
> +
> +    *eas = xive->eat[lisn];
> +    return 0;
> +}
> +
> +static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> +
> +    if (lisn >= xive->nr_irqs) {
> +        return -1;
> +    }
> +
> +    xive->eat[lisn] = *eas;
> +    return 0;
> +}
> +
> +static int spapr_xive_get_end(XiveRouter *xrtr,
> +                              uint8_t end_blk, uint32_t end_idx, XiveEND *end)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> +
> +    if (end_idx >= xive->nr_ends) {
> +        return -1;
> +    }
> +
> +    memcpy(end, &xive->endt[end_idx], sizeof(XiveEND));
> +    return 0;
> +}
> +
> +static int spapr_xive_set_end(XiveRouter *xrtr,
> +                              uint8_t end_blk, uint32_t end_idx, XiveEND *end)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> +
> +    if (end_idx >= xive->nr_ends) {
> +        return -1;
> +    }
> +
> +    memcpy(&xive->endt[end_idx], end, sizeof(XiveEND));
> +    return 0;
> +}
> +
> +static const VMStateDescription vmstate_spapr_xive_end = {
> +    .name = TYPE_SPAPR_XIVE "/end",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField []) {
> +        VMSTATE_UINT32(w0, XiveEND),
> +        VMSTATE_UINT32(w1, XiveEND),
> +        VMSTATE_UINT32(w2, XiveEND),
> +        VMSTATE_UINT32(w3, XiveEND),
> +        VMSTATE_UINT32(w4, XiveEND),
> +        VMSTATE_UINT32(w5, XiveEND),
> +        VMSTATE_UINT32(w6, XiveEND),
> +        VMSTATE_UINT32(w7, XiveEND),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static const VMStateDescription vmstate_spapr_xive_eas = {
> +    .name = TYPE_SPAPR_XIVE "/eas",
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField []) {
> +        VMSTATE_UINT64(w, XiveEAS),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static const VMStateDescription vmstate_spapr_xive = {
> +    .name = TYPE_SPAPR_XIVE,
> +    .version_id = 1,
> +    .minimum_version_id = 1,
> +    .fields = (VMStateField[]) {
> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(eat, sPAPRXive, nr_irqs,
> +                                     vmstate_spapr_xive_eas, XiveEAS),
> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(endt, sPAPRXive, nr_ends,
> +                                             vmstate_spapr_xive_end, XiveEND),
> +        VMSTATE_END_OF_LIST()
> +    },
> +};
> +
> +static Property spapr_xive_properties[] = {
> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
> +    DEFINE_PROP_UINT32("nr-ends", sPAPRXive, nr_ends, 0),
> +    DEFINE_PROP_UINT64("vc-base", sPAPRXive, vc_base, SPAPR_XIVE_VC_BASE),
> +    DEFINE_PROP_UINT64("tm-base", sPAPRXive, tm_base, SPAPR_XIVE_TM_BASE),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
> +
> +    dc->desc    = "sPAPR XIVE Interrupt Controller";
> +    dc->props   = spapr_xive_properties;
> +    dc->realize = spapr_xive_realize;
> +    dc->reset   = spapr_xive_reset;
> +    dc->vmsd    = &vmstate_spapr_xive;
> +
> +    xrc->get_eas = spapr_xive_get_eas;
> +    xrc->set_eas = spapr_xive_set_eas;
> +    xrc->get_end = spapr_xive_get_end;
> +    xrc->set_end = spapr_xive_set_end;
> +}
> +
> +static const TypeInfo spapr_xive_info = {
> +    .name = TYPE_SPAPR_XIVE,
> +    .parent = TYPE_XIVE_ROUTER,
> +    .instance_init = spapr_xive_instance_init,
> +    .instance_size = sizeof(sPAPRXive),
> +    .class_init = spapr_xive_class_init,
> +};
> +
> +static void spapr_xive_register_types(void)
> +{
> +    type_register_static(&spapr_xive_info);
> +}
> +
> +type_init(spapr_xive_register_types)
> +
> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
> +{
> +    XiveSource *xsrc = &xive->source;
> +
> +    if (lisn >= xive->nr_irqs) {
> +        return false;
> +    }
> +
> +    xive->eat[lisn].w |= EAS_VALID;
> +    xive_source_irq_set(xsrc, lisn, lsi);
> +    return true;
> +}
> +
> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
> +{
> +    XiveSource *xsrc = &xive->source;
> +
> +    if (lisn >= xive->nr_irqs) {
> +        return false;
> +    }
> +
> +    xive->eat[lisn].w &= ~EAS_VALID;
> +    xive_source_irq_set(xsrc, lisn, false);
> +    return true;
> +}
> +
> +qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn)
> +{
> +    XiveSource *xsrc = &xive->source;
> +
> +    if (lisn >= xive->nr_irqs) {
> +        return NULL;
> +    }
> +
> +    if (!(xive->eat[lisn].w & EAS_VALID)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);

I don't think this is a guest error - gettint the qirq by number
should generally be something qemu code does.

> +        return NULL;
> +    }
> +
> +    return xive_source_qirq(xsrc, lisn);
> +}
> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> index 72a46ed91c31..301a8e972d91 100644
> --- a/hw/intc/Makefile.objs
> +++ b/hw/intc/Makefile.objs
> @@ -38,6 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>  obj-$(CONFIG_XIVE) += xive.o
> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>  obj-$(CONFIG_S390_FLIC) += s390_flic.o

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged
  2018-11-28  0:13   ` David Gibson
@ 2018-11-28  2:32     ` Benjamin Herrenschmidt
  2018-11-28  2:41       ` David Gibson
  2018-11-28 11:30     ` Cédric Le Goater
  1 sibling, 1 reply; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2018-11-28  2:32 UTC (permalink / raw)
  To: David Gibson, Cédric Le Goater; +Cc: qemu-ppc, qemu-devel

On Wed, 2018-11-28 at 11:13 +1100, David Gibson wrote:
> Don't you need a cast to avoid (nsr << 8) being a shift-wider-than-size?

Shouldn't be a problem as long as it fits in an int, no ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-11-27 23:49   ` David Gibson
@ 2018-11-28  2:34     ` Benjamin Herrenschmidt
  2018-11-28 10:59     ` Cédric Le Goater
  1 sibling, 0 replies; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2018-11-28  2:34 UTC (permalink / raw)
  To: David Gibson, Cédric Le Goater; +Cc: qemu-ppc, qemu-devel

On Wed, 2018-11-28 at 10:49 +1100, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:01AM +0100, Cédric Le Goater wrote:
> > The last sub-engine of the XIVE architecture is the Interrupt
> > Virtualization Presentation Engine (IVPE). On HW, they share elements,
> > the Power Bus interface (CQ), the routing table descriptors, and they
> > can be combined in the same HW logic. We do the same in QEMU and
> > combine both engines in the XiveRouter for simplicity.
> 
> Ok, I'm not entirely convinced combining the IVPE and IVRE into a
> single object is a good idea, but we can probably discuss that once
> I've read further.

Keep in mind that the communication between the two is a bit more hairy
than simple notifications, though. Especially once we start
implementing escalation interrupts or worse, groups...

> > When the IVRE has completed its job of matching an event source with a
> > Notification Virtual Target (NVT) to notify, it forwards the event
> > notification to the IVPE sub-engine. The IVPE scans the thread
> > interrupt contexts of the Notification Virtual Targets (NVT)
> > dispatched on the HW processor threads and if a match is found, it
> > signals the thread. If not, the IVPE escalates the notification to
> > some other targets and records the notification in a backlog queue.
> > 
> > The IVPE maintains the thread interrupt context state for each of its
> > NVTs not dispatched on HW processor threads in the Notification
> > Virtual Target table (NVTT).
> > 
> > The model currently only supports single NVT notifications.
> > 
> > Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > ---
> >  include/hw/ppc/xive.h      |  13 +++
> >  include/hw/ppc/xive_regs.h |  22 ++++
> >  hw/intc/xive.c             | 223 +++++++++++++++++++++++++++++++++++++
> >  3 files changed, 258 insertions(+)
> > 
> > diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> > index 5987f26ddb98..e715a6c6923d 100644
> > --- a/include/hw/ppc/xive.h
> > +++ b/include/hw/ppc/xive.h
> > @@ -197,6 +197,10 @@ typedef struct XiveRouterClass {
> >                     XiveEND *end);
> >      int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >                     XiveEND *end);
> > +    int (*get_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> > +                   XiveNVT *nvt);
> > +    int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> > +                   XiveNVT *nvt);
> 
> As with the ENDs, I don't think get/set is a good interface for a
> bigger-than-word-size object.
> 
> >  } XiveRouterClass;
> >  
> >  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
> > @@ -207,6 +211,10 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >                          XiveEND *end);
> >  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >                          XiveEND *end);
> > +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> > +                        XiveNVT *nvt);
> > +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> > +                        XiveNVT *nvt);
> >  
> >  /*
> >   * XIVE END ESBs
> > @@ -274,4 +282,9 @@ extern const MemoryRegionOps xive_tm_ops;
> >  
> >  void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
> >  
> > +static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
> > +{
> > +    return (nvt_blk << 19) | nvt_idx;
> 
> I'm guessing this formula is the standard way of combining the NVT
> block and index into a single word?  If so, I think we should
> standardize on passing a single word "nvt_id" around and only
> splitting it when we need to use the block separately.  Same goes for
> the end_id, assuming there's a standard way of putting that into a
> single word.  That will address the point I raised earlier about lisn
> being passed around as a single word, but these later stage ids being
> split.
> 
> We'll probably want some inlines or macros to build an
> nvt/end/lisn/whatever id from block and index as well.
> 
> > +}
> > +
> >  #endif /* PPC_XIVE_H */
> > diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> > index 2e3d6cb507da..05cb992d2815 100644
> > --- a/include/hw/ppc/xive_regs.h
> > +++ b/include/hw/ppc/xive_regs.h
> > @@ -158,4 +158,26 @@ typedef struct XiveEND {
> >  #define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
> >  } XiveEND;
> >  
> > +/* Notification Virtual Target (NVT) */
> > +typedef struct XiveNVT {
> > +        uint32_t        w0;
> > +#define NVT_W0_VALID             PPC_BIT32(0)
> > +        uint32_t        w1;
> > +        uint32_t        w2;
> > +        uint32_t        w3;
> > +        uint32_t        w4;
> > +        uint32_t        w5;
> > +        uint32_t        w6;
> > +        uint32_t        w7;
> > +        uint32_t        w8;
> > +#define NVT_W8_GRP_VALID         PPC_BIT32(0)
> > +        uint32_t        w9;
> > +        uint32_t        wa;
> > +        uint32_t        wb;
> > +        uint32_t        wc;
> > +        uint32_t        wd;
> > +        uint32_t        we;
> > +        uint32_t        wf;
> > +} XiveNVT;
> > +
> >  #endif /* PPC_XIVE_REGS_H */
> > diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> > index 4c6cb5d52975..5ba3b06e6e25 100644
> > --- a/hw/intc/xive.c
> > +++ b/hw/intc/xive.c
> > @@ -373,6 +373,32 @@ void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
> >      }
> >  }
> >  
> > +/* The HW CAM (23bits) is hardwired to :
> > + *
> > + *   0x000||0b1||4Bit chip number||7Bit Thread number.
> > + *
> > + * and when the block grouping extension is enabled :
> > + *
> > + *   4Bit chip number||0x001||7Bit Thread number.
> > + */
> > +static uint32_t tctx_hw_cam_line(bool block_group, uint8_t chip_id, uint8_t tid)
> > +{
> > +    if (block_group) {
> > +        return 1 << 11 | (chip_id & 0xf) << 7 | (tid & 0x7f);
> > +    } else {
> > +        return (chip_id & 0xf) << 11 | 1 << 7 | (tid & 0x7f);
> > +    }
> > +}
> > +
> > +static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
> > +{
> > +    PowerPCCPU *cpu = POWERPC_CPU(tctx->cs);
> > +    CPUPPCState *env = &cpu->env;
> > +    uint32_t pir = env->spr_cb[SPR_PIR].default_value;
> 
> I don't much like reaching into the cpu state itself.  I think a
> better idea would be to have the TCTX have its HW CAM id set during
> initialization (via a property) and then use that.  This will mean
> less mucking about if future cpu revisions don't split the PIR into
> chip and tid ids in the same way.
> 
> > +    return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
> > +}
> > +
> >  static void xive_tctx_reset(void *dev)
> >  {
> >      XiveTCTX *tctx = XIVE_TCTX(dev);
> > @@ -1013,6 +1039,195 @@ int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >     return xrc->set_end(xrtr, end_blk, end_idx, end);
> >  }
> >  
> > +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> > +                        XiveNVT *nvt)
> > +{
> > +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> > +
> > +   return xrc->get_nvt(xrtr, nvt_blk, nvt_idx, nvt);
> > +}
> > +
> > +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> > +                        XiveNVT *nvt)
> > +{
> > +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> > +
> > +   return xrc->set_nvt(xrtr, nvt_blk, nvt_idx, nvt);
> > +}
> > +
> > +static bool xive_tctx_ring_match(XiveTCTX *tctx, uint8_t ring,
> > +                                 uint8_t nvt_blk, uint32_t nvt_idx,
> > +                                 bool cam_ignore, uint32_t logic_serv)
> > +{
> > +    uint8_t *regs = &tctx->regs[ring];
> > +    uint32_t w2 = be32_to_cpu(*((uint32_t *) &regs[TM_WORD2]));
> > +    uint32_t cam = xive_tctx_cam_line(nvt_blk, nvt_idx);
> > +    bool block_group = false; /* TODO (PowerNV) */
> > +
> > +    /* TODO (PowerNV): ignore low order bits of nvt id */
> > +
> > +    switch (ring) {
> > +    case TM_QW3_HV_PHYS:
> > +        return (w2 & TM_QW3W2_VT) && xive_tctx_hw_cam_line(tctx, block_group) ==
> > +            tctx_hw_cam_line(block_group, nvt_blk, nvt_idx);
> 
> The difference between "xive_tctx_hw_cam_line" and "tctx_hw_cam_line"
> here is far from obvious.  Remember that namespacing prefixes aren't
> necessary for static functions, which can let you give more
> descriptive names without getting excessively long.
> 
> > +    case TM_QW2_HV_POOL:
> > +        return (w2 & TM_QW2W2_VP) && (cam == GETFIELD(TM_QW2W2_POOL_CAM, w2));
> > +
> > +    case TM_QW1_OS:
> > +        return (w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2));
> > +
> > +    case TM_QW0_USER:
> > +        return ((w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2)) &&
> > +                (w2 & TM_QW0W2_VU) &&
> > +                (logic_serv == GETFIELD(TM_QW0W2_LOGIC_SERV, w2)));
> > +
> > +    default:
> > +        g_assert_not_reached();
> > +    }
> > +}
> > +
> > +static int xive_presenter_tctx_match(XiveTCTX *tctx, uint8_t format,
> > +                                     uint8_t nvt_blk, uint32_t nvt_idx,
> > +                                     bool cam_ignore, uint32_t logic_serv)
> > +{
> > +    if (format == 0) {
> > +        /* F=0 & i=1: Logical server notification */
> > +        if (cam_ignore == true) {
> > +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no support for LS "
> > +                          "NVT %x/%x\n", nvt_blk, nvt_idx);
> > +             return -1;
> > +        }
> > +
> > +        /* F=0 & i=0: Specific NVT notification */
> > +        if (xive_tctx_ring_match(tctx, TM_QW3_HV_PHYS,
> > +                                nvt_blk, nvt_idx, false, 0)) {
> > +            return TM_QW3_HV_PHYS;
> > +        }
> > +        if (xive_tctx_ring_match(tctx, TM_QW2_HV_POOL,
> > +                                nvt_blk, nvt_idx, false, 0)) {
> > +            return TM_QW2_HV_POOL;
> > +        }
> > +        if (xive_tctx_ring_match(tctx, TM_QW1_OS,
> > +                                nvt_blk, nvt_idx, false, 0)) {
> > +            return TM_QW1_OS;
> > +        }
> 
> Hm.  It's a bit pointless to iterate through each ring calling a
> common function, when that "common" function consists entirely of a
> switch which makes it not really common at all.
> 
> So I think you want separate helper functions for each ring's match,
> or even just fold the previous function into this one.
> 
> > +    } else {
> > +        /* F=1 : User level Event-Based Branch (EBB) notification */
> > +        if (xive_tctx_ring_match(tctx, TM_QW0_USER,
> > +                                nvt_blk, nvt_idx, false, logic_serv)) {
> > +            return TM_QW0_USER;
> > +        }
> > +    }
> > +    return -1;
> > +}
> > +
> > +typedef struct XiveTCTXMatch {
> > +    XiveTCTX *tctx;
> > +    uint8_t ring;
> > +} XiveTCTXMatch;
> > +
> > +static bool xive_presenter_match(XiveRouter *xrtr, uint8_t format,
> > +                                 uint8_t nvt_blk, uint32_t nvt_idx,
> > +                                 bool cam_ignore, uint8_t priority,
> > +                                 uint32_t logic_serv, XiveTCTXMatch *match)
> > +{
> > +    CPUState *cs;
> > +
> > +    /* TODO (PowerNV): handle chip_id overwrite of block field for
> > +     * hardwired CAM compares */
> > +
> > +    CPU_FOREACH(cs) {
> > +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> > +        XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
> > +        int ring;
> > +
> > +        /*
> > +         * HW checks that the CPU is enabled in the Physical Thread
> > +         * Enable Register (PTER).
> > +         */
> > +
> > +        /*
> > +         * Check the thread context CAM lines and record matches. We
> > +         * will handle CPU exception delivery later
> > +         */
> > +        ring = xive_presenter_tctx_match(tctx, format, nvt_blk, nvt_idx,
> > +                                         cam_ignore, logic_serv);
> > +        /*
> > +         * Save the context and follow on to catch duplicates, that we
> > +         * don't support yet.
> > +         */
> > +        if (ring != -1) {
> > +            if (match->tctx) {
> > +                qemu_log_mask(LOG_GUEST_ERROR, "XIVE: already found a thread "
> > +                              "context NVT %x/%x\n", nvt_blk, nvt_idx);
> > +                return false;
> > +            }
> > +
> > +            match->ring = ring;
> > +            match->tctx = tctx;
> > +        }
> > +    }
> > +
> > +    if (!match->tctx) {
> > +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is not dispatched\n",
> > +                      nvt_blk, nvt_idx);
> > +        return false;
> 
> Hmm.. this isn't actually an error isn't it?  At least not for powernv
> - that just means the NVT isn't currently dispatched, so we'll need to
> trigger the escalation interrupt.  Does this get changed later in the
> series?
> 
> > +    }
> > +
> > +    return true;
> > +}
> > +
> > +/*
> > + * This is our simple Xive Presenter Engine model. It is merged in the
> > + * Router as it does not require an extra object.
> > + *
> > + * It receives notification requests sent by the IVRE to find one
> > + * matching NVT (or more) dispatched on the processor threads. In case
> > + * of a single NVT notification, the process is abreviated and the
> > + * thread is signaled if a match is found. In case of a logical server
> > + * notification (bits ignored at the end of the NVT identifier), the
> > + * IVPE and IVRE select a winning thread using different filters. This
> > + * involves 2 or 3 exchanges on the PowerBus that the model does not
> > + * support.
> > + *
> > + * The parameters represent what is sent on the PowerBus
> > + */
> > +static void xive_presenter_notify(XiveRouter *xrtr, uint8_t format,
> > +                                  uint8_t nvt_blk, uint32_t nvt_idx,
> > +                                  bool cam_ignore, uint8_t priority,
> > +                                  uint32_t logic_serv)
> > +{
> > +    XiveNVT nvt;
> > +    XiveTCTXMatch match = { 0 };
> > +    bool found;
> > +
> > +    /* NVT cache lookup */
> > +    if (xive_router_get_nvt(xrtr, nvt_blk, nvt_idx, &nvt)) {
> > +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no NVT %x/%x\n",
> > +                      nvt_blk, nvt_idx);
> > +        return;
> > +    }
> > +
> > +    if (!(nvt.w0 & NVT_W0_VALID)) {
> > +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is invalid\n",
> > +                      nvt_blk, nvt_idx);
> > +        return;
> > +    }
> > +
> > +    found = xive_presenter_match(xrtr, format, nvt_blk, nvt_idx, cam_ignore,
> > +                                 priority, logic_serv, &match);
> > +    if (found) {
> > +        return;
> > +    }
> > +
> > +    /* If no matching NVT is dispatched on a HW thread :
> > +     * - update the NVT structure if backlog is activated
> > +     * - escalate (ESe PQ bits and EAS in w4-5) if escalation is
> > +     *   activated
> > +     */
> > +}
> > +
> >  /*
> >   * An END trigger can come from an event trigger (IPI or HW) or from
> >   * another chip. We don't model the PowerBus but the END trigger
> > @@ -1081,6 +1296,14 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
> >      /*
> >       * Follows IVPE notification
> >       */
> > +    xive_presenter_notify(xrtr, format,
> > +                          GETFIELD(END_W6_NVT_BLOCK, end.w6),
> > +                          GETFIELD(END_W6_NVT_INDEX, end.w6),
> > +                          GETFIELD(END_W7_F0_IGNORE, end.w7),
> > +                          priority,
> > +                          GETFIELD(END_W7_F1_LOG_SERVER_ID, end.w7));
> > +
> > +    /* TODO: Auto EOI. */
> >  }
> >  
> >  static void xive_router_notify(XiveFabric *xf, uint32_t lisn)

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier Cédric Le Goater
@ 2018-11-28  2:39   ` David Gibson
  2018-11-28 16:48     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-28  2:39 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 10694 bytes --]

On Fri, Nov 16, 2018 at 11:57:04AM +0100, Cédric Le Goater wrote:
> The IVPE scans the O/S CAM line of the XIVE thread interrupt contexts
> to find a matching Notification Virtual Target (NVT) among the NVTs
> dispatched on the HW processor threads.
> 
> On a real system, the thread interrupt contexts are updated by the
> hypervisor when a Virtual Processor is scheduled to run on a HW
> thread. Under QEMU, the model emulates the same behavior by hardwiring
> the NVT identifier in the thread context registers at reset.
> 
> The NVT identifier used by the sPAPRXive model is the VCPU id. The END
> identifier is also derived from the VCPU id. A set of helpers doing
> the conversion between identifiers are provided for the hcalls
> configuring the sources and the ENDs.
> 
> The model does not need a NVT table but The XiveRouter NVT operations
> are provided to perform some extra checks in the routing algorithm.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/spapr_xive.h |  17 +++++
>  include/hw/ppc/xive.h       |   3 +
>  hw/intc/spapr_xive.c        | 136 ++++++++++++++++++++++++++++++++++++
>  hw/intc/xive.c              |   9 +++
>  4 files changed, 165 insertions(+)
> 
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 06727bd86aa9..3f65b8f485fd 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -43,4 +43,21 @@ bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>  qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
>  
> +/*
> + * sPAPR NVT and END indexing helpers
> + */
> +uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, uint8_t nvt_blk,
> +                                  uint32_t nvt_idx);
> +int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t target,
> +                            uint8_t *out_nvt_blk, uint32_t *out_nvt_idx);
> +int spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
> +                          uint8_t *out_nvt_blk, uint32_t *out_nvt_idx);
> +
> +int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t end_idx,
> +                             uint32_t *out_server, uint8_t *out_prio);
> +int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
> +                             uint8_t *out_end_blk, uint32_t *out_end_idx);
> +int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
> +                          uint8_t *out_end_blk, uint32_t *out_end_idx);
> +
>  #endif /* PPC_SPAPR_XIVE_H */
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index e715a6c6923d..e6931ddaa83f 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -187,6 +187,8 @@ typedef struct XiveRouter {
>  #define XIVE_ROUTER_GET_CLASS(obj)                              \
>      OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
>  
> +typedef struct XiveTCTX XiveTCTX;
> +
>  typedef struct XiveRouterClass {
>      SysBusDeviceClass parent;
>  
> @@ -201,6 +203,7 @@ typedef struct XiveRouterClass {
>                     XiveNVT *nvt);
>      int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>                     XiveNVT *nvt);
> +    void (*reset_tctx)(XiveRouter *xrtr, XiveTCTX *tctx);
>  } XiveRouterClass;
>  
>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> index 5d038146c08e..3bf77ace11a2 100644
> --- a/hw/intc/spapr_xive.c
> +++ b/hw/intc/spapr_xive.c
> @@ -199,6 +199,139 @@ static int spapr_xive_set_end(XiveRouter *xrtr,
>      return 0;
>  }
>  
> +static int spapr_xive_get_nvt(XiveRouter *xrtr,
> +                              uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> +    uint32_t vcpu_id = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
> +    PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
> +
> +    if (!cpu) {
> +        return -1;
> +    }
> +
> +    /*
> +     * sPAPR does not maintain a NVT table. Return that the NVT is
> +     * valid if we have found a matching CPU
> +     */
> +    nvt->w0 = NVT_W0_VALID;
> +    return 0;
> +}
> +
> +static int spapr_xive_set_nvt(XiveRouter *xrtr,
> +                              uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
> +{
> +    /* no NVT table */
> +    return 0;
> +}
> +
> +/*
> + * When a Virtual Processor is scheduled to run on a HW thread, the
> + * hypervisor pushes its identifier in the OS CAM line. Under QEMU, we
> + * need to emulate the same behavior.
> + */
> +static void spapr_xive_reset_tctx(XiveRouter *xrtr, XiveTCTX *tctx)
> +{
> +    uint8_t  nvt_blk;
> +    uint32_t nvt_idx;
> +    uint32_t nvt_cam;
> +
> +    spapr_xive_cpu_to_nvt(SPAPR_XIVE(xrtr), POWERPC_CPU(tctx->cs),
> +                          &nvt_blk, &nvt_idx);
> +
> +    nvt_cam = cpu_to_be32(TM_QW1W2_VO | xive_tctx_cam_line(nvt_blk, nvt_idx));
> +    memcpy(&tctx->regs[TM_QW1_OS + TM_WORD2], &nvt_cam, 4);
> +}
> +
> +/*
> + * The allocation of VP blocks is a complex operation in OPAL and the
> + * VP identifiers have a relation with the number of HW chips, the
> + * size of the VP blocks, VP grouping, etc. The QEMU sPAPR XIVE
> + * controller model does not have the same constraints and can use a
> + * simple mapping scheme of the CPU vcpu_id
> + *
> + * These identifiers are never returned to the OS.
> + */
> +
> +#define SPAPR_XIVE_VP_BASE 0x400

0x400 == 1024.  Could we ever have the possibility of needing to
consider both physical NVTs and PAPR NVTs at the same time?  If so,
does this base leave enough space for the physical ones?

> +uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, uint8_t nvt_blk,
> +                                  uint32_t nvt_idx)
> +{
> +    return nvt_idx - SPAPR_XIVE_VP_BASE;
> +}
> +
> +int spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
> +                          uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)

A number of these conversions will come out a bit simpler if we pass
the block and index around as a single word in most places.

> +{
> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> +
> +    if (!cpu) {
> +        return -1;
> +    }
> +
> +    if (out_nvt_blk) {
> +        /* For testing purpose, we could use 0 for nvt_blk */
> +        *out_nvt_blk = xrtr->chip_id;

I don't see any point using the chip_id here, which is currently
always set to 0 for PAPR anyway.  If we just hardwire this to 0 it
removes the only use here of xrtr, which will allow some further
simplifications in the caller, I think.

> +    }
> +
> +    if (out_nvt_blk) {
> +        *out_nvt_idx = SPAPR_XIVE_VP_BASE + cpu->vcpu_id;
> +    }
> +    return 0;
> +}
> +
> +int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t target,
> +                             uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)

I suspect some, maybe most of these conversion functions could be static.

> +{
> +    return spapr_xive_cpu_to_nvt(xive, spapr_find_cpu(target), out_nvt_blk,
> +                                 out_nvt_idx);
> +}
> +
> +/*
> + * sPAPR END indexing uses a simple mapping of the CPU vcpu_id, 8
> + * priorities per CPU
> + */
> +int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t end_idx,
> +                             uint32_t *out_server, uint8_t *out_prio)
> +{
> +    if (out_server) {
> +        *out_server = end_idx >> 3;
> +    }
> +
> +    if (out_prio) {
> +        *out_prio = end_idx & 0x7;
> +    }
> +    return 0;
> +}
> +
> +int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
> +                          uint8_t *out_end_blk, uint32_t *out_end_idx)
> +{
> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> +
> +    if (!cpu) {
> +        return -1;

Is there ever a reason this would be called with cpu == NULL?  If not
might as well just assert() here rather than pushing the error
handling back to the caller.

> +    }
> +
> +    if (out_end_blk) {
> +        /* For testing purpose, we could use 0 for nvt_blk */
> +        *out_end_blk = xrtr->chip_id;

Again, I don't see any point to using the chip_id, which is pretty
meaningless for PAPR.

> +    }
> +
> +    if (out_end_idx) {
> +        *out_end_idx = (cpu->vcpu_id << 3) + prio;
> +    }
> +    return 0;
> +}
> +
> +int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
> +                             uint8_t *out_end_blk, uint32_t *out_end_idx)
> +{
> +    return spapr_xive_cpu_to_end(xive, spapr_find_cpu(target), prio,
> +                                 out_end_blk, out_end_idx);
> +}
> +
>  static const VMStateDescription vmstate_spapr_xive_end = {
>      .name = TYPE_SPAPR_XIVE "/end",
>      .version_id = 1,
> @@ -263,6 +396,9 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>      xrc->set_eas = spapr_xive_set_eas;
>      xrc->get_end = spapr_xive_get_end;
>      xrc->set_end = spapr_xive_set_end;
> +    xrc->get_nvt = spapr_xive_get_nvt;
> +    xrc->set_nvt = spapr_xive_set_nvt;
> +    xrc->reset_tctx = spapr_xive_reset_tctx;
>  }
>  
>  static const TypeInfo spapr_xive_info = {
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index c49932d2b799..fc6ef5895e6d 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -481,6 +481,7 @@ static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
>  static void xive_tctx_reset(void *dev)
>  {
>      XiveTCTX *tctx = XIVE_TCTX(dev);
> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
>  
>      memset(tctx->regs, 0, sizeof(tctx->regs));
>  
> @@ -495,6 +496,14 @@ static void xive_tctx_reset(void *dev)
>       */
>      tctx->regs[TM_QW1_OS + TM_PIPR] =
>          ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
> +
> +    /*
> +     * QEMU sPAPR XIVE only. To let the controller model reset the OS
> +     * CAM line with the VP identifier.
> +     */
> +    if (xrc->reset_tctx) {
> +        xrc->reset_tctx(tctx->xrtr, tctx);
> +    }

AFAICT this whole function is only used from PAPR, so you can just
move the whole thing to the papr code and avoid the hook function.

>  }
>  
>  static void xive_tctx_realize(DeviceState *dev, Error **errp)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged
  2018-11-28  2:32     ` Benjamin Herrenschmidt
@ 2018-11-28  2:41       ` David Gibson
  2018-11-28  3:00         ` Eric Blake
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-28  2:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Cédric Le Goater, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 612 bytes --]

On Wed, Nov 28, 2018 at 01:32:21PM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2018-11-28 at 11:13 +1100, David Gibson wrote:
> > Don't you need a cast to avoid (nsr << 8) being a shift-wider-than-size?
> 
> Shouldn't be a problem as long as it fits in an int, no ?

I dunno, I can never remember the rules about when C extends and when
it doesn't.  I'm not sure anybody does, which is kind of the point.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend Cédric Le Goater
@ 2018-11-28  2:57   ` David Gibson
  2018-11-28  9:35     ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-28  2:57 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 1889 bytes --]

On Fri, Nov 16, 2018 at 11:57:05AM +0100, Cédric Le Goater wrote:
> We will need to use xics_max_server_number() to create the sPAPRXive
> object modeling the interrupt controller of the machine which is
> created before the CPUs.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

My only concern here is that this moves the spapr_set_vsmt_mode()
before some of the sanity checks in spapr_init_cpus().  Are we certain
there are no edge cases that could cause badness?

> ---
>  hw/ppc/spapr.c | 10 +++++-----
>  1 file changed, 5 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 7afd1a175bf2..50cb9f9f4a02 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2466,11 +2466,6 @@ static void spapr_init_cpus(sPAPRMachineState *spapr)
>          boot_cores_nr = possible_cpus->len;
>      }
>  
> -    /* VSMT must be set in order to be able to compute VCPU ids, ie to
> -     * call xics_max_server_number() or spapr_vcpu_id().
> -     */
> -    spapr_set_vsmt_mode(spapr, &error_fatal);
> -
>      if (smc->pre_2_10_has_unused_icps) {
>          int i;
>  
> @@ -2593,6 +2588,11 @@ static void spapr_machine_init(MachineState *machine)
>      /* Setup a load limit for the ramdisk leaving room for SLOF and FDT */
>      load_limit = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FW_OVERHEAD;
>  
> +    /* VSMT must be set in order to be able to compute VCPU ids, ie to
> +     * call xics_max_server_number() or spapr_vcpu_id().
> +     */
> +    spapr_set_vsmt_mode(spapr, &error_fatal);
> +
>      /* Set up Interrupt Controller before we create the VCPUs */
>      smc->irq->init(spapr, &error_fatal);
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 13/36] spapr: introduce a spapr_irq_init() routine
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 13/36] spapr: introduce a spapr_irq_init() routine Cédric Le Goater
@ 2018-11-28  2:59   ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-28  2:59 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 3046 bytes --]

On Fri, Nov 16, 2018 at 11:57:06AM +0100, Cédric Le Goater wrote:
> Initialize the MSI bitmap from it as this will be necessary for the
> sPAPR IRQ backend for XIVE.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/hw/ppc/spapr_irq.h |  1 +
>  hw/ppc/spapr.c             |  2 +-
>  hw/ppc/spapr_irq.c         | 16 +++++++++++-----
>  3 files changed, 13 insertions(+), 6 deletions(-)
> 
> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> index a467ce696ee4..bd7301e6d9c6 100644
> --- a/include/hw/ppc/spapr_irq.h
> +++ b/include/hw/ppc/spapr_irq.h
> @@ -43,6 +43,7 @@ typedef struct sPAPRIrq {
>  extern sPAPRIrq spapr_irq_xics;
>  extern sPAPRIrq spapr_irq_xics_legacy;
>  
> +void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
>  void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
>  qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 50cb9f9f4a02..e470efe7993c 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2594,7 +2594,7 @@ static void spapr_machine_init(MachineState *machine)
>      spapr_set_vsmt_mode(spapr, &error_fatal);
>  
>      /* Set up Interrupt Controller before we create the VCPUs */
> -    smc->irq->init(spapr, &error_fatal);
> +    spapr_irq_init(spapr, &error_fatal);
>  
>      /* Set up containers for ibm,client-architecture-support negotiated options
>       */
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index e77b94cc685e..f8b651de0ec9 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -97,11 +97,6 @@ static void spapr_irq_init_xics(sPAPRMachineState *spapr, Error **errp)
>      int nr_irqs = smc->irq->nr_irqs;
>      Error *local_err = NULL;
>  
> -    /* Initialize the MSI IRQ allocator. */
> -    if (!SPAPR_MACHINE_GET_CLASS(spapr)->legacy_irq_allocation) {
> -        spapr_irq_msi_init(spapr, smc->irq->nr_msis);
> -    }
> -
>      if (kvm_enabled()) {
>          if (machine_kernel_irqchip_allowed(machine) &&
>              !xics_kvm_init(spapr, &local_err)) {
> @@ -213,6 +208,17 @@ sPAPRIrq spapr_irq_xics = {
>  /*
>   * sPAPR IRQ frontend routines for devices
>   */
> +void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
> +{
> +    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> +
> +    /* Initialize the MSI IRQ allocator. */
> +    if (!SPAPR_MACHINE_GET_CLASS(spapr)->legacy_irq_allocation) {
> +        spapr_irq_msi_init(spapr, smc->irq->nr_msis);
> +    }
> +
> +    smc->irq->init(spapr, errp);
> +}
>  
>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp)
>  {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged
  2018-11-28  2:41       ` David Gibson
@ 2018-11-28  3:00         ` Eric Blake
  0 siblings, 0 replies; 184+ messages in thread
From: Eric Blake @ 2018-11-28  3:00 UTC (permalink / raw)
  To: David Gibson, Benjamin Herrenschmidt
  Cc: qemu-ppc, Cédric Le Goater, qemu-devel

On 11/27/18 8:41 PM, David Gibson wrote:
> On Wed, Nov 28, 2018 at 01:32:21PM +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2018-11-28 at 11:13 +1100, David Gibson wrote:
>>> Don't you need a cast to avoid (nsr << 8) being a shift-wider-than-size?
>>
>> Shouldn't be a problem as long as it fits in an int, no ?
> 
> I dunno, I can never remember the rules about when C extends and when
> it doesn't.  I'm not sure anybody does, which is kind of the point.

As soon as you perform arithmetic on a type narrower than int, it is 
first promoted to at least int (possibly wider, depending on the type of 
the other operand to the binary operator).  And, since C guarantees that 
int is larger than 8 bits, 'uint8_t << 8' and 'int8_t << 8' are both 
well-defined, without needing a cast of nsr.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 15/36] spapr: introdude a new machine IRQ backend for XIVE
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 15/36] spapr: introdude a new machine IRQ backend for XIVE Cédric Le Goater
@ 2018-11-28  3:28   ` David Gibson
  2018-11-28 17:16     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-28  3:28 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 10078 bytes --]

On Fri, Nov 16, 2018 at 11:57:08AM +0100, Cédric Le Goater wrote:
> The XIVE IRQ backend uses the same layout as the new XICS backend but
> covers the full range of the IRQ number space. The IRQ numbers for the
> CPU IPIs are allocated at the bottom of this space, below 4K, to
> preserve compatibility with XICS which does not use that range.
> 
> This should be enough given that the maximum number of CPUs is 1024
> for the sPAPR machine under QEMU. For the record, the biggest POWER8
> or POWER9 system has a maximum of 1536 HW threads (16 sockets, 192
> cores, SMT8).
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/spapr.h     |   2 +
>  include/hw/ppc/spapr_irq.h |   7 ++-
>  hw/ppc/spapr.c             |   2 +-
>  hw/ppc/spapr_irq.c         | 119 ++++++++++++++++++++++++++++++++++++-
>  4 files changed, 124 insertions(+), 6 deletions(-)
> 
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 6279711fe8f7..1fbc2663e06c 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -16,6 +16,7 @@ typedef struct sPAPREventLogEntry sPAPREventLogEntry;
>  typedef struct sPAPREventSource sPAPREventSource;
>  typedef struct sPAPRPendingHPT sPAPRPendingHPT;
>  typedef struct ICSState ICSState;
> +typedef struct sPAPRXive sPAPRXive;
>  
>  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
>  #define SPAPR_ENTRY_POINT       0x100
> @@ -175,6 +176,7 @@ struct sPAPRMachineState {
>      const char *icp_type;
>      int32_t irq_map_nr;
>      unsigned long *irq_map;
> +    sPAPRXive  *xive;
>  
>      bool cmd_line_caps[SPAPR_CAP_NUM];
>      sPAPRCapabilities def, eff, mig;
> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> index 0e9229bf219e..c854ae527808 100644
> --- a/include/hw/ppc/spapr_irq.h
> +++ b/include/hw/ppc/spapr_irq.h
> @@ -13,6 +13,7 @@
>  /*
>   * IRQ range offsets per device type
>   */
> +#define SPAPR_IRQ_IPI        0x0
>  #define SPAPR_IRQ_EPOW       0x1000  /* XICS_IRQ_BASE offset */
>  #define SPAPR_IRQ_HOTPLUG    0x1001
>  #define SPAPR_IRQ_VIO        0x1100  /* 256 VIO devices */
> @@ -33,7 +34,8 @@ typedef struct sPAPRIrq {
>      uint32_t    nr_irqs;
>      uint32_t    nr_msis;
>  
> -    void (*init)(sPAPRMachineState *spapr, int nr_irqs, Error **errp);
> +    void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
> +                 Error **errp);
>      int (*claim)(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
>      void (*free)(sPAPRMachineState *spapr, int irq, int num);
>      qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
> @@ -42,8 +44,9 @@ typedef struct sPAPRIrq {
>  
>  extern sPAPRIrq spapr_irq_xics;
>  extern sPAPRIrq spapr_irq_xics_legacy;
> +extern sPAPRIrq spapr_irq_xive;
>  
> -void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
> +void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp);

I don't see why nr_servers needs to become a parameter, since it can
be derived from spapr within this routine.

>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
>  void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
>  qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index e470efe7993c..9f8c19e56e7a 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -2594,7 +2594,7 @@ static void spapr_machine_init(MachineState *machine)
>      spapr_set_vsmt_mode(spapr, &error_fatal);
>  
>      /* Set up Interrupt Controller before we create the VCPUs */
> -    spapr_irq_init(spapr, &error_fatal);
> +    spapr_irq_init(spapr, xics_max_server_number(spapr), &error_fatal);

We should rename xics_max_server_number() since it's no longer xics
specific.

>      /* Set up containers for ibm,client-architecture-support negotiated options
>       */
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index bac450ffff23..2569ae1bc7f8 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -12,6 +12,7 @@
>  #include "qemu/error-report.h"
>  #include "qapi/error.h"
>  #include "hw/ppc/spapr.h"
> +#include "hw/ppc/spapr_xive.h"
>  #include "hw/ppc/xics.h"
>  #include "sysemu/kvm.h"
>  
> @@ -91,7 +92,7 @@ error:
>  }
>  
>  static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
> -                                Error **errp)
> +                                int nr_servers, Error **errp)
>  {
>      MachineState *machine = MACHINE(spapr);
>      Error *local_err = NULL;
> @@ -204,10 +205,122 @@ sPAPRIrq spapr_irq_xics = {
>      .print_info  = spapr_irq_print_info_xics,
>  };
>  
> + /*
> + * XIVE IRQ backend.
> + */
> +static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
> +                                    const char *type_xive, int nr_irqs,
> +                                    int nr_servers, Error **errp)
> +{
> +    sPAPRXive *xive;
> +    Error *local_err = NULL;
> +    Object *obj;
> +    uint32_t nr_ends = nr_servers << 3; /* 8 priority ENDs per CPU */
> +    int i;
> +
> +    obj = object_new(type_xive);

What's the reason for making the type a parameter, rather than just
using the #define here.

> +    object_property_set_int(obj, nr_irqs, "nr-irqs", &error_abort);
> +    object_property_set_int(obj, nr_ends, "nr-ends", &error_abort);

This is still within the sPAPR code, and you have a pointer to the
MachineState, so I don't see why you could't just derive nr_irqs and
nr_servers from that, rather than having them passed in.

> +    object_property_set_bool(obj, true, "realized", &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return NULL;
> +    }
> +    qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());

Whereas the XiveSource and XiveRouter I think make more sense as
"device components" rather than SysBusDevice subclasses, I think it
*does* make sense for the PAPR-XIVE object to be a full fledged
SysBusDevice.

And for that reason, I think it makes more sense to create it with
qdev_create(), which should avoid having to manually fiddle with the
parent bus.

> +    xive = SPAPR_XIVE(obj);
> +
> +    /* Enable the CPU IPIs */
> +    for (i = 0; i < nr_servers; ++i) {
> +        spapr_xive_irq_enable(xive, SPAPR_IRQ_IPI + i, false);

This comment possibly belonged on an earlier patch.  I don't love the
"..._enable" name - to me that suggests something runtime rather than
configuration time.  A better option isn't quickly occurring to me
though :/.

> +    }
> +
> +    return xive;
> +}
> +
> +static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
> +                                int nr_servers, Error **errp)
> +{
> +    MachineState *machine = MACHINE(spapr);
> +    Error *local_err = NULL;
> +
> +    /* KVM XIVE support */
> +    if (kvm_enabled()) {
> +        if (machine_kernel_irqchip_required(machine)) {
> +            error_setg(errp, "kernel_irqchip requested. no XIVE support");
> +            return;
> +        }
> +    }
> +
> +    /* QEMU XIVE support */
> +    spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE, nr_irqs, nr_servers,
> +                                    &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +}
> +
> +static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
> +                                Error **errp)
> +{
> +    if (!spapr_xive_irq_enable(spapr->xive, irq, lsi)) {
> +        error_setg(errp, "IRQ %d is invalid", irq);
> +        return -1;
> +    }
> +    return 0;
> +}
> +
> +static void spapr_irq_free_xive(sPAPRMachineState *spapr, int irq, int num)
> +{
> +    int i;
> +
> +    for (i = irq; i < irq + num; ++i) {
> +        spapr_xive_irq_disable(spapr->xive, i);
> +    }
> +}
> +
> +static qemu_irq spapr_qirq_xive(sPAPRMachineState *spapr, int irq)
> +{
> +    return spapr_xive_qirq(spapr->xive, irq);
> +}
> +
> +static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
> +                                      Monitor *mon)
> +{
> +    CPUState *cs;
> +
> +    CPU_FOREACH(cs) {
> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> +
> +        xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
> +    }
> +
> +    spapr_xive_pic_print_info(spapr->xive, mon);

Any reason the info dumping routines are split into two?

> +}
> +
> +/*
> + * XIVE uses the full IRQ number space. Set it to 8K to be compatible
> + * with XICS.
> + */
> +
> +#define SPAPR_IRQ_XIVE_NR_IRQS     0x2000
> +#define SPAPR_IRQ_XIVE_NR_MSIS     (SPAPR_IRQ_XIVE_NR_IRQS - SPAPR_IRQ_MSI)
> +
> +sPAPRIrq spapr_irq_xive = {
> +    .nr_irqs     = SPAPR_IRQ_XIVE_NR_IRQS,
> +    .nr_msis     = SPAPR_IRQ_XIVE_NR_MSIS,
> +
> +    .init        = spapr_irq_init_xive,
> +    .claim       = spapr_irq_claim_xive,
> +    .free        = spapr_irq_free_xive,
> +    .qirq        = spapr_qirq_xive,
> +    .print_info  = spapr_irq_print_info_xive,
> +};
> +
>  /*
>   * sPAPR IRQ frontend routines for devices
>   */
> -void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
> +void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp)
>  {
>      sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>  
> @@ -216,7 +329,7 @@ void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
>          spapr_irq_msi_init(spapr, smc->irq->nr_msis);
>      }
>  
> -    smc->irq->init(spapr, smc->irq->nr_irqs, errp);
> +    smc->irq->init(spapr, smc->irq->nr_irqs, nr_servers, errp);
>  }
>  
>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp)

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode Cédric Le Goater
@ 2018-11-28  4:25   ` David Gibson
  2018-11-28 22:21     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-28  4:25 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 39519 bytes --]

On Fri, Nov 16, 2018 at 11:57:09AM +0100, Cédric Le Goater wrote:
> The different XIVE virtualization structures (sources and event queues)
> are configured with a set of Hypervisor calls :
> 
>  - H_INT_GET_SOURCE_INFO
> 
>    used to obtain the address of the MMIO page of the Event State
>    Buffer (ESB) entry associated with the source.
> 
>  - H_INT_SET_SOURCE_CONFIG
> 
>    assigns a source to a "target".
> 
>  - H_INT_GET_SOURCE_CONFIG
> 
>    determines which "target" and "priority" is assigned to a source
> 
>  - H_INT_GET_QUEUE_INFO
> 
>    returns the address of the notification management page associated
>    with the specified "target" and "priority".
> 
>  - H_INT_SET_QUEUE_CONFIG
> 
>    sets or resets the event queue for a given "target" and "priority".
>    It is also used to set the notification configuration associated
>    with the queue, only unconditional notification is supported for
>    the moment. Reset is performed with a queue size of 0 and queueing
>    is disabled in that case.
> 
>  - H_INT_GET_QUEUE_CONFIG
> 
>    returns the queue settings for a given "target" and "priority".
> 
>  - H_INT_RESET
> 
>    resets all of the guest's internal interrupt structures to their
>    initial state, losing all configuration set via the hcalls
>    H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
> 
>  - H_INT_SYNC
> 
>    issue a synchronisation on a source to make sure all notifications
>    have reached their queue.
> 
> Calls that still need to be addressed :
> 
>    H_INT_SET_OS_REPORTING_LINE
>    H_INT_GET_OS_REPORTING_LINE
> 
> See the code for more documentation on each hcall.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/spapr.h      |  15 +-
>  include/hw/ppc/spapr_xive.h |   6 +
>  hw/intc/spapr_xive_hcall.c  | 892 ++++++++++++++++++++++++++++++++++++
>  hw/ppc/spapr_irq.c          |   2 +
>  hw/intc/Makefile.objs       |   2 +-
>  5 files changed, 915 insertions(+), 2 deletions(-)
>  create mode 100644 hw/intc/spapr_xive_hcall.c
> 
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 1fbc2663e06c..8415faea7b82 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -452,7 +452,20 @@ struct sPAPRMachineState {
>  #define H_INVALIDATE_PID        0x378
>  #define H_REGISTER_PROC_TBL     0x37C
>  #define H_SIGNAL_SYS_RESET      0x380
> -#define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
> +
> +#define H_INT_GET_SOURCE_INFO   0x3A8
> +#define H_INT_SET_SOURCE_CONFIG 0x3AC
> +#define H_INT_GET_SOURCE_CONFIG 0x3B0
> +#define H_INT_GET_QUEUE_INFO    0x3B4
> +#define H_INT_SET_QUEUE_CONFIG  0x3B8
> +#define H_INT_GET_QUEUE_CONFIG  0x3BC
> +#define H_INT_SET_OS_REPORTING_LINE 0x3C0
> +#define H_INT_GET_OS_REPORTING_LINE 0x3C4
> +#define H_INT_ESB               0x3C8
> +#define H_INT_SYNC              0x3CC
> +#define H_INT_RESET             0x3D0
> +
> +#define MAX_HCALL_OPCODE        H_INT_RESET
>  
>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>   * as well.
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 3f65b8f485fd..418511f3dc10 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -60,4 +60,10 @@ int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
>  int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
>                            uint8_t *out_end_blk, uint32_t *out_end_idx);
>  
> +bool spapr_xive_priority_is_valid(uint8_t priority);

AFAICT this could be a local function.

> +
> +typedef struct sPAPRMachineState sPAPRMachineState;
> +
> +void spapr_xive_hcall_init(sPAPRMachineState *spapr);
> +
>  #endif /* PPC_SPAPR_XIVE_H */
> diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
> new file mode 100644
> index 000000000000..52e4e23995f5
> --- /dev/null
> +++ b/hw/intc/spapr_xive_hcall.c
> @@ -0,0 +1,892 @@
> +/*
> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "qapi/error.h"
> +#include "cpu.h"
> +#include "hw/ppc/fdt.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/ppc/spapr_xive.h"
> +#include "hw/ppc/xive_regs.h"
> +#include "monitor/monitor.h"

Fwiw, I don't think it's particularly necessary to split the hcall
handling out into a separate .c file.

> +/*
> + * OPAL uses the priority 7 EQ to automatically escalate interrupts
> + * for all other queues (DD2.X POWER9). So only priorities [0..6] are
> + * available for the guest.

Referencing OPAL behaviour doesn't really make sense in the context of
PAPR.  What I think you're getting at is that the PAPR spec only
allows a PAPR guest to use priorities 0..6 (or at least it will if the
XIVE updated spec ever gets published).  The fact that this allows the
host use 7 for escalations is a design rationale but not really
relevant to the guest device itself.

> + */
> +bool spapr_xive_priority_is_valid(uint8_t priority)
> +{
> +    switch (priority) {
> +    case 0 ... 6:
> +        return true;
> +    case 7: /* OPAL escalation queue */
> +    default:
> +        return false;
> +    }
> +}
> +
> +/*
> + * The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical
> + * real address of the MMIO page through which the Event State Buffer
> + * entry associated with the value of the "lisn" parameter is managed.
> + *
> + * Parameters:
> + * Input
> + * - "flags"
> + *       Bits 0-63 reserved
> + * - "lisn" is per "interrupts", "interrupt-map", or
> + *       "ibm,xive-lisn-ranges" properties, or as returned by the
> + *       ibm,query-interrupt-source-number RTAS call, or as returned
> + *       by the H_ALLOCATE_VAS_WINDOW hcall

I've not heard of H_ALLOCATE_VAS_WINDOW.  Is that something we intend
to implement in kvm/qemu, or is it only of interest for PowerVM?

Also, putting the register numbers on the inputs as well as the
outputs would be helpful.

> + *
> + * Output
> + * - R4: "flags"
> + *       Bits 0-59: Reserved
> + *       Bit 60: H_INT_ESB must be used for Event State Buffer
> + *               management
> + *       Bit 61: 1 == LSI  0 == MSI
> + *       Bit 62: the full function page supports trigger
> + *       Bit 63: Store EOI Supported
> + * - R5: Logical Real address of full function Event State Buffer
> + *       management page, -1 if ESB hcall flag is set to 1.

You've defined what H_INT_ESB means above, so it will be clearer if
you reference that by name here.

> + * - R6: Logical Real Address of trigger only Event State Buffer
> + *       management page or -1.
> + * - R7: Power of 2 page size for the ESB management pages returned in
> + *       R5 and R6.
> + */
> +
> +#define SPAPR_XIVE_SRC_H_INT_ESB     PPC_BIT(60) /* ESB manage with H_INT_ESB */
> +#define SPAPR_XIVE_SRC_LSI           PPC_BIT(61) /* Virtual LSI type */
> +#define SPAPR_XIVE_SRC_TRIGGER       PPC_BIT(62) /* Trigger and management
> +                                                    on same page */
> +#define SPAPR_XIVE_SRC_STORE_EOI     PPC_BIT(63) /* Store EOI support */

Probably makes sense to put these #defines in spapr.h since they form
part of the PAPR interface definition.

> +static target_ulong h_int_get_source_info(PowerPCCPU *cpu,
> +                                          sPAPRMachineState *spapr,
> +                                          target_ulong opcode,
> +                                          target_ulong *args)
> +{
> +    sPAPRXive *xive = spapr->xive;
> +    XiveSource *xsrc = &xive->source;
> +    XiveEAS eas;
> +    target_ulong flags  = args[0];
> +    target_ulong lisn   = args[1];
> +
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    if (flags) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
> +        return H_P2;
> +    }
> +
> +    if (!(eas.w & EAS_VALID)) {
> +        return H_P2;
> +    }
> +
> +    /* All sources are emulated under the main XIVE object and share
> +     * the same characteristics.
> +     */
> +    args[0] = 0;
> +    if (!xive_source_esb_has_2page(xsrc)) {
> +        args[0] |= SPAPR_XIVE_SRC_TRIGGER;
> +    }
> +    if (xsrc->esb_flags & XIVE_SRC_STORE_EOI) {
> +        args[0] |= SPAPR_XIVE_SRC_STORE_EOI;
> +    }
> +
> +    /*
> +     * Force the use of the H_INT_ESB hcall in case of an LSI
> +     * interrupt. This is necessary under KVM to re-trigger the
> +     * interrupt if the level is still asserted
> +     */
> +    if (xive_source_irq_is_lsi(xsrc, lisn)) {
> +        args[0] |= SPAPR_XIVE_SRC_H_INT_ESB | SPAPR_XIVE_SRC_LSI;
> +    }
> +
> +    if (!(args[0] & SPAPR_XIVE_SRC_H_INT_ESB)) {
> +        args[1] = xive->vc_base + xive_source_esb_mgmt(xsrc, lisn);
> +    } else {
> +        args[1] = -1;
> +    }
> +
> +    if (xive_source_esb_has_2page(xsrc)) {
> +        args[2] = xive->vc_base + xive_source_esb_page(xsrc, lisn);
> +    } else {
> +        args[2] = -1;
> +    }

Do we also need to keep this address clear in the H_INT_ESB case?

> +    args[3] = TARGET_PAGE_SIZE;

That seems wrong.  TARGET_PAGE_SIZE is generally 4kiB, but won't these
usually actually be 64kiB?

> +
> +    return H_SUCCESS;
> +}
> +
> +/*
> + * The H_INT_SET_SOURCE_CONFIG hcall() is used to assign a Logical
> + * Interrupt Source to a target. The Logical Interrupt Source is
> + * designated with the "lisn" parameter and the target is designated
> + * with the "target" and "priority" parameters.  Upon return from the
> + * hcall(), no additional interrupts will be directed to the old EQ.
> + *
> + * TODO: The old EQ should be investigated for interrupts that
> + * occurred prior to or during the hcall().

Isn't that the responsibility of the guest?

> + *
> + * Parameters:
> + * Input:
> + * - "flags"
> + *      Bits 0-61: Reserved
> + *      Bit 62: set the "eisn" in the EA

What's the "EA"?  Do you mean the EAS?

> + *      Bit 63: masks the interrupt source in the hardware interrupt
> + *      control structure. An interrupt masked by this mechanism will
> + *      be dropped, but it's source state bits will still be
> + *      set. There is no race-free way of unmasking and restoring the
> + *      source. Thus this should only be used in interrupts that are
> + *      also masked at the source, and only in cases where the
> + *      interrupt is not meant to be used for a large amount of time
> + *      because no valid target exists for it for example
> + * - "lisn" is per "interrupts", "interrupt-map", or
> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
> + *      ibm,query-interrupt-source-number RTAS call, or as returned by
> + *      the H_ALLOCATE_VAS_WINDOW hcall
> + * - "target" is per "ibm,ppc-interrupt-server#s" or
> + *      "ibm,ppc-interrupt-gserver#s"
> + * - "priority" is a valid priority not in
> + *      "ibm,plat-res-int-priorities"
> + * - "eisn" is the guest EISN associated with the "lisn"

I don't think the EISN term has been used before in the series.  I'm
guessing this is the guest-assigned global interrupt number?

> + *
> + * Output:
> + * - None
> + */
> +
> +#define SPAPR_XIVE_SRC_SET_EISN PPC_BIT(62)
> +#define SPAPR_XIVE_SRC_MASK     PPC_BIT(63)
> +
> +static target_ulong h_int_set_source_config(PowerPCCPU *cpu,
> +                                            sPAPRMachineState *spapr,
> +                                            target_ulong opcode,
> +                                            target_ulong *args)
> +{
> +    sPAPRXive *xive = spapr->xive;
> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> +    XiveEAS eas, new_eas;
> +    target_ulong flags    = args[0];
> +    target_ulong lisn     = args[1];
> +    target_ulong target   = args[2];
> +    target_ulong priority = args[3];
> +    target_ulong eisn     = args[4];
> +    uint8_t end_blk;
> +    uint32_t end_idx;
> +
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    if (flags & ~(SPAPR_XIVE_SRC_SET_EISN | SPAPR_XIVE_SRC_MASK)) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
> +        return H_P2;
> +    }
> +
> +    if (!(eas.w & EAS_VALID)) {
> +        return H_P2;
> +    }
> +
> +    /* priority 0xff is used to reset the EAS */
> +    if (priority == 0xff) {
> +        new_eas.w = EAS_VALID | EAS_MASKED;
> +        goto out;
> +    }
> +
> +    if (flags & SPAPR_XIVE_SRC_MASK) {
> +        new_eas.w = eas.w | EAS_MASKED;
> +    } else {
> +        new_eas.w = eas.w & ~EAS_MASKED;
> +    }
> +
> +    if (!spapr_xive_priority_is_valid(priority)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
> +                      priority);
> +        return H_P4;
> +    }
> +
> +    /* Validate that "target" is part of the list of threads allocated
> +     * to the partition. For that, find the END corresponding to the
> +     * target.
> +     */
> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
> +        return H_P3;
> +    }
> +
> +    new_eas.w = SETFIELD(EAS_END_BLOCK, new_eas.w, end_blk);
> +    new_eas.w = SETFIELD(EAS_END_INDEX, new_eas.w, end_idx);
> +
> +    if (flags & SPAPR_XIVE_SRC_SET_EISN) {
> +        new_eas.w = SETFIELD(EAS_END_DATA, new_eas.w, eisn);
> +    }
> +
> +out:
> +    if (xive_router_set_eas(xrtr, lisn, &new_eas)) {
> +        return H_HARDWARE;
> +    }

As noted earlier in the series, the spapr specific code owns the
memory backing the EAT, so you can just access it directly rather than
using a method here.

> +
> +    return H_SUCCESS;
> +}
> +
> +/*
> + * The H_INT_GET_SOURCE_CONFIG hcall() is used to determine to which
> + * target/priority pair is assigned to the specified Logical Interrupt
> + * Source.
> + *
> + * Parameters:
> + * Input:
> + * - "flags"
> + *      Bits 0-63 Reserved
> + * - "lisn" is per "interrupts", "interrupt-map", or
> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
> + *      ibm,query-interrupt-source-number RTAS call, or as
> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
> + *
> + * Output:
> + * - R4: Target to which the specified Logical Interrupt Source is
> + *       assigned
> + * - R5: Priority to which the specified Logical Interrupt Source is
> + *       assigned
> + * - R6: EISN for the specified Logical Interrupt Source (this will be
> + *       equivalent to the LISN if not changed by H_INT_SET_SOURCE_CONFIG)
> + */
> +static target_ulong h_int_get_source_config(PowerPCCPU *cpu,
> +                                            sPAPRMachineState *spapr,
> +                                            target_ulong opcode,
> +                                            target_ulong *args)
> +{
> +    sPAPRXive *xive = spapr->xive;
> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> +    target_ulong flags = args[0];
> +    target_ulong lisn = args[1];
> +    XiveEAS eas;
> +    XiveEND end;
> +    uint8_t end_blk, nvt_blk;
> +    uint32_t end_idx, nvt_idx;
> +
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    if (flags) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
> +        return H_P2;
> +    }
> +
> +    if (!(eas.w & EAS_VALID)) {
> +        return H_P2;
> +    }
> +
> +    end_blk = GETFIELD(EAS_END_BLOCK, eas.w);
> +    end_idx = GETFIELD(EAS_END_INDEX, eas.w);
> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
> +        /* Not sure what to return here */
> +        return H_HARDWARE;

IIUC this indicates a bug in the PAPR specific code, not the guest, so
an assert() is probably the right answer.

> +    }
> +
> +    nvt_blk = GETFIELD(END_W6_NVT_BLOCK, end.w6);
> +    nvt_idx = GETFIELD(END_W6_NVT_INDEX, end.w6);
> +    args[0] = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);

AIUI there's a specific END for each target & priority, so you could
avoid this second level lookup, although I guess this might be
valuable if we do more complicated internal routing in the future.

> +    if (eas.w & EAS_MASKED) {
> +        args[1] = 0xff;
> +    } else {
> +        args[1] = GETFIELD(END_W7_F0_PRIORITY, end.w7);
> +    }
> +
> +    args[2] = GETFIELD(EAS_END_DATA, eas.w);
> +
> +    return H_SUCCESS;
> +}
> +
> +/*
> + * The H_INT_GET_QUEUE_INFO hcall() is used to get the logical real
> + * address of the notification management page associated with the
> + * specified target and priority.
> + *
> + * Parameters:
> + * Input:
> + * - "flags"
> + *       Bits 0-63 Reserved
> + * - "target" is per "ibm,ppc-interrupt-server#s" or
> + *       "ibm,ppc-interrupt-gserver#s"
> + * - "priority" is a valid priority not in
> + *       "ibm,plat-res-int-priorities"
> + *
> + * Output:
> + * - R4: Logical real address of notification page
> + * - R5: Power of 2 page size of the notification page
> + */
> +static target_ulong h_int_get_queue_info(PowerPCCPU *cpu,
> +                                         sPAPRMachineState *spapr,
> +                                         target_ulong opcode,
> +                                         target_ulong *args)
> +{
> +    sPAPRXive *xive = spapr->xive;
> +    XiveENDSource *end_xsrc = &xive->end_source;
> +    target_ulong flags = args[0];
> +    target_ulong target = args[1];
> +    target_ulong priority = args[2];
> +    XiveEND end;
> +    uint8_t end_blk;
> +    uint32_t end_idx;
> +
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    if (flags) {
> +        return H_PARAMETER;
> +    }
> +
> +    /*
> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> +     * This is not needed when running the emulation under QEMU
> +     */
> +
> +    if (!spapr_xive_priority_is_valid(priority)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
> +                      priority);
> +        return H_P3;
> +    }
> +
> +    /* Validate that "target" is part of the list of threads allocated
> +     * to the partition. For that, find the END corresponding to the
> +     * target.
> +     */
> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
> +        return H_P2;
> +    }
> +
> +    if (xive_router_get_end(XIVE_ROUTER(xive), end_blk, end_idx, &end)) {
> +        return H_HARDWARE;
> +    }
> +
> +    args[0] = xive->end_base + (1ull << (end_xsrc->esb_shift + 1)) * end_idx;
> +    if (end.w0 & END_W0_ENQUEUE) {
> +        args[1] = GETFIELD(END_W0_QSIZE, end.w0) + 12;
> +    } else {
> +        args[1] = 0;
> +    }
> +    return H_SUCCESS;
> +}
> +
> +/*
> + * The H_INT_SET_QUEUE_CONFIG hcall() is used to set or reset a EQ for
> + * a given "target" and "priority".  It is also used to set the
> + * notification config associated with the EQ.  An EQ size of 0 is
> + * used to reset the EQ config for a given target and priority. If
> + * resetting the EQ config, the END associated with the given "target"
> + * and "priority" will be changed to disable queueing.
> + *
> + * Upon return from the hcall(), no additional interrupts will be
> + * directed to the old EQ (if one was set). The old EQ (if one was
> + * set) should be investigated for interrupts that occurred prior to
> + * or during the hcall().
> + *
> + * Parameters:
> + * Input:
> + * - "flags"
> + *      Bits 0-62: Reserved
> + *      Bit 63: Unconditional Notify (n) per the XIVE spec
> + * - "target" is per "ibm,ppc-interrupt-server#s" or
> + *       "ibm,ppc-interrupt-gserver#s"
> + * - "priority" is a valid priority not in
> + *       "ibm,plat-res-int-priorities"
> + * - "eventQueue": The logical real address of the start of the EQ
> + * - "eventQueueSize": The power of 2 EQ size per "ibm,xive-eq-sizes"
> + *
> + * Output:
> + * - None
> + */
> +
> +#define SPAPR_XIVE_END_ALWAYS_NOTIFY PPC_BIT(63)
> +
> +static target_ulong h_int_set_queue_config(PowerPCCPU *cpu,
> +                                           sPAPRMachineState *spapr,
> +                                           target_ulong opcode,
> +                                           target_ulong *args)
> +{
> +    sPAPRXive *xive = spapr->xive;
> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> +    target_ulong flags = args[0];
> +    target_ulong target = args[1];
> +    target_ulong priority = args[2];
> +    target_ulong qpage = args[3];
> +    target_ulong qsize = args[4];
> +    XiveEND end;
> +    uint8_t end_blk, nvt_blk;
> +    uint32_t end_idx, nvt_idx;
> +    uint32_t qdata;
> +
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    if (flags & ~SPAPR_XIVE_END_ALWAYS_NOTIFY) {
> +        return H_PARAMETER;
> +    }
> +
> +    /*
> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> +     * This is not needed when running the emulation under QEMU
> +     */
> +
> +    if (!spapr_xive_priority_is_valid(priority)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
> +                      priority);
> +        return H_P3;
> +    }
> +
> +    /* Validate that "target" is part of the list of threads allocated
> +     * to the partition. For that, find the END corresponding to the
> +     * target.
> +     */
> +
> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
> +        return H_P2;
> +    }
> +
> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
> +        return H_HARDWARE;

Again, I think this indicates a qemu (spapr) code bug, so could be an assert().

> +    }
> +
> +    switch (qsize) {
> +    case 12:
> +    case 16:
> +    case 21:
> +    case 24:
> +        end.w3 = ((uint64_t)qpage) & 0xffffffff;

It just occurred to me that I haven't been looking for this across any
of these reviews.  Don't you need byteswaps when accessing these
in-memory structures?

> +        end.w2 = (((uint64_t)qpage)) >> 32 & 0x0fffffff;
> +        end.w0 |= END_W0_ENQUEUE;
> +        end.w0 = SETFIELD(END_W0_QSIZE, end.w0, qsize - 12);
> +        break;
> +    case 0:
> +        /* reset queue and disable queueing */
> +        xive_end_reset(&end);
> +        goto out;
> +
> +    default:
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid EQ size %"PRIx64"\n",
> +                      qsize);
> +        return H_P5;
> +    }
> +
> +    if (qsize) {
> +        /*
> +         * Let's validate the EQ address with a read of the first EQ
> +         * entry. We could also check that the full queue has been
> +         * zeroed by the OS.
> +         */
> +        if (address_space_read(&address_space_memory, qpage,
> +                               MEMTXATTRS_UNSPECIFIED,
> +                               (uint8_t *) &qdata, sizeof(qdata))) {
> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ data @0x%"
> +                          HWADDR_PRIx "\n", qpage);
> +            return H_P4;

Just checking the first entry doesn't seem entirely safe.  Using
address_space_map() and making sure the returned plen doesn't get
reduced below the queue size might be a better option.

> +        }
> +    }
> +
> +    if (spapr_xive_target_to_nvt(xive, target, &nvt_blk, &nvt_idx)) {
> +        return H_HARDWARE;

That could be caused by a bogus 'target' value, couldn't it?  In which
case it a) should probably be checked earlier and b) should be
H_PARAMETER or similar, not H_HARDWARE, yes?

> +    }
> +
> +    /* Ensure the priority and target are correctly set (they will not
> +     * be right after allocation)

AIUI there's a static association from END to target in the PAPR
model.  So it seems to make more sense to get that set up right at
initialization / reset, rather than doing it lazily when the queue is configured.

> +     */
> +    end.w6 = SETFIELD(END_W6_NVT_BLOCK, 0ul, nvt_blk) |
> +        SETFIELD(END_W6_NVT_INDEX, 0ul, nvt_idx);
> +    end.w7 = SETFIELD(END_W7_F0_PRIORITY, 0ul, priority);
> +
> +    if (flags & SPAPR_XIVE_END_ALWAYS_NOTIFY) {
> +        end.w0 |= END_W0_UCOND_NOTIFY;
> +    } else {
> +        end.w0 &= ~END_W0_UCOND_NOTIFY;
> +    }
> +
> +    /* The generation bit for the END starts at 1 and The END page
> +     * offset counter starts at 0.
> +     */
> +    end.w1 = END_W1_GENERATION | SETFIELD(END_W1_PAGE_OFF, 0ul, 0ul);
> +    end.w0 |= END_W0_VALID;
> +
> +    /* TODO: issue syncs required to ensure all in-flight interrupts
> +     * are complete on the old END */
> +out:
> +    /* Update END */
> +    if (xive_router_set_end(xrtr, end_blk, end_idx, &end)) {
> +        return H_HARDWARE;
> +    }

Again the PAPR code owns the ENDs, so it can update them directly
rather than going through an abstraction.

> +
> +    return H_SUCCESS;
> +}
> +
> +/*
> + * The H_INT_GET_QUEUE_CONFIG hcall() is used to get a EQ for a given
> + * target and priority.
> + *
> + * Parameters:
> + * Input:
> + * - "flags"
> + *      Bits 0-62: Reserved
> + *      Bit 63: Debug: Return debug data
> + * - "target" is per "ibm,ppc-interrupt-server#s" or
> + *       "ibm,ppc-interrupt-gserver#s"
> + * - "priority" is a valid priority not in
> + *       "ibm,plat-res-int-priorities"
> + *
> + * Output:
> + * - R4: "flags":
> + *       Bits 0-61: Reserved
> + *       Bit 62: The value of Event Queue Generation Number (g) per
> + *              the XIVE spec if "Debug" = 1
> + *       Bit 63: The value of Unconditional Notify (n) per the XIVE spec
> + * - R5: The logical real address of the start of the EQ
> + * - R6: The power of 2 EQ size per "ibm,xive-eq-sizes"
> + * - R7: The value of Event Queue Offset Counter per XIVE spec
> + *       if "Debug" = 1, else 0
> + *
> + */
> +
> +#define SPAPR_XIVE_END_DEBUG     PPC_BIT(63)
> +
> +static target_ulong h_int_get_queue_config(PowerPCCPU *cpu,
> +                                           sPAPRMachineState *spapr,
> +                                           target_ulong opcode,
> +                                           target_ulong *args)
> +{
> +    sPAPRXive *xive = spapr->xive;
> +    target_ulong flags = args[0];
> +    target_ulong target = args[1];
> +    target_ulong priority = args[2];
> +    XiveEND end;
> +    uint8_t end_blk;
> +    uint32_t end_idx;
> +
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    if (flags & ~SPAPR_XIVE_END_DEBUG) {
> +        return H_PARAMETER;
> +    }
> +
> +    /*
> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> +     * This is not needed when running the emulation under QEMU
> +     */
> +
> +    if (!spapr_xive_priority_is_valid(priority)) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
> +                      priority);
> +        return H_P3;
> +    }
> +
> +    /* Validate that "target" is part of the list of threads allocated
> +     * to the partition. For that, find the END corresponding to the
> +     * target.
> +     */
> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
> +        return H_P2;
> +    }
> +
> +    if (xive_router_get_end(XIVE_ROUTER(xive), end_blk, end_idx, &end)) {
> +        return H_HARDWARE;

Again, assert() seems appropriate here.

> +    }
> +
> +    args[0] = 0;
> +    if (end.w0 & END_W0_UCOND_NOTIFY) {
> +        args[0] |= SPAPR_XIVE_END_ALWAYS_NOTIFY;
> +    }
> +
> +    if (end.w0 & END_W0_ENQUEUE) {
> +        args[1] =
> +            (((uint64_t)(end.w2 & 0x0fffffff)) << 32) | end.w3;
> +        args[2] = GETFIELD(END_W0_QSIZE, end.w0) + 12;
> +    } else {
> +        args[1] = 0;
> +        args[2] = 0;
> +    }
> +
> +    /* TODO: do we need any locking on the END ? */
> +    if (flags & SPAPR_XIVE_END_DEBUG) {
> +        /* Load the event queue generation number into the return flags */
> +        args[0] |= (uint64_t)GETFIELD(END_W1_GENERATION, end.w1) << 62;
> +
> +        /* Load R7 with the event queue offset counter */
> +        args[3] = GETFIELD(END_W1_PAGE_OFF, end.w1);
> +    } else {
> +        args[3] = 0;
> +    }
> +
> +    return H_SUCCESS;
> +}
> +
> +/*
> + * The H_INT_SET_OS_REPORTING_LINE hcall() is used to set the
> + * reporting cache line pair for the calling thread.  The reporting
> + * cache lines will contain the OS interrupt context when the OS
> + * issues a CI store byte to @TIMA+0xC10 to acknowledge the OS
> + * interrupt. The reporting cache lines can be reset by inputting -1
> + * in "reportingLine".  Issuing the CI store byte without reporting
> + * cache lines registered will result in the data not being accessible
> + * to the OS.
> + *
> + * Parameters:
> + * Input:
> + * - "flags"
> + *      Bits 0-63: Reserved
> + * - "reportingLine": The logical real address of the reporting cache
> + *    line pair
> + *
> + * Output:
> + * - None
> + */
> +static target_ulong h_int_set_os_reporting_line(PowerPCCPU *cpu,
> +                                                sPAPRMachineState *spapr,
> +                                                target_ulong opcode,
> +                                                target_ulong *args)
> +{
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    /*
> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> +     * This is not needed when running the emulation under QEMU
> +     */
> +
> +    /* TODO: H_INT_SET_OS_REPORTING_LINE */
> +    return H_FUNCTION;
> +}
> +
> +/*
> + * The H_INT_GET_OS_REPORTING_LINE hcall() is used to get the logical
> + * real address of the reporting cache line pair set for the input
> + * "target".  If no reporting cache line pair has been set, -1 is
> + * returned.
> + *
> + * Parameters:
> + * Input:
> + * - "flags"
> + *      Bits 0-63: Reserved
> + * - "target" is per "ibm,ppc-interrupt-server#s" or
> + *       "ibm,ppc-interrupt-gserver#s"
> + * - "reportingLine": The logical real address of the reporting cache
> + *   line pair
> + *
> + * Output:
> + * - R4: The logical real address of the reporting line if set, else -1
> + */
> +static target_ulong h_int_get_os_reporting_line(PowerPCCPU *cpu,
> +                                                sPAPRMachineState *spapr,
> +                                                target_ulong opcode,
> +                                                target_ulong *args)
> +{
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    /*
> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> +     * This is not needed when running the emulation under QEMU
> +     */
> +
> +    /* TODO: H_INT_GET_OS_REPORTING_LINE */
> +    return H_FUNCTION;
> +}
> +
> +/*
> + * The H_INT_ESB hcall() is used to issue a load or store to the ESB
> + * page for the input "lisn".  This hcall is only supported for LISNs
> + * that have the ESB hcall flag set to 1 when returned from hcall()
> + * H_INT_GET_SOURCE_INFO.

Is there a reason for specifically restricting this to LISNs which
advertise it, rather than allowing it for anything?  Obviously using
the direct MMIOs will generally be a faster option when possible, but
I could see occasions where it might be simpler for the guest to
always use H_INT_ESB (e.g. for micro-guests like kvm-unit-tests).

> + * Parameters:
> + * Input:
> + * - "flags"
> + *      Bits 0-62: Reserved
> + *      bit 63: Store: Store=1, store operation, else load operation
> + * - "lisn" is per "interrupts", "interrupt-map", or
> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
> + *      ibm,query-interrupt-source-number RTAS call, or as
> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
> + * - "esbOffset" is the offset into the ESB page for the load or store operation
> + * - "storeData" is the data to write for a store operation
> + *
> + * Output:
> + * - R4: R4: The value of the load if load operation, else -1
> + */
> +
> +#define SPAPR_XIVE_ESB_STORE PPC_BIT(63)
> +
> +static target_ulong h_int_esb(PowerPCCPU *cpu,
> +                              sPAPRMachineState *spapr,
> +                              target_ulong opcode,
> +                              target_ulong *args)
> +{
> +    sPAPRXive *xive = spapr->xive;
> +    XiveEAS eas;
> +    target_ulong flags  = args[0];
> +    target_ulong lisn   = args[1];
> +    target_ulong offset = args[2];
> +    target_ulong data   = args[3];
> +    hwaddr mmio_addr;
> +    XiveSource *xsrc = &xive->source;
> +
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    if (flags & ~SPAPR_XIVE_ESB_STORE) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
> +        return H_P2;
> +    }
> +
> +    if (!(eas.w & EAS_VALID)) {
> +        return H_P2;
> +    }
> +
> +    if (offset > (1ull << xsrc->esb_shift)) {
> +        return H_P3;
> +    }
> +
> +    mmio_addr = xive->vc_base + xive_source_esb_mgmt(xsrc, lisn) + offset;
> +
> +    if (dma_memory_rw(&address_space_memory, mmio_addr, &data, 8,
> +                      (flags & SPAPR_XIVE_ESB_STORE))) {
> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to access ESB @0x%"
> +                      HWADDR_PRIx "\n", mmio_addr);
> +        return H_HARDWARE;
> +    }
> +    args[0] = (flags & SPAPR_XIVE_ESB_STORE) ? -1 : data;
> +    return H_SUCCESS;
> +}
> +
> +/*
> + * The H_INT_SYNC hcall() is used to issue hardware syncs that will
> + * ensure any in flight events for the input lisn are in the event
> + * queue.
> + *
> + * Parameters:
> + * Input:
> + * - "flags"
> + *      Bits 0-63: Reserved
> + * - "lisn" is per "interrupts", "interrupt-map", or
> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
> + *      ibm,query-interrupt-source-number RTAS call, or as
> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
> + *
> + * Output:
> + * - None
> + */
> +static target_ulong h_int_sync(PowerPCCPU *cpu,
> +                               sPAPRMachineState *spapr,
> +                               target_ulong opcode,
> +                               target_ulong *args)
> +{
> +    sPAPRXive *xive = spapr->xive;
> +    XiveEAS eas;
> +    target_ulong flags = args[0];
> +    target_ulong lisn = args[1];
> +
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    if (flags) {
> +        return H_PARAMETER;
> +    }
> +
> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
> +        return H_P2;
> +    }
> +
> +    if (!(eas.w & EAS_VALID)) {
> +        return H_P2;
> +    }
> +
> +    /*
> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> +     * This is not needed when running the emulation under QEMU
> +     */
> +
> +    /* This is not real hardware. Nothing to be done */

At least, not as long as all the XIVE operations are under the BQL.

> +    return H_SUCCESS;
> +}
> +
> +/*
> + * The H_INT_RESET hcall() is used to reset all of the partition's
> + * interrupt exploitation structures to their initial state.  This
> + * means losing all previously set interrupt state set via
> + * H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
> + *
> + * Parameters:
> + * Input:
> + * - "flags"
> + *      Bits 0-63: Reserved
> + *
> + * Output:
> + * - None
> + */
> +static target_ulong h_int_reset(PowerPCCPU *cpu,
> +                                sPAPRMachineState *spapr,
> +                                target_ulong opcode,
> +                                target_ulong *args)
> +{
> +    sPAPRXive *xive = spapr->xive;
> +    target_ulong flags   = args[0];
> +
> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        return H_FUNCTION;
> +    }
> +
> +    if (flags) {
> +        return H_PARAMETER;
> +    }
> +
> +    device_reset(DEVICE(xive));
> +    return H_SUCCESS;
> +}
> +
> +void spapr_xive_hcall_init(sPAPRMachineState *spapr)
> +{
> +    spapr_register_hypercall(H_INT_GET_SOURCE_INFO, h_int_get_source_info);
> +    spapr_register_hypercall(H_INT_SET_SOURCE_CONFIG, h_int_set_source_config);
> +    spapr_register_hypercall(H_INT_GET_SOURCE_CONFIG, h_int_get_source_config);
> +    spapr_register_hypercall(H_INT_GET_QUEUE_INFO, h_int_get_queue_info);
> +    spapr_register_hypercall(H_INT_SET_QUEUE_CONFIG, h_int_set_queue_config);
> +    spapr_register_hypercall(H_INT_GET_QUEUE_CONFIG, h_int_get_queue_config);
> +    spapr_register_hypercall(H_INT_SET_OS_REPORTING_LINE,
> +                             h_int_set_os_reporting_line);
> +    spapr_register_hypercall(H_INT_GET_OS_REPORTING_LINE,
> +                             h_int_get_os_reporting_line);
> +    spapr_register_hypercall(H_INT_ESB, h_int_esb);
> +    spapr_register_hypercall(H_INT_SYNC, h_int_sync);
> +    spapr_register_hypercall(H_INT_RESET, h_int_reset);
> +}
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index 2569ae1bc7f8..da6fcfaa3c52 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -258,6 +258,8 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
>          error_propagate(errp, local_err);
>          return;
>      }
> +
> +    spapr_xive_hcall_init(spapr);
>  }
>  
>  static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> index 301a8e972d91..eacd26836ebf 100644
> --- a/hw/intc/Makefile.objs
> +++ b/hw/intc/Makefile.objs
> @@ -38,7 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>  obj-$(CONFIG_XIVE) += xive.o
> -obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>  obj-$(CONFIG_S390_FLIC) += s390_flic.o

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/36] spapr: add device tree support for the XIVE exploitation mode
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 17/36] spapr: add device tree support for the XIVE exploitation mode Cédric Le Goater
@ 2018-11-28  4:31   ` David Gibson
  2018-11-28 22:26     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-28  4:31 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 8894 bytes --]

On Fri, Nov 16, 2018 at 11:57:10AM +0100, Cédric Le Goater wrote:
> The XIVE interface for the guest is described in the device tree under
> the "interrupt-controller" node. A couple of new properties are
> specific to XIVE :
> 
>  - "reg"
> 
>    contains the base address and size of the thread interrupt
>    managnement areas (TIMA), for the User level and for the Guest OS
>    level. Only the Guest OS level is taken into account today.
> 
>  - "ibm,xive-eq-sizes"
> 
>    the size of the event queues. One cell per size supported, contains
>    log2 of size, in ascending order.
> 
>  - "ibm,xive-lisn-ranges"
> 
>    the IRQ interrupt number ranges assigned to the guest for the IPIs.
> 
> and also under the root node :
> 
>  - "ibm,plat-res-int-priorities"
> 
>    contains a list of priorities that the hypervisor has reserved for
>    its own use. OPAL uses the priority 7 queue to automatically
>    escalate interrupts for all other queues (DD2.X POWER9). So only
>    priorities [0..6] are allowed for the guest.
> 
> Extend the sPAPR IRQ backend with a new handler to populate the DT
> with the appropriate "interrupt-controller" node.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/spapr_irq.h  |  2 ++
>  include/hw/ppc/spapr_xive.h |  2 ++
>  hw/intc/spapr_xive_hcall.c  | 62 +++++++++++++++++++++++++++++++++++++
>  hw/ppc/spapr.c              |  3 +-
>  hw/ppc/spapr_irq.c          | 17 ++++++++++
>  5 files changed, 85 insertions(+), 1 deletion(-)
> 
> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> index c854ae527808..cfdc1f86e713 100644
> --- a/include/hw/ppc/spapr_irq.h
> +++ b/include/hw/ppc/spapr_irq.h
> @@ -40,6 +40,8 @@ typedef struct sPAPRIrq {
>      void (*free)(sPAPRMachineState *spapr, int irq, int num);
>      qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
>      void (*print_info)(sPAPRMachineState *spapr, Monitor *mon);
> +    void (*dt_populate)(sPAPRMachineState *spapr, uint32_t nr_servers,
> +                        void *fdt, uint32_t phandle);
>  } sPAPRIrq;
>  
>  extern sPAPRIrq spapr_irq_xics;
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 418511f3dc10..5b3fab192d41 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -65,5 +65,7 @@ bool spapr_xive_priority_is_valid(uint8_t priority);
>  typedef struct sPAPRMachineState sPAPRMachineState;
>  
>  void spapr_xive_hcall_init(sPAPRMachineState *spapr);
> +void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt,
> +                   uint32_t phandle);
>  
>  #endif /* PPC_SPAPR_XIVE_H */
> diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
> index 52e4e23995f5..66c78aa88500 100644
> --- a/hw/intc/spapr_xive_hcall.c
> +++ b/hw/intc/spapr_xive_hcall.c
> @@ -890,3 +890,65 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr)
>      spapr_register_hypercall(H_INT_SYNC, h_int_sync);
>      spapr_register_hypercall(H_INT_RESET, h_int_reset);
>  }
> +
> +void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt, uint32_t phandle)
> +{
> +    int node;
> +    uint64_t timas[2 * 2];
> +    /* Interrupt number ranges for the IPIs */
> +    uint32_t lisn_ranges[] = {
> +        cpu_to_be32(0),
> +        cpu_to_be32(nr_servers),
> +    };
> +    uint32_t eq_sizes[] = {
> +        cpu_to_be32(12), /* 4K */
> +        cpu_to_be32(16), /* 64K */
> +        cpu_to_be32(21), /* 2M */
> +        cpu_to_be32(24), /* 16M */
> +    };
> +    /* The following array is in sync with the 'spapr_xive_priority_is_valid'
> +     * routine above. The O/S is expected to choose priority 6.
> +     */
> +    uint32_t plat_res_int_priorities[] = {
> +        cpu_to_be32(7),    /* start */
> +        cpu_to_be32(0xf8), /* count */
> +    };
> +    gchar *nodename;
> +
> +    /* Thread Interrupt Management Area : User (ring 3) and OS (ring 2) */
> +    timas[0] = cpu_to_be64(xive->tm_base + 3 * (1ull << TM_SHIFT));
> +    timas[1] = cpu_to_be64(1ull << TM_SHIFT);
> +    timas[2] = cpu_to_be64(xive->tm_base + 2 * (1ull << TM_SHIFT));

Don't you have symbolic constants for the ring numbers, instead of '2'
and '3' above?

> +    timas[3] = cpu_to_be64(1ull << TM_SHIFT);
> +
> +    nodename = g_strdup_printf("interrupt-controller@%" PRIx64,
> +                               xive->tm_base + 3 * (1 << TM_SHIFT));
> +    _FDT(node = fdt_add_subnode(fdt, 0, nodename));
> +    g_free(nodename);
> +
> +    _FDT(fdt_setprop_string(fdt, node, "device_type", "power-ivpe"));
> +    _FDT(fdt_setprop(fdt, node, "reg", timas, sizeof(timas)));
> +
> +    _FDT(fdt_setprop_string(fdt, node, "compatible", "ibm,power-ivpe"));
> +    _FDT(fdt_setprop(fdt, node, "ibm,xive-eq-sizes", eq_sizes,
> +                     sizeof(eq_sizes)));
> +    _FDT(fdt_setprop(fdt, node, "ibm,xive-lisn-ranges", lisn_ranges,
> +                     sizeof(lisn_ranges)));
> +
> +    /* For Linux to link the LSIs to the main interrupt controller.

What's the "main interrupt controller" in this context?

> +     * These properties are not in XIVE exploitation mode sPAPR
> +     * specs
> +     */
> +    _FDT(fdt_setprop(fdt, node, "interrupt-controller", NULL, 0));
> +    _FDT(fdt_setprop_cell(fdt, node, "#interrupt-cells", 2));
> +
> +    /* For SLOF */
> +    _FDT(fdt_setprop_cell(fdt, node, "linux,phandle", phandle));
> +    _FDT(fdt_setprop_cell(fdt, node, "phandle", phandle));
> +
> +    /* The "ibm,plat-res-int-priorities" property defines the priority
> +     * ranges reserved by the hypervisor
> +     */
> +    _FDT(fdt_setprop(fdt, 0, "ibm,plat-res-int-priorities",
> +                     plat_res_int_priorities, sizeof(plat_res_int_priorities)));
> +}
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 9f8c19e56e7a..ad1692cdcd0f 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1270,7 +1270,8 @@ static void *spapr_build_fdt(sPAPRMachineState *spapr,
>      _FDT(fdt_setprop_cell(fdt, 0, "#size-cells", 2));
>  
>      /* /interrupt controller */
> -    spapr_dt_xics(xics_max_server_number(spapr), fdt, PHANDLE_XICP);
> +    smc->irq->dt_populate(spapr, xics_max_server_number(spapr), fdt,
> +                          PHANDLE_XICP);
>  
>      ret = spapr_populate_memory(spapr, fdt);
>      if (ret < 0) {
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index da6fcfaa3c52..d88a029d8c5c 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -190,6 +190,13 @@ static void spapr_irq_print_info_xics(sPAPRMachineState *spapr, Monitor *mon)
>      ics_pic_print_info(spapr->ics, mon);
>  }
>  
> +static void spapr_irq_dt_populate_xics(sPAPRMachineState *spapr,
> +                                       uint32_t nr_servers, void *fdt,
> +                                       uint32_t phandle)
> +{
> +    spapr_dt_xics(nr_servers, fdt, phandle);
> +}
> +

It'd be nicer to change the signature of spapr_dt_xics, rather than
having this one line wrapper.

>  #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
>  #define SPAPR_IRQ_XICS_NR_MSIS     \
>      (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
> @@ -203,6 +210,7 @@ sPAPRIrq spapr_irq_xics = {
>      .free        = spapr_irq_free_xics,
>      .qirq        = spapr_qirq_xics,
>      .print_info  = spapr_irq_print_info_xics,
> +    .dt_populate = spapr_irq_dt_populate_xics,
>  };
>  
>   /*
> @@ -300,6 +308,13 @@ static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
>      spapr_xive_pic_print_info(spapr->xive, mon);
>  }
>  
> +static void spapr_irq_dt_populate_xive(sPAPRMachineState *spapr,
> +                                       uint32_t nr_servers, void *fdt,
> +                                       uint32_t phandle)
> +{
> +    spapr_dt_xive(spapr->xive, nr_servers, fdt, phandle);

Uh.. and to make the hook signature just match what we need rather
than having to have trivial wrappers in both cases.

> +}
> +
>  /*
>   * XIVE uses the full IRQ number space. Set it to 8K to be compatible
>   * with XICS.
> @@ -317,6 +332,7 @@ sPAPRIrq spapr_irq_xive = {
>      .free        = spapr_irq_free_xive,
>      .qirq        = spapr_qirq_xive,
>      .print_info  = spapr_irq_print_info_xive,
> +    .dt_populate = spapr_irq_dt_populate_xive,
>  };
>  
>  /*
> @@ -421,4 +437,5 @@ sPAPRIrq spapr_irq_xics_legacy = {
>      .free        = spapr_irq_free_xics,
>      .qirq        = spapr_qirq_xics,
>      .print_info  = spapr_irq_print_info_xics,
> +    .dt_populate = spapr_irq_dt_populate_xics,
>  };

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 18/36] spapr: allocate the interrupt thread context under the CPU core
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 18/36] spapr: allocate the interrupt thread context under the CPU core Cédric Le Goater
@ 2018-11-28  4:39   ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-28  4:39 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 7142 bytes --]

On Fri, Nov 16, 2018 at 11:57:11AM +0100, Cédric Le Goater wrote:
> Each interrupt mode has its own specific interrupt presenter object,
> that we store under the CPU object, one for XICS and one for XIVE.
> 
> Extend the sPAPR IRQ backend with a new handler to support them both.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/hw/ppc/spapr.h     |  1 +
>  include/hw/ppc/spapr_irq.h |  2 ++
>  include/hw/ppc/xive.h      |  2 ++
>  hw/intc/xive.c             | 21 +++++++++++++++++++++
>  hw/ppc/spapr_cpu_core.c    |  5 ++---
>  hw/ppc/spapr_irq.c         | 17 +++++++++++++++++
>  6 files changed, 45 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> index 8415faea7b82..f43ef69d61bc 100644
> --- a/include/hw/ppc/spapr.h
> +++ b/include/hw/ppc/spapr.h
> @@ -177,6 +177,7 @@ struct sPAPRMachineState {
>      int32_t irq_map_nr;
>      unsigned long *irq_map;
>      sPAPRXive  *xive;
> +    const char *xive_tctx_type;
>  
>      bool cmd_line_caps[SPAPR_CAP_NUM];
>      sPAPRCapabilities def, eff, mig;
> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> index cfdc1f86e713..c3b4c38145eb 100644
> --- a/include/hw/ppc/spapr_irq.h
> +++ b/include/hw/ppc/spapr_irq.h
> @@ -42,6 +42,8 @@ typedef struct sPAPRIrq {
>      void (*print_info)(sPAPRMachineState *spapr, Monitor *mon);
>      void (*dt_populate)(sPAPRMachineState *spapr, uint32_t nr_servers,
>                          void *fdt, uint32_t phandle);
> +    Object *(*cpu_intc_create)(sPAPRMachineState *spapr, Object *cpu,
> +                               Error **errp);
>  } sPAPRIrq;
>  
>  extern sPAPRIrq spapr_irq_xics;
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index e6931ddaa83f..b74eb326dcd1 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -284,6 +284,8 @@ typedef struct XiveTCTX {
>  extern const MemoryRegionOps xive_tm_ops;
>  
>  void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
> +Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
> +                         Error **errp);
>  
>  static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
>  {
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index fc6ef5895e6d..7d921023e2ee 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -579,6 +579,27 @@ static const TypeInfo xive_tctx_info = {
>      .class_init    = xive_tctx_class_init,
>  };
>  
> +Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
> +                         Error **errp)
> +{
> +    Error *local_err = NULL;
> +    Object *obj;
> +
> +    obj = object_new(type);
> +    object_property_add_child(cpu, type, obj, &error_abort);
> +    object_unref(obj);
> +    object_property_add_const_link(obj, "cpu", cpu, &error_abort);
> +    object_property_add_const_link(obj, "xive", OBJECT(xrtr), &error_abort);
> +    object_property_set_bool(obj, true, "realized", &local_err);
> +    if (local_err) {
> +        object_unparent(obj);
> +        error_propagate(errp, local_err);
> +        return NULL;
> +    }
> +
> +    return obj;
> +}
> +
>  /*
>   * XIVE ESB helpers
>   */
> diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
> index 2398ce62c0e7..1811cd48db90 100644
> --- a/hw/ppc/spapr_cpu_core.c
> +++ b/hw/ppc/spapr_cpu_core.c
> @@ -11,7 +11,6 @@
>  #include "hw/ppc/spapr_cpu_core.h"
>  #include "target/ppc/cpu.h"
>  #include "hw/ppc/spapr.h"
> -#include "hw/ppc/xics.h" /* for icp_create() - to be removed */
>  #include "hw/boards.h"
>  #include "qapi/error.h"
>  #include "sysemu/cpus.h"
> @@ -215,6 +214,7 @@ static void spapr_cpu_core_unrealize(DeviceState *dev, Error **errp)
>  static void spapr_realize_vcpu(PowerPCCPU *cpu, sPAPRMachineState *spapr,
>                                 sPAPRCPUCore *sc, Error **errp)
>  {
> +    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>      CPUPPCState *env = &cpu->env;
>      CPUState *cs = CPU(cpu);
>      Error *local_err = NULL;
> @@ -233,8 +233,7 @@ static void spapr_realize_vcpu(PowerPCCPU *cpu, sPAPRMachineState *spapr,
>      qemu_register_reset(spapr_cpu_reset, cpu);
>      spapr_cpu_reset(cpu);
>  
> -    cpu->intc = icp_create(OBJECT(cpu), spapr->icp_type, XICS_FABRIC(spapr),
> -                           &local_err);
> +    cpu->intc = smc->irq->cpu_intc_create(spapr, OBJECT(cpu), &local_err);
>      if (local_err) {
>          goto error_unregister;
>      }
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index d88a029d8c5c..253abc10e780 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -197,6 +197,12 @@ static void spapr_irq_dt_populate_xics(sPAPRMachineState *spapr,
>      spapr_dt_xics(nr_servers, fdt, phandle);
>  }
>  
> +static Object *spapr_irq_cpu_intc_create_xics(sPAPRMachineState *spapr,
> +                                              Object *cpu, Error **errp)
> +{
> +    return icp_create(cpu, spapr->icp_type, XICS_FABRIC(spapr), errp);
> +}
> +
>  #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
>  #define SPAPR_IRQ_XICS_NR_MSIS     \
>      (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
> @@ -211,6 +217,7 @@ sPAPRIrq spapr_irq_xics = {
>      .qirq        = spapr_qirq_xics,
>      .print_info  = spapr_irq_print_info_xics,
>      .dt_populate = spapr_irq_dt_populate_xics,
> +    .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
>  };
>  
>   /*
> @@ -267,6 +274,7 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
>          return;
>      }
>  
> +    spapr->xive_tctx_type = TYPE_XIVE_TCTX;
>      spapr_xive_hcall_init(spapr);
>  }
>  
> @@ -315,6 +323,13 @@ static void spapr_irq_dt_populate_xive(sPAPRMachineState *spapr,
>      spapr_dt_xive(spapr->xive, nr_servers, fdt, phandle);
>  }
>  
> +static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
> +                                              Object *cpu, Error **errp)
> +{
> +    return xive_tctx_create(cpu, spapr->xive_tctx_type,
> +                            XIVE_ROUTER(spapr->xive), errp);
> +}
> +
>  /*
>   * XIVE uses the full IRQ number space. Set it to 8K to be compatible
>   * with XICS.
> @@ -333,6 +348,7 @@ sPAPRIrq spapr_irq_xive = {
>      .qirq        = spapr_qirq_xive,
>      .print_info  = spapr_irq_print_info_xive,
>      .dt_populate = spapr_irq_dt_populate_xive,
> +    .cpu_intc_create = spapr_irq_cpu_intc_create_xive,
>  };
>  
>  /*
> @@ -438,4 +454,5 @@ sPAPRIrq spapr_irq_xics_legacy = {
>      .qirq        = spapr_qirq_xics,
>      .print_info  = spapr_irq_print_info_xics,
>      .dt_populate = spapr_irq_dt_populate_xics,
> +    .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
>  };

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 19/36] spapr: add a 'pseries-3.1-xive' machine type
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 19/36] spapr: add a 'pseries-3.1-xive' machine type Cédric Le Goater
@ 2018-11-28  4:42   ` David Gibson
  2018-11-28 22:37     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-28  4:42 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 5603 bytes --]

On Fri, Nov 16, 2018 at 11:57:12AM +0100, Cédric Le Goater wrote:
> The interrupt mode is statically defined to XIVE only for this machine.
> The guest OS is required to have support for the XIVE exploitation
> mode of the POWER9 interrupt controller.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/spapr_irq.h |  1 +
>  hw/ppc/spapr.c             | 36 +++++++++++++++++++++++++++++++-----
>  hw/ppc/spapr_irq.c         |  3 +++
>  3 files changed, 35 insertions(+), 5 deletions(-)
> 
> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> index c3b4c38145eb..b299dd794bff 100644
> --- a/include/hw/ppc/spapr_irq.h
> +++ b/include/hw/ppc/spapr_irq.h
> @@ -33,6 +33,7 @@ void spapr_irq_msi_reset(sPAPRMachineState *spapr);
>  typedef struct sPAPRIrq {
>      uint32_t    nr_irqs;
>      uint32_t    nr_msis;
> +    uint8_t     ov5;

I'm a bit confused as to what exactly this represents..

>      void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
>                   Error **errp);
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index ad1692cdcd0f..8fbb743769db 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1097,12 +1097,14 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, void *fdt)
>      spapr_dt_rtas_tokens(fdt, rtas);
>  }
>  
> -/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU features
> - * that the guest may request and thus the valid values for bytes 24..26 of
> - * option vector 5: */
> -static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
> +/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU
> + * and the XIVE features that the guest may request and thus the valid
> + * values for bytes 23..26 of option vector 5: */
> +static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
> +                                          int chosen)
>  {
>      PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
> +    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>  
>      char val[2 * 4] = {
>          23, 0x00, /* Xive mode, filled in below. */
> @@ -1123,7 +1125,11 @@ static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
>          } else {
>              val[3] = 0x00; /* Hash */
>          }
> +        /* TODO: test KVM support */
> +        val[1] = smc->irq->ov5;
>      } else {
> +        val[1] = smc->irq->ov5;

..here it seems to be a specific value for this OV5 byte, indicating the
supported intc...

> +
>          /* V3 MMU supports both hash and radix in tcg (with dynamic switching) */
>          val[3] = 0xC0;
>      }
> @@ -1191,7 +1197,7 @@ static void spapr_dt_chosen(sPAPRMachineState *spapr, void *fdt)
>          _FDT(fdt_setprop_string(fdt, chosen, "stdout-path", stdout_path));
>      }
>  
> -    spapr_dt_ov5_platform_support(fdt, chosen);
> +    spapr_dt_ov5_platform_support(spapr, fdt, chosen);
>  
>      g_free(stdout_path);
>      g_free(bootlist);
> @@ -2622,6 +2628,11 @@ static void spapr_machine_init(MachineState *machine)
>      /* advertise support for ibm,dyamic-memory-v2 */
>      spapr_ovec_set(spapr->ov5, OV5_DRMEM_V2);
>  
> +    /* advertise XIVE */
> +    if (smc->irq->ov5) {

..but here it seems to be a bool indicating XIVE support specifically.

> +        spapr_ovec_set(spapr->ov5, OV5_XIVE_EXPLOIT);
> +    }
> +
>      /* init CPUs */
>      spapr_init_cpus(spapr);
>  
> @@ -3971,6 +3982,21 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
>  
>  DEFINE_SPAPR_MACHINE(3_1, "3.1", true);
>  
> +static void spapr_machine_3_1_xive_instance_options(MachineState *machine)
> +{
> +    spapr_machine_3_1_instance_options(machine);
> +}
> +
> +static void spapr_machine_3_1_xive_class_options(MachineClass *mc)
> +{
> +    sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
> +
> +    spapr_machine_3_1_class_options(mc);
> +    smc->irq = &spapr_irq_xive;
> +}
> +
> +DEFINE_SPAPR_MACHINE(3_1_xive, "3.1-xive", false);
> +
>  /*
>   * pseries-3.0
>   */
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index 253abc10e780..42e73851b174 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -210,6 +210,7 @@ static Object *spapr_irq_cpu_intc_create_xics(sPAPRMachineState *spapr,
>  sPAPRIrq spapr_irq_xics = {
>      .nr_irqs     = SPAPR_IRQ_XICS_NR_IRQS,
>      .nr_msis     = SPAPR_IRQ_XICS_NR_MSIS,
> +    .ov5         = 0x0, /* XICS only */
>  
>      .init        = spapr_irq_init_xics,
>      .claim       = spapr_irq_claim_xics,
> @@ -341,6 +342,7 @@ static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
>  sPAPRIrq spapr_irq_xive = {
>      .nr_irqs     = SPAPR_IRQ_XIVE_NR_IRQS,
>      .nr_msis     = SPAPR_IRQ_XIVE_NR_MSIS,
> +    .ov5         = 0x40, /* XIVE exploitation mode only */
>  
>      .init        = spapr_irq_init_xive,
>      .claim       = spapr_irq_claim_xive,
> @@ -447,6 +449,7 @@ int spapr_irq_find(sPAPRMachineState *spapr, int num, bool align, Error **errp)
>  sPAPRIrq spapr_irq_xics_legacy = {
>      .nr_irqs     = SPAPR_IRQ_XICS_LEGACY_NR_IRQS,
>      .nr_msis     = SPAPR_IRQ_XICS_LEGACY_NR_IRQS,
> +    .ov5         = 0x0, /* XICS only */
>  
>      .init        = spapr_irq_init_xics,
>      .claim       = spapr_irq_claim_xics,

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/36] spapr: add classes for the XIVE models
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 20/36] spapr: add classes for the XIVE models Cédric Le Goater
@ 2018-11-28  5:13   ` David Gibson
  2018-11-28 22:38     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-28  5:13 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 21790 bytes --]

On Fri, Nov 16, 2018 at 11:57:13AM +0100, Cédric Le Goater wrote:
> The XIVE models for the QEMU and KVM accelerators will have a lot in
> common. Introduce an abstract class for the source, the thread context
> and the interrupt controller object to handle the differences in the
> object initialization. These classes will also be used to define state
> synchronization handlers for the monitor and migration usage.
> 
> This is very much like the XICS models.

Yeah.. so I know it's my code, but in hindsight I think making
separate subclasses for TCG and KVM was a mistake.  The distinction
between emulated and KVM version is supposed to be invisible to both
guest and (almost) to user, whereas a subclass usually indicates a
visibly different device.

> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/spapr_xive.h |  15 +++++
>  include/hw/ppc/xive.h       |  30 ++++++++++
>  hw/intc/spapr_xive.c        |  86 +++++++++++++++++++---------
>  hw/intc/xive.c              | 109 +++++++++++++++++++++++++-----------
>  hw/ppc/spapr_irq.c          |   4 +-
>  5 files changed, 182 insertions(+), 62 deletions(-)
> 
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 5b3fab192d41..aca2969a09ab 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -13,6 +13,10 @@
>  #include "hw/sysbus.h"
>  #include "hw/ppc/xive.h"
>  
> +#define TYPE_SPAPR_XIVE_BASE "spapr-xive-base"
> +#define SPAPR_XIVE_BASE(obj) \
> +    OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE_BASE)
> +
>  #define TYPE_SPAPR_XIVE "spapr-xive"
>  #define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
>  
> @@ -38,6 +42,17 @@ typedef struct sPAPRXive {
>      MemoryRegion  tm_mmio;
>  } sPAPRXive;
>  
> +#define SPAPR_XIVE_BASE_CLASS(klass) \
> +     OBJECT_CLASS_CHECK(sPAPRXiveClass, (klass), TYPE_SPAPR_XIVE_BASE)
> +#define SPAPR_XIVE_BASE_GET_CLASS(obj) \
> +     OBJECT_GET_CLASS(sPAPRXiveClass, (obj), TYPE_SPAPR_XIVE_BASE)
> +
> +typedef struct sPAPRXiveClass {
> +    XiveRouterClass parent_class;
> +
> +    DeviceRealize   parent_realize;
> +} sPAPRXiveClass;
> +
>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index b74eb326dcd1..281ed370121c 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -38,6 +38,10 @@ typedef struct XiveFabricClass {
>   * XIVE Interrupt Source
>   */
>  
> +#define TYPE_XIVE_SOURCE_BASE "xive-source-base"
> +#define XIVE_SOURCE_BASE(obj) \
> +    OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE_BASE)
> +
>  #define TYPE_XIVE_SOURCE "xive-source"
>  #define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
>  
> @@ -68,6 +72,18 @@ typedef struct XiveSource {
>      XiveFabric      *xive;
>  } XiveSource;
>  
> +#define XIVE_SOURCE_BASE_CLASS(klass) \
> +     OBJECT_CLASS_CHECK(XiveSourceClass, (klass), TYPE_XIVE_SOURCE_BASE)
> +#define XIVE_SOURCE_BASE_GET_CLASS(obj) \
> +     OBJECT_GET_CLASS(XiveSourceClass, (obj), TYPE_XIVE_SOURCE_BASE)
> +
> +typedef struct XiveSourceClass {
> +    SysBusDeviceClass parent_class;
> +
> +    DeviceRealize     parent_realize;
> +    DeviceReset       parent_reset;
> +} XiveSourceClass;
> +
>  /*
>   * ESB MMIO setting. Can be one page, for both source triggering and
>   * source management, or two different pages. See below for magic
> @@ -253,6 +269,9 @@ void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
>   * XIVE Thread interrupt Management (TM) context
>   */
>  
> +#define TYPE_XIVE_TCTX_BASE "xive-tctx-base"
> +#define XIVE_TCTX_BASE(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_BASE)
> +
>  #define TYPE_XIVE_TCTX "xive-tctx"
>  #define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
>  
> @@ -278,6 +297,17 @@ typedef struct XiveTCTX {
>      XiveRouter  *xrtr;
>  } XiveTCTX;
>  
> +#define XIVE_TCTX_BASE_CLASS(klass) \
> +     OBJECT_CLASS_CHECK(XiveTCTXClass, (klass), TYPE_XIVE_TCTX_BASE)
> +#define XIVE_TCTX_BASE_GET_CLASS(obj) \
> +     OBJECT_GET_CLASS(XiveTCTXClass, (obj), TYPE_XIVE_TCTX_BASE)
> +
> +typedef struct XiveTCTXClass {
> +    DeviceClass       parent_class;
> +
> +    DeviceRealize     parent_realize;
> +} XiveTCTXClass;
> +
>  /*
>   * XIVE Thread Interrupt Management Aera (TIMA)
>   */
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> index 3bf77ace11a2..ec85f7e4f88d 100644
> --- a/hw/intc/spapr_xive.c
> +++ b/hw/intc/spapr_xive.c
> @@ -53,9 +53,9 @@ static void spapr_xive_mmio_map(sPAPRXive *xive)
>      sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->tm_base);
>  }
>  
> -static void spapr_xive_reset(DeviceState *dev)
> +static void spapr_xive_base_reset(DeviceState *dev)
>  {
> -    sPAPRXive *xive = SPAPR_XIVE(dev);
> +    sPAPRXive *xive = SPAPR_XIVE_BASE(dev);
>      int i;
>  
>      /* Xive Source reset is done through SysBus, it should put all
> @@ -76,9 +76,9 @@ static void spapr_xive_reset(DeviceState *dev)
>      spapr_xive_mmio_map(xive);
>  }
>  
> -static void spapr_xive_instance_init(Object *obj)
> +static void spapr_xive_base_instance_init(Object *obj)
>  {
> -    sPAPRXive *xive = SPAPR_XIVE(obj);
> +    sPAPRXive *xive = SPAPR_XIVE_BASE(obj);
>  
>      object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
>      object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
> @@ -89,9 +89,9 @@ static void spapr_xive_instance_init(Object *obj)
>                                NULL);
>  }
>  
> -static void spapr_xive_realize(DeviceState *dev, Error **errp)
> +static void spapr_xive_base_realize(DeviceState *dev, Error **errp)
>  {
> -    sPAPRXive *xive = SPAPR_XIVE(dev);
> +    sPAPRXive *xive = SPAPR_XIVE_BASE(dev);
>      XiveSource *xsrc = &xive->source;
>      XiveENDSource *end_xsrc = &xive->end_source;
>      Error *local_err = NULL;
> @@ -142,16 +142,11 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>       */
>      xive->eat = g_new0(XiveEAS, xive->nr_irqs);
>      xive->endt = g_new0(XiveEND, xive->nr_ends);
> -
> -    /* TIMA initialization */
> -    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
> -                          "xive.tima", 4ull << TM_SHIFT);
> -    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
>  }
>  
>  static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>  {
> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>  
>      if (lisn >= xive->nr_irqs) {
>          return -1;
> @@ -163,7 +158,7 @@ static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>  
>  static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>  {
> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>  
>      if (lisn >= xive->nr_irqs) {
>          return -1;
> @@ -176,7 +171,7 @@ static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>  static int spapr_xive_get_end(XiveRouter *xrtr,
>                                uint8_t end_blk, uint32_t end_idx, XiveEND *end)
>  {
> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>  
>      if (end_idx >= xive->nr_ends) {
>          return -1;
> @@ -189,7 +184,7 @@ static int spapr_xive_get_end(XiveRouter *xrtr,
>  static int spapr_xive_set_end(XiveRouter *xrtr,
>                                uint8_t end_blk, uint32_t end_idx, XiveEND *end)
>  {
> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>  
>      if (end_idx >= xive->nr_ends) {
>          return -1;
> @@ -202,7 +197,7 @@ static int spapr_xive_set_end(XiveRouter *xrtr,
>  static int spapr_xive_get_nvt(XiveRouter *xrtr,
>                                uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
>  {
> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>      uint32_t vcpu_id = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
>      PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
>  
> @@ -236,7 +231,7 @@ static void spapr_xive_reset_tctx(XiveRouter *xrtr, XiveTCTX *tctx)
>      uint32_t nvt_idx;
>      uint32_t nvt_cam;
>  
> -    spapr_xive_cpu_to_nvt(SPAPR_XIVE(xrtr), POWERPC_CPU(tctx->cs),
> +    spapr_xive_cpu_to_nvt(SPAPR_XIVE_BASE(xrtr), POWERPC_CPU(tctx->cs),
>                            &nvt_blk, &nvt_idx);
>  
>      nvt_cam = cpu_to_be32(TM_QW1W2_VO | xive_tctx_cam_line(nvt_blk, nvt_idx));
> @@ -359,7 +354,7 @@ static const VMStateDescription vmstate_spapr_xive_eas = {
>      },
>  };
>  
> -static const VMStateDescription vmstate_spapr_xive = {
> +static const VMStateDescription vmstate_spapr_xive_base = {
>      .name = TYPE_SPAPR_XIVE,
>      .version_id = 1,
>      .minimum_version_id = 1,
> @@ -373,7 +368,7 @@ static const VMStateDescription vmstate_spapr_xive = {
>      },
>  };
>  
> -static Property spapr_xive_properties[] = {
> +static Property spapr_xive_base_properties[] = {
>      DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
>      DEFINE_PROP_UINT32("nr-ends", sPAPRXive, nr_ends, 0),
>      DEFINE_PROP_UINT64("vc-base", sPAPRXive, vc_base, SPAPR_XIVE_VC_BASE),
> @@ -381,16 +376,16 @@ static Property spapr_xive_properties[] = {
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> -static void spapr_xive_class_init(ObjectClass *klass, void *data)
> +static void spapr_xive_base_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
>      XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
>  
>      dc->desc    = "sPAPR XIVE Interrupt Controller";
> -    dc->props   = spapr_xive_properties;
> -    dc->realize = spapr_xive_realize;
> -    dc->reset   = spapr_xive_reset;
> -    dc->vmsd    = &vmstate_spapr_xive;
> +    dc->props   = spapr_xive_base_properties;
> +    dc->realize = spapr_xive_base_realize;
> +    dc->reset   = spapr_xive_base_reset;
> +    dc->vmsd    = &vmstate_spapr_xive_base;
>  
>      xrc->get_eas = spapr_xive_get_eas;
>      xrc->set_eas = spapr_xive_set_eas;
> @@ -401,16 +396,55 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>      xrc->reset_tctx = spapr_xive_reset_tctx;
>  }
>  
> +static const TypeInfo spapr_xive_base_info = {
> +    .name = TYPE_SPAPR_XIVE_BASE,
> +    .parent = TYPE_XIVE_ROUTER,
> +    .abstract = true,
> +    .instance_init = spapr_xive_base_instance_init,
> +    .instance_size = sizeof(sPAPRXive),
> +    .class_init = spapr_xive_base_class_init,
> +    .class_size = sizeof(sPAPRXiveClass),
> +};
> +
> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
> +    Error *local_err = NULL;
> +
> +    sxc->parent_realize(dev, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    /* TIMA */
> +    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
> +                          "xive.tima", 4ull << TM_SHIFT);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
> +}
> +
> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_CLASS(klass);
> +
> +    device_class_set_parent_realize(dc, spapr_xive_realize,
> +                                    &sxc->parent_realize);
> +}
> +
>  static const TypeInfo spapr_xive_info = {
>      .name = TYPE_SPAPR_XIVE,
> -    .parent = TYPE_XIVE_ROUTER,
> -    .instance_init = spapr_xive_instance_init,
> +    .parent = TYPE_SPAPR_XIVE_BASE,
> +    .instance_init = spapr_xive_base_instance_init,
>      .instance_size = sizeof(sPAPRXive),
>      .class_init = spapr_xive_class_init,
> +    .class_size = sizeof(sPAPRXiveClass),
>  };
>  
>  static void spapr_xive_register_types(void)
>  {
> +    type_register_static(&spapr_xive_base_info);
>      type_register_static(&spapr_xive_info);
>  }
>  
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 7d921023e2ee..9bb37553c9ec 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -478,9 +478,9 @@ static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
>      return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
>  }
>  
> -static void xive_tctx_reset(void *dev)
> +static void xive_tctx_base_reset(void *dev)
>  {
> -    XiveTCTX *tctx = XIVE_TCTX(dev);
> +    XiveTCTX *tctx = XIVE_TCTX_BASE(dev);
>      XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
>  
>      memset(tctx->regs, 0, sizeof(tctx->regs));
> @@ -506,9 +506,9 @@ static void xive_tctx_reset(void *dev)
>      }
>  }
>  
> -static void xive_tctx_realize(DeviceState *dev, Error **errp)
> +static void xive_tctx_base_realize(DeviceState *dev, Error **errp)
>  {
> -    XiveTCTX *tctx = XIVE_TCTX(dev);
> +    XiveTCTX *tctx = XIVE_TCTX_BASE(dev);
>      PowerPCCPU *cpu;
>      CPUPPCState *env;
>      Object *obj;
> @@ -544,15 +544,15 @@ static void xive_tctx_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    qemu_register_reset(xive_tctx_reset, dev);
> +    qemu_register_reset(xive_tctx_base_reset, dev);
>  }
>  
> -static void xive_tctx_unrealize(DeviceState *dev, Error **errp)
> +static void xive_tctx_base_unrealize(DeviceState *dev, Error **errp)
>  {
> -    qemu_unregister_reset(xive_tctx_reset, dev);
> +    qemu_unregister_reset(xive_tctx_base_reset, dev);
>  }
>  
> -static const VMStateDescription vmstate_xive_tctx = {
> +static const VMStateDescription vmstate_xive_tctx_base = {
>      .name = TYPE_XIVE_TCTX,
>      .version_id = 1,
>      .minimum_version_id = 1,
> @@ -562,21 +562,28 @@ static const VMStateDescription vmstate_xive_tctx = {
>      },
>  };
>  
> -static void xive_tctx_class_init(ObjectClass *klass, void *data)
> +static void xive_tctx_base_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
>  
> -    dc->realize = xive_tctx_realize;
> -    dc->unrealize = xive_tctx_unrealize;
> +    dc->realize = xive_tctx_base_realize;
> +    dc->unrealize = xive_tctx_base_unrealize;
>      dc->desc = "XIVE Interrupt Thread Context";
> -    dc->vmsd = &vmstate_xive_tctx;
> +    dc->vmsd = &vmstate_xive_tctx_base;
>  }
>  
> -static const TypeInfo xive_tctx_info = {
> -    .name          = TYPE_XIVE_TCTX,
> +static const TypeInfo xive_tctx_base_info = {
> +    .name          = TYPE_XIVE_TCTX_BASE,
>      .parent        = TYPE_DEVICE,
> +    .abstract      = true,
>      .instance_size = sizeof(XiveTCTX),
> -    .class_init    = xive_tctx_class_init,
> +    .class_init    = xive_tctx_base_class_init,
> +    .class_size    = sizeof(XiveTCTXClass),
> +};
> +
> +static const TypeInfo xive_tctx_info = {
> +    .name          = TYPE_XIVE_TCTX,
> +    .parent        = TYPE_XIVE_TCTX_BASE,
>  };
>  
>  Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
> @@ -933,9 +940,9 @@ void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
>      }
>  }
>  
> -static void xive_source_reset(DeviceState *dev)
> +static void xive_source_base_reset(DeviceState *dev)
>  {
> -    XiveSource *xsrc = XIVE_SOURCE(dev);
> +    XiveSource *xsrc = XIVE_SOURCE_BASE(dev);
>  
>      /* Do not clear the LSI bitmap */
>  
> @@ -943,9 +950,9 @@ static void xive_source_reset(DeviceState *dev)
>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
>  }
>  
> -static void xive_source_realize(DeviceState *dev, Error **errp)
> +static void xive_source_base_realize(DeviceState *dev,  Error **errp)
>  {
> -    XiveSource *xsrc = XIVE_SOURCE(dev);
> +    XiveSource *xsrc = XIVE_SOURCE_BASE(dev);
>      Object *obj;
>      Error *local_err = NULL;
>  
> @@ -971,21 +978,14 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
>          return;
>      }
>  
> -    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
> -                                     xsrc->nr_irqs);
> -
>      xsrc->status = g_malloc0(xsrc->nr_irqs);
>  
>      xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
>      xsrc->lsi_map_size = xsrc->nr_irqs;
>  
> -    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> -                          &xive_source_esb_ops, xsrc, "xive.esb",
> -                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
> -    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>  }
>  
> -static const VMStateDescription vmstate_xive_source = {
> +static const VMStateDescription vmstate_xive_source_base = {
>      .name = TYPE_XIVE_SOURCE,
>      .version_id = 1,
>      .minimum_version_id = 1,
> @@ -1001,29 +1001,68 @@ static const VMStateDescription vmstate_xive_source = {
>   * The default XIVE interrupt source setting for the ESB MMIOs is two
>   * 64k pages without Store EOI, to be in sync with KVM.
>   */
> -static Property xive_source_properties[] = {
> +static Property xive_source_base_properties[] = {
>      DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
>      DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
>      DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
>      DEFINE_PROP_END_OF_LIST(),
>  };
>  
> -static void xive_source_class_init(ObjectClass *klass, void *data)
> +static void xive_source_base_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);
>  
>      dc->desc    = "XIVE Interrupt Source";
> -    dc->props   = xive_source_properties;
> -    dc->realize = xive_source_realize;
> -    dc->reset   = xive_source_reset;
> -    dc->vmsd    = &vmstate_xive_source;
> +    dc->props   = xive_source_base_properties;
> +    dc->realize = xive_source_base_realize;
> +    dc->reset   = xive_source_base_reset;
> +    dc->vmsd    = &vmstate_xive_source_base;
> +}
> +
> +static const TypeInfo xive_source_base_info = {
> +    .name          = TYPE_XIVE_SOURCE_BASE,
> +    .parent        = TYPE_SYS_BUS_DEVICE,
> +    .abstract      = true,
> +    .instance_size = sizeof(XiveSource),
> +    .class_init    = xive_source_base_class_init,
> +    .class_size    = sizeof(XiveSourceClass),
> +};
> +
> +static void xive_source_realize(DeviceState *dev, Error **errp)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE(dev);
> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
> +    Error *local_err = NULL;
> +
> +    xsc->parent_realize(dev, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc, xsrc->nr_irqs);
> +
> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> +                          &xive_source_esb_ops, xsrc, "xive.esb",
> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
> +}
> +
> +static void xive_source_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_CLASS(klass);
> +
> +    device_class_set_parent_realize(dc, xive_source_realize,
> +                                    &xsc->parent_realize);
>  }
>  
>  static const TypeInfo xive_source_info = {
>      .name          = TYPE_XIVE_SOURCE,
> -    .parent        = TYPE_SYS_BUS_DEVICE,
> +    .parent        = TYPE_XIVE_SOURCE_BASE,
>      .instance_size = sizeof(XiveSource),
>      .class_init    = xive_source_class_init,
> +    .class_size    = sizeof(XiveSourceClass),
>  };
>  
>  /*
> @@ -1659,10 +1698,12 @@ static const TypeInfo xive_fabric_info = {
>  
>  static void xive_register_types(void)
>  {
> +    type_register_static(&xive_source_base_info);
>      type_register_static(&xive_source_info);
>      type_register_static(&xive_fabric_info);
>      type_register_static(&xive_router_info);
>      type_register_static(&xive_end_source_info);
> +    type_register_static(&xive_tctx_base_info);
>      type_register_static(&xive_tctx_info);
>  }
>  
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index 42e73851b174..f6e9e44d4cf9 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -243,7 +243,7 @@ static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
>          return NULL;
>      }
>      qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
> -    xive = SPAPR_XIVE(obj);
> +    xive = SPAPR_XIVE_BASE(obj);
>  
>      /* Enable the CPU IPIs */
>      for (i = 0; i < nr_servers; ++i) {
> @@ -311,7 +311,7 @@ static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
>      CPU_FOREACH(cs) {
>          PowerPCCPU *cpu = POWERPC_CPU(cs);
>  
> -        xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
> +        xive_tctx_pic_print_info(XIVE_TCTX_BASE(cpu->intc), mon);
>      }
>  
>      spapr_xive_pic_print_info(spapr->xive, mon);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 22/36] spapr/xive: add models for KVM support
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 22/36] spapr/xive: add models for KVM support Cédric Le Goater
@ 2018-11-28  5:52   ` David Gibson
  2018-11-28 22:45     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-28  5:52 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 24293 bytes --]

On Fri, Nov 16, 2018 at 11:57:15AM +0100, Cédric Le Goater wrote:
> This introduces a set of XIVE models specific to KVM which derive from
> the XIVE base models. The interfaces with KVM are a new capability and
> a new KVM device for the XIVE native exploitation interrupt mode.
> 
> They handle the initialization of the TIMA and the source ESB memory
> regions which have a different type under KVM. These are 'ram device'
> memory mappings, similarly to VFIO, exposed to the guest and the
> associated VMAs on the host are populated dynamically with the
> appropriate pages using a fault handler.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

The logic here looks fine, but I think it would be better to activate
it with explicit if (kvm) type logic rather than using a subclass.

> ---
>  default-configs/ppc64-softmmu.mak |   1 +
>  include/hw/ppc/spapr_xive.h       |  18 ++
>  include/hw/ppc/xive.h             |   3 +
>  linux-headers/asm-powerpc/kvm.h   |  12 +
>  linux-headers/linux/kvm.h         |   4 +
>  target/ppc/kvm_ppc.h              |   6 +
>  hw/intc/spapr_xive_kvm.c          | 430 ++++++++++++++++++++++++++++++
>  hw/ppc/spapr.c                    |   7 +-
>  hw/ppc/spapr_irq.c                |  19 +-
>  target/ppc/kvm.c                  |   7 +
>  hw/intc/Makefile.objs             |   1 +
>  11 files changed, 503 insertions(+), 5 deletions(-)
>  create mode 100644 hw/intc/spapr_xive_kvm.c
> 
> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> index 7f34ad0528ed..c1bf5cd951f5 100644
> --- a/default-configs/ppc64-softmmu.mak
> +++ b/default-configs/ppc64-softmmu.mak
> @@ -18,6 +18,7 @@ CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>  CONFIG_XIVE=$(CONFIG_PSERIES)
>  CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
> +CONFIG_XIVE_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>  CONFIG_MEM_DEVICE=y
>  CONFIG_DIMM=y
>  CONFIG_SPAPR_RNG=y
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index aca2969a09ab..9c817bb7ae74 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -40,6 +40,10 @@ typedef struct sPAPRXive {
>      /* TIMA mapping address */
>      hwaddr        tm_base;
>      MemoryRegion  tm_mmio;
> +
> +    /* KVM support */
> +    int           fd;
> +    void          *tm_mmap;
>  } sPAPRXive;
>  
>  #define SPAPR_XIVE_BASE_CLASS(klass) \
> @@ -83,4 +87,18 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr);
>  void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt,
>                     uint32_t phandle);
>  
> +/*
> + * XIVE KVM models
> + */
> +
> +#define TYPE_SPAPR_XIVE_KVM  "spapr-xive-kvm"
> +#define SPAPR_XIVE_KVM(obj)  OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE_KVM)
> +
> +#define TYPE_XIVE_SOURCE_KVM "xive-source-kvm"
> +#define XIVE_SOURCE_KVM(obj) \
> +    OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE_KVM)
> +
> +#define TYPE_XIVE_TCTX_KVM   "xive-tctx-kvm"
> +#define XIVE_TCTX_KVM(obj)   OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_KVM)
> +
>  #endif /* PPC_SPAPR_XIVE_H */
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 281ed370121c..7aaf5a182cb3 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -69,6 +69,9 @@ typedef struct XiveSource {
>      uint32_t        esb_shift;
>      MemoryRegion    esb_mmio;
>  
> +    /* KVM support */
> +    void            *esb_mmap;
> +
>      XiveFabric      *xive;
>  } XiveSource;
>  
> diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
> index 8c876c166ef2..f34c971491dd 100644
> --- a/linux-headers/asm-powerpc/kvm.h
> +++ b/linux-headers/asm-powerpc/kvm.h

Updates to linux-headers need to be split out into a separate patch.
Eventually (i.e. by the time we merge) they should be just "update
headers to SHA XXX" not picking and choosing pieces.

> @@ -675,4 +675,16 @@ struct kvm_ppc_cpu_char {
>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
>  #define  KVM_XICS_QUEUED		(1ULL << 44)
>  
> +/* POWER9 XIVE Native Interrupt Controller */
> +#define KVM_DEV_XIVE_GRP_CTRL		1
> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
> +#define   KVM_DEV_XIVE_GET_TIMA_FD	2
> +#define   KVM_DEV_XIVE_VC_BASE		3
> +#define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
> +
> +/* Layout of 64-bit XIVE source attribute values */
> +#define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
> +#define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
> +
> +
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> index f11a7eb49cfa..59fa8d8d7f39 100644
> --- a/linux-headers/linux/kvm.h
> +++ b/linux-headers/linux/kvm.h
> @@ -965,6 +965,8 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_COALESCED_PIO 162
>  #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
>  #define KVM_CAP_EXCEPTION_PAYLOAD 164
> +#define KVM_CAP_ARM_VM_IPA_SIZE 165
> +#define KVM_CAP_PPC_IRQ_XIVE 166
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1188,6 +1190,8 @@ enum kvm_device_type {
>  #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
>  	KVM_DEV_TYPE_ARM_VGIC_ITS,
>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
> +	KVM_DEV_TYPE_XIVE,
> +#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> diff --git a/target/ppc/kvm_ppc.h b/target/ppc/kvm_ppc.h
> index bdfaa4e70a83..d2159660f9f2 100644
> --- a/target/ppc/kvm_ppc.h
> +++ b/target/ppc/kvm_ppc.h
> @@ -59,6 +59,7 @@ bool kvmppc_has_cap_fixup_hcalls(void);
>  bool kvmppc_has_cap_htm(void);
>  bool kvmppc_has_cap_mmu_radix(void);
>  bool kvmppc_has_cap_mmu_hash_v3(void);
> +bool kvmppc_has_cap_xive(void);
>  int kvmppc_get_cap_safe_cache(void);
>  int kvmppc_get_cap_safe_bounds_check(void);
>  int kvmppc_get_cap_safe_indirect_branch(void);
> @@ -307,6 +308,11 @@ static inline bool kvmppc_has_cap_mmu_hash_v3(void)
>      return false;
>  }
>  
> +static inline bool kvmppc_has_cap_xive(void)
> +{
> +    return false;
> +}
> +
>  static inline int kvmppc_get_cap_safe_cache(void)
>  {
>      return 0;
> diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
> new file mode 100644
> index 000000000000..767f90826e43
> --- /dev/null
> +++ b/hw/intc/spapr_xive_kvm.c
> @@ -0,0 +1,430 @@
> +/*
> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "qemu/error-report.h"
> +#include "qapi/error.h"
> +#include "target/ppc/cpu.h"
> +#include "sysemu/cpus.h"
> +#include "sysemu/kvm.h"
> +#include "hw/ppc/spapr.h"
> +#include "hw/ppc/spapr_xive.h"
> +#include "hw/ppc/xive.h"
> +#include "kvm_ppc.h"
> +
> +#include <sys/ioctl.h>
> +
> +/*
> + * Helpers for CPU hotplug
> + */
> +typedef struct KVMEnabledCPU {
> +    unsigned long vcpu_id;
> +    QLIST_ENTRY(KVMEnabledCPU) node;
> +} KVMEnabledCPU;
> +
> +static QLIST_HEAD(, KVMEnabledCPU)
> +    kvm_enabled_cpus = QLIST_HEAD_INITIALIZER(&kvm_enabled_cpus);
> +
> +static bool kvm_cpu_is_enabled(CPUState *cs)
> +{
> +    KVMEnabledCPU *enabled_cpu;
> +    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
> +
> +    QLIST_FOREACH(enabled_cpu, &kvm_enabled_cpus, node) {
> +        if (enabled_cpu->vcpu_id == vcpu_id) {
> +            return true;
> +        }
> +    }
> +    return false;
> +}
> +
> +static void kvm_cpu_enable(CPUState *cs)
> +{
> +    KVMEnabledCPU *enabled_cpu;
> +    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
> +
> +    enabled_cpu = g_malloc(sizeof(*enabled_cpu));
> +    enabled_cpu->vcpu_id = vcpu_id;
> +    QLIST_INSERT_HEAD(&kvm_enabled_cpus, enabled_cpu, node);
> +}

Blech, I hope we can find a better way of tracking this than an ugly
list.

> +
> +/*
> + * XIVE Thread Interrupt Management context (KVM)
> + */
> +
> +static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
> +{
> +    sPAPRXive *xive;
> +    unsigned long vcpu_id;
> +    int ret;
> +
> +    /* Check if CPU was hot unplugged and replugged. */
> +    if (kvm_cpu_is_enabled(tctx->cs)) {
> +        return;
> +    }
> +
> +    vcpu_id = kvm_arch_vcpu_id(tctx->cs);
> +    xive = SPAPR_XIVE_KVM(tctx->xrtr);

Is this the first use of tctx->xrtr?

> +    ret = kvm_vcpu_enable_cap(tctx->cs, KVM_CAP_PPC_IRQ_XIVE, 0, xive->fd,
> +                              vcpu_id, 0);
> +    if (ret < 0) {
> +        error_setg(errp, "Unable to connect CPU%ld to KVM XIVE device: %s",
> +                   vcpu_id, strerror(errno));
> +        return;
> +    }
> +
> +    kvm_cpu_enable(tctx->cs);
> +}
> +
> +static void xive_tctx_kvm_realize(DeviceState *dev, Error **errp)
> +{
> +    XiveTCTX *tctx = XIVE_TCTX_KVM(dev);
> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(dev);
> +    Error *local_err = NULL;
> +
> +    xtc->parent_realize(dev, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    xive_tctx_kvm_init(tctx, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +}
> +
> +static void xive_tctx_kvm_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_CLASS(klass);
> +
> +    dc->desc = "sPAPR XIVE KVM Interrupt Thread Context";
> +
> +    device_class_set_parent_realize(dc, xive_tctx_kvm_realize,
> +                                    &xtc->parent_realize);
> +}
> +
> +static const TypeInfo xive_tctx_kvm_info = {
> +    .name          = TYPE_XIVE_TCTX_KVM,
> +    .parent        = TYPE_XIVE_TCTX_BASE,
> +    .instance_size = sizeof(XiveTCTX),
> +    .class_init    = xive_tctx_kvm_class_init,
> +    .class_size    = sizeof(XiveTCTXClass),
> +};
> +
> +/*
> + * XIVE Interrupt Source (KVM)
> + */
> +
> +static void xive_source_kvm_init(XiveSource *xsrc, Error **errp)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE_KVM(xsrc->xive);
> +    int i;
> +
> +    /*
> +     * At reset, interrupt sources are simply created and MASKED. We
> +     * only need to inform the KVM device about their type: LSI or
> +     * MSI.
> +     */
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        Error *local_err = NULL;
> +        uint64_t state = 0;
> +
> +        if (xive_source_irq_is_lsi(xsrc, i)) {
> +            state |= KVM_XIVE_LEVEL_SENSITIVE;
> +            if (xsrc->status[i] & XIVE_STATUS_ASSERTED) {
> +                state |= KVM_XIVE_LEVEL_ASSERTED;
> +            }
> +        }
> +
> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_SOURCES, i, &state,
> +                          true, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return;
> +        }
> +    }
> +}
> +
> +static void xive_source_kvm_reset(DeviceState *dev)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
> +
> +    xsc->parent_reset(dev);
> +
> +    xive_source_kvm_init(xsrc, &error_fatal);
> +}
> +
> +static void xive_source_kvm_set_irq(void *opaque, int srcno, int val)
> +{
> +    XiveSource *xsrc = opaque;
> +    struct kvm_irq_level args;
> +    int rc;
> +
> +    args.irq = srcno;
> +    if (!xive_source_irq_is_lsi(xsrc, srcno)) {
> +        if (!val) {
> +            return;
> +        }
> +        args.level = KVM_INTERRUPT_SET;
> +    } else {
> +        if (val) {
> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
> +            args.level = KVM_INTERRUPT_SET_LEVEL;
> +        } else {
> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
> +            args.level = KVM_INTERRUPT_UNSET;
> +        }
> +    }
> +    rc = kvm_vm_ioctl(kvm_state, KVM_IRQ_LINE, &args);
> +    if (rc < 0) {
> +        error_report("kvm_irq_line() failed : %s", strerror(errno));
> +    }
> +}
> +
> +static void *spapr_xive_kvm_mmap(sPAPRXive *xive, int ctrl, size_t len,
> +                                 Error **errp)
> +{
> +    Error *local_err = NULL;
> +    void *addr;
> +    int fd;
> +
> +    kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_CTRL, ctrl, &fd, false,
> +                      &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return NULL;
> +    }
> +
> +    addr = mmap(NULL, len, PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
> +    close(fd);
> +    if (addr == MAP_FAILED) {
> +        error_setg_errno(errp, errno, "Unable to set XIVE mmaping");
> +        return NULL;
> +    }
> +
> +    return addr;
> +}
> +
> +/*
> + * The sPAPRXive KVM model should have initialized the KVM device
> + * before initializing the source
> + */
> +static void xive_source_kvm_mmap(XiveSource *xsrc, Error **errp)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE_KVM(xsrc->xive);
> +    Error *local_err = NULL;
> +    size_t esb_len;
> +
> +    esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
> +    xsrc->esb_mmap = spapr_xive_kvm_mmap(xive, KVM_DEV_XIVE_GET_ESB_FD,
> +                                         esb_len, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    memory_region_init_ram_device_ptr(&xsrc->esb_mmio, OBJECT(xsrc),
> +                                      "xive.esb", esb_len, xsrc->esb_mmap);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(xsrc), &xsrc->esb_mmio);
> +}
> +
> +static void xive_source_kvm_realize(DeviceState *dev, Error **errp)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
> +    Error *local_err = NULL;
> +
> +    xsc->parent_realize(dev, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_kvm_set_irq, xsrc,
> +                                     xsrc->nr_irqs);
> +
> +    xive_source_kvm_mmap(xsrc, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +}
> +
> +static void xive_source_kvm_unrealize(DeviceState *dev, Error **errp)
> +{
> +    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
> +    size_t esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
> +
> +    munmap(xsrc->esb_mmap, esb_len);
> +}
> +
> +static void xive_source_kvm_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_CLASS(klass);
> +
> +    device_class_set_parent_realize(dc, xive_source_kvm_realize,
> +                                    &xsc->parent_realize);
> +    device_class_set_parent_reset(dc, xive_source_kvm_reset,
> +                                  &xsc->parent_reset);
> +
> +    dc->desc = "sPAPR XIVE KVM Interrupt Source";
> +    dc->unrealize = xive_source_kvm_unrealize;
> +}
> +
> +static const TypeInfo xive_source_kvm_info = {
> +    .name = TYPE_XIVE_SOURCE_KVM,
> +    .parent = TYPE_XIVE_SOURCE_BASE,
> +    .instance_size = sizeof(XiveSource),
> +    .class_init    = xive_source_kvm_class_init,
> +    .class_size    = sizeof(XiveSourceClass),
> +};
> +
> +/*
> + * sPAPR XIVE Router (KVM)
> + */
> +
> +static void spapr_xive_kvm_instance_init(Object *obj)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE_KVM(obj);
> +
> +    xive->fd = -1;
> +
> +    /* We need a KVM flavored source */
> +    object_initialize(&xive->source, sizeof(xive->source),
> +                      TYPE_XIVE_SOURCE_KVM);
> +    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
> +
> +    /* No KVM support for END ESBs. OPAL doesn't either */
> +    object_initialize(&xive->end_source, sizeof(xive->end_source),
> +                      TYPE_XIVE_END_SOURCE);
> +    object_property_add_child(obj, "end_source", OBJECT(&xive->end_source),
> +                              NULL);
> +}
> +
> +static void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
> +{
> +    Error *local_err = NULL;
> +    size_t tima_len;
> +
> +    if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
> +        error_setg(errp,
> +                   "IRQ_XIVE capability must be present for KVM XIVE device");
> +        return;
> +    }
> +
> +    /* First, create the KVM XIVE device */
> +    xive->fd = kvm_create_device(kvm_state, KVM_DEV_TYPE_XIVE, false);
> +    if (xive->fd < 0) {
> +        error_setg_errno(errp, -xive->fd, "error creating KVM XIVE device");
> +        return;
> +    }
> +
> +    /* Source ESBs KVM mapping
> +     *
> +     * Inform KVM where we will map the ESB pages. This is needed by
> +     * the H_INT_GET_SOURCE_INFO hcall which returns the source
> +     * characteristics, among which the ESB page address.
> +     */
> +    kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_CTRL, KVM_DEV_XIVE_VC_BASE,
> +                      &xive->vc_base, true, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    /* Let the XiveSource KVM model handle the mapping for the moment */
> +
> +    /* TIMA KVM mapping
> +     *
> +     * We could also inform KVM where the TIMA will be mapped but as
> +     * this is a fixed MMIO address for the system it does not seem
> +     * necessary to provide a KVM ioctl to change it.
> +     */
> +    tima_len = 4ull << TM_SHIFT;
> +    xive->tm_mmap = spapr_xive_kvm_mmap(xive, KVM_DEV_XIVE_GET_TIMA_FD,
> +                                        tima_len, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +    memory_region_init_ram_device_ptr(&xive->tm_mmio, OBJECT(xive),
> +                                      "xive.tima", tima_len, xive->tm_mmap);
> +    sysbus_init_mmio(SYS_BUS_DEVICE(xive), &xive->tm_mmio);
> +
> +    kvm_kernel_irqchip = true;
> +    kvm_msi_via_irqfd_allowed = true;
> +    kvm_gsi_direct_mapping = true;
> +}
> +
> +static void spapr_xive_kvm_realize(DeviceState *dev, Error **errp)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
> +    Error *local_err = NULL;
> +
> +    spapr_xive_kvm_init(xive, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    /* Initialize the source and the local routing tables */
> +    sxc->parent_realize(dev, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +}
> +
> +static void spapr_xive_kvm_unrealize(DeviceState *dev, Error **errp)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
> +
> +    close(xive->fd);
> +    xive->fd = -1;
> +
> +    munmap(xive->tm_mmap, 4ull << TM_SHIFT);
> +}
> +
> +static void spapr_xive_kvm_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_CLASS(klass);
> +
> +    device_class_set_parent_realize(dc, spapr_xive_kvm_realize,
> +                                    &sxc->parent_realize);
> +
> +    dc->desc = "sPAPR XIVE KVM Interrupt Controller";
> +    dc->unrealize = spapr_xive_kvm_unrealize;
> +}
> +
> +static const TypeInfo spapr_xive_kvm_info = {
> +    .name = TYPE_SPAPR_XIVE_KVM,
> +    .parent = TYPE_SPAPR_XIVE_BASE,
> +    .instance_init = spapr_xive_kvm_instance_init,
> +    .instance_size = sizeof(sPAPRXive),
> +    .class_init = spapr_xive_kvm_class_init,
> +    .class_size = sizeof(sPAPRXiveClass),
> +};
> +
> +static void xive_kvm_register_types(void)
> +{
> +    type_register_static(&spapr_xive_kvm_info);
> +    type_register_static(&xive_source_kvm_info);
> +    type_register_static(&xive_tctx_kvm_info);
> +}
> +
> +type_init(xive_kvm_register_types)
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index f9cf2debff5a..d1be2579cd9b 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1125,8 +1125,11 @@ static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
>          } else {
>              val[3] = 0x00; /* Hash */
>          }
> -        /* TODO: test KVM support */
> -        val[1] = smc->irq->ov5;
> +        if (kvmppc_has_cap_xive()) {
> +            val[1] = smc->irq->ov5;
> +        } else {
> +            val[1] = 0x00;
> +        }
>      } else {
>          val[1] = smc->irq->ov5;
>  
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index 33dd5da7d255..92ef53743b64 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -273,9 +273,22 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
>      Error *local_err = NULL;
>  
>      /* KVM XIVE support */
> -    if (kvm_enabled()) {
> -        if (machine_kernel_irqchip_required(machine)) {
> -            error_setg(errp, "kernel_irqchip requested. no XIVE support");
> +    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
> +        spapr->xive_tctx_type = TYPE_XIVE_TCTX_KVM;
> +        spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE_KVM, nr_irqs,
> +                                        nr_servers, &local_err);
> +
> +        if (local_err && machine_kernel_irqchip_required(machine)) {
> +            error_propagate(errp, local_err);
> +            error_prepend(errp, "kernel_irqchip requested but init failed : ");
> +            return;
> +        }
> +
> +        /*
> +         * XIVE support is activated under KVM. No need to initialize
> +         * the fallback mode under QEMU
> +         */
> +        if (spapr->xive) {
>              return;
>          }
>      }
> diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
> index f81327d6cd47..3b7cf106242b 100644
> --- a/target/ppc/kvm.c
> +++ b/target/ppc/kvm.c
> @@ -86,6 +86,7 @@ static int cap_fixup_hcalls;
>  static int cap_htm;             /* Hardware transactional memory support */
>  static int cap_mmu_radix;
>  static int cap_mmu_hash_v3;
> +static int cap_xive;
>  static int cap_resize_hpt;
>  static int cap_ppc_pvr_compat;
>  static int cap_ppc_safe_cache;
> @@ -149,6 +150,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>      cap_htm = kvm_vm_check_extension(s, KVM_CAP_PPC_HTM);
>      cap_mmu_radix = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_RADIX);
>      cap_mmu_hash_v3 = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_HASH_V3);
> +    cap_xive = kvm_vm_check_extension(s, KVM_CAP_PPC_IRQ_XIVE);
>      cap_resize_hpt = kvm_vm_check_extension(s, KVM_CAP_SPAPR_RESIZE_HPT);
>      kvmppc_get_cpu_characteristics(s);
>      cap_ppc_nested_kvm_hv = kvm_vm_check_extension(s, KVM_CAP_PPC_NESTED_HV);
> @@ -2385,6 +2387,11 @@ static int parse_cap_ppc_safe_indirect_branch(struct kvm_ppc_cpu_char c)
>      return 0;
>  }
>  
> +bool kvmppc_has_cap_xive(void)
> +{
> +    return cap_xive;
> +}
> +
>  static void kvmppc_get_cpu_characteristics(KVMState *s)
>  {
>      struct kvm_ppc_cpu_char c;
> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> index eacd26836ebf..dd4d69db2bdd 100644
> --- a/hw/intc/Makefile.objs
> +++ b/hw/intc/Makefile.objs
> @@ -39,6 +39,7 @@ obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>  obj-$(CONFIG_XIVE) += xive.o
>  obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
> +obj-$(CONFIG_XIVE_KVM) += spapr_xive_kvm.o
>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>  obj-$(CONFIG_S390_FLIC) += s390_flic.o

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 21/36] spapr: extend the sPAPR IRQ backend for XICS migration
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 21/36] spapr: extend the sPAPR IRQ backend for XICS migration Cédric Le Goater
@ 2018-11-28  5:54   ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-28  5:54 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 5168 bytes --]

On Fri, Nov 16, 2018 at 11:57:14AM +0100, Cédric Le Goater wrote:
> Introduce a new sPAPR IRQ handler to handle resend after migration
> when the machine is using a KVM XICS interrupt controller model.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  include/hw/ppc/spapr_irq.h |  2 ++
>  hw/ppc/spapr.c             | 13 +++++--------
>  hw/ppc/spapr_irq.c         | 27 +++++++++++++++++++++++++++
>  3 files changed, 34 insertions(+), 8 deletions(-)
> 
> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> index b299dd794bff..4e36c0984e1a 100644
> --- a/include/hw/ppc/spapr_irq.h
> +++ b/include/hw/ppc/spapr_irq.h
> @@ -45,6 +45,7 @@ typedef struct sPAPRIrq {
>                          void *fdt, uint32_t phandle);
>      Object *(*cpu_intc_create)(sPAPRMachineState *spapr, Object *cpu,
>                                 Error **errp);
> +    int (*post_load)(sPAPRMachineState *spapr, int version_id);
>  } sPAPRIrq;
>  
>  extern sPAPRIrq spapr_irq_xics;
> @@ -55,6 +56,7 @@ void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp);
>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
>  void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
>  qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
> +int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id);
>  
>  /*
>   * XICS legacy routines
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 8fbb743769db..f9cf2debff5a 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1738,14 +1738,6 @@ static int spapr_post_load(void *opaque, int version_id)
>          return err;
>      }
>  
> -    if (!object_dynamic_cast(OBJECT(spapr->ics), TYPE_ICS_KVM)) {
> -        CPUState *cs;
> -        CPU_FOREACH(cs) {
> -            PowerPCCPU *cpu = POWERPC_CPU(cs);
> -            icp_resend(ICP(cpu->intc));
> -        }
> -    }
> -
>      /* In earlier versions, there was no separate qdev for the PAPR
>       * RTC, so the RTC offset was stored directly in sPAPREnvironment.
>       * So when migrating from those versions, poke the incoming offset
> @@ -1766,6 +1758,11 @@ static int spapr_post_load(void *opaque, int version_id)
>          }
>      }
>  
> +    err = spapr_irq_post_load(spapr, version_id);
> +    if (err) {
> +        return err;
> +    }
> +
>      return err;
>  }
>  
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index f6e9e44d4cf9..33dd5da7d255 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -203,6 +203,18 @@ static Object *spapr_irq_cpu_intc_create_xics(sPAPRMachineState *spapr,
>      return icp_create(cpu, spapr->icp_type, XICS_FABRIC(spapr), errp);
>  }
>  
> +static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
> +{
> +    if (!object_dynamic_cast(OBJECT(spapr->ics), TYPE_ICS_KVM)) {
> +        CPUState *cs;
> +        CPU_FOREACH(cs) {
> +            PowerPCCPU *cpu = POWERPC_CPU(cs);
> +            icp_resend(ICP(cpu->intc));
> +        }
> +    }
> +    return 0;
> +}
> +
>  #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
>  #define SPAPR_IRQ_XICS_NR_MSIS     \
>      (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
> @@ -219,6 +231,7 @@ sPAPRIrq spapr_irq_xics = {
>      .print_info  = spapr_irq_print_info_xics,
>      .dt_populate = spapr_irq_dt_populate_xics,
>      .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
> +    .post_load   = spapr_irq_post_load_xics,
>  };
>  
>   /*
> @@ -331,6 +344,11 @@ static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
>                              XIVE_ROUTER(spapr->xive), errp);
>  }
>  
> +static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
> +{
> +    return 0;
> +}
> +
>  /*
>   * XIVE uses the full IRQ number space. Set it to 8K to be compatible
>   * with XICS.
> @@ -351,6 +369,7 @@ sPAPRIrq spapr_irq_xive = {
>      .print_info  = spapr_irq_print_info_xive,
>      .dt_populate = spapr_irq_dt_populate_xive,
>      .cpu_intc_create = spapr_irq_cpu_intc_create_xive,
> +    .post_load   = spapr_irq_post_load_xive,
>  };
>  
>  /*
> @@ -389,6 +408,13 @@ qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq)
>      return smc->irq->qirq(spapr, irq);
>  }
>  
> +int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id)
> +{
> +    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> +
> +    return smc->irq->post_load(spapr, version_id);
> +}
> +
>  /*
>   * XICS legacy routines - to deprecate one day
>   */
> @@ -458,4 +484,5 @@ sPAPRIrq spapr_irq_xics_legacy = {
>      .print_info  = spapr_irq_print_info_xics,
>      .dt_populate = spapr_irq_dt_populate_xics,
>      .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
> +    .post_load   = spapr_irq_post_load_xics,
>  };

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend
  2018-11-28  2:57   ` David Gibson
@ 2018-11-28  9:35     ` Greg Kurz
  2018-11-28 16:50       ` Cédric Le Goater
  2018-11-29  1:02       ` David Gibson
  0 siblings, 2 replies; 184+ messages in thread
From: Greg Kurz @ 2018-11-28  9:35 UTC (permalink / raw)
  To: David Gibson; +Cc: Cédric Le Goater, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2505 bytes --]

On Wed, 28 Nov 2018 13:57:14 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Fri, Nov 16, 2018 at 11:57:05AM +0100, Cédric Le Goater wrote:
> > We will need to use xics_max_server_number() to create the sPAPRXive
> > object modeling the interrupt controller of the machine which is
> > created before the CPUs.
> > 
> > Signed-off-by: Cédric Le Goater <clg@kaod.org>  
> 
> My only concern here is that this moves the spapr_set_vsmt_mode()
> before some of the sanity checks in spapr_init_cpus().  Are we certain
> there are no edge cases that could cause badness?
> 

The early checks in spapr_init_cpus() filter out topologies that would
result in partially filled cores. They're only related to the rest of
the code that creates the boot CPUs. Before commit 1a5008fc17,
spapr_set_vsmt_mode() was even being called before spapr_init_cpus().
The rationale to move it there was to ensure it is called before the
first user of spapr->vsmt, which happens to be a call to
xics_max_server_number().

Now that xics_max_server_number() needs to be called even earlier, I think a
better change is to have xics_max_server_number() to call spapr_set_vsmt_mode()
if spapr->vsmt isn't set.

> > ---
> >  hw/ppc/spapr.c | 10 +++++-----
> >  1 file changed, 5 insertions(+), 5 deletions(-)
> > 
> > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > index 7afd1a175bf2..50cb9f9f4a02 100644
> > --- a/hw/ppc/spapr.c
> > +++ b/hw/ppc/spapr.c
> > @@ -2466,11 +2466,6 @@ static void spapr_init_cpus(sPAPRMachineState *spapr)
> >          boot_cores_nr = possible_cpus->len;
> >      }
> >  
> > -    /* VSMT must be set in order to be able to compute VCPU ids, ie to
> > -     * call xics_max_server_number() or spapr_vcpu_id().
> > -     */
> > -    spapr_set_vsmt_mode(spapr, &error_fatal);
> > -
> >      if (smc->pre_2_10_has_unused_icps) {
> >          int i;
> >  
> > @@ -2593,6 +2588,11 @@ static void spapr_machine_init(MachineState *machine)
> >      /* Setup a load limit for the ramdisk leaving room for SLOF and FDT */
> >      load_limit = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FW_OVERHEAD;
> >  
> > +    /* VSMT must be set in order to be able to compute VCPU ids, ie to
> > +     * call xics_max_server_number() or spapr_vcpu_id().
> > +     */
> > +    spapr_set_vsmt_mode(spapr, &error_fatal);
> > +
> >      /* Set up Interrupt Controller before we create the VCPUs */
> >      smc->irq->init(spapr, &error_fatal);
> >    
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-11-27 23:49   ` David Gibson
  2018-11-28  2:34     ` Benjamin Herrenschmidt
@ 2018-11-28 10:59     ` Cédric Le Goater
  2018-11-29  0:47       ` David Gibson
  1 sibling, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 10:59 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 12:49 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:01AM +0100, Cédric Le Goater wrote:
>> The last sub-engine of the XIVE architecture is the Interrupt
>> Virtualization Presentation Engine (IVPE). On HW, they share elements,
>> the Power Bus interface (CQ), the routing table descriptors, and they
>> can be combined in the same HW logic. We do the same in QEMU and
>> combine both engines in the XiveRouter for simplicity.
> 
> Ok, I'm not entirely convinced combining the IVPE and IVRE into a
> single object is a good idea, but we can probably discuss that once
> I've read further.

We could introduce a simplified presenter for sPAPR but I am not even
sure of that as it will get more complex if we support the EBB one day. 
 
>> When the IVRE has completed its job of matching an event source with a
>> Notification Virtual Target (NVT) to notify, it forwards the event
>> notification to the IVPE sub-engine. The IVPE scans the thread
>> interrupt contexts of the Notification Virtual Targets (NVT)
>> dispatched on the HW processor threads and if a match is found, it
>> signals the thread. If not, the IVPE escalates the notification to
>> some other targets and records the notification in a backlog queue.
>>
>> The IVPE maintains the thread interrupt context state for each of its
>> NVTs not dispatched on HW processor threads in the Notification
>> Virtual Target table (NVTT).
>>
>> The model currently only supports single NVT notifications.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/xive.h      |  13 +++
>>  include/hw/ppc/xive_regs.h |  22 ++++
>>  hw/intc/xive.c             | 223 +++++++++++++++++++++++++++++++++++++
>>  3 files changed, 258 insertions(+)
>>
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 5987f26ddb98..e715a6c6923d 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -197,6 +197,10 @@ typedef struct XiveRouterClass {
>>                     XiveEND *end);
>>      int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>                     XiveEND *end);
>> +    int (*get_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>> +                   XiveNVT *nvt);
>> +    int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>> +                   XiveNVT *nvt);
> 
> As with the ENDs, I don't think get/set is a good interface for a
> bigger-than-word-size object.

We need to agree on this interface before I respin. So you would like 
to add a extra argument specifying the word being accessed ? 

> 
>>  } XiveRouterClass;
>>  
>>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
>> @@ -207,6 +211,10 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>                          XiveEND *end);
>>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>                          XiveEND *end);
>> +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>> +                        XiveNVT *nvt);
>> +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>> +                        XiveNVT *nvt);
>>  
>>  /*
>>   * XIVE END ESBs
>> @@ -274,4 +282,9 @@ extern const MemoryRegionOps xive_tm_ops;
>>  
>>  void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
>>  
>> +static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
>> +{
>> +    return (nvt_blk << 19) | nvt_idx;
> 
> I'm guessing this formula is the standard way of combining the NVT
> block and index into a single word?  

That number is the VP/NVT identifier which is written in the CAM value. 
The index is on 19 bits because of the NVT  definition in the END 
structure. It is being increased to 24 bits on Power10 

> If so, I think we should
> standardize on passing a single word "nvt_id" around and only
> splitting it when we need to use the block separately.  

This is really the only place where we concatenate the two NVT values,
block and index. 

> Same goes for
> the end_id, assuming there's a standard way of putting that into a
> single word.  That will address the point I raised earlier about lisn
> being passed around as a single word, but these later stage ids being
> split.

Hmm, I am not sure this is a good option. It is not how the PowerNV 
model would use it, skiboot is very much aware of these blocks and 
indexes and for remote accesses chips are identified using the block. 
I will take a look at it but I am not found of it. I can add helpers 
in some places though.    

I agree we have some kind of issue linking the HW model with the sPAPR 
machine. The guest interface is only  about IRQ numbers, priorities and
cpu numbers. We really don't care about XIVE blocks and indexes in that 
case. we can clarify the code by bypassing the XiveRouter interfaces
to the table and directly use the sPAPR interrupt controller. That 
should help a bit for the hcalls but we would still have to fill in 
the EAT and the END with some index values if we want to use the router
algorithm.   



> We'll probably want some inlines or macros to build an
> nvt/end/lisn/whatever id from block and index as well.
> 
>> +}
>> +
>>  #endif /* PPC_XIVE_H */
>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>> index 2e3d6cb507da..05cb992d2815 100644
>> --- a/include/hw/ppc/xive_regs.h
>> +++ b/include/hw/ppc/xive_regs.h
>> @@ -158,4 +158,26 @@ typedef struct XiveEND {
>>  #define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
>>  } XiveEND;
>>  
>> +/* Notification Virtual Target (NVT) */
>> +typedef struct XiveNVT {
>> +        uint32_t        w0;
>> +#define NVT_W0_VALID             PPC_BIT32(0)
>> +        uint32_t        w1;
>> +        uint32_t        w2;
>> +        uint32_t        w3;
>> +        uint32_t        w4;
>> +        uint32_t        w5;
>> +        uint32_t        w6;
>> +        uint32_t        w7;
>> +        uint32_t        w8;
>> +#define NVT_W8_GRP_VALID         PPC_BIT32(0)
>> +        uint32_t        w9;
>> +        uint32_t        wa;
>> +        uint32_t        wb;
>> +        uint32_t        wc;
>> +        uint32_t        wd;
>> +        uint32_t        we;
>> +        uint32_t        wf;
>> +} XiveNVT;
>> +
>>  #endif /* PPC_XIVE_REGS_H */
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 4c6cb5d52975..5ba3b06e6e25 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -373,6 +373,32 @@ void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
>>      }
>>  }
>>  
>> +/* The HW CAM (23bits) is hardwired to :
>> + *
>> + *   0x000||0b1||4Bit chip number||7Bit Thread number.
>> + *
>> + * and when the block grouping extension is enabled :
>> + *
>> + *   4Bit chip number||0x001||7Bit Thread number.
>> + */
>> +static uint32_t tctx_hw_cam_line(bool block_group, uint8_t chip_id, uint8_t tid)
>> +{
>> +    if (block_group) {
>> +        return 1 << 11 | (chip_id & 0xf) << 7 | (tid & 0x7f);
>> +    } else {
>> +        return (chip_id & 0xf) << 11 | 1 << 7 | (tid & 0x7f);
>> +    }
>> +}
>> +
>> +static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
>> +{
>> +    PowerPCCPU *cpu = POWERPC_CPU(tctx->cs);
>> +    CPUPPCState *env = &cpu->env;
>> +    uint32_t pir = env->spr_cb[SPR_PIR].default_value;
> 
> I don't much like reaching into the cpu state itself.  I think a
> better idea would be to have the TCTX have its HW CAM id set during
> initialization (via a property) and then use that.  This will mean
> less mucking about if future cpu revisions don't split the PIR into
> chip and tid ids in the same way.

yes good idea. I will see how to handle the block_group boolean. may be we
can leave it out of the model for now as it is not used. 

> 
>> +    return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
>> +}
>> +
>>  static void xive_tctx_reset(void *dev)
>>  {
>>      XiveTCTX *tctx = XIVE_TCTX(dev);
>> @@ -1013,6 +1039,195 @@ int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>     return xrc->set_end(xrtr, end_blk, end_idx, end);
>>  }
>>  
>> +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>> +                        XiveNVT *nvt)
>> +{
>> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>> +
>> +   return xrc->get_nvt(xrtr, nvt_blk, nvt_idx, nvt);
>> +}
>> +
>> +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>> +                        XiveNVT *nvt)
>> +{
>> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>> +
>> +   return xrc->set_nvt(xrtr, nvt_blk, nvt_idx, nvt);
>> +}
>> +
>> +static bool xive_tctx_ring_match(XiveTCTX *tctx, uint8_t ring,
>> +                                 uint8_t nvt_blk, uint32_t nvt_idx,
>> +                                 bool cam_ignore, uint32_t logic_serv)
>> +{
>> +    uint8_t *regs = &tctx->regs[ring];
>> +    uint32_t w2 = be32_to_cpu(*((uint32_t *) &regs[TM_WORD2]));
>> +    uint32_t cam = xive_tctx_cam_line(nvt_blk, nvt_idx);
>> +    bool block_group = false; /* TODO (PowerNV) */
>> +
>> +    /* TODO (PowerNV): ignore low order bits of nvt id */
>> +
>> +    switch (ring) {
>> +    case TM_QW3_HV_PHYS:
>> +        return (w2 & TM_QW3W2_VT) && xive_tctx_hw_cam_line(tctx, block_group) ==
>> +            tctx_hw_cam_line(block_group, nvt_blk, nvt_idx);
> 
> The difference between "xive_tctx_hw_cam_line" and "tctx_hw_cam_line"
> here is far from obvious.  

yes. I lacked inspiration ...

> Remember that namespacing prefixes aren't
> necessary for static functions, which can let you give more
> descriptive names without getting excessively long.

OK.
 
>> +    case TM_QW2_HV_POOL:
>> +        return (w2 & TM_QW2W2_VP) && (cam == GETFIELD(TM_QW2W2_POOL_CAM, w2));
>> +
>> +    case TM_QW1_OS:
>> +        return (w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2));
>> +
>> +    case TM_QW0_USER:
>> +        return ((w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2)) &&
>> +                (w2 & TM_QW0W2_VU) &&
>> +                (logic_serv == GETFIELD(TM_QW0W2_LOGIC_SERV, w2)));
>> +
>> +    default:
>> +        g_assert_not_reached();
>> +    }
>> +}
>> +
>> +static int xive_presenter_tctx_match(XiveTCTX *tctx, uint8_t format,
>> +                                     uint8_t nvt_blk, uint32_t nvt_idx,
>> +                                     bool cam_ignore, uint32_t logic_serv)
>> +{
>> +    if (format == 0) {
>> +        /* F=0 & i=1: Logical server notification */
>> +        if (cam_ignore == true) {
>> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no support for LS "
>> +                          "NVT %x/%x\n", nvt_blk, nvt_idx);
>> +             return -1;
>> +        }
>> +
>> +        /* F=0 & i=0: Specific NVT notification */
>> +        if (xive_tctx_ring_match(tctx, TM_QW3_HV_PHYS,
>> +                                nvt_blk, nvt_idx, false, 0)) {
>> +            return TM_QW3_HV_PHYS;
>> +        }
>> +        if (xive_tctx_ring_match(tctx, TM_QW2_HV_POOL,
>> +                                nvt_blk, nvt_idx, false, 0)) {
>> +            return TM_QW2_HV_POOL;
>> +        }
>> +        if (xive_tctx_ring_match(tctx, TM_QW1_OS,
>> +                                nvt_blk, nvt_idx, false, 0)) {
>> +            return TM_QW1_OS;
>> +        }
> 
> Hm.  It's a bit pointless to iterate through each ring calling a
> common function, when that "common" function consists entirely of a
> switch which makes it not really common at all.
> 
> So I think you want separate helper functions for each ring's match,
> or even just fold the previous function into this one.

yes. It can be improved. I did try different layouts. I might just fold 
both routine in one as you propose.  

>> +    } else {
>> +        /* F=1 : User level Event-Based Branch (EBB) notification */
>> +        if (xive_tctx_ring_match(tctx, TM_QW0_USER,
>> +                                nvt_blk, nvt_idx, false, logic_serv)) {
>> +            return TM_QW0_USER;
>> +        }
>> +    }
>> +    return -1;
>> +}
>> +
>> +typedef struct XiveTCTXMatch {
>> +    XiveTCTX *tctx;
>> +    uint8_t ring;
>> +} XiveTCTXMatch;
>> +
>> +static bool xive_presenter_match(XiveRouter *xrtr, uint8_t format,
>> +                                 uint8_t nvt_blk, uint32_t nvt_idx,
>> +                                 bool cam_ignore, uint8_t priority,
>> +                                 uint32_t logic_serv, XiveTCTXMatch *match)
>> +{
>> +    CPUState *cs;
>> +
>> +    /* TODO (PowerNV): handle chip_id overwrite of block field for
>> +     * hardwired CAM compares */
>> +
>> +    CPU_FOREACH(cs) {
>> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
>> +        XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
>> +        int ring;
>> +
>> +        /*
>> +         * HW checks that the CPU is enabled in the Physical Thread
>> +         * Enable Register (PTER).
>> +         */
>> +
>> +        /*
>> +         * Check the thread context CAM lines and record matches. We
>> +         * will handle CPU exception delivery later
>> +         */
>> +        ring = xive_presenter_tctx_match(tctx, format, nvt_blk, nvt_idx,
>> +                                         cam_ignore, logic_serv);
>> +        /*
>> +         * Save the context and follow on to catch duplicates, that we
>> +         * don't support yet.
>> +         */
>> +        if (ring != -1) {
>> +            if (match->tctx) {
>> +                qemu_log_mask(LOG_GUEST_ERROR, "XIVE: already found a thread "
>> +                              "context NVT %x/%x\n", nvt_blk, nvt_idx);
>> +                return false;
>> +            }
>> +
>> +            match->ring = ring;
>> +            match->tctx = tctx;
>> +        }
>> +    }
>> +
>> +    if (!match->tctx) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is not dispatched\n",
>> +                      nvt_blk, nvt_idx);
>> +        return false;
> 
> Hmm.. this isn't actually an error isn't it? At least not for powernv

It is on sPAPR, it would mean the END was configured with an unknow CPU. 

It is not error on PowerNV, when we support escalations.

> - that just means the NVT isn't currently dispatched, so we'll need to
> trigger the escalation interrupt.  

Yes.

> Does this get changed later in the series?

No.

Thanks,

C.

>> +    }
>> +
>> +    return true;
>> +}
>> +
>> +/*
>> + * This is our simple Xive Presenter Engine model. It is merged in the
>> + * Router as it does not require an extra object.
>> + *
>> + * It receives notification requests sent by the IVRE to find one
>> + * matching NVT (or more) dispatched on the processor threads. In case
>> + * of a single NVT notification, the process is abreviated and the
>> + * thread is signaled if a match is found. In case of a logical server
>> + * notification (bits ignored at the end of the NVT identifier), the
>> + * IVPE and IVRE select a winning thread using different filters. This
>> + * involves 2 or 3 exchanges on the PowerBus that the model does not
>> + * support.
>> + *
>> + * The parameters represent what is sent on the PowerBus
>> + */
>> +static void xive_presenter_notify(XiveRouter *xrtr, uint8_t format,
>> +                                  uint8_t nvt_blk, uint32_t nvt_idx,
>> +                                  bool cam_ignore, uint8_t priority,
>> +                                  uint32_t logic_serv)
>> +{
>> +    XiveNVT nvt;
>> +    XiveTCTXMatch match = { 0 };
>> +    bool found;
>> +
>> +    /* NVT cache lookup */
>> +    if (xive_router_get_nvt(xrtr, nvt_blk, nvt_idx, &nvt)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no NVT %x/%x\n",
>> +                      nvt_blk, nvt_idx);
>> +        return;
>> +    }
>> +
>> +    if (!(nvt.w0 & NVT_W0_VALID)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is invalid\n",
>> +                      nvt_blk, nvt_idx);
>> +        return;
>> +    }
>> +
>> +    found = xive_presenter_match(xrtr, format, nvt_blk, nvt_idx, cam_ignore,
>> +                                 priority, logic_serv, &match);
>> +    if (found) {
>> +        return;
>> +    }
>> +
>> +    /* If no matching NVT is dispatched on a HW thread :
>> +     * - update the NVT structure if backlog is activated
>> +     * - escalate (ESe PQ bits and EAS in w4-5) if escalation is
>> +     *   activated
>> +     */
>> +}
>> +
>>  /*
>>   * An END trigger can come from an event trigger (IPI or HW) or from
>>   * another chip. We don't model the PowerBus but the END trigger
>> @@ -1081,6 +1296,14 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>>      /*
>>       * Follows IVPE notification
>>       */
>> +    xive_presenter_notify(xrtr, format,
>> +                          GETFIELD(END_W6_NVT_BLOCK, end.w6),
>> +                          GETFIELD(END_W6_NVT_INDEX, end.w6),
>> +                          GETFIELD(END_W7_F0_IGNORE, end.w7),
>> +                          priority,
>> +                          GETFIELD(END_W7_F1_LOG_SERVER_ID, end.w7));
>> +
>> +    /* TODO: Auto EOI. */
>>  }
>>  
>>  static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged
  2018-11-28  0:13   ` David Gibson
  2018-11-28  2:32     ` Benjamin Herrenschmidt
@ 2018-11-28 11:30     ` Cédric Le Goater
  2018-11-29  0:49       ` David Gibson
  1 sibling, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 11:30 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 1:13 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:02AM +0100, Cédric Le Goater wrote:
>> After the event data was pushed in the O/S Event Queue, the IVPE
>> raises the bit corresponding to the priority of the pending interrupt
>> in the register IBP (Interrupt Pending Buffer) to indicate there is an
>> event pending in one of the 8 priority queues. The Pending Interrupt
>> Priority Register (PIPR) is also updated using the IPB. This register
>> represent the priority of the most favored pending notification.
>>
>> The PIPR is then compared to the the Current Processor Priority
>> Register (CPPR). If it is more favored (numerically less than), the
>> CPU interrupt line is raised and the EO bit of the Notification Source
>> Register (NSR) is updated to notify the presence of an exception for
>> the O/S. The check needs to be done whenever the PIPR or the CPPR are
>> changed.
>>
>> The O/S acknowledges the interrupt with a special load in the Thread
>> Interrupt Management Area. If the EO bit of the NSR is set, the CPPR
>> takes the value of PIPR. The bit number in the IBP corresponding to
>> the priority of the pending interrupt is reseted and so is the EO bit
>> of the NSR.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  hw/intc/xive.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++-
>>  1 file changed, 93 insertions(+), 1 deletion(-)
>>
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 5ba3b06e6e25..c49932d2b799 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -21,9 +21,73 @@
>>   * XIVE Thread Interrupt Management context
>>   */
>>  
>> +/* Convert a priority number to an Interrupt Pending Buffer (IPB)
>> + * register, which indicates a pending interrupt at the priority
>> + * corresponding to the bit number
>> + */
>> +static uint8_t priority_to_ipb(uint8_t priority)
>> +{
>> +    return priority > XIVE_PRIORITY_MAX ?
>> +        0 : 1 << (XIVE_PRIORITY_MAX - priority);
>> +}
>> +
>> +/* Convert an Interrupt Pending Buffer (IPB) register to a Pending
>> + * Interrupt Priority Register (PIPR), which contains the priority of
>> + * the most favored pending notification.
>> + */
>> +static uint8_t ipb_to_pipr(uint8_t ibp)
>> +{
>> +    return ibp ? clz32((uint32_t)ibp << 24) : 0xff;
>> +}
>> +
>> +static void ipb_update(uint8_t *regs, uint8_t priority)
>> +{
>> +    regs[TM_IPB] |= priority_to_ipb(priority);
>> +    regs[TM_PIPR] = ipb_to_pipr(regs[TM_IPB]);
>> +}
>> +
>> +static uint8_t exception_mask(uint8_t ring)
>> +{
>> +    switch (ring) {
>> +    case TM_QW1_OS:
>> +        return TM_QW1_NSR_EO;
>> +    default:
>> +        g_assert_not_reached();
>> +    }
>> +}
>> +
>>  static uint64_t xive_tctx_accept(XiveTCTX *tctx, uint8_t ring)
>>  {
>> -    return 0;
>> +    uint8_t *regs = &tctx->regs[ring];
>> +    uint8_t nsr = regs[TM_NSR];
>> +    uint8_t mask = exception_mask(ring);
>> +
>> +    qemu_irq_lower(tctx->output);
>> +
>> +    if (regs[TM_NSR] & mask) {
>> +        uint8_t cppr = regs[TM_PIPR];
>> +
>> +        regs[TM_CPPR] = cppr;
>> +
>> +        /* Reset the pending buffer bit */
>> +        regs[TM_IPB] &= ~priority_to_ipb(cppr);
>> +        regs[TM_PIPR] = ipb_to_pipr(regs[TM_IPB]);
>> +
>> +        /* Drop Exception bit */
>> +        regs[TM_NSR] &= ~mask;
>> +    }
>> +
>> +    return (nsr << 8) | regs[TM_CPPR];
> 
> Don't you need a cast to avoid (nsr << 8) being a shift-wider-than-size?

I will check. 

> 
>> +}
>> +
>> +static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring)
>> +{
>> +    uint8_t *regs = &tctx->regs[ring];
>> +
>> +    if (regs[TM_PIPR] < regs[TM_CPPR]) {
>> +        regs[TM_NSR] |= exception_mask(ring);
>> +        qemu_irq_raise(tctx->output);
>> +    }
>>  }
>>  
>>  static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
>> @@ -33,6 +97,9 @@ static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
>>      }
>>  
>>      tctx->regs[ring + TM_CPPR] = cppr;
>> +
>> +    /* CPPR has changed, check if we need to raise a pending exception */
>> +    xive_tctx_notify(tctx, ring);
>>  }
>>  
>>  /*
>> @@ -198,6 +265,17 @@ static void xive_tm_set_os_cppr(XiveTCTX *tctx, hwaddr offset,
>>      xive_tctx_set_cppr(tctx, TM_QW1_OS, value & 0xff);
>>  }
>>  
>> +/*
>> + * Adjust the IPB to allow a CPU to process event queues of other
>> + * priorities during one physical interrupt cycle.
>> + */
>> +static void xive_tm_set_os_pending(XiveTCTX *tctx, hwaddr offset,
>> +                                   uint64_t value, unsigned size)
>> +{
>> +    ipb_update(&tctx->regs[TM_QW1_OS], value & 0xff);
>> +    xive_tctx_notify(tctx, TM_QW1_OS);
>> +}
>> +
>>  /*
>>   * Define a mapping of "special" operations depending on the TIMA page
>>   * offset and the size of the operation.
>> @@ -220,6 +298,7 @@ static const XiveTmOp xive_tm_operations[] = {
>>  
>>      /* MMIOs above 2K : special operations with side effects */
>>      { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG,     2, NULL, xive_tm_ack_os_reg },
>> +    { XIVE_TM_OS_PAGE, TM_SPC_SET_OS_PENDING, 1, xive_tm_set_os_pending, NULL },
>>  };
>>  
>>  static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
>> @@ -409,6 +488,13 @@ static void xive_tctx_reset(void *dev)
>>      tctx->regs[TM_QW1_OS + TM_LSMFB] = 0xFF;
>>      tctx->regs[TM_QW1_OS + TM_ACK_CNT] = 0xFF;
>>      tctx->regs[TM_QW1_OS + TM_AGE] = 0xFF;
>> +
>> +    /*
>> +     * Initialize PIPR to 0xFF to avoid phantom interrupts when the
>> +     * CPPR is first set.
>> +     */
>> +    tctx->regs[TM_QW1_OS + TM_PIPR] =
>> +        ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
>>  }
>>  
>>  static void xive_tctx_realize(DeviceState *dev, Error **errp)
>> @@ -1218,9 +1304,15 @@ static void xive_presenter_notify(XiveRouter *xrtr, uint8_t format,
>>      found = xive_presenter_match(xrtr, format, nvt_blk, nvt_idx, cam_ignore,
>>                                   priority, logic_serv, &match);
>>      if (found) {
>> +        ipb_update(&match.tctx->regs[match.ring], priority);
>> +        xive_tctx_notify(match.tctx, match.ring);
>>          return;
>>      }
>>  
>> +    /* Record the IPB in the associated NVT structure */
>> +    ipb_update((uint8_t *) &nvt.w4, priority);
>> +    xive_router_set_nvt(xrtr, nvt_blk, nvt_idx, &nvt);
> 
> You're only writing back the NVT in the !found case.  Don't you still
> need to update it in the found case?

I would say no unless we add support for redistribution which would
mean the model supports logical servers. 

These are much more complex scenarios in which the IPVE returns multiple 
matching targets, the IVRE selects one but then the context changes.
 

C.

> 
>>      /* If no matching NVT is dispatched on a HW thread :
>>       * - update the NVT structure if backlog is activated
>>       * - escalate (ESe PQ bits and EAS in w4-5) if escalation is
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller
  2018-11-28  0:52   ` David Gibson
@ 2018-11-28 16:27     ` Cédric Le Goater
  2018-11-29  0:54       ` David Gibson
  2018-12-04 17:12       ` Cédric Le Goater
  0 siblings, 2 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 16:27 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 1:52 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:03AM +0100, Cédric Le Goater wrote:
>> sPAPRXive models the XIVE interrupt controller of the sPAPR machine.
>> It inherits from the XiveRouter and provisions storage for the routing
>> tables :
>>
>>   - Event Assignment Structure (EAS)
>>   - Event Notification Descriptor (END)
>>
>> The sPAPRXive model incorporates an internal XiveSource for the IPIs
>> and for the interrupts of the virtual devices of the guest. This model
>> is consistent with XIVE architecture which also incorporates an
>> internal IVSE for IPIs and accelerator interrupts in the IVRE
>> sub-engine.
>>
>> The sPAPRXive model exports two memory regions, one for the ESB
>> trigger and management pages used to control the sources and one for
>> the TIMA pages. They are mapped by default at the addresses found on
>> chip 0 of a baremetal system. This is also consistent with the XIVE
>> architecture which defines a Virtualization Controller BAR for the
>> internal IVSE ESB pages and a Thread Managment BAR for the TIMA.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  default-configs/ppc64-softmmu.mak |   1 +
>>  include/hw/ppc/spapr_xive.h       |  46 +++++
>>  hw/intc/spapr_xive.c              | 323 ++++++++++++++++++++++++++++++
>>  hw/intc/Makefile.objs             |   1 +
>>  4 files changed, 371 insertions(+)
>>  create mode 100644 include/hw/ppc/spapr_xive.h
>>  create mode 100644 hw/intc/spapr_xive.c
>>
>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>> index 2d1e7c5c4668..7f34ad0528ed 100644
>> --- a/default-configs/ppc64-softmmu.mak
>> +++ b/default-configs/ppc64-softmmu.mak
>> @@ -17,6 +17,7 @@ CONFIG_XICS=$(CONFIG_PSERIES)
>>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>>  CONFIG_XIVE=$(CONFIG_PSERIES)
>> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>>  CONFIG_MEM_DEVICE=y
>>  CONFIG_DIMM=y
>>  CONFIG_SPAPR_RNG=y
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> new file mode 100644
>> index 000000000000..06727bd86aa9
>> --- /dev/null
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -0,0 +1,46 @@
>> +/*
>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef PPC_SPAPR_XIVE_H
>> +#define PPC_SPAPR_XIVE_H
>> +
>> +#include "hw/sysbus.h"
>> +#include "hw/ppc/xive.h"
>> +
>> +#define TYPE_SPAPR_XIVE "spapr-xive"
>> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
>> +
>> +typedef struct sPAPRXive {
>> +    XiveRouter    parent;
>> +
>> +    /* Internal interrupt source for IPIs and virtual devices */
>> +    XiveSource    source;
>> +    hwaddr        vc_base;
>> +
>> +    /* END ESB MMIOs */
>> +    XiveENDSource end_source;
>> +    hwaddr        end_base;
>> +
>> +    /* Routing table */
>> +    XiveEAS       *eat;
>> +    uint32_t      nr_irqs;
>> +    XiveEND       *endt;
>> +    uint32_t      nr_ends;
>> +
>> +    /* TIMA mapping address */
>> +    hwaddr        tm_base;
>> +    MemoryRegion  tm_mmio;
>> +} sPAPRXive;
>> +
>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>> +qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
>> +
>> +#endif /* PPC_SPAPR_XIVE_H */
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> new file mode 100644
>> index 000000000000..5d038146c08e
>> --- /dev/null
>> +++ b/hw/intc/spapr_xive.c
>> @@ -0,0 +1,323 @@
>> +/*
>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/log.h"
>> +#include "qapi/error.h"
>> +#include "target/ppc/cpu.h"
>> +#include "sysemu/cpus.h"
>> +#include "monitor/monitor.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/ppc/spapr_xive.h"
>> +#include "hw/ppc/xive.h"
>> +#include "hw/ppc/xive_regs.h"
>> +
>> +/*
>> + * XIVE Virtualization Controller BAR and Thread Managment BAR that we
>> + * use for the ESB pages and the TIMA pages
>> + */
>> +#define SPAPR_XIVE_VC_BASE   0x0006010000000000ull
>> +#define SPAPR_XIVE_TM_BASE   0x0006030203180000ull
>> +
>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>> +{
>> +    int i;
>> +    uint32_t offset = 0;
>> +
>> +    monitor_printf(mon, "XIVE Source %08x .. %08x\n", offset,
>> +                   offset + xive->source.nr_irqs - 1);
>> +    xive_source_pic_print_info(&xive->source, offset, mon);
>> +
>> +    monitor_printf(mon, "XIVE EAT %08x .. %08x\n", 0, xive->nr_irqs - 1);
>> +    for (i = 0; i < xive->nr_irqs; i++) {
>> +        xive_eas_pic_print_info(&xive->eat[i], i, mon);
>> +    }
>> +
>> +    monitor_printf(mon, "XIVE ENDT %08x .. %08x\n", 0, xive->nr_ends - 1);
>> +    for (i = 0; i < xive->nr_ends; i++) {
>> +        xive_end_pic_print_info(&xive->endt[i], i, mon);
>> +    }
> 
> AIUI the PAPR model hides the details of ENDs, EQs and NVTs - instead
> each logical EAS just points at a (thread, priority) pair, which under
> the hood has exactly one END and one NVT bound to it.
> 
> Given that, would it make more sense to reformat the info here to show
> things in terms of those (thread, priority) pairs, rather than the
> internal EAS and END details?

Yes. I had a version doing something like that before. I will rework
the ouput a little for sPAPR.  


>> +}
>> +
>> +/* Map the ESB pages and the TIMA pages */
>> +static void spapr_xive_mmio_map(sPAPRXive *xive)
>> +{
>> +    sysbus_mmio_map(SYS_BUS_DEVICE(&xive->source), 0, xive->vc_base);
>> +    sysbus_mmio_map(SYS_BUS_DEVICE(&xive->end_source), 0, xive->end_base);
> 
> Uh.. I didn't think the PAPR model exposed the END sources to the guest?

Well, it should if it was being used but it's not the case for any of the 
sPAPR guest OS today. So I think it's preferable to remove the mapping until 
someone wants to experiment with it. We can keep the XiveENDSource object 
though. This is harmless.

There is no KVM side to the END ESBs either as OPAL does not use them. 

> 
>> +    sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->tm_base);
>> +}
>> +
>> +static void spapr_xive_reset(DeviceState *dev)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>> +    int i;
>> +
>> +    /* Xive Source reset is done through SysBus, it should put all
>> +     * IRQs to OFF (!P|Q) */
>> +
>> +    /* Mask all valid EASs in the IRQ number space. */
>> +    for (i = 0; i < xive->nr_irqs; i++) {
>> +        XiveEAS *eas = &xive->eat[i];
>> +        if (eas->w & EAS_VALID) {
>> +            eas->w |= EAS_MASKED;
> 
> To ensure consistent behaviour across reboots, it would be better to
> reset the whole of the EAS, except those which have to be preserved
> across reboots (which would be VALID, and maybe nothing else?).

VALID EAS corresponds to IRQ numbers claimed by the devices of the machine.
So we should keep the valid bit but reset all other settings which this
reset method is not doing. I will fix.

>> +        }
>> +    }
>> +
>> +    for (i = 0; i < xive->nr_ends; i++) {
>> +        xive_end_reset(&xive->endt[i]);
>> +    }
>> +
>> +    spapr_xive_mmio_map(xive);
> 
> You shouldn't need to re-etablish MMIO mappings at reset time, only
> during initialization.

Yes. Not for now indeed, but the patch is anticipating the switch 
of the interrupt mode at reset. I will move the mapping to the 
realize method in that patch and re-move it again in reset when 
we reach that part of the patchset (dual machine)  

>> +}
>> +
>> +static void spapr_xive_instance_init(Object *obj)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(obj);
>> +
>> +    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
> 
> Yeah, embedding the source here makes sense, but it's a strong
> indication that XiveSource should not be a SysBusDevice subclass.  I
> really think it wants to be a TYPE_DEVICE subclass - and, in fact, I
> think it can be object_initialize() embedded everywhere it's used.

I have changed XiveSource to be a TYPE_DEVICE.
 
> I've also said elswhere that I suspect XiveRouter should also not be a
> SysBusDevice.  

I have changed XiveRouter to be a TYPE_DEVICE.

> With that approach it might make sense to embed it
> here, rather than subclassing it 

ah. why not indeed. I have to think about it. 

> (the old composition vs. inheritance debate).

he. but then the XiveRouter needs to become a QOM interface if we 
want to be able to define XIVE table accessors for sPAPRXive. See
the  spapr_xive_class_init() routine.

>> +    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
>> +
>> +    object_initialize(&xive->end_source, sizeof(xive->end_source),
>> +                      TYPE_XIVE_END_SOURCE);
>> +    object_property_add_child(obj, "end_source", OBJECT(&xive->end_source),
>> +                              NULL);
>> +}
>> +
>> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>> +    XiveSource *xsrc = &xive->source;
>> +    XiveENDSource *end_xsrc = &xive->end_source;
>> +    Error *local_err = NULL;
>> +
>> +    if (!xive->nr_irqs) {
>> +        error_setg(errp, "Number of interrupt needs to be greater 0");
>> +        return;
>> +    }
>> +
>> +    if (!xive->nr_ends) {
>> +        error_setg(errp, "Number of interrupt needs to be greater 0");
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Initialize the internal sources, for IPIs and virtual devices.
>> +     */
>> +    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
>> +                            &error_fatal);
>> +    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
>> +                                   &error_fatal);
>> +    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
>> +
>> +    /*
>> +     * Initialize the END ESB source
>> +     */
>> +    object_property_set_int(OBJECT(end_xsrc), xive->nr_irqs, "nr-ends",
>> +                            &error_fatal);
>> +    object_property_add_const_link(OBJECT(end_xsrc), "xive", OBJECT(xive),
>> +                                   &error_fatal);
>> +    object_property_set_bool(OBJECT(end_xsrc), true, "realized", &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +    qdev_set_parent_bus(DEVICE(end_xsrc), sysbus_get_default());
>> +
>> +    /* Set the mapping address of the END ESB pages after the source ESBs */
>> +    xive->end_base = xive->vc_base + (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
>> +
>> +    /*
>> +     * Allocate the routing tables
>> +     */
>> +    xive->eat = g_new0(XiveEAS, xive->nr_irqs);
>> +    xive->endt = g_new0(XiveEND, xive->nr_ends);
>> +
>> +    /* TIMA initialization */
>> +    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
>> +                          "xive.tima", 4ull << TM_SHIFT);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
>> +}
>> +
>> +static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>> +
>> +    if (lisn >= xive->nr_irqs) {
>> +        return -1;
>> +    }
>> +
>> +    *eas = xive->eat[lisn];
>> +    return 0;
>> +}
>> +
>> +static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>> +
>> +    if (lisn >= xive->nr_irqs) {
>> +        return -1;
>> +    }
>> +
>> +    xive->eat[lisn] = *eas;
>> +    return 0;
>> +}
>> +
>> +static int spapr_xive_get_end(XiveRouter *xrtr,
>> +                              uint8_t end_blk, uint32_t end_idx, XiveEND *end)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>> +
>> +    if (end_idx >= xive->nr_ends) {
>> +        return -1;
>> +    }
>> +
>> +    memcpy(end, &xive->endt[end_idx], sizeof(XiveEND));
>> +    return 0;
>> +}
>> +
>> +static int spapr_xive_set_end(XiveRouter *xrtr,
>> +                              uint8_t end_blk, uint32_t end_idx, XiveEND *end)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>> +
>> +    if (end_idx >= xive->nr_ends) {
>> +        return -1;
>> +    }
>> +
>> +    memcpy(&xive->endt[end_idx], end, sizeof(XiveEND));
>> +    return 0;
>> +}
>> +
>> +static const VMStateDescription vmstate_spapr_xive_end = {
>> +    .name = TYPE_SPAPR_XIVE "/end",
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField []) {
>> +        VMSTATE_UINT32(w0, XiveEND),
>> +        VMSTATE_UINT32(w1, XiveEND),
>> +        VMSTATE_UINT32(w2, XiveEND),
>> +        VMSTATE_UINT32(w3, XiveEND),
>> +        VMSTATE_UINT32(w4, XiveEND),
>> +        VMSTATE_UINT32(w5, XiveEND),
>> +        VMSTATE_UINT32(w6, XiveEND),
>> +        VMSTATE_UINT32(w7, XiveEND),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +static const VMStateDescription vmstate_spapr_xive_eas = {
>> +    .name = TYPE_SPAPR_XIVE "/eas",
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField []) {
>> +        VMSTATE_UINT64(w, XiveEAS),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +static const VMStateDescription vmstate_spapr_xive = {
>> +    .name = TYPE_SPAPR_XIVE,
>> +    .version_id = 1,
>> +    .minimum_version_id = 1,
>> +    .fields = (VMStateField[]) {
>> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(eat, sPAPRXive, nr_irqs,
>> +                                     vmstate_spapr_xive_eas, XiveEAS),
>> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(endt, sPAPRXive, nr_ends,
>> +                                             vmstate_spapr_xive_end, XiveEND),
>> +        VMSTATE_END_OF_LIST()
>> +    },
>> +};
>> +
>> +static Property spapr_xive_properties[] = {
>> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
>> +    DEFINE_PROP_UINT32("nr-ends", sPAPRXive, nr_ends, 0),
>> +    DEFINE_PROP_UINT64("vc-base", sPAPRXive, vc_base, SPAPR_XIVE_VC_BASE),
>> +    DEFINE_PROP_UINT64("tm-base", sPAPRXive, tm_base, SPAPR_XIVE_TM_BASE),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
>> +
>> +    dc->desc    = "sPAPR XIVE Interrupt Controller";
>> +    dc->props   = spapr_xive_properties;
>> +    dc->realize = spapr_xive_realize;
>> +    dc->reset   = spapr_xive_reset;
>> +    dc->vmsd    = &vmstate_spapr_xive;
>> +
>> +    xrc->get_eas = spapr_xive_get_eas;
>> +    xrc->set_eas = spapr_xive_set_eas;
>> +    xrc->get_end = spapr_xive_get_end;
>> +    xrc->set_end = spapr_xive_set_end;
>> +}
>> +
>> +static const TypeInfo spapr_xive_info = {
>> +    .name = TYPE_SPAPR_XIVE,
>> +    .parent = TYPE_XIVE_ROUTER,
>> +    .instance_init = spapr_xive_instance_init,
>> +    .instance_size = sizeof(sPAPRXive),
>> +    .class_init = spapr_xive_class_init,
>> +};
>> +
>> +static void spapr_xive_register_types(void)
>> +{
>> +    type_register_static(&spapr_xive_info);
>> +}
>> +
>> +type_init(spapr_xive_register_types)
>> +
>> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +
>> +    if (lisn >= xive->nr_irqs) {
>> +        return false;
>> +    }
>> +
>> +    xive->eat[lisn].w |= EAS_VALID;
>> +    xive_source_irq_set(xsrc, lisn, lsi);
>> +    return true;
>> +}
>> +
>> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +
>> +    if (lisn >= xive->nr_irqs) {
>> +        return false;
>> +    }
>> +
>> +    xive->eat[lisn].w &= ~EAS_VALID;
>> +    xive_source_irq_set(xsrc, lisn, false);
>> +    return true;
>> +}
>> +
>> +qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +
>> +    if (lisn >= xive->nr_irqs) {
>> +        return NULL;
>> +    }
>> +
>> +    if (!(xive->eat[lisn].w & EAS_VALID)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);
> 
> I don't think this is a guest error - gettint the qirq by number
> should generally be something qemu code does.

Even if the IRQ was not defined by the machine ? The EAS_VALID bit is
raised when the IRQ is enabled at the XIVE level, which means that the
IRQ number has been claimed by some device of the machine. You cannot
get a qirq by number for  some random IRQ number. Can you ? 

Thanks,

C. 

> 
>> +        return NULL;
>> +    }
>> +
>> +    return xive_source_qirq(xsrc, lisn);
>> +}
>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>> index 72a46ed91c31..301a8e972d91 100644
>> --- a/hw/intc/Makefile.objs
>> +++ b/hw/intc/Makefile.objs
>> @@ -38,6 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>  obj-$(CONFIG_XIVE) += xive.o
>> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
  2018-11-28  2:39   ` David Gibson
@ 2018-11-28 16:48     ` Cédric Le Goater
  2018-11-29  1:00       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 16:48 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 3:39 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:04AM +0100, Cédric Le Goater wrote:
>> The IVPE scans the O/S CAM line of the XIVE thread interrupt contexts
>> to find a matching Notification Virtual Target (NVT) among the NVTs
>> dispatched on the HW processor threads.
>>
>> On a real system, the thread interrupt contexts are updated by the
>> hypervisor when a Virtual Processor is scheduled to run on a HW
>> thread. Under QEMU, the model emulates the same behavior by hardwiring
>> the NVT identifier in the thread context registers at reset.
>>
>> The NVT identifier used by the sPAPRXive model is the VCPU id. The END
>> identifier is also derived from the VCPU id. A set of helpers doing
>> the conversion between identifiers are provided for the hcalls
>> configuring the sources and the ENDs.
>>
>> The model does not need a NVT table but The XiveRouter NVT operations
>> are provided to perform some extra checks in the routing algorithm.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/spapr_xive.h |  17 +++++
>>  include/hw/ppc/xive.h       |   3 +
>>  hw/intc/spapr_xive.c        | 136 ++++++++++++++++++++++++++++++++++++
>>  hw/intc/xive.c              |   9 +++
>>  4 files changed, 165 insertions(+)
>>
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 06727bd86aa9..3f65b8f485fd 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -43,4 +43,21 @@ bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>>  qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
>>  
>> +/*
>> + * sPAPR NVT and END indexing helpers
>> + */
>> +uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, uint8_t nvt_blk,
>> +                                  uint32_t nvt_idx);
>> +int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t target,
>> +                            uint8_t *out_nvt_blk, uint32_t *out_nvt_idx);
>> +int spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
>> +                          uint8_t *out_nvt_blk, uint32_t *out_nvt_idx);
>> +
>> +int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t end_idx,
>> +                             uint32_t *out_server, uint8_t *out_prio);
>> +int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
>> +                             uint8_t *out_end_blk, uint32_t *out_end_idx);
>> +int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
>> +                          uint8_t *out_end_blk, uint32_t *out_end_idx);
>> +
>>  #endif /* PPC_SPAPR_XIVE_H */
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index e715a6c6923d..e6931ddaa83f 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -187,6 +187,8 @@ typedef struct XiveRouter {
>>  #define XIVE_ROUTER_GET_CLASS(obj)                              \
>>      OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
>>  
>> +typedef struct XiveTCTX XiveTCTX;
>> +
>>  typedef struct XiveRouterClass {
>>      SysBusDeviceClass parent;
>>  
>> @@ -201,6 +203,7 @@ typedef struct XiveRouterClass {
>>                     XiveNVT *nvt);
>>      int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>>                     XiveNVT *nvt);
>> +    void (*reset_tctx)(XiveRouter *xrtr, XiveTCTX *tctx);
>>  } XiveRouterClass;
>>  
>>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> index 5d038146c08e..3bf77ace11a2 100644
>> --- a/hw/intc/spapr_xive.c
>> +++ b/hw/intc/spapr_xive.c
>> @@ -199,6 +199,139 @@ static int spapr_xive_set_end(XiveRouter *xrtr,
>>      return 0;
>>  }
>>  
>> +static int spapr_xive_get_nvt(XiveRouter *xrtr,
>> +                              uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>> +    uint32_t vcpu_id = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
>> +    PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
>> +
>> +    if (!cpu) {
>> +        return -1;
>> +    }
>> +
>> +    /*
>> +     * sPAPR does not maintain a NVT table. Return that the NVT is
>> +     * valid if we have found a matching CPU
>> +     */
>> +    nvt->w0 = NVT_W0_VALID;
>> +    return 0;
>> +}
>> +
>> +static int spapr_xive_set_nvt(XiveRouter *xrtr,
>> +                              uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
>> +{
>> +    /* no NVT table */
>> +    return 0;
>> +}
>> +
>> +/*
>> + * When a Virtual Processor is scheduled to run on a HW thread, the
>> + * hypervisor pushes its identifier in the OS CAM line. Under QEMU, we
>> + * need to emulate the same behavior.
>> + */
>> +static void spapr_xive_reset_tctx(XiveRouter *xrtr, XiveTCTX *tctx)
>> +{
>> +    uint8_t  nvt_blk;
>> +    uint32_t nvt_idx;
>> +    uint32_t nvt_cam;
>> +
>> +    spapr_xive_cpu_to_nvt(SPAPR_XIVE(xrtr), POWERPC_CPU(tctx->cs),
>> +                          &nvt_blk, &nvt_idx);
>> +
>> +    nvt_cam = cpu_to_be32(TM_QW1W2_VO | xive_tctx_cam_line(nvt_blk, nvt_idx));
>> +    memcpy(&tctx->regs[TM_QW1_OS + TM_WORD2], &nvt_cam, 4);
>> +}
>> +
>> +/*
>> + * The allocation of VP blocks is a complex operation in OPAL and the
>> + * VP identifiers have a relation with the number of HW chips, the
>> + * size of the VP blocks, VP grouping, etc. The QEMU sPAPR XIVE
>> + * controller model does not have the same constraints and can use a
>> + * simple mapping scheme of the CPU vcpu_id
>> + *
>> + * These identifiers are never returned to the OS.
>> + */
>> +
>> +#define SPAPR_XIVE_VP_BASE 0x400
> 
> 0x400 == 1024.  Could we ever have the possibility of needing to
> consider both physical NVTs and PAPR NVTs at the same time?  

They would not be in the same CAM line: OS ring vs. PHYS ring. 

> If so, does this base leave enough space for the physical ones?

I only used 0x400 to map the VP identifier to the ones allocated by KVM. 
0x0 would be fine but to exercise the model, it's better having a different 
base. 

>> +uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, uint8_t nvt_blk,
>> +                                  uint32_t nvt_idx)
>> +{
>> +    return nvt_idx - SPAPR_XIVE_VP_BASE;
>> +}
>> +
>> +int spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
>> +                          uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
> 
> A number of these conversions will come out a bit simpler if we pass
> the block and index around as a single word in most places.

Yes I have to check the whole patchset first. These prototype changes
are not too difficult in terms of code complexity but they do break
how patches apply and PowerNV is also using the idx and blk much more 
explicitly. the block has a meaning on bare metal. So I am a bit 
reluctant to do so. I will check.

>> +{
>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>> +
>> +    if (!cpu) {
>> +        return -1;
>> +    }
>> +
>> +    if (out_nvt_blk) {
>> +        /* For testing purpose, we could use 0 for nvt_blk */
>> +        *out_nvt_blk = xrtr->chip_id;
> 
> I don't see any point using the chip_id here, which is currently
> always set to 0 for PAPR anyway.  If we just hardwire this to 0 it
> removes the only use here of xrtr, which will allow some further
> simplifications in the caller, I think.

You are right about the simplification. It was one way to exercise 
the router model and remove any shortcuts in the indexing. I kept 
it to be sure I was not tempted to invent new ones. I think we can
remove it before merging. 

> 
>> +    }
>> +
>> +    if (out_nvt_blk) {
>> +        *out_nvt_idx = SPAPR_XIVE_VP_BASE + cpu->vcpu_id;
>> +    }
>> +    return 0;
>> +}
>> +
>> +int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t target,
>> +                             uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
> 
> I suspect some, maybe most of these conversion functions could be static.

static inline ? 

> 
>> +{
>> +    return spapr_xive_cpu_to_nvt(xive, spapr_find_cpu(target), out_nvt_blk,
>> +                                 out_nvt_idx);
>> +}
>> +
>> +/*
>> + * sPAPR END indexing uses a simple mapping of the CPU vcpu_id, 8
>> + * priorities per CPU
>> + */
>> +int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t end_idx,
>> +                             uint32_t *out_server, uint8_t *out_prio)
>> +{
>> +    if (out_server) {
>> +        *out_server = end_idx >> 3;
>> +    }
>> +
>> +    if (out_prio) {
>> +        *out_prio = end_idx & 0x7;
>> +    }
>> +    return 0;
>> +}
>> +
>> +int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
>> +                          uint8_t *out_end_blk, uint32_t *out_end_idx)
>> +{
>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>> +
>> +    if (!cpu) {
>> +        return -1;
> 
> Is there ever a reason this would be called with cpu == NULL?  If not
> might as well just assert() here rather than pushing the error
> handling back to the caller.

ok. yes.

> 
>> +    }
>> +
>> +    if (out_end_blk) {
>> +        /* For testing purpose, we could use 0 for nvt_blk */
>> +        *out_end_blk = xrtr->chip_id;
> 
> Again, I don't see any point to using the chip_id, which is pretty
> meaningless for PAPR.
> 
>> +    }
>> +
>> +    if (out_end_idx) {
>> +        *out_end_idx = (cpu->vcpu_id << 3) + prio;
>> +    }
>> +    return 0;
>> +}
>> +
>> +int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
>> +                             uint8_t *out_end_blk, uint32_t *out_end_idx)
>> +{
>> +    return spapr_xive_cpu_to_end(xive, spapr_find_cpu(target), prio,
>> +                                 out_end_blk, out_end_idx);
>> +}
>> +
>>  static const VMStateDescription vmstate_spapr_xive_end = {
>>      .name = TYPE_SPAPR_XIVE "/end",
>>      .version_id = 1,
>> @@ -263,6 +396,9 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>      xrc->set_eas = spapr_xive_set_eas;
>>      xrc->get_end = spapr_xive_get_end;
>>      xrc->set_end = spapr_xive_set_end;
>> +    xrc->get_nvt = spapr_xive_get_nvt;
>> +    xrc->set_nvt = spapr_xive_set_nvt;
>> +    xrc->reset_tctx = spapr_xive_reset_tctx;
>>  }
>>  
>>  static const TypeInfo spapr_xive_info = {
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index c49932d2b799..fc6ef5895e6d 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -481,6 +481,7 @@ static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
>>  static void xive_tctx_reset(void *dev)
>>  {
>>      XiveTCTX *tctx = XIVE_TCTX(dev);
>> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
>>  
>>      memset(tctx->regs, 0, sizeof(tctx->regs));
>>  
>> @@ -495,6 +496,14 @@ static void xive_tctx_reset(void *dev)
>>       */
>>      tctx->regs[TM_QW1_OS + TM_PIPR] =
>>          ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
>> +
>> +    /*
>> +     * QEMU sPAPR XIVE only. To let the controller model reset the OS
>> +     * CAM line with the VP identifier.
>> +     */
>> +    if (xrc->reset_tctx) {
>> +        xrc->reset_tctx(tctx->xrtr, tctx);
>> +    }
> 
> AFAICT this whole function is only used from PAPR, so you can just
> move the whole thing to the papr code and avoid the hook function.

Yes we could add a loop on all CPUs and reset all the XiveTCTX from
the machine or a spapr_irq->reset handler. We will need at some time
anyhow.

Thanks,

C.


> 
>>  }
>>  
>>  static void xive_tctx_realize(DeviceState *dev, Error **errp)
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend
  2018-11-28  9:35     ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
@ 2018-11-28 16:50       ` Cédric Le Goater
  2018-11-28 16:59         ` Greg Kurz
  2018-11-29  1:02       ` David Gibson
  1 sibling, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 16:50 UTC (permalink / raw)
  To: Greg Kurz, David Gibson; +Cc: qemu-ppc, qemu-devel

On 11/28/18 10:35 AM, Greg Kurz wrote:
> On Wed, 28 Nov 2018 13:57:14 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
>> On Fri, Nov 16, 2018 at 11:57:05AM +0100, Cédric Le Goater wrote:
>>> We will need to use xics_max_server_number() to create the sPAPRXive
>>> object modeling the interrupt controller of the machine which is
>>> created before the CPUs.
>>>
>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>  
>>
>> My only concern here is that this moves the spapr_set_vsmt_mode()
>> before some of the sanity checks in spapr_init_cpus().  Are we certain
>> there are no edge cases that could cause badness?
>>
> 
> The early checks in spapr_init_cpus() filter out topologies that would
> result in partially filled cores. They're only related to the rest of
> the code that creates the boot CPUs. Before commit 1a5008fc17,
> spapr_set_vsmt_mode() was even being called before spapr_init_cpus().
> The rationale to move it there was to ensure it is called before the
> first user of spapr->vsmt, which happens to be a call to
> xics_max_server_number().
> 
> Now that xics_max_server_number() needs to be called even earlier, I think a
> better change is to have xics_max_server_number() to call spapr_set_vsmt_mode()
> if spapr->vsmt isn't set.

That 'smt' routine is black magic to me and I would not dare touching it.

C.
 
> 
>>> ---
>>>  hw/ppc/spapr.c | 10 +++++-----
>>>  1 file changed, 5 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>> index 7afd1a175bf2..50cb9f9f4a02 100644
>>> --- a/hw/ppc/spapr.c
>>> +++ b/hw/ppc/spapr.c
>>> @@ -2466,11 +2466,6 @@ static void spapr_init_cpus(sPAPRMachineState *spapr)
>>>          boot_cores_nr = possible_cpus->len;
>>>      }
>>>  
>>> -    /* VSMT must be set in order to be able to compute VCPU ids, ie to
>>> -     * call xics_max_server_number() or spapr_vcpu_id().
>>> -     */
>>> -    spapr_set_vsmt_mode(spapr, &error_fatal);
>>> -
>>>      if (smc->pre_2_10_has_unused_icps) {
>>>          int i;
>>>  
>>> @@ -2593,6 +2588,11 @@ static void spapr_machine_init(MachineState *machine)
>>>      /* Setup a load limit for the ramdisk leaving room for SLOF and FDT */
>>>      load_limit = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FW_OVERHEAD;
>>>  
>>> +    /* VSMT must be set in order to be able to compute VCPU ids, ie to
>>> +     * call xics_max_server_number() or spapr_vcpu_id().
>>> +     */
>>> +    spapr_set_vsmt_mode(spapr, &error_fatal);
>>> +
>>>      /* Set up Interrupt Controller before we create the VCPUs */
>>>      smc->irq->init(spapr, &error_fatal);
>>>    
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend
  2018-11-28 16:50       ` Cédric Le Goater
@ 2018-11-28 16:59         ` Greg Kurz
  0 siblings, 0 replies; 184+ messages in thread
From: Greg Kurz @ 2018-11-28 16:59 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: David Gibson, qemu-ppc, qemu-devel

On Wed, 28 Nov 2018 17:50:21 +0100
Cédric Le Goater <clg@kaod.org> wrote:

> On 11/28/18 10:35 AM, Greg Kurz wrote:
> > On Wed, 28 Nov 2018 13:57:14 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> >> On Fri, Nov 16, 2018 at 11:57:05AM +0100, Cédric Le Goater wrote:  
> >>> We will need to use xics_max_server_number() to create the sPAPRXive
> >>> object modeling the interrupt controller of the machine which is
> >>> created before the CPUs.
> >>>
> >>> Signed-off-by: Cédric Le Goater <clg@kaod.org>    
> >>
> >> My only concern here is that this moves the spapr_set_vsmt_mode()
> >> before some of the sanity checks in spapr_init_cpus().  Are we certain
> >> there are no edge cases that could cause badness?
> >>  
> > 
> > The early checks in spapr_init_cpus() filter out topologies that would
> > result in partially filled cores. They're only related to the rest of
> > the code that creates the boot CPUs. Before commit 1a5008fc17,
> > spapr_set_vsmt_mode() was even being called before spapr_init_cpus().
> > The rationale to move it there was to ensure it is called before the
> > first user of spapr->vsmt, which happens to be a call to
> > xics_max_server_number().
> > 
> > Now that xics_max_server_number() needs to be called even earlier, I think a
> > better change is to have xics_max_server_number() to call spapr_set_vsmt_mode()
> > if spapr->vsmt isn't set.  
> 
> That 'smt' routine is black magic to me and I would not dare touching it.
> 

I don't suggest to change it, just to ensure it gets called before spapr->vsmt
gets used. But don't worry, I'll have a deeper look and send a patch :)

> C.
>  
> >   
> >>> ---
> >>>  hw/ppc/spapr.c | 10 +++++-----
> >>>  1 file changed, 5 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>> index 7afd1a175bf2..50cb9f9f4a02 100644
> >>> --- a/hw/ppc/spapr.c
> >>> +++ b/hw/ppc/spapr.c
> >>> @@ -2466,11 +2466,6 @@ static void spapr_init_cpus(sPAPRMachineState *spapr)
> >>>          boot_cores_nr = possible_cpus->len;
> >>>      }
> >>>  
> >>> -    /* VSMT must be set in order to be able to compute VCPU ids, ie to
> >>> -     * call xics_max_server_number() or spapr_vcpu_id().
> >>> -     */
> >>> -    spapr_set_vsmt_mode(spapr, &error_fatal);
> >>> -
> >>>      if (smc->pre_2_10_has_unused_icps) {
> >>>          int i;
> >>>  
> >>> @@ -2593,6 +2588,11 @@ static void spapr_machine_init(MachineState *machine)
> >>>      /* Setup a load limit for the ramdisk leaving room for SLOF and FDT */
> >>>      load_limit = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FW_OVERHEAD;
> >>>  
> >>> +    /* VSMT must be set in order to be able to compute VCPU ids, ie to
> >>> +     * call xics_max_server_number() or spapr_vcpu_id().
> >>> +     */
> >>> +    spapr_set_vsmt_mode(spapr, &error_fatal);
> >>> +
> >>>      /* Set up Interrupt Controller before we create the VCPUs */
> >>>      smc->irq->init(spapr, &error_fatal);
> >>>      
> >>  
> >   
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 15/36] spapr: introdude a new machine IRQ backend for XIVE
  2018-11-28  3:28   ` David Gibson
@ 2018-11-28 17:16     ` Cédric Le Goater
  2018-11-29  1:07       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 17:16 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 4:28 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:08AM +0100, Cédric Le Goater wrote:
>> The XIVE IRQ backend uses the same layout as the new XICS backend but
>> covers the full range of the IRQ number space. The IRQ numbers for the
>> CPU IPIs are allocated at the bottom of this space, below 4K, to
>> preserve compatibility with XICS which does not use that range.
>>
>> This should be enough given that the maximum number of CPUs is 1024
>> for the sPAPR machine under QEMU. For the record, the biggest POWER8
>> or POWER9 system has a maximum of 1536 HW threads (16 sockets, 192
>> cores, SMT8).
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/spapr.h     |   2 +
>>  include/hw/ppc/spapr_irq.h |   7 ++-
>>  hw/ppc/spapr.c             |   2 +-
>>  hw/ppc/spapr_irq.c         | 119 ++++++++++++++++++++++++++++++++++++-
>>  4 files changed, 124 insertions(+), 6 deletions(-)
>>
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 6279711fe8f7..1fbc2663e06c 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -16,6 +16,7 @@ typedef struct sPAPREventLogEntry sPAPREventLogEntry;
>>  typedef struct sPAPREventSource sPAPREventSource;
>>  typedef struct sPAPRPendingHPT sPAPRPendingHPT;
>>  typedef struct ICSState ICSState;
>> +typedef struct sPAPRXive sPAPRXive;
>>  
>>  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
>>  #define SPAPR_ENTRY_POINT       0x100
>> @@ -175,6 +176,7 @@ struct sPAPRMachineState {
>>      const char *icp_type;
>>      int32_t irq_map_nr;
>>      unsigned long *irq_map;
>> +    sPAPRXive  *xive;
>>  
>>      bool cmd_line_caps[SPAPR_CAP_NUM];
>>      sPAPRCapabilities def, eff, mig;
>> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
>> index 0e9229bf219e..c854ae527808 100644
>> --- a/include/hw/ppc/spapr_irq.h
>> +++ b/include/hw/ppc/spapr_irq.h
>> @@ -13,6 +13,7 @@
>>  /*
>>   * IRQ range offsets per device type
>>   */
>> +#define SPAPR_IRQ_IPI        0x0
>>  #define SPAPR_IRQ_EPOW       0x1000  /* XICS_IRQ_BASE offset */
>>  #define SPAPR_IRQ_HOTPLUG    0x1001
>>  #define SPAPR_IRQ_VIO        0x1100  /* 256 VIO devices */
>> @@ -33,7 +34,8 @@ typedef struct sPAPRIrq {
>>      uint32_t    nr_irqs;
>>      uint32_t    nr_msis;
>>  
>> -    void (*init)(sPAPRMachineState *spapr, int nr_irqs, Error **errp);
>> +    void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
>> +                 Error **errp);
>>      int (*claim)(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
>>      void (*free)(sPAPRMachineState *spapr, int irq, int num);
>>      qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
>> @@ -42,8 +44,9 @@ typedef struct sPAPRIrq {
>>  
>>  extern sPAPRIrq spapr_irq_xics;
>>  extern sPAPRIrq spapr_irq_xics_legacy;
>> +extern sPAPRIrq spapr_irq_xive;
>>  
>> -void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
>> +void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp);
> 
> I don't see why nr_servers needs to become a parameter, since it can
> be derived from spapr within this routine.

ok. This is true. We can use directly xics_max_server_number(spapr).

>>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
>>  void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
>>  qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index e470efe7993c..9f8c19e56e7a 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -2594,7 +2594,7 @@ static void spapr_machine_init(MachineState *machine)
>>      spapr_set_vsmt_mode(spapr, &error_fatal);
>>  
>>      /* Set up Interrupt Controller before we create the VCPUs */
>> -    spapr_irq_init(spapr, &error_fatal);
>> +    spapr_irq_init(spapr, xics_max_server_number(spapr), &error_fatal);
> 
> We should rename xics_max_server_number() since it's no longer xics
> specific.

yes.

>>      /* Set up containers for ibm,client-architecture-support negotiated options
>>       */
>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>> index bac450ffff23..2569ae1bc7f8 100644
>> --- a/hw/ppc/spapr_irq.c
>> +++ b/hw/ppc/spapr_irq.c
>> @@ -12,6 +12,7 @@
>>  #include "qemu/error-report.h"
>>  #include "qapi/error.h"
>>  #include "hw/ppc/spapr.h"
>> +#include "hw/ppc/spapr_xive.h"
>>  #include "hw/ppc/xics.h"
>>  #include "sysemu/kvm.h"
>>  
>> @@ -91,7 +92,7 @@ error:
>>  }
>>  
>>  static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
>> -                                Error **errp)
>> +                                int nr_servers, Error **errp)
>>  {
>>      MachineState *machine = MACHINE(spapr);
>>      Error *local_err = NULL;
>> @@ -204,10 +205,122 @@ sPAPRIrq spapr_irq_xics = {
>>      .print_info  = spapr_irq_print_info_xics,
>>  };
>>  
>> + /*
>> + * XIVE IRQ backend.
>> + */
>> +static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
>> +                                    const char *type_xive, int nr_irqs,
>> +                                    int nr_servers, Error **errp)
>> +{
>> +    sPAPRXive *xive;
>> +    Error *local_err = NULL;
>> +    Object *obj;
>> +    uint32_t nr_ends = nr_servers << 3; /* 8 priority ENDs per CPU */
>> +    int i;
>> +
>> +    obj = object_new(type_xive);
> 
> What's the reason for making the type a parameter, rather than just
> using the #define here.

KVM.

>> +    object_property_set_int(obj, nr_irqs, "nr-irqs", &error_abort);
>> +    object_property_set_int(obj, nr_ends, "nr-ends", &error_abort);
> 
> This is still within the sPAPR code, and you have a pointer to the
> MachineState, so I don't see why you could't just derive nr_irqs and
> nr_servers from that, rather than having them passed in.

for nr_servers I agree. nr_irqs comes from the machine class and it will
not make any sense using the machine class in the init routine of the
'dual' sPAPR IRQ backend supporting both modes. See patch 34 which
initializes both backend for the 'dual' machine.
 
>> +    object_property_set_bool(obj, true, "realized", &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return NULL;
>> +    }
>> +    qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
> 
> Whereas the XiveSource and XiveRouter I think make more sense as
> "device components" rather than SysBusDevice subclasses, 

Yes. I changed that.

> I think it
> *does* make sense for the PAPR-XIVE object to be a full fledged
> SysBusDevice.

Ah. That I didn't do but thinking of it, it makes sense as it is the
object managing the TIMA and ESB memory region mapping for the machine. 

> And for that reason, I think it makes more sense to create it with
> qdev_create(), which should avoid having to manually fiddle with the
> parent bus.

OK. I will give it a try. 

>> +    xive = SPAPR_XIVE(obj);
>> +
>> +    /* Enable the CPU IPIs */
>> +    for (i = 0; i < nr_servers; ++i) {
>> +        spapr_xive_irq_enable(xive, SPAPR_IRQ_IPI + i, false);
> 
> This comment possibly belonged on an earlier patch.  I don't love the
> "..._enable" name - to me that suggests something runtime rather than
> configuration time.  A better option isn't quickly occurring to me
> though :/.

Instead, I could call the sPAPR IRQ claim method  : 

    for (i = 0; i < nr_servers; ++i) {
	spapr_irq_xive.claim(spapr, SPAPR_IRQ_IPI + i, false, &local_err);
    }


What it does is to set the EAS_VALID bit in the EAT (it also sets the 
LSI bit). what about :
	
	spapr_xive_irq_validate() 
	spapr_xive_irq_invalidate() 

or to map the sPAPR IRQ backend names :

	spapr_xive_irq_claim() 
	spapr_xive_irq_free() 


> 
>> +    }
>> +
>> +    return xive;
>> +}
>> +
>> +static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
>> +                                int nr_servers, Error **errp)
>> +{
>> +    MachineState *machine = MACHINE(spapr);
>> +    Error *local_err = NULL;
>> +
>> +    /* KVM XIVE support */
>> +    if (kvm_enabled()) {
>> +        if (machine_kernel_irqchip_required(machine)) {
>> +            error_setg(errp, "kernel_irqchip requested. no XIVE support");
>> +            return;
>> +        }
>> +    }
>> +
>> +    /* QEMU XIVE support */
>> +    spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE, nr_irqs, nr_servers,
>> +                                    &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +}
>> +
>> +static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
>> +                                Error **errp)
>> +{
>> +    if (!spapr_xive_irq_enable(spapr->xive, irq, lsi)) {
>> +        error_setg(errp, "IRQ %d is invalid", irq);
>> +        return -1;
>> +    }
>> +    return 0;
>> +}
>> +
>> +static void spapr_irq_free_xive(sPAPRMachineState *spapr, int irq, int num)
>> +{
>> +    int i;
>> +
>> +    for (i = irq; i < irq + num; ++i) {
>> +        spapr_xive_irq_disable(spapr->xive, i);
>> +    }
>> +}
>> +
>> +static qemu_irq spapr_qirq_xive(sPAPRMachineState *spapr, int irq)
>> +{
>> +    return spapr_xive_qirq(spapr->xive, irq);
>> +}
>> +
>> +static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
>> +                                      Monitor *mon)
>> +{
>> +    CPUState *cs;
>> +
>> +    CPU_FOREACH(cs) {
>> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
>> +
>> +        xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
>> +    }
>> +
>> +    spapr_xive_pic_print_info(spapr->xive, mon);
> 
> Any reason the info dumping routines are split into two?

Not the same objects. Are you suggesting that we could print all the info 
from the sPAPR XIVE model ? including the XiveTCTX. I thought of doing 
that also. Fine for me if it's ok for you.

Thanks,

C.

> 
>> +}
>> +
>> +/*
>> + * XIVE uses the full IRQ number space. Set it to 8K to be compatible
>> + * with XICS.
>> + */
>> +
>> +#define SPAPR_IRQ_XIVE_NR_IRQS     0x2000
>> +#define SPAPR_IRQ_XIVE_NR_MSIS     (SPAPR_IRQ_XIVE_NR_IRQS - SPAPR_IRQ_MSI)
>> +
>> +sPAPRIrq spapr_irq_xive = {
>> +    .nr_irqs     = SPAPR_IRQ_XIVE_NR_IRQS,
>> +    .nr_msis     = SPAPR_IRQ_XIVE_NR_MSIS,
>> +
>> +    .init        = spapr_irq_init_xive,
>> +    .claim       = spapr_irq_claim_xive,
>> +    .free        = spapr_irq_free_xive,
>> +    .qirq        = spapr_qirq_xive,
>> +    .print_info  = spapr_irq_print_info_xive,
>> +};
>> +
>>  /*
>>   * sPAPR IRQ frontend routines for devices
>>   */
>> -void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
>> +void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp)
>>  {
>>      sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>>  
>> @@ -216,7 +329,7 @@ void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
>>          spapr_irq_msi_init(spapr, smc->irq->nr_msis);
>>      }
>>  
>> -    smc->irq->init(spapr, smc->irq->nr_irqs, errp);
>> +    smc->irq->init(spapr, smc->irq->nr_irqs, nr_servers, errp);
>>  }
>>  
>>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp)
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-11-28  4:25   ` David Gibson
@ 2018-11-28 22:21     ` Cédric Le Goater
  2018-11-29  1:23       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 22:21 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 5:25 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:09AM +0100, Cédric Le Goater wrote:
>> The different XIVE virtualization structures (sources and event queues)
>> are configured with a set of Hypervisor calls :
>>
>>  - H_INT_GET_SOURCE_INFO
>>
>>    used to obtain the address of the MMIO page of the Event State
>>    Buffer (ESB) entry associated with the source.
>>
>>  - H_INT_SET_SOURCE_CONFIG
>>
>>    assigns a source to a "target".
>>
>>  - H_INT_GET_SOURCE_CONFIG
>>
>>    determines which "target" and "priority" is assigned to a source
>>
>>  - H_INT_GET_QUEUE_INFO
>>
>>    returns the address of the notification management page associated
>>    with the specified "target" and "priority".
>>
>>  - H_INT_SET_QUEUE_CONFIG
>>
>>    sets or resets the event queue for a given "target" and "priority".
>>    It is also used to set the notification configuration associated
>>    with the queue, only unconditional notification is supported for
>>    the moment. Reset is performed with a queue size of 0 and queueing
>>    is disabled in that case.
>>
>>  - H_INT_GET_QUEUE_CONFIG
>>
>>    returns the queue settings for a given "target" and "priority".
>>
>>  - H_INT_RESET
>>
>>    resets all of the guest's internal interrupt structures to their
>>    initial state, losing all configuration set via the hcalls
>>    H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
>>
>>  - H_INT_SYNC
>>
>>    issue a synchronisation on a source to make sure all notifications
>>    have reached their queue.
>>
>> Calls that still need to be addressed :
>>
>>    H_INT_SET_OS_REPORTING_LINE
>>    H_INT_GET_OS_REPORTING_LINE
>>
>> See the code for more documentation on each hcall.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/spapr.h      |  15 +-
>>  include/hw/ppc/spapr_xive.h |   6 +
>>  hw/intc/spapr_xive_hcall.c  | 892 ++++++++++++++++++++++++++++++++++++
>>  hw/ppc/spapr_irq.c          |   2 +
>>  hw/intc/Makefile.objs       |   2 +-
>>  5 files changed, 915 insertions(+), 2 deletions(-)
>>  create mode 100644 hw/intc/spapr_xive_hcall.c
>>
>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>> index 1fbc2663e06c..8415faea7b82 100644
>> --- a/include/hw/ppc/spapr.h
>> +++ b/include/hw/ppc/spapr.h
>> @@ -452,7 +452,20 @@ struct sPAPRMachineState {
>>  #define H_INVALIDATE_PID        0x378
>>  #define H_REGISTER_PROC_TBL     0x37C
>>  #define H_SIGNAL_SYS_RESET      0x380
>> -#define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
>> +
>> +#define H_INT_GET_SOURCE_INFO   0x3A8
>> +#define H_INT_SET_SOURCE_CONFIG 0x3AC
>> +#define H_INT_GET_SOURCE_CONFIG 0x3B0
>> +#define H_INT_GET_QUEUE_INFO    0x3B4
>> +#define H_INT_SET_QUEUE_CONFIG  0x3B8
>> +#define H_INT_GET_QUEUE_CONFIG  0x3BC
>> +#define H_INT_SET_OS_REPORTING_LINE 0x3C0
>> +#define H_INT_GET_OS_REPORTING_LINE 0x3C4
>> +#define H_INT_ESB               0x3C8
>> +#define H_INT_SYNC              0x3CC
>> +#define H_INT_RESET             0x3D0
>> +
>> +#define MAX_HCALL_OPCODE        H_INT_RESET
>>  
>>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>>   * as well.
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 3f65b8f485fd..418511f3dc10 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -60,4 +60,10 @@ int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
>>  int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
>>                            uint8_t *out_end_blk, uint32_t *out_end_idx);
>>  
>> +bool spapr_xive_priority_is_valid(uint8_t priority);
> 
> AFAICT this could be a local function.

the KVM model uses it also, when collecting state from the KVM device 
to build the QEMU ENDT.

>> +
>> +typedef struct sPAPRMachineState sPAPRMachineState;
>> +
>> +void spapr_xive_hcall_init(sPAPRMachineState *spapr);
>> +
>>  #endif /* PPC_SPAPR_XIVE_H */
>> diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
>> new file mode 100644
>> index 000000000000..52e4e23995f5
>> --- /dev/null
>> +++ b/hw/intc/spapr_xive_hcall.c
>> @@ -0,0 +1,892 @@
>> +/*
>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/log.h"
>> +#include "qapi/error.h"
>> +#include "cpu.h"
>> +#include "hw/ppc/fdt.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/ppc/spapr_xive.h"
>> +#include "hw/ppc/xive_regs.h"
>> +#include "monitor/monitor.h"
> 
> Fwiw, I don't think it's particularly necessary to split the hcall
> handling out into a separate .c file.

ok. let's move it to spapr_xive then ? It might help in reducing the 
exported funtions. 

>> +/*
>> + * OPAL uses the priority 7 EQ to automatically escalate interrupts
>> + * for all other queues (DD2.X POWER9). So only priorities [0..6] are
>> + * available for the guest.
> 
> Referencing OPAL behaviour doesn't really make sense in the context of
> PAPR.  

It's an OPAL constraint which pHyp doesn't have. So its a QEMU/KVM 
constraint also.

> What I think you're getting at is that the PAPR spec only
> allows a PAPR guest to use priorities 0..6 (or at least it will if the
> XIVE updated spec ever gets published).  

It's not in the spec. the XIVE sPAPR spec should be frozen soon btw. 
 
> The fact that this allows the
> host use 7 for escalations is a design rationale 
> but not really relevant to the guest device itself. 

The guest should be aware of which priorities are reserved for
the hypervisor though.

>> + */
>> +bool spapr_xive_priority_is_valid(uint8_t priority)
>> +{
>> +    switch (priority) {
>> +    case 0 ... 6:
>> +        return true;
>> +    case 7: /* OPAL escalation queue */
>> +    default:
>> +        return false;
>> +    }
>> +}
>> +
>> +/*
>> + * The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical
>> + * real address of the MMIO page through which the Event State Buffer
>> + * entry associated with the value of the "lisn" parameter is managed.
>> + *
>> + * Parameters:
>> + * Input
>> + * - "flags"
>> + *       Bits 0-63 reserved
>> + * - "lisn" is per "interrupts", "interrupt-map", or
>> + *       "ibm,xive-lisn-ranges" properties, or as returned by the
>> + *       ibm,query-interrupt-source-number RTAS call, or as returned
>> + *       by the H_ALLOCATE_VAS_WINDOW hcall
> 
> I've not heard of H_ALLOCATE_VAS_WINDOW.  Is that something we intend
> to implement in kvm/qemu, or is it only of interest for PowerVM?

The hcall is part of the PAPR NX Interfaces and it returns interrupt
numbers. I don't know if any work has been done on the topic.  
 
> Also, putting the register numbers on the inputs as well as the
> outputs would be helpful.

yes. I will add them.

>> + *
>> + * Output
>> + * - R4: "flags"
>> + *       Bits 0-59: Reserved
>> + *       Bit 60: H_INT_ESB must be used for Event State Buffer
>> + *               management
>> + *       Bit 61: 1 == LSI  0 == MSI
>> + *       Bit 62: the full function page supports trigger
>> + *       Bit 63: Store EOI Supported
>> + * - R5: Logical Real address of full function Event State Buffer
>> + *       management page, -1 if ESB hcall flag is set to 1.
> 
> You've defined what H_INT_ESB means above, so it will be clearer if
> you reference that by name here.

yes. 

>> + * - R6: Logical Real Address of trigger only Event State Buffer
>> + *       management page or -1.
>> + * - R7: Power of 2 page size for the ESB management pages returned in
>> + *       R5 and R6.
>> + */
>> +
>> +#define SPAPR_XIVE_SRC_H_INT_ESB     PPC_BIT(60) /* ESB manage with H_INT_ESB */
>> +#define SPAPR_XIVE_SRC_LSI           PPC_BIT(61) /* Virtual LSI type */
>> +#define SPAPR_XIVE_SRC_TRIGGER       PPC_BIT(62) /* Trigger and management
>> +                                                    on same page */
>> +#define SPAPR_XIVE_SRC_STORE_EOI     PPC_BIT(63) /* Store EOI support */
> 
> Probably makes sense to put these #defines in spapr.h since they form
> part of the PAPR interface definition.

ok.

> 
>> +static target_ulong h_int_get_source_info(PowerPCCPU *cpu,
>> +                                          sPAPRMachineState *spapr,
>> +                                          target_ulong opcode,
>> +                                          target_ulong *args)
>> +{
>> +    sPAPRXive *xive = spapr->xive;
>> +    XiveSource *xsrc = &xive->source;
>> +    XiveEAS eas;
>> +    target_ulong flags  = args[0];
>> +    target_ulong lisn   = args[1];
>> +
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    if (flags) {
>> +        return H_PARAMETER;
>> +    }
>> +
>> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
>> +        return H_P2;
>> +    }
>> +
>> +    if (!(eas.w & EAS_VALID)) {
>> +        return H_P2;
>> +    }
>> +
>> +    /* All sources are emulated under the main XIVE object and share
>> +     * the same characteristics.
>> +     */
>> +    args[0] = 0;
>> +    if (!xive_source_esb_has_2page(xsrc)) {
>> +        args[0] |= SPAPR_XIVE_SRC_TRIGGER;
>> +    }
>> +    if (xsrc->esb_flags & XIVE_SRC_STORE_EOI) {
>> +        args[0] |= SPAPR_XIVE_SRC_STORE_EOI;
>> +    }
>> +
>> +    /*
>> +     * Force the use of the H_INT_ESB hcall in case of an LSI
>> +     * interrupt. This is necessary under KVM to re-trigger the
>> +     * interrupt if the level is still asserted
>> +     */
>> +    if (xive_source_irq_is_lsi(xsrc, lisn)) {
>> +        args[0] |= SPAPR_XIVE_SRC_H_INT_ESB | SPAPR_XIVE_SRC_LSI;
>> +    }
>> +
>> +    if (!(args[0] & SPAPR_XIVE_SRC_H_INT_ESB)) {
>> +        args[1] = xive->vc_base + xive_source_esb_mgmt(xsrc, lisn);
>> +    } else {
>> +        args[1] = -1;
>> +    }
>> +
>> +    if (xive_source_esb_has_2page(xsrc)) {
>> +        args[2] = xive->vc_base + xive_source_esb_page(xsrc, lisn);
>> +    } else {
>> +        args[2] = -1;
>> +    }
> 
> Do we also need to keep this address clear in the H_INT_ESB case?

I think not, but the specs are not very clear on that topic. I will
ask for clarification and use a -1 for now. We can not do loads on
the trigger page so it can not be used by the H_INT_ESB hcall.

> 
>> +    args[3] = TARGET_PAGE_SIZE;
> 
> That seems wrong.  

This is utterly wrong. it should be a power of 2 number ... I got
it right under KVM though. I guess that ioremap() under Linux rounds 
up the size to the page size in use, so, that's why it didn't blow
up under TCG.

> TARGET_PAGE_SIZE is generally 4kiB, but won't these usually
> actually be 64kiB?

yes. So what should I use to get a PAGE_SHIFT instead ? 
 
>> +
>> +    return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * The H_INT_SET_SOURCE_CONFIG hcall() is used to assign a Logical
>> + * Interrupt Source to a target. The Logical Interrupt Source is
>> + * designated with the "lisn" parameter and the target is designated
>> + * with the "target" and "priority" parameters.  Upon return from the
>> + * hcall(), no additional interrupts will be directed to the old EQ.
>> + *
>> + * TODO: The old EQ should be investigated for interrupts that
>> + * occurred prior to or during the hcall().
> 
> Isn't that the responsibility of the guest?

It should yes.

> 
>> + *
>> + * Parameters:
>> + * Input:
>> + * - "flags"
>> + *      Bits 0-61: Reserved
>> + *      Bit 62: set the "eisn" in the EA
> 
> What's the "EA"?  Do you mean the EAS?

Another XIVE acronym, EA for Event Assignment. I think we can forget
this one and just use EAS.
 
> 
>> + *      Bit 63: masks the interrupt source in the hardware interrupt
>> + *      control structure. An interrupt masked by this mechanism will
>> + *      be dropped, but it's source state bits will still be
>> + *      set. There is no race-free way of unmasking and restoring the
>> + *      source. Thus this should only be used in interrupts that are
>> + *      also masked at the source, and only in cases where the
>> + *      interrupt is not meant to be used for a large amount of time
>> + *      because no valid target exists for it for example
>> + * - "lisn" is per "interrupts", "interrupt-map", or
>> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
>> + *      ibm,query-interrupt-source-number RTAS call, or as returned by
>> + *      the H_ALLOCATE_VAS_WINDOW hcall
>> + * - "target" is per "ibm,ppc-interrupt-server#s" or
>> + *      "ibm,ppc-interrupt-gserver#s"
>> + * - "priority" is a valid priority not in
>> + *      "ibm,plat-res-int-priorities"
>> + * - "eisn" is the guest EISN associated with the "lisn"
> 
> I don't think the EISN term has been used before in the series.  

Effective Interrupt Source Number, which is the event data enqueued
in the OS EQ.

I'm planning on adding some more acronyms used by the sPAPR hcalls
in this file. There are only a couple.
 
> I'm guessing this is the guest-assigned global interrupt number?

yes 

>> + *
>> + * Output:
>> + * - None
>> + */
>> +
>> +#define SPAPR_XIVE_SRC_SET_EISN PPC_BIT(62)
>> +#define SPAPR_XIVE_SRC_MASK     PPC_BIT(63)
>> +
>> +static target_ulong h_int_set_source_config(PowerPCCPU *cpu,
>> +                                            sPAPRMachineState *spapr,
>> +                                            target_ulong opcode,
>> +                                            target_ulong *args)
>> +{
>> +    sPAPRXive *xive = spapr->xive;
>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>> +    XiveEAS eas, new_eas;
>> +    target_ulong flags    = args[0];
>> +    target_ulong lisn     = args[1];
>> +    target_ulong target   = args[2];
>> +    target_ulong priority = args[3];
>> +    target_ulong eisn     = args[4];
>> +    uint8_t end_blk;
>> +    uint32_t end_idx;
>> +
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    if (flags & ~(SPAPR_XIVE_SRC_SET_EISN | SPAPR_XIVE_SRC_MASK)) {
>> +        return H_PARAMETER;
>> +    }
>> +
>> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
>> +        return H_P2;
>> +    }
>> +
>> +    if (!(eas.w & EAS_VALID)) {
>> +        return H_P2;
>> +    }
>> +
>> +    /* priority 0xff is used to reset the EAS */
>> +    if (priority == 0xff) {
>> +        new_eas.w = EAS_VALID | EAS_MASKED;
>> +        goto out;
>> +    }
>> +
>> +    if (flags & SPAPR_XIVE_SRC_MASK) {
>> +        new_eas.w = eas.w | EAS_MASKED;
>> +    } else {
>> +        new_eas.w = eas.w & ~EAS_MASKED;
>> +    }
>> +
>> +    if (!spapr_xive_priority_is_valid(priority)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
>> +                      priority);
>> +        return H_P4;
>> +    }
>> +
>> +    /* Validate that "target" is part of the list of threads allocated
>> +     * to the partition. For that, find the END corresponding to the
>> +     * target.
>> +     */
>> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
>> +        return H_P3;
>> +    }
>> +
>> +    new_eas.w = SETFIELD(EAS_END_BLOCK, new_eas.w, end_blk);
>> +    new_eas.w = SETFIELD(EAS_END_INDEX, new_eas.w, end_idx);
>> +
>> +    if (flags & SPAPR_XIVE_SRC_SET_EISN) {
>> +        new_eas.w = SETFIELD(EAS_END_DATA, new_eas.w, eisn);
>> +    }
>> +
>> +out:
>> +    if (xive_router_set_eas(xrtr, lisn, &new_eas)) {
>> +        return H_HARDWARE;
>> +    }
> 
> As noted earlier in the series, the spapr specific code owns the
> memory backing the EAT, so you can just access it directly rather than
> using a method here.

Yes. I will give a try. I wonder if I need accessors for the tables ? 

> 
>> +
>> +    return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * The H_INT_GET_SOURCE_CONFIG hcall() is used to determine to which
>> + * target/priority pair is assigned to the specified Logical Interrupt
>> + * Source.
>> + *
>> + * Parameters:
>> + * Input:
>> + * - "flags"
>> + *      Bits 0-63 Reserved
>> + * - "lisn" is per "interrupts", "interrupt-map", or
>> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
>> + *      ibm,query-interrupt-source-number RTAS call, or as
>> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
>> + *
>> + * Output:
>> + * - R4: Target to which the specified Logical Interrupt Source is
>> + *       assigned
>> + * - R5: Priority to which the specified Logical Interrupt Source is
>> + *       assigned
>> + * - R6: EISN for the specified Logical Interrupt Source (this will be
>> + *       equivalent to the LISN if not changed by H_INT_SET_SOURCE_CONFIG)
>> + */
>> +static target_ulong h_int_get_source_config(PowerPCCPU *cpu,
>> +                                            sPAPRMachineState *spapr,
>> +                                            target_ulong opcode,
>> +                                            target_ulong *args)
>> +{
>> +    sPAPRXive *xive = spapr->xive;
>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>> +    target_ulong flags = args[0];
>> +    target_ulong lisn = args[1];
>> +    XiveEAS eas;
>> +    XiveEND end;
>> +    uint8_t end_blk, nvt_blk;
>> +    uint32_t end_idx, nvt_idx;
>> +
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    if (flags) {
>> +        return H_PARAMETER;
>> +    }
>> +
>> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
>> +        return H_P2;
>> +    }
>> +
>> +    if (!(eas.w & EAS_VALID)) {
>> +        return H_P2;
>> +    }
>> +
>> +    end_blk = GETFIELD(EAS_END_BLOCK, eas.w);
>> +    end_idx = GETFIELD(EAS_END_INDEX, eas.w);
>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
>> +        /* Not sure what to return here */
>> +        return H_HARDWARE;
> 
> IIUC this indicates a bug in the PAPR specific code, not the guest, so
> an assert() is probably the right answer.

ok

>> +    }
>> +
>> +    nvt_blk = GETFIELD(END_W6_NVT_BLOCK, end.w6);
>> +    nvt_idx = GETFIELD(END_W6_NVT_INDEX, end.w6);
>> +    args[0] = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
> 
> AIUI there's a specific END for each target & priority, so you could
> avoid this second level lookup, 

yes 

> although I guess this might be
> valuable if we do more complicated internal routing in the future.

I am not sure of that but I'd rather keep these converting helpers
for the moment.
 
>> +    if (eas.w & EAS_MASKED) {
>> +        args[1] = 0xff;
>> +    } else {
>> +        args[1] = GETFIELD(END_W7_F0_PRIORITY, end.w7);
>> +    }
>> +
>> +    args[2] = GETFIELD(EAS_END_DATA, eas.w);
>> +
>> +    return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * The H_INT_GET_QUEUE_INFO hcall() is used to get the logical real
>> + * address of the notification management page associated with the
>> + * specified target and priority.
>> + *
>> + * Parameters:
>> + * Input:
>> + * - "flags"
>> + *       Bits 0-63 Reserved
>> + * - "target" is per "ibm,ppc-interrupt-server#s" or
>> + *       "ibm,ppc-interrupt-gserver#s"
>> + * - "priority" is a valid priority not in
>> + *       "ibm,plat-res-int-priorities"
>> + *
>> + * Output:
>> + * - R4: Logical real address of notification page
>> + * - R5: Power of 2 page size of the notification page
>> + */
>> +static target_ulong h_int_get_queue_info(PowerPCCPU *cpu,
>> +                                         sPAPRMachineState *spapr,
>> +                                         target_ulong opcode,
>> +                                         target_ulong *args)
>> +{
>> +    sPAPRXive *xive = spapr->xive;
>> +    XiveENDSource *end_xsrc = &xive->end_source;
>> +    target_ulong flags = args[0];
>> +    target_ulong target = args[1];
>> +    target_ulong priority = args[2];
>> +    XiveEND end;
>> +    uint8_t end_blk;
>> +    uint32_t end_idx;
>> +
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    if (flags) {
>> +        return H_PARAMETER;
>> +    }
>> +
>> +    /*
>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>> +     * This is not needed when running the emulation under QEMU
>> +     */
>> +
>> +    if (!spapr_xive_priority_is_valid(priority)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
>> +                      priority);
>> +        return H_P3;
>> +    }
>> +
>> +    /* Validate that "target" is part of the list of threads allocated
>> +     * to the partition. For that, find the END corresponding to the
>> +     * target.
>> +     */
>> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
>> +        return H_P2;
>> +    }
>> +
>> +    if (xive_router_get_end(XIVE_ROUTER(xive), end_blk, end_idx, &end)) {
>> +        return H_HARDWARE;
>> +    }
>> +
>> +    args[0] = xive->end_base + (1ull << (end_xsrc->esb_shift + 1)) * end_idx;
>> +    if (end.w0 & END_W0_ENQUEUE) {
>> +        args[1] = GETFIELD(END_W0_QSIZE, end.w0) + 12;
>> +    } else {
>> +        args[1] = 0;
>> +    }
>> +    return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * The H_INT_SET_QUEUE_CONFIG hcall() is used to set or reset a EQ for
>> + * a given "target" and "priority".  It is also used to set the
>> + * notification config associated with the EQ.  An EQ size of 0 is
>> + * used to reset the EQ config for a given target and priority. If
>> + * resetting the EQ config, the END associated with the given "target"
>> + * and "priority" will be changed to disable queueing.
>> + *
>> + * Upon return from the hcall(), no additional interrupts will be
>> + * directed to the old EQ (if one was set). The old EQ (if one was
>> + * set) should be investigated for interrupts that occurred prior to
>> + * or during the hcall().
>> + *
>> + * Parameters:
>> + * Input:
>> + * - "flags"
>> + *      Bits 0-62: Reserved
>> + *      Bit 63: Unconditional Notify (n) per the XIVE spec
>> + * - "target" is per "ibm,ppc-interrupt-server#s" or
>> + *       "ibm,ppc-interrupt-gserver#s"
>> + * - "priority" is a valid priority not in
>> + *       "ibm,plat-res-int-priorities"
>> + * - "eventQueue": The logical real address of the start of the EQ
>> + * - "eventQueueSize": The power of 2 EQ size per "ibm,xive-eq-sizes"
>> + *
>> + * Output:
>> + * - None
>> + */
>> +
>> +#define SPAPR_XIVE_END_ALWAYS_NOTIFY PPC_BIT(63)
>> +
>> +static target_ulong h_int_set_queue_config(PowerPCCPU *cpu,
>> +                                           sPAPRMachineState *spapr,
>> +                                           target_ulong opcode,
>> +                                           target_ulong *args)
>> +{
>> +    sPAPRXive *xive = spapr->xive;
>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>> +    target_ulong flags = args[0];
>> +    target_ulong target = args[1];
>> +    target_ulong priority = args[2];
>> +    target_ulong qpage = args[3];
>> +    target_ulong qsize = args[4];
>> +    XiveEND end;
>> +    uint8_t end_blk, nvt_blk;
>> +    uint32_t end_idx, nvt_idx;
>> +    uint32_t qdata;
>> +
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    if (flags & ~SPAPR_XIVE_END_ALWAYS_NOTIFY) {
>> +        return H_PARAMETER;
>> +    }
>> +
>> +    /*
>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>> +     * This is not needed when running the emulation under QEMU
>> +     */
>> +
>> +    if (!spapr_xive_priority_is_valid(priority)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
>> +                      priority);
>> +        return H_P3;
>> +    }
>> +
>> +    /* Validate that "target" is part of the list of threads allocated
>> +     * to the partition. For that, find the END corresponding to the
>> +     * target.
>> +     */
>> +
>> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
>> +        return H_P2;
>> +    }
>> +
>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
>> +        return H_HARDWARE;
> 
> Again, I think this indicates a qemu (spapr) code bug, so could be an assert().

ok

> 
>> +    }
>> +
>> +    switch (qsize) {
>> +    case 12:
>> +    case 16:
>> +    case 21:
>> +    case 24:
>> +        end.w3 = ((uint64_t)qpage) & 0xffffffff;
> 
> It just occurred to me that I haven't been looking for this across any
> of these reviews.  Don't you need byteswaps when accessing these
> in-memory structures?

yes this is done when some event data is enqueued in the EQ.

> 
>> +        end.w2 = (((uint64_t)qpage)) >> 32 & 0x0fffffff;
>> +        end.w0 |= END_W0_ENQUEUE;
>> +        end.w0 = SETFIELD(END_W0_QSIZE, end.w0, qsize - 12);
>> +        break;
>> +    case 0:
>> +        /* reset queue and disable queueing */
>> +        xive_end_reset(&end);
>> +        goto out;
>> +
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid EQ size %"PRIx64"\n",
>> +                      qsize);
>> +        return H_P5;
>> +    }
>> +
>> +    if (qsize) {
>> +        /*
>> +         * Let's validate the EQ address with a read of the first EQ
>> +         * entry. We could also check that the full queue has been
>> +         * zeroed by the OS.
>> +         */
>> +        if (address_space_read(&address_space_memory, qpage,
>> +                               MEMTXATTRS_UNSPECIFIED,
>> +                               (uint8_t *) &qdata, sizeof(qdata))) {
>> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ data @0x%"
>> +                          HWADDR_PRIx "\n", qpage);
>> +            return H_P4;
> 
> Just checking the first entry doesn't seem entirely safe.  Using
> address_space_map() and making sure the returned plen doesn't get
> reduced below the queue size might be a better option.

ok. That was on my todo list.

> 
>> +        }
>> +    }
>> +
>> +    if (spapr_xive_target_to_nvt(xive, target, &nvt_blk, &nvt_idx)) {
>> +        return H_HARDWARE;
> 
> That could be caused by a bogus 'target' value, couldn't it?  

yes. It should have returned H_P2 above when spapr_xive_target_to_end() 
is called.

> In which
> case it a) should probably be checked earlier and b) should be
> H_PARAMETER or similar, not H_HARDWARE, yes?

H_P2 may be again. It should be checked earlier

> 
>> +    }
>> +
>> +    /* Ensure the priority and target are correctly set (they will not
>> +     * be right after allocation)
> 
> AIUI there's a static association from END to target in the PAPR
> model. 

yes. 8 priorities per cpu.

> So it seems to make more sense to get that set up right at
> initialization / reset, rather than doing it lazily when the 
> queue is configured.

Ah. You would preconfigure the word6 and word7 then. Yes, it would
save us some of the conversion fuss. I will look at it.

>> +     */
>> +    end.w6 = SETFIELD(END_W6_NVT_BLOCK, 0ul, nvt_blk) |
>> +        SETFIELD(END_W6_NVT_INDEX, 0ul, nvt_idx);
>> +    end.w7 = SETFIELD(END_W7_F0_PRIORITY, 0ul, priority);
>> +
>> +    if (flags & SPAPR_XIVE_END_ALWAYS_NOTIFY) {
>> +        end.w0 |= END_W0_UCOND_NOTIFY;
>> +    } else {
>> +        end.w0 &= ~END_W0_UCOND_NOTIFY;
>> +    }
>> +
>> +    /* The generation bit for the END starts at 1 and The END page
>> +     * offset counter starts at 0.
>> +     */
>> +    end.w1 = END_W1_GENERATION | SETFIELD(END_W1_PAGE_OFF, 0ul, 0ul);
>> +    end.w0 |= END_W0_VALID;
>> +
>> +    /* TODO: issue syncs required to ensure all in-flight interrupts
>> +     * are complete on the old END */
>> +out:
>> +    /* Update END */
>> +    if (xive_router_set_end(xrtr, end_blk, end_idx, &end)) {
>> +        return H_HARDWARE;
>> +    }
> 
> Again the PAPR code owns the ENDs, so it can update them directly
> rather than going through an abstraction.

ok.

> 
>> +
>> +    return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * The H_INT_GET_QUEUE_CONFIG hcall() is used to get a EQ for a given
>> + * target and priority.
>> + *
>> + * Parameters:
>> + * Input:
>> + * - "flags"
>> + *      Bits 0-62: Reserved
>> + *      Bit 63: Debug: Return debug data
>> + * - "target" is per "ibm,ppc-interrupt-server#s" or
>> + *       "ibm,ppc-interrupt-gserver#s"
>> + * - "priority" is a valid priority not in
>> + *       "ibm,plat-res-int-priorities"
>> + *
>> + * Output:
>> + * - R4: "flags":
>> + *       Bits 0-61: Reserved
>> + *       Bit 62: The value of Event Queue Generation Number (g) per
>> + *              the XIVE spec if "Debug" = 1
>> + *       Bit 63: The value of Unconditional Notify (n) per the XIVE spec
>> + * - R5: The logical real address of the start of the EQ
>> + * - R6: The power of 2 EQ size per "ibm,xive-eq-sizes"
>> + * - R7: The value of Event Queue Offset Counter per XIVE spec
>> + *       if "Debug" = 1, else 0
>> + *
>> + */
>> +
>> +#define SPAPR_XIVE_END_DEBUG     PPC_BIT(63)
>> +
>> +static target_ulong h_int_get_queue_config(PowerPCCPU *cpu,
>> +                                           sPAPRMachineState *spapr,
>> +                                           target_ulong opcode,
>> +                                           target_ulong *args)
>> +{
>> +    sPAPRXive *xive = spapr->xive;
>> +    target_ulong flags = args[0];
>> +    target_ulong target = args[1];
>> +    target_ulong priority = args[2];
>> +    XiveEND end;
>> +    uint8_t end_blk;
>> +    uint32_t end_idx;
>> +
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    if (flags & ~SPAPR_XIVE_END_DEBUG) {
>> +        return H_PARAMETER;
>> +    }
>> +
>> +    /*
>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>> +     * This is not needed when running the emulation under QEMU
>> +     */
>> +
>> +    if (!spapr_xive_priority_is_valid(priority)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
>> +                      priority);
>> +        return H_P3;
>> +    }
>> +
>> +    /* Validate that "target" is part of the list of threads allocated
>> +     * to the partition. For that, find the END corresponding to the
>> +     * target.
>> +     */
>> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
>> +        return H_P2;
>> +    }
>> +
>> +    if (xive_router_get_end(XIVE_ROUTER(xive), end_blk, end_idx, &end)) {
>> +        return H_HARDWARE;
> 
> Again, assert() seems appropriate here.

ok

> 
>> +    }
>> +
>> +    args[0] = 0;
>> +    if (end.w0 & END_W0_UCOND_NOTIFY) {
>> +        args[0] |= SPAPR_XIVE_END_ALWAYS_NOTIFY;
>> +    }
>> +
>> +    if (end.w0 & END_W0_ENQUEUE) {
>> +        args[1] =
>> +            (((uint64_t)(end.w2 & 0x0fffffff)) << 32) | end.w3;
>> +        args[2] = GETFIELD(END_W0_QSIZE, end.w0) + 12;
>> +    } else {
>> +        args[1] = 0;
>> +        args[2] = 0;
>> +    }
>> +
>> +    /* TODO: do we need any locking on the END ? */
>> +    if (flags & SPAPR_XIVE_END_DEBUG) {
>> +        /* Load the event queue generation number into the return flags */
>> +        args[0] |= (uint64_t)GETFIELD(END_W1_GENERATION, end.w1) << 62;
>> +
>> +        /* Load R7 with the event queue offset counter */
>> +        args[3] = GETFIELD(END_W1_PAGE_OFF, end.w1);
>> +    } else {
>> +        args[3] = 0;
>> +    }
>> +
>> +    return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * The H_INT_SET_OS_REPORTING_LINE hcall() is used to set the
>> + * reporting cache line pair for the calling thread.  The reporting
>> + * cache lines will contain the OS interrupt context when the OS
>> + * issues a CI store byte to @TIMA+0xC10 to acknowledge the OS
>> + * interrupt. The reporting cache lines can be reset by inputting -1
>> + * in "reportingLine".  Issuing the CI store byte without reporting
>> + * cache lines registered will result in the data not being accessible
>> + * to the OS.
>> + *
>> + * Parameters:
>> + * Input:
>> + * - "flags"
>> + *      Bits 0-63: Reserved
>> + * - "reportingLine": The logical real address of the reporting cache
>> + *    line pair
>> + *
>> + * Output:
>> + * - None
>> + */
>> +static target_ulong h_int_set_os_reporting_line(PowerPCCPU *cpu,
>> +                                                sPAPRMachineState *spapr,
>> +                                                target_ulong opcode,
>> +                                                target_ulong *args)
>> +{
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    /*
>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>> +     * This is not needed when running the emulation under QEMU
>> +     */
>> +
>> +    /* TODO: H_INT_SET_OS_REPORTING_LINE */
>> +    return H_FUNCTION;
>> +}
>> +
>> +/*
>> + * The H_INT_GET_OS_REPORTING_LINE hcall() is used to get the logical
>> + * real address of the reporting cache line pair set for the input
>> + * "target".  If no reporting cache line pair has been set, -1 is
>> + * returned.
>> + *
>> + * Parameters:
>> + * Input:
>> + * - "flags"
>> + *      Bits 0-63: Reserved
>> + * - "target" is per "ibm,ppc-interrupt-server#s" or
>> + *       "ibm,ppc-interrupt-gserver#s"
>> + * - "reportingLine": The logical real address of the reporting cache
>> + *   line pair
>> + *
>> + * Output:
>> + * - R4: The logical real address of the reporting line if set, else -1
>> + */
>> +static target_ulong h_int_get_os_reporting_line(PowerPCCPU *cpu,
>> +                                                sPAPRMachineState *spapr,
>> +                                                target_ulong opcode,
>> +                                                target_ulong *args)
>> +{
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    /*
>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>> +     * This is not needed when running the emulation under QEMU
>> +     */
>> +
>> +    /* TODO: H_INT_GET_OS_REPORTING_LINE */
>> +    return H_FUNCTION;
>> +}
>> +
>> +/*
>> + * The H_INT_ESB hcall() is used to issue a load or store to the ESB
>> + * page for the input "lisn".  This hcall is only supported for LISNs
>> + * that have the ESB hcall flag set to 1 when returned from hcall()
>> + * H_INT_GET_SOURCE_INFO.
> 
> Is there a reason for specifically restricting this to LISNs which
> advertise it, rather than allowing it for anything? 

It's in the specs but I did not implement the check. So H_INT_ESB can be 
used today by the OS for any interrupt number. Same under KVM.

But I should say so somewhere.

> Obviously using
> the direct MMIOs will generally be a faster option when possible, but
> I could see occasions where it might be simpler for the guest to
> always use H_INT_ESB (e.g. for micro-guests like kvm-unit-tests).

can not you use direct load and stores in these guests ? I haven't 
looked at how they are implemented.

> 
>> + * Parameters:
>> + * Input:
>> + * - "flags"
>> + *      Bits 0-62: Reserved
>> + *      bit 63: Store: Store=1, store operation, else load operation
>> + * - "lisn" is per "interrupts", "interrupt-map", or
>> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
>> + *      ibm,query-interrupt-source-number RTAS call, or as
>> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
>> + * - "esbOffset" is the offset into the ESB page for the load or store operation
>> + * - "storeData" is the data to write for a store operation
>> + *
>> + * Output:
>> + * - R4: R4: The value of the load if load operation, else -1
>> + */
>> +
>> +#define SPAPR_XIVE_ESB_STORE PPC_BIT(63)
>> +
>> +static target_ulong h_int_esb(PowerPCCPU *cpu,
>> +                              sPAPRMachineState *spapr,
>> +                              target_ulong opcode,
>> +                              target_ulong *args)
>> +{
>> +    sPAPRXive *xive = spapr->xive;
>> +    XiveEAS eas;
>> +    target_ulong flags  = args[0];
>> +    target_ulong lisn   = args[1];
>> +    target_ulong offset = args[2];
>> +    target_ulong data   = args[3];
>> +    hwaddr mmio_addr;
>> +    XiveSource *xsrc = &xive->source;
>> +
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    if (flags & ~SPAPR_XIVE_ESB_STORE) {
>> +        return H_PARAMETER;
>> +    }
>> +
>> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
>> +        return H_P2;
>> +    }
>> +
>> +    if (!(eas.w & EAS_VALID)) {
>> +        return H_P2;
>> +    }
>> +
>> +    if (offset > (1ull << xsrc->esb_shift)) {
>> +        return H_P3;
>> +    }
>> +
>> +    mmio_addr = xive->vc_base + xive_source_esb_mgmt(xsrc, lisn) + offset;
>> +
>> +    if (dma_memory_rw(&address_space_memory, mmio_addr, &data, 8,
>> +                      (flags & SPAPR_XIVE_ESB_STORE))) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to access ESB @0x%"
>> +                      HWADDR_PRIx "\n", mmio_addr);
>> +        return H_HARDWARE;
>> +    }
>> +    args[0] = (flags & SPAPR_XIVE_ESB_STORE) ? -1 : data;
>> +    return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * The H_INT_SYNC hcall() is used to issue hardware syncs that will
>> + * ensure any in flight events for the input lisn are in the event
>> + * queue.
>> + *
>> + * Parameters:
>> + * Input:
>> + * - "flags"
>> + *      Bits 0-63: Reserved
>> + * - "lisn" is per "interrupts", "interrupt-map", or
>> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
>> + *      ibm,query-interrupt-source-number RTAS call, or as
>> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
>> + *
>> + * Output:
>> + * - None
>> + */
>> +static target_ulong h_int_sync(PowerPCCPU *cpu,
>> +                               sPAPRMachineState *spapr,
>> +                               target_ulong opcode,
>> +                               target_ulong *args)
>> +{
>> +    sPAPRXive *xive = spapr->xive;
>> +    XiveEAS eas;
>> +    target_ulong flags = args[0];
>> +    target_ulong lisn = args[1];
>> +
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    if (flags) {
>> +        return H_PARAMETER;
>> +    }
>> +
>> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
>> +        return H_P2;
>> +    }
>> +
>> +    if (!(eas.w & EAS_VALID)) {
>> +        return H_P2;
>> +    }
>> +
>> +    /*
>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>> +     * This is not needed when running the emulation under QEMU
>> +     */
>> +
>> +    /* This is not real hardware. Nothing to be done */
> 
> At least, not as long as all the XIVE operations are under the BQL.

yes.

> 
>> +    return H_SUCCESS;
>> +}
>> +
>> +/*
>> + * The H_INT_RESET hcall() is used to reset all of the partition's
>> + * interrupt exploitation structures to their initial state.  This
>> + * means losing all previously set interrupt state set via
>> + * H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
>> + *
>> + * Parameters:
>> + * Input:
>> + * - "flags"
>> + *      Bits 0-63: Reserved
>> + *
>> + * Output:
>> + * - None
>> + */
>> +static target_ulong h_int_reset(PowerPCCPU *cpu,
>> +                                sPAPRMachineState *spapr,
>> +                                target_ulong opcode,
>> +                                target_ulong *args)
>> +{
>> +    sPAPRXive *xive = spapr->xive;
>> +    target_ulong flags   = args[0];
>> +
>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        return H_FUNCTION;
>> +    }
>> +
>> +    if (flags) {
>> +        return H_PARAMETER;
>> +    }
>> +
>> +    device_reset(DEVICE(xive));
>> +    return H_SUCCESS;
>> +}
>> +
>> +void spapr_xive_hcall_init(sPAPRMachineState *spapr)
>> +{
>> +    spapr_register_hypercall(H_INT_GET_SOURCE_INFO, h_int_get_source_info);
>> +    spapr_register_hypercall(H_INT_SET_SOURCE_CONFIG, h_int_set_source_config);
>> +    spapr_register_hypercall(H_INT_GET_SOURCE_CONFIG, h_int_get_source_config);
>> +    spapr_register_hypercall(H_INT_GET_QUEUE_INFO, h_int_get_queue_info);
>> +    spapr_register_hypercall(H_INT_SET_QUEUE_CONFIG, h_int_set_queue_config);
>> +    spapr_register_hypercall(H_INT_GET_QUEUE_CONFIG, h_int_get_queue_config);
>> +    spapr_register_hypercall(H_INT_SET_OS_REPORTING_LINE,
>> +                             h_int_set_os_reporting_line);
>> +    spapr_register_hypercall(H_INT_GET_OS_REPORTING_LINE,
>> +                             h_int_get_os_reporting_line);
>> +    spapr_register_hypercall(H_INT_ESB, h_int_esb);
>> +    spapr_register_hypercall(H_INT_SYNC, h_int_sync);
>> +    spapr_register_hypercall(H_INT_RESET, h_int_reset);
>> +}
>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>> index 2569ae1bc7f8..da6fcfaa3c52 100644
>> --- a/hw/ppc/spapr_irq.c
>> +++ b/hw/ppc/spapr_irq.c
>> @@ -258,6 +258,8 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
>>          error_propagate(errp, local_err);
>>          return;
>>      }
>> +
>> +    spapr_xive_hcall_init(spapr);
>>  }
>>  
>>  static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>> index 301a8e972d91..eacd26836ebf 100644
>> --- a/hw/intc/Makefile.objs
>> +++ b/hw/intc/Makefile.objs
>> @@ -38,7 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>  obj-$(CONFIG_XIVE) += xive.o
>> -obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 17/36] spapr: add device tree support for the XIVE exploitation mode
  2018-11-28  4:31   ` David Gibson
@ 2018-11-28 22:26     ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 22:26 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 5:31 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:10AM +0100, Cédric Le Goater wrote:
>> The XIVE interface for the guest is described in the device tree under
>> the "interrupt-controller" node. A couple of new properties are
>> specific to XIVE :
>>
>>  - "reg"
>>
>>    contains the base address and size of the thread interrupt
>>    managnement areas (TIMA), for the User level and for the Guest OS
>>    level. Only the Guest OS level is taken into account today.
>>
>>  - "ibm,xive-eq-sizes"
>>
>>    the size of the event queues. One cell per size supported, contains
>>    log2 of size, in ascending order.
>>
>>  - "ibm,xive-lisn-ranges"
>>
>>    the IRQ interrupt number ranges assigned to the guest for the IPIs.
>>
>> and also under the root node :
>>
>>  - "ibm,plat-res-int-priorities"
>>
>>    contains a list of priorities that the hypervisor has reserved for
>>    its own use. OPAL uses the priority 7 queue to automatically
>>    escalate interrupts for all other queues (DD2.X POWER9). So only
>>    priorities [0..6] are allowed for the guest.
>>
>> Extend the sPAPR IRQ backend with a new handler to populate the DT
>> with the appropriate "interrupt-controller" node.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/spapr_irq.h  |  2 ++
>>  include/hw/ppc/spapr_xive.h |  2 ++
>>  hw/intc/spapr_xive_hcall.c  | 62 +++++++++++++++++++++++++++++++++++++
>>  hw/ppc/spapr.c              |  3 +-
>>  hw/ppc/spapr_irq.c          | 17 ++++++++++
>>  5 files changed, 85 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
>> index c854ae527808..cfdc1f86e713 100644
>> --- a/include/hw/ppc/spapr_irq.h
>> +++ b/include/hw/ppc/spapr_irq.h
>> @@ -40,6 +40,8 @@ typedef struct sPAPRIrq {
>>      void (*free)(sPAPRMachineState *spapr, int irq, int num);
>>      qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
>>      void (*print_info)(sPAPRMachineState *spapr, Monitor *mon);
>> +    void (*dt_populate)(sPAPRMachineState *spapr, uint32_t nr_servers,
>> +                        void *fdt, uint32_t phandle);
>>  } sPAPRIrq;
>>  
>>  extern sPAPRIrq spapr_irq_xics;
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 418511f3dc10..5b3fab192d41 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -65,5 +65,7 @@ bool spapr_xive_priority_is_valid(uint8_t priority);
>>  typedef struct sPAPRMachineState sPAPRMachineState;
>>  
>>  void spapr_xive_hcall_init(sPAPRMachineState *spapr);
>> +void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt,
>> +                   uint32_t phandle);
>>  
>>  #endif /* PPC_SPAPR_XIVE_H */
>> diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
>> index 52e4e23995f5..66c78aa88500 100644
>> --- a/hw/intc/spapr_xive_hcall.c
>> +++ b/hw/intc/spapr_xive_hcall.c
>> @@ -890,3 +890,65 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr)
>>      spapr_register_hypercall(H_INT_SYNC, h_int_sync);
>>      spapr_register_hypercall(H_INT_RESET, h_int_reset);
>>  }
>> +
>> +void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt, uint32_t phandle)
>> +{
>> +    int node;
>> +    uint64_t timas[2 * 2];
>> +    /* Interrupt number ranges for the IPIs */
>> +    uint32_t lisn_ranges[] = {
>> +        cpu_to_be32(0),
>> +        cpu_to_be32(nr_servers),
>> +    };
>> +    uint32_t eq_sizes[] = {
>> +        cpu_to_be32(12), /* 4K */
>> +        cpu_to_be32(16), /* 64K */
>> +        cpu_to_be32(21), /* 2M */
>> +        cpu_to_be32(24), /* 16M */
>> +    };
>> +    /* The following array is in sync with the 'spapr_xive_priority_is_valid'
>> +     * routine above. The O/S is expected to choose priority 6.
>> +     */
>> +    uint32_t plat_res_int_priorities[] = {
>> +        cpu_to_be32(7),    /* start */
>> +        cpu_to_be32(0xf8), /* count */
>> +    };
>> +    gchar *nodename;
>> +
>> +    /* Thread Interrupt Management Area : User (ring 3) and OS (ring 2) */
>> +    timas[0] = cpu_to_be64(xive->tm_base + 3 * (1ull << TM_SHIFT));
>> +    timas[1] = cpu_to_be64(1ull << TM_SHIFT);
>> +    timas[2] = cpu_to_be64(xive->tm_base + 2 * (1ull << TM_SHIFT));
> 
> Don't you have symbolic constants for the ring numbers, instead of '2'
> and '3' above?

I have offsets in the TIMA, 0x0, 0x10, ... We could add constants
for the ring numbers as they are used in the TIMA memory ops.
 
> 
>> +    timas[3] = cpu_to_be64(1ull << TM_SHIFT);
>> +
>> +    nodename = g_strdup_printf("interrupt-controller@%" PRIx64,
>> +                               xive->tm_base + 3 * (1 << TM_SHIFT));
>> +    _FDT(node = fdt_add_subnode(fdt, 0, nodename));
>> +    g_free(nodename);
>> +
>> +    _FDT(fdt_setprop_string(fdt, node, "device_type", "power-ivpe"));
>> +    _FDT(fdt_setprop(fdt, node, "reg", timas, sizeof(timas)));
>> +
>> +    _FDT(fdt_setprop_string(fdt, node, "compatible", "ibm,power-ivpe"));
>> +    _FDT(fdt_setprop(fdt, node, "ibm,xive-eq-sizes", eq_sizes,
>> +                     sizeof(eq_sizes)));
>> +    _FDT(fdt_setprop(fdt, node, "ibm,xive-lisn-ranges", lisn_ranges,
>> +                     sizeof(lisn_ranges)));
>> +
>> +    /* For Linux to link the LSIs to the main interrupt controller.
> 
> What's the "main interrupt controller" in this context?

There is only one. This is just how it is reference in the Linux code.  
I will remove the comment.

> 
>> +     * These properties are not in XIVE exploitation mode sPAPR
>> +     * specs

These properties have been added now.

>> +     */
>> +    _FDT(fdt_setprop(fdt, node, "interrupt-controller", NULL, 0));
>> +    _FDT(fdt_setprop_cell(fdt, node, "#interrupt-cells", 2));
>> +
>> +    /* For SLOF */
>> +    _FDT(fdt_setprop_cell(fdt, node, "linux,phandle", phandle));
>> +    _FDT(fdt_setprop_cell(fdt, node, "phandle", phandle));
>> +
>> +    /* The "ibm,plat-res-int-priorities" property defines the priority
>> +     * ranges reserved by the hypervisor
>> +     */
>> +    _FDT(fdt_setprop(fdt, 0, "ibm,plat-res-int-priorities",
>> +                     plat_res_int_priorities, sizeof(plat_res_int_priorities)));
>> +}
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index 9f8c19e56e7a..ad1692cdcd0f 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1270,7 +1270,8 @@ static void *spapr_build_fdt(sPAPRMachineState *spapr,
>>      _FDT(fdt_setprop_cell(fdt, 0, "#size-cells", 2));
>>  
>>      /* /interrupt controller */
>> -    spapr_dt_xics(xics_max_server_number(spapr), fdt, PHANDLE_XICP);
>> +    smc->irq->dt_populate(spapr, xics_max_server_number(spapr), fdt,
>> +                          PHANDLE_XICP);
>>  
>>      ret = spapr_populate_memory(spapr, fdt);
>>      if (ret < 0) {
>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>> index da6fcfaa3c52..d88a029d8c5c 100644
>> --- a/hw/ppc/spapr_irq.c
>> +++ b/hw/ppc/spapr_irq.c
>> @@ -190,6 +190,13 @@ static void spapr_irq_print_info_xics(sPAPRMachineState *spapr, Monitor *mon)
>>      ics_pic_print_info(spapr->ics, mon);
>>  }
>>  
>> +static void spapr_irq_dt_populate_xics(sPAPRMachineState *spapr,
>> +                                       uint32_t nr_servers, void *fdt,
>> +                                       uint32_t phandle)
>> +{
>> +    spapr_dt_xics(nr_servers, fdt, phandle);
>> +}
>> +
> 
> It'd be nicer to change the signature of spapr_dt_xics, rather than
> having this one line wrapper.

but this is a sPAPR IRQ method. So you would use spapr_dt_xics directly 
for the definition of the backend ? if you prefer, it's fine with me.

> 
>>  #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
>>  #define SPAPR_IRQ_XICS_NR_MSIS     \
>>      (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
>> @@ -203,6 +210,7 @@ sPAPRIrq spapr_irq_xics = {
>>      .free        = spapr_irq_free_xics,
>>      .qirq        = spapr_qirq_xics,
>>      .print_info  = spapr_irq_print_info_xics,
>> +    .dt_populate = spapr_irq_dt_populate_xics,
>>  };
>>  
>>   /*
>> @@ -300,6 +308,13 @@ static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
>>      spapr_xive_pic_print_info(spapr->xive, mon);
>>  }
>>  
>> +static void spapr_irq_dt_populate_xive(sPAPRMachineState *spapr,
>> +                                       uint32_t nr_servers, void *fdt,
>> +                                       uint32_t phandle)
>> +{
>> +    spapr_dt_xive(spapr->xive, nr_servers, fdt, phandle);
> 
> Uh.. and to make the hook signature just match what we need rather
> than having to have trivial wrappers in both cases.

ok.

> 
>> +}
>> +
>>  /*
>>   * XIVE uses the full IRQ number space. Set it to 8K to be compatible
>>   * with XICS.
>> @@ -317,6 +332,7 @@ sPAPRIrq spapr_irq_xive = {
>>      .free        = spapr_irq_free_xive,
>>      .qirq        = spapr_qirq_xive,
>>      .print_info  = spapr_irq_print_info_xive,
>> +    .dt_populate = spapr_irq_dt_populate_xive,
>>  };
>>  
>>  /*
>> @@ -421,4 +437,5 @@ sPAPRIrq spapr_irq_xics_legacy = {
>>      .free        = spapr_irq_free_xics,
>>      .qirq        = spapr_qirq_xics,
>>      .print_info  = spapr_irq_print_info_xics,
>> +    .dt_populate = spapr_irq_dt_populate_xics,
>>  };
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 19/36] spapr: add a 'pseries-3.1-xive' machine type
  2018-11-28  4:42   ` David Gibson
@ 2018-11-28 22:37     ` Cédric Le Goater
  2018-12-04 15:14       ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 22:37 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 5:42 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:12AM +0100, Cédric Le Goater wrote:
>> The interrupt mode is statically defined to XIVE only for this machine.
>> The guest OS is required to have support for the XIVE exploitation
>> mode of the POWER9 interrupt controller.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/spapr_irq.h |  1 +
>>  hw/ppc/spapr.c             | 36 +++++++++++++++++++++++++++++++-----
>>  hw/ppc/spapr_irq.c         |  3 +++
>>  3 files changed, 35 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
>> index c3b4c38145eb..b299dd794bff 100644
>> --- a/include/hw/ppc/spapr_irq.h
>> +++ b/include/hw/ppc/spapr_irq.h
>> @@ -33,6 +33,7 @@ void spapr_irq_msi_reset(sPAPRMachineState *spapr);
>>  typedef struct sPAPRIrq {
>>      uint32_t    nr_irqs;
>>      uint32_t    nr_msis;
>> +    uint8_t     ov5;
> 
> I'm a bit confused as to what exactly this represents..

The option vector 5 bits advertised by CAS for the platform. What the
hypervisor supports.
 
> 
>>      void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
>>                   Error **errp);
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index ad1692cdcd0f..8fbb743769db 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1097,12 +1097,14 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, void *fdt)
>>      spapr_dt_rtas_tokens(fdt, rtas);
>>  }
>>  
>> -/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU features
>> - * that the guest may request and thus the valid values for bytes 24..26 of
>> - * option vector 5: */
>> -static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
>> +/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU
>> + * and the XIVE features that the guest may request and thus the valid
>> + * values for bytes 23..26 of option vector 5: */
>> +static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
>> +                                          int chosen)
>>  {
>>      PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
>> +    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>>  
>>      char val[2 * 4] = {
>>          23, 0x00, /* Xive mode, filled in below. */
>> @@ -1123,7 +1125,11 @@ static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
>>          } else {
>>              val[3] = 0x00; /* Hash */
>>          }
>> +        /* TODO: test KVM support */
>> +        val[1] = smc->irq->ov5;
>>      } else {
>> +        val[1] = smc->irq->ov5;
> 
> ..here it seems to be a specific value for this OV5 byte, indicating the
> supported intc...

yes.

> 
>> +
>>          /* V3 MMU supports both hash and radix in tcg (with dynamic switching) */
>>          val[3] = 0xC0;
>>      }
>> @@ -1191,7 +1197,7 @@ static void spapr_dt_chosen(sPAPRMachineState *spapr, void *fdt)
>>          _FDT(fdt_setprop_string(fdt, chosen, "stdout-path", stdout_path));
>>      }
>>  
>> -    spapr_dt_ov5_platform_support(fdt, chosen);
>> +    spapr_dt_ov5_platform_support(spapr, fdt, chosen);
>>  
>>      g_free(stdout_path);
>>      g_free(bootlist);
>> @@ -2622,6 +2628,11 @@ static void spapr_machine_init(MachineState *machine)
>>      /* advertise support for ibm,dyamic-memory-v2 */
>>      spapr_ovec_set(spapr->ov5, OV5_DRMEM_V2);
>>  
>> +    /* advertise XIVE */
>> +    if (smc->irq->ov5) {
> 
> ..but here it seems to be a bool indicating XIVE support specifically.

ah. yes. I need to check this part. That was a while ago.

>> +        spapr_ovec_set(spapr->ov5, OV5_XIVE_EXPLOIT);
>> +    }
>> +
>>      /* init CPUs */
>>      spapr_init_cpus(spapr);
>>  
>> @@ -3971,6 +3982,21 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
>>  
>>  DEFINE_SPAPR_MACHINE(3_1, "3.1", true);
>>  
>> +static void spapr_machine_3_1_xive_instance_options(MachineState *machine)
>> +{
>> +    spapr_machine_3_1_instance_options(machine);
>> +}
>> +
>> +static void spapr_machine_3_1_xive_class_options(MachineClass *mc)
>> +{
>> +    sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
>> +
>> +    spapr_machine_3_1_class_options(mc);
>> +    smc->irq = &spapr_irq_xive;
>> +}
>> +
>> +DEFINE_SPAPR_MACHINE(3_1_xive, "3.1-xive", false);
>> +
>>  /*
>>   * pseries-3.0
>>   */
>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>> index 253abc10e780..42e73851b174 100644
>> --- a/hw/ppc/spapr_irq.c
>> +++ b/hw/ppc/spapr_irq.c
>> @@ -210,6 +210,7 @@ static Object *spapr_irq_cpu_intc_create_xics(sPAPRMachineState *spapr,
>>  sPAPRIrq spapr_irq_xics = {
>>      .nr_irqs     = SPAPR_IRQ_XICS_NR_IRQS,
>>      .nr_msis     = SPAPR_IRQ_XICS_NR_MSIS,
>> +    .ov5         = 0x0, /* XICS only */
>>  
>>      .init        = spapr_irq_init_xics,
>>      .claim       = spapr_irq_claim_xics,
>> @@ -341,6 +342,7 @@ static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
>>  sPAPRIrq spapr_irq_xive = {
>>      .nr_irqs     = SPAPR_IRQ_XIVE_NR_IRQS,
>>      .nr_msis     = SPAPR_IRQ_XIVE_NR_MSIS,
>> +    .ov5         = 0x40, /* XIVE exploitation mode only */
>>  
>>      .init        = spapr_irq_init_xive,
>>      .claim       = spapr_irq_claim_xive,
>> @@ -447,6 +449,7 @@ int spapr_irq_find(sPAPRMachineState *spapr, int num, bool align, Error **errp)
>>  sPAPRIrq spapr_irq_xics_legacy = {
>>      .nr_irqs     = SPAPR_IRQ_XICS_LEGACY_NR_IRQS,
>>      .nr_msis     = SPAPR_IRQ_XICS_LEGACY_NR_IRQS,
>> +    .ov5         = 0x0, /* XICS only */
>>  
>>      .init        = spapr_irq_init_xics,
>>      .claim       = spapr_irq_claim_xics,
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/36] spapr: add classes for the XIVE models
  2018-11-28  5:13   ` David Gibson
@ 2018-11-28 22:38     ` Cédric Le Goater
  2018-11-29  2:59       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 22:38 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 6:13 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:13AM +0100, Cédric Le Goater wrote:
>> The XIVE models for the QEMU and KVM accelerators will have a lot in
>> common. Introduce an abstract class for the source, the thread context
>> and the interrupt controller object to handle the differences in the
>> object initialization. These classes will also be used to define state
>> synchronization handlers for the monitor and migration usage.
>>
>> This is very much like the XICS models.
> 
> Yeah.. so I know it's my code, but in hindsight I think making
> separate subclasses for TCG and KVM was a mistake.  The distinction
> between emulated and KVM version is supposed to be invisible to both
> guest and (almost) to user, whereas a subclass usually indicates a
> visibly different device.

so how do you want to model the KVM part ? with a single object and
kvm_enabled() sections ? 

> 
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/spapr_xive.h |  15 +++++
>>  include/hw/ppc/xive.h       |  30 ++++++++++
>>  hw/intc/spapr_xive.c        |  86 +++++++++++++++++++---------
>>  hw/intc/xive.c              | 109 +++++++++++++++++++++++++-----------
>>  hw/ppc/spapr_irq.c          |   4 +-
>>  5 files changed, 182 insertions(+), 62 deletions(-)
>>
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 5b3fab192d41..aca2969a09ab 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -13,6 +13,10 @@
>>  #include "hw/sysbus.h"
>>  #include "hw/ppc/xive.h"
>>  
>> +#define TYPE_SPAPR_XIVE_BASE "spapr-xive-base"
>> +#define SPAPR_XIVE_BASE(obj) \
>> +    OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE_BASE)
>> +
>>  #define TYPE_SPAPR_XIVE "spapr-xive"
>>  #define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
>>  
>> @@ -38,6 +42,17 @@ typedef struct sPAPRXive {
>>      MemoryRegion  tm_mmio;
>>  } sPAPRXive;
>>  
>> +#define SPAPR_XIVE_BASE_CLASS(klass) \
>> +     OBJECT_CLASS_CHECK(sPAPRXiveClass, (klass), TYPE_SPAPR_XIVE_BASE)
>> +#define SPAPR_XIVE_BASE_GET_CLASS(obj) \
>> +     OBJECT_GET_CLASS(sPAPRXiveClass, (obj), TYPE_SPAPR_XIVE_BASE)
>> +
>> +typedef struct sPAPRXiveClass {
>> +    XiveRouterClass parent_class;
>> +
>> +    DeviceRealize   parent_realize;
>> +} sPAPRXiveClass;
>> +
>>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index b74eb326dcd1..281ed370121c 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -38,6 +38,10 @@ typedef struct XiveFabricClass {
>>   * XIVE Interrupt Source
>>   */
>>  
>> +#define TYPE_XIVE_SOURCE_BASE "xive-source-base"
>> +#define XIVE_SOURCE_BASE(obj) \
>> +    OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE_BASE)
>> +
>>  #define TYPE_XIVE_SOURCE "xive-source"
>>  #define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
>>  
>> @@ -68,6 +72,18 @@ typedef struct XiveSource {
>>      XiveFabric      *xive;
>>  } XiveSource;
>>  
>> +#define XIVE_SOURCE_BASE_CLASS(klass) \
>> +     OBJECT_CLASS_CHECK(XiveSourceClass, (klass), TYPE_XIVE_SOURCE_BASE)
>> +#define XIVE_SOURCE_BASE_GET_CLASS(obj) \
>> +     OBJECT_GET_CLASS(XiveSourceClass, (obj), TYPE_XIVE_SOURCE_BASE)
>> +
>> +typedef struct XiveSourceClass {
>> +    SysBusDeviceClass parent_class;
>> +
>> +    DeviceRealize     parent_realize;
>> +    DeviceReset       parent_reset;
>> +} XiveSourceClass;
>> +
>>  /*
>>   * ESB MMIO setting. Can be one page, for both source triggering and
>>   * source management, or two different pages. See below for magic
>> @@ -253,6 +269,9 @@ void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
>>   * XIVE Thread interrupt Management (TM) context
>>   */
>>  
>> +#define TYPE_XIVE_TCTX_BASE "xive-tctx-base"
>> +#define XIVE_TCTX_BASE(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_BASE)
>> +
>>  #define TYPE_XIVE_TCTX "xive-tctx"
>>  #define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
>>  
>> @@ -278,6 +297,17 @@ typedef struct XiveTCTX {
>>      XiveRouter  *xrtr;
>>  } XiveTCTX;
>>  
>> +#define XIVE_TCTX_BASE_CLASS(klass) \
>> +     OBJECT_CLASS_CHECK(XiveTCTXClass, (klass), TYPE_XIVE_TCTX_BASE)
>> +#define XIVE_TCTX_BASE_GET_CLASS(obj) \
>> +     OBJECT_GET_CLASS(XiveTCTXClass, (obj), TYPE_XIVE_TCTX_BASE)
>> +
>> +typedef struct XiveTCTXClass {
>> +    DeviceClass       parent_class;
>> +
>> +    DeviceRealize     parent_realize;
>> +} XiveTCTXClass;
>> +
>>  /*
>>   * XIVE Thread Interrupt Management Aera (TIMA)
>>   */
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> index 3bf77ace11a2..ec85f7e4f88d 100644
>> --- a/hw/intc/spapr_xive.c
>> +++ b/hw/intc/spapr_xive.c
>> @@ -53,9 +53,9 @@ static void spapr_xive_mmio_map(sPAPRXive *xive)
>>      sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->tm_base);
>>  }
>>  
>> -static void spapr_xive_reset(DeviceState *dev)
>> +static void spapr_xive_base_reset(DeviceState *dev)
>>  {
>> -    sPAPRXive *xive = SPAPR_XIVE(dev);
>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(dev);
>>      int i;
>>  
>>      /* Xive Source reset is done through SysBus, it should put all
>> @@ -76,9 +76,9 @@ static void spapr_xive_reset(DeviceState *dev)
>>      spapr_xive_mmio_map(xive);
>>  }
>>  
>> -static void spapr_xive_instance_init(Object *obj)
>> +static void spapr_xive_base_instance_init(Object *obj)
>>  {
>> -    sPAPRXive *xive = SPAPR_XIVE(obj);
>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(obj);
>>  
>>      object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
>>      object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
>> @@ -89,9 +89,9 @@ static void spapr_xive_instance_init(Object *obj)
>>                                NULL);
>>  }
>>  
>> -static void spapr_xive_realize(DeviceState *dev, Error **errp)
>> +static void spapr_xive_base_realize(DeviceState *dev, Error **errp)
>>  {
>> -    sPAPRXive *xive = SPAPR_XIVE(dev);
>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(dev);
>>      XiveSource *xsrc = &xive->source;
>>      XiveENDSource *end_xsrc = &xive->end_source;
>>      Error *local_err = NULL;
>> @@ -142,16 +142,11 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>       */
>>      xive->eat = g_new0(XiveEAS, xive->nr_irqs);
>>      xive->endt = g_new0(XiveEND, xive->nr_ends);
>> -
>> -    /* TIMA initialization */
>> -    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
>> -                          "xive.tima", 4ull << TM_SHIFT);
>> -    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
>>  }
>>  
>>  static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>  {
>> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>>  
>>      if (lisn >= xive->nr_irqs) {
>>          return -1;
>> @@ -163,7 +158,7 @@ static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>  
>>  static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>  {
>> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>>  
>>      if (lisn >= xive->nr_irqs) {
>>          return -1;
>> @@ -176,7 +171,7 @@ static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>  static int spapr_xive_get_end(XiveRouter *xrtr,
>>                                uint8_t end_blk, uint32_t end_idx, XiveEND *end)
>>  {
>> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>>  
>>      if (end_idx >= xive->nr_ends) {
>>          return -1;
>> @@ -189,7 +184,7 @@ static int spapr_xive_get_end(XiveRouter *xrtr,
>>  static int spapr_xive_set_end(XiveRouter *xrtr,
>>                                uint8_t end_blk, uint32_t end_idx, XiveEND *end)
>>  {
>> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>>  
>>      if (end_idx >= xive->nr_ends) {
>>          return -1;
>> @@ -202,7 +197,7 @@ static int spapr_xive_set_end(XiveRouter *xrtr,
>>  static int spapr_xive_get_nvt(XiveRouter *xrtr,
>>                                uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
>>  {
>> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>>      uint32_t vcpu_id = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
>>      PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
>>  
>> @@ -236,7 +231,7 @@ static void spapr_xive_reset_tctx(XiveRouter *xrtr, XiveTCTX *tctx)
>>      uint32_t nvt_idx;
>>      uint32_t nvt_cam;
>>  
>> -    spapr_xive_cpu_to_nvt(SPAPR_XIVE(xrtr), POWERPC_CPU(tctx->cs),
>> +    spapr_xive_cpu_to_nvt(SPAPR_XIVE_BASE(xrtr), POWERPC_CPU(tctx->cs),
>>                            &nvt_blk, &nvt_idx);
>>  
>>      nvt_cam = cpu_to_be32(TM_QW1W2_VO | xive_tctx_cam_line(nvt_blk, nvt_idx));
>> @@ -359,7 +354,7 @@ static const VMStateDescription vmstate_spapr_xive_eas = {
>>      },
>>  };
>>  
>> -static const VMStateDescription vmstate_spapr_xive = {
>> +static const VMStateDescription vmstate_spapr_xive_base = {
>>      .name = TYPE_SPAPR_XIVE,
>>      .version_id = 1,
>>      .minimum_version_id = 1,
>> @@ -373,7 +368,7 @@ static const VMStateDescription vmstate_spapr_xive = {
>>      },
>>  };
>>  
>> -static Property spapr_xive_properties[] = {
>> +static Property spapr_xive_base_properties[] = {
>>      DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
>>      DEFINE_PROP_UINT32("nr-ends", sPAPRXive, nr_ends, 0),
>>      DEFINE_PROP_UINT64("vc-base", sPAPRXive, vc_base, SPAPR_XIVE_VC_BASE),
>> @@ -381,16 +376,16 @@ static Property spapr_xive_properties[] = {
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> -static void spapr_xive_class_init(ObjectClass *klass, void *data)
>> +static void spapr_xive_base_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
>>      XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
>>  
>>      dc->desc    = "sPAPR XIVE Interrupt Controller";
>> -    dc->props   = spapr_xive_properties;
>> -    dc->realize = spapr_xive_realize;
>> -    dc->reset   = spapr_xive_reset;
>> -    dc->vmsd    = &vmstate_spapr_xive;
>> +    dc->props   = spapr_xive_base_properties;
>> +    dc->realize = spapr_xive_base_realize;
>> +    dc->reset   = spapr_xive_base_reset;
>> +    dc->vmsd    = &vmstate_spapr_xive_base;
>>  
>>      xrc->get_eas = spapr_xive_get_eas;
>>      xrc->set_eas = spapr_xive_set_eas;
>> @@ -401,16 +396,55 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>      xrc->reset_tctx = spapr_xive_reset_tctx;
>>  }
>>  
>> +static const TypeInfo spapr_xive_base_info = {
>> +    .name = TYPE_SPAPR_XIVE_BASE,
>> +    .parent = TYPE_XIVE_ROUTER,
>> +    .abstract = true,
>> +    .instance_init = spapr_xive_base_instance_init,
>> +    .instance_size = sizeof(sPAPRXive),
>> +    .class_init = spapr_xive_base_class_init,
>> +    .class_size = sizeof(sPAPRXiveClass),
>> +};
>> +
>> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
>> +    Error *local_err = NULL;
>> +
>> +    sxc->parent_realize(dev, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    /* TIMA */
>> +    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
>> +                          "xive.tima", 4ull << TM_SHIFT);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
>> +}
>> +
>> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_CLASS(klass);
>> +
>> +    device_class_set_parent_realize(dc, spapr_xive_realize,
>> +                                    &sxc->parent_realize);
>> +}
>> +
>>  static const TypeInfo spapr_xive_info = {
>>      .name = TYPE_SPAPR_XIVE,
>> -    .parent = TYPE_XIVE_ROUTER,
>> -    .instance_init = spapr_xive_instance_init,
>> +    .parent = TYPE_SPAPR_XIVE_BASE,
>> +    .instance_init = spapr_xive_base_instance_init,
>>      .instance_size = sizeof(sPAPRXive),
>>      .class_init = spapr_xive_class_init,
>> +    .class_size = sizeof(sPAPRXiveClass),
>>  };
>>  
>>  static void spapr_xive_register_types(void)
>>  {
>> +    type_register_static(&spapr_xive_base_info);
>>      type_register_static(&spapr_xive_info);
>>  }
>>  
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 7d921023e2ee..9bb37553c9ec 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -478,9 +478,9 @@ static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
>>      return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
>>  }
>>  
>> -static void xive_tctx_reset(void *dev)
>> +static void xive_tctx_base_reset(void *dev)
>>  {
>> -    XiveTCTX *tctx = XIVE_TCTX(dev);
>> +    XiveTCTX *tctx = XIVE_TCTX_BASE(dev);
>>      XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
>>  
>>      memset(tctx->regs, 0, sizeof(tctx->regs));
>> @@ -506,9 +506,9 @@ static void xive_tctx_reset(void *dev)
>>      }
>>  }
>>  
>> -static void xive_tctx_realize(DeviceState *dev, Error **errp)
>> +static void xive_tctx_base_realize(DeviceState *dev, Error **errp)
>>  {
>> -    XiveTCTX *tctx = XIVE_TCTX(dev);
>> +    XiveTCTX *tctx = XIVE_TCTX_BASE(dev);
>>      PowerPCCPU *cpu;
>>      CPUPPCState *env;
>>      Object *obj;
>> @@ -544,15 +544,15 @@ static void xive_tctx_realize(DeviceState *dev, Error **errp)
>>          return;
>>      }
>>  
>> -    qemu_register_reset(xive_tctx_reset, dev);
>> +    qemu_register_reset(xive_tctx_base_reset, dev);
>>  }
>>  
>> -static void xive_tctx_unrealize(DeviceState *dev, Error **errp)
>> +static void xive_tctx_base_unrealize(DeviceState *dev, Error **errp)
>>  {
>> -    qemu_unregister_reset(xive_tctx_reset, dev);
>> +    qemu_unregister_reset(xive_tctx_base_reset, dev);
>>  }
>>  
>> -static const VMStateDescription vmstate_xive_tctx = {
>> +static const VMStateDescription vmstate_xive_tctx_base = {
>>      .name = TYPE_XIVE_TCTX,
>>      .version_id = 1,
>>      .minimum_version_id = 1,
>> @@ -562,21 +562,28 @@ static const VMStateDescription vmstate_xive_tctx = {
>>      },
>>  };
>>  
>> -static void xive_tctx_class_init(ObjectClass *klass, void *data)
>> +static void xive_tctx_base_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
>>  
>> -    dc->realize = xive_tctx_realize;
>> -    dc->unrealize = xive_tctx_unrealize;
>> +    dc->realize = xive_tctx_base_realize;
>> +    dc->unrealize = xive_tctx_base_unrealize;
>>      dc->desc = "XIVE Interrupt Thread Context";
>> -    dc->vmsd = &vmstate_xive_tctx;
>> +    dc->vmsd = &vmstate_xive_tctx_base;
>>  }
>>  
>> -static const TypeInfo xive_tctx_info = {
>> -    .name          = TYPE_XIVE_TCTX,
>> +static const TypeInfo xive_tctx_base_info = {
>> +    .name          = TYPE_XIVE_TCTX_BASE,
>>      .parent        = TYPE_DEVICE,
>> +    .abstract      = true,
>>      .instance_size = sizeof(XiveTCTX),
>> -    .class_init    = xive_tctx_class_init,
>> +    .class_init    = xive_tctx_base_class_init,
>> +    .class_size    = sizeof(XiveTCTXClass),
>> +};
>> +
>> +static const TypeInfo xive_tctx_info = {
>> +    .name          = TYPE_XIVE_TCTX,
>> +    .parent        = TYPE_XIVE_TCTX_BASE,
>>  };
>>  
>>  Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
>> @@ -933,9 +940,9 @@ void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
>>      }
>>  }
>>  
>> -static void xive_source_reset(DeviceState *dev)
>> +static void xive_source_base_reset(DeviceState *dev)
>>  {
>> -    XiveSource *xsrc = XIVE_SOURCE(dev);
>> +    XiveSource *xsrc = XIVE_SOURCE_BASE(dev);
>>  
>>      /* Do not clear the LSI bitmap */
>>  
>> @@ -943,9 +950,9 @@ static void xive_source_reset(DeviceState *dev)
>>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
>>  }
>>  
>> -static void xive_source_realize(DeviceState *dev, Error **errp)
>> +static void xive_source_base_realize(DeviceState *dev,  Error **errp)
>>  {
>> -    XiveSource *xsrc = XIVE_SOURCE(dev);
>> +    XiveSource *xsrc = XIVE_SOURCE_BASE(dev);
>>      Object *obj;
>>      Error *local_err = NULL;
>>  
>> @@ -971,21 +978,14 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
>>          return;
>>      }
>>  
>> -    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
>> -                                     xsrc->nr_irqs);
>> -
>>      xsrc->status = g_malloc0(xsrc->nr_irqs);
>>  
>>      xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
>>      xsrc->lsi_map_size = xsrc->nr_irqs;
>>  
>> -    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>> -                          &xive_source_esb_ops, xsrc, "xive.esb",
>> -                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
>> -    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>>  }
>>  
>> -static const VMStateDescription vmstate_xive_source = {
>> +static const VMStateDescription vmstate_xive_source_base = {
>>      .name = TYPE_XIVE_SOURCE,
>>      .version_id = 1,
>>      .minimum_version_id = 1,
>> @@ -1001,29 +1001,68 @@ static const VMStateDescription vmstate_xive_source = {
>>   * The default XIVE interrupt source setting for the ESB MMIOs is two
>>   * 64k pages without Store EOI, to be in sync with KVM.
>>   */
>> -static Property xive_source_properties[] = {
>> +static Property xive_source_base_properties[] = {
>>      DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
>>      DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
>>      DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
>>      DEFINE_PROP_END_OF_LIST(),
>>  };
>>  
>> -static void xive_source_class_init(ObjectClass *klass, void *data)
>> +static void xive_source_base_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
>>  
>>      dc->desc    = "XIVE Interrupt Source";
>> -    dc->props   = xive_source_properties;
>> -    dc->realize = xive_source_realize;
>> -    dc->reset   = xive_source_reset;
>> -    dc->vmsd    = &vmstate_xive_source;
>> +    dc->props   = xive_source_base_properties;
>> +    dc->realize = xive_source_base_realize;
>> +    dc->reset   = xive_source_base_reset;
>> +    dc->vmsd    = &vmstate_xive_source_base;
>> +}
>> +
>> +static const TypeInfo xive_source_base_info = {
>> +    .name          = TYPE_XIVE_SOURCE_BASE,
>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>> +    .abstract      = true,
>> +    .instance_size = sizeof(XiveSource),
>> +    .class_init    = xive_source_base_class_init,
>> +    .class_size    = sizeof(XiveSourceClass),
>> +};
>> +
>> +static void xive_source_realize(DeviceState *dev, Error **errp)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE(dev);
>> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
>> +    Error *local_err = NULL;
>> +
>> +    xsc->parent_realize(dev, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc, xsrc->nr_irqs);
>> +
>> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>> +                          &xive_source_esb_ops, xsrc, "xive.esb",
>> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>> +}
>> +
>> +static void xive_source_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_CLASS(klass);
>> +
>> +    device_class_set_parent_realize(dc, xive_source_realize,
>> +                                    &xsc->parent_realize);
>>  }
>>  
>>  static const TypeInfo xive_source_info = {
>>      .name          = TYPE_XIVE_SOURCE,
>> -    .parent        = TYPE_SYS_BUS_DEVICE,
>> +    .parent        = TYPE_XIVE_SOURCE_BASE,
>>      .instance_size = sizeof(XiveSource),
>>      .class_init    = xive_source_class_init,
>> +    .class_size    = sizeof(XiveSourceClass),
>>  };
>>  
>>  /*
>> @@ -1659,10 +1698,12 @@ static const TypeInfo xive_fabric_info = {
>>  
>>  static void xive_register_types(void)
>>  {
>> +    type_register_static(&xive_source_base_info);
>>      type_register_static(&xive_source_info);
>>      type_register_static(&xive_fabric_info);
>>      type_register_static(&xive_router_info);
>>      type_register_static(&xive_end_source_info);
>> +    type_register_static(&xive_tctx_base_info);
>>      type_register_static(&xive_tctx_info);
>>  }
>>  
>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>> index 42e73851b174..f6e9e44d4cf9 100644
>> --- a/hw/ppc/spapr_irq.c
>> +++ b/hw/ppc/spapr_irq.c
>> @@ -243,7 +243,7 @@ static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
>>          return NULL;
>>      }
>>      qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
>> -    xive = SPAPR_XIVE(obj);
>> +    xive = SPAPR_XIVE_BASE(obj);
>>  
>>      /* Enable the CPU IPIs */
>>      for (i = 0; i < nr_servers; ++i) {
>> @@ -311,7 +311,7 @@ static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
>>      CPU_FOREACH(cs) {
>>          PowerPCCPU *cpu = POWERPC_CPU(cs);
>>  
>> -        xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
>> +        xive_tctx_pic_print_info(XIVE_TCTX_BASE(cpu->intc), mon);
>>      }
>>  
>>      spapr_xive_pic_print_info(spapr->xive, mon);
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 22/36] spapr/xive: add models for KVM support
  2018-11-28  5:52   ` David Gibson
@ 2018-11-28 22:45     ` Cédric Le Goater
  2018-11-29  3:33       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-28 22:45 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 6:52 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:15AM +0100, Cédric Le Goater wrote:
>> This introduces a set of XIVE models specific to KVM which derive from
>> the XIVE base models. The interfaces with KVM are a new capability and
>> a new KVM device for the XIVE native exploitation interrupt mode.
>>
>> They handle the initialization of the TIMA and the source ESB memory
>> regions which have a different type under KVM. These are 'ram device'
>> memory mappings, similarly to VFIO, exposed to the guest and the
>> associated VMAs on the host are populated dynamically with the
>> appropriate pages using a fault handler.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> 
> The logic here looks fine, but I think it would be better to activate
> it with explicit if (kvm) type logic rather than using a subclass.

ok. ARM has taken a different path, the one proposed below, but it should 
be possible to use a "if (kvm)" type logic. There should be less noise 
in the object design.

 
>> ---
>>  default-configs/ppc64-softmmu.mak |   1 +
>>  include/hw/ppc/spapr_xive.h       |  18 ++
>>  include/hw/ppc/xive.h             |   3 +
>>  linux-headers/asm-powerpc/kvm.h   |  12 +
>>  linux-headers/linux/kvm.h         |   4 +
>>  target/ppc/kvm_ppc.h              |   6 +
>>  hw/intc/spapr_xive_kvm.c          | 430 ++++++++++++++++++++++++++++++
>>  hw/ppc/spapr.c                    |   7 +-
>>  hw/ppc/spapr_irq.c                |  19 +-
>>  target/ppc/kvm.c                  |   7 +
>>  hw/intc/Makefile.objs             |   1 +
>>  11 files changed, 503 insertions(+), 5 deletions(-)
>>  create mode 100644 hw/intc/spapr_xive_kvm.c
>>
>> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
>> index 7f34ad0528ed..c1bf5cd951f5 100644
>> --- a/default-configs/ppc64-softmmu.mak
>> +++ b/default-configs/ppc64-softmmu.mak
>> @@ -18,6 +18,7 @@ CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
>>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>>  CONFIG_XIVE=$(CONFIG_PSERIES)
>>  CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
>> +CONFIG_XIVE_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
>>  CONFIG_MEM_DEVICE=y
>>  CONFIG_DIMM=y
>>  CONFIG_SPAPR_RNG=y
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index aca2969a09ab..9c817bb7ae74 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -40,6 +40,10 @@ typedef struct sPAPRXive {
>>      /* TIMA mapping address */
>>      hwaddr        tm_base;
>>      MemoryRegion  tm_mmio;
>> +
>> +    /* KVM support */
>> +    int           fd;
>> +    void          *tm_mmap;
>>  } sPAPRXive;
>>  
>>  #define SPAPR_XIVE_BASE_CLASS(klass) \
>> @@ -83,4 +87,18 @@ void spapr_xive_hcall_init(sPAPRMachineState *spapr);
>>  void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt,
>>                     uint32_t phandle);
>>  
>> +/*
>> + * XIVE KVM models
>> + */
>> +
>> +#define TYPE_SPAPR_XIVE_KVM  "spapr-xive-kvm"
>> +#define SPAPR_XIVE_KVM(obj)  OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE_KVM)
>> +
>> +#define TYPE_XIVE_SOURCE_KVM "xive-source-kvm"
>> +#define XIVE_SOURCE_KVM(obj) \
>> +    OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE_KVM)
>> +
>> +#define TYPE_XIVE_TCTX_KVM   "xive-tctx-kvm"
>> +#define XIVE_TCTX_KVM(obj)   OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_KVM)
>> +
>>  #endif /* PPC_SPAPR_XIVE_H */
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 281ed370121c..7aaf5a182cb3 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -69,6 +69,9 @@ typedef struct XiveSource {
>>      uint32_t        esb_shift;
>>      MemoryRegion    esb_mmio;
>>  
>> +    /* KVM support */
>> +    void            *esb_mmap;
>> +
>>      XiveFabric      *xive;
>>  } XiveSource;
>>  
>> diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
>> index 8c876c166ef2..f34c971491dd 100644
>> --- a/linux-headers/asm-powerpc/kvm.h
>> +++ b/linux-headers/asm-powerpc/kvm.h
> 
> Updates to linux-headers need to be split out into a separate patch.
> Eventually (i.e. by the time we merge) they should be just "update
> headers to SHA XXX" not picking and choosing pieces.

ok. I am starting to separate the KVM definition from the patch now
that the interface is stabilizing. 

>> @@ -675,4 +675,16 @@ struct kvm_ppc_cpu_char {
>>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
>>  #define  KVM_XICS_QUEUED		(1ULL << 44)
>>  
>> +/* POWER9 XIVE Native Interrupt Controller */
>> +#define KVM_DEV_XIVE_GRP_CTRL		1
>> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
>> +#define   KVM_DEV_XIVE_GET_TIMA_FD	2
>> +#define   KVM_DEV_XIVE_VC_BASE		3
>> +#define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>> +
>> +/* Layout of 64-bit XIVE source attribute values */
>> +#define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>> +#define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>> +
>> +
>>  #endif /* __LINUX_KVM_POWERPC_H */
>> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
>> index f11a7eb49cfa..59fa8d8d7f39 100644
>> --- a/linux-headers/linux/kvm.h
>> +++ b/linux-headers/linux/kvm.h
>> @@ -965,6 +965,8 @@ struct kvm_ppc_resize_hpt {
>>  #define KVM_CAP_COALESCED_PIO 162
>>  #define KVM_CAP_HYPERV_ENLIGHTENED_VMCS 163
>>  #define KVM_CAP_EXCEPTION_PAYLOAD 164
>> +#define KVM_CAP_ARM_VM_IPA_SIZE 165
>> +#define KVM_CAP_PPC_IRQ_XIVE 166
>>  
>>  #ifdef KVM_CAP_IRQ_ROUTING
>>  
>> @@ -1188,6 +1190,8 @@ enum kvm_device_type {
>>  #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
>>  	KVM_DEV_TYPE_ARM_VGIC_ITS,
>>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
>> +	KVM_DEV_TYPE_XIVE,
>> +#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> diff --git a/target/ppc/kvm_ppc.h b/target/ppc/kvm_ppc.h
>> index bdfaa4e70a83..d2159660f9f2 100644
>> --- a/target/ppc/kvm_ppc.h
>> +++ b/target/ppc/kvm_ppc.h
>> @@ -59,6 +59,7 @@ bool kvmppc_has_cap_fixup_hcalls(void);
>>  bool kvmppc_has_cap_htm(void);
>>  bool kvmppc_has_cap_mmu_radix(void);
>>  bool kvmppc_has_cap_mmu_hash_v3(void);
>> +bool kvmppc_has_cap_xive(void);
>>  int kvmppc_get_cap_safe_cache(void);
>>  int kvmppc_get_cap_safe_bounds_check(void);
>>  int kvmppc_get_cap_safe_indirect_branch(void);
>> @@ -307,6 +308,11 @@ static inline bool kvmppc_has_cap_mmu_hash_v3(void)
>>      return false;
>>  }
>>  
>> +static inline bool kvmppc_has_cap_xive(void)
>> +{
>> +    return false;
>> +}
>> +
>>  static inline int kvmppc_get_cap_safe_cache(void)
>>  {
>>      return 0;
>> diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
>> new file mode 100644
>> index 000000000000..767f90826e43
>> --- /dev/null
>> +++ b/hw/intc/spapr_xive_kvm.c
>> @@ -0,0 +1,430 @@
>> +/*
>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/log.h"
>> +#include "qemu/error-report.h"
>> +#include "qapi/error.h"
>> +#include "target/ppc/cpu.h"
>> +#include "sysemu/cpus.h"
>> +#include "sysemu/kvm.h"
>> +#include "hw/ppc/spapr.h"
>> +#include "hw/ppc/spapr_xive.h"
>> +#include "hw/ppc/xive.h"
>> +#include "kvm_ppc.h"
>> +
>> +#include <sys/ioctl.h>
>> +
>> +/*
>> + * Helpers for CPU hotplug
>> + */
>> +typedef struct KVMEnabledCPU {
>> +    unsigned long vcpu_id;
>> +    QLIST_ENTRY(KVMEnabledCPU) node;
>> +} KVMEnabledCPU;
>> +
>> +static QLIST_HEAD(, KVMEnabledCPU)
>> +    kvm_enabled_cpus = QLIST_HEAD_INITIALIZER(&kvm_enabled_cpus);
>> +
>> +static bool kvm_cpu_is_enabled(CPUState *cs)
>> +{
>> +    KVMEnabledCPU *enabled_cpu;
>> +    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
>> +
>> +    QLIST_FOREACH(enabled_cpu, &kvm_enabled_cpus, node) {
>> +        if (enabled_cpu->vcpu_id == vcpu_id) {
>> +            return true;
>> +        }
>> +    }
>> +    return false;
>> +}
>> +
>> +static void kvm_cpu_enable(CPUState *cs)
>> +{
>> +    KVMEnabledCPU *enabled_cpu;
>> +    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
>> +
>> +    enabled_cpu = g_malloc(sizeof(*enabled_cpu));
>> +    enabled_cpu->vcpu_id = vcpu_id;
>> +    QLIST_INSERT_HEAD(&kvm_enabled_cpus, enabled_cpu, node);
>> +}
> 
> Blech, I hope we can find a better way of tracking this than an ugly
> list.

yes ... We have one similar for XICS.

>> +
>> +/*
>> + * XIVE Thread Interrupt Management context (KVM)
>> + */
>> +
>> +static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
>> +{
>> +    sPAPRXive *xive;
>> +    unsigned long vcpu_id;
>> +    int ret;
>> +
>> +    /* Check if CPU was hot unplugged and replugged. */
>> +    if (kvm_cpu_is_enabled(tctx->cs)) {
>> +        return;
>> +    }
>> +
>> +    vcpu_id = kvm_arch_vcpu_id(tctx->cs);
>> +    xive = SPAPR_XIVE_KVM(tctx->xrtr);
> 
> Is this the first use of tctx->xrtr?

No, the second. the first is the reset_tctx() ops doing the CAM reset.
But we said that we could remove it.

> 
>> +    ret = kvm_vcpu_enable_cap(tctx->cs, KVM_CAP_PPC_IRQ_XIVE, 0, xive->fd,
>> +                              vcpu_id, 0);
>> +    if (ret < 0) {
>> +        error_setg(errp, "Unable to connect CPU%ld to KVM XIVE device: %s",
>> +                   vcpu_id, strerror(errno));
>> +        return;
>> +    }
>> +
>> +    kvm_cpu_enable(tctx->cs);
>> +}
>> +
>> +static void xive_tctx_kvm_realize(DeviceState *dev, Error **errp)
>> +{
>> +    XiveTCTX *tctx = XIVE_TCTX_KVM(dev);
>> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(dev);
>> +    Error *local_err = NULL;
>> +
>> +    xtc->parent_realize(dev, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    xive_tctx_kvm_init(tctx, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +}
>> +
>> +static void xive_tctx_kvm_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_CLASS(klass);
>> +
>> +    dc->desc = "sPAPR XIVE KVM Interrupt Thread Context";
>> +
>> +    device_class_set_parent_realize(dc, xive_tctx_kvm_realize,
>> +                                    &xtc->parent_realize);
>> +}
>> +
>> +static const TypeInfo xive_tctx_kvm_info = {
>> +    .name          = TYPE_XIVE_TCTX_KVM,
>> +    .parent        = TYPE_XIVE_TCTX_BASE,
>> +    .instance_size = sizeof(XiveTCTX),
>> +    .class_init    = xive_tctx_kvm_class_init,
>> +    .class_size    = sizeof(XiveTCTXClass),
>> +};
>> +
>> +/*
>> + * XIVE Interrupt Source (KVM)
>> + */
>> +
>> +static void xive_source_kvm_init(XiveSource *xsrc, Error **errp)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE_KVM(xsrc->xive);
>> +    int i;
>> +
>> +    /*
>> +     * At reset, interrupt sources are simply created and MASKED. We
>> +     * only need to inform the KVM device about their type: LSI or
>> +     * MSI.
>> +     */
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        Error *local_err = NULL;
>> +        uint64_t state = 0;
>> +
>> +        if (xive_source_irq_is_lsi(xsrc, i)) {
>> +            state |= KVM_XIVE_LEVEL_SENSITIVE;
>> +            if (xsrc->status[i] & XIVE_STATUS_ASSERTED) {
>> +                state |= KVM_XIVE_LEVEL_ASSERTED;
>> +            }
>> +        }
>> +
>> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_SOURCES, i, &state,
>> +                          true, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return;
>> +        }
>> +    }
>> +}
>> +
>> +static void xive_source_kvm_reset(DeviceState *dev)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
>> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
>> +
>> +    xsc->parent_reset(dev);
>> +
>> +    xive_source_kvm_init(xsrc, &error_fatal);
>> +}
>> +
>> +static void xive_source_kvm_set_irq(void *opaque, int srcno, int val)
>> +{
>> +    XiveSource *xsrc = opaque;
>> +    struct kvm_irq_level args;
>> +    int rc;
>> +
>> +    args.irq = srcno;
>> +    if (!xive_source_irq_is_lsi(xsrc, srcno)) {
>> +        if (!val) {
>> +            return;
>> +        }
>> +        args.level = KVM_INTERRUPT_SET;
>> +    } else {
>> +        if (val) {
>> +            xsrc->status[srcno] |= XIVE_STATUS_ASSERTED;
>> +            args.level = KVM_INTERRUPT_SET_LEVEL;
>> +        } else {
>> +            xsrc->status[srcno] &= ~XIVE_STATUS_ASSERTED;
>> +            args.level = KVM_INTERRUPT_UNSET;
>> +        }
>> +    }
>> +    rc = kvm_vm_ioctl(kvm_state, KVM_IRQ_LINE, &args);
>> +    if (rc < 0) {
>> +        error_report("kvm_irq_line() failed : %s", strerror(errno));
>> +    }
>> +}
>> +
>> +static void *spapr_xive_kvm_mmap(sPAPRXive *xive, int ctrl, size_t len,
>> +                                 Error **errp)
>> +{
>> +    Error *local_err = NULL;
>> +    void *addr;
>> +    int fd;
>> +
>> +    kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_CTRL, ctrl, &fd, false,
>> +                      &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return NULL;
>> +    }
>> +
>> +    addr = mmap(NULL, len, PROT_WRITE | PROT_READ, MAP_SHARED, fd, 0);
>> +    close(fd);
>> +    if (addr == MAP_FAILED) {
>> +        error_setg_errno(errp, errno, "Unable to set XIVE mmaping");
>> +        return NULL;
>> +    }
>> +
>> +    return addr;
>> +}
>> +
>> +/*
>> + * The sPAPRXive KVM model should have initialized the KVM device
>> + * before initializing the source
>> + */
>> +static void xive_source_kvm_mmap(XiveSource *xsrc, Error **errp)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE_KVM(xsrc->xive);
>> +    Error *local_err = NULL;
>> +    size_t esb_len;
>> +
>> +    esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
>> +    xsrc->esb_mmap = spapr_xive_kvm_mmap(xive, KVM_DEV_XIVE_GET_ESB_FD,
>> +                                         esb_len, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    memory_region_init_ram_device_ptr(&xsrc->esb_mmio, OBJECT(xsrc),
>> +                                      "xive.esb", esb_len, xsrc->esb_mmap);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(xsrc), &xsrc->esb_mmio);
>> +}
>> +
>> +static void xive_source_kvm_realize(DeviceState *dev, Error **errp)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
>> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
>> +    Error *local_err = NULL;
>> +
>> +    xsc->parent_realize(dev, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_kvm_set_irq, xsrc,
>> +                                     xsrc->nr_irqs);
>> +
>> +    xive_source_kvm_mmap(xsrc, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +}
>> +
>> +static void xive_source_kvm_unrealize(DeviceState *dev, Error **errp)
>> +{
>> +    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
>> +    size_t esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
>> +
>> +    munmap(xsrc->esb_mmap, esb_len);
>> +}
>> +
>> +static void xive_source_kvm_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_CLASS(klass);
>> +
>> +    device_class_set_parent_realize(dc, xive_source_kvm_realize,
>> +                                    &xsc->parent_realize);
>> +    device_class_set_parent_reset(dc, xive_source_kvm_reset,
>> +                                  &xsc->parent_reset);
>> +
>> +    dc->desc = "sPAPR XIVE KVM Interrupt Source";
>> +    dc->unrealize = xive_source_kvm_unrealize;
>> +}
>> +
>> +static const TypeInfo xive_source_kvm_info = {
>> +    .name = TYPE_XIVE_SOURCE_KVM,
>> +    .parent = TYPE_XIVE_SOURCE_BASE,
>> +    .instance_size = sizeof(XiveSource),
>> +    .class_init    = xive_source_kvm_class_init,
>> +    .class_size    = sizeof(XiveSourceClass),
>> +};
>> +
>> +/*
>> + * sPAPR XIVE Router (KVM)
>> + */
>> +
>> +static void spapr_xive_kvm_instance_init(Object *obj)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE_KVM(obj);
>> +
>> +    xive->fd = -1;
>> +
>> +    /* We need a KVM flavored source */
>> +    object_initialize(&xive->source, sizeof(xive->source),
>> +                      TYPE_XIVE_SOURCE_KVM);
>> +    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
>> +
>> +    /* No KVM support for END ESBs. OPAL doesn't either */
>> +    object_initialize(&xive->end_source, sizeof(xive->end_source),
>> +                      TYPE_XIVE_END_SOURCE);
>> +    object_property_add_child(obj, "end_source", OBJECT(&xive->end_source),
>> +                              NULL);
>> +}
>> +
>> +static void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
>> +{
>> +    Error *local_err = NULL;
>> +    size_t tima_len;
>> +
>> +    if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
>> +        error_setg(errp,
>> +                   "IRQ_XIVE capability must be present for KVM XIVE device");
>> +        return;
>> +    }
>> +
>> +    /* First, create the KVM XIVE device */
>> +    xive->fd = kvm_create_device(kvm_state, KVM_DEV_TYPE_XIVE, false);
>> +    if (xive->fd < 0) {
>> +        error_setg_errno(errp, -xive->fd, "error creating KVM XIVE device");
>> +        return;
>> +    }
>> +
>> +    /* Source ESBs KVM mapping
>> +     *
>> +     * Inform KVM where we will map the ESB pages. This is needed by
>> +     * the H_INT_GET_SOURCE_INFO hcall which returns the source
>> +     * characteristics, among which the ESB page address.
>> +     */
>> +    kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_CTRL, KVM_DEV_XIVE_VC_BASE,
>> +                      &xive->vc_base, true, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    /* Let the XiveSource KVM model handle the mapping for the moment */
>> +
>> +    /* TIMA KVM mapping
>> +     *
>> +     * We could also inform KVM where the TIMA will be mapped but as
>> +     * this is a fixed MMIO address for the system it does not seem
>> +     * necessary to provide a KVM ioctl to change it.
>> +     */
>> +    tima_len = 4ull << TM_SHIFT;
>> +    xive->tm_mmap = spapr_xive_kvm_mmap(xive, KVM_DEV_XIVE_GET_TIMA_FD,
>> +                                        tima_len, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +    memory_region_init_ram_device_ptr(&xive->tm_mmio, OBJECT(xive),
>> +                                      "xive.tima", tima_len, xive->tm_mmap);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(xive), &xive->tm_mmio);
>> +
>> +    kvm_kernel_irqchip = true;
>> +    kvm_msi_via_irqfd_allowed = true;
>> +    kvm_gsi_direct_mapping = true;
>> +}
>> +
>> +static void spapr_xive_kvm_realize(DeviceState *dev, Error **errp)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
>> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
>> +    Error *local_err = NULL;
>> +
>> +    spapr_xive_kvm_init(xive, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    /* Initialize the source and the local routing tables */
>> +    sxc->parent_realize(dev, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +}
>> +
>> +static void spapr_xive_kvm_unrealize(DeviceState *dev, Error **errp)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
>> +
>> +    close(xive->fd);
>> +    xive->fd = -1;
>> +
>> +    munmap(xive->tm_mmap, 4ull << TM_SHIFT);
>> +}
>> +
>> +static void spapr_xive_kvm_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_CLASS(klass);
>> +
>> +    device_class_set_parent_realize(dc, spapr_xive_kvm_realize,
>> +                                    &sxc->parent_realize);
>> +
>> +    dc->desc = "sPAPR XIVE KVM Interrupt Controller";
>> +    dc->unrealize = spapr_xive_kvm_unrealize;
>> +}
>> +
>> +static const TypeInfo spapr_xive_kvm_info = {
>> +    .name = TYPE_SPAPR_XIVE_KVM,
>> +    .parent = TYPE_SPAPR_XIVE_BASE,
>> +    .instance_init = spapr_xive_kvm_instance_init,
>> +    .instance_size = sizeof(sPAPRXive),
>> +    .class_init = spapr_xive_kvm_class_init,
>> +    .class_size = sizeof(sPAPRXiveClass),
>> +};
>> +
>> +static void xive_kvm_register_types(void)
>> +{
>> +    type_register_static(&spapr_xive_kvm_info);
>> +    type_register_static(&xive_source_kvm_info);
>> +    type_register_static(&xive_tctx_kvm_info);
>> +}
>> +
>> +type_init(xive_kvm_register_types)
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index f9cf2debff5a..d1be2579cd9b 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1125,8 +1125,11 @@ static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
>>          } else {
>>              val[3] = 0x00; /* Hash */
>>          }
>> -        /* TODO: test KVM support */
>> -        val[1] = smc->irq->ov5;
>> +        if (kvmppc_has_cap_xive()) {
>> +            val[1] = smc->irq->ov5;
>> +        } else {
>> +            val[1] = 0x00;
>> +        }
>>      } else {
>>          val[1] = smc->irq->ov5;
>>  
>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>> index 33dd5da7d255..92ef53743b64 100644
>> --- a/hw/ppc/spapr_irq.c
>> +++ b/hw/ppc/spapr_irq.c
>> @@ -273,9 +273,22 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
>>      Error *local_err = NULL;
>>  
>>      /* KVM XIVE support */
>> -    if (kvm_enabled()) {
>> -        if (machine_kernel_irqchip_required(machine)) {
>> -            error_setg(errp, "kernel_irqchip requested. no XIVE support");
>> +    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
>> +        spapr->xive_tctx_type = TYPE_XIVE_TCTX_KVM;
>> +        spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE_KVM, nr_irqs,
>> +                                        nr_servers, &local_err);
>> +
>> +        if (local_err && machine_kernel_irqchip_required(machine)) {
>> +            error_propagate(errp, local_err);
>> +            error_prepend(errp, "kernel_irqchip requested but init failed : ");
>> +            return;
>> +        }
>> +
>> +        /*
>> +         * XIVE support is activated under KVM. No need to initialize
>> +         * the fallback mode under QEMU
>> +         */
>> +        if (spapr->xive) {
>>              return;
>>          }
>>      }
>> diff --git a/target/ppc/kvm.c b/target/ppc/kvm.c
>> index f81327d6cd47..3b7cf106242b 100644
>> --- a/target/ppc/kvm.c
>> +++ b/target/ppc/kvm.c
>> @@ -86,6 +86,7 @@ static int cap_fixup_hcalls;
>>  static int cap_htm;             /* Hardware transactional memory support */
>>  static int cap_mmu_radix;
>>  static int cap_mmu_hash_v3;
>> +static int cap_xive;
>>  static int cap_resize_hpt;
>>  static int cap_ppc_pvr_compat;
>>  static int cap_ppc_safe_cache;
>> @@ -149,6 +150,7 @@ int kvm_arch_init(MachineState *ms, KVMState *s)
>>      cap_htm = kvm_vm_check_extension(s, KVM_CAP_PPC_HTM);
>>      cap_mmu_radix = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_RADIX);
>>      cap_mmu_hash_v3 = kvm_vm_check_extension(s, KVM_CAP_PPC_MMU_HASH_V3);
>> +    cap_xive = kvm_vm_check_extension(s, KVM_CAP_PPC_IRQ_XIVE);
>>      cap_resize_hpt = kvm_vm_check_extension(s, KVM_CAP_SPAPR_RESIZE_HPT);
>>      kvmppc_get_cpu_characteristics(s);
>>      cap_ppc_nested_kvm_hv = kvm_vm_check_extension(s, KVM_CAP_PPC_NESTED_HV);
>> @@ -2385,6 +2387,11 @@ static int parse_cap_ppc_safe_indirect_branch(struct kvm_ppc_cpu_char c)
>>      return 0;
>>  }
>>  
>> +bool kvmppc_has_cap_xive(void)
>> +{
>> +    return cap_xive;
>> +}
>> +
>>  static void kvmppc_get_cpu_characteristics(KVMState *s)
>>  {
>>      struct kvm_ppc_cpu_char c;
>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>> index eacd26836ebf..dd4d69db2bdd 100644
>> --- a/hw/intc/Makefile.objs
>> +++ b/hw/intc/Makefile.objs
>> @@ -39,6 +39,7 @@ obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>  obj-$(CONFIG_XIVE) += xive.o
>>  obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
>> +obj-$(CONFIG_XIVE_KVM) += spapr_xive_kvm.o
>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-11-28 10:59     ` Cédric Le Goater
@ 2018-11-29  0:47       ` David Gibson
  2018-11-29  3:39         ` Benjamin Herrenschmidt
  2018-12-03 17:05         ` Cédric Le Goater
  0 siblings, 2 replies; 184+ messages in thread
From: David Gibson @ 2018-11-29  0:47 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 17328 bytes --]

On Wed, Nov 28, 2018 at 11:59:58AM +0100, Cédric Le Goater wrote:
> On 11/28/18 12:49 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:01AM +0100, Cédric Le Goater wrote:
> >> The last sub-engine of the XIVE architecture is the Interrupt
> >> Virtualization Presentation Engine (IVPE). On HW, they share elements,
> >> the Power Bus interface (CQ), the routing table descriptors, and they
> >> can be combined in the same HW logic. We do the same in QEMU and
> >> combine both engines in the XiveRouter for simplicity.
> > 
> > Ok, I'm not entirely convinced combining the IVPE and IVRE into a
> > single object is a good idea, but we can probably discuss that once
> > I've read further.
> 
> We could introduce a simplified presenter for sPAPR but I am not even
> sure of that as it will get more complex if we support the EBB one day. 

I wasn't really thinking about PAPR for this comment.

> >> When the IVRE has completed its job of matching an event source with a
> >> Notification Virtual Target (NVT) to notify, it forwards the event
> >> notification to the IVPE sub-engine. The IVPE scans the thread
> >> interrupt contexts of the Notification Virtual Targets (NVT)
> >> dispatched on the HW processor threads and if a match is found, it
> >> signals the thread. If not, the IVPE escalates the notification to
> >> some other targets and records the notification in a backlog queue.
> >>
> >> The IVPE maintains the thread interrupt context state for each of its
> >> NVTs not dispatched on HW processor threads in the Notification
> >> Virtual Target table (NVTT).
> >>
> >> The model currently only supports single NVT notifications.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  include/hw/ppc/xive.h      |  13 +++
> >>  include/hw/ppc/xive_regs.h |  22 ++++
> >>  hw/intc/xive.c             | 223 +++++++++++++++++++++++++++++++++++++
> >>  3 files changed, 258 insertions(+)
> >>
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index 5987f26ddb98..e715a6c6923d 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -197,6 +197,10 @@ typedef struct XiveRouterClass {
> >>                     XiveEND *end);
> >>      int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>                     XiveEND *end);
> >> +    int (*get_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> >> +                   XiveNVT *nvt);
> >> +    int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> >> +                   XiveNVT *nvt);
> > 
> > As with the ENDs, I don't think get/set is a good interface for a
> > bigger-than-word-size object.
> 
> We need to agree on this interface before I respin. So you would like 
> to add a extra argument specifying the word being accessed ?

Yes.  Ok, 3 options I can see at this point:

1) read/write accessors which take a word number

2) A "get" accessor which copies the whole structure, but "write"
accessor which takes a word number.  The asymmetry is a bit ugly, but
it's the non-atomic writeback of the whole structure which I'm most
uncomfortable with.

3) A map/unmap interface which gives you / releases a pointer to the
"live" structure.  For powernv that would become
address_space_map()/unmap().  For PAPR it would just be reutn pointer
/ no-op.

> 
> > 
> >>  } XiveRouterClass;
> >>  
> >>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
> >> @@ -207,6 +211,10 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>                          XiveEND *end);
> >>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>                          XiveEND *end);
> >> +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> >> +                        XiveNVT *nvt);
> >> +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> >> +                        XiveNVT *nvt);
> >>  
> >>  /*
> >>   * XIVE END ESBs
> >> @@ -274,4 +282,9 @@ extern const MemoryRegionOps xive_tm_ops;
> >>  
> >>  void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
> >>  
> >> +static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
> >> +{
> >> +    return (nvt_blk << 19) | nvt_idx;
> > 
> > I'm guessing this formula is the standard way of combining the NVT
> > block and index into a single word?  
> 
> That number is the VP/NVT identifier which is written in the CAM value. 
> The index is on 19 bits because of the NVT  definition in the END 
> structure. It is being increased to 24 bits on Power10 
> 
> > If so, I think we should
> > standardize on passing a single word "nvt_id" around and only
> > splitting it when we need to use the block separately.  
> 
> This is really the only place where we concatenate the two NVT values,
> block and index. 

Hm, ok.  I know we don't model them (yet, maybe ever) but could
combined values appear in the PowerBUS messages that handle remote
notifications?

> > Same goes for
> > the end_id, assuming there's a standard way of putting that into a
> > single word.  That will address the point I raised earlier about lisn
> > being passed around as a single word, but these later stage ids being
> > split.
> 
> Hmm, I am not sure this is a good option. It is not how the PowerNV 
> model would use it, skiboot is very much aware of these blocks and 
> indexes and for remote accesses chips are identified using the block. 
> I will take a look at it but I am not found of it. I can add helpers 
> in some places though.    

Hm, ok.  Do the block and index appear as an (effectively) single
field in the EAS?

> I agree we have some kind of issue linking the HW model with the sPAPR 
> machine. The guest interface is only  about IRQ numbers, priorities and
> cpu numbers. We really don't care about XIVE blocks and indexes in that 
> case. we can clarify the code by bypassing the XiveRouter interfaces
> to the table and directly use the sPAPR interrupt controller. That 
> should help a bit for the hcalls but we would still have to fill in 
> the EAT and the END with some index values if we want to use the router
> algorithm.

I don't think this is too much of a problem.  These are essentially
machine internal details so we can choose an allocation to suit us.
The obvious one is to put everything in a single block, at least as
long as that won't limit our numbers too much.

> > We'll probably want some inlines or macros to build an
> > nvt/end/lisn/whatever id from block and index as well.
> > 
> >> +}
> >> +
> >>  #endif /* PPC_XIVE_H */
> >> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
> >> index 2e3d6cb507da..05cb992d2815 100644
> >> --- a/include/hw/ppc/xive_regs.h
> >> +++ b/include/hw/ppc/xive_regs.h
> >> @@ -158,4 +158,26 @@ typedef struct XiveEND {
> >>  #define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
> >>  } XiveEND;
> >>  
> >> +/* Notification Virtual Target (NVT) */
> >> +typedef struct XiveNVT {
> >> +        uint32_t        w0;
> >> +#define NVT_W0_VALID             PPC_BIT32(0)
> >> +        uint32_t        w1;
> >> +        uint32_t        w2;
> >> +        uint32_t        w3;
> >> +        uint32_t        w4;
> >> +        uint32_t        w5;
> >> +        uint32_t        w6;
> >> +        uint32_t        w7;
> >> +        uint32_t        w8;
> >> +#define NVT_W8_GRP_VALID         PPC_BIT32(0)
> >> +        uint32_t        w9;
> >> +        uint32_t        wa;
> >> +        uint32_t        wb;
> >> +        uint32_t        wc;
> >> +        uint32_t        wd;
> >> +        uint32_t        we;
> >> +        uint32_t        wf;
> >> +} XiveNVT;
> >> +
> >>  #endif /* PPC_XIVE_REGS_H */
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index 4c6cb5d52975..5ba3b06e6e25 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -373,6 +373,32 @@ void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
> >>      }
> >>  }
> >>  
> >> +/* The HW CAM (23bits) is hardwired to :
> >> + *
> >> + *   0x000||0b1||4Bit chip number||7Bit Thread number.
> >> + *
> >> + * and when the block grouping extension is enabled :
> >> + *
> >> + *   4Bit chip number||0x001||7Bit Thread number.
> >> + */
> >> +static uint32_t tctx_hw_cam_line(bool block_group, uint8_t chip_id, uint8_t tid)
> >> +{
> >> +    if (block_group) {
> >> +        return 1 << 11 | (chip_id & 0xf) << 7 | (tid & 0x7f);
> >> +    } else {
> >> +        return (chip_id & 0xf) << 11 | 1 << 7 | (tid & 0x7f);
> >> +    }
> >> +}
> >> +
> >> +static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
> >> +{
> >> +    PowerPCCPU *cpu = POWERPC_CPU(tctx->cs);
> >> +    CPUPPCState *env = &cpu->env;
> >> +    uint32_t pir = env->spr_cb[SPR_PIR].default_value;
> > 
> > I don't much like reaching into the cpu state itself.  I think a
> > better idea would be to have the TCTX have its HW CAM id set during
> > initialization (via a property) and then use that.  This will mean
> > less mucking about if future cpu revisions don't split the PIR into
> > chip and tid ids in the same way.
> 
> yes good idea. I will see how to handle the block_group boolean. may be we
> can leave it out of the model for now as it is not used.

Yes, it would be nice to leave the block_group stuff as a later
extensions when/if we need it.  If we put it in as a stub and nothing
is using/testing it, it's likely it will be broken if we ever do
actually try to use it.

> 
> > 
> >> +    return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
> >> +}
> >> +
> >>  static void xive_tctx_reset(void *dev)
> >>  {
> >>      XiveTCTX *tctx = XIVE_TCTX(dev);
> >> @@ -1013,6 +1039,195 @@ int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>     return xrc->set_end(xrtr, end_blk, end_idx, end);
> >>  }
> >>  
> >> +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> >> +                        XiveNVT *nvt)
> >> +{
> >> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> >> +
> >> +   return xrc->get_nvt(xrtr, nvt_blk, nvt_idx, nvt);
> >> +}
> >> +
> >> +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> >> +                        XiveNVT *nvt)
> >> +{
> >> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
> >> +
> >> +   return xrc->set_nvt(xrtr, nvt_blk, nvt_idx, nvt);
> >> +}
> >> +
> >> +static bool xive_tctx_ring_match(XiveTCTX *tctx, uint8_t ring,
> >> +                                 uint8_t nvt_blk, uint32_t nvt_idx,
> >> +                                 bool cam_ignore, uint32_t logic_serv)
> >> +{
> >> +    uint8_t *regs = &tctx->regs[ring];
> >> +    uint32_t w2 = be32_to_cpu(*((uint32_t *) &regs[TM_WORD2]));
> >> +    uint32_t cam = xive_tctx_cam_line(nvt_blk, nvt_idx);
> >> +    bool block_group = false; /* TODO (PowerNV) */
> >> +
> >> +    /* TODO (PowerNV): ignore low order bits of nvt id */
> >> +
> >> +    switch (ring) {
> >> +    case TM_QW3_HV_PHYS:
> >> +        return (w2 & TM_QW3W2_VT) && xive_tctx_hw_cam_line(tctx, block_group) ==
> >> +            tctx_hw_cam_line(block_group, nvt_blk, nvt_idx);
> > 
> > The difference between "xive_tctx_hw_cam_line" and "tctx_hw_cam_line"
> > here is far from obvious.  
> 
> yes. I lacked inspiration ...

I'd suggest that the one which takes the tctx as a parameter be
tctx_hw_cam_line() and the other be nvt_hw_cam_line() or similar.  The
crucial difference here is that one is what the thread is looking for,
the other is what the NVT is advertising.

> > Remember that namespacing prefixes aren't
> > necessary for static functions, which can let you give more
> > descriptive names without getting excessively long.
> 
> OK.
>  
> >> +    case TM_QW2_HV_POOL:
> >> +        return (w2 & TM_QW2W2_VP) && (cam == GETFIELD(TM_QW2W2_POOL_CAM, w2));
> >> +
> >> +    case TM_QW1_OS:
> >> +        return (w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2));
> >> +
> >> +    case TM_QW0_USER:
> >> +        return ((w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2)) &&
> >> +                (w2 & TM_QW0W2_VU) &&
> >> +                (logic_serv == GETFIELD(TM_QW0W2_LOGIC_SERV, w2)));
> >> +
> >> +    default:
> >> +        g_assert_not_reached();
> >> +    }
> >> +}
> >> +
> >> +static int xive_presenter_tctx_match(XiveTCTX *tctx, uint8_t format,
> >> +                                     uint8_t nvt_blk, uint32_t nvt_idx,
> >> +                                     bool cam_ignore, uint32_t logic_serv)
> >> +{
> >> +    if (format == 0) {
> >> +        /* F=0 & i=1: Logical server notification */
> >> +        if (cam_ignore == true) {
> >> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no support for LS "
> >> +                          "NVT %x/%x\n", nvt_blk, nvt_idx);
> >> +             return -1;
> >> +        }
> >> +
> >> +        /* F=0 & i=0: Specific NVT notification */
> >> +        if (xive_tctx_ring_match(tctx, TM_QW3_HV_PHYS,
> >> +                                nvt_blk, nvt_idx, false, 0)) {
> >> +            return TM_QW3_HV_PHYS;
> >> +        }
> >> +        if (xive_tctx_ring_match(tctx, TM_QW2_HV_POOL,
> >> +                                nvt_blk, nvt_idx, false, 0)) {
> >> +            return TM_QW2_HV_POOL;
> >> +        }
> >> +        if (xive_tctx_ring_match(tctx, TM_QW1_OS,
> >> +                                nvt_blk, nvt_idx, false, 0)) {
> >> +            return TM_QW1_OS;
> >> +        }
> > 
> > Hm.  It's a bit pointless to iterate through each ring calling a
> > common function, when that "common" function consists entirely of a
> > switch which makes it not really common at all.
> > 
> > So I think you want separate helper functions for each ring's match,
> > or even just fold the previous function into this one.
> 
> yes. It can be improved. I did try different layouts. I might just fold 
> both routine in one as you propose.  
> 
> >> +    } else {
> >> +        /* F=1 : User level Event-Based Branch (EBB) notification */
> >> +        if (xive_tctx_ring_match(tctx, TM_QW0_USER,
> >> +                                nvt_blk, nvt_idx, false, logic_serv)) {
> >> +            return TM_QW0_USER;
> >> +        }
> >> +    }
> >> +    return -1;
> >> +}
> >> +
> >> +typedef struct XiveTCTXMatch {
> >> +    XiveTCTX *tctx;
> >> +    uint8_t ring;
> >> +} XiveTCTXMatch;
> >> +
> >> +static bool xive_presenter_match(XiveRouter *xrtr, uint8_t format,
> >> +                                 uint8_t nvt_blk, uint32_t nvt_idx,
> >> +                                 bool cam_ignore, uint8_t priority,
> >> +                                 uint32_t logic_serv, XiveTCTXMatch *match)
> >> +{
> >> +    CPUState *cs;
> >> +
> >> +    /* TODO (PowerNV): handle chip_id overwrite of block field for
> >> +     * hardwired CAM compares */
> >> +
> >> +    CPU_FOREACH(cs) {
> >> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> >> +        XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
> >> +        int ring;
> >> +
> >> +        /*
> >> +         * HW checks that the CPU is enabled in the Physical Thread
> >> +         * Enable Register (PTER).
> >> +         */
> >> +
> >> +        /*
> >> +         * Check the thread context CAM lines and record matches. We
> >> +         * will handle CPU exception delivery later
> >> +         */
> >> +        ring = xive_presenter_tctx_match(tctx, format, nvt_blk, nvt_idx,
> >> +                                         cam_ignore, logic_serv);
> >> +        /*
> >> +         * Save the context and follow on to catch duplicates, that we
> >> +         * don't support yet.
> >> +         */
> >> +        if (ring != -1) {
> >> +            if (match->tctx) {
> >> +                qemu_log_mask(LOG_GUEST_ERROR, "XIVE: already found a thread "
> >> +                              "context NVT %x/%x\n", nvt_blk, nvt_idx);
> >> +                return false;
> >> +            }
> >> +
> >> +            match->ring = ring;
> >> +            match->tctx = tctx;
> >> +        }
> >> +    }
> >> +
> >> +    if (!match->tctx) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is not dispatched\n",
> >> +                      nvt_blk, nvt_idx);
> >> +        return false;
> > 
> > Hmm.. this isn't actually an error isn't it? At least not for powernv
> 
> It is on sPAPR, it would mean the END was configured with an unknow CPU. 

Right.

> It is not error on PowerNV, when we support escalations.
> 
> > - that just means the NVT isn't currently dispatched, so we'll need to
> > trigger the escalation interrupt.  
> 
> Yes.
> 
> > Does this get changed later in the series?
> 
> No.

But this code is common to PAPR and powernv, yes, so it will need to?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged
  2018-11-28 11:30     ` Cédric Le Goater
@ 2018-11-29  0:49       ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-29  0:49 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 7964 bytes --]

On Wed, Nov 28, 2018 at 12:30:45PM +0100, Cédric Le Goater wrote:
> On 11/28/18 1:13 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:02AM +0100, Cédric Le Goater wrote:
> >> After the event data was pushed in the O/S Event Queue, the IVPE
> >> raises the bit corresponding to the priority of the pending interrupt
> >> in the register IBP (Interrupt Pending Buffer) to indicate there is an
> >> event pending in one of the 8 priority queues. The Pending Interrupt
> >> Priority Register (PIPR) is also updated using the IPB. This register
> >> represent the priority of the most favored pending notification.
> >>
> >> The PIPR is then compared to the the Current Processor Priority
> >> Register (CPPR). If it is more favored (numerically less than), the
> >> CPU interrupt line is raised and the EO bit of the Notification Source
> >> Register (NSR) is updated to notify the presence of an exception for
> >> the O/S. The check needs to be done whenever the PIPR or the CPPR are
> >> changed.
> >>
> >> The O/S acknowledges the interrupt with a special load in the Thread
> >> Interrupt Management Area. If the EO bit of the NSR is set, the CPPR
> >> takes the value of PIPR. The bit number in the IBP corresponding to
> >> the priority of the pending interrupt is reseted and so is the EO bit
> >> of the NSR.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  hw/intc/xive.c | 94 +++++++++++++++++++++++++++++++++++++++++++++++++-
> >>  1 file changed, 93 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index 5ba3b06e6e25..c49932d2b799 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -21,9 +21,73 @@
> >>   * XIVE Thread Interrupt Management context
> >>   */
> >>  
> >> +/* Convert a priority number to an Interrupt Pending Buffer (IPB)
> >> + * register, which indicates a pending interrupt at the priority
> >> + * corresponding to the bit number
> >> + */
> >> +static uint8_t priority_to_ipb(uint8_t priority)
> >> +{
> >> +    return priority > XIVE_PRIORITY_MAX ?
> >> +        0 : 1 << (XIVE_PRIORITY_MAX - priority);
> >> +}
> >> +
> >> +/* Convert an Interrupt Pending Buffer (IPB) register to a Pending
> >> + * Interrupt Priority Register (PIPR), which contains the priority of
> >> + * the most favored pending notification.
> >> + */
> >> +static uint8_t ipb_to_pipr(uint8_t ibp)
> >> +{
> >> +    return ibp ? clz32((uint32_t)ibp << 24) : 0xff;
> >> +}
> >> +
> >> +static void ipb_update(uint8_t *regs, uint8_t priority)
> >> +{
> >> +    regs[TM_IPB] |= priority_to_ipb(priority);
> >> +    regs[TM_PIPR] = ipb_to_pipr(regs[TM_IPB]);
> >> +}
> >> +
> >> +static uint8_t exception_mask(uint8_t ring)
> >> +{
> >> +    switch (ring) {
> >> +    case TM_QW1_OS:
> >> +        return TM_QW1_NSR_EO;
> >> +    default:
> >> +        g_assert_not_reached();
> >> +    }
> >> +}
> >> +
> >>  static uint64_t xive_tctx_accept(XiveTCTX *tctx, uint8_t ring)
> >>  {
> >> -    return 0;
> >> +    uint8_t *regs = &tctx->regs[ring];
> >> +    uint8_t nsr = regs[TM_NSR];
> >> +    uint8_t mask = exception_mask(ring);
> >> +
> >> +    qemu_irq_lower(tctx->output);
> >> +
> >> +    if (regs[TM_NSR] & mask) {
> >> +        uint8_t cppr = regs[TM_PIPR];
> >> +
> >> +        regs[TM_CPPR] = cppr;
> >> +
> >> +        /* Reset the pending buffer bit */
> >> +        regs[TM_IPB] &= ~priority_to_ipb(cppr);
> >> +        regs[TM_PIPR] = ipb_to_pipr(regs[TM_IPB]);
> >> +
> >> +        /* Drop Exception bit */
> >> +        regs[TM_NSR] &= ~mask;
> >> +    }
> >> +
> >> +    return (nsr << 8) | regs[TM_CPPR];
> > 
> > Don't you need a cast to avoid (nsr << 8) being a shift-wider-than-size?
> 
> I will check.

According to Eric, it doesn't, and given the compiler isn't
complaining I'm pretty sure that's right.  Makes me a bit nervous
though.

> 
> > 
> >> +}
> >> +
> >> +static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring)
> >> +{
> >> +    uint8_t *regs = &tctx->regs[ring];
> >> +
> >> +    if (regs[TM_PIPR] < regs[TM_CPPR]) {
> >> +        regs[TM_NSR] |= exception_mask(ring);
> >> +        qemu_irq_raise(tctx->output);
> >> +    }
> >>  }
> >>  
> >>  static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
> >> @@ -33,6 +97,9 @@ static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
> >>      }
> >>  
> >>      tctx->regs[ring + TM_CPPR] = cppr;
> >> +
> >> +    /* CPPR has changed, check if we need to raise a pending exception */
> >> +    xive_tctx_notify(tctx, ring);
> >>  }
> >>  
> >>  /*
> >> @@ -198,6 +265,17 @@ static void xive_tm_set_os_cppr(XiveTCTX *tctx, hwaddr offset,
> >>      xive_tctx_set_cppr(tctx, TM_QW1_OS, value & 0xff);
> >>  }
> >>  
> >> +/*
> >> + * Adjust the IPB to allow a CPU to process event queues of other
> >> + * priorities during one physical interrupt cycle.
> >> + */
> >> +static void xive_tm_set_os_pending(XiveTCTX *tctx, hwaddr offset,
> >> +                                   uint64_t value, unsigned size)
> >> +{
> >> +    ipb_update(&tctx->regs[TM_QW1_OS], value & 0xff);
> >> +    xive_tctx_notify(tctx, TM_QW1_OS);
> >> +}
> >> +
> >>  /*
> >>   * Define a mapping of "special" operations depending on the TIMA page
> >>   * offset and the size of the operation.
> >> @@ -220,6 +298,7 @@ static const XiveTmOp xive_tm_operations[] = {
> >>  
> >>      /* MMIOs above 2K : special operations with side effects */
> >>      { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG,     2, NULL, xive_tm_ack_os_reg },
> >> +    { XIVE_TM_OS_PAGE, TM_SPC_SET_OS_PENDING, 1, xive_tm_set_os_pending, NULL },
> >>  };
> >>  
> >>  static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
> >> @@ -409,6 +488,13 @@ static void xive_tctx_reset(void *dev)
> >>      tctx->regs[TM_QW1_OS + TM_LSMFB] = 0xFF;
> >>      tctx->regs[TM_QW1_OS + TM_ACK_CNT] = 0xFF;
> >>      tctx->regs[TM_QW1_OS + TM_AGE] = 0xFF;
> >> +
> >> +    /*
> >> +     * Initialize PIPR to 0xFF to avoid phantom interrupts when the
> >> +     * CPPR is first set.
> >> +     */
> >> +    tctx->regs[TM_QW1_OS + TM_PIPR] =
> >> +        ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
> >>  }
> >>  
> >>  static void xive_tctx_realize(DeviceState *dev, Error **errp)
> >> @@ -1218,9 +1304,15 @@ static void xive_presenter_notify(XiveRouter *xrtr, uint8_t format,
> >>      found = xive_presenter_match(xrtr, format, nvt_blk, nvt_idx, cam_ignore,
> >>                                   priority, logic_serv, &match);
> >>      if (found) {
> >> +        ipb_update(&match.tctx->regs[match.ring], priority);
> >> +        xive_tctx_notify(match.tctx, match.ring);
> >>          return;
> >>      }
> >>  
> >> +    /* Record the IPB in the associated NVT structure */
> >> +    ipb_update((uint8_t *) &nvt.w4, priority);
> >> +    xive_router_set_nvt(xrtr, nvt_blk, nvt_idx, &nvt);
> > 
> > You're only writing back the NVT in the !found case.  Don't you still
> > need to update it in the found case?
> 
> I would say no unless we add support for redistribution which would
> mean the model supports logical servers. 

Oh, sorry, I think I missed that the ipb_update() was only touching
the NVT in the !found case.

> These are much more complex scenarios in which the IPVE returns multiple 
> matching targets, the IVRE selects one but then the context changes.
>  
> 
> C.
> 
> > 
> >>      /* If no matching NVT is dispatched on a HW thread :
> >>       * - update the NVT structure if backlog is activated
> >>       * - escalate (ESe PQ bits and EAS in w4-5) if escalation is
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller
  2018-11-28 16:27     ` Cédric Le Goater
@ 2018-11-29  0:54       ` David Gibson
  2018-11-29 14:37         ` Cédric Le Goater
  2018-12-04 17:12       ` Cédric Le Goater
  1 sibling, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  0:54 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 20091 bytes --]

On Wed, Nov 28, 2018 at 05:27:29PM +0100, Cédric Le Goater wrote:
> On 11/28/18 1:52 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:03AM +0100, Cédric Le Goater wrote:
> >> sPAPRXive models the XIVE interrupt controller of the sPAPR machine.
> >> It inherits from the XiveRouter and provisions storage for the routing
> >> tables :
> >>
> >>   - Event Assignment Structure (EAS)
> >>   - Event Notification Descriptor (END)
> >>
> >> The sPAPRXive model incorporates an internal XiveSource for the IPIs
> >> and for the interrupts of the virtual devices of the guest. This model
> >> is consistent with XIVE architecture which also incorporates an
> >> internal IVSE for IPIs and accelerator interrupts in the IVRE
> >> sub-engine.
> >>
> >> The sPAPRXive model exports two memory regions, one for the ESB
> >> trigger and management pages used to control the sources and one for
> >> the TIMA pages. They are mapped by default at the addresses found on
> >> chip 0 of a baremetal system. This is also consistent with the XIVE
> >> architecture which defines a Virtualization Controller BAR for the
> >> internal IVSE ESB pages and a Thread Managment BAR for the TIMA.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  default-configs/ppc64-softmmu.mak |   1 +
> >>  include/hw/ppc/spapr_xive.h       |  46 +++++
> >>  hw/intc/spapr_xive.c              | 323 ++++++++++++++++++++++++++++++
> >>  hw/intc/Makefile.objs             |   1 +
> >>  4 files changed, 371 insertions(+)
> >>  create mode 100644 include/hw/ppc/spapr_xive.h
> >>  create mode 100644 hw/intc/spapr_xive.c
> >>
> >> diff --git a/default-configs/ppc64-softmmu.mak b/default-configs/ppc64-softmmu.mak
> >> index 2d1e7c5c4668..7f34ad0528ed 100644
> >> --- a/default-configs/ppc64-softmmu.mak
> >> +++ b/default-configs/ppc64-softmmu.mak
> >> @@ -17,6 +17,7 @@ CONFIG_XICS=$(CONFIG_PSERIES)
> >>  CONFIG_XICS_SPAPR=$(CONFIG_PSERIES)
> >>  CONFIG_XICS_KVM=$(call land,$(CONFIG_PSERIES),$(CONFIG_KVM))
> >>  CONFIG_XIVE=$(CONFIG_PSERIES)
> >> +CONFIG_XIVE_SPAPR=$(CONFIG_PSERIES)
> >>  CONFIG_MEM_DEVICE=y
> >>  CONFIG_DIMM=y
> >>  CONFIG_SPAPR_RNG=y
> >> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >> new file mode 100644
> >> index 000000000000..06727bd86aa9
> >> --- /dev/null
> >> +++ b/include/hw/ppc/spapr_xive.h
> >> @@ -0,0 +1,46 @@
> >> +/*
> >> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#ifndef PPC_SPAPR_XIVE_H
> >> +#define PPC_SPAPR_XIVE_H
> >> +
> >> +#include "hw/sysbus.h"
> >> +#include "hw/ppc/xive.h"
> >> +
> >> +#define TYPE_SPAPR_XIVE "spapr-xive"
> >> +#define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
> >> +
> >> +typedef struct sPAPRXive {
> >> +    XiveRouter    parent;
> >> +
> >> +    /* Internal interrupt source for IPIs and virtual devices */
> >> +    XiveSource    source;
> >> +    hwaddr        vc_base;
> >> +
> >> +    /* END ESB MMIOs */
> >> +    XiveENDSource end_source;
> >> +    hwaddr        end_base;
> >> +
> >> +    /* Routing table */
> >> +    XiveEAS       *eat;
> >> +    uint32_t      nr_irqs;
> >> +    XiveEND       *endt;
> >> +    uint32_t      nr_ends;
> >> +
> >> +    /* TIMA mapping address */
> >> +    hwaddr        tm_base;
> >> +    MemoryRegion  tm_mmio;
> >> +} sPAPRXive;
> >> +
> >> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
> >> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> >> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> >> +qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
> >> +
> >> +#endif /* PPC_SPAPR_XIVE_H */
> >> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >> new file mode 100644
> >> index 000000000000..5d038146c08e
> >> --- /dev/null
> >> +++ b/hw/intc/spapr_xive.c
> >> @@ -0,0 +1,323 @@
> >> +/*
> >> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu/log.h"
> >> +#include "qapi/error.h"
> >> +#include "target/ppc/cpu.h"
> >> +#include "sysemu/cpus.h"
> >> +#include "monitor/monitor.h"
> >> +#include "hw/ppc/spapr.h"
> >> +#include "hw/ppc/spapr_xive.h"
> >> +#include "hw/ppc/xive.h"
> >> +#include "hw/ppc/xive_regs.h"
> >> +
> >> +/*
> >> + * XIVE Virtualization Controller BAR and Thread Managment BAR that we
> >> + * use for the ESB pages and the TIMA pages
> >> + */
> >> +#define SPAPR_XIVE_VC_BASE   0x0006010000000000ull
> >> +#define SPAPR_XIVE_TM_BASE   0x0006030203180000ull
> >> +
> >> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> >> +{
> >> +    int i;
> >> +    uint32_t offset = 0;
> >> +
> >> +    monitor_printf(mon, "XIVE Source %08x .. %08x\n", offset,
> >> +                   offset + xive->source.nr_irqs - 1);
> >> +    xive_source_pic_print_info(&xive->source, offset, mon);
> >> +
> >> +    monitor_printf(mon, "XIVE EAT %08x .. %08x\n", 0, xive->nr_irqs - 1);
> >> +    for (i = 0; i < xive->nr_irqs; i++) {
> >> +        xive_eas_pic_print_info(&xive->eat[i], i, mon);
> >> +    }
> >> +
> >> +    monitor_printf(mon, "XIVE ENDT %08x .. %08x\n", 0, xive->nr_ends - 1);
> >> +    for (i = 0; i < xive->nr_ends; i++) {
> >> +        xive_end_pic_print_info(&xive->endt[i], i, mon);
> >> +    }
> > 
> > AIUI the PAPR model hides the details of ENDs, EQs and NVTs - instead
> > each logical EAS just points at a (thread, priority) pair, which under
> > the hood has exactly one END and one NVT bound to it.
> > 
> > Given that, would it make more sense to reformat the info here to show
> > things in terms of those (thread, priority) pairs, rather than the
> > internal EAS and END details?
> 
> Yes. I had a version doing something like that before. I will rework
> the ouput a little for sPAPR.  
> 
> 
> >> +}
> >> +
> >> +/* Map the ESB pages and the TIMA pages */
> >> +static void spapr_xive_mmio_map(sPAPRXive *xive)
> >> +{
> >> +    sysbus_mmio_map(SYS_BUS_DEVICE(&xive->source), 0, xive->vc_base);
> >> +    sysbus_mmio_map(SYS_BUS_DEVICE(&xive->end_source), 0, xive->end_base);
> > 
> > Uh.. I didn't think the PAPR model exposed the END sources to the guest?
> 
> Well, it should if it was being used but it's not the case for any of the 
> sPAPR guest OS today. So I think it's preferable to remove the mapping until 
> someone wants to experiment with it. We can keep the XiveENDSource object 
> though. This is harmless.

So, having now read later patches I see that the model does at least
theoretically expose these.  And yes, it makes sense to keep the
ENDSource objects internally whether or not we expose them.

> There is no KVM side to the END ESBs either as OPAL does not use them. 

Ah.. that's a pretty good argument for not exposing them for now.

> >> +    sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->tm_base);
> >> +}
> >> +
> >> +static void spapr_xive_reset(DeviceState *dev)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +    int i;
> >> +
> >> +    /* Xive Source reset is done through SysBus, it should put all
> >> +     * IRQs to OFF (!P|Q) */
> >> +
> >> +    /* Mask all valid EASs in the IRQ number space. */
> >> +    for (i = 0; i < xive->nr_irqs; i++) {
> >> +        XiveEAS *eas = &xive->eat[i];
> >> +        if (eas->w & EAS_VALID) {
> >> +            eas->w |= EAS_MASKED;
> > 
> > To ensure consistent behaviour across reboots, it would be better to
> > reset the whole of the EAS, except those which have to be preserved
> > across reboots (which would be VALID, and maybe nothing else?).
> 
> VALID EAS corresponds to IRQ numbers claimed by the devices of the machine.
> So we should keep the valid bit but reset all other settings which this
> reset method is not doing. I will fix.

Ok.

> >> +        }
> >> +    }
> >> +
> >> +    for (i = 0; i < xive->nr_ends; i++) {
> >> +        xive_end_reset(&xive->endt[i]);
> >> +    }
> >> +
> >> +    spapr_xive_mmio_map(xive);
> > 
> > You shouldn't need to re-etablish MMIO mappings at reset time, only
> > during initialization.
> 
> Yes. Not for now indeed, but the patch is anticipating the switch 
> of the interrupt mode at reset. I will move the mapping to the 
> realize method in that patch and re-move it again in reset when 
> we reach that part of the patchset (dual machine)  

Ok.

> >> +}
> >> +
> >> +static void spapr_xive_instance_init(Object *obj)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(obj);
> >> +
> >> +    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
> > 
> > Yeah, embedding the source here makes sense, but it's a strong
> > indication that XiveSource should not be a SysBusDevice subclass.  I
> > really think it wants to be a TYPE_DEVICE subclass - and, in fact, I
> > think it can be object_initialize() embedded everywhere it's used.
> 
> I have changed XiveSource to be a TYPE_DEVICE.

Great.

> > I've also said elswhere that I suspect XiveRouter should also not be a
> > SysBusDevice.  
> 
> I have changed XiveRouter to be a TYPE_DEVICE.

Great.

> > With that approach it might make sense to embed it
> > here, rather than subclassing it 
> 
> ah. why not indeed. I have to think about it. 
> 
> > (the old composition vs. inheritance debate).
> 
> he. but then the XiveRouter needs to become a QOM interface if we 
> want to be able to define XIVE table accessors for sPAPRXive. See
> the  spapr_xive_class_init() routine.

Erm.. I'm not really sure what you're getting at here.

> 
> >> +    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
> >> +
> >> +    object_initialize(&xive->end_source, sizeof(xive->end_source),
> >> +                      TYPE_XIVE_END_SOURCE);
> >> +    object_property_add_child(obj, "end_source", OBJECT(&xive->end_source),
> >> +                              NULL);
> >> +}
> >> +
> >> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +    XiveSource *xsrc = &xive->source;
> >> +    XiveENDSource *end_xsrc = &xive->end_source;
> >> +    Error *local_err = NULL;
> >> +
> >> +    if (!xive->nr_irqs) {
> >> +        error_setg(errp, "Number of interrupt needs to be greater 0");
> >> +        return;
> >> +    }
> >> +
> >> +    if (!xive->nr_ends) {
> >> +        error_setg(errp, "Number of interrupt needs to be greater 0");
> >> +        return;
> >> +    }
> >> +
> >> +    /*
> >> +     * Initialize the internal sources, for IPIs and virtual devices.
> >> +     */
> >> +    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
> >> +                            &error_fatal);
> >> +    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
> >> +                                   &error_fatal);
> >> +    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return;
> >> +    }
> >> +    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
> >> +
> >> +    /*
> >> +     * Initialize the END ESB source
> >> +     */
> >> +    object_property_set_int(OBJECT(end_xsrc), xive->nr_irqs, "nr-ends",
> >> +                            &error_fatal);
> >> +    object_property_add_const_link(OBJECT(end_xsrc), "xive", OBJECT(xive),
> >> +                                   &error_fatal);
> >> +    object_property_set_bool(OBJECT(end_xsrc), true, "realized", &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return;
> >> +    }
> >> +    qdev_set_parent_bus(DEVICE(end_xsrc), sysbus_get_default());
> >> +
> >> +    /* Set the mapping address of the END ESB pages after the source ESBs */
> >> +    xive->end_base = xive->vc_base + (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
> >> +
> >> +    /*
> >> +     * Allocate the routing tables
> >> +     */
> >> +    xive->eat = g_new0(XiveEAS, xive->nr_irqs);
> >> +    xive->endt = g_new0(XiveEND, xive->nr_ends);
> >> +
> >> +    /* TIMA initialization */
> >> +    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
> >> +                          "xive.tima", 4ull << TM_SHIFT);
> >> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
> >> +}
> >> +
> >> +static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> >> +
> >> +    if (lisn >= xive->nr_irqs) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    *eas = xive->eat[lisn];
> >> +    return 0;
> >> +}
> >> +
> >> +static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> >> +
> >> +    if (lisn >= xive->nr_irqs) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    xive->eat[lisn] = *eas;
> >> +    return 0;
> >> +}
> >> +
> >> +static int spapr_xive_get_end(XiveRouter *xrtr,
> >> +                              uint8_t end_blk, uint32_t end_idx, XiveEND *end)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> >> +
> >> +    if (end_idx >= xive->nr_ends) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    memcpy(end, &xive->endt[end_idx], sizeof(XiveEND));
> >> +    return 0;
> >> +}
> >> +
> >> +static int spapr_xive_set_end(XiveRouter *xrtr,
> >> +                              uint8_t end_blk, uint32_t end_idx, XiveEND *end)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> >> +
> >> +    if (end_idx >= xive->nr_ends) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    memcpy(&xive->endt[end_idx], end, sizeof(XiveEND));
> >> +    return 0;
> >> +}
> >> +
> >> +static const VMStateDescription vmstate_spapr_xive_end = {
> >> +    .name = TYPE_SPAPR_XIVE "/end",
> >> +    .version_id = 1,
> >> +    .minimum_version_id = 1,
> >> +    .fields = (VMStateField []) {
> >> +        VMSTATE_UINT32(w0, XiveEND),
> >> +        VMSTATE_UINT32(w1, XiveEND),
> >> +        VMSTATE_UINT32(w2, XiveEND),
> >> +        VMSTATE_UINT32(w3, XiveEND),
> >> +        VMSTATE_UINT32(w4, XiveEND),
> >> +        VMSTATE_UINT32(w5, XiveEND),
> >> +        VMSTATE_UINT32(w6, XiveEND),
> >> +        VMSTATE_UINT32(w7, XiveEND),
> >> +        VMSTATE_END_OF_LIST()
> >> +    },
> >> +};
> >> +
> >> +static const VMStateDescription vmstate_spapr_xive_eas = {
> >> +    .name = TYPE_SPAPR_XIVE "/eas",
> >> +    .version_id = 1,
> >> +    .minimum_version_id = 1,
> >> +    .fields = (VMStateField []) {
> >> +        VMSTATE_UINT64(w, XiveEAS),
> >> +        VMSTATE_END_OF_LIST()
> >> +    },
> >> +};
> >> +
> >> +static const VMStateDescription vmstate_spapr_xive = {
> >> +    .name = TYPE_SPAPR_XIVE,
> >> +    .version_id = 1,
> >> +    .minimum_version_id = 1,
> >> +    .fields = (VMStateField[]) {
> >> +        VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> >> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(eat, sPAPRXive, nr_irqs,
> >> +                                     vmstate_spapr_xive_eas, XiveEAS),
> >> +        VMSTATE_STRUCT_VARRAY_POINTER_UINT32(endt, sPAPRXive, nr_ends,
> >> +                                             vmstate_spapr_xive_end, XiveEND),
> >> +        VMSTATE_END_OF_LIST()
> >> +    },
> >> +};
> >> +
> >> +static Property spapr_xive_properties[] = {
> >> +    DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
> >> +    DEFINE_PROP_UINT32("nr-ends", sPAPRXive, nr_ends, 0),
> >> +    DEFINE_PROP_UINT64("vc-base", sPAPRXive, vc_base, SPAPR_XIVE_VC_BASE),
> >> +    DEFINE_PROP_UINT64("tm-base", sPAPRXive, tm_base, SPAPR_XIVE_TM_BASE),
> >> +    DEFINE_PROP_END_OF_LIST(),
> >> +};
> >> +
> >> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >> +    XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
> >> +
> >> +    dc->desc    = "sPAPR XIVE Interrupt Controller";
> >> +    dc->props   = spapr_xive_properties;
> >> +    dc->realize = spapr_xive_realize;
> >> +    dc->reset   = spapr_xive_reset;
> >> +    dc->vmsd    = &vmstate_spapr_xive;
> >> +
> >> +    xrc->get_eas = spapr_xive_get_eas;
> >> +    xrc->set_eas = spapr_xive_set_eas;
> >> +    xrc->get_end = spapr_xive_get_end;
> >> +    xrc->set_end = spapr_xive_set_end;
> >> +}
> >> +
> >> +static const TypeInfo spapr_xive_info = {
> >> +    .name = TYPE_SPAPR_XIVE,
> >> +    .parent = TYPE_XIVE_ROUTER,
> >> +    .instance_init = spapr_xive_instance_init,
> >> +    .instance_size = sizeof(sPAPRXive),
> >> +    .class_init = spapr_xive_class_init,
> >> +};
> >> +
> >> +static void spapr_xive_register_types(void)
> >> +{
> >> +    type_register_static(&spapr_xive_info);
> >> +}
> >> +
> >> +type_init(spapr_xive_register_types)
> >> +
> >> +bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi)
> >> +{
> >> +    XiveSource *xsrc = &xive->source;
> >> +
> >> +    if (lisn >= xive->nr_irqs) {
> >> +        return false;
> >> +    }
> >> +
> >> +    xive->eat[lisn].w |= EAS_VALID;
> >> +    xive_source_irq_set(xsrc, lisn, lsi);
> >> +    return true;
> >> +}
> >> +
> >> +bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn)
> >> +{
> >> +    XiveSource *xsrc = &xive->source;
> >> +
> >> +    if (lisn >= xive->nr_irqs) {
> >> +        return false;
> >> +    }
> >> +
> >> +    xive->eat[lisn].w &= ~EAS_VALID;
> >> +    xive_source_irq_set(xsrc, lisn, false);
> >> +    return true;
> >> +}
> >> +
> >> +qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn)
> >> +{
> >> +    XiveSource *xsrc = &xive->source;
> >> +
> >> +    if (lisn >= xive->nr_irqs) {
> >> +        return NULL;
> >> +    }
> >> +
> >> +    if (!(xive->eat[lisn].w & EAS_VALID)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);
> > 
> > I don't think this is a guest error - gettint the qirq by number
> > should generally be something qemu code does.
> 
> Even if the IRQ was not defined by the machine ? The EAS_VALID bit is
> raised when the IRQ is enabled at the XIVE level, which means that the
> IRQ number has been claimed by some device of the machine. You cannot
> get a qirq by number for  some random IRQ number. Can you ?

Well, you shouldn't.  The point is that it is qemu code (specifically
the machine setup stuff) that will be calling this, and it shouldn't
be calling it with irq numbers that haven't been
enabled/claimed/whatever.

> 
> Thanks,
> 
> C. 
> 
> > 
> >> +        return NULL;
> >> +    }
> >> +
> >> +    return xive_source_qirq(xsrc, lisn);
> >> +}
> >> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> >> index 72a46ed91c31..301a8e972d91 100644
> >> --- a/hw/intc/Makefile.objs
> >> +++ b/hw/intc/Makefile.objs
> >> @@ -38,6 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
> >>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
> >>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> >>  obj-$(CONFIG_XIVE) += xive.o
> >> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
> >>  obj-$(CONFIG_POWERNV) += xics_pnv.o
> >>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
> >>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
  2018-11-28 16:48     ` Cédric Le Goater
@ 2018-11-29  1:00       ` David Gibson
  2018-11-29 15:27         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  1:00 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 13236 bytes --]

On Wed, Nov 28, 2018 at 05:48:32PM +0100, Cédric Le Goater wrote:
> On 11/28/18 3:39 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:04AM +0100, Cédric Le Goater wrote:
> >> The IVPE scans the O/S CAM line of the XIVE thread interrupt contexts
> >> to find a matching Notification Virtual Target (NVT) among the NVTs
> >> dispatched on the HW processor threads.
> >>
> >> On a real system, the thread interrupt contexts are updated by the
> >> hypervisor when a Virtual Processor is scheduled to run on a HW
> >> thread. Under QEMU, the model emulates the same behavior by hardwiring
> >> the NVT identifier in the thread context registers at reset.
> >>
> >> The NVT identifier used by the sPAPRXive model is the VCPU id. The END
> >> identifier is also derived from the VCPU id. A set of helpers doing
> >> the conversion between identifiers are provided for the hcalls
> >> configuring the sources and the ENDs.
> >>
> >> The model does not need a NVT table but The XiveRouter NVT operations
> >> are provided to perform some extra checks in the routing algorithm.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  include/hw/ppc/spapr_xive.h |  17 +++++
> >>  include/hw/ppc/xive.h       |   3 +
> >>  hw/intc/spapr_xive.c        | 136 ++++++++++++++++++++++++++++++++++++
> >>  hw/intc/xive.c              |   9 +++
> >>  4 files changed, 165 insertions(+)
> >>
> >> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >> index 06727bd86aa9..3f65b8f485fd 100644
> >> --- a/include/hw/ppc/spapr_xive.h
> >> +++ b/include/hw/ppc/spapr_xive.h
> >> @@ -43,4 +43,21 @@ bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> >>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> >>  qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
> >>  
> >> +/*
> >> + * sPAPR NVT and END indexing helpers
> >> + */
> >> +uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, uint8_t nvt_blk,
> >> +                                  uint32_t nvt_idx);
> >> +int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t target,
> >> +                            uint8_t *out_nvt_blk, uint32_t *out_nvt_idx);
> >> +int spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
> >> +                          uint8_t *out_nvt_blk, uint32_t *out_nvt_idx);
> >> +
> >> +int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t end_idx,
> >> +                             uint32_t *out_server, uint8_t *out_prio);
> >> +int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
> >> +                             uint8_t *out_end_blk, uint32_t *out_end_idx);
> >> +int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
> >> +                          uint8_t *out_end_blk, uint32_t *out_end_idx);
> >> +
> >>  #endif /* PPC_SPAPR_XIVE_H */
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index e715a6c6923d..e6931ddaa83f 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -187,6 +187,8 @@ typedef struct XiveRouter {
> >>  #define XIVE_ROUTER_GET_CLASS(obj)                              \
> >>      OBJECT_GET_CLASS(XiveRouterClass, (obj), TYPE_XIVE_ROUTER)
> >>  
> >> +typedef struct XiveTCTX XiveTCTX;
> >> +
> >>  typedef struct XiveRouterClass {
> >>      SysBusDeviceClass parent;
> >>  
> >> @@ -201,6 +203,7 @@ typedef struct XiveRouterClass {
> >>                     XiveNVT *nvt);
> >>      int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
> >>                     XiveNVT *nvt);
> >> +    void (*reset_tctx)(XiveRouter *xrtr, XiveTCTX *tctx);
> >>  } XiveRouterClass;
> >>  
> >>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
> >> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >> index 5d038146c08e..3bf77ace11a2 100644
> >> --- a/hw/intc/spapr_xive.c
> >> +++ b/hw/intc/spapr_xive.c
> >> @@ -199,6 +199,139 @@ static int spapr_xive_set_end(XiveRouter *xrtr,
> >>      return 0;
> >>  }
> >>  
> >> +static int spapr_xive_get_nvt(XiveRouter *xrtr,
> >> +                              uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> >> +    uint32_t vcpu_id = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
> >> +    PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
> >> +
> >> +    if (!cpu) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    /*
> >> +     * sPAPR does not maintain a NVT table. Return that the NVT is
> >> +     * valid if we have found a matching CPU
> >> +     */
> >> +    nvt->w0 = NVT_W0_VALID;
> >> +    return 0;
> >> +}
> >> +
> >> +static int spapr_xive_set_nvt(XiveRouter *xrtr,
> >> +                              uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
> >> +{
> >> +    /* no NVT table */
> >> +    return 0;
> >> +}
> >> +
> >> +/*
> >> + * When a Virtual Processor is scheduled to run on a HW thread, the
> >> + * hypervisor pushes its identifier in the OS CAM line. Under QEMU, we
> >> + * need to emulate the same behavior.
> >> + */
> >> +static void spapr_xive_reset_tctx(XiveRouter *xrtr, XiveTCTX *tctx)
> >> +{
> >> +    uint8_t  nvt_blk;
> >> +    uint32_t nvt_idx;
> >> +    uint32_t nvt_cam;
> >> +
> >> +    spapr_xive_cpu_to_nvt(SPAPR_XIVE(xrtr), POWERPC_CPU(tctx->cs),
> >> +                          &nvt_blk, &nvt_idx);
> >> +
> >> +    nvt_cam = cpu_to_be32(TM_QW1W2_VO | xive_tctx_cam_line(nvt_blk, nvt_idx));
> >> +    memcpy(&tctx->regs[TM_QW1_OS + TM_WORD2], &nvt_cam, 4);
> >> +}
> >> +
> >> +/*
> >> + * The allocation of VP blocks is a complex operation in OPAL and the
> >> + * VP identifiers have a relation with the number of HW chips, the
> >> + * size of the VP blocks, VP grouping, etc. The QEMU sPAPR XIVE
> >> + * controller model does not have the same constraints and can use a
> >> + * simple mapping scheme of the CPU vcpu_id
> >> + *
> >> + * These identifiers are never returned to the OS.
> >> + */
> >> +
> >> +#define SPAPR_XIVE_VP_BASE 0x400
> > 
> > 0x400 == 1024.  Could we ever have the possibility of needing to
> > consider both physical NVTs and PAPR NVTs at the same time?  
> 
> They would not be in the same CAM line: OS ring vs. PHYS ring. 

Hm.  They still inhabit the same NVT number space though, don't they?
I'm thinking about the END->NVT stage of the process here, rather than
the NVT->TCTX stage.

Oh, also, you're using "VP" here which IIUC == "NVT".  Can we
standardize on one, please.

> > If so, does this base leave enough space for the physical ones?
> 
> I only used 0x400 to map the VP identifier to the ones allocated by KVM. 
> 0x0 would be fine but to exercise the model, it's better having a different 
> base. 
> 
> >> +uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, uint8_t nvt_blk,
> >> +                                  uint32_t nvt_idx)
> >> +{
> >> +    return nvt_idx - SPAPR_XIVE_VP_BASE;
> >> +}
> >> +
> >> +int spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
> >> +                          uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
> > 
> > A number of these conversions will come out a bit simpler if we pass
> > the block and index around as a single word in most places.
> 
> Yes I have to check the whole patchset first. These prototype changes
> are not too difficult in terms of code complexity but they do break
> how patches apply and PowerNV is also using the idx and blk much more 
> explicitly. the block has a meaning on bare metal. So I am a bit 
> reluctant to do so. I will check.

Yeah, based on your comments here and earlier, I'm not sure that's a
good idea any more either.

> >> +{
> >> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> >> +
> >> +    if (!cpu) {
> >> +        return -1;
> >> +    }
> >> +
> >> +    if (out_nvt_blk) {
> >> +        /* For testing purpose, we could use 0 for nvt_blk */
> >> +        *out_nvt_blk = xrtr->chip_id;
> > 
> > I don't see any point using the chip_id here, which is currently
> > always set to 0 for PAPR anyway.  If we just hardwire this to 0 it
> > removes the only use here of xrtr, which will allow some further
> > simplifications in the caller, I think.
> 
> You are right about the simplification. It was one way to exercise 
> the router model and remove any shortcuts in the indexing. I kept 
> it to be sure I was not tempted to invent new ones. I think we can
> remove it before merging. 
> 
> > 
> >> +    }
> >> +
> >> +    if (out_nvt_blk) {
> >> +        *out_nvt_idx = SPAPR_XIVE_VP_BASE + cpu->vcpu_id;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t target,
> >> +                             uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
> > 
> > I suspect some, maybe most of these conversion functions could be static.
> 
> static inline ? 

It's in a .c file so you don't need the "inline" - the compiler can
work out whether it's a good idea to inline on its own.

> > 
> >> +{
> >> +    return spapr_xive_cpu_to_nvt(xive, spapr_find_cpu(target), out_nvt_blk,
> >> +                                 out_nvt_idx);
> >> +}
> >> +
> >> +/*
> >> + * sPAPR END indexing uses a simple mapping of the CPU vcpu_id, 8
> >> + * priorities per CPU
> >> + */
> >> +int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t end_idx,
> >> +                             uint32_t *out_server, uint8_t *out_prio)
> >> +{
> >> +    if (out_server) {
> >> +        *out_server = end_idx >> 3;
> >> +    }
> >> +
> >> +    if (out_prio) {
> >> +        *out_prio = end_idx & 0x7;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
> >> +                          uint8_t *out_end_blk, uint32_t *out_end_idx)
> >> +{
> >> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> >> +
> >> +    if (!cpu) {
> >> +        return -1;
> > 
> > Is there ever a reason this would be called with cpu == NULL?  If not
> > might as well just assert() here rather than pushing the error
> > handling back to the caller.
> 
> ok. yes.
> 
> > 
> >> +    }
> >> +
> >> +    if (out_end_blk) {
> >> +        /* For testing purpose, we could use 0 for nvt_blk */
> >> +        *out_end_blk = xrtr->chip_id;
> > 
> > Again, I don't see any point to using the chip_id, which is pretty
> > meaningless for PAPR.
> > 
> >> +    }
> >> +
> >> +    if (out_end_idx) {
> >> +        *out_end_idx = (cpu->vcpu_id << 3) + prio;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
> >> +                             uint8_t *out_end_blk, uint32_t *out_end_idx)
> >> +{
> >> +    return spapr_xive_cpu_to_end(xive, spapr_find_cpu(target), prio,
> >> +                                 out_end_blk, out_end_idx);
> >> +}
> >> +
> >>  static const VMStateDescription vmstate_spapr_xive_end = {
> >>      .name = TYPE_SPAPR_XIVE "/end",
> >>      .version_id = 1,
> >> @@ -263,6 +396,9 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >>      xrc->set_eas = spapr_xive_set_eas;
> >>      xrc->get_end = spapr_xive_get_end;
> >>      xrc->set_end = spapr_xive_set_end;
> >> +    xrc->get_nvt = spapr_xive_get_nvt;
> >> +    xrc->set_nvt = spapr_xive_set_nvt;
> >> +    xrc->reset_tctx = spapr_xive_reset_tctx;
> >>  }
> >>  
> >>  static const TypeInfo spapr_xive_info = {
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index c49932d2b799..fc6ef5895e6d 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -481,6 +481,7 @@ static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
> >>  static void xive_tctx_reset(void *dev)
> >>  {
> >>      XiveTCTX *tctx = XIVE_TCTX(dev);
> >> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
> >>  
> >>      memset(tctx->regs, 0, sizeof(tctx->regs));
> >>  
> >> @@ -495,6 +496,14 @@ static void xive_tctx_reset(void *dev)
> >>       */
> >>      tctx->regs[TM_QW1_OS + TM_PIPR] =
> >>          ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
> >> +
> >> +    /*
> >> +     * QEMU sPAPR XIVE only. To let the controller model reset the OS
> >> +     * CAM line with the VP identifier.
> >> +     */
> >> +    if (xrc->reset_tctx) {
> >> +        xrc->reset_tctx(tctx->xrtr, tctx);
> >> +    }
> > 
> > AFAICT this whole function is only used from PAPR, so you can just
> > move the whole thing to the papr code and avoid the hook function.
> 
> Yes we could add a loop on all CPUs and reset all the XiveTCTX from
> the machine or a spapr_irq->reset handler. We will need at some time
> anyhow.
> 
> Thanks,
> 
> C.
> 
> 
> > 
> >>  }
> >>  
> >>  static void xive_tctx_realize(DeviceState *dev, Error **errp)
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend
  2018-11-28  9:35     ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
  2018-11-28 16:50       ` Cédric Le Goater
@ 2018-11-29  1:02       ` David Gibson
  2018-11-29  6:56         ` Greg Kurz
  1 sibling, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  1:02 UTC (permalink / raw)
  To: Greg Kurz; +Cc: Cédric Le Goater, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3170 bytes --]

On Wed, Nov 28, 2018 at 10:35:51AM +0100, Greg Kurz wrote:
> On Wed, 28 Nov 2018 13:57:14 +1100
> David Gibson <david@gibson.dropbear.id.au> wrote:
> 
> > On Fri, Nov 16, 2018 at 11:57:05AM +0100, Cédric Le Goater wrote:
> > > We will need to use xics_max_server_number() to create the sPAPRXive
> > > object modeling the interrupt controller of the machine which is
> > > created before the CPUs.
> > > 
> > > Signed-off-by: Cédric Le Goater <clg@kaod.org>  
> > 
> > My only concern here is that this moves the spapr_set_vsmt_mode()
> > before some of the sanity checks in spapr_init_cpus().  Are we certain
> > there are no edge cases that could cause badness?
> > 
> 
> The early checks in spapr_init_cpus() filter out topologies that would
> result in partially filled cores. They're only related to the rest of
> the code that creates the boot CPUs. Before commit 1a5008fc17,
> spapr_set_vsmt_mode() was even being called before spapr_init_cpus().
> The rationale to move it there was to ensure it is called before the
> first user of spapr->vsmt, which happens to be a call to
> xics_max_server_number().

Ok.

> Now that xics_max_server_number() needs to be called even earlier, I think a
> better change is to have xics_max_server_number() to call spapr_set_vsmt_mode()
> if spapr->vsmt isn't set.

I'd rather not do that, but instead move it statically to where it
needs to be.  That sort of lazy/on-demand initialization can result in
really confusing behaviours depending on when a seemingly innocuous
data-returning function is called, so I consider it a code smell.

> 
> > > ---
> > >  hw/ppc/spapr.c | 10 +++++-----
> > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > index 7afd1a175bf2..50cb9f9f4a02 100644
> > > --- a/hw/ppc/spapr.c
> > > +++ b/hw/ppc/spapr.c
> > > @@ -2466,11 +2466,6 @@ static void spapr_init_cpus(sPAPRMachineState *spapr)
> > >          boot_cores_nr = possible_cpus->len;
> > >      }
> > >  
> > > -    /* VSMT must be set in order to be able to compute VCPU ids, ie to
> > > -     * call xics_max_server_number() or spapr_vcpu_id().
> > > -     */
> > > -    spapr_set_vsmt_mode(spapr, &error_fatal);
> > > -
> > >      if (smc->pre_2_10_has_unused_icps) {
> > >          int i;
> > >  
> > > @@ -2593,6 +2588,11 @@ static void spapr_machine_init(MachineState *machine)
> > >      /* Setup a load limit for the ramdisk leaving room for SLOF and FDT */
> > >      load_limit = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FW_OVERHEAD;
> > >  
> > > +    /* VSMT must be set in order to be able to compute VCPU ids, ie to
> > > +     * call xics_max_server_number() or spapr_vcpu_id().
> > > +     */
> > > +    spapr_set_vsmt_mode(spapr, &error_fatal);
> > > +
> > >      /* Set up Interrupt Controller before we create the VCPUs */
> > >      smc->irq->init(spapr, &error_fatal);
> > >    
> > 
> 



-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 15/36] spapr: introdude a new machine IRQ backend for XIVE
  2018-11-28 17:16     ` Cédric Le Goater
@ 2018-11-29  1:07       ` David Gibson
  2018-11-29 15:34         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  1:07 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 12874 bytes --]

On Wed, Nov 28, 2018 at 06:16:58PM +0100, Cédric Le Goater wrote:
> On 11/28/18 4:28 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:08AM +0100, Cédric Le Goater wrote:
> >> The XIVE IRQ backend uses the same layout as the new XICS backend but
> >> covers the full range of the IRQ number space. The IRQ numbers for the
> >> CPU IPIs are allocated at the bottom of this space, below 4K, to
> >> preserve compatibility with XICS which does not use that range.
> >>
> >> This should be enough given that the maximum number of CPUs is 1024
> >> for the sPAPR machine under QEMU. For the record, the biggest POWER8
> >> or POWER9 system has a maximum of 1536 HW threads (16 sockets, 192
> >> cores, SMT8).
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  include/hw/ppc/spapr.h     |   2 +
> >>  include/hw/ppc/spapr_irq.h |   7 ++-
> >>  hw/ppc/spapr.c             |   2 +-
> >>  hw/ppc/spapr_irq.c         | 119 ++++++++++++++++++++++++++++++++++++-
> >>  4 files changed, 124 insertions(+), 6 deletions(-)
> >>
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index 6279711fe8f7..1fbc2663e06c 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -16,6 +16,7 @@ typedef struct sPAPREventLogEntry sPAPREventLogEntry;
> >>  typedef struct sPAPREventSource sPAPREventSource;
> >>  typedef struct sPAPRPendingHPT sPAPRPendingHPT;
> >>  typedef struct ICSState ICSState;
> >> +typedef struct sPAPRXive sPAPRXive;
> >>  
> >>  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
> >>  #define SPAPR_ENTRY_POINT       0x100
> >> @@ -175,6 +176,7 @@ struct sPAPRMachineState {
> >>      const char *icp_type;
> >>      int32_t irq_map_nr;
> >>      unsigned long *irq_map;
> >> +    sPAPRXive  *xive;
> >>  
> >>      bool cmd_line_caps[SPAPR_CAP_NUM];
> >>      sPAPRCapabilities def, eff, mig;
> >> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> >> index 0e9229bf219e..c854ae527808 100644
> >> --- a/include/hw/ppc/spapr_irq.h
> >> +++ b/include/hw/ppc/spapr_irq.h
> >> @@ -13,6 +13,7 @@
> >>  /*
> >>   * IRQ range offsets per device type
> >>   */
> >> +#define SPAPR_IRQ_IPI        0x0
> >>  #define SPAPR_IRQ_EPOW       0x1000  /* XICS_IRQ_BASE offset */
> >>  #define SPAPR_IRQ_HOTPLUG    0x1001
> >>  #define SPAPR_IRQ_VIO        0x1100  /* 256 VIO devices */
> >> @@ -33,7 +34,8 @@ typedef struct sPAPRIrq {
> >>      uint32_t    nr_irqs;
> >>      uint32_t    nr_msis;
> >>  
> >> -    void (*init)(sPAPRMachineState *spapr, int nr_irqs, Error **errp);
> >> +    void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
> >> +                 Error **errp);
> >>      int (*claim)(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
> >>      void (*free)(sPAPRMachineState *spapr, int irq, int num);
> >>      qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
> >> @@ -42,8 +44,9 @@ typedef struct sPAPRIrq {
> >>  
> >>  extern sPAPRIrq spapr_irq_xics;
> >>  extern sPAPRIrq spapr_irq_xics_legacy;
> >> +extern sPAPRIrq spapr_irq_xive;
> >>  
> >> -void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
> >> +void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp);
> > 
> > I don't see why nr_servers needs to become a parameter, since it can
> > be derived from spapr within this routine.
> 
> ok. This is true. We can use directly xics_max_server_number(spapr).
> 
> >>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
> >>  void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
> >>  qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
> >> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >> index e470efe7993c..9f8c19e56e7a 100644
> >> --- a/hw/ppc/spapr.c
> >> +++ b/hw/ppc/spapr.c
> >> @@ -2594,7 +2594,7 @@ static void spapr_machine_init(MachineState *machine)
> >>      spapr_set_vsmt_mode(spapr, &error_fatal);
> >>  
> >>      /* Set up Interrupt Controller before we create the VCPUs */
> >> -    spapr_irq_init(spapr, &error_fatal);
> >> +    spapr_irq_init(spapr, xics_max_server_number(spapr), &error_fatal);
> > 
> > We should rename xics_max_server_number() since it's no longer xics
> > specific.
> 
> yes.
> 
> >>      /* Set up containers for ibm,client-architecture-support negotiated options
> >>       */
> >> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> >> index bac450ffff23..2569ae1bc7f8 100644
> >> --- a/hw/ppc/spapr_irq.c
> >> +++ b/hw/ppc/spapr_irq.c
> >> @@ -12,6 +12,7 @@
> >>  #include "qemu/error-report.h"
> >>  #include "qapi/error.h"
> >>  #include "hw/ppc/spapr.h"
> >> +#include "hw/ppc/spapr_xive.h"
> >>  #include "hw/ppc/xics.h"
> >>  #include "sysemu/kvm.h"
> >>  
> >> @@ -91,7 +92,7 @@ error:
> >>  }
> >>  
> >>  static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
> >> -                                Error **errp)
> >> +                                int nr_servers, Error **errp)
> >>  {
> >>      MachineState *machine = MACHINE(spapr);
> >>      Error *local_err = NULL;
> >> @@ -204,10 +205,122 @@ sPAPRIrq spapr_irq_xics = {
> >>      .print_info  = spapr_irq_print_info_xics,
> >>  };
> >>  
> >> + /*
> >> + * XIVE IRQ backend.
> >> + */
> >> +static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
> >> +                                    const char *type_xive, int nr_irqs,
> >> +                                    int nr_servers, Error **errp)
> >> +{
> >> +    sPAPRXive *xive;
> >> +    Error *local_err = NULL;
> >> +    Object *obj;
> >> +    uint32_t nr_ends = nr_servers << 3; /* 8 priority ENDs per CPU */
> >> +    int i;
> >> +
> >> +    obj = object_new(type_xive);
> > 
> > What's the reason for making the type a parameter, rather than just
> > using the #define here.
> 
> KVM.

Yeah, I realised that when I'd read a few patches further on.  As I
commented there, I don't think the separate KVM/TCG subclasses is
actually a good pattern to follow.

> >> +    object_property_set_int(obj, nr_irqs, "nr-irqs", &error_abort);
> >> +    object_property_set_int(obj, nr_ends, "nr-ends", &error_abort);
> > 
> > This is still within the sPAPR code, and you have a pointer to the
> > MachineState, so I don't see why you could't just derive nr_irqs and
> > nr_servers from that, rather than having them passed in.
> 
> for nr_servers I agree. nr_irqs comes from the machine class and it will
> not make any sense using the machine class in the init routine of the
> 'dual' sPAPR IRQ backend supporting both modes. See patch 34 which
> initializes both backend for the 'dual' machine.

Uh.. I guess I'll comment when I get to that patch, but I don't see
why accessing the machine class would be a problem.  If we have the
MachineState we can get to the MachineClass.

> >> +    object_property_set_bool(obj, true, "realized", &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return NULL;
> >> +    }
> >> +    qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
> > 
> > Whereas the XiveSource and XiveRouter I think make more sense as
> > "device components" rather than SysBusDevice subclasses, 
> 
> Yes. I changed that.
> 
> > I think it
> > *does* make sense for the PAPR-XIVE object to be a full fledged
> > SysBusDevice.
> 
> Ah. That I didn't do but thinking of it, it makes sense as it is the
> object managing the TIMA and ESB memory region mapping for the machine. 
> 
> > And for that reason, I think it makes more sense to create it with
> > qdev_create(), which should avoid having to manually fiddle with the
> > parent bus.
> 
> OK. I will give it a try. 
> 
> >> +    xive = SPAPR_XIVE(obj);
> >> +
> >> +    /* Enable the CPU IPIs */
> >> +    for (i = 0; i < nr_servers; ++i) {
> >> +        spapr_xive_irq_enable(xive, SPAPR_IRQ_IPI + i, false);
> > 
> > This comment possibly belonged on an earlier patch.  I don't love the
> > "..._enable" name - to me that suggests something runtime rather than
> > configuration time.  A better option isn't quickly occurring to me
> > though :/.
> 
> Instead, I could call the sPAPR IRQ claim method  : 
> 
>     for (i = 0; i < nr_servers; ++i) {
> 	spapr_irq_xive.claim(spapr, SPAPR_IRQ_IPI + i, false, &local_err);
>     }
> 
> 
> What it does is to set the EAS_VALID bit in the EAT (it also sets the 
> LSI bit). what about :
> 	
> 	spapr_xive_irq_validate() 
> 	spapr_xive_irq_invalidate() 
> 
> or to map the sPAPR IRQ backend names :
> 
> 	spapr_xive_irq_claim() 
> 	spapr_xive_irq_free()

Let's use claim/free to match the terms spapr already uses.


> 
> 
> > 
> >> +    }
> >> +
> >> +    return xive;
> >> +}
> >> +
> >> +static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
> >> +                                int nr_servers, Error **errp)
> >> +{
> >> +    MachineState *machine = MACHINE(spapr);
> >> +    Error *local_err = NULL;
> >> +
> >> +    /* KVM XIVE support */
> >> +    if (kvm_enabled()) {
> >> +        if (machine_kernel_irqchip_required(machine)) {
> >> +            error_setg(errp, "kernel_irqchip requested. no XIVE support");
> >> +            return;
> >> +        }
> >> +    }
> >> +
> >> +    /* QEMU XIVE support */
> >> +    spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE, nr_irqs, nr_servers,
> >> +                                    &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return;
> >> +    }
> >> +}
> >> +
> >> +static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
> >> +                                Error **errp)
> >> +{
> >> +    if (!spapr_xive_irq_enable(spapr->xive, irq, lsi)) {
> >> +        error_setg(errp, "IRQ %d is invalid", irq);
> >> +        return -1;
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static void spapr_irq_free_xive(sPAPRMachineState *spapr, int irq, int num)
> >> +{
> >> +    int i;
> >> +
> >> +    for (i = irq; i < irq + num; ++i) {
> >> +        spapr_xive_irq_disable(spapr->xive, i);
> >> +    }
> >> +}
> >> +
> >> +static qemu_irq spapr_qirq_xive(sPAPRMachineState *spapr, int irq)
> >> +{
> >> +    return spapr_xive_qirq(spapr->xive, irq);
> >> +}
> >> +
> >> +static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
> >> +                                      Monitor *mon)
> >> +{
> >> +    CPUState *cs;
> >> +
> >> +    CPU_FOREACH(cs) {
> >> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> >> +
> >> +        xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
> >> +    }
> >> +
> >> +    spapr_xive_pic_print_info(spapr->xive, mon);
> > 
> > Any reason the info dumping routines are split into two?
> 
> Not the same objects. Are you suggesting that we could print all the info 
> from the sPAPR XIVE model ? including the XiveTCTX. I thought of doing 
> that also. Fine for me if it's ok for you.

Ah.. I think I got xive_pic_print_info() and
xive_tctx_pic_print_info() mixed up.  Never mind.

> 
> Thanks,
> 
> C.
> 
> > 
> >> +}
> >> +
> >> +/*
> >> + * XIVE uses the full IRQ number space. Set it to 8K to be compatible
> >> + * with XICS.
> >> + */
> >> +
> >> +#define SPAPR_IRQ_XIVE_NR_IRQS     0x2000
> >> +#define SPAPR_IRQ_XIVE_NR_MSIS     (SPAPR_IRQ_XIVE_NR_IRQS - SPAPR_IRQ_MSI)
> >> +
> >> +sPAPRIrq spapr_irq_xive = {
> >> +    .nr_irqs     = SPAPR_IRQ_XIVE_NR_IRQS,
> >> +    .nr_msis     = SPAPR_IRQ_XIVE_NR_MSIS,
> >> +
> >> +    .init        = spapr_irq_init_xive,
> >> +    .claim       = spapr_irq_claim_xive,
> >> +    .free        = spapr_irq_free_xive,
> >> +    .qirq        = spapr_qirq_xive,
> >> +    .print_info  = spapr_irq_print_info_xive,
> >> +};
> >> +
> >>  /*
> >>   * sPAPR IRQ frontend routines for devices
> >>   */
> >> -void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
> >> +void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp)
> >>  {
> >>      sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> >>  
> >> @@ -216,7 +329,7 @@ void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
> >>          spapr_irq_msi_init(spapr, smc->irq->nr_msis);
> >>      }
> >>  
> >> -    smc->irq->init(spapr, smc->irq->nr_irqs, errp);
> >> +    smc->irq->init(spapr, smc->irq->nr_irqs, nr_servers, errp);
> >>  }
> >>  
> >>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp)
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-11-28 22:21     ` Cédric Le Goater
@ 2018-11-29  1:23       ` David Gibson
  2018-11-29 16:04         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  1:23 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 46861 bytes --]

On Wed, Nov 28, 2018 at 11:21:37PM +0100, Cédric Le Goater wrote:
> On 11/28/18 5:25 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:09AM +0100, Cédric Le Goater wrote:
> >> The different XIVE virtualization structures (sources and event queues)
> >> are configured with a set of Hypervisor calls :
> >>
> >>  - H_INT_GET_SOURCE_INFO
> >>
> >>    used to obtain the address of the MMIO page of the Event State
> >>    Buffer (ESB) entry associated with the source.
> >>
> >>  - H_INT_SET_SOURCE_CONFIG
> >>
> >>    assigns a source to a "target".
> >>
> >>  - H_INT_GET_SOURCE_CONFIG
> >>
> >>    determines which "target" and "priority" is assigned to a source
> >>
> >>  - H_INT_GET_QUEUE_INFO
> >>
> >>    returns the address of the notification management page associated
> >>    with the specified "target" and "priority".
> >>
> >>  - H_INT_SET_QUEUE_CONFIG
> >>
> >>    sets or resets the event queue for a given "target" and "priority".
> >>    It is also used to set the notification configuration associated
> >>    with the queue, only unconditional notification is supported for
> >>    the moment. Reset is performed with a queue size of 0 and queueing
> >>    is disabled in that case.
> >>
> >>  - H_INT_GET_QUEUE_CONFIG
> >>
> >>    returns the queue settings for a given "target" and "priority".
> >>
> >>  - H_INT_RESET
> >>
> >>    resets all of the guest's internal interrupt structures to their
> >>    initial state, losing all configuration set via the hcalls
> >>    H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
> >>
> >>  - H_INT_SYNC
> >>
> >>    issue a synchronisation on a source to make sure all notifications
> >>    have reached their queue.
> >>
> >> Calls that still need to be addressed :
> >>
> >>    H_INT_SET_OS_REPORTING_LINE
> >>    H_INT_GET_OS_REPORTING_LINE
> >>
> >> See the code for more documentation on each hcall.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  include/hw/ppc/spapr.h      |  15 +-
> >>  include/hw/ppc/spapr_xive.h |   6 +
> >>  hw/intc/spapr_xive_hcall.c  | 892 ++++++++++++++++++++++++++++++++++++
> >>  hw/ppc/spapr_irq.c          |   2 +
> >>  hw/intc/Makefile.objs       |   2 +-
> >>  5 files changed, 915 insertions(+), 2 deletions(-)
> >>  create mode 100644 hw/intc/spapr_xive_hcall.c
> >>
> >> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >> index 1fbc2663e06c..8415faea7b82 100644
> >> --- a/include/hw/ppc/spapr.h
> >> +++ b/include/hw/ppc/spapr.h
> >> @@ -452,7 +452,20 @@ struct sPAPRMachineState {
> >>  #define H_INVALIDATE_PID        0x378
> >>  #define H_REGISTER_PROC_TBL     0x37C
> >>  #define H_SIGNAL_SYS_RESET      0x380
> >> -#define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
> >> +
> >> +#define H_INT_GET_SOURCE_INFO   0x3A8
> >> +#define H_INT_SET_SOURCE_CONFIG 0x3AC
> >> +#define H_INT_GET_SOURCE_CONFIG 0x3B0
> >> +#define H_INT_GET_QUEUE_INFO    0x3B4
> >> +#define H_INT_SET_QUEUE_CONFIG  0x3B8
> >> +#define H_INT_GET_QUEUE_CONFIG  0x3BC
> >> +#define H_INT_SET_OS_REPORTING_LINE 0x3C0
> >> +#define H_INT_GET_OS_REPORTING_LINE 0x3C4
> >> +#define H_INT_ESB               0x3C8
> >> +#define H_INT_SYNC              0x3CC
> >> +#define H_INT_RESET             0x3D0
> >> +
> >> +#define MAX_HCALL_OPCODE        H_INT_RESET
> >>  
> >>  /* The hcalls above are standardized in PAPR and implemented by pHyp
> >>   * as well.
> >> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >> index 3f65b8f485fd..418511f3dc10 100644
> >> --- a/include/hw/ppc/spapr_xive.h
> >> +++ b/include/hw/ppc/spapr_xive.h
> >> @@ -60,4 +60,10 @@ int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
> >>  int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
> >>                            uint8_t *out_end_blk, uint32_t *out_end_idx);
> >>  
> >> +bool spapr_xive_priority_is_valid(uint8_t priority);
> > 
> > AFAICT this could be a local function.
> 
> the KVM model uses it also, when collecting state from the KVM device 
> to build the QEMU ENDT.
> 
> >> +
> >> +typedef struct sPAPRMachineState sPAPRMachineState;
> >> +
> >> +void spapr_xive_hcall_init(sPAPRMachineState *spapr);
> >> +
> >>  #endif /* PPC_SPAPR_XIVE_H */
> >> diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
> >> new file mode 100644
> >> index 000000000000..52e4e23995f5
> >> --- /dev/null
> >> +++ b/hw/intc/spapr_xive_hcall.c
> >> @@ -0,0 +1,892 @@
> >> +/*
> >> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >> + *
> >> + * Copyright (c) 2017-2018, IBM Corporation.
> >> + *
> >> + * This code is licensed under the GPL version 2 or later. See the
> >> + * COPYING file in the top-level directory.
> >> + */
> >> +
> >> +#include "qemu/osdep.h"
> >> +#include "qemu/log.h"
> >> +#include "qapi/error.h"
> >> +#include "cpu.h"
> >> +#include "hw/ppc/fdt.h"
> >> +#include "hw/ppc/spapr.h"
> >> +#include "hw/ppc/spapr_xive.h"
> >> +#include "hw/ppc/xive_regs.h"
> >> +#include "monitor/monitor.h"
> > 
> > Fwiw, I don't think it's particularly necessary to split the hcall
> > handling out into a separate .c file.
> 
> ok. let's move it to spapr_xive then ? It might help in reducing the 
> exported funtions. 

Yes, I think so.

> >> +/*
> >> + * OPAL uses the priority 7 EQ to automatically escalate interrupts
> >> + * for all other queues (DD2.X POWER9). So only priorities [0..6] are
> >> + * available for the guest.
> > 
> > Referencing OPAL behaviour doesn't really make sense in the context of
> > PAPR.  
> 
> It's an OPAL constraint which pHyp doesn't have. So its a QEMU/KVM 
> constraint also.

Right, I realized that a few patches on.  Maybe rephrase this to

   Linux hosts under OPAL reserve priority 7 for their own escalation
   interrupts.  So we only allow the guest to use priorities [0..6].

The point here is that we're emphasizing that this is a design
decision to make the host implementation easier, rather than a
fundamental constraint.

> > What I think you're getting at is that the PAPR spec only
> > allows a PAPR guest to use priorities 0..6 (or at least it will if the
> > XIVE updated spec ever gets published).  
> 
> It's not in the spec. the XIVE sPAPR spec should be frozen soon btw. 
>  
> > The fact that this allows the
> > host use 7 for escalations is a design rationale 
> > but not really relevant to the guest device itself. 
> 
> The guest should be aware of which priorities are reserved for
> the hypervisor though.
> 
> >> + */
> >> +bool spapr_xive_priority_is_valid(uint8_t priority)
> >> +{
> >> +    switch (priority) {
> >> +    case 0 ... 6:
> >> +        return true;
> >> +    case 7: /* OPAL escalation queue */
> >> +    default:
> >> +        return false;
> >> +    }
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical
> >> + * real address of the MMIO page through which the Event State Buffer
> >> + * entry associated with the value of the "lisn" parameter is managed.
> >> + *
> >> + * Parameters:
> >> + * Input
> >> + * - "flags"
> >> + *       Bits 0-63 reserved
> >> + * - "lisn" is per "interrupts", "interrupt-map", or
> >> + *       "ibm,xive-lisn-ranges" properties, or as returned by the
> >> + *       ibm,query-interrupt-source-number RTAS call, or as returned
> >> + *       by the H_ALLOCATE_VAS_WINDOW hcall
> > 
> > I've not heard of H_ALLOCATE_VAS_WINDOW.  Is that something we intend
> > to implement in kvm/qemu, or is it only of interest for PowerVM?
> 
> The hcall is part of the PAPR NX Interfaces and it returns interrupt
> numbers. I don't know if any work has been done on the topic.  

What's a "PAPR NX"?

> > Also, putting the register numbers on the inputs as well as the
> > outputs would be helpful.
> 
> yes. I will add them.
> 
> >> + *
> >> + * Output
> >> + * - R4: "flags"
> >> + *       Bits 0-59: Reserved
> >> + *       Bit 60: H_INT_ESB must be used for Event State Buffer
> >> + *               management
> >> + *       Bit 61: 1 == LSI  0 == MSI
> >> + *       Bit 62: the full function page supports trigger
> >> + *       Bit 63: Store EOI Supported
> >> + * - R5: Logical Real address of full function Event State Buffer
> >> + *       management page, -1 if ESB hcall flag is set to 1.
> > 
> > You've defined what H_INT_ESB means above, so it will be clearer if
> > you reference that by name here.
> 
> yes. 
> 
> >> + * - R6: Logical Real Address of trigger only Event State Buffer
> >> + *       management page or -1.
> >> + * - R7: Power of 2 page size for the ESB management pages returned in
> >> + *       R5 and R6.
> >> + */
> >> +
> >> +#define SPAPR_XIVE_SRC_H_INT_ESB     PPC_BIT(60) /* ESB manage with H_INT_ESB */
> >> +#define SPAPR_XIVE_SRC_LSI           PPC_BIT(61) /* Virtual LSI type */
> >> +#define SPAPR_XIVE_SRC_TRIGGER       PPC_BIT(62) /* Trigger and management
> >> +                                                    on same page */
> >> +#define SPAPR_XIVE_SRC_STORE_EOI     PPC_BIT(63) /* Store EOI support */
> > 
> > Probably makes sense to put these #defines in spapr.h since they form
> > part of the PAPR interface definition.
> 
> ok.
> 
> > 
> >> +static target_ulong h_int_get_source_info(PowerPCCPU *cpu,
> >> +                                          sPAPRMachineState *spapr,
> >> +                                          target_ulong opcode,
> >> +                                          target_ulong *args)
> >> +{
> >> +    sPAPRXive *xive = spapr->xive;
> >> +    XiveSource *xsrc = &xive->source;
> >> +    XiveEAS eas;
> >> +    target_ulong flags  = args[0];
> >> +    target_ulong lisn   = args[1];
> >> +
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    if (flags) {
> >> +        return H_PARAMETER;
> >> +    }
> >> +
> >> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    if (!(eas.w & EAS_VALID)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    /* All sources are emulated under the main XIVE object and share
> >> +     * the same characteristics.
> >> +     */
> >> +    args[0] = 0;
> >> +    if (!xive_source_esb_has_2page(xsrc)) {
> >> +        args[0] |= SPAPR_XIVE_SRC_TRIGGER;
> >> +    }
> >> +    if (xsrc->esb_flags & XIVE_SRC_STORE_EOI) {
> >> +        args[0] |= SPAPR_XIVE_SRC_STORE_EOI;
> >> +    }
> >> +
> >> +    /*
> >> +     * Force the use of the H_INT_ESB hcall in case of an LSI
> >> +     * interrupt. This is necessary under KVM to re-trigger the
> >> +     * interrupt if the level is still asserted
> >> +     */
> >> +    if (xive_source_irq_is_lsi(xsrc, lisn)) {
> >> +        args[0] |= SPAPR_XIVE_SRC_H_INT_ESB | SPAPR_XIVE_SRC_LSI;
> >> +    }
> >> +
> >> +    if (!(args[0] & SPAPR_XIVE_SRC_H_INT_ESB)) {
> >> +        args[1] = xive->vc_base + xive_source_esb_mgmt(xsrc, lisn);
> >> +    } else {
> >> +        args[1] = -1;
> >> +    }
> >> +
> >> +    if (xive_source_esb_has_2page(xsrc)) {
> >> +        args[2] = xive->vc_base + xive_source_esb_page(xsrc, lisn);
> >> +    } else {
> >> +        args[2] = -1;
> >> +    }
> > 
> > Do we also need to keep this address clear in the H_INT_ESB case?
> 
> I think not, but the specs are not very clear on that topic. I will
> ask for clarification and use a -1 for now. We can not do loads on
> the trigger page so it can not be used by the H_INT_ESB hcall.
> 
> > 
> >> +    args[3] = TARGET_PAGE_SIZE;
> > 
> > That seems wrong.  
> 
> This is utterly wrong. it should be a power of 2 number ... I got
> it right under KVM though. I guess that ioremap() under Linux rounds 
> up the size to the page size in use, so, that's why it didn't blow
> up under TCG.
> 
> > TARGET_PAGE_SIZE is generally 4kiB, but won't these usually
> > actually be 64kiB?
> 
> yes. So what should I use to get a PAGE_SHIFT instead ? 

Erm, that gets a bit tricky, since qemu in a sense doesn't know the
guest's page size.

But.. don't you actually want the esb_shift here, not PAGE_SHIFT - it
could matter for the 2 page * 64kiB variant, yes?

> >> +
> >> +    return H_SUCCESS;
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_SET_SOURCE_CONFIG hcall() is used to assign a Logical
> >> + * Interrupt Source to a target. The Logical Interrupt Source is
> >> + * designated with the "lisn" parameter and the target is designated
> >> + * with the "target" and "priority" parameters.  Upon return from the
> >> + * hcall(), no additional interrupts will be directed to the old EQ.
> >> + *
> >> + * TODO: The old EQ should be investigated for interrupts that
> >> + * occurred prior to or during the hcall().
> > 
> > Isn't that the responsibility of the guest?
> 
> It should yes.

Right, so not a TODO for the qemu code.

> 
> > 
> >> + *
> >> + * Parameters:
> >> + * Input:
> >> + * - "flags"
> >> + *      Bits 0-61: Reserved
> >> + *      Bit 62: set the "eisn" in the EA
> > 
> > What's the "EA"?  Do you mean the EAS?
> 
> Another XIVE acronym, EA for Event Assignment. I think we can forget
> this one and just use EAS.
>  
> > 
> >> + *      Bit 63: masks the interrupt source in the hardware interrupt
> >> + *      control structure. An interrupt masked by this mechanism will
> >> + *      be dropped, but it's source state bits will still be
> >> + *      set. There is no race-free way of unmasking and restoring the
> >> + *      source. Thus this should only be used in interrupts that are
> >> + *      also masked at the source, and only in cases where the
> >> + *      interrupt is not meant to be used for a large amount of time
> >> + *      because no valid target exists for it for example
> >> + * - "lisn" is per "interrupts", "interrupt-map", or
> >> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
> >> + *      ibm,query-interrupt-source-number RTAS call, or as returned by
> >> + *      the H_ALLOCATE_VAS_WINDOW hcall
> >> + * - "target" is per "ibm,ppc-interrupt-server#s" or
> >> + *      "ibm,ppc-interrupt-gserver#s"
> >> + * - "priority" is a valid priority not in
> >> + *      "ibm,plat-res-int-priorities"
> >> + * - "eisn" is the guest EISN associated with the "lisn"
> > 
> > I don't think the EISN term has been used before in the series.  
> 
> Effective Interrupt Source Number, which is the event data enqueued
> in the OS EQ.
> 
> I'm planning on adding some more acronyms used by the sPAPR hcalls
> in this file. There are only a couple.

That would be helpful.

> > I'm guessing this is the guest-assigned global interrupt number?
> 
> yes 
> 
> >> + *
> >> + * Output:
> >> + * - None
> >> + */
> >> +
> >> +#define SPAPR_XIVE_SRC_SET_EISN PPC_BIT(62)
> >> +#define SPAPR_XIVE_SRC_MASK     PPC_BIT(63)
> >> +
> >> +static target_ulong h_int_set_source_config(PowerPCCPU *cpu,
> >> +                                            sPAPRMachineState *spapr,
> >> +                                            target_ulong opcode,
> >> +                                            target_ulong *args)
> >> +{
> >> +    sPAPRXive *xive = spapr->xive;
> >> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> >> +    XiveEAS eas, new_eas;
> >> +    target_ulong flags    = args[0];
> >> +    target_ulong lisn     = args[1];
> >> +    target_ulong target   = args[2];
> >> +    target_ulong priority = args[3];
> >> +    target_ulong eisn     = args[4];
> >> +    uint8_t end_blk;
> >> +    uint32_t end_idx;
> >> +
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    if (flags & ~(SPAPR_XIVE_SRC_SET_EISN | SPAPR_XIVE_SRC_MASK)) {
> >> +        return H_PARAMETER;
> >> +    }
> >> +
> >> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    if (!(eas.w & EAS_VALID)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    /* priority 0xff is used to reset the EAS */
> >> +    if (priority == 0xff) {
> >> +        new_eas.w = EAS_VALID | EAS_MASKED;
> >> +        goto out;
> >> +    }
> >> +
> >> +    if (flags & SPAPR_XIVE_SRC_MASK) {
> >> +        new_eas.w = eas.w | EAS_MASKED;
> >> +    } else {
> >> +        new_eas.w = eas.w & ~EAS_MASKED;
> >> +    }
> >> +
> >> +    if (!spapr_xive_priority_is_valid(priority)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
> >> +                      priority);
> >> +        return H_P4;
> >> +    }
> >> +
> >> +    /* Validate that "target" is part of the list of threads allocated
> >> +     * to the partition. For that, find the END corresponding to the
> >> +     * target.
> >> +     */
> >> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
> >> +        return H_P3;
> >> +    }
> >> +
> >> +    new_eas.w = SETFIELD(EAS_END_BLOCK, new_eas.w, end_blk);
> >> +    new_eas.w = SETFIELD(EAS_END_INDEX, new_eas.w, end_idx);
> >> +
> >> +    if (flags & SPAPR_XIVE_SRC_SET_EISN) {
> >> +        new_eas.w = SETFIELD(EAS_END_DATA, new_eas.w, eisn);
> >> +    }
> >> +
> >> +out:
> >> +    if (xive_router_set_eas(xrtr, lisn, &new_eas)) {
> >> +        return H_HARDWARE;
> >> +    }
> > 
> > As noted earlier in the series, the spapr specific code owns the
> > memory backing the EAT, so you can just access it directly rather than
> > using a method here.
> 
> Yes. I will give a try. I wonder if I need accessors for the tables
> ?

You'll still need the read accessor since the routing core uses that.
I don't think you need a write accessor though.

> 
> > 
> >> +
> >> +    return H_SUCCESS;
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_GET_SOURCE_CONFIG hcall() is used to determine to which
> >> + * target/priority pair is assigned to the specified Logical Interrupt
> >> + * Source.
> >> + *
> >> + * Parameters:
> >> + * Input:
> >> + * - "flags"
> >> + *      Bits 0-63 Reserved
> >> + * - "lisn" is per "interrupts", "interrupt-map", or
> >> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
> >> + *      ibm,query-interrupt-source-number RTAS call, or as
> >> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
> >> + *
> >> + * Output:
> >> + * - R4: Target to which the specified Logical Interrupt Source is
> >> + *       assigned
> >> + * - R5: Priority to which the specified Logical Interrupt Source is
> >> + *       assigned
> >> + * - R6: EISN for the specified Logical Interrupt Source (this will be
> >> + *       equivalent to the LISN if not changed by H_INT_SET_SOURCE_CONFIG)
> >> + */
> >> +static target_ulong h_int_get_source_config(PowerPCCPU *cpu,
> >> +                                            sPAPRMachineState *spapr,
> >> +                                            target_ulong opcode,
> >> +                                            target_ulong *args)
> >> +{
> >> +    sPAPRXive *xive = spapr->xive;
> >> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> >> +    target_ulong flags = args[0];
> >> +    target_ulong lisn = args[1];
> >> +    XiveEAS eas;
> >> +    XiveEND end;
> >> +    uint8_t end_blk, nvt_blk;
> >> +    uint32_t end_idx, nvt_idx;
> >> +
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    if (flags) {
> >> +        return H_PARAMETER;
> >> +    }
> >> +
> >> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    if (!(eas.w & EAS_VALID)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    end_blk = GETFIELD(EAS_END_BLOCK, eas.w);
> >> +    end_idx = GETFIELD(EAS_END_INDEX, eas.w);
> >> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
> >> +        /* Not sure what to return here */
> >> +        return H_HARDWARE;
> > 
> > IIUC this indicates a bug in the PAPR specific code, not the guest, so
> > an assert() is probably the right answer.
> 
> ok
> 
> >> +    }
> >> +
> >> +    nvt_blk = GETFIELD(END_W6_NVT_BLOCK, end.w6);
> >> +    nvt_idx = GETFIELD(END_W6_NVT_INDEX, end.w6);
> >> +    args[0] = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
> > 
> > AIUI there's a specific END for each target & priority, so you could
> > avoid this second level lookup, 
> 
> yes 
> 
> > although I guess this might be
> > valuable if we do more complicated internal routing in the future.
> 
> I am not sure of that but I'd rather keep these converting helpers
> for the moment.

Ok.

> >> +    if (eas.w & EAS_MASKED) {
> >> +        args[1] = 0xff;
> >> +    } else {
> >> +        args[1] = GETFIELD(END_W7_F0_PRIORITY, end.w7);
> >> +    }
> >> +
> >> +    args[2] = GETFIELD(EAS_END_DATA, eas.w);
> >> +
> >> +    return H_SUCCESS;
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_GET_QUEUE_INFO hcall() is used to get the logical real
> >> + * address of the notification management page associated with the
> >> + * specified target and priority.
> >> + *
> >> + * Parameters:
> >> + * Input:
> >> + * - "flags"
> >> + *       Bits 0-63 Reserved
> >> + * - "target" is per "ibm,ppc-interrupt-server#s" or
> >> + *       "ibm,ppc-interrupt-gserver#s"
> >> + * - "priority" is a valid priority not in
> >> + *       "ibm,plat-res-int-priorities"
> >> + *
> >> + * Output:
> >> + * - R4: Logical real address of notification page
> >> + * - R5: Power of 2 page size of the notification page
> >> + */
> >> +static target_ulong h_int_get_queue_info(PowerPCCPU *cpu,
> >> +                                         sPAPRMachineState *spapr,
> >> +                                         target_ulong opcode,
> >> +                                         target_ulong *args)
> >> +{
> >> +    sPAPRXive *xive = spapr->xive;
> >> +    XiveENDSource *end_xsrc = &xive->end_source;
> >> +    target_ulong flags = args[0];
> >> +    target_ulong target = args[1];
> >> +    target_ulong priority = args[2];
> >> +    XiveEND end;
> >> +    uint8_t end_blk;
> >> +    uint32_t end_idx;
> >> +
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    if (flags) {
> >> +        return H_PARAMETER;
> >> +    }
> >> +
> >> +    /*
> >> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> >> +     * This is not needed when running the emulation under QEMU
> >> +     */
> >> +
> >> +    if (!spapr_xive_priority_is_valid(priority)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
> >> +                      priority);
> >> +        return H_P3;
> >> +    }
> >> +
> >> +    /* Validate that "target" is part of the list of threads allocated
> >> +     * to the partition. For that, find the END corresponding to the
> >> +     * target.
> >> +     */
> >> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    if (xive_router_get_end(XIVE_ROUTER(xive), end_blk, end_idx, &end)) {
> >> +        return H_HARDWARE;
> >> +    }
> >> +
> >> +    args[0] = xive->end_base + (1ull << (end_xsrc->esb_shift + 1)) * end_idx;
> >> +    if (end.w0 & END_W0_ENQUEUE) {
> >> +        args[1] = GETFIELD(END_W0_QSIZE, end.w0) + 12;
> >> +    } else {
> >> +        args[1] = 0;
> >> +    }
> >> +    return H_SUCCESS;
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_SET_QUEUE_CONFIG hcall() is used to set or reset a EQ for
> >> + * a given "target" and "priority".  It is also used to set the
> >> + * notification config associated with the EQ.  An EQ size of 0 is
> >> + * used to reset the EQ config for a given target and priority. If
> >> + * resetting the EQ config, the END associated with the given "target"
> >> + * and "priority" will be changed to disable queueing.
> >> + *
> >> + * Upon return from the hcall(), no additional interrupts will be
> >> + * directed to the old EQ (if one was set). The old EQ (if one was
> >> + * set) should be investigated for interrupts that occurred prior to
> >> + * or during the hcall().
> >> + *
> >> + * Parameters:
> >> + * Input:
> >> + * - "flags"
> >> + *      Bits 0-62: Reserved
> >> + *      Bit 63: Unconditional Notify (n) per the XIVE spec
> >> + * - "target" is per "ibm,ppc-interrupt-server#s" or
> >> + *       "ibm,ppc-interrupt-gserver#s"
> >> + * - "priority" is a valid priority not in
> >> + *       "ibm,plat-res-int-priorities"
> >> + * - "eventQueue": The logical real address of the start of the EQ
> >> + * - "eventQueueSize": The power of 2 EQ size per "ibm,xive-eq-sizes"
> >> + *
> >> + * Output:
> >> + * - None
> >> + */
> >> +
> >> +#define SPAPR_XIVE_END_ALWAYS_NOTIFY PPC_BIT(63)
> >> +
> >> +static target_ulong h_int_set_queue_config(PowerPCCPU *cpu,
> >> +                                           sPAPRMachineState *spapr,
> >> +                                           target_ulong opcode,
> >> +                                           target_ulong *args)
> >> +{
> >> +    sPAPRXive *xive = spapr->xive;
> >> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> >> +    target_ulong flags = args[0];
> >> +    target_ulong target = args[1];
> >> +    target_ulong priority = args[2];
> >> +    target_ulong qpage = args[3];
> >> +    target_ulong qsize = args[4];
> >> +    XiveEND end;
> >> +    uint8_t end_blk, nvt_blk;
> >> +    uint32_t end_idx, nvt_idx;
> >> +    uint32_t qdata;
> >> +
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    if (flags & ~SPAPR_XIVE_END_ALWAYS_NOTIFY) {
> >> +        return H_PARAMETER;
> >> +    }
> >> +
> >> +    /*
> >> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> >> +     * This is not needed when running the emulation under QEMU
> >> +     */
> >> +
> >> +    if (!spapr_xive_priority_is_valid(priority)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
> >> +                      priority);
> >> +        return H_P3;
> >> +    }
> >> +
> >> +    /* Validate that "target" is part of the list of threads allocated
> >> +     * to the partition. For that, find the END corresponding to the
> >> +     * target.
> >> +     */
> >> +
> >> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
> >> +        return H_HARDWARE;
> > 
> > Again, I think this indicates a qemu (spapr) code bug, so could be an assert().
> 
> ok
> 
> > 
> >> +    }
> >> +
> >> +    switch (qsize) {
> >> +    case 12:
> >> +    case 16:
> >> +    case 21:
> >> +    case 24:
> >> +        end.w3 = ((uint64_t)qpage) & 0xffffffff;
> > 
> > It just occurred to me that I haven't been looking for this across any
> > of these reviews.  Don't you need byteswaps when accessing these
> > in-memory structures?
> 
> yes this is done when some event data is enqueued in the EQ.

I'm not talking about the data in the EQ itself, but the fields in the
END (and the NVT).

> 
> > 
> >> +        end.w2 = (((uint64_t)qpage)) >> 32 & 0x0fffffff;
> >> +        end.w0 |= END_W0_ENQUEUE;
> >> +        end.w0 = SETFIELD(END_W0_QSIZE, end.w0, qsize - 12);
> >> +        break;
> >> +    case 0:
> >> +        /* reset queue and disable queueing */
> >> +        xive_end_reset(&end);
> >> +        goto out;
> >> +
> >> +    default:
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid EQ size %"PRIx64"\n",
> >> +                      qsize);
> >> +        return H_P5;
> >> +    }
> >> +
> >> +    if (qsize) {
> >> +        /*
> >> +         * Let's validate the EQ address with a read of the first EQ
> >> +         * entry. We could also check that the full queue has been
> >> +         * zeroed by the OS.
> >> +         */
> >> +        if (address_space_read(&address_space_memory, qpage,
> >> +                               MEMTXATTRS_UNSPECIFIED,
> >> +                               (uint8_t *) &qdata, sizeof(qdata))) {
> >> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ data @0x%"
> >> +                          HWADDR_PRIx "\n", qpage);
> >> +            return H_P4;
> > 
> > Just checking the first entry doesn't seem entirely safe.  Using
> > address_space_map() and making sure the returned plen doesn't get
> > reduced below the queue size might be a better option.
> 
> ok. That was on my todo list.
> 
> > 
> >> +        }
> >> +    }
> >> +
> >> +    if (spapr_xive_target_to_nvt(xive, target, &nvt_blk, &nvt_idx)) {
> >> +        return H_HARDWARE;
> > 
> > That could be caused by a bogus 'target' value, couldn't it?  
> 
> yes. It should have returned H_P2 above when spapr_xive_target_to_end() 
> is called.
> 
> > In which
> > case it a) should probably be checked earlier and b) should be
> > H_PARAMETER or similar, not H_HARDWARE, yes?
> 
> H_P2 may be again. It should be checked earlier
> 
> > 
> >> +    }
> >> +
> >> +    /* Ensure the priority and target are correctly set (they will not
> >> +     * be right after allocation)
> > 
> > AIUI there's a static association from END to target in the PAPR
> > model. 
> 
> yes. 8 priorities per cpu.
> 
> > So it seems to make more sense to get that set up right at
> > initialization / reset, rather than doing it lazily when the 
> > queue is configured.
> 
> Ah. You would preconfigure the word6 and word7 then. Yes, it would
> save us some of the conversion fuss. I will look at it.
> 
> >> +     */
> >> +    end.w6 = SETFIELD(END_W6_NVT_BLOCK, 0ul, nvt_blk) |
> >> +        SETFIELD(END_W6_NVT_INDEX, 0ul, nvt_idx);
> >> +    end.w7 = SETFIELD(END_W7_F0_PRIORITY, 0ul, priority);
> >> +
> >> +    if (flags & SPAPR_XIVE_END_ALWAYS_NOTIFY) {
> >> +        end.w0 |= END_W0_UCOND_NOTIFY;
> >> +    } else {
> >> +        end.w0 &= ~END_W0_UCOND_NOTIFY;
> >> +    }
> >> +
> >> +    /* The generation bit for the END starts at 1 and The END page
> >> +     * offset counter starts at 0.
> >> +     */
> >> +    end.w1 = END_W1_GENERATION | SETFIELD(END_W1_PAGE_OFF, 0ul, 0ul);
> >> +    end.w0 |= END_W0_VALID;
> >> +
> >> +    /* TODO: issue syncs required to ensure all in-flight interrupts
> >> +     * are complete on the old END */
> >> +out:
> >> +    /* Update END */
> >> +    if (xive_router_set_end(xrtr, end_blk, end_idx, &end)) {
> >> +        return H_HARDWARE;
> >> +    }
> > 
> > Again the PAPR code owns the ENDs, so it can update them directly
> > rather than going through an abstraction.
> 
> ok.
> 
> > 
> >> +
> >> +    return H_SUCCESS;
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_GET_QUEUE_CONFIG hcall() is used to get a EQ for a given
> >> + * target and priority.
> >> + *
> >> + * Parameters:
> >> + * Input:
> >> + * - "flags"
> >> + *      Bits 0-62: Reserved
> >> + *      Bit 63: Debug: Return debug data
> >> + * - "target" is per "ibm,ppc-interrupt-server#s" or
> >> + *       "ibm,ppc-interrupt-gserver#s"
> >> + * - "priority" is a valid priority not in
> >> + *       "ibm,plat-res-int-priorities"
> >> + *
> >> + * Output:
> >> + * - R4: "flags":
> >> + *       Bits 0-61: Reserved
> >> + *       Bit 62: The value of Event Queue Generation Number (g) per
> >> + *              the XIVE spec if "Debug" = 1
> >> + *       Bit 63: The value of Unconditional Notify (n) per the XIVE spec
> >> + * - R5: The logical real address of the start of the EQ
> >> + * - R6: The power of 2 EQ size per "ibm,xive-eq-sizes"
> >> + * - R7: The value of Event Queue Offset Counter per XIVE spec
> >> + *       if "Debug" = 1, else 0
> >> + *
> >> + */
> >> +
> >> +#define SPAPR_XIVE_END_DEBUG     PPC_BIT(63)
> >> +
> >> +static target_ulong h_int_get_queue_config(PowerPCCPU *cpu,
> >> +                                           sPAPRMachineState *spapr,
> >> +                                           target_ulong opcode,
> >> +                                           target_ulong *args)
> >> +{
> >> +    sPAPRXive *xive = spapr->xive;
> >> +    target_ulong flags = args[0];
> >> +    target_ulong target = args[1];
> >> +    target_ulong priority = args[2];
> >> +    XiveEND end;
> >> +    uint8_t end_blk;
> >> +    uint32_t end_idx;
> >> +
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    if (flags & ~SPAPR_XIVE_END_DEBUG) {
> >> +        return H_PARAMETER;
> >> +    }
> >> +
> >> +    /*
> >> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> >> +     * This is not needed when running the emulation under QEMU
> >> +     */
> >> +
> >> +    if (!spapr_xive_priority_is_valid(priority)) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
> >> +                      priority);
> >> +        return H_P3;
> >> +    }
> >> +
> >> +    /* Validate that "target" is part of the list of threads allocated
> >> +     * to the partition. For that, find the END corresponding to the
> >> +     * target.
> >> +     */
> >> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    if (xive_router_get_end(XIVE_ROUTER(xive), end_blk, end_idx, &end)) {
> >> +        return H_HARDWARE;
> > 
> > Again, assert() seems appropriate here.
> 
> ok
> 
> > 
> >> +    }
> >> +
> >> +    args[0] = 0;
> >> +    if (end.w0 & END_W0_UCOND_NOTIFY) {
> >> +        args[0] |= SPAPR_XIVE_END_ALWAYS_NOTIFY;
> >> +    }
> >> +
> >> +    if (end.w0 & END_W0_ENQUEUE) {
> >> +        args[1] =
> >> +            (((uint64_t)(end.w2 & 0x0fffffff)) << 32) | end.w3;
> >> +        args[2] = GETFIELD(END_W0_QSIZE, end.w0) + 12;
> >> +    } else {
> >> +        args[1] = 0;
> >> +        args[2] = 0;
> >> +    }
> >> +
> >> +    /* TODO: do we need any locking on the END ? */
> >> +    if (flags & SPAPR_XIVE_END_DEBUG) {
> >> +        /* Load the event queue generation number into the return flags */
> >> +        args[0] |= (uint64_t)GETFIELD(END_W1_GENERATION, end.w1) << 62;
> >> +
> >> +        /* Load R7 with the event queue offset counter */
> >> +        args[3] = GETFIELD(END_W1_PAGE_OFF, end.w1);
> >> +    } else {
> >> +        args[3] = 0;
> >> +    }
> >> +
> >> +    return H_SUCCESS;
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_SET_OS_REPORTING_LINE hcall() is used to set the
> >> + * reporting cache line pair for the calling thread.  The reporting
> >> + * cache lines will contain the OS interrupt context when the OS
> >> + * issues a CI store byte to @TIMA+0xC10 to acknowledge the OS
> >> + * interrupt. The reporting cache lines can be reset by inputting -1
> >> + * in "reportingLine".  Issuing the CI store byte without reporting
> >> + * cache lines registered will result in the data not being accessible
> >> + * to the OS.
> >> + *
> >> + * Parameters:
> >> + * Input:
> >> + * - "flags"
> >> + *      Bits 0-63: Reserved
> >> + * - "reportingLine": The logical real address of the reporting cache
> >> + *    line pair
> >> + *
> >> + * Output:
> >> + * - None
> >> + */
> >> +static target_ulong h_int_set_os_reporting_line(PowerPCCPU *cpu,
> >> +                                                sPAPRMachineState *spapr,
> >> +                                                target_ulong opcode,
> >> +                                                target_ulong *args)
> >> +{
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    /*
> >> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> >> +     * This is not needed when running the emulation under QEMU
> >> +     */
> >> +
> >> +    /* TODO: H_INT_SET_OS_REPORTING_LINE */
> >> +    return H_FUNCTION;
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_GET_OS_REPORTING_LINE hcall() is used to get the logical
> >> + * real address of the reporting cache line pair set for the input
> >> + * "target".  If no reporting cache line pair has been set, -1 is
> >> + * returned.
> >> + *
> >> + * Parameters:
> >> + * Input:
> >> + * - "flags"
> >> + *      Bits 0-63: Reserved
> >> + * - "target" is per "ibm,ppc-interrupt-server#s" or
> >> + *       "ibm,ppc-interrupt-gserver#s"
> >> + * - "reportingLine": The logical real address of the reporting cache
> >> + *   line pair
> >> + *
> >> + * Output:
> >> + * - R4: The logical real address of the reporting line if set, else -1
> >> + */
> >> +static target_ulong h_int_get_os_reporting_line(PowerPCCPU *cpu,
> >> +                                                sPAPRMachineState *spapr,
> >> +                                                target_ulong opcode,
> >> +                                                target_ulong *args)
> >> +{
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    /*
> >> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> >> +     * This is not needed when running the emulation under QEMU
> >> +     */
> >> +
> >> +    /* TODO: H_INT_GET_OS_REPORTING_LINE */
> >> +    return H_FUNCTION;
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_ESB hcall() is used to issue a load or store to the ESB
> >> + * page for the input "lisn".  This hcall is only supported for LISNs
> >> + * that have the ESB hcall flag set to 1 when returned from hcall()
> >> + * H_INT_GET_SOURCE_INFO.
> > 
> > Is there a reason for specifically restricting this to LISNs which
> > advertise it, rather than allowing it for anything? 
> 
> It's in the specs but I did not implement the check. So H_INT_ESB can be 
> used today by the OS for any interrupt number. Same under KVM.
> 
> But I should say so somewhere.
> 
> > Obviously using
> > the direct MMIOs will generally be a faster option when possible, but
> > I could see occasions where it might be simpler for the guest to
> > always use H_INT_ESB (e.g. for micro-guests like kvm-unit-tests).
> 
> can not you use direct load and stores in these guests ? I haven't 
> looked at how they are implemented.

It's not that you can't, but that might involve setting up mappings
and so forth which could be more trouble than using an hcall.  At the
very least they'll also need H_INT_ESB support for the irqs that
require it, so allowing it for everything avoids one code variant.

> >> + * Parameters:
> >> + * Input:
> >> + * - "flags"
> >> + *      Bits 0-62: Reserved
> >> + *      bit 63: Store: Store=1, store operation, else load operation
> >> + * - "lisn" is per "interrupts", "interrupt-map", or
> >> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
> >> + *      ibm,query-interrupt-source-number RTAS call, or as
> >> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
> >> + * - "esbOffset" is the offset into the ESB page for the load or store operation
> >> + * - "storeData" is the data to write for a store operation
> >> + *
> >> + * Output:
> >> + * - R4: R4: The value of the load if load operation, else -1
> >> + */
> >> +
> >> +#define SPAPR_XIVE_ESB_STORE PPC_BIT(63)
> >> +
> >> +static target_ulong h_int_esb(PowerPCCPU *cpu,
> >> +                              sPAPRMachineState *spapr,
> >> +                              target_ulong opcode,
> >> +                              target_ulong *args)
> >> +{
> >> +    sPAPRXive *xive = spapr->xive;
> >> +    XiveEAS eas;
> >> +    target_ulong flags  = args[0];
> >> +    target_ulong lisn   = args[1];
> >> +    target_ulong offset = args[2];
> >> +    target_ulong data   = args[3];
> >> +    hwaddr mmio_addr;
> >> +    XiveSource *xsrc = &xive->source;
> >> +
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    if (flags & ~SPAPR_XIVE_ESB_STORE) {
> >> +        return H_PARAMETER;
> >> +    }
> >> +
> >> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    if (!(eas.w & EAS_VALID)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    if (offset > (1ull << xsrc->esb_shift)) {
> >> +        return H_P3;
> >> +    }
> >> +
> >> +    mmio_addr = xive->vc_base + xive_source_esb_mgmt(xsrc, lisn) + offset;
> >> +
> >> +    if (dma_memory_rw(&address_space_memory, mmio_addr, &data, 8,
> >> +                      (flags & SPAPR_XIVE_ESB_STORE))) {
> >> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to access ESB @0x%"
> >> +                      HWADDR_PRIx "\n", mmio_addr);
> >> +        return H_HARDWARE;
> >> +    }
> >> +    args[0] = (flags & SPAPR_XIVE_ESB_STORE) ? -1 : data;
> >> +    return H_SUCCESS;
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_SYNC hcall() is used to issue hardware syncs that will
> >> + * ensure any in flight events for the input lisn are in the event
> >> + * queue.
> >> + *
> >> + * Parameters:
> >> + * Input:
> >> + * - "flags"
> >> + *      Bits 0-63: Reserved
> >> + * - "lisn" is per "interrupts", "interrupt-map", or
> >> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
> >> + *      ibm,query-interrupt-source-number RTAS call, or as
> >> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
> >> + *
> >> + * Output:
> >> + * - None
> >> + */
> >> +static target_ulong h_int_sync(PowerPCCPU *cpu,
> >> +                               sPAPRMachineState *spapr,
> >> +                               target_ulong opcode,
> >> +                               target_ulong *args)
> >> +{
> >> +    sPAPRXive *xive = spapr->xive;
> >> +    XiveEAS eas;
> >> +    target_ulong flags = args[0];
> >> +    target_ulong lisn = args[1];
> >> +
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    if (flags) {
> >> +        return H_PARAMETER;
> >> +    }
> >> +
> >> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    if (!(eas.w & EAS_VALID)) {
> >> +        return H_P2;
> >> +    }
> >> +
> >> +    /*
> >> +     * H_STATE should be returned if a H_INT_RESET is in progress.
> >> +     * This is not needed when running the emulation under QEMU
> >> +     */
> >> +
> >> +    /* This is not real hardware. Nothing to be done */
> > 
> > At least, not as long as all the XIVE operations are under the BQL.
> 
> yes.
> 
> > 
> >> +    return H_SUCCESS;
> >> +}
> >> +
> >> +/*
> >> + * The H_INT_RESET hcall() is used to reset all of the partition's
> >> + * interrupt exploitation structures to their initial state.  This
> >> + * means losing all previously set interrupt state set via
> >> + * H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
> >> + *
> >> + * Parameters:
> >> + * Input:
> >> + * - "flags"
> >> + *      Bits 0-63: Reserved
> >> + *
> >> + * Output:
> >> + * - None
> >> + */
> >> +static target_ulong h_int_reset(PowerPCCPU *cpu,
> >> +                                sPAPRMachineState *spapr,
> >> +                                target_ulong opcode,
> >> +                                target_ulong *args)
> >> +{
> >> +    sPAPRXive *xive = spapr->xive;
> >> +    target_ulong flags   = args[0];
> >> +
> >> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> >> +        return H_FUNCTION;
> >> +    }
> >> +
> >> +    if (flags) {
> >> +        return H_PARAMETER;
> >> +    }
> >> +
> >> +    device_reset(DEVICE(xive));
> >> +    return H_SUCCESS;
> >> +}
> >> +
> >> +void spapr_xive_hcall_init(sPAPRMachineState *spapr)
> >> +{
> >> +    spapr_register_hypercall(H_INT_GET_SOURCE_INFO, h_int_get_source_info);
> >> +    spapr_register_hypercall(H_INT_SET_SOURCE_CONFIG, h_int_set_source_config);
> >> +    spapr_register_hypercall(H_INT_GET_SOURCE_CONFIG, h_int_get_source_config);
> >> +    spapr_register_hypercall(H_INT_GET_QUEUE_INFO, h_int_get_queue_info);
> >> +    spapr_register_hypercall(H_INT_SET_QUEUE_CONFIG, h_int_set_queue_config);
> >> +    spapr_register_hypercall(H_INT_GET_QUEUE_CONFIG, h_int_get_queue_config);
> >> +    spapr_register_hypercall(H_INT_SET_OS_REPORTING_LINE,
> >> +                             h_int_set_os_reporting_line);
> >> +    spapr_register_hypercall(H_INT_GET_OS_REPORTING_LINE,
> >> +                             h_int_get_os_reporting_line);
> >> +    spapr_register_hypercall(H_INT_ESB, h_int_esb);
> >> +    spapr_register_hypercall(H_INT_SYNC, h_int_sync);
> >> +    spapr_register_hypercall(H_INT_RESET, h_int_reset);
> >> +}
> >> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> >> index 2569ae1bc7f8..da6fcfaa3c52 100644
> >> --- a/hw/ppc/spapr_irq.c
> >> +++ b/hw/ppc/spapr_irq.c
> >> @@ -258,6 +258,8 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
> >>          error_propagate(errp, local_err);
> >>          return;
> >>      }
> >> +
> >> +    spapr_xive_hcall_init(spapr);
> >>  }
> >>  
> >>  static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
> >> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> >> index 301a8e972d91..eacd26836ebf 100644
> >> --- a/hw/intc/Makefile.objs
> >> +++ b/hw/intc/Makefile.objs
> >> @@ -38,7 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
> >>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
> >>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
> >>  obj-$(CONFIG_XIVE) += xive.o
> >> -obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
> >> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
> >>  obj-$(CONFIG_POWERNV) += xics_pnv.o
> >>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
> >>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/36] spapr: add classes for the XIVE models
  2018-11-28 22:38     ` Cédric Le Goater
@ 2018-11-29  2:59       ` David Gibson
  2018-11-29 16:06         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  2:59 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 23874 bytes --]

On Wed, Nov 28, 2018 at 11:38:50PM +0100, Cédric Le Goater wrote:
> On 11/28/18 6:13 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:13AM +0100, Cédric Le Goater wrote:
> >> The XIVE models for the QEMU and KVM accelerators will have a lot in
> >> common. Introduce an abstract class for the source, the thread context
> >> and the interrupt controller object to handle the differences in the
> >> object initialization. These classes will also be used to define state
> >> synchronization handlers for the monitor and migration usage.
> >>
> >> This is very much like the XICS models.
> > 
> > Yeah.. so I know it's my code, but in hindsight I think making
> > separate subclasses for TCG and KVM was a mistake.  The distinction
> > between emulated and KVM version is supposed to be invisible to both
> > guest and (almost) to user, whereas a subclass usually indicates a
> > visibly different device.
> 
> so how do you want to model the KVM part ? with a single object and
> kvm_enabled() sections ?

Basically, yes.  In practice I think you probably want a helper called
xive_is_kvm() or something, which would evaluate to (kvm_enabled() &&
kvm_irqchip_in_kernel()).

> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  include/hw/ppc/spapr_xive.h |  15 +++++
> >>  include/hw/ppc/xive.h       |  30 ++++++++++
> >>  hw/intc/spapr_xive.c        |  86 +++++++++++++++++++---------
> >>  hw/intc/xive.c              | 109 +++++++++++++++++++++++++-----------
> >>  hw/ppc/spapr_irq.c          |   4 +-
> >>  5 files changed, 182 insertions(+), 62 deletions(-)
> >>
> >> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >> index 5b3fab192d41..aca2969a09ab 100644
> >> --- a/include/hw/ppc/spapr_xive.h
> >> +++ b/include/hw/ppc/spapr_xive.h
> >> @@ -13,6 +13,10 @@
> >>  #include "hw/sysbus.h"
> >>  #include "hw/ppc/xive.h"
> >>  
> >> +#define TYPE_SPAPR_XIVE_BASE "spapr-xive-base"
> >> +#define SPAPR_XIVE_BASE(obj) \
> >> +    OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE_BASE)
> >> +
> >>  #define TYPE_SPAPR_XIVE "spapr-xive"
> >>  #define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
> >>  
> >> @@ -38,6 +42,17 @@ typedef struct sPAPRXive {
> >>      MemoryRegion  tm_mmio;
> >>  } sPAPRXive;
> >>  
> >> +#define SPAPR_XIVE_BASE_CLASS(klass) \
> >> +     OBJECT_CLASS_CHECK(sPAPRXiveClass, (klass), TYPE_SPAPR_XIVE_BASE)
> >> +#define SPAPR_XIVE_BASE_GET_CLASS(obj) \
> >> +     OBJECT_GET_CLASS(sPAPRXiveClass, (obj), TYPE_SPAPR_XIVE_BASE)
> >> +
> >> +typedef struct sPAPRXiveClass {
> >> +    XiveRouterClass parent_class;
> >> +
> >> +    DeviceRealize   parent_realize;
> >> +} sPAPRXiveClass;
> >> +
> >>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
> >>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> >>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index b74eb326dcd1..281ed370121c 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -38,6 +38,10 @@ typedef struct XiveFabricClass {
> >>   * XIVE Interrupt Source
> >>   */
> >>  
> >> +#define TYPE_XIVE_SOURCE_BASE "xive-source-base"
> >> +#define XIVE_SOURCE_BASE(obj) \
> >> +    OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE_BASE)
> >> +
> >>  #define TYPE_XIVE_SOURCE "xive-source"
> >>  #define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
> >>  
> >> @@ -68,6 +72,18 @@ typedef struct XiveSource {
> >>      XiveFabric      *xive;
> >>  } XiveSource;
> >>  
> >> +#define XIVE_SOURCE_BASE_CLASS(klass) \
> >> +     OBJECT_CLASS_CHECK(XiveSourceClass, (klass), TYPE_XIVE_SOURCE_BASE)
> >> +#define XIVE_SOURCE_BASE_GET_CLASS(obj) \
> >> +     OBJECT_GET_CLASS(XiveSourceClass, (obj), TYPE_XIVE_SOURCE_BASE)
> >> +
> >> +typedef struct XiveSourceClass {
> >> +    SysBusDeviceClass parent_class;
> >> +
> >> +    DeviceRealize     parent_realize;
> >> +    DeviceReset       parent_reset;
> >> +} XiveSourceClass;
> >> +
> >>  /*
> >>   * ESB MMIO setting. Can be one page, for both source triggering and
> >>   * source management, or two different pages. See below for magic
> >> @@ -253,6 +269,9 @@ void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
> >>   * XIVE Thread interrupt Management (TM) context
> >>   */
> >>  
> >> +#define TYPE_XIVE_TCTX_BASE "xive-tctx-base"
> >> +#define XIVE_TCTX_BASE(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_BASE)
> >> +
> >>  #define TYPE_XIVE_TCTX "xive-tctx"
> >>  #define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
> >>  
> >> @@ -278,6 +297,17 @@ typedef struct XiveTCTX {
> >>      XiveRouter  *xrtr;
> >>  } XiveTCTX;
> >>  
> >> +#define XIVE_TCTX_BASE_CLASS(klass) \
> >> +     OBJECT_CLASS_CHECK(XiveTCTXClass, (klass), TYPE_XIVE_TCTX_BASE)
> >> +#define XIVE_TCTX_BASE_GET_CLASS(obj) \
> >> +     OBJECT_GET_CLASS(XiveTCTXClass, (obj), TYPE_XIVE_TCTX_BASE)
> >> +
> >> +typedef struct XiveTCTXClass {
> >> +    DeviceClass       parent_class;
> >> +
> >> +    DeviceRealize     parent_realize;
> >> +} XiveTCTXClass;
> >> +
> >>  /*
> >>   * XIVE Thread Interrupt Management Aera (TIMA)
> >>   */
> >> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >> index 3bf77ace11a2..ec85f7e4f88d 100644
> >> --- a/hw/intc/spapr_xive.c
> >> +++ b/hw/intc/spapr_xive.c
> >> @@ -53,9 +53,9 @@ static void spapr_xive_mmio_map(sPAPRXive *xive)
> >>      sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->tm_base);
> >>  }
> >>  
> >> -static void spapr_xive_reset(DeviceState *dev)
> >> +static void spapr_xive_base_reset(DeviceState *dev)
> >>  {
> >> -    sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +    sPAPRXive *xive = SPAPR_XIVE_BASE(dev);
> >>      int i;
> >>  
> >>      /* Xive Source reset is done through SysBus, it should put all
> >> @@ -76,9 +76,9 @@ static void spapr_xive_reset(DeviceState *dev)
> >>      spapr_xive_mmio_map(xive);
> >>  }
> >>  
> >> -static void spapr_xive_instance_init(Object *obj)
> >> +static void spapr_xive_base_instance_init(Object *obj)
> >>  {
> >> -    sPAPRXive *xive = SPAPR_XIVE(obj);
> >> +    sPAPRXive *xive = SPAPR_XIVE_BASE(obj);
> >>  
> >>      object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
> >>      object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
> >> @@ -89,9 +89,9 @@ static void spapr_xive_instance_init(Object *obj)
> >>                                NULL);
> >>  }
> >>  
> >> -static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >> +static void spapr_xive_base_realize(DeviceState *dev, Error **errp)
> >>  {
> >> -    sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +    sPAPRXive *xive = SPAPR_XIVE_BASE(dev);
> >>      XiveSource *xsrc = &xive->source;
> >>      XiveENDSource *end_xsrc = &xive->end_source;
> >>      Error *local_err = NULL;
> >> @@ -142,16 +142,11 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >>       */
> >>      xive->eat = g_new0(XiveEAS, xive->nr_irqs);
> >>      xive->endt = g_new0(XiveEND, xive->nr_ends);
> >> -
> >> -    /* TIMA initialization */
> >> -    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
> >> -                          "xive.tima", 4ull << TM_SHIFT);
> >> -    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
> >>  }
> >>  
> >>  static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> >>  {
> >> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> >> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
> >>  
> >>      if (lisn >= xive->nr_irqs) {
> >>          return -1;
> >> @@ -163,7 +158,7 @@ static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> >>  
> >>  static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> >>  {
> >> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> >> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
> >>  
> >>      if (lisn >= xive->nr_irqs) {
> >>          return -1;
> >> @@ -176,7 +171,7 @@ static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
> >>  static int spapr_xive_get_end(XiveRouter *xrtr,
> >>                                uint8_t end_blk, uint32_t end_idx, XiveEND *end)
> >>  {
> >> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> >> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
> >>  
> >>      if (end_idx >= xive->nr_ends) {
> >>          return -1;
> >> @@ -189,7 +184,7 @@ static int spapr_xive_get_end(XiveRouter *xrtr,
> >>  static int spapr_xive_set_end(XiveRouter *xrtr,
> >>                                uint8_t end_blk, uint32_t end_idx, XiveEND *end)
> >>  {
> >> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> >> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
> >>  
> >>      if (end_idx >= xive->nr_ends) {
> >>          return -1;
> >> @@ -202,7 +197,7 @@ static int spapr_xive_set_end(XiveRouter *xrtr,
> >>  static int spapr_xive_get_nvt(XiveRouter *xrtr,
> >>                                uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
> >>  {
> >> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
> >> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
> >>      uint32_t vcpu_id = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
> >>      PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
> >>  
> >> @@ -236,7 +231,7 @@ static void spapr_xive_reset_tctx(XiveRouter *xrtr, XiveTCTX *tctx)
> >>      uint32_t nvt_idx;
> >>      uint32_t nvt_cam;
> >>  
> >> -    spapr_xive_cpu_to_nvt(SPAPR_XIVE(xrtr), POWERPC_CPU(tctx->cs),
> >> +    spapr_xive_cpu_to_nvt(SPAPR_XIVE_BASE(xrtr), POWERPC_CPU(tctx->cs),
> >>                            &nvt_blk, &nvt_idx);
> >>  
> >>      nvt_cam = cpu_to_be32(TM_QW1W2_VO | xive_tctx_cam_line(nvt_blk, nvt_idx));
> >> @@ -359,7 +354,7 @@ static const VMStateDescription vmstate_spapr_xive_eas = {
> >>      },
> >>  };
> >>  
> >> -static const VMStateDescription vmstate_spapr_xive = {
> >> +static const VMStateDescription vmstate_spapr_xive_base = {
> >>      .name = TYPE_SPAPR_XIVE,
> >>      .version_id = 1,
> >>      .minimum_version_id = 1,
> >> @@ -373,7 +368,7 @@ static const VMStateDescription vmstate_spapr_xive = {
> >>      },
> >>  };
> >>  
> >> -static Property spapr_xive_properties[] = {
> >> +static Property spapr_xive_base_properties[] = {
> >>      DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
> >>      DEFINE_PROP_UINT32("nr-ends", sPAPRXive, nr_ends, 0),
> >>      DEFINE_PROP_UINT64("vc-base", sPAPRXive, vc_base, SPAPR_XIVE_VC_BASE),
> >> @@ -381,16 +376,16 @@ static Property spapr_xive_properties[] = {
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>  
> >> -static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >> +static void spapr_xive_base_class_init(ObjectClass *klass, void *data)
> >>  {
> >>      DeviceClass *dc = DEVICE_CLASS(klass);
> >>      XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
> >>  
> >>      dc->desc    = "sPAPR XIVE Interrupt Controller";
> >> -    dc->props   = spapr_xive_properties;
> >> -    dc->realize = spapr_xive_realize;
> >> -    dc->reset   = spapr_xive_reset;
> >> -    dc->vmsd    = &vmstate_spapr_xive;
> >> +    dc->props   = spapr_xive_base_properties;
> >> +    dc->realize = spapr_xive_base_realize;
> >> +    dc->reset   = spapr_xive_base_reset;
> >> +    dc->vmsd    = &vmstate_spapr_xive_base;
> >>  
> >>      xrc->get_eas = spapr_xive_get_eas;
> >>      xrc->set_eas = spapr_xive_set_eas;
> >> @@ -401,16 +396,55 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >>      xrc->reset_tctx = spapr_xive_reset_tctx;
> >>  }
> >>  
> >> +static const TypeInfo spapr_xive_base_info = {
> >> +    .name = TYPE_SPAPR_XIVE_BASE,
> >> +    .parent = TYPE_XIVE_ROUTER,
> >> +    .abstract = true,
> >> +    .instance_init = spapr_xive_base_instance_init,
> >> +    .instance_size = sizeof(sPAPRXive),
> >> +    .class_init = spapr_xive_base_class_init,
> >> +    .class_size = sizeof(sPAPRXiveClass),
> >> +};
> >> +
> >> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE(dev);
> >> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
> >> +    Error *local_err = NULL;
> >> +
> >> +    sxc->parent_realize(dev, &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return;
> >> +    }
> >> +
> >> +    /* TIMA */
> >> +    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
> >> +                          "xive.tima", 4ull << TM_SHIFT);
> >> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
> >> +}
> >> +
> >> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_CLASS(klass);
> >> +
> >> +    device_class_set_parent_realize(dc, spapr_xive_realize,
> >> +                                    &sxc->parent_realize);
> >> +}
> >> +
> >>  static const TypeInfo spapr_xive_info = {
> >>      .name = TYPE_SPAPR_XIVE,
> >> -    .parent = TYPE_XIVE_ROUTER,
> >> -    .instance_init = spapr_xive_instance_init,
> >> +    .parent = TYPE_SPAPR_XIVE_BASE,
> >> +    .instance_init = spapr_xive_base_instance_init,
> >>      .instance_size = sizeof(sPAPRXive),
> >>      .class_init = spapr_xive_class_init,
> >> +    .class_size = sizeof(sPAPRXiveClass),
> >>  };
> >>  
> >>  static void spapr_xive_register_types(void)
> >>  {
> >> +    type_register_static(&spapr_xive_base_info);
> >>      type_register_static(&spapr_xive_info);
> >>  }
> >>  
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index 7d921023e2ee..9bb37553c9ec 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -478,9 +478,9 @@ static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
> >>      return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
> >>  }
> >>  
> >> -static void xive_tctx_reset(void *dev)
> >> +static void xive_tctx_base_reset(void *dev)
> >>  {
> >> -    XiveTCTX *tctx = XIVE_TCTX(dev);
> >> +    XiveTCTX *tctx = XIVE_TCTX_BASE(dev);
> >>      XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
> >>  
> >>      memset(tctx->regs, 0, sizeof(tctx->regs));
> >> @@ -506,9 +506,9 @@ static void xive_tctx_reset(void *dev)
> >>      }
> >>  }
> >>  
> >> -static void xive_tctx_realize(DeviceState *dev, Error **errp)
> >> +static void xive_tctx_base_realize(DeviceState *dev, Error **errp)
> >>  {
> >> -    XiveTCTX *tctx = XIVE_TCTX(dev);
> >> +    XiveTCTX *tctx = XIVE_TCTX_BASE(dev);
> >>      PowerPCCPU *cpu;
> >>      CPUPPCState *env;
> >>      Object *obj;
> >> @@ -544,15 +544,15 @@ static void xive_tctx_realize(DeviceState *dev, Error **errp)
> >>          return;
> >>      }
> >>  
> >> -    qemu_register_reset(xive_tctx_reset, dev);
> >> +    qemu_register_reset(xive_tctx_base_reset, dev);
> >>  }
> >>  
> >> -static void xive_tctx_unrealize(DeviceState *dev, Error **errp)
> >> +static void xive_tctx_base_unrealize(DeviceState *dev, Error **errp)
> >>  {
> >> -    qemu_unregister_reset(xive_tctx_reset, dev);
> >> +    qemu_unregister_reset(xive_tctx_base_reset, dev);
> >>  }
> >>  
> >> -static const VMStateDescription vmstate_xive_tctx = {
> >> +static const VMStateDescription vmstate_xive_tctx_base = {
> >>      .name = TYPE_XIVE_TCTX,
> >>      .version_id = 1,
> >>      .minimum_version_id = 1,
> >> @@ -562,21 +562,28 @@ static const VMStateDescription vmstate_xive_tctx = {
> >>      },
> >>  };
> >>  
> >> -static void xive_tctx_class_init(ObjectClass *klass, void *data)
> >> +static void xive_tctx_base_class_init(ObjectClass *klass, void *data)
> >>  {
> >>      DeviceClass *dc = DEVICE_CLASS(klass);
> >>  
> >> -    dc->realize = xive_tctx_realize;
> >> -    dc->unrealize = xive_tctx_unrealize;
> >> +    dc->realize = xive_tctx_base_realize;
> >> +    dc->unrealize = xive_tctx_base_unrealize;
> >>      dc->desc = "XIVE Interrupt Thread Context";
> >> -    dc->vmsd = &vmstate_xive_tctx;
> >> +    dc->vmsd = &vmstate_xive_tctx_base;
> >>  }
> >>  
> >> -static const TypeInfo xive_tctx_info = {
> >> -    .name          = TYPE_XIVE_TCTX,
> >> +static const TypeInfo xive_tctx_base_info = {
> >> +    .name          = TYPE_XIVE_TCTX_BASE,
> >>      .parent        = TYPE_DEVICE,
> >> +    .abstract      = true,
> >>      .instance_size = sizeof(XiveTCTX),
> >> -    .class_init    = xive_tctx_class_init,
> >> +    .class_init    = xive_tctx_base_class_init,
> >> +    .class_size    = sizeof(XiveTCTXClass),
> >> +};
> >> +
> >> +static const TypeInfo xive_tctx_info = {
> >> +    .name          = TYPE_XIVE_TCTX,
> >> +    .parent        = TYPE_XIVE_TCTX_BASE,
> >>  };
> >>  
> >>  Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
> >> @@ -933,9 +940,9 @@ void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
> >>      }
> >>  }
> >>  
> >> -static void xive_source_reset(DeviceState *dev)
> >> +static void xive_source_base_reset(DeviceState *dev)
> >>  {
> >> -    XiveSource *xsrc = XIVE_SOURCE(dev);
> >> +    XiveSource *xsrc = XIVE_SOURCE_BASE(dev);
> >>  
> >>      /* Do not clear the LSI bitmap */
> >>  
> >> @@ -943,9 +950,9 @@ static void xive_source_reset(DeviceState *dev)
> >>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
> >>  }
> >>  
> >> -static void xive_source_realize(DeviceState *dev, Error **errp)
> >> +static void xive_source_base_realize(DeviceState *dev,  Error **errp)
> >>  {
> >> -    XiveSource *xsrc = XIVE_SOURCE(dev);
> >> +    XiveSource *xsrc = XIVE_SOURCE_BASE(dev);
> >>      Object *obj;
> >>      Error *local_err = NULL;
> >>  
> >> @@ -971,21 +978,14 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
> >>          return;
> >>      }
> >>  
> >> -    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
> >> -                                     xsrc->nr_irqs);
> >> -
> >>      xsrc->status = g_malloc0(xsrc->nr_irqs);
> >>  
> >>      xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
> >>      xsrc->lsi_map_size = xsrc->nr_irqs;
> >>  
> >> -    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> >> -                          &xive_source_esb_ops, xsrc, "xive.esb",
> >> -                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
> >> -    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
> >>  }
> >>  
> >> -static const VMStateDescription vmstate_xive_source = {
> >> +static const VMStateDescription vmstate_xive_source_base = {
> >>      .name = TYPE_XIVE_SOURCE,
> >>      .version_id = 1,
> >>      .minimum_version_id = 1,
> >> @@ -1001,29 +1001,68 @@ static const VMStateDescription vmstate_xive_source = {
> >>   * The default XIVE interrupt source setting for the ESB MMIOs is two
> >>   * 64k pages without Store EOI, to be in sync with KVM.
> >>   */
> >> -static Property xive_source_properties[] = {
> >> +static Property xive_source_base_properties[] = {
> >>      DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
> >>      DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
> >>      DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
> >>      DEFINE_PROP_END_OF_LIST(),
> >>  };
> >>  
> >> -static void xive_source_class_init(ObjectClass *klass, void *data)
> >> +static void xive_source_base_class_init(ObjectClass *klass, void *data)
> >>  {
> >>      DeviceClass *dc = DEVICE_CLASS(klass);
> >>  
> >>      dc->desc    = "XIVE Interrupt Source";
> >> -    dc->props   = xive_source_properties;
> >> -    dc->realize = xive_source_realize;
> >> -    dc->reset   = xive_source_reset;
> >> -    dc->vmsd    = &vmstate_xive_source;
> >> +    dc->props   = xive_source_base_properties;
> >> +    dc->realize = xive_source_base_realize;
> >> +    dc->reset   = xive_source_base_reset;
> >> +    dc->vmsd    = &vmstate_xive_source_base;
> >> +}
> >> +
> >> +static const TypeInfo xive_source_base_info = {
> >> +    .name          = TYPE_XIVE_SOURCE_BASE,
> >> +    .parent        = TYPE_SYS_BUS_DEVICE,
> >> +    .abstract      = true,
> >> +    .instance_size = sizeof(XiveSource),
> >> +    .class_init    = xive_source_base_class_init,
> >> +    .class_size    = sizeof(XiveSourceClass),
> >> +};
> >> +
> >> +static void xive_source_realize(DeviceState *dev, Error **errp)
> >> +{
> >> +    XiveSource *xsrc = XIVE_SOURCE(dev);
> >> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
> >> +    Error *local_err = NULL;
> >> +
> >> +    xsc->parent_realize(dev, &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return;
> >> +    }
> >> +
> >> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc, xsrc->nr_irqs);
> >> +
> >> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
> >> +                          &xive_source_esb_ops, xsrc, "xive.esb",
> >> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
> >> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
> >> +}
> >> +
> >> +static void xive_source_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    DeviceClass *dc = DEVICE_CLASS(klass);
> >> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_CLASS(klass);
> >> +
> >> +    device_class_set_parent_realize(dc, xive_source_realize,
> >> +                                    &xsc->parent_realize);
> >>  }
> >>  
> >>  static const TypeInfo xive_source_info = {
> >>      .name          = TYPE_XIVE_SOURCE,
> >> -    .parent        = TYPE_SYS_BUS_DEVICE,
> >> +    .parent        = TYPE_XIVE_SOURCE_BASE,
> >>      .instance_size = sizeof(XiveSource),
> >>      .class_init    = xive_source_class_init,
> >> +    .class_size    = sizeof(XiveSourceClass),
> >>  };
> >>  
> >>  /*
> >> @@ -1659,10 +1698,12 @@ static const TypeInfo xive_fabric_info = {
> >>  
> >>  static void xive_register_types(void)
> >>  {
> >> +    type_register_static(&xive_source_base_info);
> >>      type_register_static(&xive_source_info);
> >>      type_register_static(&xive_fabric_info);
> >>      type_register_static(&xive_router_info);
> >>      type_register_static(&xive_end_source_info);
> >> +    type_register_static(&xive_tctx_base_info);
> >>      type_register_static(&xive_tctx_info);
> >>  }
> >>  
> >> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> >> index 42e73851b174..f6e9e44d4cf9 100644
> >> --- a/hw/ppc/spapr_irq.c
> >> +++ b/hw/ppc/spapr_irq.c
> >> @@ -243,7 +243,7 @@ static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
> >>          return NULL;
> >>      }
> >>      qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
> >> -    xive = SPAPR_XIVE(obj);
> >> +    xive = SPAPR_XIVE_BASE(obj);
> >>  
> >>      /* Enable the CPU IPIs */
> >>      for (i = 0; i < nr_servers; ++i) {
> >> @@ -311,7 +311,7 @@ static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
> >>      CPU_FOREACH(cs) {
> >>          PowerPCCPU *cpu = POWERPC_CPU(cs);
> >>  
> >> -        xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
> >> +        xive_tctx_pic_print_info(XIVE_TCTX_BASE(cpu->intc), mon);
> >>      }
> >>  
> >>      spapr_xive_pic_print_info(spapr->xive, mon);
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 22/36] spapr/xive: add models for KVM support
  2018-11-28 22:45     ` Cédric Le Goater
@ 2018-11-29  3:33       ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-29  3:33 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 2583 bytes --]

On Wed, Nov 28, 2018 at 11:45:46PM +0100, Cédric Le Goater wrote:
> On 11/28/18 6:52 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:15AM +0100, Cédric Le Goater wrote:
> >> This introduces a set of XIVE models specific to KVM which derive from
> >> the XIVE base models. The interfaces with KVM are a new capability and
> >> a new KVM device for the XIVE native exploitation interrupt mode.
> >>
> >> They handle the initialization of the TIMA and the source ESB memory
> >> regions which have a different type under KVM. These are 'ram device'
> >> memory mappings, similarly to VFIO, exposed to the guest and the
> >> associated VMAs on the host are populated dynamically with the
> >> appropriate pages using a fault handler.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > 
> > The logic here looks fine, but I think it would be better to activate
> > it with explicit if (kvm) type logic rather than using a subclass.
> 
> ok. ARM has taken a different path, the one proposed below, but it should 
> be possible to use a "if (kvm)" type logic. There should be less noise 
> in the object design.

Yeah, it seemed like a good path when I wrote the XICS code, but
experience with that has led me to decide it wasn't a good idea.  I'm
not sure if the ARM people copied that or came up with it on their
own.

[snip]
> >> +/*
> >> + * XIVE Thread Interrupt Management context (KVM)
> >> + */
> >> +
> >> +static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
> >> +{
> >> +    sPAPRXive *xive;
> >> +    unsigned long vcpu_id;
> >> +    int ret;
> >> +
> >> +    /* Check if CPU was hot unplugged and replugged. */
> >> +    if (kvm_cpu_is_enabled(tctx->cs)) {
> >> +        return;
> >> +    }
> >> +
> >> +    vcpu_id = kvm_arch_vcpu_id(tctx->cs);
> >> +    xive = SPAPR_XIVE_KVM(tctx->xrtr);
> > 
> > Is this the first use of tctx->xrtr?
> 
> No, the second. the first is the reset_tctx() ops doing the CAM reset.
> But we said that we could remove it.

And I think we can remove it here too.  We know this is PAPR specific
so we can go qdev_get_machine() -> PAPR xive object.

I normally don't like using qdev_get_machine(), but I think it's a
little less ugly than including this backlink (and it is sometimes
necessary to deal with the different abstraction boundaries between
qemu and KVM).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-11-29  0:47       ` David Gibson
@ 2018-11-29  3:39         ` Benjamin Herrenschmidt
  2018-11-29 17:51           ` Cédric Le Goater
  2018-12-03 17:05         ` Cédric Le Goater
  1 sibling, 1 reply; 184+ messages in thread
From: Benjamin Herrenschmidt @ 2018-11-29  3:39 UTC (permalink / raw)
  To: David Gibson, Cédric Le Goater; +Cc: qemu-ppc, qemu-devel

On Thu, 2018-11-29 at 11:47 +1100, David Gibson wrote:
> 
> 1) read/write accessors which take a word number
> 
> 2) A "get" accessor which copies the whole structure, but "write"
> accessor which takes a word number.  The asymmetry is a bit ugly, but
> it's the non-atomic writeback of the whole structure which I'm most
> uncomfortable with.

It shouldn't be a big deal though, there are HW facilities to access
the structures "atomically" anyway, due to the caching done by the
XIVE.

> 3) A map/unmap interface which gives you / releases a pointer to the
> "live" structure.  For powernv that would become
> address_space_map()/unmap().  For PAPR it would just be reutn pointer
> / no-op.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/36] spapr/xive: add migration support for KVM
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 23/36] spapr/xive: add migration support for KVM Cédric Le Goater
@ 2018-11-29  3:43   ` David Gibson
  2018-11-29 16:19     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  3:43 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 30138 bytes --]

On Fri, Nov 16, 2018 at 11:57:16AM +0100, Cédric Le Goater wrote:
> This extends the KVM XIVE models to handle the state synchronization
> with KVM, for the monitor usage and for the migration.
> 
> The migration priority of the XIVE interrupt controller sPAPRXive is
> raised for KVM. It operates first and orchestrates the capture
> sequence of the states of all the XIVE models. The XIVE sources are
> masked to quiesce the interrupt flow and a XIVE xync is performed to
> stabilize the OS Event Queues. The state of the ENDs are then captured
> by the XIVE interrupt controller model, sPAPRXive, and the state of
> the thread contexts by the thread interrupt presenter model,
> XiveTCTX. When done, a rollback is performed to restore the sources to
> their initial state.
> 
> The sPAPRXive 'post_load' method is called from the sPAPR machine,
> after all XIVE device states have been transfered and loaded. First,
> sPAPRXive restores the XIVE routing tables: ENDT and EAT. Next, are
> restored the thread interrupt context registers and the source PQ
> bits.
> 
> The get/set operations rely on their KVM counterpart in the host
> kernel which acts as a proxy for OPAL, the host firmware.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
> 
>  WIP:
>  
>     If migration occurs when a VCPU is 'ceded', some the OS event
>     notification queues are mapped to the ZERO_PAGE on the receiving
>     side. As if the HW had triggered a page fault before the dirty
>     page was transferred from the source or as if we were not using
>     the correct page table.
> 
>  include/hw/ppc/spapr_xive.h     |   5 +
>  include/hw/ppc/xive.h           |   3 +
>  include/migration/vmstate.h     |   1 +
>  linux-headers/asm-powerpc/kvm.h |  33 +++
>  hw/intc/spapr_xive.c            |  32 +++
>  hw/intc/spapr_xive_kvm.c        | 494 ++++++++++++++++++++++++++++++++
>  hw/intc/xive.c                  |  46 +++
>  hw/ppc/spapr_irq.c              |   2 +-
>  8 files changed, 615 insertions(+), 1 deletion(-)
> 
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 9c817bb7ae74..d2517c040958 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -55,12 +55,17 @@ typedef struct sPAPRXiveClass {
>      XiveRouterClass parent_class;
>  
>      DeviceRealize   parent_realize;
> +
> +    void (*synchronize_state)(sPAPRXive *xive);
> +    int  (*pre_save)(sPAPRXive *xsrc);
> +    int  (*post_load)(sPAPRXive *xsrc, int version_id);

This should go away if the KVM and non-KVM versions are in the same
object.

>  } sPAPRXiveClass;
>  
>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>  qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
> +int spapr_xive_post_load(sPAPRXive *xive, int version_id);
>  
>  /*
>   * sPAPR NVT and END indexing helpers
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index 7aaf5a182cb3..c8201462d698 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -309,6 +309,9 @@ typedef struct XiveTCTXClass {
>      DeviceClass       parent_class;
>  
>      DeviceRealize     parent_realize;
> +
> +    void (*synchronize_state)(XiveTCTX *tctx);
> +    int  (*post_load)(XiveTCTX *tctx, int version_id);

.. and this too.

>  } XiveTCTXClass;
>  
>  /*
> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> index 2b501d04669a..ee2e836cc1c1 100644
> --- a/include/migration/vmstate.h
> +++ b/include/migration/vmstate.h
> @@ -154,6 +154,7 @@ typedef enum {
>      MIG_PRI_PCI_BUS,            /* Must happen before IOMMU */
>      MIG_PRI_GICV3_ITS,          /* Must happen before PCI devices */
>      MIG_PRI_GICV3,              /* Must happen before the ITS */
> +    MIG_PRI_XIVE_IC,            /* Must happen before all XIVE models */

Ugh.. explicit priority / order levels are a pretty bad code smell.
Usually migration ordering can be handled by getting the object
heirarchy right.  What exactly is the problem you're addessing with
this?


>      MIG_PRI_MAX,
>  } MigrationPriority;
>  
> diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
> index f34c971491dd..9d55ade23634 100644
> --- a/linux-headers/asm-powerpc/kvm.h
> +++ b/linux-headers/asm-powerpc/kvm.h

Again, linux-headers need to be split out.

> @@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
>  #define  KVM_REG_PPC_ICP_PPRI_SHIFT	16	/* pending irq priority */
>  #define  KVM_REG_PPC_ICP_PPRI_MASK	0xff
>  
> +#define KVM_REG_PPC_NVT_STATE	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d)
> +
>  /* Device control API: PPC-specific devices */
>  #define KVM_DEV_MPIC_GRP_MISC		1
>  #define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
> @@ -681,10 +683,41 @@ struct kvm_ppc_cpu_char {
>  #define   KVM_DEV_XIVE_GET_TIMA_FD	2
>  #define   KVM_DEV_XIVE_VC_BASE		3
>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
> +#define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
> +#define KVM_DEV_XIVE_GRP_EQ		5	/* 64-bit eq attributes */
>  
>  /* Layout of 64-bit XIVE source attribute values */
>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>  
> +/* Layout of 64-bit eas attribute values */
> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
> +#define KVM_XIVE_EAS_MASK_SHIFT		32
> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
> +#define KVM_XIVE_EAS_EISN_SHIFT		33
> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
> +
> +/* Layout of 64-bit eq attribute */
> +#define KVM_XIVE_EQ_PRIORITY_SHIFT	0
> +#define KVM_XIVE_EQ_PRIORITY_MASK	0x7
> +#define KVM_XIVE_EQ_SERVER_SHIFT	3
> +#define KVM_XIVE_EQ_SERVER_MASK		0xfffffff8ULL
> +
> +/* Layout of 64-bit eq attribute values */
> +struct kvm_ppc_xive_eq {
> +	__u32 flags;
> +	__u32 qsize;
> +	__u64 qpage;
> +	__u32 qtoggle;
> +	__u32 qindex;
> +};
> +
> +#define KVM_XIVE_EQ_FLAG_ENABLED	0x00000001
> +#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY	0x00000002
> +#define KVM_XIVE_EQ_FLAG_ESCALATE	0x00000004
>  
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> index ec85f7e4f88d..c5c0e063dc33 100644
> --- a/hw/intc/spapr_xive.c
> +++ b/hw/intc/spapr_xive.c
> @@ -27,9 +27,14 @@
>  
>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>  {
> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
>      int i;
>      uint32_t offset = 0;
>  
> +    if (sxc->synchronize_state) {
> +        sxc->synchronize_state(xive);
> +    }
> +
>      monitor_printf(mon, "XIVE Source %08x .. %08x\n", offset,
>                     offset + xive->source.nr_irqs - 1);
>      xive_source_pic_print_info(&xive->source, offset, mon);
> @@ -354,10 +359,37 @@ static const VMStateDescription vmstate_spapr_xive_eas = {
>      },
>  };
>  
> +static int vmstate_spapr_xive_pre_save(void *opaque)
> +{
> +    sPAPRXive *xive = SPAPR_XIVE_BASE(opaque);
> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
> +
> +    if (sxc->pre_save) {
> +        return sxc->pre_save(xive);
> +    }
> +
> +    return 0;
> +}
> +
> +/* handled at the machine level */
> +int spapr_xive_post_load(sPAPRXive *xive, int version_id)
> +{
> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
> +
> +    if (sxc->post_load) {
> +        return sxc->post_load(xive, version_id);
> +    }
> +
> +    return 0;
> +}
> +
>  static const VMStateDescription vmstate_spapr_xive_base = {
>      .name = TYPE_SPAPR_XIVE,
>      .version_id = 1,
>      .minimum_version_id = 1,
> +    .pre_save = vmstate_spapr_xive_pre_save,
> +    .post_load = NULL, /* handled at the machine level */
> +    .priority = MIG_PRI_XIVE_IC,
>      .fields = (VMStateField[]) {
>          VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>          VMSTATE_STRUCT_VARRAY_POINTER_UINT32(eat, sPAPRXive, nr_irqs,
> diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
> index 767f90826e43..176083c37d61 100644
> --- a/hw/intc/spapr_xive_kvm.c
> +++ b/hw/intc/spapr_xive_kvm.c
> @@ -58,6 +58,58 @@ static void kvm_cpu_enable(CPUState *cs)
>  /*
>   * XIVE Thread Interrupt Management context (KVM)
>   */
> +static void xive_tctx_kvm_set_state(XiveTCTX *tctx, Error **errp)
> +{
> +    uint64_t state[4];
> +    int ret;
> +
> +    /* word0 and word1 of the OS ring. */
> +    state[0] = *((uint64_t *) &tctx->regs[TM_QW1_OS]);
> +
> +    /* VP identifier. Only for KVM pr_debug() */
> +    state[1] = *((uint64_t *) &tctx->regs[TM_QW1_OS + TM_WORD2]);
> +
> +    ret = kvm_set_one_reg(tctx->cs, KVM_REG_PPC_NVT_STATE, state);
> +    if (ret != 0) {
> +        error_setg_errno(errp, errno, "Could restore KVM XIVE CPU %ld state",
> +                         kvm_arch_vcpu_id(tctx->cs));
> +    }
> +}
> +
> +static void xive_tctx_kvm_get_state(XiveTCTX *tctx, Error **errp)
> +{
> +    uint64_t state[4] = { 0 };
> +    int ret;
> +
> +    ret = kvm_get_one_reg(tctx->cs, KVM_REG_PPC_NVT_STATE, state);
> +    if (ret != 0) {
> +        error_setg_errno(errp, errno, "Could capture KVM XIVE CPU %ld state",
> +                         kvm_arch_vcpu_id(tctx->cs));
> +        return;
> +    }
> +
> +    /* word0 and word1 of the OS ring. */
> +    *((uint64_t *) &tctx->regs[TM_QW1_OS]) = state[0];
> +
> +    /*
> +     * KVM also returns word2 containing the VP CAM line value which
> +     * is interesting to print out the VP identifier in the QEMU
> +     * monitor. No need to restore it.
> +     */
> +    *((uint64_t *) &tctx->regs[TM_QW1_OS + TM_WORD2]) = state[1];
> +}
> +
> +static void xive_tctx_kvm_do_synchronize_state(CPUState *cpu,
> +                                              run_on_cpu_data arg)
> +{
> +    xive_tctx_kvm_get_state(arg.host_ptr, &error_fatal);
> +}
> +
> +static void xive_tctx_kvm_synchronize_state(XiveTCTX *tctx)
> +{
> +    run_on_cpu(tctx->cs, xive_tctx_kvm_do_synchronize_state,
> +               RUN_ON_CPU_HOST_PTR(tctx));
> +}
>  
>  static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
>  {
> @@ -112,6 +164,8 @@ static void xive_tctx_kvm_class_init(ObjectClass *klass, void *data)
>  
>      device_class_set_parent_realize(dc, xive_tctx_kvm_realize,
>                                      &xtc->parent_realize);
> +
> +    xtc->synchronize_state = xive_tctx_kvm_synchronize_state;
>  }
>  
>  static const TypeInfo xive_tctx_kvm_info = {
> @@ -166,6 +220,34 @@ static void xive_source_kvm_reset(DeviceState *dev)
>      xive_source_kvm_init(xsrc, &error_fatal);
>  }
>  
> +/*
> + * This is used to perform the magic loads on the ESB pages, described
> + * in xive.h.
> + */
> +static uint8_t xive_esb_read(XiveSource *xsrc, int srcno, uint32_t offset)
> +{
> +    unsigned long addr = (unsigned long) xsrc->esb_mmap +
> +        xive_source_esb_mgmt(xsrc, srcno) + offset;
> +
> +    /* Prevent the compiler from optimizing away the load */
> +    volatile uint64_t value = *((uint64_t *) addr);
> +
> +    return be64_to_cpu(value) & 0x3;
> +}
> +
> +static void xive_source_kvm_get_state(XiveSource *xsrc)
> +{
> +    int i;
> +
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        /* Perform a load without side effect to retrieve the PQ bits */
> +        uint8_t pq = xive_esb_read(xsrc, i, XIVE_ESB_GET);
> +
> +        /* and save PQ locally */
> +        xive_source_esb_set(xsrc, i, pq);
> +    }
> +}
> +
>  static void xive_source_kvm_set_irq(void *opaque, int srcno, int val)
>  {
>      XiveSource *xsrc = opaque;
> @@ -295,6 +377,414 @@ static const TypeInfo xive_source_kvm_info = {
>  /*
>   * sPAPR XIVE Router (KVM)
>   */
> +static int spapr_xive_kvm_set_eq_state(sPAPRXive *xive, CPUState *cs,
> +                                       Error **errp)
> +{
> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> +    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
> +    int ret;
> +    int i;
> +
> +    for (i = 0; i < XIVE_PRIORITY_MAX + 1; i++) {
> +        Error *local_err = NULL;
> +        XiveEND end;
> +        uint8_t end_blk;
> +        uint32_t end_idx;
> +        struct kvm_ppc_xive_eq kvm_eq = { 0 };
> +        uint64_t kvm_eq_idx;
> +
> +        if (!spapr_xive_priority_is_valid(i)) {
> +            continue;
> +        }
> +
> +        spapr_xive_cpu_to_end(xive, POWERPC_CPU(cs), i, &end_blk, &end_idx);
> +
> +        ret = xive_router_get_end(xrtr, end_blk, end_idx, &end);
> +        if (ret) {
> +            error_setg(errp, "XIVE: No END for CPU %ld priority %d",
> +                       vcpu_id, i);
> +            return ret;
> +        }
> +
> +        if (!(end.w0 & END_W0_VALID)) {
> +            continue;
> +        }
> +
> +        /* Build the KVM state from the local END structure */
> +        kvm_eq.flags   = KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY;
> +        kvm_eq.qsize   = GETFIELD(END_W0_QSIZE, end.w0) + 12;
> +        kvm_eq.qpage   = (((uint64_t)(end.w2 & 0x0fffffff)) << 32) | end.w3;
> +        kvm_eq.qtoggle = GETFIELD(END_W1_GENERATION, end.w1);
> +        kvm_eq.qindex  = GETFIELD(END_W1_PAGE_OFF, end.w1);
> +
> +        /* Encode the tuple (server, prio) as a KVM EQ index */
> +        kvm_eq_idx = i << KVM_XIVE_EQ_PRIORITY_SHIFT &
> +            KVM_XIVE_EQ_PRIORITY_MASK;
> +        kvm_eq_idx |= vcpu_id << KVM_XIVE_EQ_SERVER_SHIFT &
> +            KVM_XIVE_EQ_SERVER_MASK;
> +
> +        ret = kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EQ, kvm_eq_idx,
> +                                &kvm_eq, true, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static int spapr_xive_kvm_get_eq_state(sPAPRXive *xive, CPUState *cs,
> +                                       Error **errp)
> +{
> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> +    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
> +    int ret;
> +    int i;
> +
> +    for (i = 0; i < XIVE_PRIORITY_MAX + 1; i++) {
> +        Error *local_err = NULL;
> +        struct kvm_ppc_xive_eq kvm_eq = { 0 };
> +        uint64_t kvm_eq_idx;
> +        XiveEND end = { 0 };
> +        uint8_t end_blk, nvt_blk;
> +        uint32_t end_idx, nvt_idx;
> +
> +        /* Skip priorities reserved for the hypervisor */
> +        if (!spapr_xive_priority_is_valid(i)) {
> +            continue;
> +        }
> +
> +        /* Encode the tuple (server, prio) as a KVM EQ index */
> +        kvm_eq_idx = i << KVM_XIVE_EQ_PRIORITY_SHIFT &
> +            KVM_XIVE_EQ_PRIORITY_MASK;
> +        kvm_eq_idx |= vcpu_id << KVM_XIVE_EQ_SERVER_SHIFT &
> +            KVM_XIVE_EQ_SERVER_MASK;
> +
> +        ret = kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EQ, kvm_eq_idx,
> +                                &kvm_eq, false, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return ret;
> +        }
> +
> +        if (!(kvm_eq.flags & KVM_XIVE_EQ_FLAG_ENABLED)) {
> +            continue;
> +        }
> +
> +        /* Update the local END structure with the KVM input */
> +        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ENABLED) {
> +                end.w0 |= END_W0_VALID | END_W0_ENQUEUE;
> +        }
> +        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY) {
> +                end.w0 |= END_W0_UCOND_NOTIFY;
> +        }
> +        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ESCALATE) {
> +                end.w0 |= END_W0_ESCALATE_CTL;
> +        }
> +        end.w0 |= SETFIELD(END_W0_QSIZE, 0ul, kvm_eq.qsize - 12);
> +
> +        end.w1 = SETFIELD(END_W1_GENERATION, 0ul, kvm_eq.qtoggle) |
> +            SETFIELD(END_W1_PAGE_OFF, 0ul, kvm_eq.qindex);
> +        end.w2 = (kvm_eq.qpage >> 32) & 0x0fffffff;
> +        end.w3 = kvm_eq.qpage & 0xffffffff;
> +        end.w4 = 0;
> +        end.w5 = 0;
> +
> +        ret = spapr_xive_cpu_to_nvt(xive, POWERPC_CPU(cs), &nvt_blk, &nvt_idx);
> +        if (ret) {
> +            error_setg(errp, "XIVE: No NVT for CPU %ld", vcpu_id);
> +            return ret;
> +        }
> +
> +        end.w6 = SETFIELD(END_W6_NVT_BLOCK, 0ul, nvt_blk) |
> +            SETFIELD(END_W6_NVT_INDEX, 0ul, nvt_idx);
> +        end.w7 = SETFIELD(END_W7_F0_PRIORITY, 0ul, i);
> +
> +        spapr_xive_cpu_to_end(xive, POWERPC_CPU(cs), i, &end_blk, &end_idx);
> +
> +        ret = xive_router_set_end(xrtr, end_blk, end_idx, &end);
> +        if (ret) {
> +            error_setg(errp, "XIVE: No END for CPU %ld priority %d",
> +                       vcpu_id, i);
> +            return ret;
> +        }
> +    }
> +
> +    return 0;
> +}
> +
> +static void spapr_xive_kvm_set_eas_state(sPAPRXive *xive, Error **errp)
> +{
> +    XiveSource *xsrc = &xive->source;
> +    int i;
> +
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        XiveEAS *eas = &xive->eat[i];
> +        uint32_t end_idx;
> +        uint32_t end_blk;
> +        uint32_t eisn;
> +        uint8_t priority;
> +        uint32_t server;
> +        uint64_t kvm_eas;
> +        Error *local_err = NULL;
> +
> +        /* No need to set MASKED EAS, this is the default state after reset */
> +        if (!(eas->w & EAS_VALID) || eas->w & EAS_MASKED) {
> +            continue;
> +        }
> +
> +        end_idx = GETFIELD(EAS_END_INDEX, eas->w);
> +        end_blk = GETFIELD(EAS_END_BLOCK, eas->w);
> +        eisn = GETFIELD(EAS_END_DATA, eas->w);
> +
> +        spapr_xive_end_to_target(xive, end_blk, end_idx, &server, &priority);
> +
> +        kvm_eas = priority << KVM_XIVE_EAS_PRIORITY_SHIFT &
> +            KVM_XIVE_EAS_PRIORITY_MASK;
> +        kvm_eas |= server << KVM_XIVE_EAS_SERVER_SHIFT &
> +            KVM_XIVE_EAS_SERVER_MASK;
> +        kvm_eas |= ((uint64_t)eisn << KVM_XIVE_EAS_EISN_SHIFT) &
> +            KVM_XIVE_EAS_EISN_MASK;
> +
> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EAS, i, &kvm_eas, true,
> +                          &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return;
> +        }
> +    }
> +}
> +
> +static void spapr_xive_kvm_get_eas_state(sPAPRXive *xive, Error **errp)
> +{
> +    XiveSource *xsrc = &xive->source;
> +    int i;
> +
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        XiveEAS *eas = &xive->eat[i];
> +        XiveEAS new_eas;
> +        uint64_t kvm_eas;
> +        uint8_t priority;
> +        uint32_t server;
> +        uint32_t end_idx;
> +        uint8_t end_blk;
> +        uint32_t eisn;
> +        Error *local_err = NULL;
> +
> +        if (!(eas->w & EAS_VALID)) {
> +            continue;
> +        }
> +
> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EAS, i, &kvm_eas, false,
> +                          &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return;
> +        }
> +
> +        priority = (kvm_eas & KVM_XIVE_EAS_PRIORITY_MASK) >>
> +            KVM_XIVE_EAS_PRIORITY_SHIFT;
> +        server = (kvm_eas & KVM_XIVE_EAS_SERVER_MASK) >>
> +            KVM_XIVE_EAS_SERVER_SHIFT;
> +        eisn = (kvm_eas & KVM_XIVE_EAS_EISN_MASK) >> KVM_XIVE_EAS_EISN_SHIFT;
> +
> +        if (spapr_xive_target_to_end(xive, server, priority, &end_blk,
> +                                     &end_idx)) {
> +            error_setg(errp, "XIVE: invalid tuple CPU %d priority %d", server,
> +                       priority);
> +            return;
> +        }
> +
> +        new_eas.w = EAS_VALID;
> +        if (kvm_eas & KVM_XIVE_EAS_MASK_MASK) {
> +            new_eas.w |= EAS_MASKED;
> +        }
> +
> +        new_eas.w = SETFIELD(EAS_END_INDEX, new_eas.w, end_idx);
> +        new_eas.w = SETFIELD(EAS_END_BLOCK, new_eas.w, end_blk);
> +        new_eas.w = SETFIELD(EAS_END_DATA, new_eas.w, eisn);
> +
> +        *eas = new_eas;
> +    }
> +}
> +
> +static void spapr_xive_kvm_sync_all(sPAPRXive *xive, Error **errp)
> +{
> +    XiveSource *xsrc = &xive->source;
> +    Error *local_err = NULL;
> +    int i;
> +
> +    /* Sync the KVM source. This reaches the XIVE HW through OPAL */
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        XiveEAS *eas = &xive->eat[i];
> +
> +        if (!(eas->w & EAS_VALID)) {
> +            continue;
> +        }
> +
> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_SYNC, i, NULL, true,
> +                          &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return;
> +        }
> +    }
> +}
> +
> +/*
> + * The sPAPRXive KVM model migration priority is higher to make sure

Higher than what?

> + * its 'pre_save' method runs before all the other XIVE models. It

If the other XIVE components are children of sPAPRXive (which I think
they are or could be), then I believe the parent object's pre_save
will automatically be called first.

> + * orchestrates the capture sequence of the XIVE states in the
> + * following order:
> + *
> + *   1. mask all the sources by setting PQ=01, which returns the
> + *      previous value and save it.
> + *   2. sync the sources in KVM to stabilize all the queues
> + *      sync the ENDs to make sure END -> VP is fully completed
> + *   3. dump the EAS table
> + *   4. dump the END table
> + *   5. dump the thread context (IPB)
> + *
> + *  Rollback to restore the current configuration of the sources



> + */
> +static int spapr_xive_kvm_pre_save(sPAPRXive *xive)
> +{
> +    XiveSource *xsrc = &xive->source;
> +    Error *local_err = NULL;
> +    CPUState *cs;
> +    int i;
> +    int ret = 0;
> +
> +    /* Quiesce the sources, to stop the flow of event notifications */
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        /*
> +         * Mask and save the ESB PQs locally in the XiveSource object.
> +         */
> +        uint8_t pq = xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_01);
> +        xive_source_esb_set(xsrc, i, pq);
> +    }
> +
> +    /* Sync the sources in KVM */
> +    spapr_xive_kvm_sync_all(xive, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        goto out;
> +    }
> +
> +    /* Grab the EAT (could be done earlier ?) */
> +    spapr_xive_kvm_get_eas_state(xive, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        goto out;
> +    }
> +
> +    /*
> +     * Grab the ENDs. The EQ index and the toggle bit are what we want
> +     * to capture
> +     */
> +    CPU_FOREACH(cs) {
> +        spapr_xive_kvm_get_eq_state(xive, cs, &local_err);
> +        if (local_err) {
> +            error_report_err(local_err);
> +            goto out;
> +        }
> +    }
> +
> +    /* Capture the thread interrupt contexts */
> +    CPU_FOREACH(cs) {
> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> +
> +        /* TODO: Check if we need to use under run_on_cpu() ? */
> +        xive_tctx_kvm_get_state(XIVE_TCTX_KVM(cpu->intc), &local_err);
> +        if (local_err) {
> +            error_report_err(local_err);
> +            goto out;
> +        }
> +    }
> +
> +    /* All done. */
> +
> +out:
> +    /* Restore the sources to their initial state */
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        uint8_t pq = xive_source_esb_get(xsrc, i);
> +        if (xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_00 + (pq << 8)) != 0x1) {
> +            error_report("XIVE: IRQ %d has an invalid state", i);
> +        }
> +    }
> +
> +    /*
> +     * The XiveSource and the XiveTCTX states will be collected by
> +     * their respective vmstate handlers afterwards.
> +     */
> +    return ret;
> +}
> +
> +/*
> + * The sPAPRXive 'post_load' method is called by the sPAPR machine,
> + * after all XIVE device states have been transfered and loaded.
> + *
> + * All should be in place when the VCPUs resume execution.
> + */
> +static int spapr_xive_kvm_post_load(sPAPRXive *xive, int version_id)
> +{
> +    XiveSource *xsrc = &xive->source;
> +    Error *local_err = NULL;
> +    CPUState *cs;
> +    int i;
> +
> +    /* Set the ENDs first. The targetting depends on it. */
> +    CPU_FOREACH(cs) {
> +        spapr_xive_kvm_set_eq_state(xive, cs, &local_err);
> +        if (local_err) {
> +            error_report_err(local_err);
> +            return -1;
> +        }
> +    }
> +
> +    /* Restore the targetting, if any */
> +    spapr_xive_kvm_set_eas_state(xive, &local_err);
> +    if (local_err) {
> +        error_report_err(local_err);
> +        return -1;
> +    }
> +
> +    /* Restore the thread interrupt contexts */
> +    CPU_FOREACH(cs) {
> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> +
> +        xive_tctx_kvm_set_state(XIVE_TCTX_KVM(cpu->intc), &local_err);
> +        if (local_err) {
> +            error_report_err(local_err);
> +            return -1;
> +        }
> +    }
> +
> +    /*
> +     * Get the saved state from the XiveSource model and restore the
> +     * PQ bits
> +     */
> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> +        uint8_t pq = xive_source_esb_get(xsrc, i);
> +        xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_00 + (pq << 8));
> +    }
> +    return 0;
> +}
> +
> +static void spapr_xive_kvm_synchronize_state(sPAPRXive *xive)
> +{
> +    XiveSource *xsrc = &xive->source;
> +    CPUState *cs;
> +
> +    xive_source_kvm_get_state(xsrc);
> +
> +    spapr_xive_kvm_get_eas_state(xive, &error_fatal);
> +
> +    CPU_FOREACH(cs) {
> +        spapr_xive_kvm_get_eq_state(xive, cs, &error_fatal);
> +    }
> +}
>  
>  static void spapr_xive_kvm_instance_init(Object *obj)
>  {
> @@ -409,6 +899,10 @@ static void spapr_xive_kvm_class_init(ObjectClass *klass, void *data)
>  
>      dc->desc = "sPAPR XIVE KVM Interrupt Controller";
>      dc->unrealize = spapr_xive_kvm_unrealize;
> +
> +    sxc->synchronize_state = spapr_xive_kvm_synchronize_state;
> +    sxc->pre_save = spapr_xive_kvm_pre_save;
> +    sxc->post_load = spapr_xive_kvm_post_load;
>  }
>  
>  static const TypeInfo spapr_xive_kvm_info = {
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index 9bb37553c9ec..c9aedecc8216 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -438,9 +438,14 @@ static const struct {
>  
>  void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
>  {
> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(tctx);
>      int cpu_index = tctx->cs ? tctx->cs->cpu_index : -1;
>      int i;
>  
> +    if (xtc->synchronize_state) {
> +        xtc->synchronize_state(tctx);
> +    }
> +
>      monitor_printf(mon, "CPU[%04x]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
>                     "  W2\n", cpu_index);
>  
> @@ -552,10 +557,23 @@ static void xive_tctx_base_unrealize(DeviceState *dev, Error **errp)
>      qemu_unregister_reset(xive_tctx_base_reset, dev);
>  }
>  
> +static int vmstate_xive_tctx_post_load(void *opaque, int version_id)
> +{
> +    XiveTCTX *tctx = XIVE_TCTX_BASE(opaque);
> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(tctx);
> +
> +    if (xtc->post_load) {
> +        return xtc->post_load(tctx, version_id);
> +    }
> +
> +    return 0;
> +}
> +
>  static const VMStateDescription vmstate_xive_tctx_base = {
>      .name = TYPE_XIVE_TCTX,
>      .version_id = 1,
>      .minimum_version_id = 1,
> +    .post_load = vmstate_xive_tctx_post_load,
>      .fields = (VMStateField[]) {
>          VMSTATE_BUFFER(regs, XiveTCTX),
>          VMSTATE_END_OF_LIST()
> @@ -581,9 +599,37 @@ static const TypeInfo xive_tctx_base_info = {
>      .class_size    = sizeof(XiveTCTXClass),
>  };
>  
> +static int xive_tctx_post_load(XiveTCTX *tctx, int version_id)
> +{
> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
> +
> +    /*
> +     * When we collect the states from KVM XIVE irqchip, we set word2
> +     * of the thread context to print out the OS CAM line under the
> +     * QEMU monitor.
> +     *
> +     * This breaks migration on a guest using TCG or not using a KVM
> +     * irqchip. Fix with an extra reset of the thread contexts.
> +     */
> +    if (xrc->reset_tctx) {
> +        xrc->reset_tctx(tctx->xrtr, tctx);
> +    }
> +    return 0;
> +}
> +
> +static void xive_tctx_class_init(ObjectClass *klass, void *data)
> +{
> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_CLASS(klass);
> +
> +    xtc->post_load = xive_tctx_post_load;
> +}
> +
>  static const TypeInfo xive_tctx_info = {
>      .name          = TYPE_XIVE_TCTX,
>      .parent        = TYPE_XIVE_TCTX_BASE,
> +    .instance_size = sizeof(XiveTCTX),
> +    .class_init    = xive_tctx_class_init,
> +    .class_size    = sizeof(XiveTCTXClass),
>  };
>  
>  Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index 92ef53743b64..6fac6ca70595 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -359,7 +359,7 @@ static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
>  
>  static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
>  {
> -    return 0;
> +    return spapr_xive_post_load(spapr->xive, version_id);
>  }
>  
>  /*

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 24/36] spapr: add a 'reset' method to the sPAPR IRQ backend
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 24/36] spapr: add a 'reset' method to the sPAPR IRQ backend Cédric Le Goater
@ 2018-11-29  3:47   ` David Gibson
  2018-11-29 16:21     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  3:47 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 6199 bytes --]

On Fri, Nov 16, 2018 at 11:57:17AM +0100, Cédric Le Goater wrote:
> This method will become useful when the new machine supporting both
> interrupt modes, XIVE and XICS, is introduced. In this machine, the
> interrupt mode is chosen by the CAS negotiation process and activated
> after a reset.
> 
> For the time being, the only thing that can be done in the XIVE reset
> handler is to map the pages for the TIMA and for the source ESBs.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/spapr_irq.h  |  2 ++
>  include/hw/ppc/spapr_xive.h |  1 +
>  hw/intc/spapr_xive.c        |  4 +---
>  hw/ppc/spapr.c              |  2 ++
>  hw/ppc/spapr_irq.c          | 21 +++++++++++++++++++++
>  5 files changed, 27 insertions(+), 3 deletions(-)
> 
> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> index 4e36c0984e1a..34128976e21c 100644
> --- a/include/hw/ppc/spapr_irq.h
> +++ b/include/hw/ppc/spapr_irq.h
> @@ -46,6 +46,7 @@ typedef struct sPAPRIrq {
>      Object *(*cpu_intc_create)(sPAPRMachineState *spapr, Object *cpu,
>                                 Error **errp);
>      int (*post_load)(sPAPRMachineState *spapr, int version_id);
> +    void (*reset)(sPAPRMachineState *spapr, Error **errp);
>  } sPAPRIrq;
>  
>  extern sPAPRIrq spapr_irq_xics;
> @@ -57,6 +58,7 @@ int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
>  void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
>  qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
>  int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id);
> +void spapr_irq_reset(sPAPRMachineState *spapr, Error **errp);
>  
>  /*
>   * XICS legacy routines
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index d2517c040958..fa7f3d7718da 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -91,6 +91,7 @@ typedef struct sPAPRMachineState sPAPRMachineState;
>  void spapr_xive_hcall_init(sPAPRMachineState *spapr);
>  void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt,
>                     uint32_t phandle);
> +void spapr_xive_mmio_map(sPAPRXive *xive);
>  
>  /*
>   * XIVE KVM models
> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> index c5c0e063dc33..def43160e12a 100644
> --- a/hw/intc/spapr_xive.c
> +++ b/hw/intc/spapr_xive.c
> @@ -51,7 +51,7 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>  }
>  
>  /* Map the ESB pages and the TIMA pages */
> -static void spapr_xive_mmio_map(sPAPRXive *xive)
> +void spapr_xive_mmio_map(sPAPRXive *xive)
>  {
>      sysbus_mmio_map(SYS_BUS_DEVICE(&xive->source), 0, xive->vc_base);
>      sysbus_mmio_map(SYS_BUS_DEVICE(&xive->end_source), 0, xive->end_base);
> @@ -77,8 +77,6 @@ static void spapr_xive_base_reset(DeviceState *dev)
>      for (i = 0; i < xive->nr_ends; i++) {
>          xive_end_reset(&xive->endt[i]);
>      }
> -
> -    spapr_xive_mmio_map(xive);
>  }
>  
>  static void spapr_xive_base_instance_init(Object *obj)
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index d1be2579cd9b..013e6ea8aa64 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1628,6 +1628,8 @@ static void spapr_machine_reset(void)
>          spapr_irq_msi_reset(spapr);
>      }
>  
> +    spapr_irq_reset(spapr, &error_fatal);
> +
>      qemu_devices_reset();
>  
>      /* DRC reset may cause a device to be unplugged. This will cause troubles
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index 6fac6ca70595..984c6d60cd9f 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -13,6 +13,7 @@
>  #include "qapi/error.h"
>  #include "hw/ppc/spapr.h"
>  #include "hw/ppc/spapr_xive.h"
> +#include "hw/ppc/spapr_cpu_core.h"
>  #include "hw/ppc/xics.h"
>  #include "sysemu/kvm.h"
>  
> @@ -215,6 +216,10 @@ static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
>      return 0;
>  }
>  
> +static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
> +{
> +}
> +
>  #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
>  #define SPAPR_IRQ_XICS_NR_MSIS     \
>      (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
> @@ -232,6 +237,7 @@ sPAPRIrq spapr_irq_xics = {
>      .dt_populate = spapr_irq_dt_populate_xics,
>      .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
>      .post_load   = spapr_irq_post_load_xics,
> +    .reset       = spapr_irq_reset_xics,
>  };
>  
>   /*
> @@ -362,6 +368,11 @@ static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
>      return spapr_xive_post_load(spapr->xive, version_id);
>  }
>  
> +static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
> +{
> +    spapr_xive_mmio_map(spapr->xive);

It's usually not a good idea to actually construct different
MemoryRegion's at run time.  Instead map them all in, but disable the
ones you don't want (with memory_region_set_enabled()).

I think your current version will also leave the TIMA etc. still
mapped if you reboot from a XIVE guest to a XICS guest.

> +}
> +
>  /*
>   * XIVE uses the full IRQ number space. Set it to 8K to be compatible
>   * with XICS.
> @@ -383,6 +394,7 @@ sPAPRIrq spapr_irq_xive = {
>      .dt_populate = spapr_irq_dt_populate_xive,
>      .cpu_intc_create = spapr_irq_cpu_intc_create_xive,
>      .post_load   = spapr_irq_post_load_xive,
> +    .reset       = spapr_irq_reset_xive,
>  };
>  
>  /*
> @@ -428,6 +440,15 @@ int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id)
>      return smc->irq->post_load(spapr, version_id);
>  }
>  
> +void spapr_irq_reset(sPAPRMachineState *spapr, Error **errp)
> +{
> +    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> +
> +    if (smc->irq->reset) {
> +        smc->irq->reset(spapr, errp);
> +    }
> +}
> +
>  /*
>   * XICS legacy routines - to deprecate one day
>   */

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 25/36] spapr: set the interrupt presenter at reset
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 25/36] spapr: set the interrupt presenter at reset Cédric Le Goater
@ 2018-11-29  4:03   ` David Gibson
  2018-11-29 16:28     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  4:03 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 3522 bytes --]

On Fri, Nov 16, 2018 at 11:57:18AM +0100, Cédric Le Goater wrote:
> Currently, the interrupt presenter of the VPCU is set at realize
> time. Setting it at reset will become useful when the new machine
> supporting both interrupt modes is introduced. In this machine, the
> interrupt mode is chosen at CAS time and activated after a reset.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/spapr_cpu_core.h |  2 ++
>  hw/ppc/spapr_cpu_core.c         | 26 ++++++++++++++++++++++++++
>  hw/ppc/spapr_irq.c              | 11 +++++++++++
>  3 files changed, 39 insertions(+)
> 
> diff --git a/include/hw/ppc/spapr_cpu_core.h b/include/hw/ppc/spapr_cpu_core.h
> index 9e2821e4b31f..fc8ea9021656 100644
> --- a/include/hw/ppc/spapr_cpu_core.h
> +++ b/include/hw/ppc/spapr_cpu_core.h
> @@ -53,4 +53,6 @@ static inline sPAPRCPUState *spapr_cpu_state(PowerPCCPU *cpu)
>      return (sPAPRCPUState *)cpu->machine_data;
>  }
>  
> +void spapr_cpu_core_set_intc(PowerPCCPU *cpu, const char *intc_type);
> +
>  #endif
> diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
> index 1811cd48db90..529de0b6b9c8 100644
> --- a/hw/ppc/spapr_cpu_core.c
> +++ b/hw/ppc/spapr_cpu_core.c
> @@ -398,3 +398,29 @@ static const TypeInfo spapr_cpu_core_type_infos[] = {
>  };
>  
>  DEFINE_TYPES(spapr_cpu_core_type_infos)
> +
> +typedef struct ForeachFindIntCArgs {
> +    const char *intc_type;
> +    Object *intc;
> +} ForeachFindIntCArgs;
> +
> +static int spapr_cpu_core_find_intc(Object *child, void *opaque)
> +{
> +    ForeachFindIntCArgs *args = opaque;
> +
> +    if (object_dynamic_cast(child, args->intc_type)) {
> +        args->intc = child;
> +    }
> +
> +    return args->intc != NULL;
> +}
> +
> +void spapr_cpu_core_set_intc(PowerPCCPU *cpu, const char *intc_type)
> +{
> +    ForeachFindIntCArgs args = { intc_type, NULL };
> +
> +    object_child_foreach(OBJECT(cpu), spapr_cpu_core_find_intc, &args);
> +    g_assert(args.intc);

We could create some extra links on the cpu to avoid scanning all the
children, but I guess that's a refinement.

Then again.. what do we actually use the cpu->intc pointer for in XIVE
context?  I had a feeling because of the different way notifications
are handled we might not ever need to go from a cpu handle to the
associated TCTX.


> +    cpu->intc = args.intc;
> +}
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index 984c6d60cd9f..969efad7e6e9 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -218,6 +218,11 @@ static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
>  
>  static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
>  {
> +    CPUState *cs;
> +
> +    CPU_FOREACH(cs) {
> +        spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->icp_type);
> +    }
>  }
>  
>  #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
> @@ -370,6 +375,12 @@ static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
>  
>  static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
>  {
> +    CPUState *cs;
> +
> +    CPU_FOREACH(cs) {
> +        spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->xive_tctx_type);
> +    }
> +
>      spapr_xive_mmio_map(spapr->xive);
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 28/36] ppc/xics: introduce a icp_kvm_init() routine
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 28/36] ppc/xics: introduce a icp_kvm_init() routine Cédric Le Goater
@ 2018-11-29  4:08   ` David Gibson
  2018-11-29 16:36     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  4:08 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 2428 bytes --]

On Fri, Nov 16, 2018 at 11:57:21AM +0100, Cédric Le Goater wrote:
> This routine gathers all the KVM initialization of the XICS KVM
> presenter. It will be useful when the initialization of the KVM XICS
> device is moved to a global routine.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

I dislike calling things *_init() because it's not clear which of
qemu's many "init" hooks it belongs with.

> ---
>  hw/intc/xics_kvm.c | 29 +++++++++++++++++++----------
>  1 file changed, 19 insertions(+), 10 deletions(-)
> 
> diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
> index e8fa9a53aeba..efad1b19d821 100644
> --- a/hw/intc/xics_kvm.c
> +++ b/hw/intc/xics_kvm.c
> @@ -123,11 +123,8 @@ static void icp_kvm_reset(DeviceState *dev)
>      icp_set_kvm_state(ICP(dev), 1);
>  }
>  
> -static void icp_kvm_realize(DeviceState *dev, Error **errp)
> +static void icp_kvm_init(ICPState *icp, Error **errp)
>  {
> -    ICPState *icp = ICP(dev);
> -    ICPStateClass *icpc = ICP_GET_CLASS(icp);
> -    Error *local_err = NULL;
>      CPUState *cs;
>      KVMEnabledICP *enabled_icp;
>      unsigned long vcpu_id;
> @@ -137,12 +134,6 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
>          abort();
>      }
>  
> -    icpc->parent_realize(dev, &local_err);
> -    if (local_err) {
> -        error_propagate(errp, local_err);
> -        return;
> -    }
> -
>      cs = icp->cs;
>      vcpu_id = kvm_arch_vcpu_id(cs);
>  
> @@ -168,6 +159,24 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
>      QLIST_INSERT_HEAD(&kvm_enabled_icps, enabled_icp, node);
>  }
>  
> +static void icp_kvm_realize(DeviceState *dev, Error **errp)
> +{
> +    ICPStateClass *icpc = ICP_GET_CLASS(dev);
> +    Error *local_err = NULL;
> +
> +    icpc->parent_realize(dev, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    icp_kvm_init(ICP(dev), &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +}
> +
>  static void icp_kvm_class_init(ObjectClass *klass, void *data)
>  {
>      DeviceClass *dc = DEVICE_CLASS(klass);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper Cédric Le Goater
@ 2018-11-29  4:09   ` David Gibson
  2018-11-29 16:36     ` Cédric Le Goater
  2018-12-03 17:48     ` Peter Maydell
  0 siblings, 2 replies; 184+ messages in thread
From: David Gibson @ 2018-11-29  4:09 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 2126 bytes --]

On Fri, Nov 16, 2018 at 11:57:20AM +0100, Cédric Le Goater wrote:
> This will be used to remove the MMIO regions of the POWER9 XIVE
> interrupt controller when the sPAPR machine is reseted.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

Since the code looks sane.

Hoever, I think using memory_region_set_enabled() would be a better
idea for our purposes than actually adding/deleting the subregion.

> ---
>  include/hw/sysbus.h |  1 +
>  hw/core/sysbus.c    | 10 ++++++++++
>  2 files changed, 11 insertions(+)
> 
> diff --git a/include/hw/sysbus.h b/include/hw/sysbus.h
> index 0b59a3b8d605..bc641984b5da 100644
> --- a/include/hw/sysbus.h
> +++ b/include/hw/sysbus.h
> @@ -92,6 +92,7 @@ qemu_irq sysbus_get_connected_irq(SysBusDevice *dev, int n);
>  void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr);
>  void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
>                               int priority);
> +void sysbus_mmio_unmap(SysBusDevice *dev, int n);
>  void sysbus_add_io(SysBusDevice *dev, hwaddr addr,
>                     MemoryRegion *mem);
>  MemoryRegion *sysbus_address_space(SysBusDevice *dev);
> diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
> index 7ac36ad3e707..09f202167dcb 100644
> --- a/hw/core/sysbus.c
> +++ b/hw/core/sysbus.c
> @@ -153,6 +153,16 @@ static void sysbus_mmio_map_common(SysBusDevice *dev, int n, hwaddr addr,
>      }
>  }
>  
> +void sysbus_mmio_unmap(SysBusDevice *dev, int n)
> +{
> +    assert(n >= 0 && n < dev->num_mmio);
> +
> +    if (dev->mmio[n].addr != (hwaddr)-1) {
> +        memory_region_del_subregion(get_system_memory(), dev->mmio[n].memory);
> +        dev->mmio[n].addr = (hwaddr)-1;
> +    }
> +}
> +
>  void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr)
>  {
>      sysbus_mmio_map_common(dev, n, addr, false, 0);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 31/36] spapr/xive: export the spapr_xive_kvm_init() routine
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 31/36] spapr/xive: export the spapr_xive_kvm_init() routine Cédric Le Goater
@ 2018-11-29  4:11   ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-29  4:11 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 1662 bytes --]

On Fri, Nov 16, 2018 at 11:57:24AM +0100, Cédric Le Goater wrote:
> We will need it to initialize the KVM XIVE device globally from the
> machine when the XIVE interrupt mode is selected.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

This is so trivial, I think it's better to fold it into the patch
which uses it.

> ---
>  include/hw/ppc/spapr_xive.h | 2 ++
>  hw/intc/spapr_xive_kvm.c    | 2 +-
>  2 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index fa7f3d7718da..1d134a681326 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -107,4 +107,6 @@ void spapr_xive_mmio_map(sPAPRXive *xive);
>  #define TYPE_XIVE_TCTX_KVM   "xive-tctx-kvm"
>  #define XIVE_TCTX_KVM(obj)   OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_KVM)
>  
> +void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp);
> +
>  #endif /* PPC_SPAPR_XIVE_H */
> diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
> index b9fee4ea240f..cb2aa6e81274 100644
> --- a/hw/intc/spapr_xive_kvm.c
> +++ b/hw/intc/spapr_xive_kvm.c
> @@ -809,7 +809,7 @@ static void spapr_xive_kvm_instance_init(Object *obj)
>                                NULL);
>  }
>  
> -static void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
> +void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
>  {
>      Error *local_err = NULL;
>      size_t tima_len;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 32/36] spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 32/36] spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers Cédric Le Goater
@ 2018-11-29  4:12   ` David Gibson
  2018-11-29 16:40     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  4:12 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 1116 bytes --]

On Fri, Nov 16, 2018 at 11:57:25AM +0100, Cédric Le Goater wrote:
> Removing RTAS handlers will become necessary when the new pseries
> machine supporting multiple interrupt mode is introduced.

I'd prefer this to be done as a separate spapr_rtas_unregister()
helper, just to improve greppability.

> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  hw/ppc/spapr_rtas.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
> index d6a0952154ac..e005d5d08151 100644
> --- a/hw/ppc/spapr_rtas.c
> +++ b/hw/ppc/spapr_rtas.c
> @@ -404,7 +404,7 @@ void spapr_rtas_register(int token, const char *name, spapr_rtas_fn fn)
>  
>      token -= RTAS_TOKEN_BASE;
>  
> -    assert(!rtas_table[token].name);
> +    assert(!name || !rtas_table[token].name);
>  
>      rtas_table[token].name = name;
>      rtas_table[token].fn = fn;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 33/36] spapr: introduce routines to delete the KVM IRQ device
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 33/36] spapr: introduce routines to delete the KVM IRQ device Cédric Le Goater
@ 2018-11-29  4:17   ` David Gibson
  2018-11-29 16:41     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  4:17 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 7565 bytes --]

On Fri, Nov 16, 2018 at 11:57:26AM +0100, Cédric Le Goater wrote:
> If a new interrupt mode is chosen by CAS, the machine generates a
> reset to reconfigure. At this point, the connection with the previous
> KVM device needs to be closed and a new connection needs to opened
> with the KVM device operating the chosen interrupt mode.
> 
> New routines are introduced to destroy the XICS and XIVE KVM
> devices. They make use of a new KVM device ioctl which destroys the
> device and also disconnects the IRQ presenters from the VCPUs.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  include/hw/ppc/spapr_xive.h |  1 +
>  include/hw/ppc/xics.h       |  1 +
>  linux-headers/linux/kvm.h   |  2 ++
>  hw/intc/spapr_xive_kvm.c    | 54 +++++++++++++++++++++++++++++++++++
>  hw/intc/xics_kvm.c          | 57 +++++++++++++++++++++++++++++++++++++
>  5 files changed, 115 insertions(+)
> 
> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> index 1d134a681326..c913c0aed08a 100644
> --- a/include/hw/ppc/spapr_xive.h
> +++ b/include/hw/ppc/spapr_xive.h
> @@ -108,5 +108,6 @@ void spapr_xive_mmio_map(sPAPRXive *xive);
>  #define XIVE_TCTX_KVM(obj)   OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_KVM)
>  
>  void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp);
> +void spapr_xive_kvm_fini(sPAPRXive *xive, Error **errp);
>  
>  #endif /* PPC_SPAPR_XIVE_H */
> diff --git a/include/hw/ppc/xics.h b/include/hw/ppc/xics.h
> index 9958443d1984..a5468c6eb6e3 100644
> --- a/include/hw/ppc/xics.h
> +++ b/include/hw/ppc/xics.h
> @@ -205,6 +205,7 @@ void icp_resend(ICPState *ss);
>  typedef struct sPAPRMachineState sPAPRMachineState;
>  
>  int xics_kvm_init(sPAPRMachineState *spapr, Error **errp);
> +int xics_kvm_fini(sPAPRMachineState *spapr, Error **errp);
>  void xics_spapr_init(sPAPRMachineState *spapr);
>  
>  Object *icp_create(Object *cpu, const char *type, XICSFabric *xi,
> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
> index 59fa8d8d7f39..b7a74c58d0db 100644
> --- a/linux-headers/linux/kvm.h
> +++ b/linux-headers/linux/kvm.h

linux-headers updates separate.

> @@ -1309,6 +1309,8 @@ struct kvm_s390_ucas_mapping {
>  #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
>  #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
>  
> +#define KVM_DESTROY_DEVICE	  _IOWR(KVMIO,  0xf0, struct kvm_create_device)
> +
>  /*
>   * ioctls for vcpu fds
>   */
> diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
> index cb2aa6e81274..0672d8bcbc6b 100644
> --- a/hw/intc/spapr_xive_kvm.c
> +++ b/hw/intc/spapr_xive_kvm.c
> @@ -55,6 +55,16 @@ static void kvm_cpu_enable(CPUState *cs)
>      QLIST_INSERT_HEAD(&kvm_enabled_cpus, enabled_cpu, node);
>  }
>  
> +static void kvm_cpu_disable_all(void)
> +{
> +    KVMEnabledCPU *enabled_cpu, *next;
> +
> +    QLIST_FOREACH_SAFE(enabled_cpu, &kvm_enabled_cpus, node, next) {
> +        QLIST_REMOVE(enabled_cpu, node);
> +        g_free(enabled_cpu);
> +    }
> +}
> +
>  /*
>   * XIVE Thread Interrupt Management context (KVM)
>   */
> @@ -864,6 +874,50 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
>      kvm_gsi_direct_mapping = true;
>  }
>  
> +void spapr_xive_kvm_fini(sPAPRXive *xive, Error **errp)
> +{
> +    XiveSource *xsrc = &xive->source;
> +    struct kvm_create_device xive_destroy_device = {
> +        .fd = xive->fd,
> +        .type = KVM_DEV_TYPE_XIVE,
> +        .flags = 0,
> +    };
> +    size_t esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
> +    int rc;
> +
> +    /* The KVM XIVE device is not in use */
> +    if (xive->fd == -1) {
> +        return;
> +    }
> +
> +    if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
> +        error_setg(errp,
> +                   "IRQ_XIVE capability must be present for KVM XIVE device");
> +        return;

If we're here, xive->fd, checked above, definitely shouldn't have been
valid, so you can just assert().

> +    }
> +
> +    /* Clear the KVM mapping */
> +    sysbus_mmio_unmap(SYS_BUS_DEVICE(xsrc), 0);
> +    munmap(xsrc->esb_mmap, esb_len);
> +    sysbus_mmio_unmap(SYS_BUS_DEVICE(xive), 0);
> +    munmap(xive->tm_mmap, 4ull << TM_SHIFT);
> +
> +    /* Destroy the KVM device. This also clears the VCPU presenters */
> +    rc = kvm_vm_ioctl(kvm_state, KVM_DESTROY_DEVICE, &xive_destroy_device);
> +    if (rc < 0) {
> +        error_setg_errno(errp, -rc, "Error on KVM_DESTROY_DEVICE for XIVE");
> +    }
> +    close(xive->fd);
> +    xive->fd = -1;
> +
> +    kvm_kernel_irqchip = false;
> +    kvm_msi_via_irqfd_allowed = false;
> +    kvm_gsi_direct_mapping = false;
> +
> +    /* Clear the local list of presenter (hotplug) */
> +    kvm_cpu_disable_all();
> +}
> +
>  static void spapr_xive_kvm_realize(DeviceState *dev, Error **errp)
>  {
>      sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
> diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
> index eabc901a4556..a7e3ec32a761 100644
> --- a/hw/intc/xics_kvm.c
> +++ b/hw/intc/xics_kvm.c
> @@ -50,6 +50,16 @@ typedef struct KVMEnabledICP {
>  static QLIST_HEAD(, KVMEnabledICP)
>      kvm_enabled_icps = QLIST_HEAD_INITIALIZER(&kvm_enabled_icps);
>  
> +static void kvm_disable_icps(void)
> +{
> +    KVMEnabledICP *enabled_icp, *next;
> +
> +    QLIST_FOREACH_SAFE(enabled_icp, &kvm_enabled_icps, node, next) {
> +        QLIST_REMOVE(enabled_icp, node);
> +        g_free(enabled_icp);
> +    }
> +}
> +
>  /*
>   * ICP-KVM
>   */
> @@ -475,6 +485,53 @@ fail:
>      return -1;
>  }
>  
> +int xics_kvm_fini(sPAPRMachineState *spapr, Error **errp)
> +{
> +    int rc;
> +    struct kvm_create_device xics_create_device = {
> +        .fd = kernel_xics_fd,
> +        .type = KVM_DEV_TYPE_XICS,
> +        .flags = 0,
> +    };
> +
> +    /* The KVM XICS device is not in use */
> +    if (kernel_xics_fd == -1) {
> +        return 0;
> +    }
> +
> +    if (!kvm_enabled() || !kvm_check_extension(kvm_state, KVM_CAP_IRQ_XICS)) {
> +        error_setg(errp,
> +                   "KVM and IRQ_XICS capability must be present for KVM XICS device");
> +        return -1;

Same comment as above.

> +    }
> +
> +    rc = kvm_vm_ioctl(kvm_state, KVM_DESTROY_DEVICE, &xics_create_device);
> +    if (rc < 0) {
> +        error_setg_errno(errp, -rc, "Error on KVM_DESTROY_DEVICE for XICS");
> +    }
> +    close(kernel_xics_fd);
> +    kernel_xics_fd = -1;
> +
> +    spapr_rtas_register(RTAS_IBM_SET_XIVE, NULL, 0);
> +    spapr_rtas_register(RTAS_IBM_GET_XIVE, NULL, 0);
> +    spapr_rtas_register(RTAS_IBM_INT_OFF, NULL, 0);
> +    spapr_rtas_register(RTAS_IBM_INT_ON, NULL, 0);
> +
> +    kvmppc_define_rtas_kernel_token(0, "ibm,set-xive");
> +    kvmppc_define_rtas_kernel_token(0, "ibm,get-xive");
> +    kvmppc_define_rtas_kernel_token(0, "ibm,int-on");
> +    kvmppc_define_rtas_kernel_token(0, "ibm,int-off");
> +
> +    kvm_kernel_irqchip = false;
> +    kvm_msi_via_irqfd_allowed = false;
> +    kvm_gsi_direct_mapping = false;
> +
> +    /* Clear the presenter from the VCPUs */
> +    kvm_disable_icps();
> +
> +    return rc;
> +}
> +
>  static void xics_kvm_register_types(void)
>  {
>      type_register_static(&ics_kvm_info);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 34/36] spapr: add KVM support to the 'dual' machine
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 34/36] spapr: add KVM support to the 'dual' machine Cédric Le Goater
@ 2018-11-29  4:22   ` David Gibson
  2018-11-29 17:07     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-29  4:22 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 13583 bytes --]

On Fri, Nov 16, 2018 at 11:57:27AM +0100, Cédric Le Goater wrote:
> The interrupt mode is chosen by the CAS negotiation process and
> activated after a reset to take into account the required changes in
> the machine. This brings new constraints on how the associated KVM IRQ
> device is initialized.
> 
> Currently, each model takes care of the initialization of the KVM
> device in their realize method but this is not possible anymore as the
> initialization needs to done globaly when the interrupt mode is known,
> i.e. when machine is reseted. It also means that we need a way to
> delete a KVM device when another mode is chosen.
> 
> Also, to support migration, the QEMU objects holding the state to
> transfer should always be available but not necessarily activated.
> 
> The overall approach of this proposal is to initialize both interrupt
> mode at the QEMU level and keep the IRQ number space in sync to allow
> switching from one mode to another. For the KVM side of things, the
> whole initialization of the KVM device, sources and presenters, is
> grouped in a single routine. The XICS and XIVE sPAPR IRQ reset
> handlers are modified accordingly to handle the init and delete
> sequences of the KVM device. The post_load handlers also are, to take
> into account a possible change of interrupt mode after transfer.
> 
> As KVM is now initialized at reset, we loose the possiblity to
> fallback to the QEMU emulated mode in case of failure and failures
> become fatal to the machine.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  hw/intc/spapr_xive_kvm.c | 48 +++++++++++-----------
>  hw/intc/xics_kvm.c       | 18 ++++++---
>  hw/ppc/spapr_irq.c       | 86 +++++++++++++++++++++++++++++-----------
>  3 files changed, 98 insertions(+), 54 deletions(-)
> 
> diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
> index 0672d8bcbc6b..9c7d36f51e3d 100644
> --- a/hw/intc/spapr_xive_kvm.c
> +++ b/hw/intc/spapr_xive_kvm.c
> @@ -148,7 +148,6 @@ static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
>  
>  static void xive_tctx_kvm_realize(DeviceState *dev, Error **errp)
>  {
> -    XiveTCTX *tctx = XIVE_TCTX_KVM(dev);
>      XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(dev);
>      Error *local_err = NULL;
>  
> @@ -157,12 +156,6 @@ static void xive_tctx_kvm_realize(DeviceState *dev, Error **errp)
>          error_propagate(errp, local_err);
>          return;
>      }
> -
> -    xive_tctx_kvm_init(tctx, &local_err);
> -    if (local_err) {
> -        error_propagate(errp, local_err);
> -        return;
> -    }
>  }
>  
>  static void xive_tctx_kvm_class_init(ObjectClass *klass, void *data)
> @@ -222,12 +215,9 @@ static void xive_source_kvm_init(XiveSource *xsrc, Error **errp)
>  
>  static void xive_source_kvm_reset(DeviceState *dev)
>  {
> -    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
>      XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
>  
>      xsc->parent_reset(dev);
> -
> -    xive_source_kvm_init(xsrc, &error_fatal);
>  }
>  
>  /*
> @@ -346,12 +336,6 @@ static void xive_source_kvm_realize(DeviceState *dev, Error **errp)
>  
>      xsrc->qirqs = qemu_allocate_irqs(xive_source_kvm_set_irq, xsrc,
>                                       xsrc->nr_irqs);
> -
> -    xive_source_kvm_mmap(xsrc, &local_err);
> -    if (local_err) {
> -        error_propagate(errp, local_err);
> -        return;
> -    }
>  }
>  
>  static void xive_source_kvm_unrealize(DeviceState *dev, Error **errp)
> @@ -823,6 +807,7 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
>  {
>      Error *local_err = NULL;
>      size_t tima_len;
> +    CPUState *cs;
>  
>      if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
>          error_setg(errp,
> @@ -850,7 +835,18 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
>          return;
>      }
>  
> -    /* Let the XiveSource KVM model handle the mapping for the moment */
> +    xive_source_kvm_mmap(&xive->source, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +
> +    /* Create the KVM interrupt sources */
> +    xive_source_kvm_init(&xive->source, &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
>  
>      /* TIMA KVM mapping
>       *
> @@ -869,6 +865,17 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
>                                        "xive.tima", tima_len, xive->tm_mmap);
>      sysbus_init_mmio(SYS_BUS_DEVICE(xive), &xive->tm_mmio);
>  
> +    /* Connect the presenters to the VCPU */
> +    CPU_FOREACH(cs) {
> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> +
> +        xive_tctx_kvm_init(XIVE_TCTX_BASE(cpu->intc), &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return;
> +        }
> +    }
> +
>      kvm_kernel_irqchip = true;
>      kvm_msi_via_irqfd_allowed = true;
>      kvm_gsi_direct_mapping = true;
> @@ -920,16 +927,9 @@ void spapr_xive_kvm_fini(sPAPRXive *xive, Error **errp)
>  
>  static void spapr_xive_kvm_realize(DeviceState *dev, Error **errp)
>  {
> -    sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
>      sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
>      Error *local_err = NULL;
>  
> -    spapr_xive_kvm_init(xive, &local_err);
> -    if (local_err) {
> -        error_propagate(errp, local_err);
> -        return;
> -    }
> -
>      /* Initialize the source and the local routing tables */
>      sxc->parent_realize(dev, &local_err);
>      if (local_err) {
> diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
> index a7e3ec32a761..c89fa943847c 100644
> --- a/hw/intc/xics_kvm.c
> +++ b/hw/intc/xics_kvm.c
> @@ -190,12 +190,6 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
>          error_propagate(errp, local_err);
>          return;
>      }
> -
> -    icp_kvm_init(ICP(dev), &local_err);
> -    if (local_err) {
> -        error_propagate(errp, local_err);
> -        return;
> -    }
>  }
>  
>  static void icp_kvm_class_init(ObjectClass *klass, void *data)
> @@ -427,6 +421,8 @@ static void rtas_dummy(PowerPCCPU *cpu, sPAPRMachineState *spapr,
>  int xics_kvm_init(sPAPRMachineState *spapr, Error **errp)
>  {
>      int rc;
> +    CPUState *cs;
> +    Error *local_err = NULL;
>  
>      if (!kvm_enabled() || !kvm_check_extension(kvm_state, KVM_CAP_IRQ_XICS)) {
>          error_setg(errp,
> @@ -475,6 +471,16 @@ int xics_kvm_init(sPAPRMachineState *spapr, Error **errp)
>      kvm_msi_via_irqfd_allowed = true;
>      kvm_gsi_direct_mapping = true;
>  
> +    /* Connect the presenters to the VCPU */
> +    CPU_FOREACH(cs) {
> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> +
> +        icp_kvm_init(ICP(cpu->intc), &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            goto fail;
> +        }
> +    }
>      return 0;
>  
>  fail:
> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> index 79ead51c630d..f1720a8dda33 100644
> --- a/hw/ppc/spapr_irq.c
> +++ b/hw/ppc/spapr_irq.c
> @@ -98,20 +98,14 @@ static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
>      MachineState *machine = MACHINE(spapr);
>      Error *local_err = NULL;
>  
> -    if (kvm_enabled()) {
> -        if (machine_kernel_irqchip_allowed(machine) &&
> -            !xics_kvm_init(spapr, &local_err)) {
> -            spapr->icp_type = TYPE_KVM_ICP;
> -            spapr->ics = spapr_ics_create(spapr, TYPE_ICS_KVM, nr_irqs,
> -                                          &local_err);
> -        }
> -        if (machine_kernel_irqchip_required(machine) && !spapr->ics) {
> -            error_prepend(&local_err,
> -                          "kernel_irqchip requested but unavailable: ");
> -            goto error;
> +    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
> +        spapr->icp_type = TYPE_KVM_ICP;
> +        spapr->ics = spapr_ics_create(spapr, TYPE_ICS_KVM, nr_irqs,
> +                                      &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return;
>          }
> -        error_free(local_err);
> -        local_err = NULL;
>      }
>  
>      if (!spapr->ics) {
> @@ -119,10 +113,11 @@ static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
>          spapr->icp_type = TYPE_ICP;
>          spapr->ics = spapr_ics_create(spapr, TYPE_ICS_SIMPLE, nr_irqs,
>                                        &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            return;
> +        }
>      }
> -
> -error:
> -    error_propagate(errp, local_err);
>  }
>  
>  #define ICS_IRQ_FREE(ics, srcno)   \
> @@ -218,11 +213,28 @@ static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
>  
>  static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
>  {
> +    MachineState *machine = MACHINE(spapr);
>      CPUState *cs;
> +    Error *local_err = NULL;
>  
>      CPU_FOREACH(cs) {
>          spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->icp_type);
>      }
> +
> +    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {

Aren't both devices '_fini'-ed by the machine level reset handler, why
does it need a _fini here as well as an init?

> +        xics_kvm_fini(spapr, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            error_prepend(errp, "KVM XICS fini failed: ");
> +            return;
> +        }
> +        xics_kvm_init(spapr, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            error_prepend(errp, "KVM XICS init failed: ");
> +            return;
> +        }
> +    }
>  }
>  
>  #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
> @@ -288,10 +300,8 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
>          spapr->xive_tctx_type = TYPE_XIVE_TCTX_KVM;
>          spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE_KVM, nr_irqs,
>                                          nr_servers, &local_err);
> -
> -        if (local_err && machine_kernel_irqchip_required(machine)) {
> +        if (local_err) {
>              error_propagate(errp, local_err);
> -            error_prepend(errp, "kernel_irqchip requested but init failed : ");
>              return;
>          }
>  
> @@ -375,12 +385,29 @@ static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
>  
>  static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
>  {
> +    MachineState *machine = MACHINE(spapr);
>      CPUState *cs;
> +    Error *local_err = NULL;
>  
>      CPU_FOREACH(cs) {
>          spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->xive_tctx_type);
>      }
>  
> +    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
> +        spapr_xive_kvm_fini(spapr->xive, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            error_prepend(errp, "KVM XIVE fini failed: ");
> +            return;
> +        }
> +        spapr_xive_kvm_init(spapr->xive, &local_err);
> +        if (local_err) {
> +            error_propagate(errp, local_err);
> +            error_prepend(errp, "KVM XIVE init failed: ");
> +            return;
> +        }
> +    }
> +
>      spapr_xive_mmio_map(spapr->xive);
>  }
>  
> @@ -432,11 +459,6 @@ static void spapr_irq_init_dual(sPAPRMachineState *spapr, int nr_irqs,
>  {
>      Error *local_err = NULL;
>  
> -    if (kvm_enabled()) {
> -        error_setg(errp, "No KVM support for the 'dual' machine");
> -        return;
> -    }
> -
>      spapr_irq_xics.init(spapr, spapr_irq_xics.nr_irqs, nr_servers, &local_err);
>      if (local_err) {
>          error_propagate(errp, local_err);
> @@ -510,10 +532,15 @@ static Object *spapr_irq_cpu_intc_create_dual(sPAPRMachineState *spapr,
>  
>  static int spapr_irq_post_load_dual(sPAPRMachineState *spapr, int version_id)
>  {
> +    MachineState *machine = MACHINE(spapr);
> +
>      /*
>       * Force a reset of the XIVE backend after migration.
>       */
>      if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
> +        if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
> +            xics_kvm_fini(spapr, &error_fatal);
> +        }
>          spapr_irq_xive.reset(spapr, &error_fatal);
>      }
>  
> @@ -522,6 +549,17 @@ static int spapr_irq_post_load_dual(sPAPRMachineState *spapr, int version_id)
>  
>  static void spapr_irq_reset_dual(sPAPRMachineState *spapr, Error **errp)
>  {
> +    MachineState *machine = MACHINE(spapr);
> +
> +    /*
> +     * Destroy all the KVM IRQ devices. This also clears the VCPU
> +     * presenters
> +     */
> +    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
> +        xics_kvm_fini(spapr, &error_fatal);
> +        spapr_xive_kvm_fini(spapr->xive, &error_fatal);
> +    }
> +
>      /*
>       * Only XICS is reseted at startup as it is the default interrupt
>       * mode.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors
  2018-11-23 11:01         ` Cédric Le Goater
@ 2018-11-29  4:46           ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-29  4:46 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 2072 bytes --]

On Fri, Nov 23, 2018 at 12:01:27PM +0100, Cédric Le Goater wrote:
> On 11/23/18 5:35 AM, David Gibson wrote:
> > On Thu, Nov 22, 2018 at 10:47:44PM +0100, Cédric Le Goater wrote:
> >> On 11/22/18 5:41 AM, David Gibson wrote:
> >>> On Fri, Nov 16, 2018 at 11:56:58AM +0100, Cédric Le Goater
> wrote:
[snip]
> >>>> +/*
> >>>> + * XiveEND helpers
> >>>> + */
> >>>> +
> >>>> +void xive_end_reset(XiveEND *end)
> >>>> +{
> >>>> +    memset(end, 0, sizeof(*end));
> >>>> +
> >>>> +    /* switch off the escalation and notification ESBs */
> >>>> +    end->w1 = END_W1_ESe_Q | END_W1_ESn_Q;
> >>>
> >>> It's not obvious to me what circumstances this would be called under.
> >>> Since the ENDs are in system memory, a memset() seems like an odd
> >>> thing for (virtual) hardware to be doing to it.
> >>
> >> It makes sense on sPAPR if one day some OS starts using the END ESBs for 
> >> further coalescing of the events. None does for now but I have added the 
> >> model though.
> > 
> > Hrm, I think that belongs in PAPR specific code.  It's not really part
> > of the router model - it's the PAPR stuff configuring the router at
> > reset time (much as firmware would configure it at reset time for bare
> > metal).
> 
> This is true this routine is only used by the H_INT_RESET hcall and by 
> the reset handler of the sPAPR controller model. But it made sense to put 
> this END helper routine with the other END routines. Don't you think so ? 

Actually, no.  In real hardware this would be handled by a different
component - the system firmware would do this setup of the ENDs as it
configures the XIVE.  So it makes sense to do this in a separate
component for PAPR as well.  In this case that's another piece of qemu
(the spapr stuff) rather than being within the VM, but the difference
isn't important to the END handling itself.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [Qemu-ppc] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend
  2018-11-29  1:02       ` David Gibson
@ 2018-11-29  6:56         ` Greg Kurz
  0 siblings, 0 replies; 184+ messages in thread
From: Greg Kurz @ 2018-11-29  6:56 UTC (permalink / raw)
  To: David Gibson; +Cc: Cédric Le Goater, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 3302 bytes --]

On Thu, 29 Nov 2018 12:02:48 +1100
David Gibson <david@gibson.dropbear.id.au> wrote:

> On Wed, Nov 28, 2018 at 10:35:51AM +0100, Greg Kurz wrote:
> > On Wed, 28 Nov 2018 13:57:14 +1100
> > David Gibson <david@gibson.dropbear.id.au> wrote:
> >   
> > > On Fri, Nov 16, 2018 at 11:57:05AM +0100, Cédric Le Goater wrote:  
> > > > We will need to use xics_max_server_number() to create the sPAPRXive
> > > > object modeling the interrupt controller of the machine which is
> > > > created before the CPUs.
> > > > 
> > > > Signed-off-by: Cédric Le Goater <clg@kaod.org>    
> > > 
> > > My only concern here is that this moves the spapr_set_vsmt_mode()
> > > before some of the sanity checks in spapr_init_cpus().  Are we certain
> > > there are no edge cases that could cause badness?
> > >   
> > 
> > The early checks in spapr_init_cpus() filter out topologies that would
> > result in partially filled cores. They're only related to the rest of
> > the code that creates the boot CPUs. Before commit 1a5008fc17,
> > spapr_set_vsmt_mode() was even being called before spapr_init_cpus().
> > The rationale to move it there was to ensure it is called before the
> > first user of spapr->vsmt, which happens to be a call to
> > xics_max_server_number().  
> 
> Ok.
> 
> > Now that xics_max_server_number() needs to be called even earlier, I think a
> > better change is to have xics_max_server_number() to call spapr_set_vsmt_mode()
> > if spapr->vsmt isn't set.  
> 
> I'd rather not do that, but instead move it statically to where it
> needs to be.  That sort of lazy/on-demand initialization can result in
> really confusing behaviours depending on when a seemingly innocuous
> data-returning function is called, so I consider it a code smell.
> 

Fair enough, then:

Reviewed-by: Greg Kurz <groug@kaod.org>

> >   
> > > > ---
> > > >  hw/ppc/spapr.c | 10 +++++-----
> > > >  1 file changed, 5 insertions(+), 5 deletions(-)
> > > > 
> > > > diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> > > > index 7afd1a175bf2..50cb9f9f4a02 100644
> > > > --- a/hw/ppc/spapr.c
> > > > +++ b/hw/ppc/spapr.c
> > > > @@ -2466,11 +2466,6 @@ static void spapr_init_cpus(sPAPRMachineState *spapr)
> > > >          boot_cores_nr = possible_cpus->len;
> > > >      }
> > > >  
> > > > -    /* VSMT must be set in order to be able to compute VCPU ids, ie to
> > > > -     * call xics_max_server_number() or spapr_vcpu_id().
> > > > -     */
> > > > -    spapr_set_vsmt_mode(spapr, &error_fatal);
> > > > -
> > > >      if (smc->pre_2_10_has_unused_icps) {
> > > >          int i;
> > > >  
> > > > @@ -2593,6 +2588,11 @@ static void spapr_machine_init(MachineState *machine)
> > > >      /* Setup a load limit for the ramdisk leaving room for SLOF and FDT */
> > > >      load_limit = MIN(spapr->rma_size, RTAS_MAX_ADDR) - FW_OVERHEAD;
> > > >  
> > > > +    /* VSMT must be set in order to be able to compute VCPU ids, ie to
> > > > +     * call xics_max_server_number() or spapr_vcpu_id().
> > > > +     */
> > > > +    spapr_set_vsmt_mode(spapr, &error_fatal);
> > > > +
> > > >      /* Set up Interrupt Controller before we create the VCPUs */
> > > >      smc->irq->init(spapr, &error_fatal);
> > > >      
> > >   
> >   
> 
> 
> 


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller
  2018-11-29  0:54       ` David Gibson
@ 2018-11-29 14:37         ` Cédric Le Goater
  2018-11-29 22:36           ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 14:37 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[ ... ]
 
>>> With that approach it might make sense to embed it
>>> here, rather than subclassing it 
>>
>> ah. why not indeed. I have to think about it. 
>>
>>> (the old composition vs. inheritance debate).
>>
>> he. but then the XiveRouter needs to become a QOM interface if we 
>> want to be able to define XIVE table accessors for sPAPRXive. See
>> the  spapr_xive_class_init() routine.
> 
> Erm.. I'm not really sure what you're getting at here.

if we compose a sPAPRXive object with a XiveSource object and a XiveRouter 
object, how will the  XiveRouter object access the XIVE internal tables 
which are in the sPAPRXive object ? 

Thinking of it, I am not sure a QOM interface would solve the problem now. 
So we are stuck with inheritance.

[ ... ]

>>>> +qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn)
>>>> +{
>>>> +    XiveSource *xsrc = &xive->source;
>>>> +
>>>> +    if (lisn >= xive->nr_irqs) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    if (!(xive->eat[lisn].w & EAS_VALID)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);
>>>
>>> I don't think this is a guest error - gettint the qirq by number
>>> should generally be something qemu code does.
>>
>> Even if the IRQ was not defined by the machine ? The EAS_VALID bit is
>> raised when the IRQ is enabled at the XIVE level, which means that the
>> IRQ number has been claimed by some device of the machine. You cannot
>> get a qirq by number for  some random IRQ number. Can you ?
> 
> Well, you shouldn't.  The point is that it is qemu code (specifically
> the machine setup stuff) that will be calling this, and it shouldn't
> be calling it with irq numbers that haven't been
> enabled/claimed/whatever.

so it should be an assert ? 

Thanks,

C. 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
  2018-11-29  1:00       ` David Gibson
@ 2018-11-29 15:27         ` Cédric Le Goater
  2018-11-30  1:11           ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 15:27 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[ ... ] 

>>>> +/*
>>>> + * The allocation of VP blocks is a complex operation in OPAL and the
>>>> + * VP identifiers have a relation with the number of HW chips, the
>>>> + * size of the VP blocks, VP grouping, etc. The QEMU sPAPR XIVE
>>>> + * controller model does not have the same constraints and can use a
>>>> + * simple mapping scheme of the CPU vcpu_id
>>>> + *
>>>> + * These identifiers are never returned to the OS.
>>>> + */
>>>> +
>>>> +#define SPAPR_XIVE_VP_BASE 0x400
>>>
>>> 0x400 == 1024.  Could we ever have the possibility of needing to
>>> consider both physical NVTs and PAPR NVTs at the same time?  
>>
>> They would not be in the same CAM line: OS ring vs. PHYS ring. 
> 
> Hm.  They still inhabit the same NVT number space though, don't they?

No. skiboot reserves the range of VPs for the HW at init.

https://github.com/open-power/skiboot/blob/master/hw/xive.c#L1093

> I'm thinking about the END->NVT stage of the process here, rather than
> the NVT->TCTX stage.
>
> Oh, also, you're using "VP" here which IIUC == "NVT".  Can we
> standardize on one, please.

VP is used in Linux/KVM Linux/Native and skiboot. Yes. it's a mess. 
Let's have consistent naming in QEMU and use NVT. 

>>> If so, does this base leave enough space for the physical ones?
>>
>> I only used 0x400 to map the VP identifier to the ones allocated by KVM. 
>> 0x0 would be fine but to exercise the model, it's better having a different 
>> base. 
>>
>>>> +uint32_t spapr_xive_nvt_to_target(sPAPRXive *xive, uint8_t nvt_blk,
>>>> +                                  uint32_t nvt_idx)
>>>> +{
>>>> +    return nvt_idx - SPAPR_XIVE_VP_BASE;
>>>> +}
>>>> +
>>>> +int spapr_xive_cpu_to_nvt(sPAPRXive *xive, PowerPCCPU *cpu,
>>>> +                          uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
>>>
>>> A number of these conversions will come out a bit simpler if we pass
>>> the block and index around as a single word in most places.
>>
>> Yes I have to check the whole patchset first. These prototype changes
>> are not too difficult in terms of code complexity but they do break
>> how patches apply and PowerNV is also using the idx and blk much more 
>> explicitly. the block has a meaning on bare metal. So I am a bit 
>> reluctant to do so. I will check.
> 
> Yeah, based on your comments here and earlier, I'm not sure that's a
> good idea any more either.
> 
>>>> +{
>>>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>>>> +
>>>> +    if (!cpu) {
>>>> +        return -1;
>>>> +    }
>>>> +
>>>> +    if (out_nvt_blk) {
>>>> +        /* For testing purpose, we could use 0 for nvt_blk */
>>>> +        *out_nvt_blk = xrtr->chip_id;
>>>
>>> I don't see any point using the chip_id here, which is currently
>>> always set to 0 for PAPR anyway.  If we just hardwire this to 0 it
>>> removes the only use here of xrtr, which will allow some further
>>> simplifications in the caller, I think.
>>
>> You are right about the simplification. It was one way to exercise 
>> the router model and remove any shortcuts in the indexing. I kept 
>> it to be sure I was not tempted to invent new ones. I think we can
>> remove it before merging. 
>>
>>>
>>>> +    }
>>>> +
>>>> +    if (out_nvt_blk) {
>>>> +        *out_nvt_idx = SPAPR_XIVE_VP_BASE + cpu->vcpu_id;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int spapr_xive_target_to_nvt(sPAPRXive *xive, uint32_t target,
>>>> +                             uint8_t *out_nvt_blk, uint32_t *out_nvt_idx)
>>>
>>> I suspect some, maybe most of these conversion functions could be static.
>>
>> static inline ? 
> 
> It's in a .c file so you don't need the "inline" - the compiler can
> work out whether it's a good idea to inline on its own.

It is used in the hcall file also. But we are going to change that.

Thanks,

C.

> 
>>>
>>>> +{
>>>> +    return spapr_xive_cpu_to_nvt(xive, spapr_find_cpu(target), out_nvt_blk,
>>>> +                                 out_nvt_idx);
>>>> +}
>>>> +
>>>> +/*
>>>> + * sPAPR END indexing uses a simple mapping of the CPU vcpu_id, 8
>>>> + * priorities per CPU
>>>> + */
>>>> +int spapr_xive_end_to_target(sPAPRXive *xive, uint8_t end_blk, uint32_t end_idx,
>>>> +                             uint32_t *out_server, uint8_t *out_prio)
>>>> +{
>>>> +    if (out_server) {
>>>> +        *out_server = end_idx >> 3;
>>>> +    }
>>>> +
>>>> +    if (out_prio) {
>>>> +        *out_prio = end_idx & 0x7;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
>>>> +                          uint8_t *out_end_blk, uint32_t *out_end_idx)
>>>> +{
>>>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>>>> +
>>>> +    if (!cpu) {
>>>> +        return -1;
>>>
>>> Is there ever a reason this would be called with cpu == NULL?  If not
>>> might as well just assert() here rather than pushing the error
>>> handling back to the caller.
>>
>> ok. yes.
>>
>>>
>>>> +    }
>>>> +
>>>> +    if (out_end_blk) {
>>>> +        /* For testing purpose, we could use 0 for nvt_blk */
>>>> +        *out_end_blk = xrtr->chip_id;
>>>
>>> Again, I don't see any point to using the chip_id, which is pretty
>>> meaningless for PAPR.
>>>
>>>> +    }
>>>> +
>>>> +    if (out_end_idx) {
>>>> +        *out_end_idx = (cpu->vcpu_id << 3) + prio;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
>>>> +                             uint8_t *out_end_blk, uint32_t *out_end_idx)
>>>> +{
>>>> +    return spapr_xive_cpu_to_end(xive, spapr_find_cpu(target), prio,
>>>> +                                 out_end_blk, out_end_idx);
>>>> +}
>>>> +
>>>>  static const VMStateDescription vmstate_spapr_xive_end = {
>>>>      .name = TYPE_SPAPR_XIVE "/end",
>>>>      .version_id = 1,
>>>> @@ -263,6 +396,9 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>>      xrc->set_eas = spapr_xive_set_eas;
>>>>      xrc->get_end = spapr_xive_get_end;
>>>>      xrc->set_end = spapr_xive_set_end;
>>>> +    xrc->get_nvt = spapr_xive_get_nvt;
>>>> +    xrc->set_nvt = spapr_xive_set_nvt;
>>>> +    xrc->reset_tctx = spapr_xive_reset_tctx;
>>>>  }
>>>>  
>>>>  static const TypeInfo spapr_xive_info = {
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> index c49932d2b799..fc6ef5895e6d 100644
>>>> --- a/hw/intc/xive.c
>>>> +++ b/hw/intc/xive.c
>>>> @@ -481,6 +481,7 @@ static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
>>>>  static void xive_tctx_reset(void *dev)
>>>>  {
>>>>      XiveTCTX *tctx = XIVE_TCTX(dev);
>>>> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
>>>>  
>>>>      memset(tctx->regs, 0, sizeof(tctx->regs));
>>>>  
>>>> @@ -495,6 +496,14 @@ static void xive_tctx_reset(void *dev)
>>>>       */
>>>>      tctx->regs[TM_QW1_OS + TM_PIPR] =
>>>>          ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
>>>> +
>>>> +    /*
>>>> +     * QEMU sPAPR XIVE only. To let the controller model reset the OS
>>>> +     * CAM line with the VP identifier.
>>>> +     */
>>>> +    if (xrc->reset_tctx) {
>>>> +        xrc->reset_tctx(tctx->xrtr, tctx);
>>>> +    }
>>>
>>> AFAICT this whole function is only used from PAPR, so you can just
>>> move the whole thing to the papr code and avoid the hook function.
>>
>> Yes we could add a loop on all CPUs and reset all the XiveTCTX from
>> the machine or a spapr_irq->reset handler. We will need at some time
>> anyhow.
>>
>> Thanks,
>>
>> C.
>>
>>
>>>
>>>>  }
>>>>  
>>>>  static void xive_tctx_realize(DeviceState *dev, Error **errp)
>>>
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 15/36] spapr: introdude a new machine IRQ backend for XIVE
  2018-11-29  1:07       ` David Gibson
@ 2018-11-29 15:34         ` Cédric Le Goater
  2018-11-29 22:39           ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 15:34 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 2:07 AM, David Gibson wrote:
> On Wed, Nov 28, 2018 at 06:16:58PM +0100, Cédric Le Goater wrote:
>> On 11/28/18 4:28 AM, David Gibson wrote:
>>> On Fri, Nov 16, 2018 at 11:57:08AM +0100, Cédric Le Goater wrote:
>>>> The XIVE IRQ backend uses the same layout as the new XICS backend but
>>>> covers the full range of the IRQ number space. The IRQ numbers for the
>>>> CPU IPIs are allocated at the bottom of this space, below 4K, to
>>>> preserve compatibility with XICS which does not use that range.
>>>>
>>>> This should be enough given that the maximum number of CPUs is 1024
>>>> for the sPAPR machine under QEMU. For the record, the biggest POWER8
>>>> or POWER9 system has a maximum of 1536 HW threads (16 sockets, 192
>>>> cores, SMT8).
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  include/hw/ppc/spapr.h     |   2 +
>>>>  include/hw/ppc/spapr_irq.h |   7 ++-
>>>>  hw/ppc/spapr.c             |   2 +-
>>>>  hw/ppc/spapr_irq.c         | 119 ++++++++++++++++++++++++++++++++++++-
>>>>  4 files changed, 124 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>> index 6279711fe8f7..1fbc2663e06c 100644
>>>> --- a/include/hw/ppc/spapr.h
>>>> +++ b/include/hw/ppc/spapr.h
>>>> @@ -16,6 +16,7 @@ typedef struct sPAPREventLogEntry sPAPREventLogEntry;
>>>>  typedef struct sPAPREventSource sPAPREventSource;
>>>>  typedef struct sPAPRPendingHPT sPAPRPendingHPT;
>>>>  typedef struct ICSState ICSState;
>>>> +typedef struct sPAPRXive sPAPRXive;
>>>>  
>>>>  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
>>>>  #define SPAPR_ENTRY_POINT       0x100
>>>> @@ -175,6 +176,7 @@ struct sPAPRMachineState {
>>>>      const char *icp_type;
>>>>      int32_t irq_map_nr;
>>>>      unsigned long *irq_map;
>>>> +    sPAPRXive  *xive;
>>>>  
>>>>      bool cmd_line_caps[SPAPR_CAP_NUM];
>>>>      sPAPRCapabilities def, eff, mig;
>>>> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
>>>> index 0e9229bf219e..c854ae527808 100644
>>>> --- a/include/hw/ppc/spapr_irq.h
>>>> +++ b/include/hw/ppc/spapr_irq.h
>>>> @@ -13,6 +13,7 @@
>>>>  /*
>>>>   * IRQ range offsets per device type
>>>>   */
>>>> +#define SPAPR_IRQ_IPI        0x0
>>>>  #define SPAPR_IRQ_EPOW       0x1000  /* XICS_IRQ_BASE offset */
>>>>  #define SPAPR_IRQ_HOTPLUG    0x1001
>>>>  #define SPAPR_IRQ_VIO        0x1100  /* 256 VIO devices */
>>>> @@ -33,7 +34,8 @@ typedef struct sPAPRIrq {
>>>>      uint32_t    nr_irqs;
>>>>      uint32_t    nr_msis;
>>>>  
>>>> -    void (*init)(sPAPRMachineState *spapr, int nr_irqs, Error **errp);
>>>> +    void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
>>>> +                 Error **errp);
>>>>      int (*claim)(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
>>>>      void (*free)(sPAPRMachineState *spapr, int irq, int num);
>>>>      qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
>>>> @@ -42,8 +44,9 @@ typedef struct sPAPRIrq {
>>>>  
>>>>  extern sPAPRIrq spapr_irq_xics;
>>>>  extern sPAPRIrq spapr_irq_xics_legacy;
>>>> +extern sPAPRIrq spapr_irq_xive;
>>>>  
>>>> -void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
>>>> +void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp);
>>>
>>> I don't see why nr_servers needs to become a parameter, since it can
>>> be derived from spapr within this routine.
>>
>> ok. This is true. We can use directly xics_max_server_number(spapr).
>>
>>>>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
>>>>  void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
>>>>  qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
>>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>>> index e470efe7993c..9f8c19e56e7a 100644
>>>> --- a/hw/ppc/spapr.c
>>>> +++ b/hw/ppc/spapr.c
>>>> @@ -2594,7 +2594,7 @@ static void spapr_machine_init(MachineState *machine)
>>>>      spapr_set_vsmt_mode(spapr, &error_fatal);
>>>>  
>>>>      /* Set up Interrupt Controller before we create the VCPUs */
>>>> -    spapr_irq_init(spapr, &error_fatal);
>>>> +    spapr_irq_init(spapr, xics_max_server_number(spapr), &error_fatal);
>>>
>>> We should rename xics_max_server_number() since it's no longer xics
>>> specific.
>>
>> yes.
>>
>>>>      /* Set up containers for ibm,client-architecture-support negotiated options
>>>>       */
>>>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>>>> index bac450ffff23..2569ae1bc7f8 100644
>>>> --- a/hw/ppc/spapr_irq.c
>>>> +++ b/hw/ppc/spapr_irq.c
>>>> @@ -12,6 +12,7 @@
>>>>  #include "qemu/error-report.h"
>>>>  #include "qapi/error.h"
>>>>  #include "hw/ppc/spapr.h"
>>>> +#include "hw/ppc/spapr_xive.h"
>>>>  #include "hw/ppc/xics.h"
>>>>  #include "sysemu/kvm.h"
>>>>  
>>>> @@ -91,7 +92,7 @@ error:
>>>>  }
>>>>  
>>>>  static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
>>>> -                                Error **errp)
>>>> +                                int nr_servers, Error **errp)
>>>>  {
>>>>      MachineState *machine = MACHINE(spapr);
>>>>      Error *local_err = NULL;
>>>> @@ -204,10 +205,122 @@ sPAPRIrq spapr_irq_xics = {
>>>>      .print_info  = spapr_irq_print_info_xics,
>>>>  };
>>>>  
>>>> + /*
>>>> + * XIVE IRQ backend.
>>>> + */
>>>> +static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
>>>> +                                    const char *type_xive, int nr_irqs,
>>>> +                                    int nr_servers, Error **errp)
>>>> +{
>>>> +    sPAPRXive *xive;
>>>> +    Error *local_err = NULL;
>>>> +    Object *obj;
>>>> +    uint32_t nr_ends = nr_servers << 3; /* 8 priority ENDs per CPU */
>>>> +    int i;
>>>> +
>>>> +    obj = object_new(type_xive);
>>>
>>> What's the reason for making the type a parameter, rather than just
>>> using the #define here.
>>
>> KVM.
> 
> Yeah, I realised that when I'd read a few patches further on.  As I
> commented there, I don't think the separate KVM/TCG subclasses is
> actually a good pattern to follow.

I will use the simple pattern in next spin: if (kvm) { } 

We might want to do that for XICS also but it would break migratibility.  

>>>> +    object_property_set_int(obj, nr_irqs, "nr-irqs", &error_abort);
>>>> +    object_property_set_int(obj, nr_ends, "nr-ends", &error_abort);
>>>
>>> This is still within the sPAPR code, and you have a pointer to the
>>> MachineState, so I don't see why you could't just derive nr_irqs and
>>> nr_servers from that, rather than having them passed in.
>>
>> for nr_servers I agree. nr_irqs comes from the machine class and it will
>> not make any sense using the machine class in the init routine of the
>> 'dual' sPAPR IRQ backend supporting both modes. See patch 34 which
>> initializes both backend for the 'dual' machine.
> 
> Uh.. I guess I'll comment when I get to that patch, but I don't see
> why accessing the machine class would be a problem.  If we have the
> MachineState we can get to the MachineClass.> 
>>>> +    object_property_set_bool(obj, true, "realized", &local_err);
>>>> +    if (local_err) {
>>>> +        error_propagate(errp, local_err);
>>>> +        return NULL;
>>>> +    }
>>>> +    qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
>>>
>>> Whereas the XiveSource and XiveRouter I think make more sense as
>>> "device components" rather than SysBusDevice subclasses, 
>>
>> Yes. I changed that.
>>
>>> I think it
>>> *does* make sense for the PAPR-XIVE object to be a full fledged
>>> SysBusDevice.
>>
>> Ah. That I didn't do but thinking of it, it makes sense as it is the
>> object managing the TIMA and ESB memory region mapping for the machine. 
>>
>>> And for that reason, I think it makes more sense to create it with
>>> qdev_create(), which should avoid having to manually fiddle with the
>>> parent bus.
>>
>> OK. I will give it a try. 
>>
>>>> +    xive = SPAPR_XIVE(obj);
>>>> +
>>>> +    /* Enable the CPU IPIs */
>>>> +    for (i = 0; i < nr_servers; ++i) {
>>>> +        spapr_xive_irq_enable(xive, SPAPR_IRQ_IPI + i, false);
>>>
>>> This comment possibly belonged on an earlier patch.  I don't love the
>>> "..._enable" name - to me that suggests something runtime rather than
>>> configuration time.  A better option isn't quickly occurring to me
>>> though :/.
>>
>> Instead, I could call the sPAPR IRQ claim method  : 
>>
>>     for (i = 0; i < nr_servers; ++i) {
>> 	spapr_irq_xive.claim(spapr, SPAPR_IRQ_IPI + i, false, &local_err);
>>     }
>>
>>
>> What it does is to set the EAS_VALID bit in the EAT (it also sets the 
>> LSI bit). what about :
>> 	
>> 	spapr_xive_irq_validate() 
>> 	spapr_xive_irq_invalidate() 
>>
>> or to map the sPAPR IRQ backend names :
>>
>> 	spapr_xive_irq_claim() 
>> 	spapr_xive_irq_free()
> 
> Let's use claim/free to match the terms spapr already uses.

OK.

Thanks,

C.

> 
> 
>>
>>
>>>
>>>> +    }
>>>> +
>>>> +    return xive;
>>>> +}
>>>> +
>>>> +static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
>>>> +                                int nr_servers, Error **errp)
>>>> +{
>>>> +    MachineState *machine = MACHINE(spapr);
>>>> +    Error *local_err = NULL;
>>>> +
>>>> +    /* KVM XIVE support */
>>>> +    if (kvm_enabled()) {
>>>> +        if (machine_kernel_irqchip_required(machine)) {
>>>> +            error_setg(errp, "kernel_irqchip requested. no XIVE support");
>>>> +            return;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    /* QEMU XIVE support */
>>>> +    spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE, nr_irqs, nr_servers,
>>>> +                                    &local_err);
>>>> +    if (local_err) {
>>>> +        error_propagate(errp, local_err);
>>>> +        return;
>>>> +    }
>>>> +}
>>>> +
>>>> +static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
>>>> +                                Error **errp)
>>>> +{
>>>> +    if (!spapr_xive_irq_enable(spapr->xive, irq, lsi)) {
>>>> +        error_setg(errp, "IRQ %d is invalid", irq);
>>>> +        return -1;
>>>> +    }
>>>> +    return 0;
>>>> +}
>>>> +
>>>> +static void spapr_irq_free_xive(sPAPRMachineState *spapr, int irq, int num)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    for (i = irq; i < irq + num; ++i) {
>>>> +        spapr_xive_irq_disable(spapr->xive, i);
>>>> +    }
>>>> +}
>>>> +
>>>> +static qemu_irq spapr_qirq_xive(sPAPRMachineState *spapr, int irq)
>>>> +{
>>>> +    return spapr_xive_qirq(spapr->xive, irq);
>>>> +}
>>>> +
>>>> +static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
>>>> +                                      Monitor *mon)
>>>> +{
>>>> +    CPUState *cs;
>>>> +
>>>> +    CPU_FOREACH(cs) {
>>>> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
>>>> +
>>>> +        xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
>>>> +    }
>>>> +
>>>> +    spapr_xive_pic_print_info(spapr->xive, mon);
>>>
>>> Any reason the info dumping routines are split into two?
>>
>> Not the same objects. Are you suggesting that we could print all the info 
>> from the sPAPR XIVE model ? including the XiveTCTX. I thought of doing 
>> that also. Fine for me if it's ok for you.
> 
> Ah.. I think I got xive_pic_print_info() and
> xive_tctx_pic_print_info() mixed up.  Never mind.
> 
>>
>> Thanks,
>>
>> C.
>>
>>>
>>>> +}
>>>> +
>>>> +/*
>>>> + * XIVE uses the full IRQ number space. Set it to 8K to be compatible
>>>> + * with XICS.
>>>> + */
>>>> +
>>>> +#define SPAPR_IRQ_XIVE_NR_IRQS     0x2000
>>>> +#define SPAPR_IRQ_XIVE_NR_MSIS     (SPAPR_IRQ_XIVE_NR_IRQS - SPAPR_IRQ_MSI)
>>>> +
>>>> +sPAPRIrq spapr_irq_xive = {
>>>> +    .nr_irqs     = SPAPR_IRQ_XIVE_NR_IRQS,
>>>> +    .nr_msis     = SPAPR_IRQ_XIVE_NR_MSIS,
>>>> +
>>>> +    .init        = spapr_irq_init_xive,
>>>> +    .claim       = spapr_irq_claim_xive,
>>>> +    .free        = spapr_irq_free_xive,
>>>> +    .qirq        = spapr_qirq_xive,
>>>> +    .print_info  = spapr_irq_print_info_xive,
>>>> +};
>>>> +
>>>>  /*
>>>>   * sPAPR IRQ frontend routines for devices
>>>>   */
>>>> -void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
>>>> +void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp)
>>>>  {
>>>>      sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>>>>  
>>>> @@ -216,7 +329,7 @@ void spapr_irq_init(sPAPRMachineState *spapr, Error **errp)
>>>>          spapr_irq_msi_init(spapr, smc->irq->nr_msis);
>>>>      }
>>>>  
>>>> -    smc->irq->init(spapr, smc->irq->nr_irqs, errp);
>>>> +    smc->irq->init(spapr, smc->irq->nr_irqs, nr_servers, errp);
>>>>  }
>>>>  
>>>>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp)
>>>
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-11-29  1:23       ` David Gibson
@ 2018-11-29 16:04         ` Cédric Le Goater
  2018-11-30  1:23           ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 16:04 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 2:23 AM, David Gibson wrote:
> On Wed, Nov 28, 2018 at 11:21:37PM +0100, Cédric Le Goater wrote:
>> On 11/28/18 5:25 AM, David Gibson wrote:
>>> On Fri, Nov 16, 2018 at 11:57:09AM +0100, Cédric Le Goater wrote:
>>>> The different XIVE virtualization structures (sources and event queues)
>>>> are configured with a set of Hypervisor calls :
>>>>
>>>>  - H_INT_GET_SOURCE_INFO
>>>>
>>>>    used to obtain the address of the MMIO page of the Event State
>>>>    Buffer (ESB) entry associated with the source.
>>>>
>>>>  - H_INT_SET_SOURCE_CONFIG
>>>>
>>>>    assigns a source to a "target".
>>>>
>>>>  - H_INT_GET_SOURCE_CONFIG
>>>>
>>>>    determines which "target" and "priority" is assigned to a source
>>>>
>>>>  - H_INT_GET_QUEUE_INFO
>>>>
>>>>    returns the address of the notification management page associated
>>>>    with the specified "target" and "priority".
>>>>
>>>>  - H_INT_SET_QUEUE_CONFIG
>>>>
>>>>    sets or resets the event queue for a given "target" and "priority".
>>>>    It is also used to set the notification configuration associated
>>>>    with the queue, only unconditional notification is supported for
>>>>    the moment. Reset is performed with a queue size of 0 and queueing
>>>>    is disabled in that case.
>>>>
>>>>  - H_INT_GET_QUEUE_CONFIG
>>>>
>>>>    returns the queue settings for a given "target" and "priority".
>>>>
>>>>  - H_INT_RESET
>>>>
>>>>    resets all of the guest's internal interrupt structures to their
>>>>    initial state, losing all configuration set via the hcalls
>>>>    H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
>>>>
>>>>  - H_INT_SYNC
>>>>
>>>>    issue a synchronisation on a source to make sure all notifications
>>>>    have reached their queue.
>>>>
>>>> Calls that still need to be addressed :
>>>>
>>>>    H_INT_SET_OS_REPORTING_LINE
>>>>    H_INT_GET_OS_REPORTING_LINE
>>>>
>>>> See the code for more documentation on each hcall.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  include/hw/ppc/spapr.h      |  15 +-
>>>>  include/hw/ppc/spapr_xive.h |   6 +
>>>>  hw/intc/spapr_xive_hcall.c  | 892 ++++++++++++++++++++++++++++++++++++
>>>>  hw/ppc/spapr_irq.c          |   2 +
>>>>  hw/intc/Makefile.objs       |   2 +-
>>>>  5 files changed, 915 insertions(+), 2 deletions(-)
>>>>  create mode 100644 hw/intc/spapr_xive_hcall.c
>>>>
>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>> index 1fbc2663e06c..8415faea7b82 100644
>>>> --- a/include/hw/ppc/spapr.h
>>>> +++ b/include/hw/ppc/spapr.h
>>>> @@ -452,7 +452,20 @@ struct sPAPRMachineState {
>>>>  #define H_INVALIDATE_PID        0x378
>>>>  #define H_REGISTER_PROC_TBL     0x37C
>>>>  #define H_SIGNAL_SYS_RESET      0x380
>>>> -#define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
>>>> +
>>>> +#define H_INT_GET_SOURCE_INFO   0x3A8
>>>> +#define H_INT_SET_SOURCE_CONFIG 0x3AC
>>>> +#define H_INT_GET_SOURCE_CONFIG 0x3B0
>>>> +#define H_INT_GET_QUEUE_INFO    0x3B4
>>>> +#define H_INT_SET_QUEUE_CONFIG  0x3B8
>>>> +#define H_INT_GET_QUEUE_CONFIG  0x3BC
>>>> +#define H_INT_SET_OS_REPORTING_LINE 0x3C0
>>>> +#define H_INT_GET_OS_REPORTING_LINE 0x3C4
>>>> +#define H_INT_ESB               0x3C8
>>>> +#define H_INT_SYNC              0x3CC
>>>> +#define H_INT_RESET             0x3D0
>>>> +
>>>> +#define MAX_HCALL_OPCODE        H_INT_RESET
>>>>  
>>>>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>>>>   * as well.
>>>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>>>> index 3f65b8f485fd..418511f3dc10 100644
>>>> --- a/include/hw/ppc/spapr_xive.h
>>>> +++ b/include/hw/ppc/spapr_xive.h
>>>> @@ -60,4 +60,10 @@ int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
>>>>  int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
>>>>                            uint8_t *out_end_blk, uint32_t *out_end_idx);
>>>>  
>>>> +bool spapr_xive_priority_is_valid(uint8_t priority);
>>>
>>> AFAICT this could be a local function.
>>
>> the KVM model uses it also, when collecting state from the KVM device 
>> to build the QEMU ENDT.
>>
>>>> +
>>>> +typedef struct sPAPRMachineState sPAPRMachineState;
>>>> +
>>>> +void spapr_xive_hcall_init(sPAPRMachineState *spapr);
>>>> +
>>>>  #endif /* PPC_SPAPR_XIVE_H */
>>>> diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
>>>> new file mode 100644
>>>> index 000000000000..52e4e23995f5
>>>> --- /dev/null
>>>> +++ b/hw/intc/spapr_xive_hcall.c
>>>> @@ -0,0 +1,892 @@
>>>> +/*
>>>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>>>> + *
>>>> + * Copyright (c) 2017-2018, IBM Corporation.
>>>> + *
>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>> + * COPYING file in the top-level directory.
>>>> + */
>>>> +
>>>> +#include "qemu/osdep.h"
>>>> +#include "qemu/log.h"
>>>> +#include "qapi/error.h"
>>>> +#include "cpu.h"
>>>> +#include "hw/ppc/fdt.h"
>>>> +#include "hw/ppc/spapr.h"
>>>> +#include "hw/ppc/spapr_xive.h"
>>>> +#include "hw/ppc/xive_regs.h"
>>>> +#include "monitor/monitor.h"
>>>
>>> Fwiw, I don't think it's particularly necessary to split the hcall
>>> handling out into a separate .c file.
>>
>> ok. let's move it to spapr_xive then ? It might help in reducing the 
>> exported funtions. 
> 
> Yes, I think so.
> 
>>>> +/*
>>>> + * OPAL uses the priority 7 EQ to automatically escalate interrupts
>>>> + * for all other queues (DD2.X POWER9). So only priorities [0..6] are
>>>> + * available for the guest.
>>>
>>> Referencing OPAL behaviour doesn't really make sense in the context of
>>> PAPR.  
>>
>> It's an OPAL constraint which pHyp doesn't have. So its a QEMU/KVM 
>> constraint also.
> 
> Right, I realized that a few patches on.  Maybe rephrase this to
> 
>    Linux hosts under OPAL reserve priority 7 for their own escalation
>    interrupts.  So we only allow the guest to use priorities [0..6].

OK.

> The point here is that we're emphasizing that this is a design
> decision to make the host implementation easier, rather than a
> fundamental constraint.
> 
>>> What I think you're getting at is that the PAPR spec only
>>> allows a PAPR guest to use priorities 0..6 (or at least it will if the
>>> XIVE updated spec ever gets published).  
>>
>> It's not in the spec. the XIVE sPAPR spec should be frozen soon btw. 
>>  
>>> The fact that this allows the
>>> host use 7 for escalations is a design rationale 
>>> but not really relevant to the guest device itself. 
>>
>> The guest should be aware of which priorities are reserved for
>> the hypervisor though.
>>
>>>> + */
>>>> +bool spapr_xive_priority_is_valid(uint8_t priority)
>>>> +{
>>>> +    switch (priority) {
>>>> +    case 0 ... 6:
>>>> +        return true;
>>>> +    case 7: /* OPAL escalation queue */
>>>> +    default:
>>>> +        return false;
>>>> +    }
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical
>>>> + * real address of the MMIO page through which the Event State Buffer
>>>> + * entry associated with the value of the "lisn" parameter is managed.
>>>> + *
>>>> + * Parameters:
>>>> + * Input
>>>> + * - "flags"
>>>> + *       Bits 0-63 reserved
>>>> + * - "lisn" is per "interrupts", "interrupt-map", or
>>>> + *       "ibm,xive-lisn-ranges" properties, or as returned by the
>>>> + *       ibm,query-interrupt-source-number RTAS call, or as returned
>>>> + *       by the H_ALLOCATE_VAS_WINDOW hcall
>>>
>>> I've not heard of H_ALLOCATE_VAS_WINDOW.  Is that something we intend
>>> to implement in kvm/qemu, or is it only of interest for PowerVM?
>>
>> The hcall is part of the PAPR NX Interfaces and it returns interrupt
>> numbers. I don't know if any work has been done on the topic.  
> 
> What's a "PAPR NX"?

A way for the PAPR guests to access the POWER coprocessors doing 
compression and encryption. I really don't know much about this.

>>> Also, putting the register numbers on the inputs as well as the
>>> outputs would be helpful.
>>
>> yes. I will add them.
>>
>>>> + *
>>>> + * Output
>>>> + * - R4: "flags"
>>>> + *       Bits 0-59: Reserved
>>>> + *       Bit 60: H_INT_ESB must be used for Event State Buffer
>>>> + *               management
>>>> + *       Bit 61: 1 == LSI  0 == MSI
>>>> + *       Bit 62: the full function page supports trigger
>>>> + *       Bit 63: Store EOI Supported
>>>> + * - R5: Logical Real address of full function Event State Buffer
>>>> + *       management page, -1 if ESB hcall flag is set to 1.
>>>
>>> You've defined what H_INT_ESB means above, so it will be clearer if
>>> you reference that by name here.
>>
>> yes. 
>>
>>>> + * - R6: Logical Real Address of trigger only Event State Buffer
>>>> + *       management page or -1.
>>>> + * - R7: Power of 2 page size for the ESB management pages returned in
>>>> + *       R5 and R6.
>>>> + */
>>>> +
>>>> +#define SPAPR_XIVE_SRC_H_INT_ESB     PPC_BIT(60) /* ESB manage with H_INT_ESB */
>>>> +#define SPAPR_XIVE_SRC_LSI           PPC_BIT(61) /* Virtual LSI type */
>>>> +#define SPAPR_XIVE_SRC_TRIGGER       PPC_BIT(62) /* Trigger and management
>>>> +                                                    on same page */
>>>> +#define SPAPR_XIVE_SRC_STORE_EOI     PPC_BIT(63) /* Store EOI support */
>>>
>>> Probably makes sense to put these #defines in spapr.h since they form
>>> part of the PAPR interface definition.
>>
>> ok.
>>
>>>
>>>> +static target_ulong h_int_get_source_info(PowerPCCPU *cpu,
>>>> +                                          sPAPRMachineState *spapr,
>>>> +                                          target_ulong opcode,
>>>> +                                          target_ulong *args)
>>>> +{
>>>> +    sPAPRXive *xive = spapr->xive;
>>>> +    XiveSource *xsrc = &xive->source;
>>>> +    XiveEAS eas;
>>>> +    target_ulong flags  = args[0];
>>>> +    target_ulong lisn   = args[1];
>>>> +
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    if (flags) {
>>>> +        return H_PARAMETER;
>>>> +    }
>>>> +
>>>> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    if (!(eas.w & EAS_VALID)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    /* All sources are emulated under the main XIVE object and share
>>>> +     * the same characteristics.
>>>> +     */
>>>> +    args[0] = 0;
>>>> +    if (!xive_source_esb_has_2page(xsrc)) {
>>>> +        args[0] |= SPAPR_XIVE_SRC_TRIGGER;
>>>> +    }
>>>> +    if (xsrc->esb_flags & XIVE_SRC_STORE_EOI) {
>>>> +        args[0] |= SPAPR_XIVE_SRC_STORE_EOI;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * Force the use of the H_INT_ESB hcall in case of an LSI
>>>> +     * interrupt. This is necessary under KVM to re-trigger the
>>>> +     * interrupt if the level is still asserted
>>>> +     */
>>>> +    if (xive_source_irq_is_lsi(xsrc, lisn)) {
>>>> +        args[0] |= SPAPR_XIVE_SRC_H_INT_ESB | SPAPR_XIVE_SRC_LSI;
>>>> +    }
>>>> +
>>>> +    if (!(args[0] & SPAPR_XIVE_SRC_H_INT_ESB)) {
>>>> +        args[1] = xive->vc_base + xive_source_esb_mgmt(xsrc, lisn);
>>>> +    } else {
>>>> +        args[1] = -1;
>>>> +    }
>>>> +
>>>> +    if (xive_source_esb_has_2page(xsrc)) {
>>>> +        args[2] = xive->vc_base + xive_source_esb_page(xsrc, lisn);
>>>> +    } else {
>>>> +        args[2] = -1;
>>>> +    }
>>>
>>> Do we also need to keep this address clear in the H_INT_ESB case?
>>
>> I think not, but the specs are not very clear on that topic. I will
>> ask for clarification and use a -1 for now. We can not do loads on
>> the trigger page so it can not be used by the H_INT_ESB hcall.
>>
>>>
>>>> +    args[3] = TARGET_PAGE_SIZE;
>>>
>>> That seems wrong.  
>>
>> This is utterly wrong. it should be a power of 2 number ... I got
>> it right under KVM though. I guess that ioremap() under Linux rounds 
>> up the size to the page size in use, so, that's why it didn't blow
>> up under TCG.
>>
>>> TARGET_PAGE_SIZE is generally 4kiB, but won't these usually
>>> actually be 64kiB?
>>
>> yes. So what should I use to get a PAGE_SHIFT instead ? 
> 
> Erm, that gets a bit tricky, since qemu in a sense doesn't know the
> guest's page size.
> 
> But.. don't you actually want the esb_shift here, not PAGE_SHIFT - it
> could matter for the 2 page * 64kiB variant, yes?

Yes. we just want the page_shift of the ESB page, whether it's one or
two pages. The other registers inform the guest if there are one or 
two ESB page in use. 


>>>> +
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_SET_SOURCE_CONFIG hcall() is used to assign a Logical
>>>> + * Interrupt Source to a target. The Logical Interrupt Source is
>>>> + * designated with the "lisn" parameter and the target is designated
>>>> + * with the "target" and "priority" parameters.  Upon return from the
>>>> + * hcall(), no additional interrupts will be directed to the old EQ.
>>>> + *
>>>> + * TODO: The old EQ should be investigated for interrupts that
>>>> + * occurred prior to or during the hcall().
>>>
>>> Isn't that the responsibility of the guest?
>>
>> It should yes.
> 
> Right, so not a TODO for the qemu code.

yes

> 
>>
>>>
>>>> + *
>>>> + * Parameters:
>>>> + * Input:
>>>> + * - "flags"
>>>> + *      Bits 0-61: Reserved
>>>> + *      Bit 62: set the "eisn" in the EA
>>>
>>> What's the "EA"?  Do you mean the EAS?
>>
>> Another XIVE acronym, EA for Event Assignment. I think we can forget
>> this one and just use EAS.
>>  
>>>
>>>> + *      Bit 63: masks the interrupt source in the hardware interrupt
>>>> + *      control structure. An interrupt masked by this mechanism will
>>>> + *      be dropped, but it's source state bits will still be
>>>> + *      set. There is no race-free way of unmasking and restoring the
>>>> + *      source. Thus this should only be used in interrupts that are
>>>> + *      also masked at the source, and only in cases where the
>>>> + *      interrupt is not meant to be used for a large amount of time
>>>> + *      because no valid target exists for it for example
>>>> + * - "lisn" is per "interrupts", "interrupt-map", or
>>>> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
>>>> + *      ibm,query-interrupt-source-number RTAS call, or as returned by
>>>> + *      the H_ALLOCATE_VAS_WINDOW hcall
>>>> + * - "target" is per "ibm,ppc-interrupt-server#s" or
>>>> + *      "ibm,ppc-interrupt-gserver#s"
>>>> + * - "priority" is a valid priority not in
>>>> + *      "ibm,plat-res-int-priorities"
>>>> + * - "eisn" is the guest EISN associated with the "lisn"
>>>
>>> I don't think the EISN term has been used before in the series.  
>>
>> Effective Interrupt Source Number, which is the event data enqueued
>> in the OS EQ.
>>
>> I'm planning on adding some more acronyms used by the sPAPR hcalls
>> in this file. There are only a couple.
> 
> That would be helpful.
> 
>>> I'm guessing this is the guest-assigned global interrupt number?
>>
>> yes 
>>
>>>> + *
>>>> + * Output:
>>>> + * - None
>>>> + */
>>>> +
>>>> +#define SPAPR_XIVE_SRC_SET_EISN PPC_BIT(62)
>>>> +#define SPAPR_XIVE_SRC_MASK     PPC_BIT(63)
>>>> +
>>>> +static target_ulong h_int_set_source_config(PowerPCCPU *cpu,
>>>> +                                            sPAPRMachineState *spapr,
>>>> +                                            target_ulong opcode,
>>>> +                                            target_ulong *args)
>>>> +{
>>>> +    sPAPRXive *xive = spapr->xive;
>>>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>>>> +    XiveEAS eas, new_eas;
>>>> +    target_ulong flags    = args[0];
>>>> +    target_ulong lisn     = args[1];
>>>> +    target_ulong target   = args[2];
>>>> +    target_ulong priority = args[3];
>>>> +    target_ulong eisn     = args[4];
>>>> +    uint8_t end_blk;
>>>> +    uint32_t end_idx;
>>>> +
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    if (flags & ~(SPAPR_XIVE_SRC_SET_EISN | SPAPR_XIVE_SRC_MASK)) {
>>>> +        return H_PARAMETER;
>>>> +    }
>>>> +
>>>> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    if (!(eas.w & EAS_VALID)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    /* priority 0xff is used to reset the EAS */
>>>> +    if (priority == 0xff) {
>>>> +        new_eas.w = EAS_VALID | EAS_MASKED;
>>>> +        goto out;
>>>> +    }
>>>> +
>>>> +    if (flags & SPAPR_XIVE_SRC_MASK) {
>>>> +        new_eas.w = eas.w | EAS_MASKED;
>>>> +    } else {
>>>> +        new_eas.w = eas.w & ~EAS_MASKED;
>>>> +    }
>>>> +
>>>> +    if (!spapr_xive_priority_is_valid(priority)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
>>>> +                      priority);
>>>> +        return H_P4;
>>>> +    }
>>>> +
>>>> +    /* Validate that "target" is part of the list of threads allocated
>>>> +     * to the partition. For that, find the END corresponding to the
>>>> +     * target.
>>>> +     */
>>>> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
>>>> +        return H_P3;
>>>> +    }
>>>> +
>>>> +    new_eas.w = SETFIELD(EAS_END_BLOCK, new_eas.w, end_blk);
>>>> +    new_eas.w = SETFIELD(EAS_END_INDEX, new_eas.w, end_idx);
>>>> +
>>>> +    if (flags & SPAPR_XIVE_SRC_SET_EISN) {
>>>> +        new_eas.w = SETFIELD(EAS_END_DATA, new_eas.w, eisn);
>>>> +    }
>>>> +
>>>> +out:
>>>> +    if (xive_router_set_eas(xrtr, lisn, &new_eas)) {
>>>> +        return H_HARDWARE;
>>>> +    }
>>>
>>> As noted earlier in the series, the spapr specific code owns the
>>> memory backing the EAT, so you can just access it directly rather than
>>> using a method here.
>>
>> Yes. I will give a try. I wonder if I need accessors for the tables
>> ?
> 
> You'll still need the read accessor since the routing core uses that.
> I don't think you need a write accessor though.
> 
>>
>>>
>>>> +
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_GET_SOURCE_CONFIG hcall() is used to determine to which
>>>> + * target/priority pair is assigned to the specified Logical Interrupt
>>>> + * Source.
>>>> + *
>>>> + * Parameters:
>>>> + * Input:
>>>> + * - "flags"
>>>> + *      Bits 0-63 Reserved
>>>> + * - "lisn" is per "interrupts", "interrupt-map", or
>>>> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
>>>> + *      ibm,query-interrupt-source-number RTAS call, or as
>>>> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
>>>> + *
>>>> + * Output:
>>>> + * - R4: Target to which the specified Logical Interrupt Source is
>>>> + *       assigned
>>>> + * - R5: Priority to which the specified Logical Interrupt Source is
>>>> + *       assigned
>>>> + * - R6: EISN for the specified Logical Interrupt Source (this will be
>>>> + *       equivalent to the LISN if not changed by H_INT_SET_SOURCE_CONFIG)
>>>> + */
>>>> +static target_ulong h_int_get_source_config(PowerPCCPU *cpu,
>>>> +                                            sPAPRMachineState *spapr,
>>>> +                                            target_ulong opcode,
>>>> +                                            target_ulong *args)
>>>> +{
>>>> +    sPAPRXive *xive = spapr->xive;
>>>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>>>> +    target_ulong flags = args[0];
>>>> +    target_ulong lisn = args[1];
>>>> +    XiveEAS eas;
>>>> +    XiveEND end;
>>>> +    uint8_t end_blk, nvt_blk;
>>>> +    uint32_t end_idx, nvt_idx;
>>>> +
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    if (flags) {
>>>> +        return H_PARAMETER;
>>>> +    }
>>>> +
>>>> +    if (xive_router_get_eas(xrtr, lisn, &eas)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    if (!(eas.w & EAS_VALID)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    end_blk = GETFIELD(EAS_END_BLOCK, eas.w);
>>>> +    end_idx = GETFIELD(EAS_END_INDEX, eas.w);
>>>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
>>>> +        /* Not sure what to return here */
>>>> +        return H_HARDWARE;
>>>
>>> IIUC this indicates a bug in the PAPR specific code, not the guest, so
>>> an assert() is probably the right answer.
>>
>> ok
>>
>>>> +    }
>>>> +
>>>> +    nvt_blk = GETFIELD(END_W6_NVT_BLOCK, end.w6);
>>>> +    nvt_idx = GETFIELD(END_W6_NVT_INDEX, end.w6);
>>>> +    args[0] = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
>>>
>>> AIUI there's a specific END for each target & priority, so you could
>>> avoid this second level lookup, 
>>
>> yes 
>>
>>> although I guess this might be
>>> valuable if we do more complicated internal routing in the future.
>>
>> I am not sure of that but I'd rather keep these converting helpers
>> for the moment.
> 
> Ok.
> 
>>>> +    if (eas.w & EAS_MASKED) {
>>>> +        args[1] = 0xff;
>>>> +    } else {
>>>> +        args[1] = GETFIELD(END_W7_F0_PRIORITY, end.w7);
>>>> +    }
>>>> +
>>>> +    args[2] = GETFIELD(EAS_END_DATA, eas.w);
>>>> +
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_GET_QUEUE_INFO hcall() is used to get the logical real
>>>> + * address of the notification management page associated with the
>>>> + * specified target and priority.
>>>> + *
>>>> + * Parameters:
>>>> + * Input:
>>>> + * - "flags"
>>>> + *       Bits 0-63 Reserved
>>>> + * - "target" is per "ibm,ppc-interrupt-server#s" or
>>>> + *       "ibm,ppc-interrupt-gserver#s"
>>>> + * - "priority" is a valid priority not in
>>>> + *       "ibm,plat-res-int-priorities"
>>>> + *
>>>> + * Output:
>>>> + * - R4: Logical real address of notification page
>>>> + * - R5: Power of 2 page size of the notification page
>>>> + */
>>>> +static target_ulong h_int_get_queue_info(PowerPCCPU *cpu,
>>>> +                                         sPAPRMachineState *spapr,
>>>> +                                         target_ulong opcode,
>>>> +                                         target_ulong *args)
>>>> +{
>>>> +    sPAPRXive *xive = spapr->xive;
>>>> +    XiveENDSource *end_xsrc = &xive->end_source;
>>>> +    target_ulong flags = args[0];
>>>> +    target_ulong target = args[1];
>>>> +    target_ulong priority = args[2];
>>>> +    XiveEND end;
>>>> +    uint8_t end_blk;
>>>> +    uint32_t end_idx;
>>>> +
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    if (flags) {
>>>> +        return H_PARAMETER;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>>>> +     * This is not needed when running the emulation under QEMU
>>>> +     */
>>>> +
>>>> +    if (!spapr_xive_priority_is_valid(priority)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
>>>> +                      priority);
>>>> +        return H_P3;
>>>> +    }
>>>> +
>>>> +    /* Validate that "target" is part of the list of threads allocated
>>>> +     * to the partition. For that, find the END corresponding to the
>>>> +     * target.
>>>> +     */
>>>> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    if (xive_router_get_end(XIVE_ROUTER(xive), end_blk, end_idx, &end)) {
>>>> +        return H_HARDWARE;
>>>> +    }
>>>> +
>>>> +    args[0] = xive->end_base + (1ull << (end_xsrc->esb_shift + 1)) * end_idx;
>>>> +    if (end.w0 & END_W0_ENQUEUE) {
>>>> +        args[1] = GETFIELD(END_W0_QSIZE, end.w0) + 12;
>>>> +    } else {
>>>> +        args[1] = 0;
>>>> +    }
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_SET_QUEUE_CONFIG hcall() is used to set or reset a EQ for
>>>> + * a given "target" and "priority".  It is also used to set the
>>>> + * notification config associated with the EQ.  An EQ size of 0 is
>>>> + * used to reset the EQ config for a given target and priority. If
>>>> + * resetting the EQ config, the END associated with the given "target"
>>>> + * and "priority" will be changed to disable queueing.
>>>> + *
>>>> + * Upon return from the hcall(), no additional interrupts will be
>>>> + * directed to the old EQ (if one was set). The old EQ (if one was
>>>> + * set) should be investigated for interrupts that occurred prior to
>>>> + * or during the hcall().
>>>> + *
>>>> + * Parameters:
>>>> + * Input:
>>>> + * - "flags"
>>>> + *      Bits 0-62: Reserved
>>>> + *      Bit 63: Unconditional Notify (n) per the XIVE spec
>>>> + * - "target" is per "ibm,ppc-interrupt-server#s" or
>>>> + *       "ibm,ppc-interrupt-gserver#s"
>>>> + * - "priority" is a valid priority not in
>>>> + *       "ibm,plat-res-int-priorities"
>>>> + * - "eventQueue": The logical real address of the start of the EQ
>>>> + * - "eventQueueSize": The power of 2 EQ size per "ibm,xive-eq-sizes"
>>>> + *
>>>> + * Output:
>>>> + * - None
>>>> + */
>>>> +
>>>> +#define SPAPR_XIVE_END_ALWAYS_NOTIFY PPC_BIT(63)
>>>> +
>>>> +static target_ulong h_int_set_queue_config(PowerPCCPU *cpu,
>>>> +                                           sPAPRMachineState *spapr,
>>>> +                                           target_ulong opcode,
>>>> +                                           target_ulong *args)
>>>> +{
>>>> +    sPAPRXive *xive = spapr->xive;
>>>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>>>> +    target_ulong flags = args[0];
>>>> +    target_ulong target = args[1];
>>>> +    target_ulong priority = args[2];
>>>> +    target_ulong qpage = args[3];
>>>> +    target_ulong qsize = args[4];
>>>> +    XiveEND end;
>>>> +    uint8_t end_blk, nvt_blk;
>>>> +    uint32_t end_idx, nvt_idx;
>>>> +    uint32_t qdata;
>>>> +
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    if (flags & ~SPAPR_XIVE_END_ALWAYS_NOTIFY) {
>>>> +        return H_PARAMETER;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>>>> +     * This is not needed when running the emulation under QEMU
>>>> +     */
>>>> +
>>>> +    if (!spapr_xive_priority_is_valid(priority)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
>>>> +                      priority);
>>>> +        return H_P3;
>>>> +    }
>>>> +
>>>> +    /* Validate that "target" is part of the list of threads allocated
>>>> +     * to the partition. For that, find the END corresponding to the
>>>> +     * target.
>>>> +     */
>>>> +
>>>> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
>>>> +        return H_HARDWARE;
>>>
>>> Again, I think this indicates a qemu (spapr) code bug, so could be an assert().
>>
>> ok
>>
>>>
>>>> +    }
>>>> +
>>>> +    switch (qsize) {
>>>> +    case 12:
>>>> +    case 16:
>>>> +    case 21:
>>>> +    case 24:
>>>> +        end.w3 = ((uint64_t)qpage) & 0xffffffff;
>>>
>>> It just occurred to me that I haven't been looking for this across any
>>> of these reviews.  Don't you need byteswaps when accessing these
>>> in-memory structures?
>>
>> yes this is done when some event data is enqueued in the EQ.
> 
> I'm not talking about the data in the EQ itself, but the fields in the
> END (and the NVT).

XIVE is all BE.

> 
>>
>>>
>>>> +        end.w2 = (((uint64_t)qpage)) >> 32 & 0x0fffffff;
>>>> +        end.w0 |= END_W0_ENQUEUE;
>>>> +        end.w0 = SETFIELD(END_W0_QSIZE, end.w0, qsize - 12);
>>>> +        break;
>>>> +    case 0:
>>>> +        /* reset queue and disable queueing */
>>>> +        xive_end_reset(&end);
>>>> +        goto out;
>>>> +
>>>> +    default:
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid EQ size %"PRIx64"\n",
>>>> +                      qsize);
>>>> +        return H_P5;
>>>> +    }
>>>> +
>>>> +    if (qsize) {
>>>> +        /*
>>>> +         * Let's validate the EQ address with a read of the first EQ
>>>> +         * entry. We could also check that the full queue has been
>>>> +         * zeroed by the OS.
>>>> +         */
>>>> +        if (address_space_read(&address_space_memory, qpage,
>>>> +                               MEMTXATTRS_UNSPECIFIED,
>>>> +                               (uint8_t *) &qdata, sizeof(qdata))) {
>>>> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to read EQ data @0x%"
>>>> +                          HWADDR_PRIx "\n", qpage);
>>>> +            return H_P4;
>>>
>>> Just checking the first entry doesn't seem entirely safe.  Using
>>> address_space_map() and making sure the returned plen doesn't get
>>> reduced below the queue size might be a better option.
>>
>> ok. That was on my todo list.
>>
>>>
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (spapr_xive_target_to_nvt(xive, target, &nvt_blk, &nvt_idx)) {
>>>> +        return H_HARDWARE;
>>>
>>> That could be caused by a bogus 'target' value, couldn't it?  
>>
>> yes. It should have returned H_P2 above when spapr_xive_target_to_end() 
>> is called.
>>
>>> In which
>>> case it a) should probably be checked earlier and b) should be
>>> H_PARAMETER or similar, not H_HARDWARE, yes?
>>
>> H_P2 may be again. It should be checked earlier
>>
>>>
>>>> +    }
>>>> +
>>>> +    /* Ensure the priority and target are correctly set (they will not
>>>> +     * be right after allocation)
>>>
>>> AIUI there's a static association from END to target in the PAPR
>>> model. 
>>
>> yes. 8 priorities per cpu.
>>
>>> So it seems to make more sense to get that set up right at
>>> initialization / reset, rather than doing it lazily when the 
>>> queue is configured.
>>
>> Ah. You would preconfigure the word6 and word7 then. Yes, it would
>> save us some of the conversion fuss. I will look at it.
>>
>>>> +     */
>>>> +    end.w6 = SETFIELD(END_W6_NVT_BLOCK, 0ul, nvt_blk) |
>>>> +        SETFIELD(END_W6_NVT_INDEX, 0ul, nvt_idx);
>>>> +    end.w7 = SETFIELD(END_W7_F0_PRIORITY, 0ul, priority);
>>>> +
>>>> +    if (flags & SPAPR_XIVE_END_ALWAYS_NOTIFY) {
>>>> +        end.w0 |= END_W0_UCOND_NOTIFY;
>>>> +    } else {
>>>> +        end.w0 &= ~END_W0_UCOND_NOTIFY;
>>>> +    }
>>>> +
>>>> +    /* The generation bit for the END starts at 1 and The END page
>>>> +     * offset counter starts at 0.
>>>> +     */
>>>> +    end.w1 = END_W1_GENERATION | SETFIELD(END_W1_PAGE_OFF, 0ul, 0ul);
>>>> +    end.w0 |= END_W0_VALID;
>>>> +
>>>> +    /* TODO: issue syncs required to ensure all in-flight interrupts
>>>> +     * are complete on the old END */
>>>> +out:
>>>> +    /* Update END */
>>>> +    if (xive_router_set_end(xrtr, end_blk, end_idx, &end)) {
>>>> +        return H_HARDWARE;
>>>> +    }
>>>
>>> Again the PAPR code owns the ENDs, so it can update them directly
>>> rather than going through an abstraction.
>>
>> ok.
>>
>>>
>>>> +
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_GET_QUEUE_CONFIG hcall() is used to get a EQ for a given
>>>> + * target and priority.
>>>> + *
>>>> + * Parameters:
>>>> + * Input:
>>>> + * - "flags"
>>>> + *      Bits 0-62: Reserved
>>>> + *      Bit 63: Debug: Return debug data
>>>> + * - "target" is per "ibm,ppc-interrupt-server#s" or
>>>> + *       "ibm,ppc-interrupt-gserver#s"
>>>> + * - "priority" is a valid priority not in
>>>> + *       "ibm,plat-res-int-priorities"
>>>> + *
>>>> + * Output:
>>>> + * - R4: "flags":
>>>> + *       Bits 0-61: Reserved
>>>> + *       Bit 62: The value of Event Queue Generation Number (g) per
>>>> + *              the XIVE spec if "Debug" = 1
>>>> + *       Bit 63: The value of Unconditional Notify (n) per the XIVE spec
>>>> + * - R5: The logical real address of the start of the EQ
>>>> + * - R6: The power of 2 EQ size per "ibm,xive-eq-sizes"
>>>> + * - R7: The value of Event Queue Offset Counter per XIVE spec
>>>> + *       if "Debug" = 1, else 0
>>>> + *
>>>> + */
>>>> +
>>>> +#define SPAPR_XIVE_END_DEBUG     PPC_BIT(63)
>>>> +
>>>> +static target_ulong h_int_get_queue_config(PowerPCCPU *cpu,
>>>> +                                           sPAPRMachineState *spapr,
>>>> +                                           target_ulong opcode,
>>>> +                                           target_ulong *args)
>>>> +{
>>>> +    sPAPRXive *xive = spapr->xive;
>>>> +    target_ulong flags = args[0];
>>>> +    target_ulong target = args[1];
>>>> +    target_ulong priority = args[2];
>>>> +    XiveEND end;
>>>> +    uint8_t end_blk;
>>>> +    uint32_t end_idx;
>>>> +
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    if (flags & ~SPAPR_XIVE_END_DEBUG) {
>>>> +        return H_PARAMETER;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>>>> +     * This is not needed when running the emulation under QEMU
>>>> +     */
>>>> +
>>>> +    if (!spapr_xive_priority_is_valid(priority)) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid priority %ld requested\n",
>>>> +                      priority);
>>>> +        return H_P3;
>>>> +    }
>>>> +
>>>> +    /* Validate that "target" is part of the list of threads allocated
>>>> +     * to the partition. For that, find the END corresponding to the
>>>> +     * target.
>>>> +     */
>>>> +    if (spapr_xive_target_to_end(xive, target, priority, &end_blk, &end_idx)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    if (xive_router_get_end(XIVE_ROUTER(xive), end_blk, end_idx, &end)) {
>>>> +        return H_HARDWARE;
>>>
>>> Again, assert() seems appropriate here.
>>
>> ok
>>
>>>
>>>> +    }
>>>> +
>>>> +    args[0] = 0;
>>>> +    if (end.w0 & END_W0_UCOND_NOTIFY) {
>>>> +        args[0] |= SPAPR_XIVE_END_ALWAYS_NOTIFY;
>>>> +    }
>>>> +
>>>> +    if (end.w0 & END_W0_ENQUEUE) {
>>>> +        args[1] =
>>>> +            (((uint64_t)(end.w2 & 0x0fffffff)) << 32) | end.w3;
>>>> +        args[2] = GETFIELD(END_W0_QSIZE, end.w0) + 12;
>>>> +    } else {
>>>> +        args[1] = 0;
>>>> +        args[2] = 0;
>>>> +    }
>>>> +
>>>> +    /* TODO: do we need any locking on the END ? */
>>>> +    if (flags & SPAPR_XIVE_END_DEBUG) {
>>>> +        /* Load the event queue generation number into the return flags */
>>>> +        args[0] |= (uint64_t)GETFIELD(END_W1_GENERATION, end.w1) << 62;
>>>> +
>>>> +        /* Load R7 with the event queue offset counter */
>>>> +        args[3] = GETFIELD(END_W1_PAGE_OFF, end.w1);
>>>> +    } else {
>>>> +        args[3] = 0;
>>>> +    }
>>>> +
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_SET_OS_REPORTING_LINE hcall() is used to set the
>>>> + * reporting cache line pair for the calling thread.  The reporting
>>>> + * cache lines will contain the OS interrupt context when the OS
>>>> + * issues a CI store byte to @TIMA+0xC10 to acknowledge the OS
>>>> + * interrupt. The reporting cache lines can be reset by inputting -1
>>>> + * in "reportingLine".  Issuing the CI store byte without reporting
>>>> + * cache lines registered will result in the data not being accessible
>>>> + * to the OS.
>>>> + *
>>>> + * Parameters:
>>>> + * Input:
>>>> + * - "flags"
>>>> + *      Bits 0-63: Reserved
>>>> + * - "reportingLine": The logical real address of the reporting cache
>>>> + *    line pair
>>>> + *
>>>> + * Output:
>>>> + * - None
>>>> + */
>>>> +static target_ulong h_int_set_os_reporting_line(PowerPCCPU *cpu,
>>>> +                                                sPAPRMachineState *spapr,
>>>> +                                                target_ulong opcode,
>>>> +                                                target_ulong *args)
>>>> +{
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>>>> +     * This is not needed when running the emulation under QEMU
>>>> +     */
>>>> +
>>>> +    /* TODO: H_INT_SET_OS_REPORTING_LINE */
>>>> +    return H_FUNCTION;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_GET_OS_REPORTING_LINE hcall() is used to get the logical
>>>> + * real address of the reporting cache line pair set for the input
>>>> + * "target".  If no reporting cache line pair has been set, -1 is
>>>> + * returned.
>>>> + *
>>>> + * Parameters:
>>>> + * Input:
>>>> + * - "flags"
>>>> + *      Bits 0-63: Reserved
>>>> + * - "target" is per "ibm,ppc-interrupt-server#s" or
>>>> + *       "ibm,ppc-interrupt-gserver#s"
>>>> + * - "reportingLine": The logical real address of the reporting cache
>>>> + *   line pair
>>>> + *
>>>> + * Output:
>>>> + * - R4: The logical real address of the reporting line if set, else -1
>>>> + */
>>>> +static target_ulong h_int_get_os_reporting_line(PowerPCCPU *cpu,
>>>> +                                                sPAPRMachineState *spapr,
>>>> +                                                target_ulong opcode,
>>>> +                                                target_ulong *args)
>>>> +{
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>>>> +     * This is not needed when running the emulation under QEMU
>>>> +     */
>>>> +
>>>> +    /* TODO: H_INT_GET_OS_REPORTING_LINE */
>>>> +    return H_FUNCTION;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_ESB hcall() is used to issue a load or store to the ESB
>>>> + * page for the input "lisn".  This hcall is only supported for LISNs
>>>> + * that have the ESB hcall flag set to 1 when returned from hcall()
>>>> + * H_INT_GET_SOURCE_INFO.
>>>
>>> Is there a reason for specifically restricting this to LISNs which
>>> advertise it, rather than allowing it for anything? 
>>
>> It's in the specs but I did not implement the check. So H_INT_ESB can be 
>> used today by the OS for any interrupt number. Same under KVM.
>>
>> But I should say so somewhere.
>>
>>> Obviously using
>>> the direct MMIOs will generally be a faster option when possible, but
>>> I could see occasions where it might be simpler for the guest to
>>> always use H_INT_ESB (e.g. for micro-guests like kvm-unit-tests).
>>
>> can not you use direct load and stores in these guests ? I haven't 
>> looked at how they are implemented.
> 
> It's not that you can't, but that might involve setting up mappings
> and so forth which could be more trouble than using an hcall.  At the
> very least they'll also need H_INT_ESB support for the irqs that
> require it, so allowing it for everything avoids one code variant.

ok. All good then.

Thanks,

C.

> 
>>>> + * Parameters:
>>>> + * Input:
>>>> + * - "flags"
>>>> + *      Bits 0-62: Reserved
>>>> + *      bit 63: Store: Store=1, store operation, else load operation
>>>> + * - "lisn" is per "interrupts", "interrupt-map", or
>>>> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
>>>> + *      ibm,query-interrupt-source-number RTAS call, or as
>>>> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
>>>> + * - "esbOffset" is the offset into the ESB page for the load or store operation
>>>> + * - "storeData" is the data to write for a store operation
>>>> + *
>>>> + * Output:
>>>> + * - R4: R4: The value of the load if load operation, else -1
>>>> + */
>>>> +
>>>> +#define SPAPR_XIVE_ESB_STORE PPC_BIT(63)
>>>> +
>>>> +static target_ulong h_int_esb(PowerPCCPU *cpu,
>>>> +                              sPAPRMachineState *spapr,
>>>> +                              target_ulong opcode,
>>>> +                              target_ulong *args)
>>>> +{
>>>> +    sPAPRXive *xive = spapr->xive;
>>>> +    XiveEAS eas;
>>>> +    target_ulong flags  = args[0];
>>>> +    target_ulong lisn   = args[1];
>>>> +    target_ulong offset = args[2];
>>>> +    target_ulong data   = args[3];
>>>> +    hwaddr mmio_addr;
>>>> +    XiveSource *xsrc = &xive->source;
>>>> +
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    if (flags & ~SPAPR_XIVE_ESB_STORE) {
>>>> +        return H_PARAMETER;
>>>> +    }
>>>> +
>>>> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    if (!(eas.w & EAS_VALID)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    if (offset > (1ull << xsrc->esb_shift)) {
>>>> +        return H_P3;
>>>> +    }
>>>> +
>>>> +    mmio_addr = xive->vc_base + xive_source_esb_mgmt(xsrc, lisn) + offset;
>>>> +
>>>> +    if (dma_memory_rw(&address_space_memory, mmio_addr, &data, 8,
>>>> +                      (flags & SPAPR_XIVE_ESB_STORE))) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: failed to access ESB @0x%"
>>>> +                      HWADDR_PRIx "\n", mmio_addr);
>>>> +        return H_HARDWARE;
>>>> +    }
>>>> +    args[0] = (flags & SPAPR_XIVE_ESB_STORE) ? -1 : data;
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_SYNC hcall() is used to issue hardware syncs that will
>>>> + * ensure any in flight events for the input lisn are in the event
>>>> + * queue.
>>>> + *
>>>> + * Parameters:
>>>> + * Input:
>>>> + * - "flags"
>>>> + *      Bits 0-63: Reserved
>>>> + * - "lisn" is per "interrupts", "interrupt-map", or
>>>> + *      "ibm,xive-lisn-ranges" properties, or as returned by the
>>>> + *      ibm,query-interrupt-source-number RTAS call, or as
>>>> + *      returned by the H_ALLOCATE_VAS_WINDOW hcall
>>>> + *
>>>> + * Output:
>>>> + * - None
>>>> + */
>>>> +static target_ulong h_int_sync(PowerPCCPU *cpu,
>>>> +                               sPAPRMachineState *spapr,
>>>> +                               target_ulong opcode,
>>>> +                               target_ulong *args)
>>>> +{
>>>> +    sPAPRXive *xive = spapr->xive;
>>>> +    XiveEAS eas;
>>>> +    target_ulong flags = args[0];
>>>> +    target_ulong lisn = args[1];
>>>> +
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    if (flags) {
>>>> +        return H_PARAMETER;
>>>> +    }
>>>> +
>>>> +    if (xive_router_get_eas(XIVE_ROUTER(xive), lisn, &eas)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    if (!(eas.w & EAS_VALID)) {
>>>> +        return H_P2;
>>>> +    }
>>>> +
>>>> +    /*
>>>> +     * H_STATE should be returned if a H_INT_RESET is in progress.
>>>> +     * This is not needed when running the emulation under QEMU
>>>> +     */
>>>> +
>>>> +    /* This is not real hardware. Nothing to be done */
>>>
>>> At least, not as long as all the XIVE operations are under the BQL.
>>
>> yes.
>>
>>>
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +
>>>> +/*
>>>> + * The H_INT_RESET hcall() is used to reset all of the partition's
>>>> + * interrupt exploitation structures to their initial state.  This
>>>> + * means losing all previously set interrupt state set via
>>>> + * H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
>>>> + *
>>>> + * Parameters:
>>>> + * Input:
>>>> + * - "flags"
>>>> + *      Bits 0-63: Reserved
>>>> + *
>>>> + * Output:
>>>> + * - None
>>>> + */
>>>> +static target_ulong h_int_reset(PowerPCCPU *cpu,
>>>> +                                sPAPRMachineState *spapr,
>>>> +                                target_ulong opcode,
>>>> +                                target_ulong *args)
>>>> +{
>>>> +    sPAPRXive *xive = spapr->xive;
>>>> +    target_ulong flags   = args[0];
>>>> +
>>>> +    if (!spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>>>> +        return H_FUNCTION;
>>>> +    }
>>>> +
>>>> +    if (flags) {
>>>> +        return H_PARAMETER;
>>>> +    }
>>>> +
>>>> +    device_reset(DEVICE(xive));
>>>> +    return H_SUCCESS;
>>>> +}
>>>> +
>>>> +void spapr_xive_hcall_init(sPAPRMachineState *spapr)
>>>> +{
>>>> +    spapr_register_hypercall(H_INT_GET_SOURCE_INFO, h_int_get_source_info);
>>>> +    spapr_register_hypercall(H_INT_SET_SOURCE_CONFIG, h_int_set_source_config);
>>>> +    spapr_register_hypercall(H_INT_GET_SOURCE_CONFIG, h_int_get_source_config);
>>>> +    spapr_register_hypercall(H_INT_GET_QUEUE_INFO, h_int_get_queue_info);
>>>> +    spapr_register_hypercall(H_INT_SET_QUEUE_CONFIG, h_int_set_queue_config);
>>>> +    spapr_register_hypercall(H_INT_GET_QUEUE_CONFIG, h_int_get_queue_config);
>>>> +    spapr_register_hypercall(H_INT_SET_OS_REPORTING_LINE,
>>>> +                             h_int_set_os_reporting_line);
>>>> +    spapr_register_hypercall(H_INT_GET_OS_REPORTING_LINE,
>>>> +                             h_int_get_os_reporting_line);
>>>> +    spapr_register_hypercall(H_INT_ESB, h_int_esb);
>>>> +    spapr_register_hypercall(H_INT_SYNC, h_int_sync);
>>>> +    spapr_register_hypercall(H_INT_RESET, h_int_reset);
>>>> +}
>>>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>>>> index 2569ae1bc7f8..da6fcfaa3c52 100644
>>>> --- a/hw/ppc/spapr_irq.c
>>>> +++ b/hw/ppc/spapr_irq.c
>>>> @@ -258,6 +258,8 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
>>>>          error_propagate(errp, local_err);
>>>>          return;
>>>>      }
>>>> +
>>>> +    spapr_xive_hcall_init(spapr);
>>>>  }
>>>>  
>>>>  static int spapr_irq_claim_xive(sPAPRMachineState *spapr, int irq, bool lsi,
>>>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>>>> index 301a8e972d91..eacd26836ebf 100644
>>>> --- a/hw/intc/Makefile.objs
>>>> +++ b/hw/intc/Makefile.objs
>>>> @@ -38,7 +38,7 @@ obj-$(CONFIG_XICS) += xics.o
>>>>  obj-$(CONFIG_XICS_SPAPR) += xics_spapr.o
>>>>  obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>>>  obj-$(CONFIG_XIVE) += xive.o
>>>> -obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o
>>>> +obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
>>>>  obj-$(CONFIG_POWERNV) += xics_pnv.o
>>>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
>>>
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 20/36] spapr: add classes for the XIVE models
  2018-11-29  2:59       ` David Gibson
@ 2018-11-29 16:06         ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 16:06 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 3:59 AM, David Gibson wrote:
> On Wed, Nov 28, 2018 at 11:38:50PM +0100, Cédric Le Goater wrote:
>> On 11/28/18 6:13 AM, David Gibson wrote:
>>> On Fri, Nov 16, 2018 at 11:57:13AM +0100, Cédric Le Goater wrote:
>>>> The XIVE models for the QEMU and KVM accelerators will have a lot in
>>>> common. Introduce an abstract class for the source, the thread context
>>>> and the interrupt controller object to handle the differences in the
>>>> object initialization. These classes will also be used to define state
>>>> synchronization handlers for the monitor and migration usage.
>>>>
>>>> This is very much like the XICS models.
>>>
>>> Yeah.. so I know it's my code, but in hindsight I think making
>>> separate subclasses for TCG and KVM was a mistake.  The distinction
>>> between emulated and KVM version is supposed to be invisible to both
>>> guest and (almost) to user, whereas a subclass usually indicates a
>>> visibly different device.
>>
>> so how do you want to model the KVM part ? with a single object and
>> kvm_enabled() sections ?
> 
> Basically, yes. 

OK. Let's take that path then. It should improve code readability. 

> In practice I think you probably want a helper called
> xive_is_kvm() or something, which would evaluate to (kvm_enabled() &&
> kvm_irqchip_in_kernel()).

yes.

Thanks,

C.

> 
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  include/hw/ppc/spapr_xive.h |  15 +++++
>>>>  include/hw/ppc/xive.h       |  30 ++++++++++
>>>>  hw/intc/spapr_xive.c        |  86 +++++++++++++++++++---------
>>>>  hw/intc/xive.c              | 109 +++++++++++++++++++++++++-----------
>>>>  hw/ppc/spapr_irq.c          |   4 +-
>>>>  5 files changed, 182 insertions(+), 62 deletions(-)
>>>>
>>>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>>>> index 5b3fab192d41..aca2969a09ab 100644
>>>> --- a/include/hw/ppc/spapr_xive.h
>>>> +++ b/include/hw/ppc/spapr_xive.h
>>>> @@ -13,6 +13,10 @@
>>>>  #include "hw/sysbus.h"
>>>>  #include "hw/ppc/xive.h"
>>>>  
>>>> +#define TYPE_SPAPR_XIVE_BASE "spapr-xive-base"
>>>> +#define SPAPR_XIVE_BASE(obj) \
>>>> +    OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE_BASE)
>>>> +
>>>>  #define TYPE_SPAPR_XIVE "spapr-xive"
>>>>  #define SPAPR_XIVE(obj) OBJECT_CHECK(sPAPRXive, (obj), TYPE_SPAPR_XIVE)
>>>>  
>>>> @@ -38,6 +42,17 @@ typedef struct sPAPRXive {
>>>>      MemoryRegion  tm_mmio;
>>>>  } sPAPRXive;
>>>>  
>>>> +#define SPAPR_XIVE_BASE_CLASS(klass) \
>>>> +     OBJECT_CLASS_CHECK(sPAPRXiveClass, (klass), TYPE_SPAPR_XIVE_BASE)
>>>> +#define SPAPR_XIVE_BASE_GET_CLASS(obj) \
>>>> +     OBJECT_GET_CLASS(sPAPRXiveClass, (obj), TYPE_SPAPR_XIVE_BASE)
>>>> +
>>>> +typedef struct sPAPRXiveClass {
>>>> +    XiveRouterClass parent_class;
>>>> +
>>>> +    DeviceRealize   parent_realize;
>>>> +} sPAPRXiveClass;
>>>> +
>>>>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>>>>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>>>>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> index b74eb326dcd1..281ed370121c 100644
>>>> --- a/include/hw/ppc/xive.h
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -38,6 +38,10 @@ typedef struct XiveFabricClass {
>>>>   * XIVE Interrupt Source
>>>>   */
>>>>  
>>>> +#define TYPE_XIVE_SOURCE_BASE "xive-source-base"
>>>> +#define XIVE_SOURCE_BASE(obj) \
>>>> +    OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE_BASE)
>>>> +
>>>>  #define TYPE_XIVE_SOURCE "xive-source"
>>>>  #define XIVE_SOURCE(obj) OBJECT_CHECK(XiveSource, (obj), TYPE_XIVE_SOURCE)
>>>>  
>>>> @@ -68,6 +72,18 @@ typedef struct XiveSource {
>>>>      XiveFabric      *xive;
>>>>  } XiveSource;
>>>>  
>>>> +#define XIVE_SOURCE_BASE_CLASS(klass) \
>>>> +     OBJECT_CLASS_CHECK(XiveSourceClass, (klass), TYPE_XIVE_SOURCE_BASE)
>>>> +#define XIVE_SOURCE_BASE_GET_CLASS(obj) \
>>>> +     OBJECT_GET_CLASS(XiveSourceClass, (obj), TYPE_XIVE_SOURCE_BASE)
>>>> +
>>>> +typedef struct XiveSourceClass {
>>>> +    SysBusDeviceClass parent_class;
>>>> +
>>>> +    DeviceRealize     parent_realize;
>>>> +    DeviceReset       parent_reset;
>>>> +} XiveSourceClass;
>>>> +
>>>>  /*
>>>>   * ESB MMIO setting. Can be one page, for both source triggering and
>>>>   * source management, or two different pages. See below for magic
>>>> @@ -253,6 +269,9 @@ void xive_end_pic_print_info(XiveEND *end, uint32_t end_idx, Monitor *mon);
>>>>   * XIVE Thread interrupt Management (TM) context
>>>>   */
>>>>  
>>>> +#define TYPE_XIVE_TCTX_BASE "xive-tctx-base"
>>>> +#define XIVE_TCTX_BASE(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_BASE)
>>>> +
>>>>  #define TYPE_XIVE_TCTX "xive-tctx"
>>>>  #define XIVE_TCTX(obj) OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX)
>>>>  
>>>> @@ -278,6 +297,17 @@ typedef struct XiveTCTX {
>>>>      XiveRouter  *xrtr;
>>>>  } XiveTCTX;
>>>>  
>>>> +#define XIVE_TCTX_BASE_CLASS(klass) \
>>>> +     OBJECT_CLASS_CHECK(XiveTCTXClass, (klass), TYPE_XIVE_TCTX_BASE)
>>>> +#define XIVE_TCTX_BASE_GET_CLASS(obj) \
>>>> +     OBJECT_GET_CLASS(XiveTCTXClass, (obj), TYPE_XIVE_TCTX_BASE)
>>>> +
>>>> +typedef struct XiveTCTXClass {
>>>> +    DeviceClass       parent_class;
>>>> +
>>>> +    DeviceRealize     parent_realize;
>>>> +} XiveTCTXClass;
>>>> +
>>>>  /*
>>>>   * XIVE Thread Interrupt Management Aera (TIMA)
>>>>   */
>>>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>>>> index 3bf77ace11a2..ec85f7e4f88d 100644
>>>> --- a/hw/intc/spapr_xive.c
>>>> +++ b/hw/intc/spapr_xive.c
>>>> @@ -53,9 +53,9 @@ static void spapr_xive_mmio_map(sPAPRXive *xive)
>>>>      sysbus_mmio_map(SYS_BUS_DEVICE(xive), 0, xive->tm_base);
>>>>  }
>>>>  
>>>> -static void spapr_xive_reset(DeviceState *dev)
>>>> +static void spapr_xive_base_reset(DeviceState *dev)
>>>>  {
>>>> -    sPAPRXive *xive = SPAPR_XIVE(dev);
>>>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(dev);
>>>>      int i;
>>>>  
>>>>      /* Xive Source reset is done through SysBus, it should put all
>>>> @@ -76,9 +76,9 @@ static void spapr_xive_reset(DeviceState *dev)
>>>>      spapr_xive_mmio_map(xive);
>>>>  }
>>>>  
>>>> -static void spapr_xive_instance_init(Object *obj)
>>>> +static void spapr_xive_base_instance_init(Object *obj)
>>>>  {
>>>> -    sPAPRXive *xive = SPAPR_XIVE(obj);
>>>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(obj);
>>>>  
>>>>      object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
>>>>      object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
>>>> @@ -89,9 +89,9 @@ static void spapr_xive_instance_init(Object *obj)
>>>>                                NULL);
>>>>  }
>>>>  
>>>> -static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>>> +static void spapr_xive_base_realize(DeviceState *dev, Error **errp)
>>>>  {
>>>> -    sPAPRXive *xive = SPAPR_XIVE(dev);
>>>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(dev);
>>>>      XiveSource *xsrc = &xive->source;
>>>>      XiveENDSource *end_xsrc = &xive->end_source;
>>>>      Error *local_err = NULL;
>>>> @@ -142,16 +142,11 @@ static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>>>       */
>>>>      xive->eat = g_new0(XiveEAS, xive->nr_irqs);
>>>>      xive->endt = g_new0(XiveEND, xive->nr_ends);
>>>> -
>>>> -    /* TIMA initialization */
>>>> -    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
>>>> -                          "xive.tima", 4ull << TM_SHIFT);
>>>> -    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
>>>>  }
>>>>  
>>>>  static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>>>  {
>>>> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>>>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>>>>  
>>>>      if (lisn >= xive->nr_irqs) {
>>>>          return -1;
>>>> @@ -163,7 +158,7 @@ static int spapr_xive_get_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>>>  
>>>>  static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>>>  {
>>>> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>>>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>>>>  
>>>>      if (lisn >= xive->nr_irqs) {
>>>>          return -1;
>>>> @@ -176,7 +171,7 @@ static int spapr_xive_set_eas(XiveRouter *xrtr, uint32_t lisn, XiveEAS *eas)
>>>>  static int spapr_xive_get_end(XiveRouter *xrtr,
>>>>                                uint8_t end_blk, uint32_t end_idx, XiveEND *end)
>>>>  {
>>>> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>>>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>>>>  
>>>>      if (end_idx >= xive->nr_ends) {
>>>>          return -1;
>>>> @@ -189,7 +184,7 @@ static int spapr_xive_get_end(XiveRouter *xrtr,
>>>>  static int spapr_xive_set_end(XiveRouter *xrtr,
>>>>                                uint8_t end_blk, uint32_t end_idx, XiveEND *end)
>>>>  {
>>>> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>>>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>>>>  
>>>>      if (end_idx >= xive->nr_ends) {
>>>>          return -1;
>>>> @@ -202,7 +197,7 @@ static int spapr_xive_set_end(XiveRouter *xrtr,
>>>>  static int spapr_xive_get_nvt(XiveRouter *xrtr,
>>>>                                uint8_t nvt_blk, uint32_t nvt_idx, XiveNVT *nvt)
>>>>  {
>>>> -    sPAPRXive *xive = SPAPR_XIVE(xrtr);
>>>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(xrtr);
>>>>      uint32_t vcpu_id = spapr_xive_nvt_to_target(xive, nvt_blk, nvt_idx);
>>>>      PowerPCCPU *cpu = spapr_find_cpu(vcpu_id);
>>>>  
>>>> @@ -236,7 +231,7 @@ static void spapr_xive_reset_tctx(XiveRouter *xrtr, XiveTCTX *tctx)
>>>>      uint32_t nvt_idx;
>>>>      uint32_t nvt_cam;
>>>>  
>>>> -    spapr_xive_cpu_to_nvt(SPAPR_XIVE(xrtr), POWERPC_CPU(tctx->cs),
>>>> +    spapr_xive_cpu_to_nvt(SPAPR_XIVE_BASE(xrtr), POWERPC_CPU(tctx->cs),
>>>>                            &nvt_blk, &nvt_idx);
>>>>  
>>>>      nvt_cam = cpu_to_be32(TM_QW1W2_VO | xive_tctx_cam_line(nvt_blk, nvt_idx));
>>>> @@ -359,7 +354,7 @@ static const VMStateDescription vmstate_spapr_xive_eas = {
>>>>      },
>>>>  };
>>>>  
>>>> -static const VMStateDescription vmstate_spapr_xive = {
>>>> +static const VMStateDescription vmstate_spapr_xive_base = {
>>>>      .name = TYPE_SPAPR_XIVE,
>>>>      .version_id = 1,
>>>>      .minimum_version_id = 1,
>>>> @@ -373,7 +368,7 @@ static const VMStateDescription vmstate_spapr_xive = {
>>>>      },
>>>>  };
>>>>  
>>>> -static Property spapr_xive_properties[] = {
>>>> +static Property spapr_xive_base_properties[] = {
>>>>      DEFINE_PROP_UINT32("nr-irqs", sPAPRXive, nr_irqs, 0),
>>>>      DEFINE_PROP_UINT32("nr-ends", sPAPRXive, nr_ends, 0),
>>>>      DEFINE_PROP_UINT64("vc-base", sPAPRXive, vc_base, SPAPR_XIVE_VC_BASE),
>>>> @@ -381,16 +376,16 @@ static Property spapr_xive_properties[] = {
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>>  
>>>> -static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>> +static void spapr_xive_base_class_init(ObjectClass *klass, void *data)
>>>>  {
>>>>      DeviceClass *dc = DEVICE_CLASS(klass);
>>>>      XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
>>>>  
>>>>      dc->desc    = "sPAPR XIVE Interrupt Controller";
>>>> -    dc->props   = spapr_xive_properties;
>>>> -    dc->realize = spapr_xive_realize;
>>>> -    dc->reset   = spapr_xive_reset;
>>>> -    dc->vmsd    = &vmstate_spapr_xive;
>>>> +    dc->props   = spapr_xive_base_properties;
>>>> +    dc->realize = spapr_xive_base_realize;
>>>> +    dc->reset   = spapr_xive_base_reset;
>>>> +    dc->vmsd    = &vmstate_spapr_xive_base;
>>>>  
>>>>      xrc->get_eas = spapr_xive_get_eas;
>>>>      xrc->set_eas = spapr_xive_set_eas;
>>>> @@ -401,16 +396,55 @@ static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>>      xrc->reset_tctx = spapr_xive_reset_tctx;
>>>>  }
>>>>  
>>>> +static const TypeInfo spapr_xive_base_info = {
>>>> +    .name = TYPE_SPAPR_XIVE_BASE,
>>>> +    .parent = TYPE_XIVE_ROUTER,
>>>> +    .abstract = true,
>>>> +    .instance_init = spapr_xive_base_instance_init,
>>>> +    .instance_size = sizeof(sPAPRXive),
>>>> +    .class_init = spapr_xive_base_class_init,
>>>> +    .class_size = sizeof(sPAPRXiveClass),
>>>> +};
>>>> +
>>>> +static void spapr_xive_realize(DeviceState *dev, Error **errp)
>>>> +{
>>>> +    sPAPRXive *xive = SPAPR_XIVE(dev);
>>>> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
>>>> +    Error *local_err = NULL;
>>>> +
>>>> +    sxc->parent_realize(dev, &local_err);
>>>> +    if (local_err) {
>>>> +        error_propagate(errp, local_err);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    /* TIMA */
>>>> +    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops, xive,
>>>> +                          "xive.tima", 4ull << TM_SHIFT);
>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xive->tm_mmio);
>>>> +}
>>>> +
>>>> +static void spapr_xive_class_init(ObjectClass *klass, void *data)
>>>> +{
>>>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>>>> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_CLASS(klass);
>>>> +
>>>> +    device_class_set_parent_realize(dc, spapr_xive_realize,
>>>> +                                    &sxc->parent_realize);
>>>> +}
>>>> +
>>>>  static const TypeInfo spapr_xive_info = {
>>>>      .name = TYPE_SPAPR_XIVE,
>>>> -    .parent = TYPE_XIVE_ROUTER,
>>>> -    .instance_init = spapr_xive_instance_init,
>>>> +    .parent = TYPE_SPAPR_XIVE_BASE,
>>>> +    .instance_init = spapr_xive_base_instance_init,
>>>>      .instance_size = sizeof(sPAPRXive),
>>>>      .class_init = spapr_xive_class_init,
>>>> +    .class_size = sizeof(sPAPRXiveClass),
>>>>  };
>>>>  
>>>>  static void spapr_xive_register_types(void)
>>>>  {
>>>> +    type_register_static(&spapr_xive_base_info);
>>>>      type_register_static(&spapr_xive_info);
>>>>  }
>>>>  
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> index 7d921023e2ee..9bb37553c9ec 100644
>>>> --- a/hw/intc/xive.c
>>>> +++ b/hw/intc/xive.c
>>>> @@ -478,9 +478,9 @@ static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
>>>>      return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
>>>>  }
>>>>  
>>>> -static void xive_tctx_reset(void *dev)
>>>> +static void xive_tctx_base_reset(void *dev)
>>>>  {
>>>> -    XiveTCTX *tctx = XIVE_TCTX(dev);
>>>> +    XiveTCTX *tctx = XIVE_TCTX_BASE(dev);
>>>>      XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
>>>>  
>>>>      memset(tctx->regs, 0, sizeof(tctx->regs));
>>>> @@ -506,9 +506,9 @@ static void xive_tctx_reset(void *dev)
>>>>      }
>>>>  }
>>>>  
>>>> -static void xive_tctx_realize(DeviceState *dev, Error **errp)
>>>> +static void xive_tctx_base_realize(DeviceState *dev, Error **errp)
>>>>  {
>>>> -    XiveTCTX *tctx = XIVE_TCTX(dev);
>>>> +    XiveTCTX *tctx = XIVE_TCTX_BASE(dev);
>>>>      PowerPCCPU *cpu;
>>>>      CPUPPCState *env;
>>>>      Object *obj;
>>>> @@ -544,15 +544,15 @@ static void xive_tctx_realize(DeviceState *dev, Error **errp)
>>>>          return;
>>>>      }
>>>>  
>>>> -    qemu_register_reset(xive_tctx_reset, dev);
>>>> +    qemu_register_reset(xive_tctx_base_reset, dev);
>>>>  }
>>>>  
>>>> -static void xive_tctx_unrealize(DeviceState *dev, Error **errp)
>>>> +static void xive_tctx_base_unrealize(DeviceState *dev, Error **errp)
>>>>  {
>>>> -    qemu_unregister_reset(xive_tctx_reset, dev);
>>>> +    qemu_unregister_reset(xive_tctx_base_reset, dev);
>>>>  }
>>>>  
>>>> -static const VMStateDescription vmstate_xive_tctx = {
>>>> +static const VMStateDescription vmstate_xive_tctx_base = {
>>>>      .name = TYPE_XIVE_TCTX,
>>>>      .version_id = 1,
>>>>      .minimum_version_id = 1,
>>>> @@ -562,21 +562,28 @@ static const VMStateDescription vmstate_xive_tctx = {
>>>>      },
>>>>  };
>>>>  
>>>> -static void xive_tctx_class_init(ObjectClass *klass, void *data)
>>>> +static void xive_tctx_base_class_init(ObjectClass *klass, void *data)
>>>>  {
>>>>      DeviceClass *dc = DEVICE_CLASS(klass);
>>>>  
>>>> -    dc->realize = xive_tctx_realize;
>>>> -    dc->unrealize = xive_tctx_unrealize;
>>>> +    dc->realize = xive_tctx_base_realize;
>>>> +    dc->unrealize = xive_tctx_base_unrealize;
>>>>      dc->desc = "XIVE Interrupt Thread Context";
>>>> -    dc->vmsd = &vmstate_xive_tctx;
>>>> +    dc->vmsd = &vmstate_xive_tctx_base;
>>>>  }
>>>>  
>>>> -static const TypeInfo xive_tctx_info = {
>>>> -    .name          = TYPE_XIVE_TCTX,
>>>> +static const TypeInfo xive_tctx_base_info = {
>>>> +    .name          = TYPE_XIVE_TCTX_BASE,
>>>>      .parent        = TYPE_DEVICE,
>>>> +    .abstract      = true,
>>>>      .instance_size = sizeof(XiveTCTX),
>>>> -    .class_init    = xive_tctx_class_init,
>>>> +    .class_init    = xive_tctx_base_class_init,
>>>> +    .class_size    = sizeof(XiveTCTXClass),
>>>> +};
>>>> +
>>>> +static const TypeInfo xive_tctx_info = {
>>>> +    .name          = TYPE_XIVE_TCTX,
>>>> +    .parent        = TYPE_XIVE_TCTX_BASE,
>>>>  };
>>>>  
>>>>  Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
>>>> @@ -933,9 +940,9 @@ void xive_source_pic_print_info(XiveSource *xsrc, uint32_t offset, Monitor *mon)
>>>>      }
>>>>  }
>>>>  
>>>> -static void xive_source_reset(DeviceState *dev)
>>>> +static void xive_source_base_reset(DeviceState *dev)
>>>>  {
>>>> -    XiveSource *xsrc = XIVE_SOURCE(dev);
>>>> +    XiveSource *xsrc = XIVE_SOURCE_BASE(dev);
>>>>  
>>>>      /* Do not clear the LSI bitmap */
>>>>  
>>>> @@ -943,9 +950,9 @@ static void xive_source_reset(DeviceState *dev)
>>>>      memset(xsrc->status, 0x1, xsrc->nr_irqs);
>>>>  }
>>>>  
>>>> -static void xive_source_realize(DeviceState *dev, Error **errp)
>>>> +static void xive_source_base_realize(DeviceState *dev,  Error **errp)
>>>>  {
>>>> -    XiveSource *xsrc = XIVE_SOURCE(dev);
>>>> +    XiveSource *xsrc = XIVE_SOURCE_BASE(dev);
>>>>      Object *obj;
>>>>      Error *local_err = NULL;
>>>>  
>>>> @@ -971,21 +978,14 @@ static void xive_source_realize(DeviceState *dev, Error **errp)
>>>>          return;
>>>>      }
>>>>  
>>>> -    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc,
>>>> -                                     xsrc->nr_irqs);
>>>> -
>>>>      xsrc->status = g_malloc0(xsrc->nr_irqs);
>>>>  
>>>>      xsrc->lsi_map = bitmap_new(xsrc->nr_irqs);
>>>>      xsrc->lsi_map_size = xsrc->nr_irqs;
>>>>  
>>>> -    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>>>> -                          &xive_source_esb_ops, xsrc, "xive.esb",
>>>> -                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
>>>> -    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>>>>  }
>>>>  
>>>> -static const VMStateDescription vmstate_xive_source = {
>>>> +static const VMStateDescription vmstate_xive_source_base = {
>>>>      .name = TYPE_XIVE_SOURCE,
>>>>      .version_id = 1,
>>>>      .minimum_version_id = 1,
>>>> @@ -1001,29 +1001,68 @@ static const VMStateDescription vmstate_xive_source = {
>>>>   * The default XIVE interrupt source setting for the ESB MMIOs is two
>>>>   * 64k pages without Store EOI, to be in sync with KVM.
>>>>   */
>>>> -static Property xive_source_properties[] = {
>>>> +static Property xive_source_base_properties[] = {
>>>>      DEFINE_PROP_UINT64("flags", XiveSource, esb_flags, 0),
>>>>      DEFINE_PROP_UINT32("nr-irqs", XiveSource, nr_irqs, 0),
>>>>      DEFINE_PROP_UINT32("shift", XiveSource, esb_shift, XIVE_ESB_64K_2PAGE),
>>>>      DEFINE_PROP_END_OF_LIST(),
>>>>  };
>>>>  
>>>> -static void xive_source_class_init(ObjectClass *klass, void *data)
>>>> +static void xive_source_base_class_init(ObjectClass *klass, void *data)
>>>>  {
>>>>      DeviceClass *dc = DEVICE_CLASS(klass);
>>>>  
>>>>      dc->desc    = "XIVE Interrupt Source";
>>>> -    dc->props   = xive_source_properties;
>>>> -    dc->realize = xive_source_realize;
>>>> -    dc->reset   = xive_source_reset;
>>>> -    dc->vmsd    = &vmstate_xive_source;
>>>> +    dc->props   = xive_source_base_properties;
>>>> +    dc->realize = xive_source_base_realize;
>>>> +    dc->reset   = xive_source_base_reset;
>>>> +    dc->vmsd    = &vmstate_xive_source_base;
>>>> +}
>>>> +
>>>> +static const TypeInfo xive_source_base_info = {
>>>> +    .name          = TYPE_XIVE_SOURCE_BASE,
>>>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>>>> +    .abstract      = true,
>>>> +    .instance_size = sizeof(XiveSource),
>>>> +    .class_init    = xive_source_base_class_init,
>>>> +    .class_size    = sizeof(XiveSourceClass),
>>>> +};
>>>> +
>>>> +static void xive_source_realize(DeviceState *dev, Error **errp)
>>>> +{
>>>> +    XiveSource *xsrc = XIVE_SOURCE(dev);
>>>> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
>>>> +    Error *local_err = NULL;
>>>> +
>>>> +    xsc->parent_realize(dev, &local_err);
>>>> +    if (local_err) {
>>>> +        error_propagate(errp, local_err);
>>>> +        return;
>>>> +    }
>>>> +
>>>> +    xsrc->qirqs = qemu_allocate_irqs(xive_source_set_irq, xsrc, xsrc->nr_irqs);
>>>> +
>>>> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>>>> +                          &xive_source_esb_ops, xsrc, "xive.esb",
>>>> +                          (1ull << xsrc->esb_shift) * xsrc->nr_irqs);
>>>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>>>> +}
>>>> +
>>>> +static void xive_source_class_init(ObjectClass *klass, void *data)
>>>> +{
>>>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>>>> +    XiveSourceClass *xsc = XIVE_SOURCE_BASE_CLASS(klass);
>>>> +
>>>> +    device_class_set_parent_realize(dc, xive_source_realize,
>>>> +                                    &xsc->parent_realize);
>>>>  }
>>>>  
>>>>  static const TypeInfo xive_source_info = {
>>>>      .name          = TYPE_XIVE_SOURCE,
>>>> -    .parent        = TYPE_SYS_BUS_DEVICE,
>>>> +    .parent        = TYPE_XIVE_SOURCE_BASE,
>>>>      .instance_size = sizeof(XiveSource),
>>>>      .class_init    = xive_source_class_init,
>>>> +    .class_size    = sizeof(XiveSourceClass),
>>>>  };
>>>>  
>>>>  /*
>>>> @@ -1659,10 +1698,12 @@ static const TypeInfo xive_fabric_info = {
>>>>  
>>>>  static void xive_register_types(void)
>>>>  {
>>>> +    type_register_static(&xive_source_base_info);
>>>>      type_register_static(&xive_source_info);
>>>>      type_register_static(&xive_fabric_info);
>>>>      type_register_static(&xive_router_info);
>>>>      type_register_static(&xive_end_source_info);
>>>> +    type_register_static(&xive_tctx_base_info);
>>>>      type_register_static(&xive_tctx_info);
>>>>  }
>>>>  
>>>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>>>> index 42e73851b174..f6e9e44d4cf9 100644
>>>> --- a/hw/ppc/spapr_irq.c
>>>> +++ b/hw/ppc/spapr_irq.c
>>>> @@ -243,7 +243,7 @@ static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
>>>>          return NULL;
>>>>      }
>>>>      qdev_set_parent_bus(DEVICE(obj), sysbus_get_default());
>>>> -    xive = SPAPR_XIVE(obj);
>>>> +    xive = SPAPR_XIVE_BASE(obj);
>>>>  
>>>>      /* Enable the CPU IPIs */
>>>>      for (i = 0; i < nr_servers; ++i) {
>>>> @@ -311,7 +311,7 @@ static void spapr_irq_print_info_xive(sPAPRMachineState *spapr,
>>>>      CPU_FOREACH(cs) {
>>>>          PowerPCCPU *cpu = POWERPC_CPU(cs);
>>>>  
>>>> -        xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
>>>> +        xive_tctx_pic_print_info(XIVE_TCTX_BASE(cpu->intc), mon);
>>>>      }
>>>>  
>>>>      spapr_xive_pic_print_info(spapr->xive, mon);
>>>
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/36] spapr/xive: add migration support for KVM
  2018-11-29  3:43   ` David Gibson
@ 2018-11-29 16:19     ` Cédric Le Goater
  2018-11-30  1:24       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 16:19 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

David,

Could you tell what you think about the KVM interfaces for migration,
the ones capturing and restoring the states ? 

On 11/29/18 4:43 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:16AM +0100, Cédric Le Goater wrote:
>> This extends the KVM XIVE models to handle the state synchronization
>> with KVM, for the monitor usage and for the migration.
>>
>> The migration priority of the XIVE interrupt controller sPAPRXive is
>> raised for KVM. It operates first and orchestrates the capture
>> sequence of the states of all the XIVE models. The XIVE sources are
>> masked to quiesce the interrupt flow and a XIVE xync is performed to
>> stabilize the OS Event Queues. The state of the ENDs are then captured
>> by the XIVE interrupt controller model, sPAPRXive, and the state of
>> the thread contexts by the thread interrupt presenter model,
>> XiveTCTX. When done, a rollback is performed to restore the sources to
>> their initial state.
>>
>> The sPAPRXive 'post_load' method is called from the sPAPR machine,
>> after all XIVE device states have been transfered and loaded. First,
>> sPAPRXive restores the XIVE routing tables: ENDT and EAT. Next, are
>> restored the thread interrupt context registers and the source PQ
>> bits.
>>
>> The get/set operations rely on their KVM counterpart in the host
>> kernel which acts as a proxy for OPAL, the host firmware.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>
>>  WIP:
>>  
>>     If migration occurs when a VCPU is 'ceded', some the OS event
>>     notification queues are mapped to the ZERO_PAGE on the receiving
>>     side. As if the HW had triggered a page fault before the dirty
>>     page was transferred from the source or as if we were not using
>>     the correct page table.


v6 adds a VM change state handler to make XIVE reach a quiescent state. 
The sequence is a little more sophisticated and an extra KVM call 
marks the EQ page dirty.

>>
>>  include/hw/ppc/spapr_xive.h     |   5 +
>>  include/hw/ppc/xive.h           |   3 +
>>  include/migration/vmstate.h     |   1 +
>>  linux-headers/asm-powerpc/kvm.h |  33 +++
>>  hw/intc/spapr_xive.c            |  32 +++
>>  hw/intc/spapr_xive_kvm.c        | 494 ++++++++++++++++++++++++++++++++
>>  hw/intc/xive.c                  |  46 +++
>>  hw/ppc/spapr_irq.c              |   2 +-
>>  8 files changed, 615 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 9c817bb7ae74..d2517c040958 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -55,12 +55,17 @@ typedef struct sPAPRXiveClass {
>>      XiveRouterClass parent_class;
>>  
>>      DeviceRealize   parent_realize;
>> +
>> +    void (*synchronize_state)(sPAPRXive *xive);
>> +    int  (*pre_save)(sPAPRXive *xsrc);
>> +    int  (*post_load)(sPAPRXive *xsrc, int version_id);
> 
> This should go away if the KVM and non-KVM versions are in the same
> object.

yes.

>>  } sPAPRXiveClass;
>>  
>>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
>>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
>>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
>>  qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
>> +int spapr_xive_post_load(sPAPRXive *xive, int version_id);
>>  
>>  /*
>>   * sPAPR NVT and END indexing helpers
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index 7aaf5a182cb3..c8201462d698 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -309,6 +309,9 @@ typedef struct XiveTCTXClass {
>>      DeviceClass       parent_class;
>>  
>>      DeviceRealize     parent_realize;
>> +
>> +    void (*synchronize_state)(XiveTCTX *tctx);
>> +    int  (*post_load)(XiveTCTX *tctx, int version_id);
> 
> .. and this too.
> 
>>  } XiveTCTXClass;
>>  
>>  /*
>> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
>> index 2b501d04669a..ee2e836cc1c1 100644
>> --- a/include/migration/vmstate.h
>> +++ b/include/migration/vmstate.h
>> @@ -154,6 +154,7 @@ typedef enum {
>>      MIG_PRI_PCI_BUS,            /* Must happen before IOMMU */
>>      MIG_PRI_GICV3_ITS,          /* Must happen before PCI devices */
>>      MIG_PRI_GICV3,              /* Must happen before the ITS */
>> +    MIG_PRI_XIVE_IC,            /* Must happen before all XIVE models */
> 
> Ugh.. explicit priority / order levels are a pretty bad code smell.
> Usually migration ordering can be handled by getting the object
> heirarchy right.  What exactly is the problem you're addessing with
> this?

I wanted sPAPRXive to capture the state on behalf of all XIVE models. 
But with the addition of the VMState change handler I think I can 
remove this priority. I will check. 

> 
>>      MIG_PRI_MAX,
>>  } MigrationPriority;
>>  
>> diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
>> index f34c971491dd..9d55ade23634 100644
>> --- a/linux-headers/asm-powerpc/kvm.h
>> +++ b/linux-headers/asm-powerpc/kvm.h
> 
> Again, linux-headers need to be split out.
> 
>> @@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
>>  #define  KVM_REG_PPC_ICP_PPRI_SHIFT	16	/* pending irq priority */
>>  #define  KVM_REG_PPC_ICP_PPRI_MASK	0xff
>>  
>> +#define KVM_REG_PPC_NVT_STATE	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d)
>> +
>>  /* Device control API: PPC-specific devices */
>>  #define KVM_DEV_MPIC_GRP_MISC		1
>>  #define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
>> @@ -681,10 +683,41 @@ struct kvm_ppc_cpu_char {
>>  #define   KVM_DEV_XIVE_GET_TIMA_FD	2
>>  #define   KVM_DEV_XIVE_VC_BASE		3
>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>> +#define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
>> +#define KVM_DEV_XIVE_GRP_EQ		5	/* 64-bit eq attributes */
>>  
>>  /* Layout of 64-bit XIVE source attribute values */
>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>>  
>> +/* Layout of 64-bit eas attribute values */
>> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
>> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
>> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
>> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
>> +#define KVM_XIVE_EAS_MASK_SHIFT		32
>> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
>> +#define KVM_XIVE_EAS_EISN_SHIFT		33
>> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
>> +
>> +/* Layout of 64-bit eq attribute */
>> +#define KVM_XIVE_EQ_PRIORITY_SHIFT	0
>> +#define KVM_XIVE_EQ_PRIORITY_MASK	0x7
>> +#define KVM_XIVE_EQ_SERVER_SHIFT	3
>> +#define KVM_XIVE_EQ_SERVER_MASK		0xfffffff8ULL
>> +
>> +/* Layout of 64-bit eq attribute values */
>> +struct kvm_ppc_xive_eq {
>> +	__u32 flags;
>> +	__u32 qsize;
>> +	__u64 qpage;
>> +	__u32 qtoggle;
>> +	__u32 qindex;
>> +};
>> +
>> +#define KVM_XIVE_EQ_FLAG_ENABLED	0x00000001
>> +#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY	0x00000002
>> +#define KVM_XIVE_EQ_FLAG_ESCALATE	0x00000004
>>  
>>  #endif /* __LINUX_KVM_POWERPC_H */
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> index ec85f7e4f88d..c5c0e063dc33 100644
>> --- a/hw/intc/spapr_xive.c
>> +++ b/hw/intc/spapr_xive.c
>> @@ -27,9 +27,14 @@
>>  
>>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>>  {
>> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
>>      int i;
>>      uint32_t offset = 0;
>>  
>> +    if (sxc->synchronize_state) {
>> +        sxc->synchronize_state(xive);
>> +    }
>> +
>>      monitor_printf(mon, "XIVE Source %08x .. %08x\n", offset,
>>                     offset + xive->source.nr_irqs - 1);
>>      xive_source_pic_print_info(&xive->source, offset, mon);
>> @@ -354,10 +359,37 @@ static const VMStateDescription vmstate_spapr_xive_eas = {
>>      },
>>  };
>>  
>> +static int vmstate_spapr_xive_pre_save(void *opaque)
>> +{
>> +    sPAPRXive *xive = SPAPR_XIVE_BASE(opaque);
>> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
>> +
>> +    if (sxc->pre_save) {
>> +        return sxc->pre_save(xive);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +/* handled at the machine level */
>> +int spapr_xive_post_load(sPAPRXive *xive, int version_id)
>> +{
>> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
>> +
>> +    if (sxc->post_load) {
>> +        return sxc->post_load(xive, version_id);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>  static const VMStateDescription vmstate_spapr_xive_base = {
>>      .name = TYPE_SPAPR_XIVE,
>>      .version_id = 1,
>>      .minimum_version_id = 1,
>> +    .pre_save = vmstate_spapr_xive_pre_save,
>> +    .post_load = NULL, /* handled at the machine level */
>> +    .priority = MIG_PRI_XIVE_IC,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
>>          VMSTATE_STRUCT_VARRAY_POINTER_UINT32(eat, sPAPRXive, nr_irqs,
>> diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
>> index 767f90826e43..176083c37d61 100644
>> --- a/hw/intc/spapr_xive_kvm.c
>> +++ b/hw/intc/spapr_xive_kvm.c
>> @@ -58,6 +58,58 @@ static void kvm_cpu_enable(CPUState *cs)
>>  /*
>>   * XIVE Thread Interrupt Management context (KVM)
>>   */
>> +static void xive_tctx_kvm_set_state(XiveTCTX *tctx, Error **errp)
>> +{
>> +    uint64_t state[4];
>> +    int ret;
>> +
>> +    /* word0 and word1 of the OS ring. */
>> +    state[0] = *((uint64_t *) &tctx->regs[TM_QW1_OS]);
>> +
>> +    /* VP identifier. Only for KVM pr_debug() */
>> +    state[1] = *((uint64_t *) &tctx->regs[TM_QW1_OS + TM_WORD2]);
>> +
>> +    ret = kvm_set_one_reg(tctx->cs, KVM_REG_PPC_NVT_STATE, state);
>> +    if (ret != 0) {
>> +        error_setg_errno(errp, errno, "Could restore KVM XIVE CPU %ld state",
>> +                         kvm_arch_vcpu_id(tctx->cs));
>> +    }
>> +}
>> +
>> +static void xive_tctx_kvm_get_state(XiveTCTX *tctx, Error **errp)
>> +{
>> +    uint64_t state[4] = { 0 };
>> +    int ret;
>> +
>> +    ret = kvm_get_one_reg(tctx->cs, KVM_REG_PPC_NVT_STATE, state);
>> +    if (ret != 0) {
>> +        error_setg_errno(errp, errno, "Could capture KVM XIVE CPU %ld state",
>> +                         kvm_arch_vcpu_id(tctx->cs));
>> +        return;
>> +    }
>> +
>> +    /* word0 and word1 of the OS ring. */
>> +    *((uint64_t *) &tctx->regs[TM_QW1_OS]) = state[0];
>> +
>> +    /*
>> +     * KVM also returns word2 containing the VP CAM line value which
>> +     * is interesting to print out the VP identifier in the QEMU
>> +     * monitor. No need to restore it.
>> +     */
>> +    *((uint64_t *) &tctx->regs[TM_QW1_OS + TM_WORD2]) = state[1];
>> +}
>> +
>> +static void xive_tctx_kvm_do_synchronize_state(CPUState *cpu,
>> +                                              run_on_cpu_data arg)
>> +{
>> +    xive_tctx_kvm_get_state(arg.host_ptr, &error_fatal);
>> +}
>> +
>> +static void xive_tctx_kvm_synchronize_state(XiveTCTX *tctx)
>> +{
>> +    run_on_cpu(tctx->cs, xive_tctx_kvm_do_synchronize_state,
>> +               RUN_ON_CPU_HOST_PTR(tctx));
>> +}
>>  
>>  static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
>>  {
>> @@ -112,6 +164,8 @@ static void xive_tctx_kvm_class_init(ObjectClass *klass, void *data)
>>  
>>      device_class_set_parent_realize(dc, xive_tctx_kvm_realize,
>>                                      &xtc->parent_realize);
>> +
>> +    xtc->synchronize_state = xive_tctx_kvm_synchronize_state;
>>  }
>>  
>>  static const TypeInfo xive_tctx_kvm_info = {
>> @@ -166,6 +220,34 @@ static void xive_source_kvm_reset(DeviceState *dev)
>>      xive_source_kvm_init(xsrc, &error_fatal);
>>  }
>>  
>> +/*
>> + * This is used to perform the magic loads on the ESB pages, described
>> + * in xive.h.
>> + */
>> +static uint8_t xive_esb_read(XiveSource *xsrc, int srcno, uint32_t offset)
>> +{
>> +    unsigned long addr = (unsigned long) xsrc->esb_mmap +
>> +        xive_source_esb_mgmt(xsrc, srcno) + offset;
>> +
>> +    /* Prevent the compiler from optimizing away the load */
>> +    volatile uint64_t value = *((uint64_t *) addr);
>> +
>> +    return be64_to_cpu(value) & 0x3;
>> +}
>> +
>> +static void xive_source_kvm_get_state(XiveSource *xsrc)
>> +{
>> +    int i;
>> +
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        /* Perform a load without side effect to retrieve the PQ bits */
>> +        uint8_t pq = xive_esb_read(xsrc, i, XIVE_ESB_GET);
>> +
>> +        /* and save PQ locally */
>> +        xive_source_esb_set(xsrc, i, pq);
>> +    }
>> +}
>> +
>>  static void xive_source_kvm_set_irq(void *opaque, int srcno, int val)
>>  {
>>      XiveSource *xsrc = opaque;
>> @@ -295,6 +377,414 @@ static const TypeInfo xive_source_kvm_info = {
>>  /*
>>   * sPAPR XIVE Router (KVM)
>>   */
>> +static int spapr_xive_kvm_set_eq_state(sPAPRXive *xive, CPUState *cs,
>> +                                       Error **errp)
>> +{
>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>> +    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
>> +    int ret;
>> +    int i;
>> +
>> +    for (i = 0; i < XIVE_PRIORITY_MAX + 1; i++) {
>> +        Error *local_err = NULL;
>> +        XiveEND end;
>> +        uint8_t end_blk;
>> +        uint32_t end_idx;
>> +        struct kvm_ppc_xive_eq kvm_eq = { 0 };
>> +        uint64_t kvm_eq_idx;
>> +
>> +        if (!spapr_xive_priority_is_valid(i)) {
>> +            continue;
>> +        }
>> +
>> +        spapr_xive_cpu_to_end(xive, POWERPC_CPU(cs), i, &end_blk, &end_idx);
>> +
>> +        ret = xive_router_get_end(xrtr, end_blk, end_idx, &end);
>> +        if (ret) {
>> +            error_setg(errp, "XIVE: No END for CPU %ld priority %d",
>> +                       vcpu_id, i);
>> +            return ret;
>> +        }
>> +
>> +        if (!(end.w0 & END_W0_VALID)) {
>> +            continue;
>> +        }
>> +
>> +        /* Build the KVM state from the local END structure */
>> +        kvm_eq.flags   = KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY;
>> +        kvm_eq.qsize   = GETFIELD(END_W0_QSIZE, end.w0) + 12;
>> +        kvm_eq.qpage   = (((uint64_t)(end.w2 & 0x0fffffff)) << 32) | end.w3;
>> +        kvm_eq.qtoggle = GETFIELD(END_W1_GENERATION, end.w1);
>> +        kvm_eq.qindex  = GETFIELD(END_W1_PAGE_OFF, end.w1);
>> +
>> +        /* Encode the tuple (server, prio) as a KVM EQ index */
>> +        kvm_eq_idx = i << KVM_XIVE_EQ_PRIORITY_SHIFT &
>> +            KVM_XIVE_EQ_PRIORITY_MASK;
>> +        kvm_eq_idx |= vcpu_id << KVM_XIVE_EQ_SERVER_SHIFT &
>> +            KVM_XIVE_EQ_SERVER_MASK;
>> +
>> +        ret = kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EQ, kvm_eq_idx,
>> +                                &kvm_eq, true, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static int spapr_xive_kvm_get_eq_state(sPAPRXive *xive, CPUState *cs,
>> +                                       Error **errp)
>> +{
>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>> +    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
>> +    int ret;
>> +    int i;
>> +
>> +    for (i = 0; i < XIVE_PRIORITY_MAX + 1; i++) {
>> +        Error *local_err = NULL;
>> +        struct kvm_ppc_xive_eq kvm_eq = { 0 };
>> +        uint64_t kvm_eq_idx;
>> +        XiveEND end = { 0 };
>> +        uint8_t end_blk, nvt_blk;
>> +        uint32_t end_idx, nvt_idx;
>> +
>> +        /* Skip priorities reserved for the hypervisor */
>> +        if (!spapr_xive_priority_is_valid(i)) {
>> +            continue;
>> +        }
>> +
>> +        /* Encode the tuple (server, prio) as a KVM EQ index */
>> +        kvm_eq_idx = i << KVM_XIVE_EQ_PRIORITY_SHIFT &
>> +            KVM_XIVE_EQ_PRIORITY_MASK;
>> +        kvm_eq_idx |= vcpu_id << KVM_XIVE_EQ_SERVER_SHIFT &
>> +            KVM_XIVE_EQ_SERVER_MASK;
>> +
>> +        ret = kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EQ, kvm_eq_idx,
>> +                                &kvm_eq, false, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return ret;
>> +        }
>> +
>> +        if (!(kvm_eq.flags & KVM_XIVE_EQ_FLAG_ENABLED)) {
>> +            continue;
>> +        }
>> +
>> +        /* Update the local END structure with the KVM input */
>> +        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ENABLED) {
>> +                end.w0 |= END_W0_VALID | END_W0_ENQUEUE;
>> +        }
>> +        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY) {
>> +                end.w0 |= END_W0_UCOND_NOTIFY;
>> +        }
>> +        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ESCALATE) {
>> +                end.w0 |= END_W0_ESCALATE_CTL;
>> +        }
>> +        end.w0 |= SETFIELD(END_W0_QSIZE, 0ul, kvm_eq.qsize - 12);
>> +
>> +        end.w1 = SETFIELD(END_W1_GENERATION, 0ul, kvm_eq.qtoggle) |
>> +            SETFIELD(END_W1_PAGE_OFF, 0ul, kvm_eq.qindex);
>> +        end.w2 = (kvm_eq.qpage >> 32) & 0x0fffffff;
>> +        end.w3 = kvm_eq.qpage & 0xffffffff;
>> +        end.w4 = 0;
>> +        end.w5 = 0;
>> +
>> +        ret = spapr_xive_cpu_to_nvt(xive, POWERPC_CPU(cs), &nvt_blk, &nvt_idx);
>> +        if (ret) {
>> +            error_setg(errp, "XIVE: No NVT for CPU %ld", vcpu_id);
>> +            return ret;
>> +        }
>> +
>> +        end.w6 = SETFIELD(END_W6_NVT_BLOCK, 0ul, nvt_blk) |
>> +            SETFIELD(END_W6_NVT_INDEX, 0ul, nvt_idx);
>> +        end.w7 = SETFIELD(END_W7_F0_PRIORITY, 0ul, i);
>> +
>> +        spapr_xive_cpu_to_end(xive, POWERPC_CPU(cs), i, &end_blk, &end_idx);
>> +
>> +        ret = xive_router_set_end(xrtr, end_blk, end_idx, &end);
>> +        if (ret) {
>> +            error_setg(errp, "XIVE: No END for CPU %ld priority %d",
>> +                       vcpu_id, i);
>> +            return ret;
>> +        }
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>> +static void spapr_xive_kvm_set_eas_state(sPAPRXive *xive, Error **errp)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +    int i;
>> +
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        XiveEAS *eas = &xive->eat[i];
>> +        uint32_t end_idx;
>> +        uint32_t end_blk;
>> +        uint32_t eisn;
>> +        uint8_t priority;
>> +        uint32_t server;
>> +        uint64_t kvm_eas;
>> +        Error *local_err = NULL;
>> +
>> +        /* No need to set MASKED EAS, this is the default state after reset */
>> +        if (!(eas->w & EAS_VALID) || eas->w & EAS_MASKED) {
>> +            continue;
>> +        }
>> +
>> +        end_idx = GETFIELD(EAS_END_INDEX, eas->w);
>> +        end_blk = GETFIELD(EAS_END_BLOCK, eas->w);
>> +        eisn = GETFIELD(EAS_END_DATA, eas->w);
>> +
>> +        spapr_xive_end_to_target(xive, end_blk, end_idx, &server, &priority);
>> +
>> +        kvm_eas = priority << KVM_XIVE_EAS_PRIORITY_SHIFT &
>> +            KVM_XIVE_EAS_PRIORITY_MASK;
>> +        kvm_eas |= server << KVM_XIVE_EAS_SERVER_SHIFT &
>> +            KVM_XIVE_EAS_SERVER_MASK;
>> +        kvm_eas |= ((uint64_t)eisn << KVM_XIVE_EAS_EISN_SHIFT) &
>> +            KVM_XIVE_EAS_EISN_MASK;
>> +
>> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EAS, i, &kvm_eas, true,
>> +                          &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return;
>> +        }
>> +    }
>> +}
>> +
>> +static void spapr_xive_kvm_get_eas_state(sPAPRXive *xive, Error **errp)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +    int i;
>> +
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        XiveEAS *eas = &xive->eat[i];
>> +        XiveEAS new_eas;
>> +        uint64_t kvm_eas;
>> +        uint8_t priority;
>> +        uint32_t server;
>> +        uint32_t end_idx;
>> +        uint8_t end_blk;
>> +        uint32_t eisn;
>> +        Error *local_err = NULL;
>> +
>> +        if (!(eas->w & EAS_VALID)) {
>> +            continue;
>> +        }
>> +
>> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EAS, i, &kvm_eas, false,
>> +                          &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return;
>> +        }
>> +
>> +        priority = (kvm_eas & KVM_XIVE_EAS_PRIORITY_MASK) >>
>> +            KVM_XIVE_EAS_PRIORITY_SHIFT;
>> +        server = (kvm_eas & KVM_XIVE_EAS_SERVER_MASK) >>
>> +            KVM_XIVE_EAS_SERVER_SHIFT;
>> +        eisn = (kvm_eas & KVM_XIVE_EAS_EISN_MASK) >> KVM_XIVE_EAS_EISN_SHIFT;
>> +
>> +        if (spapr_xive_target_to_end(xive, server, priority, &end_blk,
>> +                                     &end_idx)) {
>> +            error_setg(errp, "XIVE: invalid tuple CPU %d priority %d", server,
>> +                       priority);
>> +            return;
>> +        }
>> +
>> +        new_eas.w = EAS_VALID;
>> +        if (kvm_eas & KVM_XIVE_EAS_MASK_MASK) {
>> +            new_eas.w |= EAS_MASKED;
>> +        }
>> +
>> +        new_eas.w = SETFIELD(EAS_END_INDEX, new_eas.w, end_idx);
>> +        new_eas.w = SETFIELD(EAS_END_BLOCK, new_eas.w, end_blk);
>> +        new_eas.w = SETFIELD(EAS_END_DATA, new_eas.w, eisn);
>> +
>> +        *eas = new_eas;
>> +    }
>> +}
>> +
>> +static void spapr_xive_kvm_sync_all(sPAPRXive *xive, Error **errp)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +    Error *local_err = NULL;
>> +    int i;
>> +
>> +    /* Sync the KVM source. This reaches the XIVE HW through OPAL */
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        XiveEAS *eas = &xive->eat[i];
>> +
>> +        if (!(eas->w & EAS_VALID)) {
>> +            continue;
>> +        }
>> +
>> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_SYNC, i, NULL, true,
>> +                          &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return;
>> +        }
>> +    }
>> +}
>> +
>> +/*
>> + * The sPAPRXive KVM model migration priority is higher to make sure
> 
> Higher than what?

Than the XiveTCTX and XiveSource models.

>> + * its 'pre_save' method runs before all the other XIVE models. It
> 
> If the other XIVE components are children of sPAPRXive (which I think
> they are or could be), then I believe the parent object's pre_save
> will automatically be called first.

ok. XiveTCTX are not children of sPAPRXive but that might not be 
a problem anymore with the VMState change handler.

Thanks

C.

>> + * orchestrates the capture sequence of the XIVE states in the
>> + * following order:
>> + *
>> + *   1. mask all the sources by setting PQ=01, which returns the
>> + *      previous value and save it.
>> + *   2. sync the sources in KVM to stabilize all the queues
>> + *      sync the ENDs to make sure END -> VP is fully completed
>> + *   3. dump the EAS table
>> + *   4. dump the END table
>> + *   5. dump the thread context (IPB)
>> + *
>> + *  Rollback to restore the current configuration of the sources
> 
> 
> 
>> + */
>> +static int spapr_xive_kvm_pre_save(sPAPRXive *xive)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +    Error *local_err = NULL;
>> +    CPUState *cs;
>> +    int i;
>> +    int ret = 0;
>> +
>> +    /* Quiesce the sources, to stop the flow of event notifications */
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        /*
>> +         * Mask and save the ESB PQs locally in the XiveSource object.
>> +         */
>> +        uint8_t pq = xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_01);
>> +        xive_source_esb_set(xsrc, i, pq);
>> +    }
>> +
>> +    /* Sync the sources in KVM */
>> +    spapr_xive_kvm_sync_all(xive, &local_err);
>> +    if (local_err) {
>> +        error_report_err(local_err);
>> +        goto out;
>> +    }
>> +
>> +    /* Grab the EAT (could be done earlier ?) */
>> +    spapr_xive_kvm_get_eas_state(xive, &local_err);
>> +    if (local_err) {
>> +        error_report_err(local_err);
>> +        goto out;
>> +    }
>> +
>> +    /*
>> +     * Grab the ENDs. The EQ index and the toggle bit are what we want
>> +     * to capture
>> +     */
>> +    CPU_FOREACH(cs) {
>> +        spapr_xive_kvm_get_eq_state(xive, cs, &local_err);
>> +        if (local_err) {
>> +            error_report_err(local_err);
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* Capture the thread interrupt contexts */
>> +    CPU_FOREACH(cs) {
>> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
>> +
>> +        /* TODO: Check if we need to use under run_on_cpu() ? */
>> +        xive_tctx_kvm_get_state(XIVE_TCTX_KVM(cpu->intc), &local_err);
>> +        if (local_err) {
>> +            error_report_err(local_err);
>> +            goto out;
>> +        }
>> +    }
>> +
>> +    /* All done. */
>> +
>> +out:
>> +    /* Restore the sources to their initial state */
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        uint8_t pq = xive_source_esb_get(xsrc, i);
>> +        if (xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_00 + (pq << 8)) != 0x1) {
>> +            error_report("XIVE: IRQ %d has an invalid state", i);
>> +        }
>> +    }
>> +
>> +    /*
>> +     * The XiveSource and the XiveTCTX states will be collected by
>> +     * their respective vmstate handlers afterwards.
>> +     */
>> +    return ret;
>> +}
>> +
>> +/*
>> + * The sPAPRXive 'post_load' method is called by the sPAPR machine,
>> + * after all XIVE device states have been transfered and loaded.
>> + *
>> + * All should be in place when the VCPUs resume execution.
>> + */
>> +static int spapr_xive_kvm_post_load(sPAPRXive *xive, int version_id)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +    Error *local_err = NULL;
>> +    CPUState *cs;
>> +    int i;
>> +
>> +    /* Set the ENDs first. The targetting depends on it. */
>> +    CPU_FOREACH(cs) {
>> +        spapr_xive_kvm_set_eq_state(xive, cs, &local_err);
>> +        if (local_err) {
>> +            error_report_err(local_err);
>> +            return -1;
>> +        }
>> +    }
>> +
>> +    /* Restore the targetting, if any */
>> +    spapr_xive_kvm_set_eas_state(xive, &local_err);
>> +    if (local_err) {
>> +        error_report_err(local_err);
>> +        return -1;
>> +    }
>> +
>> +    /* Restore the thread interrupt contexts */
>> +    CPU_FOREACH(cs) {
>> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
>> +
>> +        xive_tctx_kvm_set_state(XIVE_TCTX_KVM(cpu->intc), &local_err);
>> +        if (local_err) {
>> +            error_report_err(local_err);
>> +            return -1;
>> +        }
>> +    }
>> +
>> +    /*
>> +     * Get the saved state from the XiveSource model and restore the
>> +     * PQ bits
>> +     */
>> +    for (i = 0; i < xsrc->nr_irqs; i++) {
>> +        uint8_t pq = xive_source_esb_get(xsrc, i);
>> +        xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_00 + (pq << 8));
>> +    }
>> +    return 0;
>> +}
>> +
>> +static void spapr_xive_kvm_synchronize_state(sPAPRXive *xive)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +    CPUState *cs;
>> +
>> +    xive_source_kvm_get_state(xsrc);
>> +
>> +    spapr_xive_kvm_get_eas_state(xive, &error_fatal);
>> +
>> +    CPU_FOREACH(cs) {
>> +        spapr_xive_kvm_get_eq_state(xive, cs, &error_fatal);
>> +    }
>> +}
>>  
>>  static void spapr_xive_kvm_instance_init(Object *obj)
>>  {
>> @@ -409,6 +899,10 @@ static void spapr_xive_kvm_class_init(ObjectClass *klass, void *data)
>>  
>>      dc->desc = "sPAPR XIVE KVM Interrupt Controller";
>>      dc->unrealize = spapr_xive_kvm_unrealize;
>> +
>> +    sxc->synchronize_state = spapr_xive_kvm_synchronize_state;
>> +    sxc->pre_save = spapr_xive_kvm_pre_save;
>> +    sxc->post_load = spapr_xive_kvm_post_load;
>>  }
>>  
>>  static const TypeInfo spapr_xive_kvm_info = {
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 9bb37553c9ec..c9aedecc8216 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -438,9 +438,14 @@ static const struct {
>>  
>>  void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
>>  {
>> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(tctx);
>>      int cpu_index = tctx->cs ? tctx->cs->cpu_index : -1;
>>      int i;
>>  
>> +    if (xtc->synchronize_state) {
>> +        xtc->synchronize_state(tctx);
>> +    }
>> +
>>      monitor_printf(mon, "CPU[%04x]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
>>                     "  W2\n", cpu_index);
>>  
>> @@ -552,10 +557,23 @@ static void xive_tctx_base_unrealize(DeviceState *dev, Error **errp)
>>      qemu_unregister_reset(xive_tctx_base_reset, dev);
>>  }
>>  
>> +static int vmstate_xive_tctx_post_load(void *opaque, int version_id)
>> +{
>> +    XiveTCTX *tctx = XIVE_TCTX_BASE(opaque);
>> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(tctx);
>> +
>> +    if (xtc->post_load) {
>> +        return xtc->post_load(tctx, version_id);
>> +    }
>> +
>> +    return 0;
>> +}
>> +
>>  static const VMStateDescription vmstate_xive_tctx_base = {
>>      .name = TYPE_XIVE_TCTX,
>>      .version_id = 1,
>>      .minimum_version_id = 1,
>> +    .post_load = vmstate_xive_tctx_post_load,
>>      .fields = (VMStateField[]) {
>>          VMSTATE_BUFFER(regs, XiveTCTX),
>>          VMSTATE_END_OF_LIST()
>> @@ -581,9 +599,37 @@ static const TypeInfo xive_tctx_base_info = {
>>      .class_size    = sizeof(XiveTCTXClass),
>>  };
>>  
>> +static int xive_tctx_post_load(XiveTCTX *tctx, int version_id)
>> +{
>> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
>> +
>> +    /*
>> +     * When we collect the states from KVM XIVE irqchip, we set word2
>> +     * of the thread context to print out the OS CAM line under the
>> +     * QEMU monitor.
>> +     *
>> +     * This breaks migration on a guest using TCG or not using a KVM
>> +     * irqchip. Fix with an extra reset of the thread contexts.
>> +     */
>> +    if (xrc->reset_tctx) {
>> +        xrc->reset_tctx(tctx->xrtr, tctx);
>> +    }
>> +    return 0;
>> +}
>> +
>> +static void xive_tctx_class_init(ObjectClass *klass, void *data)
>> +{
>> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_CLASS(klass);
>> +
>> +    xtc->post_load = xive_tctx_post_load;
>> +}
>> +
>>  static const TypeInfo xive_tctx_info = {
>>      .name          = TYPE_XIVE_TCTX,
>>      .parent        = TYPE_XIVE_TCTX_BASE,
>> +    .instance_size = sizeof(XiveTCTX),
>> +    .class_init    = xive_tctx_class_init,
>> +    .class_size    = sizeof(XiveTCTXClass),
>>  };
>>  
>>  Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>> index 92ef53743b64..6fac6ca70595 100644
>> --- a/hw/ppc/spapr_irq.c
>> +++ b/hw/ppc/spapr_irq.c
>> @@ -359,7 +359,7 @@ static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
>>  
>>  static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
>>  {
>> -    return 0;
>> +    return spapr_xive_post_load(spapr->xive, version_id);
>>  }
>>  
>>  /*
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 24/36] spapr: add a 'reset' method to the sPAPR IRQ backend
  2018-11-29  3:47   ` David Gibson
@ 2018-11-29 16:21     ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 16:21 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 4:47 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:17AM +0100, Cédric Le Goater wrote:
>> This method will become useful when the new machine supporting both
>> interrupt modes, XIVE and XICS, is introduced. In this machine, the
>> interrupt mode is chosen by the CAS negotiation process and activated
>> after a reset.
>>
>> For the time being, the only thing that can be done in the XIVE reset
>> handler is to map the pages for the TIMA and for the source ESBs.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/spapr_irq.h  |  2 ++
>>  include/hw/ppc/spapr_xive.h |  1 +
>>  hw/intc/spapr_xive.c        |  4 +---
>>  hw/ppc/spapr.c              |  2 ++
>>  hw/ppc/spapr_irq.c          | 21 +++++++++++++++++++++
>>  5 files changed, 27 insertions(+), 3 deletions(-)
>>
>> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
>> index 4e36c0984e1a..34128976e21c 100644
>> --- a/include/hw/ppc/spapr_irq.h
>> +++ b/include/hw/ppc/spapr_irq.h
>> @@ -46,6 +46,7 @@ typedef struct sPAPRIrq {
>>      Object *(*cpu_intc_create)(sPAPRMachineState *spapr, Object *cpu,
>>                                 Error **errp);
>>      int (*post_load)(sPAPRMachineState *spapr, int version_id);
>> +    void (*reset)(sPAPRMachineState *spapr, Error **errp);
>>  } sPAPRIrq;
>>  
>>  extern sPAPRIrq spapr_irq_xics;
>> @@ -57,6 +58,7 @@ int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
>>  void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
>>  qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
>>  int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id);
>> +void spapr_irq_reset(sPAPRMachineState *spapr, Error **errp);
>>  
>>  /*
>>   * XICS legacy routines
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index d2517c040958..fa7f3d7718da 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -91,6 +91,7 @@ typedef struct sPAPRMachineState sPAPRMachineState;
>>  void spapr_xive_hcall_init(sPAPRMachineState *spapr);
>>  void spapr_dt_xive(sPAPRXive *xive, int nr_servers, void *fdt,
>>                     uint32_t phandle);
>> +void spapr_xive_mmio_map(sPAPRXive *xive);
>>  
>>  /*
>>   * XIVE KVM models
>> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
>> index c5c0e063dc33..def43160e12a 100644
>> --- a/hw/intc/spapr_xive.c
>> +++ b/hw/intc/spapr_xive.c
>> @@ -51,7 +51,7 @@ void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>>  }
>>  
>>  /* Map the ESB pages and the TIMA pages */
>> -static void spapr_xive_mmio_map(sPAPRXive *xive)
>> +void spapr_xive_mmio_map(sPAPRXive *xive)
>>  {
>>      sysbus_mmio_map(SYS_BUS_DEVICE(&xive->source), 0, xive->vc_base);
>>      sysbus_mmio_map(SYS_BUS_DEVICE(&xive->end_source), 0, xive->end_base);
>> @@ -77,8 +77,6 @@ static void spapr_xive_base_reset(DeviceState *dev)
>>      for (i = 0; i < xive->nr_ends; i++) {
>>          xive_end_reset(&xive->endt[i]);
>>      }
>> -
>> -    spapr_xive_mmio_map(xive);
>>  }
>>  
>>  static void spapr_xive_base_instance_init(Object *obj)
>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>> index d1be2579cd9b..013e6ea8aa64 100644
>> --- a/hw/ppc/spapr.c
>> +++ b/hw/ppc/spapr.c
>> @@ -1628,6 +1628,8 @@ static void spapr_machine_reset(void)
>>          spapr_irq_msi_reset(spapr);
>>      }
>>  
>> +    spapr_irq_reset(spapr, &error_fatal);
>> +
>>      qemu_devices_reset();
>>  
>>      /* DRC reset may cause a device to be unplugged. This will cause troubles
>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>> index 6fac6ca70595..984c6d60cd9f 100644
>> --- a/hw/ppc/spapr_irq.c
>> +++ b/hw/ppc/spapr_irq.c
>> @@ -13,6 +13,7 @@
>>  #include "qapi/error.h"
>>  #include "hw/ppc/spapr.h"
>>  #include "hw/ppc/spapr_xive.h"
>> +#include "hw/ppc/spapr_cpu_core.h"
>>  #include "hw/ppc/xics.h"
>>  #include "sysemu/kvm.h"
>>  
>> @@ -215,6 +216,10 @@ static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
>>      return 0;
>>  }
>>  
>> +static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
>> +{
>> +}
>> +
>>  #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
>>  #define SPAPR_IRQ_XICS_NR_MSIS     \
>>      (XICS_IRQ_BASE + SPAPR_IRQ_XICS_NR_IRQS - SPAPR_IRQ_MSI)
>> @@ -232,6 +237,7 @@ sPAPRIrq spapr_irq_xics = {
>>      .dt_populate = spapr_irq_dt_populate_xics,
>>      .cpu_intc_create = spapr_irq_cpu_intc_create_xics,
>>      .post_load   = spapr_irq_post_load_xics,
>> +    .reset       = spapr_irq_reset_xics,
>>  };
>>  
>>   /*
>> @@ -362,6 +368,11 @@ static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
>>      return spapr_xive_post_load(spapr->xive, version_id);
>>  }
>>  
>> +static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
>> +{
>> +    spapr_xive_mmio_map(spapr->xive);
> 
> It's usually not a good idea to actually construct different
> MemoryRegion's at run time.  Instead map them all in, but disable the
> ones you don't want (with memory_region_set_enabled()).

Yes. I realized that.

> I think your current version will also leave the TIMA etc. still
> mapped if you reboot from a XIVE guest to a XICS guest.

The sysbus mmios are cleared I think.

C.

>> +}
>> +
>>  /*
>>   * XIVE uses the full IRQ number space. Set it to 8K to be compatible
>>   * with XICS.
>> @@ -383,6 +394,7 @@ sPAPRIrq spapr_irq_xive = {
>>      .dt_populate = spapr_irq_dt_populate_xive,
>>      .cpu_intc_create = spapr_irq_cpu_intc_create_xive,
>>      .post_load   = spapr_irq_post_load_xive,
>> +    .reset       = spapr_irq_reset_xive,
>>  };
>>  
>>  /*
>> @@ -428,6 +440,15 @@ int spapr_irq_post_load(sPAPRMachineState *spapr, int version_id)
>>      return smc->irq->post_load(spapr, version_id);
>>  }
>>  
>> +void spapr_irq_reset(sPAPRMachineState *spapr, Error **errp)
>> +{
>> +    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>> +
>> +    if (smc->irq->reset) {
>> +        smc->irq->reset(spapr, errp);
>> +    }
>> +}
>> +
>>  /*
>>   * XICS legacy routines - to deprecate one day
>>   */
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 25/36] spapr: set the interrupt presenter at reset
  2018-11-29  4:03   ` David Gibson
@ 2018-11-29 16:28     ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 16:28 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 5:03 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:18AM +0100, Cédric Le Goater wrote:
>> Currently, the interrupt presenter of the VPCU is set at realize
>> time. Setting it at reset will become useful when the new machine
>> supporting both interrupt modes is introduced. In this machine, the
>> interrupt mode is chosen at CAS time and activated after a reset.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/spapr_cpu_core.h |  2 ++
>>  hw/ppc/spapr_cpu_core.c         | 26 ++++++++++++++++++++++++++
>>  hw/ppc/spapr_irq.c              | 11 +++++++++++
>>  3 files changed, 39 insertions(+)
>>
>> diff --git a/include/hw/ppc/spapr_cpu_core.h b/include/hw/ppc/spapr_cpu_core.h
>> index 9e2821e4b31f..fc8ea9021656 100644
>> --- a/include/hw/ppc/spapr_cpu_core.h
>> +++ b/include/hw/ppc/spapr_cpu_core.h
>> @@ -53,4 +53,6 @@ static inline sPAPRCPUState *spapr_cpu_state(PowerPCCPU *cpu)
>>      return (sPAPRCPUState *)cpu->machine_data;
>>  }
>>  
>> +void spapr_cpu_core_set_intc(PowerPCCPU *cpu, const char *intc_type);
>> +
>>  #endif
>> diff --git a/hw/ppc/spapr_cpu_core.c b/hw/ppc/spapr_cpu_core.c
>> index 1811cd48db90..529de0b6b9c8 100644
>> --- a/hw/ppc/spapr_cpu_core.c
>> +++ b/hw/ppc/spapr_cpu_core.c
>> @@ -398,3 +398,29 @@ static const TypeInfo spapr_cpu_core_type_infos[] = {
>>  };
>>  
>>  DEFINE_TYPES(spapr_cpu_core_type_infos)
>> +
>> +typedef struct ForeachFindIntCArgs {
>> +    const char *intc_type;
>> +    Object *intc;
>> +} ForeachFindIntCArgs;
>> +
>> +static int spapr_cpu_core_find_intc(Object *child, void *opaque)
>> +{
>> +    ForeachFindIntCArgs *args = opaque;
>> +
>> +    if (object_dynamic_cast(child, args->intc_type)) {
>> +        args->intc = child;
>> +    }
>> +
>> +    return args->intc != NULL;
>> +}
>> +
>> +void spapr_cpu_core_set_intc(PowerPCCPU *cpu, const char *intc_type)
>> +{
>> +    ForeachFindIntCArgs args = { intc_type, NULL };
>> +
>> +    object_child_foreach(OBJECT(cpu), spapr_cpu_core_find_intc, &args);
>> +    g_assert(args.intc);
> 
> We could create some extra links on the cpu to avoid scanning all the
> children, but I guess that's a refinement.

yes. Like an extra ->intc for xive. but as we can have only one interrupt
controller active, having only one presenter per CPU seems reasonable. 

> Then again.. what do we actually use the cpu->intc pointer for in XIVE
> context?  

yes for the TIMA MMIOs and also when running the matching algo in the XIVE 
presenter.

C.

> I had a feeling because of the different way notifications
> are handled we might not ever need to go from a cpu handle to the
> associated TCTX.
>
> 
>> +    cpu->intc = args.intc;
>> +}
>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>> index 984c6d60cd9f..969efad7e6e9 100644
>> --- a/hw/ppc/spapr_irq.c
>> +++ b/hw/ppc/spapr_irq.c
>> @@ -218,6 +218,11 @@ static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
>>  
>>  static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
>>  {
>> +    CPUState *cs;
>> +
>> +    CPU_FOREACH(cs) {
>> +        spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->icp_type);
>> +    }
>>  }
>>  
>>  #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
>> @@ -370,6 +375,12 @@ static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
>>  
>>  static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
>>  {
>> +    CPUState *cs;
>> +
>> +    CPU_FOREACH(cs) {
>> +        spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->xive_tctx_type);
>> +    }
>> +
>>      spapr_xive_mmio_map(spapr->xive);
>>  }
>>  
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 28/36] ppc/xics: introduce a icp_kvm_init() routine
  2018-11-29  4:08   ` David Gibson
@ 2018-11-29 16:36     ` Cédric Le Goater
  2018-11-29 22:43       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 16:36 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 5:08 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:21AM +0100, Cédric Le Goater wrote:
>> This routine gathers all the KVM initialization of the XICS KVM
>> presenter. It will be useful when the initialization of the KVM XICS
>> device is moved to a global routine.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> 
> I dislike calling things *_init() because it's not clear which of
> qemu's many "init" hooks it belongs with.

we could use 'icp_kvm_connect' instead. Which was QEMU is doing, 
connecting a QEMU model to a KVM one.

C.

 
>> ---
>>  hw/intc/xics_kvm.c | 29 +++++++++++++++++++----------
>>  1 file changed, 19 insertions(+), 10 deletions(-)
>>
>> diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
>> index e8fa9a53aeba..efad1b19d821 100644
>> --- a/hw/intc/xics_kvm.c
>> +++ b/hw/intc/xics_kvm.c
>> @@ -123,11 +123,8 @@ static void icp_kvm_reset(DeviceState *dev)
>>      icp_set_kvm_state(ICP(dev), 1);
>>  }
>>  
>> -static void icp_kvm_realize(DeviceState *dev, Error **errp)
>> +static void icp_kvm_init(ICPState *icp, Error **errp)
>>  {
>> -    ICPState *icp = ICP(dev);
>> -    ICPStateClass *icpc = ICP_GET_CLASS(icp);
>> -    Error *local_err = NULL;
>>      CPUState *cs;
>>      KVMEnabledICP *enabled_icp;
>>      unsigned long vcpu_id;
>> @@ -137,12 +134,6 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
>>          abort();
>>      }
>>  
>> -    icpc->parent_realize(dev, &local_err);
>> -    if (local_err) {
>> -        error_propagate(errp, local_err);
>> -        return;
>> -    }
>> -
>>      cs = icp->cs;
>>      vcpu_id = kvm_arch_vcpu_id(cs);
>>  
>> @@ -168,6 +159,24 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
>>      QLIST_INSERT_HEAD(&kvm_enabled_icps, enabled_icp, node);
>>  }
>>  
>> +static void icp_kvm_realize(DeviceState *dev, Error **errp)
>> +{
>> +    ICPStateClass *icpc = ICP_GET_CLASS(dev);
>> +    Error *local_err = NULL;
>> +
>> +    icpc->parent_realize(dev, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    icp_kvm_init(ICP(dev), &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +}
>> +
>>  static void icp_kvm_class_init(ObjectClass *klass, void *data)
>>  {
>>      DeviceClass *dc = DEVICE_CLASS(klass);
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper
  2018-11-29  4:09   ` David Gibson
@ 2018-11-29 16:36     ` Cédric Le Goater
  2018-12-03 15:52       ` Cédric Le Goater
  2018-12-03 17:48     ` Peter Maydell
  1 sibling, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 16:36 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 5:09 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:20AM +0100, Cédric Le Goater wrote:
>> This will be used to remove the MMIO regions of the POWER9 XIVE
>> interrupt controller when the sPAPR machine is reseted.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> 
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> 
> Since the code looks sane.
> 
> Hoever, I think using memory_region_set_enabled() would be a better
> idea for our purposes than actually adding/deleting the subregion.

Yes and we might not need this one anymore. 

Thanks,

C.

>> ---
>>  include/hw/sysbus.h |  1 +
>>  hw/core/sysbus.c    | 10 ++++++++++
>>  2 files changed, 11 insertions(+)
>>
>> diff --git a/include/hw/sysbus.h b/include/hw/sysbus.h
>> index 0b59a3b8d605..bc641984b5da 100644
>> --- a/include/hw/sysbus.h
>> +++ b/include/hw/sysbus.h
>> @@ -92,6 +92,7 @@ qemu_irq sysbus_get_connected_irq(SysBusDevice *dev, int n);
>>  void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr);
>>  void sysbus_mmio_map_overlap(SysBusDevice *dev, int n, hwaddr addr,
>>                               int priority);
>> +void sysbus_mmio_unmap(SysBusDevice *dev, int n);
>>  void sysbus_add_io(SysBusDevice *dev, hwaddr addr,
>>                     MemoryRegion *mem);
>>  MemoryRegion *sysbus_address_space(SysBusDevice *dev);
>> diff --git a/hw/core/sysbus.c b/hw/core/sysbus.c
>> index 7ac36ad3e707..09f202167dcb 100644
>> --- a/hw/core/sysbus.c
>> +++ b/hw/core/sysbus.c
>> @@ -153,6 +153,16 @@ static void sysbus_mmio_map_common(SysBusDevice *dev, int n, hwaddr addr,
>>      }
>>  }
>>  
>> +void sysbus_mmio_unmap(SysBusDevice *dev, int n)
>> +{
>> +    assert(n >= 0 && n < dev->num_mmio);
>> +
>> +    if (dev->mmio[n].addr != (hwaddr)-1) {
>> +        memory_region_del_subregion(get_system_memory(), dev->mmio[n].memory);
>> +        dev->mmio[n].addr = (hwaddr)-1;
>> +    }
>> +}
>> +
>>  void sysbus_mmio_map(SysBusDevice *dev, int n, hwaddr addr)
>>  {
>>      sysbus_mmio_map_common(dev, n, addr, false, 0);
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 32/36] spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers
  2018-11-29  4:12   ` David Gibson
@ 2018-11-29 16:40     ` Cédric Le Goater
  2018-11-29 22:44       ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 16:40 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 5:12 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:25AM +0100, Cédric Le Goater wrote:
>> Removing RTAS handlers will become necessary when the new pseries
>> machine supporting multiple interrupt mode is introduced.
> 
> I'd prefer this to be done as a separate spapr_rtas_unregister()
> helper, just to improve greppability.

ok. I should propose an inline :

static inline void spapr_rtas_unregister(int token) 
{ 
	spapr_rtas_register(token, NULL, NULL); 
}
  
> 
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  hw/ppc/spapr_rtas.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
>> index d6a0952154ac..e005d5d08151 100644
>> --- a/hw/ppc/spapr_rtas.c
>> +++ b/hw/ppc/spapr_rtas.c
>> @@ -404,7 +404,7 @@ void spapr_rtas_register(int token, const char *name, spapr_rtas_fn fn)
>>  
>>      token -= RTAS_TOKEN_BASE;
>>  
>> -    assert(!rtas_table[token].name);
>> +    assert(!name || !rtas_table[token].name);
>>  
>>      rtas_table[token].name = name;
>>      rtas_table[token].fn = fn;
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 33/36] spapr: introduce routines to delete the KVM IRQ device
  2018-11-29  4:17   ` David Gibson
@ 2018-11-29 16:41     ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 16:41 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 5:17 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:26AM +0100, Cédric Le Goater wrote:
>> If a new interrupt mode is chosen by CAS, the machine generates a
>> reset to reconfigure. At this point, the connection with the previous
>> KVM device needs to be closed and a new connection needs to opened
>> with the KVM device operating the chosen interrupt mode.
>>
>> New routines are introduced to destroy the XICS and XIVE KVM
>> devices. They make use of a new KVM device ioctl which destroys the
>> device and also disconnects the IRQ presenters from the VCPUs.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/spapr_xive.h |  1 +
>>  include/hw/ppc/xics.h       |  1 +
>>  linux-headers/linux/kvm.h   |  2 ++
>>  hw/intc/spapr_xive_kvm.c    | 54 +++++++++++++++++++++++++++++++++++
>>  hw/intc/xics_kvm.c          | 57 +++++++++++++++++++++++++++++++++++++
>>  5 files changed, 115 insertions(+)
>>
>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>> index 1d134a681326..c913c0aed08a 100644
>> --- a/include/hw/ppc/spapr_xive.h
>> +++ b/include/hw/ppc/spapr_xive.h
>> @@ -108,5 +108,6 @@ void spapr_xive_mmio_map(sPAPRXive *xive);
>>  #define XIVE_TCTX_KVM(obj)   OBJECT_CHECK(XiveTCTX, (obj), TYPE_XIVE_TCTX_KVM)
>>  
>>  void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp);
>> +void spapr_xive_kvm_fini(sPAPRXive *xive, Error **errp);
>>  
>>  #endif /* PPC_SPAPR_XIVE_H */
>> diff --git a/include/hw/ppc/xics.h b/include/hw/ppc/xics.h
>> index 9958443d1984..a5468c6eb6e3 100644
>> --- a/include/hw/ppc/xics.h
>> +++ b/include/hw/ppc/xics.h
>> @@ -205,6 +205,7 @@ void icp_resend(ICPState *ss);
>>  typedef struct sPAPRMachineState sPAPRMachineState;
>>  
>>  int xics_kvm_init(sPAPRMachineState *spapr, Error **errp);
>> +int xics_kvm_fini(sPAPRMachineState *spapr, Error **errp);
>>  void xics_spapr_init(sPAPRMachineState *spapr);
>>  
>>  Object *icp_create(Object *cpu, const char *type, XICSFabric *xi,
>> diff --git a/linux-headers/linux/kvm.h b/linux-headers/linux/kvm.h
>> index 59fa8d8d7f39..b7a74c58d0db 100644
>> --- a/linux-headers/linux/kvm.h
>> +++ b/linux-headers/linux/kvm.h
> 
> linux-headers updates separate.

yes. 

> 
>> @@ -1309,6 +1309,8 @@ struct kvm_s390_ucas_mapping {
>>  #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
>>  #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
>>  
>> +#define KVM_DESTROY_DEVICE	  _IOWR(KVMIO,  0xf0, struct kvm_create_device)
>> +
>>  /*
>>   * ioctls for vcpu fds
>>   */
>> diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
>> index cb2aa6e81274..0672d8bcbc6b 100644
>> --- a/hw/intc/spapr_xive_kvm.c
>> +++ b/hw/intc/spapr_xive_kvm.c
>> @@ -55,6 +55,16 @@ static void kvm_cpu_enable(CPUState *cs)
>>      QLIST_INSERT_HEAD(&kvm_enabled_cpus, enabled_cpu, node);
>>  }
>>  
>> +static void kvm_cpu_disable_all(void)
>> +{
>> +    KVMEnabledCPU *enabled_cpu, *next;
>> +
>> +    QLIST_FOREACH_SAFE(enabled_cpu, &kvm_enabled_cpus, node, next) {
>> +        QLIST_REMOVE(enabled_cpu, node);
>> +        g_free(enabled_cpu);
>> +    }
>> +}
>> +
>>  /*
>>   * XIVE Thread Interrupt Management context (KVM)
>>   */
>> @@ -864,6 +874,50 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
>>      kvm_gsi_direct_mapping = true;
>>  }
>>  
>> +void spapr_xive_kvm_fini(sPAPRXive *xive, Error **errp)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +    struct kvm_create_device xive_destroy_device = {
>> +        .fd = xive->fd,
>> +        .type = KVM_DEV_TYPE_XIVE,
>> +        .flags = 0,
>> +    };
>> +    size_t esb_len = (1ull << xsrc->esb_shift) * xsrc->nr_irqs;
>> +    int rc;
>> +
>> +    /* The KVM XIVE device is not in use */
>> +    if (xive->fd == -1) {
>> +        return;
>> +    }
>> +
>> +    if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
>> +        error_setg(errp,
>> +                   "IRQ_XIVE capability must be present for KVM XIVE device");
>> +        return;
> 
> If we're here, xive->fd, checked above, definitely shouldn't have been
> valid, so you can just assert().

ok.

> 
>> +    }
>> +
>> +    /* Clear the KVM mapping */
>> +    sysbus_mmio_unmap(SYS_BUS_DEVICE(xsrc), 0);
>> +    munmap(xsrc->esb_mmap, esb_len);
>> +    sysbus_mmio_unmap(SYS_BUS_DEVICE(xive), 0);
>> +    munmap(xive->tm_mmap, 4ull << TM_SHIFT);
>> +
>> +    /* Destroy the KVM device. This also clears the VCPU presenters */
>> +    rc = kvm_vm_ioctl(kvm_state, KVM_DESTROY_DEVICE, &xive_destroy_device);
>> +    if (rc < 0) {
>> +        error_setg_errno(errp, -rc, "Error on KVM_DESTROY_DEVICE for XIVE");
>> +    }
>> +    close(xive->fd);
>> +    xive->fd = -1;
>> +
>> +    kvm_kernel_irqchip = false;
>> +    kvm_msi_via_irqfd_allowed = false;
>> +    kvm_gsi_direct_mapping = false;
>> +
>> +    /* Clear the local list of presenter (hotplug) */
>> +    kvm_cpu_disable_all();
>> +}
>> +
>>  static void spapr_xive_kvm_realize(DeviceState *dev, Error **errp)
>>  {
>>      sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
>> diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
>> index eabc901a4556..a7e3ec32a761 100644
>> --- a/hw/intc/xics_kvm.c
>> +++ b/hw/intc/xics_kvm.c
>> @@ -50,6 +50,16 @@ typedef struct KVMEnabledICP {
>>  static QLIST_HEAD(, KVMEnabledICP)
>>      kvm_enabled_icps = QLIST_HEAD_INITIALIZER(&kvm_enabled_icps);
>>  
>> +static void kvm_disable_icps(void)
>> +{
>> +    KVMEnabledICP *enabled_icp, *next;
>> +
>> +    QLIST_FOREACH_SAFE(enabled_icp, &kvm_enabled_icps, node, next) {
>> +        QLIST_REMOVE(enabled_icp, node);
>> +        g_free(enabled_icp);
>> +    }
>> +}
>> +
>>  /*
>>   * ICP-KVM
>>   */
>> @@ -475,6 +485,53 @@ fail:
>>      return -1;
>>  }
>>  
>> +int xics_kvm_fini(sPAPRMachineState *spapr, Error **errp)
>> +{
>> +    int rc;
>> +    struct kvm_create_device xics_create_device = {
>> +        .fd = kernel_xics_fd,
>> +        .type = KVM_DEV_TYPE_XICS,
>> +        .flags = 0,
>> +    };
>> +
>> +    /* The KVM XICS device is not in use */
>> +    if (kernel_xics_fd == -1) {
>> +        return 0;
>> +    }
>> +
>> +    if (!kvm_enabled() || !kvm_check_extension(kvm_state, KVM_CAP_IRQ_XICS)) {
>> +        error_setg(errp,
>> +                   "KVM and IRQ_XICS capability must be present for KVM XICS device");
>> +        return -1;
> 
> Same comment as above.
> 
>> +    }
>> +
>> +    rc = kvm_vm_ioctl(kvm_state, KVM_DESTROY_DEVICE, &xics_create_device);
>> +    if (rc < 0) {
>> +        error_setg_errno(errp, -rc, "Error on KVM_DESTROY_DEVICE for XICS");
>> +    }
>> +    close(kernel_xics_fd);
>> +    kernel_xics_fd = -1;
>> +
>> +    spapr_rtas_register(RTAS_IBM_SET_XIVE, NULL, 0);
>> +    spapr_rtas_register(RTAS_IBM_GET_XIVE, NULL, 0);
>> +    spapr_rtas_register(RTAS_IBM_INT_OFF, NULL, 0);
>> +    spapr_rtas_register(RTAS_IBM_INT_ON, NULL, 0);
>> +
>> +    kvmppc_define_rtas_kernel_token(0, "ibm,set-xive");
>> +    kvmppc_define_rtas_kernel_token(0, "ibm,get-xive");
>> +    kvmppc_define_rtas_kernel_token(0, "ibm,int-on");
>> +    kvmppc_define_rtas_kernel_token(0, "ibm,int-off");
>> +
>> +    kvm_kernel_irqchip = false;
>> +    kvm_msi_via_irqfd_allowed = false;
>> +    kvm_gsi_direct_mapping = false;
>> +
>> +    /* Clear the presenter from the VCPUs */
>> +    kvm_disable_icps();
>> +
>> +    return rc;
>> +}
>> +
>>  static void xics_kvm_register_types(void)
>>  {
>>      type_register_static(&ics_kvm_info);
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 34/36] spapr: add KVM support to the 'dual' machine
  2018-11-29  4:22   ` David Gibson
@ 2018-11-29 17:07     ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 17:07 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 5:22 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:27AM +0100, Cédric Le Goater wrote:
>> The interrupt mode is chosen by the CAS negotiation process and
>> activated after a reset to take into account the required changes in
>> the machine. This brings new constraints on how the associated KVM IRQ
>> device is initialized.
>>
>> Currently, each model takes care of the initialization of the KVM
>> device in their realize method but this is not possible anymore as the
>> initialization needs to done globaly when the interrupt mode is known,
>> i.e. when machine is reseted. It also means that we need a way to
>> delete a KVM device when another mode is chosen.
>>
>> Also, to support migration, the QEMU objects holding the state to
>> transfer should always be available but not necessarily activated.
>>
>> The overall approach of this proposal is to initialize both interrupt
>> mode at the QEMU level and keep the IRQ number space in sync to allow
>> switching from one mode to another. For the KVM side of things, the
>> whole initialization of the KVM device, sources and presenters, is
>> grouped in a single routine. The XICS and XIVE sPAPR IRQ reset
>> handlers are modified accordingly to handle the init and delete
>> sequences of the KVM device. The post_load handlers also are, to take
>> into account a possible change of interrupt mode after transfer.
>>
>> As KVM is now initialized at reset, we loose the possiblity to
>> fallback to the QEMU emulated mode in case of failure and failures
>> become fatal to the machine.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  hw/intc/spapr_xive_kvm.c | 48 +++++++++++-----------
>>  hw/intc/xics_kvm.c       | 18 ++++++---
>>  hw/ppc/spapr_irq.c       | 86 +++++++++++++++++++++++++++++-----------
>>  3 files changed, 98 insertions(+), 54 deletions(-)
>>
>> diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
>> index 0672d8bcbc6b..9c7d36f51e3d 100644
>> --- a/hw/intc/spapr_xive_kvm.c
>> +++ b/hw/intc/spapr_xive_kvm.c
>> @@ -148,7 +148,6 @@ static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
>>  
>>  static void xive_tctx_kvm_realize(DeviceState *dev, Error **errp)
>>  {
>> -    XiveTCTX *tctx = XIVE_TCTX_KVM(dev);
>>      XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(dev);
>>      Error *local_err = NULL;
>>  
>> @@ -157,12 +156,6 @@ static void xive_tctx_kvm_realize(DeviceState *dev, Error **errp)
>>          error_propagate(errp, local_err);
>>          return;
>>      }
>> -
>> -    xive_tctx_kvm_init(tctx, &local_err);
>> -    if (local_err) {
>> -        error_propagate(errp, local_err);
>> -        return;
>> -    }
>>  }
>>  
>>  static void xive_tctx_kvm_class_init(ObjectClass *klass, void *data)
>> @@ -222,12 +215,9 @@ static void xive_source_kvm_init(XiveSource *xsrc, Error **errp)
>>  
>>  static void xive_source_kvm_reset(DeviceState *dev)
>>  {
>> -    XiveSource *xsrc = XIVE_SOURCE_KVM(dev);
>>      XiveSourceClass *xsc = XIVE_SOURCE_BASE_GET_CLASS(dev);
>>  
>>      xsc->parent_reset(dev);
>> -
>> -    xive_source_kvm_init(xsrc, &error_fatal);
>>  }
>>  
>>  /*
>> @@ -346,12 +336,6 @@ static void xive_source_kvm_realize(DeviceState *dev, Error **errp)
>>  
>>      xsrc->qirqs = qemu_allocate_irqs(xive_source_kvm_set_irq, xsrc,
>>                                       xsrc->nr_irqs);
>> -
>> -    xive_source_kvm_mmap(xsrc, &local_err);
>> -    if (local_err) {
>> -        error_propagate(errp, local_err);
>> -        return;
>> -    }
>>  }
>>  
>>  static void xive_source_kvm_unrealize(DeviceState *dev, Error **errp)
>> @@ -823,6 +807,7 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
>>  {
>>      Error *local_err = NULL;
>>      size_t tima_len;
>> +    CPUState *cs;
>>  
>>      if (!kvm_enabled() || !kvmppc_has_cap_xive()) {
>>          error_setg(errp,
>> @@ -850,7 +835,18 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
>>          return;
>>      }
>>  
>> -    /* Let the XiveSource KVM model handle the mapping for the moment */
>> +    xive_source_kvm_mmap(&xive->source, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +
>> +    /* Create the KVM interrupt sources */
>> +    xive_source_kvm_init(&xive->source, &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>>  
>>      /* TIMA KVM mapping
>>       *
>> @@ -869,6 +865,17 @@ void spapr_xive_kvm_init(sPAPRXive *xive, Error **errp)
>>                                        "xive.tima", tima_len, xive->tm_mmap);
>>      sysbus_init_mmio(SYS_BUS_DEVICE(xive), &xive->tm_mmio);
>>  
>> +    /* Connect the presenters to the VCPU */
>> +    CPU_FOREACH(cs) {
>> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
>> +
>> +        xive_tctx_kvm_init(XIVE_TCTX_BASE(cpu->intc), &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return;
>> +        }
>> +    }
>> +
>>      kvm_kernel_irqchip = true;
>>      kvm_msi_via_irqfd_allowed = true;
>>      kvm_gsi_direct_mapping = true;
>> @@ -920,16 +927,9 @@ void spapr_xive_kvm_fini(sPAPRXive *xive, Error **errp)
>>  
>>  static void spapr_xive_kvm_realize(DeviceState *dev, Error **errp)
>>  {
>> -    sPAPRXive *xive = SPAPR_XIVE_KVM(dev);
>>      sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(dev);
>>      Error *local_err = NULL;
>>  
>> -    spapr_xive_kvm_init(xive, &local_err);
>> -    if (local_err) {
>> -        error_propagate(errp, local_err);
>> -        return;
>> -    }
>> -
>>      /* Initialize the source and the local routing tables */
>>      sxc->parent_realize(dev, &local_err);
>>      if (local_err) {
>> diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
>> index a7e3ec32a761..c89fa943847c 100644
>> --- a/hw/intc/xics_kvm.c
>> +++ b/hw/intc/xics_kvm.c
>> @@ -190,12 +190,6 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
>>          error_propagate(errp, local_err);
>>          return;
>>      }
>> -
>> -    icp_kvm_init(ICP(dev), &local_err);
>> -    if (local_err) {
>> -        error_propagate(errp, local_err);
>> -        return;
>> -    }
>>  }
>>  
>>  static void icp_kvm_class_init(ObjectClass *klass, void *data)
>> @@ -427,6 +421,8 @@ static void rtas_dummy(PowerPCCPU *cpu, sPAPRMachineState *spapr,
>>  int xics_kvm_init(sPAPRMachineState *spapr, Error **errp)
>>  {
>>      int rc;
>> +    CPUState *cs;
>> +    Error *local_err = NULL;
>>  
>>      if (!kvm_enabled() || !kvm_check_extension(kvm_state, KVM_CAP_IRQ_XICS)) {
>>          error_setg(errp,
>> @@ -475,6 +471,16 @@ int xics_kvm_init(sPAPRMachineState *spapr, Error **errp)
>>      kvm_msi_via_irqfd_allowed = true;
>>      kvm_gsi_direct_mapping = true;
>>  
>> +    /* Connect the presenters to the VCPU */
>> +    CPU_FOREACH(cs) {
>> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
>> +
>> +        icp_kvm_init(ICP(cpu->intc), &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            goto fail;
>> +        }
>> +    }
>>      return 0;
>>  
>>  fail:
>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>> index 79ead51c630d..f1720a8dda33 100644
>> --- a/hw/ppc/spapr_irq.c
>> +++ b/hw/ppc/spapr_irq.c
>> @@ -98,20 +98,14 @@ static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
>>      MachineState *machine = MACHINE(spapr);
>>      Error *local_err = NULL;
>>  
>> -    if (kvm_enabled()) {
>> -        if (machine_kernel_irqchip_allowed(machine) &&
>> -            !xics_kvm_init(spapr, &local_err)) {
>> -            spapr->icp_type = TYPE_KVM_ICP;
>> -            spapr->ics = spapr_ics_create(spapr, TYPE_ICS_KVM, nr_irqs,
>> -                                          &local_err);
>> -        }
>> -        if (machine_kernel_irqchip_required(machine) && !spapr->ics) {
>> -            error_prepend(&local_err,
>> -                          "kernel_irqchip requested but unavailable: ");
>> -            goto error;
>> +    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
>> +        spapr->icp_type = TYPE_KVM_ICP;
>> +        spapr->ics = spapr_ics_create(spapr, TYPE_ICS_KVM, nr_irqs,
>> +                                      &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return;
>>          }
>> -        error_free(local_err);
>> -        local_err = NULL;
>>      }
>>  
>>      if (!spapr->ics) {
>> @@ -119,10 +113,11 @@ static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
>>          spapr->icp_type = TYPE_ICP;
>>          spapr->ics = spapr_ics_create(spapr, TYPE_ICS_SIMPLE, nr_irqs,
>>                                        &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            return;
>> +        }
>>      }
>> -
>> -error:
>> -    error_propagate(errp, local_err);
>>  }
>>  
>>  #define ICS_IRQ_FREE(ics, srcno)   \
>> @@ -218,11 +213,28 @@ static int spapr_irq_post_load_xics(sPAPRMachineState *spapr, int version_id)
>>  
>>  static void spapr_irq_reset_xics(sPAPRMachineState *spapr, Error **errp)
>>  {
>> +    MachineState *machine = MACHINE(spapr);
>>      CPUState *cs;
>> +    Error *local_err = NULL;
>>  
>>      CPU_FOREACH(cs) {
>>          spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->icp_type);
>>      }
>> +
>> +    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
> 
> Aren't both devices '_fini'-ed by the machine level reset handler,

This is it. May be you mean that we should destroy all KVM devices
at the machine level reset before calling the sPAPR IRQ reset 
handler, which would only do the KVM init ? 

I agree there is a bit too much of these _fini calls.


> why does it need a _fini here as well as an init?

Each single interrupt mode machine (xics, xive) starts from a clean 
KVM state by destroying the KVM device and recreating it at reset.

> 
>> +        xics_kvm_fini(spapr, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            error_prepend(errp, "KVM XICS fini failed: ");
>> +            return;
>> +        }
>> +        xics_kvm_init(spapr, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            error_prepend(errp, "KVM XICS init failed: ");
>> +            return;
>> +        }
>> +    }
>>  }
>>  
>>  #define SPAPR_IRQ_XICS_NR_IRQS     0x1000
>> @@ -288,10 +300,8 @@ static void spapr_irq_init_xive(sPAPRMachineState *spapr, int nr_irqs,
>>          spapr->xive_tctx_type = TYPE_XIVE_TCTX_KVM;
>>          spapr->xive = spapr_xive_create(spapr, TYPE_SPAPR_XIVE_KVM, nr_irqs,
>>                                          nr_servers, &local_err);
>> -
>> -        if (local_err && machine_kernel_irqchip_required(machine)) {
>> +        if (local_err) {
>>              error_propagate(errp, local_err);
>> -            error_prepend(errp, "kernel_irqchip requested but init failed : ");
>>              return;
>>          }
>>  
>> @@ -375,12 +385,29 @@ static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
>>  
>>  static void spapr_irq_reset_xive(sPAPRMachineState *spapr, Error **errp)
>>  {
>> +    MachineState *machine = MACHINE(spapr);
>>      CPUState *cs;
>> +    Error *local_err = NULL;
>>  
>>      CPU_FOREACH(cs) {
>>          spapr_cpu_core_set_intc(POWERPC_CPU(cs), spapr->xive_tctx_type);
>>      }
>>  
>> +    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
>> +        spapr_xive_kvm_fini(spapr->xive, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            error_prepend(errp, "KVM XIVE fini failed: ");
>> +            return;
>> +        }
>> +        spapr_xive_kvm_init(spapr->xive, &local_err);
>> +        if (local_err) {
>> +            error_propagate(errp, local_err);
>> +            error_prepend(errp, "KVM XIVE init failed: ");
>> +            return;
>> +        }
>> +    }
>> +
>>      spapr_xive_mmio_map(spapr->xive);
>>  }
>>  
>> @@ -432,11 +459,6 @@ static void spapr_irq_init_dual(sPAPRMachineState *spapr, int nr_irqs,
>>  {
>>      Error *local_err = NULL;
>>  
>> -    if (kvm_enabled()) {
>> -        error_setg(errp, "No KVM support for the 'dual' machine");
>> -        return;
>> -    }
>> -
>>      spapr_irq_xics.init(spapr, spapr_irq_xics.nr_irqs, nr_servers, &local_err);
>>      if (local_err) {
>>          error_propagate(errp, local_err);
>> @@ -510,10 +532,15 @@ static Object *spapr_irq_cpu_intc_create_dual(sPAPRMachineState *spapr,
>>  
>>  static int spapr_irq_post_load_dual(sPAPRMachineState *spapr, int version_id)
>>  {
>> +    MachineState *machine = MACHINE(spapr);
>> +
>>      /*
>>       * Force a reset of the XIVE backend after migration.
>>       */
>>      if (spapr_ovec_test(spapr->ov5_cas, OV5_XIVE_EXPLOIT)) {
>> +        if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
>> +            xics_kvm_fini(spapr, &error_fatal);
>> +        }
>>          spapr_irq_xive.reset(spapr, &error_fatal);
>>      }
>>  
>> @@ -522,6 +549,17 @@ static int spapr_irq_post_load_dual(sPAPRMachineState *spapr, int version_id)
>>  
>>  static void spapr_irq_reset_dual(sPAPRMachineState *spapr, Error **errp)
>>  {
>> +    MachineState *machine = MACHINE(spapr);
>> +
>> +    /*
>> +     * Destroy all the KVM IRQ devices. This also clears the VCPU
>> +     * presenters
>> +     */
>> +    if (kvm_enabled() && machine_kernel_irqchip_allowed(machine)) {
>> +        xics_kvm_fini(spapr, &error_fatal);
>> +        spapr_xive_kvm_fini(spapr->xive, &error_fatal);
>> +    }
>> +
>>      /*
>>       * Only XICS is reseted at startup as it is the default interrupt
>>       * mode.
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-11-29  3:39         ` Benjamin Herrenschmidt
@ 2018-11-29 17:51           ` Cédric Le Goater
  2018-11-30  1:09             ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 17:51 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson; +Cc: qemu-ppc, qemu-devel

On 11/29/18 4:39 AM, Benjamin Herrenschmidt wrote:
> On Thu, 2018-11-29 at 11:47 +1100, David Gibson wrote:
>>
>> 1) read/write accessors which take a word number

ok for single word updates of the structures.

>> 2) A "get" accessor which copies the whole structure, 

ok

>> but "write"
>> accessor which takes a word number.  The asymmetry is a bit ugly, but
>> it's the non-atomic writeback of the whole structure which I'm most
>> uncomfortable with.

And, how would you make the update of the whole structure in RAM look 
"atomic" under QEMU ? 

> It shouldn't be a big deal though, there are HW facilities to access
> the structures "atomically" anyway, due to the caching done by the
> XIVE.

Are you suggesting that the PowerNV model should update the VPC, EQC, 
IVC in the VST accessors before updating the VSTs in RAM ?
>> 3) A map/unmap interface which gives you / releases a pointer to the
>> "live" structure.  For powernv that would become
>> address_space_map()/unmap().  

yes.

>> For PAPR it would just be reutn pointer  / no-op.

ok

I think I will introduce these handlers progressively in the patchset.

Thanks,

C. 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-11-22  5:13   ` David Gibson
  2018-11-22 21:58     ` Cédric Le Goater
@ 2018-11-29 22:06     ` Cédric Le Goater
  2018-11-30  1:04       ` David Gibson
  1 sibling, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-29 22:06 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/22/18 6:13 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:56:59AM +0100, Cédric Le Goater wrote:
>> The Event Notification Descriptor also contains two Event State
>> Buffers providing further coalescing of interrupts, one for the
>> notification event (ESn) and one for the escalation events (ESe). A
>> MMIO page is assigned for each to control the EOI through loads
>> only. Stores are not allowed.
>>
>> The END ESBs are modeled through an object resembling the 'XiveSource'
>> It is stateless as the END state bits are backed into the XiveEND
>> structure under the XiveRouter and the MMIO accesses follow the same
>> rules as for the standard source ESBs.
>>
>> END ESBs are not supported by the Linux drivers neither on OPAL nor on
>> sPAPR. Nevetherless, it provides a mean to study the question in the
>> future and validates a bit more the XIVE model.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  include/hw/ppc/xive.h |  20 ++++++
>>  hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
>>  2 files changed, 178 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index ce62aaf28343..24301bf2076d 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>                          XiveEND *end);
>>  
>> +/*
>> + * XIVE END ESBs
>> + */
>> +
>> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
>> +#define XIVE_END_SOURCE(obj) \
>> +    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
> 
> Is there a particular reason to make this a full QOM object, rather
> than just embedding it in the XiveRouter?

Coming back on this question because removing the chip_id from the
router is a problem for the END triggering. At least with the current
design. See below for the comment.

>> +typedef struct XiveENDSource {
>> +    SysBusDevice parent;
>> +
>> +    uint32_t        nr_ends;
>> +
>> +    /* ESB memory region */
>> +    uint32_t        esb_shift;
>> +    MemoryRegion    esb_mmio;
>> +
>> +    XiveRouter      *xrtr;
>> +} XiveENDSource;
>> +
>>  /*
>>   * For legacy compatibility, the exceptions define up to 256 different
>>   * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index 9cb001e7b540..5a8882d47a98 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -622,8 +622,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>>       * even futher coalescing in the Router
>>       */
>>      if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
>> -        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
>> -        return;
>> +        uint8_t pq = GETFIELD(END_W1_ESn, end.w1);
>> +        bool notify = xive_esb_trigger(&pq);
>> +
>> +        if (pq != GETFIELD(END_W1_ESn, end.w1)) {
>> +            end.w1 = SETFIELD(END_W1_ESn, end.w1, pq);
>> +            xive_router_set_end(xrtr, end_blk, end_idx, &end);
>> +        }
>> +
>> +        /* ESn[Q]=1 : end of notification */
>> +        if (!notify) {
>> +            return;
>> +        }
>>      }
>>  
>>      /*
>> @@ -706,6 +716,151 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
>>                     (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
>>  }
>>  
>> +/*
>> + * END ESB MMIO loads
>> + */
>> +static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
>> +    XiveRouter *xrtr = xsrc->xrtr;
>> +    uint32_t offset = addr & 0xFFF;
>> +    uint8_t end_blk;
>> +    uint32_t end_idx;
>> +    XiveEND end;
>> +    uint32_t end_esmask;
>> +    uint8_t pq;
>> +    uint64_t ret = -1;
>> +
>> +    end_blk = xrtr->chip_id;
>> +    end_idx = addr >> (xsrc->esb_shift + 1);
>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {

The current END accessors require a block identifier, hence xrtr->chip_id, 
but in this case, we don't really need it because we are using the ENDT 
local to the router/chip. 

I don't know how to handle simply this case without keeping chip_id :/

C.

>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: No END %x/%x\n", end_blk,
>> +                      end_idx);
>> +        return -1;
>> +    }
>> +
>> +    if (!(end.w0 & END_W0_VALID)) {
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: END %x/%x is invalid\n",
>> +                      end_blk, end_idx);
>> +        return -1;
>> +    }
>> +
>> +    end_esmask = addr_is_even(addr, xsrc->esb_shift) ? END_W1_ESn : END_W1_ESe;
>> +    pq = GETFIELD(end_esmask, end.w1);
>> +
>> +    switch (offset) {
>> +    case XIVE_ESB_LOAD_EOI ... XIVE_ESB_LOAD_EOI + 0x7FF:
>> +        ret = xive_esb_eoi(&pq);
>> +
>> +        /* Forward the source event notification for routing ?? */
>> +        break;
>> +
>> +    case XIVE_ESB_GET ... XIVE_ESB_GET + 0x3FF:
>> +        ret = pq;
>> +        break;
>> +
>> +    case XIVE_ESB_SET_PQ_00 ... XIVE_ESB_SET_PQ_00 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_01 ... XIVE_ESB_SET_PQ_01 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_10 ... XIVE_ESB_SET_PQ_10 + 0x0FF:
>> +    case XIVE_ESB_SET_PQ_11 ... XIVE_ESB_SET_PQ_11 + 0x0FF:
>> +        ret = xive_esb_set(&pq, (offset >> 8) & 0x3);
>> +        break;
>> +    default:
>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid END ESB load addr %d\n",
>> +                      offset);
>> +        return -1;
>> +    }
>> +
>> +    if (pq != GETFIELD(end_esmask, end.w1)) {
>> +        end.w1 = SETFIELD(end_esmask, end.w1, pq);
>> +        xive_router_set_end(xrtr, end_blk, end_idx, &end);
>> +    }
> 
> We can probably share some more code with XiveSource here, but that's
> something that can be refined later.
> 
>> +
>> +    return ret;
>> +}
>> +
>> +/*
>> + * END ESB MMIO stores are invalid
>> + */
>> +static void xive_end_source_write(void *opaque, hwaddr addr,
>> +                                  uint64_t value, unsigned size)
>> +{
>> +    qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid ESB write addr 0x%"
>> +                  HWADDR_PRIx"\n", addr);
>> +}
>> +
>> +static const MemoryRegionOps xive_end_source_ops = {
>> +    .read = xive_end_source_read,
>> +    .write = xive_end_source_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +static void xive_end_source_realize(DeviceState *dev, Error **errp)
>> +{
>> +    XiveENDSource *xsrc = XIVE_END_SOURCE(dev);
>> +    Object *obj;
>> +    Error *local_err = NULL;
>> +
>> +    obj = object_property_get_link(OBJECT(dev), "xive", &local_err);
>> +    if (!obj) {
>> +        error_propagate(errp, local_err);
>> +        error_prepend(errp, "required link 'xive' not found: ");
>> +        return;
>> +    }
>> +
>> +    xsrc->xrtr = XIVE_ROUTER(obj);
>> +
>> +    if (!xsrc->nr_ends) {
>> +        error_setg(errp, "Number of interrupt needs to be greater than 0");
>> +        return;
>> +    }
>> +
>> +    if (xsrc->esb_shift != XIVE_ESB_4K &&
>> +        xsrc->esb_shift != XIVE_ESB_64K) {
>> +        error_setg(errp, "Invalid ESB shift setting");
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Each END is assigned an even/odd pair of MMIO pages, the even page
>> +     * manages the ESn field while the odd page manages the ESe field.
>> +     */
>> +    memory_region_init_io(&xsrc->esb_mmio, OBJECT(xsrc),
>> +                          &xive_end_source_ops, xsrc, "xive.end",
>> +                          (1ull << (xsrc->esb_shift + 1)) * xsrc->nr_ends);
>> +    sysbus_init_mmio(SYS_BUS_DEVICE(dev), &xsrc->esb_mmio);
>> +}
>> +
>> +static Property xive_end_source_properties[] = {
>> +    DEFINE_PROP_UINT32("nr-ends", XiveENDSource, nr_ends, 0),
>> +    DEFINE_PROP_UINT32("shift", XiveENDSource, esb_shift, XIVE_ESB_64K),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void xive_end_source_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +
>> +    dc->desc    = "XIVE END Source";
>> +    dc->props   = xive_end_source_properties;
>> +    dc->realize = xive_end_source_realize;
>> +}
>> +
>> +static const TypeInfo xive_end_source_info = {
>> +    .name          = TYPE_XIVE_END_SOURCE,
>> +    .parent        = TYPE_SYS_BUS_DEVICE,
>> +    .instance_size = sizeof(XiveENDSource),
>> +    .class_init    = xive_end_source_class_init,
>> +};
>> +
>>  /*
>>   * XIVE Fabric
>>   */
>> @@ -720,6 +875,7 @@ static void xive_register_types(void)
>>      type_register_static(&xive_source_info);
>>      type_register_static(&xive_fabric_info);
>>      type_register_static(&xive_router_info);
>> +    type_register_static(&xive_end_source_info);
>>  }
>>  
>>  type_init(xive_register_types)
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller
  2018-11-29 14:37         ` Cédric Le Goater
@ 2018-11-29 22:36           ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-29 22:36 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 2244 bytes --]

On Thu, Nov 29, 2018 at 03:37:12PM +0100, Cédric Le Goater wrote:
> [ ... ]
>  
> >>> With that approach it might make sense to embed it
> >>> here, rather than subclassing it 
> >>
> >> ah. why not indeed. I have to think about it. 
> >>
> >>> (the old composition vs. inheritance debate).
> >>
> >> he. but then the XiveRouter needs to become a QOM interface if we 
> >> want to be able to define XIVE table accessors for sPAPRXive. See
> >> the  spapr_xive_class_init() routine.
> > 
> > Erm.. I'm not really sure what you're getting at here.
> 
> if we compose a sPAPRXive object with a XiveSource object and a XiveRouter 
> object, how will the  XiveRouter object access the XIVE internal tables 
> which are in the sPAPRXive object ? 
> 
> Thinking of it, I am not sure a QOM interface would solve the problem now. 
> So we are stuck with inheritance.

Uh.. true.  There are ways aroud it, but it gets a bit complicated.

> [ ... ]
> 
> >>>> +qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn)
> >>>> +{
> >>>> +    XiveSource *xsrc = &xive->source;
> >>>> +
> >>>> +    if (lisn >= xive->nr_irqs) {
> >>>> +        return NULL;
> >>>> +    }
> >>>> +
> >>>> +    if (!(xive->eat[lisn].w & EAS_VALID)) {
> >>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: invalid LISN %x\n", lisn);
> >>>
> >>> I don't think this is a guest error - gettint the qirq by number
> >>> should generally be something qemu code does.
> >>
> >> Even if the IRQ was not defined by the machine ? The EAS_VALID bit is
> >> raised when the IRQ is enabled at the XIVE level, which means that the
> >> IRQ number has been claimed by some device of the machine. You cannot
> >> get a qirq by number for  some random IRQ number. Can you ?
> > 
> > Well, you shouldn't.  The point is that it is qemu code (specifically
> > the machine setup stuff) that will be calling this, and it shouldn't
> > be calling it with irq numbers that haven't been
> > enabled/claimed/whatever.
> 
> so it should be an assert ?

Yes.
-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 15/36] spapr: introdude a new machine IRQ backend for XIVE
  2018-11-29 15:34         ` Cédric Le Goater
@ 2018-11-29 22:39           ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-29 22:39 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 7163 bytes --]

On Thu, Nov 29, 2018 at 04:34:51PM +0100, Cédric Le Goater wrote:
> On 11/29/18 2:07 AM, David Gibson wrote:
> > On Wed, Nov 28, 2018 at 06:16:58PM +0100, Cédric Le Goater wrote:
> >> On 11/28/18 4:28 AM, David Gibson wrote:
> >>> On Fri, Nov 16, 2018 at 11:57:08AM +0100, Cédric Le Goater wrote:
> >>>> The XIVE IRQ backend uses the same layout as the new XICS backend but
> >>>> covers the full range of the IRQ number space. The IRQ numbers for the
> >>>> CPU IPIs are allocated at the bottom of this space, below 4K, to
> >>>> preserve compatibility with XICS which does not use that range.
> >>>>
> >>>> This should be enough given that the maximum number of CPUs is 1024
> >>>> for the sPAPR machine under QEMU. For the record, the biggest POWER8
> >>>> or POWER9 system has a maximum of 1536 HW threads (16 sockets, 192
> >>>> cores, SMT8).
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>  include/hw/ppc/spapr.h     |   2 +
> >>>>  include/hw/ppc/spapr_irq.h |   7 ++-
> >>>>  hw/ppc/spapr.c             |   2 +-
> >>>>  hw/ppc/spapr_irq.c         | 119 ++++++++++++++++++++++++++++++++++++-
> >>>>  4 files changed, 124 insertions(+), 6 deletions(-)
> >>>>
> >>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>> index 6279711fe8f7..1fbc2663e06c 100644
> >>>> --- a/include/hw/ppc/spapr.h
> >>>> +++ b/include/hw/ppc/spapr.h
> >>>> @@ -16,6 +16,7 @@ typedef struct sPAPREventLogEntry sPAPREventLogEntry;
> >>>>  typedef struct sPAPREventSource sPAPREventSource;
> >>>>  typedef struct sPAPRPendingHPT sPAPRPendingHPT;
> >>>>  typedef struct ICSState ICSState;
> >>>> +typedef struct sPAPRXive sPAPRXive;
> >>>>  
> >>>>  #define HPTE64_V_HPTE_DIRTY     0x0000000000000040ULL
> >>>>  #define SPAPR_ENTRY_POINT       0x100
> >>>> @@ -175,6 +176,7 @@ struct sPAPRMachineState {
> >>>>      const char *icp_type;
> >>>>      int32_t irq_map_nr;
> >>>>      unsigned long *irq_map;
> >>>> +    sPAPRXive  *xive;
> >>>>  
> >>>>      bool cmd_line_caps[SPAPR_CAP_NUM];
> >>>>      sPAPRCapabilities def, eff, mig;
> >>>> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> >>>> index 0e9229bf219e..c854ae527808 100644
> >>>> --- a/include/hw/ppc/spapr_irq.h
> >>>> +++ b/include/hw/ppc/spapr_irq.h
> >>>> @@ -13,6 +13,7 @@
> >>>>  /*
> >>>>   * IRQ range offsets per device type
> >>>>   */
> >>>> +#define SPAPR_IRQ_IPI        0x0
> >>>>  #define SPAPR_IRQ_EPOW       0x1000  /* XICS_IRQ_BASE offset */
> >>>>  #define SPAPR_IRQ_HOTPLUG    0x1001
> >>>>  #define SPAPR_IRQ_VIO        0x1100  /* 256 VIO devices */
> >>>> @@ -33,7 +34,8 @@ typedef struct sPAPRIrq {
> >>>>      uint32_t    nr_irqs;
> >>>>      uint32_t    nr_msis;
> >>>>  
> >>>> -    void (*init)(sPAPRMachineState *spapr, int nr_irqs, Error **errp);
> >>>> +    void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
> >>>> +                 Error **errp);
> >>>>      int (*claim)(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
> >>>>      void (*free)(sPAPRMachineState *spapr, int irq, int num);
> >>>>      qemu_irq (*qirq)(sPAPRMachineState *spapr, int irq);
> >>>> @@ -42,8 +44,9 @@ typedef struct sPAPRIrq {
> >>>>  
> >>>>  extern sPAPRIrq spapr_irq_xics;
> >>>>  extern sPAPRIrq spapr_irq_xics_legacy;
> >>>> +extern sPAPRIrq spapr_irq_xive;
> >>>>  
> >>>> -void spapr_irq_init(sPAPRMachineState *spapr, Error **errp);
> >>>> +void spapr_irq_init(sPAPRMachineState *spapr, int nr_servers, Error **errp);
> >>>
> >>> I don't see why nr_servers needs to become a parameter, since it can
> >>> be derived from spapr within this routine.
> >>
> >> ok. This is true. We can use directly xics_max_server_number(spapr).
> >>
> >>>>  int spapr_irq_claim(sPAPRMachineState *spapr, int irq, bool lsi, Error **errp);
> >>>>  void spapr_irq_free(sPAPRMachineState *spapr, int irq, int num);
> >>>>  qemu_irq spapr_qirq(sPAPRMachineState *spapr, int irq);
> >>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>>> index e470efe7993c..9f8c19e56e7a 100644
> >>>> --- a/hw/ppc/spapr.c
> >>>> +++ b/hw/ppc/spapr.c
> >>>> @@ -2594,7 +2594,7 @@ static void spapr_machine_init(MachineState *machine)
> >>>>      spapr_set_vsmt_mode(spapr, &error_fatal);
> >>>>  
> >>>>      /* Set up Interrupt Controller before we create the VCPUs */
> >>>> -    spapr_irq_init(spapr, &error_fatal);
> >>>> +    spapr_irq_init(spapr, xics_max_server_number(spapr), &error_fatal);
> >>>
> >>> We should rename xics_max_server_number() since it's no longer xics
> >>> specific.
> >>
> >> yes.
> >>
> >>>>      /* Set up containers for ibm,client-architecture-support negotiated options
> >>>>       */
> >>>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> >>>> index bac450ffff23..2569ae1bc7f8 100644
> >>>> --- a/hw/ppc/spapr_irq.c
> >>>> +++ b/hw/ppc/spapr_irq.c
> >>>> @@ -12,6 +12,7 @@
> >>>>  #include "qemu/error-report.h"
> >>>>  #include "qapi/error.h"
> >>>>  #include "hw/ppc/spapr.h"
> >>>> +#include "hw/ppc/spapr_xive.h"
> >>>>  #include "hw/ppc/xics.h"
> >>>>  #include "sysemu/kvm.h"
> >>>>  
> >>>> @@ -91,7 +92,7 @@ error:
> >>>>  }
> >>>>  
> >>>>  static void spapr_irq_init_xics(sPAPRMachineState *spapr, int nr_irqs,
> >>>> -                                Error **errp)
> >>>> +                                int nr_servers, Error **errp)
> >>>>  {
> >>>>      MachineState *machine = MACHINE(spapr);
> >>>>      Error *local_err = NULL;
> >>>> @@ -204,10 +205,122 @@ sPAPRIrq spapr_irq_xics = {
> >>>>      .print_info  = spapr_irq_print_info_xics,
> >>>>  };
> >>>>  
> >>>> + /*
> >>>> + * XIVE IRQ backend.
> >>>> + */
> >>>> +static sPAPRXive *spapr_xive_create(sPAPRMachineState *spapr,
> >>>> +                                    const char *type_xive, int nr_irqs,
> >>>> +                                    int nr_servers, Error **errp)
> >>>> +{
> >>>> +    sPAPRXive *xive;
> >>>> +    Error *local_err = NULL;
> >>>> +    Object *obj;
> >>>> +    uint32_t nr_ends = nr_servers << 3; /* 8 priority ENDs per CPU */
> >>>> +    int i;
> >>>> +
> >>>> +    obj = object_new(type_xive);
> >>>
> >>> What's the reason for making the type a parameter, rather than just
> >>> using the #define here.
> >>
> >> KVM.
> > 
> > Yeah, I realised that when I'd read a few patches further on.  As I
> > commented there, I don't think the separate KVM/TCG subclasses is
> > actually a good pattern to follow.
> 
> I will use the simple pattern in next spin: if (kvm) { } 

Great.

> We might want to do that for XICS also but it would break migratibility.  

Well, if that breaks migration, we already have a problem migrating
between KVM and non-KVM guests (or even KVM-with-irqchip and
KVM-without-irqchip guests).  I think we put the actual migratable
state in the base class to avoid that, but we should check.  If we
ever get time.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 28/36] ppc/xics: introduce a icp_kvm_init() routine
  2018-11-29 16:36     ` Cédric Le Goater
@ 2018-11-29 22:43       ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-29 22:43 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 2905 bytes --]

On Thu, Nov 29, 2018 at 05:36:19PM +0100, Cédric Le Goater wrote:
> On 11/29/18 5:08 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:21AM +0100, Cédric Le Goater wrote:
> >> This routine gathers all the KVM initialization of the XICS KVM
> >> presenter. It will be useful when the initialization of the KVM XICS
> >> device is moved to a global routine.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > 
> > I dislike calling things *_init() because it's not clear which of
> > qemu's many "init" hooks it belongs with.
> 
> we could use 'icp_kvm_connect' instead. Which was QEMU is doing, 
> connecting a QEMU model to a KVM one.

Works for me.

> 
> C.
> 
>  
> >> ---
> >>  hw/intc/xics_kvm.c | 29 +++++++++++++++++++----------
> >>  1 file changed, 19 insertions(+), 10 deletions(-)
> >>
> >> diff --git a/hw/intc/xics_kvm.c b/hw/intc/xics_kvm.c
> >> index e8fa9a53aeba..efad1b19d821 100644
> >> --- a/hw/intc/xics_kvm.c
> >> +++ b/hw/intc/xics_kvm.c
> >> @@ -123,11 +123,8 @@ static void icp_kvm_reset(DeviceState *dev)
> >>      icp_set_kvm_state(ICP(dev), 1);
> >>  }
> >>  
> >> -static void icp_kvm_realize(DeviceState *dev, Error **errp)
> >> +static void icp_kvm_init(ICPState *icp, Error **errp)
> >>  {
> >> -    ICPState *icp = ICP(dev);
> >> -    ICPStateClass *icpc = ICP_GET_CLASS(icp);
> >> -    Error *local_err = NULL;
> >>      CPUState *cs;
> >>      KVMEnabledICP *enabled_icp;
> >>      unsigned long vcpu_id;
> >> @@ -137,12 +134,6 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
> >>          abort();
> >>      }
> >>  
> >> -    icpc->parent_realize(dev, &local_err);
> >> -    if (local_err) {
> >> -        error_propagate(errp, local_err);
> >> -        return;
> >> -    }
> >> -
> >>      cs = icp->cs;
> >>      vcpu_id = kvm_arch_vcpu_id(cs);
> >>  
> >> @@ -168,6 +159,24 @@ static void icp_kvm_realize(DeviceState *dev, Error **errp)
> >>      QLIST_INSERT_HEAD(&kvm_enabled_icps, enabled_icp, node);
> >>  }
> >>  
> >> +static void icp_kvm_realize(DeviceState *dev, Error **errp)
> >> +{
> >> +    ICPStateClass *icpc = ICP_GET_CLASS(dev);
> >> +    Error *local_err = NULL;
> >> +
> >> +    icpc->parent_realize(dev, &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return;
> >> +    }
> >> +
> >> +    icp_kvm_init(ICP(dev), &local_err);
> >> +    if (local_err) {
> >> +        error_propagate(errp, local_err);
> >> +        return;
> >> +    }
> >> +}
> >> +
> >>  static void icp_kvm_class_init(ObjectClass *klass, void *data)
> >>  {
> >>      DeviceClass *dc = DEVICE_CLASS(klass);
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 32/36] spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers
  2018-11-29 16:40     ` Cédric Le Goater
@ 2018-11-29 22:44       ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-29 22:44 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 1498 bytes --]

On Thu, Nov 29, 2018 at 05:40:18PM +0100, Cédric Le Goater wrote:
> On 11/29/18 5:12 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:25AM +0100, Cédric Le Goater wrote:
> >> Removing RTAS handlers will become necessary when the new pseries
> >> machine supporting multiple interrupt mode is introduced.
> > 
> > I'd prefer this to be done as a separate spapr_rtas_unregister()
> > helper, just to improve greppability.
> 
> ok. I should propose an inline :
> 
> static inline void spapr_rtas_unregister(int token) 
> { 
> 	spapr_rtas_register(token, NULL, NULL); 
> }

Fair enough.

>   
> > 
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  hw/ppc/spapr_rtas.c | 2 +-
> >>  1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/hw/ppc/spapr_rtas.c b/hw/ppc/spapr_rtas.c
> >> index d6a0952154ac..e005d5d08151 100644
> >> --- a/hw/ppc/spapr_rtas.c
> >> +++ b/hw/ppc/spapr_rtas.c
> >> @@ -404,7 +404,7 @@ void spapr_rtas_register(int token, const char *name, spapr_rtas_fn fn)
> >>  
> >>      token -= RTAS_TOKEN_BASE;
> >>  
> >> -    assert(!rtas_table[token].name);
> >> +    assert(!name || !rtas_table[token].name);
> >>  
> >>      rtas_table[token].name = name;
> >>      rtas_table[token].fn = fn;
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-11-29 22:06     ` Cédric Le Goater
@ 2018-11-30  1:04       ` David Gibson
  2018-11-30  6:41         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-30  1:04 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 5143 bytes --]

On Thu, Nov 29, 2018 at 11:06:13PM +0100, Cédric Le Goater wrote:
> On 11/22/18 6:13 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:56:59AM +0100, Cédric Le Goater wrote:
> >> The Event Notification Descriptor also contains two Event State
> >> Buffers providing further coalescing of interrupts, one for the
> >> notification event (ESn) and one for the escalation events (ESe). A
> >> MMIO page is assigned for each to control the EOI through loads
> >> only. Stores are not allowed.
> >>
> >> The END ESBs are modeled through an object resembling the 'XiveSource'
> >> It is stateless as the END state bits are backed into the XiveEND
> >> structure under the XiveRouter and the MMIO accesses follow the same
> >> rules as for the standard source ESBs.
> >>
> >> END ESBs are not supported by the Linux drivers neither on OPAL nor on
> >> sPAPR. Nevetherless, it provides a mean to study the question in the
> >> future and validates a bit more the XIVE model.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  include/hw/ppc/xive.h |  20 ++++++
> >>  hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
> >>  2 files changed, 178 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index ce62aaf28343..24301bf2076d 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>                          XiveEND *end);
> >>  
> >> +/*
> >> + * XIVE END ESBs
> >> + */
> >> +
> >> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
> >> +#define XIVE_END_SOURCE(obj) \
> >> +    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
> > 
> > Is there a particular reason to make this a full QOM object, rather
> > than just embedding it in the XiveRouter?
> 
> Coming back on this question because removing the chip_id from the
> router is a problem for the END triggering. At least with the current
> design. See below for the comment.
> 
> >> +typedef struct XiveENDSource {
> >> +    SysBusDevice parent;
> >> +
> >> +    uint32_t        nr_ends;
> >> +
> >> +    /* ESB memory region */
> >> +    uint32_t        esb_shift;
> >> +    MemoryRegion    esb_mmio;
> >> +
> >> +    XiveRouter      *xrtr;
> >> +} XiveENDSource;
> >> +
> >>  /*
> >>   * For legacy compatibility, the exceptions define up to 256 different
> >>   * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index 9cb001e7b540..5a8882d47a98 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -622,8 +622,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
> >>       * even futher coalescing in the Router
> >>       */
> >>      if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
> >> -        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
> >> -        return;
> >> +        uint8_t pq = GETFIELD(END_W1_ESn, end.w1);
> >> +        bool notify = xive_esb_trigger(&pq);
> >> +
> >> +        if (pq != GETFIELD(END_W1_ESn, end.w1)) {
> >> +            end.w1 = SETFIELD(END_W1_ESn, end.w1, pq);
> >> +            xive_router_set_end(xrtr, end_blk, end_idx, &end);
> >> +        }
> >> +
> >> +        /* ESn[Q]=1 : end of notification */
> >> +        if (!notify) {
> >> +            return;
> >> +        }
> >>      }
> >>  
> >>      /*
> >> @@ -706,6 +716,151 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
> >>                     (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
> >>  }
> >>  
> >> +/*
> >> + * END ESB MMIO loads
> >> + */
> >> +static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned size)
> >> +{
> >> +    XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
> >> +    XiveRouter *xrtr = xsrc->xrtr;
> >> +    uint32_t offset = addr & 0xFFF;
> >> +    uint8_t end_blk;
> >> +    uint32_t end_idx;
> >> +    XiveEND end;
> >> +    uint32_t end_esmask;
> >> +    uint8_t pq;
> >> +    uint64_t ret = -1;
> >> +
> >> +    end_blk = xrtr->chip_id;
> >> +    end_idx = addr >> (xsrc->esb_shift + 1);
> >> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
> 
> The current END accessors require a block identifier, hence xrtr->chip_id, 
> but in this case, we don't really need it because we are using the ENDT 
> local to the router/chip. 

> I don't know how to handle simply this case without keeping chip_id :/

I don't really follow how chip_id is relevant here.  AFAICT the END
accessors take a block id and the back end is responsible for
interpreting them.  The ponwernv one will map it to chip id, but the
PAPR one can just ignore it or only use block 0.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-11-29 17:51           ` Cédric Le Goater
@ 2018-11-30  1:09             ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-11-30  1:09 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: Benjamin Herrenschmidt, qemu-ppc, qemu-devel

[-- Attachment #1: Type: text/plain, Size: 2075 bytes --]

On Thu, Nov 29, 2018 at 06:51:53PM +0100, Cédric Le Goater wrote:
> On 11/29/18 4:39 AM, Benjamin Herrenschmidt wrote:
> > On Thu, 2018-11-29 at 11:47 +1100, David Gibson wrote:
> >>
> >> 1) read/write accessors which take a word number
> 
> ok for single word updates of the structures.
> 
> >> 2) A "get" accessor which copies the whole structure, 
> 
> ok
> 
> >> but "write"
> >> accessor which takes a word number.  The asymmetry is a bit ugly, but
> >> it's the non-atomic writeback of the whole structure which I'm most
> >> uncomfortable with.
> 
> And, how would you make the update of the whole structure in RAM look 
> "atomic" under QEMU ? 

So, the BQL means it actually is atomic now (at least for PAPR where
the guest doesn't have access to it), but I don't want to rely on that
always being the case - there are moves to put less stuff under the
BQL, and with KVM we might be mapping some of these things such that
real hardware can touch it.

But the real point is that we don't *need* it to be atomic.  Perhaps
the individual field updates need to be atomic, but not writes to the
END as a whole.  Writing back the whole thing is also a whole heap of
unnecessary stores.

> > It shouldn't be a big deal though, there are HW facilities to access
> > the structures "atomically" anyway, due to the caching done by the
> > XIVE.
> 
> Are you suggesting that the PowerNV model should update the VPC, EQC, 
> IVC in the VST accessors before updating the VSTs in RAM ?
> >> 3) A map/unmap interface which gives you / releases a pointer to the
> >> "live" structure.  For powernv that would become
> >> address_space_map()/unmap().  
> 
> yes.
> 
> >> For PAPR it would just be reutn pointer  / no-op.
> 
> ok
> 
> I think I will introduce these handlers progressively in the patchset.
> 
> Thanks,
> 
> C. 
> 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
  2018-11-29 15:27         ` Cédric Le Goater
@ 2018-11-30  1:11           ` David Gibson
  2018-11-30  6:56             ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-30  1:11 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 1918 bytes --]

On Thu, Nov 29, 2018 at 04:27:31PM +0100, Cédric Le Goater wrote:
> [ ... ] 
> 
> >>>> +/*
> >>>> + * The allocation of VP blocks is a complex operation in OPAL and the
> >>>> + * VP identifiers have a relation with the number of HW chips, the
> >>>> + * size of the VP blocks, VP grouping, etc. The QEMU sPAPR XIVE
> >>>> + * controller model does not have the same constraints and can use a
> >>>> + * simple mapping scheme of the CPU vcpu_id
> >>>> + *
> >>>> + * These identifiers are never returned to the OS.
> >>>> + */
> >>>> +
> >>>> +#define SPAPR_XIVE_VP_BASE 0x400
> >>>
> >>> 0x400 == 1024.  Could we ever have the possibility of needing to
> >>> consider both physical NVTs and PAPR NVTs at the same time?  
> >>
> >> They would not be in the same CAM line: OS ring vs. PHYS ring. 
> > 
> > Hm.  They still inhabit the same NVT number space though, don't they?
> 
> No. skiboot reserves the range of VPs for the HW at init.
> 
> https://github.com/open-power/skiboot/blob/master/hw/xive.c#L1093

Uh.. I don't see how they're reserved is relevant.

What I mean is that the ENDs address the NVTs for HW endpoints by the
same (block, index) tuples as the NVTs for virtualized endpoints, yes?

> > I'm thinking about the END->NVT stage of the process here, rather than
> > the NVT->TCTX stage.
> >
> > Oh, also, you're using "VP" here which IIUC == "NVT".  Can we
> > standardize on one, please.
> 
> VP is used in Linux/KVM Linux/Native and skiboot. Yes. it's a mess. 
> Let's have consistent naming in QEMU and use NVT. 

Right.  And to cover any inevitable missed ones is why I'd like to see
a cheatsheet giving both terms in the header comments somewhere.

[snip]

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-11-29 16:04         ` Cédric Le Goater
@ 2018-11-30  1:23           ` David Gibson
  2018-11-30  8:07             ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-30  1:23 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 11506 bytes --]

On Thu, Nov 29, 2018 at 05:04:50PM +0100, Cédric Le Goater wrote:
> On 11/29/18 2:23 AM, David Gibson wrote:
> > On Wed, Nov 28, 2018 at 11:21:37PM +0100, Cédric Le Goater wrote:
> >> On 11/28/18 5:25 AM, David Gibson wrote:
> >>> On Fri, Nov 16, 2018 at 11:57:09AM +0100, Cédric Le Goater wrote:
> >>>> The different XIVE virtualization structures (sources and event queues)
> >>>> are configured with a set of Hypervisor calls :
> >>>>
> >>>>  - H_INT_GET_SOURCE_INFO
> >>>>
> >>>>    used to obtain the address of the MMIO page of the Event State
> >>>>    Buffer (ESB) entry associated with the source.
> >>>>
> >>>>  - H_INT_SET_SOURCE_CONFIG
> >>>>
> >>>>    assigns a source to a "target".
> >>>>
> >>>>  - H_INT_GET_SOURCE_CONFIG
> >>>>
> >>>>    determines which "target" and "priority" is assigned to a source
> >>>>
> >>>>  - H_INT_GET_QUEUE_INFO
> >>>>
> >>>>    returns the address of the notification management page associated
> >>>>    with the specified "target" and "priority".
> >>>>
> >>>>  - H_INT_SET_QUEUE_CONFIG
> >>>>
> >>>>    sets or resets the event queue for a given "target" and "priority".
> >>>>    It is also used to set the notification configuration associated
> >>>>    with the queue, only unconditional notification is supported for
> >>>>    the moment. Reset is performed with a queue size of 0 and queueing
> >>>>    is disabled in that case.
> >>>>
> >>>>  - H_INT_GET_QUEUE_CONFIG
> >>>>
> >>>>    returns the queue settings for a given "target" and "priority".
> >>>>
> >>>>  - H_INT_RESET
> >>>>
> >>>>    resets all of the guest's internal interrupt structures to their
> >>>>    initial state, losing all configuration set via the hcalls
> >>>>    H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
> >>>>
> >>>>  - H_INT_SYNC
> >>>>
> >>>>    issue a synchronisation on a source to make sure all notifications
> >>>>    have reached their queue.
> >>>>
> >>>> Calls that still need to be addressed :
> >>>>
> >>>>    H_INT_SET_OS_REPORTING_LINE
> >>>>    H_INT_GET_OS_REPORTING_LINE
> >>>>
> >>>> See the code for more documentation on each hcall.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>  include/hw/ppc/spapr.h      |  15 +-
> >>>>  include/hw/ppc/spapr_xive.h |   6 +
> >>>>  hw/intc/spapr_xive_hcall.c  | 892 ++++++++++++++++++++++++++++++++++++
> >>>>  hw/ppc/spapr_irq.c          |   2 +
> >>>>  hw/intc/Makefile.objs       |   2 +-
> >>>>  5 files changed, 915 insertions(+), 2 deletions(-)
> >>>>  create mode 100644 hw/intc/spapr_xive_hcall.c
> >>>>
> >>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>> index 1fbc2663e06c..8415faea7b82 100644
> >>>> --- a/include/hw/ppc/spapr.h
> >>>> +++ b/include/hw/ppc/spapr.h
> >>>> @@ -452,7 +452,20 @@ struct sPAPRMachineState {
> >>>>  #define H_INVALIDATE_PID        0x378
> >>>>  #define H_REGISTER_PROC_TBL     0x37C
> >>>>  #define H_SIGNAL_SYS_RESET      0x380
> >>>> -#define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
> >>>> +
> >>>> +#define H_INT_GET_SOURCE_INFO   0x3A8
> >>>> +#define H_INT_SET_SOURCE_CONFIG 0x3AC
> >>>> +#define H_INT_GET_SOURCE_CONFIG 0x3B0
> >>>> +#define H_INT_GET_QUEUE_INFO    0x3B4
> >>>> +#define H_INT_SET_QUEUE_CONFIG  0x3B8
> >>>> +#define H_INT_GET_QUEUE_CONFIG  0x3BC
> >>>> +#define H_INT_SET_OS_REPORTING_LINE 0x3C0
> >>>> +#define H_INT_GET_OS_REPORTING_LINE 0x3C4
> >>>> +#define H_INT_ESB               0x3C8
> >>>> +#define H_INT_SYNC              0x3CC
> >>>> +#define H_INT_RESET             0x3D0
> >>>> +
> >>>> +#define MAX_HCALL_OPCODE        H_INT_RESET
> >>>>  
> >>>>  /* The hcalls above are standardized in PAPR and implemented by pHyp
> >>>>   * as well.
> >>>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >>>> index 3f65b8f485fd..418511f3dc10 100644
> >>>> --- a/include/hw/ppc/spapr_xive.h
> >>>> +++ b/include/hw/ppc/spapr_xive.h
> >>>> @@ -60,4 +60,10 @@ int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
> >>>>  int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
> >>>>                            uint8_t *out_end_blk, uint32_t *out_end_idx);
> >>>>  
> >>>> +bool spapr_xive_priority_is_valid(uint8_t priority);
> >>>
> >>> AFAICT this could be a local function.
> >>
> >> the KVM model uses it also, when collecting state from the KVM device 
> >> to build the QEMU ENDT.
> >>
> >>>> +
> >>>> +typedef struct sPAPRMachineState sPAPRMachineState;
> >>>> +
> >>>> +void spapr_xive_hcall_init(sPAPRMachineState *spapr);
> >>>> +
> >>>>  #endif /* PPC_SPAPR_XIVE_H */
> >>>> diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
> >>>> new file mode 100644
> >>>> index 000000000000..52e4e23995f5
> >>>> --- /dev/null
> >>>> +++ b/hw/intc/spapr_xive_hcall.c
> >>>> @@ -0,0 +1,892 @@
> >>>> +/*
> >>>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >>>> + *
> >>>> + * Copyright (c) 2017-2018, IBM Corporation.
> >>>> + *
> >>>> + * This code is licensed under the GPL version 2 or later. See the
> >>>> + * COPYING file in the top-level directory.
> >>>> + */
> >>>> +
> >>>> +#include "qemu/osdep.h"
> >>>> +#include "qemu/log.h"
> >>>> +#include "qapi/error.h"
> >>>> +#include "cpu.h"
> >>>> +#include "hw/ppc/fdt.h"
> >>>> +#include "hw/ppc/spapr.h"
> >>>> +#include "hw/ppc/spapr_xive.h"
> >>>> +#include "hw/ppc/xive_regs.h"
> >>>> +#include "monitor/monitor.h"
> >>>
> >>> Fwiw, I don't think it's particularly necessary to split the hcall
> >>> handling out into a separate .c file.
> >>
> >> ok. let's move it to spapr_xive then ? It might help in reducing the 
> >> exported funtions. 
> > 
> > Yes, I think so.
> > 
> >>>> +/*
> >>>> + * OPAL uses the priority 7 EQ to automatically escalate interrupts
> >>>> + * for all other queues (DD2.X POWER9). So only priorities [0..6] are
> >>>> + * available for the guest.
> >>>
> >>> Referencing OPAL behaviour doesn't really make sense in the context of
> >>> PAPR.  
> >>
> >> It's an OPAL constraint which pHyp doesn't have. So its a QEMU/KVM 
> >> constraint also.
> > 
> > Right, I realized that a few patches on.  Maybe rephrase this to
> > 
> >    Linux hosts under OPAL reserve priority 7 for their own escalation
> >    interrupts.  So we only allow the guest to use priorities [0..6].
> 
> OK.
> 
> > The point here is that we're emphasizing that this is a design
> > decision to make the host implementation easier, rather than a
> > fundamental constraint.
> > 
> >>> What I think you're getting at is that the PAPR spec only
> >>> allows a PAPR guest to use priorities 0..6 (or at least it will if the
> >>> XIVE updated spec ever gets published).  
> >>
> >> It's not in the spec. the XIVE sPAPR spec should be frozen soon btw. 
> >>  
> >>> The fact that this allows the
> >>> host use 7 for escalations is a design rationale 
> >>> but not really relevant to the guest device itself. 
> >>
> >> The guest should be aware of which priorities are reserved for
> >> the hypervisor though.
> >>
> >>>> + */
> >>>> +bool spapr_xive_priority_is_valid(uint8_t priority)
> >>>> +{
> >>>> +    switch (priority) {
> >>>> +    case 0 ... 6:
> >>>> +        return true;
> >>>> +    case 7: /* OPAL escalation queue */
> >>>> +    default:
> >>>> +        return false;
> >>>> +    }
> >>>> +}
> >>>> +
> >>>> +/*
> >>>> + * The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical
> >>>> + * real address of the MMIO page through which the Event State Buffer
> >>>> + * entry associated with the value of the "lisn" parameter is managed.
> >>>> + *
> >>>> + * Parameters:
> >>>> + * Input
> >>>> + * - "flags"
> >>>> + *       Bits 0-63 reserved
> >>>> + * - "lisn" is per "interrupts", "interrupt-map", or
> >>>> + *       "ibm,xive-lisn-ranges" properties, or as returned by the
> >>>> + *       ibm,query-interrupt-source-number RTAS call, or as returned
> >>>> + *       by the H_ALLOCATE_VAS_WINDOW hcall
> >>>
> >>> I've not heard of H_ALLOCATE_VAS_WINDOW.  Is that something we intend
> >>> to implement in kvm/qemu, or is it only of interest for PowerVM?
> >>
> >> The hcall is part of the PAPR NX Interfaces and it returns interrupt
> >> numbers. I don't know if any work has been done on the topic.  
> > 
> > What's a "PAPR NX"?
> 
> A way for the PAPR guests to access the POWER coprocessors doing 
> compression and encryption. I really don't know much about this.

Ah, ok.

[snip]
> >> I think not, but the specs are not very clear on that topic. I will
> >> ask for clarification and use a -1 for now. We can not do loads on
> >> the trigger page so it can not be used by the H_INT_ESB hcall.
> >>
> >>>
> >>>> +    args[3] = TARGET_PAGE_SIZE;
> >>>
> >>> That seems wrong.  
> >>
> >> This is utterly wrong. it should be a power of 2 number ... I got
> >> it right under KVM though. I guess that ioremap() under Linux rounds 
> >> up the size to the page size in use, so, that's why it didn't blow
> >> up under TCG.
> >>
> >>> TARGET_PAGE_SIZE is generally 4kiB, but won't these usually
> >>> actually be 64kiB?
> >>
> >> yes. So what should I use to get a PAGE_SHIFT instead ? 
> > 
> > Erm, that gets a bit tricky, since qemu in a sense doesn't know the
> > guest's page size.
> > 
> > But.. don't you actually want the esb_shift here, not PAGE_SHIFT - it
> > could matter for the 2 page * 64kiB variant, yes?
> 
> Yes. we just want the page_shift of the ESB page, whether it's one or
> two pages. The other registers inform the guest if there are one or 
> two ESB page in use. 

Ok, still sounds like you should base it on esb_shift, just adjust for
the two page case.

> >>>> +    }
> >>>> +
> >>>> +    switch (qsize) {
> >>>> +    case 12:
> >>>> +    case 16:
> >>>> +    case 21:
> >>>> +    case 24:
> >>>> +        end.w3 = ((uint64_t)qpage) & 0xffffffff;
> >>>
> >>> It just occurred to me that I haven't been looking for this across any
> >>> of these reviews.  Don't you need byteswaps when accessing these
> >>> in-memory structures?
> >>
> >> yes this is done when some event data is enqueued in the EQ.
> > 
> > I'm not talking about the data in the EQ itself, but the fields in the
> > END (and the NVT).
> 
> XIVE is all BE.

Yes... the qemu host might not be, which is why you need byteswaps.

I realized eventually you have the swaps in your pnv get/set
accessors.  I don't like that at all for a couple of reasons:

1) Although the END structure is made up of word-sized fields because
that's convenient, the END really is made of a bunch of subfields of
different sizes.  Knowing that it wouldn't be unreasonable for people
to expect they can look into the XIVE by byte offsets; that will break
if you're working with a copy that has already been byte-swapped on
word-sized units.

2) At different points in the code you're storing both BE and
native-endian data in the same struct.  That's both confusing to
someone reading the code (if they see that struct they don't know if
it's byteswapped already) and also means you can't use sparse
annotations to make sure you have it right.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/36] spapr/xive: add migration support for KVM
  2018-11-29 16:19     ` Cédric Le Goater
@ 2018-11-30  1:24       ` David Gibson
  2018-11-30  7:04         ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-11-30  1:24 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 33711 bytes --]

On Thu, Nov 29, 2018 at 05:19:51PM +0100, Cédric Le Goater wrote:
> David,
> 
> Could you tell what you think about the KVM interfaces for migration,
> the ones capturing and restoring the states ? 
> 
> On 11/29/18 4:43 AM, David Gibson wrote:
> > On Fri, Nov 16, 2018 at 11:57:16AM +0100, Cédric Le Goater wrote:
> >> This extends the KVM XIVE models to handle the state synchronization
> >> with KVM, for the monitor usage and for the migration.
> >>
> >> The migration priority of the XIVE interrupt controller sPAPRXive is
> >> raised for KVM. It operates first and orchestrates the capture
> >> sequence of the states of all the XIVE models. The XIVE sources are
> >> masked to quiesce the interrupt flow and a XIVE xync is performed to
> >> stabilize the OS Event Queues. The state of the ENDs are then captured
> >> by the XIVE interrupt controller model, sPAPRXive, and the state of
> >> the thread contexts by the thread interrupt presenter model,
> >> XiveTCTX. When done, a rollback is performed to restore the sources to
> >> their initial state.
> >>
> >> The sPAPRXive 'post_load' method is called from the sPAPR machine,
> >> after all XIVE device states have been transfered and loaded. First,
> >> sPAPRXive restores the XIVE routing tables: ENDT and EAT. Next, are
> >> restored the thread interrupt context registers and the source PQ
> >> bits.
> >>
> >> The get/set operations rely on their KVM counterpart in the host
> >> kernel which acts as a proxy for OPAL, the host firmware.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>
> >>  WIP:
> >>  
> >>     If migration occurs when a VCPU is 'ceded', some the OS event
> >>     notification queues are mapped to the ZERO_PAGE on the receiving
> >>     side. As if the HW had triggered a page fault before the dirty
> >>     page was transferred from the source or as if we were not using
> >>     the correct page table.
> 
> 
> v6 adds a VM change state handler to make XIVE reach a quiescent state. 
> The sequence is a little more sophisticated and an extra KVM call 
> marks the EQ page dirty.

Ok.

> 
> >>
> >>  include/hw/ppc/spapr_xive.h     |   5 +
> >>  include/hw/ppc/xive.h           |   3 +
> >>  include/migration/vmstate.h     |   1 +
> >>  linux-headers/asm-powerpc/kvm.h |  33 +++
> >>  hw/intc/spapr_xive.c            |  32 +++
> >>  hw/intc/spapr_xive_kvm.c        | 494 ++++++++++++++++++++++++++++++++
> >>  hw/intc/xive.c                  |  46 +++
> >>  hw/ppc/spapr_irq.c              |   2 +-
> >>  8 files changed, 615 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >> index 9c817bb7ae74..d2517c040958 100644
> >> --- a/include/hw/ppc/spapr_xive.h
> >> +++ b/include/hw/ppc/spapr_xive.h
> >> @@ -55,12 +55,17 @@ typedef struct sPAPRXiveClass {
> >>      XiveRouterClass parent_class;
> >>  
> >>      DeviceRealize   parent_realize;
> >> +
> >> +    void (*synchronize_state)(sPAPRXive *xive);
> >> +    int  (*pre_save)(sPAPRXive *xsrc);
> >> +    int  (*post_load)(sPAPRXive *xsrc, int version_id);
> > 
> > This should go away if the KVM and non-KVM versions are in the same
> > object.
> 
> yes.
> 
> >>  } sPAPRXiveClass;
> >>  
> >>  bool spapr_xive_irq_enable(sPAPRXive *xive, uint32_t lisn, bool lsi);
> >>  bool spapr_xive_irq_disable(sPAPRXive *xive, uint32_t lisn);
> >>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon);
> >>  qemu_irq spapr_xive_qirq(sPAPRXive *xive, uint32_t lisn);
> >> +int spapr_xive_post_load(sPAPRXive *xive, int version_id);
> >>  
> >>  /*
> >>   * sPAPR NVT and END indexing helpers
> >> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >> index 7aaf5a182cb3..c8201462d698 100644
> >> --- a/include/hw/ppc/xive.h
> >> +++ b/include/hw/ppc/xive.h
> >> @@ -309,6 +309,9 @@ typedef struct XiveTCTXClass {
> >>      DeviceClass       parent_class;
> >>  
> >>      DeviceRealize     parent_realize;
> >> +
> >> +    void (*synchronize_state)(XiveTCTX *tctx);
> >> +    int  (*post_load)(XiveTCTX *tctx, int version_id);
> > 
> > .. and this too.
> > 
> >>  } XiveTCTXClass;
> >>  
> >>  /*
> >> diff --git a/include/migration/vmstate.h b/include/migration/vmstate.h
> >> index 2b501d04669a..ee2e836cc1c1 100644
> >> --- a/include/migration/vmstate.h
> >> +++ b/include/migration/vmstate.h
> >> @@ -154,6 +154,7 @@ typedef enum {
> >>      MIG_PRI_PCI_BUS,            /* Must happen before IOMMU */
> >>      MIG_PRI_GICV3_ITS,          /* Must happen before PCI devices */
> >>      MIG_PRI_GICV3,              /* Must happen before the ITS */
> >> +    MIG_PRI_XIVE_IC,            /* Must happen before all XIVE models */
> > 
> > Ugh.. explicit priority / order levels are a pretty bad code smell.
> > Usually migration ordering can be handled by getting the object
> > heirarchy right.  What exactly is the problem you're addessing with
> > this?
> 
> I wanted sPAPRXive to capture the state on behalf of all XIVE models. 
> But with the addition of the VMState change handler I think I can 
> remove this priority. I will check. 
> 
> > 
> >>      MIG_PRI_MAX,
> >>  } MigrationPriority;
> >>  
> >> diff --git a/linux-headers/asm-powerpc/kvm.h b/linux-headers/asm-powerpc/kvm.h
> >> index f34c971491dd..9d55ade23634 100644
> >> --- a/linux-headers/asm-powerpc/kvm.h
> >> +++ b/linux-headers/asm-powerpc/kvm.h
> > 
> > Again, linux-headers need to be split out.
> > 
> >> @@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
> >>  #define  KVM_REG_PPC_ICP_PPRI_SHIFT	16	/* pending irq priority */
> >>  #define  KVM_REG_PPC_ICP_PPRI_MASK	0xff
> >>  
> >> +#define KVM_REG_PPC_NVT_STATE	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d)
> >> +
> >>  /* Device control API: PPC-specific devices */
> >>  #define KVM_DEV_MPIC_GRP_MISC		1
> >>  #define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
> >> @@ -681,10 +683,41 @@ struct kvm_ppc_cpu_char {
> >>  #define   KVM_DEV_XIVE_GET_TIMA_FD	2
> >>  #define   KVM_DEV_XIVE_VC_BASE		3
> >>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
> >> +#define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
> >> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
> >> +#define KVM_DEV_XIVE_GRP_EQ		5	/* 64-bit eq attributes */
> >>  
> >>  /* Layout of 64-bit XIVE source attribute values */
> >>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
> >>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
> >>  
> >> +/* Layout of 64-bit eas attribute values */
> >> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
> >> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
> >> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
> >> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
> >> +#define KVM_XIVE_EAS_MASK_SHIFT		32
> >> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
> >> +#define KVM_XIVE_EAS_EISN_SHIFT		33
> >> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
> >> +
> >> +/* Layout of 64-bit eq attribute */
> >> +#define KVM_XIVE_EQ_PRIORITY_SHIFT	0
> >> +#define KVM_XIVE_EQ_PRIORITY_MASK	0x7
> >> +#define KVM_XIVE_EQ_SERVER_SHIFT	3
> >> +#define KVM_XIVE_EQ_SERVER_MASK		0xfffffff8ULL
> >> +
> >> +/* Layout of 64-bit eq attribute values */
> >> +struct kvm_ppc_xive_eq {
> >> +	__u32 flags;
> >> +	__u32 qsize;
> >> +	__u64 qpage;
> >> +	__u32 qtoggle;
> >> +	__u32 qindex;
> >> +};
> >> +
> >> +#define KVM_XIVE_EQ_FLAG_ENABLED	0x00000001
> >> +#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY	0x00000002
> >> +#define KVM_XIVE_EQ_FLAG_ESCALATE	0x00000004
> >>  
> >>  #endif /* __LINUX_KVM_POWERPC_H */
> >> diff --git a/hw/intc/spapr_xive.c b/hw/intc/spapr_xive.c
> >> index ec85f7e4f88d..c5c0e063dc33 100644
> >> --- a/hw/intc/spapr_xive.c
> >> +++ b/hw/intc/spapr_xive.c
> >> @@ -27,9 +27,14 @@
> >>  
> >>  void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> >>  {
> >> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
> >>      int i;
> >>      uint32_t offset = 0;
> >>  
> >> +    if (sxc->synchronize_state) {
> >> +        sxc->synchronize_state(xive);
> >> +    }
> >> +
> >>      monitor_printf(mon, "XIVE Source %08x .. %08x\n", offset,
> >>                     offset + xive->source.nr_irqs - 1);
> >>      xive_source_pic_print_info(&xive->source, offset, mon);
> >> @@ -354,10 +359,37 @@ static const VMStateDescription vmstate_spapr_xive_eas = {
> >>      },
> >>  };
> >>  
> >> +static int vmstate_spapr_xive_pre_save(void *opaque)
> >> +{
> >> +    sPAPRXive *xive = SPAPR_XIVE_BASE(opaque);
> >> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
> >> +
> >> +    if (sxc->pre_save) {
> >> +        return sxc->pre_save(xive);
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +/* handled at the machine level */
> >> +int spapr_xive_post_load(sPAPRXive *xive, int version_id)
> >> +{
> >> +    sPAPRXiveClass *sxc = SPAPR_XIVE_BASE_GET_CLASS(xive);
> >> +
> >> +    if (sxc->post_load) {
> >> +        return sxc->post_load(xive, version_id);
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >>  static const VMStateDescription vmstate_spapr_xive_base = {
> >>      .name = TYPE_SPAPR_XIVE,
> >>      .version_id = 1,
> >>      .minimum_version_id = 1,
> >> +    .pre_save = vmstate_spapr_xive_pre_save,
> >> +    .post_load = NULL, /* handled at the machine level */
> >> +    .priority = MIG_PRI_XIVE_IC,
> >>      .fields = (VMStateField[]) {
> >>          VMSTATE_UINT32_EQUAL(nr_irqs, sPAPRXive, NULL),
> >>          VMSTATE_STRUCT_VARRAY_POINTER_UINT32(eat, sPAPRXive, nr_irqs,
> >> diff --git a/hw/intc/spapr_xive_kvm.c b/hw/intc/spapr_xive_kvm.c
> >> index 767f90826e43..176083c37d61 100644
> >> --- a/hw/intc/spapr_xive_kvm.c
> >> +++ b/hw/intc/spapr_xive_kvm.c
> >> @@ -58,6 +58,58 @@ static void kvm_cpu_enable(CPUState *cs)
> >>  /*
> >>   * XIVE Thread Interrupt Management context (KVM)
> >>   */
> >> +static void xive_tctx_kvm_set_state(XiveTCTX *tctx, Error **errp)
> >> +{
> >> +    uint64_t state[4];
> >> +    int ret;
> >> +
> >> +    /* word0 and word1 of the OS ring. */
> >> +    state[0] = *((uint64_t *) &tctx->regs[TM_QW1_OS]);
> >> +
> >> +    /* VP identifier. Only for KVM pr_debug() */
> >> +    state[1] = *((uint64_t *) &tctx->regs[TM_QW1_OS + TM_WORD2]);
> >> +
> >> +    ret = kvm_set_one_reg(tctx->cs, KVM_REG_PPC_NVT_STATE, state);
> >> +    if (ret != 0) {
> >> +        error_setg_errno(errp, errno, "Could restore KVM XIVE CPU %ld state",
> >> +                         kvm_arch_vcpu_id(tctx->cs));
> >> +    }
> >> +}
> >> +
> >> +static void xive_tctx_kvm_get_state(XiveTCTX *tctx, Error **errp)
> >> +{
> >> +    uint64_t state[4] = { 0 };
> >> +    int ret;
> >> +
> >> +    ret = kvm_get_one_reg(tctx->cs, KVM_REG_PPC_NVT_STATE, state);
> >> +    if (ret != 0) {
> >> +        error_setg_errno(errp, errno, "Could capture KVM XIVE CPU %ld state",
> >> +                         kvm_arch_vcpu_id(tctx->cs));
> >> +        return;
> >> +    }
> >> +
> >> +    /* word0 and word1 of the OS ring. */
> >> +    *((uint64_t *) &tctx->regs[TM_QW1_OS]) = state[0];
> >> +
> >> +    /*
> >> +     * KVM also returns word2 containing the VP CAM line value which
> >> +     * is interesting to print out the VP identifier in the QEMU
> >> +     * monitor. No need to restore it.
> >> +     */
> >> +    *((uint64_t *) &tctx->regs[TM_QW1_OS + TM_WORD2]) = state[1];
> >> +}
> >> +
> >> +static void xive_tctx_kvm_do_synchronize_state(CPUState *cpu,
> >> +                                              run_on_cpu_data arg)
> >> +{
> >> +    xive_tctx_kvm_get_state(arg.host_ptr, &error_fatal);
> >> +}
> >> +
> >> +static void xive_tctx_kvm_synchronize_state(XiveTCTX *tctx)
> >> +{
> >> +    run_on_cpu(tctx->cs, xive_tctx_kvm_do_synchronize_state,
> >> +               RUN_ON_CPU_HOST_PTR(tctx));
> >> +}
> >>  
> >>  static void xive_tctx_kvm_init(XiveTCTX *tctx, Error **errp)
> >>  {
> >> @@ -112,6 +164,8 @@ static void xive_tctx_kvm_class_init(ObjectClass *klass, void *data)
> >>  
> >>      device_class_set_parent_realize(dc, xive_tctx_kvm_realize,
> >>                                      &xtc->parent_realize);
> >> +
> >> +    xtc->synchronize_state = xive_tctx_kvm_synchronize_state;
> >>  }
> >>  
> >>  static const TypeInfo xive_tctx_kvm_info = {
> >> @@ -166,6 +220,34 @@ static void xive_source_kvm_reset(DeviceState *dev)
> >>      xive_source_kvm_init(xsrc, &error_fatal);
> >>  }
> >>  
> >> +/*
> >> + * This is used to perform the magic loads on the ESB pages, described
> >> + * in xive.h.
> >> + */
> >> +static uint8_t xive_esb_read(XiveSource *xsrc, int srcno, uint32_t offset)
> >> +{
> >> +    unsigned long addr = (unsigned long) xsrc->esb_mmap +
> >> +        xive_source_esb_mgmt(xsrc, srcno) + offset;
> >> +
> >> +    /* Prevent the compiler from optimizing away the load */
> >> +    volatile uint64_t value = *((uint64_t *) addr);
> >> +
> >> +    return be64_to_cpu(value) & 0x3;
> >> +}
> >> +
> >> +static void xive_source_kvm_get_state(XiveSource *xsrc)
> >> +{
> >> +    int i;
> >> +
> >> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >> +        /* Perform a load without side effect to retrieve the PQ bits */
> >> +        uint8_t pq = xive_esb_read(xsrc, i, XIVE_ESB_GET);
> >> +
> >> +        /* and save PQ locally */
> >> +        xive_source_esb_set(xsrc, i, pq);
> >> +    }
> >> +}
> >> +
> >>  static void xive_source_kvm_set_irq(void *opaque, int srcno, int val)
> >>  {
> >>      XiveSource *xsrc = opaque;
> >> @@ -295,6 +377,414 @@ static const TypeInfo xive_source_kvm_info = {
> >>  /*
> >>   * sPAPR XIVE Router (KVM)
> >>   */
> >> +static int spapr_xive_kvm_set_eq_state(sPAPRXive *xive, CPUState *cs,
> >> +                                       Error **errp)
> >> +{
> >> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> >> +    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
> >> +    int ret;
> >> +    int i;
> >> +
> >> +    for (i = 0; i < XIVE_PRIORITY_MAX + 1; i++) {
> >> +        Error *local_err = NULL;
> >> +        XiveEND end;
> >> +        uint8_t end_blk;
> >> +        uint32_t end_idx;
> >> +        struct kvm_ppc_xive_eq kvm_eq = { 0 };
> >> +        uint64_t kvm_eq_idx;
> >> +
> >> +        if (!spapr_xive_priority_is_valid(i)) {
> >> +            continue;
> >> +        }
> >> +
> >> +        spapr_xive_cpu_to_end(xive, POWERPC_CPU(cs), i, &end_blk, &end_idx);
> >> +
> >> +        ret = xive_router_get_end(xrtr, end_blk, end_idx, &end);
> >> +        if (ret) {
> >> +            error_setg(errp, "XIVE: No END for CPU %ld priority %d",
> >> +                       vcpu_id, i);
> >> +            return ret;
> >> +        }
> >> +
> >> +        if (!(end.w0 & END_W0_VALID)) {
> >> +            continue;
> >> +        }
> >> +
> >> +        /* Build the KVM state from the local END structure */
> >> +        kvm_eq.flags   = KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY;
> >> +        kvm_eq.qsize   = GETFIELD(END_W0_QSIZE, end.w0) + 12;
> >> +        kvm_eq.qpage   = (((uint64_t)(end.w2 & 0x0fffffff)) << 32) | end.w3;
> >> +        kvm_eq.qtoggle = GETFIELD(END_W1_GENERATION, end.w1);
> >> +        kvm_eq.qindex  = GETFIELD(END_W1_PAGE_OFF, end.w1);
> >> +
> >> +        /* Encode the tuple (server, prio) as a KVM EQ index */
> >> +        kvm_eq_idx = i << KVM_XIVE_EQ_PRIORITY_SHIFT &
> >> +            KVM_XIVE_EQ_PRIORITY_MASK;
> >> +        kvm_eq_idx |= vcpu_id << KVM_XIVE_EQ_SERVER_SHIFT &
> >> +            KVM_XIVE_EQ_SERVER_MASK;
> >> +
> >> +        ret = kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EQ, kvm_eq_idx,
> >> +                                &kvm_eq, true, &local_err);
> >> +        if (local_err) {
> >> +            error_propagate(errp, local_err);
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static int spapr_xive_kvm_get_eq_state(sPAPRXive *xive, CPUState *cs,
> >> +                                       Error **errp)
> >> +{
> >> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> >> +    unsigned long vcpu_id = kvm_arch_vcpu_id(cs);
> >> +    int ret;
> >> +    int i;
> >> +
> >> +    for (i = 0; i < XIVE_PRIORITY_MAX + 1; i++) {
> >> +        Error *local_err = NULL;
> >> +        struct kvm_ppc_xive_eq kvm_eq = { 0 };
> >> +        uint64_t kvm_eq_idx;
> >> +        XiveEND end = { 0 };
> >> +        uint8_t end_blk, nvt_blk;
> >> +        uint32_t end_idx, nvt_idx;
> >> +
> >> +        /* Skip priorities reserved for the hypervisor */
> >> +        if (!spapr_xive_priority_is_valid(i)) {
> >> +            continue;
> >> +        }
> >> +
> >> +        /* Encode the tuple (server, prio) as a KVM EQ index */
> >> +        kvm_eq_idx = i << KVM_XIVE_EQ_PRIORITY_SHIFT &
> >> +            KVM_XIVE_EQ_PRIORITY_MASK;
> >> +        kvm_eq_idx |= vcpu_id << KVM_XIVE_EQ_SERVER_SHIFT &
> >> +            KVM_XIVE_EQ_SERVER_MASK;
> >> +
> >> +        ret = kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EQ, kvm_eq_idx,
> >> +                                &kvm_eq, false, &local_err);
> >> +        if (local_err) {
> >> +            error_propagate(errp, local_err);
> >> +            return ret;
> >> +        }
> >> +
> >> +        if (!(kvm_eq.flags & KVM_XIVE_EQ_FLAG_ENABLED)) {
> >> +            continue;
> >> +        }
> >> +
> >> +        /* Update the local END structure with the KVM input */
> >> +        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ENABLED) {
> >> +                end.w0 |= END_W0_VALID | END_W0_ENQUEUE;
> >> +        }
> >> +        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY) {
> >> +                end.w0 |= END_W0_UCOND_NOTIFY;
> >> +        }
> >> +        if (kvm_eq.flags & KVM_XIVE_EQ_FLAG_ESCALATE) {
> >> +                end.w0 |= END_W0_ESCALATE_CTL;
> >> +        }
> >> +        end.w0 |= SETFIELD(END_W0_QSIZE, 0ul, kvm_eq.qsize - 12);
> >> +
> >> +        end.w1 = SETFIELD(END_W1_GENERATION, 0ul, kvm_eq.qtoggle) |
> >> +            SETFIELD(END_W1_PAGE_OFF, 0ul, kvm_eq.qindex);
> >> +        end.w2 = (kvm_eq.qpage >> 32) & 0x0fffffff;
> >> +        end.w3 = kvm_eq.qpage & 0xffffffff;
> >> +        end.w4 = 0;
> >> +        end.w5 = 0;
> >> +
> >> +        ret = spapr_xive_cpu_to_nvt(xive, POWERPC_CPU(cs), &nvt_blk, &nvt_idx);
> >> +        if (ret) {
> >> +            error_setg(errp, "XIVE: No NVT for CPU %ld", vcpu_id);
> >> +            return ret;
> >> +        }
> >> +
> >> +        end.w6 = SETFIELD(END_W6_NVT_BLOCK, 0ul, nvt_blk) |
> >> +            SETFIELD(END_W6_NVT_INDEX, 0ul, nvt_idx);
> >> +        end.w7 = SETFIELD(END_W7_F0_PRIORITY, 0ul, i);
> >> +
> >> +        spapr_xive_cpu_to_end(xive, POWERPC_CPU(cs), i, &end_blk, &end_idx);
> >> +
> >> +        ret = xive_router_set_end(xrtr, end_blk, end_idx, &end);
> >> +        if (ret) {
> >> +            error_setg(errp, "XIVE: No END for CPU %ld priority %d",
> >> +                       vcpu_id, i);
> >> +            return ret;
> >> +        }
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >> +static void spapr_xive_kvm_set_eas_state(sPAPRXive *xive, Error **errp)
> >> +{
> >> +    XiveSource *xsrc = &xive->source;
> >> +    int i;
> >> +
> >> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >> +        XiveEAS *eas = &xive->eat[i];
> >> +        uint32_t end_idx;
> >> +        uint32_t end_blk;
> >> +        uint32_t eisn;
> >> +        uint8_t priority;
> >> +        uint32_t server;
> >> +        uint64_t kvm_eas;
> >> +        Error *local_err = NULL;
> >> +
> >> +        /* No need to set MASKED EAS, this is the default state after reset */
> >> +        if (!(eas->w & EAS_VALID) || eas->w & EAS_MASKED) {
> >> +            continue;
> >> +        }
> >> +
> >> +        end_idx = GETFIELD(EAS_END_INDEX, eas->w);
> >> +        end_blk = GETFIELD(EAS_END_BLOCK, eas->w);
> >> +        eisn = GETFIELD(EAS_END_DATA, eas->w);
> >> +
> >> +        spapr_xive_end_to_target(xive, end_blk, end_idx, &server, &priority);
> >> +
> >> +        kvm_eas = priority << KVM_XIVE_EAS_PRIORITY_SHIFT &
> >> +            KVM_XIVE_EAS_PRIORITY_MASK;
> >> +        kvm_eas |= server << KVM_XIVE_EAS_SERVER_SHIFT &
> >> +            KVM_XIVE_EAS_SERVER_MASK;
> >> +        kvm_eas |= ((uint64_t)eisn << KVM_XIVE_EAS_EISN_SHIFT) &
> >> +            KVM_XIVE_EAS_EISN_MASK;
> >> +
> >> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EAS, i, &kvm_eas, true,
> >> +                          &local_err);
> >> +        if (local_err) {
> >> +            error_propagate(errp, local_err);
> >> +            return;
> >> +        }
> >> +    }
> >> +}
> >> +
> >> +static void spapr_xive_kvm_get_eas_state(sPAPRXive *xive, Error **errp)
> >> +{
> >> +    XiveSource *xsrc = &xive->source;
> >> +    int i;
> >> +
> >> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >> +        XiveEAS *eas = &xive->eat[i];
> >> +        XiveEAS new_eas;
> >> +        uint64_t kvm_eas;
> >> +        uint8_t priority;
> >> +        uint32_t server;
> >> +        uint32_t end_idx;
> >> +        uint8_t end_blk;
> >> +        uint32_t eisn;
> >> +        Error *local_err = NULL;
> >> +
> >> +        if (!(eas->w & EAS_VALID)) {
> >> +            continue;
> >> +        }
> >> +
> >> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_EAS, i, &kvm_eas, false,
> >> +                          &local_err);
> >> +        if (local_err) {
> >> +            error_propagate(errp, local_err);
> >> +            return;
> >> +        }
> >> +
> >> +        priority = (kvm_eas & KVM_XIVE_EAS_PRIORITY_MASK) >>
> >> +            KVM_XIVE_EAS_PRIORITY_SHIFT;
> >> +        server = (kvm_eas & KVM_XIVE_EAS_SERVER_MASK) >>
> >> +            KVM_XIVE_EAS_SERVER_SHIFT;
> >> +        eisn = (kvm_eas & KVM_XIVE_EAS_EISN_MASK) >> KVM_XIVE_EAS_EISN_SHIFT;
> >> +
> >> +        if (spapr_xive_target_to_end(xive, server, priority, &end_blk,
> >> +                                     &end_idx)) {
> >> +            error_setg(errp, "XIVE: invalid tuple CPU %d priority %d", server,
> >> +                       priority);
> >> +            return;
> >> +        }
> >> +
> >> +        new_eas.w = EAS_VALID;
> >> +        if (kvm_eas & KVM_XIVE_EAS_MASK_MASK) {
> >> +            new_eas.w |= EAS_MASKED;
> >> +        }
> >> +
> >> +        new_eas.w = SETFIELD(EAS_END_INDEX, new_eas.w, end_idx);
> >> +        new_eas.w = SETFIELD(EAS_END_BLOCK, new_eas.w, end_blk);
> >> +        new_eas.w = SETFIELD(EAS_END_DATA, new_eas.w, eisn);
> >> +
> >> +        *eas = new_eas;
> >> +    }
> >> +}
> >> +
> >> +static void spapr_xive_kvm_sync_all(sPAPRXive *xive, Error **errp)
> >> +{
> >> +    XiveSource *xsrc = &xive->source;
> >> +    Error *local_err = NULL;
> >> +    int i;
> >> +
> >> +    /* Sync the KVM source. This reaches the XIVE HW through OPAL */
> >> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >> +        XiveEAS *eas = &xive->eat[i];
> >> +
> >> +        if (!(eas->w & EAS_VALID)) {
> >> +            continue;
> >> +        }
> >> +
> >> +        kvm_device_access(xive->fd, KVM_DEV_XIVE_GRP_SYNC, i, NULL, true,
> >> +                          &local_err);
> >> +        if (local_err) {
> >> +            error_propagate(errp, local_err);
> >> +            return;
> >> +        }
> >> +    }
> >> +}
> >> +
> >> +/*
> >> + * The sPAPRXive KVM model migration priority is higher to make sure
> > 
> > Higher than what?
> 
> Than the XiveTCTX and XiveSource models.
> 
> >> + * its 'pre_save' method runs before all the other XIVE models. It
> > 
> > If the other XIVE components are children of sPAPRXive (which I think
> > they are or could be), then I believe the parent object's pre_save
> > will automatically be called first.
> 
> ok. XiveTCTX are not children of sPAPRXive but that might not be 
> a problem anymore with the VMState change handler.

Ah, right.  You might need the handler in the machine itself then - we
already have something like that for XICS, IIRC.

> 
> Thanks
> 
> C.
> 
> >> + * orchestrates the capture sequence of the XIVE states in the
> >> + * following order:
> >> + *
> >> + *   1. mask all the sources by setting PQ=01, which returns the
> >> + *      previous value and save it.
> >> + *   2. sync the sources in KVM to stabilize all the queues
> >> + *      sync the ENDs to make sure END -> VP is fully completed
> >> + *   3. dump the EAS table
> >> + *   4. dump the END table
> >> + *   5. dump the thread context (IPB)
> >> + *
> >> + *  Rollback to restore the current configuration of the sources
> > 
> > 
> > 
> >> + */
> >> +static int spapr_xive_kvm_pre_save(sPAPRXive *xive)
> >> +{
> >> +    XiveSource *xsrc = &xive->source;
> >> +    Error *local_err = NULL;
> >> +    CPUState *cs;
> >> +    int i;
> >> +    int ret = 0;
> >> +
> >> +    /* Quiesce the sources, to stop the flow of event notifications */
> >> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >> +        /*
> >> +         * Mask and save the ESB PQs locally in the XiveSource object.
> >> +         */
> >> +        uint8_t pq = xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_01);
> >> +        xive_source_esb_set(xsrc, i, pq);
> >> +    }
> >> +
> >> +    /* Sync the sources in KVM */
> >> +    spapr_xive_kvm_sync_all(xive, &local_err);
> >> +    if (local_err) {
> >> +        error_report_err(local_err);
> >> +        goto out;
> >> +    }
> >> +
> >> +    /* Grab the EAT (could be done earlier ?) */
> >> +    spapr_xive_kvm_get_eas_state(xive, &local_err);
> >> +    if (local_err) {
> >> +        error_report_err(local_err);
> >> +        goto out;
> >> +    }
> >> +
> >> +    /*
> >> +     * Grab the ENDs. The EQ index and the toggle bit are what we want
> >> +     * to capture
> >> +     */
> >> +    CPU_FOREACH(cs) {
> >> +        spapr_xive_kvm_get_eq_state(xive, cs, &local_err);
> >> +        if (local_err) {
> >> +            error_report_err(local_err);
> >> +            goto out;
> >> +        }
> >> +    }
> >> +
> >> +    /* Capture the thread interrupt contexts */
> >> +    CPU_FOREACH(cs) {
> >> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> >> +
> >> +        /* TODO: Check if we need to use under run_on_cpu() ? */
> >> +        xive_tctx_kvm_get_state(XIVE_TCTX_KVM(cpu->intc), &local_err);
> >> +        if (local_err) {
> >> +            error_report_err(local_err);
> >> +            goto out;
> >> +        }
> >> +    }
> >> +
> >> +    /* All done. */
> >> +
> >> +out:
> >> +    /* Restore the sources to their initial state */
> >> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >> +        uint8_t pq = xive_source_esb_get(xsrc, i);
> >> +        if (xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_00 + (pq << 8)) != 0x1) {
> >> +            error_report("XIVE: IRQ %d has an invalid state", i);
> >> +        }
> >> +    }
> >> +
> >> +    /*
> >> +     * The XiveSource and the XiveTCTX states will be collected by
> >> +     * their respective vmstate handlers afterwards.
> >> +     */
> >> +    return ret;
> >> +}
> >> +
> >> +/*
> >> + * The sPAPRXive 'post_load' method is called by the sPAPR machine,
> >> + * after all XIVE device states have been transfered and loaded.
> >> + *
> >> + * All should be in place when the VCPUs resume execution.
> >> + */
> >> +static int spapr_xive_kvm_post_load(sPAPRXive *xive, int version_id)
> >> +{
> >> +    XiveSource *xsrc = &xive->source;
> >> +    Error *local_err = NULL;
> >> +    CPUState *cs;
> >> +    int i;
> >> +
> >> +    /* Set the ENDs first. The targetting depends on it. */
> >> +    CPU_FOREACH(cs) {
> >> +        spapr_xive_kvm_set_eq_state(xive, cs, &local_err);
> >> +        if (local_err) {
> >> +            error_report_err(local_err);
> >> +            return -1;
> >> +        }
> >> +    }
> >> +
> >> +    /* Restore the targetting, if any */
> >> +    spapr_xive_kvm_set_eas_state(xive, &local_err);
> >> +    if (local_err) {
> >> +        error_report_err(local_err);
> >> +        return -1;
> >> +    }
> >> +
> >> +    /* Restore the thread interrupt contexts */
> >> +    CPU_FOREACH(cs) {
> >> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
> >> +
> >> +        xive_tctx_kvm_set_state(XIVE_TCTX_KVM(cpu->intc), &local_err);
> >> +        if (local_err) {
> >> +            error_report_err(local_err);
> >> +            return -1;
> >> +        }
> >> +    }
> >> +
> >> +    /*
> >> +     * Get the saved state from the XiveSource model and restore the
> >> +     * PQ bits
> >> +     */
> >> +    for (i = 0; i < xsrc->nr_irqs; i++) {
> >> +        uint8_t pq = xive_source_esb_get(xsrc, i);
> >> +        xive_esb_read(xsrc, i, XIVE_ESB_SET_PQ_00 + (pq << 8));
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static void spapr_xive_kvm_synchronize_state(sPAPRXive *xive)
> >> +{
> >> +    XiveSource *xsrc = &xive->source;
> >> +    CPUState *cs;
> >> +
> >> +    xive_source_kvm_get_state(xsrc);
> >> +
> >> +    spapr_xive_kvm_get_eas_state(xive, &error_fatal);
> >> +
> >> +    CPU_FOREACH(cs) {
> >> +        spapr_xive_kvm_get_eq_state(xive, cs, &error_fatal);
> >> +    }
> >> +}
> >>  
> >>  static void spapr_xive_kvm_instance_init(Object *obj)
> >>  {
> >> @@ -409,6 +899,10 @@ static void spapr_xive_kvm_class_init(ObjectClass *klass, void *data)
> >>  
> >>      dc->desc = "sPAPR XIVE KVM Interrupt Controller";
> >>      dc->unrealize = spapr_xive_kvm_unrealize;
> >> +
> >> +    sxc->synchronize_state = spapr_xive_kvm_synchronize_state;
> >> +    sxc->pre_save = spapr_xive_kvm_pre_save;
> >> +    sxc->post_load = spapr_xive_kvm_post_load;
> >>  }
> >>  
> >>  static const TypeInfo spapr_xive_kvm_info = {
> >> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >> index 9bb37553c9ec..c9aedecc8216 100644
> >> --- a/hw/intc/xive.c
> >> +++ b/hw/intc/xive.c
> >> @@ -438,9 +438,14 @@ static const struct {
> >>  
> >>  void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
> >>  {
> >> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(tctx);
> >>      int cpu_index = tctx->cs ? tctx->cs->cpu_index : -1;
> >>      int i;
> >>  
> >> +    if (xtc->synchronize_state) {
> >> +        xtc->synchronize_state(tctx);
> >> +    }
> >> +
> >>      monitor_printf(mon, "CPU[%04x]:   QW   NSR CPPR IPB LSMFB ACK# INC AGE PIPR"
> >>                     "  W2\n", cpu_index);
> >>  
> >> @@ -552,10 +557,23 @@ static void xive_tctx_base_unrealize(DeviceState *dev, Error **errp)
> >>      qemu_unregister_reset(xive_tctx_base_reset, dev);
> >>  }
> >>  
> >> +static int vmstate_xive_tctx_post_load(void *opaque, int version_id)
> >> +{
> >> +    XiveTCTX *tctx = XIVE_TCTX_BASE(opaque);
> >> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_GET_CLASS(tctx);
> >> +
> >> +    if (xtc->post_load) {
> >> +        return xtc->post_load(tctx, version_id);
> >> +    }
> >> +
> >> +    return 0;
> >> +}
> >> +
> >>  static const VMStateDescription vmstate_xive_tctx_base = {
> >>      .name = TYPE_XIVE_TCTX,
> >>      .version_id = 1,
> >>      .minimum_version_id = 1,
> >> +    .post_load = vmstate_xive_tctx_post_load,
> >>      .fields = (VMStateField[]) {
> >>          VMSTATE_BUFFER(regs, XiveTCTX),
> >>          VMSTATE_END_OF_LIST()
> >> @@ -581,9 +599,37 @@ static const TypeInfo xive_tctx_base_info = {
> >>      .class_size    = sizeof(XiveTCTXClass),
> >>  };
> >>  
> >> +static int xive_tctx_post_load(XiveTCTX *tctx, int version_id)
> >> +{
> >> +    XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(tctx->xrtr);
> >> +
> >> +    /*
> >> +     * When we collect the states from KVM XIVE irqchip, we set word2
> >> +     * of the thread context to print out the OS CAM line under the
> >> +     * QEMU monitor.
> >> +     *
> >> +     * This breaks migration on a guest using TCG or not using a KVM
> >> +     * irqchip. Fix with an extra reset of the thread contexts.
> >> +     */
> >> +    if (xrc->reset_tctx) {
> >> +        xrc->reset_tctx(tctx->xrtr, tctx);
> >> +    }
> >> +    return 0;
> >> +}
> >> +
> >> +static void xive_tctx_class_init(ObjectClass *klass, void *data)
> >> +{
> >> +    XiveTCTXClass *xtc = XIVE_TCTX_BASE_CLASS(klass);
> >> +
> >> +    xtc->post_load = xive_tctx_post_load;
> >> +}
> >> +
> >>  static const TypeInfo xive_tctx_info = {
> >>      .name          = TYPE_XIVE_TCTX,
> >>      .parent        = TYPE_XIVE_TCTX_BASE,
> >> +    .instance_size = sizeof(XiveTCTX),
> >> +    .class_init    = xive_tctx_class_init,
> >> +    .class_size    = sizeof(XiveTCTXClass),
> >>  };
> >>  
> >>  Object *xive_tctx_create(Object *cpu, const char *type, XiveRouter *xrtr,
> >> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
> >> index 92ef53743b64..6fac6ca70595 100644
> >> --- a/hw/ppc/spapr_irq.c
> >> +++ b/hw/ppc/spapr_irq.c
> >> @@ -359,7 +359,7 @@ static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
> >>  
> >>  static int spapr_irq_post_load_xive(sPAPRMachineState *spapr, int version_id)
> >>  {
> >> -    return 0;
> >> +    return spapr_xive_post_load(spapr->xive, version_id);
> >>  }
> >>  
> >>  /*
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-11-30  1:04       ` David Gibson
@ 2018-11-30  6:41         ` Cédric Le Goater
  2018-12-03  1:14           ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-30  6:41 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/30/18 2:04 AM, David Gibson wrote:
> On Thu, Nov 29, 2018 at 11:06:13PM +0100, Cédric Le Goater wrote:
>> On 11/22/18 6:13 AM, David Gibson wrote:
>>> On Fri, Nov 16, 2018 at 11:56:59AM +0100, Cédric Le Goater wrote:
>>>> The Event Notification Descriptor also contains two Event State
>>>> Buffers providing further coalescing of interrupts, one for the
>>>> notification event (ESn) and one for the escalation events (ESe). A
>>>> MMIO page is assigned for each to control the EOI through loads
>>>> only. Stores are not allowed.
>>>>
>>>> The END ESBs are modeled through an object resembling the 'XiveSource'
>>>> It is stateless as the END state bits are backed into the XiveEND
>>>> structure under the XiveRouter and the MMIO accesses follow the same
>>>> rules as for the standard source ESBs.
>>>>
>>>> END ESBs are not supported by the Linux drivers neither on OPAL nor on
>>>> sPAPR. Nevetherless, it provides a mean to study the question in the
>>>> future and validates a bit more the XIVE model.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  include/hw/ppc/xive.h |  20 ++++++
>>>>  hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
>>>>  2 files changed, 178 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> index ce62aaf28343..24301bf2076d 100644
>>>> --- a/include/hw/ppc/xive.h
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>>                          XiveEND *end);
>>>>  
>>>> +/*
>>>> + * XIVE END ESBs
>>>> + */
>>>> +
>>>> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
>>>> +#define XIVE_END_SOURCE(obj) \
>>>> +    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
>>>
>>> Is there a particular reason to make this a full QOM object, rather
>>> than just embedding it in the XiveRouter?
>>
>> Coming back on this question because removing the chip_id from the
>> router is a problem for the END triggering. At least with the current
>> design. See below for the comment.
>>
>>>> +typedef struct XiveENDSource {
>>>> +    SysBusDevice parent;
>>>> +
>>>> +    uint32_t        nr_ends;
>>>> +
>>>> +    /* ESB memory region */
>>>> +    uint32_t        esb_shift;
>>>> +    MemoryRegion    esb_mmio;
>>>> +
>>>> +    XiveRouter      *xrtr;
>>>> +} XiveENDSource;
>>>> +
>>>>  /*
>>>>   * For legacy compatibility, the exceptions define up to 256 different
>>>>   * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> index 9cb001e7b540..5a8882d47a98 100644
>>>> --- a/hw/intc/xive.c
>>>> +++ b/hw/intc/xive.c
>>>> @@ -622,8 +622,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>>>>       * even futher coalescing in the Router
>>>>       */
>>>>      if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
>>>> -        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
>>>> -        return;
>>>> +        uint8_t pq = GETFIELD(END_W1_ESn, end.w1);
>>>> +        bool notify = xive_esb_trigger(&pq);
>>>> +
>>>> +        if (pq != GETFIELD(END_W1_ESn, end.w1)) {
>>>> +            end.w1 = SETFIELD(END_W1_ESn, end.w1, pq);
>>>> +            xive_router_set_end(xrtr, end_blk, end_idx, &end);
>>>> +        }
>>>> +
>>>> +        /* ESn[Q]=1 : end of notification */
>>>> +        if (!notify) {
>>>> +            return;
>>>> +        }
>>>>      }
>>>>  
>>>>      /*
>>>> @@ -706,6 +716,151 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
>>>>                     (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
>>>>  }
>>>>  
>>>> +/*
>>>> + * END ESB MMIO loads
>>>> + */
>>>> +static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned size)
>>>> +{
>>>> +    XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
>>>> +    XiveRouter *xrtr = xsrc->xrtr;
>>>> +    uint32_t offset = addr & 0xFFF;
>>>> +    uint8_t end_blk;
>>>> +    uint32_t end_idx;
>>>> +    XiveEND end;
>>>> +    uint32_t end_esmask;
>>>> +    uint8_t pq;
>>>> +    uint64_t ret = -1;
>>>> +
>>>> +    end_blk = xrtr->chip_id;
>>>> +    end_idx = addr >> (xsrc->esb_shift + 1);
>>>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
>>
>> The current END accessors require a block identifier, hence xrtr->chip_id, 
>> but in this case, we don't really need it because we are using the ENDT 
>> local to the router/chip. 
> 
>> I don't know how to handle simply this case without keeping chip_id :/
> 
> I don't really follow how chip_id is relevant here.  AFAICT the END
> accessors take a block id and the back end is responsible for
> interpreting them.  The ponwernv one will map it to chip id, but the
> PAPR one can just ignore it or only use block 0.

Yes. But the block value comes from the xrtr->chip_id today, on PAPR and
PowerNV, even if it's block 0. 

What I could do is add a "chip-id" property to XiveENDSource possibly.

C. 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
  2018-11-30  1:11           ` David Gibson
@ 2018-11-30  6:56             ` Cédric Le Goater
  2018-12-03  1:18               ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-30  6:56 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/30/18 2:11 AM, David Gibson wrote:
> On Thu, Nov 29, 2018 at 04:27:31PM +0100, Cédric Le Goater wrote:
>> [ ... ] 
>>
>>>>>> +/*
>>>>>> + * The allocation of VP blocks is a complex operation in OPAL and the
>>>>>> + * VP identifiers have a relation with the number of HW chips, the
>>>>>> + * size of the VP blocks, VP grouping, etc. The QEMU sPAPR XIVE
>>>>>> + * controller model does not have the same constraints and can use a
>>>>>> + * simple mapping scheme of the CPU vcpu_id
>>>>>> + *
>>>>>> + * These identifiers are never returned to the OS.
>>>>>> + */
>>>>>> +
>>>>>> +#define SPAPR_XIVE_VP_BASE 0x400
>>>>>
>>>>> 0x400 == 1024.  Could we ever have the possibility of needing to
>>>>> consider both physical NVTs and PAPR NVTs at the same time?  
>>>>
>>>> They would not be in the same CAM line: OS ring vs. PHYS ring. 
>>>
>>> Hm.  They still inhabit the same NVT number space though, don't they?
>>
>> No. skiboot reserves the range of VPs for the HW at init.
>>
>> https://github.com/open-power/skiboot/blob/master/hw/xive.c#L1093
> 
> Uh.. I don't see how they're reserved is relevant.
> 
> What I mean is that the ENDs address the NVTs for HW endpoints by the
> same (block, index) tuples as the NVTs for virtualized endpoints, yes?

Ah. Yes. The (block, index) tuples, fields END_W6_NVT_BLOCK and 
END_W6_NVT_INDEX in the END structure, are all in the same number space.

skiboot defines some ranges though.


>>> I'm thinking about the END->NVT stage of the process here, rather than
>>> the NVT->TCTX stage.
>>>
>>> Oh, also, you're using "VP" here which IIUC == "NVT".  Can we
>>> standardize on one, please.
>>
>> VP is used in Linux/KVM Linux/Native and skiboot. Yes. it's a mess. 
>> Let's have consistent naming in QEMU and use NVT. 
> 
> Right.  And to cover any inevitable missed ones is why I'd like to see
> a cheatsheet giving both terms in the header comments somewhere.

yes. I have added a list of names in xive.h. 

I was wondering if I should put the diagram below somewhere in a .h file 
or under doc/specs/.

Thanks,

C.  


= XIVE =================================================================

The POWER9 processor comes with a new interrupt controller, called
XIVE as "eXternal Interrupt Virtualization Engine".


* Overall architecture


             XIVE Interrupt Controller
             +------------------------------------+      IPIs
             | +---------+ +---------+ +--------+ |    +-------+
             | |VC       | |CQ       | |PC      |----> | CORES |
             | |     esb | |         | |        |----> |       |
             | |     eas | |  Bridge | |   tctx |----> |       |
             | |SC   end | |         | |    nvt | |    |       |
 +------+    | +---------+ +----+----+ +--------+ |    +-+-+-+-+
 | RAM  |    +------------------|-----------------+      | | |
 |      |                       |                        | | |
 |      |                       |                        | | |
 |      |  +--------------------v------------------------v-v-v--+    other
 |      <--+                     Power Bus                      +--> chips
 |  esb |  +---------+-----------------------+------------------+
 |  eas |            |                       |
 |  end |        +---|-----+                 |
 |  nvt |       +----+----+|            +----+----+
 +------+       |SC       ||            |SC       |
                |         ||            |         |
                | PQ-bits ||            | PQ-bits |
                | local   |+            |  in VC  |
                +---------+             +---------+
                   PCIe                 NX,NPU,CAPI

                  SC: Source Controller (aka. IVSE)
                  VC: Virtualization Controller (aka. IVRE)
                  PC: Presentation Controller (aka. IVPE)
                  CQ: Common Queue (Bridge)

             PQ-bits: 2 bits source state machine (P:pending Q:queued)
                 esb: Event State Buffer (Array of PQ bits in an IVSE)
                 eas: Event Assignment Structure
                 end: Event Notification Descriptor
                 nvt: Notification Virtual Target
                tctx: Thread interrupt Context


The XIVE IC is composed of three sub-engines :

  - Interrupt Virtualization Source Engine (IVSE), or Source
    Controller (SC). These are found in PCI PHBs, in the PSI host
    bridge controller, but also inside the main controller for the
    core IPIs and other sub-chips (NX, CAP, NPU) of the
    chip/processor. They are configured to feed the IVRE with events.

  - Interrupt Virtualization Routing Engine (IVRE) or Virtualization
    Controller (VC). Its job is to match an event source with an Event
    Notification Descriptor (END).

  - Interrupt Virtualization Presentation Engine (IVPE) or Presentation
    Controller (PC). It maintains the interrupt context state of each
    thread and handles the delivery of the external exception to the
    thread.


* XIVE internal tables

Each of the sub-engines uses a set of tables to redirect exceptions
from event sources to CPU threads.

                                          +-------+
  User or OS                              |  EQ   |
      or                          +------>|entries|
  Hypervisor                      |       |  ..   |
    Memory                        |       +-------+
                                  |           ^
                                  |           |
             +-------------------------------------------------+
                                  |           |
  Hypervisor      +------+    +---+--+    +---+--+   +------+
    Memory        | ESB  |    | EAT  |    | ENDT |   | NVTT |
   (skiboot)      +----+-+    +----+-+    +----+-+   +------+
                    ^  |        ^  |        ^  |       ^
                    |  |        |  |        |  |       |
             +-------------------------------------------------+
                    |  |        |  |        |  |       |
                    |  |        |  |        |  |       |
               +----|--|--------|--|--------|--|-+   +-|-----+    +------+
               |    |  |        |  |        |  | |   | | tctx|    |Thread|
   IPI or   ---+    +  v        +  v        +  v |---| +  .. |----->     |
  HW events    |                                 |   |       |    |      |
               |             IVRE                |   | IVPE  |    +------+
               +---------------------------------+   +-------+
            


The IVSE have a 2-bits, P for pending and Q for queued, state machine
for each source that allows events to be triggered. They are stored in
an array, the Event State Buffer (ESB) and controlled by MMIOs.

If the event is let through, the IVRE looks up in the Event Assignment
Structure (EAS) table for an Event Notification Descriptor (END)
configured for the source. Each Event Notification Descriptor defines
a notification path to a CPU and an in-memory Event Queue, in which
will be pushed an EQ data for the OS to pull.

The IVPE determines if a Notification Virtual Target (NVT) can handle
the event by scanning the thread contexts of the VPs dispatched on the
processor HW threads. It maintains the interrupt context state of each
thread in a NVT table.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 23/36] spapr/xive: add migration support for KVM
  2018-11-30  1:24       ` David Gibson
@ 2018-11-30  7:04         ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-30  7:04 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[ ... ]

>>>> +/*
>>>> + * The sPAPRXive KVM model migration priority is higher to make sure
>>>
>>> Higher than what?
>>
>> Than the XiveTCTX and XiveSource models.
>>
>>>> + * its 'pre_save' method runs before all the other XIVE models. It
>>>
>>> If the other XIVE components are children of sPAPRXive (which I think
>>> they are or could be), then I believe the parent object's pre_save
>>> will automatically be called first.
>>
>> ok. XiveTCTX are not children of sPAPRXive but that might not be 
>> a problem anymore with the VMState change handler.
> 
> Ah, right.  You might need the handler in the machine itself then - we
> already have something like that for XICS, IIRC.

exactly. For XIVE, I am using the post_load method at the machine level, 
which should be last. The XIVE sources PQs are restored when the 
machine starts running again in the VM state change handler. So I don't
need the priority at all on the destination. I will try to remove the
prio, I agree it's a bit ugly.  

C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-11-30  1:23           ` David Gibson
@ 2018-11-30  8:07             ` Cédric Le Goater
  2018-12-03  1:36               ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-11-30  8:07 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/30/18 2:23 AM, David Gibson wrote:
> On Thu, Nov 29, 2018 at 05:04:50PM +0100, Cédric Le Goater wrote:
>> On 11/29/18 2:23 AM, David Gibson wrote:
>>> On Wed, Nov 28, 2018 at 11:21:37PM +0100, Cédric Le Goater wrote:
>>>> On 11/28/18 5:25 AM, David Gibson wrote:
>>>>> On Fri, Nov 16, 2018 at 11:57:09AM +0100, Cédric Le Goater wrote:
>>>>>> The different XIVE virtualization structures (sources and event queues)
>>>>>> are configured with a set of Hypervisor calls :
>>>>>>
>>>>>>  - H_INT_GET_SOURCE_INFO
>>>>>>
>>>>>>    used to obtain the address of the MMIO page of the Event State
>>>>>>    Buffer (ESB) entry associated with the source.
>>>>>>
>>>>>>  - H_INT_SET_SOURCE_CONFIG
>>>>>>
>>>>>>    assigns a source to a "target".
>>>>>>
>>>>>>  - H_INT_GET_SOURCE_CONFIG
>>>>>>
>>>>>>    determines which "target" and "priority" is assigned to a source
>>>>>>
>>>>>>  - H_INT_GET_QUEUE_INFO
>>>>>>
>>>>>>    returns the address of the notification management page associated
>>>>>>    with the specified "target" and "priority".
>>>>>>
>>>>>>  - H_INT_SET_QUEUE_CONFIG
>>>>>>
>>>>>>    sets or resets the event queue for a given "target" and "priority".
>>>>>>    It is also used to set the notification configuration associated
>>>>>>    with the queue, only unconditional notification is supported for
>>>>>>    the moment. Reset is performed with a queue size of 0 and queueing
>>>>>>    is disabled in that case.
>>>>>>
>>>>>>  - H_INT_GET_QUEUE_CONFIG
>>>>>>
>>>>>>    returns the queue settings for a given "target" and "priority".
>>>>>>
>>>>>>  - H_INT_RESET
>>>>>>
>>>>>>    resets all of the guest's internal interrupt structures to their
>>>>>>    initial state, losing all configuration set via the hcalls
>>>>>>    H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
>>>>>>
>>>>>>  - H_INT_SYNC
>>>>>>
>>>>>>    issue a synchronisation on a source to make sure all notifications
>>>>>>    have reached their queue.
>>>>>>
>>>>>> Calls that still need to be addressed :
>>>>>>
>>>>>>    H_INT_SET_OS_REPORTING_LINE
>>>>>>    H_INT_GET_OS_REPORTING_LINE
>>>>>>
>>>>>> See the code for more documentation on each hcall.
>>>>>>
>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>> ---
>>>>>>  include/hw/ppc/spapr.h      |  15 +-
>>>>>>  include/hw/ppc/spapr_xive.h |   6 +
>>>>>>  hw/intc/spapr_xive_hcall.c  | 892 ++++++++++++++++++++++++++++++++++++
>>>>>>  hw/ppc/spapr_irq.c          |   2 +
>>>>>>  hw/intc/Makefile.objs       |   2 +-
>>>>>>  5 files changed, 915 insertions(+), 2 deletions(-)
>>>>>>  create mode 100644 hw/intc/spapr_xive_hcall.c
>>>>>>
>>>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
>>>>>> index 1fbc2663e06c..8415faea7b82 100644
>>>>>> --- a/include/hw/ppc/spapr.h
>>>>>> +++ b/include/hw/ppc/spapr.h
>>>>>> @@ -452,7 +452,20 @@ struct sPAPRMachineState {
>>>>>>  #define H_INVALIDATE_PID        0x378
>>>>>>  #define H_REGISTER_PROC_TBL     0x37C
>>>>>>  #define H_SIGNAL_SYS_RESET      0x380
>>>>>> -#define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
>>>>>> +
>>>>>> +#define H_INT_GET_SOURCE_INFO   0x3A8
>>>>>> +#define H_INT_SET_SOURCE_CONFIG 0x3AC
>>>>>> +#define H_INT_GET_SOURCE_CONFIG 0x3B0
>>>>>> +#define H_INT_GET_QUEUE_INFO    0x3B4
>>>>>> +#define H_INT_SET_QUEUE_CONFIG  0x3B8
>>>>>> +#define H_INT_GET_QUEUE_CONFIG  0x3BC
>>>>>> +#define H_INT_SET_OS_REPORTING_LINE 0x3C0
>>>>>> +#define H_INT_GET_OS_REPORTING_LINE 0x3C4
>>>>>> +#define H_INT_ESB               0x3C8
>>>>>> +#define H_INT_SYNC              0x3CC
>>>>>> +#define H_INT_RESET             0x3D0
>>>>>> +
>>>>>> +#define MAX_HCALL_OPCODE        H_INT_RESET
>>>>>>  
>>>>>>  /* The hcalls above are standardized in PAPR and implemented by pHyp
>>>>>>   * as well.
>>>>>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
>>>>>> index 3f65b8f485fd..418511f3dc10 100644
>>>>>> --- a/include/hw/ppc/spapr_xive.h
>>>>>> +++ b/include/hw/ppc/spapr_xive.h
>>>>>> @@ -60,4 +60,10 @@ int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
>>>>>>  int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
>>>>>>                            uint8_t *out_end_blk, uint32_t *out_end_idx);
>>>>>>  
>>>>>> +bool spapr_xive_priority_is_valid(uint8_t priority);
>>>>>
>>>>> AFAICT this could be a local function.
>>>>
>>>> the KVM model uses it also, when collecting state from the KVM device 
>>>> to build the QEMU ENDT.
>>>>
>>>>>> +
>>>>>> +typedef struct sPAPRMachineState sPAPRMachineState;
>>>>>> +
>>>>>> +void spapr_xive_hcall_init(sPAPRMachineState *spapr);
>>>>>> +
>>>>>>  #endif /* PPC_SPAPR_XIVE_H */
>>>>>> diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
>>>>>> new file mode 100644
>>>>>> index 000000000000..52e4e23995f5
>>>>>> --- /dev/null
>>>>>> +++ b/hw/intc/spapr_xive_hcall.c
>>>>>> @@ -0,0 +1,892 @@
>>>>>> +/*
>>>>>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
>>>>>> + *
>>>>>> + * Copyright (c) 2017-2018, IBM Corporation.
>>>>>> + *
>>>>>> + * This code is licensed under the GPL version 2 or later. See the
>>>>>> + * COPYING file in the top-level directory.
>>>>>> + */
>>>>>> +
>>>>>> +#include "qemu/osdep.h"
>>>>>> +#include "qemu/log.h"
>>>>>> +#include "qapi/error.h"
>>>>>> +#include "cpu.h"
>>>>>> +#include "hw/ppc/fdt.h"
>>>>>> +#include "hw/ppc/spapr.h"
>>>>>> +#include "hw/ppc/spapr_xive.h"
>>>>>> +#include "hw/ppc/xive_regs.h"
>>>>>> +#include "monitor/monitor.h"
>>>>>
>>>>> Fwiw, I don't think it's particularly necessary to split the hcall
>>>>> handling out into a separate .c file.
>>>>
>>>> ok. let's move it to spapr_xive then ? It might help in reducing the 
>>>> exported funtions. 
>>>
>>> Yes, I think so.
>>>
>>>>>> +/*
>>>>>> + * OPAL uses the priority 7 EQ to automatically escalate interrupts
>>>>>> + * for all other queues (DD2.X POWER9). So only priorities [0..6] are
>>>>>> + * available for the guest.
>>>>>
>>>>> Referencing OPAL behaviour doesn't really make sense in the context of
>>>>> PAPR.  
>>>>
>>>> It's an OPAL constraint which pHyp doesn't have. So its a QEMU/KVM 
>>>> constraint also.
>>>
>>> Right, I realized that a few patches on.  Maybe rephrase this to
>>>
>>>    Linux hosts under OPAL reserve priority 7 for their own escalation
>>>    interrupts.  So we only allow the guest to use priorities [0..6].
>>
>> OK.
>>
>>> The point here is that we're emphasizing that this is a design
>>> decision to make the host implementation easier, rather than a
>>> fundamental constraint.
>>>
>>>>> What I think you're getting at is that the PAPR spec only
>>>>> allows a PAPR guest to use priorities 0..6 (or at least it will if the
>>>>> XIVE updated spec ever gets published).  
>>>>
>>>> It's not in the spec. the XIVE sPAPR spec should be frozen soon btw. 
>>>>  
>>>>> The fact that this allows the
>>>>> host use 7 for escalations is a design rationale 
>>>>> but not really relevant to the guest device itself. 
>>>>
>>>> The guest should be aware of which priorities are reserved for
>>>> the hypervisor though.
>>>>
>>>>>> + */
>>>>>> +bool spapr_xive_priority_is_valid(uint8_t priority)
>>>>>> +{
>>>>>> +    switch (priority) {
>>>>>> +    case 0 ... 6:
>>>>>> +        return true;
>>>>>> +    case 7: /* OPAL escalation queue */
>>>>>> +    default:
>>>>>> +        return false;
>>>>>> +    }
>>>>>> +}
>>>>>> +
>>>>>> +/*
>>>>>> + * The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical
>>>>>> + * real address of the MMIO page through which the Event State Buffer
>>>>>> + * entry associated with the value of the "lisn" parameter is managed.
>>>>>> + *
>>>>>> + * Parameters:
>>>>>> + * Input
>>>>>> + * - "flags"
>>>>>> + *       Bits 0-63 reserved
>>>>>> + * - "lisn" is per "interrupts", "interrupt-map", or
>>>>>> + *       "ibm,xive-lisn-ranges" properties, or as returned by the
>>>>>> + *       ibm,query-interrupt-source-number RTAS call, or as returned
>>>>>> + *       by the H_ALLOCATE_VAS_WINDOW hcall
>>>>>
>>>>> I've not heard of H_ALLOCATE_VAS_WINDOW.  Is that something we intend
>>>>> to implement in kvm/qemu, or is it only of interest for PowerVM?
>>>>
>>>> The hcall is part of the PAPR NX Interfaces and it returns interrupt
>>>> numbers. I don't know if any work has been done on the topic.  
>>>
>>> What's a "PAPR NX"?
>>
>> A way for the PAPR guests to access the POWER coprocessors doing 
>> compression and encryption. I really don't know much about this.
> 
> Ah, ok.
> 
> [snip]
>>>> I think not, but the specs are not very clear on that topic. I will
>>>> ask for clarification and use a -1 for now. We can not do loads on
>>>> the trigger page so it can not be used by the H_INT_ESB hcall.
>>>>
>>>>>
>>>>>> +    args[3] = TARGET_PAGE_SIZE;
>>>>>
>>>>> That seems wrong.  
>>>>
>>>> This is utterly wrong. it should be a power of 2 number ... I got
>>>> it right under KVM though. I guess that ioremap() under Linux rounds 
>>>> up the size to the page size in use, so, that's why it didn't blow
>>>> up under TCG.
>>>>
>>>>> TARGET_PAGE_SIZE is generally 4kiB, but won't these usually
>>>>> actually be 64kiB?
>>>>
>>>> yes. So what should I use to get a PAGE_SHIFT instead ? 
>>>
>>> Erm, that gets a bit tricky, since qemu in a sense doesn't know the
>>> guest's page size.
>>>
>>> But.. don't you actually want the esb_shift here, not PAGE_SHIFT - it
>>> could matter for the 2 page * 64kiB variant, yes?
>>
>> Yes. we just want the page_shift of the ESB page, whether it's one or
>> two pages. The other registers inform the guest if there are one or 
>> two ESB page in use. 
> 
> Ok, still sounds like you should base it on esb_shift, just adjust for
> the two page case.

yes.


>>>>>> +    }
>>>>>> +
>>>>>> +    switch (qsize) {
>>>>>> +    case 12:
>>>>>> +    case 16:
>>>>>> +    case 21:
>>>>>> +    case 24:
>>>>>> +        end.w3 = ((uint64_t)qpage) & 0xffffffff;
>>>>>
>>>>> It just occurred to me that I haven't been looking for this across any
>>>>> of these reviews.  Don't you need byteswaps when accessing these
>>>>> in-memory structures?
>>>>
>>>> yes this is done when some event data is enqueued in the EQ.
>>>
>>> I'm not talking about the data in the EQ itself, but the fields in the
>>> END (and the NVT).
>>
>> XIVE is all BE.
> 
> Yes... the qemu host might not be, which is why you need byteswaps.

ok. I understand.

> I realized eventually you have the swaps in your pnv get/set
> accessors.  

Yes. because skiboot is BE, like the XIVE structures.  

> I don't like that at all for a couple of reasons:
> 
> 1) Although the END structure is made up of word-sized fields because
> that's convenient, the END really is made of a bunch of subfields of
> different sizes.  Knowing that it wouldn't be unreasonable for people
> to expect they can look into the XIVE by byte offsets; 

These structures should be accessed with GETFIELD and SETFIELD macros
using the XIVE definitions in the xive_regs.h header file. I would want 
to keep that common with skiboot  for sure.

Are you suggesting we should define each field of the XIVE structures 
with C attributes ? That would be very unfortunate. 
 
> that will break
> if you're working with a copy that has already been byte-swapped on
> word-sized units.

I am not sure I understand the last sentence. 

the code working with a copy would necessarily know that the structure 
has been byteswapped and use correct offsets for the expected endianess. 
no ? why would it break ?  

> 2) At different points in the code you're storing both BE and
> native-endian data in the same struct. 

on sPAPR, it's all native (which is a violation I agree). TIMA is BE.

> That's both confusing to
> someone reading the code (if they see that struct they don't know if
> it's byteswapped already) and also means you can't use sparse
> annotations to make sure you have it right.

XIVE structures are architected to be BE. That's immutable.

It's a not problem for skiboot which is BE. The PnvXIVE model for the 
QEMU PowerNV machine reads these VSTs (Virtual Structure Tables) from 
the guest RAM and byteswaps the structure before using it. I think
that's fine. Isn't it ? 

It becomes a problem with the sPAPR model which is using the XIVE structures 
in native endianess and not BE anymore. But the guest OS never manipulates 
these structures, so under the hood, I think we are free to use them in 
native and keep the common definitions.

Except that the event data entries in the OS EQs are BE. So the only place 
where we convert is when an event data is enqueued. 

What would you put in place if you think this is a too strong violation 
of the architecture ? I am afraid of something too complex to manipulate
to be honest. May be we can drop the map/unmap access methods only keep 
the very basic ones.  

C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-11-30  6:41         ` Cédric Le Goater
@ 2018-12-03  1:14           ` David Gibson
  2018-12-03 16:19             ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-12-03  1:14 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 5908 bytes --]

On Fri, Nov 30, 2018 at 07:41:33AM +0100, Cédric Le Goater wrote:
> On 11/30/18 2:04 AM, David Gibson wrote:
> > On Thu, Nov 29, 2018 at 11:06:13PM +0100, Cédric Le Goater wrote:
> >> On 11/22/18 6:13 AM, David Gibson wrote:
> >>> On Fri, Nov 16, 2018 at 11:56:59AM +0100, Cédric Le Goater wrote:
> >>>> The Event Notification Descriptor also contains two Event State
> >>>> Buffers providing further coalescing of interrupts, one for the
> >>>> notification event (ESn) and one for the escalation events (ESe). A
> >>>> MMIO page is assigned for each to control the EOI through loads
> >>>> only. Stores are not allowed.
> >>>>
> >>>> The END ESBs are modeled through an object resembling the 'XiveSource'
> >>>> It is stateless as the END state bits are backed into the XiveEND
> >>>> structure under the XiveRouter and the MMIO accesses follow the same
> >>>> rules as for the standard source ESBs.
> >>>>
> >>>> END ESBs are not supported by the Linux drivers neither on OPAL nor on
> >>>> sPAPR. Nevetherless, it provides a mean to study the question in the
> >>>> future and validates a bit more the XIVE model.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>  include/hw/ppc/xive.h |  20 ++++++
> >>>>  hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
> >>>>  2 files changed, 178 insertions(+), 2 deletions(-)
> >>>>
> >>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> >>>> index ce62aaf28343..24301bf2076d 100644
> >>>> --- a/include/hw/ppc/xive.h
> >>>> +++ b/include/hw/ppc/xive.h
> >>>> @@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>>>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
> >>>>                          XiveEND *end);
> >>>>  
> >>>> +/*
> >>>> + * XIVE END ESBs
> >>>> + */
> >>>> +
> >>>> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
> >>>> +#define XIVE_END_SOURCE(obj) \
> >>>> +    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
> >>>
> >>> Is there a particular reason to make this a full QOM object, rather
> >>> than just embedding it in the XiveRouter?
> >>
> >> Coming back on this question because removing the chip_id from the
> >> router is a problem for the END triggering. At least with the current
> >> design. See below for the comment.
> >>
> >>>> +typedef struct XiveENDSource {
> >>>> +    SysBusDevice parent;
> >>>> +
> >>>> +    uint32_t        nr_ends;
> >>>> +
> >>>> +    /* ESB memory region */
> >>>> +    uint32_t        esb_shift;
> >>>> +    MemoryRegion    esb_mmio;
> >>>> +
> >>>> +    XiveRouter      *xrtr;
> >>>> +} XiveENDSource;
> >>>> +
> >>>>  /*
> >>>>   * For legacy compatibility, the exceptions define up to 256 different
> >>>>   * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
> >>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> >>>> index 9cb001e7b540..5a8882d47a98 100644
> >>>> --- a/hw/intc/xive.c
> >>>> +++ b/hw/intc/xive.c
> >>>> @@ -622,8 +622,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
> >>>>       * even futher coalescing in the Router
> >>>>       */
> >>>>      if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
> >>>> -        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
> >>>> -        return;
> >>>> +        uint8_t pq = GETFIELD(END_W1_ESn, end.w1);
> >>>> +        bool notify = xive_esb_trigger(&pq);
> >>>> +
> >>>> +        if (pq != GETFIELD(END_W1_ESn, end.w1)) {
> >>>> +            end.w1 = SETFIELD(END_W1_ESn, end.w1, pq);
> >>>> +            xive_router_set_end(xrtr, end_blk, end_idx, &end);
> >>>> +        }
> >>>> +
> >>>> +        /* ESn[Q]=1 : end of notification */
> >>>> +        if (!notify) {
> >>>> +            return;
> >>>> +        }
> >>>>      }
> >>>>  
> >>>>      /*
> >>>> @@ -706,6 +716,151 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
> >>>>                     (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
> >>>>  }
> >>>>  
> >>>> +/*
> >>>> + * END ESB MMIO loads
> >>>> + */
> >>>> +static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned size)
> >>>> +{
> >>>> +    XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
> >>>> +    XiveRouter *xrtr = xsrc->xrtr;
> >>>> +    uint32_t offset = addr & 0xFFF;
> >>>> +    uint8_t end_blk;
> >>>> +    uint32_t end_idx;
> >>>> +    XiveEND end;
> >>>> +    uint32_t end_esmask;
> >>>> +    uint8_t pq;
> >>>> +    uint64_t ret = -1;
> >>>> +
> >>>> +    end_blk = xrtr->chip_id;
> >>>> +    end_idx = addr >> (xsrc->esb_shift + 1);
> >>>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
> >>
> >> The current END accessors require a block identifier, hence xrtr->chip_id, 
> >> but in this case, we don't really need it because we are using the ENDT 
> >> local to the router/chip. 
> > 
> >> I don't know how to handle simply this case without keeping chip_id :/
> > 
> > I don't really follow how chip_id is relevant here.  AFAICT the END
> > accessors take a block id and the back end is responsible for
> > interpreting them.  The ponwernv one will map it to chip id, but the
> > PAPR one can just ignore it or only use block 0.
> 
> Yes. But the block value comes from the xrtr->chip_id today, on PAPR and
> PowerNV, even if it's block 0. 
> 
> What I could do is add a "chip-id" property to XiveENDSource possibly.

This still seems wrong for the PAPR model.  Why can't you configure
the end_block value directly in the Xive components, then just set it
equal to the chip_id when you build the powernv machine?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
  2018-11-30  6:56             ` Cédric Le Goater
@ 2018-12-03  1:18               ` David Gibson
  2018-12-03 16:30                 ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-12-03  1:18 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 8379 bytes --]

On Fri, Nov 30, 2018 at 07:56:02AM +0100, Cédric Le Goater wrote:
> On 11/30/18 2:11 AM, David Gibson wrote:
> > On Thu, Nov 29, 2018 at 04:27:31PM +0100, Cédric Le Goater wrote:
> >> [ ... ] 
> >>
> >>>>>> +/*
> >>>>>> + * The allocation of VP blocks is a complex operation in OPAL and the
> >>>>>> + * VP identifiers have a relation with the number of HW chips, the
> >>>>>> + * size of the VP blocks, VP grouping, etc. The QEMU sPAPR XIVE
> >>>>>> + * controller model does not have the same constraints and can use a
> >>>>>> + * simple mapping scheme of the CPU vcpu_id
> >>>>>> + *
> >>>>>> + * These identifiers are never returned to the OS.
> >>>>>> + */
> >>>>>> +
> >>>>>> +#define SPAPR_XIVE_VP_BASE 0x400
> >>>>>
> >>>>> 0x400 == 1024.  Could we ever have the possibility of needing to
> >>>>> consider both physical NVTs and PAPR NVTs at the same time?  
> >>>>
> >>>> They would not be in the same CAM line: OS ring vs. PHYS ring. 
> >>>
> >>> Hm.  They still inhabit the same NVT number space though, don't they?
> >>
> >> No. skiboot reserves the range of VPs for the HW at init.
> >>
> >> https://github.com/open-power/skiboot/blob/master/hw/xive.c#L1093
> > 
> > Uh.. I don't see how they're reserved is relevant.
> > 
> > What I mean is that the ENDs address the NVTs for HW endpoints by the
> > same (block, index) tuples as the NVTs for virtualized endpoints, yes?
> 
> Ah. Yes. The (block, index) tuples, fields END_W6_NVT_BLOCK and 
> END_W6_NVT_INDEX in the END structure, are all in the same number space.

Right.

> skiboot defines some ranges though.

Ok.  I guess we can rely on that for PAPR, but not for PowerNV.

> >>> I'm thinking about the END->NVT stage of the process here, rather than
> >>> the NVT->TCTX stage.
> >>>
> >>> Oh, also, you're using "VP" here which IIUC == "NVT".  Can we
> >>> standardize on one, please.
> >>
> >> VP is used in Linux/KVM Linux/Native and skiboot. Yes. it's a mess. 
> >> Let's have consistent naming in QEMU and use NVT. 
> > 
> > Right.  And to cover any inevitable missed ones is why I'd like to see
> > a cheatsheet giving both terms in the header comments somewhere.
> 
> yes. I have added a list of names in xive.h. 

Great.  Oh BTW - this is getting big enough, that I wonder if it makes
sense to create a hw/intc/xive subdir to put things in, then splitting
IVSE, IVRE, IVPE related code into separate .c files (I'd still expect
a common .h though).

> I was wondering if I should put the diagram below somewhere in a .h file 
> or under doc/specs/.

I'd prefer it in the .h file.

> 
> Thanks,
> 
> C.  
> 
> 
> = XIVE =================================================================
> 
> The POWER9 processor comes with a new interrupt controller, called
> XIVE as "eXternal Interrupt Virtualization Engine".
> 
> 
> * Overall architecture
> 
> 
>              XIVE Interrupt Controller
>              +------------------------------------+      IPIs
>              | +---------+ +---------+ +--------+ |    +-------+
>              | |VC       | |CQ       | |PC      |----> | CORES |
>              | |     esb | |         | |        |----> |       |
>              | |     eas | |  Bridge | |   tctx |----> |       |
>              | |SC   end | |         | |    nvt | |    |       |
>  +------+    | +---------+ +----+----+ +--------+ |    +-+-+-+-+
>  | RAM  |    +------------------|-----------------+      | | |
>  |      |                       |                        | | |
>  |      |                       |                        | | |
>  |      |  +--------------------v------------------------v-v-v--+    other
>  |      <--+                     Power Bus                      +--> chips
>  |  esb |  +---------+-----------------------+------------------+
>  |  eas |            |                       |
>  |  end |        +---|-----+                 |
>  |  nvt |       +----+----+|            +----+----+
>  +------+       |SC       ||            |SC       |
>                 |         ||            |         |
>                 | PQ-bits ||            | PQ-bits |
>                 | local   |+            |  in VC  |
>                 +---------+             +---------+
>                    PCIe                 NX,NPU,CAPI
> 
>                   SC: Source Controller (aka. IVSE)
>                   VC: Virtualization Controller (aka. IVRE)
>                   PC: Presentation Controller (aka. IVPE)
>                   CQ: Common Queue (Bridge)
> 
>              PQ-bits: 2 bits source state machine (P:pending Q:queued)
>                  esb: Event State Buffer (Array of PQ bits in an IVSE)
>                  eas: Event Assignment Structure
>                  end: Event Notification Descriptor
>                  nvt: Notification Virtual Target
>                 tctx: Thread interrupt Context
> 
> 
> The XIVE IC is composed of three sub-engines :
> 
>   - Interrupt Virtualization Source Engine (IVSE), or Source
>     Controller (SC). These are found in PCI PHBs, in the PSI host
>     bridge controller, but also inside the main controller for the
>     core IPIs and other sub-chips (NX, CAP, NPU) of the
>     chip/processor. They are configured to feed the IVRE with events.
> 
>   - Interrupt Virtualization Routing Engine (IVRE) or Virtualization
>     Controller (VC). Its job is to match an event source with an Event
>     Notification Descriptor (END).
> 
>   - Interrupt Virtualization Presentation Engine (IVPE) or Presentation
>     Controller (PC). It maintains the interrupt context state of each
>     thread and handles the delivery of the external exception to the
>     thread.
> 
> 
> * XIVE internal tables
> 
> Each of the sub-engines uses a set of tables to redirect exceptions
> from event sources to CPU threads.
> 
>                                           +-------+
>   User or OS                              |  EQ   |
>       or                          +------>|entries|
>   Hypervisor                      |       |  ..   |
>     Memory                        |       +-------+
>                                   |           ^
>                                   |           |
>              +-------------------------------------------------+
>                                   |           |
>   Hypervisor      +------+    +---+--+    +---+--+   +------+
>     Memory        | ESB  |    | EAT  |    | ENDT |   | NVTT |
>    (skiboot)      +----+-+    +----+-+    +----+-+   +------+
>                     ^  |        ^  |        ^  |       ^
>                     |  |        |  |        |  |       |
>              +-------------------------------------------------+
>                     |  |        |  |        |  |       |
>                     |  |        |  |        |  |       |
>                +----|--|--------|--|--------|--|-+   +-|-----+    +------+
>                |    |  |        |  |        |  | |   | | tctx|    |Thread|
>    IPI or   ---+    +  v        +  v        +  v |---| +  .. |----->     |
>   HW events    |                                 |   |       |    |      |
>                |             IVRE                |   | IVPE  |    +------+
>                +---------------------------------+   +-------+
>             
> 
> 
> The IVSE have a 2-bits, P for pending and Q for queued, state machine
> for each source that allows events to be triggered. They are stored in
> an array, the Event State Buffer (ESB) and controlled by MMIOs.
> 
> If the event is let through, the IVRE looks up in the Event Assignment
> Structure (EAS) table for an Event Notification Descriptor (END)
> configured for the source. Each Event Notification Descriptor defines
> a notification path to a CPU and an in-memory Event Queue, in which
> will be pushed an EQ data for the OS to pull.
> 
> The IVPE determines if a Notification Virtual Target (NVT) can handle
> the event by scanning the thread contexts of the VPs dispatched on the
> processor HW threads. It maintains the interrupt context state of each
> thread in a NVT table.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-11-30  8:07             ` Cédric Le Goater
@ 2018-12-03  1:36               ` David Gibson
  2018-12-03 16:49                 ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-12-03  1:36 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 15758 bytes --]

On Fri, Nov 30, 2018 at 09:07:19AM +0100, Cédric Le Goater wrote:
> On 11/30/18 2:23 AM, David Gibson wrote:
> > On Thu, Nov 29, 2018 at 05:04:50PM +0100, Cédric Le Goater wrote:
> >> On 11/29/18 2:23 AM, David Gibson wrote:
> >>> On Wed, Nov 28, 2018 at 11:21:37PM +0100, Cédric Le Goater wrote:
> >>>> On 11/28/18 5:25 AM, David Gibson wrote:
> >>>>> On Fri, Nov 16, 2018 at 11:57:09AM +0100, Cédric Le Goater wrote:
> >>>>>> The different XIVE virtualization structures (sources and event queues)
> >>>>>> are configured with a set of Hypervisor calls :
> >>>>>>
> >>>>>>  - H_INT_GET_SOURCE_INFO
> >>>>>>
> >>>>>>    used to obtain the address of the MMIO page of the Event State
> >>>>>>    Buffer (ESB) entry associated with the source.
> >>>>>>
> >>>>>>  - H_INT_SET_SOURCE_CONFIG
> >>>>>>
> >>>>>>    assigns a source to a "target".
> >>>>>>
> >>>>>>  - H_INT_GET_SOURCE_CONFIG
> >>>>>>
> >>>>>>    determines which "target" and "priority" is assigned to a source
> >>>>>>
> >>>>>>  - H_INT_GET_QUEUE_INFO
> >>>>>>
> >>>>>>    returns the address of the notification management page associated
> >>>>>>    with the specified "target" and "priority".
> >>>>>>
> >>>>>>  - H_INT_SET_QUEUE_CONFIG
> >>>>>>
> >>>>>>    sets or resets the event queue for a given "target" and "priority".
> >>>>>>    It is also used to set the notification configuration associated
> >>>>>>    with the queue, only unconditional notification is supported for
> >>>>>>    the moment. Reset is performed with a queue size of 0 and queueing
> >>>>>>    is disabled in that case.
> >>>>>>
> >>>>>>  - H_INT_GET_QUEUE_CONFIG
> >>>>>>
> >>>>>>    returns the queue settings for a given "target" and "priority".
> >>>>>>
> >>>>>>  - H_INT_RESET
> >>>>>>
> >>>>>>    resets all of the guest's internal interrupt structures to their
> >>>>>>    initial state, losing all configuration set via the hcalls
> >>>>>>    H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
> >>>>>>
> >>>>>>  - H_INT_SYNC
> >>>>>>
> >>>>>>    issue a synchronisation on a source to make sure all notifications
> >>>>>>    have reached their queue.
> >>>>>>
> >>>>>> Calls that still need to be addressed :
> >>>>>>
> >>>>>>    H_INT_SET_OS_REPORTING_LINE
> >>>>>>    H_INT_GET_OS_REPORTING_LINE
> >>>>>>
> >>>>>> See the code for more documentation on each hcall.
> >>>>>>
> >>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>> ---
> >>>>>>  include/hw/ppc/spapr.h      |  15 +-
> >>>>>>  include/hw/ppc/spapr_xive.h |   6 +
> >>>>>>  hw/intc/spapr_xive_hcall.c  | 892 ++++++++++++++++++++++++++++++++++++
> >>>>>>  hw/ppc/spapr_irq.c          |   2 +
> >>>>>>  hw/intc/Makefile.objs       |   2 +-
> >>>>>>  5 files changed, 915 insertions(+), 2 deletions(-)
> >>>>>>  create mode 100644 hw/intc/spapr_xive_hcall.c
> >>>>>>
> >>>>>> diff --git a/include/hw/ppc/spapr.h b/include/hw/ppc/spapr.h
> >>>>>> index 1fbc2663e06c..8415faea7b82 100644
> >>>>>> --- a/include/hw/ppc/spapr.h
> >>>>>> +++ b/include/hw/ppc/spapr.h
> >>>>>> @@ -452,7 +452,20 @@ struct sPAPRMachineState {
> >>>>>>  #define H_INVALIDATE_PID        0x378
> >>>>>>  #define H_REGISTER_PROC_TBL     0x37C
> >>>>>>  #define H_SIGNAL_SYS_RESET      0x380
> >>>>>> -#define MAX_HCALL_OPCODE        H_SIGNAL_SYS_RESET
> >>>>>> +
> >>>>>> +#define H_INT_GET_SOURCE_INFO   0x3A8
> >>>>>> +#define H_INT_SET_SOURCE_CONFIG 0x3AC
> >>>>>> +#define H_INT_GET_SOURCE_CONFIG 0x3B0
> >>>>>> +#define H_INT_GET_QUEUE_INFO    0x3B4
> >>>>>> +#define H_INT_SET_QUEUE_CONFIG  0x3B8
> >>>>>> +#define H_INT_GET_QUEUE_CONFIG  0x3BC
> >>>>>> +#define H_INT_SET_OS_REPORTING_LINE 0x3C0
> >>>>>> +#define H_INT_GET_OS_REPORTING_LINE 0x3C4
> >>>>>> +#define H_INT_ESB               0x3C8
> >>>>>> +#define H_INT_SYNC              0x3CC
> >>>>>> +#define H_INT_RESET             0x3D0
> >>>>>> +
> >>>>>> +#define MAX_HCALL_OPCODE        H_INT_RESET
> >>>>>>  
> >>>>>>  /* The hcalls above are standardized in PAPR and implemented by pHyp
> >>>>>>   * as well.
> >>>>>> diff --git a/include/hw/ppc/spapr_xive.h b/include/hw/ppc/spapr_xive.h
> >>>>>> index 3f65b8f485fd..418511f3dc10 100644
> >>>>>> --- a/include/hw/ppc/spapr_xive.h
> >>>>>> +++ b/include/hw/ppc/spapr_xive.h
> >>>>>> @@ -60,4 +60,10 @@ int spapr_xive_target_to_end(sPAPRXive *xive, uint32_t target, uint8_t prio,
> >>>>>>  int spapr_xive_cpu_to_end(sPAPRXive *xive, PowerPCCPU *cpu, uint8_t prio,
> >>>>>>                            uint8_t *out_end_blk, uint32_t *out_end_idx);
> >>>>>>  
> >>>>>> +bool spapr_xive_priority_is_valid(uint8_t priority);
> >>>>>
> >>>>> AFAICT this could be a local function.
> >>>>
> >>>> the KVM model uses it also, when collecting state from the KVM device 
> >>>> to build the QEMU ENDT.
> >>>>
> >>>>>> +
> >>>>>> +typedef struct sPAPRMachineState sPAPRMachineState;
> >>>>>> +
> >>>>>> +void spapr_xive_hcall_init(sPAPRMachineState *spapr);
> >>>>>> +
> >>>>>>  #endif /* PPC_SPAPR_XIVE_H */
> >>>>>> diff --git a/hw/intc/spapr_xive_hcall.c b/hw/intc/spapr_xive_hcall.c
> >>>>>> new file mode 100644
> >>>>>> index 000000000000..52e4e23995f5
> >>>>>> --- /dev/null
> >>>>>> +++ b/hw/intc/spapr_xive_hcall.c
> >>>>>> @@ -0,0 +1,892 @@
> >>>>>> +/*
> >>>>>> + * QEMU PowerPC sPAPR XIVE interrupt controller model
> >>>>>> + *
> >>>>>> + * Copyright (c) 2017-2018, IBM Corporation.
> >>>>>> + *
> >>>>>> + * This code is licensed under the GPL version 2 or later. See the
> >>>>>> + * COPYING file in the top-level directory.
> >>>>>> + */
> >>>>>> +
> >>>>>> +#include "qemu/osdep.h"
> >>>>>> +#include "qemu/log.h"
> >>>>>> +#include "qapi/error.h"
> >>>>>> +#include "cpu.h"
> >>>>>> +#include "hw/ppc/fdt.h"
> >>>>>> +#include "hw/ppc/spapr.h"
> >>>>>> +#include "hw/ppc/spapr_xive.h"
> >>>>>> +#include "hw/ppc/xive_regs.h"
> >>>>>> +#include "monitor/monitor.h"
> >>>>>
> >>>>> Fwiw, I don't think it's particularly necessary to split the hcall
> >>>>> handling out into a separate .c file.
> >>>>
> >>>> ok. let's move it to spapr_xive then ? It might help in reducing the 
> >>>> exported funtions. 
> >>>
> >>> Yes, I think so.
> >>>
> >>>>>> +/*
> >>>>>> + * OPAL uses the priority 7 EQ to automatically escalate interrupts
> >>>>>> + * for all other queues (DD2.X POWER9). So only priorities [0..6] are
> >>>>>> + * available for the guest.
> >>>>>
> >>>>> Referencing OPAL behaviour doesn't really make sense in the context of
> >>>>> PAPR.  
> >>>>
> >>>> It's an OPAL constraint which pHyp doesn't have. So its a QEMU/KVM 
> >>>> constraint also.
> >>>
> >>> Right, I realized that a few patches on.  Maybe rephrase this to
> >>>
> >>>    Linux hosts under OPAL reserve priority 7 for their own escalation
> >>>    interrupts.  So we only allow the guest to use priorities [0..6].
> >>
> >> OK.
> >>
> >>> The point here is that we're emphasizing that this is a design
> >>> decision to make the host implementation easier, rather than a
> >>> fundamental constraint.
> >>>
> >>>>> What I think you're getting at is that the PAPR spec only
> >>>>> allows a PAPR guest to use priorities 0..6 (or at least it will if the
> >>>>> XIVE updated spec ever gets published).  
> >>>>
> >>>> It's not in the spec. the XIVE sPAPR spec should be frozen soon btw. 
> >>>>  
> >>>>> The fact that this allows the
> >>>>> host use 7 for escalations is a design rationale 
> >>>>> but not really relevant to the guest device itself. 
> >>>>
> >>>> The guest should be aware of which priorities are reserved for
> >>>> the hypervisor though.
> >>>>
> >>>>>> + */
> >>>>>> +bool spapr_xive_priority_is_valid(uint8_t priority)
> >>>>>> +{
> >>>>>> +    switch (priority) {
> >>>>>> +    case 0 ... 6:
> >>>>>> +        return true;
> >>>>>> +    case 7: /* OPAL escalation queue */
> >>>>>> +    default:
> >>>>>> +        return false;
> >>>>>> +    }
> >>>>>> +}
> >>>>>> +
> >>>>>> +/*
> >>>>>> + * The H_INT_GET_SOURCE_INFO hcall() is used to obtain the logical
> >>>>>> + * real address of the MMIO page through which the Event State Buffer
> >>>>>> + * entry associated with the value of the "lisn" parameter is managed.
> >>>>>> + *
> >>>>>> + * Parameters:
> >>>>>> + * Input
> >>>>>> + * - "flags"
> >>>>>> + *       Bits 0-63 reserved
> >>>>>> + * - "lisn" is per "interrupts", "interrupt-map", or
> >>>>>> + *       "ibm,xive-lisn-ranges" properties, or as returned by the
> >>>>>> + *       ibm,query-interrupt-source-number RTAS call, or as returned
> >>>>>> + *       by the H_ALLOCATE_VAS_WINDOW hcall
> >>>>>
> >>>>> I've not heard of H_ALLOCATE_VAS_WINDOW.  Is that something we intend
> >>>>> to implement in kvm/qemu, or is it only of interest for PowerVM?
> >>>>
> >>>> The hcall is part of the PAPR NX Interfaces and it returns interrupt
> >>>> numbers. I don't know if any work has been done on the topic.  
> >>>
> >>> What's a "PAPR NX"?
> >>
> >> A way for the PAPR guests to access the POWER coprocessors doing 
> >> compression and encryption. I really don't know much about this.
> > 
> > Ah, ok.
> > 
> > [snip]
> >>>> I think not, but the specs are not very clear on that topic. I will
> >>>> ask for clarification and use a -1 for now. We can not do loads on
> >>>> the trigger page so it can not be used by the H_INT_ESB hcall.
> >>>>
> >>>>>
> >>>>>> +    args[3] = TARGET_PAGE_SIZE;
> >>>>>
> >>>>> That seems wrong.  
> >>>>
> >>>> This is utterly wrong. it should be a power of 2 number ... I got
> >>>> it right under KVM though. I guess that ioremap() under Linux rounds 
> >>>> up the size to the page size in use, so, that's why it didn't blow
> >>>> up under TCG.
> >>>>
> >>>>> TARGET_PAGE_SIZE is generally 4kiB, but won't these usually
> >>>>> actually be 64kiB?
> >>>>
> >>>> yes. So what should I use to get a PAGE_SHIFT instead ? 
> >>>
> >>> Erm, that gets a bit tricky, since qemu in a sense doesn't know the
> >>> guest's page size.
> >>>
> >>> But.. don't you actually want the esb_shift here, not PAGE_SHIFT - it
> >>> could matter for the 2 page * 64kiB variant, yes?
> >>
> >> Yes. we just want the page_shift of the ESB page, whether it's one or
> >> two pages. The other registers inform the guest if there are one or 
> >> two ESB page in use. 
> > 
> > Ok, still sounds like you should base it on esb_shift, just adjust for
> > the two page case.
> 
> yes.
> 
> 
> >>>>>> +    }
> >>>>>> +
> >>>>>> +    switch (qsize) {
> >>>>>> +    case 12:
> >>>>>> +    case 16:
> >>>>>> +    case 21:
> >>>>>> +    case 24:
> >>>>>> +        end.w3 = ((uint64_t)qpage) & 0xffffffff;
> >>>>>
> >>>>> It just occurred to me that I haven't been looking for this across any
> >>>>> of these reviews.  Don't you need byteswaps when accessing these
> >>>>> in-memory structures?
> >>>>
> >>>> yes this is done when some event data is enqueued in the EQ.
> >>>
> >>> I'm not talking about the data in the EQ itself, but the fields in the
> >>> END (and the NVT).
> >>
> >> XIVE is all BE.
> > 
> > Yes... the qemu host might not be, which is why you need byteswaps.
> 
> ok. I understand.
> 
> > I realized eventually you have the swaps in your pnv get/set
> > accessors.  
> 
> Yes. because skiboot is BE, like the XIVE structures.

skiboot's endiannness isn't really relevant, because we're modelling
below that level.

> > I don't like that at all for a couple of reasons:
> > 
> > 1) Although the END structure is made up of word-sized fields because
> > that's convenient, the END really is made of a bunch of subfields of
> > different sizes.  Knowing that it wouldn't be unreasonable for people
> > to expect they can look into the XIVE by byte offsets; 
> 
> These structures should be accessed with GETFIELD and SETFIELD macros
> using the XIVE definitions in the xive_regs.h header file. I would want 
> to keep that common with skiboot  for sure.

Right.  It might make sense to make some helper macros or inlines that
include both the GETFIELD/SETFIELD and the byteswap.

> Are you suggesting we should define each field of the XIVE structures 
> with C attributes ? That would be very unfortunate.

Oh no, bitfields are a complete mess.

> > that will break
> > if you're working with a copy that has already been byte-swapped on
> > word-sized units.
> 
> I am not sure I understand the last sentence.

I mean that GETFIELD/SETFIELD only work on values that are already
native endian, but using byte offsets would only work on values that
are still in BE.

> the code working with a copy would necessarily know that the structure 
> has been byteswapped and use correct offsets for the expected endianess. 
> no ? why would it break ?  
> 
> > 2) At different points in the code you're storing both BE and
> > native-endian data in the same struct. 
> 
> on sPAPR, it's all native (which is a violation I agree).

Don't do that.  Having the same structure be BE in some situations and
native endian in other situations is a sure path to madness.

> TIMA is BE.
> 
> > That's both confusing to
> > someone reading the code (if they see that struct they don't know if
> > it's byteswapped already) and also means you can't use sparse
> > annotations to make sure you have it right.
> 
> XIVE structures are architected to be BE. That's immutable.

Yes, absolutely.  So don't represent them in C structs that are in
native endian.  Ever, even temporarily.

> It's a not problem for skiboot which is BE. The PnvXIVE model for the 
> QEMU PowerNV machine reads these VSTs (Virtual Structure Tables) from 
> the guest RAM and byteswaps the structure before using it. I think
> that's fine. Isn't it ?

Byteswapping structures - rather than individual fields as you use
them - is almost always a bad idea.  It's insanely easy to lose track
of whether this particular instance of the structure is swapped yet or
not, and you can't use sparse (or whatever) to check it for you.

Stick to one endianness for a struct, and do the byteswaps when you
access the fields (using helpers if that's, well, helpful).

> It becomes a problem with the sPAPR model which is using the XIVE structures 
> in native endianess and not BE anymore. But the guest OS never manipulates 
> these structures, so under the hood, I think we are free to use them in 
> native and keep the common definitions.

Free to in the sense that it can theoretically work, yes.  But there's
no upside (byteswaps are essentially free on POWER, and of trivial
cost compared to memory access basically everywhere).  The downside is
that having the same variables / structures have data in different
endianness in different situations makes it exceedingly easy to forget
which one you're dealing with right now and therefore forget some
swaps or put in extra ones.

> Except that the event data entries in the OS EQs are BE. So the only place 
> where we convert is when an event data is enqueued. 
> 
> What would you put in place if you think this is a too strong violation 
> of the architecture ? I am afraid of something too complex to manipulate
> to be honest. May be we can drop the map/unmap access methods only keep 
> the very basic ones.

THe complexity of having extra swaps is almost always less than having
the complexity of having those swaps not be in a consistent place.
Especially if you use helpers (including the swaps) to access your
structure.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 36/36] ppc/pnv: add XIVE support
  2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 36/36] ppc/pnv: add XIVE support Cédric Le Goater
@ 2018-12-03  2:26   ` David Gibson
  2018-12-06 15:14     ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-12-03  2:26 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 96400 bytes --]

On Fri, Nov 16, 2018 at 11:57:29AM +0100, Cédric Le Goater wrote:
> This is simple model of the POWER9 XIVE interrupt controller for the
> PowerNV machine. XIVE for baremetal is a complex controller and the
> model only addresses the needs of the skiboot firmware.
> 
> * Overall architecture
> 
>               XIVE Interrupt Controller
>               +-------------------------------------+       IPIs
>               | +---------+ +---------+ +---------+ |    +--------+
>               | |VC       | |CQ       | |PC       |----> | CORES  |
>               | |     esb | |         | |         |----> |        |
>               | |     eas | |  Bridge | |         |----> |        |
>               | |SC   end | |         | |     nvt | |    |        |
> +------+      | +---------+ +----+----+ +---------+ |    +--+-+-+-+
> | RAM  |      +------------------|------------------+       | | |
> |      |                         |                          | | |
> |      |                         |                          | | |
> |      |   +---------------------v--------------------------v-v-v---+      other
> |      <---+                       Power Bus                        +----> chips
> |  esb |   +-----------+-----------------------+--------------------+
> |  eas |               |                       |
> |  end |               |                       |
> |  nvt |           +---+----+              +---+----+
> +------+           |SC      |              |SC      |
>                    |        |              |        |
>                    | 2-bits |              | 2-bits |
>                    | local  |              |   VC   |
>                    +--------+              +--------+
>                      PCIe                  NX,NPU,CAPI
> 
>                   SC: Source Controller (aka. IVSE)
>                   VC: Virtualization Controller (aka. IVRE)
>                   CQ: Common Queue (Bridge)
>                   PC: Presentation Controller (aka. IVPE)
> 
>               2-bits: source state machine
>                  esb: Event State Buffer (Array of PQ bits in an IVSE)
>                  eas: Event Assignment Structure
>                  end: Event Notification Descriptor
>                  nvt: Notification Virtual Target
> 
> It is composed of three sub-engines :
> 
>   - Interrupt Virtualization Source Engine (IVSE), or Source
>     Controller (SC). These are found in PCI PHBs, in the PSI host
>     bridge controller, but also inside the main controller for the
>     core IPIs and other sub-chips (NX, CAP, NPU) of the
>     chip/processor. They are configured to feed the IVRE with events.
> 
>   - Interrupt Virtualization Routing Engine (IVRE) or Virtualization
>     Controller (VC). Its job is to match an event source with an Event
>     Notification Descriptor (END).
> 
>   - Interrupt Virtualization Presentation Engine (IVPE) or Presentation
>     Controller (PC). It maintains the interrupt context state of each
>     thread and handles the delivery of the external exception to the
>     thread.
> 
> * XIVE internal tables
> 
> Each of the sub-engines uses a set of tables to redirect exceptions
> from event sources to CPU threads.
> 
>                                              +-------+
>    User or OS                                |  EQ   |
>        or                            +------>|entries|
>    Hypervisor                        |       |  ..   |
>      Memory                          |       +-------+
>                                      |           ^
>                                      |           |
>                +--------------------------------------------------+
>                                      |           |
>    Hypervisor        +------+    +---+--+    +---+--+   +------+
>      Memory          | ESB  |    | EAT  |    | ENDT |   | NVTT |
>     (skiboot)        +----+-+    +----+-+    +----+-+   +------+
>                        ^  |        ^  |        ^  |       ^
>                        |  |        |  |        |  |       |
>                +--------------------------------------------------+
>                        |  |        |  |        |  |       |
>                        |  |        |  |        |  |       |
>                  +-----|--|--------|--|--------|--|-+   +-|-----+    +------+
>                  |     |  |        |  |        |  | |   | | tctx|    |Thread|
>     IPI or   ----+     +  v        +  v        +  v |---| +  .. |----->     |
>    HW events     |                                  |   |       |    |      |
>                  |              IVRE                |   | IVPE  |    +------+
>                  +----------------------------------+   +-------+
> 
> The IVSE have a 2-bits, P for pending and Q for queued, state machine
> for each source that allows events to be triggered. They are stored in
> an array, the Event State Buffer (ESB) and controlled by MMIOs.
> 
> If the event is let through, the IVRE looks up in the Event Assignment
> Structure (EAS) table for an Event Notification Descriptor (END)
> configured for the source. Each Event Notification Descriptor defines
> a notification path to a CPU and an in-memory Event Queue, in which
> will be pushed an EQ data for the OS to pull.
> 
> The IVPE determines if a Notification Virtual Target (NVT) can handle
> the event by scanning the thread contexts of the VPs dispatched on the
> processor HW threads. It maintains the interrupt context state of each
> thread in a NVT table.
> 
> * QEMU model for PowerNV
> 
> The PowerNV model reuses the common XIVE framework developed for sPAPR
> and the fundamentals aspects are quite the same. The difference are
> outlined below.
> 
> The controller initial BAR configuration is performed using the XSCOM
> bus from there, MMIO are used for further configuration.
> 
> The MMIO regions exposed are :
> 
>  - Interrupt controller registers
>  - ESB pages for IPIs and ENDs
>  - Presenter MMIO (Not used)
>  - Thread Interrupt Management Area MMIO, direct and indirect
> 
> Virtualization Controller MMIO region containing the IPI ESB pages and
> END ESB pages is sub-divided into "sets" which map portions of the VC
> region to the different ESB pages. It is configured at runtime through
> the EDT set translation table to let the firmware decide how to split
> the address space between IPI ESB pages and END ESB pages.
> 
> The XIVE tables are now in the machine RAM and not in the hypervisor
> anymore. The firmware (skiboot) configures these tables using Virtual
> Structure Descriptor defining the characteristics of each table : SBE,
> EAS, END and NVT. These are later used to access the virtual interrupt
> entries. The internal cache of these tables in the interrupt controller
> is updated and invalidated using a set of registers.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  hw/intc/pnv_xive_regs.h    |  314 +++++++
>  include/hw/ppc/pnv.h       |   22 +-
>  include/hw/ppc/pnv_xive.h  |  100 +++
>  include/hw/ppc/pnv_xscom.h |    3 +
>  include/hw/ppc/xive.h      |    1 +
>  hw/intc/pnv_xive.c         | 1612 ++++++++++++++++++++++++++++++++++++
>  hw/intc/xive.c             |   63 +-
>  hw/ppc/pnv.c               |   58 +-
>  hw/intc/Makefile.objs      |    2 +-
>  9 files changed, 2164 insertions(+), 11 deletions(-)
>  create mode 100644 hw/intc/pnv_xive_regs.h
>  create mode 100644 include/hw/ppc/pnv_xive.h
>  create mode 100644 hw/intc/pnv_xive.c
> 
> diff --git a/hw/intc/pnv_xive_regs.h b/hw/intc/pnv_xive_regs.h
> new file mode 100644
> index 000000000000..509d5a18cdde
> --- /dev/null
> +++ b/hw/intc/pnv_xive_regs.h
> @@ -0,0 +1,314 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef PPC_PNV_XIVE_REGS_H
> +#define PPC_PNV_XIVE_REGS_H
> +
> +/* IC register offsets 0x0 - 0x400 */
> +#define CQ_SWI_CMD_HIST         0x020
> +#define CQ_SWI_CMD_POLL         0x028
> +#define CQ_SWI_CMD_BCAST        0x030
> +#define CQ_SWI_CMD_ASSIGN       0x038
> +#define CQ_SWI_CMD_BLK_UPD      0x040
> +#define CQ_SWI_RSP              0x048
> +#define X_CQ_CFG_PB_GEN         0x0a
> +#define CQ_CFG_PB_GEN           0x050
> +#define   CQ_INT_ADDR_OPT       PPC_BITMASK(14, 15)
> +#define X_CQ_IC_BAR             0x10
> +#define X_CQ_MSGSND             0x0b
> +#define CQ_MSGSND               0x058
> +#define CQ_CNPM_SEL             0x078
> +#define CQ_IC_BAR               0x080
> +#define   CQ_IC_BAR_VALID       PPC_BIT(0)
> +#define   CQ_IC_BAR_64K         PPC_BIT(1)
> +#define X_CQ_TM1_BAR            0x12
> +#define CQ_TM1_BAR              0x90
> +#define X_CQ_TM2_BAR            0x014
> +#define CQ_TM2_BAR              0x0a0
> +#define   CQ_TM_BAR_VALID       PPC_BIT(0)
> +#define   CQ_TM_BAR_64K         PPC_BIT(1)
> +#define X_CQ_PC_BAR             0x16
> +#define CQ_PC_BAR               0x0b0
> +#define  CQ_PC_BAR_VALID        PPC_BIT(0)
> +#define X_CQ_PC_BARM            0x17
> +#define CQ_PC_BARM              0x0b8
> +#define  CQ_PC_BARM_MASK        PPC_BITMASK(26, 38)
> +#define X_CQ_VC_BAR             0x18
> +#define CQ_VC_BAR               0x0c0
> +#define  CQ_VC_BAR_VALID        PPC_BIT(0)
> +#define X_CQ_VC_BARM            0x19
> +#define CQ_VC_BARM              0x0c8
> +#define  CQ_VC_BARM_MASK        PPC_BITMASK(21, 37)
> +#define X_CQ_TAR                0x1e
> +#define CQ_TAR                  0x0f0
> +#define  CQ_TAR_TBL_AUTOINC     PPC_BIT(0)
> +#define  CQ_TAR_TSEL            PPC_BITMASK(12, 15)
> +#define  CQ_TAR_TSEL_BLK        PPC_BIT(12)
> +#define  CQ_TAR_TSEL_MIG        PPC_BIT(13)
> +#define  CQ_TAR_TSEL_VDT        PPC_BIT(14)
> +#define  CQ_TAR_TSEL_EDT        PPC_BIT(15)
> +#define  CQ_TAR_TSEL_INDEX      PPC_BITMASK(26, 31)
> +#define X_CQ_TDR                0x1f
> +#define CQ_TDR                  0x0f8
> +#define  CQ_TDR_VDT_VALID       PPC_BIT(0)
> +#define  CQ_TDR_VDT_BLK         PPC_BITMASK(11, 15)
> +#define  CQ_TDR_VDT_INDEX       PPC_BITMASK(28, 31)
> +#define  CQ_TDR_EDT_TYPE        PPC_BITMASK(0, 1)
> +#define  CQ_TDR_EDT_INVALID     0
> +#define  CQ_TDR_EDT_IPI         1
> +#define  CQ_TDR_EDT_EQ          2
> +#define  CQ_TDR_EDT_BLK         PPC_BITMASK(12, 15)
> +#define  CQ_TDR_EDT_INDEX       PPC_BITMASK(26, 31)
> +#define X_CQ_PBI_CTL            0x20
> +#define CQ_PBI_CTL              0x100
> +#define  CQ_PBI_PC_64K          PPC_BIT(5)
> +#define  CQ_PBI_VC_64K          PPC_BIT(6)
> +#define  CQ_PBI_LNX_TRIG        PPC_BIT(7)
> +#define  CQ_PBI_FORCE_TM_LOCAL  PPC_BIT(22)
> +#define CQ_PBO_CTL              0x108
> +#define CQ_AIB_CTL              0x110
> +#define X_CQ_RST_CTL            0x23
> +#define CQ_RST_CTL              0x118
> +#define X_CQ_FIRMASK            0x33
> +#define CQ_FIRMASK              0x198
> +#define X_CQ_FIRMASK_AND        0x34
> +#define CQ_FIRMASK_AND          0x1a0
> +#define X_CQ_FIRMASK_OR         0x35
> +#define CQ_FIRMASK_OR           0x1a8
> +
> +/* PC LBS1 register offsets 0x400 - 0x800 */
> +#define X_PC_TCTXT_CFG          0x100
> +#define PC_TCTXT_CFG            0x400
> +#define  PC_TCTXT_CFG_BLKGRP_EN         PPC_BIT(0)
> +#define  PC_TCTXT_CFG_TARGET_EN         PPC_BIT(1)
> +#define  PC_TCTXT_CFG_LGS_EN            PPC_BIT(2)
> +#define  PC_TCTXT_CFG_STORE_ACK         PPC_BIT(3)
> +#define  PC_TCTXT_CFG_HARD_CHIPID_BLK   PPC_BIT(8)
> +#define  PC_TCTXT_CHIPID_OVERRIDE       PPC_BIT(9)
> +#define  PC_TCTXT_CHIPID                PPC_BITMASK(12, 15)
> +#define  PC_TCTXT_INIT_AGE              PPC_BITMASK(30, 31)
> +#define X_PC_TCTXT_TRACK        0x101
> +#define PC_TCTXT_TRACK          0x408
> +#define  PC_TCTXT_TRACK_EN              PPC_BIT(0)
> +#define X_PC_TCTXT_INDIR0       0x104
> +#define PC_TCTXT_INDIR0         0x420
> +#define  PC_TCTXT_INDIR_VALID           PPC_BIT(0)
> +#define  PC_TCTXT_INDIR_THRDID          PPC_BITMASK(9, 15)
> +#define X_PC_TCTXT_INDIR1       0x105
> +#define PC_TCTXT_INDIR1         0x428
> +#define X_PC_TCTXT_INDIR2       0x106
> +#define PC_TCTXT_INDIR2         0x430
> +#define X_PC_TCTXT_INDIR3       0x107
> +#define PC_TCTXT_INDIR3         0x438
> +#define X_PC_THREAD_EN_REG0     0x108
> +#define PC_THREAD_EN_REG0       0x440
> +#define X_PC_THREAD_EN_REG0_SET 0x109
> +#define PC_THREAD_EN_REG0_SET   0x448
> +#define X_PC_THREAD_EN_REG0_CLR 0x10a
> +#define PC_THREAD_EN_REG0_CLR   0x450
> +#define X_PC_THREAD_EN_REG1     0x10c
> +#define PC_THREAD_EN_REG1       0x460
> +#define X_PC_THREAD_EN_REG1_SET 0x10d
> +#define PC_THREAD_EN_REG1_SET   0x468
> +#define X_PC_THREAD_EN_REG1_CLR 0x10e
> +#define PC_THREAD_EN_REG1_CLR   0x470
> +#define X_PC_GLOBAL_CONFIG      0x110
> +#define PC_GLOBAL_CONFIG        0x480
> +#define  PC_GCONF_INDIRECT      PPC_BIT(32)
> +#define  PC_GCONF_CHIPID_OVR    PPC_BIT(40)
> +#define  PC_GCONF_CHIPID        PPC_BITMASK(44, 47)
> +#define X_PC_VSD_TABLE_ADDR     0x111
> +#define PC_VSD_TABLE_ADDR       0x488
> +#define X_PC_VSD_TABLE_DATA     0x112
> +#define PC_VSD_TABLE_DATA       0x490
> +#define X_PC_AT_KILL            0x116
> +#define PC_AT_KILL              0x4b0
> +#define  PC_AT_KILL_VALID       PPC_BIT(0)
> +#define  PC_AT_KILL_BLOCK_ID    PPC_BITMASK(27, 31)
> +#define  PC_AT_KILL_OFFSET      PPC_BITMASK(48, 60)
> +#define X_PC_AT_KILL_MASK       0x117
> +#define PC_AT_KILL_MASK         0x4b8
> +
> +/* PC LBS2 register offsets */
> +#define X_PC_VPC_CACHE_ENABLE   0x161
> +#define PC_VPC_CACHE_ENABLE     0x708
> +#define  PC_VPC_CACHE_EN_MASK   PPC_BITMASK(0, 31)
> +#define X_PC_VPC_SCRUB_TRIG     0x162
> +#define PC_VPC_SCRUB_TRIG       0x710
> +#define X_PC_VPC_SCRUB_MASK     0x163
> +#define PC_VPC_SCRUB_MASK       0x718
> +#define  PC_SCRUB_VALID         PPC_BIT(0)
> +#define  PC_SCRUB_WANT_DISABLE  PPC_BIT(1)
> +#define  PC_SCRUB_WANT_INVAL    PPC_BIT(2)
> +#define  PC_SCRUB_BLOCK_ID      PPC_BITMASK(27, 31)
> +#define  PC_SCRUB_OFFSET        PPC_BITMASK(45, 63)
> +#define X_PC_VPC_CWATCH_SPEC    0x167
> +#define PC_VPC_CWATCH_SPEC      0x738
> +#define  PC_VPC_CWATCH_CONFLICT PPC_BIT(0)
> +#define  PC_VPC_CWATCH_FULL     PPC_BIT(8)
> +#define  PC_VPC_CWATCH_BLOCKID  PPC_BITMASK(27, 31)
> +#define  PC_VPC_CWATCH_OFFSET   PPC_BITMASK(45, 63)
> +#define X_PC_VPC_CWATCH_DAT0    0x168
> +#define PC_VPC_CWATCH_DAT0      0x740
> +#define X_PC_VPC_CWATCH_DAT1    0x169
> +#define PC_VPC_CWATCH_DAT1      0x748
> +#define X_PC_VPC_CWATCH_DAT2    0x16a
> +#define PC_VPC_CWATCH_DAT2      0x750
> +#define X_PC_VPC_CWATCH_DAT3    0x16b
> +#define PC_VPC_CWATCH_DAT3      0x758
> +#define X_PC_VPC_CWATCH_DAT4    0x16c
> +#define PC_VPC_CWATCH_DAT4      0x760
> +#define X_PC_VPC_CWATCH_DAT5    0x16d
> +#define PC_VPC_CWATCH_DAT5      0x768
> +#define X_PC_VPC_CWATCH_DAT6    0x16e
> +#define PC_VPC_CWATCH_DAT6      0x770
> +#define X_PC_VPC_CWATCH_DAT7    0x16f
> +#define PC_VPC_CWATCH_DAT7      0x778
> +
> +/* VC0 register offsets 0x800 - 0xFFF */
> +#define X_VC_GLOBAL_CONFIG      0x200
> +#define VC_GLOBAL_CONFIG        0x800
> +#define  VC_GCONF_INDIRECT      PPC_BIT(32)
> +#define X_VC_VSD_TABLE_ADDR     0x201
> +#define VC_VSD_TABLE_ADDR       0x808
> +#define X_VC_VSD_TABLE_DATA     0x202
> +#define VC_VSD_TABLE_DATA       0x810
> +#define VC_IVE_ISB_BLOCK_MODE   0x818
> +#define VC_EQD_BLOCK_MODE       0x820
> +#define VC_VPS_BLOCK_MODE       0x828
> +#define X_VC_IRQ_CONFIG_IPI     0x208
> +#define VC_IRQ_CONFIG_IPI       0x840
> +#define  VC_IRQ_CONFIG_MEMB_EN  PPC_BIT(45)
> +#define  VC_IRQ_CONFIG_MEMB_SZ  PPC_BITMASK(46, 51)
> +#define VC_IRQ_CONFIG_HW        0x848
> +#define VC_IRQ_CONFIG_CASCADE1  0x850
> +#define VC_IRQ_CONFIG_CASCADE2  0x858
> +#define VC_IRQ_CONFIG_REDIST    0x860
> +#define VC_IRQ_CONFIG_IPI_CASC  0x868
> +#define X_VC_AIB_TX_ORDER_TAG2  0x22d
> +#define  VC_AIB_TX_ORDER_TAG2_REL_TF    PPC_BIT(20)
> +#define VC_AIB_TX_ORDER_TAG2    0x890
> +#define X_VC_AT_MACRO_KILL      0x23e
> +#define VC_AT_MACRO_KILL        0x8b0
> +#define X_VC_AT_MACRO_KILL_MASK 0x23f
> +#define VC_AT_MACRO_KILL_MASK   0x8b8
> +#define  VC_KILL_VALID          PPC_BIT(0)
> +#define  VC_KILL_TYPE           PPC_BITMASK(14, 15)
> +#define   VC_KILL_IRQ   0
> +#define   VC_KILL_IVC   1
> +#define   VC_KILL_SBC   2
> +#define   VC_KILL_EQD   3
> +#define  VC_KILL_BLOCK_ID       PPC_BITMASK(27, 31)
> +#define  VC_KILL_OFFSET         PPC_BITMASK(48, 60)
> +#define X_VC_EQC_CACHE_ENABLE   0x211
> +#define VC_EQC_CACHE_ENABLE     0x908
> +#define  VC_EQC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
> +#define X_VC_EQC_SCRUB_TRIG     0x212
> +#define VC_EQC_SCRUB_TRIG       0x910
> +#define X_VC_EQC_SCRUB_MASK     0x213
> +#define VC_EQC_SCRUB_MASK       0x918
> +#define X_VC_EQC_CWATCH_SPEC    0x215
> +#define VC_EQC_CONFIG           0x920
> +#define X_VC_EQC_CONFIG         0x214
> +#define  VC_EQC_CONF_SYNC_IPI           PPC_BIT(32)
> +#define  VC_EQC_CONF_SYNC_HW            PPC_BIT(33)
> +#define  VC_EQC_CONF_SYNC_ESC1          PPC_BIT(34)
> +#define  VC_EQC_CONF_SYNC_ESC2          PPC_BIT(35)
> +#define  VC_EQC_CONF_SYNC_REDI          PPC_BIT(36)
> +#define  VC_EQC_CONF_EQP_INTERLEAVE     PPC_BIT(38)
> +#define  VC_EQC_CONF_ENABLE_END_s_BIT   PPC_BIT(39)
> +#define  VC_EQC_CONF_ENABLE_END_u_BIT   PPC_BIT(40)
> +#define  VC_EQC_CONF_ENABLE_END_c_BIT   PPC_BIT(41)
> +#define  VC_EQC_CONF_ENABLE_MORE_QSZ    PPC_BIT(42)
> +#define  VC_EQC_CONF_SKIP_ESCALATE      PPC_BIT(43)
> +#define VC_EQC_CWATCH_SPEC      0x928
> +#define  VC_EQC_CWATCH_CONFLICT PPC_BIT(0)
> +#define  VC_EQC_CWATCH_FULL     PPC_BIT(8)
> +#define  VC_EQC_CWATCH_BLOCKID  PPC_BITMASK(28, 31)
> +#define  VC_EQC_CWATCH_OFFSET   PPC_BITMASK(40, 63)
> +#define X_VC_EQC_CWATCH_DAT0    0x216
> +#define VC_EQC_CWATCH_DAT0      0x930
> +#define X_VC_EQC_CWATCH_DAT1    0x217
> +#define VC_EQC_CWATCH_DAT1      0x938
> +#define X_VC_EQC_CWATCH_DAT2    0x218
> +#define VC_EQC_CWATCH_DAT2      0x940
> +#define X_VC_EQC_CWATCH_DAT3    0x219
> +#define VC_EQC_CWATCH_DAT3      0x948
> +#define X_VC_IVC_SCRUB_TRIG     0x222
> +#define VC_IVC_SCRUB_TRIG       0x990
> +#define X_VC_IVC_SCRUB_MASK     0x223
> +#define VC_IVC_SCRUB_MASK       0x998
> +#define X_VC_SBC_SCRUB_TRIG     0x232
> +#define VC_SBC_SCRUB_TRIG       0xa10
> +#define X_VC_SBC_SCRUB_MASK     0x233
> +#define VC_SBC_SCRUB_MASK       0xa18
> +#define  VC_SCRUB_VALID         PPC_BIT(0)
> +#define  VC_SCRUB_WANT_DISABLE  PPC_BIT(1)
> +#define  VC_SCRUB_WANT_INVAL    PPC_BIT(2) /* EQC and SBC only */
> +#define  VC_SCRUB_BLOCK_ID      PPC_BITMASK(28, 31)
> +#define  VC_SCRUB_OFFSET        PPC_BITMASK(40, 63)
> +#define X_VC_IVC_CACHE_ENABLE   0x221
> +#define VC_IVC_CACHE_ENABLE     0x988
> +#define  VC_IVC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
> +#define X_VC_SBC_CACHE_ENABLE   0x231
> +#define VC_SBC_CACHE_ENABLE     0xa08
> +#define  VC_SBC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
> +#define VC_IVC_CACHE_SCRUB_TRIG 0x990
> +#define VC_IVC_CACHE_SCRUB_MASK 0x998
> +#define VC_SBC_CACHE_ENABLE     0xa08
> +#define VC_SBC_CACHE_SCRUB_TRIG 0xa10
> +#define VC_SBC_CACHE_SCRUB_MASK 0xa18
> +#define VC_SBC_CONFIG           0xa20
> +#define X_VC_SBC_CONFIG         0x234
> +#define  VC_SBC_CONF_CPLX_CIST  PPC_BIT(44)
> +#define  VC_SBC_CONF_CIST_BOTH  PPC_BIT(45)
> +#define  VC_SBC_CONF_NO_UPD_PRF PPC_BIT(59)
> +
> +/* VC1 register offsets */
> +
> +/* VSD Table address register definitions (shared) */
> +#define VST_ADDR_AUTOINC        PPC_BIT(0)
> +#define VST_TABLE_SELECT        PPC_BITMASK(13, 15)
> +#define  VST_TSEL_IVT   0
> +#define  VST_TSEL_SBE   1
> +#define  VST_TSEL_EQDT  2
> +#define  VST_TSEL_VPDT  3
> +#define  VST_TSEL_IRQ   4       /* VC only */
> +#define VST_TABLE_BLOCK        PPC_BITMASK(27, 31)
> +
> +/* Number of queue overflow pages */
> +#define VC_QUEUE_OVF_COUNT      6
> +
> +/* Bits in a VSD entry.
> + *
> + * Note: the address is naturally aligned,  we don't use a PPC_BITMASK,
> + *       but just a mask to apply to the address before OR'ing it in.
> + *
> + * Note: VSD_FIRMWARE is a SW bit ! It hijacks an unused bit in the
> + *       VSD and is only meant to be used in indirect mode !
> + */
> +#define VSD_MODE                PPC_BITMASK(0, 1)
> +#define  VSD_MODE_SHARED        1
> +#define  VSD_MODE_EXCLUSIVE     2
> +#define  VSD_MODE_FORWARD       3
> +#define VSD_ADDRESS_MASK        0x0ffffffffffff000ull
> +#define VSD_MIGRATION_REG       PPC_BITMASK(52, 55)
> +#define VSD_INDIRECT            PPC_BIT(56)
> +#define VSD_TSIZE               PPC_BITMASK(59, 63)
> +#define VSD_FIRMWARE            PPC_BIT(2) /* Read warning above */
> +
> +#define VC_EQC_SYNC_MASK         \
> +        (VC_EQC_CONF_SYNC_IPI  | \
> +         VC_EQC_CONF_SYNC_HW   | \
> +         VC_EQC_CONF_SYNC_ESC1 | \
> +         VC_EQC_CONF_SYNC_ESC2 | \
> +         VC_EQC_CONF_SYNC_REDI)
> +
> +
> +#endif /* PPC_PNV_XIVE_REGS_H */
> diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
> index 86d5f54e5459..402dd8f6452c 100644
> --- a/include/hw/ppc/pnv.h
> +++ b/include/hw/ppc/pnv.h
> @@ -25,6 +25,7 @@
>  #include "hw/ppc/pnv_lpc.h"
>  #include "hw/ppc/pnv_psi.h"
>  #include "hw/ppc/pnv_occ.h"
> +#include "hw/ppc/pnv_xive.h"
>  
>  #define TYPE_PNV_CHIP "pnv-chip"
>  #define PNV_CHIP(obj) OBJECT_CHECK(PnvChip, (obj), TYPE_PNV_CHIP)
> @@ -82,6 +83,7 @@ typedef struct Pnv9Chip {
>      PnvChip      parent_obj;
>  
>      /*< public >*/
> +    PnvXive      xive;
>  } Pnv9Chip;
>  
>  typedef struct PnvChipClass {
> @@ -205,7 +207,6 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
>  #define PNV_ICP_BASE(chip)                                              \
>      (0x0003ffff80000000ull + (uint64_t) PNV_CHIP_INDEX(chip) * PNV_ICP_SIZE)
>  
> -
>  #define PNV_PSIHB_SIZE       0x0000000000100000ull
>  #define PNV_PSIHB_BASE(chip) \
>      (0x0003fffe80000000ull + (uint64_t)PNV_CHIP_INDEX(chip) * PNV_PSIHB_SIZE)
> @@ -215,4 +216,23 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
>      (0x0003ffe000000000ull + (uint64_t)PNV_CHIP_INDEX(chip) * \
>       PNV_PSIHB_FSP_SIZE)
>  
> +/*
> + * POWER9 MMIO base addresses
> + */
> +#define PNV9_CHIP_BASE(chip, base)   \
> +    ((base) + ((uint64_t) (chip)->chip_id << 42))
> +
> +#define PNV9_XIVE_VC_SIZE            0x0000008000000000ull
> +#define PNV9_XIVE_VC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006010000000000ull)
> +
> +#define PNV9_XIVE_PC_SIZE            0x0000001000000000ull
> +#define PNV9_XIVE_PC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006018000000000ull)
> +
> +#define PNV9_XIVE_IC_SIZE            0x0000000000080000ull
> +#define PNV9_XIVE_IC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203100000ull)
> +
> +#define PNV9_XIVE_TM_SIZE            0x0000000000040000ull
> +#define PNV9_XIVE_TM_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203180000ull)
> +
> +
>  #endif /* _PPC_PNV_H */
> diff --git a/include/hw/ppc/pnv_xive.h b/include/hw/ppc/pnv_xive.h
> new file mode 100644
> index 000000000000..5b64d4cafe8f
> --- /dev/null
> +++ b/include/hw/ppc/pnv_xive.h
> @@ -0,0 +1,100 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#ifndef PPC_PNV_XIVE_H
> +#define PPC_PNV_XIVE_H
> +
> +#include "hw/sysbus.h"
> +#include "hw/ppc/xive.h"
> +
> +#define TYPE_PNV_XIVE "pnv-xive"
> +#define PNV_XIVE(obj) OBJECT_CHECK(PnvXive, (obj), TYPE_PNV_XIVE)
> +
> +#define XIVE_BLOCK_MAX      16
> +
> +#define XIVE_XLATE_BLK_MAX  16  /* Block Scope Table (0-15) */
> +#define XIVE_XLATE_MIG_MAX  16  /* Migration Register Table (1-15) */
> +#define XIVE_XLATE_VDT_MAX  16  /* VDT Domain Table (0-15) */
> +#define XIVE_XLATE_EDT_MAX  64  /* EDT Domain Table (0-63) */
> +
> +typedef struct PnvXive {
> +    XiveRouter    parent_obj;
> +
> +    /* Can be overridden by XIVE configuration */
> +    uint32_t      thread_chip_id;
> +    uint32_t      chip_id;

These have similar names but they're very different AFAICT - one is
static configuration, the other runtime state.  I'd generally order
structures so that configuration information is in one block, computed
at initialization then static in another, then runtime state in a
third - it's both clearer and (usually) more cache efficient.

Sometimes that's less important that other logical groupings, but I
don't think that's the case here.

> +
> +    /* Interrupt controller regs */
> +    uint64_t      regs[0x300];
> +    MemoryRegion  xscom_regs;
> +
> +    /* For IPIs and accelerator interrupts */
> +    uint32_t      nr_irqs;
> +    XiveSource    source;
> +
> +    uint32_t      nr_ends;
> +    XiveENDSource end_source;
> +
> +    /* Cache update registers */
> +    uint64_t      eqc_watch[4];
> +    uint64_t      vpc_watch[8];
> +
> +    /* Virtual Structure Table Descriptors : EAT, SBE, ENDT, NVTT, IRQ */
> +    uint64_t      vsds[5][XIVE_BLOCK_MAX];
> +
> +    /* Set Translation tables */
> +    bool          set_xlate_autoinc;
> +    uint64_t      set_xlate_index;
> +    uint64_t      set_xlate;
> +
> +    uint64_t      set_xlate_blk[XIVE_XLATE_BLK_MAX];
> +    uint64_t      set_xlate_mig[XIVE_XLATE_MIG_MAX];
> +    uint64_t      set_xlate_vdt[XIVE_XLATE_VDT_MAX];
> +    uint64_t      set_xlate_edt[XIVE_XLATE_EDT_MAX];
> +
> +    /* Interrupt controller MMIO */
> +    hwaddr        ic_base;
> +    uint32_t      ic_shift;
> +    MemoryRegion  ic_mmio;
> +    MemoryRegion  ic_reg_mmio;
> +    MemoryRegion  ic_notify_mmio;
> +
> +    /* VC memory regions */
> +    hwaddr        vc_base;
> +    uint64_t      vc_size;
> +    uint32_t      vc_shift;
> +    MemoryRegion  vc_mmio;
> +
> +    /* IPI and END address space to model the EDT segmentation */
> +    uint32_t      edt_shift;
> +    MemoryRegion  ipi_mmio;
> +    AddressSpace  ipi_as;
> +    MemoryRegion  end_mmio;
> +    AddressSpace  end_as;
> +
> +    /* PC memory regions */
> +    hwaddr        pc_base;
> +    uint64_t      pc_size;
> +    uint32_t      pc_shift;
> +    MemoryRegion  pc_mmio;
> +    uint32_t      vdt_shift;
> +
> +    /* TIMA memory regions */
> +    hwaddr        tm_base;
> +    uint32_t      tm_shift;
> +    MemoryRegion  tm_mmio;
> +    MemoryRegion  tm_mmio_indirect;
> +
> +    /* CPU for indirect TIMA access */
> +    PowerPCCPU    *cpu_ind;
> +} PnvXive;
> +
> +void pnv_xive_pic_print_info(PnvXive *xive, Monitor *mon);
> +
> +#endif /* PPC_PNV_XIVE_H */
> diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
> index 255b26a5aaf6..6623ec54a7a8 100644
> --- a/include/hw/ppc/pnv_xscom.h
> +++ b/include/hw/ppc/pnv_xscom.h
> @@ -73,6 +73,9 @@ typedef struct PnvXScomInterfaceClass {
>  #define PNV_XSCOM_OCC_BASE        0x0066000
>  #define PNV_XSCOM_OCC_SIZE        0x6000
>  
> +#define PNV9_XSCOM_XIVE_BASE      0x5013000
> +#define PNV9_XSCOM_XIVE_SIZE      0x300
> +
>  extern void pnv_xscom_realize(PnvChip *chip, Error **errp);
>  extern int pnv_dt_xscom(PnvChip *chip, void *fdt, int offset);
>  
> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
> index c8201462d698..6089511cff83 100644
> --- a/include/hw/ppc/xive.h
> +++ b/include/hw/ppc/xive.h
> @@ -237,6 +237,7 @@ int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>                          XiveNVT *nvt);
>  int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>                          XiveNVT *nvt);
> +void xive_router_notify(XiveFabric *xf, uint32_t lisn);
>  
>  /*
>   * XIVE END ESBs
> diff --git a/hw/intc/pnv_xive.c b/hw/intc/pnv_xive.c
> new file mode 100644
> index 000000000000..9f0c41cdb750
> --- /dev/null
> +++ b/hw/intc/pnv_xive.c
> @@ -0,0 +1,1612 @@
> +/*
> + * QEMU PowerPC XIVE interrupt controller model
> + *
> + * Copyright (c) 2017-2018, IBM Corporation.
> + *
> + * This code is licensed under the GPL version 2 or later. See the
> + * COPYING file in the top-level directory.
> + */
> +
> +#include "qemu/osdep.h"
> +#include "qemu/log.h"
> +#include "qapi/error.h"
> +#include "target/ppc/cpu.h"
> +#include "sysemu/cpus.h"
> +#include "sysemu/dma.h"
> +#include "monitor/monitor.h"
> +#include "hw/ppc/fdt.h"
> +#include "hw/ppc/pnv.h"
> +#include "hw/ppc/pnv_xscom.h"
> +#include "hw/ppc/pnv_xive.h"
> +#include "hw/ppc/xive_regs.h"
> +#include "hw/ppc/ppc.h"
> +
> +#include <libfdt.h>
> +
> +#include "pnv_xive_regs.h"
> +
> +/*
> + * Interrupt source number encoding
> + */
> +#define SRCNO_BLOCK(srcno)        (((srcno) >> 28) & 0xf)
> +#define SRCNO_INDEX(srcno)        ((srcno) & 0x0fffffff)
> +#define XIVE_SRCNO(blk, idx)      ((uint32_t)(blk) << 28 | (idx))
> +
> +/*
> + * Virtual structures table accessors
> + */
> +typedef struct XiveVstInfo {
> +    const char *name;
> +    uint32_t    size;
> +    uint32_t    max_blocks;
> +} XiveVstInfo;
> +
> +static const XiveVstInfo vst_infos[] = {
> +    [VST_TSEL_IVT]  = { "EAT",  sizeof(XiveEAS), 16 },
> +    [VST_TSEL_SBE]  = { "SBE",  0,               16 },
> +    [VST_TSEL_EQDT] = { "ENDT", sizeof(XiveEND), 16 },
> +    [VST_TSEL_VPDT] = { "VPDT", sizeof(XiveNVT),  32 },

Are those VST_TSEL_* things named in the XIVE documentation?  It not,
you probably want to rename them to reflect the new-style naming.

> +    /* Interrupt fifo backing store table :
> +     *
> +     * 0 - IPI,
> +     * 1 - HWD,
> +     * 2 - First escalate,
> +     * 3 - Second escalate,
> +     * 4 - Redistribution,
> +     * 5 - IPI cascaded queue ?
> +     */
> +    [VST_TSEL_IRQ]  = { "IRQ",  0,               6  },
> +};
> +
> +#define xive_error(xive, fmt, ...)                                      \
> +    qemu_log_mask(LOG_GUEST_ERROR, "XIVE[%x] - " fmt "\n", (xive)->chip_id, \
> +                  ## __VA_ARGS__);
> +
> +/*
> + * Our lookup routine for a remote XIVE IC. A simple scan of the chips.
> + */
> +static PnvXive *pnv_xive_get_ic(PnvXive *xive, uint8_t blk)
> +{
> +    PnvMachineState *pnv = PNV_MACHINE(qdev_get_machine());
> +    int i;
> +
> +    for (i = 0; i < pnv->num_chips; i++) {
> +        Pnv9Chip *chip9 = PNV9_CHIP(pnv->chips[i]);
> +        PnvXive *ic_xive = &chip9->xive;
> +        bool chip_override =
> +            ic_xive->regs[PC_GLOBAL_CONFIG >> 3] & PC_GCONF_CHIPID_OVR;
> +
> +        if (chip_override) {
> +            if (ic_xive->chip_id == blk) {
> +                return ic_xive;
> +            }
> +        } else {
> +            ; /* TODO: Block scope support */
> +        }
> +    }
> +    xive_error(xive, "VST: unknown chip/block %d !?", blk);
> +    return NULL;
> +}
> +
> +/*
> + * Virtual Structures Table accessors for SBE, EAT, ENDT, NVT
> + */
> +static uint64_t pnv_xive_vst_addr_direct(PnvXive *xive,
> +                                         const XiveVstInfo *info, uint64_t vsd,
> +                                         uint8_t blk, uint32_t idx)
> +{
> +    uint64_t vst_addr = vsd & VSD_ADDRESS_MASK;
> +    uint64_t vst_tsize = 1ull << (GETFIELD(VSD_TSIZE, vsd) + 12);
> +    uint32_t idx_max = (vst_tsize / info->size) - 1;
> +
> +    if (idx > idx_max) {
> +#ifdef XIVE_DEBUG
> +        xive_error(xive, "VST: %s entry %x/%x out of range !?", info->name,
> +                   blk, idx);
> +#endif
> +        return 0;
> +    }
> +
> +    return vst_addr + idx * info->size;
> +}
> +
> +#define XIVE_VSD_SIZE 8
> +
> +static uint64_t pnv_xive_vst_addr_indirect(PnvXive *xive,
> +                                           const XiveVstInfo *info,
> +                                           uint64_t vsd, uint8_t blk,
> +                                           uint32_t idx)
> +{
> +    uint64_t vsd_addr;
> +    uint64_t vst_addr;
> +    uint32_t page_shift;
> +    uint32_t page_mask;
> +    uint64_t vst_tsize = 1ull << (GETFIELD(VSD_TSIZE, vsd) + 12);
> +    uint32_t idx_max = (vst_tsize / XIVE_VSD_SIZE) - 1;
> +
> +    if (idx > idx_max) {
> +#ifdef XIVE_DEBUG
> +        xive_error(xive, "VET: %s entry %x/%x out of range !?", info->name,
> +                   blk, idx);
> +#endif
> +        return 0;
> +    }
> +
> +    vsd_addr = vsd & VSD_ADDRESS_MASK;
> +
> +    /*
> +     * Read the first descriptor to get the page size of each indirect
> +     * table.
> +     */
> +    vsd = ldq_be_dma(&address_space_memory, vsd_addr);
> +    page_shift = GETFIELD(VSD_TSIZE, vsd) + 12;
> +    page_mask = (1ull << page_shift) - 1;
> +
> +    /* Indirect page size can be 4K, 64K, 2M. */
> +    if (page_shift != 12 && page_shift != 16 && page_shift != 23) {

page_shift == 23?? That's 8 MiB.

> +        xive_error(xive, "VST: invalid %s table shift %d", info->name,
> +                   page_shift);
> +    }
> +
> +    if (!(vsd & VSD_ADDRESS_MASK)) {
> +        xive_error(xive, "VST: invalid %s entry %x/%x !?", info->name,
> +                   blk, 0);
> +        return 0;
> +    }
> +
> +    /* Load the descriptor we are looking for, if not already done */
> +    if (idx) {
> +        vsd_addr = vsd_addr + (idx >> page_shift);
> +        vsd = ldq_be_dma(&address_space_memory, vsd_addr);
> +
> +        if (page_shift != GETFIELD(VSD_TSIZE, vsd) + 12) {
> +            xive_error(xive, "VST: %s entry %x/%x indirect page size differ !?",
> +                       info->name, blk, idx);
> +            return 0;
> +        }
> +    }
> +
> +    vst_addr = vsd & VSD_ADDRESS_MASK;
> +
> +    return vst_addr + (idx & page_mask) * info->size;
> +}
> +
> +static uint64_t pnv_xive_vst_addr(PnvXive *xive, uint8_t type, uint8_t blk,
> +                                  uint32_t idx)
> +{
> +    uint64_t vsd;
> +
> +    if (blk >= vst_infos[type].max_blocks) {
> +        xive_error(xive, "VST: invalid block id %d for VST %s %d !?",
> +                   blk, vst_infos[type].name, idx);
> +        return 0;
> +    }
> +
> +    vsd = xive->vsds[type][blk];
> +
> +    /* Remote VST accesses */
> +    if (GETFIELD(VSD_MODE, vsd) == VSD_MODE_FORWARD) {
> +        xive = pnv_xive_get_ic(xive, blk);
> +
> +        return xive ? pnv_xive_vst_addr(xive, type, blk, idx) : 0;
> +    }
> +
> +    if (VSD_INDIRECT & vsd) {
> +        return pnv_xive_vst_addr_indirect(xive, &vst_infos[type], vsd,
> +                                          blk, idx);
> +    }
> +
> +    return pnv_xive_vst_addr_direct(xive, &vst_infos[type], vsd, blk, idx);
> +}
> +
> +static int pnv_xive_get_end(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
> +                           XiveEND *end)
> +{
> +    PnvXive *xive = PNV_XIVE(xrtr);
> +    uint64_t end_addr = pnv_xive_vst_addr(xive, VST_TSEL_EQDT, blk, idx);
> +
> +    if (!end_addr) {
> +        return -1;
> +    }
> +
> +    cpu_physical_memory_read(end_addr, end, sizeof(XiveEND));
> +    end->w0 = be32_to_cpu(end->w0);
> +    end->w1 = be32_to_cpu(end->w1);
> +    end->w2 = be32_to_cpu(end->w2);
> +    end->w3 = be32_to_cpu(end->w3);
> +    end->w4 = be32_to_cpu(end->w4);
> +    end->w5 = be32_to_cpu(end->w5);
> +    end->w6 = be32_to_cpu(end->w6);
> +    end->w7 = be32_to_cpu(end->w7);
> +
> +    return 0;
> +}
> +
> +static int pnv_xive_set_end(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
> +                           XiveEND *in_end)
> +{
> +    PnvXive *xive = PNV_XIVE(xrtr);
> +    XiveEND end;
> +    uint64_t end_addr = pnv_xive_vst_addr(xive, VST_TSEL_EQDT, blk, idx);
> +
> +    if (!end_addr) {
> +        return -1;
> +    }
> +
> +    end.w0 = cpu_to_be32(in_end->w0);
> +    end.w1 = cpu_to_be32(in_end->w1);
> +    end.w2 = cpu_to_be32(in_end->w2);
> +    end.w3 = cpu_to_be32(in_end->w3);
> +    end.w4 = cpu_to_be32(in_end->w4);
> +    end.w5 = cpu_to_be32(in_end->w5);
> +    end.w6 = cpu_to_be32(in_end->w6);
> +    end.w7 = cpu_to_be32(in_end->w7);
> +    cpu_physical_memory_write(end_addr, &end, sizeof(XiveEND));
> +    return 0;
> +}
> +
> +static int pnv_xive_end_update(PnvXive *xive, uint8_t blk, uint32_t idx)
> +{
> +    uint64_t end_addr = pnv_xive_vst_addr(xive, VST_TSEL_EQDT, blk, idx);
> +
> +    if (!end_addr) {
> +        return -1;
> +    }
> +
> +    cpu_physical_memory_write(end_addr, xive->eqc_watch, sizeof(XiveEND));
> +    return 0;
> +}
> +
> +static int pnv_xive_get_nvt(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
> +                           XiveNVT *nvt)
> +{
> +    PnvXive *xive = PNV_XIVE(xrtr);
> +    uint64_t nvt_addr = pnv_xive_vst_addr(xive, VST_TSEL_VPDT, blk, idx);
> +
> +    if (!nvt_addr) {
> +        return -1;
> +    }
> +
> +    cpu_physical_memory_read(nvt_addr, nvt, sizeof(XiveNVT));
> +    nvt->w0 = cpu_to_be32(nvt->w0);
> +    nvt->w1 = cpu_to_be32(nvt->w1);
> +    nvt->w2 = cpu_to_be32(nvt->w2);
> +    nvt->w3 = cpu_to_be32(nvt->w3);
> +    nvt->w4 = cpu_to_be32(nvt->w4);
> +    nvt->w5 = cpu_to_be32(nvt->w5);
> +    nvt->w6 = cpu_to_be32(nvt->w6);
> +    nvt->w7 = cpu_to_be32(nvt->w7);
> +
> +    return 0;
> +}
> +
> +static int pnv_xive_set_nvt(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
> +                           XiveNVT *in_nvt)
> +{
> +    PnvXive *xive = PNV_XIVE(xrtr);
> +    XiveNVT nvt;
> +    uint64_t nvt_addr = pnv_xive_vst_addr(xive, VST_TSEL_VPDT, blk, idx);
> +
> +    if (!nvt_addr) {
> +        return -1;
> +    }
> +
> +    nvt.w0 = cpu_to_be32(in_nvt->w0);
> +    nvt.w1 = cpu_to_be32(in_nvt->w1);
> +    nvt.w2 = cpu_to_be32(in_nvt->w2);
> +    nvt.w3 = cpu_to_be32(in_nvt->w3);
> +    nvt.w4 = cpu_to_be32(in_nvt->w4);
> +    nvt.w5 = cpu_to_be32(in_nvt->w5);
> +    nvt.w6 = cpu_to_be32(in_nvt->w6);
> +    nvt.w7 = cpu_to_be32(in_nvt->w7);
> +    cpu_physical_memory_write(nvt_addr, &nvt, sizeof(XiveNVT));
> +    return 0;
> +}
> +
> +static int pnv_xive_nvt_update(PnvXive *xive, uint8_t blk, uint32_t idx)
> +{
> +    uint64_t nvt_addr = pnv_xive_vst_addr(xive, VST_TSEL_VPDT, blk, idx);
> +
> +    if (!nvt_addr) {
> +        return -1;
> +    }
> +
> +    cpu_physical_memory_write(nvt_addr, xive->vpc_watch, sizeof(XiveNVT));
> +    return 0;
> +}
> +
> +static int pnv_xive_get_eas(XiveRouter *xrtr, uint32_t srcno, XiveEAS *eas)
> +{
> +    PnvXive *xive = PNV_XIVE(xrtr);
> +    uint8_t  blk = SRCNO_BLOCK(srcno);
> +    uint32_t idx = SRCNO_INDEX(srcno);
> +    uint64_t eas_addr;
> +
> +    /* TODO: check when remote EAS lookups are possible */
> +    if (pnv_xive_get_ic(xive, blk) != xive) {
> +        xive_error(xive, "VST: EAS %x is remote !?", srcno);
> +        return -1;
> +    }
> +
> +    eas_addr = pnv_xive_vst_addr(xive, VST_TSEL_IVT, blk, idx);
> +    if (!eas_addr) {
> +        return -1;
> +    }
> +
> +    eas->w &= ~EAS_VALID;

Doesn't this get overwritten by the next statement?

> +    *((uint64_t *) eas) = ldq_be_dma(&address_space_memory, eas_addr);

eas->w = ldq... would surely be simpler.

> +    return 0;
> +}
> +
> +static int pnv_xive_set_eas(XiveRouter *xrtr, uint32_t srcno, XiveEAS *ive)
> +{
> +    /* All done. */

Uh.. what?  This is wrong, although I guess it doesn't matter because
the pnv model never uses set_eas.  Another argument for not
abstracting this path - just write directly in the PAPR code.

> +    return 0;
> +}
> +
> +static int pnv_xive_eas_update(PnvXive *xive, uint32_t idx)
> +{
> +    /* All done. */
> +    return 0;
> +}
> +
> +/*
> + * XIVE Set Translation Table configuration
> + *
> + * The Virtualization Controller MMIO region containing the IPI ESB
> + * pages and END ESB pages is sub-divided into "sets" which map
> + * portions of the VC region to the different ESB pages. It is
> + * configured at runtime through the EDT set translation table to let
> + * the firmware decide how to split the address space between IPI ESB
> + * pages and END ESB pages.
> + */
> +static int pnv_xive_set_xlate_update(PnvXive *xive, uint64_t val)
> +{
> +    uint8_t index = xive->set_xlate_autoinc ?
> +        xive->set_xlate_index++ : xive->set_xlate_index;

What's the correct hardware behaviour when the index runs off the end
with autoincrement mode?

> +    uint8_t max_index;
> +    uint64_t *xlate_table;
> +
> +    switch (xive->set_xlate) {
> +    case CQ_TAR_TSEL_BLK:
> +        max_index = ARRAY_SIZE(xive->set_xlate_blk);
> +        xlate_table = xive->set_xlate_blk;
> +        break;
> +    case CQ_TAR_TSEL_MIG:
> +        max_index = ARRAY_SIZE(xive->set_xlate_mig);
> +        xlate_table = xive->set_xlate_mig;
> +        break;
> +    case CQ_TAR_TSEL_EDT:
> +        max_index = ARRAY_SIZE(xive->set_xlate_edt);
> +        xlate_table = xive->set_xlate_edt;
> +        break;
> +    case CQ_TAR_TSEL_VDT:
> +        max_index = ARRAY_SIZE(xive->set_xlate_vdt);
> +        xlate_table = xive->set_xlate_vdt;
> +        break;
> +    default:
> +        xive_error(xive, "xlate: invalid table %d", (int) xive->set_xlate);

In the error case is it correct for the autoincrement to go ahead?

> +        return -1;
> +    }
> +
> +    if (index >= max_index) {
> +        return -1;
> +    }
> +
> +    xlate_table[index] = val;
> +    return 0;
> +}
> +
> +static int pnv_xive_set_xlate_select(PnvXive *xive, uint64_t val)
> +{
> +    xive->set_xlate_autoinc = val & CQ_TAR_TBL_AUTOINC;
> +    xive->set_xlate = val & CQ_TAR_TSEL;
> +    xive->set_xlate_index = GETFIELD(CQ_TAR_TSEL_INDEX, val);

Why split this here, rather than just storing the MMIOed value direct
in the regs[] array, then parsing out the bits when you need them?

To expand a bit, there are two models you can use for modelling
registers in qemu.  You can have a big regs[] with all the registers
make the accessors just read/write that, plus side-effect and special
case handling.  Or you can have specific fields in your state for the
crucial register values, then have the MMIO access do all the
translation into those underlying registers based on the offset.

Either model can make sense, depending on how many side effects and
special cases there are.  Mixing the two models, which is kind of what
you're doing here, is usually not a good idea.

> +
> +    return 0;
> +}
> +
> +/*
> + * Computes the overall size of the IPI or the END ESB pages
> + */
> +static uint64_t pnv_xive_set_xlate_edt_size(PnvXive *xive, uint64_t type)
> +{
> +    uint64_t edt_size = 1ull << xive->edt_shift;
> +    uint64_t size = 0;
> +    int i;
> +
> +    for (i = 0; i < XIVE_XLATE_EDT_MAX; i++) {
> +        uint64_t edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[i]);
> +
> +        if (edt_type == type) {
> +            size += edt_size;
> +        }
> +    }
> +
> +    return size;
> +}
> +
> +/*
> + * Maps an offset of the VC region in the IPI or END region using the
> + * layout defined by the EDT table
> + */
> +static uint64_t pnv_xive_set_xlate_edt_offset(PnvXive *xive, uint64_t vc_offset,
> +                                              uint64_t type)
> +{
> +    int i;
> +    uint64_t edt_size = (1ull << xive->edt_shift);
> +    uint64_t edt_offset = vc_offset;
> +
> +    for (i = 0; i < XIVE_XLATE_EDT_MAX && (i * edt_size) < vc_offset; i++) {
> +        uint64_t edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[i]);
> +
> +        if (edt_type != type) {
> +            edt_offset -= edt_size;
> +        }
> +    }
> +
> +    return edt_offset;
> +}
> +
> +/*
> + * IPI and END sources realize routines
> + *
> + * We use the EDT table to size the internal XiveSource object backing
> + * the IPIs and the XiveENDSource object backing the ENDs
> + */
> +static void pnv_xive_source_realize(PnvXive *xive, Error **errp)
> +{
> +    XiveSource *xsrc = &xive->source;
> +    Error *local_err = NULL;
> +    uint64_t ipi_mmio_size = pnv_xive_set_xlate_edt_size(xive, CQ_TDR_EDT_IPI);
> +
> +    /* Two pages per IRQ */
> +    xive->nr_irqs = ipi_mmio_size / (1ull << (xive->vc_shift + 1));
> +
> +    /*
> +     * Configure store EOI if required by firwmare (skiboot has
> +     * removed support recently though)
> +     */
> +    if (xive->regs[VC_SBC_CONFIG >> 3] &
> +        (VC_SBC_CONF_CPLX_CIST | VC_SBC_CONF_CIST_BOTH)) {
> +        object_property_set_int(OBJECT(xsrc), XIVE_SRC_STORE_EOI, "flags",
> +                                &error_fatal);
> +    }
> +
> +    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
> +                            &error_fatal);
> +    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
> +                                   &error_fatal);
> +    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
> +
> +    /* Install the IPI ESB MMIO region in its VC region */
> +    memory_region_add_subregion(&xive->ipi_mmio, 0, &xsrc->esb_mmio);
> +
> +    /* Start in a clean state */
> +    device_reset(DEVICE(&xive->source));

I don't think you should need that.  During qemu start up all the
device reset handlers should be called after reset but before starting
the VM anyway.

> +}
> +
> +static void pnv_xive_end_source_realize(PnvXive *xive, Error **errp)
> +{
> +    XiveENDSource *end_xsrc = &xive->end_source;
> +    Error *local_err = NULL;
> +    uint64_t end_mmio_size = pnv_xive_set_xlate_edt_size(xive, CQ_TDR_EDT_EQ);
> +
> +    /* Two pages per END: ESn and ESe */
> +    xive->nr_ends  = end_mmio_size / (1ull << (xive->vc_shift + 1));
> +
> +    object_property_set_int(OBJECT(end_xsrc), xive->nr_ends, "nr-ends",
> +                            &error_fatal);
> +    object_property_add_const_link(OBJECT(end_xsrc), "xive", OBJECT(xive),
> +                                   &error_fatal);
> +    object_property_set_bool(OBJECT(end_xsrc), true, "realized", &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +    qdev_set_parent_bus(DEVICE(end_xsrc), sysbus_get_default());
> +
> +    /* Install the END ESB MMIO region in its VC region */
> +    memory_region_add_subregion(&xive->end_mmio, 0, &end_xsrc->esb_mmio);
> +}
> +
> +/*
> + * Virtual Structure Tables (VST) configuration
> + */
> +static void pnv_xive_table_set_exclusive(PnvXive *xive, uint8_t type,
> +                                         uint8_t blk, uint64_t vsd)
> +{
> +    bool gconf_indirect =
> +        xive->regs[VC_GLOBAL_CONFIG >> 3] & VC_GCONF_INDIRECT;
> +    uint32_t vst_shift = GETFIELD(VSD_TSIZE, vsd) + 12;
> +    uint64_t vst_addr = vsd & VSD_ADDRESS_MASK;
> +
> +    if (VSD_INDIRECT & vsd) {
> +        if (!gconf_indirect) {
> +            xive_error(xive, "VST: %s indirect tables not enabled",
> +                       vst_infos[type].name);
> +            return;
> +        }
> +    }
> +
> +    switch (type) {
> +    case VST_TSEL_IVT:
> +        /*
> +         * This is our trigger to create the XiveSource object backing
> +         * the IPIs.
> +         */
> +        pnv_xive_source_realize(xive, &error_fatal);

IIUC this gets called in response to an MMIO.  Realizing devices in
response to a runtime MMIO looks very wrong.

> +        break;
> +
> +    case VST_TSEL_EQDT:
> +        /* Same trigger but for the XiveENDSource object backing the ENDs. */
> +        pnv_xive_end_source_realize(xive, &error_fatal);
> +        break;
> +
> +    case VST_TSEL_VPDT:
> +        /* FIXME (skiboot) : remove DD1 workaround on the NVT table size */
> +        vst_shift = 16;
> +        break;
> +
> +    case VST_TSEL_SBE: /* Not modeled */
> +        /*
> +         * Contains the backing store pages for the source PQ bits.
> +         * The XiveSource object has its own. We would need a custom
> +         * source object to use this backing.
> +         */
> +        break;
> +
> +    case VST_TSEL_IRQ: /* VC only. Not modeled */
> +        /*
> +         * These tables contains the backing store pages for the
> +         * interrupt fifos of the VC sub-engine in case of overflow.
> +         */
> +        break;
> +    default:
> +        g_assert_not_reached();
> +    }
> +
> +    if (!QEMU_IS_ALIGNED(vst_addr, 1ull << vst_shift)) {
> +        xive_error(xive, "VST: %s table address 0x%"PRIx64" is not aligned with"
> +                   " page shift %d", vst_infos[type].name, vst_addr, vst_shift);
> +    }
> +
> +    /* Keep the VSD for later use */
> +    xive->vsds[type][blk] = vsd;
> +}
> +
> +/*
> + * Both PC and VC sub-engines are configured as each use the Virtual
> + * Structure Tables : SBE, EAS, END and NVT.
> + */
> +static void pnv_xive_table_set_data(PnvXive *xive, uint64_t vsd, bool pc_engine)
> +{
> +    uint8_t mode = GETFIELD(VSD_MODE, vsd);
> +    uint8_t type = GETFIELD(VST_TABLE_SELECT,
> +                            xive->regs[VC_VSD_TABLE_ADDR >> 3]);
> +    uint8_t blk = GETFIELD(VST_TABLE_BLOCK,
> +                             xive->regs[VC_VSD_TABLE_ADDR >> 3]);
> +    uint64_t vst_addr = vsd & VSD_ADDRESS_MASK;
> +
> +    if (type > VST_TSEL_IRQ) {
> +        xive_error(xive, "VST: invalid table type %d", type);
> +        return;
> +    }
> +
> +    if (blk >= vst_infos[type].max_blocks) {
> +        xive_error(xive, "VST: invalid block id %d for"
> +                      " %s table", blk, vst_infos[type].name);
> +        return;
> +    }
> +
> +    /*
> +     * Only take the VC sub-engine configuration into account because
> +     * the XiveRouter model combines both VC and PC sub-engines
> +     */
> +    if (pc_engine) {
> +        return;
> +    }
> +
> +    if (!vst_addr) {
> +        xive_error(xive, "VST: invalid %s table address", vst_infos[type].name);
> +        return;
> +    }
> +
> +    switch (mode) {
> +    case VSD_MODE_FORWARD:
> +        xive->vsds[type][blk] = vsd;
> +        break;
> +
> +    case VSD_MODE_EXCLUSIVE:
> +        pnv_xive_table_set_exclusive(xive, type, blk, vsd);
> +        break;
> +
> +    default:
> +        xive_error(xive, "VST: unsupported table mode %d", mode);
> +        return;
> +    }
> +}
> +
> +/*
> + * When the TIMA is accessed from the indirect page, the thread id
> + * (PIR) has to be configured in the IC before. This is used for
> + * resets and for debug purpose also.
> + */
> +static void pnv_xive_thread_indirect_set(PnvXive *xive, uint64_t val)
> +{
> +    int pir = GETFIELD(PC_TCTXT_INDIR_THRDID, xive->regs[PC_TCTXT_INDIR0 >> 3]);
> +
> +    if (val & PC_TCTXT_INDIR_VALID) {
> +        if (xive->cpu_ind) {
> +            xive_error(xive, "IC: indirect access already set for "
> +                       "invalid PIR %d", pir);
> +        }
> +
> +        pir = GETFIELD(PC_TCTXT_INDIR_THRDID, val) & 0xff;
> +        xive->cpu_ind = ppc_get_vcpu_by_pir(pir);
> +        if (!xive->cpu_ind) {
> +            xive_error(xive, "IC: invalid PIR %d for indirect access", pir);
> +        }
> +    } else {
> +        xive->cpu_ind = NULL;
> +    }
> +}
> +
> +/*
> + * Interrupt Controller registers MMIO
> + */
> +static void pnv_xive_ic_reg_write(PnvXive *xive, uint32_t offset, uint64_t val,
> +                                  bool mmio)
> +{
> +    MemoryRegion *sysmem = get_system_memory();
> +    uint32_t reg = offset >> 3;
> +
> +    switch (offset) {
> +
> +    /*
> +     * XIVE CQ (PowerBus bridge) settings
> +     */
> +    case CQ_MSGSND:     /* msgsnd for doorbells */
> +    case CQ_FIRMASK_OR: /* FIR error reporting */
> +        xive->regs[reg] = val;

Can you do that generic update outside the switch?  If that leaves too
many special cases that might be a sign you shouldn't use the
big-array-of-regs model.

> +        break;
> +    case CQ_PBI_CTL:
> +        if (val & CQ_PBI_PC_64K) {
> +            xive->pc_shift = 16;
> +        }
> +        if (val & CQ_PBI_VC_64K) {
> +            xive->vc_shift = 16;
> +        }
> +        break;
> +    case CQ_CFG_PB_GEN: /* PowerBus General Configuration */
> +        /*
> +         * TODO: CQ_INT_ADDR_OPT for 1-block-per-chip mode
> +         */
> +        xive->regs[reg] = val;
> +        break;
> +
> +    /*
> +     * XIVE Virtualization Controller settings
> +     */
> +    case VC_GLOBAL_CONFIG:
> +        xive->regs[reg] = val;
> +        break;
> +
> +    /*
> +     * XIVE Presenter Controller settings
> +     */
> +    case PC_GLOBAL_CONFIG:
> +        /* Overrides Int command Chip ID with the Chip ID field */
> +        if (val & PC_GCONF_CHIPID_OVR) {
> +            xive->chip_id = GETFIELD(PC_GCONF_CHIPID, val);
> +        }
> +        xive->regs[reg] = val;
> +        break;
> +    case PC_TCTXT_CFG:
> +        /*
> +         * TODO: PC_TCTXT_CFG_BLKGRP_EN for block group support
> +         * TODO: PC_TCTXT_CFG_HARD_CHIPID_BLK
> +         */
> +
> +        /*
> +         * Moves the chipid into block field for hardwired CAM
> +         * compares Block offset value is adjusted to 0b0..01 & ThrdId
> +         */
> +        if (val & PC_TCTXT_CHIPID_OVERRIDE) {
> +            xive->thread_chip_id = GETFIELD(PC_TCTXT_CHIPID, val);
> +        }
> +        break;
> +    case PC_TCTXT_TRACK: /* Enable block tracking (DD2) */
> +        xive->regs[reg] = val;
> +        break;
> +
> +    /*
> +     * Misc settings
> +     */
> +    case VC_EQC_CONFIG: /* enable silent escalation */
> +    case VC_SBC_CONFIG: /* Store EOI configuration */
> +    case VC_AIB_TX_ORDER_TAG2:
> +        xive->regs[reg] = val;
> +        break;
> +
> +    /*
> +     * XIVE BAR settings (XSCOM only)
> +     */
> +    case CQ_RST_CTL:
> +        /* resets all bars */
> +        break;
> +
> +    case CQ_IC_BAR: /* IC BAR. 8 pages */
> +        xive->ic_shift = val & CQ_IC_BAR_64K ? 16 : 12;
> +        if (!(val & CQ_IC_BAR_VALID)) {
> +            xive->ic_base = 0;
> +            if (xive->regs[reg] & CQ_IC_BAR_VALID) {
> +                memory_region_del_subregion(&xive->ic_mmio,
> +                                            &xive->ic_reg_mmio);
> +                memory_region_del_subregion(&xive->ic_mmio,
> +                                            &xive->ic_notify_mmio);
> +                memory_region_del_subregion(sysmem, &xive->ic_mmio);
> +                memory_region_del_subregion(sysmem, &xive->tm_mmio_indirect);
> +            }
> +        } else {
> +            xive->ic_base  = val & ~(CQ_IC_BAR_VALID | CQ_IC_BAR_64K);
> +            if (!(xive->regs[reg] & CQ_IC_BAR_VALID)) {
> +                memory_region_add_subregion(sysmem, xive->ic_base,
> +                                            &xive->ic_mmio);
> +                memory_region_add_subregion(&xive->ic_mmio,  0,
> +                                            &xive->ic_reg_mmio);
> +                memory_region_add_subregion(&xive->ic_mmio,
> +                                            1ul << xive->ic_shift,
> +                                            &xive->ic_notify_mmio);
> +                memory_region_add_subregion(sysmem,
> +                                   xive->ic_base + (4ull << xive->ic_shift),
> +                                   &xive->tm_mmio_indirect);
> +            }
> +        }
> +        xive->regs[reg] = val;
> +        break;
> +
> +    case CQ_TM1_BAR: /* TM BAR and page size. 4 pages */
> +    case CQ_TM2_BAR: /* second TM BAR is for hotplug use */
> +        xive->tm_shift = val & CQ_TM_BAR_64K ? 16 : 12;
> +        if (!(val & CQ_TM_BAR_VALID)) {
> +            xive->tm_base = 0;
> +            if (xive->regs[reg] & CQ_TM_BAR_VALID) {
> +                memory_region_del_subregion(sysmem, &xive->tm_mmio);
> +            }
> +        } else {
> +            xive->tm_base  = val & ~(CQ_TM_BAR_VALID | CQ_TM_BAR_64K);
> +            if (!(xive->regs[reg] & CQ_TM_BAR_VALID)) {
> +                memory_region_add_subregion(sysmem, xive->tm_base,
> +                                            &xive->tm_mmio);
> +            }
> +        }
> +        xive->regs[reg] = val;
> +       break;

Something funny with your indentation here.

> +    case CQ_PC_BAR:
> +        if (!(val & CQ_PC_BAR_VALID)) {
> +            xive->pc_base = 0;
> +            if (xive->regs[reg] & CQ_PC_BAR_VALID) {
> +                memory_region_del_subregion(sysmem, &xive->pc_mmio);
> +            }
> +        } else {
> +            xive->pc_base = val & ~(CQ_PC_BAR_VALID);
> +            if (!(xive->regs[reg] & CQ_PC_BAR_VALID)) {
> +                memory_region_add_subregion(sysmem, xive->pc_base,
> +                                            &xive->pc_mmio);
> +            }
> +        }
> +        xive->regs[reg] = val;
> +        break;
> +    case CQ_PC_BARM: /* TODO: configure PC BAR size at runtime */
> +        xive->pc_size =  (~val + 1) & CQ_PC_BARM_MASK;
> +        xive->regs[reg] = val;
> +
> +        /* Compute the size of the VDT sets */
> +        xive->vdt_shift = ctz64(xive->pc_size / XIVE_XLATE_VDT_MAX);
> +        break;
> +
> +    case CQ_VC_BAR: /* From 64M to 4TB */
> +        if (!(val & CQ_VC_BAR_VALID)) {
> +            xive->vc_base = 0;
> +            if (xive->regs[reg] & CQ_VC_BAR_VALID) {
> +                memory_region_del_subregion(sysmem, &xive->vc_mmio);
> +            }
> +        } else {
> +            xive->vc_base = val & ~(CQ_VC_BAR_VALID);
> +            if (!(xive->regs[reg] & CQ_VC_BAR_VALID)) {
> +                memory_region_add_subregion(sysmem, xive->vc_base,
> +                                            &xive->vc_mmio);
> +            }
> +        }
> +        xive->regs[reg] = val;
> +        break;
> +    case CQ_VC_BARM: /* TODO: configure VC BAR size at runtime */
> +        xive->vc_size = (~val + 1) & CQ_VC_BARM_MASK;

Any reason to precompute that, rather than work it out from
regs[CQ_VC_BARM] when you need it?

> +        xive->regs[reg] = val;
> +
> +        /* Compute the size of the EDT sets */
> +        xive->edt_shift = ctz64(xive->vc_size / XIVE_XLATE_EDT_MAX);
> +        break;
> +
> +    /*
> +     * XIVE Set Translation Table settings. Defines the layout of the
> +     * VC BAR containing the ESB pages of the IPIs and of the ENDs
> +     */
> +    case CQ_TAR: /* Set Translation Table Address */
> +        pnv_xive_set_xlate_select(xive, val);
> +        break;
> +    case CQ_TDR: /* Set Translation Table Data */
> +        pnv_xive_set_xlate_update(xive, val);
> +        break;
> +
> +    /*
> +     * XIVE VC & PC Virtual Structure Table settings
> +     */
> +    case VC_VSD_TABLE_ADDR:
> +    case PC_VSD_TABLE_ADDR: /* Virtual table selector */
> +        xive->regs[reg] = val;
> +        break;
> +    case VC_VSD_TABLE_DATA: /* Virtual table setting */
> +    case PC_VSD_TABLE_DATA:
> +        pnv_xive_table_set_data(xive, val, offset == PC_VSD_TABLE_DATA);
> +        break;
> +
> +    /*
> +     * Interrupt fifo overflow in memory backing store. Not modeled
> +     */
> +    case VC_IRQ_CONFIG_IPI:
> +    case VC_IRQ_CONFIG_HW:
> +    case VC_IRQ_CONFIG_CASCADE1:
> +    case VC_IRQ_CONFIG_CASCADE2:
> +    case VC_IRQ_CONFIG_REDIST:
> +    case VC_IRQ_CONFIG_IPI_CASC:
> +        xive->regs[reg] = val;
> +        break;
> +
> +    /*
> +     * XIVE hardware thread enablement
> +     */
> +    case PC_THREAD_EN_REG0_SET: /* Physical Thread Enable */
> +    case PC_THREAD_EN_REG1_SET: /* Physical Thread Enable (fused core) */
> +        xive->regs[reg] |= val;
> +        break;
> +    case PC_THREAD_EN_REG0_CLR:
> +        xive->regs[PC_THREAD_EN_REG0_SET >> 3] &= ~val;
> +        break;
> +    case PC_THREAD_EN_REG1_CLR:
> +        xive->regs[PC_THREAD_EN_REG1_SET >> 3] &= ~val;
> +        break;
> +
> +    /*
> +     * Indirect TIMA access set up. Defines the HW thread to use.
> +     */
> +    case PC_TCTXT_INDIR0:
> +        pnv_xive_thread_indirect_set(xive, val);
> +        xive->regs[reg] = val;
> +        break;
> +    case PC_TCTXT_INDIR1:
> +    case PC_TCTXT_INDIR2:
> +    case PC_TCTXT_INDIR3:
> +        /* TODO: check what PC_TCTXT_INDIR[123] are for */
> +        xive->regs[reg] = val;
> +        break;
> +
> +    /*
> +     * XIVE PC & VC cache updates for EAS, NVT and END
> +     */
> +    case PC_VPC_SCRUB_MASK:
> +    case PC_VPC_CWATCH_SPEC:
> +    case VC_EQC_SCRUB_MASK:
> +    case VC_EQC_CWATCH_SPEC:
> +    case VC_IVC_SCRUB_MASK:
> +        xive->regs[reg] = val;
> +        break;
> +    case VC_IVC_SCRUB_TRIG:
> +        pnv_xive_eas_update(xive, GETFIELD(VC_SCRUB_OFFSET, val));
> +        break;
> +    case PC_VPC_CWATCH_DAT0:
> +    case PC_VPC_CWATCH_DAT1:
> +    case PC_VPC_CWATCH_DAT2:
> +    case PC_VPC_CWATCH_DAT3:
> +    case PC_VPC_CWATCH_DAT4:
> +    case PC_VPC_CWATCH_DAT5:
> +    case PC_VPC_CWATCH_DAT6:
> +    case PC_VPC_CWATCH_DAT7:
> +        xive->vpc_watch[(offset - PC_VPC_CWATCH_DAT0) / 8] = cpu_to_be64(val);
> +        break;
> +    case PC_VPC_SCRUB_TRIG:
> +        pnv_xive_nvt_update(xive, GETFIELD(PC_SCRUB_BLOCK_ID, val),
> +                           GETFIELD(PC_SCRUB_OFFSET, val));
> +        break;
> +    case VC_EQC_CWATCH_DAT0:
> +    case VC_EQC_CWATCH_DAT1:
> +    case VC_EQC_CWATCH_DAT2:
> +    case VC_EQC_CWATCH_DAT3:
> +        xive->eqc_watch[(offset - VC_EQC_CWATCH_DAT0) / 8] = cpu_to_be64(val);
> +        break;
> +    case VC_EQC_SCRUB_TRIG:
> +        pnv_xive_end_update(xive, GETFIELD(VC_SCRUB_BLOCK_ID, val),
> +                            GETFIELD(VC_SCRUB_OFFSET, val));
> +        break;
> +
> +    /*
> +     * XIVE PC & VC cache invalidation
> +     */
> +    case PC_AT_KILL:
> +        xive->regs[reg] |= val;
> +        break;
> +    case VC_AT_MACRO_KILL:
> +        xive->regs[reg] |= val;
> +        break;
> +    case PC_AT_KILL_MASK:
> +    case VC_AT_MACRO_KILL_MASK:
> +        xive->regs[reg] = val;
> +        break;
> +
> +    default:
> +        xive_error(xive, "IC: invalid write to reg=0x%08x mmio=%d", offset,
> +                   mmio);
> +    }
> +}
> +
> +static uint64_t pnv_xive_ic_reg_read(PnvXive *xive, uint32_t offset, bool mmio)
> +{
> +    uint64_t val = 0;
> +    uint32_t reg = offset >> 3;
> +
> +    switch (offset) {
> +    case CQ_CFG_PB_GEN:
> +    case CQ_IC_BAR:
> +    case CQ_TM1_BAR:
> +    case CQ_TM2_BAR:
> +    case CQ_PC_BAR:
> +    case CQ_PC_BARM:
> +    case CQ_VC_BAR:
> +    case CQ_VC_BARM:
> +    case CQ_TAR:
> +    case CQ_TDR:
> +    case CQ_PBI_CTL:
> +
> +    case PC_TCTXT_CFG:
> +    case PC_TCTXT_TRACK:
> +    case PC_TCTXT_INDIR0:
> +    case PC_TCTXT_INDIR1:
> +    case PC_TCTXT_INDIR2:
> +    case PC_TCTXT_INDIR3:
> +    case PC_GLOBAL_CONFIG:
> +
> +    case PC_VPC_SCRUB_MASK:
> +    case PC_VPC_CWATCH_SPEC:
> +    case PC_VPC_CWATCH_DAT0:
> +    case PC_VPC_CWATCH_DAT1:
> +    case PC_VPC_CWATCH_DAT2:
> +    case PC_VPC_CWATCH_DAT3:
> +    case PC_VPC_CWATCH_DAT4:
> +    case PC_VPC_CWATCH_DAT5:
> +    case PC_VPC_CWATCH_DAT6:
> +    case PC_VPC_CWATCH_DAT7:
> +
> +    case VC_GLOBAL_CONFIG:
> +    case VC_AIB_TX_ORDER_TAG2:
> +
> +    case VC_IRQ_CONFIG_IPI:
> +    case VC_IRQ_CONFIG_HW:
> +    case VC_IRQ_CONFIG_CASCADE1:
> +    case VC_IRQ_CONFIG_CASCADE2:
> +    case VC_IRQ_CONFIG_REDIST:
> +    case VC_IRQ_CONFIG_IPI_CASC:
> +
> +    case VC_EQC_SCRUB_MASK:
> +    case VC_EQC_CWATCH_DAT0:
> +    case VC_EQC_CWATCH_DAT1:
> +    case VC_EQC_CWATCH_DAT2:
> +    case VC_EQC_CWATCH_DAT3:
> +
> +    case VC_EQC_CWATCH_SPEC:
> +    case VC_IVC_SCRUB_MASK:
> +    case VC_SBC_CONFIG:
> +    case VC_AT_MACRO_KILL_MASK:
> +    case VC_VSD_TABLE_ADDR:
> +    case PC_VSD_TABLE_ADDR:
> +    case VC_VSD_TABLE_DATA:
> +    case PC_VSD_TABLE_DATA:
> +        val = xive->regs[reg];
> +        break;
> +
> +    case CQ_MSGSND: /* Identifies which cores have msgsnd enabled.
> +                     * Say all have. */
> +        val = 0xffffff0000000000;
> +        break;
> +
> +    /*
> +     * XIVE PC & VC cache updates for EAS, NVT and END
> +     */
> +    case PC_VPC_SCRUB_TRIG:
> +    case VC_IVC_SCRUB_TRIG:
> +    case VC_EQC_SCRUB_TRIG:
> +        xive->regs[reg] &= ~VC_SCRUB_VALID;
> +        val = xive->regs[reg];
> +        break;
> +
> +    /*
> +     * XIVE PC & VC cache invalidation
> +     */
> +    case PC_AT_KILL:
> +        xive->regs[reg] &= ~PC_AT_KILL_VALID;
> +        val = xive->regs[reg];
> +        break;
> +    case VC_AT_MACRO_KILL:
> +        xive->regs[reg] &= ~VC_KILL_VALID;
> +        val = xive->regs[reg];
> +        break;
> +
> +    /*
> +     * XIVE synchronisation
> +     */
> +    case VC_EQC_CONFIG:
> +        val = VC_EQC_SYNC_MASK;
> +        break;
> +
> +    default:
> +        xive_error(xive, "IC: invalid read reg=0x%08x mmio=%d", offset, mmio);
> +    }
> +
> +    return val;
> +}
> +
> +static void pnv_xive_ic_reg_write_mmio(void *opaque, hwaddr addr,
> +                                       uint64_t val, unsigned size)
> +{
> +    pnv_xive_ic_reg_write(opaque, addr, val, true);

AFAICT the underlaying write function never uses that 'mmio' parameter
except for debug, so it's probably not worth the bother of having
these wrappers.

> +}
> +
> +static uint64_t pnv_xive_ic_reg_read_mmio(void *opaque, hwaddr addr,
> +                                      unsigned size)
> +{
> +    return pnv_xive_ic_reg_read(opaque, addr, true);
> +}
> +
> +static const MemoryRegionOps pnv_xive_ic_reg_ops = {
> +    .read = pnv_xive_ic_reg_read_mmio,
> +    .write = pnv_xive_ic_reg_write_mmio,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +/*
> + * Interrupt Controller MMIO: Notify port page (write only)
> + */
> +#define PNV_XIVE_FORWARD_IPI        0x800 /* Forward IPI */
> +#define PNV_XIVE_FORWARD_HW         0x880 /* Forward HW */
> +#define PNV_XIVE_FORWARD_OS_ESC     0x900 /* Forward OS escalation */
> +#define PNV_XIVE_FORWARD_HW_ESC     0x980 /* Forward Hyp escalation */
> +#define PNV_XIVE_FORWARD_REDIS      0xa00 /* Forward Redistribution */
> +#define PNV_XIVE_RESERVED5          0xa80 /* Cache line 5 PowerBUS operation */
> +#define PNV_XIVE_RESERVED6          0xb00 /* Cache line 6 PowerBUS operation */
> +#define PNV_XIVE_RESERVED7          0xb80 /* Cache line 7 PowerBUS operation */
> +
> +/* VC synchronisation */
> +#define PNV_XIVE_SYNC_IPI           0xc00 /* Sync IPI */
> +#define PNV_XIVE_SYNC_HW            0xc80 /* Sync HW */
> +#define PNV_XIVE_SYNC_OS_ESC        0xd00 /* Sync OS escalation */
> +#define PNV_XIVE_SYNC_HW_ESC        0xd80 /* Sync Hyp escalation */
> +#define PNV_XIVE_SYNC_REDIS         0xe00 /* Sync Redistribution */
> +
> +/* PC synchronisation */
> +#define PNV_XIVE_SYNC_PULL          0xe80 /* Sync pull context */
> +#define PNV_XIVE_SYNC_PUSH          0xf00 /* Sync push context */
> +#define PNV_XIVE_SYNC_VPC           0xf80 /* Sync remove VPC store */
> +
> +static void pnv_xive_ic_hw_trigger(PnvXive *xive, hwaddr addr, uint64_t val)
> +{
> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xive);
> +
> +    xfc->notify(XIVE_FABRIC(xive), val);
> +}
> +
> +static void pnv_xive_ic_notify_write(void *opaque, hwaddr addr, uint64_t val,
> +                                     unsigned size)
> +{
> +    PnvXive *xive = PNV_XIVE(opaque);
> +
> +    /* VC: HW triggers */
> +    switch (addr) {
> +    case 0x000 ... 0x7FF:
> +        pnv_xive_ic_hw_trigger(opaque, addr, val);
> +        break;
> +
> +    /* VC: Forwarded IRQs */
> +    case PNV_XIVE_FORWARD_IPI:
> +    case PNV_XIVE_FORWARD_HW:
> +    case PNV_XIVE_FORWARD_OS_ESC:
> +    case PNV_XIVE_FORWARD_HW_ESC:
> +    case PNV_XIVE_FORWARD_REDIS:
> +        /* TODO: forwarded IRQs. Should be like HW triggers */
> +        xive_error(xive, "IC: forwarded at @0x%"HWADDR_PRIx" IRQ 0x%"PRIx64,
> +                   addr, val);
> +        break;
> +
> +    /* VC syncs */
> +    case PNV_XIVE_SYNC_IPI:
> +    case PNV_XIVE_SYNC_HW:
> +    case PNV_XIVE_SYNC_OS_ESC:
> +    case PNV_XIVE_SYNC_HW_ESC:
> +    case PNV_XIVE_SYNC_REDIS:
> +        break;
> +
> +    /* PC sync */
> +    case PNV_XIVE_SYNC_PULL:
> +    case PNV_XIVE_SYNC_PUSH:
> +    case PNV_XIVE_SYNC_VPC:
> +        break;
> +
> +    default:
> +        xive_error(xive, "IC: invalid notify write @%"HWADDR_PRIx, addr);
> +    }
> +}
> +
> +static uint64_t pnv_xive_ic_notify_read(void *opaque, hwaddr addr,
> +                                        unsigned size)
> +{
> +    PnvXive *xive = PNV_XIVE(opaque);
> +
> +    /* loads are invalid */
> +    xive_error(xive, "IC: invalid notify read @%"HWADDR_PRIx, addr);
> +    return -1;
> +}
> +
> +static const MemoryRegionOps pnv_xive_ic_notify_ops = {
> +    .read = pnv_xive_ic_notify_read,
> +    .write = pnv_xive_ic_notify_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +/*
> + * Interrupt controller MMIO region. The layout is compatible between
> + * 4K and 64K pages :
> + *
> + * Page 0           sub-engine BARs
> + *  0x000 - 0x3FF   IC registers
> + *  0x400 - 0x7FF   PC registers
> + *  0x800 - 0xFFF   VC registers
> + *
> + * Page 1           Notify page
> + *  0x000 - 0x7FF   HW interrupt triggers (PSI, PHB)
> + *  0x800 - 0xFFF   forwards and syncs
> + *
> + * Page 2           LSI Trigger page (writes only) (not modeled)
> + * Page 3           LSI SB EOI page (reads only) (not modeled)
> + *
> + * Page 4-7         indirect TIMA (aliased to TIMA region)
> + */
> +static void pnv_xive_ic_write(void *opaque, hwaddr addr,
> +                              uint64_t val, unsigned size)
> +{
> +    PnvXive *xive = PNV_XIVE(opaque);
> +
> +    xive_error(xive, "IC: invalid write @%"HWADDR_PRIx, addr);
> +}
> +
> +static uint64_t pnv_xive_ic_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    PnvXive *xive = PNV_XIVE(opaque);
> +
> +    xive_error(xive, "IC: invalid read @%"HWADDR_PRIx, addr);
> +    return -1;
> +}
> +
> +static const MemoryRegionOps pnv_xive_ic_ops = {
> +    .read = pnv_xive_ic_read,
> +    .write = pnv_xive_ic_write,

Erm.. it's not clear to me what this achieves, since the read/write
accessors just error every time.

> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +/*
> + * Interrupt controller XSCOM region. Load accesses are nearly all
> + * done all through the MMIO region.
> + */
> +static uint64_t pnv_xive_xscom_read(void *opaque, hwaddr addr, unsigned size)
> +{
> +    PnvXive *xive = PNV_XIVE(opaque);
> +
> +    switch (addr >> 3) {
> +    case X_VC_EQC_CONFIG:
> +        /*
> +         * This is the only XSCOM load done in skiboot. Bizarre. To be
> +         * checked.
> +         */
> +        return VC_EQC_SYNC_MASK;
> +    default:
> +        return pnv_xive_ic_reg_read(xive, addr, false);
> +    }
> +}
> +
> +static void pnv_xive_xscom_write(void *opaque, hwaddr addr,
> +                                uint64_t val, unsigned size)
> +{
> +    pnv_xive_ic_reg_write(opaque, addr, val, false);
> +}
> +
> +static const MemoryRegionOps pnv_xive_xscom_ops = {
> +    .read = pnv_xive_xscom_read,
> +    .write = pnv_xive_xscom_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    }
> +};
> +
> +/*
> + * Virtualization Controller MMIO region containing the IPI and END ESB pages
> + */
> +static uint64_t pnv_xive_vc_read(void *opaque, hwaddr offset,
> +                                 unsigned size)
> +{
> +    PnvXive *xive = PNV_XIVE(opaque);
> +    uint64_t edt_index = offset >> xive->edt_shift;
> +    uint64_t edt_type = 0;
> +    uint64_t ret = -1;
> +    uint64_t edt_offset;
> +    MemTxResult result;
> +    AddressSpace *edt_as = NULL;
> +
> +    if (edt_index < XIVE_XLATE_EDT_MAX) {
> +        edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[edt_index]);
> +    }
> +
> +    switch (edt_type) {
> +    case CQ_TDR_EDT_IPI:
> +        edt_as = &xive->ipi_as;
> +        break;
> +    case CQ_TDR_EDT_EQ:
> +        edt_as = &xive->end_as;
> +        break;
> +    default:
> +        xive_error(xive, "VC: invalid read @%"HWADDR_PRIx, offset);
> +        return -1;
> +    }
> +
> +    /* remap the offset for the targeted address space */
> +    edt_offset = pnv_xive_set_xlate_edt_offset(xive, offset, edt_type);
> +
> +    ret = address_space_ldq(edt_as, edt_offset, MEMTXATTRS_UNSPECIFIED,
> +                            &result);

I think there needs to be a byteswap here somewhere.  This is loading
a value from a BE table, AFAICT...

> +    if (result != MEMTX_OK) {
> +        xive_error(xive, "VC: %s read failed at @0x%"HWADDR_PRIx " -> @0x%"
> +                   HWADDR_PRIx, edt_type == CQ_TDR_EDT_IPI ? "IPI" : "END",
> +                   offset, edt_offset);
> +        return -1;
> +    }

... but these helpers are expected to return host-native values.

> +    return ret;
> +}
> +
> +static void pnv_xive_vc_write(void *opaque, hwaddr offset,
> +                              uint64_t val, unsigned size)
> +{
> +    PnvXive *xive = PNV_XIVE(opaque);
> +    uint64_t edt_index = offset >> xive->edt_shift;
> +    uint64_t edt_type = 0;
> +    uint64_t edt_offset;
> +    MemTxResult result;
> +    AddressSpace *edt_as = NULL;
> +
> +    if (edt_index < XIVE_XLATE_EDT_MAX) {
> +        edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[edt_index]);
> +    }
> +
> +    switch (edt_type) {
> +    case CQ_TDR_EDT_IPI:
> +        edt_as = &xive->ipi_as;
> +        break;
> +    case CQ_TDR_EDT_EQ:
> +        edt_as = &xive->end_as;
> +        break;
> +    default:
> +        xive_error(xive, "VC: invalid read @%"HWADDR_PRIx, offset);
> +        return;
> +    }
> +
> +    /* remap the offset for the targeted address space */
> +    edt_offset = pnv_xive_set_xlate_edt_offset(xive, offset, edt_type);
> +
> +    address_space_stq(edt_as, edt_offset, val, MEMTXATTRS_UNSPECIFIED, &result);
> +    if (result != MEMTX_OK) {
> +        xive_error(xive, "VC: write failed at @0x%"HWADDR_PRIx, edt_offset);
> +    }
> +}
> +
> +static const MemoryRegionOps pnv_xive_vc_ops = {
> +    .read = pnv_xive_vc_read,
> +    .write = pnv_xive_vc_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 8,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +/*
> + * Presenter Controller MMIO region. This is used by the Virtualization
> + * Controller to update the IPB in the NVT table when required. Not
> + * implemented.
> + */
> +static uint64_t pnv_xive_pc_read(void *opaque, hwaddr addr,
> +                                 unsigned size)
> +{
> +    PnvXive *xive = PNV_XIVE(opaque);
> +
> +    xive_error(xive, "PC: invalid read @%"HWADDR_PRIx, addr);
> +    return -1;
> +}
> +
> +static void pnv_xive_pc_write(void *opaque, hwaddr addr,
> +                              uint64_t value, unsigned size)
> +{
> +    PnvXive *xive = PNV_XIVE(opaque);
> +
> +    xive_error(xive, "PC: invalid write to VC @%"HWADDR_PRIx, addr);
> +}
> +
> +static const MemoryRegionOps pnv_xive_pc_ops = {
> +    .read = pnv_xive_pc_read,
> +    .write = pnv_xive_pc_write,
> +    .endianness = DEVICE_BIG_ENDIAN,
> +    .valid = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +    },
> +    .impl = {
> +        .min_access_size = 1,
> +        .max_access_size = 8,
> +    },
> +};
> +
> +void pnv_xive_pic_print_info(PnvXive *xive, Monitor *mon)
> +{
> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
> +    XiveEAS eas;
> +    XiveEND end;
> +    uint32_t endno = 0;
> +    uint32_t srcno0 = XIVE_SRCNO(xive->chip_id, 0);
> +    uint32_t srcno = srcno0;
> +
> +    monitor_printf(mon, "XIVE[%x] Source %08x .. %08x\n", xive->chip_id,
> +                  srcno0, srcno0 + xive->source.nr_irqs - 1);
> +    xive_source_pic_print_info(&xive->source, srcno0, mon);
> +
> +    monitor_printf(mon, "XIVE[%x] EAT %08x .. %08x\n", xive->chip_id,
> +                   srcno0, srcno0 + xive->nr_irqs - 1);
> +    while (!xive_router_get_eas(xrtr, srcno, &eas)) {
> +        if (!(eas.w & EAS_MASKED)) {
> +            xive_eas_pic_print_info(&eas, srcno, mon);
> +        }
> +        srcno++;
> +    }
> +
> +    monitor_printf(mon, "XIVE[%x] ENDT %08x .. %08x\n", xive->chip_id,
> +                   0, xive->nr_ends - 1);
> +    while (!xive_router_get_end(xrtr, xrtr->chip_id, endno, &end)) {
> +        xive_end_pic_print_info(&end, endno++, mon);
> +    }
> +}
> +
> +static void pnv_xive_reset(DeviceState *dev)
> +{
> +    PnvXive *xive = PNV_XIVE(dev);
> +    PnvChip *chip = PNV_CHIP(object_property_get_link(OBJECT(dev), "chip",
> +                                                      &error_fatal));
> +
> +    /*
> +     * Use the chip id to identify the XIVE interrupt controller. It
> +     * can be overriden by configuration at runtime.
> +     */
> +    xive->chip_id = xive->thread_chip_id = chip->chip_id;

You shouldn't need to touch this at reset, only at init/realize.

> +    /* Default page size. Should be changed at runtime to 64k */
> +    xive->ic_shift = xive->vc_shift = xive->pc_shift = 12;
> +
> +    /*
> +     * PowerNV XIVE sources are realized at runtime when the set
> +     * translation tables are configured.

Yeah.. that seems unlikely to be a good idea.

> +     */
> +    if (DEVICE(&xive->source)->realized) {
> +        object_property_set_bool(OBJECT(&xive->source), false, "realized",
> +                                 &error_fatal);
> +    }
> +
> +    if (DEVICE(&xive->end_source)->realized) {
> +        object_property_set_bool(OBJECT(&xive->end_source), false, "realized",
> +                                 &error_fatal);
> +    }
> +}
> +
> +/*
> + * The VC sub-engine incorporates a source controller for the IPIs.
> + * When triggered, we need to construct a source number with the
> + * chip/block identifier
> + */
> +static void pnv_xive_notify(XiveFabric *xf, uint32_t srcno)
> +{
> +    PnvXive *xive = PNV_XIVE(xf);
> +
> +    xive_router_notify(xf, XIVE_SRCNO(xive->chip_id, srcno));
> +}
> +
> +static void pnv_xive_init(Object *obj)
> +{
> +    PnvXive *xive = PNV_XIVE(obj);
> +
> +    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
> +    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
> +
> +    object_initialize(&xive->end_source, sizeof(xive->end_source),
> +                      TYPE_XIVE_END_SOURCE);
> +    object_property_add_child(obj, "end_source", OBJECT(&xive->end_source),
> +                              NULL);
> +}
> +
> +static void pnv_xive_realize(DeviceState *dev, Error **errp)
> +{
> +    PnvXive *xive = PNV_XIVE(dev);
> +
> +    /* Default page size. Generally changed at runtime to 64k */
> +    xive->ic_shift = xive->vc_shift = xive->pc_shift = 12;
> +
> +    /* XSCOM region, used for initial configuration of the BARs */
> +    memory_region_init_io(&xive->xscom_regs, OBJECT(dev), &pnv_xive_xscom_ops,
> +                          xive, "xscom-xive", PNV9_XSCOM_XIVE_SIZE << 3);
> +
> +    /* Interrupt controller MMIO region */
> +    memory_region_init_io(&xive->ic_mmio, OBJECT(dev), &pnv_xive_ic_ops, xive,
> +                          "xive.ic", PNV9_XIVE_IC_SIZE);
> +    memory_region_init_io(&xive->ic_reg_mmio, OBJECT(dev), &pnv_xive_ic_reg_ops,
> +                          xive, "xive.ic.reg", 1 << xive->ic_shift);
> +    memory_region_init_io(&xive->ic_notify_mmio, OBJECT(dev),
> +                          &pnv_xive_ic_notify_ops,
> +                          xive, "xive.ic.notify", 1 << xive->ic_shift);
> +
> +    /* The Pervasive LSI trigger and EOI pages are not modeled */
> +
> +    /*
> +     * Overall Virtualization Controller MMIO region containing the
> +     * IPI ESB pages and END ESB pages. The layout is defined by the
> +     * EDT set translation table and the accesses are dispatched using
> +     * address spaces for each.
> +     */
> +    memory_region_init_io(&xive->vc_mmio, OBJECT(xive), &pnv_xive_vc_ops, xive,
> +                          "xive.vc", PNV9_XIVE_VC_SIZE);
> +
> +    memory_region_init(&xive->ipi_mmio, OBJECT(xive), "xive.vc.ipi",
> +                       PNV9_XIVE_VC_SIZE);
> +    address_space_init(&xive->ipi_as, &xive->ipi_mmio, "xive.vc.ipi");
> +    memory_region_init(&xive->end_mmio, OBJECT(xive), "xive.vc.end",
> +                       PNV9_XIVE_VC_SIZE);
> +    address_space_init(&xive->end_as, &xive->end_mmio, "xive.vc.end");
> +
> +
> +    /* Presenter Controller MMIO region (not implemented) */
> +    memory_region_init_io(&xive->pc_mmio, OBJECT(xive), &pnv_xive_pc_ops, xive,
> +                          "xive.pc", PNV9_XIVE_PC_SIZE);
> +
> +    /* Thread Interrupt Management Area, direct an indirect */
> +    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops,
> +                          &xive->cpu_ind, "xive.tima", PNV9_XIVE_TM_SIZE);
> +    memory_region_init_alias(&xive->tm_mmio_indirect, OBJECT(xive),
> +                             "xive.tima.indirect",
> +                             &xive->tm_mmio, 0, PNV9_XIVE_TM_SIZE);

I'm not quite sure how aliasing to the TIMA can work.  AIUI the TIMA
via it's normal access magically tests the requesting CPU to work out
which TCTX it should manipulate.  Isn't the idea of of the indirect
access to access some other thread's TIMA for debugging, in which case
you need to override that thread id somehow.

> +}
> +
> +static int pnv_xive_dt_xscom(PnvXScomInterface *dev, void *fdt,
> +                             int xscom_offset)
> +{
> +    const char compat[] = "ibm,power9-xive-x";
> +    char *name;
> +    int offset;
> +    uint32_t lpc_pcba = PNV9_XSCOM_XIVE_BASE;
> +    uint32_t reg[] = {
> +        cpu_to_be32(lpc_pcba),
> +        cpu_to_be32(PNV9_XSCOM_XIVE_SIZE)
> +    };
> +
> +    name = g_strdup_printf("xive@%x", lpc_pcba);
> +    offset = fdt_add_subnode(fdt, xscom_offset, name);
> +    _FDT(offset);
> +    g_free(name);
> +
> +    _FDT((fdt_setprop(fdt, offset, "reg", reg, sizeof(reg))));
> +    _FDT((fdt_setprop(fdt, offset, "compatible", compat,
> +                      sizeof(compat))));
> +    return 0;
> +}
> +
> +static Property pnv_xive_properties[] = {
> +    DEFINE_PROP_UINT64("ic-bar", PnvXive, ic_base, 0),
> +    DEFINE_PROP_UINT64("vc-bar", PnvXive, vc_base, 0),
> +    DEFINE_PROP_UINT64("pc-bar", PnvXive, pc_base, 0),
> +    DEFINE_PROP_UINT64("tm-bar", PnvXive, tm_base, 0),
> +    DEFINE_PROP_END_OF_LIST(),
> +};
> +
> +static void pnv_xive_class_init(ObjectClass *klass, void *data)
> +{
> +    DeviceClass *dc = DEVICE_CLASS(klass);
> +    PnvXScomInterfaceClass *xdc = PNV_XSCOM_INTERFACE_CLASS(klass);
> +    XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
> +
> +    xdc->dt_xscom = pnv_xive_dt_xscom;
> +
> +    dc->desc = "PowerNV XIVE Interrupt Controller";
> +    dc->realize = pnv_xive_realize;
> +    dc->props = pnv_xive_properties;
> +    dc->reset = pnv_xive_reset;
> +
> +    xrc->get_eas = pnv_xive_get_eas;
> +    xrc->set_eas = pnv_xive_set_eas;
> +    xrc->get_end = pnv_xive_get_end;
> +    xrc->set_end = pnv_xive_set_end;
> +    xrc->get_nvt  = pnv_xive_get_nvt;
> +    xrc->set_nvt  = pnv_xive_set_nvt;
> +
> +    xfc->notify  = pnv_xive_notify;
> +};
> +
> +static const TypeInfo pnv_xive_info = {
> +    .name          = TYPE_PNV_XIVE,
> +    .parent        = TYPE_XIVE_ROUTER,
> +    .instance_init = pnv_xive_init,
> +    .instance_size = sizeof(PnvXive),
> +    .class_init    = pnv_xive_class_init,
> +    .interfaces    = (InterfaceInfo[]) {
> +        { TYPE_PNV_XSCOM_INTERFACE },
> +        { }
> +    }
> +};
> +
> +static void pnv_xive_register_types(void)
> +{
> +    type_register_static(&pnv_xive_info);
> +}
> +
> +type_init(pnv_xive_register_types)
> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
> index c9aedecc8216..9925c90481ae 100644
> --- a/hw/intc/xive.c
> +++ b/hw/intc/xive.c
> @@ -51,6 +51,8 @@ static uint8_t exception_mask(uint8_t ring)
>      switch (ring) {
>      case TM_QW1_OS:
>          return TM_QW1_NSR_EO;
> +    case TM_QW3_HV_PHYS:
> +        return TM_QW3_NSR_HE;
>      default:
>          g_assert_not_reached();
>      }
> @@ -85,7 +87,17 @@ static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring)
>      uint8_t *regs = &tctx->regs[ring];
>  
>      if (regs[TM_PIPR] < regs[TM_CPPR]) {
> -        regs[TM_NSR] |= exception_mask(ring);
> +        switch (ring) {
> +        case TM_QW1_OS:
> +            regs[TM_NSR] |= TM_QW1_NSR_EO;
> +            break;
> +        case TM_QW3_HV_PHYS:
> +            regs[TM_NSR] |= SETFIELD(TM_QW3_NSR_HE, regs[TM_NSR],
> +                                     TM_QW3_NSR_HE_PHYS);
> +            break;
> +        default:
> +            g_assert_not_reached();
> +        }
>          qemu_irq_raise(tctx->output);
>      }
>  }
> @@ -116,6 +128,38 @@ static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
>  #define XIVE_TM_OS_PAGE   0x2
>  #define XIVE_TM_USER_PAGE 0x3
>  
> +static void xive_tm_set_hv_cppr(XiveTCTX *tctx, hwaddr offset,
> +                                uint64_t value, unsigned size)
> +{
> +    xive_tctx_set_cppr(tctx, TM_QW3_HV_PHYS, value & 0xff);
> +}
> +
> +static uint64_t xive_tm_ack_hv_reg(XiveTCTX *tctx, hwaddr offset, unsigned size)
> +{
> +    return xive_tctx_accept(tctx, TM_QW3_HV_PHYS);
> +}
> +
> +static uint64_t xive_tm_pull_pool_ctx(XiveTCTX *tctx, hwaddr offset,
> +                                      unsigned size)
> +{
> +    uint64_t ret;
> +
> +    ret = tctx->regs[TM_QW2_HV_POOL + TM_WORD2] & TM_QW2W2_POOL_CAM;
> +    tctx->regs[TM_QW2_HV_POOL + TM_WORD2] &= ~TM_QW2W2_POOL_CAM;
> +    return ret;
> +}
> +
> +static void xive_tm_vt_push(XiveTCTX *tctx, hwaddr offset,
> +                            uint64_t value, unsigned size)
> +{
> +    tctx->regs[TM_QW3_HV_PHYS + TM_WORD2] = value & 0xff;
> +}
> +
> +static uint64_t xive_tm_vt_poll(XiveTCTX *tctx, hwaddr offset, unsigned size)
> +{
> +    return tctx->regs[TM_QW3_HV_PHYS + TM_WORD2] & 0xff;
> +}
> +
>  /*
>   * Define an access map for each page of the TIMA that we will use in
>   * the memory region ops to filter values when doing loads and stores
> @@ -295,10 +339,16 @@ static const XiveTmOp xive_tm_operations[] = {
>       * effects
>       */
>      { XIVE_TM_OS_PAGE, TM_QW1_OS + TM_CPPR,   1, xive_tm_set_os_cppr, NULL },
> +    { XIVE_TM_HV_PAGE, TM_QW3_HV_PHYS + TM_CPPR, 1, xive_tm_set_hv_cppr, NULL },
> +    { XIVE_TM_HV_PAGE, TM_QW3_HV_PHYS + TM_WORD2, 1, xive_tm_vt_push, NULL },
> +    { XIVE_TM_HV_PAGE, TM_QW3_HV_PHYS + TM_WORD2, 1, NULL, xive_tm_vt_poll },
>  
>      /* MMIOs above 2K : special operations with side effects */
>      { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG,     2, NULL, xive_tm_ack_os_reg },
>      { XIVE_TM_OS_PAGE, TM_SPC_SET_OS_PENDING, 1, xive_tm_set_os_pending, NULL },
> +    { XIVE_TM_HV_PAGE, TM_SPC_ACK_HV_REG,     2, NULL, xive_tm_ack_hv_reg },
> +    { XIVE_TM_HV_PAGE, TM_SPC_PULL_POOL_CTX,  4, NULL, xive_tm_pull_pool_ctx },
> +    { XIVE_TM_HV_PAGE, TM_SPC_PULL_POOL_CTX,  8, NULL, xive_tm_pull_pool_ctx },
>  };
>  
>  static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
> @@ -327,7 +377,8 @@ static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
>  static void xive_tm_write(void *opaque, hwaddr offset,
>                            uint64_t value, unsigned size)
>  {
> -    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> +    PowerPCCPU **cpuptr = opaque;
> +    PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
>      XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
>      const XiveTmOp *xto;
>  
> @@ -366,7 +417,8 @@ static void xive_tm_write(void *opaque, hwaddr offset,
>  
>  static uint64_t xive_tm_read(void *opaque, hwaddr offset, unsigned size)
>  {
> -    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
> +    PowerPCCPU **cpuptr = opaque;
> +    PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
>      XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
>      const XiveTmOp *xto;
>  
> @@ -501,6 +553,9 @@ static void xive_tctx_base_reset(void *dev)
>       */
>      tctx->regs[TM_QW1_OS + TM_PIPR] =
>          ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
> +    tctx->regs[TM_QW3_HV_PHYS + TM_PIPR] =
> +        ipb_to_pipr(tctx->regs[TM_QW3_HV_PHYS + TM_IPB]);
> +
>  
>      /*
>       * QEMU sPAPR XIVE only. To let the controller model reset the OS
> @@ -1513,7 +1568,7 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>      /* TODO: Auto EOI. */
>  }
>  
> -static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
> +void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>  {
>      XiveRouter *xrtr = XIVE_ROUTER(xf);
>      XiveEAS eas;
> diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
> index 66f2301b4ece..7b0bda652338 100644
> --- a/hw/ppc/pnv.c
> +++ b/hw/ppc/pnv.c
> @@ -279,7 +279,10 @@ static void pnv_dt_chip(PnvChip *chip, void *fdt)
>          pnv_dt_core(chip, pnv_core, fdt);
>  
>          /* Interrupt Control Presenters (ICP). One per core. */
> -        pnv_dt_icp(chip, fdt, pnv_core->pir, CPU_CORE(pnv_core)->nr_threads);
> +        if (!pnv_chip_is_power9(chip)) {
> +            pnv_dt_icp(chip, fdt, pnv_core->pir,
> +                       CPU_CORE(pnv_core)->nr_threads);
> +        }
>      }
>  
>      if (chip->ram_size) {
> @@ -693,7 +696,15 @@ static uint32_t pnv_chip_core_pir_p9(PnvChip *chip, uint32_t core_id)
>  static Object *pnv_chip_power9_intc_create(PnvChip *chip, Object *child,
>                                             Error **errp)
>  {
> -    return NULL;
> +    Pnv9Chip *chip9 = PNV9_CHIP(chip);
> +
> +    /*
> +     * The core creates its interrupt presenter but the XIVE interrupt
> +     * controller object is initialized afterwards. Hopefully, it's
> +     * only used at runtime.
> +     */
> +    return xive_tctx_create(child, TYPE_XIVE_TCTX,
> +                            XIVE_ROUTER(&chip9->xive), errp);
>  }
>  
>  /* Allowed core identifiers on a POWER8 Processor Chip :
> @@ -875,11 +886,19 @@ static void pnv_chip_power8nvl_class_init(ObjectClass *klass, void *data)
>  
>  static void pnv_chip_power9_instance_init(Object *obj)
>  {
> +    Pnv9Chip *chip9 = PNV9_CHIP(obj);
> +
> +    object_initialize(&chip9->xive, sizeof(chip9->xive), TYPE_PNV_XIVE);
> +    object_property_add_child(obj, "xive", OBJECT(&chip9->xive), NULL);
> +    object_property_add_const_link(OBJECT(&chip9->xive), "chip", obj,
> +                                   &error_abort);
>  }
>  
>  static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
>  {
>      PnvChipClass *pcc = PNV_CHIP_GET_CLASS(dev);
> +    Pnv9Chip *chip9 = PNV9_CHIP(dev);
> +    PnvChip *chip = PNV_CHIP(dev);
>      Error *local_err = NULL;
>  
>      pcc->parent_realize(dev, &local_err);
> @@ -887,6 +906,24 @@ static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
>          error_propagate(errp, local_err);
>          return;
>      }
> +
> +    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_IC_BASE(chip),
> +                            "ic-bar", &error_fatal);
> +    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_VC_BASE(chip),
> +                            "vc-bar", &error_fatal);
> +    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_PC_BASE(chip),
> +                            "pc-bar", &error_fatal);
> +    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_TM_BASE(chip),
> +                            "tm-bar", &error_fatal);
> +    object_property_set_bool(OBJECT(&chip9->xive), true, "realized",
> +                             &local_err);
> +    if (local_err) {
> +        error_propagate(errp, local_err);
> +        return;
> +    }
> +    qdev_set_parent_bus(DEVICE(&chip9->xive), sysbus_get_default());
> +    pnv_xscom_add_subregion(chip, PNV9_XSCOM_XIVE_BASE,
> +                            &chip9->xive.xscom_regs);
>  }
>  
>  static void pnv_chip_power9_class_init(ObjectClass *klass, void *data)
> @@ -1087,12 +1124,23 @@ static void pnv_pic_print_info(InterruptStatsProvider *obj,
>      CPU_FOREACH(cs) {
>          PowerPCCPU *cpu = POWERPC_CPU(cs);
>  
> -        icp_pic_print_info(ICP(cpu->intc), mon);
> +        if (pnv_chip_is_power9(pnv->chips[0])) {
> +            xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
> +        } else {
> +            icp_pic_print_info(ICP(cpu->intc), mon);
> +        }
>      }
>  
>      for (i = 0; i < pnv->num_chips; i++) {
> -        Pnv8Chip *chip8 = PNV8_CHIP(pnv->chips[i]);
> -        ics_pic_print_info(&chip8->psi.ics, mon);
> +        PnvChip *chip = pnv->chips[i];
> +
> +        if (pnv_chip_is_power9(pnv->chips[i])) {
> +            Pnv9Chip *chip9 = PNV9_CHIP(chip);
> +            pnv_xive_pic_print_info(&chip9->xive, mon);
> +        } else {
> +            Pnv8Chip *chip8 = PNV8_CHIP(chip);
> +            ics_pic_print_info(&chip8->psi.ics, mon);
> +        }
>      }
>  }
>  
> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
> index dd4d69db2bdd..145bfaf44014 100644
> --- a/hw/intc/Makefile.objs
> +++ b/hw/intc/Makefile.objs
> @@ -40,7 +40,7 @@ obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>  obj-$(CONFIG_XIVE) += xive.o
>  obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
>  obj-$(CONFIG_XIVE_KVM) += spapr_xive_kvm.o
> -obj-$(CONFIG_POWERNV) += xics_pnv.o
> +obj-$(CONFIG_POWERNV) += xics_pnv.o pnv_xive.o
>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
>  obj-$(CONFIG_S390_FLIC_KVM) += s390_flic_kvm.o

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper
  2018-11-29 16:36     ` Cédric Le Goater
@ 2018-12-03 15:52       ` Cédric Le Goater
  2018-12-04  1:59         ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-12-03 15:52 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/29/18 5:36 PM, Cédric Le Goater wrote:
> On 11/29/18 5:09 AM, David Gibson wrote:
>> On Fri, Nov 16, 2018 at 11:57:20AM +0100, Cédric Le Goater wrote:
>>> This will be used to remove the MMIO regions of the POWER9 XIVE
>>> interrupt controller when the sPAPR machine is reseted.
>>>
>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>
>> Since the code looks sane.
>>
>> Hoever, I think using memory_region_set_enabled() would be a better
>> idea for our purposes than actually adding/deleting the subregion.
> 
> Yes and we might not need this one anymore. 

As we are destroying the KVM device, we also need to remove the mmap 
in QEMU, else we will have a VMA with a page fault handler pointing
on a bogus KVM device.  which means destroying the memory region, so
we can not use  memory_region_set_enabled(). 

Anyhow mapping/unmapping works well.

Thanks,

C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers
  2018-12-03  1:14           ` David Gibson
@ 2018-12-03 16:19             ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-12-03 16:19 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 12/3/18 2:14 AM, David Gibson wrote:
> On Fri, Nov 30, 2018 at 07:41:33AM +0100, Cédric Le Goater wrote:
>> On 11/30/18 2:04 AM, David Gibson wrote:
>>> On Thu, Nov 29, 2018 at 11:06:13PM +0100, Cédric Le Goater wrote:
>>>> On 11/22/18 6:13 AM, David Gibson wrote:
>>>>> On Fri, Nov 16, 2018 at 11:56:59AM +0100, Cédric Le Goater wrote:
>>>>>> The Event Notification Descriptor also contains two Event State
>>>>>> Buffers providing further coalescing of interrupts, one for the
>>>>>> notification event (ESn) and one for the escalation events (ESe). A
>>>>>> MMIO page is assigned for each to control the EOI through loads
>>>>>> only. Stores are not allowed.
>>>>>>
>>>>>> The END ESBs are modeled through an object resembling the 'XiveSource'
>>>>>> It is stateless as the END state bits are backed into the XiveEND
>>>>>> structure under the XiveRouter and the MMIO accesses follow the same
>>>>>> rules as for the standard source ESBs.
>>>>>>
>>>>>> END ESBs are not supported by the Linux drivers neither on OPAL nor on
>>>>>> sPAPR. Nevetherless, it provides a mean to study the question in the
>>>>>> future and validates a bit more the XIVE model.
>>>>>>
>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>> ---
>>>>>>  include/hw/ppc/xive.h |  20 ++++++
>>>>>>  hw/intc/xive.c        | 160 +++++++++++++++++++++++++++++++++++++++++-
>>>>>>  2 files changed, 178 insertions(+), 2 deletions(-)
>>>>>>
>>>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>>>> index ce62aaf28343..24301bf2076d 100644
>>>>>> --- a/include/hw/ppc/xive.h
>>>>>> +++ b/include/hw/ppc/xive.h
>>>>>> @@ -208,6 +208,26 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>>>>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>>>>                          XiveEND *end);
>>>>>>  
>>>>>> +/*
>>>>>> + * XIVE END ESBs
>>>>>> + */
>>>>>> +
>>>>>> +#define TYPE_XIVE_END_SOURCE "xive-end-source"
>>>>>> +#define XIVE_END_SOURCE(obj) \
>>>>>> +    OBJECT_CHECK(XiveENDSource, (obj), TYPE_XIVE_END_SOURCE)
>>>>>
>>>>> Is there a particular reason to make this a full QOM object, rather
>>>>> than just embedding it in the XiveRouter?
>>>>
>>>> Coming back on this question because removing the chip_id from the
>>>> router is a problem for the END triggering. At least with the current
>>>> design. See below for the comment.
>>>>
>>>>>> +typedef struct XiveENDSource {
>>>>>> +    SysBusDevice parent;
>>>>>> +
>>>>>> +    uint32_t        nr_ends;
>>>>>> +
>>>>>> +    /* ESB memory region */
>>>>>> +    uint32_t        esb_shift;
>>>>>> +    MemoryRegion    esb_mmio;
>>>>>> +
>>>>>> +    XiveRouter      *xrtr;
>>>>>> +} XiveENDSource;
>>>>>> +
>>>>>>  /*
>>>>>>   * For legacy compatibility, the exceptions define up to 256 different
>>>>>>   * priorities. P9 implements only 9 levels : 8 active levels [0 - 7]
>>>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>>>> index 9cb001e7b540..5a8882d47a98 100644
>>>>>> --- a/hw/intc/xive.c
>>>>>> +++ b/hw/intc/xive.c
>>>>>> @@ -622,8 +622,18 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>>>>>>       * even futher coalescing in the Router
>>>>>>       */
>>>>>>      if (!(end.w0 & END_W0_UCOND_NOTIFY)) {
>>>>>> -        qemu_log_mask(LOG_UNIMP, "XIVE: !UCOND_NOTIFY not implemented\n");
>>>>>> -        return;
>>>>>> +        uint8_t pq = GETFIELD(END_W1_ESn, end.w1);
>>>>>> +        bool notify = xive_esb_trigger(&pq);
>>>>>> +
>>>>>> +        if (pq != GETFIELD(END_W1_ESn, end.w1)) {
>>>>>> +            end.w1 = SETFIELD(END_W1_ESn, end.w1, pq);
>>>>>> +            xive_router_set_end(xrtr, end_blk, end_idx, &end);
>>>>>> +        }
>>>>>> +
>>>>>> +        /* ESn[Q]=1 : end of notification */
>>>>>> +        if (!notify) {
>>>>>> +            return;
>>>>>> +        }
>>>>>>      }
>>>>>>  
>>>>>>      /*
>>>>>> @@ -706,6 +716,151 @@ void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon)
>>>>>>                     (uint32_t) GETFIELD(EAS_END_DATA, eas->w));
>>>>>>  }
>>>>>>  
>>>>>> +/*
>>>>>> + * END ESB MMIO loads
>>>>>> + */
>>>>>> +static uint64_t xive_end_source_read(void *opaque, hwaddr addr, unsigned size)
>>>>>> +{
>>>>>> +    XiveENDSource *xsrc = XIVE_END_SOURCE(opaque);
>>>>>> +    XiveRouter *xrtr = xsrc->xrtr;
>>>>>> +    uint32_t offset = addr & 0xFFF;
>>>>>> +    uint8_t end_blk;
>>>>>> +    uint32_t end_idx;
>>>>>> +    XiveEND end;
>>>>>> +    uint32_t end_esmask;
>>>>>> +    uint8_t pq;
>>>>>> +    uint64_t ret = -1;
>>>>>> +
>>>>>> +    end_blk = xrtr->chip_id;
>>>>>> +    end_idx = addr >> (xsrc->esb_shift + 1);
>>>>>> +    if (xive_router_get_end(xrtr, end_blk, end_idx, &end)) {
>>>>
>>>> The current END accessors require a block identifier, hence xrtr->chip_id, 
>>>> but in this case, we don't really need it because we are using the ENDT 
>>>> local to the router/chip. 
>>>
>>>> I don't know how to handle simply this case without keeping chip_id :/
>>>
>>> I don't really follow how chip_id is relevant here.  AFAICT the END
>>> accessors take a block id and the back end is responsible for
>>> interpreting them.  The ponwernv one will map it to chip id, but the
>>> PAPR one can just ignore it or only use block 0.
>>
>> Yes. But the block value comes from the xrtr->chip_id today, on PAPR and
>> PowerNV, even if it's block 0. 
>>
>> What I could do is add a "chip-id" property to XiveENDSource possibly.
> 
> This still seems wrong for the PAPR model. 

We don't really care for PAPR. We can use 0 but the model is common
with PowerNV.

> Why can't you configure the end_block value directly in the Xive 
> components, 

That is what I am proposing to add a "chip-id" property to XiveENDSource 

> then just set it equal to the chip_id when you build the powernv 
> machine?

yes. that's how it's more or less built. It's a little more complex
on PowerNV because there are low level settings on how the chip_id is
used in the PC and VC, but the modeling is minimal.

C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier
  2018-12-03  1:18               ` David Gibson
@ 2018-12-03 16:30                 ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-12-03 16:30 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

>>>>> I'm thinking about the END->NVT stage of the process here, rather than
>>>>> the NVT->TCTX stage.
>>>>>
>>>>> Oh, also, you're using "VP" here which IIUC == "NVT".  Can we
>>>>> standardize on one, please.
>>>>
>>>> VP is used in Linux/KVM Linux/Native and skiboot. Yes. it's a mess. 
>>>> Let's have consistent naming in QEMU and use NVT. 
>>>
>>> Right.  And to cover any inevitable missed ones is why I'd like to see
>>> a cheatsheet giving both terms in the header comments somewhere.
>>
>> yes. I have added a list of names in xive.h. 
> 
> Great.  Oh BTW - this is getting big enough, that I wonder if it makes
> sense to create a hw/intc/xive subdir to put things in, then splitting
> IVSE, IVRE, IVPE related code into separate .c files (I'd still expect
> a common .h though).

Here is what I have for now, for the new files only :

   190 include/hw/ppc/xive_regs.h
   343 include/hw/ppc/xive.h
    78 include/hw/ppc/spapr_xive.h
  1453 hw/intc/spapr_xive.c (sPAPRXive + hcalls)
  1678 hw/intc/xive.c       (XiveSource, XiveRouter, XiveENDSource, XiveTCTX, 
			     END helpers)
   864 hw/intc/spapr_xive_kvm.c
  4606 total

I am putting the KVM export definitions in spapr_xive.h and xive.h but all 
the code is in spapr_xive_kvm.c.

So you would rather have something like :
 
   include/hw/ppc/xive_regs.h
   include/hw/ppc/xive.h
   include/hw/ppc/spapr_xive.h

   hw/intc/xive/spapr.c 
   hw/intc/xive/spapr_kvm.c 
   hw/intc/xive/source.c 
   hw/intc/xive/router.c 
   hw/intc/xive/presenter.c 
   hw/intc/xive/tcontext.c 

>> I was wondering if I should put the diagram below somewhere in a .h file 
>> or under doc/specs/.
> 
> I'd prefer it in the .h file.

OK. will do in  xive.h
 
>>
>> Thanks,
>>
>> C.  
>>
>>
>> = XIVE =================================================================
>>
>> The POWER9 processor comes with a new interrupt controller, called
>> XIVE as "eXternal Interrupt Virtualization Engine".
>>
>>
>> * Overall architecture
>>
>>
>>              XIVE Interrupt Controller
>>              +------------------------------------+      IPIs
>>              | +---------+ +---------+ +--------+ |    +-------+
>>              | |VC       | |CQ       | |PC      |----> | CORES |
>>              | |     esb | |         | |        |----> |       |
>>              | |     eas | |  Bridge | |   tctx |----> |       |
>>              | |SC   end | |         | |    nvt | |    |       |
>>  +------+    | +---------+ +----+----+ +--------+ |    +-+-+-+-+
>>  | RAM  |    +------------------|-----------------+      | | |
>>  |      |                       |                        | | |
>>  |      |                       |                        | | |
>>  |      |  +--------------------v------------------------v-v-v--+    other
>>  |      <--+                     Power Bus                      +--> chips
>>  |  esb |  +---------+-----------------------+------------------+
>>  |  eas |            |                       |
>>  |  end |        +---|-----+                 |
>>  |  nvt |       +----+----+|            +----+----+
>>  +------+       |SC       ||            |SC       |
>>                 |         ||            |         |
>>                 | PQ-bits ||            | PQ-bits |
>>                 | local   |+            |  in VC  |
>>                 +---------+             +---------+
>>                    PCIe                 NX,NPU,CAPI
>>
>>                   SC: Source Controller (aka. IVSE)
>>                   VC: Virtualization Controller (aka. IVRE)
>>                   PC: Presentation Controller (aka. IVPE)
>>                   CQ: Common Queue (Bridge)
>>
>>              PQ-bits: 2 bits source state machine (P:pending Q:queued)
>>                  esb: Event State Buffer (Array of PQ bits in an IVSE)
>>                  eas: Event Assignment Structure
>>                  end: Event Notification Descriptor
>>                  nvt: Notification Virtual Target
>>                 tctx: Thread interrupt Context
>>
>>
>> The XIVE IC is composed of three sub-engines :
>>
>>   - Interrupt Virtualization Source Engine (IVSE), or Source
>>     Controller (SC). These are found in PCI PHBs, in the PSI host
>>     bridge controller, but also inside the main controller for the
>>     core IPIs and other sub-chips (NX, CAP, NPU) of the
>>     chip/processor. They are configured to feed the IVRE with events.
>>
>>   - Interrupt Virtualization Routing Engine (IVRE) or Virtualization
>>     Controller (VC). Its job is to match an event source with an Event
>>     Notification Descriptor (END).
>>
>>   - Interrupt Virtualization Presentation Engine (IVPE) or Presentation
>>     Controller (PC). It maintains the interrupt context state of each
>>     thread and handles the delivery of the external exception to the
>>     thread.
>>
>>
>> * XIVE internal tables
>>
>> Each of the sub-engines uses a set of tables to redirect exceptions
>> from event sources to CPU threads.
>>
>>                                           +-------+
>>   User or OS                              |  EQ   |
>>       or                          +------>|entries|
>>   Hypervisor                      |       |  ..   |
>>     Memory                        |       +-------+
>>                                   |           ^
>>                                   |           |
>>              +-------------------------------------------------+
>>                                   |           |
>>   Hypervisor      +------+    +---+--+    +---+--+   +------+
>>     Memory        | ESB  |    | EAT  |    | ENDT |   | NVTT |
>>    (skiboot)      +----+-+    +----+-+    +----+-+   +------+
>>                     ^  |        ^  |        ^  |       ^
>>                     |  |        |  |        |  |       |
>>              +-------------------------------------------------+
>>                     |  |        |  |        |  |       |
>>                     |  |        |  |        |  |       |
>>                +----|--|--------|--|--------|--|-+   +-|-----+    +------+
>>                |    |  |        |  |        |  | |   | | tctx|    |Thread|
>>    IPI or   ---+    +  v        +  v        +  v |---| +  .. |----->     |
>>   HW events    |                                 |   |       |    |      |
>>                |             IVRE                |   | IVPE  |    +------+
>>                +---------------------------------+   +-------+
>>             
>>
>>
>> The IVSE have a 2-bits, P for pending and Q for queued, state machine
>> for each source that allows events to be triggered. They are stored in
>> an array, the Event State Buffer (ESB) and controlled by MMIOs.
>>
>> If the event is let through, the IVRE looks up in the Event Assignment
>> Structure (EAS) table for an Event Notification Descriptor (END)
>> configured for the source. Each Event Notification Descriptor defines
>> a notification path to a CPU and an in-memory Event Queue, in which
>> will be pushed an EQ data for the OS to pull.
>>
>> The IVPE determines if a Notification Virtual Target (NVT) can handle
>> the event by scanning the thread contexts of the VPs dispatched on the
>> processor HW threads. It maintains the interrupt context state of each
>> thread in a NVT table.
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-12-03  1:36               ` David Gibson
@ 2018-12-03 16:49                 ` Cédric Le Goater
  2018-12-04  1:56                   ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-12-03 16:49 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

>>>>>>>> +    }
>>>>>>>> +
>>>>>>>> +    switch (qsize) {
>>>>>>>> +    case 12:
>>>>>>>> +    case 16:
>>>>>>>> +    case 21:
>>>>>>>> +    case 24:
>>>>>>>> +        end.w3 = ((uint64_t)qpage) & 0xffffffff;
>>>>>>>
>>>>>>> It just occurred to me that I haven't been looking for this across any
>>>>>>> of these reviews.  Don't you need byteswaps when accessing these
>>>>>>> in-memory structures?
>>>>>>
>>>>>> yes this is done when some event data is enqueued in the EQ.
>>>>>
>>>>> I'm not talking about the data in the EQ itself, but the fields in the
>>>>> END (and the NVT).
>>>>
>>>> XIVE is all BE.
>>>
>>> Yes... the qemu host might not be, which is why you need byteswaps.
>>
>> ok. I understand.
>>
>>> I realized eventually you have the swaps in your pnv get/set
>>> accessors.  
>>
>> Yes. because skiboot is BE, like the XIVE structures.
> 
> skiboot's endiannness isn't really relevant, because we're modelling
> below that level.
> 
>>> I don't like that at all for a couple of reasons:
>>>
>>> 1) Although the END structure is made up of word-sized fields because
>>> that's convenient, the END really is made of a bunch of subfields of
>>> different sizes.  Knowing that it wouldn't be unreasonable for people
>>> to expect they can look into the XIVE by byte offsets; 
>>
>> These structures should be accessed with GETFIELD and SETFIELD macros
>> using the XIVE definitions in the xive_regs.h header file. I would want 
>> to keep that common with skiboot  for sure.
> 
> Right.  It might make sense to make some helper macros or inlines that
> include both the GETFIELD/SETFIELD and the byteswap.

ah. I have to evaluate the added complexity, because we don't really
have a struct. it's just an array of BE words. 

So for each field or bit we are interested in, we would have an helper 
routine picking the correct word from the XIVE structure, doing the 
byteswap and extracting the value ?  

sigh. 

C.

>> Are you suggesting we should define each field of the XIVE structures 
>> with C attributes ? That would be very unfortunate.
> 
> Oh no, bitfields are a complete mess.
>
>>> that will break
>>> if you're working with a copy that has already been byte-swapped on
>>> word-sized units.
>>
>> I am not sure I understand the last sentence.
> 
> I mean that GETFIELD/SETFIELD only work on values that are already
> native endian, but using byte offsets would only work on values that
> are still in BE.
> 
>> the code working with a copy would necessarily know that the structure 
>> has been byteswapped and use correct offsets for the expected endianess. 
>> no ? why would it break ?  
>>
>>> 2) At different points in the code you're storing both BE and
>>> native-endian data in the same struct. 
>>
>> on sPAPR, it's all native (which is a violation I agree).
> 
> Don't do that.  Having the same structure be BE in some situations and
> native endian in other situations is a sure path to madness.
> 
>> TIMA is BE.
>>
>>> That's both confusing to
>>> someone reading the code (if they see that struct they don't know if
>>> it's byteswapped already) and also means you can't use sparse
>>> annotations to make sure you have it right.
>>
>> XIVE structures are architected to be BE. That's immutable.
> 
> Yes, absolutely.  So don't represent them in C structs that are in
> native endian.  Ever, even temporarily.
> 
>> It's a not problem for skiboot which is BE. The PnvXIVE model for the 
>> QEMU PowerNV machine reads these VSTs (Virtual Structure Tables) from 
>> the guest RAM and byteswaps the structure before using it. I think
>> that's fine. Isn't it ?
> 
> Byteswapping structures - rather than individual fields as you use
> them - is almost always a bad idea.  It's insanely easy to lose track
> of whether this particular instance of the structure is swapped yet or
> not, and you can't use sparse (or whatever) to check it for you.
> 
> Stick to one endianness for a struct, and do the byteswaps when you
> access the fields (using helpers if that's, well, helpful).
> 
>> It becomes a problem with the sPAPR model which is using the XIVE structures 
>> in native endianess and not BE anymore. But the guest OS never manipulates 
>> these structures, so under the hood, I think we are free to use them in 
>> native and keep the common definitions.
> 
> Free to in the sense that it can theoretically work, yes.  But there's
> no upside (byteswaps are essentially free on POWER, and of trivial
> cost compared to memory access basically everywhere).  The downside is
> that having the same variables / structures have data in different
> endianness in different situations makes it exceedingly easy to forget
> which one you're dealing with right now and therefore forget some
> swaps or put in extra ones.
> 
>> Except that the event data entries in the OS EQs are BE. So the only place 
>> where we convert is when an event data is enqueued. 
>>
>> What would you put in place if you think this is a too strong violation 
>> of the architecture ? I am afraid of something too complex to manipulate
>> to be honest. May be we can drop the map/unmap access methods only keep 
>> the very basic ones.
> 
> THe complexity of having extra swaps is almost always less than having
> the complexity of having those swaps not be in a consistent place.
> Especially if you use helpers (including the swaps) to access your
> structure.
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-11-29  0:47       ` David Gibson
  2018-11-29  3:39         ` Benjamin Herrenschmidt
@ 2018-12-03 17:05         ` Cédric Le Goater
  2018-12-04  1:54           ` David Gibson
  1 sibling, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-12-03 17:05 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

I forgot to reply to this one.

On 11/29/18 1:47 AM, David Gibson wrote:
> On Wed, Nov 28, 2018 at 11:59:58AM +0100, Cédric Le Goater wrote:
>> On 11/28/18 12:49 AM, David Gibson wrote:
>>> On Fri, Nov 16, 2018 at 11:57:01AM +0100, Cédric Le Goater wrote:
>>>> The last sub-engine of the XIVE architecture is the Interrupt
>>>> Virtualization Presentation Engine (IVPE). On HW, they share elements,
>>>> the Power Bus interface (CQ), the routing table descriptors, and they
>>>> can be combined in the same HW logic. We do the same in QEMU and
>>>> combine both engines in the XiveRouter for simplicity.
>>>
>>> Ok, I'm not entirely convinced combining the IVPE and IVRE into a
>>> single object is a good idea, but we can probably discuss that once
>>> I've read further.
>>
>> We could introduce a simplified presenter for sPAPR but I am not even
>> sure of that as it will get more complex if we support the EBB one day. 
> 
> I wasn't really thinking about PAPR for this comment.
> 
>>>> When the IVRE has completed its job of matching an event source with a
>>>> Notification Virtual Target (NVT) to notify, it forwards the event
>>>> notification to the IVPE sub-engine. The IVPE scans the thread
>>>> interrupt contexts of the Notification Virtual Targets (NVT)
>>>> dispatched on the HW processor threads and if a match is found, it
>>>> signals the thread. If not, the IVPE escalates the notification to
>>>> some other targets and records the notification in a backlog queue.
>>>>
>>>> The IVPE maintains the thread interrupt context state for each of its
>>>> NVTs not dispatched on HW processor threads in the Notification
>>>> Virtual Target table (NVTT).
>>>>
>>>> The model currently only supports single NVT notifications.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  include/hw/ppc/xive.h      |  13 +++
>>>>  include/hw/ppc/xive_regs.h |  22 ++++
>>>>  hw/intc/xive.c             | 223 +++++++++++++++++++++++++++++++++++++
>>>>  3 files changed, 258 insertions(+)
>>>>
>>>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>>>> index 5987f26ddb98..e715a6c6923d 100644
>>>> --- a/include/hw/ppc/xive.h
>>>> +++ b/include/hw/ppc/xive.h
>>>> @@ -197,6 +197,10 @@ typedef struct XiveRouterClass {
>>>>                     XiveEND *end);
>>>>      int (*set_end)(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>>                     XiveEND *end);
>>>> +    int (*get_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>>>> +                   XiveNVT *nvt);
>>>> +    int (*set_nvt)(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>>>> +                   XiveNVT *nvt);
>>>
>>> As with the ENDs, I don't think get/set is a good interface for a
>>> bigger-than-word-size object.
>>
>> We need to agree on this interface before I respin. So you would like 
>> to add a extra argument specifying the word being accessed ?
> 
> Yes.  Ok, 3 options I can see at this point:
> 
> 1) read/write accessors which take a word number
> 
> 2) A "get" accessor which copies the whole structure, but "write"
> accessor which takes a word number.  The asymmetry is a bit ugly, but
> it's the non-atomic writeback of the whole structure which I'm most
> uncomfortable with.
> 
> 3) A map/unmap interface which gives you / releases a pointer to the
> "live" structure.  For powernv that would become
> address_space_map()/unmap().  For PAPR it would just be reutn pointer
> / no-op.

This discussion is in progress in another subthread.

>>
>>>
>>>>  } XiveRouterClass;
>>>>  
>>>>  void xive_eas_pic_print_info(XiveEAS *eas, uint32_t lisn, Monitor *mon);
>>>> @@ -207,6 +211,10 @@ int xive_router_get_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>>                          XiveEND *end);
>>>>  int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>>                          XiveEND *end);
>>>> +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>>>> +                        XiveNVT *nvt);
>>>> +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>>>> +                        XiveNVT *nvt);
>>>>  
>>>>  /*
>>>>   * XIVE END ESBs
>>>> @@ -274,4 +282,9 @@ extern const MemoryRegionOps xive_tm_ops;
>>>>  
>>>>  void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon);
>>>>  
>>>> +static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
>>>> +{
>>>> +    return (nvt_blk << 19) | nvt_idx;
>>>
>>> I'm guessing this formula is the standard way of combining the NVT
>>> block and index into a single word?  
>>
>> That number is the VP/NVT identifier which is written in the CAM value. 
>> The index is on 19 bits because of the NVT  definition in the END 
>> structure. It is being increased to 24 bits on Power10 
>>
>>> If so, I think we should
>>> standardize on passing a single word "nvt_id" around and only
>>> splitting it when we need to use the block separately.  
>>
>> This is really the only place where we concatenate the two NVT values,
>> block and index. 
> 
> Hm, ok.  I know we don't model them (yet, maybe ever) but could
> combined values appear in the PowerBUS messages that handle remote
> notifications?

They do. 
 
>>> Same goes for
>>> the end_id, assuming there's a standard way of putting that into a
>>> single word.  That will address the point I raised earlier about lisn
>>> being passed around as a single word, but these later stage ids being
>>> split.
>>
>> Hmm, I am not sure this is a good option. It is not how the PowerNV 
>> model would use it, skiboot is very much aware of these blocks and 
>> indexes and for remote accesses chips are identified using the block. 
>> I will take a look at it but I am not found of it. I can add helpers 
>> in some places though.    
> 
> Hm, ok.  Do the block and index appear as an (effectively) single
> field in the EAS?

no. In all XIVE structures, block and index are always distinct.

>> I agree we have some kind of issue linking the HW model with the sPAPR 
>> machine. The guest interface is only  about IRQ numbers, priorities and
>> cpu numbers. We really don't care about XIVE blocks and indexes in that 
>> case. we can clarify the code by bypassing the XiveRouter interfaces
>> to the table and directly use the sPAPR interrupt controller. That 
>> should help a bit for the hcalls but we would still have to fill in 
>> the EAT and the END with some index values if we want to use the router
>> algorithm.
> 
> I don't think this is too much of a problem.  These are essentially
> machine internal details so we can choose an allocation to suit us.
> The obvious one is to put everything in a single block, at least as
> long as that won't limit our numbers too much.
> 
>>> We'll probably want some inlines or macros to build an
>>> nvt/end/lisn/whatever id from block and index as well.
>>>
>>>> +}
>>>> +
>>>>  #endif /* PPC_XIVE_H */
>>>> diff --git a/include/hw/ppc/xive_regs.h b/include/hw/ppc/xive_regs.h
>>>> index 2e3d6cb507da..05cb992d2815 100644
>>>> --- a/include/hw/ppc/xive_regs.h
>>>> +++ b/include/hw/ppc/xive_regs.h
>>>> @@ -158,4 +158,26 @@ typedef struct XiveEND {
>>>>  #define END_W7_F1_LOG_SERVER_ID  PPC_BITMASK32(1, 31)
>>>>  } XiveEND;
>>>>  
>>>> +/* Notification Virtual Target (NVT) */
>>>> +typedef struct XiveNVT {
>>>> +        uint32_t        w0;
>>>> +#define NVT_W0_VALID             PPC_BIT32(0)
>>>> +        uint32_t        w1;
>>>> +        uint32_t        w2;
>>>> +        uint32_t        w3;
>>>> +        uint32_t        w4;
>>>> +        uint32_t        w5;
>>>> +        uint32_t        w6;
>>>> +        uint32_t        w7;
>>>> +        uint32_t        w8;
>>>> +#define NVT_W8_GRP_VALID         PPC_BIT32(0)
>>>> +        uint32_t        w9;
>>>> +        uint32_t        wa;
>>>> +        uint32_t        wb;
>>>> +        uint32_t        wc;
>>>> +        uint32_t        wd;
>>>> +        uint32_t        we;
>>>> +        uint32_t        wf;
>>>> +} XiveNVT;
>>>> +
>>>>  #endif /* PPC_XIVE_REGS_H */
>>>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>>>> index 4c6cb5d52975..5ba3b06e6e25 100644
>>>> --- a/hw/intc/xive.c
>>>> +++ b/hw/intc/xive.c
>>>> @@ -373,6 +373,32 @@ void xive_tctx_pic_print_info(XiveTCTX *tctx, Monitor *mon)
>>>>      }
>>>>  }
>>>>  
>>>> +/* The HW CAM (23bits) is hardwired to :
>>>> + *
>>>> + *   0x000||0b1||4Bit chip number||7Bit Thread number.
>>>> + *
>>>> + * and when the block grouping extension is enabled :
>>>> + *
>>>> + *   4Bit chip number||0x001||7Bit Thread number.
>>>> + */
>>>> +static uint32_t tctx_hw_cam_line(bool block_group, uint8_t chip_id, uint8_t tid)
>>>> +{
>>>> +    if (block_group) {
>>>> +        return 1 << 11 | (chip_id & 0xf) << 7 | (tid & 0x7f);
>>>> +    } else {
>>>> +        return (chip_id & 0xf) << 11 | 1 << 7 | (tid & 0x7f);
>>>> +    }
>>>> +}
>>>> +
>>>> +static uint32_t xive_tctx_hw_cam_line(XiveTCTX *tctx, bool block_group)
>>>> +{
>>>> +    PowerPCCPU *cpu = POWERPC_CPU(tctx->cs);
>>>> +    CPUPPCState *env = &cpu->env;
>>>> +    uint32_t pir = env->spr_cb[SPR_PIR].default_value;
>>>
>>> I don't much like reaching into the cpu state itself.  I think a
>>> better idea would be to have the TCTX have its HW CAM id set during
>>> initialization (via a property) and then use that.  This will mean
>>> less mucking about if future cpu revisions don't split the PIR into
>>> chip and tid ids in the same way.
>>
>> yes good idea. I will see how to handle the block_group boolean. may be we
>> can leave it out of the model for now as it is not used.
> 
> Yes, it would be nice to leave the block_group stuff as a later
> extensions when/if we need it.  If we put it in as a stub and nothing
> is using/testing it, it's likely it will be broken if we ever do
> actually try to use it.

> 
>>
>>>
>>>> +    return tctx_hw_cam_line(block_group, (pir >> 8) & 0xf, pir & 0x7f);
>>>> +}
>>>> +
>>>>  static void xive_tctx_reset(void *dev)
>>>>  {
>>>>      XiveTCTX *tctx = XIVE_TCTX(dev);
>>>> @@ -1013,6 +1039,195 @@ int xive_router_set_end(XiveRouter *xrtr, uint8_t end_blk, uint32_t end_idx,
>>>>     return xrc->set_end(xrtr, end_blk, end_idx, end);
>>>>  }
>>>>  
>>>> +int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>>>> +                        XiveNVT *nvt)
>>>> +{
>>>> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>>>> +
>>>> +   return xrc->get_nvt(xrtr, nvt_blk, nvt_idx, nvt);
>>>> +}
>>>> +
>>>> +int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>>>> +                        XiveNVT *nvt)
>>>> +{
>>>> +   XiveRouterClass *xrc = XIVE_ROUTER_GET_CLASS(xrtr);
>>>> +
>>>> +   return xrc->set_nvt(xrtr, nvt_blk, nvt_idx, nvt);
>>>> +}
>>>> +
>>>> +static bool xive_tctx_ring_match(XiveTCTX *tctx, uint8_t ring,
>>>> +                                 uint8_t nvt_blk, uint32_t nvt_idx,
>>>> +                                 bool cam_ignore, uint32_t logic_serv)
>>>> +{
>>>> +    uint8_t *regs = &tctx->regs[ring];
>>>> +    uint32_t w2 = be32_to_cpu(*((uint32_t *) &regs[TM_WORD2]));
>>>> +    uint32_t cam = xive_tctx_cam_line(nvt_blk, nvt_idx);
>>>> +    bool block_group = false; /* TODO (PowerNV) */
>>>> +
>>>> +    /* TODO (PowerNV): ignore low order bits of nvt id */
>>>> +
>>>> +    switch (ring) {
>>>> +    case TM_QW3_HV_PHYS:
>>>> +        return (w2 & TM_QW3W2_VT) && xive_tctx_hw_cam_line(tctx, block_group) ==
>>>> +            tctx_hw_cam_line(block_group, nvt_blk, nvt_idx);
>>>
>>> The difference between "xive_tctx_hw_cam_line" and "tctx_hw_cam_line"
>>> here is far from obvious.  
>>
>> yes. I lacked inspiration ...
> 
> I'd suggest that the one which takes the tctx as a parameter be
> tctx_hw_cam_line() and the other be nvt_hw_cam_line() or similar.  The
> crucial difference here is that one is what the thread is looking for,
> the other is what the NVT is advertising.
> 
>>> Remember that namespacing prefixes aren't
>>> necessary for static functions, which can let you give more
>>> descriptive names without getting excessively long.
>>
>> OK.
>>  
>>>> +    case TM_QW2_HV_POOL:
>>>> +        return (w2 & TM_QW2W2_VP) && (cam == GETFIELD(TM_QW2W2_POOL_CAM, w2));
>>>> +
>>>> +    case TM_QW1_OS:
>>>> +        return (w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2));
>>>> +
>>>> +    case TM_QW0_USER:
>>>> +        return ((w2 & TM_QW1W2_VO) && (cam == GETFIELD(TM_QW1W2_OS_CAM, w2)) &&
>>>> +                (w2 & TM_QW0W2_VU) &&
>>>> +                (logic_serv == GETFIELD(TM_QW0W2_LOGIC_SERV, w2)));
>>>> +
>>>> +    default:
>>>> +        g_assert_not_reached();
>>>> +    }
>>>> +}
>>>> +
>>>> +static int xive_presenter_tctx_match(XiveTCTX *tctx, uint8_t format,
>>>> +                                     uint8_t nvt_blk, uint32_t nvt_idx,
>>>> +                                     bool cam_ignore, uint32_t logic_serv)
>>>> +{
>>>> +    if (format == 0) {
>>>> +        /* F=0 & i=1: Logical server notification */
>>>> +        if (cam_ignore == true) {
>>>> +            qemu_log_mask(LOG_GUEST_ERROR, "XIVE: no support for LS "
>>>> +                          "NVT %x/%x\n", nvt_blk, nvt_idx);
>>>> +             return -1;
>>>> +        }
>>>> +
>>>> +        /* F=0 & i=0: Specific NVT notification */
>>>> +        if (xive_tctx_ring_match(tctx, TM_QW3_HV_PHYS,
>>>> +                                nvt_blk, nvt_idx, false, 0)) {
>>>> +            return TM_QW3_HV_PHYS;
>>>> +        }
>>>> +        if (xive_tctx_ring_match(tctx, TM_QW2_HV_POOL,
>>>> +                                nvt_blk, nvt_idx, false, 0)) {
>>>> +            return TM_QW2_HV_POOL;
>>>> +        }
>>>> +        if (xive_tctx_ring_match(tctx, TM_QW1_OS,
>>>> +                                nvt_blk, nvt_idx, false, 0)) {
>>>> +            return TM_QW1_OS;
>>>> +        }
>>>
>>> Hm.  It's a bit pointless to iterate through each ring calling a
>>> common function, when that "common" function consists entirely of a
>>> switch which makes it not really common at all.
>>>
>>> So I think you want separate helper functions for each ring's match,
>>> or even just fold the previous function into this one.
>>
>> yes. It can be improved. I did try different layouts. I might just fold 
>> both routine in one as you propose.  
>>
>>>> +    } else {
>>>> +        /* F=1 : User level Event-Based Branch (EBB) notification */
>>>> +        if (xive_tctx_ring_match(tctx, TM_QW0_USER,
>>>> +                                nvt_blk, nvt_idx, false, logic_serv)) {
>>>> +            return TM_QW0_USER;
>>>> +        }
>>>> +    }
>>>> +    return -1;
>>>> +}
>>>> +
>>>> +typedef struct XiveTCTXMatch {
>>>> +    XiveTCTX *tctx;
>>>> +    uint8_t ring;
>>>> +} XiveTCTXMatch;
>>>> +
>>>> +static bool xive_presenter_match(XiveRouter *xrtr, uint8_t format,
>>>> +                                 uint8_t nvt_blk, uint32_t nvt_idx,
>>>> +                                 bool cam_ignore, uint8_t priority,
>>>> +                                 uint32_t logic_serv, XiveTCTXMatch *match)
>>>> +{
>>>> +    CPUState *cs;
>>>> +
>>>> +    /* TODO (PowerNV): handle chip_id overwrite of block field for
>>>> +     * hardwired CAM compares */
>>>> +
>>>> +    CPU_FOREACH(cs) {
>>>> +        PowerPCCPU *cpu = POWERPC_CPU(cs);
>>>> +        XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
>>>> +        int ring;
>>>> +
>>>> +        /*
>>>> +         * HW checks that the CPU is enabled in the Physical Thread
>>>> +         * Enable Register (PTER).
>>>> +         */
>>>> +
>>>> +        /*
>>>> +         * Check the thread context CAM lines and record matches. We
>>>> +         * will handle CPU exception delivery later
>>>> +         */
>>>> +        ring = xive_presenter_tctx_match(tctx, format, nvt_blk, nvt_idx,
>>>> +                                         cam_ignore, logic_serv);
>>>> +        /*
>>>> +         * Save the context and follow on to catch duplicates, that we
>>>> +         * don't support yet.
>>>> +         */
>>>> +        if (ring != -1) {
>>>> +            if (match->tctx) {
>>>> +                qemu_log_mask(LOG_GUEST_ERROR, "XIVE: already found a thread "
>>>> +                              "context NVT %x/%x\n", nvt_blk, nvt_idx);
>>>> +                return false;
>>>> +            }
>>>> +
>>>> +            match->ring = ring;
>>>> +            match->tctx = tctx;
>>>> +        }
>>>> +    }
>>>> +
>>>> +    if (!match->tctx) {
>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is not dispatched\n",
>>>> +                      nvt_blk, nvt_idx);
>>>> +        return false;
>>>
>>> Hmm.. this isn't actually an error isn't it? At least not for powernv
>>
>> It is on sPAPR, it would mean the END was configured with an unknow CPU. 
> 
> Right.
> 
>> It is not error on PowerNV, when we support escalations.
>>
>>> - that just means the NVT isn't currently dispatched, so we'll need to
>>> trigger the escalation interrupt.  
>>
>> Yes.
>>
>>> Does this get changed later in the series?
>>
>> No.
> 
> But this code is common to PAPR and powernv, yes, so it will need to?

When we add support for escalations, yes, it will change. Would you rather
use an error_report() until then ? 

C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper
  2018-11-29  4:09   ` David Gibson
  2018-11-29 16:36     ` Cédric Le Goater
@ 2018-12-03 17:48     ` Peter Maydell
  2018-12-04 12:33       ` Cédric Le Goater
  1 sibling, 1 reply; 184+ messages in thread
From: Peter Maydell @ 2018-12-03 17:48 UTC (permalink / raw)
  To: David Gibson; +Cc: Cédric Le Goater, qemu-ppc, QEMU Developers

On Thu, 29 Nov 2018 at 04:55, David Gibson <david@gibson.dropbear.id.au> wrote:
>
> On Fri, Nov 16, 2018 at 11:57:20AM +0100, Cédric Le Goater wrote:
> > This will be used to remove the MMIO regions of the POWER9 XIVE
> > interrupt controller when the sPAPR machine is reseted.
> >
> > Signed-off-by: Cédric Le Goater <clg@kaod.org>
>
> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>
> Since the code looks sane.
>
> Hoever, I think using memory_region_set_enabled() would be a better
> idea for our purposes than actually adding/deleting the subregion.

The other approach I've used in the past is to use
sysbus_mmio_get_region() and then just map and unmap
that directly, rather than using the sysbus_mmio_map()
convenience function. (Often the kind of device that's
doing complicated things like this will be working in
a setup where it doesn't necessarily want to be mapping
directly into system memory rather than an SoC or similar
container MemoryRegion anyway.)

thanks
-- PMM

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-12-03 17:05         ` Cédric Le Goater
@ 2018-12-04  1:54           ` David Gibson
  2018-12-04 17:04             ` Cédric Le Goater
  0 siblings, 1 reply; 184+ messages in thread
From: David Gibson @ 2018-12-04  1:54 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 4289 bytes --]

On Mon, Dec 03, 2018 at 06:05:12PM +0100, Cédric Le Goater wrote:
> I forgot to reply to this one.
> 
> On 11/29/18 1:47 AM, David Gibson wrote:
> > On Wed, Nov 28, 2018 at 11:59:58AM +0100, Cédric Le Goater wrote:
> >> On 11/28/18 12:49 AM, David Gibson wrote:
> >>> On Fri, Nov 16, 2018 at 11:57:01AM +0100, Cédric Le Goater wrote:
> >>>> The last sub-engine of the XIVE architecture is the Interrupt
> >>>> Virtualization Presentation Engine (IVPE). On HW, they share elements,
> >>>> the Power Bus interface (CQ), the routing table descriptors, and they
> >>>> can be combined in the same HW logic. We do the same in QEMU and
> >>>> combine both engines in the XiveRouter for simplicity.
> >>>
> >>> Ok, I'm not entirely convinced combining the IVPE and IVRE into a
> >>> single object is a good idea, but we can probably discuss that once
> >>> I've read further.
> >>
> >> We could introduce a simplified presenter for sPAPR but I am not even
> >> sure of that as it will get more complex if we support the EBB
> >> one day.

[snip]
> >>>> +static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
> >>>> +{
> >>>> +    return (nvt_blk << 19) | nvt_idx;
> >>>
> >>> I'm guessing this formula is the standard way of combining the NVT
> >>> block and index into a single word?  
> >>
> >> That number is the VP/NVT identifier which is written in the CAM value. 
> >> The index is on 19 bits because of the NVT  definition in the END 
> >> structure. It is being increased to 24 bits on Power10 
> >>
> >>> If so, I think we should
> >>> standardize on passing a single word "nvt_id" around and only
> >>> splitting it when we need to use the block separately.  
> >>
> >> This is really the only place where we concatenate the two NVT values,
> >> block and index. 
> > 
> > Hm, ok.  I know we don't model them (yet, maybe ever) but could
> > combined values appear in the PowerBUS messages that handle remote
> > notifications?
> 
> They do. 
>  
> >>> Same goes for
> >>> the end_id, assuming there's a standard way of putting that into a
> >>> single word.  That will address the point I raised earlier about lisn
> >>> being passed around as a single word, but these later stage ids being
> >>> split.
> >>
> >> Hmm, I am not sure this is a good option. It is not how the PowerNV 
> >> model would use it, skiboot is very much aware of these blocks and 
> >> indexes and for remote accesses chips are identified using the block. 
> >> I will take a look at it but I am not found of it. I can add helpers 
> >> in some places though.    
> > 
> > Hm, ok.  Do the block and index appear as an (effectively) single
> > field in the EAS?
> 
> no. In all XIVE structures, block and index are always distinct.

Hm.  Distinct in what sense?  I get that the fields are labelled
separately in the documentation, but if the fields are adjacent, you
could equally well treat them as one.

> >>>> +    if (!match->tctx) {
> >>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is not dispatched\n",
> >>>> +                      nvt_blk, nvt_idx);
> >>>> +        return false;
> >>>
> >>> Hmm.. this isn't actually an error isn't it? At least not for powernv
> >>
> >> It is on sPAPR, it would mean the END was configured with an unknow CPU. 
> > 
> > Right.
> > 
> >> It is not error on PowerNV, when we support escalations.
> >>
> >>> - that just means the NVT isn't currently dispatched, so we'll need to
> >>> trigger the escalation interrupt.  
> >>
> >> Yes.
> >>
> >>> Does this get changed later in the series?
> >>
> >> No.
> > 
> > But this code is common to PAPR and powernv, yes, so it will need to?
> 
> When we add support for escalations, yes, it will change. Would you rather
> use an error_report() until then ?

Ah, I guess leaving an error until we implement escalation makes
sense.  It shouldn't be LOG_GUEST_ERROR, though, the guest didn't do
anything wrong, and error_report() doesn't really make sense for the
same reason.

LOG_UNIMP, I guess?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode
  2018-12-03 16:49                 ` Cédric Le Goater
@ 2018-12-04  1:56                   ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-12-04  1:56 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 2662 bytes --]

On Mon, Dec 03, 2018 at 05:49:37PM +0100, Cédric Le Goater wrote:
> >>>>>>>> +    }
> >>>>>>>> +
> >>>>>>>> +    switch (qsize) {
> >>>>>>>> +    case 12:
> >>>>>>>> +    case 16:
> >>>>>>>> +    case 21:
> >>>>>>>> +    case 24:
> >>>>>>>> +        end.w3 = ((uint64_t)qpage) & 0xffffffff;
> >>>>>>>
> >>>>>>> It just occurred to me that I haven't been looking for this across any
> >>>>>>> of these reviews.  Don't you need byteswaps when accessing these
> >>>>>>> in-memory structures?
> >>>>>>
> >>>>>> yes this is done when some event data is enqueued in the EQ.
> >>>>>
> >>>>> I'm not talking about the data in the EQ itself, but the fields in the
> >>>>> END (and the NVT).
> >>>>
> >>>> XIVE is all BE.
> >>>
> >>> Yes... the qemu host might not be, which is why you need byteswaps.
> >>
> >> ok. I understand.
> >>
> >>> I realized eventually you have the swaps in your pnv get/set
> >>> accessors.  
> >>
> >> Yes. because skiboot is BE, like the XIVE structures.
> > 
> > skiboot's endiannness isn't really relevant, because we're modelling
> > below that level.
> > 
> >>> I don't like that at all for a couple of reasons:
> >>>
> >>> 1) Although the END structure is made up of word-sized fields because
> >>> that's convenient, the END really is made of a bunch of subfields of
> >>> different sizes.  Knowing that it wouldn't be unreasonable for people
> >>> to expect they can look into the XIVE by byte offsets; 
> >>
> >> These structures should be accessed with GETFIELD and SETFIELD macros
> >> using the XIVE definitions in the xive_regs.h header file. I would want 
> >> to keep that common with skiboot  for sure.
> > 
> > Right.  It might make sense to make some helper macros or inlines that
> > include both the GETFIELD/SETFIELD and the byteswap.
> 
> ah. I have to evaluate the added complexity, because we don't really
> have a struct. it's just an array of BE words.

You're still treating as a struct which reflects an in-memory layout,
even if all the fields are words of the same size.

> So for each field or bit we are interested in, we would have an helper 
> routine picking the correct word from the XIVE structure, doing the 
> byteswap and extracting the value ?
> 
> sigh.

Oh, no, I was just thinking a version of GETFIELD that byteswaps the
value before extracting the given field.  Likewise a SETFIELD variant
that swaps, deposits the field, then swaps back.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper
  2018-12-03 15:52       ` Cédric Le Goater
@ 2018-12-04  1:59         ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-12-04  1:59 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 1264 bytes --]

On Mon, Dec 03, 2018 at 04:52:46PM +0100, Cédric Le Goater wrote:
> On 11/29/18 5:36 PM, Cédric Le Goater wrote:
> > On 11/29/18 5:09 AM, David Gibson wrote:
> >> On Fri, Nov 16, 2018 at 11:57:20AM +0100, Cédric Le Goater wrote:
> >>> This will be used to remove the MMIO regions of the POWER9 XIVE
> >>> interrupt controller when the sPAPR machine is reseted.
> >>>
> >>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>
> >> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
> >>
> >> Since the code looks sane.
> >>
> >> Hoever, I think using memory_region_set_enabled() would be a better
> >> idea for our purposes than actually adding/deleting the subregion.
> > 
> > Yes and we might not need this one anymore. 
> 
> As we are destroying the KVM device, we also need to remove the mmap 
> in QEMU, else we will have a VMA with a page fault handler pointing
> on a bogus KVM device.  which means destroying the memory region, so
> we can not use  memory_region_set_enabled(). 
> 
> Anyhow mapping/unmapping works well.

Ah, ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper
  2018-12-03 17:48     ` Peter Maydell
@ 2018-12-04 12:33       ` Cédric Le Goater
  2018-12-04 13:04         ` Peter Maydell
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-12-04 12:33 UTC (permalink / raw)
  To: Peter Maydell, David Gibson; +Cc: qemu-ppc, QEMU Developers

Hello Peter,

On 12/3/18 6:48 PM, Peter Maydell wrote:
> On Thu, 29 Nov 2018 at 04:55, David Gibson <david@gibson.dropbear.id.au> wrote:
>>
>> On Fri, Nov 16, 2018 at 11:57:20AM +0100, Cédric Le Goater wrote:
>>> This will be used to remove the MMIO regions of the POWER9 XIVE
>>> interrupt controller when the sPAPR machine is reseted.
>>>
>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>
>> Reviewed-by: David Gibson <david@gibson.dropbear.id.au>
>>
>> Since the code looks sane.
>>
>> Hoever, I think using memory_region_set_enabled() would be a better
>> idea for our purposes than actually adding/deleting the subregion.
> 
> The other approach I've used in the past is to use
> sysbus_mmio_get_region() and then just map and unmap
> that directly, rather than using the sysbus_mmio_map()
> convenience function. (Often the kind of device that's
> doing complicated things like this will be working in
> a setup where it doesn't necessarily want to be mapping
> directly into system memory rather than an SoC or similar
> container MemoryRegion anyway.)

Thanks for chiming in on that patch. Here is some background on 
what we are trying to model. May be you have some suggestions.

A completely new interrupt controller was introduced on the POWER9 
processor and it uses MMIO regions for interrupt management. These
regions are backed by simple MRs in QEMU, when using TCG, and backed
by ram_device_ptr MRs under KVM.

Difficulties arise with the fact that POWER9 pseries guests need
to support the old mode (XICS, no MMIOs) and the new mode XIVE.
The interrupt mode is negotiated at boot between the hypervisor
and the guest and a reset is generated to take into account 
the changes. Which means that, at reset, we may need to disconnect 
from a KVM IC device and reconnect to another. 

When switching from XICS to XIVE mode : 

  if kvm
    - destroy KVM XICS device
    - create KVM XIVE device 
    - get fd, mmap, init ram_device_ptr MRs
    - map mmio
  - enable MMIOs

When switching from XIVE to XICS  : 
  
  - disable MMIOs 
  if kvm
    - delete MRs
    - munmap
    - destroy KVM XIVE device
    - create KVM XICS device 


Thanks,

C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper
  2018-12-04 12:33       ` Cédric Le Goater
@ 2018-12-04 13:04         ` Peter Maydell
  0 siblings, 0 replies; 184+ messages in thread
From: Peter Maydell @ 2018-12-04 13:04 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: David Gibson, qemu-ppc, QEMU Developers

On Tue, 4 Dec 2018 at 12:33, Cédric Le Goater <clg@kaod.org> wrote:
> A completely new interrupt controller was introduced on the POWER9
> processor and it uses MMIO regions for interrupt management. These
> regions are backed by simple MRs in QEMU, when using TCG, and backed
> by ram_device_ptr MRs under KVM.
>
> Difficulties arise with the fact that POWER9 pseries guests need
> to support the old mode (XICS, no MMIOs) and the new mode XIVE.
> The interrupt mode is negotiated at boot between the hypervisor
> and the guest and a reset is generated to take into account
> the changes. Which means that, at reset, we may need to disconnect
> from a KVM IC device and reconnect to another.

This is a painful API for QEMU to implement, incidentally,
because we don't have any concept really of a warm reset. In
theory reset should get you back to exactly the same state
as if you'd just started QEMU. You can probably bodge something
together, though.

> When switching from XICS to XIVE mode :
>
>   if kvm
>     - destroy KVM XICS device
>     - create KVM XIVE device
>     - get fd, mmap, init ram_device_ptr MRs
>     - map mmio
>   - enable MMIOs
>
> When switching from XIVE to XICS  :
>
>   - disable MMIOs
>   if kvm
>     - delete MRs
>     - munmap
>     - destroy KVM XIVE device
>     - create KVM XICS device

This seems basically OK, I think.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 19/36] spapr: add a 'pseries-3.1-xive' machine type
  2018-11-28 22:37     ` Cédric Le Goater
@ 2018-12-04 15:14       ` Cédric Le Goater
  2018-12-05  1:44         ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-12-04 15:14 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 11/28/18 11:37 PM, Cédric Le Goater wrote:
> On 11/28/18 5:42 AM, David Gibson wrote:
>> On Fri, Nov 16, 2018 at 11:57:12AM +0100, Cédric Le Goater wrote:
>>> The interrupt mode is statically defined to XIVE only for this machine.
>>> The guest OS is required to have support for the XIVE exploitation
>>> mode of the POWER9 interrupt controller.
>>>
>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>> ---
>>>  include/hw/ppc/spapr_irq.h |  1 +
>>>  hw/ppc/spapr.c             | 36 +++++++++++++++++++++++++++++++-----
>>>  hw/ppc/spapr_irq.c         |  3 +++
>>>  3 files changed, 35 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
>>> index c3b4c38145eb..b299dd794bff 100644
>>> --- a/include/hw/ppc/spapr_irq.h
>>> +++ b/include/hw/ppc/spapr_irq.h
>>> @@ -33,6 +33,7 @@ void spapr_irq_msi_reset(sPAPRMachineState *spapr);
>>>  typedef struct sPAPRIrq {
>>>      uint32_t    nr_irqs;
>>>      uint32_t    nr_msis;
>>> +    uint8_t     ov5;
>>
>> I'm a bit confused as to what exactly this represents..
> 
> The option vector 5 bits advertised by CAS for the platform. What the
> hypervisor supports.

0x80 both mode
0x40 XIVE only
0x00 XICS only

>>
>>>      void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
>>>                   Error **errp);
>>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
>>> index ad1692cdcd0f..8fbb743769db 100644
>>> --- a/hw/ppc/spapr.c
>>> +++ b/hw/ppc/spapr.c
>>> @@ -1097,12 +1097,14 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, void *fdt)
>>>      spapr_dt_rtas_tokens(fdt, rtas);
>>>  }
>>>  
>>> -/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU features
>>> - * that the guest may request and thus the valid values for bytes 24..26 of
>>> - * option vector 5: */
>>> -static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
>>> +/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU
>>> + * and the XIVE features that the guest may request and thus the valid
>>> + * values for bytes 23..26 of option vector 5: */
>>> +static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
>>> +                                          int chosen)
>>>  {
>>>      PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
>>> +    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
>>>  
>>>      char val[2 * 4] = {
>>>          23, 0x00, /* Xive mode, filled in below. */
>>> @@ -1123,7 +1125,11 @@ static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
>>>          } else {
>>>              val[3] = 0x00; /* Hash */
>>>          }
>>> +        /* TODO: test KVM support */
>>> +        val[1] = smc->irq->ov5;
>>>      } else {
>>> +        val[1] = smc->irq->ov5;
>>
>> ..here it seems to be a specific value for this OV5 byte, indicating the
>> supported intc...
> 
> yes.> 
>>
>>> +
>>>          /* V3 MMU supports both hash and radix in tcg (with dynamic switching) */
>>>          val[3] = 0xC0;
>>>      }
>>> @@ -1191,7 +1197,7 @@ static void spapr_dt_chosen(sPAPRMachineState *spapr, void *fdt)
>>>          _FDT(fdt_setprop_string(fdt, chosen, "stdout-path", stdout_path));
>>>      }
>>>  
>>> -    spapr_dt_ov5_platform_support(fdt, chosen);
>>> +    spapr_dt_ov5_platform_support(spapr, fdt, chosen);
>>>  
>>>      g_free(stdout_path);
>>>      g_free(bootlist);
>>> @@ -2622,6 +2628,11 @@ static void spapr_machine_init(MachineState *machine)
>>>      /* advertise support for ibm,dyamic-memory-v2 */
>>>      spapr_ovec_set(spapr->ov5, OV5_DRMEM_V2);
>>>  
>>> +    /* advertise XIVE */
>>> +    if (smc->irq->ov5) {
>>
>> ..but here it seems to be a bool indicating XIVE support specifically.
> 
> ah. yes. I need to check this part. That was a while ago.

This is advertising XIVE again if the machine supports it. We need to 
populate the DT node "ibm,arch-vec-5-platform-support" in routine
spapr_dt_ov5_platform_support() *and* also to update the machine field 
spapr->ov5. But it seems redundant to me. 

spapr->ov5 should be used to build the DT. Shouldn't it ? Or I really 
missed something. 


Thanks, 

C.

 
>>> +        spapr_ovec_set(spapr->ov5, OV5_XIVE_EXPLOIT);
>>> +    }
>>> +
>>>      /* init CPUs */
>>>      spapr_init_cpus(spapr);
>>>  
>>> @@ -3971,6 +3982,21 @@ static void spapr_machine_3_1_class_options(MachineClass *mc)
>>>  
>>>  DEFINE_SPAPR_MACHINE(3_1, "3.1", true);
>>>  
>>> +static void spapr_machine_3_1_xive_instance_options(MachineState *machine)
>>> +{
>>> +    spapr_machine_3_1_instance_options(machine);
>>> +}
>>> +
>>> +static void spapr_machine_3_1_xive_class_options(MachineClass *mc)
>>> +{
>>> +    sPAPRMachineClass *smc = SPAPR_MACHINE_CLASS(mc);
>>> +
>>> +    spapr_machine_3_1_class_options(mc);
>>> +    smc->irq = &spapr_irq_xive;
>>> +}
>>> +
>>> +DEFINE_SPAPR_MACHINE(3_1_xive, "3.1-xive", false);
>>> +
>>>  /*
>>>   * pseries-3.0
>>>   */
>>> diff --git a/hw/ppc/spapr_irq.c b/hw/ppc/spapr_irq.c
>>> index 253abc10e780..42e73851b174 100644
>>> --- a/hw/ppc/spapr_irq.c
>>> +++ b/hw/ppc/spapr_irq.c
>>> @@ -210,6 +210,7 @@ static Object *spapr_irq_cpu_intc_create_xics(sPAPRMachineState *spapr,
>>>  sPAPRIrq spapr_irq_xics = {
>>>      .nr_irqs     = SPAPR_IRQ_XICS_NR_IRQS,
>>>      .nr_msis     = SPAPR_IRQ_XICS_NR_MSIS,
>>> +    .ov5         = 0x0, /* XICS only */
>>>  
>>>      .init        = spapr_irq_init_xics,
>>>      .claim       = spapr_irq_claim_xics,
>>> @@ -341,6 +342,7 @@ static Object *spapr_irq_cpu_intc_create_xive(sPAPRMachineState *spapr,
>>>  sPAPRIrq spapr_irq_xive = {
>>>      .nr_irqs     = SPAPR_IRQ_XIVE_NR_IRQS,
>>>      .nr_msis     = SPAPR_IRQ_XIVE_NR_MSIS,
>>> +    .ov5         = 0x40, /* XIVE exploitation mode only */
>>>  
>>>      .init        = spapr_irq_init_xive,
>>>      .claim       = spapr_irq_claim_xive,
>>> @@ -447,6 +449,7 @@ int spapr_irq_find(sPAPRMachineState *spapr, int num, bool align, Error **errp)
>>>  sPAPRIrq spapr_irq_xics_legacy = {
>>>      .nr_irqs     = SPAPR_IRQ_XICS_LEGACY_NR_IRQS,
>>>      .nr_msis     = SPAPR_IRQ_XICS_LEGACY_NR_IRQS,
>>> +    .ov5         = 0x0, /* XICS only */
>>>  
>>>      .init        = spapr_irq_init_xics,
>>>      .claim       = spapr_irq_claim_xics,
>>
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-12-04  1:54           ` David Gibson
@ 2018-12-04 17:04             ` Cédric Le Goater
  2018-12-05  1:40               ` David Gibson
  0 siblings, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-12-04 17:04 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 12/4/18 2:54 AM, David Gibson wrote:
> On Mon, Dec 03, 2018 at 06:05:12PM +0100, Cédric Le Goater wrote:
>> I forgot to reply to this one.
>>
>> On 11/29/18 1:47 AM, David Gibson wrote:
>>> On Wed, Nov 28, 2018 at 11:59:58AM +0100, Cédric Le Goater wrote:
>>>> On 11/28/18 12:49 AM, David Gibson wrote:
>>>>> On Fri, Nov 16, 2018 at 11:57:01AM +0100, Cédric Le Goater wrote:
>>>>>> The last sub-engine of the XIVE architecture is the Interrupt
>>>>>> Virtualization Presentation Engine (IVPE). On HW, they share elements,
>>>>>> the Power Bus interface (CQ), the routing table descriptors, and they
>>>>>> can be combined in the same HW logic. We do the same in QEMU and
>>>>>> combine both engines in the XiveRouter for simplicity.
>>>>>
>>>>> Ok, I'm not entirely convinced combining the IVPE and IVRE into a
>>>>> single object is a good idea, but we can probably discuss that once
>>>>> I've read further.
>>>>
>>>> We could introduce a simplified presenter for sPAPR but I am not even
>>>> sure of that as it will get more complex if we support the EBB
>>>> one day.
> 
> [snip]
>>>>>> +static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
>>>>>> +{
>>>>>> +    return (nvt_blk << 19) | nvt_idx;
>>>>>
>>>>> I'm guessing this formula is the standard way of combining the NVT
>>>>> block and index into a single word?  
>>>>
>>>> That number is the VP/NVT identifier which is written in the CAM value. 
>>>> The index is on 19 bits because of the NVT  definition in the END 
>>>> structure. It is being increased to 24 bits on Power10 
>>>>
>>>>> If so, I think we should
>>>>> standardize on passing a single word "nvt_id" around and only
>>>>> splitting it when we need to use the block separately.  
>>>>
>>>> This is really the only place where we concatenate the two NVT values,
>>>> block and index. 
>>>
>>> Hm, ok.  I know we don't model them (yet, maybe ever) but could
>>> combined values appear in the PowerBUS messages that handle remote
>>> notifications?
>>
>> They do. 
>>  
>>>>> Same goes for
>>>>> the end_id, assuming there's a standard way of putting that into a
>>>>> single word.  That will address the point I raised earlier about lisn
>>>>> being passed around as a single word, but these later stage ids being
>>>>> split.
>>>>
>>>> Hmm, I am not sure this is a good option. It is not how the PowerNV 
>>>> model would use it, skiboot is very much aware of these blocks and 
>>>> indexes and for remote accesses chips are identified using the block. 
>>>> I will take a look at it but I am not found of it. I can add helpers 
>>>> in some places though.    
>>>
>>> Hm, ok.  Do the block and index appear as an (effectively) single
>>> field in the EAS?
>>
>> no. In all XIVE structures, block and index are always distinct.
> 
> Hm.  Distinct in what sense?  I get that the fields are labelled
> separately in the documentation, but if the fields are adjacent, you
> could equally well treat them as one.

yes. Indeed. They are adjacent. The size of the index is subject to 
change in P10. 

I am not sure that treating them as one will be of any help because 
we need to extract them from their XIVE structure with the *_INDEX 
and *_BLOCK masks first. I will take a look. May be not in v6.


>>>>>> +    if (!match->tctx) {
>>>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is not dispatched\n",
>>>>>> +                      nvt_blk, nvt_idx);
>>>>>> +        return false;
>>>>>
>>>>> Hmm.. this isn't actually an error isn't it? At least not for powernv
>>>>
>>>> It is on sPAPR, it would mean the END was configured with an unknow CPU. 
>>>
>>> Right.
>>>
>>>> It is not error on PowerNV, when we support escalations.
>>>>
>>>>> - that just means the NVT isn't currently dispatched, so we'll need to
>>>>> trigger the escalation interrupt.  
>>>>
>>>> Yes.
>>>>
>>>>> Does this get changed later in the series?
>>>>
>>>> No.
>>>
>>> But this code is common to PAPR and powernv, yes, so it will need to?
>>
>> When we add support for escalations, yes, it will change. Would you rather
>> use an error_report() until then ?
> 
> Ah, I guess leaving an error until we implement escalation makes
> sense.  It shouldn't be LOG_GUEST_ERROR, though, the guest didn't do
> anything wrong, and error_report() doesn't really make sense for the
> same reason.
> 
> LOG_UNIMP, I guess?

OK. will do.

Thanks,

C.

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller
  2018-11-28 16:27     ` Cédric Le Goater
  2018-11-29  0:54       ` David Gibson
@ 2018-12-04 17:12       ` Cédric Le Goater
  2018-12-05  1:41         ` David Gibson
  1 sibling, 1 reply; 184+ messages in thread
From: Cédric Le Goater @ 2018-12-04 17:12 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[ ... ]

>>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
>>> +{
>>> +    int i;
>>> +    uint32_t offset = 0;
>>> +
>>> +    monitor_printf(mon, "XIVE Source %08x .. %08x\n", offset,
>>> +                   offset + xive->source.nr_irqs - 1);
>>> +    xive_source_pic_print_info(&xive->source, offset, mon);
>>> +
>>> +    monitor_printf(mon, "XIVE EAT %08x .. %08x\n", 0, xive->nr_irqs - 1);
>>> +    for (i = 0; i < xive->nr_irqs; i++) {
>>> +        xive_eas_pic_print_info(&xive->eat[i], i, mon);
>>> +    }
>>> +
>>> +    monitor_printf(mon, "XIVE ENDT %08x .. %08x\n", 0, xive->nr_ends - 1);
>>> +    for (i = 0; i < xive->nr_ends; i++) {
>>> +        xive_end_pic_print_info(&xive->endt[i], i, mon);
>>> +    }
>>
>> AIUI the PAPR model hides the details of ENDs, EQs and NVTs - instead
>> each logical EAS just points at a (thread, priority) pair, which under
>> the hood has exactly one END and one NVT bound to it.
>>
>> Given that, would it make more sense to reformat the info here to show
>> things in terms of those (thread, priority) pairs, rather than the
>> internal EAS and END details?
> 
> Yes. I had a version doing something like that before. I will rework
> the ouput a little for sPAPR.  

I would like to keep the 'advanced' monitor output in some ways and have
two possible outputs : simple and long.

Is it possible to add command line options or arguments to the Monitor 
interface ? 

Thanks,

C. 

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter
  2018-12-04 17:04             ` Cédric Le Goater
@ 2018-12-05  1:40               ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-12-05  1:40 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 5085 bytes --]

On Tue, Dec 04, 2018 at 06:04:13PM +0100, Cédric Le Goater wrote:
> On 12/4/18 2:54 AM, David Gibson wrote:
> > On Mon, Dec 03, 2018 at 06:05:12PM +0100, Cédric Le Goater wrote:
> >> I forgot to reply to this one.
> >>
> >> On 11/29/18 1:47 AM, David Gibson wrote:
> >>> On Wed, Nov 28, 2018 at 11:59:58AM +0100, Cédric Le Goater wrote:
> >>>> On 11/28/18 12:49 AM, David Gibson wrote:
> >>>>> On Fri, Nov 16, 2018 at 11:57:01AM +0100, Cédric Le Goater wrote:
> >>>>>> The last sub-engine of the XIVE architecture is the Interrupt
> >>>>>> Virtualization Presentation Engine (IVPE). On HW, they share elements,
> >>>>>> the Power Bus interface (CQ), the routing table descriptors, and they
> >>>>>> can be combined in the same HW logic. We do the same in QEMU and
> >>>>>> combine both engines in the XiveRouter for simplicity.
> >>>>>
> >>>>> Ok, I'm not entirely convinced combining the IVPE and IVRE into a
> >>>>> single object is a good idea, but we can probably discuss that once
> >>>>> I've read further.
> >>>>
> >>>> We could introduce a simplified presenter for sPAPR but I am not even
> >>>> sure of that as it will get more complex if we support the EBB
> >>>> one day.
> > 
> > [snip]
> >>>>>> +static inline uint32_t xive_tctx_cam_line(uint8_t nvt_blk, uint32_t nvt_idx)
> >>>>>> +{
> >>>>>> +    return (nvt_blk << 19) | nvt_idx;
> >>>>>
> >>>>> I'm guessing this formula is the standard way of combining the NVT
> >>>>> block and index into a single word?  
> >>>>
> >>>> That number is the VP/NVT identifier which is written in the CAM value. 
> >>>> The index is on 19 bits because of the NVT  definition in the END 
> >>>> structure. It is being increased to 24 bits on Power10 
> >>>>
> >>>>> If so, I think we should
> >>>>> standardize on passing a single word "nvt_id" around and only
> >>>>> splitting it when we need to use the block separately.  
> >>>>
> >>>> This is really the only place where we concatenate the two NVT values,
> >>>> block and index. 
> >>>
> >>> Hm, ok.  I know we don't model them (yet, maybe ever) but could
> >>> combined values appear in the PowerBUS messages that handle remote
> >>> notifications?
> >>
> >> They do. 
> >>  
> >>>>> Same goes for
> >>>>> the end_id, assuming there's a standard way of putting that into a
> >>>>> single word.  That will address the point I raised earlier about lisn
> >>>>> being passed around as a single word, but these later stage ids being
> >>>>> split.
> >>>>
> >>>> Hmm, I am not sure this is a good option. It is not how the PowerNV 
> >>>> model would use it, skiboot is very much aware of these blocks and 
> >>>> indexes and for remote accesses chips are identified using the block. 
> >>>> I will take a look at it but I am not found of it. I can add helpers 
> >>>> in some places though.    
> >>>
> >>> Hm, ok.  Do the block and index appear as an (effectively) single
> >>> field in the EAS?
> >>
> >> no. In all XIVE structures, block and index are always distinct.
> > 
> > Hm.  Distinct in what sense?  I get that the fields are labelled
> > separately in the documentation, but if the fields are adjacent, you
> > could equally well treat them as one.
> 
> yes. Indeed. They are adjacent. The size of the index is subject to 
> change in P10. 

Ah, ok.  If the boundary might change in P10 that is indeed a reason
to keep them separate.

> I am not sure that treating them as one will be of any help because 
> we need to extract them from their XIVE structure with the *_INDEX 
> and *_BLOCK masks first. I will take a look. May be not in v6.

Ok.

> >>>>>> +    if (!match->tctx) {
> >>>>>> +        qemu_log_mask(LOG_GUEST_ERROR, "XIVE: NVT %x/%x is not dispatched\n",
> >>>>>> +                      nvt_blk, nvt_idx);
> >>>>>> +        return false;
> >>>>>
> >>>>> Hmm.. this isn't actually an error isn't it? At least not for powernv
> >>>>
> >>>> It is on sPAPR, it would mean the END was configured with an unknow CPU. 
> >>>
> >>> Right.
> >>>
> >>>> It is not error on PowerNV, when we support escalations.
> >>>>
> >>>>> - that just means the NVT isn't currently dispatched, so we'll need to
> >>>>> trigger the escalation interrupt.  
> >>>>
> >>>> Yes.
> >>>>
> >>>>> Does this get changed later in the series?
> >>>>
> >>>> No.
> >>>
> >>> But this code is common to PAPR and powernv, yes, so it will need to?
> >>
> >> When we add support for escalations, yes, it will change. Would you rather
> >> use an error_report() until then ?
> > 
> > Ah, I guess leaving an error until we implement escalation makes
> > sense.  It shouldn't be LOG_GUEST_ERROR, though, the guest didn't do
> > anything wrong, and error_report() doesn't really make sense for the
> > same reason.
> > 
> > LOG_UNIMP, I guess?
> 
> OK. will do.
> 
> Thanks,
> 
> C.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller
  2018-12-04 17:12       ` Cédric Le Goater
@ 2018-12-05  1:41         ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-12-05  1:41 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 1930 bytes --]

On Tue, Dec 04, 2018 at 06:12:11PM +0100, Cédric Le Goater wrote:
> [ ... ]
> 
> >>> +void spapr_xive_pic_print_info(sPAPRXive *xive, Monitor *mon)
> >>> +{
> >>> +    int i;
> >>> +    uint32_t offset = 0;
> >>> +
> >>> +    monitor_printf(mon, "XIVE Source %08x .. %08x\n", offset,
> >>> +                   offset + xive->source.nr_irqs - 1);
> >>> +    xive_source_pic_print_info(&xive->source, offset, mon);
> >>> +
> >>> +    monitor_printf(mon, "XIVE EAT %08x .. %08x\n", 0, xive->nr_irqs - 1);
> >>> +    for (i = 0; i < xive->nr_irqs; i++) {
> >>> +        xive_eas_pic_print_info(&xive->eat[i], i, mon);
> >>> +    }
> >>> +
> >>> +    monitor_printf(mon, "XIVE ENDT %08x .. %08x\n", 0, xive->nr_ends - 1);
> >>> +    for (i = 0; i < xive->nr_ends; i++) {
> >>> +        xive_end_pic_print_info(&xive->endt[i], i, mon);
> >>> +    }
> >>
> >> AIUI the PAPR model hides the details of ENDs, EQs and NVTs - instead
> >> each logical EAS just points at a (thread, priority) pair, which under
> >> the hood has exactly one END and one NVT bound to it.
> >>
> >> Given that, would it make more sense to reformat the info here to show
> >> things in terms of those (thread, priority) pairs, rather than the
> >> internal EAS and END details?
> > 
> > Yes. I had a version doing something like that before. I will rework
> > the ouput a little for sPAPR.  
> 
> I would like to keep the 'advanced' monitor output in some ways and have
> two possible outputs : simple and long.
> 
> Is it possible to add command line options or arguments to the Monitor 
> interface ?

Not to the "info pic" command specifically, no.  Or at least, not
without a lot of work.

> 
> Thanks,
> 
> C. 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 19/36] spapr: add a 'pseries-3.1-xive' machine type
  2018-12-04 15:14       ` Cédric Le Goater
@ 2018-12-05  1:44         ` David Gibson
  0 siblings, 0 replies; 184+ messages in thread
From: David Gibson @ 2018-12-05  1:44 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

[-- Attachment #1: Type: text/plain, Size: 5040 bytes --]

On Tue, Dec 04, 2018 at 04:14:12PM +0100, Cédric Le Goater wrote:
> On 11/28/18 11:37 PM, Cédric Le Goater wrote:
> > On 11/28/18 5:42 AM, David Gibson wrote:
> >> On Fri, Nov 16, 2018 at 11:57:12AM +0100, Cédric Le Goater wrote:
> >>> The interrupt mode is statically defined to XIVE only for this machine.
> >>> The guest OS is required to have support for the XIVE exploitation
> >>> mode of the POWER9 interrupt controller.
> >>>
> >>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>> ---
> >>>  include/hw/ppc/spapr_irq.h |  1 +
> >>>  hw/ppc/spapr.c             | 36 +++++++++++++++++++++++++++++++-----
> >>>  hw/ppc/spapr_irq.c         |  3 +++
> >>>  3 files changed, 35 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/include/hw/ppc/spapr_irq.h b/include/hw/ppc/spapr_irq.h
> >>> index c3b4c38145eb..b299dd794bff 100644
> >>> --- a/include/hw/ppc/spapr_irq.h
> >>> +++ b/include/hw/ppc/spapr_irq.h
> >>> @@ -33,6 +33,7 @@ void spapr_irq_msi_reset(sPAPRMachineState *spapr);
> >>>  typedef struct sPAPRIrq {
> >>>      uint32_t    nr_irqs;
> >>>      uint32_t    nr_msis;
> >>> +    uint8_t     ov5;
> >>
> >> I'm a bit confused as to what exactly this represents..
> > 
> > The option vector 5 bits advertised by CAS for the platform. What the
> > hypervisor supports.
> 
> 0x80 both mode
> 0x40 XIVE only
> 0x00 XICS only

Yes....

> 
> >>
> >>>      void (*init)(sPAPRMachineState *spapr, int nr_irqs, int nr_servers,
> >>>                   Error **errp);
> >>> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> >>> index ad1692cdcd0f..8fbb743769db 100644
> >>> --- a/hw/ppc/spapr.c
> >>> +++ b/hw/ppc/spapr.c
> >>> @@ -1097,12 +1097,14 @@ static void spapr_dt_rtas(sPAPRMachineState *spapr, void *fdt)
> >>>      spapr_dt_rtas_tokens(fdt, rtas);
> >>>  }
> >>>  
> >>> -/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU features
> >>> - * that the guest may request and thus the valid values for bytes 24..26 of
> >>> - * option vector 5: */
> >>> -static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
> >>> +/* Prepare ibm,arch-vec-5-platform-support, which indicates the MMU
> >>> + * and the XIVE features that the guest may request and thus the valid
> >>> + * values for bytes 23..26 of option vector 5: */
> >>> +static void spapr_dt_ov5_platform_support(sPAPRMachineState *spapr, void *fdt,
> >>> +                                          int chosen)
> >>>  {
> >>>      PowerPCCPU *first_ppc_cpu = POWERPC_CPU(first_cpu);
> >>> +    sPAPRMachineClass *smc = SPAPR_MACHINE_GET_CLASS(spapr);
> >>>  
> >>>      char val[2 * 4] = {
> >>>          23, 0x00, /* Xive mode, filled in below. */
> >>> @@ -1123,7 +1125,11 @@ static void spapr_dt_ov5_platform_support(void *fdt, int chosen)
> >>>          } else {
> >>>              val[3] = 0x00; /* Hash */
> >>>          }
> >>> +        /* TODO: test KVM support */
> >>> +        val[1] = smc->irq->ov5;
> >>>      } else {
> >>> +        val[1] = smc->irq->ov5;
> >>
> >> ..here it seems to be a specific value for this OV5 byte, indicating the
> >> supported intc...
> > 
> > yes.> 
> >>
> >>> +
> >>>          /* V3 MMU supports both hash and radix in tcg (with dynamic switching) */
> >>>          val[3] = 0xC0;
> >>>      }
> >>> @@ -1191,7 +1197,7 @@ static void spapr_dt_chosen(sPAPRMachineState *spapr, void *fdt)
> >>>          _FDT(fdt_setprop_string(fdt, chosen, "stdout-path", stdout_path));
> >>>      }
> >>>  
> >>> -    spapr_dt_ov5_platform_support(fdt, chosen);
> >>> +    spapr_dt_ov5_platform_support(spapr, fdt, chosen);
> >>>  
> >>>      g_free(stdout_path);
> >>>      g_free(bootlist);
> >>> @@ -2622,6 +2628,11 @@ static void spapr_machine_init(MachineState *machine)
> >>>      /* advertise support for ibm,dyamic-memory-v2 */
> >>>      spapr_ovec_set(spapr->ov5, OV5_DRMEM_V2);
> >>>  
> >>> +    /* advertise XIVE */
> >>> +    if (smc->irq->ov5) {
> >>
> >> ..but here it seems to be a bool indicating XIVE support specifically.
> > 
> > ah. yes. I need to check this part. That was a while ago.
> 
> This is advertising XIVE again if the machine supports it. We need to 
> populate the DT node "ibm,arch-vec-5-platform-support" in routine
> spapr_dt_ov5_platform_support() *and* also to update the machine field 
> spapr->ov5. But it seems redundant to me. 
> 
> spapr->ov5 should be used to build the DT. Shouldn't it ? Or I really 
> missed something.

Possibly, but we are talking PAPR here, which is the king of putting
the same information in multiple places, differently encoded.  You'll
need to check it.

Regardless please don't use if (smc->irq->ov5) as a shortcut for if
(smc->irq->ov5 != XICS_ONLY).  The latter is much clearer and doesn't
mislead as to the type of ...->ov5.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 184+ messages in thread

* Re: [Qemu-devel] [PATCH v5 36/36] ppc/pnv: add XIVE support
  2018-12-03  2:26   ` David Gibson
@ 2018-12-06 15:14     ` Cédric Le Goater
  0 siblings, 0 replies; 184+ messages in thread
From: Cédric Le Goater @ 2018-12-06 15:14 UTC (permalink / raw)
  To: David Gibson; +Cc: qemu-ppc, qemu-devel, Benjamin Herrenschmidt

On 12/3/18 3:26 AM, David Gibson wrote:
> On Fri, Nov 16, 2018 at 11:57:29AM +0100, Cédric Le Goater wrote:
>> This is simple model of the POWER9 XIVE interrupt controller for the
>> PowerNV machine. XIVE for baremetal is a complex controller and the
>> model only addresses the needs of the skiboot firmware.
>>
>> * Overall architecture
>>
>>               XIVE Interrupt Controller
>>               +-------------------------------------+       IPIs
>>               | +---------+ +---------+ +---------+ |    +--------+
>>               | |VC       | |CQ       | |PC       |----> | CORES  |
>>               | |     esb | |         | |         |----> |        |
>>               | |     eas | |  Bridge | |         |----> |        |
>>               | |SC   end | |         | |     nvt | |    |        |
>> +------+      | +---------+ +----+----+ +---------+ |    +--+-+-+-+
>> | RAM  |      +------------------|------------------+       | | |
>> |      |                         |                          | | |
>> |      |                         |                          | | |
>> |      |   +---------------------v--------------------------v-v-v---+      other
>> |      <---+                       Power Bus                        +----> chips
>> |  esb |   +-----------+-----------------------+--------------------+
>> |  eas |               |                       |
>> |  end |               |                       |
>> |  nvt |           +---+----+              +---+----+
>> +------+           |SC      |              |SC      |
>>                    |        |              |        |
>>                    | 2-bits |              | 2-bits |
>>                    | local  |              |   VC   |
>>                    +--------+              +--------+
>>                      PCIe                  NX,NPU,CAPI
>>
>>                   SC: Source Controller (aka. IVSE)
>>                   VC: Virtualization Controller (aka. IVRE)
>>                   CQ: Common Queue (Bridge)
>>                   PC: Presentation Controller (aka. IVPE)
>>
>>               2-bits: source state machine
>>                  esb: Event State Buffer (Array of PQ bits in an IVSE)
>>                  eas: Event Assignment Structure
>>                  end: Event Notification Descriptor
>>                  nvt: Notification Virtual Target
>>
>> It is composed of three sub-engines :
>>
>>   - Interrupt Virtualization Source Engine (IVSE), or Source
>>     Controller (SC). These are found in PCI PHBs, in the PSI host
>>     bridge controller, but also inside the main controller for the
>>     core IPIs and other sub-chips (NX, CAP, NPU) of the
>>     chip/processor. They are configured to feed the IVRE with events.
>>
>>   - Interrupt Virtualization Routing Engine (IVRE) or Virtualization
>>     Controller (VC). Its job is to match an event source with an Event
>>     Notification Descriptor (END).
>>
>>   - Interrupt Virtualization Presentation Engine (IVPE) or Presentation
>>     Controller (PC). It maintains the interrupt context state of each
>>     thread and handles the delivery of the external exception to the
>>     thread.
>>
>> * XIVE internal tables
>>
>> Each of the sub-engines uses a set of tables to redirect exceptions
>> from event sources to CPU threads.
>>
>>                                              +-------+
>>    User or OS                                |  EQ   |
>>        or                            +------>|entries|
>>    Hypervisor                        |       |  ..   |
>>      Memory                          |       +-------+
>>                                      |           ^
>>                                      |           |
>>                +--------------------------------------------------+
>>                                      |           |
>>    Hypervisor        +------+    +---+--+    +---+--+   +------+
>>      Memory          | ESB  |    | EAT  |    | ENDT |   | NVTT |
>>     (skiboot)        +----+-+    +----+-+    +----+-+   +------+
>>                        ^  |        ^  |        ^  |       ^
>>                        |  |        |  |        |  |       |
>>                +--------------------------------------------------+
>>                        |  |        |  |        |  |       |
>>                        |  |        |  |        |  |       |
>>                  +-----|--|--------|--|--------|--|-+   +-|-----+    +------+
>>                  |     |  |        |  |        |  | |   | | tctx|    |Thread|
>>     IPI or   ----+     +  v        +  v        +  v |---| +  .. |----->     |
>>    HW events     |                                  |   |       |    |      |
>>                  |              IVRE                |   | IVPE  |    +------+
>>                  +----------------------------------+   +-------+
>>
>> The IVSE have a 2-bits, P for pending and Q for queued, state machine
>> for each source that allows events to be triggered. They are stored in
>> an array, the Event State Buffer (ESB) and controlled by MMIOs.
>>
>> If the event is let through, the IVRE looks up in the Event Assignment
>> Structure (EAS) table for an Event Notification Descriptor (END)
>> configured for the source. Each Event Notification Descriptor defines
>> a notification path to a CPU and an in-memory Event Queue, in which
>> will be pushed an EQ data for the OS to pull.
>>
>> The IVPE determines if a Notification Virtual Target (NVT) can handle
>> the event by scanning the thread contexts of the VPs dispatched on the
>> processor HW threads. It maintains the interrupt context state of each
>> thread in a NVT table.
>>
>> * QEMU model for PowerNV
>>
>> The PowerNV model reuses the common XIVE framework developed for sPAPR
>> and the fundamentals aspects are quite the same. The difference are
>> outlined below.
>>
>> The controller initial BAR configuration is performed using the XSCOM
>> bus from there, MMIO are used for further configuration.
>>
>> The MMIO regions exposed are :
>>
>>  - Interrupt controller registers
>>  - ESB pages for IPIs and ENDs
>>  - Presenter MMIO (Not used)
>>  - Thread Interrupt Management Area MMIO, direct and indirect
>>
>> Virtualization Controller MMIO region containing the IPI ESB pages and
>> END ESB pages is sub-divided into "sets" which map portions of the VC
>> region to the different ESB pages. It is configured at runtime through
>> the EDT set translation table to let the firmware decide how to split
>> the address space between IPI ESB pages and END ESB pages.
>>
>> The XIVE tables are now in the machine RAM and not in the hypervisor
>> anymore. The firmware (skiboot) configures these tables using Virtual
>> Structure Descriptor defining the characteristics of each table : SBE,
>> EAS, END and NVT. These are later used to access the virtual interrupt
>> entries. The internal cache of these tables in the interrupt controller
>> is updated and invalidated using a set of registers.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  hw/intc/pnv_xive_regs.h    |  314 +++++++
>>  include/hw/ppc/pnv.h       |   22 +-
>>  include/hw/ppc/pnv_xive.h  |  100 +++
>>  include/hw/ppc/pnv_xscom.h |    3 +
>>  include/hw/ppc/xive.h      |    1 +
>>  hw/intc/pnv_xive.c         | 1612 ++++++++++++++++++++++++++++++++++++
>>  hw/intc/xive.c             |   63 +-
>>  hw/ppc/pnv.c               |   58 +-
>>  hw/intc/Makefile.objs      |    2 +-
>>  9 files changed, 2164 insertions(+), 11 deletions(-)
>>  create mode 100644 hw/intc/pnv_xive_regs.h
>>  create mode 100644 include/hw/ppc/pnv_xive.h
>>  create mode 100644 hw/intc/pnv_xive.c
>>
>> diff --git a/hw/intc/pnv_xive_regs.h b/hw/intc/pnv_xive_regs.h
>> new file mode 100644
>> index 000000000000..509d5a18cdde
>> --- /dev/null
>> +++ b/hw/intc/pnv_xive_regs.h
>> @@ -0,0 +1,314 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef PPC_PNV_XIVE_REGS_H
>> +#define PPC_PNV_XIVE_REGS_H
>> +
>> +/* IC register offsets 0x0 - 0x400 */
>> +#define CQ_SWI_CMD_HIST         0x020
>> +#define CQ_SWI_CMD_POLL         0x028
>> +#define CQ_SWI_CMD_BCAST        0x030
>> +#define CQ_SWI_CMD_ASSIGN       0x038
>> +#define CQ_SWI_CMD_BLK_UPD      0x040
>> +#define CQ_SWI_RSP              0x048
>> +#define X_CQ_CFG_PB_GEN         0x0a
>> +#define CQ_CFG_PB_GEN           0x050
>> +#define   CQ_INT_ADDR_OPT       PPC_BITMASK(14, 15)
>> +#define X_CQ_IC_BAR             0x10
>> +#define X_CQ_MSGSND             0x0b
>> +#define CQ_MSGSND               0x058
>> +#define CQ_CNPM_SEL             0x078
>> +#define CQ_IC_BAR               0x080
>> +#define   CQ_IC_BAR_VALID       PPC_BIT(0)
>> +#define   CQ_IC_BAR_64K         PPC_BIT(1)
>> +#define X_CQ_TM1_BAR            0x12
>> +#define CQ_TM1_BAR              0x90
>> +#define X_CQ_TM2_BAR            0x014
>> +#define CQ_TM2_BAR              0x0a0
>> +#define   CQ_TM_BAR_VALID       PPC_BIT(0)
>> +#define   CQ_TM_BAR_64K         PPC_BIT(1)
>> +#define X_CQ_PC_BAR             0x16
>> +#define CQ_PC_BAR               0x0b0
>> +#define  CQ_PC_BAR_VALID        PPC_BIT(0)
>> +#define X_CQ_PC_BARM            0x17
>> +#define CQ_PC_BARM              0x0b8
>> +#define  CQ_PC_BARM_MASK        PPC_BITMASK(26, 38)
>> +#define X_CQ_VC_BAR             0x18
>> +#define CQ_VC_BAR               0x0c0
>> +#define  CQ_VC_BAR_VALID        PPC_BIT(0)
>> +#define X_CQ_VC_BARM            0x19
>> +#define CQ_VC_BARM              0x0c8
>> +#define  CQ_VC_BARM_MASK        PPC_BITMASK(21, 37)
>> +#define X_CQ_TAR                0x1e
>> +#define CQ_TAR                  0x0f0
>> +#define  CQ_TAR_TBL_AUTOINC     PPC_BIT(0)
>> +#define  CQ_TAR_TSEL            PPC_BITMASK(12, 15)
>> +#define  CQ_TAR_TSEL_BLK        PPC_BIT(12)
>> +#define  CQ_TAR_TSEL_MIG        PPC_BIT(13)
>> +#define  CQ_TAR_TSEL_VDT        PPC_BIT(14)
>> +#define  CQ_TAR_TSEL_EDT        PPC_BIT(15)
>> +#define  CQ_TAR_TSEL_INDEX      PPC_BITMASK(26, 31)
>> +#define X_CQ_TDR                0x1f
>> +#define CQ_TDR                  0x0f8
>> +#define  CQ_TDR_VDT_VALID       PPC_BIT(0)
>> +#define  CQ_TDR_VDT_BLK         PPC_BITMASK(11, 15)
>> +#define  CQ_TDR_VDT_INDEX       PPC_BITMASK(28, 31)
>> +#define  CQ_TDR_EDT_TYPE        PPC_BITMASK(0, 1)
>> +#define  CQ_TDR_EDT_INVALID     0
>> +#define  CQ_TDR_EDT_IPI         1
>> +#define  CQ_TDR_EDT_EQ          2
>> +#define  CQ_TDR_EDT_BLK         PPC_BITMASK(12, 15)
>> +#define  CQ_TDR_EDT_INDEX       PPC_BITMASK(26, 31)
>> +#define X_CQ_PBI_CTL            0x20
>> +#define CQ_PBI_CTL              0x100
>> +#define  CQ_PBI_PC_64K          PPC_BIT(5)
>> +#define  CQ_PBI_VC_64K          PPC_BIT(6)
>> +#define  CQ_PBI_LNX_TRIG        PPC_BIT(7)
>> +#define  CQ_PBI_FORCE_TM_LOCAL  PPC_BIT(22)
>> +#define CQ_PBO_CTL              0x108
>> +#define CQ_AIB_CTL              0x110
>> +#define X_CQ_RST_CTL            0x23
>> +#define CQ_RST_CTL              0x118
>> +#define X_CQ_FIRMASK            0x33
>> +#define CQ_FIRMASK              0x198
>> +#define X_CQ_FIRMASK_AND        0x34
>> +#define CQ_FIRMASK_AND          0x1a0
>> +#define X_CQ_FIRMASK_OR         0x35
>> +#define CQ_FIRMASK_OR           0x1a8
>> +
>> +/* PC LBS1 register offsets 0x400 - 0x800 */
>> +#define X_PC_TCTXT_CFG          0x100
>> +#define PC_TCTXT_CFG            0x400
>> +#define  PC_TCTXT_CFG_BLKGRP_EN         PPC_BIT(0)
>> +#define  PC_TCTXT_CFG_TARGET_EN         PPC_BIT(1)
>> +#define  PC_TCTXT_CFG_LGS_EN            PPC_BIT(2)
>> +#define  PC_TCTXT_CFG_STORE_ACK         PPC_BIT(3)
>> +#define  PC_TCTXT_CFG_HARD_CHIPID_BLK   PPC_BIT(8)
>> +#define  PC_TCTXT_CHIPID_OVERRIDE       PPC_BIT(9)
>> +#define  PC_TCTXT_CHIPID                PPC_BITMASK(12, 15)
>> +#define  PC_TCTXT_INIT_AGE              PPC_BITMASK(30, 31)
>> +#define X_PC_TCTXT_TRACK        0x101
>> +#define PC_TCTXT_TRACK          0x408
>> +#define  PC_TCTXT_TRACK_EN              PPC_BIT(0)
>> +#define X_PC_TCTXT_INDIR0       0x104
>> +#define PC_TCTXT_INDIR0         0x420
>> +#define  PC_TCTXT_INDIR_VALID           PPC_BIT(0)
>> +#define  PC_TCTXT_INDIR_THRDID          PPC_BITMASK(9, 15)
>> +#define X_PC_TCTXT_INDIR1       0x105
>> +#define PC_TCTXT_INDIR1         0x428
>> +#define X_PC_TCTXT_INDIR2       0x106
>> +#define PC_TCTXT_INDIR2         0x430
>> +#define X_PC_TCTXT_INDIR3       0x107
>> +#define PC_TCTXT_INDIR3         0x438
>> +#define X_PC_THREAD_EN_REG0     0x108
>> +#define PC_THREAD_EN_REG0       0x440
>> +#define X_PC_THREAD_EN_REG0_SET 0x109
>> +#define PC_THREAD_EN_REG0_SET   0x448
>> +#define X_PC_THREAD_EN_REG0_CLR 0x10a
>> +#define PC_THREAD_EN_REG0_CLR   0x450
>> +#define X_PC_THREAD_EN_REG1     0x10c
>> +#define PC_THREAD_EN_REG1       0x460
>> +#define X_PC_THREAD_EN_REG1_SET 0x10d
>> +#define PC_THREAD_EN_REG1_SET   0x468
>> +#define X_PC_THREAD_EN_REG1_CLR 0x10e
>> +#define PC_THREAD_EN_REG1_CLR   0x470
>> +#define X_PC_GLOBAL_CONFIG      0x110
>> +#define PC_GLOBAL_CONFIG        0x480
>> +#define  PC_GCONF_INDIRECT      PPC_BIT(32)
>> +#define  PC_GCONF_CHIPID_OVR    PPC_BIT(40)
>> +#define  PC_GCONF_CHIPID        PPC_BITMASK(44, 47)
>> +#define X_PC_VSD_TABLE_ADDR     0x111
>> +#define PC_VSD_TABLE_ADDR       0x488
>> +#define X_PC_VSD_TABLE_DATA     0x112
>> +#define PC_VSD_TABLE_DATA       0x490
>> +#define X_PC_AT_KILL            0x116
>> +#define PC_AT_KILL              0x4b0
>> +#define  PC_AT_KILL_VALID       PPC_BIT(0)
>> +#define  PC_AT_KILL_BLOCK_ID    PPC_BITMASK(27, 31)
>> +#define  PC_AT_KILL_OFFSET      PPC_BITMASK(48, 60)
>> +#define X_PC_AT_KILL_MASK       0x117
>> +#define PC_AT_KILL_MASK         0x4b8
>> +
>> +/* PC LBS2 register offsets */
>> +#define X_PC_VPC_CACHE_ENABLE   0x161
>> +#define PC_VPC_CACHE_ENABLE     0x708
>> +#define  PC_VPC_CACHE_EN_MASK   PPC_BITMASK(0, 31)
>> +#define X_PC_VPC_SCRUB_TRIG     0x162
>> +#define PC_VPC_SCRUB_TRIG       0x710
>> +#define X_PC_VPC_SCRUB_MASK     0x163
>> +#define PC_VPC_SCRUB_MASK       0x718
>> +#define  PC_SCRUB_VALID         PPC_BIT(0)
>> +#define  PC_SCRUB_WANT_DISABLE  PPC_BIT(1)
>> +#define  PC_SCRUB_WANT_INVAL    PPC_BIT(2)
>> +#define  PC_SCRUB_BLOCK_ID      PPC_BITMASK(27, 31)
>> +#define  PC_SCRUB_OFFSET        PPC_BITMASK(45, 63)
>> +#define X_PC_VPC_CWATCH_SPEC    0x167
>> +#define PC_VPC_CWATCH_SPEC      0x738
>> +#define  PC_VPC_CWATCH_CONFLICT PPC_BIT(0)
>> +#define  PC_VPC_CWATCH_FULL     PPC_BIT(8)
>> +#define  PC_VPC_CWATCH_BLOCKID  PPC_BITMASK(27, 31)
>> +#define  PC_VPC_CWATCH_OFFSET   PPC_BITMASK(45, 63)
>> +#define X_PC_VPC_CWATCH_DAT0    0x168
>> +#define PC_VPC_CWATCH_DAT0      0x740
>> +#define X_PC_VPC_CWATCH_DAT1    0x169
>> +#define PC_VPC_CWATCH_DAT1      0x748
>> +#define X_PC_VPC_CWATCH_DAT2    0x16a
>> +#define PC_VPC_CWATCH_DAT2      0x750
>> +#define X_PC_VPC_CWATCH_DAT3    0x16b
>> +#define PC_VPC_CWATCH_DAT3      0x758
>> +#define X_PC_VPC_CWATCH_DAT4    0x16c
>> +#define PC_VPC_CWATCH_DAT4      0x760
>> +#define X_PC_VPC_CWATCH_DAT5    0x16d
>> +#define PC_VPC_CWATCH_DAT5      0x768
>> +#define X_PC_VPC_CWATCH_DAT6    0x16e
>> +#define PC_VPC_CWATCH_DAT6      0x770
>> +#define X_PC_VPC_CWATCH_DAT7    0x16f
>> +#define PC_VPC_CWATCH_DAT7      0x778
>> +
>> +/* VC0 register offsets 0x800 - 0xFFF */
>> +#define X_VC_GLOBAL_CONFIG      0x200
>> +#define VC_GLOBAL_CONFIG        0x800
>> +#define  VC_GCONF_INDIRECT      PPC_BIT(32)
>> +#define X_VC_VSD_TABLE_ADDR     0x201
>> +#define VC_VSD_TABLE_ADDR       0x808
>> +#define X_VC_VSD_TABLE_DATA     0x202
>> +#define VC_VSD_TABLE_DATA       0x810
>> +#define VC_IVE_ISB_BLOCK_MODE   0x818
>> +#define VC_EQD_BLOCK_MODE       0x820
>> +#define VC_VPS_BLOCK_MODE       0x828
>> +#define X_VC_IRQ_CONFIG_IPI     0x208
>> +#define VC_IRQ_CONFIG_IPI       0x840
>> +#define  VC_IRQ_CONFIG_MEMB_EN  PPC_BIT(45)
>> +#define  VC_IRQ_CONFIG_MEMB_SZ  PPC_BITMASK(46, 51)
>> +#define VC_IRQ_CONFIG_HW        0x848
>> +#define VC_IRQ_CONFIG_CASCADE1  0x850
>> +#define VC_IRQ_CONFIG_CASCADE2  0x858
>> +#define VC_IRQ_CONFIG_REDIST    0x860
>> +#define VC_IRQ_CONFIG_IPI_CASC  0x868
>> +#define X_VC_AIB_TX_ORDER_TAG2  0x22d
>> +#define  VC_AIB_TX_ORDER_TAG2_REL_TF    PPC_BIT(20)
>> +#define VC_AIB_TX_ORDER_TAG2    0x890
>> +#define X_VC_AT_MACRO_KILL      0x23e
>> +#define VC_AT_MACRO_KILL        0x8b0
>> +#define X_VC_AT_MACRO_KILL_MASK 0x23f
>> +#define VC_AT_MACRO_KILL_MASK   0x8b8
>> +#define  VC_KILL_VALID          PPC_BIT(0)
>> +#define  VC_KILL_TYPE           PPC_BITMASK(14, 15)
>> +#define   VC_KILL_IRQ   0
>> +#define   VC_KILL_IVC   1
>> +#define   VC_KILL_SBC   2
>> +#define   VC_KILL_EQD   3
>> +#define  VC_KILL_BLOCK_ID       PPC_BITMASK(27, 31)
>> +#define  VC_KILL_OFFSET         PPC_BITMASK(48, 60)
>> +#define X_VC_EQC_CACHE_ENABLE   0x211
>> +#define VC_EQC_CACHE_ENABLE     0x908
>> +#define  VC_EQC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
>> +#define X_VC_EQC_SCRUB_TRIG     0x212
>> +#define VC_EQC_SCRUB_TRIG       0x910
>> +#define X_VC_EQC_SCRUB_MASK     0x213
>> +#define VC_EQC_SCRUB_MASK       0x918
>> +#define X_VC_EQC_CWATCH_SPEC    0x215
>> +#define VC_EQC_CONFIG           0x920
>> +#define X_VC_EQC_CONFIG         0x214
>> +#define  VC_EQC_CONF_SYNC_IPI           PPC_BIT(32)
>> +#define  VC_EQC_CONF_SYNC_HW            PPC_BIT(33)
>> +#define  VC_EQC_CONF_SYNC_ESC1          PPC_BIT(34)
>> +#define  VC_EQC_CONF_SYNC_ESC2          PPC_BIT(35)
>> +#define  VC_EQC_CONF_SYNC_REDI          PPC_BIT(36)
>> +#define  VC_EQC_CONF_EQP_INTERLEAVE     PPC_BIT(38)
>> +#define  VC_EQC_CONF_ENABLE_END_s_BIT   PPC_BIT(39)
>> +#define  VC_EQC_CONF_ENABLE_END_u_BIT   PPC_BIT(40)
>> +#define  VC_EQC_CONF_ENABLE_END_c_BIT   PPC_BIT(41)
>> +#define  VC_EQC_CONF_ENABLE_MORE_QSZ    PPC_BIT(42)
>> +#define  VC_EQC_CONF_SKIP_ESCALATE      PPC_BIT(43)
>> +#define VC_EQC_CWATCH_SPEC      0x928
>> +#define  VC_EQC_CWATCH_CONFLICT PPC_BIT(0)
>> +#define  VC_EQC_CWATCH_FULL     PPC_BIT(8)
>> +#define  VC_EQC_CWATCH_BLOCKID  PPC_BITMASK(28, 31)
>> +#define  VC_EQC_CWATCH_OFFSET   PPC_BITMASK(40, 63)
>> +#define X_VC_EQC_CWATCH_DAT0    0x216
>> +#define VC_EQC_CWATCH_DAT0      0x930
>> +#define X_VC_EQC_CWATCH_DAT1    0x217
>> +#define VC_EQC_CWATCH_DAT1      0x938
>> +#define X_VC_EQC_CWATCH_DAT2    0x218
>> +#define VC_EQC_CWATCH_DAT2      0x940
>> +#define X_VC_EQC_CWATCH_DAT3    0x219
>> +#define VC_EQC_CWATCH_DAT3      0x948
>> +#define X_VC_IVC_SCRUB_TRIG     0x222
>> +#define VC_IVC_SCRUB_TRIG       0x990
>> +#define X_VC_IVC_SCRUB_MASK     0x223
>> +#define VC_IVC_SCRUB_MASK       0x998
>> +#define X_VC_SBC_SCRUB_TRIG     0x232
>> +#define VC_SBC_SCRUB_TRIG       0xa10
>> +#define X_VC_SBC_SCRUB_MASK     0x233
>> +#define VC_SBC_SCRUB_MASK       0xa18
>> +#define  VC_SCRUB_VALID         PPC_BIT(0)
>> +#define  VC_SCRUB_WANT_DISABLE  PPC_BIT(1)
>> +#define  VC_SCRUB_WANT_INVAL    PPC_BIT(2) /* EQC and SBC only */
>> +#define  VC_SCRUB_BLOCK_ID      PPC_BITMASK(28, 31)
>> +#define  VC_SCRUB_OFFSET        PPC_BITMASK(40, 63)
>> +#define X_VC_IVC_CACHE_ENABLE   0x221
>> +#define VC_IVC_CACHE_ENABLE     0x988
>> +#define  VC_IVC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
>> +#define X_VC_SBC_CACHE_ENABLE   0x231
>> +#define VC_SBC_CACHE_ENABLE     0xa08
>> +#define  VC_SBC_CACHE_EN_MASK   PPC_BITMASK(0, 15)
>> +#define VC_IVC_CACHE_SCRUB_TRIG 0x990
>> +#define VC_IVC_CACHE_SCRUB_MASK 0x998
>> +#define VC_SBC_CACHE_ENABLE     0xa08
>> +#define VC_SBC_CACHE_SCRUB_TRIG 0xa10
>> +#define VC_SBC_CACHE_SCRUB_MASK 0xa18
>> +#define VC_SBC_CONFIG           0xa20
>> +#define X_VC_SBC_CONFIG         0x234
>> +#define  VC_SBC_CONF_CPLX_CIST  PPC_BIT(44)
>> +#define  VC_SBC_CONF_CIST_BOTH  PPC_BIT(45)
>> +#define  VC_SBC_CONF_NO_UPD_PRF PPC_BIT(59)
>> +
>> +/* VC1 register offsets */
>> +
>> +/* VSD Table address register definitions (shared) */
>> +#define VST_ADDR_AUTOINC        PPC_BIT(0)
>> +#define VST_TABLE_SELECT        PPC_BITMASK(13, 15)
>> +#define  VST_TSEL_IVT   0
>> +#define  VST_TSEL_SBE   1
>> +#define  VST_TSEL_EQDT  2
>> +#define  VST_TSEL_VPDT  3
>> +#define  VST_TSEL_IRQ   4       /* VC only */
>> +#define VST_TABLE_BLOCK        PPC_BITMASK(27, 31)
>> +
>> +/* Number of queue overflow pages */
>> +#define VC_QUEUE_OVF_COUNT      6
>> +
>> +/* Bits in a VSD entry.
>> + *
>> + * Note: the address is naturally aligned,  we don't use a PPC_BITMASK,
>> + *       but just a mask to apply to the address before OR'ing it in.
>> + *
>> + * Note: VSD_FIRMWARE is a SW bit ! It hijacks an unused bit in the
>> + *       VSD and is only meant to be used in indirect mode !
>> + */
>> +#define VSD_MODE                PPC_BITMASK(0, 1)
>> +#define  VSD_MODE_SHARED        1
>> +#define  VSD_MODE_EXCLUSIVE     2
>> +#define  VSD_MODE_FORWARD       3
>> +#define VSD_ADDRESS_MASK        0x0ffffffffffff000ull
>> +#define VSD_MIGRATION_REG       PPC_BITMASK(52, 55)
>> +#define VSD_INDIRECT            PPC_BIT(56)
>> +#define VSD_TSIZE               PPC_BITMASK(59, 63)
>> +#define VSD_FIRMWARE            PPC_BIT(2) /* Read warning above */
>> +
>> +#define VC_EQC_SYNC_MASK         \
>> +        (VC_EQC_CONF_SYNC_IPI  | \
>> +         VC_EQC_CONF_SYNC_HW   | \
>> +         VC_EQC_CONF_SYNC_ESC1 | \
>> +         VC_EQC_CONF_SYNC_ESC2 | \
>> +         VC_EQC_CONF_SYNC_REDI)
>> +
>> +
>> +#endif /* PPC_PNV_XIVE_REGS_H */
>> diff --git a/include/hw/ppc/pnv.h b/include/hw/ppc/pnv.h
>> index 86d5f54e5459..402dd8f6452c 100644
>> --- a/include/hw/ppc/pnv.h
>> +++ b/include/hw/ppc/pnv.h
>> @@ -25,6 +25,7 @@
>>  #include "hw/ppc/pnv_lpc.h"
>>  #include "hw/ppc/pnv_psi.h"
>>  #include "hw/ppc/pnv_occ.h"
>> +#include "hw/ppc/pnv_xive.h"
>>  
>>  #define TYPE_PNV_CHIP "pnv-chip"
>>  #define PNV_CHIP(obj) OBJECT_CHECK(PnvChip, (obj), TYPE_PNV_CHIP)
>> @@ -82,6 +83,7 @@ typedef struct Pnv9Chip {
>>      PnvChip      parent_obj;
>>  
>>      /*< public >*/
>> +    PnvXive      xive;
>>  } Pnv9Chip;
>>  
>>  typedef struct PnvChipClass {
>> @@ -205,7 +207,6 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
>>  #define PNV_ICP_BASE(chip)                                              \
>>      (0x0003ffff80000000ull + (uint64_t) PNV_CHIP_INDEX(chip) * PNV_ICP_SIZE)
>>  
>> -
>>  #define PNV_PSIHB_SIZE       0x0000000000100000ull
>>  #define PNV_PSIHB_BASE(chip) \
>>      (0x0003fffe80000000ull + (uint64_t)PNV_CHIP_INDEX(chip) * PNV_PSIHB_SIZE)
>> @@ -215,4 +216,23 @@ void pnv_bmc_powerdown(IPMIBmc *bmc);
>>      (0x0003ffe000000000ull + (uint64_t)PNV_CHIP_INDEX(chip) * \
>>       PNV_PSIHB_FSP_SIZE)
>>  
>> +/*
>> + * POWER9 MMIO base addresses
>> + */
>> +#define PNV9_CHIP_BASE(chip, base)   \
>> +    ((base) + ((uint64_t) (chip)->chip_id << 42))
>> +
>> +#define PNV9_XIVE_VC_SIZE            0x0000008000000000ull
>> +#define PNV9_XIVE_VC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006010000000000ull)
>> +
>> +#define PNV9_XIVE_PC_SIZE            0x0000001000000000ull
>> +#define PNV9_XIVE_PC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006018000000000ull)
>> +
>> +#define PNV9_XIVE_IC_SIZE            0x0000000000080000ull
>> +#define PNV9_XIVE_IC_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203100000ull)
>> +
>> +#define PNV9_XIVE_TM_SIZE            0x0000000000040000ull
>> +#define PNV9_XIVE_TM_BASE(chip)      PNV9_CHIP_BASE(chip, 0x0006030203180000ull)
>> +
>> +
>>  #endif /* _PPC_PNV_H */
>> diff --git a/include/hw/ppc/pnv_xive.h b/include/hw/ppc/pnv_xive.h
>> new file mode 100644
>> index 000000000000..5b64d4cafe8f
>> --- /dev/null
>> +++ b/include/hw/ppc/pnv_xive.h
>> @@ -0,0 +1,100 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#ifndef PPC_PNV_XIVE_H
>> +#define PPC_PNV_XIVE_H
>> +
>> +#include "hw/sysbus.h"
>> +#include "hw/ppc/xive.h"
>> +
>> +#define TYPE_PNV_XIVE "pnv-xive"
>> +#define PNV_XIVE(obj) OBJECT_CHECK(PnvXive, (obj), TYPE_PNV_XIVE)
>> +
>> +#define XIVE_BLOCK_MAX      16
>> +
>> +#define XIVE_XLATE_BLK_MAX  16  /* Block Scope Table (0-15) */
>> +#define XIVE_XLATE_MIG_MAX  16  /* Migration Register Table (1-15) */
>> +#define XIVE_XLATE_VDT_MAX  16  /* VDT Domain Table (0-15) */
>> +#define XIVE_XLATE_EDT_MAX  64  /* EDT Domain Table (0-63) */
>> +
>> +typedef struct PnvXive {
>> +    XiveRouter    parent_obj;
>> +
>> +    /* Can be overridden by XIVE configuration */
>> +    uint32_t      thread_chip_id;
>> +    uint32_t      chip_id;
> 
> These have similar names but they're very different AFAICT - one is
> static configuration, the other runtime state.  I'd generally order
> structures so that configuration information is in one block, computed
> at initialization then static in another, then runtime state in a
> third - it's both clearer and (usually) more cache efficient.

yes. This is a good pratice. 

 
> Sometimes that's less important that other logical groupings, but I
> don't think that's the case here.
> 
>> +
>> +    /* Interrupt controller regs */
>> +    uint64_t      regs[0x300];
>> +    MemoryRegion  xscom_regs;
>> +
>> +    /* For IPIs and accelerator interrupts */
>> +    uint32_t      nr_irqs;
>> +    XiveSource    source;
>> +
>> +    uint32_t      nr_ends;
>> +    XiveENDSource end_source;
>> +
>> +    /* Cache update registers */
>> +    uint64_t      eqc_watch[4];
>> +    uint64_t      vpc_watch[8];
>> +
>> +    /* Virtual Structure Table Descriptors : EAT, SBE, ENDT, NVTT, IRQ */
>> +    uint64_t      vsds[5][XIVE_BLOCK_MAX];
>> +
>> +    /* Set Translation tables */
>> +    bool          set_xlate_autoinc;
>> +    uint64_t      set_xlate_index;
>> +    uint64_t      set_xlate;
>> +
>> +    uint64_t      set_xlate_blk[XIVE_XLATE_BLK_MAX];
>> +    uint64_t      set_xlate_mig[XIVE_XLATE_MIG_MAX];
>> +    uint64_t      set_xlate_vdt[XIVE_XLATE_VDT_MAX];
>> +    uint64_t      set_xlate_edt[XIVE_XLATE_EDT_MAX];
>> +
>> +    /* Interrupt controller MMIO */
>> +    hwaddr        ic_base;
>> +    uint32_t      ic_shift;
>> +    MemoryRegion  ic_mmio;
>> +    MemoryRegion  ic_reg_mmio;
>> +    MemoryRegion  ic_notify_mmio;
>> +
>> +    /* VC memory regions */
>> +    hwaddr        vc_base;
>> +    uint64_t      vc_size;
>> +    uint32_t      vc_shift;
>> +    MemoryRegion  vc_mmio;
>> +
>> +    /* IPI and END address space to model the EDT segmentation */
>> +    uint32_t      edt_shift;
>> +    MemoryRegion  ipi_mmio;
>> +    AddressSpace  ipi_as;
>> +    MemoryRegion  end_mmio;
>> +    AddressSpace  end_as;
>> +
>> +    /* PC memory regions */
>> +    hwaddr        pc_base;
>> +    uint64_t      pc_size;
>> +    uint32_t      pc_shift;
>> +    MemoryRegion  pc_mmio;
>> +    uint32_t      vdt_shift;
>> +
>> +    /* TIMA memory regions */
>> +    hwaddr        tm_base;
>> +    uint32_t      tm_shift;
>> +    MemoryRegion  tm_mmio;
>> +    MemoryRegion  tm_mmio_indirect;
>> +
>> +    /* CPU for indirect TIMA access */
>> +    PowerPCCPU    *cpu_ind;
>> +} PnvXive;
>> +
>> +void pnv_xive_pic_print_info(PnvXive *xive, Monitor *mon);
>> +
>> +#endif /* PPC_PNV_XIVE_H */
>> diff --git a/include/hw/ppc/pnv_xscom.h b/include/hw/ppc/pnv_xscom.h
>> index 255b26a5aaf6..6623ec54a7a8 100644
>> --- a/include/hw/ppc/pnv_xscom.h
>> +++ b/include/hw/ppc/pnv_xscom.h
>> @@ -73,6 +73,9 @@ typedef struct PnvXScomInterfaceClass {
>>  #define PNV_XSCOM_OCC_BASE        0x0066000
>>  #define PNV_XSCOM_OCC_SIZE        0x6000
>>  
>> +#define PNV9_XSCOM_XIVE_BASE      0x5013000
>> +#define PNV9_XSCOM_XIVE_SIZE      0x300
>> +
>>  extern void pnv_xscom_realize(PnvChip *chip, Error **errp);
>>  extern int pnv_dt_xscom(PnvChip *chip, void *fdt, int offset);
>>  
>> diff --git a/include/hw/ppc/xive.h b/include/hw/ppc/xive.h
>> index c8201462d698..6089511cff83 100644
>> --- a/include/hw/ppc/xive.h
>> +++ b/include/hw/ppc/xive.h
>> @@ -237,6 +237,7 @@ int xive_router_get_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>>                          XiveNVT *nvt);
>>  int xive_router_set_nvt(XiveRouter *xrtr, uint8_t nvt_blk, uint32_t nvt_idx,
>>                          XiveNVT *nvt);
>> +void xive_router_notify(XiveFabric *xf, uint32_t lisn);
>>  
>>  /*
>>   * XIVE END ESBs
>> diff --git a/hw/intc/pnv_xive.c b/hw/intc/pnv_xive.c
>> new file mode 100644
>> index 000000000000..9f0c41cdb750
>> --- /dev/null
>> +++ b/hw/intc/pnv_xive.c
>> @@ -0,0 +1,1612 @@
>> +/*
>> + * QEMU PowerPC XIVE interrupt controller model
>> + *
>> + * Copyright (c) 2017-2018, IBM Corporation.
>> + *
>> + * This code is licensed under the GPL version 2 or later. See the
>> + * COPYING file in the top-level directory.
>> + */
>> +
>> +#include "qemu/osdep.h"
>> +#include "qemu/log.h"
>> +#include "qapi/error.h"
>> +#include "target/ppc/cpu.h"
>> +#include "sysemu/cpus.h"
>> +#include "sysemu/dma.h"
>> +#include "monitor/monitor.h"
>> +#include "hw/ppc/fdt.h"
>> +#include "hw/ppc/pnv.h"
>> +#include "hw/ppc/pnv_xscom.h"
>> +#include "hw/ppc/pnv_xive.h"
>> +#include "hw/ppc/xive_regs.h"
>> +#include "hw/ppc/ppc.h"
>> +
>> +#include <libfdt.h>
>> +
>> +#include "pnv_xive_regs.h"
>> +
>> +/*
>> + * Interrupt source number encoding
>> + */
>> +#define SRCNO_BLOCK(srcno)        (((srcno) >> 28) & 0xf)
>> +#define SRCNO_INDEX(srcno)        ((srcno) & 0x0fffffff)
>> +#define XIVE_SRCNO(blk, idx)      ((uint32_t)(blk) << 28 | (idx))
>> +
>> +/*
>> + * Virtual structures table accessors
>> + */
>> +typedef struct XiveVstInfo {
>> +    const char *name;
>> +    uint32_t    size;
>> +    uint32_t    max_blocks;
>> +} XiveVstInfo;
>> +
>> +static const XiveVstInfo vst_infos[] = {
>> +    [VST_TSEL_IVT]  = { "EAT",  sizeof(XiveEAS), 16 },
>> +    [VST_TSEL_SBE]  = { "SBE",  0,               16 },
>> +    [VST_TSEL_EQDT] = { "ENDT", sizeof(XiveEND), 16 },
>> +    [VST_TSEL_VPDT] = { "VPDT", sizeof(XiveNVT),  32 },
> 
> Are those VST_TSEL_* things named in the XIVE documentation?  It not,
> you probably want to rename them to reflect the new-style naming.

The register documentation still refers to IVE, ESB, EQD, VPD ...

> 
>> +    /* Interrupt fifo backing store table :
>> +     *
>> +     * 0 - IPI,
>> +     * 1 - HWD,
>> +     * 2 - First escalate,
>> +     * 3 - Second escalate,
>> +     * 4 - Redistribution,
>> +     * 5 - IPI cascaded queue ?
>> +     */
>> +    [VST_TSEL_IRQ]  = { "IRQ",  0,               6  },
>> +};
>> +
>> +#define xive_error(xive, fmt, ...)                                      \
>> +    qemu_log_mask(LOG_GUEST_ERROR, "XIVE[%x] - " fmt "\n", (xive)->chip_id, \
>> +                  ## __VA_ARGS__);
>> +
>> +/*
>> + * Our lookup routine for a remote XIVE IC. A simple scan of the chips.
>> + */
>> +static PnvXive *pnv_xive_get_ic(PnvXive *xive, uint8_t blk)
>> +{
>> +    PnvMachineState *pnv = PNV_MACHINE(qdev_get_machine());
>> +    int i;
>> +
>> +    for (i = 0; i < pnv->num_chips; i++) {
>> +        Pnv9Chip *chip9 = PNV9_CHIP(pnv->chips[i]);
>> +        PnvXive *ic_xive = &chip9->xive;
>> +        bool chip_override =
>> +            ic_xive->regs[PC_GLOBAL_CONFIG >> 3] & PC_GCONF_CHIPID_OVR;
>> +
>> +        if (chip_override) {
>> +            if (ic_xive->chip_id == blk) {
>> +                return ic_xive;
>> +            }
>> +        } else {
>> +            ; /* TODO: Block scope support */
>> +        }
>> +    }
>> +    xive_error(xive, "VST: unknown chip/block %d !?", blk);
>> +    return NULL;
>> +}
>> +
>> +/*
>> + * Virtual Structures Table accessors for SBE, EAT, ENDT, NVT
>> + */
>> +static uint64_t pnv_xive_vst_addr_direct(PnvXive *xive,
>> +                                         const XiveVstInfo *info, uint64_t vsd,
>> +                                         uint8_t blk, uint32_t idx)
>> +{
>> +    uint64_t vst_addr = vsd & VSD_ADDRESS_MASK;
>> +    uint64_t vst_tsize = 1ull << (GETFIELD(VSD_TSIZE, vsd) + 12);
>> +    uint32_t idx_max = (vst_tsize / info->size) - 1;
>> +
>> +    if (idx > idx_max) {
>> +#ifdef XIVE_DEBUG
>> +        xive_error(xive, "VST: %s entry %x/%x out of range !?", info->name,
>> +                   blk, idx);
>> +#endif
>> +        return 0;
>> +    }
>> +
>> +    return vst_addr + idx * info->size;
>> +}
>> +
>> +#define XIVE_VSD_SIZE 8
>> +
>> +static uint64_t pnv_xive_vst_addr_indirect(PnvXive *xive,
>> +                                           const XiveVstInfo *info,
>> +                                           uint64_t vsd, uint8_t blk,
>> +                                           uint32_t idx)
>> +{
>> +    uint64_t vsd_addr;
>> +    uint64_t vst_addr;
>> +    uint32_t page_shift;
>> +    uint32_t page_mask;
>> +    uint64_t vst_tsize = 1ull << (GETFIELD(VSD_TSIZE, vsd) + 12);
>> +    uint32_t idx_max = (vst_tsize / XIVE_VSD_SIZE) - 1;
>> +
>> +    if (idx > idx_max) {
>> +#ifdef XIVE_DEBUG
>> +        xive_error(xive, "VET: %s entry %x/%x out of range !?", info->name,
>> +                   blk, idx);
>> +#endif
>> +        return 0;
>> +    }
>> +
>> +    vsd_addr = vsd & VSD_ADDRESS_MASK;
>> +
>> +    /*
>> +     * Read the first descriptor to get the page size of each indirect
>> +     * table.
>> +     */
>> +    vsd = ldq_be_dma(&address_space_memory, vsd_addr);
>> +    page_shift = GETFIELD(VSD_TSIZE, vsd) + 12;
>> +    page_mask = (1ull << page_shift) - 1;
>> +
>> +    /* Indirect page size can be 4K, 64K, 2M. */
>> +    if (page_shift != 12 && page_shift != 16 && page_shift != 23) {
> 
> page_shift == 23?? That's 8 MiB.

ooups. and I should add 16M also.

> 
>> +        xive_error(xive, "VST: invalid %s table shift %d", info->name,
>> +                   page_shift);
>> +    }
>> +
>> +    if (!(vsd & VSD_ADDRESS_MASK)) {
>> +        xive_error(xive, "VST: invalid %s entry %x/%x !?", info->name,
>> +                   blk, 0);
>> +        return 0;
>> +    }
>> +
>> +    /* Load the descriptor we are looking for, if not already done */
>> +    if (idx) {
>> +        vsd_addr = vsd_addr + (idx >> page_shift);
>> +        vsd = ldq_be_dma(&address_space_memory, vsd_addr);
>> +
>> +        if (page_shift != GETFIELD(VSD_TSIZE, vsd) + 12) {
>> +            xive_error(xive, "VST: %s entry %x/%x indirect page size differ !?",
>> +                       info->name, blk, idx);
>> +            return 0;
>> +        }
>> +    }
>> +
>> +    vst_addr = vsd & VSD_ADDRESS_MASK;
>> +
>> +    return vst_addr + (idx & page_mask) * info->size;
>> +}
>> +
>> +static uint64_t pnv_xive_vst_addr(PnvXive *xive, uint8_t type, uint8_t blk,
>> +                                  uint32_t idx)
>> +{
>> +    uint64_t vsd;
>> +
>> +    if (blk >= vst_infos[type].max_blocks) {
>> +        xive_error(xive, "VST: invalid block id %d for VST %s %d !?",
>> +                   blk, vst_infos[type].name, idx);
>> +        return 0;
>> +    }
>> +
>> +    vsd = xive->vsds[type][blk];
>> +
>> +    /* Remote VST accesses */
>> +    if (GETFIELD(VSD_MODE, vsd) == VSD_MODE_FORWARD) {
>> +        xive = pnv_xive_get_ic(xive, blk);
>> +
>> +        return xive ? pnv_xive_vst_addr(xive, type, blk, idx) : 0;
>> +    }
>> +
>> +    if (VSD_INDIRECT & vsd) {
>> +        return pnv_xive_vst_addr_indirect(xive, &vst_infos[type], vsd,
>> +                                          blk, idx);
>> +    }
>> +
>> +    return pnv_xive_vst_addr_direct(xive, &vst_infos[type], vsd, blk, idx);
>> +}
>> +
>> +static int pnv_xive_get_end(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
>> +                           XiveEND *end)
>> +{
>> +    PnvXive *xive = PNV_XIVE(xrtr);
>> +    uint64_t end_addr = pnv_xive_vst_addr(xive, VST_TSEL_EQDT, blk, idx);
>> +
>> +    if (!end_addr) {
>> +        return -1;
>> +    }
>> +
>> +    cpu_physical_memory_read(end_addr, end, sizeof(XiveEND));
>> +    end->w0 = be32_to_cpu(end->w0);
>> +    end->w1 = be32_to_cpu(end->w1);
>> +    end->w2 = be32_to_cpu(end->w2);
>> +    end->w3 = be32_to_cpu(end->w3);
>> +    end->w4 = be32_to_cpu(end->w4);
>> +    end->w5 = be32_to_cpu(end->w5);
>> +    end->w6 = be32_to_cpu(end->w6);
>> +    end->w7 = be32_to_cpu(end->w7);
>> +
>> +    return 0;
>> +}
>> +
>> +static int pnv_xive_set_end(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
>> +                           XiveEND *in_end)
>> +{
>> +    PnvXive *xive = PNV_XIVE(xrtr);
>> +    XiveEND end;
>> +    uint64_t end_addr = pnv_xive_vst_addr(xive, VST_TSEL_EQDT, blk, idx);
>> +
>> +    if (!end_addr) {
>> +        return -1;
>> +    }
>> +
>> +    end.w0 = cpu_to_be32(in_end->w0);
>> +    end.w1 = cpu_to_be32(in_end->w1);
>> +    end.w2 = cpu_to_be32(in_end->w2);
>> +    end.w3 = cpu_to_be32(in_end->w3);
>> +    end.w4 = cpu_to_be32(in_end->w4);
>> +    end.w5 = cpu_to_be32(in_end->w5);
>> +    end.w6 = cpu_to_be32(in_end->w6);
>> +    end.w7 = cpu_to_be32(in_end->w7);
>> +    cpu_physical_memory_write(end_addr, &end, sizeof(XiveEND));
>> +    return 0;
>> +}
>> +
>> +static int pnv_xive_end_update(PnvXive *xive, uint8_t blk, uint32_t idx)
>> +{
>> +    uint64_t end_addr = pnv_xive_vst_addr(xive, VST_TSEL_EQDT, blk, idx);
>> +
>> +    if (!end_addr) {
>> +        return -1;
>> +    }
>> +
>> +    cpu_physical_memory_write(end_addr, xive->eqc_watch, sizeof(XiveEND));
>> +    return 0;
>> +}
>> +
>> +static int pnv_xive_get_nvt(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
>> +                           XiveNVT *nvt)
>> +{
>> +    PnvXive *xive = PNV_XIVE(xrtr);
>> +    uint64_t nvt_addr = pnv_xive_vst_addr(xive, VST_TSEL_VPDT, blk, idx);
>> +
>> +    if (!nvt_addr) {
>> +        return -1;
>> +    }
>> +
>> +    cpu_physical_memory_read(nvt_addr, nvt, sizeof(XiveNVT));
>> +    nvt->w0 = cpu_to_be32(nvt->w0);
>> +    nvt->w1 = cpu_to_be32(nvt->w1);
>> +    nvt->w2 = cpu_to_be32(nvt->w2);
>> +    nvt->w3 = cpu_to_be32(nvt->w3);
>> +    nvt->w4 = cpu_to_be32(nvt->w4);
>> +    nvt->w5 = cpu_to_be32(nvt->w5);
>> +    nvt->w6 = cpu_to_be32(nvt->w6);
>> +    nvt->w7 = cpu_to_be32(nvt->w7);
>> +
>> +    return 0;
>> +}
>> +
>> +static int pnv_xive_set_nvt(XiveRouter *xrtr, uint8_t blk, uint32_t idx,
>> +                           XiveNVT *in_nvt)
>> +{
>> +    PnvXive *xive = PNV_XIVE(xrtr);
>> +    XiveNVT nvt;
>> +    uint64_t nvt_addr = pnv_xive_vst_addr(xive, VST_TSEL_VPDT, blk, idx);
>> +
>> +    if (!nvt_addr) {
>> +        return -1;
>> +    }
>> +
>> +    nvt.w0 = cpu_to_be32(in_nvt->w0);
>> +    nvt.w1 = cpu_to_be32(in_nvt->w1);
>> +    nvt.w2 = cpu_to_be32(in_nvt->w2);
>> +    nvt.w3 = cpu_to_be32(in_nvt->w3);
>> +    nvt.w4 = cpu_to_be32(in_nvt->w4);
>> +    nvt.w5 = cpu_to_be32(in_nvt->w5);
>> +    nvt.w6 = cpu_to_be32(in_nvt->w6);
>> +    nvt.w7 = cpu_to_be32(in_nvt->w7);
>> +    cpu_physical_memory_write(nvt_addr, &nvt, sizeof(XiveNVT));
>> +    return 0;
>> +}
>> +
>> +static int pnv_xive_nvt_update(PnvXive *xive, uint8_t blk, uint32_t idx)
>> +{
>> +    uint64_t nvt_addr = pnv_xive_vst_addr(xive, VST_TSEL_VPDT, blk, idx);
>> +
>> +    if (!nvt_addr) {
>> +        return -1;
>> +    }
>> +
>> +    cpu_physical_memory_write(nvt_addr, xive->vpc_watch, sizeof(XiveNVT));
>> +    return 0;
>> +}
>> +
>> +static int pnv_xive_get_eas(XiveRouter *xrtr, uint32_t srcno, XiveEAS *eas)
>> +{
>> +    PnvXive *xive = PNV_XIVE(xrtr);
>> +    uint8_t  blk = SRCNO_BLOCK(srcno);
>> +    uint32_t idx = SRCNO_INDEX(srcno);
>> +    uint64_t eas_addr;
>> +
>> +    /* TODO: check when remote EAS lookups are possible */
>> +    if (pnv_xive_get_ic(xive, blk) != xive) {
>> +        xive_error(xive, "VST: EAS %x is remote !?", srcno);
>> +        return -1;
>> +    }
>> +
>> +    eas_addr = pnv_xive_vst_addr(xive, VST_TSEL_IVT, blk, idx);
>> +    if (!eas_addr) {
>> +        return -1;
>> +    }
>> +
>> +    eas->w &= ~EAS_VALID;
> 
> Doesn't this get overwritten by the next statement?

yes ..
 
>> +    *((uint64_t *) eas) = ldq_be_dma(&address_space_memory, eas_addr);
> 
> eas->w = ldq... would surely be simpler.

yes. 

we are changing the XIVE core layer to use XIVE structures in BE. It should
simplify all the accessors.

> 
>> +    return 0;
>> +}
>> +
>> +static int pnv_xive_set_eas(XiveRouter *xrtr, uint32_t srcno, XiveEAS *ive)
>> +{
>> +    /* All done. */
> 
> Uh.. what?  This is wrong, although I guess it doesn't matter because
> the pnv model never uses set_eas.  Another argument for not
> abstracting this path - just write directly in the PAPR code.

he :)

> 
>> +    return 0;
>> +}
>> +
>> +static int pnv_xive_eas_update(PnvXive *xive, uint32_t idx)
>> +{
>> +    /* All done. */
>> +    return 0;
>> +}
>> +
>> +/*
>> + * XIVE Set Translation Table configuration
>> + *
>> + * The Virtualization Controller MMIO region containing the IPI ESB
>> + * pages and END ESB pages is sub-divided into "sets" which map
>> + * portions of the VC region to the different ESB pages. It is
>> + * configured at runtime through the EDT set translation table to let
>> + * the firmware decide how to split the address space between IPI ESB
>> + * pages and END ESB pages.
>> + */
>> +static int pnv_xive_set_xlate_update(PnvXive *xive, uint64_t val)
>> +{
>> +    uint8_t index = xive->set_xlate_autoinc ?
>> +        xive->set_xlate_index++ : xive->set_xlate_index;
> 
> What's the correct hardware behaviour when the index runs off the end
> with autoincrement mode?

It doesn't say ... the index is on 6bits, I suppose it should wrap up
at 63.
 
>> +    uint8_t max_index;
>> +    uint64_t *xlate_table;
>> +
>> +    switch (xive->set_xlate) {
>> +    case CQ_TAR_TSEL_BLK:
>> +        max_index = ARRAY_SIZE(xive->set_xlate_blk);
>> +        xlate_table = xive->set_xlate_blk;
>> +        break;
>> +    case CQ_TAR_TSEL_MIG:
>> +        max_index = ARRAY_SIZE(xive->set_xlate_mig);
>> +        xlate_table = xive->set_xlate_mig;
>> +        break;
>> +    case CQ_TAR_TSEL_EDT:
>> +        max_index = ARRAY_SIZE(xive->set_xlate_edt);
>> +        xlate_table = xive->set_xlate_edt;
>> +        break;
>> +    case CQ_TAR_TSEL_VDT:
>> +        max_index = ARRAY_SIZE(xive->set_xlate_vdt);
>> +        xlate_table = xive->set_xlate_vdt;
>> +        break;
>> +    default:
>> +        xive_error(xive, "xlate: invalid table %d", (int) xive->set_xlate);
> 
> In the error case is it correct for the autoincrement to go ahead?

ah, no, it isn't. I need to change the logic.

>> +        return -1;
>> +    }
>> +
>> +    if (index >= max_index) {
>> +        return -1;
>> +    }
>> +
>> +    xlate_table[index] = val;
>> +    return 0;
>> +}
>> +
>> +static int pnv_xive_set_xlate_select(PnvXive *xive, uint64_t val)
>> +{
>> +    xive->set_xlate_autoinc = val & CQ_TAR_TBL_AUTOINC;
>> +    xive->set_xlate = val & CQ_TAR_TSEL;
>> +    xive->set_xlate_index = GETFIELD(CQ_TAR_TSEL_INDEX, val);
> 
> Why split this here, rather than just storing the MMIOed value direct
> in the regs[] array, then parsing out the bits when you need them?

There is no strong reason really. Mostly because the "Set Translation 
Table Address" register is set before "Set Translation Table" and
it was one way to prepare the work to be done. But It doesn't do 
much so I could use the MMIOed value directly as you propose.


> To expand a bit, there are two models you can use for modelling
> registers in qemu.  You can have a big regs[] with all the registers
> make the accessors just read/write that, plus side-effect and special
> case handling.  Or you can have specific fields in your state for the
> crucial register values, then have the MMIO access do all the
> translation into those underlying registers based on the offset.
> 
> Either model can make sense, depending on how many side effects and
> special cases there are.  Mixing the two models, which is kind of what
> you're doing here, is usually not a good idea.

I agree. It also makes the vmstate more complex to figure out, even
if we don't care for PowerNV. 
 
>> +
>> +    return 0;
>> +}
>> +
>> +/*
>> + * Computes the overall size of the IPI or the END ESB pages
>> + */
>> +static uint64_t pnv_xive_set_xlate_edt_size(PnvXive *xive, uint64_t type)
>> +{
>> +    uint64_t edt_size = 1ull << xive->edt_shift;
>> +    uint64_t size = 0;
>> +    int i;
>> +
>> +    for (i = 0; i < XIVE_XLATE_EDT_MAX; i++) {
>> +        uint64_t edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[i]);
>> +
>> +        if (edt_type == type) {
>> +            size += edt_size;
>> +        }
>> +    }
>> +
>> +    return size;
>> +}
>> +
>> +/*
>> + * Maps an offset of the VC region in the IPI or END region using the
>> + * layout defined by the EDT table
>> + */
>> +static uint64_t pnv_xive_set_xlate_edt_offset(PnvXive *xive, uint64_t vc_offset,
>> +                                              uint64_t type)
>> +{
>> +    int i;
>> +    uint64_t edt_size = (1ull << xive->edt_shift);
>> +    uint64_t edt_offset = vc_offset;
>> +
>> +    for (i = 0; i < XIVE_XLATE_EDT_MAX && (i * edt_size) < vc_offset; i++) {
>> +        uint64_t edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[i]);
>> +
>> +        if (edt_type != type) {
>> +            edt_offset -= edt_size;
>> +        }
>> +    }
>> +
>> +    return edt_offset;
>> +}
>> +
>> +/*
>> + * IPI and END sources realize routines
>> + *
>> + * We use the EDT table to size the internal XiveSource object backing
>> + * the IPIs and the XiveENDSource object backing the ENDs
>> + */
>> +static void pnv_xive_source_realize(PnvXive *xive, Error **errp)
>> +{
>> +    XiveSource *xsrc = &xive->source;
>> +    Error *local_err = NULL;
>> +    uint64_t ipi_mmio_size = pnv_xive_set_xlate_edt_size(xive, CQ_TDR_EDT_IPI);
>> +
>> +    /* Two pages per IRQ */
>> +    xive->nr_irqs = ipi_mmio_size / (1ull << (xive->vc_shift + 1));
>> +
>> +    /*
>> +     * Configure store EOI if required by firwmare (skiboot has
>> +     * removed support recently though)
>> +     */
>> +    if (xive->regs[VC_SBC_CONFIG >> 3] &
>> +        (VC_SBC_CONF_CPLX_CIST | VC_SBC_CONF_CIST_BOTH)) {
>> +        object_property_set_int(OBJECT(xsrc), XIVE_SRC_STORE_EOI, "flags",
>> +                                &error_fatal);
>> +    }
>> +
>> +    object_property_set_int(OBJECT(xsrc), xive->nr_irqs, "nr-irqs",
>> +                            &error_fatal);
>> +    object_property_add_const_link(OBJECT(xsrc), "xive", OBJECT(xive),
>> +                                   &error_fatal);
>> +    object_property_set_bool(OBJECT(xsrc), true, "realized", &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +    qdev_set_parent_bus(DEVICE(xsrc), sysbus_get_default());
>> +
>> +    /* Install the IPI ESB MMIO region in its VC region */
>> +    memory_region_add_subregion(&xive->ipi_mmio, 0, &xsrc->esb_mmio);
>> +
>> +    /* Start in a clean state */
>> +    device_reset(DEVICE(&xive->source));
> 
> I don't think you should need that.  During qemu start up all the
> device reset handlers should be called after reset but before starting
> the VM anyway.

yes but I choose to realize the source after reset ... I will explain 
why later.
 
 
>> +}
>> +
>> +static void pnv_xive_end_source_realize(PnvXive *xive, Error **errp)
>> +{
>> +    XiveENDSource *end_xsrc = &xive->end_source;
>> +    Error *local_err = NULL;
>> +    uint64_t end_mmio_size = pnv_xive_set_xlate_edt_size(xive, CQ_TDR_EDT_EQ);
>> +
>> +    /* Two pages per END: ESn and ESe */
>> +    xive->nr_ends  = end_mmio_size / (1ull << (xive->vc_shift + 1));
>> +
>> +    object_property_set_int(OBJECT(end_xsrc), xive->nr_ends, "nr-ends",
>> +                            &error_fatal);
>> +    object_property_add_const_link(OBJECT(end_xsrc), "xive", OBJECT(xive),
>> +                                   &error_fatal);
>> +    object_property_set_bool(OBJECT(end_xsrc), true, "realized", &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +    qdev_set_parent_bus(DEVICE(end_xsrc), sysbus_get_default());
>> +
>> +    /* Install the END ESB MMIO region in its VC region */
>> +    memory_region_add_subregion(&xive->end_mmio, 0, &end_xsrc->esb_mmio);
>> +}
>> +
>> +/*
>> + * Virtual Structure Tables (VST) configuration
>> + */
>> +static void pnv_xive_table_set_exclusive(PnvXive *xive, uint8_t type,
>> +                                         uint8_t blk, uint64_t vsd)
>> +{
>> +    bool gconf_indirect =
>> +        xive->regs[VC_GLOBAL_CONFIG >> 3] & VC_GCONF_INDIRECT;
>> +    uint32_t vst_shift = GETFIELD(VSD_TSIZE, vsd) + 12;
>> +    uint64_t vst_addr = vsd & VSD_ADDRESS_MASK;
>> +
>> +    if (VSD_INDIRECT & vsd) {
>> +        if (!gconf_indirect) {
>> +            xive_error(xive, "VST: %s indirect tables not enabled",
>> +                       vst_infos[type].name);
>> +            return;
>> +        }
>> +    }
>> +
>> +    switch (type) {
>> +    case VST_TSEL_IVT:
>> +        /*
>> +         * This is our trigger to create the XiveSource object backing
>> +         * the IPIs.
>> +         */
>> +        pnv_xive_source_realize(xive, &error_fatal);
> 
> IIUC this gets called in response to an MMIO.  Realizing devices in
> response to a runtime MMIO looks very wrong.

It does. I agree but I didn't find the appropriate modeling for 
the problem I am trying to solve. Which is to create a XiveSource 
object of the appropriate size, depending on how the software 
configured the XIVE IC. 

We could choose to cover the maximum that the VC MMIO region can 
cover.  I might do that in next version. 

> 
>> +        break;
>> +
>> +    case VST_TSEL_EQDT:
>> +        /* Same trigger but for the XiveENDSource object backing the ENDs. */
>> +        pnv_xive_end_source_realize(xive, &error_fatal);
>> +        break;
>> +
>> +    case VST_TSEL_VPDT:
>> +        /* FIXME (skiboot) : remove DD1 workaround on the NVT table size */
>> +        vst_shift = 16;
>> +        break;
>> +
>> +    case VST_TSEL_SBE: /* Not modeled */
>> +        /*
>> +         * Contains the backing store pages for the source PQ bits.
>> +         * The XiveSource object has its own. We would need a custom
>> +         * source object to use this backing.
>> +         */
>> +        break;
>> +
>> +    case VST_TSEL_IRQ: /* VC only. Not modeled */
>> +        /*
>> +         * These tables contains the backing store pages for the
>> +         * interrupt fifos of the VC sub-engine in case of overflow.
>> +         */
>> +        break;
>> +    default:
>> +        g_assert_not_reached();
>> +    }
>> +
>> +    if (!QEMU_IS_ALIGNED(vst_addr, 1ull << vst_shift)) {
>> +        xive_error(xive, "VST: %s table address 0x%"PRIx64" is not aligned with"
>> +                   " page shift %d", vst_infos[type].name, vst_addr, vst_shift);
>> +    }
>> +
>> +    /* Keep the VSD for later use */
>> +    xive->vsds[type][blk] = vsd;

So I should use the MMIOed value also for such configuration and not 
store the value in its own field I suppose ?


>> +}
>> +
>> +/*
>> + * Both PC and VC sub-engines are configured as each use the Virtual
>> + * Structure Tables : SBE, EAS, END and NVT.
>> + */
>> +static void pnv_xive_table_set_data(PnvXive *xive, uint64_t vsd, bool pc_engine)
>> +{
>> +    uint8_t mode = GETFIELD(VSD_MODE, vsd);
>> +    uint8_t type = GETFIELD(VST_TABLE_SELECT,
>> +                            xive->regs[VC_VSD_TABLE_ADDR >> 3]);
>> +    uint8_t blk = GETFIELD(VST_TABLE_BLOCK,
>> +                             xive->regs[VC_VSD_TABLE_ADDR >> 3]);
>> +    uint64_t vst_addr = vsd & VSD_ADDRESS_MASK;
>> +
>> +    if (type > VST_TSEL_IRQ) {
>> +        xive_error(xive, "VST: invalid table type %d", type);
>> +        return;
>> +    }
>> +
>> +    if (blk >= vst_infos[type].max_blocks) {
>> +        xive_error(xive, "VST: invalid block id %d for"
>> +                      " %s table", blk, vst_infos[type].name);
>> +        return;
>> +    }
>> +
>> +    /*
>> +     * Only take the VC sub-engine configuration into account because
>> +     * the XiveRouter model combines both VC and PC sub-engines
>> +     */
>> +    if (pc_engine) {
>> +        return;
>> +    }
>> +
>> +    if (!vst_addr) {
>> +        xive_error(xive, "VST: invalid %s table address", vst_infos[type].name);
>> +        return;
>> +    }
>> +
>> +    switch (mode) {
>> +    case VSD_MODE_FORWARD:
>> +        xive->vsds[type][blk] = vsd;
>> +        break;
>> +
>> +    case VSD_MODE_EXCLUSIVE:
>> +        pnv_xive_table_set_exclusive(xive, type, blk, vsd);
>> +        break;
>> +
>> +    default:
>> +        xive_error(xive, "VST: unsupported table mode %d", mode);
>> +        return;
>> +    }
>> +}
>> +
>> +/*
>> + * When the TIMA is accessed from the indirect page, the thread id
>> + * (PIR) has to be configured in the IC before. This is used for
>> + * resets and for debug purpose also.
>> + */
>> +static void pnv_xive_thread_indirect_set(PnvXive *xive, uint64_t val)
>> +{
>> +    int pir = GETFIELD(PC_TCTXT_INDIR_THRDID, xive->regs[PC_TCTXT_INDIR0 >> 3]);
>> +
>> +    if (val & PC_TCTXT_INDIR_VALID) {
>> +        if (xive->cpu_ind) {
>> +            xive_error(xive, "IC: indirect access already set for "
>> +                       "invalid PIR %d", pir);
>> +        }
>> +
>> +        pir = GETFIELD(PC_TCTXT_INDIR_THRDID, val) & 0xff;
>> +        xive->cpu_ind = ppc_get_vcpu_by_pir(pir);
>> +        if (!xive->cpu_ind) {
>> +            xive_error(xive, "IC: invalid PIR %d for indirect access", pir);
>> +        }
>> +    } else {
>> +        xive->cpu_ind = NULL;
>> +    }
>> +}
>> +
>> +/*
>> + * Interrupt Controller registers MMIO
>> + */
>> +static void pnv_xive_ic_reg_write(PnvXive *xive, uint32_t offset, uint64_t val,
>> +                                  bool mmio)
>> +{
>> +    MemoryRegion *sysmem = get_system_memory();
>> +    uint32_t reg = offset >> 3;
>> +
>> +    switch (offset) {
>> +
>> +    /*
>> +     * XIVE CQ (PowerBus bridge) settings
>> +     */
>> +    case CQ_MSGSND:     /* msgsnd for doorbells */
>> +    case CQ_FIRMASK_OR: /* FIR error reporting */
>> +        xive->regs[reg] = val;
> 
> Can you do that generic update outside the switch?  If that leaves too
> many special cases that might be a sign you shouldn't use the
> big-array-of-regs model.

I also use the 'switch' to segment the different registers of the
controller depending on the sub-engine. It think the big-array-of-regs 
model should word. I need to add a couple more helper routines.

>> +        break;
>> +    case CQ_PBI_CTL:
>> +        if (val & CQ_PBI_PC_64K) {
>> +            xive->pc_shift = 16;
>> +        }
>> +        if (val & CQ_PBI_VC_64K) {
>> +            xive->vc_shift = 16;
>> +        }
>> +        break;
>> +    case CQ_CFG_PB_GEN: /* PowerBus General Configuration */
>> +        /*
>> +         * TODO: CQ_INT_ADDR_OPT for 1-block-per-chip mode
>> +         */
>> +        xive->regs[reg] = val;
>> +        break;
>> +
>> +    /*
>> +     * XIVE Virtualization Controller settings
>> +     */
>> +    case VC_GLOBAL_CONFIG:
>> +        xive->regs[reg] = val;
>> +        break;
>> +
>> +    /*
>> +     * XIVE Presenter Controller settings
>> +     */
>> +    case PC_GLOBAL_CONFIG:
>> +        /* Overrides Int command Chip ID with the Chip ID field */
>> +        if (val & PC_GCONF_CHIPID_OVR) {
>> +            xive->chip_id = GETFIELD(PC_GCONF_CHIPID, val);
>> +        }
>> +        xive->regs[reg] = val;
>> +        break;
>> +    case PC_TCTXT_CFG:
>> +        /*
>> +         * TODO: PC_TCTXT_CFG_BLKGRP_EN for block group support
>> +         * TODO: PC_TCTXT_CFG_HARD_CHIPID_BLK
>> +         */
>> +
>> +        /*
>> +         * Moves the chipid into block field for hardwired CAM
>> +         * compares Block offset value is adjusted to 0b0..01 & ThrdId
>> +         */
>> +        if (val & PC_TCTXT_CHIPID_OVERRIDE) {
>> +            xive->thread_chip_id = GETFIELD(PC_TCTXT_CHIPID, val);
>> +        }
>> +        break;
>> +    case PC_TCTXT_TRACK: /* Enable block tracking (DD2) */
>> +        xive->regs[reg] = val;
>> +        break;
>> +
>> +    /*
>> +     * Misc settings
>> +     */
>> +    case VC_EQC_CONFIG: /* enable silent escalation */
>> +    case VC_SBC_CONFIG: /* Store EOI configuration */
>> +    case VC_AIB_TX_ORDER_TAG2:
>> +        xive->regs[reg] = val;
>> +        break;
>> +
>> +    /*
>> +     * XIVE BAR settings (XSCOM only)
>> +     */
>> +    case CQ_RST_CTL:
>> +        /* resets all bars */
>> +        break;
>> +
>> +    case CQ_IC_BAR: /* IC BAR. 8 pages */
>> +        xive->ic_shift = val & CQ_IC_BAR_64K ? 16 : 12;
>> +        if (!(val & CQ_IC_BAR_VALID)) {
>> +            xive->ic_base = 0;
>> +            if (xive->regs[reg] & CQ_IC_BAR_VALID) {
>> +                memory_region_del_subregion(&xive->ic_mmio,
>> +                                            &xive->ic_reg_mmio);
>> +                memory_region_del_subregion(&xive->ic_mmio,
>> +                                            &xive->ic_notify_mmio);
>> +                memory_region_del_subregion(sysmem, &xive->ic_mmio);
>> +                memory_region_del_subregion(sysmem, &xive->tm_mmio_indirect);
>> +            }
>> +        } else {
>> +            xive->ic_base  = val & ~(CQ_IC_BAR_VALID | CQ_IC_BAR_64K);
>> +            if (!(xive->regs[reg] & CQ_IC_BAR_VALID)) {
>> +                memory_region_add_subregion(sysmem, xive->ic_base,
>> +                                            &xive->ic_mmio);
>> +                memory_region_add_subregion(&xive->ic_mmio,  0,
>> +                                            &xive->ic_reg_mmio);
>> +                memory_region_add_subregion(&xive->ic_mmio,
>> +                                            1ul << xive->ic_shift,
>> +                                            &xive->ic_notify_mmio);
>> +                memory_region_add_subregion(sysmem,
>> +                                   xive->ic_base + (4ull << xive->ic_shift),
>> +                                   &xive->tm_mmio_indirect);
>> +            }
>> +        }
>> +        xive->regs[reg] = val;
>> +        break;
>> +
>> +    case CQ_TM1_BAR: /* TM BAR and page size. 4 pages */
>> +    case CQ_TM2_BAR: /* second TM BAR is for hotplug use */
>> +        xive->tm_shift = val & CQ_TM_BAR_64K ? 16 : 12;
>> +        if (!(val & CQ_TM_BAR_VALID)) {
>> +            xive->tm_base = 0;
>> +            if (xive->regs[reg] & CQ_TM_BAR_VALID) {
>> +                memory_region_del_subregion(sysmem, &xive->tm_mmio);
>> +            }
>> +        } else {
>> +            xive->tm_base  = val & ~(CQ_TM_BAR_VALID | CQ_TM_BAR_64K);
>> +            if (!(xive->regs[reg] & CQ_TM_BAR_VALID)) {
>> +                memory_region_add_subregion(sysmem, xive->tm_base,
>> +                                            &xive->tm_mmio);
>> +            }
>> +        }
>> +        xive->regs[reg] = val;
>> +       break;
> 
> Something funny with your indentation here.
> 
>> +    case CQ_PC_BAR:
>> +        if (!(val & CQ_PC_BAR_VALID)) {
>> +            xive->pc_base = 0;
>> +            if (xive->regs[reg] & CQ_PC_BAR_VALID) {
>> +                memory_region_del_subregion(sysmem, &xive->pc_mmio);
>> +            }
>> +        } else {
>> +            xive->pc_base = val & ~(CQ_PC_BAR_VALID);
>> +            if (!(xive->regs[reg] & CQ_PC_BAR_VALID)) {
>> +                memory_region_add_subregion(sysmem, xive->pc_base,
>> +                                            &xive->pc_mmio);
>> +            }
>> +        }
>> +        xive->regs[reg] = val;
>> +        break;
>> +    case CQ_PC_BARM: /* TODO: configure PC BAR size at runtime */
>> +        xive->pc_size =  (~val + 1) & CQ_PC_BARM_MASK;
>> +        xive->regs[reg] = val;
>> +
>> +        /* Compute the size of the VDT sets */
>> +        xive->vdt_shift = ctz64(xive->pc_size / XIVE_XLATE_VDT_MAX);
>> +        break;
>> +
>> +    case CQ_VC_BAR: /* From 64M to 4TB */
>> +        if (!(val & CQ_VC_BAR_VALID)) {
>> +            xive->vc_base = 0;
>> +            if (xive->regs[reg] & CQ_VC_BAR_VALID) {
>> +                memory_region_del_subregion(sysmem, &xive->vc_mmio);
>> +            }
>> +        } else {
>> +            xive->vc_base = val & ~(CQ_VC_BAR_VALID);
>> +            if (!(xive->regs[reg] & CQ_VC_BAR_VALID)) {
>> +                memory_region_add_subregion(sysmem, xive->vc_base,
>> +                                            &xive->vc_mmio);
>> +            }
>> +        }
>> +        xive->regs[reg] = val;
>> +        break;
>> +    case CQ_VC_BARM: /* TODO: configure VC BAR size at runtime */
>> +        xive->vc_size = (~val + 1) & CQ_VC_BARM_MASK;
> 
> Any reason to precompute that, rather than work it out from
> regs[CQ_VC_BARM] when you need it?

Apart from having a name making more sense (it matches vc_mmio), no. 

I can remove these. 

>> +        xive->regs[reg] = val;
>> +
>> +        /* Compute the size of the EDT sets */
>> +        xive->edt_shift = ctz64(xive->vc_size / XIVE_XLATE_EDT_MAX);
>> +        break;
>> +
>> +    /*
>> +     * XIVE Set Translation Table settings. Defines the layout of the
>> +     * VC BAR containing the ESB pages of the IPIs and of the ENDs
>> +     */
>> +    case CQ_TAR: /* Set Translation Table Address */
>> +        pnv_xive_set_xlate_select(xive, val);
>> +        break;
>> +    case CQ_TDR: /* Set Translation Table Data */
>> +        pnv_xive_set_xlate_update(xive, val);
>> +        break;
>> +
>> +    /*
>> +     * XIVE VC & PC Virtual Structure Table settings
>> +     */
>> +    case VC_VSD_TABLE_ADDR:
>> +    case PC_VSD_TABLE_ADDR: /* Virtual table selector */
>> +        xive->regs[reg] = val;
>> +        break;
>> +    case VC_VSD_TABLE_DATA: /* Virtual table setting */
>> +    case PC_VSD_TABLE_DATA:
>> +        pnv_xive_table_set_data(xive, val, offset == PC_VSD_TABLE_DATA);
>> +        break;
>> +
>> +    /*
>> +     * Interrupt fifo overflow in memory backing store. Not modeled
>> +     */
>> +    case VC_IRQ_CONFIG_IPI:
>> +    case VC_IRQ_CONFIG_HW:
>> +    case VC_IRQ_CONFIG_CASCADE1:
>> +    case VC_IRQ_CONFIG_CASCADE2:
>> +    case VC_IRQ_CONFIG_REDIST:
>> +    case VC_IRQ_CONFIG_IPI_CASC:
>> +        xive->regs[reg] = val;
>> +        break;
>> +
>> +    /*
>> +     * XIVE hardware thread enablement
>> +     */
>> +    case PC_THREAD_EN_REG0_SET: /* Physical Thread Enable */
>> +    case PC_THREAD_EN_REG1_SET: /* Physical Thread Enable (fused core) */
>> +        xive->regs[reg] |= val;
>> +        break;
>> +    case PC_THREAD_EN_REG0_CLR:
>> +        xive->regs[PC_THREAD_EN_REG0_SET >> 3] &= ~val;
>> +        break;
>> +    case PC_THREAD_EN_REG1_CLR:
>> +        xive->regs[PC_THREAD_EN_REG1_SET >> 3] &= ~val;
>> +        break;
>> +
>> +    /*
>> +     * Indirect TIMA access set up. Defines the HW thread to use.
>> +     */
>> +    case PC_TCTXT_INDIR0:
>> +        pnv_xive_thread_indirect_set(xive, val);
>> +        xive->regs[reg] = val;
>> +        break;
>> +    case PC_TCTXT_INDIR1:
>> +    case PC_TCTXT_INDIR2:
>> +    case PC_TCTXT_INDIR3:
>> +        /* TODO: check what PC_TCTXT_INDIR[123] are for */
>> +        xive->regs[reg] = val;
>> +        break;
>> +
>> +    /*
>> +     * XIVE PC & VC cache updates for EAS, NVT and END
>> +     */
>> +    case PC_VPC_SCRUB_MASK:
>> +    case PC_VPC_CWATCH_SPEC:
>> +    case VC_EQC_SCRUB_MASK:
>> +    case VC_EQC_CWATCH_SPEC:
>> +    case VC_IVC_SCRUB_MASK:
>> +        xive->regs[reg] = val;
>> +        break;
>> +    case VC_IVC_SCRUB_TRIG:
>> +        pnv_xive_eas_update(xive, GETFIELD(VC_SCRUB_OFFSET, val));
>> +        break;
>> +    case PC_VPC_CWATCH_DAT0:
>> +    case PC_VPC_CWATCH_DAT1:
>> +    case PC_VPC_CWATCH_DAT2:
>> +    case PC_VPC_CWATCH_DAT3:
>> +    case PC_VPC_CWATCH_DAT4:
>> +    case PC_VPC_CWATCH_DAT5:
>> +    case PC_VPC_CWATCH_DAT6:
>> +    case PC_VPC_CWATCH_DAT7:
>> +        xive->vpc_watch[(offset - PC_VPC_CWATCH_DAT0) / 8] = cpu_to_be64(val);
>> +        break;
>> +    case PC_VPC_SCRUB_TRIG:
>> +        pnv_xive_nvt_update(xive, GETFIELD(PC_SCRUB_BLOCK_ID, val),
>> +                           GETFIELD(PC_SCRUB_OFFSET, val));
>> +        break;
>> +    case VC_EQC_CWATCH_DAT0:
>> +    case VC_EQC_CWATCH_DAT1:
>> +    case VC_EQC_CWATCH_DAT2:
>> +    case VC_EQC_CWATCH_DAT3:
>> +        xive->eqc_watch[(offset - VC_EQC_CWATCH_DAT0) / 8] = cpu_to_be64(val);
>> +        break;
>> +    case VC_EQC_SCRUB_TRIG:
>> +        pnv_xive_end_update(xive, GETFIELD(VC_SCRUB_BLOCK_ID, val),
>> +                            GETFIELD(VC_SCRUB_OFFSET, val));
>> +        break;
>> +
>> +    /*
>> +     * XIVE PC & VC cache invalidation
>> +     */
>> +    case PC_AT_KILL:
>> +        xive->regs[reg] |= val;
>> +        break;
>> +    case VC_AT_MACRO_KILL:
>> +        xive->regs[reg] |= val;
>> +        break;
>> +    case PC_AT_KILL_MASK:
>> +    case VC_AT_MACRO_KILL_MASK:
>> +        xive->regs[reg] = val;
>> +        break;
>> +
>> +    default:
>> +        xive_error(xive, "IC: invalid write to reg=0x%08x mmio=%d", offset,
>> +                   mmio);
>> +    }
>> +}
>> +
>> +static uint64_t pnv_xive_ic_reg_read(PnvXive *xive, uint32_t offset, bool mmio)
>> +{
>> +    uint64_t val = 0;
>> +    uint32_t reg = offset >> 3;
>> +
>> +    switch (offset) {
>> +    case CQ_CFG_PB_GEN:
>> +    case CQ_IC_BAR:
>> +    case CQ_TM1_BAR:
>> +    case CQ_TM2_BAR:
>> +    case CQ_PC_BAR:
>> +    case CQ_PC_BARM:
>> +    case CQ_VC_BAR:
>> +    case CQ_VC_BARM:
>> +    case CQ_TAR:
>> +    case CQ_TDR:
>> +    case CQ_PBI_CTL:
>> +
>> +    case PC_TCTXT_CFG:
>> +    case PC_TCTXT_TRACK:
>> +    case PC_TCTXT_INDIR0:
>> +    case PC_TCTXT_INDIR1:
>> +    case PC_TCTXT_INDIR2:
>> +    case PC_TCTXT_INDIR3:
>> +    case PC_GLOBAL_CONFIG:
>> +
>> +    case PC_VPC_SCRUB_MASK:
>> +    case PC_VPC_CWATCH_SPEC:
>> +    case PC_VPC_CWATCH_DAT0:
>> +    case PC_VPC_CWATCH_DAT1:
>> +    case PC_VPC_CWATCH_DAT2:
>> +    case PC_VPC_CWATCH_DAT3:
>> +    case PC_VPC_CWATCH_DAT4:
>> +    case PC_VPC_CWATCH_DAT5:
>> +    case PC_VPC_CWATCH_DAT6:
>> +    case PC_VPC_CWATCH_DAT7:
>> +
>> +    case VC_GLOBAL_CONFIG:
>> +    case VC_AIB_TX_ORDER_TAG2:
>> +
>> +    case VC_IRQ_CONFIG_IPI:
>> +    case VC_IRQ_CONFIG_HW:
>> +    case VC_IRQ_CONFIG_CASCADE1:
>> +    case VC_IRQ_CONFIG_CASCADE2:
>> +    case VC_IRQ_CONFIG_REDIST:
>> +    case VC_IRQ_CONFIG_IPI_CASC:
>> +
>> +    case VC_EQC_SCRUB_MASK:
>> +    case VC_EQC_CWATCH_DAT0:
>> +    case VC_EQC_CWATCH_DAT1:
>> +    case VC_EQC_CWATCH_DAT2:
>> +    case VC_EQC_CWATCH_DAT3:
>> +
>> +    case VC_EQC_CWATCH_SPEC:
>> +    case VC_IVC_SCRUB_MASK:
>> +    case VC_SBC_CONFIG:
>> +    case VC_AT_MACRO_KILL_MASK:
>> +    case VC_VSD_TABLE_ADDR:
>> +    case PC_VSD_TABLE_ADDR:
>> +    case VC_VSD_TABLE_DATA:
>> +    case PC_VSD_TABLE_DATA:
>> +        val = xive->regs[reg];
>> +        break;
>> +
>> +    case CQ_MSGSND: /* Identifies which cores have msgsnd enabled.
>> +                     * Say all have. */
>> +        val = 0xffffff0000000000;
>> +        break;
>> +
>> +    /*
>> +     * XIVE PC & VC cache updates for EAS, NVT and END
>> +     */
>> +    case PC_VPC_SCRUB_TRIG:
>> +    case VC_IVC_SCRUB_TRIG:
>> +    case VC_EQC_SCRUB_TRIG:
>> +        xive->regs[reg] &= ~VC_SCRUB_VALID;
>> +        val = xive->regs[reg];
>> +        break;
>> +
>> +    /*
>> +     * XIVE PC & VC cache invalidation
>> +     */
>> +    case PC_AT_KILL:
>> +        xive->regs[reg] &= ~PC_AT_KILL_VALID;
>> +        val = xive->regs[reg];
>> +        break;
>> +    case VC_AT_MACRO_KILL:
>> +        xive->regs[reg] &= ~VC_KILL_VALID;
>> +        val = xive->regs[reg];
>> +        break;
>> +
>> +    /*
>> +     * XIVE synchronisation
>> +     */
>> +    case VC_EQC_CONFIG:
>> +        val = VC_EQC_SYNC_MASK;
>> +        break;
>> +
>> +    default:
>> +        xive_error(xive, "IC: invalid read reg=0x%08x mmio=%d", offset, mmio);
>> +    }
>> +
>> +    return val;
>> +}
>> +
>> +static void pnv_xive_ic_reg_write_mmio(void *opaque, hwaddr addr,
>> +                                       uint64_t val, unsigned size)
>> +{
>> +    pnv_xive_ic_reg_write(opaque, addr, val, true);
> 
> AFAICT the underlaying write function never uses that 'mmio' parameter
> except for debug, so it's probably not worth the bother of having
> these wrappers.

OK. I will check.
>> +}
>> +
>> +static uint64_t pnv_xive_ic_reg_read_mmio(void *opaque, hwaddr addr,
>> +                                      unsigned size)
>> +{
>> +    return pnv_xive_ic_reg_read(opaque, addr, true);
>> +}
>> +
>> +static const MemoryRegionOps pnv_xive_ic_reg_ops = {
>> +    .read = pnv_xive_ic_reg_read_mmio,
>> +    .write = pnv_xive_ic_reg_write_mmio,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +/*
>> + * Interrupt Controller MMIO: Notify port page (write only)
>> + */
>> +#define PNV_XIVE_FORWARD_IPI        0x800 /* Forward IPI */
>> +#define PNV_XIVE_FORWARD_HW         0x880 /* Forward HW */
>> +#define PNV_XIVE_FORWARD_OS_ESC     0x900 /* Forward OS escalation */
>> +#define PNV_XIVE_FORWARD_HW_ESC     0x980 /* Forward Hyp escalation */
>> +#define PNV_XIVE_FORWARD_REDIS      0xa00 /* Forward Redistribution */
>> +#define PNV_XIVE_RESERVED5          0xa80 /* Cache line 5 PowerBUS operation */
>> +#define PNV_XIVE_RESERVED6          0xb00 /* Cache line 6 PowerBUS operation */
>> +#define PNV_XIVE_RESERVED7          0xb80 /* Cache line 7 PowerBUS operation */
>> +
>> +/* VC synchronisation */
>> +#define PNV_XIVE_SYNC_IPI           0xc00 /* Sync IPI */
>> +#define PNV_XIVE_SYNC_HW            0xc80 /* Sync HW */
>> +#define PNV_XIVE_SYNC_OS_ESC        0xd00 /* Sync OS escalation */
>> +#define PNV_XIVE_SYNC_HW_ESC        0xd80 /* Sync Hyp escalation */
>> +#define PNV_XIVE_SYNC_REDIS         0xe00 /* Sync Redistribution */
>> +
>> +/* PC synchronisation */
>> +#define PNV_XIVE_SYNC_PULL          0xe80 /* Sync pull context */
>> +#define PNV_XIVE_SYNC_PUSH          0xf00 /* Sync push context */
>> +#define PNV_XIVE_SYNC_VPC           0xf80 /* Sync remove VPC store */
>> +
>> +static void pnv_xive_ic_hw_trigger(PnvXive *xive, hwaddr addr, uint64_t val)
>> +{
>> +    XiveFabricClass *xfc = XIVE_FABRIC_GET_CLASS(xive);
>> +
>> +    xfc->notify(XIVE_FABRIC(xive), val);
>> +}
>> +
>> +static void pnv_xive_ic_notify_write(void *opaque, hwaddr addr, uint64_t val,
>> +                                     unsigned size)
>> +{
>> +    PnvXive *xive = PNV_XIVE(opaque);
>> +
>> +    /* VC: HW triggers */
>> +    switch (addr) {
>> +    case 0x000 ... 0x7FF:
>> +        pnv_xive_ic_hw_trigger(opaque, addr, val);
>> +        break;
>> +
>> +    /* VC: Forwarded IRQs */
>> +    case PNV_XIVE_FORWARD_IPI:
>> +    case PNV_XIVE_FORWARD_HW:
>> +    case PNV_XIVE_FORWARD_OS_ESC:
>> +    case PNV_XIVE_FORWARD_HW_ESC:
>> +    case PNV_XIVE_FORWARD_REDIS:
>> +        /* TODO: forwarded IRQs. Should be like HW triggers */
>> +        xive_error(xive, "IC: forwarded at @0x%"HWADDR_PRIx" IRQ 0x%"PRIx64,
>> +                   addr, val);
>> +        break;
>> +
>> +    /* VC syncs */
>> +    case PNV_XIVE_SYNC_IPI:
>> +    case PNV_XIVE_SYNC_HW:
>> +    case PNV_XIVE_SYNC_OS_ESC:
>> +    case PNV_XIVE_SYNC_HW_ESC:
>> +    case PNV_XIVE_SYNC_REDIS:
>> +        break;
>> +
>> +    /* PC sync */
>> +    case PNV_XIVE_SYNC_PULL:
>> +    case PNV_XIVE_SYNC_PUSH:
>> +    case PNV_XIVE_SYNC_VPC:
>> +        break;
>> +
>> +    default:
>> +        xive_error(xive, "IC: invalid notify write @%"HWADDR_PRIx, addr);
>> +    }
>> +}
>> +
>> +static uint64_t pnv_xive_ic_notify_read(void *opaque, hwaddr addr,
>> +                                        unsigned size)
>> +{
>> +    PnvXive *xive = PNV_XIVE(opaque);
>> +
>> +    /* loads are invalid */
>> +    xive_error(xive, "IC: invalid notify read @%"HWADDR_PRIx, addr);
>> +    return -1;
>> +}
>> +
>> +static const MemoryRegionOps pnv_xive_ic_notify_ops = {
>> +    .read = pnv_xive_ic_notify_read,
>> +    .write = pnv_xive_ic_notify_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +/*
>> + * Interrupt controller MMIO region. The layout is compatible between
>> + * 4K and 64K pages :
>> + *
>> + * Page 0           sub-engine BARs
>> + *  0x000 - 0x3FF   IC registers
>> + *  0x400 - 0x7FF   PC registers
>> + *  0x800 - 0xFFF   VC registers
>> + *
>> + * Page 1           Notify page
>> + *  0x000 - 0x7FF   HW interrupt triggers (PSI, PHB)
>> + *  0x800 - 0xFFF   forwards and syncs
>> + *
>> + * Page 2           LSI Trigger page (writes only) (not modeled)
>> + * Page 3           LSI SB EOI page (reads only) (not modeled)
>> + *
>> + * Page 4-7         indirect TIMA (aliased to TIMA region)
>> + */
>> +static void pnv_xive_ic_write(void *opaque, hwaddr addr,
>> +                              uint64_t val, unsigned size)
>> +{
>> +    PnvXive *xive = PNV_XIVE(opaque);
>> +
>> +    xive_error(xive, "IC: invalid write @%"HWADDR_PRIx, addr);
>> +}
>> +
>> +static uint64_t pnv_xive_ic_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    PnvXive *xive = PNV_XIVE(opaque);
>> +
>> +    xive_error(xive, "IC: invalid read @%"HWADDR_PRIx, addr);
>> +    return -1;
>> +}
>> +
>> +static const MemoryRegionOps pnv_xive_ic_ops = {
>> +    .read = pnv_xive_ic_read,
>> +    .write = pnv_xive_ic_write,
> 
> Erm.. it's not clear to me what this achieves, since the read/write
> accessors just error every time.

These are the ops for the main IC MMIO region (8 pages) which contains 
the subregions ic_reg_mmio and ic_notify_mmio, 1 page each. pages 2-3 
are not implemented and pages 4-7 are so we have a hole.


>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +/*
>> + * Interrupt controller XSCOM region. Load accesses are nearly all
>> + * done all through the MMIO region.
>> + */
>> +static uint64_t pnv_xive_xscom_read(void *opaque, hwaddr addr, unsigned size)
>> +{
>> +    PnvXive *xive = PNV_XIVE(opaque);
>> +
>> +    switch (addr >> 3) {
>> +    case X_VC_EQC_CONFIG:
>> +        /*
>> +         * This is the only XSCOM load done in skiboot. Bizarre. To be
>> +         * checked.
>> +         */
>> +        return VC_EQC_SYNC_MASK;
>> +    default:
>> +        return pnv_xive_ic_reg_read(xive, addr, false);
>> +    }
>> +}
>> +
>> +static void pnv_xive_xscom_write(void *opaque, hwaddr addr,
>> +                                uint64_t val, unsigned size)
>> +{
>> +    pnv_xive_ic_reg_write(opaque, addr, val, false);
>> +}
>> +
>> +static const MemoryRegionOps pnv_xive_xscom_ops = {
>> +    .read = pnv_xive_xscom_read,
>> +    .write = pnv_xive_xscom_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    }
>> +};
>> +
>> +/*
>> + * Virtualization Controller MMIO region containing the IPI and END ESB pages
>> + */
>> +static uint64_t pnv_xive_vc_read(void *opaque, hwaddr offset,
>> +                                 unsigned size)
>> +{
>> +    PnvXive *xive = PNV_XIVE(opaque);
>> +    uint64_t edt_index = offset >> xive->edt_shift;
>> +    uint64_t edt_type = 0;
>> +    uint64_t ret = -1;
>> +    uint64_t edt_offset;
>> +    MemTxResult result;
>> +    AddressSpace *edt_as = NULL;
>> +
>> +    if (edt_index < XIVE_XLATE_EDT_MAX) {
>> +        edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[edt_index]);
>> +    }
>> +
>> +    switch (edt_type) {
>> +    case CQ_TDR_EDT_IPI:
>> +        edt_as = &xive->ipi_as;
>> +        break;
>> +    case CQ_TDR_EDT_EQ:
>> +        edt_as = &xive->end_as;
>> +        break;
>> +    default:
>> +        xive_error(xive, "VC: invalid read @%"HWADDR_PRIx, offset);
>> +        return -1;
>> +    }
>> +
>> +    /* remap the offset for the targeted address space */
>> +    edt_offset = pnv_xive_set_xlate_edt_offset(xive, offset, edt_type);
>> +
>> +    ret = address_space_ldq(edt_as, edt_offset, MEMTXATTRS_UNSPECIFIED,
>> +                            &result);
> 
> I think there needs to be a byteswap here somewhere.  This is loading
> a value from a BE table, AFAICT...
> 
>> +    if (result != MEMTX_OK) {
>> +        xive_error(xive, "VC: %s read failed at @0x%"HWADDR_PRIx " -> @0x%"
>> +                   HWADDR_PRIx, edt_type == CQ_TDR_EDT_IPI ? "IPI" : "END",
>> +                   offset, edt_offset);
>> +        return -1;
>> +    }
> 
> ... but these helpers are expected to return host-native values.

hmm, yes. 

This works today because these address spaces are backed by the ESB pages 
of the XiveSource and the XiveENDSource for which the data can be ignored.


>> +    return ret;
>> +}
>> +
>> +static void pnv_xive_vc_write(void *opaque, hwaddr offset,
>> +                              uint64_t val, unsigned size)
>> +{
>> +    PnvXive *xive = PNV_XIVE(opaque);
>> +    uint64_t edt_index = offset >> xive->edt_shift;
>> +    uint64_t edt_type = 0;
>> +    uint64_t edt_offset;
>> +    MemTxResult result;
>> +    AddressSpace *edt_as = NULL;
>> +
>> +    if (edt_index < XIVE_XLATE_EDT_MAX) {
>> +        edt_type = GETFIELD(CQ_TDR_EDT_TYPE, xive->set_xlate_edt[edt_index]);
>> +    }
>> +
>> +    switch (edt_type) {
>> +    case CQ_TDR_EDT_IPI:
>> +        edt_as = &xive->ipi_as;
>> +        break;
>> +    case CQ_TDR_EDT_EQ:
>> +        edt_as = &xive->end_as;
>> +        break;
>> +    default:
>> +        xive_error(xive, "VC: invalid read @%"HWADDR_PRIx, offset);
>> +        return;
>> +    }
>> +
>> +    /* remap the offset for the targeted address space */
>> +    edt_offset = pnv_xive_set_xlate_edt_offset(xive, offset, edt_type);
>> +
>> +    address_space_stq(edt_as, edt_offset, val, MEMTXATTRS_UNSPECIFIED, &result);
>> +    if (result != MEMTX_OK) {
>> +        xive_error(xive, "VC: write failed at @0x%"HWADDR_PRIx, edt_offset);
>> +    }
>> +}
>> +
>> +static const MemoryRegionOps pnv_xive_vc_ops = {
>> +    .read = pnv_xive_vc_read,
>> +    .write = pnv_xive_vc_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 8,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +/*
>> + * Presenter Controller MMIO region. This is used by the Virtualization
>> + * Controller to update the IPB in the NVT table when required. Not
>> + * implemented.
>> + */
>> +static uint64_t pnv_xive_pc_read(void *opaque, hwaddr addr,
>> +                                 unsigned size)
>> +{
>> +    PnvXive *xive = PNV_XIVE(opaque);
>> +
>> +    xive_error(xive, "PC: invalid read @%"HWADDR_PRIx, addr);
>> +    return -1;
>> +}
>> +
>> +static void pnv_xive_pc_write(void *opaque, hwaddr addr,
>> +                              uint64_t value, unsigned size)
>> +{
>> +    PnvXive *xive = PNV_XIVE(opaque);
>> +
>> +    xive_error(xive, "PC: invalid write to VC @%"HWADDR_PRIx, addr);
>> +}
>> +
>> +static const MemoryRegionOps pnv_xive_pc_ops = {
>> +    .read = pnv_xive_pc_read,
>> +    .write = pnv_xive_pc_write,
>> +    .endianness = DEVICE_BIG_ENDIAN,
>> +    .valid = {
>> +        .min_access_size = 1,
>> +        .max_access_size = 8,
>> +    },
>> +    .impl = {
>> +        .min_access_size = 1,
>> +        .max_access_size = 8,
>> +    },
>> +};
>> +
>> +void pnv_xive_pic_print_info(PnvXive *xive, Monitor *mon)
>> +{
>> +    XiveRouter *xrtr = XIVE_ROUTER(xive);
>> +    XiveEAS eas;
>> +    XiveEND end;
>> +    uint32_t endno = 0;
>> +    uint32_t srcno0 = XIVE_SRCNO(xive->chip_id, 0);
>> +    uint32_t srcno = srcno0;
>> +
>> +    monitor_printf(mon, "XIVE[%x] Source %08x .. %08x\n", xive->chip_id,
>> +                  srcno0, srcno0 + xive->source.nr_irqs - 1);
>> +    xive_source_pic_print_info(&xive->source, srcno0, mon);
>> +
>> +    monitor_printf(mon, "XIVE[%x] EAT %08x .. %08x\n", xive->chip_id,
>> +                   srcno0, srcno0 + xive->nr_irqs - 1);
>> +    while (!xive_router_get_eas(xrtr, srcno, &eas)) {
>> +        if (!(eas.w & EAS_MASKED)) {
>> +            xive_eas_pic_print_info(&eas, srcno, mon);
>> +        }
>> +        srcno++;
>> +    }
>> +
>> +    monitor_printf(mon, "XIVE[%x] ENDT %08x .. %08x\n", xive->chip_id,
>> +                   0, xive->nr_ends - 1);
>> +    while (!xive_router_get_end(xrtr, xrtr->chip_id, endno, &end)) {
>> +        xive_end_pic_print_info(&end, endno++, mon);
>> +    }
>> +}
>> +
>> +static void pnv_xive_reset(DeviceState *dev)
>> +{
>> +    PnvXive *xive = PNV_XIVE(dev);
>> +    PnvChip *chip = PNV_CHIP(object_property_get_link(OBJECT(dev), "chip",
>> +                                                      &error_fatal));
>> +
>> +    /*
>> +     * Use the chip id to identify the XIVE interrupt controller. It
>> +     * can be overriden by configuration at runtime.
>> +     */
>> +    xive->chip_id = xive->thread_chip_id = chip->chip_id;
> 
> You shouldn't need to touch this at reset, only at init/realize.

yes apart from the thread_chip_id.

> 
>> +    /* Default page size. Should be changed at runtime to 64k */
>> +    xive->ic_shift = xive->vc_shift = xive->pc_shift = 12;
>> +
>> +    /*
>> +     * PowerNV XIVE sources are realized at runtime when the set
>> +     * translation tables are configured.
> 
> Yeah.. that seems unlikely to be a good idea.

I can try to allocate the maximum IRQ Number space for the XiveSource. 
I don't how to size the XiveENDSource though. Hmm, or may be I should 
just map portions of IPI MMIO regions and the END MMIO regions depending 
on the configuration of the firmware. 

>> +     */
>> +    if (DEVICE(&xive->source)->realized) {
>> +        object_property_set_bool(OBJECT(&xive->source), false, "realized",
>> +                                 &error_fatal);
>> +    }
>> +
>> +    if (DEVICE(&xive->end_source)->realized) {
>> +        object_property_set_bool(OBJECT(&xive->end_source), false, "realized",
>> +                                 &error_fatal);
>> +    }
>> +}
>> +
>> +/*
>> + * The VC sub-engine incorporates a source controller for the IPIs.
>> + * When triggered, we need to construct a source number with the
>> + * chip/block identifier
>> + */
>> +static void pnv_xive_notify(XiveFabric *xf, uint32_t srcno)

We won't need this routine anymore in the version of the model you have
merged. The decoding of the interrupt number is handled by the XiveRouter
now.

>> +{
>> +    PnvXive *xive = PNV_XIVE(xf);
>> +
>> +    xive_router_notify(xf, XIVE_SRCNO(xive->chip_id, srcno));
>> +}
>> +
>> +static void pnv_xive_init(Object *obj)
>> +{
>> +    PnvXive *xive = PNV_XIVE(obj);
>> +
>> +    object_initialize(&xive->source, sizeof(xive->source), TYPE_XIVE_SOURCE);
>> +    object_property_add_child(obj, "source", OBJECT(&xive->source), NULL);
>> +
>> +    object_initialize(&xive->end_source, sizeof(xive->end_source),
>> +                      TYPE_XIVE_END_SOURCE);
>> +    object_property_add_child(obj, "end_source", OBJECT(&xive->end_source),
>> +                              NULL);
>> +}
>> +
>> +static void pnv_xive_realize(DeviceState *dev, Error **errp)
>> +{
>> +    PnvXive *xive = PNV_XIVE(dev);
>> +
>> +    /* Default page size. Generally changed at runtime to 64k */
>> +    xive->ic_shift = xive->vc_shift = xive->pc_shift = 12;
>> +
>> +    /* XSCOM region, used for initial configuration of the BARs */
>> +    memory_region_init_io(&xive->xscom_regs, OBJECT(dev), &pnv_xive_xscom_ops,
>> +                          xive, "xscom-xive", PNV9_XSCOM_XIVE_SIZE << 3);
>> +
>> +    /* Interrupt controller MMIO region */
>> +    memory_region_init_io(&xive->ic_mmio, OBJECT(dev), &pnv_xive_ic_ops, xive,
>> +                          "xive.ic", PNV9_XIVE_IC_SIZE);
>> +    memory_region_init_io(&xive->ic_reg_mmio, OBJECT(dev), &pnv_xive_ic_reg_ops,
>> +                          xive, "xive.ic.reg", 1 << xive->ic_shift);
>> +    memory_region_init_io(&xive->ic_notify_mmio, OBJECT(dev),
>> +                          &pnv_xive_ic_notify_ops,
>> +                          xive, "xive.ic.notify", 1 << xive->ic_shift);
>> +
>> +    /* The Pervasive LSI trigger and EOI pages are not modeled */
>> +
>> +    /*
>> +     * Overall Virtualization Controller MMIO region containing the
>> +     * IPI ESB pages and END ESB pages. The layout is defined by the
>> +     * EDT set translation table and the accesses are dispatched using
>> +     * address spaces for each.
>> +     */
>> +    memory_region_init_io(&xive->vc_mmio, OBJECT(xive), &pnv_xive_vc_ops, xive,
>> +                          "xive.vc", PNV9_XIVE_VC_SIZE);
>> +
>> +    memory_region_init(&xive->ipi_mmio, OBJECT(xive), "xive.vc.ipi",
>> +                       PNV9_XIVE_VC_SIZE);
>> +    address_space_init(&xive->ipi_as, &xive->ipi_mmio, "xive.vc.ipi");
>> +    memory_region_init(&xive->end_mmio, OBJECT(xive), "xive.vc.end",
>> +                       PNV9_XIVE_VC_SIZE);
>> +    address_space_init(&xive->end_as, &xive->end_mmio, "xive.vc.end");
>> +
>> +
>> +    /* Presenter Controller MMIO region (not implemented) */
>> +    memory_region_init_io(&xive->pc_mmio, OBJECT(xive), &pnv_xive_pc_ops, xive,
>> +                          "xive.pc", PNV9_XIVE_PC_SIZE);
>> +
>> +    /* Thread Interrupt Management Area, direct an indirect */
>> +    memory_region_init_io(&xive->tm_mmio, OBJECT(xive), &xive_tm_ops,
>> +                          &xive->cpu_ind, "xive.tima", PNV9_XIVE_TM_SIZE);
>> +    memory_region_init_alias(&xive->tm_mmio_indirect, OBJECT(xive),
>> +                             "xive.tima.indirect",
>> +                             &xive->tm_mmio, 0, PNV9_XIVE_TM_SIZE);
> 
> I'm not quite sure how aliasing to the TIMA can work.  AIUI the TIMA
> via it's normal access magically tests the requesting CPU to work out
> which TCTX it should manipulate.  Isn't the idea of of the indirect
> access to access some other thread's TIMA for debugging, in which case
> you need to override that thread id somehow.

Yes. Indirect TIMA accesses need to be discussed.

check the pnv_xive_thread_indirect_set() routine in the PnvXive controller 
and the associated changes in xive_tm_write() and xive_tm_read() of the 
TIMA, which are below.

Thanks,

C.

>> +}
>> +
>> +static int pnv_xive_dt_xscom(PnvXScomInterface *dev, void *fdt,
>> +                             int xscom_offset)
>> +{
>> +    const char compat[] = "ibm,power9-xive-x";
>> +    char *name;
>> +    int offset;
>> +    uint32_t lpc_pcba = PNV9_XSCOM_XIVE_BASE;
>> +    uint32_t reg[] = {
>> +        cpu_to_be32(lpc_pcba),
>> +        cpu_to_be32(PNV9_XSCOM_XIVE_SIZE)
>> +    };
>> +
>> +    name = g_strdup_printf("xive@%x", lpc_pcba);
>> +    offset = fdt_add_subnode(fdt, xscom_offset, name);
>> +    _FDT(offset);
>> +    g_free(name);
>> +
>> +    _FDT((fdt_setprop(fdt, offset, "reg", reg, sizeof(reg))));
>> +    _FDT((fdt_setprop(fdt, offset, "compatible", compat,
>> +                      sizeof(compat))));
>> +    return 0;
>> +}
>> +
>> +static Property pnv_xive_properties[] = {
>> +    DEFINE_PROP_UINT64("ic-bar", PnvXive, ic_base, 0),
>> +    DEFINE_PROP_UINT64("vc-bar", PnvXive, vc_base, 0),
>> +    DEFINE_PROP_UINT64("pc-bar", PnvXive, pc_base, 0),
>> +    DEFINE_PROP_UINT64("tm-bar", PnvXive, tm_base, 0),
>> +    DEFINE_PROP_END_OF_LIST(),
>> +};
>> +
>> +static void pnv_xive_class_init(ObjectClass *klass, void *data)
>> +{
>> +    DeviceClass *dc = DEVICE_CLASS(klass);
>> +    PnvXScomInterfaceClass *xdc = PNV_XSCOM_INTERFACE_CLASS(klass);
>> +    XiveRouterClass *xrc = XIVE_ROUTER_CLASS(klass);
>> +    XiveFabricClass *xfc = XIVE_FABRIC_CLASS(klass);
>> +
>> +    xdc->dt_xscom = pnv_xive_dt_xscom;
>> +
>> +    dc->desc = "PowerNV XIVE Interrupt Controller";
>> +    dc->realize = pnv_xive_realize;
>> +    dc->props = pnv_xive_properties;
>> +    dc->reset = pnv_xive_reset;
>> +
>> +    xrc->get_eas = pnv_xive_get_eas;
>> +    xrc->set_eas = pnv_xive_set_eas;
>> +    xrc->get_end = pnv_xive_get_end;
>> +    xrc->set_end = pnv_xive_set_end;
>> +    xrc->get_nvt  = pnv_xive_get_nvt;
>> +    xrc->set_nvt  = pnv_xive_set_nvt;
>> +
>> +    xfc->notify  = pnv_xive_notify;
>> +};
>> +
>> +static const TypeInfo pnv_xive_info = {
>> +    .name          = TYPE_PNV_XIVE,
>> +    .parent        = TYPE_XIVE_ROUTER,
>> +    .instance_init = pnv_xive_init,
>> +    .instance_size = sizeof(PnvXive),
>> +    .class_init    = pnv_xive_class_init,
>> +    .interfaces    = (InterfaceInfo[]) {
>> +        { TYPE_PNV_XSCOM_INTERFACE },
>> +        { }
>> +    }
>> +};
>> +
>> +static void pnv_xive_register_types(void)
>> +{
>> +    type_register_static(&pnv_xive_info);
>> +}
>> +
>> +type_init(pnv_xive_register_types)
>> diff --git a/hw/intc/xive.c b/hw/intc/xive.c
>> index c9aedecc8216..9925c90481ae 100644
>> --- a/hw/intc/xive.c
>> +++ b/hw/intc/xive.c
>> @@ -51,6 +51,8 @@ static uint8_t exception_mask(uint8_t ring)
>>      switch (ring) {
>>      case TM_QW1_OS:
>>          return TM_QW1_NSR_EO;
>> +    case TM_QW3_HV_PHYS:
>> +        return TM_QW3_NSR_HE;
>>      default:
>>          g_assert_not_reached();
>>      }
>> @@ -85,7 +87,17 @@ static void xive_tctx_notify(XiveTCTX *tctx, uint8_t ring)
>>      uint8_t *regs = &tctx->regs[ring];
>>  
>>      if (regs[TM_PIPR] < regs[TM_CPPR]) {
>> -        regs[TM_NSR] |= exception_mask(ring);
>> +        switch (ring) {
>> +        case TM_QW1_OS:
>> +            regs[TM_NSR] |= TM_QW1_NSR_EO;
>> +            break;
>> +        case TM_QW3_HV_PHYS:
>> +            regs[TM_NSR] |= SETFIELD(TM_QW3_NSR_HE, regs[TM_NSR],
>> +                                     TM_QW3_NSR_HE_PHYS);
>> +            break;
>> +        default:
>> +            g_assert_not_reached();
>> +        }
>>          qemu_irq_raise(tctx->output);
>>      }
>>  }
>> @@ -116,6 +128,38 @@ static void xive_tctx_set_cppr(XiveTCTX *tctx, uint8_t ring, uint8_t cppr)
>>  #define XIVE_TM_OS_PAGE   0x2
>>  #define XIVE_TM_USER_PAGE 0x3
>>  
>> +static void xive_tm_set_hv_cppr(XiveTCTX *tctx, hwaddr offset,
>> +                                uint64_t value, unsigned size)
>> +{
>> +    xive_tctx_set_cppr(tctx, TM_QW3_HV_PHYS, value & 0xff);
>> +}
>> +
>> +static uint64_t xive_tm_ack_hv_reg(XiveTCTX *tctx, hwaddr offset, unsigned size)
>> +{
>> +    return xive_tctx_accept(tctx, TM_QW3_HV_PHYS);
>> +}
>> +
>> +static uint64_t xive_tm_pull_pool_ctx(XiveTCTX *tctx, hwaddr offset,
>> +                                      unsigned size)
>> +{
>> +    uint64_t ret;
>> +
>> +    ret = tctx->regs[TM_QW2_HV_POOL + TM_WORD2] & TM_QW2W2_POOL_CAM;
>> +    tctx->regs[TM_QW2_HV_POOL + TM_WORD2] &= ~TM_QW2W2_POOL_CAM;
>> +    return ret;
>> +}
>> +
>> +static void xive_tm_vt_push(XiveTCTX *tctx, hwaddr offset,
>> +                            uint64_t value, unsigned size)
>> +{
>> +    tctx->regs[TM_QW3_HV_PHYS + TM_WORD2] = value & 0xff;
>> +}
>> +
>> +static uint64_t xive_tm_vt_poll(XiveTCTX *tctx, hwaddr offset, unsigned size)
>> +{
>> +    return tctx->regs[TM_QW3_HV_PHYS + TM_WORD2] & 0xff;
>> +}
>> +
>>  /*
>>   * Define an access map for each page of the TIMA that we will use in
>>   * the memory region ops to filter values when doing loads and stores
>> @@ -295,10 +339,16 @@ static const XiveTmOp xive_tm_operations[] = {
>>       * effects
>>       */
>>      { XIVE_TM_OS_PAGE, TM_QW1_OS + TM_CPPR,   1, xive_tm_set_os_cppr, NULL },
>> +    { XIVE_TM_HV_PAGE, TM_QW3_HV_PHYS + TM_CPPR, 1, xive_tm_set_hv_cppr, NULL },
>> +    { XIVE_TM_HV_PAGE, TM_QW3_HV_PHYS + TM_WORD2, 1, xive_tm_vt_push, NULL },
>> +    { XIVE_TM_HV_PAGE, TM_QW3_HV_PHYS + TM_WORD2, 1, NULL, xive_tm_vt_poll },
>>  
>>      /* MMIOs above 2K : special operations with side effects */
>>      { XIVE_TM_OS_PAGE, TM_SPC_ACK_OS_REG,     2, NULL, xive_tm_ack_os_reg },
>>      { XIVE_TM_OS_PAGE, TM_SPC_SET_OS_PENDING, 1, xive_tm_set_os_pending, NULL },
>> +    { XIVE_TM_HV_PAGE, TM_SPC_ACK_HV_REG,     2, NULL, xive_tm_ack_hv_reg },
>> +    { XIVE_TM_HV_PAGE, TM_SPC_PULL_POOL_CTX,  4, NULL, xive_tm_pull_pool_ctx },
>> +    { XIVE_TM_HV_PAGE, TM_SPC_PULL_POOL_CTX,  8, NULL, xive_tm_pull_pool_ctx },
>>  };
>>  
>>  static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
>> @@ -327,7 +377,8 @@ static const XiveTmOp *xive_tm_find_op(hwaddr offset, unsigned size, bool write)
>>  static void xive_tm_write(void *opaque, hwaddr offset,
>>                            uint64_t value, unsigned size)
>>  {
>> -    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>> +    PowerPCCPU **cpuptr = opaque;
>> +    PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
>>      XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
>>      const XiveTmOp *xto;
>>  
>> @@ -366,7 +417,8 @@ static void xive_tm_write(void *opaque, hwaddr offset,
>>  
>>  static uint64_t xive_tm_read(void *opaque, hwaddr offset, unsigned size)
>>  {
>> -    PowerPCCPU *cpu = POWERPC_CPU(current_cpu);
>> +    PowerPCCPU **cpuptr = opaque;
>> +    PowerPCCPU *cpu = *cpuptr ? *cpuptr : POWERPC_CPU(current_cpu);
>>      XiveTCTX *tctx = XIVE_TCTX(cpu->intc);
>>      const XiveTmOp *xto;
>>  
>> @@ -501,6 +553,9 @@ static void xive_tctx_base_reset(void *dev)
>>       */
>>      tctx->regs[TM_QW1_OS + TM_PIPR] =
>>          ipb_to_pipr(tctx->regs[TM_QW1_OS + TM_IPB]);
>> +    tctx->regs[TM_QW3_HV_PHYS + TM_PIPR] =
>> +        ipb_to_pipr(tctx->regs[TM_QW3_HV_PHYS + TM_IPB]);
>> +
>>  
>>      /*
>>       * QEMU sPAPR XIVE only. To let the controller model reset the OS
>> @@ -1513,7 +1568,7 @@ static void xive_router_end_notify(XiveRouter *xrtr, uint8_t end_blk,
>>      /* TODO: Auto EOI. */
>>  }
>>  
>> -static void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>> +void xive_router_notify(XiveFabric *xf, uint32_t lisn)
>>  {
>>      XiveRouter *xrtr = XIVE_ROUTER(xf);
>>      XiveEAS eas;
>> diff --git a/hw/ppc/pnv.c b/hw/ppc/pnv.c
>> index 66f2301b4ece..7b0bda652338 100644
>> --- a/hw/ppc/pnv.c
>> +++ b/hw/ppc/pnv.c
>> @@ -279,7 +279,10 @@ static void pnv_dt_chip(PnvChip *chip, void *fdt)
>>          pnv_dt_core(chip, pnv_core, fdt);
>>  
>>          /* Interrupt Control Presenters (ICP). One per core. */
>> -        pnv_dt_icp(chip, fdt, pnv_core->pir, CPU_CORE(pnv_core)->nr_threads);
>> +        if (!pnv_chip_is_power9(chip)) {
>> +            pnv_dt_icp(chip, fdt, pnv_core->pir,
>> +                       CPU_CORE(pnv_core)->nr_threads);
>> +        }
>>      }
>>  
>>      if (chip->ram_size) {
>> @@ -693,7 +696,15 @@ static uint32_t pnv_chip_core_pir_p9(PnvChip *chip, uint32_t core_id)
>>  static Object *pnv_chip_power9_intc_create(PnvChip *chip, Object *child,
>>                                             Error **errp)
>>  {
>> -    return NULL;
>> +    Pnv9Chip *chip9 = PNV9_CHIP(chip);
>> +
>> +    /*
>> +     * The core creates its interrupt presenter but the XIVE interrupt
>> +     * controller object is initialized afterwards. Hopefully, it's
>> +     * only used at runtime.
>> +     */
>> +    return xive_tctx_create(child, TYPE_XIVE_TCTX,
>> +                            XIVE_ROUTER(&chip9->xive), errp);
>>  }
>>  
>>  /* Allowed core identifiers on a POWER8 Processor Chip :
>> @@ -875,11 +886,19 @@ static void pnv_chip_power8nvl_class_init(ObjectClass *klass, void *data)
>>  
>>  static void pnv_chip_power9_instance_init(Object *obj)
>>  {
>> +    Pnv9Chip *chip9 = PNV9_CHIP(obj);
>> +
>> +    object_initialize(&chip9->xive, sizeof(chip9->xive), TYPE_PNV_XIVE);
>> +    object_property_add_child(obj, "xive", OBJECT(&chip9->xive), NULL);
>> +    object_property_add_const_link(OBJECT(&chip9->xive), "chip", obj,
>> +                                   &error_abort);
>>  }
>>  
>>  static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
>>  {
>>      PnvChipClass *pcc = PNV_CHIP_GET_CLASS(dev);
>> +    Pnv9Chip *chip9 = PNV9_CHIP(dev);
>> +    PnvChip *chip = PNV_CHIP(dev);
>>      Error *local_err = NULL;
>>  
>>      pcc->parent_realize(dev, &local_err);
>> @@ -887,6 +906,24 @@ static void pnv_chip_power9_realize(DeviceState *dev, Error **errp)
>>          error_propagate(errp, local_err);
>>          return;
>>      }
>> +
>> +    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_IC_BASE(chip),
>> +                            "ic-bar", &error_fatal);
>> +    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_VC_BASE(chip),
>> +                            "vc-bar", &error_fatal);
>> +    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_PC_BASE(chip),
>> +                            "pc-bar", &error_fatal);
>> +    object_property_set_int(OBJECT(&chip9->xive), PNV9_XIVE_TM_BASE(chip),
>> +                            "tm-bar", &error_fatal);
>> +    object_property_set_bool(OBJECT(&chip9->xive), true, "realized",
>> +                             &local_err);
>> +    if (local_err) {
>> +        error_propagate(errp, local_err);
>> +        return;
>> +    }
>> +    qdev_set_parent_bus(DEVICE(&chip9->xive), sysbus_get_default());
>> +    pnv_xscom_add_subregion(chip, PNV9_XSCOM_XIVE_BASE,
>> +                            &chip9->xive.xscom_regs);
>>  }
>>  
>>  static void pnv_chip_power9_class_init(ObjectClass *klass, void *data)
>> @@ -1087,12 +1124,23 @@ static void pnv_pic_print_info(InterruptStatsProvider *obj,
>>      CPU_FOREACH(cs) {
>>          PowerPCCPU *cpu = POWERPC_CPU(cs);
>>  
>> -        icp_pic_print_info(ICP(cpu->intc), mon);
>> +        if (pnv_chip_is_power9(pnv->chips[0])) {
>> +            xive_tctx_pic_print_info(XIVE_TCTX(cpu->intc), mon);
>> +        } else {
>> +            icp_pic_print_info(ICP(cpu->intc), mon);
>> +        }
>>      }
>>  
>>      for (i = 0; i < pnv->num_chips; i++) {
>> -        Pnv8Chip *chip8 = PNV8_CHIP(pnv->chips[i]);
>> -        ics_pic_print_info(&chip8->psi.ics, mon);
>> +        PnvChip *chip = pnv->chips[i];
>> +
>> +        if (pnv_chip_is_power9(pnv->chips[i])) {
>> +            Pnv9Chip *chip9 = PNV9_CHIP(chip);
>> +            pnv_xive_pic_print_info(&chip9->xive, mon);
>> +        } else {
>> +            Pnv8Chip *chip8 = PNV8_CHIP(chip);
>> +            ics_pic_print_info(&chip8->psi.ics, mon);
>> +        }
>>      }
>>  }
>>  
>> diff --git a/hw/intc/Makefile.objs b/hw/intc/Makefile.objs
>> index dd4d69db2bdd..145bfaf44014 100644
>> --- a/hw/intc/Makefile.objs
>> +++ b/hw/intc/Makefile.objs
>> @@ -40,7 +40,7 @@ obj-$(CONFIG_XICS_KVM) += xics_kvm.o
>>  obj-$(CONFIG_XIVE) += xive.o
>>  obj-$(CONFIG_XIVE_SPAPR) += spapr_xive.o spapr_xive_hcall.o
>>  obj-$(CONFIG_XIVE_KVM) += spapr_xive_kvm.o
>> -obj-$(CONFIG_POWERNV) += xics_pnv.o
>> +obj-$(CONFIG_POWERNV) += xics_pnv.o pnv_xive.o
>>  obj-$(CONFIG_ALLWINNER_A10_PIC) += allwinner-a10-pic.o
>>  obj-$(CONFIG_S390_FLIC) += s390_flic.o
>>  obj-$(CONFIG_S390_FLIC_KVM) += s390_flic_kvm.o
> 

^ permalink raw reply	[flat|nested] 184+ messages in thread

end of thread, other threads:[~2018-12-06 15:25 UTC | newest]

Thread overview: 184+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-11-16 10:56 [Qemu-devel] [PATCH v5 00/36] ppc: support for the XIVE interrupt controller (POWER9) Cédric Le Goater
2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 01/36] ppc/xive: introduce a XIVE interrupt source model Cédric Le Goater
2018-11-22  3:05   ` David Gibson
2018-11-22  7:25     ` Cédric Le Goater
2018-11-23  0:31       ` David Gibson
2018-11-23  8:21         ` Cédric Le Goater
2018-11-26  8:14         ` Cédric Le Goater
2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 02/36] ppc/xive: add support for the LSI interrupt sources Cédric Le Goater
2018-11-22  3:19   ` David Gibson
2018-11-22  7:39     ` Cédric Le Goater
2018-11-23  1:08       ` David Gibson
2018-11-23 13:28         ` Cédric Le Goater
2018-11-26  5:39           ` David Gibson
2018-11-26 11:20             ` Cédric Le Goater
2018-11-26 23:48               ` David Gibson
2018-11-27  7:30                 ` Cédric Le Goater
2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 03/36] ppc/xive: introduce the XiveFabric interface Cédric Le Goater
2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 04/36] ppc/xive: introduce the XiveRouter model Cédric Le Goater
2018-11-22  4:11   ` David Gibson
2018-11-22  7:53     ` Cédric Le Goater
2018-11-23  3:50       ` David Gibson
2018-11-23  8:06         ` Cédric Le Goater
2018-11-27  1:54           ` David Gibson
2018-11-27  8:45             ` Cédric Le Goater
2018-11-22  4:44   ` David Gibson
2018-11-22  6:50     ` Benjamin Herrenschmidt
2018-11-22  7:59       ` Cédric Le Goater
2018-11-23  1:17         ` David Gibson
2018-11-23  1:10       ` David Gibson
2018-11-23 10:28         ` Cédric Le Goater
2018-11-26  5:44           ` David Gibson
2018-11-26  9:39             ` Cédric Le Goater
2018-11-27  0:11               ` David Gibson
2018-11-27  7:30                 ` Cédric Le Goater
2018-11-27 22:56                   ` David Gibson
2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 05/36] ppc/xive: introduce the XIVE Event Notification Descriptors Cédric Le Goater
2018-11-22  4:41   ` David Gibson
2018-11-22  6:49     ` Benjamin Herrenschmidt
2018-11-23  3:51       ` David Gibson
2018-11-22 21:47     ` Cédric Le Goater
2018-11-23  4:35       ` David Gibson
2018-11-23 11:01         ` Cédric Le Goater
2018-11-29  4:46           ` David Gibson
2018-11-16 10:56 ` [Qemu-devel] [PATCH v5 06/36] ppc/xive: add support for the END Event State buffers Cédric Le Goater
2018-11-22  5:13   ` David Gibson
2018-11-22 21:58     ` Cédric Le Goater
2018-11-23  4:36       ` David Gibson
2018-11-23  7:28         ` Cédric Le Goater
2018-11-26  5:54           ` David Gibson
2018-11-29 22:06     ` Cédric Le Goater
2018-11-30  1:04       ` David Gibson
2018-11-30  6:41         ` Cédric Le Goater
2018-12-03  1:14           ` David Gibson
2018-12-03 16:19             ` Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 07/36] ppc/xive: introduce the XIVE interrupt thread context Cédric Le Goater
2018-11-23  5:08   ` David Gibson
2018-11-25 20:35     ` Cédric Le Goater
2018-11-27  5:07       ` David Gibson
2018-11-27 12:47         ` Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 08/36] ppc/xive: introduce a simplified XIVE presenter Cédric Le Goater
2018-11-27 23:49   ` David Gibson
2018-11-28  2:34     ` Benjamin Herrenschmidt
2018-11-28 10:59     ` Cédric Le Goater
2018-11-29  0:47       ` David Gibson
2018-11-29  3:39         ` Benjamin Herrenschmidt
2018-11-29 17:51           ` Cédric Le Goater
2018-11-30  1:09             ` David Gibson
2018-12-03 17:05         ` Cédric Le Goater
2018-12-04  1:54           ` David Gibson
2018-12-04 17:04             ` Cédric Le Goater
2018-12-05  1:40               ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 09/36] ppc/xive: notify the CPU when the interrupt priority is more privileged Cédric Le Goater
2018-11-28  0:13   ` David Gibson
2018-11-28  2:32     ` Benjamin Herrenschmidt
2018-11-28  2:41       ` David Gibson
2018-11-28  3:00         ` Eric Blake
2018-11-28 11:30     ` Cédric Le Goater
2018-11-29  0:49       ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 10/36] spapr/xive: introduce a XIVE interrupt controller Cédric Le Goater
2018-11-28  0:52   ` David Gibson
2018-11-28 16:27     ` Cédric Le Goater
2018-11-29  0:54       ` David Gibson
2018-11-29 14:37         ` Cédric Le Goater
2018-11-29 22:36           ` David Gibson
2018-12-04 17:12       ` Cédric Le Goater
2018-12-05  1:41         ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 11/36] spapr/xive: use the VCPU id as a NVT identifier Cédric Le Goater
2018-11-28  2:39   ` David Gibson
2018-11-28 16:48     ` Cédric Le Goater
2018-11-29  1:00       ` David Gibson
2018-11-29 15:27         ` Cédric Le Goater
2018-11-30  1:11           ` David Gibson
2018-11-30  6:56             ` Cédric Le Goater
2018-12-03  1:18               ` David Gibson
2018-12-03 16:30                 ` Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 12/36] spapr: initialize VSMT before initializing the IRQ backend Cédric Le Goater
2018-11-28  2:57   ` David Gibson
2018-11-28  9:35     ` [Qemu-devel] [Qemu-ppc] " Greg Kurz
2018-11-28 16:50       ` Cédric Le Goater
2018-11-28 16:59         ` Greg Kurz
2018-11-29  1:02       ` David Gibson
2018-11-29  6:56         ` Greg Kurz
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 13/36] spapr: introduce a spapr_irq_init() routine Cédric Le Goater
2018-11-28  2:59   ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 14/36] spapr: modify the irq backend 'init' method Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 15/36] spapr: introdude a new machine IRQ backend for XIVE Cédric Le Goater
2018-11-28  3:28   ` David Gibson
2018-11-28 17:16     ` Cédric Le Goater
2018-11-29  1:07       ` David Gibson
2018-11-29 15:34         ` Cédric Le Goater
2018-11-29 22:39           ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 16/36] spapr: add hcalls support for the XIVE exploitation interrupt mode Cédric Le Goater
2018-11-28  4:25   ` David Gibson
2018-11-28 22:21     ` Cédric Le Goater
2018-11-29  1:23       ` David Gibson
2018-11-29 16:04         ` Cédric Le Goater
2018-11-30  1:23           ` David Gibson
2018-11-30  8:07             ` Cédric Le Goater
2018-12-03  1:36               ` David Gibson
2018-12-03 16:49                 ` Cédric Le Goater
2018-12-04  1:56                   ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 17/36] spapr: add device tree support for the XIVE exploitation mode Cédric Le Goater
2018-11-28  4:31   ` David Gibson
2018-11-28 22:26     ` Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 18/36] spapr: allocate the interrupt thread context under the CPU core Cédric Le Goater
2018-11-28  4:39   ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 19/36] spapr: add a 'pseries-3.1-xive' machine type Cédric Le Goater
2018-11-28  4:42   ` David Gibson
2018-11-28 22:37     ` Cédric Le Goater
2018-12-04 15:14       ` Cédric Le Goater
2018-12-05  1:44         ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 20/36] spapr: add classes for the XIVE models Cédric Le Goater
2018-11-28  5:13   ` David Gibson
2018-11-28 22:38     ` Cédric Le Goater
2018-11-29  2:59       ` David Gibson
2018-11-29 16:06         ` Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 21/36] spapr: extend the sPAPR IRQ backend for XICS migration Cédric Le Goater
2018-11-28  5:54   ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 22/36] spapr/xive: add models for KVM support Cédric Le Goater
2018-11-28  5:52   ` David Gibson
2018-11-28 22:45     ` Cédric Le Goater
2018-11-29  3:33       ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 23/36] spapr/xive: add migration support for KVM Cédric Le Goater
2018-11-29  3:43   ` David Gibson
2018-11-29 16:19     ` Cédric Le Goater
2018-11-30  1:24       ` David Gibson
2018-11-30  7:04         ` Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 24/36] spapr: add a 'reset' method to the sPAPR IRQ backend Cédric Le Goater
2018-11-29  3:47   ` David Gibson
2018-11-29 16:21     ` Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 25/36] spapr: set the interrupt presenter at reset Cédric Le Goater
2018-11-29  4:03   ` David Gibson
2018-11-29 16:28     ` Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 26/36] spapr: add a 'pseries-3.1-dual' machine type Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 27/36] sysbus: add a sysbus_mmio_unmap() helper Cédric Le Goater
2018-11-29  4:09   ` David Gibson
2018-11-29 16:36     ` Cédric Le Goater
2018-12-03 15:52       ` Cédric Le Goater
2018-12-04  1:59         ` David Gibson
2018-12-03 17:48     ` Peter Maydell
2018-12-04 12:33       ` Cédric Le Goater
2018-12-04 13:04         ` Peter Maydell
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 28/36] ppc/xics: introduce a icp_kvm_init() routine Cédric Le Goater
2018-11-29  4:08   ` David Gibson
2018-11-29 16:36     ` Cédric Le Goater
2018-11-29 22:43       ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 29/36] ppc/xics: remove abort() in icp_kvm_init() Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 30/36] spapr: check for KVM IRQ device activation Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 31/36] spapr/xive: export the spapr_xive_kvm_init() routine Cédric Le Goater
2018-11-29  4:11   ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 32/36] spapr/rtas: modify spapr_rtas_register() to remove RTAS handlers Cédric Le Goater
2018-11-29  4:12   ` David Gibson
2018-11-29 16:40     ` Cédric Le Goater
2018-11-29 22:44       ` David Gibson
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 33/36] spapr: introduce routines to delete the KVM IRQ device Cédric Le Goater
2018-11-29  4:17   ` David Gibson
2018-11-29 16:41     ` Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 34/36] spapr: add KVM support to the 'dual' machine Cédric Le Goater
2018-11-29  4:22   ` David Gibson
2018-11-29 17:07     ` Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 35/36] ppc: externalize ppc_get_vcpu_by_pir() Cédric Le Goater
2018-11-16 10:57 ` [Qemu-devel] [PATCH v5 36/36] ppc/pnv: add XIVE support Cédric Le Goater
2018-12-03  2:26   ` David Gibson
2018-12-06 15:14     ` Cédric Le Goater

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.