LinuxPPC-Dev Archive on lore.kernel.org
 help / Atom feed
* [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
@ 2019-01-07 18:43 Cédric Le Goater
  2019-01-07 18:43 ` [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls Cédric Le Goater
                   ` (17 more replies)
  0 siblings, 18 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

Hello,

On the POWER9 processor, the XIVE interrupt controller can control
interrupt sources using MMIO to trigger events, to EOI or to turn off
the sources. Priority management and interrupt acknowledgment is also
controlled by MMIO in the CPU presenter subengine.

PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
special support from the hypervisor to do the same. This is called the
XIVE native exploitation mode and today, it can be activated under the
PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
and still offers the old interrupt mode interface using a
XICS-over-XIVE glue which implements the XICS hcalls.

The following series is proposal to add the same support under KVM.

A new KVM device is introduced for the XIVE native exploitation
mode. It reuses most of the XICS-over-XIVE glue implementation
structures which are internal to KVM but has a completely different
interface. A set of Hypervisor calls configures the sources and the
event queues and from there, all control is done by the guest through
MMIOs.

These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
similarly to VFIO, and the associated VMAs are populated dynamically
with the appropriate pages using a fault handler. This is implemented
with a couple of KVM device ioctls.

On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
negotiation process determines whether the guest operates with a
interrupt controller using the XICS legacy model, as found on POWER8,
or in XIVE exploitation mode. Which means that the KVM interrupt
device should be created at runtime, after the machine as started.
This requires extra KVM support to create/destroy KVM devices. The
last patches are an attempt to solve that problem.

Migration has its own specific needs. The patchset provides the
necessary routines to quiesce XIVE, to capture and restore the state
of the different structures used by KVM, OPAL and HW. Extra OPAL
support is required for these.

GitHub trees available here :
 
QEMU sPAPR:

  https://github.com/legoater/qemu/commits/xive-next
  
Linux/KVM:

  https://github.com/legoater/linux/commits/xive-5.0

OPAL:

  https://github.com/legoater/skiboot/commits/xive

Best wishes for 2019 !

C.


Cédric Le Goater (19):
  powerpc/xive: export flags for the XIVE native exploitation mode
    hcalls
  powerpc/xive: add OPAL extensions for the XIVE native exploitation
    support
  KVM: PPC: Book3S HV: check the IRQ controller type
  KVM: PPC: Book3S HV: export services for the XIVE native exploitation
    device
  KVM: PPC: Book3S HV: add a new KVM device for the XIVE native
    exploitation mode
  KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native
    device
  KVM: PPC: Book3S HV: add a GET_TIMA_FD control to XIVE native device
  KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device
  KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native
    device
  KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state
  KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode
    hcalls
  KVM: PPC: Book3S HV: record guest queue page address
  KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration
  KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty
  KVM: PPC: Book3S HV: add get/set accessors for the source
    configuration
  KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration
  KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state
  KVM: PPC: Book3S HV: add passthrough support
  KVM: introduce a KVM_DELETE_DEVICE ioctl

 arch/powerpc/include/asm/kvm_host.h           |    2 +
 arch/powerpc/include/asm/kvm_ppc.h            |   69 +
 arch/powerpc/include/asm/opal-api.h           |   11 +-
 arch/powerpc/include/asm/opal.h               |    7 +
 arch/powerpc/include/asm/xive.h               |   40 +
 arch/powerpc/include/uapi/asm/kvm.h           |   47 +
 arch/powerpc/kvm/book3s_xive.h                |   82 +
 include/linux/kvm_host.h                      |    2 +
 include/uapi/linux/kvm.h                      |    5 +
 arch/powerpc/kvm/book3s.c                     |   31 +-
 arch/powerpc/kvm/book3s_hv.c                  |   29 +
 arch/powerpc/kvm/book3s_hv_builtin.c          |  196 +++
 arch/powerpc/kvm/book3s_hv_rm_xive_native.c   |   47 +
 arch/powerpc/kvm/book3s_xive.c                |  149 +-
 arch/powerpc/kvm/book3s_xive_native.c         | 1406 +++++++++++++++++
 .../powerpc/kvm/book3s_xive_native_template.c |  398 +++++
 arch/powerpc/kvm/powerpc.c                    |   30 +
 arch/powerpc/sysdev/xive/native.c             |  110 ++
 arch/powerpc/sysdev/xive/spapr.c              |   28 +-
 virt/kvm/kvm_main.c                           |   39 +
 arch/powerpc/kvm/Makefile                     |    4 +-
 arch/powerpc/kvm/book3s_hv_rmhandlers.S       |   52 +
 .../powerpc/platforms/powernv/opal-wrappers.S |    3 +
 23 files changed, 2722 insertions(+), 65 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_rm_xive_native.c
 create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
 create mode 100644 arch/powerpc/kvm/book3s_xive_native_template.c

-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-01-09  3:33   ` David Gibson
  2019-01-09 13:08   ` Michael Ellerman
  2019-01-07 18:43 ` [PATCH 02/19] powerpc/xive: add OPAL extensions for the XIVE native exploitation support Cédric Le Goater
                   ` (16 subsequent siblings)
  17 siblings, 2 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

These flags are shared between Linux/KVM implementing the hypervisor
calls for the XIVE native exploitation mode and the driver for the
sPAPR guests.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/xive.h  | 23 +++++++++++++++++++++++
 arch/powerpc/sysdev/xive/spapr.c | 28 ++++++++--------------------
 2 files changed, 31 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 3c704f5dd3ae..32f033bfbf42 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -93,6 +93,29 @@ extern void xive_flush_interrupt(void);
 /* xmon hook */
 extern void xmon_xive_do_dump(int cpu);
 
+/*
+ * Hcall flags shared by the sPAPR backend and KVM
+ */
+
+/* H_INT_GET_SOURCE_INFO */
+#define XIVE_SPAPR_SRC_H_INT_ESB	PPC_BIT(60)
+#define XIVE_SPAPR_SRC_LSI		PPC_BIT(61)
+#define XIVE_SPAPR_SRC_TRIGGER		PPC_BIT(62)
+#define XIVE_SPAPR_SRC_STORE_EOI	PPC_BIT(63)
+
+/* H_INT_SET_SOURCE_CONFIG */
+#define XIVE_SPAPR_SRC_SET_EISN		PPC_BIT(62)
+#define XIVE_SPAPR_SRC_MASK		PPC_BIT(63) /* unused */
+
+/* H_INT_SET_QUEUE_CONFIG */
+#define XIVE_SPAPR_EQ_ALWAYS_NOTIFY	PPC_BIT(63)
+
+/* H_INT_SET_QUEUE_CONFIG */
+#define XIVE_SPAPR_EQ_DEBUG		PPC_BIT(63)
+
+/* H_INT_ESB */
+#define XIVE_SPAPR_ESB_STORE		PPC_BIT(63)
+
 /* APIs used by KVM */
 extern u32 xive_native_default_eq_shift(void);
 extern u32 xive_native_alloc_vp_block(u32 max_vcpus);
diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
index 575db3b06a6b..730284f838c8 100644
--- a/arch/powerpc/sysdev/xive/spapr.c
+++ b/arch/powerpc/sysdev/xive/spapr.c
@@ -184,9 +184,6 @@ static long plpar_int_get_source_info(unsigned long flags,
 	return 0;
 }
 
-#define XIVE_SRC_SET_EISN (1ull << (63 - 62))
-#define XIVE_SRC_MASK     (1ull << (63 - 63)) /* unused */
-
 static long plpar_int_set_source_config(unsigned long flags,
 					unsigned long lisn,
 					unsigned long target,
@@ -243,8 +240,6 @@ static long plpar_int_get_queue_info(unsigned long flags,
 	return 0;
 }
 
-#define XIVE_EQ_ALWAYS_NOTIFY (1ull << (63 - 63))
-
 static long plpar_int_set_queue_config(unsigned long flags,
 				       unsigned long target,
 				       unsigned long priority,
@@ -286,8 +281,6 @@ static long plpar_int_sync(unsigned long flags, unsigned long lisn)
 	return 0;
 }
 
-#define XIVE_ESB_FLAG_STORE (1ull << (63 - 63))
-
 static long plpar_int_esb(unsigned long flags,
 			  unsigned long lisn,
 			  unsigned long offset,
@@ -321,7 +314,7 @@ static u64 xive_spapr_esb_rw(u32 lisn, u32 offset, u64 data, bool write)
 	unsigned long read_data;
 	long rc;
 
-	rc = plpar_int_esb(write ? XIVE_ESB_FLAG_STORE : 0,
+	rc = plpar_int_esb(write ? XIVE_SPAPR_ESB_STORE : 0,
 			   lisn, offset, data, &read_data);
 	if (rc)
 		return -1;
@@ -329,11 +322,6 @@ static u64 xive_spapr_esb_rw(u32 lisn, u32 offset, u64 data, bool write)
 	return write ? 0 : read_data;
 }
 
-#define XIVE_SRC_H_INT_ESB     (1ull << (63 - 60))
-#define XIVE_SRC_LSI           (1ull << (63 - 61))
-#define XIVE_SRC_TRIGGER       (1ull << (63 - 62))
-#define XIVE_SRC_STORE_EOI     (1ull << (63 - 63))
-
 static int xive_spapr_populate_irq_data(u32 hw_irq, struct xive_irq_data *data)
 {
 	long rc;
@@ -349,11 +337,11 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct xive_irq_data *data)
 	if (rc)
 		return  -EINVAL;
 
-	if (flags & XIVE_SRC_H_INT_ESB)
+	if (flags & XIVE_SPAPR_SRC_H_INT_ESB)
 		data->flags  |= XIVE_IRQ_FLAG_H_INT_ESB;
-	if (flags & XIVE_SRC_STORE_EOI)
+	if (flags & XIVE_SPAPR_SRC_STORE_EOI)
 		data->flags  |= XIVE_IRQ_FLAG_STORE_EOI;
-	if (flags & XIVE_SRC_LSI)
+	if (flags & XIVE_SPAPR_SRC_LSI)
 		data->flags  |= XIVE_IRQ_FLAG_LSI;
 	data->eoi_page  = eoi_page;
 	data->esb_shift = esb_shift;
@@ -374,7 +362,7 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct xive_irq_data *data)
 	data->hw_irq = hw_irq;
 
 	/* Full function page supports trigger */
-	if (flags & XIVE_SRC_TRIGGER) {
+	if (flags & XIVE_SPAPR_SRC_TRIGGER) {
 		data->trig_mmio = data->eoi_mmio;
 		return 0;
 	}
@@ -391,8 +379,8 @@ static int xive_spapr_configure_irq(u32 hw_irq, u32 target, u8 prio, u32 sw_irq)
 {
 	long rc;
 
-	rc = plpar_int_set_source_config(XIVE_SRC_SET_EISN, hw_irq, target,
-					 prio, sw_irq);
+	rc = plpar_int_set_source_config(XIVE_SPAPR_SRC_SET_EISN, hw_irq,
+					 target, prio, sw_irq);
 
 	return rc == 0 ? 0 : -ENXIO;
 }
@@ -432,7 +420,7 @@ static int xive_spapr_configure_queue(u32 target, struct xive_q *q, u8 prio,
 	q->eoi_phys = esn_page;
 
 	/* Default is to always notify */
-	flags = XIVE_EQ_ALWAYS_NOTIFY;
+	flags = XIVE_SPAPR_EQ_ALWAYS_NOTIFY;
 
 	/* Configure and enable the queue in HW */
 	rc = plpar_int_set_queue_config(flags, target, prio, qpage_phys, order);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 02/19] powerpc/xive: add OPAL extensions for the XIVE native exploitation support
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
  2019-01-07 18:43 ` [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-01-09  4:26   ` David Gibson
  2019-01-07 18:43 ` [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type Cédric Le Goater
                   ` (15 subsequent siblings)
  17 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

The support for XIVE native exploitation mode in Linux/KVM needs a
couple more OPAL calls to configure the sPAPR guest and to get/set the
state of the XIVE internal structures.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/opal-api.h           | 11 ++-
 arch/powerpc/include/asm/opal.h               |  7 ++
 arch/powerpc/include/asm/xive.h               | 14 +++
 arch/powerpc/sysdev/xive/native.c             | 99 +++++++++++++++++++
 .../powerpc/platforms/powernv/opal-wrappers.S |  3 +
 5 files changed, 130 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
index 870fb7b239ea..cdfc54f78101 100644
--- a/arch/powerpc/include/asm/opal-api.h
+++ b/arch/powerpc/include/asm/opal-api.h
@@ -186,8 +186,8 @@
 #define OPAL_XIVE_FREE_IRQ			140
 #define OPAL_XIVE_SYNC				141
 #define OPAL_XIVE_DUMP				142
-#define OPAL_XIVE_RESERVED3			143
-#define OPAL_XIVE_RESERVED4			144
+#define OPAL_XIVE_GET_QUEUE_STATE		143
+#define OPAL_XIVE_SET_QUEUE_STATE		144
 #define OPAL_SIGNAL_SYSTEM_RESET		145
 #define OPAL_NPU_INIT_CONTEXT			146
 #define OPAL_NPU_DESTROY_CONTEXT		147
@@ -209,8 +209,11 @@
 #define OPAL_SENSOR_GROUP_ENABLE		163
 #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR		164
 #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR		165
-#define	OPAL_NX_COPROC_INIT			167
-#define OPAL_LAST				167
+#define OPAL_HANDLE_HMI2			166
+#define OPAL_NX_COPROC_INIT			167
+#define OPAL_NPU_SET_RELAXED_ORDER		168
+#define OPAL_NPU_GET_RELAXED_ORDER		169
+#define OPAL_XIVE_GET_VP_STATE			170
 
 #define QUIESCE_HOLD			1 /* Spin all calls at entry */
 #define QUIESCE_REJECT			2 /* Fail all calls with OPAL_BUSY */
diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
index a55b01c90bb1..4e978d4dea5c 100644
--- a/arch/powerpc/include/asm/opal.h
+++ b/arch/powerpc/include/asm/opal.h
@@ -279,6 +279,13 @@ int64_t opal_xive_allocate_irq(uint32_t chip_id);
 int64_t opal_xive_free_irq(uint32_t girq);
 int64_t opal_xive_sync(uint32_t type, uint32_t id);
 int64_t opal_xive_dump(uint32_t type, uint32_t id);
+int64_t opal_xive_get_queue_state(uint64_t vp, uint32_t prio,
+				  __be32 *out_qtoggle,
+				  __be32 *out_qindex);
+int64_t opal_xive_set_queue_state(uint64_t vp, uint32_t prio,
+				  uint32_t qtoggle,
+				  uint32_t qindex);
+int64_t opal_xive_get_vp_state(uint64_t vp, __be64 *out_w01);
 int64_t opal_pci_set_p2p(uint64_t phb_init, uint64_t phb_target,
 			uint64_t desc, uint16_t pe_number);
 
diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 32f033bfbf42..d6be3e4d9fa4 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -132,12 +132,26 @@ extern int xive_native_configure_queue(u32 vp_id, struct xive_q *q, u8 prio,
 extern void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio);
 
 extern void xive_native_sync_source(u32 hw_irq);
+extern void xive_native_sync_queue(u32 hw_irq);
 extern bool is_xive_irq(struct irq_chip *chip);
 extern int xive_native_enable_vp(u32 vp_id, bool single_escalation);
 extern int xive_native_disable_vp(u32 vp_id);
 extern int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, u32 *out_chip_id);
 extern bool xive_native_has_single_escalation(void);
 
+extern int xive_native_get_queue_info(u32 vp_id, uint32_t prio,
+				      u64 *out_qpage,
+				      u64 *out_qsize,
+				      u64 *out_qeoi_page,
+				      u32 *out_escalate_irq,
+				      u64 *out_qflags);
+
+extern int xive_native_get_queue_state(u32 vp_id, uint32_t prio, u32 *qtoggle,
+				       u32 *qindex);
+extern int xive_native_set_queue_state(u32 vp_id, uint32_t prio, u32 qtoggle,
+				       u32 qindex);
+extern int xive_native_get_vp_state(u32 vp_id, u64 *out_state);
+
 #else
 
 static inline bool xive_enabled(void) { return false; }
diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c
index 1ca127d052a6..0c037e933e55 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -437,6 +437,12 @@ void xive_native_sync_source(u32 hw_irq)
 }
 EXPORT_SYMBOL_GPL(xive_native_sync_source);
 
+void xive_native_sync_queue(u32 hw_irq)
+{
+	opal_xive_sync(XIVE_SYNC_QUEUE, hw_irq);
+}
+EXPORT_SYMBOL_GPL(xive_native_sync_queue);
+
 static const struct xive_ops xive_native_ops = {
 	.populate_irq_data	= xive_native_populate_irq_data,
 	.configure_irq		= xive_native_configure_irq,
@@ -711,3 +717,96 @@ bool xive_native_has_single_escalation(void)
 	return xive_has_single_esc;
 }
 EXPORT_SYMBOL_GPL(xive_native_has_single_escalation);
+
+int xive_native_get_queue_info(u32 vp_id, u32 prio,
+			       u64 *out_qpage,
+			       u64 *out_qsize,
+			       u64 *out_qeoi_page,
+			       u32 *out_escalate_irq,
+			       u64 *out_qflags)
+{
+	__be64 qpage;
+	__be64 qsize;
+	__be64 qeoi_page;
+	__be32 escalate_irq;
+	__be64 qflags;
+	s64 rc;
+
+	rc = opal_xive_get_queue_info(vp_id, prio, &qpage, &qsize,
+				      &qeoi_page, &escalate_irq, &qflags);
+	if (rc) {
+		pr_err("OPAL failed to get queue info for VCPU %d/%d : %lld\n",
+		       vp_id, prio, rc);
+		return -EIO;
+	}
+
+	if (out_qpage)
+		*out_qpage = be64_to_cpu(qpage);
+	if (out_qsize)
+		*out_qsize = be32_to_cpu(qsize);
+	if (out_qeoi_page)
+		*out_qeoi_page = be64_to_cpu(qeoi_page);
+	if (out_escalate_irq)
+		*out_escalate_irq = be32_to_cpu(escalate_irq);
+	if (out_qflags)
+		*out_qflags = be64_to_cpu(qflags);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xive_native_get_queue_info);
+
+int xive_native_get_queue_state(u32 vp_id, u32 prio, u32 *qtoggle, u32 *qindex)
+{
+	__be32 opal_qtoggle;
+	__be32 opal_qindex;
+	s64 rc;
+
+	rc = opal_xive_get_queue_state(vp_id, prio, &opal_qtoggle,
+				       &opal_qindex);
+	if (rc) {
+		pr_err("OPAL failed to get queue state for VCPU %d/%d : %lld\n",
+		       vp_id, prio, rc);
+		return -EIO;
+	}
+
+	if (qtoggle)
+		*qtoggle = be32_to_cpu(opal_qtoggle);
+	if (qindex)
+		*qindex = be32_to_cpu(opal_qindex);
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xive_native_get_queue_state);
+
+int xive_native_set_queue_state(u32 vp_id, u32 prio, u32 qtoggle, u32 qindex)
+{
+	s64 rc;
+
+	rc = opal_xive_set_queue_state(vp_id, prio, qtoggle, qindex);
+	if (rc) {
+		pr_err("OPAL failed to set queue state for VCPU %d/%d : %lld\n",
+		       vp_id, prio, rc);
+		return -EIO;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xive_native_set_queue_state);
+
+int xive_native_get_vp_state(u32 vp_id, u64 *out_state)
+{
+	__be64 state;
+	s64 rc;
+
+	rc = opal_xive_get_vp_state(vp_id, &state);
+	if (rc) {
+		pr_err("OPAL failed to get vp state for VCPU %d : %lld\n",
+		       vp_id, rc);
+		return -EIO;
+	}
+
+	if (out_state)
+		*out_state = be64_to_cpu(state);
+	return 0;
+}
+EXPORT_SYMBOL_GPL(xive_native_get_vp_state);
diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
index f4875fe3f8ff..3179953d6b56 100644
--- a/arch/powerpc/platforms/powernv/opal-wrappers.S
+++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
@@ -309,6 +309,9 @@ OPAL_CALL(opal_xive_get_vp_info,		OPAL_XIVE_GET_VP_INFO);
 OPAL_CALL(opal_xive_set_vp_info,		OPAL_XIVE_SET_VP_INFO);
 OPAL_CALL(opal_xive_sync,			OPAL_XIVE_SYNC);
 OPAL_CALL(opal_xive_dump,			OPAL_XIVE_DUMP);
+OPAL_CALL(opal_xive_get_queue_state,		OPAL_XIVE_GET_QUEUE_STATE);
+OPAL_CALL(opal_xive_set_queue_state,		OPAL_XIVE_SET_QUEUE_STATE);
+OPAL_CALL(opal_xive_get_vp_state,		OPAL_XIVE_GET_VP_STATE);
 OPAL_CALL(opal_signal_system_reset,		OPAL_SIGNAL_SYSTEM_RESET);
 OPAL_CALL(opal_npu_init_context,		OPAL_NPU_INIT_CONTEXT);
 OPAL_CALL(opal_npu_destroy_context,		OPAL_NPU_DESTROY_CONTEXT);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
  2019-01-07 18:43 ` [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls Cédric Le Goater
  2019-01-07 18:43 ` [PATCH 02/19] powerpc/xive: add OPAL extensions for the XIVE native exploitation support Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-01-09  4:27   ` David Gibson
  2019-01-22  4:56   ` Paul Mackerras
  2019-01-07 18:43 ` [PATCH 04/19] KVM: PPC: Book3S HV: export services for the XIVE native exploitation device Cédric Le Goater
                   ` (14 subsequent siblings)
  17 siblings, 2 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

We will have different KVM devices for interrupts, one for the
XICS-over-XIVE mode and one for the XIVE native exploitation
mode. Let's add some checks to make sure we are not mixing the
interfaces in KVM.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index f78d002f0fe0..8a4fa45f07f8 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -819,6 +819,9 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
 
+	if (!kvmppc_xics_enabled(vcpu))
+		return -EPERM;
+
 	if (!xc)
 		return 0;
 
@@ -835,6 +838,9 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
 	u8 cppr, mfrr;
 	u32 xisr;
 
+	if (!kvmppc_xics_enabled(vcpu))
+		return -EPERM;
+
 	if (!xc || !xive)
 		return -ENOENT;
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 04/19] KVM: PPC: Book3S HV: export services for the XIVE native exploitation device
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (2 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-01-11  4:09   ` David Gibson
  2019-01-07 18:43 ` [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode Cédric Le Goater
                   ` (13 subsequent siblings)
  17 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

The KVM device for the XIVE native exploitation mode will reuse the
structures of the XICS-over-XIVE glue implementation. Some code will
also be shared : source block creation and destruction, target
selection and escalation attachment.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.h | 11 +++++
 arch/powerpc/kvm/book3s_xive.c | 89 +++++++++++++++++++---------------
 2 files changed, 62 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index a08ae6fd4c51..10c4aa5cd010 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -248,5 +248,16 @@ extern int (*__xive_vm_h_ipi)(struct kvm_vcpu *vcpu, unsigned long server,
 extern int (*__xive_vm_h_cppr)(struct kvm_vcpu *vcpu, unsigned long cppr);
 extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr);
 
+/*
+ * Common Xive routines for XICS-over-XIVE and XIVE native
+ */
+struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
+	struct kvmppc_xive *xive, int irq);
+void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
+int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
+void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu);
+int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio);
+int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu);
+
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 8a4fa45f07f8..bb5d32f7e4e6 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -166,7 +166,7 @@ static irqreturn_t xive_esc_irq(int irq, void *data)
 	return IRQ_HANDLED;
 }
 
-static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio)
+int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio)
 {
 	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
 	struct xive_q *q = &xc->queues[prio];
@@ -291,7 +291,7 @@ static int xive_check_provisioning(struct kvm *kvm, u8 prio)
 			continue;
 		rc = xive_provision_queue(vcpu, prio);
 		if (rc == 0 && !xive->single_escalation)
-			xive_attach_escalation(vcpu, prio);
+			kvmppc_xive_attach_escalation(vcpu, prio);
 		if (rc)
 			return rc;
 	}
@@ -342,7 +342,7 @@ static int xive_try_pick_queue(struct kvm_vcpu *vcpu, u8 prio)
 	return atomic_add_unless(&q->count, 1, max) ? 0 : -EBUSY;
 }
 
-static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
+int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
 {
 	struct kvm_vcpu *vcpu;
 	int i, rc;
@@ -535,7 +535,7 @@ static int xive_target_interrupt(struct kvm *kvm,
 	 * priority. The count for that new target will have
 	 * already been incremented.
 	 */
-	rc = xive_select_target(kvm, &server, prio);
+	rc = kvmppc_xive_select_target(kvm, &server, prio);
 
 	/*
 	 * We failed to find a target ? Not much we can do
@@ -1055,7 +1055,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
 }
 EXPORT_SYMBOL_GPL(kvmppc_xive_clr_mapped);
 
-static void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
+void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
 	struct kvm *kvm = vcpu->kvm;
@@ -1225,7 +1225,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 		if (xive->qmap & (1 << i)) {
 			r = xive_provision_queue(vcpu, i);
 			if (r == 0 && !xive->single_escalation)
-				xive_attach_escalation(vcpu, i);
+				kvmppc_xive_attach_escalation(vcpu, i);
 			if (r)
 				goto bail;
 		} else {
@@ -1240,7 +1240,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 	}
 
 	/* If not done above, attach priority 0 escalation */
-	r = xive_attach_escalation(vcpu, 0);
+	r = kvmppc_xive_attach_escalation(vcpu, 0);
 	if (r)
 		goto bail;
 
@@ -1491,8 +1491,8 @@ static int xive_get_source(struct kvmppc_xive *xive, long irq, u64 addr)
 	return 0;
 }
 
-static struct kvmppc_xive_src_block *xive_create_src_block(struct kvmppc_xive *xive,
-							   int irq)
+struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
+	struct kvmppc_xive *xive, int irq)
 {
 	struct kvm *kvm = xive->kvm;
 	struct kvmppc_xive_src_block *sb;
@@ -1571,7 +1571,7 @@ static int xive_set_source(struct kvmppc_xive *xive, long irq, u64 addr)
 	sb = kvmppc_xive_find_source(xive, irq, &idx);
 	if (!sb) {
 		pr_devel("No source, creating source block...\n");
-		sb = xive_create_src_block(xive, irq);
+		sb = kvmppc_xive_create_src_block(xive, irq);
 		if (!sb) {
 			pr_devel("Failed to create block...\n");
 			return -ENOMEM;
@@ -1795,7 +1795,7 @@ static void kvmppc_xive_cleanup_irq(u32 hw_num, struct xive_irq_data *xd)
 	xive_cleanup_irq_data(xd);
 }
 
-static void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
+void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
 {
 	int i;
 
@@ -1824,6 +1824,8 @@ static void kvmppc_xive_free(struct kvm_device *dev)
 
 	debugfs_remove(xive->dentry);
 
+	pr_devel("Destroying xive for partition\n");
+
 	if (kvm)
 		kvm->arch.xive = NULL;
 
@@ -1889,6 +1891,43 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 	return 0;
 }
 
+int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu)
+{
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	unsigned int i;
+
+	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
+		struct xive_q *q = &xc->queues[i];
+		u32 i0, i1, idx;
+
+		if (!q->qpage && !xc->esc_virq[i])
+			continue;
+
+		seq_printf(m, " [q%d]: ", i);
+
+		if (q->qpage) {
+			idx = q->idx;
+			i0 = be32_to_cpup(q->qpage + idx);
+			idx = (idx + 1) & q->msk;
+			i1 = be32_to_cpup(q->qpage + idx);
+			seq_printf(m, "T=%d %08x %08x...\n", q->toggle,
+				   i0, i1);
+		}
+		if (xc->esc_virq[i]) {
+			struct irq_data *d = irq_get_irq_data(xc->esc_virq[i]);
+			struct xive_irq_data *xd =
+				irq_data_get_irq_handler_data(d);
+			u64 pq = xive_vm_esb_load(xd, XIVE_ESB_GET);
+
+			seq_printf(m, "E:%c%c I(%d:%llx:%llx)",
+				   (pq & XIVE_ESB_VAL_P) ? 'P' : 'p',
+				   (pq & XIVE_ESB_VAL_Q) ? 'Q' : 'q',
+				   xc->esc_virq[i], pq, xd->eoi_page);
+			seq_puts(m, "\n");
+		}
+	}
+	return 0;
+}
 
 static int xive_debug_show(struct seq_file *m, void *private)
 {
@@ -1914,7 +1953,6 @@ static int xive_debug_show(struct seq_file *m, void *private)
 
 	kvm_for_each_vcpu(i, vcpu, kvm) {
 		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
-		unsigned int i;
 
 		if (!xc)
 			continue;
@@ -1924,33 +1962,8 @@ static int xive_debug_show(struct seq_file *m, void *private)
 			   xc->server_num, xc->cppr, xc->hw_cppr,
 			   xc->mfrr, xc->pending,
 			   xc->stat_rm_h_xirr, xc->stat_vm_h_xirr);
-		for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
-			struct xive_q *q = &xc->queues[i];
-			u32 i0, i1, idx;
 
-			if (!q->qpage && !xc->esc_virq[i])
-				continue;
-
-			seq_printf(m, " [q%d]: ", i);
-
-			if (q->qpage) {
-				idx = q->idx;
-				i0 = be32_to_cpup(q->qpage + idx);
-				idx = (idx + 1) & q->msk;
-				i1 = be32_to_cpup(q->qpage + idx);
-				seq_printf(m, "T=%d %08x %08x... \n", q->toggle, i0, i1);
-			}
-			if (xc->esc_virq[i]) {
-				struct irq_data *d = irq_get_irq_data(xc->esc_virq[i]);
-				struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
-				u64 pq = xive_vm_esb_load(xd, XIVE_ESB_GET);
-				seq_printf(m, "E:%c%c I(%d:%llx:%llx)",
-					   (pq & XIVE_ESB_VAL_P) ? 'P' : 'p',
-					   (pq & XIVE_ESB_VAL_Q) ? 'Q' : 'q',
-					   xc->esc_virq[i], pq, xd->eoi_page);
-				seq_printf(m, "\n");
-			}
-		}
+		kvmppc_xive_debug_show_queues(m, vcpu);
 
 		t_rm_h_xirr += xc->stat_rm_h_xirr;
 		t_rm_h_ipoll += xc->stat_rm_h_ipoll;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (3 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 04/19] KVM: PPC: Book3S HV: export services for the XIVE native exploitation device Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-01-22  5:05   ` Paul Mackerras
  2019-02-04  4:25   ` David Gibson
  2019-01-07 18:43 ` [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device Cédric Le Goater
                   ` (12 subsequent siblings)
  17 siblings, 2 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

This is the basic framework for the new KVM device supporting the XIVE
native exploitation mode. The user interface exposes a new capability
and a new KVM device to be used by QEMU.

Internally, the interface to the new KVM device is protected with a
new interrupt mode: KVMPPC_IRQ_XIVE.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/kvm_host.h   |   2 +
 arch/powerpc/include/asm/kvm_ppc.h    |  21 ++
 arch/powerpc/kvm/book3s_xive.h        |   3 +
 include/uapi/linux/kvm.h              |   3 +
 arch/powerpc/kvm/book3s.c             |   7 +-
 arch/powerpc/kvm/book3s_xive_native.c | 332 ++++++++++++++++++++++++++
 arch/powerpc/kvm/powerpc.c            |  30 +++
 arch/powerpc/kvm/Makefile             |   2 +-
 8 files changed, 398 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_xive_native.c

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 0f98f00da2ea..c522e8274ad9 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops;
 struct kvmppc_xive;
 struct kvmppc_xive_vcpu;
 extern struct kvm_device_ops kvm_xive_ops;
+extern struct kvm_device_ops kvm_xive_native_ops;
 
 struct kvmppc_passthru_irqmap;
 
@@ -446,6 +447,7 @@ struct kvmppc_passthru_irqmap {
 #define KVMPPC_IRQ_DEFAULT	0
 #define KVMPPC_IRQ_MPIC		1
 #define KVMPPC_IRQ_XICS		2 /* Includes a XIVE option */
+#define KVMPPC_IRQ_XIVE		3 /* XIVE native exploitation mode */
 
 #define MMIO_HPTE_CACHE_SIZE	4
 
diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index eb0d79f0ca45..1bb313f238fe 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -591,6 +591,18 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
 extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
 			       int level, bool line_status);
 extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
+
+static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
+{
+	return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
+}
+
+extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+				    struct kvm_vcpu *vcpu, u32 cpu);
+extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
+extern void kvmppc_xive_native_init_module(void);
+extern void kvmppc_xive_native_exit_module(void);
+
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
 				       u32 priority) { return -1; }
@@ -614,6 +626,15 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) { retur
 static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
 				      int level, bool line_status) { return -ENODEV; }
 static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
+
+static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
+	{ return 0; }
+static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+						  struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; }
+static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
+static inline void kvmppc_xive_native_init_module(void) { }
+static inline void kvmppc_xive_native_exit_module(void) { }
+
 #endif /* CONFIG_KVM_XIVE */
 
 /*
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 10c4aa5cd010..5f22415520b4 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -12,6 +12,9 @@
 #ifdef CONFIG_KVM_XICS
 #include "book3s_xics.h"
 
+#define KVMPPC_XIVE_FIRST_IRQ	0
+#define KVMPPC_XIVE_NR_IRQS	KVMPPC_XICS_NR_IRQS
+
 /*
  * State for one guest irq source.
  *
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 6d4ea4b6c922..52bf74a1616e 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
 #define KVM_CAP_ARM_VM_IPA_SIZE 165
 #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
 #define KVM_CAP_HYPERV_CPUID 167
+#define KVM_CAP_PPC_IRQ_XIVE 168
 
 #ifdef KVM_CAP_IRQ_ROUTING
 
@@ -1211,6 +1212,8 @@ enum kvm_device_type {
 #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
 	KVM_DEV_TYPE_ARM_VGIC_ITS,
 #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
+	KVM_DEV_TYPE_XIVE,
+#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
 	KVM_DEV_TYPE_MAX,
 };
 
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index bd1a677dd9e4..de7eed191107 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
 #ifdef CONFIG_KVM_XIVE
 	if (xive_enabled()) {
 		kvmppc_xive_init_module();
+		kvmppc_xive_native_init_module();
 		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
+		kvm_register_device_ops(&kvm_xive_native_ops,
+					KVM_DEV_TYPE_XIVE);
 	} else
 #endif
 		kvm_register_device_ops(&kvm_xics_ops, KVM_DEV_TYPE_XICS);
@@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
 static void kvmppc_book3s_exit(void)
 {
 #ifdef CONFIG_KVM_XICS
-	if (xive_enabled())
+	if (xive_enabled()) {
 		kvmppc_xive_exit_module();
+		kvmppc_xive_native_exit_module();
+	}
 #endif
 #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
 	kvmppc_book3s_exit_pr();
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
new file mode 100644
index 000000000000..115143e76c45
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -0,0 +1,332 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2017-2019, IBM Corporation.
+ */
+
+#define pr_fmt(fmt) "xive-kvm: " fmt
+
+#include <linux/anon_inodes.h>
+#include <linux/kernel.h>
+#include <linux/kvm_host.h>
+#include <linux/err.h>
+#include <linux/gfp.h>
+#include <linux/spinlock.h>
+#include <linux/delay.h>
+#include <linux/percpu.h>
+#include <linux/cpumask.h>
+#include <asm/uaccess.h>
+#include <asm/kvm_book3s.h>
+#include <asm/kvm_ppc.h>
+#include <asm/hvcall.h>
+#include <asm/xics.h>
+#include <asm/xive.h>
+#include <asm/xive-regs.h>
+#include <asm/debug.h>
+#include <asm/debugfs.h>
+#include <asm/time.h>
+#include <asm/opal.h>
+
+#include <linux/debugfs.h>
+#include <linux/seq_file.h>
+
+#include "book3s_xive.h"
+
+static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
+{
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	struct xive_q *q = &xc->queues[prio];
+
+	xive_native_disable_queue(xc->vp_id, q, prio);
+	if (q->qpage) {
+		put_page(virt_to_page(q->qpage));
+		q->qpage = NULL;
+	}
+}
+
+void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
+{
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	int i;
+
+	if (!kvmppc_xive_enabled(vcpu))
+		return;
+
+	if (!xc)
+		return;
+
+	pr_devel("native_cleanup_vcpu(cpu=%d)\n", xc->server_num);
+
+	/* Ensure no interrupt is still routed to that VP */
+	xc->valid = false;
+	kvmppc_xive_disable_vcpu_interrupts(vcpu);
+
+	/* Disable the VP */
+	xive_native_disable_vp(xc->vp_id);
+
+	/* Free the queues & associated interrupts */
+	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
+		/* Free the escalation irq */
+		if (xc->esc_virq[i]) {
+			free_irq(xc->esc_virq[i], vcpu);
+			irq_dispose_mapping(xc->esc_virq[i]);
+			kfree(xc->esc_virq_names[i]);
+			xc->esc_virq[i] = 0;
+		}
+
+		/* Free the queue */
+		xive_native_cleanup_queue(vcpu, i);
+	}
+
+	/* Free the VP */
+	kfree(xc);
+
+	/* Cleanup the vcpu */
+	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
+	vcpu->arch.xive_vcpu = NULL;
+}
+
+int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
+				    struct kvm_vcpu *vcpu, u32 cpu)
+{
+	struct kvmppc_xive *xive = dev->private;
+	struct kvmppc_xive_vcpu *xc;
+	int rc;
+
+	pr_devel("native_connect_vcpu(cpu=%d)\n", cpu);
+
+	if (dev->ops != &kvm_xive_native_ops) {
+		pr_devel("Wrong ops !\n");
+		return -EPERM;
+	}
+	if (xive->kvm != vcpu->kvm)
+		return -EPERM;
+	if (vcpu->arch.irq_type)
+		return -EBUSY;
+	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
+		pr_devel("Duplicate !\n");
+		return -EEXIST;
+	}
+	if (cpu >= KVM_MAX_VCPUS) {
+		pr_devel("Out of bounds !\n");
+		return -EINVAL;
+	}
+	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
+	if (!xc)
+		return -ENOMEM;
+
+	mutex_lock(&vcpu->kvm->lock);
+	vcpu->arch.xive_vcpu = xc;
+	xc->xive = xive;
+	xc->vcpu = vcpu;
+	xc->server_num = cpu;
+	xc->vp_id = xive->vp_base + cpu;
+	xc->valid = true;
+
+	rc = xive_native_get_vp_info(xc->vp_id, &xc->vp_cam, &xc->vp_chip_id);
+	if (rc) {
+		pr_err("Failed to get VP info from OPAL: %d\n", rc);
+		goto bail;
+	}
+
+	/*
+	 * Enable the VP first as the single escalation mode will
+	 * affect escalation interrupts numbering
+	 */
+	rc = xive_native_enable_vp(xc->vp_id, xive->single_escalation);
+	if (rc) {
+		pr_err("Failed to enable VP in OPAL: %d\n", rc);
+		goto bail;
+	}
+
+	/* Configure VCPU fields for use by assembly push/pull */
+	vcpu->arch.xive_saved_state.w01 = cpu_to_be64(0xff000000);
+	vcpu->arch.xive_cam_word = cpu_to_be32(xc->vp_cam | TM_QW1W2_VO);
+
+	/* TODO: initialize queues ? */
+
+bail:
+	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
+	mutex_unlock(&vcpu->kvm->lock);
+	if (rc)
+		kvmppc_xive_native_cleanup_vcpu(vcpu);
+
+	return rc;
+}
+
+static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
+				       struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
+				       struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
+				       struct kvm_device_attr *attr)
+{
+	return -ENXIO;
+}
+
+static void kvmppc_xive_native_free(struct kvm_device *dev)
+{
+	struct kvmppc_xive *xive = dev->private;
+	struct kvm *kvm = xive->kvm;
+	int i;
+
+	debugfs_remove(xive->dentry);
+
+	pr_devel("Destroying xive native for partition\n");
+
+	if (kvm)
+		kvm->arch.xive = NULL;
+
+	/* Mask and free interrupts */
+	for (i = 0; i <= xive->max_sbid; i++) {
+		if (xive->src_blocks[i])
+			kvmppc_xive_free_sources(xive->src_blocks[i]);
+		kfree(xive->src_blocks[i]);
+		xive->src_blocks[i] = NULL;
+	}
+
+	if (xive->vp_base != XIVE_INVALID_VP)
+		xive_native_free_vp_block(xive->vp_base);
+
+	kfree(xive);
+	kfree(dev);
+}
+
+static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
+{
+	struct kvmppc_xive *xive;
+	struct kvm *kvm = dev->kvm;
+	int ret = 0;
+
+	pr_devel("Creating xive native for partition\n");
+
+	if (kvm->arch.xive)
+		return -EEXIST;
+
+	xive = kzalloc(sizeof(*xive), GFP_KERNEL);
+	if (!xive)
+		return -ENOMEM;
+
+	dev->private = xive;
+	xive->dev = dev;
+	xive->kvm = kvm;
+	kvm->arch.xive = xive;
+
+	/* We use the default queue size set by the host */
+	xive->q_order = xive_native_default_eq_shift();
+	if (xive->q_order < PAGE_SHIFT)
+		xive->q_page_order = 0;
+	else
+		xive->q_page_order = xive->q_order - PAGE_SHIFT;
+
+	/* Allocate a bunch of VPs */
+	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
+	pr_devel("VP_Base=%x\n", xive->vp_base);
+
+	if (xive->vp_base == XIVE_INVALID_VP)
+		ret = -ENOMEM;
+
+	xive->single_escalation = xive_native_has_single_escalation();
+
+	if (ret)
+		kfree(xive);
+
+	return ret;
+}
+
+static int xive_native_debug_show(struct seq_file *m, void *private)
+{
+	struct kvmppc_xive *xive = m->private;
+	struct kvm *kvm = xive->kvm;
+	struct kvm_vcpu *vcpu;
+	unsigned int i;
+
+	if (!kvm)
+		return 0;
+
+	seq_puts(m, "=========\nVCPU state\n=========\n");
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+
+		if (!xc)
+			continue;
+
+		seq_printf(m, "cpu server %#x NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x\n",
+			   xc->server_num,
+			   vcpu->arch.xive_saved_state.nsr,
+			   vcpu->arch.xive_saved_state.cppr,
+			   vcpu->arch.xive_saved_state.ipb,
+			   vcpu->arch.xive_saved_state.pipr,
+			   vcpu->arch.xive_saved_state.w01,
+			   (u32) vcpu->arch.xive_cam_word);
+
+		kvmppc_xive_debug_show_queues(m, vcpu);
+	}
+
+	return 0;
+}
+
+static int xive_native_debug_open(struct inode *inode, struct file *file)
+{
+	return single_open(file, xive_native_debug_show, inode->i_private);
+}
+
+static const struct file_operations xive_native_debug_fops = {
+	.open = xive_native_debug_open,
+	.read = seq_read,
+	.llseek = seq_lseek,
+	.release = single_release,
+};
+
+static void xive_native_debugfs_init(struct kvmppc_xive *xive)
+{
+	char *name;
+
+	name = kasprintf(GFP_KERNEL, "kvm-xive-%p", xive);
+	if (!name) {
+		pr_err("%s: no memory for name\n", __func__);
+		return;
+	}
+
+	xive->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,
+					   xive, &xive_native_debug_fops);
+
+	pr_debug("%s: created %s\n", __func__, name);
+	kfree(name);
+}
+
+static void kvmppc_xive_native_init(struct kvm_device *dev)
+{
+	struct kvmppc_xive *xive = (struct kvmppc_xive *)dev->private;
+
+	/* Register some debug interfaces */
+	xive_native_debugfs_init(xive);
+}
+
+struct kvm_device_ops kvm_xive_native_ops = {
+	.name = "kvm-xive-native",
+	.create = kvmppc_xive_native_create,
+	.init = kvmppc_xive_native_init,
+	.destroy = kvmppc_xive_native_free,
+	.set_attr = kvmppc_xive_native_set_attr,
+	.get_attr = kvmppc_xive_native_get_attr,
+	.has_attr = kvmppc_xive_native_has_attr,
+};
+
+void kvmppc_xive_native_init_module(void)
+{
+	;
+}
+
+void kvmppc_xive_native_exit_module(void)
+{
+	;
+}
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index b90a7d154180..01d526e15e9d 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -566,6 +566,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_PPC_ENABLE_HCALL:
 #ifdef CONFIG_KVM_XICS
 	case KVM_CAP_IRQ_XICS:
+#endif
+#ifdef CONFIG_KVM_XIVE
+	case KVM_CAP_PPC_IRQ_XIVE:
 #endif
 	case KVM_CAP_PPC_GET_CPU_CHAR:
 		r = 1;
@@ -753,6 +756,9 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
 		else
 			kvmppc_xics_free_icp(vcpu);
 		break;
+	case KVMPPC_IRQ_XIVE:
+		kvmppc_xive_native_cleanup_vcpu(vcpu);
+		break;
 	}
 
 	kvmppc_core_vcpu_free(vcpu);
@@ -1941,6 +1947,30 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 		break;
 	}
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_XIVE
+	case KVM_CAP_PPC_IRQ_XIVE: {
+		struct fd f;
+		struct kvm_device *dev;
+
+		r = -EBADF;
+		f = fdget(cap->args[0]);
+		if (!f.file)
+			break;
+
+		r = -ENXIO;
+		if (!xive_enabled())
+			break;
+
+		r = -EPERM;
+		dev = kvm_device_from_filp(f.file);
+		if (dev)
+			r = kvmppc_xive_native_connect_vcpu(dev, vcpu,
+							    cap->args[1]);
+
+		fdput(f);
+		break;
+	}
+#endif /* CONFIG_KVM_XIVE */
 #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
 	case KVM_CAP_PPC_FWNMI:
 		r = -EINVAL;
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 64f1135e7732..806cbe488410 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -99,7 +99,7 @@ endif
 kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
 	book3s_xics.o
 
-kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o
+kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o book3s_xive_native.o
 kvm-book3s_64-objs-$(CONFIG_SPAPR_TCE_IOMMU) += book3s_64_vio.o
 
 kvm-book3s_64-module-objs := \
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (4 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-01-22  5:09   ` Paul Mackerras
  2019-02-04  4:45   ` David Gibson
  2019-01-07 18:43 ` [PATCH 07/19] KVM: PPC: Book3S HV: add a GET_TIMA_FD control to " Cédric Le Goater
                   ` (11 subsequent siblings)
  17 siblings, 2 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

This will let the guest create a memory mapping to expose the ESB MMIO
regions used to control the interrupt sources, to trigger events, to
EOI or to turn off the sources.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
 arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
 2 files changed, 101 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 8c876c166ef2..6bb61ba141c2 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
 #define  KVM_XICS_PRESENTED		(1ULL << 43)
 #define  KVM_XICS_QUEUED		(1ULL << 44)
 
+/* POWER9 XIVE Native Interrupt Controller */
+#define KVM_DEV_XIVE_GRP_CTRL		1
+#define   KVM_DEV_XIVE_GET_ESB_FD	1
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 115143e76c45..e20081f0c8d4 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -153,6 +153,85 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 	return rc;
 }
 
+static int xive_native_esb_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+	struct kvmppc_xive *xive = vma->vm_file->private_data;
+	struct kvmppc_xive_src_block *sb;
+	struct kvmppc_xive_irq_state *state;
+	struct xive_irq_data *xd;
+	u32 hw_num;
+	u16 src;
+	u64 page;
+	unsigned long irq;
+
+	/*
+	 * Linux/KVM uses a two pages ESB setting, one for trigger and
+	 * one for EOI
+	 */
+	irq = vmf->pgoff / 2;
+
+	sb = kvmppc_xive_find_source(xive, irq, &src);
+	if (!sb) {
+		pr_err("%s: source %lx not found !\n", __func__, irq);
+		return VM_FAULT_SIGBUS;
+	}
+
+	state = &sb->irq_state[src];
+	kvmppc_xive_select_irq(state, &hw_num, &xd);
+
+	arch_spin_lock(&sb->lock);
+
+	/*
+	 * first/even page is for trigger
+	 * second/odd page is for EOI and management.
+	 */
+	page = vmf->pgoff % 2 ? xd->eoi_page : xd->trig_page;
+	arch_spin_unlock(&sb->lock);
+
+	if (!page) {
+		pr_err("%s: acessing invalid ESB page for source %lx !\n",
+		       __func__, irq);
+		return VM_FAULT_SIGBUS;
+	}
+
+	vmf_insert_pfn(vma, vmf->address, page >> PAGE_SHIFT);
+	return VM_FAULT_NOPAGE;
+}
+
+static const struct vm_operations_struct xive_native_esb_vmops = {
+	.fault = xive_native_esb_fault,
+};
+
+static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	/* There are two ESB pages (trigger and EOI) per IRQ */
+	if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2)
+		return -EINVAL;
+
+	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
+	vma->vm_ops = &xive_native_esb_vmops;
+	return 0;
+}
+
+static const struct file_operations xive_native_esb_fops = {
+	.mmap = xive_native_esb_mmap,
+};
+
+static int kvmppc_xive_native_get_esb_fd(struct kvmppc_xive *xive, u64 addr)
+{
+	u64 __user *ubufp = (u64 __user *) addr;
+	int ret;
+
+	ret = anon_inode_getfd("[xive-esb]", &xive_native_esb_fops, xive,
+				O_RDWR | O_CLOEXEC);
+	if (ret < 0)
+		return ret;
+
+	return put_user(ret, ubufp);
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 				       struct kvm_device_attr *attr)
 {
@@ -162,12 +241,30 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
 				       struct kvm_device_attr *attr)
 {
+	struct kvmppc_xive *xive = dev->private;
+
+	switch (attr->group) {
+	case KVM_DEV_XIVE_GRP_CTRL:
+		switch (attr->attr) {
+		case KVM_DEV_XIVE_GET_ESB_FD:
+			return kvmppc_xive_native_get_esb_fd(xive, attr->addr);
+		}
+		break;
+	}
 	return -ENXIO;
 }
 
 static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 				       struct kvm_device_attr *attr)
 {
+	switch (attr->group) {
+	case KVM_DEV_XIVE_GRP_CTRL:
+		switch (attr->attr) {
+		case KVM_DEV_XIVE_GET_ESB_FD:
+			return 0;
+		}
+		break;
+	}
 	return -ENXIO;
 }
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 07/19] KVM: PPC: Book3S HV: add a GET_TIMA_FD control to XIVE native device
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (5 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device Cédric Le Goater
@ 2019-01-07 18:43 ` " Cédric Le Goater
  2019-01-07 18:43 ` [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the " Cédric Le Goater
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

This will let the guest create a memory mapping to expose the XIVE
MMIO region (TIMA) used for interrupt management at the CPU level.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/xive.h       |  1 +
 arch/powerpc/include/uapi/asm/kvm.h   |  1 +
 arch/powerpc/kvm/book3s_xive_native.c | 57 +++++++++++++++++++++++++++
 arch/powerpc/sysdev/xive/native.c     | 11 ++++++
 4 files changed, 70 insertions(+)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index d6be3e4d9fa4..7a7aa22d8258 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -23,6 +23,7 @@
  * same offset regardless of where the code is executing
  */
 extern void __iomem *xive_tima;
+extern unsigned long xive_tima_os;
 
 /*
  * Offset in the TM area of our current execution level (provided by
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 6bb61ba141c2..89c140cb9e79 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -678,5 +678,6 @@ struct kvm_ppc_cpu_char {
 /* POWER9 XIVE Native Interrupt Controller */
 #define KVM_DEV_XIVE_GRP_CTRL		1
 #define   KVM_DEV_XIVE_GET_ESB_FD	1
+#define   KVM_DEV_XIVE_GET_TIMA_FD	2
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index e20081f0c8d4..ee9d12bf2dae 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -232,6 +232,60 @@ static int kvmppc_xive_native_get_esb_fd(struct kvmppc_xive *xive, u64 addr)
 	return put_user(ret, ubufp);
 }
 
+static int xive_native_tima_fault(struct vm_fault *vmf)
+{
+	struct vm_area_struct *vma = vmf->vma;
+
+	switch (vmf->pgoff) {
+	case 0: /* HW - forbid access */
+	case 1: /* HV - forbid access */
+		return VM_FAULT_SIGBUS;
+	case 2: /* OS */
+		vmf_insert_pfn(vma, vmf->address, xive_tima_os >> PAGE_SHIFT);
+		return VM_FAULT_NOPAGE;
+	case 3: /* USER - TODO */
+	default:
+		return VM_FAULT_SIGBUS;
+	}
+}
+
+static const struct vm_operations_struct xive_native_tima_vmops = {
+	.fault = xive_native_tima_fault,
+};
+
+static int xive_native_tima_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	/*
+	 * The TIMA is four pages wide but only the last two pages (OS
+	 * and User view) are accessible to the guest. The page fault
+	 * handler will handle the permissions.
+	 */
+	if (vma_pages(vma) + vma->vm_pgoff > 4)
+		return -EINVAL;
+
+	vma->vm_flags |= VM_IO | VM_PFNMAP;
+	vma->vm_page_prot = pgprot_noncached_wc(vma->vm_page_prot);
+	vma->vm_ops = &xive_native_tima_vmops;
+	return 0;
+}
+
+static const struct file_operations xive_native_tima_fops = {
+	.mmap = xive_native_tima_mmap,
+};
+
+static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr)
+{
+	u64 __user *ubufp = (u64 __user *) addr;
+	int ret;
+
+	ret = anon_inode_getfd("[xive-tima]", &xive_native_tima_fops, xive,
+			       O_RDWR | O_CLOEXEC);
+	if (ret < 0)
+		return ret;
+
+	return put_user(ret, ubufp);
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 				       struct kvm_device_attr *attr)
 {
@@ -248,6 +302,8 @@ static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_XIVE_GET_ESB_FD:
 			return kvmppc_xive_native_get_esb_fd(xive, attr->addr);
+		case KVM_DEV_XIVE_GET_TIMA_FD:
+			return kvmppc_xive_native_get_tima_fd(xive, attr->addr);
 		}
 		break;
 	}
@@ -261,6 +317,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 	case KVM_DEV_XIVE_GRP_CTRL:
 		switch (attr->attr) {
 		case KVM_DEV_XIVE_GET_ESB_FD:
+		case KVM_DEV_XIVE_GET_TIMA_FD:
 			return 0;
 		}
 		break;
diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c
index 0c037e933e55..7782201e5fe8 100644
--- a/arch/powerpc/sysdev/xive/native.c
+++ b/arch/powerpc/sysdev/xive/native.c
@@ -521,6 +521,9 @@ u32 xive_native_default_eq_shift(void)
 }
 EXPORT_SYMBOL_GPL(xive_native_default_eq_shift);
 
+unsigned long xive_tima_os;
+EXPORT_SYMBOL_GPL(xive_tima_os);
+
 bool __init xive_native_init(void)
 {
 	struct device_node *np;
@@ -573,6 +576,14 @@ bool __init xive_native_init(void)
 	for_each_possible_cpu(cpu)
 		kvmppc_set_xive_tima(cpu, r.start, tima);
 
+	/* Resource 2 is OS window */
+	if (of_address_to_resource(np, 2, &r)) {
+		pr_err("Failed to get thread mgmnt area resource\n");
+		return false;
+	}
+
+	xive_tima_os = r.start;
+
 	/* Grab size of provisionning pages */
 	xive_parse_provisioning(np);
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (6 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 07/19] KVM: PPC: Book3S HV: add a GET_TIMA_FD control to " Cédric Le Goater
@ 2019-01-07 18:43 ` " Cédric Le Goater
  2019-01-22  5:14   ` Paul Mackerras
  2019-01-07 18:43 ` [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE " Cédric Le Goater
                   ` (9 subsequent siblings)
  17 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

The ESB MMIO region controls the interrupt sources of the guest. QEMU
will query an fd (GET_ESB_FD ioctl) and map this region at a specific
address for the guest to use. The guest will obtain this information
using the H_INT_GET_SOURCE_INFO hcall. To inform KVM of the address
setting used by QEMU, add a VC_BASE control to the KVM XIVE device

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/uapi/asm/kvm.h   |  1 +
 arch/powerpc/kvm/book3s_xive.h        |  3 +++
 arch/powerpc/kvm/book3s_xive_native.c | 39 +++++++++++++++++++++++++++
 3 files changed, 43 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 89c140cb9e79..8b78b12aa118 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -679,5 +679,6 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_CTRL		1
 #define   KVM_DEV_XIVE_GET_ESB_FD	1
 #define   KVM_DEV_XIVE_GET_TIMA_FD	2
+#define   KVM_DEV_XIVE_VC_BASE		3
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 5f22415520b4..ae4a670eea63 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -125,6 +125,9 @@ struct kvmppc_xive {
 
 	/* Flags */
 	u8	single_escalation;
+
+	/* VC base address for ESBs */
+	u64     vc_base;
 };
 
 #define KVMPPC_XIVE_Q_COUNT	8
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index ee9d12bf2dae..29a62914de55 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -153,6 +153,25 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 	return rc;
 }
 
+static int kvmppc_xive_native_set_vc_base(struct kvmppc_xive *xive, u64 addr)
+{
+	u64 __user *ubufp = (u64 __user *) addr;
+
+	if (get_user(xive->vc_base, ubufp))
+		return -EFAULT;
+	return 0;
+}
+
+static int kvmppc_xive_native_get_vc_base(struct kvmppc_xive *xive, u64 addr)
+{
+	u64 __user *ubufp = (u64 __user *) addr;
+
+	if (put_user(xive->vc_base, ubufp))
+		return -EFAULT;
+
+	return 0;
+}
+
 static int xive_native_esb_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -289,6 +308,16 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr)
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 				       struct kvm_device_attr *attr)
 {
+	struct kvmppc_xive *xive = dev->private;
+
+	switch (attr->group) {
+	case KVM_DEV_XIVE_GRP_CTRL:
+		switch (attr->attr) {
+		case KVM_DEV_XIVE_VC_BASE:
+			return kvmppc_xive_native_set_vc_base(xive, attr->addr);
+		}
+		break;
+	}
 	return -ENXIO;
 }
 
@@ -304,6 +333,8 @@ static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
 			return kvmppc_xive_native_get_esb_fd(xive, attr->addr);
 		case KVM_DEV_XIVE_GET_TIMA_FD:
 			return kvmppc_xive_native_get_tima_fd(xive, attr->addr);
+		case KVM_DEV_XIVE_VC_BASE:
+			return kvmppc_xive_native_get_vc_base(xive, attr->addr);
 		}
 		break;
 	}
@@ -318,6 +349,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_XIVE_GET_ESB_FD:
 		case KVM_DEV_XIVE_GET_TIMA_FD:
+		case KVM_DEV_XIVE_VC_BASE:
 			return 0;
 		}
 		break;
@@ -353,6 +385,11 @@ static void kvmppc_xive_native_free(struct kvm_device *dev)
 	kfree(dev);
 }
 
+/*
+ * ESB MMIO address of chip 0
+ */
+#define XIVE_VC_BASE   0x0006010000000000ull
+
 static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
@@ -387,6 +424,8 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	if (xive->vp_base == XIVE_INVALID_VP)
 		ret = -ENOMEM;
 
+	xive->vc_base = XIVE_VC_BASE;
+
 	xive->single_escalation = xive_native_has_single_escalation();
 
 	if (ret)
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native device
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (7 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the " Cédric Le Goater
@ 2019-01-07 18:43 ` " Cédric Le Goater
  2019-02-04  4:57   ` David Gibson
  2019-01-07 18:43 ` [PATCH 10/19] KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state Cédric Le Goater
                   ` (8 subsequent siblings)
  17 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

Interrupt sources are simply created at the OPAL level and then
MASKED. KVM only needs to know about their type: LSI or MSI.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/uapi/asm/kvm.h           |  5 +
 arch/powerpc/kvm/book3s_xive_native.c         | 98 +++++++++++++++++++
 .../powerpc/kvm/book3s_xive_native_template.c | 27 +++++
 3 files changed, 130 insertions(+)
 create mode 100644 arch/powerpc/kvm/book3s_xive_native_template.c

diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 8b78b12aa118..6fc9660c5aec 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -680,5 +680,10 @@ struct kvm_ppc_cpu_char {
 #define   KVM_DEV_XIVE_GET_ESB_FD	1
 #define   KVM_DEV_XIVE_GET_TIMA_FD	2
 #define   KVM_DEV_XIVE_VC_BASE		3
+#define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
+
+/* Layout of 64-bit XIVE source attribute values */
+#define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
+#define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
 
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 29a62914de55..2518640d4a58 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -31,6 +31,24 @@
 
 #include "book3s_xive.h"
 
+/*
+ * We still instantiate them here because we use some of the
+ * generated utility functions as well in this file.
+ */
+#define XIVE_RUNTIME_CHECKS
+#define X_PFX xive_vm_
+#define X_STATIC static
+#define X_STAT_PFX stat_vm_
+#define __x_tima		xive_tima
+#define __x_eoi_page(xd)	((void __iomem *)((xd)->eoi_mmio))
+#define __x_trig_page(xd)	((void __iomem *)((xd)->trig_mmio))
+#define __x_writeb	__raw_writeb
+#define __x_readw	__raw_readw
+#define __x_readq	__raw_readq
+#define __x_writeq	__raw_writeq
+
+#include "book3s_xive_native_template.c"
+
 static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
 {
 	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
@@ -305,6 +323,78 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr)
 	return put_user(ret, ubufp);
 }
 
+static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
+					 u64 addr)
+{
+	struct kvmppc_xive_src_block *sb;
+	struct kvmppc_xive_irq_state *state;
+	u64 __user *ubufp = (u64 __user *) addr;
+	u64 val;
+	u16 idx;
+
+	pr_devel("%s irq=0x%lx\n", __func__, irq);
+
+	if (irq < KVMPPC_XIVE_FIRST_IRQ || irq >= KVMPPC_XIVE_NR_IRQS)
+		return -ENOENT;
+
+	sb = kvmppc_xive_find_source(xive, irq, &idx);
+	if (!sb) {
+		pr_debug("No source, creating source block...\n");
+		sb = kvmppc_xive_create_src_block(xive, irq);
+		if (!sb) {
+			pr_err("Failed to create block...\n");
+			return -ENOMEM;
+		}
+	}
+	state = &sb->irq_state[idx];
+
+	if (get_user(val, ubufp)) {
+		pr_err("fault getting user info !\n");
+		return -EFAULT;
+	}
+
+	/*
+	 * If the source doesn't already have an IPI, allocate
+	 * one and get the corresponding data
+	 */
+	if (!state->ipi_number) {
+		state->ipi_number = xive_native_alloc_irq();
+		if (state->ipi_number == 0) {
+			pr_err("Failed to allocate IRQ !\n");
+			return -ENOMEM;
+		}
+		xive_native_populate_irq_data(state->ipi_number,
+					      &state->ipi_data);
+		pr_debug("%s allocated hw_irq=0x%x for irq=0x%lx\n", __func__,
+			 state->ipi_number, irq);
+	}
+
+	arch_spin_lock(&sb->lock);
+
+	/* Restore LSI state */
+	if (val & KVM_XIVE_LEVEL_SENSITIVE) {
+		state->lsi = true;
+		if (val & KVM_XIVE_LEVEL_ASSERTED)
+			state->asserted = true;
+		pr_devel("  LSI ! Asserted=%d\n", state->asserted);
+	}
+
+	/* Mask IRQ to start with */
+	state->act_server = 0;
+	state->act_priority = MASKED;
+	xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
+	xive_native_configure_irq(state->ipi_number, 0, MASKED, 0);
+
+	/* Increment the number of valid sources and mark this one valid */
+	if (!state->valid)
+		xive->src_count++;
+	state->valid = true;
+
+	arch_spin_unlock(&sb->lock);
+
+	return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 				       struct kvm_device_attr *attr)
 {
@@ -317,6 +407,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 			return kvmppc_xive_native_set_vc_base(xive, attr->addr);
 		}
 		break;
+	case KVM_DEV_XIVE_GRP_SOURCES:
+		return kvmppc_xive_native_set_source(xive, attr->attr,
+						     attr->addr);
 	}
 	return -ENXIO;
 }
@@ -353,6 +446,11 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 			return 0;
 		}
 		break;
+	case KVM_DEV_XIVE_GRP_SOURCES:
+		if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
+		    attr->attr < KVMPPC_XIVE_NR_IRQS)
+			return 0;
+		break;
 	}
 	return -ENXIO;
 }
diff --git a/arch/powerpc/kvm/book3s_xive_native_template.c b/arch/powerpc/kvm/book3s_xive_native_template.c
new file mode 100644
index 000000000000..e7260da4a596
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_xive_native_template.c
@@ -0,0 +1,27 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Copyright (c) 2017-2019, IBM Corporation.
+ */
+
+/* File to be included by other .c files */
+
+#define XGLUE(a, b) a##b
+#define GLUE(a, b) XGLUE(a, b)
+
+/*
+ * TODO: introduce a common template file with the XIVE native layer
+ * and the XICS-on-XIVE glue for the utility functions
+ */
+static u8 GLUE(X_PFX, esb_load)(struct xive_irq_data *xd, u32 offset)
+{
+	u64 val;
+
+	if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
+		offset |= offset << 4;
+
+	val = __x_readq(__x_eoi_page(xd) + offset);
+#ifdef __LITTLE_ENDIAN__
+	val >>= 64-8;
+#endif
+	return (u8)val;
+}
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 10/19] KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (8 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE " Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-01-07 18:43 ` [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls Cédric Le Goater
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

The Effective IRQ Source Number is the interrupt number pushed in the
event queue that the guest OS will use to dispatch events internally.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.h | 3 +++
 arch/powerpc/kvm/book3s_xive.c | 1 +
 2 files changed, 4 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index ae4a670eea63..67e07b41061d 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -57,6 +57,9 @@ struct kvmppc_xive_irq_state {
 	bool saved_p;
 	bool saved_q;
 	u8 saved_scan_prio;
+
+	/* Xive native */
+	u32 eisn;			/* Guest Effective IRQ number */
 };
 
 /* Select the "right" interrupt (IPI vs. passthrough) */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index bb5d32f7e4e6..e9f05d9c9ad5 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1515,6 +1515,7 @@ struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
 
 	for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
 		sb->irq_state[i].number = (bid << KVMPPC_XICS_ICS_SHIFT) | i;
+		sb->irq_state[i].eisn = 0;
 		sb->irq_state[i].guest_priority = MASKED;
 		sb->irq_state[i].saved_priority = MASKED;
 		sb->irq_state[i].act_priority = MASKED;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (9 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 10/19] KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-01-22  5:23   ` Paul Mackerras
  2019-01-07 18:43 ` [PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address Cédric Le Goater
                   ` (6 subsequent siblings)
  17 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

The XIVE native exploitation mode specs define a set of Hypervisor
calls to configure the sources and the event queues :

 - H_INT_GET_SOURCE_INFO

   used to obtain the address of the MMIO page of the Event State
   Buffer (PQ bits) entry associated with the source.

 - H_INT_SET_SOURCE_CONFIG

   assigns a source to a "target".

 - H_INT_GET_SOURCE_CONFIG

   determines which "target" and "priority" is assigned to a source

 - H_INT_GET_QUEUE_INFO

   returns the address of the notification management page associated
   with the specified "target" and "priority".

 - H_INT_SET_QUEUE_CONFIG

   sets or resets the event queue for a given "target" and "priority".
   It is also used to set the notification configuration associated
   with the queue, only unconditional notification is supported for
   the moment. Reset is performed with a queue size of 0 and queueing
   is disabled in that case.

 - H_INT_GET_QUEUE_CONFIG

   returns the queue settings for a given "target" and "priority".

 - H_INT_RESET

   resets all of the guest's internal interrupt structures to their
   initial state, losing all configuration set via the hcalls
   H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.

 - H_INT_SYNC

   issue a synchronisation on a source to make sure all notifications
   have reached their queue.

Calls that still need to be addressed :

   H_INT_SET_OS_REPORTING_LINE
   H_INT_GET_OS_REPORTING_LINE

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/kvm_ppc.h            |  43 ++
 arch/powerpc/kvm/book3s_xive.h                |  54 +++
 arch/powerpc/kvm/book3s_hv.c                  |  29 ++
 arch/powerpc/kvm/book3s_hv_builtin.c          | 196 +++++++++
 arch/powerpc/kvm/book3s_hv_rm_xive_native.c   |  47 +++
 arch/powerpc/kvm/book3s_xive_native.c         | 326 ++++++++++++++-
 .../powerpc/kvm/book3s_xive_native_template.c | 371 ++++++++++++++++++
 arch/powerpc/kvm/Makefile                     |   2 +
 arch/powerpc/kvm/book3s_hv_rmhandlers.S       |  52 +++
 9 files changed, 1118 insertions(+), 2 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_rm_xive_native.c

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 1bb313f238fe..4cc897039485 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -602,6 +602,7 @@ extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
 extern void kvmppc_xive_native_init_module(void);
 extern void kvmppc_xive_native_exit_module(void);
+extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd);
 
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
@@ -634,6 +635,8 @@ static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
 static inline void kvmppc_xive_native_init_module(void) { }
 static inline void kvmppc_xive_native_exit_module(void) { }
+static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd)
+	{ return 0; }
 
 #endif /* CONFIG_KVM_XIVE */
 
@@ -682,6 +685,46 @@ int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
 int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
 void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu);
 
+int kvmppc_rm_h_int_get_source_info(struct kvm_vcpu *vcpu,
+				    unsigned long flag,
+				    unsigned long lisn);
+int kvmppc_rm_h_int_set_source_config(struct kvm_vcpu *vcpu,
+				      unsigned long flag,
+				      unsigned long lisn,
+				      unsigned long target,
+				      unsigned long priority,
+				      unsigned long eisn);
+int kvmppc_rm_h_int_get_source_config(struct kvm_vcpu *vcpu,
+				      unsigned long flag,
+				      unsigned long lisn);
+int kvmppc_rm_h_int_get_queue_info(struct kvm_vcpu *vcpu,
+				   unsigned long flag,
+				   unsigned long target,
+				   unsigned long priority);
+int kvmppc_rm_h_int_set_queue_config(struct kvm_vcpu *vcpu,
+				     unsigned long flag,
+				     unsigned long target,
+				     unsigned long priority,
+				     unsigned long qpage,
+				     unsigned long qsize);
+int kvmppc_rm_h_int_get_queue_config(struct kvm_vcpu *vcpu,
+				     unsigned long flag,
+				     unsigned long target,
+				     unsigned long priority);
+int kvmppc_rm_h_int_set_os_reporting_line(struct kvm_vcpu *vcpu,
+					  unsigned long flag,
+					  unsigned long reportingline);
+int kvmppc_rm_h_int_get_os_reporting_line(struct kvm_vcpu *vcpu,
+					  unsigned long flag,
+					  unsigned long target,
+					  unsigned long reportingline);
+int kvmppc_rm_h_int_esb(struct kvm_vcpu *vcpu, unsigned long flag,
+			unsigned long lisn, unsigned long offset,
+			unsigned long data);
+int kvmppc_rm_h_int_sync(struct kvm_vcpu *vcpu, unsigned long flag,
+			 unsigned long lisn);
+int kvmppc_rm_h_int_reset(struct kvm_vcpu *vcpu, unsigned long flag);
+
 /*
  * Host-side operations we want to set up while running in real
  * mode in the guest operating on the xics.
diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 67e07b41061d..31e598e62589 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -268,5 +268,59 @@ void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu);
 int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio);
 int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu);
 
+int xive_rm_h_int_get_source_info(struct kvm_vcpu *vcpu,
+				    unsigned long flag,
+				    unsigned long lisn);
+int xive_rm_h_int_get_source_config(struct kvm_vcpu *vcpu,
+				      unsigned long flag,
+				      unsigned long lisn);
+int xive_rm_h_int_get_queue_info(struct kvm_vcpu *vcpu,
+				   unsigned long flag,
+				   unsigned long target,
+				   unsigned long priority);
+int xive_rm_h_int_get_queue_config(struct kvm_vcpu *vcpu,
+				     unsigned long flag,
+				     unsigned long target,
+				     unsigned long priority);
+int xive_rm_h_int_set_os_reporting_line(struct kvm_vcpu *vcpu,
+					  unsigned long flag,
+					  unsigned long reportingline);
+int xive_rm_h_int_get_os_reporting_line(struct kvm_vcpu *vcpu,
+					  unsigned long flag,
+					  unsigned long target,
+					  unsigned long reportingline);
+int xive_rm_h_int_esb(struct kvm_vcpu *vcpu, unsigned long flag,
+			unsigned long lisn, unsigned long offset,
+			unsigned long data);
+int xive_rm_h_int_sync(struct kvm_vcpu *vcpu, unsigned long flag,
+			 unsigned long lisn);
+
+extern int (*__xive_vm_h_int_get_source_info)(struct kvm_vcpu *vcpu,
+				    unsigned long flag,
+				    unsigned long lisn);
+extern int (*__xive_vm_h_int_get_source_config)(struct kvm_vcpu *vcpu,
+				      unsigned long flag,
+				      unsigned long lisn);
+extern int (*__xive_vm_h_int_get_queue_info)(struct kvm_vcpu *vcpu,
+				   unsigned long flag,
+				   unsigned long target,
+				   unsigned long priority);
+extern int (*__xive_vm_h_int_get_queue_config)(struct kvm_vcpu *vcpu,
+				     unsigned long flag,
+				     unsigned long target,
+				     unsigned long priority);
+extern int (*__xive_vm_h_int_set_os_reporting_line)(struct kvm_vcpu *vcpu,
+					  unsigned long flag,
+					  unsigned long reportingline);
+extern int (*__xive_vm_h_int_get_os_reporting_line)(struct kvm_vcpu *vcpu,
+					  unsigned long flag,
+					  unsigned long target,
+					  unsigned long reportingline);
+extern int (*__xive_vm_h_int_esb)(struct kvm_vcpu *vcpu, unsigned long flag,
+			unsigned long lisn, unsigned long offset,
+			unsigned long data);
+extern int (*__xive_vm_h_int_sync)(struct kvm_vcpu *vcpu, unsigned long flag,
+			 unsigned long lisn);
+
 #endif /* CONFIG_KVM_XICS */
 #endif /* _KVM_PPC_BOOK3S_XICS_H */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 5a066fc299e1..1fb17d529a88 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -930,6 +930,22 @@ int kvmppc_pseries_do_hcall(struct kvm_vcpu *vcpu)
 			break;
 		}
 		return RESUME_HOST;
+	case H_INT_GET_SOURCE_INFO:
+	case H_INT_SET_SOURCE_CONFIG:
+	case H_INT_GET_SOURCE_CONFIG:
+	case H_INT_GET_QUEUE_INFO:
+	case H_INT_SET_QUEUE_CONFIG:
+	case H_INT_GET_QUEUE_CONFIG:
+	case H_INT_SET_OS_REPORTING_LINE:
+	case H_INT_GET_OS_REPORTING_LINE:
+	case H_INT_ESB:
+	case H_INT_SYNC:
+	case H_INT_RESET:
+		if (kvmppc_xive_enabled(vcpu)) {
+			ret = kvmppc_xive_native_hcall(vcpu, req);
+			break;
+		}
+		return RESUME_HOST;
 	case H_SET_DABR:
 		ret = kvmppc_h_set_dabr(vcpu, kvmppc_get_gpr(vcpu, 4));
 		break;
@@ -5153,6 +5169,19 @@ static unsigned int default_hcall_list[] = {
 	H_IPOLL,
 	H_XIRR,
 	H_XIRR_X,
+#endif
+#ifdef CONFIG_KVM_XIVE
+	H_INT_GET_SOURCE_INFO,
+	H_INT_SET_SOURCE_CONFIG,
+	H_INT_GET_SOURCE_CONFIG,
+	H_INT_GET_QUEUE_INFO,
+	H_INT_SET_QUEUE_CONFIG,
+	H_INT_GET_QUEUE_CONFIG,
+	H_INT_SET_OS_REPORTING_LINE,
+	H_INT_GET_OS_REPORTING_LINE,
+	H_INT_ESB,
+	H_INT_SYNC,
+	H_INT_RESET,
 #endif
 	0
 };
diff --git a/arch/powerpc/kvm/book3s_hv_builtin.c b/arch/powerpc/kvm/book3s_hv_builtin.c
index a71e2fc00a4e..db690f914d78 100644
--- a/arch/powerpc/kvm/book3s_hv_builtin.c
+++ b/arch/powerpc/kvm/book3s_hv_builtin.c
@@ -51,6 +51,42 @@ EXPORT_SYMBOL_GPL(__xive_vm_h_ipi);
 EXPORT_SYMBOL_GPL(__xive_vm_h_cppr);
 EXPORT_SYMBOL_GPL(__xive_vm_h_eoi);
 
+int (*__xive_vm_h_int_get_source_info)(struct kvm_vcpu *vcpu,
+				       unsigned long flag,
+				       unsigned long lisn);
+int (*__xive_vm_h_int_get_source_config)(struct kvm_vcpu *vcpu,
+					 unsigned long flag,
+					 unsigned long lisn);
+int (*__xive_vm_h_int_get_queue_info)(struct kvm_vcpu *vcpu,
+				      unsigned long flag,
+				      unsigned long target,
+				      unsigned long priority);
+int (*__xive_vm_h_int_get_queue_config)(struct kvm_vcpu *vcpu,
+					unsigned long flag,
+					unsigned long target,
+					unsigned long priority);
+int (*__xive_vm_h_int_set_os_reporting_line)(struct kvm_vcpu *vcpu,
+					     unsigned long flag,
+					     unsigned long line);
+int (*__xive_vm_h_int_get_os_reporting_line)(struct kvm_vcpu *vcpu,
+					     unsigned long flag,
+					     unsigned long target,
+					     unsigned long line);
+int (*__xive_vm_h_int_esb)(struct kvm_vcpu *vcpu, unsigned long flag,
+			   unsigned long lisn, unsigned long offset,
+			   unsigned long data);
+int (*__xive_vm_h_int_sync)(struct kvm_vcpu *vcpu, unsigned long flag,
+			    unsigned long lisn);
+
+EXPORT_SYMBOL_GPL(__xive_vm_h_int_get_source_info);
+EXPORT_SYMBOL_GPL(__xive_vm_h_int_get_source_config);
+EXPORT_SYMBOL_GPL(__xive_vm_h_int_get_queue_info);
+EXPORT_SYMBOL_GPL(__xive_vm_h_int_get_queue_config);
+EXPORT_SYMBOL_GPL(__xive_vm_h_int_set_os_reporting_line);
+EXPORT_SYMBOL_GPL(__xive_vm_h_int_get_os_reporting_line);
+EXPORT_SYMBOL_GPL(__xive_vm_h_int_esb);
+EXPORT_SYMBOL_GPL(__xive_vm_h_int_sync);
+
 /*
  * Hash page table alignment on newer cpus(CPU_FTR_ARCH_206)
  * should be power of 2.
@@ -660,6 +696,166 @@ int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr)
 }
 #endif /* CONFIG_KVM_XICS */
 
+#ifdef CONFIG_KVM_XIVE
+int kvmppc_rm_h_int_get_source_info(struct kvm_vcpu *vcpu,
+				    unsigned long flag,
+				    unsigned long lisn)
+{
+	if (!kvmppc_xive_enabled(vcpu))
+		return H_TOO_HARD;
+	if (!xive_enabled())
+		return H_TOO_HARD;
+
+	if (is_rm())
+		return xive_rm_h_int_get_source_info(vcpu, flag, lisn);
+	if (unlikely(!__xive_vm_h_int_get_source_info))
+		return H_NOT_AVAILABLE;
+	return __xive_vm_h_int_get_source_info(vcpu, flag, lisn);
+}
+
+int kvmppc_rm_h_int_set_source_config(struct kvm_vcpu *vcpu,
+				      unsigned long flag,
+				      unsigned long lisn,
+				      unsigned long target,
+				      unsigned long priority,
+				      unsigned long eisn)
+{
+	return H_TOO_HARD;
+}
+
+int kvmppc_rm_h_int_get_source_config(struct kvm_vcpu *vcpu,
+				      unsigned long flag,
+				      unsigned long lisn)
+{
+	if (!kvmppc_xive_enabled(vcpu))
+		return H_TOO_HARD;
+	if (!xive_enabled())
+		return H_TOO_HARD;
+
+	if (is_rm())
+		return xive_rm_h_int_get_source_config(vcpu, flag, lisn);
+	if (unlikely(!__xive_vm_h_int_get_source_config))
+		return H_NOT_AVAILABLE;
+	return __xive_vm_h_int_get_source_config(vcpu, flag, lisn);
+}
+
+int kvmppc_rm_h_int_get_queue_info(struct kvm_vcpu *vcpu,
+				   unsigned long flag,
+				   unsigned long target,
+				   unsigned long priority)
+{
+	if (!kvmppc_xive_enabled(vcpu))
+		return H_TOO_HARD;
+	if (!xive_enabled())
+		return H_TOO_HARD;
+
+	if (is_rm())
+		return xive_rm_h_int_get_queue_info(vcpu, flag, target,
+						    priority);
+	if (unlikely(!__xive_vm_h_int_get_queue_info))
+		return H_NOT_AVAILABLE;
+	return __xive_vm_h_int_get_queue_info(vcpu, flag, target, priority);
+}
+
+int kvmppc_rm_h_int_set_queue_config(struct kvm_vcpu *vcpu,
+				     unsigned long flag,
+				     unsigned long target,
+				     unsigned long priority,
+				     unsigned long qpage,
+				     unsigned long qsize)
+{
+	return H_TOO_HARD;
+}
+
+int kvmppc_rm_h_int_get_queue_config(struct kvm_vcpu *vcpu,
+				     unsigned long flag,
+				     unsigned long target,
+				     unsigned long priority)
+{
+	if (!kvmppc_xive_enabled(vcpu))
+		return H_TOO_HARD;
+	if (!xive_enabled())
+		return H_TOO_HARD;
+
+	if (is_rm())
+		return xive_rm_h_int_get_queue_config(vcpu, flag, target,
+						      priority);
+	if (unlikely(!__xive_vm_h_int_get_queue_config))
+		return H_NOT_AVAILABLE;
+	return __xive_vm_h_int_get_queue_config(vcpu, flag, target, priority);
+}
+
+int kvmppc_rm_h_int_set_os_reporting_line(struct kvm_vcpu *vcpu,
+					  unsigned long flag,
+					  unsigned long line)
+{
+	if (!kvmppc_xive_enabled(vcpu))
+		return H_TOO_HARD;
+	if (!xive_enabled())
+		return H_TOO_HARD;
+
+	if (is_rm())
+		return xive_rm_h_int_set_os_reporting_line(vcpu, flag, line);
+	if (unlikely(!__xive_vm_h_int_set_os_reporting_line))
+		return H_NOT_AVAILABLE;
+	return __xive_vm_h_int_set_os_reporting_line(vcpu, flag, line);
+}
+
+int kvmppc_rm_h_int_get_os_reporting_line(struct kvm_vcpu *vcpu,
+					  unsigned long flag,
+					  unsigned long target,
+					  unsigned long line)
+{
+	if (!kvmppc_xive_enabled(vcpu))
+		return H_TOO_HARD;
+	if (!xive_enabled())
+		return H_TOO_HARD;
+
+	if (is_rm())
+		return xive_rm_h_int_get_os_reporting_line(vcpu,
+							   flag, target, line);
+	if (unlikely(!__xive_vm_h_int_get_os_reporting_line))
+		return H_NOT_AVAILABLE;
+	return __xive_vm_h_int_get_os_reporting_line(vcpu, flag, target, line);
+}
+
+int kvmppc_rm_h_int_esb(struct kvm_vcpu *vcpu, unsigned long flag,
+			 unsigned long lisn, unsigned long offset,
+			 unsigned long data)
+{
+	if (!kvmppc_xive_enabled(vcpu))
+		return H_TOO_HARD;
+	if (!xive_enabled())
+		return H_TOO_HARD;
+
+	if (is_rm())
+		return xive_rm_h_int_esb(vcpu, flag, lisn, offset, data);
+	if (unlikely(!__xive_vm_h_int_esb))
+		return H_NOT_AVAILABLE;
+	return __xive_vm_h_int_esb(vcpu, flag, lisn, offset, data);
+}
+
+int kvmppc_rm_h_int_sync(struct kvm_vcpu *vcpu, unsigned long flag,
+			 unsigned long lisn)
+{
+	if (!kvmppc_xive_enabled(vcpu))
+		return H_TOO_HARD;
+	if (!xive_enabled())
+		return H_TOO_HARD;
+
+	if (is_rm())
+		return xive_rm_h_int_sync(vcpu, flag, lisn);
+	if (unlikely(!__xive_vm_h_int_sync))
+		return H_NOT_AVAILABLE;
+	return __xive_vm_h_int_sync(vcpu, flag, lisn);
+}
+
+int kvmppc_rm_h_int_reset(struct kvm_vcpu *vcpu, unsigned long flag)
+{
+	return H_TOO_HARD;
+}
+#endif /* CONFIG_KVM_XIVE */
+
 void kvmppc_bad_interrupt(struct pt_regs *regs)
 {
 	/*
diff --git a/arch/powerpc/kvm/book3s_hv_rm_xive_native.c b/arch/powerpc/kvm/book3s_hv_rm_xive_native.c
new file mode 100644
index 000000000000..0e72a6ae0f07
--- /dev/null
+++ b/arch/powerpc/kvm/book3s_hv_rm_xive_native.c
@@ -0,0 +1,47 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/kernel.h>
+#include <linux/kvm_host.h>
+#include <linux/err.h>
+#include <linux/kernel_stat.h>
+
+#include <asm/kvm_book3s.h>
+#include <asm/kvm_ppc.h>
+#include <asm/hvcall.h>
+#include <asm/xics.h>
+#include <asm/debug.h>
+#include <asm/synch.h>
+#include <asm/cputhreads.h>
+#include <asm/pgtable.h>
+#include <asm/ppc-opcode.h>
+#include <asm/pnv-pci.h>
+#include <asm/opal.h>
+#include <asm/smp.h>
+#include <asm/asm-prototypes.h>
+#include <asm/xive.h>
+#include <asm/xive-regs.h>
+
+#include "book3s_xive.h"
+
+/* XXX */
+#include <asm/udbg.h>
+//#define DBG(fmt...) udbg_printf(fmt)
+#define DBG(fmt...) do { } while (0)
+
+static inline void __iomem *get_tima_phys(void)
+{
+	return local_paca->kvm_hstate.xive_tima_phys;
+}
+
+#undef XIVE_RUNTIME_CHECKS
+#define X_PFX xive_rm_
+#define X_STATIC
+#define X_STAT_PFX stat_rm_
+#define __x_tima		get_tima_phys()
+#define __x_eoi_page(xd)	((void __iomem *)((xd)->eoi_page))
+#define __x_trig_page(xd)	((void __iomem *)((xd)->trig_page))
+#define __x_writeb	__raw_rm_writeb
+#define __x_readw	__raw_rm_readw
+#define __x_readq	__raw_rm_readq
+#define __x_writeq	__raw_rm_writeq
+
+#include "book3s_xive_native_template.c"
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 2518640d4a58..35d806740c3a 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -171,6 +171,56 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
 	return rc;
 }
 
+static int kvmppc_xive_native_set_source_config(struct kvmppc_xive *xive,
+					struct kvmppc_xive_src_block *sb,
+					struct kvmppc_xive_irq_state *state,
+					u32 server,
+					u8 priority,
+					u32 eisn)
+{
+	struct kvm *kvm = xive->kvm;
+	u32 hw_num;
+	int rc = 0;
+
+	/*
+	 * TODO: Do we need to safely mask and unmask a source ? can
+	 * we just let the guest handle the possible races ?
+	 */
+	arch_spin_lock(&sb->lock);
+
+	if (state->act_server == server && state->act_priority == priority &&
+	    state->eisn == eisn)
+		goto unlock;
+
+	pr_devel("new_act_prio=%d new_act_server=%d act_server=%d act_prio=%d\n",
+		 priority, server, state->act_server, state->act_priority);
+
+	kvmppc_xive_select_irq(state, &hw_num, NULL);
+
+	if (priority != MASKED) {
+		rc = kvmppc_xive_select_target(kvm, &server, priority);
+		if (rc)
+			goto unlock;
+
+		state->act_priority = priority;
+		state->act_server = server;
+		state->eisn = eisn;
+
+		rc = xive_native_configure_irq(hw_num, xive->vp_base + server,
+					       priority, eisn);
+	} else {
+		state->act_priority = MASKED;
+		state->act_server = 0;
+		state->eisn = 0;
+
+		rc = xive_native_configure_irq(hw_num, 0, MASKED, 0);
+	}
+
+unlock:
+	arch_spin_unlock(&sb->lock);
+	return rc;
+}
+
 static int kvmppc_xive_native_set_vc_base(struct kvmppc_xive *xive, u64 addr)
 {
 	u64 __user *ubufp = (u64 __user *) addr;
@@ -323,6 +373,20 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr)
 	return put_user(ret, ubufp);
 }
 
+static int xive_native_validate_queue_size(u32 qsize)
+{
+	switch (qsize) {
+	case 12:
+	case 16:
+	case 21:
+	case 24:
+	case 0:
+		return 0;
+	default:
+		return -EINVAL;
+	}
+}
+
 static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
 					 u64 addr)
 {
@@ -532,6 +596,248 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	return ret;
 }
 
+static int kvmppc_h_int_set_source_config(struct kvm_vcpu *vcpu,
+					  unsigned long flags,
+					  unsigned long irq,
+					  unsigned long server,
+					  unsigned long priority,
+					  unsigned long eisn)
+{
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+	struct kvmppc_xive_src_block *sb;
+	struct kvmppc_xive_irq_state *state;
+	int rc = 0;
+	u16 idx;
+
+	pr_devel("H_INT_SET_SOURCE_CONFIG flags=%08lx irq=%lx server=%ld priority=%ld eisn=%lx\n",
+		 flags, irq, server, priority, eisn);
+
+	if (flags & ~(XIVE_SPAPR_SRC_SET_EISN | XIVE_SPAPR_SRC_MASK))
+		return H_PARAMETER;
+
+	sb = kvmppc_xive_find_source(xive, irq, &idx);
+	if (!sb)
+		return H_P2;
+	state = &sb->irq_state[idx];
+
+	if (!(flags & XIVE_SPAPR_SRC_SET_EISN))
+		eisn = state->eisn;
+
+	if (priority != xive_prio_from_guest(priority)) {
+		pr_err("invalid priority for queue %ld for VCPU %ld\n",
+		       priority, server);
+		return H_P3;
+	}
+
+	/* TODO: handle XIVE_SPAPR_SRC_MASK */
+
+	rc = kvmppc_xive_native_set_source_config(xive, sb, state, server,
+						  priority, eisn);
+	if (!rc)
+		return H_SUCCESS;
+	else if (rc == -EINVAL)
+		return H_P4; /* no server found */
+	else
+		return H_HARDWARE;
+}
+
+static int kvmppc_h_int_set_queue_config(struct kvm_vcpu *vcpu,
+					 unsigned long flags,
+					 unsigned long server,
+					 unsigned long priority,
+					 unsigned long qpage,
+					 unsigned long qsize)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	struct xive_q *q;
+	int rc;
+	__be32 *qaddr = 0;
+	struct page *page;
+
+	pr_devel("H_INT_SET_QUEUE_CONFIG flags=%08lx server=%ld priority=%ld qpage=%08lx qsize=%ld\n",
+		 flags, server, priority, qpage, qsize);
+
+	if (flags & ~XIVE_SPAPR_EQ_ALWAYS_NOTIFY)
+		return H_PARAMETER;
+
+	if (xc->server_num != server) {
+		vcpu = kvmppc_xive_find_server(kvm, server);
+		if (!vcpu) {
+			pr_debug("Can't find server %ld\n", server);
+			return H_P2;
+		}
+		xc = vcpu->arch.xive_vcpu;
+	}
+
+	if (priority != xive_prio_from_guest(priority) || priority == MASKED) {
+		pr_err("invalid priority for queue %ld for VCPU %d\n",
+		       priority, xc->server_num);
+		return H_P3;
+	}
+	q = &xc->queues[priority];
+
+	rc = xive_native_validate_queue_size(qsize);
+	if (rc) {
+		pr_err("invalid queue size %ld\n", qsize);
+		return H_P5;
+	}
+
+	/* reset queue and disable queueing */
+	if (!qsize) {
+		rc = xive_native_configure_queue(xc->vp_id, q, priority,
+						 NULL, 0, true);
+		if (rc) {
+			pr_err("Failed to reset queue %ld for VCPU %d: %d\n",
+			       priority, xc->server_num, rc);
+			return H_HARDWARE;
+		}
+
+		if (q->qpage) {
+			put_page(virt_to_page(q->qpage));
+			q->qpage = NULL;
+		}
+
+		return H_SUCCESS;
+	}
+
+	page = gfn_to_page(kvm, gpa_to_gfn(qpage));
+	if (is_error_page(page)) {
+		pr_warn("Couldn't get guest page for %lx!\n", qpage);
+		return H_P4;
+	}
+	qaddr = page_to_virt(page) + (qpage & ~PAGE_MASK);
+
+	rc = xive_native_configure_queue(xc->vp_id, q, priority,
+					 (__be32 *) qaddr, qsize, true);
+	if (rc) {
+		pr_err("Failed to configure queue %ld for VCPU %d: %d\n",
+		       priority, xc->server_num, rc);
+		put_page(page);
+		return H_HARDWARE;
+	}
+
+	rc = kvmppc_xive_attach_escalation(vcpu, priority);
+	if (rc) {
+		xive_native_cleanup_queue(vcpu, priority);
+		return H_HARDWARE;
+	}
+
+	return H_SUCCESS;
+}
+
+static void kvmppc_xive_reset_sources(struct kvmppc_xive_src_block *sb)
+{
+	int i;
+
+	for (i = 0; i < KVMPPC_XICS_IRQ_PER_ICS; i++) {
+		struct kvmppc_xive_irq_state *state = &sb->irq_state[i];
+
+		if (!state->valid)
+			continue;
+
+		if (state->act_priority == MASKED)
+			continue;
+
+		arch_spin_lock(&sb->lock);
+		state->eisn = 0;
+		state->act_server = 0;
+		state->act_priority = MASKED;
+		xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
+		xive_native_configure_irq(state->ipi_number, 0, MASKED, 0);
+		if (state->pt_number) {
+			xive_vm_esb_load(state->pt_data, XIVE_ESB_SET_PQ_01);
+			xive_native_configure_irq(state->pt_number,
+						  0, MASKED, 0);
+		}
+		arch_spin_unlock(&sb->lock);
+	}
+}
+
+static int kvmppc_h_int_reset(struct kvmppc_xive *xive, unsigned long flags)
+{
+	struct kvm *kvm = xive->kvm;
+	struct kvm_vcpu *vcpu;
+	unsigned int i;
+
+	pr_devel("H_INT_RESET flags=%08lx\n", flags);
+
+	if (flags)
+		return H_PARAMETER;
+
+	mutex_lock(&kvm->lock);
+
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+		unsigned int prio;
+
+		if (!xc)
+			continue;
+
+		kvmppc_xive_disable_vcpu_interrupts(vcpu);
+
+		for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) {
+
+			if (xc->esc_virq[prio]) {
+				free_irq(xc->esc_virq[prio], vcpu);
+				irq_dispose_mapping(xc->esc_virq[prio]);
+				kfree(xc->esc_virq_names[prio]);
+				xc->esc_virq[prio] = 0;
+			}
+
+			xive_native_cleanup_queue(vcpu, prio);
+		}
+	}
+
+	for (i = 0; i <= xive->max_sbid; i++) {
+		if (xive->src_blocks[i])
+			kvmppc_xive_reset_sources(xive->src_blocks[i]);
+	}
+
+	mutex_unlock(&kvm->lock);
+
+	return H_SUCCESS;
+}
+
+int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 req)
+{
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+	int rc;
+
+	if (!xive || !vcpu->arch.xive_vcpu)
+		return H_FUNCTION;
+
+	switch (req) {
+	case H_INT_SET_QUEUE_CONFIG:
+		rc = kvmppc_h_int_set_queue_config(vcpu,
+						   kvmppc_get_gpr(vcpu, 4),
+						   kvmppc_get_gpr(vcpu, 5),
+						   kvmppc_get_gpr(vcpu, 6),
+						   kvmppc_get_gpr(vcpu, 7),
+						   kvmppc_get_gpr(vcpu, 8));
+		break;
+
+	case H_INT_SET_SOURCE_CONFIG:
+		rc = kvmppc_h_int_set_source_config(vcpu,
+						    kvmppc_get_gpr(vcpu, 4),
+						    kvmppc_get_gpr(vcpu, 5),
+						    kvmppc_get_gpr(vcpu, 6),
+						    kvmppc_get_gpr(vcpu, 7),
+						    kvmppc_get_gpr(vcpu, 8));
+		break;
+
+	case H_INT_RESET:
+		rc = kvmppc_h_int_reset(xive, kvmppc_get_gpr(vcpu, 4));
+		break;
+
+	default:
+		rc =  H_NOT_AVAILABLE;
+	}
+
+	return rc;
+}
+EXPORT_SYMBOL_GPL(kvmppc_xive_native_hcall);
+
 static int xive_native_debug_show(struct seq_file *m, void *private)
 {
 	struct kvmppc_xive *xive = m->private;
@@ -614,10 +920,26 @@ struct kvm_device_ops kvm_xive_native_ops = {
 
 void kvmppc_xive_native_init_module(void)
 {
-	;
+	__xive_vm_h_int_get_source_info = xive_vm_h_int_get_source_info;
+	__xive_vm_h_int_get_source_config = xive_vm_h_int_get_source_config;
+	__xive_vm_h_int_get_queue_info = xive_vm_h_int_get_queue_info;
+	__xive_vm_h_int_get_queue_config = xive_vm_h_int_get_queue_config;
+	__xive_vm_h_int_set_os_reporting_line =
+		xive_vm_h_int_set_os_reporting_line;
+	__xive_vm_h_int_get_os_reporting_line =
+		xive_vm_h_int_get_os_reporting_line;
+	__xive_vm_h_int_esb = xive_vm_h_int_esb;
+	__xive_vm_h_int_sync = xive_vm_h_int_sync;
 }
 
 void kvmppc_xive_native_exit_module(void)
 {
-	;
+	__xive_vm_h_int_get_source_info = NULL;
+	__xive_vm_h_int_get_source_config = NULL;
+	__xive_vm_h_int_get_queue_info = NULL;
+	__xive_vm_h_int_get_queue_config = NULL;
+	__xive_vm_h_int_set_os_reporting_line = NULL;
+	__xive_vm_h_int_get_os_reporting_line = NULL;
+	__xive_vm_h_int_esb = NULL;
+	__xive_vm_h_int_sync = NULL;
 }
diff --git a/arch/powerpc/kvm/book3s_xive_native_template.c b/arch/powerpc/kvm/book3s_xive_native_template.c
index e7260da4a596..ccde2786d203 100644
--- a/arch/powerpc/kvm/book3s_xive_native_template.c
+++ b/arch/powerpc/kvm/book3s_xive_native_template.c
@@ -8,6 +8,279 @@
 #define XGLUE(a, b) a##b
 #define GLUE(a, b) XGLUE(a, b)
 
+X_STATIC int GLUE(X_PFX, h_int_get_source_info)(struct kvm_vcpu *vcpu,
+						unsigned long flags,
+						unsigned long irq)
+{
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+	struct kvmppc_xive_src_block *sb;
+	struct kvmppc_xive_irq_state *state;
+	struct xive_irq_data *xd;
+	u32 hw_num;
+	u16 src;
+	unsigned long esb_addr;
+
+	pr_devel("H_INT_GET_SOURCE_INFO flags=%08lx irq=%lx\n", flags, irq);
+
+	if (!xive)
+		return H_FUNCTION;
+
+	if (flags)
+		return H_PARAMETER;
+
+	sb = kvmppc_xive_find_source(xive, irq, &src);
+	if (!sb) {
+		pr_debug("source %lx not found !\n", irq);
+		return H_P2;
+	}
+	state = &sb->irq_state[src];
+
+	arch_spin_lock(&sb->lock);
+	kvmppc_xive_select_irq(state, &hw_num, &xd);
+
+	vcpu->arch.regs.gpr[4] = 0;
+	if (xd->flags & XIVE_IRQ_FLAG_STORE_EOI)
+		vcpu->arch.regs.gpr[4] |= XIVE_SPAPR_SRC_STORE_EOI;
+
+	/*
+	 * Force the use of the H_INT_ESB hcall in case of a Virtual
+	 * LSI interrupt. This is necessary under KVM to re-trigger
+	 * the interrupt if the level is still asserted
+	 */
+	if (state->lsi) {
+		vcpu->arch.regs.gpr[4] |= XIVE_SPAPR_SRC_LSI;
+		vcpu->arch.regs.gpr[4] |= XIVE_SPAPR_SRC_H_INT_ESB;
+	}
+
+	/*
+	 * Linux/KVM uses a two pages ESB setting, one for trigger and
+	 * one for EOI
+	 */
+	esb_addr = xive->vc_base + (irq << (PAGE_SHIFT + 1));
+
+	/* EOI/management page is the second/odd page */
+	if (xd->eoi_page &&
+	    !(vcpu->arch.regs.gpr[4] & XIVE_SPAPR_SRC_H_INT_ESB))
+		vcpu->arch.regs.gpr[5] = esb_addr + (1ull << PAGE_SHIFT);
+	else
+		vcpu->arch.regs.gpr[5] = -1;
+
+	/* Trigger page is always the first/even page */
+	if (xd->trig_page)
+		vcpu->arch.regs.gpr[6] = esb_addr;
+	else
+		vcpu->arch.regs.gpr[6] = -1;
+
+	vcpu->arch.regs.gpr[7] = PAGE_SHIFT;
+	arch_spin_unlock(&sb->lock);
+	return H_SUCCESS;
+}
+
+X_STATIC int GLUE(X_PFX, h_int_get_source_config)(struct kvm_vcpu *vcpu,
+						  unsigned long flags,
+						  unsigned long irq)
+{
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+	struct kvmppc_xive_src_block *sb;
+	struct kvmppc_xive_irq_state *state;
+	u16 src;
+
+	pr_devel("H_INT_GET_SOURCE_CONFIG flags=%08lx irq=%lx\n", flags, irq);
+
+	if (!xive)
+		return H_FUNCTION;
+
+	if (flags)
+		return H_PARAMETER;
+
+	sb = kvmppc_xive_find_source(xive, irq, &src);
+	if (!sb) {
+		pr_debug("source %lx not found !\n", irq);
+		return H_P2;
+	}
+	state = &sb->irq_state[src];
+
+	arch_spin_lock(&sb->lock);
+	vcpu->arch.regs.gpr[4] = state->act_server;
+	vcpu->arch.regs.gpr[5] = state->act_priority;
+	vcpu->arch.regs.gpr[6] = state->number;
+	arch_spin_unlock(&sb->lock);
+
+	return H_SUCCESS;
+}
+
+X_STATIC int GLUE(X_PFX, h_int_get_queue_info)(struct kvm_vcpu *vcpu,
+					       unsigned long flags,
+					       unsigned long server,
+					       unsigned long priority)
+{
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	struct xive_q *q;
+
+	pr_devel("H_INT_GET_QUEUE_INFO flags=%08lx server=%ld priority=%ld\n",
+		 flags, server, priority);
+
+	if (!xive)
+		return H_FUNCTION;
+
+	if (flags)
+		return H_PARAMETER;
+
+	if (xc->server_num != server) {
+		struct kvm_vcpu *vc;
+
+		vc = kvmppc_xive_find_server(vcpu->kvm, server);
+		if (!vc) {
+			pr_debug("server %ld not found\n", server);
+			return H_P2;
+		}
+		xc = vc->arch.xive_vcpu;
+	}
+
+	if (priority != xive_prio_from_guest(priority) || priority == MASKED) {
+		pr_debug("invalid priority for queue %ld for VCPU %ld\n",
+		       priority, server);
+		return H_P3;
+	}
+	q = &xc->queues[priority];
+
+	vcpu->arch.regs.gpr[4] = q->eoi_phys;
+	/* TODO: Power of 2 page size of the notification page */
+	vcpu->arch.regs.gpr[5] = 0;
+	return H_SUCCESS;
+}
+
+X_STATIC int GLUE(X_PFX, get_queue_state)(struct kvm_vcpu *vcpu,
+					  struct kvmppc_xive_vcpu *xc,
+					  unsigned long prio)
+{
+	int rc;
+	u32 qtoggle;
+	u32 qindex;
+
+	rc = xive_native_get_queue_state(xc->vp_id, prio, &qtoggle, &qindex);
+	if (rc)
+		return rc;
+
+	vcpu->arch.regs.gpr[4] |= ((unsigned long) qtoggle) << 62;
+	vcpu->arch.regs.gpr[7] = qindex;
+	return 0;
+}
+
+X_STATIC int GLUE(X_PFX, h_int_get_queue_config)(struct kvm_vcpu *vcpu,
+						 unsigned long flags,
+						 unsigned long server,
+						 unsigned long priority)
+{
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	struct xive_q *q;
+	u64 qpage;
+	u64 qsize;
+	u64 qeoi_page;
+	u32 escalate_irq;
+	u64 qflags;
+	int rc;
+
+	pr_devel("H_INT_GET_QUEUE_CONFIG flags=%08lx server=%ld priority=%ld\n",
+		 flags, server, priority);
+
+	if (!xive)
+		return H_FUNCTION;
+
+	if (flags & ~XIVE_SPAPR_EQ_DEBUG)
+		return H_PARAMETER;
+
+	if (xc->server_num != server) {
+		struct kvm_vcpu *vc;
+
+		vc = kvmppc_xive_find_server(vcpu->kvm, server);
+		if (!vc) {
+			pr_debug("server %ld not found\n", server);
+			return H_P2;
+		}
+		xc = vc->arch.xive_vcpu;
+	}
+
+	if (priority != xive_prio_from_guest(priority) || priority == MASKED) {
+		pr_debug("invalid priority for queue %ld for VCPU %ld\n",
+		       priority, server);
+		return H_P3;
+	}
+	q = &xc->queues[priority];
+
+	rc = xive_native_get_queue_info(xc->vp_id, priority, &qpage, &qsize,
+					&qeoi_page, &escalate_irq, &qflags);
+	if (rc)
+		return H_HARDWARE;
+
+	vcpu->arch.regs.gpr[4] = 0;
+	if (qflags & OPAL_XIVE_EQ_ALWAYS_NOTIFY)
+		vcpu->arch.regs.gpr[4] |= XIVE_SPAPR_EQ_ALWAYS_NOTIFY;
+
+	vcpu->arch.regs.gpr[5] = qpage;
+	vcpu->arch.regs.gpr[6] = qsize;
+	if (flags & XIVE_SPAPR_EQ_DEBUG) {
+		rc = GLUE(X_PFX, get_queue_state)(vcpu, xc, priority);
+		if (rc)
+			return H_HARDWARE;
+	}
+	return H_SUCCESS;
+}
+
+/* TODO H_INT_SET_OS_REPORTING_LINE */
+X_STATIC int GLUE(X_PFX, h_int_set_os_reporting_line)(struct kvm_vcpu *vcpu,
+						      unsigned long flags,
+						      unsigned long line)
+{
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+
+	pr_devel("H_INT_SET_OS_REPORTING_LINE flags=%08lx line=%ld\n",
+		 flags, line);
+
+	if (!xive)
+		return H_FUNCTION;
+
+	if (flags)
+		return H_PARAMETER;
+
+	return H_FUNCTION;
+}
+
+/* TODO H_INT_GET_OS_REPORTING_LINE*/
+X_STATIC int GLUE(X_PFX, h_int_get_os_reporting_line)(struct kvm_vcpu *vcpu,
+						      unsigned long flags,
+						      unsigned long server,
+						      unsigned long line)
+{
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+
+	pr_devel("H_INT_GET_OS_REPORTING_LINE flags=%08lx server=%ld line=%ld\n",
+		 flags, server, line);
+
+	if (!xive)
+		return H_FUNCTION;
+
+	if (flags)
+		return H_PARAMETER;
+
+	if (xc->server_num != server) {
+		struct kvm_vcpu *vc;
+
+		vc = kvmppc_xive_find_server(vcpu->kvm, server);
+		if (!vc) {
+			pr_debug("server %ld not found\n", server);
+			return H_P2;
+		}
+		xc = vc->arch.xive_vcpu;
+	}
+
+	return H_FUNCTION;
+
+}
+
 /*
  * TODO: introduce a common template file with the XIVE native layer
  * and the XICS-on-XIVE glue for the utility functions
@@ -25,3 +298,101 @@ static u8 GLUE(X_PFX, esb_load)(struct xive_irq_data *xd, u32 offset)
 #endif
 	return (u8)val;
 }
+
+static u8 GLUE(X_PFX, esb_store)(struct xive_irq_data *xd, u32 offset, u64 data)
+{
+	u64 val;
+
+	if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
+		offset |= offset << 4;
+
+	val = __x_readq(__x_eoi_page(xd) + offset);
+#ifdef __LITTLE_ENDIAN__
+	val >>= 64-8;
+#endif
+	return (u8)val;
+}
+
+X_STATIC int GLUE(X_PFX, h_int_esb)(struct kvm_vcpu *vcpu, unsigned long flags,
+				    unsigned long irq, unsigned long offset,
+				    unsigned long data)
+{
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+	struct kvmppc_xive_src_block *sb;
+	struct kvmppc_xive_irq_state *state;
+	struct xive_irq_data *xd;
+	u32 hw_num;
+	u16 src;
+
+	if (!xive)
+		return H_FUNCTION;
+
+	if (flags)
+		return H_PARAMETER;
+
+	sb = kvmppc_xive_find_source(xive, irq, &src);
+	if (!sb) {
+		pr_debug("source %lx not found !\n", irq);
+		return H_P2;
+	}
+	state = &sb->irq_state[src];
+
+	if (offset > (1ull << PAGE_SHIFT))
+		return H_P3;
+
+	arch_spin_lock(&sb->lock);
+	kvmppc_xive_select_irq(state, &hw_num, &xd);
+
+	if (flags & XIVE_SPAPR_ESB_STORE) {
+		GLUE(X_PFX, esb_store)(xd, offset, data);
+		vcpu->arch.regs.gpr[4] = -1;
+	} else {
+		/* Virtual LSI EOI handling */
+		if (state->lsi && offset == XIVE_ESB_LOAD_EOI) {
+			GLUE(X_PFX, esb_load)(xd, XIVE_ESB_SET_PQ_00);
+			if (state->asserted && __x_trig_page(xd))
+				__x_writeq(0, __x_trig_page(xd));
+			vcpu->arch.regs.gpr[4] = 0;
+		} else {
+			vcpu->arch.regs.gpr[4] =
+				GLUE(X_PFX, esb_load)(xd, offset);
+		}
+	}
+	arch_spin_unlock(&sb->lock);
+
+	return H_SUCCESS;
+}
+
+X_STATIC int GLUE(X_PFX, h_int_sync)(struct kvm_vcpu *vcpu, unsigned long flags,
+				     unsigned long irq)
+{
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+	struct kvmppc_xive_src_block *sb;
+	struct kvmppc_xive_irq_state *state;
+	struct xive_irq_data *xd;
+	u32 hw_num;
+	u16 src;
+
+	pr_devel("H_INT_SYNC flags=%08lx irq=%lx\n", flags, irq);
+
+	if (!xive)
+		return H_FUNCTION;
+
+	if (flags)
+		return H_PARAMETER;
+
+	sb = kvmppc_xive_find_source(xive, irq, &src);
+	if (!sb) {
+		pr_debug("source %lx not found !\n", irq);
+		return H_P2;
+	}
+	state = &sb->irq_state[src];
+
+	arch_spin_lock(&sb->lock);
+
+	kvmppc_xive_select_irq(state, &hw_num, &xd);
+	xive_native_sync_source(hw_num);
+
+	arch_spin_unlock(&sb->lock);
+	return H_SUCCESS;
+}
diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
index 806cbe488410..1a5c65c59b13 100644
--- a/arch/powerpc/kvm/Makefile
+++ b/arch/powerpc/kvm/Makefile
@@ -81,6 +81,8 @@ kvm-hv-$(CONFIG_PPC_TRANSACTIONAL_MEM) += \
 
 kvm-book3s_64-builtin-xics-objs-$(CONFIG_KVM_XICS) := \
 	book3s_hv_rm_xics.o book3s_hv_rm_xive.o
+kvm-book3s_64-builtin-xics-objs-$(CONFIG_KVM_XIVE) += \
+	book3s_hv_rm_xive_native.o
 
 kvm-book3s_64-builtin-tm-objs-$(CONFIG_PPC_TRANSACTIONAL_MEM) += \
 	book3s_hv_tm_builtin.o
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 9b8d50a7cbaf..25b9489de249 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -2462,6 +2462,58 @@ hcall_real_table:
 	.long	0		/* 0x2fc - H_XIRR_X*/
 #endif
 	.long	DOTSYM(kvmppc_h_random) - hcall_real_table
+	.long	0		/* 0x304 */
+	.long	0		/* 0x308 */
+	.long	0		/* 0x30c */
+	.long	0		/* 0x310 */
+	.long	0		/* 0x314 */
+	.long	0		/* 0x318 */
+	.long	0		/* 0x31c */
+	.long	0		/* 0x320 */
+	.long	0		/* 0x324 */
+	.long	0		/* 0x328 */
+	.long	0		/* 0x32c */
+	.long	0		/* 0x330 */
+	.long	0		/* 0x334 */
+	.long	0		/* 0x338 */
+	.long	0		/* 0x33c */
+	.long	0		/* 0x340 */
+	.long	0		/* 0x344 */
+	.long	0		/* 0x348 */
+	.long	0		/* 0x34c */
+	.long	0		/* 0x350 */
+	.long	0		/* 0x354 */
+	.long	0		/* 0x358 */
+	.long	0		/* 0x35c */
+	.long	0		/* 0x360 */
+	.long	0		/* 0x364 */
+	.long	0		/* 0x368 */
+	.long	0		/* 0x36c */
+	.long	0		/* 0x370 */
+	.long	0		/* 0x374 */
+	.long	0		/* 0x378 */
+	.long	0		/* 0x37c */
+	.long	0		/* 0x380 */
+	.long	0		/* 0x384 */
+	.long	0		/* 0x388 */
+	.long	0		/* 0x38c */
+	.long	0		/* 0x390 */
+	.long	0		/* 0x394 */
+	.long	0		/* 0x398 */
+	.long	0		/* 0x39c */
+	.long	0		/* 0x3a0 */
+	.long	0		/* 0x3a4 */
+	.long	DOTSYM(kvmppc_rm_h_int_get_source_info) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_int_set_source_config) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_int_get_source_config) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_int_get_queue_info) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_int_set_queue_config) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_int_get_queue_config) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_int_set_os_reporting_line) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_int_get_os_reporting_line) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_int_esb) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_int_sync) - hcall_real_table
+	.long	DOTSYM(kvmppc_rm_h_int_reset) - hcall_real_table
 	.globl	hcall_real_table_end
 hcall_real_table_end:
 
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (10 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-02-04  5:15   ` David Gibson
  2019-01-07 18:43 ` [PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration Cédric Le Goater
                   ` (5 subsequent siblings)
  17 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

The guest physical address of the event queue will be part of the
state to transfer in the migration. Cache its value when the queue is
configured, it will save us an OPAL call.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/xive.h       | 2 ++
 arch/powerpc/kvm/book3s_xive_native.c | 4 ++++
 2 files changed, 6 insertions(+)

diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
index 7a7aa22d8258..e90c3c5d9533 100644
--- a/arch/powerpc/include/asm/xive.h
+++ b/arch/powerpc/include/asm/xive.h
@@ -74,6 +74,8 @@ struct xive_q {
 	u32			esc_irq;
 	atomic_t		count;
 	atomic_t		pending_count;
+	u64			guest_qpage;
+	u32			guest_qsize;
 };
 
 /* Global enable flags for the XIVE support */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 35d806740c3a..4ca75aade069 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -708,6 +708,10 @@ static int kvmppc_h_int_set_queue_config(struct kvm_vcpu *vcpu,
 	}
 	qaddr = page_to_virt(page) + (qpage & ~PAGE_MASK);
 
+	/* Backup queue page address and size for migration */
+	q->guest_qpage = qpage;
+	q->guest_qsize = qsize;
+
 	rc = xive_native_configure_queue(xc->vp_id, q, priority,
 					 (__be32 *) qaddr, qsize, true);
 	if (rc) {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (11 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-02-04  5:17   ` David Gibson
  2019-01-07 18:43 ` [PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty Cédric Le Goater
                   ` (4 subsequent siblings)
  17 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

When migration of a VM is initiated, a first copy of the RAM is
transferred to the destination before the VM is stopped. At that time,
QEMU needs to perform a XIVE quiesce sequence to stop the flow of
event notifications and stabilize the EQs. The sources are masked and
the XIVE IC is synced with the KVM ioctl KVM_DEV_XIVE_GRP_SYNC.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/uapi/asm/kvm.h   |  1 +
 arch/powerpc/kvm/book3s_xive_native.c | 32 +++++++++++++++++++++++++++
 2 files changed, 33 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 6fc9660c5aec..f3b859223b80 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char {
 #define   KVM_DEV_XIVE_GET_TIMA_FD	2
 #define   KVM_DEV_XIVE_VC_BASE		3
 #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
+#define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 4ca75aade069..a8052867afc1 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -459,6 +459,35 @@ static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
 	return 0;
 }
 
+static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
+{
+	struct kvmppc_xive_src_block *sb;
+	struct kvmppc_xive_irq_state *state;
+	struct xive_irq_data *xd;
+	u32 hw_num;
+	u16 src;
+
+	pr_devel("%s irq=0x%lx\n", __func__, irq);
+
+	sb = kvmppc_xive_find_source(xive, irq, &src);
+	if (!sb)
+		return -ENOENT;
+
+	state = &sb->irq_state[src];
+
+	if (!state->valid)
+		return -ENOENT;
+
+	arch_spin_lock(&sb->lock);
+
+	kvmppc_xive_select_irq(state, &hw_num, &xd);
+	xive_native_sync_source(hw_num);
+	xive_native_sync_queue(hw_num);
+
+	arch_spin_unlock(&sb->lock);
+	return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 				       struct kvm_device_attr *attr)
 {
@@ -474,6 +503,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 	case KVM_DEV_XIVE_GRP_SOURCES:
 		return kvmppc_xive_native_set_source(xive, attr->attr,
 						     attr->addr);
+	case KVM_DEV_XIVE_GRP_SYNC:
+		return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
 	}
 	return -ENXIO;
 }
@@ -511,6 +542,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 		}
 		break;
 	case KVM_DEV_XIVE_GRP_SOURCES:
+	case KVM_DEV_XIVE_GRP_SYNC:
 		if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
 		    attr->attr < KVMPPC_XIVE_NR_IRQS)
 			return 0;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (12 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-02-04  5:18   ` David Gibson
  2019-01-07 18:43 ` [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration Cédric Le Goater
                   ` (3 subsequent siblings)
  17 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

When the VM is stopped in a migration sequence, the sources are masked
and the XIVE IC is synced to stabilize the EQs. When done, the KVM
ioctl KVM_DEV_XIVE_SAVE_EQ_PAGES is called to mark dirty the EQ pages.

The migration can then transfer the remaining dirty pages to the
destination and start collecting the state of the devices.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/uapi/asm/kvm.h   |  1 +
 arch/powerpc/kvm/book3s_xive_native.c | 40 +++++++++++++++++++++++++++
 2 files changed, 41 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index f3b859223b80..1a8740629acf 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -680,6 +680,7 @@ struct kvm_ppc_cpu_char {
 #define   KVM_DEV_XIVE_GET_ESB_FD	1
 #define   KVM_DEV_XIVE_GET_TIMA_FD	2
 #define   KVM_DEV_XIVE_VC_BASE		3
+#define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
 #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
 #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
 
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index a8052867afc1..f2de1bcf3b35 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -373,6 +373,43 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr)
 	return put_user(ret, ubufp);
 }
 
+static int kvmppc_xive_native_vcpu_save_eq_pages(struct kvm_vcpu *vcpu)
+{
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	unsigned int prio;
+
+	if (!xc)
+		return -ENOENT;
+
+	for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) {
+		struct xive_q *q = &xc->queues[prio];
+
+		if (!q->qpage)
+			continue;
+
+		/* Mark EQ page dirty for migration */
+		mark_page_dirty(vcpu->kvm, gpa_to_gfn(q->guest_qpage));
+	}
+	return 0;
+}
+
+static int kvmppc_xive_native_save_eq_pages(struct kvmppc_xive *xive)
+{
+	struct kvm *kvm = xive->kvm;
+	struct kvm_vcpu *vcpu;
+	unsigned int i;
+
+	pr_devel("%s\n", __func__);
+
+	mutex_lock(&kvm->lock);
+	kvm_for_each_vcpu(i, vcpu, kvm) {
+		kvmppc_xive_native_vcpu_save_eq_pages(vcpu);
+	}
+	mutex_unlock(&kvm->lock);
+
+	return 0;
+}
+
 static int xive_native_validate_queue_size(u32 qsize)
 {
 	switch (qsize) {
@@ -498,6 +535,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 		switch (attr->attr) {
 		case KVM_DEV_XIVE_VC_BASE:
 			return kvmppc_xive_native_set_vc_base(xive, attr->addr);
+		case KVM_DEV_XIVE_SAVE_EQ_PAGES:
+			return kvmppc_xive_native_save_eq_pages(xive);
 		}
 		break;
 	case KVM_DEV_XIVE_GRP_SOURCES:
@@ -538,6 +577,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 		case KVM_DEV_XIVE_GET_ESB_FD:
 		case KVM_DEV_XIVE_GET_TIMA_FD:
 		case KVM_DEV_XIVE_VC_BASE:
+		case KVM_DEV_XIVE_SAVE_EQ_PAGES:
 			return 0;
 		}
 		break;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (13 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-02-04  5:21   ` David Gibson
  2019-01-07 18:43 ` [PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration Cédric Le Goater
                   ` (2 subsequent siblings)
  17 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

Theses are use to capure the XIVE EAS table of the KVM device, the
configuration of the source targets.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
 arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
 2 files changed, 98 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 1a8740629acf..faf024f39858 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
 #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
 #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
 #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
+#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
 #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
 
+/* Layout of 64-bit eas attribute values */
+#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
+#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
+#define KVM_XIVE_EAS_SERVER_SHIFT	3
+#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
+#define KVM_XIVE_EAS_MASK_SHIFT		32
+#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
+#define KVM_XIVE_EAS_EISN_SHIFT		33
+#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index f2de1bcf3b35..0468b605baa7 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
 	return 0;
 }
 
+static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
+				      u64 addr)
+{
+	struct kvmppc_xive_src_block *sb;
+	struct kvmppc_xive_irq_state *state;
+	u64 __user *ubufp = (u64 __user *) addr;
+	u16 src;
+	u64 kvm_eas;
+	u32 server;
+	u8 priority;
+	u32 eisn;
+
+	sb = kvmppc_xive_find_source(xive, irq, &src);
+	if (!sb)
+		return -ENOENT;
+
+	state = &sb->irq_state[src];
+
+	if (!state->valid)
+		return -EINVAL;
+
+	if (get_user(kvm_eas, ubufp))
+		return -EFAULT;
+
+	pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas);
+
+	priority = (kvm_eas & KVM_XIVE_EAS_PRIORITY_MASK) >>
+		KVM_XIVE_EAS_PRIORITY_SHIFT;
+	server = (kvm_eas & KVM_XIVE_EAS_SERVER_MASK) >>
+		KVM_XIVE_EAS_SERVER_SHIFT;
+	eisn = (kvm_eas & KVM_XIVE_EAS_EISN_MASK) >> KVM_XIVE_EAS_EISN_SHIFT;
+
+	if (priority != xive_prio_from_guest(priority)) {
+		pr_err("invalid priority for queue %d for VCPU %d\n",
+		       priority, server);
+		return -EINVAL;
+	}
+
+	return kvmppc_xive_native_set_source_config(xive, sb, state, server,
+						    priority, eisn);
+}
+
+static int kvmppc_xive_native_get_eas(struct kvmppc_xive *xive, long irq,
+				      u64 addr)
+{
+	struct kvmppc_xive_src_block *sb;
+	struct kvmppc_xive_irq_state *state;
+	u64 __user *ubufp = (u64 __user *) addr;
+	u16 src;
+	u64 kvm_eas;
+
+	sb = kvmppc_xive_find_source(xive, irq, &src);
+	if (!sb)
+		return -ENOENT;
+
+	state = &sb->irq_state[src];
+
+	if (!state->valid)
+		return -EINVAL;
+
+	arch_spin_lock(&sb->lock);
+
+	if (state->act_priority == MASKED)
+		kvm_eas = KVM_XIVE_EAS_MASK_MASK;
+	else {
+		kvm_eas = (state->act_priority << KVM_XIVE_EAS_PRIORITY_SHIFT) &
+			KVM_XIVE_EAS_PRIORITY_MASK;
+		kvm_eas |= (state->act_server << KVM_XIVE_EAS_SERVER_SHIFT) &
+			KVM_XIVE_EAS_SERVER_MASK;
+		kvm_eas |= ((u64) state->eisn << KVM_XIVE_EAS_EISN_SHIFT) &
+			KVM_XIVE_EAS_EISN_MASK;
+	}
+	arch_spin_unlock(&sb->lock);
+
+	pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas);
+
+	if (put_user(kvm_eas, ubufp))
+		return -EFAULT;
+
+	return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 				       struct kvm_device_attr *attr)
 {
@@ -544,6 +626,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 						     attr->addr);
 	case KVM_DEV_XIVE_GRP_SYNC:
 		return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
+	case KVM_DEV_XIVE_GRP_EAS:
+		return kvmppc_xive_native_set_eas(xive, attr->attr, attr->addr);
 	}
 	return -ENXIO;
 }
@@ -564,6 +648,8 @@ static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
 			return kvmppc_xive_native_get_vc_base(xive, attr->addr);
 		}
 		break;
+	case KVM_DEV_XIVE_GRP_EAS:
+		return kvmppc_xive_native_get_eas(xive, attr->attr, attr->addr);
 	}
 	return -ENXIO;
 }
@@ -583,6 +669,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 		break;
 	case KVM_DEV_XIVE_GRP_SOURCES:
 	case KVM_DEV_XIVE_GRP_SYNC:
+	case KVM_DEV_XIVE_GRP_EAS:
 		if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
 		    attr->attr < KVMPPC_XIVE_NR_IRQS)
 			return 0;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (14 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration Cédric Le Goater
@ 2019-01-07 18:43 ` Cédric Le Goater
  2019-02-04  5:24   ` David Gibson
  2019-01-07 19:10 ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state Cédric Le Goater
  2019-01-22  4:46 ` [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Paul Mackerras
  17 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 18:43 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

These are used to capture the XIVE END table of the KVM device. It
relies on an OPAL call to retrieve from the XIVE IC the EQ toggle bit
and index which are updated by the HW when events are enqueued in the
guest RAM.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/uapi/asm/kvm.h   |  21 ++++
 arch/powerpc/kvm/book3s_xive_native.c | 166 ++++++++++++++++++++++++++
 2 files changed, 187 insertions(+)

diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index faf024f39858..95302558ce10 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -684,6 +684,7 @@ struct kvm_ppc_cpu_char {
 #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
 #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
 #define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
+#define KVM_DEV_XIVE_GRP_EQ		5	/* 64-bit eq attributes */
 
 /* Layout of 64-bit XIVE source attribute values */
 #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
@@ -699,4 +700,24 @@ struct kvm_ppc_cpu_char {
 #define KVM_XIVE_EAS_EISN_SHIFT		33
 #define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
 
+/* Layout of 64-bit eq attribute */
+#define KVM_XIVE_EQ_PRIORITY_SHIFT	0
+#define KVM_XIVE_EQ_PRIORITY_MASK	0x7
+#define KVM_XIVE_EQ_SERVER_SHIFT	3
+#define KVM_XIVE_EQ_SERVER_MASK		0xfffffff8ULL
+
+/* Layout of 64-bit eq attribute values */
+struct kvm_ppc_xive_eq {
+	__u32 flags;
+	__u32 qsize;
+	__u64 qpage;
+	__u32 qtoggle;
+	__u32 qindex;
+};
+
+#define KVM_XIVE_EQ_FLAG_ENABLED	0x00000001
+#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY	0x00000002
+#define KVM_XIVE_EQ_FLAG_ESCALATE	0x00000004
+
+
 #endif /* __LINUX_KVM_POWERPC_H */
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 0468b605baa7..f4eb71eafc57 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -607,6 +607,164 @@ static int kvmppc_xive_native_get_eas(struct kvmppc_xive *xive, long irq,
 	return 0;
 }
 
+static int kvmppc_xive_native_set_queue(struct kvmppc_xive *xive, long eq_idx,
+				      u64 addr)
+{
+	struct kvm *kvm = xive->kvm;
+	struct kvm_vcpu *vcpu;
+	struct kvmppc_xive_vcpu *xc;
+	void __user *ubufp = (u64 __user *) addr;
+	u32 server;
+	u8 priority;
+	struct kvm_ppc_xive_eq kvm_eq;
+	int rc;
+	__be32 *qaddr = 0;
+	struct page *page;
+	struct xive_q *q;
+
+	/*
+	 * Demangle priority/server tuple from the EQ index
+	 */
+	priority = (eq_idx & KVM_XIVE_EQ_PRIORITY_MASK) >>
+		KVM_XIVE_EQ_PRIORITY_SHIFT;
+	server = (eq_idx & KVM_XIVE_EQ_SERVER_MASK) >>
+		KVM_XIVE_EQ_SERVER_SHIFT;
+
+	if (copy_from_user(&kvm_eq, ubufp, sizeof(kvm_eq)))
+		return -EFAULT;
+
+	vcpu = kvmppc_xive_find_server(kvm, server);
+	if (!vcpu) {
+		pr_err("Can't find server %d\n", server);
+		return -ENOENT;
+	}
+	xc = vcpu->arch.xive_vcpu;
+
+	if (priority != xive_prio_from_guest(priority)) {
+		pr_err("Trying to restore invalid queue %d for VCPU %d\n",
+		       priority, server);
+		return -EINVAL;
+	}
+	q = &xc->queues[priority];
+
+	pr_devel("%s VCPU %d priority %d fl:%x sz:%d addr:%llx g:%d idx:%d\n",
+		 __func__, server, priority, kvm_eq.flags,
+		 kvm_eq.qsize, kvm_eq.qpage, kvm_eq.qtoggle, kvm_eq.qindex);
+
+	rc = xive_native_validate_queue_size(kvm_eq.qsize);
+	if (rc || !kvm_eq.qsize) {
+		pr_err("invalid queue size %d\n", kvm_eq.qsize);
+		return rc;
+	}
+
+	page = gfn_to_page(kvm, gpa_to_gfn(kvm_eq.qpage));
+	if (is_error_page(page)) {
+		pr_warn("Couldn't get guest page for %llx!\n", kvm_eq.qpage);
+		return -ENOMEM;
+	}
+	qaddr = page_to_virt(page) + (kvm_eq.qpage & ~PAGE_MASK);
+
+	/* Backup queue page guest address for migration */
+	q->guest_qpage = kvm_eq.qpage;
+	q->guest_qsize = kvm_eq.qsize;
+
+	rc = xive_native_configure_queue(xc->vp_id, q, priority,
+					 (__be32 *) qaddr, kvm_eq.qsize, true);
+	if (rc) {
+		pr_err("Failed to configure queue %d for VCPU %d: %d\n",
+		       priority, xc->server_num, rc);
+		put_page(page);
+		return rc;
+	}
+
+	rc = xive_native_set_queue_state(xc->vp_id, priority, kvm_eq.qtoggle,
+					 kvm_eq.qindex);
+	if (rc)
+		goto error;
+
+	rc = kvmppc_xive_attach_escalation(vcpu, priority);
+error:
+	if (rc)
+		xive_native_cleanup_queue(vcpu, priority);
+	return rc;
+}
+
+static int kvmppc_xive_native_get_queue(struct kvmppc_xive *xive, long eq_idx,
+				      u64 addr)
+{
+	struct kvm *kvm = xive->kvm;
+	struct kvm_vcpu *vcpu;
+	struct kvmppc_xive_vcpu *xc;
+	struct xive_q *q;
+	void __user *ubufp = (u64 __user *) addr;
+	u32 server;
+	u8 priority;
+	struct kvm_ppc_xive_eq kvm_eq;
+	u64 qpage;
+	u64 qsize;
+	u64 qeoi_page;
+	u32 escalate_irq;
+	u64 qflags;
+	int rc;
+
+	/*
+	 * Demangle priority/server tuple from the EQ index
+	 */
+	priority = (eq_idx & KVM_XIVE_EQ_PRIORITY_MASK) >>
+		KVM_XIVE_EQ_PRIORITY_SHIFT;
+	server = (eq_idx & KVM_XIVE_EQ_SERVER_MASK) >>
+		KVM_XIVE_EQ_SERVER_SHIFT;
+
+	vcpu = kvmppc_xive_find_server(kvm, server);
+	if (!vcpu) {
+		pr_err("Can't find server %d\n", server);
+		return -ENOENT;
+	}
+	xc = vcpu->arch.xive_vcpu;
+
+	if (priority != xive_prio_from_guest(priority)) {
+		pr_err("invalid priority for queue %d for VCPU %d\n",
+		       priority, server);
+		return -EINVAL;
+	}
+	q = &xc->queues[priority];
+
+	memset(&kvm_eq, 0, sizeof(kvm_eq));
+
+	if (!q->qpage)
+		return 0;
+
+	rc = xive_native_get_queue_info(xc->vp_id, priority, &qpage, &qsize,
+					&qeoi_page, &escalate_irq, &qflags);
+	if (rc)
+		return rc;
+
+	kvm_eq.flags = 0;
+	if (qflags & OPAL_XIVE_EQ_ENABLED)
+		kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ENABLED;
+	if (qflags & OPAL_XIVE_EQ_ALWAYS_NOTIFY)
+		kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY;
+	if (qflags & OPAL_XIVE_EQ_ESCALATE)
+		kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ESCALATE;
+
+	kvm_eq.qsize = q->guest_qsize;
+	kvm_eq.qpage = q->guest_qpage;
+
+	rc = xive_native_get_queue_state(xc->vp_id, priority, &kvm_eq.qtoggle,
+					 &kvm_eq.qindex);
+	if (rc)
+		return rc;
+
+	pr_devel("%s VCPU %d priority %d fl:%x sz:%d addr:%llx g:%d idx:%d\n",
+		 __func__, server, priority, kvm_eq.flags,
+		 kvm_eq.qsize, kvm_eq.qpage, kvm_eq.qtoggle, kvm_eq.qindex);
+
+	if (copy_to_user(ubufp, &kvm_eq, sizeof(kvm_eq)))
+		return -EFAULT;
+
+	return 0;
+}
+
 static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 				       struct kvm_device_attr *attr)
 {
@@ -628,6 +786,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
 		return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
 	case KVM_DEV_XIVE_GRP_EAS:
 		return kvmppc_xive_native_set_eas(xive, attr->attr, attr->addr);
+	case KVM_DEV_XIVE_GRP_EQ:
+		return kvmppc_xive_native_set_queue(xive, attr->attr,
+						    attr->addr);
 	}
 	return -ENXIO;
 }
@@ -650,6 +811,9 @@ static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
 		break;
 	case KVM_DEV_XIVE_GRP_EAS:
 		return kvmppc_xive_native_get_eas(xive, attr->attr, attr->addr);
+	case KVM_DEV_XIVE_GRP_EQ:
+		return kvmppc_xive_native_get_queue(xive, attr->attr,
+						    attr->addr);
 	}
 	return -ENXIO;
 }
@@ -674,6 +838,8 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
 		    attr->attr < KVMPPC_XIVE_NR_IRQS)
 			return 0;
 		break;
+	case KVM_DEV_XIVE_GRP_EQ:
+		return 0;
 	}
 	return -ENXIO;
 }
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (15 preceding siblings ...)
  2019-01-07 18:43 ` [PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration Cédric Le Goater
@ 2019-01-07 19:10 ` Cédric Le Goater
  2019-01-07 19:10   ` [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support Cédric Le Goater
                     ` (2 more replies)
  2019-01-22  4:46 ` [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Paul Mackerras
  17 siblings, 3 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 19:10 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

At a VCPU level, the state of the thread context interrupt management
registers needs to be collected. These registers are cached under the
'xive_saved_state.w01' field of the VCPU when the VPCU context is
pulled from the HW thread. An OPAL call retrieves the backup of the
IPB register in the NVT structure and merges it in the KVM state.

The structures of the interface between QEMU and KVM provisions some
extra room (two u64) for further extensions if more state needs to be
transferred back to QEMU.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/include/asm/kvm_ppc.h    |  5 ++
 arch/powerpc/include/uapi/asm/kvm.h   |  2 +
 arch/powerpc/kvm/book3s.c             | 24 +++++++++
 arch/powerpc/kvm/book3s_xive_native.c | 78 +++++++++++++++++++++++++++
 4 files changed, 109 insertions(+)

diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
index 4cc897039485..49c488af168c 100644
--- a/arch/powerpc/include/asm/kvm_ppc.h
+++ b/arch/powerpc/include/asm/kvm_ppc.h
@@ -270,6 +270,7 @@ union kvmppc_one_reg {
 		u64	addr;
 		u64	length;
 	}	vpaval;
+	u64	xive_timaval[4];
 };
 
 struct kvmppc_ops {
@@ -603,6 +604,8 @@ extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
 extern void kvmppc_xive_native_init_module(void);
 extern void kvmppc_xive_native_exit_module(void);
 extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd);
+extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
+extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
 
 #else
 static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
@@ -637,6 +640,8 @@ static inline void kvmppc_xive_native_init_module(void) { }
 static inline void kvmppc_xive_native_exit_module(void) { }
 static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd)
 	{ return 0; }
+static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return 0; }
+static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return -ENOENT; }
 
 #endif /* CONFIG_KVM_XIVE */
 
diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
index 95302558ce10..3c958c39a782 100644
--- a/arch/powerpc/include/uapi/asm/kvm.h
+++ b/arch/powerpc/include/uapi/asm/kvm.h
@@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
 #define  KVM_REG_PPC_ICP_PPRI_SHIFT	16	/* pending irq priority */
 #define  KVM_REG_PPC_ICP_PPRI_MASK	0xff
 
+#define KVM_REG_PPC_VP_STATE	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d)
+
 /* Device control API: PPC-specific devices */
 #define KVM_DEV_MPIC_GRP_MISC		1
 #define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index de7eed191107..5ad658077a35 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -641,6 +641,18 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
 				*val = get_reg_val(id, kvmppc_xics_get_icp(vcpu));
 			break;
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_XIVE
+		case KVM_REG_PPC_VP_STATE:
+			if (!vcpu->arch.xive_vcpu) {
+				r = -ENXIO;
+				break;
+			}
+			if (xive_enabled())
+				r = kvmppc_xive_native_get_vp(vcpu, val);
+			else
+				r = -ENXIO;
+			break;
+#endif /* CONFIG_KVM_XIVE */
 		case KVM_REG_PPC_FSCR:
 			*val = get_reg_val(id, vcpu->arch.fscr);
 			break;
@@ -714,6 +726,18 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
 				r = kvmppc_xics_set_icp(vcpu, set_reg_val(id, *val));
 			break;
 #endif /* CONFIG_KVM_XICS */
+#ifdef CONFIG_KVM_XIVE
+		case KVM_REG_PPC_VP_STATE:
+			if (!vcpu->arch.xive_vcpu) {
+				r = -ENXIO;
+				break;
+			}
+			if (xive_enabled())
+				r = kvmppc_xive_native_set_vp(vcpu, val);
+			else
+				r = -ENXIO;
+			break;
+#endif /* CONFIG_KVM_XIVE */
 		case KVM_REG_PPC_FSCR:
 			vcpu->arch.fscr = set_reg_val(id, *val);
 			break;
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index f4eb71eafc57..1aefb366df0b 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -424,6 +424,84 @@ static int xive_native_validate_queue_size(u32 qsize)
 	}
 }
 
+#define TM_IPB_SHIFT 40
+#define TM_IPB_MASK  (((u64) 0xFF) << TM_IPB_SHIFT)
+
+int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val)
+{
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	u64 opal_state;
+	int rc;
+
+	if (!kvmppc_xive_enabled(vcpu))
+		return -EPERM;
+
+	if (!xc)
+		return -ENOENT;
+
+	/* Thread context registers. We only care about IPB and CPPR */
+	val->xive_timaval[0] = vcpu->arch.xive_saved_state.w01;
+
+	/*
+	 * Return the OS CAM line to print out the VP identifier in
+	 * the QEMU monitor. This is not restored.
+	 */
+	val->xive_timaval[1] = vcpu->arch.xive_cam_word;
+
+	/* Get the VP state from OPAL */
+	rc = xive_native_get_vp_state(xc->vp_id, &opal_state);
+	if (rc)
+		return rc;
+
+	/*
+	 * Capture the backup of IPB register in the NVT structure and
+	 * merge it in our KVM VP state.
+	 *
+	 * TODO: P10 support.
+	 */
+	val->xive_timaval[0] |= cpu_to_be64(opal_state & TM_IPB_MASK);
+
+	pr_devel("%s NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x opal=%016llx\n",
+		 __func__,
+		 vcpu->arch.xive_saved_state.nsr,
+		 vcpu->arch.xive_saved_state.cppr,
+		 vcpu->arch.xive_saved_state.ipb,
+		 vcpu->arch.xive_saved_state.pipr,
+		 vcpu->arch.xive_saved_state.w01,
+		 (u32) vcpu->arch.xive_cam_word, opal_state);
+
+	return 0;
+}
+
+int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val)
+{
+	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
+	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
+
+	pr_devel("%s w01=%016llx vp=%016llx\n", __func__,
+		 val->xive_timaval[0], val->xive_timaval[1]);
+
+	if (!kvmppc_xive_enabled(vcpu))
+		return -EPERM;
+
+	if (!xc || !xive)
+		return -ENOENT;
+
+	/* We can't update the state of a "pushed" VCPU	 */
+	if (WARN_ON(vcpu->arch.xive_pushed))
+		return -EIO;
+
+	/* Thread context registers. only restore IPB and CPPR ? */
+	vcpu->arch.xive_saved_state.w01 = val->xive_timaval[0];
+
+	/*
+	 * There is no need to restore the XIVE internal state (IPB
+	 * stored in the NVT) as the IPB register was merged in KVM VP
+	 * state.
+	 */
+	return 0;
+}
+
 static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
 					 u64 addr)
 {
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-07 19:10 ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state Cédric Le Goater
@ 2019-01-07 19:10   ` Cédric Le Goater
  2019-01-22  5:26     ` Paul Mackerras
  2019-01-07 19:10   ` [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl Cédric Le Goater
  2019-02-04  5:26   ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state David Gibson
  2 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 19:10 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

Clear the ESB pages from the VMA of the IRQ being pass through to the
guest and let the fault handler repopulate the VMA when the ESB pages
are accessed for an EOI or for a trigger.

Storing the VMA under the KVM XIVE device is a little ugly.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 arch/powerpc/kvm/book3s_xive.h        |  8 +++++++
 arch/powerpc/kvm/book3s_xive.c        | 15 ++++++++++++++
 arch/powerpc/kvm/book3s_xive_native.c | 30 +++++++++++++++++++++++++++
 3 files changed, 53 insertions(+)

diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
index 31e598e62589..6e64d3496a2c 100644
--- a/arch/powerpc/kvm/book3s_xive.h
+++ b/arch/powerpc/kvm/book3s_xive.h
@@ -90,6 +90,11 @@ struct kvmppc_xive_src_block {
 	struct kvmppc_xive_irq_state irq_state[KVMPPC_XICS_IRQ_PER_ICS];
 };
 
+struct kvmppc_xive;
+
+struct kvmppc_xive_ops {
+	int (*reset_mapped)(struct kvm *kvm, unsigned long guest_irq);
+};
 
 struct kvmppc_xive {
 	struct kvm *kvm;
@@ -131,6 +136,9 @@ struct kvmppc_xive {
 
 	/* VC base address for ESBs */
 	u64     vc_base;
+
+	struct kvmppc_xive_ops *ops;
+	struct vm_area_struct *vma;
 };
 
 #define KVMPPC_XIVE_Q_COUNT	8
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index e9f05d9c9ad5..9b4751713554 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -946,6 +946,13 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
 	/* Turn the IPI hard off */
 	xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
 
+	/*
+	 * Reset ESB guest mapping. Needed when ESB pages are exposed
+	 * to the guest in XIVE native mode
+	 */
+	if (xive->ops && xive->ops->reset_mapped)
+		xive->ops->reset_mapped(kvm, guest_irq);
+
 	/* Grab info about irq */
 	state->pt_number = hw_irq;
 	state->pt_data = irq_data_get_irq_handler_data(host_data);
@@ -1031,6 +1038,14 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
 	state->pt_number = 0;
 	state->pt_data = NULL;
 
+	/*
+	 * Reset ESB guest mapping. Needed when ESB pages are exposed
+	 * to the guest in XIVE native mode
+	 */
+	if (xive->ops && xive->ops->reset_mapped) {
+		xive->ops->reset_mapped(kvm, guest_irq);
+	}
+
 	/* Reconfigure the IPI */
 	xive_native_configure_irq(state->ipi_number,
 				  xive_vp(xive, state->act_server),
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 1aefb366df0b..12edac29995e 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -240,6 +240,32 @@ static int kvmppc_xive_native_get_vc_base(struct kvmppc_xive *xive, u64 addr)
 	return 0;
 }
 
+static int kvmppc_xive_native_reset_mapped(struct kvm *kvm, unsigned long irq)
+{
+	struct kvmppc_xive *xive = kvm->arch.xive;
+	struct mm_struct *mm = kvm->mm;
+	struct vm_area_struct *vma = xive->vma;
+	unsigned long address;
+
+	if (irq >= KVMPPC_XIVE_NR_IRQS)
+		return -EINVAL;
+
+	pr_debug("clearing esb pages for girq 0x%lx\n", irq);
+
+	down_read(&mm->mmap_sem);
+	/* TODO: can we clear the PTEs without keeping a VMA pointer ? */
+	if (vma) {
+		address = vma->vm_start + irq * (2ull << PAGE_SHIFT);
+		zap_vma_ptes(vma, address, 2ull << PAGE_SHIFT);
+	}
+	up_read(&mm->mmap_sem);
+	return 0;
+}
+
+static struct kvmppc_xive_ops kvmppc_xive_native_ops =  {
+	.reset_mapped = kvmppc_xive_native_reset_mapped,
+};
+
 static int xive_native_esb_fault(struct vm_fault *vmf)
 {
 	struct vm_area_struct *vma = vmf->vma;
@@ -292,6 +318,8 @@ static const struct vm_operations_struct xive_native_esb_vmops = {
 
 static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma)
 {
+	struct kvmppc_xive *xive = vma->vm_file->private_data;
+
 	/* There are two ESB pages (trigger and EOI) per IRQ */
 	if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2)
 		return -EINVAL;
@@ -299,6 +327,7 @@ static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma)
 	vma->vm_flags |= VM_IO | VM_PFNMAP;
 	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
 	vma->vm_ops = &xive_native_esb_vmops;
+	xive->vma = vma; /* TODO: get rid of the VMA pointer */
 	return 0;
 }
 
@@ -992,6 +1021,7 @@ static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
 	xive->vc_base = XIVE_VC_BASE;
 
 	xive->single_escalation = xive_native_has_single_escalation();
+	xive->ops = &kvmppc_xive_native_ops;
 
 	if (ret)
 		kfree(xive);
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl
  2019-01-07 19:10 ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state Cédric Le Goater
  2019-01-07 19:10   ` [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support Cédric Le Goater
@ 2019-01-07 19:10   ` Cédric Le Goater
  2019-01-22  5:42     ` Paul Mackerras
  2019-02-04  5:26   ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state David Gibson
  2 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-07 19:10 UTC (permalink / raw)
  To: kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

This will be used to destroy the KVM XICS or XIVE device when the
sPAPR machine is reseted. When the VM boots, the CAS negotiation
process will determine which interrupt mode to use and the appropriate
KVM device will then be created.

Signed-off-by: Cédric Le Goater <clg@kaod.org>
---
 include/linux/kvm_host.h              |  2 ++
 include/uapi/linux/kvm.h              |  2 ++
 arch/powerpc/kvm/book3s_xive.c        | 38 +++++++++++++++++++++++++-
 arch/powerpc/kvm/book3s_xive_native.c | 24 +++++++++++++++++
 virt/kvm/kvm_main.c                   | 39 +++++++++++++++++++++++++++
 5 files changed, 104 insertions(+), 1 deletion(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c38cc5eb7e73..259b6885dc74 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1218,6 +1218,8 @@ struct kvm_device_ops {
 	 */
 	void (*destroy)(struct kvm_device *dev);
 
+	int (*delete)(struct kvm_device *dev);
+
 	int (*set_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
 	int (*get_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
 	int (*has_attr)(struct kvm_device *dev, struct kvm_device_attr *attr);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 52bf74a1616e..b00cb4d986cf 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1331,6 +1331,8 @@ struct kvm_s390_ucas_mapping {
 #define KVM_GET_DEVICE_ATTR	  _IOW(KVMIO,  0xe2, struct kvm_device_attr)
 #define KVM_HAS_DEVICE_ATTR	  _IOW(KVMIO,  0xe3, struct kvm_device_attr)
 
+#define KVM_DELETE_DEVICE	  _IOWR(KVMIO,  0xf0, struct kvm_create_device)
+
 /*
  * ioctls for vcpu fds
  */
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index 9b4751713554..5449fb4c87f9 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -1109,11 +1109,19 @@ void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
 void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 {
 	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
-	struct kvmppc_xive *xive = xc->xive;
+	struct kvmppc_xive *xive;
 	int i;
 
+	if (!kvmppc_xics_enabled(vcpu))
+		return;
+
+	if (!xc)
+		return;
+
 	pr_devel("cleanup_vcpu(cpu=%d)\n", xc->server_num);
 
+	xive = xc->xive;
+
 	/* Ensure no interrupt is still routed to that VP */
 	xc->valid = false;
 	kvmppc_xive_disable_vcpu_interrupts(vcpu);
@@ -1150,6 +1158,10 @@ void kvmppc_xive_cleanup_vcpu(struct kvm_vcpu *vcpu)
 	}
 	/* Free the VP */
 	kfree(xc);
+
+	/* Cleanup the vcpu */
+	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
+	vcpu->arch.xive_vcpu = NULL;
 }
 
 int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
@@ -1861,6 +1873,29 @@ static void kvmppc_xive_free(struct kvm_device *dev)
 	kfree(dev);
 }
 
+static int kvmppc_xive_delete(struct kvm_device *dev)
+{
+	struct kvm *kvm = dev->kvm;
+	unsigned int i;
+	struct kvm_vcpu *vcpu;
+
+	if (!kvm->arch.xive)
+		return -EPERM;
+
+	/*
+	 * call kick_all_cpus_sync() to ensure that all CPUs have
+	 * executed any pending interrupts
+	 */
+	if (is_kvmppc_hv_enabled(kvm))
+		kick_all_cpus_sync();
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvmppc_xive_cleanup_vcpu(vcpu);
+
+	kvmppc_xive_free(dev);
+	return 0;
+}
+
 static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
 {
 	struct kvmppc_xive *xive;
@@ -2035,6 +2070,7 @@ struct kvm_device_ops kvm_xive_ops = {
 	.create = kvmppc_xive_create,
 	.init = kvmppc_xive_init,
 	.destroy = kvmppc_xive_free,
+	.delete = kvmppc_xive_delete,
 	.set_attr = xive_set_attr,
 	.get_attr = xive_get_attr,
 	.has_attr = xive_has_attr,
diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
index 12edac29995e..7367962e670a 100644
--- a/arch/powerpc/kvm/book3s_xive_native.c
+++ b/arch/powerpc/kvm/book3s_xive_native.c
@@ -979,6 +979,29 @@ static void kvmppc_xive_native_free(struct kvm_device *dev)
 	kfree(dev);
 }
 
+static int kvmppc_xive_native_delete(struct kvm_device *dev)
+{
+	struct kvm *kvm = dev->kvm;
+	unsigned int i;
+	struct kvm_vcpu *vcpu;
+
+	if (!kvm->arch.xive)
+		return -EPERM;
+
+	/*
+	 * call kick_all_cpus_sync() to ensure that all CPUs have
+	 * executed any pending interrupts
+	 */
+	if (is_kvmppc_hv_enabled(kvm))
+		kick_all_cpus_sync();
+
+	kvm_for_each_vcpu(i, vcpu, kvm)
+		kvmppc_xive_native_cleanup_vcpu(vcpu);
+
+	kvmppc_xive_native_free(dev);
+	return 0;
+}
+
 /*
  * ESB MMIO address of chip 0
  */
@@ -1350,6 +1373,7 @@ struct kvm_device_ops kvm_xive_native_ops = {
 	.create = kvmppc_xive_native_create,
 	.init = kvmppc_xive_native_init,
 	.destroy = kvmppc_xive_native_free,
+	.delete = kvmppc_xive_native_delete,
 	.set_attr = kvmppc_xive_native_set_attr,
 	.get_attr = kvmppc_xive_native_get_attr,
 	.has_attr = kvmppc_xive_native_has_attr,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1f888a103f78..c93c35c43675 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3009,6 +3009,31 @@ static int kvm_ioctl_create_device(struct kvm *kvm,
 	return 0;
 }
 
+static int kvm_ioctl_delete_device(struct kvm *kvm,
+				   struct kvm_create_device *cd)
+{
+	struct fd f;
+	struct kvm_device *dev;
+	int ret;
+
+	f = fdget(cd->fd);
+	if (!f.file)
+		return -EBADF;
+
+	dev = kvm_device_from_filp(f.file);
+	fdput(f);
+
+	if (!dev)
+		return -EPERM;
+
+	mutex_lock(&kvm->lock);
+	list_del(&dev->vm_node);
+	mutex_unlock(&kvm->lock);
+	ret = dev->ops->delete(dev);
+
+	return ret;
+}
+
 static long kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 {
 	switch (arg) {
@@ -3253,6 +3278,20 @@ static long kvm_vm_ioctl(struct file *filp,
 		r = 0;
 		break;
 	}
+	case KVM_DELETE_DEVICE: {
+		struct kvm_create_device cd;
+
+		r = -EFAULT;
+		if (copy_from_user(&cd, argp, sizeof(cd)))
+			goto out;
+
+		r = kvm_ioctl_delete_device(kvm, &cd);
+		if (r)
+			goto out;
+
+		r = 0;
+		break;
+	}
 	case KVM_CHECK_EXTENSION:
 		r = kvm_vm_ioctl_check_extension_generic(kvm, arg);
 		break;
-- 
2.20.1


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls
  2019-01-07 18:43 ` [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls Cédric Le Goater
@ 2019-01-09  3:33   ` David Gibson
  2019-01-09 13:08   ` Michael Ellerman
  1 sibling, 0 replies; 135+ messages in thread
From: David Gibson @ 2019-01-09  3:33 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 5524 bytes --]

On Mon, Jan 07, 2019 at 07:43:13PM +0100, Cédric Le Goater wrote:
> These flags are shared between Linux/KVM implementing the hypervisor
> calls for the XIVE native exploitation mode and the driver for the
> sPAPR guests.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/xive.h  | 23 +++++++++++++++++++++++
>  arch/powerpc/sysdev/xive/spapr.c | 28 ++++++++--------------------
>  2 files changed, 31 insertions(+), 20 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index 3c704f5dd3ae..32f033bfbf42 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -93,6 +93,29 @@ extern void xive_flush_interrupt(void);
>  /* xmon hook */
>  extern void xmon_xive_do_dump(int cpu);
>  
> +/*
> + * Hcall flags shared by the sPAPR backend and KVM
> + */
> +
> +/* H_INT_GET_SOURCE_INFO */
> +#define XIVE_SPAPR_SRC_H_INT_ESB	PPC_BIT(60)
> +#define XIVE_SPAPR_SRC_LSI		PPC_BIT(61)
> +#define XIVE_SPAPR_SRC_TRIGGER		PPC_BIT(62)
> +#define XIVE_SPAPR_SRC_STORE_EOI	PPC_BIT(63)
> +
> +/* H_INT_SET_SOURCE_CONFIG */
> +#define XIVE_SPAPR_SRC_SET_EISN		PPC_BIT(62)
> +#define XIVE_SPAPR_SRC_MASK		PPC_BIT(63) /* unused */
> +
> +/* H_INT_SET_QUEUE_CONFIG */
> +#define XIVE_SPAPR_EQ_ALWAYS_NOTIFY	PPC_BIT(63)
> +
> +/* H_INT_SET_QUEUE_CONFIG */
> +#define XIVE_SPAPR_EQ_DEBUG		PPC_BIT(63)
> +
> +/* H_INT_ESB */
> +#define XIVE_SPAPR_ESB_STORE		PPC_BIT(63)
> +
>  /* APIs used by KVM */
>  extern u32 xive_native_default_eq_shift(void);
>  extern u32 xive_native_alloc_vp_block(u32 max_vcpus);
> diff --git a/arch/powerpc/sysdev/xive/spapr.c b/arch/powerpc/sysdev/xive/spapr.c
> index 575db3b06a6b..730284f838c8 100644
> --- a/arch/powerpc/sysdev/xive/spapr.c
> +++ b/arch/powerpc/sysdev/xive/spapr.c
> @@ -184,9 +184,6 @@ static long plpar_int_get_source_info(unsigned long flags,
>  	return 0;
>  }
>  
> -#define XIVE_SRC_SET_EISN (1ull << (63 - 62))
> -#define XIVE_SRC_MASK     (1ull << (63 - 63)) /* unused */
> -
>  static long plpar_int_set_source_config(unsigned long flags,
>  					unsigned long lisn,
>  					unsigned long target,
> @@ -243,8 +240,6 @@ static long plpar_int_get_queue_info(unsigned long flags,
>  	return 0;
>  }
>  
> -#define XIVE_EQ_ALWAYS_NOTIFY (1ull << (63 - 63))
> -
>  static long plpar_int_set_queue_config(unsigned long flags,
>  				       unsigned long target,
>  				       unsigned long priority,
> @@ -286,8 +281,6 @@ static long plpar_int_sync(unsigned long flags, unsigned long lisn)
>  	return 0;
>  }
>  
> -#define XIVE_ESB_FLAG_STORE (1ull << (63 - 63))
> -
>  static long plpar_int_esb(unsigned long flags,
>  			  unsigned long lisn,
>  			  unsigned long offset,
> @@ -321,7 +314,7 @@ static u64 xive_spapr_esb_rw(u32 lisn, u32 offset, u64 data, bool write)
>  	unsigned long read_data;
>  	long rc;
>  
> -	rc = plpar_int_esb(write ? XIVE_ESB_FLAG_STORE : 0,
> +	rc = plpar_int_esb(write ? XIVE_SPAPR_ESB_STORE : 0,
>  			   lisn, offset, data, &read_data);
>  	if (rc)
>  		return -1;
> @@ -329,11 +322,6 @@ static u64 xive_spapr_esb_rw(u32 lisn, u32 offset, u64 data, bool write)
>  	return write ? 0 : read_data;
>  }
>  
> -#define XIVE_SRC_H_INT_ESB     (1ull << (63 - 60))
> -#define XIVE_SRC_LSI           (1ull << (63 - 61))
> -#define XIVE_SRC_TRIGGER       (1ull << (63 - 62))
> -#define XIVE_SRC_STORE_EOI     (1ull << (63 - 63))
> -
>  static int xive_spapr_populate_irq_data(u32 hw_irq, struct xive_irq_data *data)
>  {
>  	long rc;
> @@ -349,11 +337,11 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct xive_irq_data *data)
>  	if (rc)
>  		return  -EINVAL;
>  
> -	if (flags & XIVE_SRC_H_INT_ESB)
> +	if (flags & XIVE_SPAPR_SRC_H_INT_ESB)
>  		data->flags  |= XIVE_IRQ_FLAG_H_INT_ESB;
> -	if (flags & XIVE_SRC_STORE_EOI)
> +	if (flags & XIVE_SPAPR_SRC_STORE_EOI)
>  		data->flags  |= XIVE_IRQ_FLAG_STORE_EOI;
> -	if (flags & XIVE_SRC_LSI)
> +	if (flags & XIVE_SPAPR_SRC_LSI)
>  		data->flags  |= XIVE_IRQ_FLAG_LSI;
>  	data->eoi_page  = eoi_page;
>  	data->esb_shift = esb_shift;
> @@ -374,7 +362,7 @@ static int xive_spapr_populate_irq_data(u32 hw_irq, struct xive_irq_data *data)
>  	data->hw_irq = hw_irq;
>  
>  	/* Full function page supports trigger */
> -	if (flags & XIVE_SRC_TRIGGER) {
> +	if (flags & XIVE_SPAPR_SRC_TRIGGER) {
>  		data->trig_mmio = data->eoi_mmio;
>  		return 0;
>  	}
> @@ -391,8 +379,8 @@ static int xive_spapr_configure_irq(u32 hw_irq, u32 target, u8 prio, u32 sw_irq)
>  {
>  	long rc;
>  
> -	rc = plpar_int_set_source_config(XIVE_SRC_SET_EISN, hw_irq, target,
> -					 prio, sw_irq);
> +	rc = plpar_int_set_source_config(XIVE_SPAPR_SRC_SET_EISN, hw_irq,
> +					 target, prio, sw_irq);
>  
>  	return rc == 0 ? 0 : -ENXIO;
>  }
> @@ -432,7 +420,7 @@ static int xive_spapr_configure_queue(u32 target, struct xive_q *q, u8 prio,
>  	q->eoi_phys = esn_page;
>  
>  	/* Default is to always notify */
> -	flags = XIVE_EQ_ALWAYS_NOTIFY;
> +	flags = XIVE_SPAPR_EQ_ALWAYS_NOTIFY;
>  
>  	/* Configure and enable the queue in HW */
>  	rc = plpar_int_set_queue_config(flags, target, prio, qpage_phys, order);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 02/19] powerpc/xive: add OPAL extensions for the XIVE native exploitation support
  2019-01-07 18:43 ` [PATCH 02/19] powerpc/xive: add OPAL extensions for the XIVE native exploitation support Cédric Le Goater
@ 2019-01-09  4:26   ` David Gibson
  0 siblings, 0 replies; 135+ messages in thread
From: David Gibson @ 2019-01-09  4:26 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 8614 bytes --]

On Mon, Jan 07, 2019 at 07:43:14PM +0100, Cédric Le Goater wrote:
> The support for XIVE native exploitation mode in Linux/KVM needs a
> couple more OPAL calls to configure the sPAPR guest and to get/set the
> state of the XIVE internal structures.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/include/asm/opal-api.h           | 11 ++-
>  arch/powerpc/include/asm/opal.h               |  7 ++
>  arch/powerpc/include/asm/xive.h               | 14 +++
>  arch/powerpc/sysdev/xive/native.c             | 99 +++++++++++++++++++
>  .../powerpc/platforms/powernv/opal-wrappers.S |  3 +
>  5 files changed, 130 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/opal-api.h b/arch/powerpc/include/asm/opal-api.h
> index 870fb7b239ea..cdfc54f78101 100644
> --- a/arch/powerpc/include/asm/opal-api.h
> +++ b/arch/powerpc/include/asm/opal-api.h
> @@ -186,8 +186,8 @@
>  #define OPAL_XIVE_FREE_IRQ			140
>  #define OPAL_XIVE_SYNC				141
>  #define OPAL_XIVE_DUMP				142
> -#define OPAL_XIVE_RESERVED3			143
> -#define OPAL_XIVE_RESERVED4			144
> +#define OPAL_XIVE_GET_QUEUE_STATE		143
> +#define OPAL_XIVE_SET_QUEUE_STATE		144
>  #define OPAL_SIGNAL_SYSTEM_RESET		145
>  #define OPAL_NPU_INIT_CONTEXT			146
>  #define OPAL_NPU_DESTROY_CONTEXT		147
> @@ -209,8 +209,11 @@
>  #define OPAL_SENSOR_GROUP_ENABLE		163
>  #define OPAL_PCI_GET_PBCQ_TUNNEL_BAR		164
>  #define OPAL_PCI_SET_PBCQ_TUNNEL_BAR		165
> -#define	OPAL_NX_COPROC_INIT			167
> -#define OPAL_LAST				167
> +#define OPAL_HANDLE_HMI2			166
> +#define OPAL_NX_COPROC_INIT			167
> +#define OPAL_NPU_SET_RELAXED_ORDER		168
> +#define OPAL_NPU_GET_RELAXED_ORDER		169
> +#define OPAL_XIVE_GET_VP_STATE			170
>  
>  #define QUIESCE_HOLD			1 /* Spin all calls at entry */
>  #define QUIESCE_REJECT			2 /* Fail all calls with OPAL_BUSY */
> diff --git a/arch/powerpc/include/asm/opal.h b/arch/powerpc/include/asm/opal.h
> index a55b01c90bb1..4e978d4dea5c 100644
> --- a/arch/powerpc/include/asm/opal.h
> +++ b/arch/powerpc/include/asm/opal.h
> @@ -279,6 +279,13 @@ int64_t opal_xive_allocate_irq(uint32_t chip_id);
>  int64_t opal_xive_free_irq(uint32_t girq);
>  int64_t opal_xive_sync(uint32_t type, uint32_t id);
>  int64_t opal_xive_dump(uint32_t type, uint32_t id);
> +int64_t opal_xive_get_queue_state(uint64_t vp, uint32_t prio,
> +				  __be32 *out_qtoggle,
> +				  __be32 *out_qindex);
> +int64_t opal_xive_set_queue_state(uint64_t vp, uint32_t prio,
> +				  uint32_t qtoggle,
> +				  uint32_t qindex);
> +int64_t opal_xive_get_vp_state(uint64_t vp, __be64 *out_w01);
>  int64_t opal_pci_set_p2p(uint64_t phb_init, uint64_t phb_target,
>  			uint64_t desc, uint16_t pe_number);
>  
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index 32f033bfbf42..d6be3e4d9fa4 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -132,12 +132,26 @@ extern int xive_native_configure_queue(u32 vp_id, struct xive_q *q, u8 prio,
>  extern void xive_native_disable_queue(u32 vp_id, struct xive_q *q, u8 prio);
>  
>  extern void xive_native_sync_source(u32 hw_irq);
> +extern void xive_native_sync_queue(u32 hw_irq);
>  extern bool is_xive_irq(struct irq_chip *chip);
>  extern int xive_native_enable_vp(u32 vp_id, bool single_escalation);
>  extern int xive_native_disable_vp(u32 vp_id);
>  extern int xive_native_get_vp_info(u32 vp_id, u32 *out_cam_id, u32 *out_chip_id);
>  extern bool xive_native_has_single_escalation(void);
>  
> +extern int xive_native_get_queue_info(u32 vp_id, uint32_t prio,
> +				      u64 *out_qpage,
> +				      u64 *out_qsize,
> +				      u64 *out_qeoi_page,
> +				      u32 *out_escalate_irq,
> +				      u64 *out_qflags);
> +
> +extern int xive_native_get_queue_state(u32 vp_id, uint32_t prio, u32 *qtoggle,
> +				       u32 *qindex);
> +extern int xive_native_set_queue_state(u32 vp_id, uint32_t prio, u32 qtoggle,
> +				       u32 qindex);
> +extern int xive_native_get_vp_state(u32 vp_id, u64 *out_state);
> +
>  #else
>  
>  static inline bool xive_enabled(void) { return false; }
> diff --git a/arch/powerpc/sysdev/xive/native.c b/arch/powerpc/sysdev/xive/native.c
> index 1ca127d052a6..0c037e933e55 100644
> --- a/arch/powerpc/sysdev/xive/native.c
> +++ b/arch/powerpc/sysdev/xive/native.c
> @@ -437,6 +437,12 @@ void xive_native_sync_source(u32 hw_irq)
>  }
>  EXPORT_SYMBOL_GPL(xive_native_sync_source);
>  
> +void xive_native_sync_queue(u32 hw_irq)
> +{
> +	opal_xive_sync(XIVE_SYNC_QUEUE, hw_irq);
> +}
> +EXPORT_SYMBOL_GPL(xive_native_sync_queue);
> +
>  static const struct xive_ops xive_native_ops = {
>  	.populate_irq_data	= xive_native_populate_irq_data,
>  	.configure_irq		= xive_native_configure_irq,
> @@ -711,3 +717,96 @@ bool xive_native_has_single_escalation(void)
>  	return xive_has_single_esc;
>  }
>  EXPORT_SYMBOL_GPL(xive_native_has_single_escalation);
> +
> +int xive_native_get_queue_info(u32 vp_id, u32 prio,
> +			       u64 *out_qpage,
> +			       u64 *out_qsize,
> +			       u64 *out_qeoi_page,
> +			       u32 *out_escalate_irq,
> +			       u64 *out_qflags)
> +{
> +	__be64 qpage;
> +	__be64 qsize;
> +	__be64 qeoi_page;
> +	__be32 escalate_irq;
> +	__be64 qflags;
> +	s64 rc;
> +
> +	rc = opal_xive_get_queue_info(vp_id, prio, &qpage, &qsize,
> +				      &qeoi_page, &escalate_irq, &qflags);
> +	if (rc) {
> +		pr_err("OPAL failed to get queue info for VCPU %d/%d : %lld\n",
> +		       vp_id, prio, rc);
> +		return -EIO;
> +	}
> +
> +	if (out_qpage)
> +		*out_qpage = be64_to_cpu(qpage);
> +	if (out_qsize)
> +		*out_qsize = be32_to_cpu(qsize);
> +	if (out_qeoi_page)
> +		*out_qeoi_page = be64_to_cpu(qeoi_page);
> +	if (out_escalate_irq)
> +		*out_escalate_irq = be32_to_cpu(escalate_irq);
> +	if (out_qflags)
> +		*out_qflags = be64_to_cpu(qflags);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(xive_native_get_queue_info);
> +
> +int xive_native_get_queue_state(u32 vp_id, u32 prio, u32 *qtoggle, u32 *qindex)
> +{
> +	__be32 opal_qtoggle;
> +	__be32 opal_qindex;
> +	s64 rc;
> +
> +	rc = opal_xive_get_queue_state(vp_id, prio, &opal_qtoggle,
> +				       &opal_qindex);
> +	if (rc) {
> +		pr_err("OPAL failed to get queue state for VCPU %d/%d : %lld\n",
> +		       vp_id, prio, rc);
> +		return -EIO;
> +	}
> +
> +	if (qtoggle)
> +		*qtoggle = be32_to_cpu(opal_qtoggle);
> +	if (qindex)
> +		*qindex = be32_to_cpu(opal_qindex);
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(xive_native_get_queue_state);
> +
> +int xive_native_set_queue_state(u32 vp_id, u32 prio, u32 qtoggle, u32 qindex)
> +{
> +	s64 rc;
> +
> +	rc = opal_xive_set_queue_state(vp_id, prio, qtoggle, qindex);
> +	if (rc) {
> +		pr_err("OPAL failed to set queue state for VCPU %d/%d : %lld\n",
> +		       vp_id, prio, rc);
> +		return -EIO;
> +	}
> +
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(xive_native_set_queue_state);
> +
> +int xive_native_get_vp_state(u32 vp_id, u64 *out_state)
> +{
> +	__be64 state;
> +	s64 rc;
> +
> +	rc = opal_xive_get_vp_state(vp_id, &state);
> +	if (rc) {
> +		pr_err("OPAL failed to get vp state for VCPU %d : %lld\n",
> +		       vp_id, rc);
> +		return -EIO;
> +	}
> +
> +	if (out_state)
> +		*out_state = be64_to_cpu(state);
> +	return 0;
> +}
> +EXPORT_SYMBOL_GPL(xive_native_get_vp_state);
> diff --git a/arch/powerpc/platforms/powernv/opal-wrappers.S b/arch/powerpc/platforms/powernv/opal-wrappers.S
> index f4875fe3f8ff..3179953d6b56 100644
> --- a/arch/powerpc/platforms/powernv/opal-wrappers.S
> +++ b/arch/powerpc/platforms/powernv/opal-wrappers.S
> @@ -309,6 +309,9 @@ OPAL_CALL(opal_xive_get_vp_info,		OPAL_XIVE_GET_VP_INFO);
>  OPAL_CALL(opal_xive_set_vp_info,		OPAL_XIVE_SET_VP_INFO);
>  OPAL_CALL(opal_xive_sync,			OPAL_XIVE_SYNC);
>  OPAL_CALL(opal_xive_dump,			OPAL_XIVE_DUMP);
> +OPAL_CALL(opal_xive_get_queue_state,		OPAL_XIVE_GET_QUEUE_STATE);
> +OPAL_CALL(opal_xive_set_queue_state,		OPAL_XIVE_SET_QUEUE_STATE);
> +OPAL_CALL(opal_xive_get_vp_state,		OPAL_XIVE_GET_VP_STATE);
>  OPAL_CALL(opal_signal_system_reset,		OPAL_SIGNAL_SYSTEM_RESET);
>  OPAL_CALL(opal_npu_init_context,		OPAL_NPU_INIT_CONTEXT);
>  OPAL_CALL(opal_npu_destroy_context,		OPAL_NPU_DESTROY_CONTEXT);

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type
  2019-01-07 18:43 ` [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type Cédric Le Goater
@ 2019-01-09  4:27   ` David Gibson
  2019-01-22  4:56   ` Paul Mackerras
  1 sibling, 0 replies; 135+ messages in thread
From: David Gibson @ 2019-01-09  4:27 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1363 bytes --]

On Mon, Jan 07, 2019 at 07:43:15PM +0100, Cédric Le Goater wrote:
> We will have different KVM devices for interrupts, one for the
> XICS-over-XIVE mode and one for the XIVE native exploitation
> mode. Let's add some checks to make sure we are not mixing the
> interfaces in KVM.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s_xive.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index f78d002f0fe0..8a4fa45f07f8 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -819,6 +819,9 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu)
>  {
>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>  
> +	if (!kvmppc_xics_enabled(vcpu))
> +		return -EPERM;
> +
>  	if (!xc)
>  		return 0;
>  
> @@ -835,6 +838,9 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
>  	u8 cppr, mfrr;
>  	u32 xisr;
>  
> +	if (!kvmppc_xics_enabled(vcpu))
> +		return -EPERM;
> +
>  	if (!xc || !xive)
>  		return -ENOENT;
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls
  2019-01-07 18:43 ` [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls Cédric Le Goater
  2019-01-09  3:33   ` David Gibson
@ 2019-01-09 13:08   ` Michael Ellerman
  2019-01-09 13:38     ` Cédric Le Goater
  1 sibling, 1 reply; 135+ messages in thread
From: Michael Ellerman @ 2019-01-09 13:08 UTC (permalink / raw)
  To: Cédric Le Goater, kvm-ppc
  Cc: kvm, Paul Mackerras, Cédric Le Goater, linuxppc-dev, David Gibson

Cédric Le Goater <clg@kaod.org> writes:

> These flags are shared between Linux/KVM implementing the hypervisor
> calls for the XIVE native exploitation mode and the driver for the
> sPAPR guests.
>
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/asm/xive.h  | 23 +++++++++++++++++++++++
>  arch/powerpc/sysdev/xive/spapr.c | 28 ++++++++--------------------
>  2 files changed, 31 insertions(+), 20 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index 3c704f5dd3ae..32f033bfbf42 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -93,6 +93,29 @@ extern void xive_flush_interrupt(void);
>  /* xmon hook */
>  extern void xmon_xive_do_dump(int cpu);
>  
> +/*
> + * Hcall flags shared by the sPAPR backend and KVM
> + */
> +
> +/* H_INT_GET_SOURCE_INFO */
> +#define XIVE_SPAPR_SRC_H_INT_ESB	PPC_BIT(60)
> +#define XIVE_SPAPR_SRC_LSI		PPC_BIT(61)
> +#define XIVE_SPAPR_SRC_TRIGGER		PPC_BIT(62)
> +#define XIVE_SPAPR_SRC_STORE_EOI	PPC_BIT(63)

I have an (irrational) hatred of PPC_BIT, because it obfuscates what's
going on and makes PPC seem weirder than it needs to be. It could at
least be called IBM_BIT().

I know it helps people compare the code vs the documentation, but
basically no one has the documentation, and everyone has the code.

Anyway it's not a show stopper, just a pet-peeve of mine :)

cheers

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls
  2019-01-09 13:08   ` Michael Ellerman
@ 2019-01-09 13:38     ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-09 13:38 UTC (permalink / raw)
  To: Michael Ellerman, kvm-ppc; +Cc: Paul Mackerras, linuxppc-dev, kvm, David Gibson

On 1/9/19 2:08 PM, Michael Ellerman wrote:
> Cédric Le Goater <clg@kaod.org> writes:
> 
>> These flags are shared between Linux/KVM implementing the hypervisor
>> calls for the XIVE native exploitation mode and the driver for the
>> sPAPR guests.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/asm/xive.h  | 23 +++++++++++++++++++++++
>>  arch/powerpc/sysdev/xive/spapr.c | 28 ++++++++--------------------
>>  2 files changed, 31 insertions(+), 20 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
>> index 3c704f5dd3ae..32f033bfbf42 100644
>> --- a/arch/powerpc/include/asm/xive.h
>> +++ b/arch/powerpc/include/asm/xive.h
>> @@ -93,6 +93,29 @@ extern void xive_flush_interrupt(void);
>>  /* xmon hook */
>>  extern void xmon_xive_do_dump(int cpu);
>>  
>> +/*
>> + * Hcall flags shared by the sPAPR backend and KVM
>> + */
>> +
>> +/* H_INT_GET_SOURCE_INFO */
>> +#define XIVE_SPAPR_SRC_H_INT_ESB	PPC_BIT(60)
>> +#define XIVE_SPAPR_SRC_LSI		PPC_BIT(61)
>> +#define XIVE_SPAPR_SRC_TRIGGER		PPC_BIT(62)
>> +#define XIVE_SPAPR_SRC_STORE_EOI	PPC_BIT(63)
> 
> I have an (irrational) hatred of PPC_BIT, because it obfuscates what's
> going on and makes PPC seem weirder than it needs to be. It could at
> least be called IBM_BIT().
> 
> I know it helps people compare the code vs the documentation, but
> basically no one has the documentation, and everyone has the code.
> 
> Anyway it's not a show stopper, just a pet-peeve of mine :)

Only the define matters, I can change that back to the non-PPC_BIT
version in v2. Not a problem. 

Cheers,

C. 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 04/19] KVM: PPC: Book3S HV: export services for the XIVE native exploitation device
  2019-01-07 18:43 ` [PATCH 04/19] KVM: PPC: Book3S HV: export services for the XIVE native exploitation device Cédric Le Goater
@ 2019-01-11  4:09   ` David Gibson
  0 siblings, 0 replies; 135+ messages in thread
From: David Gibson @ 2019-01-11  4:09 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 8687 bytes --]

On Mon, Jan 07, 2019 at 07:43:16PM +0100, Cédric Le Goater wrote:
> The KVM device for the XIVE native exploitation mode will reuse the
> structures of the XICS-over-XIVE glue implementation. Some code will
> also be shared : source block creation and destruction, target
> selection and escalation attachment.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>

Reviewed-by: David Gibson <david@gibson.dropbear.id.au>

> ---
>  arch/powerpc/kvm/book3s_xive.h | 11 +++++
>  arch/powerpc/kvm/book3s_xive.c | 89 +++++++++++++++++++---------------
>  2 files changed, 62 insertions(+), 38 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index a08ae6fd4c51..10c4aa5cd010 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -248,5 +248,16 @@ extern int (*__xive_vm_h_ipi)(struct kvm_vcpu *vcpu, unsigned long server,
>  extern int (*__xive_vm_h_cppr)(struct kvm_vcpu *vcpu, unsigned long cppr);
>  extern int (*__xive_vm_h_eoi)(struct kvm_vcpu *vcpu, unsigned long xirr);
>  
> +/*
> + * Common Xive routines for XICS-over-XIVE and XIVE native
> + */
> +struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
> +	struct kvmppc_xive *xive, int irq);
> +void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb);
> +int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio);
> +void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu);
> +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio);
> +int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu);
> +
>  #endif /* CONFIG_KVM_XICS */
>  #endif /* _KVM_PPC_BOOK3S_XICS_H */
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index 8a4fa45f07f8..bb5d32f7e4e6 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -166,7 +166,7 @@ static irqreturn_t xive_esc_irq(int irq, void *data)
>  	return IRQ_HANDLED;
>  }
>  
> -static int xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio)
> +int kvmppc_xive_attach_escalation(struct kvm_vcpu *vcpu, u8 prio)
>  {
>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>  	struct xive_q *q = &xc->queues[prio];
> @@ -291,7 +291,7 @@ static int xive_check_provisioning(struct kvm *kvm, u8 prio)
>  			continue;
>  		rc = xive_provision_queue(vcpu, prio);
>  		if (rc == 0 && !xive->single_escalation)
> -			xive_attach_escalation(vcpu, prio);
> +			kvmppc_xive_attach_escalation(vcpu, prio);
>  		if (rc)
>  			return rc;
>  	}
> @@ -342,7 +342,7 @@ static int xive_try_pick_queue(struct kvm_vcpu *vcpu, u8 prio)
>  	return atomic_add_unless(&q->count, 1, max) ? 0 : -EBUSY;
>  }
>  
> -static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
> +int kvmppc_xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
>  {
>  	struct kvm_vcpu *vcpu;
>  	int i, rc;
> @@ -535,7 +535,7 @@ static int xive_target_interrupt(struct kvm *kvm,
>  	 * priority. The count for that new target will have
>  	 * already been incremented.
>  	 */
> -	rc = xive_select_target(kvm, &server, prio);
> +	rc = kvmppc_xive_select_target(kvm, &server, prio);
>  
>  	/*
>  	 * We failed to find a target ? Not much we can do
> @@ -1055,7 +1055,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
>  }
>  EXPORT_SYMBOL_GPL(kvmppc_xive_clr_mapped);
>  
> -static void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
> +void kvmppc_xive_disable_vcpu_interrupts(struct kvm_vcpu *vcpu)
>  {
>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>  	struct kvm *kvm = vcpu->kvm;
> @@ -1225,7 +1225,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  		if (xive->qmap & (1 << i)) {
>  			r = xive_provision_queue(vcpu, i);
>  			if (r == 0 && !xive->single_escalation)
> -				xive_attach_escalation(vcpu, i);
> +				kvmppc_xive_attach_escalation(vcpu, i);
>  			if (r)
>  				goto bail;
>  		} else {
> @@ -1240,7 +1240,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
>  	}
>  
>  	/* If not done above, attach priority 0 escalation */
> -	r = xive_attach_escalation(vcpu, 0);
> +	r = kvmppc_xive_attach_escalation(vcpu, 0);
>  	if (r)
>  		goto bail;
>  
> @@ -1491,8 +1491,8 @@ static int xive_get_source(struct kvmppc_xive *xive, long irq, u64 addr)
>  	return 0;
>  }
>  
> -static struct kvmppc_xive_src_block *xive_create_src_block(struct kvmppc_xive *xive,
> -							   int irq)
> +struct kvmppc_xive_src_block *kvmppc_xive_create_src_block(
> +	struct kvmppc_xive *xive, int irq)
>  {
>  	struct kvm *kvm = xive->kvm;
>  	struct kvmppc_xive_src_block *sb;
> @@ -1571,7 +1571,7 @@ static int xive_set_source(struct kvmppc_xive *xive, long irq, u64 addr)
>  	sb = kvmppc_xive_find_source(xive, irq, &idx);
>  	if (!sb) {
>  		pr_devel("No source, creating source block...\n");
> -		sb = xive_create_src_block(xive, irq);
> +		sb = kvmppc_xive_create_src_block(xive, irq);
>  		if (!sb) {
>  			pr_devel("Failed to create block...\n");
>  			return -ENOMEM;
> @@ -1795,7 +1795,7 @@ static void kvmppc_xive_cleanup_irq(u32 hw_num, struct xive_irq_data *xd)
>  	xive_cleanup_irq_data(xd);
>  }
>  
> -static void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
> +void kvmppc_xive_free_sources(struct kvmppc_xive_src_block *sb)
>  {
>  	int i;
>  
> @@ -1824,6 +1824,8 @@ static void kvmppc_xive_free(struct kvm_device *dev)
>  
>  	debugfs_remove(xive->dentry);
>  
> +	pr_devel("Destroying xive for partition\n");
> +
>  	if (kvm)
>  		kvm->arch.xive = NULL;
>  
> @@ -1889,6 +1891,43 @@ static int kvmppc_xive_create(struct kvm_device *dev, u32 type)
>  	return 0;
>  }
>  
> +int kvmppc_xive_debug_show_queues(struct seq_file *m, struct kvm_vcpu *vcpu)
> +{
> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> +	unsigned int i;
> +
> +	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
> +		struct xive_q *q = &xc->queues[i];
> +		u32 i0, i1, idx;
> +
> +		if (!q->qpage && !xc->esc_virq[i])
> +			continue;
> +
> +		seq_printf(m, " [q%d]: ", i);
> +
> +		if (q->qpage) {
> +			idx = q->idx;
> +			i0 = be32_to_cpup(q->qpage + idx);
> +			idx = (idx + 1) & q->msk;
> +			i1 = be32_to_cpup(q->qpage + idx);
> +			seq_printf(m, "T=%d %08x %08x...\n", q->toggle,
> +				   i0, i1);
> +		}
> +		if (xc->esc_virq[i]) {
> +			struct irq_data *d = irq_get_irq_data(xc->esc_virq[i]);
> +			struct xive_irq_data *xd =
> +				irq_data_get_irq_handler_data(d);
> +			u64 pq = xive_vm_esb_load(xd, XIVE_ESB_GET);
> +
> +			seq_printf(m, "E:%c%c I(%d:%llx:%llx)",
> +				   (pq & XIVE_ESB_VAL_P) ? 'P' : 'p',
> +				   (pq & XIVE_ESB_VAL_Q) ? 'Q' : 'q',
> +				   xc->esc_virq[i], pq, xd->eoi_page);
> +			seq_puts(m, "\n");
> +		}
> +	}
> +	return 0;
> +}
>  
>  static int xive_debug_show(struct seq_file *m, void *private)
>  {
> @@ -1914,7 +1953,6 @@ static int xive_debug_show(struct seq_file *m, void *private)
>  
>  	kvm_for_each_vcpu(i, vcpu, kvm) {
>  		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> -		unsigned int i;
>  
>  		if (!xc)
>  			continue;
> @@ -1924,33 +1962,8 @@ static int xive_debug_show(struct seq_file *m, void *private)
>  			   xc->server_num, xc->cppr, xc->hw_cppr,
>  			   xc->mfrr, xc->pending,
>  			   xc->stat_rm_h_xirr, xc->stat_vm_h_xirr);
> -		for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
> -			struct xive_q *q = &xc->queues[i];
> -			u32 i0, i1, idx;
>  
> -			if (!q->qpage && !xc->esc_virq[i])
> -				continue;
> -
> -			seq_printf(m, " [q%d]: ", i);
> -
> -			if (q->qpage) {
> -				idx = q->idx;
> -				i0 = be32_to_cpup(q->qpage + idx);
> -				idx = (idx + 1) & q->msk;
> -				i1 = be32_to_cpup(q->qpage + idx);
> -				seq_printf(m, "T=%d %08x %08x... \n", q->toggle, i0, i1);
> -			}
> -			if (xc->esc_virq[i]) {
> -				struct irq_data *d = irq_get_irq_data(xc->esc_virq[i]);
> -				struct xive_irq_data *xd = irq_data_get_irq_handler_data(d);
> -				u64 pq = xive_vm_esb_load(xd, XIVE_ESB_GET);
> -				seq_printf(m, "E:%c%c I(%d:%llx:%llx)",
> -					   (pq & XIVE_ESB_VAL_P) ? 'P' : 'p',
> -					   (pq & XIVE_ESB_VAL_Q) ? 'Q' : 'q',
> -					   xc->esc_virq[i], pq, xd->eoi_page);
> -				seq_printf(m, "\n");
> -			}
> -		}
> +		kvmppc_xive_debug_show_queues(m, vcpu);
>  
>  		t_rm_h_xirr += xc->stat_rm_h_xirr;
>  		t_rm_h_ipoll += xc->stat_rm_h_ipoll;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
                   ` (16 preceding siblings ...)
  2019-01-07 19:10 ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state Cédric Le Goater
@ 2019-01-22  4:46 ` Paul Mackerras
  2019-01-23 19:07   ` Cédric Le Goater
  17 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-22  4:46 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 07, 2019 at 07:43:12PM +0100, Cédric Le Goater wrote:
> Hello,
> 
> On the POWER9 processor, the XIVE interrupt controller can control
> interrupt sources using MMIO to trigger events, to EOI or to turn off
> the sources. Priority management and interrupt acknowledgment is also
> controlled by MMIO in the CPU presenter subengine.
> 
> PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
> special support from the hypervisor to do the same. This is called the
> XIVE native exploitation mode and today, it can be activated under the
> PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
> and still offers the old interrupt mode interface using a
> XICS-over-XIVE glue which implements the XICS hcalls.
> 
> The following series is proposal to add the same support under KVM.
> 
> A new KVM device is introduced for the XIVE native exploitation
> mode. It reuses most of the XICS-over-XIVE glue implementation
> structures which are internal to KVM but has a completely different
> interface. A set of Hypervisor calls configures the sources and the
> event queues and from there, all control is done by the guest through
> MMIOs.
> 
> These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
> similarly to VFIO, and the associated VMAs are populated dynamically
> with the appropriate pages using a fault handler. This is implemented
> with a couple of KVM device ioctls.
> 
> On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
> negotiation process determines whether the guest operates with a
> interrupt controller using the XICS legacy model, as found on POWER8,
> or in XIVE exploitation mode. Which means that the KVM interrupt
> device should be created at runtime, after the machine as started.
> This requires extra KVM support to create/destroy KVM devices. The
> last patches are an attempt to solve that problem.
> 
> Migration has its own specific needs. The patchset provides the
> necessary routines to quiesce XIVE, to capture and restore the state
> of the different structures used by KVM, OPAL and HW. Extra OPAL
> support is required for these.

Thanks for the patchset.  It mostly looks good, but there are some
more things we need to consider, and I think a v2 will be needed.

One general comment I have is that there are a lot of acronyms in this
code and you mostly seem to assume that people will know what they all
mean.  It would make the code more readable if you provide the
expansion of the acronym on first use in a comment or whatever.  For
example, one of the patches in this series talks about the "EAS"
without ever expanding it in any comment or in the patch description,
and I have forgotten just at the moment what EAS stands for (I just
know that understanding the XIVE is not eas-y. :)

Another general comment is that you seem to have written all this
code assuming we are using HV KVM in a host running bare-metal.
However, we could be using PR KVM (either in a bare-metal host or in a
guest), or we could be doing nested HV KVM where we are using the
kvm_hv module inside a KVM guest and using special hypercalls for
controlling our guests.

It would be perfectly acceptable for now to say that we don't yet
support XIVE exploitation in those scenarios, as long as we then make
sure that the new KVM capability reports false in those scenarios, and
any attempt to use the XIVE exploitation interfaces fails cleanly.
I don't see that either of those is true in the patch set as it
stands, so that is one area that needs to be fixed.

A third general comment is that the new KVM interfaces you have added
need to be documented in the files under Documentation/virtual/kvm.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type
  2019-01-07 18:43 ` [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type Cédric Le Goater
  2019-01-09  4:27   ` David Gibson
@ 2019-01-22  4:56   ` Paul Mackerras
  2019-01-23 16:24     ` Cédric Le Goater
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-22  4:56 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 07, 2019 at 07:43:15PM +0100, Cédric Le Goater wrote:
> We will have different KVM devices for interrupts, one for the
> XICS-over-XIVE mode and one for the XIVE native exploitation
> mode. Let's add some checks to make sure we are not mixing the
> interfaces in KVM.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/kvm/book3s_xive.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> index f78d002f0fe0..8a4fa45f07f8 100644
> --- a/arch/powerpc/kvm/book3s_xive.c
> +++ b/arch/powerpc/kvm/book3s_xive.c
> @@ -819,6 +819,9 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu)
>  {
>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>  
> +	if (!kvmppc_xics_enabled(vcpu))
> +		return -EPERM;
> +
>  	if (!xc)
>  		return 0;
>  
> @@ -835,6 +838,9 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
>  	u8 cppr, mfrr;
>  	u32 xisr;
>  
> +	if (!kvmppc_xics_enabled(vcpu))
> +		return -EPERM;
> +
>  	if (!xc || !xive)
>  		return -ENOENT;

I can't see how these new checks could ever trigger in the code as it
stands.  Is there a way at present?  Do following patches ever add a
path where the new checks could trigger, or is this just an excess of
caution?  (Your patch description should ideally have answered these
questions for me.)

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-01-07 18:43 ` [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode Cédric Le Goater
@ 2019-01-22  5:05   ` Paul Mackerras
  2019-01-23 16:28     ` Cédric Le Goater
  2019-01-28 17:35     ` Cédric Le Goater
  2019-02-04  4:25   ` David Gibson
  1 sibling, 2 replies; 135+ messages in thread
From: Paul Mackerras @ 2019-01-22  5:05 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> This is the basic framework for the new KVM device supporting the XIVE
> native exploitation mode. The user interface exposes a new capability
> and a new KVM device to be used by QEMU.

[snip]
> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>  #ifdef CONFIG_KVM_XIVE
>  	if (xive_enabled()) {
>  		kvmppc_xive_init_module();
> +		kvmppc_xive_native_init_module();
>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
> +		kvm_register_device_ops(&kvm_xive_native_ops,
> +					KVM_DEV_TYPE_XIVE);

I think we want tighter conditions on initializing the xive_native
stuff and creating the xive device class.  We could have
xive_enabled() returning true in a guest, and this code will get
called both by PR KVM and HV KVM (and HV KVM no longer implies that we
are running bare metal).

> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>  static void kvmppc_book3s_exit(void)
>  {
>  #ifdef CONFIG_KVM_XICS
> -	if (xive_enabled())
> +	if (xive_enabled()) {
>  		kvmppc_xive_exit_module();
> +		kvmppc_xive_native_exit_module();

Same comment here.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-01-07 18:43 ` [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device Cédric Le Goater
@ 2019-01-22  5:09   ` Paul Mackerras
  2019-01-23 16:48     ` Cédric Le Goater
  2019-02-04  4:45   ` David Gibson
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-22  5:09 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
> This will let the guest create a memory mapping to expose the ESB MMIO
> regions used to control the interrupt sources, to trigger events, to
> EOI or to turn off the sources.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
>  2 files changed, 101 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 8c876c166ef2..6bb61ba141c2 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
>  #define  KVM_XICS_QUEUED		(1ULL << 44)
>  
> +/* POWER9 XIVE Native Interrupt Controller */
> +#define KVM_DEV_XIVE_GRP_CTRL		1
> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
> +
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 115143e76c45..e20081f0c8d4 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -153,6 +153,85 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  	return rc;
>  }
>  
> +static int xive_native_esb_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct kvmppc_xive *xive = vma->vm_file->private_data;
> +	struct kvmppc_xive_src_block *sb;
> +	struct kvmppc_xive_irq_state *state;
> +	struct xive_irq_data *xd;
> +	u32 hw_num;
> +	u16 src;
> +	u64 page;
> +	unsigned long irq;
> +
> +	/*
> +	 * Linux/KVM uses a two pages ESB setting, one for trigger and
> +	 * one for EOI
> +	 */
> +	irq = vmf->pgoff / 2;
> +
> +	sb = kvmppc_xive_find_source(xive, irq, &src);
> +	if (!sb) {
> +		pr_err("%s: source %lx not found !\n", __func__, irq);

In general it's a bad idea to have a printk that userspace can trigger
at will without any rate-limiting.  Is there a real reason why this
printk is needed (and can't be pr_devel)?

> +		return VM_FAULT_SIGBUS;
> +	}
> +
> +	state = &sb->irq_state[src];
> +	kvmppc_xive_select_irq(state, &hw_num, &xd);
> +
> +	arch_spin_lock(&sb->lock);
> +
> +	/*
> +	 * first/even page is for trigger
> +	 * second/odd page is for EOI and management.
> +	 */
> +	page = vmf->pgoff % 2 ? xd->eoi_page : xd->trig_page;
> +	arch_spin_unlock(&sb->lock);
> +
> +	if (!page) {
> +		pr_err("%s: acessing invalid ESB page for source %lx !\n",
> +		       __func__, irq);

Does this represent a exceptional condition that userspace can't just
trigger at will (i.e. it implies the presence of a kernel bug)?  If
not then the same comment as above applies.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device
  2019-01-07 18:43 ` [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the " Cédric Le Goater
@ 2019-01-22  5:14   ` Paul Mackerras
  2019-01-23 16:56     ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-22  5:14 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 07, 2019 at 07:43:20PM +0100, Cédric Le Goater wrote:
> The ESB MMIO region controls the interrupt sources of the guest. QEMU
> will query an fd (GET_ESB_FD ioctl) and map this region at a specific
> address for the guest to use. The guest will obtain this information
> using the H_INT_GET_SOURCE_INFO hcall. To inform KVM of the address
> setting used by QEMU, add a VC_BASE control to the KVM XIVE device

This needs a little more explanation.  I *think* the only way this
gets used is that it gets returned to the guest by the new
hypercalls.  If that is indeed the case it would be useful to mention
that in the patch description, because otherwise taking a value that
userspace provides and which looks like it is an address, and not
doing any validation on it, looks a bit scary.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls
  2019-01-07 18:43 ` [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls Cédric Le Goater
@ 2019-01-22  5:23   ` Paul Mackerras
  2019-01-23  6:44     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-22  5:23 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 07, 2019 at 07:43:23PM +0100, Cédric Le Goater wrote:
> The XIVE native exploitation mode specs define a set of Hypervisor
> calls to configure the sources and the event queues :
> 
>  - H_INT_GET_SOURCE_INFO
> 
>    used to obtain the address of the MMIO page of the Event State
>    Buffer (PQ bits) entry associated with the source.
> 
>  - H_INT_SET_SOURCE_CONFIG
> 
>    assigns a source to a "target".
> 
>  - H_INT_GET_SOURCE_CONFIG
> 
>    determines which "target" and "priority" is assigned to a source
> 
>  - H_INT_GET_QUEUE_INFO
> 
>    returns the address of the notification management page associated
>    with the specified "target" and "priority".
> 
>  - H_INT_SET_QUEUE_CONFIG
> 
>    sets or resets the event queue for a given "target" and "priority".
>    It is also used to set the notification configuration associated
>    with the queue, only unconditional notification is supported for
>    the moment. Reset is performed with a queue size of 0 and queueing
>    is disabled in that case.
> 
>  - H_INT_GET_QUEUE_CONFIG
> 
>    returns the queue settings for a given "target" and "priority".
> 
>  - H_INT_RESET
> 
>    resets all of the guest's internal interrupt structures to their
>    initial state, losing all configuration set via the hcalls
>    H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG.
> 
>  - H_INT_SYNC
> 
>    issue a synchronisation on a source to make sure all notifications
>    have reached their queue.

Which ones of these could be implemented in QEMU?  Are there any that
can't possibly be implemented in QEMU because they need to do things
that require calling internal interfaces that userspace doesn't have
access to?

How often do we expect each of these hypercalls to be called?

[snip]

> @@ -682,6 +685,46 @@ int kvmppc_rm_h_cppr(struct kvm_vcpu *vcpu, unsigned long cppr);
>  int kvmppc_rm_h_eoi(struct kvm_vcpu *vcpu, unsigned long xirr);
>  void kvmppc_guest_entry_inject_int(struct kvm_vcpu *vcpu);
>  
> +int kvmppc_rm_h_int_get_source_info(struct kvm_vcpu *vcpu,
> +				    unsigned long flag,
> +				    unsigned long lisn);
> +int kvmppc_rm_h_int_set_source_config(struct kvm_vcpu *vcpu,
> +				      unsigned long flag,
> +				      unsigned long lisn,
> +				      unsigned long target,
> +				      unsigned long priority,
> +				      unsigned long eisn);
> +int kvmppc_rm_h_int_get_source_config(struct kvm_vcpu *vcpu,
> +				      unsigned long flag,
> +				      unsigned long lisn);
> +int kvmppc_rm_h_int_get_queue_info(struct kvm_vcpu *vcpu,
> +				   unsigned long flag,
> +				   unsigned long target,
> +				   unsigned long priority);
> +int kvmppc_rm_h_int_set_queue_config(struct kvm_vcpu *vcpu,
> +				     unsigned long flag,
> +				     unsigned long target,
> +				     unsigned long priority,
> +				     unsigned long qpage,
> +				     unsigned long qsize);
> +int kvmppc_rm_h_int_get_queue_config(struct kvm_vcpu *vcpu,
> +				     unsigned long flag,
> +				     unsigned long target,
> +				     unsigned long priority);
> +int kvmppc_rm_h_int_set_os_reporting_line(struct kvm_vcpu *vcpu,
> +					  unsigned long flag,
> +					  unsigned long reportingline);
> +int kvmppc_rm_h_int_get_os_reporting_line(struct kvm_vcpu *vcpu,
> +					  unsigned long flag,
> +					  unsigned long target,
> +					  unsigned long reportingline);
> +int kvmppc_rm_h_int_esb(struct kvm_vcpu *vcpu, unsigned long flag,
> +			unsigned long lisn, unsigned long offset,
> +			unsigned long data);
> +int kvmppc_rm_h_int_sync(struct kvm_vcpu *vcpu, unsigned long flag,
> +			 unsigned long lisn);
> +int kvmppc_rm_h_int_reset(struct kvm_vcpu *vcpu, unsigned long flag);

Why do we need to provide real-mode versions of these hypercall
handlers?  I thought these hypercalls would only get called
infrequently, and in any case certainly much less frequently than once
per interrupt delivered.  If they are infrequent, then let's leave out
the real-mode version and just handle them in book3s_hv.c.

> @@ -5153,6 +5169,19 @@ static unsigned int default_hcall_list[] = {
>  	H_IPOLL,
>  	H_XIRR,
>  	H_XIRR_X,
> +#endif
> +#ifdef CONFIG_KVM_XIVE
> +	H_INT_GET_SOURCE_INFO,
> +	H_INT_SET_SOURCE_CONFIG,
> +	H_INT_GET_SOURCE_CONFIG,
> +	H_INT_GET_QUEUE_INFO,
> +	H_INT_SET_QUEUE_CONFIG,
> +	H_INT_GET_QUEUE_CONFIG,
> +	H_INT_SET_OS_REPORTING_LINE,
> +	H_INT_GET_OS_REPORTING_LINE,
> +	H_INT_ESB,
> +	H_INT_SYNC,
> +	H_INT_RESET,
>  #endif

The policy is not to add new hcalls to default_hcall_list[].  Is there
a strong reason for adding them here?

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-07 19:10   ` [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support Cédric Le Goater
@ 2019-01-22  5:26     ` Paul Mackerras
  2019-01-23  6:45       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-22  5:26 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
> Clear the ESB pages from the VMA of the IRQ being pass through to the
> guest and let the fault handler repopulate the VMA when the ESB pages
> are accessed for an EOI or for a trigger.

Why do we want to do this?

I don't see any possible advantage to removing the PTEs from the
userspace mapping.  You'll need to explain further.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl
  2019-01-07 19:10   ` [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl Cédric Le Goater
@ 2019-01-22  5:42     ` Paul Mackerras
  2019-01-23 18:39       ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-22  5:42 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 07, 2019 at 08:10:06PM +0100, Cédric Le Goater wrote:
> This will be used to destroy the KVM XICS or XIVE device when the
> sPAPR machine is reseted. When the VM boots, the CAS negotiation
> process will determine which interrupt mode to use and the appropriate
> KVM device will then be created.

What would be the consequence if we didn't destroy the device?

The reason I ask is that we will have to be much more careful about
memory allocation lifetimes with this patch.  Having KVM devices last
until the KVM instance is destroyed means that we generally avoid
use-after-free bugs.  With this patch we will have to do a careful
analysis of the lifetime of the xive structures vs. possible accesses
on other threads to prove there are no use-after-free bugs.

For example, it is not sufficient to set any pointers in struct kvm or
struct kvm_vcpu that point into xive structures to NULL before freeing
the structures.  There could be code on another CPU that has read the
pointer value before you set it to NULL and then goes and accesses it
after you have freed it.  You need to prove that can't happen,
possibly using some sort of explicit synchronization that ensures that
no other CPU could still be accessing the structure at the time when
you free it.  RCU can help with this, but in general means you need
RCU synchronization primitives (rcu_read_lock() etc.) at all the
places where you use the pointer, which I don't think you currently
have.

If there is a good fundamental reason why this can't happen, even
though you don't have explicit synchronization, then at a minimum you
need to explain that in the patch description, and ideally also in
code comments.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls
  2019-01-22  5:23   ` Paul Mackerras
@ 2019-01-23  6:44     ` Benjamin Herrenschmidt
  2019-01-23  8:48       ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Benjamin Herrenschmidt @ 2019-01-23  6:44 UTC (permalink / raw)
  To: Paul Mackerras, Cédric Le Goater
  Cc: linuxppc-dev, kvm, kvm-ppc, David Gibson

On Tue, 2019-01-22 at 16:23 +1100, Paul Mackerras wrote:
> 
> Which ones of these could be implemented in QEMU?  Are there any that
> can't possibly be implemented in QEMU because they need to do things
> that require calling internal interfaces that userspace doesn't have
> access to?

Implementing them in qemu doesn't make a lot of sense. Qemu doesn't
have access to most of the XIVE HW state. There's a XIVE model for full
emulation but when using the real thing, almost none of it is visible.
A lot of those hcalls effectively turn into OPAL calls.

> How often do we expect each of these hypercalls to be called?

It depends, I can't tell for AIX. For Linux, not often with one
exception: H_INT_ESB which will be used whenever you EOI an emulated
LSI.

 .../...

> Why do we need to provide real-mode versions of these hypercall
> handlers?  I thought these hypercalls would only get called
> infrequently, and in any case certainly much less frequently than once
> per interrupt delivered.  If they are infrequent, then let's leave out
> the real-mode version and just handle them in book3s_hv.c.

Agreed with the exception maybe of H_INT_ESB

> > @@ -5153,6 +5169,19 @@ static unsigned int default_hcall_list[] = {
> >        H_IPOLL,
> >        H_XIRR,
> >        H_XIRR_X,
> > +#endif
> > +#ifdef CONFIG_KVM_XIVE
> > +     H_INT_GET_SOURCE_INFO,
> > +     H_INT_SET_SOURCE_CONFIG,
> > +     H_INT_GET_SOURCE_CONFIG,
> > +     H_INT_GET_QUEUE_INFO,
> > +     H_INT_SET_QUEUE_CONFIG,
> > +     H_INT_GET_QUEUE_CONFIG,
> > +     H_INT_SET_OS_REPORTING_LINE,
> > +     H_INT_GET_OS_REPORTING_LINE,
> > +     H_INT_ESB,
> > +     H_INT_SYNC,
> > +     H_INT_RESET,
> >   #endif
> 
> The policy is not to add new hcalls to default_hcall_list[].  Is there
> a strong reason for adding them here?
> 
> Paul.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-22  5:26     ` Paul Mackerras
@ 2019-01-23  6:45       ` Benjamin Herrenschmidt
  2019-01-23 10:30         ` Paul Mackerras
  0 siblings, 1 reply; 135+ messages in thread
From: Benjamin Herrenschmidt @ 2019-01-23  6:45 UTC (permalink / raw)
  To: Paul Mackerras, Cédric Le Goater
  Cc: linuxppc-dev, kvm, kvm-ppc, David Gibson

On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
> > Clear the ESB pages from the VMA of the IRQ being pass through to the
> > guest and let the fault handler repopulate the VMA when the ESB pages
> > are accessed for an EOI or for a trigger.
> 
> Why do we want to do this?
> 
> I don't see any possible advantage to removing the PTEs from the
> userspace mapping.  You'll need to explain further.

Afaik bcs we change the mapping to point to the real HW irq ESB page
instead of the "IPI" that was there at VM init time.

Cedric, is that correct ?

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls
  2019-01-23  6:44     ` Benjamin Herrenschmidt
@ 2019-01-23  8:48       ` Cédric Le Goater
  2019-01-23 10:26         ` Paul Mackerras
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-23  8:48 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras
  Cc: linuxppc-dev, kvm, kvm-ppc, David Gibson

On 1/23/19 7:44 AM, Benjamin Herrenschmidt wrote:
> On Tue, 2019-01-22 at 16:23 +1100, Paul Mackerras wrote:
>>
>> Which ones of these could be implemented in QEMU?  Are there any that
>> can't possibly be implemented in QEMU because they need to do things
>> that require calling internal interfaces that userspace doesn't have
>> access to?
> 
> Implementing them in qemu doesn't make a lot of sense. Qemu doesn't
> have access to most of the XIVE HW state. There's a XIVE model for full
> emulation but when using the real thing, almost none of it is visible.
> A lot of those hcalls effectively turn into OPAL calls.
> 
>> How often do we expect each of these hypercalls to be called?
> 
> It depends, I can't tell for AIX. For Linux, not often with one
> exception: H_INT_ESB which will be used whenever you EOI an emulated
> LSI.

yes. This one is only doing loads and stores.

>  .../...
> 
>> Why do we need to provide real-mode versions of these hypercall
>> handlers?  I thought these hypercalls would only get called
>> infrequently, and in any case certainly much less frequently than once
>> per interrupt delivered.  If they are infrequent, then let's leave out
>> the real-mode version and just handle them in book3s_hv.c.
> 
> Agreed with the exception maybe of H_INT_ESB

ok. 

Some of these hcalls are really simple and only getting local info from 
the host (h_int_get_*). I thought handling the hcall ASAP was a preferred 
practice, even if the hcall is not called frequently. Isn't it ?

 
>>> @@ -5153,6 +5169,19 @@ static unsigned int default_hcall_list[] = {
>>>        H_IPOLL,
>>>        H_XIRR,
>>>        H_XIRR_X,
>>> +#endif
>>> +#ifdef CONFIG_KVM_XIVE
>>> +     H_INT_GET_SOURCE_INFO,
>>> +     H_INT_SET_SOURCE_CONFIG,
>>> +     H_INT_GET_SOURCE_CONFIG,
>>> +     H_INT_GET_QUEUE_INFO,
>>> +     H_INT_SET_QUEUE_CONFIG,
>>> +     H_INT_GET_QUEUE_CONFIG,
>>> +     H_INT_SET_OS_REPORTING_LINE,
>>> +     H_INT_GET_OS_REPORTING_LINE,
>>> +     H_INT_ESB,
>>> +     H_INT_SYNC,
>>> +     H_INT_RESET,
>>>   #endif
>>
>> The policy is not to add new hcalls to default_hcall_list[].  Is there
>> a strong reason for adding them here?

I don't remember. I will check for v2.

Thanks, 

C.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls
  2019-01-23  8:48       ` Cédric Le Goater
@ 2019-01-23 10:26         ` Paul Mackerras
  2019-01-23 10:48           ` Cédric Le Goater
  2019-01-23 21:23           ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 135+ messages in thread
From: Paul Mackerras @ 2019-01-23 10:26 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Wed, Jan 23, 2019 at 09:48:31AM +0100, Cédric Le Goater wrote:
> On 1/23/19 7:44 AM, Benjamin Herrenschmidt wrote:
> > On Tue, 2019-01-22 at 16:23 +1100, Paul Mackerras wrote:
> >> Why do we need to provide real-mode versions of these hypercall
> >> handlers?  I thought these hypercalls would only get called
> >> infrequently, and in any case certainly much less frequently than once
> >> per interrupt delivered.  If they are infrequent, then let's leave out
> >> the real-mode version and just handle them in book3s_hv.c.
> > 
> > Agreed with the exception maybe of H_INT_ESB
> 
> ok. 
> 
> Some of these hcalls are really simple and only getting local info from 
> the host (h_int_get_*). I thought handling the hcall ASAP was a preferred 
> practice, even if the hcall is not called frequently. Isn't it ?

If we are going to handle a given hcall in the kernel at all, then we
have to have a virtual mode handler.  If we have a real-mode handler
as well then we in general incur a certain amount of code duplication
with consequent maintenance costs and possibility of bugs.  So we
generally only have real-mode handlers for the hcalls where it is
critical to minimize the latency.  From what Ben is saying that would
only be H_INT_ESB, and maybe not even that.

If H_INT_ESB is only used for LSIs, then is a guest going to be using
it at all?  My understanding was that with XIVE, only a small number
of interrupts that are to do with system management functions are
LSIs; all of the interrupts relating to PCI-e devices are MSIs.  So do
we actually have a real high-frequency use case for LSIs in a guest?

For now I would prefer that you remove all the real-mode hcall
handlers.  We can add them later if we get performance data showing
that they are needed.

Regarding whether or not to have a given hcall handler in the kernel
at all - if there is for example an hcall which is just called once
on guest startup, and its function is just to provide information to
the guest, and QEMU has that information, then why not have that hcall
implemented by QEMU?  Are any of the hcalls like that?

For example, if H_INT_GET_SOURCE_INFO was implemented in QEMU, could
we then remove the VC_BASE thing from the xive device?

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-23  6:45       ` Benjamin Herrenschmidt
@ 2019-01-23 10:30         ` Paul Mackerras
  2019-01-23 11:07           ` Cédric Le Goater
  2019-01-23 21:25           ` Benjamin Herrenschmidt
  0 siblings, 2 replies; 135+ messages in thread
From: Paul Mackerras @ 2019-01-23 10:30 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, kvm-ppc, Cédric Le Goater, linuxppc-dev, David Gibson

On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
> > > Clear the ESB pages from the VMA of the IRQ being pass through to the
> > > guest and let the fault handler repopulate the VMA when the ESB pages
> > > are accessed for an EOI or for a trigger.
> > 
> > Why do we want to do this?
> > 
> > I don't see any possible advantage to removing the PTEs from the
> > userspace mapping.  You'll need to explain further.
> 
> Afaik bcs we change the mapping to point to the real HW irq ESB page
> instead of the "IPI" that was there at VM init time.

So that makes it sound like there is a whole lot going on that hasn't
even been hinted at in the patch descriptions...  It sounds like we
need a good description of how all this works and fits together
somewhere under Documentation/.

In any case we need much more informative patch descriptions.  I
realize that it's all currently in Cedric's head, but I bet that in
two or three years' time when we come to try to debug something, it
won't be in anyone's head...

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls
  2019-01-23 10:26         ` Paul Mackerras
@ 2019-01-23 10:48           ` Cédric Le Goater
  2019-01-23 21:23           ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-23 10:48 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/23/19 11:26 AM, Paul Mackerras wrote:
> On Wed, Jan 23, 2019 at 09:48:31AM +0100, Cédric Le Goater wrote:
>> On 1/23/19 7:44 AM, Benjamin Herrenschmidt wrote:
>>> On Tue, 2019-01-22 at 16:23 +1100, Paul Mackerras wrote:
>>>> Why do we need to provide real-mode versions of these hypercall
>>>> handlers?  I thought these hypercalls would only get called
>>>> infrequently, and in any case certainly much less frequently than once
>>>> per interrupt delivered.  If they are infrequent, then let's leave out
>>>> the real-mode version and just handle them in book3s_hv.c.
>>>
>>> Agreed with the exception maybe of H_INT_ESB
>>
>> ok. 
>>
>> Some of these hcalls are really simple and only getting local info from 
>> the host (h_int_get_*). I thought handling the hcall ASAP was a preferred 
>> practice, even if the hcall is not called frequently. Isn't it ?
> 
> If we are going to handle a given hcall in the kernel at all, then we
> have to have a virtual mode handler.  If we have a real-mode handler
> as well then we in general incur a certain amount of code duplication
> with consequent maintenance costs and possibility of bugs.  So we
> generally only have real-mode handlers for the hcalls where it is
> critical to minimize the latency.  From what Ben is saying that would
> only be H_INT_ESB, and maybe not even that.

ok. and yes, even the H_INT_ESB is questionable as this is really a rare
configuration. 

> If H_INT_ESB is only used for LSIs, then is a guest going to be using
> it at all?  My understanding was that with XIVE, only a small number
> of interrupts that are to do with system management functions are
> LSIs; all of the interrupts relating to PCI-e devices are MSIs.  So do
> we actually have a real high-frequency use case for LSIs in a guest?

The guest should be using a rtl8139 or a e1000 NIC under QEMU/KVM, which 
is not the common scenario. 

> For now I would prefer that you remove all the real-mode hcall
> handlers.  We can add them later if we get performance data showing
> that they are needed.

ok. I will.

> Regarding whether or not to have a given hcall handler in the kernel
> at all - if there is for example an hcall which is just called once
> on guest startup, and its function is just to provide information to
> the guest, and QEMU has that information, then why not have that hcall
> implemented by QEMU?  Are any of the hcalls like that?
> 
> For example, if H_INT_GET_SOURCE_INFO was implemented in QEMU, could
> we then remove the VC_BASE thing from the xive device?

Yes. H_INT_GET_SOURCE_INFO looks like a good candidate, all info should
be in QEMU and there are no OPAL calls, and we would get rid of the 
VC_BASE kvm device ioctl at the same time.

Thanks,

C.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-23 10:30         ` Paul Mackerras
@ 2019-01-23 11:07           ` Cédric Le Goater
  2019-01-28  6:13             ` Paul Mackerras
  2019-01-23 21:25           ` Benjamin Herrenschmidt
  1 sibling, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-23 11:07 UTC (permalink / raw)
  To: Paul Mackerras, Benjamin Herrenschmidt
  Cc: linuxppc-dev, kvm, kvm-ppc, David Gibson

On 1/23/19 11:30 AM, Paul Mackerras wrote:
> On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
>> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
>>> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
>>>> Clear the ESB pages from the VMA of the IRQ being pass through to the
>>>> guest and let the fault handler repopulate the VMA when the ESB pages
>>>> are accessed for an EOI or for a trigger.
>>>
>>> Why do we want to do this?
>>>
>>> I don't see any possible advantage to removing the PTEs from the
>>> userspace mapping.  You'll need to explain further.
>>
>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>> instead of the "IPI" that was there at VM init time.

yes exactly. You need to clean up the pages each time.
 
> So that makes it sound like there is a whole lot going on that hasn't
> even been hinted at in the patch descriptions...  It sounds like we
> need a good description of how all this works and fits together
> somewhere under Documentation/.

OK. I have started doing so for the models merged in QEMU but not yet 
for KVM. I will work on it.

> In any case we need much more informative patch descriptions.  I
> realize that it's all currently in Cedric's head, but I bet that in
> two or three years' time when we come to try to debug something, it
> won't be in anyone's head...

I agree. 


So, storing the ESB VMA under the KVM device is not shocking anyone ?  

Thanks,

C.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type
  2019-01-22  4:56   ` Paul Mackerras
@ 2019-01-23 16:24     ` Cédric Le Goater
  2019-02-04  0:50       ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-23 16:24 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/22/19 5:56 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:15PM +0100, Cédric Le Goater wrote:
>> We will have different KVM devices for interrupts, one for the
>> XICS-over-XIVE mode and one for the XIVE native exploitation
>> mode. Let's add some checks to make sure we are not mixing the
>> interfaces in KVM.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/kvm/book3s_xive.c | 6 ++++++
>>  1 file changed, 6 insertions(+)
>>
>> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
>> index f78d002f0fe0..8a4fa45f07f8 100644
>> --- a/arch/powerpc/kvm/book3s_xive.c
>> +++ b/arch/powerpc/kvm/book3s_xive.c
>> @@ -819,6 +819,9 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu)
>>  {
>>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>>  
>> +	if (!kvmppc_xics_enabled(vcpu))
>> +		return -EPERM;
>> +
>>  	if (!xc)
>>  		return 0;
>>  
>> @@ -835,6 +838,9 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
>>  	u8 cppr, mfrr;
>>  	u32 xisr;
>>  
>> +	if (!kvmppc_xics_enabled(vcpu))
>> +		return -EPERM;
>> +
>>  	if (!xc || !xive)
>>  		return -ENOENT;
> 
> I can't see how these new checks could ever trigger in the code as it
> stands.  Is there a way at present? 

It would require some custom QEMU doing silly things : create the XICS 
KVM device, and then call kvm_get_one_reg(KVM_REG_PPC_ICP_STATE) or 
kvm_set_one_reg(icp->cs, KVM_REG_PPC_ICP_STATE) without connecting the
vCPU to its presenter. 

Today, you get a ENOENT.

> Do following patches ever add a path where the new checks could trigger, 
> or is this just an excess of caution? 

With the following patches, QEMU could to do something even more silly,
which is to mix the interrupt mode interfaces : create a KVM XICS device    
and call KVM CPU ioctls of the KVM XIVE device, or the opposite.  

> (Your patch description should ideally have answered these questions > for me.)

Yes. I also think that I introduced this patch to early in the series.
It make more sense when the XICS and the XIVE KVM devices are available.  

Thanks,

C.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-01-22  5:05   ` Paul Mackerras
@ 2019-01-23 16:28     ` Cédric Le Goater
  2019-01-28 17:35     ` Cédric Le Goater
  1 sibling, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-23 16:28 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/22/19 6:05 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>> This is the basic framework for the new KVM device supporting the XIVE
>> native exploitation mode. The user interface exposes a new capability
>> and a new KVM device to be used by QEMU.
> 
> [snip]
>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>  #ifdef CONFIG_KVM_XIVE
>>  	if (xive_enabled()) {
>>  		kvmppc_xive_init_module();
>> +		kvmppc_xive_native_init_module();
>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
>> +		kvm_register_device_ops(&kvm_xive_native_ops,
>> +					KVM_DEV_TYPE_XIVE);
> 
> I think we want tighter conditions on initializing the xive_native
> stuff and creating the xive device class.  We could have
> xive_enabled() returning true in a guest, and this code will get
> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> are running bare metal).

Ah yes, I agree. I haven't addressed at all the nested flavor. I have 
some questions about this that I will ask in summary email you sent. 

Thanks,

C. 
  

> 
>> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>>  static void kvmppc_book3s_exit(void)
>>  {
>>  #ifdef CONFIG_KVM_XICS
>> -	if (xive_enabled())
>> +	if (xive_enabled()) {
>>  		kvmppc_xive_exit_module();
>> +		kvmppc_xive_native_exit_module();
> 
> Same comment here.
> 
> Paul.
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-01-22  5:09   ` Paul Mackerras
@ 2019-01-23 16:48     ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-23 16:48 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/22/19 6:09 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
>> This will let the guest create a memory mapping to expose the ESB MMIO
>> regions used to control the interrupt sources, to trigger events, to
>> EOI or to turn off the sources.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
>>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
>>  2 files changed, 101 insertions(+)
>>
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 8c876c166ef2..6bb61ba141c2 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
>>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
>>  #define  KVM_XICS_QUEUED		(1ULL << 44)
>>  
>> +/* POWER9 XIVE Native Interrupt Controller */
>> +#define KVM_DEV_XIVE_GRP_CTRL		1
>> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
>> +
>>  #endif /* __LINUX_KVM_POWERPC_H */
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> index 115143e76c45..e20081f0c8d4 100644
>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -153,6 +153,85 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>>  	return rc;
>>  }
>>  
>> +static int xive_native_esb_fault(struct vm_fault *vmf)
>> +{
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	struct kvmppc_xive *xive = vma->vm_file->private_data;
>> +	struct kvmppc_xive_src_block *sb;
>> +	struct kvmppc_xive_irq_state *state;
>> +	struct xive_irq_data *xd;
>> +	u32 hw_num;
>> +	u16 src;
>> +	u64 page;
>> +	unsigned long irq;
>> +
>> +	/*
>> +	 * Linux/KVM uses a two pages ESB setting, one for trigger and
>> +	 * one for EOI
>> +	 */
>> +	irq = vmf->pgoff / 2;
>> +
>> +	sb = kvmppc_xive_find_source(xive, irq, &src);
>> +	if (!sb) {
>> +		pr_err("%s: source %lx not found !\n", __func__, irq);
> 
> In general it's a bad idea to have a printk that userspace can trigger
> at will without any rate-limiting.  Is there a real reason why this
> printk is needed (and can't be pr_devel)?

yes. It should. The SIGBUS is enough to know what's going on. 

> 
>> +		return VM_FAULT_SIGBUS;
>> +	}
>> +
>> +	state = &sb->irq_state[src];
>> +	kvmppc_xive_select_irq(state, &hw_num, &xd);
>> +
>> +	arch_spin_lock(&sb->lock);
>> +
>> +	/*
>> +	 * first/even page is for trigger
>> +	 * second/odd page is for EOI and management.
>> +	 */
>> +	page = vmf->pgoff % 2 ? xd->eoi_page : xd->trig_page;
>> +	arch_spin_unlock(&sb->lock);
>> +
>> +	if (!page) {
>> +		pr_err("%s: acessing invalid ESB page for source %lx !\n",
>> +		       __func__, irq);
> 
> Does this represent a exceptional condition that userspace can't just
> trigger at will (i.e. it implies the presence of a kernel bug)?  If
> not then the same comment as above applies.

Not having an ESB page (trigger or EOI) implies that the xive_irq_data 
for the source is bogus. This probably deserves a WARN().

Thanks,

C. 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device
  2019-01-22  5:14   ` Paul Mackerras
@ 2019-01-23 16:56     ` Cédric Le Goater
  2019-02-04  4:49       ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-23 16:56 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/22/19 6:14 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:20PM +0100, Cédric Le Goater wrote:
>> The ESB MMIO region controls the interrupt sources of the guest. QEMU
>> will query an fd (GET_ESB_FD ioctl) and map this region at a specific
>> address for the guest to use. The guest will obtain this information
>> using the H_INT_GET_SOURCE_INFO hcall. To inform KVM of the address
>> setting used by QEMU, add a VC_BASE control to the KVM XIVE device
> 
> This needs a little more explanation.  I *think* the only way this
> gets used is that it gets returned to the guest by the new
> hypercalls.  If that is indeed the case it would be useful to mention
> that in the patch description, because otherwise taking a value that
> userspace provides and which looks like it is an address, and not
> doing any validation on it, looks a bit scary.

I think we have solved this problem in another email thread. 

The H_INT_GET_SOURCE_INFO hcall does not need to be implemented in KVM
as all the source information should already be available in QEMU. In
that case, there is no need to inform KVM of where the ESB pages are 
mapped in the guest address space. So we don't need that extra control
on the KVM device. This is good news. 

Thanks,

C. 




^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl
  2019-01-22  5:42     ` Paul Mackerras
@ 2019-01-23 18:39       ` Cédric Le Goater
  2019-01-23 21:32         ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-23 18:39 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/22/19 6:42 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 08:10:06PM +0100, Cédric Le Goater wrote:
>> This will be used to destroy the KVM XICS or XIVE device when the
>> sPAPR machine is reseted. When the VM boots, the CAS negotiation
>> process will determine which interrupt mode to use and the appropriate
>> KVM device will then be created.
> 
> What would be the consequence if we didn't destroy the device?

So, if we don't destroy the device, it would mean that we are 
maintaining its availability under the KVM PPC structures, VM and
vCPUs, I think the changes would be significant to have two interrupt 
devices unde the VM. We would also need a way to activate one or 
the other depending on the interrupt mode chosen by CAS. In other 
words, it's moving all the interrupt mode politics from QEMU to KVM. 
It's possible of course but I would prefer to leave the ugly details 
in QEMU.  

Let's suppose now that we keep the device alive but disconnect the 
presenters from it, and from the VM also. We would have an unused 
device in the VM. We would need way to keep an handle on it (fd 
certainly) and a KVM interface to soft reset a KVM device partially 
initialized. That's one other option.

It seemed easier to do an hard reset : create/destroy.  

> The reason I ask is that we will have to be much more careful about
> memory allocation lifetimes with this patch. 

yes. bad refcounting will lead the host kernel to a crash. 

> Having KVM devices last
> until the KVM instance is destroyed means that we generally avoid
> use-after-free bugs.  With this patch we will have to do a careful
> analysis of the lifetime of the xive structures vs. possible accesses
> on other threads to prove there are no use-after-free bugs.
> 
> For example, it is not sufficient to set any pointers in struct kvm or
> struct kvm_vcpu that point into xive structures to NULL before freeing
> the structures.  There could be code on another CPU that has read the
> pointer value before you set it to NULL and then goes and accesses it
> after you have freed it.  You need to prove that can't happen,
> possibly using some sort of explicit synchronization that ensures that
> no other CPU could still be accessing the structure at the time when
> you free it.  RCU can help with this, but in general means you need
> RCU synchronization primitives (rcu_read_lock() etc.) at all the
> places where you use the pointer, which I don't think you currently
> have.

no. indeed. I have overlooked the synchronization aspect.

> If there is a good fundamental reason why this can't happen, even
> though you don't have explicit synchronization, then at a minimum you
> need to explain that in the patch description, and ideally also in
> code comments.

OK. I did leave that patch at the end for one reason. It needs more care.

Thanks,

C.
 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-01-22  4:46 ` [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Paul Mackerras
@ 2019-01-23 19:07   ` Cédric Le Goater
  2019-01-23 21:35     ` Benjamin Herrenschmidt
  2019-01-28  5:51     ` Paul Mackerras
  0 siblings, 2 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-23 19:07 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/22/19 5:46 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:12PM +0100, Cédric Le Goater wrote:
>> Hello,
>>
>> On the POWER9 processor, the XIVE interrupt controller can control
>> interrupt sources using MMIO to trigger events, to EOI or to turn off
>> the sources. Priority management and interrupt acknowledgment is also
>> controlled by MMIO in the CPU presenter subengine.
>>
>> PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
>> special support from the hypervisor to do the same. This is called the
>> XIVE native exploitation mode and today, it can be activated under the
>> PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
>> and still offers the old interrupt mode interface using a
>> XICS-over-XIVE glue which implements the XICS hcalls.
>>
>> The following series is proposal to add the same support under KVM.
>>
>> A new KVM device is introduced for the XIVE native exploitation
>> mode. It reuses most of the XICS-over-XIVE glue implementation
>> structures which are internal to KVM but has a completely different
>> interface. A set of Hypervisor calls configures the sources and the
>> event queues and from there, all control is done by the guest through
>> MMIOs.
>>
>> These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
>> similarly to VFIO, and the associated VMAs are populated dynamically
>> with the appropriate pages using a fault handler. This is implemented
>> with a couple of KVM device ioctls.
>>
>> On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
>> negotiation process determines whether the guest operates with a
>> interrupt controller using the XICS legacy model, as found on POWER8,
>> or in XIVE exploitation mode. Which means that the KVM interrupt
>> device should be created at runtime, after the machine as started.
>> This requires extra KVM support to create/destroy KVM devices. The
>> last patches are an attempt to solve that problem.
>>
>> Migration has its own specific needs. The patchset provides the
>> necessary routines to quiesce XIVE, to capture and restore the state
>> of the different structures used by KVM, OPAL and HW. Extra OPAL
>> support is required for these.
> 
> Thanks for the patchset.  It mostly looks good, but there are some
> more things we need to consider, and I think a v2 will be needed.
>> One general comment I have is that there are a lot of acronyms in this
> code and you mostly seem to assume that people will know what they all
> mean.  It would make the code more readable if you provide the
> expansion of the acronym on first use in a comment or whatever.  For
> example, one of the patches in this series talks about the "EAS"

 Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)

All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
Linux should be adjusted ...

> without ever expanding it in any comment or in the patch description,
> and I have forgotten just at the moment what EAS stands for (I just
> know that understanding the XIVE is not eas-y. :)
ah ! yes. But we have great documentation :)

We pushed some high level description of XIVE in QEMU :

  https://git.qemu.org/?p=qemu.git;a=blob;f=include/hw/ppc/xive.h;h=ec23253ba448e25c621356b55a7777119a738f8e;hb=HEAD

I should do the same for Linux with a KVM section to explain the 
interfaces which do not directly expose the underlying XIVE concepts. 
It's better to understand a little what is happening under the hood.

> Another general comment is that you seem to have written all this
> code assuming we are using HV KVM in a host running bare-metal.

Yes. I didn't look at the other configurations. I thought that we could
use the kernel_irqchip=off option to begin with. A couple of checks
are indeed missing.

> However, we could be using PR KVM (either in a bare-metal host or in a
> guest), or we could be doing nested HV KVM where we are using the
> kvm_hv module inside a KVM guest and using special hypercalls for
> controlling our guests.

Yes. 

It would be good to talk a little about the nested support (offline 
may be) to make sure that we are not missing some major interface that 
would require a lot of change. If we need to prepare ground, I think
the timing is good.

The size of the IRQ number space might be a problem. It seems we 
would need to increase it considerably to support multiple nested 
guests. That said I haven't look much how nested is designed.  

> It would be perfectly acceptable for now to say that we don't yet
> support XIVE exploitation in those scenarios, as long as we then make
> sure that the new KVM capability reports false in those scenarios, and
> any attempt to use the XIVE exploitation interfaces fails cleanly.

ok. That looks the best approach for now.

> I don't see that either of those is true in the patch set as it
> stands, so that is one area that needs to be fixed.
> 
> A third general comment is that the new KVM interfaces you have added
> need to be documented in the files under Documentation/virtual/kvm.

ok. 

Thanks,

C. 



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls
  2019-01-23 10:26         ` Paul Mackerras
  2019-01-23 10:48           ` Cédric Le Goater
@ 2019-01-23 21:23           ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 135+ messages in thread
From: Benjamin Herrenschmidt @ 2019-01-23 21:23 UTC (permalink / raw)
  To: Paul Mackerras, Cédric Le Goater
  Cc: linuxppc-dev, kvm, kvm-ppc, David Gibson

On Wed, 2019-01-23 at 21:26 +1100, Paul Mackerras wrote:
> If H_INT_ESB is only used for LSIs, then is a guest going to be using
> it at all?  

*emulated* LSIs, ie LSIs coming from emulated devices. It will depends
in practice of what kind of emulated device you put in your guest. We
need that because under the hood, we send a XIVE MSI, so we need to be
notified of the EOI so we can resend if the emulated LSI is still
asserted. 

> My understanding was that with XIVE, only a small number
> of interrupts that are to do with system management functions are
> LSIs; all of the interrupts relating to PCI-e devices are MSIs.  So do
> we actually have a real high-frequency use case for LSIs in a guest?
> 
> For now I would prefer that you remove all the real-mode hcall
> handlers.  We can add them later if we get performance data showing
> that they are needed.
> 
> Regarding whether or not to have a given hcall handler in the kernel
> at all - if there is for example an hcall which is just called once
> on guest startup, and its function is just to provide information to
> the guest, and QEMU has that information, then why not have that hcall
> implemented by QEMU?  Are any of the hcalls like that?
> 
> For example, if H_INT_GET_SOURCE_INFO was implemented in QEMU, could
> we then remove the VC_BASE thing from the xive device?

Ben.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-23 10:30         ` Paul Mackerras
  2019-01-23 11:07           ` Cédric Le Goater
@ 2019-01-23 21:25           ` Benjamin Herrenschmidt
  2019-01-24  8:41             ` Cédric Le Goater
  2019-01-28  4:43             ` Paul Mackerras
  1 sibling, 2 replies; 135+ messages in thread
From: Benjamin Herrenschmidt @ 2019-01-23 21:25 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: kvm, kvm-ppc, Cédric Le Goater, linuxppc-dev, David Gibson

On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
> > Afaik bcs we change the mapping to point to the real HW irq ESB page
> > instead of the "IPI" that was there at VM init time.
> 
> So that makes it sound like there is a whole lot going on that hasn't
> even been hinted at in the patch descriptions...  It sounds like we
> need a good description of how all this works and fits together
> somewhere under Documentation/.
> 
> In any case we need much more informative patch descriptions.  I
> realize that it's all currently in Cedric's head, but I bet that in
> two or three years' time when we come to try to debug something, it
> won't be in anyone's head...

The main problem is understanding XIVE itself. It's not realistic to
ask Cedric to write a proper documentation for XIVE as part of the
patch series, but sadly IBM doesn't have a good one to provide either.

Ben.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl
  2019-01-23 18:39       ` Cédric Le Goater
@ 2019-01-23 21:32         ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 135+ messages in thread
From: Benjamin Herrenschmidt @ 2019-01-23 21:32 UTC (permalink / raw)
  To: Cédric Le Goater, Paul Mackerras
  Cc: linuxppc-dev, kvm, kvm-ppc, David Gibson

On Wed, 2019-01-23 at 19:39 +0100, Cédric Le Goater wrote:
> > The reason I ask is that we will have to be much more careful about
> > memory allocation lifetimes with this patch. 
> 
> yes. bad refcounting will lead the host kernel to a crash. 

One way to alleviate that is to make sure this is only supported on
selected devices such as XICS via some flag or the presence of a
callback.

Cheers,
Ben.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-01-23 19:07   ` Cédric Le Goater
@ 2019-01-23 21:35     ` Benjamin Herrenschmidt
  2019-01-26  8:25       ` Cédric Le Goater
  2019-01-28  5:51     ` Paul Mackerras
  1 sibling, 1 reply; 135+ messages in thread
From: Benjamin Herrenschmidt @ 2019-01-23 21:35 UTC (permalink / raw)
  To: Cédric Le Goater, Paul Mackerras
  Cc: linuxppc-dev, kvm, kvm-ppc, David Gibson

On Wed, 2019-01-23 at 20:07 +0100, Cédric Le Goater wrote:
>  Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)
> 
> All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
> Linux should be adjusted ...

All the names changed between the HW design and the "architecture"
document. The HW guys use the old names, the architecture the new
names, and Linux & OPAL mostly use the old ones because frankly the new
names suck big time.

> It would be good to talk a little about the nested support (offline 
> may be) to make sure that we are not missing some major interface that 
> would require a lot of change. If we need to prepare ground, I think
> the timing is good.
> 
> The size of the IRQ number space might be a problem. It seems we 
> would need to increase it considerably to support multiple nested 
> guests. That said I haven't look much how nested is designed.  

The size of the VP space is a bigger concern. Even today. We really
need qemu to tell the max #cpu to KVM so we can allocate less of them.

As for nesting, I suggest for the foreseeable future we stick to XICS
emulation in nested guests.

Cheers,
Ben.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-23 21:25           ` Benjamin Herrenschmidt
@ 2019-01-24  8:41             ` Cédric Le Goater
  2019-01-28  4:43             ` Paul Mackerras
  1 sibling, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-24  8:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras
  Cc: linuxppc-dev, kvm, kvm-ppc, David Gibson

On 1/23/19 10:25 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
>>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>>> instead of the "IPI" that was there at VM init time.
>>
>> So that makes it sound like there is a whole lot going on that hasn't
>> even been hinted at in the patch descriptions...  It sounds like we
>> need a good description of how all this works and fits together
>> somewhere under Documentation/.
>>
>> In any case we need much more informative patch descriptions.  I
>> realize that it's all currently in Cedric's head, but I bet that in
>> two or three years' time when we come to try to debug something, it
>> won't be in anyone's head...
> 
> The main problem is understanding XIVE itself. It's not realistic to
> ask Cedric to write a proper documentation for XIVE as part of the
> patch series, but sadly IBM doesn't have a good one to provide either.

QEMU has a preliminary introduction we could use :

https://git.qemu.org/?p=qemu.git;a=blob;f=include/hw/ppc/xive.h;h=ec23253ba448e25c621356b55a7777119a738f8e;hb=HEAD

With some extensions for sPAPR and KVM, the resulting file could 
be moved to the Linux documentation directory. This would be an
iterative process over time of course. 

Cheers,

C.  

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-01-23 21:35     ` Benjamin Herrenschmidt
@ 2019-01-26  8:25       ` Cédric Le Goater
  2019-02-04  5:36         ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-26  8:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, Paul Mackerras
  Cc: linuxppc-dev, kvm, kvm-ppc, David Gibson

Was there a crashing.org shutdown ? 

  Received: from gate.crashing.org (gate.crashing.org [63.228.1.57])
	by in5.mail.ovh.net (Postfix) with ESMTPS id 43mYnj0nrlz1N7KC
	for <clg@kaod.org>; Fri, 25 Jan 2019 22:38:00 +0000 (UTC)
  Received: from localhost (localhost.localdomain [127.0.0.1])
	by gate.crashing.org (8.14.1/8.14.1) with ESMTP id x0NLZf4K021092;
	Wed, 23 Jan 2019 15:35:43 -0600


On 1/23/19 10:35 PM, Benjamin Herrenschmidt wrote:
> On Wed, 2019-01-23 at 20:07 +0100, Cédric Le Goater wrote:
>>  Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)
>>
>> All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
>> Linux should be adjusted ...
> 
> All the names changed between the HW design and the "architecture"
> document. The HW guys use the old names, the architecture the new
> names, and Linux & OPAL mostly use the old ones because frankly the new
> names suck big time.

Well, It does not make XIVE any clearer ... I did prefer the v1 names
but there was some naming overlap in the concepts. 

>> It would be good to talk a little about the nested support (offline 
>> may be) to make sure that we are not missing some major interface that 
>> would require a lot of change. If we need to prepare ground, I think
>> the timing is good.
>>
>> The size of the IRQ number space might be a problem. It seems we 
>> would need to increase it considerably to support multiple nested 
>> guests. That said I haven't look much how nested is designed.  
> 
> The size of the VP space is a bigger concern. Even today. We really
> need qemu to tell the max #cpu to KVM so we can allocate less of them.

ah yes. we would also need to reduce the number of available priorities 
per CPU to have more EQ descriptors available if I recall well. 

> As for nesting, I suggest for the foreseeable future we stick to XICS
> emulation in nested guests.

ok. so no kernel_irqchip at all. hmm.  

I was wondering how possible it was to have L2 initialize the underlying 
OPAL structures in the L0 hypervisor. May be with a sort of proxy hcall 
which would perform the initialization in QEMU L1 on behalf of L2.

Cheers,
C.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-23 21:25           ` Benjamin Herrenschmidt
  2019-01-24  8:41             ` Cédric Le Goater
@ 2019-01-28  4:43             ` Paul Mackerras
  2019-01-29 13:46               ` Cédric Le Goater
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-28  4:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: kvm, kvm-ppc, Cédric Le Goater, linuxppc-dev, David Gibson

On Thu, Jan 24, 2019 at 08:25:15AM +1100, Benjamin Herrenschmidt wrote:
> On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
> > > Afaik bcs we change the mapping to point to the real HW irq ESB page
> > > instead of the "IPI" that was there at VM init time.
> > 
> > So that makes it sound like there is a whole lot going on that hasn't
> > even been hinted at in the patch descriptions...  It sounds like we
> > need a good description of how all this works and fits together
> > somewhere under Documentation/.
> > 
> > In any case we need much more informative patch descriptions.  I
> > realize that it's all currently in Cedric's head, but I bet that in
> > two or three years' time when we come to try to debug something, it
> > won't be in anyone's head...
> 
> The main problem is understanding XIVE itself. It's not realistic to
> ask Cedric to write a proper documentation for XIVE as part of the
> patch series, but sadly IBM doesn't have a good one to provide either.

There are: (a) the XIVE hardware, (b) the definition of the XIVE
hypercalls that guests use, and (c) the design decisions around how to
implement that hypercall interface.  We need to get (b) published
somehow, but it is mostly (c) that I would expect the patch
descriptions to explain.

It sounds like there will be a mapping to userspace where the pages
can sometimes point to an IPI page and sometimes point to a real HW
irq ESB page.  That is, the same guest "hardware" irq number sometimes
refers to a software-generated interrupt (what you called an "IPI"
above) and sometimes to a hardware-generated interrupt.  That fact,
the reason why it is so and the consequences all need to be explained
somewhere.  They are really not obvious and I don't believe they are
part of either the XIVE hardware spec or the XIVE hypercall spec.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-01-23 19:07   ` Cédric Le Goater
  2019-01-23 21:35     ` Benjamin Herrenschmidt
@ 2019-01-28  5:51     ` Paul Mackerras
  2019-01-29 13:51       ` Cédric Le Goater
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-28  5:51 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Wed, Jan 23, 2019 at 08:07:33PM +0100, Cédric Le Goater wrote:
> On 1/22/19 5:46 AM, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 07:43:12PM +0100, Cédric Le Goater wrote:
> >> Hello,
> >>
> >> On the POWER9 processor, the XIVE interrupt controller can control
> >> interrupt sources using MMIO to trigger events, to EOI or to turn off
> >> the sources. Priority management and interrupt acknowledgment is also
> >> controlled by MMIO in the CPU presenter subengine.
> >>
> >> PowerNV/baremetal Linux runs natively under XIVE but sPAPR guests need
> >> special support from the hypervisor to do the same. This is called the
> >> XIVE native exploitation mode and today, it can be activated under the
> >> PowerPC Hypervisor, pHyp. However, Linux/KVM lacks XIVE native support
> >> and still offers the old interrupt mode interface using a
> >> XICS-over-XIVE glue which implements the XICS hcalls.
> >>
> >> The following series is proposal to add the same support under KVM.
> >>
> >> A new KVM device is introduced for the XIVE native exploitation
> >> mode. It reuses most of the XICS-over-XIVE glue implementation
> >> structures which are internal to KVM but has a completely different
> >> interface. A set of Hypervisor calls configures the sources and the
> >> event queues and from there, all control is done by the guest through
> >> MMIOs.
> >>
> >> These MMIO regions (ESB and TIMA) are exposed to guests in QEMU,
> >> similarly to VFIO, and the associated VMAs are populated dynamically
> >> with the appropriate pages using a fault handler. This is implemented
> >> with a couple of KVM device ioctls.
> >>
> >> On a POWER9 sPAPR machine, the Client Architecture Support (CAS)
> >> negotiation process determines whether the guest operates with a
> >> interrupt controller using the XICS legacy model, as found on POWER8,
> >> or in XIVE exploitation mode. Which means that the KVM interrupt
> >> device should be created at runtime, after the machine as started.
> >> This requires extra KVM support to create/destroy KVM devices. The
> >> last patches are an attempt to solve that problem.
> >>
> >> Migration has its own specific needs. The patchset provides the
> >> necessary routines to quiesce XIVE, to capture and restore the state
> >> of the different structures used by KVM, OPAL and HW. Extra OPAL
> >> support is required for these.
> > 
> > Thanks for the patchset.  It mostly looks good, but there are some
> > more things we need to consider, and I think a v2 will be needed.
> >> One general comment I have is that there are a lot of acronyms in this
> > code and you mostly seem to assume that people will know what they all
> > mean.  It would make the code more readable if you provide the
> > expansion of the acronym on first use in a comment or whatever.  For
> > example, one of the patches in this series talks about the "EAS"
> 
>  Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)
> 
> All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
> Linux should be adjusted ...
> 
> > without ever expanding it in any comment or in the patch description,
> > and I have forgotten just at the moment what EAS stands for (I just
> > know that understanding the XIVE is not eas-y. :)
> ah ! yes. But we have great documentation :)
> 
> We pushed some high level description of XIVE in QEMU :
> 
>   https://git.qemu.org/?p=qemu.git;a=blob;f=include/hw/ppc/xive.h;h=ec23253ba448e25c621356b55a7777119a738f8e;hb=HEAD
> 
> I should do the same for Linux with a KVM section to explain the 
> interfaces which do not directly expose the underlying XIVE concepts. 
> It's better to understand a little what is happening under the hood.
> 
> > Another general comment is that you seem to have written all this
> > code assuming we are using HV KVM in a host running bare-metal.
> 
> Yes. I didn't look at the other configurations. I thought that we could
> use the kernel_irqchip=off option to begin with. A couple of checks
> are indeed missing.

Using kernel_irqchip=off would mean that we would not be able to use
the in-kernel XICS emulation, which would have a performance impact.

We need an explicit capability for XIVE exploitation that can be
enabled or disabled on the qemu command line, so that we can enforce a
uniform set of capabilities across all the hosts in a migration
domain.  And it's no good to say we have the capability when all
attempts to use it will fail.  Therefore the kernel needs to say that
it doesn't have the capability in a PR KVM guest or in a nested HV
guest.

> > However, we could be using PR KVM (either in a bare-metal host or in a
> > guest), or we could be doing nested HV KVM where we are using the
> > kvm_hv module inside a KVM guest and using special hypercalls for
> > controlling our guests.
> 
> Yes. 
> 
> It would be good to talk a little about the nested support (offline 
> may be) to make sure that we are not missing some major interface that 
> would require a lot of change. If we need to prepare ground, I think
> the timing is good.
> 
> The size of the IRQ number space might be a problem. It seems we 
> would need to increase it considerably to support multiple nested 
> guests. That said I haven't look much how nested is designed.  

The current design of nested HV is that the entire non-volatile state
of all the nested guests is encapsulated within the state and
resources of the L1 hypervisor.  That means that if the L1 hypervisor
gets migrated, all of its guests go across inside it and there is no
extra state that L0 needs to be aware of.  That would imply that the
VP number space for the nested guests would need to come from within
the VP number space for L1; but the amount of VP space we allocate to
each guest doesn't seem to be large enough for that to be practical.

> > It would be perfectly acceptable for now to say that we don't yet
> > support XIVE exploitation in those scenarios, as long as we then make
> > sure that the new KVM capability reports false in those scenarios, and
> > any attempt to use the XIVE exploitation interfaces fails cleanly.
> 
> ok. That looks the best approach for now.
> 
> > I don't see that either of those is true in the patch set as it
> > stands, so that is one area that needs to be fixed.
> > 
> > A third general comment is that the new KVM interfaces you have added
> > need to be documented in the files under Documentation/virtual/kvm.
> 
> ok. 
> 
> Thanks,
> 
> C. 
> 

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-23 11:07           ` Cédric Le Goater
@ 2019-01-28  6:13             ` Paul Mackerras
  2019-01-28 18:26               ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-28  6:13 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Wed, Jan 23, 2019 at 12:07:19PM +0100, Cédric Le Goater wrote:
> On 1/23/19 11:30 AM, Paul Mackerras wrote:
> > On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
> >> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
> >>> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
> >>>> Clear the ESB pages from the VMA of the IRQ being pass through to the
> >>>> guest and let the fault handler repopulate the VMA when the ESB pages
> >>>> are accessed for an EOI or for a trigger.
> >>>
> >>> Why do we want to do this?
> >>>
> >>> I don't see any possible advantage to removing the PTEs from the
> >>> userspace mapping.  You'll need to explain further.
> >>
> >> Afaik bcs we change the mapping to point to the real HW irq ESB page
> >> instead of the "IPI" that was there at VM init time.
> 
> yes exactly. You need to clean up the pages each time.
>  
> > So that makes it sound like there is a whole lot going on that hasn't
> > even been hinted at in the patch descriptions...  It sounds like we
> > need a good description of how all this works and fits together
> > somewhere under Documentation/.
> 
> OK. I have started doing so for the models merged in QEMU but not yet 
> for KVM. I will work on it.
> 
> > In any case we need much more informative patch descriptions.  I
> > realize that it's all currently in Cedric's head, but I bet that in
> > two or three years' time when we come to try to debug something, it
> > won't be in anyone's head...
> 
> I agree. 
> 
> 
> So, storing the ESB VMA under the KVM device is not shocking anyone ?  

Actually, now that I think of it, why can't userspace (QEMU) manage
this using mmap()?  Based on what Ben has said, I assume there would
be a pair of pages for each interrupt that a PCI pass-through device
has.  Would we end up with too many VMAs if we just used mmap() to
change the mappings from the software-generated pages to the
hardware-generated interrupt pages?  Are the necessary pages for a PCI
passthrough device contiguous in both host real space and guest real
space?  If so we'd only need one mmap() for all the device's interrupt
pages.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-01-22  5:05   ` Paul Mackerras
  2019-01-23 16:28     ` Cédric Le Goater
@ 2019-01-28 17:35     ` Cédric Le Goater
  2019-01-30  4:29       ` Paul Mackerras
  1 sibling, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-28 17:35 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/22/19 6:05 AM, Paul Mackerras wrote:
> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>> This is the basic framework for the new KVM device supporting the XIVE
>> native exploitation mode. The user interface exposes a new capability
>> and a new KVM device to be used by QEMU.
> 
> [snip]
>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>  #ifdef CONFIG_KVM_XIVE
>>  	if (xive_enabled()) {
>>  		kvmppc_xive_init_module();
>> +		kvmppc_xive_native_init_module();
>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
>> +		kvm_register_device_ops(&kvm_xive_native_ops,
>> +					KVM_DEV_TYPE_XIVE);
> 
> I think we want tighter conditions on initializing the xive_native
> stuff and creating the xive device class.  We could have
> xive_enabled() returning true in a guest, and this code will get
> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> are running bare metal).

So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
(L1) obviously crashes trying to call OPAL. I have tighten the test with : 

	if (xive_enabled() && !kvmhv_on_pseries()) {

for now.

As this is a problem today in 5.0.x, I will send a patch for it if you think
it is correct. I don't think we should bother taking care of the PR case
on P9. Should we ? 

Thanks,

C.
 
>> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>>  static void kvmppc_book3s_exit(void)
>>  {
>>  #ifdef CONFIG_KVM_XICS
>> -	if (xive_enabled())
>> +	if (xive_enabled()) {
>>  		kvmppc_xive_exit_module();
>> +		kvmppc_xive_native_exit_module();
> 
> Same comment here.
> 
> Paul.
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-28  6:13             ` Paul Mackerras
@ 2019-01-28 18:26               ` Cédric Le Goater
  2019-01-29  2:45                 ` Paul Mackerras
  2019-01-29  4:12                 ` Paul Mackerras
  0 siblings, 2 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-28 18:26 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/28/19 7:13 AM, Paul Mackerras wrote:
> On Wed, Jan 23, 2019 at 12:07:19PM +0100, Cédric Le Goater wrote:
>> On 1/23/19 11:30 AM, Paul Mackerras wrote:
>>> On Wed, Jan 23, 2019 at 05:45:24PM +1100, Benjamin Herrenschmidt wrote:
>>>> On Tue, 2019-01-22 at 16:26 +1100, Paul Mackerras wrote:
>>>>> On Mon, Jan 07, 2019 at 08:10:05PM +0100, Cédric Le Goater wrote:
>>>>>> Clear the ESB pages from the VMA of the IRQ being pass through to the
>>>>>> guest and let the fault handler repopulate the VMA when the ESB pages
>>>>>> are accessed for an EOI or for a trigger.
>>>>>
>>>>> Why do we want to do this?
>>>>>
>>>>> I don't see any possible advantage to removing the PTEs from the
>>>>> userspace mapping.  You'll need to explain further.
>>>>
>>>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>>>> instead of the "IPI" that was there at VM init time.
>>
>> yes exactly. You need to clean up the pages each time.
>>  
>>> So that makes it sound like there is a whole lot going on that hasn't
>>> even been hinted at in the patch descriptions...  It sounds like we
>>> need a good description of how all this works and fits together
>>> somewhere under Documentation/.
>>
>> OK. I have started doing so for the models merged in QEMU but not yet 
>> for KVM. I will work on it.
>>
>>> In any case we need much more informative patch descriptions.  I
>>> realize that it's all currently in Cedric's head, but I bet that in
>>> two or three years' time when we come to try to debug something, it
>>> won't be in anyone's head...
>>
>> I agree. 
>>
>>
>> So, storing the ESB VMA under the KVM device is not shocking anyone ?  
> 
> Actually, now that I think of it, why can't userspace (QEMU) manage
> this using mmap()?  Based on what Ben has said, I assume there would
> be a pair of pages for each interrupt that a PCI pass-through device
> has. 

Yes. there is a pair of ESB pages per IRQ number.

> Would we end up with too many VMAs if we just used mmap() to
> change the mappings from the software-generated pages to the
> hardware-generated interrupt pages?  
The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
dedicated to CPU IPIs and the remaining 4K are for devices. We can 
extend the last range if needed as these are for MSIs. Dynamic 
extensions under KVM should work also.

This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
lot of mmap(), too much. Also the KVM model needs to be compatible
with the QEMU emulated one and it was simpler to have one overall
memory region for the IPI ESBs, one for the END ESBs (if we support
that one day) and one for the TIMA.

> Are the necessary pages for a PCI
> passthrough device contiguous in both host real space 

They should as they are the PHB4 ESBs.

> and guest real space ? 

also. That's how we organized the mapping. 

> If so we'd only need one mmap() for all the device's interrupt
> pages.

Ah. So we would want to make a special case for the passthrough 
device and have a mmap() and a memory region for its ESBs. Hmm.

Wouldn't that require to hot plug a memory region under the guest ? 
which means that we need to provision an address space/container 
region for theses regions. What are the benefits ? 

Is clearing the PTEs and repopulating the VMA unsafe ? 

Thanks,     

C.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-28 18:26               ` Cédric Le Goater
@ 2019-01-29  2:45                 ` Paul Mackerras
  2019-01-29 13:47                   ` Cédric Le Goater
  2019-01-29  4:12                 ` Paul Mackerras
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-29  2:45 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> On 1/28/19 7:13 AM, Paul Mackerras wrote:
> > Would we end up with too many VMAs if we just used mmap() to
> > change the mappings from the software-generated pages to the
> > hardware-generated interrupt pages?  
> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
> dedicated to CPU IPIs and the remaining 4K are for devices. We can 

Confused.  You say the number space has 32768 entries but then imply
there are only 8K entries.  Do you mean that the API allows for 15-bit
IRQ numbers but we are only making using of 8192 of them?

> extend the last range if needed as these are for MSIs. Dynamic 
> extensions under KVM should work also.
> 
> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
> lot of mmap(), too much. Also the KVM model needs to be compatible

I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
space would be covered by a single mmap, overlaid by subsequent mmaps
where we need to map real device interrupts.

> with the QEMU emulated one and it was simpler to have one overall
> memory region for the IPI ESBs, one for the END ESBs (if we support
> that one day) and one for the TIMA.
> 
> > Are the necessary pages for a PCI
> > passthrough device contiguous in both host real space 
> 
> They should as they are the PHB4 ESBs.
> 
> > and guest real space ? 
> 
> also. That's how we organized the mapping. 

"How we organized the mapping" is a significant design decision that I
haven't seen documented anywhere, and is really needed for
understanding what's going on.

> 
> > If so we'd only need one mmap() for all the device's interrupt
> > pages.
> 
> Ah. So we would want to make a special case for the passthrough 
> device and have a mmap() and a memory region for its ESBs. Hmm.
> 
> Wouldn't that require to hot plug a memory region under the guest ? 

No; the way that a memory region works is that userspace can do
whatever disparate mappings it likes within the region on the user
process side, and the corresponding region of guest real address space
follows that automatically.

> which means that we need to provision an address space/container 
> region for theses regions. What are the benefits ? 
> 
> Is clearing the PTEs and repopulating the VMA unsafe ? 

Explicitly unmapping parts of the VMA seems like the wrong way to do
it.  If you have a device mmap where the device wants to change the
physical page underlying parts of the mapping, there should be a way
for it to do that explicitly (but I don't know off the top of my head
what the interface to do that is).

However, I still haven't seen a clear and concise explanation of what
is being changed, when, and why we need to do that.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-28 18:26               ` Cédric Le Goater
  2019-01-29  2:45                 ` Paul Mackerras
@ 2019-01-29  4:12                 ` Paul Mackerras
  2019-01-29 17:44                   ` Cédric Le Goater
  1 sibling, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-29  4:12 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> 
> Is clearing the PTEs and repopulating the VMA unsafe ? 

Actually, now that I come to think of it, there could be any number of
VMAs (well, up to almost 64k of them), since once you have a file
descriptor you can call mmap on it multiple times.

The more I think about it, the more I think that getting userspace to
manage the mappings with mmap() and munmap() is the right idea if it
can be made to work.

Paul.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-28  4:43             ` Paul Mackerras
@ 2019-01-29 13:46               ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-29 13:46 UTC (permalink / raw)
  To: Paul Mackerras, Benjamin Herrenschmidt
  Cc: linuxppc-dev, kvm, kvm-ppc, David Gibson

On 1/28/19 5:43 AM, Paul Mackerras wrote:
> On Thu, Jan 24, 2019 at 08:25:15AM +1100, Benjamin Herrenschmidt wrote:
>> On Wed, 2019-01-23 at 21:30 +1100, Paul Mackerras wrote:
>>>> Afaik bcs we change the mapping to point to the real HW irq ESB page
>>>> instead of the "IPI" that was there at VM init time.
>>>
>>> So that makes it sound like there is a whole lot going on that hasn't
>>> even been hinted at in the patch descriptions...  It sounds like we
>>> need a good description of how all this works and fits together
>>> somewhere under Documentation/.
>>>
>>> In any case we need much more informative patch descriptions.  I
>>> realize that it's all currently in Cedric's head, but I bet that in
>>> two or three years' time when we come to try to debug something, it
>>> won't be in anyone's head...
>>
>> The main problem is understanding XIVE itself. It's not realistic to
>> ask Cedric to write a proper documentation for XIVE as part of the
>> patch series, but sadly IBM doesn't have a good one to provide either.
> 
> There are: (a) the XIVE hardware, (b) the definition of the XIVE
> hypercalls that guests use, and (c) the design decisions around how to
> implement that hypercall interface.  We need to get (b) published
> somehow, but it is mostly (c) that I would expect the patch
> descriptions to explain.
> 
> It sounds like there will be a mapping to userspace where the pages
> can sometimes point to an IPI page and sometimes point to a real HW
> irq ESB page.  

Just to be clear. In both cases, these pages are real HW ESB pages. 
They are just attached to a different controller : the XIVE IC for 
the IPIs and the PHB4 for the others. 

> That is, the same guest "hardware" irq number sometimes
> refers to a software-generated interrupt (what you called an "IPI"
> above) and sometimes to a hardware-generated interrupt.  That fact,> the reason why it is so and the consequences all need to be explained
> somewhere.  They are really not obvious and I don't believe they are
> part of either the XIVE hardware spec or the XIVE hypercall spec.

I tried to put the reasons behind the current approach in another 
thread, not saying this is the correct one.

Thanks,

C.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-29  2:45                 ` Paul Mackerras
@ 2019-01-29 13:47                   ` Cédric Le Goater
  2019-01-30  6:20                     ` Paul Mackerras
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-29 13:47 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/29/19 3:45 AM, Paul Mackerras wrote:
> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
>>> Would we end up with too many VMAs if we just used mmap() to
>>> change the mappings from the software-generated pages to the
>>> hardware-generated interrupt pages?  
>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
>> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
> 
> Confused.  You say the number space has 32768 entries but then imply
> there are only 8K entries.  Do you mean that the API allows for 15-bit
> IRQ numbers but we are only making using of 8192 of them?

Ouch. My bad. Let's do it again. 

The sPAPR IRQ number space is 0x2000 wide :

https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396

The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
devices (which can be extended if needed).

So that's 8192 x 2 ESB pages.

>> extend the last range if needed as these are for MSIs. Dynamic 
>> extensions under KVM should work also.
>>
>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
>> lot of mmap(), too much. Also the KVM model needs to be compatible
> 
> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
> space would be covered by a single mmap, overlaid by subsequent mmaps
> where we need to map real device interrupts.

ok. The same fault handler could be used to populate the VMA with the 
ESB pages. 

But it would mean extra work on the QEMU side, which is not needed 
with this patch. 

>> with the QEMU emulated one and it was simpler to have one overall
>> memory region for the IPI ESBs, one for the END ESBs (if we support
>> that one day) and one for the TIMA.
>>
>>> Are the necessary pages for a PCI
>>> passthrough device contiguous in both host real space 
>>
>> They should as they are the PHB4 ESBs.
>>
>>> and guest real space ? 
>>
>> also. That's how we organized the mapping. 
> 
> "How we organized the mapping" is a significant design decision that I
> haven't seen documented anywhere, and is really needed for
> understanding what's going on.

OK. I will add comments on that. See below for some description.

There is nothing fancy, it's simply indexed with the interrupt number,
like for HW, or for the QEMU XIVE emulated model.

>>> If so we'd only need one mmap() for all the device's interrupt
>>> pages.
>>
>> Ah. So we would want to make a special case for the passthrough 
>> device and have a mmap() and a memory region for its ESBs. Hmm.
>>
>> Wouldn't that require to hot plug a memory region under the guest ? 
> 
> No; the way that a memory region works is that userspace can do
> whatever disparate mappings it likes within the region on the user
> process side, and the corresponding region of guest real address space
> follows that automatically.

yes. I suppose this should work also for 'ram device' memory mappings.

So when the passthrough device is added to the guest, we would add a 
new 'ram device' memory region for the device interrupt ESB pages 
that would overlap with the initial guest ESB pages.  

This is really a different approach.

>> which means that we need to provision an address space/container 
>> region for theses regions. What are the benefits ? 
>>
>> Is clearing the PTEs and repopulating the VMA unsafe ? 
> 
> Explicitly unmapping parts of the VMA seems like the wrong way to do
> it.  If you have a device mmap where the device wants to change the
> physical page underlying parts of the mapping, there should be a way
> for it to do that explicitly (but I don't know off the top of my head
> what the interface to do that is).
> 
> However, I still haven't seen a clear and concise explanation of what
> is being changed, when, and why we need to do that.

Yes. I agree on that. The problem is not very different from what we 
have today with the XICS-over-XIVE glue in KVM. Let me try to explain.


The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
implement an IRQ space for the guest using the generic IPI interrupts 
of the XIVE IC controller. These interrupts are allocated at the OPAL
level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
Interrupt management is performed in the XIVE way: using loads and 
stores on the addresses of the XIVE IPI interrupt ESB pages.

Both KVM devices share the same internal structure caching information 
on the interrupts, among which the xive_irq_data struct containing the 
addresses of the IPI ESB pages and an extra one in case of passthrough. 
The later contains the addresses of the ESB pages of the underlying HW 
controller interrupts, PHB4 in all cases for now.    

A guest when running in the XICS legacy interrupt mode lets the KVM 
XICS-over-XIVE device "handle" interrupt management, that is to perform  
the loads and stores on the addresses of the ESB pages of the guest 
interrupts. 

However, when running in XIVE native exploitation mode, the KVM XIVE 
native device exposes the interrupt ESB pages to the guest and lets 
the guest perform directly the loads and stores. 

The VMA exposing the ESB pages make use of a custom VM fault handler
which role is to populate the VMA with appropriate pages. When a fault
occurs, the guest IRQ number is deduced from the offset, and the ESB 
pages of associated XIVE IPI interrupt are inserted in the VMA (using
the internal structure caching information on the interrupts).

Supporting device passthrough in the guest running in XIVE native 
exploitation mode adds some extra refinements because the ESB pages 
of a different HW controller (PHB4) need to be exposed to the guest 
along with the initial IPI ESB pages of the XIVE IC controller. But
the overall mechanic is the same. 

When the device HW irqs are mapped into or unmapped from the guest
IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
and kvmppc_xive_clr_mapped(), are called to record or clear the 
passthrough interrupt information and to perform the switch.

The approach taken by this patch is to clear the ESB pages of the 
guest IRQ number being mapped and let the VM fault handler repopulate. 
The handler will insert the ESB page corresponding to the HW interrupt 
of the device being passed-through or the initial IPI ESB page if the
device is being removed.   



Thanks,

C.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-01-28  5:51     ` Paul Mackerras
@ 2019-01-29 13:51       ` Cédric Le Goater
  2019-01-30  5:40         ` Paul Mackerras
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-29 13:51 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

>>> Another general comment is that you seem to have written all this
>>> code assuming we are using HV KVM in a host running bare-metal.
>>
>> Yes. I didn't look at the other configurations. I thought that we could
>> use the kernel_irqchip=off option to begin with. A couple of checks
>> are indeed missing.
> 
> Using kernel_irqchip=off would mean that we would not be able to use
> the in-kernel XICS emulation, which would have a performance impact.

yes. But it is not supported today. Correct ? 

> We need an explicit capability for XIVE exploitation that can be
> enabled or disabled on the qemu command line, so that we can enforce a
> uniform set of capabilities across all the hosts in a migration
> domain.  And it's no good to say we have the capability when all
> attempts to use it will fail.  Therefore the kernel needs to say that
> it doesn't have the capability in a PR KVM guest or in a nested HV
> guest.

OK. I will work on adding a KVM_CAP_PPC_NESTED_IRQ_HV capability 
for future use.

>>> However, we could be using PR KVM (either in a bare-metal host or in a
>>> guest), or we could be doing nested HV KVM where we are using the
>>> kvm_hv module inside a KVM guest and using special hypercalls for
>>> controlling our guests.
>>
>> Yes. 
>>
>> It would be good to talk a little about the nested support (offline 
>> may be) to make sure that we are not missing some major interface that 
>> would require a lot of change. If we need to prepare ground, I think
>> the timing is good.
>>
>> The size of the IRQ number space might be a problem. It seems we 
>> would need to increase it considerably to support multiple nested 
>> guests. That said I haven't look much how nested is designed.  
> 
> The current design of nested HV is that the entire non-volatile state
> of all the nested guests is encapsulated within the state and
> resources of the L1 hypervisor.  That means that if the L1 hypervisor
> gets migrated, all of its guests go across inside it and there is no
> extra state that L0 needs to be aware of.  That would imply that the
> VP number space for the nested guests would need to come from within
> the VP number space for L1; but the amount of VP space we allocate to
> each guest doesn't seem to be large enough for that to be practical.

If the KVM XIVE device had some information on the max number of CPUs 
provisioned for the guest, we could optimize the VP allocation.

That might be a larger KVM topic though. There are some static limits 
on the number of CPUs in QEMU and in KVM, which have no relation AFAICT. 

Thanks,

C.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-29  4:12                 ` Paul Mackerras
@ 2019-01-29 17:44                   ` Cédric Le Goater
  2019-01-30  5:55                     ` Paul Mackerras
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-29 17:44 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/29/19 5:12 AM, Paul Mackerras wrote:
> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>
>> Is clearing the PTEs and repopulating the VMA unsafe ? 
> 
> Actually, now that I come to think of it, there could be any number of
> VMAs (well, up to almost 64k of them), since once you have a file
> descriptor you can call mmap on it multiple times.
> 
> The more I think about it, the more I think that getting userspace to
> manage the mappings with mmap() and munmap() is the right idea if it
> can be made to work.

We might be able to mmap() and munmap() regions of ESB pages in the RTAS 
call "ibm,change-msi".  I think that's the right spot for it. 

Thanks,

C.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-01-28 17:35     ` Cédric Le Goater
@ 2019-01-30  4:29       ` Paul Mackerras
  2019-01-30  7:01         ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-30  4:29 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
> On 1/22/19 6:05 AM, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> >> This is the basic framework for the new KVM device supporting the XIVE
> >> native exploitation mode. The user interface exposes a new capability
> >> and a new KVM device to be used by QEMU.
> > 
> > [snip]
> >> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
> >>  #ifdef CONFIG_KVM_XIVE
> >>  	if (xive_enabled()) {
> >>  		kvmppc_xive_init_module();
> >> +		kvmppc_xive_native_init_module();
> >>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
> >> +		kvm_register_device_ops(&kvm_xive_native_ops,
> >> +					KVM_DEV_TYPE_XIVE);
> > 
> > I think we want tighter conditions on initializing the xive_native
> > stuff and creating the xive device class.  We could have
> > xive_enabled() returning true in a guest, and this code will get
> > called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> > are running bare metal).
> 
> So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
> 
> 	if (xive_enabled() && !kvmhv_on_pseries()) {
> 
> for now.
> 
> As this is a problem today in 5.0.x, I will send a patch for it if you think

How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
with kernel_irqchip=on in a nested guest and it works just fine.  What
exactly did you test?

> it is correct. I don't think we should bother taking care of the PR case
> on P9. Should we ? 

We do need to take care of PR KVM on P9, since it is the only form of
nested KVM that works inside a host in HPT mode.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-01-29 13:51       ` Cédric Le Goater
@ 2019-01-30  5:40         ` Paul Mackerras
  2019-01-30 15:36           ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-30  5:40 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Tue, Jan 29, 2019 at 02:51:05PM +0100, Cédric Le Goater wrote:
> >>> Another general comment is that you seem to have written all this
> >>> code assuming we are using HV KVM in a host running bare-metal.
> >>
> >> Yes. I didn't look at the other configurations. I thought that we could
> >> use the kernel_irqchip=off option to begin with. A couple of checks
> >> are indeed missing.
> > 
> > Using kernel_irqchip=off would mean that we would not be able to use
> > the in-kernel XICS emulation, which would have a performance impact.
> 
> yes. But it is not supported today. Correct ? 

Not correct, it has been working for years, and works in v5.0-rc1 (I
just tested it), at both L0 and L1.

> > We need an explicit capability for XIVE exploitation that can be
> > enabled or disabled on the qemu command line, so that we can enforce a
> > uniform set of capabilities across all the hosts in a migration
> > domain.  And it's no good to say we have the capability when all
> > attempts to use it will fail.  Therefore the kernel needs to say that
> > it doesn't have the capability in a PR KVM guest or in a nested HV
> > guest.
> 
> OK. I will work on adding a KVM_CAP_PPC_NESTED_IRQ_HV capability 
> for future use.

That's not what I meant.  Why do we need that?  I meant that querying
the new KVM_CAP_PPC_IRQ_XIVE capability should return 0 if we are in a
guest.  It should only return 1 if we are running bare-metal on a P9.

> >>> However, we could be using PR KVM (either in a bare-metal host or in a
> >>> guest), or we could be doing nested HV KVM where we are using the
> >>> kvm_hv module inside a KVM guest and using special hypercalls for
> >>> controlling our guests.
> >>
> >> Yes. 
> >>
> >> It would be good to talk a little about the nested support (offline 
> >> may be) to make sure that we are not missing some major interface that 
> >> would require a lot of change. If we need to prepare ground, I think
> >> the timing is good.
> >>
> >> The size of the IRQ number space might be a problem. It seems we 
> >> would need to increase it considerably to support multiple nested 
> >> guests. That said I haven't look much how nested is designed.  
> > 
> > The current design of nested HV is that the entire non-volatile state
> > of all the nested guests is encapsulated within the state and
> > resources of the L1 hypervisor.  That means that if the L1 hypervisor
> > gets migrated, all of its guests go across inside it and there is no
> > extra state that L0 needs to be aware of.  That would imply that the
> > VP number space for the nested guests would need to come from within
> > the VP number space for L1; but the amount of VP space we allocate to
> > each guest doesn't seem to be large enough for that to be practical.
> 
> If the KVM XIVE device had some information on the max number of CPUs 
> provisioned for the guest, we could optimize the VP allocation.

The problem is that we might have 1000 guests running under L0, or we
might have 1 guest running under L0 and 1000 guests running under it,
and we have no way to know which situation to optimize for at the
point where an L1 guest starts.  If we had an enormous VP space then
we could just give each L1 guest a large amount of VP space and solve
it that way; but we don't.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-29 17:44                   ` Cédric Le Goater
@ 2019-01-30  5:55                     ` Paul Mackerras
  2019-01-30  7:06                       ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-30  5:55 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Tue, Jan 29, 2019 at 06:44:50PM +0100, Cédric Le Goater wrote:
> On 1/29/19 5:12 AM, Paul Mackerras wrote:
> > On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> >>
> >> Is clearing the PTEs and repopulating the VMA unsafe ? 
> > 
> > Actually, now that I come to think of it, there could be any number of
> > VMAs (well, up to almost 64k of them), since once you have a file
> > descriptor you can call mmap on it multiple times.
> > 
> > The more I think about it, the more I think that getting userspace to
> > manage the mappings with mmap() and munmap() is the right idea if it
> > can be made to work.
> 
> We might be able to mmap() and munmap() regions of ESB pages in the RTAS 
> call "ibm,change-msi".  I think that's the right spot for it. 

I was thinking that the ESB pages should be mmapped for device
interrupts at VM startup or when a device is hot-plugged in, and
munmapped when the device is hot-removed.  Maybe the mmap could be
done in conjunction with the KVM_IRQFD call?

What is the reasoning behind doing it in the ibm,change-msi call?

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-29 13:47                   ` Cédric Le Goater
@ 2019-01-30  6:20                     ` Paul Mackerras
  2019-01-30 15:54                       ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-30  6:20 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
> On 1/29/19 3:45 AM, Paul Mackerras wrote:
> > On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> >> On 1/28/19 7:13 AM, Paul Mackerras wrote:
> >>> Would we end up with too many VMAs if we just used mmap() to
> >>> change the mappings from the software-generated pages to the
> >>> hardware-generated interrupt pages?  
> >> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
> >> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
> > 
> > Confused.  You say the number space has 32768 entries but then imply
> > there are only 8K entries.  Do you mean that the API allows for 15-bit
> > IRQ numbers but we are only making using of 8192 of them?
> 
> Ouch. My bad. Let's do it again. 
> 
> The sPAPR IRQ number space is 0x2000 wide :
> 
> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
> 
> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
> devices (which can be extended if needed).
> 
> So that's 8192 x 2 ESB pages.
> 
> >> extend the last range if needed as these are for MSIs. Dynamic 
> >> extensions under KVM should work also.
> >>
> >> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
> >> lot of mmap(), too much. Also the KVM model needs to be compatible
> > 
> > I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
> > space would be covered by a single mmap, overlaid by subsequent mmaps
> > where we need to map real device interrupts.
> 
> ok. The same fault handler could be used to populate the VMA with the 
> ESB pages. 
> 
> But it would mean extra work on the QEMU side, which is not needed 
> with this patch. 

Maybe, but just storing a single vma pointer in our private data is
not a feasible approach.  First, you have no control on the lifetime
of the vma and thus this is a use-after-free waiting to happen, and
secondly, there could be multiple vmas that you need to worry about.
Userspace could do multiple mmaps, or could do mprotect or similar on
part of an existing mmap, which would split the vma for the mmap into
multiple vmas.  You don't get notified about munmap either as far as I
can tell, so the vma is liable to go away.  And it doesn't matter if
QEMU would never do such things; if userspace can provoke a
use-after-free in the kernel using KVM then someone will write a
specially crafted user program to do that.

So we either solve it in userspace, or we have to write and maintain
complex kernel code with deep links into the MM subsystem.  I'd much
rather solve it in userspace.

> >> with the QEMU emulated one and it was simpler to have one overall
> >> memory region for the IPI ESBs, one for the END ESBs (if we support
> >> that one day) and one for the TIMA.
> >>
> >>> Are the necessary pages for a PCI
> >>> passthrough device contiguous in both host real space 
> >>
> >> They should as they are the PHB4 ESBs.
> >>
> >>> and guest real space ? 
> >>
> >> also. That's how we organized the mapping. 
> > 
> > "How we organized the mapping" is a significant design decision that I
> > haven't seen documented anywhere, and is really needed for
> > understanding what's going on.
> 
> OK. I will add comments on that. See below for some description.
> 
> There is nothing fancy, it's simply indexed with the interrupt number,
> like for HW, or for the QEMU XIVE emulated model.
> 
> >>> If so we'd only need one mmap() for all the device's interrupt
> >>> pages.
> >>
> >> Ah. So we would want to make a special case for the passthrough 
> >> device and have a mmap() and a memory region for its ESBs. Hmm.
> >>
> >> Wouldn't that require to hot plug a memory region under the guest ? 
> > 
> > No; the way that a memory region works is that userspace can do
> > whatever disparate mappings it likes within the region on the user
> > process side, and the corresponding region of guest real address space
> > follows that automatically.
> 
> yes. I suppose this should work also for 'ram device' memory mappings.
> 
> So when the passthrough device is added to the guest, we would add a 
> new 'ram device' memory region for the device interrupt ESB pages 
> that would overlap with the initial guest ESB pages.  

Not knowing the QEMU internals all that well, I don't at all
understand why a new ram device is necesssary.  I would see it as a
single virtual area mapping the ESB pages of guest hwirqs that are in
use, and we manage those mappings with mmap and munmap.

> This is really a different approach.
> 
> >> which means that we need to provision an address space/container 
> >> region for theses regions. What are the benefits ? 
> >>
> >> Is clearing the PTEs and repopulating the VMA unsafe ? 
> > 
> > Explicitly unmapping parts of the VMA seems like the wrong way to do
> > it.  If you have a device mmap where the device wants to change the
> > physical page underlying parts of the mapping, there should be a way
> > for it to do that explicitly (but I don't know off the top of my head
> > what the interface to do that is).
> > 
> > However, I still haven't seen a clear and concise explanation of what
> > is being changed, when, and why we need to do that.
> 
> Yes. I agree on that. The problem is not very different from what we 
> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
> 
> 
> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
> implement an IRQ space for the guest using the generic IPI interrupts 
> of the XIVE IC controller. These interrupts are allocated at the OPAL
> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
> Interrupt management is performed in the XIVE way: using loads and 
> stores on the addresses of the XIVE IPI interrupt ESB pages.
> 
> Both KVM devices share the same internal structure caching information 
> on the interrupts, among which the xive_irq_data struct containing the 
> addresses of the IPI ESB pages and an extra one in case of passthrough. 
> The later contains the addresses of the ESB pages of the underlying HW 
> controller interrupts, PHB4 in all cases for now.    
> 
> A guest when running in the XICS legacy interrupt mode lets the KVM 
> XICS-over-XIVE device "handle" interrupt management, that is to perform  
> the loads and stores on the addresses of the ESB pages of the guest 
> interrupts. 
> 
> However, when running in XIVE native exploitation mode, the KVM XIVE 
> native device exposes the interrupt ESB pages to the guest and lets 
> the guest perform directly the loads and stores. 
> 
> The VMA exposing the ESB pages make use of a custom VM fault handler
> which role is to populate the VMA with appropriate pages. When a fault
> occurs, the guest IRQ number is deduced from the offset, and the ESB 
> pages of associated XIVE IPI interrupt are inserted in the VMA (using
> the internal structure caching information on the interrupts).
> 
> Supporting device passthrough in the guest running in XIVE native 
> exploitation mode adds some extra refinements because the ESB pages 
> of a different HW controller (PHB4) need to be exposed to the guest 
> along with the initial IPI ESB pages of the XIVE IC controller. But
> the overall mechanic is the same. 
> 
> When the device HW irqs are mapped into or unmapped from the guest
> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
> and kvmppc_xive_clr_mapped(), are called to record or clear the 
> passthrough interrupt information and to perform the switch.
> 
> The approach taken by this patch is to clear the ESB pages of the 
> guest IRQ number being mapped and let the VM fault handler repopulate. 
> The handler will insert the ESB page corresponding to the HW interrupt 
> of the device being passed-through or the initial IPI ESB page if the
> device is being removed.   

That's a much better write-up.  Thanks.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-01-30  4:29       ` Paul Mackerras
@ 2019-01-30  7:01         ` Cédric Le Goater
  2019-01-31  3:01           ` Paul Mackerras
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-30  7:01 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/30/19 5:29 AM, Paul Mackerras wrote:
> On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
>> On 1/22/19 6:05 AM, Paul Mackerras wrote:
>>> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>>>> This is the basic framework for the new KVM device supporting the XIVE
>>>> native exploitation mode. The user interface exposes a new capability
>>>> and a new KVM device to be used by QEMU.
>>>
>>> [snip]
>>>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>>>  #ifdef CONFIG_KVM_XIVE
>>>>  	if (xive_enabled()) {
>>>>  		kvmppc_xive_init_module();
>>>> +		kvmppc_xive_native_init_module();
>>>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
>>>> +		kvm_register_device_ops(&kvm_xive_native_ops,
>>>> +					KVM_DEV_TYPE_XIVE);
>>>
>>> I think we want tighter conditions on initializing the xive_native
>>> stuff and creating the xive device class.  We could have
>>> xive_enabled() returning true in a guest, and this code will get
>>> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
>>> are running bare metal).
>>
>> So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
>> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
>>
>> 	if (xive_enabled() && !kvmhv_on_pseries()) {
>>
>> for now.
>>
>> As this is a problem today in 5.0.x, I will send a patch for it if you think
> 
> How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
> with kernel_irqchip=on in a nested guest and it works just fine.  What
> exactly did you test?

L0: Linux 5.0.0-rc3 (+ KVM HV)
L1:     QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3 (+ KVM HV)
L2:          QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3

L1 crashes when L2 starts and tries to initialize the KVM IRQ device as 
it does an OPAL call and its running under SLOF. See below.

I don't understand how L2 can work with kernel_irqchip=on. Could you
please explain ? 

>> it is correct. I don't think we should bother taking care of the PR case
>> on P9. Should we ? 
> 
> We do need to take care of PR KVM on P9, since it is the only form of
> nested KVM that works inside a host in HPT mode.

ok. That is the test case. There are quite a few combinations now.

Thanks,

C.

[   49.547056] Oops: Exception in kernel mode, sig: 4 [#1]
[   49.555101] LE SMP NR_CPUS=2048 NUMA pSeries
[   49.555132] Modules linked in: xt_CHECKSUM iptable_mangle ipt_MASQUERADE iptable_nat nf_nat_ipv4 nf_nat xt_conntrack nf_conntrack nf_defrag_ipv6 libcrc32c nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 xt_tcpudp bridge stp llc ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter vmx_crypto crct10dif_vpmsum crc32c_vpmsum kvm_hv kvm sch_fq_codel ip_tables x_tables autofs4 virtio_net net_failover failover virtio_scsi
[   49.555335] CPU: 9 PID: 2162 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.0.0-rc3+ #53
[   49.555378] NIP:  c0000000000a7548 LR: c0000000000a4044 CTR: c0000000000a24b0
[   49.555421] REGS: c0000003ad71f8a0 TRAP: 0700   Not tainted  (5.0.0-rc3+)
[   49.555456] MSR:  8000000000041033 <SF,ME,IR,DR,RI,LE>  CR: 44222822  XER: 20040000
[   49.555501] CFAR: c0000000000a2508 IRQMASK: 0 
[   49.555501] GPR00: 0000000000000087 c0000003ad71fb30 c00000000175f700 000000000000000b 
[   49.555501] GPR04: 0000000000000000 0000000000000000 c0000003f88d4000 000000000000000b 
[   49.555501] GPR08: 00000003fd800000 000000000000000b 0000000000000800 0000000000000031 
[   49.555501] GPR12: 8000000000001002 c000000007ff3280 0000000000000000 0000000000000000 
[   49.555501] GPR16: 00007ffff8d2bd60 0000000000000000 000002c9896d7800 00007ffff8d2b970 
[   49.555501] GPR20: 000002c95c876f90 000002c95c876fa0 000002c95c876f80 000002c95c876f70 
[   49.555501] GPR24: 000002c95cf4f648 ffffffffffffffff c0000003ab3e4058 00000000006000c0 
[   49.555501] GPR28: 000000000000000b c0000003ab3e0000 0000000000000000 c0000003f88d0000 
[   49.555883] NIP [c0000000000a7548] opal_xive_alloc_vp_block+0x50/0x68
[   49.555919] LR [c0000000000a4044] opal_return+0x0/0x48
[   49.555947] Call Trace:
[   49.555964] [c0000003ad71fb30] [c0000000000a250c] xive_native_alloc_vp_block+0x5c/0x1c0 (unreliable)
[   49.556019] [c0000003ad71fbc0] [c00800000430c0c0] kvmppc_xive_create+0x98/0x168 [kvm]
[   49.556065] [c0000003ad71fc00] [c0080000042f9fcc] kvm_vm_ioctl+0x474/0xa00 [kvm]
[   49.556113] [c0000003ad71fd10] [c000000000423a64] do_vfs_ioctl+0xd4/0x8e0
[   49.556153] [c0000003ad71fdb0] [c000000000424334] ksys_ioctl+0xc4/0x110
[   49.556190] [c0000003ad71fe00] [c0000000004243a8] sys_ioctl+0x28/0x80
[   49.556230] [c0000003ad71fe20] [c00000000000b288] system_call+0x5c/0x70
[   49.556265] Instruction dump:
[   49.556288] 60000000 7d600026 91610008 39600000 616b8000 f98d0980 7d8c5878 7d810164 
[   49.556332] e9628098 7d6803a6 39600031 7d8c5878 <7d9b4ba6> e96280b0 e98b0008 e84b0000 
[   49.556378] ---[ end trace ac7420a6784de93b ]---

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-30  5:55                     ` Paul Mackerras
@ 2019-01-30  7:06                       ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-30  7:06 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/30/19 6:55 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 06:44:50PM +0100, Cédric Le Goater wrote:
>> On 1/29/19 5:12 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>>>
>>>> Is clearing the PTEs and repopulating the VMA unsafe ? 
>>>
>>> Actually, now that I come to think of it, there could be any number of
>>> VMAs (well, up to almost 64k of them), since once you have a file
>>> descriptor you can call mmap on it multiple times.
>>>
>>> The more I think about it, the more I think that getting userspace to
>>> manage the mappings with mmap() and munmap() is the right idea if it
>>> can be made to work.
>>
>> We might be able to mmap() and munmap() regions of ESB pages in the RTAS 
>> call "ibm,change-msi".  I think that's the right spot for it. 
> 
> I was thinking that the ESB pages should be mmapped for device
> interrupts at VM startup or when a device is hot-plugged in, and
> munmapped when the device is hot-removed.  Maybe the mmap could be
> done in conjunction with the KVM_IRQFD call?
> 
> What is the reasoning behind doing it in the ibm,change-msi call?

Because when the device is plugged in, it has no interrupts. These are 
allocated on demand only when the driver is loaded in the guest. 

C.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-01-30  5:40         ` Paul Mackerras
@ 2019-01-30 15:36           ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-30 15:36 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/30/19 6:40 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 02:51:05PM +0100, Cédric Le Goater wrote:
>>>>> Another general comment is that you seem to have written all this
>>>>> code assuming we are using HV KVM in a host running bare-metal.
>>>>
>>>> Yes. I didn't look at the other configurations. I thought that we could
>>>> use the kernel_irqchip=off option to begin with. A couple of checks
>>>> are indeed missing.
>>>
>>> Using kernel_irqchip=off would mean that we would not be able to use
>>> the in-kernel XICS emulation, which would have a performance impact.
>>
>> yes. But it is not supported today. Correct ? 
> 
> Not correct, it has been working for years, and works in v5.0-rc1 (I
> just tested it), at both L0 and L1.

Please see other email for the test is did.

>>> We need an explicit capability for XIVE exploitation that can be
>>> enabled or disabled on the qemu command line, so that we can enforce a
>>> uniform set of capabilities across all the hosts in a migration
>>> domain.  And it's no good to say we have the capability when all
>>> attempts to use it will fail.  Therefore the kernel needs to say that
>>> it doesn't have the capability in a PR KVM guest or in a nested HV
>>> guest.
>>
>> OK. I will work on adding a KVM_CAP_PPC_NESTED_IRQ_HV capability 
>> for future use.
> 
> That's not what I meant.  Why do we need that?  I meant that querying
> the new KVM_CAP_PPC_IRQ_XIVE capability should return 0 if we are in a
> guest.  It should only return 1 if we are running bare-metal on a P9.

ok. I guess I need to understand first how the nested guest uses the 
KVM IRQ device. That is a question in another email thread.   

>>>>> However, we could be using PR KVM (either in a bare-metal host or in a
>>>>> guest), or we could be doing nested HV KVM where we are using the
>>>>> kvm_hv module inside a KVM guest and using special hypercalls for
>>>>> controlling our guests.
>>>>
>>>> Yes. 
>>>>
>>>> It would be good to talk a little about the nested support (offline 
>>>> may be) to make sure that we are not missing some major interface that 
>>>> would require a lot of change. If we need to prepare ground, I think
>>>> the timing is good.
>>>>
>>>> The size of the IRQ number space might be a problem. It seems we 
>>>> would need to increase it considerably to support multiple nested 
>>>> guests. That said I haven't look much how nested is designed.  
>>>
>>> The current design of nested HV is that the entire non-volatile state
>>> of all the nested guests is encapsulated within the state and
>>> resources of the L1 hypervisor.  That means that if the L1 hypervisor
>>> gets migrated, all of its guests go across inside it and there is no
>>> extra state that L0 needs to be aware of.  That would imply that the
>>> VP number space for the nested guests would need to come from within
>>> the VP number space for L1; but the amount of VP space we allocate to
>>> each guest doesn't seem to be large enough for that to be practical.
>>
>> If the KVM XIVE device had some information on the max number of CPUs 
>> provisioned for the guest, we could optimize the VP allocation.
> 
> The problem is that we might have 1000 guests running under L0, or we
> might have 1 guest running under L0 and 1000 guests running under it,
> and we have no way to know which situation to optimize for at the
> point where an L1 guest starts.  If we had an enormous VP space then
> we could just give each L1 guest a large amount of VP space and solve
> it that way; but we don't.

There are some ideas to increase our VP space size. Using multiblock 
per XIVE chip in skiboot is one I think. It's not an obvious change. 
Also, XIVE2 will add more bits to the NVT index so we will be free 
to allocate more at once when P10 is available.

On the same topic, may be we could move the VP allocator from skiboot
to KVM, allocate the full VP space at the KVM level and let KVM do 
the VP segmentation. 

Any how, I think that if we knew how much VPs we need to provision for 
when the KVM XIVE device is created, we would make a better use of the 
available space. Shouldn't we ?

Thanks,

C.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-30  6:20                     ` Paul Mackerras
@ 2019-01-30 15:54                       ` Cédric Le Goater
  2019-01-31  2:48                         ` Paul Mackerras
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-01-30 15:54 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/30/19 7:20 AM, Paul Mackerras wrote:
> On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
>> On 1/29/19 3:45 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
>>>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
>>>>> Would we end up with too many VMAs if we just used mmap() to
>>>>> change the mappings from the software-generated pages to the
>>>>> hardware-generated interrupt pages?  
>>>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
>>>> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
>>>
>>> Confused.  You say the number space has 32768 entries but then imply
>>> there are only 8K entries.  Do you mean that the API allows for 15-bit
>>> IRQ numbers but we are only making using of 8192 of them?
>>
>> Ouch. My bad. Let's do it again. 
>>
>> The sPAPR IRQ number space is 0x2000 wide :
>>
>> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
>>
>> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
>> devices (which can be extended if needed).
>>
>> So that's 8192 x 2 ESB pages.
>>
>>>> extend the last range if needed as these are for MSIs. Dynamic 
>>>> extensions under KVM should work also.
>>>>
>>>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
>>>> lot of mmap(), too much. Also the KVM model needs to be compatible
>>>
>>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
>>> space would be covered by a single mmap, overlaid by subsequent mmaps
>>> where we need to map real device interrupts.
>>
>> ok. The same fault handler could be used to populate the VMA with the 
>> ESB pages. 
>>
>> But it would mean extra work on the QEMU side, which is not needed 
>> with this patch. 
> 
> Maybe, but just storing a single vma pointer in our private data is
> not a feasible approach.  First, you have no control on the lifetime
> of the vma and thus this is a use-after-free waiting to happen, and
> secondly, there could be multiple vmas that you need to worry about.

I fully agree. That's why I was uncomfortable with the solution. There 
are a few other drivers (GPUs if I recall) doing that but it feels wrong.

> Userspace could do multiple mmaps, or could do mprotect or similar on
> part of an existing mmap, which would split the vma for the mmap into
> multiple vmas.  You don't get notified about munmap either as far as I
> can tell, so the vma is liable to go away.  

yes ...

> And it doesn't matter if
> QEMU would never do such things; if userspace can provoke a
> use-after-free in the kernel using KVM then someone will write a
> specially crafted user program to do that.
> 
> So we either solve it in userspace, or we have to write and maintain
> complex kernel code with deep links into the MM subsystem.  
>
> I'd much rather solve it in userspace.

OK, then. I have been reluctant doing so but it seems there are no
other in-kernel solution. 

>>>> with the QEMU emulated one and it was simpler to have one overall
>>>> memory region for the IPI ESBs, one for the END ESBs (if we support
>>>> that one day) and one for the TIMA.
>>>>
>>>>> Are the necessary pages for a PCI
>>>>> passthrough device contiguous in both host real space 
>>>>
>>>> They should as they are the PHB4 ESBs.
>>>>
>>>>> and guest real space ? 
>>>>
>>>> also. That's how we organized the mapping. 
>>>
>>> "How we organized the mapping" is a significant design decision that I
>>> haven't seen documented anywhere, and is really needed for
>>> understanding what's going on.
>>
>> OK. I will add comments on that. See below for some description.
>>
>> There is nothing fancy, it's simply indexed with the interrupt number,
>> like for HW, or for the QEMU XIVE emulated model.
>>
>>>>> If so we'd only need one mmap() for all the device's interrupt
>>>>> pages.
>>>>
>>>> Ah. So we would want to make a special case for the passthrough 
>>>> device and have a mmap() and a memory region for its ESBs. Hmm.
>>>>
>>>> Wouldn't that require to hot plug a memory region under the guest ? 
>>>
>>> No; the way that a memory region works is that userspace can do
>>> whatever disparate mappings it likes within the region on the user
>>> process side, and the corresponding region of guest real address space
>>> follows that automatically.
>>
>> yes. I suppose this should work also for 'ram device' memory mappings.
>>
>> So when the passthrough device is added to the guest, we would add a 
>> new 'ram device' memory region for the device interrupt ESB pages 
>> that would overlap with the initial guest ESB pages.  
> 
> Not knowing the QEMU internals all that well, I don't at all
> understand why a new ram device is necesssary. 

'ram device' memory regions are of a special type which is used to 
directly map into the guest the result of a mmap() in QEMU. 

This is how we propagate the XIVE ESB pages from HW (and the TIMA) 
to the guest and the Linux kernel. It has other use with VFIO.   

> I would see it as a
> single virtual area mapping the ESB pages of guest hwirqs that are in
> use, and we manage those mappings with mmap and munmap.

Yes I think I understand the idea. I will give a try. I need to find the 
right place to do so in QEMU. See other email thread.

>> This is really a different approach.
>>
>>>> which means that we need to provision an address space/container 
>>>> region for theses regions. What are the benefits ? 
>>>>
>>>> Is clearing the PTEs and repopulating the VMA unsafe ? 
>>>
>>> Explicitly unmapping parts of the VMA seems like the wrong way to do
>>> it.  If you have a device mmap where the device wants to change the
>>> physical page underlying parts of the mapping, there should be a way
>>> for it to do that explicitly (but I don't know off the top of my head
>>> what the interface to do that is).
>>>
>>> However, I still haven't seen a clear and concise explanation of what
>>> is being changed, when, and why we need to do that.
>>
>> Yes. I agree on that. The problem is not very different from what we 
>> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
>>
>>
>> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
>> implement an IRQ space for the guest using the generic IPI interrupts 
>> of the XIVE IC controller. These interrupts are allocated at the OPAL
>> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
>> Interrupt management is performed in the XIVE way: using loads and 
>> stores on the addresses of the XIVE IPI interrupt ESB pages.
>>
>> Both KVM devices share the same internal structure caching information 
>> on the interrupts, among which the xive_irq_data struct containing the 
>> addresses of the IPI ESB pages and an extra one in case of passthrough. 
>> The later contains the addresses of the ESB pages of the underlying HW 
>> controller interrupts, PHB4 in all cases for now.    
>>
>> A guest when running in the XICS legacy interrupt mode lets the KVM 
>> XICS-over-XIVE device "handle" interrupt management, that is to perform  
>> the loads and stores on the addresses of the ESB pages of the guest 
>> interrupts. 
>>
>> However, when running in XIVE native exploitation mode, the KVM XIVE 
>> native device exposes the interrupt ESB pages to the guest and lets 
>> the guest perform directly the loads and stores. 
>>
>> The VMA exposing the ESB pages make use of a custom VM fault handler
>> which role is to populate the VMA with appropriate pages. When a fault
>> occurs, the guest IRQ number is deduced from the offset, and the ESB 
>> pages of associated XIVE IPI interrupt are inserted in the VMA (using
>> the internal structure caching information on the interrupts).
>>
>> Supporting device passthrough in the guest running in XIVE native 
>> exploitation mode adds some extra refinements because the ESB pages 
>> of a different HW controller (PHB4) need to be exposed to the guest 
>> along with the initial IPI ESB pages of the XIVE IC controller. But
>> the overall mechanic is the same. 
>>
>> When the device HW irqs are mapped into or unmapped from the guest
>> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
>> and kvmppc_xive_clr_mapped(), are called to record or clear the 
>> passthrough interrupt information and to perform the switch.
>>
>> The approach taken by this patch is to clear the ESB pages of the 
>> guest IRQ number being mapped and let the VM fault handler repopulate. 
>> The handler will insert the ESB page corresponding to the HW interrupt 
>> of the device being passed-through or the initial IPI ESB page if the
>> device is being removed.   
> 
> That's a much better write-up.  Thanks.

OK. I will reuse it when time comes.

Thanks,

C. 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support
  2019-01-30 15:54                       ` Cédric Le Goater
@ 2019-01-31  2:48                         ` Paul Mackerras
  0 siblings, 0 replies; 135+ messages in thread
From: Paul Mackerras @ 2019-01-31  2:48 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Wed, Jan 30, 2019 at 04:54:23PM +0100, Cédric Le Goater wrote:
> On 1/30/19 7:20 AM, Paul Mackerras wrote:
> > On Tue, Jan 29, 2019 at 02:47:55PM +0100, Cédric Le Goater wrote:
> >> On 1/29/19 3:45 AM, Paul Mackerras wrote:
> >>> On Mon, Jan 28, 2019 at 07:26:00PM +0100, Cédric Le Goater wrote:
> >>>> On 1/28/19 7:13 AM, Paul Mackerras wrote:
> >>>>> Would we end up with too many VMAs if we just used mmap() to
> >>>>> change the mappings from the software-generated pages to the
> >>>>> hardware-generated interrupt pages?  
> >>>> The sPAPR IRQ number space is 0x8000 wide now. The first 4K are 
> >>>> dedicated to CPU IPIs and the remaining 4K are for devices. We can 
> >>>
> >>> Confused.  You say the number space has 32768 entries but then imply
> >>> there are only 8K entries.  Do you mean that the API allows for 15-bit
> >>> IRQ numbers but we are only making using of 8192 of them?
> >>
> >> Ouch. My bad. Let's do it again. 
> >>
> >> The sPAPR IRQ number space is 0x2000 wide :
> >>
> >> https://git.qemu.org/?p=qemu.git;a=blob;f=hw/ppc/spapr_irq.c;h=1da7a32348fced0bd638717022fc37a83fc5e279;hb=HEAD#l396
> >>
> >> The first 4K are dedicated to the CPU IPIs and the remaining 4K are for 
> >> devices (which can be extended if needed).
> >>
> >> So that's 8192 x 2 ESB pages.
> >>
> >>>> extend the last range if needed as these are for MSIs. Dynamic 
> >>>> extensions under KVM should work also.
> >>>>
> >>>> This to say that we have with 8K x 2 (trigger+EOI) pages. This is a
> >>>> lot of mmap(), too much. Also the KVM model needs to be compatible
> >>>
> >>> I wasn't suggesting an mmap per IRQ, I meant that the bulk of the
> >>> space would be covered by a single mmap, overlaid by subsequent mmaps
> >>> where we need to map real device interrupts.
> >>
> >> ok. The same fault handler could be used to populate the VMA with the 
> >> ESB pages. 
> >>
> >> But it would mean extra work on the QEMU side, which is not needed 
> >> with this patch. 
> > 
> > Maybe, but just storing a single vma pointer in our private data is
> > not a feasible approach.  First, you have no control on the lifetime
> > of the vma and thus this is a use-after-free waiting to happen, and
> > secondly, there could be multiple vmas that you need to worry about.
> 
> I fully agree. That's why I was uncomfortable with the solution. There 
> are a few other drivers (GPUs if I recall) doing that but it feels wrong.

There is the HMM infrastructure (heterogeneous memory model) which
could possibly be used to do this, but it's very heavyweight for the
problem we have here.

> > Userspace could do multiple mmaps, or could do mprotect or similar on
> > part of an existing mmap, which would split the vma for the mmap into
> > multiple vmas.  You don't get notified about munmap either as far as I
> > can tell, so the vma is liable to go away.  
> 
> yes ...
> 
> > And it doesn't matter if
> > QEMU would never do such things; if userspace can provoke a
> > use-after-free in the kernel using KVM then someone will write a
> > specially crafted user program to do that.
> > 
> > So we either solve it in userspace, or we have to write and maintain
> > complex kernel code with deep links into the MM subsystem.  
> >
> > I'd much rather solve it in userspace.
> 
> OK, then. I have been reluctant doing so but it seems there are no
> other in-kernel solution. 

I discussed the problem today with David Gibson and he mentioned that
QEMU does have a lot of freedom in how it assigns the guest hwirq
numbers.  So you may be able to avoid the problem by (for example)
never assigning a hwirq to a VFIO device that has previously been used
for an emulated device (or something like that).

> >>>> with the QEMU emulated one and it was simpler to have one overall
> >>>> memory region for the IPI ESBs, one for the END ESBs (if we support
> >>>> that one day) and one for the TIMA.
> >>>>
> >>>>> Are the necessary pages for a PCI
> >>>>> passthrough device contiguous in both host real space 
> >>>>
> >>>> They should as they are the PHB4 ESBs.
> >>>>
> >>>>> and guest real space ? 
> >>>>
> >>>> also. That's how we organized the mapping. 
> >>>
> >>> "How we organized the mapping" is a significant design decision that I
> >>> haven't seen documented anywhere, and is really needed for
> >>> understanding what's going on.
> >>
> >> OK. I will add comments on that. See below for some description.
> >>
> >> There is nothing fancy, it's simply indexed with the interrupt number,
> >> like for HW, or for the QEMU XIVE emulated model.
> >>
> >>>>> If so we'd only need one mmap() for all the device's interrupt
> >>>>> pages.
> >>>>
> >>>> Ah. So we would want to make a special case for the passthrough 
> >>>> device and have a mmap() and a memory region for its ESBs. Hmm.
> >>>>
> >>>> Wouldn't that require to hot plug a memory region under the guest ? 
> >>>
> >>> No; the way that a memory region works is that userspace can do
> >>> whatever disparate mappings it likes within the region on the user
> >>> process side, and the corresponding region of guest real address space
> >>> follows that automatically.
> >>
> >> yes. I suppose this should work also for 'ram device' memory mappings.
> >>
> >> So when the passthrough device is added to the guest, we would add a 
> >> new 'ram device' memory region for the device interrupt ESB pages 
> >> that would overlap with the initial guest ESB pages.  
> > 
> > Not knowing the QEMU internals all that well, I don't at all
> > understand why a new ram device is necesssary. 
> 
> 'ram device' memory regions are of a special type which is used to 
> directly map into the guest the result of a mmap() in QEMU. 
> 
> This is how we propagate the XIVE ESB pages from HW (and the TIMA) 
> to the guest and the Linux kernel. It has other use with VFIO.   
> 
> > I would see it as a
> > single virtual area mapping the ESB pages of guest hwirqs that are in
> > use, and we manage those mappings with mmap and munmap.
> 
> Yes I think I understand the idea. I will give a try. I need to find the 
> right place to do so in QEMU. See other email thread.
> 
> >> This is really a different approach.
> >>
> >>>> which means that we need to provision an address space/container 
> >>>> region for theses regions. What are the benefits ? 
> >>>>
> >>>> Is clearing the PTEs and repopulating the VMA unsafe ? 
> >>>
> >>> Explicitly unmapping parts of the VMA seems like the wrong way to do
> >>> it.  If you have a device mmap where the device wants to change the
> >>> physical page underlying parts of the mapping, there should be a way
> >>> for it to do that explicitly (but I don't know off the top of my head
> >>> what the interface to do that is).
> >>>
> >>> However, I still haven't seen a clear and concise explanation of what
> >>> is being changed, when, and why we need to do that.
> >>
> >> Yes. I agree on that. The problem is not very different from what we 
> >> have today with the XICS-over-XIVE glue in KVM. Let me try to explain.
> >>
> >>
> >> The KVM XICS-over-XIVE device and the proposed KVM XIVE native device 
> >> implement an IRQ space for the guest using the generic IPI interrupts 
> >> of the XIVE IC controller. These interrupts are allocated at the OPAL
> >> level and "mapped" into the guest IRQ number space in the range 0-0x1FFF.
> >> Interrupt management is performed in the XIVE way: using loads and 
> >> stores on the addresses of the XIVE IPI interrupt ESB pages.
> >>
> >> Both KVM devices share the same internal structure caching information 
> >> on the interrupts, among which the xive_irq_data struct containing the 
> >> addresses of the IPI ESB pages and an extra one in case of passthrough. 
> >> The later contains the addresses of the ESB pages of the underlying HW 
> >> controller interrupts, PHB4 in all cases for now.    
> >>
> >> A guest when running in the XICS legacy interrupt mode lets the KVM 
> >> XICS-over-XIVE device "handle" interrupt management, that is to perform  
> >> the loads and stores on the addresses of the ESB pages of the guest 
> >> interrupts. 
> >>
> >> However, when running in XIVE native exploitation mode, the KVM XIVE 
> >> native device exposes the interrupt ESB pages to the guest and lets 
> >> the guest perform directly the loads and stores. 
> >>
> >> The VMA exposing the ESB pages make use of a custom VM fault handler
> >> which role is to populate the VMA with appropriate pages. When a fault
> >> occurs, the guest IRQ number is deduced from the offset, and the ESB 
> >> pages of associated XIVE IPI interrupt are inserted in the VMA (using
> >> the internal structure caching information on the interrupts).
> >>
> >> Supporting device passthrough in the guest running in XIVE native 
> >> exploitation mode adds some extra refinements because the ESB pages 
> >> of a different HW controller (PHB4) need to be exposed to the guest 
> >> along with the initial IPI ESB pages of the XIVE IC controller. But
> >> the overall mechanic is the same. 
> >>
> >> When the device HW irqs are mapped into or unmapped from the guest
> >> IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped()
> >> and kvmppc_xive_clr_mapped(), are called to record or clear the 
> >> passthrough interrupt information and to perform the switch.
> >>
> >> The approach taken by this patch is to clear the ESB pages of the 
> >> guest IRQ number being mapped and let the VM fault handler repopulate. 
> >> The handler will insert the ESB page corresponding to the HW interrupt 
> >> of the device being passed-through or the initial IPI ESB page if the
> >> device is being removed.   
> > 
> > That's a much better write-up.  Thanks.
> 
> OK. I will reuse it when time comes.
> 
> Thanks,
> 
> C. 

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-01-30  7:01         ` Cédric Le Goater
@ 2019-01-31  3:01           ` Paul Mackerras
  2019-02-01 17:03             ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-01-31  3:01 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Wed, Jan 30, 2019 at 08:01:22AM +0100, Cédric Le Goater wrote:
> On 1/30/19 5:29 AM, Paul Mackerras wrote:
> > On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
> >> On 1/22/19 6:05 AM, Paul Mackerras wrote:
> >>> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> >>>> This is the basic framework for the new KVM device supporting the XIVE
> >>>> native exploitation mode. The user interface exposes a new capability
> >>>> and a new KVM device to be used by QEMU.
> >>>
> >>> [snip]
> >>>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
> >>>>  #ifdef CONFIG_KVM_XIVE
> >>>>  	if (xive_enabled()) {
> >>>>  		kvmppc_xive_init_module();
> >>>> +		kvmppc_xive_native_init_module();
> >>>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
> >>>> +		kvm_register_device_ops(&kvm_xive_native_ops,
> >>>> +					KVM_DEV_TYPE_XIVE);
> >>>
> >>> I think we want tighter conditions on initializing the xive_native
> >>> stuff and creating the xive device class.  We could have
> >>> xive_enabled() returning true in a guest, and this code will get
> >>> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
> >>> are running bare metal).
> >>
> >> So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
> >> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
> >>
> >> 	if (xive_enabled() && !kvmhv_on_pseries()) {
> >>
> >> for now.
> >>
> >> As this is a problem today in 5.0.x, I will send a patch for it if you think
> > 
> > How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
> > with kernel_irqchip=on in a nested guest and it works just fine.  What
> > exactly did you test?
> 
> L0: Linux 5.0.0-rc3 (+ KVM HV)
> L1:     QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3 (+ KVM HV)
> L2:          QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3
> 
> L1 crashes when L2 starts and tries to initialize the KVM IRQ device as 
> it does an OPAL call and its running under SLOF. See below.

OK, you must have a QEMU that advertises XIVE to the guest (L1).  In
that case I can see that L1 would try to do XICS-on-XIVE, which won't
work.  We need to fix that.  Unfortunately the XICS-on-XICS emulation
won't work as is in L1 either, but I think we can fix that by
disabling the real-mode XICS hcall handling.

> I don't understand how L2 can work with kernel_irqchip=on. Could you
> please explain ? 

If QEMU decides to advertise XIVE to the L2 guest and the L2 guest can
do XIVE, then the only possibility is to use the XIVE software
emulation in QEMU, and if kernel_irqchip=on has been specified
explicitly, maybe QEMU decides to terminate the guest rather than
implicitly turning off kernel_irqchip.

If QEMU decides not to advertise XIVE to the L2 guest, or the L2 guest
can't do XIVE, then we could use the XICS-on-XICS emulation in L1 as
long as either (a) L1 is not using XIVE, or (b) we modify the
XICS-on-XICS code to avoid using any XICS or XIVE access (i.e. just
using calls to generic kernel facilities).

Ultimately, if the spapr xive backend code in the kernel could be
extended to provide all the low-level functions that the XICS-on-XIVE
code needs, then we could do XICS-on-XIVE in a guest.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-01-31  3:01           ` Paul Mackerras
@ 2019-02-01 17:03             ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-01 17:03 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 1/31/19 4:01 AM, Paul Mackerras wrote:
> On Wed, Jan 30, 2019 at 08:01:22AM +0100, Cédric Le Goater wrote:
>> On 1/30/19 5:29 AM, Paul Mackerras wrote:
>>> On Mon, Jan 28, 2019 at 06:35:34PM +0100, Cédric Le Goater wrote:
>>>> On 1/22/19 6:05 AM, Paul Mackerras wrote:
>>>>> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>>>>>> This is the basic framework for the new KVM device supporting the XIVE
>>>>>> native exploitation mode. The user interface exposes a new capability
>>>>>> and a new KVM device to be used by QEMU.
>>>>>
>>>>> [snip]
>>>>>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>>>>>  #ifdef CONFIG_KVM_XIVE
>>>>>>  	if (xive_enabled()) {
>>>>>>  		kvmppc_xive_init_module();
>>>>>> +		kvmppc_xive_native_init_module();
>>>>>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
>>>>>> +		kvm_register_device_ops(&kvm_xive_native_ops,
>>>>>> +					KVM_DEV_TYPE_XIVE);
>>>>>
>>>>> I think we want tighter conditions on initializing the xive_native
>>>>> stuff and creating the xive device class.  We could have
>>>>> xive_enabled() returning true in a guest, and this code will get
>>>>> called both by PR KVM and HV KVM (and HV KVM no longer implies that we
>>>>> are running bare metal).
>>>>
>>>> So yes, I gave nested a try with kernel_irqchip=on and the nested hypervisor 
>>>> (L1) obviously crashes trying to call OPAL. I have tighten the test with : 
>>>>
>>>> 	if (xive_enabled() && !kvmhv_on_pseries()) {
>>>>
>>>> for now.
>>>>
>>>> As this is a problem today in 5.0.x, I will send a patch for it if you think
>>>
>>> How do you mean this is a problem today in 5.0?  I just tried 5.0-rc1
>>> with kernel_irqchip=on in a nested guest and it works just fine.  What
>>> exactly did you test?
>>
>> L0: Linux 5.0.0-rc3 (+ KVM HV)
>> L1:     QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3 (+ KVM HV)
>> L2:          QEMU pseries-4.0 (kernel_irqchip=on) - Linux 5.0.0-rc3
>>
>> L1 crashes when L2 starts and tries to initialize the KVM IRQ device as 
>> it does an OPAL call and its running under SLOF. See below.
> 
> OK, you must have a QEMU that advertises XIVE to the guest (L1). 

XIVE is not advertised if QEMU is started with 'ic-mode=xics' 

> In
> that case I can see that L1 would try to do XICS-on-XIVE, which won't
> work.  We need to fix that.  Unfortunately the XICS-on-XICS emulation
> won't work as is in L1 either, but I think we can fix that by
> disabling the real-mode XICS hcall handling.

I have added some tests on kvm-hv, using kvmhv_on_pseries(), to disable 
the KVM XICS-on-XIVE device in a L1 guest running as hypervisor and 
to instead register the old KVM XICS device. 

If the L1 is started in KVM XICS mode, L2 can now run with KVM XICS.
All seem fine. I booted two guests with disk and network. 

But I am still "a bit" confused with what is being done at each 
hypervisor level. It's not obvious to follow at all even with traces.
 
>> I don't understand how L2 can work with kernel_irqchip=on. Could you
>> please explain ? 
> 
> If QEMU decides to advertise XIVE to the L2 guest and the L2 guest can
> do XIVE, then the only possibility is to use the XIVE software
> emulation in QEMU, and if kernel_irqchip=on has been specified
> explicitly, maybe QEMU decides to terminate the guest rather than
> implicitly turning off kernel_irqchip.

we can do that by disabling the KVM XIVE device when under kvmhv_on_pseries().

> If QEMU decides not to advertise XIVE to the L2 guest, or the L2 guest
> can't do XIVE, then we could use the XICS-on-XICS emulation in L1 as
> long as either (a) L1 is not using XIVE, or (b) we modify the
> XICS-on-XICS code to avoid using any XICS or XIVE access (i.e. just
> using calls to generic kernel facilities).

(a) is what I did above I think

May be we should consider having nested version of the KVM devices 
when under kvmhv_on_pseries(). With some sort of backend ops to
modify the relation with the parent hypervisor : PowerNV/Linux or 
pseries/Linux. 

> Ultimately, if the spapr xive backend code in the kernel could be
> extended to provide all the low-level functions that the XICS-on-XIVE
> code needs, then we could do XICS-on-XIVE in a guest.

What about a XIVE on XIVE ? 

Propagating the ESB pages to a nested guest seems feasible if not 
already done. The hcalls could be forwarded to the L1 QEMU ? The 
problematic part is handling the XIVE VP block.

C.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type
  2019-01-23 16:24     ` Cédric Le Goater
@ 2019-02-04  0:50       ` David Gibson
  2019-02-04 10:16         ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  0:50 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2742 bytes --]

On Wed, Jan 23, 2019 at 05:24:13PM +0100, Cédric Le Goater wrote:
> On 1/22/19 5:56 AM, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 07:43:15PM +0100, Cédric Le Goater wrote:
> >> We will have different KVM devices for interrupts, one for the
> >> XICS-over-XIVE mode and one for the XIVE native exploitation
> >> mode. Let's add some checks to make sure we are not mixing the
> >> interfaces in KVM.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  arch/powerpc/kvm/book3s_xive.c | 6 ++++++
> >>  1 file changed, 6 insertions(+)
> >>
> >> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
> >> index f78d002f0fe0..8a4fa45f07f8 100644
> >> --- a/arch/powerpc/kvm/book3s_xive.c
> >> +++ b/arch/powerpc/kvm/book3s_xive.c
> >> @@ -819,6 +819,9 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu)
> >>  {
> >>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> >>  
> >> +	if (!kvmppc_xics_enabled(vcpu))
> >> +		return -EPERM;
> >> +
> >>  	if (!xc)
> >>  		return 0;
> >>  
> >> @@ -835,6 +838,9 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
> >>  	u8 cppr, mfrr;
> >>  	u32 xisr;
> >>  
> >> +	if (!kvmppc_xics_enabled(vcpu))
> >> +		return -EPERM;
> >> +
> >>  	if (!xc || !xive)
> >>  		return -ENOENT;
> > 
> > I can't see how these new checks could ever trigger in the code as it
> > stands.  Is there a way at present? 
> 
> It would require some custom QEMU doing silly things : create the XICS 
> KVM device, and then call kvm_get_one_reg(KVM_REG_PPC_ICP_STATE) or 
> kvm_set_one_reg(icp->cs, KVM_REG_PPC_ICP_STATE) without connecting the
> vCPU to its presenter. 
> 
> Today, you get a ENOENT.

TBH, ENOENT seems fine to me.

> > Do following patches ever add a path where the new checks could trigger, 
> > or is this just an excess of caution? 
> 
> With the following patches, QEMU could to do something even more silly,
> which is to mix the interrupt mode interfaces : create a KVM XICS device    
> and call KVM CPU ioctls of the KVM XIVE device, or the opposite.

AFAICT, like above, that won't really differ from calling the XIVE CPU
ioctl()s when no irqchip is set up at all, and should be covered by
just a !xive check.

> 
> > (Your patch description should ideally have answered these questions > for me.)
> 
> Yes. I also think that I introduced this patch to early in the series.
> It make more sense when the XICS and the XIVE KVM devices are available.  
> 
> Thanks,
> 
> C.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-01-07 18:43 ` [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode Cédric Le Goater
  2019-01-22  5:05   ` Paul Mackerras
@ 2019-02-04  4:25   ` David Gibson
  2019-02-04 11:19     ` Cédric Le Goater
  1 sibling, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  4:25 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 17236 bytes --]

On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> This is the basic framework for the new KVM device supporting the XIVE
> native exploitation mode. The user interface exposes a new capability
> and a new KVM device to be used by QEMU.
> 
> Internally, the interface to the new KVM device is protected with a
> new interrupt mode: KVMPPC_IRQ_XIVE.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/asm/kvm_host.h   |   2 +
>  arch/powerpc/include/asm/kvm_ppc.h    |  21 ++
>  arch/powerpc/kvm/book3s_xive.h        |   3 +
>  include/uapi/linux/kvm.h              |   3 +
>  arch/powerpc/kvm/book3s.c             |   7 +-
>  arch/powerpc/kvm/book3s_xive_native.c | 332 ++++++++++++++++++++++++++
>  arch/powerpc/kvm/powerpc.c            |  30 +++
>  arch/powerpc/kvm/Makefile             |   2 +-
>  8 files changed, 398 insertions(+), 2 deletions(-)
>  create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
> 
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 0f98f00da2ea..c522e8274ad9 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops;
>  struct kvmppc_xive;
>  struct kvmppc_xive_vcpu;
>  extern struct kvm_device_ops kvm_xive_ops;
> +extern struct kvm_device_ops kvm_xive_native_ops;
>  
>  struct kvmppc_passthru_irqmap;
>  
> @@ -446,6 +447,7 @@ struct kvmppc_passthru_irqmap {
>  #define KVMPPC_IRQ_DEFAULT	0
>  #define KVMPPC_IRQ_MPIC		1
>  #define KVMPPC_IRQ_XICS		2 /* Includes a XIVE option */
> +#define KVMPPC_IRQ_XIVE		3 /* XIVE native exploitation mode */
>  
>  #define MMIO_HPTE_CACHE_SIZE	4
>  
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index eb0d79f0ca45..1bb313f238fe 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -591,6 +591,18 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
>  extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
>  			       int level, bool line_status);
>  extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
> +
> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
> +}
> +
> +extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> +				    struct kvm_vcpu *vcpu, u32 cpu);
> +extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
> +extern void kvmppc_xive_native_init_module(void);
> +extern void kvmppc_xive_native_exit_module(void);
> +
>  #else
>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
>  				       u32 priority) { return -1; }
> @@ -614,6 +626,15 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) { retur
>  static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
>  				      int level, bool line_status) { return -ENODEV; }
>  static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
> +
> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
> +	{ return 0; }
> +static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> +						  struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; }
> +static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
> +static inline void kvmppc_xive_native_init_module(void) { }
> +static inline void kvmppc_xive_native_exit_module(void) { }
> +
>  #endif /* CONFIG_KVM_XIVE */
>  
>  /*
> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> index 10c4aa5cd010..5f22415520b4 100644
> --- a/arch/powerpc/kvm/book3s_xive.h
> +++ b/arch/powerpc/kvm/book3s_xive.h
> @@ -12,6 +12,9 @@
>  #ifdef CONFIG_KVM_XICS
>  #include "book3s_xics.h"
>  
> +#define KVMPPC_XIVE_FIRST_IRQ	0
> +#define KVMPPC_XIVE_NR_IRQS	KVMPPC_XICS_NR_IRQS
> +
>  /*
>   * State for one guest irq source.
>   *
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 6d4ea4b6c922..52bf74a1616e 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
>  #define KVM_CAP_ARM_VM_IPA_SIZE 165
>  #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
>  #define KVM_CAP_HYPERV_CPUID 167
> +#define KVM_CAP_PPC_IRQ_XIVE 168
>  
>  #ifdef KVM_CAP_IRQ_ROUTING
>  
> @@ -1211,6 +1212,8 @@ enum kvm_device_type {
>  #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
>  	KVM_DEV_TYPE_ARM_VGIC_ITS,
>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
> +	KVM_DEV_TYPE_XIVE,
> +#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
>  	KVM_DEV_TYPE_MAX,
>  };
>  
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index bd1a677dd9e4..de7eed191107 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>  #ifdef CONFIG_KVM_XIVE
>  	if (xive_enabled()) {
>  		kvmppc_xive_init_module();
> +		kvmppc_xive_native_init_module();
>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
> +		kvm_register_device_ops(&kvm_xive_native_ops,
> +					KVM_DEV_TYPE_XIVE);
>  	} else
>  #endif
>  		kvm_register_device_ops(&kvm_xics_ops, KVM_DEV_TYPE_XICS);
> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>  static void kvmppc_book3s_exit(void)
>  {
>  #ifdef CONFIG_KVM_XICS
> -	if (xive_enabled())
> +	if (xive_enabled()) {
>  		kvmppc_xive_exit_module();
> +		kvmppc_xive_native_exit_module();
> +	}
>  #endif
>  #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
>  	kvmppc_book3s_exit_pr();
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> new file mode 100644
> index 000000000000..115143e76c45
> --- /dev/null
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -0,0 +1,332 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2017-2019, IBM Corporation.
> + */
> +
> +#define pr_fmt(fmt) "xive-kvm: " fmt
> +
> +#include <linux/anon_inodes.h>
> +#include <linux/kernel.h>
> +#include <linux/kvm_host.h>
> +#include <linux/err.h>
> +#include <linux/gfp.h>
> +#include <linux/spinlock.h>
> +#include <linux/delay.h>
> +#include <linux/percpu.h>
> +#include <linux/cpumask.h>
> +#include <asm/uaccess.h>
> +#include <asm/kvm_book3s.h>
> +#include <asm/kvm_ppc.h>
> +#include <asm/hvcall.h>
> +#include <asm/xics.h>
> +#include <asm/xive.h>
> +#include <asm/xive-regs.h>
> +#include <asm/debug.h>
> +#include <asm/debugfs.h>
> +#include <asm/time.h>
> +#include <asm/opal.h>
> +
> +#include <linux/debugfs.h>
> +#include <linux/seq_file.h>
> +
> +#include "book3s_xive.h"
> +
> +static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
> +{
> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> +	struct xive_q *q = &xc->queues[prio];
> +
> +	xive_native_disable_queue(xc->vp_id, q, prio);
> +	if (q->qpage) {
> +		put_page(virt_to_page(q->qpage));
> +		q->qpage = NULL;
> +	}
> +}
> +
> +void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> +	int i;
> +
> +	if (!kvmppc_xive_enabled(vcpu))
> +		return;
> +
> +	if (!xc)
> +		return;
> +
> +	pr_devel("native_cleanup_vcpu(cpu=%d)\n", xc->server_num);
> +
> +	/* Ensure no interrupt is still routed to that VP */
> +	xc->valid = false;
> +	kvmppc_xive_disable_vcpu_interrupts(vcpu);
> +
> +	/* Disable the VP */
> +	xive_native_disable_vp(xc->vp_id);
> +
> +	/* Free the queues & associated interrupts */
> +	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
> +		/* Free the escalation irq */
> +		if (xc->esc_virq[i]) {
> +			free_irq(xc->esc_virq[i], vcpu);
> +			irq_dispose_mapping(xc->esc_virq[i]);
> +			kfree(xc->esc_virq_names[i]);
> +			xc->esc_virq[i] = 0;
> +		}
> +
> +		/* Free the queue */
> +		xive_native_cleanup_queue(vcpu, i);
> +	}
> +
> +	/* Free the VP */
> +	kfree(xc);
> +
> +	/* Cleanup the vcpu */
> +	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
> +	vcpu->arch.xive_vcpu = NULL;
> +}
> +
> +int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> +				    struct kvm_vcpu *vcpu, u32 cpu)

Why do we need both a *vcpu and a cpu number as an integer?

> +{
> +	struct kvmppc_xive *xive = dev->private;
> +	struct kvmppc_xive_vcpu *xc;
> +	int rc;
> +
> +	pr_devel("native_connect_vcpu(cpu=%d)\n", cpu);
> +
> +	if (dev->ops != &kvm_xive_native_ops) {
> +		pr_devel("Wrong ops !\n");
> +		return -EPERM;
> +	}
> +	if (xive->kvm != vcpu->kvm)
> +		return -EPERM;
> +	if (vcpu->arch.irq_type)

Please use an explicit == / != here so we don't have to remember which
symbolic value corresponds to 0.

> +		return -EBUSY;
> +	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
> +		pr_devel("Duplicate !\n");
> +		return -EEXIST;
> +	}
> +	if (cpu >= KVM_MAX_VCPUS) {
> +		pr_devel("Out of bounds !\n");
> +		return -EINVAL;
> +	}
> +	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
> +	if (!xc)
> +		return -ENOMEM;
> +
> +	mutex_lock(&vcpu->kvm->lock);
> +	vcpu->arch.xive_vcpu = xc;
> +	xc->xive = xive;
> +	xc->vcpu = vcpu;
> +	xc->server_num = cpu;
> +	xc->vp_id = xive->vp_base + cpu;
> +	xc->valid = true;
> +
> +	rc = xive_native_get_vp_info(xc->vp_id, &xc->vp_cam, &xc->vp_chip_id);
> +	if (rc) {
> +		pr_err("Failed to get VP info from OPAL: %d\n", rc);
> +		goto bail;
> +	}
> +
> +	/*
> +	 * Enable the VP first as the single escalation mode will
> +	 * affect escalation interrupts numbering
> +	 */
> +	rc = xive_native_enable_vp(xc->vp_id, xive->single_escalation);
> +	if (rc) {
> +		pr_err("Failed to enable VP in OPAL: %d\n", rc);
> +		goto bail;
> +	}
> +
> +	/* Configure VCPU fields for use by assembly push/pull */
> +	vcpu->arch.xive_saved_state.w01 = cpu_to_be64(0xff000000);
> +	vcpu->arch.xive_cam_word = cpu_to_be32(xc->vp_cam | TM_QW1W2_VO);
> +
> +	/* TODO: initialize queues ? */
> +
> +bail:
> +	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
> +	mutex_unlock(&vcpu->kvm->lock);
> +	if (rc)
> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
> +
> +	return rc;
> +}
> +
> +static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
> +				       struct kvm_device_attr *attr)
> +{
> +	return -ENXIO;
> +}
> +
> +static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
> +				       struct kvm_device_attr *attr)
> +{
> +	return -ENXIO;
> +}
> +
> +static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
> +				       struct kvm_device_attr *attr)
> +{
> +	return -ENXIO;
> +}
> +
> +static void kvmppc_xive_native_free(struct kvm_device *dev)
> +{
> +	struct kvmppc_xive *xive = dev->private;
> +	struct kvm *kvm = xive->kvm;
> +	int i;
> +
> +	debugfs_remove(xive->dentry);
> +
> +	pr_devel("Destroying xive native for partition\n");
> +
> +	if (kvm)
> +		kvm->arch.xive = NULL;
> +
> +	/* Mask and free interrupts */
> +	for (i = 0; i <= xive->max_sbid; i++) {
> +		if (xive->src_blocks[i])
> +			kvmppc_xive_free_sources(xive->src_blocks[i]);
> +		kfree(xive->src_blocks[i]);
> +		xive->src_blocks[i] = NULL;
> +	}
> +
> +	if (xive->vp_base != XIVE_INVALID_VP)
> +		xive_native_free_vp_block(xive->vp_base);
> +
> +	kfree(xive);
> +	kfree(dev);
> +}
> +
> +static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
> +{
> +	struct kvmppc_xive *xive;
> +	struct kvm *kvm = dev->kvm;
> +	int ret = 0;
> +
> +	pr_devel("Creating xive native for partition\n");
> +
> +	if (kvm->arch.xive)
> +		return -EEXIST;
> +
> +	xive = kzalloc(sizeof(*xive), GFP_KERNEL);
> +	if (!xive)
> +		return -ENOMEM;
> +
> +	dev->private = xive;
> +	xive->dev = dev;
> +	xive->kvm = kvm;
> +	kvm->arch.xive = xive;
> +
> +	/* We use the default queue size set by the host */
> +	xive->q_order = xive_native_default_eq_shift();
> +	if (xive->q_order < PAGE_SHIFT)
> +		xive->q_page_order = 0;
> +	else
> +		xive->q_page_order = xive->q_order - PAGE_SHIFT;
> +
> +	/* Allocate a bunch of VPs */
> +	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> +	pr_devel("VP_Base=%x\n", xive->vp_base);
> +
> +	if (xive->vp_base == XIVE_INVALID_VP)
> +		ret = -ENOMEM;
> +
> +	xive->single_escalation = xive_native_has_single_escalation();
> +
> +	if (ret)
> +		kfree(xive);
> +
> +	return ret;
> +}
> +
> +static int xive_native_debug_show(struct seq_file *m, void *private)
> +{
> +	struct kvmppc_xive *xive = m->private;
> +	struct kvm *kvm = xive->kvm;
> +	struct kvm_vcpu *vcpu;
> +	unsigned int i;
> +
> +	if (!kvm)
> +		return 0;
> +
> +	seq_puts(m, "=========\nVCPU state\n=========\n");
> +
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> +
> +		if (!xc)
> +			continue;
> +
> +		seq_printf(m, "cpu server %#x NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x\n",
> +			   xc->server_num,
> +			   vcpu->arch.xive_saved_state.nsr,
> +			   vcpu->arch.xive_saved_state.cppr,
> +			   vcpu->arch.xive_saved_state.ipb,
> +			   vcpu->arch.xive_saved_state.pipr,
> +			   vcpu->arch.xive_saved_state.w01,
> +			   (u32) vcpu->arch.xive_cam_word);
> +
> +		kvmppc_xive_debug_show_queues(m, vcpu);
> +	}
> +
> +	return 0;
> +}
> +
> +static int xive_native_debug_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, xive_native_debug_show, inode->i_private);
> +}
> +
> +static const struct file_operations xive_native_debug_fops = {
> +	.open = xive_native_debug_open,
> +	.read = seq_read,
> +	.llseek = seq_lseek,
> +	.release = single_release,
> +};
> +
> +static void xive_native_debugfs_init(struct kvmppc_xive *xive)
> +{
> +	char *name;
> +
> +	name = kasprintf(GFP_KERNEL, "kvm-xive-%p", xive);
> +	if (!name) {
> +		pr_err("%s: no memory for name\n", __func__);
> +		return;
> +	}
> +
> +	xive->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,
> +					   xive, &xive_native_debug_fops);
> +
> +	pr_debug("%s: created %s\n", __func__, name);
> +	kfree(name);
> +}
> +
> +static void kvmppc_xive_native_init(struct kvm_device *dev)
> +{
> +	struct kvmppc_xive *xive = (struct kvmppc_xive *)dev->private;
> +
> +	/* Register some debug interfaces */
> +	xive_native_debugfs_init(xive);
> +}
> +
> +struct kvm_device_ops kvm_xive_native_ops = {
> +	.name = "kvm-xive-native",
> +	.create = kvmppc_xive_native_create,
> +	.init = kvmppc_xive_native_init,
> +	.destroy = kvmppc_xive_native_free,
> +	.set_attr = kvmppc_xive_native_set_attr,
> +	.get_attr = kvmppc_xive_native_get_attr,
> +	.has_attr = kvmppc_xive_native_has_attr,
> +};
> +
> +void kvmppc_xive_native_init_module(void)
> +{
> +	;
> +}
> +
> +void kvmppc_xive_native_exit_module(void)
> +{
> +	;
> +}
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index b90a7d154180..01d526e15e9d 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -566,6 +566,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>  	case KVM_CAP_PPC_ENABLE_HCALL:
>  #ifdef CONFIG_KVM_XICS
>  	case KVM_CAP_IRQ_XICS:
> +#endif
> +#ifdef CONFIG_KVM_XIVE
> +	case KVM_CAP_PPC_IRQ_XIVE:
>  #endif
>  	case KVM_CAP_PPC_GET_CPU_CHAR:
>  		r = 1;
> @@ -753,6 +756,9 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>  		else
>  			kvmppc_xics_free_icp(vcpu);
>  		break;
> +	case KVMPPC_IRQ_XIVE:
> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
> +		break;
>  	}
>  
>  	kvmppc_core_vcpu_free(vcpu);
> @@ -1941,6 +1947,30 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>  		break;
>  	}
>  #endif /* CONFIG_KVM_XICS */
> +#ifdef CONFIG_KVM_XIVE
> +	case KVM_CAP_PPC_IRQ_XIVE: {
> +		struct fd f;
> +		struct kvm_device *dev;
> +
> +		r = -EBADF;
> +		f = fdget(cap->args[0]);
> +		if (!f.file)
> +			break;
> +
> +		r = -ENXIO;
> +		if (!xive_enabled())
> +			break;
> +
> +		r = -EPERM;
> +		dev = kvm_device_from_filp(f.file);
> +		if (dev)
> +			r = kvmppc_xive_native_connect_vcpu(dev, vcpu,
> +							    cap->args[1]);
> +
> +		fdput(f);
> +		break;
> +	}
> +#endif /* CONFIG_KVM_XIVE */
>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>  	case KVM_CAP_PPC_FWNMI:
>  		r = -EINVAL;
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index 64f1135e7732..806cbe488410 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -99,7 +99,7 @@ endif
>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>  	book3s_xics.o
>  
> -kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o
> +kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o book3s_xive_native.o
>  kvm-book3s_64-objs-$(CONFIG_SPAPR_TCE_IOMMU) += book3s_64_vio.o
>  
>  kvm-book3s_64-module-objs := \

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-01-07 18:43 ` [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device Cédric Le Goater
  2019-01-22  5:09   ` Paul Mackerras
@ 2019-02-04  4:45   ` David Gibson
  2019-02-04 11:30     ` Cédric Le Goater
  1 sibling, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  4:45 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4916 bytes --]

On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
> This will let the guest create a memory mapping to expose the ESB MMIO
> regions used to control the interrupt sources, to trigger events, to
> EOI or to turn off the sources.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
>  2 files changed, 101 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 8c876c166ef2..6bb61ba141c2 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
>  #define  KVM_XICS_QUEUED		(1ULL << 44)
>  
> +/* POWER9 XIVE Native Interrupt Controller */
> +#define KVM_DEV_XIVE_GRP_CTRL		1
> +#define   KVM_DEV_XIVE_GET_ESB_FD	1

Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
to both with an mmap() directly on the xive device fd?  Using the
offset to distinguish which one to map, obviously.

>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 115143e76c45..e20081f0c8d4 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -153,6 +153,85 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>  	return rc;
>  }
>  
> +static int xive_native_esb_fault(struct vm_fault *vmf)
> +{
> +	struct vm_area_struct *vma = vmf->vma;
> +	struct kvmppc_xive *xive = vma->vm_file->private_data;
> +	struct kvmppc_xive_src_block *sb;
> +	struct kvmppc_xive_irq_state *state;
> +	struct xive_irq_data *xd;
> +	u32 hw_num;
> +	u16 src;
> +	u64 page;
> +	unsigned long irq;
> +
> +	/*
> +	 * Linux/KVM uses a two pages ESB setting, one for trigger and
> +	 * one for EOI
> +	 */
> +	irq = vmf->pgoff / 2;
> +
> +	sb = kvmppc_xive_find_source(xive, irq, &src);
> +	if (!sb) {
> +		pr_err("%s: source %lx not found !\n", __func__, irq);
> +		return VM_FAULT_SIGBUS;
> +	}
> +
> +	state = &sb->irq_state[src];
> +	kvmppc_xive_select_irq(state, &hw_num, &xd);
> +
> +	arch_spin_lock(&sb->lock);
> +
> +	/*
> +	 * first/even page is for trigger
> +	 * second/odd page is for EOI and management.
> +	 */
> +	page = vmf->pgoff % 2 ? xd->eoi_page : xd->trig_page;
> +	arch_spin_unlock(&sb->lock);
> +
> +	if (!page) {
> +		pr_err("%s: acessing invalid ESB page for source %lx !\n",
> +		       __func__, irq);
> +		return VM_FAULT_SIGBUS;
> +	}
> +
> +	vmf_insert_pfn(vma, vmf->address, page >> PAGE_SHIFT);
> +	return VM_FAULT_NOPAGE;
> +}
> +
> +static const struct vm_operations_struct xive_native_esb_vmops = {
> +	.fault = xive_native_esb_fault,
> +};
> +
> +static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma)
> +{
> +	/* There are two ESB pages (trigger and EOI) per IRQ */
> +	if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2)
> +		return -EINVAL;
> +
> +	vma->vm_flags |= VM_IO | VM_PFNMAP;
> +	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> +	vma->vm_ops = &xive_native_esb_vmops;
> +	return 0;
> +}
> +
> +static const struct file_operations xive_native_esb_fops = {
> +	.mmap = xive_native_esb_mmap,
> +};
> +
> +static int kvmppc_xive_native_get_esb_fd(struct kvmppc_xive *xive, u64 addr)
> +{
> +	u64 __user *ubufp = (u64 __user *) addr;
> +	int ret;
> +
> +	ret = anon_inode_getfd("[xive-esb]", &xive_native_esb_fops, xive,
> +				O_RDWR | O_CLOEXEC);
> +	if (ret < 0)
> +		return ret;
> +
> +	return put_user(ret, ubufp);
> +}
> +
>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  				       struct kvm_device_attr *attr)
>  {
> @@ -162,12 +241,30 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
>  				       struct kvm_device_attr *attr)
>  {
> +	struct kvmppc_xive *xive = dev->private;
> +
> +	switch (attr->group) {
> +	case KVM_DEV_XIVE_GRP_CTRL:
> +		switch (attr->attr) {
> +		case KVM_DEV_XIVE_GET_ESB_FD:
> +			return kvmppc_xive_native_get_esb_fd(xive, attr->addr);
> +		}
> +		break;
> +	}
>  	return -ENXIO;
>  }
>  
>  static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>  				       struct kvm_device_attr *attr)
>  {
> +	switch (attr->group) {
> +	case KVM_DEV_XIVE_GRP_CTRL:
> +		switch (attr->attr) {
> +		case KVM_DEV_XIVE_GET_ESB_FD:
> +			return 0;
> +		}
> +		break;
> +	}
>  	return -ENXIO;
>  }
>  

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device
  2019-01-23 16:56     ` Cédric Le Goater
@ 2019-02-04  4:49       ` David Gibson
  2019-02-04 15:36         ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  4:49 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1630 bytes --]

On Wed, Jan 23, 2019 at 05:56:26PM +0100, Cédric Le Goater wrote:
> On 1/22/19 6:14 AM, Paul Mackerras wrote:
> > On Mon, Jan 07, 2019 at 07:43:20PM +0100, Cédric Le Goater wrote:
> >> The ESB MMIO region controls the interrupt sources of the guest. QEMU
> >> will query an fd (GET_ESB_FD ioctl) and map this region at a specific
> >> address for the guest to use. The guest will obtain this information
> >> using the H_INT_GET_SOURCE_INFO hcall. To inform KVM of the address
> >> setting used by QEMU, add a VC_BASE control to the KVM XIVE device
> > 
> > This needs a little more explanation.  I *think* the only way this
> > gets used is that it gets returned to the guest by the new
> > hypercalls.  If that is indeed the case it would be useful to mention
> > that in the patch description, because otherwise taking a value that
> > userspace provides and which looks like it is an address, and not
> > doing any validation on it, looks a bit scary.
> 
> I think we have solved this problem in another email thread. 
> 
> The H_INT_GET_SOURCE_INFO hcall does not need to be implemented in KVM
> as all the source information should already be available in QEMU. In
> that case, there is no need to inform KVM of where the ESB pages are 
> mapped in the guest address space. So we don't need that extra control
> on the KVM device. This is good news.

Ah, good to hear.  I thought this looked strange.


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native device
  2019-01-07 18:43 ` [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE " Cédric Le Goater
@ 2019-02-04  4:57   ` David Gibson
  2019-02-04 19:07     ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  4:57 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 6693 bytes --]

On Mon, Jan 07, 2019 at 07:43:21PM +0100, Cédric Le Goater wrote:
> Interrupt sources are simply created at the OPAL level and then
> MASKED. KVM only needs to know about their type: LSI or MSI.

This commit message isn't very illuminating.

> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/uapi/asm/kvm.h           |  5 +
>  arch/powerpc/kvm/book3s_xive_native.c         | 98 +++++++++++++++++++
>  .../powerpc/kvm/book3s_xive_native_template.c | 27 +++++
>  3 files changed, 130 insertions(+)
>  create mode 100644 arch/powerpc/kvm/book3s_xive_native_template.c
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 8b78b12aa118..6fc9660c5aec 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -680,5 +680,10 @@ struct kvm_ppc_cpu_char {
>  #define   KVM_DEV_XIVE_GET_ESB_FD	1
>  #define   KVM_DEV_XIVE_GET_TIMA_FD	2
>  #define   KVM_DEV_XIVE_VC_BASE		3
> +#define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
> +
> +/* Layout of 64-bit XIVE source attribute values */
> +#define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
> +#define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>  
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 29a62914de55..2518640d4a58 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -31,6 +31,24 @@
>  
>  #include "book3s_xive.h"
>  
> +/*
> + * We still instantiate them here because we use some of the
> + * generated utility functions as well in this file.

And this comment is downright cryptic.

> + */
> +#define XIVE_RUNTIME_CHECKS
> +#define X_PFX xive_vm_
> +#define X_STATIC static
> +#define X_STAT_PFX stat_vm_
> +#define __x_tima		xive_tima
> +#define __x_eoi_page(xd)	((void __iomem *)((xd)->eoi_mmio))
> +#define __x_trig_page(xd)	((void __iomem *)((xd)->trig_mmio))
> +#define __x_writeb	__raw_writeb
> +#define __x_readw	__raw_readw
> +#define __x_readq	__raw_readq
> +#define __x_writeq	__raw_writeq
> +
> +#include "book3s_xive_native_template.c"
> +
>  static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
>  {
>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> @@ -305,6 +323,78 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr)
>  	return put_user(ret, ubufp);
>  }
>  
> +static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
> +					 u64 addr)
> +{
> +	struct kvmppc_xive_src_block *sb;
> +	struct kvmppc_xive_irq_state *state;
> +	u64 __user *ubufp = (u64 __user *) addr;
> +	u64 val;
> +	u16 idx;
> +
> +	pr_devel("%s irq=0x%lx\n", __func__, irq);
> +
> +	if (irq < KVMPPC_XIVE_FIRST_IRQ || irq >= KVMPPC_XIVE_NR_IRQS)
> +		return -ENOENT;
> +
> +	sb = kvmppc_xive_find_source(xive, irq, &idx);
> +	if (!sb) {
> +		pr_debug("No source, creating source block...\n");

Doesn't this need to be protected by some lock?

> +		sb = kvmppc_xive_create_src_block(xive, irq);
> +		if (!sb) {
> +			pr_err("Failed to create block...\n");
> +			return -ENOMEM;
> +		}
> +	}
> +	state = &sb->irq_state[idx];
> +
> +	if (get_user(val, ubufp)) {
> +		pr_err("fault getting user info !\n");
> +		return -EFAULT;
> +	}
> +
> +	/*
> +	 * If the source doesn't already have an IPI, allocate
> +	 * one and get the corresponding data
> +	 */
> +	if (!state->ipi_number) {
> +		state->ipi_number = xive_native_alloc_irq();
> +		if (state->ipi_number == 0) {
> +			pr_err("Failed to allocate IRQ !\n");
> +			return -ENOMEM;
> +		}

Am I right in thinking this is the point at which a specific guest irq
number gets bound to a specific host irq number?

> +		xive_native_populate_irq_data(state->ipi_number,
> +					      &state->ipi_data);
> +		pr_debug("%s allocated hw_irq=0x%x for irq=0x%lx\n", __func__,
> +			 state->ipi_number, irq);
> +	}
> +
> +	arch_spin_lock(&sb->lock);
> +
> +	/* Restore LSI state */
> +	if (val & KVM_XIVE_LEVEL_SENSITIVE) {
> +		state->lsi = true;
> +		if (val & KVM_XIVE_LEVEL_ASSERTED)
> +			state->asserted = true;
> +		pr_devel("  LSI ! Asserted=%d\n", state->asserted);
> +	}
> +
> +	/* Mask IRQ to start with */
> +	state->act_server = 0;
> +	state->act_priority = MASKED;
> +	xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
> +	xive_native_configure_irq(state->ipi_number, 0, MASKED, 0);
> +
> +	/* Increment the number of valid sources and mark this one valid */
> +	if (!state->valid)
> +		xive->src_count++;
> +	state->valid = true;
> +
> +	arch_spin_unlock(&sb->lock);
> +
> +	return 0;
> +}
> +
>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  				       struct kvm_device_attr *attr)
>  {
> @@ -317,6 +407,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  			return kvmppc_xive_native_set_vc_base(xive, attr->addr);
>  		}
>  		break;
> +	case KVM_DEV_XIVE_GRP_SOURCES:
> +		return kvmppc_xive_native_set_source(xive, attr->attr,
> +						     attr->addr);
>  	}
>  	return -ENXIO;
>  }
> @@ -353,6 +446,11 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>  			return 0;
>  		}
>  		break;
> +	case KVM_DEV_XIVE_GRP_SOURCES:
> +		if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
> +		    attr->attr < KVMPPC_XIVE_NR_IRQS)
> +			return 0;
> +		break;
>  	}
>  	return -ENXIO;
>  }
> diff --git a/arch/powerpc/kvm/book3s_xive_native_template.c b/arch/powerpc/kvm/book3s_xive_native_template.c
> new file mode 100644
> index 000000000000..e7260da4a596
> --- /dev/null
> +++ b/arch/powerpc/kvm/book3s_xive_native_template.c
> @@ -0,0 +1,27 @@
> +// SPDX-License-Identifier: GPL-2.0
> +/*
> + * Copyright (c) 2017-2019, IBM Corporation.
> + */
> +
> +/* File to be included by other .c files */
> +
> +#define XGLUE(a, b) a##b
> +#define GLUE(a, b) XGLUE(a, b)
> +
> +/*
> + * TODO: introduce a common template file with the XIVE native layer
> + * and the XICS-on-XIVE glue for the utility functions
> + */
> +static u8 GLUE(X_PFX, esb_load)(struct xive_irq_data *xd, u32 offset)
> +{
> +	u64 val;
> +
> +	if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
> +		offset |= offset << 4;
> +
> +	val = __x_readq(__x_eoi_page(xd) + offset);
> +#ifdef __LITTLE_ENDIAN__
> +	val >>= 64-8;
> +#endif
> +	return (u8)val;
> +}

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address
  2019-01-07 18:43 ` [PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address Cédric Le Goater
@ 2019-02-04  5:15   ` David Gibson
  2019-02-04 15:37     ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  5:15 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1885 bytes --]

On Mon, Jan 07, 2019 at 07:43:24PM +0100, Cédric Le Goater wrote:
> The guest physical address of the event queue will be part of the
> state to transfer in the migration. Cache its value when the queue is
> configured, it will save us an OPAL call.

That doesn't sound like a very compelling case - migration is already
a hundreds of milliseconds type operation, I wouldn't expect a few
extra OPAL calls to be an issue.

> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/asm/xive.h       | 2 ++
>  arch/powerpc/kvm/book3s_xive_native.c | 4 ++++
>  2 files changed, 6 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
> index 7a7aa22d8258..e90c3c5d9533 100644
> --- a/arch/powerpc/include/asm/xive.h
> +++ b/arch/powerpc/include/asm/xive.h
> @@ -74,6 +74,8 @@ struct xive_q {
>  	u32			esc_irq;
>  	atomic_t		count;
>  	atomic_t		pending_count;
> +	u64			guest_qpage;
> +	u32			guest_qsize;
>  };
>  
>  /* Global enable flags for the XIVE support */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 35d806740c3a..4ca75aade069 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -708,6 +708,10 @@ static int kvmppc_h_int_set_queue_config(struct kvm_vcpu *vcpu,
>  	}
>  	qaddr = page_to_virt(page) + (qpage & ~PAGE_MASK);
>  
> +	/* Backup queue page address and size for migration */
> +	q->guest_qpage = qpage;
> +	q->guest_qsize = qsize;
> +
>  	rc = xive_native_configure_queue(xc->vp_id, q, priority,
>  					 (__be32 *) qaddr, qsize, true);
>  	if (rc) {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration
  2019-01-07 18:43 ` [PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration Cédric Le Goater
@ 2019-02-04  5:17   ` David Gibson
  2019-02-04 15:39     ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  5:17 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3368 bytes --]

On Mon, Jan 07, 2019 at 07:43:25PM +0100, Cédric Le Goater wrote:
> When migration of a VM is initiated, a first copy of the RAM is
> transferred to the destination before the VM is stopped. At that time,
> QEMU needs to perform a XIVE quiesce sequence to stop the flow of
> event notifications and stabilize the EQs. The sources are masked and
> the XIVE IC is synced with the KVM ioctl KVM_DEV_XIVE_GRP_SYNC.
>

Don't you also need to make sure the guests queue pages are marked
dirty here, in case they were already migrated?

> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   |  1 +
>  arch/powerpc/kvm/book3s_xive_native.c | 32 +++++++++++++++++++++++++++
>  2 files changed, 33 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 6fc9660c5aec..f3b859223b80 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char {
>  #define   KVM_DEV_XIVE_GET_TIMA_FD	2
>  #define   KVM_DEV_XIVE_VC_BASE		3
>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
> +#define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>  
>  /* Layout of 64-bit XIVE source attribute values */
>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 4ca75aade069..a8052867afc1 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -459,6 +459,35 @@ static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
>  	return 0;
>  }
>  
> +static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
> +{
> +	struct kvmppc_xive_src_block *sb;
> +	struct kvmppc_xive_irq_state *state;
> +	struct xive_irq_data *xd;
> +	u32 hw_num;
> +	u16 src;
> +
> +	pr_devel("%s irq=0x%lx\n", __func__, irq);
> +
> +	sb = kvmppc_xive_find_source(xive, irq, &src);
> +	if (!sb)
> +		return -ENOENT;
> +
> +	state = &sb->irq_state[src];
> +
> +	if (!state->valid)
> +		return -ENOENT;
> +
> +	arch_spin_lock(&sb->lock);
> +
> +	kvmppc_xive_select_irq(state, &hw_num, &xd);
> +	xive_native_sync_source(hw_num);
> +	xive_native_sync_queue(hw_num);
> +
> +	arch_spin_unlock(&sb->lock);
> +	return 0;
> +}
> +
>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  				       struct kvm_device_attr *attr)
>  {
> @@ -474,6 +503,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  	case KVM_DEV_XIVE_GRP_SOURCES:
>  		return kvmppc_xive_native_set_source(xive, attr->attr,
>  						     attr->addr);
> +	case KVM_DEV_XIVE_GRP_SYNC:
> +		return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
>  	}
>  	return -ENXIO;
>  }
> @@ -511,6 +542,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>  		}
>  		break;
>  	case KVM_DEV_XIVE_GRP_SOURCES:
> +	case KVM_DEV_XIVE_GRP_SYNC:
>  		if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
>  		    attr->attr < KVMPPC_XIVE_NR_IRQS)
>  			return 0;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty
  2019-01-07 18:43 ` [PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty Cédric Le Goater
@ 2019-02-04  5:18   ` David Gibson
  2019-02-04 15:46     ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  5:18 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3484 bytes --]

On Mon, Jan 07, 2019 at 07:43:26PM +0100, Cédric Le Goater wrote:
> When the VM is stopped in a migration sequence, the sources are masked
> and the XIVE IC is synced to stabilize the EQs. When done, the KVM
> ioctl KVM_DEV_XIVE_SAVE_EQ_PAGES is called to mark dirty the EQ pages.
> 
> The migration can then transfer the remaining dirty pages to the
> destination and start collecting the state of the devices.

Is there a reason to make this a separate step from the SYNC
operation?

> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   |  1 +
>  arch/powerpc/kvm/book3s_xive_native.c | 40 +++++++++++++++++++++++++++
>  2 files changed, 41 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index f3b859223b80..1a8740629acf 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -680,6 +680,7 @@ struct kvm_ppc_cpu_char {
>  #define   KVM_DEV_XIVE_GET_ESB_FD	1
>  #define   KVM_DEV_XIVE_GET_TIMA_FD	2
>  #define   KVM_DEV_XIVE_VC_BASE		3
> +#define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>  
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index a8052867afc1..f2de1bcf3b35 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -373,6 +373,43 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr)
>  	return put_user(ret, ubufp);
>  }
>  
> +static int kvmppc_xive_native_vcpu_save_eq_pages(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> +	unsigned int prio;
> +
> +	if (!xc)
> +		return -ENOENT;
> +
> +	for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) {
> +		struct xive_q *q = &xc->queues[prio];
> +
> +		if (!q->qpage)
> +			continue;
> +
> +		/* Mark EQ page dirty for migration */
> +		mark_page_dirty(vcpu->kvm, gpa_to_gfn(q->guest_qpage));
> +	}
> +	return 0;
> +}
> +
> +static int kvmppc_xive_native_save_eq_pages(struct kvmppc_xive *xive)
> +{
> +	struct kvm *kvm = xive->kvm;
> +	struct kvm_vcpu *vcpu;
> +	unsigned int i;
> +
> +	pr_devel("%s\n", __func__);
> +
> +	mutex_lock(&kvm->lock);
> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> +		kvmppc_xive_native_vcpu_save_eq_pages(vcpu);
> +	}
> +	mutex_unlock(&kvm->lock);
> +
> +	return 0;
> +}
> +
>  static int xive_native_validate_queue_size(u32 qsize)
>  {
>  	switch (qsize) {
> @@ -498,6 +535,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  		switch (attr->attr) {
>  		case KVM_DEV_XIVE_VC_BASE:
>  			return kvmppc_xive_native_set_vc_base(xive, attr->addr);
> +		case KVM_DEV_XIVE_SAVE_EQ_PAGES:
> +			return kvmppc_xive_native_save_eq_pages(xive);
>  		}
>  		break;
>  	case KVM_DEV_XIVE_GRP_SOURCES:
> @@ -538,6 +577,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>  		case KVM_DEV_XIVE_GET_ESB_FD:
>  		case KVM_DEV_XIVE_GET_TIMA_FD:
>  		case KVM_DEV_XIVE_VC_BASE:
> +		case KVM_DEV_XIVE_SAVE_EQ_PAGES:
>  			return 0;
>  		}
>  		break;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-01-07 18:43 ` [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration Cédric Le Goater
@ 2019-02-04  5:21   ` David Gibson
  2019-02-04 16:07     ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  5:21 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 5861 bytes --]

On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
> Theses are use to capure the XIVE EAS table of the KVM device, the
> configuration of the source targets.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
>  2 files changed, 98 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 1a8740629acf..faf024f39858 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
>  
>  /* Layout of 64-bit XIVE source attribute values */
>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>  
> +/* Layout of 64-bit eas attribute values */
> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
> +#define KVM_XIVE_EAS_MASK_SHIFT		32
> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
> +#define KVM_XIVE_EAS_EISN_SHIFT		33
> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
> +
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index f2de1bcf3b35..0468b605baa7 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
>  	return 0;
>  }
>  
> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
> +				      u64 addr)

I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
values, but rather essentially the "source config" in the terminology
of the PAPR hcalls.  Which, yes, is basically implemented by setting
the EAS, but since it's the PAPR architected state that we need to
preserve across migration, I'd prefer to stick as close as we can to
the PAPR terminology.

> +{
> +	struct kvmppc_xive_src_block *sb;
> +	struct kvmppc_xive_irq_state *state;
> +	u64 __user *ubufp = (u64 __user *) addr;
> +	u16 src;
> +	u64 kvm_eas;
> +	u32 server;
> +	u8 priority;
> +	u32 eisn;
> +
> +	sb = kvmppc_xive_find_source(xive, irq, &src);
> +	if (!sb)
> +		return -ENOENT;
> +
> +	state = &sb->irq_state[src];
> +
> +	if (!state->valid)
> +		return -EINVAL;
> +
> +	if (get_user(kvm_eas, ubufp))
> +		return -EFAULT;
> +
> +	pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas);
> +
> +	priority = (kvm_eas & KVM_XIVE_EAS_PRIORITY_MASK) >>
> +		KVM_XIVE_EAS_PRIORITY_SHIFT;
> +	server = (kvm_eas & KVM_XIVE_EAS_SERVER_MASK) >>
> +		KVM_XIVE_EAS_SERVER_SHIFT;
> +	eisn = (kvm_eas & KVM_XIVE_EAS_EISN_MASK) >> KVM_XIVE_EAS_EISN_SHIFT;
> +
> +	if (priority != xive_prio_from_guest(priority)) {
> +		pr_err("invalid priority for queue %d for VCPU %d\n",
> +		       priority, server);
> +		return -EINVAL;
> +	}
> +
> +	return kvmppc_xive_native_set_source_config(xive, sb, state, server,
> +						    priority, eisn);
> +}
> +
> +static int kvmppc_xive_native_get_eas(struct kvmppc_xive *xive, long irq,
> +				      u64 addr)
> +{
> +	struct kvmppc_xive_src_block *sb;
> +	struct kvmppc_xive_irq_state *state;
> +	u64 __user *ubufp = (u64 __user *) addr;
> +	u16 src;
> +	u64 kvm_eas;
> +
> +	sb = kvmppc_xive_find_source(xive, irq, &src);
> +	if (!sb)
> +		return -ENOENT;
> +
> +	state = &sb->irq_state[src];
> +
> +	if (!state->valid)
> +		return -EINVAL;
> +
> +	arch_spin_lock(&sb->lock);
> +
> +	if (state->act_priority == MASKED)
> +		kvm_eas = KVM_XIVE_EAS_MASK_MASK;
> +	else {
> +		kvm_eas = (state->act_priority << KVM_XIVE_EAS_PRIORITY_SHIFT) &
> +			KVM_XIVE_EAS_PRIORITY_MASK;
> +		kvm_eas |= (state->act_server << KVM_XIVE_EAS_SERVER_SHIFT) &
> +			KVM_XIVE_EAS_SERVER_MASK;
> +		kvm_eas |= ((u64) state->eisn << KVM_XIVE_EAS_EISN_SHIFT) &
> +			KVM_XIVE_EAS_EISN_MASK;
> +	}
> +	arch_spin_unlock(&sb->lock);
> +
> +	pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas);
> +
> +	if (put_user(kvm_eas, ubufp))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  				       struct kvm_device_attr *attr)
>  {
> @@ -544,6 +626,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  						     attr->addr);
>  	case KVM_DEV_XIVE_GRP_SYNC:
>  		return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
> +	case KVM_DEV_XIVE_GRP_EAS:
> +		return kvmppc_xive_native_set_eas(xive, attr->attr, attr->addr);
>  	}
>  	return -ENXIO;
>  }
> @@ -564,6 +648,8 @@ static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
>  			return kvmppc_xive_native_get_vc_base(xive, attr->addr);
>  		}
>  		break;
> +	case KVM_DEV_XIVE_GRP_EAS:
> +		return kvmppc_xive_native_get_eas(xive, attr->attr, attr->addr);
>  	}
>  	return -ENXIO;
>  }
> @@ -583,6 +669,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>  		break;
>  	case KVM_DEV_XIVE_GRP_SOURCES:
>  	case KVM_DEV_XIVE_GRP_SYNC:
> +	case KVM_DEV_XIVE_GRP_EAS:
>  		if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
>  		    attr->attr < KVMPPC_XIVE_NR_IRQS)
>  			return 0;

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration
  2019-01-07 18:43 ` [PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration Cédric Le Goater
@ 2019-02-04  5:24   ` David Gibson
  2019-02-05 17:45     ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  5:24 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 8455 bytes --]

On Mon, Jan 07, 2019 at 07:43:28PM +0100, Cédric Le Goater wrote:
> These are used to capture the XIVE END table of the KVM device. It
> relies on an OPAL call to retrieve from the XIVE IC the EQ toggle bit
> and index which are updated by the HW when events are enqueued in the
> guest RAM.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/uapi/asm/kvm.h   |  21 ++++
>  arch/powerpc/kvm/book3s_xive_native.c | 166 ++++++++++++++++++++++++++
>  2 files changed, 187 insertions(+)
> 
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index faf024f39858..95302558ce10 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -684,6 +684,7 @@ struct kvm_ppc_cpu_char {
>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>  #define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
> +#define KVM_DEV_XIVE_GRP_EQ		5	/* 64-bit eq attributes */
>  
>  /* Layout of 64-bit XIVE source attribute values */
>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
> @@ -699,4 +700,24 @@ struct kvm_ppc_cpu_char {
>  #define KVM_XIVE_EAS_EISN_SHIFT		33
>  #define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
>  
> +/* Layout of 64-bit eq attribute */
> +#define KVM_XIVE_EQ_PRIORITY_SHIFT	0
> +#define KVM_XIVE_EQ_PRIORITY_MASK	0x7
> +#define KVM_XIVE_EQ_SERVER_SHIFT	3
> +#define KVM_XIVE_EQ_SERVER_MASK		0xfffffff8ULL
> +
> +/* Layout of 64-bit eq attribute values */
> +struct kvm_ppc_xive_eq {
> +	__u32 flags;
> +	__u32 qsize;
> +	__u64 qpage;
> +	__u32 qtoggle;
> +	__u32 qindex;

Should we pad this in case a) we discover some fields in the EQ that
we thought weren't relevant to the guest actually are or b) future
XIVE extensions add something we need to migrate.

> +};
> +
> +#define KVM_XIVE_EQ_FLAG_ENABLED	0x00000001
> +#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY	0x00000002
> +#define KVM_XIVE_EQ_FLAG_ESCALATE	0x00000004
> +
> +
>  #endif /* __LINUX_KVM_POWERPC_H */
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index 0468b605baa7..f4eb71eafc57 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -607,6 +607,164 @@ static int kvmppc_xive_native_get_eas(struct kvmppc_xive *xive, long irq,
>  	return 0;
>  }
>  
> +static int kvmppc_xive_native_set_queue(struct kvmppc_xive *xive, long eq_idx,
> +				      u64 addr)
> +{
> +	struct kvm *kvm = xive->kvm;
> +	struct kvm_vcpu *vcpu;
> +	struct kvmppc_xive_vcpu *xc;
> +	void __user *ubufp = (u64 __user *) addr;
> +	u32 server;
> +	u8 priority;
> +	struct kvm_ppc_xive_eq kvm_eq;
> +	int rc;
> +	__be32 *qaddr = 0;
> +	struct page *page;
> +	struct xive_q *q;
> +
> +	/*
> +	 * Demangle priority/server tuple from the EQ index
> +	 */
> +	priority = (eq_idx & KVM_XIVE_EQ_PRIORITY_MASK) >>
> +		KVM_XIVE_EQ_PRIORITY_SHIFT;
> +	server = (eq_idx & KVM_XIVE_EQ_SERVER_MASK) >>
> +		KVM_XIVE_EQ_SERVER_SHIFT;
> +
> +	if (copy_from_user(&kvm_eq, ubufp, sizeof(kvm_eq)))
> +		return -EFAULT;
> +
> +	vcpu = kvmppc_xive_find_server(kvm, server);
> +	if (!vcpu) {
> +		pr_err("Can't find server %d\n", server);
> +		return -ENOENT;
> +	}
> +	xc = vcpu->arch.xive_vcpu;
> +
> +	if (priority != xive_prio_from_guest(priority)) {
> +		pr_err("Trying to restore invalid queue %d for VCPU %d\n",
> +		       priority, server);
> +		return -EINVAL;
> +	}
> +	q = &xc->queues[priority];
> +
> +	pr_devel("%s VCPU %d priority %d fl:%x sz:%d addr:%llx g:%d idx:%d\n",
> +		 __func__, server, priority, kvm_eq.flags,
> +		 kvm_eq.qsize, kvm_eq.qpage, kvm_eq.qtoggle, kvm_eq.qindex);
> +
> +	rc = xive_native_validate_queue_size(kvm_eq.qsize);
> +	if (rc || !kvm_eq.qsize) {
> +		pr_err("invalid queue size %d\n", kvm_eq.qsize);
> +		return rc;
> +	}
> +
> +	page = gfn_to_page(kvm, gpa_to_gfn(kvm_eq.qpage));
> +	if (is_error_page(page)) {
> +		pr_warn("Couldn't get guest page for %llx!\n", kvm_eq.qpage);
> +		return -ENOMEM;
> +	}
> +	qaddr = page_to_virt(page) + (kvm_eq.qpage & ~PAGE_MASK);
> +
> +	/* Backup queue page guest address for migration */
> +	q->guest_qpage = kvm_eq.qpage;
> +	q->guest_qsize = kvm_eq.qsize;
> +
> +	rc = xive_native_configure_queue(xc->vp_id, q, priority,
> +					 (__be32 *) qaddr, kvm_eq.qsize, true);
> +	if (rc) {
> +		pr_err("Failed to configure queue %d for VCPU %d: %d\n",
> +		       priority, xc->server_num, rc);
> +		put_page(page);
> +		return rc;
> +	}
> +
> +	rc = xive_native_set_queue_state(xc->vp_id, priority, kvm_eq.qtoggle,
> +					 kvm_eq.qindex);
> +	if (rc)
> +		goto error;
> +
> +	rc = kvmppc_xive_attach_escalation(vcpu, priority);
> +error:
> +	if (rc)
> +		xive_native_cleanup_queue(vcpu, priority);
> +	return rc;
> +}
> +
> +static int kvmppc_xive_native_get_queue(struct kvmppc_xive *xive, long eq_idx,
> +				      u64 addr)
> +{
> +	struct kvm *kvm = xive->kvm;
> +	struct kvm_vcpu *vcpu;
> +	struct kvmppc_xive_vcpu *xc;
> +	struct xive_q *q;
> +	void __user *ubufp = (u64 __user *) addr;
> +	u32 server;
> +	u8 priority;
> +	struct kvm_ppc_xive_eq kvm_eq;
> +	u64 qpage;
> +	u64 qsize;
> +	u64 qeoi_page;
> +	u32 escalate_irq;
> +	u64 qflags;
> +	int rc;
> +
> +	/*
> +	 * Demangle priority/server tuple from the EQ index
> +	 */
> +	priority = (eq_idx & KVM_XIVE_EQ_PRIORITY_MASK) >>
> +		KVM_XIVE_EQ_PRIORITY_SHIFT;
> +	server = (eq_idx & KVM_XIVE_EQ_SERVER_MASK) >>
> +		KVM_XIVE_EQ_SERVER_SHIFT;
> +
> +	vcpu = kvmppc_xive_find_server(kvm, server);
> +	if (!vcpu) {
> +		pr_err("Can't find server %d\n", server);
> +		return -ENOENT;
> +	}
> +	xc = vcpu->arch.xive_vcpu;
> +
> +	if (priority != xive_prio_from_guest(priority)) {
> +		pr_err("invalid priority for queue %d for VCPU %d\n",
> +		       priority, server);
> +		return -EINVAL;
> +	}
> +	q = &xc->queues[priority];
> +
> +	memset(&kvm_eq, 0, sizeof(kvm_eq));
> +
> +	if (!q->qpage)
> +		return 0;
> +
> +	rc = xive_native_get_queue_info(xc->vp_id, priority, &qpage, &qsize,
> +					&qeoi_page, &escalate_irq, &qflags);
> +	if (rc)
> +		return rc;
> +
> +	kvm_eq.flags = 0;
> +	if (qflags & OPAL_XIVE_EQ_ENABLED)
> +		kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ENABLED;
> +	if (qflags & OPAL_XIVE_EQ_ALWAYS_NOTIFY)
> +		kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY;
> +	if (qflags & OPAL_XIVE_EQ_ESCALATE)
> +		kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ESCALATE;
> +
> +	kvm_eq.qsize = q->guest_qsize;
> +	kvm_eq.qpage = q->guest_qpage;
> +
> +	rc = xive_native_get_queue_state(xc->vp_id, priority, &kvm_eq.qtoggle,
> +					 &kvm_eq.qindex);
> +	if (rc)
> +		return rc;
> +
> +	pr_devel("%s VCPU %d priority %d fl:%x sz:%d addr:%llx g:%d idx:%d\n",
> +		 __func__, server, priority, kvm_eq.flags,
> +		 kvm_eq.qsize, kvm_eq.qpage, kvm_eq.qtoggle, kvm_eq.qindex);
> +
> +	if (copy_to_user(ubufp, &kvm_eq, sizeof(kvm_eq)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +
>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  				       struct kvm_device_attr *attr)
>  {
> @@ -628,6 +786,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>  		return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
>  	case KVM_DEV_XIVE_GRP_EAS:
>  		return kvmppc_xive_native_set_eas(xive, attr->attr, attr->addr);
> +	case KVM_DEV_XIVE_GRP_EQ:
> +		return kvmppc_xive_native_set_queue(xive, attr->attr,
> +						    attr->addr);
>  	}
>  	return -ENXIO;
>  }
> @@ -650,6 +811,9 @@ static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
>  		break;
>  	case KVM_DEV_XIVE_GRP_EAS:
>  		return kvmppc_xive_native_get_eas(xive, attr->attr, attr->addr);
> +	case KVM_DEV_XIVE_GRP_EQ:
> +		return kvmppc_xive_native_get_queue(xive, attr->attr,
> +						    attr->addr);
>  	}
>  	return -ENXIO;
>  }
> @@ -674,6 +838,8 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>  		    attr->attr < KVMPPC_XIVE_NR_IRQS)
>  			return 0;
>  		break;
> +	case KVM_DEV_XIVE_GRP_EQ:
> +		return 0;
>  	}
>  	return -ENXIO;
>  }

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state
  2019-01-07 19:10 ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state Cédric Le Goater
  2019-01-07 19:10   ` [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support Cédric Le Goater
  2019-01-07 19:10   ` [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl Cédric Le Goater
@ 2019-02-04  5:26   ` David Gibson
  2019-02-04 18:57     ` Cédric Le Goater
  2 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  5:26 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 7408 bytes --]

On Mon, Jan 07, 2019 at 08:10:04PM +0100, Cédric Le Goater wrote:
> At a VCPU level, the state of the thread context interrupt management
> registers needs to be collected. These registers are cached under the
> 'xive_saved_state.w01' field of the VCPU when the VPCU context is
> pulled from the HW thread. An OPAL call retrieves the backup of the
> IPB register in the NVT structure and merges it in the KVM state.
> 
> The structures of the interface between QEMU and KVM provisions some
> extra room (two u64) for further extensions if more state needs to be
> transferred back to QEMU.
> 
> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> ---
>  arch/powerpc/include/asm/kvm_ppc.h    |  5 ++
>  arch/powerpc/include/uapi/asm/kvm.h   |  2 +
>  arch/powerpc/kvm/book3s.c             | 24 +++++++++
>  arch/powerpc/kvm/book3s_xive_native.c | 78 +++++++++++++++++++++++++++
>  4 files changed, 109 insertions(+)
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index 4cc897039485..49c488af168c 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -270,6 +270,7 @@ union kvmppc_one_reg {
>  		u64	addr;
>  		u64	length;
>  	}	vpaval;
> +	u64	xive_timaval[4];
>  };
>  
>  struct kvmppc_ops {
> @@ -603,6 +604,8 @@ extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
>  extern void kvmppc_xive_native_init_module(void);
>  extern void kvmppc_xive_native_exit_module(void);
>  extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd);
> +extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
> +extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
>  
>  #else
>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
> @@ -637,6 +640,8 @@ static inline void kvmppc_xive_native_init_module(void) { }
>  static inline void kvmppc_xive_native_exit_module(void) { }
>  static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd)
>  	{ return 0; }
> +static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return 0; }
> +static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return -ENOENT; }

IIRC "VP" is the old name for "TCTX".  Since we're using tctx in the
rest of the XIVE code, can we use it here as well.

>  #endif /* CONFIG_KVM_XIVE */
>  
> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> index 95302558ce10..3c958c39a782 100644
> --- a/arch/powerpc/include/uapi/asm/kvm.h
> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> @@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
>  #define  KVM_REG_PPC_ICP_PPRI_SHIFT	16	/* pending irq priority */
>  #define  KVM_REG_PPC_ICP_PPRI_MASK	0xff
>  
> +#define KVM_REG_PPC_VP_STATE	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d)
> +
>  /* Device control API: PPC-specific devices */
>  #define KVM_DEV_MPIC_GRP_MISC		1
>  #define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index de7eed191107..5ad658077a35 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -641,6 +641,18 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
>  				*val = get_reg_val(id, kvmppc_xics_get_icp(vcpu));
>  			break;
>  #endif /* CONFIG_KVM_XICS */
> +#ifdef CONFIG_KVM_XIVE
> +		case KVM_REG_PPC_VP_STATE:
> +			if (!vcpu->arch.xive_vcpu) {
> +				r = -ENXIO;
> +				break;
> +			}
> +			if (xive_enabled())
> +				r = kvmppc_xive_native_get_vp(vcpu, val);
> +			else
> +				r = -ENXIO;
> +			break;
> +#endif /* CONFIG_KVM_XIVE */
>  		case KVM_REG_PPC_FSCR:
>  			*val = get_reg_val(id, vcpu->arch.fscr);
>  			break;
> @@ -714,6 +726,18 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
>  				r = kvmppc_xics_set_icp(vcpu, set_reg_val(id, *val));
>  			break;
>  #endif /* CONFIG_KVM_XICS */
> +#ifdef CONFIG_KVM_XIVE
> +		case KVM_REG_PPC_VP_STATE:
> +			if (!vcpu->arch.xive_vcpu) {
> +				r = -ENXIO;
> +				break;
> +			}
> +			if (xive_enabled())
> +				r = kvmppc_xive_native_set_vp(vcpu, val);
> +			else
> +				r = -ENXIO;
> +			break;
> +#endif /* CONFIG_KVM_XIVE */
>  		case KVM_REG_PPC_FSCR:
>  			vcpu->arch.fscr = set_reg_val(id, *val);
>  			break;
> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> index f4eb71eafc57..1aefb366df0b 100644
> --- a/arch/powerpc/kvm/book3s_xive_native.c
> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> @@ -424,6 +424,84 @@ static int xive_native_validate_queue_size(u32 qsize)
>  	}
>  }
>  
> +#define TM_IPB_SHIFT 40
> +#define TM_IPB_MASK  (((u64) 0xFF) << TM_IPB_SHIFT)
> +
> +int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val)
> +{
> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> +	u64 opal_state;
> +	int rc;
> +
> +	if (!kvmppc_xive_enabled(vcpu))
> +		return -EPERM;
> +
> +	if (!xc)
> +		return -ENOENT;
> +
> +	/* Thread context registers. We only care about IPB and CPPR */
> +	val->xive_timaval[0] = vcpu->arch.xive_saved_state.w01;
> +
> +	/*
> +	 * Return the OS CAM line to print out the VP identifier in
> +	 * the QEMU monitor. This is not restored.
> +	 */
> +	val->xive_timaval[1] = vcpu->arch.xive_cam_word;
> +
> +	/* Get the VP state from OPAL */
> +	rc = xive_native_get_vp_state(xc->vp_id, &opal_state);
> +	if (rc)
> +		return rc;
> +
> +	/*
> +	 * Capture the backup of IPB register in the NVT structure and
> +	 * merge it in our KVM VP state.
> +	 *
> +	 * TODO: P10 support.
> +	 */
> +	val->xive_timaval[0] |= cpu_to_be64(opal_state & TM_IPB_MASK);
> +
> +	pr_devel("%s NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x opal=%016llx\n",
> +		 __func__,
> +		 vcpu->arch.xive_saved_state.nsr,
> +		 vcpu->arch.xive_saved_state.cppr,
> +		 vcpu->arch.xive_saved_state.ipb,
> +		 vcpu->arch.xive_saved_state.pipr,
> +		 vcpu->arch.xive_saved_state.w01,
> +		 (u32) vcpu->arch.xive_cam_word, opal_state);
> +
> +	return 0;
> +}
> +
> +int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val)
> +{
> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> +	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
> +
> +	pr_devel("%s w01=%016llx vp=%016llx\n", __func__,
> +		 val->xive_timaval[0], val->xive_timaval[1]);
> +
> +	if (!kvmppc_xive_enabled(vcpu))
> +		return -EPERM;
> +
> +	if (!xc || !xive)
> +		return -ENOENT;
> +
> +	/* We can't update the state of a "pushed" VCPU	 */
> +	if (WARN_ON(vcpu->arch.xive_pushed))
> +		return -EIO;
> +
> +	/* Thread context registers. only restore IPB and CPPR ? */
> +	vcpu->arch.xive_saved_state.w01 = val->xive_timaval[0];
> +
> +	/*
> +	 * There is no need to restore the XIVE internal state (IPB
> +	 * stored in the NVT) as the IPB register was merged in KVM VP
> +	 * state.
> +	 */
> +	return 0;
> +}
> +
>  static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
>  					 u64 addr)
>  {

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-01-26  8:25       ` Cédric Le Goater
@ 2019-02-04  5:36         ` David Gibson
  2019-02-05 11:31           ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-04  5:36 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3072 bytes --]

On Sat, Jan 26, 2019 at 09:25:04AM +0100, Cédric Le Goater wrote:
> Was there a crashing.org shutdown ? 
> 
>   Received: from gate.crashing.org (gate.crashing.org [63.228.1.57])
> 	by in5.mail.ovh.net (Postfix) with ESMTPS id 43mYnj0nrlz1N7KC
> 	for <clg@kaod.org>; Fri, 25 Jan 2019 22:38:00 +0000 (UTC)
>   Received: from localhost (localhost.localdomain [127.0.0.1])
> 	by gate.crashing.org (8.14.1/8.14.1) with ESMTP id x0NLZf4K021092;
> 	Wed, 23 Jan 2019 15:35:43 -0600
> 
> 
> On 1/23/19 10:35 PM, Benjamin Herrenschmidt wrote:
> > On Wed, 2019-01-23 at 20:07 +0100, Cédric Le Goater wrote:
> >>  Event Assignment Structure, a.k.a IVE (Interrupt Virtualization Entry)
> >>
> >> All the names changed somewhere between XIVE v1 and XIVE v2. OPAL and
> >> Linux should be adjusted ...
> > 
> > All the names changed between the HW design and the "architecture"
> > document. The HW guys use the old names, the architecture the new
> > names, and Linux & OPAL mostly use the old ones because frankly the new
> > names suck big time.
> 
> Well, It does not make XIVE any clearer ... I did prefer the v1 names
> but there was some naming overlap in the concepts. 
> 
> >> It would be good to talk a little about the nested support (offline 
> >> may be) to make sure that we are not missing some major interface that 
> >> would require a lot of change. If we need to prepare ground, I think
> >> the timing is good.
> >>
> >> The size of the IRQ number space might be a problem. It seems we 
> >> would need to increase it considerably to support multiple nested 
> >> guests. That said I haven't look much how nested is designed.  
> > 
> > The size of the VP space is a bigger concern. Even today. We really
> > need qemu to tell the max #cpu to KVM so we can allocate less of them.
> 
> ah yes. we would also need to reduce the number of available priorities 
> per CPU to have more EQ descriptors available if I recall well. 
> 
> > As for nesting, I suggest for the foreseeable future we stick to XICS
> > emulation in nested guests.
> 
> ok. so no kernel_irqchip at all. hmm.  

That would certainly be step 0, making sure the capability advertises
this correctly.  I think we do want to make XICs-on-XIVE emulation
work in a KVM L1 (so we'd need to have it make XIVE hcalls to the L0
instead of OPAL calls).

XIVE-on-XIVE for L1 would be nice too, which would mean implementing
the XIVE hcalls from the L2 in terms of XIVE hcalls to the L0.  I
think it's ok to delay this indefinitely as long as the caps advertise
correctly so that qemu will use userspace emulation until its ready.

> I was wondering how possible it was to have L2 initialize the underlying 
> OPAL structures in the L0 hypervisor. May be with a sort of proxy hcall 
> which would perform the initialization in QEMU L1 on behalf of L2.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type
  2019-02-04  0:50       ` David Gibson
@ 2019-02-04 10:16         ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-04 10:16 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, linuxppc-dev

On 2/4/19 1:50 AM, David Gibson wrote:
> On Wed, Jan 23, 2019 at 05:24:13PM +0100, Cédric Le Goater wrote:
>> On 1/22/19 5:56 AM, Paul Mackerras wrote:
>>> On Mon, Jan 07, 2019 at 07:43:15PM +0100, Cédric Le Goater wrote:
>>>> We will have different KVM devices for interrupts, one for the
>>>> XICS-over-XIVE mode and one for the XIVE native exploitation
>>>> mode. Let's add some checks to make sure we are not mixing the
>>>> interfaces in KVM.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  arch/powerpc/kvm/book3s_xive.c | 6 ++++++
>>>>  1 file changed, 6 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
>>>> index f78d002f0fe0..8a4fa45f07f8 100644
>>>> --- a/arch/powerpc/kvm/book3s_xive.c
>>>> +++ b/arch/powerpc/kvm/book3s_xive.c
>>>> @@ -819,6 +819,9 @@ u64 kvmppc_xive_get_icp(struct kvm_vcpu *vcpu)
>>>>  {
>>>>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>>>>  
>>>> +	if (!kvmppc_xics_enabled(vcpu))
>>>> +		return -EPERM;
>>>> +
>>>>  	if (!xc)
>>>>  		return 0;
>>>>  
>>>> @@ -835,6 +838,9 @@ int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval)
>>>>  	u8 cppr, mfrr;
>>>>  	u32 xisr;
>>>>  
>>>> +	if (!kvmppc_xics_enabled(vcpu))
>>>> +		return -EPERM;
>>>> +
>>>>  	if (!xc || !xive)
>>>>  		return -ENOENT;
>>>
>>> I can't see how these new checks could ever trigger in the code as it
>>> stands.  Is there a way at present? 
>>
>> It would require some custom QEMU doing silly things : create the XICS 
>> KVM device, and then call kvm_get_one_reg(KVM_REG_PPC_ICP_STATE) or 
>> kvm_set_one_reg(icp->cs, KVM_REG_PPC_ICP_STATE) without connecting the
>> vCPU to its presenter. 
>>
>> Today, you get a ENOENT.
> 
> TBH, ENOENT seems fine to me.
> 
>>> Do following patches ever add a path where the new checks could trigger, 
>>> or is this just an excess of caution? 
>>
>> With the following patches, QEMU could to do something even more silly,
>> which is to mix the interrupt mode interfaces : create a KVM XICS device    
>> and call KVM CPU ioctls of the KVM XIVE device, or the opposite.
> 
> AFAICT, like above, that won't really differ from calling the XIVE CPU
> ioctl()s when no irqchip is set up at all, and should be covered by
> just a !xive check.

we can drop that patch. It does not bring much.

Thanks,

C.

> 
>>
>>> (Your patch description should ideally have answered these questions > for me.)
>>
>> Yes. I also think that I introduced this patch to early in the series.
>> It make more sense when the XICS and the XIVE KVM devices are available.  
>>
>> Thanks,
>>
>> C.
>>
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-02-04  4:25   ` David Gibson
@ 2019-02-04 11:19     ` Cédric Le Goater
  2019-02-05  5:26       ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-04 11:19 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/4/19 5:25 AM, David Gibson wrote:
> On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
>> This is the basic framework for the new KVM device supporting the XIVE
>> native exploitation mode. The user interface exposes a new capability
>> and a new KVM device to be used by QEMU.
>>
>> Internally, the interface to the new KVM device is protected with a
>> new interrupt mode: KVMPPC_IRQ_XIVE.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/asm/kvm_host.h   |   2 +
>>  arch/powerpc/include/asm/kvm_ppc.h    |  21 ++
>>  arch/powerpc/kvm/book3s_xive.h        |   3 +
>>  include/uapi/linux/kvm.h              |   3 +
>>  arch/powerpc/kvm/book3s.c             |   7 +-
>>  arch/powerpc/kvm/book3s_xive_native.c | 332 ++++++++++++++++++++++++++
>>  arch/powerpc/kvm/powerpc.c            |  30 +++
>>  arch/powerpc/kvm/Makefile             |   2 +-
>>  8 files changed, 398 insertions(+), 2 deletions(-)
>>  create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
>>
>> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
>> index 0f98f00da2ea..c522e8274ad9 100644
>> --- a/arch/powerpc/include/asm/kvm_host.h
>> +++ b/arch/powerpc/include/asm/kvm_host.h
>> @@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops;
>>  struct kvmppc_xive;
>>  struct kvmppc_xive_vcpu;
>>  extern struct kvm_device_ops kvm_xive_ops;
>> +extern struct kvm_device_ops kvm_xive_native_ops;
>>  
>>  struct kvmppc_passthru_irqmap;
>>  
>> @@ -446,6 +447,7 @@ struct kvmppc_passthru_irqmap {
>>  #define KVMPPC_IRQ_DEFAULT	0
>>  #define KVMPPC_IRQ_MPIC		1
>>  #define KVMPPC_IRQ_XICS		2 /* Includes a XIVE option */
>> +#define KVMPPC_IRQ_XIVE		3 /* XIVE native exploitation mode */
>>  
>>  #define MMIO_HPTE_CACHE_SIZE	4
>>  
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index eb0d79f0ca45..1bb313f238fe 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -591,6 +591,18 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
>>  extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
>>  			       int level, bool line_status);
>>  extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
>> +
>> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
>> +{
>> +	return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
>> +}
>> +
>> +extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>> +				    struct kvm_vcpu *vcpu, u32 cpu);
>> +extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
>> +extern void kvmppc_xive_native_init_module(void);
>> +extern void kvmppc_xive_native_exit_module(void);
>> +
>>  #else
>>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
>>  				       u32 priority) { return -1; }
>> @@ -614,6 +626,15 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) { retur
>>  static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
>>  				      int level, bool line_status) { return -ENODEV; }
>>  static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
>> +
>> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
>> +	{ return 0; }
>> +static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>> +						  struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; }
>> +static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
>> +static inline void kvmppc_xive_native_init_module(void) { }
>> +static inline void kvmppc_xive_native_exit_module(void) { }
>> +
>>  #endif /* CONFIG_KVM_XIVE */
>>  
>>  /*
>> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
>> index 10c4aa5cd010..5f22415520b4 100644
>> --- a/arch/powerpc/kvm/book3s_xive.h
>> +++ b/arch/powerpc/kvm/book3s_xive.h
>> @@ -12,6 +12,9 @@
>>  #ifdef CONFIG_KVM_XICS
>>  #include "book3s_xics.h"
>>  
>> +#define KVMPPC_XIVE_FIRST_IRQ	0
>> +#define KVMPPC_XIVE_NR_IRQS	KVMPPC_XICS_NR_IRQS
>> +
>>  /*
>>   * State for one guest irq source.
>>   *
>> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
>> index 6d4ea4b6c922..52bf74a1616e 100644
>> --- a/include/uapi/linux/kvm.h
>> +++ b/include/uapi/linux/kvm.h
>> @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
>>  #define KVM_CAP_ARM_VM_IPA_SIZE 165
>>  #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
>>  #define KVM_CAP_HYPERV_CPUID 167
>> +#define KVM_CAP_PPC_IRQ_XIVE 168
>>  
>>  #ifdef KVM_CAP_IRQ_ROUTING
>>  
>> @@ -1211,6 +1212,8 @@ enum kvm_device_type {
>>  #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
>>  	KVM_DEV_TYPE_ARM_VGIC_ITS,
>>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
>> +	KVM_DEV_TYPE_XIVE,
>> +#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
>>  	KVM_DEV_TYPE_MAX,
>>  };
>>  
>> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
>> index bd1a677dd9e4..de7eed191107 100644
>> --- a/arch/powerpc/kvm/book3s.c
>> +++ b/arch/powerpc/kvm/book3s.c
>> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
>>  #ifdef CONFIG_KVM_XIVE
>>  	if (xive_enabled()) {
>>  		kvmppc_xive_init_module();
>> +		kvmppc_xive_native_init_module();
>>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
>> +		kvm_register_device_ops(&kvm_xive_native_ops,
>> +					KVM_DEV_TYPE_XIVE);
>>  	} else
>>  #endif
>>  		kvm_register_device_ops(&kvm_xics_ops, KVM_DEV_TYPE_XICS);
>> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
>>  static void kvmppc_book3s_exit(void)
>>  {
>>  #ifdef CONFIG_KVM_XICS
>> -	if (xive_enabled())
>> +	if (xive_enabled()) {
>>  		kvmppc_xive_exit_module();
>> +		kvmppc_xive_native_exit_module();
>> +	}
>>  #endif
>>  #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
>>  	kvmppc_book3s_exit_pr();
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> new file mode 100644
>> index 000000000000..115143e76c45
>> --- /dev/null
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -0,0 +1,332 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2017-2019, IBM Corporation.
>> + */
>> +
>> +#define pr_fmt(fmt) "xive-kvm: " fmt
>> +
>> +#include <linux/anon_inodes.h>
>> +#include <linux/kernel.h>
>> +#include <linux/kvm_host.h>
>> +#include <linux/err.h>
>> +#include <linux/gfp.h>
>> +#include <linux/spinlock.h>
>> +#include <linux/delay.h>
>> +#include <linux/percpu.h>
>> +#include <linux/cpumask.h>
>> +#include <asm/uaccess.h>
>> +#include <asm/kvm_book3s.h>
>> +#include <asm/kvm_ppc.h>
>> +#include <asm/hvcall.h>
>> +#include <asm/xics.h>
>> +#include <asm/xive.h>
>> +#include <asm/xive-regs.h>
>> +#include <asm/debug.h>
>> +#include <asm/debugfs.h>
>> +#include <asm/time.h>
>> +#include <asm/opal.h>
>> +
>> +#include <linux/debugfs.h>
>> +#include <linux/seq_file.h>
>> +
>> +#include "book3s_xive.h"
>> +
>> +static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
>> +{
>> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>> +	struct xive_q *q = &xc->queues[prio];
>> +
>> +	xive_native_disable_queue(xc->vp_id, q, prio);
>> +	if (q->qpage) {
>> +		put_page(virt_to_page(q->qpage));
>> +		q->qpage = NULL;
>> +	}
>> +}
>> +
>> +void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>> +	int i;
>> +
>> +	if (!kvmppc_xive_enabled(vcpu))
>> +		return;
>> +
>> +	if (!xc)
>> +		return;
>> +
>> +	pr_devel("native_cleanup_vcpu(cpu=%d)\n", xc->server_num);
>> +
>> +	/* Ensure no interrupt is still routed to that VP */
>> +	xc->valid = false;
>> +	kvmppc_xive_disable_vcpu_interrupts(vcpu);
>> +
>> +	/* Disable the VP */
>> +	xive_native_disable_vp(xc->vp_id);
>> +
>> +	/* Free the queues & associated interrupts */
>> +	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
>> +		/* Free the escalation irq */
>> +		if (xc->esc_virq[i]) {
>> +			free_irq(xc->esc_virq[i], vcpu);
>> +			irq_dispose_mapping(xc->esc_virq[i]);
>> +			kfree(xc->esc_virq_names[i]);
>> +			xc->esc_virq[i] = 0;
>> +		}
>> +
>> +		/* Free the queue */
>> +		xive_native_cleanup_queue(vcpu, i);
>> +	}
>> +
>> +	/* Free the VP */
>> +	kfree(xc);
>> +
>> +	/* Cleanup the vcpu */
>> +	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
>> +	vcpu->arch.xive_vcpu = NULL;
>> +}
>> +
>> +int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>> +				    struct kvm_vcpu *vcpu, u32 cpu)
> 
> Why do we need both a *vcpu and a cpu number as an integer?

To be in sync with the other similar routines : kvmppc_xics_connect_vcpu() 
and kvmppc_xive_connect_vcpu().

But if we consider that this 'cpu' parameter is always in sync with 
vcpu->vcpu_id, we could remove it from the KVM ioctl call I suppose.

Should we do the same for the other routines ? 
 
>> +{
>> +	struct kvmppc_xive *xive = dev->private;
>> +	struct kvmppc_xive_vcpu *xc;
>> +	int rc;
>> +
>> +	pr_devel("native_connect_vcpu(cpu=%d)\n", cpu);
>> +
>> +	if (dev->ops != &kvm_xive_native_ops) {
>> +		pr_devel("Wrong ops !\n");
>> +		return -EPERM;
>> +	}
>> +	if (xive->kvm != vcpu->kvm)
>> +		return -EPERM;
>> +	if (vcpu->arch.irq_type)
> 
> Please use an explicit == / != here so we don't have to remember which
> symbolic value corresponds to 0.

ok. I agree.

Thanks,

C. 


> 
>> +		return -EBUSY;
>> +	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
>> +		pr_devel("Duplicate !\n");
>> +		return -EEXIST;
>> +	}
>> +	if (cpu >= KVM_MAX_VCPUS) {
>> +		pr_devel("Out of bounds !\n");
>> +		return -EINVAL;
>> +	}
>> +	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
>> +	if (!xc)
>> +		return -ENOMEM;
>> +
>> +	mutex_lock(&vcpu->kvm->lock);
>> +	vcpu->arch.xive_vcpu = xc;
>> +	xc->xive = xive;
>> +	xc->vcpu = vcpu;
>> +	xc->server_num = cpu;
>> +	xc->vp_id = xive->vp_base + cpu;
>> +	xc->valid = true;
>> +
>> +	rc = xive_native_get_vp_info(xc->vp_id, &xc->vp_cam, &xc->vp_chip_id);
>> +	if (rc) {
>> +		pr_err("Failed to get VP info from OPAL: %d\n", rc);
>> +		goto bail;
>> +	}
>> +
>> +	/*
>> +	 * Enable the VP first as the single escalation mode will
>> +	 * affect escalation interrupts numbering
>> +	 */
>> +	rc = xive_native_enable_vp(xc->vp_id, xive->single_escalation);
>> +	if (rc) {
>> +		pr_err("Failed to enable VP in OPAL: %d\n", rc);
>> +		goto bail;
>> +	}
>> +
>> +	/* Configure VCPU fields for use by assembly push/pull */
>> +	vcpu->arch.xive_saved_state.w01 = cpu_to_be64(0xff000000);
>> +	vcpu->arch.xive_cam_word = cpu_to_be32(xc->vp_cam | TM_QW1W2_VO);
>> +
>> +	/* TODO: initialize queues ? */
>> +
>> +bail:
>> +	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
>> +	mutex_unlock(&vcpu->kvm->lock);
>> +	if (rc)
>> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
>> +
>> +	return rc;
>> +}
>> +
>> +static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>> +				       struct kvm_device_attr *attr)
>> +{
>> +	return -ENXIO;
>> +}
>> +
>> +static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
>> +				       struct kvm_device_attr *attr)
>> +{
>> +	return -ENXIO;
>> +}
>> +
>> +static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>> +				       struct kvm_device_attr *attr)
>> +{
>> +	return -ENXIO;
>> +}
>> +
>> +static void kvmppc_xive_native_free(struct kvm_device *dev)
>> +{
>> +	struct kvmppc_xive *xive = dev->private;
>> +	struct kvm *kvm = xive->kvm;
>> +	int i;
>> +
>> +	debugfs_remove(xive->dentry);
>> +
>> +	pr_devel("Destroying xive native for partition\n");
>> +
>> +	if (kvm)
>> +		kvm->arch.xive = NULL;
>> +
>> +	/* Mask and free interrupts */
>> +	for (i = 0; i <= xive->max_sbid; i++) {
>> +		if (xive->src_blocks[i])
>> +			kvmppc_xive_free_sources(xive->src_blocks[i]);
>> +		kfree(xive->src_blocks[i]);
>> +		xive->src_blocks[i] = NULL;
>> +	}
>> +
>> +	if (xive->vp_base != XIVE_INVALID_VP)
>> +		xive_native_free_vp_block(xive->vp_base);
>> +
>> +	kfree(xive);
>> +	kfree(dev);
>> +}
>> +
>> +static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
>> +{
>> +	struct kvmppc_xive *xive;
>> +	struct kvm *kvm = dev->kvm;
>> +	int ret = 0;
>> +
>> +	pr_devel("Creating xive native for partition\n");
>> +
>> +	if (kvm->arch.xive)
>> +		return -EEXIST;
>> +
>> +	xive = kzalloc(sizeof(*xive), GFP_KERNEL);
>> +	if (!xive)
>> +		return -ENOMEM;
>> +
>> +	dev->private = xive;
>> +	xive->dev = dev;
>> +	xive->kvm = kvm;
>> +	kvm->arch.xive = xive;
>> +
>> +	/* We use the default queue size set by the host */
>> +	xive->q_order = xive_native_default_eq_shift();
>> +	if (xive->q_order < PAGE_SHIFT)
>> +		xive->q_page_order = 0;
>> +	else
>> +		xive->q_page_order = xive->q_order - PAGE_SHIFT;
>> +
>> +	/* Allocate a bunch of VPs */
>> +	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
>> +	pr_devel("VP_Base=%x\n", xive->vp_base);
>> +
>> +	if (xive->vp_base == XIVE_INVALID_VP)
>> +		ret = -ENOMEM;
>> +
>> +	xive->single_escalation = xive_native_has_single_escalation();
>> +
>> +	if (ret)
>> +		kfree(xive);
>> +
>> +	return ret;
>> +}
>> +
>> +static int xive_native_debug_show(struct seq_file *m, void *private)
>> +{
>> +	struct kvmppc_xive *xive = m->private;
>> +	struct kvm *kvm = xive->kvm;
>> +	struct kvm_vcpu *vcpu;
>> +	unsigned int i;
>> +
>> +	if (!kvm)
>> +		return 0;
>> +
>> +	seq_puts(m, "=========\nVCPU state\n=========\n");
>> +
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>> +
>> +		if (!xc)
>> +			continue;
>> +
>> +		seq_printf(m, "cpu server %#x NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x\n",
>> +			   xc->server_num,
>> +			   vcpu->arch.xive_saved_state.nsr,
>> +			   vcpu->arch.xive_saved_state.cppr,
>> +			   vcpu->arch.xive_saved_state.ipb,
>> +			   vcpu->arch.xive_saved_state.pipr,
>> +			   vcpu->arch.xive_saved_state.w01,
>> +			   (u32) vcpu->arch.xive_cam_word);
>> +
>> +		kvmppc_xive_debug_show_queues(m, vcpu);
>> +	}
>> +
>> +	return 0;
>> +}
>> +
>> +static int xive_native_debug_open(struct inode *inode, struct file *file)
>> +{
>> +	return single_open(file, xive_native_debug_show, inode->i_private);
>> +}
>> +
>> +static const struct file_operations xive_native_debug_fops = {
>> +	.open = xive_native_debug_open,
>> +	.read = seq_read,
>> +	.llseek = seq_lseek,
>> +	.release = single_release,
>> +};
>> +
>> +static void xive_native_debugfs_init(struct kvmppc_xive *xive)
>> +{
>> +	char *name;
>> +
>> +	name = kasprintf(GFP_KERNEL, "kvm-xive-%p", xive);
>> +	if (!name) {
>> +		pr_err("%s: no memory for name\n", __func__);
>> +		return;
>> +	}
>> +
>> +	xive->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,
>> +					   xive, &xive_native_debug_fops);
>> +
>> +	pr_debug("%s: created %s\n", __func__, name);
>> +	kfree(name);
>> +}
>> +
>> +static void kvmppc_xive_native_init(struct kvm_device *dev)
>> +{
>> +	struct kvmppc_xive *xive = (struct kvmppc_xive *)dev->private;
>> +
>> +	/* Register some debug interfaces */
>> +	xive_native_debugfs_init(xive);
>> +}
>> +
>> +struct kvm_device_ops kvm_xive_native_ops = {
>> +	.name = "kvm-xive-native",
>> +	.create = kvmppc_xive_native_create,
>> +	.init = kvmppc_xive_native_init,
>> +	.destroy = kvmppc_xive_native_free,
>> +	.set_attr = kvmppc_xive_native_set_attr,
>> +	.get_attr = kvmppc_xive_native_get_attr,
>> +	.has_attr = kvmppc_xive_native_has_attr,
>> +};
>> +
>> +void kvmppc_xive_native_init_module(void)
>> +{
>> +	;
>> +}
>> +
>> +void kvmppc_xive_native_exit_module(void)
>> +{
>> +	;
>> +}
>> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
>> index b90a7d154180..01d526e15e9d 100644
>> --- a/arch/powerpc/kvm/powerpc.c
>> +++ b/arch/powerpc/kvm/powerpc.c
>> @@ -566,6 +566,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
>>  	case KVM_CAP_PPC_ENABLE_HCALL:
>>  #ifdef CONFIG_KVM_XICS
>>  	case KVM_CAP_IRQ_XICS:
>> +#endif
>> +#ifdef CONFIG_KVM_XIVE
>> +	case KVM_CAP_PPC_IRQ_XIVE:
>>  #endif
>>  	case KVM_CAP_PPC_GET_CPU_CHAR:
>>  		r = 1;
>> @@ -753,6 +756,9 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
>>  		else
>>  			kvmppc_xics_free_icp(vcpu);
>>  		break;
>> +	case KVMPPC_IRQ_XIVE:
>> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
>> +		break;
>>  	}
>>  
>>  	kvmppc_core_vcpu_free(vcpu);
>> @@ -1941,6 +1947,30 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>>  		break;
>>  	}
>>  #endif /* CONFIG_KVM_XICS */
>> +#ifdef CONFIG_KVM_XIVE
>> +	case KVM_CAP_PPC_IRQ_XIVE: {
>> +		struct fd f;
>> +		struct kvm_device *dev;
>> +
>> +		r = -EBADF;
>> +		f = fdget(cap->args[0]);
>> +		if (!f.file)
>> +			break;
>> +
>> +		r = -ENXIO;
>> +		if (!xive_enabled())
>> +			break;
>> +
>> +		r = -EPERM;
>> +		dev = kvm_device_from_filp(f.file);
>> +		if (dev)
>> +			r = kvmppc_xive_native_connect_vcpu(dev, vcpu,
>> +							    cap->args[1]);
>> +
>> +		fdput(f);
>> +		break;
>> +	}
>> +#endif /* CONFIG_KVM_XIVE */
>>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
>>  	case KVM_CAP_PPC_FWNMI:
>>  		r = -EINVAL;
>> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
>> index 64f1135e7732..806cbe488410 100644
>> --- a/arch/powerpc/kvm/Makefile
>> +++ b/arch/powerpc/kvm/Makefile
>> @@ -99,7 +99,7 @@ endif
>>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
>>  	book3s_xics.o
>>  
>> -kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o
>> +kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o book3s_xive_native.o
>>  kvm-book3s_64-objs-$(CONFIG_SPAPR_TCE_IOMMU) += book3s_64_vio.o
>>  
>>  kvm-book3s_64-module-objs := \
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-04  4:45   ` David Gibson
@ 2019-02-04 11:30     ` Cédric Le Goater
  2019-02-05  5:28       ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-04 11:30 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/4/19 5:45 AM, David Gibson wrote:
> On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
>> This will let the guest create a memory mapping to expose the ESB MMIO
>> regions used to control the interrupt sources, to trigger events, to
>> EOI or to turn off the sources.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
>>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
>>  2 files changed, 101 insertions(+)
>>
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 8c876c166ef2..6bb61ba141c2 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
>>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
>>  #define  KVM_XICS_QUEUED		(1ULL << 44)
>>  
>> +/* POWER9 XIVE Native Interrupt Controller */
>> +#define KVM_DEV_XIVE_GRP_CTRL		1
>> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
> 
> Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
> to both with an mmap() directly on the xive device fd?  Using the
> offset to distinguish which one to map, obviously.

The page offset would define some sort of user API. It seems feasible.
But I am not sure this would be practical in the future if we need to 
tune the length.

The TIMA has two pages that can be exposed at guest level for interrupt 
management : the OS and the USER page. That should be OK.

But we might want to map only portions of the interrupt ESB space, for 
PCI passthrough for instance as Paul proposed. I am still looking at that.

Thanks,

C.

>>  #endif /* __LINUX_KVM_POWERPC_H */
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> index 115143e76c45..e20081f0c8d4 100644
>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -153,6 +153,85 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
>>  	return rc;
>>  }
>>  
>> +static int xive_native_esb_fault(struct vm_fault *vmf)
>> +{
>> +	struct vm_area_struct *vma = vmf->vma;
>> +	struct kvmppc_xive *xive = vma->vm_file->private_data;
>> +	struct kvmppc_xive_src_block *sb;
>> +	struct kvmppc_xive_irq_state *state;
>> +	struct xive_irq_data *xd;
>> +	u32 hw_num;
>> +	u16 src;
>> +	u64 page;
>> +	unsigned long irq;
>> +
>> +	/*
>> +	 * Linux/KVM uses a two pages ESB setting, one for trigger and
>> +	 * one for EOI
>> +	 */
>> +	irq = vmf->pgoff / 2;
>> +
>> +	sb = kvmppc_xive_find_source(xive, irq, &src);
>> +	if (!sb) {
>> +		pr_err("%s: source %lx not found !\n", __func__, irq);
>> +		return VM_FAULT_SIGBUS;
>> +	}
>> +
>> +	state = &sb->irq_state[src];
>> +	kvmppc_xive_select_irq(state, &hw_num, &xd);
>> +
>> +	arch_spin_lock(&sb->lock);
>> +
>> +	/*
>> +	 * first/even page is for trigger
>> +	 * second/odd page is for EOI and management.
>> +	 */
>> +	page = vmf->pgoff % 2 ? xd->eoi_page : xd->trig_page;
>> +	arch_spin_unlock(&sb->lock);
>> +
>> +	if (!page) {
>> +		pr_err("%s: acessing invalid ESB page for source %lx !\n",
>> +		       __func__, irq);
>> +		return VM_FAULT_SIGBUS;
>> +	}
>> +
>> +	vmf_insert_pfn(vma, vmf->address, page >> PAGE_SHIFT);
>> +	return VM_FAULT_NOPAGE;
>> +}
>> +
>> +static const struct vm_operations_struct xive_native_esb_vmops = {
>> +	.fault = xive_native_esb_fault,
>> +};
>> +
>> +static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma)
>> +{
>> +	/* There are two ESB pages (trigger and EOI) per IRQ */
>> +	if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2)
>> +		return -EINVAL;
>> +
>> +	vma->vm_flags |= VM_IO | VM_PFNMAP;
>> +	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
>> +	vma->vm_ops = &xive_native_esb_vmops;
>> +	return 0;
>> +}
>> +
>> +static const struct file_operations xive_native_esb_fops = {
>> +	.mmap = xive_native_esb_mmap,
>> +};
>> +
>> +static int kvmppc_xive_native_get_esb_fd(struct kvmppc_xive *xive, u64 addr)
>> +{
>> +	u64 __user *ubufp = (u64 __user *) addr;
>> +	int ret;
>> +
>> +	ret = anon_inode_getfd("[xive-esb]", &xive_native_esb_fops, xive,
>> +				O_RDWR | O_CLOEXEC);
>> +	if (ret < 0)
>> +		return ret;
>> +
>> +	return put_user(ret, ubufp);
>> +}
>> +
>>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  				       struct kvm_device_attr *attr)
>>  {
>> @@ -162,12 +241,30 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
>>  				       struct kvm_device_attr *attr)
>>  {
>> +	struct kvmppc_xive *xive = dev->private;
>> +
>> +	switch (attr->group) {
>> +	case KVM_DEV_XIVE_GRP_CTRL:
>> +		switch (attr->attr) {
>> +		case KVM_DEV_XIVE_GET_ESB_FD:
>> +			return kvmppc_xive_native_get_esb_fd(xive, attr->addr);
>> +		}
>> +		break;
>> +	}
>>  	return -ENXIO;
>>  }
>>  
>>  static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>>  				       struct kvm_device_attr *attr)
>>  {
>> +	switch (attr->group) {
>> +	case KVM_DEV_XIVE_GRP_CTRL:
>> +		switch (attr->attr) {
>> +		case KVM_DEV_XIVE_GET_ESB_FD:
>> +			return 0;
>> +		}
>> +		break;
>> +	}
>>  	return -ENXIO;
>>  }
>>  
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the XIVE native device
  2019-02-04  4:49       ` David Gibson
@ 2019-02-04 15:36         ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-04 15:36 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, linuxppc-dev

On 2/4/19 5:49 AM, David Gibson wrote:
> On Wed, Jan 23, 2019 at 05:56:26PM +0100, Cédric Le Goater wrote:
>> On 1/22/19 6:14 AM, Paul Mackerras wrote:
>>> On Mon, Jan 07, 2019 at 07:43:20PM +0100, Cédric Le Goater wrote:
>>>> The ESB MMIO region controls the interrupt sources of the guest. QEMU
>>>> will query an fd (GET_ESB_FD ioctl) and map this region at a specific
>>>> address for the guest to use. The guest will obtain this information
>>>> using the H_INT_GET_SOURCE_INFO hcall. To inform KVM of the address
>>>> setting used by QEMU, add a VC_BASE control to the KVM XIVE device
>>>
>>> This needs a little more explanation.  I *think* the only way this
>>> gets used is that it gets returned to the guest by the new
>>> hypercalls.  If that is indeed the case it would be useful to mention
>>> that in the patch description, because otherwise taking a value that
>>> userspace provides and which looks like it is an address, and not
>>> doing any validation on it, looks a bit scary.
>>
>> I think we have solved this problem in another email thread. 
>>
>> The H_INT_GET_SOURCE_INFO hcall does not need to be implemented in KVM
>> as all the source information should already be available in QEMU. In
>> that case, there is no need to inform KVM of where the ESB pages are 
>> mapped in the guest address space. So we don't need that extra control
>> on the KVM device. This is good news.
> 
> Ah, good to hear.  I thought this looked strange.

yes. I didn't know which path to choose between HV real mode, HV, QEMU. 
It's clarified now. 

But now, we have nested, and this is adding quite a bit of strangeness 
to the hcall possibilities.

C.  

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address
  2019-02-04  5:15   ` David Gibson
@ 2019-02-04 15:37     ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-04 15:37 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/4/19 6:15 AM, David Gibson wrote:
> On Mon, Jan 07, 2019 at 07:43:24PM +0100, Cédric Le Goater wrote:
>> The guest physical address of the event queue will be part of the
>> state to transfer in the migration. Cache its value when the queue is
>> configured, it will save us an OPAL call.
> 
> That doesn't sound like a very compelling case - migration is already
> a hundreds of milliseconds type operation, I wouldn't expect a few
> extra OPAL calls to be an issue.

OK. I don't think this is much a problem anyhow. Let's call OPAL.

C. 

 
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/asm/xive.h       | 2 ++
>>  arch/powerpc/kvm/book3s_xive_native.c | 4 ++++
>>  2 files changed, 6 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/xive.h b/arch/powerpc/include/asm/xive.h
>> index 7a7aa22d8258..e90c3c5d9533 100644
>> --- a/arch/powerpc/include/asm/xive.h
>> +++ b/arch/powerpc/include/asm/xive.h
>> @@ -74,6 +74,8 @@ struct xive_q {
>>  	u32			esc_irq;
>>  	atomic_t		count;
>>  	atomic_t		pending_count;
>> +	u64			guest_qpage;
>> +	u32			guest_qsize;
>>  };
>>  
>>  /* Global enable flags for the XIVE support */
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> index 35d806740c3a..4ca75aade069 100644
>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -708,6 +708,10 @@ static int kvmppc_h_int_set_queue_config(struct kvm_vcpu *vcpu,
>>  	}
>>  	qaddr = page_to_virt(page) + (qpage & ~PAGE_MASK);
>>  
>> +	/* Backup queue page address and size for migration */
>> +	q->guest_qpage = qpage;
>> +	q->guest_qsize = qsize;
>> +
>>  	rc = xive_native_configure_queue(xc->vp_id, q, priority,
>>  					 (__be32 *) qaddr, qsize, true);
>>  	if (rc) {
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration
  2019-02-04  5:17   ` David Gibson
@ 2019-02-04 15:39     ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-04 15:39 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/4/19 6:17 AM, David Gibson wrote:
> On Mon, Jan 07, 2019 at 07:43:25PM +0100, Cédric Le Goater wrote:
>> When migration of a VM is initiated, a first copy of the RAM is
>> transferred to the destination before the VM is stopped. At that time,
>> QEMU needs to perform a XIVE quiesce sequence to stop the flow of
>> event notifications and stabilize the EQs. The sources are masked and
>> the XIVE IC is synced with the KVM ioctl KVM_DEV_XIVE_GRP_SYNC.
>>
> 
> Don't you also need to make sure the guests queue pages are marked
> dirty here, in case they were already migrated?

I have added an extra KVM service to mark the EQ pages dirty. That 
might be overkill as it seems you are suggesting.

C. 
 
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/uapi/asm/kvm.h   |  1 +
>>  arch/powerpc/kvm/book3s_xive_native.c | 32 +++++++++++++++++++++++++++
>>  2 files changed, 33 insertions(+)
>>
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 6fc9660c5aec..f3b859223b80 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -681,6 +681,7 @@ struct kvm_ppc_cpu_char {
>>  #define   KVM_DEV_XIVE_GET_TIMA_FD	2
>>  #define   KVM_DEV_XIVE_VC_BASE		3
>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>> +#define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>>  
>>  /* Layout of 64-bit XIVE source attribute values */
>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> index 4ca75aade069..a8052867afc1 100644
>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -459,6 +459,35 @@ static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
>>  	return 0;
>>  }
>>  
>> +static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
>> +{
>> +	struct kvmppc_xive_src_block *sb;
>> +	struct kvmppc_xive_irq_state *state;
>> +	struct xive_irq_data *xd;
>> +	u32 hw_num;
>> +	u16 src;
>> +
>> +	pr_devel("%s irq=0x%lx\n", __func__, irq);
>> +
>> +	sb = kvmppc_xive_find_source(xive, irq, &src);
>> +	if (!sb)
>> +		return -ENOENT;
>> +
>> +	state = &sb->irq_state[src];
>> +
>> +	if (!state->valid)
>> +		return -ENOENT;
>> +
>> +	arch_spin_lock(&sb->lock);
>> +
>> +	kvmppc_xive_select_irq(state, &hw_num, &xd);
>> +	xive_native_sync_source(hw_num);
>> +	xive_native_sync_queue(hw_num);
>> +
>> +	arch_spin_unlock(&sb->lock);
>> +	return 0;
>> +}
>> +
>>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  				       struct kvm_device_attr *attr)
>>  {
>> @@ -474,6 +503,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  	case KVM_DEV_XIVE_GRP_SOURCES:
>>  		return kvmppc_xive_native_set_source(xive, attr->attr,
>>  						     attr->addr);
>> +	case KVM_DEV_XIVE_GRP_SYNC:
>> +		return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
>>  	}
>>  	return -ENXIO;
>>  }
>> @@ -511,6 +542,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>>  		}
>>  		break;
>>  	case KVM_DEV_XIVE_GRP_SOURCES:
>> +	case KVM_DEV_XIVE_GRP_SYNC:
>>  		if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
>>  		    attr->attr < KVMPPC_XIVE_NR_IRQS)
>>  			return 0;
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty
  2019-02-04  5:18   ` David Gibson
@ 2019-02-04 15:46     ` Cédric Le Goater
  2019-02-05  5:30       ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-04 15:46 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/4/19 6:18 AM, David Gibson wrote:
> On Mon, Jan 07, 2019 at 07:43:26PM +0100, Cédric Le Goater wrote:
>> When the VM is stopped in a migration sequence, the sources are masked
>> and the XIVE IC is synced to stabilize the EQs. When done, the KVM
>> ioctl KVM_DEV_XIVE_SAVE_EQ_PAGES is called to mark dirty the EQ pages.
>>
>> The migration can then transfer the remaining dirty pages to the
>> destination and start collecting the state of the devices.
> 
> Is there a reason to make this a separate step from the SYNC
> operation?

Hmm, apart from letting QEMU orchestrate the migration step by step, no.

We could merge the SYNC and the SAVE_EQ_PAGES in a single KVM operation. 
I think that should be fine. 

However, it does not make sense to call this operation without the VM 
being stopped. I wonder how this can checked from KVM. May be we can't. 

C. 

> 
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/uapi/asm/kvm.h   |  1 +
>>  arch/powerpc/kvm/book3s_xive_native.c | 40 +++++++++++++++++++++++++++
>>  2 files changed, 41 insertions(+)
>>
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index f3b859223b80..1a8740629acf 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -680,6 +680,7 @@ struct kvm_ppc_cpu_char {
>>  #define   KVM_DEV_XIVE_GET_ESB_FD	1
>>  #define   KVM_DEV_XIVE_GET_TIMA_FD	2
>>  #define   KVM_DEV_XIVE_VC_BASE		3
>> +#define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>>  
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> index a8052867afc1..f2de1bcf3b35 100644
>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -373,6 +373,43 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr)
>>  	return put_user(ret, ubufp);
>>  }
>>  
>> +static int kvmppc_xive_native_vcpu_save_eq_pages(struct kvm_vcpu *vcpu)
>> +{
>> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>> +	unsigned int prio;
>> +
>> +	if (!xc)
>> +		return -ENOENT;
>> +
>> +	for (prio = 0; prio < KVMPPC_XIVE_Q_COUNT; prio++) {
>> +		struct xive_q *q = &xc->queues[prio];
>> +
>> +		if (!q->qpage)
>> +			continue;
>> +
>> +		/* Mark EQ page dirty for migration */
>> +		mark_page_dirty(vcpu->kvm, gpa_to_gfn(q->guest_qpage));
>> +	}
>> +	return 0;
>> +}
>> +
>> +static int kvmppc_xive_native_save_eq_pages(struct kvmppc_xive *xive)
>> +{
>> +	struct kvm *kvm = xive->kvm;
>> +	struct kvm_vcpu *vcpu;
>> +	unsigned int i;
>> +
>> +	pr_devel("%s\n", __func__);
>> +
>> +	mutex_lock(&kvm->lock);
>> +	kvm_for_each_vcpu(i, vcpu, kvm) {
>> +		kvmppc_xive_native_vcpu_save_eq_pages(vcpu);
>> +	}
>> +	mutex_unlock(&kvm->lock);
>> +
>> +	return 0;
>> +}
>> +
>>  static int xive_native_validate_queue_size(u32 qsize)
>>  {
>>  	switch (qsize) {
>> @@ -498,6 +535,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  		switch (attr->attr) {
>>  		case KVM_DEV_XIVE_VC_BASE:
>>  			return kvmppc_xive_native_set_vc_base(xive, attr->addr);
>> +		case KVM_DEV_XIVE_SAVE_EQ_PAGES:
>> +			return kvmppc_xive_native_save_eq_pages(xive);
>>  		}
>>  		break;
>>  	case KVM_DEV_XIVE_GRP_SOURCES:
>> @@ -538,6 +577,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>>  		case KVM_DEV_XIVE_GET_ESB_FD:
>>  		case KVM_DEV_XIVE_GET_TIMA_FD:
>>  		case KVM_DEV_XIVE_VC_BASE:
>> +		case KVM_DEV_XIVE_SAVE_EQ_PAGES:
>>  			return 0;
>>  		}
>>  		break;
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-02-04  5:21   ` David Gibson
@ 2019-02-04 16:07     ` Cédric Le Goater
  2019-02-05  5:32       ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-04 16:07 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/4/19 6:21 AM, David Gibson wrote:
> On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
>> Theses are use to capure the XIVE EAS table of the KVM device, the
>> configuration of the source targets.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
>>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
>>  2 files changed, 98 insertions(+)
>>
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 1a8740629acf..faf024f39858 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
>>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
>>  
>>  /* Layout of 64-bit XIVE source attribute values */
>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>>  
>> +/* Layout of 64-bit eas attribute values */
>> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
>> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
>> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
>> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
>> +#define KVM_XIVE_EAS_MASK_SHIFT		32
>> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
>> +#define KVM_XIVE_EAS_EISN_SHIFT		33
>> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
>> +
>>  #endif /* __LINUX_KVM_POWERPC_H */
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> index f2de1bcf3b35..0468b605baa7 100644
>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
>>  	return 0;
>>  }
>>  
>> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
>> +				      u64 addr)
> 
> I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
> values, but rather essentially the "source config" in the terminology
> of the PAPR hcalls.  Which, yes, is basically implemented by setting
> the EAS, but since it's the PAPR architected state that we need to
> preserve across migration, I'd prefer to stick as close as we can to
> the PAPR terminology.

But we don't have an equivalent name in the PAPR specs for the tuple 
(prio, server). We could use the generic 'target' name may be ? even 
if this is usually referring to a CPU number.

Or, IVE (Interrupt Vector Entry) ? which makes some sense. 
This is was the former name in HW. I think we recycle it for KVM. 
 
C.  

> 
>> +{
>> +	struct kvmppc_xive_src_block *sb;
>> +	struct kvmppc_xive_irq_state *state;
>> +	u64 __user *ubufp = (u64 __user *) addr;
>> +	u16 src;
>> +	u64 kvm_eas;
>> +	u32 server;
>> +	u8 priority;
>> +	u32 eisn;
>> +
>> +	sb = kvmppc_xive_find_source(xive, irq, &src);
>> +	if (!sb)
>> +		return -ENOENT;
>> +
>> +	state = &sb->irq_state[src];
>> +
>> +	if (!state->valid)
>> +		return -EINVAL;
>> +
>> +	if (get_user(kvm_eas, ubufp))
>> +		return -EFAULT;
>> +
>> +	pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas);
>> +
>> +	priority = (kvm_eas & KVM_XIVE_EAS_PRIORITY_MASK) >>
>> +		KVM_XIVE_EAS_PRIORITY_SHIFT;
>> +	server = (kvm_eas & KVM_XIVE_EAS_SERVER_MASK) >>
>> +		KVM_XIVE_EAS_SERVER_SHIFT;
>> +	eisn = (kvm_eas & KVM_XIVE_EAS_EISN_MASK) >> KVM_XIVE_EAS_EISN_SHIFT;
>> +
>> +	if (priority != xive_prio_from_guest(priority)) {
>> +		pr_err("invalid priority for queue %d for VCPU %d\n",
>> +		       priority, server);
>> +		return -EINVAL;
>> +	}
>> +
>> +	return kvmppc_xive_native_set_source_config(xive, sb, state, server,
>> +						    priority, eisn);
>> +}
>> +
>> +static int kvmppc_xive_native_get_eas(struct kvmppc_xive *xive, long irq,
>> +				      u64 addr)
>> +{
>> +	struct kvmppc_xive_src_block *sb;
>> +	struct kvmppc_xive_irq_state *state;
>> +	u64 __user *ubufp = (u64 __user *) addr;
>> +	u16 src;
>> +	u64 kvm_eas;
>> +
>> +	sb = kvmppc_xive_find_source(xive, irq, &src);
>> +	if (!sb)
>> +		return -ENOENT;
>> +
>> +	state = &sb->irq_state[src];
>> +
>> +	if (!state->valid)
>> +		return -EINVAL;
>> +
>> +	arch_spin_lock(&sb->lock);
>> +
>> +	if (state->act_priority == MASKED)
>> +		kvm_eas = KVM_XIVE_EAS_MASK_MASK;
>> +	else {
>> +		kvm_eas = (state->act_priority << KVM_XIVE_EAS_PRIORITY_SHIFT) &
>> +			KVM_XIVE_EAS_PRIORITY_MASK;
>> +		kvm_eas |= (state->act_server << KVM_XIVE_EAS_SERVER_SHIFT) &
>> +			KVM_XIVE_EAS_SERVER_MASK;
>> +		kvm_eas |= ((u64) state->eisn << KVM_XIVE_EAS_EISN_SHIFT) &
>> +			KVM_XIVE_EAS_EISN_MASK;
>> +	}
>> +	arch_spin_unlock(&sb->lock);
>> +
>> +	pr_devel("%s irq=0x%lx eas=%016llx\n", __func__, irq, kvm_eas);
>> +
>> +	if (put_user(kvm_eas, ubufp))
>> +		return -EFAULT;
>> +
>> +	return 0;
>> +}
>> +
>>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  				       struct kvm_device_attr *attr)
>>  {
>> @@ -544,6 +626,8 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  						     attr->addr);
>>  	case KVM_DEV_XIVE_GRP_SYNC:
>>  		return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
>> +	case KVM_DEV_XIVE_GRP_EAS:
>> +		return kvmppc_xive_native_set_eas(xive, attr->attr, attr->addr);
>>  	}
>>  	return -ENXIO;
>>  }
>> @@ -564,6 +648,8 @@ static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
>>  			return kvmppc_xive_native_get_vc_base(xive, attr->addr);
>>  		}
>>  		break;
>> +	case KVM_DEV_XIVE_GRP_EAS:
>> +		return kvmppc_xive_native_get_eas(xive, attr->attr, attr->addr);
>>  	}
>>  	return -ENXIO;
>>  }
>> @@ -583,6 +669,7 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>>  		break;
>>  	case KVM_DEV_XIVE_GRP_SOURCES:
>>  	case KVM_DEV_XIVE_GRP_SYNC:
>> +	case KVM_DEV_XIVE_GRP_EAS:
>>  		if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
>>  		    attr->attr < KVMPPC_XIVE_NR_IRQS)
>>  			return 0;
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state
  2019-02-04  5:26   ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state David Gibson
@ 2019-02-04 18:57     ` Cédric Le Goater
  2019-02-05  5:33       ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-04 18:57 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/4/19 6:26 AM, David Gibson wrote:
> On Mon, Jan 07, 2019 at 08:10:04PM +0100, Cédric Le Goater wrote:
>> At a VCPU level, the state of the thread context interrupt management
>> registers needs to be collected. These registers are cached under the
>> 'xive_saved_state.w01' field of the VCPU when the VPCU context is
>> pulled from the HW thread. An OPAL call retrieves the backup of the
>> IPB register in the NVT structure and merges it in the KVM state.
>>
>> The structures of the interface between QEMU and KVM provisions some
>> extra room (two u64) for further extensions if more state needs to be
>> transferred back to QEMU.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/asm/kvm_ppc.h    |  5 ++
>>  arch/powerpc/include/uapi/asm/kvm.h   |  2 +
>>  arch/powerpc/kvm/book3s.c             | 24 +++++++++
>>  arch/powerpc/kvm/book3s_xive_native.c | 78 +++++++++++++++++++++++++++
>>  4 files changed, 109 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>> index 4cc897039485..49c488af168c 100644
>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>> @@ -270,6 +270,7 @@ union kvmppc_one_reg {
>>  		u64	addr;
>>  		u64	length;
>>  	}	vpaval;
>> +	u64	xive_timaval[4];
>>  };
>>  
>>  struct kvmppc_ops {
>> @@ -603,6 +604,8 @@ extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
>>  extern void kvmppc_xive_native_init_module(void);
>>  extern void kvmppc_xive_native_exit_module(void);
>>  extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd);
>> +extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
>> +extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
>>  
>>  #else
>>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
>> @@ -637,6 +640,8 @@ static inline void kvmppc_xive_native_init_module(void) { }
>>  static inline void kvmppc_xive_native_exit_module(void) { }
>>  static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd)
>>  	{ return 0; }
>> +static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return 0; }
>> +static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return -ENOENT; }
> 
> IIRC "VP" is the old name for "TCTX".  Since we're using tctx in the
> rest of the XIVE code, can we use it here as well.

OK. The state we are getting or setting is indeed related to the thread 
interrupt  context registers. 

The name VP is related to an identifier to some interrupt context under 
OPAL (NVT in HW to be precise).  

C.

> 
>>  #endif /* CONFIG_KVM_XIVE */
>>  
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 95302558ce10..3c958c39a782 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -480,6 +480,8 @@ struct kvm_ppc_cpu_char {
>>  #define  KVM_REG_PPC_ICP_PPRI_SHIFT	16	/* pending irq priority */
>>  #define  KVM_REG_PPC_ICP_PPRI_MASK	0xff
>>  
>> +#define KVM_REG_PPC_VP_STATE	(KVM_REG_PPC | KVM_REG_SIZE_U256 | 0x8d)
>> +
>>  /* Device control API: PPC-specific devices */
>>  #define KVM_DEV_MPIC_GRP_MISC		1
>>  #define   KVM_DEV_MPIC_BASE_ADDR	0	/* 64-bit */
>> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
>> index de7eed191107..5ad658077a35 100644
>> --- a/arch/powerpc/kvm/book3s.c
>> +++ b/arch/powerpc/kvm/book3s.c
>> @@ -641,6 +641,18 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
>>  				*val = get_reg_val(id, kvmppc_xics_get_icp(vcpu));
>>  			break;
>>  #endif /* CONFIG_KVM_XICS */
>> +#ifdef CONFIG_KVM_XIVE
>> +		case KVM_REG_PPC_VP_STATE:
>> +			if (!vcpu->arch.xive_vcpu) {
>> +				r = -ENXIO;
>> +				break;
>> +			}
>> +			if (xive_enabled())
>> +				r = kvmppc_xive_native_get_vp(vcpu, val);
>> +			else
>> +				r = -ENXIO;
>> +			break;
>> +#endif /* CONFIG_KVM_XIVE */
>>  		case KVM_REG_PPC_FSCR:
>>  			*val = get_reg_val(id, vcpu->arch.fscr);
>>  			break;
>> @@ -714,6 +726,18 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
>>  				r = kvmppc_xics_set_icp(vcpu, set_reg_val(id, *val));
>>  			break;
>>  #endif /* CONFIG_KVM_XICS */
>> +#ifdef CONFIG_KVM_XIVE
>> +		case KVM_REG_PPC_VP_STATE:
>> +			if (!vcpu->arch.xive_vcpu) {
>> +				r = -ENXIO;
>> +				break;
>> +			}
>> +			if (xive_enabled())
>> +				r = kvmppc_xive_native_set_vp(vcpu, val);
>> +			else
>> +				r = -ENXIO;
>> +			break;
>> +#endif /* CONFIG_KVM_XIVE */
>>  		case KVM_REG_PPC_FSCR:
>>  			vcpu->arch.fscr = set_reg_val(id, *val);
>>  			break;
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> index f4eb71eafc57..1aefb366df0b 100644
>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -424,6 +424,84 @@ static int xive_native_validate_queue_size(u32 qsize)
>>  	}
>>  }
>>  
>> +#define TM_IPB_SHIFT 40
>> +#define TM_IPB_MASK  (((u64) 0xFF) << TM_IPB_SHIFT)
>> +
>> +int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val)
>> +{
>> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>> +	u64 opal_state;
>> +	int rc;
>> +
>> +	if (!kvmppc_xive_enabled(vcpu))
>> +		return -EPERM;
>> +
>> +	if (!xc)
>> +		return -ENOENT;
>> +
>> +	/* Thread context registers. We only care about IPB and CPPR */
>> +	val->xive_timaval[0] = vcpu->arch.xive_saved_state.w01;
>> +
>> +	/*
>> +	 * Return the OS CAM line to print out the VP identifier in
>> +	 * the QEMU monitor. This is not restored.
>> +	 */
>> +	val->xive_timaval[1] = vcpu->arch.xive_cam_word;
>> +
>> +	/* Get the VP state from OPAL */
>> +	rc = xive_native_get_vp_state(xc->vp_id, &opal_state);
>> +	if (rc)
>> +		return rc;
>> +
>> +	/*
>> +	 * Capture the backup of IPB register in the NVT structure and
>> +	 * merge it in our KVM VP state.
>> +	 *
>> +	 * TODO: P10 support.
>> +	 */
>> +	val->xive_timaval[0] |= cpu_to_be64(opal_state & TM_IPB_MASK);
>> +
>> +	pr_devel("%s NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x opal=%016llx\n",
>> +		 __func__,
>> +		 vcpu->arch.xive_saved_state.nsr,
>> +		 vcpu->arch.xive_saved_state.cppr,
>> +		 vcpu->arch.xive_saved_state.ipb,
>> +		 vcpu->arch.xive_saved_state.pipr,
>> +		 vcpu->arch.xive_saved_state.w01,
>> +		 (u32) vcpu->arch.xive_cam_word, opal_state);
>> +
>> +	return 0;
>> +}
>> +
>> +int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val)
>> +{
>> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>> +	struct kvmppc_xive *xive = vcpu->kvm->arch.xive;
>> +
>> +	pr_devel("%s w01=%016llx vp=%016llx\n", __func__,
>> +		 val->xive_timaval[0], val->xive_timaval[1]);
>> +
>> +	if (!kvmppc_xive_enabled(vcpu))
>> +		return -EPERM;
>> +
>> +	if (!xc || !xive)
>> +		return -ENOENT;
>> +
>> +	/* We can't update the state of a "pushed" VCPU	 */
>> +	if (WARN_ON(vcpu->arch.xive_pushed))
>> +		return -EIO;
>> +
>> +	/* Thread context registers. only restore IPB and CPPR ? */
>> +	vcpu->arch.xive_saved_state.w01 = val->xive_timaval[0];
>> +
>> +	/*
>> +	 * There is no need to restore the XIVE internal state (IPB
>> +	 * stored in the NVT) as the IPB register was merged in KVM VP
>> +	 * state.
>> +	 */
>> +	return 0;
>> +}
>> +
>>  static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
>>  					 u64 addr)
>>  {
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native device
  2019-02-04  4:57   ` David Gibson
@ 2019-02-04 19:07     ` Cédric Le Goater
  2019-02-05  5:35       ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-04 19:07 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/4/19 5:57 AM, David Gibson wrote:
> On Mon, Jan 07, 2019 at 07:43:21PM +0100, Cédric Le Goater wrote:
>> Interrupt sources are simply created at the OPAL level and then
>> MASKED. KVM only needs to know about their type: LSI or MSI.
> 
> This commit message isn't very illuminating.

There is room for improvement certainly.
 
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/uapi/asm/kvm.h           |  5 +
>>  arch/powerpc/kvm/book3s_xive_native.c         | 98 +++++++++++++++++++
>>  .../powerpc/kvm/book3s_xive_native_template.c | 27 +++++
>>  3 files changed, 130 insertions(+)
>>  create mode 100644 arch/powerpc/kvm/book3s_xive_native_template.c
>>
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index 8b78b12aa118..6fc9660c5aec 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -680,5 +680,10 @@ struct kvm_ppc_cpu_char {
>>  #define   KVM_DEV_XIVE_GET_ESB_FD	1
>>  #define   KVM_DEV_XIVE_GET_TIMA_FD	2
>>  #define   KVM_DEV_XIVE_VC_BASE		3
>> +#define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>> +
>> +/* Layout of 64-bit XIVE source attribute values */
>> +#define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>> +#define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>>  
>>  #endif /* __LINUX_KVM_POWERPC_H */
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> index 29a62914de55..2518640d4a58 100644
>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -31,6 +31,24 @@
>>  
>>  #include "book3s_xive.h"
>>  
>> +/*
>> + * We still instantiate them here because we use some of the
>> + * generated utility functions as well in this file.
> 
> And this comment is downright cryptic.

I have removed this part now that the hcalls are not done under
real mode anymore.
 
> 
>> + */
>> +#define XIVE_RUNTIME_CHECKS
>> +#define X_PFX xive_vm_
>> +#define X_STATIC static
>> +#define X_STAT_PFX stat_vm_
>> +#define __x_tima		xive_tima
>> +#define __x_eoi_page(xd)	((void __iomem *)((xd)->eoi_mmio))
>> +#define __x_trig_page(xd)	((void __iomem *)((xd)->trig_mmio))
>> +#define __x_writeb	__raw_writeb
>> +#define __x_readw	__raw_readw
>> +#define __x_readq	__raw_readq
>> +#define __x_writeq	__raw_writeq
>> +
>> +#include "book3s_xive_native_template.c"
>> +
>>  static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
>>  {
>>  	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
>> @@ -305,6 +323,78 @@ static int kvmppc_xive_native_get_tima_fd(struct kvmppc_xive *xive, u64 addr)
>>  	return put_user(ret, ubufp);
>>  }
>>  
>> +static int kvmppc_xive_native_set_source(struct kvmppc_xive *xive, long irq,
>> +					 u64 addr)
>> +{
>> +	struct kvmppc_xive_src_block *sb;
>> +	struct kvmppc_xive_irq_state *state;
>> +	u64 __user *ubufp = (u64 __user *) addr;
>> +	u64 val;
>> +	u16 idx;
>> +
>> +	pr_devel("%s irq=0x%lx\n", __func__, irq);
>> +
>> +	if (irq < KVMPPC_XIVE_FIRST_IRQ || irq >= KVMPPC_XIVE_NR_IRQS)
>> +		return -ENOENT;
>> +
>> +	sb = kvmppc_xive_find_source(xive, irq, &idx);
>> +	if (!sb) {
>> +		pr_debug("No source, creating source block...\n");
> 
> Doesn't this need to be protected by some lock?
> 
>> +		sb = kvmppc_xive_create_src_block(xive, irq);
>> +		if (!sb) {
>> +			pr_err("Failed to create block...\n");
>> +			return -ENOMEM;
>> +		}
>> +	}
>> +	state = &sb->irq_state[idx];
>> +
>> +	if (get_user(val, ubufp)) {
>> +		pr_err("fault getting user info !\n");
>> +		return -EFAULT;
>> +	}
>> +
>> +	/*
>> +	 * If the source doesn't already have an IPI, allocate
>> +	 * one and get the corresponding data
>> +	 */
>> +	if (!state->ipi_number) {
>> +		state->ipi_number = xive_native_alloc_irq();
>> +		if (state->ipi_number == 0) {
>> +			pr_err("Failed to allocate IRQ !\n");
>> +			return -ENOMEM;
>> +		}
> 
> Am I right in thinking this is the point at which a specific guest irq
> number gets bound to a specific host irq number?

yes. the XIVE IRQ state caches this information and 'state' should be 
protected before being assigned, indeed ... The XICS-over-XIVE device
also has the same race issue.

It's not showing because where initializing the KVM device sequentially
from QEMU and only once.

Thanks,

C. 
 

> 
>> +		xive_native_populate_irq_data(state->ipi_number,
>> +					      &state->ipi_data);
>> +		pr_debug("%s allocated hw_irq=0x%x for irq=0x%lx\n", __func__,
>> +			 state->ipi_number, irq);
>> +	}
>> +
>> +	arch_spin_lock(&sb->lock);
>> +
>> +	/* Restore LSI state */
>> +	if (val & KVM_XIVE_LEVEL_SENSITIVE) {
>> +		state->lsi = true;
>> +		if (val & KVM_XIVE_LEVEL_ASSERTED)
>> +			state->asserted = true;
>> +		pr_devel("  LSI ! Asserted=%d\n", state->asserted);
>> +	}
>> +
>> +	/* Mask IRQ to start with */
>> +	state->act_server = 0;
>> +	state->act_priority = MASKED;
>> +	xive_vm_esb_load(&state->ipi_data, XIVE_ESB_SET_PQ_01);
>> +	xive_native_configure_irq(state->ipi_number, 0, MASKED, 0);
>> +
>> +	/* Increment the number of valid sources and mark this one valid */
>> +	if (!state->valid)
>> +		xive->src_count++;
>> +	state->valid = true;
>> +
>> +	arch_spin_unlock(&sb->lock);
>> +
>> +	return 0;
>> +}
>> +
>>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  				       struct kvm_device_attr *attr)
>>  {
>> @@ -317,6 +407,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  			return kvmppc_xive_native_set_vc_base(xive, attr->addr);
>>  		}
>>  		break;
>> +	case KVM_DEV_XIVE_GRP_SOURCES:
>> +		return kvmppc_xive_native_set_source(xive, attr->attr,
>> +						     attr->addr);
>>  	}
>>  	return -ENXIO;
>>  }
>> @@ -353,6 +446,11 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>>  			return 0;
>>  		}
>>  		break;
>> +	case KVM_DEV_XIVE_GRP_SOURCES:
>> +		if (attr->attr >= KVMPPC_XIVE_FIRST_IRQ &&
>> +		    attr->attr < KVMPPC_XIVE_NR_IRQS)
>> +			return 0;
>> +		break;
>>  	}
>>  	return -ENXIO;
>>  }
>> diff --git a/arch/powerpc/kvm/book3s_xive_native_template.c b/arch/powerpc/kvm/book3s_xive_native_template.c
>> new file mode 100644
>> index 000000000000..e7260da4a596
>> --- /dev/null
>> +++ b/arch/powerpc/kvm/book3s_xive_native_template.c
>> @@ -0,0 +1,27 @@
>> +// SPDX-License-Identifier: GPL-2.0
>> +/*
>> + * Copyright (c) 2017-2019, IBM Corporation.
>> + */
>> +
>> +/* File to be included by other .c files */
>> +
>> +#define XGLUE(a, b) a##b
>> +#define GLUE(a, b) XGLUE(a, b)
>> +
>> +/*
>> + * TODO: introduce a common template file with the XIVE native layer
>> + * and the XICS-on-XIVE glue for the utility functions
>> + */
>> +static u8 GLUE(X_PFX, esb_load)(struct xive_irq_data *xd, u32 offset)
>> +{
>> +	u64 val;
>> +
>> +	if (xd->flags & XIVE_IRQ_FLAG_SHIFT_BUG)
>> +		offset |= offset << 4;
>> +
>> +	val = __x_readq(__x_eoi_page(xd) + offset);
>> +#ifdef __LITTLE_ENDIAN__
>> +	val >>= 64-8;
>> +#endif
>> +	return (u8)val;
>> +}
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode
  2019-02-04 11:19     ` Cédric Le Goater
@ 2019-02-05  5:26       ` David Gibson
  0 siblings, 0 replies; 135+ messages in thread
From: David Gibson @ 2019-02-05  5:26 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 19688 bytes --]

On Mon, Feb 04, 2019 at 12:19:07PM +0100, Cédric Le Goater wrote:
> On 2/4/19 5:25 AM, David Gibson wrote:
> > On Mon, Jan 07, 2019 at 07:43:17PM +0100, Cédric Le Goater wrote:
> >> This is the basic framework for the new KVM device supporting the XIVE
> >> native exploitation mode. The user interface exposes a new capability
> >> and a new KVM device to be used by QEMU.
> >>
> >> Internally, the interface to the new KVM device is protected with a
> >> new interrupt mode: KVMPPC_IRQ_XIVE.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  arch/powerpc/include/asm/kvm_host.h   |   2 +
> >>  arch/powerpc/include/asm/kvm_ppc.h    |  21 ++
> >>  arch/powerpc/kvm/book3s_xive.h        |   3 +
> >>  include/uapi/linux/kvm.h              |   3 +
> >>  arch/powerpc/kvm/book3s.c             |   7 +-
> >>  arch/powerpc/kvm/book3s_xive_native.c | 332 ++++++++++++++++++++++++++
> >>  arch/powerpc/kvm/powerpc.c            |  30 +++
> >>  arch/powerpc/kvm/Makefile             |   2 +-
> >>  8 files changed, 398 insertions(+), 2 deletions(-)
> >>  create mode 100644 arch/powerpc/kvm/book3s_xive_native.c
> >>
> >> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> >> index 0f98f00da2ea..c522e8274ad9 100644
> >> --- a/arch/powerpc/include/asm/kvm_host.h
> >> +++ b/arch/powerpc/include/asm/kvm_host.h
> >> @@ -220,6 +220,7 @@ extern struct kvm_device_ops kvm_xics_ops;
> >>  struct kvmppc_xive;
> >>  struct kvmppc_xive_vcpu;
> >>  extern struct kvm_device_ops kvm_xive_ops;
> >> +extern struct kvm_device_ops kvm_xive_native_ops;
> >>  
> >>  struct kvmppc_passthru_irqmap;
> >>  
> >> @@ -446,6 +447,7 @@ struct kvmppc_passthru_irqmap {
> >>  #define KVMPPC_IRQ_DEFAULT	0
> >>  #define KVMPPC_IRQ_MPIC		1
> >>  #define KVMPPC_IRQ_XICS		2 /* Includes a XIVE option */
> >> +#define KVMPPC_IRQ_XIVE		3 /* XIVE native exploitation mode */
> >>  
> >>  #define MMIO_HPTE_CACHE_SIZE	4
> >>  
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index eb0d79f0ca45..1bb313f238fe 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -591,6 +591,18 @@ extern int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval);
> >>  extern int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
> >>  			       int level, bool line_status);
> >>  extern void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu);
> >> +
> >> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
> >> +{
> >> +	return vcpu->arch.irq_type == KVMPPC_IRQ_XIVE;
> >> +}
> >> +
> >> +extern int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> >> +				    struct kvm_vcpu *vcpu, u32 cpu);
> >> +extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
> >> +extern void kvmppc_xive_native_init_module(void);
> >> +extern void kvmppc_xive_native_exit_module(void);
> >> +
> >>  #else
> >>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
> >>  				       u32 priority) { return -1; }
> >> @@ -614,6 +626,15 @@ static inline int kvmppc_xive_set_icp(struct kvm_vcpu *vcpu, u64 icpval) { retur
> >>  static inline int kvmppc_xive_set_irq(struct kvm *kvm, int irq_source_id, u32 irq,
> >>  				      int level, bool line_status) { return -ENODEV; }
> >>  static inline void kvmppc_xive_push_vcpu(struct kvm_vcpu *vcpu) { }
> >> +
> >> +static inline int kvmppc_xive_enabled(struct kvm_vcpu *vcpu)
> >> +	{ return 0; }
> >> +static inline int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> >> +						  struct kvm_vcpu *vcpu, u32 cpu) { return -EBUSY; }
> >> +static inline void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu) { }
> >> +static inline void kvmppc_xive_native_init_module(void) { }
> >> +static inline void kvmppc_xive_native_exit_module(void) { }
> >> +
> >>  #endif /* CONFIG_KVM_XIVE */
> >>  
> >>  /*
> >> diff --git a/arch/powerpc/kvm/book3s_xive.h b/arch/powerpc/kvm/book3s_xive.h
> >> index 10c4aa5cd010..5f22415520b4 100644
> >> --- a/arch/powerpc/kvm/book3s_xive.h
> >> +++ b/arch/powerpc/kvm/book3s_xive.h
> >> @@ -12,6 +12,9 @@
> >>  #ifdef CONFIG_KVM_XICS
> >>  #include "book3s_xics.h"
> >>  
> >> +#define KVMPPC_XIVE_FIRST_IRQ	0
> >> +#define KVMPPC_XIVE_NR_IRQS	KVMPPC_XICS_NR_IRQS
> >> +
> >>  /*
> >>   * State for one guest irq source.
> >>   *
> >> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> >> index 6d4ea4b6c922..52bf74a1616e 100644
> >> --- a/include/uapi/linux/kvm.h
> >> +++ b/include/uapi/linux/kvm.h
> >> @@ -988,6 +988,7 @@ struct kvm_ppc_resize_hpt {
> >>  #define KVM_CAP_ARM_VM_IPA_SIZE 165
> >>  #define KVM_CAP_MANUAL_DIRTY_LOG_PROTECT 166
> >>  #define KVM_CAP_HYPERV_CPUID 167
> >> +#define KVM_CAP_PPC_IRQ_XIVE 168
> >>  
> >>  #ifdef KVM_CAP_IRQ_ROUTING
> >>  
> >> @@ -1211,6 +1212,8 @@ enum kvm_device_type {
> >>  #define KVM_DEV_TYPE_ARM_VGIC_V3	KVM_DEV_TYPE_ARM_VGIC_V3
> >>  	KVM_DEV_TYPE_ARM_VGIC_ITS,
> >>  #define KVM_DEV_TYPE_ARM_VGIC_ITS	KVM_DEV_TYPE_ARM_VGIC_ITS
> >> +	KVM_DEV_TYPE_XIVE,
> >> +#define KVM_DEV_TYPE_XIVE		KVM_DEV_TYPE_XIVE
> >>  	KVM_DEV_TYPE_MAX,
> >>  };
> >>  
> >> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> >> index bd1a677dd9e4..de7eed191107 100644
> >> --- a/arch/powerpc/kvm/book3s.c
> >> +++ b/arch/powerpc/kvm/book3s.c
> >> @@ -1039,7 +1039,10 @@ static int kvmppc_book3s_init(void)
> >>  #ifdef CONFIG_KVM_XIVE
> >>  	if (xive_enabled()) {
> >>  		kvmppc_xive_init_module();
> >> +		kvmppc_xive_native_init_module();
> >>  		kvm_register_device_ops(&kvm_xive_ops, KVM_DEV_TYPE_XICS);
> >> +		kvm_register_device_ops(&kvm_xive_native_ops,
> >> +					KVM_DEV_TYPE_XIVE);
> >>  	} else
> >>  #endif
> >>  		kvm_register_device_ops(&kvm_xics_ops, KVM_DEV_TYPE_XICS);
> >> @@ -1050,8 +1053,10 @@ static int kvmppc_book3s_init(void)
> >>  static void kvmppc_book3s_exit(void)
> >>  {
> >>  #ifdef CONFIG_KVM_XICS
> >> -	if (xive_enabled())
> >> +	if (xive_enabled()) {
> >>  		kvmppc_xive_exit_module();
> >> +		kvmppc_xive_native_exit_module();
> >> +	}
> >>  #endif
> >>  #ifdef CONFIG_KVM_BOOK3S_32_HANDLER
> >>  	kvmppc_book3s_exit_pr();
> >> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> >> new file mode 100644
> >> index 000000000000..115143e76c45
> >> --- /dev/null
> >> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> >> @@ -0,0 +1,332 @@
> >> +// SPDX-License-Identifier: GPL-2.0
> >> +/*
> >> + * Copyright (c) 2017-2019, IBM Corporation.
> >> + */
> >> +
> >> +#define pr_fmt(fmt) "xive-kvm: " fmt
> >> +
> >> +#include <linux/anon_inodes.h>
> >> +#include <linux/kernel.h>
> >> +#include <linux/kvm_host.h>
> >> +#include <linux/err.h>
> >> +#include <linux/gfp.h>
> >> +#include <linux/spinlock.h>
> >> +#include <linux/delay.h>
> >> +#include <linux/percpu.h>
> >> +#include <linux/cpumask.h>
> >> +#include <asm/uaccess.h>
> >> +#include <asm/kvm_book3s.h>
> >> +#include <asm/kvm_ppc.h>
> >> +#include <asm/hvcall.h>
> >> +#include <asm/xics.h>
> >> +#include <asm/xive.h>
> >> +#include <asm/xive-regs.h>
> >> +#include <asm/debug.h>
> >> +#include <asm/debugfs.h>
> >> +#include <asm/time.h>
> >> +#include <asm/opal.h>
> >> +
> >> +#include <linux/debugfs.h>
> >> +#include <linux/seq_file.h>
> >> +
> >> +#include "book3s_xive.h"
> >> +
> >> +static void xive_native_cleanup_queue(struct kvm_vcpu *vcpu, int prio)
> >> +{
> >> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> >> +	struct xive_q *q = &xc->queues[prio];
> >> +
> >> +	xive_native_disable_queue(xc->vp_id, q, prio);
> >> +	if (q->qpage) {
> >> +		put_page(virt_to_page(q->qpage));
> >> +		q->qpage = NULL;
> >> +	}
> >> +}
> >> +
> >> +void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu)
> >> +{
> >> +	struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> >> +	int i;
> >> +
> >> +	if (!kvmppc_xive_enabled(vcpu))
> >> +		return;
> >> +
> >> +	if (!xc)
> >> +		return;
> >> +
> >> +	pr_devel("native_cleanup_vcpu(cpu=%d)\n", xc->server_num);
> >> +
> >> +	/* Ensure no interrupt is still routed to that VP */
> >> +	xc->valid = false;
> >> +	kvmppc_xive_disable_vcpu_interrupts(vcpu);
> >> +
> >> +	/* Disable the VP */
> >> +	xive_native_disable_vp(xc->vp_id);
> >> +
> >> +	/* Free the queues & associated interrupts */
> >> +	for (i = 0; i < KVMPPC_XIVE_Q_COUNT; i++) {
> >> +		/* Free the escalation irq */
> >> +		if (xc->esc_virq[i]) {
> >> +			free_irq(xc->esc_virq[i], vcpu);
> >> +			irq_dispose_mapping(xc->esc_virq[i]);
> >> +			kfree(xc->esc_virq_names[i]);
> >> +			xc->esc_virq[i] = 0;
> >> +		}
> >> +
> >> +		/* Free the queue */
> >> +		xive_native_cleanup_queue(vcpu, i);
> >> +	}
> >> +
> >> +	/* Free the VP */
> >> +	kfree(xc);
> >> +
> >> +	/* Cleanup the vcpu */
> >> +	vcpu->arch.irq_type = KVMPPC_IRQ_DEFAULT;
> >> +	vcpu->arch.xive_vcpu = NULL;
> >> +}
> >> +
> >> +int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> >> +				    struct kvm_vcpu *vcpu, u32 cpu)
> > 
> > Why do we need both a *vcpu and a cpu number as an integer?
> 
> To be in sync with the other similar routines : kvmppc_xics_connect_vcpu() 
> and kvmppc_xive_connect_vcpu().
> 
> But if we consider that this 'cpu' parameter is always in sync with 
> vcpu->vcpu_id, we could remove it from the KVM ioctl call I suppose.
> 
> Should we do the same for the other routines ? 

Well.. I don't know why they are that way.  Is that int parameter the
XICS server number, which need not be the same as the vcpu_id ?  Can
we set that arbitrarily in XIVE as well?

It looks like these parameters need a name change at least to make it
clearer what the distinction is.

> >> +{
> >> +	struct kvmppc_xive *xive = dev->private;
> >> +	struct kvmppc_xive_vcpu *xc;
> >> +	int rc;
> >> +
> >> +	pr_devel("native_connect_vcpu(cpu=%d)\n", cpu);
> >> +
> >> +	if (dev->ops != &kvm_xive_native_ops) {
> >> +		pr_devel("Wrong ops !\n");
> >> +		return -EPERM;
> >> +	}
> >> +	if (xive->kvm != vcpu->kvm)
> >> +		return -EPERM;
> >> +	if (vcpu->arch.irq_type)
> > 
> > Please use an explicit == / != here so we don't have to remember which
> > symbolic value corresponds to 0.
> 
> ok. I agree.
> 
> Thanks,
> 
> C. 
> 
> 
> > 
> >> +		return -EBUSY;
> >> +	if (kvmppc_xive_find_server(vcpu->kvm, cpu)) {
> >> +		pr_devel("Duplicate !\n");
> >> +		return -EEXIST;
> >> +	}
> >> +	if (cpu >= KVM_MAX_VCPUS) {
> >> +		pr_devel("Out of bounds !\n");
> >> +		return -EINVAL;
> >> +	}
> >> +	xc = kzalloc(sizeof(*xc), GFP_KERNEL);
> >> +	if (!xc)
> >> +		return -ENOMEM;
> >> +
> >> +	mutex_lock(&vcpu->kvm->lock);
> >> +	vcpu->arch.xive_vcpu = xc;
> >> +	xc->xive = xive;
> >> +	xc->vcpu = vcpu;
> >> +	xc->server_num = cpu;
> >> +	xc->vp_id = xive->vp_base + cpu;
> >> +	xc->valid = true;
> >> +
> >> +	rc = xive_native_get_vp_info(xc->vp_id, &xc->vp_cam, &xc->vp_chip_id);
> >> +	if (rc) {
> >> +		pr_err("Failed to get VP info from OPAL: %d\n", rc);
> >> +		goto bail;
> >> +	}
> >> +
> >> +	/*
> >> +	 * Enable the VP first as the single escalation mode will
> >> +	 * affect escalation interrupts numbering
> >> +	 */
> >> +	rc = xive_native_enable_vp(xc->vp_id, xive->single_escalation);
> >> +	if (rc) {
> >> +		pr_err("Failed to enable VP in OPAL: %d\n", rc);
> >> +		goto bail;
> >> +	}
> >> +
> >> +	/* Configure VCPU fields for use by assembly push/pull */
> >> +	vcpu->arch.xive_saved_state.w01 = cpu_to_be64(0xff000000);
> >> +	vcpu->arch.xive_cam_word = cpu_to_be32(xc->vp_cam | TM_QW1W2_VO);
> >> +
> >> +	/* TODO: initialize queues ? */
> >> +
> >> +bail:
> >> +	vcpu->arch.irq_type = KVMPPC_IRQ_XIVE;
> >> +	mutex_unlock(&vcpu->kvm->lock);
> >> +	if (rc)
> >> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
> >> +
> >> +	return rc;
> >> +}
> >> +
> >> +static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
> >> +				       struct kvm_device_attr *attr)
> >> +{
> >> +	return -ENXIO;
> >> +}
> >> +
> >> +static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
> >> +				       struct kvm_device_attr *attr)
> >> +{
> >> +	return -ENXIO;
> >> +}
> >> +
> >> +static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
> >> +				       struct kvm_device_attr *attr)
> >> +{
> >> +	return -ENXIO;
> >> +}
> >> +
> >> +static void kvmppc_xive_native_free(struct kvm_device *dev)
> >> +{
> >> +	struct kvmppc_xive *xive = dev->private;
> >> +	struct kvm *kvm = xive->kvm;
> >> +	int i;
> >> +
> >> +	debugfs_remove(xive->dentry);
> >> +
> >> +	pr_devel("Destroying xive native for partition\n");
> >> +
> >> +	if (kvm)
> >> +		kvm->arch.xive = NULL;
> >> +
> >> +	/* Mask and free interrupts */
> >> +	for (i = 0; i <= xive->max_sbid; i++) {
> >> +		if (xive->src_blocks[i])
> >> +			kvmppc_xive_free_sources(xive->src_blocks[i]);
> >> +		kfree(xive->src_blocks[i]);
> >> +		xive->src_blocks[i] = NULL;
> >> +	}
> >> +
> >> +	if (xive->vp_base != XIVE_INVALID_VP)
> >> +		xive_native_free_vp_block(xive->vp_base);
> >> +
> >> +	kfree(xive);
> >> +	kfree(dev);
> >> +}
> >> +
> >> +static int kvmppc_xive_native_create(struct kvm_device *dev, u32 type)
> >> +{
> >> +	struct kvmppc_xive *xive;
> >> +	struct kvm *kvm = dev->kvm;
> >> +	int ret = 0;
> >> +
> >> +	pr_devel("Creating xive native for partition\n");
> >> +
> >> +	if (kvm->arch.xive)
> >> +		return -EEXIST;
> >> +
> >> +	xive = kzalloc(sizeof(*xive), GFP_KERNEL);
> >> +	if (!xive)
> >> +		return -ENOMEM;
> >> +
> >> +	dev->private = xive;
> >> +	xive->dev = dev;
> >> +	xive->kvm = kvm;
> >> +	kvm->arch.xive = xive;
> >> +
> >> +	/* We use the default queue size set by the host */
> >> +	xive->q_order = xive_native_default_eq_shift();
> >> +	if (xive->q_order < PAGE_SHIFT)
> >> +		xive->q_page_order = 0;
> >> +	else
> >> +		xive->q_page_order = xive->q_order - PAGE_SHIFT;
> >> +
> >> +	/* Allocate a bunch of VPs */
> >> +	xive->vp_base = xive_native_alloc_vp_block(KVM_MAX_VCPUS);
> >> +	pr_devel("VP_Base=%x\n", xive->vp_base);
> >> +
> >> +	if (xive->vp_base == XIVE_INVALID_VP)
> >> +		ret = -ENOMEM;
> >> +
> >> +	xive->single_escalation = xive_native_has_single_escalation();
> >> +
> >> +	if (ret)
> >> +		kfree(xive);
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static int xive_native_debug_show(struct seq_file *m, void *private)
> >> +{
> >> +	struct kvmppc_xive *xive = m->private;
> >> +	struct kvm *kvm = xive->kvm;
> >> +	struct kvm_vcpu *vcpu;
> >> +	unsigned int i;
> >> +
> >> +	if (!kvm)
> >> +		return 0;
> >> +
> >> +	seq_puts(m, "=========\nVCPU state\n=========\n");
> >> +
> >> +	kvm_for_each_vcpu(i, vcpu, kvm) {
> >> +		struct kvmppc_xive_vcpu *xc = vcpu->arch.xive_vcpu;
> >> +
> >> +		if (!xc)
> >> +			continue;
> >> +
> >> +		seq_printf(m, "cpu server %#x NSR=%02x CPPR=%02x IBP=%02x PIPR=%02x w01=%016llx w2=%08x\n",
> >> +			   xc->server_num,
> >> +			   vcpu->arch.xive_saved_state.nsr,
> >> +			   vcpu->arch.xive_saved_state.cppr,
> >> +			   vcpu->arch.xive_saved_state.ipb,
> >> +			   vcpu->arch.xive_saved_state.pipr,
> >> +			   vcpu->arch.xive_saved_state.w01,
> >> +			   (u32) vcpu->arch.xive_cam_word);
> >> +
> >> +		kvmppc_xive_debug_show_queues(m, vcpu);
> >> +	}
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +static int xive_native_debug_open(struct inode *inode, struct file *file)
> >> +{
> >> +	return single_open(file, xive_native_debug_show, inode->i_private);
> >> +}
> >> +
> >> +static const struct file_operations xive_native_debug_fops = {
> >> +	.open = xive_native_debug_open,
> >> +	.read = seq_read,
> >> +	.llseek = seq_lseek,
> >> +	.release = single_release,
> >> +};
> >> +
> >> +static void xive_native_debugfs_init(struct kvmppc_xive *xive)
> >> +{
> >> +	char *name;
> >> +
> >> +	name = kasprintf(GFP_KERNEL, "kvm-xive-%p", xive);
> >> +	if (!name) {
> >> +		pr_err("%s: no memory for name\n", __func__);
> >> +		return;
> >> +	}
> >> +
> >> +	xive->dentry = debugfs_create_file(name, 0444, powerpc_debugfs_root,
> >> +					   xive, &xive_native_debug_fops);
> >> +
> >> +	pr_debug("%s: created %s\n", __func__, name);
> >> +	kfree(name);
> >> +}
> >> +
> >> +static void kvmppc_xive_native_init(struct kvm_device *dev)
> >> +{
> >> +	struct kvmppc_xive *xive = (struct kvmppc_xive *)dev->private;
> >> +
> >> +	/* Register some debug interfaces */
> >> +	xive_native_debugfs_init(xive);
> >> +}
> >> +
> >> +struct kvm_device_ops kvm_xive_native_ops = {
> >> +	.name = "kvm-xive-native",
> >> +	.create = kvmppc_xive_native_create,
> >> +	.init = kvmppc_xive_native_init,
> >> +	.destroy = kvmppc_xive_native_free,
> >> +	.set_attr = kvmppc_xive_native_set_attr,
> >> +	.get_attr = kvmppc_xive_native_get_attr,
> >> +	.has_attr = kvmppc_xive_native_has_attr,
> >> +};
> >> +
> >> +void kvmppc_xive_native_init_module(void)
> >> +{
> >> +	;
> >> +}
> >> +
> >> +void kvmppc_xive_native_exit_module(void)
> >> +{
> >> +	;
> >> +}
> >> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> >> index b90a7d154180..01d526e15e9d 100644
> >> --- a/arch/powerpc/kvm/powerpc.c
> >> +++ b/arch/powerpc/kvm/powerpc.c
> >> @@ -566,6 +566,9 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
> >>  	case KVM_CAP_PPC_ENABLE_HCALL:
> >>  #ifdef CONFIG_KVM_XICS
> >>  	case KVM_CAP_IRQ_XICS:
> >> +#endif
> >> +#ifdef CONFIG_KVM_XIVE
> >> +	case KVM_CAP_PPC_IRQ_XIVE:
> >>  #endif
> >>  	case KVM_CAP_PPC_GET_CPU_CHAR:
> >>  		r = 1;
> >> @@ -753,6 +756,9 @@ void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
> >>  		else
> >>  			kvmppc_xics_free_icp(vcpu);
> >>  		break;
> >> +	case KVMPPC_IRQ_XIVE:
> >> +		kvmppc_xive_native_cleanup_vcpu(vcpu);
> >> +		break;
> >>  	}
> >>  
> >>  	kvmppc_core_vcpu_free(vcpu);
> >> @@ -1941,6 +1947,30 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
> >>  		break;
> >>  	}
> >>  #endif /* CONFIG_KVM_XICS */
> >> +#ifdef CONFIG_KVM_XIVE
> >> +	case KVM_CAP_PPC_IRQ_XIVE: {
> >> +		struct fd f;
> >> +		struct kvm_device *dev;
> >> +
> >> +		r = -EBADF;
> >> +		f = fdget(cap->args[0]);
> >> +		if (!f.file)
> >> +			break;
> >> +
> >> +		r = -ENXIO;
> >> +		if (!xive_enabled())
> >> +			break;
> >> +
> >> +		r = -EPERM;
> >> +		dev = kvm_device_from_filp(f.file);
> >> +		if (dev)
> >> +			r = kvmppc_xive_native_connect_vcpu(dev, vcpu,
> >> +							    cap->args[1]);
> >> +
> >> +		fdput(f);
> >> +		break;
> >> +	}
> >> +#endif /* CONFIG_KVM_XIVE */
> >>  #ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> >>  	case KVM_CAP_PPC_FWNMI:
> >>  		r = -EINVAL;
> >> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> >> index 64f1135e7732..806cbe488410 100644
> >> --- a/arch/powerpc/kvm/Makefile
> >> +++ b/arch/powerpc/kvm/Makefile
> >> @@ -99,7 +99,7 @@ endif
> >>  kvm-book3s_64-objs-$(CONFIG_KVM_XICS) += \
> >>  	book3s_xics.o
> >>  
> >> -kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o
> >> +kvm-book3s_64-objs-$(CONFIG_KVM_XIVE) += book3s_xive.o book3s_xive_native.o
> >>  kvm-book3s_64-objs-$(CONFIG_SPAPR_TCE_IOMMU) += book3s_64_vio.o
> >>  
> >>  kvm-book3s_64-module-objs := \
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-04 11:30     ` Cédric Le Goater
@ 2019-02-05  5:28       ` David Gibson
  2019-02-05 12:55         ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-05  5:28 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 6190 bytes --]

On Mon, Feb 04, 2019 at 12:30:39PM +0100, Cédric Le Goater wrote:
> On 2/4/19 5:45 AM, David Gibson wrote:
> > On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
> >> This will let the guest create a memory mapping to expose the ESB MMIO
> >> regions used to control the interrupt sources, to trigger events, to
> >> EOI or to turn off the sources.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
> >>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
> >>  2 files changed, 101 insertions(+)
> >>
> >> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >> index 8c876c166ef2..6bb61ba141c2 100644
> >> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
> >>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
> >>  #define  KVM_XICS_QUEUED		(1ULL << 44)
> >>  
> >> +/* POWER9 XIVE Native Interrupt Controller */
> >> +#define KVM_DEV_XIVE_GRP_CTRL		1
> >> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
> > 
> > Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
> > to both with an mmap() directly on the xive device fd?  Using the
> > offset to distinguish which one to map, obviously.
> 
> The page offset would define some sort of user API. It seems feasible.
> But I am not sure this would be practical in the future if we need to 
> tune the length.

Um.. why not?  I mean, yes the XIVE supports rather a lot of
interrupts, but we have 64-bits of offset we can play with - we can
leave room for billions of ESB slots and still have room for billions
of VPs.

> The TIMA has two pages that can be exposed at guest level for interrupt 
> management : the OS and the USER page. That should be OK.
> 
> But we might want to map only portions of the interrupt ESB space, for 
> PCI passthrough for instance as Paul proposed. I am still looking at that.
> 
> Thanks,
> 
> C.
> 
> >>  #endif /* __LINUX_KVM_POWERPC_H */
> >> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> >> index 115143e76c45..e20081f0c8d4 100644
> >> --- a/arch/powerpc/kvm/book3s_xive_native.c
> >> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> >> @@ -153,6 +153,85 @@ int kvmppc_xive_native_connect_vcpu(struct kvm_device *dev,
> >>  	return rc;
> >>  }
> >>  
> >> +static int xive_native_esb_fault(struct vm_fault *vmf)
> >> +{
> >> +	struct vm_area_struct *vma = vmf->vma;
> >> +	struct kvmppc_xive *xive = vma->vm_file->private_data;
> >> +	struct kvmppc_xive_src_block *sb;
> >> +	struct kvmppc_xive_irq_state *state;
> >> +	struct xive_irq_data *xd;
> >> +	u32 hw_num;
> >> +	u16 src;
> >> +	u64 page;
> >> +	unsigned long irq;
> >> +
> >> +	/*
> >> +	 * Linux/KVM uses a two pages ESB setting, one for trigger and
> >> +	 * one for EOI
> >> +	 */
> >> +	irq = vmf->pgoff / 2;
> >> +
> >> +	sb = kvmppc_xive_find_source(xive, irq, &src);
> >> +	if (!sb) {
> >> +		pr_err("%s: source %lx not found !\n", __func__, irq);
> >> +		return VM_FAULT_SIGBUS;
> >> +	}
> >> +
> >> +	state = &sb->irq_state[src];
> >> +	kvmppc_xive_select_irq(state, &hw_num, &xd);
> >> +
> >> +	arch_spin_lock(&sb->lock);
> >> +
> >> +	/*
> >> +	 * first/even page is for trigger
> >> +	 * second/odd page is for EOI and management.
> >> +	 */
> >> +	page = vmf->pgoff % 2 ? xd->eoi_page : xd->trig_page;
> >> +	arch_spin_unlock(&sb->lock);
> >> +
> >> +	if (!page) {
> >> +		pr_err("%s: acessing invalid ESB page for source %lx !\n",
> >> +		       __func__, irq);
> >> +		return VM_FAULT_SIGBUS;
> >> +	}
> >> +
> >> +	vmf_insert_pfn(vma, vmf->address, page >> PAGE_SHIFT);
> >> +	return VM_FAULT_NOPAGE;
> >> +}
> >> +
> >> +static const struct vm_operations_struct xive_native_esb_vmops = {
> >> +	.fault = xive_native_esb_fault,
> >> +};
> >> +
> >> +static int xive_native_esb_mmap(struct file *file, struct vm_area_struct *vma)
> >> +{
> >> +	/* There are two ESB pages (trigger and EOI) per IRQ */
> >> +	if (vma_pages(vma) + vma->vm_pgoff > KVMPPC_XIVE_NR_IRQS * 2)
> >> +		return -EINVAL;
> >> +
> >> +	vma->vm_flags |= VM_IO | VM_PFNMAP;
> >> +	vma->vm_page_prot = pgprot_noncached(vma->vm_page_prot);
> >> +	vma->vm_ops = &xive_native_esb_vmops;
> >> +	return 0;
> >> +}
> >> +
> >> +static const struct file_operations xive_native_esb_fops = {
> >> +	.mmap = xive_native_esb_mmap,
> >> +};
> >> +
> >> +static int kvmppc_xive_native_get_esb_fd(struct kvmppc_xive *xive, u64 addr)
> >> +{
> >> +	u64 __user *ubufp = (u64 __user *) addr;
> >> +	int ret;
> >> +
> >> +	ret = anon_inode_getfd("[xive-esb]", &xive_native_esb_fops, xive,
> >> +				O_RDWR | O_CLOEXEC);
> >> +	if (ret < 0)
> >> +		return ret;
> >> +
> >> +	return put_user(ret, ubufp);
> >> +}
> >> +
> >>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
> >>  				       struct kvm_device_attr *attr)
> >>  {
> >> @@ -162,12 +241,30 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
> >>  static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
> >>  				       struct kvm_device_attr *attr)
> >>  {
> >> +	struct kvmppc_xive *xive = dev->private;
> >> +
> >> +	switch (attr->group) {
> >> +	case KVM_DEV_XIVE_GRP_CTRL:
> >> +		switch (attr->attr) {
> >> +		case KVM_DEV_XIVE_GET_ESB_FD:
> >> +			return kvmppc_xive_native_get_esb_fd(xive, attr->addr);
> >> +		}
> >> +		break;
> >> +	}
> >>  	return -ENXIO;
> >>  }
> >>  
> >>  static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
> >>  				       struct kvm_device_attr *attr)
> >>  {
> >> +	switch (attr->group) {
> >> +	case KVM_DEV_XIVE_GRP_CTRL:
> >> +		switch (attr->attr) {
> >> +		case KVM_DEV_XIVE_GET_ESB_FD:
> >> +			return 0;
> >> +		}
> >> +		break;
> >> +	}
> >>  	return -ENXIO;
> >>  }
> >>  
> > 
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty
  2019-02-04 15:46     ` Cédric Le Goater
@ 2019-02-05  5:30       ` David Gibson
  0 siblings, 0 replies; 135+ messages in thread
From: David Gibson @ 2019-02-05  5:30 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1451 bytes --]

On Mon, Feb 04, 2019 at 04:46:00PM +0100, Cédric Le Goater wrote:
> On 2/4/19 6:18 AM, David Gibson wrote:
> > On Mon, Jan 07, 2019 at 07:43:26PM +0100, Cédric Le Goater wrote:
> >> When the VM is stopped in a migration sequence, the sources are masked
> >> and the XIVE IC is synced to stabilize the EQs. When done, the KVM
> >> ioctl KVM_DEV_XIVE_SAVE_EQ_PAGES is called to mark dirty the EQ pages.
> >>
> >> The migration can then transfer the remaining dirty pages to the
> >> destination and start collecting the state of the devices.
> > 
> > Is there a reason to make this a separate step from the SYNC
> > operation?
> 
> Hmm, apart from letting QEMU orchestrate the migration step by step, no.
> 
> We could merge the SYNC and the SAVE_EQ_PAGES in a single KVM operation. 
> I think that should be fine.

I think that makes sense.  SYNC is supposed to complete delivery of
any in-flight interrupts, and to me writing to the queue page and
marking it dirty as a result is a logical part of that.

> However, it does not make sense to call this operation without the VM 
> being stopped. I wonder how this can checked from KVM. May be we
> can't.

I don't think it matters.  qemu is allowed to shoot itself in the
foot.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-02-04 16:07     ` Cédric Le Goater
@ 2019-02-05  5:32       ` David Gibson
  2019-02-05 13:03         ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-05  5:32 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3391 bytes --]

On Mon, Feb 04, 2019 at 05:07:28PM +0100, Cédric Le Goater wrote:
> On 2/4/19 6:21 AM, David Gibson wrote:
> > On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
> >> Theses are use to capure the XIVE EAS table of the KVM device, the
> >> configuration of the source targets.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
> >>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
> >>  2 files changed, 98 insertions(+)
> >>
> >> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >> index 1a8740629acf..faf024f39858 100644
> >> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
> >>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
> >>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
> >>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
> >> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
> >>  
> >>  /* Layout of 64-bit XIVE source attribute values */
> >>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
> >>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
> >>  
> >> +/* Layout of 64-bit eas attribute values */
> >> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
> >> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
> >> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
> >> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
> >> +#define KVM_XIVE_EAS_MASK_SHIFT		32
> >> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
> >> +#define KVM_XIVE_EAS_EISN_SHIFT		33
> >> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
> >> +
> >>  #endif /* __LINUX_KVM_POWERPC_H */
> >> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> >> index f2de1bcf3b35..0468b605baa7 100644
> >> --- a/arch/powerpc/kvm/book3s_xive_native.c
> >> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> >> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
> >>  	return 0;
> >>  }
> >>  
> >> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
> >> +				      u64 addr)
> > 
> > I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
> > values, but rather essentially the "source config" in the terminology
> > of the PAPR hcalls.  Which, yes, is basically implemented by setting
> > the EAS, but since it's the PAPR architected state that we need to
> > preserve across migration, I'd prefer to stick as close as we can to
> > the PAPR terminology.
> 
> But we don't have an equivalent name in the PAPR specs for the tuple 
> (prio, server). We could use the generic 'target' name may be ? even 
> if this is usually referring to a CPU number.

Um.. what?  That's about terminology for one of the fields in this
thing, not about the name for the thing itself.

> Or, IVE (Interrupt Vector Entry) ? which makes some sense. 
> This is was the former name in HW. I think we recycle it for KVM.

That's a terrible idea, which will make a confusing situation even
more confusing.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state
  2019-02-04 18:57     ` Cédric Le Goater
@ 2019-02-05  5:33       ` David Gibson
  2019-02-05 11:58         ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-05  5:33 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3259 bytes --]

On Mon, Feb 04, 2019 at 07:57:26PM +0100, Cédric Le Goater wrote:
> On 2/4/19 6:26 AM, David Gibson wrote:
> > On Mon, Jan 07, 2019 at 08:10:04PM +0100, Cédric Le Goater wrote:
> >> At a VCPU level, the state of the thread context interrupt management
> >> registers needs to be collected. These registers are cached under the
> >> 'xive_saved_state.w01' field of the VCPU when the VPCU context is
> >> pulled from the HW thread. An OPAL call retrieves the backup of the
> >> IPB register in the NVT structure and merges it in the KVM state.
> >>
> >> The structures of the interface between QEMU and KVM provisions some
> >> extra room (two u64) for further extensions if more state needs to be
> >> transferred back to QEMU.
> >>
> >> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >> ---
> >>  arch/powerpc/include/asm/kvm_ppc.h    |  5 ++
> >>  arch/powerpc/include/uapi/asm/kvm.h   |  2 +
> >>  arch/powerpc/kvm/book3s.c             | 24 +++++++++
> >>  arch/powerpc/kvm/book3s_xive_native.c | 78 +++++++++++++++++++++++++++
> >>  4 files changed, 109 insertions(+)
> >>
> >> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >> index 4cc897039485..49c488af168c 100644
> >> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >> @@ -270,6 +270,7 @@ union kvmppc_one_reg {
> >>  		u64	addr;
> >>  		u64	length;
> >>  	}	vpaval;
> >> +	u64	xive_timaval[4];
> >>  };
> >>  
> >>  struct kvmppc_ops {
> >> @@ -603,6 +604,8 @@ extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
> >>  extern void kvmppc_xive_native_init_module(void);
> >>  extern void kvmppc_xive_native_exit_module(void);
> >>  extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd);
> >> +extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
> >> +extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
> >>  
> >>  #else
> >>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
> >> @@ -637,6 +640,8 @@ static inline void kvmppc_xive_native_init_module(void) { }
> >>  static inline void kvmppc_xive_native_exit_module(void) { }
> >>  static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd)
> >>  	{ return 0; }
> >> +static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return 0; }
> >> +static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return -ENOENT; }
> > 
> > IIRC "VP" is the old name for "TCTX".  Since we're using tctx in the
> > rest of the XIVE code, can we use it here as well.
> 
> OK. The state we are getting or setting is indeed related to the thread 
> interrupt  context registers. 
> 
> The name VP is related to an identifier to some interrupt context under 
> OPAL (NVT in HW to be precise).

Oh, sorry, "NVT" was the name I was looking for, not "TCTX".  But in
any case, please lets standardize on one.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native device
  2019-02-04 19:07     ` Cédric Le Goater
@ 2019-02-05  5:35       ` David Gibson
  2019-02-05 13:39         ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-05  5:35 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1692 bytes --]

On Mon, Feb 04, 2019 at 08:07:20PM +0100, Cédric Le Goater wrote:
> On 2/4/19 5:57 AM, David Gibson wrote:
> > On Mon, Jan 07, 2019 at 07:43:21PM +0100, Cédric Le Goater wrote:
[snip]
> >> +		sb = kvmppc_xive_create_src_block(xive, irq);
> >> +		if (!sb) {
> >> +			pr_err("Failed to create block...\n");
> >> +			return -ENOMEM;
> >> +		}
> >> +	}
> >> +	state = &sb->irq_state[idx];
> >> +
> >> +	if (get_user(val, ubufp)) {
> >> +		pr_err("fault getting user info !\n");
> >> +		return -EFAULT;
> >> +	}
> >> +
> >> +	/*
> >> +	 * If the source doesn't already have an IPI, allocate
> >> +	 * one and get the corresponding data
> >> +	 */
> >> +	if (!state->ipi_number) {
> >> +		state->ipi_number = xive_native_alloc_irq();
> >> +		if (state->ipi_number == 0) {
> >> +			pr_err("Failed to allocate IRQ !\n");
> >> +			return -ENOMEM;
> >> +		}
> > 
> > Am I right in thinking this is the point at which a specific guest irq
> > number gets bound to a specific host irq number?
> 
> yes. the XIVE IRQ state caches this information and 'state' should be 
> protected before being assigned, indeed ... The XICS-over-XIVE device
> also has the same race issue.
> 
> It's not showing because where initializing the KVM device sequentially
> from QEMU and only once.

Ok.

So, for the passthrough case, what's the point at which we know that a
particular guest interrupt needs to be bound to a specific real
hardware interrupt, rather than a generic IPI?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-02-04  5:36         ` David Gibson
@ 2019-02-05 11:31           ` Cédric Le Goater
  2019-02-05 22:13             ` Paul Mackerras
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-05 11:31 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, linuxppc-dev

>>> As for nesting, I suggest for the foreseeable future we stick to XICS
>>> emulation in nested guests.
>>
>> ok. so no kernel_irqchip at all. hmm. 

I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
device KVM uses when under a P9 processor. 

> That would certainly be step 0, making sure the capability advertises
> this correctly.  I think we do want to make XICs-on-XIVE emulation
> work in a KVM L1 (so we'd need to have it make XIVE hcalls to the L0
> instead of OPAL calls).

With the latest patch of Paul, the KVM XICS device is available for L2
and it works quite well. 

I also want to test it when L1 runs in KVM XIVE native mode, with the 
current patchset, to see how it behaves.

> XIVE-on-XIVE for L1 would be nice too, which would mean implementing
> the XIVE hcalls from the L2 in terms of XIVE hcalls to the L0.  I
> think it's ok to delay this indefinitely as long as the caps advertise
> correctly so that qemu will use userspace emulation until its ready.

ok. I need to fix this in the current patchset.

Thanks,

C. 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state
  2019-02-05  5:33       ` David Gibson
@ 2019-02-05 11:58         ` Cédric Le Goater
  2019-02-06  1:19           ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-05 11:58 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/5/19 6:33 AM, David Gibson wrote:
> On Mon, Feb 04, 2019 at 07:57:26PM +0100, Cédric Le Goater wrote:
>> On 2/4/19 6:26 AM, David Gibson wrote:
>>> On Mon, Jan 07, 2019 at 08:10:04PM +0100, Cédric Le Goater wrote:
>>>> At a VCPU level, the state of the thread context interrupt management
>>>> registers needs to be collected. These registers are cached under the
>>>> 'xive_saved_state.w01' field of the VCPU when the VPCU context is
>>>> pulled from the HW thread. An OPAL call retrieves the backup of the
>>>> IPB register in the NVT structure and merges it in the KVM state.
>>>>
>>>> The structures of the interface between QEMU and KVM provisions some
>>>> extra room (two u64) for further extensions if more state needs to be
>>>> transferred back to QEMU.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  arch/powerpc/include/asm/kvm_ppc.h    |  5 ++
>>>>  arch/powerpc/include/uapi/asm/kvm.h   |  2 +
>>>>  arch/powerpc/kvm/book3s.c             | 24 +++++++++
>>>>  arch/powerpc/kvm/book3s_xive_native.c | 78 +++++++++++++++++++++++++++
>>>>  4 files changed, 109 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
>>>> index 4cc897039485..49c488af168c 100644
>>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
>>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
>>>> @@ -270,6 +270,7 @@ union kvmppc_one_reg {
>>>>  		u64	addr;
>>>>  		u64	length;
>>>>  	}	vpaval;
>>>> +	u64	xive_timaval[4];
>>>>  };
>>>>  
>>>>  struct kvmppc_ops {
>>>> @@ -603,6 +604,8 @@ extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
>>>>  extern void kvmppc_xive_native_init_module(void);
>>>>  extern void kvmppc_xive_native_exit_module(void);
>>>>  extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd);
>>>> +extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
>>>> +extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
>>>>  
>>>>  #else
>>>>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
>>>> @@ -637,6 +640,8 @@ static inline void kvmppc_xive_native_init_module(void) { }
>>>>  static inline void kvmppc_xive_native_exit_module(void) { }
>>>>  static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd)
>>>>  	{ return 0; }
>>>> +static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return 0; }
>>>> +static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return -ENOENT; }
>>>
>>> IIRC "VP" is the old name for "TCTX".  Since we're using tctx in the
>>> rest of the XIVE code, can we use it here as well.
>>
>> OK. The state we are getting or setting is indeed related to the thread 
>> interrupt  context registers. 
>>
>> The name VP is related to an identifier to some interrupt context under 
>> OPAL (NVT in HW to be precise).
> 
> Oh, sorry, "NVT" was the name I was looking for, not "TCTX".  But in
> any case, please lets standardize on one.

There is some confusion in the naming for :

 - VP    Virtual Processor (XIVE 1)
 - VPD   Virtual Processor Descriptor (XIVE 1)
 - TCTX  Thread interrupt context registers
 - NVT   Notify Virtual Target. Former VP. 
 - NVTS  Notify Virtual Target Structure. Where the TCTX regs are cached.


I am fine with using NVT because this is indeed the name of the XIVE 
structure where the HW caches the thread interrupt context registers.

But the XIVE native layer and the XICS-over-XIVE KVM device use the
name VP (the old one). I don't think we want to change these now.

C. 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-05  5:28       ` David Gibson
@ 2019-02-05 12:55         ` Cédric Le Goater
  2019-02-06  1:23           ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-05 12:55 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/5/19 6:28 AM, David Gibson wrote:
> On Mon, Feb 04, 2019 at 12:30:39PM +0100, Cédric Le Goater wrote:
>> On 2/4/19 5:45 AM, David Gibson wrote:
>>> On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
>>>> This will let the guest create a memory mapping to expose the ESB MMIO
>>>> regions used to control the interrupt sources, to trigger events, to
>>>> EOI or to turn off the sources.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
>>>>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
>>>>  2 files changed, 101 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>> index 8c876c166ef2..6bb61ba141c2 100644
>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
>>>>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
>>>>  #define  KVM_XICS_QUEUED		(1ULL << 44)
>>>>  
>>>> +/* POWER9 XIVE Native Interrupt Controller */
>>>> +#define KVM_DEV_XIVE_GRP_CTRL		1
>>>> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
>>>
>>> Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
>>> to both with an mmap() directly on the xive device fd?  Using the
>>> offset to distinguish which one to map, obviously.
>>
>> The page offset would define some sort of user API. It seems feasible.
>> But I am not sure this would be practical in the future if we need to 
>> tune the length.
> 
> Um.. why not?  I mean, yes the XIVE supports rather a lot of
> interrupts, but we have 64-bits of offset we can play with - we can
> leave room for billions of ESB slots and still have room for billions
> of VPs.

So the first 4 pages could be the TIMA pages and then would come  
the pages for the interrupt ESBs. I think that we can have different 
vm_fault handler for each mapping.
 
I wonder how this will work out with pass-through. As Paul said in 
a previous email, it would be better to let QEMU request a new 
mapping to handle the ESB pages of the device being passed through.
I guess this is not a special case, just another offset and length.

I will give it a try.

Thanks,

C. 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-02-05  5:32       ` David Gibson
@ 2019-02-05 13:03         ` Cédric Le Goater
  2019-02-06  1:23           ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-05 13:03 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/5/19 6:32 AM, David Gibson wrote:
> On Mon, Feb 04, 2019 at 05:07:28PM +0100, Cédric Le Goater wrote:
>> On 2/4/19 6:21 AM, David Gibson wrote:
>>> On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
>>>> Theses are use to capure the XIVE EAS table of the KVM device, the
>>>> configuration of the source targets.
>>>>
>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>> ---
>>>>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
>>>>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
>>>>  2 files changed, 98 insertions(+)
>>>>
>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>> index 1a8740629acf..faf024f39858 100644
>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
>>>>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
>>>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>>>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>>>> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
>>>>  
>>>>  /* Layout of 64-bit XIVE source attribute values */
>>>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>>>>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>>>>  
>>>> +/* Layout of 64-bit eas attribute values */
>>>> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
>>>> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
>>>> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
>>>> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
>>>> +#define KVM_XIVE_EAS_MASK_SHIFT		32
>>>> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
>>>> +#define KVM_XIVE_EAS_EISN_SHIFT		33
>>>> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
>>>> +
>>>>  #endif /* __LINUX_KVM_POWERPC_H */
>>>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>>>> index f2de1bcf3b35..0468b605baa7 100644
>>>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>>>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>>>> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
>>>>  	return 0;
>>>>  }
>>>>  
>>>> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
>>>> +				      u64 addr)
>>>
>>> I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
>>> values, but rather essentially the "source config" in the terminology
>>> of the PAPR hcalls.  Which, yes, is basically implemented by setting
>>> the EAS, but since it's the PAPR architected state that we need to
>>> preserve across migration, I'd prefer to stick as close as we can to
>>> the PAPR terminology.
>>
>> But we don't have an equivalent name in the PAPR specs for the tuple 
>> (prio, server). We could use the generic 'target' name may be ? even 
>> if this is usually referring to a CPU number.
> 
> Um.. what?  That's about terminology for one of the fields in this
> thing, not about the name for the thing itself.
> 
>> Or, IVE (Interrupt Vector Entry) ? which makes some sense. 
>> This is was the former name in HW. I think we recycle it for KVM.
> 
> That's a terrible idea, which will make a confusing situation even
> more confusing.

Let's use SOURCE_CONFIG and QUEUE_CONFIG. The KVM ioctls are very 
similar to the hcalls anyhow.

C.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE control to the XIVE native device
  2019-02-05  5:35       ` David Gibson
@ 2019-02-05 13:39         ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-05 13:39 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/5/19 6:35 AM, David Gibson wrote:
> On Mon, Feb 04, 2019 at 08:07:20PM +0100, Cédric Le Goater wrote:
>> On 2/4/19 5:57 AM, David Gibson wrote:
>>> On Mon, Jan 07, 2019 at 07:43:21PM +0100, Cédric Le Goater wrote:
> [snip]
>>>> +		sb = kvmppc_xive_create_src_block(xive, irq);
>>>> +		if (!sb) {
>>>> +			pr_err("Failed to create block...\n");
>>>> +			return -ENOMEM;
>>>> +		}
>>>> +	}
>>>> +	state = &sb->irq_state[idx];
>>>> +
>>>> +	if (get_user(val, ubufp)) {
>>>> +		pr_err("fault getting user info !\n");
>>>> +		return -EFAULT;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * If the source doesn't already have an IPI, allocate
>>>> +	 * one and get the corresponding data
>>>> +	 */
>>>> +	if (!state->ipi_number) {
>>>> +		state->ipi_number = xive_native_alloc_irq();
>>>> +		if (state->ipi_number == 0) {
>>>> +			pr_err("Failed to allocate IRQ !\n");
>>>> +			return -ENOMEM;
>>>> +		}
>>>
>>> Am I right in thinking this is the point at which a specific guest irq
>>> number gets bound to a specific host irq number?
>>
>> yes. the XIVE IRQ state caches this information and 'state' should be 
>> protected before being assigned, indeed ... The XICS-over-XIVE device
>> also has the same race issue.
>>
>> It's not showing because where initializing the KVM device sequentially
>> from QEMU and only once.
> 
> Ok.
> 
> So, for the passthrough case, what's the point at which we know that a
> particular guest interrupt needs to be bound to a specific real
> hardware interrupt, rather than a generic IPI?

when the guest driver requests MSIs, VFIO requests a mapping of the 
HW irqs in the guest IRQ space. This is very briefly said as VFIO is 
a huge framework. 

Patch 18 adds some initial support to handle the ESB pages but this 
should be done at the QEMU level.

C. 

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration
  2019-02-04  5:24   ` David Gibson
@ 2019-02-05 17:45     ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-05 17:45 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/4/19 6:24 AM, David Gibson wrote:
> On Mon, Jan 07, 2019 at 07:43:28PM +0100, Cédric Le Goater wrote:
>> These are used to capture the XIVE END table of the KVM device. It
>> relies on an OPAL call to retrieve from the XIVE IC the EQ toggle bit
>> and index which are updated by the HW when events are enqueued in the
>> guest RAM.
>>
>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>> ---
>>  arch/powerpc/include/uapi/asm/kvm.h   |  21 ++++
>>  arch/powerpc/kvm/book3s_xive_native.c | 166 ++++++++++++++++++++++++++
>>  2 files changed, 187 insertions(+)
>>
>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>> index faf024f39858..95302558ce10 100644
>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>> @@ -684,6 +684,7 @@ struct kvm_ppc_cpu_char {
>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>>  #define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
>> +#define KVM_DEV_XIVE_GRP_EQ		5	/* 64-bit eq attributes */
>>  
>>  /* Layout of 64-bit XIVE source attribute values */
>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>> @@ -699,4 +700,24 @@ struct kvm_ppc_cpu_char {
>>  #define KVM_XIVE_EAS_EISN_SHIFT		33
>>  #define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
>>  
>> +/* Layout of 64-bit eq attribute */
>> +#define KVM_XIVE_EQ_PRIORITY_SHIFT	0
>> +#define KVM_XIVE_EQ_PRIORITY_MASK	0x7
>> +#define KVM_XIVE_EQ_SERVER_SHIFT	3
>> +#define KVM_XIVE_EQ_SERVER_MASK		0xfffffff8ULL
>> +
>> +/* Layout of 64-bit eq attribute values */
>> +struct kvm_ppc_xive_eq {
>> +	__u32 flags;
>> +	__u32 qsize;
>> +	__u64 qpage;
>> +	__u32 qtoggle;
>> +	__u32 qindex;
> 
> Should we pad this in case a) we discover some fields in the EQ that
> we thought weren't relevant to the guest actually are or b) future
> XIVE extensions add something we need to migrate.

The underlying XIVE structure is composed of 32bytes. I will double the
size.

Thanks,

C.


> 
>> +};
>> +
>> +#define KVM_XIVE_EQ_FLAG_ENABLED	0x00000001
>> +#define KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY	0x00000002
>> +#define KVM_XIVE_EQ_FLAG_ESCALATE	0x00000004
>> +
>> +
>>  #endif /* __LINUX_KVM_POWERPC_H */
>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>> index 0468b605baa7..f4eb71eafc57 100644
>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>> @@ -607,6 +607,164 @@ static int kvmppc_xive_native_get_eas(struct kvmppc_xive *xive, long irq,
>>  	return 0;
>>  }
>>  
>> +static int kvmppc_xive_native_set_queue(struct kvmppc_xive *xive, long eq_idx,
>> +				      u64 addr)
>> +{
>> +	struct kvm *kvm = xive->kvm;
>> +	struct kvm_vcpu *vcpu;
>> +	struct kvmppc_xive_vcpu *xc;
>> +	void __user *ubufp = (u64 __user *) addr;
>> +	u32 server;
>> +	u8 priority;
>> +	struct kvm_ppc_xive_eq kvm_eq;
>> +	int rc;
>> +	__be32 *qaddr = 0;
>> +	struct page *page;
>> +	struct xive_q *q;
>> +
>> +	/*
>> +	 * Demangle priority/server tuple from the EQ index
>> +	 */
>> +	priority = (eq_idx & KVM_XIVE_EQ_PRIORITY_MASK) >>
>> +		KVM_XIVE_EQ_PRIORITY_SHIFT;
>> +	server = (eq_idx & KVM_XIVE_EQ_SERVER_MASK) >>
>> +		KVM_XIVE_EQ_SERVER_SHIFT;
>> +
>> +	if (copy_from_user(&kvm_eq, ubufp, sizeof(kvm_eq)))
>> +		return -EFAULT;
>> +
>> +	vcpu = kvmppc_xive_find_server(kvm, server);
>> +	if (!vcpu) {
>> +		pr_err("Can't find server %d\n", server);
>> +		return -ENOENT;
>> +	}
>> +	xc = vcpu->arch.xive_vcpu;
>> +
>> +	if (priority != xive_prio_from_guest(priority)) {
>> +		pr_err("Trying to restore invalid queue %d for VCPU %d\n",
>> +		       priority, server);
>> +		return -EINVAL;
>> +	}
>> +	q = &xc->queues[priority];
>> +
>> +	pr_devel("%s VCPU %d priority %d fl:%x sz:%d addr:%llx g:%d idx:%d\n",
>> +		 __func__, server, priority, kvm_eq.flags,
>> +		 kvm_eq.qsize, kvm_eq.qpage, kvm_eq.qtoggle, kvm_eq.qindex);
>> +
>> +	rc = xive_native_validate_queue_size(kvm_eq.qsize);
>> +	if (rc || !kvm_eq.qsize) {
>> +		pr_err("invalid queue size %d\n", kvm_eq.qsize);
>> +		return rc;
>> +	}
>> +
>> +	page = gfn_to_page(kvm, gpa_to_gfn(kvm_eq.qpage));
>> +	if (is_error_page(page)) {
>> +		pr_warn("Couldn't get guest page for %llx!\n", kvm_eq.qpage);
>> +		return -ENOMEM;
>> +	}
>> +	qaddr = page_to_virt(page) + (kvm_eq.qpage & ~PAGE_MASK);
>> +
>> +	/* Backup queue page guest address for migration */
>> +	q->guest_qpage = kvm_eq.qpage;
>> +	q->guest_qsize = kvm_eq.qsize;
>> +
>> +	rc = xive_native_configure_queue(xc->vp_id, q, priority,
>> +					 (__be32 *) qaddr, kvm_eq.qsize, true);
>> +	if (rc) {
>> +		pr_err("Failed to configure queue %d for VCPU %d: %d\n",
>> +		       priority, xc->server_num, rc);
>> +		put_page(page);
>> +		return rc;
>> +	}
>> +
>> +	rc = xive_native_set_queue_state(xc->vp_id, priority, kvm_eq.qtoggle,
>> +					 kvm_eq.qindex);
>> +	if (rc)
>> +		goto error;
>> +
>> +	rc = kvmppc_xive_attach_escalation(vcpu, priority);
>> +error:
>> +	if (rc)
>> +		xive_native_cleanup_queue(vcpu, priority);
>> +	return rc;
>> +}
>> +
>> +static int kvmppc_xive_native_get_queue(struct kvmppc_xive *xive, long eq_idx,
>> +				      u64 addr)
>> +{
>> +	struct kvm *kvm = xive->kvm;
>> +	struct kvm_vcpu *vcpu;
>> +	struct kvmppc_xive_vcpu *xc;
>> +	struct xive_q *q;
>> +	void __user *ubufp = (u64 __user *) addr;
>> +	u32 server;
>> +	u8 priority;
>> +	struct kvm_ppc_xive_eq kvm_eq;
>> +	u64 qpage;
>> +	u64 qsize;
>> +	u64 qeoi_page;
>> +	u32 escalate_irq;
>> +	u64 qflags;
>> +	int rc;
>> +
>> +	/*
>> +	 * Demangle priority/server tuple from the EQ index
>> +	 */
>> +	priority = (eq_idx & KVM_XIVE_EQ_PRIORITY_MASK) >>
>> +		KVM_XIVE_EQ_PRIORITY_SHIFT;
>> +	server = (eq_idx & KVM_XIVE_EQ_SERVER_MASK) >>
>> +		KVM_XIVE_EQ_SERVER_SHIFT;
>> +
>> +	vcpu = kvmppc_xive_find_server(kvm, server);
>> +	if (!vcpu) {
>> +		pr_err("Can't find server %d\n", server);
>> +		return -ENOENT;
>> +	}
>> +	xc = vcpu->arch.xive_vcpu;
>> +
>> +	if (priority != xive_prio_from_guest(priority)) {
>> +		pr_err("invalid priority for queue %d for VCPU %d\n",
>> +		       priority, server);
>> +		return -EINVAL;
>> +	}
>> +	q = &xc->queues[priority];
>> +
>> +	memset(&kvm_eq, 0, sizeof(kvm_eq));
>> +
>> +	if (!q->qpage)
>> +		return 0;
>> +
>> +	rc = xive_native_get_queue_info(xc->vp_id, priority, &qpage, &qsize,
>> +					&qeoi_page, &escalate_irq, &qflags);
>> +	if (rc)
>> +		return rc;
>> +
>> +	kvm_eq.flags = 0;
>> +	if (qflags & OPAL_XIVE_EQ_ENABLED)
>> +		kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ENABLED;
>> +	if (qflags & OPAL_XIVE_EQ_ALWAYS_NOTIFY)
>> +		kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ALWAYS_NOTIFY;
>> +	if (qflags & OPAL_XIVE_EQ_ESCALATE)
>> +		kvm_eq.flags |= KVM_XIVE_EQ_FLAG_ESCALATE;
>> +
>> +	kvm_eq.qsize = q->guest_qsize;
>> +	kvm_eq.qpage = q->guest_qpage;
>> +
>> +	rc = xive_native_get_queue_state(xc->vp_id, priority, &kvm_eq.qtoggle,
>> +					 &kvm_eq.qindex);
>> +	if (rc)
>> +		return rc;
>> +
>> +	pr_devel("%s VCPU %d priority %d fl:%x sz:%d addr:%llx g:%d idx:%d\n",
>> +		 __func__, server, priority, kvm_eq.flags,
>> +		 kvm_eq.qsize, kvm_eq.qpage, kvm_eq.qtoggle, kvm_eq.qindex);
>> +
>> +	if (copy_to_user(ubufp, &kvm_eq, sizeof(kvm_eq)))
>> +		return -EFAULT;
>> +
>> +	return 0;
>> +}
>> +
>>  static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  				       struct kvm_device_attr *attr)
>>  {
>> @@ -628,6 +786,9 @@ static int kvmppc_xive_native_set_attr(struct kvm_device *dev,
>>  		return kvmppc_xive_native_sync(xive, attr->attr, attr->addr);
>>  	case KVM_DEV_XIVE_GRP_EAS:
>>  		return kvmppc_xive_native_set_eas(xive, attr->attr, attr->addr);
>> +	case KVM_DEV_XIVE_GRP_EQ:
>> +		return kvmppc_xive_native_set_queue(xive, attr->attr,
>> +						    attr->addr);
>>  	}
>>  	return -ENXIO;
>>  }
>> @@ -650,6 +811,9 @@ static int kvmppc_xive_native_get_attr(struct kvm_device *dev,
>>  		break;
>>  	case KVM_DEV_XIVE_GRP_EAS:
>>  		return kvmppc_xive_native_get_eas(xive, attr->attr, attr->addr);
>> +	case KVM_DEV_XIVE_GRP_EQ:
>> +		return kvmppc_xive_native_get_queue(xive, attr->attr,
>> +						    attr->addr);
>>  	}
>>  	return -ENXIO;
>>  }
>> @@ -674,6 +838,8 @@ static int kvmppc_xive_native_has_attr(struct kvm_device *dev,
>>  		    attr->attr < KVMPPC_XIVE_NR_IRQS)
>>  			return 0;
>>  		break;
>> +	case KVM_DEV_XIVE_GRP_EQ:
>> +		return 0;
>>  	}
>>  	return -ENXIO;
>>  }
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-02-05 11:31           ` Cédric Le Goater
@ 2019-02-05 22:13             ` Paul Mackerras
  2019-02-06  1:18               ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-02-05 22:13 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
> >>> As for nesting, I suggest for the foreseeable future we stick to XICS
> >>> emulation in nested guests.
> >>
> >> ok. so no kernel_irqchip at all. hmm. 
> 
> I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
> XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
> device KVM uses when under a P9 processor. 

Actually there are two separate implementations of XICS emulation in
KVM.  The first (older) one is almost entirely a software emulation
but does have some cases where it accesses an underlying XICS device
in order to make some things faster (IPIs and pass-through of a device
interrupt to a guest).  The other, newer one is the XICS-on-XIVE
emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
My patch was about making the the older code work when there is no
XICS available to the host.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-02-05 22:13             ` Paul Mackerras
@ 2019-02-06  1:18               ` David Gibson
  2019-02-06  7:35                 ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-06  1:18 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, Cédric Le Goater, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 1547 bytes --]

On Wed, Feb 06, 2019 at 09:13:15AM +1100, Paul Mackerras wrote:
> On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
> > >>> As for nesting, I suggest for the foreseeable future we stick to XICS
> > >>> emulation in nested guests.
> > >>
> > >> ok. so no kernel_irqchip at all. hmm. 
> > 
> > I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
> > XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
> > device KVM uses when under a P9 processor. 
> 
> Actually there are two separate implementations of XICS emulation in
> KVM.  The first (older) one is almost entirely a software emulation
> but does have some cases where it accesses an underlying XICS device
> in order to make some things faster (IPIs and pass-through of a device
> interrupt to a guest).  The other, newer one is the XICS-on-XIVE
> emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
> My patch was about making the the older code work when there is no
> XICS available to the host.

Ah, right.  To clarify my earlier statements in light of this:

 * We definitely want some sort of kernel-XICS available in a nested
   guest.  AIUI, this is now accomplished, so, Yay!

 * Implementing the L2 XICS in terms of L1's PAPR-XIVE would be a
   bonus, but it's a much lower priority.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state
  2019-02-05 11:58         ` Cédric Le Goater
@ 2019-02-06  1:19           ` David Gibson
  0 siblings, 0 replies; 135+ messages in thread
From: David Gibson @ 2019-02-06  1:19 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4284 bytes --]

On Tue, Feb 05, 2019 at 12:58:54PM +0100, Cédric Le Goater wrote:
> On 2/5/19 6:33 AM, David Gibson wrote:
> > On Mon, Feb 04, 2019 at 07:57:26PM +0100, Cédric Le Goater wrote:
> >> On 2/4/19 6:26 AM, David Gibson wrote:
> >>> On Mon, Jan 07, 2019 at 08:10:04PM +0100, Cédric Le Goater wrote:
> >>>> At a VCPU level, the state of the thread context interrupt management
> >>>> registers needs to be collected. These registers are cached under the
> >>>> 'xive_saved_state.w01' field of the VCPU when the VPCU context is
> >>>> pulled from the HW thread. An OPAL call retrieves the backup of the
> >>>> IPB register in the NVT structure and merges it in the KVM state.
> >>>>
> >>>> The structures of the interface between QEMU and KVM provisions some
> >>>> extra room (two u64) for further extensions if more state needs to be
> >>>> transferred back to QEMU.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>  arch/powerpc/include/asm/kvm_ppc.h    |  5 ++
> >>>>  arch/powerpc/include/uapi/asm/kvm.h   |  2 +
> >>>>  arch/powerpc/kvm/book3s.c             | 24 +++++++++
> >>>>  arch/powerpc/kvm/book3s_xive_native.c | 78 +++++++++++++++++++++++++++
> >>>>  4 files changed, 109 insertions(+)
> >>>>
> >>>> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> index 4cc897039485..49c488af168c 100644
> >>>> --- a/arch/powerpc/include/asm/kvm_ppc.h
> >>>> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> >>>> @@ -270,6 +270,7 @@ union kvmppc_one_reg {
> >>>>  		u64	addr;
> >>>>  		u64	length;
> >>>>  	}	vpaval;
> >>>> +	u64	xive_timaval[4];
> >>>>  };
> >>>>  
> >>>>  struct kvmppc_ops {
> >>>> @@ -603,6 +604,8 @@ extern void kvmppc_xive_native_cleanup_vcpu(struct kvm_vcpu *vcpu);
> >>>>  extern void kvmppc_xive_native_init_module(void);
> >>>>  extern void kvmppc_xive_native_exit_module(void);
> >>>>  extern int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd);
> >>>> +extern int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
> >>>> +extern int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val);
> >>>>  
> >>>>  #else
> >>>>  static inline int kvmppc_xive_set_xive(struct kvm *kvm, u32 irq, u32 server,
> >>>> @@ -637,6 +640,8 @@ static inline void kvmppc_xive_native_init_module(void) { }
> >>>>  static inline void kvmppc_xive_native_exit_module(void) { }
> >>>>  static inline int kvmppc_xive_native_hcall(struct kvm_vcpu *vcpu, u32 cmd)
> >>>>  	{ return 0; }
> >>>> +static inline int kvmppc_xive_native_get_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return 0; }
> >>>> +static inline int kvmppc_xive_native_set_vp(struct kvm_vcpu *vcpu, union kvmppc_one_reg *val) { return -ENOENT; }
> >>>
> >>> IIRC "VP" is the old name for "TCTX".  Since we're using tctx in the
> >>> rest of the XIVE code, can we use it here as well.
> >>
> >> OK. The state we are getting or setting is indeed related to the thread 
> >> interrupt  context registers. 
> >>
> >> The name VP is related to an identifier to some interrupt context under 
> >> OPAL (NVT in HW to be precise).
> > 
> > Oh, sorry, "NVT" was the name I was looking for, not "TCTX".  But in
> > any case, please lets standardize on one.
> 
> There is some confusion in the naming for :
> 
>  - VP    Virtual Processor (XIVE 1)
>  - VPD   Virtual Processor Descriptor (XIVE 1)
>  - TCTX  Thread interrupt context registers
>  - NVT   Notify Virtual Target. Former VP. 
>  - NVTS  Notify Virtual Target Structure. Where the TCTX regs are cached.
> 
> 
> I am fine with using NVT because this is indeed the name of the XIVE 
> structure where the HW caches the thread interrupt context registers.
> 
> But the XIVE native layer and the XICS-over-XIVE KVM device use the
> name VP (the old one). I don't think we want to change these now.

Ah, right.  It now occurs to me that the place I've already seen NVT
used is in the qemu code, whereas this is kernel.  In that case
sticking to VP here makes sense.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-05 12:55         ` Cédric Le Goater
@ 2019-02-06  1:23           ` David Gibson
  2019-02-06  7:21             ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-06  1:23 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2877 bytes --]

On Tue, Feb 05, 2019 at 01:55:40PM +0100, Cédric Le Goater wrote:
> On 2/5/19 6:28 AM, David Gibson wrote:
> > On Mon, Feb 04, 2019 at 12:30:39PM +0100, Cédric Le Goater wrote:
> >> On 2/4/19 5:45 AM, David Gibson wrote:
> >>> On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
> >>>> This will let the guest create a memory mapping to expose the ESB MMIO
> >>>> regions used to control the interrupt sources, to trigger events, to
> >>>> EOI or to turn off the sources.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
> >>>>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
> >>>>  2 files changed, 101 insertions(+)
> >>>>
> >>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >>>> index 8c876c166ef2..6bb61ba141c2 100644
> >>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >>>> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
> >>>>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
> >>>>  #define  KVM_XICS_QUEUED		(1ULL << 44)
> >>>>  
> >>>> +/* POWER9 XIVE Native Interrupt Controller */
> >>>> +#define KVM_DEV_XIVE_GRP_CTRL		1
> >>>> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
> >>>
> >>> Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
> >>> to both with an mmap() directly on the xive device fd?  Using the
> >>> offset to distinguish which one to map, obviously.
> >>
> >> The page offset would define some sort of user API. It seems feasible.
> >> But I am not sure this would be practical in the future if we need to 
> >> tune the length.
> > 
> > Um.. why not?  I mean, yes the XIVE supports rather a lot of
> > interrupts, but we have 64-bits of offset we can play with - we can
> > leave room for billions of ESB slots and still have room for billions
> > of VPs.
> 
> So the first 4 pages could be the TIMA pages and then would come  
> the pages for the interrupt ESBs. I think that we can have different 
> vm_fault handler for each mapping.

Um.. no, I'm saying you don't need to tightly pack them.  So you could
have the ESB pages at 0, the TIMA at, say offset 2^60.

> I wonder how this will work out with pass-through. As Paul said in 
> a previous email, it would be better to let QEMU request a new 
> mapping to handle the ESB pages of the device being passed through.
> I guess this is not a special case, just another offset and length.

Right, if we need multiple "chunks" of ESB pages we can given them
each their own terabyte or several.  No need to be stingy with address
space.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-02-05 13:03         ` Cédric Le Goater
@ 2019-02-06  1:23           ` David Gibson
  2019-02-06  1:24             ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-06  1:23 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3798 bytes --]

On Tue, Feb 05, 2019 at 02:03:11PM +0100, Cédric Le Goater wrote:
> On 2/5/19 6:32 AM, David Gibson wrote:
> > On Mon, Feb 04, 2019 at 05:07:28PM +0100, Cédric Le Goater wrote:
> >> On 2/4/19 6:21 AM, David Gibson wrote:
> >>> On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
> >>>> Theses are use to capure the XIVE EAS table of the KVM device, the
> >>>> configuration of the source targets.
> >>>>
> >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>> ---
> >>>>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
> >>>>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
> >>>>  2 files changed, 98 insertions(+)
> >>>>
> >>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >>>> index 1a8740629acf..faf024f39858 100644
> >>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >>>> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
> >>>>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
> >>>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
> >>>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
> >>>> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
> >>>>  
> >>>>  /* Layout of 64-bit XIVE source attribute values */
> >>>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
> >>>>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
> >>>>  
> >>>> +/* Layout of 64-bit eas attribute values */
> >>>> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
> >>>> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
> >>>> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
> >>>> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
> >>>> +#define KVM_XIVE_EAS_MASK_SHIFT		32
> >>>> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
> >>>> +#define KVM_XIVE_EAS_EISN_SHIFT		33
> >>>> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
> >>>> +
> >>>>  #endif /* __LINUX_KVM_POWERPC_H */
> >>>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> >>>> index f2de1bcf3b35..0468b605baa7 100644
> >>>> --- a/arch/powerpc/kvm/book3s_xive_native.c
> >>>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> >>>> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
> >>>>  	return 0;
> >>>>  }
> >>>>  
> >>>> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
> >>>> +				      u64 addr)
> >>>
> >>> I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
> >>> values, but rather essentially the "source config" in the terminology
> >>> of the PAPR hcalls.  Which, yes, is basically implemented by setting
> >>> the EAS, but since it's the PAPR architected state that we need to
> >>> preserve across migration, I'd prefer to stick as close as we can to
> >>> the PAPR terminology.
> >>
> >> But we don't have an equivalent name in the PAPR specs for the tuple 
> >> (prio, server). We could use the generic 'target' name may be ? even 
> >> if this is usually referring to a CPU number.
> > 
> > Um.. what?  That's about terminology for one of the fields in this
> > thing, not about the name for the thing itself.
> > 
> >> Or, IVE (Interrupt Vector Entry) ? which makes some sense. 
> >> This is was the former name in HW. I think we recycle it for KVM.
> > 
> > That's a terrible idea, which will make a confusing situation even
> > more confusing.
> 
> Let's use SOURCE_CONFIG and QUEUE_CONFIG. The KVM ioctls are very 
> similar to the hcalls anyhow.

Yes, I think that's a good idea.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-02-06  1:23           ` David Gibson
@ 2019-02-06  1:24             ` David Gibson
  2019-02-06  7:07               ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-06  1:24 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4277 bytes --]

On Wed, Feb 06, 2019 at 12:23:29PM +1100, David Gibson wrote:
> On Tue, Feb 05, 2019 at 02:03:11PM +0100, Cédric Le Goater wrote:
> > On 2/5/19 6:32 AM, David Gibson wrote:
> > > On Mon, Feb 04, 2019 at 05:07:28PM +0100, Cédric Le Goater wrote:
> > >> On 2/4/19 6:21 AM, David Gibson wrote:
> > >>> On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
> > >>>> Theses are use to capure the XIVE EAS table of the KVM device, the
> > >>>> configuration of the source targets.
> > >>>>
> > >>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> > >>>> ---
> > >>>>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
> > >>>>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
> > >>>>  2 files changed, 98 insertions(+)
> > >>>>
> > >>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> > >>>> index 1a8740629acf..faf024f39858 100644
> > >>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
> > >>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> > >>>> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
> > >>>>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
> > >>>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
> > >>>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
> > >>>> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
> > >>>>  
> > >>>>  /* Layout of 64-bit XIVE source attribute values */
> > >>>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
> > >>>>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
> > >>>>  
> > >>>> +/* Layout of 64-bit eas attribute values */
> > >>>> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
> > >>>> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
> > >>>> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
> > >>>> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
> > >>>> +#define KVM_XIVE_EAS_MASK_SHIFT		32
> > >>>> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
> > >>>> +#define KVM_XIVE_EAS_EISN_SHIFT		33
> > >>>> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
> > >>>> +
> > >>>>  #endif /* __LINUX_KVM_POWERPC_H */
> > >>>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> > >>>> index f2de1bcf3b35..0468b605baa7 100644
> > >>>> --- a/arch/powerpc/kvm/book3s_xive_native.c
> > >>>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> > >>>> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
> > >>>>  	return 0;
> > >>>>  }
> > >>>>  
> > >>>> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
> > >>>> +				      u64 addr)
> > >>>
> > >>> I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
> > >>> values, but rather essentially the "source config" in the terminology
> > >>> of the PAPR hcalls.  Which, yes, is basically implemented by setting
> > >>> the EAS, but since it's the PAPR architected state that we need to
> > >>> preserve across migration, I'd prefer to stick as close as we can to
> > >>> the PAPR terminology.
> > >>
> > >> But we don't have an equivalent name in the PAPR specs for the tuple 
> > >> (prio, server). We could use the generic 'target' name may be ? even 
> > >> if this is usually referring to a CPU number.
> > > 
> > > Um.. what?  That's about terminology for one of the fields in this
> > > thing, not about the name for the thing itself.
> > > 
> > >> Or, IVE (Interrupt Vector Entry) ? which makes some sense. 
> > >> This is was the former name in HW. I think we recycle it for KVM.
> > > 
> > > That's a terrible idea, which will make a confusing situation even
> > > more confusing.
> > 
> > Let's use SOURCE_CONFIG and QUEUE_CONFIG. The KVM ioctls are very 
> > similar to the hcalls anyhow.
> 
> Yes, I think that's a good idea.

Actually... AIUI the SET_CONFIG hcalls shouldn't be a fast path.  Can
we simplify things further by removing the hcall implementation from
the kernel entirely, and have qemu implement them by basically just
forwarding them to the appropriate SET_CONFIG ioctl()?

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-02-06  1:24             ` David Gibson
@ 2019-02-06  7:07               ` Cédric Le Goater
  2019-02-07  2:48                 ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-06  7:07 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/6/19 2:24 AM, David Gibson wrote:
> On Wed, Feb 06, 2019 at 12:23:29PM +1100, David Gibson wrote:
>> On Tue, Feb 05, 2019 at 02:03:11PM +0100, Cédric Le Goater wrote:
>>> On 2/5/19 6:32 AM, David Gibson wrote:
>>>> On Mon, Feb 04, 2019 at 05:07:28PM +0100, Cédric Le Goater wrote:
>>>>> On 2/4/19 6:21 AM, David Gibson wrote:
>>>>>> On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
>>>>>>> Theses are use to capure the XIVE EAS table of the KVM device, the
>>>>>>> configuration of the source targets.
>>>>>>>
>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>> ---
>>>>>>>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
>>>>>>>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
>>>>>>>  2 files changed, 98 insertions(+)
>>>>>>>
>>>>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>> index 1a8740629acf..faf024f39858 100644
>>>>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
>>>>>>>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
>>>>>>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>>>>>>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>>>>>>> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
>>>>>>>  
>>>>>>>  /* Layout of 64-bit XIVE source attribute values */
>>>>>>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>>>>>>>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>>>>>>>  
>>>>>>> +/* Layout of 64-bit eas attribute values */
>>>>>>> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
>>>>>>> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
>>>>>>> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
>>>>>>> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
>>>>>>> +#define KVM_XIVE_EAS_MASK_SHIFT		32
>>>>>>> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
>>>>>>> +#define KVM_XIVE_EAS_EISN_SHIFT		33
>>>>>>> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
>>>>>>> +
>>>>>>>  #endif /* __LINUX_KVM_POWERPC_H */
>>>>>>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>>>>>>> index f2de1bcf3b35..0468b605baa7 100644
>>>>>>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>>>>>>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>>>>>>> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
>>>>>>>  	return 0;
>>>>>>>  }
>>>>>>>  
>>>>>>> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
>>>>>>> +				      u64 addr)
>>>>>>
>>>>>> I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
>>>>>> values, but rather essentially the "source config" in the terminology
>>>>>> of the PAPR hcalls.  Which, yes, is basically implemented by setting
>>>>>> the EAS, but since it's the PAPR architected state that we need to
>>>>>> preserve across migration, I'd prefer to stick as close as we can to
>>>>>> the PAPR terminology.
>>>>>
>>>>> But we don't have an equivalent name in the PAPR specs for the tuple 
>>>>> (prio, server). We could use the generic 'target' name may be ? even 
>>>>> if this is usually referring to a CPU number.
>>>>
>>>> Um.. what?  That's about terminology for one of the fields in this
>>>> thing, not about the name for the thing itself.
>>>>
>>>>> Or, IVE (Interrupt Vector Entry) ? which makes some sense. 
>>>>> This is was the former name in HW. I think we recycle it for KVM.
>>>>
>>>> That's a terrible idea, which will make a confusing situation even
>>>> more confusing.
>>>
>>> Let's use SOURCE_CONFIG and QUEUE_CONFIG. The KVM ioctls are very 
>>> similar to the hcalls anyhow.
>>
>> Yes, I think that's a good idea.
> 
> Actually... AIUI the SET_CONFIG hcalls shouldn't be a fast path.  

No indeed. I have move them to standard hcalls in the current version.

> Can
> we simplify things further by removing the hcall implementation from
> the kernel entirely, and have qemu implement them by basically just
> forwarding them to the appropriate SET_CONFIG ioctl()?

Yes. I think we could. 

The hcalls H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG and 
the KVM ioctls to set the EQ and the SOURCE configuration have a 
lot in common. I need to look at how we can plug the KVM ioctl in 
the hcalls under QEMU.

We will have to convert the returned error to respect the PAPR 
specs or have the ioctls return H_* errors.


Let's dig that idea. If we choose that path, QEMU will have an 
up-to-date EAT and so we won't need to synchronize its state anymore 
for migration.
 
H_INT_GET_SOURCE_CONFIG can be implemented in QEMU without any KVM 
ioctl.

H_INT_GET_QUEUE_INFO could be implemented in QEMU. I need to check 
how we return the address of the END ESB in sPAPR. We haven't paid 
much attention to these pages because they are not used under Linux
and today the address is returned by OPAL. 

H_INT_GET_QUEUE_CONFIG is a little more problematic because we need
to query into the XIVE HW the EQ index and toggle bit. OPAL support
is required for that. But we could reduce the KVM support to the 
ioctl querying these EQ information.

H_INT_ESB could be entirely done under QEMU.

H_INT_SYNC and H_INT_RESET can not.

H_INT_GET_OS_REPORTING_LINE and H_INT_SET_OS_REPORTING_LINE are not
implemented.

C.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-06  1:23           ` David Gibson
@ 2019-02-06  7:21             ` Cédric Le Goater
  2019-02-07  2:49               ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-06  7:21 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/6/19 2:23 AM, David Gibson wrote:
> On Tue, Feb 05, 2019 at 01:55:40PM +0100, Cédric Le Goater wrote:
>> On 2/5/19 6:28 AM, David Gibson wrote:
>>> On Mon, Feb 04, 2019 at 12:30:39PM +0100, Cédric Le Goater wrote:
>>>> On 2/4/19 5:45 AM, David Gibson wrote:
>>>>> On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
>>>>>> This will let the guest create a memory mapping to expose the ESB MMIO
>>>>>> regions used to control the interrupt sources, to trigger events, to
>>>>>> EOI or to turn off the sources.
>>>>>>
>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>> ---
>>>>>>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
>>>>>>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
>>>>>>  2 files changed, 101 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>> index 8c876c166ef2..6bb61ba141c2 100644
>>>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
>>>>>>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
>>>>>>  #define  KVM_XICS_QUEUED		(1ULL << 44)
>>>>>>  
>>>>>> +/* POWER9 XIVE Native Interrupt Controller */
>>>>>> +#define KVM_DEV_XIVE_GRP_CTRL		1
>>>>>> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
>>>>>
>>>>> Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
>>>>> to both with an mmap() directly on the xive device fd?  Using the
>>>>> offset to distinguish which one to map, obviously.
>>>>
>>>> The page offset would define some sort of user API. It seems feasible.
>>>> But I am not sure this would be practical in the future if we need to 
>>>> tune the length.
>>>
>>> Um.. why not?  I mean, yes the XIVE supports rather a lot of
>>> interrupts, but we have 64-bits of offset we can play with - we can
>>> leave room for billions of ESB slots and still have room for billions
>>> of VPs.
>>
>> So the first 4 pages could be the TIMA pages and then would come  
>> the pages for the interrupt ESBs. I think that we can have different 
>> vm_fault handler for each mapping.
> 
> Um.. no, I'm saying you don't need to tightly pack them.  So you could
> have the ESB pages at 0, the TIMA at, say offset 2^60.

Well, we know that the TIMA is 4 pages wide and is "directly" related
with the KVM interrupt device. So being at offset 0 seems a good idea.
While the ESB segment is of a variable size depending on the number
of IRQs and it can come after I think.

>> I wonder how this will work out with pass-through. As Paul said in 
>> a previous email, it would be better to let QEMU request a new 
>> mapping to handle the ESB pages of the device being passed through.
>> I guess this is not a special case, just another offset and length.
> 
> Right, if we need multiple "chunks" of ESB pages we can given them
> each their own terabyte or several.  No need to be stingy with address
> space.

You can not put them anywhere. They should map the same interrupt range
of ESB pages, overlapping with the underlying segment of IPI ESB pages. 

C.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-02-06  1:18               ` David Gibson
@ 2019-02-06  7:35                 ` Cédric Le Goater
  2019-02-07  2:51                   ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-06  7:35 UTC (permalink / raw)
  To: David Gibson, Paul Mackerras; +Cc: linuxppc-dev, kvm, kvm-ppc

On 2/6/19 2:18 AM, David Gibson wrote:
> On Wed, Feb 06, 2019 at 09:13:15AM +1100, Paul Mackerras wrote:
>> On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
>>>>>> As for nesting, I suggest for the foreseeable future we stick to XICS
>>>>>> emulation in nested guests.
>>>>>
>>>>> ok. so no kernel_irqchip at all. hmm. 
>>>
>>> I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
>>> XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
>>> device KVM uses when under a P9 processor. 
>>
>> Actually there are two separate implementations of XICS emulation in
>> KVM.  The first (older) one is almost entirely a software emulation
>> but does have some cases where it accesses an underlying XICS device
>> in order to make some things faster (IPIs and pass-through of a device
>> interrupt to a guest).  The other, newer one is the XICS-on-XIVE
>> emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
>> My patch was about making the the older code work when there is no
>> XICS available to the host.
> 
> Ah, right.  To clarify my earlier statements in light of this:
> 
>  * We definitely want some sort of kernel-XICS available in a nested
>    guest.  AIUI, this is now accomplished, so, Yay!
> 
>  * Implementing the L2 XICS in terms of L1's PAPR-XIVE would be a
>    bonus, but it's a much lower priority.

Yes. In this case, the L1 KVM-HV should not advertise KVM_CAP_PPC_IRQ_XIVE
to QEMU which will restrict CAS to the XICS only interrupt mode.

C.



^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-02-06  7:07               ` Cédric Le Goater
@ 2019-02-07  2:48                 ` David Gibson
  2019-02-07  9:13                   ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-07  2:48 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 6395 bytes --]

On Wed, Feb 06, 2019 at 08:07:36AM +0100, Cédric Le Goater wrote:
> On 2/6/19 2:24 AM, David Gibson wrote:
> > On Wed, Feb 06, 2019 at 12:23:29PM +1100, David Gibson wrote:
> >> On Tue, Feb 05, 2019 at 02:03:11PM +0100, Cédric Le Goater wrote:
> >>> On 2/5/19 6:32 AM, David Gibson wrote:
> >>>> On Mon, Feb 04, 2019 at 05:07:28PM +0100, Cédric Le Goater wrote:
> >>>>> On 2/4/19 6:21 AM, David Gibson wrote:
> >>>>>> On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
> >>>>>>> Theses are use to capure the XIVE EAS table of the KVM device, the
> >>>>>>> configuration of the source targets.
> >>>>>>>
> >>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>>> ---
> >>>>>>>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
> >>>>>>>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
> >>>>>>>  2 files changed, 98 insertions(+)
> >>>>>>>
> >>>>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>>> index 1a8740629acf..faf024f39858 100644
> >>>>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>>> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
> >>>>>>>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
> >>>>>>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
> >>>>>>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
> >>>>>>> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
> >>>>>>>  
> >>>>>>>  /* Layout of 64-bit XIVE source attribute values */
> >>>>>>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
> >>>>>>>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
> >>>>>>>  
> >>>>>>> +/* Layout of 64-bit eas attribute values */
> >>>>>>> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
> >>>>>>> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
> >>>>>>> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
> >>>>>>> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
> >>>>>>> +#define KVM_XIVE_EAS_MASK_SHIFT		32
> >>>>>>> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
> >>>>>>> +#define KVM_XIVE_EAS_EISN_SHIFT		33
> >>>>>>> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
> >>>>>>> +
> >>>>>>>  #endif /* __LINUX_KVM_POWERPC_H */
> >>>>>>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> >>>>>>> index f2de1bcf3b35..0468b605baa7 100644
> >>>>>>> --- a/arch/powerpc/kvm/book3s_xive_native.c
> >>>>>>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> >>>>>>> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
> >>>>>>>  	return 0;
> >>>>>>>  }
> >>>>>>>  
> >>>>>>> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
> >>>>>>> +				      u64 addr)
> >>>>>>
> >>>>>> I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
> >>>>>> values, but rather essentially the "source config" in the terminology
> >>>>>> of the PAPR hcalls.  Which, yes, is basically implemented by setting
> >>>>>> the EAS, but since it's the PAPR architected state that we need to
> >>>>>> preserve across migration, I'd prefer to stick as close as we can to
> >>>>>> the PAPR terminology.
> >>>>>
> >>>>> But we don't have an equivalent name in the PAPR specs for the tuple 
> >>>>> (prio, server). We could use the generic 'target' name may be ? even 
> >>>>> if this is usually referring to a CPU number.
> >>>>
> >>>> Um.. what?  That's about terminology for one of the fields in this
> >>>> thing, not about the name for the thing itself.
> >>>>
> >>>>> Or, IVE (Interrupt Vector Entry) ? which makes some sense. 
> >>>>> This is was the former name in HW. I think we recycle it for KVM.
> >>>>
> >>>> That's a terrible idea, which will make a confusing situation even
> >>>> more confusing.
> >>>
> >>> Let's use SOURCE_CONFIG and QUEUE_CONFIG. The KVM ioctls are very 
> >>> similar to the hcalls anyhow.
> >>
> >> Yes, I think that's a good idea.
> > 
> > Actually... AIUI the SET_CONFIG hcalls shouldn't be a fast path.  
> 
> No indeed. I have move them to standard hcalls in the current version.
> 
> > Can
> > we simplify things further by removing the hcall implementation from
> > the kernel entirely, and have qemu implement them by basically just
> > forwarding them to the appropriate SET_CONFIG ioctl()?
> 
> Yes. I think we could. 

Great!

> The hcalls H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG and 
> the KVM ioctls to set the EQ and the SOURCE configuration have a 
> lot in common. I need to look at how we can plug the KVM ioctl in 
> the hcalls under QEMU.
> 
> We will have to convert the returned error to respect the PAPR 
> specs or have the ioctls return H_* errors.

I don't think returning H_* values from a kernel call is a good idea.
Converting errors is kinda ugly, but I still think it's the better
option.  Note that we already have something like this for the HPT
resizing hcalls.

> Let's dig that idea. If we choose that path, QEMU will have an 
> up-to-date EAT and so we won't need to synchronize its state anymore 
> for migration.

I guess so, though I don't see that as essential.

> H_INT_GET_SOURCE_CONFIG can be implemented in QEMU without any KVM 
> ioctl.
> 
> H_INT_GET_QUEUE_INFO could be implemented in QEMU. I need to check 
> how we return the address of the END ESB in sPAPR. We haven't paid 
> much attention to these pages because they are not used under Linux
> and today the address is returned by OPAL. 
> 
> H_INT_GET_QUEUE_CONFIG is a little more problematic because we need
> to query into the XIVE HW the EQ index and toggle bit. OPAL support
> is required for that. But we could reduce the KVM support to the 
> ioctl querying these EQ information.

Right, and we'd need an ioctl() like that for migration anyway, yes?

> H_INT_ESB could be entirely done under QEMU.

This one can actually happen on fairly hot paths, so I think doing
that in qemu probably isn't a good idea.

> H_INT_SYNC and H_INT_RESET can not.
> 
> H_INT_GET_OS_REPORTING_LINE and H_INT_SET_OS_REPORTING_LINE are not
> implemented.
> 
> C.
> 

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-06  7:21             ` Cédric Le Goater
@ 2019-02-07  2:49               ` David Gibson
  2019-02-07  9:03                 ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-07  2:49 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3588 bytes --]

On Wed, Feb 06, 2019 at 08:21:10AM +0100, Cédric Le Goater wrote:
> On 2/6/19 2:23 AM, David Gibson wrote:
> > On Tue, Feb 05, 2019 at 01:55:40PM +0100, Cédric Le Goater wrote:
> >> On 2/5/19 6:28 AM, David Gibson wrote:
> >>> On Mon, Feb 04, 2019 at 12:30:39PM +0100, Cédric Le Goater wrote:
> >>>> On 2/4/19 5:45 AM, David Gibson wrote:
> >>>>> On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
> >>>>>> This will let the guest create a memory mapping to expose the ESB MMIO
> >>>>>> regions used to control the interrupt sources, to trigger events, to
> >>>>>> EOI or to turn off the sources.
> >>>>>>
> >>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>> ---
> >>>>>>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
> >>>>>>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
> >>>>>>  2 files changed, 101 insertions(+)
> >>>>>>
> >>>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>> index 8c876c166ef2..6bb61ba141c2 100644
> >>>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
> >>>>>>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
> >>>>>>  #define  KVM_XICS_QUEUED		(1ULL << 44)
> >>>>>>  
> >>>>>> +/* POWER9 XIVE Native Interrupt Controller */
> >>>>>> +#define KVM_DEV_XIVE_GRP_CTRL		1
> >>>>>> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
> >>>>>
> >>>>> Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
> >>>>> to both with an mmap() directly on the xive device fd?  Using the
> >>>>> offset to distinguish which one to map, obviously.
> >>>>
> >>>> The page offset would define some sort of user API. It seems feasible.
> >>>> But I am not sure this would be practical in the future if we need to 
> >>>> tune the length.
> >>>
> >>> Um.. why not?  I mean, yes the XIVE supports rather a lot of
> >>> interrupts, but we have 64-bits of offset we can play with - we can
> >>> leave room for billions of ESB slots and still have room for billions
> >>> of VPs.
> >>
> >> So the first 4 pages could be the TIMA pages and then would come  
> >> the pages for the interrupt ESBs. I think that we can have different 
> >> vm_fault handler for each mapping.
> > 
> > Um.. no, I'm saying you don't need to tightly pack them.  So you could
> > have the ESB pages at 0, the TIMA at, say offset 2^60.
> 
> Well, we know that the TIMA is 4 pages wide and is "directly" related
> with the KVM interrupt device. So being at offset 0 seems a good idea.
> While the ESB segment is of a variable size depending on the number
> of IRQs and it can come after I think.
> 
> >> I wonder how this will work out with pass-through. As Paul said in 
> >> a previous email, it would be better to let QEMU request a new 
> >> mapping to handle the ESB pages of the device being passed through.
> >> I guess this is not a special case, just another offset and length.
> > 
> > Right, if we need multiple "chunks" of ESB pages we can given them
> > each their own terabyte or several.  No need to be stingy with address
> > space.
> 
> You can not put them anywhere. They should map the same interrupt range
> of ESB pages, overlapping with the underlying segment of IPI ESB pages. 

I don't really follow what you're saying here.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-02-06  7:35                 ` Cédric Le Goater
@ 2019-02-07  2:51                   ` David Gibson
  2019-02-07  8:31                     ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-07  2:51 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 2125 bytes --]

On Wed, Feb 06, 2019 at 08:35:24AM +0100, Cédric Le Goater wrote:
> On 2/6/19 2:18 AM, David Gibson wrote:
> > On Wed, Feb 06, 2019 at 09:13:15AM +1100, Paul Mackerras wrote:
> >> On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
> >>>>>> As for nesting, I suggest for the foreseeable future we stick to XICS
> >>>>>> emulation in nested guests.
> >>>>>
> >>>>> ok. so no kernel_irqchip at all. hmm. 
> >>>
> >>> I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
> >>> XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
> >>> device KVM uses when under a P9 processor. 
> >>
> >> Actually there are two separate implementations of XICS emulation in
> >> KVM.  The first (older) one is almost entirely a software emulation
> >> but does have some cases where it accesses an underlying XICS device
> >> in order to make some things faster (IPIs and pass-through of a device
> >> interrupt to a guest).  The other, newer one is the XICS-on-XIVE
> >> emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
> >> My patch was about making the the older code work when there is no
> >> XICS available to the host.
> > 
> > Ah, right.  To clarify my earlier statements in light of this:
> > 
> >  * We definitely want some sort of kernel-XICS available in a nested
> >    guest.  AIUI, this is now accomplished, so, Yay!
> > 
> >  * Implementing the L2 XICS in terms of L1's PAPR-XIVE would be a
> >    bonus, but it's a much lower priority.
> 
> Yes. In this case, the L1 KVM-HV should not advertise KVM_CAP_PPC_IRQ_XIVE
> to QEMU which will restrict CAS to the XICS only interrupt mode.

Uh... no... we shouldn't change what's available to the guest based on
host configuration only.  We should just stop advertising the CAP
saying that *KVM implemented* is available so that qemu will fall back
to userspace XIVE emulation.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-02-07  2:51                   ` David Gibson
@ 2019-02-07  8:31                     ` Cédric Le Goater
  2019-02-08  5:07                       ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-07  8:31 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, linuxppc-dev

On 2/7/19 3:51 AM, David Gibson wrote:
> On Wed, Feb 06, 2019 at 08:35:24AM +0100, Cédric Le Goater wrote:
>> On 2/6/19 2:18 AM, David Gibson wrote:
>>> On Wed, Feb 06, 2019 at 09:13:15AM +1100, Paul Mackerras wrote:
>>>> On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
>>>>>>>> As for nesting, I suggest for the foreseeable future we stick to XICS
>>>>>>>> emulation in nested guests.
>>>>>>>
>>>>>>> ok. so no kernel_irqchip at all. hmm. 
>>>>>
>>>>> I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
>>>>> XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
>>>>> device KVM uses when under a P9 processor. 
>>>>
>>>> Actually there are two separate implementations of XICS emulation in
>>>> KVM.  The first (older) one is almost entirely a software emulation
>>>> but does have some cases where it accesses an underlying XICS device
>>>> in order to make some things faster (IPIs and pass-through of a device
>>>> interrupt to a guest).  The other, newer one is the XICS-on-XIVE
>>>> emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
>>>> My patch was about making the the older code work when there is no
>>>> XICS available to the host.
>>>
>>> Ah, right.  To clarify my earlier statements in light of this:
>>>
>>>  * We definitely want some sort of kernel-XICS available in a nested
>>>    guest.  AIUI, this is now accomplished, so, Yay!
>>>
>>>  * Implementing the L2 XICS in terms of L1's PAPR-XIVE would be a
>>>    bonus, but it's a much lower priority.
>>
>> Yes. In this case, the L1 KVM-HV should not advertise KVM_CAP_PPC_IRQ_XIVE
>> to QEMU which will restrict CAS to the XICS only interrupt mode.
> 
> Uh... no... we shouldn't change what's available to the guest based on
> host configuration only.  We should just stop advertising the CAP
> saying that *KVM implemented* is available 

yes. that is what I meant.

> so that qemu will fall back to userspace XIVE emulation.

even if kernel_irqchip is required ? 

Today, QEMU just fails to start. With the dual mode, the interrupt mode 
is negotiated at CAS time and when merged, the KVM device will be created 
at reset. In case of failure, QEMU will abort. 

I am not saying it is not possible but we will need some internal 
infrastructure to handle dynamically the fall back to userspace emulation.

C.


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-07  2:49               ` David Gibson
@ 2019-02-07  9:03                 ` Cédric Le Goater
  2019-02-08  5:15                   ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-07  9:03 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/7/19 3:49 AM, David Gibson wrote:
> On Wed, Feb 06, 2019 at 08:21:10AM +0100, Cédric Le Goater wrote:
>> On 2/6/19 2:23 AM, David Gibson wrote:
>>> On Tue, Feb 05, 2019 at 01:55:40PM +0100, Cédric Le Goater wrote:
>>>> On 2/5/19 6:28 AM, David Gibson wrote:
>>>>> On Mon, Feb 04, 2019 at 12:30:39PM +0100, Cédric Le Goater wrote:
>>>>>> On 2/4/19 5:45 AM, David Gibson wrote:
>>>>>>> On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
>>>>>>>> This will let the guest create a memory mapping to expose the ESB MMIO
>>>>>>>> regions used to control the interrupt sources, to trigger events, to
>>>>>>>> EOI or to turn off the sources.
>>>>>>>>
>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>> ---
>>>>>>>>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
>>>>>>>>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
>>>>>>>>  2 files changed, 101 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>> index 8c876c166ef2..6bb61ba141c2 100644
>>>>>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
>>>>>>>>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
>>>>>>>>  #define  KVM_XICS_QUEUED		(1ULL << 44)
>>>>>>>>  
>>>>>>>> +/* POWER9 XIVE Native Interrupt Controller */
>>>>>>>> +#define KVM_DEV_XIVE_GRP_CTRL		1
>>>>>>>> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
>>>>>>>
>>>>>>> Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
>>>>>>> to both with an mmap() directly on the xive device fd?  Using the
>>>>>>> offset to distinguish which one to map, obviously.
>>>>>>
>>>>>> The page offset would define some sort of user API. It seems feasible.
>>>>>> But I am not sure this would be practical in the future if we need to 
>>>>>> tune the length.
>>>>>
>>>>> Um.. why not?  I mean, yes the XIVE supports rather a lot of
>>>>> interrupts, but we have 64-bits of offset we can play with - we can
>>>>> leave room for billions of ESB slots and still have room for billions
>>>>> of VPs.
>>>>
>>>> So the first 4 pages could be the TIMA pages and then would come  
>>>> the pages for the interrupt ESBs. I think that we can have different 
>>>> vm_fault handler for each mapping.
>>>
>>> Um.. no, I'm saying you don't need to tightly pack them.  So you could
>>> have the ESB pages at 0, the TIMA at, say offset 2^60.
>>
>> Well, we know that the TIMA is 4 pages wide and is "directly" related
>> with the KVM interrupt device. So being at offset 0 seems a good idea.
>> While the ESB segment is of a variable size depending on the number
>> of IRQs and it can come after I think.
>>
>>>> I wonder how this will work out with pass-through. As Paul said in 
>>>> a previous email, it would be better to let QEMU request a new 
>>>> mapping to handle the ESB pages of the device being passed through.
>>>> I guess this is not a special case, just another offset and length.
>>>
>>> Right, if we need multiple "chunks" of ESB pages we can given them
>>> each their own terabyte or several.  No need to be stingy with address
>>> space.
>>
>> You can not put them anywhere. They should map the same interrupt range
>> of ESB pages, overlapping with the underlying segment of IPI ESB pages. 
> 
> I don't really follow what you're saying here.


What we want the guest to access in terms of ESB pages is something like 
below, VMA0 being the initial mapping done by QEMU at offset 0x0, the IPI 
ESB pages being populated on the demand with the loads and the stores from 
the guest :


                 0x0                   0x1100  0x1200    0x1300     
      
         ranges   |       CPU IPIs   .. |  VIO  | PCI LSI |  MSIs
       	  
                  +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
 VMA0    IPI ESB  | | | | | | |     | | | | | | | | | | | | | | | | | |
          pages   +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....



A device is passed through and the driver requests MSIs. 

We now want the guest to access the HW ESB pages for the requested IRQs 
but still the initial IPI ESB pages for the others. Something like below : 


                 0x0                   0x1100  0x1200    0x1300     
      
         ranges   |       CPU IPIs   .. |  VIO  | PCI LSI |  MSIs

                  +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
 VMA0    IPI ESB  | | | | | | |     | | | | | | | | | | | | | | | | | |
          pages   +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
                                                                  
 VMA1    PHB ESB                                          +-------+
          pages                                           | | | | | 
                                                          +-------+

The VMA1 is the result of a new mmap() being done at an offset depending on 
the first IRQ number requested by the driver. 

This is because the vm_fault handler uses the page offset to find the 
associated KVM IRQ struct containing the addresses of the EOI and trigger 
pages in the underlying hardware, which will be the PHB in case of a 
passthrough device.  

From there, the VMA1 mmap() pointer will be used to create a 'ram device'
memory region which will be mapped on top of the initial ESB memory region 
in QEMU. This will override the initial IPI ESB pages with the PHB ESB pages 
in the guest ESB address space. 

That's the plan I have in mind as suggested by Paul if I understood it well.
The mechanics are more complex than the patch zapping the PTEs from the VMA
but it's also safer. 


C.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-02-07  2:48                 ` David Gibson
@ 2019-02-07  9:13                   ` Cédric Le Goater
  2019-02-08  5:15                     ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-07  9:13 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/7/19 3:48 AM, David Gibson wrote:
> On Wed, Feb 06, 2019 at 08:07:36AM +0100, Cédric Le Goater wrote:
>> On 2/6/19 2:24 AM, David Gibson wrote:
>>> On Wed, Feb 06, 2019 at 12:23:29PM +1100, David Gibson wrote:
>>>> On Tue, Feb 05, 2019 at 02:03:11PM +0100, Cédric Le Goater wrote:
>>>>> On 2/5/19 6:32 AM, David Gibson wrote:
>>>>>> On Mon, Feb 04, 2019 at 05:07:28PM +0100, Cédric Le Goater wrote:
>>>>>>> On 2/4/19 6:21 AM, David Gibson wrote:
>>>>>>>> On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
>>>>>>>>> Theses are use to capure the XIVE EAS table of the KVM device, the
>>>>>>>>> configuration of the source targets.
>>>>>>>>>
>>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>>> ---
>>>>>>>>>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
>>>>>>>>>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
>>>>>>>>>  2 files changed, 98 insertions(+)
>>>>>>>>>
>>>>>>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>>> index 1a8740629acf..faf024f39858 100644
>>>>>>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>>> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
>>>>>>>>>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
>>>>>>>>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>>>>>>>>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>>>>>>>>> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
>>>>>>>>>  
>>>>>>>>>  /* Layout of 64-bit XIVE source attribute values */
>>>>>>>>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>>>>>>>>>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>>>>>>>>>  
>>>>>>>>> +/* Layout of 64-bit eas attribute values */
>>>>>>>>> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
>>>>>>>>> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
>>>>>>>>> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
>>>>>>>>> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
>>>>>>>>> +#define KVM_XIVE_EAS_MASK_SHIFT		32
>>>>>>>>> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
>>>>>>>>> +#define KVM_XIVE_EAS_EISN_SHIFT		33
>>>>>>>>> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
>>>>>>>>> +
>>>>>>>>>  #endif /* __LINUX_KVM_POWERPC_H */
>>>>>>>>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>>>>>>>>> index f2de1bcf3b35..0468b605baa7 100644
>>>>>>>>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>>>>>>>>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>>>>>>>>> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
>>>>>>>>>  	return 0;
>>>>>>>>>  }
>>>>>>>>>  
>>>>>>>>> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
>>>>>>>>> +				      u64 addr)
>>>>>>>>
>>>>>>>> I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
>>>>>>>> values, but rather essentially the "source config" in the terminology
>>>>>>>> of the PAPR hcalls.  Which, yes, is basically implemented by setting
>>>>>>>> the EAS, but since it's the PAPR architected state that we need to
>>>>>>>> preserve across migration, I'd prefer to stick as close as we can to
>>>>>>>> the PAPR terminology.
>>>>>>>
>>>>>>> But we don't have an equivalent name in the PAPR specs for the tuple 
>>>>>>> (prio, server). We could use the generic 'target' name may be ? even 
>>>>>>> if this is usually referring to a CPU number.
>>>>>>
>>>>>> Um.. what?  That's about terminology for one of the fields in this
>>>>>> thing, not about the name for the thing itself.
>>>>>>
>>>>>>> Or, IVE (Interrupt Vector Entry) ? which makes some sense. 
>>>>>>> This is was the former name in HW. I think we recycle it for KVM.
>>>>>>
>>>>>> That's a terrible idea, which will make a confusing situation even
>>>>>> more confusing.
>>>>>
>>>>> Let's use SOURCE_CONFIG and QUEUE_CONFIG. The KVM ioctls are very 
>>>>> similar to the hcalls anyhow.
>>>>
>>>> Yes, I think that's a good idea.
>>>
>>> Actually... AIUI the SET_CONFIG hcalls shouldn't be a fast path.  
>>
>> No indeed. I have move them to standard hcalls in the current version.
>>
>>> Can
>>> we simplify things further by removing the hcall implementation from
>>> the kernel entirely, and have qemu implement them by basically just
>>> forwarding them to the appropriate SET_CONFIG ioctl()?
>>
>> Yes. I think we could. 
> 
> Great!
> 
>> The hcalls H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG and 
>> the KVM ioctls to set the EQ and the SOURCE configuration have a 
>> lot in common. I need to look at how we can plug the KVM ioctl in 
>> the hcalls under QEMU.
>>
>> We will have to convert the returned error to respect the PAPR 
>> specs or have the ioctls return H_* errors.
> 
> I don't think returning H_* values from a kernel call is a good idea.
> Converting errors is kinda ugly, but I still think it's the better
> option.  Note that we already have something like this for the HPT
> resizing hcalls.

ok.
 
>> Let's dig that idea. If we choose that path, QEMU will have an 
>> up-to-date EAT and so we won't need to synchronize its state anymore 
>> for migration.
> 
> I guess so, though I don't see that as essential.
> 
>> H_INT_GET_SOURCE_CONFIG can be implemented in QEMU without any KVM 
>> ioctl.
>>
>> H_INT_GET_QUEUE_INFO could be implemented in QEMU. I need to check 
>> how we return the address of the END ESB in sPAPR. We haven't paid 
>> much attention to these pages because they are not used under Linux
>> and today the address is returned by OPAL. 
>>
>> H_INT_GET_QUEUE_CONFIG is a little more problematic because we need
>> to query into the XIVE HW the EQ index and toggle bit. OPAL support
>> is required for that. But we could reduce the KVM support to the 
>> ioctl querying these EQ information.
> 
> Right, and we'd need an ioctl() like that for migration anyway, yes?

Yes. it is the same need.

>> H_INT_ESB could be entirely done under QEMU.
> 
> This one can actually happen on fairly hot paths, so I think doing
> that in qemu probably isn't a good idea.

I agree It would nice to have some performance.

This hcall is used when LSIs are involved, which is not really a common 
configuration. There are no OPAL calls involved. And we are duplicating 
code at the KVM level to retrigger the interrupt when the level is still
asserted.

I will benchmark the two options before making a choice. 

C.


>> H_INT_SYNC and H_INT_RESET can not.
>>
>> H_INT_GET_OS_REPORTING_LINE and H_INT_SET_OS_REPORTING_LINE are not
>> implemented.
>>
>> C.
>>
> 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-02-07  8:31                     ` Cédric Le Goater
@ 2019-02-08  5:07                       ` David Gibson
  2019-02-08  7:38                         ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-08  5:07 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3139 bytes --]

On Thu, Feb 07, 2019 at 09:31:06AM +0100, Cédric Le Goater wrote:
> On 2/7/19 3:51 AM, David Gibson wrote:
> > On Wed, Feb 06, 2019 at 08:35:24AM +0100, Cédric Le Goater wrote:
> >> On 2/6/19 2:18 AM, David Gibson wrote:
> >>> On Wed, Feb 06, 2019 at 09:13:15AM +1100, Paul Mackerras wrote:
> >>>> On Tue, Feb 05, 2019 at 12:31:28PM +0100, Cédric Le Goater wrote:
> >>>>>>>> As for nesting, I suggest for the foreseeable future we stick to XICS
> >>>>>>>> emulation in nested guests.
> >>>>>>>
> >>>>>>> ok. so no kernel_irqchip at all. hmm. 
> >>>>>
> >>>>> I was confused with what Paul calls 'XICS emulation'. It's not the QEMU
> >>>>> XICS emulated device but the XICS-over-XIVE KVM device, the KVM XICS 
> >>>>> device KVM uses when under a P9 processor. 
> >>>>
> >>>> Actually there are two separate implementations of XICS emulation in
> >>>> KVM.  The first (older) one is almost entirely a software emulation
> >>>> but does have some cases where it accesses an underlying XICS device
> >>>> in order to make some things faster (IPIs and pass-through of a device
> >>>> interrupt to a guest).  The other, newer one is the XICS-on-XIVE
> >>>> emulation that Ben wrote, which uses the XIVE hardware pretty heavily.
> >>>> My patch was about making the the older code work when there is no
> >>>> XICS available to the host.
> >>>
> >>> Ah, right.  To clarify my earlier statements in light of this:
> >>>
> >>>  * We definitely want some sort of kernel-XICS available in a nested
> >>>    guest.  AIUI, this is now accomplished, so, Yay!
> >>>
> >>>  * Implementing the L2 XICS in terms of L1's PAPR-XIVE would be a
> >>>    bonus, but it's a much lower priority.
> >>
> >> Yes. In this case, the L1 KVM-HV should not advertise KVM_CAP_PPC_IRQ_XIVE
> >> to QEMU which will restrict CAS to the XICS only interrupt mode.
> > 
> > Uh... no... we shouldn't change what's available to the guest based on
> > host configuration only.  We should just stop advertising the CAP
> > saying that *KVM implemented* is available 
> 
> yes. that is what I meant.
> 
> > so that qemu will fall back to userspace XIVE emulation.
> 
> even if kernel_irqchip is required ? 

Well, no, but if we don't specify.

> Today, QEMU just fails to start.

If we specify kernel_irqchip=on but the kernel can't support that I
think that's the right thing to do.

> With the dual mode, the interrupt mode 
> is negotiated at CAS time and when merged, the KVM device will be created 
> at reset. In case of failure, QEMU will abort. 
> 
> I am not saying it is not possible but we will need some internal 
> infrastructure to handle dynamically the fall back to userspace
> emulation.

Uh.. we do?  I think in all cases we need to make the XICS vs. XIVE
decision first (i.e. what we present to the guest), then we should
decide how to implement it (userspace, KVM accelerated, impossible and
give up).

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-07  9:03                 ` Cédric Le Goater
@ 2019-02-08  5:15                   ` David Gibson
  2019-02-08  7:58                     ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-08  5:15 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 6973 bytes --]

On Thu, Feb 07, 2019 at 10:03:15AM +0100, Cédric Le Goater wrote:
> On 2/7/19 3:49 AM, David Gibson wrote:
> > On Wed, Feb 06, 2019 at 08:21:10AM +0100, Cédric Le Goater wrote:
> >> On 2/6/19 2:23 AM, David Gibson wrote:
> >>> On Tue, Feb 05, 2019 at 01:55:40PM +0100, Cédric Le Goater wrote:
> >>>> On 2/5/19 6:28 AM, David Gibson wrote:
> >>>>> On Mon, Feb 04, 2019 at 12:30:39PM +0100, Cédric Le Goater wrote:
> >>>>>> On 2/4/19 5:45 AM, David Gibson wrote:
> >>>>>>> On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
> >>>>>>>> This will let the guest create a memory mapping to expose the ESB MMIO
> >>>>>>>> regions used to control the interrupt sources, to trigger events, to
> >>>>>>>> EOI or to turn off the sources.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>>>> ---
> >>>>>>>>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
> >>>>>>>>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
> >>>>>>>>  2 files changed, 101 insertions(+)
> >>>>>>>>
> >>>>>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>>>> index 8c876c166ef2..6bb61ba141c2 100644
> >>>>>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>>>> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
> >>>>>>>>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
> >>>>>>>>  #define  KVM_XICS_QUEUED		(1ULL << 44)
> >>>>>>>>  
> >>>>>>>> +/* POWER9 XIVE Native Interrupt Controller */
> >>>>>>>> +#define KVM_DEV_XIVE_GRP_CTRL		1
> >>>>>>>> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
> >>>>>>>
> >>>>>>> Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
> >>>>>>> to both with an mmap() directly on the xive device fd?  Using the
> >>>>>>> offset to distinguish which one to map, obviously.
> >>>>>>
> >>>>>> The page offset would define some sort of user API. It seems feasible.
> >>>>>> But I am not sure this would be practical in the future if we need to 
> >>>>>> tune the length.
> >>>>>
> >>>>> Um.. why not?  I mean, yes the XIVE supports rather a lot of
> >>>>> interrupts, but we have 64-bits of offset we can play with - we can
> >>>>> leave room for billions of ESB slots and still have room for billions
> >>>>> of VPs.
> >>>>
> >>>> So the first 4 pages could be the TIMA pages and then would come  
> >>>> the pages for the interrupt ESBs. I think that we can have different 
> >>>> vm_fault handler for each mapping.
> >>>
> >>> Um.. no, I'm saying you don't need to tightly pack them.  So you could
> >>> have the ESB pages at 0, the TIMA at, say offset 2^60.
> >>
> >> Well, we know that the TIMA is 4 pages wide and is "directly" related
> >> with the KVM interrupt device. So being at offset 0 seems a good idea.
> >> While the ESB segment is of a variable size depending on the number
> >> of IRQs and it can come after I think.
> >>
> >>>> I wonder how this will work out with pass-through. As Paul said in 
> >>>> a previous email, it would be better to let QEMU request a new 
> >>>> mapping to handle the ESB pages of the device being passed through.
> >>>> I guess this is not a special case, just another offset and length.
> >>>
> >>> Right, if we need multiple "chunks" of ESB pages we can given them
> >>> each their own terabyte or several.  No need to be stingy with address
> >>> space.
> >>
> >> You can not put them anywhere. They should map the same interrupt range
> >> of ESB pages, overlapping with the underlying segment of IPI ESB pages. 
> > 
> > I don't really follow what you're saying here.
> 
> 
> What we want the guest to access in terms of ESB pages is something like 
> below, VMA0 being the initial mapping done by QEMU at offset 0x0, the IPI 
> ESB pages being populated on the demand with the loads and the stores from 
> the guest :
> 
> 
>                  0x0                   0x1100  0x1200    0x1300     
>       
>          ranges   |       CPU IPIs   .. |  VIO  | PCI LSI |  MSIs
>        	  
>                   +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
>  VMA0    IPI ESB  | | | | | | |     | | | | | | | | | | | | | | | | | |
>           pages   +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
> 
> 
> 
> A device is passed through and the driver requests MSIs. 
> 
> We now want the guest to access the HW ESB pages for the requested IRQs 
> but still the initial IPI ESB pages for the others. Something like below : 
> 
> 
>                  0x0                   0x1100  0x1200    0x1300     
>       
>          ranges   |       CPU IPIs   .. |  VIO  | PCI LSI |  MSIs
> 
>                   +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
>  VMA0    IPI ESB  | | | | | | |     | | | | | | | | | | | | | | | | | |
>           pages   +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
>                                                                   
>  VMA1    PHB ESB                                          +-------+
>           pages                                           | | | | | 
>                                                           +-------+

Right, except of course VMA0 will be split into two pieces by
performing the mmap() over it.

> The VMA1 is the result of a new mmap() being done at an offset depending on 
> the first IRQ number requested by the driver.

Right... that's one way we could do it.  But the irq numbers are all
dynamically allocated here, so could we instead just put the
passthrough MSIs in a separate range?  We'd still need a separate
mmap() for them, but we wouldn't have to deal with mapping over and
unmapping if the device is removed or whatever.

> This is because the vm_fault handler uses the page offset to find the 
> associated KVM IRQ struct containing the addresses of the EOI and trigger 
> pages in the underlying hardware, which will be the PHB in case of a 
> passthrough device.  
> 
> >From there, the VMA1 mmap() pointer will be used to create a 'ram device'
> memory region which will be mapped on top of the initial ESB memory region 
> in QEMU. This will override the initial IPI ESB pages with the PHB ESB pages 
> in the guest ESB address space.

Um.. what?  If that qemu memory range is already mapped into the guest
we don't need to create new RAM devices or anything for the
overmapping.  If we overmap in qemu that will just get carried into
the guest.

> That's the plan I have in mind as suggested by Paul if I understood it well.
> The mechanics are more complex than the patch zapping the PTEs from the VMA
> but it's also safer.

Well, yes, where "safer" means "has the possibility to be correct".

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-02-07  9:13                   ` Cédric Le Goater
@ 2019-02-08  5:15                     ` David Gibson
  2019-02-14 16:50                       ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-08  5:15 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 7088 bytes --]

On Thu, Feb 07, 2019 at 10:13:48AM +0100, Cédric Le Goater wrote:
> On 2/7/19 3:48 AM, David Gibson wrote:
> > On Wed, Feb 06, 2019 at 08:07:36AM +0100, Cédric Le Goater wrote:
> >> On 2/6/19 2:24 AM, David Gibson wrote:
> >>> On Wed, Feb 06, 2019 at 12:23:29PM +1100, David Gibson wrote:
> >>>> On Tue, Feb 05, 2019 at 02:03:11PM +0100, Cédric Le Goater wrote:
> >>>>> On 2/5/19 6:32 AM, David Gibson wrote:
> >>>>>> On Mon, Feb 04, 2019 at 05:07:28PM +0100, Cédric Le Goater wrote:
> >>>>>>> On 2/4/19 6:21 AM, David Gibson wrote:
> >>>>>>>> On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
> >>>>>>>>> Theses are use to capure the XIVE EAS table of the KVM device, the
> >>>>>>>>> configuration of the source targets.
> >>>>>>>>>
> >>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
> >>>>>>>>> ---
> >>>>>>>>>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
> >>>>>>>>>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
> >>>>>>>>>  2 files changed, 98 insertions(+)
> >>>>>>>>>
> >>>>>>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>>>>> index 1a8740629acf..faf024f39858 100644
> >>>>>>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
> >>>>>>>>> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
> >>>>>>>>>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
> >>>>>>>>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
> >>>>>>>>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
> >>>>>>>>> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
> >>>>>>>>>  
> >>>>>>>>>  /* Layout of 64-bit XIVE source attribute values */
> >>>>>>>>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
> >>>>>>>>>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
> >>>>>>>>>  
> >>>>>>>>> +/* Layout of 64-bit eas attribute values */
> >>>>>>>>> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
> >>>>>>>>> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
> >>>>>>>>> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
> >>>>>>>>> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
> >>>>>>>>> +#define KVM_XIVE_EAS_MASK_SHIFT		32
> >>>>>>>>> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
> >>>>>>>>> +#define KVM_XIVE_EAS_EISN_SHIFT		33
> >>>>>>>>> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
> >>>>>>>>> +
> >>>>>>>>>  #endif /* __LINUX_KVM_POWERPC_H */
> >>>>>>>>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
> >>>>>>>>> index f2de1bcf3b35..0468b605baa7 100644
> >>>>>>>>> --- a/arch/powerpc/kvm/book3s_xive_native.c
> >>>>>>>>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
> >>>>>>>>> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
> >>>>>>>>>  	return 0;
> >>>>>>>>>  }
> >>>>>>>>>  
> >>>>>>>>> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
> >>>>>>>>> +				      u64 addr)
> >>>>>>>>
> >>>>>>>> I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
> >>>>>>>> values, but rather essentially the "source config" in the terminology
> >>>>>>>> of the PAPR hcalls.  Which, yes, is basically implemented by setting
> >>>>>>>> the EAS, but since it's the PAPR architected state that we need to
> >>>>>>>> preserve across migration, I'd prefer to stick as close as we can to
> >>>>>>>> the PAPR terminology.
> >>>>>>>
> >>>>>>> But we don't have an equivalent name in the PAPR specs for the tuple 
> >>>>>>> (prio, server). We could use the generic 'target' name may be ? even 
> >>>>>>> if this is usually referring to a CPU number.
> >>>>>>
> >>>>>> Um.. what?  That's about terminology for one of the fields in this
> >>>>>> thing, not about the name for the thing itself.
> >>>>>>
> >>>>>>> Or, IVE (Interrupt Vector Entry) ? which makes some sense. 
> >>>>>>> This is was the former name in HW. I think we recycle it for KVM.
> >>>>>>
> >>>>>> That's a terrible idea, which will make a confusing situation even
> >>>>>> more confusing.
> >>>>>
> >>>>> Let's use SOURCE_CONFIG and QUEUE_CONFIG. The KVM ioctls are very 
> >>>>> similar to the hcalls anyhow.
> >>>>
> >>>> Yes, I think that's a good idea.
> >>>
> >>> Actually... AIUI the SET_CONFIG hcalls shouldn't be a fast path.  
> >>
> >> No indeed. I have move them to standard hcalls in the current version.
> >>
> >>> Can
> >>> we simplify things further by removing the hcall implementation from
> >>> the kernel entirely, and have qemu implement them by basically just
> >>> forwarding them to the appropriate SET_CONFIG ioctl()?
> >>
> >> Yes. I think we could. 
> > 
> > Great!
> > 
> >> The hcalls H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG and 
> >> the KVM ioctls to set the EQ and the SOURCE configuration have a 
> >> lot in common. I need to look at how we can plug the KVM ioctl in 
> >> the hcalls under QEMU.
> >>
> >> We will have to convert the returned error to respect the PAPR 
> >> specs or have the ioctls return H_* errors.
> > 
> > I don't think returning H_* values from a kernel call is a good idea.
> > Converting errors is kinda ugly, but I still think it's the better
> > option.  Note that we already have something like this for the HPT
> > resizing hcalls.
> 
> ok.
>  
> >> Let's dig that idea. If we choose that path, QEMU will have an 
> >> up-to-date EAT and so we won't need to synchronize its state anymore 
> >> for migration.
> > 
> > I guess so, though I don't see that as essential.
> > 
> >> H_INT_GET_SOURCE_CONFIG can be implemented in QEMU without any KVM 
> >> ioctl.
> >>
> >> H_INT_GET_QUEUE_INFO could be implemented in QEMU. I need to check 
> >> how we return the address of the END ESB in sPAPR. We haven't paid 
> >> much attention to these pages because they are not used under Linux
> >> and today the address is returned by OPAL. 
> >>
> >> H_INT_GET_QUEUE_CONFIG is a little more problematic because we need
> >> to query into the XIVE HW the EQ index and toggle bit. OPAL support
> >> is required for that. But we could reduce the KVM support to the 
> >> ioctl querying these EQ information.
> > 
> > Right, and we'd need an ioctl() like that for migration anyway, yes?
> 
> Yes. it is the same need.
> 
> >> H_INT_ESB could be entirely done under QEMU.
> > 
> > This one can actually happen on fairly hot paths, so I think doing
> > that in qemu probably isn't a good idea.
> 
> I agree It would nice to have some performance.
> 
> This hcall is used when LSIs are involved, which is not really a common 
> configuration. There are no OPAL calls involved. And we are duplicating 
> code at the KVM level to retrigger the interrupt when the level is still
> asserted.
> 
> I will benchmark the two options before making a choice.

Ok.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode
  2019-02-08  5:07                       ` David Gibson
@ 2019-02-08  7:38                         ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-08  7:38 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, linuxppc-dev

>> With the dual mode, the interrupt mode 
>> is negotiated at CAS time and when merged, the KVM device will be created 
>> at reset. In case of failure, QEMU will abort. 
>>
>> I am not saying it is not possible but we will need some internal 
>> infrastructure to handle dynamically the fall back to userspace
>> emulation.
> 
> Uh.. we do?  I think in all cases we need to make the XICS vs. XIVE
> decision first (i.e. what we present to the guest), then we should
> decide how to implement it (userspace, KVM accelerated, impossible and
> give up).

I am changing things with the addition of KM support for dual mode but 
that might not be the right approach. Let's talk over it when you reach 
the end of the QEMU patchset.

I will keep in mind that we should know exactly what KVM supports
before the machine starts. That is : not to abort QEMU if we can not 
satisfy the interrupt mode chosen at CAS time. It might be possible
to fallback to XIVE emulated mode, I think that is where the problem
is but I haven't looked at it closely.

C.  


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-08  5:15                   ` David Gibson
@ 2019-02-08  7:58                     ` Cédric Le Goater
  2019-02-08 21:53                       ` Paul Mackerras
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-08  7:58 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/8/19 6:15 AM, David Gibson wrote:
> On Thu, Feb 07, 2019 at 10:03:15AM +0100, Cédric Le Goater wrote:
>> On 2/7/19 3:49 AM, David Gibson wrote:
>>> On Wed, Feb 06, 2019 at 08:21:10AM +0100, Cédric Le Goater wrote:
>>>> On 2/6/19 2:23 AM, David Gibson wrote:
>>>>> On Tue, Feb 05, 2019 at 01:55:40PM +0100, Cédric Le Goater wrote:
>>>>>> On 2/5/19 6:28 AM, David Gibson wrote:
>>>>>>> On Mon, Feb 04, 2019 at 12:30:39PM +0100, Cédric Le Goater wrote:
>>>>>>>> On 2/4/19 5:45 AM, David Gibson wrote:
>>>>>>>>> On Mon, Jan 07, 2019 at 07:43:18PM +0100, Cédric Le Goater wrote:
>>>>>>>>>> This will let the guest create a memory mapping to expose the ESB MMIO
>>>>>>>>>> regions used to control the interrupt sources, to trigger events, to
>>>>>>>>>> EOI or to turn off the sources.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>>>> ---
>>>>>>>>>>  arch/powerpc/include/uapi/asm/kvm.h   |  4 ++
>>>>>>>>>>  arch/powerpc/kvm/book3s_xive_native.c | 97 +++++++++++++++++++++++++++
>>>>>>>>>>  2 files changed, 101 insertions(+)
>>>>>>>>>>
>>>>>>>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>>>> index 8c876c166ef2..6bb61ba141c2 100644
>>>>>>>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>>>> @@ -675,4 +675,8 @@ struct kvm_ppc_cpu_char {
>>>>>>>>>>  #define  KVM_XICS_PRESENTED		(1ULL << 43)
>>>>>>>>>>  #define  KVM_XICS_QUEUED		(1ULL << 44)
>>>>>>>>>>  
>>>>>>>>>> +/* POWER9 XIVE Native Interrupt Controller */
>>>>>>>>>> +#define KVM_DEV_XIVE_GRP_CTRL		1
>>>>>>>>>> +#define   KVM_DEV_XIVE_GET_ESB_FD	1
>>>>>>>>>
>>>>>>>>> Introducing a new FD for ESB and TIMA seems overkill.  Can't you get
>>>>>>>>> to both with an mmap() directly on the xive device fd?  Using the
>>>>>>>>> offset to distinguish which one to map, obviously.
>>>>>>>>
>>>>>>>> The page offset would define some sort of user API. It seems feasible.
>>>>>>>> But I am not sure this would be practical in the future if we need to 
>>>>>>>> tune the length.
>>>>>>>
>>>>>>> Um.. why not?  I mean, yes the XIVE supports rather a lot of
>>>>>>> interrupts, but we have 64-bits of offset we can play with - we can
>>>>>>> leave room for billions of ESB slots and still have room for billions
>>>>>>> of VPs.
>>>>>>
>>>>>> So the first 4 pages could be the TIMA pages and then would come  
>>>>>> the pages for the interrupt ESBs. I think that we can have different 
>>>>>> vm_fault handler for each mapping.
>>>>>
>>>>> Um.. no, I'm saying you don't need to tightly pack them.  So you could
>>>>> have the ESB pages at 0, the TIMA at, say offset 2^60.
>>>>
>>>> Well, we know that the TIMA is 4 pages wide and is "directly" related
>>>> with the KVM interrupt device. So being at offset 0 seems a good idea.
>>>> While the ESB segment is of a variable size depending on the number
>>>> of IRQs and it can come after I think.
>>>>
>>>>>> I wonder how this will work out with pass-through. As Paul said in 
>>>>>> a previous email, it would be better to let QEMU request a new 
>>>>>> mapping to handle the ESB pages of the device being passed through.
>>>>>> I guess this is not a special case, just another offset and length.
>>>>>
>>>>> Right, if we need multiple "chunks" of ESB pages we can given them
>>>>> each their own terabyte or several.  No need to be stingy with address
>>>>> space.
>>>>
>>>> You can not put them anywhere. They should map the same interrupt range
>>>> of ESB pages, overlapping with the underlying segment of IPI ESB pages. 
>>>
>>> I don't really follow what you're saying here.
>>
>>
>> What we want the guest to access in terms of ESB pages is something like 
>> below, VMA0 being the initial mapping done by QEMU at offset 0x0, the IPI 
>> ESB pages being populated on the demand with the loads and the stores from 
>> the guest :
>>
>>
>>                  0x0                   0x1100  0x1200    0x1300     
>>       
>>          ranges   |       CPU IPIs   .. |  VIO  | PCI LSI |  MSIs
>>        	  
>>                   +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
>>  VMA0    IPI ESB  | | | | | | |     | | | | | | | | | | | | | | | | | |
>>           pages   +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
>>
>>
>>
>> A device is passed through and the driver requests MSIs. 
>>
>> We now want the guest to access the HW ESB pages for the requested IRQs 
>> but still the initial IPI ESB pages for the others. Something like below : 
>>
>>
>>                  0x0                   0x1100  0x1200    0x1300     
>>       
>>          ranges   |       CPU IPIs   .. |  VIO  | PCI LSI |  MSIs
>>
>>                   +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
>>  VMA0    IPI ESB  | | | | | | |     | | | | | | | | | | | | | | | | | |
>>           pages   +-+-+-+-+-+-+-...-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+- ....
>>                                                                   
>>  VMA1    PHB ESB                                          +-------+
>>           pages                                           | | | | | 
>>                                                           +-------+
> 
> Right, except of course VMA0 will be split into two pieces by
> performing the mmap() over it.
> 
>> The VMA1 is the result of a new mmap() being done at an offset depending on 
>> the first IRQ number requested by the driver.
> 
> Right... that's one way we could do it.  But the irq numbers are all
> dynamically allocated here, so could we instead just put the
> passthrough MSIs in a separate range?  

Hmm, yes. These are still MSIs. I am not sure of the benefits. See below.

> We'd still need a separate
> mmap() for them, but we wouldn't have to deal with mapping over and
> unmapping if the device is removed or whatever.

How would we handle multiples devices being hot-plugged, hot-unplugged 
and hot-replugged ? The ESB pages would be populated the first time 
they are touched and might not be the correct ones if a new device is 
hot-plugged to the machine. 

>> This is because the vm_fault handler uses the page offset to find the 
>> associated KVM IRQ struct containing the addresses of the EOI and trigger 
>> pages in the underlying hardware, which will be the PHB in case of a 
>> passthrough device.  
>>
>> From there, the VMA1 mmap() pointer will be used to create a 'ram device'
>> memory region which will be mapped on top of the initial ESB memory region 
>> in QEMU. This will override the initial IPI ESB pages with the PHB ESB pages 
>> in the guest ESB address space.
> 
> Um.. what?  If that qemu memory range is already mapped into the guest
> we don't need to create new RAM devices or anything for the
> overmapping.  If we overmap in qemu that will just get carried into
> the guest.

yes, that's the goal. 

When the guest accesses the region, the vm_fault handler will be invoked 
and the VMA will be populated with the ESB pages of the device being 
passthrough. When the device is removed from the machine, we only need 
to delete the region from QEMU and munmap() the VMA to clear the mappings.
The underlying pages will be the ones for the XIVE IC IPIs. 

And the IRQ numbers can be safely recycled for another passthrough device.

>> That's the plan I have in mind as suggested by Paul if I understood it well.
>> The mechanics are more complex than the patch zapping the PTEs from the VMA
>> but it's also safer.
> 
> Well, yes, where "safer" means "has the possibility to be correct".

Well, the only problem with the kernel approach is keeping a pointer on 
the VMA. If we could call find_vma(), it would be perfectly safe and much 
more simpler.

C. 
 


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-08  7:58                     ` Cédric Le Goater
@ 2019-02-08 21:53                       ` Paul Mackerras
  2019-02-09  9:41                         ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Paul Mackerras @ 2019-02-08 21:53 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On Fri, Feb 08, 2019 at 08:58:14AM +0100, Cédric Le Goater wrote:
> On 2/8/19 6:15 AM, David Gibson wrote:
> > On Thu, Feb 07, 2019 at 10:03:15AM +0100, Cédric Le Goater wrote:
> >> That's the plan I have in mind as suggested by Paul if I understood it well.
> >> The mechanics are more complex than the patch zapping the PTEs from the VMA
> >> but it's also safer.
> > 
> > Well, yes, where "safer" means "has the possibility to be correct".
> 
> Well, the only problem with the kernel approach is keeping a pointer on 
> the VMA. If we could call find_vma(), it would be perfectly safe and much 
> more simpler.

You seem to be assuming that the kernel can easily work out a single
virtual address which will be the only place where a given set of
interrupt pages are mapped.  But that is really not possible in the
general case, because userspace could have mapped the fd at many
different offsets in many different places.

QEMU doesn't do that; in QEMU, the mmaps are sufficiently limited that
it can work out a single virtual address that needs to be changed.
The way that QEMU should tell the kernel what that address is and what
the mapping should be changed to, is via the existing munmap()/mmap()
interface.

Paul.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-08 21:53                       ` Paul Mackerras
@ 2019-02-09  9:41                         ` Cédric Le Goater
  2019-02-11  2:38                           ` David Gibson
  0 siblings, 1 reply; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-09  9:41 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: kvm, kvm-ppc, linuxppc-dev, David Gibson

On 2/8/19 10:53 PM, Paul Mackerras wrote:
> On Fri, Feb 08, 2019 at 08:58:14AM +0100, Cédric Le Goater wrote:
>> On 2/8/19 6:15 AM, David Gibson wrote:
>>> On Thu, Feb 07, 2019 at 10:03:15AM +0100, Cédric Le Goater wrote:
>>>> That's the plan I have in mind as suggested by Paul if I understood it well.
>>>> The mechanics are more complex than the patch zapping the PTEs from the VMA
>>>> but it's also safer.
>>>
>>> Well, yes, where "safer" means "has the possibility to be correct".
>>
>> Well, the only problem with the kernel approach is keeping a pointer on 
>> the VMA. If we could call find_vma(), it would be perfectly safe and much 
>> more simpler.
> 
> You seem to be assuming that the kernel can easily work out a single
> virtual address which will be the only place where a given set of
> interrupt pages are mapped.  But that is really not possible in the
> general case, because userspace could have mapped the fd at many
> different offsets in many different places.
> 
> QEMU doesn't do that; in QEMU, the mmaps are sufficiently limited that
> it can work out a single virtual address that needs to be changed.
> The way that QEMU should tell the kernel what that address is and what
> the mapping should be changed to, is via the existing munmap()/mmap()
> interface.

Yes. We agreed on that. QEMU should handle these mappings somewhere in 
VFIO. It's me grumbling, that's all.

The discussion has moved to the mmap() interface of the KVM device. The 
current proposal adds controls on the device creating fds to mmap() the 
TIMA pages and the ESB pages. David is proposing to use directly the fd 
of the KVM device to mmap() these pages with a different offset for each 
set. 

I think that should work pretty well, for passthrough also. The fault 
handler should take care of populating the VMA(s) with the appropriate 
pages. 

We might support END notification one day, so we should have room for 
these pages. And nested might require IRQ space extensions at L1. 
something to keep in mind.

C.
  

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-09  9:41                         ` Cédric Le Goater
@ 2019-02-11  2:38                           ` David Gibson
  2019-02-11  6:42                             ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 135+ messages in thread
From: David Gibson @ 2019-02-11  2:38 UTC (permalink / raw)
  To: Cédric Le Goater; +Cc: kvm, kvm-ppc, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 4507 bytes --]

On Sat, Feb 09, 2019 at 10:41:38AM +0100, Cédric Le Goater wrote:
> On 2/8/19 10:53 PM, Paul Mackerras wrote:
> > On Fri, Feb 08, 2019 at 08:58:14AM +0100, Cédric Le Goater wrote:
> >> On 2/8/19 6:15 AM, David Gibson wrote:
> >>> On Thu, Feb 07, 2019 at 10:03:15AM +0100, Cédric Le Goater wrote:
> >>>> That's the plan I have in mind as suggested by Paul if I understood it well.
> >>>> The mechanics are more complex than the patch zapping the PTEs from the VMA
> >>>> but it's also safer.
> >>>
> >>> Well, yes, where "safer" means "has the possibility to be correct".
> >>
> >> Well, the only problem with the kernel approach is keeping a pointer on 
> >> the VMA. If we could call find_vma(), it would be perfectly safe and much 
> >> more simpler.
> > 
> > You seem to be assuming that the kernel can easily work out a single
> > virtual address which will be the only place where a given set of
> > interrupt pages are mapped.  But that is really not possible in the
> > general case, because userspace could have mapped the fd at many
> > different offsets in many different places.
> > 
> > QEMU doesn't do that; in QEMU, the mmaps are sufficiently limited that
> > it can work out a single virtual address that needs to be changed.
> > The way that QEMU should tell the kernel what that address is and what
> > the mapping should be changed to, is via the existing munmap()/mmap()
> > interface.
> 
> Yes. We agreed on that. QEMU should handle these mappings somewhere in 
> VFIO. It's me grumbling, that's all.
> 
> The discussion has moved to the mmap() interface of the KVM device. The 
> current proposal adds controls on the device creating fds to mmap() the 
> TIMA pages and the ESB pages. David is proposing to use directly the fd 
> of the KVM device to mmap() these pages with a different offset for each 
> set. 
> 
> I think that should work pretty well, for passthrough also. The fault 
> handler should take care of populating the VMA(s) with the appropriate 
> pages. 
> 
> We might support END notification one day, so we should have room for 
> these pages. And nested might require IRQ space extensions at L1. 
> something to keep in mind.

I had some more thoughts on this topic.  I think there's been some
confusion because there are more ways of tackling this than I
previously realized:

1) All in kernel

The offset always maps directly to guest irq number and the kernel
somehow binds it either to an IPI or a host irq as necessary.
Cédric's original code attempts this, but the mechanism of keeping a
pointer to the VMA can't work.

But.. remapping the irqs should be sufficiently infrequent that it
might be ok to consider simply stepping through all the hosting
process's VMAs to do this.

2) Remapped in qemu (using memory regions)

I _think_ (in hindsight) was Cédric's been discussing as the
alternative in more recent posts.

Qemu maps the IPI pages at one place and the passthrough IRQ pages
somewhere else.  The IPIs are mapped into the guest as one memory
region, then any passthrough IRQ pages are mapped over that using
overlapping memory regions.

I don't think this approach will work well, because it could require a
bunch of separate KVM memory slots, which are fairly scarce.

3) Remapped in qemu (using mmap())

This is the approach I (and I think Paul) have been suggested in
contrast to (1).

Qemu maps the IPI pages and maps those into the guest.  When we need
to set up a passthrough IRQ, qemu mmap()s its pages directly over the
IPI pages, and it remains mapped into the guest with the same memory
region / memslot as the IPIs are already using.  If the passthrough
device is removed we have to remap the IPI pages back into place.

4) Dedicated irq numbers

We never re-use regular guest irq numbers for passthrough irqs,
instead we put them somewhere else and keep those mapped to the
passthrough irq pages.

I was favouring this approach, but it does mean there will be a guest
visible difference between kernel_irqchip=on and off which isn't
great.


(1) is the most elegant _interface_, but as we've seen it's
problematic to implement.  Looking at the for_all_vmas() approach
could be interesting, but otherwise option (3) might be the most
practical.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-11  2:38                           ` David Gibson
@ 2019-02-11  6:42                             ` Benjamin Herrenschmidt
  2019-02-12 22:07                               ` Cédric Le Goater
  0 siblings, 1 reply; 135+ messages in thread
From: Benjamin Herrenschmidt @ 2019-02-11  6:42 UTC (permalink / raw)
  To: David Gibson, Cédric Le Goater; +Cc: linuxppc-dev, kvm, kvm-ppc

On Mon, 2019-02-11 at 13:38 +1100, David Gibson wrote:
> 
> 1) All in kernel
> 
> The offset always maps directly to guest irq number and the kernel
> somehow binds it either to an IPI or a host irq as necessary.
> Cédric's original code attempts this, but the mechanism of keeping a
> pointer to the VMA can't work.

Why do you need a pointer to the VMA anyway ? unmap_mapping_range()
doesn't need a VMA for the unmap part, and faults/mmaps have the VMA.

> But.. remapping the irqs should be sufficiently infrequent that it
> might be ok to consider simply stepping through all the hosting
> process's VMAs to do this.

Which unmap_mapping_range() does for you as I explained previously. You
only need the address space. See how spufs does it (among others).

> 2) Remapped in qemu (using memory regions)
> 
> I _think_ (in hindsight) was Cédric's been discussing as the
> alternative in more recent posts.
> 
> Qemu maps the IPI pages at one place and the passthrough IRQ pages
> somewhere else.  The IPIs are mapped into the guest as one memory
> region, then any passthrough IRQ pages are mapped over that using
> overlapping memory regions.
> 
> I don't think this approach will work well, because it could require a
> bunch of separate KVM memory slots, which are fairly scarce.
> 
> 3) Remapped in qemu (using mmap())
> 
> This is the approach I (and I think Paul) have been suggested in
> contrast to (1).
> 
> Qemu maps the IPI pages and maps those into the guest.  When we need
> to set up a passthrough IRQ, qemu mmap()s its pages directly over the
> IPI pages, and it remains mapped into the guest with the same memory
> region / memslot as the IPIs are already using.  If the passthrough
> device is removed we have to remap the IPI pages back into place.
> 
> 4) Dedicated irq numbers
> 
> We never re-use regular guest irq numbers for passthrough irqs,
> instead we put them somewhere else and keep those mapped to the
> passthrough irq pages.
> 
> I was favouring this approach, but it does mean there will be a guest
> visible difference between kernel_irqchip=on and off which isn't
> great.
> 
> 
> (1) is the most elegant _interface_, but as we've seen it's
> problematic to implement.  Looking at the for_all_vmas() approach
> could be interesting, but otherwise option (3) might be the most
> practical.
> 
> --


^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device
  2019-02-11  6:42                             ` Benjamin Herrenschmidt
@ 2019-02-12 22:07                               ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-12 22:07 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, David Gibson; +Cc: linuxppc-dev, kvm, kvm-ppc

On 2/11/19 7:42 AM, Benjamin Herrenschmidt wrote:
> On Mon, 2019-02-11 at 13:38 +1100, David Gibson wrote:
>>
>> 1) All in kernel
>>
>> The offset always maps directly to guest irq number and the kernel
>> somehow binds it either to an IPI or a host irq as necessary.
>> Cédric's original code attempts this, but the mechanism of keeping a
>> pointer to the VMA can't work.
> 
> Why do you need a pointer to the VMA anyway ? unmap_mapping_range()
> doesn't need a VMA for the unmap part, and faults/mmaps have the VMA.
> 
>> But.. remapping the irqs should be sufficiently infrequent that it
>> might be ok to consider simply stepping through all the hosting
>> process's VMAs to do this.
> 
> Which unmap_mapping_range() does for you as I explained previously. You
> only need the address space. See how spufs does it (among others).

and the different CAPI drivers. This is much better and it works fine.

On the same topic, the XIVE IC on P10 will use IPI ESB pages for the 
PHB interrupts sources. We will still need this kind of remapping but 
the pages will be from the same controller.

Thanks,

C.

^ permalink raw reply	[flat|nested] 135+ messages in thread

* Re: [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration
  2019-02-08  5:15                     ` David Gibson
@ 2019-02-14 16:50                       ` Cédric Le Goater
  0 siblings, 0 replies; 135+ messages in thread
From: Cédric Le Goater @ 2019-02-14 16:50 UTC (permalink / raw)
  To: David Gibson; +Cc: kvm, kvm-ppc, Paul Mackerras, linuxppc-dev

On 2/8/19 6:15 AM, David Gibson wrote:
> On Thu, Feb 07, 2019 at 10:13:48AM +0100, Cédric Le Goater wrote:
>> On 2/7/19 3:48 AM, David Gibson wrote:
>>> On Wed, Feb 06, 2019 at 08:07:36AM +0100, Cédric Le Goater wrote:
>>>> On 2/6/19 2:24 AM, David Gibson wrote:
>>>>> On Wed, Feb 06, 2019 at 12:23:29PM +1100, David Gibson wrote:
>>>>>> On Tue, Feb 05, 2019 at 02:03:11PM +0100, Cédric Le Goater wrote:
>>>>>>> On 2/5/19 6:32 AM, David Gibson wrote:
>>>>>>>> On Mon, Feb 04, 2019 at 05:07:28PM +0100, Cédric Le Goater wrote:
>>>>>>>>> On 2/4/19 6:21 AM, David Gibson wrote:
>>>>>>>>>> On Mon, Jan 07, 2019 at 07:43:27PM +0100, Cédric Le Goater wrote:
>>>>>>>>>>> Theses are use to capure the XIVE EAS table of the KVM device, the
>>>>>>>>>>> configuration of the source targets.
>>>>>>>>>>>
>>>>>>>>>>> Signed-off-by: Cédric Le Goater <clg@kaod.org>
>>>>>>>>>>> ---
>>>>>>>>>>>  arch/powerpc/include/uapi/asm/kvm.h   | 11 ++++
>>>>>>>>>>>  arch/powerpc/kvm/book3s_xive_native.c | 87 +++++++++++++++++++++++++++
>>>>>>>>>>>  2 files changed, 98 insertions(+)
>>>>>>>>>>>
>>>>>>>>>>> diff --git a/arch/powerpc/include/uapi/asm/kvm.h b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>>>>> index 1a8740629acf..faf024f39858 100644
>>>>>>>>>>> --- a/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>>>>> +++ b/arch/powerpc/include/uapi/asm/kvm.h
>>>>>>>>>>> @@ -683,9 +683,20 @@ struct kvm_ppc_cpu_char {
>>>>>>>>>>>  #define   KVM_DEV_XIVE_SAVE_EQ_PAGES	4
>>>>>>>>>>>  #define KVM_DEV_XIVE_GRP_SOURCES	2	/* 64-bit source attributes */
>>>>>>>>>>>  #define KVM_DEV_XIVE_GRP_SYNC		3	/* 64-bit source attributes */
>>>>>>>>>>> +#define KVM_DEV_XIVE_GRP_EAS		4	/* 64-bit eas attributes */
>>>>>>>>>>>  
>>>>>>>>>>>  /* Layout of 64-bit XIVE source attribute values */
>>>>>>>>>>>  #define KVM_XIVE_LEVEL_SENSITIVE	(1ULL << 0)
>>>>>>>>>>>  #define KVM_XIVE_LEVEL_ASSERTED		(1ULL << 1)
>>>>>>>>>>>  
>>>>>>>>>>> +/* Layout of 64-bit eas attribute values */
>>>>>>>>>>> +#define KVM_XIVE_EAS_PRIORITY_SHIFT	0
>>>>>>>>>>> +#define KVM_XIVE_EAS_PRIORITY_MASK	0x7
>>>>>>>>>>> +#define KVM_XIVE_EAS_SERVER_SHIFT	3
>>>>>>>>>>> +#define KVM_XIVE_EAS_SERVER_MASK	0xfffffff8ULL
>>>>>>>>>>> +#define KVM_XIVE_EAS_MASK_SHIFT		32
>>>>>>>>>>> +#define KVM_XIVE_EAS_MASK_MASK		0x100000000ULL
>>>>>>>>>>> +#define KVM_XIVE_EAS_EISN_SHIFT		33
>>>>>>>>>>> +#define KVM_XIVE_EAS_EISN_MASK		0xfffffffe00000000ULL
>>>>>>>>>>> +
>>>>>>>>>>>  #endif /* __LINUX_KVM_POWERPC_H */
>>>>>>>>>>> diff --git a/arch/powerpc/kvm/book3s_xive_native.c b/arch/powerpc/kvm/book3s_xive_native.c
>>>>>>>>>>> index f2de1bcf3b35..0468b605baa7 100644
>>>>>>>>>>> --- a/arch/powerpc/kvm/book3s_xive_native.c
>>>>>>>>>>> +++ b/arch/powerpc/kvm/book3s_xive_native.c
>>>>>>>>>>> @@ -525,6 +525,88 @@ static int kvmppc_xive_native_sync(struct kvmppc_xive *xive, long irq, u64 addr)
>>>>>>>>>>>  	return 0;
>>>>>>>>>>>  }
>>>>>>>>>>>  
>>>>>>>>>>> +static int kvmppc_xive_native_set_eas(struct kvmppc_xive *xive, long irq,
>>>>>>>>>>> +				      u64 addr)
>>>>>>>>>>
>>>>>>>>>> I'd prefer to avoid the name "EAS" here.  IIUC these aren't "raw" EAS
>>>>>>>>>> values, but rather essentially the "source config" in the terminology
>>>>>>>>>> of the PAPR hcalls.  Which, yes, is basically implemented by setting
>>>>>>>>>> the EAS, but since it's the PAPR architected state that we need to
>>>>>>>>>> preserve across migration, I'd prefer to stick as close as we can to
>>>>>>>>>> the PAPR terminology.
>>>>>>>>>
>>>>>>>>> But we don't have an equivalent name in the PAPR specs for the tuple 
>>>>>>>>> (prio, server). We could use the generic 'target' name may be ? even 
>>>>>>>>> if this is usually referring to a CPU number.
>>>>>>>>
>>>>>>>> Um.. what?  That's about terminology for one of the fields in this
>>>>>>>> thing, not about the name for the thing itself.
>>>>>>>>
>>>>>>>>> Or, IVE (Interrupt Vector Entry) ? which makes some sense. 
>>>>>>>>> This is was the former name in HW. I think we recycle it for KVM.
>>>>>>>>
>>>>>>>> That's a terrible idea, which will make a confusing situation even
>>>>>>>> more confusing.
>>>>>>>
>>>>>>> Let's use SOURCE_CONFIG and QUEUE_CONFIG. The KVM ioctls are very 
>>>>>>> similar to the hcalls anyhow.
>>>>>>
>>>>>> Yes, I think that's a good idea.
>>>>>
>>>>> Actually... AIUI the SET_CONFIG hcalls shouldn't be a fast path.  
>>>>
>>>> No indeed. I have move them to standard hcalls in the current version.
>>>>
>>>>> Can
>>>>> we simplify things further by removing the hcall implementation from
>>>>> the kernel entirely, and have qemu implement them by basically just
>>>>> forwarding them to the appropriate SET_CONFIG ioctl()?
>>>>
>>>> Yes. I think we could. 
>>>
>>> Great!
>>>
>>>> The hcalls H_INT_SET_SOURCE_CONFIG and H_INT_SET_QUEUE_CONFIG and 
>>>> the KVM ioctls to set the EQ and the SOURCE configuration have a 
>>>> lot in common. I need to look at how we can plug the KVM ioctl in 
>>>> the hcalls under QEMU.
>>>>
>>>> We will have to convert the returned error to respect the PAPR 
>>>> specs or have the ioctls return H_* errors.
>>>
>>> I don't think returning H_* values from a kernel call is a good idea.
>>> Converting errors is kinda ugly, but I still think it's the better
>>> option.  Note that we already have something like this for the HPT
>>> resizing hcalls.
>>
>> ok.
>>  
>>>> Let's dig that idea. If we choose that path, QEMU will have an 
>>>> up-to-date EAT and so we won't need to synchronize its state anymore 
>>>> for migration.
>>>
>>> I guess so, though I don't see that as essential.
>>>
>>>> H_INT_GET_SOURCE_CONFIG can be implemented in QEMU without any KVM 
>>>> ioctl.
>>>>
>>>> H_INT_GET_QUEUE_INFO could be implemented in QEMU. I need to check 
>>>> how we return the address of the END ESB in sPAPR. We haven't paid 
>>>> much attention to these pages because they are not used under Linux
>>>> and today the address is returned by OPAL. 
>>>>
>>>> H_INT_GET_QUEUE_CONFIG is a little more problematic because we need
>>>> to query into the XIVE HW the EQ index and toggle bit. OPAL support
>>>> is required for that. But we could reduce the KVM support to the 
>>>> ioctl querying these EQ information.
>>>
>>> Right, and we'd need an ioctl() like that for migration anyway, yes?
>>
>> Yes. it is the same need.
>>
>>>> H_INT_ESB could be entirely done under QEMU.
>>>
>>> This one can actually happen on fairly hot paths, so I think doing
>>> that in qemu probably isn't a good idea.
>>
>> I agree It would nice to have some performance.
>>
>> This hcall is used when LSIs are involved, which is not really a common 
>> configuration. There are no OPAL calls involved. And we are duplicating 
>> code at the KVM level to retrigger the interrupt when the level is still
>> asserted.
>>
>> I will benchmark the two options before making a choice.
> 
> Ok.
 

Here are some iperf results for a 4 vCPUs guest running a 5.0.0 kernel
on a small initrd image. I didn't do any kind of tuning like CPU pinning. 
So these are really rough figures :   


  kernel irqchip            OFF      ON    ON (*)

  rtl8139  (LSI)           1.19    1.24   1.23    Gbits/sec
  VIRTIO                  31.80   42.30    --     Gbits/sec


There is not much benefit in handling the H_INT_ESB hcall under KVM it seems. 
I think we can leave it under QEMU.  


C.

^ permalink raw reply	[flat|nested] 135+ messages in thread

end of thread, back to index

Thread overview: 135+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-01-07 18:43 [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Cédric Le Goater
2019-01-07 18:43 ` [PATCH 01/19] powerpc/xive: export flags for the XIVE native exploitation mode hcalls Cédric Le Goater
2019-01-09  3:33   ` David Gibson
2019-01-09 13:08   ` Michael Ellerman
2019-01-09 13:38     ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 02/19] powerpc/xive: add OPAL extensions for the XIVE native exploitation support Cédric Le Goater
2019-01-09  4:26   ` David Gibson
2019-01-07 18:43 ` [PATCH 03/19] KVM: PPC: Book3S HV: check the IRQ controller type Cédric Le Goater
2019-01-09  4:27   ` David Gibson
2019-01-22  4:56   ` Paul Mackerras
2019-01-23 16:24     ` Cédric Le Goater
2019-02-04  0:50       ` David Gibson
2019-02-04 10:16         ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 04/19] KVM: PPC: Book3S HV: export services for the XIVE native exploitation device Cédric Le Goater
2019-01-11  4:09   ` David Gibson
2019-01-07 18:43 ` [PATCH 05/19] KVM: PPC: Book3S HV: add a new KVM device for the XIVE native exploitation mode Cédric Le Goater
2019-01-22  5:05   ` Paul Mackerras
2019-01-23 16:28     ` Cédric Le Goater
2019-01-28 17:35     ` Cédric Le Goater
2019-01-30  4:29       ` Paul Mackerras
2019-01-30  7:01         ` Cédric Le Goater
2019-01-31  3:01           ` Paul Mackerras
2019-02-01 17:03             ` Cédric Le Goater
2019-02-04  4:25   ` David Gibson
2019-02-04 11:19     ` Cédric Le Goater
2019-02-05  5:26       ` David Gibson
2019-01-07 18:43 ` [PATCH 06/19] KVM: PPC: Book3S HV: add a GET_ESB_FD control to the XIVE native device Cédric Le Goater
2019-01-22  5:09   ` Paul Mackerras
2019-01-23 16:48     ` Cédric Le Goater
2019-02-04  4:45   ` David Gibson
2019-02-04 11:30     ` Cédric Le Goater
2019-02-05  5:28       ` David Gibson
2019-02-05 12:55         ` Cédric Le Goater
2019-02-06  1:23           ` David Gibson
2019-02-06  7:21             ` Cédric Le Goater
2019-02-07  2:49               ` David Gibson
2019-02-07  9:03                 ` Cédric Le Goater
2019-02-08  5:15                   ` David Gibson
2019-02-08  7:58                     ` Cédric Le Goater
2019-02-08 21:53                       ` Paul Mackerras
2019-02-09  9:41                         ` Cédric Le Goater
2019-02-11  2:38                           ` David Gibson
2019-02-11  6:42                             ` Benjamin Herrenschmidt
2019-02-12 22:07                               ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 07/19] KVM: PPC: Book3S HV: add a GET_TIMA_FD control to " Cédric Le Goater
2019-01-07 18:43 ` [PATCH 08/19] KVM: PPC: Book3S HV: add a VC_BASE control to the " Cédric Le Goater
2019-01-22  5:14   ` Paul Mackerras
2019-01-23 16:56     ` Cédric Le Goater
2019-02-04  4:49       ` David Gibson
2019-02-04 15:36         ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 09/19] KVM: PPC: Book3S HV: add a SET_SOURCE " Cédric Le Goater
2019-02-04  4:57   ` David Gibson
2019-02-04 19:07     ` Cédric Le Goater
2019-02-05  5:35       ` David Gibson
2019-02-05 13:39         ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 10/19] KVM: PPC: Book3S HV: add a EISN attribute to kvmppc_xive_irq_state Cédric Le Goater
2019-01-07 18:43 ` [PATCH 11/19] KVM: PPC: Book3S HV: add support for the XIVE native exploitation mode hcalls Cédric Le Goater
2019-01-22  5:23   ` Paul Mackerras
2019-01-23  6:44     ` Benjamin Herrenschmidt
2019-01-23  8:48       ` Cédric Le Goater
2019-01-23 10:26         ` Paul Mackerras
2019-01-23 10:48           ` Cédric Le Goater
2019-01-23 21:23           ` Benjamin Herrenschmidt
2019-01-07 18:43 ` [PATCH 12/19] KVM: PPC: Book3S HV: record guest queue page address Cédric Le Goater
2019-02-04  5:15   ` David Gibson
2019-02-04 15:37     ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 13/19] KVM: PPC: Book3S HV: add a SYNC control for the XIVE native migration Cédric Le Goater
2019-02-04  5:17   ` David Gibson
2019-02-04 15:39     ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 14/19] KVM: PPC: Book3S HV: add a control to make the XIVE EQ pages dirty Cédric Le Goater
2019-02-04  5:18   ` David Gibson
2019-02-04 15:46     ` Cédric Le Goater
2019-02-05  5:30       ` David Gibson
2019-01-07 18:43 ` [PATCH 15/19] KVM: PPC: Book3S HV: add get/set accessors for the source configuration Cédric Le Goater
2019-02-04  5:21   ` David Gibson
2019-02-04 16:07     ` Cédric Le Goater
2019-02-05  5:32       ` David Gibson
2019-02-05 13:03         ` Cédric Le Goater
2019-02-06  1:23           ` David Gibson
2019-02-06  1:24             ` David Gibson
2019-02-06  7:07               ` Cédric Le Goater
2019-02-07  2:48                 ` David Gibson
2019-02-07  9:13                   ` Cédric Le Goater
2019-02-08  5:15                     ` David Gibson
2019-02-14 16:50                       ` Cédric Le Goater
2019-01-07 18:43 ` [PATCH 16/19] KVM: PPC: Book3S HV: add get/set accessors for the EQ configuration Cédric Le Goater
2019-02-04  5:24   ` David Gibson
2019-02-05 17:45     ` Cédric Le Goater
2019-01-07 19:10 ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state Cédric Le Goater
2019-01-07 19:10   ` [PATCH 18/19] KVM: PPC: Book3S HV: add passthrough support Cédric Le Goater
2019-01-22  5:26     ` Paul Mackerras
2019-01-23  6:45       ` Benjamin Herrenschmidt
2019-01-23 10:30         ` Paul Mackerras
2019-01-23 11:07           ` Cédric Le Goater
2019-01-28  6:13             ` Paul Mackerras
2019-01-28 18:26               ` Cédric Le Goater
2019-01-29  2:45                 ` Paul Mackerras
2019-01-29 13:47                   ` Cédric Le Goater
2019-01-30  6:20                     ` Paul Mackerras
2019-01-30 15:54                       ` Cédric Le Goater
2019-01-31  2:48                         ` Paul Mackerras
2019-01-29  4:12                 ` Paul Mackerras
2019-01-29 17:44                   ` Cédric Le Goater
2019-01-30  5:55                     ` Paul Mackerras
2019-01-30  7:06                       ` Cédric Le Goater
2019-01-23 21:25           ` Benjamin Herrenschmidt
2019-01-24  8:41             ` Cédric Le Goater
2019-01-28  4:43             ` Paul Mackerras
2019-01-29 13:46               ` Cédric Le Goater
2019-01-07 19:10   ` [PATCH 19/19] KVM: introduce a KVM_DELETE_DEVICE ioctl Cédric Le Goater
2019-01-22  5:42     ` Paul Mackerras
2019-01-23 18:39       ` Cédric Le Goater
2019-01-23 21:32         ` Benjamin Herrenschmidt
2019-02-04  5:26   ` [PATCH 17/19] KVM: PPC: Book3S HV: add get/set accessors for the VP XIVE state David Gibson
2019-02-04 18:57     ` Cédric Le Goater
2019-02-05  5:33       ` David Gibson
2019-02-05 11:58         ` Cédric Le Goater
2019-02-06  1:19           ` David Gibson
2019-01-22  4:46 ` [PATCH 00/19] KVM: PPC: Book3S HV: add XIVE native exploitation mode Paul Mackerras
2019-01-23 19:07   ` Cédric Le Goater
2019-01-23 21:35     ` Benjamin Herrenschmidt
2019-01-26  8:25       ` Cédric Le Goater
2019-02-04  5:36         ` David Gibson
2019-02-05 11:31           ` Cédric Le Goater
2019-02-05 22:13             ` Paul Mackerras
2019-02-06  1:18               ` David Gibson
2019-02-06  7:35                 ` Cédric Le Goater
2019-02-07  2:51                   ` David Gibson
2019-02-07  8:31                     ` Cédric Le Goater
2019-02-08  5:07                       ` David Gibson
2019-02-08  7:38                         ` Cédric Le Goater
2019-01-28  5:51     ` Paul Mackerras
2019-01-29 13:51       ` Cédric Le Goater
2019-01-30  5:40         ` Paul Mackerras
2019-01-30 15:36           ` Cédric Le Goater

LinuxPPC-Dev Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/linuxppc-dev/0 linuxppc-dev/git/0.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 linuxppc-dev linuxppc-dev/ https://lore.kernel.org/linuxppc-dev \
		linuxppc-dev@lists.ozlabs.org linuxppc-dev@ozlabs.org linuxppc-dev@archiver.kernel.org
	public-inbox-index linuxppc-dev


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.ozlabs.lists.linuxppc-dev


AGPL code for this site: git clone https://public-inbox.org/ public-inbox